Re: [PATCH net-next 1/8] net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
From: Gavin ShanDate: Thu, 29 Sep 2016 15:03:08 +1000 > This replaces the atomic access to NCSI channel's state with READ_ONCE() > and WRITE_ONCE() to avoid the above build warning. We needn't hold the > channel's lock when updating its state as well. No logical changes > introduced. I don't understand this. If it's important to take the lock for the list add/del, then it must be important to make the state change appear atomic wrt. that lock as well. Can parallel threads of control enter these functions which change the state? If so, then you need to make the state changes under the lock. In fact, you probably have to make the state tests under the locks as well. If not, please explain what prevents it from happening. Thanks.
Re: [PATCH 0/3] [v2] Add basic ACPI support to the Qualcomm Technologies EMAC driver
From: Timur TabiDate: Wed, 28 Sep 2016 11:58:41 -0500 > This patch series adds support to the EMAC driver for extracting addresses, > interrupts, and some _DSDs (properties) from ACPI. The first two patches > clean up the code, and the third patch adds ACPI-specific functionality. > > The first patch fixes a bug with handling the platform_device for the > internal PHY. This phy is treated as a separate device in both DT and > ACPI, but since the platform is not released automatically when the > driver unloads, managed functions like devm_ioremap_resource cannot be > used. > > The second patch replaces of_get_mac_address with its platform-independent > equivalent device_get_mac_address. > > The third patch parses the ACPI tables to obtain the platform_device for > the primary EMAC node ("QCOM8070") and the internal phy node ("QCOM8071"). Series applied, thanks.
Re: [PATCH net-next v5] bpf: allow access into map value arrays
From: Josef BacikDate: Wed, 28 Sep 2016 10:54:32 -0400 > Suppose you have a map array value that is something like this > > struct foo { > unsigned iter; > int array[SOME_CONSTANT]; > }; > > You can easily insert this into an array, but you cannot modify the contents > of > foo->array[] after the fact. This is because we have no way to verify we > won't > go off the end of the array at verification time. This patch provides a start > for this work. We accomplish this by keeping track of a minimum and maximum > value a register could be while we're checking the code. Then at the time we > try to do an access into a MAP_VALUE we verify that the maximum offset into > that > region is a valid access into that memory region. So in practice, code such > as > this > > unsigned index = 0; > > if (foo->iter >= SOME_CONSTANT) > foo->iter = index; > else > index = foo->iter++; > foo->array[index] = bar; > > would be allowed, as we can verify that index will always be between 0 and > SOME_CONSTANT-1. If you wish to use signed values you'll have to have an > extra > check to make sure the index isn't less than 0, or do something like index %= > SOME_CONSTANT. > > Signed-off-by: Josef Bacik > Acked-by: Alexei Starovoitov Applied, thanks.
[PATCH kernel v2] PCI: Enable access to custom VPD for Chelsio devices (cxgb3)
There is at least one Chelsio 10Gb card which uses VPD area to store some custom blocks (example below). However pci_vpd_size() returns the length of the first block only assuming that there can be only one VPD "End Tag" and VFIO blocks access beyond that offset (since 4e1a63555) which leads to the situation when the guest "cxgb3" driver fails to probe the device. The host system does not have this problem as the drives accesses the config space directly without pci_read_vpd()/... This adds a quirk to override the VPD size to a bigger value. The maximum size is taken from EEPROMSIZE in drivers/net/ethernet/chelsio/cxgb3/common.h. We do not read the tag as the cxgb3 driver does as the driver supports writing to EEPROM/VPD and when it writes, it only checks for 8192 bytes boundary. The quirk is registerted for all devices supported by the cxgb3 driver. This adds a quirk to the PCI layer (not to the cxgb3 driver) as the cxgb3 driver itself accesses VPD directly and the problem only exists with the vfio-pci driver (when cxgb3 is not running on the host and may not be even loaded) which blocks accesses beyond the first block of VPD data. However vfio-pci itself does not have quirks mechanism so we add it to PCI. Tested on: Ethernet controller [0200]: Chelsio Communications Inc T310 10GbE Single Port Adapter [1425:0030] This is its VPD: Large item 42 bytes; name 0x2 Identifier String b'10 Gigabit Ethernet-SR PCI Express Adapter' #00 [EC] len=7: b'D76809 ' #0a [FN] len=7: b'46K7897' #14 [PN] len=7: b'46K7897' #1e [MN] len=4: b'1037' #25 [FC] len=4: b'5769' #2c [SN] len=12: b'YL102035603V' #3b [NA] len=12: b'00145E992ED1' 0c00 Large item 16 bytes; name 0x2 Identifier String b'S310E-SR-X ' 0c13 Large item 234 bytes; name 0x10 #00 [PN] len=16: b'TBD ' #13 [EC] len=16: b'110107730D2 ' #26 [SN] len=16: b'97YL102035603V ' #39 [NA] len=12: b'00145E992ED1' #48 [V0] len=6: b'175000' #51 [V1] len=6: b'26' #5a [V2] len=6: b'26' #63 [V3] len=6: b'2000 ' #6c [V4] len=2: b'1 ' #71 [V5] len=6: b'c2' #7a [V6] len=6: b'0 ' #83 [V7] len=2: b'1 ' #88 [V8] len=2: b'0 ' #8d [V9] len=2: b'0 ' #92 [VA] len=2: b'0 ' #97 [RV] len=80: b's\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'... 0d00 Large item 252 bytes; name 0x11 #00 [VC] len=16: b'122310_1222 dp ' #13 [VD] len=16: b'610-0001-00 H1\x00\x00' #26 [VE] len=16: b'122310_1353 fp ' #39 [VF] len=16: b'610-0001-00 H1\x00\x00' #4c [RW] len=173: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'... 0dff Small item 0 bytes; name 0xf End Tag Signed-off-by: Alexey Kardashevskiy--- Changes: v2: * used pci_set_vpd_size() helper * added explicit list of IDs from cxgb3 driver * added a note in the commit log why the quirk is not in cxgb3 --- drivers/pci/quirks.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 44e0ff3..b22fce5 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3243,6 +3243,28 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PORT_RIDGE, quirk_thunderbolt_hotplug_msi); +static void quirk_chelsio_extend_vpd(struct pci_dev *dev) +{ + if (!dev->vpd) + return; + + pci_set_vpd_size(dev, max_t(unsigned int, dev->vpd->len, 8192)); +} + +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x20, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x21, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x22, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x23, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x24, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x25, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x26, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x30, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x31, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x32, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x35, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x36, quirk_chelsio_extend_vpd); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x37, quirk_chelsio_extend_vpd); + #ifdef CONFIG_ACPI /* * Apple: Shutdown Cactus Ridge Thunderbolt controller. -- 2.5.0.rc3
[PATCH net-next 7/8] net/ncsi: Introduce ncsi_stop_dev()
This introduces ncsi_stop_dev(), as counterpart to ncsi_start_dev(), to stop the NCSI device so that it can be reenabled in future. This API should be called when the network device driver is going to shutdown the device. There are 3 things done in the function: Stop the channel monitoring; Reset channels to inactive state; Report NCSI link down. Signed-off-by: Gavin ShanReviewed-by: Joel Stanley --- include/net/ncsi.h | 5 + net/ncsi/ncsi-manage.c | 33 ++--- 2 files changed, 27 insertions(+), 11 deletions(-) diff --git a/include/net/ncsi.h b/include/net/ncsi.h index 1dbf42f..68680ba 100644 --- a/include/net/ncsi.h +++ b/include/net/ncsi.h @@ -31,6 +31,7 @@ struct ncsi_dev { struct ncsi_dev *ncsi_register_dev(struct net_device *dev, void (*notifier)(struct ncsi_dev *nd)); int ncsi_start_dev(struct ncsi_dev *nd); +void ncsi_stop_dev(struct ncsi_dev *nd); void ncsi_unregister_dev(struct ncsi_dev *nd); #else /* !CONFIG_NET_NCSI */ static inline struct ncsi_dev *ncsi_register_dev(struct net_device *dev, @@ -44,6 +45,10 @@ static inline int ncsi_start_dev(struct ncsi_dev *nd) return -ENOTTY; } +static void ncsi_stop_dev(struct ncsi_dev *nd) +{ +} + static inline void ncsi_unregister_dev(struct ncsi_dev *nd) { } diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index 1b797c9..6a96873 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -1161,9 +1161,7 @@ EXPORT_SYMBOL_GPL(ncsi_register_dev); int ncsi_start_dev(struct ncsi_dev *nd) { struct ncsi_dev_priv *ndp = TO_NCSI_DEV_PRIV(nd); - struct ncsi_package *np; - struct ncsi_channel *nc; - int old_state, ret; + int ret; if (nd->state != ncsi_dev_state_registered && nd->state != ncsi_dev_state_functional) @@ -1175,9 +1173,27 @@ int ncsi_start_dev(struct ncsi_dev *nd) return 0; } - /* Reset channel's state and start over */ + if (ndp->flags & NCSI_DEV_HWA) + ret = ncsi_enable_hwa(ndp); + else + ret = ncsi_choose_active_channel(ndp); + + return ret; +} +EXPORT_SYMBOL_GPL(ncsi_start_dev); + +void ncsi_stop_dev(struct ncsi_dev *nd) +{ + struct ncsi_dev_priv *ndp = TO_NCSI_DEV_PRIV(nd); + struct ncsi_package *np; + struct ncsi_channel *nc; + int old_state; + + /* Stop the channel monitor and reset channel's state */ NCSI_FOR_EACH_PACKAGE(ndp, np) { NCSI_FOR_EACH_CHANNEL(np, nc) { + ncsi_stop_channel_monitor(nc); + old_state = READ_ONCE(nc->state); WRITE_ONCE(nc->state, NCSI_CHANNEL_INACTIVE); WARN_ON_ONCE(!list_empty(>link) || @@ -1185,14 +1201,9 @@ int ncsi_start_dev(struct ncsi_dev *nd) } } - if (ndp->flags & NCSI_DEV_HWA) - ret = ncsi_enable_hwa(ndp); - else - ret = ncsi_choose_active_channel(ndp); - - return ret; + ncsi_report_link(ndp, true); } -EXPORT_SYMBOL_GPL(ncsi_start_dev); +EXPORT_SYMBOL_GPL(ncsi_stop_dev); void ncsi_unregister_dev(struct ncsi_dev *nd) { -- 2.1.0
[PATCH net-next 8/8] net/faraday: Stop NCSI device on shutdown
This stops NCSI device when closing the network device so that the NCSI device can be reenabled later. Signed-off-by: Gavin ShanReviewed-by: Joel Stanley --- drivers/net/ethernet/faraday/ftgmac100.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c index 90f9c54..2625872 100644 --- a/drivers/net/ethernet/faraday/ftgmac100.c +++ b/drivers/net/ethernet/faraday/ftgmac100.c @@ -1190,6 +1190,8 @@ static int ftgmac100_stop(struct net_device *netdev) napi_disable(>napi); if (netdev->phydev) phy_stop(netdev->phydev); + else if (priv->use_ncsi) + ncsi_stop_dev(priv->ndev); ftgmac100_stop_hw(priv); free_irq(priv->irq, netdev); -- 2.1.0
[PATCH net-next 0/8] net/ncsi: NCSI Improvment and bug fixes
This series of patches improves NCSI stack according to the comments I received after the NCSI code was merged to 4.8.rc1: * PATCH[1/8] fixes the build warning caused by xchg() with ia64-linux-gcc. The atomic operations are replaced with {READ, WRITE}_ONCE(). * Channel ID (0x1f) is the reserved one and it cannot be valid channel ID. So we needn't try to probe channel whose ID is 0x1f. PATCH[2/8] and PATCH[3/8] are addressing this issue. * The request IDs are assigned in round-robin fashion, but it's broken. PATCH[4/8] make it work. * PATCH[5/8] and PATCH[6/8] reworks the channel monitoring to improve the code readability and its robustness. * PATCH[7/8] and PATCH[8/8] introduces ncsi_stop_dev() so that the network device can be closed and opened afterwards. No error will be seen. Gavin Shan (8): net/ncsi: Avoid unused-value build warning from ia64-linux-gcc net/ncsi: Introduce NCSI_RESERVED_CHANNEL net/ncsi: Don't probe on the reserved channel ID (0x1f) net/ncsi: Rework request index allocation net/ncsi: Allow to extend NCSI request properties net/ncsi: Rework the channel monitoring net/ncsi: Introduce ncsi_stop_dev() net/faraday: Stop NCSI device on shutdown drivers/net/ethernet/faraday/ftgmac100.c | 2 + include/net/ncsi.h | 5 + net/ncsi/internal.h | 22 ++-- net/ncsi/ncsi-aen.c | 19 ++-- net/ncsi/ncsi-cmd.c | 2 +- net/ncsi/ncsi-manage.c | 166 ++- net/ncsi/ncsi-rsp.c | 17 +--- 7 files changed, 137 insertions(+), 96 deletions(-) -- 2.1.0
[PATCH net-next 1/8] net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
xchg() is used to set NCSI channel's state in order for consistent access to the state. xchg()'s return value should be used. Otherwise, one build warning will be raised (with -Wunused-value) as below message indicates. It is reported by ia64-linux-gcc (GCC) 4.9.0. net/ncsi/ncsi-manage.c: In function 'ncsi_channel_monitor': arch/ia64/include/uapi/asm/cmpxchg.h:56:2: warning: value computed is \ not used [-Wunused-value] ((__typeof__(*(ptr))) __xchg((unsigned long) (x), (ptr), sizeof(*(ptr ^ net/ncsi/ncsi-manage.c:202:3: note: in expansion of macro 'xchg' xchg(>state, NCSI_CHANNEL_INACTIVE); This replaces the atomic access to NCSI channel's state with READ_ONCE() and WRITE_ONCE() to avoid the above build warning. We needn't hold the channel's lock when updating its state as well. No logical changes introduced. Signed-off-by: Gavin ShanReviewed-by: Joel Stanley --- net/ncsi/ncsi-aen.c| 19 +-- net/ncsi/ncsi-manage.c | 37 + net/ncsi/ncsi-rsp.c| 13 +++-- 3 files changed, 37 insertions(+), 32 deletions(-) diff --git a/net/ncsi/ncsi-aen.c b/net/ncsi/ncsi-aen.c index d463468..abc5473 100644 --- a/net/ncsi/ncsi-aen.c +++ b/net/ncsi/ncsi-aen.c @@ -53,6 +53,7 @@ static int ncsi_aen_handler_lsc(struct ncsi_dev_priv *ndp, struct ncsi_aen_lsc_pkt *lsc; struct ncsi_channel *nc; struct ncsi_channel_mode *ncm; + int old_state; unsigned long old_data; unsigned long flags; @@ -70,12 +71,14 @@ static int ncsi_aen_handler_lsc(struct ncsi_dev_priv *ndp, if (!((old_data ^ ncm->data[2]) & 0x1) || !list_empty(>link)) return 0; - if (!(nc->state == NCSI_CHANNEL_INACTIVE && (ncm->data[2] & 0x1)) && - !(nc->state == NCSI_CHANNEL_ACTIVE && !(ncm->data[2] & 0x1))) + + old_state = READ_ONCE(nc->state); + if (!(old_state == NCSI_CHANNEL_INACTIVE && (ncm->data[2] & 0x1)) && + !(old_state == NCSI_CHANNEL_ACTIVE && !(ncm->data[2] & 0x1))) return 0; if (!(ndp->flags & NCSI_DEV_HWA) && - nc->state == NCSI_CHANNEL_ACTIVE) + old_state == NCSI_CHANNEL_ACTIVE) ndp->flags |= NCSI_DEV_RESHUFFLE; ncsi_stop_channel_monitor(nc); @@ -90,6 +93,7 @@ static int ncsi_aen_handler_cr(struct ncsi_dev_priv *ndp, struct ncsi_aen_pkt_hdr *h) { struct ncsi_channel *nc; + int old_state; unsigned long flags; /* Find the NCSI channel */ @@ -97,13 +101,14 @@ static int ncsi_aen_handler_cr(struct ncsi_dev_priv *ndp, if (!nc) return -ENODEV; + old_state = READ_ONCE(nc->state); if (!list_empty(>link) || - nc->state != NCSI_CHANNEL_ACTIVE) + old_state != NCSI_CHANNEL_ACTIVE) return 0; ncsi_stop_channel_monitor(nc); + WRITE_ONCE(nc->state, NCSI_CHANNEL_INACTIVE); spin_lock_irqsave(>lock, flags); - xchg(>state, NCSI_CHANNEL_INACTIVE); list_add_tail_rcu(>link, >channel_queue); spin_unlock_irqrestore(>lock, flags); @@ -116,6 +121,7 @@ static int ncsi_aen_handler_hncdsc(struct ncsi_dev_priv *ndp, struct ncsi_channel *nc; struct ncsi_channel_mode *ncm; struct ncsi_aen_hncdsc_pkt *hncdsc; + int old_state; unsigned long flags; /* Find the NCSI channel */ @@ -127,8 +133,9 @@ static int ncsi_aen_handler_hncdsc(struct ncsi_dev_priv *ndp, ncm = >modes[NCSI_MODE_LINK]; hncdsc = (struct ncsi_aen_hncdsc_pkt *)h; ncm->data[3] = ntohl(hncdsc->status); + old_state = READ_ONCE(nc->state); if (!list_empty(>link) || - nc->state != NCSI_CHANNEL_ACTIVE || + old_state != NCSI_CHANNEL_ACTIVE || (ncm->data[3] & 0x1)) return 0; diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index ef017b8..fc0ae54 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -143,7 +143,7 @@ static void ncsi_report_link(struct ncsi_dev_priv *ndp, bool force_down) NCSI_FOR_EACH_PACKAGE(ndp, np) { NCSI_FOR_EACH_CHANNEL(np, nc) { if (!list_empty(>link) || - nc->state != NCSI_CHANNEL_ACTIVE) + READ_ONCE(nc->state) != NCSI_CHANNEL_ACTIVE) continue; if (nc->modes[NCSI_MODE_LINK].data[2] & 0x1) { @@ -166,7 +166,7 @@ static void ncsi_channel_monitor(unsigned long data) bool enabled; unsigned int timeout; unsigned long flags; - int ret; + int old_state, ret; spin_lock_irqsave(>lock, flags); timeout = nc->timeout; @@ -175,8 +175,10 @@ static void ncsi_channel_monitor(unsigned long data) if (!enabled || !list_empty(>link))
[PATCH net-next 6/8] net/ncsi: Rework the channel monitoring
The original NCSI channel monitoring was implemented based on a backoff algorithm: the GLS response should be received in the specified interval. Otherwise, the channel is regarded as dead and failover should be taken if current channel is an active one. There are several problems in the implementation: (A) On BCM5718, we found when the IID (Instance ID) in the GLS command packet changes from 255 to 1, the response corresponding to IID#1 never comes in. It means we cannot make the unfair judgement that the channel is dead when one response is missed. (B) The code's readability should be improved. (C) We should do failover when current channel is active one and the channel monitoring should be marked as disabled before doing failover. This reworks the channel monitoring to address all above issues. The fields for channel monitoring is put into separate struct and the state of channel monitoring is predefined. The channel is regarded alive if the network controller responses to one of two GLS commands or both of them in 5 seconds. Signed-off-by: Gavin ShanReviewed-by: Joel Stanley --- net/ncsi/internal.h| 12 +--- net/ncsi/ncsi-manage.c | 46 ++ net/ncsi/ncsi-rsp.c| 2 +- 3 files changed, 36 insertions(+), 24 deletions(-) diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h index 26e9295..13290a7 100644 --- a/net/ncsi/internal.h +++ b/net/ncsi/internal.h @@ -187,9 +187,15 @@ struct ncsi_channel { struct ncsi_channel_modemodes[NCSI_MODE_MAX]; struct ncsi_channel_filter *filters[NCSI_FILTER_MAX]; struct ncsi_channel_stats stats; - struct timer_list timer; /* Link monitor timer */ - boolenabled;/* Timer is enabled*/ - unsigned inttimeout;/* Times of timeout*/ + struct { + struct timer_list timer; + boolenabled; + unsigned intstate; +#define NCSI_CHANNEL_MONITOR_START 0 +#define NCSI_CHANNEL_MONITOR_RETRY 1 +#define NCSI_CHANNEL_MONITOR_WAIT 2 +#define NCSI_CHANNEL_MONITOR_WAIT_MAX 5 + } monitor; struct list_headnode; struct list_headlink; }; diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index 41d6af9..1b797c9 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -164,13 +164,13 @@ static void ncsi_channel_monitor(unsigned long data) struct ncsi_dev_priv *ndp = np->ndp; struct ncsi_cmd_arg nca; bool enabled; - unsigned int timeout; + unsigned int monitor_state; unsigned long flags; int old_state, ret; spin_lock_irqsave(>lock, flags); - timeout = nc->timeout; - enabled = nc->enabled; + enabled = nc->monitor.enabled; + monitor_state = nc->monitor.state; spin_unlock_irqrestore(>lock, flags); if (!enabled || !list_empty(>link)) @@ -181,7 +181,9 @@ static void ncsi_channel_monitor(unsigned long data) old_state != NCSI_CHANNEL_ACTIVE) return; - if (!(timeout % 2)) { + switch (monitor_state) { + case NCSI_CHANNEL_MONITOR_START: + case NCSI_CHANNEL_MONITOR_RETRY: nca.ndp = ndp; nca.package = np->id; nca.channel = nc->id; @@ -193,14 +195,18 @@ static void ncsi_channel_monitor(unsigned long data) ret); return; } - } - if (timeout + 1 >= 3) { + break; + case NCSI_CHANNEL_MONITOR_WAIT ... NCSI_CHANNEL_MONITOR_WAIT_MAX: + break; + default: if (!(ndp->flags & NCSI_DEV_HWA) && - old_state == NCSI_CHANNEL_ACTIVE) + old_state == NCSI_CHANNEL_ACTIVE) { ncsi_report_link(ndp, true); + ndp->flags |= NCSI_DEV_RESHUFFLE; + } - WRITE_ONCE(nc->state, NCSI_CHANNEL_INACTIVE); + WRITE_ONCE(nc->state, NCSI_CHANNEL_ACTIVE); spin_lock_irqsave(>lock, flags); list_add_tail_rcu(>link, >channel_queue); spin_unlock_irqrestore(>lock, flags); @@ -209,10 +215,9 @@ static void ncsi_channel_monitor(unsigned long data) } spin_lock_irqsave(>lock, flags); - nc->timeout = timeout + 1; - nc->enabled = true; + nc->monitor.state++; spin_unlock_irqrestore(>lock, flags); - mod_timer(>timer, jiffies + HZ * (1 << (nc->timeout / 2))); + mod_timer(>monitor.timer, jiffies + HZ); } void ncsi_start_channel_monitor(struct ncsi_channel *nc) @@ -220,12 +225,12 @@ void ncsi_start_channel_monitor(struct ncsi_channel *nc) unsigned long flags; spin_lock_irqsave(>lock, flags); -
[PATCH net-next 5/8] net/ncsi: Allow to extend NCSI request properties
There is only one NCSI request property for now: the response for the sent command need drive the workqueue or not. So we had one field (@driven) for the purpose. We lost the flexibility to extend NCSI request properties. This replaces @driven with @flags and @req_flags in NCSI request and NCSI command argument struct. Each bit of the newly introduced field can be used for one property. No functional changes introduced. Signed-off-by: Gavin ShanReviewed-by: Joel Stanley --- net/ncsi/internal.h| 8 +--- net/ncsi/ncsi-cmd.c| 2 +- net/ncsi/ncsi-manage.c | 19 ++- net/ncsi/ncsi-rsp.c| 2 +- 4 files changed, 17 insertions(+), 14 deletions(-) diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h index c956fe8..26e9295 100644 --- a/net/ncsi/internal.h +++ b/net/ncsi/internal.h @@ -207,7 +207,8 @@ struct ncsi_package { struct ncsi_request { unsigned charid; /* Request ID - 0 to 255 */ bool used;/* Request that has been assigned */ - bool driven; /* Drive state machine */ + unsigned int flags; /* NCSI request property */ +#define NCSI_REQ_FLAG_EVENT_DRIVEN 1 struct ncsi_dev_priv *ndp;/* Associated NCSI device */ struct sk_buff *cmd;/* Associated NCSI command packet */ struct sk_buff *rsp;/* Associated NCSI response packet */ @@ -276,7 +277,7 @@ struct ncsi_cmd_arg { unsigned charpackage; /* Destination package ID*/ unsigned charchannel; /* Detination channel ID or 0x1f */ unsigned short payload; /* Command packet payload length */ - bool driven; /* Drive the state machine? */ + unsigned int req_flags; /* NCSI request properties */ union { unsigned char bytes[16]; /* Command packet specific data */ unsigned short words[8]; @@ -315,7 +316,8 @@ void ncsi_find_package_and_channel(struct ncsi_dev_priv *ndp, unsigned char id, struct ncsi_package **np, struct ncsi_channel **nc); -struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp, bool driven); +struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp, + unsigned int req_flags); void ncsi_free_request(struct ncsi_request *nr); struct ncsi_dev *ncsi_find_dev(struct net_device *dev); int ncsi_process_next_channel(struct ncsi_dev_priv *ndp); diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c index 21057a8..db7083b 100644 --- a/net/ncsi/ncsi-cmd.c +++ b/net/ncsi/ncsi-cmd.c @@ -272,7 +272,7 @@ static struct ncsi_request *ncsi_alloc_command(struct ncsi_cmd_arg *nca) struct sk_buff *skb; struct ncsi_request *nr; - nr = ncsi_alloc_request(ndp, nca->driven); + nr = ncsi_alloc_request(ndp, nca->req_flags); if (!nr) return NULL; diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index 64aab38..41d6af9 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -186,7 +186,7 @@ static void ncsi_channel_monitor(unsigned long data) nca.package = np->id; nca.channel = nc->id; nca.type = NCSI_PKT_CMD_GLS; - nca.driven = false; + nca.req_flags = 0; ret = ncsi_xmit_cmd(); if (ret) { netdev_err(ndp->ndev.dev, "Error %d sending GLS\n", @@ -407,7 +407,8 @@ void ncsi_find_package_and_channel(struct ncsi_dev_priv *ndp, * be same. Otherwise, the bogus response might be replied. So * the available IDs are allocated in round-robin fashion. */ -struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp, bool driven) +struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp, + unsigned int req_flags) { struct ncsi_request *nr = NULL; int i, limit = ARRAY_SIZE(ndp->requests); @@ -421,7 +422,7 @@ struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp, bool driven) nr = >requests[i]; nr->used = true; - nr->driven = driven; + nr->flags = req_flags; ndp->request_id = i + 1; goto found; } @@ -433,7 +434,7 @@ struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp, bool driven) nr = >requests[i]; nr->used = true; - nr->driven = driven; + nr->flags = req_flags; ndp->request_id = i + 1; goto found; } @@ -461,7 +462,7 @@ void ncsi_free_request(struct ncsi_request *nr) nr->cmd = NULL; nr->rsp =
[PATCH net-next 4/8] net/ncsi: Rework request index allocation
The NCSI request index (struct ncsi_request::id) is put into instance ID (IID) field while sending NCSI command packet. It was designed the available IDs are given in round-robin fashion. @ndp->request_id was introduced to represent the next available ID, but it has been used as number of successively allocated IDs. It breaks the round-robin design. Besides, we shouldn't put 0 to NCSI command packet's IID field, meaning ID#0 should be reserved according section 6.3.1.1 in NCSI spec (v1.1.0). This fixes above two issues. With it applied, the available IDs will be assigned in round-robin fashion and ID#0 won't be assigned. Signed-off-by: Gavin ShanReviewed-by: Joel Stanley --- net/ncsi/internal.h| 1 + net/ncsi/ncsi-manage.c | 17 + 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h index 66dc851..c956fe8 100644 --- a/net/ncsi/internal.h +++ b/net/ncsi/internal.h @@ -259,6 +259,7 @@ struct ncsi_dev_priv { struct list_headpackages;/* List of packages */ struct ncsi_request requests[256]; /* Request table */ unsigned intrequest_id; /* Last used request ID */ +#define NCSI_REQ_START_IDX 1 unsigned intpending_req_num; /* Number of pending requests */ struct ncsi_package *active_package; /* Currently handled package */ struct ncsi_channel *active_channel; /* Currently handled channel */ diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index 3eeb915..64aab38 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -415,30 +415,31 @@ struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp, bool driven) /* Check if there is one available request until the ceiling */ spin_lock_irqsave(>lock, flags); - for (i = ndp->request_id; !nr && i < limit; i++) { + for (i = ndp->request_id; i < limit; i++) { if (ndp->requests[i].used) continue; nr = >requests[i]; nr->used = true; nr->driven = driven; - if (++ndp->request_id >= limit) - ndp->request_id = 0; + ndp->request_id = i + 1; + goto found; } /* Fail back to check from the starting cursor */ - for (i = 0; !nr && i < ndp->request_id; i++) { + for (i = NCSI_REQ_START_IDX; i < ndp->request_id; i++) { if (ndp->requests[i].used) continue; nr = >requests[i]; nr->used = true; nr->driven = driven; - if (++ndp->request_id >= limit) - ndp->request_id = 0; + ndp->request_id = i + 1; + goto found; } - spin_unlock_irqrestore(>lock, flags); +found: + spin_unlock_irqrestore(>lock, flags); return nr; } @@ -1122,7 +1123,7 @@ struct ncsi_dev *ncsi_register_dev(struct net_device *dev, /* Initialize private NCSI device */ spin_lock_init(>lock); INIT_LIST_HEAD(>packages); - ndp->request_id = 0; + ndp->request_id = NCSI_REQ_START_IDX; for (i = 0; i < ARRAY_SIZE(ndp->requests); i++) { ndp->requests[i].id = i; ndp->requests[i].ndp = ndp; -- 2.1.0
[PATCH net-next 3/8] net/ncsi: Don't probe on the reserved channel ID (0x1f)
We needn't send CIS (Clear Initial State) command to the NCSI reserved channel (0x1f) in the enumeration. We shouldn't receive a valid response from CIS on NCSI channel 0x1f. Signed-off-by: Gavin ShanReviewed-by: Joel Stanley --- net/ncsi/ncsi-manage.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index 73883ed..3eeb915 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -887,12 +887,12 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp) nd->state = ncsi_dev_state_probe_cis; break; case ncsi_dev_state_probe_cis: - ndp->pending_req_num = 32; + ndp->pending_req_num = NCSI_RESERVED_CHANNEL; /* Clear initial state */ nca.type = NCSI_PKT_CMD_CIS; nca.package = ndp->active_package->id; - for (index = 0; index < 0x20; index++) { + for (index = 0; index < NCSI_RESERVED_CHANNEL; index++) { nca.channel = index; ret = ncsi_xmit_cmd(); if (ret) -- 2.1.0
[PATCH net-next 2/8] net/ncsi: Introduce NCSI_RESERVED_CHANNEL
This defines NCSI_RESERVED_CHANNEL as the reserved NCSI channel ID (0x1f). No logical changes introduced. Signed-off-by: Gavin ShanReviewed-by: Joel Stanley --- net/ncsi/internal.h| 1 + net/ncsi/ncsi-manage.c | 14 +++--- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h index 33738c0..66dc851 100644 --- a/net/ncsi/internal.h +++ b/net/ncsi/internal.h @@ -170,6 +170,7 @@ struct ncsi_package; #define NCSI_PACKAGE_SHIFT 5 #define NCSI_PACKAGE_INDEX(c) (((c) >> NCSI_PACKAGE_SHIFT) & 0x7) +#define NCSI_RESERVED_CHANNEL 0x1f #define NCSI_CHANNEL_INDEX(c) ((c) & ((1 << NCSI_PACKAGE_SHIFT) - 1)) #define NCSI_TO_CHANNEL(p, c) (((p) << NCSI_PACKAGE_SHIFT) | (c)) diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index fc0ae54..73883ed 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -529,7 +529,7 @@ static void ncsi_suspend_channel(struct ncsi_dev_priv *ndp) nca.package = np->id; if (nd->state == ncsi_dev_state_suspend_select) { nca.type = NCSI_PKT_CMD_SP; - nca.channel = 0x1f; + nca.channel = NCSI_RESERVED_CHANNEL; if (ndp->flags & NCSI_DEV_HWA) nca.bytes[0] = 0; else @@ -546,7 +546,7 @@ static void ncsi_suspend_channel(struct ncsi_dev_priv *ndp) nd->state = ncsi_dev_state_suspend_deselect; } else if (nd->state == ncsi_dev_state_suspend_deselect) { nca.type = NCSI_PKT_CMD_DP; - nca.channel = 0x1f; + nca.channel = NCSI_RESERVED_CHANNEL; nd->state = ncsi_dev_state_suspend_done; } @@ -592,7 +592,7 @@ static void ncsi_configure_channel(struct ncsi_dev_priv *ndp) else nca.bytes[0] = 1; nca.package = np->id; - nca.channel = 0x1f; + nca.channel = NCSI_RESERVED_CHANNEL; ret = ncsi_xmit_cmd(); if (ret) goto error; @@ -810,7 +810,7 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp) /* Deselect all possible packages */ nca.type = NCSI_PKT_CMD_DP; - nca.channel = 0x1f; + nca.channel = NCSI_RESERVED_CHANNEL; for (index = 0; index < 8; index++) { nca.package = index; ret = ncsi_xmit_cmd(); @@ -826,7 +826,7 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp) /* Select all possible packages */ nca.type = NCSI_PKT_CMD_SP; nca.bytes[0] = 1; - nca.channel = 0x1f; + nca.channel = NCSI_RESERVED_CHANNEL; for (index = 0; index < 8; index++) { nca.package = index; ret = ncsi_xmit_cmd(); @@ -879,7 +879,7 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp) nca.type = NCSI_PKT_CMD_SP; nca.bytes[0] = 1; nca.package = ndp->active_package->id; - nca.channel = 0x1f; + nca.channel = NCSI_RESERVED_CHANNEL; ret = ncsi_xmit_cmd(); if (ret) goto error; @@ -936,7 +936,7 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp) /* Deselect the active package */ nca.type = NCSI_PKT_CMD_DP; nca.package = ndp->active_package->id; - nca.channel = 0x1f; + nca.channel = NCSI_RESERVED_CHANNEL; ret = ncsi_xmit_cmd(); if (ret) goto error; -- 2.1.0
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote: > xps_flows maintains a per device flow table that is indexed by the > skbuff hash. The table is only consulted when there is no queue saved in > a transmit socket for an skbuff. > > Each entry in the flow table contains a queue index and a queue > pointer. The queue pointer is set when a queue is chosen using a > flow table entry. This pointer is set to the head pointer in the > transmit queue (which is maintained by BQL). > > The new function get_xps_flows_index that looks up flows in the > xps_flows table. The entry returned gives the last queue a matching flow > used. The returned queue is compared against the normal XPS queue. If > they are different, then we only switch if the tail pointer in the TX > queue has advanced past the pointer saved in the entry. In this > way OOO should be avoided when XPS wants to use a different queue. There is something I do not understand with this. If this OOO avoidance is tied to BQL, it means that packets sitting in a qdisc wont be part of the detection. So packets of flow X could possibly be queued on multiple qdiscs.
[PATCH v2 net-next] mlx5: Add ndo_poll_controller() implementation
This implements ndo_poll_controller in net_device_ops callbacks for mlx5, which is necessary to use netconsole with this driver. Acked-By: Saeed MahameedSigned-off-by: Calvin Owens --- Changes in v2: * Only iterate channels to avoid redundant napi_schedule() calls drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index b58cfe3..7eaf380 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -3188,6 +3188,20 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_xdp *xdp) } } +#ifdef CONFIG_NET_POLL_CONTROLLER +/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without + * reenabling interrupts. + */ +static void mlx5e_netpoll(struct net_device *dev) +{ + struct mlx5e_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < priv->params.num_channels; i++) + napi_schedule(>channel[i]->napi); +} +#endif + static const struct net_device_ops mlx5e_netdev_ops_basic = { .ndo_open= mlx5e_open, .ndo_stop= mlx5e_close, @@ -3208,6 +3222,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = { #endif .ndo_tx_timeout = mlx5e_tx_timeout, .ndo_xdp = mlx5e_xdp, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static const struct net_device_ops mlx5e_netdev_ops_sriov = { @@ -3240,6 +3257,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = { .ndo_get_vf_stats= mlx5e_get_vf_stats, .ndo_tx_timeout = mlx5e_tx_timeout, .ndo_xdp = mlx5e_xdp, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = mlx5e_netpoll, +#endif }; static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev) -- 2.9.3
Re: Explaining RX-stages for XDP
On Wed, Sep 28, 2016 at 12:44:31PM +0200, Jesper Dangaard Brouer wrote: > > The idea is quite different. It has nothing to do with Edward's > proposal[3]. The RX packet-vector is simply an array, either of pointers > or index numbers (into the RX-ring). The needed changes are completely > contained inside the driver. > > > As far as intermixed XDP vs stack traffic, I think for DoS case the > > traffic patterns are binary. Either all of it is good or under attack > > most of the traffic is bad, so makes sense to optimize for these two. > > 50/50 case I think is artificial and not worth optimizing for. > > Sorry, but I feel you have completely misunderstood the concept of my > idea. It does not matter what traffic pattern you believe or don't > believe in, it is irrelevant. The fact is that intermixed traffic is > possible with the current solution. The core of my idea is to remove > the possibility for this intermixed traffic to occur, simply by seeing > XDP as a RX-stage before the stack. Is the idea to add two extra 'to_stack' and 'not_to_stack' arrays after XDP program made a decision ? Sounds interesting, but I cannot evaluate it properly, since I fail to see what problem it solves and how it can be benchmarked. Since you're saying the percentage of good vs bad traffic is irrelevant then this idea should help all types of traffic in all cases? That would be truly awesome. I think the discussion around performance optimizations should start from: 1. describing the problem: What is the use case? What is the traffic pattern? After problem is understood and accepted as a valid problem and not some research subject then 2. define a benchmark that simulates it when benchmarking methodology is agreed upon we can go the next step of 3. come up with multiple ideas/proposals to solve the problem 1. and finally discuss 4. different implementations of the best proposal from step 3 Without 1 it's easy to come up with a fake benchmark in step 2 that can justify addition of any code in 4 Without 2, the comparison of different proposals will be based on subjective opinions instead of hard data. Without 3 the proposed patch will be a hard sell, since other alternatives were not on the table. For new features the steps are different, of course. When we started on XDP among everything else the first two problems that it's supposed to solve were DoS and loadbalancer/ila_router. For former the attack traffic is mostly dropped after parsing. That behavior is simulated by samples/bpf/xdp1_kern.c For load balancer most of the traffic is parsed and transmitted out on the same device after packet rewrite. That scenario is simulated by samples/bpf/xdp2_kern.c Therefore the benchmarking and performance numbers were centered around these two programs and I believe XDP itself is a good solution for DoS and ila_router based on the performance numbers from these two benchmarks. I think XDP is _not_ a good solution for many other use cases. The DoS protection feature when not under attack was not benchmarked, so this is the area to work on in the future. (the appropriate benchmark is xdp_pass+tcp+netperf) Another area that was not benchmarked is how XDP drop/tx interacts with normal control plane traffic. We assumed that different hw tx queues should provide sufficient separation. That is certainly to be tested. A lot of work to do, no doubt. > > Optimizing xdp for 'mostly good' traffic is indeed a challange. > > We'd need all the tricks to make it as good as normal skb-based traffic. > > > > I haven't seen any tests yet comparing xdp with 'return XDP_PASS' program > > vs no xdp at all running netperf tcp/udp in user space. It shouldn't > > be too far off. > > Well, I did post numbers to the list with a 'return XDP_PASS' program[4]: > https://mid.mail-archive.com/netdev@vger.kernel.org/msg122350.html > > Wake-up and smell the coffee, please revise your assumptions: > * It showed that the performance reduction is 25.98%!!! >(AB comparison dropping packets in iptables raw) sure. iptables drop is slow with xdp_pass vs iptables without xdp. This is not a benchmark I was interested in, since I don't understand what use case it simulates. > Conclusion: These measurements confirm that we need a page recycle > facility for the drivers before switching to order-0 allocations. ... > page_pool work > - iptables-raw-drop: driver mlx5 >* 4,487,518 pps - baseline-before => 100.0% >* 3,624,237 pps - mlx5 order0-patch => - 19.2% (slower) >* 4,806,142 pps - PoC page_pool patch => +7.1% (faster) I agree that generic page_pool is a useful facility, but I don't think it's the right approach to design it based on iptables_drop+xdp_pass benchmark. You're saying the protoype gives 7.1% improvement. Ok, but what is the problem being solved? If the use case is DoS and fast as possible drop, then XDP_DROP is a better alternative. Why design/benchmark page_pool based on iptables_drop ? If page_pool replaces
[PATCH v2 net-next 5/5] xps: Documentation for transmit socketles flow steering
Signed-off-by: Tom Herbert--- Documentation/ABI/testing/sysfs-class-net | 8 Documentation/networking/scaling.txt | 26 ++ 2 files changed, 34 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net index 668604f..0d2a6a9 100644 --- a/Documentation/ABI/testing/sysfs-class-net +++ b/Documentation/ABI/testing/sysfs-class-net @@ -251,3 +251,11 @@ Contact: netdev@vger.kernel.org Description: Indicates the unique physical switch identifier of a switch this port belongs to, as a string. + +What: /sys/class/net//xps_dev_flow_table_cnt +Date: October 2016 +KernelVersion: 4.8 +Contact: netdev@vger.kernel.org +Description: + Indicates the number of entries in the XPS socketless flow + table. diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index 59f4db2..7f9590c 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -400,6 +400,26 @@ transport layer is responsible for setting ooo_okay appropriately. TCP, for instance, sets the flag when all data for a connection has been acknowledged. +XPS: Transmit Socketless Flow Steering +== + +Not all flows have an associated socket in which the transmit queue and +ooo information can be saved. For instance, packets being transmitted over +OVS for VMs do not have a socket that is visible to the kernel. To allow +XPS to be effectively used a flow hash table is employed to save queue and +ooo information for such socketless flows. + +The idea is that transmit queues are saved for socketless flows in a +flow table that is indexed by skbuff hash. The flow table entries have +two values: the queue_index and the head cnt of packets from the TX +queue. When a packet is being sent without an associated socket the +hash table is consulted which returns the queue_index to use. The +returned queue is compared to the queue that would be selected by XPS. +If they are are different the XPS queue is selected only if the tail cnt +in the TX queue advances beyond the recorded head cnt. This checks if +sending the packet on the new queue could cause ooo packets for the +flow. + XPS Configuration XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by @@ -409,6 +429,12 @@ queue is configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus +Transmit socketless flow steering is enabled by setting the number of +entries in the XPS flow table. This is configured in the sysfs file +entry: + +/sys/class/net//xps_dev_flow_table_cnt + == Suggested Configuration For a network device with a single transmission queue, XPS configuration -- 2.9.3
[PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
xps_flows maintains a per device flow table that is indexed by the skbuff hash. The table is only consulted when there is no queue saved in a transmit socket for an skbuff. Each entry in the flow table contains a queue index and a queue pointer. The queue pointer is set when a queue is chosen using a flow table entry. This pointer is set to the head pointer in the transmit queue (which is maintained by BQL). The new function get_xps_flows_index that looks up flows in the xps_flows table. The entry returned gives the last queue a matching flow used. The returned queue is compared against the normal XPS queue. If they are different, then we only switch if the tail pointer in the TX queue has advanced past the pointer saved in the entry. In this way OOO should be avoided when XPS wants to use a different queue. Signed-off-by: Tom Herbert--- net/Kconfig| 6 net/core/dev.c | 87 +- 2 files changed, 74 insertions(+), 19 deletions(-) diff --git a/net/Kconfig b/net/Kconfig index 7b6cd34..f77fad1 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -255,6 +255,12 @@ config XPS depends on SMP default y +config XPS_FLOWS + bool + depends on XPS + depends on BQL + default y + config HWBM bool diff --git a/net/core/dev.c b/net/core/dev.c index c0c291f..1ca08b9 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3210,6 +3210,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) } #endif /* CONFIG_NET_EGRESS */ +/* Must be called with RCU read_lock */ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) { #ifdef CONFIG_XPS @@ -3217,7 +3218,6 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) struct xps_map *map; int queue_index = -1; - rcu_read_lock(); dev_maps = rcu_dereference(dev->xps_maps); if (dev_maps) { map = rcu_dereference( @@ -3228,15 +3228,62 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) else queue_index = map->queues[reciprocal_scale(skb_get_hash(skb), map->len)]; - if (unlikely(queue_index >= dev->real_num_tx_queues)) - queue_index = -1; + if (queue_index >= 0 && + likely(queue_index < dev->real_num_tx_queues)) + return queue_index; } } - rcu_read_unlock(); +#endif + return skb_tx_hash(dev, skb); +} + +/* Must be called with RCU read_lock */ +static int get_xps_flows_index(struct net_device *dev, struct sk_buff *skb) +{ +#ifdef CONFIG_XPS_FLOWS + struct xps_dev_flow_table *flow_table; + struct xps_dev_flow ent; + int queue_index; + struct netdev_queue *txq; + u32 hash; + + queue_index = get_xps_queue(dev, skb); + if (queue_index < 0) + return -1; + + flow_table = rcu_dereference(dev->xps_flow_table); + if (!flow_table) + return queue_index; + + hash = skb_get_hash(skb); + if (!hash) + return queue_index; + + ent.v64 = flow_table->flows[hash & flow_table->mask].v64; + + if (queue_index != ent.queue_index && + ent.queue_index >= 0 && + ent.queue_index < dev->real_num_tx_queues) { + txq = netdev_get_tx_queue(dev, ent.queue_index); + if ((int)(txq->dql.num_completed_ops - ent.queue_ptr) < 0) { + /* The current queue's tail has not advanced beyond the +* last packet that was enqueued using the table entry. +* We can't change queues without risking OOO. Stick +* with the queue listed in the flow table. +*/ + queue_index = ent.queue_index; + } + } + + /* Save the updated entry */ + txq = netdev_get_tx_queue(dev, queue_index); + ent.queue_index = queue_index; + ent.queue_ptr = txq->dql.num_enqueue_ops; + flow_table->flows[hash & flow_table->mask].v64 = ent.v64; return queue_index; #else - return -1; + return get_xps_queue(dev, skb); #endif } @@ -3244,22 +3291,24 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb) { struct sock *sk = skb->sk; int queue_index = sk_tx_queue_get(sk); - - if (queue_index < 0 || skb->ooo_okay || - queue_index >= dev->real_num_tx_queues) { - int new_index = get_xps_queue(dev, skb); - if (new_index < 0) - new_index = skb_tx_hash(dev, skb); - - if (queue_index != new_index && sk && -
[PATCH v2 net-next 1/5] net: Set SW hash in skb_set_hash_from_sk
Use the __skb_set_sw_hash to set the hash in an skbuff from the socket txhash. Signed-off-by: Tom Herbert--- include/net/sock.h | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index ebf75db..17d379a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1920,10 +1920,8 @@ static inline void sock_poll_wait(struct file *filp, static inline void skb_set_hash_from_sk(struct sk_buff *skb, struct sock *sk) { - if (sk->sk_txhash) { - skb->l4_hash = 1; - skb->hash = sk->sk_txhash; - } + if (sk->sk_txhash) + __skb_set_sw_hash(skb, sk->sk_txhash, true); } void skb_set_owner_w(struct sk_buff *skb, struct sock *sk); -- 2.9.3
[PATCH v2 net-next 2/5] dql: Add counters for number of queuing and completion operations
Add two new counters to struct dql that are num_enqueue_ops and num_completed_ops. num_enqueue_ops is incremented by one in each call to dql_queued. num_enqueue_ops is incremented in dql_completed which takes an argument indicating number of operations completed. These counters are only intended for statistics and do not impact the BQL algorithm. We add a new sysfs entry in byte_queue_limits named inflight_pkts. This provides the number of packets in flight for the queue by dql->num_enqueue_ops - dql->num_completed_ops. Signed-off-by: Tom Herbert--- include/linux/dynamic_queue_limits.h | 7 ++- include/linux/netdevice.h| 2 +- lib/dynamic_queue_limits.c | 3 ++- net/core/net-sysfs.c | 14 ++ 4 files changed, 23 insertions(+), 3 deletions(-) diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h index a4be703..b6a4804 100644 --- a/include/linux/dynamic_queue_limits.h +++ b/include/linux/dynamic_queue_limits.h @@ -43,6 +43,8 @@ struct dql { unsigned intadj_limit; /* limit + num_completed */ unsigned intlast_obj_cnt; /* Count at last queuing */ + unsigned intnum_enqueue_ops;/* Number of queue operations */ + /* Fields accessed only by completion path (dql_completed) */ unsigned intlimit cacheline_aligned_in_smp; /* Current limit */ @@ -55,6 +57,8 @@ struct dql { unsigned intlowest_slack; /* Lowest slack found */ unsigned long slack_start_time; /* Time slacks seen */ + unsigned intnum_completed_ops; /* Number of complete ops */ + /* Configuration */ unsigned intmax_limit; /* Max limit */ unsigned intmin_limit; /* Minimum limit */ @@ -83,6 +87,7 @@ static inline void dql_queued(struct dql *dql, unsigned int count) barrier(); dql->num_queued += count; + dql->num_enqueue_ops++; } /* Returns how many objects can be queued, < 0 indicates over limit. */ @@ -92,7 +97,7 @@ static inline int dql_avail(const struct dql *dql) } /* Record number of completed objects and recalculate the limit. */ -void dql_completed(struct dql *dql, unsigned int count); +void dql_completed(struct dql *dql, unsigned int count, unsigned int ops); /* Reset dql state */ void dql_reset(struct dql *dql); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 136ae6bb..9567107 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3015,7 +3015,7 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue, if (unlikely(!bytes)) return; - dql_completed(_queue->dql, bytes); + dql_completed(_queue->dql, bytes, pkts); /* * Without the memory barrier there is a small possiblity that diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c index f346715..d5e7a27 100644 --- a/lib/dynamic_queue_limits.c +++ b/lib/dynamic_queue_limits.c @@ -14,7 +14,7 @@ #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0) /* Records completed count and recalculates the queue limit */ -void dql_completed(struct dql *dql, unsigned int count) +void dql_completed(struct dql *dql, unsigned int count, unsigned int ops) { unsigned int inprogress, prev_inprogress, limit; unsigned int ovlimit, completed, num_queued; @@ -108,6 +108,7 @@ void dql_completed(struct dql *dql, unsigned int count) dql->prev_ovlimit = ovlimit; dql->prev_last_obj_cnt = dql->last_obj_cnt; dql->num_completed = completed; + dql->num_completed_ops += ops; dql->prev_num_queued = num_queued; } EXPORT_SYMBOL(dql_completed); diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 6e4f347..ab7b0b6 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1147,6 +1147,19 @@ static ssize_t bql_show_inflight(struct netdev_queue *queue, static struct netdev_queue_attribute bql_inflight_attribute = __ATTR(inflight, S_IRUGO, bql_show_inflight, NULL); +static ssize_t bql_show_inflight_pkts(struct netdev_queue *queue, + struct netdev_queue_attribute *attr, + char *buf) +{ + struct dql *dql = >dql; + + return sprintf(buf, "%u\n", + dql->num_enqueue_ops - dql->num_completed_ops); +} + +static struct netdev_queue_attribute bql_inflight_pkts_attribute = + __ATTR(inflight_pkts, S_IRUGO, bql_show_inflight_pkts, NULL); + #define BQL_ATTR(NAME, FIELD) \ static ssize_t bql_show_ ## NAME(struct netdev_queue *queue, \ struct netdev_queue_attribute *attr, \ @@ -1176,6 +1189,7 @@ static struct attribute *dql_attrs[] = { _limit_min_attribute.attr,
[PATCH v2 net-next 3/5] net: Add xps_dev_flow_table_cnt
Add infrastructure and definitions to create XFS flow tables. This creates the new sys entry /sys/class/net/eth*/xps_dev_flow_table_cnt Signed-off-by: Tom Herbert--- include/linux/netdevice.h | 24 + net/core/net-sysfs.c | 89 +++ 2 files changed, 113 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 9567107..60063b3 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -736,6 +736,27 @@ struct xps_dev_maps { (nr_cpu_ids * sizeof(struct xps_map *))) #endif /* CONFIG_XPS */ +#ifdef CONFIG_XPS_FLOWS +struct xps_dev_flow { + union { + u64 v64; + struct { + int queue_index; + unsigned intqueue_ptr; + }; + }; +}; + +struct xps_dev_flow_table { + unsigned int mask; + struct rcu_head rcu; + struct xps_dev_flow flows[0]; +}; +#define XPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct xps_dev_flow_table) + \ + ((_num) * sizeof(struct xps_dev_flow))) + +#endif /* CONFIG_XPS_FLOWS */ + #define TC_MAX_QUEUE 16 #define TC_BITMASK 15 /* HW offloaded queuing disciplines txq count and offset maps */ @@ -1825,6 +1846,9 @@ struct net_device { #ifdef CONFIG_XPS struct xps_dev_maps __rcu *xps_maps; #endif +#ifdef CONFIG_XPS_FLOWS + struct xps_dev_flow_table __rcu *xps_flow_table; +#endif #ifdef CONFIG_NET_CLS_ACT struct tcf_proto __rcu *egress_cl_list; #endif diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index ab7b0b6..0d00b9c 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -503,6 +503,92 @@ static ssize_t phys_switch_id_show(struct device *dev, } static DEVICE_ATTR_RO(phys_switch_id); +#ifdef CONFIG_XPS_FLOWS +static void xps_dev_flow_table_release(struct rcu_head *rcu) +{ + struct xps_dev_flow_table *table = container_of(rcu, + struct xps_dev_flow_table, rcu); + vfree(table); +} + +static int change_xps_dev_flow_table_cnt(struct net_device *dev, +unsigned long count) +{ + unsigned long mask; + struct xps_dev_flow_table *table, *old_table; + static DEFINE_SPINLOCK(xps_dev_flow_lock); + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (count) { + mask = count - 1; + /* mask = roundup_pow_of_two(count) - 1; +* without overflows... +*/ + while ((mask | (mask >> 1)) != mask) + mask |= (mask >> 1); + /* On 64 bit arches, must check mask fits in table->mask (u32), +* and on 32bit arches, must check +* XPS_DEV_FLOW_TABLE_SIZE(mask + 1) doesn't overflow. +*/ +#if BITS_PER_LONG > 32 + if (mask > (unsigned long)(u32)mask) + return -EINVAL; +#else + if (mask > (ULONG_MAX - XPS_DEV_FLOW_TABLE_SIZE(1)) + / sizeof(struct xps_dev_flow)) { + /* Enforce a limit to prevent overflow */ + return -EINVAL; + } +#endif + table = vmalloc(XPS_DEV_FLOW_TABLE_SIZE(mask + 1)); + if (!table) + return -ENOMEM; + + table->mask = mask; + for (count = 0; count <= mask; count++) + table->flows[count].queue_index = -1; + } else + table = NULL; + + spin_lock(_dev_flow_lock); + old_table = rcu_dereference_protected(dev->xps_flow_table, + lockdep_is_held(_dev_flow_lock)); + rcu_assign_pointer(dev->xps_flow_table, table); + spin_unlock(_dev_flow_lock); + + if (old_table) + call_rcu(_table->rcu, xps_dev_flow_table_release); + + return 0; +} + +static ssize_t xps_dev_flow_table_cnt_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + return netdev_store(dev, attr, buf, len, change_xps_dev_flow_table_cnt); +} + +static ssize_t xps_dev_flow_table_cnt_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct net_device *netdev = to_net_dev(dev); + struct xps_dev_flow_table *table; + unsigned int cnt = 0; + + rcu_read_lock(); + table = rcu_dereference(netdev->xps_flow_table); + if (table) + cnt = table->mask + 1; + rcu_read_unlock(); + + return sprintf(buf, fmt_dec, cnt); +} +DEVICE_ATTR_RW(xps_dev_flow_table_cnt); +#endif /* CONFIG_XPS_FLOWS */ + static struct attribute *net_class_attrs[] = { _attr_netdev_group.attr, _attr_type.attr, @@ -531,6 +617,9
[PATCH v2 net-next 0/5] xps_flows: XPS flow steering when there is no socket
This patch set introduces transmit flow steering for socketless packets. The idea is that we record the transmit queues in a flow table that is indexed by skbuff hash. The flow table entries have two values: the queue_index and the head cnt of packets from the TX queue. We only allow a queue to change for a flow if the tail cnt in the TX queue advances beyond the recorded head cnt. That is the condition that should indicate that all outstanding packets for the flow have completed transmission so the queue can change. Tracking the inflight queue is performed as part of DQL. Two fields are added to the dql structure: num_enqueue_ops and num_completed_ops. num_enqueue_ops incremented in dql_queued and num_completed_ops is incremented in dql_completed by the number of operations completed (a new argument to the function). This patch set creates /sys/class/net/eth*/xps_dev_flow_table_cnt which number of entries in the XPS flow table. Note that the functionality here is technically best effort (for instance we don't obtain a lock while processing a flow table entry). Under high load it is possible that OOO packets can still be generated due to XPS if two threads are hammering on the same flow table entry. The assumption of these patches is that OOO packets are not the end of the world and these should prevent OOO in most common use cases with XPS. This is a followup to previous RFC version. Fixes from RFC are: - Move counters to DQL - Fixed typo - Simplified get flow index funtion - Fixed sysfs flow_table_cnt to properly use DEVICE_ATTR_RW - Renamed the mechanism V2: - Added documentation in scaling.txt and sysfs documentation - Call skb_tx_hash directly from get_xps_queue. This allows the socketless transmit flow steering to work properly if a flow is bouncing between non-XPS and XPS CPUS. (suggested by Alexander Duyck). - Added a whold bunch of tested results provided by Rick Jones (Thanks Rick!) Tested: Manually forced all packets to go through the xps_flows path. Observed that some flows were deferred to change queues because packets were in flight witht the flow bucket. Testing done by Rick Jones: Here is a quick look at performance tests for the result of trying the prototype fix for the packet reordering problem with VMs sending over an XPS-configured NIC. In particular, the Emulex/Avago/Broadcom Skyhawk. The fix was applied to a 4.4 kernel. Before: 3884 Mbit/s After: 8897 Mbit/s That was from a VM on a node with a Skyhawk and 2 E5-2640 processors to baremetal E5-2640 with a BE3. Physical MTU was 1500, the VM's vNIC's MTU was 1400. Systems were HPE ProLiants in OS Control Mode for power management, with the "performance" frequency governor loaded. An OpenStack Mitaka setup with Distributed Virtual Router. We had some other NIC types in the setup as well. XPS was also enabled on the ConnectX3-Pro. It was not enabled on the 82599ES (a function of the kernel being used, which had it disabled from the first reports of XPS negatively affecting VM traffic at the beginning of the year) Average Mbit/s From NIC type To Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v39224 9271 BE3, E5-26409016 9022 82599, E5-2640 9192 9003 BCM57840, E5-2640 9213 9153 Skyhawk, E5-26403884 8897 For completeness: Average Mbit/s To NIC type from Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v39322 9144 BE3, E5-26409074 9017 82599, E5-2640 8670 8564 BCM57840, E5-2640 2468 * 7979 Skyhawk, E5-26408897 9269 * This is the Busted bnx2x NIC FW GRO implementation issue. It was not visible in the "After" because the system was setup to disable the NIC FW GRO by the time it booted on the fix kernel. Average Transactions/s Between NIC type and Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v3 12421 12612 BE3, E5-26408178 8484 82599, E5-2640 8499 8549 BCM57840, E5-2640 8544 8560 Skyhawk, E5-26408537 8701 Tom Herbert (5): net: Set SW hash in skb_set_hash_from_sk dql: Add counters for number of queuing and completion operations net: Add xps_dev_flow_table_cnt xps_flows: XPS for packets that don't have a socket xps: Documentation for transmit socketles flow steering Documentation/ABI/testing/sysfs-class-net | 8 +++ Documentation/networking/scaling.txt | 26 include/linux/dynamic_queue_limits.h
Re: [PATCH net v2] L2TP:Adjust intf MTU,factor underlay L3,overlay L2
Hi David, Please see inline: On Wed, 28 Sep 2016, David Miller wrote: > From: "R. Parameswaran"> Date: Tue, 27 Sep 2016 12:17:21 -0700 (PDT) > > > Later, in vxlan_dev_configure(), called from vxlan_dev_create(), it gets > > adjusted to account for the headers: > > > > vxlan_dev_configure(): > > ... > > if (!conf->mtu) > > dev->mtu = lowerdev->mtu - (use_ipv6 ? > > VXLAN6_HEADROOM : VXLAN_HEADROOM); > > > > > > where VXLAN_HEADROOM is defined as follows: > > > > /* IP header + UDP + VXLAN + Ethernet header */ > > #define VXLAN_HEADROOM (20 + 8 + 8 + 14) > > /* IPv6 header + UDP + VXLAN + Ethernet header */ > > #define VXLAN6_HEADROOM (40 + 8 + 8 + 14) > > Right but I don't see it going through the effort to make use of the > PMTU like you are. > > I have another strong concern related to this. There seems to be no > mechanism used to propagate any PMTU events into the device's MTU. > > Because if there is a limiting nexthop in the route to the other end > of the UDP tunnel, you won't learn the PMTU until you (or some other > entity on the machine) actually starts sending traffic to the tunnel's > endpoint. > > If the PMTU events aren't propagated into the tunnel's MTU or similar > I think this is an ad-hoc solution. > > I would suggest that you either: > > 1) Do what VXLAN appears to do an ignore the PMTu > I'd like to point out one difference with VXLAN - in VXLAN, the local physical interface is directly specified at the time of creation of the tunnel, and the data structure seems to have the ifindex of the local interface with which it is able to directly pull up the underlay interface device. Whereas in L2TP, we only have the IP address of the remote tunnel end-point and thus only the socket and the dst from which we need to derive this. Also, dst_mtu references dst->ops->mtu, which if I followed the pointer chain correctly, will dereference to ipv4_mtu() (for the IPv4 case, as an example). The code in ipv4_mtu looks like the following: ipv4_mtu(): unsigned int mtu = rt->rt_pmtu; if (!mtu || time_after_eq(jiffies, rt->dst.expires)) mtu = dst_metric_raw(dst, RTAX_MTU); if (mtu) return mtu; mtu = dst->dev->mtu; if (unlikely(dst_metric_locked(dst, RTAX_MTU))) { if (rt->rt_uses_gateway && mtu > 576) mtu = 576; } return min_t(unsigned int, mtu, IP_MAX_MTU); The code above does not depend on PMTU to be working. If no PMTU discovered MTU exists, it eventually falls back to the local underlay device MTU - and this is the mode in which I tested the fix - PMTU was off in my testbed, but it was picking up the local device MTU correctly. Basically, this looks better than the VXLAN handling as far as I can tell - at least it will pick up the existing discovered PMTU on a best effort basis, while falling back to the underlay device if all else fails. I agree that something like 2. below would be needed in the long run (it will need some effort and redesign -e.g. how do I lookup the parent tunnel from the socket when receiving a PMTU update, existing pointer chain runs from tunnel to socket). But since the existing (Ethernet over L2TP) MTU derivation is incorrect, I am hoping this may be acceptable as an interim solution. thanks, Ramkumar > 2) Add code to handle PMTU events that land on the UDP tunnel >socket. > > Thanks. >
Re: [PATCH] netfilter: bridge: clarify bridge/netfilter message
Stefan Agnerwrote: > When using bridge without bridge netfilter enabled the message > displayed is rather confusing and leads to belive that a deprecated > feature is in use. Use IS_MODULE to be explicit that the message only > affects users which use bridge netfilter as module and reword the > message. Acked-by: Florian Westphal
Re: UDP wierdness around skb_copy_and_csum_datagram_msg()
Actually, on a little more searching of this list's archives, I think that this discussion: https://patchwork.kernel.org/patch/9260733/ is about exactly the same issue I've found, except from the TCP side. I'm cc'ing a few of the participants from that discussion. So is the patch proposed there (copying and restoring the entire iov_iter in skb_copy_and_csum_datagram_msg()) being considered as a fix? If not, would an alternate one that concealed the save-and-restore logic inside iov_iter.c be more acceptable? I'd be happy to produce whatever's needed, or yield to someone with stronger feelings about where it should go... On Wed, Sep 28, 2016 at 6:24 PM, Eric Dumazetwrote: > On Wed, 2016-09-28 at 17:18 -0700, Jay Smith wrote: >> I've spent the last week or so trying to track down a recurring >> problem I'm seeing with UDP datagram handling. I'm new to the >> internals of the Linux network stack, but it appears to me that >> there's a substantial error in recent kernels' handling of UDP >> checksum errors. >> >> The behavior I'm seeing: I have a UDP server that receives lots of >> datagrams from many devices on a single port. A small number of those >> devices occasionally send packets with bad UDP checksums. After I >> receive one of these bad packets, the next recvmsg() made on the >> socket returns data from the bad-checksum packet, but the >> source-address and length of the next (good) packet that arrived at >> the port. >> >> I believe this issue was introduced between linux 3.18 and 3.19, by a >> set of changes to net/core/datagram.c that made >> skb_copy_and_csum_datagram_msg() and related functions use the >> iov_iter structure to copy data to user buffers. In the case where >> those functions copy a datagram and then conclude that the checksum is >> invalid, they don't remove the already-copied data from the user's >> iovec; when udp_recvmsg() calls skb_copy_and_csum_datagram_msg() for a >> second time, looking at the next datagram on the queue, that second >> datagram's data is appended to the first datagram's. So when >> recvmsg() returns to the user, the return value and msg_name reflect >> the second datagram, but the first bytes in the user's iovec come from >> the first (bad) datagram. >> >> (I've attached a test program that demonstrates the problem. Note >> that it sees correct behavior unless the bad-checksum packet has > 68 >> bytes of UDP data; otherwise, the packet doesn't make it past the >> CHECKSUM_BREAK test, and never enters >> skp_copy_and_csum_datagram_msg().) >> >> The fix for UDP seems pretty simple; the iov_iter's iov_offset member >> just needs to be set back to zero on a checksum failure. But it looks >> like skb_copy_and_csum_datagram_msg() is also called from tcp_input.c, >> where I assume that multiple sk_buffs can be copied-and-csum'd into >> the same iov -- if that's right, it seems like iov_iter needs some >> additional state to support rolling-back the most recent copy without >> losing previous ones. >> >> Any thoughts? > > Nice catch ! > > What about clearing iov_offset from UDP (v4 & v6) only ? > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index > 7d96dc2d3d08fa909f247dfbcbd0fc1eeb59862b..928da2fbb3caa6de4d0e1d889c237590f71607ea > 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1342,6 +1342,7 @@ csum_copy_err: > /* starting over for a new packet, but check if we need to yield */ > cond_resched(); > msg->msg_flags &= ~MSG_TRUNC; > + msg->msg_iter.iov_offset = 0; > goto try_again; > } > > > (similar for ipv6) > >
Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers
On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > +static void udp_rmem_release(struct sock *sk, int partial) > +{ > + struct udp_sock *up = udp_sk(sk); > + int fwd, amt; > + > + if (partial && !udp_under_memory_pressure(sk)) > + return; > + > + /* we can have concurrent release; if we catch any conflict > + * we let only one of them do the work > + */ > + if (atomic_dec_if_positive(>can_reclaim) < 0) > + return; > + > + fwd = __udp_forward(up, atomic_read(>sk_rmem_alloc)); > + if (fwd < SK_MEM_QUANTUM + partial) { > + atomic_inc(>can_reclaim); > + return; > + } > + > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > + atomic_sub(amt, >mem_allocated); > + atomic_inc(>can_reclaim); > + > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > + sk->sk_forward_alloc = fwd - amt; > +} This is racy... all these atomics make me nervous...
Re: UDP wierdness around skb_copy_and_csum_datagram_msg()
On Wed, 2016-09-28 at 17:18 -0700, Jay Smith wrote: > I've spent the last week or so trying to track down a recurring > problem I'm seeing with UDP datagram handling. I'm new to the > internals of the Linux network stack, but it appears to me that > there's a substantial error in recent kernels' handling of UDP > checksum errors. > > The behavior I'm seeing: I have a UDP server that receives lots of > datagrams from many devices on a single port. A small number of those > devices occasionally send packets with bad UDP checksums. After I > receive one of these bad packets, the next recvmsg() made on the > socket returns data from the bad-checksum packet, but the > source-address and length of the next (good) packet that arrived at > the port. > > I believe this issue was introduced between linux 3.18 and 3.19, by a > set of changes to net/core/datagram.c that made > skb_copy_and_csum_datagram_msg() and related functions use the > iov_iter structure to copy data to user buffers. In the case where > those functions copy a datagram and then conclude that the checksum is > invalid, they don't remove the already-copied data from the user's > iovec; when udp_recvmsg() calls skb_copy_and_csum_datagram_msg() for a > second time, looking at the next datagram on the queue, that second > datagram's data is appended to the first datagram's. So when > recvmsg() returns to the user, the return value and msg_name reflect > the second datagram, but the first bytes in the user's iovec come from > the first (bad) datagram. > > (I've attached a test program that demonstrates the problem. Note > that it sees correct behavior unless the bad-checksum packet has > 68 > bytes of UDP data; otherwise, the packet doesn't make it past the > CHECKSUM_BREAK test, and never enters > skp_copy_and_csum_datagram_msg().) > > The fix for UDP seems pretty simple; the iov_iter's iov_offset member > just needs to be set back to zero on a checksum failure. But it looks > like skb_copy_and_csum_datagram_msg() is also called from tcp_input.c, > where I assume that multiple sk_buffs can be copied-and-csum'd into > the same iov -- if that's right, it seems like iov_iter needs some > additional state to support rolling-back the most recent copy without > losing previous ones. > > Any thoughts? Nice catch ! What about clearing iov_offset from UDP (v4 & v6) only ? diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 7d96dc2d3d08fa909f247dfbcbd0fc1eeb59862b..928da2fbb3caa6de4d0e1d889c237590f71607ea 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1342,6 +1342,7 @@ csum_copy_err: /* starting over for a new packet, but check if we need to yield */ cond_resched(); msg->msg_flags &= ~MSG_TRUNC; + msg->msg_iter.iov_offset = 0; goto try_again; } (similar for ipv6)
Re: linux-next: manual merge of the netfilter-next tree with the net tree
Hi all, On Tue, 13 Sep 2016 10:12:50 +1000 Stephen Rothwellwrote: > > Today's linux-next merge of the netfilter-next tree got a conflict in: > > net/netfilter/nf_tables_netdev.c > > between commit: > > c73c24849011 ("netfilter: nf_tables_netdev: remove redundant ip_hdr > assignment") > > from the net tree and commit: > > ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4, > ipv6}_validate()") > > from the netfilter-next tree. > > I fixed it up (I used the latter version of this file and applied the > patch below) and can carry the fix as necessary. This is now fixed as > far as linux-next is concerned, but any non trivial conflicts should be > mentioned to your upstream maintainer when your tree is submitted for > merging. You may also want to consider cooperating with the maintainer > of the conflicting tree to minimise any particularly complex conflicts. > > From: Stephen Rothwell > Date: Tue, 13 Sep 2016 10:08:58 +1000 > Subject: [PATCH] netfilter: merge fixup for "nf_tables_netdev: remove > redundant ip_hdr assignment" > > Signed-off-by: Stephen Rothwell > --- > include/net/netfilter/nf_tables_ipv4.h | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/include/net/netfilter/nf_tables_ipv4.h > b/include/net/netfilter/nf_tables_ipv4.h > index 968f00b82fb5..25e33aee91e7 100644 > --- a/include/net/netfilter/nf_tables_ipv4.h > +++ b/include/net/netfilter/nf_tables_ipv4.h > @@ -33,7 +33,6 @@ __nft_set_pktinfo_ipv4_validate(struct nft_pktinfo *pkt, > if (!iph) > return -1; > > - iph = ip_hdr(skb); > if (iph->ihl < 5 || iph->version != 4) > return -1; > > -- > 2.8.1 The above merge fix patch is now needed when the net-next tree is merged with Linus' tree. -- Cheers, Stephen Rothwell
Re: [PATCH net-next] net: do not export sk_stream_write_space
From: Eric DumazetDate: Wed, 28 Sep 2016 08:41:16 -0700 > From: Eric Dumazet > > Since commit 900f65d361d3 ("tcp: move duplicate code from > tcp_v4_init_sock()/tcp_v6_init_sock()") we no longer need > to export sk_stream_write_space() > > From: Eric Dumazet Applied, thanks Eric.
UDP wierdness around skb_copy_and_csum_datagram_msg()
I've spent the last week or so trying to track down a recurring problem I'm seeing with UDP datagram handling. I'm new to the internals of the Linux network stack, but it appears to me that there's a substantial error in recent kernels' handling of UDP checksum errors. The behavior I'm seeing: I have a UDP server that receives lots of datagrams from many devices on a single port. A small number of those devices occasionally send packets with bad UDP checksums. After I receive one of these bad packets, the next recvmsg() made on the socket returns data from the bad-checksum packet, but the source-address and length of the next (good) packet that arrived at the port. I believe this issue was introduced between linux 3.18 and 3.19, by a set of changes to net/core/datagram.c that made skb_copy_and_csum_datagram_msg() and related functions use the iov_iter structure to copy data to user buffers. In the case where those functions copy a datagram and then conclude that the checksum is invalid, they don't remove the already-copied data from the user's iovec; when udp_recvmsg() calls skb_copy_and_csum_datagram_msg() for a second time, looking at the next datagram on the queue, that second datagram's data is appended to the first datagram's. So when recvmsg() returns to the user, the return value and msg_name reflect the second datagram, but the first bytes in the user's iovec come from the first (bad) datagram. (I've attached a test program that demonstrates the problem. Note that it sees correct behavior unless the bad-checksum packet has > 68 bytes of UDP data; otherwise, the packet doesn't make it past the CHECKSUM_BREAK test, and never enters skp_copy_and_csum_datagram_msg().) The fix for UDP seems pretty simple; the iov_iter's iov_offset member just needs to be set back to zero on a checksum failure. But it looks like skb_copy_and_csum_datagram_msg() is also called from tcp_input.c, where I assume that multiple sk_buffs can be copied-and-csum'd into the same iov -- if that's right, it seems like iov_iter needs some additional state to support rolling-back the most recent copy without losing previous ones. Any thoughts? #include #include #include #include #include #include #include #include #include #include #include int main(int argc, char **argv) { int ret; if (argc < 2) { fprintf(stderr, "Usage: csumtest \n"); exit(1); } struct sockaddr_in addr; socklen_t addrLen = sizeof(addr); addr.sin_family = AF_INET; addr.sin_addr.s_addr = inet_addr("127.0.0.1"); addr.sin_port = htons(0); int listen = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP); if (listen < 0) { fprintf(stderr, "failed to create listening socket (err=%d: %s)\n", errno, strerror(errno)); exit(1); } ret = bind(listen, (struct sockaddr *) , addrLen); if (ret != 0) { fprintf(stderr, "failed to bind listening socket socket (err=%d: %s)\n", errno, strerror(errno)); exit(1); } ret = getsockname(listen, (struct sockaddr *) , ); if (ret != 0) { fprintf(stderr, "getsockname failed for listening socket socket (err=%d: %s)\n", errno, strerror(errno)); exit(1); } else { printf("listening on port %d\n", ntohs(addr.sin_port)); } // Send a packet with deliberately bad checksum char rawBuf[4096]; memset(rawBuf, 0, 4096); struct udphdr *udph = (struct udphdr *) rawBuf; // Send as UDP data the string "BAD DATA", repeated as necessary // to fill the number of bytes requested on the command-line char *data = rawBuf + sizeof(struct udphdr); int badBytes = atoi(argv[1]); int i; for (i = 0; i < badBytes; i += 8) { strncpy(data + i, "BAD DATA", 8); } // UDP headers contain a bogus checksum, but are otherwise reasonable udph->check = htons(0xbadd); udph->source = htons(18); udph->dest = addr.sin_port; udph->len = htons(sizeof(struct udphdr) + badBytes); int raw = socket(AF_INET, SOCK_RAW, IPPROTO_UDP); ret = sendto(raw, rawBuf, sizeof(struct udphdr) + badBytes, 0, (struct sockaddr *) , addrLen); if (ret < 0) { fprintf(stderr, "raw send failed (err=%d: %s)\n", errno, strerror(errno)); exit(1); } // Send a second, legitimate UDP packet int cooked = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP); char *second = "Good data"; ret = sendto(cooked, second, strlen(second), 0, (struct sockaddr *) , addrLen); if (ret < 0) { fprintf(stderr, "second cooked send failed (err=%d: %s)\n", errno, strerror(errno)); exit(1); } // Read what's queued up for us. while (1) { struct msghdr msg; struct iovec iov; char readBuf[4096]; ssize_t bytes; memset(readBuf, 0, 4096); iov.iov_base = readBuf; iov.iov_len = sizeof(readBuf); msg.msg_name = NULL; msg.msg_namelen = 0; msg.msg_iov = msg.msg_iovlen = 1; msg.msg_control = NULL; msg.msg_controllen = 0; msg.msg_flags = 0; bytes = recvmsg(listen, , 0);
Re: mv643xxx_eth driver, failed to linearize skb with tiny unaligned fragment
On Wed, 2016-09-28 at 22:37 +0200, Frans van de Wiel wrote: > We compiled kernel 4.6.6 for our nas devices running on ARM kirkwood cpu’s. > These nas devices have only 256 MB of RAM so limited memory resources > When we fully load the cpu and are copying files via the interface we get > frequently this “ error” message in dmesg log > > [ 1978.149929] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize > skb with tiny unaligned fragment > [ 1978.150138] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize > skb with tiny unaligned fragment > [ 1978.150318] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize > skb with tiny unaligned fragment > [ 1978.150360] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize > skb with tiny unaligned fragment > > ... > > CPU[|||98.9%] Tasks: 29, 0 thr; 2 > running > Mem[30/244MB] Load average: 1.36 1.29 > 0.92 > Swp[ 0/511MB] Uptime: 00:38:11 > > We analyzed the driver code code and think that it does not seems to be an > error but a warning, this we want to verify and if this warning can be taken > out without risk > > The message originates from this part of the code of the driver > > 1023 if (has_tiny_unaligned_frags(skb) && __skb_linearize(skb)) { > 1024 netdev_printk(KERN_DEBUG, dev, > 1025 "failed to linearize skb with tiny > unaligned fragment\n"); > 1026 return NETDEV_TX_BUSY; > 1027 } > > __skb_linearize is set in skbuff.h and comments are useful > > static inline int __skb_linearize(struct sk_buff *skb) > 1636 { > 1637 return __pskb_pull_tail(skb, skb->data_len) ? 0 : -ENOMEM; > 1638 } > 1639 > 1640 /** > 1641 * skb_linearize - convert paged skb to linear one > 1642 * @skb: buffer to linarize > 1643 * > 1644 * If there is no free memory -ENOMEM is returned, otherwise zero > 1645 * is returned and the old skb data released. > 1646 */ > 1647 static inline int skb_linearize(struct sk_buff *skb) > 1648 { > > So return a false state (0) if there is enough free memory and true on the > other case. > > Please to note mv643xx_eth.c returns also a busy state. So I assume on this > case the driver repeats this step up to success > > So in my opinion, this is not an error message but a warning and does not > mean corrupted data and I think is should be possible to remove it. > Is conclusion is correct ? Well, linearizing an skb only because one fragment is tiny and not aligned is unfortunate. This driver should copy the ~7 bytes into temp storage instead. It would be faster, and would avoid dropping packet when the (possibly huge) allocation fails. Strange thing is that txq_submit_tso() can happily present 'tiny frags' after segmentation is done on a 'big frag', so I wonder if the has_tiny_unaligned_frags() is really needed. If this is needed, then a fix in txq_submit_tso() is also needed.
How to submit potential patch in linux kernel
Hi everyone, I'm Shyam, final year undergraduate student. I wanted to know how one can submit potential linux kernel patch in networking subsystem. Thanks, Shyam
[PATCH net-next] nfp: bpf: zero extend 4 byte context loads
Set upper 32 bits of destination register to zeros after load from the context structure. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c index 3de819acd68c..f8df5300f49c 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c @@ -1134,6 +1134,8 @@ static int mem_ldx4(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta) else return -ENOTSUPP; + wrp_immed(nfp_prog, reg_both(meta->insn.dst_reg * 2 + 1), 0); + return 0; } -- 1.9.1
Re: [PATCH] Add netdev all_adj_list refcnt propagation to fix panic
On 09/28/2016 12:06 PM, David Ahern wrote: Andrew Collins posted this patch as RFC in March: http://patchwork.ozlabs.org/patch/603101/ It has apparently fallen through the cracks and never applied. It solves a refcnt problem (thanks Nik for pointing out this patch) I have been running the patch in production for the past few months without issue, in spite of my concerns about the potential side effects of the change. That said, I suspect my use cases are a subset of yours. If nobody has a better fix yet, I'm happy to respin this patch as non-RFC with proper signoff. Regards, Andrew Collins
[PATCH v2 net-next 2/2] net: deprecate eth_change_mtu, remove usage
With centralized MTU checking, there's nothing productive done by eth_change_mtu that isn't already done in dev_set_mtu, so mark it as deprecated and remove all usage of it in the kernel. All callers have been audited for calls to alloc_etherdev* or ether_setup directly, which means they all have a valid dev->min_mtu and dev->max_mtu. Now eth_change_mtu prints out a netdev_warn about being deprecated, for the benefit of out-of-tree drivers that might be utilizing it. Of note, dvb_net.c actually had dev->mtu = 4096, while using eth_change_mtu, meaning that if you ever tried changing it's mtu, you couldn't set it above 1500 anymore. It's now getting dev->max_mtu also set to 4096 to remedy that. v2: fix up lantiq_etop, missed breakage due to drive not compiling on x86 CC: "David S. Miller"CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson --- arch/m68k/emu/nfeth.c | 1 - drivers/isdn/hysdn/hysdn_net.c| 1 - drivers/media/dvb-core/dvb_net.c | 2 +- drivers/net/appletalk/ipddp.c | 1 - drivers/net/cris/eth_v10.c| 1 - drivers/net/ethernet/3com/3c509.c | 1 - drivers/net/ethernet/3com/3c515.c | 1 - drivers/net/ethernet/3com/3c574_cs.c | 1 - drivers/net/ethernet/3com/3c589_cs.c | 1 - drivers/net/ethernet/3com/3c59x.c | 2 -- drivers/net/ethernet/3com/typhoon.c | 1 - drivers/net/ethernet/8390/8390.c | 1 - drivers/net/ethernet/8390/8390p.c | 1 - drivers/net/ethernet/8390/ax88796.c | 1 - drivers/net/ethernet/8390/axnet_cs.c | 1 - drivers/net/ethernet/8390/etherh.c| 1 - drivers/net/ethernet/8390/hydra.c | 1 - drivers/net/ethernet/8390/mac8390.c | 1 - drivers/net/ethernet/8390/mcf8390.c | 1 - drivers/net/ethernet/8390/ne2k-pci.c | 1 - drivers/net/ethernet/8390/pcnet_cs.c | 1 - drivers/net/ethernet/8390/smc-ultra.c | 1 - drivers/net/ethernet/8390/wd.c| 1 - drivers/net/ethernet/8390/zorro8390.c | 1 - drivers/net/ethernet/adaptec/starfire.c | 1 - drivers/net/ethernet/adi/bfin_mac.c | 1 - drivers/net/ethernet/allwinner/sun4i-emac.c | 1 - drivers/net/ethernet/amd/a2065.c | 1 - drivers/net/ethernet/amd/am79c961a.c | 1 - drivers/net/ethernet/amd/ariadne.c| 1 - drivers/net/ethernet/amd/atarilance.c | 1 - drivers/net/ethernet/amd/au1000_eth.c | 1 - drivers/net/ethernet/amd/declance.c | 1 - drivers/net/ethernet/amd/hplance.c| 1 - drivers/net/ethernet/amd/lance.c | 1 - drivers/net/ethernet/amd/mvme147.c| 1 - drivers/net/ethernet/amd/ni65.c | 1 - drivers/net/ethernet/amd/nmclan_cs.c | 1 - drivers/net/ethernet/amd/pcnet32.c| 1 - drivers/net/ethernet/amd/sun3lance.c | 1 - drivers/net/ethernet/amd/sunlance.c | 1 - drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 1 - drivers/net/ethernet/apple/bmac.c | 1 - drivers/net/ethernet/apple/mace.c | 1 - drivers/net/ethernet/apple/macmace.c | 1 - drivers/net/ethernet/aurora/nb8800.c | 1 - drivers/net/ethernet/cadence/macb.c | 1 - drivers/net/ethernet/cirrus/cs89x0.c | 1 - drivers/net/ethernet/cirrus/ep93xx_eth.c | 1 - drivers/net/ethernet/cirrus/mac89x0.c | 1 - drivers/net/ethernet/davicom/dm9000.c | 1 - drivers/net/ethernet/dec/tulip/de2104x.c | 1 - drivers/net/ethernet/dec/tulip/de4x5.c| 1 - drivers/net/ethernet/dec/tulip/dmfe.c | 1 - drivers/net/ethernet/dec/tulip/tulip_core.c | 1 - drivers/net/ethernet/dec/tulip/uli526x.c | 1 - drivers/net/ethernet/dec/tulip/winbond-840.c | 1 - drivers/net/ethernet/dec/tulip/xircom_cb.c| 1 - drivers/net/ethernet/dnet.c | 1 - drivers/net/ethernet/ec_bhf.c | 1 - drivers/net/ethernet/fealnx.c | 1 - drivers/net/ethernet/freescale/fec_main.c | 1 - drivers/net/ethernet/freescale/fec_mpc52xx.c | 1 - drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c | 1 - drivers/net/ethernet/freescale/ucc_geth.c | 1 - drivers/net/ethernet/fujitsu/fmvj18x_cs.c | 1 -
[PATCH v2 net-next 0/2] net: centralize net_device MTU bounds checking
While looking into an MTU issue with sfc, I started noticing that almost every NIC driver with an ndo_change_mtu function implemented almost exactly the same range checks, and in many cases, that was the only practical thing their ndo_change_mtu function was doing. Quite a few drivers have either 68, 64, 60 or 46 as their minimum MTU value checked, and then various sizes from 1500 to 65535 for their maximum MTU value. We can remove a whole lot of redundant code here if we simple store min_mtu and max_mtu in net_device, and check against those in net/core/dev.c's dev_set_mtu(). This pair of patches looks to introduce centralized MTU range checking infrastructure, while maintaining compatibility with all existing drivers, and start to make use of it, converting all eth_change_mtu/ether_setup users over to this new infra. Assuming these pass review muster, I've got a ton of follow-on patches to clean up MTU settings for everything in the kernel with an ndo_change_mtu. This work is all staged in a (rebasing) git tree here: https://github.com/jarodwilson/linux-muck The master branch is based on net-next from today, and carries these two patches, plus a ton of follow-on patches to eliminate MTU range checks and change_mtu functions where possible. All patches were successfully built across 160 various arch and config combos by the 0-day folks. (Thanks to Andrew Lunn for the suggestion to get that going). Resending, because I forgot to add CC list to the actual patches. Jarod Wilson (2): net: centralize net_device min/max MTU checking net: deprecate eth_change_mtu, remove usage arch/m68k/emu/nfeth.c | 1 - drivers/isdn/hysdn/hysdn_net.c| 1 - drivers/media/dvb-core/dvb_net.c | 2 +- drivers/net/appletalk/ipddp.c | 1 - drivers/net/cris/eth_v10.c| 1 - drivers/net/ethernet/3com/3c509.c | 1 - drivers/net/ethernet/3com/3c515.c | 1 - drivers/net/ethernet/3com/3c574_cs.c | 1 - drivers/net/ethernet/3com/3c589_cs.c | 1 - drivers/net/ethernet/3com/3c59x.c | 2 -- drivers/net/ethernet/3com/typhoon.c | 1 - drivers/net/ethernet/8390/8390.c | 1 - drivers/net/ethernet/8390/8390p.c | 1 - drivers/net/ethernet/8390/ax88796.c | 1 - drivers/net/ethernet/8390/axnet_cs.c | 1 - drivers/net/ethernet/8390/etherh.c| 1 - drivers/net/ethernet/8390/hydra.c | 1 - drivers/net/ethernet/8390/mac8390.c | 1 - drivers/net/ethernet/8390/mcf8390.c | 1 - drivers/net/ethernet/8390/ne2k-pci.c | 1 - drivers/net/ethernet/8390/pcnet_cs.c | 1 - drivers/net/ethernet/8390/smc-ultra.c | 1 - drivers/net/ethernet/8390/wd.c| 1 - drivers/net/ethernet/8390/zorro8390.c | 1 - drivers/net/ethernet/adaptec/starfire.c | 1 - drivers/net/ethernet/adi/bfin_mac.c | 1 - drivers/net/ethernet/allwinner/sun4i-emac.c | 1 - drivers/net/ethernet/amd/a2065.c | 1 - drivers/net/ethernet/amd/am79c961a.c | 1 - drivers/net/ethernet/amd/ariadne.c| 1 - drivers/net/ethernet/amd/atarilance.c | 1 - drivers/net/ethernet/amd/au1000_eth.c | 1 - drivers/net/ethernet/amd/declance.c | 1 - drivers/net/ethernet/amd/hplance.c| 1 - drivers/net/ethernet/amd/lance.c | 1 - drivers/net/ethernet/amd/mvme147.c| 1 - drivers/net/ethernet/amd/ni65.c | 1 - drivers/net/ethernet/amd/nmclan_cs.c | 1 - drivers/net/ethernet/amd/pcnet32.c| 1 - drivers/net/ethernet/amd/sun3lance.c | 1 - drivers/net/ethernet/amd/sunlance.c | 1 - drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 1 - drivers/net/ethernet/apple/bmac.c | 1 - drivers/net/ethernet/apple/mace.c | 1 - drivers/net/ethernet/apple/macmace.c | 1 - drivers/net/ethernet/aurora/nb8800.c | 1 - drivers/net/ethernet/cadence/macb.c | 1 - drivers/net/ethernet/cirrus/cs89x0.c | 1 - drivers/net/ethernet/cirrus/ep93xx_eth.c | 1 - drivers/net/ethernet/cirrus/mac89x0.c | 1 - drivers/net/ethernet/davicom/dm9000.c | 1 - drivers/net/ethernet/dec/tulip/de2104x.c | 1 - drivers/net/ethernet/dec/tulip/de4x5.c| 1 - drivers/net/ethernet/dec/tulip/dmfe.c | 1 -
[PATCH v2 net-next 1/2] net: centralize net_device min/max MTU checking
While looking into an MTU issue with sfc, I started noticing that almost every NIC driver with an ndo_change_mtu function implemented almost exactly the same range checks, and in many cases, that was the only practical thing their ndo_change_mtu function was doing. Quite a few drivers have either 68, 64, 60 or 46 as their minimum MTU value checked, and then various sizes from 1500 to 65535 for their maximum MTU value. We can remove a whole lot of redundant code here if we simple store min_mtu and max_mtu in net_device, and check against those in net/core/dev.c's dev_set_mtu(). In theory, there should be zero functional change with this patch, it just puts the infrastructure in place. Subsequent patches will attempt to start using said infrastructure, with theoretically zero change in functionality. CC: "David S. Miller"CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson --- include/linux/netdevice.h | 4 net/core/dev.c| 12 ++-- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 136ae6bb..fbdf923 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1506,6 +1506,8 @@ enum netdev_priv_flags { * @if_port: Selectable AUI, TP, ... * @dma: DMA channel * @mtu: Interface MTU value + * @min_mtu: Interface Minimum MTU value + * @max_mtu: Interface Maximum MTU value * @type: Interface hardware type * @hard_header_len: Maximum hardware header length. * @@ -1726,6 +1728,8 @@ struct net_device { unsigned char dma; unsigned intmtu; + unsigned intmin_mtu; + unsigned intmax_mtu; unsigned short type; unsigned short hard_header_len; diff --git a/net/core/dev.c b/net/core/dev.c index c0c291f..5343799 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6493,9 +6493,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu) if (new_mtu == dev->mtu) return 0; - /* MTU must be positive.*/ - if (new_mtu < 0) + if (new_mtu < dev->min_mtu) { + net_err_ratelimited("%s: Invalid MTU %d requested, hw min %d\n", + dev->name, new_mtu, dev->min_mtu); return -EINVAL; + } + + if (dev->max_mtu > 0 && new_mtu > dev->max_mtu) { + net_err_ratelimited("%s: Invalid MTU %d requested, hw max %d\n", + dev->name, new_mtu, dev->min_mtu); + return -EINVAL; + } if (!netif_device_present(dev)) return -ENODEV; -- 2.10.0
[PATCH] netfilter: bridge: clarify bridge/netfilter message
When using bridge without bridge netfilter enabled the message displayed is rather confusing and leads to belive that a deprecated feature is in use. Use IS_MODULE to be explicit that the message only affects users which use bridge netfilter as module and reword the message. Signed-off-by: Stefan Agner--- net/bridge/br.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/bridge/br.c b/net/bridge/br.c index 3addc05..889e564 100644 --- a/net/bridge/br.c +++ b/net/bridge/br.c @@ -227,9 +227,11 @@ static int __init br_init(void) br_fdb_test_addr_hook = br_fdb_test_addr; #endif - pr_info("bridge: automatic filtering via arp/ip/ip6tables has been " - "deprecated. Update your scripts to load br_netfilter if you " +#if IS_MODULE(CONFIG_BRIDGE_NETFILTER) + pr_info("bridge: filtering via arp/ip/ip6tables is no longer available " + "by default. Update your scripts to load br_netfilter if you " "need this.\n"); +#endif return 0; -- 2.10.0
Re: [PATCH RFC 5/6] net: phy: Trigger state machine on state change and not polling.
On 09/28/2016 01:32 AM, Andrew Lunn wrote: > The phy_start() is used to indicate the PHY is now ready to do its > work. The state is changed, normally to PHY_UP which means that both > the MAC and the PHY are ready. > > If the phy driver is using polling, when the next poll happens, the > state machine notices the PHY is now in PHY_UP, and kicks off > auto-negotiation, if needed. > > If however, the PHY is using interrupts, there is no polling. The phy > is stuck in PHY_UP until the next interrupt comes along. And there is > no reason for the PHY to interrupt. > > Have phy_start() schedule the state machine to run, which both speeds > up the polling use case, and makes the interrupt use case actually > work. > > This problems exists whenever there is a state change which will not > cause an interrupt. Trigger the state machine in these cases, > e.g. phy_error(). > > Signed-off-by: Andrew LunnNo particular objections, this should also fix this: http://lists.openwall.net/netdev/2016/05/17/147 > --- > drivers/net/phy/phy.c | 22 -- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c > index 940450c6cd2c..f446ce04caf3 100644 > --- a/drivers/net/phy/phy.c > +++ b/drivers/net/phy/phy.c > @@ -608,6 +608,21 @@ void phy_start_machine(struct phy_device *phydev) > } > > /** > + * phy_trigger_machine - trigger the state machine to run > + * > + * @phydev: the phy_device struct > + * > + * Description: There has been a change in state which requires that the > + * state machine runs. > + */ > + > +static void phy_trigger_machine(struct phy_device *phydev) > +{ > + cancel_delayed_work_sync(>state_queue); > + queue_delayed_work(system_power_efficient_wq, >state_queue, 0); > +} > + > +/** > * phy_stop_machine - stop the PHY state machine tracking > * @phydev: target phy_device struct > * > @@ -639,6 +654,8 @@ static void phy_error(struct phy_device *phydev) > mutex_lock(>lock); > phydev->state = PHY_HALTED; > mutex_unlock(>lock); > + > + phy_trigger_machine(phydev); > } > > /** > @@ -785,8 +802,7 @@ void phy_change(struct phy_device *phydev) > } > > /* reschedule state queue work to run as soon as possible */ > - cancel_delayed_work_sync(>state_queue); > - queue_delayed_work(system_power_efficient_wq, >state_queue, 0); > + phy_trigger_machine(phydev); > return; > > ignore: > @@ -875,6 +891,8 @@ void phy_start(struct phy_device *phydev) > /* if phy was suspended, bring the physical link up again */ > if (do_resume) > phy_resume(phydev); > + > + phy_trigger_machine(phydev); > } > EXPORT_SYMBOL(phy_start); > > -- Florian
[PATCH v2 net-next 0/2] net: centralize net_device MTU bounds checking
While looking into an MTU issue with sfc, I started noticing that almost every NIC driver with an ndo_change_mtu function implemented almost exactly the same range checks, and in many cases, that was the only practical thing their ndo_change_mtu function was doing. Quite a few drivers have either 68, 64, 60 or 46 as their minimum MTU value checked, and then various sizes from 1500 to 65535 for their maximum MTU value. We can remove a whole lot of redundant code here if we simple store min_mtu and max_mtu in net_device, and check against those in net/core/dev.c's dev_set_mtu(). This pair of patches looks to introduce centralized MTU range checking infrastructure, while maintaining compatibility with all existing drivers, and start to make use of it, converting all eth_change_mtu/ether_setup users over to this new infra. Assuming these pass review muster, I've got a ton of follow-on patches to clean up MTU settings for everything in the kernel with an ndo_change_mtu. This work is all staged in a (rebasing) git tree here: https://github.com/jarodwilson/linux-muck The master branch is based on net-next from today, and carries these two patches, plus a ton of follow-on patches to eliminate MTU range checks and change_mtu functions where possible. All patches were successfully built across 160 various arch and config combos by the 0-day folks. Jarod Wilson (2): net: centralize net_device min/max MTU checking net: deprecate eth_change_mtu, remove usage arch/m68k/emu/nfeth.c | 1 - drivers/isdn/hysdn/hysdn_net.c| 1 - drivers/media/dvb-core/dvb_net.c | 2 +- drivers/net/appletalk/ipddp.c | 1 - drivers/net/cris/eth_v10.c| 1 - drivers/net/ethernet/3com/3c509.c | 1 - drivers/net/ethernet/3com/3c515.c | 1 - drivers/net/ethernet/3com/3c574_cs.c | 1 - drivers/net/ethernet/3com/3c589_cs.c | 1 - drivers/net/ethernet/3com/3c59x.c | 2 -- drivers/net/ethernet/3com/typhoon.c | 1 - drivers/net/ethernet/8390/8390.c | 1 - drivers/net/ethernet/8390/8390p.c | 1 - drivers/net/ethernet/8390/ax88796.c | 1 - drivers/net/ethernet/8390/axnet_cs.c | 1 - drivers/net/ethernet/8390/etherh.c| 1 - drivers/net/ethernet/8390/hydra.c | 1 - drivers/net/ethernet/8390/mac8390.c | 1 - drivers/net/ethernet/8390/mcf8390.c | 1 - drivers/net/ethernet/8390/ne2k-pci.c | 1 - drivers/net/ethernet/8390/pcnet_cs.c | 1 - drivers/net/ethernet/8390/smc-ultra.c | 1 - drivers/net/ethernet/8390/wd.c| 1 - drivers/net/ethernet/8390/zorro8390.c | 1 - drivers/net/ethernet/adaptec/starfire.c | 1 - drivers/net/ethernet/adi/bfin_mac.c | 1 - drivers/net/ethernet/allwinner/sun4i-emac.c | 1 - drivers/net/ethernet/amd/a2065.c | 1 - drivers/net/ethernet/amd/am79c961a.c | 1 - drivers/net/ethernet/amd/ariadne.c| 1 - drivers/net/ethernet/amd/atarilance.c | 1 - drivers/net/ethernet/amd/au1000_eth.c | 1 - drivers/net/ethernet/amd/declance.c | 1 - drivers/net/ethernet/amd/hplance.c| 1 - drivers/net/ethernet/amd/lance.c | 1 - drivers/net/ethernet/amd/mvme147.c| 1 - drivers/net/ethernet/amd/ni65.c | 1 - drivers/net/ethernet/amd/nmclan_cs.c | 1 - drivers/net/ethernet/amd/pcnet32.c| 1 - drivers/net/ethernet/amd/sun3lance.c | 1 - drivers/net/ethernet/amd/sunlance.c | 1 - drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 1 - drivers/net/ethernet/apple/bmac.c | 1 - drivers/net/ethernet/apple/mace.c | 1 - drivers/net/ethernet/apple/macmace.c | 1 - drivers/net/ethernet/aurora/nb8800.c | 1 - drivers/net/ethernet/cadence/macb.c | 1 - drivers/net/ethernet/cirrus/cs89x0.c | 1 - drivers/net/ethernet/cirrus/ep93xx_eth.c | 1 - drivers/net/ethernet/cirrus/mac89x0.c | 1 - drivers/net/ethernet/davicom/dm9000.c | 1 - drivers/net/ethernet/dec/tulip/de2104x.c | 1 - drivers/net/ethernet/dec/tulip/de4x5.c| 1 - drivers/net/ethernet/dec/tulip/dmfe.c | 1 - drivers/net/ethernet/dec/tulip/tulip_core.c | 1 - drivers/net/ethernet/dec/tulip/uli526x.c | 1 -
Re: [PATCH RFC 2/6] net: phy: Use threaded IRQ, to allow IRQ from sleeping devices
On Wed, Sep 28, 2016 at 02:38:03PM +0300, Sergei Shtylyov wrote: > Hello. > > On 9/28/2016 11:32 AM, Andrew Lunn wrote: > > >The interrupt lines from PHYs maybe connected to I2C bus expanders, or > >from switches on MDIO busses. Such interrupts are sourced from devices > >which sleep, so use threaded interrupts. Threaded interrupts require > >that the interrupt requester also uses the threaded API. Change the > >phylib to use the threaded API, which is backwards compatible with > >none-threaded IRQs. > > > >Signed-off-by: Andrew Lunn> >--- > > drivers/net/phy/phy.c | 7 +++ > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > >diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c > >index c6f66832a1a6..5c29ed72f721 100644 > >--- a/drivers/net/phy/phy.c > >+++ b/drivers/net/phy/phy.c > >@@ -722,10 +722,9 @@ phy_err: > > int phy_start_interrupts(struct phy_device *phydev) > > { > > atomic_set(>irq_disable, 0); > >-if (request_irq(phydev->irq, phy_interrupt, > >-IRQF_SHARED, > >-"phy_interrupt", > >-phydev) < 0) { > >+if (request_threaded_irq(phydev->irq, NULL, phy_interrupt, > >+ IRQF_ONESHOT, "phy_interrupt", > >What about IRQF_SHARED? IRQF_ONESHOT | IRQF_SHARED does work. However, it requires all uses of the interrupt use the same flags. The commit messages suggests IRQF_SHARED is there because of boards which have multiple PHYs using one IRQ. That case will work, since all PHYs will be using this code. What will be an issue is if the sharing is between a PHY and something else. That something else will also need to use IRQF_ONESHOT. Anybody know if this situation actually exists? Thanks Andrew
mv643xxx_eth driver, failed to linearize skb with tiny unaligned fragment
We compiled kernel 4.6.6 for our nas devices running on ARM kirkwood cpu’s. These nas devices have only 256 MB of RAM so limited memory resources When we fully load the cpu and are copying files via the interface we get frequently this “ error” message in dmesg log [ 1978.149929] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize skb with tiny unaligned fragment [ 1978.150138] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize skb with tiny unaligned fragment [ 1978.150318] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize skb with tiny unaligned fragment [ 1978.150360] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize skb with tiny unaligned fragment ... CPU[|||98.9%] Tasks: 29, 0 thr; 2 running Mem[30/244MB] Load average: 1.36 1.29 0.92 Swp[ 0/511MB] Uptime: 00:38:11 We analyzed the driver code code and think that it does not seems to be an error but a warning, this we want to verify and if this warning can be taken out without risk The message originates from this part of the code of the driver 1023 if (has_tiny_unaligned_frags(skb) && __skb_linearize(skb)) { 1024 netdev_printk(KERN_DEBUG, dev, 1025 "failed to linearize skb with tiny unaligned fragment\n"); 1026 return NETDEV_TX_BUSY; 1027 } __skb_linearize is set in skbuff.h and comments are useful static inline int __skb_linearize(struct sk_buff *skb) 1636 { 1637 return __pskb_pull_tail(skb, skb->data_len) ? 0 : -ENOMEM; 1638 } 1639 1640 /** 1641 * skb_linearize - convert paged skb to linear one 1642 * @skb: buffer to linarize 1643 * 1644 * If there is no free memory -ENOMEM is returned, otherwise zero 1645 * is returned and the old skb data released. 1646 */ 1647 static inline int skb_linearize(struct sk_buff *skb) 1648 { So return a false state (0) if there is enough free memory and true on the other case. Please to note mv643xx_eth.c returns also a busy state. So I assume on this case the driver repeats this step up to success So in my opinion, this is not an error message but a warning and does not mean corrupted data and I think is should be possible to remove it. Is conclusion is correct ?
Re: [PATCH net-next 2/2] net: phy: Add PHY Auto/Mdi/Mdix set driver for Microsemi PHYs.
> + reg_val = phy_read(phydev, MSCC_PHY_BYPASS_CONTROL); > + if ((mdix == ETH_TP_MDI) || (mdix == ETH_TP_MDI_X)) { > + reg_val |= (DISABLE_PAIR_SWAP_CORR_MASK | > + DISABLE_POLARITY_CORR_MASK | > + DISABLE_HP_AUTO_MDIX_MASK); > + } else { > + reg_val &= ~(DISABLE_PAIR_SWAP_CORR_MASK | > + DISABLE_POLARITY_CORR_MASK | > + DISABLE_HP_AUTO_MDIX_MASK); > + } > + rc = phy_write(phydev, MSCC_PHY_BYPASS_CONTROL, reg_val); > + if (rc != 0) > + goto out_unlock; > + > + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED); > + if (rc != 0) > + goto out_unlock; > + > + reg_val = phy_read(phydev, MSCC_PHY_EXT_MODE_CNTL); > + reg_val &= ~(FORCE_MDI_CROSSOVER_MASK); > + if (mdix == ETH_TP_MDI) > + reg_val |= FORCE_MDI_CROSSOVER_MDI; > + else if (mdix == ETH_TP_MDI_X) > + reg_val |= FORCE_MDI_CROSSOVER_MDIX; > + rc = phy_write(phydev, MSCC_PHY_EXT_MODE_CNTL, reg_val); > + if (rc != 0) > + goto out_unlock; > + > + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD); > + > +out_unlock: out_unlock seems a bit of an odd name, since you are not unlocking anything. I also wonder if you should try to reset to MSCC_PHY_PAGE_STANDARD in the error condition? > + > + return rc; > +} > + > static int vsc85xx_wol_set(struct phy_device *phydev, > struct ethtool_wolinfo *wol) > { > @@ -227,6 +281,7 @@ static int vsc85xx_default_config(struct phy_device > *phydev) > int rc; > u16 reg_val; > > + phydev->mdix = ETH_TP_MDI_AUTO; Humm, interesting. The only other driver supporting mdix is the Marvell one. It does not do this, it leaves it to its default value of ETH_TP_MDI_INVALID. It does however interpret ETH_TP_MDI_INVALID as meaning as ETH_TP_MDI_AUTO. There needs to be consistency here. You either need to do the same as the Marvell driver, or you need to modify the Marvell driver to also set phydev->mdix to ETH_TP_MDI_AUTO. I don't yet know which of these two is the right thing to do. Florian? Andrew
Re: [PATCH] fs/select: add vmalloc fallback for select(2)
On 09/28/2016 06:30 PM, David Laight wrote: > From: Vlastimil Babka >> Sent: 27 September 2016 12:51 > ... >> Process name suggests it's part of db2 database. It seems it has to implement >> its own interface to select() syscall, because glibc itself seems to have a >> FD_SETSIZE limit of 1024, which is probably why this wasn't an issue for all >> the >> years... > > ISTR the canonical way to increase the size being to set FD_SETSIZE > to a larger value before including any of the headers. > > Or doesn't that work with linux and glibc ?? Doesn't seem so. > > David >
Re: [PATCH net-next] net/vxlan: Avoid unaligned access in vxlan_build_skb()
On (09/28/16 11:08), Alexander Duyck wrote: > > - then udp_gro_receive -> vxlan_gro_receive pulls up vxlan header > > into linear part, and then.. > > This is the point where we need to stop, drop the existing headers, > call skb_reserve(NET_IP_ALIGN), and then pick back up where we left > off. We just have to make sure the skbuff isn't shared. : > In general the interface behind the tunnel shouldn't need to know > about any of the data in front of the tunnel, so you should be able to > drop the original header offsets and reset things if you want so you > could overwrite the old headers. Except that some of these encapsulations may be doing things like [inner-mac, outer-ip] fdb learning, so we may not be able to lose the outer headers at this early point in all cases. > If I am not mistaken I think the multi-buffer approach is the approach > taken by other OSes, although for us it is more difficult since we yes, I think that's approximately what gets done with mblk/mbufs (with pullups as needed). But as you point out, this type of ->frag_list chaining is a non-trivial change. > I'm sure we are all going to be talking about this in great detail > next week at netdev/netconf.. :-) Agree. --Sowmini
[PATCH net] sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock
When sctp dumps all the ep->assocs, it needs to lock_sock first, but now it locks sock in rcu_read_lock, and lock_sock may sleep, which would break rcu_read_lock. This patch is to get and hold one sock when traversing the list. After that and get out of rcu_read_lock, lock and dump it. Then it will traverse the list again to get the next one until all sctp socks are dumped. For sctp_diag_dump_one, it fixes this issue by holding asoc and moving cb() out of rcu_read_lock in sctp_transport_lookup_process. Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Signed-off-by: Xin Long--- net/sctp/sctp_diag.c | 58 net/sctp/socket.c| 10 ++--- 2 files changed, 47 insertions(+), 21 deletions(-) diff --git a/net/sctp/sctp_diag.c b/net/sctp/sctp_diag.c index f3508aa..cef0cee 100644 --- a/net/sctp/sctp_diag.c +++ b/net/sctp/sctp_diag.c @@ -272,28 +272,17 @@ out: return err; } -static int sctp_tsp_dump(struct sctp_transport *tsp, void *p) +static int sctp_sock_dump(struct sock *sk, void *p) { - struct sctp_endpoint *ep = tsp->asoc->ep; + struct sctp_endpoint *ep = sctp_sk(sk)->ep; struct sctp_comm_param *commp = p; - struct sock *sk = ep->base.sk; struct sk_buff *skb = commp->skb; struct netlink_callback *cb = commp->cb; const struct inet_diag_req_v2 *r = commp->r; - struct sctp_association *assoc = - list_entry(ep->asocs.next, struct sctp_association, asocs); + struct sctp_association *assoc; int err = 0; - /* find the ep only once through the transports by this condition */ - if (tsp->asoc != assoc) - goto out; - - if (r->sdiag_family != AF_UNSPEC && sk->sk_family != r->sdiag_family) - goto out; - lock_sock(sk); - if (sk != assoc->base.sk) - goto release; list_for_each_entry(assoc, >asocs, asocs) { if (cb->args[4] < cb->args[1]) goto next; @@ -312,7 +301,7 @@ static int sctp_tsp_dump(struct sctp_transport *tsp, void *p) cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh) < 0) { cb->args[3] = 1; - err = 2; + err = 1; goto release; } cb->args[3] = 1; @@ -321,7 +310,7 @@ static int sctp_tsp_dump(struct sctp_transport *tsp, void *p) sk_user_ns(NETLINK_CB(cb->skb).sk), NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, 0, cb->nlh) < 0) { - err = 2; + err = 1; goto release; } next: @@ -333,10 +322,35 @@ next: cb->args[4] = 0; release: release_sock(sk); + sock_put(sk); return err; +} + +static int sctp_get_sock(struct sctp_transport *tsp, void *p) +{ + struct sctp_endpoint *ep = tsp->asoc->ep; + struct sctp_comm_param *commp = p; + struct sock *sk = ep->base.sk; + struct netlink_callback *cb = commp->cb; + const struct inet_diag_req_v2 *r = commp->r; + struct sctp_association *assoc = + list_entry(ep->asocs.next, struct sctp_association, asocs); + + /* find the ep only once through the transports by this condition */ + if (tsp->asoc != assoc) + goto out; + + if (r->sdiag_family != AF_UNSPEC && sk->sk_family != r->sdiag_family) + goto out; + + sock_hold(sk); + cb->args[5] = (long)sk; + + return 1; + out: cb->args[2]++; - return err; + return 0; } static int sctp_ep_dump(struct sctp_endpoint *ep, void *p) @@ -472,10 +486,18 @@ skip: * 2 : to record the transport pos of this time's traversal * 3 : to mark if we have dumped the ep info of the current asoc * 4 : to work as a temporary variable to traversal list +* 5 : to save the sk we get from travelsing the tsp list. */ if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE))) goto done; - sctp_for_each_transport(sctp_tsp_dump, net, cb->args[2], ); + +next: + cb->args[5] = 0; + sctp_for_each_transport(sctp_get_sock, net, cb->args[2], ); + + if (cb->args[5] && !sctp_sock_dump((struct sock *)cb->args[5], )) + goto next; + done: cb->args[1] = cb->args[4]; cb->args[4] = 0; diff --git a/net/sctp/socket.c b/net/sctp/socket.c index 9fc417a..8ed2d99 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -4469,17 +4469,21 @@ int sctp_transport_lookup_process(int (*cb)(struct sctp_transport *, void *), const union sctp_addr *paddr, void *p) { struct
[PATCHv3 net 0/3] sctp: a bunch of fixes for prsctp polices
This patchset is to fix 2 issues for prsctp polices: 1. patch 1 and 2 fix "netperf-Throughput_Mbps -37.2% regression" issue when overloading the CPU. 2. patch 3 fix "prsctp polices should check both sides' prsctp_capable, instead of only local side". Xin Long (3): sctp: move sent_count to the memory hole in sctp_chunk sctp: remove prsctp_param from sctp_chunk sctp: change to check peer prsctp_capable when using prsctp polices include/net/sctp/structs.h | 13 +++-- net/sctp/chunk.c | 11 --- net/sctp/outqueue.c| 12 ++-- net/sctp/sm_make_chunk.c | 15 --- 4 files changed, 17 insertions(+), 34 deletions(-) -- 2.1.0
[PATCHv3 net 3/3] sctp: change to check peer prsctp_capable when using prsctp polices
Now before using prsctp polices, sctp uses asoc->prsctp_enable to check if prsctp is enabled. However asoc->prsctp_enable is set only means local host support prsctp, sctp should not abandon packet if peer host doesn't enable prsctp. So this patch is to use asoc->peer.prsctp_capable to check if prsctp is enabled on both side, instead of asoc->prsctp_enable, as asoc's peer.prsctp_capable is set only when local and peer both enable prsctp. Fixes: a6c2f792873a ("sctp: implement prsctp TTL policy") Signed-off-by: Xin Long--- net/sctp/chunk.c| 4 ++-- net/sctp/outqueue.c | 8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c index 14990e2..0a3dbec 100644 --- a/net/sctp/chunk.c +++ b/net/sctp/chunk.c @@ -179,7 +179,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc, msg, msg->expires_at, jiffies); } - if (asoc->prsctp_enable && + if (asoc->peer.prsctp_capable && SCTP_PR_TTL_ENABLED(sinfo->sinfo_flags)) msg->expires_at = jiffies + msecs_to_jiffies(sinfo->sinfo_timetolive); @@ -340,7 +340,7 @@ errout: /* Check whether this message has expired. */ int sctp_chunk_abandoned(struct sctp_chunk *chunk) { - if (!chunk->asoc->prsctp_enable || + if (!chunk->asoc->peer.prsctp_capable || !SCTP_PR_POLICY(chunk->sinfo.sinfo_flags)) { struct sctp_datamsg *msg = chunk->msg; diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c index e084f35..107233d 100644 --- a/net/sctp/outqueue.c +++ b/net/sctp/outqueue.c @@ -326,7 +326,7 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk *chunk, gfp_t gfp) sctp_chunk_hold(chunk); sctp_outq_tail_data(q, chunk); - if (chunk->asoc->prsctp_enable && + if (chunk->asoc->peer.prsctp_capable && SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags)) chunk->asoc->sent_cnt_removable++; if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED) @@ -442,7 +442,7 @@ void sctp_prsctp_prune(struct sctp_association *asoc, { struct sctp_transport *transport; - if (!asoc->prsctp_enable || !asoc->sent_cnt_removable) + if (!asoc->peer.prsctp_capable || !asoc->sent_cnt_removable) return; msg_len = sctp_prsctp_prune_sent(asoc, sinfo, @@ -1055,7 +1055,7 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp) /* Mark as failed send. */ sctp_chunk_fail(chunk, SCTP_ERROR_INV_STRM); - if (asoc->prsctp_enable && + if (asoc->peer.prsctp_capable && SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags)) asoc->sent_cnt_removable--; sctp_chunk_free(chunk); @@ -1347,7 +1347,7 @@ int sctp_outq_sack(struct sctp_outq *q, struct sctp_chunk *chunk) tsn = ntohl(tchunk->subh.data_hdr->tsn); if (TSN_lte(tsn, ctsn)) { list_del_init(>transmitted_list); - if (asoc->prsctp_enable && + if (asoc->peer.prsctp_capable && SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags)) asoc->sent_cnt_removable--; sctp_chunk_free(tchunk); -- 2.1.0
[PATCHv3 net 1/3] sctp: move sent_count to the memory hole in sctp_chunk
Now pahole sctp_chunk, it has 2 memory holes: struct sctp_chunk { struct list_head list; atomic_t refcnt; /* XXX 4 bytes hole, try to pack */ ... long unsigned int prsctp_param; intsent_count; /* XXX 4 bytes hole, try to pack */ This patch is to move up sent_count to fill the 1st one and eliminate the 2nd one. It's not just another struct compaction, it also fixes the "netperf- Throughput_Mbps -37.2% regression" issue when overloading the CPU. Fixes: a6c2f792873a ("sctp: implement prsctp TTL policy") Signed-off-by: Xin Long--- include/net/sctp/structs.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h index ce93c4b..4f097f5 100644 --- a/include/net/sctp/structs.h +++ b/include/net/sctp/structs.h @@ -554,6 +554,9 @@ struct sctp_chunk { atomic_t refcnt; + /* How many times this chunk have been sent, for prsctp RTX policy */ + int sent_count; + /* This is our link to the per-transport transmitted list. */ struct list_head transmitted_list; @@ -610,9 +613,6 @@ struct sctp_chunk { */ unsigned long prsctp_param; - /* How many times this chunk have been sent, for prsctp RTX policy */ - int sent_count; - /* Which association does this belong to? */ struct sctp_association *asoc; -- 2.1.0
[PATCHv3 net 2/3] sctp: remove prsctp_param from sctp_chunk
Now sctp uses chunk->prsctp_param to save the prsctp param for all the prsctp polices, we didn't need to introduce prsctp_param to sctp_chunk. We can just use chunk->sinfo.sinfo_timetolive for RTX and BUF polices, and reuse msg->expires_at for TTL policy, as the prsctp polices and old expires policy are mutual exclusive. This patch is to remove prsctp_param from sctp_chunk, and reuse msg's expires_at for TTL and chunk's sinfo.sinfo_timetolive for RTX and BUF polices. Note that sctp can't use chunk's sinfo.sinfo_timetolive for TTL policy, as it needs a u64 variables to save the expires_at time. This one also fixes the "netperf-Throughput_Mbps -37.2% regression" issue. Fixes: a6c2f792873a ("sctp: implement prsctp TTL policy") Signed-off-by: Xin Long--- include/net/sctp/structs.h | 7 --- net/sctp/chunk.c | 9 +++-- net/sctp/outqueue.c| 4 ++-- net/sctp/sm_make_chunk.c | 15 --- 4 files changed, 9 insertions(+), 26 deletions(-) diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h index 4f097f5..ced0df3 100644 --- a/include/net/sctp/structs.h +++ b/include/net/sctp/structs.h @@ -606,13 +606,6 @@ struct sctp_chunk { /* This needs to be recoverable for SCTP_SEND_FAILED events. */ struct sctp_sndrcvinfo sinfo; - /* We use this field to record param for prsctp policies, -* for TTL policy, it is the time_to_drop of this chunk, -* for RTX policy, it is the max_sent_count of this chunk, -* for PRIO policy, it is the priority of this chunk. -*/ - unsigned long prsctp_param; - /* Which association does this belong to? */ struct sctp_association *asoc; diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c index a55e547..14990e2 100644 --- a/net/sctp/chunk.c +++ b/net/sctp/chunk.c @@ -179,6 +179,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc, msg, msg->expires_at, jiffies); } + if (asoc->prsctp_enable && + SCTP_PR_TTL_ENABLED(sinfo->sinfo_flags)) + msg->expires_at = + jiffies + msecs_to_jiffies(sinfo->sinfo_timetolive); + /* This is the biggest possible DATA chunk that can fit into * the packet */ @@ -349,14 +354,14 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk) } if (SCTP_PR_TTL_ENABLED(chunk->sinfo.sinfo_flags) && - time_after(jiffies, chunk->prsctp_param)) { + time_after(jiffies, chunk->msg->expires_at)) { if (chunk->sent_count) chunk->asoc->abandoned_sent[SCTP_PR_INDEX(TTL)]++; else chunk->asoc->abandoned_unsent[SCTP_PR_INDEX(TTL)]++; return 1; } else if (SCTP_PR_RTX_ENABLED(chunk->sinfo.sinfo_flags) && - chunk->sent_count > chunk->prsctp_param) { + chunk->sent_count > chunk->sinfo.sinfo_timetolive) { chunk->asoc->abandoned_sent[SCTP_PR_INDEX(RTX)]++; return 1; } diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c index 72e54a4..e084f35 100644 --- a/net/sctp/outqueue.c +++ b/net/sctp/outqueue.c @@ -383,7 +383,7 @@ static int sctp_prsctp_prune_sent(struct sctp_association *asoc, list_for_each_entry_safe(chk, temp, queue, transmitted_list) { if (!SCTP_PR_PRIO_ENABLED(chk->sinfo.sinfo_flags) || - chk->prsctp_param <= sinfo->sinfo_timetolive) + chk->sinfo.sinfo_timetolive <= sinfo->sinfo_timetolive) continue; list_del_init(>transmitted_list); @@ -418,7 +418,7 @@ static int sctp_prsctp_prune_unsent(struct sctp_association *asoc, list_for_each_entry_safe(chk, temp, queue, list) { if (!SCTP_PR_PRIO_ENABLED(chk->sinfo.sinfo_flags) || - chk->prsctp_param <= sinfo->sinfo_timetolive) + chk->sinfo.sinfo_timetolive <= sinfo->sinfo_timetolive) continue; list_del_init(>list); diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c index 8c77b87..46ffecc 100644 --- a/net/sctp/sm_make_chunk.c +++ b/net/sctp/sm_make_chunk.c @@ -706,20 +706,6 @@ nodata: return retval; } -static void sctp_set_prsctp_policy(struct sctp_chunk *chunk, - const struct sctp_sndrcvinfo *sinfo) -{ - if (!chunk->asoc->prsctp_enable) - return; - - if (SCTP_PR_TTL_ENABLED(sinfo->sinfo_flags)) - chunk->prsctp_param = - jiffies + msecs_to_jiffies(sinfo->sinfo_timetolive); - else if (SCTP_PR_RTX_ENABLED(sinfo->sinfo_flags) || -SCTP_PR_PRIO_ENABLED(sinfo->sinfo_flags)) - chunk->prsctp_param = sinfo->sinfo_timetolive; -} - /* Make a DATA chunk for the given association from the
[PATCH] tc: f_u32: Fill in 'linkid' provided by user
Currently, 'linkid' input by the user is parsed but 'handle' is appended to the netlink message. # tc filter add dev enp1s0f1 protocol ip parent : prio 99 u32 ht 800: \ order 1 link 1: offset at 0 mask 0f00 shift 6 plus 0 eat match ip \ protocol 6 ff resulted in: filter protocol ip pref 99 u32 fh 800::1 order 1 key ht 800 bkt 0 match 0006/00ff at 8 offset 0f00>>6 at 0 eat This patch results in: filter protocol ip pref 99 u32 fh 800::1 order 1 key ht 800 bkt 0 link 1: match 0006/00ff at 8 offset 0f00>>6 at 0 eat Signed-off-by Sushma Sitaram: Sushma Sitaram--- tc/f_u32.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tc/f_u32.c b/tc/f_u32.c index 0ad7ed2..92c1fcd 100644 --- a/tc/f_u32.c +++ b/tc/f_u32.c @@ -1071,7 +1071,7 @@ static int u32_parse_opt(struct filter_util *qu, char *handle, fprintf(stderr, "\"link\" must be a hash table.\n"); return -1; } - addattr_l(n, MAX_MSG, TCA_U32_LINK, , 4); + addattr_l(n, MAX_MSG, TCA_U32_LINK, , 4); } else if (strcmp(*argv, "ht") == 0) { unsigned int ht;
Re: [PATCH v1] mlx4: remove unused fields
On Wed, 2016-09-28 at 11:00 -0700, David Decotigny wrote: > From: David Decotigny> > This also can address following UBSAN warnings: > [ 36.640343] > > [ 36.648772] UBSAN: Undefined behaviour in > drivers/net/ethernet/mellanox/mlx4/fw.c:857:26 > [ 36.656853] shift exponent 64 is too large for 32-bit type 'int' > [ 36.663348] > > [ 36.671783] > > [ 36.680213] UBSAN: Undefined behaviour in > drivers/net/ethernet/mellanox/mlx4/fw.c:861:27 > [ 36.688297] shift exponent 35 is too large for 32-bit type 'int' > [ 36.694702] > > > Tested: > reboot with UBSAN, no warning. > > Signed-off-by: David Decotigny > --- CC: Tariq Toukan (mlx4 maintainer) Note this patch was cooked/tested using net-next, but can be applied to net tree, with minor fuzz. Acked-by: Eric Dumazet Thanks David.
Re: [PATCH net-next] net/vxlan: Avoid unaligned access in vxlan_build_skb()
On Wed, Sep 28, 2016 at 10:03 AM, Sowmini Varadhanwrote: > On (09/23/16 17:43), Alexander Duyck wrote: >> > On (09/23/16 10:38), Alexander Duyck wrote: > ; >> >> almost think of it as us doing something like the inverse of >> >> pskb_pull_tail. The general idea here is we want to actually leave >> >> the data in skb->data, but just reference it from frag[0] so that we >> >> don't accidentally pull in the 2 byte padding for alignment when >> >> transmitting the frame. > > Some additional findings.. > > Just to recap how we got here: for the Rx path, the inner packet has > been set up as an ethernet frame with the IP header at an aligned address > when it hits vxlan_build_skb. But that means the (inner) mac address > was offset by NET_IP_ALIGN so vxlan_build_skb needs to pad the data > by NET_IP_ALIGN to make the vxh outer ip header align. > > Then we'd need to do something like the suggestion above (keep some > pointers in frag[0]? do the reverse of a pskb_expand_head to push out > the inner ip header to the skb_frag_t?), to have the driver skip over the > pad.. > > I tried the following for a hack, and it takes care of the tx side > unaligned access, though, clearly, the memmove needs to be avoided > > @@ -1750,10 +1825,38 @@ static int vxlan_build_skb(struct sk_buff *skb, > struct d > if (err) > goto out_free; > > + > +#if (NET_IP_ALIGN != 0) > + { > + unsigned char *data; > + > + /* inner packet is an ethernet frame that was set up > +* so that the IP header is aligned. But that means the > +* mac address was offset by NET_IP_ALIGN, so we need > +* to move things up so that the vxh and outer ip header > +* are now aligned > +* XXX The Alexander Duyck idea was to only do the > +* extra __skb_push() for NET_IP_ALIGN, and avoid the > +* extram memmove and ->inner* adjustments. Plus keep > +* additional pointers in frag[0] and have the driver pick > +* up pointers from frag[0] .. need to investigate > +* that suggestion further. > +*/ > + data = skb->data; > + skb->data -= NET_IP_ALIGN; > + memmove(skb->data, data, skb_headlen(skb)); > + skb->inner_network_header -= NET_IP_ALIGN; > + skb->inner_mac_header -= NET_IP_ALIGN; > + skb->inner_transport_header -= NET_IP_ALIGN; > + } > +#endif > vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh)); > vxh->vx_flags = VXLAN_HF_VNI; > vxh->vx_vni = vxlan_vni_field(vni); > > In the general case (when the skb passed to vxlan_build_skb is already > non-linear), wouldn't we end up having to shift all the frags by 1 and/or > do some type of memory copy of the inner packet? However, I think > there are some clever things we can do in general, to avoid the memmove.. Right, basically my idea was to just skip the memmove, pull the data out and add pointers to this spot in the fraglist. Doing that you should be pulling tail back so it is equal to data. Then you just do an skb_reserve(skb, -NET_IP_ALIGN) and you should be all set to start adding outer headers. The problem is you end up having to disable any offsets such as GSO or checksum offload since you can't really use the inner header offsets anymore. > I also looked at the Rx path. Here the suggestion was: > "we should only pull the outer headers from the page frag, and then > when the time is right we drop the outer headers and pull the inner > headers from the page frag. That way we can keep all the headers > aligned." > I hacked up ixgbe_add_rx_frag to always only create nonlinear skb's, > i.e., always avoid the >memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long))); > and then I end up with > - ixgbe_clean_rx_irq copies outer header ether/ip/udp headers > into linear part as needed, > - then udp_gro_receive -> vxlan_gro_receive pulls up vxlan header > into linear part, and then.. This is the point where we need to stop, drop the existing headers, call skb_reserve(NET_IP_ALIGN), and then pick back up where we left off. We just have to make sure the skbuff isn't shared. > - eth_gro_receive pulls up another 34 bytes for the eth + ip header. > This last pull ends up being unaligned. > I dont know if we can safely drop the outer ip headers at this point > (have not tried this yet, and I'm not sure we can do this in all udp > encaps cases..) In general the interface behind the tunnel shouldn't need to know about any of the data in front of the tunnel, so you should be able to drop the original header offsets and reset things if you want so you could overwrite the old headers. > one other possibility is to set up the inner frame as part of the > ->frag_list (note, this is *not* skb_frag_t). I suspect that is
[PATCH] Add netdev all_adj_list refcnt propagation to fix panic
Andrew Collins posted this patch as RFC in March: http://patchwork.ozlabs.org/patch/603101/ It has apparently fallen through the cracks and never applied. It solves a refcnt problem (thanks Nik for pointing out this patch) with stacked devices that involves macvlan on a bridge, a bond into the bridge, and the bridge and macvlan are enslaved to a vrf: ++ | myvrf | ++ || | +-+ | | macvlan | | +-+ || +--+ | bridge | +--+ | ++ | bond0 | ++ | ++ | swp3 | ++ Deleting bond0 hangs waiting for bond0 to become free. The splat in dmesg is: [ 206.485340] [ cut here ] [ 206.486052] WARNING: CPU: 0 PID: 746 at /home/dsa/kernel-3.git/net/core/dev.c:6772 rollback_registered_many+0x28a/0x2da [ 206.487563] Modules linked in: macvlan bonding bridge stp llc vrf [ 206.488946] CPU: 0 PID: 746 Comm: ifdown Not tainted 4.8.0-rc7+ #144 [ 206.489768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 206.490973] 88003b3a3ca8 81287ad3 [ 206.491939] 88003b3a3ce8 8104f19d 1a743b3a3b58 [ 206.492914] 88003c93e000 88003b3a3d48 88003c877400 88003b3a3d48 [ 206.493873] Call Trace: [ 206.494184] [] dump_stack+0x81/0xb6 [ 206.494805] [] __warn+0xc5/0xe0 [ 206.495382] [] warn_slowpath_null+0x18/0x1a [ 206.496094] [] rollback_registered_many+0x28a/0x2da [ 206.496888] [] rollback_registered+0x29/0x36 [ 206.497596] [] unregister_netdevice_queue+0x6a/0xa6 [ 206.498383] [] bonding_store_bonds+0x132/0x186 [bonding] [ 206.499209] [] class_attr_store+0x1e/0x20 [ 206.499893] [] sysfs_kf_write+0x44/0x4b [ 206.500558] [] kernfs_fop_write+0x113/0x15d [ 206.501269] [] __vfs_write+0x21/0xa0 [ 206.501897] [] ? percpu_down_read+0x4e/0x7a [ 206.502595] [] ? __sb_start_write+0x5a/0xab [ 206.503302] [] ? __sb_start_write+0x5a/0xab [ 206.504015] [] vfs_write+0xa2/0xc6 [ 206.504637] [] SyS_write+0x4b/0x79 [ 206.505251] [] entry_SYSCALL_64_fastpath+0x1f/0xbd [ 206.506054] ---[ end trace a578aa9ea7e7176a ]--- [ 206.512017] PF_BRIDGE: RTM_SETLINK with unknown ifindex [ 216.765567] unregister_netdevice: waiting for bond0 to become free. Usage count = 1 The splate is from this line in rollback_registered_many(): WARN_ON(netdev_has_any_upper_dev(dev)); --- This is the original commit message from Andrew: This is an RFC patch to fix a relatively easily reproducible kernel panic related to the all_adj_list handling for netdevs in recent kernels. This is more to generate discussion than anything else. I don't particularly like this approach, I'm hoping someone has a better idea. The following sequence of commands will reproduce the issue: ip link add link eth0 name eth0.100 type vlan id 100 ip link add link eth0 name eth0.200 type vlan id 200 ip link add name testbr type bridge ip link set eth0.100 master testbr ip link set eth0.200 master testbr ip link add link testbr mac0 type macvlan ip link delete dev testbr This creates an upper/lower tree of (excuse the poor ASCII art): /---eth0.100-eth0 mac0-testbr- \---eth0.200-eth0 When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one reference to eth0 is added, so this results in a panic. This change adds reference count propagation so things are handled properly. Matthias Schiffer reported a similar crash in batman-adv: https://github.com/freifunk-gluon/gluon/issues/680 https://www.open-mesh.org/issues/247 which this patch also seems to resolve. Patch is from Andrew Collins, but did not have a formal sign-off. The patch applies to top of tree, so no change made on my part I just pulled it off the web. From: Andrew Collins Signed-off-by: David Ahern --- net/core/dev.c | 69 -- 1 file changed, 38 insertions(+), 31 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index c0c291f721d6..84963c99763a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5589,6 +5589,7 @@ static inline bool netdev_adjacent_is_neigh_list(struct net_device *dev, static int __netdev_adjacent_dev_insert(struct net_device *dev, struct net_device *adj_dev, + u16 ref_nr, struct list_head *dev_list, void *private, bool master) { @@ -5598,7 +5599,7 @@ static int __netdev_adjacent_dev_insert(struct net_device *dev,
Re: [PATCH 3/3] net: fec: align IP header in hardware
Thanks Russell, On 09/28/2016 10:25 AM, Russell King - ARM Linux wrote: > On Wed, Sep 28, 2016 at 10:14:52AM -0700, Eric Nelson wrote: >> Thanks David, >> >> On 09/28/2016 09:42 AM, David Laight wrote: >>> From reading this it seems that the effect of FEC_RACC_SHIFT16 is to >>> add two bytes of 'junk' to the start of every receive frame. >> >> That's right. Two bytes of junk between the MAC header and the >> IP header. > > That's wrong. FEC_RACC_SHIFT16 adds two bytes to the _beginning_ of > the packet, not in the middle of the packet: > >7 RX FIFO Shift-16 > SHIFT16 > When this field is set, the actual frame data starts at bit 16 > of the first word read from the RX FIFO aligning the Ethernet > payload on a 32-bit boundary. > > NOTE: This function only affects the FIFO storage and has no > influence on the statistics, which use the actual length > of the frame received. > > 0Disabled. > 1Instructs the MAC to write two additional bytes in front >of each frame received into the RX FIFO. > > *in front* of the frame - that's before the Ethernet header. Not between > the ethernet and IP headers. > I obviously mis-read this, and haven't dumped any packets to straighten myself out.
[PATCH v1] mlx4: remove unused fields
From: David DecotignyThis also can address following UBSAN warnings: [ 36.640343] [ 36.648772] UBSAN: Undefined behaviour in drivers/net/ethernet/mellanox/mlx4/fw.c:857:26 [ 36.656853] shift exponent 64 is too large for 32-bit type 'int' [ 36.663348] [ 36.671783] [ 36.680213] UBSAN: Undefined behaviour in drivers/net/ethernet/mellanox/mlx4/fw.c:861:27 [ 36.688297] shift exponent 35 is too large for 32-bit type 'int' [ 36.694702] Tested: reboot with UBSAN, no warning. Signed-off-by: David Decotigny --- drivers/net/ethernet/mellanox/mlx4/fw.c | 4 drivers/net/ethernet/mellanox/mlx4/fw.h | 2 -- 2 files changed, 6 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c index 090bf81..f9cbc67 100644 --- a/drivers/net/ethernet/mellanox/mlx4/fw.c +++ b/drivers/net/ethernet/mellanox/mlx4/fw.c @@ -853,12 +853,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) dev_cap->max_eqs = 1 << (field & 0xf); MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_MTT_OFFSET); dev_cap->reserved_mtts = 1 << (field >> 4); - MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MRW_SZ_OFFSET); - dev_cap->max_mrw_sz = 1 << field; MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_MRW_OFFSET); dev_cap->reserved_mrws = 1 << (field & 0xf); - MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MTT_SEG_OFFSET); - dev_cap->max_mtt_seg = 1 << (field & 0x3f); MLX4_GET(size, outbox, QUERY_DEV_CAP_NUM_SYS_EQ_OFFSET); dev_cap->num_sys_eqs = size & 0xfff; MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_REQ_QP_OFFSET); diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.h b/drivers/net/ethernet/mellanox/mlx4/fw.h index f11614f..5343a05 100644 --- a/drivers/net/ethernet/mellanox/mlx4/fw.h +++ b/drivers/net/ethernet/mellanox/mlx4/fw.h @@ -80,9 +80,7 @@ struct mlx4_dev_cap { int max_eqs; int num_sys_eqs; int reserved_mtts; - int max_mrw_sz; int reserved_mrws; - int max_mtt_seg; int max_requester_per_qp; int max_responder_per_qp; int max_rdma_global; -- 2.8.0.rc3.226.g39d4020
Re: ISDN-Gigaset: Fine-tuning for three function implementations
On Wed, 2016-09-28 at 18:50 +0200, SF Markus Elfring wrote: > Would you like to look once more into an improved patch series for this > software module a bit later? I'm afraid I'm not looking forward to receiving an update of this series, sorry. Thanks, Paul Bolle
Re: [PATCH 1/5] ISDN-Gigaset: Use kmalloc_array() in two functions
On Wed, 2016-09-28 at 18:38 +0200, SF Markus Elfring wrote: > > I'm not going to change code just because some checker suggests to > > do so. > > The script "checkpatch.pl" can point information out like the > following. > > WARNING: Prefer kmalloc_array over kmalloc with multiply Am I being trolled? Paul Bolle
Re: [PATCH v2 net] net: skbuff: skb_vlan_push: Fix wrong unwinding of skb->data after __vlan_insert_tag call
On Wed, 28 Sep 2016 16:43:38 +0200 Daniel Borkmannwrote: > Couldn't we end up with 1) for the act_vlan case when we'd have the > offset-adjusted skb_vlan_push() fix from here, where we'd then redirect > to ingress where skb_vlan_pop() would be called? If I'm not missing > something, skb_vlan_push() would then point to the data location of 1) > and with your other proposed direct netif_receive_skb() patch, no > further skb->data adjustments would be done, right? Right. Then skb_vlan_pop() should expect either (1) or (2). > Another potential issue (but unrelated to this fix here) I just noticed > is, whether act_vlan might have the same problem as we fixed in 8065694e6519 > ("bpf: fix checksum for vlan push/pop helper"). So potentially, we could > end up fixing CHECKSUM_COMPLETE wrongly on ingress, since these 14 bytes > are already pulled out of the sum at that point. > > > Should we adjust "offset" back, only if resulting offset is >=14 ? > > If also the checksum one might end up as an issue, maybe it's just best > to go through the pain and do the push/pull for data plus csum, so both > skb_vlan_*() functions see the frame starting from mac header temporarily? Although not related to this specific fix, I see 2 ways addressing the rcsum problem: 1. Per your suggestion, skb_vlan_*() to expect 'data' at mac_header That would simplify things; for this suggested 'data unwind' fix as well 2. Within skb_vlan_*(), deduct (according to initial offset) whether we're already "pulled out" of the rcsum, and not invoke the skb_postpull/push_rcsum update. Will meditate some more. Thanks Shmulik
Re: [PATCH net-next 1/2] net: phy: Add Wake-on-LAN driver for Microsemi PHYs.
On 09/28/2016 05:01 AM, Raju Lakkaraju wrote: > From: Raju Lakkaraju> > Wake-on-LAN (WoL) is an Ethernet networking standard that allows > a computer/device to be turned on or awakened by a network message. > VSC8531 PHY can support this feature configure by driver set function. > WoL status get by driver get function. > > Tested on Beaglebone Black with VSC 8531 PHY. > > Signed-off-by: Raju Lakkaraju > --- > drivers/net/phy/mscc.c | 132 > + > 1 file changed, 132 insertions(+) > > diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c > index d350deb..ca6ea23 100644 > --- a/drivers/net/phy/mscc.c > +++ b/drivers/net/phy/mscc.c > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > > enum rgmii_rx_clock_delay { > RGMII_RX_CLK_DELAY_0_2_NS = 0, > @@ -35,6 +36,7 @@ enum rgmii_rx_clock_delay { > > #define MII_VSC85XX_INT_MASK 25 > #define MII_VSC85XX_INT_MASK_MASK 0xa000 > +#define MII_VSC85XX_INT_MASK_WOL 0x0040 > #define MII_VSC85XX_INT_STATUS 26 > > #define MSCC_EXT_PAGE_ACCESS 31 > @@ -46,6 +48,19 @@ enum rgmii_rx_clock_delay { > #define RGMII_RX_CLK_DELAY_MASK0x0070 > #define RGMII_RX_CLK_DELAY_POS 4 > > +#define MSCC_PHY_WOL_LOWER_MAC_ADDR21 > +#define MSCC_PHY_WOL_MID_MAC_ADDR 22 > +#define MSCC_PHY_WOL_UPPER_MAC_ADDR23 > +#define MSCC_PHY_WOL_LOWER_PASSWD 24 > +#define MSCC_PHY_WOL_MID_PASSWD25 > +#define MSCC_PHY_WOL_UPPER_PASSWD 26 > + > +#define MSCC_PHY_WOL_MAC_CONTROL 27 > +#define EDGE_RATE_CNTL_POS 5 > +#define EDGE_RATE_CNTL_MASK0x00E0 > +#define SECURE_ON_ENABLE 0x8000 > +#define SECURE_ON_PASSWD_LEN_4 0x4000 > + > /* Microsemi PHY ID's */ > #define PHY_ID_VSC8531 0x00070570 > #define PHY_ID_VSC8541 0x00070770 > @@ -58,6 +73,119 @@ static int vsc85xx_phy_page_set(struct phy_device > *phydev, u8 page) > return rc; > } > > +static int vsc85xx_wol_set(struct phy_device *phydev, > +struct ethtool_wolinfo *wol) > +{ > + int rc; > + u16 reg_val; > + struct ethtool_wolinfo *wol_conf = wol; > + > + mutex_lock(>lock); This mutex is used here because you are using an indirect page access, right? This is not to protect against multiple calls of wol_set from different executing threads? > + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED_2); > + if (rc != 0) > + goto out_unlock; > + > + if (wol->wolopts & WAKE_MAGIC) { > + /* Store the device address for the magic packet */ > + reg_val = phydev->attached_dev->dev_addr[4] << 8; > + reg_val |= phydev->attached_dev->dev_addr[5]; > + phy_write(phydev, MSCC_PHY_WOL_LOWER_MAC_ADDR, reg_val); > + reg_val = phydev->attached_dev->dev_addr[2] << 8; > + reg_val |= phydev->attached_dev->dev_addr[3]; > + phy_write(phydev, MSCC_PHY_WOL_MID_MAC_ADDR, reg_val); > + reg_val = phydev->attached_dev->dev_addr[0] << 8; > + reg_val |= phydev->attached_dev->dev_addr[1]; > + phy_write(phydev, MSCC_PHY_WOL_UPPER_MAC_ADDR, reg_val); > + } else { > + phy_write(phydev, MSCC_PHY_WOL_LOWER_MAC_ADDR, 0); > + phy_write(phydev, MSCC_PHY_WOL_MID_MAC_ADDR, 0); > + phy_write(phydev, MSCC_PHY_WOL_UPPER_MAC_ADDR, 0); > + } > + > + reg_val = phy_read(phydev, MSCC_PHY_WOL_MAC_CONTROL); > + if (wol_conf->wolopts & WAKE_MAGICSECURE) > + reg_val |= SECURE_ON_ENABLE; > + else > + reg_val &= ~SECURE_ON_ENABLE; > + phy_write(phydev, MSCC_PHY_WOL_MAC_CONTROL, reg_val); > + > + if (wol_conf->wolopts & WAKE_MAGICSECURE) { > + reg_val = wol_conf->sopass[4] << 8; > + reg_val |= wol_conf->sopass[5]; > + phy_write(phydev, MSCC_PHY_WOL_LOWER_PASSWD, reg_val); > + reg_val = wol_conf->sopass[2] << 8; > + reg_val |= wol_conf->sopass[3]; > + phy_write(phydev, MSCC_PHY_WOL_MID_PASSWD, reg_val); > + reg_val = wol_conf->sopass[0] << 8; > + reg_val |= wol_conf->sopass[1]; > + phy_write(phydev, MSCC_PHY_WOL_UPPER_PASSWD, reg_val); > + } else { > + phy_write(phydev, MSCC_PHY_WOL_LOWER_PASSWD, 0); > + phy_write(phydev, MSCC_PHY_WOL_MID_PASSWD, 0); > + phy_write(phydev, MSCC_PHY_WOL_UPPER_PASSWD, 0); > + } How about making the code a little simpler in both cases with something like this the following: u16 pwd = { }; unsigned int i; if (wol_conf->wolopts & WAKE_MAGICECURE) for (i = 0; i < ARRAY_SIZE(pwd); i++) pwd[i] =
Re: [PATCH 3/3] net: fec: align IP header in hardware
On Wed, Sep 28, 2016 at 10:14:52AM -0700, Eric Nelson wrote: > Thanks David, > > On 09/28/2016 09:42 AM, David Laight wrote: > > From reading this it seems that the effect of FEC_RACC_SHIFT16 is to > > add two bytes of 'junk' to the start of every receive frame. > > That's right. Two bytes of junk between the MAC header and the > IP header. That's wrong. FEC_RACC_SHIFT16 adds two bytes to the _beginning_ of the packet, not in the middle of the packet: 7 RX FIFO Shift-16 SHIFT16 When this field is set, the actual frame data starts at bit 16 of the first word read from the RX FIFO aligning the Ethernet payload on a 32-bit boundary. NOTE: This function only affects the FIFO storage and has no influence on the statistics, which use the actual length of the frame received. 0Disabled. 1Instructs the MAC to write two additional bytes in front of each frame received into the RX FIFO. *in front* of the frame - that's before the Ethernet header. Not between the ethernet and IP headers. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
Re: [PATCH RFC 4/6] net: phy: Use phy name when requesting the interrupt
On 09/28/2016 01:32 AM, Andrew Lunn wrote: > Using the fixed name "phy_interrupt" is not very informative in > /proc/interrupts when there are a lot of phys, e.g. a device with an > Ethernet switch. So when requesting the interrupt, use the name of the > phy. > > Signed-off-by: Andrew LunnAcked-by: Florian Fainelli -- Florian
Re: [PATCH 3/3] net: fec: align IP header in hardware
Thanks David, On 09/28/2016 09:42 AM, David Laight wrote: > From: Eric Nelson >> Sent: 26 September 2016 19:40 >> Hi David, >> >> On 09/26/2016 02:26 AM, David Laight wrote: >>> From: Eric Nelson Sent: 24 September 2016 15:42 The FEC receive accelerator (RACC) supports shifting the data payload of received packets by 16-bits, which aligns the payload (IP header) on a 4-byte boundary, which is, if not required, at least strongly suggested by the Linux networking layer. >>> ... + /* align IP header */ + val |= FEC_RACC_SHIFT16; >>> >>> I can't help feeling that there needs to be corresponding >>> changes to increase the buffer size by 2 (maybe for large mtu) >>> and to discard two bytes from the frame length. >>> >> >> In the normal case, the fec driver over-allocates all receive packets to >> be of size FEC_ENET_RX_FRSIZE (2048) minus the value of rx_align, >> which is either 0x0f (ARM) or 0x03 (PPC). >> >> If the frame length is less than rx_copybreak (typically 256), then >> the frame length from the receive buffer descriptor is used to >> control the allocation size for a copied buffer, and this will include >> the two bytes of padding if RACC_SHIFT16 is set. >> >>> If probably ought to be predicated on NET_IP_ALIGN as well. >>> >> Can you elaborate? > > From reading this it seems that the effect of FEC_RACC_SHIFT16 is to > add two bytes of 'junk' to the start of every receive frame. > That's right. Two bytes of junk between the MAC header and the IP header. > In the 'copybreak' case the new skb would need to be 2 bytes shorter > than the length reported by the hardware, and the data copied from > 2 bytes into the dma buffer. > As it stands, the skb allocated by the copybreak routine will include the two bytes of padding, and the call to skb_pull_inline will ignore them. > The extra 2 bytes also mean the that maximum mtu that can be received > into a buffer is two bytes less. > Right, but I think the max is already high enough that this isn't a problem. > If someone sets the mtu to (say) 9k for jumbo frames this might matter. > Even with fixed 2048 byte buffers it reduces the maximum value the mtu > can be set to by 2. > As far as I can tell, the fec driver doesn't support jumbo frames, and the max frame length is currently hard-coded at PKT_MAXBUF_SIZE (1522). This is well within the 2048-byte allocation, even with optional headers for VLAN etc. > Now if NET_IP_ALIGN is zero then it is fine for the rx frame to start > on a 4n boundary, and the skb are likely to be allocated that way. > In this case you don't want to extra two bytes of 'junk'. > NET_IP_ALIGN is defaulting to 2 by the conditional in skbuff.h > OTOH if NET_IP_ALIGN is 2 then you need to 'fiddle' things so that > the data is dma'd to offset -2 in the skb and then ensure that the > end of frame is set correctly. > That's what the RACC SHIFT16 bit does. The FEC hardware isn't capable of DMA'ing to an un-aligned address. On ARM, it requires 64-bit alignment, but suggests 128-bit alignment. On other (PPC?) architectures, it requires 32-bit alignment. This is handled by the rx_align field. Regards, Eric
Re: [PATCH RFC 3/6] net: phy: Threaded interrupts allow some simplification
On 09/28/2016 06:38 AM, Sergei Shtylyov wrote: > On 09/28/2016 03:28 PM, Andrew Lunn wrote: > > The PHY interrupts are now handled in a threaded interrupt handler, > which can sleep. The work queue is no longer needed, phy_change() can > be called directly. Additionally, none of the callers of > phy_mac_interrupt() did so in interrupt context, so fully remove the I did intend to call it from interrupt context (from the ravb driver). > work queue, and document that phy_mac_interrupt() should not be called > in interrupt context. It was intentionally made callable from the interrupt context, I'd prefer if you wouldn't change that. >>> >>>OTOH, it's still not very handy to call because of the 'new_link' >>> parameter which I'm not sure I can provide... >> >> Hi Sergei >> >> If there is a need for it, i will leave the work queue and keep this >> code unchanged. > >Let's hear what Florian says... The intent is really to have phy_mac_interrupt() callable from hard IRQ context, not that this matters really too much because link events already occur in the slow path, but it's nice to have that property retained IMHO. -- Florian
Re: [PATCH v2 net] net: skbuff: skb_vlan_push: Fix wrong unwinding of skb->data after __vlan_insert_tag call
On Wed, 28 Sep 2016 16:43:38 +0200 Daniel Borkmannwrote: > > (1) suppose upon entry we have > > > > DA,SA,0x8100,TCI,0x0800, > > ^^ > > mac_hdr data > > > > initial offset is 18, and after current unwinding code we'll get > > You mean data points after the 0x0800, right? Sorry. Yes, exactly as you say. Initially 18 bytes ahead: DA,SA,0x8100,TCI,0x0800, ^ ^ mac_hdr data
Re: [PATCH net-next] net/vxlan: Avoid unaligned access in vxlan_build_skb()
On (09/23/16 17:43), Alexander Duyck wrote: > > On (09/23/16 10:38), Alexander Duyck wrote: ; > >> almost think of it as us doing something like the inverse of > >> pskb_pull_tail. The general idea here is we want to actually leave > >> the data in skb->data, but just reference it from frag[0] so that we > >> don't accidentally pull in the 2 byte padding for alignment when > >> transmitting the frame. Some additional findings.. Just to recap how we got here: for the Rx path, the inner packet has been set up as an ethernet frame with the IP header at an aligned address when it hits vxlan_build_skb. But that means the (inner) mac address was offset by NET_IP_ALIGN so vxlan_build_skb needs to pad the data by NET_IP_ALIGN to make the vxh outer ip header align. Then we'd need to do something like the suggestion above (keep some pointers in frag[0]? do the reverse of a pskb_expand_head to push out the inner ip header to the skb_frag_t?), to have the driver skip over the pad.. I tried the following for a hack, and it takes care of the tx side unaligned access, though, clearly, the memmove needs to be avoided @@ -1750,10 +1825,38 @@ static int vxlan_build_skb(struct sk_buff *skb, struct d if (err) goto out_free; + +#if (NET_IP_ALIGN != 0) + { + unsigned char *data; + + /* inner packet is an ethernet frame that was set up +* so that the IP header is aligned. But that means the +* mac address was offset by NET_IP_ALIGN, so we need +* to move things up so that the vxh and outer ip header +* are now aligned +* XXX The Alexander Duyck idea was to only do the +* extra __skb_push() for NET_IP_ALIGN, and avoid the +* extram memmove and ->inner* adjustments. Plus keep +* additional pointers in frag[0] and have the driver pick +* up pointers from frag[0] .. need to investigate +* that suggestion further. +*/ + data = skb->data; + skb->data -= NET_IP_ALIGN; + memmove(skb->data, data, skb_headlen(skb)); + skb->inner_network_header -= NET_IP_ALIGN; + skb->inner_mac_header -= NET_IP_ALIGN; + skb->inner_transport_header -= NET_IP_ALIGN; + } +#endif vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh)); vxh->vx_flags = VXLAN_HF_VNI; vxh->vx_vni = vxlan_vni_field(vni); In the general case (when the skb passed to vxlan_build_skb is already non-linear), wouldn't we end up having to shift all the frags by 1 and/or do some type of memory copy of the inner packet? However, I think there are some clever things we can do in general, to avoid the memmove.. I also looked at the Rx path. Here the suggestion was: "we should only pull the outer headers from the page frag, and then when the time is right we drop the outer headers and pull the inner headers from the page frag. That way we can keep all the headers aligned." I hacked up ixgbe_add_rx_frag to always only create nonlinear skb's, i.e., always avoid the memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long))); and then I end up with - ixgbe_clean_rx_irq copies outer header ether/ip/udp headers into linear part as needed, - then udp_gro_receive -> vxlan_gro_receive pulls up vxlan header into linear part, and then.. - eth_gro_receive pulls up another 34 bytes for the eth + ip header. This last pull ends up being unaligned. I dont know if we can safely drop the outer ip headers at this point (have not tried this yet, and I'm not sure we can do this in all udp encaps cases..) one other possibility is to set up the inner frame as part of the ->frag_list (note, this is *not* skb_frag_t). I suspect that is going to cause other inefficiencies. but, as tom has been saying all along, a big part of this problem is that we are tripping up on the ethernet header in the middle of an IP packet. Unfortunately I dont think the ietf is going to agree to never ever do that, so I'm not sure we can win that architectural battle.
[PATCH 0/3] [v2] Add basic ACPI support to the Qualcomm Technologies EMAC driver
This patch series adds support to the EMAC driver for extracting addresses, interrupts, and some _DSDs (properties) from ACPI. The first two patches clean up the code, and the third patch adds ACPI-specific functionality. The first patch fixes a bug with handling the platform_device for the internal PHY. This phy is treated as a separate device in both DT and ACPI, but since the platform is not released automatically when the driver unloads, managed functions like devm_ioremap_resource cannot be used. The second patch replaces of_get_mac_address with its platform-independent equivalent device_get_mac_address. The third patch parses the ACPI tables to obtain the platform_device for the primary EMAC node ("QCOM8070") and the internal phy node ("QCOM8071"). Timur Tabi (3): [v2] net: qcom/emac: do not use devm on internal phy pdev [v2] net: qcom/emac: use device_get_mac_address [v2] net: qcom/emac: initial ACPI support drivers/net/ethernet/qualcomm/emac/emac-phy.c | 37 ++-- drivers/net/ethernet/qualcomm/emac/emac-sgmii.c | 110 ++-- drivers/net/ethernet/qualcomm/emac/emac.c | 26 -- 3 files changed, 134 insertions(+), 39 deletions(-) -- Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH 2/3] [v2] net: qcom/emac: use device_get_mac_address
Replace the DT-specific of_get_mac_address() function with device_get_mac_address, which works on both DT and ACPI platforms. This change makes it easier to add ACPI support. Signed-off-by: Timur Tabi--- drivers/net/ethernet/qualcomm/emac/emac.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c b/drivers/net/ethernet/qualcomm/emac/emac.c index 429b4cb..551df1c 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac.c +++ b/drivers/net/ethernet/qualcomm/emac/emac.c @@ -531,18 +531,16 @@ static void emac_clks_teardown(struct emac_adapter *adpt) static int emac_probe_resources(struct platform_device *pdev, struct emac_adapter *adpt) { - struct device_node *node = pdev->dev.of_node; struct net_device *netdev = adpt->netdev; struct resource *res; - const void *maddr; + char maddr[ETH_ALEN]; int ret = 0; /* get mac address */ - maddr = of_get_mac_address(node); - if (!maddr) - eth_hw_addr_random(netdev); - else + if (device_get_mac_address(>dev, maddr, ETH_ALEN)) ether_addr_copy(netdev->dev_addr, maddr); + else + eth_hw_addr_random(netdev); /* Core 0 interrupt */ ret = platform_get_irq(pdev, 0); -- Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH 1/3] [v2] net: qcom/emac: do not use devm on internal phy pdev
The platform_device returned by of_find_device_by_node() is not automatically released when the driver unprobes. Therefore, managed calls like devm_ioremap_resource() should not be used. Instead, we manually allocate the resources and then free them on driver release. Signed-off-by: Timur Tabi--- drivers/net/ethernet/qualcomm/emac/emac-sgmii.c | 42 +++-- drivers/net/ethernet/qualcomm/emac/emac.c | 4 +++ 2 files changed, 37 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c index 6ab0a3c..ad0e420 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c +++ b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c @@ -681,6 +681,7 @@ int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt) struct resource *res; const struct of_device_id *match; struct device_node *np; + int ret; np = of_parse_phandle(pdev->dev.of_node, "internal-phy", 0); if (!np) { @@ -697,25 +698,48 @@ int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt) match = of_match_device(emac_sgmii_dt_match, _pdev->dev); if (!match) { dev_err(>dev, "unrecognized internal phy node\n"); - return -ENODEV; + ret = -ENODEV; + goto error_put_device; } phy->initialize = (emac_sgmii_initialize)match->data; /* Base address is the first address */ res = platform_get_resource(sgmii_pdev, IORESOURCE_MEM, 0); - phy->base = devm_ioremap_resource(_pdev->dev, res); - if (IS_ERR(phy->base)) - return PTR_ERR(phy->base); + phy->base = ioremap(res->start, resource_size(res)); + if (IS_ERR(phy->base)) { + ret = PTR_ERR(phy->base); + goto error_put_device; + } /* v2 SGMII has a per-lane digital digital, so parse it if it exists */ res = platform_get_resource(sgmii_pdev, IORESOURCE_MEM, 1); if (res) { - phy->digital = devm_ioremap_resource(_pdev->dev, res); - if (IS_ERR(phy->base)) - return PTR_ERR(phy->base); - + phy->digital = ioremap(res->start, resource_size(res)); + if (IS_ERR(phy->digital)) { + ret = PTR_ERR(phy->digital); + goto error_unmap_base; + } } - return phy->initialize(adpt); + ret = phy->initialize(adpt); + if (ret) + goto error; + + /* We've remapped the addresses, so we don't need the device any +* more. of_find_device_by_node() says we should release it. +*/ + put_device(_pdev->dev); + + return 0; + +error: + if (phy->digital) + iounmap(phy->digital); +error_unmap_base: + iounmap(phy->base); +error_put_device: + put_device(_pdev->dev); + + return ret; } diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c b/drivers/net/ethernet/qualcomm/emac/emac.c index e47d387..429b4cb 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac.c +++ b/drivers/net/ethernet/qualcomm/emac/emac.c @@ -723,6 +723,10 @@ static int emac_remove(struct platform_device *pdev) mdiobus_unregister(adpt->mii_bus); free_netdev(netdev); + if (adpt->phy.digital) + iounmap(adpt->phy.digital); + iounmap(adpt->phy.base); + return 0; } -- Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH 3/3] [v2] net: qcom/emac: initial ACPI support
Add support for reading addresses, interrupts, and _DSD properties from ACPI tables, just like with device tree. The HID for the EMAC device itself is QCOM8070. The internal PHY is represented by a child node with a HID of QCOM8071. The EMAC also has some complex clock initialization requirements that are not represented by this patch. This will be addressed in a future patch. Signed-off-by: Timur Tabi--- drivers/net/ethernet/qualcomm/emac/emac-phy.c | 37 ++--- drivers/net/ethernet/qualcomm/emac/emac-sgmii.c | 72 ++--- drivers/net/ethernet/qualcomm/emac/emac.c | 12 + 3 files changed, 95 insertions(+), 26 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/emac/emac-phy.c b/drivers/net/ethernet/qualcomm/emac/emac-phy.c index c412ba9..da4e90d 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac-phy.c +++ b/drivers/net/ethernet/qualcomm/emac/emac-phy.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "emac.h" #include "emac-mac.h" #include "emac-phy.h" @@ -167,7 +168,6 @@ static int emac_mdio_write(struct mii_bus *bus, int addr, int regnum, u16 val) int emac_phy_config(struct platform_device *pdev, struct emac_adapter *adpt) { struct device_node *np = pdev->dev.of_node; - struct device_node *phy_np; struct mii_bus *mii_bus; int ret; @@ -183,14 +183,37 @@ int emac_phy_config(struct platform_device *pdev, struct emac_adapter *adpt) mii_bus->parent = >dev; mii_bus->priv = adpt; - ret = of_mdiobus_register(mii_bus, np); - if (ret) { - dev_err(>dev, "could not register mdio bus\n"); - return ret; + if (has_acpi_companion(>dev)) { + u32 phy_addr; + + ret = mdiobus_register(mii_bus); + if (ret) { + dev_err(>dev, "could not register mdio bus\n"); + return ret; + } + ret = device_property_read_u32(>dev, "phy-channel", + _addr); + if (ret) + /* If we can't read a valid phy address, then assume +* that there is only one phy on this mdio bus. +*/ + adpt->phydev = phy_find_first(mii_bus); + else + adpt->phydev = mdiobus_get_phy(mii_bus, phy_addr); + + } else { + struct device_node *phy_np; + + ret = of_mdiobus_register(mii_bus, np); + if (ret) { + dev_err(>dev, "could not register mdio bus\n"); + return ret; + } + + phy_np = of_parse_phandle(np, "phy-handle", 0); + adpt->phydev = of_phy_find_device(phy_np); } - phy_np = of_parse_phandle(np, "phy-handle", 0); - adpt->phydev = of_phy_find_device(phy_np); if (!adpt->phydev) { dev_err(>dev, "could not find external phy\n"); mdiobus_unregister(mii_bus); diff --git a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c index ad0e420..3d2c05a 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c +++ b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c @@ -14,6 +14,7 @@ */ #include +#include #include #include "emac.h" #include "emac-mac.h" @@ -662,6 +663,24 @@ void emac_sgmii_reset(struct emac_adapter *adpt) clk_set_rate(adpt->clk[EMAC_CLK_HIGH_SPEED], 12500); } +static int emac_sgmii_acpi_match(struct device *dev, void *data) +{ + static const struct acpi_device_id match_table[] = { + { + .id = "QCOM8071", + .driver_data = (kernel_ulong_t)emac_sgmii_init_v2, + }, + {} + }; + const struct acpi_device_id *id = acpi_match_device(match_table, dev); + emac_sgmii_initialize *initialize = data; + + if (id) + *initialize = (emac_sgmii_initialize)id->driver_data; + + return !!id; +} + static const struct of_device_id emac_sgmii_dt_match[] = { { .compatible = "qcom,fsm9900-emac-sgmii", @@ -679,30 +698,45 @@ int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt) struct platform_device *sgmii_pdev = NULL; struct emac_phy *phy = >phy; struct resource *res; - const struct of_device_id *match; - struct device_node *np; int ret; - np = of_parse_phandle(pdev->dev.of_node, "internal-phy", 0); - if (!np) { - dev_err(>dev, "missing internal-phy property\n"); - return -ENODEV; - } + if (has_acpi_companion(>dev)) { + struct device *dev; - sgmii_pdev = of_find_device_by_node(np); - if (!sgmii_pdev) { - dev_err(>dev,
Re: ISDN-Gigaset: Fine-tuning for three function implementations
> Two of the five patches introduced bugs. The rest of the series isn't > free of various nits either. Of course, I was in no mood to be lenient > when I looked at those three patches. > > I won't take any of these patches, sorry. Would you like to look once more into an improved patch series for this software module a bit later? Regards, Markus
Re: [PATCHv2 net 0/5] sctp: some fixes of prsctp polices
> > That doesn't matter. This happens all the time. > > The way you handle this is you submit patch #1 targetting 'net' and > wait for it to get into 'net' and propagate into 'net-next'. When > that happens you can submit patches #2-5 as a second patch series > targetting 'net-next'. Thanks, will reorganize them.
RE: [PATCH 3/3] net: fec: align IP header in hardware
From: Eric Nelson > Sent: 26 September 2016 19:40 > Hi David, > > On 09/26/2016 02:26 AM, David Laight wrote: > > From: Eric Nelson > >> Sent: 24 September 2016 15:42 > >> The FEC receive accelerator (RACC) supports shifting the data payload of > >> received packets by 16-bits, which aligns the payload (IP header) on a > >> 4-byte boundary, which is, if not required, at least strongly suggested > >> by the Linux networking layer. > > ... > >> + /* align IP header */ > >> + val |= FEC_RACC_SHIFT16; > > > > I can't help feeling that there needs to be corresponding > > changes to increase the buffer size by 2 (maybe for large mtu) > > and to discard two bytes from the frame length. > > > > In the normal case, the fec driver over-allocates all receive packets to > be of size FEC_ENET_RX_FRSIZE (2048) minus the value of rx_align, > which is either 0x0f (ARM) or 0x03 (PPC). > > If the frame length is less than rx_copybreak (typically 256), then > the frame length from the receive buffer descriptor is used to > control the allocation size for a copied buffer, and this will include > the two bytes of padding if RACC_SHIFT16 is set. > > > If probably ought to be predicated on NET_IP_ALIGN as well. > > > Can you elaborate? >From reading this it seems that the effect of FEC_RACC_SHIFT16 is to add two bytes of 'junk' to the start of every receive frame. In the 'copybreak' case the new skb would need to be 2 bytes shorter than the length reported by the hardware, and the data copied from 2 bytes into the dma buffer. The extra 2 bytes also mean the that maximum mtu that can be received into a buffer is two bytes less. If someone sets the mtu to (say) 9k for jumbo frames this might matter. Even with fixed 2048 byte buffers it reduces the maximum value the mtu can be set to by 2. Now if NET_IP_ALIGN is zero then it is fine for the rx frame to start on a 4n boundary, and the skb are likely to be allocated that way. In this case you don't want to extra two bytes of 'junk'. OTOH if NET_IP_ALIGN is 2 then you need to 'fiddle' things so that the data is dma'd to offset -2 in the skb and then ensure that the end of frame is set correctly. David
Re: [PATCH 1/5] ISDN-Gigaset: Use kmalloc_array() in two functions
>> * Multiplications for the size determination of memory allocations >> indicated that array data structures should be processed. >> Thus use the corresponding function "kmalloc_array". > > Was the current code incorrect? I suggest to use a safer interface for array allocations. > What makes kmalloc_array() better? 1. How do you think about the safety checks that this function provides? 2. Will you be also affected by further software evolution here? 2016-07-26 mm: faster kmalloc_array(), kcalloc() https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=91c6a05f72a996bee5133e76374ab3ad7d3b9b72 > I'm not going to change code just because some checker suggests to do so. The script "checkpatch.pl" can point information out like the following. WARNING: Prefer kmalloc_array over kmalloc with multiply >> This issue was detected by using the Coccinelle software. > > So? And which coccinelle script was actually used? How do you think about to look into related information sources? https://github.com/coccinelle/coccinelle/issues/81 Would you like to experiment any further with an excerpt? @replacement1@ expression count, target; type T; @@ target = - kmalloc(sizeof(T) * (count) + kmalloc_array(count, sizeof(T) , ...); @replacement2@ expression count, pointer, target; @@ target = - kmalloc(sizeof(*pointer) * (count) + kmalloc_array(count, sizeof(*pointer) , ...); > I couldn't spot a coccinelle script doing that in the current tree. This is true for such a software update opportunity. >> * Replace the specification of a data structure by a pointer dereference >> to make the corresponding size determination a bit safer according to >> the Linux coding style convention. > > I'm not happy with you mixing this with the above, less trivial, change. I find that it is a useful combination. - A parameter is adjusted together with a special function name. >> -drv->cs = kmalloc(minors * sizeof *drv->cs, GFP_KERNEL); >> +drv->cs = kmalloc_array(minors, sizeof(*drv->cs), GFP_KERNEL); > > For "minors" the same holds as for "channels", above. > > And you snuck in a parentheses change. That should have probably been > merged with 5/5. Would you prefer to add them in another update step? Regards, Markus
RE: [PATCH] fs/select: add vmalloc fallback for select(2)
From: Vlastimil Babka > Sent: 27 September 2016 12:51 ... > Process name suggests it's part of db2 database. It seems it has to implement > its own interface to select() syscall, because glibc itself seems to have a > FD_SETSIZE limit of 1024, which is probably why this wasn't an issue for all > the > years... ISTR the canonical way to increase the size being to set FD_SETSIZE to a larger value before including any of the headers. Or doesn't that work with linux and glibc ?? David
Re: [PATCH net-next 1/2] net: phy: Add Wake-on-LAN driver for Microsemi PHYs.
> +#define MSCC_PHY_WOL_MAC_CONTROL 27 > +#define EDGE_RATE_CNTL_POS 5 > +#define EDGE_RATE_CNTL_MASK0x00E0 This patch does not require these two #defines. Please indicate in the cover note if the patches depends on other patches in order to cleanly apply. Or if these patches are going to conflict with some other patches. > + reg_val = phy_read(phydev, MSCC_PHY_WOL_MAC_CONTROL); > + if (wol_conf->wolopts & WAKE_MAGICSECURE) > + reg_val |= SECURE_ON_ENABLE; > + else > + reg_val &= ~SECURE_ON_ENABLE; > + phy_write(phydev, MSCC_PHY_WOL_MAC_CONTROL, reg_val); > + > + if (wol_conf->wolopts & WAKE_MAGICSECURE) { > + reg_val = wol_conf->sopass[4] << 8; > + reg_val |= wol_conf->sopass[5]; > + phy_write(phydev, MSCC_PHY_WOL_LOWER_PASSWD, reg_val); > + reg_val = wol_conf->sopass[2] << 8; > + reg_val |= wol_conf->sopass[3]; > + phy_write(phydev, MSCC_PHY_WOL_MID_PASSWD, reg_val); > + reg_val = wol_conf->sopass[0] << 8; > + reg_val |= wol_conf->sopass[1]; > + phy_write(phydev, MSCC_PHY_WOL_UPPER_PASSWD, reg_val); > + } else { > + phy_write(phydev, MSCC_PHY_WOL_LOWER_PASSWD, 0); > + phy_write(phydev, MSCC_PHY_WOL_MID_PASSWD, 0); > + phy_write(phydev, MSCC_PHY_WOL_UPPER_PASSWD, 0); > + } Wouldn't it be better to set the password, and then enable the password feature? I don't know much about WOL. Hopefully Florian will add further comments. Andrew
Re: [net-next 5/5] PCI: disable FLR for 82579 device
On Wed, Sep 28, 2016 at 03:33:52PM +, Neftin, Sasha wrote: > > Since I worked with Sasha on this I will provide a bit of information from > what I understand of this bug as well. > > On Tue, Sep 27, 2016 at 12:13 PM, Alex Williamson >wrote: > > On Tue, 27 Sep 2016 13:17:02 -0500 > > Bjorn Helgaas wrote: > > > >> On Sun, Sep 25, 2016 at 10:02:43AM +0300, Neftin, Sasha wrote: > >> > On 9/24/2016 12:05 AM, Jeff Kirsher wrote: > >> > >On Fri, 2016-09-23 at 09:01 -0500, Bjorn Helgaas wrote: > >> > >>On Thu, Sep 22, 2016 at 11:39:01PM -0700, Jeff Kirsher wrote: > >> > >>>From: Sasha Neftin > >> > >>> > >> > >>>82579 has a problem reattaching itself after the device is detached. > >> > >>>The bug was reported by Redhat. The suggested fix is to disable > >> > >>>FLR capability in PCIe configuration space. > >> > >>> > >> > >>>Reproduction: > >> > >>>Attach the device to a VM, then detach and try to attach again. > >> > >>> > >> > >>>Fix: > >> > >>>Disable FLR capability to prevent the 82579 from hanging. > >> > >>Is there a bugzilla or other reference URL to include here? > >> > >>Should this be marked for stable? > >> > >So the author is in Israel, meaning it is their weekend now. I do > >> > >not believe Sasha monitors email over the weekend, so a response > >> > >to your questions won't happen for a few days. > >> > > > >> > >I tried searching my archives for more information, but had no > >> > >luck finding any additional information. > >> > > > > I agree that we do probably need to update the patch description since it > isn't exactly clear what this is fixing or what was actually broken. > > >> > >>>Signed-off-by: Sasha Neftin > >> > >>>Tested-by: Aaron Brown > >> > >>>Signed-off-by: Jeff Kirsher > >> > >>>--- > >> > >>> drivers/pci/quirks.c | 21 + > >> > >>> 1 file changed, 21 insertions(+) > >> > >>> > >> > >>>diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index > >> > >>>44e0ff3..59fba6e 100644 > >> > >>>--- a/drivers/pci/quirks.c > >> > >>>+++ b/drivers/pci/quirks.c > >> > >>>@@ -4431,3 +4431,24 @@ static void quirk_intel_qat_vf_cap(struct > >> > >>>pci_dev *pdev) > >> > >>> } > >> > >>> } > >> > >>> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, > >> > >>>quirk_intel_qat_vf_cap); > >> > >>>+/* > >> > >>>+ * Workaround FLR issues for 82579 > >> > >>>+ * This code disables the FLR (Function Level Reset) via PCIe, > >> > >>>+in > >> > >>>order > >> > >>>+ * to workaround a bug found while using device passthrough, > >> > >>>+ where the > >> > >>>+ * interface would become non-responsive. > >> > >>>+ * NOTE: the FLR bit is Read/Write Once (RWO) in config space, > >> > >>>+ so if > >> > >>>+ * the BIOS or kernel writes this register * then this > >> > >>>+ workaround will > >> > >>>+ * not work. > >> > >>This doesn't sound like a root cause. Is the issue a hardware > >> > >>erratum? Linux PCI core bug? VFIO bug? Device firmware bug? > >> > >> > >> > >>The changelog suggests that the problem only affects passthrough, > >> > >>which suggests some sort of kernel bug related to how passthrough > >> > >>is implemented. > >> > >> If this bug affects all scenarios, not just passthrough, the > >> changelog should not mention passthrough. > >> > >> > >>>+ */ > >> > >>>+static void quirk_intel_flr_cap_dis(struct pci_dev *dev) { > >> > >>>+int pos = pci_find_capability(dev, PCI_CAP_ID_AF); > >> > >>>+if (pos) { > >> > >>>+u8 cap; > >> > >>>+pci_read_config_byte(dev, pos + PCI_AF_CAP, ); > >> > >>>+cap = cap & (~PCI_AF_CAP_FLR); > >> > >>>+pci_write_config_byte(dev, pos + PCI_AF_CAP, cap); > >> > >>>+} > >> > >>>+} > >> > >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, > >> > >>>quirk_intel_flr_cap_dis); > >> > >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, > >> > >>>quirk_intel_flr_cap_dis); > >> > >>>-- > >> > >>>2.7.4 > >> > >>> > >> > >>>-- > >> > >>>To unsubscribe from this list: send the line "unsubscribe > >> > >>>linux-pci" in the body of a message to majord...@vger.kernel.org > >> > >>>More majordomo info at > >> > >>>http://vger.kernel.org/majordomo-info.html > >> > > >> > Hello, > >> > > >> > Original bugzilla thread could be found here: > >> > https://bugzilla.redhat.com/show_bug.cgi?format=multiple=966840 > >> > >> That bugzilla is private and I can't read it. > > > > Hmm, I can, but I don't see anything in it that supports this. Is > > that really the right bz? It's the right hardware, but has all sorts > > of FUD about the version of various other components in the stack. > > It looks like we had a local copy of the bugzilla saved here, though it only > goes up to comment 13 which is where I think we started working this on our > side. I believe what this
Re: [PATCH v4 net-next] net: phy: Add Edge-rate driver for Microsemi PHYs.
> +Optional properties: > +- vsc8531,vddmac : The vddmac in mV. > +- vsc8531,edge-slowdown : % the edge should be slowed down relative to > + the fastest possible edge time. Native sign > + need not enter. > + Edge rate sets the drive strength of the MAC > + interface output signals. Changing the drive > + strength will affect the edge rate of the output > + signal. The goal of this setting is to help > + reduce electrical emission (EMI) by being able > + to reprogram drive strength and in effect slow > + down the edge rate if desired. Table 5 shows the Hi Raju There is no table five here? Is that a reference to a data sheet table? > +Example: > + > +vsc8531_0: ethernet-phy@0 { > +compatible = "ethernet-phy-id0007.0570"; > +vsc8531,vddmac = /bits/ 16 ; > +vsc8531,edge-slowdown= /bits/ 8 <17>; No, real values please: vsc8531,vddmac = <2000>; vsc8531,edge-slowdown = <21>; The driver should then do the maths to figure out the nearest magic value to write to the register, or complain the settings are out of range with an -EINVAL. FYI: No floating point maths are allowed in the kernel. Andrew
Local variable ordering (was Re: [PATCH v5 0/7] Reduce cache miss for snmp_fold_field)
On 28/09/16 14:45, hejianet wrote: > > > On 9/28/16 5:08 PM, David Miller wrote: >> From: Jia He>> Date: Wed, 28 Sep 2016 14:22:21 +0800 >> >>> v5: >>> - order local variables from longest to shortest line >> I still see many cases where this problem still exists. Please >> do not resubmit this patch series until you fix all of them. >> >> Patch #2: >> >> -static int snmp_seq_show(struct seq_file *seq, void *v) >> +static int snmp_seq_show_ipstats(struct seq_file *seq, void *v) >> { >> int i; >> +u64 buff64[IPSTATS_MIB_MAX]; >> struct net *net = seq->private; >> >> The order should be "net" then "buff64" then "i". > Sorry for my bad eyesight and quick hand :( > B.R. > Jia A few months back I wrote a script to help find these: https://github.com/ecree-solarflare/xmastree It can check a source file or a patch. Posting in the hope that people will find it useful. -Ed
Re: [PATCHv2 net 0/5] sctp: some fixes of prsctp polices
On Wed, Sep 28, 2016 at 11:17:16AM -0400, David Miller wrote: > From: Xin Long> Date: Wed, 28 Sep 2016 20:35:40 +0800 > > >> > >>> This patchset is to improve some codes about prsctp polices, and also > >>> to fix some issues. > >>> > >>> v1->v2: > >>> - wrap the check of chunk->sent_count in a macro: > >>> sctp_chunk_retransmitted in patch 2/5. > >> > >> This series is a mix of bug fixes (patch #1) which should be targetting > >> 'net' and simplifications/cleanups (patch #2-5) which should be targetting > >> 'net-next'. > >> > >> Please do not mix things up like this, and submit patches targetting > >> the appropriate tree. > >> > > understand, I wanted to divide them. > > but this patchset is special, the following patches > > depend on patch 1/5. > > That doesn't matter. This happens all the time. > > The way you handle this is you submit patch #1 targetting 'net' and > wait for it to get into 'net' and propagate into 'net-next'. When > that happens you can submit patches #2-5 as a second patch series > targetting 'net-next'. Thanks for your patience for explaining this, David, appreciated. Marcelo
[PATCH net-next] net: do not export sk_stream_write_space
From: Eric DumazetSince commit 900f65d361d3 ("tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock()") we no longer need to export sk_stream_write_space() From: Eric Dumazet Cc: Neal Cardwell --- net/core/stream.c |1 - 1 file changed, 1 deletion(-) diff --git a/net/core/stream.c b/net/core/stream.c index 159516a11b7e84f7e64e0acfa2dd55dab82307bd..1086c8b280a868101df7b44691eccac40bd7b3e1 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -43,7 +43,6 @@ void sk_stream_write_space(struct sock *sk) rcu_read_unlock(); } } -EXPORT_SYMBOL(sk_stream_write_space); /** * sk_stream_wait_connect - Wait for a socket to get into the connected state
[PATCH nf-next v4 0/2] fixes for recent nf_compact hooks
Two possible error conditions were caught during an extended testing session, and by a build robot. These patches fix the two issues (a missing handler when config is changed, and a potential NULL dereference). Aaron Conole (2): netfilter: Fix potential null pointer dereference nf_set_hooks_head: acommodate different kconfig net/netfilter/core.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) -- 2.5.5
[PATCH nf-next v4 1/2] netfilter: Fix potential null pointer dereference
It's possible for nf_hook_entry_head to return NULL. If two nf_unregister_net_hook calls happen simultaneously with a single hook entry in the list, both will enter the nf_hook_mutex critical section. The first will successfully delete the head, but the second will see this NULL pointer and attempt to dereference. This fix ensures that no null pointer dereference could occur when such a condition happens. Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list") Signed-off-by: Aaron Conole--- net/netfilter/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 360c63d..e58e420 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg) mutex_lock(_hook_mutex); hooks_entry = nf_hook_entry_head(net, reg); - if (hooks_entry->orig_ops == reg) { + if (hooks_entry && hooks_entry->orig_ops == reg) { nf_set_hooks_head(net, reg, nf_entry_dereference(hooks_entry->next)); goto unlock; -- 2.7.4
Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference
Eric Dumazetwrites: > On Wed, 2016-09-28 at 10:56 -0400, Aaron Conole wrote: >> Eric Dumazet writes: >> >> > On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote: >> >> It's possible for nf_hook_entry_head to return NULL. If two >> >> nf_unregister_net_hook calls happen simultaneously with a single hook >> >> entry in the list, both will enter the nf_hook_mutex critical section. >> >> The first will successfully delete the head, but the second will see >> >> this NULL pointer and attempt to dereference. >> >> >> >> This fix ensures that no null pointer dereference could occur when such >> >> a condition happens. >> >> >> >> Signed-off-by: Aaron Conole >> >> --- >> >> net/netfilter/core.c | 2 +- >> >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> >> >> diff --git a/net/netfilter/core.c b/net/netfilter/core.c >> >> index 360c63d..e58e420 100644 >> >> --- a/net/netfilter/core.c >> >> +++ b/net/netfilter/core.c >> >> @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const >> >> struct nf_hook_ops *reg) >> >> >> >> mutex_lock(_hook_mutex); >> >> hooks_entry = nf_hook_entry_head(net, reg); >> >> - if (hooks_entry->orig_ops == reg) { >> >> + if (hooks_entry && hooks_entry->orig_ops == reg) { >> >> nf_set_hooks_head(net, reg, >> >> nf_entry_dereference(hooks_entry->next)); >> >> goto unlock; >> > >> > When was the bug added exactly ? >> >> Sunday, on the nf-next tree. >> >> > For all bug fixes, you need to add a Fixes: tag. >> > >> > Like : >> > >> > Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked >> > list") >> >> I would but it's in nf-next tree, and I'm not sure how pulls go. If >> they are done via patch imports, then the sha sums will be wrong and the >> commit message will be misleading. If the sums are preserved, then I >> can resubmit with this information. >> > > I gave the (12 digits) sha-1 as present in David Miller net-next tree. > > https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d > > > This wont change, because David never rebases his tree under normal > operations. > > Thanks. Thank you very much, Eric. I've reposted. -Aaron
[PATCH nf-next v4 2/2] nf_set_hooks_head: accommodate different kconfig
When CONFIG_NETFILTER_INGRESS is unset (or no), we need to handle the request for registration properly by dropping the hook. This releases the entry during the set. Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list") Signed-off-by: Aaron Conole--- net/netfilter/core.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/net/netfilter/core.c b/net/netfilter/core.c index e58e420..61e8a9d 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -90,10 +90,12 @@ static void nf_set_hooks_head(struct net *net, const struct nf_hook_ops *reg, { switch (reg->pf) { case NFPROTO_NETDEV: +#ifdef CONFIG_NETFILTER_INGRESS /* We already checked in nf_register_net_hook() that this is * used from ingress. */ rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry); +#endif break; default: rcu_assign_pointer(net->nf.hooks[reg->pf][reg->hooknum], @@ -107,10 +109,15 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg) struct nf_hook_entry *hooks_entry; struct nf_hook_entry *entry; - if (reg->pf == NFPROTO_NETDEV && - (reg->hooknum != NF_NETDEV_INGRESS || -!reg->dev || dev_net(reg->dev) != net)) - return -EINVAL; + if (reg->pf == NFPROTO_NETDEV) { +#ifndef CONFIG_NETFILTER_INGRESS + if (reg->hooknum == NF_NETDEV_INGRESS) + return -EOPNOTSUPP; +#endif + if (reg->hooknum != NF_NETDEV_INGRESS || + !reg->dev || dev_net(reg->dev) != net) + return -EINVAL; + } entry = kmalloc(sizeof(*entry), GFP_KERNEL); if (!entry) -- 2.7.4
RE: [net-next 5/5] PCI: disable FLR for 82579 device
Since I worked with Sasha on this I will provide a bit of information from what I understand of this bug as well. On Tue, Sep 27, 2016 at 12:13 PM, Alex Williamsonwrote: > On Tue, 27 Sep 2016 13:17:02 -0500 > Bjorn Helgaas wrote: > >> On Sun, Sep 25, 2016 at 10:02:43AM +0300, Neftin, Sasha wrote: >> > On 9/24/2016 12:05 AM, Jeff Kirsher wrote: >> > >On Fri, 2016-09-23 at 09:01 -0500, Bjorn Helgaas wrote: >> > >>On Thu, Sep 22, 2016 at 11:39:01PM -0700, Jeff Kirsher wrote: >> > >>>From: Sasha Neftin >> > >>> >> > >>>82579 has a problem reattaching itself after the device is detached. >> > >>>The bug was reported by Redhat. The suggested fix is to disable >> > >>>FLR capability in PCIe configuration space. >> > >>> >> > >>>Reproduction: >> > >>>Attach the device to a VM, then detach and try to attach again. >> > >>> >> > >>>Fix: >> > >>>Disable FLR capability to prevent the 82579 from hanging. >> > >>Is there a bugzilla or other reference URL to include here? >> > >>Should this be marked for stable? >> > >So the author is in Israel, meaning it is their weekend now. I do >> > >not believe Sasha monitors email over the weekend, so a response >> > >to your questions won't happen for a few days. >> > > >> > >I tried searching my archives for more information, but had no >> > >luck finding any additional information. >> > > I agree that we do probably need to update the patch description since it isn't exactly clear what this is fixing or what was actually broken. >> > >>>Signed-off-by: Sasha Neftin >> > >>>Tested-by: Aaron Brown >> > >>>Signed-off-by: Jeff Kirsher >> > >>>--- >> > >>> drivers/pci/quirks.c | 21 + >> > >>> 1 file changed, 21 insertions(+) >> > >>> >> > >>>diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index >> > >>>44e0ff3..59fba6e 100644 >> > >>>--- a/drivers/pci/quirks.c >> > >>>+++ b/drivers/pci/quirks.c >> > >>>@@ -4431,3 +4431,24 @@ static void quirk_intel_qat_vf_cap(struct >> > >>>pci_dev *pdev) >> > >>> } >> > >>> } >> > >>> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, >> > >>>quirk_intel_qat_vf_cap); >> > >>>+/* >> > >>>+ * Workaround FLR issues for 82579 >> > >>>+ * This code disables the FLR (Function Level Reset) via PCIe, >> > >>>+in >> > >>>order >> > >>>+ * to workaround a bug found while using device passthrough, >> > >>>+ where the >> > >>>+ * interface would become non-responsive. >> > >>>+ * NOTE: the FLR bit is Read/Write Once (RWO) in config space, >> > >>>+ so if >> > >>>+ * the BIOS or kernel writes this register * then this >> > >>>+ workaround will >> > >>>+ * not work. >> > >>This doesn't sound like a root cause. Is the issue a hardware >> > >>erratum? Linux PCI core bug? VFIO bug? Device firmware bug? >> > >> >> > >>The changelog suggests that the problem only affects passthrough, >> > >>which suggests some sort of kernel bug related to how passthrough >> > >>is implemented. >> >> If this bug affects all scenarios, not just passthrough, the >> changelog should not mention passthrough. >> >> > >>>+ */ >> > >>>+static void quirk_intel_flr_cap_dis(struct pci_dev *dev) { >> > >>>+int pos = pci_find_capability(dev, PCI_CAP_ID_AF); >> > >>>+if (pos) { >> > >>>+u8 cap; >> > >>>+pci_read_config_byte(dev, pos + PCI_AF_CAP, ); >> > >>>+cap = cap & (~PCI_AF_CAP_FLR); >> > >>>+pci_write_config_byte(dev, pos + PCI_AF_CAP, cap); >> > >>>+} >> > >>>+} >> > >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, >> > >>>quirk_intel_flr_cap_dis); >> > >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, >> > >>>quirk_intel_flr_cap_dis); >> > >>>-- >> > >>>2.7.4 >> > >>> >> > >>>-- >> > >>>To unsubscribe from this list: send the line "unsubscribe >> > >>>linux-pci" in the body of a message to majord...@vger.kernel.org >> > >>>More majordomo info at >> > >>>http://vger.kernel.org/majordomo-info.html >> > >> > Hello, >> > >> > Original bugzilla thread could be found here: >> > https://bugzilla.redhat.com/show_bug.cgi?format=multiple=966840 >> >> That bugzilla is private and I can't read it. > > Hmm, I can, but I don't see anything in it that supports this. Is > that really the right bz? It's the right hardware, but has all sorts > of FUD about the version of various other components in the stack. It looks like we had a local copy of the bugzilla saved here, though it only goes up to comment 13 which is where I think we started working this on our side. I believe what this patch is attempting to resolve is related to comment 8 where the driver returned "probe of :00:19.0 failed with error ‐2" instead of correctly probing the interface. So the bug as reported was that e1000e had a problem reattaching itself to the PHY after it was attached to a VM.
Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference
On Wed, 2016-09-28 at 10:56 -0400, Aaron Conole wrote: > Eric Dumazetwrites: > > > On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote: > >> It's possible for nf_hook_entry_head to return NULL. If two > >> nf_unregister_net_hook calls happen simultaneously with a single hook > >> entry in the list, both will enter the nf_hook_mutex critical section. > >> The first will successfully delete the head, but the second will see > >> this NULL pointer and attempt to dereference. > >> > >> This fix ensures that no null pointer dereference could occur when such > >> a condition happens. > >> > >> Signed-off-by: Aaron Conole > >> --- > >> net/netfilter/core.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/net/netfilter/core.c b/net/netfilter/core.c > >> index 360c63d..e58e420 100644 > >> --- a/net/netfilter/core.c > >> +++ b/net/netfilter/core.c > >> @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const > >> struct nf_hook_ops *reg) > >> > >>mutex_lock(_hook_mutex); > >>hooks_entry = nf_hook_entry_head(net, reg); > >> - if (hooks_entry->orig_ops == reg) { > >> + if (hooks_entry && hooks_entry->orig_ops == reg) { > >>nf_set_hooks_head(net, reg, > >> nf_entry_dereference(hooks_entry->next)); > >>goto unlock; > > > > When was the bug added exactly ? > > Sunday, on the nf-next tree. > > > For all bug fixes, you need to add a Fixes: tag. > > > > Like : > > > > Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list") > > I would but it's in nf-next tree, and I'm not sure how pulls go. If > they are done via patch imports, then the sha sums will be wrong and the > commit message will be misleading. If the sums are preserved, then I > can resubmit with this information. > I gave the (12 digits) sha-1 as present in David Miller net-next tree. https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d This wont change, because David never rebases his tree under normal operations. Thanks.
Re: [PATCH] ipv6 addrconf: enable use of proc_dointvec_minmax in addrconf_sysctl
From: Maciej ŻenczykowskiDate: Wed, 28 Sep 2016 22:23:10 +0900 > Anyway, enough, I give up, this isn't worth my time, and it's also not > worth your time. > I removed the dependency from the other patches and squashed them all into > 1 to make reviewing easier. Splitting things up is sometimes warranted, but you have to keep together the things that strongly need each other. And new facilities require examples of their use. They really do. I simply can't look at your patch and empty commit message and figure out why you needed to do what you were doing. Any developer should be able to look at a patch being proposed and be able to understand it standing upon on it's own. This means they shouldn't have to know what happened in this discussion thread or that one in order to evaluate and audit the patch properly.
Re: [PATCHv2 net 0/5] sctp: some fixes of prsctp polices
From: Xin LongDate: Wed, 28 Sep 2016 20:35:40 +0800 >> >>> This patchset is to improve some codes about prsctp polices, and also >>> to fix some issues. >>> >>> v1->v2: >>> - wrap the check of chunk->sent_count in a macro: >>> sctp_chunk_retransmitted in patch 2/5. >> >> This series is a mix of bug fixes (patch #1) which should be targetting >> 'net' and simplifications/cleanups (patch #2-5) which should be targetting >> 'net-next'. >> >> Please do not mix things up like this, and submit patches targetting >> the appropriate tree. >> > understand, I wanted to divide them. > but this patchset is special, the following patches > depend on patch 1/5. That doesn't matter. This happens all the time. The way you handle this is you submit patch #1 targetting 'net' and wait for it to get into 'net' and propagate into 'net-next'. When that happens you can submit patches #2-5 as a second patch series targetting 'net-next'.
Re: [PATCH RFC 0/4] xfs: Transmit flow steering
Here is a quick look at performance tests for the result of trying the prototype fix for the packet reordering problem with VMs sending over an XPS-configured NIC. In particular, the Emulex/Avago/Broadcom Skyhawk. The fix was applied to a 4.4 kernel. Before: 3884 Mbit/s After: 8897 Mbit/s That was from a VM on a node with a Skyhawk and 2 E5-2640 processors to baremetal E5-2640 with a BE3. Physical MTU was 1500, the VM's vNIC's MTU was 1400. Systems were HPE ProLiants in OS Control Mode for power management, with the "performance" frequency governor loaded. An OpenStack Mitaka setup with Distributed Virtual Router. We had some other NIC types in the setup as well. XPS was also enabled on the ConnectX3-Pro. It was not enabled on the 82599ES (a function of the kernel being used, which had it disabled from the first reports of XPS negatively affecting VM traffic at the beginning of the year) Average Mbit/s From NIC type To Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v39224 9271 BE3, E5-26409016 9022 82599, E5-2640 9192 9003 BCM57840, E5-2640 9213 9153 Skyhawk, E5-26403884 8897 For completeness: Average Mbit/s To NIC type from Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v39322 9144 BE3, E5-26409074 9017 82599, E5-2640 8670 8564 BCM57840, E5-2640 2468 * 7979 Skyhawk, E5-26408897 9269 * This is the Busted bnx2x NIC FW GRO implementation issue. It was not visible in the "After" because the system was setup to disable the NIC FW GRO by the time it booted on the fix kernel. Average Transactions/s Between NIC type and Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v3 12421 12612 BE3, E5-26408178 8484 82599, E5-2640 8499 8549 BCM57840, E5-2640 8544 8560 Skyhawk, E5-26408537 8701 happy benchmarking, Drew Balliet Jeurg Haefliger rick jones The semi-cooked results with additional statistics: 554M - BE3 544+M - ConnectX-3 Pro 560M - 82599ES 630M - BCM57840 650M - Skyhawk (substitute is simply replacing a system name with the model of NIC and CPU) Bulk To (South) and From (North) VM, Before: $ ../substitute.sh vxlan_554m_control_performance_gvnr_dvr_northsouth_stream.log | ~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 4 -f 7 -f 8 Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count North,560M,E5-2640,554FLB,E5-2640,8148.090,9048.830,9235.400,9192.868,9315.980,9338.845,9339.500,113 North,630M,E5-2640,554FLB,E5-2640,8909.980,9113.238,9234.750,9213.140,9299.442,9336.206,9337.830,47 North,544+M,E5-2670v3,554FLB,E5-2640,9013.740,9182.546,9229.620,9224.025,9264.036,9299.206,9301.970,99 North,650M,E5-2640,554FLB,E5-2640,3187.680,3393.724,3796.160,3884.765,4405.096,4941.391,4956.300,129 North,554M,E5-2640,554FLB,E5-2640,8700.930,8855.768,9026.030,9016.061,9158.846,9213.687,9226.150,135 South,554FLB,E5-2640,560M,E5-2640,7754.350,8193.114,8718.540,8670.612,9026.436,9262.355,9285.010,113 South,554FLB,E5-2640,630M,E5-2640,1897.660,2068.290,2514.430,2468.323,2787.162,2942.934,2957.250,53 South,554FLB,E5-2640,544+M,E5-2670v3,9298.260,9314.432,9323.220,9322.207,9328.324,9330.704,9331.080,100 South,554FLB,E5-2640,650M,E5-2640,8407.050,8907.136,9304.390,9206.776,9321.320,9325.347,9326.410,103 South,554FLB,E5-2640,554M,E5-2640,7844.900,8632.530,9199.385,9074.535,9308.070,9319.224,9322.360,132 0 too-short lines ignored. Bulk To (South) and From (North) VM, After: $ ../substitute.sh vxlan_554m_control_performance_gvnr_xpsfix_dvr_northsouth_stream.log | ~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 4 -f 7 -f 8 Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count North,560M,E5-2640,554FLB,E5-2640,7576.790,8213.890,9182.870,9003.190,9295.975,9315.878,9318.160,36 North,630M,E5-2640,554FLB,E5-2640,8811.800,8924.000,9206.660,9153.076,9306.287,9315.152,9315.790,12 North,544+M,E5-2670v3,554FLB,E5-2640,9135.990,9228.520,9277.465,9271.875,9324.545,9339.604,9339.780,46 North,650M,E5-2640,554FLB,E5-2640,8133.420,8483.340,8995.040,8897.779,9129.056,9165.230,9165.860,43 North,554M,E5-2640,554FLB,E5-2640,8438.390,8879.150,9048.590,9022.813,9181.540,9248.650,9297.660,101 South,554FLB,E5-2640,630M,E5-2640,7347.120,7592.565,7951.325,7979.951,8365.400,8575.837,8579.890,16 South,554FLB,E5-2640,560M,E5-2640,7719.510,8044.496,8602.750,8564.741,9172.824,9248.686,9259.070,45
Re: [RFC PATCH net-next 2/2] sfc: report 4-tuple UDP hashing to ethtool, if it's enabled
On 28/09/16 10:12, David Laight wrote: > If you invert the above and add a goto... > if (!efx->rx_hash_udp_4tuple) > goto set_ip; I don't mind gotos... >> case SCTP_V4_FLOW: >> case AH_ESP_V4_FLOW: >> case IPV4_FLOW: > set_ip: ...but this adds a label where we effectively already have one. I wish C allowed goto case labels. > It might look better. > David It just bugs me that it would have this unnecessary goto and label. Alternate ways to maybe make it look better, or not: * Remove the /* else fall further */ comment, does this make the indentation more or less confusing? * Include braces on the if, even though there's only one statement inside. Also, how strong are people's reaction to this? If it's just "I personally wouldn't do it that way", then I'm tempted to go ahead anyway. But if it's "NAK NAK NAK burn the heretic", that's another matter. -Ed
Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference
Eric Dumazetwrites: > On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote: >> It's possible for nf_hook_entry_head to return NULL. If two >> nf_unregister_net_hook calls happen simultaneously with a single hook >> entry in the list, both will enter the nf_hook_mutex critical section. >> The first will successfully delete the head, but the second will see >> this NULL pointer and attempt to dereference. >> >> This fix ensures that no null pointer dereference could occur when such >> a condition happens. >> >> Signed-off-by: Aaron Conole >> --- >> net/netfilter/core.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/netfilter/core.c b/net/netfilter/core.c >> index 360c63d..e58e420 100644 >> --- a/net/netfilter/core.c >> +++ b/net/netfilter/core.c >> @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const >> struct nf_hook_ops *reg) >> >> mutex_lock(_hook_mutex); >> hooks_entry = nf_hook_entry_head(net, reg); >> -if (hooks_entry->orig_ops == reg) { >> +if (hooks_entry && hooks_entry->orig_ops == reg) { >> nf_set_hooks_head(net, reg, >>nf_entry_dereference(hooks_entry->next)); >> goto unlock; > > When was the bug added exactly ? Sunday, on the nf-next tree. > For all bug fixes, you need to add a Fixes: tag. > > Like : > > Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list") I would but it's in nf-next tree, and I'm not sure how pulls go. If they are done via patch imports, then the sha sums will be wrong and the commit message will be misleading. If the sums are preserved, then I can resubmit with this information. > So that 100 different people in stable teams do not have to do the > archeology themselves ... > > Thanks. Thanks for the review, Eric.
Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets
On Wed, Sep 28, 2016 at 06:49:51AM -0700, Eric Dumazet wrote: > > > > 4.7 is pretty widespread, so I've to think... > > Sorry, 4.4.7 it was > > https://www.mail-archive.com/netdev@vger.kernel.org/msg128714.html Ah, thanks for info!
[PATCH net-next v5] bpf: allow access into map value arrays
Suppose you have a map array value that is something like this struct foo { unsigned iter; int array[SOME_CONSTANT]; }; You can easily insert this into an array, but you cannot modify the contents of foo->array[] after the fact. This is because we have no way to verify we won't go off the end of the array at verification time. This patch provides a start for this work. We accomplish this by keeping track of a minimum and maximum value a register could be while we're checking the code. Then at the time we try to do an access into a MAP_VALUE we verify that the maximum offset into that region is a valid access into that memory region. So in practice, code such as this unsigned index = 0; if (foo->iter >= SOME_CONSTANT) foo->iter = index; else index = foo->iter++; foo->array[index] = bar; would be allowed, as we can verify that index will always be between 0 and SOME_CONSTANT-1. If you wish to use signed values you'll have to have an extra check to make sure the index isn't less than 0, or do something like index %= SOME_CONSTANT. Signed-off-by: Josef BacikAcked-by: Alexei Starovoitov --- V4->V5: -rebased onto net-next. -dealt with adding map values to other map values, made it so we only adjust min/max if we are messing with CONST or INV variables, otherwise we set it to RANGE_MIN/RANGE_MAX. -added a new testcase for the adding map values to map values case. -moved a reset of the register values until after the register validation checks for BPF_LD V3->V4: -fixed the alignment check to deal with MAP_VALUE_ADJ properly. -updated the error messages for invalid ranges to be more clear. -fixed test_verifier.c to work with the updated error messages. -fixed a couple of typos. V2->V3: -added testcases to test_verifier.c -fixed a problem where we wouldn't adjust min/max values if the val == *_RANGE values, which would cause us to think any access to the map was a-ok. -Handle the case where the CONST_IMM register is the dst_reg instead of the src_reg. -Deal with overflows from the jmp handling. -Fixed the states_equal logic to make more sense. V1->V2: -Instead of using (u64)-1 as the limit which could overflow in some cases and give us weird behavior, make the limit 1gib so that any math doesn't overflow and it's still outside of the realm of reasonable offsets. include/linux/bpf.h | 7 + include/linux/bpf_verifier.h | 12 ++ kernel/bpf/verifier.c| 329 --- samples/bpf/libbpf.h | 8 ++ samples/bpf/test_verifier.c | 243 +++- 5 files changed, 577 insertions(+), 22 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5691fdc..c201017 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -139,6 +139,13 @@ enum bpf_reg_type { */ PTR_TO_PACKET, PTR_TO_PACKET_END, /* skb->data + headlen */ + + /* PTR_TO_MAP_VALUE_ADJ is used for doing pointer math inside of a map +* elem value. We only allow this if we can statically verify that +* access from this register are going to fall within the size of the +* map element. +*/ + PTR_TO_MAP_VALUE_ADJ, }; struct bpf_prog; diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index c5cb661..7035b99 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -10,8 +10,19 @@ #include /* for enum bpf_reg_type */ #include /* for MAX_BPF_STACK */ + /* Just some arbitrary values so we can safely do math without overflowing and + * are obviously wrong for any sort of memory access. + */ +#define BPF_REGISTER_MAX_RANGE (1024 * 1024 * 1024) +#define BPF_REGISTER_MIN_RANGE -(1024 * 1024 * 1024) + struct bpf_reg_state { enum bpf_reg_type type; + /* +* Used to determine if any memory access using this register will +* result in a bad access. +*/ + u64 min_value, max_value; union { /* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE */ s64 imm; @@ -81,6 +92,7 @@ struct bpf_verifier_env { u32 id_gen; /* used to generate unique reg IDs */ bool allow_ptr_leaks; bool seen_direct_write; + bool varlen_map_value_access; struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */ }; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7ada315..99a7e5b 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -182,6 +182,7 @@ static const char * const reg_type_str[] = { [CONST_PTR_TO_MAP] = "map_ptr", [PTR_TO_MAP_VALUE] = "map_value", [PTR_TO_MAP_VALUE_OR_NULL] = "map_value_or_null", + [PTR_TO_MAP_VALUE_ADJ] = "map_value_adj", [FRAME_PTR] = "fp", [PTR_TO_STACK] = "fp", [CONST_IMM]
[PATCH v8 1/8] thunderbolt: Macro rename
This first patch updates the NHI Thunderbolt controller registers file to reflect that it is not only for Cactus Ridge. No functional change intended. Signed-off-by: Amir LevySigned-off-by: Andreas Noever --- drivers/thunderbolt/nhi_regs.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/thunderbolt/nhi_regs.h b/drivers/thunderbolt/nhi_regs.h index 86b996c..75cf069 100644 --- a/drivers/thunderbolt/nhi_regs.h +++ b/drivers/thunderbolt/nhi_regs.h @@ -1,11 +1,11 @@ /* - * Thunderbolt Cactus Ridge driver - NHI registers + * Thunderbolt driver - NHI registers * * Copyright (c) 2014 Andreas Noever */ -#ifndef DSL3510_REGS_H_ -#define DSL3510_REGS_H_ +#ifndef NHI_REGS_H_ +#define NHI_REGS_H_ #include -- 2.7.4
[PATCH v8 4/8] thunderbolt: Networking state machine
This patch builds the peer to peer communication path. Communication is established by a negotiation process whereby messages are sent back and forth between the peers until a connection is established. This includes the Thunderbolt Network driver communication with the second peer via Intel Connection Manager(ICM) firmware. ++++ |Host 1 ||Host 2 | |||| | +---+ || +---+ | | |Thunderbolt| || |Thunderbolt| | | |Networking | || |Networking | | | |Driver | || |Driver | | | +---+ || +---+ | | ^ || ^ | | | || | | | ++---+ || ++---+ | | |Thunderbolt | | || |Thunderbolt | | | | |Controller v | || |Controller v | | | | +---+ | || | +---+ | | | | |ICM|<-+-++-+>|ICM| | | | | +---+ | || | +---+ | | | ++ || ++ | ++++ Note that this patch only establishes the link between the two hosts and not Network Packet handling - this is dealt with in the next patch. Signed-off-by: Amir Levy--- drivers/thunderbolt/icm/Makefile | 2 +- drivers/thunderbolt/icm/icm_nhi.c | 262 - drivers/thunderbolt/icm/net.c | 783 ++ drivers/thunderbolt/icm/net.h | 70 4 files changed, 1109 insertions(+), 8 deletions(-) create mode 100644 drivers/thunderbolt/icm/net.c diff --git a/drivers/thunderbolt/icm/Makefile b/drivers/thunderbolt/icm/Makefile index f0d0fbb..94a2797 100644 --- a/drivers/thunderbolt/icm/Makefile +++ b/drivers/thunderbolt/icm/Makefile @@ -1,2 +1,2 @@ obj-${CONFIG_THUNDERBOLT_ICM} += thunderbolt-icm.o -thunderbolt-icm-objs := icm_nhi.o +thunderbolt-icm-objs := icm_nhi.o net.o diff --git a/drivers/thunderbolt/icm/icm_nhi.c b/drivers/thunderbolt/icm/icm_nhi.c index 23047d3..a849698 100644 --- a/drivers/thunderbolt/icm/icm_nhi.c +++ b/drivers/thunderbolt/icm/icm_nhi.c @@ -64,6 +64,13 @@ static const struct nla_policy nhi_genl_policy[NHI_ATTR_MAX + 1] = { .len = TBT_ICM_RING_MAX_FRAME_SIZE }, [NHI_ATTR_MSG_FROM_ICM] = { .type = NLA_BINARY, .len = TBT_ICM_RING_MAX_FRAME_SIZE }, + [NHI_ATTR_LOCAL_ROUTE_STRING] = { + .len = sizeof(struct route_string) }, + [NHI_ATTR_LOCAL_UUID] = { .len = sizeof(uuid_be) }, + [NHI_ATTR_REMOTE_UUID] = { .len = sizeof(uuid_be) }, + [NHI_ATTR_LOCAL_DEPTH] = { .type = NLA_U8, }, + [NHI_ATTR_ENABLE_FULL_E2E] = { .type = NLA_FLAG, }, + [NHI_ATTR_MATCH_FRAME_ID] = { .type = NLA_FLAG, }, }; /* NHI genetlink family */ @@ -480,6 +487,29 @@ int nhi_mailbox(struct tbt_nhi_ctxt *nhi_ctxt, u32 cmd, u32 data, bool deinit) return 0; } +static inline bool nhi_is_path_disconnected(u32 cmd, u8 num_ports) +{ + return (cmd >= DISCONNECT_PORT_A_INTER_DOMAIN_PATH && + cmd < (DISCONNECT_PORT_A_INTER_DOMAIN_PATH + num_ports)); +} + +static int nhi_mailbox_disconn_path(struct tbt_nhi_ctxt *nhi_ctxt, u32 cmd) + __releases(_list_mutex) +{ + struct port_net_dev *port; + u32 port_num = cmd - DISCONNECT_PORT_A_INTER_DOMAIN_PATH; + + port = &(nhi_ctxt->net_devices[port_num]); + mutex_lock(>state_mutex); + + mutex_unlock(_list_mutex); + port->medium_sts = MEDIUM_READY_FOR_APPROVAL; + if (port->net_dev) + negotiation_events(port->net_dev, MEDIUM_DISCONNECTED); + mutex_unlock(>state_mutex); + return 0; +} + static int nhi_mailbox_generic(struct tbt_nhi_ctxt *nhi_ctxt, u32 mb_cmd) __releases(_list_mutex) { @@ -526,13 +556,90 @@ static int nhi_genl_mailbox(__always_unused struct sk_buff *u_skb, return -ERESTART; nhi_ctxt = nhi_search_ctxt(*(u32 *)info->userhdr); - if (nhi_ctxt && !nhi_ctxt->d0_exit) - return nhi_mailbox_generic(nhi_ctxt, mb_cmd); + if (nhi_ctxt && !nhi_ctxt->d0_exit) { + + /* rwsem is released later by the below functions */ + if (nhi_is_path_disconnected(cmd, nhi_ctxt->num_ports)) + return nhi_mailbox_disconn_path(nhi_ctxt, cmd); + else + return nhi_mailbox_generic(nhi_ctxt, mb_cmd); + + } mutex_unlock(_list_mutex); return -ENODEV; } +static int
[PATCH v8 6/8] thunderbolt: Kconfig for Thunderbolt Networking
Update to the Kconfig Thunderbolt description to add Thunderbolt networking as an option. The menu item "Thunderbolt support" now offers: "Apple Hardware Support" (existing) and/or "Thunderbolt Networking" (new) You can choose the driver for your platform or build both drivers - each driver will detect if it can run on the specific platform. If the Thunderbolt Networking option is chosen, Thunderbolt Networking will be enabled between Linux non-Apple systems, macOS and Windows based systems. Thunderbolt Networking will not affect any other Thunderbolt feature that was previous available to Linux users on either Apple or non-Apple platforms. Signed-off-by: Amir Levy--- drivers/thunderbolt/Kconfig | 27 +++ drivers/thunderbolt/Makefile | 3 ++- 2 files changed, 25 insertions(+), 5 deletions(-) diff --git a/drivers/thunderbolt/Kconfig b/drivers/thunderbolt/Kconfig index c121acc..376e5bb 100644 --- a/drivers/thunderbolt/Kconfig +++ b/drivers/thunderbolt/Kconfig @@ -1,13 +1,32 @@ -menuconfig THUNDERBOLT - tristate "Thunderbolt support for Apple devices" +config THUNDERBOLT + tristate "Thunderbolt support" depends on PCI select CRC32 help - Cactus Ridge Thunderbolt Controller driver + Thunderbolt Controller driver + +if THUNDERBOLT + +config THUNDERBOLT_APPLE + tristate "Apple hardware support" + help This driver is required if you want to hotplug Thunderbolt devices on Apple hardware. Device chaining is currently not supported. - To compile this driver a module, choose M here. The module will be + To compile this driver as a module, choose M here. The module will be called thunderbolt. + +config THUNDERBOLT_ICM + tristate "Thunderbolt Networking" + help + This driver is required if you want Thunderbolt Networking on + non-Apple hardware. + It creates a virtual Ethernet device that enables computer to + computer communication over a Thunderbolt cable. + + To compile this driver as a module, choose M here. The module will be + called thunderbolt_icm. + +endif diff --git a/drivers/thunderbolt/Makefile b/drivers/thunderbolt/Makefile index 5d1053c..b6aa6a3 100644 --- a/drivers/thunderbolt/Makefile +++ b/drivers/thunderbolt/Makefile @@ -1,3 +1,4 @@ -obj-${CONFIG_THUNDERBOLT} := thunderbolt.o +obj-${CONFIG_THUNDERBOLT_APPLE} := thunderbolt.o thunderbolt-objs := nhi.o ctl.o tb.o switch.o cap.o path.o tunnel_pci.o eeprom.o +obj-${CONFIG_THUNDERBOLT_ICM} += icm/ -- 2.7.4
[PATCH v8 7/8] thunderbolt: Networking doc
Adding Thunderbolt(TM) networking documentation. Signed-off-by: Amir Levy--- Documentation/00-INDEX | 2 + Documentation/thunderbolt/networking.txt | 132 +++ 2 files changed, 134 insertions(+) create mode 100644 Documentation/thunderbolt/networking.txt diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index cb9a6c6..a448ba1 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -439,6 +439,8 @@ this_cpu_ops.txt - List rationale behind and the way to use this_cpu operations. thermal/ - directory with information on managing thermal issues (CPU/temp) +thunderbolt/ + - directory with info regarding Thunderbolt. trace/ - directory with info on tracing technologies within linux unaligned-memory-access.txt diff --git a/Documentation/thunderbolt/networking.txt b/Documentation/thunderbolt/networking.txt new file mode 100644 index 000..88d1c12 --- /dev/null +++ b/Documentation/thunderbolt/networking.txt @@ -0,0 +1,132 @@ +Intel Thunderbolt(TM) Networking driver +=== + +Copyright(c) 2013 - 2016 Intel Corporation. + +Contact Information: +Intel Thunderbolt mailing list +Edited by Amir Levy + +Overview + + +* The Thunderbolt Networking driver enables peer to peer networking on non-Apple + platforms running Linux. + +* The driver creates a virtual Ethernet device that enables computer to computer + communication over the Thunderbolt cable. + +* Using Thunderbolt Networking you can perform high speed file transfers between + computers, perform PC migrations and/or set up small workgroups with shared + storage without compromising any other Thunderbolt functionality. + +* The driver is located in drivers/thunderbolt/icm. + +* This driver will function only on non-Apple platforms with firmware based + Thunderbolt controllers that support Thunderbolt Networking. + + ++++ + |Host 1 ||Host 2 | + |||| + | +---+||+---+ | + | |Network||||Network| | + | |Stack ||||Stack | | + | +---+||+---+ | + | ^||^ | + | |||| | + | v||v | + | +---+ || +---+ | + | |Thunderbolt| || |Thunderbolt| | + | |Networking | || |Networking | | + | |Driver | || |Driver | | + | +---+ || +---+ | + | ^||^ | + | |||| | + | v||v | + | +---+ || +---+ | + | |Thunderbolt| || |Thunderbolt| | + | |Controller |<-++->|Controller | | + | |with ICM | || |with ICM | | + | |enabled| || |enabled| | + | +---+ || +---+ | + ++++ + +Files += + +The following files are located in the drivers/thunderbolt/icm directory: + +- icm_nhi.c/h: These files allow communication with the firmware (Intel + Connection Manager) based controller. They also create an interface for + netlink communication with a user space daemon. + +- net.c/net.h: These files implement the 'eth' interface for the + Thunderbolt(TM) Networking. + +Interface to User Space +=== + +The interface to the user space module is implemented through a Generic Netlink. +This is the communications protocol between the Thunderbolt driver and the user +space application. + +Note that this interface mediates user space communication with ICM. +(Existing Linux tools can be used to configure the network interface.) + +The Thunderbolt Daemon utilizes this interface to communicate with the driver. +To be accessed by the user space module, both kernel and user space modules +have to register with the same GENL_NAME. +For the purpose of the Thunderbolt Network driver, "thunderbolt" is used. +The registration is done at driver initialization time for all instances +of the Thunderbolt controllers. The communication is carried through pre-defined +Thunderbolt messages. Each specific message has a callback function that is +called when the related message is received. + +Message Definitions: +* NHI_CMD_UNSPEC: Not used. +* NHI_CMD_SUBSCRIBE: Subscription request from daemon to driver to open the + communication channel. +* NHI_CMD_UNSUBSCRIBE: Request from daemon to driver to unsubscribe and + to close communication channel. +* NHI_CMD_QUERY_INFORMATION: Request information from the driver
[PATCH v8 3/8] thunderbolt: Communication with the ICM (firmware)
This patch provides the communication protocol between the Intel Connection Manager(ICM) firmware that is operational in the Thunderbolt controller in non-Apple hardware. The ICM firmware-based controller is used for establishing and maintaining the Thunderbolt Networking connection - we need to be able to communicate with it. Signed-off-by: Amir Levy--- drivers/thunderbolt/icm/Makefile |2 + drivers/thunderbolt/icm/icm_nhi.c | 1251 + drivers/thunderbolt/icm/icm_nhi.h | 82 +++ drivers/thunderbolt/icm/net.h | 217 +++ 4 files changed, 1552 insertions(+) create mode 100644 drivers/thunderbolt/icm/Makefile create mode 100644 drivers/thunderbolt/icm/icm_nhi.c create mode 100644 drivers/thunderbolt/icm/icm_nhi.h create mode 100644 drivers/thunderbolt/icm/net.h diff --git a/drivers/thunderbolt/icm/Makefile b/drivers/thunderbolt/icm/Makefile new file mode 100644 index 000..f0d0fbb --- /dev/null +++ b/drivers/thunderbolt/icm/Makefile @@ -0,0 +1,2 @@ +obj-${CONFIG_THUNDERBOLT_ICM} += thunderbolt-icm.o +thunderbolt-icm-objs := icm_nhi.o diff --git a/drivers/thunderbolt/icm/icm_nhi.c b/drivers/thunderbolt/icm/icm_nhi.c new file mode 100644 index 000..23047d3 --- /dev/null +++ b/drivers/thunderbolt/icm/icm_nhi.c @@ -0,0 +1,1251 @@ +/*** + * + * Intel Thunderbolt(TM) driver + * Copyright(c) 2014 - 2016 Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + **/ + +#include +#include +#include +#include +#include "icm_nhi.h" +#include "net.h" + +#define NHI_GENL_VERSION 1 +#define NHI_GENL_NAME "thunderbolt" + +#define DEVICE_DATA(num_ports, dma_port, nvm_ver_offset, nvm_auth_on_boot,\ + support_full_e2e) \ + ((num_ports) | ((dma_port) << 4) | ((nvm_ver_offset) << 10) | \ +((nvm_auth_on_boot) << 22) | ((support_full_e2e) << 23)) +#define DEVICE_DATA_NUM_PORTS(device_data) ((device_data) & 0xf) +#define DEVICE_DATA_DMA_PORT(device_data) (((device_data) >> 4) & 0x3f) +#define DEVICE_DATA_NVM_VER_OFFSET(device_data) (((device_data) >> 10) & 0xfff) +#define DEVICE_DATA_NVM_AUTH_ON_BOOT(device_data) (((device_data) >> 22) & 0x1) +#define DEVICE_DATA_SUPPORT_FULL_E2E(device_data) (((device_data) >> 23) & 0x1) + +#define USEC_TO_256_NSECS(usec) DIV_ROUND_UP((usec) * NSEC_PER_USEC, 256) + +/* NHI genetlink commands */ +enum { + NHI_CMD_UNSPEC, + NHI_CMD_SUBSCRIBE, + NHI_CMD_UNSUBSCRIBE, + NHI_CMD_QUERY_INFORMATION, + NHI_CMD_MSG_TO_ICM, + NHI_CMD_MSG_FROM_ICM, + NHI_CMD_MAILBOX, + NHI_CMD_APPROVE_TBT_NETWORKING, + NHI_CMD_ICM_IN_SAFE_MODE, + __NHI_CMD_MAX, +}; +#define NHI_CMD_MAX (__NHI_CMD_MAX - 1) + +/* NHI genetlink policy */ +static const struct nla_policy nhi_genl_policy[NHI_ATTR_MAX + 1] = { + [NHI_ATTR_DRV_VERSION] = { .type = NLA_NUL_STRING, }, + [NHI_ATTR_NVM_VER_OFFSET] = { .type = NLA_U16, }, + [NHI_ATTR_NUM_PORTS]= { .type = NLA_U8, }, + [NHI_ATTR_DMA_PORT] = { .type = NLA_U8, }, + [NHI_ATTR_SUPPORT_FULL_E2E] = { .type = NLA_FLAG, }, + [NHI_ATTR_MAILBOX_CMD] = { .type = NLA_U32, }, + [NHI_ATTR_PDF] = { .type = NLA_U32, }, + [NHI_ATTR_MSG_TO_ICM] = { .type = NLA_BINARY, + .len = TBT_ICM_RING_MAX_FRAME_SIZE }, + [NHI_ATTR_MSG_FROM_ICM] = { .type = NLA_BINARY, + .len = TBT_ICM_RING_MAX_FRAME_SIZE }, +}; + +/* NHI genetlink family */ +static struct genl_family nhi_genl_family = { + .id = GENL_ID_GENERATE, + .hdrsize= FIELD_SIZEOF(struct tbt_nhi_ctxt, id), + .name = NHI_GENL_NAME, + .version= NHI_GENL_VERSION, + .maxattr= NHI_ATTR_MAX, +}; + +static LIST_HEAD(controllers_list); +static DEFINE_MUTEX(controllers_list_mutex); +static atomic_t subscribers = ATOMIC_INIT(0); +/* + * Some of the received generic netlink messages are replied in a different + * context. The reply has to include the netlink portid of sender, therefore + * saving it in global variable (current assuption is one sender). + */ +static u32 portid; + +static bool nhi_nvm_authenticated(struct tbt_nhi_ctxt *nhi_ctxt) +{ + enum icm_operation_mode op_mode; + u32 *msg_head, port_id, reg; +
[PATCH v8 8/8] thunderbolt: Adding maintainer entry
Add Amir Levy as maintainer for Thunderbolt(TM) ICM driver Signed-off-by: Amir Levy--- MAINTAINERS | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 01bff8e..a4a4614 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10358,7 +10358,13 @@ F: include/uapi/linux/stm.h THUNDERBOLT DRIVER M: Andreas Noever S: Maintained -F: drivers/thunderbolt/ +F: drivers/thunderbolt/* + +THUNDERBOLT ICM DRIVER +M: Amir Levy +S: Maintained +F: drivers/thunderbolt/icm/ +F: Documentation/thunderbolt/networking.txt TI BQ27XXX POWER SUPPLY DRIVER R: Andrew F. Davis -- 2.7.4