Re: [PATCH nf-next,v2] gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)
On Sun, May 8, 2016 at 3:55 PM, Pablo Neira Ayusowrote: > +static int gtp_genl_new_pdp(struct sk_buff *skb, struct genl_info *info) > +{ ... > + > + net = gtp_genl_get_net(sock_net(skb->sk), info->attrs); > + if (IS_ERR(net)) > + return PTR_ERR(net); > + > + /* Check if there's an existing gtpX device to configure */ > + dev = gtp_find_dev(net, nla_get_u32(info->attrs[GTPA_LINK])); > + if (dev == NULL) > + return -ENODEV; > + > + return ipv4_pdp_add(dev, info); Seems you are leaking struct net at least in the error path.
Re: [PATCH] cfg80211/nl80211: add wifi tx power mode switching support
On Fri, May 6, 2016 at 4:19 PM, Wei-Ning Huangwrote: > On Fri, May 6, 2016 at 12:07 AM, Dan Williams wrote: >> >> On Thu, 2016-05-05 at 14:44 +0800, Wei-Ning Huang wrote: >> > Recent new hardware has the ability to switch between tablet mode and >> > clamshell mode. To optimize WiFi performance, we want to be able to >> > use >> > different power table between modes. This patch adds a new netlink >> > message type and cfg80211_ops function to allow userspace to trigger >> > a >> > power mode switch for a given wireless interface. >> > >> > Signed-off-by: Wei-Ning Huang >> > --- >> > include/net/cfg80211.h | 11 +++ >> > include/uapi/linux/nl80211.h | 21 + >> > net/wireless/nl80211.c | 16 >> > net/wireless/rdev-ops.h | 22 ++ >> > net/wireless/trace.h | 20 >> > 5 files changed, 90 insertions(+) >> > >> > diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h >> > index 9e1b24c..aa77fa0 100644 >> > --- a/include/net/cfg80211.h >> > +++ b/include/net/cfg80211.h >> > @@ -2370,6 +2370,12 @@ struct cfg80211_qos_map { >> > * @get_tx_power: store the current TX power into the dbm variable; >> > * return 0 if successful >> > * >> > + * @set_tx_power_mode: set the transmit power mode. Some device have >> > the ability >> > + * to transform between different mode such as clamshell and >> > tablet mode. >> > + * set_tx_power_mode allows setting of different TX power >> > mode at runtime. >> > + * @get_tx_power_mode: store the current TX power mode into the mode >> > variable; >> > + * return 0 if successful >> > + * >> > * @set_wds_peer: set the WDS peer for a WDS interface >> > * >> > * @rfkill_poll: polls the hw rfkill line, use cfg80211 reporting >> > @@ -2631,6 +2637,11 @@ struct cfg80211_ops { >> > int (*get_tx_power)(struct wiphy *wiphy, struct >> > wireless_dev *wdev, >> > int *dbm); >> > >> > + int (*set_tx_power_mode)(struct wiphy *wiphy, >> > + enum nl80211_tx_power_mode >> > mode); >> > + int (*get_tx_power_mode)(struct wiphy *wiphy, >> > + enum nl80211_tx_power_mode >> > *mode); >> > + >> > int (*set_wds_peer)(struct wiphy *wiphy, struct >> > net_device *dev, >> > const u8 *addr); >> > >> > diff --git a/include/uapi/linux/nl80211.h >> > b/include/uapi/linux/nl80211.h >> > index 5a30a75..9b1888a 100644 >> > --- a/include/uapi/linux/nl80211.h >> > +++ b/include/uapi/linux/nl80211.h >> > @@ -1796,6 +1796,9 @@ enum nl80211_commands { >> > * connecting to a PCP, and in %NL80211_CMD_START_AP to start >> > * a PCP instead of AP. Relevant for DMG networks only. >> > * >> > + * @NL80211_ATTR_WIPHY_TX_POWER_MODE: Transmit power mode. See >> > + * nl80211_tx_power_mode for possible values. >> > + * >> > * @NUM_NL80211_ATTR: total number of nl80211_attrs available >> > * @NL80211_ATTR_MAX: highest attribute number currently defined >> > * @__NL80211_ATTR_AFTER_LAST: internal use >> > @@ -2172,6 +2175,8 @@ enum nl80211_attrs { >> > >> > NL80211_ATTR_PBSS, >> > >> > + NL80211_ATTR_WIPHY_TX_POWER_MODE, >> > + >> > /* add attributes here, update the policy in nl80211.c */ >> > >> > __NL80211_ATTR_AFTER_LAST, >> > @@ -3703,6 +3708,22 @@ enum nl80211_tx_power_setting { >> > }; >> > >> > /** >> > + * enum nl80211_tx_power_mode - TX power mode setting >> > + * @NL80211_TX_POWER_LOW: general low TX power mode >> > + * @NL80211_TX_POWER_MEDIUM: general medium TX power mode >> > + * @NL80211_TX_POWER_HIGH: general high TX power mode >> > + * @NL80211_TX_POWER_CLAMSHELL: clamshell mode TX power mode >> > + * @NL80211_TX_POWER_TABLET: tablet mode TX power mode >> > + */ >> > +enum nl80211_tx_power_mode { >> > + NL80211_TX_POWER_LOW, >> > + NL80211_TX_POWER_MEDIUM, >> > + NL80211_TX_POWER_HIGH, >> > + NL80211_TX_POWER_CLAMSHELL, >> > + NL80211_TX_POWER_TABLET, >> > >> "clamshell" and "tablet" probably mean many different things to many >> different people with respect to whether or not they should do anything >> with power saving or wifi. I feel like a more generic interface is >> needed here. > We could probably drop those two CLAMSHELL and TABLET constant or > describing what they mean > in more detail? > >> >> Could this be already done by: >> @NL80211_ATTR_WIPHY_TX_POWER_SETTING = NL80211_TX_POWER_LIMITED >> @NL80211_ATTR_WIPHY_TX_POWER_LEVEL = >> >> and then the device would be able to change its TX power as it saw fit >> up to that limit set by your application which figures out whether it's >> in clamshell or tablet mode? > > We usually want different power settings in different band/channel. > For example, we can have three different power settings > in 2.4Ghz, channels 36-64 & channels 100+ on
Re: [Intel-wired-lan] [PATCH] e1000e: prevent division by zero if TIMINCA is zero
Jarod Wilsonwrote: On Fri, May 06, 2016 at 11:43:17PM +, Rustad, Mark D wrote: Denys Vlasenko wrote: Users report that under VMWare, er32(TIMINCA) returns zero. This causes division by zero at init time as follows: ==>incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK; for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) { /* latch SYSTIMH on read of SYSTIML */ systim_next = (cycle_t)er32(SYSTIML); systim_next |= (cycle_t)er32(SYSTIMH) << 32; time_delta = systim_next - systim; temp = time_delta; > rem = do_div(temp, incvalue); This change makes kernel survive this, and users report that NIC does work after this change. Since on real hardware incvalue is never zero, this should not affect real hardware use case. ... I seem to recall that this was rejected before because it really is VMWare's bug and, if they fix it, any existing VMs that use this will just work. Changing the driver will only fix it for vms that install a new driver. I don't object to doing it, it just seems like not the most effective place to address the issue. You could also have people who never update VMWare, for whom a kernel work-around would be better. I think it'd be best to address it both at the driver level and the emulated hardware level, to improve things for the most possible users. Those who update neither hypervisor or kernel/driver, well, they reap what they sow. That is a sound argument for doing both. I would expect that there are more frozen VM images than host environments, but I can certainly imagine that some choose to freeze their host. Of course if everything is frozen there is no point at all. :-) I am on an extended vacation, and don't work on e1000e anyway, so I will quit my kibitzing here. -- Mark Rustad, mrus...@gmail.com signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [net PATCH v2 0/6] net sched: Fix broken late binding of actions
From: Jamal Hadi SalimDate: Tue, 10 May 2016 16:49:25 -0400 > Some actions were broken in allowing for late binding of actions. > Late binding workflow is as follows: > a) create an action and provide all necessary parameters for it > Optionally provide an index or let the kernel give you one. > Example: > sudo tc actions add action police rate 1kbit burst 90k drop index 1 > > b) later on bind to the pre-created action from a filter definition > by merely specifying the index. > Example: > sudo tc filter add dev lo parent : protocol ip prio 8 \ > u32 match ip src 127.0.0.8/32 flowid 1:8 action police index 1 Series applied, thanks Jamal.
Re: [PATCH v3 net-next] tcp: replace cnt & rtt with struct in pkts_acked()
From: David MillerDate: Tue, 10 May 2016 23:47:58 -0400 (EDT) > From: Lawrence Brakmo > Date: Tue, 10 May 2016 13:11:09 -0700 > >> Replace 2 arguments (cnt and rtt) in the congestion control modules' >> pkts_acked() function with a struct. This will allow adding more >> information without having to modify existing congestion control >> modules (tcp_nv in particular needs bytes in flight when packet >> was sent). >> >> As proposed by Neal Cardwell in his comments to the tcp_nv patch. >> >> Signed-off-by: Lawrence Brakmo >> Acked-by: Yuchung Cheng > > Applied, thanks. Actually I had to revert, this breaks the build: net/ipv4/tcp_illinois.c: In function ‘tcp_illinois_acked’: net/ipv4/tcp_illinois.c:97:18: error: assignment of member ‘rtt_us’ in read-only object sample->rtt_us = RTT_MAX; You can't mark this argument const.
RE: [PATCH] net: phylib: fix interrupts re-enablement in phy_start
> -Original Message- > From: Florian Fainelli [mailto:f.faine...@gmail.com] > Sent: Wednesday, May 11, 2016 2:25 AM > To: shh@gmail.com; netdev@vger.kernel.org; da...@davemloft.net > Cc: Shaohui Xie; Andrew Lunn > Subject: Re: [PATCH] net: phylib: fix interrupts re-enablement in phy_start > > On 05/10/2016 02:42 AM, shh@gmail.com wrote: > > From: Shaohui Xie > > > > If phy was suspended and is starting, current driver always enable > > phy's interrupts, if phy works in polling, phy can raise unexpected > > interrupt which will not be handled, the interrupt will block system > > enter suspend again. So interrupts should only be re-enabled if phy > > works in interrupt. > > Your explanation makes sense. The commit message could you use some > improvements like s/phy/PHY/ and "works in interrupt mode", but that is not a > huge thing. [S.H] Thank you for the comment. I should make the commit message more precise and will make improvement in future patch submission. Regards, Shaohui
Re: [PATCH v3 net-next] tcp: replace cnt & rtt with struct in pkts_acked()
From: Lawrence BrakmoDate: Tue, 10 May 2016 13:11:09 -0700 > Replace 2 arguments (cnt and rtt) in the congestion control modules' > pkts_acked() function with a struct. This will allow adding more > information without having to modify existing congestion control > modules (tcp_nv in particular needs bytes in flight when packet > was sent). > > As proposed by Neal Cardwell in his comments to the tcp_nv patch. > > Signed-off-by: Lawrence Brakmo > Acked-by: Yuchung Cheng Applied, thanks.
Re: pull request: batman-adv 20160511
From: Antonio QuartulliDate: Wed, 11 May 2016 03:29:48 +0800 > here you have a pull request intended for net-next. > There are 17 patches in this batch, but most of them are cleanups > and minor code re-arrangement. > > The more detailed description follows in the git tag. > > Please pull or let me know of any problem. Pulled, thanks Antonio.
Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support
Hi Jiri, On Tue, May 10, 2016 at 02:06:18PM +0200, Jiri Benc wrote: > On Mon, 9 May 2016 17:18:20 +0900, Simon Horman wrote: > > On Fri, May 06, 2016 at 11:35:04AM +0200, Jiri Benc wrote: > > > In addition, we should check whether mac_len > 0 and in such case, > > > change skb->protocol to ETH_P_TEB first (and store that value in the > > > pushed eth header). > > > > > > Similarly on pop_eth, we need to check skb->protocol and if it is > > > ETH_P_TEB, call eth_type_trans on the modified frame to set the new > > > skb->protocol correctly. It's probably not that simple, as we'd need a > > > version of eth_type_trans that doesn't need a net device. > > > > I'm not sure I understand the interaction with ETH_P_TEB here. > > > > In my mind skb->protocol == ETH_P_TEB may be used early on in OvS's receive > > processing to find the inner protocol from the packet and at that point > > skb->protocol is set to that value. And that for further packet processing > > the fact the packet was received as TEB is transparent. > > Yes but think about the case when you have two Ethernet headers pushed. > > We can either disallow it, or do what I described. > > Specifically, let's assume we have an IP packet with an Ethernet > header present. skb->protocol is ETH_P_IP. Now, when there's skb_push, > the correct operation would be setting skb->protocol to ETH_P_TEB, > pushing a new Ethernet header and filing ETH_P_TEB to the ethertype > field in the new header. The packet is not ETH_P_IP anymore, as the L2 > header is followed by another Ethernet header now. Thanks for the clarification, I had not considered the case of two ethernet headers when I wrote my previous email. I think that at this stage I would prefer to prohibit push_eth() acting on a packet which already has an ethernet header. Indeed that is what my patch-set already does in its modifications of __ovs_nla_copy_actions(). The reason that I lean towards prohibiting this is that I do not have an easy way to exercise this case within the current patch-set. And thus this extra complexity seems well suited to being handled handled incrementally as further work.
Re: [v10, 7/7] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0
On Thu, 2016-05-05 at 13:10 +0200, Arnd Bergmann wrote: > On Thursday 05 May 2016 09:41:32 Yangbo Lu wrote: > > > -Original Message- > > > From: Arnd Bergmann [mailto:a...@arndb.de] > > > Sent: Thursday, May 05, 2016 4:32 PM > > > To: linuxppc-...@lists.ozlabs.org > > > Cc: Yangbo Lu; linux-...@vger.kernel.org; devicet...@vger.kernel.org; > > > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; > > > linux-...@vger.kernel.org; linux-...@vger.kernel.org; iommu@lists.linux- > > > foundation.org; netdev@vger.kernel.org; Mark Rutland; > > > ulf.hans...@linaro.org; Russell King; Bhupesh Sharma; Joerg Roedel; > > > Santosh Shilimkar; Yang-Leo Li; Scott Wood; Rob Herring; Claudiu Manoil; > > > Kumar Gala; Xiaobo Xie; Qiang Zhao > > > Subject: Re: [v10, 7/7] mmc: sdhci-of-esdhc: fix host version for T4240- > > > R1.0-R2.0 > > > > > > On Thursday 05 May 2016 11:12:30 Yangbo Lu wrote: > > > > IIRC, it is the same IP block as i.MX and Arnd's point is this won't > > > > even compile on !PPC. It is things like this that prevent sharing the > > > > driver. > > > > The whole point of using the MMIO SVR instead of the PPC SPR is so that > > it will work on ARM... The guts driver should build on any platform as > > long as OF is enabled, and if it doesn't find a node to bind to it will > > return 0 for SVR, and the eSDHC driver will continue (after printing an > > error that should be removed) without the ability to test for errata > > based on SVR. > > It feels like a bad design to have to come up with a different > method for each SoC type here when they all do the same thing > and want to identify some variant of the chip to do device > specific quirks. > > As far as I'm concerned, every driver in drivers/soc that needs to > export a symbol to be used by a device driver is an indication that > we don't have the right set of abstractions yet. There are cases > that are not worth abstracting because the functionality is rather > obscure and only a couple of drivers for one particular chip > ever need it. > > Finding out the version of the SoC does not look like this case. I'm open to new ways of abstracting this, but can that please be discussed after these patches are merged? This patchset is fixing a problem, the existing abstraction is unappealing and not widely adopted, a new abstraction is not ready, and we're only touching code for our hardware. Oh, and the existing abstraction isn't even "existing". I don't see any examples where soc_device is being used like this -- or even any way for a driver (the one consuming the information, not the soc "driver") to get a reference to the soc_device that's been registered short of searching for the device object by name -- and you're asking for new functionality in drivers/base/soc.c. > > > I think the first four patches take care of building for ARM, > > > but the problem remains if you want to enable COMPILE_TEST as > > > we need for certain automated checking. > > > > What specific problem is there with COMPILE_TEST? > > COMPILE_TEST is solvable here and the way it is implemented in this > case (selecting FSL_GUTS from the driver) indeed looks like it works > correctly, but it's still awkward that this means building the > SoC specific ID stuff into the vmlinux binary for any driver that > uses something like that for a particular SoC. Please keep in mind that this is a Freescale-specific driver... it's not as if we're attaching this dependency to common SDHCI code. > > > > > Dealing with Si revs is a common problem. We should have a > > > > common solution. There is soc_device for this purpose. > > > > > > Exactly. The last time this came up, I think we agreed to implement a > > > helper using glob_match() on the soc_device strings. Unfortunately > > > this hasn't happened then, but I'd still prefer that over yet another > > > vendor-specific way of dealing with the generic issue. > > > > soc_device would require encoding the SVR as a string and then decoding > > the string, which is more complicated and error prone than having > > platform-specific code test a platform-specific number. > > You already need to encode it as a string to register the soc_device, No we don't, because we don't already register a soc_device on arm64 or ppc (and it looks like whatever does get registered on at least some relevant arm32 chips is not particularly useful). > and the driver just needs to pass a glob string, so the only part that > is missing is the generic function that takes the string from the > driver and passes that to glob_match for the soc_device. "just" And what would the glob look like? I'd rather not write kernel code as if it were a shell/Perl script. > > And when would it get registered on arm64, which doesn't have > > platform code? > > Whenever the soc driver is loaded, as is the case now. The match > function can return -EPROBE_DEFER if no SoC device is registered > yet. That's too late for some places where we need
Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support
Hi Jiri, On Wed, May 11, 2016 at 10:50:09AM +0900, Simon Horman wrote: [...] > > > Its possible that I've overlooked something but as things stand I think > > > things look like this: > > > > > > * ovs_flow_key_extract() keys off dev->type and skb->protocol. > > > * ovs_flow_key_extract() calls key_extract() which amongst other things > > > sets up the skb->mac_header and skb->mac_len of the skb. > > > * ovs_flow_key_extract() sets skb->protocol to that of the inner packet > > > in the case of TEB > > > * Actions update the above mentioned skb fields as appropriate. > > > > Okay, that actually eases things somewhat. > > > > > So it seems to me that it should be safe to rely on skb->protocol > > > in the receive path. Or more specifically, in netdev_port_receive(). > > > > > > If mac_len is also able to be used then I think fine. But it seems to me > > > that it needs to be set up by OvS at least for the ARPHRD_NONE case. This > > > could be done early on, say in netdev_port_receive(). But it seems that > > > would involve duplicating some of what is already occurring in > > > key_extract(). > > > > I'd actually prefer doing this earlier, netdev_port_receive looks like > > the right place. Just set mac_len there (or whatever) and let > > key_extract do the rest of the work, not depending on dev->type in > > there. > > > > My point about recirculation was not actually valid, as I missed you're > > doing this in ovs_flow_key_extract and not in key_extract. Still, > > I think the special handling of particular interface types belongs to > > the tx processing on those interfaces, not to the common code. > > Sure, if that is your preference I think it should be simple enough to > implement. I agree that netdev_port_receive() looks like a good place for > this. So far I have the following, which seems to work changes to make dev->type ARPHRD_NONE for compat GRE vports. Is this close to what you had in mind? diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index d320c2657627..4d2698596033 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -700,8 +700,6 @@ int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key) int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info, struct sk_buff *skb, struct sw_flow_key *key) { - bool is_layer3 = false; - bool is_teb = false; int err; /* Extract metadata from packet. */ @@ -709,14 +707,6 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info, key->tun_proto = ip_tunnel_info_af(tun_info); memcpy(>tun_key, _info->key, sizeof(key->tun_key)); - if (OVS_CB(skb)->input_vport->dev->type != ARPHRD_ETHER) { - if (skb->protocol == htons(ETH_P_TEB)) - is_teb = true; - else - is_layer3 = true; - } - - if (tun_info->options_len) { BUILD_BUG_ON((1 << (sizeof(tun_info->options_len) * 8)) - 1 @@ -739,17 +729,17 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info, key->phy.skb_mark = skb->mark; ovs_ct_fill_key(skb, key); key->ovs_flow_hash = 0; - key->phy.is_layer3 = is_layer3; + key->phy.is_layer3 = (tun_info && skb->mac_len == 0); key->recirc_id = 0; err = key_extract(skb, key); if (err < 0) return err; - if (is_teb) - skb->protocol = key->eth.type; - else if (is_layer3) + if (key->phy.is_layer3) key->eth.type = skb->protocol; + else if (tun_info && skb->protocol == htons(ETH_P_TEB)) + skb->protocol = key->eth.type; return err; } diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c index 7d54414b35eb..ac8178ac2c81 100644 --- a/net/openvswitch/vport-netdev.c +++ b/net/openvswitch/vport-netdev.c @@ -60,7 +60,21 @@ static void netdev_port_receive(struct sk_buff *skb) if (vport->dev->type == ARPHRD_ETHER) { skb_push(skb, ETH_HLEN); skb_postpush_rcsum(skb, skb->data, ETH_HLEN); + } else if (vport->dev->type == ARPHRD_NONE) { + if (skb->protocol == htons(ETH_P_TEB)) { + struct ethhdr *eth = eth_hdr(skb); + + if (unlikely(skb->len < ETH_HLEN)) + goto error; + + skb->mac_len = ETH_HLEN; + if (eth->h_proto == htons(ETH_P_8021Q)) + skb->mac_len += VLAN_HLEN; + } else { + skb->mac_len = 0; + } } + ovs_vport_receive(vport, skb, skb_tunnel_info(skb)); return; error:
Re: [PATCH 1/2] [v4] net: emac: emac gigabit ethernet controller driver
Florian Fainelli wrote: The Ethernet MAC should be started in ndo_open() and stopped in ndo_close(), in between, there are link state changes, but you are not supposed to stop or start your Ethernet MAC and its DMA for instance during link change, if that is a HW requirement, your HW is pretty funky. I think the problem is that the current driver seems to be too eager to start/stop the MAC. Please take a look at emac_work_thread_link_check() at https://lkml.org/lkml/2016/4/13/670. Every time the PHY link goes up, it does this: if (phy->link_up) { if (netif_carrier_ok(netdev)) goto link_task_done; pm_runtime_get_sync(netdev->dev.parent); netif_info(adpt, timer, adpt->netdev, "NIC Link is Up %s\n", speed); emac_mac_start(adpt); netif_carrier_on(netdev); netif_wake_queue(netdev); The call to emac_mac_start seems wrong to me here. -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation.
Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support
Hi Jiri, On Tue, May 10, 2016 at 02:01:06PM +0200, Jiri Benc wrote: > On Mon, 9 May 2016 17:04:22 +0900, Simon Horman wrote: > > It seems to be caused by the following: > > > > 1. __ipgre_rcv() calls skb_pop_mac_header() which > >sets skb->mac_header to the skb->network_header. > > > > 2. __ipgre_rcv() then calls ip_tunnel_rcv() which calls > >skb_reset_network_header(). This updates skb->network_header to > >just after the end of the GRE header. > > > >This is 28 bytes after the old skb->network_header > >as there is a 20 byte IPv4 header followed by an > >8 byte GRE header. > > > > 3. Later, dev_gro_receive() calls skb_reset_mac_len(). > >This calculates skb->mac_len based on skb->network_header and > >skb->mac_header. I.e. 28 bytes. > > Right. Thanks for tracking this down! > > > I think this may be possible to address by calling > > skb_reset_network_header() instead of skb_pop_mac_header() > > in __ipgre_rcv(). > > We can't do that. The interface type is ARPHRD_IPGRE and not > ARPHRD_NONE, so the current behavior makes pretty good sense. See > e.g. commit 0e3da5bb8da45. > > We have two options here: > > 1. As for metadata tunnels all the info is in metadata_dst and we >don't need the IP/GRE header for anything, we can make the ipgre >interface ARPHRD_NONE in metadata based mode. > > 2. We can fix this up in ovs after receiving the packet from >ARPHRD_IPGRE interface. > > I think the first option is the correct one. We already don't assign > dev->header_ops in metadata mode. I'll prepare a patch. I agree that 1. seems to be the better approach. > > Its possible that I've overlooked something but as things stand I think > > things look like this: > > > > * ovs_flow_key_extract() keys off dev->type and skb->protocol. > > * ovs_flow_key_extract() calls key_extract() which amongst other things > > sets up the skb->mac_header and skb->mac_len of the skb. > > * ovs_flow_key_extract() sets skb->protocol to that of the inner packet > > in the case of TEB > > * Actions update the above mentioned skb fields as appropriate. > > Okay, that actually eases things somewhat. > > > So it seems to me that it should be safe to rely on skb->protocol > > in the receive path. Or more specifically, in netdev_port_receive(). > > > > If mac_len is also able to be used then I think fine. But it seems to me > > that it needs to be set up by OvS at least for the ARPHRD_NONE case. This > > could be done early on, say in netdev_port_receive(). But it seems that > > would involve duplicating some of what is already occurring in > > key_extract(). > > I'd actually prefer doing this earlier, netdev_port_receive looks like > the right place. Just set mac_len there (or whatever) and let > key_extract do the rest of the work, not depending on dev->type in > there. > > My point about recirculation was not actually valid, as I missed you're > doing this in ovs_flow_key_extract and not in key_extract. Still, > I think the special handling of particular interface types belongs to > the tx processing on those interfaces, not to the common code. Sure, if that is your preference I think it should be simple enough to implement. I agree that netdev_port_receive() looks like a good place for this.
Re: [PATCH net] openvswitch: Fix cached ct with helper.
On 10 May 2016 at 16:55, Jarno Rajahalmewrote: > This would result in inconsistent helper assignment if a first CT action > assigns a helper and a further CT action tries to assign a different helper; > Typically the second helper assignment would be ignored, but if the > unconfirmed conntrack entry is lost due to an upcall the second helper > assignment would be successful. This is best resolved by allowing helper > assignment by a committing CT action only by testing the 'info->commit' flag > in addition to the conditions you have there already. It may also be helpful > to fail helper assignment without the commit flag in parse_ct(). Strictly speaking I think that skb_nfct_cached() handles at least some of those cases but I agree it should be more explicit here. I'll send a v2. We can follow up separately on net-next to improve the parsing/make that more user-friendly.
linux-next: manual merge of the net-next tree with the net tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: net/netfilter/nf_conntrack_core.c between commit: 70d72b7e060e ("netfilter: conntrack: init all_locks to avoid debug warning") from the net tree and commit: a3efd81205b1 ("netfilter: conntrack: move generation seqcnt out of netns_ct") 56d52d4892d0 ("netfilter: conntrack: use a single hashtable for all namespaces") 0c5366b3a8c7 ("netfilter: conntrack: use single slab cache") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc net/netfilter/nf_conntrack_core.c index 895d11dced3c,566c64e3ec50.. --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@@ -66,7 -69,12 +69,12 @@@ EXPORT_SYMBOL_GPL(nf_conntrack_locks) __cacheline_aligned_in_smp DEFINE_SPINLOCK(nf_conntrack_expect_lock); EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock); + struct hlist_nulls_head *nf_conntrack_hash __read_mostly; + EXPORT_SYMBOL_GPL(nf_conntrack_hash); + + static __read_mostly struct kmem_cache *nf_conntrack_cachep; -static __read_mostly spinlock_t nf_conntrack_locks_all_lock; +static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock); + static __read_mostly seqcount_t nf_conntrack_generation; static __read_mostly bool nf_conntrack_locks_all; void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
Re: [PATCH net] openvswitch: Fix cached ct with helper.
This would result in inconsistent helper assignment if a first CT action assigns a helper and a further CT action tries to assign a different helper; Typically the second helper assignment would be ignored, but if the unconfirmed conntrack entry is lost due to an upcall the second helper assignment would be successful. This is best resolved by allowing helper assignment by a committing CT action only by testing the 'info->commit' flag in addition to the conditions you have there already. It may also be helpful to fail helper assignment without the commit flag in parse_ct(). Jarno > On May 10, 2016, at 11:40 AM, Joe Stringerwrote: > > When using conntrack helpers from OVS, a common configuration is to > perform a lookup without specifying a helper, then go through a > firewalling policy, only to decide to attach a helper afterwards. > > In this case, the initial lookup will cause a ct entry to be attached to > the skb, then the later commit with helper should attach the helper and > confirm the connection. However, the helper attachment has been missing. > If the user has enabled automatic helper attachment, then this issue > will be masked as it will be applied in init_conntrack(). It is also > masked if the action is executed from ovs_packet_cmd_execute() as that > will construct a fresh skb. > > This patch fixes the issue by making an explicit call to try to assign > the helper if there is a discrepancy between the action's helper and the > current skb->nfct. > > Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action") > Signed-off-by: Joe Stringer > --- > net/openvswitch/conntrack.c | 12 > 1 file changed, 12 insertions(+) > > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c > index b5fea1101faa..89f61a1720eb 100644 > --- a/net/openvswitch/conntrack.c > +++ b/net/openvswitch/conntrack.c > @@ -776,6 +776,18 @@ static int __ovs_ct_lookup(struct net *net, struct > sw_flow_key *key, > return -EINVAL; > } > > + /* Userspace may decide to perform a ct lookup without a helper > + * specified followed by a (recirculate and) commit with one. > + * Therefore, for unconfirmed connections we need to attach the > + * helper here. > + */ > + if (!nf_ct_is_confirmed(ct) && info->helper && !nfct_help(ct)) { > + int err = __nf_ct_try_assign_helper(ct, info->ct, > + GFP_ATOMIC); > + if (err) > + return err; > + } > + > /* Call the helper only if: >* - nf_conntrack_in() was executed above ("!cached") for a >* confirmed connection, or > -- > 2.1.4 >
Re: [PATCH 1/2] [v4] net: emac: emac gigabit ethernet controller driver
On 05/10/2016 04:18 PM, Timur Tabi wrote: > Florian Fainelli wrote: >> Are you utilizing the PHYLIB APIs properly? You need at least a >> phy_start() to start the PHY state machine, and an adjust_link callback >> to be provided to phy_connect() (or of_phy_connect()) to manage link >> state changes. And that's the very basic minimum here, there could be >> additional APIs that you may end up using. >> >> There are tons of example in tree of drivers doing this, bcmgenet, >> bcmsysport, tg3 etc. > > Thank you. I think I finally got phylib working, more or less. > > Unfortunately, it seems I have some kind of race condition. The driver > has a lot that's wrong with it, and I'm trying to fix it all. One crazy > the driver does is it create a workqueue to handle a lot of the tasks > that would normally be handled in the interrupt handler itself. That sounds like a typicall top half/bottom half split, fair enough. > > With phylib support, I know my driver can call phy_mac_interrupt() when > it gets a link status change interrupt. I then have an .adjust_link > callback which starts or stops the mac accordingly. The Ethernet MAC should be started in ndo_open() and stopped in ndo_close(), in between, there are link state changes, but you are not supposed to stop or start your Ethernet MAC and its DMA for instance during link change, if that is a HW requirement, your HW is pretty funky. > > My problem is that I'm not really sure what adjust_link is supposed to > be doing. Well, it's pretty simple, it is about re-configuring your Ethernet MAC based on what the PHY link state mandates: duplex, pause, speed changes, EEE etc is what this callback is supposed to take care of, at the Ethernet MAC level. > In addition, it seems that I need to keep the workqueue > running, otherwise the interface will not function. I bring the > interface up, and the driver reports success, but pings do not work. > > I'm getting really frustrated. The sample code isn't really helping a > whole lot, because I lack a fundamental understanding of what needs to > be done. None of the documentation I've read is helpful, and I don't > know how to debug it. Seriously, no documentation is helpful? The PHY library seems pretty well documented to me, but I suppose I have a bias, oh, and patches are welcome of course. > > Can you give me some advice on how to debug this? Take a look at drivers/net/ethernet/broadcom/genet/bcmgenet.c and see how it deals with managing link state changes for instance. The code is pretty straight forward: link interrupt (and other causes) trigger a workqueue schedule, which then processes link state changes and calls phy_mac_interrupt(), which in turn makes the PHY library adjust the interface carrier state. -- Florian
Re: [PATCH 1/2] [v4] net: emac: emac gigabit ethernet controller driver
Florian Fainelli wrote: Are you utilizing the PHYLIB APIs properly? You need at least a phy_start() to start the PHY state machine, and an adjust_link callback to be provided to phy_connect() (or of_phy_connect()) to manage link state changes. And that's the very basic minimum here, there could be additional APIs that you may end up using. There are tons of example in tree of drivers doing this, bcmgenet, bcmsysport, tg3 etc. Thank you. I think I finally got phylib working, more or less. Unfortunately, it seems I have some kind of race condition. The driver has a lot that's wrong with it, and I'm trying to fix it all. One crazy the driver does is it create a workqueue to handle a lot of the tasks that would normally be handled in the interrupt handler itself. With phylib support, I know my driver can call phy_mac_interrupt() when it gets a link status change interrupt. I then have an .adjust_link callback which starts or stops the mac accordingly. My problem is that I'm not really sure what adjust_link is supposed to be doing. In addition, it seems that I need to keep the workqueue running, otherwise the interface will not function. I bring the interface up, and the driver reports success, but pings do not work. I'm getting really frustrated. The sample code isn't really helping a whole lot, because I lack a fundamental understanding of what needs to be done. None of the documentation I've read is helpful, and I don't know how to debug it. Can you give me some advice on how to debug this? -- Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation collaborative project.
[PATCH net 0/2] bnxt_en: Add workaround to detect bad opaque in rx completion.
2-part workaround for this hardware bug. Michael Chan (2): bnxt_en: Add workaround to detect bad opaque in rx completion (part 1) bnxt_en: Add workaround to detect bad opaque in rx completion (part 2) drivers/net/ethernet/broadcom/bnxt/bnxt.c | 63 +++ drivers/net/ethernet/broadcom/bnxt/bnxt.h | 2 + 2 files changed, 65 insertions(+) -- 1.8.3.1
[PATCH net 1/2] bnxt_en: Add workaround to detect bad opaque in rx completion (part 1)
There is a rare hardware bug that can cause a bad opaque value in the RX or TPA completion. When this happens, the hardware may have used the same buffer twice for 2 rx packets. In addition, the driver will also crash later using the bad opaque as the index into the ring. The rx opaque value is predictable and is always monotonically increasing. The workaround is to keep track of the expected next opaque value and compare it with the one returned by hardware during RX and TPA start completions. If they miscompare, we will not process any more RX and TPA completions and exit NAPI. We will then schedule a workqueue to reset the function. This patch adds the logic to keep track of the next rx consumer index. Signed-off-by: Michael Chan--- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++ drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 + 2 files changed, 4 insertions(+) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 9d4e8e1..58999cd 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -867,6 +867,7 @@ static void bnxt_tpa_start(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, rxr->rx_prod = NEXT_RX(prod); cons = NEXT_RX(cons); + rxr->rx_next_cons = NEXT_RX(cons); cons_rx_buf = >rx_buf_ring[cons]; bnxt_reuse_rx_data(rxr, cons, cons_rx_buf->data); @@ -1245,6 +1246,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_napi *bnapi, u32 *raw_cons, next_rx: rxr->rx_prod = NEXT_RX(prod); + rxr->rx_next_cons = NEXT_RX(cons); next_rx_no_prod: *raw_cons = tmp_raw_cons; @@ -2486,6 +2488,7 @@ static void bnxt_clear_ring_indices(struct bnxt *bp) rxr->rx_prod = 0; rxr->rx_agg_prod = 0; rxr->rx_sw_agg_prod = 0; + rxr->rx_next_cons = 0; } } } diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index 8b823ff..52f2d74 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -584,6 +584,7 @@ struct bnxt_rx_ring_info { u16 rx_prod; u16 rx_agg_prod; u16 rx_sw_agg_prod; + u16 rx_next_cons; void __iomem*rx_doorbell; void __iomem*rx_agg_doorbell; -- 1.8.3.1
[PATCH net 2/2] bnxt_en: Add workaround to detect bad opaque in rx completion (part 2)
Add detection and recovery code when the hardware returned opaque value does not match the expected consumer index. Once the issue is detected, we skip the processing of all RX and LRO/GRO packets. These completion entries are discarded without sending the SKB to the stack and without producing new buffers. The function will be reset from a workqueue. Signed-off-by: Michael Chan--- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 60 +++ drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 + 2 files changed, 61 insertions(+) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 58999cd..c39a7f5 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -813,6 +813,46 @@ static inline struct sk_buff *bnxt_copy_skb(struct bnxt_napi *bnapi, u8 *data, return skb; } +static int bnxt_discard_rx(struct bnxt *bp, struct bnxt_napi *bnapi, + u32 *raw_cons, void *cmp) +{ + struct bnxt_cp_ring_info *cpr = >cp_ring; + struct rx_cmp *rxcmp = cmp; + u32 tmp_raw_cons = *raw_cons; + u8 cmp_type, agg_bufs = 0; + + cmp_type = RX_CMP_TYPE(rxcmp); + + if (cmp_type == CMP_TYPE_RX_L2_CMP) { + agg_bufs = (le32_to_cpu(rxcmp->rx_cmp_misc_v1) & + RX_CMP_AGG_BUFS) >> + RX_CMP_AGG_BUFS_SHIFT; + } else if (cmp_type == CMP_TYPE_RX_L2_TPA_END_CMP) { + struct rx_tpa_end_cmp *tpa_end = cmp; + + agg_bufs = (le32_to_cpu(tpa_end->rx_tpa_end_cmp_misc_v1) & + RX_TPA_END_CMP_AGG_BUFS) >> + RX_TPA_END_CMP_AGG_BUFS_SHIFT; + } + + if (agg_bufs) { + if (!bnxt_agg_bufs_valid(bp, cpr, agg_bufs, _raw_cons)) + return -EBUSY; + } + *raw_cons = tmp_raw_cons; + return 0; +} + +static void bnxt_sched_reset(struct bnxt *bp, struct bnxt_rx_ring_info *rxr) +{ + if (!rxr->bnapi->in_reset) { + rxr->bnapi->in_reset = true; + set_bit(BNXT_RESET_TASK_SP_EVENT, >sp_event); + schedule_work(>sp_task); + } + rxr->rx_next_cons = 0x; +} + static void bnxt_tpa_start(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, struct rx_tpa_start_cmp *tpa_start, struct rx_tpa_start_cmp_ext *tpa_start1) @@ -830,6 +870,11 @@ static void bnxt_tpa_start(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, prod_rx_buf = >rx_buf_ring[prod]; tpa_info = >rx_tpa[agg_id]; + if (unlikely(cons != rxr->rx_next_cons)) { + bnxt_sched_reset(bp, rxr); + return; + } + prod_rx_buf->data = tpa_info->data; mapping = tpa_info->mapping; @@ -981,6 +1026,14 @@ static inline struct sk_buff *bnxt_tpa_end(struct bnxt *bp, dma_addr_t mapping; struct sk_buff *skb; + if (unlikely(bnapi->in_reset)) { + int rc = bnxt_discard_rx(bp, bnapi, raw_cons, tpa_end); + + if (rc < 0) + return ERR_PTR(-EBUSY); + return NULL; + } + tpa_info = >rx_tpa[agg_id]; data = tpa_info->data; prefetch(data); @@ -1147,6 +1200,12 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_napi *bnapi, u32 *raw_cons, cons = rxcmp->rx_cmp_opaque; rx_buf = >rx_buf_ring[cons]; data = rx_buf->data; + if (unlikely(cons != rxr->rx_next_cons)) { + int rc1 = bnxt_discard_rx(bp, bnapi, raw_cons, rxcmp); + + bnxt_sched_reset(bp, rxr); + return rc1; + } prefetch(data); agg_bufs = (le32_to_cpu(rxcmp->rx_cmp_misc_v1) & RX_CMP_AGG_BUFS) >> @@ -4465,6 +4524,7 @@ static void bnxt_enable_napi(struct bnxt *bp) int i; for (i = 0; i < bp->cp_nr_rings; i++) { + bp->bnapi[i]->in_reset = false; bnxt_enable_poll(bp->bnapi[i]); napi_enable(>bnapi[i]->napi); } diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index 52f2d74..de9d53e 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -637,6 +637,7 @@ struct bnxt_napi { #ifdef CONFIG_NET_RX_BUSY_POLL atomic_tpoll_state; #endif + boolin_reset; }; #ifdef CONFIG_NET_RX_BUSY_POLL -- 1.8.3.1
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote: > Not only did we want to present this solely as a bugfix but also as as > performance enhancements in case of virtio (as you can see in the cover > letter). Given that a long time ago there was a tendency to remove > softirqs completely, we thought it might be very interesting, that a > threaded napi in general seems to be absolutely viable nowadays and > might offer new features. Well, you did not fix the bug, you worked around by adding yet another layer, with another sysctl that admins or programs have to manage. If you have a special need for virtio, do not hide it behind a 'bug fix' but add it as a features request. This ksoftirqd issue is real and a fix looks very reasonable. Please try this patch, as I had very good success with it. Thanks. diff --git a/kernel/softirq.c b/kernel/softirq.c index 17caf4b63342..22463217e3cf 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat); static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp; DEFINE_PER_CPU(struct task_struct *, ksoftirqd); +DEFINE_PER_CPU(bool, ksoftirqd_scheduled); const char * const softirq_to_name[NR_SOFTIRQS] = { "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL", @@ -73,8 +74,10 @@ static void wakeup_softirqd(void) /* Interrupts are disabled: no need to stop preemption */ struct task_struct *tsk = __this_cpu_read(ksoftirqd); - if (tsk && tsk->state != TASK_RUNNING) + if (tsk && tsk->state != TASK_RUNNING) { + __this_cpu_write(ksoftirqd_scheduled, true); wake_up_process(tsk); + } } /* @@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) */ preempt_count_sub(cnt - 1); - if (unlikely(!in_interrupt() && local_softirq_pending())) { + if (unlikely(!in_interrupt() && +local_softirq_pending() && +!__this_cpu_read(ksoftirqd_scheduled))) { /* * Run softirq if any pending. And do it in its own stack * as we may be calling this deep in a task call stack already. @@ -340,6 +345,9 @@ void irq_enter(void) static inline void invoke_softirq(void) { + if (__this_cpu_read(ksoftirqd_scheduled)) + return; + if (!force_irqthreads) { #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK /* @@ -660,6 +668,8 @@ static void run_ksoftirqd(unsigned int cpu) * in the task stack here. */ __do_softirq(); + if (!local_softirq_pending()) + __this_cpu_write(ksoftirqd_scheduled, false); local_irq_enable(); cond_resched_rcu_qs(); return;
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 15:02 -0700, Eric Dumazet wrote: > On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote: > > On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote: > > > > > You might need another one of these in invoke_softirq() > > > > > > > Excellent. > > > > I gave it a quick try (without your suggestion), and host seems to > > survive a stress test. > > > > Of course we do have to fix these problems : > > > > [ 147.781629] NOHZ: local_softirq_pending 48 > > [ 147.785546] NOHZ: local_softirq_pending 48 > > [ 147.788344] NOHZ: local_softirq_pending 48 > > [ 147.788992] NOHZ: local_softirq_pending 48 > > [ 147.790943] NOHZ: local_softirq_pending 48 > > [ 147.791232] NOHZ: local_softirq_pending 24a > > [ 147.791258] NOHZ: local_softirq_pending 48 > > [ 147.791366] NOHZ: local_softirq_pending 48 > > [ 147.792118] NOHZ: local_softirq_pending 48 > > [ 147.793428] NOHZ: local_softirq_pending 48 > > > Well, with your suggestion, these warnings disappear ;) This is really nice. Under stress number of context switches is really small. ksoftirqd and my netserver compete equally to get the cpu cycles (on CPU0) lpaa23:~# vmstat 1 10 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 0 0 260668416 37240 24144280021 0 329 349 0 3 96 0 1 0 0 260667904 37240 241442800 012 193126 1050 0 2 98 0 1 0 0 260667904 37240 241442800 0 0 194354 1056 0 2 98 0 1 0 0 260669104 37240 241449200 0 0 200897 1095 0 2 98 0 1 0 0 260668592 37240 241449200 0 0 205731 964 0 2 98 0 1 0 0 260678832 37240 241449200 0 0 201689 981 0 2 98 0 1 0 0 260678832 37240 241449200 0 0 204899 742 0 2 98 0 1 0 0 260678320 37240 241449200 0 0 199148 792 0 3 97 0 1 0 0 260678832 37240 241449200 0 0 196398 766 0 2 98 0 1 0 0 260678832 37240 241449200 0 0 201930 858 0 2 98 0 And we can see that ksoftirqd/0 runs for longer periods (~500 usec), instead of stupid 4 usec before the patch. Less overhead. lpaa23:~# cat /proc/3/sched ksoftirqd/0 (3, #threads: 1) --- se.exec_start: 1552401.399526 se.vruntime :237599.421560 se.sum_exec_runtime : 75432.494199 se.nr_migrations :0 nr_switches : 144333 nr_voluntary_switches: 143828 nr_involuntary_switches : 505 se.load.weight : 1024 se.avg.load_sum :10445 se.avg.util_sum :10445 se.avg.load_avg :0 se.avg.util_avg :0 se.avg.last_update_time :1552401399526 policy :0 prio : 120 clock-delta : 47 lpaa23:~# echo 75432.494199/144333|bc -l .52262818758703830724 And yes indeed, user space can progress way faster under flood. lpaa23:~# nstat >/dev/null;sleep 1;nstat | grep Udp UdpInDatagrams 186132 0.0 UdpInErrors 735462 0.0 UdpOutDatagrams 10 0.0 UdpRcvbufErrors 735461 0.0
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On 10.05.2016 23:09, Eric Dumazet wrote: > On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa >wrote: > >> I agree here, but I don't think this patch particularly is a lot of >> bloat and something very interesting people can play with and extend upon. >> > > Sure, very rarely patch authors think their stuff is bloat. > > I prefer to fix kernel softirq.c, or at least show me that you tried > hard enough. > > I am pretty sure that the following would work : > > When ksoftirqd is scheduled, remember this in a per cpu variable > (ksoftiqd_scheduled) > > When enabling BH , do not call do_softirq() if this variable is set. > > ksoftirqd would clear the variable at the right place (probably in > run_ksoftirqd()) > > Sure, this might add a lot of latency regressions, but lets fix them. Probably, yes. We had a version which limited the number of restarts if softirqs were invoked from local_bh_enable (so that at least timers etc. would run) and would defer all other work to ksoftirqd. That also solved the initial live lock problem. I do have concerns about the fairness of this approach, but we now have to investigate this. ;) Not only did we want to present this solely as a bugfix but also as as performance enhancements in case of virtio (as you can see in the cover letter). Given that a long time ago there was a tendency to remove softirqs completely, we thought it might be very interesting, that a threaded napi in general seems to be absolutely viable nowadays and might offer new features. Bye, Hannes
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote: > On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote: > > > You might need another one of these in invoke_softirq() > > > > Excellent. > > I gave it a quick try (without your suggestion), and host seems to > survive a stress test. > > Of course we do have to fix these problems : > > [ 147.781629] NOHZ: local_softirq_pending 48 > [ 147.785546] NOHZ: local_softirq_pending 48 > [ 147.788344] NOHZ: local_softirq_pending 48 > [ 147.788992] NOHZ: local_softirq_pending 48 > [ 147.790943] NOHZ: local_softirq_pending 48 > [ 147.791232] NOHZ: local_softirq_pending 24a > [ 147.791258] NOHZ: local_softirq_pending 48 > [ 147.791366] NOHZ: local_softirq_pending 48 > [ 147.792118] NOHZ: local_softirq_pending 48 > [ 147.793428] NOHZ: local_softirq_pending 48 Well, with your suggestion, these warnings disappear ;)
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote: > On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote: > > > > > You might need another one of these in invoke_softirq() > > > Excellent. > > I gave it a quick try (without your suggestion), and host seems to > survive a stress test. > > Of course we do have to fix these problems : > > [ 147.781629] NOHZ: local_softirq_pending 48 > [ 147.785546] NOHZ: local_softirq_pending 48 > [ 147.788344] NOHZ: local_softirq_pending 48 > [ 147.788992] NOHZ: local_softirq_pending 48 > [ 147.790943] NOHZ: local_softirq_pending 48 > [ 147.791232] NOHZ: local_softirq_pending 24a > [ 147.791258] NOHZ: local_softirq_pending 48 > [ 147.791366] NOHZ: local_softirq_pending 48 > [ 147.792118] NOHZ: local_softirq_pending 48 > [ 147.793428] NOHZ: local_softirq_pending 48 As long as ksoftirqd is running, that should not be an actual problem, just a false positive. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote: > You might need another one of these in invoke_softirq() > Excellent. I gave it a quick try (without your suggestion), and host seems to survive a stress test. Of course we do have to fix these problems : [ 147.781629] NOHZ: local_softirq_pending 48 [ 147.785546] NOHZ: local_softirq_pending 48 [ 147.788344] NOHZ: local_softirq_pending 48 [ 147.788992] NOHZ: local_softirq_pending 48 [ 147.790943] NOHZ: local_softirq_pending 48 [ 147.791232] NOHZ: local_softirq_pending 24a [ 147.791258] NOHZ: local_softirq_pending 48 [ 147.791366] NOHZ: local_softirq_pending 48 [ 147.792118] NOHZ: local_softirq_pending 48 [ 147.793428] NOHZ: local_softirq_pending 48 Thanks.
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 14:31 -0700, Eric Dumazet wrote: > On Tue, 2016-05-10 at 14:09 -0700, Eric Dumazet wrote: > > > > On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa > >wrote: > > > > > > > > I agree here, but I don't think this patch particularly is a lot > > > of > > > bloat and something very interesting people can play with and > > > extend upon. > > > > > Sure, very rarely patch authors think their stuff is bloat. > > > > I prefer to fix kernel softirq.c, or at least show me that you > > tried > > hard enough. > > > > I am pretty sure that the following would work : > > > > When ksoftirqd is scheduled, remember this in a per cpu variable > > (ksoftiqd_scheduled) > > > > When enabling BH , do not call do_softirq() if this variable is > > set. > > > > ksoftirqd would clear the variable at the right place (probably in > > run_ksoftirqd()) > > > > Sure, this might add a lot of latency regressions, but lets fix > > them. > Only to give the idea (it is completely untested and probably buggy) > > diff --git a/kernel/softirq.c b/kernel/softirq.c > index 17caf4b63342..cb30cfd76687 100644 > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat); > static struct softirq_action softirq_vec[NR_SOFTIRQS] > __cacheline_aligned_in_smp; > > DEFINE_PER_CPU(struct task_struct *, ksoftirqd); > +DEFINE_PER_CPU(bool, ksoftirqd_scheduled); > > const char * const softirq_to_name[NR_SOFTIRQS] = { > "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL", > @@ -73,8 +74,10 @@ static void wakeup_softirqd(void) > /* Interrupts are disabled: no need to stop preemption */ > struct task_struct *tsk = __this_cpu_read(ksoftirqd); > > - if (tsk && tsk->state != TASK_RUNNING) > + if (tsk && tsk->state != TASK_RUNNING) { > + __this_cpu_write(ksoftirqd_scheduled, true); > wake_up_process(tsk); > + } > } > > /* > @@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip, > unsigned int cnt) > */ > preempt_count_sub(cnt - 1); > > - if (unlikely(!in_interrupt() && local_softirq_pending())) { > + if (unlikely(!in_interrupt() && > + local_softirq_pending() && > + !__this_cpu_read(ksoftirqd_scheduled))) { > /* > You might need another one of these in invoke_softirq() -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
[PATCH net-next 3/3] net/mlx5e: Enable CQE compression when PCI is slower than link
We turn the feature ON, only for servers with PCI BW < MAX LINK BW, as it helps reducing PCI pressure on weak PCI slots, but it adds some software overhead. Signed-off-by: Saeed MahameedSigned-off-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 + .../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 19 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 52 ++ 3 files changed, 72 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index e05abad..e8a6c33 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -645,6 +645,7 @@ int mlx5e_close_locked(struct net_device *netdev); void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev, u32 *indirection_rqt, int len, int num_channels); +int mlx5e_get_max_linkspeed(struct mlx5_core_dev *mdev, u32 *speed); static inline void mlx5e_tx_notify_hw(struct mlx5e_sq *sq, struct mlx5_wqe_ctrl_seg *ctrl, int bf_sz) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c index 534d99e..fc7dcc0 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c @@ -613,6 +613,25 @@ static u32 ptys2ethtool_supported_port(u32 eth_proto_cap) return 0; } +int mlx5e_get_max_linkspeed(struct mlx5_core_dev *mdev, u32 *speed) +{ + u32 max_speed = 0; + u32 proto_cap; + int err; + int i; + + err = mlx5_query_port_proto_cap(mdev, _cap, MLX5_PTYS_EN); + if (err) + return err; + + for (i = 0; i < MLX5E_LINK_MODES_NUMBER; ++i) + if (proto_cap & MLX5E_PROT_MASK(i)) + max_speed = max(max_speed, ptys2ethtool_table[i].speed); + + *speed = max_speed; + return 0; +} + static void get_speed_duplex(struct net_device *netdev, u32 eth_proto_oper, struct ethtool_cmd *cmd) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index e40e59a..0804070 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2716,11 +2716,49 @@ static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev) MLX5_CAP_ETH(mdev, reg_umr_sq); } +static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw) +{ + enum pcie_link_width width; + enum pci_bus_speed speed; + int err = 0; + + err = pcie_get_minimum_link(mdev->pdev, , ); + if (err) + return err; + + if (speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) + return -EINVAL; + + switch (speed) { + case PCIE_SPEED_2_5GT: + *pci_bw = 2500 * width; + break; + case PCIE_SPEED_5_0GT: + *pci_bw = 5000 * width; + break; + case PCIE_SPEED_8_0GT: + *pci_bw = 8000 * width; + break; + default: + return -EINVAL; + } + + return 0; +} + +static bool cqe_compress_heuristic(u32 link_speed, u32 pci_bw) +{ + return (link_speed && pci_bw && + (pci_bw < 4) && (pci_bw < link_speed)); +} + static void mlx5e_build_netdev_priv(struct mlx5_core_dev *mdev, struct net_device *netdev, int num_channels) { struct mlx5e_priv *priv = netdev_priv(netdev); + u32 link_speed = 0; + u32 pci_bw = 0; priv->params.log_sq_size = MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE; @@ -2728,6 +2766,20 @@ static void mlx5e_build_netdev_priv(struct mlx5_core_dev *mdev, MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ : MLX5_WQ_TYPE_LINKED_LIST; + /* set CQE compression */ + priv->params.rx_cqe_compress_admin = false; + if (MLX5_CAP_GEN(mdev, cqe_compression) && + MLX5_CAP_GEN(mdev, vport_group_manager)) { + mlx5e_get_max_linkspeed(mdev, _speed); + mlx5e_get_pci_bw(mdev, _bw); + mlx5_core_dbg(mdev, "Max link speed = %d, PCI BW = %d\n", + link_speed, pci_bw); + priv->params.rx_cqe_compress_admin = + cqe_compress_heuristic(link_speed, pci_bw); + } + + priv->params.rx_cqe_compress = priv->params.rx_cqe_compress_admin; + switch (priv->params.rq_wq_type) { case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW; -- 2.8.0
[PATCH net-next 2/3] net/mlx5e: Expand WQE stride when CQE compression is enabled
From: Tariq ToukanMake the MPWQE/Striding RQ default configuration dynamic and not statically set at compile time. Now at driver load we set stride size and num strides dynamically. By default we use same values as before, but when CQE compression is enabled, we set larger stride size to benefit from CQE compression for larger packets. Signed-off-by: Tariq Toukan Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 13 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 23 ++--- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 39 --- 3 files changed, 46 insertions(+), 29 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 19f0d8d..e05abad 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -64,12 +64,9 @@ #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x4 #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6 -#define MLX5_MPWRQ_LOG_NUM_STRIDES 11 /* >= 9, HW restriction */ #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */ -#define MLX5_MPWRQ_NUM_STRIDES BIT(MLX5_MPWRQ_LOG_NUM_STRIDES) -#define MLX5_MPWRQ_STRIDE_SIZE BIT(MLX5_MPWRQ_LOG_STRIDE_SIZE) -#define MLX5_MPWRQ_LOG_WQE_SZ (MLX5_MPWRQ_LOG_NUM_STRIDES +\ -MLX5_MPWRQ_LOG_STRIDE_SIZE) +#define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS8 /* >= 6, HW restriction */ +#define MLX5_MPWRQ_LOG_WQE_SZ 17 #define MLX5_MPWRQ_WQE_PAGE_ORDER (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \ MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0) #define MLX5_MPWRQ_PAGES_PER_WQE BIT(MLX5_MPWRQ_WQE_PAGE_ORDER) @@ -154,6 +151,8 @@ struct mlx5e_umr_wqe { struct mlx5e_params { u8 log_sq_size; u8 rq_wq_type; + u8 mpwqe_log_stride_sz; + u8 mpwqe_log_num_strides; u8 log_rq_size; u16 num_channels; u8 num_tc; @@ -249,6 +248,8 @@ struct mlx5e_rq { /* control */ struct mlx5_wq_ctrlwq_ctrl; u8 wq_type; + u32mpwqe_stride_sz; + u32mpwqe_num_strides; u32rqn; struct mlx5e_channel *channel; struct mlx5e_priv *priv; @@ -272,7 +273,7 @@ struct mlx5e_mpw_info { void (*dma_pre_sync)(struct device *pdev, struct mlx5e_mpw_info *wi, u32 wqe_offset, u32 len); - void (*add_skb_frag)(struct device *pdev, + void (*add_skb_frag)(struct mlx5e_rq *rq, struct sk_buff *skb, struct mlx5e_mpw_info *wi, u32 page_idx, u32 frag_offset, u32 len); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 0ea4c03..e40e59a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -307,7 +307,9 @@ static int mlx5e_create_rq(struct mlx5e_channel *c, rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq; rq->alloc_wqe = mlx5e_alloc_rx_mpwqe; - rq->wqe_sz = MLX5_MPWRQ_NUM_STRIDES * MLX5_MPWRQ_STRIDE_SIZE; + rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz); + rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides); + rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides; byte_count = rq->wqe_sz; break; default: /* MLX5_WQ_TYPE_LINKED_LIST */ @@ -1130,9 +1132,9 @@ static void mlx5e_build_rq_param(struct mlx5e_priv *priv, switch (priv->params.rq_wq_type) { case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: MLX5_SET(wq, wq, log_wqe_num_of_strides, -MLX5_MPWRQ_LOG_NUM_STRIDES - 9); +priv->params.mpwqe_log_num_strides - 9); MLX5_SET(wq, wq, log_wqe_stride_size, -MLX5_MPWRQ_LOG_STRIDE_SIZE - 6); +priv->params.mpwqe_log_stride_sz - 6); MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ); break; default: /* MLX5_WQ_TYPE_LINKED_LIST */ @@ -1199,7 +1201,7 @@ static void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv, switch (priv->params.rq_wq_type) { case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: log_cq_size = priv->params.log_rq_size + - MLX5_MPWRQ_LOG_NUM_STRIDES; + priv->params.mpwqe_log_num_strides; break; default: /*
Re: [PATCH net-next 2/2] net: dsa: mv88e6xxx: add STU capability
On Tue, May 10, 2016 at 03:44:29PM -0400, Vivien Didelot wrote: > Some switch models have a STU (per VLAN port state database). Add a new > capability flag to switches info, instead of checking their family. > > Also if the 6165 family has an STU, it must have a VTU, so add the > MV88E6XXX_FLAG_VTU to its family flags. > > Signed-off-by: Vivien DidelotReviewed-by: Andrew Lunn Andrew
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 14:09 -0700, Eric Dumazet wrote: > On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa >wrote: > > > I agree here, but I don't think this patch particularly is a lot of > > bloat and something very interesting people can play with and extend upon. > > > > Sure, very rarely patch authors think their stuff is bloat. > > I prefer to fix kernel softirq.c, or at least show me that you tried > hard enough. > > I am pretty sure that the following would work : > > When ksoftirqd is scheduled, remember this in a per cpu variable > (ksoftiqd_scheduled) > > When enabling BH , do not call do_softirq() if this variable is set. > > ksoftirqd would clear the variable at the right place (probably in > run_ksoftirqd()) > > Sure, this might add a lot of latency regressions, but lets fix them. Only to give the idea (it is completely untested and probably buggy) diff --git a/kernel/softirq.c b/kernel/softirq.c index 17caf4b63342..cb30cfd76687 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat); static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp; DEFINE_PER_CPU(struct task_struct *, ksoftirqd); +DEFINE_PER_CPU(bool, ksoftirqd_scheduled); const char * const softirq_to_name[NR_SOFTIRQS] = { "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL", @@ -73,8 +74,10 @@ static void wakeup_softirqd(void) /* Interrupts are disabled: no need to stop preemption */ struct task_struct *tsk = __this_cpu_read(ksoftirqd); - if (tsk && tsk->state != TASK_RUNNING) + if (tsk && tsk->state != TASK_RUNNING) { + __this_cpu_write(ksoftirqd_scheduled, true); wake_up_process(tsk); + } } /* @@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) */ preempt_count_sub(cnt - 1); - if (unlikely(!in_interrupt() && local_softirq_pending())) { + if (unlikely(!in_interrupt() && +local_softirq_pending() && +!__this_cpu_read(ksoftirqd_scheduled))) { /* * Run softirq if any pending. And do it in its own stack * as we may be calling this deep in a task call stack already. @@ -660,6 +665,8 @@ static void run_ksoftirqd(unsigned int cpu) * in the task stack here. */ __do_softirq(); + if (!local_softirq_pending()) + __this_cpu_write(ksoftirqd_scheduled, false); local_irq_enable(); cond_resched_rcu_qs(); return;
[PATCH net-next 1/3] net/mlx5e: CQE compression
From: Tariq ToukanCQE compression feature is meant to save PCIe bandwidth by compressing few CQEs into smaller amount of bytes on PCIe. CQE compression can be selectively enabled per CQ. By default is disabled for now and will be enabled later on. Signed-off-by: Tariq Toukan Signed-off-by: Eugenia Emantayev Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 10 ++ drivers/net/ethernet/mellanox/mlx5/core/en_clock.c | 4 + drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 6 + drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 151 - drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 8 ++ include/linux/mlx5/device.h| 34 + 6 files changed, 211 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index bfa5daa..19f0d8d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -157,6 +157,8 @@ struct mlx5e_params { u8 log_rq_size; u16 num_channels; u8 num_tc; + bool rx_cqe_compress_admin; + bool rx_cqe_compress; u16 rx_cq_moderation_usec; u16 rx_cq_moderation_pkts; u16 tx_cq_moderation_usec; @@ -202,6 +204,13 @@ struct mlx5e_cq { struct mlx5e_channel *channel; struct mlx5e_priv *priv; + /* cqe decompression */ + struct mlx5_cqe64 title; + struct mlx5_mini_cqe8 mini_arr[MLX5_MINI_CQE_ARRAY_SIZE]; + u8 mini_arr_idx; + u16decmprs_left; + u16decmprs_wqe_counter; + /* control */ struct mlx5_wq_ctrlwq_ctrl; } cacheline_aligned_in_smp; @@ -616,6 +625,7 @@ void mlx5e_timestamp_init(struct mlx5e_priv *priv); void mlx5e_timestamp_cleanup(struct mlx5e_priv *priv); int mlx5e_hwstamp_set(struct net_device *dev, struct ifreq *ifr); int mlx5e_hwstamp_get(struct net_device *dev, struct ifreq *ifr); +void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val); int mlx5e_vlan_rx_add_vid(struct net_device *dev, __always_unused __be16 proto, u16 vid); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_clock.c b/drivers/net/ethernet/mellanox/mlx5/core/en_clock.c index 2018eeb..847a8f3 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_clock.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_clock.c @@ -93,6 +93,8 @@ int mlx5e_hwstamp_set(struct net_device *dev, struct ifreq *ifr) /* RX HW timestamp */ switch (config.rx_filter) { case HWTSTAMP_FILTER_NONE: + /* Reset CQE compression to Admin default */ + mlx5e_modify_rx_cqe_compression(priv, priv->params.rx_cqe_compress_admin); break; case HWTSTAMP_FILTER_ALL: case HWTSTAMP_FILTER_SOME: @@ -108,6 +110,8 @@ int mlx5e_hwstamp_set(struct net_device *dev, struct ifreq *ifr) case HWTSTAMP_FILTER_PTP_V2_EVENT: case HWTSTAMP_FILTER_PTP_V2_SYNC: case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ: + /* Disable CQE compression */ + mlx5e_modify_rx_cqe_compression(priv, false); config.rx_filter = HWTSTAMP_FILTER_ALL; break; default: diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 1c70e51..0ea4c03 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -114,6 +114,8 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv) s->rx_mpwqe_filler += rq_stats->mpwqe_filler; s->rx_mpwqe_frag += rq_stats->mpwqe_frag; s->rx_buff_alloc_err += rq_stats->buff_alloc_err; + s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks; + s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts; for (j = 0; j < priv->params.num_tc; j++) { sq_stats = >channel[i]->sq[j].stats; @@ -1204,6 +1206,10 @@ static void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv, } MLX5_SET(cqc, cqc, log_cq_size, log_cq_size); + if (priv->params.rx_cqe_compress) { + MLX5_SET(cqc, cqc, mini_cqe_res_format, MLX5_CQE_FORMAT_CSUM); + MLX5_SET(cqc, cqc, cqe_comp_en, 1); + } mlx5e_build_common_cq_param(priv, param); } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 23adfe2..c8b8d45 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -42,6 +42,143 @@ static inline bool mlx5e_rx_hw_stamp(struct
[PATCH net-next 0/3] Mellanox 100G mlx5 CQE compression
Hi Dave, Introducing ConnectX-4 CQE (Completion Queue Entry) compression feature for mlx5 etherent driver. CQE Compressing reduces PCI overhead by coalescing and compressing multiple CQEs into a single merged CQE. Successful compressing improves message rate especially for small packet traffic. CQE Compressing in details: Instead of writing full CQEs to memory, multiple almost identical CQEs are merged and compressed. Information that is shared between the CQEs is written once, regardless of the number of compressed CQEs. In addition, only the unique information (small amount of bytes compared to full CQE size) is written per CQE. CQE Compression Block: This block contains multiple compressed CQEs. CQE Compression Block contains a single copy of CQEs properties which are shared between all the compressed CQEs (called Title, see below) and multiple mini CQEs (CQEs in compressed form). Title: The Title holds information which is shared between all the compressed CQEs in the CQE Compression Block. In each Compression Block there is only a single Title regardless of the number of compressed CQEs. Mini CQE: A CQE in compressed form that holds some data needed to extract a single full CQE, for example 8 Bytes instead of 64 Bytes. The shared information between all compressed CQEs, which belong to the same CQE Compression Block called Title, is written once, and only the unique information in each compressed CQE, for example 8 bytes, is written per compressed CQE, called mini CQE. Since CQE Compression can add overhead to the software (CPU), it will be only enabled on "weak/slow" PCI slots, where it can actually help. Applied on top: c047c3b1af62 ('netfilter: conntrack: remove uninitialized shadow variable') Thanks, Saeed. Saeed Mahameed (1): net/mlx5e: Enable CQE compression when PCI is slower than link Tariq Toukan (2): net/mlx5e: CQE compression net/mlx5e: Expand WQE stride when CQE compression is enabled drivers/net/ethernet/mellanox/mlx5/core/en.h | 24 ++- drivers/net/ethernet/mellanox/mlx5/core/en_clock.c | 4 + .../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 19 +++ drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 81 - drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 188 ++--- drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 8 + include/linux/mlx5/device.h| 34 7 files changed, 328 insertions(+), 30 deletions(-) -- 2.8.0
[PATCH v1 net-next 6/7] dsa: Rename switch chip data to cd
The dsa_switch structure contains a dsa_chip_data member called pd. However in the rest of the code, pd is used for dsa_platform_data. This is confusing. Rename it cd, which is already often used in dsa.c and slave.c for this data type. Signed-off-by: Andrew Lunn--- drivers/net/dsa/bcm_sf2.c | 4 ++-- drivers/net/dsa/mv88e6xxx.c | 4 ++-- include/net/dsa.h | 4 ++-- net/dsa/dsa.c | 18 +- net/dsa/slave.c | 10 +- 5 files changed, 20 insertions(+), 20 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index 448deb59b9a4..10ddd5a5dfb6 100644 --- a/drivers/net/dsa/bcm_sf2.c +++ b/drivers/net/dsa/bcm_sf2.c @@ -949,8 +949,8 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds) /* All the interesting properties are at the parent device_node * level */ - dn = ds->pd->of_node->parent; - bcm_sf2_identify_ports(priv, ds->pd->of_node); + dn = ds->cd->of_node->parent; + bcm_sf2_identify_ports(priv, ds->cd->of_node); priv->irq0 = irq_of_parse_and_map(dn, 0); priv->irq1 = irq_of_parse_and_map(dn, 1); diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 428d213ed4fc..c41ec38613c7 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -3009,9 +3009,9 @@ static int mv88e6xxx_setup_global(struct mv88e6xxx_priv_state *ps) for (i = 0; i < 32; i++) { int nexthop = 0x1f; - if (ps->ds->pd->rtable && + if (ps->ds->cd->rtable && i != ps->ds->index && i < ps->ds->dst->pd->nr_chips) - nexthop = ps->ds->pd->rtable[i] & 0x1f; + nexthop = ps->ds->cd->rtable[i] & 0x1f; err = _mv88e6xxx_reg_write( ps, REG_GLOBAL2, diff --git a/include/net/dsa.h b/include/net/dsa.h index f4c0bff8d9d6..17c3d37b6779 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -137,7 +137,7 @@ struct dsa_switch { /* * Configuration data for this switch. */ - struct dsa_chip_data*pd; + struct dsa_chip_data*cd; /* * The used switch driver. @@ -190,7 +190,7 @@ static inline u8 dsa_upstream_port(struct dsa_switch *ds) if (dst->cpu_switch == ds->index) return dst->cpu_port; else - return ds->pd->rtable[dst->cpu_switch]; + return ds->cd->rtable[dst->cpu_switch]; } struct switchdev_trans; diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index 5db779c69a68..eff5dfc2e33f 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -182,7 +182,7 @@ __ATTRIBUTE_GROUPS(dsa_hwmon); /* basic switch operations **/ static int dsa_cpu_dsa_setup(struct dsa_switch *ds, struct net_device *master) { - struct dsa_chip_data *cd = ds->pd; + struct dsa_chip_data *cd = ds->cd; struct device_node *port_dn; struct phy_device *phydev; int ret, port, mode; @@ -219,7 +219,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, struct device *parent) { struct dsa_switch_driver *drv = ds->drv; struct dsa_switch_tree *dst = ds->dst; - struct dsa_chip_data *pd = ds->pd; + struct dsa_chip_data *cd = ds->cd; bool valid_name_found = false; int index = ds->index; int i, ret; @@ -230,7 +230,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, struct device *parent) for (i = 0; i < DSA_MAX_PORTS; i++) { char *name; - name = pd->port_names[i]; + name = cd->port_names[i]; if (name == NULL) continue; @@ -328,10 +328,10 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, struct device *parent) if (!(ds->enabled_port_mask & (1 << i))) continue; - ret = dsa_slave_create(ds, parent, i, pd->port_names[i]); + ret = dsa_slave_create(ds, parent, i, cd->port_names[i]); if (ret < 0) { netdev_err(dst->master_netdev, "[%d]: can't create dsa slave device for port %d(%s): %d\n", - index, i, pd->port_names[i], ret); + index, i, cd->port_names[i], ret); ret = 0; } } @@ -379,7 +379,7 @@ static struct dsa_switch * dsa_switch_setup(struct dsa_switch_tree *dst, int index, struct device *parent, struct device *host_dev) { - struct dsa_chip_data *pd = dst->pd->chip + index; + struct dsa_chip_data *cd = dst->pd->chip + index; struct dsa_switch_driver *drv; struct dsa_switch *ds; int ret; @@ -389,7 +389,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index, /* * Probe for switch
[PATCH v1 net-next 5/7] dsa: Remove master_dev from switch structure
The switch drivers only use the master_dev member for dev_info() messages. Now that the device is passed to the old style probe, and new style drivers are probed as true linux drivers, this is no longer needed. Signed-off-by: Andrew Lunn--- drivers/net/dsa/mv88e6xxx.c | 1 + include/net/dsa.h | 7 ++- net/dsa/dsa.c | 2 +- net/dsa/slave.c | 2 +- 4 files changed, 5 insertions(+), 7 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index f4bb3bed812f..428d213ed4fc 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -3628,6 +3628,7 @@ int mv88e6xxx_probe(struct mdio_device *mdiodev) ps = (struct mv88e6xxx_priv_state *)(ds + 1); ds->priv = ps; + ds->dev = dev; ps->dev = dev; ps->ds = ds; ps->bus = mdiodev->bus; diff --git a/include/net/dsa.h b/include/net/dsa.h index ecb52e265cc3..f4c0bff8d9d6 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -120,6 +120,8 @@ struct dsa_switch_tree { }; struct dsa_switch { + struct device *dev; + /* * Parent switch tree, and switch index. */ @@ -142,11 +144,6 @@ struct dsa_switch { */ struct dsa_switch_driver*drv; - /* -* Reference to host device to use. -*/ - struct device *master_dev; - #ifdef CONFIG_NET_DSA_HWMON /* * Hardware monitoring information diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index df169811f26d..5db779c69a68 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -411,7 +411,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index, ds->pd = pd; ds->drv = drv; ds->priv = priv; - ds->master_dev = host_dev; + ds->dev = parent; ret = dsa_switch_setup_one(ds, parent); if (ret) diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 5ea8a40c8d33..f25dcd9e814a 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -51,7 +51,7 @@ void dsa_slave_mii_bus_init(struct dsa_switch *ds) ds->slave_mii_bus->write = dsa_slave_phy_write; snprintf(ds->slave_mii_bus->id, MII_BUS_ID_SIZE, "dsa-%d:%.2x", ds->index, ds->pd->sw_addr); - ds->slave_mii_bus->parent = ds->master_dev; + ds->slave_mii_bus->parent = ds->dev; ds->slave_mii_bus->phy_mask = ~ds->phys_mii_mask; } -- 2.8.1
[PATCH v1 net-next 7/7] dsa: mv88e6xxx: Handle eeprom-length property
A switch can export an attached EEPROM using the standard ethtool API. However the switch itself cannot determine the size of the EEPROM, and multiple sizes are allowed. Thus a device tree property is supported to indicate the length of the EEPROM. Parse this property during device probe, and implement a callback function to retrieve it. Signed-off-by: Andrew Lunn--- drivers/net/dsa/mv88e6xxx.c | 17 + drivers/net/dsa/mv88e6xxx.h | 3 +++ 2 files changed, 20 insertions(+) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index c41ec38613c7..6f9be26d002f 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -879,6 +879,16 @@ error: return ret; } +static int mv88e6xxx_get_eeprom_len(struct dsa_switch *ds) +{ + struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); + + if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_EEPROM)) + return ps->eeprom_len; + + return 0; +} + static int mv88e6xxx_get_eeprom(struct dsa_switch *ds, struct ethtool_eeprom *eeprom, u8 *data) { @@ -3596,6 +3606,7 @@ struct dsa_switch_driver mv88e6xxx_switch_driver = { .set_temp_limit = mv88e6xxx_set_temp_limit, .get_temp_alarm = mv88e6xxx_get_temp_alarm, #endif + .get_eeprom_len = mv88e6xxx_get_eeprom_len, .get_eeprom = mv88e6xxx_get_eeprom, .set_eeprom = mv88e6xxx_set_eeprom, .get_regs_len = mv88e6xxx_get_regs_len, @@ -3617,9 +3628,11 @@ struct dsa_switch_driver mv88e6xxx_switch_driver = { int mv88e6xxx_probe(struct mdio_device *mdiodev) { struct device *dev = >dev; + struct device_node *np = dev->of_node; struct mv88e6xxx_priv_state *ps; int id, prod_num, rev; struct dsa_switch *ds; + u32 eeprom_len; int err; ds = devm_kzalloc(dev, sizeof(*ds) + sizeof(*ps), GFP_KERNEL); @@ -3662,6 +3675,10 @@ int mv88e6xxx_probe(struct mdio_device *mdiodev) } } + if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_EEPROM) && + !of_property_read_u32(np, "eeprom-length", _len)) + ps->eeprom_len = eeprom_len; + dev_set_drvdata(dev, ds); dev_info(dev, "switch 0x%x probed: %s, revision %u\n", diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h index df226c151d9c..d83565ac684e 100644 --- a/drivers/net/dsa/mv88e6xxx.h +++ b/drivers/net/dsa/mv88e6xxx.h @@ -590,6 +590,9 @@ struct mv88e6xxx_priv_state { * switch soft reset. */ struct gpio_desc *reset; + + /* set to size of eeprom if supported by the switch */ + int eeprom_len; }; enum stat_type { -- 2.8.1
[PATCH v1 net-next 1/7] dsa: mv88e6xxx: Initialise the mutex as soon as it is created
By initialising immediately it, we don't run the danger of using it before it is initialised. Signed-off-by: Andrew Lunn--- drivers/net/dsa/mv88e6xxx.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 1e5ca8e0f48e..b2d27bfd53c2 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -3118,8 +3118,6 @@ static int mv88e6xxx_setup(struct dsa_switch *ds) ps->ds = ds; - mutex_init(>smi_mutex); - INIT_WORK(>bridge_work, mv88e6xxx_bridge_work); if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_EEPROM)) @@ -3566,6 +3564,7 @@ static const char *mv88e6xxx_probe(struct device *dsa_dev, ps->bus = bus; ps->sw_addr = sw_addr; ps->info = info; + mutex_init(>smi_mutex); *priv = ps; -- 2.8.1
Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: abstract VTU/STU data access
On Tue, May 10, 2016 at 03:44:28PM -0400, Vivien Didelot wrote: > Both VTU and STU operations use the same routine to access their > (common) data registers, with a different offset. > > Add VTU and STU specific read and write functions to the data registers > to abstract the required offset. > > Signed-off-by: Vivien DidelotHi Vivien Improves the readability. Reviewed-by: Andrew Lunn Thanks Andrew
[PATCH v1 net-next 2/7] dsa: mv88e6xxx: Rename probe function to fit the normal pattern
All other DSA drivers use _drv_ in there DSA probe function name, thus allowing for a true linux driver probe function to use the conventional name. Make mv88e6xxx fit this pattern. Signed-off-by: Andrew Lunn--- drivers/net/dsa/mv88e6xxx.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index b2d27bfd53c2..199a8628df2b 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -3529,9 +3529,9 @@ mv88e6xxx_lookup_info(unsigned int prod_num, const struct mv88e6xxx_info *table, return NULL; } -static const char *mv88e6xxx_probe(struct device *dsa_dev, - struct device *host_dev, int sw_addr, - void **priv) +static const char *mv88e6xxx_drv_probe(struct device *dsa_dev, + struct device *host_dev, int sw_addr, + void **priv) { const struct mv88e6xxx_info *info; struct mv88e6xxx_priv_state *ps; @@ -3576,7 +3576,7 @@ static const char *mv88e6xxx_probe(struct device *dsa_dev, struct dsa_switch_driver mv88e6xxx_switch_driver = { .tag_protocol = DSA_TAG_PROTO_EDSA, - .probe = mv88e6xxx_probe, + .probe = mv88e6xxx_drv_probe, .setup = mv88e6xxx_setup, .set_addr = mv88e6xxx_set_addr, .phy_read = mv88e6xxx_phy_read, -- 2.8.1
[PATCH v1 net-next 0/7] More enabler patches for DSA probing
The complete set of patches for the reworked DSA probing is too big to post as once. These subset contains some enablers which are easy to review. Eventually, the Marvell driver will instantiate its own internal MDIO bus, rather than have the framework do it, thus allows devices on the bus to be listed in the device tree. Initialize the main mutex as soon as it is created, to avoid lifetime issues with the mdio bus. A previous patch renamed all the DSA probe functions to make room for a true device probe. However the recent merging of all the Marvell switch drivers resulted in mv88e6xxx going back to the old probe name. Rename it again, so we can have a driver probe function. Add minimum support for the Marvell switch driver to probe as an MDIO device, as well as an DSA driver. Later patches will then register this device with the new DSA core framework. Move the GPIO reset code out of the DSA code. Different drivers may need different reset mechanisms, e.g. via a reset controller for memory mapped devices. Don't clutter up the core with this. Let each driver implement what it needs. master_dev is no longer needed in the switch drivers, since they have access to a device pointer from the probe function. Remove it. Let the switch parse the eeprom length from its one device tree node. This is required with the new binding when the central DSA platform device no longer exists. Andrew Lunn (7): dsa: mv88e6xxx: Initialise the mutex as soon as it is created dsa: mv88e6xxx: Rename probe function to fit the normal pattern dsa: Add mdio device support to Marvell switches dsa: Move gpio reset into switch driver dsa: Remove master_dev from switch structure dsa: Rename switch chip data to cd dsa: mv88e6xxx: Handle eeprom-length property Documentation/devicetree/bindings/net/dsa/dsa.txt | 2 - .../devicetree/bindings/net/dsa/marvell.txt| 35 ++ drivers/net/dsa/bcm_sf2.c | 4 +- drivers/net/dsa/mv88e6xxx.c| 137 + drivers/net/dsa/mv88e6xxx.h| 10 ++ include/net/dsa.h | 19 +-- net/dsa/dsa.c | 36 ++ net/dsa/slave.c| 12 +- 8 files changed, 177 insertions(+), 78 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/dsa/marvell.txt -- 2.8.1
[PATCH v1 net-next 3/7] dsa: Add mdio device support to Marvell switches
Allow Marvell switches to be mdio devices. Currently the driver just allocate the private structure and detects what device is on the bus. Later patches will make them register with the DSA framework. Signed-off-by: Andrew Lunn--- .../devicetree/bindings/net/dsa/marvell.txt| 27 +++ drivers/net/dsa/mv88e6xxx.c| 90 +- 2 files changed, 99 insertions(+), 18 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/dsa/marvell.txt diff --git a/Documentation/devicetree/bindings/net/dsa/marvell.txt b/Documentation/devicetree/bindings/net/dsa/marvell.txt new file mode 100644 index ..cdd70cebdea7 --- /dev/null +++ b/Documentation/devicetree/bindings/net/dsa/marvell.txt @@ -0,0 +1,27 @@ +Marvell DSA Switch Device Tree Bindings +--- + +WARNING: This binding is currently unstable. Do not program it into a +FLASH never to be changed again. Once this binding is stable, this +warning will be removed. + +If you need a stable binding, use the old dsa.txt binding. + +Marvell Switches are MDIO devices. The following properties should be +placed as a child node of an mdio device. + +Required properties: +- compatible : Should be one of "marvell,mv88e6085", +- reg : Address on the MII bus for the switch. + +Example: + + mdio { + #address-cells = <1>; + #size-cells = <0>; + + switch0: switch@0 { + compatible = "marvell,mv88e6085"; + reg = <0>; + }; + }; diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 199a8628df2b..e5ba04132140 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -5,6 +5,8 @@ * Copyright (c) 2015 CMC Electronics, Inc. * Added support for VLAN Table Unit operations * + * Copyright (c) 2016 Andrew Lunn + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -17,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -3611,36 +3614,87 @@ struct dsa_switch_driver mv88e6xxx_switch_driver = { .port_fdb_dump = mv88e6xxx_port_fdb_dump, }; -static int __init mv88e6xxx_init(void) +int mv88e6xxx_probe(struct mdio_device *mdiodev) { - register_switch_driver(_switch_driver); + struct device *dev = >dev; + struct mv88e6xxx_priv_state *ps; + int id, prod_num, rev; + struct dsa_switch *ds; + + ds = devm_kzalloc(dev, sizeof(*ds) + sizeof(*ps), GFP_KERNEL); + if (!ds) + return -ENOMEM; + + ps = (struct mv88e6xxx_priv_state *)(ds + 1); + ds->priv = ps; + ps->dev = dev; + ps->ds = ds; + ps->bus = mdiodev->bus; + ps->sw_addr = mdiodev->addr; + mutex_init(>smi_mutex); + + get_device(>bus->dev); + + ds->drv = _switch_driver; + + id = mv88e6xxx_reg_read(ps, REG_PORT(0), PORT_SWITCH_ID); + if (id < 0) + return id; + + prod_num = (id & 0xfff0) >> 4; + rev = id & 0x000f; + + ps->info = mv88e6xxx_lookup_info(prod_num, mv88e6xxx_table, +ARRAY_SIZE(mv88e6xxx_table)); + if (!ps->info) + return -ENODEV; + + dev_set_drvdata(dev, ds); + + dev_info(dev, "switch 0x%x probed: %s, revision %u\n", +prod_num, ps->info->name, rev); return 0; } + +static void mv88e6xxx_remove(struct mdio_device *mdiodev) +{ + struct dsa_switch *ds = dev_get_drvdata(>dev); + struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); + + put_device(>bus->dev); +} + +static const struct of_device_id mv88e6xxx_of_match[] = { + { .compatible = "marvell,mv88e6085" }, + { /* sentinel */ }, +}; + +MODULE_DEVICE_TABLE(of, mv88e6xxx_of_match); + +static struct mdio_driver mv88e6xxx_driver = { + .probe = mv88e6xxx_probe, + .remove = mv88e6xxx_remove, + .mdiodrv.driver = { + .name = "mv88e6085", + .of_match_table = mv88e6xxx_of_match, + }, +}; + +static int __init mv88e6xxx_init(void) +{ + register_switch_driver(_switch_driver); + return mdio_driver_register(_driver); +} module_init(mv88e6xxx_init); static void __exit mv88e6xxx_cleanup(void) { + mdio_driver_unregister(_driver); unregister_switch_driver(_switch_driver); } module_exit(mv88e6xxx_cleanup); -MODULE_ALIAS("platform:mv88e6085"); -MODULE_ALIAS("platform:mv88e6095"); -MODULE_ALIAS("platform:mv88e6095f"); -MODULE_ALIAS("platform:mv88e6123"); -MODULE_ALIAS("platform:mv88e6131"); -MODULE_ALIAS("platform:mv88e6161"); -MODULE_ALIAS("platform:mv88e6165"); -MODULE_ALIAS("platform:mv88e6171");
Re: [net PATCH v2 6/6] net sched: ife action fix late binding
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salimwrote: > From: Jamal Hadi Salim > > The process below was broken and is fixed with this patch. > > //add an ife action and give it an instance id of 1 > sudo tc actions add action ife encode \ > type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1 > > //create a filter which binds to ife action id 1 > sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ > match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1 > > Message before fix was: > RTNETLINK answers: Invalid argument > We have an error talking to the kernel > > Signed-off-by: Jamal Hadi Salim Reviewed-by: Cong Wang
[PATCH v1 net-next 4/7] dsa: Move gpio reset into switch driver
Resetting the switch is something the driver does, not the framework. So move the parsing of this property into the driver. There are no in kernel users of this property, so moving it does not break anything. There is however a board which will make use of this property making its way into the kernel. Signed-off-by: Andrew Lunn--- Documentation/devicetree/bindings/net/dsa/dsa.txt | 2 -- Documentation/devicetree/bindings/net/dsa/marvell.txt | 8 drivers/net/dsa/mv88e6xxx.c | 14 +- drivers/net/dsa/mv88e6xxx.h | 7 +++ include/net/dsa.h | 8 net/dsa/dsa.c | 16 6 files changed, 28 insertions(+), 27 deletions(-) diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.txt b/Documentation/devicetree/bindings/net/dsa/dsa.txt index 5fdbbcdf8c4b..9f4807f90c31 100644 --- a/Documentation/devicetree/bindings/net/dsa/dsa.txt +++ b/Documentation/devicetree/bindings/net/dsa/dsa.txt @@ -31,8 +31,6 @@ A switch child node has the following optional property: switch. Must be set if the switch can not detect the presence and/or size of a connected EEPROM, otherwise optional. -- reset-gpios : phandle and specifier to a gpio line connected to - reset pin of the switch chip. A switch may have multiple "port" children nodes diff --git a/Documentation/devicetree/bindings/net/dsa/marvell.txt b/Documentation/devicetree/bindings/net/dsa/marvell.txt index cdd70cebdea7..7629189398aa 100644 --- a/Documentation/devicetree/bindings/net/dsa/marvell.txt +++ b/Documentation/devicetree/bindings/net/dsa/marvell.txt @@ -10,10 +10,17 @@ If you need a stable binding, use the old dsa.txt binding. Marvell Switches are MDIO devices. The following properties should be placed as a child node of an mdio device. +The properties described here are those specific to Marvell devices. +Additional required and optional properties can be found in dsa.txt. + Required properties: - compatible : Should be one of "marvell,mv88e6085", - reg : Address on the MII bus for the switch. +Optional properties: + +- reset-gpios : Should be a gpio specifier for a reset line + Example: mdio { @@ -23,5 +30,6 @@ Example: switch0: switch@0 { compatible = "marvell,mv88e6085"; reg = <0>; + reset-gpios = < 1 GPIO_ACTIVE_LOW>; }; }; diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index e5ba04132140..f4bb3bed812f 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2568,7 +2568,7 @@ static int mv88e6xxx_switch_reset(struct mv88e6xxx_priv_state *ps) { bool ppu_active = mv88e6xxx_has(ps, MV88E6XXX_FLAG_PPU_ACTIVE); u16 is_reset = (ppu_active ? 0x8800 : 0xc800); - struct gpio_desc *gpiod = ps->ds->pd->reset; + struct gpio_desc *gpiod = ps->reset; unsigned long timeout; int ret; int i; @@ -3620,6 +3620,7 @@ int mv88e6xxx_probe(struct mdio_device *mdiodev) struct mv88e6xxx_priv_state *ps; int id, prod_num, rev; struct dsa_switch *ds; + int err; ds = devm_kzalloc(dev, sizeof(*ds) + sizeof(*ps), GFP_KERNEL); if (!ds) @@ -3649,6 +3650,17 @@ int mv88e6xxx_probe(struct mdio_device *mdiodev) if (!ps->info) return -ENODEV; + ps->reset = devm_gpiod_get(>dev, "reset", GPIOD_ASIS); + if (IS_ERR(ps->reset)) { + err = PTR_ERR(ps->reset); + if (err == -ENOENT) { + /* Optional, so not an error */ + ps->reset = NULL; + } else { + return err; + } + } + dev_set_drvdata(dev, ds); dev_info(dev, "switch 0x%x probed: %s, revision %u\n", diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h index ca69a93a42a0..df226c151d9c 100644 --- a/drivers/net/dsa/mv88e6xxx.h +++ b/drivers/net/dsa/mv88e6xxx.h @@ -12,6 +12,7 @@ #define __MV88E6XXX_H #include +#include #ifndef UINT64_MAX #define UINT64_MAX (u64)(~((u64)0)) @@ -583,6 +584,12 @@ struct mv88e6xxx_priv_state { DECLARE_BITMAP(port_state_update_mask, DSA_MAX_PORTS); struct work_struct bridge_work; + + /* A switch may have a GPIO line tied to its reset pin. Parse +* this from the device tree, and use it before performing +* switch soft reset. +*/ + struct gpio_desc *reset; }; enum stat_type { diff --git a/include/net/dsa.h b/include/net/dsa.h index 8e86af87c84f..ecb52e265cc3 100644 --- a/include/net/dsa.h +++ b/include/net/dsa.h @@ -16,7 +16,6 @@
Re: [net PATCH v2 5/6] net sched: skbedit action fix late binding
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salimwrote: > From: Jamal Hadi Salim > > The process below was broken and is fixed with this patch. > > //add a skbedit action and give it an instance id of 1 > sudo tc actions add action skbedit mark 10 index 1 > //create a filter which binds to skbedit action id 1 > sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ > match ip dst 17.0.0.1/32 flowid 1:10 action skbedit index 1 > > Message before fix was: > RTNETLINK answers: Invalid argument > We have an error talking to the kernel > > Signed-off-by: Jamal Hadi Salim Reviewed-by: Cong Wang
Re: [net PATCH v2 4/6] net sched: simple action fix late binding
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salimwrote: > From: Jamal Hadi Salim > > The process below was broken and is fixed with this patch. > > //add a simple action and give it an instance id of 1 > sudo tc actions add action simple sdata "foobar" index 1 > //create a filter which binds to simple action id 1 > sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ > match ip dst 17.0.0.1/32 flowid 1:10 action simple index 1 > > Message before fix was: > RTNETLINK answers: Invalid argument > We have an error talking to the kernel > > Signed-off-by: Jamal Hadi Salim Reviewed-by: Cong Wang
Re: [net PATCH v2 3/6] net sched: mirred action fix late binding
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salimwrote: > From: Jamal Hadi Salim > > The process below was broken and is fixed with this patch. > > //add an mirred action and give it an instance id of 1 > sudo tc actions add action mirred egress mirror dev $MDEV index 1 > //create a filter which binds to mirred action id 1 > sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ > match ip dst 17.0.0.1/32 flowid 1:10 action mirred index 1 > > Message before bug fix was: > RTNETLINK answers: Invalid argument > We have an error talking to the kernel > > Signed-off-by: Jamal Hadi Salim Reviewed-by: Cong Wang
Re: [net PATCH v2 2/6] net sched: ipt action fix late binding
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salimwrote: > From: Jamal Hadi Salim > > This was broken and is fixed with this patch. > > //add an ipt action and give it an instance id of 1 > sudo tc actions add action ipt -j mark --set-mark 2 index 1 > //create a filter which binds to ipt action id 1 > sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ > match ip dst 17.0.0.1/32 flowid 1:10 action ipt index 1 > > Message before bug fix was: > RTNETLINK answers: Invalid argument > We have an error talking to the kernel > > Signed-off-by: Jamal Hadi Salim Reviewed-by: Cong Wang
Re: [net PATCH v2 1/6] net sched: vlan action fix late binding
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salimwrote: > From: Jamal Hadi Salim > > Late vlan action binding was broken and is fixed with this patch. > > //add a vlan action to pop and give it an instance id of 1 > sudo tc actions add action vlan pop index 1 > //create filter which binds to vlan action id 1 > sudo tc filter add dev $DEV parent : protocol ip prio 1 u32 \ > match ip dst 17.0.0.1/32 flowid 1:1 action vlan index 1 > > current message(before bug fix) was: > RTNETLINK answers: Invalid argument > We have an error talking to the kernel > > Signed-off-by: Jamal Hadi Salim Reviewed-by: Cong Wang
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowawrote: > I agree here, but I don't think this patch particularly is a lot of > bloat and something very interesting people can play with and extend upon. > Sure, very rarely patch authors think their stuff is bloat. I prefer to fix kernel softirq.c, or at least show me that you tried hard enough. I am pretty sure that the following would work : When ksoftirqd is scheduled, remember this in a per cpu variable (ksoftiqd_scheduled) When enabling BH , do not call do_softirq() if this variable is set. ksoftirqd would clear the variable at the right place (probably in run_ksoftirqd()) Sure, this might add a lot of latency regressions, but lets fix them.
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 16:52 -0400, David Miller wrote: > From: Rik van Riel> Date: Tue, 10 May 2016 16:50:56 -0400 > > > On Tue, 2016-05-10 at 16:45 -0400, David Miller wrote: > >> From: Paolo Abeni > >> Date: Tue, 10 May 2016 22:22:50 +0200 > >> > >> > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote: > >> >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote: > >> >> > >> >> > If a single core host is under network flood, i.e. ksoftirqd > is > >> >> > scheduled and it eventually (after processing ~640 packets) > will > >> let the > >> >> > user space process run. The latter will execute a syscall to > >> receive a > >> >> > packet, which will have to disable/enable bh at least once > and > >> that will > >> >> > cause the processing of another ~640 packets. To receive a > >> single packet > >> >> > in user space, the kernel has to process more than one > thousand > >> packets. > >> >> > >> >> Looks you found the bug then. Have you tried to fix it ? > >> ... > >> > The ksoftirq and the local_bh_enable() design are the root of > the > >> > problem, they need to be touched/affected to solve it. > >> > >> That's not what I read from your description, processing 640 > packets > >> before going to ksoftirqd seems to the be the absolute root > problem. > > > > What would a fix for that look like? > > > > Keep track of the number of processed incoming packets, > > and the number of packets handed off, and defer to > > ksoftirqd earlier if the statistics suggest packets are > > getting dropped on the floor? > > Not by packet count but by something more easily to measure and > scalable to fairness like processing time. I need to get back to fixing irq & softirq time accounting, which does not currently work correctly in all time keeping modes... -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [Intel-wired-lan] [PATCH] e1000e: prevent division by zero if TIMINCA is zero
On Fri, May 06, 2016 at 11:43:17PM +, Rustad, Mark D wrote: > Denys Vlasenkowrote: > > >Users report that under VMWare, er32(TIMINCA) returns zero. > >This causes division by zero at init time as follows: > > > > ==>incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK; > >for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) { > >/* latch SYSTIMH on read of SYSTIML */ > >systim_next = (cycle_t)er32(SYSTIML); > >systim_next |= (cycle_t)er32(SYSTIMH) << 32; > > > >time_delta = systim_next - systim; > >temp = time_delta; > > > rem = do_div(temp, incvalue); > > > >This change makes kernel survive this, and users report that > >NIC does work after this change. > > > >Since on real hardware incvalue is never zero, this should not affect > >real hardware use case. ... > I seem to recall that this was rejected before because it really is VMWare's > bug and, if they fix it, any existing VMs that use this will just work. > Changing the driver will only fix it for vms that install a new driver. I > don't object to doing it, it just seems like not the most effective place to > address the issue. You could also have people who never update VMWare, for whom a kernel work-around would be better. I think it'd be best to address it both at the driver level and the emulated hardware level, to improve things for the most possible users. Those who update neither hypervisor or kernel/driver, well, they reap what they sow. -- Jarod Wilson ja...@redhat.com
Re: [RFC PATCH 0/2] net: threadable napi poll loop
From: Rik van RielDate: Tue, 10 May 2016 16:50:56 -0400 > On Tue, 2016-05-10 at 16:45 -0400, David Miller wrote: >> From: Paolo Abeni >> Date: Tue, 10 May 2016 22:22:50 +0200 >> >> > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote: >> >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote: >> >> >> >> > If a single core host is under network flood, i.e. ksoftirqd is >> >> > scheduled and it eventually (after processing ~640 packets) will >> let the >> >> > user space process run. The latter will execute a syscall to >> receive a >> >> > packet, which will have to disable/enable bh at least once and >> that will >> >> > cause the processing of another ~640 packets. To receive a >> single packet >> >> > in user space, the kernel has to process more than one thousand >> packets. >> >> >> >> Looks you found the bug then. Have you tried to fix it ? >> ... >> > The ksoftirq and the local_bh_enable() design are the root of the >> > problem, they need to be touched/affected to solve it. >> >> That's not what I read from your description, processing 640 packets >> before going to ksoftirqd seems to the be the absolute root problem. > > What would a fix for that look like? > > Keep track of the number of processed incoming packets, > and the number of packets handed off, and defer to > ksoftirqd earlier if the statistics suggest packets are > getting dropped on the floor? Not by packet count but by something more easily to measure and scalable to fairness like processing time.
Re: [net-next PATCH v2 1/6] net sched: vlan action fix late binding
On 16-05-08 11:08 PM, Cong Wang wrote: + if (aexists) + tcf_hash_release(a, bind); Introduce a goto to reduce duplicated cleanup code? I addressed all your comments except for above goto. There are different return codes for all failures - and the cleverness of a goto seems unecessary. cheers, jamal
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 16:45 -0400, David Miller wrote: > From: Paolo Abeni> Date: Tue, 10 May 2016 22:22:50 +0200 > > > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote: > >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote: > >> > >> > If a single core host is under network flood, i.e. ksoftirqd is > >> > scheduled and it eventually (after processing ~640 packets) will > let the > >> > user space process run. The latter will execute a syscall to > receive a > >> > packet, which will have to disable/enable bh at least once and > that will > >> > cause the processing of another ~640 packets. To receive a > single packet > >> > in user space, the kernel has to process more than one thousand > packets. > >> > >> Looks you found the bug then. Have you tried to fix it ? > ... > > The ksoftirq and the local_bh_enable() design are the root of the > > problem, they need to be touched/affected to solve it. > > That's not what I read from your description, processing 640 packets > before going to ksoftirqd seems to the be the absolute root problem. What would a fix for that look like? Keep track of the number of processed incoming packets, and the number of packets handed off, and defer to ksoftirqd earlier if the statistics suggest packets are getting dropped on the floor? Is there a cheap way to do that kind of thing? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
[net PATCH v2 5/6] net sched: skbedit action fix late binding
From: Jamal Hadi SalimThe process below was broken and is fixed with this patch. //add a skbedit action and give it an instance id of 1 sudo tc actions add action skbedit mark 10 index 1 //create a filter which binds to skbedit action id 1 sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action skbedit index 1 Message before fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim --- net/sched/act_skbedit.c | 18 +++--- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c index cfcdbdc..69da5a8 100644 --- a/net/sched/act_skbedit.c +++ b/net/sched/act_skbedit.c @@ -69,7 +69,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, struct tcf_skbedit *d; u32 flags = 0, *priority = NULL, *mark = NULL; u16 *queue_mapping = NULL; - int ret = 0, err; + int ret = 0, err, exists = 0; if (nla == NULL) return -EINVAL; @@ -96,12 +96,18 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, mark = nla_data(tb[TCA_SKBEDIT_MARK]); } - if (!flags) - return -EINVAL; - parm = nla_data(tb[TCA_SKBEDIT_PARMS]); - if (!tcf_hash_check(tn, parm->index, a, bind)) { + exists = tcf_hash_check(tn, parm->index, a, bind); + if (exists && bind) + return 0; + + if (!flags) { + tcf_hash_release(a, bind); + return -EINVAL; + } + + if (!exists) { ret = tcf_hash_create(tn, parm->index, est, a, sizeof(*d), bind, false); if (ret) @@ -111,8 +117,6 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, ret = ACT_P_CREATED; } else { d = to_skbedit(a); - if (bind) - return 0; tcf_hash_release(a, bind); if (!ovr) return -EEXIST; -- 1.9.1
[net PATCH v2 2/6] net sched: ipt action fix late binding
From: Jamal Hadi SalimThis was broken and is fixed with this patch. //add an ipt action and give it an instance id of 1 sudo tc actions add action ipt -j mark --set-mark 2 index 1 //create a filter which binds to ipt action id 1 sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action ipt index 1 Message before bug fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim --- net/sched/act_ipt.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c index 350e134..8b52700 100644 --- a/net/sched/act_ipt.c +++ b/net/sched/act_ipt.c @@ -96,7 +96,7 @@ static int __tcf_ipt_init(struct tc_action_net *tn, struct nlattr *nla, struct tcf_ipt *ipt; struct xt_entry_target *td, *t; char *tname; - int ret = 0, err; + int ret = 0, err, exists = 0; u32 hook = 0; u32 index = 0; @@ -107,18 +107,23 @@ static int __tcf_ipt_init(struct tc_action_net *tn, struct nlattr *nla, if (err < 0) return err; - if (tb[TCA_IPT_HOOK] == NULL) - return -EINVAL; - if (tb[TCA_IPT_TARG] == NULL) + if (tb[TCA_IPT_INDEX] != NULL) + index = nla_get_u32(tb[TCA_IPT_INDEX]); + + exists = tcf_hash_check(tn, index, a, bind); + if (exists && bind) + return 0; + + if (tb[TCA_IPT_HOOK] == NULL || tb[TCA_IPT_TARG] == NULL) { + if (exists) + tcf_hash_release(a, bind); return -EINVAL; + } td = (struct xt_entry_target *)nla_data(tb[TCA_IPT_TARG]); if (nla_len(tb[TCA_IPT_TARG]) < td->u.target_size) return -EINVAL; - if (tb[TCA_IPT_INDEX] != NULL) - index = nla_get_u32(tb[TCA_IPT_INDEX]); - if (!tcf_hash_check(tn, index, a, bind)) { ret = tcf_hash_create(tn, index, est, a, sizeof(*ipt), bind, false); -- 1.9.1
[net PATCH v2 6/6] net sched: ife action fix late binding
From: Jamal Hadi SalimThe process below was broken and is fixed with this patch. //add an ife action and give it an instance id of 1 sudo tc actions add action ife encode \ type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1 //create a filter which binds to ife action id 1 sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1 Message before fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim --- net/sched/act_ife.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c index c589a9b..343d011 100644 --- a/net/sched/act_ife.c +++ b/net/sched/act_ife.c @@ -423,7 +423,7 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, u16 ife_type = 0; u8 *daddr = NULL; u8 *saddr = NULL; - int ret = 0; + int ret = 0, exists = 0; int err; err = nla_parse_nested(tb, TCA_IFE_MAX, nla, ife_policy); @@ -435,25 +435,29 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, parm = nla_data(tb[TCA_IFE_PARMS]); + exists = tcf_hash_check(tn, parm->index, a, bind); + if (exists && bind) + return 0; + if (parm->flags & IFE_ENCODE) { /* Until we get issued the ethertype, we cant have * a default.. **/ if (!tb[TCA_IFE_TYPE]) { + if (exists) + tcf_hash_release(a, bind); pr_info("You MUST pass etherype for encoding\n"); return -EINVAL; } } - if (!tcf_hash_check(tn, parm->index, a, bind)) { + if (!exists) { ret = tcf_hash_create(tn, parm->index, est, a, sizeof(*ife), bind, false); if (ret) return ret; ret = ACT_P_CREATED; } else { - if (bind) /* dont override defaults */ - return 0; tcf_hash_release(a, bind); if (!ovr) return -EEXIST; @@ -495,6 +499,8 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla, NULL); if (err) { metadata_parse_err: + if (exists) + tcf_hash_release(a, bind); if (ret == ACT_P_CREATED) _tcf_ife_cleanup(a, bind); -- 1.9.1
[net PATCH v2 4/6] net sched: simple action fix late binding
From: Jamal Hadi SalimThe process below was broken and is fixed with this patch. //add a simple action and give it an instance id of 1 sudo tc actions add action simple sdata "foobar" index 1 //create a filter which binds to simple action id 1 sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action simple index 1 Message before fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim --- net/sched/act_simple.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c index 75b2be1..3a33fb6 100644 --- a/net/sched/act_simple.c +++ b/net/sched/act_simple.c @@ -87,7 +87,7 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, struct tc_defact *parm; struct tcf_defact *d; char *defdata; - int ret = 0, err; + int ret = 0, err, exists = 0; if (nla == NULL) return -EINVAL; @@ -99,13 +99,21 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, if (tb[TCA_DEF_PARMS] == NULL) return -EINVAL; - if (tb[TCA_DEF_DATA] == NULL) - return -EINVAL; parm = nla_data(tb[TCA_DEF_PARMS]); + exists = tcf_hash_check(tn, parm->index, a, bind); + if (exists && bind) + return 0; + + if (tb[TCA_DEF_DATA] == NULL) { + if (exists) + tcf_hash_release(a, bind); + return -EINVAL; + } + defdata = nla_data(tb[TCA_DEF_DATA]); - if (!tcf_hash_check(tn, parm->index, a, bind)) { + if (!exists) { ret = tcf_hash_create(tn, parm->index, est, a, sizeof(*d), bind, false); if (ret) @@ -122,8 +130,6 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla, } else { d = to_defact(a); - if (bind) - return 0; tcf_hash_release(a, bind); if (!ovr) return -EEXIST; -- 1.9.1
[net PATCH v2 0/6] net sched: Fix broken late binding of actions
From: Jamal Hadi SalimSome actions were broken in allowing for late binding of actions. Late binding workflow is as follows: a) create an action and provide all necessary parameters for it Optionally provide an index or let the kernel give you one. Example: sudo tc actions add action police rate 1kbit burst 90k drop index 1 b) later on bind to the pre-created action from a filter definition by merely specifying the index. Example: sudo tc filter add dev lo parent : protocol ip prio 8 \ u32 match ip src 127.0.0.8/32 flowid 1:8 action police index 1 Jamal Hadi Salim (6): net sched: vlan action fix late binding net sched: ipt action fix late binding net sched: mirred action fix late binding net sched: simple action fix late binding net sched: skbedit action fix late binding net sched: ife action fix late binding net/sched/act_ife.c | 14 ++ net/sched/act_ipt.c | 19 --- net/sched/act_mirred.c | 19 +-- net/sched/act_simple.c | 18 -- net/sched/act_skbedit.c | 18 +++--- net/sched/act_vlan.c| 22 -- 6 files changed, 74 insertions(+), 36 deletions(-) -- 1.9.1
[net PATCH v2 1/6] net sched: vlan action fix late binding
From: Jamal Hadi SalimLate vlan action binding was broken and is fixed with this patch. //add a vlan action to pop and give it an instance id of 1 sudo tc actions add action vlan pop index 1 //create filter which binds to vlan action id 1 sudo tc filter add dev $DEV parent : protocol ip prio 1 u32 \ match ip dst 17.0.0.1/32 flowid 1:1 action vlan index 1 current message(before bug fix) was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim --- net/sched/act_vlan.c | 22 -- 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c index bab8ae0..c45f926 100644 --- a/net/sched/act_vlan.c +++ b/net/sched/act_vlan.c @@ -77,7 +77,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla, int action; __be16 push_vid = 0; __be16 push_proto = 0; - int ret = 0; + int ret = 0, exists = 0; int err; if (!nla) @@ -90,15 +90,25 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla, if (!tb[TCA_VLAN_PARMS]) return -EINVAL; parm = nla_data(tb[TCA_VLAN_PARMS]); + exists = tcf_hash_check(tn, parm->index, a, bind); + if (exists && bind) + return 0; + switch (parm->v_action) { case TCA_VLAN_ACT_POP: break; case TCA_VLAN_ACT_PUSH: - if (!tb[TCA_VLAN_PUSH_VLAN_ID]) + if (!tb[TCA_VLAN_PUSH_VLAN_ID]) { + if (exists) + tcf_hash_release(a, bind); return -EINVAL; + } push_vid = nla_get_u16(tb[TCA_VLAN_PUSH_VLAN_ID]); - if (push_vid >= VLAN_VID_MASK) + if (push_vid >= VLAN_VID_MASK) { + if (exists) + tcf_hash_release(a, bind); return -ERANGE; + } if (tb[TCA_VLAN_PUSH_VLAN_PROTOCOL]) { push_proto = nla_get_be16(tb[TCA_VLAN_PUSH_VLAN_PROTOCOL]); @@ -114,11 +124,13 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla, } break; default: + if (exists) + tcf_hash_release(a, bind); return -EINVAL; } action = parm->v_action; - if (!tcf_hash_check(tn, parm->index, a, bind)) { + if (!exists) { ret = tcf_hash_create(tn, parm->index, est, a, sizeof(*v), bind, false); if (ret) @@ -126,8 +138,6 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla, ret = ACT_P_CREATED; } else { - if (bind) - return 0; tcf_hash_release(a, bind); if (!ovr) return -EEXIST; -- 1.9.1
[net PATCH v2 3/6] net sched: mirred action fix late binding
From: Jamal Hadi SalimThe process below was broken and is fixed with this patch. //add an mirred action and give it an instance id of 1 sudo tc actions add action mirred egress mirror dev $MDEV index 1 //create a filter which binds to mirred action id 1 sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action mirred index 1 Message before bug fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim --- net/sched/act_mirred.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index e8a760c..8f3948d 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -61,7 +61,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, struct tc_mirred *parm; struct tcf_mirred *m; struct net_device *dev; - int ret, ok_push = 0; + int ret, ok_push = 0, exists = 0; if (nla == NULL) return -EINVAL; @@ -71,17 +71,27 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, if (tb[TCA_MIRRED_PARMS] == NULL) return -EINVAL; parm = nla_data(tb[TCA_MIRRED_PARMS]); + + exists = tcf_hash_check(tn, parm->index, a, bind); + if (exists && bind) + return 0; + switch (parm->eaction) { case TCA_EGRESS_MIRROR: case TCA_EGRESS_REDIR: break; default: + if (exists) + tcf_hash_release(a, bind); return -EINVAL; } if (parm->ifindex) { dev = __dev_get_by_index(net, parm->ifindex); - if (dev == NULL) + if (dev == NULL) { + if (exists) + tcf_hash_release(a, bind); return -ENODEV; + } switch (dev->type) { case ARPHRD_TUNNEL: case ARPHRD_TUNNEL6: @@ -99,7 +109,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, dev = NULL; } - if (!tcf_hash_check(tn, parm->index, a, bind)) { + if (!exists) { if (dev == NULL) return -EINVAL; ret = tcf_hash_create(tn, parm->index, est, a, @@ -108,9 +118,6 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, return ret; ret = ACT_P_CREATED; } else { - if (bind) - return 0; - tcf_hash_release(a, bind); if (!ovr) return -EEXIST; -- 1.9.1
Re: [RFC PATCH 0/2] net: threadable napi poll loop
Hello, On 10.05.2016 16:29, Eric Dumazet wrote: > On Tue, 2016-05-10 at 16:11 +0200, Paolo Abeni wrote: >> Currently, the softirq loop can be scheduled both inside the ksofirqd kernel >> thread and inside any running process. This makes nearly impossible for the >> process scheduler to balance in a fair way the amount of time that >> a given core spends performing the softirq loop. >> >> Under high network load, the softirq loop can take nearly 100% of a given >> CPU, >> leaving very little time for use space processing. On single core hosts, this >> means that the user space can nearly starve; for example super_netperf >> UDP_STREAM tests towards a remote single core vCPU guest[1] can measure an >> aggregated throughput of a few thousands pps, and the same behavior can be >> reproduced even on bare-metal, eventually simulating a single core with >> taskset >> and/or sysfs configuration. > > I hate these patches and ideas guys, sorry. That is before my breakfast, > but still... :) > I have enough hard time dealing with loads where ksoftirqd has to > compete with user threads that thought that playing with priorities was > a nice idea. We tried a lot of approaches so far and this seemed to be the best architectural RFC we could post. I was quite surprised to see such good performance numbers with threaded NAPI, thus I think it could be a way forward. Your mentioned problem above seems to be a configuration mistake, no? Otherwise isn't that something user space/cgroups might solve? > Guess what, when they lose networking they complain. > > We already have ksoftirqd to normally cope with the case you are > describing. Indeed, but the time until we wake up ksoftirqd can be already quite long and for every packet we get in udp_recvmsg the local_bh_enable call let's us pick up quite a lot of new packets, which we drop before user space can make any progress. By being more fair between user space and "napid" we hoped to solve this. We also want more feedback from the scheduler people, so we Cc'ed them also. > If it is not working as intended, please identify the bugs and fix them, > instead of adding yet another tests in fast path and extra complexity in > the stack. We could use _local_bh_enable instead of local_bh_enable in udp_recvmsg, which certainly wouldn't branch down to softirqs as often, but this feels wrong to me and certainly is. After the discussion on netdev@ with Peter Hurley here [1] about "Softirq priority inversion from "softirq: reduce latencies"", we didn't want to propose some patch looking like this again, but this could help. The idea would be to limit the number we recheck for softirqs but give back control to user space. [1] https://lkml.org/lkml/2016/2/27/152 If I remember local_bh_enable in kernel-rt processes one softirq directly and defers its work to ksoftirqd much more quickly. > In the one vcpu case, allowing the user thread to consume more UDP > packets from the target UDP socket will also make your NIC drop more > packets, that are not necessarily packets for the same socket. > > So you are shifting the attack to a different target, > at the expense of more kernel bloat. I agree here, but I don't think this patch particularly is a lot of bloat and something very interesting people can play with and extend upon. Thanks, Hannes
Re: [RFC PATCH 0/2] net: threadable napi poll loop
From: Paolo AbeniDate: Tue, 10 May 2016 22:22:50 +0200 > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote: >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote: >> >> > If a single core host is under network flood, i.e. ksoftirqd is >> > scheduled and it eventually (after processing ~640 packets) will let the >> > user space process run. The latter will execute a syscall to receive a >> > packet, which will have to disable/enable bh at least once and that will >> > cause the processing of another ~640 packets. To receive a single packet >> > in user space, the kernel has to process more than one thousand packets. >> >> Looks you found the bug then. Have you tried to fix it ? ... > The ksoftirq and the local_bh_enable() design are the root of the > problem, they need to be touched/affected to solve it. That's not what I read from your description, processing 640 packets before going to ksoftirqd seems to the be the absolute root problem.
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 17:57 +0200, Thomas Gleixner wrote: > On Tue, 10 May 2016, Paolo Abeni wrote: > > Nice patch set and very promising results! > > > At this point we are not really sure if we should go with this simpler > > approach by putting NAPI itself into kthreads or leverage the threadirqs > > function by putting the whole interrupt into a thread and signaling NAPI > > that it does not reschedule itself in a softirq but to simply run at > > this particular context of the interrupt handler. > > > > While the threaded irq way seems to better integrate into the kernel and > > also other devices could move their interrupts into the threads easily > > on a common policy, we don't know how to really express the necessary > > knobs with the current device driver model (module parameters, sysfs > > attributes, etc.). This is where we would like to hear some opinions. > > NAPI would e.g. have to query the kernel if the particular IRQ/MSI if it > > should be scheduled in a softirq or in a thread, so we don't have to > > rewrite all device drivers. This might even be needed on a per rx-queue > > granularity. > > Utilizing threaded irqs should be halfways simple even without touching the > device driver at all. > > We can do the switch to threading in two ways: > > 1) Let the driver request the interrupt(s) as it does now and then have a >/proc/irq/NNN/threaded file which converts it to a threaded interrupt on >the fly. That should be fairly trivial. > > 2) Let the driver request the interrupt(s) as it does now and retrieve the >interrupt number which belongs to the device/queue from the network core >and let the irq core switch it over to threaded. Thank you for the feedback. We actually experimented something similar to (2). In our implementation we needed a per device chunk of code to do the actual irq number -> queue mapping (and than we performed as well the switch in the device code). > You surely need some way to figure out whether the interrupt is threaded when > you set up the device in order to flag your napi struct, but that should be > not too hard to achieve. This is the part that required per device changes and complicated a bit the implementation. We can research further to simplify it, according to the overall discussion. Cheers, Paolo
Re: [RFC PATCH 0/2] net: threadable napi poll loop
On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote: > On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote: > > > If a single core host is under network flood, i.e. ksoftirqd is > > scheduled and it eventually (after processing ~640 packets) will let the > > user space process run. The latter will execute a syscall to receive a > > packet, which will have to disable/enable bh at least once and that will > > cause the processing of another ~640 packets. To receive a single packet > > in user space, the kernel has to process more than one thousand packets. > > Looks you found the bug then. Have you tried to fix it ? The core functionality is implemented in ~100 lines of code, is that the kind of bloat that do concerns you ? That could probably be improved removing some code duplication, i.e. factorizing napi_thread_wait() with irq_wait_for_interrupt() and possibly napi_threaded_poll() with net_rx_action(). If the additional test inside napi_schedule() is really scaring, it can be guarded with a static_key. The ksoftirq and the local_bh_enable() design are the root of the problem, they need to be touched/affected to solve it. We actually experimented several different options. Limiting the amount of work performed by local_bh_enable() somewhat mitigate the issue, but it adds just another kernel parameter difficult to be tuned. Running the softirq loop exclusively inside the ksoftirqd will solve the issue, but this is a very invasive approach, affecting all others subsystem. The above can be restricted to the net_rx_action only (i.e. running net_rx_action always in ksoftirqd context). The related patch isn't really much simpler than this and will add at least the same number of additional tests in fast path. Running the napi loop in a thread that can be migrated gives additional benefit in the hyper-visor/VM scenario, which can't be achieved elsewhere. Would you consider the threaded irq alternative more viable ? Cheers, Paolo
[PATCH v3 net-next] tcp: replace cnt & rtt with struct in pkts_acked()
Replace 2 arguments (cnt and rtt) in the congestion control modules' pkts_acked() function with a struct. This will allow adding more information without having to modify existing congestion control modules (tcp_nv in particular needs bytes in flight when packet was sent). As proposed by Neal Cardwell in his comments to the tcp_nv patch. Signed-off-by: Lawrence BrakmoAcked-by: Yuchung Cheng --- include/net/tcp.h | 7 ++- net/ipv4/tcp_bic.c | 6 +++--- net/ipv4/tcp_cdg.c | 14 +++--- net/ipv4/tcp_cubic.c| 6 +++--- net/ipv4/tcp_htcp.c | 10 +- net/ipv4/tcp_illinois.c | 20 ++-- net/ipv4/tcp_input.c| 8 ++-- net/ipv4/tcp_lp.c | 6 +++--- net/ipv4/tcp_vegas.c| 6 +++--- net/ipv4/tcp_vegas.h| 2 +- net/ipv4/tcp_veno.c | 7 --- net/ipv4/tcp_westwood.c | 7 --- net/ipv4/tcp_yeah.c | 7 --- 13 files changed, 59 insertions(+), 47 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 24ec804..dc588c3 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -849,6 +849,11 @@ enum tcp_ca_ack_event_flags { union tcp_cc_info; +struct ack_sample { + u32 pkts_acked; + s32 rtt_us; +}; + struct tcp_congestion_ops { struct list_headlist; u32 key; @@ -872,7 +877,7 @@ struct tcp_congestion_ops { /* new value of cwnd after loss (optional) */ u32 (*undo_cwnd)(struct sock *sk); /* hook for packet ack accounting (optional) */ - void (*pkts_acked)(struct sock *sk, u32 num_acked, s32 rtt_us); + void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample); /* get info for inet_diag (optional) */ size_t (*get_info)(struct sock *sk, u32 ext, int *attr, union tcp_cc_info *info); diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c index fd1405d..36087bc 100644 --- a/net/ipv4/tcp_bic.c +++ b/net/ipv4/tcp_bic.c @@ -197,15 +197,15 @@ static void bictcp_state(struct sock *sk, u8 new_state) /* Track delayed acknowledgment ratio using sliding window * ratio = (15*ratio + sample) / 16 */ -static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt) +static void bictcp_acked(struct sock *sk, const struct ack_sample *sample) { const struct inet_connection_sock *icsk = inet_csk(sk); if (icsk->icsk_ca_state == TCP_CA_Open) { struct bictcp *ca = inet_csk_ca(sk); - cnt -= ca->delayed_ack >> ACK_RATIO_SHIFT; - ca->delayed_ack += cnt; + ca->delayed_ack += sample->pkts_acked - + (ca->delayed_ack >> ACK_RATIO_SHIFT); } } diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c index ccce8a5..03725b2 100644 --- a/net/ipv4/tcp_cdg.c +++ b/net/ipv4/tcp_cdg.c @@ -294,12 +294,12 @@ static void tcp_cdg_cong_avoid(struct sock *sk, u32 ack, u32 acked) ca->shadow_wnd = max(ca->shadow_wnd, ca->shadow_wnd + incr); } -static void tcp_cdg_acked(struct sock *sk, u32 num_acked, s32 rtt_us) +static void tcp_cdg_acked(struct sock *sk, const struct ack_sample *sample) { struct cdg *ca = inet_csk_ca(sk); struct tcp_sock *tp = tcp_sk(sk); - if (rtt_us <= 0) + if (sample->rtt_us <= 0) return; /* A heuristic for filtering delayed ACKs, adapted from: @@ -307,20 +307,20 @@ static void tcp_cdg_acked(struct sock *sk, u32 num_acked, s32 rtt_us) * delay and rate based TCP mechanisms." TR 100219A. CAIA, 2010. */ if (tp->sacked_out == 0) { - if (num_acked == 1 && ca->delack) { + if (sample->pkts_acked == 1 && ca->delack) { /* A delayed ACK is only used for the minimum if it is * provenly lower than an existing non-zero minimum. */ - ca->rtt.min = min(ca->rtt.min, rtt_us); + ca->rtt.min = min(ca->rtt.min, sample->rtt_us); ca->delack--; return; - } else if (num_acked > 1 && ca->delack < 5) { + } else if (sample->pkts_acked > 1 && ca->delack < 5) { ca->delack++; } } - ca->rtt.min = min_not_zero(ca->rtt.min, rtt_us); - ca->rtt.max = max(ca->rtt.max, rtt_us); + ca->rtt.min = min_not_zero(ca->rtt.min, sample->rtt_us); + ca->rtt.max = max(ca->rtt.max, sample->rtt_us); } static u32 tcp_cdg_ssthresh(struct sock *sk) diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 0ce946e..c99230e 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -437,21 +437,21 @@ static void hystart_update(struct sock *sk, u32 delay) /* Track delayed acknowledgment ratio using sliding window * ratio = (15*ratio + sample) / 16 */ -static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt_us)
Re: [PATCH net-next] skbuff: remove unused variable `doff'
From: Sowmini VaradhanDate: Tue, 10 May 2016 12:38:08 -0400 > There are two instances of an unused variable, `doff' added by > commit 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function") > in pskb_carve_inside_header() and pskb_carve_inside_nonlinear(). > Remove these instances, they are not used. > > Reported by: Daniel Borkmann > Signed-off-by: Sowmini Varadhan Applied.
Re: [PATCH net-next v2] ila: ipv6/ila: fix nlsize calculation for lwtunnel
From: Nicolas DichtelDate: Tue, 10 May 2016 11:56:32 +0200 > From: Tom Herbert > > The handler 'ila_fill_encap_info' adds two attributes: ILA_ATTR_LOCATOR > and ILA_ATTR_CSUM_MODE. > > nla_total_size_64bit() must be use for ILA_ATTR_LOCATOR. > > Also, do nla_put_u8 instead of nla_put_u64 for ILA_ATTR_CSUM_MODE. > > Fixes: f13a82d87b21 ("ipv6: use nla_put_u64_64bit()") > Fixes: 90bfe662db13 ("ila: add checksum neutral ILA translations") > Reported-by: Nicolas Dichtel > Signed-off-by: Tom Herbert > Signed-off-by: Nicolas Dichtel Applied.
Re: [PATCH] net: phylib: fix interrupts re-enablement in phy_start
From:Date: Tue, 10 May 2016 17:42:26 +0800 > From: Shaohui Xie > > If phy was suspended and is starting, current driver always enable > phy's interrupts, if phy works in polling, phy can raise unexpected > interrupt which will not be handled, the interrupt will block system > enter suspend again. So interrupts should only be re-enabled if phy > works in interrupt. > > Signed-off-by: Shaohui Xie Applied.
Re: [PATCH net] tcp: refresh skb timestamp at retransmit time
From: Eric DumazetDate: Mon, 09 May 2016 20:55:16 -0700 > From: Eric Dumazet > > In the very unlikely case __tcp_retransmit_skb() can not use the cloning > done in tcp_transmit_skb(), we need to refresh skb_mstamp before doing > the copy and transmit, otherwise TCP TS val will be an exact copy of > original transmit. > > Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when") > Signed-off-by: Eric Dumazet > Cc: Yuchung Cheng Applied and queued up for -stable, thanks.
Re: [PATCH net-next] gtp: reload GTPv1 header after pskb_may_pull()
From: Pablo Neira AyusoDate: Tue, 10 May 2016 21:33:38 +0200 > The GTPv1 header flags indicate the presence of optional extensions > after this header. Refresh the pointer to the GTPv1 header as skb->head > might have be reallocated via pskb_may_pull(). > > Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling > Protocol (GTP-U)") > Reported-by: Eric Dumazet > Signed-off-by: Pablo Neira Ayuso Applied.
Re: [PATCH 5/6] drivers: net: xgene: Using static MSS values
From: Iyappan SubramanianDate: Mon, 9 May 2016 17:04:15 -0700 > Due to the nature of hardware design for TSO, if the MSS values that are > stored in the register, changes during TSO operation, data corruption may > occur. > > This patch fixes the issue by using one of the predefined MSS values. > > Signed-off-by: Iyappan Subramanian > Tested-by: Toan Le This is a very serious quality of implementation issue. And it could quietly kill performance for some users, and there is no way for them to find out that this is happening. If you use a predefined set of MSS values, if the MSS value we need is between two of them then there is going to be wasted space on the wire. It's can therefore certainly be better to not do TSO in that case. I think you absolutely need to disable TSO by default, and require the user to explicitly turn it on, unless you can fix this problem in another way. Thanks.
[PATCH net-next 2/2] net: dsa: mv88e6xxx: add STU capability
Some switch models have a STU (per VLAN port state database). Add a new capability flag to switches info, instead of checking their family. Also if the 6165 family has an STU, it must have a VTU, so add the MV88E6XXX_FLAG_VTU to its family flags. Signed-off-by: Vivien Didelot--- drivers/net/dsa/mv88e6xxx.c | 14 ++ drivers/net/dsa/mv88e6xxx.h | 16 ++-- 2 files changed, 16 insertions(+), 14 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 92be27d..835126e 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -453,16 +453,6 @@ static bool mv88e6xxx_has_fid_reg(struct mv88e6xxx_priv_state *ps) return false; } -static bool mv88e6xxx_has_stu(struct mv88e6xxx_priv_state *ps) -{ - /* Does the device have STU and dedicated SID registers for VTU ops? */ - if (mv88e6xxx_6097_family(ps) || mv88e6xxx_6165_family(ps) || - mv88e6xxx_6351_family(ps) || mv88e6xxx_6352_family(ps)) - return true; - - return false; -} - /* We expect the switch to perform auto negotiation if there is a real * phy. However, in the case of a fixed link phy, we force the port * settings from the fixed link settings. @@ -1599,7 +1589,7 @@ static int _mv88e6xxx_vtu_getnext(struct mv88e6xxx_priv_state *ps, next.fid |= ret & 0xf; } - if (mv88e6xxx_has_stu(ps)) { + if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_STU)) { ret = _mv88e6xxx_reg_read(ps, REG_GLOBAL, GLOBAL_VTU_SID); if (ret < 0) @@ -1686,7 +1676,7 @@ static int _mv88e6xxx_vtu_loadpurge(struct mv88e6xxx_priv_state *ps, if (ret < 0) return ret; - if (mv88e6xxx_has_stu(ps)) { + if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_STU)) { reg = entry->sid & GLOBAL_VTU_SID_MASK; ret = _mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_VTU_SID, reg); if (ret < 0) diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h index ca69a93..5f09a4e 100644 --- a/drivers/net/dsa/mv88e6xxx.h +++ b/drivers/net/dsa/mv88e6xxx.h @@ -403,6 +403,12 @@ enum mv88e6xxx_cap { */ MV88E6XXX_CAP_SMI_PHY, + /* Per VLAN Spanning Tree Unit (STU). +* The Port State database, if present, is accessed through VTU +* operations and dedicated SID registers. See GLOBAL_VTU_SID. +*/ + MV88E6XXX_CAP_STU, + /* Switch MAC/WoL/WoF register. * This requires an indirect access to set the switch MAC address * through GLOBAL2_SWITCH_MAC, otherwise GLOBAL_MAC_01, GLOBAL_MAC_23, @@ -436,6 +442,7 @@ enum mv88e6xxx_cap { #define MV88E6XXX_FLAG_PPU BIT(MV88E6XXX_CAP_PPU) #define MV88E6XXX_FLAG_PPU_ACTIVE BIT(MV88E6XXX_CAP_PPU_ACTIVE) #define MV88E6XXX_FLAG_SMI_PHY BIT(MV88E6XXX_CAP_SMI_PHY) +#define MV88E6XXX_FLAG_STU BIT(MV88E6XXX_CAP_STU) #define MV88E6XXX_FLAG_SWITCH_MAC BIT(MV88E6XXX_CAP_SWITCH_MAC_WOL_WOF) #define MV88E6XXX_FLAG_TEMPBIT(MV88E6XXX_CAP_TEMP) #define MV88E6XXX_FLAG_TEMP_LIMIT BIT(MV88E6XXX_CAP_TEMP_LIMIT) @@ -451,12 +458,15 @@ enum mv88e6xxx_cap { #define MV88E6XXX_FLAGS_FAMILY_6097\ (MV88E6XXX_FLAG_ATU | \ MV88E6XXX_FLAG_PPU | \ +MV88E6XXX_FLAG_STU | \ MV88E6XXX_FLAG_VLANTABLE | \ MV88E6XXX_FLAG_VTU) #define MV88E6XXX_FLAGS_FAMILY_6165\ - (MV88E6XXX_FLAG_SWITCH_MAC |\ -MV88E6XXX_FLAG_TEMP) + (MV88E6XXX_FLAG_STU | \ +MV88E6XXX_FLAG_SWITCH_MAC |\ +MV88E6XXX_FLAG_TEMP | \ +MV88E6XXX_FLAG_VTU) #define MV88E6XXX_FLAGS_FAMILY_6185\ (MV88E6XXX_FLAG_ATU | \ @@ -482,6 +492,7 @@ enum mv88e6xxx_cap { MV88E6XXX_FLAG_PORTSTATE | \ MV88E6XXX_FLAG_PPU_ACTIVE |\ MV88E6XXX_FLAG_SMI_PHY | \ +MV88E6XXX_FLAG_STU | \ MV88E6XXX_FLAG_SWITCH_MAC |\ MV88E6XXX_FLAG_TEMP | \ MV88E6XXX_FLAG_VLANTABLE | \ @@ -494,6 +505,7 @@ enum mv88e6xxx_cap { MV88E6XXX_FLAG_PORTSTATE | \ MV88E6XXX_FLAG_PPU_ACTIVE |\ MV88E6XXX_FLAG_SMI_PHY | \ +MV88E6XXX_FLAG_STU | \ MV88E6XXX_FLAG_SWITCH_MAC |\ MV88E6XXX_FLAG_TEMP | \ MV88E6XXX_FLAG_TEMP_LIMIT |\ -- 2.8.2
[PATCH net-next 1/2] net: dsa: mv88e6xxx: abstract VTU/STU data access
Both VTU and STU operations use the same routine to access their (common) data registers, with a different offset. Add VTU and STU specific read and write functions to the data registers to abstract the required offset. Signed-off-by: Vivien Didelot--- drivers/net/dsa/mv88e6xxx.c | 32 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 1e5ca8e..92be27d 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -1498,6 +1498,18 @@ static int _mv88e6xxx_vtu_stu_data_read(struct mv88e6xxx_priv_state *ps, return 0; } +static int mv88e6xxx_vtu_data_read(struct mv88e6xxx_priv_state *ps, + struct mv88e6xxx_vtu_stu_entry *entry) +{ + return _mv88e6xxx_vtu_stu_data_read(ps, entry, 0); +} + +static int mv88e6xxx_stu_data_read(struct mv88e6xxx_priv_state *ps, + struct mv88e6xxx_vtu_stu_entry *entry) +{ + return _mv88e6xxx_vtu_stu_data_read(ps, entry, 2); +} + static int _mv88e6xxx_vtu_stu_data_write(struct mv88e6xxx_priv_state *ps, struct mv88e6xxx_vtu_stu_entry *entry, unsigned int nibble_offset) @@ -1523,6 +1535,18 @@ static int _mv88e6xxx_vtu_stu_data_write(struct mv88e6xxx_priv_state *ps, return 0; } +static int mv88e6xxx_vtu_data_write(struct mv88e6xxx_priv_state *ps, + struct mv88e6xxx_vtu_stu_entry *entry) +{ + return _mv88e6xxx_vtu_stu_data_write(ps, entry, 0); +} + +static int mv88e6xxx_stu_data_write(struct mv88e6xxx_priv_state *ps, + struct mv88e6xxx_vtu_stu_entry *entry) +{ + return _mv88e6xxx_vtu_stu_data_write(ps, entry, 2); +} + static int _mv88e6xxx_vtu_vid_write(struct mv88e6xxx_priv_state *ps, u16 vid) { return _mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_VTU_VID, @@ -1551,7 +1575,7 @@ static int _mv88e6xxx_vtu_getnext(struct mv88e6xxx_priv_state *ps, next.valid = !!(ret & GLOBAL_VTU_VID_VALID); if (next.valid) { - ret = _mv88e6xxx_vtu_stu_data_read(ps, , 0); + ret = mv88e6xxx_vtu_data_read(ps, ); if (ret < 0) return ret; @@ -1658,7 +1682,7 @@ static int _mv88e6xxx_vtu_loadpurge(struct mv88e6xxx_priv_state *ps, goto loadpurge; /* Write port member tags */ - ret = _mv88e6xxx_vtu_stu_data_write(ps, entry, 0); + ret = mv88e6xxx_vtu_data_write(ps, entry); if (ret < 0) return ret; @@ -1724,7 +1748,7 @@ static int _mv88e6xxx_stu_getnext(struct mv88e6xxx_priv_state *ps, u8 sid, next.valid = !!(ret & GLOBAL_VTU_VID_VALID); if (next.valid) { - ret = _mv88e6xxx_vtu_stu_data_read(ps, , 2); + ret = mv88e6xxx_stu_data_read(ps, ); if (ret < 0) return ret; } @@ -1747,7 +1771,7 @@ static int _mv88e6xxx_stu_loadpurge(struct mv88e6xxx_priv_state *ps, goto loadpurge; /* Write port states */ - ret = _mv88e6xxx_vtu_stu_data_write(ps, entry, 2); + ret = mv88e6xxx_stu_data_write(ps, entry); if (ret < 0) return ret; -- 2.8.2
[PATCH 14/17] batman-adv: Use kref_get for _batadv_update_route
From: Sven Eckelmann_batadv_update_route requires that the caller already has a valid reference for neigh_node. It is therefore not possible that it has an reference counter of 0 and was still given to this function The kref_get function instead WARNs (with debug information) when the reference counter would still be 0. This makes a bug in batman-adv better visible because kref_get_unless_zero would have ignored this problem. Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/routing.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c index 2ecfca246be4..b494e435686f 100644 --- a/net/batman-adv/routing.c +++ b/net/batman-adv/routing.c @@ -100,10 +100,6 @@ static void _batadv_update_route(struct batadv_priv *bat_priv, if (curr_router) batadv_neigh_node_put(curr_router); - /* increase refcount of new best neighbor */ - if (neigh_node && !kref_get_unless_zero(_node->refcount)) - neigh_node = NULL; - spin_lock_bh(_node->neigh_list_lock); /* curr_router used earlier may not be the current orig_ifinfo->router * anymore because it was dereferenced outside of the neigh_list_lock @@ -114,6 +110,10 @@ static void _batadv_update_route(struct batadv_priv *bat_priv, */ curr_router = rcu_dereference_protected(orig_ifinfo->router, true); + /* increase refcount of new best neighbor */ + if (neigh_node) + kref_get(_node->refcount); + rcu_assign_pointer(orig_ifinfo->router, neigh_node); spin_unlock_bh(_node->neigh_list_lock); batadv_orig_ifinfo_put(orig_ifinfo); -- 2.8.2
[PATCH 15/17] batman-adv: Use bool as return type for boolean functions
From: Sven EckelmannIt is easier to understand that the returned value of a specific function doesn't have to be 0 when the functions was successful when the actual return type is bool. This is especially true when all surrounding functions with return type int use negative values to return the error code. Reported-by: Nicholas Krause Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/bat_iv_ogm.c| 23 ++--- net/batman-adv/bitarray.c | 16 +-- net/batman-adv/bitarray.h | 15 +-- net/batman-adv/bridge_loop_avoidance.c | 175 + net/batman-adv/bridge_loop_avoidance.h | 43 net/batman-adv/debugfs.c | 2 +- net/batman-adv/distributed-arp-table.c | 6 +- net/batman-adv/hard-interface.c| 15 ++- net/batman-adv/hash.h | 6 +- net/batman-adv/main.h | 2 +- net/batman-adv/network-coding.c| 12 +-- net/batman-adv/originator.c| 4 +- net/batman-adv/originator.h| 2 +- net/batman-adv/routing.c | 37 +++ net/batman-adv/routing.h | 6 +- net/batman-adv/soft-interface.c| 6 +- net/batman-adv/soft-interface.h| 3 +- net/batman-adv/translation-table.c | 31 +++--- 18 files changed, 205 insertions(+), 199 deletions(-) diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c index eb3435de54b5..7f98a9d39883 100644 --- a/net/batman-adv/bat_iv_ogm.c +++ b/net/batman-adv/bat_iv_ogm.c @@ -1168,13 +1168,13 @@ out: * @if_incoming: interface where the packet was received * @if_outgoing: interface for which the retransmission should be considered * - * Return: 1 if the link can be considered bidirectional, 0 otherwise + * Return: true if the link can be considered bidirectional, false otherwise */ -static int batadv_iv_ogm_calc_tq(struct batadv_orig_node *orig_node, -struct batadv_orig_node *orig_neigh_node, -struct batadv_ogm_packet *batadv_ogm_packet, -struct batadv_hard_iface *if_incoming, -struct batadv_hard_iface *if_outgoing) +static bool batadv_iv_ogm_calc_tq(struct batadv_orig_node *orig_node, + struct batadv_orig_node *orig_neigh_node, + struct batadv_ogm_packet *batadv_ogm_packet, + struct batadv_hard_iface *if_incoming, + struct batadv_hard_iface *if_outgoing) { struct batadv_priv *bat_priv = netdev_priv(if_incoming->soft_iface); struct batadv_neigh_node *neigh_node = NULL, *tmp_neigh_node; @@ -1182,9 +1182,10 @@ static int batadv_iv_ogm_calc_tq(struct batadv_orig_node *orig_node, u8 total_count; u8 orig_eq_count, neigh_rq_count, neigh_rq_inv, tq_own; unsigned int neigh_rq_inv_cube, neigh_rq_max_cube; - int tq_asym_penalty, inv_asym_penalty, if_num, ret = 0; + int tq_asym_penalty, inv_asym_penalty, if_num; unsigned int combined_tq; int tq_iface_penalty; + bool ret = false; /* find corresponding one hop neighbor */ rcu_read_lock(); @@ -1296,7 +1297,7 @@ static int batadv_iv_ogm_calc_tq(struct batadv_orig_node *orig_node, * consider it bidirectional */ if (batadv_ogm_packet->tq >= BATADV_TQ_TOTAL_BIDRECT_LIMIT) - ret = 1; + ret = true; out: if (neigh_node) @@ -1325,9 +1326,9 @@ batadv_iv_ogm_update_seqnos(const struct ethhdr *ethhdr, struct batadv_orig_ifinfo *orig_ifinfo = NULL; struct batadv_neigh_node *neigh_node; struct batadv_neigh_ifinfo *neigh_ifinfo; - int is_dup; + bool is_dup; s32 seq_diff; - int need_update = 0; + bool need_update = false; int set_mark; enum batadv_dup_status ret = BATADV_NO_DUP; u32 seqno = ntohl(batadv_ogm_packet->seqno); @@ -1437,7 +1438,7 @@ batadv_iv_ogm_process_per_outif(const struct sk_buff *skb, int ogm_offset, struct sk_buff *skb_priv; struct ethhdr *ethhdr; u8 *prev_sender; - int is_bidirect; + bool is_bidirect; /* create a private copy of the skb, as some functions change tq value * and/or flags. diff --git a/net/batman-adv/bitarray.c b/net/batman-adv/bitarray.c index b56bb000a0ab..a0c7913837a5 100644 --- a/net/batman-adv/bitarray.c +++ b/net/batman-adv/bitarray.c @@ -38,11 +38,11 @@ static void batadv_bitmap_shift_left(unsigned long *seq_bits, s32 n) * the last sequence number * @set_mark: whether this packet should be marked in seq_bits * - * Return: 1 if the window was moved (either new or very
[PATCH 10/17] batman-adv: Use kref_get for batadv_nc_get_nc_node
From: Sven Eckelmannbatadv_nc_get_nc_node requires that the caller already has a valid reference for orig_neigh_node. It is therefore not possible that it has an reference counter of 0 and was still given to this function The kref_get function instead WARNs (with debug information) when the reference counter would still be 0. This makes a bug in batman-adv better visible because kref_get_unless_zero would have ignored this problem. Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/network-coding.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c index 1da8e0e1b18f..953dff1ad43b 100644 --- a/net/batman-adv/network-coding.c +++ b/net/batman-adv/network-coding.c @@ -856,8 +856,7 @@ batadv_nc_get_nc_node(struct batadv_priv *bat_priv, if (!nc_node) return NULL; - if (!kref_get_unless_zero(_neigh_node->refcount)) - goto free; + kref_get(_neigh_node->refcount); /* Initialize nc_node */ INIT_LIST_HEAD(_node->list); @@ -884,10 +883,6 @@ batadv_nc_get_nc_node(struct batadv_priv *bat_priv, spin_unlock_bh(lock); return nc_node; - -free: - kfree(nc_node); - return NULL; } /** -- 2.8.2
[PATCH 13/17] batman-adv: Use kref_get for hard_iface subfunctions
From: Sven EckelmannThe callers of the functions using batadv_hard_iface objects already make sure that they hold a valid reference. The subfunctions don't have to check whether the reference counter is > 0 because this was checked by the callers. The kref_get function instead WARNs (with debug information) when the reference counter would still be 0. This makes a bug in batman-adv better visible because kref_get_unless_zero would have ignored this problem. Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/bat_iv_ogm.c | 14 +++--- net/batman-adv/hard-interface.c | 7 +++ net/batman-adv/originator.c | 30 +++--- 3 files changed, 13 insertions(+), 38 deletions(-) diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c index 57e9962c7090..eb3435de54b5 100644 --- a/net/batman-adv/bat_iv_ogm.c +++ b/net/batman-adv/bat_iv_ogm.c @@ -681,18 +681,12 @@ static void batadv_iv_ogm_aggregate_new(const unsigned char *packet_buff, unsigned char *skb_buff; unsigned int skb_size; - if (!kref_get_unless_zero(_incoming->refcount)) - return; - - if (!kref_get_unless_zero(_outgoing->refcount)) - goto out_free_incoming; - /* own packet should always be scheduled */ if (!own_packet) { if (!batadv_atomic_dec_not_zero(_priv->batman_queue_left)) { batadv_dbg(BATADV_DBG_BATMAN, bat_priv, "batman packet queue full\n"); - goto out_free_outgoing; + return; } } @@ -718,6 +712,8 @@ static void batadv_iv_ogm_aggregate_new(const unsigned char *packet_buff, forw_packet_aggr->packet_len = packet_len; memcpy(skb_buff, packet_buff, packet_len); + kref_get(_incoming->refcount); + kref_get(_outgoing->refcount); forw_packet_aggr->own = own_packet; forw_packet_aggr->if_incoming = if_incoming; forw_packet_aggr->if_outgoing = if_outgoing; @@ -747,10 +743,6 @@ out_free_forw_packet: out_nomem: if (!own_packet) atomic_inc(_priv->batman_queue_left); -out_free_outgoing: - batadv_hardif_put(if_outgoing); -out_free_incoming: - batadv_hardif_put(if_incoming); } /* aggregate a new packet into the existing ogm packet */ diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c index d3d37f3f99cf..7c1d8d7ac548 100644 --- a/net/batman-adv/hard-interface.c +++ b/net/batman-adv/hard-interface.c @@ -236,8 +236,8 @@ static void batadv_primary_if_select(struct batadv_priv *bat_priv, ASSERT_RTNL(); - if (new_hard_iface && !kref_get_unless_zero(_hard_iface->refcount)) - new_hard_iface = NULL; + if (new_hard_iface) + kref_get(_hard_iface->refcount); curr_hard_iface = rcu_dereference_protected(bat_priv->primary_if, 1); rcu_assign_pointer(bat_priv->primary_if, new_hard_iface); @@ -467,8 +467,7 @@ int batadv_hardif_enable_interface(struct batadv_hard_iface *hard_iface, if (hard_iface->if_status != BATADV_IF_NOT_IN_USE) goto out; - if (!kref_get_unless_zero(_iface->refcount)) - goto out; + kref_get(_iface->refcount); soft_iface = dev_get_by_name(net, iface_name); diff --git a/net/batman-adv/originator.c b/net/batman-adv/originator.c index 2ed2cc89a669..04fa139911c3 100644 --- a/net/batman-adv/originator.c +++ b/net/batman-adv/originator.c @@ -374,12 +374,8 @@ batadv_orig_ifinfo_new(struct batadv_orig_node *orig_node, if (!orig_ifinfo) goto out; - if (if_outgoing != BATADV_IF_DEFAULT && - !kref_get_unless_zero(_outgoing->refcount)) { - kfree(orig_ifinfo); - orig_ifinfo = NULL; - goto out; - } + if (if_outgoing != BATADV_IF_DEFAULT) + kref_get(_outgoing->refcount); reset_time = jiffies - 1; reset_time -= msecs_to_jiffies(BATADV_RESET_PROTECTION_MS); @@ -455,11 +451,8 @@ batadv_neigh_ifinfo_new(struct batadv_neigh_node *neigh, if (!neigh_ifinfo) goto out; - if (if_outgoing && !kref_get_unless_zero(_outgoing->refcount)) { - kfree(neigh_ifinfo); - neigh_ifinfo = NULL; - goto out; - } + if (if_outgoing) + kref_get(_outgoing->refcount); INIT_HLIST_NODE(_ifinfo->list); kref_init(_ifinfo->refcount); @@ -532,15 +525,11 @@ batadv_hardif_neigh_create(struct batadv_hard_iface *hard_iface, if (hardif_neigh) goto out; - if (!kref_get_unless_zero(_iface->refcount)) - goto out; - hardif_neigh =
[PATCH 17/17] batman-adv: use batadv_compare_eth when possible
When comparing Ethernet address it is better to use the more generic batadv_compare_eth. The latter is also optimised for architectures having a fast unaligned access. Signed-off-by: Antonio Quartulli[s...@narfation.org: fix conflicts with current version] Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner --- net/batman-adv/network-coding.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c index df5ae9c7e507..678f06865312 100644 --- a/net/batman-adv/network-coding.c +++ b/net/batman-adv/network-coding.c @@ -521,12 +521,10 @@ static bool batadv_nc_hash_compare(const struct hlist_node *node, nc_path2 = data2; /* Return 1 if the two keys are identical */ - if (memcmp(nc_path1->prev_hop, nc_path2->prev_hop, - sizeof(nc_path1->prev_hop)) != 0) + if (!batadv_compare_eth(nc_path1->prev_hop, nc_path2->prev_hop)) return false; - if (memcmp(nc_path1->next_hop, nc_path2->next_hop, - sizeof(nc_path1->next_hop)) != 0) + if (!batadv_compare_eth(nc_path1->next_hop, nc_path2->next_hop)) return false; return true; -- 2.8.2
[PATCH 16/17] batman-adv: replace ethertype variable with ETH_P_BATMAN for readability
From: Marek LindnerSigned-off-by: Marek Lindner Reviewed-by: Sven Eckelmann Signed-off-by: Antonio Quartulli --- net/batman-adv/soft-interface.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c index 3a0fc3c18444..343d2c904399 100644 --- a/net/batman-adv/soft-interface.c +++ b/net/batman-adv/soft-interface.c @@ -186,7 +186,6 @@ static int batadv_interface_tx(struct sk_buff *skb, struct batadv_priv *bat_priv = netdev_priv(soft_iface); struct batadv_hard_iface *primary_if = NULL; struct batadv_bcast_packet *bcast_packet; - __be16 ethertype = htons(ETH_P_BATMAN); static const u8 stp_addr[ETH_ALEN] = {0x01, 0x80, 0xC2, 0x00, 0x00, 0x00}; static const u8 ectp_addr[ETH_ALEN] = {0xCF, 0x00, 0x00, 0x00, @@ -216,7 +215,8 @@ static int batadv_interface_tx(struct sk_buff *skb, case ETH_P_8021Q: vhdr = vlan_eth_hdr(skb); - if (vhdr->h_vlan_encapsulated_proto != ethertype) { + /* drop batman-in-batman packets to prevent loops */ + if (vhdr->h_vlan_encapsulated_proto != htons(ETH_P_BATMAN)) { network_offset += VLAN_HLEN; break; } @@ -404,7 +404,6 @@ void batadv_interface_rx(struct net_device *soft_iface, { struct batadv_bcast_packet *batadv_bcast_packet; struct batadv_priv *bat_priv = netdev_priv(soft_iface); - __be16 ethertype = htons(ETH_P_BATMAN); struct vlan_ethhdr *vhdr; struct ethhdr *ethhdr; unsigned short vid; @@ -434,7 +433,8 @@ void batadv_interface_rx(struct net_device *soft_iface, vhdr = (struct vlan_ethhdr *)skb->data; - if (vhdr->h_vlan_encapsulated_proto != ethertype) + /* drop batman-in-batman packets to prevent loops */ + if (vhdr->h_vlan_encapsulated_proto != htons(ETH_P_BATMAN)) break; /* fall through */ -- 2.8.2
[PATCH 12/17] batman-adv: Use kref_get for batadv_gw_node_add
From: Sven Eckelmannbatadv_gw_node_add requires that the caller already has a valid reference for orig_node. It is therefore not possible that it has an reference counter of 0 and was still given to this function The kref_get function instead WARNs (with debug information) when the reference counter would still be 0. This makes a bug in batman-adv better visible because kref_get_unless_zero would have ignored this problem. Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/gateway_client.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/net/batman-adv/gateway_client.c b/net/batman-adv/gateway_client.c index bb1c4f37716e..5839c569f769 100644 --- a/net/batman-adv/gateway_client.c +++ b/net/batman-adv/gateway_client.c @@ -440,15 +440,11 @@ static void batadv_gw_node_add(struct batadv_priv *bat_priv, if (gateway->bandwidth_down == 0) return; - if (!kref_get_unless_zero(_node->refcount)) - return; - gw_node = kzalloc(sizeof(*gw_node), GFP_ATOMIC); - if (!gw_node) { - batadv_orig_node_put(orig_node); + if (!gw_node) return; - } + kref_get(_node->refcount); INIT_HLIST_NODE(_node->list); gw_node->orig_node = orig_node; gw_node->bandwidth_down = ntohl(gateway->bandwidth_down); -- 2.8.2
[PATCH 11/17] batman-adv: Use kref_get for batadv_gw_select
From: Sven Eckelmannbatadv_gw_select requires that the caller already has a valid reference for new_gw_node. It is therefore not possible that it has an reference counter of 0 and was still given to this function The kref_get function instead WARNs (with debug information) when the reference counter would still be 0. This makes a bug in batman-adv better visible because kref_get_unless_zero would have ignored this problem. Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/gateway_client.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/batman-adv/gateway_client.c b/net/batman-adv/gateway_client.c index c59aff5ccac8..bb1c4f37716e 100644 --- a/net/batman-adv/gateway_client.c +++ b/net/batman-adv/gateway_client.c @@ -135,8 +135,8 @@ static void batadv_gw_select(struct batadv_priv *bat_priv, spin_lock_bh(_priv->gw.list_lock); - if (new_gw_node && !kref_get_unless_zero(_gw_node->refcount)) - new_gw_node = NULL; + if (new_gw_node) + kref_get(_gw_node->refcount); curr_gw_node = rcu_dereference_protected(bat_priv->gw.curr_gw, 1); rcu_assign_pointer(bat_priv->gw.curr_gw, new_gw_node); -- 2.8.2
[PATCH net-next] gtp: reload GTPv1 header after pskb_may_pull()
The GTPv1 header flags indicate the presence of optional extensions after this header. Refresh the pointer to the GTPv1 header as skb->head might have be reallocated via pskb_may_pull(). Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)") Reported-by: Eric DumazetSigned-off-by: Pablo Neira Ayuso --- drivers/net/gtp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c index 8ce1104..f7caf1e 100644 --- a/drivers/net/gtp.c +++ b/drivers/net/gtp.c @@ -253,6 +253,8 @@ static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, struct sk_buff *skb, if (!pskb_may_pull(skb, hdrlen)) return -1; + gtp1 = (struct gtp1_header *)(skb->data + sizeof(struct udphdr)); + rcu_read_lock(); pctx = gtp1_pdp_find(gtp, ntohl(gtp1->tid)); if (!pctx) { -- 2.1.4
[PATCH 05/17] batman-adv: add detection for complex bridge loops
From: Simon WunderlichThere are network setups where the current bridge loop avoidance can't detect bridge loops. The minimal setup affected would consist of two LANs and two separate meshes, connected in a ring like that: A...(mesh1)...B | | (LAN1)(LAN2) | | C...(mesh2)...D Since both the meshes and backbones are separate, the bridge loop avoidance has not enough information to detect and avoid the loop in this case. Even if these scenarios can't be fixed easily, these kind of loops can be detected. This patch implements a periodic check (running every 60 seconds for now) which sends a broadcast frame with a random MAC address on each backbone VLAN. If a broadcast frame with the same MAC address is received shortly after on the mesh, we know that there must be a loop and report that incident as well as throw an uevent to let others handle that problem. Signed-off-by: Simon Wunderlich [s...@narfation.org: fix conflicts with current version] Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/bridge_loop_avoidance.c | 139 + net/batman-adv/main.h | 4 + net/batman-adv/packet.h| 1 + net/batman-adv/sysfs.c | 6 +- net/batman-adv/types.h | 8 ++ 5 files changed, 156 insertions(+), 2 deletions(-) diff --git a/net/batman-adv/bridge_loop_avoidance.c b/net/batman-adv/bridge_loop_avoidance.c index 2c9aa671a49b..5064ae5e9b34 100644 --- a/net/batman-adv/bridge_loop_avoidance.c +++ b/net/batman-adv/bridge_loop_avoidance.c @@ -50,6 +50,7 @@ #include "hash.h" #include "originator.h" #include "packet.h" +#include "sysfs.h" #include "translation-table.h" static const u8 batadv_announce_mac[4] = {0x43, 0x05, 0x43, 0x05}; @@ -407,6 +408,14 @@ static void batadv_bla_send_claim(struct batadv_priv *bat_priv, u8 *mac, ethhdr->h_source, ethhdr->h_dest, BATADV_PRINT_VID(vid)); break; + case BATADV_CLAIM_TYPE_LOOPDETECT: + ether_addr_copy(ethhdr->h_source, mac); + batadv_dbg(BATADV_DBG_BLA, bat_priv, + "bla_send_claim(): LOOPDETECT of %pM to %pM on vid %d\n", + ethhdr->h_source, ethhdr->h_dest, + BATADV_PRINT_VID(vid)); + + break; } if (vid & BATADV_VLAN_HAS_TAG) @@ -427,6 +436,36 @@ out: } /** + * batadv_bla_loopdetect_report - worker for reporting the loop + * @work: work queue item + * + * Throws an uevent, as the loopdetect check function can't do that itself + * since the kernel may sleep while throwing uevents. + */ +static void batadv_bla_loopdetect_report(struct work_struct *work) +{ + struct batadv_bla_backbone_gw *backbone_gw; + struct batadv_priv *bat_priv; + char vid_str[6] = { '\0' }; + + backbone_gw = container_of(work, struct batadv_bla_backbone_gw, + report_work); + bat_priv = backbone_gw->bat_priv; + + batadv_info(bat_priv->soft_iface, + "Possible loop on VLAN %d detected which can't be handled by BLA - please check your network setup!\n", + BATADV_PRINT_VID(backbone_gw->vid)); + snprintf(vid_str, sizeof(vid_str), "%d", +BATADV_PRINT_VID(backbone_gw->vid)); + vid_str[sizeof(vid_str) - 1] = 0; + + batadv_throw_uevent(bat_priv, BATADV_UEV_BLA, BATADV_UEV_LOOPDETECT, + vid_str); + + batadv_backbone_gw_put(backbone_gw); +} + +/** * batadv_bla_get_backbone_gw - finds or creates a backbone gateway * @bat_priv: the bat priv with all the soft interface information * @orig: the mac address of the originator @@ -464,6 +503,7 @@ batadv_bla_get_backbone_gw(struct batadv_priv *bat_priv, u8 *orig, atomic_set(>request_sent, 0); atomic_set(>wait_periods, 0); ether_addr_copy(entry->orig, orig); + INIT_WORK(>report_work, batadv_bla_loopdetect_report); /* one for the hash, one for returning */ kref_init(>refcount); @@ -1060,6 +1100,10 @@ static int batadv_bla_process_claim(struct batadv_priv *bat_priv, if (vlan_depth > 1) return 1; + /* Let the loopdetect frames on the mesh in any case. */ + if (bla_dst->type == BATADV_CLAIM_TYPE_LOOPDETECT) + return 0; + /* check if it is a claim frame. */ ret = batadv_check_claim_group(bat_priv, primary_if, hw_src, hw_dst, ethhdr); @@ -1265,6 +1309,26 @@ void batadv_bla_update_orig_address(struct batadv_priv *bat_priv, } /** + * batadv_bla_send_loopdetect - send a loopdetect frame + * @bat_priv:
[PATCH 06/17] batman-adv: Check hard_iface refcnt before calling function
From: Sven EckelmannThe batadv_hardif_list list is checked in many situations and the items in this list are given to specialized functions to modify the routing behavior. At the moment each of these called functions has to check itself whether the received batadv_hard_iface has a refcount > 0 before it can increase the reference counter and use it in other objects. This can easily lead to problems because it is not easily visible where all callers of a function got the batadv_hard_iface object from and whether they already hold a valid reference. Checking the reference counter directly before calling a subfunction with a pointer from the batadv_hardif_list avoids this problem. Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/bat_iv_ogm.c | 11 +++ net/batman-adv/bat_v_ogm.c | 14 +- net/batman-adv/originator.c | 5 + net/batman-adv/send.c | 6 ++ 4 files changed, 35 insertions(+), 1 deletion(-) diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c index 8c1710bba803..57e9962c7090 100644 --- a/net/batman-adv/bat_iv_ogm.c +++ b/net/batman-adv/bat_iv_ogm.c @@ -987,9 +987,15 @@ static void batadv_iv_ogm_schedule(struct batadv_hard_iface *hard_iface) list_for_each_entry_rcu(tmp_hard_iface, _hardif_list, list) { if (tmp_hard_iface->soft_iface != hard_iface->soft_iface) continue; + + if (!kref_get_unless_zero(_hard_iface->refcount)) + continue; + batadv_iv_ogm_queue_add(bat_priv, *ogm_buff, *ogm_buff_len, hard_iface, tmp_hard_iface, 1, send_time); + + batadv_hardif_put(tmp_hard_iface); } rcu_read_unlock(); @@ -1767,8 +1773,13 @@ static void batadv_iv_ogm_process(const struct sk_buff *skb, int ogm_offset, if (hard_iface->soft_iface != bat_priv->soft_iface) continue; + if (!kref_get_unless_zero(_iface->refcount)) + continue; + batadv_iv_ogm_process_per_outif(skb, ogm_offset, orig_node, if_incoming, hard_iface); + + batadv_hardif_put(hard_iface); } rcu_read_unlock(); diff --git a/net/batman-adv/bat_v_ogm.c b/net/batman-adv/bat_v_ogm.c index 4155fa57cf6d..473ebb9a0e73 100644 --- a/net/batman-adv/bat_v_ogm.c +++ b/net/batman-adv/bat_v_ogm.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -176,6 +177,9 @@ static void batadv_v_ogm_send(struct work_struct *work) if (hard_iface->soft_iface != bat_priv->soft_iface) continue; + if (!kref_get_unless_zero(_iface->refcount)) + continue; + batadv_dbg(BATADV_DBG_BATMAN, bat_priv, "Sending own OGM2 packet (originator %pM, seqno %u, throughput %u, TTL %d) on interface %s [%pM]\n", ogm_packet->orig, ntohl(ogm_packet->seqno), @@ -185,10 +189,13 @@ static void batadv_v_ogm_send(struct work_struct *work) /* this skb gets consumed by batadv_v_ogm_send_to_if() */ skb_tmp = skb_clone(skb, GFP_ATOMIC); - if (!skb_tmp) + if (!skb_tmp) { + batadv_hardif_put(hard_iface); break; + } batadv_v_ogm_send_to_if(skb_tmp, hard_iface); + batadv_hardif_put(hard_iface); } rcu_read_unlock(); @@ -704,9 +711,14 @@ static void batadv_v_ogm_process(const struct sk_buff *skb, int ogm_offset, if (hard_iface->soft_iface != bat_priv->soft_iface) continue; + if (!kref_get_unless_zero(_iface->refcount)) + continue; + batadv_v_ogm_process_per_outif(bat_priv, ethhdr, ogm_packet, orig_node, neigh_node, if_incoming, hard_iface); + + batadv_hardif_put(hard_iface); } rcu_read_unlock(); out: diff --git a/net/batman-adv/originator.c b/net/batman-adv/originator.c index f885a41d06d5..2ed2cc89a669 100644 --- a/net/batman-adv/originator.c +++ b/net/batman-adv/originator.c @@ -1160,6 +1160,9 @@ static bool batadv_purge_orig_node(struct batadv_priv *bat_priv, if (hard_iface->soft_iface != bat_priv->soft_iface) continue; + if (!kref_get_unless_zero(_iface->refcount)) + continue; + best_neigh_node = batadv_find_best_neighbor(bat_priv,
[PATCH 07/17] batman-adv: Check hard_iface refcnt when receiving skb
From: Sven EckelmannThe receive function may start processing an incoming packet while the hard_iface is shut down in a different context. All called functions called with the batadv_hard_iface object belonging to the incoming interface would have to check whether the reference counter is still > 0. This is rather error-prone because this check can be forgotten easily. Instead check the reference counter when receiving the object to make sure that all called functions have a valid reference. Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/main.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/net/batman-adv/main.c b/net/batman-adv/main.c index 78c05a91ae6f..c8d8bc78a518 100644 --- a/net/batman-adv/main.c +++ b/net/batman-adv/main.c @@ -401,11 +401,19 @@ int batadv_batman_skb_recv(struct sk_buff *skb, struct net_device *dev, hard_iface = container_of(ptype, struct batadv_hard_iface, batman_adv_ptype); + + /* Prevent processing a packet received on an interface which is getting +* shut down otherwise the packet may trigger de-reference errors +* further down in the receive path. +*/ + if (!kref_get_unless_zero(_iface->refcount)) + goto err_out; + skb = skb_share_check(skb, GFP_ATOMIC); /* skb was released by skb_share_check() */ if (!skb) - goto err_out; + goto err_put; /* packet should hold at least type and version */ if (unlikely(!pskb_may_pull(skb, 2))) @@ -448,6 +456,8 @@ int batadv_batman_skb_recv(struct sk_buff *skb, struct net_device *dev, if (ret == NET_RX_DROP) kfree_skb(skb); + batadv_hardif_put(hard_iface); + /* return NET_RX_SUCCESS in any case as we * most probably dropped the packet for * routing-logical reasons. @@ -456,6 +466,8 @@ int batadv_batman_skb_recv(struct sk_buff *skb, struct net_device *dev, err_free: kfree_skb(skb); +err_put: + batadv_hardif_put(hard_iface); err_out: return NET_RX_DROP; } -- 2.8.2
[PATCH 04/17] batman-adv: Create batman soft interfaces within correct netns.
From: Andrew LunnWhen creating a soft interface, create it in the same netns as the hard interface. Replace all references to init_net with the correct name space for the interface being manipulated. Suggested-by: Daniel Ehlers Signed-off-by: Andrew Lunn Acked-by: Antonio Quartulli Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/hard-interface.c| 10 +- net/batman-adv/hard-interface.h| 3 ++- net/batman-adv/soft-interface.c| 7 +-- net/batman-adv/soft-interface.h| 3 ++- net/batman-adv/sysfs.c | 3 ++- net/batman-adv/translation-table.c | 4 ++-- 6 files changed, 18 insertions(+), 12 deletions(-) diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c index 0a7deaf2670a..f0e1899e5b6b 100644 --- a/net/batman-adv/hard-interface.c +++ b/net/batman-adv/hard-interface.c @@ -36,7 +36,6 @@ #include #include #include -#include #include "bridge_loop_avoidance.h" #include "debugfs.h" @@ -121,6 +120,7 @@ static bool batadv_mutual_parents(const struct net_device *dev1, static bool batadv_is_on_batman_iface(const struct net_device *net_dev) { struct net_device *parent_dev; + struct net *net = dev_net(net_dev); bool ret; /* check if this is a batman-adv mesh interface */ @@ -133,7 +133,7 @@ static bool batadv_is_on_batman_iface(const struct net_device *net_dev) return false; /* recurse over the parent device */ - parent_dev = __dev_get_by_index(_net, dev_get_iflink(net_dev)); + parent_dev = __dev_get_by_index(net, dev_get_iflink(net_dev)); /* if we got a NULL parent_dev there is something broken.. */ if (WARN(!parent_dev, "Cannot find parent device")) return false; @@ -456,7 +456,7 @@ static int batadv_master_del_slave(struct batadv_hard_iface *slave, } int batadv_hardif_enable_interface(struct batadv_hard_iface *hard_iface, - const char *iface_name) + struct net *net, const char *iface_name) { struct batadv_priv *bat_priv; struct net_device *soft_iface, *master; @@ -470,10 +470,10 @@ int batadv_hardif_enable_interface(struct batadv_hard_iface *hard_iface, if (!kref_get_unless_zero(_iface->refcount)) goto out; - soft_iface = dev_get_by_name(_net, iface_name); + soft_iface = dev_get_by_name(net, iface_name); if (!soft_iface) { - soft_iface = batadv_softif_create(iface_name); + soft_iface = batadv_softif_create(net, iface_name); if (!soft_iface) { ret = -ENOMEM; diff --git a/net/batman-adv/hard-interface.h b/net/batman-adv/hard-interface.h index d74f1983f33e..a76724d369bf 100644 --- a/net/batman-adv/hard-interface.h +++ b/net/batman-adv/hard-interface.h @@ -28,6 +28,7 @@ #include struct net_device; +struct net; enum batadv_hard_if_state { BATADV_IF_NOT_IN_USE, @@ -55,7 +56,7 @@ bool batadv_is_wifi_iface(int ifindex); struct batadv_hard_iface* batadv_hardif_get_by_netdev(const struct net_device *net_dev); int batadv_hardif_enable_interface(struct batadv_hard_iface *hard_iface, - const char *iface_name); + struct net *net, const char *iface_name); void batadv_hardif_disable_interface(struct batadv_hard_iface *hard_iface, enum batadv_hard_if_cleanup autodel); void batadv_hardif_remove_interfaces(void); diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c index 66dd0aac480a..04866c9b860a 100644 --- a/net/batman-adv/soft-interface.c +++ b/net/batman-adv/soft-interface.c @@ -885,13 +885,14 @@ static int batadv_softif_slave_add(struct net_device *dev, struct net_device *slave_dev) { struct batadv_hard_iface *hard_iface; + struct net *net = dev_net(dev); int ret = -EINVAL; hard_iface = batadv_hardif_get_by_netdev(slave_dev); if (!hard_iface || hard_iface->soft_iface) goto out; - ret = batadv_hardif_enable_interface(hard_iface, dev->name); + ret = batadv_hardif_enable_interface(hard_iface, net, dev->name); out: if (hard_iface) @@ -988,7 +989,7 @@ static void batadv_softif_init_early(struct net_device *dev) memset(priv, 0, sizeof(*priv)); } -struct net_device *batadv_softif_create(const char *name) +struct net_device *batadv_softif_create(struct net *net, const char *name) { struct net_device *soft_iface; int ret; @@ -998,6 +999,8 @@ struct net_device *batadv_softif_create(const char *name) if (!soft_iface) return NULL;
[PATCH 09/17] batman-adv: Use kref_get for batadv_tvlv_container_get
From: Sven Eckelmannbatadv_tvlv_container_get requires that tvlv.container_list_lock is held by the caller. It is therefore not possible that an item in tvlv.container_list has an reference counter of 0 and is still in the list The kref_get function instead WARNs (with debug information) when the reference counter would still be 0. This makes a bug in batman-adv better visible because kref_get_unless_zero would have ignored this problem. Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/main.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/batman-adv/main.c b/net/batman-adv/main.c index c8d8bc78a518..5f2974bd1227 100644 --- a/net/batman-adv/main.c +++ b/net/batman-adv/main.c @@ -748,9 +748,7 @@ batadv_tvlv_container_get(struct batadv_priv *bat_priv, u8 type, u8 version) if (tvlv_tmp->tvlv_hdr.version != version) continue; - if (!kref_get_unless_zero(_tmp->refcount)) - continue; - + kref_get(_tmp->refcount); tvlv = tvlv_tmp; break; } -- 2.8.2
[PATCH 02/17] batman-adv: Remove hdr_size skb size check in batadv_interface_rx
From: Sven EckelmannThe callers of batadv_interface_rx have to make sure that enough data can be pulled from the skb when they read the batman-adv header. The only two functions using it are either calling pskb_may_pull with hdr_size directly (batadv_recv_bcast_packet) or indirectly via batadv_check_unicast_packet (batadv_recv_unicast_packet). Reported-by: Marek Lindner Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/soft-interface.c | 4 1 file changed, 4 deletions(-) diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c index dc9a61a5122d..d72f88707736 100644 --- a/net/batman-adv/soft-interface.c +++ b/net/batman-adv/soft-interface.c @@ -413,10 +413,6 @@ void batadv_interface_rx(struct net_device *soft_iface, batadv_bcast_packet = (struct batadv_bcast_packet *)skb->data; is_bcast = (batadv_bcast_packet->packet_type == BATADV_BCAST); - /* check if enough space is available for pulling, and pull */ - if (!pskb_may_pull(skb, hdr_size)) - goto dropped; - skb_pull_rcsum(skb, hdr_size); skb_reset_mac_header(skb); -- 2.8.2
[PATCH 08/17] batman-adv: Increase hard_iface refcnt for ptype
From: Sven EckelmannThe hard_iface is referenced in the packet_type for batman-adv. Increase the refcounter of the hard_interface for it to have an explicit reference for it in case this functionality gets refactorted and the currently used implicit reference for it will be removed. Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/hard-interface.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c index f0e1899e5b6b..d3d37f3f99cf 100644 --- a/net/batman-adv/hard-interface.c +++ b/net/batman-adv/hard-interface.c @@ -522,6 +522,7 @@ int batadv_hardif_enable_interface(struct batadv_hard_iface *hard_iface, goto err_upper; } + kref_get(_iface->refcount); hard_iface->batman_adv_ptype.type = ethertype; hard_iface->batman_adv_ptype.func = batadv_batman_skb_recv; hard_iface->batman_adv_ptype.dev = hard_iface->net_dev; @@ -583,6 +584,7 @@ void batadv_hardif_disable_interface(struct batadv_hard_iface *hard_iface, batadv_info(hard_iface->soft_iface, "Removing interface: %s\n", hard_iface->net_dev->name); dev_remove_pack(_iface->batman_adv_ptype); + batadv_hardif_put(hard_iface); bat_priv->num_ifaces--; batadv_orig_hash_del_if(hard_iface, bat_priv->num_ifaces); -- 2.8.2
[PATCH 03/17] batman-adv: NETIF_F_NETNS_LOCAL feature to prevent netns moves
From: Andrew LunnThe batX soft interface should not be moved between network name spaces. This is similar to bridges, bonds, tunnels, which are not allowed to move between network namespaces. Suggested-by: Daniel Ehlers Signed-off-by: Andrew Lunn Acked-by: Antonio Quartulli Reviewed-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/soft-interface.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c index d72f88707736..66dd0aac480a 100644 --- a/net/batman-adv/soft-interface.c +++ b/net/batman-adv/soft-interface.c @@ -972,7 +972,7 @@ static void batadv_softif_init_early(struct net_device *dev) dev->netdev_ops = _netdev_ops; dev->destructor = batadv_softif_free; - dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER; + dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_NETNS_LOCAL; dev->priv_flags |= IFF_NO_QUEUE; /* can't call min_mtu, because the needed variables -- 2.8.2
[PATCH 01/17] batman-adv: Remove unused parameter recv_if of batadv_interface_rx
From: Sven EckelmannSigned-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli --- net/batman-adv/routing.c| 5 ++--- net/batman-adv/soft-interface.c | 5 ++--- net/batman-adv/soft-interface.h | 4 ++-- 3 files changed, 6 insertions(+), 8 deletions(-) diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c index b781bf753250..2ecfca246be4 100644 --- a/net/batman-adv/routing.c +++ b/net/batman-adv/routing.c @@ -912,7 +912,7 @@ int batadv_recv_unicast_packet(struct sk_buff *skb, hdr_size)) goto rx_success; - batadv_interface_rx(recv_if->soft_iface, skb, recv_if, hdr_size, + batadv_interface_rx(recv_if->soft_iface, skb, hdr_size, orig_node); rx_success: @@ -1122,8 +1122,7 @@ int batadv_recv_bcast_packet(struct sk_buff *skb, goto rx_success; /* broadcast for me */ - batadv_interface_rx(recv_if->soft_iface, skb, recv_if, hdr_size, - orig_node); + batadv_interface_rx(recv_if->soft_iface, skb, hdr_size, orig_node); rx_success: ret = NET_RX_SUCCESS; diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c index dfb4d56120b6..dc9a61a5122d 100644 --- a/net/batman-adv/soft-interface.c +++ b/net/batman-adv/soft-interface.c @@ -385,7 +385,6 @@ end: * batadv_interface_rx - receive ethernet frame on local batman-adv interface * @soft_iface: local interface which will receive the ethernet frame * @skb: ethernet frame for @soft_iface - * @recv_if: interface on which the batman-adv packet was received * @hdr_size: size of already parsed batman-adv header * @orig_node: originator from which the batman-adv packet was sent * @@ -400,8 +399,8 @@ end: * isolated clients. */ void batadv_interface_rx(struct net_device *soft_iface, -struct sk_buff *skb, struct batadv_hard_iface *recv_if, -int hdr_size, struct batadv_orig_node *orig_node) +struct sk_buff *skb, int hdr_size, +struct batadv_orig_node *orig_node) { struct batadv_bcast_packet *batadv_bcast_packet; struct batadv_priv *bat_priv = netdev_priv(soft_iface); diff --git a/net/batman-adv/soft-interface.h b/net/batman-adv/soft-interface.h index 9ae265703d23..5942da3d03d5 100644 --- a/net/batman-adv/soft-interface.h +++ b/net/batman-adv/soft-interface.h @@ -27,8 +27,8 @@ struct sk_buff; int batadv_skb_head_push(struct sk_buff *skb, unsigned int len); void batadv_interface_rx(struct net_device *soft_iface, -struct sk_buff *skb, struct batadv_hard_iface *recv_if, -int hdr_size, struct batadv_orig_node *orig_node); +struct sk_buff *skb, int hdr_size, +struct batadv_orig_node *orig_node); struct net_device *batadv_softif_create(const char *name); void batadv_softif_destroy_sysfs(struct net_device *soft_iface); int batadv_softif_is_valid(const struct net_device *net_dev); -- 2.8.2
pull request: batman-adv 20160511
Hi David, here you have a pull request intended for net-next. There are 17 patches in this batch, but most of them are cleanups and minor code re-arrangement. The more detailed description follows in the git tag. Please pull or let me know of any problem. Thanks a lot, Antonio The following changes since commit c047c3b1af6214b447e353527e394fa3f3e86397: netfilter: conntrack: remove uninitialized shadow variable (2016-05-10 01:04:04 -0400) are available in the git repository at: git://git.open-mesh.org/linux-merge.git tags/batman-adv-for-davem for you to fetch changes up to 676970e55b1033af7f0a03d4037b4d9b76327ded: batman-adv: use batadv_compare_eth when possible (2016-05-10 18:28:54 +0800) Included changes: - remove useless skb size check in batadv_interface_rx - basic netns support introduced by Andrew Lunn: - prevent virtual interface from changing netns by setting NETIF_F_NETNS_LOCAL - create virtual interface within the netns of the first hard-interface - introduce detection of complex bridge loops and report event to the user (via udev) when the Bridge Loop Avoidance mechanism can't prevent them - minor reference counting bugfixes for the hard_iface object that couldn't make it via the net tree - use kref_get() instead of kref_get_unless_zero() to make reference counting bug more visible - use batadv_compare_eth() all over the code when possible instead of plain memcmp() - minor code cleanup and style adjustments Andrew Lunn (2): batman-adv: NETIF_F_NETNS_LOCAL feature to prevent netns moves batman-adv: Create batman soft interfaces within correct netns. Antonio Quartulli (1): batman-adv: use batadv_compare_eth when possible Marek Lindner (1): batman-adv: replace ethertype variable with ETH_P_BATMAN for readability Simon Wunderlich (1): batman-adv: add detection for complex bridge loops Sven Eckelmann (12): batman-adv: Remove unused parameter recv_if of batadv_interface_rx batman-adv: Remove hdr_size skb size check in batadv_interface_rx batman-adv: Check hard_iface refcnt before calling function batman-adv: Check hard_iface refcnt when receiving skb batman-adv: Increase hard_iface refcnt for ptype batman-adv: Use kref_get for batadv_tvlv_container_get batman-adv: Use kref_get for batadv_nc_get_nc_node batman-adv: Use kref_get for batadv_gw_select batman-adv: Use kref_get for batadv_gw_node_add batman-adv: Use kref_get for hard_iface subfunctions batman-adv: Use kref_get for _batadv_update_route batman-adv: Use bool as return type for boolean functions net/batman-adv/bat_iv_ogm.c| 48 ++--- net/batman-adv/bat_v_ogm.c | 14 +- net/batman-adv/bitarray.c | 16 +- net/batman-adv/bitarray.h | 15 +- net/batman-adv/bridge_loop_avoidance.c | 314 - net/batman-adv/bridge_loop_avoidance.h | 43 ++--- net/batman-adv/debugfs.c | 2 +- net/batman-adv/distributed-arp-table.c | 6 +- net/batman-adv/gateway_client.c| 12 +- net/batman-adv/hard-interface.c| 34 ++-- net/batman-adv/hard-interface.h| 3 +- net/batman-adv/hash.h | 6 +- net/batman-adv/main.c | 18 +- net/batman-adv/main.h | 6 +- net/batman-adv/network-coding.c| 25 +-- net/batman-adv/originator.c| 39 ++-- net/batman-adv/originator.h| 2 +- net/batman-adv/packet.h| 1 + net/batman-adv/routing.c | 50 +++--- net/batman-adv/routing.h | 6 +- net/batman-adv/send.c | 6 + net/batman-adv/soft-interface.c| 32 ++-- net/batman-adv/soft-interface.h| 10 +- net/batman-adv/sysfs.c | 9 +- net/batman-adv/translation-table.c | 35 ++-- net/batman-adv/types.h | 8 + 26 files changed, 465 insertions(+), 295 deletions(-)
Re: [Drbd-dev] [PATCH net-next v3] block/drbd: align properly u64 in nl messages
From: Lars EllenbergDate: Tue, 10 May 2016 21:09:03 +0200 > On Tue, May 10, 2016 at 11:39:49AM -0400, David Miller wrote: >> From: Lars Ellenberg >> Date: Tue, 10 May 2016 11:40:23 +0200 > > excuse me for reordering the original: > >> Anyways, back to the topic, can you please just relent and come to >> some kind of agreement about the fix for this alignment bug? > > I thought we did? I'm fine with the "v3", > it even carries my signed-of-by. My bad, I missed that, I'll apply v3 thanks a lot!
[patch -mainline] qlcnic: potential NULL dereference in qlcnic_83xx_get_minidump_template()
If qlcnic_fw_cmd_get_minidump_temp() fails then "fw_dump->tmpl_hdr" is NULL or possibly freed. It can lead to an oops later. Fixes: d01a6d3c8ae1 ('qlcnic: Add support to enable capability to extend minidump for iSCSI') Signed-off-by: Dan Carpenterdiff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c index cda9e60..0844b7c 100644 --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c @@ -1417,6 +1417,7 @@ void qlcnic_83xx_get_minidump_template(struct qlcnic_adapter *adapter) struct qlcnic_fw_dump *fw_dump = >fw_dump; struct pci_dev *pdev = adapter->pdev; bool extended = false; + int ret; prev_version = adapter->fw_version; current_version = qlcnic_83xx_get_fw_version(adapter); @@ -1427,8 +1428,11 @@ void qlcnic_83xx_get_minidump_template(struct qlcnic_adapter *adapter) if (qlcnic_83xx_md_check_extended_dump_capability(adapter)) extended = !qlcnic_83xx_extend_md_capab(adapter); - if (!qlcnic_fw_cmd_get_minidump_temp(adapter)) - dev_info(>dev, "Supports FW dump capability\n"); + ret = qlcnic_fw_cmd_get_minidump_temp(adapter); + if (ret) + return; + + dev_info(>dev, "Supports FW dump capability\n"); /* Once we have minidump template with extended iSCSI dump * capability, update the minidump capture mask to 0x1f as
Re: [PATCH RFT 1/2] phylib: add device reset GPIO support
On 05/10/2016 12:11 PM, Sergei Shtylyov wrote: > Hello. > > On 05/10/2016 09:32 PM, Florian Fainelli wrote: > >>> The PHY devices sometimes do have their reset signal (maybe even power >>> supply?) tied to some GPIO and sometimes it also does happen that a boot >>> loader does not leave it deasserted. So far this issue has been attacked >>> from (as I believe) a wrong angle: by teaching the MAC driver to >>> manipulate >>> the GPIO in question; that solution, when applied to the device >>> trees, led >>> to adding the PHY reset GPIO properties to the MAC device node, with one >>> exception: Cadence MACB driver which could handle the "reset-gpios" prop >>> in a PHY device subnode. I believe that the correct approach is to teach >>> the 'phylib' to get the MDIO device reset GPIO from the device tree node >>> corresponding to this device -- which this patch is doing... >>> >>> Note that I had to modify the AT803x PHY driver as it would stop >>> working >>> otherwise as it made use of the reset GPIO for its own purposes... >>> >>> Signed-off-by: Sergei Shtylyov>> >> This looks good to me: >> >> Acked-by: Florian Fainelli > >Thank you! I'll send v3 without [RFT] then. > >> Can you follow up with changes in phy_{suspend,resume} > >I'm not sure what changes you mean -- powering down the PHYs? Yes, powering down, conversely up the PHY. The whole point of putting this in PHYLIB is to be able to perform things like that. We do not need this right now, but it would be nice if we saw that materialize at some point. -- Florian
Re: [PATCH RFT 1/2] phylib: add device reset GPIO support
Hello. On 05/10/2016 09:32 PM, Florian Fainelli wrote: The PHY devices sometimes do have their reset signal (maybe even power supply?) tied to some GPIO and sometimes it also does happen that a boot loader does not leave it deasserted. So far this issue has been attacked from (as I believe) a wrong angle: by teaching the MAC driver to manipulate the GPIO in question; that solution, when applied to the device trees, led to adding the PHY reset GPIO properties to the MAC device node, with one exception: Cadence MACB driver which could handle the "reset-gpios" prop in a PHY device subnode. I believe that the correct approach is to teach the 'phylib' to get the MDIO device reset GPIO from the device tree node corresponding to this device -- which this patch is doing... Note that I had to modify the AT803x PHY driver as it would stop working otherwise as it made use of the reset GPIO for its own purposes... Signed-off-by: Sergei ShtylyovThis looks good to me: Acked-by: Florian Fainelli Thank you! I'll send v3 without [RFT] then. Can you follow up with changes in phy_{suspend,resume} I'm not sure what changes you mean -- powering down the PHYs? if that is also an use case that you have? No, I'm not into power management. MBR, Sergei
Re: [PATCH net-next 01/14] qed: Add CONFIG_QED_SRIOV
From: Yuval MintzDate: Tue, 10 May 2016 18:15:08 + >> > I'm not entirely convinced this is true; If we'll not enforce the >> > alignment of this 64-bit field, it's possible there will be >> > differences between 32-bit and 64-bit machines versions of this struct. >> > You have to recall that this is going to be copied via DMA between PF >> > and VF, so they must have the exact same representation of the structure. >> >> Then use properly sized types to fill in all the space in the structure, >> that's how >> you guarantee layout, not aligned_u64. Also, do not use the packed >> attribute. >> >> struct foo { >> u32 x; >> u32 y; >> u64 z; >> }; >> >> 'z' will always be 64-bit aligned. > > Perhaps my bit-numeric is a bit weak - why is it so? foo is 64-bit aligned, therefore a properly aligned struct member will be so as well. We absolutely depend upon this for several data structures in the kernel.
Re: [Drbd-dev] [PATCH net-next v3] block/drbd: align properly u64 in nl messages
On Tue, May 10, 2016 at 11:39:49AM -0400, David Miller wrote: > From: Lars Ellenberg> Date: Tue, 10 May 2016 11:40:23 +0200 excuse me for reordering the original: > Anyways, back to the topic, can you please just relent and come to > some kind of agreement about the fix for this alignment bug? I thought we did? I'm fine with the "v3", it even carries my signed-of-by. Whether or not Nicholas wants to prefix those headers with drbd_, I don't really care. > This is taking a very long time and patches are just rotting in > patchwork with no resolution. Why would Nicholas asked how to go about DRBD, I suggested to use 0 as a padding attribute, and after taking a detour, he did. All good. Rest of original: > > If we introduce a new config option, > > we have to add it to the config scanner (one line), > > define min, max, default and scale (four short defines), > > and add it to the netlink definition here (one line). > > Done, rest of the code is generated, > > both on the kernel side, > > and on the drbd-utils side used to talk to the kernel. > > We found that to be very convenient. > > But it entirely misses the core design point of netlink. > > Sender and receive _DO NOT_ need to coordinate at all. That's the > whole point. So tightly coupling such coordination is going to run > you into all kinds of problems. > > When implemented properly, the sender can emit whatever attributes it > knows about and can generate, and the receive scans the attributes one > by one and picks out the ones it understands and processes them. > > If you go against this model > then you have no clean way to We don't. We extend (not violate) that model, so the sender *may* indicate to the recipient that for some particular attribute, the sender would rather have an "I don't understand this" return than a silent ignore. And that we can indicate in the definition of the attributes which ones are required to make a message meaningful. > extend things whilst allowing existing software to continue working. *that* is exactly why we use netlink, and why we do things with it the way we do. Actually I think what we are doing there is, comparatively, "elegant". You obviously don't have to agree. I could discuss this in more detail, but I assume you are not really interested, at least not here and now. Thanks, Lars