Re: [PATCH nf-next,v2] gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)

2016-05-10 Thread Cong Wang
On Sun, May 8, 2016 at 3:55 PM, Pablo Neira Ayuso  wrote:
> +static int gtp_genl_new_pdp(struct sk_buff *skb, struct genl_info *info)
> +{
...
> +
> +   net = gtp_genl_get_net(sock_net(skb->sk), info->attrs);
> +   if (IS_ERR(net))
> +   return PTR_ERR(net);
> +
> +   /* Check if there's an existing gtpX device to configure */
> +   dev = gtp_find_dev(net, nla_get_u32(info->attrs[GTPA_LINK]));
> +   if (dev == NULL)
> +   return -ENODEV;
> +
> +   return ipv4_pdp_add(dev, info);


Seems you are leaking struct net at least in the error path.


Re: [PATCH] cfg80211/nl80211: add wifi tx power mode switching support

2016-05-10 Thread Wei-Ning Huang
On Fri, May 6, 2016 at 4:19 PM, Wei-Ning Huang  wrote:
> On Fri, May 6, 2016 at 12:07 AM, Dan Williams  wrote:
>>
>> On Thu, 2016-05-05 at 14:44 +0800, Wei-Ning Huang wrote:
>> > Recent new hardware has the ability to switch between tablet mode and
>> > clamshell mode. To optimize WiFi performance, we want to be able to
>> > use
>> > different power table between modes. This patch adds a new netlink
>> > message type and cfg80211_ops function to allow userspace to trigger
>> > a
>> > power mode switch for a given wireless interface.
>> >
>> > Signed-off-by: Wei-Ning Huang 
>> > ---
>> >  include/net/cfg80211.h   | 11 +++
>> >  include/uapi/linux/nl80211.h | 21 +
>> >  net/wireless/nl80211.c   | 16 
>> >  net/wireless/rdev-ops.h  | 22 ++
>> >  net/wireless/trace.h | 20 
>> >  5 files changed, 90 insertions(+)
>> >
>> > diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
>> > index 9e1b24c..aa77fa0 100644
>> > --- a/include/net/cfg80211.h
>> > +++ b/include/net/cfg80211.h
>> > @@ -2370,6 +2370,12 @@ struct cfg80211_qos_map {
>> >   * @get_tx_power: store the current TX power into the dbm variable;
>> >   *   return 0 if successful
>> >   *
>> > + * @set_tx_power_mode: set the transmit power mode. Some device have
>> > the ability
>> > + *   to transform between different mode such as clamshell and
>> > tablet mode.
>> > + *   set_tx_power_mode allows setting of different TX power
>> > mode at runtime.
>> > + * @get_tx_power_mode: store the current TX power mode into the mode
>> > variable;
>> > + *   return 0 if successful
>> > + *
>> >   * @set_wds_peer: set the WDS peer for a WDS interface
>> >   *
>> >   * @rfkill_poll: polls the hw rfkill line, use cfg80211 reporting
>> > @@ -2631,6 +2637,11 @@ struct cfg80211_ops {
>> >   int (*get_tx_power)(struct wiphy *wiphy, struct
>> > wireless_dev *wdev,
>> >   int *dbm);
>> >
>> > + int (*set_tx_power_mode)(struct wiphy *wiphy,
>> > +  enum nl80211_tx_power_mode
>> > mode);
>> > + int (*get_tx_power_mode)(struct wiphy *wiphy,
>> > +  enum nl80211_tx_power_mode
>> > *mode);
>> > +
>> >   int (*set_wds_peer)(struct wiphy *wiphy, struct
>> > net_device *dev,
>> >   const u8 *addr);
>> >
>> > diff --git a/include/uapi/linux/nl80211.h
>> > b/include/uapi/linux/nl80211.h
>> > index 5a30a75..9b1888a 100644
>> > --- a/include/uapi/linux/nl80211.h
>> > +++ b/include/uapi/linux/nl80211.h
>> > @@ -1796,6 +1796,9 @@ enum nl80211_commands {
>> >   *   connecting to a PCP, and in %NL80211_CMD_START_AP to start
>> >   *   a PCP instead of AP. Relevant for DMG networks only.
>> >   *
>> > + * @NL80211_ATTR_WIPHY_TX_POWER_MODE: Transmit power mode. See
>> > + *   nl80211_tx_power_mode for possible values.
>> > + *
>> >   * @NUM_NL80211_ATTR: total number of nl80211_attrs available
>> >   * @NL80211_ATTR_MAX: highest attribute number currently defined
>> >   * @__NL80211_ATTR_AFTER_LAST: internal use
>> > @@ -2172,6 +2175,8 @@ enum nl80211_attrs {
>> >
>> >   NL80211_ATTR_PBSS,
>> >
>> > + NL80211_ATTR_WIPHY_TX_POWER_MODE,
>> > +
>> >   /* add attributes here, update the policy in nl80211.c */
>> >
>> >   __NL80211_ATTR_AFTER_LAST,
>> > @@ -3703,6 +3708,22 @@ enum nl80211_tx_power_setting {
>> >  };
>> >
>> >  /**
>> > + * enum nl80211_tx_power_mode - TX power mode setting
>> > + * @NL80211_TX_POWER_LOW: general low TX power mode
>> > + * @NL80211_TX_POWER_MEDIUM: general medium TX power mode
>> > + * @NL80211_TX_POWER_HIGH: general high TX power mode
>> > + * @NL80211_TX_POWER_CLAMSHELL: clamshell mode TX power mode
>> > + * @NL80211_TX_POWER_TABLET: tablet mode TX power mode
>> > + */
>> > +enum nl80211_tx_power_mode {
>> > + NL80211_TX_POWER_LOW,
>> > + NL80211_TX_POWER_MEDIUM,
>> > + NL80211_TX_POWER_HIGH,
>> > + NL80211_TX_POWER_CLAMSHELL,
>> > + NL80211_TX_POWER_TABLET,
>>
>
>> "clamshell" and "tablet" probably mean many different things to many
>> different people with respect to whether or not they should do anything
>> with power saving or wifi.  I feel like a more generic interface is
>> needed here.
> We could probably drop those two CLAMSHELL and TABLET constant or
> describing what they mean
> in more detail?
>
>>
>> Could this be already done by:
>> @NL80211_ATTR_WIPHY_TX_POWER_SETTING = NL80211_TX_POWER_LIMITED
>> @NL80211_ATTR_WIPHY_TX_POWER_LEVEL = 
>>
>> and then the device would be able to change its TX power as it saw fit
>> up to that limit set by your application which figures out whether it's
>> in clamshell or tablet mode?
>
> We usually want different power settings in different band/channel.
> For example, we can have three different power settings
> in 2.4Ghz, channels 36-64 & channels 100+ on 

Re: [Intel-wired-lan] [PATCH] e1000e: prevent division by zero if TIMINCA is zero

2016-05-10 Thread Mark D Rustad

Jarod Wilson  wrote:


On Fri, May 06, 2016 at 11:43:17PM +, Rustad, Mark D wrote:

Denys Vlasenko  wrote:


Users report that under VMWare, er32(TIMINCA) returns zero.
This causes division by zero at init time as follows:

==>incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK;
   for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) {
   /* latch SYSTIMH on read of SYSTIML */
   systim_next = (cycle_t)er32(SYSTIML);
   systim_next |= (cycle_t)er32(SYSTIMH) << 32;

   time_delta = systim_next - systim;
   temp = time_delta;
>  rem = do_div(temp, incvalue);

This change makes kernel survive this, and users report that
NIC does work after this change.

Since on real hardware incvalue is never zero, this should not affect
real hardware use case.

...
I seem to recall that this was rejected before because it really is  
VMWare's

bug and, if they fix it, any existing VMs that use this will just work.
Changing the driver will only fix it for vms that install a new driver. I
don't object to doing it, it just seems like not the most effective  
place to

address the issue.


You could also have people who never update VMWare, for whom a kernel
work-around would be better. I think it'd be best to address it both at
the driver level and the emulated hardware level, to improve things for
the most possible users. Those who update neither hypervisor or
kernel/driver, well, they reap what they sow.


That is a sound argument for doing both. I would expect that there are more  
frozen VM images than host environments, but I can certainly imagine that  
some choose to freeze their host. Of course if everything is frozen there  
is no point at all. :-)


I am on an extended vacation, and don't work on e1000e anyway, so I will  
quit my kibitzing here.


--
Mark Rustad, mrus...@gmail.com


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [net PATCH v2 0/6] net sched: Fix broken late binding of actions

2016-05-10 Thread David Miller
From: Jamal Hadi Salim 
Date: Tue, 10 May 2016 16:49:25 -0400

> Some actions were broken in allowing for late binding of actions.
> Late binding workflow is as follows:
> a) create an action and provide all necessary parameters for it
> Optionally provide an index or let the kernel give you one.
> Example:
> sudo tc actions add action police rate 1kbit burst 90k drop index 1
> 
> b) later on bind to the pre-created action from a filter definition
> by merely specifying the index.
> Example:
> sudo tc filter add dev lo parent : protocol ip prio 8 \
> u32 match ip src 127.0.0.8/32 flowid 1:8 action police index 1

Series applied, thanks Jamal.


Re: [PATCH v3 net-next] tcp: replace cnt & rtt with struct in pkts_acked()

2016-05-10 Thread David Miller
From: David Miller 
Date: Tue, 10 May 2016 23:47:58 -0400 (EDT)

> From: Lawrence Brakmo 
> Date: Tue, 10 May 2016 13:11:09 -0700
> 
>> Replace 2 arguments (cnt and rtt) in the congestion control modules'
>> pkts_acked() function with a struct. This will allow adding more
>> information without having to modify existing congestion control
>> modules (tcp_nv in particular needs bytes in flight when packet
>> was sent).
>> 
>> As proposed by Neal Cardwell in his comments to the tcp_nv patch.
>> 
>> Signed-off-by: Lawrence Brakmo 
>> Acked-by: Yuchung Cheng 
> 
> Applied, thanks.

Actually I had to revert, this breaks the build:

net/ipv4/tcp_illinois.c: In function ‘tcp_illinois_acked’:
net/ipv4/tcp_illinois.c:97:18: error: assignment of member ‘rtt_us’ in 
read-only object
   sample->rtt_us = RTT_MAX;

You can't mark this argument const.


RE: [PATCH] net: phylib: fix interrupts re-enablement in phy_start

2016-05-10 Thread Shaohui Xie
> -Original Message-
> From: Florian Fainelli [mailto:f.faine...@gmail.com]
> Sent: Wednesday, May 11, 2016 2:25 AM
> To: shh@gmail.com; netdev@vger.kernel.org; da...@davemloft.net
> Cc: Shaohui Xie ; Andrew Lunn 
> Subject: Re: [PATCH] net: phylib: fix interrupts re-enablement in phy_start
> 
> On 05/10/2016 02:42 AM, shh@gmail.com wrote:
> > From: Shaohui Xie 
> >
> > If phy was suspended and is starting, current driver always enable
> > phy's interrupts, if phy works in polling, phy can raise unexpected
> > interrupt which will not be handled, the interrupt will block system
> > enter suspend again. So interrupts should only be re-enabled if phy
> > works in interrupt.
> 
> Your explanation makes sense. The commit message could you use some
> improvements like s/phy/PHY/ and "works in interrupt mode", but that is not a
> huge thing.
[S.H] Thank you for the comment.

I should make the commit message more precise and will make improvement in
future patch submission.

Regards,
Shaohui


Re: [PATCH v3 net-next] tcp: replace cnt & rtt with struct in pkts_acked()

2016-05-10 Thread David Miller
From: Lawrence Brakmo 
Date: Tue, 10 May 2016 13:11:09 -0700

> Replace 2 arguments (cnt and rtt) in the congestion control modules'
> pkts_acked() function with a struct. This will allow adding more
> information without having to modify existing congestion control
> modules (tcp_nv in particular needs bytes in flight when packet
> was sent).
> 
> As proposed by Neal Cardwell in his comments to the tcp_nv patch.
> 
> Signed-off-by: Lawrence Brakmo 
> Acked-by: Yuchung Cheng 

Applied, thanks.


Re: pull request: batman-adv 20160511

2016-05-10 Thread David Miller
From: Antonio Quartulli 
Date: Wed, 11 May 2016 03:29:48 +0800

> here you have a pull request intended for net-next.
> There are 17 patches in this batch, but most of them are cleanups
> and minor code re-arrangement.
> 
> The more detailed description follows in the git tag.
> 
> Please pull or let me know of any problem.

Pulled, thanks Antonio.


Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support

2016-05-10 Thread Simon Horman
Hi Jiri,

On Tue, May 10, 2016 at 02:06:18PM +0200, Jiri Benc wrote:
> On Mon, 9 May 2016 17:18:20 +0900, Simon Horman wrote:
> > On Fri, May 06, 2016 at 11:35:04AM +0200, Jiri Benc wrote:
> > > In addition, we should check whether mac_len > 0 and in such case,
> > > change skb->protocol to ETH_P_TEB first (and store that value in the
> > > pushed eth header).
> > > 
> > > Similarly on pop_eth, we need to check skb->protocol and if it is
> > > ETH_P_TEB, call eth_type_trans on the modified frame to set the new
> > > skb->protocol correctly. It's probably not that simple, as we'd need a
> > > version of eth_type_trans that doesn't need a net device.
> > 
> > I'm not sure I understand the interaction with ETH_P_TEB here.
> > 
> > In my mind skb->protocol == ETH_P_TEB may be used early on in OvS's receive
> > processing to find the inner protocol from the packet and at that point
> > skb->protocol is set to that value. And that for further packet processing
> > the fact the packet was received as TEB is transparent.
> 
> Yes but think about the case when you have two Ethernet headers pushed.
> 
> We can either disallow it, or do what I described.
> 
> Specifically, let's assume we have an IP packet with an Ethernet
> header present. skb->protocol is ETH_P_IP. Now, when there's skb_push,
> the correct operation would be setting skb->protocol to ETH_P_TEB,
> pushing a new Ethernet header and filing ETH_P_TEB to the ethertype
> field in the new header. The packet is not ETH_P_IP anymore, as the L2
> header is followed by another Ethernet header now.

Thanks for the clarification, I had not considered the case of two
ethernet headers when I wrote my previous email.

I think that at this stage I would prefer to prohibit push_eth() acting
on a packet which already has an ethernet header. Indeed that is what
my patch-set already does in its modifications of __ovs_nla_copy_actions().

The reason that I lean towards prohibiting this is that I do not
have an easy way to exercise this case within the current patch-set.
And thus this extra complexity seems well suited to being handled handled
incrementally as further work.


Re: [v10, 7/7] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

2016-05-10 Thread Scott Wood
On Thu, 2016-05-05 at 13:10 +0200, Arnd Bergmann wrote:
> On Thursday 05 May 2016 09:41:32 Yangbo Lu wrote:
> > > -Original Message-
> > > From: Arnd Bergmann [mailto:a...@arndb.de]
> > > Sent: Thursday, May 05, 2016 4:32 PM
> > > To: linuxppc-...@lists.ozlabs.org
> > > Cc: Yangbo Lu; linux-...@vger.kernel.org; devicet...@vger.kernel.org;
> > > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> > > linux-...@vger.kernel.org; linux-...@vger.kernel.org; iommu@lists.linux-
> > > foundation.org; netdev@vger.kernel.org; Mark Rutland;
> > > ulf.hans...@linaro.org; Russell King; Bhupesh Sharma; Joerg Roedel;
> > > Santosh Shilimkar; Yang-Leo Li; Scott Wood; Rob Herring; Claudiu Manoil;
> > > Kumar Gala; Xiaobo Xie; Qiang Zhao
> > > Subject: Re: [v10, 7/7] mmc: sdhci-of-esdhc: fix host version for T4240-
> > > R1.0-R2.0
> > > 
> > > On Thursday 05 May 2016 11:12:30 Yangbo Lu wrote:
> > > > IIRC, it is the same IP block as i.MX and Arnd's point is this won't
> > > > even compile on !PPC. It is things like this that prevent sharing the
> > > > driver.
> > 
> > The whole point of using the MMIO SVR instead of the PPC SPR is so that
> > it will work on ARM...  The guts driver should build on any platform as
> > long as OF is enabled, and if it doesn't find a node to bind to it will
> > return 0 for SVR, and the eSDHC driver will continue (after printing an
> > error that should be removed) without the ability to test for errata
> > based on SVR.
> 
> It feels like a bad design to have to come up with a different
> method for each SoC type here when they all do the same thing
> and want to identify some variant of the chip to do device
> specific quirks.
> 
> As far as I'm concerned, every driver in drivers/soc that needs to
> export a symbol to be used by a device driver is an indication that
> we don't have the right set of abstractions yet. There are cases
> that are not worth abstracting because the functionality is rather
> obscure and only a couple of drivers for one particular chip
> ever need it.
> 
> Finding out the version of the SoC does not look like this case.

I'm open to new ways of abstracting this, but can that please be discussed
after these patches are merged?  This patchset is fixing a problem, the
existing abstraction is unappealing and not widely adopted, a new abstraction
is not ready, and we're only touching code for our hardware.

Oh, and the existing abstraction isn't even "existing".  I don't see any
examples where soc_device is being used like this -- or even any way for a
driver (the one consuming the information, not the soc "driver") to get a
reference to the soc_device that's been registered short of searching for the
device object by name -- and you're asking for new functionality in
drivers/base/soc.c.

> > > I think the first four patches take care of building for ARM,
> > > but the problem remains if you want to enable COMPILE_TEST as
> > > we need for certain automated checking.
> > 
> > What specific problem is there with COMPILE_TEST?
> 
> COMPILE_TEST is solvable here and the way it is implemented in this
> case (selecting FSL_GUTS from the driver) indeed looks like it works
> correctly, but it's still awkward that this means building the
> SoC specific ID stuff into the vmlinux binary for any driver that
> uses something like that for a particular SoC.

Please keep in mind that this is a Freescale-specific driver... it's not as if
we're attaching this dependency to common SDHCI code.

> 
> > > > Dealing with Si revs is a common problem. We should have a
> > > > common solution. There is soc_device for this purpose.
> > > 
> > > Exactly. The last time this came up, I think we agreed to implement a
> > > helper using glob_match() on the soc_device strings. Unfortunately
> > > this hasn't happened then, but I'd still prefer that over yet another
> > > vendor-specific way of dealing with the generic issue.
> > 
> > soc_device would require encoding the SVR as a string and then decoding
> > the string, which is more complicated and error prone than having
> > platform-specific code test a platform-specific number. 
> 
> You already need to encode it as a string to register the soc_device,

No we don't, because we don't already register a soc_device on arm64 or ppc
(and it looks like whatever does get registered on at least some relevant
arm32 chips is not particularly useful).

> and the driver just needs to pass a glob string, so the only part that
> is missing is the generic function that takes the string from the
> driver and passes that to glob_match for the soc_device.

"just"

And what would the glob look like?

I'd rather not write kernel code as if it were a shell/Perl script.

> > And when would it get registered on arm64, which doesn't have
> > platform code?
> 
> Whenever the soc driver is loaded, as is the case now. The match
> function can return -EPROBE_DEFER if no SoC device is registered
> yet.

That's too late for some places where we need 

Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support

2016-05-10 Thread Simon Horman
Hi Jiri,

On Wed, May 11, 2016 at 10:50:09AM +0900, Simon Horman wrote:

[...]

> > > Its possible that I've overlooked something but as things stand I think
> > > things look like this:
> > > 
> > > * ovs_flow_key_extract() keys off dev->type and skb->protocol.
> > > * ovs_flow_key_extract() calls key_extract() which amongst other things
> > >   sets up the skb->mac_header and skb->mac_len of the skb.
> > > * ovs_flow_key_extract() sets skb->protocol to that of the inner packet
> > >   in the case of TEB
> > > * Actions update the above mentioned skb fields as appropriate.
> > 
> > Okay, that actually eases things somewhat.
> > 
> > > So it seems to me that it should be safe to rely on skb->protocol
> > > in the receive path. Or more specifically, in netdev_port_receive().
> > > 
> > > If mac_len is also able to be used then I think fine. But it seems to me
> > > that it needs to be set up by OvS at least for the ARPHRD_NONE case. This
> > > could be done early on, say in netdev_port_receive(). But it seems that
> > > would involve duplicating some of what is already occurring in
> > > key_extract().
> > 
> > I'd actually prefer doing this earlier, netdev_port_receive looks like
> > the right place. Just set mac_len there (or whatever) and let
> > key_extract do the rest of the work, not depending on dev->type in
> > there.
> > 
> > My point about recirculation was not actually valid, as I missed you're
> > doing this in ovs_flow_key_extract and not in key_extract. Still,
> > I think the special handling of particular interface types belongs to
> > the tx processing on those interfaces, not to the common code.
> 
> Sure, if that is your preference I think it should be simple enough to
> implement. I agree that netdev_port_receive() looks like a good place for
> this.

So far I have the following, which seems to work
changes to make dev->type ARPHRD_NONE for compat GRE vports.

Is this close to what you had in mind?

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index d320c2657627..4d2698596033 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -700,8 +700,6 @@ int ovs_flow_key_update(struct sk_buff *skb, struct 
sw_flow_key *key)
 int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 struct sk_buff *skb, struct sw_flow_key *key)
 {
-   bool is_layer3 = false;
-   bool is_teb = false;
int err;
 
/* Extract metadata from packet. */
@@ -709,14 +707,6 @@ int ovs_flow_key_extract(const struct ip_tunnel_info 
*tun_info,
key->tun_proto = ip_tunnel_info_af(tun_info);
memcpy(>tun_key, _info->key, sizeof(key->tun_key));
 
-   if (OVS_CB(skb)->input_vport->dev->type != ARPHRD_ETHER) {
-   if (skb->protocol == htons(ETH_P_TEB))
-   is_teb = true;
-   else
-   is_layer3 = true;
-   }
-
-
if (tun_info->options_len) {
BUILD_BUG_ON((1 << (sizeof(tun_info->options_len) *
   8)) - 1
@@ -739,17 +729,17 @@ int ovs_flow_key_extract(const struct ip_tunnel_info 
*tun_info,
key->phy.skb_mark = skb->mark;
ovs_ct_fill_key(skb, key);
key->ovs_flow_hash = 0;
-   key->phy.is_layer3 = is_layer3;
+   key->phy.is_layer3 = (tun_info && skb->mac_len == 0);
key->recirc_id = 0;
 
err = key_extract(skb, key);
if (err < 0)
return err;
 
-   if (is_teb)
-   skb->protocol = key->eth.type;
-   else if (is_layer3)
+   if (key->phy.is_layer3)
key->eth.type = skb->protocol;
+   else if (tun_info && skb->protocol == htons(ETH_P_TEB))
+   skb->protocol = key->eth.type;
 
return err;
 }
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 7d54414b35eb..ac8178ac2c81 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -60,7 +60,21 @@ static void netdev_port_receive(struct sk_buff *skb)
if (vport->dev->type == ARPHRD_ETHER) {
skb_push(skb, ETH_HLEN);
skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
+   } else if (vport->dev->type == ARPHRD_NONE) {
+   if (skb->protocol == htons(ETH_P_TEB)) {
+   struct ethhdr *eth = eth_hdr(skb);
+
+   if (unlikely(skb->len < ETH_HLEN))
+   goto error;
+
+   skb->mac_len = ETH_HLEN;
+   if (eth->h_proto == htons(ETH_P_8021Q))
+   skb->mac_len += VLAN_HLEN;
+   } else {
+   skb->mac_len = 0;
+   }
}
+
ovs_vport_receive(vport, skb, skb_tunnel_info(skb));
return;
 error:


Re: [PATCH 1/2] [v4] net: emac: emac gigabit ethernet controller driver

2016-05-10 Thread Timur Tabi

Florian Fainelli wrote:

The Ethernet MAC should be started in ndo_open() and stopped in
ndo_close(), in between, there are link state changes, but you are not
supposed to stop or start your Ethernet MAC and its DMA for instance
during link change, if that is a HW requirement, your HW is pretty funky.


I think the problem is that the current driver seems to be too eager to 
start/stop the MAC.


Please take a look at emac_work_thread_link_check() at 
https://lkml.org/lkml/2016/4/13/670.  Every time the PHY link goes up, 
it does this:


if (phy->link_up) {
if (netif_carrier_ok(netdev))
goto link_task_done;

pm_runtime_get_sync(netdev->dev.parent);
netif_info(adpt, timer, adpt->netdev, "NIC Link is Up %s\n",
   speed);

emac_mac_start(adpt);
netif_carrier_on(netdev);
netif_wake_queue(netdev);


The call to emac_mac_start seems wrong to me here.

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.


Re: [PATCH v9 net-next 4/7] openvswitch: add layer 3 flow/port support

2016-05-10 Thread Simon Horman
Hi Jiri,

On Tue, May 10, 2016 at 02:01:06PM +0200, Jiri Benc wrote:
> On Mon, 9 May 2016 17:04:22 +0900, Simon Horman wrote:
> > It seems to be caused by the following:
> > 
> > 1. __ipgre_rcv() calls skb_pop_mac_header() which
> >sets skb->mac_header to the skb->network_header.
> > 
> > 2. __ipgre_rcv() then calls ip_tunnel_rcv() which calls
> >skb_reset_network_header(). This updates skb->network_header to
> >just after the end of the GRE header.
> > 
> >This is 28 bytes after the old skb->network_header
> >as there is a 20 byte IPv4 header followed by an
> >8 byte GRE header.
> > 
> > 3. Later, dev_gro_receive() calls skb_reset_mac_len().
> >This calculates skb->mac_len based on skb->network_header and
> >skb->mac_header. I.e. 28 bytes.
> 
> Right. Thanks for tracking this down!
> 
> > I think this may be possible to address by calling
> > skb_reset_network_header() instead of skb_pop_mac_header()
> > in __ipgre_rcv().
> 
> We can't do that. The interface type is ARPHRD_IPGRE and not
> ARPHRD_NONE, so the current behavior makes pretty good sense. See
> e.g. commit 0e3da5bb8da45.
> 
> We have two options here:
> 
> 1. As for metadata tunnels all the info is in metadata_dst and we
>don't need the IP/GRE header for anything, we can make the ipgre
>interface ARPHRD_NONE in metadata based mode.
> 
> 2. We can fix this up in ovs after receiving the packet from
>ARPHRD_IPGRE interface.
> 
> I think the first option is the correct one. We already don't assign
> dev->header_ops in metadata mode. I'll prepare a patch.

I agree that 1. seems to be the better approach.

> > Its possible that I've overlooked something but as things stand I think
> > things look like this:
> > 
> > * ovs_flow_key_extract() keys off dev->type and skb->protocol.
> > * ovs_flow_key_extract() calls key_extract() which amongst other things
> >   sets up the skb->mac_header and skb->mac_len of the skb.
> > * ovs_flow_key_extract() sets skb->protocol to that of the inner packet
> >   in the case of TEB
> > * Actions update the above mentioned skb fields as appropriate.
> 
> Okay, that actually eases things somewhat.
> 
> > So it seems to me that it should be safe to rely on skb->protocol
> > in the receive path. Or more specifically, in netdev_port_receive().
> > 
> > If mac_len is also able to be used then I think fine. But it seems to me
> > that it needs to be set up by OvS at least for the ARPHRD_NONE case. This
> > could be done early on, say in netdev_port_receive(). But it seems that
> > would involve duplicating some of what is already occurring in
> > key_extract().
> 
> I'd actually prefer doing this earlier, netdev_port_receive looks like
> the right place. Just set mac_len there (or whatever) and let
> key_extract do the rest of the work, not depending on dev->type in
> there.
> 
> My point about recirculation was not actually valid, as I missed you're
> doing this in ovs_flow_key_extract and not in key_extract. Still,
> I think the special handling of particular interface types belongs to
> the tx processing on those interfaces, not to the common code.

Sure, if that is your preference I think it should be simple enough to
implement. I agree that netdev_port_receive() looks like a good place for
this.


Re: [PATCH net] openvswitch: Fix cached ct with helper.

2016-05-10 Thread Joe Stringer
On 10 May 2016 at 16:55, Jarno Rajahalme  wrote:
> This would result in inconsistent helper assignment if a first CT action 
> assigns a helper and a further CT action tries to assign a different helper; 
> Typically the second helper assignment would be ignored, but if the 
> unconfirmed conntrack entry is lost due to an upcall the second helper 
> assignment would be successful.  This is best resolved by allowing helper 
> assignment by a committing CT action only by testing the 'info->commit' flag 
> in addition to the conditions you have there already. It may also be helpful 
> to fail helper assignment without the commit flag in parse_ct().

Strictly speaking I think that skb_nfct_cached() handles at least some
of those cases but I agree it should be more explicit here. I'll send
a v2. We can follow up separately on net-next to improve the
parsing/make that more user-friendly.


linux-next: manual merge of the net-next tree with the net tree

2016-05-10 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/netfilter/nf_conntrack_core.c

between commit:

  70d72b7e060e ("netfilter: conntrack: init all_locks to avoid debug warning")

from the net tree and commit:

  a3efd81205b1 ("netfilter: conntrack: move generation seqcnt out of netns_ct")
  56d52d4892d0 ("netfilter: conntrack: use a single hashtable for all 
namespaces")
  0c5366b3a8c7 ("netfilter: conntrack: use single slab cache")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/netfilter/nf_conntrack_core.c
index 895d11dced3c,566c64e3ec50..
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@@ -66,7 -69,12 +69,12 @@@ EXPORT_SYMBOL_GPL(nf_conntrack_locks)
  __cacheline_aligned_in_smp DEFINE_SPINLOCK(nf_conntrack_expect_lock);
  EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock);
  
+ struct hlist_nulls_head *nf_conntrack_hash __read_mostly;
+ EXPORT_SYMBOL_GPL(nf_conntrack_hash);
+ 
+ static __read_mostly struct kmem_cache *nf_conntrack_cachep;
 -static __read_mostly spinlock_t nf_conntrack_locks_all_lock;
 +static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock);
+ static __read_mostly seqcount_t nf_conntrack_generation;
  static __read_mostly bool nf_conntrack_locks_all;
  
  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)


Re: [PATCH net] openvswitch: Fix cached ct with helper.

2016-05-10 Thread Jarno Rajahalme
This would result in inconsistent helper assignment if a first CT action 
assigns a helper and a further CT action tries to assign a different helper; 
Typically the second helper assignment would be ignored, but if the unconfirmed 
conntrack entry is lost due to an upcall the second helper assignment would be 
successful.  This is best resolved by allowing helper assignment by a 
committing CT action only by testing the 'info->commit' flag in addition to the 
conditions you have there already. It may also be helpful to fail helper 
assignment without the commit flag in parse_ct().

  Jarno

> On May 10, 2016, at 11:40 AM, Joe Stringer  wrote:
> 
> When using conntrack helpers from OVS, a common configuration is to
> perform a lookup without specifying a helper, then go through a
> firewalling policy, only to decide to attach a helper afterwards.
> 
> In this case, the initial lookup will cause a ct entry to be attached to
> the skb, then the later commit with helper should attach the helper and
> confirm the connection. However, the helper attachment has been missing.
> If the user has enabled automatic helper attachment, then this issue
> will be masked as it will be applied in init_conntrack(). It is also
> masked if the action is executed from ovs_packet_cmd_execute() as that
> will construct a fresh skb.
> 
> This patch fixes the issue by making an explicit call to try to assign
> the helper if there is a discrepancy between the action's helper and the
> current skb->nfct.
> 
> Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action")
> Signed-off-by: Joe Stringer 
> ---
> net/openvswitch/conntrack.c | 12 
> 1 file changed, 12 insertions(+)
> 
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index b5fea1101faa..89f61a1720eb 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -776,6 +776,18 @@ static int __ovs_ct_lookup(struct net *net, struct 
> sw_flow_key *key,
>   return -EINVAL;
>   }
> 
> + /* Userspace may decide to perform a ct lookup without a helper
> +  * specified followed by a (recirculate and) commit with one.
> +  * Therefore, for unconfirmed connections we need to attach the
> +  * helper here.
> +  */
> + if (!nf_ct_is_confirmed(ct) && info->helper && !nfct_help(ct)) {
> + int err = __nf_ct_try_assign_helper(ct, info->ct,
> + GFP_ATOMIC);
> + if (err)
> + return err;
> + }
> +
>   /* Call the helper only if:
>* - nf_conntrack_in() was executed above ("!cached") for a
>*   confirmed connection, or
> -- 
> 2.1.4
> 



Re: [PATCH 1/2] [v4] net: emac: emac gigabit ethernet controller driver

2016-05-10 Thread Florian Fainelli
On 05/10/2016 04:18 PM, Timur Tabi wrote:
> Florian Fainelli wrote:
>> Are you utilizing the PHYLIB APIs properly? You need at least a
>> phy_start() to start the PHY state machine, and an adjust_link callback
>> to be provided to phy_connect() (or of_phy_connect()) to manage link
>> state changes. And that's the very basic minimum here, there could be
>> additional APIs that you may end up using.
>>
>> There are tons of example in tree of drivers doing this, bcmgenet,
>> bcmsysport, tg3 etc.
> 
> Thank you.  I think I finally got phylib working, more or less.
> 
> Unfortunately, it seems I have some kind of race condition.  The driver
> has a lot that's wrong with it, and I'm trying to fix it all.  One crazy
> the driver does is it create a workqueue to handle a lot of the tasks
> that would normally be handled in the interrupt handler itself.

That sounds like a typicall top half/bottom half split, fair enough.

> 
> With phylib support, I know my driver can call phy_mac_interrupt() when
> it gets a link status change interrupt.  I then have an .adjust_link
> callback which starts or stops the mac accordingly.

The Ethernet MAC should be started in ndo_open() and stopped in
ndo_close(), in between, there are link state changes, but you are not
supposed to stop or start your Ethernet MAC and its DMA for instance
during link change, if that is a HW requirement, your HW is pretty funky.

> 
> My problem is that I'm not really sure what adjust_link is supposed to
> be doing.

Well, it's pretty simple, it is about re-configuring your Ethernet MAC
based on what the PHY link state mandates: duplex, pause, speed changes,
EEE etc is what this callback is supposed to take care of, at the
Ethernet MAC level.

>  In addition, it seems that I need to keep the workqueue
> running, otherwise the interface will not function.  I bring the
> interface up, and the driver reports success, but pings do not work.
> 
> I'm getting really frustrated.  The sample code isn't really helping a
> whole lot, because I lack a fundamental understanding of what needs to
> be done.  None of the documentation I've read is helpful, and I don't
> know how to debug it.

Seriously, no documentation is helpful? The PHY library seems pretty
well documented to me, but I suppose I have a bias, oh, and patches are
welcome of course.

> 
> Can you give me some advice on how to debug this?

Take a look at drivers/net/ethernet/broadcom/genet/bcmgenet.c and see
how it deals with managing link state changes for instance. The code is
pretty straight forward: link interrupt (and other causes) trigger a
workqueue schedule, which then processes link state changes and calls
phy_mac_interrupt(), which in turn makes the PHY library adjust the
interface carrier state.
-- 
Florian


Re: [PATCH 1/2] [v4] net: emac: emac gigabit ethernet controller driver

2016-05-10 Thread Timur Tabi

Florian Fainelli wrote:

Are you utilizing the PHYLIB APIs properly? You need at least a
phy_start() to start the PHY state machine, and an adjust_link callback
to be provided to phy_connect() (or of_phy_connect()) to manage link
state changes. And that's the very basic minimum here, there could be
additional APIs that you may end up using.

There are tons of example in tree of drivers doing this, bcmgenet,
bcmsysport, tg3 etc.


Thank you.  I think I finally got phylib working, more or less.

Unfortunately, it seems I have some kind of race condition.  The driver 
has a lot that's wrong with it, and I'm trying to fix it all.  One crazy 
the driver does is it create a workqueue to handle a lot of the tasks 
that would normally be handled in the interrupt handler itself.


With phylib support, I know my driver can call phy_mac_interrupt() when 
it gets a link status change interrupt.  I then have an .adjust_link 
callback which starts or stops the mac accordingly.


My problem is that I'm not really sure what adjust_link is supposed to 
be doing.  In addition, it seems that I need to keep the workqueue 
running, otherwise the interface will not function.  I bring the 
interface up, and the driver reports success, but pings do not work.


I'm getting really frustrated.  The sample code isn't really helping a 
whole lot, because I lack a fundamental understanding of what needs to 
be done.  None of the documentation I've read is helpful, and I don't 
know how to debug it.


Can you give me some advice on how to debug this?

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation collaborative project.


[PATCH net 0/2] bnxt_en: Add workaround to detect bad opaque in rx completion.

2016-05-10 Thread Michael Chan
2-part workaround for this hardware bug.

Michael Chan (2):
  bnxt_en: Add workaround to detect bad opaque in rx completion (part 1)
  bnxt_en: Add workaround to detect bad opaque in rx completion (part 2)

 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 63 +++
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  2 +
 2 files changed, 65 insertions(+)

-- 
1.8.3.1



[PATCH net 1/2] bnxt_en: Add workaround to detect bad opaque in rx completion (part 1)

2016-05-10 Thread Michael Chan
There is a rare hardware bug that can cause a bad opaque value in the RX
or TPA completion.  When this happens, the hardware may have used the
same buffer twice for 2 rx packets.  In addition, the driver will also
crash later using the bad opaque as the index into the ring.

The rx opaque value is predictable and is always monotonically increasing.
The workaround is to keep track of the expected next opaque value and
compare it with the one returned by hardware during RX and TPA start
completions.  If they miscompare, we will not process any more RX and
TPA completions and exit NAPI.  We will then schedule a workqueue to
reset the function.

This patch adds the logic to keep track of the next rx consumer index.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9d4e8e1..58999cd 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -867,6 +867,7 @@ static void bnxt_tpa_start(struct bnxt *bp, struct 
bnxt_rx_ring_info *rxr,
 
rxr->rx_prod = NEXT_RX(prod);
cons = NEXT_RX(cons);
+   rxr->rx_next_cons = NEXT_RX(cons);
cons_rx_buf = >rx_buf_ring[cons];
 
bnxt_reuse_rx_data(rxr, cons, cons_rx_buf->data);
@@ -1245,6 +1246,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_napi 
*bnapi, u32 *raw_cons,
 
 next_rx:
rxr->rx_prod = NEXT_RX(prod);
+   rxr->rx_next_cons = NEXT_RX(cons);
 
 next_rx_no_prod:
*raw_cons = tmp_raw_cons;
@@ -2486,6 +2488,7 @@ static void bnxt_clear_ring_indices(struct bnxt *bp)
rxr->rx_prod = 0;
rxr->rx_agg_prod = 0;
rxr->rx_sw_agg_prod = 0;
+   rxr->rx_next_cons = 0;
}
}
 }
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 8b823ff..52f2d74 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -584,6 +584,7 @@ struct bnxt_rx_ring_info {
u16 rx_prod;
u16 rx_agg_prod;
u16 rx_sw_agg_prod;
+   u16 rx_next_cons;
void __iomem*rx_doorbell;
void __iomem*rx_agg_doorbell;
 
-- 
1.8.3.1



[PATCH net 2/2] bnxt_en: Add workaround to detect bad opaque in rx completion (part 2)

2016-05-10 Thread Michael Chan
Add detection and recovery code when the hardware returned opaque value
does not match the expected consumer index.  Once the issue is detected,
we skip the processing of all RX and LRO/GRO packets.  These completion
entries are discarded without sending the SKB to the stack and without
producing new buffers.  The function will be reset from a workqueue.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 60 +++
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  1 +
 2 files changed, 61 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 58999cd..c39a7f5 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -813,6 +813,46 @@ static inline struct sk_buff *bnxt_copy_skb(struct 
bnxt_napi *bnapi, u8 *data,
return skb;
 }
 
+static int bnxt_discard_rx(struct bnxt *bp, struct bnxt_napi *bnapi,
+  u32 *raw_cons, void *cmp)
+{
+   struct bnxt_cp_ring_info *cpr = >cp_ring;
+   struct rx_cmp *rxcmp = cmp;
+   u32 tmp_raw_cons = *raw_cons;
+   u8 cmp_type, agg_bufs = 0;
+
+   cmp_type = RX_CMP_TYPE(rxcmp);
+
+   if (cmp_type == CMP_TYPE_RX_L2_CMP) {
+   agg_bufs = (le32_to_cpu(rxcmp->rx_cmp_misc_v1) &
+   RX_CMP_AGG_BUFS) >>
+  RX_CMP_AGG_BUFS_SHIFT;
+   } else if (cmp_type == CMP_TYPE_RX_L2_TPA_END_CMP) {
+   struct rx_tpa_end_cmp *tpa_end = cmp;
+
+   agg_bufs = (le32_to_cpu(tpa_end->rx_tpa_end_cmp_misc_v1) &
+   RX_TPA_END_CMP_AGG_BUFS) >>
+  RX_TPA_END_CMP_AGG_BUFS_SHIFT;
+   }
+
+   if (agg_bufs) {
+   if (!bnxt_agg_bufs_valid(bp, cpr, agg_bufs, _raw_cons))
+   return -EBUSY;
+   }
+   *raw_cons = tmp_raw_cons;
+   return 0;
+}
+
+static void bnxt_sched_reset(struct bnxt *bp, struct bnxt_rx_ring_info *rxr)
+{
+   if (!rxr->bnapi->in_reset) {
+   rxr->bnapi->in_reset = true;
+   set_bit(BNXT_RESET_TASK_SP_EVENT, >sp_event);
+   schedule_work(>sp_task);
+   }
+   rxr->rx_next_cons = 0x;
+}
+
 static void bnxt_tpa_start(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
   struct rx_tpa_start_cmp *tpa_start,
   struct rx_tpa_start_cmp_ext *tpa_start1)
@@ -830,6 +870,11 @@ static void bnxt_tpa_start(struct bnxt *bp, struct 
bnxt_rx_ring_info *rxr,
prod_rx_buf = >rx_buf_ring[prod];
tpa_info = >rx_tpa[agg_id];
 
+   if (unlikely(cons != rxr->rx_next_cons)) {
+   bnxt_sched_reset(bp, rxr);
+   return;
+   }
+
prod_rx_buf->data = tpa_info->data;
 
mapping = tpa_info->mapping;
@@ -981,6 +1026,14 @@ static inline struct sk_buff *bnxt_tpa_end(struct bnxt 
*bp,
dma_addr_t mapping;
struct sk_buff *skb;
 
+   if (unlikely(bnapi->in_reset)) {
+   int rc = bnxt_discard_rx(bp, bnapi, raw_cons, tpa_end);
+
+   if (rc < 0)
+   return ERR_PTR(-EBUSY);
+   return NULL;
+   }
+
tpa_info = >rx_tpa[agg_id];
data = tpa_info->data;
prefetch(data);
@@ -1147,6 +1200,12 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_napi 
*bnapi, u32 *raw_cons,
cons = rxcmp->rx_cmp_opaque;
rx_buf = >rx_buf_ring[cons];
data = rx_buf->data;
+   if (unlikely(cons != rxr->rx_next_cons)) {
+   int rc1 = bnxt_discard_rx(bp, bnapi, raw_cons, rxcmp);
+
+   bnxt_sched_reset(bp, rxr);
+   return rc1;
+   }
prefetch(data);
 
agg_bufs = (le32_to_cpu(rxcmp->rx_cmp_misc_v1) & RX_CMP_AGG_BUFS) >>
@@ -4465,6 +4524,7 @@ static void bnxt_enable_napi(struct bnxt *bp)
int i;
 
for (i = 0; i < bp->cp_nr_rings; i++) {
+   bp->bnapi[i]->in_reset = false;
bnxt_enable_poll(bp->bnapi[i]);
napi_enable(>bnapi[i]->napi);
}
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 52f2d74..de9d53e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -637,6 +637,7 @@ struct bnxt_napi {
 #ifdef CONFIG_NET_RX_BUSY_POLL
atomic_tpoll_state;
 #endif
+   boolin_reset;
 };
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
-- 
1.8.3.1



Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Eric Dumazet
On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:

> Not only did we want to present this solely as a bugfix but also as as
> performance enhancements in case of virtio (as you can see in the cover
> letter). Given that a long time ago there was a tendency to remove
> softirqs completely, we thought it might be very interesting, that a
> threaded napi in general seems to be absolutely viable nowadays and
> might offer new features.

Well, you did not fix the bug, you worked around by adding yet another
layer, with another sysctl that admins or programs have to manage.

If you have a special need for virtio, do not hide it behind a 'bug fix'
but add it as a features request.

This ksoftirqd issue is real and a fix looks very reasonable.

Please try this patch, as I had very good success with it.

Thanks.


diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..22463217e3cf 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
 static struct softirq_action softirq_vec[NR_SOFTIRQS] 
__cacheline_aligned_in_smp;
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
 
 const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
@@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
/* Interrupts are disabled: no need to stop preemption */
struct task_struct *tsk = __this_cpu_read(ksoftirqd);
 
-   if (tsk && tsk->state != TASK_RUNNING)
+   if (tsk && tsk->state != TASK_RUNNING) {
+   __this_cpu_write(ksoftirqd_scheduled, true);
wake_up_process(tsk);
+   }
 }
 
 /*
@@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int 
cnt)
 */
preempt_count_sub(cnt - 1);
 
-   if (unlikely(!in_interrupt() && local_softirq_pending())) {
+   if (unlikely(!in_interrupt() &&
+local_softirq_pending() &&
+!__this_cpu_read(ksoftirqd_scheduled))) {
/*
 * Run softirq if any pending. And do it in its own stack
 * as we may be calling this deep in a task call stack already.
@@ -340,6 +345,9 @@ void irq_enter(void)
 
 static inline void invoke_softirq(void)
 {
+   if (__this_cpu_read(ksoftirqd_scheduled))
+   return;
+
if (!force_irqthreads) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
/*
@@ -660,6 +668,8 @@ static void run_ksoftirqd(unsigned int cpu)
 * in the task stack here.
 */
__do_softirq();
+   if (!local_softirq_pending())
+   __this_cpu_write(ksoftirqd_scheduled, false);
local_irq_enable();
cond_resched_rcu_qs();
return;




Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Eric Dumazet
On Tue, 2016-05-10 at 15:02 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote:
> > On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:
> > 
> > > You might need another one of these in invoke_softirq()
> > > 
> > 
> > Excellent.
> > 
> > I gave it a quick try (without your suggestion), and host seems to
> > survive a stress test.
> > 
> > Of course we do have to fix these problems :
> > 
> > [  147.781629] NOHZ: local_softirq_pending 48
> > [  147.785546] NOHZ: local_softirq_pending 48
> > [  147.788344] NOHZ: local_softirq_pending 48
> > [  147.788992] NOHZ: local_softirq_pending 48
> > [  147.790943] NOHZ: local_softirq_pending 48
> > [  147.791232] NOHZ: local_softirq_pending 24a
> > [  147.791258] NOHZ: local_softirq_pending 48
> > [  147.791366] NOHZ: local_softirq_pending 48
> > [  147.792118] NOHZ: local_softirq_pending 48
> > [  147.793428] NOHZ: local_softirq_pending 48
> 
> 
> Well, with your suggestion, these warnings disappear ;)

This is really nice.

Under stress number of context switches is really small.

ksoftirqd and my netserver compete equally to get the cpu cycles (on
CPU0)

lpaa23:~# vmstat 1 10
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 260668416  37240 24144280021 0  329  349  0  3 96  0
 1  0  0 260667904  37240 241442800 012 193126 1050  0  2 
98  0
 1  0  0 260667904  37240 241442800 0 0 194354 1056  0  2 
98  0
 1  0  0 260669104  37240 241449200 0 0 200897 1095  0  2 
98  0
 1  0  0 260668592  37240 241449200 0 0 205731  964  0  2 
98  0
 1  0  0 260678832  37240 241449200 0 0 201689  981  0  2 
98  0
 1  0  0 260678832  37240 241449200 0 0 204899  742  0  2 
98  0
 1  0  0 260678320  37240 241449200 0 0 199148  792  0  3 
97  0
 1  0  0 260678832  37240 241449200 0 0 196398  766  0  2 
98  0
 1  0  0 260678832  37240 241449200 0 0 201930  858  0  2 
98  0


And we can see that ksoftirqd/0 runs for longer periods (~500 usec),
instead of stupid 4 usec before the patch. Less overhead.

lpaa23:~# cat /proc/3/sched
ksoftirqd/0 (3, #threads: 1)
---
se.exec_start:   1552401.399526
se.vruntime  :237599.421560
se.sum_exec_runtime  : 75432.494199
se.nr_migrations :0
nr_switches  :   144333
nr_voluntary_switches:   143828
nr_involuntary_switches  :  505
se.load.weight   : 1024
se.avg.load_sum  :10445
se.avg.util_sum  :10445
se.avg.load_avg  :0
se.avg.util_avg  :0
se.avg.last_update_time  :1552401399526
policy   :0
prio :  120
clock-delta  :   47
lpaa23:~# echo 75432.494199/144333|bc -l
.52262818758703830724

And yes indeed, user space can progress way faster under flood.

lpaa23:~# nstat >/dev/null;sleep 1;nstat | grep Udp
UdpInDatagrams  186132 0.0
UdpInErrors 735462 0.0
UdpOutDatagrams 10 0.0
UdpRcvbufErrors 735461 0.0




Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Hannes Frederic Sowa
On 10.05.2016 23:09, Eric Dumazet wrote:
> On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa
>  wrote:
> 
>> I agree here, but I don't think this patch particularly is a lot of
>> bloat and something very interesting people can play with and extend upon.
>>
> 
> Sure, very rarely patch authors think their stuff is bloat.
> 
> I prefer to fix kernel softirq.c, or at least show me that you tried
> hard enough.
> 
> I am pretty sure that the following would work :
> 
> When ksoftirqd is scheduled, remember this in a per cpu variable
> (ksoftiqd_scheduled)
> 
> When enabling BH , do not call do_softirq() if this variable is set.
> 
> ksoftirqd would clear the variable at the right place (probably in
> run_ksoftirqd())
> 
> Sure, this might add a lot of latency regressions, but lets fix them.

Probably, yes.

We had a version which limited the number of restarts if softirqs were
invoked from local_bh_enable (so that at least timers etc. would run)
and would defer all other work to ksoftirqd. That also solved the
initial live lock problem. I do have concerns about the fairness of this
approach, but we now have to investigate this. ;)

Not only did we want to present this solely as a bugfix but also as as
performance enhancements in case of virtio (as you can see in the cover
letter). Given that a long time ago there was a tendency to remove
softirqs completely, we thought it might be very interesting, that a
threaded napi in general seems to be absolutely viable nowadays and
might offer new features.

Bye,
Hannes



Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Eric Dumazet
On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:
> 
> > You might need another one of these in invoke_softirq()
> > 
> 
> Excellent.
> 
> I gave it a quick try (without your suggestion), and host seems to
> survive a stress test.
> 
> Of course we do have to fix these problems :
> 
> [  147.781629] NOHZ: local_softirq_pending 48
> [  147.785546] NOHZ: local_softirq_pending 48
> [  147.788344] NOHZ: local_softirq_pending 48
> [  147.788992] NOHZ: local_softirq_pending 48
> [  147.790943] NOHZ: local_softirq_pending 48
> [  147.791232] NOHZ: local_softirq_pending 24a
> [  147.791258] NOHZ: local_softirq_pending 48
> [  147.791366] NOHZ: local_softirq_pending 48
> [  147.792118] NOHZ: local_softirq_pending 48
> [  147.793428] NOHZ: local_softirq_pending 48


Well, with your suggestion, these warnings disappear ;)






Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Rik van Riel
On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:
> 
> > 
> > You might need another one of these in invoke_softirq()
> > 
> Excellent.
> 
> I gave it a quick try (without your suggestion), and host seems to
> survive a stress test.
> 
> Of course we do have to fix these problems :
> 
> [  147.781629] NOHZ: local_softirq_pending 48
> [  147.785546] NOHZ: local_softirq_pending 48
> [  147.788344] NOHZ: local_softirq_pending 48
> [  147.788992] NOHZ: local_softirq_pending 48
> [  147.790943] NOHZ: local_softirq_pending 48
> [  147.791232] NOHZ: local_softirq_pending 24a
> [  147.791258] NOHZ: local_softirq_pending 48
> [  147.791366] NOHZ: local_softirq_pending 48
> [  147.792118] NOHZ: local_softirq_pending 48
> [  147.793428] NOHZ: local_softirq_pending 48

As long as ksoftirqd is running, that should not be
an actual problem, just a false positive.

-- 
All Rights Reversed.



signature.asc
Description: This is a digitally signed message part


Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Eric Dumazet
On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:

> You might need another one of these in invoke_softirq()
> 

Excellent.

I gave it a quick try (without your suggestion), and host seems to
survive a stress test.

Of course we do have to fix these problems :

[  147.781629] NOHZ: local_softirq_pending 48
[  147.785546] NOHZ: local_softirq_pending 48
[  147.788344] NOHZ: local_softirq_pending 48
[  147.788992] NOHZ: local_softirq_pending 48
[  147.790943] NOHZ: local_softirq_pending 48
[  147.791232] NOHZ: local_softirq_pending 24a
[  147.791258] NOHZ: local_softirq_pending 48
[  147.791366] NOHZ: local_softirq_pending 48
[  147.792118] NOHZ: local_softirq_pending 48
[  147.793428] NOHZ: local_softirq_pending 48

Thanks.





Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Rik van Riel
On Tue, 2016-05-10 at 14:31 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 14:09 -0700, Eric Dumazet wrote:
> > 
> > On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa
> >  wrote:
> > 
> > > 
> > > I agree here, but I don't think this patch particularly is a lot
> > > of
> > > bloat and something very interesting people can play with and
> > > extend upon.
> > > 
> > Sure, very rarely patch authors think their stuff is bloat.
> > 
> > I prefer to fix kernel softirq.c, or at least show me that you
> > tried
> > hard enough.
> > 
> > I am pretty sure that the following would work :
> > 
> > When ksoftirqd is scheduled, remember this in a per cpu variable
> > (ksoftiqd_scheduled)
> > 
> > When enabling BH , do not call do_softirq() if this variable is
> > set.
> > 
> > ksoftirqd would clear the variable at the right place (probably in
> > run_ksoftirqd())
> > 
> > Sure, this might add a lot of latency regressions, but lets fix
> > them.
> Only to give the idea (it is completely untested and probably buggy)
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 17caf4b63342..cb30cfd76687 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
>  static struct softirq_action softirq_vec[NR_SOFTIRQS]
> __cacheline_aligned_in_smp;
>  
>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
>  
>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>   "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
>   /* Interrupts are disabled: no need to stop preemption */
>   struct task_struct *tsk = __this_cpu_read(ksoftirqd);
>  
> - if (tsk && tsk->state != TASK_RUNNING)
> + if (tsk && tsk->state != TASK_RUNNING) {
> + __this_cpu_write(ksoftirqd_scheduled, true);
>   wake_up_process(tsk);
> + }
>  }
>  
>  /*
> @@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip,
> unsigned int cnt)
>    */
>   preempt_count_sub(cnt - 1);
>  
> - if (unlikely(!in_interrupt() && local_softirq_pending())) {
> + if (unlikely(!in_interrupt() &&
> +  local_softirq_pending() &&
> +  !__this_cpu_read(ksoftirqd_scheduled))) {
>   /*
> 

You might need another one of these in invoke_softirq()

-- 
All Rights Reversed.



signature.asc
Description: This is a digitally signed message part


[PATCH net-next 3/3] net/mlx5e: Enable CQE compression when PCI is slower than link

2016-05-10 Thread Saeed Mahameed
We turn the feature ON, only for servers with PCI BW < MAX LINK BW, as it
helps reducing PCI pressure on weak PCI slots, but it adds some software
overhead.

Signed-off-by: Saeed Mahameed 
Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  1 +
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 19 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 52 ++
 3 files changed, 72 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index e05abad..e8a6c33 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -645,6 +645,7 @@ int mlx5e_close_locked(struct net_device *netdev);
 void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
   u32 *indirection_rqt, int len,
   int num_channels);
+int mlx5e_get_max_linkspeed(struct mlx5_core_dev *mdev, u32 *speed);
 
 static inline void mlx5e_tx_notify_hw(struct mlx5e_sq *sq,
  struct mlx5_wqe_ctrl_seg *ctrl, int bf_sz)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 534d99e..fc7dcc0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -613,6 +613,25 @@ static u32 ptys2ethtool_supported_port(u32 eth_proto_cap)
return 0;
 }
 
+int mlx5e_get_max_linkspeed(struct mlx5_core_dev *mdev, u32 *speed)
+{
+   u32 max_speed = 0;
+   u32 proto_cap;
+   int err;
+   int i;
+
+   err = mlx5_query_port_proto_cap(mdev, _cap, MLX5_PTYS_EN);
+   if (err)
+   return err;
+
+   for (i = 0; i < MLX5E_LINK_MODES_NUMBER; ++i)
+   if (proto_cap & MLX5E_PROT_MASK(i))
+   max_speed = max(max_speed, ptys2ethtool_table[i].speed);
+
+   *speed = max_speed;
+   return 0;
+}
+
 static void get_speed_duplex(struct net_device *netdev,
 u32 eth_proto_oper,
 struct ethtool_cmd *cmd)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e40e59a..0804070 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2716,11 +2716,49 @@ static bool 
mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
MLX5_CAP_ETH(mdev, reg_umr_sq);
 }
 
+static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw)
+{
+   enum pcie_link_width width;
+   enum pci_bus_speed speed;
+   int err = 0;
+
+   err = pcie_get_minimum_link(mdev->pdev, , );
+   if (err)
+   return err;
+
+   if (speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN)
+   return -EINVAL;
+
+   switch (speed) {
+   case PCIE_SPEED_2_5GT:
+   *pci_bw = 2500 * width;
+   break;
+   case PCIE_SPEED_5_0GT:
+   *pci_bw = 5000 * width;
+   break;
+   case PCIE_SPEED_8_0GT:
+   *pci_bw = 8000 * width;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static bool cqe_compress_heuristic(u32 link_speed, u32 pci_bw)
+{
+   return (link_speed && pci_bw &&
+   (pci_bw < 4) && (pci_bw < link_speed));
+}
+
 static void mlx5e_build_netdev_priv(struct mlx5_core_dev *mdev,
struct net_device *netdev,
int num_channels)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
+   u32 link_speed = 0;
+   u32 pci_bw = 0;
 
priv->params.log_sq_size   =
MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
@@ -2728,6 +2766,20 @@ static void mlx5e_build_netdev_priv(struct mlx5_core_dev 
*mdev,
MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
MLX5_WQ_TYPE_LINKED_LIST;
 
+   /* set CQE compression */
+   priv->params.rx_cqe_compress_admin = false;
+   if (MLX5_CAP_GEN(mdev, cqe_compression) &&
+   MLX5_CAP_GEN(mdev, vport_group_manager)) {
+   mlx5e_get_max_linkspeed(mdev, _speed);
+   mlx5e_get_pci_bw(mdev, _bw);
+   mlx5_core_dbg(mdev, "Max link speed = %d, PCI BW = %d\n",
+ link_speed, pci_bw);
+   priv->params.rx_cqe_compress_admin =
+   cqe_compress_heuristic(link_speed, pci_bw);
+   }
+
+   priv->params.rx_cqe_compress = priv->params.rx_cqe_compress_admin;
+
switch (priv->params.rq_wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW;
-- 
2.8.0



[PATCH net-next 2/3] net/mlx5e: Expand WQE stride when CQE compression is enabled

2016-05-10 Thread Saeed Mahameed
From: Tariq Toukan 

Make the MPWQE/Striding RQ default configuration dynamic and not
statically set at compile time.  Now at driver load we set
stride size and num strides dynamically.

By default we use same values as before, but when CQE compression
is enabled, we set larger stride size to benefit from CQE
compression for larger packets.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 13 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 23 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 39 ---
 3 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 19f0d8d..e05abad 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -64,12 +64,9 @@
 #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x4
 #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6
 
-#define MLX5_MPWRQ_LOG_NUM_STRIDES 11 /* >= 9, HW restriction */
 #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6  /* >= 6, HW restriction */
-#define MLX5_MPWRQ_NUM_STRIDES BIT(MLX5_MPWRQ_LOG_NUM_STRIDES)
-#define MLX5_MPWRQ_STRIDE_SIZE BIT(MLX5_MPWRQ_LOG_STRIDE_SIZE)
-#define MLX5_MPWRQ_LOG_WQE_SZ  (MLX5_MPWRQ_LOG_NUM_STRIDES +\
-MLX5_MPWRQ_LOG_STRIDE_SIZE)
+#define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS8  /* >= 6, HW 
restriction */
+#define MLX5_MPWRQ_LOG_WQE_SZ  17
 #define MLX5_MPWRQ_WQE_PAGE_ORDER  (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
 #define MLX5_MPWRQ_PAGES_PER_WQE   BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
@@ -154,6 +151,8 @@ struct mlx5e_umr_wqe {
 struct mlx5e_params {
u8  log_sq_size;
u8  rq_wq_type;
+   u8  mpwqe_log_stride_sz;
+   u8  mpwqe_log_num_strides;
u8  log_rq_size;
u16 num_channels;
u8  num_tc;
@@ -249,6 +248,8 @@ struct mlx5e_rq {
/* control */
struct mlx5_wq_ctrlwq_ctrl;
u8 wq_type;
+   u32mpwqe_stride_sz;
+   u32mpwqe_num_strides;
u32rqn;
struct mlx5e_channel  *channel;
struct mlx5e_priv *priv;
@@ -272,7 +273,7 @@ struct mlx5e_mpw_info {
void (*dma_pre_sync)(struct device *pdev,
 struct mlx5e_mpw_info *wi,
 u32 wqe_offset, u32 len);
-   void (*add_skb_frag)(struct device *pdev,
+   void (*add_skb_frag)(struct mlx5e_rq *rq,
 struct sk_buff *skb,
 struct mlx5e_mpw_info *wi,
 u32 page_idx, u32 frag_offset, u32 len);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0ea4c03..e40e59a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -307,7 +307,9 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
 
-   rq->wqe_sz = MLX5_MPWRQ_NUM_STRIDES * MLX5_MPWRQ_STRIDE_SIZE;
+   rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
+   rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
+   rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
byte_count = rq->wqe_sz;
break;
default: /* MLX5_WQ_TYPE_LINKED_LIST */
@@ -1130,9 +1132,9 @@ static void mlx5e_build_rq_param(struct mlx5e_priv *priv,
switch (priv->params.rq_wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
MLX5_SET(wq, wq, log_wqe_num_of_strides,
-MLX5_MPWRQ_LOG_NUM_STRIDES - 9);
+priv->params.mpwqe_log_num_strides - 9);
MLX5_SET(wq, wq, log_wqe_stride_size,
-MLX5_MPWRQ_LOG_STRIDE_SIZE - 6);
+priv->params.mpwqe_log_stride_sz - 6);
MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ);
break;
default: /* MLX5_WQ_TYPE_LINKED_LIST */
@@ -1199,7 +1201,7 @@ static void mlx5e_build_rx_cq_param(struct mlx5e_priv 
*priv,
switch (priv->params.rq_wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
log_cq_size = priv->params.log_rq_size +
-   MLX5_MPWRQ_LOG_NUM_STRIDES;
+   priv->params.mpwqe_log_num_strides;
break;
default: /* 

Re: [PATCH net-next 2/2] net: dsa: mv88e6xxx: add STU capability

2016-05-10 Thread Andrew Lunn
On Tue, May 10, 2016 at 03:44:29PM -0400, Vivien Didelot wrote:
> Some switch models have a STU (per VLAN port state database). Add a new
> capability flag to switches info, instead of checking their family.
> 
> Also if the 6165 family has an STU, it must have a VTU, so add the
> MV88E6XXX_FLAG_VTU to its family flags.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

 Andrew


Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Eric Dumazet
On Tue, 2016-05-10 at 14:09 -0700, Eric Dumazet wrote:
> On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa
>  wrote:
> 
> > I agree here, but I don't think this patch particularly is a lot of
> > bloat and something very interesting people can play with and extend upon.
> >
> 
> Sure, very rarely patch authors think their stuff is bloat.
> 
> I prefer to fix kernel softirq.c, or at least show me that you tried
> hard enough.
> 
> I am pretty sure that the following would work :
> 
> When ksoftirqd is scheduled, remember this in a per cpu variable
> (ksoftiqd_scheduled)
> 
> When enabling BH , do not call do_softirq() if this variable is set.
> 
> ksoftirqd would clear the variable at the right place (probably in
> run_ksoftirqd())
> 
> Sure, this might add a lot of latency regressions, but lets fix them.

Only to give the idea (it is completely untested and probably buggy)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..cb30cfd76687 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
 static struct softirq_action softirq_vec[NR_SOFTIRQS] 
__cacheline_aligned_in_smp;
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
 
 const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
@@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
/* Interrupts are disabled: no need to stop preemption */
struct task_struct *tsk = __this_cpu_read(ksoftirqd);
 
-   if (tsk && tsk->state != TASK_RUNNING)
+   if (tsk && tsk->state != TASK_RUNNING) {
+   __this_cpu_write(ksoftirqd_scheduled, true);
wake_up_process(tsk);
+   }
 }
 
 /*
@@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int 
cnt)
 */
preempt_count_sub(cnt - 1);
 
-   if (unlikely(!in_interrupt() && local_softirq_pending())) {
+   if (unlikely(!in_interrupt() &&
+local_softirq_pending() &&
+!__this_cpu_read(ksoftirqd_scheduled))) {
/*
 * Run softirq if any pending. And do it in its own stack
 * as we may be calling this deep in a task call stack already.
@@ -660,6 +665,8 @@ static void run_ksoftirqd(unsigned int cpu)
 * in the task stack here.
 */
__do_softirq();
+   if (!local_softirq_pending())
+   __this_cpu_write(ksoftirqd_scheduled, false);
local_irq_enable();
cond_resched_rcu_qs();
return;




[PATCH net-next 1/3] net/mlx5e: CQE compression

2016-05-10 Thread Saeed Mahameed
From: Tariq Toukan 

CQE compression feature is meant to save PCIe bandwidth by
compressing few CQEs into smaller amount of bytes on PCIe.
CQE compression can be selectively enabled per CQ.  By default
is disabled for now and will be enabled later on.

Signed-off-by: Tariq Toukan 
Signed-off-by: Eugenia Emantayev 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  10 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_clock.c |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   6 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 151 -
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   8 ++
 include/linux/mlx5/device.h|  34 +
 6 files changed, 211 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index bfa5daa..19f0d8d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -157,6 +157,8 @@ struct mlx5e_params {
u8  log_rq_size;
u16 num_channels;
u8  num_tc;
+   bool rx_cqe_compress_admin;
+   bool rx_cqe_compress;
u16 rx_cq_moderation_usec;
u16 rx_cq_moderation_pkts;
u16 tx_cq_moderation_usec;
@@ -202,6 +204,13 @@ struct mlx5e_cq {
struct mlx5e_channel  *channel;
struct mlx5e_priv *priv;
 
+   /* cqe decompression */
+   struct mlx5_cqe64  title;
+   struct mlx5_mini_cqe8  mini_arr[MLX5_MINI_CQE_ARRAY_SIZE];
+   u8 mini_arr_idx;
+   u16decmprs_left;
+   u16decmprs_wqe_counter;
+
/* control */
struct mlx5_wq_ctrlwq_ctrl;
 } cacheline_aligned_in_smp;
@@ -616,6 +625,7 @@ void mlx5e_timestamp_init(struct mlx5e_priv *priv);
 void mlx5e_timestamp_cleanup(struct mlx5e_priv *priv);
 int mlx5e_hwstamp_set(struct net_device *dev, struct ifreq *ifr);
 int mlx5e_hwstamp_get(struct net_device *dev, struct ifreq *ifr);
+void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val);
 
 int mlx5e_vlan_rx_add_vid(struct net_device *dev, __always_unused __be16 proto,
  u16 vid);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_clock.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_clock.c
index 2018eeb..847a8f3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_clock.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_clock.c
@@ -93,6 +93,8 @@ int mlx5e_hwstamp_set(struct net_device *dev, struct ifreq 
*ifr)
/* RX HW timestamp */
switch (config.rx_filter) {
case HWTSTAMP_FILTER_NONE:
+   /* Reset CQE compression to Admin default */
+   mlx5e_modify_rx_cqe_compression(priv, 
priv->params.rx_cqe_compress_admin);
break;
case HWTSTAMP_FILTER_ALL:
case HWTSTAMP_FILTER_SOME:
@@ -108,6 +110,8 @@ int mlx5e_hwstamp_set(struct net_device *dev, struct ifreq 
*ifr)
case HWTSTAMP_FILTER_PTP_V2_EVENT:
case HWTSTAMP_FILTER_PTP_V2_SYNC:
case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
+   /* Disable CQE compression */
+   mlx5e_modify_rx_cqe_compression(priv, false);
config.rx_filter = HWTSTAMP_FILTER_ALL;
break;
default:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1c70e51..0ea4c03 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -114,6 +114,8 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
*priv)
s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
s->rx_mpwqe_frag   += rq_stats->mpwqe_frag;
s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
+   s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
+   s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
 
for (j = 0; j < priv->params.num_tc; j++) {
sq_stats = >channel[i]->sq[j].stats;
@@ -1204,6 +1206,10 @@ static void mlx5e_build_rx_cq_param(struct mlx5e_priv 
*priv,
}
 
MLX5_SET(cqc, cqc, log_cq_size, log_cq_size);
+   if (priv->params.rx_cqe_compress) {
+   MLX5_SET(cqc, cqc, mini_cqe_res_format, MLX5_CQE_FORMAT_CSUM);
+   MLX5_SET(cqc, cqc, cqe_comp_en, 1);
+   }
 
mlx5e_build_common_cq_param(priv, param);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 23adfe2..c8b8d45 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -42,6 +42,143 @@ static inline bool mlx5e_rx_hw_stamp(struct 

[PATCH net-next 0/3] Mellanox 100G mlx5 CQE compression

2016-05-10 Thread Saeed Mahameed
Hi Dave,

Introducing ConnectX-4 CQE (Completion Queue Entry) compression feature
for mlx5 etherent driver.

CQE Compressing reduces PCI overhead by coalescing and compressing multiple 
CQEs into a
single merged CQE.  Successful compressing improves message rate especially for 
small packet
traffic.

CQE Compressing in details:

Instead of writing full CQEs to memory, multiple almost identical CQEs are 
merged and compressed.
Information that is shared between the CQEs is written once, regardless of the 
number of
compressed CQEs.  In addition, only the unique information (small amount of 
bytes compared to
full CQE size) is written per CQE. 


CQE Compression Block:

This block contains multiple compressed CQEs.  CQE Compression Block contains a 
single copy
of CQEs properties which are shared between all the compressed CQEs (called 
Title, see below)
and multiple mini CQEs (CQEs in compressed form).


Title:

The Title holds information which is shared between all the compressed CQEs in 
the CQE Compression
Block.  In each Compression Block there is only a single Title regardless of 
the number
of compressed CQEs.


Mini CQE:

A CQE in compressed form that holds some data needed to extract a single full 
CQE, for example
8 Bytes instead of 64 Bytes.
The shared information between all compressed CQEs, which belong to the same 
CQE Compression
Block called Title, is written once, and only the unique information in each 
compressed
CQE, for example 8 bytes, is written per compressed CQE, called mini CQE.


Since CQE Compression can add overhead to the software (CPU),
it will be only enabled on "weak/slow" PCI slots, where it can actually help.

Applied on top: c047c3b1af62 ('netfilter: conntrack: remove uninitialized 
shadow variable')

Thanks,
Saeed.

Saeed Mahameed (1):
  net/mlx5e: Enable CQE compression when PCI is slower than link

Tariq Toukan (2):
  net/mlx5e: CQE compression
  net/mlx5e: Expand WQE stride when CQE compression is enabled

 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  24 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_clock.c |   4 +
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  19 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  81 -
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 188 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   8 +
 include/linux/mlx5/device.h|  34 
 7 files changed, 328 insertions(+), 30 deletions(-)

-- 
2.8.0



[PATCH v1 net-next 6/7] dsa: Rename switch chip data to cd

2016-05-10 Thread Andrew Lunn
The dsa_switch structure contains a dsa_chip_data member called pd.
However in the rest of the code, pd is used for dsa_platform_data.
This is confusing. Rename it cd, which is already often used in dsa.c
and slave.c for this data type.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/bcm_sf2.c   |  4 ++--
 drivers/net/dsa/mv88e6xxx.c |  4 ++--
 include/net/dsa.h   |  4 ++--
 net/dsa/dsa.c   | 18 +-
 net/dsa/slave.c | 10 +-
 5 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 448deb59b9a4..10ddd5a5dfb6 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -949,8 +949,8 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds)
/* All the interesting properties are at the parent device_node
 * level
 */
-   dn = ds->pd->of_node->parent;
-   bcm_sf2_identify_ports(priv, ds->pd->of_node);
+   dn = ds->cd->of_node->parent;
+   bcm_sf2_identify_ports(priv, ds->cd->of_node);
 
priv->irq0 = irq_of_parse_and_map(dn, 0);
priv->irq1 = irq_of_parse_and_map(dn, 1);
diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 428d213ed4fc..c41ec38613c7 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -3009,9 +3009,9 @@ static int mv88e6xxx_setup_global(struct 
mv88e6xxx_priv_state *ps)
for (i = 0; i < 32; i++) {
int nexthop = 0x1f;
 
-   if (ps->ds->pd->rtable &&
+   if (ps->ds->cd->rtable &&
i != ps->ds->index && i < ps->ds->dst->pd->nr_chips)
-   nexthop = ps->ds->pd->rtable[i] & 0x1f;
+   nexthop = ps->ds->cd->rtable[i] & 0x1f;
 
err = _mv88e6xxx_reg_write(
ps, REG_GLOBAL2,
diff --git a/include/net/dsa.h b/include/net/dsa.h
index f4c0bff8d9d6..17c3d37b6779 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -137,7 +137,7 @@ struct dsa_switch {
/*
 * Configuration data for this switch.
 */
-   struct dsa_chip_data*pd;
+   struct dsa_chip_data*cd;
 
/*
 * The used switch driver.
@@ -190,7 +190,7 @@ static inline u8 dsa_upstream_port(struct dsa_switch *ds)
if (dst->cpu_switch == ds->index)
return dst->cpu_port;
else
-   return ds->pd->rtable[dst->cpu_switch];
+   return ds->cd->rtable[dst->cpu_switch];
 }
 
 struct switchdev_trans;
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 5db779c69a68..eff5dfc2e33f 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -182,7 +182,7 @@ __ATTRIBUTE_GROUPS(dsa_hwmon);
 /* basic switch operations **/
 static int dsa_cpu_dsa_setup(struct dsa_switch *ds, struct net_device *master)
 {
-   struct dsa_chip_data *cd = ds->pd;
+   struct dsa_chip_data *cd = ds->cd;
struct device_node *port_dn;
struct phy_device *phydev;
int ret, port, mode;
@@ -219,7 +219,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, 
struct device *parent)
 {
struct dsa_switch_driver *drv = ds->drv;
struct dsa_switch_tree *dst = ds->dst;
-   struct dsa_chip_data *pd = ds->pd;
+   struct dsa_chip_data *cd = ds->cd;
bool valid_name_found = false;
int index = ds->index;
int i, ret;
@@ -230,7 +230,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, 
struct device *parent)
for (i = 0; i < DSA_MAX_PORTS; i++) {
char *name;
 
-   name = pd->port_names[i];
+   name = cd->port_names[i];
if (name == NULL)
continue;
 
@@ -328,10 +328,10 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, 
struct device *parent)
if (!(ds->enabled_port_mask & (1 << i)))
continue;
 
-   ret = dsa_slave_create(ds, parent, i, pd->port_names[i]);
+   ret = dsa_slave_create(ds, parent, i, cd->port_names[i]);
if (ret < 0) {
netdev_err(dst->master_netdev, "[%d]: can't create dsa 
slave device for port %d(%s): %d\n",
-  index, i, pd->port_names[i], ret);
+  index, i, cd->port_names[i], ret);
ret = 0;
}
}
@@ -379,7 +379,7 @@ static struct dsa_switch *
 dsa_switch_setup(struct dsa_switch_tree *dst, int index,
 struct device *parent, struct device *host_dev)
 {
-   struct dsa_chip_data *pd = dst->pd->chip + index;
+   struct dsa_chip_data *cd = dst->pd->chip + index;
struct dsa_switch_driver *drv;
struct dsa_switch *ds;
int ret;
@@ -389,7 +389,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
/*
 * Probe for switch 

[PATCH v1 net-next 5/7] dsa: Remove master_dev from switch structure

2016-05-10 Thread Andrew Lunn
The switch drivers only use the master_dev member for dev_info()
messages.  Now that the device is passed to the old style probe, and
new style drivers are probed as true linux drivers, this is no longer
needed.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx.c | 1 +
 include/net/dsa.h   | 7 ++-
 net/dsa/dsa.c   | 2 +-
 net/dsa/slave.c | 2 +-
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index f4bb3bed812f..428d213ed4fc 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -3628,6 +3628,7 @@ int mv88e6xxx_probe(struct mdio_device *mdiodev)
 
ps = (struct mv88e6xxx_priv_state *)(ds + 1);
ds->priv = ps;
+   ds->dev = dev;
ps->dev = dev;
ps->ds = ds;
ps->bus = mdiodev->bus;
diff --git a/include/net/dsa.h b/include/net/dsa.h
index ecb52e265cc3..f4c0bff8d9d6 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -120,6 +120,8 @@ struct dsa_switch_tree {
 };
 
 struct dsa_switch {
+   struct device *dev;
+
/*
 * Parent switch tree, and switch index.
 */
@@ -142,11 +144,6 @@ struct dsa_switch {
 */
struct dsa_switch_driver*drv;
 
-   /*
-* Reference to host device to use.
-*/
-   struct device   *master_dev;
-
 #ifdef CONFIG_NET_DSA_HWMON
/*
 * Hardware monitoring information
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index df169811f26d..5db779c69a68 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -411,7 +411,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
ds->pd = pd;
ds->drv = drv;
ds->priv = priv;
-   ds->master_dev = host_dev;
+   ds->dev = parent;
 
ret = dsa_switch_setup_one(ds, parent);
if (ret)
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 5ea8a40c8d33..f25dcd9e814a 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -51,7 +51,7 @@ void dsa_slave_mii_bus_init(struct dsa_switch *ds)
ds->slave_mii_bus->write = dsa_slave_phy_write;
snprintf(ds->slave_mii_bus->id, MII_BUS_ID_SIZE, "dsa-%d:%.2x",
ds->index, ds->pd->sw_addr);
-   ds->slave_mii_bus->parent = ds->master_dev;
+   ds->slave_mii_bus->parent = ds->dev;
ds->slave_mii_bus->phy_mask = ~ds->phys_mii_mask;
 }
 
-- 
2.8.1



[PATCH v1 net-next 7/7] dsa: mv88e6xxx: Handle eeprom-length property

2016-05-10 Thread Andrew Lunn
A switch can export an attached EEPROM using the standard ethtool API.
However the switch itself cannot determine the size of the EEPROM, and
multiple sizes are allowed. Thus a device tree property is supported
to indicate the length of the EEPROM. Parse this property during
device probe, and implement a callback function to retrieve it.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx.c | 17 +
 drivers/net/dsa/mv88e6xxx.h |  3 +++
 2 files changed, 20 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index c41ec38613c7..6f9be26d002f 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -879,6 +879,16 @@ error:
return ret;
 }
 
+static int mv88e6xxx_get_eeprom_len(struct dsa_switch *ds)
+{
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+
+   if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_EEPROM))
+   return ps->eeprom_len;
+
+   return 0;
+}
+
 static int mv88e6xxx_get_eeprom(struct dsa_switch *ds,
struct ethtool_eeprom *eeprom, u8 *data)
 {
@@ -3596,6 +3606,7 @@ struct dsa_switch_driver mv88e6xxx_switch_driver = {
.set_temp_limit = mv88e6xxx_set_temp_limit,
.get_temp_alarm = mv88e6xxx_get_temp_alarm,
 #endif
+   .get_eeprom_len = mv88e6xxx_get_eeprom_len,
.get_eeprom = mv88e6xxx_get_eeprom,
.set_eeprom = mv88e6xxx_set_eeprom,
.get_regs_len   = mv88e6xxx_get_regs_len,
@@ -3617,9 +3628,11 @@ struct dsa_switch_driver mv88e6xxx_switch_driver = {
 int mv88e6xxx_probe(struct mdio_device *mdiodev)
 {
struct device *dev = >dev;
+   struct device_node *np = dev->of_node;
struct mv88e6xxx_priv_state *ps;
int id, prod_num, rev;
struct dsa_switch *ds;
+   u32 eeprom_len;
int err;
 
ds = devm_kzalloc(dev, sizeof(*ds) + sizeof(*ps), GFP_KERNEL);
@@ -3662,6 +3675,10 @@ int mv88e6xxx_probe(struct mdio_device *mdiodev)
}
}
 
+   if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_EEPROM) &&
+   !of_property_read_u32(np, "eeprom-length", _len))
+   ps->eeprom_len = eeprom_len;
+
dev_set_drvdata(dev, ds);
 
dev_info(dev, "switch 0x%x probed: %s, revision %u\n",
diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h
index df226c151d9c..d83565ac684e 100644
--- a/drivers/net/dsa/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx.h
@@ -590,6 +590,9 @@ struct mv88e6xxx_priv_state {
 * switch soft reset.
 */
struct gpio_desc *reset;
+
+   /* set to size of eeprom if supported by the switch */
+   int eeprom_len;
 };
 
 enum stat_type {
-- 
2.8.1



[PATCH v1 net-next 1/7] dsa: mv88e6xxx: Initialise the mutex as soon as it is created

2016-05-10 Thread Andrew Lunn
By initialising immediately it, we don't run the danger of using it
before it is initialised.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 1e5ca8e0f48e..b2d27bfd53c2 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -3118,8 +3118,6 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
 
ps->ds = ds;
 
-   mutex_init(>smi_mutex);
-
INIT_WORK(>bridge_work, mv88e6xxx_bridge_work);
 
if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_EEPROM))
@@ -3566,6 +3564,7 @@ static const char *mv88e6xxx_probe(struct device *dsa_dev,
ps->bus = bus;
ps->sw_addr = sw_addr;
ps->info = info;
+   mutex_init(>smi_mutex);
 
*priv = ps;
 
-- 
2.8.1



Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: abstract VTU/STU data access

2016-05-10 Thread Andrew Lunn
On Tue, May 10, 2016 at 03:44:28PM -0400, Vivien Didelot wrote:
> Both VTU and STU operations use the same routine to access their
> (common) data registers, with a different offset.
> 
> Add VTU and STU specific read and write functions to the data registers
> to abstract the required offset.
> 
> Signed-off-by: Vivien Didelot 

Hi Vivien

Improves the readability.

Reviewed-by: Andrew Lunn 

Thanks
Andrew


[PATCH v1 net-next 2/7] dsa: mv88e6xxx: Rename probe function to fit the normal pattern

2016-05-10 Thread Andrew Lunn
All other DSA drivers use _drv_ in there DSA probe function name, thus
allowing for a true linux driver probe function to use the
conventional name. Make mv88e6xxx fit this pattern.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index b2d27bfd53c2..199a8628df2b 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -3529,9 +3529,9 @@ mv88e6xxx_lookup_info(unsigned int prod_num, const struct 
mv88e6xxx_info *table,
return NULL;
 }
 
-static const char *mv88e6xxx_probe(struct device *dsa_dev,
-  struct device *host_dev, int sw_addr,
-  void **priv)
+static const char *mv88e6xxx_drv_probe(struct device *dsa_dev,
+  struct device *host_dev, int sw_addr,
+  void **priv)
 {
const struct mv88e6xxx_info *info;
struct mv88e6xxx_priv_state *ps;
@@ -3576,7 +3576,7 @@ static const char *mv88e6xxx_probe(struct device *dsa_dev,
 
 struct dsa_switch_driver mv88e6xxx_switch_driver = {
.tag_protocol   = DSA_TAG_PROTO_EDSA,
-   .probe  = mv88e6xxx_probe,
+   .probe  = mv88e6xxx_drv_probe,
.setup  = mv88e6xxx_setup,
.set_addr   = mv88e6xxx_set_addr,
.phy_read   = mv88e6xxx_phy_read,
-- 
2.8.1



[PATCH v1 net-next 0/7] More enabler patches for DSA probing

2016-05-10 Thread Andrew Lunn
The complete set of patches for the reworked DSA probing is too big to
post as once. These subset contains some enablers which are easy to
review.

Eventually, the Marvell driver will instantiate its own internal MDIO
bus, rather than have the framework do it, thus allows devices on the
bus to be listed in the device tree. Initialize the main mutex as soon
as it is created, to avoid lifetime issues with the mdio bus.

A previous patch renamed all the DSA probe functions to make room for
a true device probe. However the recent merging of all the Marvell
switch drivers resulted in mv88e6xxx going back to the old probe
name. Rename it again, so we can have a driver probe function.

Add minimum support for the Marvell switch driver to probe as an MDIO
device, as well as an DSA driver. Later patches will then register
this device with the new DSA core framework.

Move the GPIO reset code out of the DSA code. Different drivers may
need different reset mechanisms, e.g. via a reset controller for
memory mapped devices. Don't clutter up the core with this. Let each
driver implement what it needs.

master_dev is no longer needed in the switch drivers, since they have
access to a device pointer from the probe function. Remove it.

Let the switch parse the eeprom length from its one device tree
node. This is required with the new binding when the central DSA
platform device no longer exists.

Andrew Lunn (7):
  dsa: mv88e6xxx: Initialise the mutex as soon as it is created
  dsa: mv88e6xxx: Rename probe function to fit the normal pattern
  dsa: Add mdio device support to Marvell switches
  dsa: Move gpio reset into switch driver
  dsa: Remove master_dev from switch structure
  dsa: Rename switch chip data to cd
  dsa: mv88e6xxx: Handle eeprom-length property

 Documentation/devicetree/bindings/net/dsa/dsa.txt  |   2 -
 .../devicetree/bindings/net/dsa/marvell.txt|  35 ++
 drivers/net/dsa/bcm_sf2.c  |   4 +-
 drivers/net/dsa/mv88e6xxx.c| 137 +
 drivers/net/dsa/mv88e6xxx.h|  10 ++
 include/net/dsa.h  |  19 +--
 net/dsa/dsa.c  |  36 ++
 net/dsa/slave.c|  12 +-
 8 files changed, 177 insertions(+), 78 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/dsa/marvell.txt

-- 
2.8.1



[PATCH v1 net-next 3/7] dsa: Add mdio device support to Marvell switches

2016-05-10 Thread Andrew Lunn
Allow Marvell switches to be mdio devices. Currently the driver just
allocate the private structure and detects what device is on the
bus. Later patches will make them register with the DSA framework.

Signed-off-by: Andrew Lunn 
---
 .../devicetree/bindings/net/dsa/marvell.txt| 27 +++
 drivers/net/dsa/mv88e6xxx.c| 90 +-
 2 files changed, 99 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/dsa/marvell.txt

diff --git a/Documentation/devicetree/bindings/net/dsa/marvell.txt 
b/Documentation/devicetree/bindings/net/dsa/marvell.txt
new file mode 100644
index ..cdd70cebdea7
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/dsa/marvell.txt
@@ -0,0 +1,27 @@
+Marvell DSA Switch Device Tree Bindings
+---
+
+WARNING: This binding is currently unstable. Do not program it into a
+FLASH never to be changed again. Once this binding is stable, this
+warning will be removed.
+
+If you need a stable binding, use the old dsa.txt binding.
+
+Marvell Switches are MDIO devices. The following properties should be
+placed as a child node of an mdio device.
+
+Required properties:
+- compatible   : Should be one of "marvell,mv88e6085",
+- reg  : Address on the MII bus for the switch.
+
+Example:
+
+   mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   switch0: switch@0 {
+   compatible = "marvell,mv88e6085";
+   reg = <0>;
+   };
+   };
diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 199a8628df2b..e5ba04132140 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -5,6 +5,8 @@
  * Copyright (c) 2015 CMC Electronics, Inc.
  * Added support for VLAN Table Unit operations
  *
+ * Copyright (c) 2016 Andrew Lunn 
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -17,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -3611,36 +3614,87 @@ struct dsa_switch_driver mv88e6xxx_switch_driver = {
.port_fdb_dump  = mv88e6xxx_port_fdb_dump,
 };
 
-static int __init mv88e6xxx_init(void)
+int mv88e6xxx_probe(struct mdio_device *mdiodev)
 {
-   register_switch_driver(_switch_driver);
+   struct device *dev = >dev;
+   struct mv88e6xxx_priv_state *ps;
+   int id, prod_num, rev;
+   struct dsa_switch *ds;
+
+   ds = devm_kzalloc(dev, sizeof(*ds) + sizeof(*ps), GFP_KERNEL);
+   if (!ds)
+   return -ENOMEM;
+
+   ps = (struct mv88e6xxx_priv_state *)(ds + 1);
+   ds->priv = ps;
+   ps->dev = dev;
+   ps->ds = ds;
+   ps->bus = mdiodev->bus;
+   ps->sw_addr = mdiodev->addr;
+   mutex_init(>smi_mutex);
+
+   get_device(>bus->dev);
+
+   ds->drv = _switch_driver;
+
+   id = mv88e6xxx_reg_read(ps, REG_PORT(0), PORT_SWITCH_ID);
+   if (id < 0)
+   return id;
+
+   prod_num = (id & 0xfff0) >> 4;
+   rev = id & 0x000f;
+
+   ps->info = mv88e6xxx_lookup_info(prod_num, mv88e6xxx_table,
+ARRAY_SIZE(mv88e6xxx_table));
+   if (!ps->info)
+   return -ENODEV;
+
+   dev_set_drvdata(dev, ds);
+
+   dev_info(dev, "switch 0x%x probed: %s, revision %u\n",
+prod_num, ps->info->name, rev);
 
return 0;
 }
+
+static void mv88e6xxx_remove(struct mdio_device *mdiodev)
+{
+   struct dsa_switch *ds = dev_get_drvdata(>dev);
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+
+   put_device(>bus->dev);
+}
+
+static const struct of_device_id mv88e6xxx_of_match[] = {
+   { .compatible = "marvell,mv88e6085" },
+   { /* sentinel */ },
+};
+
+MODULE_DEVICE_TABLE(of, mv88e6xxx_of_match);
+
+static struct mdio_driver mv88e6xxx_driver = {
+   .probe  = mv88e6xxx_probe,
+   .remove = mv88e6xxx_remove,
+   .mdiodrv.driver = {
+   .name = "mv88e6085",
+   .of_match_table = mv88e6xxx_of_match,
+   },
+};
+
+static int __init mv88e6xxx_init(void)
+{
+   register_switch_driver(_switch_driver);
+   return mdio_driver_register(_driver);
+}
 module_init(mv88e6xxx_init);
 
 static void __exit mv88e6xxx_cleanup(void)
 {
+   mdio_driver_unregister(_driver);
unregister_switch_driver(_switch_driver);
 }
 module_exit(mv88e6xxx_cleanup);
 
-MODULE_ALIAS("platform:mv88e6085");
-MODULE_ALIAS("platform:mv88e6095");
-MODULE_ALIAS("platform:mv88e6095f");
-MODULE_ALIAS("platform:mv88e6123");
-MODULE_ALIAS("platform:mv88e6131");
-MODULE_ALIAS("platform:mv88e6161");
-MODULE_ALIAS("platform:mv88e6165");
-MODULE_ALIAS("platform:mv88e6171");

Re: [net PATCH v2 6/6] net sched: ife action fix late binding

2016-05-10 Thread Cong Wang
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> The process below was broken and is fixed with this patch.
>
> //add an ife action and give it an instance id of 1
> sudo tc actions add action ife encode \
> type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1
>
> //create a filter which binds to ife action id 1
> sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
> match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1
>
> Message before fix was:
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
>
> Signed-off-by: Jamal Hadi Salim 

Reviewed-by: Cong Wang 


[PATCH v1 net-next 4/7] dsa: Move gpio reset into switch driver

2016-05-10 Thread Andrew Lunn
Resetting the switch is something the driver does, not the framework.
So move the parsing of this property into the driver.

There are no in kernel users of this property, so moving it does not
break anything. There is however a board which will make use of this
property making its way into the kernel.

Signed-off-by: Andrew Lunn 
---
 Documentation/devicetree/bindings/net/dsa/dsa.txt |  2 --
 Documentation/devicetree/bindings/net/dsa/marvell.txt |  8 
 drivers/net/dsa/mv88e6xxx.c   | 14 +-
 drivers/net/dsa/mv88e6xxx.h   |  7 +++
 include/net/dsa.h |  8 
 net/dsa/dsa.c | 16 
 6 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.txt 
b/Documentation/devicetree/bindings/net/dsa/dsa.txt
index 5fdbbcdf8c4b..9f4807f90c31 100644
--- a/Documentation/devicetree/bindings/net/dsa/dsa.txt
+++ b/Documentation/devicetree/bindings/net/dsa/dsa.txt
@@ -31,8 +31,6 @@ A switch child node has the following optional property:
  switch. Must be set if the switch can not detect
  the presence and/or size of a connected EEPROM,
  otherwise optional.
-- reset-gpios  : phandle and specifier to a gpio line connected to
- reset pin of the switch chip.
 
 A switch may have multiple "port" children nodes
 
diff --git a/Documentation/devicetree/bindings/net/dsa/marvell.txt 
b/Documentation/devicetree/bindings/net/dsa/marvell.txt
index cdd70cebdea7..7629189398aa 100644
--- a/Documentation/devicetree/bindings/net/dsa/marvell.txt
+++ b/Documentation/devicetree/bindings/net/dsa/marvell.txt
@@ -10,10 +10,17 @@ If you need a stable binding, use the old dsa.txt binding.
 Marvell Switches are MDIO devices. The following properties should be
 placed as a child node of an mdio device.
 
+The properties described here are those specific to Marvell devices.
+Additional required and optional properties can be found in dsa.txt.
+
 Required properties:
 - compatible   : Should be one of "marvell,mv88e6085",
 - reg  : Address on the MII bus for the switch.
 
+Optional properties:
+
+- reset-gpios  : Should be a gpio specifier for a reset line
+
 Example:
 
mdio {
@@ -23,5 +30,6 @@ Example:
switch0: switch@0 {
compatible = "marvell,mv88e6085";
reg = <0>;
+  reset-gpios = < 1 GPIO_ACTIVE_LOW>;
};
};
diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index e5ba04132140..f4bb3bed812f 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -2568,7 +2568,7 @@ static int mv88e6xxx_switch_reset(struct 
mv88e6xxx_priv_state *ps)
 {
bool ppu_active = mv88e6xxx_has(ps, MV88E6XXX_FLAG_PPU_ACTIVE);
u16 is_reset = (ppu_active ? 0x8800 : 0xc800);
-   struct gpio_desc *gpiod = ps->ds->pd->reset;
+   struct gpio_desc *gpiod = ps->reset;
unsigned long timeout;
int ret;
int i;
@@ -3620,6 +3620,7 @@ int mv88e6xxx_probe(struct mdio_device *mdiodev)
struct mv88e6xxx_priv_state *ps;
int id, prod_num, rev;
struct dsa_switch *ds;
+   int err;
 
ds = devm_kzalloc(dev, sizeof(*ds) + sizeof(*ps), GFP_KERNEL);
if (!ds)
@@ -3649,6 +3650,17 @@ int mv88e6xxx_probe(struct mdio_device *mdiodev)
if (!ps->info)
return -ENODEV;
 
+   ps->reset = devm_gpiod_get(>dev, "reset", GPIOD_ASIS);
+   if (IS_ERR(ps->reset)) {
+   err = PTR_ERR(ps->reset);
+   if (err == -ENOENT) {
+   /* Optional, so not an error */
+   ps->reset = NULL;
+   } else {
+   return err;
+   }
+   }
+
dev_set_drvdata(dev, ds);
 
dev_info(dev, "switch 0x%x probed: %s, revision %u\n",
diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h
index ca69a93a42a0..df226c151d9c 100644
--- a/drivers/net/dsa/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx.h
@@ -12,6 +12,7 @@
 #define __MV88E6XXX_H
 
 #include 
+#include 
 
 #ifndef UINT64_MAX
 #define UINT64_MAX (u64)(~((u64)0))
@@ -583,6 +584,12 @@ struct mv88e6xxx_priv_state {
DECLARE_BITMAP(port_state_update_mask, DSA_MAX_PORTS);
 
struct work_struct bridge_work;
+
+   /* A switch may have a GPIO line tied to its reset pin. Parse
+* this from the device tree, and use it before performing
+* switch soft reset.
+*/
+   struct gpio_desc *reset;
 };
 
 enum stat_type {
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 8e86af87c84f..ecb52e265cc3 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -16,7 +16,6 @@
 

Re: [net PATCH v2 5/6] net sched: skbedit action fix late binding

2016-05-10 Thread Cong Wang
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> The process below was broken and is fixed with this patch.
>
> //add a skbedit action and give it an instance id of 1
> sudo tc actions add action skbedit mark 10 index 1
> //create a filter which binds to skbedit action id 1
> sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
> match ip dst 17.0.0.1/32 flowid 1:10 action skbedit index 1
>
> Message before fix was:
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
>
> Signed-off-by: Jamal Hadi Salim 

Reviewed-by: Cong Wang 


Re: [net PATCH v2 4/6] net sched: simple action fix late binding

2016-05-10 Thread Cong Wang
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> The process below was broken and is fixed with this patch.
>
> //add a simple action and give it an instance id of 1
> sudo tc actions add action simple sdata "foobar" index 1
> //create a filter which binds to simple action id 1
> sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
> match ip dst 17.0.0.1/32 flowid 1:10 action simple index 1
>
> Message before fix was:
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
>
> Signed-off-by: Jamal Hadi Salim 

Reviewed-by: Cong Wang 


Re: [net PATCH v2 3/6] net sched: mirred action fix late binding

2016-05-10 Thread Cong Wang
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> The process below was broken and is fixed with this patch.
>
> //add an mirred action and give it an instance id of 1
> sudo tc actions add action mirred egress mirror dev $MDEV  index 1
> //create a filter which binds to mirred action id 1
> sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
> match ip dst 17.0.0.1/32 flowid 1:10 action mirred index 1
>
> Message before bug fix was:
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
>
> Signed-off-by: Jamal Hadi Salim 

Reviewed-by: Cong Wang 


Re: [net PATCH v2 2/6] net sched: ipt action fix late binding

2016-05-10 Thread Cong Wang
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> This was broken and is fixed with this patch.
>
> //add an ipt action and give it an instance id of 1
> sudo tc actions add action ipt -j mark --set-mark 2 index 1
> //create a filter which binds to ipt action id 1
> sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
> match ip dst 17.0.0.1/32 flowid 1:10 action ipt index 1
>
> Message before bug fix was:
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
>
> Signed-off-by: Jamal Hadi Salim 


Reviewed-by: Cong Wang 


Re: [net PATCH v2 1/6] net sched: vlan action fix late binding

2016-05-10 Thread Cong Wang
On Tue, May 10, 2016 at 1:49 PM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> Late vlan action binding was broken and is fixed with this patch.
>
> //add a vlan action to pop and give it an instance id of 1
> sudo tc actions add action vlan pop index 1
> //create filter which binds to vlan action id 1
> sudo tc filter add dev $DEV parent : protocol ip prio 1 u32 \
> match ip dst 17.0.0.1/32 flowid 1:1 action vlan index 1
>
> current message(before bug fix) was:
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
>
> Signed-off-by: Jamal Hadi Salim 

Reviewed-by: Cong Wang 


Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Eric Dumazet
On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa
 wrote:

> I agree here, but I don't think this patch particularly is a lot of
> bloat and something very interesting people can play with and extend upon.
>

Sure, very rarely patch authors think their stuff is bloat.

I prefer to fix kernel softirq.c, or at least show me that you tried
hard enough.

I am pretty sure that the following would work :

When ksoftirqd is scheduled, remember this in a per cpu variable
(ksoftiqd_scheduled)

When enabling BH , do not call do_softirq() if this variable is set.

ksoftirqd would clear the variable at the right place (probably in
run_ksoftirqd())

Sure, this might add a lot of latency regressions, but lets fix them.


Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Rik van Riel
On Tue, 2016-05-10 at 16:52 -0400, David Miller wrote:
> From: Rik van Riel 
> Date: Tue, 10 May 2016 16:50:56 -0400
> 
> > On Tue, 2016-05-10 at 16:45 -0400, David Miller wrote:
> >> From: Paolo Abeni 
> >> Date: Tue, 10 May 2016 22:22:50 +0200
> >> 
> >> > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
> >> >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
> >> >> 
> >> >> > If a single core host is under network flood, i.e. ksoftirqd
> is
> >> >> > scheduled and it eventually (after processing ~640 packets)
> will
> >> let the
> >> >> > user space process run. The latter will execute a syscall to
> >> receive a
> >> >> > packet, which will have to disable/enable bh at least once
> and
> >> that will
> >> >> > cause the processing of another ~640 packets. To receive a
> >> single packet
> >> >> > in user space, the kernel has to process more than one
> thousand
> >> packets.
> >> >> 
> >> >> Looks you found the bug then. Have you tried to fix it ?
> >>  ...
> >> > The ksoftirq and the local_bh_enable() design are the root of
> the
> >> > problem, they need to be touched/affected to solve it.
> >> 
> >> That's not what I read from your description, processing 640
> packets
> >> before going to ksoftirqd seems to the be the absolute root
> problem.
> > 
> > What would a fix for that look like?
> > 
> > Keep track of the number of processed incoming packets,
> > and the number of packets handed off, and defer to
> > ksoftirqd earlier if the statistics suggest packets are
> > getting dropped on the floor?
> 
> Not by packet count but by something more easily to measure and
> scalable to fairness like processing time.

I need to get back to fixing irq & softirq time accounting,
which does not currently work correctly in all time keeping
modes...

-- 
All Rights Reversed.



signature.asc
Description: This is a digitally signed message part


Re: [Intel-wired-lan] [PATCH] e1000e: prevent division by zero if TIMINCA is zero

2016-05-10 Thread Jarod Wilson
On Fri, May 06, 2016 at 11:43:17PM +, Rustad, Mark D wrote:
> Denys Vlasenko  wrote:
> 
> >Users report that under VMWare, er32(TIMINCA) returns zero.
> >This causes division by zero at init time as follows:
> >
> > ==>incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK;
> >for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) {
> >/* latch SYSTIMH on read of SYSTIML */
> >systim_next = (cycle_t)er32(SYSTIML);
> >systim_next |= (cycle_t)er32(SYSTIMH) << 32;
> >
> >time_delta = systim_next - systim;
> >temp = time_delta;
> > >  rem = do_div(temp, incvalue);
> >
> >This change makes kernel survive this, and users report that
> >NIC does work after this change.
> >
> >Since on real hardware incvalue is never zero, this should not affect
> >real hardware use case.
...
> I seem to recall that this was rejected before because it really is VMWare's
> bug and, if they fix it, any existing VMs that use this will just work.
> Changing the driver will only fix it for vms that install a new driver. I
> don't object to doing it, it just seems like not the most effective place to
> address the issue.

You could also have people who never update VMWare, for whom a kernel
work-around would be better. I think it'd be best to address it both at
the driver level and the emulated hardware level, to improve things for
the most possible users. Those who update neither hypervisor or
kernel/driver, well, they reap what they sow.

-- 
Jarod Wilson
ja...@redhat.com



Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread David Miller
From: Rik van Riel 
Date: Tue, 10 May 2016 16:50:56 -0400

> On Tue, 2016-05-10 at 16:45 -0400, David Miller wrote:
>> From: Paolo Abeni 
>> Date: Tue, 10 May 2016 22:22:50 +0200
>> 
>> > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
>> >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
>> >> 
>> >> > If a single core host is under network flood, i.e. ksoftirqd is
>> >> > scheduled and it eventually (after processing ~640 packets) will
>> let the
>> >> > user space process run. The latter will execute a syscall to
>> receive a
>> >> > packet, which will have to disable/enable bh at least once and
>> that will
>> >> > cause the processing of another ~640 packets. To receive a
>> single packet
>> >> > in user space, the kernel has to process more than one thousand
>> packets.
>> >> 
>> >> Looks you found the bug then. Have you tried to fix it ?
>>  ...
>> > The ksoftirq and the local_bh_enable() design are the root of the
>> > problem, they need to be touched/affected to solve it.
>> 
>> That's not what I read from your description, processing 640 packets
>> before going to ksoftirqd seems to the be the absolute root problem.
> 
> What would a fix for that look like?
> 
> Keep track of the number of processed incoming packets,
> and the number of packets handed off, and defer to
> ksoftirqd earlier if the statistics suggest packets are
> getting dropped on the floor?

Not by packet count but by something more easily to measure and
scalable to fairness like processing time.


Re: [net-next PATCH v2 1/6] net sched: vlan action fix late binding

2016-05-10 Thread Jamal Hadi Salim

On 16-05-08 11:08 PM, Cong Wang wrote:


+   if (aexists)
+   tcf_hash_release(a, bind);



Introduce a goto to reduce duplicated cleanup code?



I addressed all your comments except for above goto. There are different
return codes for all failures - and the cleverness of a goto seems
unecessary.

cheers,
jamal



Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Rik van Riel
On Tue, 2016-05-10 at 16:45 -0400, David Miller wrote:
> From: Paolo Abeni 
> Date: Tue, 10 May 2016 22:22:50 +0200
> 
> > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
> >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
> >> 
> >> > If a single core host is under network flood, i.e. ksoftirqd is
> >> > scheduled and it eventually (after processing ~640 packets) will
> let the
> >> > user space process run. The latter will execute a syscall to
> receive a
> >> > packet, which will have to disable/enable bh at least once and
> that will
> >> > cause the processing of another ~640 packets. To receive a
> single packet
> >> > in user space, the kernel has to process more than one thousand
> packets.
> >> 
> >> Looks you found the bug then. Have you tried to fix it ?
>  ...
> > The ksoftirq and the local_bh_enable() design are the root of the
> > problem, they need to be touched/affected to solve it.
> 
> That's not what I read from your description, processing 640 packets
> before going to ksoftirqd seems to the be the absolute root problem.

What would a fix for that look like?

Keep track of the number of processed incoming packets,
and the number of packets handed off, and defer to
ksoftirqd earlier if the statistics suggest packets are
getting dropped on the floor?

Is there a cheap way to do that kind of thing?

-- 
All Rights Reversed.



signature.asc
Description: This is a digitally signed message part


[net PATCH v2 5/6] net sched: skbedit action fix late binding

2016-05-10 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

The process below was broken and is fixed with this patch.

//add a skbedit action and give it an instance id of 1
sudo tc actions add action skbedit mark 10 index 1
//create a filter which binds to skbedit action id 1
sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action skbedit index 1

Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_skbedit.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index cfcdbdc..69da5a8 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -69,7 +69,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
struct tcf_skbedit *d;
u32 flags = 0, *priority = NULL, *mark = NULL;
u16 *queue_mapping = NULL;
-   int ret = 0, err;
+   int ret = 0, err, exists = 0;
 
if (nla == NULL)
return -EINVAL;
@@ -96,12 +96,18 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
mark = nla_data(tb[TCA_SKBEDIT_MARK]);
}
 
-   if (!flags)
-   return -EINVAL;
-
parm = nla_data(tb[TCA_SKBEDIT_PARMS]);
 
-   if (!tcf_hash_check(tn, parm->index, a, bind)) {
+   exists = tcf_hash_check(tn, parm->index, a, bind);
+   if (exists && bind)
+   return 0;
+
+   if (!flags) {
+   tcf_hash_release(a, bind);
+   return -EINVAL;
+   }
+
+   if (!exists) {
ret = tcf_hash_create(tn, parm->index, est, a,
  sizeof(*d), bind, false);
if (ret)
@@ -111,8 +117,6 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
ret = ACT_P_CREATED;
} else {
d = to_skbedit(a);
-   if (bind)
-   return 0;
tcf_hash_release(a, bind);
if (!ovr)
return -EEXIST;
-- 
1.9.1



[net PATCH v2 2/6] net sched: ipt action fix late binding

2016-05-10 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

This was broken and is fixed with this patch.

//add an ipt action and give it an instance id of 1
sudo tc actions add action ipt -j mark --set-mark 2 index 1
//create a filter which binds to ipt action id 1
sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action ipt index 1

Message before bug fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_ipt.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c
index 350e134..8b52700 100644
--- a/net/sched/act_ipt.c
+++ b/net/sched/act_ipt.c
@@ -96,7 +96,7 @@ static int __tcf_ipt_init(struct tc_action_net *tn, struct 
nlattr *nla,
struct tcf_ipt *ipt;
struct xt_entry_target *td, *t;
char *tname;
-   int ret = 0, err;
+   int ret = 0, err, exists = 0;
u32 hook = 0;
u32 index = 0;
 
@@ -107,18 +107,23 @@ static int __tcf_ipt_init(struct tc_action_net *tn, 
struct nlattr *nla,
if (err < 0)
return err;
 
-   if (tb[TCA_IPT_HOOK] == NULL)
-   return -EINVAL;
-   if (tb[TCA_IPT_TARG] == NULL)
+   if (tb[TCA_IPT_INDEX] != NULL)
+   index = nla_get_u32(tb[TCA_IPT_INDEX]);
+
+   exists = tcf_hash_check(tn, index, a, bind);
+   if (exists && bind)
+   return 0;
+
+   if (tb[TCA_IPT_HOOK] == NULL || tb[TCA_IPT_TARG] == NULL) {
+   if (exists)
+   tcf_hash_release(a, bind);
return -EINVAL;
+   }
 
td = (struct xt_entry_target *)nla_data(tb[TCA_IPT_TARG]);
if (nla_len(tb[TCA_IPT_TARG]) < td->u.target_size)
return -EINVAL;
 
-   if (tb[TCA_IPT_INDEX] != NULL)
-   index = nla_get_u32(tb[TCA_IPT_INDEX]);
-
if (!tcf_hash_check(tn, index, a, bind)) {
ret = tcf_hash_create(tn, index, est, a, sizeof(*ipt), bind,
  false);
-- 
1.9.1



[net PATCH v2 6/6] net sched: ife action fix late binding

2016-05-10 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

The process below was broken and is fixed with this patch.

//add an ife action and give it an instance id of 1
sudo tc actions add action ife encode \
type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1

//create a filter which binds to ife action id 1
sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1

Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_ife.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index c589a9b..343d011 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -423,7 +423,7 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
u16 ife_type = 0;
u8 *daddr = NULL;
u8 *saddr = NULL;
-   int ret = 0;
+   int ret = 0, exists = 0;
int err;
 
err = nla_parse_nested(tb, TCA_IFE_MAX, nla, ife_policy);
@@ -435,25 +435,29 @@ static int tcf_ife_init(struct net *net, struct nlattr 
*nla,
 
parm = nla_data(tb[TCA_IFE_PARMS]);
 
+   exists = tcf_hash_check(tn, parm->index, a, bind);
+   if (exists && bind)
+   return 0;
+
if (parm->flags & IFE_ENCODE) {
/* Until we get issued the ethertype, we cant have
 * a default..
**/
if (!tb[TCA_IFE_TYPE]) {
+   if (exists)
+   tcf_hash_release(a, bind);
pr_info("You MUST pass etherype for encoding\n");
return -EINVAL;
}
}
 
-   if (!tcf_hash_check(tn, parm->index, a, bind)) {
+   if (!exists) {
ret = tcf_hash_create(tn, parm->index, est, a, sizeof(*ife),
  bind, false);
if (ret)
return ret;
ret = ACT_P_CREATED;
} else {
-   if (bind)   /* dont override defaults */
-   return 0;
tcf_hash_release(a, bind);
if (!ovr)
return -EEXIST;
@@ -495,6 +499,8 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
   NULL);
if (err) {
 metadata_parse_err:
+   if (exists)
+   tcf_hash_release(a, bind);
if (ret == ACT_P_CREATED)
_tcf_ife_cleanup(a, bind);
 
-- 
1.9.1



[net PATCH v2 4/6] net sched: simple action fix late binding

2016-05-10 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

The process below was broken and is fixed with this patch.

//add a simple action and give it an instance id of 1
sudo tc actions add action simple sdata "foobar" index 1
//create a filter which binds to simple action id 1
sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action simple index 1

Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_simple.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
index 75b2be1..3a33fb6 100644
--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -87,7 +87,7 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla,
struct tc_defact *parm;
struct tcf_defact *d;
char *defdata;
-   int ret = 0, err;
+   int ret = 0, err, exists = 0;
 
if (nla == NULL)
return -EINVAL;
@@ -99,13 +99,21 @@ static int tcf_simp_init(struct net *net, struct nlattr 
*nla,
if (tb[TCA_DEF_PARMS] == NULL)
return -EINVAL;
 
-   if (tb[TCA_DEF_DATA] == NULL)
-   return -EINVAL;
 
parm = nla_data(tb[TCA_DEF_PARMS]);
+   exists = tcf_hash_check(tn, parm->index, a, bind);
+   if (exists && bind)
+   return 0;
+
+   if (tb[TCA_DEF_DATA] == NULL) {
+   if (exists)
+   tcf_hash_release(a, bind);
+   return -EINVAL;
+   }
+
defdata = nla_data(tb[TCA_DEF_DATA]);
 
-   if (!tcf_hash_check(tn, parm->index, a, bind)) {
+   if (!exists) {
ret = tcf_hash_create(tn, parm->index, est, a,
  sizeof(*d), bind, false);
if (ret)
@@ -122,8 +130,6 @@ static int tcf_simp_init(struct net *net, struct nlattr 
*nla,
} else {
d = to_defact(a);
 
-   if (bind)
-   return 0;
tcf_hash_release(a, bind);
if (!ovr)
return -EEXIST;
-- 
1.9.1



[net PATCH v2 0/6] net sched: Fix broken late binding of actions

2016-05-10 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

Some actions were broken in allowing for late binding of actions.
Late binding workflow is as follows:
a) create an action and provide all necessary parameters for it
Optionally provide an index or let the kernel give you one.
Example:
sudo tc actions add action police rate 1kbit burst 90k drop index 1

b) later on bind to the pre-created action from a filter definition
by merely specifying the index.
Example:
sudo tc filter add dev lo parent : protocol ip prio 8 \
u32 match ip src 127.0.0.8/32 flowid 1:8 action police index 1


Jamal Hadi Salim (6):
  net sched: vlan action fix late binding
  net sched: ipt action fix late binding
  net sched: mirred action fix late binding
  net sched: simple action fix late binding
  net sched: skbedit action fix late binding
  net sched: ife action fix late binding

 net/sched/act_ife.c | 14 ++
 net/sched/act_ipt.c | 19 ---
 net/sched/act_mirred.c  | 19 +--
 net/sched/act_simple.c  | 18 --
 net/sched/act_skbedit.c | 18 +++---
 net/sched/act_vlan.c| 22 --
 6 files changed, 74 insertions(+), 36 deletions(-)

-- 
1.9.1



[net PATCH v2 1/6] net sched: vlan action fix late binding

2016-05-10 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

Late vlan action binding was broken and is fixed with this patch.

//add a vlan action to pop and give it an instance id of 1
sudo tc actions add action vlan pop index 1
//create filter which binds to vlan action id 1
sudo tc filter add dev $DEV parent : protocol ip prio 1 u32 \
match ip dst 17.0.0.1/32 flowid 1:1 action vlan index 1

current message(before bug fix) was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_vlan.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index bab8ae0..c45f926 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -77,7 +77,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla,
int action;
__be16 push_vid = 0;
__be16 push_proto = 0;
-   int ret = 0;
+   int ret = 0, exists = 0;
int err;
 
if (!nla)
@@ -90,15 +90,25 @@ static int tcf_vlan_init(struct net *net, struct nlattr 
*nla,
if (!tb[TCA_VLAN_PARMS])
return -EINVAL;
parm = nla_data(tb[TCA_VLAN_PARMS]);
+   exists = tcf_hash_check(tn, parm->index, a, bind);
+   if (exists && bind)
+   return 0;
+
switch (parm->v_action) {
case TCA_VLAN_ACT_POP:
break;
case TCA_VLAN_ACT_PUSH:
-   if (!tb[TCA_VLAN_PUSH_VLAN_ID])
+   if (!tb[TCA_VLAN_PUSH_VLAN_ID]) {
+   if (exists)
+   tcf_hash_release(a, bind);
return -EINVAL;
+   }
push_vid = nla_get_u16(tb[TCA_VLAN_PUSH_VLAN_ID]);
-   if (push_vid >= VLAN_VID_MASK)
+   if (push_vid >= VLAN_VID_MASK) {
+   if (exists)
+   tcf_hash_release(a, bind);
return -ERANGE;
+   }
 
if (tb[TCA_VLAN_PUSH_VLAN_PROTOCOL]) {
push_proto = 
nla_get_be16(tb[TCA_VLAN_PUSH_VLAN_PROTOCOL]);
@@ -114,11 +124,13 @@ static int tcf_vlan_init(struct net *net, struct nlattr 
*nla,
}
break;
default:
+   if (exists)
+   tcf_hash_release(a, bind);
return -EINVAL;
}
action = parm->v_action;
 
-   if (!tcf_hash_check(tn, parm->index, a, bind)) {
+   if (!exists) {
ret = tcf_hash_create(tn, parm->index, est, a,
  sizeof(*v), bind, false);
if (ret)
@@ -126,8 +138,6 @@ static int tcf_vlan_init(struct net *net, struct nlattr 
*nla,
 
ret = ACT_P_CREATED;
} else {
-   if (bind)
-   return 0;
tcf_hash_release(a, bind);
if (!ovr)
return -EEXIST;
-- 
1.9.1



[net PATCH v2 3/6] net sched: mirred action fix late binding

2016-05-10 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

The process below was broken and is fixed with this patch.

//add an mirred action and give it an instance id of 1
sudo tc actions add action mirred egress mirror dev $MDEV  index 1
//create a filter which binds to mirred action id 1
sudo tc filter add dev $DEV parent : protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action mirred index 1

Message before bug fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_mirred.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index e8a760c..8f3948d 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -61,7 +61,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
struct tc_mirred *parm;
struct tcf_mirred *m;
struct net_device *dev;
-   int ret, ok_push = 0;
+   int ret, ok_push = 0, exists = 0;
 
if (nla == NULL)
return -EINVAL;
@@ -71,17 +71,27 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
if (tb[TCA_MIRRED_PARMS] == NULL)
return -EINVAL;
parm = nla_data(tb[TCA_MIRRED_PARMS]);
+
+   exists = tcf_hash_check(tn, parm->index, a, bind);
+   if (exists && bind)
+   return 0;
+
switch (parm->eaction) {
case TCA_EGRESS_MIRROR:
case TCA_EGRESS_REDIR:
break;
default:
+   if (exists)
+   tcf_hash_release(a, bind);
return -EINVAL;
}
if (parm->ifindex) {
dev = __dev_get_by_index(net, parm->ifindex);
-   if (dev == NULL)
+   if (dev == NULL) {
+   if (exists)
+   tcf_hash_release(a, bind);
return -ENODEV;
+   }
switch (dev->type) {
case ARPHRD_TUNNEL:
case ARPHRD_TUNNEL6:
@@ -99,7 +109,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
dev = NULL;
}
 
-   if (!tcf_hash_check(tn, parm->index, a, bind)) {
+   if (!exists) {
if (dev == NULL)
return -EINVAL;
ret = tcf_hash_create(tn, parm->index, est, a,
@@ -108,9 +118,6 @@ static int tcf_mirred_init(struct net *net, struct nlattr 
*nla,
return ret;
ret = ACT_P_CREATED;
} else {
-   if (bind)
-   return 0;
-
tcf_hash_release(a, bind);
if (!ovr)
return -EEXIST;
-- 
1.9.1



Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Hannes Frederic Sowa
Hello,

On 10.05.2016 16:29, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 16:11 +0200, Paolo Abeni wrote:
>> Currently, the softirq loop can be scheduled both inside the ksofirqd kernel
>> thread and inside any running process. This makes nearly impossible for the
>> process scheduler to balance in a fair way the amount of time that
>> a given core spends performing the softirq loop.
>>
>> Under high network load, the softirq loop can take nearly 100% of a given 
>> CPU,
>> leaving very little time for use space processing. On single core hosts, this
>> means that the user space can nearly starve; for example super_netperf
>> UDP_STREAM tests towards a remote single core vCPU guest[1] can measure an
>> aggregated throughput of a few thousands pps, and the same behavior can be
>> reproduced even on bare-metal, eventually simulating a single core with 
>> taskset
>> and/or sysfs configuration.
> 
> I hate these patches and ideas guys, sorry. That is before my breakfast,
> but still...

:)

> I have enough hard time dealing with loads where ksoftirqd has to
> compete with user threads that thought that playing with priorities was
> a nice idea.

We tried a lot of approaches so far and this seemed to be the best
architectural RFC we could post. I was quite surprised to see such good
performance numbers with threaded NAPI, thus I think it could be a way
forward.

Your mentioned problem above seems to be a configuration mistake, no?
Otherwise isn't that something user space/cgroups might solve?

> Guess what, when they lose networking they complain.
> 
> We already have ksoftirqd to normally cope with the case you are
> describing.

Indeed, but the time until we wake up ksoftirqd can be already quite
long and for every packet we get in udp_recvmsg the local_bh_enable call
let's us pick up quite a lot of new packets, which we drop before user
space can make any progress. By being more fair between user space and
"napid" we hoped to solve this. We also want more feedback from the
scheduler people, so we Cc'ed them also.

> If it is not working as intended, please identify the bugs and fix them,
> instead of adding yet another tests in fast path and extra complexity in
> the stack.

We could use _local_bh_enable instead of local_bh_enable in udp_recvmsg,
which certainly wouldn't branch down to softirqs as often, but this
feels wrong to me and certainly is.

After the discussion on netdev@ with Peter Hurley here [1] about
"Softirq priority inversion from "softirq: reduce latencies"", we didn't
want to propose some patch looking like this again, but this could help.
The idea would be to limit the number we recheck for softirqs but give
back control to user space.

[1] https://lkml.org/lkml/2016/2/27/152

If I remember local_bh_enable in kernel-rt processes one softirq
directly and defers its work to ksoftirqd much more quickly.

> In the one vcpu case, allowing the user thread to consume more UDP
> packets from the target UDP socket will also make your NIC drop more
> packets, that are not necessarily packets for the same socket.
> 
> So you are shifting the attack to a different target,
> at the expense of more kernel bloat.

I agree here, but I don't think this patch particularly is a lot of
bloat and something very interesting people can play with and extend upon.

Thanks,
Hannes



Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread David Miller
From: Paolo Abeni 
Date: Tue, 10 May 2016 22:22:50 +0200

> On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
>> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
>> 
>> > If a single core host is under network flood, i.e. ksoftirqd is
>> > scheduled and it eventually (after processing ~640 packets) will let the
>> > user space process run. The latter will execute a syscall to receive a
>> > packet, which will have to disable/enable bh at least once and that will
>> > cause the processing of another ~640 packets. To receive a single packet
>> > in user space, the kernel has to process more than one thousand packets.
>> 
>> Looks you found the bug then. Have you tried to fix it ?
 ...
> The ksoftirq and the local_bh_enable() design are the root of the
> problem, they need to be touched/affected to solve it.

That's not what I read from your description, processing 640 packets
before going to ksoftirqd seems to the be the absolute root problem.


Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Paolo Abeni
On Tue, 2016-05-10 at 17:57 +0200, Thomas Gleixner wrote:
> On Tue, 10 May 2016, Paolo Abeni wrote:
> 
> Nice patch set and very promising results! 
> 
> > At this point we are not really sure if we should go with this simpler
> > approach by putting NAPI itself into kthreads or leverage the threadirqs
> > function by putting the whole interrupt into a thread and signaling NAPI
> > that it does not reschedule itself in a softirq but to simply run at
> > this particular context of the interrupt handler.
> > 
> > While the threaded irq way seems to better integrate into the kernel and
> > also other devices could move their interrupts into the threads easily
> > on a common policy, we don't know how to really express the necessary
> > knobs with the current device driver model (module parameters, sysfs
> > attributes, etc.). This is where we would like to hear some opinions.
> > NAPI would e.g. have to query the kernel if the particular IRQ/MSI if it
> > should be scheduled in a softirq or in a thread, so we don't have to
> > rewrite all device drivers. This might even be needed on a per rx-queue
> > granularity.
> 
> Utilizing threaded irqs should be halfways simple even without touching the
> device driver at all.
> 
> We can do the switch to threading in two ways:
> 
> 1) Let the driver request the interrupt(s) as it does now and then have a
>/proc/irq/NNN/threaded file which converts it to a threaded interrupt on
>the fly. That should be fairly trivial.
> 
> 2) Let the driver request the interrupt(s) as it does now and retrieve the
>interrupt number which belongs to the device/queue from the network core
>and let the irq core switch it over to threaded.

Thank you for the feedback.

We actually experimented something similar to (2). In our implementation
we needed a per device chunk of code to do the actual irq number ->
queue mapping (and than we performed as well the switch in the device
code).

> You surely need some way to figure out whether the interrupt is threaded when
> you set up the device in order to flag your napi struct, but that should be
> not too hard to achieve.

This is the part that required per device changes and complicated a bit
the implementation. We can research further to simplify it, according to
the overall discussion.

Cheers,

Paolo





Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Paolo Abeni
On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
> 
> > If a single core host is under network flood, i.e. ksoftirqd is
> > scheduled and it eventually (after processing ~640 packets) will let the
> > user space process run. The latter will execute a syscall to receive a
> > packet, which will have to disable/enable bh at least once and that will
> > cause the processing of another ~640 packets. To receive a single packet
> > in user space, the kernel has to process more than one thousand packets.
> 
> Looks you found the bug then. Have you tried to fix it ?

The core functionality is implemented in ~100 lines of code, is that
the kind of bloat that do concerns you ?

That could probably be improved removing some code duplication, i.e.
factorizing napi_thread_wait() with irq_wait_for_interrupt() and
possibly napi_threaded_poll() with net_rx_action(). 

If the additional test inside napi_schedule() is really scaring, it can
be guarded with a static_key.

The ksoftirq and the local_bh_enable() design are the root of the
problem, they need to be touched/affected to solve it.

We actually experimented several different options.

Limiting the amount of work performed by local_bh_enable() somewhat
mitigate the issue, but it adds just another kernel parameter difficult
to be tuned.

Running the softirq loop exclusively inside the ksoftirqd will solve the
issue, but this is a very invasive approach, affecting all others
subsystem.

The above can be restricted to the net_rx_action only (i.e. running
net_rx_action always in ksoftirqd context). The related patch isn't
really much simpler than this and will add at least the same number of
additional tests in fast path.

Running the napi loop in a thread that can be migrated gives additional
benefit in the hyper-visor/VM scenario, which can't be achieved
elsewhere.

Would you consider the threaded irq alternative more viable ?

Cheers,

Paolo



[PATCH v3 net-next] tcp: replace cnt & rtt with struct in pkts_acked()

2016-05-10 Thread Lawrence Brakmo
Replace 2 arguments (cnt and rtt) in the congestion control modules'
pkts_acked() function with a struct. This will allow adding more
information without having to modify existing congestion control
modules (tcp_nv in particular needs bytes in flight when packet
was sent).

As proposed by Neal Cardwell in his comments to the tcp_nv patch.

Signed-off-by: Lawrence Brakmo 
Acked-by: Yuchung Cheng 
---
 include/net/tcp.h   |  7 ++-
 net/ipv4/tcp_bic.c  |  6 +++---
 net/ipv4/tcp_cdg.c  | 14 +++---
 net/ipv4/tcp_cubic.c|  6 +++---
 net/ipv4/tcp_htcp.c | 10 +-
 net/ipv4/tcp_illinois.c | 20 ++--
 net/ipv4/tcp_input.c|  8 ++--
 net/ipv4/tcp_lp.c   |  6 +++---
 net/ipv4/tcp_vegas.c|  6 +++---
 net/ipv4/tcp_vegas.h|  2 +-
 net/ipv4/tcp_veno.c |  7 ---
 net/ipv4/tcp_westwood.c |  7 ---
 net/ipv4/tcp_yeah.c |  7 ---
 13 files changed, 59 insertions(+), 47 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 24ec804..dc588c3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -849,6 +849,11 @@ enum tcp_ca_ack_event_flags {
 
 union tcp_cc_info;
 
+struct ack_sample {
+   u32 pkts_acked;
+   s32 rtt_us;
+};
+
 struct tcp_congestion_ops {
struct list_headlist;
u32 key;
@@ -872,7 +877,7 @@ struct tcp_congestion_ops {
/* new value of cwnd after loss (optional) */
u32  (*undo_cwnd)(struct sock *sk);
/* hook for packet ack accounting (optional) */
-   void (*pkts_acked)(struct sock *sk, u32 num_acked, s32 rtt_us);
+   void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_bic.c b/net/ipv4/tcp_bic.c
index fd1405d..36087bc 100644
--- a/net/ipv4/tcp_bic.c
+++ b/net/ipv4/tcp_bic.c
@@ -197,15 +197,15 @@ static void bictcp_state(struct sock *sk, u8 new_state)
 /* Track delayed acknowledgment ratio using sliding window
  * ratio = (15*ratio + sample) / 16
  */
-static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt)
+static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
 {
const struct inet_connection_sock *icsk = inet_csk(sk);
 
if (icsk->icsk_ca_state == TCP_CA_Open) {
struct bictcp *ca = inet_csk_ca(sk);
 
-   cnt -= ca->delayed_ack >> ACK_RATIO_SHIFT;
-   ca->delayed_ack += cnt;
+   ca->delayed_ack += sample->pkts_acked -
+   (ca->delayed_ack >> ACK_RATIO_SHIFT);
}
 }
 
diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index ccce8a5..03725b2 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -294,12 +294,12 @@ static void tcp_cdg_cong_avoid(struct sock *sk, u32 ack, 
u32 acked)
ca->shadow_wnd = max(ca->shadow_wnd, ca->shadow_wnd + incr);
 }
 
-static void tcp_cdg_acked(struct sock *sk, u32 num_acked, s32 rtt_us)
+static void tcp_cdg_acked(struct sock *sk, const struct ack_sample *sample)
 {
struct cdg *ca = inet_csk_ca(sk);
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (rtt_us <= 0)
+   if (sample->rtt_us <= 0)
return;
 
/* A heuristic for filtering delayed ACKs, adapted from:
@@ -307,20 +307,20 @@ static void tcp_cdg_acked(struct sock *sk, u32 num_acked, 
s32 rtt_us)
 * delay and rate based TCP mechanisms." TR 100219A. CAIA, 2010.
 */
if (tp->sacked_out == 0) {
-   if (num_acked == 1 && ca->delack) {
+   if (sample->pkts_acked == 1 && ca->delack) {
/* A delayed ACK is only used for the minimum if it is
 * provenly lower than an existing non-zero minimum.
 */
-   ca->rtt.min = min(ca->rtt.min, rtt_us);
+   ca->rtt.min = min(ca->rtt.min, sample->rtt_us);
ca->delack--;
return;
-   } else if (num_acked > 1 && ca->delack < 5) {
+   } else if (sample->pkts_acked > 1 && ca->delack < 5) {
ca->delack++;
}
}
 
-   ca->rtt.min = min_not_zero(ca->rtt.min, rtt_us);
-   ca->rtt.max = max(ca->rtt.max, rtt_us);
+   ca->rtt.min = min_not_zero(ca->rtt.min, sample->rtt_us);
+   ca->rtt.max = max(ca->rtt.max, sample->rtt_us);
 }
 
 static u32 tcp_cdg_ssthresh(struct sock *sk)
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 0ce946e..c99230e 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -437,21 +437,21 @@ static void hystart_update(struct sock *sk, u32 delay)
 /* Track delayed acknowledgment ratio using sliding window
  * ratio = (15*ratio + sample) / 16
  */
-static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt_us)

Re: [PATCH net-next] skbuff: remove unused variable `doff'

2016-05-10 Thread David Miller
From: Sowmini Varadhan 
Date: Tue, 10 May 2016 12:38:08 -0400

> There are two instances of an unused variable, `doff' added by
> commit 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function")
> in pskb_carve_inside_header() and pskb_carve_inside_nonlinear().
> Remove these instances, they are not used.
> 
> Reported by: Daniel Borkmann 
> Signed-off-by: Sowmini Varadhan 

Applied.


Re: [PATCH net-next v2] ila: ipv6/ila: fix nlsize calculation for lwtunnel

2016-05-10 Thread David Miller
From: Nicolas Dichtel 
Date: Tue, 10 May 2016 11:56:32 +0200

> From: Tom Herbert 
> 
> The handler 'ila_fill_encap_info' adds two attributes: ILA_ATTR_LOCATOR
> and ILA_ATTR_CSUM_MODE.
> 
> nla_total_size_64bit() must be use for ILA_ATTR_LOCATOR.
> 
> Also, do nla_put_u8 instead of nla_put_u64 for ILA_ATTR_CSUM_MODE.
> 
> Fixes: f13a82d87b21 ("ipv6: use nla_put_u64_64bit()")
> Fixes: 90bfe662db13 ("ila: add checksum neutral ILA translations")
> Reported-by: Nicolas Dichtel 
> Signed-off-by: Tom Herbert 
> Signed-off-by: Nicolas Dichtel 

Applied.


Re: [PATCH] net: phylib: fix interrupts re-enablement in phy_start

2016-05-10 Thread David Miller
From: 
Date: Tue, 10 May 2016 17:42:26 +0800

> From: Shaohui Xie 
> 
> If phy was suspended and is starting, current driver always enable
> phy's interrupts, if phy works in polling, phy can raise unexpected
> interrupt which will not be handled, the interrupt will block system
> enter suspend again. So interrupts should only be re-enabled if phy
> works in interrupt.
> 
> Signed-off-by: Shaohui Xie 

Applied.


Re: [PATCH net] tcp: refresh skb timestamp at retransmit time

2016-05-10 Thread David Miller
From: Eric Dumazet 
Date: Mon, 09 May 2016 20:55:16 -0700

> From: Eric Dumazet 
> 
> In the very unlikely case __tcp_retransmit_skb() can not use the cloning
> done in tcp_transmit_skb(), we need to refresh skb_mstamp before doing
> the copy and transmit, otherwise TCP TS val will be an exact copy of
> original transmit.
> 
> Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
> Signed-off-by: Eric Dumazet 
> Cc: Yuchung Cheng 

Applied and queued up for -stable, thanks.


Re: [PATCH net-next] gtp: reload GTPv1 header after pskb_may_pull()

2016-05-10 Thread David Miller
From: Pablo Neira Ayuso 
Date: Tue, 10 May 2016 21:33:38 +0200

> The GTPv1 header flags indicate the presence of optional extensions
> after this header. Refresh the pointer to the GTPv1 header as skb->head
> might have be reallocated via pskb_may_pull().
> 
> Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling 
> Protocol (GTP-U)")
> Reported-by: Eric Dumazet 
> Signed-off-by: Pablo Neira Ayuso 

Applied.


Re: [PATCH 5/6] drivers: net: xgene: Using static MSS values

2016-05-10 Thread David Miller
From: Iyappan Subramanian 
Date: Mon,  9 May 2016 17:04:15 -0700

> Due to the nature of hardware design for TSO, if the MSS values that are
> stored in the register, changes during TSO operation, data corruption may
> occur.
> 
> This patch fixes the issue by using one of the predefined MSS values.
> 
> Signed-off-by: Iyappan Subramanian 
> Tested-by: Toan Le 

This is a very serious quality of implementation issue.

And it could quietly kill performance for some users, and there is no
way for them to find out that this is happening.

If you use a predefined set of MSS values, if the MSS value we need is
between two of them then there is going to be wasted space on the
wire.  It's can therefore certainly be better to not do TSO in that
case.

I think you absolutely need to disable TSO by default, and require the
user to explicitly turn it on, unless you can fix this problem in
another way.

Thanks.


[PATCH net-next 2/2] net: dsa: mv88e6xxx: add STU capability

2016-05-10 Thread Vivien Didelot
Some switch models have a STU (per VLAN port state database). Add a new
capability flag to switches info, instead of checking their family.

Also if the 6165 family has an STU, it must have a VTU, so add the
MV88E6XXX_FLAG_VTU to its family flags.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 14 ++
 drivers/net/dsa/mv88e6xxx.h | 16 ++--
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 92be27d..835126e 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -453,16 +453,6 @@ static bool mv88e6xxx_has_fid_reg(struct 
mv88e6xxx_priv_state *ps)
return false;
 }
 
-static bool mv88e6xxx_has_stu(struct mv88e6xxx_priv_state *ps)
-{
-   /* Does the device have STU and dedicated SID registers for VTU ops? */
-   if (mv88e6xxx_6097_family(ps) || mv88e6xxx_6165_family(ps) ||
-   mv88e6xxx_6351_family(ps) || mv88e6xxx_6352_family(ps))
-   return true;
-
-   return false;
-}
-
 /* We expect the switch to perform auto negotiation if there is a real
  * phy. However, in the case of a fixed link phy, we force the port
  * settings from the fixed link settings.
@@ -1599,7 +1589,7 @@ static int _mv88e6xxx_vtu_getnext(struct 
mv88e6xxx_priv_state *ps,
next.fid |= ret & 0xf;
}
 
-   if (mv88e6xxx_has_stu(ps)) {
+   if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_STU)) {
ret = _mv88e6xxx_reg_read(ps, REG_GLOBAL,
  GLOBAL_VTU_SID);
if (ret < 0)
@@ -1686,7 +1676,7 @@ static int _mv88e6xxx_vtu_loadpurge(struct 
mv88e6xxx_priv_state *ps,
if (ret < 0)
return ret;
 
-   if (mv88e6xxx_has_stu(ps)) {
+   if (mv88e6xxx_has(ps, MV88E6XXX_FLAG_STU)) {
reg = entry->sid & GLOBAL_VTU_SID_MASK;
ret = _mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_VTU_SID, reg);
if (ret < 0)
diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h
index ca69a93..5f09a4e 100644
--- a/drivers/net/dsa/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx.h
@@ -403,6 +403,12 @@ enum mv88e6xxx_cap {
 */
MV88E6XXX_CAP_SMI_PHY,
 
+   /* Per VLAN Spanning Tree Unit (STU).
+* The Port State database, if present, is accessed through VTU
+* operations and dedicated SID registers. See GLOBAL_VTU_SID.
+*/
+   MV88E6XXX_CAP_STU,
+
/* Switch MAC/WoL/WoF register.
 * This requires an indirect access to set the switch MAC address
 * through GLOBAL2_SWITCH_MAC, otherwise GLOBAL_MAC_01, GLOBAL_MAC_23,
@@ -436,6 +442,7 @@ enum mv88e6xxx_cap {
 #define MV88E6XXX_FLAG_PPU BIT(MV88E6XXX_CAP_PPU)
 #define MV88E6XXX_FLAG_PPU_ACTIVE  BIT(MV88E6XXX_CAP_PPU_ACTIVE)
 #define MV88E6XXX_FLAG_SMI_PHY BIT(MV88E6XXX_CAP_SMI_PHY)
+#define MV88E6XXX_FLAG_STU BIT(MV88E6XXX_CAP_STU)
 #define MV88E6XXX_FLAG_SWITCH_MAC  BIT(MV88E6XXX_CAP_SWITCH_MAC_WOL_WOF)
 #define MV88E6XXX_FLAG_TEMPBIT(MV88E6XXX_CAP_TEMP)
 #define MV88E6XXX_FLAG_TEMP_LIMIT  BIT(MV88E6XXX_CAP_TEMP_LIMIT)
@@ -451,12 +458,15 @@ enum mv88e6xxx_cap {
 #define MV88E6XXX_FLAGS_FAMILY_6097\
(MV88E6XXX_FLAG_ATU |   \
 MV88E6XXX_FLAG_PPU |   \
+MV88E6XXX_FLAG_STU |   \
 MV88E6XXX_FLAG_VLANTABLE | \
 MV88E6XXX_FLAG_VTU)
 
 #define MV88E6XXX_FLAGS_FAMILY_6165\
-   (MV88E6XXX_FLAG_SWITCH_MAC |\
-MV88E6XXX_FLAG_TEMP)
+   (MV88E6XXX_FLAG_STU |   \
+MV88E6XXX_FLAG_SWITCH_MAC |\
+MV88E6XXX_FLAG_TEMP |  \
+MV88E6XXX_FLAG_VTU)
 
 #define MV88E6XXX_FLAGS_FAMILY_6185\
(MV88E6XXX_FLAG_ATU |   \
@@ -482,6 +492,7 @@ enum mv88e6xxx_cap {
 MV88E6XXX_FLAG_PORTSTATE | \
 MV88E6XXX_FLAG_PPU_ACTIVE |\
 MV88E6XXX_FLAG_SMI_PHY |   \
+MV88E6XXX_FLAG_STU |   \
 MV88E6XXX_FLAG_SWITCH_MAC |\
 MV88E6XXX_FLAG_TEMP |  \
 MV88E6XXX_FLAG_VLANTABLE | \
@@ -494,6 +505,7 @@ enum mv88e6xxx_cap {
 MV88E6XXX_FLAG_PORTSTATE | \
 MV88E6XXX_FLAG_PPU_ACTIVE |\
 MV88E6XXX_FLAG_SMI_PHY |   \
+MV88E6XXX_FLAG_STU |   \
 MV88E6XXX_FLAG_SWITCH_MAC |\
 MV88E6XXX_FLAG_TEMP |  \
 MV88E6XXX_FLAG_TEMP_LIMIT |\
-- 
2.8.2



[PATCH net-next 1/2] net: dsa: mv88e6xxx: abstract VTU/STU data access

2016-05-10 Thread Vivien Didelot
Both VTU and STU operations use the same routine to access their
(common) data registers, with a different offset.

Add VTU and STU specific read and write functions to the data registers
to abstract the required offset.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 32 
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 1e5ca8e..92be27d 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1498,6 +1498,18 @@ static int _mv88e6xxx_vtu_stu_data_read(struct 
mv88e6xxx_priv_state *ps,
return 0;
 }
 
+static int mv88e6xxx_vtu_data_read(struct mv88e6xxx_priv_state *ps,
+  struct mv88e6xxx_vtu_stu_entry *entry)
+{
+   return _mv88e6xxx_vtu_stu_data_read(ps, entry, 0);
+}
+
+static int mv88e6xxx_stu_data_read(struct mv88e6xxx_priv_state *ps,
+  struct mv88e6xxx_vtu_stu_entry *entry)
+{
+   return _mv88e6xxx_vtu_stu_data_read(ps, entry, 2);
+}
+
 static int _mv88e6xxx_vtu_stu_data_write(struct mv88e6xxx_priv_state *ps,
 struct mv88e6xxx_vtu_stu_entry *entry,
 unsigned int nibble_offset)
@@ -1523,6 +1535,18 @@ static int _mv88e6xxx_vtu_stu_data_write(struct 
mv88e6xxx_priv_state *ps,
return 0;
 }
 
+static int mv88e6xxx_vtu_data_write(struct mv88e6xxx_priv_state *ps,
+   struct mv88e6xxx_vtu_stu_entry *entry)
+{
+   return _mv88e6xxx_vtu_stu_data_write(ps, entry, 0);
+}
+
+static int mv88e6xxx_stu_data_write(struct mv88e6xxx_priv_state *ps,
+   struct mv88e6xxx_vtu_stu_entry *entry)
+{
+   return _mv88e6xxx_vtu_stu_data_write(ps, entry, 2);
+}
+
 static int _mv88e6xxx_vtu_vid_write(struct mv88e6xxx_priv_state *ps, u16 vid)
 {
return _mv88e6xxx_reg_write(ps, REG_GLOBAL, GLOBAL_VTU_VID,
@@ -1551,7 +1575,7 @@ static int _mv88e6xxx_vtu_getnext(struct 
mv88e6xxx_priv_state *ps,
next.valid = !!(ret & GLOBAL_VTU_VID_VALID);
 
if (next.valid) {
-   ret = _mv88e6xxx_vtu_stu_data_read(ps, , 0);
+   ret = mv88e6xxx_vtu_data_read(ps, );
if (ret < 0)
return ret;
 
@@ -1658,7 +1682,7 @@ static int _mv88e6xxx_vtu_loadpurge(struct 
mv88e6xxx_priv_state *ps,
goto loadpurge;
 
/* Write port member tags */
-   ret = _mv88e6xxx_vtu_stu_data_write(ps, entry, 0);
+   ret = mv88e6xxx_vtu_data_write(ps, entry);
if (ret < 0)
return ret;
 
@@ -1724,7 +1748,7 @@ static int _mv88e6xxx_stu_getnext(struct 
mv88e6xxx_priv_state *ps, u8 sid,
next.valid = !!(ret & GLOBAL_VTU_VID_VALID);
 
if (next.valid) {
-   ret = _mv88e6xxx_vtu_stu_data_read(ps, , 2);
+   ret = mv88e6xxx_stu_data_read(ps, );
if (ret < 0)
return ret;
}
@@ -1747,7 +1771,7 @@ static int _mv88e6xxx_stu_loadpurge(struct 
mv88e6xxx_priv_state *ps,
goto loadpurge;
 
/* Write port states */
-   ret = _mv88e6xxx_vtu_stu_data_write(ps, entry, 2);
+   ret = mv88e6xxx_stu_data_write(ps, entry);
if (ret < 0)
return ret;
 
-- 
2.8.2



[PATCH 14/17] batman-adv: Use kref_get for _batadv_update_route

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

_batadv_update_route requires that the caller already has a valid reference
for neigh_node. It is therefore not possible that it has an reference
counter of 0 and was still given to this function

The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/routing.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c
index 2ecfca246be4..b494e435686f 100644
--- a/net/batman-adv/routing.c
+++ b/net/batman-adv/routing.c
@@ -100,10 +100,6 @@ static void _batadv_update_route(struct batadv_priv 
*bat_priv,
if (curr_router)
batadv_neigh_node_put(curr_router);
 
-   /* increase refcount of new best neighbor */
-   if (neigh_node && !kref_get_unless_zero(_node->refcount))
-   neigh_node = NULL;
-
spin_lock_bh(_node->neigh_list_lock);
/* curr_router used earlier may not be the current orig_ifinfo->router
 * anymore because it was dereferenced outside of the neigh_list_lock
@@ -114,6 +110,10 @@ static void _batadv_update_route(struct batadv_priv 
*bat_priv,
 */
curr_router = rcu_dereference_protected(orig_ifinfo->router, true);
 
+   /* increase refcount of new best neighbor */
+   if (neigh_node)
+   kref_get(_node->refcount);
+
rcu_assign_pointer(orig_ifinfo->router, neigh_node);
spin_unlock_bh(_node->neigh_list_lock);
batadv_orig_ifinfo_put(orig_ifinfo);
-- 
2.8.2



[PATCH 15/17] batman-adv: Use bool as return type for boolean functions

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

It is easier to understand that the returned value of a specific function
doesn't have to be 0 when the functions was successful when the actual
return type is bool. This is especially true when all surrounding functions
with return type int use negative values to return the error code.

Reported-by: Nicholas Krause 
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/bat_iv_ogm.c|  23 ++---
 net/batman-adv/bitarray.c  |  16 +--
 net/batman-adv/bitarray.h  |  15 +--
 net/batman-adv/bridge_loop_avoidance.c | 175 +
 net/batman-adv/bridge_loop_avoidance.h |  43 
 net/batman-adv/debugfs.c   |   2 +-
 net/batman-adv/distributed-arp-table.c |   6 +-
 net/batman-adv/hard-interface.c|  15 ++-
 net/batman-adv/hash.h  |   6 +-
 net/batman-adv/main.h  |   2 +-
 net/batman-adv/network-coding.c|  12 +--
 net/batman-adv/originator.c|   4 +-
 net/batman-adv/originator.h|   2 +-
 net/batman-adv/routing.c   |  37 +++
 net/batman-adv/routing.h   |   6 +-
 net/batman-adv/soft-interface.c|   6 +-
 net/batman-adv/soft-interface.h|   3 +-
 net/batman-adv/translation-table.c |  31 +++---
 18 files changed, 205 insertions(+), 199 deletions(-)

diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c
index eb3435de54b5..7f98a9d39883 100644
--- a/net/batman-adv/bat_iv_ogm.c
+++ b/net/batman-adv/bat_iv_ogm.c
@@ -1168,13 +1168,13 @@ out:
  * @if_incoming: interface where the packet was received
  * @if_outgoing: interface for which the retransmission should be considered
  *
- * Return: 1 if the link can be considered bidirectional, 0 otherwise
+ * Return: true if the link can be considered bidirectional, false otherwise
  */
-static int batadv_iv_ogm_calc_tq(struct batadv_orig_node *orig_node,
-struct batadv_orig_node *orig_neigh_node,
-struct batadv_ogm_packet *batadv_ogm_packet,
-struct batadv_hard_iface *if_incoming,
-struct batadv_hard_iface *if_outgoing)
+static bool batadv_iv_ogm_calc_tq(struct batadv_orig_node *orig_node,
+ struct batadv_orig_node *orig_neigh_node,
+ struct batadv_ogm_packet *batadv_ogm_packet,
+ struct batadv_hard_iface *if_incoming,
+ struct batadv_hard_iface *if_outgoing)
 {
struct batadv_priv *bat_priv = netdev_priv(if_incoming->soft_iface);
struct batadv_neigh_node *neigh_node = NULL, *tmp_neigh_node;
@@ -1182,9 +1182,10 @@ static int batadv_iv_ogm_calc_tq(struct batadv_orig_node 
*orig_node,
u8 total_count;
u8 orig_eq_count, neigh_rq_count, neigh_rq_inv, tq_own;
unsigned int neigh_rq_inv_cube, neigh_rq_max_cube;
-   int tq_asym_penalty, inv_asym_penalty, if_num, ret = 0;
+   int tq_asym_penalty, inv_asym_penalty, if_num;
unsigned int combined_tq;
int tq_iface_penalty;
+   bool ret = false;
 
/* find corresponding one hop neighbor */
rcu_read_lock();
@@ -1296,7 +1297,7 @@ static int batadv_iv_ogm_calc_tq(struct batadv_orig_node 
*orig_node,
 * consider it bidirectional
 */
if (batadv_ogm_packet->tq >= BATADV_TQ_TOTAL_BIDRECT_LIMIT)
-   ret = 1;
+   ret = true;
 
 out:
if (neigh_node)
@@ -1325,9 +1326,9 @@ batadv_iv_ogm_update_seqnos(const struct ethhdr *ethhdr,
struct batadv_orig_ifinfo *orig_ifinfo = NULL;
struct batadv_neigh_node *neigh_node;
struct batadv_neigh_ifinfo *neigh_ifinfo;
-   int is_dup;
+   bool is_dup;
s32 seq_diff;
-   int need_update = 0;
+   bool need_update = false;
int set_mark;
enum batadv_dup_status ret = BATADV_NO_DUP;
u32 seqno = ntohl(batadv_ogm_packet->seqno);
@@ -1437,7 +1438,7 @@ batadv_iv_ogm_process_per_outif(const struct sk_buff 
*skb, int ogm_offset,
struct sk_buff *skb_priv;
struct ethhdr *ethhdr;
u8 *prev_sender;
-   int is_bidirect;
+   bool is_bidirect;
 
/* create a private copy of the skb, as some functions change tq value
 * and/or flags.
diff --git a/net/batman-adv/bitarray.c b/net/batman-adv/bitarray.c
index b56bb000a0ab..a0c7913837a5 100644
--- a/net/batman-adv/bitarray.c
+++ b/net/batman-adv/bitarray.c
@@ -38,11 +38,11 @@ static void batadv_bitmap_shift_left(unsigned long 
*seq_bits, s32 n)
  *  the last sequence number
  * @set_mark: whether this packet should be marked in seq_bits
  *
- * Return: 1 if the window was moved (either new or very 

[PATCH 10/17] batman-adv: Use kref_get for batadv_nc_get_nc_node

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

batadv_nc_get_nc_node requires that the caller already has a valid
reference for orig_neigh_node. It is therefore not possible that it has an
reference counter of 0 and was still given to this function

The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/network-coding.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c
index 1da8e0e1b18f..953dff1ad43b 100644
--- a/net/batman-adv/network-coding.c
+++ b/net/batman-adv/network-coding.c
@@ -856,8 +856,7 @@ batadv_nc_get_nc_node(struct batadv_priv *bat_priv,
if (!nc_node)
return NULL;
 
-   if (!kref_get_unless_zero(_neigh_node->refcount))
-   goto free;
+   kref_get(_neigh_node->refcount);
 
/* Initialize nc_node */
INIT_LIST_HEAD(_node->list);
@@ -884,10 +883,6 @@ batadv_nc_get_nc_node(struct batadv_priv *bat_priv,
spin_unlock_bh(lock);
 
return nc_node;
-
-free:
-   kfree(nc_node);
-   return NULL;
 }
 
 /**
-- 
2.8.2



[PATCH 13/17] batman-adv: Use kref_get for hard_iface subfunctions

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

The callers of the functions using batadv_hard_iface objects already make
sure that they hold a valid reference. The subfunctions don't have
to check whether the reference counter is > 0 because this was checked by
the callers.

The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/bat_iv_ogm.c | 14 +++---
 net/batman-adv/hard-interface.c |  7 +++
 net/batman-adv/originator.c | 30 +++---
 3 files changed, 13 insertions(+), 38 deletions(-)

diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c
index 57e9962c7090..eb3435de54b5 100644
--- a/net/batman-adv/bat_iv_ogm.c
+++ b/net/batman-adv/bat_iv_ogm.c
@@ -681,18 +681,12 @@ static void batadv_iv_ogm_aggregate_new(const unsigned 
char *packet_buff,
unsigned char *skb_buff;
unsigned int skb_size;
 
-   if (!kref_get_unless_zero(_incoming->refcount))
-   return;
-
-   if (!kref_get_unless_zero(_outgoing->refcount))
-   goto out_free_incoming;
-
/* own packet should always be scheduled */
if (!own_packet) {
if (!batadv_atomic_dec_not_zero(_priv->batman_queue_left)) {
batadv_dbg(BATADV_DBG_BATMAN, bat_priv,
   "batman packet queue full\n");
-   goto out_free_outgoing;
+   return;
}
}
 
@@ -718,6 +712,8 @@ static void batadv_iv_ogm_aggregate_new(const unsigned char 
*packet_buff,
forw_packet_aggr->packet_len = packet_len;
memcpy(skb_buff, packet_buff, packet_len);
 
+   kref_get(_incoming->refcount);
+   kref_get(_outgoing->refcount);
forw_packet_aggr->own = own_packet;
forw_packet_aggr->if_incoming = if_incoming;
forw_packet_aggr->if_outgoing = if_outgoing;
@@ -747,10 +743,6 @@ out_free_forw_packet:
 out_nomem:
if (!own_packet)
atomic_inc(_priv->batman_queue_left);
-out_free_outgoing:
-   batadv_hardif_put(if_outgoing);
-out_free_incoming:
-   batadv_hardif_put(if_incoming);
 }
 
 /* aggregate a new packet into the existing ogm packet */
diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c
index d3d37f3f99cf..7c1d8d7ac548 100644
--- a/net/batman-adv/hard-interface.c
+++ b/net/batman-adv/hard-interface.c
@@ -236,8 +236,8 @@ static void batadv_primary_if_select(struct batadv_priv 
*bat_priv,
 
ASSERT_RTNL();
 
-   if (new_hard_iface && !kref_get_unless_zero(_hard_iface->refcount))
-   new_hard_iface = NULL;
+   if (new_hard_iface)
+   kref_get(_hard_iface->refcount);
 
curr_hard_iface = rcu_dereference_protected(bat_priv->primary_if, 1);
rcu_assign_pointer(bat_priv->primary_if, new_hard_iface);
@@ -467,8 +467,7 @@ int batadv_hardif_enable_interface(struct batadv_hard_iface 
*hard_iface,
if (hard_iface->if_status != BATADV_IF_NOT_IN_USE)
goto out;
 
-   if (!kref_get_unless_zero(_iface->refcount))
-   goto out;
+   kref_get(_iface->refcount);
 
soft_iface = dev_get_by_name(net, iface_name);
 
diff --git a/net/batman-adv/originator.c b/net/batman-adv/originator.c
index 2ed2cc89a669..04fa139911c3 100644
--- a/net/batman-adv/originator.c
+++ b/net/batman-adv/originator.c
@@ -374,12 +374,8 @@ batadv_orig_ifinfo_new(struct batadv_orig_node *orig_node,
if (!orig_ifinfo)
goto out;
 
-   if (if_outgoing != BATADV_IF_DEFAULT &&
-   !kref_get_unless_zero(_outgoing->refcount)) {
-   kfree(orig_ifinfo);
-   orig_ifinfo = NULL;
-   goto out;
-   }
+   if (if_outgoing != BATADV_IF_DEFAULT)
+   kref_get(_outgoing->refcount);
 
reset_time = jiffies - 1;
reset_time -= msecs_to_jiffies(BATADV_RESET_PROTECTION_MS);
@@ -455,11 +451,8 @@ batadv_neigh_ifinfo_new(struct batadv_neigh_node *neigh,
if (!neigh_ifinfo)
goto out;
 
-   if (if_outgoing && !kref_get_unless_zero(_outgoing->refcount)) {
-   kfree(neigh_ifinfo);
-   neigh_ifinfo = NULL;
-   goto out;
-   }
+   if (if_outgoing)
+   kref_get(_outgoing->refcount);
 
INIT_HLIST_NODE(_ifinfo->list);
kref_init(_ifinfo->refcount);
@@ -532,15 +525,11 @@ batadv_hardif_neigh_create(struct batadv_hard_iface 
*hard_iface,
if (hardif_neigh)
goto out;
 
-   if (!kref_get_unless_zero(_iface->refcount))
-   goto out;
-
hardif_neigh = 

[PATCH 17/17] batman-adv: use batadv_compare_eth when possible

2016-05-10 Thread Antonio Quartulli
When comparing Ethernet address it is better to use the more
generic batadv_compare_eth. The latter is also optimised for
architectures having a fast unaligned access.

Signed-off-by: Antonio Quartulli 
[s...@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
---
 net/batman-adv/network-coding.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c
index df5ae9c7e507..678f06865312 100644
--- a/net/batman-adv/network-coding.c
+++ b/net/batman-adv/network-coding.c
@@ -521,12 +521,10 @@ static bool batadv_nc_hash_compare(const struct 
hlist_node *node,
nc_path2 = data2;
 
/* Return 1 if the two keys are identical */
-   if (memcmp(nc_path1->prev_hop, nc_path2->prev_hop,
-  sizeof(nc_path1->prev_hop)) != 0)
+   if (!batadv_compare_eth(nc_path1->prev_hop, nc_path2->prev_hop))
return false;
 
-   if (memcmp(nc_path1->next_hop, nc_path2->next_hop,
-  sizeof(nc_path1->next_hop)) != 0)
+   if (!batadv_compare_eth(nc_path1->next_hop, nc_path2->next_hop))
return false;
 
return true;
-- 
2.8.2



[PATCH 16/17] batman-adv: replace ethertype variable with ETH_P_BATMAN for readability

2016-05-10 Thread Antonio Quartulli
From: Marek Lindner 

Signed-off-by: Marek Lindner 
Reviewed-by: Sven Eckelmann 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/soft-interface.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
index 3a0fc3c18444..343d2c904399 100644
--- a/net/batman-adv/soft-interface.c
+++ b/net/batman-adv/soft-interface.c
@@ -186,7 +186,6 @@ static int batadv_interface_tx(struct sk_buff *skb,
struct batadv_priv *bat_priv = netdev_priv(soft_iface);
struct batadv_hard_iface *primary_if = NULL;
struct batadv_bcast_packet *bcast_packet;
-   __be16 ethertype = htons(ETH_P_BATMAN);
static const u8 stp_addr[ETH_ALEN] = {0x01, 0x80, 0xC2, 0x00,
  0x00, 0x00};
static const u8 ectp_addr[ETH_ALEN] = {0xCF, 0x00, 0x00, 0x00,
@@ -216,7 +215,8 @@ static int batadv_interface_tx(struct sk_buff *skb,
case ETH_P_8021Q:
vhdr = vlan_eth_hdr(skb);
 
-   if (vhdr->h_vlan_encapsulated_proto != ethertype) {
+   /* drop batman-in-batman packets to prevent loops */
+   if (vhdr->h_vlan_encapsulated_proto != htons(ETH_P_BATMAN)) {
network_offset += VLAN_HLEN;
break;
}
@@ -404,7 +404,6 @@ void batadv_interface_rx(struct net_device *soft_iface,
 {
struct batadv_bcast_packet *batadv_bcast_packet;
struct batadv_priv *bat_priv = netdev_priv(soft_iface);
-   __be16 ethertype = htons(ETH_P_BATMAN);
struct vlan_ethhdr *vhdr;
struct ethhdr *ethhdr;
unsigned short vid;
@@ -434,7 +433,8 @@ void batadv_interface_rx(struct net_device *soft_iface,
 
vhdr = (struct vlan_ethhdr *)skb->data;
 
-   if (vhdr->h_vlan_encapsulated_proto != ethertype)
+   /* drop batman-in-batman packets to prevent loops */
+   if (vhdr->h_vlan_encapsulated_proto != htons(ETH_P_BATMAN))
break;
 
/* fall through */
-- 
2.8.2



[PATCH 12/17] batman-adv: Use kref_get for batadv_gw_node_add

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

batadv_gw_node_add requires that the caller already has a valid reference
for orig_node. It is therefore not possible that it has an reference
counter of 0 and was still given to this function

The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/gateway_client.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/batman-adv/gateway_client.c b/net/batman-adv/gateway_client.c
index bb1c4f37716e..5839c569f769 100644
--- a/net/batman-adv/gateway_client.c
+++ b/net/batman-adv/gateway_client.c
@@ -440,15 +440,11 @@ static void batadv_gw_node_add(struct batadv_priv 
*bat_priv,
if (gateway->bandwidth_down == 0)
return;
 
-   if (!kref_get_unless_zero(_node->refcount))
-   return;
-
gw_node = kzalloc(sizeof(*gw_node), GFP_ATOMIC);
-   if (!gw_node) {
-   batadv_orig_node_put(orig_node);
+   if (!gw_node)
return;
-   }
 
+   kref_get(_node->refcount);
INIT_HLIST_NODE(_node->list);
gw_node->orig_node = orig_node;
gw_node->bandwidth_down = ntohl(gateway->bandwidth_down);
-- 
2.8.2



[PATCH 11/17] batman-adv: Use kref_get for batadv_gw_select

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

batadv_gw_select requires that the caller already has a valid reference for
new_gw_node. It is therefore not possible that it has an reference counter
of 0 and was still given to this function

The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/gateway_client.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/batman-adv/gateway_client.c b/net/batman-adv/gateway_client.c
index c59aff5ccac8..bb1c4f37716e 100644
--- a/net/batman-adv/gateway_client.c
+++ b/net/batman-adv/gateway_client.c
@@ -135,8 +135,8 @@ static void batadv_gw_select(struct batadv_priv *bat_priv,
 
spin_lock_bh(_priv->gw.list_lock);
 
-   if (new_gw_node && !kref_get_unless_zero(_gw_node->refcount))
-   new_gw_node = NULL;
+   if (new_gw_node)
+   kref_get(_gw_node->refcount);
 
curr_gw_node = rcu_dereference_protected(bat_priv->gw.curr_gw, 1);
rcu_assign_pointer(bat_priv->gw.curr_gw, new_gw_node);
-- 
2.8.2



[PATCH net-next] gtp: reload GTPv1 header after pskb_may_pull()

2016-05-10 Thread Pablo Neira Ayuso
The GTPv1 header flags indicate the presence of optional extensions
after this header. Refresh the pointer to the GTPv1 header as skb->head
might have be reallocated via pskb_may_pull().

Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling 
Protocol (GTP-U)")
Reported-by: Eric Dumazet 
Signed-off-by: Pablo Neira Ayuso 
---
 drivers/net/gtp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c
index 8ce1104..f7caf1e 100644
--- a/drivers/net/gtp.c
+++ b/drivers/net/gtp.c
@@ -253,6 +253,8 @@ static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, struct 
sk_buff *skb,
if (!pskb_may_pull(skb, hdrlen))
return -1;
 
+   gtp1 = (struct gtp1_header *)(skb->data + sizeof(struct udphdr));
+
rcu_read_lock();
pctx = gtp1_pdp_find(gtp, ntohl(gtp1->tid));
if (!pctx) {
-- 
2.1.4



[PATCH 05/17] batman-adv: add detection for complex bridge loops

2016-05-10 Thread Antonio Quartulli
From: Simon Wunderlich 

There are network setups where the current bridge loop avoidance can't
detect bridge loops. The minimal setup affected would consist of two
LANs and two separate meshes, connected in a ring like that:

   A...(mesh1)...B
   | |
 (LAN1)(LAN2)
   | |
   C...(mesh2)...D

Since both the meshes and backbones are separate, the bridge loop
avoidance has not enough information to detect and avoid the loop
in this case. Even if these scenarios can't be fixed easily,
these kind of loops can be detected.

This patch implements a periodic check (running every 60 seconds for
now) which sends a broadcast frame with a random MAC address on
each backbone VLAN. If a broadcast frame with the same MAC address
is received shortly after on the mesh, we know that there must be a
loop and report that incident as well as throw an uevent to let others
handle that problem.

Signed-off-by: Simon Wunderlich 
[s...@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/bridge_loop_avoidance.c | 139 +
 net/batman-adv/main.h  |   4 +
 net/batman-adv/packet.h|   1 +
 net/batman-adv/sysfs.c |   6 +-
 net/batman-adv/types.h |   8 ++
 5 files changed, 156 insertions(+), 2 deletions(-)

diff --git a/net/batman-adv/bridge_loop_avoidance.c 
b/net/batman-adv/bridge_loop_avoidance.c
index 2c9aa671a49b..5064ae5e9b34 100644
--- a/net/batman-adv/bridge_loop_avoidance.c
+++ b/net/batman-adv/bridge_loop_avoidance.c
@@ -50,6 +50,7 @@
 #include "hash.h"
 #include "originator.h"
 #include "packet.h"
+#include "sysfs.h"
 #include "translation-table.h"
 
 static const u8 batadv_announce_mac[4] = {0x43, 0x05, 0x43, 0x05};
@@ -407,6 +408,14 @@ static void batadv_bla_send_claim(struct batadv_priv 
*bat_priv, u8 *mac,
   ethhdr->h_source, ethhdr->h_dest,
   BATADV_PRINT_VID(vid));
break;
+   case BATADV_CLAIM_TYPE_LOOPDETECT:
+   ether_addr_copy(ethhdr->h_source, mac);
+   batadv_dbg(BATADV_DBG_BLA, bat_priv,
+  "bla_send_claim(): LOOPDETECT of %pM to %pM on vid 
%d\n",
+  ethhdr->h_source, ethhdr->h_dest,
+  BATADV_PRINT_VID(vid));
+
+   break;
}
 
if (vid & BATADV_VLAN_HAS_TAG)
@@ -427,6 +436,36 @@ out:
 }
 
 /**
+ * batadv_bla_loopdetect_report - worker for reporting the loop
+ * @work: work queue item
+ *
+ * Throws an uevent, as the loopdetect check function can't do that itself
+ * since the kernel may sleep while throwing uevents.
+ */
+static void batadv_bla_loopdetect_report(struct work_struct *work)
+{
+   struct batadv_bla_backbone_gw *backbone_gw;
+   struct batadv_priv *bat_priv;
+   char vid_str[6] = { '\0' };
+
+   backbone_gw = container_of(work, struct batadv_bla_backbone_gw,
+  report_work);
+   bat_priv = backbone_gw->bat_priv;
+
+   batadv_info(bat_priv->soft_iface,
+   "Possible loop on VLAN %d detected which can't be handled 
by BLA - please check your network setup!\n",
+   BATADV_PRINT_VID(backbone_gw->vid));
+   snprintf(vid_str, sizeof(vid_str), "%d",
+BATADV_PRINT_VID(backbone_gw->vid));
+   vid_str[sizeof(vid_str) - 1] = 0;
+
+   batadv_throw_uevent(bat_priv, BATADV_UEV_BLA, BATADV_UEV_LOOPDETECT,
+   vid_str);
+
+   batadv_backbone_gw_put(backbone_gw);
+}
+
+/**
  * batadv_bla_get_backbone_gw - finds or creates a backbone gateway
  * @bat_priv: the bat priv with all the soft interface information
  * @orig: the mac address of the originator
@@ -464,6 +503,7 @@ batadv_bla_get_backbone_gw(struct batadv_priv *bat_priv, u8 
*orig,
atomic_set(>request_sent, 0);
atomic_set(>wait_periods, 0);
ether_addr_copy(entry->orig, orig);
+   INIT_WORK(>report_work, batadv_bla_loopdetect_report);
 
/* one for the hash, one for returning */
kref_init(>refcount);
@@ -1060,6 +1100,10 @@ static int batadv_bla_process_claim(struct batadv_priv 
*bat_priv,
if (vlan_depth > 1)
return 1;
 
+   /* Let the loopdetect frames on the mesh in any case. */
+   if (bla_dst->type == BATADV_CLAIM_TYPE_LOOPDETECT)
+   return 0;
+
/* check if it is a claim frame. */
ret = batadv_check_claim_group(bat_priv, primary_if, hw_src, hw_dst,
   ethhdr);
@@ -1265,6 +1309,26 @@ void batadv_bla_update_orig_address(struct batadv_priv 
*bat_priv,
 }
 
 /**
+ * batadv_bla_send_loopdetect - send a loopdetect frame
+ * @bat_priv: 

[PATCH 06/17] batman-adv: Check hard_iface refcnt before calling function

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

The batadv_hardif_list list is checked in many situations and the items
in this list are given to specialized functions to modify the routing
behavior. At the moment each of these called functions has to check
itself whether the received batadv_hard_iface has a refcount > 0 before
it can increase the reference counter and use it in other objects.

This can easily lead to problems because it is not easily visible where
all callers of a function got the batadv_hard_iface object from and
whether they already hold a valid reference.

Checking the reference counter directly before calling a subfunction
with a pointer from the batadv_hardif_list avoids this problem.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/bat_iv_ogm.c | 11 +++
 net/batman-adv/bat_v_ogm.c  | 14 +-
 net/batman-adv/originator.c |  5 +
 net/batman-adv/send.c   |  6 ++
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c
index 8c1710bba803..57e9962c7090 100644
--- a/net/batman-adv/bat_iv_ogm.c
+++ b/net/batman-adv/bat_iv_ogm.c
@@ -987,9 +987,15 @@ static void batadv_iv_ogm_schedule(struct 
batadv_hard_iface *hard_iface)
list_for_each_entry_rcu(tmp_hard_iface, _hardif_list, list) {
if (tmp_hard_iface->soft_iface != hard_iface->soft_iface)
continue;
+
+   if (!kref_get_unless_zero(_hard_iface->refcount))
+   continue;
+
batadv_iv_ogm_queue_add(bat_priv, *ogm_buff,
*ogm_buff_len, hard_iface,
tmp_hard_iface, 1, send_time);
+
+   batadv_hardif_put(tmp_hard_iface);
}
rcu_read_unlock();
 
@@ -1767,8 +1773,13 @@ static void batadv_iv_ogm_process(const struct sk_buff 
*skb, int ogm_offset,
if (hard_iface->soft_iface != bat_priv->soft_iface)
continue;
 
+   if (!kref_get_unless_zero(_iface->refcount))
+   continue;
+
batadv_iv_ogm_process_per_outif(skb, ogm_offset, orig_node,
if_incoming, hard_iface);
+
+   batadv_hardif_put(hard_iface);
}
rcu_read_unlock();
 
diff --git a/net/batman-adv/bat_v_ogm.c b/net/batman-adv/bat_v_ogm.c
index 4155fa57cf6d..473ebb9a0e73 100644
--- a/net/batman-adv/bat_v_ogm.c
+++ b/net/batman-adv/bat_v_ogm.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -176,6 +177,9 @@ static void batadv_v_ogm_send(struct work_struct *work)
if (hard_iface->soft_iface != bat_priv->soft_iface)
continue;
 
+   if (!kref_get_unless_zero(_iface->refcount))
+   continue;
+
batadv_dbg(BATADV_DBG_BATMAN, bat_priv,
   "Sending own OGM2 packet (originator %pM, seqno %u, 
throughput %u, TTL %d) on interface %s [%pM]\n",
   ogm_packet->orig, ntohl(ogm_packet->seqno),
@@ -185,10 +189,13 @@ static void batadv_v_ogm_send(struct work_struct *work)
 
/* this skb gets consumed by batadv_v_ogm_send_to_if() */
skb_tmp = skb_clone(skb, GFP_ATOMIC);
-   if (!skb_tmp)
+   if (!skb_tmp) {
+   batadv_hardif_put(hard_iface);
break;
+   }
 
batadv_v_ogm_send_to_if(skb_tmp, hard_iface);
+   batadv_hardif_put(hard_iface);
}
rcu_read_unlock();
 
@@ -704,9 +711,14 @@ static void batadv_v_ogm_process(const struct sk_buff 
*skb, int ogm_offset,
if (hard_iface->soft_iface != bat_priv->soft_iface)
continue;
 
+   if (!kref_get_unless_zero(_iface->refcount))
+   continue;
+
batadv_v_ogm_process_per_outif(bat_priv, ethhdr, ogm_packet,
   orig_node, neigh_node,
   if_incoming, hard_iface);
+
+   batadv_hardif_put(hard_iface);
}
rcu_read_unlock();
 out:
diff --git a/net/batman-adv/originator.c b/net/batman-adv/originator.c
index f885a41d06d5..2ed2cc89a669 100644
--- a/net/batman-adv/originator.c
+++ b/net/batman-adv/originator.c
@@ -1160,6 +1160,9 @@ static bool batadv_purge_orig_node(struct batadv_priv 
*bat_priv,
if (hard_iface->soft_iface != bat_priv->soft_iface)
continue;
 
+   if (!kref_get_unless_zero(_iface->refcount))
+   continue;
+
best_neigh_node = batadv_find_best_neighbor(bat_priv,
   

[PATCH 07/17] batman-adv: Check hard_iface refcnt when receiving skb

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

The receive function may start processing an incoming packet while the
hard_iface is shut down in a different context. All called functions called
with the batadv_hard_iface object belonging to the incoming interface would
have to check whether the reference counter is still > 0.

This is rather error-prone because this check can be forgotten easily.
Instead check the reference counter when receiving the object to make sure
that all called functions have a valid reference.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/main.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/batman-adv/main.c b/net/batman-adv/main.c
index 78c05a91ae6f..c8d8bc78a518 100644
--- a/net/batman-adv/main.c
+++ b/net/batman-adv/main.c
@@ -401,11 +401,19 @@ int batadv_batman_skb_recv(struct sk_buff *skb, struct 
net_device *dev,
 
hard_iface = container_of(ptype, struct batadv_hard_iface,
  batman_adv_ptype);
+
+   /* Prevent processing a packet received on an interface which is getting
+* shut down otherwise the packet may trigger de-reference errors
+* further down in the receive path.
+*/
+   if (!kref_get_unless_zero(_iface->refcount))
+   goto err_out;
+
skb = skb_share_check(skb, GFP_ATOMIC);
 
/* skb was released by skb_share_check() */
if (!skb)
-   goto err_out;
+   goto err_put;
 
/* packet should hold at least type and version */
if (unlikely(!pskb_may_pull(skb, 2)))
@@ -448,6 +456,8 @@ int batadv_batman_skb_recv(struct sk_buff *skb, struct 
net_device *dev,
if (ret == NET_RX_DROP)
kfree_skb(skb);
 
+   batadv_hardif_put(hard_iface);
+
/* return NET_RX_SUCCESS in any case as we
 * most probably dropped the packet for
 * routing-logical reasons.
@@ -456,6 +466,8 @@ int batadv_batman_skb_recv(struct sk_buff *skb, struct 
net_device *dev,
 
 err_free:
kfree_skb(skb);
+err_put:
+   batadv_hardif_put(hard_iface);
 err_out:
return NET_RX_DROP;
 }
-- 
2.8.2



[PATCH 04/17] batman-adv: Create batman soft interfaces within correct netns.

2016-05-10 Thread Antonio Quartulli
From: Andrew Lunn 

When creating a soft interface, create it in the same netns as the
hard interface. Replace all references to init_net with the correct
name space for the interface being manipulated.

Suggested-by: Daniel Ehlers 
Signed-off-by: Andrew Lunn 
Acked-by: Antonio Quartulli 
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/hard-interface.c| 10 +-
 net/batman-adv/hard-interface.h|  3 ++-
 net/batman-adv/soft-interface.c|  7 +--
 net/batman-adv/soft-interface.h|  3 ++-
 net/batman-adv/sysfs.c |  3 ++-
 net/batman-adv/translation-table.c |  4 ++--
 6 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c
index 0a7deaf2670a..f0e1899e5b6b 100644
--- a/net/batman-adv/hard-interface.c
+++ b/net/batman-adv/hard-interface.c
@@ -36,7 +36,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "bridge_loop_avoidance.h"
 #include "debugfs.h"
@@ -121,6 +120,7 @@ static bool batadv_mutual_parents(const struct net_device 
*dev1,
 static bool batadv_is_on_batman_iface(const struct net_device *net_dev)
 {
struct net_device *parent_dev;
+   struct net *net = dev_net(net_dev);
bool ret;
 
/* check if this is a batman-adv mesh interface */
@@ -133,7 +133,7 @@ static bool batadv_is_on_batman_iface(const struct 
net_device *net_dev)
return false;
 
/* recurse over the parent device */
-   parent_dev = __dev_get_by_index(_net, dev_get_iflink(net_dev));
+   parent_dev = __dev_get_by_index(net, dev_get_iflink(net_dev));
/* if we got a NULL parent_dev there is something broken.. */
if (WARN(!parent_dev, "Cannot find parent device"))
return false;
@@ -456,7 +456,7 @@ static int batadv_master_del_slave(struct batadv_hard_iface 
*slave,
 }
 
 int batadv_hardif_enable_interface(struct batadv_hard_iface *hard_iface,
-  const char *iface_name)
+  struct net *net, const char *iface_name)
 {
struct batadv_priv *bat_priv;
struct net_device *soft_iface, *master;
@@ -470,10 +470,10 @@ int batadv_hardif_enable_interface(struct 
batadv_hard_iface *hard_iface,
if (!kref_get_unless_zero(_iface->refcount))
goto out;
 
-   soft_iface = dev_get_by_name(_net, iface_name);
+   soft_iface = dev_get_by_name(net, iface_name);
 
if (!soft_iface) {
-   soft_iface = batadv_softif_create(iface_name);
+   soft_iface = batadv_softif_create(net, iface_name);
 
if (!soft_iface) {
ret = -ENOMEM;
diff --git a/net/batman-adv/hard-interface.h b/net/batman-adv/hard-interface.h
index d74f1983f33e..a76724d369bf 100644
--- a/net/batman-adv/hard-interface.h
+++ b/net/batman-adv/hard-interface.h
@@ -28,6 +28,7 @@
 #include 
 
 struct net_device;
+struct net;
 
 enum batadv_hard_if_state {
BATADV_IF_NOT_IN_USE,
@@ -55,7 +56,7 @@ bool batadv_is_wifi_iface(int ifindex);
 struct batadv_hard_iface*
 batadv_hardif_get_by_netdev(const struct net_device *net_dev);
 int batadv_hardif_enable_interface(struct batadv_hard_iface *hard_iface,
-  const char *iface_name);
+  struct net *net, const char *iface_name);
 void batadv_hardif_disable_interface(struct batadv_hard_iface *hard_iface,
 enum batadv_hard_if_cleanup autodel);
 void batadv_hardif_remove_interfaces(void);
diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
index 66dd0aac480a..04866c9b860a 100644
--- a/net/batman-adv/soft-interface.c
+++ b/net/batman-adv/soft-interface.c
@@ -885,13 +885,14 @@ static int batadv_softif_slave_add(struct net_device *dev,
   struct net_device *slave_dev)
 {
struct batadv_hard_iface *hard_iface;
+   struct net *net = dev_net(dev);
int ret = -EINVAL;
 
hard_iface = batadv_hardif_get_by_netdev(slave_dev);
if (!hard_iface || hard_iface->soft_iface)
goto out;
 
-   ret = batadv_hardif_enable_interface(hard_iface, dev->name);
+   ret = batadv_hardif_enable_interface(hard_iface, net, dev->name);
 
 out:
if (hard_iface)
@@ -988,7 +989,7 @@ static void batadv_softif_init_early(struct net_device *dev)
memset(priv, 0, sizeof(*priv));
 }
 
-struct net_device *batadv_softif_create(const char *name)
+struct net_device *batadv_softif_create(struct net *net, const char *name)
 {
struct net_device *soft_iface;
int ret;
@@ -998,6 +999,8 @@ struct net_device *batadv_softif_create(const char *name)
if (!soft_iface)
return NULL;
 

[PATCH 09/17] batman-adv: Use kref_get for batadv_tvlv_container_get

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

batadv_tvlv_container_get requires that tvlv.container_list_lock is held by
the caller. It is therefore not possible that an item in
tvlv.container_list has an reference counter of 0 and is still in the list

The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/main.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/batman-adv/main.c b/net/batman-adv/main.c
index c8d8bc78a518..5f2974bd1227 100644
--- a/net/batman-adv/main.c
+++ b/net/batman-adv/main.c
@@ -748,9 +748,7 @@ batadv_tvlv_container_get(struct batadv_priv *bat_priv, u8 
type, u8 version)
if (tvlv_tmp->tvlv_hdr.version != version)
continue;
 
-   if (!kref_get_unless_zero(_tmp->refcount))
-   continue;
-
+   kref_get(_tmp->refcount);
tvlv = tvlv_tmp;
break;
}
-- 
2.8.2



[PATCH 02/17] batman-adv: Remove hdr_size skb size check in batadv_interface_rx

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

The callers of batadv_interface_rx have to make sure that enough data can
be pulled from the skb when they read the batman-adv header. The only two
functions using it are either calling pskb_may_pull with hdr_size directly
(batadv_recv_bcast_packet) or indirectly via batadv_check_unicast_packet
(batadv_recv_unicast_packet).

Reported-by: Marek Lindner 
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/soft-interface.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
index dc9a61a5122d..d72f88707736 100644
--- a/net/batman-adv/soft-interface.c
+++ b/net/batman-adv/soft-interface.c
@@ -413,10 +413,6 @@ void batadv_interface_rx(struct net_device *soft_iface,
batadv_bcast_packet = (struct batadv_bcast_packet *)skb->data;
is_bcast = (batadv_bcast_packet->packet_type == BATADV_BCAST);
 
-   /* check if enough space is available for pulling, and pull */
-   if (!pskb_may_pull(skb, hdr_size))
-   goto dropped;
-
skb_pull_rcsum(skb, hdr_size);
skb_reset_mac_header(skb);
 
-- 
2.8.2



[PATCH 08/17] batman-adv: Increase hard_iface refcnt for ptype

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

The hard_iface is referenced in the packet_type for batman-adv. Increase
the refcounter of the hard_interface for it to have an explicit reference
for it in case this functionality gets refactorted and the currently
used implicit reference for it will be removed.

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/hard-interface.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c
index f0e1899e5b6b..d3d37f3f99cf 100644
--- a/net/batman-adv/hard-interface.c
+++ b/net/batman-adv/hard-interface.c
@@ -522,6 +522,7 @@ int batadv_hardif_enable_interface(struct batadv_hard_iface 
*hard_iface,
goto err_upper;
}
 
+   kref_get(_iface->refcount);
hard_iface->batman_adv_ptype.type = ethertype;
hard_iface->batman_adv_ptype.func = batadv_batman_skb_recv;
hard_iface->batman_adv_ptype.dev = hard_iface->net_dev;
@@ -583,6 +584,7 @@ void batadv_hardif_disable_interface(struct 
batadv_hard_iface *hard_iface,
batadv_info(hard_iface->soft_iface, "Removing interface: %s\n",
hard_iface->net_dev->name);
dev_remove_pack(_iface->batman_adv_ptype);
+   batadv_hardif_put(hard_iface);
 
bat_priv->num_ifaces--;
batadv_orig_hash_del_if(hard_iface, bat_priv->num_ifaces);
-- 
2.8.2



[PATCH 03/17] batman-adv: NETIF_F_NETNS_LOCAL feature to prevent netns moves

2016-05-10 Thread Antonio Quartulli
From: Andrew Lunn 

The batX soft interface should not be moved between network name
spaces. This is similar to bridges, bonds, tunnels, which are not
allowed to move between network namespaces.

Suggested-by: Daniel Ehlers 
Signed-off-by: Andrew Lunn 
Acked-by: Antonio Quartulli 
Reviewed-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/soft-interface.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
index d72f88707736..66dd0aac480a 100644
--- a/net/batman-adv/soft-interface.c
+++ b/net/batman-adv/soft-interface.c
@@ -972,7 +972,7 @@ static void batadv_softif_init_early(struct net_device *dev)
 
dev->netdev_ops = _netdev_ops;
dev->destructor = batadv_softif_free;
-   dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
+   dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_NETNS_LOCAL;
dev->priv_flags |= IFF_NO_QUEUE;
 
/* can't call min_mtu, because the needed variables
-- 
2.8.2



[PATCH 01/17] batman-adv: Remove unused parameter recv_if of batadv_interface_rx

2016-05-10 Thread Antonio Quartulli
From: Sven Eckelmann 

Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Antonio Quartulli 
---
 net/batman-adv/routing.c| 5 ++---
 net/batman-adv/soft-interface.c | 5 ++---
 net/batman-adv/soft-interface.h | 4 ++--
 3 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c
index b781bf753250..2ecfca246be4 100644
--- a/net/batman-adv/routing.c
+++ b/net/batman-adv/routing.c
@@ -912,7 +912,7 @@ int batadv_recv_unicast_packet(struct sk_buff *skb,
hdr_size))
goto rx_success;
 
-   batadv_interface_rx(recv_if->soft_iface, skb, recv_if, hdr_size,
+   batadv_interface_rx(recv_if->soft_iface, skb, hdr_size,
orig_node);
 
 rx_success:
@@ -1122,8 +1122,7 @@ int batadv_recv_bcast_packet(struct sk_buff *skb,
goto rx_success;
 
/* broadcast for me */
-   batadv_interface_rx(recv_if->soft_iface, skb, recv_if, hdr_size,
-   orig_node);
+   batadv_interface_rx(recv_if->soft_iface, skb, hdr_size, orig_node);
 
 rx_success:
ret = NET_RX_SUCCESS;
diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
index dfb4d56120b6..dc9a61a5122d 100644
--- a/net/batman-adv/soft-interface.c
+++ b/net/batman-adv/soft-interface.c
@@ -385,7 +385,6 @@ end:
  * batadv_interface_rx - receive ethernet frame on local batman-adv interface
  * @soft_iface: local interface which will receive the ethernet frame
  * @skb: ethernet frame for @soft_iface
- * @recv_if: interface on which the batman-adv packet was received
  * @hdr_size: size of already parsed batman-adv header
  * @orig_node: originator from which the batman-adv packet was sent
  *
@@ -400,8 +399,8 @@ end:
  * isolated clients.
  */
 void batadv_interface_rx(struct net_device *soft_iface,
-struct sk_buff *skb, struct batadv_hard_iface *recv_if,
-int hdr_size, struct batadv_orig_node *orig_node)
+struct sk_buff *skb, int hdr_size,
+struct batadv_orig_node *orig_node)
 {
struct batadv_bcast_packet *batadv_bcast_packet;
struct batadv_priv *bat_priv = netdev_priv(soft_iface);
diff --git a/net/batman-adv/soft-interface.h b/net/batman-adv/soft-interface.h
index 9ae265703d23..5942da3d03d5 100644
--- a/net/batman-adv/soft-interface.h
+++ b/net/batman-adv/soft-interface.h
@@ -27,8 +27,8 @@ struct sk_buff;
 
 int batadv_skb_head_push(struct sk_buff *skb, unsigned int len);
 void batadv_interface_rx(struct net_device *soft_iface,
-struct sk_buff *skb, struct batadv_hard_iface *recv_if,
-int hdr_size, struct batadv_orig_node *orig_node);
+struct sk_buff *skb, int hdr_size,
+struct batadv_orig_node *orig_node);
 struct net_device *batadv_softif_create(const char *name);
 void batadv_softif_destroy_sysfs(struct net_device *soft_iface);
 int batadv_softif_is_valid(const struct net_device *net_dev);
-- 
2.8.2



pull request: batman-adv 20160511

2016-05-10 Thread Antonio Quartulli
Hi David,

here you have a pull request intended for net-next.
There are 17 patches in this batch, but most of them are cleanups
and minor code re-arrangement.

The more detailed description follows in the git tag.

Please pull or let me know of any problem.

Thanks a lot,
Antonio


The following changes since commit c047c3b1af6214b447e353527e394fa3f3e86397:

  netfilter: conntrack: remove uninitialized shadow variable (2016-05-10 
01:04:04 -0400)

are available in the git repository at:

  git://git.open-mesh.org/linux-merge.git tags/batman-adv-for-davem

for you to fetch changes up to 676970e55b1033af7f0a03d4037b4d9b76327ded:

  batman-adv: use batadv_compare_eth when possible (2016-05-10 18:28:54 +0800)


Included changes:
- remove useless skb size check in batadv_interface_rx
- basic netns support introduced by Andrew Lunn:
- prevent virtual interface from changing netns by setting
  NETIF_F_NETNS_LOCAL
- create virtual interface within the netns of the first
  hard-interface
- introduce detection of complex bridge loops and report event
  to the user (via udev) when the Bridge Loop Avoidance mechanism
  can't prevent them
- minor reference counting bugfixes for the hard_iface object that
  couldn't make it via the net tree
- use kref_get() instead of kref_get_unless_zero() to make reference
  counting bug more visible
- use batadv_compare_eth() all over the code when possible instead of
  plain memcmp()
- minor code cleanup and style adjustments


Andrew Lunn (2):
  batman-adv: NETIF_F_NETNS_LOCAL feature to prevent netns moves
  batman-adv: Create batman soft interfaces within correct netns.

Antonio Quartulli (1):
  batman-adv: use batadv_compare_eth when possible

Marek Lindner (1):
  batman-adv: replace ethertype variable with ETH_P_BATMAN for readability

Simon Wunderlich (1):
  batman-adv: add detection for complex bridge loops

Sven Eckelmann (12):
  batman-adv: Remove unused parameter recv_if of batadv_interface_rx
  batman-adv: Remove hdr_size skb size check in batadv_interface_rx
  batman-adv: Check hard_iface refcnt before calling function
  batman-adv: Check hard_iface refcnt when receiving skb
  batman-adv: Increase hard_iface refcnt for ptype
  batman-adv: Use kref_get for batadv_tvlv_container_get
  batman-adv: Use kref_get for batadv_nc_get_nc_node
  batman-adv: Use kref_get for batadv_gw_select
  batman-adv: Use kref_get for batadv_gw_node_add
  batman-adv: Use kref_get for hard_iface subfunctions
  batman-adv: Use kref_get for _batadv_update_route
  batman-adv: Use bool as return type for boolean functions

 net/batman-adv/bat_iv_ogm.c|  48 ++---
 net/batman-adv/bat_v_ogm.c |  14 +-
 net/batman-adv/bitarray.c  |  16 +-
 net/batman-adv/bitarray.h  |  15 +-
 net/batman-adv/bridge_loop_avoidance.c | 314 -
 net/batman-adv/bridge_loop_avoidance.h |  43 ++---
 net/batman-adv/debugfs.c   |   2 +-
 net/batman-adv/distributed-arp-table.c |   6 +-
 net/batman-adv/gateway_client.c|  12 +-
 net/batman-adv/hard-interface.c|  34 ++--
 net/batman-adv/hard-interface.h|   3 +-
 net/batman-adv/hash.h  |   6 +-
 net/batman-adv/main.c  |  18 +-
 net/batman-adv/main.h  |   6 +-
 net/batman-adv/network-coding.c|  25 +--
 net/batman-adv/originator.c|  39 ++--
 net/batman-adv/originator.h|   2 +-
 net/batman-adv/packet.h|   1 +
 net/batman-adv/routing.c   |  50 +++---
 net/batman-adv/routing.h   |   6 +-
 net/batman-adv/send.c  |   6 +
 net/batman-adv/soft-interface.c|  32 ++--
 net/batman-adv/soft-interface.h|  10 +-
 net/batman-adv/sysfs.c |   9 +-
 net/batman-adv/translation-table.c |  35 ++--
 net/batman-adv/types.h |   8 +
 26 files changed, 465 insertions(+), 295 deletions(-)


Re: [Drbd-dev] [PATCH net-next v3] block/drbd: align properly u64 in nl messages

2016-05-10 Thread David Miller
From: Lars Ellenberg 
Date: Tue, 10 May 2016 21:09:03 +0200

> On Tue, May 10, 2016 at 11:39:49AM -0400, David Miller wrote:
>> From: Lars Ellenberg 
>> Date: Tue, 10 May 2016 11:40:23 +0200
> 
> excuse me for reordering the original:
> 
>> Anyways, back to the topic, can you please just relent and come to
>> some kind of agreement about the fix for this alignment bug?
> 
> I thought we did?  I'm fine with the "v3",
> it even carries my signed-of-by.

My bad, I missed that, I'll apply v3 thanks a lot!


[patch -mainline] qlcnic: potential NULL dereference in qlcnic_83xx_get_minidump_template()

2016-05-10 Thread Dan Carpenter
If qlcnic_fw_cmd_get_minidump_temp() fails then "fw_dump->tmpl_hdr" is
NULL or possibly freed.  It can lead to an oops later.

Fixes: d01a6d3c8ae1 ('qlcnic: Add support to enable capability to extend 
minidump for iSCSI')
Signed-off-by: Dan Carpenter 

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c
index cda9e60..0844b7c 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_minidump.c
@@ -1417,6 +1417,7 @@ void qlcnic_83xx_get_minidump_template(struct 
qlcnic_adapter *adapter)
struct qlcnic_fw_dump *fw_dump = >fw_dump;
struct pci_dev *pdev = adapter->pdev;
bool extended = false;
+   int ret;
 
prev_version = adapter->fw_version;
current_version = qlcnic_83xx_get_fw_version(adapter);
@@ -1427,8 +1428,11 @@ void qlcnic_83xx_get_minidump_template(struct 
qlcnic_adapter *adapter)
if (qlcnic_83xx_md_check_extended_dump_capability(adapter))
extended = !qlcnic_83xx_extend_md_capab(adapter);
 
-   if (!qlcnic_fw_cmd_get_minidump_temp(adapter))
-   dev_info(>dev, "Supports FW dump capability\n");
+   ret = qlcnic_fw_cmd_get_minidump_temp(adapter);
+   if (ret)
+   return;
+
+   dev_info(>dev, "Supports FW dump capability\n");
 
/* Once we have minidump template with extended iSCSI dump
 * capability, update the minidump capture mask to 0x1f as


Re: [PATCH RFT 1/2] phylib: add device reset GPIO support

2016-05-10 Thread Florian Fainelli
On 05/10/2016 12:11 PM, Sergei Shtylyov wrote:
> Hello.
> 
> On 05/10/2016 09:32 PM, Florian Fainelli wrote:
> 
>>> The PHY devices sometimes do have their reset signal (maybe even power
>>> supply?) tied to some GPIO and sometimes it also does happen that a boot
>>> loader does not leave it deasserted. So far this issue has been attacked
>>> from (as I believe) a wrong angle: by teaching the MAC driver to
>>> manipulate
>>> the GPIO in question; that solution, when applied to the device
>>> trees, led
>>> to adding the PHY reset GPIO properties to the MAC device node, with one
>>> exception: Cadence MACB driver which could handle the "reset-gpios" prop
>>> in a PHY device subnode. I believe that the correct approach is to teach
>>> the 'phylib' to get the MDIO device reset GPIO from the device tree node
>>> corresponding to this device -- which this patch is doing...
>>>
>>> Note that I had to modify the  AT803x PHY driver as it would stop
>>> working
>>> otherwise as it made use of the reset GPIO for its own purposes...
>>>
>>> Signed-off-by: Sergei Shtylyov 
>>
>> This looks good to me:
>>
>> Acked-by: Florian Fainelli 
> 
>Thank you! I'll send v3 without [RFT] then.
> 
>> Can you follow up with changes in phy_{suspend,resume}
> 
>I'm not sure what changes you mean -- powering down the PHYs?

Yes, powering down, conversely up the PHY. The whole point of putting
this in PHYLIB is to be able to perform things like that. We do not need
this right now, but it would be nice if we saw that materialize at some
point.
-- 
Florian


Re: [PATCH RFT 1/2] phylib: add device reset GPIO support

2016-05-10 Thread Sergei Shtylyov

Hello.

On 05/10/2016 09:32 PM, Florian Fainelli wrote:


The PHY devices sometimes do have their reset signal (maybe even power
supply?) tied to some GPIO and sometimes it also does happen that a boot
loader does not leave it deasserted. So far this issue has been attacked
from (as I believe) a wrong angle: by teaching the MAC driver to manipulate
the GPIO in question; that solution, when applied to the device trees, led
to adding the PHY reset GPIO properties to the MAC device node, with one
exception: Cadence MACB driver which could handle the "reset-gpios" prop
in a PHY device subnode. I believe that the correct approach is to teach
the 'phylib' to get the MDIO device reset GPIO from the device tree node
corresponding to this device -- which this patch is doing...

Note that I had to modify the  AT803x PHY driver as it would stop working
otherwise as it made use of the reset GPIO for its own purposes...

Signed-off-by: Sergei Shtylyov 


This looks good to me:

Acked-by: Florian Fainelli 


   Thank you! I'll send v3 without [RFT] then.


Can you follow up with changes in phy_{suspend,resume}


   I'm not sure what changes you mean -- powering down the PHYs?


if that is also
an use case that you have?


   No, I'm not into power management.

MBR, Sergei



Re: [PATCH net-next 01/14] qed: Add CONFIG_QED_SRIOV

2016-05-10 Thread David Miller
From: Yuval Mintz 
Date: Tue, 10 May 2016 18:15:08 +

>> > I'm not entirely convinced this is true; If we'll not enforce the
>> > alignment of this 64-bit field, it's possible there will be
>> > differences between 32-bit and 64-bit machines versions of this struct.
>> > You have to recall that this is going to be copied via DMA between PF
>> > and VF, so they must have the exact same representation of the structure.
>> 
>> Then use properly sized types to fill in all the space in the structure, 
>> that's how
>> you guarantee layout, not aligned_u64.  Also, do not use the packed 
>> attribute.
>> 
>> struct foo {
>>  u32 x;
>>  u32 y;
>>  u64 z;
>> };
>> 
>> 'z' will always be 64-bit aligned.
> 
> Perhaps my bit-numeric is a bit weak - why is it so?

foo is 64-bit aligned, therefore a properly aligned struct member will
be so as well.

We absolutely depend upon this for several data structures in the kernel.


Re: [Drbd-dev] [PATCH net-next v3] block/drbd: align properly u64 in nl messages

2016-05-10 Thread Lars Ellenberg
On Tue, May 10, 2016 at 11:39:49AM -0400, David Miller wrote:
> From: Lars Ellenberg 
> Date: Tue, 10 May 2016 11:40:23 +0200

excuse me for reordering the original:

> Anyways, back to the topic, can you please just relent and come to
> some kind of agreement about the fix for this alignment bug?

I thought we did?  I'm fine with the "v3",
it even carries my signed-of-by.

Whether or not Nicholas wants to prefix those headers with drbd_,
I don't really care.

> This is taking a very long time and patches are just rotting in
> patchwork with no resolution.  Why would 

Nicholas asked how to go about DRBD,
I suggested to use 0 as a padding attribute,
and after taking a detour, he did. All good.


Rest of original:

> > If we introduce a new config option,
> > we have to add it to the config scanner (one line),
> > define min, max, default and scale (four short defines),
> > and add it to the netlink definition here (one line).
> > Done, rest of the code is generated,
> > both on the kernel side,
> > and on the drbd-utils side used to talk to the kernel.
> > We found that to be very convenient.
> 
> But it entirely misses the core design point of netlink.
> 
> Sender and receive _DO NOT_ need to coordinate at all.  That's the
> whole point.  So tightly coupling such coordination is going to run
> you into all kinds of problems.
> 
> When implemented properly, the sender can emit whatever attributes it
> knows about and can generate, and the receive scans the attributes one
> by one and picks out the ones it understands and processes them.
> 
> If you go against this model
> then you have no clean way to

We don't.
We extend (not violate) that model, so the sender *may* indicate
to the recipient that for some particular attribute, the sender would
rather have an "I don't understand this" return than a silent ignore.
And that we can indicate in the definition of the attributes which ones
are required to make a message meaningful.

> extend things whilst allowing existing software to continue working.

*that* is exactly why we use netlink,
and why we do things with it the way we do.
Actually I think what we are doing there is, comparatively, "elegant".
You obviously don't have to agree.

I could discuss this in more detail,
but I assume you are not really interested,
at least not here and now.

Thanks,

Lars



  1   2   >