Re: [PATCH net-next v2 7/9] net: use core MTU range checking in misc drivers

2016-10-20 Thread Rémi Denis-Courmont
Acked-by: Rémi Denis-Courmont 

-- 
Rémi Denis-Courmont
http://www.remlab.net/CV.pdf



[PATCH net 1/1] bnx2x: Use the correct divisor value for PHC clock readings.

2016-10-20 Thread Sony.Chacko
From: Sudarsana Reddy Kalluru 

Time Sync (PTP) implementation uses the divisor/shift value for converting
the clock ticks to nanoseconds. Driver currently defines shift value as 1,
this results in the nanoseconds value to be calculated as half the actual
value. Hence the user application fails to synchronize the device clock
value with the PTP master device clock. Need to use the 'shift' value of 0.

Signed-off-by: Sony.Chacko 
Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Yuval Mintz 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 20fe6a8..0cee4c0 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -15241,7 +15241,7 @@ static void bnx2x_init_cyclecounter(struct bnx2x *bp)
memset(&bp->cyclecounter, 0, sizeof(bp->cyclecounter));
bp->cyclecounter.read = bnx2x_cyclecounter_read;
bp->cyclecounter.mask = CYCLECOUNTER_MASK(64);
-   bp->cyclecounter.shift = 1;
+   bp->cyclecounter.shift = 0;
bp->cyclecounter.mult = 1;
 }
 
-- 
1.8.3.1



Re: [PATCH] net: cpsw: fix obtaining mac address for am3517

2016-10-20 Thread Tony Lindgren
* Jeroen Hofstee  [161020 12:57]:
> Commit b6745f6e4e63 ("drivers: net: cpsw: davinci_emac: move reading mac
> id to common file") did not only move the code for an am3517, it also
> added the slave parameter, resulting in a invalid (all zero) mac address
> being returned. So change it back to always read from slave zero, so it
> works again.

Hmm doesn't this now break it for cpsw with two instances? We may need
am3517 specific quirk flag instead?

Regards,

Tony


network packet kernel rate limit

2016-10-20 Thread YUAN Linyu
User space application require linux kernel to do rate limit to some type of 
Ethernet packet(ARP, ICMP, TCP, for example 1000 packet/s) when system run 
under Ethernet packet flooding test.
Do you know if kernel already exist a feature to implement this requirement?

Thanks


[PATCH net-next 1/1] driver: tun: Forbid to set IFF_TUN and IFF_TAP at the same time

2016-10-20 Thread fgao
From: Gao Feng 

Current tun driver permits the ifr_flags is set with IFF_TUN and
IFF_TAP at the same time. But actually there is only IFF_TUN flag
works. And it does not make sense these two flags are set, so add
this check.

Signed-off-by: Gao Feng 
---
 drivers/net/tun.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8093e39..c1f89c1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1752,6 +1752,10 @@ static int tun_set_iff(struct net *net, struct file 
*file, struct ifreq *ifr)
if (err < 0)
return err;
 
+   if ((ifr->ifr_flags & (IFF_TUN | IFF_TAP)) == (IFF_TUN | 
IFF_TAP)) {
+   return -EINVAL;
+   }
+
/* Set dev type */
if (ifr->ifr_flags & IFF_TUN) {
/* TUN device */
-- 
1.9.1




Re: [PATCH v6 0/6] Add eBPF hooks for cgroups

2016-10-20 Thread David Ahern
On 9/19/16 10:43 AM, Daniel Mack wrote:
> This is v6 of the patch set to allow eBPF programs for network
> filtering and accounting to be attached to cgroups, so that they apply
> to all sockets of all tasks placed in that cgroup. The logic also
> allows to be extendeded for other cgroup based eBPF logic.
> 
> 
> Changes from v5:
> 
> * The eBPF programs now operate on L3 rather than on L2 of the packets,
>   and the egress hooks were moved from __dev_queue_xmit() to
>   ip*_output().
> 
> * For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb
>   through BPF_LD_[ABS|IND] instructions, but hook up the
>   bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann
>   for the help.

It's been a month since the last response or update to this series. Any 
progress in resolving the resistance to hook locations?

As I mentioned in Tokyo I need a solution for VRF that allows running processes 
in a VRF context -- meaning a process inherits a default sk_bound_dev_if for 
any AF_INET{6} sockets opened. Right now we (Cumulus) are using an l3mdev 
cgroup, something that Tejun pushed back on earlier this year. I strongly 
believe that cgroups provide the right infrastructure for this feature and 
looking at options. I'm sure a few people will chuckle about this, but I do 
have another solution that leverages this patchset -- a bpf program on a cgroup 
that sets sk_bound_dev_if. So, what's the likelihood of this patchset making 
4.10 (or any other release)?

Thanks,
David


Re: [PATCH] kexec: Export kexec_in_progress to modules

2016-10-20 Thread Eric W. Biederman
David Miller  writes:

> From: Florian Fainelli 
> Date: Thu, 20 Oct 2016 18:15:16 -0700
>
>> The bcm_sf2 driver uses kexec_in_progress to know whether it can power
>> down an integrated PHY during shutdown, and can be built as a module.
>> Other modules may be using this in the future, so export it.
>> 
>> Fixes: 2399d6143f85 ("net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd 
>> kernels")
>> Signed-off-by: Florian Fainelli 
>> ---
>> Eric, David, Stephen,
>> 
>> The offending commit is in David's net.git tree, so it would probably make
>> sense to route the fix through the same tree.
>
> Ok, I'll apply this, thanks Florian.

Florian

I am completely confused why any driver would want to do this.

A reboot is semantically identical to a kexec restart.  Always has been.
That is pwoering down your hardware during reboot is not safe.

The only thing that might save you is the hardware reset line being
toggled at which point your hardware is powered up again anyway.

So as far as I can tell you are advocating for a change to support a
driver doing something that is completely pointless.  So no let's not
export this symbol.  Please fix the driver to do something less
pointless instead.

Eric




Re: [PATCH] net: skip genenerating uevents for network namespaces that are exiting

2016-10-20 Thread Andrey Vagin
On Thu, Oct 20, 2016 at 8:10 PM, Cong Wang  wrote:
> On Thu, Oct 20, 2016 at 7:46 PM, Andrei Vagin  wrote:
>> No one can see these events, because a network namespace can not be
>> destroyed, if it has sockets.
>>
>
> Are you sure? kobject_uevent_env() seems sending uevents to all
> network namespaces.

kobj_bcast_filter() checks that a kobject namespace is equal to a
socket namespace.


> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers


Re: [Patch net] net: saving irq context for peernet2id()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 4:35 PM, Cong Wang  wrote:
> Since you want to test SELinux anyway, please test the attached one.
>

Finally my kernel config is friendly to SELinux, and now there are several
tests fails:


Test Summary Report
---
sysctl/test   (Wstat: 0 Tests: 4 Failed: 2)
  Failed tests:  1-2
mmap/test (Wstat: 0 Tests: 44 Failed: 2)
  Failed tests:  20, 26
overlay/test  (Wstat: 65280 Tests: 0 Failed: 0)
  Non-zero exit status: 255
  Parse errors: Bad plan.  You planned 121 tests but ran 0.
Files=44, Tests=306, 47 wallclock secs ( 0.23 usr  0.30 sys +  1.00
cusr  7.35 csys =  8.88 CPU)
Result: FAIL

Seems not related to my patch?

And, there is a kernel warning which is totally unrelated to my code:


[  136.788836] [ cut here ]
[  136.793046] WARNING: CPU: 1 PID: 1345 at fs/locks.c:615
locks_unlink_lock_ctx+0xc7/0xcc
[  136.799829] CPU: 1 PID: 1345 Comm: test_lease Not tainted 4.8.0+ #271
[  136.803814] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  136.803814]  c900036bfd38 8152c08e 

[  136.803814]  c900036bfd78 810841c0 0267036bfc88
8800db2371e8
[  136.803814]  8800db26f4c0 c900036bfe00 c900036bfe00
8800db239240
[  136.803814] Call Trace:
[  136.803814]  [] dump_stack+0x67/0x90
[  136.803814]  [] __warn+0xbd/0xd8
[  136.803814]  [] warn_slowpath_null+0x1d/0x1f
[  136.803814]  [] locks_unlink_lock_ctx+0xc7/0xcc
[  136.803814]  [] locks_delete_lock_ctx+0x17/0x3a
[  136.803814]  [] lease_modify+0xb0/0xb9
[  136.803814]  [] locks_remove_file+0x93/0xbd
[  136.803814]  [] __fput+0xd0/0x19b
[  136.803814]  [] fput+0xe/0x10
[  136.803814]  [] task_work_run+0x6c/0x97
[  136.803814]  [] prepare_exit_to_usermode+0xb0/0xcc
[  136.803814]  [] syscall_return_slowpath+0x15e/0x1c2
[  136.803814]  [] do_syscall_64+0x6f/0x76
[  136.803814]  [] entry_SYSCALL64_slow_path+0x25/0x25
[  138.538370] ---[ end trace 90914a16bd3272b8 ]---


Re: [PATCH net-next v12 8/9] openvswitch: allow L3 netdev ports

2016-10-20 Thread Pravin Shelar
On Mon, Oct 17, 2016 at 6:02 AM, Jiri Benc  wrote:
> Allow ARPHRD_NONE interfaces to be added to ovs bridge.
>
> Based on previous versions by Lorand Jakab and Simon Horman.
>
> Signed-off-by: Lorand Jakab 
> Signed-off-by: Simon Horman 
> Signed-off-by: Jiri Benc 

Acked-by: Pravin B Shelar 


Re: [PATCH net-next v12 7/9] openvswitch: add Ethernet push and pop actions

2016-10-20 Thread Pravin Shelar
On Mon, Oct 17, 2016 at 6:02 AM, Jiri Benc  wrote:
> It's not allowed to push Ethernet header in front of another Ethernet
> header.
>
> It's not allowed to pop Ethernet header if there's a vlan tag. This
> preserves the invariant that L3 packet never has a vlan tag.
>
> Based on previous versions by Lorand Jakab and Simon Horman.
>
> Signed-off-by: Lorand Jakab 
> Signed-off-by: Simon Horman 
> Signed-off-by: Jiri Benc 
> ---
>  include/uapi/linux/openvswitch.h | 15 
>  net/openvswitch/actions.c| 49 
> 
>  net/openvswitch/flow_netlink.c   | 18 +++
>  3 files changed, 82 insertions(+)
>
...
...
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 064cbcb7b0c5..a63572fb878e 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -317,6 +317,47 @@ static int set_eth_addr(struct sk_buff *skb, struct 
> sw_flow_key *flow_key,
> return 0;
>  }
>
> +/* pop_eth does not support VLAN packets as this action is never called
> + * for them.
> + */
> +static int pop_eth(struct sk_buff *skb, struct sw_flow_key *key)
> +{
> +   skb_pull_rcsum(skb, ETH_HLEN);
> +   skb_reset_mac_header(skb);
> +   skb->mac_len -= ETH_HLEN;
> +
> +   /* safe right before invalidate_flow_key */
> +   key->mac_proto = MAC_PROTO_NONE;
> +   invalidate_flow_key(key);
> +   return 0;
> +}
> +
> +static int push_eth(struct sk_buff *skb, struct sw_flow_key *key,
> +   const struct ovs_action_push_eth *ethh)
> +{
> +   struct ethhdr *hdr;
> +
> +   /* Add the new Ethernet header */
> +   if (skb_cow_head(skb, ETH_HLEN) < 0)
> +   return -ENOMEM;
> +
> +   skb_push(skb, ETH_HLEN);
> +   skb_reset_mac_header(skb);
> +   skb->mac_len = ETH_HLEN;
> +

The eth pop substracts ETH_HLEN but here the length is set. I think it
should be consistent with respect to eth-pop.


Re: [PATCH net-next v12 6/9] openvswitch: netlink: support L3 packets

2016-10-20 Thread Pravin Shelar
On Mon, Oct 17, 2016 at 6:02 AM, Jiri Benc  wrote:
> Extend the ovs flow netlink protocol to support L3 packets. Packets without
> OVS_KEY_ATTR_ETHERNET attribute specify L3 packets; for those, the
> OVS_KEY_ATTR_ETHERTYPE attribute is mandatory.
>
> Push/pop vlan actions are only supported for Ethernet packets.
>
> Based on previous versions by Lorand Jakab and Simon Horman.
>
> Signed-off-by: Lorand Jakab 
> Signed-off-by: Simon Horman 
> Signed-off-by: Jiri Benc 

Acked-by: Pravin B Shelar 


Re: [PATCH net-next v12 5/9] openvswitch: add processing of L3 packets

2016-10-20 Thread Pravin Shelar
On Mon, Oct 17, 2016 at 6:02 AM, Jiri Benc  wrote:
> Support receiving, extracting flow key and sending of L3 packets (packets
> without an Ethernet header).
>
> Note that even after this patch, non-Ethernet interfaces are still not
> allowed to be added to bridges. Similarly, netlink interface for sending and
> receiving L3 packets to/from user space is not in place yet.
>
> Based on previous versions by Lorand Jakab and Simon Horman.
>
> Signed-off-by: Lorand Jakab 
> Signed-off-by: Simon Horman 
> Signed-off-by: Jiri Benc 
> ---
>  net/openvswitch/datapath.c |  17 ++--
>  net/openvswitch/flow.c | 101 
> ++---
>  net/openvswitch/vport.c|  16 +++
>  3 files changed, 96 insertions(+), 38 deletions(-)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 4d67ea856067..c5b719fca8d4 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
...
...
> @@ -609,8 +597,10 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
> struct genl_info *info)
>
> err = ovs_flow_key_extract_userspace(net, a[OVS_PACKET_ATTR_KEY],
>  packet, &flow->key, log);
> -   if (err)
> +   if (err) {
> +   packet = NULL;
> goto err_flow_free;
> +   }
>
Why packet is set to NULL? This would leak skb memory in case of error.

> err = ovs_nla_copy_actions(net, a[OVS_PACKET_ATTR_ACTIONS],
>&flow->key, &acts, log);
...
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index 96c8c4716603..7aee6e273765 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
...

> @@ -721,6 +734,20 @@ int ovs_flow_key_update(struct sk_buff *skb, struct 
> sw_flow_key *key)
> return key_extract(skb, key);
>  }
>
> +static u8 key_extract_mac_proto(struct sk_buff *skb)
> +{
> +   switch (skb->dev->type) {
> +   case ARPHRD_ETHER:
> +   return MAC_PROTO_ETHERNET;
> +   case ARPHRD_NONE:
> +   if (skb->protocol == htons(ETH_P_TEB))
> +   return MAC_PROTO_ETHERNET;
> +   return MAC_PROTO_NONE;
> +   }
> +   WARN_ON_ONCE(1);
> +   return MAC_PROTO_ETHERNET;
This packet should be dropped.

> +}
> +
>  int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>  struct sk_buff *skb, struct sw_flow_key *key)
>  {


> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> index dc0e2212edfc..8361b62a47c2 100644
> --- a/net/openvswitch/vport.c
> +++ b/net/openvswitch/vport.c
> @@ -502,6 +502,22 @@ void ovs_vport_send(struct vport *vport, struct sk_buff 
> *skb, u8 mac_proto)
>  {
> int mtu = vport->dev->mtu;
>
> +   switch (vport->dev->type) {
> +   case ARPHRD_NONE:
> +   if (mac_proto != MAC_PROTO_NONE) {
> +   WARN_ON_ONCE(mac_proto != MAC_PROTO_ETHERNET);
> +
It would be easy to read if you check mac_proto for MAC_PROTO_ETHERNET
and then warn otherwise.
another issue is the packet is not dropped if mac_proto is not
MAC_PROTO_ETHERNET

> +   skb_reset_network_header(skb);
> +   skb_reset_mac_len(skb);
> +   skb->protocol = htons(ETH_P_TEB);
> +   }
> +   break;
> +   case ARPHRD_ETHER:
> +   if (mac_proto != MAC_PROTO_ETHERNET)
> +   goto drop;
> +   break;
> +   }
> +
> if (unlikely(packet_length(skb, vport->dev) > mtu &&
>  !skb_is_gso(skb))) {
> net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
> --
> 1.8.3.1
>


Re: [PATCH net] net: remove MTU limits on a few ether_setup callers

2016-10-20 Thread Florian Fainelli
Le 20/10/2016 à 20:25, Jarod Wilson a écrit :
> These few drivers call ether_setup(), but have no ndo_change_mtu, and thus
> were overlooked for changes to MTU range checking behavior. They
> previously had no range checks, so for feature-parity, set their min_mtu
> to 0 and max_mtu to ETH_MAX_MTU (65535), instead of the 68 and 1500
> inherited from the ether_setup() changes. Fine-tuning can come after we get
> back to full feature-parity here.
> 
> CC: netdev@vger.kernel.org
> Reported-by: Asbjoern Sloth Toennesen 
> CC: Asbjoern Sloth Toennesen 
> CC: R Parameswaran 
> Signed-off-by: Jarod Wilson 
> ---

> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 68714a5..d0c7bce 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -1247,6 +1247,8 @@ int dsa_slave_create(struct dsa_switch *ds, struct 
> device *parent,
>   slave_dev->priv_flags |= IFF_NO_QUEUE;
>   slave_dev->netdev_ops = &dsa_slave_netdev_ops;
>   slave_dev->switchdev_ops = &dsa_slave_switchdev_ops;
> + slave_dev->min_mtu = 0;
> + slave_dev->max_mtu = ETH_MAX_MTU;
>   SET_NETDEV_DEVTYPE(slave_dev, &dsa_type);

Actually for DSA, a reasonable minimum is probably 68 for the following
reasons: ETH_ZLEN of 60 bytes + FCS (4 bytes) + 2/4/8 bytes of Ethernet
switch tag (which is dependent on the tagging protocol used). Such an
Ethernet interface would submit packets through the conduit interface
which is connected to an Ethernet switches, and those typically discard
packets smaller than 64 bytes +/- their custom tag length.

One driver for sure pads the packets to the appropriate length:
bcmsysport, but I don't know about e1000e or e.g: mv643xx_eth which are
also commonly used for the same purposes.
-- 
Florian


Re: [PATCH net-next v2 6/9] net: use core MTU range checking in virt drivers

2016-10-20 Thread Michael S. Tsirkin
On Thu, Oct 20, 2016 at 10:37:20PM -0400, Jarod Wilson wrote:
> On Thu, Oct 20, 2016 at 11:23:54PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 20, 2016 at 01:55:21PM -0400, Jarod Wilson wrote:
> ...
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index fad84f3..720809f 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -1419,17 +1419,6 @@ static const struct ethtool_ops 
> > > virtnet_ethtool_ops = {
> > >   .set_settings = virtnet_set_settings,
> > >  };
> > >  
> > > -#define MIN_MTU 68
> > > -#define MAX_MTU 65535
> > > -
> > > -static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
> > > -{
> > > - if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
> > > - return -EINVAL;
> > > - dev->mtu = new_mtu;
> > > - return 0;
> > > -}
> > > -
> > >  static const struct net_device_ops virtnet_netdev = {
> > >   .ndo_open= virtnet_open,
> > >   .ndo_stop= virtnet_close,
> > > @@ -1437,7 +1426,6 @@ static const struct net_device_ops virtnet_netdev = 
> > > {
> > >   .ndo_validate_addr   = eth_validate_addr,
> > >   .ndo_set_mac_address = virtnet_set_mac_address,
> > >   .ndo_set_rx_mode = virtnet_set_rx_mode,
> > > - .ndo_change_mtu  = virtnet_change_mtu,
> > >   .ndo_get_stats64 = virtnet_stats,
> > >   .ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
> > >   .ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
> > > @@ -1748,6 +1736,9 @@ static bool virtnet_validate_features(struct 
> > > virtio_device *vdev)
> > >   return true;
> > >  }
> > >  
> > > +#define MIN_MTU ETH_MIN_MTU
> > > +#define MAX_MTU ETH_MAX_MTU
> > > +
> > 
> > Can we drop these btw?
> 
> Bah. Yeah. Should have just used them directly. I didn't add ETH_MAX_MTU
> until after doing the virtio_net changes, so I missed that.
> 
> > >  static int virtnet_probe(struct virtio_device *vdev)
> > >  {
> > >   int i, err;
> > > @@ -1821,6 +1812,10 @@ static int virtnet_probe(struct virtio_device 
> > > *vdev)
> > >  
> > >   dev->vlan_features = dev->features;
> > >  
> > > + /* MTU range: 68 - 65535 */
> > > + dev->min_mtu = MIN_MTU;
> > > + dev->max_mtu = MAX_MTU;
> > > +
> > >   /* Configuration may specify what MAC to use.  Otherwise random. */
> > >   if (virtio_has_feature(vdev, VIRTIO_NET_F_MAC))
> > >   virtio_cread_bytes(vdev,
> > > @@ -1875,8 +1870,10 @@ static int virtnet_probe(struct virtio_device 
> > > *vdev)
> > >   mtu = virtio_cread16(vdev,
> > >offsetof(struct virtio_net_config,
> > > mtu));
> > > - if (virtnet_change_mtu(dev, mtu))
> > > + if (mtu < dev->min_mtu || mtu > dev->max_mtu)
> > 
> > In fact the > max_mtu branch does not make sense since a 16 bit
> > value can't exceed MAX_MTU.
> 
> Hm. mtu is declared as an int, not sure if there's any sort of type
> promotion to be worried about (not an area I know much/anything about).

Not by design, that's for sure.

> Certainly something that could be looked into as a minor optimization,
> though it's only in a probe path and shouldn't hurt anything, so ... meh?

Right. Aaron said he's working on a patch that essentially does
dev->max_mtu = mtu after validation, so this part will look
a bit silly there.

> -- 
> Jarod Wilson
> ja...@redhat.com


[PATCH net] net: remove MTU limits on a few ether_setup callers

2016-10-20 Thread Jarod Wilson
These few drivers call ether_setup(), but have no ndo_change_mtu, and thus
were overlooked for changes to MTU range checking behavior. They
previously had no range checks, so for feature-parity, set their min_mtu
to 0 and max_mtu to ETH_MAX_MTU (65535), instead of the 68 and 1500
inherited from the ether_setup() changes. Fine-tuning can come after we get
back to full feature-parity here.

CC: netdev@vger.kernel.org
Reported-by: Asbjoern Sloth Toennesen 
CC: Asbjoern Sloth Toennesen 
CC: R Parameswaran 
Signed-off-by: Jarod Wilson 
---
 net/atm/br2684.c| 4 +++-
 net/bluetooth/bnep/netdev.c | 2 ++
 net/dsa/slave.c | 2 ++
 net/irda/irlan/irlan_eth.c  | 3 ++-
 net/l2tp/l2tp_eth.c | 2 ++
 5 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/atm/br2684.c b/net/atm/br2684.c
index c7d82f4..fca84e1 100644
--- a/net/atm/br2684.c
+++ b/net/atm/br2684.c
@@ -649,7 +649,9 @@ static void br2684_setup_routed(struct net_device *netdev)
netdev->hard_header_len = sizeof(llc_oui_ipv4); /* worst case */
netdev->netdev_ops = &br2684_netdev_ops_routed;
netdev->addr_len = 0;
-   netdev->mtu = 1500;
+   netdev->mtu = ETH_DATA_LEN;
+   netdev->min_mtu = 0;
+   netdev->max_mtu = ETH_MAX_MTU;
netdev->type = ARPHRD_PPP;
netdev->flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST;
netdev->tx_queue_len = 100;
diff --git a/net/bluetooth/bnep/netdev.c b/net/bluetooth/bnep/netdev.c
index 0f25ddc..2b875ed 100644
--- a/net/bluetooth/bnep/netdev.c
+++ b/net/bluetooth/bnep/netdev.c
@@ -221,6 +221,8 @@ void bnep_net_setup(struct net_device *dev)
dev->addr_len = ETH_ALEN;
 
ether_setup(dev);
+   dev->min_mtu = 0;
+   dev->max_mtu = ETH_MAX_MTU;
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
dev->netdev_ops = &bnep_netdev_ops;
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 68714a5..d0c7bce 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1247,6 +1247,8 @@ int dsa_slave_create(struct dsa_switch *ds, struct device 
*parent,
slave_dev->priv_flags |= IFF_NO_QUEUE;
slave_dev->netdev_ops = &dsa_slave_netdev_ops;
slave_dev->switchdev_ops = &dsa_slave_switchdev_ops;
+   slave_dev->min_mtu = 0;
+   slave_dev->max_mtu = ETH_MAX_MTU;
SET_NETDEV_DEVTYPE(slave_dev, &dsa_type);
 
netdev_for_each_tx_queue(slave_dev, dsa_slave_set_lockdep_class_one,
diff --git a/net/irda/irlan/irlan_eth.c b/net/irda/irlan/irlan_eth.c
index 8192eae..74d09f9 100644
--- a/net/irda/irlan/irlan_eth.c
+++ b/net/irda/irlan/irlan_eth.c
@@ -66,7 +66,8 @@ static void irlan_eth_setup(struct net_device *dev)
 
dev->netdev_ops = &irlan_eth_netdev_ops;
dev->destructor = free_netdev;
-
+   dev->min_mtu= 0;
+   dev->max_mtu= ETH_MAX_MTU;
 
/*
 * Lets do all queueing in IrTTP instead of this device driver.
diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 965f7e3..e2c6ae0 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -259,6 +259,8 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, 
u32 session_id, u32 p
session->mtu = dev->mtu - session->hdr_len;
dev->mtu = session->mtu;
dev->needed_headroom += session->hdr_len;
+   dev->min_mtu = 0;
+   dev->max_mtu = ETH_MAX_MTU;
 
priv = netdev_priv(dev);
priv->dev = dev;
-- 
2.10.0



Re: [PATCH 1/4] kconfig: introduce the "imply" keyword

2016-10-20 Thread Nicolas Pitre
On Thu, 20 Oct 2016, Nicolas Pitre wrote:

> On Thu, 20 Oct 2016, Edward Cree wrote:
> 
> > I'm interpreting "imply" as being more a way of saying "if you want FOO you
> > probably want BAZ as well".  But maybe that should be yet another new
> > keyword if it's so different from what you want "imply" to be.  "suggests",
> > perhaps.
> 
> Indeed. That's exactly the keyword that came to my mind after I sent my 
> previous reply.

So what about the following on top of my previous series:

From: Nicolas Pitre 
Date: Thu, 20 Oct 2016 23:04:46 -0400
Subject: [PATCH] kconfig: introduce the "suggest" keyword

Similar to "imply" but with no added value restrictions on the target
symbol. Useful for providing a default value to another symbol.

Signed-off-by: Nicolas Pitre 

diff --git a/Documentation/kbuild/kconfig-language.txt 
b/Documentation/kbuild/kconfig-language.txt
index 5ee0dd3c85..b7f4f0ca1d 100644
--- a/Documentation/kbuild/kconfig-language.txt
+++ b/Documentation/kbuild/kconfig-language.txt
@@ -140,6 +140,12 @@ applicable everywhere (see syntax).
   ability to hook into a given subsystem while still being able to
   configure that subsystem out and keep those drivers selected.
 
+- even weaker reverse dependencies: "suggest"  ["if" ]
+  This is similar to "imply" except that this doesn't add any restrictions
+  on the value the suggested symbol may use. In other words this only
+  provides a default for the specified symbol based on the value for the
+  config entry where this is used.
+
 - limiting menu display: "visible if" 
   This attribute is only applicable to menu blocks, if the condition is
   false, the menu block is not displayed to the user (the symbols
diff --git a/scripts/kconfig/expr.h b/scripts/kconfig/expr.h
index a73f762c48..eea3aa3c7a 100644
--- a/scripts/kconfig/expr.h
+++ b/scripts/kconfig/expr.h
@@ -86,6 +86,7 @@ struct symbol {
struct expr_value dir_dep;
struct expr_value rev_dep;
struct expr_value implied;
+   struct expr_value suggested;
 };
 
 #define for_all_symbols(i, sym) for (i = 0; i < SYMBOL_HASHSIZE; i++) for (sym 
= symbol_hash[i]; sym; sym = sym->next) if (sym->type != S_OTHER)
@@ -138,6 +139,7 @@ enum prop_type {
P_CHOICE,   /* choice value */
P_SELECT,   /* select BAR */
P_IMPLY,/* imply BAR */
+   P_SUGGEST,  /* suggest BAR */
P_RANGE,/* range 7..100 (for a symbol) */
P_ENV,  /* value from environment variable */
P_SYMBOL,   /* where a symbol is defined */
diff --git a/scripts/kconfig/menu.c b/scripts/kconfig/menu.c
index e9357931b4..3abc5c85ac 100644
--- a/scripts/kconfig/menu.c
+++ b/scripts/kconfig/menu.c
@@ -255,7 +255,9 @@ static void sym_check_prop(struct symbol *sym)
break;
case P_SELECT:
case P_IMPLY:
-   use = prop->type == P_SELECT ? "select" : "imply";
+   case P_SUGGEST:
+   use = prop->type == P_SELECT ? "select" :
+ prop->type == P_IMPLY ? "imply" : "suggest";
sym2 = prop_get_symbol(prop);
if (sym->type != S_BOOLEAN && sym->type != S_TRISTATE)
prop_warn(prop,
@@ -341,6 +343,10 @@ void menu_finalize(struct menu *parent)
struct symbol *es = 
prop_get_symbol(prop);
es->implied.expr = 
expr_alloc_or(es->implied.expr,

expr_alloc_and(expr_alloc_symbol(menu->sym), expr_copy(dep)));
+   } else if (prop->type == P_SUGGEST) {
+   struct symbol *es = 
prop_get_symbol(prop);
+   es->suggested.expr = 
expr_alloc_or(es->suggested.expr,
+   
expr_alloc_and(expr_alloc_symbol(menu->sym), expr_copy(dep)));
}
}
}
@@ -687,6 +693,13 @@ static void get_symbol_str(struct gstr *r, struct symbol 
*sym,
str_append(r, "\n");
}
 
+   get_symbol_props_str(r, sym, P_SUGGEST, _("  Suggests: "));
+   if (sym->suggested.expr) {
+   str_append(r, _("  Suggested by: "));
+   expr_gstr_print(sym->suggested.expr, r);
+   str_append(r, "\n");
+   }
+
str_append(r, "\n\n");
 }
 
diff --git a/scripts/kconfig/symbol.c b/scripts/kconfig/symbol.c
index 074fb66d9a..235f11e3f9 100644
--- a/scripts/kconfig/symbol.c
+++ b/scripts/kconfig/symbol.c
@@ -267,6 +267,16 @@ static void sym_calc_visibility(struct symbol *sym)
sym->implied.tri = tri;
sym_set_changed(sym);
}
+   tri = no;
+   if (sym->suggested.expr)
+   tri = expr_calc_value(sym->suggested.expr);
+   tri = EXPR_AND(tri, sym->visible);
+   if 

Re: [PATCH] net: l2tp_eth: fix max_mtu

2016-10-20 Thread Jarod Wilson
On Thu, Oct 20, 2016 at 09:19:29PM +, Asbjoern Sloth Toennesen wrote:
> On Thu, 20 Oct 2016 20:08:26 + (UTC), a...@asbjorn.biz wrote:
> > diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
> > index 965f7e3..ba82dcc 100644
> > --- a/net/l2tp/l2tp_eth.c
> > +++ b/net/l2tp/l2tp_eth.c
> > @@ -259,6 +259,7 @@ static int l2tp_eth_create(struct net *net, u32 
> > tunnel_id, u32 session_id, u32 p
> > session->mtu = dev->mtu - session->hdr_len;
> > dev->mtu = session->mtu;
> > dev->needed_headroom += session->hdr_len;
> > +   dev->max_mtu = ETH_MAX_MTU - dev->needed_headroom;
> 
> I forgot to add that subtracting dev->needed_headroom doesn't
> give the exact maximum MTU, but one that is a bit too high.
> It could also just be set to ETH_MAX_MTU, to indicate that
> the encapsulation is not included.
> 
> Calculations like in R. Parameswaran's patch is needed for the
> exact one.
> https://lkml.org/lkml/2016/10/17/3

Argh. So there are a tiny handful of drivers that call ether_setup, but
have no ndo_change_mtu at all, which thus previously, had no mtu range
checks other than mtu < 0, and now have a range check of 68 to 1500...

Since there was no range at all before, feature-parity here should be
min_mtu = 0, max_mtu = ETH_MAX_MTU. I think I see five other cases that
are in a similar situation:

net/atm/br2684.c
net/bluetooth/bnep/netdev.c
net/dsa/slave.c
net/irda/irda_device.c
net/irda/irlan/irlan_eth.c

Honestly, I think dsa is the only one where it likely actually matters,
but I'll send a patch addressing all of those plus l2tp_eth in half a sec.

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH] net: skip genenerating uevents for network namespaces that are exiting

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 7:46 PM, Andrei Vagin  wrote:
> No one can see these events, because a network namespace can not be
> destroyed, if it has sockets.
>

Are you sure? kobject_uevent_env() seems sending uevents to all
network namespaces.


[PATCH net-next] net: allow to kill a task which waits net_mutex in copy_new_ns

2016-10-20 Thread Andrei Vagin
From: Andrey Vagin 

net_mutex can be locked for a long time. It may be because many
namespaces are being destroyed or many processes decide to create
a network namespace.

Both these operations are heavy, so it is better to have an ability to
kill a process which is waiting net_mutex.

Cc: "David S. Miller" 
Cc: Eric W. Biederman 
Signed-off-by: Andrei Vagin 
---
 net/core/net_namespace.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 989434f..b9243b1 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -379,7 +379,14 @@ struct net *copy_net_ns(unsigned long flags,
 
get_user_ns(user_ns);
 
-   mutex_lock(&net_mutex);
+   rv = mutex_lock_killable(&net_mutex);
+   if (rv < 0) {
+   net_free(net);
+   dec_net_namespaces(ucounts);
+   put_user_ns(user_ns);
+   return ERR_PTR(rv);
+   }
+
net->ucounts = ucounts;
rv = setup_net(net, user_ns);
if (rv == 0) {
-- 
2.7.4



[PATCH] net: skip genenerating uevents for network namespaces that are exiting

2016-10-20 Thread Andrei Vagin
No one can see these events, because a network namespace can not be
destroyed, if it has sockets.

My experiments shows that net namespaces are destroyed more 30% faster
with this optimization.

Here is a perf output for destroying network namespaces without this
patch.

-   94.76% 0.02%  kworker/u48:1  [kernel.kallsyms] [k] cleanup_net
   - 94.74% cleanup_net
  - 94.64% ops_exit_list.isra.4
 - 41.61% default_device_exit_batch
- 41.47% unregister_netdevice_many
   - rollback_registered_many
  - 40.36% netdev_unregister_kobject
 - 14.55% device_del
+ 13.71% kobject_uevent
 - 13.04% netdev_queue_update_kobjects
+ 12.96% kobject_put
 - 12.72% net_rx_queue_update_kobjects
  kobject_put
- kobject_release
   + 12.69% kobject_uevent
  + 0.80% call_netdevice_notifiers_info
 + 19.57% nfsd_exit_net
 + 11.15% tcp_net_metrics_exit
 + 8.25% rpcsec_gss_exit_net

It's very critical to optimize the exit path for network namespaces,
because they are destroyed under net_mutex and many namespaces can be
destroyed for one iteration.

Cc: "David S. Miller" 
Cc: Eric W. Biederman 
Signed-off-by: Andrei Vagin 
---
 net/core/net-sysfs.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 6e4f347..c02515e 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -950,10 +950,13 @@ net_rx_queue_update_kobjects(struct net_device *dev, int 
old_num, int new_num)
}
 
while (--i >= new_num) {
+   struct kobject *kobj = &dev->_rx[i].kobj;
+
+   if (!list_empty(&dev_net(dev)->exit_list))
+   kobj->uevent_suppress = 1;
if (dev->sysfs_rx_queue_group)
-   sysfs_remove_group(&dev->_rx[i].kobj,
-  dev->sysfs_rx_queue_group);
-   kobject_put(&dev->_rx[i].kobj);
+   sysfs_remove_group(kobj, dev->sysfs_rx_queue_group);
+   kobject_put(kobj);
}
 
return error;
@@ -1340,6 +1343,8 @@ netdev_queue_update_kobjects(struct net_device *dev, int 
old_num, int new_num)
while (--i >= new_num) {
struct netdev_queue *queue = dev->_tx + i;
 
+   if (!list_empty(&dev_net(dev)->exit_list))
+   queue->kobj.uevent_suppress = 1;
 #ifdef CONFIG_BQL
sysfs_remove_group(&queue->kobj, &dql_group);
 #endif
@@ -1525,6 +1530,9 @@ void netdev_unregister_kobject(struct net_device *ndev)
 {
struct device *dev = &(ndev->dev);
 
+   if (!list_empty(&dev_net(ndev)->exit_list))
+   dev->kobj.uevent_suppress = 1;
+
kobject_get(&dev->kobj);
 
remove_queue_kobjects(ndev);
-- 
2.7.4



Re: [PATCH net-next v2 6/9] net: use core MTU range checking in virt drivers

2016-10-20 Thread Jarod Wilson
On Thu, Oct 20, 2016 at 11:23:54PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 20, 2016 at 01:55:21PM -0400, Jarod Wilson wrote:
...
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index fad84f3..720809f 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1419,17 +1419,6 @@ static const struct ethtool_ops virtnet_ethtool_ops 
> > = {
> > .set_settings = virtnet_set_settings,
> >  };
> >  
> > -#define MIN_MTU 68
> > -#define MAX_MTU 65535
> > -
> > -static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
> > -{
> > -   if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
> > -   return -EINVAL;
> > -   dev->mtu = new_mtu;
> > -   return 0;
> > -}
> > -
> >  static const struct net_device_ops virtnet_netdev = {
> > .ndo_open= virtnet_open,
> > .ndo_stop= virtnet_close,
> > @@ -1437,7 +1426,6 @@ static const struct net_device_ops virtnet_netdev = {
> > .ndo_validate_addr   = eth_validate_addr,
> > .ndo_set_mac_address = virtnet_set_mac_address,
> > .ndo_set_rx_mode = virtnet_set_rx_mode,
> > -   .ndo_change_mtu  = virtnet_change_mtu,
> > .ndo_get_stats64 = virtnet_stats,
> > .ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
> > .ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
> > @@ -1748,6 +1736,9 @@ static bool virtnet_validate_features(struct 
> > virtio_device *vdev)
> > return true;
> >  }
> >  
> > +#define MIN_MTU ETH_MIN_MTU
> > +#define MAX_MTU ETH_MAX_MTU
> > +
> 
> Can we drop these btw?

Bah. Yeah. Should have just used them directly. I didn't add ETH_MAX_MTU
until after doing the virtio_net changes, so I missed that.

> >  static int virtnet_probe(struct virtio_device *vdev)
> >  {
> > int i, err;
> > @@ -1821,6 +1812,10 @@ static int virtnet_probe(struct virtio_device *vdev)
> >  
> > dev->vlan_features = dev->features;
> >  
> > +   /* MTU range: 68 - 65535 */
> > +   dev->min_mtu = MIN_MTU;
> > +   dev->max_mtu = MAX_MTU;
> > +
> > /* Configuration may specify what MAC to use.  Otherwise random. */
> > if (virtio_has_feature(vdev, VIRTIO_NET_F_MAC))
> > virtio_cread_bytes(vdev,
> > @@ -1875,8 +1870,10 @@ static int virtnet_probe(struct virtio_device *vdev)
> > mtu = virtio_cread16(vdev,
> >  offsetof(struct virtio_net_config,
> >   mtu));
> > -   if (virtnet_change_mtu(dev, mtu))
> > +   if (mtu < dev->min_mtu || mtu > dev->max_mtu)
> 
> In fact the > max_mtu branch does not make sense since a 16 bit
> value can't exceed MAX_MTU.

Hm. mtu is declared as an int, not sure if there's any sort of type
promotion to be worried about (not an area I know much/anything about).
Certainly something that could be looked into as a minor optimization,
though it's only in a probe path and shouldn't hurt anything, so ... meh?

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net] udp: must lock the socket in udp_disconnect()

2016-10-20 Thread Eric Dumazet
On Thu, 2016-10-20 at 18:12 -0700, Cong Wang wrote:

> If this is the cause of the hashlist corruption (I am still unsure about 
> this),
> then why only UDP? Don't all of those using ip4_datagram_connect()
> as ->connect() and using udp_disconnect() as ->disconnect() need this fix?
> 
> For example, after your patch,
> 
> .connect =  ip4_datagram_connect,
> -   .disconnect =   udp_disconnect,
> +   .disconnect =   __udp_disconnect,
> 
> Ping socket still doesn't have sock lock for ->disconnect() but has it for
> ->connect()? I must miss something...

ping is less complex, it is protected by a single ping_table.lock

While UDP has to maintain two hash chains with may locks, and has to
handle rehash(), because autobind happens before full 4-tuple is setup.







Re: [PATCH] kexec: Export kexec_in_progress to modules

2016-10-20 Thread David Miller
From: Florian Fainelli 
Date: Thu, 20 Oct 2016 18:15:16 -0700

> The bcm_sf2 driver uses kexec_in_progress to know whether it can power
> down an integrated PHY during shutdown, and can be built as a module.
> Other modules may be using this in the future, so export it.
> 
> Fixes: 2399d6143f85 ("net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd 
> kernels")
> Signed-off-by: Florian Fainelli 
> ---
> Eric, David, Stephen,
> 
> The offending commit is in David's net.git tree, so it would probably make
> sense to route the fix through the same tree.

Ok, I'll apply this, thanks Florian.


linux-next: manual merge of the staging tree with the net-next tree

2016-10-20 Thread Stephen Rothwell
Hi Greg,

Today's linux-next merge of the staging tree got a conflict in:

  drivers/staging/wlan-ng/p80211netdev.c

between commit:

  9c22b4a34edd ("net: use core MTU range checking in wireless drivers")

from the net-next tree and commit:

  84ad1efa7d2b ("staging: wlan-ng: fix block comment warnings in 
p80211netdev.c")

from the staging tree.

I fixed it up (the former removed the code updates by the latter, so I
just did that) and can carry the fix as necessary. This is now fixed as
far as linux-next is concerned, but any non trivial conflicts should be
mentioned to your upstream maintainer when your tree is submitted for
merging.  You may also want to consider cooperating with the maintainer
of the conflicting tree to minimise any particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


[PATCH] kexec: Export kexec_in_progress to modules

2016-10-20 Thread Florian Fainelli
The bcm_sf2 driver uses kexec_in_progress to know whether it can power
down an integrated PHY during shutdown, and can be built as a module.
Other modules may be using this in the future, so export it.

Fixes: 2399d6143f85 ("net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd 
kernels")
Signed-off-by: Florian Fainelli 
---
Eric, David, Stephen,

The offending commit is in David's net.git tree, so it would probably make
sense to route the fix through the same tree.

Thanks!

 kernel/kexec_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 561675589511..786ab85a5452 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -59,6 +59,7 @@ size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
 
 /* Flag to indicate we are going to kexec a new kernel */
 bool kexec_in_progress = false;
+EXPORT_SYMBOL_GPL(kexec_in_progress);
 
 
 /* Location of the reserved area for the crash kernel */
-- 
2.7.4



Re: [PATCH net] udp: must lock the socket in udp_disconnect()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 9:39 AM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> Baozeng Ding reported KASAN traces showing uses after free in
> udp_lib_get_port() and other related UDP functions.
>
> A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.
>
> I could write a reproducer with two threads doing :
>
> static int sock_fd;
> static void *thr1(void *arg)
> {
> for (;;) {
> connect(sock_fd, (const struct sockaddr *)arg,
> sizeof(struct sockaddr_in));
> }
> }
>
> static void *thr2(void *arg)
> {
> struct sockaddr_in unspec;
>
> for (;;) {
> memset(&unspec, 0, sizeof(unspec));
> connect(sock_fd, (const struct sockaddr *)&unspec,
> sizeof(unspec));
> }
> }
>
> Problem is that udp_disconnect() could run without holding socket lock,
> and this was causing list corruptions.

If this is the cause of the hashlist corruption (I am still unsure about this),
then why only UDP? Don't all of those using ip4_datagram_connect()
as ->connect() and using udp_disconnect() as ->disconnect() need this fix?

For example, after your patch,

.connect =  ip4_datagram_connect,
-   .disconnect =   udp_disconnect,
+   .disconnect =   __udp_disconnect,

Ping socket still doesn't have sock lock for ->disconnect() but has it for
->connect()? I must miss something...


Re: linux-next: build failure after merge of the net tree

2016-10-20 Thread Florian Fainelli
On 10/20/2016 05:43 PM, Stephen Rothwell wrote:
> Hi Florian,
> 
> On Thu, 20 Oct 2016 17:30:33 -0700 Florian Fainelli  
> wrote:
>>
>> On 10/20/2016 03:42 PM, Florian Fainelli wrote:
>>> On 10/20/2016 03:27 PM, Stephen Rothwell wrote:  
>>>>
>>>> After merging the net tree, today's linux-next build (arm
>>>> multi_v7_defconfig) failed like this:
>>>>
>>>> ERROR: "kexec_in_progress" [drivers/net/dsa/bcm_sf2.ko] undefined!
>>>>
>>>> Caused by commit
>>>>
>>>>   2399d6143f85 ("net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd 
>>>> kernels")
>>>>
>>>> I used the version of the net tree from next-20161020 for today.  
>>>
>>> OK, seems like we need an ifdef CONFIG_KEXEC_CORE, let me fix that.
>>> Thanks Stephen!  
>>
>> Stephen, can you send me the .config file? I could not reproduce it with
>> next-20161020 and doing a make multi_v7_defconfig:
> 
> This only started with the net tree as of today (HEAD a681574c99be).
> The problem is simply that kexec_in_progress is not exported to modules.

Realized that right after sending you the email.

> 
> Sorry, getting the actual .config file is a pain at this point since my
> builds have moved on.
> 
> It probably had (the current .config has):
> 
> CONFIG_NET_DSA_BCM_SF2=m
> CONFIG_KEXEC_CORE=y
> 
> If CONFIG_KEXEC_CORE is not set, then kexec_in_progress is defined to false.

Yes indeed, I have a fix ready, thanks!
-- 
Florian


RE: [PATCH net-next v2 6/9] net: use core MTU range checking in virt drivers

2016-10-20 Thread Kershner, David A


> -Original Message-
> From: Haiyang Zhang [mailto:haiya...@microsoft.com]
> Sent: Thursday, October 20, 2016 2:05 PM
> To: Jarod Wilson ; linux-ker...@vger.kernel.org
> Cc: netdev@vger.kernel.org; virtualizat...@lists.linux-foundation.org; KY
> Srinivasan ; Michael S. Tsirkin ;
> Shrikrishna Khare ; VMware, Inc.  driv...@vmware.com>; Wei Liu ; Paul Durrant
> ; Kershner, David A
> 
> Subject: RE: [PATCH net-next v2 6/9] net: use core MTU range checking in virt
> drivers
> 
> 
> 
> > -Original Message-
> > From: Jarod Wilson [mailto:ja...@redhat.com]
> > Sent: Thursday, October 20, 2016 1:55 PM
> > To: linux-ker...@vger.kernel.org
> > Cc: Jarod Wilson ; netdev@vger.kernel.org;
> > virtualizat...@lists.linux-foundation.org; KY Srinivasan
> > ; Haiyang Zhang ;
> Michael S.
> > Tsirkin ; Shrikrishna Khare ;
> VMware,
> > Inc. ; Wei Liu ; Paul
> > Durrant ; David Kershner
> > 
> > Subject: [PATCH net-next v2 6/9] net: use core MTU range checking in
> > virt drivers
> >
> > hyperv_net:
> > - set min/max_mtu, per Haiyang, after rndis_filter_device_add
> >
> > virtio_net:
> > - set min/max_mtu
> > - remove virtnet_change_mtu
> >
> > vmxnet3:
> > - set min/max_mtu
> >
> > xen-netback:
> > - min_mtu = 0, max_mtu = 65517
> >
> > xen-netfront:
> > - min_mtu = 0, max_mtu = 65535
> >
> > unisys/visor:
> > - clean up defines a little to not clash with network core or add
> >   redundat definitions
> >
> > CC: netdev@vger.kernel.org
> > CC: virtualizat...@lists.linux-foundation.org
> > CC: "K. Y. Srinivasan" 
> > CC: Haiyang Zhang 
> > CC: "Michael S. Tsirkin" 
> > CC: Shrikrishna Khare 
> > CC: "VMware, Inc." 
> > CC: Wei Liu 
> > CC: Paul Durrant 
> > CC: David Kershner 
> > Signed-off-by: Jarod Wilson 
> > ---
> 
> The hv_netvsc changes look fine. Thanks.
> 
> Reviewed-by: Haiyang Zhang 
> 

The visornic changes look good.

Reviewed-by: David Kershner 


 


Re: linux-next: build failure after merge of the net tree

2016-10-20 Thread Stephen Rothwell
Hi Florian,

On Thu, 20 Oct 2016 17:30:33 -0700 Florian Fainelli  
wrote:
>
> On 10/20/2016 03:42 PM, Florian Fainelli wrote:
> > On 10/20/2016 03:27 PM, Stephen Rothwell wrote:  
> >>
> >> After merging the net tree, today's linux-next build (arm
> >> multi_v7_defconfig) failed like this:
> >>
> >> ERROR: "kexec_in_progress" [drivers/net/dsa/bcm_sf2.ko] undefined!
> >>
> >> Caused by commit
> >>
> >>   2399d6143f85 ("net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd 
> >> kernels")
> >>
> >> I used the version of the net tree from next-20161020 for today.  
> > 
> > OK, seems like we need an ifdef CONFIG_KEXEC_CORE, let me fix that.
> > Thanks Stephen!  
> 
> Stephen, can you send me the .config file? I could not reproduce it with
> next-20161020 and doing a make multi_v7_defconfig:

This only started with the net tree as of today (HEAD a681574c99be).
The problem is simply that kexec_in_progress is not exported to modules.

Sorry, getting the actual .config file is a pain at this point since my
builds have moved on.

It probably had (the current .config has):

CONFIG_NET_DSA_BCM_SF2=m
CONFIG_KEXEC_CORE=y

If CONFIG_KEXEC_CORE is not set, then kexec_in_progress is defined to false.

-- 
Cheers,
Stephen Rothwell


Re: linux-next: build failure after merge of the net tree

2016-10-20 Thread Florian Fainelli
On 10/20/2016 03:42 PM, Florian Fainelli wrote:
> On 10/20/2016 03:27 PM, Stephen Rothwell wrote:
>> Hi all,
>>
>> After merging the net tree, today's linux-next build (arm
>> multi_v7_defconfig) failed like this:
>>
>> ERROR: "kexec_in_progress" [drivers/net/dsa/bcm_sf2.ko] undefined!
>>
>> Caused by commit
>>
>>   2399d6143f85 ("net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd 
>> kernels")
>>
>> I used the version of the net tree from next-20161020 for today.
> 
> OK, seems like we need an ifdef CONFIG_KEXEC_CORE, let me fix that.
> Thanks Stephen!

Stephen, can you send me the .config file? I could not reproduce it with
next-20161020 and doing a make multi_v7_defconfig:

fainelli@fainelli-desktop:[~/../linux]$ ls -l drivers/net/dsa/*.ko
-rw-rw-r-- 1 fainelli fainelli 17K Oct 20 17:30 drivers/net/dsa/bcm_sf2.ko

Thanks
-- 
Florian


Re: [Patch net v2] ipv6: fix a potential deadlock in do_ipv6_setsockopt()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 3:10 PM, Eric Dumazet  wrote:
>
> I used 'I wonder if' to say that we might have some better way to code
> this test nowadays, but this can be done in a separate patch of course.

OK, let's keep one bugfix in one patch. ;)


linux-next: manual merge of the net-next tree with the net tree

2016-10-20 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/ethernet/ibm/ibmvnic.c

between commit:

  87737f8810db ("ibmvnic: Update MTU after device initialization")

from the net tree and commit:

  d894be57ca92 ("ethernet: use net core MTU range checking in more drivers")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/ethernet/ibm/ibmvnic.c
index 213162df1a9b,657206be7ba9..
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@@ -3654,7 -3644,8 +3644,9 @@@ static void handle_crq_init_rsp(struct 
goto task_failed;
  
netdev->real_num_tx_queues = adapter->req_tx_queues;
 +  netdev->mtu = adapter->req_mtu;
+   netdev->min_mtu = adapter->min_mtu;
+   netdev->max_mtu = adapter->max_mtu;
  
if (adapter->failover) {
adapter->failover = false;


Re: [Patch net] net: saving irq context for peernet2id()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 12:07 PM, Paul Moore  wrote:
> On Thu, Oct 20, 2016 at 2:29 PM, Cong Wang  wrote:
>> On Thu, Oct 20, 2016 at 7:58 AM, Stephen Smalley  wrote:
>>> On 10/20/2016 02:52 AM, Cong Wang wrote:
 A kernel warning inside __local_bh_enable_ip() was reported by people
 running SELinux, this is caused due to some SELinux functions
 (indirectly) call peernet2id() with IRQ disabled in process context,
 when we re-enable BH with IRQ disabled kernel complains. Shut up this
 warning by saving IRQ context in peernet2id(), BH is still implicitly
 disabled.
>>>
>>> Not sure this suffices; kill_fasync() -> send_sigio() ->
>>> send_sigio_to_task() -> sigio_perm() -> security_file_send_sigiotask()
>>> -> selinux_file_send_sigiotask() -> ... -> audit_log() -> ... ->
>>> peernet2id()
>>
>> Oh, this is a new one. kill_fasync() is called in IRQ handler, so we actually
>> do multicast in IRQ context It makes no sense, netlink multicast could
>> be very expensive if we have many listeners.
>
> I'm sure there are a few others I don't know about, but I believe the
> only commonly used audit multicast listener is systemd.

But user-space is free to listen to this group, right? If so this is just open
for a potential DDOS attack.

>
>> I am Cc'ing Richard who added that multicast in audit_log_end(). It seems
>> not easy to just move the multicast to a workqueue, since the skb is copied
>> from audit_buffer which is freed immediately after that, probably need 
>> another
>> queue like audit_skb_queue.
>
> This approach would double the queue size which is something I want to
> avoid.  I would suggest sticking with a single queue and dealing with
> the netlink message link fixup and multicast send in the existing
> netlink unicast thread; basically we would just be moving the
> multicast code from audit_log_end() into kauditd_thread().  This is
> the same approach I mentioned earlier off-list.

This is what I did in the follow up patch. I attach the updated version
in this email for you to review, I still can't make selinux-testsuites working
on my Fedora even though I have SELinux=enforcing, anyhow I don't
see any kernel warning in my dmesg at least.

>
> However, that isn't something I want to mess with as a regression fix,
> mostly because I really want to see this regression gone by -rc2 as it
> is making SELinux testing a real pain.  If the patch posted at the top
> of this thread isn't a suitable fix, we really should revert the
> original patch.

Since you want to test SELinux anyway, please test the attached one.

Thanks.
diff --git a/kernel/audit.c b/kernel/audit.c
index f1ca116..cdc5a91 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -139,6 +139,7 @@ static int audit_freelist_count;
 static LIST_HEAD(audit_freelist);
 
 static struct sk_buff_head audit_skb_queue;
+static struct sk_buff_head audit_skb_multicast_queue;
 /* queue of skbs to send to auditd when/if it comes back */
 static struct sk_buff_head audit_skb_hold_queue;
 static struct task_struct *kauditd_task;
@@ -468,7 +469,8 @@ static void kauditd_send_multicast_skb(struct sk_buff *skb, 
gfp_t gfp_mask)
if (!copy)
return;
 
-   nlmsg_multicast(sock, copy, 0, AUDIT_NLGRP_READLOG, gfp_mask);
+   skb_queue_tail(&audit_skb_multicast_queue, copy);
+   wake_up_interruptible(&kauditd_wait);
 }
 
 /*
@@ -509,6 +511,26 @@ static void flush_hold_queue(void)
consume_skb(skb);
 }
 
+static void flush_multicast_queue(void)
+{
+   struct audit_net *aunet = net_generic(&init_net, audit_net_id);
+   struct sock *sock = aunet->nlsk;
+   struct sk_buff *skb = skb_dequeue(&audit_skb_multicast_queue);
+
+   if (!netlink_has_listeners(sock, AUDIT_NLGRP_READLOG)) {
+   while (skb) {
+   consume_skb(skb);
+   skb = skb_dequeue(&audit_skb_multicast_queue);
+   }
+   return;
+   }
+
+   while (skb) {
+   nlmsg_multicast(sock, skb, 0, AUDIT_NLGRP_READLOG, GFP_KERNEL);
+   skb = skb_dequeue(&audit_skb_multicast_queue);
+   }
+}
+
 static int kauditd_thread(void *dummy)
 {
set_freezable();
@@ -517,6 +539,8 @@ static int kauditd_thread(void *dummy)
 
flush_hold_queue();
 
+   flush_multicast_queue();
+
skb = skb_dequeue(&audit_skb_queue);
 
if (skb) {
@@ -530,7 +554,8 @@ static int kauditd_thread(void *dummy)
continue;
}
 
-   wait_event_freezable(kauditd_wait, 
skb_queue_len(&audit_skb_queue));
+   wait_event_freezable(kauditd_wait, 
skb_queue_len(&audit_skb_queue)
+  || 
skb_queue_len(&audit_skb_multicast_queue));
}
return 0;
 }
@@ -1197,6 +1222,7 @@ static int __init audit_init(void)
register_pernet_subsys(&audit_net_ops);
 
skb_queue_head_init(&audit_skb_queue)

RE: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for VF support

2016-10-20 Thread Vatsavayi, Raghu


> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, October 20, 2016 1:57 PM
> To: Vatsavayi, Raghu
> Cc: netdev@vger.kernel.org; Chickles, Derek; Burla, Satananda; Manlunas,
> Felix
> Subject: Re: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for VF
> support
> 
> From: "Vatsavayi, Raghu" 
> Date: Thu, 20 Oct 2016 20:01:37 +
> 
> >
> >
> >> -Original Message-
> >> From: David Miller [mailto:da...@davemloft.net]
> >> Sent: Thursday, October 20, 2016 11:13 AM
> >> To: Vatsavayi, Raghu
> >> Cc: netdev@vger.kernel.org; Vatsavayi, Raghu; Chickles, Derek; Burla,
> >> Satananda; Manlunas, Felix
> >> Subject: Re: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for
> >> VF support
> >>
> >> From: Raghu Vatsavayi 
> >> Date: Wed, 19 Oct 2016 22:40:38 -0700
> >>
> >> > +/* Default behaviour of Liquidio is to provide one queue per VF.
> >> > +But Liquidio
> >> > + * can also provide multiple queues to each VF. If user wants to
> >> > +change the
> >> > + * default behaviour HW should be provided configuration info at
> >> > +init time,
> >> > + * based on which it will create control queues for communicating
> >> > +with
> >> FW.
> >> > + */
> >> > +static u32 max_vfs[2] = { 0, 0 };
> >> > +module_param_array(max_vfs, int, NULL, 0444);
> >> > +MODULE_PARM_DESC(max_vfs, "Assign two comma-separated
> unsigned
> >> > +integers that specify max number of VFs for PF0 (left of the
> >> > +comma) and PF1 (right of the comma); for 23xx only. By default HW
> >> > +will configure as many VFs as queues after allocating PF queues.To
> >> > +increase queues for VF use this parameter. Use sysfs to create
> >> > +these VFs.");
> >> > +
> >> > +static unsigned int num_queues_per_pf[2] = { 0, 0 };
> >> > +module_param_array(num_queues_per_pf, uint, NULL, 0444);
> >> > +MODULE_PARM_DESC(num_queues_per_pf, "two comma-separated
> >> unsigned
> >> > +integers that specify number of queues per PF0 (left of the comma)
> >> > +and PF1 (right of the comma); for 23xx only");
> >> > +
> >> >  static int ptp_enable = 1;
> >>
> >> We cannot continue to allow drivers to add custom module parameters
> >> to control this.  It is the worst user experience possible.
> >>
> >> We need a tree-wide generic, consistent, manner in which to configure
> >> and control this kind of thing.
> >
> > Sure Dave, I will remove max_vfs module parameter and will use tree
> > wide generic sysfs interface to enable VFs.
> 
> That's not what I meant.
> 
> I mean there needs to be a generic mechanism that isn't a per-device knob
> (be it a module parameter or a sysctl, to me these are identical functionality
> and user experience wise).
> 
> Something like ethtool or netlink.

Dave, I will remove max_vfs module parameter and will just use the generic 
mechanism that all drivers do like:

" echo 10 > /sys/devices/pci:00/:00:03.0/:03:00.1/sriov_numvfs"

Regarding other module parameters, in non-default case if user wants to have 
multiple queues then because of the way Liquidio HW works we need 
num_queues_per_pf and num_queues_per_vf module parameters at HW/module init 
time. This is because in multi-queues per VF scenario, HW has to carve these 
queues before FW can start communicating with PF/VF host drivers, so we must 
include these two parameters. 

Please confirm that having these two module parameters is fine for non-default 
case. I will soon forward you the patches with these changes that you have 
recommended.

Thanks Much
Raghu.


Re: [PATCH 05/10] gpio: Introduce SAM gpio driver

2016-10-20 Thread Linus Walleij
n Fri, Oct 7, 2016 at 5:18 PM, Pantelis Antoniou
 wrote:

> From: Guenter Roeck 
>
> The SAM GPIO IP block is present in the Juniper PTX series
> of routers as part of the SAM FPGA.
>
> Signed-off-by: Georgi Vlaev 
> Signed-off-by: Guenter Roeck 
> Signed-off-by: Rajat Jain 
> [Ported from Juniper kernel]
> Signed-off-by: Pantelis Antoniou 

First copy/paste my other review comments on the previous driver
I reviewed, this seems to have pretty much all the same issues.

> +config GPIO_SAM
> +   tristate "SAM FPGA GPIO"
> +   depends on MFD_JUNIPER_SAM
> +   default y if MFD_JUNIPER_SAM

I suspect this should use
select GPIOLIB_IRQCHIP

> +#include 
> +#include 
> +#include 
> +#include 



> +#include 
> +#include 

Not needed with GPIOLIB_IRQCHIP

> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

Why?

> +#include 
> +
> +/* gpio status/configuration */
> +#define SAM_GPIO_NEG_EDGE  (1 << 8)
> +#define SAM_GPIO_NEG_EDGE_EN   (1 << 7)
> +#define SAM_GPIO_POS_EDGE  (1 << 6)
> +#define SAM_GPIO_POS_EDGE_EN   (1 << 5)

Interrupt triggers I suppose.

> +#define SAM_GPIO_BLINK (1 << 4)

Cool, we don't support that in gpiolib as of now.
Maybe we should.

> +#define SAM_GPIO_OUT   (1 << 3)
> +#define SAM_GPIO_OUT_TS(1 << 2)

OUT_TS ... what does TS mean here?

> +#define SAM_GPIO_DEBOUNCE_EN   (1 << 1)
> +#define SAM_GPIO_IN(1 << 0)
> +
> +#define SAM_GPIO_BASE  0x1000

Base into what, and why is this not coming from PCI
or the device tree?

> +#define SAM_MAX_NGPIO  512

Why do we need to know this and what does it really mean?
That is the max number the GPIO subsystem can handle by
the way.

> +#define SAM_GPIO_ADDR(addr, nr)((addr) + SAM_GPIO_BASE + (nr) * 
> sizeof(u32))

Why can't we just offset the address earlier, ah well it's OK.

> +struct sam_gpio_irq_group {
> +   int start;  /* 1st gpio pin */
> +   int count;  /* # of pins in group */
> +   int num_enabled;/* # of enabled interrupts */
> +};
> +
> +/**
> + * struct sam_gpio - GPIO private data structure.
> + * @base:  PCI base address of Memory mapped I/O 
> register.
> + * @dev:   Pointer to device structure.
> + * @gpio:  Data for GPIO infrastructure.
> + * @gpio_base: 1st gpio pin
> + * @gpio_count:# of gpio pins
> + * @irq_lock:  Lock used by interrupt subsystem
> + * @domain:Pointer to interrupt domain
> + * @irq:   Interrupt # from parent
> + * @irq_high:  Second interrupt # from parent
> + * (currently unused)
> + * @irq_group: Interrupt group descriptions
> + * (one group per interrupt bit)
> + * @irq_type:  The interrupt type for each gpio pin
> + */

Why do you need to keep all of this around? Is is all really
used? gpio_base makes me nervous we generally use dynamic
allocation of GPIO numbers these days.

> +struct sam_gpio {
> +   void __iomem *base;
> +   struct device *dev;
> +   struct gpio_chip gpio;
> +   int gpio_base;
> +   int gpio_count;
> +   struct mutex irq_lock;
> +   struct irq_domain *domain;
> +   int irq;
> +   int irq_high;
> +   struct sam_gpio_irq_group irq_group[18];
> +   u8 irq_type[SAM_MAX_NGPIO];
> +   struct sam_platform_data *pdata;
> +   const char **names;
> +   u32 *export_flags;
> +};
> +#define to_sam(chip)   container_of((chip), struct sam_gpio, gpio)

Instead of this use gpiochip_get_data(). Applies everywhere.

> +static void sam_gpio_bitop(struct sam_gpio *sam, unsigned int nr,
> +  u32 bit, bool set)
> +{
> +   u32 reg;
> +
> +   reg = ioread32(SAM_GPIO_ADDR(sam->base, nr));
> +   if (set)
> +   reg |= bit;
> +   else
> +   reg &= ~bit;
> +   iowrite32(reg, SAM_GPIO_ADDR(sam->base, nr));
> +   ioread32(SAM_GPIO_ADDR(sam->base, nr));
> +}

Does that rally need a helper function?

Use BIT() and inline like I explained in the previous patch.

> +static void sam_gpio_setup(struct sam_gpio *sam)
> +{
> +   struct gpio_chip *chip = &sam->gpio;
> +
> +   chip->parent = sam->dev;
> +   chip->label = dev_name(sam->dev);
> +   chip->owner = THIS_MODULE;
> +   chip->direction_input = sam_gpio_direction_input;
> +   chip->get = sam_gpio_get;
> +   chip->direction_output = sam_gpio_direction_output;

Implement also chip->get_direction

> +   chip->set = sam_gpio_set;
> +   chip->set_debounce = sam_gpio_debounce;
> +   chip->dbg_show = NULL;
> +   chip->base = sam->gpio_base;

Oh no, why. Use -1 please and let gpiolib decide...

> +   chip->ngpio = sam->gpio_count;
> +#ifdef CONFIG_OF_GPIO
> +   chip->of_node = sam

Re: linux-next: build failure after merge of the net tree

2016-10-20 Thread Florian Fainelli
On 10/20/2016 03:27 PM, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the net tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:
> 
> ERROR: "kexec_in_progress" [drivers/net/dsa/bcm_sf2.ko] undefined!
> 
> Caused by commit
> 
>   2399d6143f85 ("net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd 
> kernels")
> 
> I used the version of the net tree from next-20161020 for today.

OK, seems like we need an ifdef CONFIG_KEXEC_CORE, let me fix that.
Thanks Stephen!
-- 
Florian


[PATCH net-next 1/2] bpf: add helper for retrieving current numa node id

2016-10-20 Thread Daniel Borkmann
Use case is mainly for soreuseport to select sockets for the local
numa node, but since generic, lets also add this for other networking
and tracing program types.

Suggested-by: Eric Dumazet 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |  1 +
 include/uapi/linux/bpf.h |  6 ++
 kernel/bpf/helpers.c | 12 
 kernel/trace/bpf_trace.c |  2 ++
 net/core/filter.c|  2 ++
 5 files changed, 23 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c201017..edcd96d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -319,6 +319,7 @@ static inline struct bpf_prog *bpf_prog_inc(struct bpf_prog 
*prog)
 
 extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
 extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
+extern const struct bpf_func_proto bpf_get_numa_node_id_proto;
 extern const struct bpf_func_proto bpf_tail_call_proto;
 extern const struct bpf_func_proto bpf_ktime_get_ns_proto;
 extern const struct bpf_func_proto bpf_get_current_pid_tgid_proto;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f09c70b..4ae1b66 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -426,6 +426,12 @@ enum bpf_func_id {
 */
BPF_FUNC_set_hash_invalid,
 
+   /**
+* bpf_get_numa_node_id()
+* Returns the id of the current NUMA node.
+*/
+   BPF_FUNC_get_numa_node_id,
+
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 3991840..045cbe6 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -92,6 +93,17 @@
.ret_type   = RET_INTEGER,
 };
 
+BPF_CALL_0(bpf_get_numa_node_id)
+{
+   return numa_node_id();
+}
+
+const struct bpf_func_proto bpf_get_numa_node_id_proto = {
+   .func   = bpf_get_numa_node_id,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+};
+
 BPF_CALL_0(bpf_ktime_get_ns)
 {
/* NMI safe access to clock monotonic */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 5dcb992..fa77311 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -422,6 +422,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum 
bpf_func_id func_id)
return bpf_get_trace_printk_proto();
case BPF_FUNC_get_smp_processor_id:
return &bpf_get_smp_processor_id_proto;
+   case BPF_FUNC_get_numa_node_id:
+   return &bpf_get_numa_node_id_proto;
case BPF_FUNC_perf_event_read:
return &bpf_perf_event_read_proto;
case BPF_FUNC_probe_write_user:
diff --git a/net/core/filter.c b/net/core/filter.c
index 00351cd..cd9e2ba 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2492,6 +2492,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
return &bpf_get_prandom_u32_proto;
case BPF_FUNC_get_smp_processor_id:
return &bpf_get_raw_smp_processor_id_proto;
+   case BPF_FUNC_get_numa_node_id:
+   return &bpf_get_numa_node_id_proto;
case BPF_FUNC_tail_call:
return &bpf_tail_call_proto;
case BPF_FUNC_ktime_get_ns:
-- 
1.9.3



[PATCH net-next 0/2] Add BPF numa id helper

2016-10-20 Thread Daniel Borkmann
This patch set adds a helper for retrieving current numa node
id and a test case for SO_REUSEPORT.

Thanks!

Daniel Borkmann (2):
  bpf: add helper for retrieving current numa node id
  reuseport, bpf: add test case for bpf_get_numa_node_id

 include/linux/bpf.h  |   1 +
 include/uapi/linux/bpf.h |   6 +
 kernel/bpf/helpers.c |  12 ++
 kernel/trace/bpf_trace.c |   2 +
 net/core/filter.c|   2 +
 tools/testing/selftests/net/.gitignore   |   1 +
 tools/testing/selftests/net/Makefile |  11 +-
 tools/testing/selftests/net/reuseport_bpf_numa.c | 255 +++
 8 files changed, 286 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseport_bpf_numa.c

-- 
1.9.3



[PATCH net-next 2/2] reuseport, bpf: add test case for bpf_get_numa_node_id

2016-10-20 Thread Daniel Borkmann
The test case is very similar to reuseport_bpf_cpu, only that here
we select socket members based on current numa node id.

  # numactl -H
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
  node 0 size: 128867 MB
  node 0 free: 120080 MB
  node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
  node 1 size: 96765 MB
  node 1 free: 87504 MB
  node distances:
  node   0   1
0:  10  20
1:  20  10

  # ./reuseport_bpf_numa
   IPv4 UDP 
  send node 0, receive socket 0
  send node 1, receive socket 1
  send node 1, receive socket 1
  send node 0, receive socket 0
   IPv6 UDP 
  send node 0, receive socket 0
  send node 1, receive socket 1
  send node 1, receive socket 1
  send node 0, receive socket 0
   IPv4 TCP 
  send node 0, receive socket 0
  send node 1, receive socket 1
  send node 1, receive socket 1
  send node 0, receive socket 0
   IPv6 TCP 
  send node 0, receive socket 0
  send node 1, receive socket 1
  send node 1, receive socket 1
  send node 0, receive socket 0
  SUCCESS

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/net/.gitignore   |   1 +
 tools/testing/selftests/net/Makefile |  11 +-
 tools/testing/selftests/net/reuseport_bpf_numa.c | 255 +++
 3 files changed, 263 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseport_bpf_numa.c

diff --git a/tools/testing/selftests/net/.gitignore 
b/tools/testing/selftests/net/.gitignore
index 0840684..afe109e 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -3,4 +3,5 @@ psock_fanout
 psock_tpacket
 reuseport_bpf
 reuseport_bpf_cpu
+reuseport_bpf_numa
 reuseport_dualstack
diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 0e53407..e24e4c8 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -1,14 +1,17 @@
 # Makefile for net selftests
 
-CFLAGS = -Wall -O2 -g
-
+CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
-NET_PROGS = socket psock_fanout psock_tpacket reuseport_bpf reuseport_bpf_cpu 
reuseport_dualstack
+NET_PROGS =  socket
+NET_PROGS += psock_fanout psock_tpacket
+NET_PROGS += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
+NET_PROGS += reuseport_dualstack
 
 all: $(NET_PROGS)
+reuseport_bpf_numa: LDFLAGS += -lnuma
 %: %.c
-   $(CC) $(CFLAGS) -o $@ $^
+   $(CC) $(CFLAGS) $(LDFLAGS) -o $@ $^
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh
 TEST_FILES := $(NET_PROGS)
diff --git a/tools/testing/selftests/net/reuseport_bpf_numa.c 
b/tools/testing/selftests/net/reuseport_bpf_numa.c
new file mode 100644
index 000..6f20bc9
--- /dev/null
+++ b/tools/testing/selftests/net/reuseport_bpf_numa.c
@@ -0,0 +1,255 @@
+/*
+ * Test functionality of BPF filters with SO_REUSEPORT. Same test as
+ * in reuseport_bpf_cpu, only as one socket per NUMA node.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const int PORT = ;
+
+static void build_rcv_group(int *rcv_fd, size_t len, int family, int proto)
+{
+   struct sockaddr_storage addr;
+   struct sockaddr_in  *addr4;
+   struct sockaddr_in6 *addr6;
+   size_t i;
+   int opt;
+
+   switch (family) {
+   case AF_INET:
+   addr4 = (struct sockaddr_in *)&addr;
+   addr4->sin_family = AF_INET;
+   addr4->sin_addr.s_addr = htonl(INADDR_ANY);
+   addr4->sin_port = htons(PORT);
+   break;
+   case AF_INET6:
+   addr6 = (struct sockaddr_in6 *)&addr;
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_addr = in6addr_any;
+   addr6->sin6_port = htons(PORT);
+   break;
+   default:
+   error(1, 0, "Unsupported family %d", family);
+   }
+
+   for (i = 0; i < len; ++i) {
+   rcv_fd[i] = socket(family, proto, 0);
+   if (rcv_fd[i] < 0)
+   error(1, errno, "failed to create receive socket");
+
+   opt = 1;
+   if (setsockopt(rcv_fd[i], SOL_SOCKET, SO_REUSEPORT, &opt,
+  sizeof(opt)))
+   error(1, errno, "failed to set SO_REUSEPORT");
+
+   if (bind(rcv_fd[i], (struct sockaddr *)&addr, sizeof(addr)))
+   error(1, errno, "failed to bind receive socket");
+
+   if (proto == SOCK_STREAM && listen(rcv_fd[i], len * 10))
+   error(1, errno, "failed to listen on receive port");
+   }
+}
+
+static void attach_bpf(int fd)
+{
+   static char bpf_log_buf[65536];
+   static const char bpf_license[] = "";
+
+   int bpf_fd;
+   const struct bpf_insn

linux-next: build failure after merge of the net tree

2016-10-20 Thread Stephen Rothwell
Hi all,

After merging the net tree, today's linux-next build (arm
multi_v7_defconfig) failed like this:

ERROR: "kexec_in_progress" [drivers/net/dsa/bcm_sf2.ko] undefined!

Caused by commit

  2399d6143f85 ("net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd kernels")

I used the version of the net tree from next-20161020 for today.

-- 
Cheers,
Stephen Rothwell


Unexpected behaviour of suppress_prefixlength 0

2016-10-20 Thread Matthias Peter Walther

Hello,

I'm Matthias and I'm new to this list. I just signed up, to ask the 
following question.


I have a configuration like this:

root@des1 ~ # ip rule
0:from all lookup local
32765:from all iif lo lookup ffnet suppress_prefixlength 0
32766:from all lookup main
32767:from all lookup default
(ffnet is table 42)
root@des1 ~ # ip r s
default via 5.9.86.151 dev eth0
5.9.86.151 dev eth0  proto kernel  scope link  src 5.9.86.144
root@des1 ~ # ip r s t 42
blackhole default

I have the default routing table, and a routing table number 42. I could 
use an ip rule filtering by destination ip, but I wanted to try 
suppress_prefixlength.


Let's say I want to ping 8.8.8.8. What I expect is, that the package is 
put into routing table 42 by the ip rule 32765. As there is no more 
specific route for 8.8.8.8 than the default route in table 42, I expect 
the suppress_prefixlength 0 option to put it back to the default routing 
table and then to be send out through eth0.


Instead this configuration takes the whole machine offline:

root@des1 ~ # ping 8.8.8.8
connect: Invalid argument

When I delete the ip rule 32765 containing the suppress_prefixlength, 
the machine is back online.


Do I not understand the suppress_prefixlength-feature correctly or is 
this a bug? I tested with Kernel 4.7 and 4.6, both show the same 
behaviour as described above.


Thanks for any replies in advance.

Regards,
Matthias


Re: [Patch net v2] ipv6: fix a potential deadlock in do_ipv6_setsockopt()

2016-10-20 Thread Eric Dumazet
On Thu, 2016-10-20 at 14:35 -0700, Cong Wang wrote:
> On Thu, Oct 20, 2016 at 6:47 AM, Eric Dumazet  wrote:
> > On Wed, 2016-10-19 at 23:35 -0700, Cong Wang wrote:
> >> Baozeng reported this deadlock case:
> >
> > ...
> >
> >> +
> >> +void ipv6_sock_mc_close(struct sock *sk)
> >> +{
> >> + struct ipv6_pinfo *np = inet6_sk(sk);
> >> +
> >> + if (!rcu_access_pointer(np->ipv6_mc_list))
> >> + return;
> >
> > I wonder if rcu_dereference_protected(..., lockdep_sock_is_held(sk))
> > could be used instead, to get lockdep support ?
> 
> Maybe, but this "problem" exists without my patch too, right?

I used 'I wonder if' to say that we might have some better way to code
this test nowadays, but this can be done in a separate patch of course.




Re: [Patch net v2] ipv6: fix a potential deadlock in do_ipv6_setsockopt()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 6:47 AM, Eric Dumazet  wrote:
> On Wed, 2016-10-19 at 23:35 -0700, Cong Wang wrote:
>> Baozeng reported this deadlock case:
>
> ...
>
>> +
>> +void ipv6_sock_mc_close(struct sock *sk)
>> +{
>> + struct ipv6_pinfo *np = inet6_sk(sk);
>> +
>> + if (!rcu_access_pointer(np->ipv6_mc_list))
>> + return;
>
> I wonder if rcu_dereference_protected(..., lockdep_sock_is_held(sk))
> could be used instead, to get lockdep support ?

Maybe, but this "problem" exists without my patch too, right?


Re: [PATCH] net: l2tp_eth: fix max_mtu

2016-10-20 Thread Asbjoern Sloth Toennesen
On Thu, 20 Oct 2016 20:08:26 + (UTC), a...@asbjorn.biz wrote:
> diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
> index 965f7e3..ba82dcc 100644
> --- a/net/l2tp/l2tp_eth.c
> +++ b/net/l2tp/l2tp_eth.c
> @@ -259,6 +259,7 @@ static int l2tp_eth_create(struct net *net, u32 
> tunnel_id, u32 session_id, u32 p
>   session->mtu = dev->mtu - session->hdr_len;
>   dev->mtu = session->mtu;
>   dev->needed_headroom += session->hdr_len;
> + dev->max_mtu = ETH_MAX_MTU - dev->needed_headroom;

I forgot to add that subtracting dev->needed_headroom doesn't
give the exact maximum MTU, but one that is a bit too high.
It could also just be set to ETH_MAX_MTU, to indicate that
the encapsulation is not included.

Calculations like in R. Parameswaran's patch is needed for the
exact one.
https://lkml.org/lkml/2016/10/17/3

-- 
Best regards
Asbjørn Sloth Tønnesen


RE: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for VF support

2016-10-20 Thread Vatsavayi, Raghu


> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, October 20, 2016 1:57 PM
> To: Vatsavayi, Raghu
> Cc: netdev@vger.kernel.org; Chickles, Derek; Burla, Satananda; Manlunas,
> Felix
> Subject: Re: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for VF
> support
> 
> From: "Vatsavayi, Raghu" 
> Date: Thu, 20 Oct 2016 20:01:37 +
> 
> >
> >
> >> -Original Message-
> >> From: David Miller [mailto:da...@davemloft.net]
> >> Sent: Thursday, October 20, 2016 11:13 AM
> >> To: Vatsavayi, Raghu
> >> Cc: netdev@vger.kernel.org; Vatsavayi, Raghu; Chickles, Derek; Burla,
> >> Satananda; Manlunas, Felix
> >> Subject: Re: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for
> >> VF support
> >>
> >> From: Raghu Vatsavayi 
> >> Date: Wed, 19 Oct 2016 22:40:38 -0700
> >>
> >> > +/* Default behaviour of Liquidio is to provide one queue per VF.
> >> > +But Liquidio
> >> > + * can also provide multiple queues to each VF. If user wants to
> >> > +change the
> >> > + * default behaviour HW should be provided configuration info at
> >> > +init time,
> >> > + * based on which it will create control queues for communicating
> >> > +with
> >> FW.
> >> > + */
> >> > +static u32 max_vfs[2] = { 0, 0 };
> >> > +module_param_array(max_vfs, int, NULL, 0444);
> >> > +MODULE_PARM_DESC(max_vfs, "Assign two comma-separated
> unsigned
> >> > +integers that specify max number of VFs for PF0 (left of the
> >> > +comma) and PF1 (right of the comma); for 23xx only. By default HW
> >> > +will configure as many VFs as queues after allocating PF queues.To
> >> > +increase queues for VF use this parameter. Use sysfs to create
> >> > +these VFs.");
> >> > +
> >> > +static unsigned int num_queues_per_pf[2] = { 0, 0 };
> >> > +module_param_array(num_queues_per_pf, uint, NULL, 0444);
> >> > +MODULE_PARM_DESC(num_queues_per_pf, "two comma-separated
> >> unsigned
> >> > +integers that specify number of queues per PF0 (left of the comma)
> >> > +and PF1 (right of the comma); for 23xx only");
> >> > +
> >> >  static int ptp_enable = 1;
> >>
> >> We cannot continue to allow drivers to add custom module parameters
> >> to control this.  It is the worst user experience possible.
> >>
> >> We need a tree-wide generic, consistent, manner in which to configure
> >> and control this kind of thing.
> >
> > Sure Dave, I will remove max_vfs module parameter and will use tree
> > wide generic sysfs interface to enable VFs.
> 
> That's not what I meant.
> 
> I mean there needs to be a generic mechanism that isn't a per-device knob
> (be it a module parameter or a sysctl, to me these are identical functionality
> and user experience wise).
> 
> Something like ethtool or netlink.

Dave, I will remove max_vfs module parameter and will just use the generic 
mechanism that 
all drivers do like:

" echo 10 > /sys/devices/pci:00/:00:03.0/:03:00.1/sriov_numvfs"

Regards
Raghu.




[Patch net] ipv4: use the right lock for ping_group_range

2016-10-20 Thread Cong Wang
This reverts commit a681574c99be23e4d20b769bf0e543239c364af5
("ipv4: disable BH in set_ping_group_range()") because we never
read ping_group_range in BH context (unlike local_port_range).

Then, since we already have a lock for ping_group_range, those
using ip_local_ports.lock for ping_group_range are clearly typos.

We might consider to share a same lock for both ping_group_range
and local_port_range w.r.t. space saving, but that should be for
net-next.

Fixes: a681574c99be ("ipv4: disable BH in set_ping_group_range()")
Fixes: ba6b918ab234 ("ping: move ping_group_range out of CONFIG_SYSCTL")
Cc: Eric Dumazet 
Cc: Eric Salo 
Signed-off-by: Cong Wang 
---
 net/ipv4/sysctl_net_ipv4.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 500ae40..80bc36b 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -96,11 +96,11 @@ static void inet_get_ping_group_range_table(struct 
ctl_table *table, kgid_t *low
container_of(table->data, struct net, 
ipv4.ping_group_range.range);
unsigned int seq;
do {
-   seq = read_seqbegin(&net->ipv4.ip_local_ports.lock);
+   seq = read_seqbegin(&net->ipv4.ping_group_range.lock);
 
*low = data[0];
*high = data[1];
-   } while (read_seqretry(&net->ipv4.ip_local_ports.lock, seq));
+   } while (read_seqretry(&net->ipv4.ping_group_range.lock, seq));
 }
 
 /* Update system visible IP port range */
@@ -109,10 +109,10 @@ static void set_ping_group_range(struct ctl_table *table, 
kgid_t low, kgid_t hig
kgid_t *data = table->data;
struct net *net =
container_of(table->data, struct net, 
ipv4.ping_group_range.range);
-   write_seqlock_bh(&net->ipv4.ip_local_ports.lock);
+   write_seqlock(&net->ipv4.ping_group_range.lock);
data[0] = low;
data[1] = high;
-   write_sequnlock_bh(&net->ipv4.ip_local_ports.lock);
+   write_sequnlock(&net->ipv4.ping_group_range.lock);
 }
 
 /* Validate changes from /proc interface. */
-- 
2.1.0



[PATCH net-next] net/sched: em_meta: Fix 'meta vlan' to correctly recognize zero VID frames

2016-10-20 Thread Shmulik Ladkani
META_COLLECTOR int_vlan_tag() assumes that if the accel tag (vlan_tci)
is zero, then no vlan accel tag is present.

This is incorrect for zero VID vlan accel packets, making the following
match fail:
  tc filter add ... basic match 'meta(vlan mask 0xfff eq 0)' ...

Apparently 'int_vlan_tag' was implemented prior VLAN_TAG_PRESENT was
introduced in 05423b2 "vlan: allow null VLAN ID to be used"
(and at time introduced, the 'vlan_tx_tag_get' call in em_meta was not
 adapted).

Fix, testing skb_vlan_tag_present instead of testing skb_vlan_tag_get's
value.

Fixes: 05423b2413 ("vlan: allow null VLAN ID to be used")
Fixes: 1a31f2042e ("netsched: Allow meta match on vlan tag on receive")

Signed-off-by: Shmulik Ladkani 
Cc: Eric Dumazet 
Cc: Stephen Hemminger 
---
 net/sched/em_meta.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index a309a07ccb..41c80b6c39 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -176,11 +176,12 @@ META_COLLECTOR(int_vlan_tag)
 {
unsigned short tag;
 
-   tag = skb_vlan_tag_get(skb);
-   if (!tag && __vlan_get_tag(skb, &tag))
-   *err = -1;
-   else
+   if (skb_vlan_tag_present(skb))
+   dst->value = skb_vlan_tag_get(skb);
+   else if (!__vlan_get_tag(skb, &tag))
dst->value = tag;
+   else
+   *err = -1;
 }
 
 
-- 
2.7.4



Re: [PATCH net-next v5 2/3] udp: implement memory accounting helpers

2016-10-20 Thread Eric Dumazet
On Thu, 2016-10-20 at 22:31 +0200, Paolo Abeni wrote:

> +
> +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> +{
> + struct sk_buff_head *list = &sk->sk_receive_queue;
> + int rmem, delta, amt, err = -ENOMEM;
> + int size = skb->truesize;
> +
> + /* try to avoid the costly atomic add/sub pair when the receive
> +  * queue is full; always allow at least a packet
> +  */
> + rmem = atomic_read(&sk->sk_rmem_alloc);
> + if (rmem && (rmem + size > sk->sk_rcvbuf))
> + goto drop;
> +
> + /* we drop only if the receive buf is full and the receive
> +  * queue contains some other skb
> +  */
> + rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> + if ((rmem > sk->sk_rcvbuf) && (rmem > size))
> + goto uncharge_drop;
> +
> + skb_orphan(skb);

Minor point :

UDP should already have orphaned skbs ? (it uses skb_steal_sock())





Re: [PATCH net] ipv4: disable BH in set_ping_group_range()

2016-10-20 Thread Eric Dumazet
On Thu, 2016-10-20 at 14:00 -0700, Cong Wang wrote:

> Error prone vs. space saving, it's up to you...
> 
> But clearly current code is still broken even after your patch. I will send
> a revert + previous typo fix.

Yes please do.





Re: [PATCH net] ipv4: disable BH in set_ping_group_range()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 1:43 PM, Eric Dumazet  wrote:
> On Thu, 2016-10-20 at 12:44 -0700, Cong Wang wrote:
>> On Thu, Oct 20, 2016 at 12:40 PM, Cong Wang  wrote:
>> > On Thu, Oct 20, 2016 at 12:32 PM, Cong Wang  
>> > wrote:
>> >> On Thu, Oct 20, 2016 at 10:26 AM, Eric Dumazet  
>> >> wrote:
>> >>> From: Eric Dumazet 
>> >>>
>> >>> In commit 4ee3bd4a8c746 ("ipv4: disable BH when changing ip local port
>> >>> range") Cong added BH protection in set_local_port_range() but missed
>> >>> that same fix was needed in set_ping_group_range()
>> >>
>> >> Don't know why ping_group_range shares the same lock with 
>> >> local_port_range...
>> >> Perhaps just for saving a few bytes, but that is why I missed this place.
>> >
>> > Hold on... We clearly have typos there... Your fix is not correct.
>>
>> We need the attached patch, your patch should be reverted, because
>> unlike local_port_range we never read it in BH context, no need to bother 
>> _bh.
>
> Well, we do not change this sysctl very often, so I am not sure why we
> need different seqlocks to protect these ranges.
>
> Seems a waste of space really (per netns)

Error prone vs. space saving, it's up to you...

But clearly current code is still broken even after your patch. I will send
a revert + previous typo fix.


Re: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for VF support

2016-10-20 Thread David Miller
From: "Vatsavayi, Raghu" 
Date: Thu, 20 Oct 2016 20:01:37 +

> 
> 
>> -Original Message-
>> From: David Miller [mailto:da...@davemloft.net]
>> Sent: Thursday, October 20, 2016 11:13 AM
>> To: Vatsavayi, Raghu
>> Cc: netdev@vger.kernel.org; Vatsavayi, Raghu; Chickles, Derek; Burla,
>> Satananda; Manlunas, Felix
>> Subject: Re: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for VF
>> support
>> 
>> From: Raghu Vatsavayi 
>> Date: Wed, 19 Oct 2016 22:40:38 -0700
>> 
>> > +/* Default behaviour of Liquidio is to provide one queue per VF. But
>> > +Liquidio
>> > + * can also provide multiple queues to each VF. If user wants to
>> > +change the
>> > + * default behaviour HW should be provided configuration info at init
>> > +time,
>> > + * based on which it will create control queues for communicating with
>> FW.
>> > + */
>> > +static u32 max_vfs[2] = { 0, 0 };
>> > +module_param_array(max_vfs, int, NULL, 0444);
>> > +MODULE_PARM_DESC(max_vfs, "Assign two comma-separated unsigned
>> > +integers that specify max number of VFs for PF0 (left of the comma)
>> > +and PF1 (right of the comma); for 23xx only. By default HW will
>> > +configure as many VFs as queues after allocating PF queues.To
>> > +increase queues for VF use this parameter. Use sysfs to create these
>> > +VFs.");
>> > +
>> > +static unsigned int num_queues_per_pf[2] = { 0, 0 };
>> > +module_param_array(num_queues_per_pf, uint, NULL, 0444);
>> > +MODULE_PARM_DESC(num_queues_per_pf, "two comma-separated
>> unsigned
>> > +integers that specify number of queues per PF0 (left of the comma)
>> > +and PF1 (right of the comma); for 23xx only");
>> > +
>> >  static int ptp_enable = 1;
>> 
>> We cannot continue to allow drivers to add custom module parameters to
>> control this.  It is the worst user experience possible.
>> 
>> We need a tree-wide generic, consistent, manner in which to configure and
>> control this kind of thing.
> 
> Sure Dave, I will remove max_vfs module parameter and will use tree wide 
> generic 
> sysfs interface to enable VFs.

That's not what I meant.

I mean there needs to be a generic mechanism that isn't a per-device
knob (be it a module parameter or a sysctl, to me these are identical
functionality and user experience wise).

Something like ethtool or netlink.



Re: [PATCH net] udp: must lock the socket in udp_disconnect()

2016-10-20 Thread Eric Dumazet
On Thu, 2016-10-20 at 14:46 -0400, David Miller wrote:

> 
> Applied, sounds like I should queue this up for -stable too right?

Yes, I believe all stable versions have this bug.
Thanks.




Re: [PATCH net] ipv4: disable BH in set_ping_group_range()

2016-10-20 Thread Eric Dumazet
On Thu, 2016-10-20 at 12:44 -0700, Cong Wang wrote:
> On Thu, Oct 20, 2016 at 12:40 PM, Cong Wang  wrote:
> > On Thu, Oct 20, 2016 at 12:32 PM, Cong Wang  
> > wrote:
> >> On Thu, Oct 20, 2016 at 10:26 AM, Eric Dumazet  
> >> wrote:
> >>> From: Eric Dumazet 
> >>>
> >>> In commit 4ee3bd4a8c746 ("ipv4: disable BH when changing ip local port
> >>> range") Cong added BH protection in set_local_port_range() but missed
> >>> that same fix was needed in set_ping_group_range()
> >>
> >> Don't know why ping_group_range shares the same lock with 
> >> local_port_range...
> >> Perhaps just for saving a few bytes, but that is why I missed this place.
> >
> > Hold on... We clearly have typos there... Your fix is not correct.
> 
> We need the attached patch, your patch should be reverted, because
> unlike local_port_range we never read it in BH context, no need to bother _bh.

Well, we do not change this sysctl very often, so I am not sure why we
need different seqlocks to protect these ranges.

Seems a waste of space really (per netns)

 



[PATCH net-next v5 3/3] udp: use it's own memory accounting schema

2016-10-20 Thread Paolo Abeni
Completely avoid default sock memory accounting and replace it
with udp-specific accounting.

Since the new memory accounting model encapsulates completely
the required locking, remove the socket lock on both enqueue and
dequeue, and avoid using the backlog on enqueue.

Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr readers  Kpps (vanilla)  Kpps (patched)
1   170 440
3   12502150
6   30003650
9   42004450
12  57006250

v4 -> v5:
  - avoid unneeded test in first_packet_length

v3 -> v4:
  - remove useless sk_rcvqueues_full() call

v2 -> v3:
  - do not set the now unsed backlog_rcv callback

v1 -> v2:
  - add memory pressure support
  - fixed dropwatch accounting for ipv6

Acked-by: Hannes Frederic Sowa 
Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c| 43 ---
 net/ipv6/udp.c| 34 --
 net/sunrpc/svcsock.c  | 20 
 net/sunrpc/xprtsock.c |  2 +-
 4 files changed, 33 insertions(+), 66 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5e79992..73d33a8 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1309,13 +1309,7 @@ static int first_packet_length(struct sock *sk)
res = skb ? skb->len : -1;
spin_unlock_bh(&rcvq->lock);
 
-   if (!skb_queue_empty(&list_kill)) {
-   bool slow = lock_sock_fast(sk);
-
-   __skb_queue_purge(&list_kill);
-   sk_mem_reclaim_partial(sk);
-   unlock_sock_fast(sk, slow);
-   }
+   __skb_queue_purge(&list_kill);
return res;
 }
 
@@ -1364,7 +1358,6 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
int err;
int is_udplite = IS_UDPLITE(sk);
bool checksum_valid = false;
-   bool slow;
 
if (flags & MSG_ERRQUEUE)
return ip_recv_error(sk, msg, len, addr_len);
@@ -1405,13 +1398,12 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
}
 
if (unlikely(err)) {
-   trace_kfree_skb(skb, udp_recvmsg);
if (!peeked) {
atomic_inc(&sk->sk_drops);
UDP_INC_STATS(sock_net(sk),
  UDP_MIB_INERRORS, is_udplite);
}
-   skb_free_datagram_locked(sk, skb);
+   kfree_skb(skb);
return err;
}
 
@@ -1436,16 +1428,15 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
if (flags & MSG_TRUNC)
err = ulen;
 
-   __skb_free_datagram_locked(sk, skb, peeking ? -err : err);
+   skb_consume_udp(sk, skb, peeking ? -err : err);
return err;
 
 csum_copy_err:
-   slow = lock_sock_fast(sk);
-   if (!skb_kill_datagram(sk, skb, flags)) {
+   if (!__sk_queue_drop_skb(sk, skb, flags)) {
UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
}
-   unlock_sock_fast(sk, slow);
+   kfree_skb(skb);
 
/* starting over for a new packet, but check if we need to yield */
cond_resched();
@@ -1564,7 +1555,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
sk_incoming_cpu_update(sk);
}
 
-   rc = __sock_queue_rcv_skb(sk, skb);
+   rc = __udp_enqueue_schedule_skb(sk, skb);
if (rc < 0) {
int is_udplite = IS_UDPLITE(sk);
 
@@ -1579,7 +1570,6 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
}
 
return 0;
-
 }
 
 static struct static_key udp_encap_needed __read_mostly;
@@ -1601,7 +1591,6 @@ void udp_encap_enable(void)
 int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
struct udp_sock *up = udp_sk(sk);
-   int rc;
int is_udplite = IS_UDPLITE(sk);
 
/*
@@ -1688,25 +1677,9 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff 
*skb)
goto drop;
 
udp_csum_pull_header(skb);
-   if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
-   __UDP_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS,
-   is_udplite);
-   goto drop;
-   }
-
-   rc = 0;
 
ipv4_pktinfo_prepare(sk, skb);
-   bh_lock_sock(sk);
-   if (!sock_owned_by_user(sk))
-   rc = __udp_queue_rcv_skb(sk, skb);
-   else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
-   bh_unlock_sock(sk);
-   goto drop;
-   }
-  

[PATCH net-next v5 2/3] udp: implement memory accounting helpers

2016-10-20 Thread Paolo Abeni
Avoid using the generic helpers.
Use the receive queue spin lock to protect the memory
accounting operation, both on enqueue and on dequeue.

On dequeue perform partial memory reclaiming, trying to
leave a quantum of forward allocated memory.

On enqueue use a custom helper, to allow some optimizations:
- use a plain spin_lock() variant instead of the slightly
  costly spin_lock_irqsave(),
- avoid dst_force check, since the calling code has already
  dropped the skb dst

The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock().

v4 -> v5:
  - replace the mem_lock with the receive queue spin lock
  - ensure that the bh is always allowed to enqueue at least
a skb, even if sk_rcvbuf is exceeded

v3 -> v4:
  - reworked memory accunting, simplifying the schema
  - provide an helper for both memory scheduling and enqueuing

v1 -> v2:
  - use a udp specific destrctor to perform memory reclaiming
  - remove a couple of helpers, unneeded after the above cleanup
  - do not reclaim memory on dequeue if not under memory
pressure
  - reworked the fwd accounting schema to avoid potential
integer overflow

Acked-by: Hannes Frederic Sowa 
Signed-off-by: Paolo Abeni 
---
 include/net/udp.h |   4 ++
 net/ipv4/udp.c| 108 ++
 2 files changed, 112 insertions(+)

diff --git a/include/net/udp.h b/include/net/udp.h
index ea53a87..18f1e6b 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -246,6 +246,9 @@ static inline __be16 udp_flow_src_port(struct net *net, 
struct sk_buff *skb,
 }
 
 /* net/ipv4/udp.c */
+void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
+int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
+
 void udp_v4_early_demux(struct sk_buff *skb);
 int udp_get_port(struct sock *sk, unsigned short snum,
 int (*saddr_cmp)(const struct sock *,
@@ -258,6 +261,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
 void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
+int udp_init_sock(struct sock *sk);
 int udp_disconnect(struct sock *sk, int flags);
 unsigned int udp_poll(struct file *file, struct socket *sock, poll_table 
*wait);
 struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7d96dc2..5e79992 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1172,6 +1172,114 @@ int udp_sendpage(struct sock *sk, struct page *page, 
int offset,
return ret;
 }
 
+static void udp_rmem_release(struct sock *sk, int size, int partial)
+{
+   int amt;
+
+   atomic_sub(size, &sk->sk_rmem_alloc);
+
+   spin_lock_bh(&sk->sk_receive_queue.lock);
+   sk->sk_forward_alloc += size;
+   amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
+   sk->sk_forward_alloc -= amt;
+   spin_unlock_bh(&sk->sk_receive_queue.lock);
+
+   if (amt)
+   __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
+}
+
+static void udp_rmem_free(struct sk_buff *skb)
+{
+   udp_rmem_release(skb->sk, skb->truesize, 1);
+}
+
+int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
+{
+   struct sk_buff_head *list = &sk->sk_receive_queue;
+   int rmem, delta, amt, err = -ENOMEM;
+   int size = skb->truesize;
+
+   /* try to avoid the costly atomic add/sub pair when the receive
+* queue is full; always allow at least a packet
+*/
+   rmem = atomic_read(&sk->sk_rmem_alloc);
+   if (rmem && (rmem + size > sk->sk_rcvbuf))
+   goto drop;
+
+   /* we drop only if the receive buf is full and the receive
+* queue contains some other skb
+*/
+   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
+   if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+   goto uncharge_drop;
+
+   skb_orphan(skb);
+
+   spin_lock(&list->lock);
+   if (size >= sk->sk_forward_alloc) {
+   amt = sk_mem_pages(size);
+   delta = amt << SK_MEM_QUANTUM_SHIFT;
+   if (!__sk_mem_raise_allocated(sk, delta, amt, SK_MEM_RECV)) {
+   err = -ENOBUFS;
+   spin_unlock(&list->lock);
+   goto uncharge_drop;
+   }
+
+   sk->sk_forward_alloc += delta;
+   }
+
+   sk->sk_forward_alloc -= size;
+
+   /* the skb owner in now the udp socket */
+   skb->sk = sk;
+   skb->destructor = udp_rmem_free;
+   skb->dev = NULL;
+   sock_skb_set_dropcount(sk, skb);
+
+   __skb_queue_tail(list, skb);
+   spin_unlock(&list->lock);
+
+   if (!sock_flag(sk, SOCK_DEAD))
+   sk->sk_data_ready(sk);
+
+   return 0;
+
+uncharge_drop:
+   atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+
+drop:
+   atomic_inc(&sk->sk_drops);
+   return err;
+}
+EXPORT_SYMB

[PATCH net-next v5 1/3] net/socket: factor out helpers for memory and queue manipulation

2016-10-20 Thread Paolo Abeni
Basic sock operations that udp code can use with its own
memory accounting schema. No functional change is introduced
in the existing APIs.

v4 -> v5:
  - avoid whitespace changes

v2 -> v4:
  - avoid exporting __sock_enqueue_skb

v1 -> v2:
  - avoid export sock_rmem_free

Acked-by: Hannes Frederic Sowa 
Signed-off-by: Paolo Abeni 
---
 include/net/sock.h  |  4 
 net/core/datagram.c | 36 ++
 net/core/sock.c | 64 +
 3 files changed, 71 insertions(+), 33 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ebf75db..2764895 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1274,7 +1274,9 @@ static inline struct inode *SOCK_INODE(struct socket 
*socket)
 /*
  * Functions for memory accounting
  */
+int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind);
 int __sk_mem_schedule(struct sock *sk, int size, int kind);
+void __sk_mem_reduce_allocated(struct sock *sk, int amount);
 void __sk_mem_reclaim(struct sock *sk, int amount);
 
 #define SK_MEM_QUANTUM ((int)PAGE_SIZE)
@@ -1950,6 +1952,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list 
*timer,
 
 void sk_stop_timer(struct sock *sk, struct timer_list *timer);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+   unsigned int flags);
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 
diff --git a/net/core/datagram.c b/net/core/datagram.c
index b7de71f..bfb973a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -323,6 +323,27 @@ void __skb_free_datagram_locked(struct sock *sk, struct 
sk_buff *skb, int len)
 }
 EXPORT_SYMBOL(__skb_free_datagram_locked);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+   unsigned int flags)
+{
+   int err = 0;
+
+   if (flags & MSG_PEEK) {
+   err = -ENOENT;
+   spin_lock_bh(&sk->sk_receive_queue.lock);
+   if (skb == skb_peek(&sk->sk_receive_queue)) {
+   __skb_unlink(skb, &sk->sk_receive_queue);
+   atomic_dec(&skb->users);
+   err = 0;
+   }
+   spin_unlock_bh(&sk->sk_receive_queue.lock);
+   }
+
+   atomic_inc(&sk->sk_drops);
+   return err;
+}
+EXPORT_SYMBOL(__sk_queue_drop_skb);
+
 /**
  * skb_kill_datagram - Free a datagram skbuff forcibly
  * @sk: socket
@@ -346,23 +367,10 @@ void __skb_free_datagram_locked(struct sock *sk, struct 
sk_buff *skb, int len)
 
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 {
-   int err = 0;
-
-   if (flags & MSG_PEEK) {
-   err = -ENOENT;
-   spin_lock_bh(&sk->sk_receive_queue.lock);
-   if (skb == skb_peek(&sk->sk_receive_queue)) {
-   __skb_unlink(skb, &sk->sk_receive_queue);
-   atomic_dec(&skb->users);
-   err = 0;
-   }
-   spin_unlock_bh(&sk->sk_receive_queue.lock);
-   }
+   int err = __sk_queue_drop_skb(sk, skb, flags);
 
kfree_skb(skb);
-   atomic_inc(&sk->sk_drops);
sk_mem_reclaim_partial(sk);
-
return err;
 }
 EXPORT_SYMBOL(skb_kill_datagram);
diff --git a/net/core/sock.c b/net/core/sock.c
index c73e28f..d8e4532e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2091,24 +2091,18 @@ int sk_wait_data(struct sock *sk, long *timeo, const 
struct sk_buff *skb)
 EXPORT_SYMBOL(sk_wait_data);
 
 /**
- * __sk_mem_schedule - increase sk_forward_alloc and memory_allocated
+ * __sk_mem_raise_allocated - increase memory_allocated
  * @sk: socket
  * @size: memory size to allocate
+ * @amt: pages to allocate
  * @kind: allocation type
  *
- * If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means
- * rmem allocation. This function assumes that protocols which have
- * memory_pressure use sk_wmem_queued as write buffer accounting.
+ * Similar to __sk_mem_schedule(), but does not update sk_forward_alloc
  */
-int __sk_mem_schedule(struct sock *sk, int size, int kind)
+int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
struct proto *prot = sk->sk_prot;
-   int amt = sk_mem_pages(size);
-   long allocated;
-
-   sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-
-   allocated = sk_memory_allocated_add(sk, amt);
+   long allocated = sk_memory_allocated_add(sk, amt);
 
if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
!mem_cgroup_charge_skmem(sk->sk_memcg, amt))
@@ -2169,9 +2163,6 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 
trace_sock_exceed_buf_limit(sk, prot, allocated);
 
-   /* Alas. Undo changes. */
-   sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-
sk_memory_allocated_sub(sk, a

[PATCH net-next v5 0/3] udp: refactor memory accounting

2016-10-20 Thread Paolo Abeni
This patch series refactor the udp memory accounting, replacing the
generic implementation with a custom one, in order to remove the needs for
locking the socket on the enqueue and dequeue operations. The socket backlog
usage is dropped, as well.

The first patch factor out pieces of some queue and memory management
socket helpers, so that they can later be used by the udp memory accounting
functions.
The second patch adds the memory account helpers, without using them.
The third patch replacse the old rx memory accounting path for udp over ipv4 and
udp over ipv6. In kernel UDP users are updated, as well.

The memory accounting schema is described in detail in the individual patch
commit message.

The performance gain depends on the specific scenario; with few flows (and
little contention in the original code) the differences are in the noise range,
while with several flows contending the same socket, the measured speed-up
is relevant (e.g. even over 100% in case of extreme contention)

v4 -> v5:
 - use the receive queue spin lock to protect the memory accounting
 - several minor clean-up

v3 -> v4:
 - simplified the locking schema, always use a plain spinlock

v2 -> v3:
 - do not set the now unsed backlog_rcv callback

v1 -> v2:
 - changed slighly the memory accounting schema, we now perform lazy reclaim
 - fixed forward_alloc updating issue
 - fixed memory counter integer overflows

Paolo Abeni (3):
  net/socket: factor out helpers for memory and queue manipulation
  udp: implement memory accounting helpers
  udp: use it's own memory accounting schema

 include/net/sock.h|   4 ++
 include/net/udp.h |   4 ++
 net/core/datagram.c   |  36 +++-
 net/core/sock.c   |  64 ++---
 net/ipv4/udp.c| 151 ++
 net/ipv6/udp.c|  34 +++-
 net/sunrpc/svcsock.c  |  20 +--
 net/sunrpc/xprtsock.c |   2 +-
 8 files changed, 216 insertions(+), 99 deletions(-)

-- 
1.8.3.1



Re: [PATCH net-next v2 6/9] net: use core MTU range checking in virt drivers

2016-10-20 Thread Michael S. Tsirkin
On Thu, Oct 20, 2016 at 01:55:21PM -0400, Jarod Wilson wrote:
> hyperv_net:
> - set min/max_mtu, per Haiyang, after rndis_filter_device_add
> 
> virtio_net:
> - set min/max_mtu
> - remove virtnet_change_mtu
> vmxnet3:
> - set min/max_mtu
> 
> xen-netback:
> - min_mtu = 0, max_mtu = 65517
> 
> xen-netfront:
> - min_mtu = 0, max_mtu = 65535
> 
> unisys/visor:
> - clean up defines a little to not clash with network core or add
>   redundat definitions
> 
> CC: netdev@vger.kernel.org
> CC: virtualizat...@lists.linux-foundation.org
> CC: "K. Y. Srinivasan" 
> CC: Haiyang Zhang 
> CC: "Michael S. Tsirkin" 
> CC: Shrikrishna Khare 
> CC: "VMware, Inc." 
> CC: Wei Liu 
> CC: Paul Durrant 
> CC: David Kershner 
> Signed-off-by: Jarod Wilson 
> ---
>  drivers/net/hyperv/hyperv_net.h |  4 ++--
>  drivers/net/hyperv/netvsc_drv.c | 14 +++---
>  drivers/net/virtio_net.c| 23 ++-
>  drivers/net/vmxnet3/vmxnet3_drv.c   |  7 ---
>  drivers/net/xen-netback/interface.c |  5 -
>  drivers/net/xen-netfront.c  |  2 ++
>  drivers/staging/unisys/include/iochannel.h  | 10 --
>  drivers/staging/unisys/visornic/visornic_main.c |  4 ++--
>  8 files changed, 35 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index f4fbcb5..3958ada 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -606,8 +606,8 @@ struct nvsp_message {
>  } __packed;
>  
>  
> -#define NETVSC_MTU 65536
> -#define NETVSC_MTU_MIN 68
> +#define NETVSC_MTU 65535
> +#define NETVSC_MTU_MIN ETH_MIN_MTU
>  
>  #define NETVSC_RECEIVE_BUFFER_SIZE   (1024*1024*16)  /* 16MB */
>  #define NETVSC_RECEIVE_BUFFER_SIZE_LEGACY(1024*1024*15)  /* 15MB */
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index f0919bd..3b28cf1 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -872,19 +872,12 @@ static int netvsc_change_mtu(struct net_device *ndev, 
> int mtu)
>   struct netvsc_device *nvdev = ndevctx->nvdev;
>   struct hv_device *hdev = ndevctx->device_ctx;
>   struct netvsc_device_info device_info;
> - int limit = ETH_DATA_LEN;
>   u32 num_chn;
>   int ret = 0;
>  
>   if (ndevctx->start_remove || !nvdev || nvdev->destroy)
>   return -ENODEV;
>  
> - if (nvdev->nvsp_version >= NVSP_PROTOCOL_VERSION_2)
> - limit = NETVSC_MTU - ETH_HLEN;
> -
> - if (mtu < NETVSC_MTU_MIN || mtu > limit)
> - return -EINVAL;
> -
>   ret = netvsc_close(ndev);
>   if (ret)
>   goto out;
> @@ -1402,6 +1395,13 @@ static int netvsc_probe(struct hv_device *dev,
>   netif_set_real_num_tx_queues(net, nvdev->num_chn);
>   netif_set_real_num_rx_queues(net, nvdev->num_chn);
>  
> + /* MTU range: 68 - 1500 or 65521 */
> + net->min_mtu = NETVSC_MTU_MIN;
> + if (nvdev->nvsp_version >= NVSP_PROTOCOL_VERSION_2)
> + net->max_mtu = NETVSC_MTU - ETH_HLEN;
> + else
> + net->max_mtu = ETH_DATA_LEN;
> +
>   ret = register_netdev(net);
>   if (ret != 0) {
>   pr_err("Unable to register netdev.\n");
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index fad84f3..720809f 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1419,17 +1419,6 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
>   .set_settings = virtnet_set_settings,
>  };
>  
> -#define MIN_MTU 68
> -#define MAX_MTU 65535
> -
> -static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
> -{
> - if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
> - return -EINVAL;
> - dev->mtu = new_mtu;
> - return 0;
> -}
> -
>  static const struct net_device_ops virtnet_netdev = {
>   .ndo_open= virtnet_open,
>   .ndo_stop= virtnet_close,
> @@ -1437,7 +1426,6 @@ static const struct net_device_ops virtnet_netdev = {
>   .ndo_validate_addr   = eth_validate_addr,
>   .ndo_set_mac_address = virtnet_set_mac_address,
>   .ndo_set_rx_mode = virtnet_set_rx_mode,
> - .ndo_change_mtu  = virtnet_change_mtu,
>   .ndo_get_stats64 = virtnet_stats,
>   .ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
>   .ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
> @@ -1748,6 +1736,9 @@ static bool virtnet_validate_features(struct 
> virtio_device *vdev)
>   return true;
>  }
>  
> +#define MIN_MTU ETH_MIN_MTU
> +#define MAX_MTU ETH_MAX_MTU
> +

Can we drop these btw?

>  static int virtnet_probe(struct virtio_device *vdev)
>  {
>   int i, err;
> @@ -1821,6 +1812,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>  
>   dev->vlan_features = dev->features;
>  
> + /* MTU range: 68 - 65535 */
> + dev->min_mtu = MIN_MTU;
> + dev->max_

RE: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for VF support

2016-10-20 Thread Vatsavayi, Raghu


> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, October 20, 2016 11:13 AM
> To: Vatsavayi, Raghu
> Cc: netdev@vger.kernel.org; Vatsavayi, Raghu; Chickles, Derek; Burla,
> Satananda; Manlunas, Felix
> Subject: Re: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for VF
> support
> 
> From: Raghu Vatsavayi 
> Date: Wed, 19 Oct 2016 22:40:38 -0700
> 
> > +/* Default behaviour of Liquidio is to provide one queue per VF. But
> > +Liquidio
> > + * can also provide multiple queues to each VF. If user wants to
> > +change the
> > + * default behaviour HW should be provided configuration info at init
> > +time,
> > + * based on which it will create control queues for communicating with
> FW.
> > + */
> > +static u32 max_vfs[2] = { 0, 0 };
> > +module_param_array(max_vfs, int, NULL, 0444);
> > +MODULE_PARM_DESC(max_vfs, "Assign two comma-separated unsigned
> > +integers that specify max number of VFs for PF0 (left of the comma)
> > +and PF1 (right of the comma); for 23xx only. By default HW will
> > +configure as many VFs as queues after allocating PF queues.To
> > +increase queues for VF use this parameter. Use sysfs to create these
> > +VFs.");
> > +
> > +static unsigned int num_queues_per_pf[2] = { 0, 0 };
> > +module_param_array(num_queues_per_pf, uint, NULL, 0444);
> > +MODULE_PARM_DESC(num_queues_per_pf, "two comma-separated
> unsigned
> > +integers that specify number of queues per PF0 (left of the comma)
> > +and PF1 (right of the comma); for 23xx only");
> > +
> >  static int ptp_enable = 1;
> 
> We cannot continue to allow drivers to add custom module parameters to
> control this.  It is the worst user experience possible.
> 
> We need a tree-wide generic, consistent, manner in which to configure and
> control this kind of thing.

Sure Dave, I will remove max_vfs module parameter and will use tree wide 
generic 
sysfs interface to enable VFs. Also if user wants to have multiple queues then 
because 
of the way Liquidio HW works we need num_queues_per_pf and num_queues_per_vf 
module parameters at HW/module init time. This is required only in case of 
non-default
case of multi-queues per VF because HW has to carve these queues before FW can 
start
communicating with PF/VF host drivers, so we may include these two. 

I will soon forward you the patches with the changes that you recommended.

Thanks Much.
Raghu.


[PATCH] net: l2tp_eth: fix max_mtu

2016-10-20 Thread Asbjoern Sloth Toennesen
Fixes: 61e84623ace3 ("net: centralize net_device min/max MTU checking")
CC: netdev@vger.kernel.org
CC: Jarod Wilson 
Signed-off-by: Asbjoern Sloth Toennesen 
---
 net/l2tp/l2tp_eth.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 965f7e3..ba82dcc 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -259,6 +259,7 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, 
u32 session_id, u32 p
session->mtu = dev->mtu - session->hdr_len;
dev->mtu = session->mtu;
dev->needed_headroom += session->hdr_len;
+   dev->max_mtu = ETH_MAX_MTU - dev->needed_headroom;
 
priv = netdev_priv(dev);
priv->dev = dev;
-- 
2.9.3



Re: [PATCH 1/4] kconfig: introduce the "imply" keyword

2016-10-20 Thread Nicolas Pitre
On Thu, 20 Oct 2016, Edward Cree wrote:

> On 20/10/16 19:29, Nicolas Pitre wrote:
> > On Thu, 20 Oct 2016, Edward Cree wrote:
> >> But the desire is a property of the user, not of the driver.  If you're
> >> willing to add CONFIG_FOO_BAZ to every combination of (driver, subsystem)
> >> then "imply" becomes unnecessary, doesn't it?
> > Absolutely.  And if that's something that inspires you please be my
> > guest.  So far, though, this apparently didn't inspire the majority of
> > driver authors who preferred to have a smaller set of config options and
> > forcefully pull in the BAZ features with a "select".  But "select" comes
> > with its set of evils which "imply" is meant to overcome.
> It really doesn't inspire me either ;)
> I was using it as a way to set up the converse, rather than as any kind of
> serious suggestion.
> And I agree that "imply", as it stands, is an improvement over "select" for
> these kinds of cases.  I just think it could be further improved.
> >> Conversely, if you *don't*
> >> want to have to do that, then "imply" needs to only ever deal in defaults,
> >> not in limitations.
> > As I explained, It still has to prevent BAZ=m if FOO moves from m to y
> > otherwise this would effectively have the same result as BAZ=n in
> > practice and that is not what people expect if BAZ actually isn't n in
> > your .config file.  That's why "select" also has that particular
> > semantic.
> If FOO "moves from" m to y (by which I'm assuming you mean a call to
> sym_set_tristate_value()), then by all means set BAZ=y.  But the user
> should then still be allowed to move BAZ from y to m without having to
> change FOO (though hopefully they will get warned about it somehow).

The case I'm most worried about is:

- open .config in $EDITOR
- change "CONFIG_FOO=m" to "CONFIG_FOO=y"
- save and exit
- make

The warning is most likely to be missed in that case, and if .config 
doesn't list CONFIG_BAZ as unset then this is likely to confuse people.

> > Here "imply" is meant to be a weaker form of "select".  If you prefer
> > not to have that limitation imposed by either "select" and "imply" then
> > simply don't use them at all.  Nothing forces you to use any of them if
> > your code can cope with any config combination.
> I'm interpreting "imply" as being more a way of saying "if you want FOO you
> probably want BAZ as well".  But maybe that should be yet another new
> keyword if it's so different from what you want "imply" to be.  "suggests",
> perhaps.

Indeed. That's exactly the keyword that came to my mind after I sent my 
previous reply.

> >> Right, so those drivers can use PTP if they're y and PTP is m, as long
> >> as the PTP module is loaded when they probe.
> > Not at the moment. There is no way for PTP to dynamically signal to
> > interested drivers its presence at run time.  And drivers, when
> > built-in, typically probe their hardware during the boot process even
> > before you have the chance to load any module. If that ever changes,
> > then the imply or select statement could simply be dropped.
> At least for PCIe devices, the driver can be un- and re-bound (despite
> being built-in) through sysfs.  So you (already) can re-probe them after
> loading PTP.  So driver=y && PTP=m is valid, today.

I agree with the principle, but the implementation just doesn't allow it 
at the moment. In a simplified form, what's there now in ptp.h is:

#if IS_REACHABLE(CONFIG_PTP)
extern struct ptp_clock *ptp_clock_register(...);
#else
static inline struct ptp_clock *ptp_clock_register(...) { return NULL; }
#endif

i.e. drivers may cope at runtime with PTP being absent, but there is no 
support for PTP to be loaded after drivers referring to it. Hence the 
use of "select" that I simply converted to "imply".

> >> I think that Josh's suggestion (have the UI warn you if you set BAZ to m
> >> while FOO=y) is the right approach, but I also think it should be done
> >> now rather than at some unspecified future time.
> > Please advocate this with kconfig UI authors.  My recursion stack is
> > already quite deep.
> If I'm reading your patch correctly, your symbol.c additions enforce this
> restriction, and AFAIK the UI can't override that by saying "Yeah I warned
> the user and he said it was fine".
> The kconfig UI API would need to change; sym_set_tristate_value() could
> grow an 'override-imply' flag, for instance.

The UI may tell the user about the restriction and suggest a way to 
overcome it by also changing the "impliers" to m.

Otherwise I much prefer to have a kconfig keyword that expresses the 
fact that it is actually OK to override it, like this "suggests" idea 
could do.

Sidenote: the kconfig language isn't coherent wrt english grammar as 
there is "depends on" but "select" rather than "selects". I interpreted 
"select" as using the imperative mood, hence the addition of "imply" 
rather than "implies".


Nicolas


Re: [PATCH] davinci_emac: fix setting the mac from DT

2016-10-20 Thread Jeroen Hofstee

Hi,


On 20-10-16 14:41, Tony Lindgren wrote:

* Jeroen Hofstee  [161019 12:39]:

commit 9120bd6e9f77 ("net: davinci_emac: Get device dm816x MAC address
using the cpsw code") sets the mac address to the one stored in the chip
unconditionally, overwritten the one already set from the device tree.
This patch makes sure the mac from DT is preserved.

On a am3517 this address is incorrectly read as all zeros, making it
impossible to set a valid mac address without this patch.

OK, at least I don't have better ideas for fixing this:


more details about the am3517 specific issue can be found
here http://marc.info/?l=linux-omap&m=147678889732646&w=2

Regards,
Jeroen



Re: [PATCH net] ipv4: disable BH in set_ping_group_range()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 12:40 PM, Cong Wang  wrote:
> On Thu, Oct 20, 2016 at 12:32 PM, Cong Wang  wrote:
>> On Thu, Oct 20, 2016 at 10:26 AM, Eric Dumazet  
>> wrote:
>>> From: Eric Dumazet 
>>>
>>> In commit 4ee3bd4a8c746 ("ipv4: disable BH when changing ip local port
>>> range") Cong added BH protection in set_local_port_range() but missed
>>> that same fix was needed in set_ping_group_range()
>>
>> Don't know why ping_group_range shares the same lock with local_port_range...
>> Perhaps just for saving a few bytes, but that is why I missed this place.
>
> Hold on... We clearly have typos there... Your fix is not correct.

We need the attached patch, your patch should be reverted, because
unlike local_port_range we never read it in BH context, no need to bother _bh.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1cb67de..80bc36b 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -96,11 +96,11 @@ static void inet_get_ping_group_range_table(struct 
ctl_table *table, kgid_t *low
container_of(table->data, struct net, 
ipv4.ping_group_range.range);
unsigned int seq;
do {
-   seq = read_seqbegin(&net->ipv4.ip_local_ports.lock);
+   seq = read_seqbegin(&net->ipv4.ping_group_range.lock);
 
*low = data[0];
*high = data[1];
-   } while (read_seqretry(&net->ipv4.ip_local_ports.lock, seq));
+   } while (read_seqretry(&net->ipv4.ping_group_range.lock, seq));
 }
 
 /* Update system visible IP port range */
@@ -109,10 +109,10 @@ static void set_ping_group_range(struct ctl_table *table, 
kgid_t low, kgid_t hig
kgid_t *data = table->data;
struct net *net =
container_of(table->data, struct net, 
ipv4.ping_group_range.range);
-   write_seqlock(&net->ipv4.ip_local_ports.lock);
+   write_seqlock(&net->ipv4.ping_group_range.lock);
data[0] = low;
data[1] = high;
-   write_sequnlock(&net->ipv4.ip_local_ports.lock);
+   write_sequnlock(&net->ipv4.ping_group_range.lock);
 }
 
 /* Validate changes from /proc interface. */


Re: [PATCH net] ipv4: disable BH in set_ping_group_range()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 12:32 PM, Cong Wang  wrote:
> On Thu, Oct 20, 2016 at 10:26 AM, Eric Dumazet  wrote:
>> From: Eric Dumazet 
>>
>> In commit 4ee3bd4a8c746 ("ipv4: disable BH when changing ip local port
>> range") Cong added BH protection in set_local_port_range() but missed
>> that same fix was needed in set_ping_group_range()
>
> Don't know why ping_group_range shares the same lock with local_port_range...
> Perhaps just for saving a few bytes, but that is why I missed this place.

Hold on... We clearly have typos there... Your fix is not correct.


Re: [RFC PATCH net-next] bpf: fix potential percpu map overcopy to user.

2016-10-20 Thread Daniel Borkmann

On 10/20/2016 08:41 PM, William Tu wrote:

On Thu, Oct 20, 2016 at 9:58 AM, Alexei Starovoitov
 wrote:

On Thu, Oct 20, 2016 at 06:04:38PM +0200, Daniel Borkmann wrote:


diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index ee384f0..d4832e8 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -25,6 +25,33 @@

  static int map_flags;

+static unsigned int num_possible_cpus(void)
+{
+ static const char *fcpu = "/sys/devices/system/cpu/possible";
+ unsigned int val, possible_cpus = 0;
+ char buff[128];
+ FILE *fp;
+
+ fp = fopen(fcpu, "r");
+ if (!fp) {
+ printf("Failed to open %s: '%s'!\n", fcpu, strerror(errno));
+ exit(1);
+ }
+
+ while (fgets(buff, sizeof(buff), fp)) {
+ if (sscanf(buff, "%*u-%u", &val) == 1)
+ possible_cpus = val;
+ }


looks great to me.
Could you move it into bpf_sys.h or somehow make it common in libbpf
and reuse it in samples/bpf/ ?
Since quite a few samples need this fix as well.


Ahh, true.


Thanks!


Looks good to me. I tested it and it works fine.


Okay, thanks. I'll fix that up, mid-term we should try and move most of
that over to kernel selftests/bpf, and reuse tools/lib/bpf/.


Thanks!
William


Re: [PATCH net] ipv4: disable BH in set_ping_group_range()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 10:26 AM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> In commit 4ee3bd4a8c746 ("ipv4: disable BH when changing ip local port
> range") Cong added BH protection in set_local_port_range() but missed
> that same fix was needed in set_ping_group_range()

Don't know why ping_group_range shares the same lock with local_port_range...
Perhaps just for saving a few bytes, but that is why I missed this place.


Re: [RFC PATCH net-next] bpf: fix potential percpu map overcopy to user.

2016-10-20 Thread William Tu
> Documentation/cputopology.txt +106 says /sys/devices/system/cpu/possible
> outputs cpu_possible_mask. That is the same as in num_possible_cpus(), so
> first step would be to fix the buggy example code, imho.
>
> What perhaps could be done in a second step to reduce overhead is an option
> for bpf(2) to pass in a cpu mask similarly as for sched_{get,set}affinity()
> syscalls, where user space can construct a mask via CPU_SET(3). For the
> syscall time, kernel would lock hot plugging via get_online_cpus() and
> put_online_cpus(), it would check whether passed CPUs are online to query
> and if so then it would copy the values into the user provided buffer. I'd
> think this might be useful in a number of ways anyway.
>

I like this idea. So in this case, the only the data at the cpu
specified by user in the CPU_SET is copied to userspace, potentially
have better performance than always copying the data *
num_possible_cpus() bytes.

Regards,
William


Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags

2016-10-20 Thread Michal Hocko
On Wed 19-10-16 10:23:55, Dave Hansen wrote:
> On 10/19/2016 10:01 AM, Michal Hocko wrote:
> > The question I had earlier was whether this has to be an explicit FOLL
> > flag used by g-u-p users or we can just use it internally when mm !=
> > current->mm
> 
> The reason I chose not to do that was that deferred work gets run under
> a basically random 'current'.  If we just use 'mm != current->mm', then
> the deferred work will sometimes have pkeys enforced and sometimes not,
> basically randomly.

OK, I see (async_pf_execute and ksm ). It makes more sense to me. Thanks
for the clarification.

-- 
Michal Hocko
SUSE Labs


Re: [Patch net] net: saving irq context for peernet2id()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 11:29 AM, Cong Wang  wrote:
> On Thu, Oct 20, 2016 at 7:58 AM, Stephen Smalley  wrote:
>> On 10/20/2016 02:52 AM, Cong Wang wrote:
>>> A kernel warning inside __local_bh_enable_ip() was reported by people
>>> running SELinux, this is caused due to some SELinux functions
>>> (indirectly) call peernet2id() with IRQ disabled in process context,
>>> when we re-enable BH with IRQ disabled kernel complains. Shut up this
>>> warning by saving IRQ context in peernet2id(), BH is still implicitly
>>> disabled.
>>
>> Not sure this suffices; kill_fasync() -> send_sigio() ->
>> send_sigio_to_task() -> sigio_perm() -> security_file_send_sigiotask()
>> -> selinux_file_send_sigiotask() -> ... -> audit_log() -> ... ->
>> peernet2id()
>
> Oh, this is a new one. kill_fasync() is called in IRQ handler, so we actually
> do multicast in IRQ context It makes no sense, netlink multicast could
> be very expensive if we have many listeners.
>
> I am Cc'ing Richard who added that multicast in audit_log_end(). It seems
> not easy to just move the multicast to a workqueue, since the skb is copied
> from audit_buffer which is freed immediately after that, probably need another
> queue like audit_skb_queue.

Please let me know if the attached patch makes any sense to you, before
I give it a serious test.

Thanks!
diff --git a/kernel/audit.c b/kernel/audit.c
index f1ca116..cb2b31b 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -139,6 +139,7 @@ static int audit_freelist_count;
 static LIST_HEAD(audit_freelist);
 
 static struct sk_buff_head audit_skb_queue;
+static struct sk_buff_head audit_skb_multicast_queue;
 /* queue of skbs to send to auditd when/if it comes back */
 static struct sk_buff_head audit_skb_hold_queue;
 static struct task_struct *kauditd_task;
@@ -468,7 +469,8 @@ static void kauditd_send_multicast_skb(struct sk_buff *skb, 
gfp_t gfp_mask)
if (!copy)
return;
 
-   nlmsg_multicast(sock, copy, 0, AUDIT_NLGRP_READLOG, gfp_mask);
+   skb_queue_tail(&audit_skb_multicast_queue, copy);
+   wake_up_interruptible(&kauditd_wait);
 }
 
 /*
@@ -509,6 +511,25 @@ static void flush_hold_queue(void)
consume_skb(skb);
 }
 
+static void flush_multicast_queue(void)
+{
+   struct audit_net *aunet = net_generic(&init_net, audit_net_id);
+   struct sock *sock = aunet->nlsk;
+   struct sk_buff *skb;
+
+   if (!netlink_has_listeners(sock, AUDIT_NLGRP_READLOG))
+   return;
+
+   skb = skb_dequeue(&audit_skb_multicast_queue);
+   if (likely(!skb))
+   return;
+
+   while (skb) {
+   nlmsg_multicast(sock, skb, 0, AUDIT_NLGRP_READLOG, GFP_KERNEL);
+   skb = skb_dequeue(&audit_skb_multicast_queue);
+   }
+}
+
 static int kauditd_thread(void *dummy)
 {
set_freezable();
@@ -517,6 +538,8 @@ static int kauditd_thread(void *dummy)
 
flush_hold_queue();
 
+   flush_multicast_queue();
+
skb = skb_dequeue(&audit_skb_queue);
 
if (skb) {
@@ -530,7 +553,8 @@ static int kauditd_thread(void *dummy)
continue;
}
 
-   wait_event_freezable(kauditd_wait, 
skb_queue_len(&audit_skb_queue));
+   wait_event_freezable(kauditd_wait, 
skb_queue_len(&audit_skb_queue)
+  || 
skb_queue_len(&audit_skb_multicast_queue));
}
return 0;
 }


Re: [PATCH 1/4] kconfig: introduce the "imply" keyword

2016-10-20 Thread Edward Cree
On 20/10/16 19:29, Nicolas Pitre wrote:
> On Thu, 20 Oct 2016, Edward Cree wrote:
>> But the desire is a property of the user, not of the driver.  If you're
>> willing to add CONFIG_FOO_BAZ to every combination of (driver, subsystem)
>> then "imply" becomes unnecessary, doesn't it?
> Absolutely.  And if that's something that inspires you please be my 
> guest.  So far, though, this apparently didn't inspire the majority of 
> driver authors who preferred to have a smaller set of config options and 
> forcefully pull in the BAZ features with a "select".  But "select" comes 
> with its set of evils which "imply" is meant to overcome.
It really doesn't inspire me either ;)
I was using it as a way to set up the converse, rather than as any kind of
serious suggestion.
And I agree that "imply", as it stands, is an improvement over "select" for
these kinds of cases.  I just think it could be further improved.
>> Conversely, if you *don't*
>> want to have to do that, then "imply" needs to only ever deal in defaults,
>> not in limitations.
> As I explained, It still has to prevent BAZ=m if FOO moves from m to y 
> otherwise this would effectively have the same result as BAZ=n in 
> practice and that is not what people expect if BAZ actually isn't n in 
> your .config file.  That's why "select" also has that particular 
> semantic.
If FOO "moves from" m to y (by which I'm assuming you mean a call to
sym_set_tristate_value()), then by all means set BAZ=y.  But the user
should then still be allowed to move BAZ from y to m without having to
change FOO (though hopefully they will get warned about it somehow).
> Here "imply" is meant to be a weaker form of "select".  If you prefer 
> not to have that limitation imposed by either "select" and "imply" then 
> simply don't use them at all.  Nothing forces you to use any of them if 
> your code can cope with any config combination.
I'm interpreting "imply" as being more a way of saying "if you want FOO you
probably want BAZ as well".  But maybe that should be yet another new
keyword if it's so different from what you want "imply" to be.  "suggests",
perhaps.
>> Right, so those drivers can use PTP if they're y and PTP is m, as long 
>> as the PTP module is loaded when they probe.
> Not at the moment. There is no way for PTP to dynamically signal to 
> interested drivers its presence at run time.  And drivers, when 
> built-in, typically probe their hardware during the boot process even 
> before you have the chance to load any module. If that ever changes, 
> then the imply or select statement could simply be dropped.
At least for PCIe devices, the driver can be un- and re-bound (despite
being built-in) through sysfs.  So you (already) can re-probe them after
loading PTP.  So driver=y && PTP=m is valid, today.
>> I think that Josh's suggestion (have the UI warn you if you set BAZ to m
>> while FOO=y) is the right approach, but I also think it should be done
>> now rather than at some unspecified future time.
> Please advocate this with kconfig UI authors.  My recursion stack is 
> already quite deep.
If I'm reading your patch correctly, your symbol.c additions enforce this
restriction, and AFAIK the UI can't override that by saying "Yeah I warned
the user and he said it was fine".
The kconfig UI API would need to change; sym_set_tristate_value() could
grow an 'override-imply' flag, for instance.

But I'm not married to the idea; I don't have a real-world use-case this
interferes with, so if you still think I'm wrong, feel free to ignore me :)

-Ed


Re: [Patch net] net: saving irq context for peernet2id()

2016-10-20 Thread Paul Moore
On Thu, Oct 20, 2016 at 2:29 PM, Cong Wang  wrote:
> On Thu, Oct 20, 2016 at 7:58 AM, Stephen Smalley  wrote:
>> On 10/20/2016 02:52 AM, Cong Wang wrote:
>>> A kernel warning inside __local_bh_enable_ip() was reported by people
>>> running SELinux, this is caused due to some SELinux functions
>>> (indirectly) call peernet2id() with IRQ disabled in process context,
>>> when we re-enable BH with IRQ disabled kernel complains. Shut up this
>>> warning by saving IRQ context in peernet2id(), BH is still implicitly
>>> disabled.
>>
>> Not sure this suffices; kill_fasync() -> send_sigio() ->
>> send_sigio_to_task() -> sigio_perm() -> security_file_send_sigiotask()
>> -> selinux_file_send_sigiotask() -> ... -> audit_log() -> ... ->
>> peernet2id()
>
> Oh, this is a new one. kill_fasync() is called in IRQ handler, so we actually
> do multicast in IRQ context It makes no sense, netlink multicast could
> be very expensive if we have many listeners.

I'm sure there are a few others I don't know about, but I believe the
only commonly used audit multicast listener is systemd.

> I am Cc'ing Richard who added that multicast in audit_log_end(). It seems
> not easy to just move the multicast to a workqueue, since the skb is copied
> from audit_buffer which is freed immediately after that, probably need another
> queue like audit_skb_queue.

This approach would double the queue size which is something I want to
avoid.  I would suggest sticking with a single queue and dealing with
the netlink message link fixup and multicast send in the existing
netlink unicast thread; basically we would just be moving the
multicast code from audit_log_end() into kauditd_thread().  This is
the same approach I mentioned earlier off-list.

However, that isn't something I want to mess with as a regression fix,
mostly because I really want to see this regression gone by -rc2 as it
is making SELinux testing a real pain.  If the patch posted at the top
of this thread isn't a suitable fix, we really should revert the
original patch.

-- 
paul moore
www.paul-moore.com


Re: [PATCH net-next v2 0/9] net: use core MTU range checking everywhere

2016-10-20 Thread David Miller
From: Jarod Wilson 
Date: Thu, 20 Oct 2016 13:55:15 -0400

> This stack of patches should get absolutely everything in the kernel
> converted from doing their own MTU range checking to the core MTU range
> checking. This second spin includes alterations to hopefully fix all
> concerns raised with the first, as well as including some additional
> changes to drivers and infrastructure where I completely missed necessary
> updates.
> 
> These have all been built through the 0-day build infrastructure via the
> (rebasing) master branch at https://github.com/jarodwilson/linux-muck, which
> at the time of the most recent compile across 147 configs, was based on
> net-next at commit 7b1536ef0aa0.

Series applied, hopefully this gets most of the fallout.

Thanks Jarod.


Re: [PATCH 0/4] make POSIX timers optional with some Kconfig help

2016-10-20 Thread Josh Triplett
On Wed, Oct 19, 2016 at 07:42:49PM -0400, Nicolas Pitre wrote:
> Many embedded systems don't need the full POSIX timer support.
> Configuring them out provides a nice kernel image size reduction.
> 
> When POSIX timers are configured out, the PTP clock subsystem should be
> left out as well. However a bunch of ethernet drivers currently *select*
> the later in their Kconfig entries. Therefore some more work was needed
> to break that hard dependency from those drivers without preventing their
> usage altogether.
> 
> Therefore this series also includes kconfig changes to implement a new
> keyword to express some reverse dependencies like "select" does, named
> "imply", and still allowing for the target config symbol to be disabled
> if the user or a direct dependency says so.
> 
> How to deal with the dependencies across three subsystems for potential
> upstream merging needs to be figured out.

This looks good to me, and I like the new "imply" approach.

I'd still like to see a more general solution for reporting the use of
compiled-out syscalls, but I don't think that needs to block this patch
series.

Reviewed-by: Josh Triplett 


Re: [PATCH net] ipv4: disable BH in set_ping_group_range()

2016-10-20 Thread David Miller
From: Eric Dumazet 
Date: Thu, 20 Oct 2016 10:26:48 -0700

> From: Eric Dumazet 
> 
> In commit 4ee3bd4a8c746 ("ipv4: disable BH when changing ip local port
> range") Cong added BH protection in set_local_port_range() but missed
> that same fix was needed in set_ping_group_range()
> 
> Fixes: b8f1a55639e6 ("udp: Add function to make source port for UDP tunnels")
> Signed-off-by: Eric Dumazet 
> Reported-by: Eric Salo 

Applied and queued up for -stable.


Re: [PATCH -next] net: ethernet: mediatek: use dev_kfree_skb_any instead of dev_kfree_skb

2016-10-20 Thread David Miller
From: Wei Yongjun 
Date: Thu, 20 Oct 2016 17:00:32 +

> From: Wei Yongjun 
> 
> Replace dev_kfree_skb with dev_kfree_skb_any in mtk_start_xmit()
> which can be called from hard irq context (netpoll) and from
> other contexts. mtk_start_xmit() only frees skbs that it has
> dropped.
> 
> This is detected by Coccinelle semantic patch.
> 
> Signed-off-by: Wei Yongjun 

Applied.


Re: [PATCH -next] myri10ge: fix typo in parameter description

2016-10-20 Thread David Miller
From: Wei Yongjun 
Date: Thu, 20 Oct 2016 17:01:56 +

> From: Wei Yongjun 
> 
> Fix typo in parameter description.
> 
> Signed-off-by: Wei Yongjun 

Applied.


Re: [PATCH -next] dwc_eth_qos: use dev_kfree_skb_any instead of dev_kfree_skb

2016-10-20 Thread David Miller
From: Wei Yongjun 
Date: Thu, 20 Oct 2016 16:59:49 +

> From: Wei Yongjun 
> 
> Replace dev_kfree_skb with dev_kfree_skb_any in dwceqos_start_xmit()
> which can be called from hard irq context (netpoll) and from
> other contexts. dwceqos_start_xmit() only frees skbs that it has
> dropped.
> 
> This is detected by Coccinelle semantic patch.
> 
> Signed-off-by: Wei Yongjun 

Applied.


Re: [PATCH net-next v12 5/9] openvswitch: add processing of L3 packets

2016-10-20 Thread Pravin Shelar
On Wed, Oct 19, 2016 at 9:52 AM, Jiri Benc  wrote:
> On Tue, 18 Oct 2016 22:13:45 -0700, Pravin Shelar wrote:
>> On Mon, Oct 17, 2016 at 6:02 AM, Jiri Benc  wrote:
>> > -   skb_reset_network_header(skb);
>> > +   skb->protocol = parse_ethertype(skb);
>>
>> I am not sure about changing skb->protocol here.
>> By changing this skb loosing information about packet type. Therefore
>> if packet re-enters OVS (through different bridge), this packet would
>> look like L3 packet. function key_extract_mac_proto() would not see
>> TEB type packet.
>
> This should be okay. If the packet is sent out to an Ethernet interface
> (whatever interface it is), skb->protocol needs to contain the payload
> type. We're not interested in ETH_P_TEB. If the packet is sent out to
> an ARPHRD_NONE interface, ETH_P_TEB is pushed back.
>
I see, vport send is restoring the skb protocol field. It should be fine then.

> Basically, what we're doing here is unconditionally converting
> ETH_P_TEB packets *coming from ARPHRD_NONE interfaces* (this is
> important) into regular Ethernet packets. Which is exactly what we want.
>
> Am I missing something?
>
>  Jiri


Re: [PATCH net] net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd kernels

2016-10-20 Thread Florian Fainelli
On 10/20/2016 11:44 AM, David Miller wrote:
> From: Florian Fainelli 
> Date: Thu, 20 Oct 2016 09:32:19 -0700
> 
>> For a kernel that is being kexec'd we re-enable the integrated GPHY in
>> order for the subsequent MDIO bus scan to succeed and properly bind to
>> the bcm7xxx PHY driver. If we did not do that, the GPHY would be shut
>> down by the time the MDIO driver is probing the bus, and it would fail
>> to read the correct PHY OUI and therefore bind to an appropriate PHY
>> driver. Later on, this would cause DSA not to be able to successfully
>> attach to the PHY, and the interface would not be created at all.
>>
>> Signed-off-by: Florian Fainelli 
> 
> Applied, but I have to wonder...
> 
> If enabling the GPHY is necessary for proper probing, why isn't the
> kexec kernel enabling it properly?

The GPHY enable control is unfortunately located in the switch register
block space and is dependent upon the switch port to be
enabled/accessible, which the DSA layer won't create if the GPHY is not
successfully probed and bound to a PHY driver. It did not appear that
probe deferral could help solve that problem, since MDIO and switch are
reasonable independent from each other.

This was the easiest way I could come up with, without requiring DT
changes and references to register blocks that are not quite relevant to
each other.

HTH
-- 
Florian


Re: [PATCH net] udp: must lock the socket in udp_disconnect()

2016-10-20 Thread David Miller
From: Eric Dumazet 
Date: Thu, 20 Oct 2016 09:39:40 -0700

> From: Eric Dumazet 
> 
> Baozeng Ding reported KASAN traces showing uses after free in
> udp_lib_get_port() and other related UDP functions.
> 
> A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.
> 
> I could write a reproducer with two threads doing :
> 
> static int sock_fd;
> static void *thr1(void *arg)
> {
>   for (;;) {
>   connect(sock_fd, (const struct sockaddr *)arg,
>   sizeof(struct sockaddr_in));
>   }
> }
> 
> static void *thr2(void *arg)
> {
>   struct sockaddr_in unspec;
> 
>   for (;;) {
>   memset(&unspec, 0, sizeof(unspec));
>   connect(sock_fd, (const struct sockaddr *)&unspec,
>   sizeof(unspec));
> }
> }
> 
> Problem is that udp_disconnect() could run without holding socket lock,
> and this was causing list corruptions.
> 
> Signed-off-by: Eric Dumazet 
> Reported-by: Baozeng Ding 

Applied, sounds like I should queue this up for -stable too right?


Re: [PATCH net] net: dsa: bcm_sf2: Prevent GPHY shutdown for kexec'd kernels

2016-10-20 Thread David Miller
From: Florian Fainelli 
Date: Thu, 20 Oct 2016 09:32:19 -0700

> For a kernel that is being kexec'd we re-enable the integrated GPHY in
> order for the subsequent MDIO bus scan to succeed and properly bind to
> the bcm7xxx PHY driver. If we did not do that, the GPHY would be shut
> down by the time the MDIO driver is probing the bus, and it would fail
> to read the correct PHY OUI and therefore bind to an appropriate PHY
> driver. Later on, this would cause DSA not to be able to successfully
> attach to the PHY, and the interface would not be created at all.
> 
> Signed-off-by: Florian Fainelli 

Applied, but I have to wonder...

If enabling the GPHY is necessary for proper probing, why isn't the
kexec kernel enabling it properly?


Re: [RFC PATCH net-next] bpf: fix potential percpu map overcopy to user.

2016-10-20 Thread William Tu
On Thu, Oct 20, 2016 at 9:58 AM, Alexei Starovoitov
 wrote:
> On Thu, Oct 20, 2016 at 06:04:38PM +0200, Daniel Borkmann wrote:
>>
>> diff --git a/tools/testing/selftests/bpf/test_maps.c 
>> b/tools/testing/selftests/bpf/test_maps.c
>> index ee384f0..d4832e8 100644
>> --- a/tools/testing/selftests/bpf/test_maps.c
>> +++ b/tools/testing/selftests/bpf/test_maps.c
>> @@ -25,6 +25,33 @@
>>
>>  static int map_flags;
>>
>> +static unsigned int num_possible_cpus(void)
>> +{
>> + static const char *fcpu = "/sys/devices/system/cpu/possible";
>> + unsigned int val, possible_cpus = 0;
>> + char buff[128];
>> + FILE *fp;
>> +
>> + fp = fopen(fcpu, "r");
>> + if (!fp) {
>> + printf("Failed to open %s: '%s'!\n", fcpu, strerror(errno));
>> + exit(1);
>> + }
>> +
>> + while (fgets(buff, sizeof(buff), fp)) {
>> + if (sscanf(buff, "%*u-%u", &val) == 1)
>> + possible_cpus = val;
>> + }
>
> looks great to me.
> Could you move it into bpf_sys.h or somehow make it common in libbpf
> and reuse it in samples/bpf/ ?
> Since quite a few samples need this fix as well.
> Thanks!
>

Looks good to me. I tested it and it works fine.
Thanks!
William


Re: [PATCH 0/4] STM32F429: Add Ethernet fixes

2016-10-20 Thread David Miller
From: Alexandre TORGUE 
Date: Thu, 20 Oct 2016 17:21:22 +0200

> This series adds several fixes for Ethernet for stm32f429 MCU.
> First 2 patches have already been reviewed some months ago when 
> stm32 Ethernet glue has been pushed (I added in this series to keep
> history). Fixes are:
>  -Change DT to be compliant to stm32 ethernet glue binding
>  -Add phy-handle to correctly use mdio subnode
>  -Remove WoL support

I'm assuming this will be merged via the ARM tree.


Re: [PATCH net-next v2 3/9] net: use core MTU range checking in wireless drivers

2016-10-20 Thread David Miller
From: Johannes Berg 
Date: Thu, 20 Oct 2016 20:22:35 +0200

> On Thu, 2016-10-20 at 13:55 -0400, Jarod Wilson wrote:
>> - set max_mtu in wil6210 driver
>> - set max_mtu in atmel driver
>> - set min/max_mtu in cisco airo driver, remove airo_change_mtu
>> - set min/max_mtu in ipw2100/ipw2200 drivers, remove
>> libipw_change_mtu
>> - set min/max_mtu in p80211netdev, remove wlan_change_mtu
>> - set min/max_mtu in net/mac80211/iface.c and remove ieee80211_change_mtu
> 
> For the mac80211 part,
> 
> Acked-by: Johannes Berg 
> 
> Dave, I'm assuming you'll pick this up, but if you prefer not to I can
> also coordinate with Kalle to take this through our trees.

Yeah I'll get this, thanks for asking.


Re: [PATCH net] bpf, test: fix ld_abs + vlan push/pop stress test

2016-10-20 Thread David Miller
From: Daniel Borkmann 
Date: Thu, 20 Oct 2016 17:13:53 +0200

> After commit 636c2628086e ("net: skbuff: Remove errornous length
> validation in skb_vlan_pop()") mentioned test case stopped working,
> throwing a -12 (ENOMEM) return code. The issue however is not due to
> 636c2628086e, but rather due to a buggy test case that got uncovered
> from the change in behaviour in 636c2628086e.
> 
> The data_size of that test case for the skb was set to 1. In the
> bpf_fill_ld_abs_vlan_push_pop() handler bpf insns are generated that
> loop with: reading skb data, pushing 68 tags, reading skb data,
> popping 68 tags, reading skb data, etc, in order to force a skb
> expansion and thus trigger that JITs recache skb->data. Problem is
> that initial data_size is too small.
> 
> While before 636c2628086e, the test silently bailed out due to the
> skb->len < VLAN_ETH_HLEN check with returning 0, and now throwing an
> error from failing skb_ensure_writable(). Set at least minimum of
> ETH_HLEN as an initial length so that on first push of data, equivalent
> pop will succeed.
> 
> Fixes: 4d9c5c53ac99 ("test_bpf: add bpf_skb_vlan_push/pop() tests")
> Signed-off-by: Daniel Borkmann 

Applied, thanks Daniel.


Re: [PATCH] netfilter: don't permit unprivileged writes to global state via sysctls

2016-10-20 Thread David Miller
From: Pablo Neira Ayuso 
Date: Thu, 20 Oct 2016 20:22:24 +0200

> On Sat, Sep 24, 2016 at 12:21:04AM +0200, Jann Horn wrote:
>> This prevents the modification of nf_conntrack_max in unprivileged network
>> namespaces. For unprivileged network namespaces, ip_conntrack_max is kept
>> as a readonly sysctl in order to minimize potential compatibility issues.
>> 
>> This patch should apply cleanly to the net tree.
> 
> For the record: This patch looks good to me, but this legacy
> ip_conntrack sysctl code is now gone.
> 
> I don't know what is the procedure to get this to -stable branches now
> that this cannot be pushed upstream.

In the commit message for the -stable submission simply say "Not
applicable" in the upstream commit reference.  Like:

[ Upstream commit: Not applicable ]

or something like that.


Re: [PATCH 0/4] make POSIX timers optional with some Kconfig help

2016-10-20 Thread Nicolas Pitre
On Thu, 20 Oct 2016, Thomas Gleixner wrote:

> On Wed, 19 Oct 2016, Nicolas Pitre wrote:
> > Therefore this series also includes kconfig changes to implement a new
> > keyword to express some reverse dependencies like "select" does, named
> > "imply", and still allowing for the target config symbol to be disabled
> > if the user or a direct dependency says so.
> 
> That's really nice work! Thanks for doing that. It makes the whole thing
> more palatable.

Thanks.

Now I'd need some review tags...  ;-)


Nicolas


Re: [PATCH net v3] net: add recursion limit to GRO

2016-10-20 Thread David Miller
From: Sabrina Dubroca 
Date: Thu, 20 Oct 2016 15:58:02 +0200

> Currently, GRO can do unlimited recursion through the gro_receive
> handlers.  This was fixed for tunneling protocols by limiting tunnel GRO
> to one level with encap_mark, but both VLAN and TEB still have this
> problem.  Thus, the kernel is vulnerable to a stack overflow, if we
> receive a packet composed entirely of VLAN headers.
> 
> This patch adds a recursion counter to the GRO layer to prevent stack
> overflow.  When a gro_receive function hits the recursion limit, GRO is
> aborted for this skb and it is processed normally.  This recursion
> counter is put in the GRO CB, but could be turned into a percpu counter
> if we run out of space in the CB.
> 
> Thanks to Vladimír Beneš  for the initial bug report.
> 
> Fixes: CVE-2016-7039
> Fixes: 9b174d88c257 ("net: Add Transparent Ethernet Bridging GRO support.")
> Fixes: 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated 
> vlan")
> Signed-off-by: Sabrina Dubroca 
> Reviewed-by: Jiri Benc 
> Acked-by: Hannes Frederic Sowa 

Applied and queued up for -stable, thanks!


Re: [PATCH] ipv6: properly prevent temp_prefered_lft sysctl race

2016-10-20 Thread David Miller
From: Jiri Bohac 
Date: Thu, 20 Oct 2016 12:29:26 +0200

> The check for an underflow of tmp_prefered_lft is always false
> because tmp_prefered_lft is unsigned. The intention of the check
> was to guard against racing with an update of the
> temp_prefered_lft sysctl, potentially resulting in an underflow.
> 
> As suggested by David Miller, the best way to prevent the race is
> by reading the sysctl variable using READ_ONCE.
> 
> Signed-off-by: Jiri Bohac 
> Reported-by: Julia Lawall 
> Fixes: 76506a986dc3 ("IPv6: fix DESYNC_FACTOR")

Applied, thanks Jiri.


Re: [Patch net] net: saving irq context for peernet2id()

2016-10-20 Thread Cong Wang
On Thu, Oct 20, 2016 at 7:58 AM, Stephen Smalley  wrote:
> On 10/20/2016 02:52 AM, Cong Wang wrote:
>> A kernel warning inside __local_bh_enable_ip() was reported by people
>> running SELinux, this is caused due to some SELinux functions
>> (indirectly) call peernet2id() with IRQ disabled in process context,
>> when we re-enable BH with IRQ disabled kernel complains. Shut up this
>> warning by saving IRQ context in peernet2id(), BH is still implicitly
>> disabled.
>
> Not sure this suffices; kill_fasync() -> send_sigio() ->
> send_sigio_to_task() -> sigio_perm() -> security_file_send_sigiotask()
> -> selinux_file_send_sigiotask() -> ... -> audit_log() -> ... ->
> peernet2id()

Oh, this is a new one. kill_fasync() is called in IRQ handler, so we actually
do multicast in IRQ context It makes no sense, netlink multicast could
be very expensive if we have many listeners.

I am Cc'ing Richard who added that multicast in audit_log_end(). It seems
not easy to just move the multicast to a workqueue, since the skb is copied
from audit_buffer which is freed immediately after that, probably need another
queue like audit_skb_queue.


Re: [PATCH 1/4] kconfig: introduce the "imply" keyword

2016-10-20 Thread Nicolas Pitre
On Thu, 20 Oct 2016, Edward Cree wrote:

> On 20/10/16 18:04, Nicolas Pitre wrote:
> > On Thu, 20 Oct 2016, Edward Cree wrote:
> >> Also, I don't think having any FOO=y should preclude BAZ=m.  Suppose both
> >> FOO and FOO2 imply BAZ, FOO=y and FOO2=m.
> > Some people didn't like the fact that you could turn a driver from m to
> > y and silently lose some features if they were provided by a subsystem
> > that also used to be m, which arguably is not the same as being
> > explicitly disabled.  With "select" this is not a problem as the target
> > symbol is also promoted to y in that case, so I wanted to preserve that
> > property.
> Right, but that's an argument for pushing the subsystem's default to y,
> not for preventing changing the subsystem back to m afterwards.
> >> Then if BAZ-features are only
> >> desired for driver FOO2, BAz=m makes sense.
> > In that case it would make more sense to add a config option related to
> > FOO asking if BAZ features are desired for that driver (there is already
> > one occurrence of that with PTP).  Or you could simply drop the "imply"
> > statement from the FOO config entry.
> But the desire is a property of the user, not of the driver.  If you're
> willing to add CONFIG_FOO_BAZ to every combination of (driver, subsystem)
> then "imply" becomes unnecessary, doesn't it?

Absolutely.  And if that's something that inspires you please be my 
guest.  So far, though, this apparently didn't inspire the majority of 
driver authors who preferred to have a smaller set of config options and 
forcefully pull in the BAZ features with a "select".  But "select" comes 
with its set of evils which "imply" is meant to overcome.

> Conversely, if you *don't*
> want to have to do that, then "imply" needs to only ever deal in defaults,
> not in limitations.

As I explained, It still has to prevent BAZ=m if FOO moves from m to y 
otherwise this would effectively have the same result as BAZ=n in 
practice and that is not what people expect if BAZ actually isn't n in 
your .config file.  That's why "select" also has that particular 
semantic.

Here "imply" is meant to be a weaker form of "select".  If you prefer 
not to have that limitation imposed by either "select" and "imply" then 
simply don't use them at all.  Nothing forces you to use any of them if 
your code can cope with any config combination.

In those cases where "imply" is used, you could drop it altogether 
already. But that's for driver authors to decide. If they went with 
"select" in the first place, there might be a reason, and "imply" is 
there to preserve that reason, semantically at least, without the 
handcuff effect that "select" imposes on the whole thing.

> >> There is also the case of drivers with the ability to detect at runtime
> >> whether BAZ is present, rather than making the decision at build time, but
> >> I'm not sure how common that is.
> > Right now that's how PTP support is done.  Drivers can optimize things
> > at build time, but most of them simply cope with a NULL return from
> > ptp_clock_register().  Hence the imply statement becomes a big
> > configuration hint rather than some hard build dependency.
> Right, so those drivers can use PTP if they're y and PTP is m, as long 
> as the PTP module is loaded when they probe.

Not at the moment. There is no way for PTP to dynamically signal to 
interested drivers its presence at run time.  And drivers, when 
built-in, typically probe their hardware during the boot process even 
before you have the chance to load any module. If that ever changes, 
then the imply or select statement could simply be dropped.

> But current "imply" semantics won't allow that...

And that's on purpose.

> I think that Josh's suggestion (have the UI warn you if you set BAZ to m
> while FOO=y) is the right approach, but I also think it should be done
> now rather than at some unspecified future time.

Please advocate this with kconfig UI authors.  My recursion stack is 
already quite deep.

> Otherwise you forbid
> potentially valid configs.

Like I said, if FOO=y and BAZ=m is a valid config, all you have to do is 
omit "imply BAZ" or "select BAZ" from the FOO config entry.  It's as 
simple as that.


Nicolas


Re: [PATCH] bnx2x: Replace semaphore stats_lock with mutex

2016-10-20 Thread David Miller
From: Binoy Jayan 
Date: Thu, 20 Oct 2016 14:22:12 +0530

> stats_lock is used as a simple mutex

No, it is not.

> @@ -1976,8 +1973,8 @@ int bnx2x_stats_safe_exec(struct bnx2x *bp,
>   /* Wait for statistics to end [while blocking further requests],
>* then run supplied function 'safely'.
>*/
> - rc = down_timeout(&bp->stats_lock, HZ / 10);
> - if (unlikely(rc)) {
> + rc = mutex_trylock(&bp->stats_lock);
> + if (unlikely(!rc)) {

It uses timeouts therefore this conversion is not 1 to 1.

You're losing functionality and potentially adding a regression.


Re: [PATCH net-next] net: phy: aquantia: add PHY ID of AQR106 and AQR107

2016-10-20 Thread David Miller
From: 
Date: Thu, 20 Oct 2016 16:30:31 +0800

> From: Shaohui Xie 
> 
> The AQR106 and AQR107 can use the existing driver.
> 
> Signed-off-by: Shaohui Xie 

Applied.


Re: [PATCH net-next v2 3/9] net: use core MTU range checking in wireless drivers

2016-10-20 Thread Johannes Berg
On Thu, 2016-10-20 at 13:55 -0400, Jarod Wilson wrote:
> - set max_mtu in wil6210 driver
> - set max_mtu in atmel driver
> - set min/max_mtu in cisco airo driver, remove airo_change_mtu
> - set min/max_mtu in ipw2100/ipw2200 drivers, remove
> libipw_change_mtu
> - set min/max_mtu in p80211netdev, remove wlan_change_mtu
> - set min/max_mtu in net/mac80211/iface.c and remove ieee80211_change_mtu

For the mac80211 part,

Acked-by: Johannes Berg 

Dave, I'm assuming you'll pick this up, but if you prefer not to I can
also coordinate with Kalle to take this through our trees.

johannes


Re: [PATCH] netfilter: don't permit unprivileged writes to global state via sysctls

2016-10-20 Thread Pablo Neira Ayuso
On Sat, Sep 24, 2016 at 12:21:04AM +0200, Jann Horn wrote:
> This prevents the modification of nf_conntrack_max in unprivileged network
> namespaces. For unprivileged network namespaces, ip_conntrack_max is kept
> as a readonly sysctl in order to minimize potential compatibility issues.
> 
> This patch should apply cleanly to the net tree.

For the record: This patch looks good to me, but this legacy
ip_conntrack sysctl code is now gone.

I don't know what is the procedure to get this to -stable branches now
that this cannot be pushed upstream.

> Signed-off-by: Jann Horn 
> ---
>  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c 
> b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> index ae1a71a..a639e94 100644
> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> @@ -358,6 +358,9 @@ static int ipv4_init_net(struct net *net)
>   if (!in->ctl_table)
>   return -ENOMEM;
>  
> + if (net->user_ns != &init_user_ns)
> + in->ctl_table[0].mode = 0444;
> +
>   in->ctl_table[0].data = &nf_conntrack_max;
>   in->ctl_table[1].data = &net->ct.count;
>   in->ctl_table[2].data = &nf_conntrack_htable_size;
> -- 
> 2.1.4
> 


Re: [PATCH] net: fec: drop check for clk==NULL before calling clk_*

2016-10-20 Thread David Miller
From: Uwe Kleine-König 
Date: Thu, 20 Oct 2016 10:28:27 +0200

> clk_prepare, clk_enable and their counterparts (at least the common clk
> ones, but also most others) do check for the clk being NULL anyhow (and
> return 0 then), so there is no gain when the caller checks, too.
> 
> Signed-off-by: Uwe Kleine-König 

Applied to net-next.


Re: [PATCH net-next V2 1/9] liquidio CN23XX: HW config for VF support

2016-10-20 Thread David Miller
From: Raghu Vatsavayi 
Date: Wed, 19 Oct 2016 22:40:38 -0700

> +/* Default behaviour of Liquidio is to provide one queue per VF. But Liquidio
> + * can also provide multiple queues to each VF. If user wants to change the
> + * default behaviour HW should be provided configuration info at init time,
> + * based on which it will create control queues for communicating with FW.
> + */
> +static u32 max_vfs[2] = { 0, 0 };
> +module_param_array(max_vfs, int, NULL, 0444);
> +MODULE_PARM_DESC(max_vfs, "Assign two comma-separated unsigned integers that 
> specify max number of VFs for PF0 (left of the comma) and PF1 (right of the 
> comma); for 23xx only. By default HW will configure as many VFs as queues 
> after allocating PF queues.To increase queues for VF use this parameter. Use 
> sysfs to create these VFs.");
> +
> +static unsigned int num_queues_per_pf[2] = { 0, 0 };
> +module_param_array(num_queues_per_pf, uint, NULL, 0444);
> +MODULE_PARM_DESC(num_queues_per_pf, "two comma-separated unsigned integers 
> that specify number of queues per PF0 (left of the comma) and PF1 (right of 
> the comma); for 23xx only");
> +
>  static int ptp_enable = 1;

We cannot continue to allow drivers to add custom module parameters to
control this.  It is the worst user experience possible.

We need a tree-wide generic, consistent, manner in which to configure
and control this kind of thing.


RE: [PATCH net-next v2 6/9] net: use core MTU range checking in virt drivers

2016-10-20 Thread Haiyang Zhang


> -Original Message-
> From: Jarod Wilson [mailto:ja...@redhat.com]
> Sent: Thursday, October 20, 2016 1:55 PM
> To: linux-ker...@vger.kernel.org
> Cc: Jarod Wilson ; netdev@vger.kernel.org;
> virtualizat...@lists.linux-foundation.org; KY Srinivasan
> ; Haiyang Zhang ; Michael S.
> Tsirkin ; Shrikrishna Khare ; VMware,
> Inc. ; Wei Liu ; Paul
> Durrant ; David Kershner
> 
> Subject: [PATCH net-next v2 6/9] net: use core MTU range checking in
> virt drivers
> 
> hyperv_net:
> - set min/max_mtu, per Haiyang, after rndis_filter_device_add
> 
> virtio_net:
> - set min/max_mtu
> - remove virtnet_change_mtu
> 
> vmxnet3:
> - set min/max_mtu
> 
> xen-netback:
> - min_mtu = 0, max_mtu = 65517
> 
> xen-netfront:
> - min_mtu = 0, max_mtu = 65535
> 
> unisys/visor:
> - clean up defines a little to not clash with network core or add
>   redundat definitions
> 
> CC: netdev@vger.kernel.org
> CC: virtualizat...@lists.linux-foundation.org
> CC: "K. Y. Srinivasan" 
> CC: Haiyang Zhang 
> CC: "Michael S. Tsirkin" 
> CC: Shrikrishna Khare 
> CC: "VMware, Inc." 
> CC: Wei Liu 
> CC: Paul Durrant 
> CC: David Kershner 
> Signed-off-by: Jarod Wilson 
> ---

The hv_netvsc changes look fine. Thanks.

Reviewed-by: Haiyang Zhang 




Re: [PATCH net-next] hv_netvsc: fix a race between netvsc_send() and netvsc_init_buf()

2016-10-20 Thread David Miller
From: Vitaly Kuznetsov 
Date: Thu, 20 Oct 2016 10:51:04 +0200

> Stephen Hemminger  writes:
> 
>> Do we need ACCESS_ONCE() here to avoid check/use issues?
>>
> 
> I think we don't: this is the only place in the function where we read
> the variable so we'll get normal read. We're not trying to syncronize
> with netvsc_init_buf() as that would require locking, if we read stale
> NULL value after it was already updated on a different CPU we're fine,
> we'll just return -EAGAIN.

The concern is if we race with netvsc_destroy_buf() and this pointer
becomes NULL after the test you are adding.


  1   2   >