[GIT] Networking

2017-11-02 Thread David Miller

Hopefully this is the last batch of networking fixes for 4.14

Fingers crossed...

1) Fix stmmac to use the proper sized OF property read, from Bhadram
   Varka.

2) Fix use after free in net scheduler tc action code, from Cong
   Wang.

3) Fix SKB control block mangling in tcp_make_synack().

4) Use proper locking in fib_dump_info(), from Florian Westphal.

5) Fix IPG encodings in systemport driver, from Florian Fainelli.

6) Fix division by zero in NV TCP congestion control module, from
   Konstantin Khlebnikov.

7) Fix use after free in nf_reject_ipv4, from Tejaswi Tanikella.

Please pull, thanks a lot!

The following changes since commit 3a99df9a3d14cd866b5516f8cba515a3bfd554ab:

  Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 
(2017-11-01 16:04:27 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 93824c80bf47ebe087414b3a40ca0ff9aab7d1fb:

  net: systemport: Correct IPG length settings (2017-11-03 14:30:02 +0900)


Anatole Denis (1):
  netfilter: nft_set_hash: disable fast_ops for 2-len keys

Bhadram Varka (1):
  stmmac: use of_property_read_u32 instead of read_u8

Cong Wang (2):
  net_sched: acquire RTNL in tc_action_net_exit()
  net_sched: hold netns refcnt for each action

David S. Miller (2):
  Merge git://git.kernel.org/.../pablo/nf
  Merge branch 'net-sched-use-after-free'

Eric Dumazet (1):
  tcp: do not mangle skb->cb[] in tcp_make_synack()

Florian Fainelli (1):
  net: systemport: Correct IPG length settings

Florian Westphal (1):
  fib: fib_dump_info can no longer use __in_dev_get_rtnl

Jeff Barnhill (1):
  net: vrf: correct FRA_L3MDEV encode type

Konstantin Khlebnikov (1):
  tcp_nv: fix division by zero in tcpnv_acked()

Tejaswi Tanikella (1):
  netfilter: nf_reject_ipv4: Fix use-after-free in send_reset

 drivers/net/ethernet/broadcom/bcmsysport.c| 10 ++
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 16 
 drivers/net/vrf.c |  2 +-
 include/linux/stmmac.h|  8 
 include/net/act_api.h |  6 +-
 net/ipv4/fib_semantics.c  | 16 ++--
 net/ipv4/netfilter/nf_reject_ipv4.c   |  2 ++
 net/ipv4/tcp_nv.c |  2 +-
 net/ipv4/tcp_output.c |  9 ++---
 net/netfilter/nft_set_hash.c  |  1 -
 net/sched/act_api.c   |  4 
 net/sched/act_bpf.c   |  2 +-
 net/sched/act_connmark.c  |  2 +-
 net/sched/act_csum.c  |  2 +-
 net/sched/act_gact.c  |  2 +-
 net/sched/act_ife.c   |  2 +-
 net/sched/act_ipt.c   |  4 ++--
 net/sched/act_mirred.c|  2 +-
 net/sched/act_nat.c   |  2 +-
 net/sched/act_pedit.c |  2 +-
 net/sched/act_police.c|  2 +-
 net/sched/act_sample.c|  2 +-
 net/sched/act_simple.c|  2 +-
 net/sched/act_skbedit.c   |  2 +-
 net/sched/act_skbmod.c|  2 +-
 net/sched/act_tunnel_key.c|  2 +-
 net/sched/act_vlan.c  |  2 +-
 27 files changed, 60 insertions(+), 50 deletions(-)


Re: [PATCH net] tcp: do not mangle skb->cb[] in tcp_make_synack()

2017-11-02 Thread David Miller
From: Eric Dumazet 
Date: Thu, 02 Nov 2017 12:30:25 -0700

> From: Eric Dumazet 
> 
> Christoph Paasch sent a patch to address the following issue :
> 
> tcp_make_synack() is leaving some TCP private info in skb->cb[],
> then send the packet by other means than tcp_transmit_skb()
> 
> tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
> IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.
> 
> tcp_make_synack() should not use tcp_init_nondata_skb() :
> 
> tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
> queues (the ones that are only sent via tcp_transmit_skb())
> 
> This patch fixes the issue and should even save few cpu cycles ;)
> 
> Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line 
> misses")
> Signed-off-by: Eric Dumazet 
> Reported-by: Christoph Paasch 

Applied and queued up for -stable.


Re: [PATCH net v2] net: systemport: Correct IPG length settings

2017-11-02 Thread David Miller
From: Florian Fainelli 
Date: Thu,  2 Nov 2017 16:08:40 -0700

> Due to a documentation mistake, the IPG length was set to 0x12 while it
> should have been 12 (decimal). This would affect short packet (64B
> typically) performance since the IPG was bigger than necessary.
> 
> Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
> Signed-off-by: Florian Fainelli 

Applied and queued up for -stable.


Re: [PATCH net] fib: fib_dump_info can no longer use __in_dev_get_rtnl

2017-11-02 Thread David Miller
From: Florian Westphal 
Date: Thu,  2 Nov 2017 16:02:20 +0100

> syzbot reported yet another regression added with DOIT_UNLOCKED.
> When nexthop is marked as dead, fib_dump_info uses __in_dev_get_rtnl():
> 
> ./include/linux/inetdevice.h:230 suspicious rcu_dereference_protected() usage!
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syz-executor2/23859:
>  #0:  (rcu_read_lock){}, at: []
> inet_rtm_getroute+0xaa0/0x2d70 net/ipv4/route.c:2738
> [..]
>   lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4665
>   __in_dev_get_rtnl include/linux/inetdevice.h:230 [inline]
>   fib_dump_info+0x1136/0x13d0 net/ipv4/fib_semantics.c:1377
>   inet_rtm_getroute+0xf97/0x2d70 net/ipv4/route.c:2785
> ..
> 
> This isn't safe anymore, callers either hold RTNL mutex or rcu read lock,
> so these spots must use rcu_dereference_rtnl() or plain rcu_derefence()
> (plus unconditional rcu read lock).
> 
> This does the latter.
> 
> Fixes: 394f51abb3d04f ("ipv4: route: set ipv4 RTM_GETROUTE to not use rtnl")
> Reported-by: syzbot 
> Signed-off-by: Florian Westphal 

Applied, thanks Florian.


Re: [PATCH 1/2] net: bridge: Convert timers to use timer_setup()

2017-11-02 Thread Allen





switch to using the new timer_setup() and from_timer() api's.

Signed-off-by: Allen Pais 


These two patches do not apply cleanly to net-next, please respin.



 Sure.


Re: [PATCH net-next] cxgb4: fix error return code in cxgb4_set_hash_filter()

2017-11-02 Thread David Miller
From: Wei Yongjun 
Date: Thu, 2 Nov 2017 11:15:07 +

> Fix to return a negative error code from thecxgb4_alloc_atid()
> error handling case instead of 0.
> 
> Fixes: 12b276fbf6e0 ("cxgb4: add support to create hash filters")
> Signed-off-by: Wei Yongjun 

Applied.


Re: [PATCH 1/2] [net-next] bpf: fix link error without CONFIG_NET

2017-11-02 Thread David Miller
From: Arnd Bergmann 
Date: Thu,  2 Nov 2017 12:05:51 +0100

> I ran into this link error with the latest net-next plus linux-next
> trees when networking is disabled:
> 
> kernel/bpf/verifier.o:(.rodata+0x2958): undefined reference to 
> `tc_cls_act_analyzer_ops'
> kernel/bpf/verifier.o:(.rodata+0x2970): undefined reference to 
> `xdp_analyzer_ops'
> 
> It seems that the code was written to deal with varying contents of
> the arrray, but the actual #ifdef was missing. Both tc_cls_act_analyzer_ops
> and xdp_analyzer_ops are defined in the core networking code, so adding
> a check for CONFIG_NET seems appropriate here, and I've verified this with
> many randconfig builds
> 
> Fixes: 4f9218aaf8a4 ("bpf: move knowledge about post-translation offsets out 
> of verifier")
> Signed-off-by: Arnd Bergmann 

Applied.


Re: [PATCH 2/2] [net-next] bpf: fix out-of-bounds access warning in bpf_check

2017-11-02 Thread David Miller
From: Arnd Bergmann 
Date: Thu,  2 Nov 2017 12:05:52 +0100

> The bpf_verifer_ops array is generated dynamically and may be
> empty depending on configuration, which then causes an out
> of bounds access:
> 
> kernel/bpf/verifier.c: In function 'bpf_check':
> kernel/bpf/verifier.c:4320:29: error: array subscript is above array bounds 
> [-Werror=array-bounds]
> 
> This adds a check to the start of the function as a workaround.
> I would assume that the function is never called in that configuration,
> so the warning is probably harmless.
> 
> Fixes: 00176a34d9e2 ("bpf: remove the verifier ops from program structure")
> Signed-off-by: Arnd Bergmann 

Applied.


Re: [PATCH] stmmac: use of_property_read_u32 instead of read_u8

2017-11-02 Thread David Miller
From: Bhadram Varka 
Date: Thu, 2 Nov 2017 12:52:13 +0530

> Numbers in DT are stored in “cells” which are 32-bits
> in size. of_property_read_u8 does not work properly
> because of endianness problem.
> 
> This causes it to always return 0 with little-endian
> architectures.
> 
> Fix it by using of_property_read_u32() OF API.
> 
> Signed-off-by: Bhadram Varka 

Applied, thank you.


Re: [PATCH net-next] net: Define eth_stp_addr in linux/etherdevice.h

2017-11-02 Thread David Miller
From: Egil Hjelmeland 
Date: Thu,  2 Nov 2017 10:36:48 +0100

> The lan9303 driver defines eth_stp_addr as a synonym to
> eth_reserved_addr_base to get the STP ethernet address 01:80:c2:00:00:00.
> 
> eth_reserved_addr_base is also used to define the start of Bridge Reserved
> ethernet address range, which happen to be the STP address.
> 
> br_dev_setup refer to eth_reserved_addr_base as a definition of STP
> address.
> 
> Clean up by:
>  - Move the eth_stp_addr definition to linux/etherdevice.h
>  - Use eth_stp_addr instead of eth_reserved_addr_base in br_dev_setup.
> 
> Signed-off-by: Egil Hjelmeland 

Applied, thank you.


Re: [PATCH 1/2] net: bridge: Convert timers to use timer_setup()

2017-11-02 Thread David Miller
From: Allen Pais 
Date: Thu,  2 Nov 2017 10:58:50 +0530

> switch to using the new timer_setup() and from_timer() api's.
> 
> Signed-off-by: Allen Pais 

These two patches do not apply cleanly to net-next, please respin.


Re: [PATCH net-next] liquidio: bump up driver version to 1.7.0 to match newer NIC firmware

2017-11-02 Thread David Miller
From: Felix Manlunas 
Date: Wed, 1 Nov 2017 18:14:49 -0700

> Signed-off-by: Felix Manlunas 
> Acked-by: Derek Chickles 

Applied.


Re: [PATCH net 0/2] NULL pointer dereference in {ipvlan|macvlan}_port_destroy

2017-11-02 Thread David Miller
From: Girish Moodalbail 
Date: Tue, 31 Oct 2017 09:39:45 -0700

> When call to register_netdevice() (called from ipvlan_link_new())
> fails, inside that function we call ipvlan_uninit() (through
> ndo_uninit()) to destroy the ipvlan port. Upon returning
> unsuccessfully from register_netdevice() we go ahead and call
> ipvlan_port_destroy() again which causes NULL pointer dereference
> panic.

The problem is that ipvlan doesn't follow the proper convention that
->ndo_uninit() must only release resources allocated by ->ndo_init().

What needs to happen is that the port allocation occur in
->ndo_init().

Your fix, while solving some cases, does not fully cover all of the
posibiities due to this bug.

Please fix this correctly by moving the port allocation and related
setup from link creation to ->ndo_init().

Thank you.


Re: Possible unsafe usage of skb->cb in virtio-net

2017-11-02 Thread Willem de Bruijn
On Thu, Nov 2, 2017 at 10:01 PM, Michael S. Tsirkin  wrote:
> On Thu, Nov 02, 2017 at 11:40:36AM +, Ilya Lesokhin wrote:
>> Hi,
>> I've noticed that the virtio-net uses skb->cb.
>>
>> I don't know all the detail by my understanding is it caused problem with 
>> the mlx5 driver
>> and was fixed here:
>> https://github.com/torvalds/linux/commit/34802a42b3528b0e18ea4517c8b23e1214a09332
>>
>> Thanks,
>> Ilya
>
> Thanks a lot for the pointer.
>
> I think this was in response to this:
> https://patchwork.ozlabs.org/patch/558324/
>
>> >
>> > + skb_push(skb, skb->data - skb_data_orig);
>> >   sq->skb[pi] = skb;
>> >
>> >   MLX5E_TX_SKB_CB(skb)->num_wqebbs = DIV_ROUND_UP(ds_cnt,
>>
>> And in the middle of this we have:
>>
>> skb_pull_inline(skb, ihs);
>>
>> This is looks illegal.
>>
>> You must not modify the data pointers of any SKB that you receive for
>> sending via ->ndo_start_xmit() unless you know that absolutely you are
>> the one and only reference that exists to that SKB.
>>
>> And exactly for the case you are trying to "fix" here, you do not.  If
>> the SKB is cloned, or has an elevated users count, someone else can be
>> looking at it exactly at the same time you are messing with the data
>> pointers.
>>
>> I bet mlx4 has this bug too.
>>
>> You must fix this properly, by keeping track of an offset or similar
>> internally to your driver, rather than changing the SKB data pointers.
>
> What virtio does is this:
>
> can_push = vi->any_header_sg &&
> !((unsigned long)skb->data & (__alignof__(*hdr) - 1)) &&
> !skb_header_cloned(skb) && skb_headroom(skb) >= hdr_len;
> /* Even if we can, don't push here yet as this would skew
>  * csum_start offset below. */
> if (can_push)
> hdr = (struct virtio_net_hdr_mrg_rxbuf *)(skb->data - 
> hdr_len);
> else
> hdr = skb_vnet_hdr(skb);
>
>
> This doesn't change the data pointers in a cloned skb but it does change the 
> cb.
> Is it true that it's illegal to touch the cb in a cloned skb then?

I don't have all the context for this bug. But in general, clones do not share
the struct sk_buff, which holds the CB. So skb_push and skb_pull_inline
cannot affect the view of other clones. If an skb is shared, that's a different
story.


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-02 Thread Willem de Bruijn
On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 40x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it.
>
> The main difference between V4 and V2/V3 is that TX and RX descriptors
> are separated from packet buffers.

Cool feature. I'm looking forward to the netdev talk. Aside from the
inline comments in the patches, a few architecture questions.

Is TX support needed? Existing PACKET_TX_RING already sends out
packets without copying directly from the tx_ring. Indirection through a
descriptor ring is not helpful on TX if all packets still have to come from
a pre-registered packet pool. The patch set adds a lot of tx-only code
and is complex enough without it.

Can you use the existing PACKET_V2 format for the packet pool? The
v4 format is nearly the same as V2. Using the same version might avoid
some code duplication and simplify upgrading existing legacy code.
Instead of continuing to add new versions whose behavior is implicit,
perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.

Finally, is it necessary to define a new descriptor ring format? Same for the
packet array and frame set. The kernel already has a few, such as virtio for
the first, skb_array/ptr_ring, even linux list for the second. These containers
add a lot of new boilerplate code. If new formats are absolutely necessary,
at least we should consider making them generic (like skb_array and
ptr_ring). But I'd like to understand first why, e.g., virtio cannot be used.


Re: Regression in throughput between kvm guests over virtual bridge

2017-11-02 Thread Matthew Rosato
On 10/31/2017 03:07 AM, Wei Xu wrote:
> On Thu, Oct 26, 2017 at 01:53:12PM -0400, Matthew Rosato wrote:
>>
>>>
>>> Are you using the same binding as mentioned in previous mail sent by you? it
>>> might be caused by cpu convention between pktgen and vhost, could you please
>>> try to run pktgen from another idle cpu by adjusting the binding? 
>>
>> I don't think that's the case -- I can cause pktgen to hang in the guest
>> without any cpu binding, and with vhost disabled even.
> 
> Yes, I did a test and it also hangs in guest, before we figure it out,
> maybe you try udp with uperf with this case?
> 
> VM   -> Host
> Host -> VM
> VM   -> VM
> 

Here are averaged run numbers (Gbps throughput) across 4.12, 4.13 and
net-next with and without Jason's recent "vhost_net: conditionally
enable tx polling" applied (referred to as 'patch' below).  1 uperf
instance in each case:

uperf TCP:
 4.12   4.134.13+patch  net-nextnet-next+patch
--
VM->VM   35.2   16.520.84   22.224.36
VM->Host 42.15  43.57   44.90   30.83   32.26
Host->VM 53.17  41.51   42.18   37.05   37.30

uperf UDP:
 4.12   4.134.13+patch  net-nextnet-next+patch
--
VM->VM   24.93  21.63   25.09   8.869.62
VM->Host 40.21  38.21   39.72   8.749.35
Host->VM 31.26  30.18   31.25   7.2 9.26

The net is that Jason's recent patch definitely improves things across
the board at 4.13 as well as at net-next -- But the VM<->VM TCP numbers
I am observing are still lower than base 4.12.

A separate concern is why my UDP numbers look so bad on net-next (have
not bisected this yet).



Re: [RFC PATCH 03/14] packet: enable AF_PACKET V4 rings

2017-11-02 Thread Willem de Bruijn
> +/**
> + * tp4q_enqueue_from_array - Enqueue entries from packet array to tp4 queue
> + *
> + * @a: Pointer to the packet array to enqueue from
> + * @dcnt: Max number of entries to enqueue
> + *
> + * Returns 0 for success or an errno at failure
> + **/
> +static inline int tp4q_enqueue_from_array(struct tp4_packet_array *a,
> + u32 dcnt)
> +{
> +   struct tp4_queue *q = a->tp4q;
> +   unsigned int used_idx = q->used_idx;
> +   struct tpacket4_desc *d = a->items;
> +   int i;
> +
> +   if (q->num_free < dcnt)
> +   return -ENOSPC;
> +
> +   q->num_free -= dcnt;

perhaps annotate with a lockdep_is_held to document which lock
ensures mutual exclusion on the ring. Different for tx and rx?

> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index b39be424ec0e..190598eb3461 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -189,6 +189,9 @@ static int packet_set_ring(struct sock *sk, union 
> tpacket_req_u *req_u,
>  #define BLOCK_O2PRIV(x)((x)->offset_to_priv)
>  #define BLOCK_PRIV(x)  ((void *)((char *)(x) + BLOCK_O2PRIV(x)))
>
> +#define RX_RING 0
> +#define TX_RING 1
> +

Not needed if using bool for tx_ring below. The test effectively already
treats it as bool: does not explicitly test these constants.

> +static void packet_clear_ring(struct sock *sk, int tx_ring)
> +{
> +   struct packet_sock *po = pkt_sk(sk);
> +   struct packet_ring_buffer *rb;
> +   union tpacket_req_u req_u;
> +
> +   rb = tx_ring ? >tx_ring : >rx_ring;


I meant here.


[PATCH net-next 4/6] net: hns3: add support for set_link_ksettings

2017-11-02 Thread Lipeng
From: Fuyun Liang 

This patch adds set_link_ksettings support for ethtool cmd.

Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
index c7b8ebd..7fe193b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
@@ -653,6 +653,16 @@ static int hns3_get_link_ksettings(struct net_device 
*netdev,
return 0;
 }
 
+static int hns3_set_link_ksettings(struct net_device *netdev,
+  const struct ethtool_link_ksettings *cmd)
+{
+   /* Only support ksettings_set for netdev with phy attached for now */
+   if (netdev->phydev)
+   return phy_ethtool_ksettings_set(netdev->phydev, cmd);
+
+   return -EOPNOTSUPP;
+}
+
 static u32 hns3_get_rss_key_size(struct net_device *netdev)
 {
struct hnae3_handle *h = hns3_get_handle(netdev);
@@ -839,6 +849,7 @@ static int hns3_set_rxnfc(struct net_device *netdev, struct 
ethtool_rxnfc *cmd)
.get_rxfh = hns3_get_rss,
.set_rxfh = hns3_set_rss,
.get_link_ksettings = hns3_get_link_ksettings,
+   .set_link_ksettings = hns3_set_link_ksettings,
 };
 
 void hns3_ethtool_set_ops(struct net_device *netdev)
-- 
1.9.1



[PATCH net-next 1/6] net: hns3: fix for getting autoneg in hns3_get_link_ksettings

2017-11-02 Thread Lipeng
From: Fuyun Liang 

This patch fixes a bug for ethtool's get_link_ksettings().
When phy exists, we should get autoneg from phy rather than from mac.
Because the value of mac.autoneg is invalid when phy exists.

Fixes: 496d03e (net: hns3: Add Ethtool support to HNS3 driver)
Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 .../ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c  | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
index 5cd163b..367b20c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 
 #include "hns3_enet.h"
 
@@ -571,26 +572,25 @@ static int hns3_get_link_ksettings(struct net_device 
*netdev,
u32 advertised_caps;
u8 media_type = HNAE3_MEDIA_TYPE_UNKNOWN;
u8 link_stat;
-   u8 auto_neg;
-   u8 duplex;
-   u32 speed;
 
if (!h->ae_algo || !h->ae_algo->ops)
return -EOPNOTSUPP;
 
/* 1.auto_neg & speed & duplex from cmd */
-   if (h->ae_algo->ops->get_ksettings_an_result) {
-   h->ae_algo->ops->get_ksettings_an_result(h, _neg,
-, );
-   cmd->base.autoneg = auto_neg;
-   cmd->base.speed = speed;
-   cmd->base.duplex = duplex;
-
-   link_stat = hns3_get_link(netdev);
-   if (!link_stat) {
-   cmd->base.speed = (u32)SPEED_UNKNOWN;
-   cmd->base.duplex = DUPLEX_UNKNOWN;
-   }
+   if (netdev->phydev)
+   phy_ethtool_ksettings_get(netdev->phydev, cmd);
+   else if (h->ae_algo->ops->get_ksettings_an_result)
+   h->ae_algo->ops->get_ksettings_an_result(h,
+>base.autoneg,
+>base.speed,
+>base.duplex);
+   else
+   return -EOPNOTSUPP;
+
+   link_stat = hns3_get_link(netdev);
+   if (!link_stat) {
+   cmd->base.speed = SPEED_UNKNOWN;
+   cmd->base.duplex = DUPLEX_UNKNOWN;
}
 
/* 2.media_type get from bios parameter block */
-- 
1.9.1



[PATCH net-next 0/6] net: hns3: support set_link_ksettings and for nway_reset ethtool command

2017-11-02 Thread Lipeng
This patch-set adds support for set_link_ksettings && for nway_resets
ethtool command and fixes some related ethtool bugs.
1, patch[4/6] adds support for ethtool_ops.set_link_ksettings.
2, patch[5/6] adds support ethtool_ops.for nway_reset.
3, patch[1/6,2/6,3/6,6/6] fix some bugs for getting port information by
   ethtool command(ethtool ethx).

Fuyun Liang (6):
  net: hns3: fix for getting autoneg in hns3_get_link_ksettings
  net: hns3: fix for getting advertised_caps in hns3_get_link_ksettings
  net: hns3: fix a bug in hns3_driv_to_eth_caps
  net: hns3: add support for set_link_ksettings
  net: hns3: add support for nway_reset
  net: hns3: fix a bug for phy supported feature initialization

 .../ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c| 10 +++
 .../ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c  | 71 +++---
 2 files changed, 59 insertions(+), 22 deletions(-)

-- 
1.9.1



[PATCH net-next 5/6] net: hns3: add support for nway_reset

2017-11-02 Thread Lipeng
From: Fuyun Liang 

This patch adds nway_reset support for ethtool cmd.

Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c  | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
index 7fe193b..a21470c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
@@ -832,6 +832,23 @@ static int hns3_set_rxnfc(struct net_device *netdev, 
struct ethtool_rxnfc *cmd)
}
 }
 
+static int hns3_nway_reset(struct net_device *netdev)
+{
+   struct phy_device *phy = netdev->phydev;
+
+   if (!netif_running(netdev))
+   return 0;
+
+   /* Only support nway_reset for netdev with phy attached for now */
+   if (!phy)
+   return -EOPNOTSUPP;
+
+   if (phy->autoneg != AUTONEG_ENABLE)
+   return -EINVAL;
+
+   return genphy_restart_aneg(phy);
+}
+
 static const struct ethtool_ops hns3_ethtool_ops = {
.self_test = hns3_self_test,
.get_drvinfo = hns3_get_drvinfo,
@@ -850,6 +867,7 @@ static int hns3_set_rxnfc(struct net_device *netdev, struct 
ethtool_rxnfc *cmd)
.set_rxfh = hns3_set_rss,
.get_link_ksettings = hns3_get_link_ksettings,
.set_link_ksettings = hns3_set_link_ksettings,
+   .nway_reset = hns3_nway_reset,
 };
 
 void hns3_ethtool_set_ops(struct net_device *netdev)
-- 
1.9.1



[PATCH net-next 2/6] net: hns3: fix for getting advertised_caps in hns3_get_link_ksettings

2017-11-02 Thread Lipeng
From: Fuyun Liang 

This patch fixes a bug for ethtool's get_link_ksettings().
The advertising for autoneg is always added to advertised_caps
whether autoneg is enable or disable. This patch fixes it.

Fixes: 496d03e (net: hns3: Add Ethtool support to HNS3 driver)
Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
index 367b20c..0e10a43 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
@@ -640,6 +640,9 @@ static int hns3_get_link_ksettings(struct net_device 
*netdev,
break;
}
 
+   if (!cmd->base.autoneg)
+   advertised_caps &= ~HNS3_LM_AUTONEG_BIT;
+
/* now, map driver link modes to ethtool link modes */
hns3_driv_to_eth_caps(supported_caps, cmd, false);
hns3_driv_to_eth_caps(advertised_caps, cmd, true);
-- 
1.9.1



[PATCH net-next 6/6] net: hns3: fix a bug for phy supported feature initialization

2017-11-02 Thread Lipeng
From: Fuyun Liang 

This patch fixes a bug for phy supported feature initialization.
Currently, the value of phydev->supported is initialized by kernel.
So it includes many features that we do not support, such as
SUPPORTED_FIBRE and SUPPORTED_BNC. This patch fixes it.

Fixes: 256727d (net: hns3: Add MDIO support to HNS3 Ethernet driver for hip08 
SoC)
Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
index f32d719..7069e94 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
@@ -14,6 +14,13 @@
 #include "hclge_main.h"
 #include "hclge_mdio.h"
 
+#define HCLGE_PHY_SUPPORTED_FEATURES   (SUPPORTED_Autoneg | \
+SUPPORTED_TP | \
+SUPPORTED_Pause | \
+PHY_10BT_FEATURES | \
+PHY_100BT_FEATURES | \
+PHY_1000BT_FEATURES)
+
 enum hclge_mdio_c22_op_seq {
HCLGE_MDIO_C22_WRITE = 1,
HCLGE_MDIO_C22_READ = 2
@@ -195,6 +202,9 @@ int hclge_mac_start_phy(struct hclge_dev *hdev)
return ret;
}
 
+   phydev->supported &= HCLGE_PHY_SUPPORTED_FEATURES;
+   phydev->advertising = phydev->supported;
+
phy_start(phydev);
 
return 0;
-- 
1.9.1



[PATCH net-next 3/6] net: hns3: fix a bug in hns3_driv_to_eth_caps

2017-11-02 Thread Lipeng
From: Fuyun Liang 

The value of link_modes.advertising and the value of link_modes.supported
is initialized to zero every time in for loop in hns3_driv_to_eth_caps().
But we just want to set specified bit for them. Initialization is
unnecessary. This patch fixes it.

Fixes: 496d03e (net: hns3: Add Ethtool support to HNS3 driver)
Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
index 0e10a43..c7b8ebd 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
@@ -359,17 +359,12 @@ static void hns3_driv_to_eth_caps(u32 caps, struct 
ethtool_link_ksettings *cmd,
if (!(caps & hns3_lm_map[i].hns3_link_mode))
continue;
 
-   if (is_advertised) {
-   ethtool_link_ksettings_zero_link_mode(cmd,
- advertising);
+   if (is_advertised)
__set_bit(hns3_lm_map[i].ethtool_link_mode,
  cmd->link_modes.advertising);
-   } else {
-   ethtool_link_ksettings_zero_link_mode(cmd,
- supported);
+   else
__set_bit(hns3_lm_map[i].ethtool_link_mode,
  cmd->link_modes.supported);
-   }
}
 }
 
-- 
1.9.1



Re: [RFC PATCH 07/14] packet: wire up zerocopy for AF_PACKET V4

2017-11-02 Thread Willem de Bruijn
On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> This commits adds support for zerocopy mode. Note that zerocopy mode
> requires that the network interface has been bound to the socket using
> the bind syscall, and that the corresponding netdev implements the
> AF_PACKET V4 ndos.
>
> Signed-off-by: Björn Töpel 
> ---
> +
> +static void packet_v4_disable_zerocopy(struct net_device *dev,
> +  struct tp4_netdev_parms *zc)
> +{
> +   struct tp4_netdev_parms params;
> +
> +   params = *zc;
> +   params.command  = TP4_DISABLE;
> +
> +   (void)dev->netdev_ops->ndo_tp4_zerocopy(dev, );

Don't ignore error return codes.

> +static int packet_v4_zerocopy(struct sock *sk, int qp)
> +{
> +   struct packet_sock *po = pkt_sk(sk);
> +   struct socket *sock = sk->sk_socket;
> +   struct tp4_netdev_parms *zc = NULL;
> +   struct net_device *dev;
> +   bool if_up;
> +   int ret = 0;
> +
> +   /* Currently, only RAW sockets are supported.*/
> +   if (sock->type != SOCK_RAW)
> +   return -EINVAL;
> +
> +   rtnl_lock();
> +   dev = packet_cached_dev_get(po);
> +
> +   /* Socket needs to be bound to an interface. */
> +   if (!dev) {
> +   rtnl_unlock();
> +   return -EISCONN;
> +   }
> +
> +   /* The device needs to have both the NDOs implemented. */
> +   if (!(dev->netdev_ops->ndo_tp4_zerocopy &&
> + dev->netdev_ops->ndo_tp4_xmit)) {
> +   ret = -EOPNOTSUPP;
> +   goto out_unlock;
> +   }

Inconsistent error handling with above test.

> +
> +   if (!(po->rx_ring.pg_vec && po->tx_ring.pg_vec)) {
> +   ret = -EOPNOTSUPP;
> +   goto out_unlock;
> +   }

A ring can be unmapped later with packet_set_ring. Should that operation
fail if zerocopy is enabled? After that, it can also change version with
PACKET_VERSION.

> +
> +   if_up = dev->flags & IFF_UP;
> +   zc = rtnl_dereference(po->zc);
> +
> +   /* Disable */
> +   if (qp <= 0) {
> +   if (!zc)
> +   goto out_unlock;
> +
> +   packet_v4_disable_zerocopy(dev, zc);
> +   rcu_assign_pointer(po->zc, NULL);
> +
> +   if (if_up) {
> +   spin_lock(>bind_lock);
> +   register_prot_hook(sk);
> +   spin_unlock(>bind_lock);
> +   }

There have been a bunch of race conditions in this bind code. We need
to be very careful with adding more states to the locking, especially when
open coding in multiple locations, as this patch does. I counted at least
four bind locations. See for instance also
http://patchwork.ozlabs.org/patch/813945/


> +
> +   goto out_unlock;
> +   }
> +
> +   /* Enable */
> +   if (!zc) {
> +   zc = kzalloc(sizeof(*zc), GFP_KERNEL);
> +   if (!zc) {
> +   ret = -ENOMEM;
> +   goto out_unlock;
> +   }
> +   }
> +
> +   if (zc->queue_pair >= 0)
> +   packet_v4_disable_zerocopy(dev, zc);

This calls disable even if zc was freshly allocated.
Shoud be > 0?

>  static int packet_release(struct socket *sock)
>  {
> +   struct tp4_netdev_parms *zc;
> struct sock *sk = sock->sk;
> +   struct net_device *dev;
> struct packet_sock *po;
> struct packet_fanout *f;
> struct net *net;
> @@ -3337,6 +3541,20 @@ static int packet_release(struct socket *sock)
> sock_prot_inuse_add(net, sk->sk_prot, -1);
> preempt_enable();
>
> +   rtnl_lock();
> +   zc = rtnl_dereference(po->zc);
> +   dev = packet_cached_dev_get(po);
> +   if (zc && dev)
> +   packet_v4_disable_zerocopy(dev, zc);
> +   if (dev)
> +   dev_put(dev);
> +   rtnl_unlock();
> +
> +   if (zc) {
> +   synchronize_rcu();
> +   kfree(zc);
> +   }

Please use a helper function for anything this complex.


Re: [RFC PATCH 02/14] packet: implement PACKET_MEMREG setsockopt

2017-11-02 Thread Willem de Bruijn
On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> Here, the PACKET_MEMREG setsockopt is implemented for the AF_PACKET
> protocol family. PACKET_MEMREG allows the user to register memory
> regions that can be used by AF_PACKET V4 as packet data buffers.
>
> Signed-off-by: Björn Töpel 
> ---
> +/*** V4 QUEUE OPERATIONS ***/
> +
> +/**
> + * tp4q_umem_new - Creates a new umem (packet buffer)
> + *
> + * @addr: The address to the umem
> + * @size: The size of the umem
> + * @frame_size: The size of each frame, between 2K and PAGE_SIZE
> + * @data_headroom: The desired data headroom before start of the packet
> + *
> + * Returns a pointer to the new umem or NULL for failure
> + **/
> +static inline struct tp4_umem *tp4q_umem_new(unsigned long addr, size_t size,
> +unsigned int frame_size,
> +unsigned int data_headroom)
> +{
> +   struct tp4_umem *umem;
> +   unsigned int nframes;
> +
> +   if (frame_size < TP4_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> +   /* Strictly speaking we could support this, if:
> +* - huge pages, or*
> +* - using an IOMMU, or
> +* - making sure the memory area is consecutive
> +* but for now, we simply say "computer says no".
> +*/
> +   return ERR_PTR(-EINVAL);
> +   }
> +
> +   if (!is_power_of_2(frame_size))
> +   return ERR_PTR(-EINVAL);
> +
> +   if (!PAGE_ALIGNED(addr)) {
> +   /* Memory area has to be page size aligned. For
> +* simplicity, this might change.
> +*/
> +   return ERR_PTR(-EINVAL);
> +   }
> +
> +   if ((addr + size) < addr)
> +   return ERR_PTR(-EINVAL);
> +
> +   nframes = size / frame_size;
> +   if (nframes == 0)
> +   return ERR_PTR(-EINVAL);
> +
> +   data_headroom = ALIGN(data_headroom, 64);
> +
> +   if (frame_size - data_headroom - TP4_KERNEL_HEADROOM < 0)
> +   return ERR_PTR(-EINVAL);

signed comparison on unsigned int


Re: [net-next v2 3/4] openvswitch: Add meter infrastructure

2017-11-02 Thread Andy Zhou
On Thu, Nov 2, 2017 at 5:07 AM, Pravin Shelar  wrote:
> On Thu, Nov 2, 2017 at 3:07 AM, Andy Zhou  wrote:
>> On Fri, Oct 20, 2017 at 8:32 PM, Pravin Shelar  wrote:
>>> On Thu, Oct 19, 2017 at 5:58 PM, Andy Zhou  wrote:

 On Thu, Oct 19, 2017 at 02:47 Pravin Shelar  wrote:
>
> On Tue, Oct 17, 2017 at 12:36 AM, Andy Zhou  wrote:
> > OVS kernel datapath so far does not support Openflow meter action.
> > This is the first stab at adding kernel datapath meter support.
> > This implementation supports only drop band type.
> >
> > Signed-off-by: Andy Zhou 
> > ---
> >  net/openvswitch/Makefile   |   1 +
> >  net/openvswitch/datapath.c |  14 +-
> >  net/openvswitch/datapath.h |   3 +
> >  net/openvswitch/meter.c| 604
> > +
> >  net/openvswitch/meter.h|  54 
> >  5 files changed, 674 insertions(+), 2 deletions(-)
> >  create mode 100644 net/openvswitch/meter.c
> >  create mode 100644 net/openvswitch/meter.h
> >
> This patch mostly looks good. I have one comment below.
>
> > +static int ovs_meter_cmd_set(struct sk_buff *skb, struct genl_info
> > *info)
> > +{
> > +   struct nlattr **a = info->attrs;
> > +   struct dp_meter *meter, *old_meter;
> > +   struct sk_buff *reply;
> > +   struct ovs_header *ovs_reply_header;
> > +   struct ovs_header *ovs_header = info->userhdr;
> > +   struct datapath *dp;
> > +   int err;
> > +   u32 meter_id;
> > +   bool failed;
> > +
> > +   meter = dp_meter_create(a);
> > +   if (IS_ERR_OR_NULL(meter))
> > +   return PTR_ERR(meter);
> > +
> > +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_SET,
> > + _reply_header);
> > +   if (IS_ERR(reply)) {
> > +   err = PTR_ERR(reply);
> > +   goto exit_free_meter;
> > +   }
> > +
> > +   ovs_lock();
> > +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
> > +   if (!dp) {
> > +   err = -ENODEV;
> > +   goto exit_unlock;
> > +   }
> > +
> > +   if (!a[OVS_METER_ATTR_ID]) {
> > +   err = -ENODEV;
> > +   goto exit_unlock;
> > +   }
> > +
> > +   meter_id = nla_get_u32(a[OVS_METER_ATTR_ID]);
> > +
> > +   /* Cannot fail after this. */
> > +   old_meter = lookup_meter(dp, meter_id);
> I do not see RCU read lock taken here. This is not correctness issue
> but it could cause RCU checker to spit out warning message. You could
> do same trick that is done in get_dp() to avoid this issue.

 O.K.
>
>
>
> Can you also test the code with rcu sparse check config option enabled?


 Do you mean to sparse compile with CONFIG_PROVE_LOCKING and
 CONFIG_DENUG_OBJECTS_RCU_HEAD?
>>>
>>> You could use all following options simultaneously:
>>> CONFIG_PREEMPT
>>> CONFIG_DEBUG_PREEMPT
>>> CONFIG_DEBUG_SPINLOCK
>>> CONFIG_DEBUG_ATOMIC_SLEEP
>>> CONFIG_PROVE_RCU
>>> CONFIG_DEBUG_OBJECTS_RCU_HEAD
>>
>> Thanks, I turned on those flags but did not get any error message. Do you
>> mind share the RCU checker message?
>
> There would be assert failure and stack trace. so it would be pretty
> obvious in kernel log messages.
> Let me know if you do not see any stack trace while running meter
> create, delete and execute.

No I did not see them.


Re: [PATCH net-next] tcp: tcp_fragment() should not assume rtx skbs

2017-11-02 Thread Soheil Hassas Yeganeh
On Thu, Nov 2, 2017 at 9:16 PM, Neal Cardwell  wrote:
> On Thu, Nov 2, 2017 at 9:10 PM, Eric Dumazet  wrote:
>> From: Eric Dumazet 
>>
>> While stress testing MTU probing, we had crashes in list_del() that we 
>> root-caused
>> to the fact that tcp_fragment() is unconditionally inserting the freshly 
>> allocated
>> skb into tsorted_sent_queue list.
>>
>> But this list is supposed to contain skbs that were sent.
>> This was mostly harmless until MTU probing was enabled.
>>
>> Fortunately we can use the tcp_queue enum added later (but in same linux 
>> version)
>> for rtx-rb-tree to fix the bug.
>>
>> Fixes: e2080072ed2d ("tcp: new list for sent but unacked skbs for RACK 
>> recovery")
>> Signed-off-by: Eric Dumazet 
>
> Acked-by: Neal Cardwell 

Acked-by: Soheil Hassas Yeganeh 

Nice! Thank you, Eric!


Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API

2017-11-02 Thread Willem de Bruijn
>>> +/*
>>> + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
>>> + * to register user memory which should be used to store the packet
>>> + * data.
>>> + *
>>> + * There are some constraints for the memory being registered:
>>> + * - The memory area has to be memory page size aligned.
>>> + * - The frame size has to be a power of 2.
>>> + * - The frame size cannot be smaller than 2048B.
>>> + * - The frame size cannot be larger than the memory page size.
>>> + *
>>> + * Corollary: The number of frames that can be stored is
>>> + * len / frame_size.
>>> + *
>>> + */
>>> +struct tpacket_memreg_req {
>>> +   unsigned long   addr;   /* Start of packet data area */
>>> +   unsigned long   len;/* Length of packet data area */
>>> +   unsigned intframe_size; /* Frame size */
>>> +   unsigned intdata_headroom;  /* Frame head room */
>>> +};
>>
>> Existing packet sockets take a tpacket_req, allocate memory and let the
>> user process mmap this. I understand that TPACKET_V4 distinguishes
>> the descriptor from packet pools, but could both use the existing structs
>> and logic (packet_mmap)? That would avoid introducing a lot of new code
>> just for granting user pages to the kernel.
>>
>
> We could certainly pass the "tpacket_memreg_req" fields as part of
> descriptor ring setup ("tpacket_req4"), but we went with having the
> memory register as a new separate setsockopt. Having it separated,
> makes it easier to compare regions at the kernel side of things. "Is
> this the same umem as another one?" If we go the path of passing the
> range at descriptor ring setup, we need to handle all kind of
> overlapping ranges to determine when a copy is needed or not, in those
> cases where the packet buffer (i.e. umem) is shared between processes.

That's not what I meant. Both descriptor rings and packet pools are
memory regions. Packet sockets already have logic to allocate regions
and make them available to userspace with mmap(). Packet v4 reuses
that logic for its descriptor rings. Can it use the same for its packet
pool? Why does the kernel map user memory, instead? That is a lot of
non-trivial new logic.


Re: [PATCH net-next v15] openvswitch: enable NSH support

2017-11-02 Thread Yang, Yi
On Thu, Nov 02, 2017 at 05:06:47AM -0700, Pravin Shelar wrote:
> On Wed, Nov 1, 2017 at 7:50 PM, Yang, Yi  wrote:
> > On Thu, Nov 02, 2017 at 08:52:40AM +0800, Pravin Shelar wrote:
> >> On Tue, Oct 31, 2017 at 9:03 PM, Yi Yang  wrote:
> >> >
> >> > OVS master and 2.8 branch has merged NSH userspace
> >> > patch series, this patch is to enable NSH support
> >> > in kernel data path in order that OVS can support
> >> > NSH in compat mode by porting this.
> >> >
> >> > Signed-off-by: Yi Yang 
> >> > ---
> >> I have comment related to checksum, otherwise patch looks good to me.
> >
> > Pravin, thank you for your comments, the below part is incremental patch
> > for checksum, please help check it, I'll send out v16 with this after
> > you confirm.
> >
> This change looks good to me.
> I noticed couple of more issues.
> 1. Can you move the ovs_key_nsh to the union of ipv4 an ipv6?
> ipv4/ipv6/nsh key data is mutually exclusive so there is no need for
> separate space for nsh key in the ovs key.
> 2. We need to fix match_validate() with nsh check. Datapath can not
> allow any l3 or l4 match if the flow key contains nsh match and
> vice-versa. such flow key should be rejected.

Pravin, the below incremental patch should fix the issues you pionted
out, please help confirm/ack, then I'll send out v16 with all acks
from you all for merge. BTW, it has been verified in my sfc test
environment.

diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 8eeae749..c670dd2 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -149,8 +149,8 @@ struct sw_flow_key {
} nd;
};
} ipv6;
+   struct ovs_key_nsh nsh; /* network service header */
};
-   struct ovs_key_nsh nsh; /* network service header */
struct {
/* Connection tracking fields not packed above. */
struct {
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 0d7d4ae..090103c 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -178,7 +178,8 @@ static bool match_validate(const struct sw_flow_match 
*match,
| (1 << OVS_KEY_ATTR_ICMPV6)
| (1 << OVS_KEY_ATTR_ARP)
| (1 << OVS_KEY_ATTR_ND)
-   | (1 << OVS_KEY_ATTR_MPLS));
+   | (1 << OVS_KEY_ATTR_MPLS)
+   | (1 << OVS_KEY_ATTR_NSH));
 
/* Always allowed mask fields. */
mask_allowed |= ((1 << OVS_KEY_ATTR_TUNNEL)
@@ -287,6 +288,14 @@ static bool match_validate(const struct sw_flow_match 
*match,
}
}
 
+   if (match->key->eth.type == htons(ETH_P_NSH)) {
+   key_expected |= 1 << OVS_KEY_ATTR_NSH;
+   if (match->mask &&
+   match->mask->key.eth.type == htons(0x)) {
+   mask_allowed |= 1 << OVS_KEY_ATTR_NSH;
+   }
+   }
+
if ((key_attrs & key_expected) != key_expected) {
/* Key attributes check failed. */
OVS_NLERR(log, "Missing key (keys=%llx, expected=%llx)",


Re: [PATCH net] add support of IFF_XMIT_DST_RELEASE bit in vlan

2017-11-02 Thread Eric Dumazet
On Fri, 2017-11-03 at 01:39 +0300, Vadim Fedorenko wrote:

> Do you mean what happens with vlan device with real_dev is bonding ?
> 
> With patches:
> 1) A is added
>bond_enslave()
>  bond_compute_features()
>  -> bond_dev IFF_XMIT_DST_RELEASE is not changed (set)
>netdev_change_features()
>  vlan_device_event(event=NETDEV_FEAT_CHANGE)
>vlan_transfer_features()
>  -> vlan_dev IFF_XMIT_DST_RELEASE is not changed (still set)
> Then B is added
>bond_enslave()
>  bond_compute_features()
>  -> bond_dev IFF_XMIT_DST_RELEASE is changed (cleared)
>netdev_change_features()
>  vlan_device_event(event=NETDEV_FEAT_CHANGE)
>vlan_transfer_features()
>  -> vlan_dev IFF_XMIT_DST_RELEASE is changed (cleared)
> 
> 2) B is added
>bond_enslave()
>  bond_compute_features()
>  -> bond_dev IFF_XMIT_DST_RELEASE is changed (cleared)
>netdev_change_features()
>  vlan_device_event(event=NETDEV_FEAT_CHANGE)
>vlan_transfer_features()
>  -> vlan_dev IFF_XMIT_DST_RELEASE is changed (cleared)
> Then A is added
>bond_enslave()
>  bond_compute_features()
>  -> bond_dev IFF_XMIT_DST_RELEASE is not changed (cleared)
>netdev_change_features()
>  vlan_device_event(event=NETDEV_FEAT_CHANGE)
>vlan_transfer_features()
>  -> vlan_dev IFF_XMIT_DST_RELEASE is not changed (cleared).
> 
> Without patches:
> 1) A is added
>bond_enslave()
>  bond_compute_features()
>  -> bond_dev IFF_XMIT_DST_RELEASE is not changed (set)
>netdev_change_features()
>  vlan_device_event(event=NETDEV_FEAT_CHANGE)
>vlan_transfer_features()
>   -> vlan_dev IFF_XMIT_DST_RELEASE is not changed (cleared)
> 
> Then B is added
>bond_enslave()
>  bond_compute_features()
>   -> bond_dev IFF_XMIT_DST_RELEASE is changed (cleared)
>netdev_change_features()
>  vlan_device_event(event=NETDEV_FEAT_CHANGE)
>vlan_transfer_features()
>   -> vlan_dev IFF_XMIT_DST_RELEASE is not changed(cleared)
> 2) B is added
>bond_enslave()
>  bond_compute_features()
>   -> bond_dev IFF_XMIT_DST_RELEASE is changed (cleared)
>netdev_change_features()
>  vlan_device_event(event=NETDEV_FEAT_CHANGE)
>vlan_transfer_features()
>   -> vlan_dev IFF_XMIT_DST_RELEASE is not changed(cleared)
> Then A is added
>bond_enslave()
>  bond_compute_features()
>   -> bond_dev IFF_XMIT_DST_RELEASE is not changed (cleared)
>netdev_change_features()
>  vlan_device_event(event=NETDEV_FEAT_CHANGE)
>vlan_transfer_features()
>   -> vlan_dev IFF_XMIT_DST_RELEASE is not changed (cleared).
> 

Thanks for this investigation !

Reviewed-by: Eric Dumazet 




Re: [Patch net 0/2] net_sched: fix a use-after-free for tc actions

2017-11-02 Thread David Miller
From: Cong Wang 
Date: Wed,  1 Nov 2017 10:23:48 -0700

> This patchset fixes a use-after-free reported by Lucas
> and closes potential races too.
> 
> Please see each patch for details.
> 
> Cc: Jamal Hadi Salim 
> Cc: Jiri Pirko 
> Signed-off-by: Cong Wang 

Series applied, thanks.


Re: [patch net-next v3 0/2] net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath

2017-11-02 Thread David Miller
From: Jiri Pirko 
Date: Tue, 31 Oct 2017 16:12:20 +0100

> From: Jiri Pirko 
> 
> This patchset's main patch is patch number 2. It carries the
> description and changelog. Patch 1 is just a dependency.

This no longer applies cleanly and will require a respin.

Thanks.


Re: [PATCH net-next] tcp: tcp_fragment() should not assume rtx skbs

2017-11-02 Thread Neal Cardwell
On Thu, Nov 2, 2017 at 9:10 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> While stress testing MTU probing, we had crashes in list_del() that we 
> root-caused
> to the fact that tcp_fragment() is unconditionally inserting the freshly 
> allocated
> skb into tsorted_sent_queue list.
>
> But this list is supposed to contain skbs that were sent.
> This was mostly harmless until MTU probing was enabled.
>
> Fortunately we can use the tcp_queue enum added later (but in same linux 
> version)
> for rtx-rb-tree to fix the bug.
>
> Fixes: e2080072ed2d ("tcp: new list for sent but unacked skbs for RACK 
> recovery")
> Signed-off-by: Eric Dumazet 

Acked-by: Neal Cardwell 

Thanks, Eric!

neal


Re: [PATCH v2 net-next] tcp: add tracepoint trace_tcp_retransmit_synack()

2017-11-02 Thread David Miller
From: Song Liu 
Date: Mon, 30 Oct 2017 14:41:35 -0700

> This tracepoint can be used to trace synack retransmits. It maintains
> pointer to struct request_sock.
> 
> We cannot simply reuse trace_tcp_retransmit_skb() here, because the
> sk here is the LISTEN socket. The IP addresses and ports should be
> extracted from struct request_sock.
> 
> Note that, like many other tracepoints, this patch uses IS_ENABLED
> in TP_fast_assign macro, which triggers sparse warning like:
> 
> ./include/trace/events/tcp.h:274:1: error: directive in argument list
> ./include/trace/events/tcp.h:281:1: error: directive in argument list
> 
> However, there is no good solution to avoid these warnings. To the
> best of our knowledge, these warnings are harmless.
> 
> Signed-off-by: Song Liu 
> Acked-by: Alexei Starovoitov 
> Acked-by: Martin KaFai Lau 

Applied.


Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-02 Thread Jay Vosburgh
Alex Sidorenko  wrote:
>On 11/02/2017 12:51 AM, Jay Vosburgh wrote:
>> Jarod Wilson  wrote:
>>
>>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
 Jay Vosburgh  wrote:

> Alex Sidorenko  wrote:
>
>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>> platform, it is seen both in customer's environment and in HPE test lab.
>>
>> There are several bonds configured in TLB mode and miimon=100, all other
>> options are default. Slaves are connected to VirtualConnect
>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>> temporarily, but not another one (ens3f1). But what we see is
>>
>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms 
>> for interface ens3f1

In net-next, I don't see a path in the code that will lead to
 this message, as it would apparently require entering
 bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
 If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
 remain in _FAIL state.
>>>
>>> The kernel in question is laden with a fair bit of additional debug spew,
>>> as we were going back and forth, trying to isolate where things were going
>>> wrong.  That was indeed from the BOND_LINK_FAIL state in
>>> bond_miimon_inspect, inside the if (link_state) clause though, so after
>>> commit++, there's a continue, which ... does what now? Doesn't it take us
>>> back to the top of the bond_for_each_slave_rcu() loop, so we bypass the
>>> next few lines of code that would have led to a transition to
>>> BOND_LINK_DOWN?
>>
>>  Just to confirm: your downdelay is 0, correct?
>
>Correct.
>
>>
>>  And, do you get any other log messages other than "link status
>> up again after 0 ms"?
>
>Yes, here are some messages (from an early instrumentation):
[...]
>That is, we never see ens3f1 going to BOND_LINK_DOWN and it continues
>staying in BOND_LINK_NOCHANGE/BOND_LINK_FAIL
>
>
>>
>>  To answer your question, yes, the "if (link_state) {" block in
>> the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but
>> this path is nominally for the downdelay logic.  If downdelay is active
>> and the link recovers before the delay expires, the link should never
>> have moved to BOND_LINK_DOWN.  The commit++ causes bond_miimon_inspect
>> to return nonzero, causing in turn the bond_propose_link_state change to
>> BOND_LINK_FAIL state to be committed.  This path deliberately does not
>> set slave->new_link, as downdelay is purposely delaying the transition
>> to BOND_LINK_DOWN.
>>
>>  If downdelay is 0, the slave->link should not persist in
>> BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will
>> cause a transition in bond_miimon_commit.  The bond_propose_link_state
>> call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in
>> bond_mii_monitor prior to calling bond_miimon_commit, which will in turn
>> do the transition to _DOWN state.  In this case, the BOND_LINK_FAIL case
>> "if (link_state) {" block should never be entered.
>
>I totally agree with your description of transition logic, and this is why
>we were puzzled by how this can occur until we noticed NetworkManager
>messages around this time and decided to run a test without it.
>Without NM, everything works as expected. After that, adding more
>instrumentation, we have found that we do not propose BOND_LINK_FAIL inside
>bond_miimon_inspect() but elsewhere (NetworkManager?).

I think I see the flaw in the logic.

1) bond_miimon_inspect finds link_state = 0, then makes a call
to bond_propose_link_state(BOND_LINK_FAIL), setting link_new_state to
BOND_LINK_FAIL.  _inspect then sets slave->new_link = BOND_LINK_DOWN and
returns non-zero.

2) bond_mii_monitor rtnl_trylock fails, it reschedules.

3) bond_mii_monitor runs again, and calls bond_miimon_inspect.

4) the slave's link has recovered, so link_state != 0.
slave->link is still BOND_LINK_UP.  The slave's link_new_state remains
set to BOND_LINK_FAIL, but new_link is reset to NOCHANGE.
bond_miimon_inspect returns 0, so nothing is committed.

5) step 4 can repeat indefinitely.

6) eventually, the other slave does something that causes
commit++, making bond_mii_monitor call bond_commit_link_state and then
bond_miimon_commit.  The slave in question from steps 1-4 still has
link_new_state as BOND_LINK_FAIL, but new_link is NOCHANGE, so it ends
up in BOND_LINK_FAIL state.

I think step 6 could also occur concurrently with the initial
pass through step 4 to induce the problem.

It looks like Mahesh mostly fixed this in

commit fb9eb899a6dc663e4a2deed9af2ac28f507d0ffb
Author: Mahesh Bandewar 
Date:   Tue Apr 11 22:36:00 2017 -0700

bonding: handle link transition from FAIL to UP 

Re: [PATCH v2 net-next] ipv6: Implement limits on Hop-by-Hop and Destination options

2017-11-02 Thread David Miller
From: Tom Herbert 
Date: Mon, 30 Oct 2017 14:16:00 -0700

> RFC 8200 (IPv6) defines Hop-by-Hop options and Destination options
> extension headers. Both of these carry a list of TLVs which is
> only limited by the maximum length of the extension header (2048
> bytes). By the spec a host must process all the TLVs in these
> options, however these could be used as a fairly obvious
> denial of service attack. I think this could in fact be
> a significant DOS vector on the Internet, one mitigating
> factor might be that many FWs drop all packets with EH (and
> obviously this is only IPv6) so an Internet wide attack might not
> be so effective (yet!).
 ...
> This patch adds configurable limits to Destination and Hop-by-Hop
> options. There are three limits that may be set:
>   - Limit the number of options in a Hop-by-Hop or Destination options
> extension header.
>   - Limit the byte length of a Hop-by-Hop or Destination options
> extension header.
>   - Disallow unrecognized options in a Hop-by-Hop or Destination
> options extension header.
> 
> The limits are set in corresponding sysctls:
> 
>   ipv6.sysctl.max_dst_opts_cnt
>   ipv6.sysctl.max_hbh_opts_cnt
>   ipv6.sysctl.max_dst_opts_len
>   ipv6.sysctl.max_hbh_opts_len
 ...
> Signed-off-by: Tom Herbert 

Applied to net-next, let's see how this goes.

Thanks.


[PATCH net-next] tcp: tcp_fragment() should not assume rtx skbs

2017-11-02 Thread Eric Dumazet
From: Eric Dumazet 

While stress testing MTU probing, we had crashes in list_del() that we 
root-caused
to the fact that tcp_fragment() is unconditionally inserting the freshly 
allocated
skb into tsorted_sent_queue list.

But this list is supposed to contain skbs that were sent.
This was mostly harmless until MTU probing was enabled.

Fortunately we can use the tcp_queue enum added later (but in same linux 
version)
for rtx-rb-tree to fix the bug. 

Fixes: e2080072ed2d ("tcp: new list for sent but unacked skbs for RACK 
recovery")
Signed-off-by: Eric Dumazet 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Cc: Soheil Hassas Yeganeh 
Cc: Alexei Starovoitov 
Cc: Priyaranjan Jha 
---
 net/ipv4/tcp_output.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
a85e8a282d173983e35a2a1e3135ca2a63f1699e..36a3e7c909caacd981b84d1d8820a33d922f5a4e
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1395,7 +1395,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue 
tcp_queue,
/* Link BUFF into the send queue. */
__skb_header_release(buff);
tcp_insert_write_queue_after(skb, buff, sk, tcp_queue);
-   list_add(>tcp_tsorted_anchor, >tcp_tsorted_anchor);
+   if (tcp_queue == TCP_FRAG_IN_RTX_QUEUE)
+   list_add(>tcp_tsorted_anchor, >tcp_tsorted_anchor);
 
return 0;
 }




Re: [PATCH] Net: netfilter: Moved vmalloc call to kmalloc call

2017-11-02 Thread David Miller
From: Charlie Sale 
Date: Thu,  2 Nov 2017 19:17:27 -0400

> Fixed FIXME comment in code my changing a vmalloc call
> to a kmalloc call. Thought it would be a good place to
> start for a first patch.
> 
> Signed-off-by: Charlie Sale 

Since this code you are posting doesn't even compile, we have to
assume you didn't functionally test it either.


Re: [PATCH net-next 09/12] tools: bpftool: turn err() and info() macros into functions

2017-11-02 Thread Joe Perches
On Mon, 2017-10-23 at 09:24 -0700, Jakub Kicinski wrote:
> From: Quentin Monnet 
> 
> Turn err() and info() macros into functions.
> 
> In order to avoid naming conflicts with variables in the code, rename
> them as p_err() and p_info() respectively.
> 
> The behavior of these functions is similar to the one of the macros for
> plain output. However, when JSON output is requested, these macros
> return a JSON-formatted "error" object instead of printing a message to
> stderr.
> 
> To handle error messages correctly with JSON, a modification was brought
> to their behavior nonetheless: the functions now append a end-of-line
> character at the end of the message. This way, we can remove end-of-line
> characters at the end of the argument strings, and not have them in the
> JSON output.
> 
> All error messages are formatted to hold in a single call to p_err(), in
> order to produce a single JSON field.

> Signed-off-by: Quentin Monnet 
> Acked-by: Jakub Kicinski 
[]
> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
[]
> @@ -97,4 +93,35 @@ int prog_parse_fd(int *argc, char ***argv);
>  void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes);
>  void print_hex_data_json(uint8_t *data, size_t len);
>  
> +static inline void p_err(const char *fmt, ...)
> +{
> + va_list ap;
> +
> + va_start(ap, fmt);
> + if (json_output) {
> + jsonw_start_object(json_wtr);
> + jsonw_name(json_wtr, "error");
> + jsonw_vprintf_enquote(json_wtr, fmt, ap);
> + jsonw_end_object(json_wtr);
> + } else {
> + fprintf(stderr, "Error: ");
> + vfprintf(stderr, fmt, ap);
> + fprintf(stderr, "\n");
> + }
> + va_end(ap);
> +}

inline seems very wasteful.

Why not move p_err and p_info to common.c ?

> +
> +static inline void p_info(const char *fmt, ...)
> +{
> + va_list ap;
> +
> + if (json_output)
> + return;
> +
> + va_start(ap, fmt);
> + vfprintf(stderr, fmt, ap);
> + fprintf(stderr, "\n");
> + va_end(ap);
> +}
> 


Re: [PATCH net-next] tools: bpf: handle long path in jit disasm

2017-11-02 Thread David Miller
From: "Rustad, Mark D" 
Date: Thu, 2 Nov 2017 21:19:44 +

> 
>> On Nov 2, 2017, at 1:09 AM, Prashant Bhole  
>> wrote:
>> 
>> Use PATH_MAX instead of hardcoded array size 256
>> 
>> Signed-off-by: Prashant Bhole 
 ...
>> static void get_asm_insns(uint8_t *image, size_t len, int opcodes)
>> {
>>  int count, i, pc = 0;
>> -char tpath[256];
>> +char tpath[PATH_MAX];
> 
> Seems like such a nice thing, *but* PATH_MAX is 4096. Can things really 
> tolerate 4k on the stack here?

This is userland code, why wouldn't it be able to handle 4K on the
stack?



[PATCH resend 0/2] capability controlled user-namespaces

2017-11-02 Thread Mahesh Bandewar
From: Mahesh Bandewar 

TL;DR version
-
Creating a sandbox environment with namespaces is challenging
considering what these sandboxed processes can engage into. e.g.
CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
Current form of user-namespaces, however, if changed a bit can allow
us to create a sandbox environment without locking down user-
namespaces.

Detailed version


Problem
---
User-namespaces in the current form have increased the attack surface as
any process can acquire capabilities which are not available to them (by
default) by performing combination of clone()/unshare()/setns() syscalls.

#define _GNU_SOURCE
#include 
#include 
#include 

int main(int ac, char **av)
{
int sock = -1;

printf("Attempting to open RAW socket before unshare()...\n");
sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
if (sock < 0) {
perror("socket() SOCK_RAW failed: ");
} else {
printf("Successfully opened RAW-Sock before unshare().\n");
close(sock);
sock = -1;
}

if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
perror("unshare() failed: ");
return 1;
}

printf("Attempting to open RAW socket after unshare()...\n");
sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
if (sock < 0) {
perror("socket() SOCK_RAW failed: ");
} else {
printf("Successfully opened RAW-Sock after unshare().\n");
close(sock);
sock = -1;
}

return 0;
}

The above example shows how easy it is to acquire NET_RAW capabilities
and once acquired, these processes could take benefit of above mentioned
or similar issues discovered/undiscovered with malicious intent. Note
that this is just an example and the problem/solution is not limited
to NET_RAW capability *only*. 

The easiest fix one can apply here is to lock-down user-namespaces which
many of the distros do (i.e. don't allow users to create user namespaces),
but unfortunately that prevents everyone from using them.

Approach

Introduce a notion of 'controlled' user-namespaces. Every process on
the host is allowed to create user-namespaces (governed by the limit
imposed by per-ns sysctl) however, mark user-namespaces created by
sandboxed processes as 'controlled'. Use this 'mark' at the time of
capability check in conjunction with a global capability whitelist.
If the capability is not whitelisted, processes that belong to 
controlled user-namespaces will not be allowed.

Once a user-ns is marked as 'controlled'; all its child user-
namespaces are marked as 'controlled' too.

A global whitelist is list of capabilities governed by the
sysctl which is available to (privileged) user in init-ns to modify
while it's applicable to all controlled user-namespaces on the host.

Marking user-namespaces controlled without modifying the whitelist is
equivalent of the current behavior. The default value of whitelist includes
all capabilities so that the compatibility is maintained. However it gives
admins fine-grained ability to control various capabilities system wide
without locking down user-namespaces.

Please see individual patches in this series.

Mahesh Bandewar (2):
  capability: introduce sysctl for controlled user-ns capability
whitelist
  userns: control capabilities of some user namespaces

 Documentation/sysctl/kernel.txt | 21 +
 include/linux/capability.h  |  4 
 include/linux/user_namespace.h  | 20 
 kernel/capability.c | 52 +
 kernel/sysctl.c |  5 
 kernel/user_namespace.c |  3 +++
 security/commoncap.c|  8 +++
 7 files changed, 113 insertions(+)

-- 
2.15.0.403.gc27cc4dac6-goog



[PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-02 Thread Mahesh Bandewar
From: Mahesh Bandewar 

With this new notion of "controlled" user-namespaces, the controlled
user-namespaces are marked at the time of their creation while the
capabilities of processes that belong to them are controlled using the
global mask.

Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
that belongs to uncontrolled user-ns can create another (child) user-
namespace that is uncontrolled. Any other process (that either does
not have SYS_ADMIN or belongs to a controlled user-ns) can only
create a user-ns that is controlled.

global-capability-whitelist (controlled_userns_caps_whitelist) is used
at the capability check-time and keeps the semantics for the processes
that belong to uncontrolled user-ns as it is. Processes that belong to
controlled user-ns however are subjected to different checks-

   (a) if the capability in question is controlled and process belongs
   to controlled user-ns, then it's always denied.
   (b) if the capability in question is NOT controlled then fall back
   to the traditional check.

Signed-off-by: Mahesh Bandewar 
---
 include/linux/capability.h |  1 +
 include/linux/user_namespace.h | 20 
 kernel/capability.c|  5 +
 kernel/user_namespace.c|  3 +++
 security/commoncap.c   |  8 
 5 files changed, 37 insertions(+)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 6c0b9677c03f..b8c6cac18658 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct 
user_namespace *ns);
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
 int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
 void __user *buff, size_t *lenp, loff_t *ppos);
+bool is_capability_controlled(int cap);
 
 extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
size);
 
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index c18e01252346..e890fe81b47e 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -22,6 +22,7 @@ struct uid_gid_map {  /* 64 bytes -- 1 cache line */
 };
 
 #define USERNS_SETGROUPS_ALLOWED 1UL
+#define USERNS_CONTROLLED   2UL
 
 #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
 
@@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace *ns)
__put_user_ns(ns);
 }
 
+static inline bool is_user_ns_controlled(const struct user_namespace *ns)
+{
+   return ns->flags & USERNS_CONTROLLED;
+}
+
+static inline void mark_user_ns_controlled(struct user_namespace *ns)
+{
+   ns->flags |= USERNS_CONTROLLED;
+}
+
 struct seq_operations;
 extern const struct seq_operations proc_uid_seq_operations;
 extern const struct seq_operations proc_gid_seq_operations;
@@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct 
ns_common *ns)
 {
return ERR_PTR(-EPERM);
 }
+
+static inline bool is_user_ns_controlled(const struct user_namespace *ns)
+{
+   return false;
+}
+
+static inline void mark_user_ns_controlled(struct user_namespace *ns)
+{
+}
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/capability.c b/kernel/capability.c
index 62dbe3350c1b..40a38cc4ff43 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct 
user_namespace *ns)
 }
 
 /* Controlled-userns capabilities routines */
+bool is_capability_controlled(int cap)
+{
+   return !cap_raised(controlled_userns_caps_whitelist, cap);
+}
+
 #ifdef CONFIG_SYSCTL
 int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
 void __user *buff, size_t *lenp, loff_t *ppos)
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index c490f1e4313b..f393ea5108f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct 
user_namespace *user_ns)
cred->cap_effective = CAP_FULL_SET;
cred->cap_ambient = CAP_EMPTY_SET;
cred->cap_bset = CAP_FULL_SET;
+   if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) ||
+   is_user_ns_controlled(user_ns->parent))
+   mark_user_ns_controlled(user_ns);
 #ifdef CONFIG_KEYS
key_put(cred->request_key_auth);
cred->request_key_auth = NULL;
diff --git a/security/commoncap.c b/security/commoncap.c
index fc46f5b85251..89103f16ac37 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct 
user_namespace *targ_ns,
 {
struct user_namespace *ns = targ_ns;
 
+   /* If the capability is controlled and user-ns that process
+* belongs-to is 'controlled' then return EPERM and no need
+* to check 

[PATCH resend 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-11-02 Thread Mahesh Bandewar
From: Mahesh Bandewar 

Add a sysctl variable kernel.controlled_userns_caps_whitelist. This
takes input as capability mask expressed as two comma separated hex
u32 words. The mask, however, is stored in kernel as kernel_cap_t type.

Any capabilities that are not part of this mask will be controlled and
will not be allowed to processes in controlled user-ns.

Signed-off-by: Mahesh Bandewar 
---
 Documentation/sysctl/kernel.txt | 21 ++
 include/linux/capability.h  |  3 +++
 kernel/capability.c | 47 +
 kernel/sysctl.c |  5 +
 4 files changed, 76 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 694968c7523c..a1d39dbae847 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -25,6 +25,7 @@ show up in /proc/sys/kernel:
 - bootloader_version[ X86 only ]
 - callhome  [ S390 only ]
 - cap_last_cap
+- controlled_userns_caps_whitelist
 - core_pattern
 - core_pipe_limit
 - core_uses_pid
@@ -187,6 +188,26 @@ CAP_LAST_CAP from the kernel.
 
 ==
 
+controlled_userns_caps_whitelist
+
+Capability mask that is whitelisted for "controlled" user namespaces.
+Any capability that is missing from this mask will not be allowed to
+any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
+is not part of this mask, then processes running inside any controlled
+userns's will not be allowed to perform action that needs CAP_NET_RAW
+capability. However, processes that are attached to a parent user-ns
+hierarchy that is *not* controlled and has CAP_NET_RAW can continue
+performing those actions. User-namespaces are marked "controlled" at
+the time of their creation based on the capabilities of the creator.
+A process that does not have CAP_SYS_ADMIN will create user-namespaces
+that are controlled.
+
+The value is expressed as two comma separated hex words (u32). This
+sysctl is avaialble in init-ns and users with CAP_SYS_ADMIN in init-ns
+are allowed to make changes.
+
+==
+
 core_pattern:
 
 core_pattern is used to specify a core dumpfile pattern name.
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b52e278e4744..6c0b9677c03f 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,6 +13,7 @@
 #define _LINUX_CAPABILITY_H
 
 #include 
+#include 
 
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
@@ -247,6 +248,8 @@ extern bool ptracer_capable(struct task_struct *tsk, struct 
user_namespace *ns);
 
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
+int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
+void __user *buff, size_t *lenp, loff_t *ppos);
 
 extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
size);
 
diff --git a/kernel/capability.c b/kernel/capability.c
index f97fe77ceb88..62dbe3350c1b 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -28,6 +28,8 @@ EXPORT_SYMBOL(__cap_empty_set);
 
 int file_caps_enabled = 1;
 
+kernel_cap_t controlled_userns_caps_whitelist = CAP_FULL_SET;
+
 static int __init file_caps_disable(char *str)
 {
file_caps_enabled = 0;
@@ -506,3 +508,48 @@ bool ptracer_capable(struct task_struct *tsk, struct 
user_namespace *ns)
rcu_read_unlock();
return (ret == 0);
 }
+
+/* Controlled-userns capabilities routines */
+#ifdef CONFIG_SYSCTL
+int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
+void __user *buff, size_t *lenp, loff_t *ppos)
+{
+   DECLARE_BITMAP(caps_bitmap, CAP_LAST_CAP);
+   struct ctl_table caps_table;
+   char tbuf[NAME_MAX];
+   int ret;
+
+   ret = bitmap_from_u32array(caps_bitmap, CAP_LAST_CAP,
+  controlled_userns_caps_whitelist.cap,
+  _KERNEL_CAPABILITY_U32S);
+   if (ret != CAP_LAST_CAP)
+   return -1;
+
+   scnprintf(tbuf, NAME_MAX, "%*pb", CAP_LAST_CAP, caps_bitmap);
+
+   caps_table.data = tbuf;
+   caps_table.maxlen = NAME_MAX;
+   caps_table.mode = table->mode;
+   ret = proc_dostring(_table, write, buff, lenp, ppos);
+   if (ret)
+   return ret;
+   if (write) {
+   kernel_cap_t tmp;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   ret = bitmap_parse_user(buff, *lenp, caps_bitmap, CAP_LAST_CAP);
+   if (ret)
+   return ret;
+
+   ret = bitmap_to_u32array(tmp.cap, _KERNEL_CAPABILITY_U32S,
+

[Patch net-next] net_sched: check NULL in tcf_block_put()

2017-11-02 Thread Cong Wang
Callers of tcf_block_put() could pass NULL so
we can't use block->q before checking if block is
NULL or not.

tcf_block_put_ext() callers are fine, it is always
non-NULL.

Fixes: 8c4083b30e56 ("net: sched: add block bind/unbind notif. and extended 
block_get/put")
Reported-by: Dave Taht 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/cls_api.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index a26c690b48ac..ad35bb4dffaa 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -340,9 +340,6 @@ void tcf_block_put_ext(struct tcf_block *block,
 {
struct tcf_chain *chain, *tmp;
 
-   if (!block)
-   return;
-
tcf_block_offload_unbind(block, q, ei);
 
list_for_each_entry_safe(chain, tmp, >chain_list, list)
@@ -362,6 +359,8 @@ void tcf_block_put(struct tcf_block *block)
 {
struct tcf_block_ext_info ei = {0, };
 
+   if (!block)
+   return;
tcf_block_put_ext(block, NULL, block->q, );
 }
 
-- 
2.13.0



Re: Oops with HTB on net-next

2017-11-02 Thread Cong Wang
On Thu, Nov 2, 2017 at 4:34 PM, Dave Taht  wrote:
> On Thu, Nov 2, 2017 at 11:09 AM, Cong Wang  wrote:
>> On Wed, Nov 1, 2017 at 1:17 PM, Dave Taht  wrote:
>>>
>>> That is not in net-next, and the "net" version of that one patch does
>>> not apply to net-next. The relevant thread says "... another fun merge
>>> into net-next".
>>>
>>> Please let me know when the fun is done, and I'll retest.
>>
>> -net is merged into -net-next now.
>
> retested with net-next as of commit: 2d2faaf0568b4946d9abeb4e541227b4ca259840
>
> Run after boot, with the system fairly idle, sqm-scripts works in
> setting up htb, fq_codel, filters, iptables rules, etc.
>
> If I run sqm-scripts in early boot, (run out of
> /etc/network/if-up.d/sqm) with all the other stuff going on then, it
> still fails.
>
> sqm does lots of complicated stuff in rapid succession, and I'm not
> sure how to go about reproducing this more simply than saying  grab
> those, and hand them conf files for one existing and one non-existing
> device.
>
> I'll try to make it happen at later times, and try ripping out (for
> example) the ifb setup and tc_mirred, etc, for the early boot
> scenario. Can you suggest other means of debugging?

Ah, I know what's the bug now... Sending the patch now...


Re: [PATCH] Net: netfilter: Moved vmalloc call to kmalloc call

2017-11-02 Thread Florian Westphal
Charlie Sale  wrote:
> Fixed FIXME comment in code my changing a vmalloc call
> to a kmalloc call. Thought it would be a good place to
> start for a first patch.

Please at least compile test your patches.

> - /* FIXME: don't use vmalloc() here or anywhere else -HW */
> - hinfo = vmalloc(sizeof(struct xt_hashlimit_htable) +
> - sizeof(struct hlist_head) * size);
> +
> + hinfo = kmalloc(sizeof(*hinfo) +
> + sizeof(struct hlist_head) * size, GPT_KERNEL);

If anything this should be switched to kvmalloc, not kmalloc.

Also, hinfo cannot be free'd via vfree after this change, so you need to
adjust all free operations too.


[jkirsher/next-queue PATCH 4/5] dev: Clean-up __skb_tx_hash to match up with traffic class based configs

2017-11-02 Thread Alexander Duyck
From: Alexander Duyck 

This patch is mostly just a minor clean-up so that we avoid letting a
packet jump from one traffic class to another just based on the Rx queue.
Instead we now use that queue number as an offset within the traffic class.
Handling it this way allows us to operate more cleanly in a mixed
environment that is doing routing over multiple interfaces that may not
have the same queue configuration.

This patch includes a minor clean-up of variable declaration as well to get
things into the reverse xmas tree format.

Signed-off-by: Alexander Duyck 
---
 net/core/dev.c |   18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 24ac9083bc13..fd51b8703277 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2573,16 +2573,9 @@ void netif_device_attach(struct net_device *dev)
 u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
  unsigned int num_tx_queues)
 {
-   u32 hash;
-   u16 qoffset = 0;
u16 qcount = num_tx_queues;
-
-   if (skb_rx_queue_recorded(skb)) {
-   hash = skb_get_rx_queue(skb);
-   while (unlikely(hash >= num_tx_queues))
-   hash -= num_tx_queues;
-   return hash;
-   }
+   u16 qoffset = 0;
+   u32 hash;
 
if (dev->num_tc) {
u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
@@ -2591,6 +2584,13 @@ u16 __skb_tx_hash(const struct net_device *dev, struct 
sk_buff *skb,
qcount = dev->tc_to_txq[tc].count;
}
 
+   if (skb_rx_queue_recorded(skb)) {
+   hash = skb_get_rx_queue(skb);
+   while (unlikely(hash >= qcount))
+   hash -= qcount;
+   return hash + qoffset;
+   }
+
return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
 }
 EXPORT_SYMBOL(__skb_tx_hash);



[jkirsher/next-queue PATCH 5/5] dev: Cap number of queues even with accel_priv

2017-11-02 Thread Alexander Duyck
From: Alexander Duyck 

With the recent fix to ixgbe we can cap the number of queues always
regardless of if accel_priv is being used or not since the actual number of
queues are being reported via real_num_tx_queues.

Signed-off-by: Alexander Duyck 
---
 net/core/dev.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index fd51b8703277..59dcc1b26ae2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3393,8 +3393,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device 
*dev,
else
queue_index = __netdev_pick_tx(dev, skb);
 
-   if (!accel_priv)
-   queue_index = netdev_cap_txqueue(dev, queue_index);
+   queue_index = netdev_cap_txqueue(dev, queue_index);
}
 
skb_set_queue_mapping(skb, queue_index);



[jkirsher/next-queue PATCH 3/5] ixgbe: Fix handling of macvlan Tx offload

2017-11-02 Thread Alexander Duyck
From: Alexander Duyck 

This update makes it so that we report the actual number of Tx queues via
real_num_tx_queues but are still restricted to RSS on only the first pool
by setting num_tc equal to 1. Doing this locks us into only having the
ability to setup XPS on the queues in that pool, and only those queues
should be used for transmitting anything other than macvlan traffic.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 69ef35d13c36..b22ec4b9d02c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6638,8 +6638,9 @@ int ixgbe_open(struct net_device *netdev)
goto err_req_irq;
 
/* Notify the stack of the actual queue counts. */
-   if (adapter->num_rx_pools > 1)
-   queues = adapter->num_rx_queues_per_pool;
+   if (adapter->num_rx_pools > 1 &&
+   adapter->num_tx_queues > IXGBE_MAX_L2A_QUEUES)
+   queues = IXGBE_MAX_L2A_QUEUES;
else
queues = adapter->num_tx_queues;
 
@@ -8901,6 +8902,18 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)
if (adapter->hw.mac.type == ixgbe_mac_82598EB)
adapter->hw.fc.requested_mode = adapter->last_lfc_mode;
 
+   /* To support macvlan offload we have to use num_tc to
+* restrict the queues that can be used by the device.
+* By doing this we can avoid reporing a false number of
+* queues.
+*/
+   if (adapter->num_rx_pools > 1) {
+   u16 qpp = adapter->num_rx_queues_per_pool;
+
+   netdev_set_num_tc(dev, 1);
+   netdev_set_tc_queue(dev, 0, qpp, 0);
+   }
+
adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
 
adapter->temp_dcb_cfg.pfc_mode_enable = false;



Re: Oops with HTB on net-next

2017-11-02 Thread Dave Taht
On Thu, Nov 2, 2017 at 11:09 AM, Cong Wang  wrote:
> On Wed, Nov 1, 2017 at 1:17 PM, Dave Taht  wrote:
>>
>> That is not in net-next, and the "net" version of that one patch does
>> not apply to net-next. The relevant thread says "... another fun merge
>> into net-next".
>>
>> Please let me know when the fun is done, and I'll retest.
>
> -net is merged into -net-next now.

retested with net-next as of commit: 2d2faaf0568b4946d9abeb4e541227b4ca259840

Run after boot, with the system fairly idle, sqm-scripts works in
setting up htb, fq_codel, filters, iptables rules, etc.

If I run sqm-scripts in early boot, (run out of
/etc/network/if-up.d/sqm) with all the other stuff going on then, it
still fails.

sqm does lots of complicated stuff in rapid succession, and I'm not
sure how to go about reproducing this more simply than saying  grab
those, and hand them conf files for one existing and one non-existing
device.

I'll try to make it happen at later times, and try ripping out (for
example) the ifb setup and tc_mirred, etc, for the early boot
scenario. Can you suggest other means of debugging?

sqm-scripts repo: https://github.com/tohojo/sqm-scripts
my current sqm conf files: http://www.taht.net/~d/mysqmconfig.tgz
my current net-next kernel config:
http://www.taht.net/~d/mytcffailingkernel.config

Here's a complete dmesg of the most recent failure:

[0.00] Linux version 4.14.0-rc7-netem-4 (dave@nemesis) (gcc
version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)) #1 SMP PREEMPT
Thu Nov 2 14:19:17 PDT 2017
[0.00] Command line:
BOOT_IMAGE=/boot/vmlinuz-4.14.0-rc7-netem-4
root=UUID=ab3ceeb5-6d85-4c1c-8e0a-7e9e10949bae ro quiet splash
vt.handoff=7
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating
point registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is
832 bytes, using 'standard' format.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0009efff] usable
[0.00] BIOS-e820: [mem 0x0009f000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x0010-0xa7edafff] usable
[0.00] BIOS-e820: [mem 0xa7edb000-0xa81bbfff] reserved
[0.00] BIOS-e820: [mem 0xa81bc000-0xabc08fff] usable
[0.00] BIOS-e820: [mem 0xabc09000-0xabc67fff] reserved
[0.00] BIOS-e820: [mem 0xabc68000-0xabfd4fff] usable
[0.00] BIOS-e820: [mem 0xabfd5000-0xacd4efff] ACPI NVS
[0.00] BIOS-e820: [mem 0xacd4f000-0xacfa9fff] reserved
[0.00] BIOS-e820: [mem 0xacfaa000-0xacffefff] type 20
[0.00] BIOS-e820: [mem 0xacfff000-0xacff] usable
[0.00] BIOS-e820: [mem 0xad80-0xafff] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00014eff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] efi: EFI v2.40 by American Megatrends
[0.00] efi:  ESRT=0xacfa6018  ACPI=0xac71b000  ACPI
2.0=0xac71b000  SMBIOS=0xf05b0  MPS=0xfd640
[0.00] random: fast init done
[0.00] SMBIOS 2.8 present.
[0.00] DMI: Intel Corporation SharkBay Platform/WhiteTip
Mountain1 Fab2, BIOS 5.6.5 06/18/2015
[0.00] tsc: Fast TSC calibration using PIT
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x14f000 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 7F8000 write-back
[0.00]   1 base 008000 mask 7FE000 

[jkirsher/next-queue PATCH 1/5] ixgbe: Fix interaction between SR-IOV and macvlan offload

2017-11-02 Thread Alexander Duyck
From: Alexander Duyck 

When SR-IOV was enabled the macvlan offload was configuring several filters
with the wrong pool value. This would result in the macvlan interfaces not
being able to receive traffic that had to pass over the physical interface.

To fix it wrap the pool argument in the VMDQ_P macro which will add the
necessary offset to get to the actual VMDq pool

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2d0232254a7a..69ef35d13c36 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5426,10 +5426,11 @@ static int ixgbe_fwd_ring_up(struct net_device *vdev,
goto fwd_queue_err;
 
if (is_valid_ether_addr(vdev->dev_addr))
-   ixgbe_add_mac_filter(adapter, vdev->dev_addr, accel->pool);
+   ixgbe_add_mac_filter(adapter, vdev->dev_addr,
+VMDQ_P(accel->pool));
 
ixgbe_fwd_psrtype(accel);
-   ixgbe_macvlan_set_rx_mode(vdev, accel->pool, adapter);
+   ixgbe_macvlan_set_rx_mode(vdev, VMDQ_P(accel->pool), adapter);
return err;
 fwd_queue_err:
ixgbe_fwd_ring_down(vdev, accel);



[jkirsher/next-queue PATCH 2/5] fm10k: Fix VLAN configuration for macvlan offload

2017-11-02 Thread Alexander Duyck
From: Alexander Duyck 

The fm10k driver didn't work correctly when macvlan offload was enabled.
Specifically what would occur is that we would see no unicast packets being
received. This was traced down to us not correctly configuring the default
VLAN ID for the port and defaulting to 0.

To correct this we either use the default ID provided by the switch or
simply use 1. With that we are able to pass and receive traffic without any
issues.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 81e4425f0529..1280127077de 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1490,7 +1490,7 @@ static void *fm10k_dfwd_add_station(struct net_device 
*dev,
hw->mac.ops.update_xcast_mode(hw, glort,
  FM10K_XCAST_MODE_MULTI);
fm10k_queue_mac_request(interface, glort, sdev->dev_addr,
-   0, true);
+   hw->mac.default_vid ? : 1, true);
}
 
fm10k_mbx_unlock(interface);
@@ -1530,7 +1530,7 @@ static void fm10k_dfwd_del_station(struct net_device 
*dev, void *priv)
hw->mac.ops.update_xcast_mode(hw, glort,
  FM10K_XCAST_MODE_NONE);
fm10k_queue_mac_request(interface, glort, sdev->dev_addr,
-   0, false);
+   hw->mac.default_vid ? : 1, false);
}
 
fm10k_mbx_unlock(interface);



[jkirsher/next-queue PATCH 0/5] macvlan offload fixes

2017-11-02 Thread Alexander Duyck
I'm looking at performing a refactor of the macvlan offload code. However
before I started I wanted to at least get things into a running state. The
patches in this set are needed to address a number of issues that were
preventing things from working as they were supposed to.

With these changes in place I seem to be able to receive traffic as I am
supposed to in the case of ixgbe and fm10k with the offload enabled, and I
am now transmitting to the correct Tx ring in the case of ixgbe.

The last two patches in the set are what I consider to be minor clean-ups
to address the fact that we don't want packets to somehow stray and end up
being transmitted on a queue that is supposed to be in use by a macvlan
instead of the lowerdev itself.

---

Alexander Duyck (5):
  ixgbe: Fix interaction between SR-IOV and macvlan offload
  fm10k: Fix VLAN configuration for macvlan offload
  ixgbe: Fix handling of macvlan Tx offload
  dev: Clean-up __skb_tx_hash to match up with traffic class based configs
  dev: Cap number of queues even with accel_priv


 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c |4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   22 ++
 net/core/dev.c  |   21 ++---
 3 files changed, 30 insertions(+), 17 deletions(-)

--


[PATCH] Net: netfilter: Moved vmalloc call to kmalloc call

2017-11-02 Thread Charlie Sale
Fixed FIXME comment in code my changing a vmalloc call
to a kmalloc call. Thought it would be a good place to
start for a first patch.

Signed-off-by: Charlie Sale 

---
 net/netfilter/xt_hashlimit.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 5da8746f7b88..4eab1befe03c 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -286,9 +286,9 @@ static int htable_create(struct net *net, struct 
hashlimit_cfg3 *cfg,
if (size < 16)
size = 16;
}
-   /* FIXME: don't use vmalloc() here or anywhere else -HW */
-   hinfo = vmalloc(sizeof(struct xt_hashlimit_htable) +
-   sizeof(struct hlist_head) * size);
+
+   hinfo = kmalloc(sizeof(*hinfo) +
+   sizeof(struct hlist_head) * size, GPT_KERNEL);
if (hinfo == NULL)
return -ENOMEM;
*out_hinfo = hinfo;
-- 
2.13.6



[PATCH] mISDN: hfcpci: Convert timers to use timer_setup()

2017-11-02 Thread Kees Cook
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Karsten Keil 
Cc: "David S. Miller" 
Cc: Arvind Yadav 
Cc: Geliang Tang 
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook 
---
 drivers/isdn/hardware/mISDN/hfcpci.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/isdn/hardware/mISDN/hfcpci.c 
b/drivers/isdn/hardware/mISDN/hfcpci.c
index d2e401a8090e..e4ebbee863a1 100644
--- a/drivers/isdn/hardware/mISDN/hfcpci.c
+++ b/drivers/isdn/hardware/mISDN/hfcpci.c
@@ -2265,7 +2265,7 @@ static struct pci_driver hfc_driver = {
 };
 
 static int
-_hfcpci_softirq(struct device *dev, void *arg)
+_hfcpci_softirq(struct device *dev, void *unused)
 {
struct hfc_pci  *hc = dev_get_drvdata(dev);
struct bchannel *bch;
@@ -2290,9 +2290,9 @@ _hfcpci_softirq(struct device *dev, void *arg)
 }
 
 static void
-hfcpci_softirq(void *arg)
+hfcpci_softirq(struct timer_list *unused)
 {
-   WARN_ON_ONCE(driver_for_each_device(_driver.driver, NULL, arg,
+   WARN_ON_ONCE(driver_for_each_device(_driver.driver, NULL, NULL,
  _hfcpci_softirq) != 0);
 
/* if next event would be in the past ... */
@@ -2327,9 +2327,7 @@ HFC_init(void)
if (poll != HFCPCI_BTRANS_THRESHOLD) {
printk(KERN_INFO "%s: Using alternative poll value of %d\n",
   __func__, poll);
-   hfc_tl.function = (void *)hfcpci_softirq;
-   hfc_tl.data = 0;
-   init_timer(_tl);
+   timer_setup(_tl, hfcpci_softirq, 0);
hfc_tl.expires = jiffies + tics;
hfc_jiffies = hfc_tl.expires;
add_timer(_tl);
-- 
2.7.4


-- 
Kees Cook
Pixel Security


Re: [PATCH 1/2] bpf: add a bpf_override_function helper

2017-11-02 Thread Daniel Borkmann

Hi Josef,

one more issue I just noticed, see comment below:

On 11/02/2017 03:37 PM, Josef Bacik wrote:
[...]

diff --git a/include/linux/filter.h b/include/linux/filter.h
index cdd78a7beaae..dfa44fd74bae 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -458,7 +458,8 @@ struct bpf_prog {
locked:1,   /* Program image locked? */
gpl_compatible:1, /* Is filter GPL compatible? 
*/
cb_access:1,/* Is control block accessed? */
-   dst_needed:1;   /* Do we need dst entry? */
+   dst_needed:1,   /* Do we need dst entry? */
+   kprobe_override:1; /* Do we override a kprobe? 
*/
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */

[...]

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d906775e12c1..f8f7927a9152 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4189,6 +4189,8 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
prog->dst_needed = 1;
if (insn->imm == BPF_FUNC_get_prandom_u32)
bpf_user_rnd_init_once();
+   if (insn->imm == BPF_FUNC_override_return)
+   prog->kprobe_override = 1;
if (insn->imm == BPF_FUNC_tail_call) {
/* If we tail call into other programs, we
 * cannot make any assumptions since they can
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9660ee65fbef..0d7fce52391d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8169,6 +8169,13 @@ static int perf_event_set_bpf_prog(struct perf_event 
*event, u32 prog_fd)
return -EINVAL;
}

+   /* Kprobe override only works for kprobes, not uprobes. */
+   if (prog->kprobe_override &&
+   !(event->tp_event->flags & TRACE_EVENT_FL_KPROBE)) {
+   bpf_prog_put(prog);
+   return -EINVAL;
+   }


Can we somehow avoid the prog->kprobe_override flag here completely
and also same in the perf_event_attach_bpf_prog() handler?

Reason is that it's not reliable for bailing out this way: Think of
the main program you're attaching doesn't use bpf_override_return()
helper, but it tail-calls into other BPF progs that make use of it
instead. So above check would be useless and will fail and we continue
to attach the prog for probes where it's not intended to be used.

We've had similar issues in the past e.g. c2002f983767 ("bpf: fix
checking xdp_adjust_head on tail calls") is just one of those. Thus,
can we avoid the flag altogether and handle such error case differently?


if (is_tracepoint || is_syscall_tp) {
int off = trace_event_get_offsets(event->tp_event);


Thanks,
Daniel


[PATCH net v2] net: systemport: Correct IPG length settings

2017-11-02 Thread Florian Fainelli
Due to a documentation mistake, the IPG length was set to 0x12 while it
should have been 12 (decimal). This would affect short packet (64B
typically) performance since the IPG was bigger than necessary.

Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
Signed-off-by: Florian Fainelli 
---
Changes in v2:

- move the IPG length setting outside of netdev_uses_dsa() branch

 drivers/net/ethernet/broadcom/bcmsysport.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 83eec9a8c275..eb441e5e2cd8 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -1809,15 +1809,17 @@ static inline void bcm_sysport_mask_all_intrs(struct 
bcm_sysport_priv *priv)
 
 static inline void gib_set_pad_extension(struct bcm_sysport_priv *priv)
 {
-   u32 __maybe_unused reg;
+   u32 reg;
 
-   /* Include Broadcom tag in pad extension */
+   reg = gib_readl(priv, GIB_CONTROL);
+   /* Include Broadcom tag in pad extension and fix up IPG_LENGTH */
if (netdev_uses_dsa(priv->netdev)) {
-   reg = gib_readl(priv, GIB_CONTROL);
reg &= ~(GIB_PAD_EXTENSION_MASK << GIB_PAD_EXTENSION_SHIFT);
reg |= ENET_BRCM_TAG_LEN << GIB_PAD_EXTENSION_SHIFT;
-   gib_writel(priv, reg, GIB_CONTROL);
}
+   reg &= ~(GIB_IPG_LEN_MASK << GIB_IPG_LEN_SHIFT);
+   reg |= 12 << GIB_IPG_LEN_SHIFT;
+   gib_writel(priv, reg, GIB_CONTROL);
 }
 
 static int bcm_sysport_open(struct net_device *dev)
-- 
2.9.3



Re: [PATCH ipsec] xfrm: do unconditional template resolution before pcpu cache check

2017-11-02 Thread Paul Moore
On Thu, Nov 2, 2017 at 11:46 AM, Florian Westphal  wrote:
> Stephen Smalley says:
>  Since 4.14-rc1, the selinux-testsuite has been encountering sporadic
>  failures during testing of labeled IPSEC. git bisect pointed to
>  commit ec30d ("xfrm: add xdst pcpu cache").
>  The xdst pcpu cache is only checking that the policies are the same,
>  but does not validate that the policy, state, and flow match with respect
>  to security context labeling.
>  As a result, the wrong SA could be used and the receiver could end up
>  performing permission checking and providing SO_PEERSEC or SCM_SECURITY
>  values for the wrong security context.
>
> This fix makes it so that we always do the template resolution, and
> then checks that the found states match those in the pcpu bundle.
>
> This has the disadvantage of doing a bit more work (lookup in state hash
> table) if we can reuse the xdst entry (we only avoid xdst alloc/free)
> but we don't add a lot of extra work in case we can't reuse.
>
> xfrm_pol_dead() check is removed, reasoning is that
> xfrm_tmpl_resolve does all needed checks.
>
> Cc: Paul Moore 
> Fixes: ec30d78c14a813db39a647b6a348b428 ("xfrm: add xdst pcpu cache")
> Reported-by: Stephen Smalley 
> Tested-by: Stephen Smalley 
> Signed-off-by: Florian Westphal 
> ---
>  net/xfrm/xfrm_policy.c | 42 --
>  1 file changed, 24 insertions(+), 18 deletions(-)

This looks reasonable and seems like probably the simplest approach to
me.  I'm building a test kernel with it now, but considering the time
of day here, I probably will not be able to test it until tomorrow
morning; however it is important to note that Stephen did test this
already so please don't wait on my test results - we are likely to be
running the same tests anyway.

Acked-by: Paul Moore 

> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index 8cafb3c0a4ac..a2e531bf4f97 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -1787,19 +1787,23 @@ void xfrm_policy_cache_flush(void)
> put_online_cpus();
>  }
>
> -static bool xfrm_pol_dead(struct xfrm_dst *xdst)
> +static bool xfrm_xdst_can_reuse(struct xfrm_dst *xdst,
> +   struct xfrm_state * const xfrm[],
> +   int num)
>  {
> -   unsigned int num_pols = xdst->num_pols;
> -   unsigned int pol_dead = 0, i;
> +   const struct dst_entry *dst = >u.dst;
> +   int i;
>
> -   for (i = 0; i < num_pols; i++)
> -   pol_dead |= xdst->pols[i]->walk.dead;
> +   if (xdst->num_xfrms != num)
> +   return false;
>
> -   /* Mark DST_OBSOLETE_DEAD to fail the next xfrm_dst_check() */
> -   if (pol_dead)
> -   xdst->u.dst.obsolete = DST_OBSOLETE_DEAD;
> +   for (i = 0; i < num; i++) {
> +   if (!dst || dst->xfrm != xfrm[i])
> +   return false;
> +   dst = dst->child;
> +   }
>
> -   return pol_dead;
> +   return xfrm_bundle_ok(xdst);
>  }
>
>  static struct xfrm_dst *
> @@ -1813,26 +1817,28 @@ xfrm_resolve_and_create_bundle(struct xfrm_policy 
> **pols, int num_pols,
> struct dst_entry *dst;
> int err;
>
> +   /* Try to instantiate a bundle */
> +   err = xfrm_tmpl_resolve(pols, num_pols, fl, xfrm, family);
> +   if (err <= 0) {
> +   if (err != 0 && err != -EAGAIN)
> +   XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
> +   return ERR_PTR(err);
> +   }
> +
> xdst = this_cpu_read(xfrm_last_dst);
> if (xdst &&
> xdst->u.dst.dev == dst_orig->dev &&
> xdst->num_pols == num_pols &&
> -   !xfrm_pol_dead(xdst) &&
> memcmp(xdst->pols, pols,
>sizeof(struct xfrm_policy *) * num_pols) == 0 &&
> -   xfrm_bundle_ok(xdst)) {
> +   xfrm_xdst_can_reuse(xdst, xfrm, err)) {
> dst_hold(>u.dst);
> +   while (err > 0)
> +   xfrm_state_put(xfrm[--err]);
> return xdst;
> }
>
> old = xdst;
> -   /* Try to instantiate a bundle */
> -   err = xfrm_tmpl_resolve(pols, num_pols, fl, xfrm, family);
> -   if (err <= 0) {
> -   if (err != 0 && err != -EAGAIN)
> -   XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
> -   return ERR_PTR(err);
> -   }
>
> dst = xfrm_bundle_create(pols[0], xfrm, err, fl, dst_orig);
> if (IS_ERR(dst)) {
> --
> 2.13.6
>



-- 
paul moore
www.paul-moore.com


Re: [PATCH 6/7] netdev: octeon-ethernet: Add Cavium Octeon III support.

2017-11-02 Thread David Daney

On 11/02/2017 12:13 PM, Florian Fainelli wrote:

On 11/01/2017 05:36 PM, David Daney wrote:

From: Carlos Munoz 

The Cavium OCTEON cn78xx and cn73xx SoCs have network packet I/O
hardware that is significantly different from previous generations of
the family.

Add a new driver for this hardware.  The Ethernet MAC is called BGX on
these devices.  Common code for the MAC is in octeon3-bgx-port.c.
Four of these BGX MACs are grouped together and managed as a group by
octeon3-bgx-nexus.c.  Ingress packet classification is done by the PKI
unit initialized in octeon3-pki.c.  Queue management is done in the
SSO, initialized by octeon3-sso.c.  Egress is handled by the PKO,
initialized in octeon3-pko.c.

Signed-off-by: Carlos Munoz 
Signed-off-by: Steven J. Hill 
Signed-off-by: David Daney 
---



+static char *mix_port;
+module_param(mix_port, charp, 0444);
+MODULE_PARM_DESC(mix_port, "Specifies which ports connect to MIX interfaces.");


Can you derive this from Device Tree /platform data configuration?


+
+static char *pki_port;
+module_param(pki_port, charp, 0444);
+MODULE_PARM_DESC(pki_port, "Specifies which ports connect to the PKI.");


Likewise


The SoC is flexible in how it is configured.  Technically the device 
tree should only be used to specify information about the physical 
configuration of the system that cannot be probed for, and this is about 
policy rather that physical wiring.  That said, we do take the default 
configuration from the device tree, but give the option here to override 
via the module command line.





+
+#define MAX_MIX_PER_NODE   2
+
+#define MAX_MIX(MAX_NODES * MAX_MIX_PER_NODE)
+
+/**
+ * struct mix_port_lmac - Describes a lmac that connects to a mix
+ *   port. The lmac must be on the same node as
+ *   the mix.
+ * @node:  Node of the lmac.
+ * @bgx:   Bgx of the lmac.
+ * @lmac:  Lmac index.
+ */
+struct mix_port_lmac {
+   int node;
+   int bgx;
+   int lmac;
+};
+
+/* mix_ports_lmacs contains all the lmacs connected to mix ports */
+static struct mix_port_lmac mix_port_lmacs[MAX_MIX];
+
+/* pki_ports keeps track of the lmacs connected to the pki */
+static bool pki_ports[MAX_NODES][MAX_BGX_PER_NODE][MAX_LMAC_PER_BGX];
+
+/* Created platform devices get added to this list */
+static struct list_head pdev_list;
+static struct mutex pdev_list_lock;
+
+/* Created platform device use this structure to add themselves to the list */
+struct pdev_list_item {
+   struct list_headlist;
+   struct platform_device  *pdev;
+};


Don't you have a top-level platform device that you could use which
would hold this data instead of having it here?


This is the top-level platform device.




[snip]


+/* Registers are accessed via xkphys */
+#define SSO_BASE   0x16700ull
+#define SSO_ADDR(node) (SET_XKPHYS + NODE_OFFSET(node) +  \
+SSO_BASE)
+#define GRP_OFFSET(grp)((grp) << 16)
+#define GRP_ADDR(n, g) (SSO_ADDR(n) + GRP_OFFSET(g))
+#define SSO_GRP_AQ_CNT(n, g)   (GRP_ADDR(n, g)+ 0x2700)
+
+#define MIO_PTP_BASE   0x10700ull
+#define MIO_PTP_ADDR(node) (SET_XKPHYS + NODE_OFFSET(node) +  \
+MIO_PTP_BASE)
+#define MIO_PTP_CLOCK_CFG(node)(MIO_PTP_ADDR(node) 
+ 0xf00)
+#define MIO_PTP_CLOCK_HI(node) (MIO_PTP_ADDR(node) + 0xf10)
+#define MIO_PTP_CLOCK_COMP(node)   (MIO_PTP_ADDR(node) + 0xf18)


I am sure this will work great on anything but MIPS64 ;)


Sarcasm duly noted.

That said, by definition it is exactly an OCTEON-III/MIPS64, and can 
never be anything else.  It is known a priori that the hardware and this 
driver will never be used anywhere else.





+
+struct octeon3_ethernet;
+
+struct octeon3_rx {
+   struct napi_struct  napi;
+   struct octeon3_ethernet *parent;
+   int rx_grp;
+   int rx_irq;
+   cpumask_t rx_affinity_hint;
+} cacheline_aligned_in_smp;
+
+struct octeon3_ethernet {
+   struct bgx_port_netdev_priv bgx_priv; /* Must be first element. */
+   struct list_head list;
+   struct net_device *netdev;
+   enum octeon3_mac_type mac_type;
+   struct octeon3_rx rx_cxt[MAX_RX_QUEUES];
+   struct ptp_clock_info ptp_info;
+   struct ptp_clock *ptp_clock;
+   struct cyclecounter cc;
+   struct timecounter tc;
+   spinlock_t ptp_lock;/* Serialize ptp clock adjustments */
+   int num_rx_cxt;
+   int pki_aura;
+   int pknd;
+   int pko_queue;
+   int node;
+   int interface;
+   int index;
+   int rx_buf_count;
+   int tx_complete_grp;
+   int rx_timestamp_hw:1;
+   

Re: [PATCH net-next 1/1] net sched qdisc: pass netlink message flags in event notification

2017-11-02 Thread Roman Mashak
Cong Wang  writes:

> On Mon, Oct 30, 2017 at 2:17 PM, Roman Mashak  wrote:
>> Cong Wang  writes:
>>
>>> On Mon, Oct 30, 2017 at 11:07 AM, Roman Mashak  wrote:
 Cong Wang  writes:

> On Sat, Oct 28, 2017 at 8:36 PM, Roman Mashak  wrote:
>> Cong Wang  writes:
>
> Hmm, I thought you use RTM_NEWQDISC+RTM_DELQDISC to
> determine it is replacement, no?

 Create is RTM_NEWQDISC and NLM_F_EXCL|NLM_F_CREATE, replacement is
 RTM_NEWQDISC and NLM_F_REPLACE in netlink flags.
>>>
>>> Is there any reason we can't use RTM_NEWQDISC+RTM_DELQDISC
>>> rather than NLM_F_REPLACE to determine it is replacement?
>>>
>>
>> I'm not sure this would be valid semantics for replace operation, look at
>> the rfc3549:
>>
>> Additional flag bits for NEW requests
>>   NLM_F_REPLACE   Replace existing matching config object with
>>   this request.
>>
>
> I am not saying NLM_F_REPLACE is not correct, I am saying the
> RTM_NEWQDISC+RTM_DELQDISC in a same message probably
> exists for a reason.
>
>
>>> Note, RTM_NEWQDISC+RTM_DELQDISC are put in a same
>>> message not two.
>>
>> Hmm, could you clarify how do you expect to put two event IDs in nlmsg_type?
>
> Looking at qdisc_notify(), it is essentially two skb_put() on a same
> skb, right? So two nlmsghdr in one skb? Or I read it wrong?

So there will be two netlink messages in a single skb, and the user
receives two events. But apparently this only happens when a new
_egress_ qdisc is being added and the default egress qdisc is deleted.


Re: [PATCH net] add support of IFF_XMIT_DST_RELEASE bit in vlan

2017-11-02 Thread Vadim Fedorenko



On 02.11.2017 19:25, Eric Dumazet wrote:

On Thu, 2017-11-02 at 17:47 +0300, Vadim Fedorenko wrote:

On Thu, 2017-11-02 at 07:33 -7000, Eric Dumazet wrote:

On Thu, 2017-11-02 at 15:49 +0300, Vadim Fedorenko wrote:

Some time ago Eric Dumazet suggested a "hack the IFF_XMIT_DST_RELEASE
flag on the vlan netdev". But the last comment was "does not support
properly bonding/team.(If the real_dev->privflags IFF_XMIT_DST_RELEASE
bit changes, we want to update all the vlans at the same time )"

I've extended that patch to support changes of IFF_XMIT_DST_RELEASE in
bonding/team.
Both bonding and team call netdev_change_features() after recalculation
of features including priv_flags IFF_XMIT_DST_RELEASE bit. So the only
thing needed to support is to recheck this bit in
vlan_transfer_features().

Suggested-by: Eric Dumazet 
Signed-off-by: Vadim Fedorenko 
---
  net/8021q/vlan.c | 3 +++
  net/8021q/vlan_netlink.c | 1 +
  2 files changed, 4 insertions(+)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 9649579..510986c 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -328,6 +328,9 @@ static void vlan_transfer_features(struct net_device *dev,
vlandev->fcoe_ddp_xid = dev->fcoe_ddp_xid;
  #endif
  
+	vlandev->priv_flags &= ~IFF_XMIT_DST_RELEASE;

+   vlandev->priv_flags |= (vlan->real_dev->priv_flags & 
IFF_XMIT_DST_RELEASE);
+
netdev_update_features(vlandev);
  }
  
diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c

index 5e831de..9472de8 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct 
net_device *dev,
vlan->vlan_proto = proto;
vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
vlan->real_dev= real_dev;
+   dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
vlan->flags   = VLAN_FLAG_REORDER_HDR;
  
  	err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id);


What happens if a bonding is composed of two slaves, with different
IFF_XMIT_DST_RELEASE settings ?


According to bond_compute_features() IFF_XMIT_DST_RELEASE flag is cleared from 
priv_flags
in such case, so it is cleared in vlan_transfer_features() and vlan device 
works as before
the patch.
The same code exists in team netdev.


I repeat the question.

What happens if a bonding is composed of two slaves A & B,
A has IFF_XMIT_DST_RELEASE set
B has IFF_XMIT_DST_RELEASE cleared.

1) A is added , then B is added.
2) B is added to the bond, then A is added.
3) Before, and after your patches.

Thanks.



Do you mean what happens with vlan device with real_dev is bonding ?

With patches:
1) A is added
  bond_enslave()
bond_compute_features()
-> bond_dev IFF_XMIT_DST_RELEASE is not changed (set)
  netdev_change_features()
vlan_device_event(event=NETDEV_FEAT_CHANGE)
  vlan_transfer_features()
-> vlan_dev IFF_XMIT_DST_RELEASE is not changed (still set)
Then B is added
  bond_enslave()
bond_compute_features()
-> bond_dev IFF_XMIT_DST_RELEASE is changed (cleared)
  netdev_change_features()
vlan_device_event(event=NETDEV_FEAT_CHANGE)
  vlan_transfer_features()
-> vlan_dev IFF_XMIT_DST_RELEASE is changed (cleared)

2) B is added
  bond_enslave()
bond_compute_features()
-> bond_dev IFF_XMIT_DST_RELEASE is changed (cleared)
  netdev_change_features()
vlan_device_event(event=NETDEV_FEAT_CHANGE)
  vlan_transfer_features()
-> vlan_dev IFF_XMIT_DST_RELEASE is changed (cleared)
Then A is added
  bond_enslave()
bond_compute_features()
-> bond_dev IFF_XMIT_DST_RELEASE is not changed (cleared)
  netdev_change_features()
vlan_device_event(event=NETDEV_FEAT_CHANGE)
  vlan_transfer_features()
-> vlan_dev IFF_XMIT_DST_RELEASE is not changed (cleared).

Without patches:
1) A is added
  bond_enslave()
bond_compute_features()
-> bond_dev IFF_XMIT_DST_RELEASE is not changed (set)
  netdev_change_features()
vlan_device_event(event=NETDEV_FEAT_CHANGE)
  vlan_transfer_features()
 -> vlan_dev IFF_XMIT_DST_RELEASE is not changed (cleared)

Then B is added
  bond_enslave()
bond_compute_features()
 -> bond_dev IFF_XMIT_DST_RELEASE is changed (cleared)
  netdev_change_features()
vlan_device_event(event=NETDEV_FEAT_CHANGE)
  vlan_transfer_features()
 -> vlan_dev IFF_XMIT_DST_RELEASE is not changed(cleared)
2) B is added
  bond_enslave()
bond_compute_features()
 -> bond_dev IFF_XMIT_DST_RELEASE is changed (cleared)
  netdev_change_features()
vlan_device_event(event=NETDEV_FEAT_CHANGE)
  vlan_transfer_features()
 -> vlan_dev IFF_XMIT_DST_RELEASE is not changed(cleared)
Then A is added
  

Re: [RFC PATCH] xfrm: fix regression introduced by xdst pcpu cache

2017-11-02 Thread Paul Moore
On Thu, Nov 2, 2017 at 8:58 AM, Stephen Smalley  wrote:
> On Wed, 2017-11-01 at 17:39 -0400, Paul Moore wrote:
>> On Tue, Oct 31, 2017 at 7:08 PM, Florian Westphal 
>> wrote:
>> > Paul Moore  wrote:
>> > > On Mon, Oct 30, 2017 at 10:58 AM, Stephen Smalley > > > gov> wrote:
>> > > > matching before (as in this patch) or after calling
>> > > > xfrm_bundle_ok()?
>> > >
>> > > I would probably make the LSM call the last check, as you've
>> > > done; but
>> > > I have to say that is just so it is consistent with the "LSM
>> > > last"
>> > > philosophy and not because of any performance related argument.
>> > >
>> > > > ... Also,
>> > > > do we need to test xfrm->sel.family before calling
>> > > > xfrm_selector_match
>> > > > (as in this patch) or not - xfrm_state_look_at() does so when
>> > > > the
>> > > > state is XFRM_STATE_VALID but not when it is _ERROR or
>> > > > _EXPIRED?
>> > >
>> > > Speaking purely from a SELinux perspective, I'm not sure it
>> > > matters:
>> > > as long as the labels match we are happy.  However, from a
>> > > general
>> > > IPsec perspective it does seem like a reasonable thing.
>> > >
>> > > Granted I'm probably missing something, but it seems a little odd
>> > > that
>> > > the code isn't already checking that the selectors match (...
>> > > what am
>> > > I missing?).  It does check the policies, maybe that is enough in
>> > > the
>> > > normal IPsec case?
>> >
>> > The assumption was that identical policies would yield the same
>> > SAs,
>> > but thats not correct.
>>
>> Well, to be fair, I think the assumption is valid for normal,
>> unlabeled IPsec.  The problem comes when SELinux starts labeling SAs
>> and now you have multiple SAs for a given policy, each differing only
>> in the SELinux/LSM label.
>
> No, it is invalid for normal, unlabeled IPSEC too, in the case where
> one has defined xfrm state selectors.  That's what my other testsuite
> patch (which is presently only on the xfrmselectortest branch) is
> exercising - matching of xfrm state selectors.  But in any event,
> Florian's patch fixes both, so I'm fine with it.  I don't know though
> how it compares performance-wise with walking the bundle and just
> calling security_xfrm_state_pol_flow_match() and xfrm_selector_match()
> on each one.
>
>> Considering that adding the SELinux/LSM label effectively adds an
>> additional selector, I'm wondering if we should simply add the
>> SELinux/LSM label matching to xfrm_selector_match()?  Looking quickly
>> at the code it seems as though we always follow xfrm_selector_match()
>> with a LSM check anyway, the one exception being in
>> __xfrm_policy_check() ... which *might* be a valid exception, as we
>> don't do our access checks for inbound traffic at that point in the
>> stack.
>
> Possibly, but that should probably be a separate patch. We should just
> fix this regression for 4.14, either via Florian's patch or by
> augmenting my patch to perform the matching calls on all of the xfrms.

I agree that v4.14 should get the smallest patch possible that fixes
the problem.  I was just looking at the patches presented so far and
thinking out loud.

-- 
paul moore
www.paul-moore.com


Re: [PATCH 2/2] [net-next] bpf: fix out-of-bounds access warning in bpf_check

2017-11-02 Thread Daniel Borkmann

On 11/02/2017 12:05 PM, Arnd Bergmann wrote:

The bpf_verifer_ops array is generated dynamically and may be
empty depending on configuration, which then causes an out
of bounds access:

kernel/bpf/verifier.c: In function 'bpf_check':
kernel/bpf/verifier.c:4320:29: error: array subscript is above array bounds 
[-Werror=array-bounds]

This adds a check to the start of the function as a workaround.
I would assume that the function is never called in that configuration,
so the warning is probably harmless.

Fixes: 00176a34d9e2 ("bpf: remove the verifier ops from program structure")
Signed-off-by: Arnd Bergmann 


Acked-by: Daniel Borkmann 

LGTM, and bpf_analyzer() already has proper logic to bail out for
such cases (although only used by nfp right now, which is there
when NET is configured anyway).


Re: [PATCH 1/2] [net-next] bpf: fix link error without CONFIG_NET

2017-11-02 Thread Daniel Borkmann

On 11/02/2017 12:05 PM, Arnd Bergmann wrote:

I ran into this link error with the latest net-next plus linux-next
trees when networking is disabled:

kernel/bpf/verifier.o:(.rodata+0x2958): undefined reference to 
`tc_cls_act_analyzer_ops'
kernel/bpf/verifier.o:(.rodata+0x2970): undefined reference to 
`xdp_analyzer_ops'

It seems that the code was written to deal with varying contents of
the arrray, but the actual #ifdef was missing. Both tc_cls_act_analyzer_ops
and xdp_analyzer_ops are defined in the core networking code, so adding
a check for CONFIG_NET seems appropriate here, and I've verified this with
many randconfig builds

Fixes: 4f9218aaf8a4 ("bpf: move knowledge about post-translation offsets out of 
verifier")
Signed-off-by: Arnd Bergmann 


Acked-by: Daniel Borkmann 


Re: [PATCH net-next v2] bpf: fix verifier NULL pointer dereference

2017-11-02 Thread Daniel Borkmann

On 11/02/2017 04:18 PM, Craig Gallek wrote:

From: Craig Gallek 

do_check() can fail early without allocating env->cur_state under
memory pressure.  Syzkaller found the stack below on the linux-next
tree because of this.

   kasan: CONFIG_KASAN_INLINE enabled
   kasan: GPF could be caused by NULL-ptr deref or user memory access
   general protection fault:  [#1] SMP KASAN
   Dumping ftrace buffer:
  (ftrace buffer empty)
   Modules linked in:
   CPU: 1 PID: 27062 Comm: syz-executor5 Not tainted 4.14.0-rc7+ #106
   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
Google 01/01/2011
   task: 8801c2c74700 task.stack: 8801c3e28000
   RIP: 0010:free_verifier_state kernel/bpf/verifier.c:347 [inline]
   RIP: 0010:bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533
   RSP: 0018:8801c3e2f5c8 EFLAGS: 00010202
   RAX: dc00 RBX: fff4 RCX: 
   RDX: 0070 RSI: 817d5aa9 RDI: 0380
   RBP: 8801c3e2f668 R08:  R09: 1100387c5d9f
   R10: 218c4e80 R11: 85b34380 R12: 8801c4dc6a28
   R13:  R14: 8801c4dc6a00 R15: 8801c4dc6a20
   FS:  7f311079b700() GS:8801db30() knlGS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 004d4a24 CR3: 0001cbcd CR4: 001406e0
   DR0:  DR1:  DR2: 
   DR3:  DR6: fffe0ff0 DR7: 0400
   Call Trace:
bpf_prog_load+0xcbb/0x18e0 kernel/bpf/syscall.c:1166
SYSC_bpf kernel/bpf/syscall.c:1690 [inline]
SyS_bpf+0xae9/0x4620 kernel/bpf/syscall.c:1652
entry_SYSCALL_64_fastpath+0x1f/0xbe
   RIP: 0033:0x452869
   RSP: 002b:7f311079abe8 EFLAGS: 0212 ORIG_RAX: 0141
   RAX: ffda RBX: 00758020 RCX: 00452869
   RDX: 0030 RSI: 20168000 RDI: 0005
   RBP: 7f311079aa20 R08:  R09: 
   R10:  R11: 0212 R12: 004b7550
   R13: 7f311079ab58 R14: 004b7560 R15: 
   Code: df 48 c1 ea 03 80 3c 02 00 0f 85 e6 0b 00 00 4d 8b 6e 20 48 b8 00 00 00 00 
00 fc ff df 49 8d bd 80 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 b6 0b 
00 00 49 8b bd 80 03 00 00 e8 d6 0c 26
   RIP: free_verifier_state kernel/bpf/verifier.c:347 [inline] RSP: 
8801c3e2f5c8
   RIP: bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533 RSP: 8801c3e2f5c8
   ---[ end trace c8d37f339dc64004 ]---

Fixes: 638f5b90d460 ("bpf: reduce verifier memory consumption")
Fixes: 1969db47f8d0 ("bpf: fix verifier memory leaks")
Signed-off-by: Craig Gallek 


Acked-by: Daniel Borkmann 


Re: [PATCH][net-next] net: sched: cls_bpf: use bitwise & rather than logical && on gen_flags

2017-11-02 Thread Daniel Borkmann

On 11/02/2017 09:04 PM, Colin King wrote:

From: Colin Ian King 

Currently gen_flags is being operated on by a logical && operator rather
than a bitwise & operator. This looks incorrect as these should be bit
flag operations. Fix this.

Detected by CoverityScan, CID#1460305 ("Logical vs. bitwise operator")

Fixes: 3f7889c4c79b ("net: sched: cls_bpf: call block callbacks for offload)
Signed-off-by: Colin Ian King 


Acked-by: Daniel Borkmann 


Re: [PATCH] netfilter: ipvs: Convert timers to use timer_setup()

2017-11-02 Thread Kees Cook
On Thu, Nov 2, 2017 at 7:42 AM, Simon Horman  wrote:
> On Tue, Oct 24, 2017 at 10:07:03PM +0300, Julian Anastasov wrote:
>>
>>   Hello,
>>
>> On Tue, 24 Oct 2017, Kees Cook wrote:
>>
>> > In preparation for unconditionally passing the struct timer_list pointer to
>> > all timer callbacks, switch to using the new timer_setup() and from_timer()
>> > to pass the timer pointer explicitly.
>> >
>> > Cc: Wensong Zhang 
>> > Cc: Simon Horman 
>> > Cc: Julian Anastasov 
>> > Cc: Pablo Neira Ayuso 
>> > Cc: Jozsef Kadlecsik 
>> > Cc: Florian Westphal 
>> > Cc: "David S. Miller" 
>> > Cc: netdev@vger.kernel.org
>> > Cc: lvs-de...@vger.kernel.org
>> > Cc: netfilter-de...@vger.kernel.org
>> > Cc: coret...@netfilter.org
>> > Signed-off-by: Kees Cook 
>>
>>   Looks good to me,
>>
>> Acked-by: Julian Anastasov 
>
> Signed-off-by: Simon Horman 
>
> Pablo, could you take this?

If it's helpful, we could take this via the timers tree. Either way. :) Thanks!

-Kees

-- 
Kees Cook
Pixel Security


RE: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-11-02 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Toshiaki Makita
> Sent: Thursday, November 02, 2017 2:23 AM
> To: Keller, Jacob E ; netdev@vger.kernel.org
> Cc: vyase...@redhat.com; Malek, Patryk 
> Subject: Re: removing bridge in vlan_filtering mode requests delete of 
> attached
> ports main MAC address
> 
> On 2017/11/02 7:25, Keller, Jacob E wrote:
> ...
> >> If we skip adding them, we cannot receive frames which should be
> >> received on the bridge device during non-promiscuous mode.
> >>
> >> --
> >> Toshiaki Makita
> >
> > This makes sense, but then what removes the addresses upon bridge deletion
> or exiting static mode?
> >
> > We want to make sure we remove the correct addresses but don't request a
> delete of the permanent MAC address? Or, do we just completely assume that a
> device will never actually delete it's own permanent address, and thus say 
> this is
> a driver's fault for allowing a delete request of its permanent address to do
> anything..?
> 
> We may be able to skip adding or deleting local address which is
> identical to dev_addr in bridge code.
> Having said that I feel like drivers should ensure not to remove their
> permanent address even when the same address is removed from the uc
> list, since currently it is not prohibited to do that kind of admin
> operation through bridge command (bridge fdb add|del self).
> Note that "bridge fdb ... self" is a command which modifies device's uc
> filter, not modify bridge's fdb entries.
> 
> --
> Toshiaki Makita   

Ok. I'll go ahead and cook a patch for preventing such a removal from deleting 
the permanent address from i40e. That sounds like the most reasonable approach 
given that from digging into other drivers, they don't store the permanent 
address in the regular UC table anyways.

Thanks,
Jake



[PATCH net-next 2/4] ila: add checksum neutral map auto

2017-11-02 Thread Tom Herbert
Add checksum neutral auto that performs checksum neutral mapping
without using the C-bit. This is enabled by configuration of
a mapping.

The checksum neutral function has been split into
ila_csum_do_neutral_fmt and ila_csum_do_neutral_nofmt. The former
handles the C-bit and includes it in the adjustment value. The latter
just sets the adjustment value on the locator diff only.

Added configuration for checksum neutral map aut in ila_lwt
and ila_xlat.

Signed-off-by: Tom Herbert 
---
 include/uapi/linux/ila.h  |  1 +
 net/ipv6/ila/ila_common.c | 65 ---
 net/ipv6/ila/ila_lwt.c| 29 +++--
 net/ipv6/ila/ila_xlat.c   | 10 +---
 4 files changed, 61 insertions(+), 44 deletions(-)

diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h
index 948c0a91e11b..0132b14a556f 100644
--- a/include/uapi/linux/ila.h
+++ b/include/uapi/linux/ila.h
@@ -40,6 +40,7 @@ enum {
ILA_CSUM_ADJUST_TRANSPORT,
ILA_CSUM_NEUTRAL_MAP,
ILA_CSUM_NO_ACTION,
+   ILA_CSUM_NEUTRAL_MAP_AUTO,
 };
 
 #endif /* _UAPI_LINUX_ILA_H */
diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c
index f1d9248d8b86..8c88ecf29b93 100644
--- a/net/ipv6/ila/ila_common.c
+++ b/net/ipv6/ila/ila_common.c
@@ -37,8 +37,8 @@ static __wsum get_csum_diff(struct ipv6hdr *ip6h, struct 
ila_params *p)
return get_csum_diff_iaddr(ila_a2i(>daddr), p);
 }
 
-static void ila_csum_do_neutral(struct ila_addr *iaddr,
-   struct ila_params *p)
+static void ila_csum_do_neutral_fmt(struct ila_addr *iaddr,
+   struct ila_params *p)
 {
__sum16 *adjust = (__force __sum16 *)>ident.v16[3];
__wsum diff, fval;
@@ -60,13 +60,23 @@ static void ila_csum_do_neutral(struct ila_addr *iaddr,
iaddr->ident.csum_neutral ^= 1;
 }
 
-static void ila_csum_adjust_transport(struct sk_buff *skb,
+static void ila_csum_do_neutral_nofmt(struct ila_addr *iaddr,
  struct ila_params *p)
 {
+   __sum16 *adjust = (__force __sum16 *)>ident.v16[3];
__wsum diff;
-   struct ipv6hdr *ip6h = ipv6_hdr(skb);
-   struct ila_addr *iaddr = ila_a2i(>daddr);
+
+   diff = get_csum_diff_iaddr(iaddr, p);
+
+   *adjust = ~csum_fold(csum_add(diff, csum_unfold(*adjust)));
+}
+
+static void ila_csum_adjust_transport(struct sk_buff *skb,
+ struct ila_params *p)
+{
size_t nhoff = sizeof(struct ipv6hdr);
+   struct ipv6hdr *ip6h = ipv6_hdr(skb);
+   __wsum diff;
 
switch (ip6h->nexthdr) {
case NEXTHDR_TCP:
@@ -105,36 +115,39 @@ static void ila_csum_adjust_transport(struct sk_buff *skb,
}
break;
}
-
-   /* Now change destination address */
-   iaddr->loc = p->locator;
 }
 
 void ila_update_ipv6_locator(struct sk_buff *skb, struct ila_params *p,
-bool set_csum_neutral)
+bool sir2ila)
 {
struct ipv6hdr *ip6h = ipv6_hdr(skb);
struct ila_addr *iaddr = ila_a2i(>daddr);
 
-   /* First deal with the transport checksum */
-   if (ila_csum_neutral_set(iaddr->ident)) {
-   /* C-bit is set in the locator indicating that this
-* is a locator being translated to a SIR address.
-* Perform (receiver) checksum-neutral translation.
-*/
-   if (!set_csum_neutral)
-   ila_csum_do_neutral(iaddr, p);
-   } else {
-   switch (p->csum_mode) {
-   case ILA_CSUM_ADJUST_TRANSPORT:
-   ila_csum_adjust_transport(skb, p);
-   break;
-   case ILA_CSUM_NEUTRAL_MAP:
-   ila_csum_do_neutral(iaddr, p);
-   break;
-   case ILA_CSUM_NO_ACTION:
+   switch (p->csum_mode) {
+   case ILA_CSUM_ADJUST_TRANSPORT:
+   ila_csum_adjust_transport(skb, p);
+   break;
+   case ILA_CSUM_NEUTRAL_MAP:
+   if (sir2ila) {
+   if (WARN_ON(ila_csum_neutral_set(iaddr->ident))) {
+   /* Checksum flag should never be
+* set in a formatted SIR address.
+*/
+   break;
+   }
+   } else if (!ila_csum_neutral_set(iaddr->ident)) {
+   /* ILA to SIR translation and C-bit isn't
+* set so we're good.
+*/
break;
}
+   ila_csum_do_neutral_fmt(iaddr, p);
+   break;
+   case ILA_CSUM_NEUTRAL_MAP_AUTO:
+   ila_csum_do_neutral_nofmt(iaddr, p);
+   break;
+   case ILA_CSUM_NO_ACTION:
+   break;
}
 
/* Now 

[PATCH net-next 1/4] ila: cleanup checksum diff

2017-11-02 Thread Tom Herbert
Consolidate computing checksum diff into one function.

Add get_csum_diff_iaddr that computes the checksum diff between
an address argument and locator being written. get_csum_diff
calls this using the destination address in the IP header as
the argument.

Also moved ila_init_saved_csum to be close to the checksum
diff functions.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ila/ila_common.c | 39 ++-
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c
index aba0998ddbfb..f1d9248d8b86 100644
--- a/net/ipv6/ila/ila_common.c
+++ b/net/ipv6/ila/ila_common.c
@@ -13,15 +13,28 @@
 #include 
 #include "ila.h"
 
-static __wsum get_csum_diff(struct ipv6hdr *ip6h, struct ila_params *p)
+void ila_init_saved_csum(struct ila_params *p)
 {
-   struct ila_addr *iaddr = ila_a2i(>daddr);
+   if (!p->locator_match.v64)
+   return;
+
+   p->csum_diff = compute_csum_diff8(
+   (__be32 *)>locator,
+   (__be32 *)>locator_match);
+}
 
+static __wsum get_csum_diff_iaddr(struct ila_addr *iaddr, struct ila_params *p)
+{
if (p->locator_match.v64)
return p->csum_diff;
else
-   return compute_csum_diff8((__be32 *)>loc,
- (__be32 *)>locator);
+   return compute_csum_diff8((__be32 *)>locator,
+ (__be32 *)>loc);
+}
+
+static __wsum get_csum_diff(struct ipv6hdr *ip6h, struct ila_params *p)
+{
+   return get_csum_diff_iaddr(ila_a2i(>daddr), p);
 }
 
 static void ila_csum_do_neutral(struct ila_addr *iaddr,
@@ -30,13 +43,7 @@ static void ila_csum_do_neutral(struct ila_addr *iaddr,
__sum16 *adjust = (__force __sum16 *)>ident.v16[3];
__wsum diff, fval;
 
-   /* Check if checksum adjust value has been cached */
-   if (p->locator_match.v64) {
-   diff = p->csum_diff;
-   } else {
-   diff = compute_csum_diff8((__be32 *)>locator,
- (__be32 *)iaddr);
-   }
+   diff = get_csum_diff_iaddr(iaddr, p);
 
fval = (__force __wsum)(ila_csum_neutral_set(iaddr->ident) ?
CSUM_NEUTRAL_FLAG : ~CSUM_NEUTRAL_FLAG);
@@ -134,16 +141,6 @@ void ila_update_ipv6_locator(struct sk_buff *skb, struct 
ila_params *p,
iaddr->loc = p->locator;
 }
 
-void ila_init_saved_csum(struct ila_params *p)
-{
-   if (!p->locator_match.v64)
-   return;
-
-   p->csum_diff = compute_csum_diff8(
-   (__be32 *)>locator,
-   (__be32 *)>locator_match);
-}
-
 static int __init ila_init(void)
 {
int ret;
-- 
2.11.0



[PATCH net-next 3/4] ila: allow configuraiton of identifier type

2017-11-02 Thread Tom Herbert
Allow identifier to be explicitly configured for a mapping.
This can either be one of the identifier types specified in the
ILA draft or a value of ILA_ATYPE_USE_FORMAT which means the
identifier type is inferred from the identifier type field.
If a value other than ILA_ATYPE_USE_FORMAT is set for a
mapping then it is assumed that the identifier type field is
not present in an identifier.

Signed-off-by: Tom Herbert 
---
 include/uapi/linux/ila.h | 13 +
 net/ipv6/ila/ila.h   | 12 +---
 net/ipv6/ila/ila_lwt.c   | 37 +
 net/ipv6/ila/ila_xlat.c  | 18 +-
 4 files changed, 64 insertions(+), 16 deletions(-)

diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h
index 0132b14a556f..de88b2c7ca37 100644
--- a/include/uapi/linux/ila.h
+++ b/include/uapi/linux/ila.h
@@ -16,6 +16,7 @@ enum {
ILA_ATTR_DIR,   /* u32 */
ILA_ATTR_PAD,
ILA_ATTR_CSUM_MODE, /* u8 */
+   ILA_ATTR_IDENT_TYPE,/* u8 */
 
__ILA_ATTR_MAX,
 };
@@ -43,4 +44,16 @@ enum {
ILA_CSUM_NEUTRAL_MAP_AUTO,
 };
 
+enum {
+   ILA_ATYPE_IID = 0,
+   ILA_ATYPE_LUID,
+   ILA_ATYPE_VIRT_V4,
+   ILA_ATYPE_VIRT_UNI_V6,
+   ILA_ATYPE_VIRT_MULTI_V6,
+   ILA_ATYPE_NONLOCAL_ADDR,
+   ILA_ATYPE_RSVD_1,
+   ILA_ATYPE_RSVD_2,
+
+   ILA_ATYPE_USE_FORMAT = 32, /* Get type from type field in identifier */
+};
 #endif /* _UAPI_LINUX_ILA_H */
diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h
index e0170f62bc39..3c7a11b62334 100644
--- a/net/ipv6/ila/ila.h
+++ b/net/ipv6/ila/ila.h
@@ -55,17 +55,6 @@ struct ila_identifier {
};
 };
 
-enum {
-   ILA_ATYPE_IID = 0,
-   ILA_ATYPE_LUID,
-   ILA_ATYPE_VIRT_V4,
-   ILA_ATYPE_VIRT_UNI_V6,
-   ILA_ATYPE_VIRT_MULTI_V6,
-   ILA_ATYPE_RSVD_1,
-   ILA_ATYPE_RSVD_2,
-   ILA_ATYPE_RSVD_3,
-};
-
 #define CSUM_NEUTRAL_FLAG  htonl(0x1000)
 
 struct ila_addr {
@@ -93,6 +82,7 @@ struct ila_params {
struct ila_locator locator_match;
__wsum csum_diff;
u8 csum_mode;
+   u8 ident_type;
 };
 
 static inline __wsum compute_csum_diff8(const __be32 *from, const __be32 *to)
diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c
index 92269b85281e..291d591a06c0 100644
--- a/net/ipv6/ila/ila_lwt.c
+++ b/net/ipv6/ila/ila_lwt.c
@@ -113,6 +113,7 @@ static int ila_input(struct sk_buff *skb)
 static const struct nla_policy ila_nl_policy[ILA_ATTR_MAX + 1] = {
[ILA_ATTR_LOCATOR] = { .type = NLA_U64, },
[ILA_ATTR_CSUM_MODE] = { .type = NLA_U8, },
+   [ILA_ATTR_IDENT_TYPE] = { .type = NLA_U8, },
 };
 
 static int ila_build_state(struct nlattr *nla,
@@ -126,7 +127,9 @@ static int ila_build_state(struct nlattr *nla,
struct lwtunnel_state *newts;
const struct fib6_config *cfg6 = cfg;
struct ila_addr *iaddr;
+   u8 ident_type = ILA_ATYPE_USE_FORMAT;
u8 csum_mode = ILA_CSUM_NO_ACTION;
+   u8 eff_ident_type;
int ret;
 
if (family != AF_INET6)
@@ -148,6 +151,34 @@ static int ila_build_state(struct nlattr *nla,
 
iaddr = (struct ila_addr *)>fc_dst;
 
+   if (tb[ILA_ATTR_IDENT_TYPE])
+   ident_type = nla_get_u8(tb[ILA_ATTR_IDENT_TYPE]);
+
+   if (ident_type == ILA_ATYPE_USE_FORMAT) {
+   /* Infer identifier type from type field in formatted
+* identifier.
+*/
+
+   eff_ident_type = iaddr->ident.type;
+   } else {
+   eff_ident_type = ident_type;
+   }
+
+   switch (eff_ident_type) {
+   case ILA_ATYPE_IID:
+   /* Don't allow ILA for IID type */
+   return -EINVAL;
+   case ILA_ATYPE_LUID:
+   break;
+   case ILA_ATYPE_VIRT_V4:
+   case ILA_ATYPE_VIRT_UNI_V6:
+   case ILA_ATYPE_VIRT_MULTI_V6:
+   case ILA_ATYPE_NONLOCAL_ADDR:
+   /* These ILA formats are not supported yet. */
+   default:
+   return -EINVAL;
+   }
+
if (tb[ILA_ATTR_CSUM_MODE])
csum_mode = nla_get_u8(tb[ILA_ATTR_CSUM_MODE]);
 
@@ -173,6 +204,7 @@ static int ila_build_state(struct nlattr *nla,
p = ila_params_lwtunnel(newts);
 
p->csum_mode = csum_mode;
+   p->ident_type = ident_type;
p->locator.v64 = (__force __be64)nla_get_u64(tb[ILA_ATTR_LOCATOR]);
 
/* Precompute checksum difference for translation since we
@@ -207,9 +239,13 @@ static int ila_fill_encap_info(struct sk_buff *skb,
if (nla_put_u64_64bit(skb, ILA_ATTR_LOCATOR, (__force 
u64)p->locator.v64,
  ILA_ATTR_PAD))
goto nla_put_failure;
+
if (nla_put_u8(skb, ILA_ATTR_CSUM_MODE, (__force u8)p->csum_mode))
goto nla_put_failure;
 
+   if (nla_put_u8(skb, ILA_ATTR_IDENT_TYPE, (__force u8)p->ident_type))
+   

[PATCH net-next 4/4] ila: Add ila.txt

2017-11-02 Thread Tom Herbert
Add documenation for kernel ILA. This describes ILA, features,
configuration gives some examples.

Signed-off-by: Tom Herbert 
---
 Documentation/networking/ila.txt | 286 +++
 1 file changed, 286 insertions(+)
 create mode 100644 Documentation/networking/ila.txt

diff --git a/Documentation/networking/ila.txt b/Documentation/networking/ila.txt
new file mode 100644
index ..e9923218cd99
--- /dev/null
+++ b/Documentation/networking/ila.txt
@@ -0,0 +1,286 @@
+Identifier Locator Addressing (ILA)
+
+
+Introduction
+
+
+Identifier-locator addressing (ILA) is a technique used with IPv6 that
+differentiates between location and identity of a network node. Part of an
+address expresses the immutable identity of the node, and another part
+indicates the location of the node which can be dynamic. Identifier-locator
+addressing can be used to efficiently implement overlay networks for
+network virtualization as well as solutions for use cases in mobility.
+
+ILA can be thought of as means to implement an overlay network without
+encapsulation. This is accomplished by performing network address
+translation on destination addresses as a packet traverses a network. To
+the network, an ILA translated packet appears to be no different than any
+other IPv6 packet. For instance, if the transport protocol is TCP then an
+ILA translated packet looks like just another TCP/IPv6 packet. The
+advantage of this is that ILA is transparent to the network so that
+optimizations in the network, such as ECMP, RSS, GRO, GSO, etc., just work.
+
+The ILA protocol is described in Internet-Draft draft-herbert-intarea-ila.
+
+
+ILA terminology
+===
+
+  - Identifier A number that identifies an addressable node in the network
+   independent of its location. ILA identifiers are sixty-four
+   bit values.
+
+  - LocatorA network prefix that routes to a physical host. Locators
+   provide the topological location of an addressed node. ILA
+   locators are sixty-four bit prefixes.
+
+  - ILA mapping
+   A mapping of an ILA identifier to a locator (or to a
+   locator and meta data). An ILA domain maintains a database
+   that contains mappings for all destinations in the domain.
+
+  - SIR address
+   An IPv6 address composed of a SIR prefix (upper sixty-
+   four bits) and an identifier (lower sixty-four bits).
+   SIR addresses are visible to applications and provide a
+   means for them to address nodes independent of their
+   location.
+
+  - ILA address
+   An IPv6 address composed of a locator (upper sixty-four
+   bits) and an identifier (low order sixty-four bits). ILA
+   addresses are never visible to an application.
+
+  - ILA host   An end host that is capable of performing ILA translations
+   on transmit or receive.
+
+  - ILA router A network node that performs ILA translation and forwarding
+   of translated packets.
+
+  - ILA forwarding cache
+   A type of ILA router that only maintains a working set
+   cache of mappings.
+
+  - ILA node   A network node capable of performing ILA translations. This
+   can be an ILA router, ILA forwarding cache, or ILA host.
+
+
+Operation
+=
+
+There are two fundamental operations with ILA:
+
+  - Translate a SIR address to an ILA address. This is performed on ingress
+to an ILA overlay.
+
+  - Translate an ILA address to a SIR address. This is performed on egress
+from the ILA overlay.
+
+ILA can be deployed either on end hosts or intermediate devices in the
+network; these are provided by "ILA hosts" and "ILA routers" respectively.
+Configuration and datapath for these two points of deployment is somewhat
+different.
+
+The diagram below illustrates the flow of packets through ILA as well
+as showing ILA hosts and routers.
+
+++++
+| Host A +-+ +--->| Host B |
+|| |  (2) ILA   (')   ||
+++ |...addressed   (   )  ++
+   V  +---+--+  .  packet  .  +---+--+  (_)
+   (1) SIR |  | ILA  |->>>| ILA  |   |   (3) SIR
+addressed  +->|router|  .  .  |router|->-+addressed
+packet+---+--+  . IPv6 .  +---+--+packet
+   /.Network   .
+  / .  .   +--+-+++
+++   /  .  .   |ILA ||  Host  |
+|  Host  +--+   .  .- -|host|||
+||  .  .   +--+-+++
+++  
+
+
+Transport 

[PATCH net-next 0/4] ila: make identifier format optional and other fixes

2017-11-02 Thread Tom Herbert
The identifier type and checksum neutral mapping bits are optional
in identifier formats. This patch set fixes the implementation to
make them optional and configurable.

Specific items:

  - Clean up checksum diff code in ILA
  - Add checksum neutral mapping auto so that checksum neutral
mapping can be configured without requiring use of the C-bit
  - Add identifier type configuration and allow identifier
type to be configured so that the identifier type field does
not need to be present
  - Added ILA documention: ila.txt

I have patches for ILA in iproute2 that will be poseted separately.

Tested: Ran netperf TCP_RR on various combinations of checksum
mode and the two supported identifier types.

Tom Herbert (4):
  ila: cleanup checksum diff
  ila: add checksum neutral map auto
  ila: allow configuraiton of identifier type
  ila: Add ila.txt

 Documentation/networking/ila.txt | 286 +++
 include/uapi/linux/ila.h |  14 ++
 net/ipv6/ila/ila.h   |  12 +-
 net/ipv6/ila/ila_common.c| 104 +++---
 net/ipv6/ila/ila_lwt.c   |  62 +++--
 net/ipv6/ila/ila_xlat.c  |  26 ++--
 6 files changed, 426 insertions(+), 78 deletions(-)
 create mode 100644 Documentation/networking/ila.txt

-- 
2.11.0



Re: [PATCH net-next] tools: bpf: handle long path in jit disasm

2017-11-02 Thread Rustad, Mark D

> On Nov 2, 2017, at 1:09 AM, Prashant Bhole  
> wrote:
> 
> Use PATH_MAX instead of hardcoded array size 256
> 
> Signed-off-by: Prashant Bhole 
> ---
> tools/bpf/bpf_jit_disasm.c | 3 ++-
> tools/bpf/bpftool/jit_disasm.c | 3 ++-
> 2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/bpf/bpf_jit_disasm.c b/tools/bpf/bpf_jit_disasm.c
> index 422d9abd666a..75bf526a0168 100644
> --- a/tools/bpf/bpf_jit_disasm.c
> +++ b/tools/bpf/bpf_jit_disasm.c
> @@ -27,6 +27,7 @@
> #include 
> #include 
> #include 
> +#include 
> 
> #define CMD_ACTION_SIZE_BUFFER10
> #define CMD_ACTION_READ_ALL   3
> @@ -51,7 +52,7 @@ static void get_exec_path(char *tpath, size_t size)
> static void get_asm_insns(uint8_t *image, size_t len, int opcodes)
> {
>   int count, i, pc = 0;
> - char tpath[256];
> + char tpath[PATH_MAX];

Seems like such a nice thing, *but* PATH_MAX is 4096. Can things really 
tolerate 4k on the stack here?

>   struct disassemble_info info;
>   disassembler_ftype disassemble;
>   bfd *bfdf;
> diff --git a/tools/bpf/bpftool/jit_disasm.c b/tools/bpf/bpftool/jit_disasm.c
> index 5937e134e408..1551d3918d4c 100644
> --- a/tools/bpf/bpftool/jit_disasm.c
> +++ b/tools/bpf/bpftool/jit_disasm.c
> @@ -21,6 +21,7 @@
> #include 
> #include 
> #include 
> +#include 
> 
> #include "json_writer.h"
> #include "main.h"
> @@ -80,7 +81,7 @@ void disasm_print_insn(unsigned char *image, ssize_t len, 
> int opcodes)
>   disassembler_ftype disassemble;
>   struct disassemble_info info;
>   int count, i, pc = 0;
> - char tpath[256];
> + char tpath[PATH_MAX];

Same comment here.

>   bfd *bfdf;
> 
>   if (!len)

--
Mark Rustad, Networking Division, Intel Corporation



signature.asc
Description: Message signed with OpenPGP


[PATCH] ISDN: eicon: message: mark expected switch fall-throughs

2017-11-02 Thread Gustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114780
Addresses-Coverity-ID: 114781
Addresses-Coverity-ID: 114782
Addresses-Coverity-ID: 114783
Addresses-Coverity-ID: 114784
Addresses-Coverity-ID: 114785
Addresses-Coverity-ID: 114786
Addresses-Coverity-ID: 114787
Addresses-Coverity-ID: 114788
Addresses-Coverity-ID: 114789
Addresses-Coverity-ID: 114790
Addresses-Coverity-ID: 114791
Addresses-Coverity-ID: 114792
Addresses-Coverity-ID: 114793
Addresses-Coverity-ID: 114794
Addresses-Coverity-ID: 114795
Addresses-Coverity-ID: 200521
Signed-off-by: Gustavo A. R. Silva 
---
 drivers/isdn/hardware/eicon/message.c | 70 +--
 1 file changed, 58 insertions(+), 12 deletions(-)

diff --git a/drivers/isdn/hardware/eicon/message.c 
b/drivers/isdn/hardware/eicon/message.c
index eadd1ed..def7992 100644
--- a/drivers/isdn/hardware/eicon/message.c
+++ b/drivers/isdn/hardware/eicon/message.c
@@ -4501,6 +4501,7 @@ static void control_rc(PLCI *plci, byte req, byte rc, 
byte ch, byte global_req,
plci->channels++;
a->ncci_state[ncci] = OUTG_CON_PENDING;
}
+   /* fall through */
 
default:
if (plci->internal_command_queue[0])
@@ -7020,6 +7021,7 @@ static void nl_ind(PLCI *plci)
plci->NL.RNum = 1;
return;
}
+   /* fall through */
case N_BDATA:
case N_DATA:
if (((a->ncci_state[ncci] != CONNECTED) && (plci->B2_prot == 
1)) /* transparent */
@@ -9626,9 +9628,9 @@ static void dtmf_command(dword Id, PLCI *plci, byte Rc)
{
 
case DTMF_LISTEN_TONE_START:
-   mask <<= 1;
+   mask <<= 1; /* fall through */
case DTMF_LISTEN_MF_START:
-   mask <<= 1;
+   mask <<= 1; /* fall through */
 
case DTMF_LISTEN_START:
switch (internal_command)
@@ -9636,6 +9638,7 @@ static void dtmf_command(dword Id, PLCI *plci, byte Rc)
default:
adjust_b1_resource(Id, plci, NULL, 
(word)(plci->B1_facilities |
  
B1_FACILITY_DTMFR), DTMF_COMMAND_1);
+   /* fall through */
case DTMF_COMMAND_1:
if (adjust_b_process(Id, plci, Rc) != GOOD)
{
@@ -9646,6 +9649,7 @@ static void dtmf_command(dword Id, PLCI *plci, byte Rc)
}
if (plci->internal_command)
return;
+   /* fall through */
case DTMF_COMMAND_2:
if (plci_nl_busy(plci))
{
@@ -9673,9 +9677,9 @@ static void dtmf_command(dword Id, PLCI *plci, byte Rc)
 
 
case DTMF_LISTEN_TONE_STOP:
-   mask <<= 1;
+   mask <<= 1; /* fall through */
case DTMF_LISTEN_MF_STOP:
-   mask <<= 1;
+   mask <<= 1; /* fall through */
 
case DTMF_LISTEN_STOP:
switch (internal_command)
@@ -9710,6 +9714,7 @@ static void dtmf_command(dword Id, PLCI *plci, byte Rc)
 */
adjust_b1_resource(Id, plci, NULL, 
(word)(plci->B1_facilities &
  
~(B1_FACILITY_DTMFX | B1_FACILITY_DTMFR)), DTMF_COMMAND_3);
+   /* fall through */
case DTMF_COMMAND_3:
if (adjust_b_process(Id, plci, Rc) != GOOD)
{
@@ -9726,9 +9731,9 @@ static void dtmf_command(dword Id, PLCI *plci, byte Rc)
 
 
case DTMF_SEND_TONE:
-   mask <<= 1;
+   mask <<= 1; /* fall through */
case DTMF_SEND_MF:
-   mask <<= 1;
+   mask <<= 1; /* fall through */
 
case DTMF_DIGITS_SEND:
switch (internal_command)
@@ -9737,6 +9742,7 @@ static void dtmf_command(dword Id, PLCI *plci, byte Rc)
adjust_b1_resource(Id, plci, NULL, 
(word)(plci->B1_facilities |
  
((plci->dtmf_parameter_length != 0) ? B1_FACILITY_DTMFX | B1_FACILITY_DTMFR : 
B1_FACILITY_DTMFX)),
   DTMF_COMMAND_1);
+   /* fall through */
case DTMF_COMMAND_1:
if (adjust_b_process(Id, plci, Rc) != GOOD)
{
@@ -9747,6 +9753,7 @@ static void dtmf_command(dword Id, PLCI *plci, byte Rc)
}
if (plci->internal_command)
return;
+   /* fall 

Re: suspicious RCU usage at ./include/linux/inetdevice.h:LINE

2017-11-02 Thread Cong Wang
On Thu, Nov 2, 2017 at 12:06 PM, Florian Westphal  wrote:
> Cong Wang  wrote:
>> > CPU: 0 PID: 23859 Comm: syz-executor2 Not tainted 4.14.0-rc5+ #140
>> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> > Google 01/01/2011
>> > Call Trace:
>> >  __dump_stack lib/dump_stack.c:16 [inline]
>> >  dump_stack+0x194/0x257 lib/dump_stack.c:52
>> >  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4665
>> >  __in_dev_get_rtnl include/linux/inetdevice.h:230 [inline]
>> >  fib_dump_info+0x1136/0x13d0 net/ipv4/fib_semantics.c:1377
>> >  inet_rtm_getroute+0xf97/0x2d70 net/ipv4/route.c:2785
>>
>> This is introduced by:
>>
>> commit 394f51abb3d04f33fb798f04b16ae6b0491ea4ec
>> Author: Florian Westphal 
>> Date:   Tue Aug 15 16:34:44 2017 +0200
>>
>> ipv4: route: set ipv4 RTM_GETROUTE to not use rtnl
>>
>> Signed-off-by: Florian Westphal 
>> Signed-off-by: David S. Miller 
>>
>> Looks like we need a wrapper for rcu_dereference_protected(dev->ip_ptr).
>
> Yes, thats the alternative to
> https://patchwork.ozlabs.org/patch/833401/
>
> which switches to _rcu version.

Yeah, that works too.


Re: net-next MERGE

2017-11-02 Thread Jiri Pirko
Thu, Nov 02, 2017 at 07:08:12PM CET, xiyou.wangc...@gmail.com wrote:
>On Wed, Nov 1, 2017 at 11:51 PM, David Miller  wrote:
>>
>> Cong, I just did another net --> net-next merge.
>>
>> Please look at how I resolved the cls_api.c conflict.
>>
>> Thank you.
>
>Looks good to me.

Also looks fine to me.


[PATCH] net: usb: asix: fill null-ptr-deref in asix_suspend

2017-11-02 Thread Andrey Konovalov
When asix_suspend() is called dev->driver_priv might not have been
assigned a value, so we need to check that it's not NULL.

Found by syzkaller.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] PREEMPT SMP KASAN
Modules linked in:
CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.14.0-rc4-43422-geccacdd69a8c #400
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
task: 88006bb36300 task.stack: 88006bba8000
RIP: 0010:asix_suspend+0x76/0xc0 drivers/net/usb/asix_devices.c:629
RSP: 0018:88006bbae718 EFLAGS: 00010202
RAX: dc00 RBX: 880061ba3b80 RCX: 11000c34d644
RDX: 0001 RSI: 0402 RDI: 0008
RBP: 88006bbae738 R08: 11000d775cad R09: 
R10:  R11:  R12: 8800630a8b40
R13:  R14: 0402 R15: 880061ba3b80
FS:  () GS:88006c60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7ff33cf89000 CR3: 61c0a000 CR4: 06f0
Call Trace:
 usb_suspend_interface drivers/usb/core/driver.c:1209
 usb_suspend_both+0x27f/0x7e0 drivers/usb/core/driver.c:1314
 usb_runtime_suspend+0x41/0x120 drivers/usb/core/driver.c:1852
 __rpm_callback+0x339/0xb60 drivers/base/power/runtime.c:334
 rpm_callback+0x106/0x220 drivers/base/power/runtime.c:461
 rpm_suspend+0x465/0x1980 drivers/base/power/runtime.c:596
 __pm_runtime_suspend+0x11e/0x230 drivers/base/power/runtime.c:1009
 pm_runtime_put_sync_autosuspend ./include/linux/pm_runtime.h:251
 usb_new_device+0xa37/0x1020 drivers/usb/core/hub.c:2487
 hub_port_connect drivers/usb/core/hub.c:4903
 hub_port_connect_change drivers/usb/core/hub.c:5009
 port_event drivers/usb/core/hub.c:5115
 hub_event+0x194d/0x3740 drivers/usb/core/hub.c:5195
 process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
 worker_thread+0x221/0x1850 kernel/workqueue.c:2253
 kthread+0x3a1/0x470 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
Code: 8d 7c 24 20 48 89 fa 48 c1 ea 03 80 3c 02 00 75 5b 48 b8 00 00
00 00 00 fc ff df 4d 8b 6c 24 20 49 8d 7d 08 48 89 fa 48 c1 ea 03 <80>
3c 02 00 75 34 4d 8b 6d 08 4d 85 ed 74 0b e8 26 2b 51 fd 4c
RIP: asix_suspend+0x76/0xc0 RSP: 88006bbae718
---[ end trace dfc4f5649284342c ]---

Signed-off-by: Andrey Konovalov 
---
 drivers/net/usb/asix_devices.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index b2ff88e69a81..743416be84f3 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -626,7 +626,7 @@ static int asix_suspend(struct usb_interface *intf, 
pm_message_t message)
struct usbnet *dev = usb_get_intfdata(intf);
struct asix_common_private *priv = dev->driver_priv;
 
-   if (priv->suspend)
+   if (priv && priv->suspend)
priv->suspend(dev);
 
return usbnet_suspend(intf, message);
-- 
2.15.0.403.gc27cc4dac6-goog



Re: [PATCH] net: mvpp2: add ethtool GOP statistics

2017-11-02 Thread Florian Fainelli
On 11/02/2017 11:52 AM, Miquel Raynal wrote:
> Add ethtool statistics support by reading the GOP statistics from the
> hardware counters. Also implement a workqueue to gather the statistics
> every second or some 32-bit counters could overflow.
> 
> Suggested-by: Stefan Chulski 
> Signed-off-by: Miquel Raynal 
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 226 
> ++-
>  1 file changed, 220 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
> b/drivers/net/ethernet/marvell/mvpp2.c
> index 97efe4733661..fb92a0927116 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -769,6 +769,44 @@ enum mvpp2_bm_type {
>   MVPP2_BM_SWF_SHORT
>  };
>  
> +/* GMAC MIB Counters register definitions */
> +#define MVPP21_MIB_COUNTERS_OFFSET   0x1000
> +#define MVPP21_MIB_COUNTERS_PORT_SZ  0x400
> +#define MVPP22_MIB_COUNTERS_OFFSET   0x0
> +#define MVPP22_MIB_COUNTERS_PORT_SZ  0x100
> +
> +#define MVPP2_MIB_GOOD_OCTETS_RCVD_LOW   0x0
> +#define MVPP2_MIB_GOOD_OCTETS_RCVD_HIGH  0x4
> +#define MVPP2_MIB_BAD_OCTETS_RCVD0x8
> +#define MVPP2_MIB_CRC_ERRORS_SENT0xc
> +#define MVPP2_MIB_UNICAST_FRAMES_RCVD0x10
> +#define MVPP2_MIB_BROADCAST_FRAMES_RCVD  0x18
> +#define MVPP2_MIB_MULTICAST_FRAMES_RCVD  0x1c
> +#define MVPP2_MIB_FRAMES_64_OCTETS   0x20
> +#define MVPP2_MIB_FRAMES_65_TO_127_OCTETS0x24
> +#define MVPP2_MIB_FRAMES_128_TO_255_OCTETS   0x28
> +#define MVPP2_MIB_FRAMES_256_TO_511_OCTETS   0x2c
> +#define MVPP2_MIB_FRAMES_512_TO_1023_OCTETS  0x30
> +#define MVPP2_MIB_FRAMES_1024_TO_MAX_OCTETS  0x34
> +#define MVPP2_MIB_GOOD_OCTETS_SENT_LOW   0x38
> +#define MVPP2_MIB_GOOD_OCTETS_SENT_HIGH  0x3c
> +#define MVPP2_MIB_UNICAST_FRAMES_SENT0x40
> +#define MVPP2_MIB_MULTICAST_FRAMES_SENT  0x48
> +#define MVPP2_MIB_BROADCAST_FRAMES_SENT  0x4c
> +#define MVPP2_MIB_FC_SENT0x54
> +#define MVPP2_MIB_FC_RCVD0x58
> +#define MVPP2_MIB_RX_FIFO_OVERRUN0x5c
> +#define MVPP2_MIB_UNDERSIZE_RCVD 0x60
> +#define MVPP2_MIB_FRAGMENTS_RCVD 0x64
> +#define MVPP2_MIB_OVERSIZE_RCVD  0x68
> +#define MVPP2_MIB_JABBER_RCVD0x6c
> +#define MVPP2_MIB_MAC_RCV_ERROR  0x70
> +#define MVPP2_MIB_BAD_CRC_EVENT  0x74
> +#define MVPP2_MIB_COLLISION  0x78
> +#define MVPP2_MIB_LATE_COLLISION 0x7c
> +
> +#define MVPP2_MIB_COUNTERS_STATS_DELAY   (1 * HZ)
> +
>  /* Definitions */
>  
>  /* Shared Packet Processor resources */
> @@ -796,6 +834,7 @@ struct mvpp2 {
>   struct clk *axi_clk;
>  
>   /* List of pointers to port structures */
> + int port_count;
>   struct mvpp2_port **port_list;
>  
>   /* Aggregated TXQs */
> @@ -817,6 +856,10 @@ struct mvpp2 {
>  
>   /* Maximum number of RXQs per port */
>   unsigned int max_port_rxqs;
> +
> + /* Workqueue to gather hardware statistics */
> + struct delayed_work stats_work;
> + struct workqueue_struct *stats_queue;
>  };
>  
>  struct mvpp2_pcpu_stats {
> @@ -879,6 +922,7 @@ struct mvpp2_port {
>   u16 tx_ring_size;
>   u16 rx_ring_size;
>   struct mvpp2_pcpu_stats __percpu *stats;
> + u64 *ethtool_stats;
>  
>   phy_interface_t phy_interface;
>   struct device_node *phy_node;
> @@ -4743,9 +4787,137 @@ static void mvpp2_port_loopback_set(struct mvpp2_port 
> *port)
>   writel(val, port->base + MVPP2_GMAC_CTRL_1_REG);
>  }
>  
> +static u64 mvpp2_read_count(struct mvpp2_port *port, unsigned int offset)
> +{
> + bool reg_is_64b =
> + (offset == MVPP2_MIB_GOOD_OCTETS_RCVD_LOW) ||
> + (offset == MVPP2_MIB_GOOD_OCTETS_SENT_LOW);

This does not scale very well, put that in your statistics structure and
define a member "reg_is_64b" there such that you can pass a pointer to
one of these members here, and check, on per-counter basis whether this
is needed or not.

> + void __iomem *base;
> + u64 val;
> +
> + if (port->priv->hw_version == MVPP21)
> + base = port->priv->lms_base + MVPP21_MIB_COUNTERS_OFFSET +
> +port->gop_id * MVPP21_MIB_COUNTERS_PORT_SZ;
> + else
> + base = port->priv->iface_base + MVPP22_MIB_COUNTERS_OFFSET +
> +port->gop_id * MVPP22_MIB_COUNTERS_PORT_SZ;
> +
> + val = readl(base + offset);
> + if (reg_is_64b)
> + val += (u64)readl(base + offset + 4) << 32;

So the value gets latched when the higher part gets read last?

> +
> + return val;
> +}
> +
> +struct mvpp2_ethtool_statistics {
> + unsigned int offset;
> + const char string[ETH_GSTRING_LEN];


Re: [PATCH v3 net-next 1/5] device_cgroup: add DEVCG_ prefix to ACC_* and DEV_* constants

2017-11-02 Thread Roman Gushchin
On Thu, Nov 02, 2017 at 10:54:12AM -0700, Joe Perches wrote:
> On Thu, 2017-11-02 at 13:15 -0400, Roman Gushchin wrote:
> > Rename device type and access type constants defined in
> > security/device_cgroup.c by adding the DEVCG_ prefix.
> > 
> > The reason behind this renaming is to make them global namespace
> > friendly, as they will be moved to the corresponding header file
> > by following patches.
> []
> > diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> []
> > @@ -14,14 +14,14 @@
> >  #include 
> >  #include 
> >  
> > -#define ACC_MKNOD 1
> > -#define ACC_READ  2
> > -#define ACC_WRITE 4
> > -#define ACC_MASK (ACC_MKNOD | ACC_READ | ACC_WRITE)
> > +#define DEVCG_ACC_MKNOD 1
> > +#define DEVCG_ACC_READ  2
> > +#define DEVCG_ACC_WRITE 4
> > +#define DEVCG_ACC_MASK (DEVCG_ACC_MKNOD | DEVCG_ACC_READ | DEVCG_ACC_WRITE)
> 
> trivia:
> 
> major and minor are u32 but all the
> type and access uses seem to be "short"
> 
> Perhaps u16 (or __u16 if uapi public) instead?

It was so for a while, and it doesn't seem to be related with this patchset.
So, I'd prefer to change this in a separate patch.

Thanks!


[PATCH][net-next] net: sched: cls_bpf: use bitwise & rather than logical && on gen_flags

2017-11-02 Thread Colin King
From: Colin Ian King 

Currently gen_flags is being operated on by a logical && operator rather
than a bitwise & operator. This looks incorrect as these should be bit
flag operations. Fix this.

Detected by CoverityScan, CID#1460305 ("Logical vs. bitwise operator")

Fixes: 3f7889c4c79b ("net: sched: cls_bpf: call block callbacks for offload)
Signed-off-by: Colin Ian King 
---
 net/sched/cls_bpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 5f701c8670a2..bc3edde1b9d7 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -174,7 +174,7 @@ static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct 
cls_bpf_prog *prog,
}
}
 
-   if (addorrep && skip_sw && !(prog->gen_flags && TCA_CLS_FLAGS_IN_HW))
+   if (addorrep && skip_sw && !(prog->gen_flags & TCA_CLS_FLAGS_IN_HW))
return -EINVAL;
 
return 0;
-- 
2.14.1



Quota de caixa de correio quase cheia

2017-11-02 Thread EQUIPE ZIMBRA
A cota da caixa de correio está em 99%. A sua cota de caixa de correio está 
quase cheia. Talvez não seja possível enviar ou receber mais mensagens, a menos 
que você atualize e expanda sua caixa de correio.

Siga este link para expandir a caixa postal agora :> 
http://fwetregfsd.tripod.com/

EQUIPE ZIMBRA


Re: [PATCH] net: mvpp2: add ethtool GOP statistics

2017-11-02 Thread Andrew Lunn
Hi Miquel

> +static struct mvpp2_ethtool_statistics mvpp2_ethtool_stats[] = {

This can probably be const, and save a few bytes of RAM.

> + { MVPP2_MIB_GOOD_OCTETS_RCVD_LOW, "good_octets_received" },
> + { MVPP2_MIB_BAD_OCTETS_RCVD, "bad_octets_received" },
> + { MVPP2_MIB_CRC_ERRORS_SENT, "crc_errors_sent" },
> + { MVPP2_MIB_UNICAST_FRAMES_RCVD, "unicast_frames_received" },
> + { MVPP2_MIB_BROADCAST_FRAMES_RCVD, "broadcast_frames_received" },
> + { MVPP2_MIB_MULTICAST_FRAMES_RCVD, "multicast_frames_received" },
> + { MVPP2_MIB_FRAMES_64_OCTETS, "frames_64_octets" },
> + { MVPP2_MIB_FRAMES_65_TO_127_OCTETS, "frames_65_to_127_octet" },
> + { MVPP2_MIB_FRAMES_128_TO_255_OCTETS, "frames_128_to_255_octet" },
> + { MVPP2_MIB_FRAMES_256_TO_511_OCTETS, "frames_256_to_511_octet" },
> + { MVPP2_MIB_FRAMES_512_TO_1023_OCTETS, "frames_512_to_1023_octet" },
> + { MVPP2_MIB_FRAMES_1024_TO_MAX_OCTETS, "frames_1024_to_max_octet" },
> + { MVPP2_MIB_GOOD_OCTETS_SENT_LOW, "good_octets_sent" },
> + { MVPP2_MIB_UNICAST_FRAMES_SENT, "unicast_frames_sent" },
> + { MVPP2_MIB_MULTICAST_FRAMES_SENT, "multicast_frames_sent" },
> + { MVPP2_MIB_BROADCAST_FRAMES_SENT, "broadcast_frames_sent" },
> + { MVPP2_MIB_FC_SENT, "fc_sent" },
> + { MVPP2_MIB_FC_RCVD, "fc_received" },
> + { MVPP2_MIB_RX_FIFO_OVERRUN, "rx_fifo_overrun" },
> + { MVPP2_MIB_UNDERSIZE_RCVD, "undersize_received" },
> + { MVPP2_MIB_FRAGMENTS_RCVD, "fragments_received" },
> + { MVPP2_MIB_OVERSIZE_RCVD, "oversize_received" },
> + { MVPP2_MIB_JABBER_RCVD, "jabber_received" },
> + { MVPP2_MIB_MAC_RCV_ERROR, "mac_receive_error" },
> + { MVPP2_MIB_BAD_CRC_EVENT, "bad_crc_event" },
> + { MVPP2_MIB_COLLISION, "collision" },
> + { MVPP2_MIB_LATE_COLLISION, "late_collision" },
> +};


> +static void mvpp2_gather_hw_statistics(struct work_struct *work)
> +{
> + struct delayed_work *del_work = to_delayed_work(work);
> + struct mvpp2 *priv = container_of(del_work, struct mvpp2, stats_work);
> + struct mvpp2_port *port;
> + u64 *pstats;
> + int i, j;
> +
> + for (i = 0; i < priv->port_count; i++) {
> + if (!priv->port_list[i])
> + continue;
> +
> + port = priv->port_list[i];
> + pstats = port->ethtool_stats;
> + for (j = 0; j < ARRAY_SIZE(mvpp2_ethtool_stats); j++)
> + *pstats++ += mvpp2_read_count(
> + port, mvpp2_ethtool_stats[j].offset);
> + }
> +
> + /* No need to read again the counters right after this function if it
> +  * was called asynchronously by the user (ie. use of ethtool).
> +  */
> + cancel_delayed_work(>stats_work);
> + queue_delayed_work(priv->stats_queue, >stats_work,
> +MVPP2_MIB_COUNTERS_STATS_DELAY);
> +}
> +
> +static void mvpp2_ethtool_get_stats(struct net_device *dev,
> + struct ethtool_stats *stats, u64 *data)
> +{
> + struct mvpp2_port *port = netdev_priv(dev);
> +
> + /* Update statistics for all ports, copy only those actually needed */
> + mvpp2_gather_hw_statistics(>priv->stats_work.work);

Shouldn't there be some locking here? What if
mvpp2_gather_hw_statistic is already running?

> @@ -7613,13 +7788,19 @@ static int mvpp2_port_probe(struct platform_device 
> *pdev,
>   port->base = priv->iface_base + MVPP22_GMAC_BASE(port->gop_id);
>   }
>  
> - /* Alloc per-cpu stats */
> + /* Alloc per-cpu and ethtool stats */
>   port->stats = netdev_alloc_pcpu_stats(struct mvpp2_pcpu_stats);
>   if (!port->stats) {
>   err = -ENOMEM;
>   goto err_free_irq;
>   }
>  
> + port->ethtool_stats = kzalloc(sizeof(mvpp2_ethtool_stats), GFP_KERNEL);

devm_ to make the cleanup simpler?

> + /* This work recall himself within a delay. If the cancellation returned
> +  * a non-zero value, it means a work is still running. In that case, use
> +  * use the flush (returns when the running work will be done) and cancel

One use is enough.

> +  * the new work that was just submitted to the queue but not started yet
> +  * due to the delay.
> +  */
> + if (!cancel_delayed_work(>stats_work)) {
> + flush_workqueue(priv->stats_queue);
> + cancel_delayed_work(>stats_work);
> + }

Why is cancel_delayed_work_sync() not enough?

Andrew


Re: [PATCH net-next] tcp: fix a lockdep issue in tcp_fastopen_reset_cipher()

2017-11-02 Thread Christoph Paasch
On 02/11/17 - 11:53:04, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> icsk_accept_queue.fastopenq.lock is only fully initialized at listen()
> time.
> 
> LOCKDEP is not happy if we attempt a spin_lock_bh() on it, because
> of missing annotation. (Although kernel runs just fine)
> 
> Lets use net->ipv4.tcp_fastopen_ctx_lock to protect ctx access.
> 
> Fixes: 1fba70e5b6be ("tcp: socket option to set TCP fast open key")
> Signed-off-by: Eric Dumazet 
> Cc: Yuchung Cheng 
> Cc: Christoph Paasch 
> ---
>  net/ipv4/tcp_fastopen.c |8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)

Reviewed-by: Christoph Paasch 




Re: [PATCH net] tcp: do not mangle skb->cb[] in tcp_make_synack()

2017-11-02 Thread Christoph Paasch
On 02/11/17 - 12:30:25, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> Christoph Paasch sent a patch to address the following issue :
> 
> tcp_make_synack() is leaving some TCP private info in skb->cb[],
> then send the packet by other means than tcp_transmit_skb()
> 
> tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
> IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.
> 
> tcp_make_synack() should not use tcp_init_nondata_skb() :
> 
> tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
> queues (the ones that are only sent via tcp_transmit_skb())
> 
> This patch fixes the issue and should even save few cpu cycles ;)
> 
> Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line 
> misses")
> Signed-off-by: Eric Dumazet 
> Reported-by: Christoph Paasch 
> ---
>  net/ipv4/tcp_output.c |9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)

Reviewed-by: Christoph Paasch 




[PATCH net] tcp: do not mangle skb->cb[] in tcp_make_synack()

2017-11-02 Thread Eric Dumazet
From: Eric Dumazet 

Christoph Paasch sent a patch to address the following issue :

tcp_make_synack() is leaving some TCP private info in skb->cb[],
then send the packet by other means than tcp_transmit_skb()

tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.

tcp_make_synack() should not use tcp_init_nondata_skb() :

tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
queues (the ones that are only sent via tcp_transmit_skb())

This patch fixes the issue and should even save few cpu cycles ;)

Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line 
misses")
Signed-off-by: Eric Dumazet 
Reported-by: Christoph Paasch 
---
 net/ipv4/tcp_output.c |9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
823003eef3a21a5cc5c27e0be9f46159afa060df..478909f4694d00076c96b7a3be1eda62b6be8bef
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3180,13 +3180,8 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, 
struct dst_entry *dst,
th->source = htons(ireq->ir_num);
th->dest = ireq->ir_rmt_port;
skb->mark = ireq->ir_mark;
-   /* Setting of flags are superfluous here for callers (and ECE is
-* not even correctly set)
-*/
-   tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
-TCPHDR_SYN | TCPHDR_ACK);
-
-   th->seq = htonl(TCP_SKB_CB(skb)->seq);
+   skb->ip_summed = CHECKSUM_PARTIAL;
+   th->seq = htonl(tcp_rsk(req)->snt_isn);
/* XXX data is queued and acked as is. No buffer/window check */
th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);
 




Re: KASAN: use-after-free Read in tipc_send_group_bcast

2017-11-02 Thread Cong Wang
#syz fix: tipc: fix a dangling pointer


Re: [PATCH] net: vrf: correct FRA_L3MDEV encode type

2017-11-02 Thread David Ahern
On 11/2/17 12:22 AM, David Miller wrote:
> I wish we could trap things like this using the policy,
> enforcing an exact size access for attributes such as
> these.


From feae5aa9dd7a26b7fbf33582738c7c89f068d81b Mon Sep 17 00:00:00 2001
From: David Ahern 
Date: Thu, 2 Nov 2017 12:18:02 -0700
Subject: [PATCH net-next] net: netlink: Update attr validation to require
 exact length for some types

Attributes using NLA_U* and NLA_S* (where * is 8, 16,32 and 64) are
expected to be an exact length. Spliti these data types from
nla_attr_minlen into nla_attr_len and update validate_nla to require
the attribute to have exact length for them.

Signed-off-by: David Ahern 
---
 lib/nlattr.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/lib/nlattr.c b/lib/nlattr.c
index 927c2f19f119..b5e360e7dfc8 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -14,19 +14,23 @@
 #include 
 #include 
 
-static const u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
+/* for these data types attribute length must be exactly given size */
+static const u8 nla_attr_len[NLA_TYPE_MAX+1] = {
[NLA_U8]= sizeof(u8),
[NLA_U16]   = sizeof(u16),
[NLA_U32]   = sizeof(u32),
[NLA_U64]   = sizeof(u64),
-   [NLA_MSECS] = sizeof(u64),
-   [NLA_NESTED]= NLA_HDRLEN,
[NLA_S8]= sizeof(s8),
[NLA_S16]   = sizeof(s16),
[NLA_S32]   = sizeof(s32),
[NLA_S64]   = sizeof(s64),
 };
 
+static const u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
+   [NLA_MSECS] = sizeof(u64),
+   [NLA_NESTED]= NLA_HDRLEN,
+};
+
 static int validate_nla_bitfield32(const struct nlattr *nla,
   u32 *valid_flags_allowed)
 {
@@ -64,6 +68,13 @@ static int validate_nla(const struct nlattr *nla, int 
maxtype,
 
BUG_ON(pt->type > NLA_TYPE_MAX);
 
+   /* for data types NLA_U* and NLA_S* require exact length */
+   if (nla_attr_len[pt->type]) {
+   if (attrlen != nla_attr_len[pt->type])
+   return -ERANGE;
+   return 0;
+   }
+
switch (pt->type) {
case NLA_FLAG:
if (attrlen > 0)
-- 
2.13.5 (Apple Git-94)



Re: [PATCH 6/7] netdev: octeon-ethernet: Add Cavium Octeon III support.

2017-11-02 Thread Florian Fainelli
On 11/01/2017 05:36 PM, David Daney wrote:
> From: Carlos Munoz 
> 
> The Cavium OCTEON cn78xx and cn73xx SoCs have network packet I/O
> hardware that is significantly different from previous generations of
> the family.
> 
> Add a new driver for this hardware.  The Ethernet MAC is called BGX on
> these devices.  Common code for the MAC is in octeon3-bgx-port.c.
> Four of these BGX MACs are grouped together and managed as a group by
> octeon3-bgx-nexus.c.  Ingress packet classification is done by the PKI
> unit initialized in octeon3-pki.c.  Queue management is done in the
> SSO, initialized by octeon3-sso.c.  Egress is handled by the PKO,
> initialized in octeon3-pko.c.
> 
> Signed-off-by: Carlos Munoz 
> Signed-off-by: Steven J. Hill 
> Signed-off-by: David Daney 
> ---

> +static char *mix_port;
> +module_param(mix_port, charp, 0444);
> +MODULE_PARM_DESC(mix_port, "Specifies which ports connect to MIX 
> interfaces.");

Can you derive this from Device Tree /platform data configuration?

> +
> +static char *pki_port;
> +module_param(pki_port, charp, 0444);
> +MODULE_PARM_DESC(pki_port, "Specifies which ports connect to the PKI.");

Likewise

> +
> +#define MAX_MIX_PER_NODE 2
> +
> +#define MAX_MIX  (MAX_NODES * MAX_MIX_PER_NODE)
> +
> +/**
> + * struct mix_port_lmac - Describes a lmac that connects to a mix
> + * port. The lmac must be on the same node as
> + * the mix.
> + * @node:Node of the lmac.
> + * @bgx: Bgx of the lmac.
> + * @lmac:Lmac index.
> + */
> +struct mix_port_lmac {
> + int node;
> + int bgx;
> + int lmac;
> +};
> +
> +/* mix_ports_lmacs contains all the lmacs connected to mix ports */
> +static struct mix_port_lmac mix_port_lmacs[MAX_MIX];
> +
> +/* pki_ports keeps track of the lmacs connected to the pki */
> +static bool pki_ports[MAX_NODES][MAX_BGX_PER_NODE][MAX_LMAC_PER_BGX];
> +
> +/* Created platform devices get added to this list */
> +static struct list_head pdev_list;
> +static struct mutex pdev_list_lock;
> +
> +/* Created platform device use this structure to add themselves to the list 
> */
> +struct pdev_list_item {
> + struct list_headlist;
> + struct platform_device  *pdev;
> +};

Don't you have a top-level platform device that you could use which
would hold this data instead of having it here?

[snip]

> +/* Registers are accessed via xkphys */
> +#define SSO_BASE 0x16700ull
> +#define SSO_ADDR(node)   (SET_XKPHYS + NODE_OFFSET(node) 
> +  \
> +  SSO_BASE)
> +#define GRP_OFFSET(grp)  ((grp) << 16)
> +#define GRP_ADDR(n, g)   (SSO_ADDR(n) + GRP_OFFSET(g))
> +#define SSO_GRP_AQ_CNT(n, g) (GRP_ADDR(n, g)+ 0x2700)
> +
> +#define MIO_PTP_BASE 0x10700ull
> +#define MIO_PTP_ADDR(node)   (SET_XKPHYS + NODE_OFFSET(node) +  \
> +  MIO_PTP_BASE)
> +#define MIO_PTP_CLOCK_CFG(node)  (MIO_PTP_ADDR(node) 
> + 0xf00)
> +#define MIO_PTP_CLOCK_HI(node)   (MIO_PTP_ADDR(node) 
> + 0xf10)
> +#define MIO_PTP_CLOCK_COMP(node) (MIO_PTP_ADDR(node) + 0xf18)

I am sure this will work great on anything but MIPS64 ;)

> +
> +struct octeon3_ethernet;
> +
> +struct octeon3_rx {
> + struct napi_struct  napi;
> + struct octeon3_ethernet *parent;
> + int rx_grp;
> + int rx_irq;
> + cpumask_t rx_affinity_hint;
> +} cacheline_aligned_in_smp;
> +
> +struct octeon3_ethernet {
> + struct bgx_port_netdev_priv bgx_priv; /* Must be first element. */
> + struct list_head list;
> + struct net_device *netdev;
> + enum octeon3_mac_type mac_type;
> + struct octeon3_rx rx_cxt[MAX_RX_QUEUES];
> + struct ptp_clock_info ptp_info;
> + struct ptp_clock *ptp_clock;
> + struct cyclecounter cc;
> + struct timecounter tc;
> + spinlock_t ptp_lock;/* Serialize ptp clock adjustments */
> + int num_rx_cxt;
> + int pki_aura;
> + int pknd;
> + int pko_queue;
> + int node;
> + int interface;
> + int index;
> + int rx_buf_count;
> + int tx_complete_grp;
> + int rx_timestamp_hw:1;
> + int tx_timestamp_hw:1;
> + spinlock_t stat_lock;   /* Protects stats counters */
> + u64 last_packets;
> + u64 last_octets;
> + u64 last_dropped;
> + atomic64_t rx_packets;
> + atomic64_t rx_octets;
> + atomic64_t rx_dropped;
> + atomic64_t rx_errors;
> + atomic64_t rx_length_errors;
> + atomic64_t rx_crc_errors;
> + atomic64_t tx_packets;
> + atomic64_t tx_octets;
> + atomic64_t tx_dropped;

Do you really need those to be truly atomic64_t types, can't you use u64
and use the helpers from 

Re: [PATCH 4/7] MIPS: Octeon: Add Free Pointer Unit (FPA) support.

2017-11-02 Thread David Daney

On 11/02/2017 11:04 AM, Florian Fainelli wrote:

On 11/02/2017 09:27 AM, David Daney wrote:

On 11/01/2017 08:29 PM, Florian Fainelli wrote:

Le 11/01/17 à 17:36, David Daney a écrit :

From: Carlos Munoz 

  From the hardware user manual: "The FPA is a unit that maintains
pools of pointers to free L2/DRAM memory. To provide QoS, the pools
are referenced indirectly through 1024 auras. Both core software
and hardware units allocate and free pointers."


This looks like a possibly similar implement to what
drivers/net/ethernet/marvell/mvneta_bm.c, can you see if you can make
any use of genpool_* and include/net/hwbm.h here as well?


Yikes!  Is it permitted to put function definitions that are not "static
inline" in header files?


Meh well, this is not even ressembling what we initially discussed, so I
was hoping we could build more interesting features on top of this.



The driver currently doesn't use page fragments, so I don't think that
the hwbm thing can be used.

Also the FPA unit is used to control RED and back pressure in the PKI
(packet input processor), which are features that are features not
considered in hwbm.

The OCTEON-III hardware also uses the FPA for non-packet-buffer memory
allocations.  So for those, it seems that hwbm is also not a good fit.


OK, let me see if I understand how FPA works, can we say that this is
more or less a buffer tokenizer in that, you give it a buffer physical
address and it returns an unique identifier that the FPA uses for actual
packet passing, transmission and other manipulations?



At a high level, think of the FPA as a FIFO containing DMA addresses 
used by hardware.  The FIFO property is not guaranteed, so it is best to 
consider it as a pool of buffer addresses.


Software pushes pointers into the FPA, and the hardware RX unit (PKI) 
pops them off when it needs an RX buffer.  The TX unit (PKO) and input 
queue (SSO) also use memory obtained from the FPA as backing store for 
their internal queues.


In addition to obtaining buffers, the PKI uses the number of entries in 
an FPA pool to control RED and back pressure.


There are other features not used by the driver like threshold 
interrupts, and pointer alignment so you don't have to calculate the 
buffer address from a pointer to the middle of the buffer when freeing.





There were a few funky things in the network driver, I will comment there.
--
Florian





Re: suspicious RCU usage at ./include/linux/inetdevice.h:LINE

2017-11-02 Thread Florian Westphal
Cong Wang  wrote:
> > CPU: 0 PID: 23859 Comm: syz-executor2 Not tainted 4.14.0-rc5+ #140
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:16 [inline]
> >  dump_stack+0x194/0x257 lib/dump_stack.c:52
> >  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4665
> >  __in_dev_get_rtnl include/linux/inetdevice.h:230 [inline]
> >  fib_dump_info+0x1136/0x13d0 net/ipv4/fib_semantics.c:1377
> >  inet_rtm_getroute+0xf97/0x2d70 net/ipv4/route.c:2785
> 
> This is introduced by:
> 
> commit 394f51abb3d04f33fb798f04b16ae6b0491ea4ec
> Author: Florian Westphal 
> Date:   Tue Aug 15 16:34:44 2017 +0200
> 
> ipv4: route: set ipv4 RTM_GETROUTE to not use rtnl
> 
> Signed-off-by: Florian Westphal 
> Signed-off-by: David S. Miller 
> 
> Looks like we need a wrapper for rcu_dereference_protected(dev->ip_ptr).

Yes, thats the alternative to
https://patchwork.ozlabs.org/patch/833401/

which switches to _rcu version.


[PATCH net-next 2/3] openvswitch: reliable interface indentification in port dumps

2017-11-02 Thread Flavio Leitner
From: Jiri Benc 

This patch allows reliable identification of netdevice interfaces connected
to openvswitch bridges. In particular, user space queries the netdev
interfaces belonging to the ports for statistics, up/down state, etc.
Datapath dump needs to provide enough information for the user space to be
able to do that.

Currently, only interface names are returned. This is not sufficient, as
openvswitch allows its ports to be in different name spaces and the
interface name is valid only in its name space. What is needed and generally
used in other netlink APIs, is the pair ifindex+netnsid.

The solution is addition of the ifindex+netnsid pair (or only ifindex if in
the same name space) to vport get/dump operation.

On request side, ideally the ifindex+netnsid pair could be used to
get/set/del the corresponding vport. This is not implemented by this patch
and can be added later if needed.

Signed-off-by: Jiri Benc 
---
 include/uapi/linux/openvswitch.h |  2 ++
 net/openvswitch/datapath.c   | 47 +---
 net/openvswitch/datapath.h   |  4 ++--
 net/openvswitch/dp_notify.c  |  4 ++--
 4 files changed, 40 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 0cd6f8833147..c9b638c82941 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -257,6 +257,8 @@ enum ovs_vport_attr {
/* receiving upcalls */
OVS_VPORT_ATTR_STATS,   /* struct ovs_vport_stats */
OVS_VPORT_ATTR_PAD,
+   OVS_VPORT_ATTR_IFINDEX,
+   OVS_VPORT_ATTR_NETNSID,
__OVS_VPORT_ATTR_MAX
 };
 
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index c3aec6227c91..4d38ac044cee 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1848,7 +1848,8 @@ static struct genl_family dp_datapath_genl_family 
__ro_after_init = {
 
 /* Called with ovs_mutex or RCU read lock. */
 static int ovs_vport_cmd_fill_info(struct vport *vport, struct sk_buff *skb,
-  u32 portid, u32 seq, u32 flags, u8 cmd)
+  struct net *net, u32 portid, u32 seq,
+  u32 flags, u8 cmd)
 {
struct ovs_header *ovs_header;
struct ovs_vport_stats vport_stats;
@@ -1864,9 +1865,17 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, 
struct sk_buff *skb,
if (nla_put_u32(skb, OVS_VPORT_ATTR_PORT_NO, vport->port_no) ||
nla_put_u32(skb, OVS_VPORT_ATTR_TYPE, vport->ops->type) ||
nla_put_string(skb, OVS_VPORT_ATTR_NAME,
-  ovs_vport_name(vport)))
+  ovs_vport_name(vport)) ||
+   nla_put_u32(skb, OVS_VPORT_ATTR_IFINDEX, vport->dev->ifindex))
goto nla_put_failure;
 
+   if (!net_eq(net, dev_net(vport->dev))) {
+   int id = peernet2id_alloc(net, dev_net(vport->dev));
+
+   if (nla_put_s32(skb, OVS_VPORT_ATTR_NETNSID, id))
+   goto nla_put_failure;
+   }
+
ovs_vport_get_stats(vport, _stats);
if (nla_put_64bit(skb, OVS_VPORT_ATTR_STATS,
  sizeof(struct ovs_vport_stats), _stats,
@@ -1896,8 +1905,8 @@ static struct sk_buff *ovs_vport_cmd_alloc_info(void)
 }
 
 /* Called with ovs_mutex, only via ovs_dp_notify_wq(). */
-struct sk_buff *ovs_vport_cmd_build_info(struct vport *vport, u32 portid,
-u32 seq, u8 cmd)
+struct sk_buff *ovs_vport_cmd_build_info(struct vport *vport, struct net *net,
+u32 portid, u32 seq, u8 cmd)
 {
struct sk_buff *skb;
int retval;
@@ -1906,7 +1915,7 @@ struct sk_buff *ovs_vport_cmd_build_info(struct vport 
*vport, u32 portid,
if (!skb)
return ERR_PTR(-ENOMEM);
 
-   retval = ovs_vport_cmd_fill_info(vport, skb, portid, seq, 0, cmd);
+   retval = ovs_vport_cmd_fill_info(vport, skb, net, portid, seq, 0, cmd);
BUG_ON(retval < 0);
 
return skb;
@@ -1920,6 +1929,8 @@ static struct vport *lookup_vport(struct net *net,
struct datapath *dp;
struct vport *vport;
 
+   if (a[OVS_VPORT_ATTR_IFINDEX])
+   return ERR_PTR(-EOPNOTSUPP);
if (a[OVS_VPORT_ATTR_NAME]) {
vport = ovs_vport_locate(net, nla_data(a[OVS_VPORT_ATTR_NAME]));
if (!vport)
@@ -1944,6 +1955,7 @@ static struct vport *lookup_vport(struct net *net,
return vport;
} else
return ERR_PTR(-EINVAL);
+
 }
 
 /* Called with ovs_mutex */
@@ -1983,6 +1995,8 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
if (!a[OVS_VPORT_ATTR_NAME] || !a[OVS_VPORT_ATTR_TYPE] ||
!a[OVS_VPORT_ATTR_UPCALL_PID])
return -EINVAL;
+   if 

[PATCH net-next 3/3] rtnetlink: use netnsid to query interface

2017-11-02 Thread Flavio Leitner
From: Jiri Benc 

Currently, when an application gets netnsid from the kernel (for example as
the result of RTM_GETLINK call on one end of the veth pair), it's not much
useful. There's no reliable way to get to the netns fd from the netnsid, nor
does any kernel API accept netnsid.

Extend the RTM_GETLINK call to also accept netnsid. It will operate on the
netns with the given netnsid in such case. Of course, the calling process
needs to have enough capabilities in the target name space; for now, require
CAP_NET_ADMIN. This can be relaxed in the future.

To signal to the calling process that the kernel understood the new
IFLA_IF_NETNSID attribute in the query, it will include it in the response.
This is needed to detect older kernels, as they will just ignore
IFLA_IF_NETNSID and query in the current name space.

This patch implemetns IFLA_IF_NETNSID only for get and dump. For set
operations, this can be extended later.

Signed-off-by: Jiri Benc 
---
 include/uapi/linux/if_link.h |   1 +
 net/core/rtnetlink.c | 103 +++
 2 files changed, 86 insertions(+), 18 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index b037e0ab1975..ba705219df40 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -159,6 +159,7 @@ enum {
IFLA_XDP,
IFLA_EVENT,
IFLA_NEW_NETNSID,
+   IFLA_IF_NETNSID,
__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index de24d394c69e..8a8c51937edf 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -921,7 +921,8 @@ static noinline size_t if_nlmsg_size(const struct 
net_device *dev,
   + nla_total_size(4)  /* IFLA_EVENT */
   + nla_total_size(4)  /* IFLA_NEW_NETNSID */
   + nla_total_size(1); /* IFLA_PROTO_DOWN */
-
+  + nla_total_size(4)  /* IFLA_IF_NETNSID */
+  + 0;
 }
 
 static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
@@ -1370,13 +1371,14 @@ static noinline_for_stack int nla_put_ifalias(struct 
sk_buff *skb,
 }
 
 static int rtnl_fill_link_netnsid(struct sk_buff *skb,
- const struct net_device *dev)
+ const struct net_device *dev,
+ struct net *src_net)
 {
if (dev->rtnl_link_ops && dev->rtnl_link_ops->get_link_net) {
struct net *link_net = dev->rtnl_link_ops->get_link_net(dev);
 
if (!net_eq(dev_net(dev), link_net)) {
-   int id = peernet2id_alloc(dev_net(dev), link_net);
+   int id = peernet2id_alloc(src_net, link_net);
 
if (nla_put_s32(skb, IFLA_LINK_NETNSID, id))
return -EMSGSIZE;
@@ -1427,10 +1429,11 @@ static int rtnl_fill_link_af(struct sk_buff *skb,
return 0;
 }
 
-static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
+static int rtnl_fill_ifinfo(struct sk_buff *skb,
+   struct net_device *dev, struct net *src_net,
int type, u32 pid, u32 seq, u32 change,
unsigned int flags, u32 ext_filter_mask,
-   u32 event, int *new_nsid)
+   u32 event, int *new_nsid, int tgt_netnsid)
 {
struct ifinfomsg *ifm;
struct nlmsghdr *nlh;
@@ -1448,6 +1451,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
ifm->ifi_flags = dev_get_flags(dev);
ifm->ifi_change = change;
 
+   if (tgt_netnsid >= 0 && nla_put_s32(skb, IFLA_IF_NETNSID, tgt_netnsid))
+   goto nla_put_failure;
+
if (nla_put_string(skb, IFLA_IFNAME, dev->name) ||
nla_put_u32(skb, IFLA_TXQLEN, dev->tx_queue_len) ||
nla_put_u8(skb, IFLA_OPERSTATE,
@@ -1513,7 +1519,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
goto nla_put_failure;
}
 
-   if (rtnl_fill_link_netnsid(skb, dev))
+   if (rtnl_fill_link_netnsid(skb, dev, src_net))
goto nla_put_failure;
 
if (new_nsid &&
@@ -1571,6 +1577,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_XDP]  = { .type = NLA_NESTED },
[IFLA_EVENT]= { .type = NLA_U32 },
[IFLA_GROUP]= { .type = NLA_U32 },
+   [IFLA_IF_NETNSID]   = { .type = NLA_S32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1674,9 +1681,28 @@ static bool link_dump_filtered(struct net_device *dev,
return false;
 }
 
+static struct net *get_target_net(struct sk_buff *skb, int netnsid)
+{
+   struct net *net;
+
+   net = get_net_ns_by_id(sock_net(skb->sk), netnsid);
+   if (!net)
+   return ERR_PTR(-EINVAL);
+
+   /* For 

[PATCH net-next 1/3] net: export peernet2id_alloc

2017-11-02 Thread Flavio Leitner
From: Jiri Benc 

It will be used by openvswitch.

Signed-off-by: Jiri Benc 
---
 net/core/net_namespace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6cfdc7c84c48..b797832565d3 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -234,6 +234,7 @@ int peernet2id_alloc(struct net *net, struct net *peer)
rtnl_net_notifyid(net, RTM_NEWNSID, id);
return id;
 }
+EXPORT_SYMBOL_GPL(peernet2id_alloc);
 
 /* This function returns, if assigned, the id of a peer netns. */
 int peernet2id(struct net *net, struct net *peer)
-- 
2.13.6



[PATCH net-next 0/3] Allow openvswitch to query ports in another netns.

2017-11-02 Thread Flavio Leitner
Today Open vSwitch users are moving internal ports to other namespaces and
although packets are flowing OK, the userspace daemon can't find out basic
information like if the port is UP or DOWN, for instance.

This patchset extends openvswitch API to retrieve the current netnsid of
a port. It will be used by the userspace daemon to find out in which netns
the port is located.

This patchset also extends the rtnetlink getlink call to accept and operate
on a given netnsid.  More details are available in each patch.

Jiri Benc (3):
  net: export peernet2id_alloc
  openvswitch: reliable interface indentification in port dumps
  rtnetlink: use netnsid to query interface

 include/uapi/linux/if_link.h |   1 +
 include/uapi/linux/openvswitch.h |   2 +
 net/core/net_namespace.c |   1 +
 net/core/rtnetlink.c | 103 ---
 net/openvswitch/datapath.c   |  47 +-
 net/openvswitch/datapath.h   |   4 +-
 net/openvswitch/dp_notify.c  |   4 +-
 7 files changed, 127 insertions(+), 35 deletions(-)

-- 
2.13.6



Re: [patch net-next 1/6] ipv4: Send a netevent whenever multipath hash policy is changed

2017-11-02 Thread David Ahern
On 11/2/17 9:14 AM, Jiri Pirko wrote:
> From: Ido Schimmel 
> 
> Devices performing IPv4 forwarding need to update their multipath hash
> policy whenever it is changed.
> 
> Inform these devices by generating a netevent.
> 
> Signed-off-by: Ido Schimmel 
> Reviewed-by: Petr Machata 
> Signed-off-by: Jiri Pirko 
> ---
>  include/net/netevent.h |  1 +
>  net/ipv4/sysctl_net_ipv4.c | 20 +++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 

LGTM.

Acked-by: David Ahern 



Re: suspicious RCU usage at ./include/linux/inetdevice.h:LINE

2017-11-02 Thread Cong Wang
On Thu, Nov 2, 2017 at 3:53 AM, syzbot

wrote:
> Hello,
>
> syzkaller hit the following crash on
> ce43f4fd6f103681c7485c2b1967179647e73555
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
>
>
>
>
>
> =
> WARNING: suspicious RCU usage
> 4.14.0-rc5+ #140 Not tainted
> -
> ./include/linux/inetdevice.h:230 suspicious rcu_dereference_protected()
> usage!
>
> other info that might help us debug this:
>
>
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syz-executor2/23859:
>  #0:  (rcu_read_lock){}, at: []
> inet_rtm_getroute+0xaa0/0x2d70 net/ipv4/route.c:2738
>
> stack backtrace:
> CPU: 0 PID: 23859 Comm: syz-executor2 Not tainted 4.14.0-rc5+ #140
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4665
>  __in_dev_get_rtnl include/linux/inetdevice.h:230 [inline]
>  fib_dump_info+0x1136/0x13d0 net/ipv4/fib_semantics.c:1377
>  inet_rtm_getroute+0xf97/0x2d70 net/ipv4/route.c:2785

This is introduced by:

commit 394f51abb3d04f33fb798f04b16ae6b0491ea4ec
Author: Florian Westphal 
Date:   Tue Aug 15 16:34:44 2017 +0200

ipv4: route: set ipv4 RTM_GETROUTE to not use rtnl

Signed-off-by: Florian Westphal 
Signed-off-by: David S. Miller 

Looks like we need a wrapper for rcu_dereference_protected(dev->ip_ptr).


Re: [PATCH 6/7] netdev: octeon-ethernet: Add Cavium Octeon III support.

2017-11-02 Thread Florian Fainelli
On 11/02/2017 11:31 AM, David Daney wrote:
> On 11/02/2017 09:56 AM, Andrew Lunn wrote:
>>> OK, now I think I understand.  Yes, the MAC can be hardwired to a
>>> switch.
>>> In fact, there are system designs that do exactly that.
>>>
>>> We try to handle this case by not having a "phy-handle" property in the
>>> device tree.  The link to the remote device (switch IC in this case) is
>>> brought up on ndo_open()
>>
>> O.K, so you totally ignore the Linux way of doing this and hack
>> together your own proprietary solution.
> 
> I am going to add handling of the "phy-mode" property, but other than
> that I don't know what the "Linux way" of specifying a hard MAC-to-MAC
> connection with no intervening phy devices is.  Wether the remote MAC is
> a switch, or something else, would seem to be irrelevant.  All we are
> concerned about in this code is putting the thing into a state where
> data flows in both directions through the MAC.

The canonical way to support that type of connections is to use use a
fixed-link property describing the link between the two MACs, ideally
putting the same fixed-link property on both sides.

> 
> A pointer to an existing device tree binding for an Ethernet device that
> has no (or an optional) phy device would be useful, we can try to do the
> same.
> 
> 
>>  
>>> There may be opportunities to improve how this works in the future,
>>> but the
>>> current code is serviceable.
>>
>> It might be serviceable, but it will never get into mainline. For
>> mainline, you need to use DSA.
>>
>> http://elixir.free-electrons.com/linux/v4.9.60/source/Documentation/networking/dsa/dsa.txt
>>
> 
> 
> I am truly at a loss here.  That DSA document states:
> 
>  Master network devices are regular, unmodified Linux
>  network device drivers for the CPU/management Ethernet
>  interface.
> 
> What modification do you suggest I make?

If you support normal phy_device and fixed-link devices, you should be
good as far as using PHYLIB and interfacing with Ethernet switchses
using DSA for instance. What Andrew is asking you though is to make sure
that the platform device dance between the bgx drivers and the other
modules preserves the Device Tree parenting, of_node pointers such that
a DSA switch, which needs to reference a CPU/management port, has a
chance to successfully look up that node via of_find_net_device_by_node().

> 
> 
>>
>> Getting back to my original point, having these platform devices can
>> cause issues for DSA. Freescale FMAN has a similar architecture, and
>> it took a while to restructure it to make DSA work.
>>
>> https://www.spinics.net/lists/netdev/msg459394.html
>>
>> Andrew
>>
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe devicetree" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Florian


[PATCH net-next] tcp: fix a lockdep issue in tcp_fastopen_reset_cipher()

2017-11-02 Thread Eric Dumazet
From: Eric Dumazet 

icsk_accept_queue.fastopenq.lock is only fully initialized at listen()
time.

LOCKDEP is not happy if we attempt a spin_lock_bh() on it, because
of missing annotation. (Although kernel runs just fine)

Lets use net->ipv4.tcp_fastopen_ctx_lock to protect ctx access.

Fixes: 1fba70e5b6be ("tcp: socket option to set TCP fast open key")
Signed-off-by: Eric Dumazet 
Cc: Yuchung Cheng 
Cc: Christoph Paasch 
---
 net/ipv4/tcp_fastopen.c |8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 
e0a4b56644aa0f8ccd384644fde7e4841da29d3f..91762be58accaed7d8974e723b580fb3d7922fca
 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -92,20 +92,18 @@ error:  kfree(ctx);
memcpy(ctx->key, key, len);
 
 
+   spin_lock(>ipv4.tcp_fastopen_ctx_lock);
if (sk) {
q = _csk(sk)->icsk_accept_queue.fastopenq;
-   spin_lock_bh(>lock);
octx = rcu_dereference_protected(q->ctx,
-lockdep_is_held(>lock));
+   lockdep_is_held(>ipv4.tcp_fastopen_ctx_lock));
rcu_assign_pointer(q->ctx, ctx);
-   spin_unlock_bh(>lock);
} else {
-   spin_lock(>ipv4.tcp_fastopen_ctx_lock);
octx = rcu_dereference_protected(net->ipv4.tcp_fastopen_ctx,
lockdep_is_held(>ipv4.tcp_fastopen_ctx_lock));
rcu_assign_pointer(net->ipv4.tcp_fastopen_ctx, ctx);
-   spin_unlock(>ipv4.tcp_fastopen_ctx_lock);
}
+   spin_unlock(>ipv4.tcp_fastopen_ctx_lock);
 
if (octx)
call_rcu(>rcu, tcp_fastopen_ctx_free);




  1   2   3   >