ATENCIÓN;

2017-03-14 Thread administrador
ATENCIÓN;

Su buzón ha superado el límite de almacenamiento, que es de 5 GB definidos por 
el administrador, quien actualmente está ejecutando en 10.9GB, no puede ser 
capaz de enviar o recibir correo nuevo hasta que vuelva a validar su
buzón de correo electrónico. Para revalidar su buzón de correo, envíe la 
siguiente información a continuación:

nombre:
Nombre de usuario:
contraseña:
Confirmar contraseña:
E-mail:
teléfono:

Si usted no puede revalidar su buzón, el buzón se deshabilitará!

Disculpa las molestias.
Código de verificación: es: 006524
Correo Soporte Técnico © 2017

¡gracias
Sistemas administrador 


[PATCH net] fjes: Fix wrong netdevice feature flags

2017-03-14 Thread Taku Izumi
This patch fixes netdev->features for Extended Socket network device.

Currently Extended Socket network device's netdev->feature claims
NETIF_F_HW_CSUM, however this is completely wrong. There's no feature
of checksum offloading.
That causes invalid TCP/UDP checksum and packet rjection when IP
forwarding from Extended Socket network device to other network device.

NETIF_F_HW_CSUM should be omitted.

Signed-off-by: Taku Izumi 
---
 drivers/net/fjes/fjes_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c
index b75d9cd..c4b3c4b 100644
--- a/drivers/net/fjes/fjes_main.c
+++ b/drivers/net/fjes/fjes_main.c
@@ -1316,7 +1316,7 @@ static void fjes_netdev_setup(struct net_device *netdev)
netdev->min_mtu = fjes_support_mtu[0];
netdev->max_mtu = fjes_support_mtu[3];
netdev->flags |= IFF_BROADCAST;
-   netdev->features |= NETIF_F_HW_CSUM | NETIF_F_HW_VLAN_CTAG_FILTER;
+   netdev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 }
 
 static void fjes_irq_watch_task(struct work_struct *work)
-- 
1.8.3.1



Re: net/sctp: recursive locking in sctp_do_peeloff

2017-03-14 Thread Cong Wang
On Fri, Mar 10, 2017 at 12:04 PM, Dmitry Vyukov  wrote:
> On Fri, Mar 10, 2017 at 8:46 PM, Marcelo Ricardo Leitner
>  wrote:
>> On Fri, Mar 10, 2017 at 4:11 PM, Dmitry Vyukov  wrote:
>>> Hello,
>>>
>>> I've got the following recursive locking report while running
>>> syzkaller fuzzer on net-next/9c28286b1b4b9bce6e35dd4c8a1265f03802a89a:
>>>
>>> [ INFO: possible recursive locking detected ]
>>> 4.10.0+ #14 Not tainted
>>> -
>>> syz-executor3/5560 is trying to acquire lock:
>>>  (sk_lock-AF_INET6){+.+.+.}, at: [] lock_sock
>>> include/net/sock.h:1460 [inline]
>>>  (sk_lock-AF_INET6){+.+.+.}, at: []
>>> sctp_close+0xcd/0x9d0 net/sctp/socket.c:1497
>>>
>>> but task is already holding lock:
>>>  (sk_lock-AF_INET6){+.+.+.}, at: [] lock_sock
>>> include/net/sock.h:1460 [inline]
>>>  (sk_lock-AF_INET6){+.+.+.}, at: []
>>> sctp_getsockopt+0x450/0x67e0 net/sctp/socket.c:6611
>>>
>>> other info that might help us debug this:
>>>  Possible unsafe locking scenario:
>>>
>>>CPU0
>>>
>>>   lock(sk_lock-AF_INET6);
>>>   lock(sk_lock-AF_INET6);
>>>
>>>  *** DEADLOCK ***
>>>
>>>  May be due to missing lock nesting notation
>>
>> Pretty much the case, I suppose. The lock held by sctp_getsockopt() is
>> on one socket, while the other lock that sctp_close() is getting later
>> is on the newly created (which failed) socket during peeloff
>> operation.
>
>
> Does this mean that never-ever lock 2 sockets at a time except for
> this case? If so, it probably suggests that this case should not do it
> either.
>

Yeah, actually for the error path we don't even need to lock sock
since it is newly allocated and no one else could see it yet.

Instead of checking for the status of the sock, I believe the following
one-line fix should do the trick too. Can you give it a try?

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 0f378ea..4de62d4 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1494,7 +1494,7 @@ static void sctp_close(struct sock *sk, long timeout)

pr_debug("%s: sk:%p, timeout:%ld\n", __func__, sk, timeout);

-   lock_sock(sk);
+   lock_sock_nested(sk, SINGLE_DEPTH_NESTING);
sk->sk_shutdown = SHUTDOWN_MASK;
sk->sk_state = SCTP_SS_CLOSING;


Re: [PATCH v2 2/2] can: spi: hi311x: Add Holt HI-311x CAN driver

2017-03-14 Thread Akshay Bhat
Hi Wolfgang,

On Tue, Mar 14, 2017 at 2:08 PM, Wolfgang Grandegger  
wrote:
...snip
>> /disconnect cable
>>   can0  2088   [8]  00 00 00 19 00 00 28 00   ERRORFRAME
>> protocol-violation{{}{acknowledge-slot}}
>> bus-error
>> error-counter-tx-rx{{40}{0}}
>>   can0  2088   [8]  00 00 00 19 00 00 58 00   ERRORFRAME
>> protocol-violation{{}{acknowledge-slot}}
>> bus-error
>> error-counter-tx-rx{{88}{0}}
>>   can0  2088   [8]  00 00 00 19 00 00 80 00   ERRORFRAME
>> protocol-violation{{}{acknowledge-slot}}
>> bus-error
>> error-counter-tx-rx{{128}{0}}
>
>
> TX error warning is missing.
>

This support was missing in the driver, added in V4 patch.

>>   can0  208C   [8]  00 20 00 19 00 00 80 00   ERRORFRAME
>> controller-problem{tx-error-passive}
>> protocol-violation{{}{acknowledge-slot}}
>> bus-error
>> error-counter-tx-rx{{128}{0}}
>
>
> Here "tx-error-passiv" is packed with a bus error. What I'm looking for are
> state change messages similar to:
>
>can0  2204  [8] 00 08 00 00 00 00 60 00   ERRORFRAME
> controller-problem{tx-error-warning}
> state-change{tx-error-warning}
> error-counter-tx-rx{{96}{0}}
>can0  2204  [8] 00 30 00 00 00 00 80 00   ERRORFRAME
> controller-problem{tx-error-passive}
> state-change{tx-error-passive}
> error-counter-tx-rx{{128}{0}
>
> They should always come, even with "berr-reporting off".
>

HI-3110 has only 1 bus error interrupt. There is no dedicated state
change interrupts like other controllers.

So here is my plan:
- Have the bus error interrupt always enabled
- If berr-reporting off, then have the isr checks/reports state changes
- if berr-reporting on, then have the isr checks/reports bus errors
and state changes (Does it make sense packing the error message, if
the ISR finds both bus and state changes?)

>> write: No buffer space available
>> root@imx6qrom5420b1:~# ip -s -d link show can0
>> 4: can0:  mtu 16 qdisc pfifo_fast state UNKNOWN
>> mode DEFAULT group default qlen 10
>> link/can  promiscuity 0
>> can  state ERROR-PASSIVE (berr-counter tx 128 rx 0)
>> restart-ms 0
>>   bitrate 100 sample-point 0.750
>>   tq 62 prop-seg 5 phase-seg1 6 phase-seg2 4 sjw 1
>>   hi3110: tseg1 2..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
>>   clock 1600
>>   re-started bus-errors arbit-lost error-warn error-pass bus-off
>>   0  6  0  1  1  0
>
>
> The error warning and passive counter increased , though. Also the bus error
> should come in at a rather hight rate. Looking to the code, maybe
> you need to test STATF to check for state changes (and not ERR).
>

Apologize, just realized In the above case some error packets were
lost, because I forgot to set the CPU frequency to max. Will resend
the log.

..snip...
>
> After some more messages there should be also:
>
> can0  2200  [8] 00 40 00 00 00 00 5F 00   ERRORFRAME
> state-change{back-to-error-active}
> error-counter-tx-rx{{95}{0}}
>
> For each message sent, the error counter decreases by 8.
>

The HI-3110 controller decrements the error counter by 1 for every message sent.
The error count increments by 8 when there is an error.

>
>>
>> root@imx6qrom5420b1:~# ip -s -d link show can0
>> 4: can0:  mtu 16 qdisc pfifo_fast state UNKNOWN
>> mode DEFAULT group default qlen 10
>> link/can  promiscuity 0
>> can state ERROR-ACTIVE (berr-counter tx 117 rx 0) restart-ms 0
>>   bitrate 100 sample-point 0.750
>>   tq 62 prop-seg 5 phase-seg1 6 phase-seg2 4 sjw 1
>>   hi3110: tseg1 2..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
>>   clock 1600
>>   re-started bus-errors arbit-lost error-warn error-pass bus-off
>>   0  1  0  0  0  0
>
>
> Strange, some counters got lost.
>

This was a bug introduced when adding berr-reporting, have fixed in v4 patch.

>>
>> I have not been able to check the bus-off condition by (short-circuiting
>> CAN low and high). The tec error count remains at 128 when I short the
>> CAN low and high pins and the status never goes BUSOFF.
>
>
> You also need to send a message and the short-circuit should be at the
> connector of the sending host. What tranceiver is used? Do you know?
>

ADM3054 transceiver is used with HI-3111. I connected the
HI-3111/ADM3054 board to kvaser leaf and ran "cangen -i can0" and
"candump -e any,0:0,#FFF" on the board. Removed the cable and
shorted the CAN_H/L pins coming out of ADM3054. I will try your
suggestion of using a different bit-rate on the Kvaser leaf instead.

I appreciate your continued feedback, it has helped significantly
improve the error handling of the driver. Looking back I should have
based it 

Re: openvswitch conntrack and nat problem in first packet reply with RST

2017-03-14 Thread wenxu

you are correct! Thanks very much.

It's works  set a new example as following.

ip,in_port=2 actions=ct(table=1,zone=1,nat)
ip,in_port=3 actions=ct(table=1,zone=1,nat)

table=1, ct_state=+new+trk,tcp,in_port=2,tp_dst=123 
actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:3
table=1, ct_state=+new+trk,icmp,in_port=2 
actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:3
table=1, ct_state=+new+trk,ip,in_port=3 
actions=ct(commit,zone=1,nat(dst=192.168.0.7)),output:2
table=1, ct_state=+new+trk, priority=100, tcp,in_port=3,tp_dst=123 actions=drop
table=1, ct_state=+est+trk,ip,in_port=3 actions=output:2
table=1, ct_state=+est+trk,ip,in_port=2 actions=output:3





> On 13 March 2017 at 20:18, wenxu  wrote:
>> Hi all,
>>
>> There is a simple test for conntrack and nat in openvswitch.  I want to do 
>> stateful
>> firewall with conntrack then do nat
>>
>> netns1 port1 with ip 10.0.0.7
>> netns2 port2 with ip 1.1.1.7
>>
>> netns1 10.0.0.7 src -nat to 2.2.1.7 access netns2 1.1.1.7
>>
>> 1. # ovs-ofctl add-flow br0  'ip,in_port=1 actions=ct(table=1,zone=1)'
>> 2. # ovs-ofctl add-flow br0  'ip,in_port=2 actions=ct(table=1,zone=1)'
>> 3. # ovs-ofctl add-flow br0  'table=1, 
>> ct_state=+new+trk,tcp,in_port=1,tp_dst=123 
>> actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2'
>> 4. # ovs-ofctl add-flow br0  'table=1, ct_state=+est+trk,ip,in_port=2 
>> actions=ct(commit,zone=1,nat(dst=10.0.0.7)),output:1'
>> 5. # ovs-ofctl add-flow br0  'table=1, ct_state=+est+trk,ip,in_port=1  
>> actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2'
>>
>>
>> I  found that  netns1 can access 1.1.1.7:123  when there is 123-port listen 
>> on 1.1.1.7  in netns2
>>
>> But if there is no listen 123 port, The first RST packet reply by 1.1.1.7
>> (no datapath kernel rule) can't do dst-nat back to 10.0.0.7.  The second RST 
>> packet is ok (there is datapath kernel rule which comes from first RST 
>> packet)
>>
>> # tcpdump -i eth0 -nnn
>> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
>> listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
>> 14:44:13.575200 IP 10.0.0.7.39891 > 1.1.1.7.123: Flags [S], seq 93585, 
>> win 29200, options [mss 1460,sackOK,TS val 584707316 ecr 0,nop,wscale 7], 
>> length 0
>> 14:44:13.576036 IP 1.1.1.7.123 > 2.2.1.7.39891: Flags [R.], seq 0, ack 
>> 93586, win 0, length 0
>>
>> But the datapath flow is correct
>> # ovs-dpctl dump-flows
>> recirc_id(0),in_port(7),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, 
>> used:never, actions:ct(zone=1),recirc(0x5a)
>> recirc_id(0x5a),in_port(7),ct_state(+new+trk),eth_type(0x0800),ipv4(proto=6,frag=no),tcp(dst=123),
>>  packets:0, bytes:0, used:never,
>> actions:ct(commit,zone=1,nat(src=2.2.1.7)),8
>> recirc_id(0),in_port(8),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, 
>> used:never, actions:ct(zone=1),recirc(0x5b)
>> recirc_id(0x5b),in_port(8),ct_state(-new+est+trk),eth_type(0x0800),ipv4(frag=no),
>>  packets:0, bytes:0, used:never,
>> actions:ct(commit,zone=1,nat(dst=10.0.0.7)),7
>>
>>
>> I think It's a matter with the PACKET-OUT and RST packet
>>
>> There are two packet-out for rule2 and rul4. Rule2 go through connect track 
>> and find it is an RST packet then delete the conntrack . It leads the second 
>> packet(come from rule4) can't find the conntack to do dst-nat.
>>
>> In "netfilter/nf_conntrack_proto_tcp.c file
>>  if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) {
>> /* If only reply is a RST, we can consider ourselves not to
>>have an established connection: this is a fairly common
>>problem case, so we can delete the conntrack
>>immediately.  --RR */
>> if (th->rst ) {
>> nf_ct_kill_acct(ct, ctinfo, skb);
>> return NF_ACCEPT;
>> }
>> }
>>
>>
>> It should add a switch to avoid this conntrack  be deleted.
>>
>> if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) {
>> /* If only reply is a RST, we can consider ourselves not to
>>have an established connection: this is a fairly common
>>problem case, so we can delete the conntrack
>>immediately.  --RR */
>> -if (th->rst ) {
>> +if (th->rst && !nf_ct_tcp_rst_no_kill) {
>> nf_ct_kill_acct(ct, ctinfo, skb);
>> return NF_ACCEPT;
>> }
> How would you know to not kill the entry? How would you ensure it's
> properly cleaned up later? I'm not sure if there's a way to implement
> this without some fairly serious plumbing.
>
> If you look at the examples in the OVS testsuite[0], it is suggested
> to use "ct(nat)" with no options early in your rules. This ensures
> that the connection is looked up, and if necessary, NAT is applied at
> the same time - meaning that the RST can be NATed back AND the
> connection is deleted. In 

Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

2017-03-14 Thread Alexei Starovoitov
On Tue, Mar 14, 2017 at 08:11:43AM -0700, Eric Dumazet wrote:
> +static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
> + struct mlx4_en_rx_ring *ring,
> + dma_addr_t *dma,
> + unsigned int node, gfp_t gfp)
>  {
> + if (unlikely(!ring->pre_allocated_count)) {
> + unsigned int order = READ_ONCE(ring->rx_alloc_order);
> +
> + page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) |
> + __GFP_NOMEMALLOC |
> + __GFP_NOWARN |
> + __GFP_NORETRY,
> +   order);
> + if (page) {
> + split_page(page, order);
> + ring->pre_allocated_count = 1U << order;
> + } else {
> + if (order > 1)
> + ring->rx_alloc_order--;
> + page = __alloc_pages_node(node, gfp, 0);
> + if (unlikely(!page))
> + return NULL;
> + ring->pre_allocated_count = 1U;
>   }
> + ring->pre_allocated = page;
> + ring->rx_alloc_pages += ring->pre_allocated_count;
>   }
> + page = ring->pre_allocated++;
> + ring->pre_allocated_count--;

do you think this style of allocation can be moved into net common?
If it's a good thing then other drivers should be using it too, right?

> + ring->cons = 0;
> + ring->prod = 0;
> +
> + /* Page recycling works best if we have enough pages in the pool.
> +  * Apply a factor of two on the minimal allocations required to
> +  * populate RX rings.
> +  */

i'm not sure how above comments matches the code below.
If my math is correct a ring of 1k elements will ask for 1024
contiguous pages.

> +retry:
> + total = 0;
> + pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE;
> + pages_per_ring = roundup_pow_of_two(pages_per_ring);
> +
> + order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1);

if you're sure it doesn't hurt the rest of the system,
why use MAX_ORDER - 1? why not MAX_ORDER?

>  
> -/* We recover from out of memory by scheduling our napi poll
> - * function (mlx4_en_process_cq), which tries to allocate
> - * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
> +/* Under memory pressure, each ring->rx_alloc_order might be lowered
> + * to very small values. Periodically increase t to initial value for
> + * optimal allocations, in case stress is over.
>   */
> + for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
> + ring = priv->rx_ring[ring_ind];
> + order = min_t(unsigned int, ring->rx_alloc_order + 1,
> +   ring->rx_pref_alloc_order);
> + WRITE_ONCE(ring->rx_alloc_order, order);

when recycling is effective in a matter of few seconds it will
increase ther order back to 10 and the first time the driver needs
to allocate, it will start that tedious failure loop all over again.
How about removing this periodic mlx4_en_recover_from_oom() completely
and switch to increase the order inside mlx4_alloc_page().
Like N successful __alloc_pages_node() with order X will bump it
into order X+1. If it fails next time it will do only one failed attempt.

> +static bool mlx4_replenish(struct mlx4_en_priv *priv,
> +struct mlx4_en_rx_ring *ring,
> +struct mlx4_en_frag_info *frag_info)
>  {
> + struct mlx4_en_page *en_page = >pool.array[ring->pool.pool_idx];
> + if (!mlx4_page_is_reusable(en_page->page)) {
> + page = mlx4_alloc_page(priv, ring, , numa_mem_id(),
> +GFP_ATOMIC | __GFP_MEMALLOC);

I don't understand why page_is_reusable is doing !page_is_pfmemalloc(page))
check, but here you're asking for MEMALLOC pages too, so
under memory pressure the hw will dma the packet into this page,
but then the stack will still drop it, so under pressure
we'll keep alloc/free the pages from reserve. Isn't it better
to let the hw drop (since we cannot alloc and rx ring is empty) ?
What am I missing?

> @@ -767,10 +820,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
> struct mlx4_en_cq *cq, int bud
>   case XDP_PASS:
>   break;
>   case XDP_TX:
> + /* Make sure we have one page ready to replace 
> this one */
> + npage = NULL;
> + if (!ring->page_cache.index) {
> + npage = mlx4_alloc_page(priv, ring,
> + , 
> numa_mem_id(),
> + GFP_ATOMIC | 
> 

[GIT] Networking

2017-03-14 Thread David Miller

1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver, from
   Steffen Klassert.

2) Fix crashes when user tries to get_next_key on an LPM bpf map, from Alexei
   Starovoitov.

3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from Michal
   Schmidt.

4) We can get a divide by zero when TCP socket are morphed into listening
   state, fix from Eric Dumazet.

5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
   skb_complete_tx_timestamp().  From Eric Dumazet.

6) Use after free in dccp_feat_activate_values(), also from Eric Dumazet.

7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
   Jarod Wilson.

8) Fix use after free in vrf_xmit(), from David Ahern.

9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
   Alexey Kodanev.

10) Properly check napi_complete_done() return value in order to decide
whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
Lendacky.

11) Fix double free of hwmon device in marvell phy driver, from Andrew
Lunn.

12) Don't crash on malformed netlink attributes in act_connmark, from
Etienne Noss.

13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
from Sabrina Dubroca.

14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
Florian Westphal.

15) Fix routing redirect races in dccp and tcp, basically the ICMP handler
can't modify the socket's cached route in it's locked by the user
at this moment.  From Jon Maxwell.

Please pull, thanks a lot!

The following changes since commit 8d70eeb84ab277377c017af6a21d0a337025dede:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2017-03-04 
17:31:39 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 

for you to fetch changes up to 1e6a1cd888de06b09d2341d782aadb20c6034210:

  Merge branch 'qed-fixes' (2017-03-14 11:37:06 -0700)


Alexander Potapenko (1):
  net: initialize msg.msg_flags in recvfrom

Alexei Starovoitov (4):
  bpf: add get_next_key callback to LPM map
  bpf: fix struct htab_elem layout
  bpf: convert htab map to hlist_nulls
  selftests/bpf: fix broken build

Alexey Khoroshilov (1):
  net/sched: act_skbmod: remove unneeded rcu_read_unlock in tcf_skbmod_dump

Alexey Kodanev (1):
  udp: avoid ufo handling on IP payload compression packets

Andrew Lunn (1):
  net: phy: marvell: Fix double free of hwmon device

Andrey Vagin (1):
  net: use net->count to check whether a netns is alive or not

Arnd Bergmann (1):
  net/mlx5e: add IPV6 dependency

Blomme, Maarten (2):
  spi_ks8995: fix "BUG: key accdaa28 not in .data!"
  spi_ks8995: regs_size incorrect for some devices

Christian Lamparter (2):
  dt: emac: document device-tree based phy discovery and setup
  net: ibm: emac: fix regression caused by emac_dt_phy_probe()

Daniel Borkmann (1):
  bpf: improve read-only handling

Daniel Jurgens (1):
  net/mlx5: Don't save PCI state when PCI error is detected

David Ahern (4):
  vrf: Fix use-after-free in vrf_xmit
  net: ipv6: Remove redundant RTA_OIF in multipath routes
  mpls: Send route delete notifications when router module is unloaded
  mpls: Do not decrement alive counter for unregister events

David Arcari (1):
  net: ethernet: aquantia: call set_irq_affinity_hint before free_irq

David Howells (4):
  rxrpc: Call state should be read with READ_ONCE() under some circumstances
  net: Work around lockdep limitation in sockets that use sockets
  rxrpc: rxrpc_kernel_send_data() needs to handle failed call better
  rxrpc: Wake up the transmitter if Rx window size increases on the peer

David S. Miller (12):
  Merge branch 'bnx2x-fixes'
  Merge branch 'sock_hold-misuses'
  Merge branch 'rds-fixes'
  Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
  net: Revert ksettings conversions.
  Merge branch 'thunderx-misc-fixes'
  Merge branch 'bpf-htab-fixes'
  Merge branch 'bnxt_en-misc-small-fixes'
  Merge branch 'bcmgenet-minor-bug-fixes'
  Merge branch 'mlx5-fixes'
  Merge branch 'mlxsw-small-fixes'
  Merge branch 'qed-fixes'

Dmitry V. Levin (1):
  uapi: fix linux/packet_diag.h userspace compilation error

Doug Berger (7):
  net: bcmgenet: correct the RBUF_OVFL_CNT and RBUF_ERR_CNT MIB values
  net: bcmgenet: correct MIB access of UniMAC RUNT counters
  net: bcmgenet: reserved phy revisions must be checked first
  net: bcmgenet: power down internal phy if open or resume fails
  net: bcmgenet: synchronize irq0 status between the isr and task
  net: bcmgenet: Power up the internal PHY before probing the MII
  net: bcmgenet: decouple flow control from bcmgenet_tx_reclaim

Edwin Chan (1):
  net: bcmgenet: add begin/complete ethtool ops

Eric Dumazet (4):
  tcp: 

[PATCH net] bridge: ebtables: fix reception of frames DNAT-ed to bridge device

2017-03-14 Thread Linus Lüssing
When trying to redirect bridged frames to the bridge device itself
via the ebtables nat-prerouting chain and the dnat target then this
currently fails:

The ethernet destination of the frame is dnat'ed to the MAC address of
the bridge itself just fine and the correctly altered frame can even
be captured via a tcpdump on br0 (with or without promisc mode).

However, the IP code drops it in the beginning of ip_input.c/ip_rcv()
as the dnat target did not update the skb->pkt_type. If after
dnat'ing the packet is now destined to us then the skb->pkt_type
needs to be updated from PACKET_OTHERHOST to PACKET_HOST, too.

Signed-off-by: Linus Lüssing 
---
 net/bridge/br_input.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 013f2290b..ec83175 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -198,8 +198,12 @@ int br_handle_frame_finish(struct net *net, struct sock 
*sk, struct sk_buff *skb
if (dst) {
unsigned long now = jiffies;
 
-   if (dst->is_local)
+   if (dst->is_local) {
+   /* fix up potential DNAT mess */
+   skb->pkt_type = PACKET_HOST;
+
return br_pass_frame_up(skb);
+   }
 
if (now != dst->used)
dst->used = now;
-- 
2.1.4



[PATCH] net: unix: properly re-increment inflight counter of GC discarded candidates

2017-03-14 Thread Andrey Ulanov
Dmitry has reported that a BUG_ON() condition in unix_notinflight()
may be triggered by a simple code that forwards unix socket in an
SCM_RIGHTS message.
That is caused by incorrect unix socket GC implementation in unix_gc().

The GC first collects list of candidates, then (a) decrements their
"children's" inflight counter, (b) checks which inflight counters are
now 0, and then (c) increments all inflight counters back.
(a) and (c) are done by calling scan_children() with inc_inflight or
dec_inflight as the second argument.

Commit 6209344f5a37 ("net: unix: fix inflight counting bug in garbage
collector") changed scan_children() such that it no longer considers
sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
of code that that unsets this flag _before_ invoking
scan_children(, dec_iflight, ). This may lead to incorrect inflight
counters for some sockets.

This change fixes this bug by changing order of operations:
UNIX_GC_CANDIDATE is now unset only after all inflight counters are
restored to the original state.

  kernel BUG at net/unix/garbage.c:149!
  RIP: 0010:[]  []
  unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
  Call Trace:
   [] unix_detach_fds.isra.19+0xff/0x170 
net/unix/af_unix.c:1487
   [] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
   [] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
   [] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
   [] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
   [] kfree_skb+0x184/0x570 net/core/skbuff.c:705
   [] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
   [] unix_release+0x49/0x90 net/unix/af_unix.c:836
   [] sock_release+0x92/0x1f0 net/socket.c:570
   [] sock_close+0x1b/0x20 net/socket.c:1017
   [] __fput+0x34e/0x910 fs/file_table.c:208
   [] fput+0x1a/0x20 fs/file_table.c:244
   [] task_work_run+0x1a0/0x280 kernel/task_work.c:116
   [< inline >] exit_task_work include/linux/task_work.h:21
   [] do_exit+0x183a/0x2640 kernel/exit.c:828
   [] do_group_exit+0x14e/0x420 kernel/exit.c:931
   [] get_signal+0x663/0x1880 kernel/signal.c:2307
   [] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
   [] exit_to_usermode_loop+0x1ea/0x2d0
  arch/x86/entry/common.c:156
   [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
   [] syscall_return_slowpath+0x4d3/0x570
  arch/x86/entry/common.c:259
   [] entry_SYSCALL_64_fastpath+0xc4/0xc6

Link: https://lkml.org/lkml/2017/3/6/252
Signed-off-by: Andrey Ulanov 
Reported-by: Dmitry Vyukov 
Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
---
 net/unix/garbage.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 6a0d48525fcf..c36757e72844 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -146,6 +146,7 @@ void unix_notinflight(struct user_struct *user, struct file 
*fp)
if (s) {
struct unix_sock *u = unix_sk(s);
 
+   BUG_ON(!atomic_long_read(>inflight));
BUG_ON(list_empty(>link));
 
if (atomic_long_dec_and_test(>inflight))
@@ -341,6 +342,14 @@ void unix_gc(void)
}
list_del();
 
+   /* Now gc_candidates contains only garbage.  Restore original
+* inflight counters for these as well, and remove the skbuffs
+* which are creating the cycle(s).
+*/
+   skb_queue_head_init();
+   list_for_each_entry(u, _candidates, link)
+   scan_children(>sk, inc_inflight, );
+
/* not_cycle_list contains those sockets which do not make up a
 * cycle.  Restore these to the inflight list.
 */
@@ -350,14 +359,6 @@ void unix_gc(void)
list_move_tail(>link, _inflight_list);
}
 
-   /* Now gc_candidates contains only garbage.  Restore original
-* inflight counters for these as well, and remove the skbuffs
-* which are creating the cycle(s).
-*/
-   skb_queue_head_init();
-   list_for_each_entry(u, _candidates, link)
-   scan_children(>sk, inc_inflight, );
-
spin_unlock(_gc_lock);
 
/* Here we are. Hitlist is filled. Die. */
-- 
2.12.0.367.g23dc2f6d3c-goog



smime.p7s
Description: S/MIME Cryptographic Signature


Projects Finance & Business expansion

2017-03-14 Thread Slegt Financiële Dienstverlening B . V .
Dear Sir/Madam

we are Slegt Financiële Dienstverlening B.V. 

Re: Projects Finance & Business expansion.

Kindly be informed that we are investment company and we offer a wide range of 
project finance packages/services to Private, Corporate & Government Agencies.

And Our Financing Options and services are well known for the following: 
1-Project Development finance, 2-Structured finance / Debt Finance, 3-Private 
Equity investment, 4-Collateral Management, 5-Credit enhancements, 6- Leveraged 
Financing, 7-Corporate Finance

Should you be interested in any of our financial option’s, direct further 
inquiries and your Executive Summary to: slegtf...@gmail.com

For further details contact me directly for more information.

Best Regards,
Dr John Mahama.
e-Mail: slegtf...@gmail.com
Slegt Financiële Dienstverlening B.V.
Chairman/Director (GMAC-RFCIBV)
GMAC-RFC Investments BV.


Re: [PATCHv2 net-next 0/4] update ipvs sysctl document

2017-03-14 Thread Hangbin Liu
2017-03-13 17:14 GMT+08:00 Simon Horman :
>> >
>> > Hangbin Liu (4):
>> >   ipvs: fix sync_threshold description and add sync_refresh_period, 
>> > sync_retries
>> >   ipvs: Document sysctl sync_qlen_max and sync_sock_size
>> >   ipvs: Document sysctl sync_ports
>> >   ipvs: Document sysctl pmtu_disc
>>
>>   All patches look ok to me, thanks!
>>
>> Acked-by: Julian Anastasov 
>>
>>   Simon, can we keep this patchset while in merge window?
>
> Hi,
>
> I apologise for the extended delay in dealing with this - I have been ill.
>
> I have queued these up for v4.12 in ipvs-next as that seems to be the most
> appropriate course of action at this time. Please let me know if you'd like
> to explore a different approach.

Hi Simon,

Thanks a lot for taking care of this. I'm OK with the schedule.

Hope you can be well soon.

Best Wishes
Hangbin


Re: [PATCH net-next 4/4] net-next: dsa: add dsa support for Mediatek MT7530 switch

2017-03-14 Thread kbuild test robot
Hi Sean,

[auto build test WARNING on robh/for-next]
[also build test WARNING on v4.11-rc2 next-20170310]
[cannot apply to net-next/master net/master]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/sean-wang-mediatek-com/dt-bindings-net-dsa-add-Mediatek-MT7530-binding/20170315-083834
base:   https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/net/dsa/mt7530.c: In function 'mt7530_probe':
   drivers/net/dsa/mt7530.c:1076:27: warning: unused variable 'mdio' 
[-Wunused-variable]
 struct device_node *dn, *mdio;
  ^~~~
   drivers/net/dsa/mt7530.c: In function 'mt7530_remove':
>> drivers/net/dsa/mt7530.c:1173:9: warning: 'return' with a value, in function 
>> returning void
 return ret;
^~~
   drivers/net/dsa/mt7530.c:1153:1: note: declared here
mt7530_remove(struct mdio_device *mdiodev)
^

vim +/return +1173 drivers/net/dsa/mt7530.c

  1070  };
  1071  
  1072  static int
  1073  mt7530_probe(struct mdio_device *mdiodev)
  1074  {
  1075  struct mt7530_priv *priv;
> 1076  struct device_node *dn, *mdio;
  1077  int ret;
  1078  const char *pm;
  1079  
  1080  dn = mdiodev->dev.of_node;
  1081  
  1082  priv = devm_kzalloc(>dev, sizeof(*priv), GFP_KERNEL);
  1083  if (!priv)
  1084  return -ENOMEM;
  1085  
  1086  priv->ds = devm_kzalloc(>dev, sizeof(*priv->ds), 
GFP_KERNEL);
  1087  if (!priv->ds)
  1088  return -ENOMEM;
  1089  
  1090  /* Use medatek,mcm property to distinguish hardware type that 
would
  1091   * casues a little bit differences on power-on sequence.
  1092   */
  1093  ret = of_property_read_string(dn, "mediatek,mcm", );
  1094  if (!ret && !strcasecmp(pm, "enabled")) {
  1095  priv->mcm = true;
  1096  dev_info(>dev, "MT7530 adapts as multi-chip 
module\n");
  1097  }
  1098  
  1099  priv->core_pwr = devm_regulator_get(>dev, "core");
  1100  if (IS_ERR(priv->core_pwr))
  1101  return PTR_ERR(priv->core_pwr);
  1102  
  1103  priv->io_pwr = devm_regulator_get(>dev, "io");
  1104  if (IS_ERR(priv->io_pwr))
  1105  return PTR_ERR(priv->io_pwr);
  1106  
  1107  /* MT7530 shares the certain address space with Mediatek 
Ethernet
  1108   * driver for controling TRGMII. Here we create syscon regmap 
for
  1109   * access and control these parameters up on TRGMII.
  1110   */
    priv->ethsys = syscon_regmap_lookup_by_phandle(dn,
  1112 
"mediatek,ethsys");
  1113  if (IS_ERR(priv->ethsys))
  1114  return PTR_ERR(priv->ethsys);
  1115  
  1116  priv->ethernet = syscon_regmap_lookup_by_phandle(dn,
  1117 
"mediatek,ethernet");
  1118  if (IS_ERR(priv->ethernet))
  1119  return PTR_ERR(priv->ethernet);
  1120  
  1121  /* Not MCM that indicates switch works as the remote standalone
  1122   * integrated circuit so the GPIO pin would be used to complete
  1123   * the reset, otherwise memory-mapped register accessing used
  1124   * through syscon provides in the case of MCM.
  1125   */
  1126  if (!priv->mcm) {
  1127  priv->reset = of_get_named_gpio(dn, 
"mediatek,reset-pin", 0);
  1128  if (!gpio_is_valid(priv->reset))
  1129  return priv->reset;
  1130  
  1131  ret = devm_gpio_request_one(>dev,
  1132  priv->reset, 
GPIOF_OUT_INIT_LOW,
  1133  "mediatek,reset-pin");
  1134  if (ret < 0) {
  1135  dev_err(>dev,
  1136  "fail to devm_gpio_request reset\n");
  1137  return ret;
  1138  }
  1139  }
  1140  
  1141  priv->bus = mdiodev->bus;
  1142  priv->dev = >dev;
  1143  priv->ds->priv = priv;
  1144  priv->ds->dev = >dev;
  1145  priv->ds->ops = _switch_ops;
  1146  mutex_init(>reg_mutex);
  1147  dev_set_drvdata(>dev, priv);
  1148  
  1149  return dsa_register_switch(priv->ds, priv->ds->dev->of_node);
  1150  }
  1151  
  1152  static void
  1153  mt7530_remove(struct mdio_device *mdiodev)
  1154  {
  1155  struct 

Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread Tom Herbert
On Tue, Mar 14, 2017 at 5:24 PM, David Miller  wrote:
> From: Stephen Hemminger 
> Date: Tue, 14 Mar 2017 13:25:06 -0700
>
>> On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
>> David Miller  wrote:
>>
>>> From: Nikolay Aleksandrov 
>>> Date: Tue, 14 Mar 2017 17:58:46 +0200
>>>
>>> > On 14/03/17 17:55, Stephen Hemminger wrote:
>>> >> On Tue, 14 Mar 2017 17:36:15 +0200
>>> >> Nikolay Aleksandrov  wrote:
>>> >>
>>> >>> This patch adds support for ECMP hash policy choice via a new sysctl
>>> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>>> >>> The current values for fib_multipath_hash_policy are:
>>> >>>  0 - layer 3 (default)
>>> >>>  1 - layer 4
>>> >>> If there's an skb hash already set and it matches the chosen policy 
>>> >>> then it
>>> >>> will be used instead of being calculated (currently only for L4).
>>> >>> In L3 mode we always calculate the hash due to the ICMP error special
>>> >>> case, the flow dissector's field consistentification should handle the
>>> >>> address order thus we can remove the address reversals.
>>> >>>
>>> >>> Signed-off-by: Nikolay Aleksandrov 
>>> >>
>>> >> It is good to see ECMP come back from the grave.
>>> >> Linux used to support it long ago but was abandoned after it was unstable
>>> >> and removed from iproute2 in 2012.
>>> >>
>>> >> The old API was through route attributes which makes more sense than
>>> >> doing it with sysctl. It makes more sense to use netlink instead.
>>> >> Therefore please go back and do something like the old API rather than 
>>> >> doing it through
>>> >> sysctl.
>>> >>
>>> >
>>> > That's what my initial version did, but this was discussed during NetConf 
>>> > in Seville
>>> > and it was decided that it's best to make a global sysctl, thus the 
>>> > change.
>>>
>>> Correct, we discussed this, and we all agreed to only have a sysctl for now.
>>
>> Why? If you are going to have private discussions please post the rationale
>> in public.
>
> The idea is that we couldn't come up with an immediate use case, and if one
> came up we could easily add the per-route or per-fib-table attribute.
>
> Most people want the entire system to behave a certain way wrt. ECMP, rather
> than have fine granularity.  For example, the case being discussed here is
> to simply have software's behavior match that of hardware offloads.
>
Agreed, but then why do we even need any complexity here by that
argument? RSS is specifically defined to do 5-tuple hashing for TCP
(and UDP), and 3-tuple. No one has ever complained that doing per flow
RSS for TCP is bad thing AFAIK. We followed that same model for RPS,
RFS, and XPS via state in the connection context. The skb_hash is
often given to us for free, whereas in order to do a 3-tuple we have
to actually do more work and do at least an extra jhash. I suppose the
argument is probably that switches allow this configuration and
somehow we want to have feature parity, but it would be very
interesting to know if anyone is not doing per flow ECMP in real life
and why...

Tom


Re: [PATCH net-next 4/4] net-next: dsa: add dsa support for Mediatek MT7530 switch

2017-03-14 Thread kbuild test robot
Hi Sean,

[auto build test WARNING on robh/for-next]
[also build test WARNING on v4.11-rc2 next-20170310]
[cannot apply to net-next/master net/master]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/sean-wang-mediatek-com/dt-bindings-net-dsa-add-Mediatek-MT7530-binding/20170315-083834
base:   https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: x86_64-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from drivers/net/dsa/mt7530.c:25:0:
   drivers/net/dsa/mt7530.c: In function 'mt7530_setup':
>> drivers/net/dsa/mt7530.c:699:19: warning: large integer implicitly truncated 
>> to unsigned type [-Woverflow]
   RESET_MCM, ~RESET_MCM);
  ^
   include/linux/regmap.h:70:42: note: in definition of macro 
'regmap_update_bits'
 regmap_update_bits_base(map, reg, mask, val, NULL, false, false)
 ^~~
   drivers/net/dsa/mt7530.c: In function 'mt7530_probe':
   drivers/net/dsa/mt7530.c:1076:27: warning: unused variable 'mdio' 
[-Wunused-variable]
 struct device_node *dn, *mdio;
  ^~~~
   drivers/net/dsa/mt7530.c: In function 'mt7530_remove':
   drivers/net/dsa/mt7530.c:1173:9: warning: 'return' with a value, in function 
returning void
 return ret;
^~~
   drivers/net/dsa/mt7530.c:1153:1: note: declared here
mt7530_remove(struct mdio_device *mdiodev)
^

vim +699 drivers/net/dsa/mt7530.c

   683  ret = regulator_enable(priv->io_pwr);
   684  if (ret < 0) {
   685  dev_err(priv->dev, "Failed to enable io pwr: %d\n",
   686  ret);
   687  return ret;
   688  }
   689  
   690  /* Reset whole chip through gpio pin or
   691   * memory-mapped registers for different
   692   * type of hardware
   693   */
   694  if (priv->mcm) {
   695  regmap_update_bits(priv->ethsys, SYSC_REG_RSTCTRL,
   696 RESET_MCM, RESET_MCM);
   697  usleep_range(1000, 1100);
   698  regmap_update_bits(priv->ethsys, SYSC_REG_RSTCTRL,
 > 699 RESET_MCM, ~RESET_MCM);
   700  } else {
   701  gpio_direction_output(priv->reset, 0);
   702  usleep_range(1000, 1100);
   703  gpio_set_value(priv->reset, 1);
   704  }
   705  
   706  /* Wait until the reset completion */
   707  ret = wait_condition_timeout(mt7530_read(priv, MT7530_HWTRAP) 
!= 0,

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH-v5 1/4] vsock: track pkt owner vsock

2017-03-14 Thread Peng Tao
So that we can cancel a queued pkt later if necessary.

Signed-off-by: Peng Tao 
---
 include/linux/virtio_vsock.h| 3 +++
 net/vmw_vsock/virtio_transport_common.c | 7 +++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 9638bfe..584f9a6 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -48,6 +48,8 @@ struct virtio_vsock_pkt {
struct virtio_vsock_hdr hdr;
struct work_struct work;
struct list_head list;
+   /* socket refcnt not held, only use for cancellation */
+   struct vsock_sock *vsk;
void *buf;
u32 len;
u32 off;
@@ -56,6 +58,7 @@ struct virtio_vsock_pkt {
 
 struct virtio_vsock_pkt_info {
u32 remote_cid, remote_port;
+   struct vsock_sock *vsk;
struct msghdr *msg;
u32 pkt_len;
u16 type;
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 8d592a4..af087b4 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -58,6 +58,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
pkt->len= len;
pkt->hdr.len= cpu_to_le32(len);
pkt->reply  = info->reply;
+   pkt->vsk= info->vsk;
 
if (info->msg && len > 0) {
pkt->buf = kmalloc(len, GFP_KERNEL);
@@ -180,6 +181,7 @@ static int virtio_transport_send_credit_update(struct 
vsock_sock *vsk,
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_CREDIT_UPDATE,
.type = type,
+   .vsk = vsk,
};
 
return virtio_transport_send_pkt_info(vsk, );
@@ -519,6 +521,7 @@ int virtio_transport_connect(struct vsock_sock *vsk)
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_REQUEST,
.type = VIRTIO_VSOCK_TYPE_STREAM,
+   .vsk = vsk,
};
 
return virtio_transport_send_pkt_info(vsk, );
@@ -534,6 +537,7 @@ int virtio_transport_shutdown(struct vsock_sock *vsk, int 
mode)
  VIRTIO_VSOCK_SHUTDOWN_RCV : 0) |
 (mode & SEND_SHUTDOWN ?
  VIRTIO_VSOCK_SHUTDOWN_SEND : 0),
+   .vsk = vsk,
};
 
return virtio_transport_send_pkt_info(vsk, );
@@ -560,6 +564,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
.type = VIRTIO_VSOCK_TYPE_STREAM,
.msg = msg,
.pkt_len = len,
+   .vsk = vsk,
};
 
return virtio_transport_send_pkt_info(vsk, );
@@ -581,6 +586,7 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
.op = VIRTIO_VSOCK_OP_RST,
.type = VIRTIO_VSOCK_TYPE_STREAM,
.reply = !!pkt,
+   .vsk = vsk,
};
 
/* Send RST only if the original pkt is not a RST pkt */
@@ -826,6 +832,7 @@ virtio_transport_send_response(struct vsock_sock *vsk,
.remote_cid = le64_to_cpu(pkt->hdr.src_cid),
.remote_port = le32_to_cpu(pkt->hdr.src_port),
.reply = true,
+   .vsk = vsk,
};
 
return virtio_transport_send_pkt_info(vsk, );
-- 
2.7.4



[PATCH-v5 2/4] vhost-vsock: add pkt cancel capability

2017-03-14 Thread Peng Tao
To allow canceling all packets of a connection.

Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Jorgen Hansen 
Signed-off-by: Peng Tao 
---
 drivers/vhost/vsock.c  | 41 +
 include/net/af_vsock.h |  3 +++
 2 files changed, 44 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index ce5e63d..57babce 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -223,6 +223,46 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
return len;
 }
 
+static int
+vhost_transport_cancel_pkt(struct vsock_sock *vsk)
+{
+   struct vhost_vsock *vsock;
+   struct virtio_vsock_pkt *pkt, *n;
+   int cnt = 0;
+   LIST_HEAD(freeme);
+
+   /* Find the vhost_vsock according to guest context id  */
+   vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
+   if (!vsock)
+   return -ENODEV;
+
+   spin_lock_bh(>send_pkt_list_lock);
+   list_for_each_entry_safe(pkt, n, >send_pkt_list, list) {
+   if (pkt->vsk != vsk)
+   continue;
+   list_move(>list, );
+   }
+   spin_unlock_bh(>send_pkt_list_lock);
+
+   list_for_each_entry_safe(pkt, n, , list) {
+   if (pkt->reply)
+   cnt++;
+   list_del(>list);
+   virtio_transport_free_pkt(pkt);
+   }
+
+   if (cnt) {
+   struct vhost_virtqueue *tx_vq = >vqs[VSOCK_VQ_TX];
+   int new_cnt;
+
+   new_cnt = atomic_sub_return(cnt, >queued_replies);
+   if (new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
+   vhost_poll_queue(_vq->poll);
+   }
+
+   return 0;
+}
+
 static struct virtio_vsock_pkt *
 vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
  unsigned int out, unsigned int in)
@@ -675,6 +715,7 @@ static struct virtio_transport vhost_transport = {
.release  = virtio_transport_release,
.connect  = virtio_transport_connect,
.shutdown = virtio_transport_shutdown,
+   .cancel_pkt   = vhost_transport_cancel_pkt,
 
.dgram_enqueue= virtio_transport_dgram_enqueue,
.dgram_dequeue= virtio_transport_dgram_dequeue,
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index f275896..f32ed9a 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -100,6 +100,9 @@ struct vsock_transport {
void (*destruct)(struct vsock_sock *);
void (*release)(struct vsock_sock *);
 
+   /* Cancel all pending packets sent on vsock. */
+   int (*cancel_pkt)(struct vsock_sock *vsk);
+
/* Connections. */
int (*connect)(struct vsock_sock *);
 
-- 
2.7.4



[PATCH-v5 3/4] vsock: add pkt cancel capability

2017-03-14 Thread Peng Tao
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Peng Tao 
---
 net/vmw_vsock/virtio_transport.c | 42 
 1 file changed, 42 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 9d24c0e..bcab8f2 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -213,6 +213,47 @@ virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
return len;
 }
 
+static int
+virtio_transport_cancel_pkt(struct vsock_sock *vsk)
+{
+   struct virtio_vsock *vsock;
+   struct virtio_vsock_pkt *pkt, *n;
+   int cnt = 0;
+   LIST_HEAD(freeme);
+
+   vsock = virtio_vsock_get();
+   if (!vsock) {
+   return -ENODEV;
+   }
+
+   spin_lock_bh(>send_pkt_list_lock);
+   list_for_each_entry_safe(pkt, n, >send_pkt_list, list) {
+   if (pkt->vsk != vsk)
+   continue;
+   list_move(>list, );
+   }
+   spin_unlock_bh(>send_pkt_list_lock);
+
+   list_for_each_entry_safe(pkt, n, , list) {
+   if (pkt->reply)
+   cnt++;
+   list_del(>list);
+   virtio_transport_free_pkt(pkt);
+   }
+
+   if (cnt) {
+   struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
+   int new_cnt;
+
+   new_cnt = atomic_sub_return(cnt, >queued_replies);
+   if (new_cnt + cnt >= virtqueue_get_vring_size(rx_vq) &&
+   new_cnt < virtqueue_get_vring_size(rx_vq))
+   queue_work(virtio_vsock_workqueue, >rx_work);
+   }
+
+   return 0;
+}
+
 static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
 {
int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
@@ -462,6 +503,7 @@ static struct virtio_transport virtio_transport = {
.release  = virtio_transport_release,
.connect  = virtio_transport_connect,
.shutdown = virtio_transport_shutdown,
+   .cancel_pkt   = virtio_transport_cancel_pkt,
 
.dgram_bind   = virtio_transport_dgram_bind,
.dgram_dequeue= virtio_transport_dgram_dequeue,
-- 
2.7.4



[PATCH-v5 4/4] vsock: cancel packets when failing to connect

2017-03-14 Thread Peng Tao
Otherwise we'll leave the packets queued until releasing vsock device.
E.g., if guest is slow to start up, resulting ETIMEDOUT on connect, guest
will get the connect requests from failed host sockets.

Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Jorgen Hansen 
Signed-off-by: Peng Tao 
---
 net/vmw_vsock/af_vsock.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9192ead..756542a 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1102,10 +1102,19 @@ static const struct proto_ops vsock_dgram_ops = {
.sendpage = sock_no_sendpage,
 };
 
+static int vsock_transport_cancel_pkt(struct vsock_sock *vsk)
+{
+   if (!transport->cancel_pkt)
+   return -EOPNOTSUPP;
+
+   return transport->cancel_pkt(vsk);
+}
+
 static void vsock_connect_timeout(struct work_struct *work)
 {
struct sock *sk;
struct vsock_sock *vsk;
+   int cancel = 0;
 
vsk = container_of(work, struct vsock_sock, dwork.work);
sk = sk_vsock(vsk);
@@ -1116,8 +1125,11 @@ static void vsock_connect_timeout(struct work_struct 
*work)
sk->sk_state = SS_UNCONNECTED;
sk->sk_err = ETIMEDOUT;
sk->sk_error_report(sk);
+   cancel = 1;
}
release_sock(sk);
+   if (cancel)
+   vsock_transport_cancel_pkt(vsk);
 
sock_put(sk);
 }
@@ -1224,11 +1236,13 @@ static int vsock_stream_connect(struct socket *sock, 
struct sockaddr *addr,
err = sock_intr_errno(timeout);
sk->sk_state = SS_UNCONNECTED;
sock->state = SS_UNCONNECTED;
+   vsock_transport_cancel_pkt(vsk);
goto out_wait;
} else if (timeout == 0) {
err = -ETIMEDOUT;
sk->sk_state = SS_UNCONNECTED;
sock->state = SS_UNCONNECTED;
+   vsock_transport_cancel_pkt(vsk);
goto out_wait;
}
 
-- 
2.7.4



[PATCH-v5 0/4] vsock: cancel connect packets when failing to connect

2017-03-14 Thread Peng Tao
Currently, if a connect call fails on a signal or timeout (e.g., guest is still
in the process of starting up), we'll just return to caller and leave the 
connect
packet queued and they are sent even though the connection is considered a 
failure,
which can confuse applications with unwanted false connect attempt.

The patchset enables vsock (both host and guest) to cancel queued packets when
a connect attempt is considered to fail.

v5 changelog:
  - change virtio_vsock_pkt->cancel_token back to virtio_vsock_pkt->vsk
v4 changelog:
  - drop two unnecessary void * cast
  - update new callback comment
v3 changelog:
  - define cancel_pkt callback in struct vsock_transport rather than struct 
virtio_transport
  - rename virtio_vsock_pkt->vsk to virtio_vsock_pkt->cancel_token
v2 changelog:
  - fix queued_replies counting and resume tx/rx when necessary

Cheers,
Tao

Peng Tao (4):
  vsock: track pkt owner vsock
  vhost-vsock: add pkt cancel capability
  vsock: add pkt cancel capability
  vsock: cancel packets when failing to connect

 drivers/vhost/vsock.c   | 41 
 include/linux/virtio_vsock.h|  3 +++
 include/net/af_vsock.h  |  3 +++
 net/vmw_vsock/af_vsock.c| 14 +++
 net/vmw_vsock/virtio_transport.c| 42 +
 net/vmw_vsock/virtio_transport_common.c |  7 ++
 6 files changed, 110 insertions(+)

-- 
2.7.4



Re: [net-next 01/13] i40evf: fix client warnings

2017-03-14 Thread David Miller
From: Jeff Kirsher 
Date: Tue, 14 Mar 2017 17:44:44 -0700

> On Tue, 2017-03-14 at 17:39 -0700, David Miller wrote:
>> From: Jeff Kirsher 
>> Date: Tue, 14 Mar 2017 15:32:56 -0700
>> 
>> > From: Faisal Latif 
>> > 
>> > The function prototype in i40evf_client.h are giving warnings while
>> > compiling i40iwvf module. Move these function prototypes to
>> > i40evf.h.
>> > Also fix return code from u32 to int and this return code is
>> > consistent with i40e_client.h
>> > 
>> > Change-Id: Ie3757f844993aabc27654aaf02ec14fb985ad2c4
>> > Signed-off-by: Faisal Latif 
>> > Tested-by: Andrew Bowers 
>> > Signed-off-by: Jeff Kirsher 
>> 
>> There is no such i40evf_client.h header file in the tree.
> 
> Not sure how this got out of order, Mitch adds this file in patch #4

Ok, please fix the order so that the dependencies are correct.

Thanks.


Re: [net-next 01/13] i40evf: fix client warnings

2017-03-14 Thread Jeff Kirsher
On Tue, 2017-03-14 at 17:39 -0700, David Miller wrote:
> From: Jeff Kirsher 
> Date: Tue, 14 Mar 2017 15:32:56 -0700
> 
> > From: Faisal Latif 
> > 
> > The function prototype in i40evf_client.h are giving warnings while
> > compiling i40iwvf module. Move these function prototypes to
> > i40evf.h.
> > Also fix return code from u32 to int and this return code is
> > consistent with i40e_client.h
> > 
> > Change-Id: Ie3757f844993aabc27654aaf02ec14fb985ad2c4
> > Signed-off-by: Faisal Latif 
> > Tested-by: Andrew Bowers 
> > Signed-off-by: Jeff Kirsher 
> 
> There is no such i40evf_client.h header file in the tree.

Not sure how this got out of order, Mitch adds this file in patch #4

> 
> >   int i40evf_config_rss(struct i40evf_adapter *adapter);
> > +int i40evf_lan_add_device(struct i40evf_adapter *adapter);
> > +int i40evf_lan_del_device(struct i40evf_adapter *adapter);
> > +void i40evf_client_subtask(struct i40evf_adapter *adapter);
> > +void i40evf_notify_client_message(struct i40e_vsi *vsi, u8 *msg,
> > u16 len);
> > +void i40evf_notify_client_l2_params(struct i40e_vsi *vsi);
> > +void i40evf_notify_client_open(struct i40e_vsi *vsi);
> > +void i40evf_notify_client_close(struct i40e_vsi *vsi);
> >   #endif /* _I40EVF_H_ */
> 
> And these functions do not exist in the tree either.
> 
> This really isn't acceptable.



signature.asc
Description: This is a digitally signed message part


Re: [net-next 01/13] i40evf: fix client warnings

2017-03-14 Thread David Miller
From: Jeff Kirsher 
Date: Tue, 14 Mar 2017 15:32:56 -0700

> From: Faisal Latif 
> 
> The function prototype in i40evf_client.h are giving warnings while
> compiling i40iwvf module. Move these function prototypes to i40evf.h.
> Also fix return code from u32 to int and this return code is
> consistent with i40e_client.h
> 
> Change-Id: Ie3757f844993aabc27654aaf02ec14fb985ad2c4
> Signed-off-by: Faisal Latif 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 

There is no such i40evf_client.h header file in the tree.

>  int i40evf_config_rss(struct i40evf_adapter *adapter);
> +int i40evf_lan_add_device(struct i40evf_adapter *adapter);
> +int i40evf_lan_del_device(struct i40evf_adapter *adapter);
> +void i40evf_client_subtask(struct i40evf_adapter *adapter);
> +void i40evf_notify_client_message(struct i40e_vsi *vsi, u8 *msg, u16 len);
> +void i40evf_notify_client_l2_params(struct i40e_vsi *vsi);
> +void i40evf_notify_client_open(struct i40e_vsi *vsi);
> +void i40evf_notify_client_close(struct i40e_vsi *vsi);
>  #endif /* _I40EVF_H_ */

And these functions do not exist in the tree either.

This really isn't acceptable.


Re: [PATCH v2 0/5] MIPS: BPF: JIT fixes and improvements.

2017-03-14 Thread David Miller
From: David Daney 
Date: Tue, 14 Mar 2017 17:34:02 -0700

> On 03/14/2017 05:29 PM, David Miller wrote:
>> From: David Daney 
>> Date: Tue, 14 Mar 2017 14:21:39 -0700
>>
>>> Changes from v1:
>>>
>>>   - Use unsigned access for SKF_AD_HATYPE
>>>
>>>   - Added three more patches for other problems found.
>>>
>>>
>>> Testing the BPF JIT on Cavium OCTEON (mips64) with the test-bpf module
>>> identified some failures and unimplemented features.
>>>
>>> With this patch set we get:
>>>
>>>  test_bpf: Summary: 305 PASSED, 0 FAILED, [85/297 JIT'ed]
>>>
>>> Both big and little endian tested.
>>>
>>> We still lack eBPF support, but this is better than nothing.
>>
>> What tree are you targetting with these changes?  Do you expect
>> them to go via the MIPS or the net-next tree?
>>
>> Please be explicit about this in the future.
>>
> 
> Sorry I didn't mention it.
> 
> My expectation is that Ralf would merge it via the MIPS tree, as it is
> fully contained within arch/mips/*

Great, thanks for letting me know.


Re: [PATCH v2 0/5] MIPS: BPF: JIT fixes and improvements.

2017-03-14 Thread David Daney

On 03/14/2017 05:29 PM, David Miller wrote:

From: David Daney 
Date: Tue, 14 Mar 2017 14:21:39 -0700


Changes from v1:

  - Use unsigned access for SKF_AD_HATYPE

  - Added three more patches for other problems found.


Testing the BPF JIT on Cavium OCTEON (mips64) with the test-bpf module
identified some failures and unimplemented features.

With this patch set we get:

 test_bpf: Summary: 305 PASSED, 0 FAILED, [85/297 JIT'ed]

Both big and little endian tested.

We still lack eBPF support, but this is better than nothing.


What tree are you targetting with these changes?  Do you expect
them to go via the MIPS or the net-next tree?

Please be explicit about this in the future.



Sorry I didn't mention it.

My expectation is that Ralf would merge it via the MIPS tree, as it is 
fully contained within arch/mips/*



David Daney


Thank you.





Re: [PATCH v2 0/5] MIPS: BPF: JIT fixes and improvements.

2017-03-14 Thread David Miller
From: David Daney 
Date: Tue, 14 Mar 2017 14:21:39 -0700

> Changes from v1:
> 
>   - Use unsigned access for SKF_AD_HATYPE
> 
>   - Added three more patches for other problems found.
> 
> 
> Testing the BPF JIT on Cavium OCTEON (mips64) with the test-bpf module
> identified some failures and unimplemented features.
> 
> With this patch set we get:
> 
>  test_bpf: Summary: 305 PASSED, 0 FAILED, [85/297 JIT'ed]
> 
> Both big and little endian tested.
> 
> We still lack eBPF support, but this is better than nothing.

What tree are you targetting with these changes?  Do you expect
them to go via the MIPS or the net-next tree?

Please be explicit about this in the future.

Thank you.


Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread David Miller
From: Stephen Hemminger 
Date: Tue, 14 Mar 2017 13:25:06 -0700

> On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
> David Miller  wrote:
> 
>> From: Nikolay Aleksandrov 
>> Date: Tue, 14 Mar 2017 17:58:46 +0200
>> 
>> > On 14/03/17 17:55, Stephen Hemminger wrote:  
>> >> On Tue, 14 Mar 2017 17:36:15 +0200
>> >> Nikolay Aleksandrov  wrote:
>> >>   
>> >>> This patch adds support for ECMP hash policy choice via a new sysctl
>> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>> >>> The current values for fib_multipath_hash_policy are:
>> >>>  0 - layer 3 (default)
>> >>>  1 - layer 4
>> >>> If there's an skb hash already set and it matches the chosen policy then 
>> >>> it
>> >>> will be used instead of being calculated (currently only for L4).
>> >>> In L3 mode we always calculate the hash due to the ICMP error special
>> >>> case, the flow dissector's field consistentification should handle the
>> >>> address order thus we can remove the address reversals.
>> >>>
>> >>> Signed-off-by: Nikolay Aleksandrov   
>> >> 
>> >> It is good to see ECMP come back from the grave.
>> >> Linux used to support it long ago but was abandoned after it was unstable
>> >> and removed from iproute2 in 2012.
>> >> 
>> >> The old API was through route attributes which makes more sense than
>> >> doing it with sysctl. It makes more sense to use netlink instead.
>> >> Therefore please go back and do something like the old API rather than 
>> >> doing it through
>> >> sysctl.
>> >>   
>> > 
>> > That's what my initial version did, but this was discussed during NetConf 
>> > in Seville
>> > and it was decided that it's best to make a global sysctl, thus the 
>> > change.  
>> 
>> Correct, we discussed this, and we all agreed to only have a sysctl for now.
> 
> Why? If you are going to have private discussions please post the rationale
> in public.

The idea is that we couldn't come up with an immediate use case, and if one
came up we could easily add the per-route or per-fib-table attribute.

Most people want the entire system to behave a certain way wrt. ECMP, rather
than have fine granularity.  For example, the case being discussed here is
to simply have software's behavior match that of hardware offloads.

We shouldn't add things until there is a real demonstrated and requested need.


Re: [PATCH v2 2/2] iproute2: add support for invisible qdisc dumping

2017-03-14 Thread Stephen Hemminger
On Wed, 8 Mar 2017 13:04:42 +0100 (CET)
Jiri Kosina  wrote:

> From: Jiri Kosina 
> 
> Support the new TCA_DUMP_INVISIBLE netlink attribute that allows asking 
> kernel to perform 'full qdisc dump', as for historical reasons some of the 
> default qdiscs are being hidden by the kernel.
> 
> The command syntax is being extended by voluntary 'invisible' argument to
> 'tc qdisc show'.
> 
> Signed-off-by: Jiri Kosina 

Applied to net-next thanks.


Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread David Ahern
On 3/14/17 5:27 PM, Stephen Hemminger wrote:
> On Tue, 14 Mar 2017 15:38:40 -0700
> Roopa Prabhu  wrote:
> 
>>> That's what my initial version did, but this was discussed during 
>>> NetConf in Seville
>>> and it was decided that it's best to make a global sysctl, thus the 
>>> change.  
>>
>> Correct, we discussed this, and we all agreed to only have a sysctl for 
>> now.  
>
> Why? If you are going to have private discussions please post the 
> rationale
> in public.  

 Stephen, is there any reason to have a per ecmp route multipath algo
 selection ?.
 All platforms have a global multipath selection algo. I also don't see
 routing daemons ready or willing to specify a per ecmp route multipath
 selection algo attribute.  
>>>
>>> There is no compelling reason to make the attribute per route. But the
>>> issue is more that configuration through sysctl's is problematic. It doesn't
>>> fit into the standard API paradigm. Sysctl's are like routing patches not
>>> part of the real CLI. Trying to trap sysctl's for things like switchedev
>>> offload is particularly problematic. I can see the case for either way,
>>> and don't have a fixed opinion.  
>>
>> ok. understand the switchdev offload part. It was that way in the past...but
>> today you can listen to sysctl updates on the netconf netlink channel.
>> it works pretty well.
> 
> Is there another patch to add the NETCONFA_ECMP support?
> 

does userspace care?

switchdev uses notifiers for in kernel notification of FIB changes
(routes and rules).


Re: [PATCH iproute2] man: add examples to ip.8

2017-03-14 Thread Stephen Hemminger
On Sun, 12 Mar 2017 21:41:16 +0100
Alexander Alemayhu  wrote:

> Having some examples in the top level man page might make it a little bit 
> easier
> for new users to get started. Reused some words / sentences from the existing
> man pages.
> 
> Suggested-by: 積丹尼 Dan Jacobson 
> Signed-off-by: Alexander Alemayhu 
> ---
> This is my first man page patch, hopefully I've done everything correctly.
> If not please let me know, thanks.
> 
>  man/man8/ip.8 | 28 
>  1 file changed, 28 insertions(+)
> 

Sure looks good, applied.


Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread Stephen Hemminger
On Tue, 14 Mar 2017 15:38:40 -0700
Roopa Prabhu  wrote:

> >> >> > That's what my initial version did, but this was discussed during 
> >> >> > NetConf in Seville
> >> >> > and it was decided that it's best to make a global sysctl, thus the 
> >> >> > change.  
> >> >>
> >> >> Correct, we discussed this, and we all agreed to only have a sysctl for 
> >> >> now.  
> >> >
> >> > Why? If you are going to have private discussions please post the 
> >> > rationale
> >> > in public.  
> >>
> >> Stephen, is there any reason to have a per ecmp route multipath algo
> >> selection ?.
> >> All platforms have a global multipath selection algo. I also don't see
> >> routing daemons ready or willing to specify a per ecmp route multipath
> >> selection algo attribute.  
> >
> > There is no compelling reason to make the attribute per route. But the
> > issue is more that configuration through sysctl's is problematic. It doesn't
> > fit into the standard API paradigm. Sysctl's are like routing patches not
> > part of the real CLI. Trying to trap sysctl's for things like switchedev
> > offload is particularly problematic. I can see the case for either way,
> > and don't have a fixed opinion.  
> 
> ok. understand the switchdev offload part. It was that way in the past...but
> today you can listen to sysctl updates on the netconf netlink channel.
> it works pretty well.

Is there another patch to add the NETCONFA_ECMP support?


[PATCH v4.11] cgroup, net_cls: iterate the fds of only the tasks which are being migrated

2017-03-14 Thread Tejun Heo
The net_cls controller controls the classid field of each socket which
is associated with the cgroup.  Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by
3b13758f51de ("cgroups: Allow dynamically changing net_classid").

While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations.  However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css.  This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.

On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds.  Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.

Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse.  Before bfc2cf6f61fc ("cgroup: call subsys->*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the ->css_attach callback even for controllers which don't
see actual migration to a different css.

As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
->css_attach callback for those.  The net_cls ->css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used.  This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.

The worst symptom is already fixed in upstream by bfc2cf6f61fc
("cgroup: call subsys->*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.

This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated.  This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup.  As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().

Reported-by: David Goode 
Fixes: 3b13758f51de ("cgroups: Allow dynamically changing net_classid")
Cc: sta...@vger.kernel.org # v4.4+
Cc: Nina Schiff 
Cc: David S. Miller 
Signed-off-by: Tejun Heo 
---
Hello, Dave.

Can you please route this fix for v4.11?  I can also take it through
cgroup/for-4.11-fixes, if that's preferable.

Thanks.

 net/core/netclassid_cgroup.c |   32 
 1 file changed, 16 insertions(+), 16 deletions(-)

--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -69,27 +69,17 @@ static int update_classid_sock(const voi
return 0;
 }
 
-static void update_classid(struct cgroup_subsys_state *css, void *v)
+static void cgrp_attach(struct cgroup_taskset *tset)
 {
-   struct css_task_iter it;
+   struct cgroup_subsys_state *css;
struct task_struct *p;
 
-   css_task_iter_start(css, );
-   while ((p = css_task_iter_next())) {
+   cgroup_taskset_for_each(p, css, tset) {
task_lock(p);
-   iterate_fd(p->files, 0, update_classid_sock, v);
+   iterate_fd(p->files, 0, update_classid_sock,
+  (void *)(unsigned long)css_cls_state(css)->classid);
task_unlock(p);
}
-   css_task_iter_end();
-}
-
-static void cgrp_attach(struct cgroup_taskset *tset)
-{
-   struct cgroup_subsys_state *css;
-
-   cgroup_taskset_first(tset, );
-   update_classid(css,
-  (void *)(unsigned long)css_cls_state(css)->classid);
 }
 
 static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft)
@@ -101,12 +91,22 @@ static int write_classid(struct cgroup_s
 u64 value)
 {
struct cgroup_cls_state *cs = css_cls_state(css);
+   struct css_task_iter it;
+   struct task_struct *p;
 
cgroup_sk_alloc_disable();
 
cs->classid = (u32)value;
 
-   update_classid(css, (void *)(unsigned long)cs->classid);
+   css_task_iter_start(css, );
+   while ((p = css_task_iter_next())) {
+   task_lock(p);
+   iterate_fd(p->files, 0, update_classid_sock,
+  (void *)(unsigned long)cs->classid);
+   task_unlock(p);
+   }
+   

Re: [net-next sample action optimization 3/3] openvswitch: Optimize sample action for the clone use cases

2017-03-14 Thread Andy Zhou
> Actions parameter which hints if it is recirc or sample. We can add
> recic-id param and set it if it is recic case.
> will this work?
>
Just posted v2.  I added a patch in the end to implement this
refactoring as suggested.


[net-next sample action optimization v2 0/4]

2017-03-14 Thread Andy Zhou
The sample action can be used for translating Openflow 'clone' action.
However its implementation has not been sufficiently optimized for this
use case. This series attempts to close the gap.

Patch 3 commit message has more details on the specific optimizations
implemented.

Andy Zhou (4):
  openvswitch: Deferred fifo API change.
  openvswitch: Refactor recirc key allocation.
  openvswitch: Optimize sample action for the clone use cases
  Openvswitch: Refactor sample and recirc actions implementation

 include/uapi/linux/openvswitch.h |  13 +++
 net/openvswitch/actions.c| 234 +++
 net/openvswitch/datapath.h   |   2 -
 net/openvswitch/flow_netlink.c   | 141 +++
 4 files changed, 249 insertions(+), 141 deletions(-)

-- 
1.8.3.1



[net-next sample action optimization v2 3/4] openvswitch: Optimize sample action for the clone use cases

2017-03-14 Thread Andy Zhou
With the introduction of open flow 'clone' action, the OVS user space
can now translate the 'clone' action into kernel datapath 'sample'
action, with 100% probability, to ensure that the clone semantics,
which is that the packet seen by the clone action is the same as the
packet seen by the action after clone, is faithfully carried out
in the datapath.

While the sample action in the datpath has the matching semantics,
its implementation is only optimized for its original use.
Specifically, there are two limitation: First, there is a 3 level of
nesting restriction, enforced at the flow downloading time. This
limit turns out to be too restrictive for the 'clone' use case.
Second, the implementation avoid recursive call only if the sample
action list has a single userspace action.

The main optimization implemented in this series removes the static
nesting limit check, instead, implement the run time recursion limit
check, and recursion avoidance similar to that of the 'recirc' action.
This optimization solve both #1 and #2 issues above.

One related optimization attemps to avoid copying flow key as
long as the actions enclosed does not change the flow key. The
detection is performed only once at the flow downloading time.

Another related optimization is to rewrite the action list
at flow downloading time in order to save the fast path from parsing
the sample action list in its original form repeatedly.

Signed-off-by: Andy Zhou 
---
 include/uapi/linux/openvswitch.h |  13 
 net/openvswitch/actions.c| 108 +++---
 net/openvswitch/datapath.h   |   2 -
 net/openvswitch/flow_netlink.c   | 141 +++
 4 files changed, 166 insertions(+), 98 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 7f41f7d..0dfe69b 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -578,10 +578,23 @@ enum ovs_sample_attr {
OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
OVS_SAMPLE_ATTR_ACTIONS, /* Nested OVS_ACTION_ATTR_* attributes. */
__OVS_SAMPLE_ATTR_MAX,
+
+#ifdef __KERNEL__
+   OVS_SAMPLE_ATTR_ARG  /* struct sample_arg  */
+#endif
 };
 
 #define OVS_SAMPLE_ATTR_MAX (__OVS_SAMPLE_ATTR_MAX - 1)
 
+#ifdef __KERNEL__
+struct sample_arg {
+   bool exec;   /* When true, actions in sample will not
+   change flow keys. False otherwise. */
+   u32  probability;/* Same value as
+   'OVS_SAMPLE_ATTR_PROBABILITY'. */
+};
+#endif
+
 /**
  * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
  * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 8c9c60c..1638370 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -928,73 +928,71 @@ static int output_userspace(struct datapath *dp, struct 
sk_buff *skb,
return ovs_dp_upcall(dp, skb, key, , cutlen);
 }
 
+
+/* When 'last' is true, sample() should always consume the 'skb'.
+ * Otherwise, sample() should keep 'skb' intact regardless what
+ * actions are executed within sample().
+ */
 static int sample(struct datapath *dp, struct sk_buff *skb,
  struct sw_flow_key *key, const struct nlattr *attr,
- const struct nlattr *actions, int actions_len)
+ bool last)
 {
-   const struct nlattr *acts_list = NULL;
-   const struct nlattr *a;
-   int rem;
-   u32 cutlen = 0;
+   struct nlattr *actions;
+   struct nlattr *sample_arg;
+   struct sw_flow_key *orig_key = key;
+   int rem = nla_len(attr);
+   int err = 0;
+   const struct sample_arg *arg;
 
-   for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
-a = nla_next(a, )) {
-   u32 probability;
+   /* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
+   sample_arg = nla_data(attr);
+   arg = nla_data(sample_arg);
+   actions = nla_next(sample_arg, );
 
-   switch (nla_type(a)) {
-   case OVS_SAMPLE_ATTR_PROBABILITY:
-   probability = nla_get_u32(a);
-   if (!probability || prandom_u32() > probability)
-   return 0;
-   break;
-
-   case OVS_SAMPLE_ATTR_ACTIONS:
-   acts_list = a;
-   break;
-   }
+   if ((arg->probability != U32_MAX) &&
+   (!arg->probability || prandom_u32() > arg->probability)) {
+   if (last)
+   consume_skb(skb);
+   return 0;
}
 
-   rem = nla_len(acts_list);
-   a = nla_data(acts_list);
-
-   /* Actions list is empty, do nothing */
-   if (unlikely(!rem))
+   /* Unless 

[net-next sample action optimization v2 4/4] Openvswitch: Refactor sample and recirc actions implementation

2017-03-14 Thread Andy Zhou
Added execute_or_defer_actions() that both sample and recirc
action's implementation can use.

Signed-off-by: Andy Zhou 
---
 net/openvswitch/actions.c | 96 +--
 1 file changed, 59 insertions(+), 37 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1638370..fd7d903 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -44,10 +44,6 @@
 #include "conntrack.h"
 #include "vport.h"
 
-static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
- struct sw_flow_key *key,
- const struct nlattr *attr, int len);
-
 struct deferred_action {
struct sk_buff *skb;
const struct nlattr *actions;
@@ -166,6 +162,12 @@ static bool is_flow_key_valid(const struct sw_flow_key 
*key)
return !(key->mac_proto & SW_FLOW_KEY_INVALID);
 }
 
+static int execute_or_defer_actions(struct datapath *dp, struct sk_buff *skb,
+   struct sw_flow_key *clone,
+   struct sw_flow_key *key,
+   u32 *recirc_id,
+   const struct nlattr *actions, int len);
+
 static void update_ethertype(struct sk_buff *skb, struct ethhdr *hdr,
 __be16 ethertype)
 {
@@ -941,7 +943,7 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
struct nlattr *sample_arg;
struct sw_flow_key *orig_key = key;
int rem = nla_len(attr);
-   int err = 0;
+   int err;
const struct sample_arg *arg;
 
/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
@@ -979,15 +981,8 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
key = clone_key(key);
}
 
-   if (key) {
-   err = do_execute_actions(dp, skb, key, actions, rem);
-   } else if (!add_deferred_actions(skb, orig_key, actions, rem)) {
-
-   if (net_ratelimit())
-   pr_warn("%s: deferred action limit reached, drop sample 
action\n",
-   ovs_dp_name(dp));
-   kfree_skb(skb);
-   }
+   err = execute_or_defer_actions(dp, skb, key, orig_key, NULL,
+  actions, rem);
 
if (!arg->exec)
__this_cpu_dec(exec_actions_level);
@@ -1105,8 +1100,7 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
  struct sw_flow_key *key,
  const struct nlattr *a, int rem)
 {
-   struct sw_flow_key *recirc_key;
-   struct deferred_action *da;
+   u32 recirc_id;
 
if (!is_flow_key_valid(key)) {
int err;
@@ -1130,27 +1124,9 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
return 0;
}
 
-   /* If within the limit of 'OVS_DEFERRED_ACTION_THRESHOLD',
-* recirc immediately, otherwise, defer it for later execution.
-*/
-   recirc_key = clone_key(key);
-   if (recirc_key) {
-   recirc_key->recirc_id = nla_get_u32(a);
-   ovs_dp_process_packet(skb, recirc_key);
-   } else {
-   da = add_deferred_actions(skb, key, NULL, 0);
-   if (da) {
-   recirc_key = >pkt_key;
-   recirc_key->recirc_id = nla_get_u32(a);
-   } else {
-   /* Log an error in case action fifo is full.  */
-   kfree_skb(skb);
-   if (net_ratelimit())
-   pr_warn("%s: deferred action limit reached, 
drop recirc action\n",
-   ovs_dp_name(dp));
-   }
-   }
-   return 0;
+   recirc_id = nla_get_u32(a);
+   return execute_or_defer_actions(dp, skb, clone_key(key),
+   key, _id, NULL, 0);
 }
 
 /* Execute a list of actions against 'skb'. */
@@ -1286,6 +1262,52 @@ static int do_execute_actions(struct datapath *dp, 
struct sk_buff *skb,
return 0;
 }
 
+static int execute_or_defer_actions(struct datapath *dp, struct sk_buff *skb,
+   struct sw_flow_key *clone,
+   struct sw_flow_key *key,
+   u32 *recirc_id,
+   const struct nlattr *actions, int len)
+{
+   struct deferred_action *da;
+
+   /* If within the limit of 'OVS_DEFERRED_ACTION_THRESHOLD',
+* recirc immediately, otherwise, defer it for later execution.
+*/
+   if (clone) {
+   if (recirc_id) {
+   clone->recirc_id = *recirc_id;
+   ovs_dp_process_packet(skb, clone);
+   return 0;
+   } else {
+   return 

[net-next sample action optimization v2 2/4] openvswitch: Refactor recirc key allocation.

2017-03-14 Thread Andy Zhou
The logic of allocating and copy key for each 'exec_actions_level'
was specific to execute_recirc(). However, future patches will reuse
as well.  Refactor the logic into its own function clone_key().

Signed-off-by: Andy Zhou 
---
 net/openvswitch/actions.c | 66 ---
 1 file changed, 40 insertions(+), 26 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 75182e9..8c9c60c 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2007-2014 Nicira, Inc.
+ * Copyright (c) 2007-2017 Nicira, Inc.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -83,14 +83,31 @@ struct action_fifo {
struct deferred_action fifo[DEFERRED_ACTION_FIFO_SIZE];
 };
 
-struct recirc_keys {
+struct action_flow_keys {
struct sw_flow_key key[OVS_DEFERRED_ACTION_THRESHOLD];
 };
 
 static struct action_fifo __percpu *action_fifos;
-static struct recirc_keys __percpu *recirc_keys;
+static struct action_flow_keys __percpu *flow_keys;
 static DEFINE_PER_CPU(int, exec_actions_level);
 
+/* Make a clone of the 'key', using the pre-allocated percpu 'flow_keys'
+ * space. Return NULL if out of key spaces.
+ */
+static struct sw_flow_key *clone_key(const struct sw_flow_key *key_)
+{
+   struct action_flow_keys *keys = this_cpu_ptr(flow_keys);
+   int level = this_cpu_read(exec_actions_level);
+   struct sw_flow_key *key = NULL;
+
+   if (level <= OVS_DEFERRED_ACTION_THRESHOLD) {
+   key = >key[level - 1];
+   *key = *key_;
+   }
+
+   return key;
+}
+
 static void action_fifo_init(struct action_fifo *fifo)
 {
fifo->head = 0;
@@ -1090,8 +1107,8 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
  struct sw_flow_key *key,
  const struct nlattr *a, int rem)
 {
+   struct sw_flow_key *recirc_key;
struct deferred_action *da;
-   int level;
 
if (!is_flow_key_valid(key)) {
int err;
@@ -1115,29 +1132,26 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
return 0;
}
 
-   level = this_cpu_read(exec_actions_level);
-   if (level <= OVS_DEFERRED_ACTION_THRESHOLD) {
-   struct recirc_keys *rks = this_cpu_ptr(recirc_keys);
-   struct sw_flow_key *recirc_key = >key[level - 1];
-
-   *recirc_key = *key;
+   /* If within the limit of 'OVS_DEFERRED_ACTION_THRESHOLD',
+* recirc immediately, otherwise, defer it for later execution.
+*/
+   recirc_key = clone_key(key);
+   if (recirc_key) {
recirc_key->recirc_id = nla_get_u32(a);
ovs_dp_process_packet(skb, recirc_key);
-
-   return 0;
-   }
-
-   da = add_deferred_actions(skb, key, NULL, 0);
-   if (da) {
-   da->pkt_key.recirc_id = nla_get_u32(a);
} else {
-   kfree_skb(skb);
-
-   if (net_ratelimit())
-   pr_warn("%s: deferred action limit reached, drop recirc 
action\n",
-   ovs_dp_name(dp));
+   da = add_deferred_actions(skb, key, NULL, 0);
+   if (da) {
+   recirc_key = >pkt_key;
+   recirc_key->recirc_id = nla_get_u32(a);
+   } else {
+   /* Log an error in case action fifo is full.  */
+   kfree_skb(skb);
+   if (net_ratelimit())
+   pr_warn("%s: deferred action limit reached, 
drop recirc action\n",
+   ovs_dp_name(dp));
+   }
}
-
return 0;
 }
 
@@ -1327,8 +1341,8 @@ int action_fifos_init(void)
if (!action_fifos)
return -ENOMEM;
 
-   recirc_keys = alloc_percpu(struct recirc_keys);
-   if (!recirc_keys) {
+   flow_keys = alloc_percpu(struct action_flow_keys);
+   if (!flow_keys) {
free_percpu(action_fifos);
return -ENOMEM;
}
@@ -1339,5 +1353,5 @@ int action_fifos_init(void)
 void action_fifos_exit(void)
 {
free_percpu(action_fifos);
-   free_percpu(recirc_keys);
+   free_percpu(flow_keys);
 }
-- 
1.8.3.1



[net-next sample action optimization v2 1/4] openvswitch: Deferred fifo API change.

2017-03-14 Thread Andy Zhou
add_deferred_actions() API currently requires actions to be passed in
as a fully encoded netlink message. So far both 'sample' and 'recirc'
actions happens to carry actions as fully encoded netlink messages.
However, this requirement is more restrictive than necessary, future
patch will need to pass in action lists that are not fully encoded
by themselves.

Signed-off-by: Andy Zhou 
Acked-by: Joe Stringer 
---
 net/openvswitch/actions.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index c82301c..75182e9 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -51,6 +51,7 @@ static int do_execute_actions(struct datapath *dp, struct 
sk_buff *skb,
 struct deferred_action {
struct sk_buff *skb;
const struct nlattr *actions;
+   int actions_len;
 
/* Store pkt_key clone when creating deferred action. */
struct sw_flow_key pkt_key;
@@ -119,8 +120,9 @@ static struct deferred_action *action_fifo_put(struct 
action_fifo *fifo)
 
 /* Return true if fifo is not full */
 static struct deferred_action *add_deferred_actions(struct sk_buff *skb,
-   const struct sw_flow_key 
*key,
-   const struct nlattr *attr)
+   const struct sw_flow_key *key,
+   const struct nlattr *actions,
+   const int actions_len)
 {
struct action_fifo *fifo;
struct deferred_action *da;
@@ -129,7 +131,8 @@ static struct deferred_action *add_deferred_actions(struct 
sk_buff *skb,
da = action_fifo_put(fifo);
if (da) {
da->skb = skb;
-   da->actions = attr;
+   da->actions = actions;
+   da->actions_len = actions_len;
da->pkt_key = *key;
}
 
@@ -966,7 +969,8 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
/* Skip the sample action when out of memory. */
return 0;
 
-   if (!add_deferred_actions(skb, key, a)) {
+   if (!add_deferred_actions(skb, key, nla_data(acts_list),
+ nla_len(acts_list))) {
if (net_ratelimit())
pr_warn("%s: deferred actions limit reached, dropping 
sample action\n",
ovs_dp_name(dp));
@@ -1123,7 +1127,7 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
return 0;
}
 
-   da = add_deferred_actions(skb, key, NULL);
+   da = add_deferred_actions(skb, key, NULL, 0);
if (da) {
da->pkt_key.recirc_id = nla_get_u32(a);
} else {
@@ -1278,10 +1282,10 @@ static void process_deferred_actions(struct datapath 
*dp)
struct sk_buff *skb = da->skb;
struct sw_flow_key *key = >pkt_key;
const struct nlattr *actions = da->actions;
+   int actions_len = da->actions_len;
 
if (actions)
-   do_execute_actions(dp, skb, key, actions,
-  nla_len(actions));
+   do_execute_actions(dp, skb, key, actions, actions_len);
else
ovs_dp_process_packet(skb, key);
} while (!action_fifo_is_empty(fifo));
-- 
1.8.3.1



Re: [PATCH v2 0/5] MIPS: BPF: JIT fixes and improvements.

2017-03-14 Thread David Daney

On 03/14/2017 03:49 PM, Alexei Starovoitov wrote:

On Tue, Mar 14, 2017 at 02:21:39PM -0700, David Daney wrote:

Changes from v1:

  - Use unsigned access for SKF_AD_HATYPE

  - Added three more patches for other problems found.


Testing the BPF JIT on Cavium OCTEON (mips64) with the test-bpf module
identified some failures and unimplemented features.

With this patch set we get:

 test_bpf: Summary: 305 PASSED, 0 FAILED, [85/297 JIT'ed]

Both big and little endian tested.

We still lack eBPF support, but this is better than nothing.

David Daney (5):
  MIPS: uasm:  Add support for LHU.
  MIPS: BPF: Add JIT support for SKF_AD_HATYPE.
  MIPS: BPF: Use unsigned access for unsigned SKB fields.
  MIPS: BPF: Quit clobbering callee saved registers in JIT code.
  MIPS: BPF: Fix multiple problems in JIT skb access helpers.


Thanks. Nice set of fixes. Especially patch 4.
Did you see crashes because of it?


Only when running the test-bpf module.

The "JMP_JA: Jump, gap, jump, ..." test doesn't actually use any 
registers, which I think is somewhat uncommon in BPF code.  The system 
would either crash or have weird behavior after running this test.






Acked-by: Alexei Starovoitov 





Re: [PATCH v2 0/5] MIPS: BPF: JIT fixes and improvements.

2017-03-14 Thread Alexei Starovoitov
On Tue, Mar 14, 2017 at 02:21:39PM -0700, David Daney wrote:
> Changes from v1:
> 
>   - Use unsigned access for SKF_AD_HATYPE
> 
>   - Added three more patches for other problems found.
> 
> 
> Testing the BPF JIT on Cavium OCTEON (mips64) with the test-bpf module
> identified some failures and unimplemented features.
> 
> With this patch set we get:
> 
>  test_bpf: Summary: 305 PASSED, 0 FAILED, [85/297 JIT'ed]
> 
> Both big and little endian tested.
> 
> We still lack eBPF support, but this is better than nothing.
> 
> David Daney (5):
>   MIPS: uasm:  Add support for LHU.
>   MIPS: BPF: Add JIT support for SKF_AD_HATYPE.
>   MIPS: BPF: Use unsigned access for unsigned SKB fields.
>   MIPS: BPF: Quit clobbering callee saved registers in JIT code.
>   MIPS: BPF: Fix multiple problems in JIT skb access helpers.

Thanks. Nice set of fixes. Especially patch 4.
Did you see crashes because of it?
Acked-by: Alexei Starovoitov 



Re: openvswitch conntrack and nat problem in first packet reply with RST

2017-03-14 Thread Joe Stringer
On 13 March 2017 at 20:18, wenxu  wrote:
> Hi all,
>
> There is a simple test for conntrack and nat in openvswitch.  I want to do 
> stateful
> firewall with conntrack then do nat
>
> netns1 port1 with ip 10.0.0.7
> netns2 port2 with ip 1.1.1.7
>
> netns1 10.0.0.7 src -nat to 2.2.1.7 access netns2 1.1.1.7
>
> 1. # ovs-ofctl add-flow br0  'ip,in_port=1 actions=ct(table=1,zone=1)'
> 2. # ovs-ofctl add-flow br0  'ip,in_port=2 actions=ct(table=1,zone=1)'
> 3. # ovs-ofctl add-flow br0  'table=1, 
> ct_state=+new+trk,tcp,in_port=1,tp_dst=123 
> actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2'
> 4. # ovs-ofctl add-flow br0  'table=1, ct_state=+est+trk,ip,in_port=2 
> actions=ct(commit,zone=1,nat(dst=10.0.0.7)),output:1'
> 5. # ovs-ofctl add-flow br0  'table=1, ct_state=+est+trk,ip,in_port=1  
> actions=ct(commit,zone=1,nat(src=2.2.1.7)),output:2'
>
>
> I  found that  netns1 can access 1.1.1.7:123  when there is 123-port listen 
> on 1.1.1.7  in netns2
>
> But if there is no listen 123 port, The first RST packet reply by 1.1.1.7
> (no datapath kernel rule) can't do dst-nat back to 10.0.0.7.  The second RST 
> packet is ok (there is datapath kernel rule which comes from first RST packet)
>
> # tcpdump -i eth0 -nnn
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
> 14:44:13.575200 IP 10.0.0.7.39891 > 1.1.1.7.123: Flags [S], seq 93585, 
> win 29200, options [mss 1460,sackOK,TS val 584707316 ecr 0,nop,wscale 7], 
> length 0
> 14:44:13.576036 IP 1.1.1.7.123 > 2.2.1.7.39891: Flags [R.], seq 0, ack 
> 93586, win 0, length 0
>
> But the datapath flow is correct
> # ovs-dpctl dump-flows
> recirc_id(0),in_port(7),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, 
> used:never, actions:ct(zone=1),recirc(0x5a)
> recirc_id(0x5a),in_port(7),ct_state(+new+trk),eth_type(0x0800),ipv4(proto=6,frag=no),tcp(dst=123),
>  packets:0, bytes:0, used:never,
> actions:ct(commit,zone=1,nat(src=2.2.1.7)),8
> recirc_id(0),in_port(8),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, 
> used:never, actions:ct(zone=1),recirc(0x5b)
> recirc_id(0x5b),in_port(8),ct_state(-new+est+trk),eth_type(0x0800),ipv4(frag=no),
>  packets:0, bytes:0, used:never,
> actions:ct(commit,zone=1,nat(dst=10.0.0.7)),7
>
>
> I think It's a matter with the PACKET-OUT and RST packet
>
> There are two packet-out for rule2 and rul4. Rule2 go through connect track 
> and find it is an RST packet then delete the conntrack . It leads the second 
> packet(come from rule4) can't find the conntack to do dst-nat.
>
> In "netfilter/nf_conntrack_proto_tcp.c file
>  if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) {
> /* If only reply is a RST, we can consider ourselves not to
>have an established connection: this is a fairly common
>problem case, so we can delete the conntrack
>immediately.  --RR */
> if (th->rst ) {
> nf_ct_kill_acct(ct, ctinfo, skb);
> return NF_ACCEPT;
> }
> }
>
>
> It should add a switch to avoid this conntrack  be deleted.
>
> if (!test_bit(IPS_SEEN_REPLY_BIT, >status)) {
> /* If only reply is a RST, we can consider ourselves not to
>have an established connection: this is a fairly common
>problem case, so we can delete the conntrack
>immediately.  --RR */
> -if (th->rst ) {
> +if (th->rst && !nf_ct_tcp_rst_no_kill) {
> nf_ct_kill_acct(ct, ctinfo, skb);
> return NF_ACCEPT;
> }

How would you know to not kill the entry? How would you ensure it's
properly cleaned up later? I'm not sure if there's a way to implement
this without some fairly serious plumbing.

If you look at the examples in the OVS testsuite[0], it is suggested
to use "ct(nat)" with no options early in your rules. This ensures
that the connection is looked up, and if necessary, NAT is applied at
the same time - meaning that the RST can be NATed back AND the
connection is deleted. In the later table you need to differentiate
the connections based on whether they were already statefully NATed or
not. For new connections, it would be handled by your rule #3 (which
would then perform the nat as part of that rule's actions). For
existing connections, the packet is already NATed by the time it
reaches table 1, and your rules 4-5 shouldn't need to apply the nat.
If you still need access to the original tuple for matching purposes,
the new fields 'ct_nw_src', 'ct_nw_dst', etc. fields will provide the
original ct 5tuple. Note however those are only available on OVS
master, should be part of OVS 2.8.

[0] 
https://github.com/openvswitch/ovs/blob/branch-2.7/tests/system-traffic.at#L2331
[1] http://openvswitch.org/support/dist-docs/ovs-fields.7.html


Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread Roopa Prabhu
On Tue, Mar 14, 2017 at 2:42 PM, Stephen Hemminger
 wrote:
> On Tue, 14 Mar 2017 14:10:22 -0700
> Roopa Prabhu  wrote:
>
>> On Tue, Mar 14, 2017 at 1:25 PM, Stephen Hemminger
>>  wrote:
>> > On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
>> > David Miller  wrote:
>> >
>> >> From: Nikolay Aleksandrov 

[snip]


>> >> > That's what my initial version did, but this was discussed during 
>> >> > NetConf in Seville
>> >> > and it was decided that it's best to make a global sysctl, thus the 
>> >> > change.
>> >>
>> >> Correct, we discussed this, and we all agreed to only have a sysctl for 
>> >> now.
>> >
>> > Why? If you are going to have private discussions please post the rationale
>> > in public.
>>
>> Stephen, is there any reason to have a per ecmp route multipath algo
>> selection ?.
>> All platforms have a global multipath selection algo. I also don't see
>> routing daemons ready or willing to specify a per ecmp route multipath
>> selection algo attribute.
>
> There is no compelling reason to make the attribute per route. But the
> issue is more that configuration through sysctl's is problematic. It doesn't
> fit into the standard API paradigm. Sysctl's are like routing patches not
> part of the real CLI. Trying to trap sysctl's for things like switchedev
> offload is particularly problematic. I can see the case for either way,
> and don't have a fixed opinion.

ok. understand the switchdev offload part. It was that way in the past...but
today you can listen to sysctl updates on the netconf netlink channel.
it works pretty well.

>
> The bigger discussion is trying to keep a record of the rationale for 
> decisions
> such that there isn't buried tribal knowledge. This is why Dave has always 
> been
> quite insistent on having discussions on the mailing list. There doesn't seem 
> to
> be a good long term record other than Documentation/networking or commit logs.
>

agree. Most of the discussion around this has happened on the netdev
mailing list so far.
The previous set that updated ecmp algo tried to add a per route
attribute and after review on this list was moved to
on by default. Nikolay's first version included a per route
attribute...to follow the previous ecmp work.
Before Nikolay posted the second version, my feedback was to make it
global. And this feedback was only re-iterated at
last netconf/netdev. so, we can continue the discussion here if there
are other opinions.


Re: [PATCH v2 0/5] MIPS: BPF: JIT fixes and improvements.

2017-03-14 Thread David Daney

On 03/14/2017 03:29 PM, Daniel Borkmann wrote:

On 03/14/2017 10:21 PM, David Daney wrote:

Changes from v1:

   - Use unsigned access for SKF_AD_HATYPE

   - Added three more patches for other problems found.


Testing the BPF JIT on Cavium OCTEON (mips64) with the test-bpf module
identified some failures and unimplemented features.


Nice, thanks for working on this! If you see specific test
cases for the JIT missing, please also feel free to extend
the test_bpf suite, so this gets exposed further to other
JITs, too.


With this patch set we get:

  test_bpf: Summary: 305 PASSED, 0 FAILED, [85/297 JIT'ed]

Both big and little endian tested.

We still lack eBPF support, but this is better than nothing.


Any future plans on this one?


Yes, my plan is to fully implement eBPF support for 64-bit MIPS, let's 
see if I get enough time to do it.


David.


[net-next 10/13] i40e/i40evf: Add support for mapping pages with DMA attributes

2017-03-14 Thread Jeff Kirsher
From: Alexander Duyck 

This patch adds support for DMA_ATTR_SKIP_CPU_SYNC and
DMA_ATTR_WEAK_ORDERING. By enabling both of these for the Rx path we
are able to see performance improvements on architectures that implement
either one due to the fact that page mapping and unmapping only has to
sync what is actually being used instead of the entire buffer. In addition
by enabling the weak ordering attribute enables a performance improvement
for architectures that can associate a memory ordering with a DMA buffer
such as Sparc.

Change-ID: If176824e8231c5b24b8a5d55b339a6026738fc75
Signed-off-by: Alexander Duyck 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 31 ++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  3 +++
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 31 ++-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |  3 +++
 4 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 97d46058d71d..86e4991a03a7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1010,7 +1010,6 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
  **/
 void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 {
-   struct device *dev = rx_ring->dev;
unsigned long bi_size;
u16 i;
 
@@ -1030,7 +1029,20 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
if (!rx_bi->page)
continue;
 
-   dma_unmap_page(dev, rx_bi->dma, PAGE_SIZE, DMA_FROM_DEVICE);
+   /* Invalidate cache lines that may have been written to by
+* device so that we avoid corrupting memory.
+*/
+   dma_sync_single_range_for_cpu(rx_ring->dev,
+ rx_bi->dma,
+ rx_bi->page_offset,
+ I40E_RXBUFFER_2048,
+ DMA_FROM_DEVICE);
+
+   /* free resources associated with mapping */
+   dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
+PAGE_SIZE,
+DMA_FROM_DEVICE,
+I40E_RX_DMA_ATTR);
__free_pages(rx_bi->page, 0);
 
rx_bi->page = NULL;
@@ -1159,7 +1171,10 @@ static bool i40e_alloc_mapped_page(struct i40e_ring 
*rx_ring,
}
 
/* map page for use */
-   dma = dma_map_page(rx_ring->dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);
+   dma = dma_map_page_attrs(rx_ring->dev, page, 0,
+PAGE_SIZE,
+DMA_FROM_DEVICE,
+I40E_RX_DMA_ATTR);
 
/* if mapping failed free memory back to system since
 * there isn't much point in holding memory we can't use
@@ -1219,6 +1234,12 @@ bool i40e_alloc_rx_buffers(struct i40e_ring *rx_ring, 
u16 cleaned_count)
if (!i40e_alloc_mapped_page(rx_ring, bi))
goto no_buffers;
 
+   /* sync the buffer for use by the device */
+   dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
+bi->page_offset,
+I40E_RXBUFFER_2048,
+DMA_FROM_DEVICE);
+
/* Refresh the desc even if buffer_addrs didn't change
 * because each write-back erases this info.
 */
@@ -1685,8 +1706,8 @@ struct sk_buff *i40e_fetch_rx_buffer(struct i40e_ring 
*rx_ring,
rx_ring->rx_stats.page_reuse_count++;
} else {
/* we are not reusing the buffer so unmap it */
-   dma_unmap_page(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
-  DMA_FROM_DEVICE);
+   dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma, PAGE_SIZE,
+DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
}
 
/* clear contents of buffer_info */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index f80979025c01..49c7b2089d8e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -133,6 +133,9 @@ enum i40e_dyn_idx_t {
 #define I40E_RX_HDR_SIZE I40E_RXBUFFER_256
 #define i40e_rx_desc i40e_32byte_rx_desc
 
+#define I40E_RX_DMA_ATTR \
+   (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
+
 /**
  * i40e_test_staterr - tests bits in Rx descriptor status and error fields
  * @rx_desc: pointer to receive 

[net-next 11/13] i40e: Allow untrusted VFs to have more filters

2017-03-14 Thread Jeff Kirsher
From: Mitch Williams 

Our original filter limit of 8 was based on behavior that we saw from
Linux VMs. Now we're running Other Operating Systems under KVM and we
see that they commonly use more MAC filters. Since it seems weird to
require people to enable trusted VFs just to boot their OS, bump the
number of filters allowed by default.

Change-ID: I76b2dcb2ad6017e39231ad3096c3fb6f065eef5e
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 115a7286ab8f..cfe8b78dac0e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -1851,7 +1851,7 @@ static int i40e_vc_get_stats_msg(struct i40e_vf *vf, u8 
*msg, u16 msglen)
 }
 
 /* If the VF is not trusted restrict the number of MAC/VLAN it can program */
-#define I40E_VC_MAX_MAC_ADDR_PER_VF 8
+#define I40E_VC_MAX_MAC_ADDR_PER_VF 12
 #define I40E_VC_MAX_VLAN_PER_VF 8
 
 /**
-- 
2.12.0



[net-next 06/13] i40e: don't add more vectors to num_lan_msix than number of CPUs

2017-03-14 Thread Jeff Kirsher
From: Jacob Keller 

This is a solution to avoid adding too many queues to num_lan_msix.
A recent refactor of queue pairs accidentally added all remaining
vectors to the num_lan_msix which can have adverse performance issues,
due to enabling more queues than the number of CPU cores.

This patch removes the old calculation, and replaces it with a simple
algorithm.

1) add queue pairs up to num_online_cpus(), but capped at half of total
   vectors
2) then add alternative features such as flow directory and similar
3) finally, add the remaining vectors back to queue pairs, but capped
   such that the total number of queue pairs does not exceed
   num_online_cpus().

Change-ID: I668abf67d5011a1248866daba8885f4ff00cb8d9
Signed-off-by: Jacob Keller 
Signed-off-by: Harshitha Ramamurthy 
Signed-off-by: Carolyn Wyborny 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 30 ++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 3d7f179af6be..cb678ed7a2ad 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7819,6 +7819,7 @@ static int i40e_reserve_msix_vectors(struct i40e_pf *pf, 
int vectors)
 static int i40e_init_msix(struct i40e_pf *pf)
 {
struct i40e_hw *hw = >hw;
+   int cpus, extra_vectors;
int vectors_left;
int v_budget, i;
int v_actual;
@@ -7854,10 +7855,16 @@ static int i40e_init_msix(struct i40e_pf *pf)
vectors_left--;
}
 
-   /* reserve vectors for the main PF traffic queues */
-   pf->num_lan_msix = min_t(int, num_online_cpus(), vectors_left);
+   /* reserve some vectors for the main PF traffic queues. Initially we
+* only reserve at most 50% of the available vectors, in the case that
+* the number of online CPUs is large. This ensures that we can enable
+* extra features as well. Once we've enabled the other features, we
+* will use any remaining vectors to reach as close as we can to the
+* number of online CPUs.
+*/
+   cpus = num_online_cpus();
+   pf->num_lan_msix = min_t(int, cpus, vectors_left / 2);
vectors_left -= pf->num_lan_msix;
-   v_budget += pf->num_lan_msix;
 
/* reserve one vector for sideband flow director */
if (pf->flags & I40E_FLAG_FD_SB_ENABLED) {
@@ -7920,6 +7927,23 @@ static int i40e_init_msix(struct i40e_pf *pf)
}
}
 
+   /* On systems with a large number of SMP cores, we previously limited
+* the number of vectors for num_lan_msix to be at most 50% of the
+* available vectors, to allow for other features. Now, we add back
+* the remaining vectors. However, we ensure that the total
+* num_lan_msix will not exceed num_online_cpus(). To do this, we
+* calculate the number of vectors we can add without going over the
+* cap of CPUs. For systems with a small number of CPUs this will be
+* zero.
+*/
+   extra_vectors = min_t(int, cpus - pf->num_lan_msix, vectors_left);
+   pf->num_lan_msix += extra_vectors;
+   vectors_left -= extra_vectors;
+
+   WARN(vectors_left < 0,
+"Calculation of remaining vectors underflowed. This is an 
accounting bug when determining total MSI-X vectors.\n");
+
+   v_budget += pf->num_lan_msix;
pf->msix_entries = kcalloc(v_budget, sizeof(struct msix_entry),
   GFP_KERNEL);
if (!pf->msix_entries)
-- 
2.12.0



[net-next 13/13] i40e: rename auto_disable_flags to hw_disabled_flags

2017-03-14 Thread Jeff Kirsher
From: Harshitha Ramamurthy 

A previous commit introduced a field that tracks the features
that are disabled due to HW resource limitations as opposed
to the featured disabled by the user. This patch changes the
name of the field to make it more readable since it might get
confusing when looking at code containing both the flags
field and the auto_disable_features field together.

Change-ID: Idcc9888659698f6fe3ccff17c8c3f09b5026f708
Signed-off-by: Harshitha Ramamurthy 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  8 ++--
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 10 -
 drivers/net/ethernet/intel/i40e/i40e_main.c| 28 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 20 +-
 4 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 9b2bb8d971cc..a5cf5d11d0e7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -353,8 +353,12 @@ struct i40e_pf {
 #define I40E_FLAG_CLIENT_L2_CHANGE BIT_ULL(56)
 #define I40E_FLAG_WOL_MC_MAGIC_PKT_WAKEBIT_ULL(57)
 
-   /* tracks features that get auto disabled by errors */
-   u64 auto_disable_flags;
+   /* Tracks features that are disabled due to hw limitations.
+* If a bit is set here, it means that the corresponding
+* bit in the 'flags' field is cleared i.e that feature
+* is disabled
+*/
+   u64 hw_disabled_flags;
 
 #ifdef I40E_FCOE
struct i40e_fcoe fcoe;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 3aefc9e20439..a933c6c2aff8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2717,7 +2717,7 @@ static int i40e_add_fdir_ethtool(struct i40e_vsi *vsi,
if (!(pf->flags & I40E_FLAG_FD_SB_ENABLED))
return -EOPNOTSUPP;
 
-   if (pf->auto_disable_flags & I40E_FLAG_FD_SB_ENABLED)
+   if (pf->hw_disabled_flags & I40E_FLAG_FD_SB_ENABLED)
return -ENOSPC;
 
if (test_bit(__I40E_RESET_RECOVERY_PENDING, >state) ||
@@ -3059,7 +3059,7 @@ static u32 i40e_get_priv_flags(struct net_device *dev)
I40E_PRIV_FLAGS_FD_ATR : 0;
ret_flags |= pf->flags & I40E_FLAG_VEB_STATS_ENABLED ?
I40E_PRIV_FLAGS_VEB_STATS : 0;
-   ret_flags |= pf->auto_disable_flags & I40E_FLAG_HW_ATR_EVICT_CAPABLE ?
+   ret_flags |= pf->hw_disabled_flags & I40E_FLAG_HW_ATR_EVICT_CAPABLE ?
0 : I40E_PRIV_FLAGS_HW_ATR_EVICT;
if (pf->hw.pf_id == 0) {
ret_flags |= pf->flags & I40E_FLAG_TRUE_PROMISC_SUPPORT ?
@@ -3099,7 +3099,7 @@ static int i40e_set_priv_flags(struct net_device *dev, 
u32 flags)
pf->flags |= I40E_FLAG_FD_ATR_ENABLED;
} else {
pf->flags &= ~I40E_FLAG_FD_ATR_ENABLED;
-   pf->auto_disable_flags |= I40E_FLAG_FD_ATR_ENABLED;
+   pf->hw_disabled_flags |= I40E_FLAG_FD_ATR_ENABLED;
 
/* flush current ATR settings */
set_bit(__I40E_FD_FLUSH_REQUESTED, >state);
@@ -3144,9 +3144,9 @@ static int i40e_set_priv_flags(struct net_device *dev, 
u32 flags)
 
if ((flags & I40E_PRIV_FLAGS_HW_ATR_EVICT) &&
(pf->flags & I40E_FLAG_HW_ATR_EVICT_CAPABLE))
-   pf->auto_disable_flags &= ~I40E_FLAG_HW_ATR_EVICT_CAPABLE;
+   pf->hw_disabled_flags &= ~I40E_FLAG_HW_ATR_EVICT_CAPABLE;
else
-   pf->auto_disable_flags |= I40E_FLAG_HW_ATR_EVICT_CAPABLE;
+   pf->hw_disabled_flags |= I40E_FLAG_HW_ATR_EVICT_CAPABLE;
 
/* if needed, issue reset to cause things to take effect */
if (reset_required)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 4d305fb1f188..113b32911f1b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1101,13 +1101,13 @@ static void i40e_update_pf_stats(struct i40e_pf *pf)
   >rx_lpi_count, >rx_lpi_count);
 
if (pf->flags & I40E_FLAG_FD_SB_ENABLED &&
-   !(pf->auto_disable_flags & I40E_FLAG_FD_SB_ENABLED))
+   !(pf->hw_disabled_flags & I40E_FLAG_FD_SB_ENABLED))
nsd->fd_sb_status = true;
else
nsd->fd_sb_status = false;
 
if (pf->flags & I40E_FLAG_FD_ATR_ENABLED &&
-   !(pf->auto_disable_flags & I40E_FLAG_FD_ATR_ENABLED))
+   !(pf->hw_disabled_flags & I40E_FLAG_FD_ATR_ENABLED))
nsd->fd_atr_status = true;
else
nsd->fd_atr_status = false;
@@ -5467,7 

[net-next 12/13] i40e/i40evf: Change version from 1.6.27 to 2.1.7

2017-03-14 Thread Jeff Kirsher
From: Bimmy Pujari 

Signed-off-by: Bimmy Pujari 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 6 +++---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 414685c683d7..4d305fb1f188 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -39,9 +39,9 @@ static const char i40e_driver_string[] =
 
 #define DRV_KERN "-k"
 
-#define DRV_VERSION_MAJOR 1
-#define DRV_VERSION_MINOR 6
-#define DRV_VERSION_BUILD 27
+#define DRV_VERSION_MAJOR 2
+#define DRV_VERSION_MINOR 1
+#define DRV_VERSION_BUILD 7
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD)DRV_KERN
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 9492b20da557..6d666bde9df5 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -37,9 +37,9 @@ static const char i40evf_driver_string[] =
 
 #define DRV_KERN "-k"
 
-#define DRV_VERSION_MAJOR 1
-#define DRV_VERSION_MINOR 6
-#define DRV_VERSION_BUILD 27
+#define DRV_VERSION_MAJOR 2
+#define DRV_VERSION_MINOR 1
+#define DRV_VERSION_BUILD 7
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD) \
-- 
2.12.0



[net-next 08/13] i40e: fix RSS queues only operating on PF0

2017-03-14 Thread Jeff Kirsher
From: Lihong Yang 

This patch fixes the issue that RSS offloading only works on PF0 by
using the direct register writing of the hash keys for the VFs instead
of using the admin queue command to do so.

Change-ID: Ia02cda7dbaa23def342e8786097a2c03db6f580b
Signed-off-by: Lihong Yang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c| 11 +++
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  6 ++
 2 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index cb678ed7a2ad..e577ff8a9c76 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -8394,13 +8394,10 @@ static int i40e_config_rss_reg(struct i40e_vsi *vsi, 
const u8 *seed,
 
if (vsi->type == I40E_VSI_MAIN) {
for (i = 0; i <= I40E_PFQF_HKEY_MAX_INDEX; i++)
-   i40e_write_rx_ctl(hw, I40E_PFQF_HKEY(i),
- seed_dw[i]);
+   wr32(hw, I40E_PFQF_HKEY(i), seed_dw[i]);
} else if (vsi->type == I40E_VSI_SRIOV) {
for (i = 0; i <= I40E_VFQF_HKEY1_MAX_INDEX; i++)
-   i40e_write_rx_ctl(hw,
- I40E_VFQF_HKEY1(i, vf_id),
- seed_dw[i]);
+   wr32(hw, I40E_VFQF_HKEY1(i, vf_id), seed_dw[i]);
} else {
dev_err(>pdev->dev, "Cannot set RSS seed - invalid 
VSI type\n");
}
@@ -8418,9 +8415,7 @@ static int i40e_config_rss_reg(struct i40e_vsi *vsi, 
const u8 *seed,
if (lut_size != I40E_VF_HLUT_ARRAY_SIZE)
return -EINVAL;
for (i = 0; i <= I40E_VFQF_HLUT_MAX_INDEX; i++)
-   i40e_write_rx_ctl(hw,
- I40E_VFQF_HLUT1(i, vf_id),
- lut_dw[i]);
+   wr32(hw, I40E_VFQF_HLUT1(i, vf_id), lut_dw[i]);
} else {
dev_err(>pdev->dev, "Cannot set RSS LUT - invalid 
VSI type\n");
}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 25ee5af2d136..115a7286ab8f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -702,10 +702,8 @@ static int i40e_alloc_vsi_res(struct i40e_vf *vf, enum 
i40e_vsi_type type)
dev_info(>pdev->dev,
 "Could not allocate VF broadcast filter\n");
spin_unlock_bh(>mac_filter_hash_lock);
-   i40e_write_rx_ctl(>hw, I40E_VFQF_HENA1(0, vf->vf_id),
- (u32)hena);
-   i40e_write_rx_ctl(>hw, I40E_VFQF_HENA1(1, vf->vf_id),
- (u32)(hena >> 32));
+   wr32(>hw, I40E_VFQF_HENA1(0, vf->vf_id), (u32)hena);
+   wr32(>hw, I40E_VFQF_HENA1(1, vf->vf_id), (u32)(hena >> 32));
}
 
/* program mac filter */
-- 
2.12.0



[net-next 04/13] i40evf: add client interface

2017-03-14 Thread Jeff Kirsher
From: Mitch Williams 

In preparation for upcoming RDMA-capable hardware, add a client
interface to the VF driver. This is a slightly-simplified version
of the PF client interface, with the names changed to protect the
innocent.

Due to the nature of the VF<->PF interactions, the client interface
sometimes needs to call back into itself to pass messages. Because
of this, we can't use the coarse-grained locking like the PF's
client interface uses. Instead, we handle all client interactions
in a separate thread so the watchdog can still run and process
virtual channel messages.

Signed-off-by: Mitch Williams 
Signed-off-by: Jesse Brandeburg 
Signed-off-by: Anjali Singhai Jain 
Signed-off-by: Avinash Dayanand 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/Makefile |   2 +-
 drivers/net/ethernet/intel/i40evf/i40e_virtchnl.h  |  33 ++
 drivers/net/ethernet/intel/i40evf/i40evf.h |  24 +-
 drivers/net/ethernet/intel/i40evf/i40evf_client.c  | 563 +
 drivers/net/ethernet/intel/i40evf/i40evf_client.h  | 166 ++
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|  83 ++-
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c|  13 +-
 7 files changed, 873 insertions(+), 11 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40evf_client.c
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40evf_client.h

diff --git a/drivers/net/ethernet/intel/i40evf/Makefile 
b/drivers/net/ethernet/intel/i40evf/Makefile
index 3a423836a565..827c7a6ed0ba 100644
--- a/drivers/net/ethernet/intel/i40evf/Makefile
+++ b/drivers/net/ethernet/intel/i40evf/Makefile
@@ -32,5 +32,5 @@
 obj-$(CONFIG_I40EVF) += i40evf.o
 
 i40evf-objs := i40evf_main.o i40evf_ethtool.o i40evf_virtchnl.o \
-   i40e_txrx.o i40e_common.o i40e_adminq.o
+   i40e_txrx.o i40e_common.o i40e_adminq.o i40evf_client.o
 
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_virtchnl.h 
b/drivers/net/ethernet/intel/i40evf/i40e_virtchnl.h
index d38a2b2aea2b..f431fbc4a3e7 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_virtchnl.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_virtchnl.h
@@ -81,7 +81,9 @@ enum i40e_virtchnl_ops {
I40E_VIRTCHNL_OP_GET_STATS = 15,
I40E_VIRTCHNL_OP_FCOE = 16,
I40E_VIRTCHNL_OP_EVENT = 17, /* must ALWAYS be 17 */
+   I40E_VIRTCHNL_OP_IWARP = 20,
I40E_VIRTCHNL_OP_CONFIG_IWARP_IRQ_MAP = 21,
+   I40E_VIRTCHNL_OP_RELEASE_IWARP_IRQ_MAP = 22,
I40E_VIRTCHNL_OP_CONFIG_RSS_KEY = 23,
I40E_VIRTCHNL_OP_CONFIG_RSS_LUT = 24,
I40E_VIRTCHNL_OP_GET_RSS_HENA_CAPS = 25,
@@ -393,6 +395,37 @@ struct i40e_virtchnl_pf_event {
int severity;
 };
 
+/* I40E_VIRTCHNL_OP_CONFIG_IWARP_IRQ_MAP
+ * VF uses this message to request PF to map IWARP vectors to IWARP queues.
+ * The request for this originates from the VF IWARP driver through
+ * a client interface between VF LAN and VF IWARP driver.
+ * A vector could have an AEQ and CEQ attached to it although
+ * there is a single AEQ per VF IWARP instance in which case
+ * most vectors will have an INVALID_IDX for aeq and valid idx for ceq.
+ * There will never be a case where there will be multiple CEQs attached
+ * to a single vector.
+ * PF configures interrupt mapping and returns status.
+ */
+
+/* HW does not define a type value for AEQ; only for RX/TX and CEQ.
+ * In order for us to keep the interface simple, SW will define a
+ * unique type value for AEQ.
+ */
+#define I40E_QUEUE_TYPE_PE_AEQ  0x80
+#define I40E_QUEUE_INVALID_IDX  0x
+
+struct i40e_virtchnl_iwarp_qv_info {
+   u32 v_idx; /* msix_vector */
+   u16 ceq_idx;
+   u16 aeq_idx;
+   u8 itr_idx;
+};
+
+struct i40e_virtchnl_iwarp_qvlist_info {
+   u32 num_vectors;
+   struct i40e_virtchnl_iwarp_qv_info qv_info[1];
+};
+
 /* VF reset states - these are written into the RSTAT register:
  * I40E_VFGEN_RSTAT1 on the PF
  * I40E_VFGEN_RSTAT on the VF
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf.h 
b/drivers/net/ethernet/intel/i40evf/i40evf.h
index f16d9d1ec403..b2b48511f457 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf.h
+++ b/drivers/net/ethernet/intel/i40evf/i40evf.h
@@ -60,6 +60,7 @@ struct i40e_vsi {
int base_vector;
u16 work_limit;
u16 qs_handle;
+   void *priv; /* client driver data reference. */
 };
 
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
@@ -169,6 +170,7 @@ enum i40evf_state_t {
 
 enum i40evf_critical_section_t {
__I40EVF_IN_CRITICAL_TASK,  /* cannot be interrupted */
+   __I40EVF_IN_CLIENT_TASK,
 };
 /* make common code happy */
 #define __I40E_DOWN __I40EVF_DOWN
@@ -178,6 +180,7 @@ struct i40evf_adapter {
struct timer_list watchdog_timer;
   

[net-next 09/13] i40e: Clarify steps in MAC/VLAN filters initialization routine

2017-03-14 Thread Jeff Kirsher
From: Filip Sadowski 

This patch clarifies the reason for removal of automatically
firmware-generated filter and explicit addition of filter which
accepts frames with any VLAN id.

Change-ID: Iabf180b6d61c4d8a36d3bcf8457c377a6f2aca0e
Signed-off-by: Filip Sadowski 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index e577ff8a9c76..414685c683d7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9461,10 +9461,10 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
if (vsi->type == I40E_VSI_MAIN) {
SET_NETDEV_DEV(netdev, >pdev->dev);
ether_addr_copy(mac_addr, hw->mac.perm_addr);
-   /* The following steps are necessary to prevent reception
-* of tagged packets - some older NVM configurations load a
-* default a MAC-VLAN filter that accepts any tagged packet
-* which must be replaced by a normal filter.
+   /* The following steps are necessary to properly keep track of
+* MAC-VLAN filters loaded into firmware - first we remove
+* filter that is automatically generated by firmware and then
+* add new filter both to the driver hash table and firmware.
 */
i40e_rm_default_mac_filter(vsi, mac_addr);
spin_lock_bh(>mac_filter_hash_lock);
-- 
2.12.0



[net-next 00/13][pull request] 40GbE Intel Wired LAN Driver Updates 2017-03-14

2017-03-14 Thread Jeff Kirsher
This series contains updates to i40e and i40evf only.

Faisal fixes a RDMA/iWARP compile warning by make sure the function
prototypes are available in the client hooks in the VF driver.

Aaron fixes an issue on x710 devices where simultaneous read accesses
were interfering with each other, so make sure all devices acquire the
NVM lock before reads on all devices.

Shannon adds Wake On LAN support feature for x722 devices and cleaned
up the opcodes so that they are in numerical order.

Mitch adds a client interface to the VF driver, in preparation for the
upcoming RDMA-capable hardware (and client driver).  Cleaned up the
client interface in the PF driver, since it was originally over
engineered to handle multiple clients on multiple netdevs, but that
did not happen and now there will be one client per driver, so apply
the "KISS" (Keep It Simple & Stupid) to the i40e client interface.
Bumped the number of MAC filters an untrusted VF can create.

Jake fixes an issue where a recent refactor of queue pairs accidentally
added all remaining vecotrs to the num_lan_msix which can adversely
affect performance.

Lihong fixes an ethtool issue with x722 devices where "-e" will error
out since its EEPROM has a scope limit at offset 0x5B9FFF, so set the
EEPROM length to the scope limit.  Also fixed an issue where RSS
offloading only worked on PF0.

Filip cleans up and clarifies code comment so there is no confusion
about MAC/VLAN filter initialization routine.

Alex adds support for DMA_ATTR_SKIP_CPU_SYNC and DMA_ATTR_WEAK_ORDERING,
which improves performance on architectures that implement either one.

Harshitha cleans up confusion on flags disabled due to hardware limitation
versus featured disabled by the user, so rename auto_disable_flags to
hw_disabled_flags to avoid the confusion.

The following are changes since commit 9c79ddaa0f962d1f26537a670b0652ff509a6fe0:
  qed*: Add support for QL41xxx adapters
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Aaron Salter (1):
  i40e: Acquire NVM lock before reads on all devices

Alexander Duyck (1):
  i40e/i40evf: Add support for mapping pages with DMA attributes

Bimmy Pujari (1):
  i40e/i40evf: Change version from 1.6.27 to 2.1.7

Faisal Latif (1):
  i40evf: fix client warnings

Filip Sadowski (1):
  i40e: Clarify steps in MAC/VLAN filters initialization routine

Harshitha Ramamurthy (1):
  i40e: rename auto_disable_flags to hw_disabled_flags

Jacob Keller (1):
  i40e: don't add more vectors to num_lan_msix than number of CPUs

Lihong Yang (2):
  i40e: fix ethtool to get EEPROM data from X722 interface
  i40e: fix RSS queues only operating on PF0

Mitch Williams (3):
  i40evf: add client interface
  i40e: KISS the client interface
  i40e: Allow untrusted VFs to have more filters

Shannon Nelson (1):
  i40e: fix up recent proxy and wol bits for X722_SUPPORT

 drivers/net/ethernet/intel/i40e/i40e.h |  16 +-
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |  65 ++-
 drivers/net/ethernet/intel/i40e/i40e_client.c  | 457 ++---
 drivers/net/ethernet/intel/i40e/i40e_client.h  |   8 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  15 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c| 115 +++--
 drivers/net/ethernet/intel/i40e/i40e_nvm.c |  12 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|  51 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   3 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  10 +-
 drivers/net/ethernet/intel/i40evf/Makefile |   2 +-
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|  65 ++-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |  31 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h  |   3 +
 drivers/net/ethernet/intel/i40evf/i40e_virtchnl.h  |  33 ++
 drivers/net/ethernet/intel/i40evf/i40evf.h |  29 +-
 drivers/net/ethernet/intel/i40evf/i40evf_client.c  | 563 +
 drivers/net/ethernet/intel/i40evf/i40evf_client.h  | 166 ++
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|  89 +++-
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c|  13 +-
 20 files changed, 1333 insertions(+), 413 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40evf_client.c
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40evf_client.h

-- 
2.12.0



[net-next 07/13] i40e: fix ethtool to get EEPROM data from X722 interface

2017-03-14 Thread Jeff Kirsher
From: Lihong Yang 

Currently ethtool -e will error out with a X722 interface
as its EEPROM has a scope limit at offset 0x5B9FFF.
This patch fixes the issue by setting the EEPROM length to
the scope limit to avoid NVM read failure beyond that.

Change-ID: I0b7d4dd6c7f2a57cace438af5dffa0f44c229372
Signed-off-by: Lihong Yang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index a22e26200bcc..3aefc9e20439 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1165,6 +1165,11 @@ static int i40e_get_eeprom_len(struct net_device *netdev)
struct i40e_hw *hw = >vsi->back->hw;
u32 val;
 
+#define X722_EEPROM_SCOPE_LIMIT 0x5B9FFF
+   if (hw->mac.type == I40E_MAC_X722) {
+   val = X722_EEPROM_SCOPE_LIMIT + 1;
+   return val;
+   }
val = (rd32(hw, I40E_GLPCI_LBARCTRL)
& I40E_GLPCI_LBARCTRL_FL_SIZE_MASK)
>> I40E_GLPCI_LBARCTRL_FL_SIZE_SHIFT;
-- 
2.12.0



[net-next 01/13] i40evf: fix client warnings

2017-03-14 Thread Jeff Kirsher
From: Faisal Latif 

The function prototype in i40evf_client.h are giving warnings while
compiling i40iwvf module. Move these function prototypes to i40evf.h.
Also fix return code from u32 to int and this return code is
consistent with i40e_client.h

Change-Id: Ie3757f844993aabc27654aaf02ec14fb985ad2c4
Signed-off-by: Faisal Latif 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf.h 
b/drivers/net/ethernet/intel/i40evf/i40evf.h
index 00c42d803276..f16d9d1ec403 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf.h
+++ b/drivers/net/ethernet/intel/i40evf/i40evf.h
@@ -337,4 +337,11 @@ void i40evf_virtchnl_completion(struct i40evf_adapter 
*adapter,
enum i40e_virtchnl_ops v_opcode,
i40e_status v_retval, u8 *msg, u16 msglen);
 int i40evf_config_rss(struct i40evf_adapter *adapter);
+int i40evf_lan_add_device(struct i40evf_adapter *adapter);
+int i40evf_lan_del_device(struct i40evf_adapter *adapter);
+void i40evf_client_subtask(struct i40evf_adapter *adapter);
+void i40evf_notify_client_message(struct i40e_vsi *vsi, u8 *msg, u16 len);
+void i40evf_notify_client_l2_params(struct i40e_vsi *vsi);
+void i40evf_notify_client_open(struct i40e_vsi *vsi);
+void i40evf_notify_client_close(struct i40e_vsi *vsi);
 #endif /* _I40EVF_H_ */
-- 
2.12.0



[net-next 02/13] i40e: Acquire NVM lock before reads on all devices

2017-03-14 Thread Jeff Kirsher
From: Aaron Salter 

Acquire NVM lock before reads on all devices.  Previously, locks were
only used for X722 and later.  Fixes an issue where simultaneous X710
NVM accesses were interfering with each other.

Change-ID: If570bb7acf958cef58725ec2a2011cead6f80638
Signed-off-by: Aaron Salter 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c 
b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 38ee18f11124..800bd55d0159 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -292,14 +292,14 @@ i40e_status i40e_read_nvm_word(struct i40e_hw *hw, u16 
offset,
 {
enum i40e_status_code ret_code = 0;
 
-   if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE) {
-   ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
-   if (!ret_code) {
+   ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
+   if (!ret_code) {
+   if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE) {
ret_code = i40e_read_nvm_word_aq(hw, offset, data);
-   i40e_release_nvm(hw);
+   } else {
+   ret_code = i40e_read_nvm_word_srctl(hw, offset, data);
}
-   } else {
-   ret_code = i40e_read_nvm_word_srctl(hw, offset, data);
+   i40e_release_nvm(hw);
}
return ret_code;
 }
-- 
2.12.0



[net-next 03/13] i40e: fix up recent proxy and wol bits for X722_SUPPORT

2017-03-14 Thread Jeff Kirsher
From: Shannon Nelson 

Some opcodes added & reordered to be in numerical order with the
rest of the opcodes.
This patch adds admin queue structs to support Wake on LAN feature
for X722.

Signed-off-by: Shannon Nelson 
Signed-off-by: Carolyn Wyborny 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  | 65 +-
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h| 65 +-
 2 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index 451f48b7540a..251074c677c4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -132,6 +132,10 @@ enum i40e_admin_queue_opc {
i40e_aqc_opc_list_func_capabilities = 0x000A,
i40e_aqc_opc_list_dev_capabilities  = 0x000B,
 
+   /* Proxy commands */
+   i40e_aqc_opc_set_proxy_config   = 0x0104,
+   i40e_aqc_opc_set_ns_proxy_table_entry   = 0x0105,
+
/* LAA */
i40e_aqc_opc_mac_address_read   = 0x0107,
i40e_aqc_opc_mac_address_write  = 0x0108,
@@ -139,6 +143,10 @@ enum i40e_admin_queue_opc {
/* PXE */
i40e_aqc_opc_clear_pxe_mode = 0x0110,
 
+   /* WoL commands */
+   i40e_aqc_opc_set_wol_filter = 0x0120,
+   i40e_aqc_opc_get_wake_reason= 0x0121,
+
/* internal switch commands */
i40e_aqc_opc_get_switch_config  = 0x0200,
i40e_aqc_opc_add_statistics = 0x0201,
@@ -177,6 +185,7 @@ enum i40e_admin_queue_opc {
i40e_aqc_opc_remove_control_packet_filter   = 0x025B,
i40e_aqc_opc_add_cloud_filters  = 0x025C,
i40e_aqc_opc_remove_cloud_filters   = 0x025D,
+   i40e_aqc_opc_clear_wol_switch_filters   = 0x025E,
 
i40e_aqc_opc_add_mirror_rule= 0x0260,
i40e_aqc_opc_delete_mirror_rule = 0x0261,
@@ -563,6 +572,56 @@ struct i40e_aqc_clear_pxe {
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_clear_pxe);
 
+/* Set WoL Filter (0x0120) */
+
+struct i40e_aqc_set_wol_filter {
+   __le16 filter_index;
+#define I40E_AQC_MAX_NUM_WOL_FILTERS   8
+#define I40E_AQC_SET_WOL_FILTER_TYPE_MAGIC_SHIFT   15
+#define I40E_AQC_SET_WOL_FILTER_TYPE_MAGIC_MASK(0x1 << \
+   I40E_AQC_SET_WOL_FILTER_TYPE_MAGIC_SHIFT)
+
+#define I40E_AQC_SET_WOL_FILTER_INDEX_SHIFT0
+#define I40E_AQC_SET_WOL_FILTER_INDEX_MASK (0x7 << \
+   I40E_AQC_SET_WOL_FILTER_INDEX_SHIFT)
+   __le16 cmd_flags;
+#define I40E_AQC_SET_WOL_FILTER0x8000
+#define I40E_AQC_SET_WOL_FILTER_NO_TCO_WOL 0x4000
+#define I40E_AQC_SET_WOL_FILTER_ACTION_CLEAR   0
+#define I40E_AQC_SET_WOL_FILTER_ACTION_SET 1
+   __le16 valid_flags;
+#define I40E_AQC_SET_WOL_FILTER_ACTION_VALID   0x8000
+#define I40E_AQC_SET_WOL_FILTER_NO_TCO_ACTION_VALID0x4000
+   u8 reserved[2];
+   __le32  address_high;
+   __le32  address_low;
+};
+
+I40E_CHECK_CMD_LENGTH(i40e_aqc_set_wol_filter);
+
+struct i40e_aqc_set_wol_filter_data {
+   u8 filter[128];
+   u8 mask[16];
+};
+
+I40E_CHECK_STRUCT_LEN(0x90, i40e_aqc_set_wol_filter_data);
+
+/* Get Wake Reason (0x0121) */
+
+struct i40e_aqc_get_wake_reason_completion {
+   u8 reserved_1[2];
+   __le16 wake_reason;
+#define I40E_AQC_GET_WAKE_UP_REASON_WOL_REASON_MATCHED_INDEX_SHIFT 0
+#define I40E_AQC_GET_WAKE_UP_REASON_WOL_REASON_MATCHED_INDEX_MASK (0xFF << \
+   I40E_AQC_GET_WAKE_UP_REASON_WOL_REASON_MATCHED_INDEX_SHIFT)
+#define I40E_AQC_GET_WAKE_UP_REASON_WOL_REASON_RESERVED_SHIFT  8
+#define I40E_AQC_GET_WAKE_UP_REASON_WOL_REASON_RESERVED_MASK   (0xFF << \
+   I40E_AQC_GET_WAKE_UP_REASON_WOL_REASON_RESERVED_SHIFT)
+   u8 reserved_2[12];
+};
+
+I40E_CHECK_CMD_LENGTH(i40e_aqc_get_wake_reason_completion);
+
 /* Switch configuration commands (0x02xx) */
 
 /* Used by many indirect commands that only pass an seid and a buffer in the
@@ -645,6 +704,8 @@ struct i40e_aqc_set_port_parameters {
 #define I40E_AQ_SET_P_PARAMS_PAD_SHORT_PACKETS 2 /* must set! */
 #define I40E_AQ_SET_P_PARAMS_DOUBLE_VLAN_ENA   4
__le16  bad_frame_vsi;
+#define I40E_AQ_SET_P_PARAMS_BFRAME_SEID_SHIFT 0x0
+#define I40E_AQ_SET_P_PARAMS_BFRAME_SEID_MASK  0x3FF
__le16  default_seid;/* reserved for command */
u8  reserved[10];
 };
@@ -696,6 +757,7 @@ I40E_CHECK_STRUCT_LEN(0x10, 
i40e_aqc_switch_resource_alloc_element_resp);
 /* Set Switch Configuration (direct 0x0205) */
 struct i40e_aqc_set_switch_config {
__le16  flags;
+/* flags used for both fields below */
 #define I40E_AQ_SET_SWITCH_CFG_PROMISC 0x0001
 #define 

[net-next 05/13] i40e: KISS the client interface

2017-03-14 Thread Jeff Kirsher
From: Mitch Williams 

(KISS is Keep It Simple, Stupid. Or is it?)

The client interface vastly overengineered for what it needs to do.
It was originally designed to support multiple clients on multiple
netdevs, possibly even with multiple drivers. None of this happened,
and now we know that there will only ever be one client for i40e
(i40iw) and one for i40evf (i40iwvf). So, time for some KISS. Since
i40e and i40evf are a Dynasty, we'll simplify this one to match the
VF interface.

First, be a Destroyer and remove all of the lists and locks required
to support multiple clients. Keep one static around to keep track of
one client, and track the client instances for each netdev in the
driver's pf (or adapter) struct. Now it's Almost Human.

Since we already know the client type is iWarp, get rid of any checks
for this. Same for VSI type - it's always going to be the same type,
so it's just a Parasite.

While we're at it, fix up some comments. This makes the function
headers actually match the functions.

These changes reduce code complexity, simplify maintenance,
squash some lurking timing bugs, and allow us to Rock and Roll All
Nite.

Change-ID: I1ea79948ad73b8685272451440a34507f9a9012e
Signed-off-by: Mitch Williams 
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |   8 +-
 drivers/net/ethernet/intel/i40e/i40e_client.c  | 457 +++--
 drivers/net/ethernet/intel/i40e/i40e_client.h  |   8 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c|  32 +-
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |   2 +-
 5 files changed, 179 insertions(+), 328 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 82d8040fa418..9b2bb8d971cc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -348,8 +348,10 @@ struct i40e_pf {
 #define I40E_FLAG_TRUE_PROMISC_SUPPORT BIT_ULL(51)
 #define I40E_FLAG_HAVE_CRT_RETIMER BIT_ULL(52)
 #define I40E_FLAG_PTP_L4_CAPABLE   BIT_ULL(53)
-#define I40E_FLAG_WOL_MC_MAGIC_PKT_WAKEBIT_ULL(54)
+#define I40E_FLAG_CLIENT_RESET BIT_ULL(54)
 #define I40E_FLAG_TEMP_LINK_POLLINGBIT_ULL(55)
+#define I40E_FLAG_CLIENT_L2_CHANGE BIT_ULL(56)
+#define I40E_FLAG_WOL_MC_MAGIC_PKT_WAKEBIT_ULL(57)
 
/* tracks features that get auto disabled by errors */
u64 auto_disable_flags;
@@ -358,6 +360,7 @@ struct i40e_pf {
struct i40e_fcoe fcoe;
 
 #endif /* I40E_FCOE */
+   struct i40e_client_instance *cinst;
bool stat_offsets_loaded;
struct i40e_hw_port_stats stats;
struct i40e_hw_port_stats stats_offsets;
@@ -813,8 +816,7 @@ void i40e_notify_client_of_l2_param_changes(struct i40e_vsi 
*vsi);
 void i40e_notify_client_of_netdev_close(struct i40e_vsi *vsi, bool reset);
 void i40e_notify_client_of_vf_enable(struct i40e_pf *pf, u32 num_vfs);
 void i40e_notify_client_of_vf_reset(struct i40e_pf *pf, u32 vf_id);
-int i40e_vf_client_capable(struct i40e_pf *pf, u32 vf_id,
-  enum i40e_client_type type);
+int i40e_vf_client_capable(struct i40e_pf *pf, u32 vf_id);
 /**
  * i40e_irq_dynamic_enable - Enable default interrupt generation settings
  * @vsi: pointer to a vsi
diff --git a/drivers/net/ethernet/intel/i40e/i40e_client.c 
b/drivers/net/ethernet/intel/i40e/i40e_client.c
index d570219efd9f..a9f0d22a7cf4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_client.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_client.c
@@ -32,16 +32,10 @@
 #include "i40e_client.h"
 
 static const char i40e_client_interface_version_str[] = 
I40E_CLIENT_VERSION_STR;
-
+static struct i40e_client *registered_client;
 static LIST_HEAD(i40e_devices);
 static DEFINE_MUTEX(i40e_device_mutex);
 
-static LIST_HEAD(i40e_clients);
-static DEFINE_MUTEX(i40e_client_mutex);
-
-static LIST_HEAD(i40e_client_instances);
-static DEFINE_MUTEX(i40e_client_instance_mutex);
-
 static int i40e_client_virtchnl_send(struct i40e_info *ldev,
 struct i40e_client *client,
 u32 vf_id, u8 *msg, u16 len);
@@ -67,28 +61,6 @@ static struct i40e_ops i40e_lan_ops = {
 };
 
 /**
- * i40e_client_type_to_vsi_type - convert client type to vsi type
- * @client_type: the i40e_client type
- *
- * returns the related vsi type value
- **/
-static
-enum i40e_vsi_type i40e_client_type_to_vsi_type(enum i40e_client_type type)
-{
-   switch (type) {
-   case I40E_CLIENT_IWARP:
-   return I40E_VSI_IWARP;
-
-   case I40E_CLIENT_VMDQ2:
-   return I40E_VSI_VMDQ2;
-
-   default:
-   pr_err("i40e: Client type unknown\n");
-   return 

Re: [PATCH v2 0/5] MIPS: BPF: JIT fixes and improvements.

2017-03-14 Thread Daniel Borkmann

On 03/14/2017 10:21 PM, David Daney wrote:

Changes from v1:

   - Use unsigned access for SKF_AD_HATYPE

   - Added three more patches for other problems found.


Testing the BPF JIT on Cavium OCTEON (mips64) with the test-bpf module
identified some failures and unimplemented features.


Nice, thanks for working on this! If you see specific test
cases for the JIT missing, please also feel free to extend
the test_bpf suite, so this gets exposed further to other
JITs, too.


With this patch set we get:

  test_bpf: Summary: 305 PASSED, 0 FAILED, [85/297 JIT'ed]

Both big and little endian tested.

We still lack eBPF support, but this is better than nothing.


Any future plans on this one?

Thanks,
Daniel


Re: [RFC PATCH] sock: add SO_RCVQUEUE_SIZE getsockopt

2017-03-14 Thread Josh Hunt

On 03/13/2017 07:10 PM, David Miller wrote:

From: Josh Hunt 
Date: Mon, 13 Mar 2017 18:34:41 -0500


In this particular case they really do want to know total # of bytes
in the receive queue, not the data bytes they can consume from an
application pov. The kernel currently only exposes this value through
netlink or /proc/net/udp from what I saw.


Can you explain in what way this is useful?

The difference between skb->len and skb->truesize is really kernel
internal implementation detail, and I'm trying to figure out why
this would be useful to an application.



First, it looks like my original patch was against an old kernel which 
did not have the updated udp accounting code. Not sure how that 
happened. Apologies for that. There's no need to add in the backlog, at 
least for udp now, sk_rmem_alloc is all that is needed for my case.


The application here is interested in monitoring the amount of data in 
the receive buffer. Looking for and identifying overflows, and also 
understanding how full it is. I know we already have SO_RXQ_OVFL, but 
this only shows the # of drops on overflow.


We expose this (skmem) information via /proc and netlink today. It seems 
like unnecessary overhead to require an application to also create a 
netlink socket to get this data.


Creating a socket option to mimic the behavior of 
sock_diag_put_meminfo() and export all meminfo_vars would be great if 
that's something you'd accept.


Josh


Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread Stephen Hemminger
On Tue, 14 Mar 2017 14:10:22 -0700
Roopa Prabhu  wrote:

> On Tue, Mar 14, 2017 at 1:25 PM, Stephen Hemminger
>  wrote:
> > On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
> > David Miller  wrote:
> >  
> >> From: Nikolay Aleksandrov 
> >> Date: Tue, 14 Mar 2017 17:58:46 +0200
> >>  
> >> > On 14/03/17 17:55, Stephen Hemminger wrote:  
> >> >> On Tue, 14 Mar 2017 17:36:15 +0200
> >> >> Nikolay Aleksandrov  wrote:
> >> >>  
> >> >>> This patch adds support for ECMP hash policy choice via a new sysctl
> >> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
> >> >>> The current values for fib_multipath_hash_policy are:
> >> >>>  0 - layer 3 (default)
> >> >>>  1 - layer 4
> >> >>> If there's an skb hash already set and it matches the chosen policy 
> >> >>> then it
> >> >>> will be used instead of being calculated (currently only for L4).
> >> >>> In L3 mode we always calculate the hash due to the ICMP error special
> >> >>> case, the flow dissector's field consistentification should handle the
> >> >>> address order thus we can remove the address reversals.
> >> >>>
> >> >>> Signed-off-by: Nikolay Aleksandrov   
> >> >>
> >> >> It is good to see ECMP come back from the grave.
> >> >> Linux used to support it long ago but was abandoned after it was 
> >> >> unstable
> >> >> and removed from iproute2 in 2012.
> >> >>
> >> >> The old API was through route attributes which makes more sense than
> >> >> doing it with sysctl. It makes more sense to use netlink instead.
> >> >> Therefore please go back and do something like the old API rather than 
> >> >> doing it through
> >> >> sysctl.
> >> >>  
> >> >
> >> > That's what my initial version did, but this was discussed during 
> >> > NetConf in Seville
> >> > and it was decided that it's best to make a global sysctl, thus the 
> >> > change.  
> >>
> >> Correct, we discussed this, and we all agreed to only have a sysctl for 
> >> now.  
> >
> > Why? If you are going to have private discussions please post the rationale
> > in public.  
> 
> Stephen, is there any reason to have a per ecmp route multipath algo
> selection ?.
> All platforms have a global multipath selection algo. I also don't see
> routing daemons ready or willing to specify a per ecmp route multipath
> selection algo attribute.

There is no compelling reason to make the attribute per route. But the
issue is more that configuration through sysctl's is problematic. It doesn't
fit into the standard API paradigm. Sysctl's are like routing patches not
part of the real CLI. Trying to trap sysctl's for things like switchedev
offload is particularly problematic. I can see the case for either way,
and don't have a fixed opinion.

The bigger discussion is trying to keep a record of the rationale for decisions
such that there isn't buried tribal knowledge. This is why Dave has always been
quite insistent on having discussions on the mailing list. There doesn't seem to
be a good long term record other than Documentation/networking or commit logs.



[PATCH v2 0/5] MIPS: BPF: JIT fixes and improvements.

2017-03-14 Thread David Daney
Changes from v1:

  - Use unsigned access for SKF_AD_HATYPE

  - Added three more patches for other problems found.


Testing the BPF JIT on Cavium OCTEON (mips64) with the test-bpf module
identified some failures and unimplemented features.

With this patch set we get:

 test_bpf: Summary: 305 PASSED, 0 FAILED, [85/297 JIT'ed]

Both big and little endian tested.

We still lack eBPF support, but this is better than nothing.

David Daney (5):
  MIPS: uasm:  Add support for LHU.
  MIPS: BPF: Add JIT support for SKF_AD_HATYPE.
  MIPS: BPF: Use unsigned access for unsigned SKB fields.
  MIPS: BPF: Quit clobbering callee saved registers in JIT code.
  MIPS: BPF: Fix multiple problems in JIT skb access helpers.

 arch/mips/include/asm/uasm.h |  1 +
 arch/mips/mm/uasm-mips.c |  1 +
 arch/mips/mm/uasm.c  |  3 ++-
 arch/mips/net/bpf_jit.c  | 41 +++--
 arch/mips/net/bpf_jit_asm.S  | 23 ---
 5 files changed, 47 insertions(+), 22 deletions(-)

-- 
2.9.3



[PATCH v2 2/5] MIPS: BPF: Add JIT support for SKF_AD_HATYPE.

2017-03-14 Thread David Daney
This let's us pass some additional "modprobe test-bpf" tests with JIT
enabled.

Reuse the code for SKF_AD_IFINDEX, but substitute the offset and size
of the "type" field.

Signed-off-by: David Daney 
---
 arch/mips/net/bpf_jit.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
index 49a2e22..880e329 100644
--- a/arch/mips/net/bpf_jit.c
+++ b/arch/mips/net/bpf_jit.c
@@ -365,6 +365,12 @@ static inline void emit_half_load(unsigned int reg, 
unsigned int base,
emit_instr(ctx, lh, reg, offset, base);
 }
 
+static inline void emit_half_load_unsigned(unsigned int reg, unsigned int base,
+  unsigned int offset, struct jit_ctx 
*ctx)
+{
+   emit_instr(ctx, lhu, reg, offset, base);
+}
+
 static inline void emit_mul(unsigned int dst, unsigned int src1,
unsigned int src2, struct jit_ctx *ctx)
 {
@@ -1112,6 +1118,8 @@ static int build_body(struct jit_ctx *ctx)
break;
case BPF_ANC | SKF_AD_IFINDEX:
/* A = skb->dev->ifindex */
+   case BPF_ANC | SKF_AD_HATYPE:
+   /* A = skb->dev->type */
ctx->flags |= SEEN_SKB | SEEN_A;
off = offsetof(struct sk_buff, dev);
/* Load *dev pointer */
@@ -1120,10 +1128,15 @@ static int build_body(struct jit_ctx *ctx)
emit_bcond(MIPS_COND_EQ, r_s0, r_zero,
   b_imm(prog->len, ctx), ctx);
emit_reg_move(r_ret, r_zero, ctx);
-   BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
- ifindex) != 4);
-   off = offsetof(struct net_device, ifindex);
-   emit_load(r_A, r_s0, off, ctx);
+   if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
+   BUILD_BUG_ON(FIELD_SIZEOF(struct net_device, 
ifindex) != 4);
+   off = offsetof(struct net_device, ifindex);
+   emit_load(r_A, r_s0, off, ctx);
+   } else { /* (code == (BPF_ANC | SKF_AD_HATYPE) */
+   BUILD_BUG_ON(FIELD_SIZEOF(struct net_device, 
type) != 2);
+   off = offsetof(struct net_device, type);
+   emit_half_load_unsigned(r_A, r_s0, off, ctx);
+   }
break;
case BPF_ANC | SKF_AD_MARK:
ctx->flags |= SEEN_SKB | SEEN_A;
-- 
2.9.3



Re: [PATCH v2 2/2] can: spi: hi311x: Add Holt HI-311x CAN driver

2017-03-14 Thread Wolfgang Grandegger

Am 14.03.2017 um 19:08 schrieb Wolfgang Grandegger:

Hello Akshay,

Am 14.03.2017 um 17:20 schrieb Akshay Bhat:


Hi Wolfgang,

On 03/14/2017 08:11 AM, Wolfgang Grandegger wrote:

... snip ...

A few other things to check:

Run "cangen" and monitor the message with "candump -e
any,0:0,#FFF".
Then 1) disconnect the cable or 2) short-circuit CAN low and high
at the
connector. You should see error messages. After reconnection or
removing
the short-circuit (and bus-off recovery) the state should go back to
"active".



With the above sequence, candump reports "ERRORFRAME" with
protocol-violation{{}{acknowledge-slot}}, bus-error. On re-connecting
the cable the can state goes back to ACTIVE and I see the messages that
were in the queue being sent.


Do you get the ACK error also with berr-reporting off? Would be nice if
you could show a candump log here.



Below is a log for disconnecting and re-connecting CAN cable scenario:
(Note this is on a 4.1.18 kernel with RT patch)

root@imx6qrom5420b1:~# ip link set can0 up type can bitrate 100
berr-reporting on
root@imx6qrom5420b1:~# candump -e any,0:0,#FFF &


Please add "-td" ...


[1] 768
root@imx6qrom5420b1:~# cangen can0


and "-i" here.


  can0  21C   [8]  35 98 C0 7A 95 03 E6 2A
  can0  6E6   [1]  F2
  can0  5C7   [2]  42 50
  can0  57C   [8]  83 7A E4 0C 03 8B 90 45
  can0  55C   [8]  B9 74 87 52 D8 F4 64 04
  can0  014   [8]  28 CB 96 57 3B 80 67 4F
  can0  6AF   [1]  35
  can0  51E   [8]  B6 C8 6C 1D 3A 87 ED 2E
  can0  527   [8]  D0 8A D3 59 0E 34 40 78
  can0  30C   [2]  6A 12
  can0  145   [8]  CB 6E FF 55 C1 BE C3 22
  can0  5A5   [8]  C4 49 54 68 02 63 F9 35
  can0  0BA   [8]  DA 57 5E 3A CE 88 20 1C
  can0  516   [2]  09 09
  can0  743   [8]  7C 4D 25 47 61 4C 56 3D
  can0  31D   [2]  9C D3
  can0  71E   [8]  53 7C 97 2A 2A F2 9F 56
  can0  52E   [8]  FE DA 2D 51 73 96 DF 79
/disconnect cable
  can0  2088   [8]  00 00 00 19 00 00 28 00   ERRORFRAME
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{40}{0}}
  can0  2088   [8]  00 00 00 19 00 00 58 00   ERRORFRAME
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{88}{0}}
  can0  2088   [8]  00 00 00 19 00 00 80 00   ERRORFRAME
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{128}{0}}


TX error warning is missing.


  can0  208C   [8]  00 20 00 19 00 00 80 00   ERRORFRAME
controller-problem{tx-error-passive}
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{128}{0}}


Here "tx-error-passiv" is packed with a bus error. What I'm looking for
are state change messages similar to:

   can0  2204  [8] 00 08 00 00 00 00 60 00   ERRORFRAME
controller-problem{tx-error-warning}
state-change{tx-error-warning}
error-counter-tx-rx{{96}{0}}
   can0  2204  [8] 00 30 00 00 00 00 80 00   ERRORFRAME
controller-problem{tx-error-passive}
state-change{tx-error-passive}
error-counter-tx-rx{{128}{0}

They should always come, even with "berr-reporting off".


write: No buffer space available
root@imx6qrom5420b1:~# ip -s -d link show can0
4: can0:  mtu 16 qdisc pfifo_fast state UNKNOWN
mode DEFAULT group default qlen 10
link/can  promiscuity 0
can  state ERROR-PASSIVE (berr-counter tx 128 rx 0)
restart-ms 0
  bitrate 100 sample-point 0.750
  tq 62 prop-seg 5 phase-seg1 6 phase-seg2 4 sjw 1
  hi3110: tseg1 2..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
  clock 1600
  re-started bus-errors arbit-lost error-warn error-pass bus-off
  0  6  0  1  1  0


The error warning and passive counter increased , though. Also the bus
error should come in at a rather hight rate. Looking to the code, maybe
you need to test STATF to check for state changes (and not ERR).


Likely the ERR bits are only valid if the BUSERR bit in INTF is set.


RX: bytes  packets  errors  dropped overrun mcast
0  06   0   0   0
TX: bytes  packets  errors  dropped carrier collsns
10618   0   0   0   0
root@imx6qrom5420b1:~#
/re-connect cable
  can0  169   [8]  35 55 A3 1C 0F 47 2E 5B
  can0  318   [8]  11 AA 27 11 D2 1B CE 34
  can0  577   [8]  A0 A4 EE 50 8D A2 E1 3E
  can0  4ED   [8]  52 96 17 7E 31 FC 7D 7C
  can0  2E7   [8]  92 48 D4 39 05 1E 9F 50
  can0  200   [8]  4A 66 F6 02 1E 71 8E 26
  can0  29A   [8]  49 63 2E 7D C9 77 85 7A
  can0  15A   [7]  3C 0E 65 74 C3 62 80
  can0  011   [1]  D2
  can0  26B   [3]  FC D6 68
  can0  5CE   [8]  6F 02 B5 14 BC 7A D7 02

root@imx6qrom5420b1:~# ip -s -d link show can0
4: can0:  mtu 16 qdisc pfifo_fast state UNKNOWN
mode DEFAULT group default qlen 10
link/can  promiscuity 0
can  state ERROR-ACTIVE (berr-counter tx 117 rx 0)
restart-ms 0
  bitrate 100 sample-point 0.750
  tq 62 

[PATCH] net: sun: sungem: rix a possible null dereference

2017-03-14 Thread Philippe Reynes
The function gem_begin_auto_negotiation dereference
the pointer ep before testing if it's null. This
patch add a check on ep before dereferencing it.

This issue was added by the patch 92552fdd557:
"net: sun: sungem: use new api ethtool_{get|set}_link_ksettings".

Reported-by: Dan Carpenter 
Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/sun/sungem.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/sun/sungem.c 
b/drivers/net/ethernet/sun/sungem.c
index dbfca04..fa607d0 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -1259,8 +1259,9 @@ static void gem_begin_auto_negotiation(struct gem *gp,
int duplex;
u32 advertising;
 
-   ethtool_convert_link_mode_to_legacy_u32(,
-   ep->link_modes.advertising);
+   if (ep)
+   ethtool_convert_link_mode_to_legacy_u32(
+   , ep->link_modes.advertising);
 
if (gp->phy_type != phy_mii_mdio0 &&
gp->phy_type != phy_mii_mdio1)
-- 
1.7.4.4



[PATCH v2 4/5] MIPS: BPF: Quit clobbering callee saved registers in JIT code.

2017-03-14 Thread David Daney
If bpf_needs_clear_a() returns true, only actually clear it if it is
ever used.  If it is not used, we don't save and restore it, so the
clearing has the nasty side effect of clobbering caller state.

Also, don't emit stack pointer adjustment instructions if the
adjustment amount is zero.

Signed-off-by: David Daney 
---
 arch/mips/net/bpf_jit.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
index a68cd36..44b9250 100644
--- a/arch/mips/net/bpf_jit.c
+++ b/arch/mips/net/bpf_jit.c
@@ -532,7 +532,8 @@ static void save_bpf_jit_regs(struct jit_ctx *ctx, unsigned 
offset)
u32 sflags, tmp_flags;
 
/* Adjust the stack pointer */
-   emit_stack_offset(-align_sp(offset), ctx);
+   if (offset)
+   emit_stack_offset(-align_sp(offset), ctx);
 
tmp_flags = sflags = ctx->flags >> SEEN_SREG_SFT;
/* sflags is essentially a bitmap */
@@ -584,7 +585,8 @@ static void restore_bpf_jit_regs(struct jit_ctx *ctx,
emit_load_stack_reg(r_ra, r_sp, real_off, ctx);
 
/* Restore the sp and discard the scrach memory */
-   emit_stack_offset(align_sp(offset), ctx);
+   if (offset)
+   emit_stack_offset(align_sp(offset), ctx);
 }
 
 static unsigned int get_stack_depth(struct jit_ctx *ctx)
@@ -631,8 +633,14 @@ static void build_prologue(struct jit_ctx *ctx)
if (ctx->flags & SEEN_X)
emit_jit_reg_move(r_X, r_zero, ctx);
 
-   /* Do not leak kernel data to userspace */
-   if (bpf_needs_clear_a(>skf->insns[0]))
+   /*
+* Do not leak kernel data to userspace, we only need to clear
+* r_A if it is ever used.  In fact if it is never used, we
+* will not save/restore it, so clearing it in this case would
+* corrupt the state of the caller.
+*/
+   if (bpf_needs_clear_a(>skf->insns[0]) &&
+   (ctx->flags & SEEN_A))
emit_jit_reg_move(r_A, r_zero, ctx);
 }
 
-- 
2.9.3



[PATCH v2 5/5] MIPS: BPF: Fix multiple problems in JIT skb access helpers.

2017-03-14 Thread David Daney
 o Socket data is unsigned, so use unsigned accessors instructions.

 o Fix path result pointer generation arithmetic.

 o Fix half-word byte swapping code for unsigned semantics.

Signed-off-by: David Daney 
---
 arch/mips/net/bpf_jit_asm.S | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/mips/net/bpf_jit_asm.S b/arch/mips/net/bpf_jit_asm.S
index 5d2e0c8..88a2075 100644
--- a/arch/mips/net/bpf_jit_asm.S
+++ b/arch/mips/net/bpf_jit_asm.S
@@ -90,18 +90,14 @@ FEXPORT(sk_load_half_positive)
is_offset_in_header(2, half)
/* Offset within header boundaries */
PTR_ADDU t1, $r_skb_data, offset
-   .setreorder
-   lh  $r_A, 0(t1)
-   .setnoreorder
+   lhu $r_A, 0(t1)
 #ifdef CONFIG_CPU_LITTLE_ENDIAN
 # if defined(__mips_isa_rev) && (__mips_isa_rev >= 2)
-   wsbht0, $r_A
-   seh $r_A, t0
+   wsbh$r_A, $r_A
 # else
-   sll t0, $r_A, 24
-   andit1, $r_A, 0xff00
-   sra t0, t0, 16
-   srl t1, t1, 8
+   sll t0, $r_A, 8
+   srl t1, $r_A, 8
+   andit0, t0, 0xff00
or  $r_A, t0, t1
 # endif
 #endif
@@ -115,7 +111,7 @@ FEXPORT(sk_load_byte_positive)
is_offset_in_header(1, byte)
/* Offset within header boundaries */
PTR_ADDU t1, $r_skb_data, offset
-   lb  $r_A, 0(t1)
+   lbu $r_A, 0(t1)
jr  $r_ra
 move   $r_ret, zero
END(sk_load_byte)
@@ -139,6 +135,11 @@ FEXPORT(sk_load_byte_positive)
  * (void *to) is returned in r_s0
  *
  */
+#ifdef CONFIG_CPU_LITTLE_ENDIAN
+#define DS_OFFSET(SIZE) (4 * SZREG)
+#else
+#define DS_OFFSET(SIZE) ((4 * SZREG) + (4 - SIZE))
+#endif
 #define bpf_slow_path_common(SIZE) \
/* Quick check. Are we within reasonable boundaries? */ \
LONG_ADDIU  $r_s1, $r_skb_len, -SIZE;   \
@@ -150,7 +151,7 @@ FEXPORT(sk_load_byte_positive)
PTR_LA  t0, skb_copy_bits;  \
PTR_S   $r_ra, (5 * SZREG)($r_sp);  \
/* Assign low slot to a2 */ \
-   movea2, $r_sp;  \
+   PTR_ADDIU   a2, $r_sp, DS_OFFSET(SIZE); \
jalrt0; \
/* Reset our destination slot (DS but it's ok) */   \
 INT_S  zero, (4 * SZREG)($r_sp);   \
-- 
2.9.3



[PATCH v2 3/5] MIPS: BPF: Use unsigned access for unsigned SKB fields.

2017-03-14 Thread David Daney
The SKB vlan_tci and queue_mapping fields are unsigned, don't sign
extend these in the BPF JIT.  In the vlan_tci case, the value gets
masked so the change is not needed for correctness, but do it anyway
for agreement with the types defined in struct sk_buff.

Signed-off-by: David Daney 
---
 arch/mips/net/bpf_jit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
index 880e329..a68cd36 100644
--- a/arch/mips/net/bpf_jit.c
+++ b/arch/mips/net/bpf_jit.c
@@ -1156,7 +1156,7 @@ static int build_body(struct jit_ctx *ctx)
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
  vlan_tci) != 2);
off = offsetof(struct sk_buff, vlan_tci);
-   emit_half_load(r_s0, r_skb, off, ctx);
+   emit_half_load_unsigned(r_s0, r_skb, off, ctx);
if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) {
emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, 
ctx);
} else {
@@ -1183,7 +1183,7 @@ static int build_body(struct jit_ctx *ctx)
BUILD_BUG_ON(offsetof(struct sk_buff,
  queue_mapping) > 0xff);
off = offsetof(struct sk_buff, queue_mapping);
-   emit_half_load(r_A, r_skb, off, ctx);
+   emit_half_load_unsigned(r_A, r_skb, off, ctx);
break;
default:
pr_debug("%s: Unhandled opcode: 0x%02x\n", __FILE__,
-- 
2.9.3



[PATCH v2 1/5] MIPS: uasm: Add support for LHU.

2017-03-14 Thread David Daney
The follow-on BPF JIT patches use the LHU instruction, so add it.

Signed-off-by: David Daney 
---
 arch/mips/include/asm/uasm.h | 1 +
 arch/mips/mm/uasm-mips.c | 1 +
 arch/mips/mm/uasm.c  | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/mips/include/asm/uasm.h b/arch/mips/include/asm/uasm.h
index f7929f6..d7e84f1 100644
--- a/arch/mips/include/asm/uasm.h
+++ b/arch/mips/include/asm/uasm.h
@@ -135,6 +135,7 @@ Ip_u2s3u1(_lb);
 Ip_u2s3u1(_ld);
 Ip_u3u1u2(_ldx);
 Ip_u2s3u1(_lh);
+Ip_u2s3u1(_lhu);
 Ip_u2s3u1(_ll);
 Ip_u2s3u1(_lld);
 Ip_u1s2(_lui);
diff --git a/arch/mips/mm/uasm-mips.c b/arch/mips/mm/uasm-mips.c
index 763d3f1..2277499 100644
--- a/arch/mips/mm/uasm-mips.c
+++ b/arch/mips/mm/uasm-mips.c
@@ -103,6 +103,7 @@ static struct insn insn_table[] = {
{ insn_ld,  M(ld_op, 0, 0, 0, 0, 0),  RS | RT | SIMM },
{ insn_ldx, M(spec3_op, 0, 0, 0, ldx_op, lx_op), RS | RT | RD },
{ insn_lh,  M(lh_op, 0, 0, 0, 0, 0),  RS | RT | SIMM },
+   { insn_lhu,  M(lhu_op, 0, 0, 0, 0, 0),  RS | RT | SIMM },
 #ifndef CONFIG_CPU_MIPSR6
{ insn_lld,  M(lld_op, 0, 0, 0, 0, 0),  RS | RT | SIMM },
{ insn_ll,  M(ll_op, 0, 0, 0, 0, 0),  RS | RT | SIMM },
diff --git a/arch/mips/mm/uasm.c b/arch/mips/mm/uasm.c
index a829704..7f400c8 100644
--- a/arch/mips/mm/uasm.c
+++ b/arch/mips/mm/uasm.c
@@ -61,7 +61,7 @@ enum opcode {
insn_sllv, insn_slt, insn_sltiu, insn_sltu, insn_sra, insn_srl,
insn_srlv, insn_subu, insn_sw, insn_sync, insn_syscall, insn_tlbp,
insn_tlbr, insn_tlbwi, insn_tlbwr, insn_wait, insn_wsbh, insn_xor,
-   insn_xori, insn_yield, insn_lddir, insn_ldpte,
+   insn_xori, insn_yield, insn_lddir, insn_ldpte, insn_lhu,
 };
 
 struct insn {
@@ -297,6 +297,7 @@ I_u1(_jr)
 I_u2s3u1(_lb)
 I_u2s3u1(_ld)
 I_u2s3u1(_lh)
+I_u2s3u1(_lhu)
 I_u2s3u1(_ll)
 I_u2s3u1(_lld)
 I_u1s2(_lui)
-- 
2.9.3



Re: [PATCH net-next 1/1 v2] net: rmnet_data: Initial implementation

2017-03-14 Thread Subash Abhinov Kasiviswanathan

I believe that this code should be a part of that driver, not a generic
code in net/




ok, so?


Hi Jiri

I will move it to drivers/net and rename it to rmnet.

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread Roopa Prabhu
On Tue, Mar 14, 2017 at 1:25 PM, Stephen Hemminger
 wrote:
> On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
> David Miller  wrote:
>
>> From: Nikolay Aleksandrov 
>> Date: Tue, 14 Mar 2017 17:58:46 +0200
>>
>> > On 14/03/17 17:55, Stephen Hemminger wrote:
>> >> On Tue, 14 Mar 2017 17:36:15 +0200
>> >> Nikolay Aleksandrov  wrote:
>> >>
>> >>> This patch adds support for ECMP hash policy choice via a new sysctl
>> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>> >>> The current values for fib_multipath_hash_policy are:
>> >>>  0 - layer 3 (default)
>> >>>  1 - layer 4
>> >>> If there's an skb hash already set and it matches the chosen policy then 
>> >>> it
>> >>> will be used instead of being calculated (currently only for L4).
>> >>> In L3 mode we always calculate the hash due to the ICMP error special
>> >>> case, the flow dissector's field consistentification should handle the
>> >>> address order thus we can remove the address reversals.
>> >>>
>> >>> Signed-off-by: Nikolay Aleksandrov 
>> >>
>> >> It is good to see ECMP come back from the grave.
>> >> Linux used to support it long ago but was abandoned after it was unstable
>> >> and removed from iproute2 in 2012.
>> >>
>> >> The old API was through route attributes which makes more sense than
>> >> doing it with sysctl. It makes more sense to use netlink instead.
>> >> Therefore please go back and do something like the old API rather than 
>> >> doing it through
>> >> sysctl.
>> >>
>> >
>> > That's what my initial version did, but this was discussed during NetConf 
>> > in Seville
>> > and it was decided that it's best to make a global sysctl, thus the change.
>>
>> Correct, we discussed this, and we all agreed to only have a sysctl for now.
>
> Why? If you are going to have private discussions please post the rationale
> in public.

Stephen, is there any reason to have a per ecmp route multipath algo
selection ?.
All platforms have a global multipath selection algo. I also don't see
routing daemons ready or willing to specify a per ecmp route multipath
selection algo attribute.


Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread Stephen Hemminger
On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
David Miller  wrote:

> From: Nikolay Aleksandrov 
> Date: Tue, 14 Mar 2017 17:58:46 +0200
> 
> > On 14/03/17 17:55, Stephen Hemminger wrote:  
> >> On Tue, 14 Mar 2017 17:36:15 +0200
> >> Nikolay Aleksandrov  wrote:
> >>   
> >>> This patch adds support for ECMP hash policy choice via a new sysctl
> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
> >>> The current values for fib_multipath_hash_policy are:
> >>>  0 - layer 3 (default)
> >>>  1 - layer 4
> >>> If there's an skb hash already set and it matches the chosen policy then 
> >>> it
> >>> will be used instead of being calculated (currently only for L4).
> >>> In L3 mode we always calculate the hash due to the ICMP error special
> >>> case, the flow dissector's field consistentification should handle the
> >>> address order thus we can remove the address reversals.
> >>>
> >>> Signed-off-by: Nikolay Aleksandrov   
> >> 
> >> It is good to see ECMP come back from the grave.
> >> Linux used to support it long ago but was abandoned after it was unstable
> >> and removed from iproute2 in 2012.
> >> 
> >> The old API was through route attributes which makes more sense than
> >> doing it with sysctl. It makes more sense to use netlink instead.
> >> Therefore please go back and do something like the old API rather than 
> >> doing it through
> >> sysctl.
> >>   
> > 
> > That's what my initial version did, but this was discussed during NetConf 
> > in Seville
> > and it was decided that it's best to make a global sysctl, thus the change. 
> >  
> 
> Correct, we discussed this, and we all agreed to only have a sysctl for now.

Why? If you are going to have private discussions please post the rationale
in public.


Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread Nikolay Aleksandrov

> On Mar 14, 2017, at 5:36 PM, Nikolay Aleksandrov 
>  wrote:
> 
> This patch adds support for ECMP hash policy choice via a new sysctl
> called fib_multipath_hash_policy and also adds support for L4 hashes.
> The current values for fib_multipath_hash_policy are:
> 0 - layer 3 (default)
> 1 - layer 4
> If there's an skb hash already set and it matches the chosen policy then it
> will be used instead of being calculated (currently only for L4).
> In L3 mode we always calculate the hash due to the ICMP error special
> case, the flow dissector's field consistentification should handle the
> address order thus we can remove the address reversals.
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
> v3:
> - keep the ICMP error special handling and always calc L3 hash
>   Jakub, could you please run your tests with this version ?
> 
> v2:
> - removed the output_key_hash as it's not needed anymore
> - reverted to my original/internal patch with L3 as default hash
> 
> Documentation/networking/ip-sysctl.txt |  8 +++
> include/net/ip_fib.h   | 14 ++
> include/net/netns/ipv4.h   |  1 +
> include/net/route.h|  9 +---
> net/ipv4/fib_semantics.c   | 11 ++--
> net/ipv4/icmp.c| 19 +--
> net/ipv4/route.c   | 92 ++
> net/ipv4/sysctl_net_ipv4.c |  9 
> 8 files changed, 98 insertions(+), 65 deletions(-)
> 
[snip]
> /* To make ICMP packets follow the right flow, the multipath hash is
> - * calculated from the inner IP addresses in reverse order.
> + * calculated from the inner IP addresses.
>  */
> -static int ip_multipath_icmp_hash(struct sk_buff *skb)
> +static void ip_multipath_icmp_hash(const struct sk_buff *skb,
> +struct flowi4 *fl4)
> {
>   const struct iphdr *outer_iph = ip_hdr(skb);
>   struct icmphdr _icmph;
> @@ -1746,33 +1746,85 @@ static int ip_multipath_icmp_hash(struct sk_buff *skb)
>   struct iphdr _inner_iph;
>   const struct iphdr *inner_iph;
> 
> + fl4->saddr = outer_iph->saddr;
> + fl4->daddr = outer_iph->daddr;
>   if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
> - goto standard_hash;
> + return;
> 
>   icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph),
>  &_icmph);
>   if (!icmph)
> - goto standard_hash;
> + return;
> 
>   if (icmph->type != ICMP_DEST_UNREACH &&
>   icmph->type != ICMP_REDIRECT &&
>   icmph->type != ICMP_TIME_EXCEEDED &&
> - icmph->type != ICMP_PARAMETERPROB) {
> - goto standard_hash;
> - }
> + icmph->type != ICMP_PARAMETERPROB)
> + return;
> 
>   inner_iph = skb_header_pointer(skb,
>  outer_iph->ihl * 4 + sizeof(_icmph),
>  sizeof(_inner_iph), &_inner_iph);
>   if (!inner_iph)
> - goto standard_hash;
> + return;
> + fl4->saddr = inner_iph->saddr;
> + fl4->daddr = inner_iph->daddr;
> +}
> 
> - return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr);
> +int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
> +const struct sk_buff *skb)
> +{
> + struct net *net = fi->fib_net;
> + struct flow_keys hash_keys;
> + u32 mhash;
> 
> -standard_hash:
> - return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr);
> -}
> + switch (net->ipv4.sysctl_fib_multipath_hash_policy) {
> + case 0:
> + memset(_keys, 0, sizeof(hash_keys));
> + hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
> + if (skb && ip_hdr(skb)->protocol == IPPROTO_ICMP) {
> + struct flowi4 _fl4;
> 
> + ip_multipath_icmp_hash(skb, &_fl4);

Ugh, obviously I could’ve just passed hash_keys here, will  change this in v4 
but I’ll wait
to see if there aren’t any other comments or issues.

Thanks,
 Nik

> + hash_keys.addrs.v4addrs.src = _fl4.saddr;
> + hash_keys.addrs.v4addrs.dst = _fl4.daddr;
> + } else {
> + hash_keys.addrs.v4addrs.src = fl4->saddr;
> + hash_keys.addrs.v4addrs.dst = fl4->daddr;
> + }
> + break;
> + case 1:
> + /* skb is currently provided only when forwarding */
> + if (skb) {
> + unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP;
> + struct flow_keys keys;
> +
> + /* short-circuit if we already have L4 hash present */
> + if (skb->l4_hash)
> + return skb_get_hash_raw(skb) >> 1;
> + memset(_keys, 0, sizeof(hash_keys));
> +   

Re: [PATCH net-next v3] net: ipv4: add support for ECMP hash policy choice

2017-03-14 Thread David Miller
From: Nikolay Aleksandrov 
Date: Tue, 14 Mar 2017 17:58:46 +0200

> On 14/03/17 17:55, Stephen Hemminger wrote:
>> On Tue, 14 Mar 2017 17:36:15 +0200
>> Nikolay Aleksandrov  wrote:
>> 
>>> This patch adds support for ECMP hash policy choice via a new sysctl
>>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>>> The current values for fib_multipath_hash_policy are:
>>>  0 - layer 3 (default)
>>>  1 - layer 4
>>> If there's an skb hash already set and it matches the chosen policy then it
>>> will be used instead of being calculated (currently only for L4).
>>> In L3 mode we always calculate the hash due to the ICMP error special
>>> case, the flow dissector's field consistentification should handle the
>>> address order thus we can remove the address reversals.
>>>
>>> Signed-off-by: Nikolay Aleksandrov 
>> 
>> It is good to see ECMP come back from the grave.
>> Linux used to support it long ago but was abandoned after it was unstable
>> and removed from iproute2 in 2012.
>> 
>> The old API was through route attributes which makes more sense than
>> doing it with sysctl. It makes more sense to use netlink instead.
>> Therefore please go back and do something like the old API rather than doing 
>> it through
>> sysctl.
>> 
> 
> That's what my initial version did, but this was discussed during NetConf in 
> Seville
> and it was decided that it's best to make a global sysctl, thus the change.

Correct, we discussed this, and we all agreed to only have a sysctl for now.


Re: [PATCH net-next] qed*: Add support for QL41xxx adapters

2017-03-14 Thread David Miller
From: Yuval Mintz 
Date: Tue, 14 Mar 2017 16:23:54 +0200

> This adds the necessary infrastructure changes for initializing
> and working with the new series of QL41xxx adapaters.
> 
> It also adds 2 new PCI device-IDs to qede:
>   - 0x8070 for QL41xxx PFs
>   - 0x8090 for VFs spawning from QL41xxx PFs
> 
> Signed-off-by: Tomer Tayar 
> Signed-off-by: Yuval Mintz 

Applied, thanks.


Re: [PATCH net 0/7] qed: Fixes series

2017-03-14 Thread David Miller
From: Yuval Mintz 
Date: Tue, 14 Mar 2017 15:25:57 +0200

> This address several different issues in qed.
> The more significant portions:
> 
> Patch #1 would cause timeout when qedr utilizes the highest
> CIDs availble for it [or when future qede adapters would utilize
> queues in some constellations].
> 
> Patch #4 fixes a leak of mapped addresses; When iommu is enabled,
> offloaded storage protocols might eventually run out of resources
> and fail to map additional buffers.
> 
> Patches #6,#7 were missing in the initial iSCSI infrastructure
> submissions, and would hamper qedi's stability when it reaches
> out-of-order scenarios.

Series applied, thank you.


Re: [patch net 0/2] mlxsw: Couple of fixes

2017-03-14 Thread David Miller
From: Jiri Pirko 
Date: Tue, 14 Mar 2017 13:59:59 +0100

> Couple or small fixes.

Series applied, thanks Jiri.


Re: [PATCH] net: Resend IGMP memberships upon peer notification.

2017-03-14 Thread David Miller
From: Vladislav Yasevich 
Date: Tue, 14 Mar 2017 08:58:08 -0400

> When we notify peers of potential changes,  it's also good to update
> IGMP memberships.  For example, during VM migration, updating IGMP
> memberships will redirect existing multicast streams to the VM at the
> new location.
> 
> Signed-off-by: Vladislav Yasevich 

Applied, thanks Vlad.


Re: [PATCH net-next 0/4] gtp: support multiple APN's per GTP endpoint

2017-03-14 Thread David Miller
From: Andreas Schultz 
Date: Tue, 14 Mar 2017 13:42:44 +0100 (CET)

> The specific use case of the API that is no longer supported was never used by
> anyone. The only supported and documented API for the GTP module is libgtpnl.
> libgtpnl has always required the now mandatory fields. Therefor the externally
> supported API does not change.

That's not how kernel development works, sorry.

Any user visible interface the kernel exports is not to be broken,
even if you think you control the one library that makes use of it.

This is especially important for netlink because netlink is more like
a networking protocol, that arbitrary programs can listen to for
events.


Re: [PATCH v2 net-next 00/11] net: stmmac: prepare dma operations for multiple queues

2017-03-14 Thread Joao Pinto
Às 6:21 PM de 3/14/2017, David Miller escreveu:
> From: Joao Pinto 
> Date: Tue, 14 Mar 2017 10:24:22 +
> 
>> As agreed with David Miller, this patch-set is the second of 3 to enable
>> multiple queues in stmmac.
>>
>> This second one concentrates on dma operations adding functionalities as:
>> a) DMA Operation Mode configuration per channel and done in the multiple
>> queues configuration function
>> b) DMA IRQ enable and Disable by channel
>> c) DMA start and stop by channel
>> d) RX and TX ring length configuration by channel
>> e) RX and TX set tail pointer by channel
>> f) DMA Channel initialization broke into Channel comon, RX and TX
>> initialization
>> g) TSO being configured for all available channels
>> h) DMA interrupt treatment by channel
> 
> Patch #5 doesn't apply cleanly to net-next, please respin this series.
> 
> Thank you.
> 

Ok, I will rebase it and send you a v3 tomorrow. Thanks.


Re: [PATCH v2 net-next 00/11] net: stmmac: prepare dma operations for multiple queues

2017-03-14 Thread David Miller
From: Joao Pinto 
Date: Tue, 14 Mar 2017 10:24:22 +

> As agreed with David Miller, this patch-set is the second of 3 to enable
> multiple queues in stmmac.
> 
> This second one concentrates on dma operations adding functionalities as:
> a) DMA Operation Mode configuration per channel and done in the multiple
> queues configuration function
> b) DMA IRQ enable and Disable by channel
> c) DMA start and stop by channel
> d) RX and TX ring length configuration by channel
> e) RX and TX set tail pointer by channel
> f) DMA Channel initialization broke into Channel comon, RX and TX
> initialization
> g) TSO being configured for all available channels
> h) DMA interrupt treatment by channel

Patch #5 doesn't apply cleanly to net-next, please respin this series.

Thank you.


Re: [PATCH v2 16/20] ARM: dts: sun50i-a64: enable dwmac-sun8i on pine64

2017-03-14 Thread Florian Fainelli
On 03/14/2017 07:18 AM, Corentin Labbe wrote:
> The dwmac-sun8i hardware is present on the pine64
> It uses an external PHY via RMII.
> 
> Signed-off-by: Corentin Labbe 
> ---
>  arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts 
> b/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts
> index c680ed3..b53994d 100644
> --- a/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts
> +++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts
> @@ -109,3 +109,18 @@
>   {
>   status = "okay";
>  };
> +
> + {
> + ext_rmii_phy1: ethernet-phy@1 {
> +   reg = <1>;

Even though it's optional, it's nice to have a:

compatible = "ethernet-phy-ieee802.3-c22"

string here.

This applies to all DTS files that you have in subsequent patches.

Thanks!
-- 
Florian


Re: [PATCH net-next] bonding: add 802.3ad support for 25G speeds

2017-03-14 Thread Andy Gospodarek
On Tue, Mar 14, 2017 at 11:48:32AM -0400, Jarod Wilson wrote:
> Cut-n-paste enablement of 802.3ad bonding on 25G NICs, which currently
> report 0 as their bandwidth.
> 
> CC: Jay Vosburgh 
> CC: Veaceslav Falico 
> CC: Andy Gospodarek 
> CC: netdev@vger.kernel.org
> Signed-off-by: Jarod Wilson 

Look good, Jarod.  Probably time for a 50Gbps option, too

Acked-by: Andy Gospodarek 

> ---
> note: I swear I saw a patch for this already, but I don't see it on the
> list and it isn't committed, so here's this...
> 
>  drivers/net/bonding/bond_3ad.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index 431926bba9f4..508713b4e533 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -92,6 +92,7 @@ enum ad_link_speed_type {
>   AD_LINK_SPEED_2500MBPS,
>   AD_LINK_SPEED_1MBPS,
>   AD_LINK_SPEED_2MBPS,
> + AD_LINK_SPEED_25000MBPS,
>   AD_LINK_SPEED_4MBPS,
>   AD_LINK_SPEED_56000MBPS,
>   AD_LINK_SPEED_10MBPS,
> @@ -260,6 +261,7 @@ static inline int __check_agg_selection_timer(struct port 
> *port)
>   * %AD_LINK_SPEED_2500MBPS,
>   * %AD_LINK_SPEED_1MBPS
>   * %AD_LINK_SPEED_2MBPS
> + * %AD_LINK_SPEED_25000MBPS
>   * %AD_LINK_SPEED_4MBPS
>   * %AD_LINK_SPEED_56000MBPS
>   * %AD_LINK_SPEED_10MBPS
> @@ -302,6 +304,10 @@ static u16 __get_link_speed(struct port *port)
>   speed = AD_LINK_SPEED_2MBPS;
>   break;
>  
> + case SPEED_25000:
> + speed = AD_LINK_SPEED_25000MBPS;
> + break;
> +
>   case SPEED_4:
>   speed = AD_LINK_SPEED_4MBPS;
>   break;
> @@ -707,6 +713,9 @@ static u32 __get_agg_bandwidth(struct aggregator 
> *aggregator)
>   case AD_LINK_SPEED_2MBPS:
>   bandwidth = nports * 2;
>   break;
> + case AD_LINK_SPEED_25000MBPS:
> + bandwidth = nports * 25000;
> + break;
>   case AD_LINK_SPEED_4MBPS:
>   bandwidth = nports * 4;
>   break;
> -- 
> 2.11.0
> 


Re: [PATCH v2 2/2] can: spi: hi311x: Add Holt HI-311x CAN driver

2017-03-14 Thread Wolfgang Grandegger

Hello Akshay,

Am 14.03.2017 um 17:20 schrieb Akshay Bhat:


Hi Wolfgang,

On 03/14/2017 08:11 AM, Wolfgang Grandegger wrote:

... snip ...

A few other things to check:

Run "cangen" and monitor the message with "candump -e any,0:0,#FFF".
Then 1) disconnect the cable or 2) short-circuit CAN low and high at the
connector. You should see error messages. After reconnection or removing
the short-circuit (and bus-off recovery) the state should go back to
"active".



With the above sequence, candump reports "ERRORFRAME" with
protocol-violation{{}{acknowledge-slot}}, bus-error. On re-connecting
the cable the can state goes back to ACTIVE and I see the messages that
were in the queue being sent.


Do you get the ACK error also with berr-reporting off? Would be nice if
you could show a candump log here.



Below is a log for disconnecting and re-connecting CAN cable scenario:
(Note this is on a 4.1.18 kernel with RT patch)

root@imx6qrom5420b1:~# ip link set can0 up type can bitrate 100
berr-reporting on
root@imx6qrom5420b1:~# candump -e any,0:0,#FFF &


Please add "-td" ...


[1] 768
root@imx6qrom5420b1:~# cangen can0


and "-i" here.


  can0  21C   [8]  35 98 C0 7A 95 03 E6 2A
  can0  6E6   [1]  F2
  can0  5C7   [2]  42 50
  can0  57C   [8]  83 7A E4 0C 03 8B 90 45
  can0  55C   [8]  B9 74 87 52 D8 F4 64 04
  can0  014   [8]  28 CB 96 57 3B 80 67 4F
  can0  6AF   [1]  35
  can0  51E   [8]  B6 C8 6C 1D 3A 87 ED 2E
  can0  527   [8]  D0 8A D3 59 0E 34 40 78
  can0  30C   [2]  6A 12
  can0  145   [8]  CB 6E FF 55 C1 BE C3 22
  can0  5A5   [8]  C4 49 54 68 02 63 F9 35
  can0  0BA   [8]  DA 57 5E 3A CE 88 20 1C
  can0  516   [2]  09 09
  can0  743   [8]  7C 4D 25 47 61 4C 56 3D
  can0  31D   [2]  9C D3
  can0  71E   [8]  53 7C 97 2A 2A F2 9F 56
  can0  52E   [8]  FE DA 2D 51 73 96 DF 79
/disconnect cable
  can0  2088   [8]  00 00 00 19 00 00 28 00   ERRORFRAME
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{40}{0}}
  can0  2088   [8]  00 00 00 19 00 00 58 00   ERRORFRAME
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{88}{0}}
  can0  2088   [8]  00 00 00 19 00 00 80 00   ERRORFRAME
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{128}{0}}


TX error warning is missing.


  can0  208C   [8]  00 20 00 19 00 00 80 00   ERRORFRAME
controller-problem{tx-error-passive}
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{128}{0}}


Here "tx-error-passiv" is packed with a bus error. What I'm looking for 
are state change messages similar to:


   can0  2204  [8] 00 08 00 00 00 00 60 00   ERRORFRAME
controller-problem{tx-error-warning}
state-change{tx-error-warning}
error-counter-tx-rx{{96}{0}}
   can0  2204  [8] 00 30 00 00 00 00 80 00   ERRORFRAME
controller-problem{tx-error-passive}
state-change{tx-error-passive}
error-counter-tx-rx{{128}{0}

They should always come, even with "berr-reporting off".


write: No buffer space available
root@imx6qrom5420b1:~# ip -s -d link show can0
4: can0:  mtu 16 qdisc pfifo_fast state UNKNOWN
mode DEFAULT group default qlen 10
link/can  promiscuity 0
can  state ERROR-PASSIVE (berr-counter tx 128 rx 0)
restart-ms 0
  bitrate 100 sample-point 0.750
  tq 62 prop-seg 5 phase-seg1 6 phase-seg2 4 sjw 1
  hi3110: tseg1 2..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
  clock 1600
  re-started bus-errors arbit-lost error-warn error-pass bus-off
  0  6  0  1  1  0


The error warning and passive counter increased , though. Also the bus 
error should come in at a rather hight rate. Looking to the code, maybe

you need to test STATF to check for state changes (and not ERR).


RX: bytes  packets  errors  dropped overrun mcast
0  06   0   0   0
TX: bytes  packets  errors  dropped carrier collsns
10618   0   0   0   0
root@imx6qrom5420b1:~#
/re-connect cable
  can0  169   [8]  35 55 A3 1C 0F 47 2E 5B
  can0  318   [8]  11 AA 27 11 D2 1B CE 34
  can0  577   [8]  A0 A4 EE 50 8D A2 E1 3E
  can0  4ED   [8]  52 96 17 7E 31 FC 7D 7C
  can0  2E7   [8]  92 48 D4 39 05 1E 9F 50
  can0  200   [8]  4A 66 F6 02 1E 71 8E 26
  can0  29A   [8]  49 63 2E 7D C9 77 85 7A
  can0  15A   [7]  3C 0E 65 74 C3 62 80
  can0  011   [1]  D2
  can0  26B   [3]  FC D6 68
  can0  5CE   [8]  6F 02 B5 14 BC 7A D7 02

root@imx6qrom5420b1:~# ip -s -d link show can0
4: can0:  mtu 16 qdisc pfifo_fast state UNKNOWN
mode DEFAULT group default qlen 10
link/can  promiscuity 0
can  state ERROR-ACTIVE (berr-counter tx 117 rx 0)
restart-ms 0
  bitrate 100 sample-point 0.750
  tq 62 prop-seg 5 phase-seg1 6 phase-seg2 4 

Re: [PATCH v4] {net,IB}/{rxe,usnic}: Utilize generic mac to eui32 function

2017-03-14 Thread Leon Romanovsky
On Tue, Mar 14, 2017 at 04:01:57PM +0200, Yuval Shaia wrote:
> This logic seems to be duplicated in (at least) three separate files.
> Move it to one place so code can be re-use.
>
> Signed-off-by: Yuval Shaia 
> ---
> v0 -> v1:
>   * Add missing #include
>   * Rename to genaddrconf_ifid_eui48
> v1 -> v2:
>   * Reset eui[0] to default if dev_id is used
> v2 -> v3:
>   * Add helper function to avoid re-setting eui[0] to default if
> dev_id is used
> v3 -> v4:
>   * Remove RXE wrappers
>   * Remove addrconf_addr_eui48_xor and do the eui[0] ^= 2 in the
> basic implementation
> ---
>  drivers/infiniband/hw/usnic/usnic_common_util.h | 11 +++---
>  drivers/infiniband/sw/rxe/rxe.c |  4 +++-
>  drivers/infiniband/sw/rxe/rxe_loc.h |  2 --
>  drivers/infiniband/sw/rxe/rxe_net.c | 28 
> -
>  drivers/infiniband/sw/rxe/rxe_verbs.c   |  4 +++-
>  include/net/addrconf.h  | 22 +++
>  6 files changed, 27 insertions(+), 44 deletions(-)
>

Thanks, Yuval.
Reviewed-by: Leon Romanovsky 


signature.asc
Description: PGP signature


Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-14 Thread Eric W. Biederman
Hannes Frederic Sowa  writes:

> On 13.03.2017 23:06, Eric W. Biederman wrote:
>> Michael Kerrisk  writes:
>> 
>>> On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowa
>>>  wrote:
 Hi,

 On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote:
> From: Hannes Frederic Sowa 
> Date: Mon, 13 Mar 2017 00:01:24 +0100
>
>> afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
>> can work with afnetns with one limitation: one cannot cross the realm
>> of a network namespace while changing the afnetns compartement. To get
>> into a new afnetns in a different net namespace, one must first change
>> to the net namespace and afterwards switch to the desired afnetns.
>
> Please explain why this is useful, who wants this kind of facility,
> and how it will be used.

 Yes, I have to enhance the cover letter:

 The work behind all this is to provide more dense container hosting.
 Right now we lose performance, because all packets need to be forwarded
 through either a bridge or must be routed until they reach the
 containers. For example, we can't make use of early demuxing for the
 incoming packets. We basically pass the networking stack twice for
 every packet.

 The usage is very much in line with how network namespaces are used
 nowadays:

 ip afnetns add afns-1
 ip address add 192.168.1.1/24 dev eth0 afnetns afns-1
 ip afnetns exec afns-1 /usr/sbin/httpd

 this spawns a shell where all child processes will only have access to
 the specific ip addresses, even though they do a wildcard bind. Source
 address selection will also use only the ip addresses available to the
 children.

 In some sense it has lots of characteristics like ipvlan, allowing a
 single MAC address to host lots of IP addresses which will end up in
 different namespaces. Unlink ipvlan however, it will also solve the
 problem around duplicate address detection and multiplexing packets to
 the IGMP or MLD state machines.

 The resource consumption in comparison with ordinary namespaces will be
 much lower. All in all, we will have far less networking subsystems to
 cross compared to normal netns solutions.

 Some more information also in the first patch, which adds a
 Documentation.
>> 
>> If the goal is one ip address per network namespace with a network
>> device and mac address on the network I have something that I was
>> working on that I believe is in the end is a much simpler solution.
>
> Actually, it should be possible to use more than one IP address per
> namespace, proper source address selection should deal with that and
> also correctly select the higher scored ones, based on output device and
> distance to the remote ip address.

Definitely.  I should have said at least one.  Some people want address
sharing and precludes several kinds of optimizations.

>> Add routes in the routing table between network namespaces.
>> 
>> AKA in the initial network namespace with the network device have
>> an input route not towards the local loopback device but towards
>> the network namespaces loopback device.
>> 
>> Before other issues took precedence I made it half way to implementing
>> that.   The ip input path won't get confused if the destination network
>> device is not in the same network namespace as the device.  Last I
>> looked the ip output path still had a few places where confusion was
>> possible between the network socket and the output device.
>
> The ip afnetns input path is also of no concern to me and will work
> quite easily. Right now, the different semantics and rules for selecting
> a source address are the more problematic ones. I think, that in the
> case of directly routing from one ns into another this will be the same
> and the most complex case to deal with?

With what I am proposing that case should be drop dead simple and cause
no confusion.  The extra routes should look like ordinary routes
for forwarding packets, not local addresses and as such should cause
no confusion.  So source address selection should work perfectly as is.

>> As long as installing such routes is conditional upon having
>> CAP_NET_ADMIN in both network namespaces you should be fine and things
>> should be very simple and very fast.  Because that won't take a special
>> case through the network stack.
>> 
>> Given that performance is your primary motive I suspect this will yield
>> the fastest possible path through the network stack as no extra steps
>> need to be taken, and can benefit from any routing improvements to the
>> ordinary network stack.
>
> The major performance improvements come from socket early demuxing,
> which actually requires the remote netns socket being visible in the
> initial netns esock tables. We need the 

[PATCH v2 net-next 1/5] ldmvsw: better use of link up and down on ldom vswitch

2017-03-14 Thread Shannon Nelson
When an ldom VM is bound, the network vswitch infrastructure is set up for
it, but was being forced 'UP' by the userland switch configuration script.
When 'UP' but not actually connected to a running VM, the ipv6 neighbor
probes fail (not a horrible thing) and start cluttering up the kernel logs.
Funny thing: these are debug messages that never actually show up, but
we do see the net_ratelimited messages that say N callbacks were
suppressed.

This patch defers the netif_carrier_on() until an actual link has been
established with the VM, as indicated by receiving an LDC_EVENT_UP from
the underlying LDC protocol.  Similarly, we take the link down when we
see the LDC_EVENT_RESET.  Now when we see the ndo_open(), we reset the
link to get things talking again.

Orabug: 25525312

Signed-off-by: Shannon Nelson 
---
 drivers/net/ethernet/sun/ldmvsw.c |   27 +++
 drivers/net/ethernet/sun/sunvnet_common.c |   20 +---
 drivers/net/ethernet/sun/sunvnet_common.h |1 +
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/sun/ldmvsw.c 
b/drivers/net/ethernet/sun/ldmvsw.c
index 89952de..121927b 100644
--- a/drivers/net/ethernet/sun/ldmvsw.c
+++ b/drivers/net/ethernet/sun/ldmvsw.c
@@ -1,6 +1,6 @@
 /* ldmvsw.c: Sun4v LDOM Virtual Switch Driver.
  *
- * Copyright (C) 2016 Oracle. All rights reserved.
+ * Copyright (C) 2016-2017 Oracle. All rights reserved.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -41,8 +41,8 @@
 static u8 vsw_port_hwaddr[ETH_ALEN] = {0xFE, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF};
 
 #define DRV_MODULE_NAME"ldmvsw"
-#define DRV_MODULE_VERSION "1.1"
-#define DRV_MODULE_RELDATE "February 3, 2017"
+#define DRV_MODULE_VERSION "1.2"
+#define DRV_MODULE_RELDATE "March 4, 2017"
 
 static char version[] =
DRV_MODULE_NAME " " DRV_MODULE_VERSION " (" DRV_MODULE_RELDATE ")";
@@ -123,6 +123,20 @@ static void vsw_set_rx_mode(struct net_device *dev)
return sunvnet_set_rx_mode_common(dev, port->vp);
 }
 
+int ldmvsw_open(struct net_device *dev)
+{
+   struct vnet_port *port = netdev_priv(dev);
+   struct vio_driver_state *vio = >vio;
+
+   /* reset the channel */
+   vio_link_state_change(vio, LDC_EVENT_RESET);
+   vnet_port_reset(port);
+   vio_port_up(vio);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(ldmvsw_open);
+
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void vsw_poll_controller(struct net_device *dev)
 {
@@ -133,7 +147,7 @@ static void vsw_poll_controller(struct net_device *dev)
 #endif
 
 static const struct net_device_ops vsw_ops = {
-   .ndo_open   = sunvnet_open_common,
+   .ndo_open   = ldmvsw_open,
.ndo_stop   = sunvnet_close_common,
.ndo_set_rx_mode= vsw_set_rx_mode,
.ndo_set_mac_address= sunvnet_set_mac_addr_common,
@@ -365,6 +379,11 @@ static int vsw_port_probe(struct vio_dev *vdev, const 
struct vio_device_id *id)
napi_enable(>napi);
vio_port_up(>vio);
 
+   /* assure no carrier until we receive an LDC_EVENT_UP,
+* even if the vsw config script tries to force us up
+*/
+   netif_carrier_off(dev);
+
netdev_info(dev, "LDOM vsw-port %pM\n", dev->dev_addr);
 
pr_info("%s: PORT ( remote-mac %pM%s )\n", dev->name,
diff --git a/drivers/net/ethernet/sun/sunvnet_common.c 
b/drivers/net/ethernet/sun/sunvnet_common.c
index fa2d11c..1a65892 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.c
+++ b/drivers/net/ethernet/sun/sunvnet_common.c
@@ -1,7 +1,7 @@
 /* sunvnet.c: Sun LDOM Virtual Network Driver.
  *
  * Copyright (C) 2007, 2008 David S. Miller 
- * Copyright (C) 2016 Oracle. All rights reserved.
+ * Copyright (C) 2016-2017 Oracle. All rights reserved.
  */
 
 #include 
@@ -43,7 +43,6 @@
 MODULE_VERSION("1.1");
 
 static int __vnet_tx_trigger(struct vnet_port *port, u32 start);
-static void vnet_port_reset(struct vnet_port *port);
 
 static inline u32 vnet_tx_dring_avail(struct vio_dring_state *dr)
 {
@@ -747,6 +746,13 @@ static int vnet_event_napi(struct vnet_port *port, int 
budget)
 
/* RESET takes precedent over any other event */
if (port->rx_event & LDC_EVENT_RESET) {
+   /* a link went down */
+
+   if (port->vsw == 1) {
+   netif_tx_stop_all_queues(dev);
+   netif_carrier_off(dev);
+   }
+
vio_link_state_change(vio, LDC_EVENT_RESET);
vnet_port_reset(port);
vio_port_up(vio);
@@ -766,6 +772,13 @@ static int vnet_event_napi(struct vnet_port *port, int 
budget)
}
 
if (port->rx_event & LDC_EVENT_UP) {
+   /* a link came up */
+
+   if (port->vsw == 1) {
+   netif_carrier_on(port->dev);
+   netif_tx_start_all_queues(port->dev);
+   }
+
 

[PATCH v2 net-next 3/5] sunvnet: track port queues correctly

2017-03-14 Thread Shannon Nelson
Track our used and unused queue indexies correctly.  Otherwise, as ports
dropped out and returned, they all eventually ended up with the same
queue index.

Orabug: 25190537

Signed-off-by: Shannon Nelson 
---
 drivers/net/ethernet/sun/sunvnet_common.c |   24 
 drivers/net/ethernet/sun/sunvnet_common.h |   11 ++-
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet_common.c 
b/drivers/net/ethernet/sun/sunvnet_common.c
index d3dc8ed..5e1d016 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.c
+++ b/drivers/net/ethernet/sun/sunvnet_common.c
@@ -1728,11 +1728,25 @@ void sunvnet_poll_controller_common(struct net_device 
*dev, struct vnet *vp)
 void sunvnet_port_add_txq_common(struct vnet_port *port)
 {
struct vnet *vp = port->vp;
-   int n;
+   int smallest = 0;
+   int i;
+
+   /* find the first least-used q
+* When there are more ldoms than q's, we start to
+* double up on ports per queue.
+*/
+   for (i = 0; i < VNET_MAX_TXQS; i++) {
+   if (vp->q_used[i] == 0) {
+   smallest = i;
+   break;
+   }
+   if (vp->q_used[i] < vp->q_used[smallest])
+   smallest = i;
+   }
 
-   n = vp->nports++;
-   n = n & (VNET_MAX_TXQS - 1);
-   port->q_index = n;
+   vp->nports++;
+   vp->q_used[smallest]++;
+   port->q_index = smallest;
netif_tx_wake_queue(netdev_get_tx_queue(VNET_PORT_TO_NET_DEVICE(port),
port->q_index));
 }
@@ -1743,5 +1757,7 @@ void sunvnet_port_rm_txq_common(struct vnet_port *port)
port->vp->nports--;
netif_tx_stop_queue(netdev_get_tx_queue(VNET_PORT_TO_NET_DEVICE(port),
port->q_index));
+   port->vp->q_used[port->q_index]--;
+   port->q_index = 0;
 }
 EXPORT_SYMBOL_GPL(sunvnet_port_rm_txq_common);
diff --git a/drivers/net/ethernet/sun/sunvnet_common.h 
b/drivers/net/ethernet/sun/sunvnet_common.h
index c0fac03..b20d6fa 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.h
+++ b/drivers/net/ethernet/sun/sunvnet_common.h
@@ -112,22 +112,15 @@ struct vnet_mcast_entry {
 };
 
 struct vnet {
-   /* Protects port_list and port_hash.  */
-   spinlock_t  lock;
-
+   spinlock_t  lock; /* Protects port_list and port_hash.  */
struct net_device   *dev;
-
u32 msg_enable;
-
+   u8  q_used[VNET_MAX_TXQS];
struct list_headport_list;
-
struct hlist_head   port_hash[VNET_PORT_HASH_SIZE];
-
struct vnet_mcast_entry *mcast_list;
-
struct list_headlist;
u64 local_mac;
-
int nports;
 };
 
-- 
1.7.1



Re: [PATCH net-next 09/12] net: bcmgenet: return EOPNOTSUPP for unknown ioctl commands

2017-03-14 Thread Doug Berger
On 03/14/2017 04:04 AM, David Laight wrote:
> From: Doug Berger
>> Sent: 14 March 2017 00:42
>> This commit changes the ioctl handling behavior to return the
>> EOPNOTSUPP error code instead of the EINVAL error code when an
>> unknown ioctl command value is detected.
>>
>> It also removes some redundant parsing of the ioctl command value
>> and allows the SIOCSHWTSTAMP value to be handled.
> 
> A better description would seem to be:
> Remove checks on ioctl command and just forward all ioctl requests
> to phy_mii_ioctl().
That is a good description of the code change, but I felt that was
clearly conveyed by the patch content.  I thought it would be a better
use of the comment to describe the more subtle functional change that
might be less clear.

> 
> I also thought the 'generic' response to an unknown ioctl command
> was ENOTTY.
and I think it probably helped solicit this feedback :).  I would have
thought that error makes more sense if there is no ioctl handler, but I
will definitely look into it.

Thanks for the feedback,
Doug


[PATCH v2 net-next 2/5] sunvnet: add stats to track ldom to ldom packets and bytes

2017-03-14 Thread Shannon Nelson
In this driver, there is a "port" created for the connection to each of
the other ldoms; a netdev queue is mapped to each port, and they are
collected under a single netdev.  The generic netdev statistics show
us all the traffic in and out of our network device, but don't show
individual queue/port stats.  This patch breaks out the traffic counts
for the individual ports and gives us a little view into the state of
those connections.

Orabug: 25190537

Signed-off-by: Shannon Nelson 
---
 drivers/net/ethernet/sun/sunvnet.c|  116 -
 drivers/net/ethernet/sun/sunvnet_common.c |6 ++
 drivers/net/ethernet/sun/sunvnet_common.h |   15 
 3 files changed, 136 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet.c 
b/drivers/net/ethernet/sun/sunvnet.c
index 4cc2571..7543bdd 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -1,7 +1,7 @@
 /* sunvnet.c: Sun LDOM Virtual Network Driver.
  *
  * Copyright (C) 2007, 2008 David S. Miller 
- * Copyright (C) 2016 Oracle. All rights reserved.
+ * Copyright (C) 2016-2017 Oracle. All rights reserved.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -77,11 +77,125 @@ static void vnet_set_msglevel(struct net_device *dev, u32 
value)
vp->msg_enable = value;
 }
 
+static const struct {
+   const char string[ETH_GSTRING_LEN];
+} ethtool_stats_keys[] = {
+   { "rx_packets" },
+   { "tx_packets" },
+   { "rx_bytes" },
+   { "tx_bytes" },
+   { "rx_errors" },
+   { "tx_errors" },
+   { "rx_dropped" },
+   { "tx_dropped" },
+   { "multicast" },
+   { "rx_length_errors" },
+   { "rx_frame_errors" },
+   { "rx_missed_errors" },
+   { "tx_carrier_errors" },
+   { "nports" },
+};
+
+static int vnet_get_sset_count(struct net_device *dev, int sset)
+{
+   struct vnet *vp = (struct vnet *)netdev_priv(dev);
+
+   switch (sset) {
+   case ETH_SS_STATS:
+   return ARRAY_SIZE(ethtool_stats_keys)
+   + (NUM_VNET_PORT_STATS * vp->nports);
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
+static void vnet_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+   struct vnet *vp = (struct vnet *)netdev_priv(dev);
+   struct vnet_port *port;
+   char *p = (char *)buf;
+
+   switch (stringset) {
+   case ETH_SS_STATS:
+   memcpy(buf, _stats_keys, sizeof(ethtool_stats_keys));
+   p += sizeof(ethtool_stats_keys);
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(port, >port_list, list) {
+   snprintf(p, ETH_GSTRING_LEN, "p%u.%s-%pM",
+port->q_index, port->switch_port ? "s" : "q",
+port->raddr);
+   p += ETH_GSTRING_LEN;
+   snprintf(p, ETH_GSTRING_LEN, "p%u.rx_packets",
+port->q_index);
+   p += ETH_GSTRING_LEN;
+   snprintf(p, ETH_GSTRING_LEN, "p%u.tx_packets",
+port->q_index);
+   p += ETH_GSTRING_LEN;
+   snprintf(p, ETH_GSTRING_LEN, "p%u.rx_bytes",
+port->q_index);
+   p += ETH_GSTRING_LEN;
+   snprintf(p, ETH_GSTRING_LEN, "p%u.tx_bytes",
+port->q_index);
+   p += ETH_GSTRING_LEN;
+   snprintf(p, ETH_GSTRING_LEN, "p%u.event_up",
+port->q_index);
+   p += ETH_GSTRING_LEN;
+   snprintf(p, ETH_GSTRING_LEN, "p%u.event_reset",
+port->q_index);
+   p += ETH_GSTRING_LEN;
+   }
+   rcu_read_unlock();
+   break;
+   default:
+   WARN_ON(1);
+   break;
+   }
+}
+
+static void vnet_get_ethtool_stats(struct net_device *dev,
+  struct ethtool_stats *estats, u64 *data)
+{
+   struct vnet *vp = (struct vnet *)netdev_priv(dev);
+   struct vnet_port *port;
+   int i = 0;
+
+   data[i++] = dev->stats.rx_packets;
+   data[i++] = dev->stats.tx_packets;
+   data[i++] = dev->stats.rx_bytes;
+   data[i++] = dev->stats.tx_bytes;
+   data[i++] = dev->stats.rx_errors;
+   data[i++] = dev->stats.tx_errors;
+   data[i++] = dev->stats.rx_dropped;
+   data[i++] = dev->stats.tx_dropped;
+   data[i++] = dev->stats.multicast;
+   data[i++] = dev->stats.rx_length_errors;
+   data[i++] = dev->stats.rx_frame_errors;
+   data[i++] = dev->stats.rx_missed_errors;
+   data[i++] = dev->stats.tx_carrier_errors;
+   data[i++] = vp->nports;
+
+   rcu_read_lock();
+   

[PATCH v2 net-next 4/5] sunvnet: count multicast packets

2017-03-14 Thread Shannon Nelson
Make sure multicast packets get counted in the device.

Orabug: 25190537

Signed-off-by: Shannon Nelson 
---
 drivers/net/ethernet/sun/sunvnet_common.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet_common.c 
b/drivers/net/ethernet/sun/sunvnet_common.c
index 5e1d016..0c35a9a 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.c
+++ b/drivers/net/ethernet/sun/sunvnet_common.c
@@ -409,6 +409,8 @@ static int vnet_rx_one(struct vnet_port *port, struct 
vio_net_desc *desc)
 
skb->ip_summed = port->switch_port ? CHECKSUM_NONE : CHECKSUM_PARTIAL;
 
+   if (unlikely(is_multicast_ether_addr(eth_hdr(skb)->h_dest)))
+   dev->stats.multicast++;
dev->stats.rx_packets++;
dev->stats.rx_bytes += len;
port->stats.rx_packets++;
-- 
1.7.1



[PATCH v2 net-next 5/5] sunvnet: xoff not needed when removing port link

2017-03-14 Thread Shannon Nelson
The sunvnet netdev is connected to the controlling ldom's vswitch
for network bridging.  However, for higher performance between ldoms,
there also is a channel between each client ldom.  These connections are
represented in the sunvnet driver by a queue for each ldom.  The driver
uses select_queue to tell the stack which queue to use by tracking the mac
addresses on the other end of each port.  When a connected ldom shuts down,
the driver receives an LDC_EVENT_RESET and the port is removed from the
driver, thus a queue with no ldom on the other end will never be selected
for Tx.

The driver was trying to reinforce the "don't use this queue" notion with
netif_tx_stop_queue() and netif_tx_wake_queue(), which really should only
be used to signal a Tx queue is full (aka XOFF).  This misuse of queue
state resulted in NETDEV WATCHDOG messages and lots of unnecessary calls
into the driver's tx_timeout handler.  Simply removing these takes care
of the problem.

Orabug: 25190537

Signed-off-by: Shannon Nelson 
---
 drivers/net/ethernet/sun/sunvnet_common.c |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet_common.c 
b/drivers/net/ethernet/sun/sunvnet_common.c
index 0c35a9a..7febfb6 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.c
+++ b/drivers/net/ethernet/sun/sunvnet_common.c
@@ -1749,16 +1749,12 @@ void sunvnet_port_add_txq_common(struct vnet_port *port)
vp->nports++;
vp->q_used[smallest]++;
port->q_index = smallest;
-   netif_tx_wake_queue(netdev_get_tx_queue(VNET_PORT_TO_NET_DEVICE(port),
-   port->q_index));
 }
 EXPORT_SYMBOL_GPL(sunvnet_port_add_txq_common);
 
 void sunvnet_port_rm_txq_common(struct vnet_port *port)
 {
port->vp->nports--;
-   netif_tx_stop_queue(netdev_get_tx_queue(VNET_PORT_TO_NET_DEVICE(port),
-   port->q_index));
port->vp->q_used[port->q_index]--;
port->q_index = 0;
 }
-- 
1.7.1



[PATCH v2 net-next 0/5] sunvnet: better connection management

2017-03-14 Thread Shannon Nelson
These patches remove some problems in handling of carrier state
with the ldmvsw vswitch, remove  an xoff misuse in sunvnet, and
add stats for debug and tracking of point-to-point connections
between the ldom VMs.

v2:
 - added ldmvsw ndo_open to reset the LDC channel
 - updated copyrights

Shannon Nelson (5):
  ldmvsw: better use of link up and down on ldom vswitch
  sunvnet: add stats to track ldom to ldom packets and bytes
  sunvnet: track port queues correctly
  sunvnet: count multicast packets
  sunvnet: xoff not needed when removing port link

 drivers/net/ethernet/sun/ldmvsw.c |   27 ++-
 drivers/net/ethernet/sun/sunvnet.c|  116 -
 drivers/net/ethernet/sun/sunvnet_common.c |   56 +++---
 drivers/net/ethernet/sun/sunvnet_common.h |   27 +--
 4 files changed, 201 insertions(+), 25 deletions(-)


Re: [PATCH net-next 02/12] net: phy: bcm7xxx: add support for 28nm EPHY

2017-03-14 Thread Doug Berger
On 03/13/2017 07:43 PM, Andrew Lunn wrote:
> On Mon, Mar 13, 2017 at 07:06:25PM -0700, Doug Berger wrote:
>> On 03/13/2017 06:06 PM, Andrew Lunn wrote:
>>> On Mon, Mar 13, 2017 at 05:41:32PM -0700, Doug Berger wrote:
 +static int bcm7xxx_28nm_ephy_01_afe_config_init(struct phy_device *phydev)
 +{
 +  int ret;
 +
 +  /* set shadow mode 2 */
 +  ret = phy_set_clr_bits(phydev, MII_BCM7XXX_TEST,
 + MII_BCM7XXX_SHD_MODE_2, 0);
 +  if (ret < 0)
 +  return ret;
 +
 +  /* Set current trim values INT_trim = -1, Ext_trim =0 */
 +  ret = phy_write(phydev, MII_BCM7XXX_SHD_2_BIAS_TRIM, 0x3BE0);
 +  if (ret < 0)
 +  goto reset_shadow_mode;
 +
 +  /* Cal reset */
 +  ret = phy_write(phydev, MII_BCM7XXX_SHD_2_ADDR_CTRL,
 +  MII_BCM7XXX_SHD_3_TL4);
 +  if (ret < 0)
 +  goto reset_shadow_mode;
>>>
>>> Hi Doug
>>>
>>> It would be nice to have a few blank lines here and there...
>>>
>> Thanks for taking the time to review this.
>>
>> In general I try to keep lines of related functionality together and use
>> the blank lines to help identify boundaries.  In this particular case, I
>> believe it is clearer to keep the code that may return an error code
>> together with the code that tests for the error.
> 
> Hi Doug
> 
> I agree with that. Which is why i placed the comment between the goto
> and the next block of code. This is where i think there should be a
> blank line, to separate it from setting the trim values.
> 
OK, I see.  I thought you were referring to the code blocks above the
comment.  In that case, as described earlier, the code below the comment
is tightly coupled with the code above the comment since the pair of
transactions are how we "/* Cal reset */".  The idea of introducing a
subroutine/helper function for these paired (addr/data) transactions
might help readability so I will consider it for a future patch.

Thanks again for the feedback,
Doug


Re: [PATCH] net: usb: rtl8150: use new api ethtool_{get|set}_link_ksettings

2017-03-14 Thread Petko Manolov
On 17-03-13 17:00:20, Petko Manolov wrote:
> On 17-03-12 23:16:25, Philippe Reynes wrote:
> > The ethtool api {get|set}_settings is deprecated. We move this driver to 
> > new 
> > api {get|set}_link_ksettings.
> > 
> > As I don't have the hardware, I'd be very pleased if someone may test this 
> > patch.
> 
> I've got some old adapters around and will drop you a line when i test the 
> patch.

The adapter is working fine with your patch.  You may add:

Acked-by: Petko Manolov 


cheers,
Petko


> > Signed-off-by: Philippe Reynes 
> > ---
> >  drivers/net/usb/rtl8150.c |   35 ---
> >  1 files changed, 20 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/net/usb/rtl8150.c b/drivers/net/usb/rtl8150.c
> > index c81c791..daaa88a 100644
> > --- a/drivers/net/usb/rtl8150.c
> > +++ b/drivers/net/usb/rtl8150.c
> > @@ -791,47 +791,52 @@ static void rtl8150_get_drvinfo(struct net_device 
> > *netdev, struct ethtool_drvinf
> > usb_make_path(dev->udev, info->bus_info, sizeof(info->bus_info));
> >  }
> >  
> > -static int rtl8150_get_settings(struct net_device *netdev, struct 
> > ethtool_cmd *ecmd)
> > +static int rtl8150_get_link_ksettings(struct net_device *netdev,
> > + struct ethtool_link_ksettings *ecmd)
> >  {
> > rtl8150_t *dev = netdev_priv(netdev);
> > short lpa, bmcr;
> > +   u32 supported;
> >  
> > -   ecmd->supported = (SUPPORTED_10baseT_Half |
> > +   supported = (SUPPORTED_10baseT_Half |
> >   SUPPORTED_10baseT_Full |
> >   SUPPORTED_100baseT_Half |
> >   SUPPORTED_100baseT_Full |
> >   SUPPORTED_Autoneg |
> >   SUPPORTED_TP | SUPPORTED_MII);
> > -   ecmd->port = PORT_TP;
> > -   ecmd->transceiver = XCVR_INTERNAL;
> > -   ecmd->phy_address = dev->phy;
> > +   ecmd->base.port = PORT_TP;
> > +   ecmd->base.phy_address = dev->phy;
> > get_registers(dev, BMCR, 2, );
> > get_registers(dev, ANLP, 2, );
> > if (bmcr & BMCR_ANENABLE) {
> > u32 speed = ((lpa & (LPA_100HALF | LPA_100FULL)) ?
> >  SPEED_100 : SPEED_10);
> > -   ethtool_cmd_speed_set(ecmd, speed);
> > -   ecmd->autoneg = AUTONEG_ENABLE;
> > +   ecmd->base.speed = speed;
> > +   ecmd->base.autoneg = AUTONEG_ENABLE;
> > if (speed == SPEED_100)
> > -   ecmd->duplex = (lpa & LPA_100FULL) ?
> > +   ecmd->base.duplex = (lpa & LPA_100FULL) ?
> > DUPLEX_FULL : DUPLEX_HALF;
> > else
> > -   ecmd->duplex = (lpa & LPA_10FULL) ?
> > +   ecmd->base.duplex = (lpa & LPA_10FULL) ?
> > DUPLEX_FULL : DUPLEX_HALF;
> > } else {
> > -   ecmd->autoneg = AUTONEG_DISABLE;
> > -   ethtool_cmd_speed_set(ecmd, ((bmcr & BMCR_SPEED100) ?
> > -SPEED_100 : SPEED_10));
> > -   ecmd->duplex = (bmcr & BMCR_FULLDPLX) ?
> > +   ecmd->base.autoneg = AUTONEG_DISABLE;
> > +   ecmd->base.speed = ((bmcr & BMCR_SPEED100) ?
> > +SPEED_100 : SPEED_10);
> > +   ecmd->base.duplex = (bmcr & BMCR_FULLDPLX) ?
> > DUPLEX_FULL : DUPLEX_HALF;
> > }
> > +
> > +   ethtool_convert_legacy_u32_to_link_mode(ecmd->link_modes.supported,
> > +   supported);
> > +
> > return 0;
> >  }
> >  
> >  static const struct ethtool_ops ops = {
> > .get_drvinfo = rtl8150_get_drvinfo,
> > -   .get_settings = rtl8150_get_settings,
> > -   .get_link = ethtool_op_get_link
> > +   .get_link = ethtool_op_get_link,
> > +   .get_link_ksettings = rtl8150_get_link_ksettings,
> >  };
> >  
> >  static int rtl8150_ioctl(struct net_device *netdev, struct ifreq *rq, int 
> > cmd)
> > -- 
> > 1.7.4.4
> > 
> > 


Re: net: deadlock between ip_expire/sch_direct_xmit

2017-03-14 Thread Cong Wang
On Tue, Mar 14, 2017 at 7:56 AM, Eric Dumazet  wrote:
> On Tue, Mar 14, 2017 at 7:46 AM, Dmitry Vyukov  wrote:
>
>> I am confused. Lockdep has observed both of these stacks:
>>
>>CPU0CPU1
>>
>>   lock(&(>lock)->rlock);
>>lock(_xmit_ETHER#2);
>>lock(&(>lock)->rlock);
>>   lock(_xmit_ETHER#2);
>>
>>
>> So it somehow happened. Or what do you mean?
>>
>
> Lockdep said " possible circular locking dependency detected " .
> It is not an actual deadlock, but lockdep machinery firing.
>
> For a dead lock to happen, this would require that he ICMP message
> sent by ip_expire() is itself fragmented and reassembled.
> This cannot be, because ICMP messages are not candidates for
> fragmentation, but lockdep can not know that of course...

It doesn't have to be ICMP, as long as get the same hash for
the inet_frag_queue, we will need to take the same lock and
deadlock will happen.

hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);

So it is really up to this hash function.


Re: [RFC v1 for accelerated IPoIB 25/25] mlx5_ib: skeleton for mlx5_ib to support ipoib_ops

2017-03-14 Thread Erez Shitrit
On Tue, Mar 14, 2017 at 6:10 PM, Jason Gunthorpe
 wrote:
> On Tue, Mar 14, 2017 at 04:53:24PM +0200, Erez Shitrit wrote:
>
>> > Why isn't this stuff in open/close?
>>
>> According to ipoib control flows, there is a different between
>> open/close to init/cleanup for example, in open/close the driver
>> doesn't destroy hw resources, just change the state, it destroys
>> them in cleanup.
>
> So put it in mlx5_alloc_rdma_netdev then?
>
> Or ndo.init as was suggested?

I can do that, as i said to your previous suggestion, will add the
ib_device to the rdma_netdev and will use the ndo.init

>
> Or in the void (*setup)(struct net_device *)
>
>> >> + param.size_base_priv = sizeof(struct ipoib_rdma_netdev);
>> >
>> > This is really weird, the code in mlx5i_create_netdev calls
>> > ipoib_dev_priv so it must assume the struct is a ipoib_rdma_netdev.
>>
>> It is the same attitude as in the vnic/hfi
>> (https://patchwork.kernel.org/patch/9587815/)
>
> Not quite, they call alloc_netdev_mqs directly, here indirects through
> mlx5i_create_netdev which assumes a priv layout, Just drop
> param.size_base_priv and put that same calculation in
> mlx5i_create_netdev..

We are sharing 2 drivers as the low level driver, anyway i will find
the way to do that.

>
> Jason
>


Re: [PATCH] net: Resend IGMP memberships upon peer notification.

2017-03-14 Thread Michael S. Tsirkin
On Tue, Mar 14, 2017 at 08:58:08AM -0400, Vladislav Yasevich wrote:
> When we notify peers of potential changes,  it's also good to update
> IGMP memberships.  For example, during VM migration, updating IGMP
> memberships will redirect existing multicast streams to the VM at the
> new location.
> 
> Signed-off-by: Vladislav Yasevich 

Seems to make sense

Acked-by: Michael S. Tsirkin 

but I also think there's another problem: source does not
leave the groups on migration. So I think we should add code on the
host - it's snooping IGMPs so it should be able to leave groups on the
source when VM is disconnected.



> ---
>  net/core/dev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a229bf0..1ed927d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1272,6 +1272,7 @@ void netdev_notify_peers(struct net_device *dev)
>  {
>   rtnl_lock();
>   call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
> + call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
>   rtnl_unlock();
>  }
>  EXPORT_SYMBOL(netdev_notify_peers);
> -- 
> 2.7.4


Re: [PATCH] tcp_westwood: fix tcp_westwood_info() style mistakes

2017-03-14 Thread Stephen Hemminger
On Tue, 14 Mar 2017 15:26:24 +0800
Chun Long  wrote:

> From: chun Long 
> 
> replace comma to semi colons in tcp_westwood_info().
> 
> ---
>  net/ipv4/tcp_westwood.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp_westwood.c
> index fed66dc..9775453 100644
> --- a/net/ipv4/tcp_westwood.c
> +++ b/net/ipv4/tcp_westwood.c
> @@ -265,8 +265,8 @@ static size_t tcp_westwood_info(struct sock *sk, u32 ext, 
> int *attr,
>   if (ext & (1 << (INET_DIAG_VEGASINFO - 1))) {
>   info->vegas.tcpv_enabled = 1;
>   info->vegas.tcpv_rttcnt = 0;
> - info->vegas.tcpv_rtt= jiffies_to_usecs(ca->rtt),
> - info->vegas.tcpv_minrtt = jiffies_to_usecs(ca->rtt_min),
> + info->vegas.tcpv_rtt= jiffies_to_usecs(ca->rtt);
> + info->vegas.tcpv_minrtt = jiffies_to_usecs(ca->rtt_min);
>  
>   *attr = INET_DIAG_VEGASINFO;
>   return sizeof(struct tcpvegas_info);

Look fine.

Acked-by: Stephen Hemminger 


Re: [PATCH v2 2/2] can: spi: hi311x: Add Holt HI-311x CAN driver

2017-03-14 Thread Akshay Bhat

Hi Wolfgang,

On 03/14/2017 08:11 AM, Wolfgang Grandegger wrote:
> ... snip ...
>>> A few other things to check:
>>>
>>> Run "cangen" and monitor the message with "candump -e any,0:0,#FFF".
>>> Then 1) disconnect the cable or 2) short-circuit CAN low and high at the
>>> connector. You should see error messages. After reconnection or removing
>>> the short-circuit (and bus-off recovery) the state should go back to
>>> "active".
>>>
>>
>> With the above sequence, candump reports "ERRORFRAME" with
>> protocol-violation{{}{acknowledge-slot}}, bus-error. On re-connecting
>> the cable the can state goes back to ACTIVE and I see the messages that
>> were in the queue being sent.
> 
> Do you get the ACK error also with berr-reporting off? Would be nice if
> you could show a candump log here.
> 

Below is a log for disconnecting and re-connecting CAN cable scenario:
(Note this is on a 4.1.18 kernel with RT patch)

root@imx6qrom5420b1:~# ip link set can0 up type can bitrate 100
berr-reporting on
root@imx6qrom5420b1:~# candump -e any,0:0,#FFF &
[1] 768
root@imx6qrom5420b1:~# cangen can0
  can0  21C   [8]  35 98 C0 7A 95 03 E6 2A
  can0  6E6   [1]  F2
  can0  5C7   [2]  42 50
  can0  57C   [8]  83 7A E4 0C 03 8B 90 45
  can0  55C   [8]  B9 74 87 52 D8 F4 64 04
  can0  014   [8]  28 CB 96 57 3B 80 67 4F
  can0  6AF   [1]  35
  can0  51E   [8]  B6 C8 6C 1D 3A 87 ED 2E
  can0  527   [8]  D0 8A D3 59 0E 34 40 78
  can0  30C   [2]  6A 12
  can0  145   [8]  CB 6E FF 55 C1 BE C3 22
  can0  5A5   [8]  C4 49 54 68 02 63 F9 35
  can0  0BA   [8]  DA 57 5E 3A CE 88 20 1C
  can0  516   [2]  09 09
  can0  743   [8]  7C 4D 25 47 61 4C 56 3D
  can0  31D   [2]  9C D3
  can0  71E   [8]  53 7C 97 2A 2A F2 9F 56
  can0  52E   [8]  FE DA 2D 51 73 96 DF 79
/disconnect cable
  can0  2088   [8]  00 00 00 19 00 00 28 00   ERRORFRAME
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{40}{0}}
  can0  2088   [8]  00 00 00 19 00 00 58 00   ERRORFRAME
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{88}{0}}
  can0  2088   [8]  00 00 00 19 00 00 80 00   ERRORFRAME
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{128}{0}}
  can0  208C   [8]  00 20 00 19 00 00 80 00   ERRORFRAME
controller-problem{tx-error-passive}
protocol-violation{{}{acknowledge-slot}}
bus-error
error-counter-tx-rx{{128}{0}}
write: No buffer space available
root@imx6qrom5420b1:~# ip -s -d link show can0
4: can0:  mtu 16 qdisc pfifo_fast state UNKNOWN
mode DEFAULT group default qlen 10
link/can  promiscuity 0
can  state ERROR-PASSIVE (berr-counter tx 128 rx 0)
restart-ms 0
  bitrate 100 sample-point 0.750
  tq 62 prop-seg 5 phase-seg1 6 phase-seg2 4 sjw 1
  hi3110: tseg1 2..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
  clock 1600
  re-started bus-errors arbit-lost error-warn error-pass bus-off
  0  6  0  1  1  0
RX: bytes  packets  errors  dropped overrun mcast
0  06   0   0   0
TX: bytes  packets  errors  dropped carrier collsns
10618   0   0   0   0
root@imx6qrom5420b1:~#
/re-connect cable
  can0  169   [8]  35 55 A3 1C 0F 47 2E 5B
  can0  318   [8]  11 AA 27 11 D2 1B CE 34
  can0  577   [8]  A0 A4 EE 50 8D A2 E1 3E
  can0  4ED   [8]  52 96 17 7E 31 FC 7D 7C
  can0  2E7   [8]  92 48 D4 39 05 1E 9F 50
  can0  200   [8]  4A 66 F6 02 1E 71 8E 26
  can0  29A   [8]  49 63 2E 7D C9 77 85 7A
  can0  15A   [7]  3C 0E 65 74 C3 62 80
  can0  011   [1]  D2
  can0  26B   [3]  FC D6 68
  can0  5CE   [8]  6F 02 B5 14 BC 7A D7 02

root@imx6qrom5420b1:~# ip -s -d link show can0
4: can0:  mtu 16 qdisc pfifo_fast state UNKNOWN
mode DEFAULT group default qlen 10
link/can  promiscuity 0
can  state ERROR-ACTIVE (berr-counter tx 117 rx 0)
restart-ms 0
  bitrate 100 sample-point 0.750
  tq 62 prop-seg 5 phase-seg1 6 phase-seg2 4 sjw 1
  hi3110: tseg1 2..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
  clock 1600
  re-started bus-errors arbit-lost error-warn error-pass bus-off
  0  7  0  1  1  0
RX: bytes  packets  errors  dropped overrun mcast
0  07   0   0   0
TX: bytes  packets  errors  dropped carrier collsns
18129   0   0   0   0


//Reboot the board and test with bus error reporting off

root@imx6qrom5420b1:~# ip link set can0 up type can bitrate 100
berr-reporting off
root@imx6qrom5420b1:~# candump -e any,0:0,#FFF &
[1] 782
root@imx6qrom5420b1:~# cangen can0
  can0  1FA   [3]  C9 FE C2
  can0  3E2   [5]  85 37 03 5B 6F
  can0  289   [8]  A4 F6 BF 4A 3F 70 65 1B
  can0  12D   [8]  B2 72 10 33 AB B4 68 64
 

Re: [4.10+] sctp lockdep trace

2017-03-14 Thread Dave Jones
On Tue, Mar 14, 2017 at 11:35:33AM +0800, Xin Long wrote:
 > >> > [  245.416594]  (
 > >> > [  245.424928] sk_lock-AF_INET
 > >> > [  245.433279] ){+.+.+.}
 > >> > [  245.441889] , at: [] sctp_sendmsg+0x330/0xfe0 
 > >> > [sctp]
 > >> > [  245.450167]
 > >> >stack backtrace:
 > >> > [  245.466352] CPU: 3 PID: 1781 Comm: trinity-c30 Not tainted 
 > >> > 4.10.0-think+ #7
 > >> > [  245.482894] Call Trace:
 > >> > [  245.491096]  dump_stack+0x68/0x93
 > >> > [  245.499314]  lockdep_rcu_suspicious+0xce/0xf0
 > >> > [  245.507610]  sctp_hash_transport+0x6c0/0x7e0 [sctp]
 > >> > [  245.515972]  ? sctp_endpoint_bh_rcv+0x171/0x290 [sctp]
 > >> > [  245.524366]  sctp_assoc_add_peer+0x290/0x3c0 [sctp]
 > >> > [  245.532736]  sctp_sendmsg+0x8f7/0xfe0 [sctp]
 > >> > [  245.541040]  ? rw_copy_check_uvector+0x8e/0x190
 > >> > [  245.549402]  ? import_iovec+0x3a/0xe0
 > >> > [  245.557679]  inet_sendmsg+0x49/0x1e0
 > >> > [  245.565887]  ___sys_sendmsg+0x2d4/0x300
 > >> > [  245.574092]  ? debug_smp_processor_id+0x17/0x20
 > >> > [  245.582342]  ? debug_smp_processor_id+0x17/0x20
 > >> > [  245.590508]  ? get_lock_stats+0x19/0x50
 > >> > [  245.598641]  __sys_sendmsg+0x54/0x90
 > >> > [  245.606745]  SyS_sendmsg+0x12/0x20
 > >> > [  245.614784]  do_syscall_64+0x66/0x1d0
 > >> > [  245.622828]  entry_SYSCALL64_slow_path+0x25/0x25
 > >> > [  245.630894] RIP: 0033:0x7fe095fcb0f9
 > >> > [  245.638962] RSP: 002b:7ffc5601b1d8 EFLAGS: 0246
 > >> > [  245.647071]  ORIG_RAX: 002e
 > >> > [  245.655186] RAX: ffda RBX: 002e RCX: 
 > >> > 7fe095fcb0f9
 > >> > [  245.663435] RDX: 0080 RSI: 5592de12ddc0 RDI: 
 > >> > 012d
 > >> > [  245.671776] RBP: 7fe0965c8000 R08: c000 R09: 
 > >> > 00dc
 > >> > [  245.680111] R10: 000302120088 R11: 0246 R12: 
 > >> > 0002
 > >> > [  245.688460] R13: 7fe0965c8048 R14: 7fe0966a1ad8 R15: 
 > >> > 7fe0965c8000
 > >> >
 > >>
 > >> Cc'ing Xin and linux-sctp@ mailing list.
 > >
 > > Seems the same as Andrey Konovalov had reported?
 > >
 > I would think so, this patch has fixed it:
 > 
 > commit 5179b26694c92373275e4933f5d0ff32d585c675
 > Author: Xin Long 
 > Date:   Tue Feb 28 12:41:29 2017 +0800
 > 
 > sctp: call rcu_read_lock before checking for duplicate transport nodes
 > 
 > not sure which commit your tests are based on, Dave, can you
 > check if this fix has been in your test kernel?

Haven't seen this in a while. Let's call it fixed.

Dave


Re: [RFC v1 for accelerated IPoIB 04/25] IB/verb: Add ipoib_options struct and API

2017-03-14 Thread Jason Gunthorpe
On Tue, Mar 14, 2017 at 12:01:09AM -0700, Vishwanathapura, Niranjana wrote:
> On Mon, Mar 13, 2017 at 02:01:36PM -0600, Jason Gunthorpe wrote:
> >>+   /* multicast */
> >>+   int (*attach_mcast)(struct net_device *dev, struct ib_device *hca,
> >>+   union ib_gid *gid, u16 lid, int set_qkey);
> >>+   int (*detach_mcast)(struct net_device *dev, struct ib_device *hca,
> >>+   union ib_gid *gid, u16 lid);
> >
> >It would make more sense to store the struct ib_device pointer in the
> >struct rdma_netdev.
> >
> 
> Agree that it shouldn't be a function parameters.
> For opa_vnic, I found it convenient to store ib_device pointer in client and
> device private structures as those will be available in most places anyhow.

If vnic uses it too, then lets add the ib_device and port num to
rdma_netdev itself?

Jason


  1   2   3   >