date:20170808

From: Vincent Bernat 
Date: Tue,  8 Aug 2017 20:23:49 +0200

> If the user hasn't installed any custom rules, don't go through the
> whole FIB rules layer. This is pretty similar to f4530fa574df (ipv4:
> Avoid overhead when no custom FIB rules are installed).
> 
> Using a micro-benchmark module [1], timing ip6_route_output() with
> get_cycles(), with 40,000 routes in the main routing table, before this
> patch:
> 
> min=606 max=12911 count=627 average=1959 95th=4903 90th=3747 50th=1602 
> mad=821
> table=254 avgdepth=21.8 maxdepth=39
> value │ ┊count
>   600 │▒▒▒ 199
>   880 │▒▒▒  43
>  1160 │▒▒▒  48
>  1440 │▒▒▒░░░   43
>  1720 │░░░  59
>  2000 │▒▒▒  50
>  2280 │▒▒░░░26
>  2560 │▒▒░  31
>  2840 │▒▒   28
>  3120 │▒░░  17
>  3400 │▒░░░ 17
>  3680 │░ 8
>  3960 │░░   11
>  4240 │░░6
>  4520 │░░░   6
>  4800 │░░░   9
> 
> After:
> 
> min=544 max=11687 count=627 average=1776 95th=4546 90th=3585 50th=1227 
> mad=565
> table=254 avgdepth=21.8 maxdepth=39
> value │ ┊count
>   540 │201
>   800 │▒63
>  1060 │▒░   68
>  1320 │▒▒▒░░39
>  1580 │▒▒░░ 32
>  1840 │▒▒   32
>  2100 │▒▒░░░34
>  2360 │▒▒░░ 33
>  2620 │▒▒   26
>  2880 │▒░░  22
>  3140 │  9
>  3400 │░ 8
>  3660 │░ 9
>  3920 │░░8
>  4180 │░░░   8
>  4440 │░░░   8
> 
> At the frequency of the host during the bench (~ 3.7 GHz), this is
> about a 100 ns difference on the median value.
> 
> A next step would be to collapse local and main tables, as in
> 0ddcf43d5d4a (ipv4: FIB Local/MAIN table collapse).
> 
> [1]: 
> https://github.com/vincentbernat/network-lab/blob/master/lab-routes-ipv6/kbench_mod.c
> 
> Signed-off-by: Vincent Bernat 
> Reviewed-by: Jiri Pirko 

Looks great, applied, thanks!

Re: [PATCH net] net: avoid skb_warn_bad_offload false positives on UFO

From: Willem de Bruijn 
Date: Tue,  8 Aug 2017 14:22:55 -0400

> From: Willem de Bruijn 
> 
> skb_warn_bad_offload triggers a warning when an skb enters the GSO
> stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
> checksum offload set.
> 
> Commit b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise")
> observed that SKB_GSO_DODGY producers can trigger the check and
> that passing those packets through the GSO handlers will fix it
> up. But, the software UFO handler will set ip_summed to
> CHECKSUM_NONE.
> 
> When __skb_gso_segment is called from the receive path, this
> triggers the warning again.
> 
> Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
> Tx these two are equivalent. On Rx, this better matches the
> skb state (checksum computed), as CHECKSUM_NONE here means no
> checksum computed.
> 
> See also this thread for context:
> http://patchwork.ozlabs.org/patch/799015/
> 
> Fixes: b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise")
> Signed-off-by: Willem de Bruijn 

Applied and queued up for -stable, thank you.

Re: [PATCH net] Revert "vhost: cache used event for better performance"

2017-08-08 Thread Michael S. Tsirkin

On Wed, Aug 09, 2017 at 10:38:10AM +0800, Jason Wang wrote:
> I think don't think current code can work well if vq.num is grater than
> 2^15. Since all cached idx is u16. This looks like a bug which needs to be
> fixed.

That's a limitation of virtio 1.0.

> > * else if the interval of vq.num is [2^15, 2^16):
> > the logic in the original patch (809ecb9bca6a9) suffices
> > * else (= less than 2^15) (optional):
> > checking only (vring_need_event(vq->last_used_event, new + vq->num, new)
> > would suffice.
> > 
> > Am I missing something, or is this irrelevant?

Could you pls repost the suggestion copying virtio-dev mailing list
(subscriber only, sorry about that, but host/guest ABI discussions
need to copy that list)?

> Looks not, I think this may work. Let me do some test.
> 
> Thanks

I think that at this point it's prudent to add a feature bit
as the virtio spec does not require to never move the event index back.

Re: [PATCH net-next 0/7] rtnetlink: allow to run selected handlers without rtnl

From: Florian Westphal 
Date: Tue,  8 Aug 2017 18:02:29 +0200

> Unfortunately RTNL mutex is a performance issue, e.g. a cpu adding
> an ip address prevents other cpus from seemingly unrelated tasks
> such as dumping tc classifiers.

It is related if somehow the TC entries refer to IP addresses.

Someone could create something like that.

> Initial no-rtnl spots are ip6 fib add/del and netns new/getid.

I could see the netns stuff being ok, but IPv6 route add/del I'm
not so sure of.

Because of things like nexthops etc. there are dependencies on
other configuration things.

That's the whole reason we have this unfortunate global
synchronization point.  If I'm changing some aspect of network
configuration, I know I can atomically test any piece of networking
configuration state.

If I test a network address to make sure I can properly reacy X and
use X as a nexthop in the route I'm adding, it will be there
throughout the entire operation.

There really is a hierachy of these dependencies.  Device state, up
to neighbour table state, up to protocol address state, up to routes,
up to FIB tables, etc. etc. etc.

I'd really like to make this operate more freely, but this is an
extremely delicate area which has been bottled up like this for
two decades so good luck :-)

Re: [PATCH net,stable] qmi_wwan: fix NULL deref on disconnect

From: Bjørn Mork 
Date: Tue,  8 Aug 2017 18:02:11 +0200

> qmi_wwan_disconnect is called twice when disconnecting devices with
> separate control and data interfaces.  The first invocation will set
> the interface data to NULL for both interfaces to flag that the
> disconnect has been handled.  But the matching NULL check was left
> out when qmi_wwan_disconnect was added, resulting in this oops:
> 
>   usb 2-1.4: USB disconnect, device number 4
>   qmi_wwan 2-1.4:1.6 wwp0s29u1u4i6: unregister 'qmi_wwan' 
> usb-:00:1d.0-1.4, WWAN/QMI device
>   BUG: unable to handle kernel NULL pointer dereference at 00e0
>   IP: qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan]
>   PGD 0
>   P4D 0
>   Oops:  [#1] SMP
>   Modules linked in: 
>   CPU: 2 PID: 33 Comm: kworker/2:1 Tainted: GE   
> 4.12.3-nr44-normandy-r1500619820+ #1
>   Hardware name: LENOVO 4291LR7/4291LR7, BIOS CBET4000 4.6-810-g50522254fb 
> 07/21/2017
>   Workqueue: usb_hub_wq hub_event [usbcore]
>   task: 8c882b716040 task.stack: b8e800d84000
>   RIP: 0010:qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan]
>   RSP: 0018:b8e800d87b38 EFLAGS: 00010246
>   RAX:  RBX:  RCX: 
>   RDX: 0001 RSI: 8c8824f3f1d0 RDI: 8c8824ef6400
>   RBP: 8c8824ef6400 R08:  R09: 
>   R10: b8e800d87780 R11: 0011 R12: c07ea0e8
>   R13: 8c8824e2e000 R14: 8c8824e2e098 R15: 
>   FS:  () GS:8c883530() knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: 00e0 CR3: 000229ca5000 CR4: 000406e0
>   Call Trace:
>? usb_unbind_interface+0x71/0x270 [usbcore]
>? device_release_driver_internal+0x154/0x210
>? qmi_wwan_unbind+0x6d/0xc0 [qmi_wwan]
>? usbnet_disconnect+0x6c/0xf0 [usbnet]
>? qmi_wwan_disconnect+0x87/0xc0 [qmi_wwan]
>? usb_unbind_interface+0x71/0x270 [usbcore]
>? device_release_driver_internal+0x154/0x210
> 
> Reported-and-tested-by: Nathaniel Roach 
> Fixes: c6adf77953bc ("net: usb: qmi_wwan: add qmap mux protocol support")
> Cc: Daniele Palmas 
> Signed-off-by: Bjørn Mork 

Applied and queued up for -stable, thanks.

Re: [PATCH] hv_set_ifconfig.sh double check before setting ip

From: Eduardo Otubo 
Date: Tue,  8 Aug 2017 15:53:45 +0200

> This patch fixes the behavior of the hv_set_ifconfig script when setting
> the interface ip. Sometimes the interface has already been configured by
> network daemon, in this case hv_set_ifconfig causes "RTNETLINK: file
> exists error"; in order to avoid this error this patch makes sure double
> checks the interface before trying anything.
> 
> Signed-off-by: Eduardo Otubo 

And if the daemon sets the address after you test it but before
you try to set it in the script, what happens?

This is why I hate changes like this.  They don't remove the problem,
they make it smaller.  And smaller in a bad way.  Smaller makes the
problem even more harder to diagnose when it happens.

There is implicitly no synchonization between network configuration
daemons and things people run by hand like this script.

So, caveat emptor.

I'm not applying this, sorry.

Re: [PATCH 33/35] wireless: realtek: rtl8192cu: constify usb_device_id

2017-08-08 Thread Larry Finger


On 08/08/2017 11:05 AM, Arvind Yadav wrote:

usb_device_id are not supposed to change at runtime. All functions
working with usb_device_id provided by  work with
const usb_device_id. So mark the non-const structs as const.

Signed-off-by: Arvind Yadav 
---


Acked-by: Larry Finger 

Thanks,

Larry


  drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c 
b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c
index 96c923b..e673bc2 100644
--- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c
+++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c
@@ -275,7 +275,7 @@ static struct rtl_hal_cfg rtl92cu_hal_cfg = {
  #define USB_VENDER_ID_REALTEK 0x0bda
  
  /* 2010-10-19 DID_USB_V3.4 */

-static struct usb_device_id rtl8192c_usb_ids[] = {
+static const struct usb_device_id rtl8192c_usb_ids[] = {
  
  	/*=== Realtek demoboard ===*/

/* Default ID */

Re: [PATCH][next] net: phy: mdio-bcm-unimac: fix unsigned wrap-around when decrementing timeout

From: Colin King 
Date: Tue,  8 Aug 2017 10:52:32 +0100

> From: Colin Ian King 
> 
> Change post-decrement compare to pre-decrement to avoid an
> unsigned integer wrap-around on timeout. This leads to the following
> !timeout check to never to be true so -ETIMEDOUT is never returned.
> 
> Detected by CoverityScan, CID#1452623 ("Logically dead code")
> 
> Fixes: 69a60b0579a4 ("net: phy: mdio-bcm-unimac: factor busy polling loop")
> Signed-off-by: Colin Ian King 

Applied.

Re: [PATCH net] ppp: fix xmit recursion detection on ppp channels

From: Guillaume Nault 
Date: Tue, 8 Aug 2017 11:43:24 +0200

> Commit e5dadc65f9e0 ("ppp: Fix false xmit recursion detect with two ppp
> devices") dropped the xmit_recursion counter incrementation in
> ppp_channel_push() and relied on ppp_xmit_process() for this task.
> But __ppp_channel_push() can also send packets directly (using the
> .start_xmit() channel callback), in which case the xmit_recursion
> counter isn't incremented anymore. If such packets get routed back to
> the parent ppp unit, ppp_xmit_process() won't notice the recursion and
> will call ppp_channel_push() on the same channel, effectively creating
> the deadlock situation that the xmit_recursion mechanism was supposed
> to prevent.
> 
> This patch re-introduces the xmit_recursion counter incrementation in
> ppp_channel_push(). Since the xmit_recursion variable is now part of
> the parent ppp unit, incrementation is skipped if the channel doesn't
> have any. This is fine because only packets routed through the parent
> unit may enter the channel recursively.
> 
> Finally, we have to ensure that pch->ppp is not going to be modified
> while executing ppp_channel_push(). Instead of taking this lock only
> while calling ppp_xmit_process(), we now have to hold it for the full
> ppp_channel_push() execution. This respects the ppp locks ordering
> which requires locking ->upl before ->downl.
> 
> Fixes: e5dadc65f9e0 ("ppp: Fix false xmit recursion detect with two ppp 
> devices")
> Signed-off-by: Guillaume Nault 

Applied, thank you.

Re: [PATCH] net: phy: Use tab for indentation in Kconfig

From: Michal Simek 
Date: Tue,  8 Aug 2017 11:32:25 +0200

> Using tabs instead of space for indentaion
> 
> Signed-off-by: Michal Simek 

This really isn't appropriate for the 'net' tree, it doesn't fix
anything it just makes the spacing consistent.

Please respin this patch against net-next, thank you.

Re: [PATCH net] rds: Reintroduce statistics counting

From: Håkon Bugge 
Date: Tue,  8 Aug 2017 11:13:32 +0200

> In commit 7e3f2952eeb1 ("rds: don't let RDS shutdown a connection
> while senders are present"), refilling the receive queue was removed
> from rds_ib_recv(), along with the increment of
> s_ib_rx_refill_from_thread.
> 
> Commit 73ce4317bf98 ("RDS: make sure we post recv buffers")
> re-introduces filling the receive queue from rds_ib_recv(), but does
> not add the statistics counter. rds_ib_recv() was later renamed to
> rds_ib_recv_path().
> 
> This commit reintroduces the statistics counting of
> s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq.
> 
> Signed-off-by: Håkon Bugge 
> Reviewed-by: Knut Omang 
> Reviewed-by: Wei Lin Guay 

Applied.

Re: [PATCH net] tcp: fastopen: tcp_connect() must refresh the route

From: Eric Dumazet 
Date: Tue, 08 Aug 2017 01:41:58 -0700

> From: Eric Dumazet 
> 
> With new TCP_FASTOPEN_CONNECT socket option, there is a possibility
> to call tcp_connect() while socket sk_dst_cache is either NULL
> or invalid.
> 
>  +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>  +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>  +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
>  +0 connect(4, ..., ...) = 0
> 
> << sk->sk_dst_cache becomes obsolete, or even set to NULL >>
> 
>  +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> 
> 
> We need to refresh the route otherwise bad things can happen,
> especially when syzkaller is running on the host :/
> 
> Fixes: 19f6d3f3c8422 ("net/tcp-fastopen: Add new API support")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable.

[PATCH iproute2 master] examples/bpf: update list of examples

2017-08-08 Thread Alexander Alemayhu

Remove deleted examples and add the new map in map example.

Signed-off-by: Alexander Alemayhu 
---
 examples/bpf/README | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/examples/bpf/README b/examples/bpf/README
index 4247257850eb..1bbdda3f8dc1 100644
--- a/examples/bpf/README
+++ b/examples/bpf/README
@@ -1,13 +1,8 @@
 eBPF toy code examples (running in kernel) to familiarize yourself
 with syntax and features:
 
- - bpf_prog.c  -> Classifier examples with using maps
  - bpf_shared.c-> Ingress/egress map sharing example
  - bpf_tailcall.c  -> Using tail call chains
  - bpf_cyclic.c-> Simple cycle as tail calls
  - bpf_graft.c -> Demo on altering runtime behaviour
-
-User space code example:
-
- - bpf_agent.c -> Counterpart to bpf_prog.c for user
-   space to transfer/read out map data
+ - bpf_map_in_map.c -> Using map in map example
-- 
2.13.3

Re: [PATCHv3 net] net: sched: set xt_tgchk_param par.net properly in ipt_init_target

From: Xin Long 
Date: Tue,  8 Aug 2017 15:25:25 +0800

> Now xt_tgchk_param par in ipt_init_target is a local varibale,
> par.net is not initialized there. Later when xt_check_target
> calls target's checkentry in which it may access par.net, it
> would cause kernel panic.
> 
> Jaroslav found this panic when running:
> 
>   # ip link add TestIface type dummy
>   # tc qd add dev TestIface ingress handle :
>   # tc filter add dev TestIface parent : u32 match u32 0 0 \
> action xt -j CONNMARK --set-mark 4
> 
> This patch is to pass net param into ipt_init_target and set
> par.net with it properly in there.
> 
> v1->v2:
>   As Wang Cong pointed, I missed ipt_net_id != xt_net_id, so fix
>   it by also passing net_id to __tcf_ipt_init.
> v2->v3:
>   Missed the fixes tag, so add it.
> 
> Fixes: ecb2421b5ddf ("netfilter: add and use nf_ct_netns_get/put")
> Reported-by: Jaroslav Aster 
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks.

Re: [PATCH net-next] cxgb4: Clear On FLASH config file after a FW upgrade

From: Ganesh Goudar 
Date: Tue,  8 Aug 2017 11:20:52 +0530

> From: Arjun Vynipadath 
> 
> Because Firmware and the Firmware Configuration File need to be
> in sync; clear out any On-FLASH Firmware Configuration File when new
> Firmware is loaded.  This will avoid difficult to diagnose and fix
> problems with a mis-matched Firmware Configuration File which prevents the
> adapter from being initialized.
> 
> Original work by: Casey Leedom 
> Signed-off-by: Arjun Vynipadath 
> Signed-off-by: Ganesh Goudar 

Applied, thanks.

Re: [PATCH RFC v2 3/5] samples/bpf: Fix inline asm issues building samples on arm64

From: Joel Fernandes 
Date: Mon, 7 Aug 2017 18:20:49 -0700

> On Mon, Aug 7, 2017 at 11:28 AM, David Miller  wrote:
>> The amount of hellish hacks we are adding to deal with this is getting
>> way out of control.
> 
> I agree with you that hellish hacks are being added which is why it
> keeps breaking. I think one of the things my series does is to add
> back inclusion of asm headers that were previously removed (that is
> the worst hellish hack in my opinion that existing in mainline). So in
> that respect my patch is an improvement and makes it possible to build
> for arm64 platforms (which is currently broken in mainline).

Yeah that is a problem.

Perhaps another avenue of attack is to separate "type" header files from
stuff that has functiond declarations and inline assembler code.

Re: [PATCH v9 2/4] PCI: Disable PCIe Relaxed Ordering if unsupported

On Tue, Aug 08, 2017 at 09:22:39PM -0500, Bjorn Helgaas wrote:
> On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote:
> > When bit4 is set in the PCIe Device Control register, it indicates
> > whether the device is permitted to use relaxed ordering.
> > On some platforms using relaxed ordering can have performance issues or
> > due to erratum can cause data-corruption. In such cases devices must avoid
> > using relaxed ordering.
> > 
> > This patch checks if there is any node in the hierarchy that indicates that
> > using relaxed ordering is not safe. 
...

> > +EXPORT_SYMBOL(pcie_relaxed_ordering_supported);
> 
> This is misnamed.  This doesn't tell us anything about whether the
> device *supports* relaxed ordering.  It only tells us whether the
> device is *permitted* to use it.
> 
> When a device initiates a transaction, the hardware should set the RO
> bit in the TLP with logic something like this:
> 
>   RO =  &&
> &&
>
> 
> The issue you're fixing is that some Completers don't handle RO
> correctly.  The determining factor is not the Requester, but the
> Completer (for this series, a Root Port).  So I think this should be
> something like:
> 
>   int pcie_relaxed_ordering_broken(struct pci_dev *completer)
>   {
> if (!completer)
>   return 0;
> 
> return completer->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
>   }
> 
> and the caller should do something like this:
> 
>  if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev)))
>adapter->flags |= ROOT_NO_RELAXED_ORDERING;
> 
> That way it's obvious where the issue is, and it's obvious that the
> answer might be different for peer-to-peer transactions than it is for
> transactions to the root port, i.e., to coherent memory.

After looking at the driver, I wonder if it would be simpler like
this:

  int pcie_relaxed_ordering_enabled(struct pci_dev *dev)
  {
u16 ctl;

pcie_capability_read_word(dev, PCI_EXP_DEVCTL, );
return ctl & PCI_EXP_DEVCTL_RELAX_EN;
  }
  EXPORT_SYMBOL(pcie_relaxed_ordering_enabled);

  static void pci_configure_relaxed_ordering(struct pci_dev *dev)
  {
struct pci_dev *root;

if (dev->is_virtfn)
  return;  /* PCI_EXP_DEVCTL_RELAX_EN is RsvdP in VFs */

if (!pcie_relaxed_ordering_enabled(dev))
  return;

/*
 * For now, we only deal with Relaxed Ordering issues with Root
 * Ports.  Peer-to-peer DMA is another can of worms.
 */
root = pci_find_pcie_root_port(dev);
if (!root)
  return;

if (root->relaxed_ordering_broken)
  pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
 PCI_EXP_DEVCTL_RELAX_EN);
  }

This doesn't check every intervening switch, but I don't think we know
about any issues except with root ports.

And the driver could do:

  if (!pcie_relaxed_ordering_enabled(pdev))
adapter->flags |= ROOT_NO_RELAXED_ORDERING;

The driver code wouldn't show anything about coherent memory vs.
peer-to-peer, but we really don't have a clue about how to handle that
yet anyway.

I guess this is back to exactly what you proposed, except that I
changed the name of pcie_relaxed_ordering_supported() to
pcie_relaxed_ordering_enabled(), which I think is slightly more
specific from the device's point of view.

Bjorn

Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

On Wed, Aug 09, 2017 at 01:40:01AM +, Casey Leedom wrote:
> | From: Bjorn Helgaas 
> | Sent: Tuesday, August 8, 2017 4:22 PM
> | 
> | This needs to include a link to the Intel spec
> | 
> (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
> | sec 3.9.1).
> 
>   In the commit message or as a comment?  Regardless, I agree.  It's always
> nice to be able to go back and see what the official documentation says.
> However, that said, links on the internet are ... fragile as time goes by,
> so we might want to simply quote section 3.9.1 in the commit message since
> it's relatively short:
> 
> 3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
>   and Toward MMIO Regions (P2P)
> 
> In order to maximize performance for PCIe devices in the processors
> listed in Table 3-6 below, the soft- ware should determine whether the
> accesses are toward coherent memory (system memory) or toward MMIO
> regions (P2P access to other devices). If the access is toward MMIO
> region, then software can command HW to set the RO bit in the TLP
> header, as this would allow hardware to achieve maximum throughput for
> these types of accesses. For accesses toward coherent memory, software
> can command HW to clear the RO bit in the TLP header (no RO), as this
> would allow hardware to achieve maximum throughput for these types of
> accesses.
> 
> Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
>PCIe Performance
> 
> ProcessorCPU RP Device IDs
> 
> Intel Xeon processors based on   6F01H-6F0EH
> Broadwell microarchitecture
> 
> Intel Xeon processors based on   2F01H-2F0EH
> Haswell microarchitecture

Agreed, links are prone to being broken.  I would include in the
changelog the complete title and order number, along with the link as
a footnote.  Wouldn't hurt to quote the section too, since it's short.

> | It should also include a pointer to the AMD erratum, if available, or
> | at least some reference to how we know it doesn't obey the rules.
> 
>   Getting an ACK from AMD seems like a forlorn cause at this point.  My
> contact was Bob Shaw  and he stopped responding to me
> messages almost a year ago saying that all of AMD's energies were being
> redirected towards upcoming x86 products (likely Ryzen as we now know).  As
> far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM
> SoC.
> 
>   On the specific issue, I can certainly write up somthing even more
> extensive than I wrote up for the comment in drivers/pci/quirks.c.  Please
> review the comment I wrote up and tell me if you'd like something even more
> detailed -- I'm usually acused of writing comments which are too long, so
> this would be a new one on me ... :-)

If you have any bug reports with info about how you debugged it and
concluded that Seattle is broken, you could include a link (probably
in the changelog).  But if there isn't anything, there isn't anything.

I might reorganize those patches as:

  1) Add a PCI_DEV_FLAGS_RELAXED_ORDERING_BROKEN flag, the quirk that
  sets it, and the current patch [2/4] that uses it.

  2) Add the Intel DECLARE_PCI_FIXUP_CLASS_EARLY()s with the Intel
  details.

  3) Add the AMD DECLARE_PCI_FIXUP_CLASS_EARLY()s with the AMD
  details.

[PATCH 1/3] security: keys: Replace time_t/timespec with time64_t

The 'struct key' will use 'time_t' which we try to remove in the
kernel, since 'time_t' is not year 2038 safe on 32bit systems.
Also the 'struct keyring_search_context' will use 'timespec' type
to record current time, which is also not year 2038 safe on 32bit
systems.

Thus this patch replaces 'time_t' with 'time64_t' which is year 2038
safe for 'struct key', and replace 'timespec' with 'time64_t' for the
'struct keyring_search_context', since we only look at the the seconds
part of 'timespec' variable. Moreover we also change the codes where
using the 'time_t' and 'timespec', and we can get current time by
ktime_get_real_seconds() instead of current_kernel_time(), and use
'TIME64_MAX' macro to initialize the 'time64_t' type variable.

Especially in proc.c file, we have replaced 'unsigned long' and 'timespec'
type with 'u64' and 'time64_t' type to save the timeout value, which means
user will get one 'u64' type timeout value by issuing proc_keys_show()
function.

Signed-off-by: Baolin Wang 
---
 include/linux/key.h  |7 ---
 security/keys/gc.c   |   20 ++--
 security/keys/internal.h |8 
 security/keys/key.c  |   19 ++-
 security/keys/keyring.c  |   18 +-
 security/keys/permission.c   |3 +--
 security/keys/proc.c |   20 ++--
 security/keys/process_keys.c |2 +-
 8 files changed, 45 insertions(+), 52 deletions(-)

diff --git a/include/linux/key.h b/include/linux/key.h
index 0441141..6d10f84 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __KERNEL__
 #include 
@@ -157,10 +158,10 @@ struct key {
struct key_user *user;  /* owner of this key */
void*security;  /* security data for this key */
union {
-   time_t  expiry; /* time at which key expires 
(or 0) */
-   time_t  revoked_at; /* time at which key was 
revoked */
+   time64_texpiry; /* time at which key expires 
(or 0) */
+   time64_trevoked_at; /* time at which key was 
revoked */
};
-   time_t  last_used_at;   /* last time used for LRU 
keyring discard */
+   time64_tlast_used_at;   /* last time used for LRU 
keyring discard */
kuid_t  uid;
kgid_t  gid;
key_perm_t  perm;   /* access permissions */
diff --git a/security/keys/gc.c b/security/keys/gc.c
index 87cb260..c99700e 100644
--- a/security/keys/gc.c
+++ b/security/keys/gc.c
@@ -32,7 +32,7 @@
 static void key_gc_timer_func(unsigned long);
 static DEFINE_TIMER(key_gc_timer, key_gc_timer_func, 0, 0);
 
-static time_t key_gc_next_run = LONG_MAX;
+static time64_t key_gc_next_run = TIME64_MAX;
 static struct key_type *key_gc_dead_keytype;
 
 static unsigned long key_gc_flags;
@@ -53,12 +53,12 @@ struct key_type key_type_dead = {
  * Schedule a garbage collection run.
  * - time precision isn't particularly important
  */
-void key_schedule_gc(time_t gc_at)
+void key_schedule_gc(time64_t gc_at)
 {
unsigned long expires;
-   time_t now = current_kernel_time().tv_sec;
+   time64_t now = ktime_get_real_seconds();
 
-   kenter("%ld", gc_at - now);
+   kenter("%lld", gc_at - now);
 
if (gc_at <= now || test_bit(KEY_GC_REAP_KEYTYPE, _gc_flags)) {
kdebug("IMMEDIATE");
@@ -87,7 +87,7 @@ void key_schedule_gc_links(void)
 static void key_gc_timer_func(unsigned long data)
 {
kenter("");
-   key_gc_next_run = LONG_MAX;
+   key_gc_next_run = TIME64_MAX;
key_schedule_gc_links();
 }
 
@@ -184,11 +184,11 @@ static void key_garbage_collector(struct work_struct 
*work)
 
struct rb_node *cursor;
struct key *key;
-   time_t new_timer, limit;
+   time64_t new_timer, limit;
 
kenter("[%lx,%x]", key_gc_flags, gc_state);
 
-   limit = current_kernel_time().tv_sec;
+   limit = ktime_get_real_seconds();
if (limit > key_gc_delay)
limit -= key_gc_delay;
else
@@ -204,7 +204,7 @@ static void key_garbage_collector(struct work_struct *work)
gc_state |= KEY_GC_REAPING_DEAD_1;
kdebug("new pass %x", gc_state);
 
-   new_timer = LONG_MAX;
+   new_timer = TIME64_MAX;
 
/* As only this function is permitted to remove things from the key
 * serial tree, if cursor is non-NULL then it will always point to a
@@ -235,7 +235,7 @@ static void key_garbage_collector(struct work_struct *work)
 
if (gc_state & KEY_GC_SET_TIMER) {
if (key->expiry > limit && key->expiry < new_timer) {
-   kdebug("will expire %x in %ld",
+   kdebug("will expire %x in

[PATCH 0/3] Fix y2038 issues for security/keys subsystem

Since 'time_t', 'timeval' and 'timespec' types are not year 2038 safe on
32 bits system, this patchset tries to fix this issues for security/keys
subsystem and net/rxrpc subsystem which is connected with security/keys
subsystem.

Baolin Wang (3):
  security: keys: Replace time_t/timespec with time64_t
  security: keys: Replace time_t with time64_t for struct
key_preparsed_payload
  net: rxrpc: Replace time_t type with time64_t type

 include/keys/rxrpc-type.h|   21 +
 include/linux/key-type.h |2 +-
 include/linux/key.h  |7 ---
 net/rxrpc/ar-internal.h  |2 +-
 net/rxrpc/key.c  |   22 ++
 net/rxrpc/rxkad.c|   14 +++---
 security/keys/gc.c   |   20 ++--
 security/keys/internal.h |8 
 security/keys/key.c  |   27 ++-
 security/keys/keyring.c  |   18 +-
 security/keys/permission.c   |3 +--
 security/keys/proc.c |   20 ++--
 security/keys/process_keys.c |2 +-
 13 files changed, 93 insertions(+), 73 deletions(-)

-- 
1.7.9.5

[PATCH 2/3] security: keys: Replace time_t with time64_t for struct key_preparsed_payload

The 'struct key_preparsed_payload' will use 'time_t' which we will
try to remove in the kernel, since 'time_t' is not year 2038 safe on
32bits systems.

Thus this patch replaces 'time_t' with 'time64_t' which is year 2038
safe on 32 bits system for 'struct key_preparsed_payload', moreover
we should use the 'TIME64_MAX' macro to initialize the 'time64_t'
type variable.

Signed-off-by: Baolin Wang 
---
 include/linux/key-type.h |2 +-
 security/keys/key.c  |8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/key-type.h b/include/linux/key-type.h
index 8496cf6..4beb006 100644
--- a/include/linux/key-type.h
+++ b/include/linux/key-type.h
@@ -44,7 +44,7 @@ struct key_preparsed_payload {
const void  *data;  /* Raw data */
size_t  datalen;/* Raw datalen */
size_t  quotalen;   /* Quota length for proposed payload */
-   time_t  expiry; /* Expiry time of key */
+   time64_texpiry; /* Expiry time of key */
 };
 
 typedef int (*request_key_actor_t)(struct key_construction *key,
diff --git a/security/keys/key.c b/security/keys/key.c
index 291a67c..d5c8941 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -446,7 +446,7 @@ static int __key_instantiate_and_link(struct key *key,
if (authkey)
key_revoke(authkey);
 
-   if (prep->expiry != TIME_T_MAX) {
+   if (prep->expiry != TIME64_MAX) {
key->expiry = prep->expiry;
key_schedule_gc(prep->expiry + key_gc_delay);
}
@@ -492,7 +492,7 @@ int key_instantiate_and_link(struct key *key,
prep.data = data;
prep.datalen = datalen;
prep.quotalen = key->type->def_datalen;
-   prep.expiry = TIME_T_MAX;
+   prep.expiry = TIME64_MAX;
if (key->type->preparse) {
ret = key->type->preparse();
if (ret < 0)
@@ -834,7 +834,7 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
prep.data = payload;
prep.datalen = plen;
prep.quotalen = index_key.type->def_datalen;
-   prep.expiry = TIME_T_MAX;
+   prep.expiry = TIME64_MAX;
if (index_key.type->preparse) {
ret = index_key.type->preparse();
if (ret < 0) {
@@ -968,7 +968,7 @@ int key_update(key_ref_t key_ref, const void *payload, 
size_t plen)
prep.data = payload;
prep.datalen = plen;
prep.quotalen = key->type->def_datalen;
-   prep.expiry = TIME_T_MAX;
+   prep.expiry = TIME64_MAX;
if (key->type->preparse) {
ret = key->type->preparse();
if (ret < 0)
-- 
1.7.9.5

[PATCH 3/3] net: rxrpc: Replace time_t type with time64_t type

Since the 'expiry' variable of 'struct key_preparsed_payload' has been
changed to 'time64_t' type, which is year 2038 safe on 32bits system.

In net/rxrpc subsystem, we need convert 'u32' type to 'time64_t' type
when copying ticket expires time to 'prep->expiry', then this patch
introduces two helper functions to help convert 'u32' to 'time64_t'
type.

This patch also uses ktime_get_real_seconds() to get current time instead
of get_seconds() which is not year 2038 safe on 32bits system.

Signed-off-by: Baolin Wang 
---
 include/keys/rxrpc-type.h |   21 +
 net/rxrpc/ar-internal.h   |2 +-
 net/rxrpc/key.c   |   22 ++
 net/rxrpc/rxkad.c |   14 +++---
 4 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/include/keys/rxrpc-type.h b/include/keys/rxrpc-type.h
index 5de0673..76421e2 100644
--- a/include/keys/rxrpc-type.h
+++ b/include/keys/rxrpc-type.h
@@ -127,4 +127,25 @@ struct rxrpc_key_data_v1 {
 #define AFSTOKEN_K5_ADDRESSES_MAX  16  /* max K5 addresses */
 #define AFSTOKEN_K5_AUTHDATA_MAX   16  /* max K5 pieces of auth data */
 
+/*
+ * truncate a time64_t to the range from 1970 to 2106 as
+ * in the network protocol
+ */
+static inline u32 rxrpc_time64_to_u32(time64_t time)
+{
+   if (time < 0)
+   return 0;
+
+   if (time > UINT_MAX)
+   return UINT_MAX;
+
+   return (u32)time;
+}
+
+/* extend u32 back to time64_t using the same 1970-2106 range */
+static inline time64_t rxrpc_u32_to_time64(u32 time)
+{
+   return (time64_t)time;
+}
+
 #endif /* _KEYS_RXRPC_TYPE_H */
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 69b9733..3c11443 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -894,7 +894,7 @@ void rxrpc_new_incoming_connection(struct rxrpc_sock *,
 
 int rxrpc_request_key(struct rxrpc_sock *, char __user *, int);
 int rxrpc_server_keyring(struct rxrpc_sock *, char __user *, int);
-int rxrpc_get_server_data_key(struct rxrpc_connection *, const void *, time_t,
+int rxrpc_get_server_data_key(struct rxrpc_connection *, const void *, 
time64_t,
  u32);
 
 /*
diff --git a/net/rxrpc/key.c b/net/rxrpc/key.c
index 5436922..e2d3661 100644
--- a/net/rxrpc/key.c
+++ b/net/rxrpc/key.c
@@ -92,6 +92,7 @@ static int rxrpc_preparse_xdr_rxkad(struct 
key_preparsed_payload *prep,
const __be32 *xdr, unsigned int toklen)
 {
struct rxrpc_key_token *token, **pptoken;
+   time64_t expiry;
size_t plen;
u32 tktlen;
 
@@ -158,8 +159,9 @@ static int rxrpc_preparse_xdr_rxkad(struct 
key_preparsed_payload *prep,
 pptoken = &(*pptoken)->next)
continue;
*pptoken = token;
-   if (token->kad->expiry < prep->expiry)
-   prep->expiry = token->kad->expiry;
+   expiry = rxrpc_u32_to_time64(token->kad->expiry);
+   if (expiry < prep->expiry)
+   prep->expiry = expiry;
 
_leave(" = 0");
return 0;
@@ -433,6 +435,7 @@ static int rxrpc_preparse_xdr_rxk5(struct 
key_preparsed_payload *prep,
struct rxrpc_key_token *token, **pptoken;
struct rxk5_key *rxk5;
const __be32 *end_xdr = xdr + (toklen >> 2);
+   time64_t expiry;
int ret;
 
_enter(",{%x,%x,%x,%x},%u",
@@ -533,8 +536,9 @@ static int rxrpc_preparse_xdr_rxk5(struct 
key_preparsed_payload *prep,
 pptoken = &(*pptoken)->next)
continue;
*pptoken = token;
-   if (token->kad->expiry < prep->expiry)
-   prep->expiry = token->kad->expiry;
+   expiry = rxrpc_u32_to_time64(token->kad->expiry);
+   if (expiry < prep->expiry)
+   prep->expiry = expiry;
 
_leave(" = 0");
return 0;
@@ -691,6 +695,7 @@ static int rxrpc_preparse(struct key_preparsed_payload 
*prep)
 {
const struct rxrpc_key_data_v1 *v1;
struct rxrpc_key_token *token, **pp;
+   time64_t expiry;
size_t plen;
u32 kver;
int ret;
@@ -777,8 +782,9 @@ static int rxrpc_preparse(struct key_preparsed_payload 
*prep)
while (*pp)
pp = &(*pp)->next;
*pp = token;
-   if (token->kad->expiry < prep->expiry)
-   prep->expiry = token->kad->expiry;
+   expiry = rxrpc_u32_to_time64(token->kad->expiry);
+   if (expiry < prep->expiry)
+   prep->expiry = expiry;
token = NULL;
ret = 0;
 
@@ -955,7 +961,7 @@ int rxrpc_server_keyring(struct rxrpc_sock *rx, char __user 
*optval,
  */
 int rxrpc_get_server_data_key(struct rxrpc_connection *conn,
  const void *session_key,
- time_t expiry,
+ time64_t expiry,
  u32 kvno)
 {
const struct cred *cred = current_cred();
@@ -982,7 +988,7 @@ int rxrpc_get_server_data_key(struct

Re: [ovs-dev] [PATCH net-next] openvswitch: add NSH support

2017-08-08 Thread Ben Pfaff

To be clear, the OVS implementation is a placeholder.  It will get
replaced by whatever netdev implements, and that's OK.  I didn't focus
on making it perfect because I knew that.  Instead, I just made sure it
was good enough for an internal OVS implementation that doesn't fix any
ABI or API.  OVS can even change the user-visible action names, as long
as we do that soon (and encap/decap versus push/pop doesn't matter to
me).

The considerations for netdev are different and more permanent.

On Wed, Aug 09, 2017 at 02:05:12AM +, Yang, Yi Y wrote:
> Hi, Jiri
> 
> Thank you for your comments.
> 
> __be32 c[4] is the name Ben Pfaff suggested, the original name is c1, c2, c3, 
> c4, they are context data, so c seems ok, too :-)
> 
> OVS has merged it and has the same name, maybe the better way is adding 
> comment /* Context data */ after it.
> 
> For MD type 2, struct ovs_key_nsh is very difficult to cover it, so far we 
> don't know how to support MD type 2 better, Geneve defined 64 
> tun_metadata0-63 to handle this, those keys are parts of struct flow_tnl, the 
> highest possibility is to reuse those keys.
> 
> So for future MD type 2, we will have two parts of keys, one is from struct 
> ovs_key_nsh, another is from struct flow_tnl, this won't break the old uAPI.
> 
> "#define OVS_ENCAP_NSH_MAX_MD_LEN 16" is changed per Ben's comment from 256, 
> Ben thinks 256 is too big but we only support MD type 1 now. We still have 
> ways to extend it, for example:
> 
> struct ovs_action_encap_nsh * oaen = (struct ovs_action_encap_nsh *) malloc 
> (sizeof(struct ovs_action_encap_nsh) + ANY_SIZE);
> 
> nl_msg_put_unspec(actions, OVS_ACTION_ATTR_ENCAP_NSH,
>   oaen, sizeof(struct ovs_action_encap_nsh) + 
> ANY_SIZE);
> 
> In addition, we also need to consider, OVS userspace code must be consistent 
> with here, so keeping it intact will be better, we have way to support 
> dynamically extension when we add MD type 2 support.
> 
> About action name, unfortunately, userspace data plane has named them as 
> encap_nsh & decap_nsh, Jan, what do you think about Jiri's suggestion?
> 
> But from my understanding, encap_* & decap_* are better because they 
> corresponding to generic encap & decap actions, in addition, encap semantics 
> are different from push, encap just pushed an empty header with default 
> values, users must use set_field to set the content of the header.
> 
> Again, OVS userspace code must be consistent with here, so keeping it intact 
> will be better because OVS userspace code was there.
> 
> 
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On 
> Behalf Of Jiri Benc
> Sent: Tuesday, August 8, 2017 10:28 PM
> To: Yang, Yi Y 
> Cc: netdev@vger.kernel.org; d...@openvswitch.org; da...@davemloft.net
> Subject: Re: [PATCH net-next] openvswitch: add NSH support
> 
> On Tue,  8 Aug 2017 12:59:40 +0800, Yi Yang wrote:
> > +struct ovs_key_nsh {
> > +   __u8 flags;
> > +   __u8 mdtype;
> > +   __u8 np;
> > +   __u8 pad;
> > +   __be32 path_hdr;
> > +   __be32 c[4];
> 
> "c" is a very poor name. Please rename it to something that expresses what 
> this field contains.
> 
> Also, this looks like MD type 1 only. How are those fields going to work with 
> MD type 2? I don't think MD type 2 implementation is necessary in this patch 
> but I'd like to know how this is going to work - it's uAPI and thus set in 
> stone once this is merged. The uAPI needs to be designed with future use in 
> mind.
> 
> > +#define OVS_ENCAP_NSH_MAX_MD_LEN 16
> > +/*
> > + * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH
> > + * @flags: NSH header flags.
> > + * @mdtype: NSH metadata type.
> > + * @mdlen: Length of NSH metadata in bytes.
> > + * @np: NSH next_protocol: Inner packet type.
> > + * @path_hdr: NSH service path id and service index.
> > + * @metadata: NSH metadata for MD type 1 or 2  */ struct 
> > +ovs_action_encap_nsh {
> > +   __u8 flags;
> > +   __u8 mdtype;
> > +   __u8 mdlen;
> > +   __u8 np;
> > +   __be32 path_hdr;
> > +   __u8 metadata[OVS_ENCAP_NSH_MAX_MD_LEN];
> 
> This is wrong. The metadata size is set to a fixed size by this and cannot be 
> ever extended, or at least not easily. Netlink has attributes for exactly 
> these cases and that's what needs to be used here.
> 
> > @@ -835,6 +866,8 @@ enum ovs_action_attr {
> > OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */
> > OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
> > OVS_ACTION_ATTR_POP_ETH,  /* No argument. */
> > +   OVS_ACTION_ATTR_ENCAP_NSH,/* struct ovs_action_encap_nsh. */
> > +   OVS_ACTION_ATTR_DECAP_NSH,/* No argument. */
> 
> Use "push" and "pop", not "encap" and "decap" to be consistent with the 
> naming in the rest of the file. We use encap and decap for tunnel operations. 
> This code does not use lwtunnels, thus push and pop is more appropriate.
> 
>  Jiri
>

Re: [PATCH net] Revert "vhost: cache used event for better performance"

2017-08-08 Thread Jason Wang




On 2017年07月30日 14:26, K. Den wrote:

On Wed, 2017-07-26 at 19:08 +0300, Michael S. Tsirkin wrote:

On Wed, Jul 26, 2017 at 09:37:15PM +0800, Jason Wang wrote:


On 2017年07月26日 21:18, Jason Wang wrote:


On 2017年07月26日 20:57, Michael S. Tsirkin wrote:

On Wed, Jul 26, 2017 at 04:03:17PM +0800, Jason Wang wrote:

This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it
was reported to break vhost_net. We want to cache used event and use
it to check for notification. We try to valid cached used event by
checking whether or not it was ahead of new, but this is not correct
all the time, it could be stale and there's no way to know about this.

Signed-off-by: Jason Wang

Could you supply a bit more data here please?  How does it get stale?
What does guest need to do to make it stale?  This will be helpful if
anyone wants to bring it back, or if we want to extend the protocol.


The problem we don't know whether or not guest has published a new used
event. The check vring_need_event(vq->last_used_event, new + vq->num,
new) is not sufficient to check for this.

Thanks

More notes, the previous assumption is that we don't move used event back,
but this could happen in fact if idx is wrapper around.

You mean if the 16 bit index wraps around after 64K entries.
Makes sense.


Will repost and add
this into commit log.

Thanks

Hi,


Hi, sorry for the late reply, was on vacation last week.



I am just curious but I have got a question:
AFAIU, if you wanted to keep the caching mechanism alive in the code base,
the following two changes could clear off the issue, or not?:
(1) Always fetch the latest event value from guest when signalled_used event is
invalid, which includes last_used_idx wraps-around case. Otherwise we might need
changes which would complicate too much the logic to properly decide whether or
not to skip signalling in the next vhost_notify round.
(2) On top of that, split the signal-postponing logic to three cases like:
* if the interval of vq.num is [2^16, UINT_MAX]:
any cached event is in should-postpone-signalling interval, so paradoxically
must always do signalling.


I think don't think current code can work well if vq.num is grater than 
2^15. Since all cached idx is u16. This looks like a bug which needs to 
be fixed.



* else if the interval of vq.num is [2^15, 2^16):
the logic in the original patch (809ecb9bca6a9) suffices
* else (= less than 2^15) (optional):
checking only (vring_need_event(vq->last_used_event, new + vq->num, new)
would suffice.

Am I missing something, or is this irrelevant?


Looks not, I think this may work. Let me do some test.

Thanks


I would appreciate if you could elaborate a bit more how the situation where
event idx wraps around and moves back would make trouble.

Thanks.

[PATCH net-next] liquidio: napi cleanup

2017-08-08 Thread Felix Manlunas

From: Intiyaz Basha 

Disable napi when interface is going down.
Delete napi when destroying the interface.

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 15 +++
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 14 ++
 2 files changed, 29 insertions(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 3ec0dd9..cbd6287 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -1736,6 +1736,10 @@ static void liquidio_destroy_nic_device(struct 
octeon_device *oct, int ifidx)
oct->droq[0]->ops.poll_mode = 0;
}
 
+   /* Delete NAPI */
+   list_for_each_entry_safe(napi, n, >napi_list, dev_list)
+   netif_napi_del(napi);
+
if (atomic_read(>ifstate) & LIO_IFSTATE_REGISTERED)
unregister_netdev(netdev);
 
@@ -2770,6 +2774,17 @@ static int liquidio_stop(struct net_device *netdev)
 {
struct lio *lio = GET_LIO(netdev);
struct octeon_device *oct = lio->oct_dev;
+   struct napi_struct *napi, *n;
+
+   if (oct->props[lio->ifidx].napi_enabled) {
+   list_for_each_entry_safe(napi, n, >napi_list, dev_list)
+   napi_disable(napi);
+
+   oct->props[lio->ifidx].napi_enabled = 0;
+
+   if (OCTEON_CN23XX_PF(oct))
+   oct->droq[0]->ops.poll_mode = 0;
+   }
 
ifstate_reset(lio, LIO_IFSTATE_RUNNING);
 
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index 935ff29..c6f52f2 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -1137,6 +1137,10 @@ static void liquidio_destroy_nic_device(struct 
octeon_device *oct, int ifidx)
oct->droq[0]->ops.poll_mode = 0;
}
 
+   /* Delete NAPI */
+   list_for_each_entry_safe(napi, n, >napi_list, dev_list)
+   netif_napi_del(napi);
+
if (atomic_read(>ifstate) & LIO_IFSTATE_REGISTERED)
unregister_netdev(netdev);
 
@@ -1784,6 +1788,16 @@ static int liquidio_stop(struct net_device *netdev)
 {
struct lio *lio = GET_LIO(netdev);
struct octeon_device *oct = lio->oct_dev;
+   struct napi_struct *napi, *n;
+
+   if (oct->props[lio->ifidx].napi_enabled) {
+   list_for_each_entry_safe(napi, n, >napi_list, dev_list)
+   napi_disable(napi);
+
+   oct->props[lio->ifidx].napi_enabled = 0;
+
+   oct->droq[0]->ops.poll_mode = 0;
+   }
 
netif_info(lio, ifdown, lio->netdev, "Stopping interface!\n");
/* Inform that netif carrier is down */

Re: [PATCH v9 2/4] PCI: Disable PCIe Relaxed Ordering if unsupported

On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote:
> When bit4 is set in the PCIe Device Control register, it indicates
> whether the device is permitted to use relaxed ordering.
> On some platforms using relaxed ordering can have performance issues or
> due to erratum can cause data-corruption. In such cases devices must avoid
> using relaxed ordering.
> 
> This patch checks if there is any node in the hierarchy that indicates that
> using relaxed ordering is not safe. 

I think you only check the devices between the root port and the
target device.  For example, you don't check siblings or cousins of
the target device.

> In such cases the patch turns off the
> relaxed ordering by clearing the eapability for this device.

s/eapability/capability/

> And if the
> device is probably running in a guest machine, we should do nothing.

I don't know what this sentence means.  "Probably running in a guest
machine" doesn't really make sense, and there's nothing in your patch
that explicitly checks for being in a guest machine.

> Signed-off-by: Ding Tianhong 
> Acked-by: Alexander Duyck 
> Acked-by: Ashok Raj 
> ---
>  drivers/pci/pci.c   | 29 +
>  drivers/pci/probe.c | 37 +
>  include/linux/pci.h |  2 ++
>  3 files changed, 68 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index af0cc34..4f9d7c1 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4854,6 +4854,35 @@ int pcie_set_mps(struct pci_dev *dev, int mps)
>  EXPORT_SYMBOL(pcie_set_mps);
>  
>  /**
> + * pcie_clear_relaxed_ordering - clear PCI Express relaxed ordering bit
> + * @dev: PCI device to query
> + *
> + * If possible clear relaxed ordering

Why "If possible"?  The bit is required to be RW or hardwired to zero,
so PCI_EXP_DEVCTL_RELAX_EN should *always* be zero when this returns.

> + */
> +int pcie_clear_relaxed_ordering(struct pci_dev *dev)
> +{
> + return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
> +   PCI_EXP_DEVCTL_RELAX_EN);
> +}
> +EXPORT_SYMBOL(pcie_clear_relaxed_ordering);

The current series doesn't add any callers of this except
pci_configure_relaxed_ordering(), so it doesn't need to be exported to
modules.

I think I would put both of these functions in drivers/pci/probe.c.
Then this one could be static and you'd only have to add
pcie_relaxed_ordering_supported() to include/linux/pci.h.

> +
> +/**
> + * pcie_relaxed_ordering_supported - Probe for PCIe relexed ordering support

s/relexed/relaxed/

> + * @dev: PCI device to query
> + *
> + * Returns true if the device support relaxed ordering attribute.
> + */
> +bool pcie_relaxed_ordering_supported(struct pci_dev *dev)
> +{
> + u16 v;
> +
> + pcie_capability_read_word(dev, PCI_EXP_DEVCTL, );
> +
> + return !!(v & PCI_EXP_DEVCTL_RELAX_EN);
> +}
> +EXPORT_SYMBOL(pcie_relaxed_ordering_supported);

This is misnamed.  This doesn't tell us anything about whether the
device *supports* relaxed ordering.  It only tells us whether the
device is *permitted* to use it.

When a device initiates a transaction, the hardware should set the RO
bit in the TLP with logic something like this:

  RO =  &&
&&

The issue you're fixing is that some Completers don't handle RO
correctly.  The determining factor is not the Requester, but the
Completer (for this series, a Root Port).  So I think this should be
something like:

  int pcie_relaxed_ordering_broken(struct pci_dev *completer)
  {
if (!completer)
  return 0;

return completer->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
  }

and the caller should do something like this:

 if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev)))
   adapter->flags |= ROOT_NO_RELAXED_ORDERING;

That way it's obvious where the issue is, and it's obvious that the
answer might be different for peer-to-peer transactions than it is for
transactions to the root port, i.e., to coherent memory.

> +
> +/**
>   * pcie_get_minimum_link - determine minimum link settings of a PCI device
>   * @dev: PCI device to query
>   * @speed: storage for minimum speed
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index c31310d..48df012 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1762,6 +1762,42 @@ static void pci_configure_extended_tags(struct pci_dev 
> *dev)
>PCI_EXP_DEVCTL_EXT_TAG);
>  }
>  
> +/**
> + * pci_dev_should_disable_relaxed_ordering - check if the PCI device
> + * should disable the relaxed ordering attribute.
> + * @dev: PCI device
> + *
> + * Return true if any of the PCI devices above us do not support
> + * relaxed ordering.
> + */
> +static bool pci_dev_should_disable_relaxed_ordering(struct pci_dev *dev)
> +{
> + while (dev) {
> + if (dev->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING)
> +

RE: [PATCH net-next] openvswitch: add NSH support

2017-08-08 Thread Yang, Yi Y

Hi, Jiri

Thank you for your comments.

__be32 c[4] is the name Ben Pfaff suggested, the original name is c1, c2, c3, 
c4, they are context data, so c seems ok, too :-)

OVS has merged it and has the same name, maybe the better way is adding comment 
/* Context data */ after it.

For MD type 2, struct ovs_key_nsh is very difficult to cover it, so far we 
don't know how to support MD type 2 better, Geneve defined 64 tun_metadata0-63 
to handle this, those keys are parts of struct flow_tnl, the highest 
possibility is to reuse those keys.

So for future MD type 2, we will have two parts of keys, one is from struct 
ovs_key_nsh, another is from struct flow_tnl, this won't break the old uAPI.

"#define OVS_ENCAP_NSH_MAX_MD_LEN 16" is changed per Ben's comment from 256, 
Ben thinks 256 is too big but we only support MD type 1 now. We still have ways 
to extend it, for example:

struct ovs_action_encap_nsh * oaen = (struct ovs_action_encap_nsh *) malloc 
(sizeof(struct ovs_action_encap_nsh) + ANY_SIZE);

nl_msg_put_unspec(actions, OVS_ACTION_ATTR_ENCAP_NSH,
  oaen, sizeof(struct ovs_action_encap_nsh) + ANY_SIZE);

In addition, we also need to consider, OVS userspace code must be consistent 
with here, so keeping it intact will be better, we have way to support 
dynamically extension when we add MD type 2 support.

About action name, unfortunately, userspace data plane has named them as 
encap_nsh & decap_nsh, Jan, what do you think about Jiri's suggestion?

But from my understanding, encap_* & decap_* are better because they 
corresponding to generic encap & decap actions, in addition, encap semantics 
are different from push, encap just pushed an empty header with default values, 
users must use set_field to set the content of the header.

Again, OVS userspace code must be consistent with here, so keeping it intact 
will be better because OVS userspace code was there.

-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On 
Behalf Of Jiri Benc
Sent: Tuesday, August 8, 2017 10:28 PM
To: Yang, Yi Y 
Cc: netdev@vger.kernel.org; d...@openvswitch.org; da...@davemloft.net
Subject: Re: [PATCH net-next] openvswitch: add NSH support

On Tue,  8 Aug 2017 12:59:40 +0800, Yi Yang wrote:
> +struct ovs_key_nsh {
> + __u8 flags;
> + __u8 mdtype;
> + __u8 np;
> + __u8 pad;
> + __be32 path_hdr;
> + __be32 c[4];

"c" is a very poor name. Please rename it to something that expresses what this 
field contains.

Also, this looks like MD type 1 only. How are those fields going to work with 
MD type 2? I don't think MD type 2 implementation is necessary in this patch 
but I'd like to know how this is going to work - it's uAPI and thus set in 
stone once this is merged. The uAPI needs to be designed with future use in 
mind.

> +#define OVS_ENCAP_NSH_MAX_MD_LEN 16
> +/*
> + * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH
> + * @flags: NSH header flags.
> + * @mdtype: NSH metadata type.
> + * @mdlen: Length of NSH metadata in bytes.
> + * @np: NSH next_protocol: Inner packet type.
> + * @path_hdr: NSH service path id and service index.
> + * @metadata: NSH metadata for MD type 1 or 2  */ struct 
> +ovs_action_encap_nsh {
> + __u8 flags;
> + __u8 mdtype;
> + __u8 mdlen;
> + __u8 np;
> + __be32 path_hdr;
> + __u8 metadata[OVS_ENCAP_NSH_MAX_MD_LEN];

This is wrong. The metadata size is set to a fixed size by this and cannot be 
ever extended, or at least not easily. Netlink has attributes for exactly these 
cases and that's what needs to be used here.

> @@ -835,6 +866,8 @@ enum ovs_action_attr {
>   OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */
>   OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
>   OVS_ACTION_ATTR_POP_ETH,  /* No argument. */
> + OVS_ACTION_ATTR_ENCAP_NSH,/* struct ovs_action_encap_nsh. */
> + OVS_ACTION_ATTR_DECAP_NSH,/* No argument. */

Use "push" and "pop", not "encap" and "decap" to be consistent with the naming 
in the rest of the file. We use encap and decap for tunnel operations. This 
code does not use lwtunnels, thus push and pop is more appropriate.

 Jiri

Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-08-08 Thread Casey Leedom

| From: Bjorn Helgaas
| Sent: Tuesday, August 8, 2017 4:22 PM
|
| This needs to include a link to the Intel spec
|
(https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
| sec 3.9.1).

In the commit message or as a comment? Regardless, I agree. It's always
nice to be able to go back and see what the official documentation says.
However, that said, links on the internet are ... fragile as time goes by,
so we might want to simply quote section 3.9.1 in the commit message since
it's relatively short:

3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
and Toward MMIO Regions (P2P)

In order to maximize performance for PCIe devices in the processors
listed in Table 3-6 below, the soft- ware should determine whether the
accesses are toward coherent memory (system memory) or toward MMIO
regions (P2P access to other devices). If the access is toward MMIO
region, then software can command HW to set the RO bit in the TLP
header, as this would allow hardware to achieve maximum throughput for
these types of accesses. For accesses toward coherent memory, software
can command HW to clear the RO bit in the TLP header (no RO), as this
would allow hardware to achieve maximum throughput for these types of
accesses.

Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
PCIe Performance

ProcessorCPU RP Device IDs

Intel Xeon processors based on 6F01H-6F0EH
Broadwell microarchitecture

Intel Xeon processors based on 2F01H-2F0EH
Haswell microarchitecture

| It should also include a pointer to the AMD erratum, if available, or
| at least some reference to how we know it doesn't obey the rules.

Getting an ACK from AMD seems like a forlorn cause at this point. My
contact was Bob Shaw and he stopped responding to me
messages almost a year ago saying that all of AMD's energies were being
redirected towards upcoming x86 products (likely Ryzen as we now know). As
far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM
SoC.

On the specific issue, I can certainly write up somthing even more
extensive than I wrote up for the comment in drivers/pci/quirks.c. Please
review the comment I wrote up and tell me if you'd like something even more
detailed -- I'm usually acused of writing comments which are too long, so
this would be a new one on me ... :-)

| Ashok, thanks for chiming in. Now that you have, I have a few more
| questions for you:

I can answer a few of these:

| - Is the above doc the one you mentioned as being now public?

Yes. Ashok worked with me to the extent he was allowed prior to the
publishing of the public technocal note, but he couldn't say much. (Believe
it or not, it is possible to say less than the quoted section above.) When
the note was published, Patrick Cramer sent me the note about it and pointed
me at section 3.9.1.

| - Is this considered a hardware erratum?

I certainly consider it a Hardware Bug. And I'm really hoping that Ashok
will be able to find a "Chicken Bit" which allows the broken feature to be
turned off. Remember, the Relaxed Ordering Attribute on a Transaction Layer
Packet is simply a HINT. It is perfectly reasonable for a compliant
implementation to simply ignore the Relaxed Ordering Attribute on an
incoming TLP Request. The sole responsibility of a compliant implementation
is to return the exact same Relaxed Ordering and No Snoop Attributes in any
TLP Response (The rules for ID-Based Ordering Attribute are more complex.)

Earlier Intel Root Complexes did exactly this: they ignored the Relaxed
Ordering Attribute and there was no performance difference for
using/not-using it. It's pretty obvious that an attempt was made to
implement optimizations surounding the use of Relaxed Ordering and they
didn't work.

| - If so, is there a pointer to that as well?

Intel is historically tight-lipped about admiting any bugs/errata in their
products. I'm guessing that the above quoted Section 3.9.1 is likely to be
all we ever get. The language above regarding TLPs targetting Coherent
Shared Memory are basically as much of an admission that they got it wrong
as we're going to get. But heck, maybe we'll get lucky ... Especially with
regard to the hoped for "Chicken Bit" ...

| - If this is not considered an erratum, can you provide any guidance
|about how an OS should determine when it should use RO?

Software? We don't need no stinking software!

Sorry, I couldn't resist.

| Relying on a list of device IDs in an optimization manual is OK for an
| erratum, but if it's *not* an erratum, it seems like a hole in the specs
| because as far as I know there's no generic way for the OS to discover
| whether to use RO.

Well, here's to hoping that Ashok and/or Patrick are able to offer

Re: [Patch net-next] net_sched: get rid of some forward declarations

From: Cong Wang 
Date: Mon,  7 Aug 2017 15:26:50 -0700

> If we move up tcf_fill_node() we can get rid of these
> forward declarations.
> 
> Also, move down tfilter_notify_chain() to group them together.
> 
> Reported-by: Jamal Hadi Salim 
> Cc: Jamal Hadi Salim 
> Signed-off-by: Cong Wang 

Applied.

Re: [RFC PATCH] net: don't set __LINK_STATE_START until after dev->open() call

From: Jacob Keller 
Date: Mon,  7 Aug 2017 15:24:21 -0700

> Fix an issue with relying on netif_running() which could be true during
> when dev->open() handler is being called, even if it would exit with
> a failure. This ensures the state does not get set and removed with
> a narrow race for other callers to read it as open when infact it never
> finished opening.
> 
> Signed-off-by: Jacob Keller 
> ---
> I found this as a result of debugging a race condition in the i40evf
> driver, in which we assumed that netif_running() would not be true until
> after dev->open() had been called and succeeded. Unfortunately we can't
> hold the rtnl_lock() while checking netif_running() because it would
> cause a deadlock between our reset task and our ndo_open handler.
> 
> I am wondering whether the proposed change is acceptable here, or
> whether some ndo_open handlers rely on __LINK_STATE_START being true
> prior to their being called?

I think this has the potential to break a bunch of drivers, but I
cannot prove this.

A lot of drivers have several pieces of state setup when they bring
the device up.  And these routines are also invoked from other code
paths like suspend/resume, PCI-E error recovery, etc. and they
probably do netif_running() calls here and there.

This behavior has been this way for a very long time, so the risk is
quite high I think.

Re: [PATCH net-next] net: dsa: lan9303: Only allocate 3 ports

From: Egil Hjelmeland 
Date: Tue,  8 Aug 2017 00:22:21 +0200

> Save 2628 bytes on arm eabi by allocate only the required 3 ports.
> 
> Now that ds->num_ports is correct: In net/dsa/tag_lan9303.c
> eliminate duplicate LAN9303_MAX_PORTS, use ds->num_ports.
> (Matching the pattern of other net/dsa/tag_xxx.c files.)
> 
> Signed-off-by: Egil Hjelmeland 

Applied.

Re: [PATCH net-next] liquidio: fix misspelled firmware image filenames

From: Felix Manlunas 
Date: Mon, 7 Aug 2017 12:22:15 -0700

> From: Derek Chickles 
> 
> Fix misspelled firmware image filenames advertised via MODULE_FIRMWARE().
> 
> Signed-off-by: Derek Chickles 
> Signed-off-by: Felix Manlunas 

Applied.

Re: [PATCH net-next] selftests: bpf: add a test for XDP redirect

From: William Tu 
Date: Mon,  7 Aug 2017 13:14:42 -0700

> Add test for xdp_redirect by creating two namespaces with two
> veth peers, then forward packets in-between.
> 
> Signed-off-by: William Tu 

Applied, thank you.

Re: [PATCH net-next v2 1/2] bpf: Move check_uarg_tail_zero() upward

From: Mickaël Salaün 
Date: Mon,  7 Aug 2017 20:45:19 +0200

> The function check_uarg_tail_zero() may be useful for other part of the
> code in the syscall.c file. Move this function at the beginning of the
> file.
> 
> Signed-off-by: Mickaël Salaün 
> Acked-by: Daniel Borkmann 

Applied.

Re: [PATCH net-next v2 2/2] bpf: Extend check_uarg_tail_zero() checks

From: Mickaël Salaün 
Date: Mon,  7 Aug 2017 20:45:20 +0200

> The function check_uarg_tail_zero() was created from bpf(2) for
> BPF_OBJ_GET_INFO_BY_FD without taking the access_ok() nor the PAGE_SIZE
> checks. Make this checks more generally available while unlikely to be
> triggered, extend the memory range check and add an explanation
> including why the ToCToU should not be a security concern.
> 
> Signed-off-by: Mickaël Salaün 
> Acked-by: Daniel Borkmann 
> Cc: Alexei Starovoitov 
> Cc: David S. Miller 
> Cc: Kees Cook 
> Cc: Martin KaFai Lau 
> Link: 
> https://lkml.kernel.org/r/CAGXu5j+vRGFvJZmjtAcT8Hi8B+Wz0e1b6VKYZHfQP_=dxzc...@mail.gmail.com

Applied.

Re: [PATCH net-next 1/1] netvsc: make sure and unregister datapath

From: Stephen Hemminger 
Date: Mon,  7 Aug 2017 11:30:00 -0700

> Go back to switching datapath directly in the notifier callback.
> Otherwise datapath might not get switched on unregister.
> 
> No need for calling the NOTIFY_PEERS notifier since that is only for
> a gratitious ARP/ND packet; but that is not required with Hyper-V
> because both VF and synthetic NIC have the same MAC address.
> 
> Reported-by: Vitaly Kuznetsov 
> Fixes: 0c195567a8f6 ("netvsc: transparent VF management")
> Signed-off-by: Stephen Hemminger 

Applied, thanks Stephen.

Re: [PATCH net-next] liquidio: fix wrong info about vf rx/tx ring parameters reported to ethtool

From: Felix Manlunas 
Date: Mon, 7 Aug 2017 10:39:00 -0700

> From: Intiyaz Basha 
> 
> Information reported to ethtool about vf rx/tx ring parameters is wrong.
> Fix it by adding the missing initializations.
> 
> Signed-off-by: Intiyaz Basha 
> Signed-off-by: Felix Manlunas 

Applied.

Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure

From: Edward Cree 
Date: Tue, 8 Aug 2017 21:23:03 +0100

> In any case, if you go with the enum approach and later it _does_ prove
>  necessary to have more flexibility, you can have enum values dynamically
>  assigned (like genetlink manages to do); and programs using the existing
>  fixed IDs will continue to work.

Indeed:

> It's much harder to go the other way...

 THIS!

Re: [PATCH] net: dsa: mediatek: add adjust link support for user ports

From: John Crispin 
Date: Mon,  7 Aug 2017 16:20:49 +0200

> Manually adjust the port settings of user ports once PHY polling has
> completed. This patch extends the adjust_link callback to configure the
> per port PMCR register, applying the proper values polled from the PHY.
> Without this patch flow control was not always getting setup properly.
> 
> Signed-off-by: Shashidhar Lakkavalli 
> Signed-off-by: Muciri Gatimu 
> Signed-off-by: John Crispin 

Applied, thank you.

Re: [PATCH net v2] net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets

From: Saeed Mahameed 
Date: Tue, 8 Aug 2017 19:16:52 +0300

> On Thu, Aug 3, 2017 at 11:54 PM, Davide Caratti  wrote:
>> if the NIC fails to validate the checksum on TCP/UDP, and validation of IP
>> checksum is successful, the driver subtracts the pseudo-header checksum
>> from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't
>> do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails.
>>
>> V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set
>>
>> Reported-by: Shuang Li 
>> Fixes: f8c6455bb04b ("net/mlx4_en: Extend checksum offloading by CHECKSUM 
>> COMPLETE")
>> Signed-off-by: Davide Caratti 
> 
> Acked-by: Saeed Mahameed 

Applied and queued up for -stable.

Re: [PATCH v5 net-next 00/12] bpf: rewrite value tracking in verifier

From: Daniel Borkmann 
Date: Tue, 08 Aug 2017 02:46:16 +0200

> On 08/07/2017 04:21 PM, Edward Cree wrote:
>> This series simplifies alignment tracking, generalises bounds tracking
>> and
>>   fixes some bounds-tracking bugs in the BPF verifier.  Pointer
>>   arithmetic on
>>   packet pointers, stack pointers, map value pointers and context
>>   pointers has
>>   been unified, and bounds on these pointers are only checked when the
>>   pointer
>>   is dereferenced.
>> Operations on pointers which destroy all relation to the original
>> pointer
>>   (such as multiplies and shifts) are disallowed if
>>   !env->allow_ptr_leaks,
>>   otherwise they convert the pointer to an unknown scalar and feed it to
>>   the
>>   normal scalar arithmetic handling.
>> Pointer types have been unified with the corresponding
>> adjusted-pointer types
>>   where those existed (e.g. PTR_TO_MAP_VALUE[_ADJ] or FRAME_PTR vs
>>   PTR_TO_STACK); similarly, CONST_IMM and UNKNOWN_VALUE have been
>>   unified into
>>   SCALAR_VALUE.
>> Pointer types (except CONST_PTR_TO_MAP, PTR_TO_MAP_VALUE_OR_NULL and
>>   PTR_TO_PACKET_END, which do not allow arithmetic) have a 'fixed
>>   offset' and
>>   a 'variable offset'; the former is used when e.g. adding an immediate
>>   or a
>>   known-constant register, as long as it does not overflow.  Otherwise
>>   the
>>   latter is used, and any operation creating a new variable offset
>>   creates a
>>   new 'id' (and, for PTR_TO_PACKET, clears the 'range').
>> SCALAR_VALUEs use the 'variable offset' fields to track the range of
>> possible
>>   values; the 'fixed offset' should never be set on a scalar.
> 
> Been testing and reviewing the series over the last several days,
> looks
> reasonable to me as far as I can tell. Thanks for all the hard work on
> unifying this, Edward!
> 
> Acked-by: Daniel Borkmann 

Series applied, thanks everyone!

Re: [PATCH 0/6] In-kernel QMI handling

2017-08-08 Thread Dan Williams

On Tue, 2017-08-08 at 15:42 -0700, Bjorn Andersson wrote:
> On Tue 08 Aug 04:02 PDT 2017, Bj?rn Mork wrote:
> 
> > Bjorn Andersson  writes:
> > 
> > > This series starts by moving the common definitions of the QMUX
> > > protocol to the
> > > uapi header, as they are shared with clients - both in kernel and
> > > userspace.
> > > 
> > > This series then introduces in-kernel helper functions for aiding
> > > the handling
> > > of QMI encoded messages in the kernel. QMI encoding is a wire-
> > > format used in
> > > exchanging messages between the majority of QRTR clients and
> > > services.
> > 
> > Interesting!  I tried to add some QMI handling in the kernel a few
> > years
> > ago, but was thankfully voted down.  See
> > https://www.spinics.net/lists/netdev/msg183101.html and the
> > following
> > discussion. I am convinced that was the right decision, for the
> > client
> > side at least. The protocol is just too extensive and ever-growing
> > to be
> > implemented in the kernel. We would be catching up forever.
> > 
> > Note that I had very limited knowledge of the protocol at the time
> > I
> > wrote that driver.  Still have, in fact :-)
> > 
> 
> Thanks for the pointer, I definitely think there's more work to be
> done
> here to figure out the proper way to interact with these devices.
> 
> But I think that Dan's reply shows a huge source of confusion here;
> the
> acronym "QMI" covers a large amount of different things - and means
> different things for different people.

I would agree, sorry for any confusion caused.  Great discussion so
far.

> In the modem world QMI seems to mean a defined set of logical
> endpoints
> that accepts TLV-encoded messages to do modem-related things. But the
> TLV-encoding is used for non-modem related services and the only
> common
> denominator of everything called QMI is the TLV-encoding.
> 
> 
> Due to my limited exposure to the USB attached "QMI thingies" I
> haven't
> previously looked into the exact differences. The proposed patches
> aimed
> to support implementing a few non-modem-related clients using
> QMI-encoded messages over ipcrouter.
> 
> Looking at your patch above, and oPhono, seems to highlight a few
> important differences that will take some thinking to overcome.
> 
> = Transport
> The transport header in the USB case is your struct qmux, which
> contains
> the type of message (in "flags") and the transaction id. The
> "service"
> in the QMUX header matches the service id being communicated with.
> But
> in order to communicate with a service it seems like one requests a
> client-id from the control service.

Correct.  You cannot talk to a service on the modem without getting an
allocated client ID from the CTL service, which has a well-defined
client ID.

> In the smartphone world (with shared memory communication) the
> transport
> is ipcrouter - with a header very similar to UDP - and there's no
> information about the payload, it provides only the means of
> delivering

Can you explain a bit about the relationship of SMD to [I/R]PC, qrtr,
and QMI?  A couple years ago there was smd_qmi.c (like for the Nexus 4
with APQ8064 and a discrete MDM9215) which from a 10 minute fresh look
appears to just push QMUX+QMI via SMD rather than being backed by the
RPC/IPC stuff.  I could be wrong, there's a lot of indirection there
and it may well end up going over the router.  But that's buried deeper
than a 10m look for me.

Is it perhaps only with on-chip blocks where the QMUX/QMI/qrtr/irpc
stuff you describe here is used?  If so, perhaps that's the distinction
to be made.  I'll let you correct me here since you clearly know more
than I about the internals of these devices.

> messages from one address/port to another address/port. A typical
> smartphone has 3-4 nodes (modem, sensors, audio etc) and ports are
> dynamically allocated. The control messages in the QMUX protocol (not
> the same QMUX protocol as in the USB case!) are used for clients to
> find
> the mapping from service id to a port on the given address.  The
> source
> port is dynamically allocated in this case.
> 
> = QMI-encoded messages
> The list of TLV-entries have a "QMI header" prepended in both cases,
> but
> in the QMUX case the header consists only of "msgid" and length.
> 
> In the ipcrouter case the transport doesn't carry any information
> regarding the payload, so the header prepended the TLV entries
> includes
> "type", transaction id, "msg_id" and length.

I'll assume that in this case, because the client has already found out
how to contact the target service directly, that it has no use for a
"fat" QMUX header that includes the client ID and service stuff.

I don't really have an issue with the kernel doing "thin" QMUX-related
stuff.  That's pretty simple.

> It looks as if once past the differences in the transport and QMI
> message header the messages (TLV-encoded data) are the same. But I'm
> not
> yet sure about how we can hide the transport

[PATCH net] geneve: maximum value of VNI cannot be used

2017-08-08 Thread Girish Moodalbail

Geneve's Virtual Network Identifier (VNI) is 24 bit long, so the range
of values for it would be from 0 to 16777215 (2^24 -1).  However, one
cannot create a geneve device with VNI set to 16777215. This patch fixes
this issue.

Signed-off-by: Girish Moodalbail 
---
 drivers/net/geneve.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 745d57ae..8b8565d 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1099,7 +1099,7 @@ static int geneve_validate(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_GENEVE_ID]) {
__u32 vni =  nla_get_u32(data[IFLA_GENEVE_ID]);
 
-   if (vni >= GENEVE_VID_MASK)
+   if (vni >= GENEVE_N_VID)
return -ERANGE;
}
 
-- 
1.8.3.1

Re: [PATCH] netfilter: nf_nat_h323: fix logical-not-parentheses warning

2017-08-08 Thread Nick Desaulniers

bumping for review

On Mon, Jul 31, 2017 at 11:39 AM, Nick Desaulniers
 wrote:
> Clang produces the following warning:
>
> net/ipv4/netfilter/nf_nat_h323.c:553:6: error:
> logical not is only applied to the left hand side of this comparison
>   [-Werror,-Wlogical-not-parentheses]
> if (!set_h225_addr(skb, protoff, data, dataoff, taddr,
> ^
> add parentheses after the '!' to evaluate the comparison first
> add parentheses around left hand side expression to silence this warning
>
> There's not necessarily a bug here, but it's cleaner to use the form:
>
> if (x != 0)
>
> rather than:
>
> if (!x == 0)
>
> Signed-off-by: Nick Desaulniers 
> ---
> Also, it's even cleaner to use the form:
>
> if (x)
>
> but then if the return codes change from treating 0 as success (unlikely),
> then all call sites must be updated.
>
> I'm happy to send v2 that changes to that form, and updates the other call
> sites to be:
>
> if (set_h225_addr())
>   handle_failures()
> else
>   handle_success()
>
>  net/ipv4/netfilter/nf_nat_h323.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/netfilter/nf_nat_h323.c 
> b/net/ipv4/netfilter/nf_nat_h323.c
> index 574f7ebba0b6..d8fb251fa6e3 100644
> --- a/net/ipv4/netfilter/nf_nat_h323.c
> +++ b/net/ipv4/netfilter/nf_nat_h323.c
> @@ -550,9 +550,9 @@ static int nat_callforwarding(struct sk_buff *skb, struct 
> nf_conn *ct,
> }
>
> /* Modify signal */
> -   if (!set_h225_addr(skb, protoff, data, dataoff, taddr,
> -  >tuplehash[!dir].tuple.dst.u3,
> -  htons(nated_port)) == 0) {
> +   if (set_h225_addr(skb, protoff, data, dataoff, taddr,
> + >tuplehash[!dir].tuple.dst.u3,
> + htons(nated_port)) != 0) {
> nf_ct_unexpect_related(exp);
> return -1;
> }
> --
> 2.14.0.rc0.400.g1c36432dff-goog
>



-- 
Thanks,
~Nick Desaulniers

Re: [PATCH v06 35/36] uapi linux/tls.h: don't include in user space

2017-08-08 Thread Dmitry V. Levin

On Sun, Aug 06, 2017 at 06:44:26PM +0200, Mikko Rapeli wrote:
> It is not needed and not part of uapi headers, but causes
> user space compilation error:
> 
> fatal error: net/tcp.h: No such file or directory
>  #include 
>  ^
> 
> Signed-off-by: Mikko Rapeli 
> Cc: Dave Watson 
> Cc: Ilya Lesokhin 
> Cc: Aviad Yehezkel 
> ---
>  include/uapi/linux/tls.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/uapi/linux/tls.h b/include/uapi/linux/tls.h
> index cc1d21db35d8..d87c698623f2 100644
> --- a/include/uapi/linux/tls.h
> +++ b/include/uapi/linux/tls.h
> @@ -37,7 +37,9 @@
>  #include 
>  #include 
>  #include 
> +#ifdef __KERNEL__
>  #include 
> +#endif

Let's move it to include/net/tls.h instead.


-- 
ldv


signature.asc
Description: PGP signature

Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

On Sat, Aug 05, 2017 at 03:15:10PM +0800, Ding Tianhong wrote:
> From: Casey Leedom 
> 
> The patch adds a new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING to indicate that
> Relaxed Ordering (RO) attribute should not be used for Transaction Layer
> Packets (TLP) targetted towards these affected root complexes. Current list
> of affected parts include some Intel Xeon processors root complex which 
> suffers from
> flow control credits that result in performance issues. On these affected
> parts RO can still be used for peer-2-peer traffic. AMD A1100 ARM ("SEATTLE")
> Root complexes don't obey PCIe 3.0 ordering rules, hence could lead to
> data-corruption.

This needs to include a link to the Intel spec
(https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
sec 3.9.1).

It should also include a pointer to the AMD erratum, if available, or
at least some reference to how we know it doesn't obey the rules.

Ashok, thanks for chiming in.  Now that you have, I have a few more
questions for you:

  - Is the above doc the one you mentioned as being now public?

  - Is this considered a hardware erratum?

  - If so, is there a pointer to that as well?

  - If this is not considered an erratum, can you provide any guidance
about how an OS should determine when it should use RO?

Relying on a list of device IDs in an optimization manual is OK for an
erratum, but if it's *not* an erratum, it seems like a hole in the
specs because as far as I know there's no generic way for the OS to
discover whether to use RO.

Bjorn

Re: [PATCH net-next v2] wan: dscc4: add checks for dma mapping errors

2017-08-08 Thread Francois Romieu

Alexey Khoroshilov  :
[...]
> diff --git a/drivers/net/wan/dscc4.c b/drivers/net/wan/dscc4.c
> index 799830f..6a9ffac 100644
> --- a/drivers/net/wan/dscc4.c
> +++ b/drivers/net/wan/dscc4.c
> @@ -518,23 +518,31 @@ static void dscc4_release_ring(struct dscc4_dev_priv 
> *dpriv)
>  static inline int try_get_rx_skb(struct dscc4_dev_priv *dpriv,
>struct net_device *dev)
>  {
> + struct pci_dev *pdev = dpriv->pci_priv->pdev;
>   unsigned int dirty = dpriv->rx_dirty%RX_RING_SIZE;
>   struct RxFD *rx_fd = dpriv->rx_fd + dirty;
>   const int len = RX_MAX(HDLC_MAX_MRU);

For the edification of the masses, you may follow a strict inverted
xmas tree style (aka longer lines first as long as it does not bug).

[...]
> @@ -1147,14 +1155,22 @@ static netdev_tx_t dscc4_start_xmit(struct sk_buff 
> *skb,
>   struct dscc4_dev_priv *dpriv = dscc4_priv(dev);
>   struct dscc4_pci_priv *ppriv = dpriv->pci_priv;
>   struct TxFD *tx_fd;
> + dma_addr_t addr;
>   int next;
>  
> + addr = pci_map_single(ppriv->pdev, skb->data, skb->len,
> +   PCI_DMA_TODEVICE);

Use a local struct pci_dev *pdev and it fits on a single line.

At some point it will probably be converted to plain dma api and use a 'd' dev.

[...]
> @@ -1887,16 +1903,22 @@ static struct sk_buff *dscc4_init_dummy_skb(struct 
> dscc4_dev_priv *dpriv)
>  
>   skb = dev_alloc_skb(DUMMY_SKB_SIZE);
>   if (skb) {
> + struct pci_dev *pdev = dpriv->pci_priv->pdev;
>   int last = dpriv->tx_dirty%TX_RING_SIZE;
>   struct TxFD *tx_fd = dpriv->tx_fd + last;
> + dma_addr_t addr;
>  
>   skb->len = DUMMY_SKB_SIZE;
>   skb_copy_to_linear_data(skb, version,
>   strlen(version) % DUMMY_SKB_SIZE);
>   tx_fd->state = FrameEnd | TO_STATE_TX(DUMMY_SKB_SIZE);
> - tx_fd->data = cpu_to_le32(pci_map_single(dpriv->pci_priv->pdev,
> -  skb->data, DUMMY_SKB_SIZE,
> -  PCI_DMA_TODEVICE));
> + addr = pci_map_single(pdev, skb->data, DUMMY_SKB_SIZE,
> +   PCI_DMA_TODEVICE);
> + if (pci_dma_mapping_error(pdev, addr)) {
> + dev_kfree_skb_any(skb);
> + return NULL;
> + }
> + tx_fd->data = cpu_to_le32(addr);
>   dpriv->tx_skbuff[last] = skb;
>   }
>   return skb;

It isn't technically wrong but please don't update tx_fd before the mapping
succeeds. It will look marginally better.

Thanks.

-- 
Ueimor

Re: [PATCH v06 15/36] uapi linux/socket.h: include sys/socket.h in user space

2017-08-08 Thread Dmitry V. Levin

On Sun, Aug 06, 2017 at 06:44:06PM +0200, Mikko Rapeli wrote:
> This libc header has sockaddr definition in user space.
> 
> Fixes user space compilation errors like these from kernel headers including
> only linux/socket.h:
> 
> error: field ‘ifru_addr’ has incomplete type
> struct sockaddr ifru_addr;
> error: field ‘_sockaddr’ has incomplete type
> struct sockaddr  _sockaddr;
> error: invalid application of ‘sizeof’ to incomplete type ‘struct sockaddr’
> 
> With this following uapi headers now compile in user space:
> 
> rdma/rdma_user_rxe.h
> linux/vm_sockets.h
> linux/ncp_fs.h
> linux/nfc.h
> linux/phonet.h
> 
> Signed-off-by: Mikko Rapeli 
> Cc: netdev@vger.kernel.org
> Cc: Dmitry V. Levin 
> ---
>  include/uapi/linux/socket.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
> index 76ab0c68561e..8a81197cc08b 100644
> --- a/include/uapi/linux/socket.h
> +++ b/include/uapi/linux/socket.h
> @@ -1,6 +1,10 @@
>  #ifndef _UAPI_LINUX_SOCKET_H
>  #define _UAPI_LINUX_SOCKET_H
>  
> +#ifndef __KERNEL__
> +#include 
> +#endif

This is scary because of infamous libc vs uapi interoperability issues.
Couldn't we fix affected headers instead?


-- 
ldv


signature.asc
Description: PGP signature

Re: skb allocation from interrupt handler?

From: Murali Karicheri 
Date: Tue, 8 Aug 2017 18:17:52 -0400

> Is there an skb_alloc function that can be used from interrupt handler? Looks 
> like netdev_alloc_skb()
> can't be used since I see following trace with kernel hack debug options 
> enabled.
> 
> [  652.481713] [] (unwind_backtrace) from [] 
> (show_stack+0x10/0x14)
> [  652.481725] [] (show_stack) from [] 
> (dump_stack+0x98/0xc4)
> [  652.481736] [] (dump_stack) from [] 
> (___might_sleep+0x1b8/0x2a4)
> [  652.481746] [] (___might_sleep) from [] 
> (rt_spin_lock+0x24/0x5c)
> [  652.481755] [] (rt_spin_lock) from [] 
> (__netdev_alloc_skb+0xd0/0x254)
> [  652.481774] [] (__netdev_alloc_skb) from [] 
> (emac_rx_hardirq+0x374/0x554 [prueth])
> [  652.481793] [] (emac_rx_hardirq [prueth]) from [] 
> (__handle_irq_event_percpu+0x9c/0x128)
> 
> This is running under RT kernel off 4.9.y

Your receive handler should be running from a NAPI poll, which is in
software interrupt.  You should not be doing packet processing in
hardware interrupt context as hardware interrupts should be as short
as possible, and with NAPI polling packet input processing can be
properly distributed amongst several devices, and if the system is
overloaded such processing can be deferred to a kernel thread.

NAPI polling has a large number of other advantages as well, more
streamlined GRO support, automatic support for busypolling... the
list goes on and on and on.

I could show you how to do an SKB allocation in a hardware interrupt,
but instead I'd rather teach you how to fish properly, and encourage
you to convert your driver to NAPI polling instead.

Thanks.

Re: [PATCH 0/6] In-kernel QMI handling

2017-08-08 Thread Bjorn Andersson

On Tue 08 Aug 04:02 PDT 2017, Bj?rn Mork wrote:

> Bjorn Andersson  writes:
> 
> > This series starts by moving the common definitions of the QMUX protocol to 
> > the
> > uapi header, as they are shared with clients - both in kernel and userspace.
> >
> > This series then introduces in-kernel helper functions for aiding the 
> > handling
> > of QMI encoded messages in the kernel. QMI encoding is a wire-format used in
> > exchanging messages between the majority of QRTR clients and services.
> 
> Interesting!  I tried to add some QMI handling in the kernel a few years
> ago, but was thankfully voted down.  See
> https://www.spinics.net/lists/netdev/msg183101.html and the following
> discussion. I am convinced that was the right decision, for the client
> side at least. The protocol is just too extensive and ever-growing to be
> implemented in the kernel. We would be catching up forever.
> 
> Note that I had very limited knowledge of the protocol at the time I
> wrote that driver.  Still have, in fact :-)
> 

Thanks for the pointer, I definitely think there's more work to be done
here to figure out the proper way to interact with these devices.

But I think that Dan's reply shows a huge source of confusion here; the
acronym "QMI" covers a large amount of different things - and means
different things for different people.

In the modem world QMI seems to mean a defined set of logical endpoints
that accepts TLV-encoded messages to do modem-related things. But the
TLV-encoding is used for non-modem related services and the only common
denominator of everything called QMI is the TLV-encoding.

Due to my limited exposure to the USB attached "QMI thingies" I haven't
previously looked into the exact differences. The proposed patches aimed
to support implementing a few non-modem-related clients using
QMI-encoded messages over ipcrouter.

Looking at your patch above, and oPhono, seems to highlight a few
important differences that will take some thinking to overcome.

= Transport
The transport header in the USB case is your struct qmux, which contains
the type of message (in "flags") and the transaction id. The "service"
in the QMUX header matches the service id being communicated with. But
in order to communicate with a service it seems like one requests a
client-id from the control service.

In the smartphone world (with shared memory communication) the transport
is ipcrouter - with a header very similar to UDP - and there's no
information about the payload, it provides only the means of delivering
messages from one address/port to another address/port. A typical
smartphone has 3-4 nodes (modem, sensors, audio etc) and ports are
dynamically allocated. The control messages in the QMUX protocol (not
the same QMUX protocol as in the USB case!) are used for clients to find
the mapping from service id to a port on the given address.  The source
port is dynamically allocated in this case.

= QMI-encoded messages
The list of TLV-entries have a "QMI header" prepended in both cases, but
in the QMUX case the header consists only of "msgid" and length.

In the ipcrouter case the transport doesn't carry any information
regarding the payload, so the header prepended the TLV entries includes
"type", transaction id, "msg_id" and length.

It looks as if once past the differences in the transport and QMI
message header the messages (TLV-encoded data) are the same. But I'm not
yet sure about how we can hide the transport differences.

Regards,
Bjorn

Re: skb allocation from interrupt handler?

2017-08-08 Thread Matteo Croce

Il giorno mar, 08/08/2017 alle 18.17 -0400, Murali Karicheri ha
scritto:
> Is there an skb_alloc function that can be used from interrupt
> handler? Looks like netdev_alloc_skb()
> can't be used since I see following trace with kernel hack debug
> options enabled.
> 
> [  652.481713] [] (unwind_backtrace) from []
> (show_stack+0x10/0x14)
> [  652.481725] [] (show_stack) from []
> (dump_stack+0x98/0xc4)
> [  652.481736] [] (dump_stack) from []
> (___might_sleep+0x1b8/0x2a4)
> [  652.481746] [] (___might_sleep) from []
> (rt_spin_lock+0x24/0x5c)
> [  652.481755] [] (rt_spin_lock) from []
> (__netdev_alloc_skb+0xd0/0x254)
> [  652.481774] [] (__netdev_alloc_skb) from []
> (emac_rx_hardirq+0x374/0x554 [prueth])
> [  652.481793] [] (emac_rx_hardirq [prueth]) from
> [] (__handle_irq_event_percpu+0x9c/0x128)
> 
> This is running under RT kernel off 4.9.y
> 

netdev_alloc_skb() passes GFP_ATOMIC to alloc_skb() so it should work
in an interrupt handler too.

-- 
Matteo Croce
per aspera ad upstream

skb allocation from interrupt handler?

2017-08-08 Thread Murali Karicheri

Is there an skb_alloc function that can be used from interrupt handler? Looks 
like netdev_alloc_skb()
can't be used since I see following trace with kernel hack debug options 
enabled.

[  652.481713] [] (unwind_backtrace) from [] 
(show_stack+0x10/0x14)
[  652.481725] [] (show_stack) from [] 
(dump_stack+0x98/0xc4)
[  652.481736] [] (dump_stack) from [] 
(___might_sleep+0x1b8/0x2a4)
[  652.481746] [] (___might_sleep) from [] 
(rt_spin_lock+0x24/0x5c)
[  652.481755] [] (rt_spin_lock) from [] 
(__netdev_alloc_skb+0xd0/0x254)
[  652.481774] [] (__netdev_alloc_skb) from [] 
(emac_rx_hardirq+0x374/0x554 [prueth])
[  652.481793] [] (emac_rx_hardirq [prueth]) from [] 
(__handle_irq_event_percpu+0x9c/0x128)

This is running under RT kernel off 4.9.y

-- 
Murali Karicheri
Linux Kernel, Keystone

Re: [PATCH] net: systemport: Fix software statistics for SYSTEMPORT Lite

From: Florian Fainelli 
Date: Tue, 8 Aug 2017 14:44:40 -0700

> On 08/08/2017 02:39 PM, Florian Fainelli wrote:
>> With SYSTEMPORT Lite we have holes in our statistics layout that make us
>> skip over the hardware MIB counters, bcm_sysport_get_stats() was not
>> taking that into account, resulting in reporting 0 for all SW-maintained
>> statistics, fix this by skipping accordingly.
>> 
>> Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
>> Signed-off-by: Florian Fainelli 
> 
> David, please ignore this version, I accidentally sent it with the
> debugging left.

Ok.

Re: [RFC PATCH 2/2] bpf: Initialise mod[] in bpf_trace_printk

From: James Hogan 
Date: Tue, 08 Aug 2017 22:20:05 +0100

> cool, i hadn't realised unmentioned elements in an initialiser are
> always zeroed, even when non-global/static, so had interpreted the
> whole array as uninitialised. learn something new every day :-)
> sorry for the noise.

You didn't have to know in the first place, you could have simply
compiled the code into assembler by running:

make kernel/trace/bpf_trace.s

and seen for yourself before putting all of this time and effort into
this patch and discussion.

If you don't know what the compiler does, simply look!

[PATCH net-next] net: ipv6: lower ndisc notifier priority below addrconf

2017-08-08 Thread David Ahern

ndisc_notify is used to send unsolicited neighbor advertisements
(e.g., on a link up). Currently, the ndisc notifier is run before the
addrconf notifer which means NA's are not sent for link-local addresses
which are added by the addrconf notifier.

Fix by lowering the priority of the ndisc notifier. Setting the priority
to ADDRCONF_NOTIFY_PRIORITY - 5 means it runs after addrconf and before
the route notifier which is ADDRCONF_NOTIFY_PRIORITY - 10.

Signed-off-by: David Ahern 
---
 net/ipv6/ndisc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 0327c1f2e6fc..5e338eb89509 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1779,6 +1779,7 @@ static int ndisc_netdev_event(struct notifier_block 
*this, unsigned long event,
 
 static struct notifier_block ndisc_netdev_notifier = {
.notifier_call = ndisc_netdev_event,
+   .priority = ADDRCONF_NOTIFY_PRIORITY - 5,
 };
 
 #ifdef CONFIG_SYSCTL
-- 
2.9.3

[PATCH net v2] net: systemport: Fix software statistics for SYSTEMPORT Lite

2017-08-08 Thread Florian Fainelli

With SYSTEMPORT Lite we have holes in our statistics layout that make us
skip over the hardware MIB counters, bcm_sysport_get_stats() was not
taking that into account resulting in reporting 0 for all SW-maintained
statistics, fix this by skipping accordingly.

Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
Signed-off-by: Florian Fainelli 
---
Changes in v2:

- no debugging

 drivers/net/ethernet/broadcom/bcmsysport.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 5333601f855f..dc3052751bc1 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -449,6 +449,10 @@ static void bcm_sysport_get_stats(struct net_device *dev,
p = (char *)>stats;
else
p = (char *)priv;
+
+   if (priv->is_lite && !bcm_sysport_lite_stat_valid(s->type))
+   continue;
+
p += s->stat_offset;
data[j] = *(unsigned long *)p;
j++;
-- 
2.9.3

Re: [PATCH] net: systemport: Fix software statistics for SYSTEMPORT Lite

2017-08-08 Thread Florian Fainelli

On 08/08/2017 02:39 PM, Florian Fainelli wrote:
> With SYSTEMPORT Lite we have holes in our statistics layout that make us
> skip over the hardware MIB counters, bcm_sysport_get_stats() was not
> taking that into account, resulting in reporting 0 for all SW-maintained
> statistics, fix this by skipping accordingly.
> 
> Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
> Signed-off-by: Florian Fainelli 

David, please ignore this version, I accidentally sent it with the
debugging left.
-- 
Florian

[PATCH] net: systemport: Fix software statistics for SYSTEMPORT Lite

2017-08-08 Thread Florian Fainelli

With SYSTEMPORT Lite we have holes in our statistics layout that make us
skip over the hardware MIB counters, bcm_sysport_get_stats() was not
taking that into account, resulting in reporting 0 for all SW-maintained
statistics, fix this by skipping accordingly.

Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 5333601f855f..abf175372719 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -449,6 +449,10 @@ static void bcm_sysport_get_stats(struct net_device *dev,
p = (char *)>stats;
else
p = (char *)priv;
+
+   if (priv->is_lite && !bcm_sysport_lite_stat_valid(s->type))
+   continue;
+
p += s->stat_offset;
data[j] = *(unsigned long *)p;
j++;
@@ -608,8 +612,8 @@ static struct sk_buff *bcm_sysport_rx_refill(struct 
bcm_sysport_priv *priv,
 
/* Allocate a new SKB for a new packet */
skb = netdev_alloc_skb(priv->netdev, RX_BUF_LENGTH);
+   priv->mib.alloc_rx_buff_failed++;
if (!skb) {
-   priv->mib.alloc_rx_buff_failed++;
netif_err(priv, rx_err, ndev, "SKB alloc failed\n");
return NULL;
}
-- 
2.9.3

Re: [RFC PATCH 2/2] bpf: Initialise mod[] in bpf_trace_printk

2017-08-08 Thread James Hogan

On 8 August 2017 17:48:57 BST, David Miller  wrote:
>From: Daniel Borkmann 
>Date: Tue, 08 Aug 2017 10:46:52 +0200
>
>> On 08/08/2017 12:25 AM, James Hogan wrote:
>>> In bpf_trace_printk(), the elements in mod[] are left uninitialised,
>>> but
>>> they are then incremented to track the width of the formats. Zero
>>> initialise the array just in case the memory contains non-zero
>values
>>> on
>>> entry.
>>>
>>> Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call
>>> bpf_trace_printk()")
>>> Signed-off-by: James Hogan 
>>> Cc: Alexei Starovoitov 
>>> Cc: Daniel Borkmann 
>>> Cc: Steven Rostedt 
>>> Cc: Ingo Molnar 
>>> Cc: netdev@vger.kernel.org
>>> ---
>>> When I checked (on MIPS32), the elements tended to have the value
>zero
>>> anyway (does BPF zero the stack or something clever?), so this is a
>>> purely theoretical fix.
>>> ---
>>>   kernel/trace/bpf_trace.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>> index 32dcbe1b48f2..86a52857d941 100644
>>> --- a/kernel/trace/bpf_trace.c
>>> +++ b/kernel/trace/bpf_trace.c
>>> @@ -129,7 +129,7 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32,
>>> fmt_size, u64, arg1,
>>>u64, arg2, u64, arg3)
>>>   {
>>> bool str_seen = false;
>>> -   int mod[3] = {};
>>> +   int mod[3] = { 0, 0, 0 };
>> 
>> I'm probably missing something, but is the behavior of gcc wrt
>> above initializers different on mips (it zeroes just fine on x86
>> at least)? If yes, we'd probably need a cocci script to also check
>> rest of the kernel given this is used in a number of places. Hm,
>> could you elaborate?
>
>This change is not necessary at all.
>
>An empty initializer must clear the whole object to zero.
>
>"theoretical" fix indeed... :-(

cool, i hadn't realised unmentioned elements in an initialiser are always 
zeroed, even when non-global/static, so had interpreted the whole array as 
uninitialised. learn something new every day :-) sorry for the noise.

cheers
James

Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure

On Tue, Aug 8, 2017 at 1:23 PM, Edward Cree  wrote:
> On 08/08/17 20:50, Tom Herbert wrote:
>> It's a tradeoff. The nice thing about using strings is that we don't
>> need maintain a universal enum.
> Hmm, that makes it sound as though you're intending for random out-of-tree
>  modules to add these things; since if they're in-tree it's easy for them
>  to get enum values assigned when they're added.  Do we really want to
>  encourage sticking random module code into the network stack like this?
>
> In any case, if you go with the enum approach and later it _does_ prove
>  necessary to have more flexibility, you can have enum values dynamically
>  assigned (like genetlink manages to do); and programs using the existing
>  fixed IDs will continue to work.  It's much harder to go the other way...
>
There is history and precedence. The string mechanism for ulp_ops  a
direct port of the original ULP infrastructure done for kTLS. That
code based the mechanism on TCP congestion ops and that was introduced
into the kernel twelve years ago. This method doesn't seem to have
been viewed as a problem before now...

Tom

> -Ed

[PATCH ericsson v2 1/1] tipc: remove premature ESTABLISH FSM event at link synchronization

2017-08-08 Thread Jon Maloy

When a link between two nodes come up, both endpoints will initially
send out a STATE message to the peer, to increase the probability that
the peer endpoint also is up when the first traffic message arrives.
Thereafter, if the establishing link is the second link between two
nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
helping the peer to perform initial synchronization between the two
links.

However, the initial STATE message may be lost, in which case the SYNCH
message will be the first one arriving at the peer. This should also
work, as the SYNCH message itself will be used to take up the link
endpoint before  initializing synchronization.

Unfortunately the code for this case is broken. Currently, the link is
brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
arrives, whereupon __tipc_node_link_up() is called to distribute the
link slots and take the link into traffic. But, __tipc_node_link_up() is
itself starting with a test for whether the link is up, and if true,
returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
is unnecessary, since tipc_node_link_up() is itself issuing such an
event, but also harmful, since it inhibits tipc_node_link_up() to
perform the test of its tasks, and the link endpoint in question hence
is never taken into traffic.

This problem has been exposed when we set up dual links between pre-
and post-4.4 kernels, because the former ones don't send out the
initial STATE message described above.

We fix this by removing the unnecessary event call.

Signed-off-by: Jon Maloy 
---
 net/tipc/node.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index aeef801..9b4dcb6 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1455,10 +1455,8 @@ static bool tipc_node_check_state(struct tipc_node *n, 
struct sk_buff *skb,
/* Initiate synch mode if applicable */
if ((usr == TUNNEL_PROTOCOL) && (mtyp == SYNCH_MSG) && (oseqno == 1)) {
syncpt = iseqno + exp_pkts - 1;
-   if (!tipc_link_is_up(l)) {
-   tipc_link_fsm_evt(l, LINK_ESTABLISH_EVT);
+   if (!tipc_link_is_up(l))
__tipc_node_link_up(n, bearer_id, xmitq);
-   }
if (n->state == SELF_UP_PEER_UP) {
n->sync_point = syncpt;
tipc_link_fsm_evt(l, LINK_SYNCH_BEGIN_EVT);
-- 
2.1.4

[net 1/1] tipc: remove premature ESTABLISH FSM event at link synchronization

2017-08-08 Thread Jon Maloy

When a link between two nodes come up, both endpoints will initially
send out a STATE message to the peer, to increase the probability that
the peer endpoint also is up when the first traffic message arrives.
Thereafter, if the establishing link is the second link between two
nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
helping the peer to perform initial synchronization between the two
links.

However, the initial STATE message may be lost, in which case the SYNCH
message will be the first one arriving at the peer. This should also
work, as the SYNCH message itself will be used to take up the link
endpoint before  initializing synchronization.

Unfortunately the code for this case is broken. Currently, the link is
brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
arrives, whereupon __tipc_node_link_up() is called to distribute the
link slots and take the link into traffic. But, __tipc_node_link_up() is
itself starting with a test for whether the link is up, and if true,
returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
is unnecessary, since tipc_node_link_up() is itself issuing such an
event, but also harmful, since it inhibits tipc_node_link_up() to
perform the test of its tasks, and the link endpoint in question hence
is never taken into traffic.

This problem has been exposed when we set up dual links between pre-
and post-4.4 kernels, because the former ones don't send out the
initial STATE message described above.

We fix this by removing the unnecessary event call.

Signed-off-by: Jon Maloy 
---
 net/tipc/node.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index aeef801..9b4dcb6 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1455,10 +1455,8 @@ static bool tipc_node_check_state(struct tipc_node *n, 
struct sk_buff *skb,
/* Initiate synch mode if applicable */
if ((usr == TUNNEL_PROTOCOL) && (mtyp == SYNCH_MSG) && (oseqno == 1)) {
syncpt = iseqno + exp_pkts - 1;
-   if (!tipc_link_is_up(l)) {
-   tipc_link_fsm_evt(l, LINK_ESTABLISH_EVT);
+   if (!tipc_link_is_up(l))
__tipc_node_link_up(n, bearer_id, xmitq);
-   }
if (n->state == SELF_UP_PEER_UP) {
n->sync_point = syncpt;
tipc_link_fsm_evt(l, LINK_SYNCH_BEGIN_EVT);
-- 
2.1.4

Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure

2017-08-08 Thread Edward Cree

On 08/08/17 20:50, Tom Herbert wrote:
> It's a tradeoff. The nice thing about using strings is that we don't
> need maintain a universal enum.
Hmm, that makes it sound as though you're intending for random out-of-tree
 modules to add these things; since if they're in-tree it's easy for them
 to get enum values assigned when they're added.  Do we really want to
 encourage sticking random module code into the network stack like this?

In any case, if you go with the enum approach and later it _does_ prove
 necessary to have more flexibility, you can have enum values dynamically
 assigned (like genetlink manages to do); and programs using the existing
 fixed IDs will continue to work.  It's much harder to go the other way...

-Ed

Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure

On Tue, Aug 8, 2017 at 12:30 PM, John Fastabend
 wrote:
> On 08/08/2017 10:04 AM, Tom Herbert wrote:
>> On Tue, Aug 8, 2017 at 8:31 AM, John Fastabend  
>> wrote:
>>> On 08/07/2017 10:28 AM, Tom Herbert wrote:
 Generalize the ULP infrastructure that was recently introduced to
 support kTLS. This adds a SO_ULP socket option and creates new fields in
 sock structure for ULP ops and ULP data. Also, the interface allows
 additional per ULP parameters to be set so that a ULP can be pushed
 and operations started in one shot.

 In this patch set:
   - Minor dependency fix in inet_common.h
   - Implement ULP infrastructure as a socket mechanism
   - Fixes TCP and TLS to use the new method (maintaining backwards
 API compatibility)
   - Adds a ulp.txt document

 Tested: Ran simple ULP. Dave Watson verified kTLS works.

 -v2: Fix compilation errors when CONFIG_ULP_SOCK not set.
 -v3: Fix one more build issue, check that sk_protocol is IPPROTO_TCP
  in tsl_init. Also, fix a couple of minor issues related to
  introducing locked versions of sendmsg, send page. Thanks to
  Dave Watson, John Fastabend, and Mat Martineau for testing and
  providing fixes.

>>>
>>>
>>> Hi Tom, Dave,
>>>
>>> I'm concerned about the performance impact of walking a list and
>>> doing string compares on every socket we create with kTLS. Dave
>>> do you have any request/response tests for kTLS that would put pressure
>>> on the create/destroy time of this infrastructure? We should do some
>>> tests with dummy entries in the ULP list to understand the impact of
>>> this list walk.
>>>
>>> I like the underlying TCP generalized hooks, but do we really expect a
>>> lot of these hooks to exist? If we only have two on the roadmap
>>> (kTLS and socktap) it seems a bit overkill. Further, if we really expect
>>> many ULP objects then the list walk and compare will become more expensive
>>> perhaps becoming noticeable in request per second metrics.
>>>
>>> Why not just create another socktap socketopt? That will be better from
>>> complexity and likely performance sides.
>>>
>> IMO, given that there is at most two even proposed at this point I
>> don't there's much point addressing performance. When ULP feature
>> catches on and we start see a whole bunch of them then it's
>> straightforward to use a hash table or some more efficient mechanism.
>>
>
> OTOH these optimizations are usually easiest to do at the beginning. And
> building an enum of ULP types would allow removing string comparisons and
> to do simpler unsigned comparisons. I wont complain too much here though
> because this series didn't introduce the lists.
>
Hi John,

It's a tradeoff. The nice thing about using strings is that we don't
need maintain a universal enum.

A related problem is how to combine different ULPs on the same socket.
For instance, I might want to do filtering on the application layer
messages being sent over TLS (stap+kTLS ULPs). So far I don't see an
obvious way to do that. The buffering requirement for crypto seems to
convolute this some.

Tom

Re: Re:Re: Re: [PATCH net] ppp: Fix a scheduling-while-atomic bug in del_chan

2017-08-08 Thread Cong Wang

On Mon, Aug 7, 2017 at 6:10 PM, Gao Feng  wrote:
>
> Sorry, I don't get you clearly. Why the sock_hold() isn't helpful?

I already told you, the dereference happends before sock_hold().

sock = rcu_dereference(callid_sock[call_id]);
if (sock) {
opt = >proto.pptp;
if (opt->dst_addr.sin_addr.s_addr != s_addr) <=== HERE
sock = NULL;
else
sock_hold(sk_pppox(sock));
}

If we don't wait for readers properly, sock could be freed at the
same time when deference it.

> The pptp_release invokes synchronize_rcu after del_chan, it could make sure 
> the others has increased the sock refcnt if necessary
> and the lookup is over.
> There is no one could get the sock after synchronize_rcu in pptp_release.

If this were true, then this code in pptp_sock_destruct()
would be unneeded:

if (!(sk->sk_state & PPPOX_DEAD)) {
del_chan(pppox_sk(sk));
pppox_unbind_sock(sk);
}

>
>
> But I think about another problem.
> It seems the pptp_sock_destruct should not invoke del_chan and 
> pppox_unbind_sock.
> Because when the sock refcnt is 0, the pptp_release must have be invoked 
> already.
>

I don't know. Looks like sock_orphan() is only called
in pptp_release(), but I am not sure if there is a case
we call sock destructor before release.

Also note, this socket is very special, it doesn't support
poll(), sendmsg() or recvmsg()..

Re: [PATCH v3 net-next 4/5] tcp: Adjust TCP ULP to defer to sockets ULP

2017-08-08 Thread John Fastabend

On 08/07/2017 10:28 AM, Tom Herbert wrote:
> Fix TCP and TLS to use the newer ULP infrastructure in sockets.
> 
> Signed-off-by: Tom Herbert 
> ---

Looks OK to me.

Acked-by: John Fastabend

Re: [PATCH net] net: sched: fix NULL pointer dereference when action calls some targets

2017-08-08 Thread Cong Wang

On Mon, Aug 7, 2017 at 7:33 PM, Xin Long  wrote:
> On Tue, Aug 8, 2017 at 9:15 AM, Cong Wang  wrote:
>> This looks like a completely API burden?
> netfilter xt targets are not really compatible with netsched action.
> I've got to say, the patch is just a way to make checkentry return
> false and avoid panic. like [1] said

I don't doubt you fix a crash, I am thinking if we can
"fix" the API instead of fixing the caller.

I am not familiar with this API, so just my 2 cents...

Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure

2017-08-08 Thread John Fastabend

On 08/08/2017 10:04 AM, Tom Herbert wrote:
> On Tue, Aug 8, 2017 at 8:31 AM, John Fastabend  
> wrote:
>> On 08/07/2017 10:28 AM, Tom Herbert wrote:
>>> Generalize the ULP infrastructure that was recently introduced to
>>> support kTLS. This adds a SO_ULP socket option and creates new fields in
>>> sock structure for ULP ops and ULP data. Also, the interface allows
>>> additional per ULP parameters to be set so that a ULP can be pushed
>>> and operations started in one shot.
>>>
>>> In this patch set:
>>>   - Minor dependency fix in inet_common.h
>>>   - Implement ULP infrastructure as a socket mechanism
>>>   - Fixes TCP and TLS to use the new method (maintaining backwards
>>> API compatibility)
>>>   - Adds a ulp.txt document
>>>
>>> Tested: Ran simple ULP. Dave Watson verified kTLS works.
>>>
>>> -v2: Fix compilation errors when CONFIG_ULP_SOCK not set.
>>> -v3: Fix one more build issue, check that sk_protocol is IPPROTO_TCP
>>>  in tsl_init. Also, fix a couple of minor issues related to
>>>  introducing locked versions of sendmsg, send page. Thanks to
>>>  Dave Watson, John Fastabend, and Mat Martineau for testing and
>>>  providing fixes.
>>>
>>
>>
>> Hi Tom, Dave,
>>
>> I'm concerned about the performance impact of walking a list and
>> doing string compares on every socket we create with kTLS. Dave
>> do you have any request/response tests for kTLS that would put pressure
>> on the create/destroy time of this infrastructure? We should do some
>> tests with dummy entries in the ULP list to understand the impact of
>> this list walk.
>>
>> I like the underlying TCP generalized hooks, but do we really expect a
>> lot of these hooks to exist? If we only have two on the roadmap
>> (kTLS and socktap) it seems a bit overkill. Further, if we really expect
>> many ULP objects then the list walk and compare will become more expensive
>> perhaps becoming noticeable in request per second metrics.
>>
>> Why not just create another socktap socketopt? That will be better from
>> complexity and likely performance sides.
>>
> IMO, given that there is at most two even proposed at this point I
> don't there's much point addressing performance. When ULP feature
> catches on and we start see a whole bunch of them then it's
> straightforward to use a hash table or some more efficient mechanism.
> 

OTOH these optimizations are usually easiest to do at the beginning. And
building an enum of ULP types would allow removing string comparisons and
to do simpler unsigned comparisons. I wont complain too much here though
because this series didn't introduce the lists.

> Tom
> 
>> Thanks,
>> .John
>>

[PATCH net-next v2] wan: dscc4: add checks for dma mapping errors

2017-08-08 Thread Alexey Khoroshilov

The driver does not check if mapping dma memory succeed.
The patch adds the checks and failure handling.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov 
---
v2: Fix issues noted by David Miller and Francois Romieu.

 drivers/net/wan/dscc4.c | 52 +++--
 1 file changed, 37 insertions(+), 15 deletions(-)

diff --git a/drivers/net/wan/dscc4.c b/drivers/net/wan/dscc4.c
index 799830f..6a9ffac 100644
--- a/drivers/net/wan/dscc4.c
+++ b/drivers/net/wan/dscc4.c
@@ -518,23 +518,31 @@ static void dscc4_release_ring(struct dscc4_dev_priv 
*dpriv)
 static inline int try_get_rx_skb(struct dscc4_dev_priv *dpriv,
 struct net_device *dev)
 {
+   struct pci_dev *pdev = dpriv->pci_priv->pdev;
unsigned int dirty = dpriv->rx_dirty%RX_RING_SIZE;
struct RxFD *rx_fd = dpriv->rx_fd + dirty;
const int len = RX_MAX(HDLC_MAX_MRU);
struct sk_buff *skb;
-   int ret = 0;
+   dma_addr_t addr;
 
skb = dev_alloc_skb(len);
+   if (!skb)
+   goto err_out;
+
+   skb->protocol = hdlc_type_trans(skb, dev);
+   addr = pci_map_single(pdev, skb->data, len, PCI_DMA_FROMDEVICE);
+   if (pci_dma_mapping_error(pdev, addr))
+   goto err_free_skb;
+
dpriv->rx_skbuff[dirty] = skb;
-   if (skb) {
-   skb->protocol = hdlc_type_trans(skb, dev);
-   rx_fd->data = cpu_to_le32(pci_map_single(dpriv->pci_priv->pdev,
- skb->data, len, PCI_DMA_FROMDEVICE));
-   } else {
-   rx_fd->data = 0;
-   ret = -1;
-   }
-   return ret;
+   rx_fd->data = cpu_to_le32(addr);
+   return 0;
+
+err_free_skb:
+   dev_kfree_skb_any(skb);
+err_out:
+   rx_fd->data = 0;
+   return -1;
 }
 
 /*
@@ -1147,14 +1155,22 @@ static netdev_tx_t dscc4_start_xmit(struct sk_buff *skb,
struct dscc4_dev_priv *dpriv = dscc4_priv(dev);
struct dscc4_pci_priv *ppriv = dpriv->pci_priv;
struct TxFD *tx_fd;
+   dma_addr_t addr;
int next;
 
+   addr = pci_map_single(ppriv->pdev, skb->data, skb->len,
+ PCI_DMA_TODEVICE);
+   if (pci_dma_mapping_error(ppriv->pdev, addr)) {
+   dev_kfree_skb_any(skb);
+   dev->stats.tx_dropped++;
+   return NETDEV_TX_OK;
+   }
+
next = dpriv->tx_current%TX_RING_SIZE;
dpriv->tx_skbuff[next] = skb;
tx_fd = dpriv->tx_fd + next;
tx_fd->state = FrameEnd | TO_STATE_TX(skb->len);
-   tx_fd->data = cpu_to_le32(pci_map_single(ppriv->pdev, skb->data, 
skb->len,
-PCI_DMA_TODEVICE));
+   tx_fd->data = cpu_to_le32(addr);
tx_fd->complete = 0x;
tx_fd->jiffies = jiffies;
mb();
@@ -1887,16 +1903,22 @@ static struct sk_buff *dscc4_init_dummy_skb(struct 
dscc4_dev_priv *dpriv)
 
skb = dev_alloc_skb(DUMMY_SKB_SIZE);
if (skb) {
+   struct pci_dev *pdev = dpriv->pci_priv->pdev;
int last = dpriv->tx_dirty%TX_RING_SIZE;
struct TxFD *tx_fd = dpriv->tx_fd + last;
+   dma_addr_t addr;
 
skb->len = DUMMY_SKB_SIZE;
skb_copy_to_linear_data(skb, version,
strlen(version) % DUMMY_SKB_SIZE);
tx_fd->state = FrameEnd | TO_STATE_TX(DUMMY_SKB_SIZE);
-   tx_fd->data = cpu_to_le32(pci_map_single(dpriv->pci_priv->pdev,
-skb->data, DUMMY_SKB_SIZE,
-PCI_DMA_TODEVICE));
+   addr = pci_map_single(pdev, skb->data, DUMMY_SKB_SIZE,
+ PCI_DMA_TODEVICE);
+   if (pci_dma_mapping_error(pdev, addr)) {
+   dev_kfree_skb_any(skb);
+   return NULL;
+   }
+   tx_fd->data = cpu_to_le32(addr);
dpriv->tx_skbuff[last] = skb;
}
return skb;
-- 
2.7.4

Re: [PATCH net-next v2] net: ipv6: avoid overhead when no custom FIB rules are installed

2017-08-08 Thread David Ahern

On 8/8/17 12:23 PM, Vincent Bernat wrote:
> If the user hasn't installed any custom rules, don't go through the
> whole FIB rules layer. This is pretty similar to f4530fa574df (ipv4:
> Avoid overhead when no custom FIB rules are installed).
> 
> Using a micro-benchmark module [1], timing ip6_route_output() with
> get_cycles(), with 40,000 routes in the main routing table, before this
> patch:
...
> At the frequency of the host during the bench (~ 3.7 GHz), this is
> about a 100 ns difference on the median value.
> 
> A next step would be to collapse local and main tables, as in
> 0ddcf43d5d4a (ipv4: FIB Local/MAIN table collapse).
> 
> [1]: 
> https://github.com/vincentbernat/network-lab/blob/master/lab-routes-ipv6/kbench_mod.c
> 
> Signed-off-by: Vincent Bernat 
> Reviewed-by: Jiri Pirko 
> ---
>  include/net/netns/ipv6.h |  1 +
>  net/ipv6/fib6_rules.c| 40 +++-
>  net/ipv6/route.c |  1 +
>  3 files changed, 29 insertions(+), 13 deletions(-)
> 

LGTM.

Acked-by: David Ahern

Re: sysctl, argument parsing, possible bug

2017-08-08 Thread Stephen Hemminger

On Tue, 8 Aug 2017 20:26:36 +0200
Massimo Sala  wrote:

> I make another test with kernel 4.9.32-15.41
> 
> sysctl procps version 3.2.8
> 
> sysctl net.ipv4.conf.eth0.100.forwarding
> error: "net.ipv4.conf.eth0.100.forwarding" is an unknown key
> 
> 
> so I install busybox :
> 
> BusyBox v1.19.3
> 
> busybox sysctl net.ipv4.conf.eth0.100.forwarding
> net.ipv4.conf.eth0.100.forwarding = 0
> 
> It is working, as I expect reading busybox source sysctl.c
> 
> 
> 
> Stephen, I test
> sysctl net/ipv4/conf/eth0.100/forwarding
> I confirm it  works.
> 
> 
> What is the problem ?
> 
> As sysctl, also automation tools and scripts cannot be "netdev names
> aware", and so they fail using the usual dot notation.
> 
> 
> I don't pretend to change sysctl to read from the /proc/sys/
> directory, as busybox does.
> 
> I suggest to add a  remark to the man page of sysctl, reporting the
> difference between the two tools and an example of the alternate
> syntax :
> sysctl net/ipv4/conf/eth0.100/forwarding
> 
> 
> Thank you for your attention.
> Best regards, Massimo

Busybox has always been a restricted subset of the upstream standard tools.
If you have problems with busybox take it up with those developers directly;
this is not the right mailing list for that.

Re: [PATCH 00/35] constify net usb_device_id

2017-08-08 Thread Kalle Valo

Arvind Yadav  writes:

> usb_device_id are not supposed to change at runtime. All functions
> working with usb_device_id provided by  work with
> const usb_device_id. So mark the non-const structs as const.

[...]

>   [PATCH 16/35] wireless: ath: ar5523: constify usb_device_id
>   [PATCH 17/35] wireless: ath: ath6kl: constify usb_device_id
>   [PATCH 18/35] wireless: ath: ath9k: constify usb_device_id
>   [PATCH 19/35] wireless: ath: carl9170: constify usb_device_id
>   [PATCH 20/35] wireless: atmel: at76c50x: constify usb_device_id
>   [PATCH 21/35] wireless: broadcom: brcm80211: constify usb_device_id
>   [PATCH 22/35] wireless: intersil: orinoco: constify usb_device_id
>   [PATCH 23/35] wireless: intersil: p54: constify usb_device_id
>   [PATCH 24/35] wireless: marvell: libertas: constify usb_device_id
>   [PATCH 25/35] wireless: marvell: libertas_tf: constify usb_device_id
>   [PATCH 26/35] wireless: marvell: mwifiex: constify usb_device_id
>   [PATCH 27/35] wireless: mediatek: mt7601u: constify usb_device_id
>   [PATCH 28/35] wireless: ralink: rt2500usb: constify usb_device_id
>   [PATCH 29/35] wireless: ralink: rt2800usb: constify usb_device_id
>   [PATCH 30/35] wireless: ralink: rt73usb: constify usb_device_id
>   [PATCH 31/35] wireless: realtek: rtl8187: constify usb_device_id
>   [PATCH 32/35] wireless: realtek: rtl8xxxu: constify usb_device_id
>   [PATCH 33/35] wireless: realtek: rtl8192cu: constify usb_device_id
>   [PATCH 34/35] wireless: zydas: zd1201: constify usb_device_id
>   [PATCH 35/35] wireless: zydas: zd1211rw: constify usb_device_id

No need to put the whole path to the title, it's enough to have the name
of the driver there. For example:

[PATCH 16/35] ar5523: constify usb_device_id
[PATCH 17/35] ath6kl: constify usb_device_id
[PATCH 18/35] ath9k: constify usb_device_id

Please resubmit the wireless patches separately in their own patchset.

-- 
Kalle Valo

Re: sysctl, argument parsing, possible bug

2017-08-08 Thread Massimo Sala

I make another test with kernel 4.9.32-15.41

sysctl procps version 3.2.8

sysctl net.ipv4.conf.eth0.100.forwarding
error: "net.ipv4.conf.eth0.100.forwarding" is an unknown key


so I install busybox :

BusyBox v1.19.3

busybox sysctl net.ipv4.conf.eth0.100.forwarding
net.ipv4.conf.eth0.100.forwarding = 0

It is working, as I expect reading busybox source sysctl.c



Stephen, I test
sysctl net/ipv4/conf/eth0.100/forwarding
I confirm it  works.


What is the problem ?

As sysctl, also automation tools and scripts cannot be "netdev names
aware", and so they fail using the usual dot notation.


I don't pretend to change sysctl to read from the /proc/sys/
directory, as busybox does.

I suggest to add a  remark to the man page of sysctl, reporting the
difference between the two tools and an example of the alternate
syntax :
sysctl net/ipv4/conf/eth0.100/forwarding


Thank you for your attention.
Best regards, Massimo

[PATCH net-next v2] net: ipv6: avoid overhead when no custom FIB rules are installed

2017-08-08 Thread Vincent Bernat

If the user hasn't installed any custom rules, don't go through the
whole FIB rules layer. This is pretty similar to f4530fa574df (ipv4:
Avoid overhead when no custom FIB rules are installed).

Using a micro-benchmark module [1], timing ip6_route_output() with
get_cycles(), with 40,000 routes in the main routing table, before this
patch:

min=606 max=12911 count=627 average=1959 95th=4903 90th=3747 50th=1602 
mad=821
table=254 avgdepth=21.8 maxdepth=39
value │ ┊count
  600 │▒▒▒ 199
  880 │▒▒▒  43
 1160 │▒▒▒  48
 1440 │▒▒▒░░░   43
 1720 │░░░  59
 2000 │▒▒▒  50
 2280 │▒▒░░░26
 2560 │▒▒░  31
 2840 │▒▒   28
 3120 │▒░░  17
 3400 │▒░░░ 17
 3680 │░ 8
 3960 │░░   11
 4240 │░░6
 4520 │░░░   6
 4800 │░░░   9

After:

min=544 max=11687 count=627 average=1776 95th=4546 90th=3585 50th=1227 
mad=565
table=254 avgdepth=21.8 maxdepth=39
value │ ┊count
  540 │201
  800 │▒63
 1060 │▒░   68
 1320 │▒▒▒░░39
 1580 │▒▒░░ 32
 1840 │▒▒   32
 2100 │▒▒░░░34
 2360 │▒▒░░ 33
 2620 │▒▒   26
 2880 │▒░░  22
 3140 │  9
 3400 │░ 8
 3660 │░ 9
 3920 │░░8
 4180 │░░░   8
 4440 │░░░   8

At the frequency of the host during the bench (~ 3.7 GHz), this is
about a 100 ns difference on the median value.

A next step would be to collapse local and main tables, as in
0ddcf43d5d4a (ipv4: FIB Local/MAIN table collapse).

[1]: 
https://github.com/vincentbernat/network-lab/blob/master/lab-routes-ipv6/kbench_mod.c

Signed-off-by: Vincent Bernat 
Reviewed-by: Jiri Pirko 
---
 include/net/netns/ipv6.h |  1 +
 net/ipv6/fib6_rules.c| 40 +++-
 net/ipv6/route.c |  1 +
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index abdf3b40303b..0e50bf3ed097 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -65,6 +65,7 @@ struct netns_ipv6 {
unsigned int ip6_rt_gc_expire;
unsigned longip6_rt_last_gc;
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
+   bool fib6_has_custom_rules;
struct rt6_info *ip6_prohibit_entry;
struct rt6_info *ip6_blk_hole_entry;
struct fib6_table   *fib6_local_tbl;
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 2f29e4e33bd3..b240f24a6e52 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -63,19 +63,32 @@ unsigned int fib6_rules_seq_read(struct net *net)
 struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
   int flags, pol_lookup_t lookup)
 {
-   struct fib_lookup_arg arg = {
-   .lookup_ptr = lookup,
-   .flags = FIB_LOOKUP_NOREF,
-   };
-
-   /* update flow if oif or iif point to device enslaved to l3mdev */
-   l3mdev_update_flow(net, flowi6_to_flowi(fl6));
-
-   fib_rules_lookup(net->ipv6.fib6_rules_ops,
-flowi6_to_flowi(fl6), flags, );
-
-   if (arg.result)
-   return arg.result;
+   if

[PATCH net] net: avoid skb_warn_bad_offload false positives on UFO

2017-08-08 Thread Willem de Bruijn

From: Willem de Bruijn 

skb_warn_bad_offload triggers a warning when an skb enters the GSO
stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
checksum offload set.

Commit b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise")
observed that SKB_GSO_DODGY producers can trigger the check and
that passing those packets through the GSO handlers will fix it
up. But, the software UFO handler will set ip_summed to
CHECKSUM_NONE.

When __skb_gso_segment is called from the receive path, this
triggers the warning again.

Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
Tx these two are equivalent. On Rx, this better matches the
skb state (checksum computed), as CHECKSUM_NONE here means no
checksum computed.

See also this thread for context:
http://patchwork.ozlabs.org/patch/799015/

Fixes: b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise")
Signed-off-by: Willem de Bruijn 
---
 net/core/dev.c | 2 +-
 net/ipv4/udp_offload.c | 2 +-
 net/ipv6/udp_offload.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8515f8fe0460..ce15a06d5558 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2739,7 +2739,7 @@ static inline bool skb_needs_check(struct sk_buff *skb, 
bool tx_path)
 {
if (tx_path)
return skb->ip_summed != CHECKSUM_PARTIAL &&
-  skb->ip_summed != CHECKSUM_NONE;
+  skb->ip_summed != CHECKSUM_UNNECESSARY;
 
return skb->ip_summed == CHECKSUM_NONE;
 }
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 781250151d40..0932c85b42af 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -235,7 +235,7 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
if (uh->check == 0)
uh->check = CSUM_MANGLED_0;
 
-   skb->ip_summed = CHECKSUM_NONE;
+   skb->ip_summed = CHECKSUM_UNNECESSARY;
 
/* If there is no outer header we can fake a checksum offload
 * due to the fact that we have already done the checksum in
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index a2267f80febb..e7d378c032cb 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -72,7 +72,7 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
if (uh->check == 0)
uh->check = CSUM_MANGLED_0;
 
-   skb->ip_summed = CHECKSUM_NONE;
+   skb->ip_summed = CHECKSUM_UNNECESSARY;
 
/* If there is no outer header we can fake a checksum offload
 * due to the fact that we have already done the checksum in
-- 
2.14.0.rc1.383.gd1ce394fe2-goog

Re: [PATCH net] tcp: fastopen: tcp_connect() must refresh the route

2017-08-08 Thread Yuchung Cheng

On Tue, Aug 8, 2017 at 1:41 AM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> With new TCP_FASTOPEN_CONNECT socket option, there is a possibility
> to call tcp_connect() while socket sk_dst_cache is either NULL
> or invalid.
>
>  +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>  +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>  +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
>  +0 connect(4, ..., ...) = 0
>
> << sk->sk_dst_cache becomes obsolete, or even set to NULL >>
>
>  +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>
>
> We need to refresh the route otherwise bad things can happen,
> especially when syzkaller is running on the host :/
>
> Fixes: 19f6d3f3c8422 ("net/tcp-fastopen: Add new API support")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Eric Dumazet 
> Cc: Wei Wang 
> Cc: Yuchung Cheng 
> ---
Acked-by: Yuchung Cheng 

Thanks for the fix!

>  net/ipv4/tcp_output.c |4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 276406a83a37..b7661a68d498 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3436,6 +3436,10 @@ int tcp_connect(struct sock *sk)
> int err;
>
> tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB);
> +
> +   if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
> +   return -EHOSTUNREACH; /* Routing failure or similar. */
> +
> tcp_connect_init(sk);
>
> if (unlikely(tp->repair)) {
>
>

Re: [PATCH] net: Reduce skb_warn_bad_offload() noise.

2017-08-08 Thread Willem de Bruijn

>> @@ -2670,6 +2670,7 @@ static inline bool skb_needs_check(struct
>> sk_buff *skb, bool tx_path)
>>  {
>> if (tx_path)
>> return skb->ip_summed != CHECKSUM_PARTIAL &&
>> +  skb->ip_summed != CHECKSUM_UNNECESSARY &&
>>skb->ip_summed != CHECKSUM_NONE;
>
> Good catch. Only, the CHECKSUM_NONE case was added specifically to
> work around this UFO issue on the tx path in commit 6e7bc478c9a0
> ("net: skb_needs_check() accepts CHECKSUM_NONE for tx"). If we change
> the value generated by UFO, we can remove that statement, so
>
> +  skb->ip_summed != CHECKSUM_UNNECESSARY;
> -skb->ip_summed != CHECKSUM_NONE;
>
> Else the entire check becomes a NOOP. These are the only three valid
> states on tx. With very few codepaths generating CHECKSUM_UNNECESSARY
> to begin with, it arguably already is practically a NOOP. I need to
> look more closely what the statement is intended to protect against,
> before we relax it even further.

On transmit, packets entering skb_gso_segment are expected to always
have ip_summed CHECKSUM_PARTIAL. This check was added to track down
unexpected exceptions in commit 67fd1a731ff1 ("net: Add debug info to
track down GSO checksum bug").

Only when called for the second time, after skb_mac_gso_segment, do we
have to possibly handle the case where the GSO layer computes the
checksum and changes ip_summed.

Since this only goes into 4.11 to 4.13, making two separate
skb_needs_check variants for these two call sites seems overkill. I
will send the simple fix to convert CHECKSUM_NONE to
CHECKSUM_UNNECESSARY.

As a side effect of removing UFO in 4.14-rc1, we can also revert
commit 6e7bc478c9a0 ("net: skb_needs_check() accepts CHECKSUM_NONE for
tx") in net-next.

Re: [PATCH net-next] ibmvnic: Add netdev_dbg output for debugging

2017-08-08 Thread Nathan Fontenot

On 08/08/2017 11:27 AM, Stephen Hemminger wrote:
> On Mon, 07 Aug 2017 15:02:58 -0400
> Nathan Fontenot  wrote:
> 
>> To ease debugging of the ibmvnic driver add a series of netdev_dbg()
>> statements to track driver status, especially during initialization,
>> removal, and resetting of the driver.
>>
>> Signed-off-by: Nathan Fontenot 
> 
> Maybe use netif_dbg() and the message type stuff.

Oh! I like this even more. This just begs to update all of the message
reporting in the driver though.

Let's scrap this patch. I'll work on a new patchset based on using
the netif_*() infrastructure.

-Nathan

Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-08 Thread Yuchung Cheng

On Mon, Aug 7, 2017 at 11:16 AM, Rao Shoaib  wrote:
> Change from version 0: Rationale behind the change:
>
> The man page for tcp(7) states
>
> when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will
> override keepalive to  determine  when to close a connection due to keepalive
> failure.
>
> This is ambigious at best. user expectation is most likely that the connection
> will be reset after TCP_USER_TIMEOUT milliseconds of inactivity.
ccing the original author Jerry Chu who can tell more.

>
> The code however waits for the keepalive to kick-in (default 2hrs) and than
> after one failure resets the conenction.
>
> What is the rationale for that ? The same effect can be obtained by simply
> changing the value of tcp_keep_alive_probes.
>
> Since the TCP_USER_TIMEOUT option was added based on RFC 5482 we need to 
> follow
> the RFC. Which states
>
> 4.2 TCP keep-Alives:
>Some TCP implementations, such as those in BSD systems, use a
>different abort policy for TCP keep-alives than for user data.  Thus,
>the TCP keep-alive mechanism might abort a connection that would
>otherwise have survived the transient period without connectivity.
>Therefore, if a connection that enables keep-alives is also using the
>TCP User Timeout Option, then the keep-alive timer MUST be set to a
>value larger than that of the adopted USER TIMEOUT.
>
> This patch enforces the MUST and also dis-associates user timeout from keep
> alive.  A man page patch will be submitted separately.
>
> Signed-off-by: Rao Shoaib 
> ---
>  net/ipv4/tcp.c   | 10 --
>  net/ipv4/tcp_timer.c |  9 +
>  2 files changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 71ce33d..f2af44d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2628,7 +2628,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> break;
>
> case TCP_KEEPIDLE:
> -   if (val < 1 || val > MAX_TCP_KEEPIDLE)
> +   /* Per RFC5482 keepalive_time must be > user_timeout */
> +   if (val < 1 || val > MAX_TCP_KEEPIDLE ||
> +   ((val * HZ) <= icsk->icsk_user_timeout))
> err = -EINVAL;
> else {
> tp->keepalive_time = val * HZ;
> @@ -2724,8 +2726,12 @@ static int do_tcp_setsockopt(struct sock *sk, int 
> level,
> case TCP_USER_TIMEOUT:
> /* Cap the max time in ms TCP will retry or probe the window
>  * before giving up and aborting (ETIMEDOUT) a connection.
> +* Per RFC5482 TCP user timeout must be < keepalive_time.
> +* If the default value changes later -- all bets are off.
>  */
> -   if (val < 0)
> +   if (val < 0 || (tp->keepalive_time &&
> +   tp->keepalive_time <= msecs_to_jiffies(val)) 
> ||
> +  net->ipv4.sysctl_tcp_keepalive_time <= 
> msecs_to_jiffies(val))
> err = -EINVAL;
> else
> icsk->icsk_user_timeout = msecs_to_jiffies(val);
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index c0f..d39fe60 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -664,14 +664,7 @@ static void tcp_keepalive_timer (unsigned long data)
> elapsed = keepalive_time_elapsed(tp);
>
> if (elapsed >= keepalive_time_when(tp)) {
> -   /* If the TCP_USER_TIMEOUT option is enabled, use that
> -* to determine when to timeout instead.
> -*/
> -   if ((icsk->icsk_user_timeout != 0 &&
> -   elapsed >= icsk->icsk_user_timeout &&
> -   icsk->icsk_probes_out > 0) ||
> -   (icsk->icsk_user_timeout == 0 &&
> -   icsk->icsk_probes_out >= keepalive_probes(tp))) {
> +   if (icsk->icsk_probes_out >= keepalive_probes(tp)) {
> tcp_send_active_reset(sk, GFP_ATOMIC);
> tcp_write_err(sk);
> goto out;
> --
> 2.7.4
>

Re: [PATCH v3 net-next 3/5] sock: ULP infrastructure

On Tue, Aug 8, 2017 at 9:38 AM, Hannes Frederic Sowa
 wrote:
> Tom Herbert  writes:
>
>> +#ifdef CONFIG_MODULES
>> + if (!ulp && capable(CAP_NET_ADMIN)) {
>> + rcu_read_unlock();
>> + request_module("%s", name);
>> + rcu_read_lock();
>> + ulp = ulp_find(name);
>> + }
>> +#endif
>
> It looks to me that this allows users with only CAP_NET_ADMIN
> privileges to load every module?

It's a carryover. Probably should remove the check.

Tom

Re: [PATCH v3 net-next 0/5] ulp: Generalize ULP infrastructure

On Tue, Aug 8, 2017 at 8:31 AM, John Fastabend  wrote:
> On 08/07/2017 10:28 AM, Tom Herbert wrote:
>> Generalize the ULP infrastructure that was recently introduced to
>> support kTLS. This adds a SO_ULP socket option and creates new fields in
>> sock structure for ULP ops and ULP data. Also, the interface allows
>> additional per ULP parameters to be set so that a ULP can be pushed
>> and operations started in one shot.
>>
>> In this patch set:
>>   - Minor dependency fix in inet_common.h
>>   - Implement ULP infrastructure as a socket mechanism
>>   - Fixes TCP and TLS to use the new method (maintaining backwards
>> API compatibility)
>>   - Adds a ulp.txt document
>>
>> Tested: Ran simple ULP. Dave Watson verified kTLS works.
>>
>> -v2: Fix compilation errors when CONFIG_ULP_SOCK not set.
>> -v3: Fix one more build issue, check that sk_protocol is IPPROTO_TCP
>>  in tsl_init. Also, fix a couple of minor issues related to
>>  introducing locked versions of sendmsg, send page. Thanks to
>>  Dave Watson, John Fastabend, and Mat Martineau for testing and
>>  providing fixes.
>>
>
>
> Hi Tom, Dave,
>
> I'm concerned about the performance impact of walking a list and
> doing string compares on every socket we create with kTLS. Dave
> do you have any request/response tests for kTLS that would put pressure
> on the create/destroy time of this infrastructure? We should do some
> tests with dummy entries in the ULP list to understand the impact of
> this list walk.
>
> I like the underlying TCP generalized hooks, but do we really expect a
> lot of these hooks to exist? If we only have two on the roadmap
> (kTLS and socktap) it seems a bit overkill. Further, if we really expect
> many ULP objects then the list walk and compare will become more expensive
> perhaps becoming noticeable in request per second metrics.
>
> Why not just create another socktap socketopt? That will be better from
> complexity and likely performance sides.
>
IMO, given that there is at most two even proposed at this point I
don't there's much point addressing performance. When ULP feature
catches on and we start see a whole bunch of them then it's
straightforward to use a hash table or some more efficient mechanism.

Tom

> Thanks,
> .John
>

Re: [PATCH net] tcp: fastopen: tcp_connect() must refresh the route

2017-08-08 Thread Wei Wang

On Tue, Aug 8, 2017 at 1:41 AM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> With new TCP_FASTOPEN_CONNECT socket option, there is a possibility
> to call tcp_connect() while socket sk_dst_cache is either NULL
> or invalid.
>
>  +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>  +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>  +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
>  +0 connect(4, ..., ...) = 0
>
> << sk->sk_dst_cache becomes obsolete, or even set to NULL >>
>
>  +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>
>
> We need to refresh the route otherwise bad things can happen,
> especially when syzkaller is running on the host :/
>
> Fixes: 19f6d3f3c8422 ("net/tcp-fastopen: Add new API support")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Eric Dumazet 
> Cc: Wei Wang 
> Cc: Yuchung Cheng 
> ---

Thanks a lot for the fix, Eric.

Acked-by: Wei Wang

Re: [PATCH V3 net-next 03/21] net-next/hinic: Initialize api cmd resources

From: Aviad Krawczyk 
Date: Tue, 8 Aug 2017 19:23:32 +0300

> Is it a new rule?

It's at least 10 years old.

> We can fix it, but I can look over the Linux Kernel source and there are
> a lot of examples that have the same problem.

Stop right there.

Just because there is code in the kernel with a problem doesn't mean
you can submit new code with that problem.

[PATCH] isdn: hisax: hfc_usb: constify usb_device_id

usb_device_id are not supposed to change at runtime. All functions
working with usb_device_id provided by  work with
const usb_device_id. So mark the non-const structs as const.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/hfc_usb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/hfc_usb.c b/drivers/isdn/hisax/hfc_usb.c
index ef47480..e821218 100644
--- a/drivers/isdn/hisax/hfc_usb.c
+++ b/drivers/isdn/hisax/hfc_usb.c
@@ -65,7 +65,7 @@ typedef struct {
 } hfcsusb_vdata;
 
 /* VID/PID device list */
-static struct usb_device_id hfcusb_idtab[] = {
+static const struct usb_device_id hfcusb_idtab[] = {
{
USB_DEVICE(0x0959, 0x2bd0),
.driver_info = (unsigned long) &((hfcsusb_vdata)
-- 
2.7.4

[PATCH] isdn: hfcsusb: constify usb_device_id

usb_device_id are not supposed to change at runtime. All functions
working with usb_device_id provided by  work with
const usb_device_id. So mark the non-const structs as const.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hardware/mISDN/hfcsusb.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hardware/mISDN/hfcsusb.h 
b/drivers/isdn/hardware/mISDN/hfcsusb.h
index 4157311..5f8f1d9 100644
--- a/drivers/isdn/hardware/mISDN/hfcsusb.h
+++ b/drivers/isdn/hardware/mISDN/hfcsusb.h
@@ -337,7 +337,7 @@ static const char 
*HFC_NT_LAYER1_STATES[HFC_MAX_NT_LAYER1_STATE + 1] = {
 };
 
 /* supported devices */
-static struct usb_device_id hfcsusb_idtab[] = {
+static const struct usb_device_id hfcsusb_idtab[] = {
{
USB_DEVICE(0x0959, 0x2bd0),
.driver_info = (unsigned long) &((struct hfcsusb_vdata)
-- 
2.7.4

Re: [RFC PATCH 2/2] bpf: Initialise mod[] in bpf_trace_printk

From: Daniel Borkmann 
Date: Tue, 08 Aug 2017 10:46:52 +0200

> On 08/08/2017 12:25 AM, James Hogan wrote:
>> In bpf_trace_printk(), the elements in mod[] are left uninitialised,
>> but
>> they are then incremented to track the width of the formats. Zero
>> initialise the array just in case the memory contains non-zero values
>> on
>> entry.
>>
>> Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call
>> bpf_trace_printk()")
>> Signed-off-by: James Hogan 
>> Cc: Alexei Starovoitov 
>> Cc: Daniel Borkmann 
>> Cc: Steven Rostedt 
>> Cc: Ingo Molnar 
>> Cc: netdev@vger.kernel.org
>> ---
>> When I checked (on MIPS32), the elements tended to have the value zero
>> anyway (does BPF zero the stack or something clever?), so this is a
>> purely theoretical fix.
>> ---
>>   kernel/trace/bpf_trace.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index 32dcbe1b48f2..86a52857d941 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -129,7 +129,7 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32,
>> fmt_size, u64, arg1,
>> u64, arg2, u64, arg3)
>>   {
>>  bool str_seen = false;
>> -int mod[3] = {};
>> +int mod[3] = { 0, 0, 0 };
> 
> I'm probably missing something, but is the behavior of gcc wrt
> above initializers different on mips (it zeroes just fine on x86
> at least)? If yes, we'd probably need a cocci script to also check
> rest of the kernel given this is used in a number of places. Hm,
> could you elaborate?

This change is not necessary at all.

An empty initializer must clear the whole object to zero.

"theoretical" fix indeed... :-(

Re: [PATCH net] rds: Reintroduce statistics counting

2017-08-08 Thread Santosh Shilimkar


On 8/8/2017 2:13 AM, Håkon Bugge wrote:

In commit 7e3f2952eeb1 ("rds: don't let RDS shutdown a connection
while senders are present"), refilling the receive queue was removed
from rds_ib_recv(), along with the increment of
s_ib_rx_refill_from_thread.

Commit 73ce4317bf98 ("RDS: make sure we post recv buffers")
re-introduces filling the receive queue from rds_ib_recv(), but does
not add the statistics counter. rds_ib_recv() was later renamed to
rds_ib_recv_path().

This commit reintroduces the statistics counting of
s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
Reviewed-by: Wei Lin Guay 
---

Looks fine.
Acked-by: Santosh Shilimkar

Re: [PATCH v3 net-next 3/5] sock: ULP infrastructure

2017-08-08 Thread Hannes Frederic Sowa

Tom Herbert  writes:

> +#ifdef CONFIG_MODULES
> + if (!ulp && capable(CAP_NET_ADMIN)) {
> + rcu_read_unlock();
> + request_module("%s", name);
> + rcu_read_lock();
> + ulp = ulp_find(name);
> + }
> +#endif

It looks to me that this allows users with only CAP_NET_ADMIN
privileges to load every module?

Re: [PATCH net-next] ibmvnic: Add netdev_dbg output for debugging

2017-08-08 Thread Stephen Hemminger

On Mon, 07 Aug 2017 15:02:58 -0400
Nathan Fontenot  wrote:

> To ease debugging of the ibmvnic driver add a series of netdev_dbg()
> statements to track driver status, especially during initialization,
> removal, and resetting of the driver.
> 
> Signed-off-by: Nathan Fontenot 

Maybe use netif_dbg() and the message type stuff.

Re: [PATCH V3 net-next 03/21] net-next/hinic: Initialize api cmd resources

2017-08-08 Thread Aviad Krawczyk

Hi David,

Is it a new rule?

We can fix it, but I can look over the Linux Kernel source and there are
a lot of examples that have the same problem. I can see even code that inserted
to 4.10 with the same problem.

Thanks for your time and review,
Aviad

On 8/4/2017 1:35 AM, David Miller wrote:
> From: Aviad Krawczyk 
> Date: Thu, 3 Aug 2017 17:54:09 +0800
> 
>> +static int alloc_cmd_buf(struct hinic_api_cmd_chain *chain,
>> + struct hinic_api_cmd_cell *cell, int cell_idx)
>> +{
>> +struct hinic_hwif *hwif = chain->hwif;
>> +struct pci_dev *pdev = hwif->pdev;
>> +struct hinic_api_cmd_cell_ctxt *cell_ctxt;
>> +dma_addr_t cmd_paddr;
>> +u8 *cmd_vaddr;
>> +int err = 0;
> 
> Order local variables from longest to shortest line.
> 
>> +static int api_cmd_create_cell(struct hinic_api_cmd_chain *chain,
>> +   int cell_idx,
>> +   struct hinic_api_cmd_cell *pre_node,
>> +   struct hinic_api_cmd_cell **node_vaddr)
>> +{
>> +struct hinic_hwif *hwif = chain->hwif;
>> +struct pci_dev *pdev = hwif->pdev;
>> +struct hinic_api_cmd_cell_ctxt *cell_ctxt;
>> +struct hinic_api_cmd_cell *node;
>> +dma_addr_t node_paddr;
>> +int err;
> 
> Likewise.
>> +static void api_cmd_destroy_cell(struct hinic_api_cmd_chain *chain,
>> + int cell_idx)
>> +{
>> +struct hinic_hwif *hwif = chain->hwif;
>> +struct pci_dev *pdev = hwif->pdev;
>> +struct hinic_api_cmd_cell_ctxt *cell_ctxt;
>> +struct hinic_api_cmd_cell *node;
>> +dma_addr_t node_paddr;
>> +size_t node_size;
> 
> Likewise.
> 
> etc. etc. etc.
> 
> Please audit your entire submission for this problem.
> 
> Thanks.
> 
> .
>

Question about via-ircc.ko

2017-08-08 Thread Anton Volkov


Hello.

While searching for races in the Linux kernel I've come across 
"drivers/net/irda/via-ircc.ko" module. Here are questions that I came up 
with while analyzing results. Lines are given using the info from Linux 
v4.12.


Consider the following case:

Thread 1:Thread 2:
via_ircc_net_open
  request_irq
  
 via_ircc_interrupt
-> via_ircc_dma_receive  -> RxTimerHandler
   (via-ircc.c: line 1488)  (via-ircc.c: line 1315)
 self->... = ...  ... = self->...

In the via_ircc_dma_receive a lot of fields of 'self' structure are 
initialized and via_ircc_interrupt with RxTimerHandler use those fields. 
If no initialization happened interrupt handler and other functions that 
it calls may work with incorrect data. I'm not sure how bad this case 
can be and thus here are my questions. Is this situation feasible from 
your point of view? If it is feasible, is it a benign race or something 
serious?


Thank you for your time.

-- Anton Volkov
Linux Verification Center, ISPRAS
web: http://linuxtesting.org
e-mail: avol...@ispras.ru

Re: [PATCH net v2] net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets

2017-08-08 Thread Saeed Mahameed

On Thu, Aug 3, 2017 at 11:54 PM, Davide Caratti  wrote:
> if the NIC fails to validate the checksum on TCP/UDP, and validation of IP
> checksum is successful, the driver subtracts the pseudo-header checksum
> from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't
> do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails.
>
> V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set
>
> Reported-by: Shuang Li 
> Fixes: f8c6455bb04b ("net/mlx4_en: Extend checksum offloading by CHECKSUM 
> COMPLETE")
> Signed-off-by: Davide Caratti 

Acked-by: Saeed Mahameed

Re: [PATCH net v2] net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets

2017-08-08 Thread Saeed Mahameed

On Tue, Aug 8, 2017 at 7:06 PM, Davide Caratti  wrote:
> On Tue, 2017-08-08 at 18:07 +0300, Saeed Mahameed wrote:
>> >   {
>> >  __u16 length_for_csum = 0;
>> >  __wsum csum_pseudo_header = 0;
>> > +   __u8 ipproto = iph->protocol;
>> > +
>> > +   if (unlikely(ipproto == IPPROTO_SCTP))
>> > +   return -1;
>> >
>>
>> Hi Davide
>>
>
> hi Saeed,
>
> thank you for looking at this!
>
>> If you got to here then it means this is a non UDP/TCP ipv4 packet and
>> the HW failed to validate it's checksum but
>> you get from the connectx3 HW a 1's complement 16-bit sum of IP
>> Payload + IP pseudo-header.
>> so if you return -1 here the driver will report checksum none for this
>> packet (and you will abandon any checsum offload/help from HW).
>>
>> I am not an SCTP expert but it seems that you decided here that
>> connectX3 hw checksum (described above) can't be used to calculate
>> SCTP packet checksum
>> is that correct?
>>
> Yes, that's correct. SCTP uses CRC32c, not 1's complement (and does not use
> pseudo-headers): therefore, the checksum computed by the NIC hardware can't
> be used.
>
> The issue I observed is skb->ip_summed set to CHECKSUM_COMPLETE, that made
> CRC32c validation fail in my setup (that was a netfilter REJECT rule, matching
> SCTP packets). AFAIK, CHECKSUM_COMPLETE is valid only for the Internet 
> Checksum;
> setting CHECKSUM_NONE on rx packets carrying IPPROTO_SCTP fixed my test 
> scenario.
>
>> If so, then i am ok with this patch.
>
> I planned to post this some weeks ago, after agreeing v2 with Tariq
> (https://www.spinics.net/lists/netdev/msg441231.html), but took some time to
> find a ConnectX-3 (from what I saw the issue is not present on ConnectX-3 Pro,
> since it has MLX4_RX_CSUM_MODE_VAL_NON_TCP_UDP bit set to 0).
>

Thanks for the clarification, I will ack the patch.

> regards,
> --
> davide
>

[PATCH 04/35] net: irda: irda-usb: constify usb_device_id

usb_device_id are not supposed to change at runtime. All functions
working with usb_device_id provided by  work with
const usb_device_id. So mark the non-const structs as const.

Signed-off-by: Arvind Yadav 
---
 drivers/net/irda/irda-usb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/irda/irda-usb.c b/drivers/net/irda/irda-usb.c
index 6f3c805..723e49b 100644
--- a/drivers/net/irda/irda-usb.c
+++ b/drivers/net/irda/irda-usb.c
@@ -72,7 +72,7 @@
 static int qos_mtt_bits = 0;
 
 /* These are the currently known IrDA USB dongles. Add new dongles here */
-static struct usb_device_id dongles[] = {
+static const struct usb_device_id dongles[] = {
/* ACTiSYS Corp.,  ACT-IR2000U FIR-USB Adapter */
{ USB_DEVICE(0x9c4, 0x011), .driver_info = IUC_SPEED_BUG | 
IUC_NO_WINDOW },
/* Look like ACTiSYS, Report : IBM Corp., IBM UltraPort IrDA */
-- 
2.7.4

[PATCH 05/35] net: irda: kingsun: constify usb_device_id

usb_device_id are not supposed to change at runtime. All functions
working with usb_device_id provided by  work with
const usb_device_id. So mark the non-const structs as const.

Signed-off-by: Arvind Yadav 
---
 drivers/net/irda/kingsun-sir.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/irda/kingsun-sir.c b/drivers/net/irda/kingsun-sir.c
index 24c0f16..4fd4ac2 100644
--- a/drivers/net/irda/kingsun-sir.c
+++ b/drivers/net/irda/kingsun-sir.c
@@ -85,7 +85,7 @@
 #define KING_PRODUCT_ID 0x4200
 
 /* These are the currently known USB ids */
-static struct usb_device_id dongles[] = {
+static const struct usb_device_id dongles[] = {
 /* KingSun Co,Ltd  IrDA/USB Bridge */
 { USB_DEVICE(KING_VENDOR_ID, KING_PRODUCT_ID) },
 { }
-- 
2.7.4

[PATCH 07/35] net: irda: ksdazzle: constify usb_device_id