date:20171227

Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Masami Hiramatsu

On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov  wrote:

> On 12/27/17 6:34 PM, Masami Hiramatsu wrote:
> > On Wed, 27 Dec 2017 14:46:24 -0800
> > Alexei Starovoitov  wrote:
> >
> >> On 12/26/17 9:56 PM, Masami Hiramatsu wrote:
> >>> On Tue, 26 Dec 2017 17:57:32 -0800
> >>> Alexei Starovoitov  wrote:
> >>>
>  On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote:
> > Check whether error injectable event is on function entry or not.
> > Currently it checks the event is ftrace-based kprobes or not,
> > but that is wrong. It should check if the event is on the entry
> > of target function. Since error injection will override a function
> > to just return with modified return value, that operation must
> > be done before the target function starts making stackframe.
> >
> > As a side effect, bpf error injection is no need to depend on
> > function-tracer. It can work with sw-breakpoint based kprobe
> > events too.
> >
> > Signed-off-by: Masami Hiramatsu 
> > ---
> >  kernel/trace/Kconfig|2 --
> >  kernel/trace/bpf_trace.c|6 +++---
> >  kernel/trace/trace_kprobe.c |8 +---
> >  kernel/trace/trace_probe.h  |   12 ++--
> >  4 files changed, 14 insertions(+), 14 deletions(-)
> >
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index ae3a2d519e50..6400e1bf97c5 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -533,9 +533,7 @@ config FUNCTION_PROFILER
> >  config BPF_KPROBE_OVERRIDE
> > bool "Enable BPF programs to override a kprobed function"
> > depends on BPF_EVENTS
> > -   depends on KPROBES_ON_FTRACE
> > depends on HAVE_KPROBE_OVERRIDE
> > -   depends on DYNAMIC_FTRACE_WITH_REGS
> > default n
> > help
> >  Allows BPF to override the execution of a probed function and
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index f6d2327ecb59..d663660f8392 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event 
> > *event,
> > int ret = -EEXIST;
> >
> > /*
> > -* Kprobe override only works for ftrace based kprobes, and 
> > only if they
> > -* are on the opt-in list.
> > +* Kprobe override only works if they are on the function entry,
> > +* and only if they are on the opt-in list.
> >  */
> > if (prog->kprobe_override &&
> > -   (!trace_kprobe_ftrace(event->tp_event) ||
> > +   (!trace_kprobe_on_func_entry(event->tp_event) ||
> >  !trace_kprobe_error_injectable(event->tp_event)))
> > return -EINVAL;
> >
> > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> > index 91f4b57dab82..265e3e27e8dc 100644
> > --- a/kernel/trace/trace_kprobe.c
> > +++ b/kernel/trace/trace_kprobe.c
> > @@ -88,13 +88,15 @@ static nokprobe_inline unsigned long 
> > trace_kprobe_nhit(struct trace_kprobe *tk)
> > return nhit;
> >  }
> >
> > -int trace_kprobe_ftrace(struct trace_event_call *call)
> > +bool trace_kprobe_on_func_entry(struct trace_event_call *call)
> >  {
> > struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
> > -   return kprobe_ftrace(>rp.kp);
> > +
> > +   return kprobe_on_func_entry(tk->rp.kp.addr, 
> > tk->rp.kp.symbol_name,
> > +   tk->rp.kp.offset);
> 
>  That would be nice, but did you test this?
> >>>
> >>> Yes, because the jprobe, which was only official user of modifying 
> >>> execution
> >>> path using kprobe, did same way to check. (and kretprobe also does it)
> >>>
>  My understanding that kprobe will restore all regs and
>  here we need to override return ip _and_ value.
> >>>
> >>> yes, no problem. kprobe restore all regs from pt_regs, including regs->ip.
> >>>
>  Could you add a patch with the test the way Josef did
>  or describe the steps to test this new mode?
> >>>
> >>> Would you mean below patch? If so, it should work without any change.
> >>>
> >>>  [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return
> >>
> >> yeah. I expect bpf_override_return test to work as-is.
> >> I'm asking for the test for new functionality added by this patch.
> >> In particular kprobe on func entry without ftrace.
> >> How did you test it?
> >
> > This function is used in kretprobe and jprobe. Jprobe was the user of
> > "modifying instruction pointer to another function" in kprobes.
> > If it doesn't work, jprobe also doesn't work, this means you can not
> > modify IP by

Re: WARNING in strp_data_ready

2017-12-27 Thread Dmitry Vyukov

On Thu, Dec 28, 2017 at 2:19 AM, Tom Herbert  wrote:
> On Wed, Dec 27, 2017 at 12:20 PM, Ozgur  wrote:
>>
>>
>> 27.12.2017, 23:14, "Dmitry Vyukov" :
>>> On Wed, Dec 27, 2017 at 9:08 PM, Ozgur  wrote:
  27.12.2017, 22:21, "Dmitry Vyukov" :
>  On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert  
> wrote:
>>   Did you try the patch I posted?
>
>  Hi Tom,

  Hello Dmitry,

>  No. And I didn't know I need to. Why?
>  If you think the patch needs additional testing, you can ask syzbot to
>  test it. See 
> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
>  Otherwise proceed with committing it. Or what are we waiting for?
>
>  Thanks

  I think we need to fixed patch for crash, in fact check to patch code and 
 test solve the bug.
  How do test it because there is no patch in the following bug?
>>>
>>> Hi Ozgur,
>>>
>>> I am not sure I completely understand what you mean. But the
>>> reproducer for this bug (which one can use for testing) is here:
>>> https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY
>>> Tom also mentions there is some patch for this, but I don't know where
>>> it is, it doesn't seem to be referenced from this thread.
>>
>> Hello Dmitry,
>>
>> Ah, I'm sorry I don't seen Tom mail and I don't have a patch not tested :)
>> I think Tom send patch to only you and are you tested?
>>
>> kcmsock.c will change and strp_data_ready I think locked.
>>
>> Tom, please send a patch for me? I can test and inform you.
>>
> Hi Ozgur,
>
> I reposted the patches as RFC "kcm: Fix lockdep issue". Please test if you 
> can!

OK, I will work as your typist this time:

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
master

But I wonder what part of the following you don't understand? Do we
need to improve wording or something?

> If you think the patch needs additional testing, you can ask syzbot to test 
> it.
> See 
> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot

Also I don't know what git repo/branch you have in mind. Kernel
patches don't generally apply just to any tree. Fingers crossed that I
guessed correctly and it will apply.
diff --git a/include/net/sock.h b/include/net/sock.h
index 9155da422692..7a7b14e9628a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1514,6 +1514,11 @@ static inline bool sock_owned_by_user(const struct sock *sk)
 	return sk->sk_lock.owned;
 }
 
+static inline bool sock_owned_by_user_nocheck(const struct sock *sk)
+{
+	return sk->sk_lock.owned;
+}
+
 /* no reclassification while locks are held */
 static inline bool sock_allow_reclassification(const struct sock *csk)
 {
diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index c5fda15ba319..1fdab5c4eda8 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -401,7 +401,7 @@ void strp_data_ready(struct strparser *strp)
 	 * allows a thread in BH context to safely check if the process
 	 * lock is held. In this case, if the lock is held, queue work.
 	 */
-	if (sock_owned_by_user(strp->sk)) {
+	if (sock_owned_by_user_nocheck(strp->sk)) {
 		queue_work(strp_wq, >work);
 		return;
 	}

Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-27 Thread Masami Hiramatsu

On Wed, 27 Dec 2017 19:49:28 -0800
Alexei Starovoitov  wrote:

> On 12/27/17 5:38 PM, Masami Hiramatsu wrote:
> > On Wed, 27 Dec 2017 14:49:46 -0800
> > Alexei Starovoitov  wrote:
> >
> >> On 12/27/17 12:09 AM, Masami Hiramatsu wrote:
> >>> On Tue, 26 Dec 2017 18:12:56 -0800
> >>> Alexei Starovoitov  wrote:
> >>>
>  On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:
> > Support in-kernel fault-injection framework via debugfs.
> > This allows you to inject a conditional error to specified
> > function using debugfs interfaces.
> >
> > Signed-off-by: Masami Hiramatsu 
> > ---
> >  Documentation/fault-injection/fault-injection.txt |5 +
> >  kernel/Makefile   |1
> >  kernel/fail_function.c|  169 
> > +
> >  lib/Kconfig.debug |   10 +
> >  4 files changed, 185 insertions(+)
> >  create mode 100644 kernel/fail_function.c
> >
> > diff --git a/Documentation/fault-injection/fault-injection.txt 
> > b/Documentation/fault-injection/fault-injection.txt
> > index 918972babcd8..6243a588dd71 100644
> > --- a/Documentation/fault-injection/fault-injection.txt
> > +++ b/Documentation/fault-injection/fault-injection.txt
> > @@ -30,6 +30,11 @@ o fail_mmc_request
> >injects MMC data errors on devices permitted by setting
> >debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
> >
> > +o fail_function
> > +
> > +  injects error return on specific functions by setting debugfs entries
> > +  under /sys/kernel/debug/fail_function. No boot option supported.
> 
>  I like it.
>  Could you document it a bit better?
> >>>
> >>> Yes, I will do in next series.
> >>>
>  In particular retval is configurable, but without an example no one
>  will be able to figure out how to use it.
> >>>
> >>> Ah, right. BTW, as I pointed in the covermail, should we store the
> >>> expected error value range into the injectable list? e.g.
> >>>
> >>> ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)
> >>>
> >>> And provide APIs to check/get it.
> >>
> >> I'm afraid such check would be too costly.
> >> Right now we have only two functions marked but I expect hundreds more
> >> will be added in the near future as soon as developers realize the
> >> potential of such error injection.
> >> All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data.
> >> Multiple by 1k and we have 8k of data spent on marks.
> >> If we add max/min range marks that doubles it for very little use.
> >> I think marking function only is enough.
> >
> > Sorry, I don't think so.
> > Even if it takes 16 bytes more for each points, I don't think it is
> > any overhead for machines in these days. Even if so, we can provide
> > a kconfig to reduce it.
> > I mean, we are living in GB-order memory are, and it will be bigger
> > in the future. Why we have to worry about hundreds of 16bytes memory
> > pieces? It will take a few KB, and even if we mark thousands of
> > functions, it never reaches 1MB, in GB memory pool. :)
> >
> > Of course, for many small-footprint embedded devices (like having
> > less than 128MB memory), this feature can be a overhead. But they
> > can cut off the table by kconfig.
> 
> I still disagree on wasting 16-byte * num_of_funcs of .data here.
> The trade-off of usability vs memory just not worth it. Sorry.
> Please focus on testing your changes instead.

Then what happen if the user set invalid retval to those functions?
even if we limit the injectable functions, it can cause a problem,

for example, 

 obj = func_return_object();
 if (!obj) {
handling_error...;
 }
 obj->field = x;

In this case, obviously func_return_object() must return NULL if there is
an error, not -ENOMEM. But without the correct retval information, how would
you check the BPF code doesn't cause a trouble?
Currently it seems you are expecting only the functions which return error code.

 ret = func_return_state();
 if (ret < 0) {
handling_error...;
 }

But how we can distinguish those?

If we have the error range for each function, we can ensure what is
*correct* error code, NULL or errno, or any other error numbers. :)

At least fail_function needs this feature because it can check
return value when setting it up.

Thank you,

-- 
Masami Hiramatsu

Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface

2017-12-27 Thread Andrew Lunn

On Wed, Dec 27, 2017 at 10:24:01PM +, Russell King - ARM Linux wrote:
> On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote:
> > This patch enables the fourth network interface on the Marvell
> > Macchiatobin. It is configured in the 2500Base-X PHY mode.
> > 
> > Signed-off-by: Antoine Tenart 
> > ---
> >  arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts 
> > b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> > index b3350827ee55..c51efd511324 100644
> > --- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> > +++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> > @@ -279,6 +279,14 @@
> > phys = <_comphy0 1>;
> >  };
> >  
> > +_eth2 {
> > +   /* CPS Lane 5 */
> > +   status = "okay";
> > +   phy-mode = "2500base-x";
> > +   /* Generic PHY, providing serdes lanes */
> > +   phys = <_comphy5 2>;
> > +};
> > +
> 
> This is wrong.  This lane is connected to a SFP cage which can support
> more than just 2500base-X.  Tying it in this way to 2500base-X means
> that this port does not support conenctions at 1000base-X, despite
> that's one of the most popular and more standardised speeds.
> 

Hi Antoine

I agree with Russell here. SFP modules are hot pluggable, and support
a range of interface modes. You need to query what the SFP module is
in order to know how to configure the SERDES interface. The phylink
infrastructure does that for you.

   Andrew

Re: [PATCH net-next 1/6] phy: add 2.5G SGMII mode to the phy_mode enum

2017-12-27 Thread Andrew Lunn

On Wed, Dec 27, 2017 at 11:14:41PM +0100, Antoine Tenart wrote:
> This patch adds one more generic PHY mode to the phy_mode enum, to allow
> configuring generic PHYs to the 2.5G SGMII mode by using the set_mode
> callback.
> 
> Signed-off-by: Antoine Tenart 
> ---
>  include/linux/phy/phy.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/phy/phy.h b/include/linux/phy/phy.h
> index 4f8423a948d5..70459a28f3a1 100644
> --- a/include/linux/phy/phy.h
> +++ b/include/linux/phy/phy.h
> @@ -28,6 +28,7 @@ enum phy_mode {
>   PHY_MODE_USB_DEVICE,
>   PHY_MODE_USB_OTG,
>   PHY_MODE_SGMII,
> + PHY_MODE_SGMII_2_5G,
>   PHY_MODE_10GKR,
>   PHY_MODE_UFS_HS_A,
>   PHY_MODE_UFS_HS_B,

Hi Antoine

There was a discussion maybe last month about adding 2.5G SGMII. I
would prefer 2500SGMII. Putting the number first makes it uniform with
the other defines, 1000BASEX, 25000BASEX, 10GKR.

Andrew

Re: [PATCH] Staging: ipx: fixed several no space before tabs coding style issues

2017-12-27 Thread Greg KH

On Wed, Dec 27, 2017 at 09:25:44PM +, Jianshen Liu wrote:
> Fixed several coding style warnings of "please, no space before tabs".
> 
> Signed-off-by: Jianshen Liu 
> ---
>  drivers/staging/ipx/af_ipx.c| 56 
> -
>  drivers/staging/ipx/ipx_proc.c  |  2 +-
>  drivers/staging/ipx/ipx_route.c |  6 ++---
>  3 files changed, 32 insertions(+), 32 deletions(-)

Please read drivers/staging/ipx/TODO

[PATCH net-next v9 0/2] add UniPhier AVE ethernet support

2017-12-27 Thread Kunihiko Hayashi

This series adds support for Socionext AVE ethernet controller implemented
on UniPhier SoCs. This driver supports RGMII/RMII modes.

v8: https://www.spinics.net/lists/netdev/msg474374.html

The PHY patch included in v1 has already separated in:
http://www.spinics.net/lists/netdev/msg454595.html

Changes since v8:
- move operators at the beginning of the line to the end of the previous line
- dt-bindings: add blank lines before mdio and phy subnodes

Changes since v7:
- dt-bindings: fix mdio subnode description

Changes since v6:
- sort the order of local variables from longest to shortest line
- fix ave_probe() which calls register_netdev() at the end of initialization
- dt-bindings: remove phy node descriptions in mdio node

Changes since v5:
- replace license boilerplate with SPDX Identifier
- remove inline directives and an unused function

Changes since v4:
- fix larger integer warning on AVE_PFMBYTE_MASK0

Changes since v3:
- remove checking dma address and use dma_set_mask() to restirct address
- replace ave_mdio_busywait() with read_poll_timeout()
- replace functions to access to registers with readl/writel() directly
- replace a function to access to macaddr with ave_hw_write_macaddr()
- change return value of ave_dma_map() to error value
- move mdiobus_unregister() from ave_remove() to ave_uninit()
- eliminate else block at the end of ave_dma_map()
- add mask definitions for packet filter
- sort bitmap definitions in descending order
- add error check to some functions
- rename and sort functions to clear sub-categories
- fix error value consistency
- remove unneeded initializers
- change type of constant arrays

Changes since v2:
- replace clk_get() with devm_clk_get()
- replace reset_control_get() with devm_reset_control_get_optional_shared()
- add error return when the error occurs on the above *_get functions
- sort soc data and compatible strings
- remove clearly obvious comments
- modify dt-bindings document consistent with these modifications

Changes since v1:
- add/remove devicetree properties and sub-node
  - remove "internal-phy-interrupt" and "desc-bits" property
  - add SoC data structures based on compatible strings
  - add node operation to apply "mdio" sub-node
- add support for features
  - add support for {get,set}_pauseparam and pause frame operations
  - add support for ndo_get_stats64 instead of ndo_get_stats
- replace with desiable functions
  - replace check for valid phy_mode with phy_interface{_mode}_is_rgmii()
  - replace phy attach message with phy_attached_info()
  - replace 32bit operation with {upper,lower}_32_bits() on ave_wdesc_addr()
  - replace nway_reset and get_link with generic functions
- move operations to proper functions
  - move phy_start_aneg() to ndo_open,
and remove unnecessary PHY interrupt operations
See http://www.spinics.net/lists/netdev/msg454590.html
  - move irq initialization and descriptor memory allocation to ndo_open
  - move initialization of reset and clock and mdiobus to ndo_init
- fix skbuffer operations
  - fix skb alignment operations and add Rx buffer adjustment for descriptor
See http://www.spinics.net/lists/netdev/msg456014.html
  - add error returns when dma_map_single() failed 
- clean up code structures
  - clean up wait-loop and wake-queue conditions
  - add ave_wdesc_addr() and offset definitions
  - add ave_macaddr_init() to clean up mac-address operation
  - fix checking whether Tx entry is not enough
  - fix supported features of phydev
  - add necessary free/disable operations
  - add phydev check on ave_{get,set}_wol()
  - remove netif_carrier functions, phydev initializer, and Tx budget check
- change obsolate codes
  - replace ndev->{base_addr,irq} with the members of ave_private
- rename goto labels and mask definitions, and remove unused codes

Kunihiko Hayashi (2):
  dt-bindings: net: add DT bindings for Socionext UniPhier AVE
  net: ethernet: socionext: add AVE ethernet driver

 .../bindings/net/socionext,uniphier-ave4.txt   |   48 +
 drivers/net/ethernet/Kconfig   |1 +
 drivers/net/ethernet/Makefile  |1 +
 drivers/net/ethernet/socionext/Kconfig |   22 +
 drivers/net/ethernet/socionext/Makefile|5 +
 drivers/net/ethernet/socionext/sni_ave.c   | 1736 
 6 files changed, 1813 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
 create mode 100644 drivers/net/ethernet/socionext/Kconfig
 create mode 100644 drivers/net/ethernet/socionext/Makefile
 create mode 100644 drivers/net/ethernet/socionext/sni_ave.c

-- 
2.7.4

[PATCH net-next v9 1/2] dt-bindings: net: add DT bindings for Socionext UniPhier AVE

2017-12-27 Thread Kunihiko Hayashi

DT bindings for the AVE ethernet controller found on Socionext's
UniPhier platforms.

Signed-off-by: Kunihiko Hayashi 
Signed-off-by: Jassi Brar 
Acked-by: Rob Herring 
Reviewed-by: Florian Fainelli 
---
 .../bindings/net/socionext,uniphier-ave4.txt   | 48 ++
 1 file changed, 48 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt

diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt 
b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
new file mode 100644
index 000..270ea4e
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
@@ -0,0 +1,48 @@
+* Socionext AVE ethernet controller
+
+This describes the devicetree bindings for AVE ethernet controller
+implemented on Socionext UniPhier SoCs.
+
+Required properties:
+ - compatible: Should be
+   - "socionext,uniphier-pro4-ave4" : for Pro4 SoC
+   - "socionext,uniphier-pxs2-ave4" : for PXs2 SoC
+   - "socionext,uniphier-ld11-ave4" : for LD11 SoC
+   - "socionext,uniphier-ld20-ave4" : for LD20 SoC
+ - reg: Address where registers are mapped and size of region.
+ - interrupts: Should contain the MAC interrupt.
+ - phy-mode: See ethernet.txt in the same directory. Allow to choose
+   "rgmii", "rmii", or "mii" according to the PHY.
+ - phy-handle: Should point to the external phy device.
+   See ethernet.txt file in the same directory.
+ - clocks: A phandle to the clock for the MAC.
+
+Optional properties:
+ - resets: A phandle to the reset control for the MAC.
+ - local-mac-address: See ethernet.txt in the same directory.
+
+Required subnode:
+ - mdio: A container for child nodes representing phy nodes.
+ See phy.txt in the same directory.
+
+Example:
+
+   ether: ethernet@6500 {
+   compatible = "socionext,uniphier-ld20-ave4";
+   reg = <0x6500 0x8500>;
+   interrupts = <0 66 4>;
+   phy-mode = "rgmii";
+   phy-handle = <>;
+   clocks = <_clk 6>;
+   resets = <_rst 6>;
+   local-mac-address = [00 00 00 00 00 00];
+
+   mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   ethphy: ethphy@1 {
+   reg = <1>;
+   };
+   };
+   };
-- 
2.7.4

[PATCH net-next v9 2/2] net: ethernet: socionext: add AVE ethernet driver

2017-12-27 Thread Kunihiko Hayashi

The UniPhier platform from Socionext provides the AVE ethernet
controller that includes MAC and MDIO bus supporting RGMII/RMII
modes. The controller is named AVE.

Signed-off-by: Kunihiko Hayashi 
Signed-off-by: Jassi Brar 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/Kconfig |1 +
 drivers/net/ethernet/Makefile|1 +
 drivers/net/ethernet/socionext/Kconfig   |   22 +
 drivers/net/ethernet/socionext/Makefile  |5 +
 drivers/net/ethernet/socionext/sni_ave.c | 1736 ++
 5 files changed, 1765 insertions(+)
 create mode 100644 drivers/net/ethernet/socionext/Kconfig
 create mode 100644 drivers/net/ethernet/socionext/Makefile
 create mode 100644 drivers/net/ethernet/socionext/sni_ave.c

diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index c604213..d50519e 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -170,6 +170,7 @@ source "drivers/net/ethernet/sis/Kconfig"
 source "drivers/net/ethernet/sfc/Kconfig"
 source "drivers/net/ethernet/sgi/Kconfig"
 source "drivers/net/ethernet/smsc/Kconfig"
+source "drivers/net/ethernet/socionext/Kconfig"
 source "drivers/net/ethernet/stmicro/Kconfig"
 source "drivers/net/ethernet/sun/Kconfig"
 source "drivers/net/ethernet/tehuti/Kconfig"
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index 39f62733..6cf5ade 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_SFC) += sfc/
 obj-$(CONFIG_SFC_FALCON) += sfc/falcon/
 obj-$(CONFIG_NET_VENDOR_SGI) += sgi/
 obj-$(CONFIG_NET_VENDOR_SMSC) += smsc/
+obj-$(CONFIG_NET_VENDOR_SOCIONEXT) += socionext/
 obj-$(CONFIG_NET_VENDOR_STMICRO) += stmicro/
 obj-$(CONFIG_NET_VENDOR_SUN) += sun/
 obj-$(CONFIG_NET_VENDOR_TEHUTI) += tehuti/
diff --git a/drivers/net/ethernet/socionext/Kconfig 
b/drivers/net/ethernet/socionext/Kconfig
new file mode 100644
index 000..3a1829e
--- /dev/null
+++ b/drivers/net/ethernet/socionext/Kconfig
@@ -0,0 +1,22 @@
+config NET_VENDOR_SOCIONEXT
+   bool "Socionext ethernet drivers"
+   default y
+   ---help---
+ Option to select ethernet drivers for Socionext platforms.
+
+ Note that the answer to this question doesn't directly affect the
+ kernel: saying N will just cause the configurator to skip all
+ the questions about Socionext devices. If you say Y, you will be asked
+ for your specific card in the following questions.
+
+if NET_VENDOR_SOCIONEXT
+
+config SNI_AVE
+   tristate "Socionext AVE ethernet support"
+   depends on (ARCH_UNIPHIER || COMPILE_TEST) && OF
+   select PHYLIB
+   ---help---
+ Driver for gigabit ethernet MACs, called AVE, in the
+ Socionext UniPhier family.
+
+endif #NET_VENDOR_SOCIONEXT
diff --git a/drivers/net/ethernet/socionext/Makefile 
b/drivers/net/ethernet/socionext/Makefile
new file mode 100644
index 000..ab83df6
--- /dev/null
+++ b/drivers/net/ethernet/socionext/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for all ethernet ip drivers on Socionext platforms
+#
+obj-$(CONFIG_SNI_AVE) += sni_ave.o
diff --git a/drivers/net/ethernet/socionext/sni_ave.c 
b/drivers/net/ethernet/socionext/sni_ave.c
new file mode 100644
index 000..111e7ca
--- /dev/null
+++ b/drivers/net/ethernet/socionext/sni_ave.c
@@ -0,0 +1,1736 @@
+// SPDX-License-Identifier: GPL-2.0
+/**
+ * sni_ave.c - Socionext UniPhier AVE ethernet driver
+ * Copyright 2014 Panasonic Corporation
+ * Copyright 2015-2017 Socionext Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* General Register Group */
+#define AVE_IDR0x000   /* ID */
+#define AVE_VR 0x004   /* Version */
+#define AVE_GRR0x008   /* Global Reset */
+#define AVE_CFGR   0x00c   /* Configuration */
+
+/* Interrupt Register Group */
+#define AVE_GIMR   0x100   /* Global Interrupt Mask */
+#define AVE_GISR   0x104   /* Global Interrupt Status */
+
+/* MAC Register Group */
+#define AVE_TXCR   0x200   /* TX Setup */
+#define AVE_RXCR   0x204   /* RX Setup */
+#define AVE_RXMAC1R0x208   /* MAC address (lower) */
+#define AVE_RXMAC2R0x20c   /* MAC address (upper) */
+#define AVE_MDIOCTR0x214   /* MDIO Control */
+#define AVE_MDIOAR 0x218   /* MDIO Address */
+#define AVE_MDIOWDR0x21c   /* MDIO Data */
+#define AVE_MDIOSR 0x220   /* MDIO Status */
+#define AVE_MDIORDR0x224   /* MDIO Rd Data */
+
+/* Descriptor Control Register Group */
+#define AVE_DESCC  0x300   /* Descriptor Control */
+#define AVE_TXDC   0x304

[PATCH net-next] cxgb4/cxgb4vf: support for XLAUI Port Type

2017-12-27 Thread Ganesh Goudar

Add support for new Backplane XLAUI port type.

Signed-off-by: Casey Leedom 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c  | 10 +-
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  |  1 +
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |  1 +
 drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 10 +-
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
index 541419b..7852d98 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
@@ -517,7 +517,8 @@ static int from_fw_port_mod_type(enum fw_port_type 
port_type,
else
return PORT_OTHER;
} else if (port_type == FW_PORT_TYPE_KR4_100G ||
-  port_type == FW_PORT_TYPE_KR_SFP28) {
+  port_type == FW_PORT_TYPE_KR_SFP28 ||
+  port_type == FW_PORT_TYPE_KR_XLAUI) {
return PORT_NONE;
}
 
@@ -645,6 +646,13 @@ static void fw_caps_to_lmm(enum fw_port_type port_type,
FW_CAPS_TO_LMM(SPEED_25G, 25000baseKR_Full);
break;
 
+   case FW_PORT_TYPE_KR_XLAUI:
+   SET_LMM(Backplane);
+   FW_CAPS_TO_LMM(SPEED_1G, 1000baseKX_Full);
+   FW_CAPS_TO_LMM(SPEED_10G, 1baseKR_Full);
+   FW_CAPS_TO_LMM(SPEED_40G, 4baseKR4_Full);
+   break;
+
case FW_PORT_TYPE_CR2_QSFP:
SET_LMM(FIBRE);
SET_LMM(5baseSR2_Full);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 44930ca..f2a60e0 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -6084,6 +6084,7 @@ const char *t4_get_port_type_description(enum 
fw_port_type port_type)
"CR2_QSFP",
"SFP28",
"KR_SFP28",
+   "KR_XLAUI"
};
 
if (port_type < ARRAY_SIZE(port_type_description))
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
index 01f5a5e..427f252 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
@@ -2829,6 +2829,7 @@ enum fw_port_type {
FW_PORT_TYPE_CR2_QSFP,
FW_PORT_TYPE_SFP28,
FW_PORT_TYPE_KR_SFP28,
+   FW_PORT_TYPE_KR_XLAUI,
 
FW_PORT_TYPE_NONE = FW_PORT_CMD_PTYPE_M
 };
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index b48361c..96f69f8 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -1229,7 +1229,8 @@ static int from_fw_port_mod_type(enum fw_port_type 
port_type,
else
return PORT_OTHER;
} else if (port_type == FW_PORT_TYPE_KR4_100G ||
-  port_type == FW_PORT_TYPE_KR_SFP28) {
+  port_type == FW_PORT_TYPE_KR_SFP28 ||
+  port_type == FW_PORT_TYPE_KR_XLAUI) {
return PORT_NONE;
}
 
@@ -1323,6 +1324,13 @@ static void fw_caps_to_lmm(enum fw_port_type port_type,
SET_LMM(25000baseKR_Full);
break;
 
+   case FW_PORT_TYPE_KR_XLAUI:
+   SET_LMM(Backplane);
+   FW_CAPS_TO_LMM(SPEED_1G, 1000baseKX_Full);
+   FW_CAPS_TO_LMM(SPEED_10G, 1baseKR_Full);
+   FW_CAPS_TO_LMM(SPEED_40G, 4baseKR4_Full);
+   break;
+
case FW_PORT_TYPE_CR2_QSFP:
SET_LMM(FIBRE);
SET_LMM(5baseSR2_Full);
-- 
2.1.0

[PATCH net-next v6 6/6] net: dccp: Remove dccpprobe module

2017-12-27 Thread Masami Hiramatsu

Remove DCCP probe module since jprobe has been deprecated.
That function is now replaced by dccp/dccp_probe trace-event.
You can use it via ftrace or perftools.

Signed-off-by: Masami Hiramatsu 
---
 Changes in v5:
  - Fix a conflict with previous change in Makefile.
---
 net/dccp/Kconfig  |   17 
 net/dccp/Makefile |2 -
 net/dccp/probe.c  |  203 -
 3 files changed, 222 deletions(-)
 delete mode 100644 net/dccp/probe.c

diff --git a/net/dccp/Kconfig b/net/dccp/Kconfig
index 8c0ef71bed2f..b270e84d9c13 100644
--- a/net/dccp/Kconfig
+++ b/net/dccp/Kconfig
@@ -39,23 +39,6 @@ config IP_DCCP_DEBUG
 
  Just say N.
 
-config NET_DCCPPROBE
-   tristate "DCCP connection probing"
-   depends on PROC_FS && KPROBES
-   ---help---
-   This module allows for capturing the changes to DCCP connection
-   state in response to incoming packets. It is used for debugging
-   DCCP congestion avoidance modules. If you don't understand
-   what was just said, you don't need it: say N.
-
-   Documentation on how to use DCCP connection probing can be found
-   at:
-   
- 
http://www.linuxfoundation.org/collaborate/workgroups/networking/dccpprobe
-
-   To compile this code as a module, choose M here: the
-   module will be called dccp_probe.
-
 
 endmenu
 
diff --git a/net/dccp/Makefile b/net/dccp/Makefile
index 4215f13a63af..5b4ff37bc806 100644
--- a/net/dccp/Makefile
+++ b/net/dccp/Makefile
@@ -21,12 +21,10 @@ obj-$(subst y,$(CONFIG_IP_DCCP),$(CONFIG_IPV6)) += 
dccp_ipv6.o
 dccp_ipv6-y := ipv6.o
 
 obj-$(CONFIG_INET_DCCP_DIAG) += dccp_diag.o
-obj-$(CONFIG_NET_DCCPPROBE) += dccp_probe.o
 
 dccp-$(CONFIG_SYSCTL) += sysctl.o
 
 dccp_diag-y := diag.o
-dccp_probe-y := probe.o
 
 # build with local directory for trace.h
 CFLAGS_proto.o := -I$(src)
diff --git a/net/dccp/probe.c b/net/dccp/probe.c
deleted file mode 100644
index 3d3fda05b32d..
--- a/net/dccp/probe.c
+++ /dev/null
@@ -1,203 +0,0 @@
-/*
- * dccp_probe - Observe the DCCP flow with kprobes.
- *
- * The idea for this came from Werner Almesberger's umlsim
- * Copyright (C) 2004, Stephen Hemminger 
- *
- * Modified for DCCP from Stephen Hemminger's code
- * Copyright (C) 2006, Ian McDonald 
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "dccp.h"
-#include "ccid.h"
-#include "ccids/ccid3.h"
-
-static int port;
-
-static int bufsize = 64 * 1024;
-
-static const char procname[] = "dccpprobe";
-
-static struct {
-   struct kfifo  fifo;
-   spinlock_tlock;
-   wait_queue_head_t wait;
-   struct timespec64 tstart;
-} dccpw;
-
-static void printl(const char *fmt, ...)
-{
-   va_list args;
-   int len;
-   struct timespec64 now;
-   char tbuf[256];
-
-   va_start(args, fmt);
-   getnstimeofday64();
-
-   now = timespec64_sub(now, dccpw.tstart);
-
-   len = sprintf(tbuf, "%lu.%06lu ",
- (unsigned long) now.tv_sec,
- (unsigned long) now.tv_nsec / NSEC_PER_USEC);
-   len += vscnprintf(tbuf+len, sizeof(tbuf)-len, fmt, args);
-   va_end(args);
-
-   kfifo_in_locked(, tbuf, len, );
-   wake_up();
-}
-
-static int jdccp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
-{
-   const struct inet_sock *inet = inet_sk(sk);
-   struct ccid3_hc_tx_sock *hc = NULL;
-
-   if (ccid_get_current_tx_ccid(dccp_sk(sk)) == DCCPC_CCID3)
-   hc = ccid3_hc_tx_sk(sk);
-
-   if (port == 0 || ntohs(inet->inet_dport) == port ||
-   ntohs(inet->inet_sport) == port) {
-   if (hc)
-   printl("%pI4:%u %pI4:%u %d %d %d %d %u %llu %llu %d\n",
-  >inet_saddr, ntohs(inet->inet_sport),
-  >inet_daddr, ntohs(inet->inet_dport), size,
-  hc->tx_s, hc->tx_rtt, hc->tx_p,
-  hc->tx_x_calc, hc->tx_x_recv >> 6,
-  hc->tx_x >> 6, hc->tx_t_ipi);
-   else
-   printl("%pI4:%u %pI4:%u

[PATCH net-next v6 1/6] net: tcp: Add trace events for TCP congestion window tracing

2017-12-27 Thread Masami Hiramatsu

This adds an event to trace TCP stat variables with
slightly intrusive trace-event. This uses ftrace/perf
event log buffer to trace those state, no needs to
prepare own ring-buffer, nor custom user apps.

User can use ftrace to trace this event as below;

  # cd /sys/kernel/debug/tracing
  # echo 1 > events/tcp/tcp_probe/enable
  (run workloads)
  # cat trace

Signed-off-by: Masami Hiramatsu 
---
 Changes in v6:
  - Avoid preprocessor directives in tracepoint macro args as
Mat did on net tree.
---
 include/trace/events/tcp.h |   97 
 net/ipv4/tcp_input.c   |3 +
 2 files changed, 100 insertions(+)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 8e88a1671538..4dea6342f7d4 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM tcp
 
@@ -8,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * tcp event with arguments sk and skb
@@ -277,6 +279,101 @@ TRACE_EVENT(tcp_retransmit_synack,
  __entry->saddr_v6, __entry->daddr_v6)
 );
 
+
+#define TP_STORE_ADDR_PORTS_V4(__entry, inet, sk)  \
+   do {\
+   struct sockaddr_in *v4 = (void *)__entry->saddr;\
+   \
+   v4->sin_family = AF_INET;   \
+   v4->sin_port = inet->inet_sport;\
+   v4->sin_addr.s_addr = inet->inet_saddr; \
+   v4 = (void *)__entry->daddr;\
+   v4->sin_family = AF_INET;   \
+   v4->sin_port = inet->inet_dport;\
+   v4->sin_addr.s_addr = inet->inet_daddr; \
+   } while (0)
+
+#if IS_ENABLED(CONFIG_IPV6)
+
+#define TP_STORE_ADDR_PORTS(__entry, inet, sk) \
+   do {\
+   if (sk->sk_family == AF_INET6) {\
+   struct sockaddr_in6 *v6 = (void *)__entry->saddr; \
+   \
+   v6->sin6_family = AF_INET6; \
+   v6->sin6_port = inet->inet_sport;   \
+   v6->sin6_addr = inet6_sk(sk)->saddr;\
+   v6 = (void *)__entry->daddr;\
+   v6->sin6_family = AF_INET6; \
+   v6->sin6_port = inet->inet_dport;   \
+   v6->sin6_addr = sk->sk_v6_daddr;\
+   } else  \
+   TP_STORE_ADDR_PORTS_V4(__entry, inet, sk);  \
+   } while (0)
+
+#else
+
+#define TP_STORE_ADDR_PORTS(__entry, inet, sk) \
+   TP_STORE_ADDR_PORTS_V4(__entry, inet, sk);
+
+#endif
+
+TRACE_EVENT(tcp_probe,
+
+   TP_PROTO(struct sock *sk, struct sk_buff *skb),
+
+   TP_ARGS(sk, skb),
+
+   TP_STRUCT__entry(
+   /* sockaddr_in6 is always bigger than sockaddr_in */
+   __array(__u8, saddr, sizeof(struct sockaddr_in6))
+   __array(__u8, daddr, sizeof(struct sockaddr_in6))
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(__u32, mark)
+   __field(__u16, length)
+   __field(__u32, snd_nxt)
+   __field(__u32, snd_una)
+   __field(__u32, snd_cwnd)
+   __field(__u32, ssthresh)
+   __field(__u32, snd_wnd)
+   __field(__u32, srtt)
+   __field(__u32, rcv_wnd)
+   ),
+
+   TP_fast_assign(
+   const struct tcp_sock *tp = tcp_sk(sk);
+   const struct inet_sock *inet = inet_sk(sk);
+
+   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
+
+   TP_STORE_ADDR_PORTS(__entry, inet, sk);
+
+   /* For filtering use */
+   __entry->sport = ntohs(inet->inet_sport);
+   __entry->dport = ntohs(inet->inet_dport);
+   __entry->mark = skb->mark;
+
+   __entry->length = skb->len;
+   __entry->snd_nxt = tp->snd_nxt;
+   __entry->snd_una = tp->snd_una;
+   __entry->snd_cwnd = tp->snd_cwnd;
+   __entry->snd_wnd = tp->snd_wnd;
+   __entry->rcv_wnd = tp->rcv_wnd;
+   __entry->ssthresh = tcp_current_ssthresh(sk);
+   __entry->srtt = tp->srtt_us >> 3;
+   ),
+
+

[PATCH net-next v6 4/6] net: sctp: Remove debug SCTP probe module

2017-12-27 Thread Masami Hiramatsu

Remove SCTP probe module since jprobe has been deprecated.
That function is now replaced by sctp/sctp_probe and
sctp/sctp_probe_path trace-events.
You can use it via ftrace or perftools.

Signed-off-by: Masami Hiramatsu 
---
 net/sctp/Kconfig  |   12 ---
 net/sctp/Makefile |3 -
 net/sctp/probe.c  |  244 -
 3 files changed, 259 deletions(-)
 delete mode 100644 net/sctp/probe.c

diff --git a/net/sctp/Kconfig b/net/sctp/Kconfig
index d9c04dc1b3f3..c740b189d4ba 100644
--- a/net/sctp/Kconfig
+++ b/net/sctp/Kconfig
@@ -37,18 +37,6 @@ menuconfig IP_SCTP
 
 if IP_SCTP
 
-config NET_SCTPPROBE
-   tristate "SCTP: Association probing"
-depends on PROC_FS && KPROBES
----help---
-This module allows for capturing the changes to SCTP association
-state in response to incoming packets. It is used for debugging
-SCTP congestion control algorithms. If you don't understand
-what was just said, you don't need it: say N.
-
-To compile this code as a module, choose M here: the
-module will be called sctp_probe.
-
 config SCTP_DBG_OBJCNT
bool "SCTP: Debug object counts"
depends on PROC_FS
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 54bd9c1a8aa1..6776582ec449 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -4,7 +4,6 @@
 #
 
 obj-$(CONFIG_IP_SCTP) += sctp.o
-obj-$(CONFIG_NET_SCTPPROBE) += sctp_probe.o
 obj-$(CONFIG_INET_SCTP_DIAG) += sctp_diag.o
 
 sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
@@ -16,8 +15,6 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
  offload.o stream_sched.o stream_sched_prio.o \
  stream_sched_rr.o stream_interleave.o
 
-sctp_probe-y := probe.o
-
 sctp-$(CONFIG_SCTP_DBG_OBJCNT) += objcnt.o
 sctp-$(CONFIG_PROC_FS) += proc.o
 sctp-$(CONFIG_SYSCTL) += sysctl.o
diff --git a/net/sctp/probe.c b/net/sctp/probe.c
deleted file mode 100644
index 1280f85a598d..
--- a/net/sctp/probe.c
+++ /dev/null
@@ -1,244 +0,0 @@
-/*
- * sctp_probe - Observe the SCTP flow with kprobes.
- *
- * The idea for this came from Werner Almesberger's umlsim
- * Copyright (C) 2004, Stephen Hemminger 
- *
- * Modified for SCTP from Stephen Hemminger's code
- * Copyright (C) 2010, Wei Yongjun 
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-#include 
-
-MODULE_SOFTDEP("pre: sctp");
-MODULE_AUTHOR("Wei Yongjun ");
-MODULE_DESCRIPTION("SCTP snooper");
-MODULE_LICENSE("GPL");
-
-static int port __read_mostly = 0;
-MODULE_PARM_DESC(port, "Port to match (0=all)");
-module_param(port, int, 0);
-
-static unsigned int fwmark __read_mostly = 0;
-MODULE_PARM_DESC(fwmark, "skb mark to match (0=no mark)");
-module_param(fwmark, uint, 0);
-
-static int bufsize __read_mostly = 64 * 1024;
-MODULE_PARM_DESC(bufsize, "Log buffer size (default 64k)");
-module_param(bufsize, int, 0);
-
-static int full __read_mostly = 1;
-MODULE_PARM_DESC(full, "Full log (1=every ack packet received,  0=only cwnd 
changes)");
-module_param(full, int, 0);
-
-static const char procname[] = "sctpprobe";
-
-static struct {
-   struct kfifo  fifo;
-   spinlock_tlock;
-   wait_queue_head_t wait;
-   struct timespec64 tstart;
-} sctpw;
-
-static __printf(1, 2) void printl(const char *fmt, ...)
-{
-   va_list args;
-   int len;
-   char tbuf[256];
-
-   va_start(args, fmt);
-   len = vscnprintf(tbuf, sizeof(tbuf), fmt, args);
-   va_end(args);
-
-   kfifo_in_locked(, tbuf, len, );
-   wake_up();
-}
-
-static int sctpprobe_open(struct inode *inode, struct file *file)
-{
-   kfifo_reset();
-   ktime_get_ts64();
-
-   return 0;
-}
-
-static ssize_t sctpprobe_read(struct file *file, char __user *buf,
- size_t len, loff_t *ppos)
-{
-   int error = 0, cnt = 0;
-   unsigned char *tbuf;
-
-   if (!buf)
-   return -EINVAL;
-
-   if (len == 0)
-   return 0;
-
-   tbuf = vmalloc(len);
-   if (!tbuf)
-   return

[PATCH net-next v6 2/6] net: tcp: Remove TCP probe module

2017-12-27 Thread Masami Hiramatsu

Remove TCP probe module since jprobe has been deprecated.
That function is now replaced by tcp/tcp_probe trace-event.
You can use it via ftrace or perftools.

Signed-off-by: Masami Hiramatsu 
---
 net/Kconfig  |   17 ---
 net/ipv4/Makefile|1 
 net/ipv4/tcp_probe.c |  301 --
 3 files changed, 319 deletions(-)
 delete mode 100644 net/ipv4/tcp_probe.c

diff --git a/net/Kconfig b/net/Kconfig
index 9dba2715919d..efe930db3c08 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -336,23 +336,6 @@ config NET_PKTGEN
  To compile this code as a module, choose M here: the
  module will be called pktgen.
 
-config NET_TCPPROBE
-   tristate "TCP connection probing"
-   depends on INET && PROC_FS && KPROBES
-   ---help---
-   This module allows for capturing the changes to TCP connection
-   state in response to incoming packets. It is used for debugging
-   TCP congestion avoidance modules. If you don't understand
-   what was just said, you don't need it: say N.
-
-   Documentation on how to use TCP connection probing can be found
-   at:
-   
- 
http://www.linuxfoundation.org/collaborate/workgroups/networking/tcpprobe
-
-   To compile this code as a module, choose M here: the
-   module will be called tcp_probe.
-
 config NET_DROP_MONITOR
tristate "Network packet drop alerting service"
depends on INET && TRACEPOINTS
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index c6c8ad1d4b6d..47a0a6649a9d 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -43,7 +43,6 @@ obj-$(CONFIG_INET_DIAG) += inet_diag.o
 obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
 obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
 obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
-obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
 obj-$(CONFIG_TCP_CONG_BBR) += tcp_bbr.o
 obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
 obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
deleted file mode 100644
index 697f4c67b2e3..
--- a/net/ipv4/tcp_probe.c
+++ /dev/null
@@ -1,301 +0,0 @@
-/*
- * tcpprobe - Observe the TCP flow with kprobes.
- *
- * The idea for this came from Werner Almesberger's umlsim
- * Copyright (C) 2004, Stephen Hemminger 
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-MODULE_AUTHOR("Stephen Hemminger ");
-MODULE_DESCRIPTION("TCP cwnd snooper");
-MODULE_LICENSE("GPL");
-MODULE_VERSION("1.1");
-
-static int port __read_mostly;
-MODULE_PARM_DESC(port, "Port to match (0=all)");
-module_param(port, int, 0);
-
-static unsigned int bufsize __read_mostly = 4096;
-MODULE_PARM_DESC(bufsize, "Log buffer size in packets (4096)");
-module_param(bufsize, uint, 0);
-
-static unsigned int fwmark __read_mostly;
-MODULE_PARM_DESC(fwmark, "skb mark to match (0=no mark)");
-module_param(fwmark, uint, 0);
-
-static int full __read_mostly;
-MODULE_PARM_DESC(full, "Full log (1=every ack packet received,  0=only cwnd 
changes)");
-module_param(full, int, 0);
-
-static const char procname[] = "tcpprobe";
-
-struct tcp_log {
-   ktime_t tstamp;
-   union {
-   struct sockaddr raw;
-   struct sockaddr_in  v4;
-   struct sockaddr_in6 v6;
-   }   src, dst;
-   u16 length;
-   u32 snd_nxt;
-   u32 snd_una;
-   u32 snd_wnd;
-   u32 rcv_wnd;
-   u32 snd_cwnd;
-   u32 ssthresh;
-   u32 srtt;
-};
-
-static struct {
-   spinlock_t  lock;
-   wait_queue_head_t wait;
-   ktime_t start;
-   u32 lastcwnd;
-
-   unsigned long   head, tail;
-   struct tcp_log  *log;
-} tcp_probe;
-
-static inline int tcp_probe_used(void)
-{
-   return (tcp_probe.head - tcp_probe.tail) & (bufsize - 1);
-}
-
-static inline int tcp_probe_avail(void)
-{
-   return bufsize - tcp_probe_used() - 1;
-}
-
-#define tcp_probe_copy_fl_to_si4(inet, si4, mem)   \
-   do {\
-   si4.sin_family = AF_INET;

[PATCH net-next v6 3/6] net: sctp: Add SCTP ACK tracking trace event

2017-12-27 Thread Masami Hiramatsu

Add SCTP ACK tracking trace event to trace the changes of SCTP
association state in response to incoming packets.
It is used for debugging SCTP congestion control algorithms,
and will replace sctp_probe module.

Note that this event a bit tricky. Since this consists of 2
events (sctp_probe and sctp_probe_path) so you have to enable
both events as below.

  # cd /sys/kernel/debug/tracing
  # echo 1 > events/sctp/sctp_probe/enable
  # echo 1 > events/sctp/sctp_probe_path/enable

Or, you can enable all the events under sctp.

  # echo 1 > events/sctp/enable

Since sctp_probe_path event is always invoked from sctp_probe
event, you can not see any output if you only enable
sctp_probe_path.

Signed-off-by: Masami Hiramatsu 
---
  Changes in v3:
   - Add checking whether sctp_probe_path event is enabled
 before iterating sctp paths to record. Thanks Steven.
  Changes in v4:
   - Move a temporal variable definition in the block.
   - Fix to cast pointer to unsigned long instead of __u64
 for 32bit environment.
---
 include/trace/events/sctp.h |   99 +++
 net/sctp/sm_statefuns.c |5 ++
 2 files changed, 104 insertions(+)
 create mode 100644 include/trace/events/sctp.h

diff --git a/include/trace/events/sctp.h b/include/trace/events/sctp.h
new file mode 100644
index ..7475c7be165a
--- /dev/null
+++ b/include/trace/events/sctp.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM sctp
+
+#if !defined(_TRACE_SCTP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_SCTP_H
+
+#include 
+#include 
+
+TRACE_EVENT(sctp_probe_path,
+
+   TP_PROTO(struct sctp_transport *sp,
+const struct sctp_association *asoc),
+
+   TP_ARGS(sp, asoc),
+
+   TP_STRUCT__entry(
+   __field(__u64, asoc)
+   __field(__u32, primary)
+   __array(__u8, ipaddr, sizeof(union sctp_addr))
+   __field(__u32, state)
+   __field(__u32, cwnd)
+   __field(__u32, ssthresh)
+   __field(__u32, flight_size)
+   __field(__u32, partial_bytes_acked)
+   __field(__u32, pathmtu)
+   ),
+
+   TP_fast_assign(
+   __entry->asoc = (unsigned long)asoc;
+   __entry->primary = (sp == asoc->peer.primary_path);
+   memcpy(__entry->ipaddr, >ipaddr, sizeof(union sctp_addr));
+   __entry->state = sp->state;
+   __entry->cwnd = sp->cwnd;
+   __entry->ssthresh = sp->ssthresh;
+   __entry->flight_size = sp->flight_size;
+   __entry->partial_bytes_acked = sp->partial_bytes_acked;
+   __entry->pathmtu = sp->pathmtu;
+   ),
+
+   TP_printk("asoc=%#llx%s ipaddr=%pISpc state=%u cwnd=%u ssthresh=%u "
+ "flight_size=%u partial_bytes_acked=%u pathmtu=%u",
+ __entry->asoc, __entry->primary ? "(*)" : "",
+ __entry->ipaddr, __entry->state, __entry->cwnd,
+ __entry->ssthresh, __entry->flight_size,
+ __entry->partial_bytes_acked, __entry->pathmtu)
+);
+
+TRACE_EVENT(sctp_probe,
+
+   TP_PROTO(const struct sctp_endpoint *ep,
+const struct sctp_association *asoc,
+struct sctp_chunk *chunk),
+
+   TP_ARGS(ep, asoc, chunk),
+
+   TP_STRUCT__entry(
+   __field(__u64, asoc)
+   __field(__u32, mark)
+   __field(__u16, bind_port)
+   __field(__u16, peer_port)
+   __field(__u32, pathmtu)
+   __field(__u32, rwnd)
+   __field(__u16, unack_data)
+   ),
+
+   TP_fast_assign(
+   struct sk_buff *skb = chunk->skb;
+
+   __entry->asoc = (unsigned long)asoc;
+   __entry->mark = skb->mark;
+   __entry->bind_port = ep->base.bind_addr.port;
+   __entry->peer_port = asoc->peer.port;
+   __entry->pathmtu = asoc->pathmtu;
+   __entry->rwnd = asoc->peer.rwnd;
+   __entry->unack_data = asoc->unack_data;
+
+   if (trace_sctp_probe_path_enabled()) {
+   struct sctp_transport *sp;
+
+   list_for_each_entry(sp, >peer.transport_addr_list,
+   transports) {
+   trace_sctp_probe_path(sp, asoc);
+   }
+   }
+   ),
+
+   TP_printk("asoc=%#llx mark=%#x bind_port=%d peer_port=%d pathmtu=%d "
+ "rwnd=%u unack_data=%d",
+ __entry->asoc, __entry->mark, __entry->bind_port,
+ __entry->peer_port, __entry->pathmtu, __entry->rwnd,
+ __entry->unack_data)
+);
+
+#endif /* _TRACE_SCTP_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index

[PATCH net-next v6 5/6] net: dccp: Add DCCP sendmsg trace event

2017-12-27 Thread Masami Hiramatsu

Add DCCP sendmsg trace event (dccp/dccp_probe) for
replacing dccpprobe. User can trace this event via
ftrace or perftools.

Signed-off-by: Masami Hiramatsu 
---
  Changes in v5
   - Fix to add local directory to include for trace.h.
 Thanks Steven!
---
 net/dccp/Makefile |3 ++
 net/dccp/proto.c  |5 +++
 net/dccp/trace.h  |  105 +
 3 files changed, 113 insertions(+)
 create mode 100644 net/dccp/trace.h

diff --git a/net/dccp/Makefile b/net/dccp/Makefile
index 2e7b56097bc4..4215f13a63af 100644
--- a/net/dccp/Makefile
+++ b/net/dccp/Makefile
@@ -27,3 +27,6 @@ dccp-$(CONFIG_SYSCTL) += sysctl.o
 
 dccp_diag-y := diag.o
 dccp_probe-y := probe.o
+
+# build with local directory for trace.h
+CFLAGS_proto.o := -I$(src)
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7a75a1d3568b..fa7e92e08920 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -38,6 +38,9 @@
 #include "dccp.h"
 #include "feat.h"
 
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
 DEFINE_SNMP_STAT(struct dccp_mib, dccp_statistics) __read_mostly;
 
 EXPORT_SYMBOL_GPL(dccp_statistics);
@@ -761,6 +764,8 @@ int dccp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
int rc, size;
long timeo;
 
+   trace_dccp_probe(sk, len);
+
if (len > dp->dccps_mss_cache)
return -EMSGSIZE;
 
diff --git a/net/dccp/trace.h b/net/dccp/trace.h
new file mode 100644
index ..aa01321a6c37
--- /dev/null
+++ b/net/dccp/trace.h
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dccp
+
+#if !defined(_TRACE_DCCP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DCCP_H
+
+#include 
+#include "dccp.h"
+#include "ccids/ccid3.h"
+#include 
+
+TRACE_EVENT(dccp_probe,
+
+   TP_PROTO(struct sock *sk, size_t size),
+
+   TP_ARGS(sk, size),
+
+   TP_STRUCT__entry(
+   /* sockaddr_in6 is always bigger than sockaddr_in */
+   __array(__u8, saddr, sizeof(struct sockaddr_in6))
+   __array(__u8, daddr, sizeof(struct sockaddr_in6))
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(__u16, size)
+   __field(__u16, tx_s)
+   __field(__u32, tx_rtt)
+   __field(__u32, tx_p)
+   __field(__u32, tx_x_calc)
+   __field(__u64, tx_x_recv)
+   __field(__u64, tx_x)
+   __field(__u32, tx_t_ipi)
+   ),
+
+   TP_fast_assign(
+   const struct inet_sock *inet = inet_sk(sk);
+   struct ccid3_hc_tx_sock *hc = NULL;
+
+   if (ccid_get_current_tx_ccid(dccp_sk(sk)) == DCCPC_CCID3)
+   hc = ccid3_hc_tx_sk(sk);
+
+   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
+
+   if (sk->sk_family == AF_INET) {
+   struct sockaddr_in *v4 = (void *)__entry->saddr;
+
+   v4->sin_family = AF_INET;
+   v4->sin_port = inet->inet_sport;
+   v4->sin_addr.s_addr = inet->inet_saddr;
+   v4 = (void *)__entry->daddr;
+   v4->sin_family = AF_INET;
+   v4->sin_port = inet->inet_dport;
+   v4->sin_addr.s_addr = inet->inet_daddr;
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (sk->sk_family == AF_INET6) {
+   struct sockaddr_in6 *v6 = (void *)__entry->saddr;
+
+   v6->sin6_family = AF_INET6;
+   v6->sin6_port = inet->inet_sport;
+   v6->sin6_addr = inet6_sk(sk)->saddr;
+   v6 = (void *)__entry->daddr;
+   v6->sin6_family = AF_INET6;
+   v6->sin6_port = inet->inet_dport;
+   v6->sin6_addr = sk->sk_v6_daddr;
+#endif
+   }
+
+   /* For filtering use */
+   __entry->sport = ntohs(inet->inet_sport);
+   __entry->dport = ntohs(inet->inet_dport);
+
+   __entry->size = size;
+   if (hc) {
+   __entry->tx_s = hc->tx_s;
+   __entry->tx_rtt = hc->tx_rtt;
+   __entry->tx_p = hc->tx_p;
+   __entry->tx_x_calc = hc->tx_x_calc;
+   __entry->tx_x_recv = hc->tx_x_recv >> 6;
+   __entry->tx_x = hc->tx_x >> 6;
+   __entry->tx_t_ipi = hc->tx_t_ipi;
+   } else {
+   __entry->tx_s = 0;
+   memset(&__entry->tx_rtt, 0, (void *)&__entry->tx_t_ipi -
+  (void *)&__entry->tx_rtt +
+  sizeof(__entry->tx_t_ipi));
+   }
+   ),
+
+   TP_printk("src=%pISpc dest=%pISpc

[PATCH net-next v6 0/6] net: tcp: sctp: dccp: Replace jprobe usage with trace events

2017-12-27 Thread Masami Hiramatsu

Hi,

This series is v6 of the replacement of jprobe usage with trace
events. This version fixes trace/events/tcp.h to avoid sparse
warning, as same as Mat Martineau's patch.

Previous version is here;
 https://www.spinics.net/lists/netdev/msg474057.html

Changes from v5:
  [1/6]: Avoid preprocessor directives in tracepoint macro args

Thank you,

---

Masami Hiramatsu (6):
  net: tcp: Add trace events for TCP congestion window tracing
  net: tcp: Remove TCP probe module
  net: sctp: Add SCTP ACK tracking trace event
  net: sctp: Remove debug SCTP probe module
  net: dccp: Add DCCP sendmsg trace event
  net: dccp: Remove dccpprobe module


 include/trace/events/sctp.h |   99 ++
 include/trace/events/tcp.h  |   97 ++
 net/Kconfig |   17 --
 net/dccp/Kconfig|   17 --
 net/dccp/Makefile   |5 -
 net/dccp/probe.c|  203 -
 net/dccp/proto.c|5 +
 net/dccp/trace.h|  105 +++
 net/ipv4/Makefile   |1 
 net/ipv4/tcp_input.c|3 
 net/ipv4/tcp_probe.c|  301 ---
 net/sctp/Kconfig|   12 --
 net/sctp/Makefile   |3 
 net/sctp/probe.c|  244 ---
 net/sctp/sm_statefuns.c |5 +
 15 files changed, 317 insertions(+), 800 deletions(-)
 create mode 100644 include/trace/events/sctp.h
 delete mode 100644 net/dccp/probe.c
 create mode 100644 net/dccp/trace.h
 delete mode 100644 net/ipv4/tcp_probe.c
 delete mode 100644 net/sctp/probe.c

--
Masami Hiramatsu (Linaro)

[PATCH net-next] cxgb4: display VNI correctly

2017-12-27 Thread Ganesh Goudar

Fix incorrect VNI display in mps_tcam

Signed-off-by: Santosh Rastapur 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index d3ced04..4ea76c1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -1743,7 +1743,7 @@ static int mps_tcam_show(struct seq_file *seq, void *v)
 */
if (lookup_type && (lookup_type != DATALKPTYPE_M)) {
/* Inner header VNI */
-   vniy = ((data2 & DATAVIDH2_F) << 23) |
+   vniy = (data2 & DATAVIDH2_F) |
   (DATAVIDH1_G(data2) << 16) | VIDL_G(val);
dip_hit = data2 & DATADIPHIT_F;
} else {
@@ -1753,6 +1753,7 @@ static int mps_tcam_show(struct seq_file *seq, void *v)
port_num = DATAPORTNUM_G(data2);
 
/* Read tcamx. Change the control param */
+   vnix = 0;
ctl |= CTLXYBITSEL_V(1);
t4_write_reg(adap, MPS_CLS_TCAM_DATA2_CTL_A, ctl);
val = t4_read_reg(adap, MPS_CLS_TCAM_DATA1_A);
@@ -1761,7 +1762,7 @@ static int mps_tcam_show(struct seq_file *seq, void *v)
data2 = t4_read_reg(adap, MPS_CLS_TCAM_DATA2_CTL_A);
if (lookup_type && (lookup_type != DATALKPTYPE_M)) {
/* Inner header VNI mask */
-   vnix = ((data2 & DATAVIDH2_F) << 23) |
+   vnix = (data2 & DATAVIDH2_F) |
   (DATAVIDH1_G(data2) << 16) | VIDL_G(val);
}
} else {
@@ -1834,7 +1835,8 @@ static int mps_tcam_show(struct seq_file *seq, void *v)
   addr[1], addr[2], addr[3],
   addr[4], addr[5],
   (unsigned long long)mask,
-  vniy, vnix, dip_hit ? 'Y' : 'N',
+  vniy, (vnix | vniy),
+  dip_hit ? 'Y' : 'N',
   port_num,
   (cls_lo & T6_SRAM_VLD_F) ? 'Y' : 'N',
   PORTMAP_G(cls_hi),
-- 
2.1.0

Re: [PATCH iproute2 0/3] ip/tunnel: Fix noencap- and document "external" parameter handling.

2017-12-27 Thread Serhey Popovych



Stephen Hemminger wrote:
> On Wed, 27 Dec 2017 13:28:13 +0200
> Serhey Popovych  wrote:
> 
>> In this series I present next set of improvements/fixes:
>>
>>   1) Fix noencap- option handling: we need to clear bit, instead
>>  of seting all, but one we expect to clear.
>>
>>   2) Document "external" parameter both in ip-link(8) and help
>>  output for link_gre.c. Add "noexternal" option variant to
>>  bring inline with GENEVE and VXLAN.
>>
>>   3) Trivial: clear flowlabel/tclass from flowinfo in case of
>>  inherit to stop sending garbage to the kernel. It has no
>>  functional change, but follows similar behaviour in link_ip6tnl.c
>>
>> See individual patch description message for details.
>>
>> Thanks,
>> Serhii
>>
>> Serhey Popovych (3):
>>   gre,ip6tnl/tunnel: Fix noencap- support
>>   gre6/tunnel: Do not submit garbage in flowinfo
>>   ip/tunnel: Document "external" parameter
>>
>>  ip/link_gre.c |7 +--
>>  ip/link_gre6.c|4 ++--
>>  ip/link_ip6tnl.c  |6 --
>>  ip/link_iptnl.c   |4 +++-
>>  man/man8/ip-link.8.in |6 ++
>>  5 files changed, 20 insertions(+), 7 deletions(-)
>>
> 
> These are really disjoint. I applied the noencap and the flowinfo patch.
> Agree with William that having noexternal which does nothing useful is of 
> little value now.

Agree with you guys, just added it to match GENEVE/VXLAN implementation
which is has "noexternal".

Will send v2 to just document "external" and clear flowlabel/tclass in
flowinfo in case of inherit.

I have plans for iproute2-next to merge link_{gre,vti}{,6}.c and
link_ip{6,}tnl.c functionality together and just want to do things
as clear as possible before doing merge.

This is to reduce size of ip(8) binary to help openwrt like solutions to
survive JSON print support that increases size of binary.

Thank you for quick review and good feed back!

> 



signature.asc
Description: OpenPGP digital signature

Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Alexei Starovoitov


On 12/27/17 8:16 PM, Steven Rostedt wrote:

On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov  wrote:


I don't think that's the case. My reading of current
trace_kprobe_ftrace() -> arch_check_ftrace_location()
is that it will not be true for old mcount case.


In the old mcount case, you can't use ftrace to return without calling
the function. That is, no modification of the return ip, unless you
created a trampoline that could handle arbitrary stack frames, and
remove them from the stack before returning back to the function.


correct. I was saying that trace_kprobe_ftrace() won't let us do
bpf_override_return with old mcount.



As far as the rest of your arguments it very much puzzles me that
you claim that this patch suppose to work based on historical
reasoning whereas you did NOT test it.


I believe that Masami is saying that the modification of the IP from
kprobes has been very well tested. But I'm guessing that you still want
a test case for using kprobes in this particular instance. It's not the
implementation of modifying the IP that you are worried about, but the
implementation of BPF using it in this case. Right?


exactly. No doubt that old code works.
But it doesn't mean that bpf_override_return() will continue to
work in kprobes that are not ftrace based.
I suspect Josef's existing test case will cover this situation.
Probably only special .config is needed to disable ftrace, so
"kprobe on entry but not ftrace" check will kick in.
But I didn't get an impression that this situation was tested.
Instead I see only logical reasoning that it's _supposed_ to work.
That's not enough.

Re: [PATCH v3 bpf-next 2/2] tools/bpftool: fix bpftool build with bintutils >= 2.9

2017-12-27 Thread Jakub Kicinski

On Wed, 27 Dec 2017 19:16:29 +, Roman Gushchin wrote:
> Bpftool build is broken with binutils version 2.29 and later.
> The cause is commit 003ca0fd2286 ("Refactor disassembler selection")
> in the binutils repo, which changed the disassembler() function
> signature.
> 
> Fix this by adding a new "feature" to the tools/build/features
> infrastructure and make it responsible for decision which
> disassembler() function signature to use.
> 
> Signed-off-by: Roman Gushchin 
> Cc: Jakub Kicinski 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 

Acked-by: Jakub Kicinski

Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Steven Rostedt

On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov  wrote:

> I don't think that's the case. My reading of current
> trace_kprobe_ftrace() -> arch_check_ftrace_location()
> is that it will not be true for old mcount case.

In the old mcount case, you can't use ftrace to return without calling
the function. That is, no modification of the return ip, unless you
created a trampoline that could handle arbitrary stack frames, and
remove them from the stack before returning back to the function.

> 
> As far as the rest of your arguments it very much puzzles me that
> you claim that this patch suppose to work based on historical
> reasoning whereas you did NOT test it.

I believe that Masami is saying that the modification of the IP from
kprobes has been very well tested. But I'm guessing that you still want
a test case for using kprobes in this particular instance. It's not the
implementation of modifying the IP that you are worried about, but the
implementation of BPF using it in this case. Right?

-- Steve

Re: [patch net-next v5 05/10] net: sched: keep track of offloaded filters and check tc offload feature

2017-12-27 Thread Jakub Kicinski

On Tue, 26 Dec 2017 15:15:59 +0100, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> During block bind, we need to check tc offload feature. If it is
> disabled yet still the block contains offloaded filters, forbid the
> bind. Also forbid to register callback for a block that already
> containes offloaded filters, as the play back is not supported now.
> For keeping track of offloaded filters there is a new counter
> introduced, alongside with couple of helpers called from cls_* code.
> These helpers set and clear TCA_CLS_FLAGS_IN_HW flag.
> 
> Signed-off-by: Jiri Pirko 

LGTM, thanks!

Re: [PATCH net-next] nfp: bpf: allocate vNIC priv for keeping track of the offloaded program

2017-12-27 Thread Jakub Kicinski

On Wed, 27 Dec 2017 20:38:11 -0500 (EST), David Miller wrote:
> From: Jakub Kicinski 
> Date: Wed, 27 Dec 2017 15:36:49 -0800
> 
> > After TC offloads were converted to callbacks we have no choice
> > but keep track of the offloaded filter in the driver.
> > 
> > Since this change came a little late in the release cycle
> > there were a number of conflicts and allocation of vNIC priv
> > structure seems to have slipped away in linux-next.
> > 
> > Signed-off-by: Jakub Kicinski   
> 
> Oh well, I thought I resolved the merge conflict properly in this
> case.  Apparently not :-)

The resolution looks suspiciously similar to what Stephen did in
linux-next, but I wasn't sure if you reuse his resolutions hence 
no Fixes: 903628bbc3a7 ;)

Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-27 Thread Alexei Starovoitov


On 12/27/17 5:38 PM, Masami Hiramatsu wrote:

On Wed, 27 Dec 2017 14:49:46 -0800
Alexei Starovoitov  wrote:


On 12/27/17 12:09 AM, Masami Hiramatsu wrote:

On Tue, 26 Dec 2017 18:12:56 -0800
Alexei Starovoitov  wrote:


On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:

Support in-kernel fault-injection framework via debugfs.
This allows you to inject a conditional error to specified
function using debugfs interfaces.

Signed-off-by: Masami Hiramatsu 
---
 Documentation/fault-injection/fault-injection.txt |5 +
 kernel/Makefile   |1
 kernel/fail_function.c|  169 +
 lib/Kconfig.debug |   10 +
 4 files changed, 185 insertions(+)
 create mode 100644 kernel/fail_function.c

diff --git a/Documentation/fault-injection/fault-injection.txt 
b/Documentation/fault-injection/fault-injection.txt
index 918972babcd8..6243a588dd71 100644
--- a/Documentation/fault-injection/fault-injection.txt
+++ b/Documentation/fault-injection/fault-injection.txt
@@ -30,6 +30,11 @@ o fail_mmc_request
   injects MMC data errors on devices permitted by setting
   debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request

+o fail_function
+
+  injects error return on specific functions by setting debugfs entries
+  under /sys/kernel/debug/fail_function. No boot option supported.


I like it.
Could you document it a bit better?


Yes, I will do in next series.


In particular retval is configurable, but without an example no one
will be able to figure out how to use it.


Ah, right. BTW, as I pointed in the covermail, should we store the
expected error value range into the injectable list? e.g.

ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)

And provide APIs to check/get it.


I'm afraid such check would be too costly.
Right now we have only two functions marked but I expect hundreds more
will be added in the near future as soon as developers realize the
potential of such error injection.
All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data.
Multiple by 1k and we have 8k of data spent on marks.
If we add max/min range marks that doubles it for very little use.
I think marking function only is enough.


Sorry, I don't think so.
Even if it takes 16 bytes more for each points, I don't think it is
any overhead for machines in these days. Even if so, we can provide
a kconfig to reduce it.
I mean, we are living in GB-order memory are, and it will be bigger
in the future. Why we have to worry about hundreds of 16bytes memory
pieces? It will take a few KB, and even if we mark thousands of
functions, it never reaches 1MB, in GB memory pool. :)

Of course, for many small-footprint embedded devices (like having
less than 128MB memory), this feature can be a overhead. But they
can cut off the table by kconfig.


I still disagree on wasting 16-byte * num_of_funcs of .data here.
The trade-off of usability vs memory just not worth it. Sorry.
Please focus on testing your changes instead.

[net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds

2017-12-27 Thread John Fastabend

When running consumer and/or producer operations and empty checks in
parallel its possible to have the empty check run past the end of the
array. The scenario occurs when an empty check is run while
__ptr_ring_discard_one() is in progress. Specifically after the
consumer_head is incremented but before (consumer_head >= ring_size)
check is made and the consumer head is zeroe'd.

To resolve this, without having to rework how consumer/producer ops
work on the array, simply add an extra dummy slot to the end of the
array. Even if we did a rework to avoid the extra slot it looks
like the normal case checks would suffer some so best to just
allocate an extra pointer.

Reported-by: Jakub Kicinski 
Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: John Fastabend 
---
 include/linux/ptr_ring.h |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6866df4..13fb06a 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -447,7 +447,12 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
-   return kcalloc(size, sizeof(void *), gfp);
+   /* Allocate an extra dummy element at end of ring to avoid consumer head
+* or produce head access past the end of the array. Possible when
+* producer/consumer operations and __ptr_ring_peek operations run in
+* parallel.
+*/
+   return kcalloc(size + 1, sizeof(void *), gfp);
 }
 
 static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)

Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Alexei Starovoitov


On 12/27/17 6:34 PM, Masami Hiramatsu wrote:

On Wed, 27 Dec 2017 14:46:24 -0800
Alexei Starovoitov  wrote:


On 12/26/17 9:56 PM, Masami Hiramatsu wrote:

On Tue, 26 Dec 2017 17:57:32 -0800
Alexei Starovoitov  wrote:


On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote:

Check whether error injectable event is on function entry or not.
Currently it checks the event is ftrace-based kprobes or not,
but that is wrong. It should check if the event is on the entry
of target function. Since error injection will override a function
to just return with modified return value, that operation must
be done before the target function starts making stackframe.

As a side effect, bpf error injection is no need to depend on
function-tracer. It can work with sw-breakpoint based kprobe
events too.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/Kconfig|2 --
 kernel/trace/bpf_trace.c|6 +++---
 kernel/trace/trace_kprobe.c |8 +---
 kernel/trace/trace_probe.h  |   12 ++--
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ae3a2d519e50..6400e1bf97c5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -533,9 +533,7 @@ config FUNCTION_PROFILER
 config BPF_KPROBE_OVERRIDE
bool "Enable BPF programs to override a kprobed function"
depends on BPF_EVENTS
-   depends on KPROBES_ON_FTRACE
depends on HAVE_KPROBE_OVERRIDE
-   depends on DYNAMIC_FTRACE_WITH_REGS
default n
help
 Allows BPF to override the execution of a probed function and
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f6d2327ecb59..d663660f8392 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event *event,
int ret = -EEXIST;

/*
-* Kprobe override only works for ftrace based kprobes, and only if they
-* are on the opt-in list.
+* Kprobe override only works if they are on the function entry,
+* and only if they are on the opt-in list.
 */
if (prog->kprobe_override &&
-   (!trace_kprobe_ftrace(event->tp_event) ||
+   (!trace_kprobe_on_func_entry(event->tp_event) ||
 !trace_kprobe_error_injectable(event->tp_event)))
return -EINVAL;

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 91f4b57dab82..265e3e27e8dc 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -88,13 +88,15 @@ static nokprobe_inline unsigned long 
trace_kprobe_nhit(struct trace_kprobe *tk)
return nhit;
 }

-int trace_kprobe_ftrace(struct trace_event_call *call)
+bool trace_kprobe_on_func_entry(struct trace_event_call *call)
 {
struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
-   return kprobe_ftrace(>rp.kp);
+
+   return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name,
+   tk->rp.kp.offset);


That would be nice, but did you test this?


Yes, because the jprobe, which was only official user of modifying execution
path using kprobe, did same way to check. (and kretprobe also does it)


My understanding that kprobe will restore all regs and
here we need to override return ip _and_ value.


yes, no problem. kprobe restore all regs from pt_regs, including regs->ip.


Could you add a patch with the test the way Josef did
or describe the steps to test this new mode?


Would you mean below patch? If so, it should work without any change.

 [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return


yeah. I expect bpf_override_return test to work as-is.
I'm asking for the test for new functionality added by this patch.
In particular kprobe on func entry without ftrace.
How did you test it?


This function is used in kretprobe and jprobe. Jprobe was the user of
"modifying instruction pointer to another function" in kprobes.
If it doesn't work, jprobe also doesn't work, this means you can not
modify IP by kprobes anymore.
Anyway, until linux-4.13, that was well tested by kprobe smoke test.


and how I can repeat the test?
I'm still not sure that it works correctly.


That works correctly because it checks given address is on the entry
point (the 1st instruction) of a function, using kallsyms.

The reason why I made another flag for ftrace was, there are 2 modes
for ftrace dynamic instrumentation, fentry and mcount.
With new fentry mode, ftrace will be put on the first instruction
of the function, so it will work as you expected.
With traditional gcc mcount, ftrace will be called after making call
frame for _mcount(). This means if you modify ip, it will not work
or cause a trouble because _mcount call frame is still on the stack.

So, current ftrace-based checker doesn't work, it depends on the case.
Of course, in most

[PATCH bpf-next v3 1/9] bpf: offload: don't require rtnl for dev list manipulation

2017-12-27 Thread Jakub Kicinski

We don't need the RTNL lock for all operations on offload state.
We only need to hold it around ndo calls.  The device offload
initialization doesn't require it.  The soon-to-come querying
of the offload info will only need it partially.  We will also
be able to remove the waitqueue in following patches.

Use struct rw_semaphore because map offload will require sleeping
with the semaphore held for read.

Suggested-by: Kirill Tkhai 
Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
v3:
 - use dev_get_by_index() (Kirill).
v2:
 - use dev_get_by_index_rcu() instead of implicit lock dependencies;
 - use DECLARE_RWSEM() instead of init_rwsem() (Kirill).
---
 kernel/bpf/offload.c | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 8455b89d1bbf..032079754d88 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -20,13 +20,16 @@
 #include 
 #include 
 #include 
+#include 
 
-/* protected by RTNL */
+/* Protects bpf_prog_offload_devs and offload members of all progs.
+ * RTNL lock cannot be taken when holding this lock.
+ */
+static DECLARE_RWSEM(bpf_devs_lock);
 static LIST_HEAD(bpf_prog_offload_devs);
 
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 {
-   struct net *net = current->nsproxy->net_ns;
struct bpf_dev_offload *offload;
 
if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
@@ -43,19 +46,26 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union 
bpf_attr *attr)
offload->prog = prog;
init_waitqueue_head(>verifier_done);
 
-   rtnl_lock();
-   offload->netdev = __dev_get_by_index(net, attr->prog_ifindex);
-   if (!offload->netdev) {
-   rtnl_unlock();
-   kfree(offload);
-   return -EINVAL;
-   }
+   offload->netdev = dev_get_by_index(current->nsproxy->net_ns,
+  attr->prog_ifindex);
+   if (!offload->netdev)
+   goto err_free;
 
+   down_write(_devs_lock);
+   if (offload->netdev->reg_state != NETREG_REGISTERED)
+   goto err_unlock;
prog->aux->offload = offload;
list_add_tail(>offloads, _prog_offload_devs);
-   rtnl_unlock();
+   dev_put(offload->netdev);
+   up_write(_devs_lock);
 
return 0;
+err_unlock:
+   up_write(_devs_lock);
+   dev_put(offload->netdev);
+err_free:
+   kfree(offload);
+   return -EINVAL;
 }
 
 static int __bpf_offload_ndo(struct bpf_prog *prog, enum bpf_netdev_command 
cmd,
@@ -126,7 +136,9 @@ void bpf_prog_offload_destroy(struct bpf_prog *prog)
wake_up(>verifier_done);
 
rtnl_lock();
+   down_write(_devs_lock);
__bpf_prog_offload_destroy(prog);
+   up_write(_devs_lock);
rtnl_unlock();
 
kfree(offload);
@@ -181,11 +193,13 @@ static int bpf_offload_notification(struct notifier_block 
*notifier,
if (netdev->reg_state != NETREG_UNREGISTERING)
break;
 
+   down_write(_devs_lock);
list_for_each_entry_safe(offload, tmp, _prog_offload_devs,
 offloads) {
if (offload->netdev == netdev)
__bpf_prog_offload_destroy(offload->prog);
}
+   up_write(_devs_lock);
break;
default:
break;
-- 
2.15.1

[PATCH bpf-next v3 5/9] bpf: offload: free program id when device disappears

2017-12-27 Thread Jakub Kicinski

Bound programs are quite useless after their device disappears.
They are simply waiting for reference count to go to zero,
don't list them in BPF_PROG_GET_NEXT_ID by freeing their ID
early.

Note that orphaned offload programs will return -ENODEV on
BPF_OBJ_GET_INFO_BY_FD so user will never see ID 0.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h  | 2 ++
 kernel/bpf/offload.c | 3 +++
 kernel/bpf/syscall.c | 9 +++--
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 669549f7e3e8..9a916ab34299 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -357,6 +357,8 @@ void bpf_prog_put(struct bpf_prog *prog);
 int __bpf_prog_charge(struct user_struct *user, u32 pages);
 void __bpf_prog_uncharge(struct user_struct *user, u32 pages);
 
+void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
+
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
 struct bpf_map * __must_check bpf_map_inc(struct bpf_map *map, bool uref);
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 3126e1a842e6..e4f1668a021c 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -130,6 +130,9 @@ static void __bpf_prog_offload_destroy(struct bpf_prog 
*prog)
if (offload->dev_state)
WARN_ON(__bpf_offload_ndo(prog, BPF_OFFLOAD_DESTROY, ));
 
+   /* Make sure BPF_PROG_GET_NEXT_ID can't find this dead program */
+   bpf_prog_free_id(prog, true);
+
list_del_init(>offloads);
kfree(offload);
prog->aux->offload = NULL;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1143db61584c..7d9f5b0f0e49 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -905,9 +905,13 @@ static int bpf_prog_alloc_id(struct bpf_prog *prog)
return id > 0 ? 0 : id;
 }
 
-static void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock)
+void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock)
 {
-   /* cBPF to eBPF migrations are currently not in the idr store. */
+   /* cBPF to eBPF migrations are currently not in the idr store.
+* Offloaded programs are removed from the store when their device
+* disappears - even if someone grabs an fd to them they are unusable,
+* simply waiting for refcnt to drop to be freed.
+*/
if (!prog->aux->id)
return;
 
@@ -917,6 +921,7 @@ static void bpf_prog_free_id(struct bpf_prog *prog, bool 
do_idr_lock)
__acquire(_idr_lock);
 
idr_remove(_idr, prog->aux->id);
+   prog->aux->id = 0;
 
if (do_idr_lock)
spin_unlock_bh(_idr_lock);
-- 
2.15.1

[PATCH bpf-next v3 4/9] bpf: offload: free prog->aux->offload when device disappears

2017-12-27 Thread Jakub Kicinski

All bpf offload operations should now be under bpf_devs_lock,
it's safe to free and clear the entire offload structure,
not only the netdev pointer.

__bpf_prog_offload_destroy() will no longer be called multiple
times.

Suggested-by: Alexei Starovoitov 
Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/offload.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 69ddc3899bab..3126e1a842e6 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -70,12 +70,14 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union 
bpf_attr *attr)
 static int __bpf_offload_ndo(struct bpf_prog *prog, enum bpf_netdev_command 
cmd,
 struct netdev_bpf *data)
 {
-   struct net_device *netdev = prog->aux->offload->netdev;
+   struct bpf_dev_offload *offload = prog->aux->offload;
+   struct net_device *netdev;
 
ASSERT_RTNL();
 
-   if (!netdev)
+   if (!offload)
return -ENODEV;
+   netdev = offload->netdev;
if (!netdev->netdev_ops->ndo_bpf)
return -EOPNOTSUPP;
 
@@ -111,7 +113,7 @@ int bpf_prog_offload_verify_insn(struct bpf_verifier_env 
*env,
 
down_read(_devs_lock);
offload = env->prog->aux->offload;
-   if (offload->netdev)
+   if (offload)
ret = offload->dev_ops->insn_hook(env, insn_idx, prev_insn_idx);
up_read(_devs_lock);
 
@@ -123,31 +125,24 @@ static void __bpf_prog_offload_destroy(struct bpf_prog 
*prog)
struct bpf_dev_offload *offload = prog->aux->offload;
struct netdev_bpf data = {};
 
-   /* Caution - if netdev is destroyed before the program, this function
-* will be called twice.
-*/
-
data.offload.prog = prog;
 
if (offload->dev_state)
WARN_ON(__bpf_offload_ndo(prog, BPF_OFFLOAD_DESTROY, ));
 
-   offload->dev_state = false;
list_del_init(>offloads);
-   offload->netdev = NULL;
+   kfree(offload);
+   prog->aux->offload = NULL;
 }
 
 void bpf_prog_offload_destroy(struct bpf_prog *prog)
 {
-   struct bpf_dev_offload *offload = prog->aux->offload;
-
rtnl_lock();
down_write(_devs_lock);
-   __bpf_prog_offload_destroy(prog);
+   if (prog->aux->offload)
+   __bpf_prog_offload_destroy(prog);
up_write(_devs_lock);
rtnl_unlock();
-
-   kfree(offload);
 }
 
 static int bpf_prog_offload_translate(struct bpf_prog *prog)
-- 
2.15.1

[PATCH bpf-next v3 3/9] bpf: offload: allow netdev to disappear while verifier is running

2017-12-27 Thread Jakub Kicinski

To allow verifier instruction callbacks without any extra locking
NETDEV_UNREGISTER notification would wait on a waitqueue for verifier
to finish.  This design decision was made when rtnl lock was providing
all the locking.  Use the read/write lock instead and remove the
workqueue.

Verifier will now call into the offload code, so dev_ops are moved
to offload structure.  Since verifier calls are all under
bpf_prog_is_dev_bound() we no longer need static inline implementations
to please builds with CONFIG_NET=n.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  2 +-
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c |  2 +-
 drivers/net/netdevsim/bpf.c   |  2 +-
 include/linux/bpf.h   |  9 +--
 include/linux/bpf_verifier.h  | 16 ++--
 include/linux/netdevice.h |  4 +--
 kernel/bpf/offload.c  | 30 ---
 kernel/bpf/verifier.c | 20 ++-
 8 files changed, 37 insertions(+), 48 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index aae1be9ed056..89a9b6393882 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -238,7 +238,7 @@ struct nfp_bpf_vnic {
 
 int nfp_bpf_jit(struct nfp_prog *prog);
 
-extern const struct bpf_ext_analyzer_ops nfp_bpf_analyzer_ops;
+extern const struct bpf_prog_offload_ops nfp_bpf_analyzer_ops;
 
 struct netdev_bpf;
 struct nfp_app;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c 
b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
index 9c2608445bd8..d8870c2f11f3 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
@@ -260,6 +260,6 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, 
int prev_insn_idx)
return 0;
 }
 
-const struct bpf_ext_analyzer_ops nfp_bpf_analyzer_ops = {
+const struct bpf_prog_offload_ops nfp_bpf_analyzer_ops = {
.insn_hook = nfp_verify_insn,
 };
diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index a243fa7ae02f..5134d5c1306c 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -66,7 +66,7 @@ nsim_bpf_verify_insn(struct bpf_verifier_env *env, int 
insn_idx, int prev_insn)
return 0;
 }
 
-static const struct bpf_ext_analyzer_ops nsim_bpf_analyzer_ops = {
+static const struct bpf_prog_offload_ops nsim_bpf_analyzer_ops = {
.insn_hook = nsim_bpf_verify_insn,
 };
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 838eee10e979..669549f7e3e8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -17,6 +17,7 @@
 #include 
 #include 
 
+struct bpf_verifier_env;
 struct perf_event;
 struct bpf_prog;
 struct bpf_map;
@@ -184,14 +185,18 @@ struct bpf_verifier_ops {
  struct bpf_prog *prog, u32 *target_size);
 };
 
+struct bpf_prog_offload_ops {
+   int (*insn_hook)(struct bpf_verifier_env *env,
+int insn_idx, int prev_insn_idx);
+};
+
 struct bpf_dev_offload {
struct bpf_prog *prog;
struct net_device   *netdev;
void*dev_priv;
struct list_headoffloads;
booldev_state;
-   boolverifier_running;
-   wait_queue_head_t   verifier_done;
+   const struct bpf_prog_offload_ops *dev_ops;
 };
 
 struct bpf_prog_aux {
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c009e472f647..70237bb29eb7 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -166,12 +166,6 @@ static inline bool bpf_verifier_log_full(const struct 
bpf_verifer_log *log)
return log->len_used >= log->len_total - 1;
 }
 
-struct bpf_verifier_env;
-struct bpf_ext_analyzer_ops {
-   int (*insn_hook)(struct bpf_verifier_env *env,
-int insn_idx, int prev_insn_idx);
-};
-
 #define BPF_MAX_SUBPROGS 256
 
 /* single container for all structs
@@ -185,7 +179,6 @@ struct bpf_verifier_env {
bool strict_alignment;  /* perform strict pointer alignment 
checks */
struct bpf_verifier_state *cur_state; /* current verifier state */
struct bpf_verifier_state_list **explored_states; /* search pruning 
optimization */
-   const struct bpf_ext_analyzer_ops *dev_ops; /* device analyzer ops */
struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by 
eBPF program */
u32 used_map_cnt;   /* number of used maps */
u32 id_gen; /* used to generate unique reg IDs */
@@ -205,13 +198,8 @@

[PATCH bpf-next v3 9/9] selftests/bpf: test device info reporting for bound progs

2017-12-27 Thread Jakub Kicinski

Check if bound programs report correct device info.  Test
in local namespace, in remote one, back to the local ns,
remove the device and check that information is cleared.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
--
v2:
 - check the error code from "prog show pin XX" with device
   removed is -ENODEV.
---
 tools/testing/selftests/bpf/test_offload.py | 112 +---
 1 file changed, 101 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_offload.py 
b/tools/testing/selftests/bpf/test_offload.py
index c940505c2978..e3c750f17cb8 100755
--- a/tools/testing/selftests/bpf/test_offload.py
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -18,6 +18,8 @@ import argparse
 import json
 import os
 import pprint
+import random
+import string
 import subprocess
 import time
 
@@ -27,6 +29,7 @@ bpf_test_dir = os.path.dirname(os.path.realpath(__file__))
 pp = pprint.PrettyPrinter()
 devs = [] # devices we created for clean up
 files = [] # files to be removed
+netns = [] # net namespaces to be removed
 
 def log_get_sec(level=0):
 return "*" * (log_level + level)
@@ -128,22 +131,25 @@ files = [] # files to be removed
 if f in files:
 files.remove(f)
 
-def tool(name, args, flags, JSON=True, fail=True):
+def tool(name, args, flags, JSON=True, ns="", fail=True):
 params = ""
 if JSON:
 params += "%s " % (flags["json"])
 
-ret, out = cmd(name + " " + params + args, fail=fail)
+if ns != "":
+ns = "ip netns exec %s " % (ns)
+
+ret, out = cmd(ns + name + " " + params + args, fail=fail)
 if JSON and len(out.strip()) != 0:
 return ret, json.loads(out)
 else:
 return ret, out
 
-def bpftool(args, JSON=True, fail=True):
-return tool("bpftool", args, {"json":"-p"}, JSON=JSON, fail=fail)
+def bpftool(args, JSON=True, ns="", fail=True):
+return tool("bpftool", args, {"json":"-p"}, JSON=JSON, ns=ns, fail=fail)
 
-def bpftool_prog_list(expected=None):
-_, progs = bpftool("prog show", JSON=True, fail=True)
+def bpftool_prog_list(expected=None, ns=""):
+_, progs = bpftool("prog show", JSON=True, ns=ns, fail=True)
 if expected is not None:
 if len(progs) != expected:
 fail(True, "%d BPF programs loaded, expected %d" %
@@ -158,13 +164,13 @@ files = [] # files to be removed
 time.sleep(0.05)
 raise Exception("Time out waiting for program counts to stabilize want %d, 
have %d" % (expected, nprogs))
 
-def ip(args, force=False, JSON=True, fail=True):
+def ip(args, force=False, JSON=True, ns="", fail=True):
 if force:
 args = "-force " + args
-return tool("ip", args, {"json":"-j"}, JSON=JSON, fail=fail)
+return tool("ip", args, {"json":"-j"}, JSON=JSON, ns=ns, fail=fail)
 
-def tc(args, JSON=True, fail=True):
-return tool("tc", args, {"json":"-p"}, JSON=JSON, fail=fail)
+def tc(args, JSON=True, ns="", fail=True):
+return tool("tc", args, {"json":"-p"}, JSON=JSON, ns=ns, fail=fail)
 
 def ethtool(dev, opt, args, fail=True):
 return cmd("ethtool %s %s %s" % (opt, dev["ifname"], args), fail=fail)
@@ -178,6 +184,15 @@ files = [] # files to be removed
 def bpf_bytecode(bytecode):
 return "bytecode \"%s\"" % (bytecode)
 
+def mknetns(n_retry=10):
+for i in range(n_retry):
+name = ''.join([random.choice(string.ascii_letters) for i in range(8)])
+ret, _ = ip("netns add %s" % (name), fail=False)
+if ret == 0:
+netns.append(name)
+return name
+return None
+
 class DebugfsDir:
 """
 Class for accessing DebugFS directories as a dictionary.
@@ -237,6 +252,8 @@ files = [] # files to be removed
 self.dev = self._netdevsim_create()
 devs.append(self)
 
+self.ns = ""
+
 self.dfs_dir = '/sys/kernel/debug/netdevsim/%s' % (self.dev['ifname'])
 self.dfs_refresh()
 
@@ -257,7 +274,7 @@ files = [] # files to be removed
 
 def remove(self):
 devs.remove(self)
-ip("link del dev %s" % (self.dev["ifname"]))
+ip("link del dev %s" % (self.dev["ifname"]), ns=self.ns)
 
 def dfs_refresh(self):
 self.dfs = DebugfsDir(self.dfs_dir)
@@ -285,6 +302,11 @@ files = [] # files to be removed
 time.sleep(0.05)
 raise Exception("Time out waiting for program counts to stabilize want 
%d/%d, have %d bound, %d loaded" % (bound, total, nbound, nprogs))
 
+def set_ns(self, ns):
+name = "1" if ns == "" else ns
+ip("link set dev %s netns %s" % (self.dev["ifname"], name), ns=self.ns)
+self.ns = ns
+
 def set_mtu(self, mtu, fail=True):
 return ip("link set dev %s mtu %d" % (self.dev["ifname"], mtu),
   fail=fail)
@@ -372,6 +394,8 @@ files = [] # files to be removed
 dev.remove()
 for f in files:
 cmd("rm -f %s" % (f))
+for ns in netns:
+cmd("ip netns delete

[PATCH bpf-next v3 7/9] bpf: offload: report device information for offloaded programs

2017-12-27 Thread Jakub Kicinski

Report to the user ifindex and namespace information of offloaded
programs.  If device has disappeared return -ENODEV.  Specify the
namespace using dev/inode combination.

CC: Eric W. Biederman 
Signed-off-by: Jakub Kicinski 
---
v3:
 - path_put() (Daniel);
 - move more of the logic to nsfs.c (Daniel).
v2:
 - take RTNL lock to grab a coherent snapshot of device state
   (ifindex vs name space) and avoid races with name space
   moves (based on Eric's comment on Kirill's patch to
   peernet2id_alloc()).
---
 include/linux/bpf.h|  2 ++
 include/uapi/linux/bpf.h   |  3 +++
 kernel/bpf/offload.c   | 59 ++
 kernel/bpf/syscall.c   |  6 +
 tools/include/uapi/linux/bpf.h |  3 +++
 5 files changed, 73 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9a916ab34299..7810ae57b357 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -531,6 +531,8 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 
 int bpf_prog_offload_compile(struct bpf_prog *prog);
 void bpf_prog_offload_destroy(struct bpf_prog *prog);
+int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
+  struct bpf_prog *prog);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d01f1cb3cfc0..72b37fc3bc0c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -921,6 +921,9 @@ struct bpf_prog_info {
__u32 nr_map_ids;
__aligned_u64 map_ids;
char name[BPF_OBJ_NAME_LEN];
+   __u32 ifindex;
+   __u64 netns_dev;
+   __u64 netns_ino;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index e4f1668a021c..040d4e0edf3f 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -16,9 +16,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -176,6 +178,63 @@ int bpf_prog_offload_compile(struct bpf_prog *prog)
return bpf_prog_offload_translate(prog);
 }
 
+struct ns_get_path_bpf_prog_args {
+   struct bpf_prog *prog;
+   struct bpf_prog_info *info;
+};
+
+static struct ns_common *bpf_prog_offload_info_fill_ns(void *private_data)
+{
+   struct ns_get_path_bpf_prog_args *args = private_data;
+   struct bpf_prog_aux *aux = args->prog->aux;
+   struct ns_common *ns;
+   struct net *net;
+
+   rtnl_lock();
+   down_read(_devs_lock);
+
+   if (aux->offload) {
+   args->info->ifindex = aux->offload->netdev->ifindex;
+   net = dev_net(aux->offload->netdev);
+   get_net(net);
+   ns = >ns;
+   } else {
+   args->info->ifindex = 0;
+   ns = NULL;
+   }
+
+   up_read(_devs_lock);
+   rtnl_unlock();
+
+   return ns;
+}
+
+int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
+  struct bpf_prog *prog)
+{
+   struct ns_get_path_bpf_prog_args args = {
+   .prog   = prog,
+   .info   = info,
+   };
+   struct inode *ns_inode;
+   struct path ns_path;
+   void *res;
+
+   res = ns_get_path_cb(_path, bpf_prog_offload_info_fill_ns, );
+   if (IS_ERR(res)) {
+   if (!info->ifindex)
+   return -ENODEV;
+   return PTR_ERR(res);
+   }
+
+   ns_inode = ns_path.dentry->d_inode;
+   info->netns_dev = new_encode_dev(ns_inode->i_sb->s_dev);
+   info->netns_ino = ns_inode->i_ino;
+   path_put(_path);
+
+   return 0;
+}
+
 const struct bpf_prog_ops bpf_offload_prog_ops = {
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7d9f5b0f0e49..20444fd678d0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1624,6 +1624,12 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
return -EFAULT;
}
 
+   if (bpf_prog_is_dev_bound(prog->aux)) {
+   err = bpf_prog_offload_info_fill(, prog);
+   if (err)
+   return err;
+   }
+
 done:
if (copy_to_user(uinfo, , info_len) ||
put_user(info_len, >info.info_len))
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index db1b0923a308..4e8c60acfa32 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -921,6 +921,9 @@ struct bpf_prog_info {
__u32 nr_map_ids;
__aligned_u64 map_ids;
char name[BPF_OBJ_NAME_LEN];
+   __u32 ifindex;
+   __u64 netns_dev;
+   __u64 netns_ino;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
-- 
2.15.1

[PATCH bpf-next v3 6/9] nsfs: generalize ns_get_path() for path resolution with a task

2017-12-27 Thread Jakub Kicinski

ns_get_path() takes struct task_struct and proc_ns_ops as its
parameters.  For path resolution directly from a namespace,
e.g. based on a networking device's net name space, we need
more flexibility.  Add a ns_get_path_cb() helper which will
allow callers to use any method of obtaining the name space
reference.  Convert ns_get_path() to use ns_get_path_cb().

Following patches will bring a networking user.

CC: Eric W. Biederman 
Suggested-by: Daniel Borkmann 
Signed-off-by: Jakub Kicinski 
---
 fs/nsfs.c   | 29 ++---
 include/linux/proc_ns.h |  3 +++
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 7c6f76d29f56..36b0772701a0 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -103,14 +103,14 @@ static void *__ns_get_path(struct path *path, struct 
ns_common *ns)
goto got_it;
 }
 
-void *ns_get_path(struct path *path, struct task_struct *task,
-   const struct proc_ns_operations *ns_ops)
+void *ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
+void *private_data)
 {
struct ns_common *ns;
void *ret;
 
 again:
-   ns = ns_ops->get(task);
+   ns = ns_get_cb(private_data);
if (!ns)
return ERR_PTR(-ENOENT);
 
@@ -120,6 +120,29 @@ void *ns_get_path(struct path *path, struct task_struct 
*task,
return ret;
 }
 
+struct ns_get_path_task_args {
+   const struct proc_ns_operations *ns_ops;
+   struct task_struct *task;
+};
+
+static struct ns_common *ns_get_path_task(void *private_data)
+{
+   struct ns_get_path_task_args *args = private_data;
+
+   return args->ns_ops->get(args->task);
+}
+
+void *ns_get_path(struct path *path, struct task_struct *task,
+ const struct proc_ns_operations *ns_ops)
+{
+   struct ns_get_path_task_args args = {
+   .ns_ops = ns_ops,
+   .task   = task,
+   };
+
+   return ns_get_path_cb(path, ns_get_path_task, );
+}
+
 int open_related_ns(struct ns_common *ns,
   struct ns_common *(*get_ns)(struct ns_common *ns))
 {
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 2ff18c9840a7..d31cb6215905 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -78,6 +78,9 @@ extern struct file *proc_ns_fget(int fd);
 #define get_proc_ns(inode) ((struct ns_common *)(inode)->i_private)
 extern void *ns_get_path(struct path *path, struct task_struct *task,
const struct proc_ns_operations *ns_ops);
+typedef struct ns_common *ns_get_path_helper_t(void *);
+extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t ns_get_cb,
+   void *private_data);
 
 extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
const struct proc_ns_operations *ns_ops);
-- 
2.15.1

[PATCH bpf-next v3 8/9] tools: bpftool: report device information for offloaded programs

2017-12-27 Thread Jakub Kicinski

Print the just-exposed device information about device to which
program is bound.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/common.c | 52 ++
 tools/bpf/bpftool/main.h   |  2 ++
 tools/bpf/bpftool/prog.c   |  3 +++
 3 files changed, 57 insertions(+)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index b62c94e3997a..6601c95a9258 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -44,7 +44,9 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 
@@ -412,3 +414,53 @@ void delete_pinned_obj_table(struct pinned_obj_table *tab)
free(obj);
}
 }
+
+static char *
+ifindex_to_name_ns(__u32 ifindex, __u32 ns_dev, __u32 ns_ino, char *buf)
+{
+   struct stat st;
+   int err;
+
+   err = stat("/proc/self/ns/net", );
+   if (err) {
+   p_err("Can't stat /proc/self: %s", strerror(errno));
+   return NULL;
+   }
+
+   if (st.st_dev != ns_dev || st.st_ino != ns_ino)
+   return NULL;
+
+   return if_indextoname(ifindex, buf);
+}
+
+void print_dev_plain(__u32 ifindex, __u64 ns_dev, __u64 ns_inode)
+{
+   char name[IF_NAMESIZE];
+
+   if (!ifindex)
+   return;
+
+   printf(" dev ");
+   if (ifindex_to_name_ns(ifindex, ns_dev, ns_inode, name))
+   printf("%s", name);
+   else
+   printf("ifindex %u ns_dev %llu ns_ino %llu",
+  ifindex, ns_dev, ns_inode);
+}
+
+void print_dev_json(__u32 ifindex, __u64 ns_dev, __u64 ns_inode)
+{
+   char name[IF_NAMESIZE];
+
+   if (!ifindex)
+   return;
+
+   jsonw_name(json_wtr, "dev");
+   jsonw_start_object(json_wtr);
+   jsonw_uint_field(json_wtr, "ifindex", ifindex);
+   jsonw_uint_field(json_wtr, "ns_dev", ns_dev);
+   jsonw_uint_field(json_wtr, "ns_inode", ns_inode);
+   if (ifindex_to_name_ns(ifindex, ns_dev, ns_inode, name))
+   jsonw_string_field(json_wtr, "ifname", name);
+   jsonw_end_object(json_wtr);
+}
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 8f6d3cac0347..65b526fe6e7e 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -96,6 +96,8 @@ struct pinned_obj {
 int build_pinned_obj_table(struct pinned_obj_table *table,
   enum bpf_obj_type type);
 void delete_pinned_obj_table(struct pinned_obj_table *tab);
+void print_dev_plain(__u32 ifindex, __u64 ns_dev, __u64 ns_inode);
+void print_dev_json(__u32 ifindex, __u64 ns_dev, __u64 ns_inode);
 
 struct cmd {
const char *cmd;
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 037484ceaeaf..4ccf6301f0fe 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -230,6 +230,8 @@ static void print_prog_json(struct bpf_prog_info *info, int 
fd)
 info->tag[0], info->tag[1], info->tag[2], info->tag[3],
 info->tag[4], info->tag[5], info->tag[6], info->tag[7]);
 
+   print_dev_json(info->ifindex, info->netns_dev, info->netns_ino);
+
if (info->load_time) {
char buf[32];
 
@@ -287,6 +289,7 @@ static void print_prog_plain(struct bpf_prog_info *info, 
int fd)
 
printf("tag ");
fprint_hex(stdout, info->tag, BPF_TAG_SIZE, "");
+   print_dev_plain(info->ifindex, info->netns_dev, info->netns_ino);
printf("\n");
 
if (info->load_time) {
-- 
2.15.1

[PATCH bpf-next v3 2/9] bpf: offload: don't use prog->aux->offload as boolean

2017-12-27 Thread Jakub Kicinski

We currently use aux->offload to indicate that program is bound
to a specific device.  This forces us to keep the offload structure
around even after the device is gone.  Add a bool member to
struct bpf_prog_aux to indicate if offload was requested.

Suggested-by: Alexei Starovoitov 
Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h  | 3 ++-
 kernel/bpf/syscall.c | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index da54ef644fcd..838eee10e979 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -201,6 +201,7 @@ struct bpf_prog_aux {
u32 stack_depth;
u32 id;
u32 func_cnt;
+   bool offload_requested;
struct bpf_prog **func;
void *jit_data; /* JIT specific data. arch dependent */
struct latch_tree_node ksym_tnode;
@@ -529,7 +530,7 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union 
bpf_attr *attr);
 
 static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux)
 {
-   return aux->offload;
+   return aux->offload_requested;
 }
 #else
 static inline int bpf_prog_offload_init(struct bpf_prog *prog,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e2e1c78ce1dc..1143db61584c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1151,6 +1151,8 @@ static int bpf_prog_load(union bpf_attr *attr)
if (!prog)
return -ENOMEM;
 
+   prog->aux->offload_requested = !!attr->prog_ifindex;
+
err = security_bpf_prog_alloc(prog->aux);
if (err)
goto free_prog_nouncharge;
@@ -1172,7 +1174,7 @@ static int bpf_prog_load(union bpf_attr *attr)
atomic_set(>aux->refcnt, 1);
prog->gpl_compatible = is_gpl ? 1 : 0;
 
-   if (attr->prog_ifindex) {
+   if (bpf_prog_is_dev_bound(prog->aux)) {
err = bpf_prog_offload_init(prog, attr);
if (err)
goto free_prog;
-- 
2.15.1

[PATCH bpf-next v3 0/9] bpf: offload: report device back to user space (take 2)

2017-12-27 Thread Jakub Kicinski

Hi!

This series is a redo of reporting offload device information to
user space after the first attempt did not take into account name
spaces.  As requested by Kirill offloads are now protected by an
r/w sem.  This allows us to remove the workqueue and free the
offload state fully when device is removed (suggested by Alexei).

Net namespace is reported with a device/inode pair.

The accompanying bpftool support is placed in common code because
maps will have very similar info.  Note that the UAPI information
can't be nicely encapsulated into a struct, because in case we
need to grow the device information the new fields will have to
be added at the end of struct bpf_prog_info, we can't grow
structures in the middle of bpf_prog_info.

v3:
 - use dev_get_by_index();
 - redo ns code (new patch 6).
v2:
 - rework the locking in patch 1 (use RCU instead of locking
   dependencies);
 - grab RTNL for a short time in patch 6;
 - minor update to the test in patch 8.

Jakub Kicinski (9):
  bpf: offload: don't require rtnl for dev list manipulation
  bpf: offload: don't use prog->aux->offload as boolean
  bpf: offload: allow netdev to disappear while verifier is running
  bpf: offload: free prog->aux->offload when device disappears
  bpf: offload: free program id when device disappears
  nsfs: generalize ns_get_path() for path resolution with a task
  bpf: offload: report device information for offloaded programs
  tools: bpftool: report device information for offloaded programs
  selftests/bpf: test device info reporting for bound progs

 drivers/net/ethernet/netronome/nfp/bpf/main.h |   2 +-
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c |   2 +-
 drivers/net/netdevsim/bpf.c   |   2 +-
 fs/nsfs.c |  29 -
 include/linux/bpf.h   |  16 ++-
 include/linux/bpf_verifier.h  |  16 +--
 include/linux/netdevice.h |   4 +-
 include/linux/proc_ns.h   |   3 +
 include/uapi/linux/bpf.h  |   3 +
 kernel/bpf/offload.c  | 147 --
 kernel/bpf/syscall.c  |  19 ++-
 kernel/bpf/verifier.c |  20 ++-
 tools/bpf/bpftool/common.c|  52 
 tools/bpf/bpftool/main.h  |   2 +
 tools/bpf/bpftool/prog.c  |   3 +
 tools/include/uapi/linux/bpf.h|   3 +
 tools/testing/selftests/bpf/test_offload.py   | 112 +++--
 17 files changed, 346 insertions(+), 89 deletions(-)

-- 
2.15.1

Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Masami Hiramatsu

On Wed, 27 Dec 2017 14:46:24 -0800
Alexei Starovoitov  wrote:

> On 12/26/17 9:56 PM, Masami Hiramatsu wrote:
> > On Tue, 26 Dec 2017 17:57:32 -0800
> > Alexei Starovoitov  wrote:
> >
> >> On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote:
> >>> Check whether error injectable event is on function entry or not.
> >>> Currently it checks the event is ftrace-based kprobes or not,
> >>> but that is wrong. It should check if the event is on the entry
> >>> of target function. Since error injection will override a function
> >>> to just return with modified return value, that operation must
> >>> be done before the target function starts making stackframe.
> >>>
> >>> As a side effect, bpf error injection is no need to depend on
> >>> function-tracer. It can work with sw-breakpoint based kprobe
> >>> events too.
> >>>
> >>> Signed-off-by: Masami Hiramatsu 
> >>> ---
> >>>  kernel/trace/Kconfig|2 --
> >>>  kernel/trace/bpf_trace.c|6 +++---
> >>>  kernel/trace/trace_kprobe.c |8 +---
> >>>  kernel/trace/trace_probe.h  |   12 ++--
> >>>  4 files changed, 14 insertions(+), 14 deletions(-)
> >>>
> >>> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> >>> index ae3a2d519e50..6400e1bf97c5 100644
> >>> --- a/kernel/trace/Kconfig
> >>> +++ b/kernel/trace/Kconfig
> >>> @@ -533,9 +533,7 @@ config FUNCTION_PROFILER
> >>>  config BPF_KPROBE_OVERRIDE
> >>>   bool "Enable BPF programs to override a kprobed function"
> >>>   depends on BPF_EVENTS
> >>> - depends on KPROBES_ON_FTRACE
> >>>   depends on HAVE_KPROBE_OVERRIDE
> >>> - depends on DYNAMIC_FTRACE_WITH_REGS
> >>>   default n
> >>>   help
> >>>Allows BPF to override the execution of a probed function and
> >>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> >>> index f6d2327ecb59..d663660f8392 100644
> >>> --- a/kernel/trace/bpf_trace.c
> >>> +++ b/kernel/trace/bpf_trace.c
> >>> @@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event 
> >>> *event,
> >>>   int ret = -EEXIST;
> >>>
> >>>   /*
> >>> -  * Kprobe override only works for ftrace based kprobes, and only if they
> >>> -  * are on the opt-in list.
> >>> +  * Kprobe override only works if they are on the function entry,
> >>> +  * and only if they are on the opt-in list.
> >>>*/
> >>>   if (prog->kprobe_override &&
> >>> - (!trace_kprobe_ftrace(event->tp_event) ||
> >>> + (!trace_kprobe_on_func_entry(event->tp_event) ||
> >>>!trace_kprobe_error_injectable(event->tp_event)))
> >>>   return -EINVAL;
> >>>
> >>> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> >>> index 91f4b57dab82..265e3e27e8dc 100644
> >>> --- a/kernel/trace/trace_kprobe.c
> >>> +++ b/kernel/trace/trace_kprobe.c
> >>> @@ -88,13 +88,15 @@ static nokprobe_inline unsigned long 
> >>> trace_kprobe_nhit(struct trace_kprobe *tk)
> >>>   return nhit;
> >>>  }
> >>>
> >>> -int trace_kprobe_ftrace(struct trace_event_call *call)
> >>> +bool trace_kprobe_on_func_entry(struct trace_event_call *call)
> >>>  {
> >>>   struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
> >>> - return kprobe_ftrace(>rp.kp);
> >>> +
> >>> + return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name,
> >>> + tk->rp.kp.offset);
> >>
> >> That would be nice, but did you test this?
> >
> > Yes, because the jprobe, which was only official user of modifying execution
> > path using kprobe, did same way to check. (and kretprobe also does it)
> >
> >> My understanding that kprobe will restore all regs and
> >> here we need to override return ip _and_ value.
> >
> > yes, no problem. kprobe restore all regs from pt_regs, including regs->ip.
> >
> >> Could you add a patch with the test the way Josef did
> >> or describe the steps to test this new mode?
> >
> > Would you mean below patch? If so, it should work without any change.
> >
> >  [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return
> 
> yeah. I expect bpf_override_return test to work as-is.
> I'm asking for the test for new functionality added by this patch.
> In particular kprobe on func entry without ftrace.
> How did you test it?

This function is used in kretprobe and jprobe. Jprobe was the user of
"modifying instruction pointer to another function" in kprobes.
If it doesn't work, jprobe also doesn't work, this means you can not
modify IP by kprobes anymore.
Anyway, until linux-4.13, that was well tested by kprobe smoke test.

> and how I can repeat the test?
> I'm still not sure that it works correctly.

That works correctly because it checks given address is on the entry
point (the 1st instruction) of a function, using kallsyms.

The reason why I made another flag for ftrace was, there are 2 modes
for ftrace dynamic instrumentation, fentry and mcount.
With new fentry mode, ftrace will be put on the first instruction
of the function, so it will work as you

Re: pull-request: bpf-next 2017-12-28

2017-12-27 Thread David Miller

From: Daniel Borkmann 
Date: Thu, 28 Dec 2017 01:18:21 +0100

> The following pull-request contains BPF updates for your *net-next*
> tree.

Pulled.

Any progress on those tests failing on strict alignment architectures?

Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-27 Thread Masami Hiramatsu

On Wed, 27 Dec 2017 14:49:46 -0800
Alexei Starovoitov  wrote:

> On 12/27/17 12:09 AM, Masami Hiramatsu wrote:
> > On Tue, 26 Dec 2017 18:12:56 -0800
> > Alexei Starovoitov  wrote:
> >
> >> On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:
> >>> Support in-kernel fault-injection framework via debugfs.
> >>> This allows you to inject a conditional error to specified
> >>> function using debugfs interfaces.
> >>>
> >>> Signed-off-by: Masami Hiramatsu 
> >>> ---
> >>>  Documentation/fault-injection/fault-injection.txt |5 +
> >>>  kernel/Makefile   |1
> >>>  kernel/fail_function.c|  169 
> >>> +
> >>>  lib/Kconfig.debug |   10 +
> >>>  4 files changed, 185 insertions(+)
> >>>  create mode 100644 kernel/fail_function.c
> >>>
> >>> diff --git a/Documentation/fault-injection/fault-injection.txt 
> >>> b/Documentation/fault-injection/fault-injection.txt
> >>> index 918972babcd8..6243a588dd71 100644
> >>> --- a/Documentation/fault-injection/fault-injection.txt
> >>> +++ b/Documentation/fault-injection/fault-injection.txt
> >>> @@ -30,6 +30,11 @@ o fail_mmc_request
> >>>injects MMC data errors on devices permitted by setting
> >>>debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
> >>>
> >>> +o fail_function
> >>> +
> >>> +  injects error return on specific functions by setting debugfs entries
> >>> +  under /sys/kernel/debug/fail_function. No boot option supported.
> >>
> >> I like it.
> >> Could you document it a bit better?
> >
> > Yes, I will do in next series.
> >
> >> In particular retval is configurable, but without an example no one
> >> will be able to figure out how to use it.
> >
> > Ah, right. BTW, as I pointed in the covermail, should we store the
> > expected error value range into the injectable list? e.g.
> >
> > ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)
> >
> > And provide APIs to check/get it.
> 
> I'm afraid such check would be too costly.
> Right now we have only two functions marked but I expect hundreds more
> will be added in the near future as soon as developers realize the
> potential of such error injection.
> All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data.
> Multiple by 1k and we have 8k of data spent on marks.
> If we add max/min range marks that doubles it for very little use.
> I think marking function only is enough.

Sorry, I don't think so.
Even if it takes 16 bytes more for each points, I don't think it is
any overhead for machines in these days. Even if so, we can provide
a kconfig to reduce it.
I mean, we are living in GB-order memory are, and it will be bigger
in the future. Why we have to worry about hundreds of 16bytes memory
pieces? It will take a few KB, and even if we mark thousands of
functions, it never reaches 1MB, in GB memory pool. :)

Of course, for many small-footprint embedded devices (like having
less than 128MB memory), this feature can be a overhead. But they
can cut off the table by kconfig.

Thank you,

-- 
Masami Hiramatsu

Re: [RFC PATCH net-next] net: hns3: hns3_get_channels() can be static

2017-12-27 Thread David Miller

From: kbuild test robot 
Date: Thu, 28 Dec 2017 09:09:59 +0800

> 
> Fixes: 482d2e9c1cc7 ("net: hns3: add support to query tqps number")
> Signed-off-by: Fengguang Wu 

Applied, thanks.

Re: [PATCH net-next] nfp: bpf: allocate vNIC priv for keeping track of the offloaded program

2017-12-27 Thread David Miller

From: Jakub Kicinski 
Date: Wed, 27 Dec 2017 15:36:49 -0800

> After TC offloads were converted to callbacks we have no choice
> but keep track of the offloaded filter in the driver.
> 
> Since this change came a little late in the release cycle
> there were a number of conflicts and allocation of vNIC priv
> structure seems to have slipped away in linux-next.
> 
> Signed-off-by: Jakub Kicinski 

Oh well, I thought I resolved the merge conflict properly in this
case.  Apparently not :-)

Applied, thanks Jakub.

Re: pull-request: bpf 2017-12-28

2017-12-27 Thread David Miller

From: Daniel Borkmann 
Date: Thu, 28 Dec 2017 00:27:23 +0100

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Two small fixes for bpftool. Fix otherwise broken output if any of
>the system calls failed when listing maps in json format and instead
>of bailing out, skip maps or progs that disappeared between fetching
>next id and getting an fd for that id, both from Jakub.
> 
> 2) Small fix in BPF selftests to respect LLC passed from command line
>when testing for -mcpu=probe presence, from Quentin.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.

Re: WARNING in strp_data_ready

2017-12-27 Thread Tom Herbert

On Wed, Dec 27, 2017 at 12:20 PM, Ozgur  wrote:
>
>
> 27.12.2017, 23:14, "Dmitry Vyukov" :
>> On Wed, Dec 27, 2017 at 9:08 PM, Ozgur  wrote:
>>>  27.12.2017, 22:21, "Dmitry Vyukov" :
  On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert  wrote:
>   Did you try the patch I posted?

  Hi Tom,
>>>
>>>  Hello Dmitry,
>>>
  No. And I didn't know I need to. Why?
  If you think the patch needs additional testing, you can ask syzbot to
  test it. See 
 https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
  Otherwise proceed with committing it. Or what are we waiting for?

  Thanks
>>>
>>>  I think we need to fixed patch for crash, in fact check to patch code and 
>>> test solve the bug.
>>>  How do test it because there is no patch in the following bug?
>>
>> Hi Ozgur,
>>
>> I am not sure I completely understand what you mean. But the
>> reproducer for this bug (which one can use for testing) is here:
>> https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY
>> Tom also mentions there is some patch for this, but I don't know where
>> it is, it doesn't seem to be referenced from this thread.
>
> Hello Dmitry,
>
> Ah, I'm sorry I don't seen Tom mail and I don't have a patch not tested :)
> I think Tom send patch to only you and are you tested?
>
> kcmsock.c will change and strp_data_ready I think locked.
>
> Tom, please send a patch for me? I can test and inform you.
>
Hi Ozgur,

I reposted the patches as RFC "kcm: Fix lockdep issue". Please test if you can!

Thanks,
Tom

> Regards
>
> Ozgur
>
>>>  The fix patch should be for this net/kcm/kcmsock.c file and lock functions 
>>> must be added calling sk_data_ready ().
>>>  Regards
>>>
>>>  Ozgur
>>>
>   On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov  
> wrote:
>>   On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov  
>> wrote:
    wrote:
>   On 10/24/2017 08:20 AM, syzbot wrote:
>>   Hello,
>>
>>   syzkaller hit the following crash on 
>> 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>   
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>   compiler: gcc (GCC) 7.1.1 20170620
>>   .config is attached
>>   Raw console output is attached.
>>   C reproducer is attached
>>   syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>   for information about syzkaller reproducers
>>
>>   WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
>> sock_owned_by_me include/net/sock.h:1505 [inline]
>>   WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
>> sock_owned_by_user include/net/sock.h:1511 [inline]
>>   WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
>> strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>   Kernel panic - not syncing: panic_on_warn set ...
>>
>>   CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>   Hardware name: Google Google Compute Engine/Google Compute Engine, 
>> BIOS Google 01/01/2011
>>   Call Trace:
>>
>>__dump_stack lib/dump_stack.c:16 [inline]
>>dump_stack+0x194/0x257 lib/dump_stack.c:52
>>panic+0x1e4/0x417 kernel/panic.c:181
>>__warn+0x1c4/0x1d9 kernel/panic.c:542
>>report_bug+0x211/0x2d0 lib/bug.c:183
>>fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>>   RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>>   RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>>   RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>   RSP: 0018:8801db206b18 EFLAGS: 00010206
>>   RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: 
>>   RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0
>>   RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd
>>   R10: 8801db206a00 R11: dc00 R12: 8801d29fa000
>>   R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001
>>psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>
>   Looks like KCM is calling sk_data_ready() without first taking the
>   sock lock.
>
>   /* Called with lower sock held */
>   static void kcm_rcv_strparser(struct

[PATCH RFC 0/2] kcm: Fix lockdep issue

2017-12-27 Thread Tom Herbert

When sock_owned_by_user returns true in strparser. Fix is to add and
call sock_owned_by_user_nocheck since the check for owned by user is
not an error condition in this case.

Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages")
Reported-by: syzbot 

Tom Herbert (2):
  sock: Add sock_owned_by_user_nocheck
  strparser: Call sock_owned_by_user_nocheck

 include/net/sock.h| 5 +
 net/strparser/strparser.c | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

-- 
2.11.0

[PATCH RFC 1/2] sock: Add sock_owned_by_user_nocheck

2017-12-27 Thread Tom Herbert

This allows checking socket lock ownership with producing lockdep
warnings.

Signed-off-by: Tom Herbert 
---
 include/net/sock.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 6c1db823f8b9..66fd3951e6f3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1515,6 +1515,11 @@ static inline bool sock_owned_by_user(const struct sock 
*sk)
return sk->sk_lock.owned;
 }
 
+static inline bool sock_owned_by_user_nocheck(const struct sock *sk)
+{
+   return sk->sk_lock.owned;
+}
+
 /* no reclassification while locks are held */
 static inline bool sock_allow_reclassification(const struct sock *csk)
 {
-- 
2.11.0

[PATCH RFC 2/2] strparser: Call sock_owned_by_user_nocheck

2017-12-27 Thread Tom Herbert

strparser wants to check socket ownership without producing any
warnings. As indicated by the comment in the code, it is permissible
for owned_by_user to return true.

Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages")
Reported-by: syzbot 
Signed-off-by: Tom Herbert 
---
 net/strparser/strparser.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index c5fda15ba319..1fdab5c4eda8 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -401,7 +401,7 @@ void strp_data_ready(struct strparser *strp)
 * allows a thread in BH context to safely check if the process
 * lock is held. In this case, if the lock is held, queue work.
 */
-   if (sock_owned_by_user(strp->sk)) {
+   if (sock_owned_by_user_nocheck(strp->sk)) {
queue_work(strp_wq, >work);
return;
}
-- 
2.11.0

[net-next:master 731/763] drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c:852:6: sparse: symbol 'hns3_get_channels' was not declared. Should it be static?

2017-12-27 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   55b07a65e15bea6e253a907dacaf89b61fe504ca
commit: 482d2e9c1cc7c0e154464e3e052db09e5e62541f [731/763] net: hns3: add 
support to query tqps number
reproduce:
# apt-get install sparse
git checkout 482d2e9c1cc7c0e154464e3e052db09e5e62541f
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)


Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[RFC PATCH net-next] net: hns3: hns3_get_channels() can be static

2017-12-27 Thread kbuild test robot


Fixes: 482d2e9c1cc7 ("net: hns3: add support to query tqps number")
Signed-off-by: Fengguang Wu 
---
 hns3_ethtool.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index 2ae4d397..379c01d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -876,8 +876,8 @@ static int hns3_nway_reset(struct net_device *netdev)
return genphy_restart_aneg(phy);
 }
 
-void hns3_get_channels(struct net_device *netdev,
-  struct ethtool_channels *ch)
+static void hns3_get_channels(struct net_device *netdev,
+ struct ethtool_channels *ch)
 {
struct hnae3_handle *h = hns3_get_handle(netdev);

Re: [PATCH 1/4] libbpf: add function to setup XDP

2017-12-27 Thread Toshiaki Makita

On 2017/12/28 3:02, Eric Leblond wrote:
> Most of the code is taken from set_link_xdp_fd() in bpf_load.c and
> slightly modified to be library compliant.
> 
> Signed-off-by: Eric Leblond 
> ---
...
> +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
...
> + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
> + ret = -errno;
> + goto cleanup;
> + }
> +
> + addrlen = sizeof(sa);
> + if (getsockname(sock, (struct sockaddr *), ) < 0) {
> + ret = errno;

forgot to prepend '-'?

> + goto cleanup;
> + }
> +
> + if (addrlen != sizeof(sa)) {
> + ret = errno;

errno is not set?

> + goto cleanup;
> + }

-- 
Toshiaki Makita

Re: [PATCHv3 0/2] capability controlled user-namespaces

2017-12-27 Thread महेश बंडेवार

On Wed, Dec 27, 2017 at 12:23 PM, Michael Kerrisk (man-pages)
 wrote:
> Hello Mahesh,
>
> On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)
>  wrote:
>> Hello James,
>>
>> Seems like I missed your name to be added into the review of this
>> patch series. Would you be willing be pull this into the security
>> tree? Serge Hallyn has already ACKed it.
>
> We seem to have no formal documentation/specification of this feature.
> I think that should be written up before this patch goes into
> mainline...
>
absolutely. I have added enough information into the Documentation dir
relevant to this feature (please look at the  individual patches),
that could be used. I could help if needed.

thanks,
--mahesh..

> Cheers,
>
> Michael
>
>
>>
>> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar  wrote:
>>> From: Mahesh Bandewar 
>>>
>>> TL;DR version
>>> -
>>> Creating a sandbox environment with namespaces is challenging
>>> considering what these sandboxed processes can engage into. e.g.
>>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>>> Current form of user-namespaces, however, if changed a bit can allow
>>> us to create a sandbox environment without locking down user-
>>> namespaces.
>>>
>>> Detailed version
>>> 
>>>
>>> Problem
>>> ---
>>> User-namespaces in the current form have increased the attack surface as
>>> any process can acquire capabilities which are not available to them (by
>>> default) by performing combination of clone()/unshare()/setns() syscalls.
>>>
>>> #define _GNU_SOURCE
>>> #include 
>>> #include 
>>> #include 
>>>
>>> int main(int ac, char **av)
>>> {
>>> int sock = -1;
>>>
>>> printf("Attempting to open RAW socket before unshare()...\n");
>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>> if (sock < 0) {
>>> perror("socket() SOCK_RAW failed: ");
>>> } else {
>>> printf("Successfully opened RAW-Sock before unshare().\n");
>>> close(sock);
>>> sock = -1;
>>> }
>>>
>>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>>> perror("unshare() failed: ");
>>> return 1;
>>> }
>>>
>>> printf("Attempting to open RAW socket after unshare()...\n");
>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>> if (sock < 0) {
>>> perror("socket() SOCK_RAW failed: ");
>>> } else {
>>> printf("Successfully opened RAW-Sock after unshare().\n");
>>> close(sock);
>>> sock = -1;
>>> }
>>>
>>> return 0;
>>> }
>>>
>>> The above example shows how easy it is to acquire NET_RAW capabilities
>>> and once acquired, these processes could take benefit of above mentioned
>>> or similar issues discovered/undiscovered with malicious intent. Note
>>> that this is just an example and the problem/solution is not limited
>>> to NET_RAW capability *only*.
>>>
>>> The easiest fix one can apply here is to lock-down user-namespaces which
>>> many of the distros do (i.e. don't allow users to create user namespaces),
>>> but unfortunately that prevents everyone from using them.
>>>
>>> Approach
>>> 
>>> Introduce a notion of 'controlled' user-namespaces. Every process on
>>> the host is allowed to create user-namespaces (governed by the limit
>>> imposed by per-ns sysctl) however, mark user-namespaces created by
>>> sandboxed processes as 'controlled'. Use this 'mark' at the time of
>>> capability check in conjunction with a global capability whitelist.
>>> If the capability is not whitelisted, processes that belong to
>>> controlled user-namespaces will not be allowed.
>>>
>>> Once a user-ns is marked as 'controlled'; all its child user-
>>> namespaces are marked as 'controlled' too.
>>>
>>> A global whitelist is list of capabilities governed by the
>>> sysctl which is available to (privileged) user in init-ns to modify
>>> while it's applicable to all controlled user-namespaces on the host.
>>>
>>> Marking user-namespaces controlled without modifying the whitelist is
>>> equivalent of the current behavior. The default value of whitelist includes
>>> all capabilities so that the compatibility is maintained. However it gives
>>> admins fine-grained ability to control various capabilities system wide
>>> without locking down user-namespaces.
>>>
>>> Please see individual patches in this series.
>>>
>>> Mahesh Bandewar (2):
>>>   capability: introduce sysctl for controlled user-ns capability whitelist
>>>   userns: control capabilities of some user namespaces
>>>
>>>  Documentation/sysctl/kernel.txt | 21 +
>>>  include/linux/capability.h  |  7 ++
>>>  include/linux/user_namespace.h  | 25 
>>>  kernel/capability.c | 52 
>>> +
>>>

pull-request: bpf-next 2017-12-28

2017-12-27 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Fix incorrect state pruning related to recognition of zero initialized
   stack slots, where stacksafe exploration would mistakenly return a
   positive pruning verdict too early ignoring other slots, from Gianluca.

2) Various BPF to BPF calls related follow-up fixes. Fix an off-by-one
   in maximum call depth check, and rework maximum stack depth tracking
   logic to fix a bypass of the total stack size check reported by Jann.
   Also fix a bug in arm64 JIT where prog->jited_len was uninitialized.
   Addition of various test cases to BPF selftests, from Alexei.

3) Addition of a BPF selftest to test_verifier that is related to BPF to
   BPF calls which demonstrates a late caller stack size increase and
   thus out of bounds access. Fixed above in 2). Test case from Jann.

4) Addition of correlating BPF helper calls, BPF to BPF calls as well
   as BPF maps to bpftool xlated dump in order to allow for better
   BPF program introspection and debugging, from Daniel.

5) Fixing several bugs in BPF to BPF calls kallsyms handling in order
   to get it actually to work for subprogs, from Daniel.

6) Extending sparc64 JIT support for BPF to BPF calls and fix a couple
   of build errors for libbpf on sparc64, from David.

7) Allow narrower context access for BPF dev cgroup typed programs in
   order to adapt to LLVM code generation. Also adjust memlock rlimit
   in the test_dev_cgroup BPF selftest, from Yonghong.

8) Add netdevsim Kconfig entry to BPF selftests since test_offload.py
   relies on netdevsim device being available, from Jakub.

9) Reduce scope of xdp_do_generic_redirect_map() to being static,
   from Xiongwei.

10) Minor cleanups and spelling fixes in BPF verifier, from Colin.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot & have a happy new year!



The following changes since commit 962b582785b60a2b420b0636ad762959c72406f6:

  cxgb4: Simplify PCIe Completion Timeout setting (2017-12-18 15:12:57 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 624588d9d6cc0a1a270a65fb4d5220f1ceddcf38:

  Merge branch 'bpf-stack-depth-tracking-fixes' (2017-12-27 18:36:24 +0100)


Alexei Starovoitov (5):
  bpf: arm64: fix uninitialized variable
  Merge branch 'bpftool-improvements-kallsymfix'
  bpf: fix maximum stack depth tracking logic
  selftests/bpf: additional stack depth tests
  bpf: fix max call depth check

Colin Ian King (2):
  bpf: fix spelling mistake: "funcation"-> "function"
  bpf: make function skip_callee static and return NULL rather than 0

Daniel Borkmann (3):
  bpf: fix kallsyms handling for subprogs
  bpf: allow for correlation of maps and helpers in dump
  Merge branch 'bpf-stack-depth-tracking-fixes'

David Miller (2):
  libbpf: Fix build errors.
  bpf: sparc64: Add JIT support for multi-function programs.

Gianluca Borello (1):
  bpf: fix stacksafe exploration when comparing states

Jakub Kicinski (1):
  selftests/bpf: add netdevsim to config

Jann Horn (1):
  bpf: selftest for late caller stack size increase

Xiongwei Song (1):
  bpf: make function xdp_do_generic_redirect_map() static

Yonghong Song (2):
  bpf/cgroup: fix a verification error for a CGROUP_DEVICE type prog
  tools/bpf: adjust rlimit RLIMIT_MEMLOCK for test_dev_cgroup

 arch/arm64/net/bpf_jit_comp.c |   1 +
 arch/sparc/net/bpf_jit_comp_64.c  |  44 -
 include/linux/bpf_verifier.h  |   1 +
 include/linux/filter.h|   9 +
 include/uapi/linux/bpf.h  |   3 +-
 kernel/bpf/cgroup.c   |  15 +-
 kernel/bpf/core.c |   4 +-
 kernel/bpf/disasm.c   |  65 +--
 kernel/bpf/disasm.h   |  29 +++-
 kernel/bpf/syscall.c  |  93 +-
 kernel/bpf/verifier.c | 126 +++---
 net/core/filter.c |   5 +-
 tools/bpf/bpftool/prog.c  | 181 ++-
 tools/lib/bpf/libbpf.c|   5 +-
 tools/testing/selftests/bpf/config|   1 +
 tools/testing/selftests/bpf/test_dev_cgroup.c |   9 +-
 tools/testing/selftests/bpf/test_verifier.c   | 241 ++
 17 files changed, 764 insertions(+), 68 deletions(-)

Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()

2017-12-27 Thread John Fastabend

On 12/27/2017 10:29 AM, Cong Wang wrote:
> On Sat, Dec 23, 2017 at 10:57 PM, John Fastabend
>  wrote:
>> On 12/22/2017 12:31 PM, Cong Wang wrote:
>>> I understand why you had it, but it is just not safe. You don't want
>>> to achieve performance gain by crashing system, right?
>>
>> huh? So my point is the patch you submit here is not a
>> real fix but a work around. To peek the head of a consumer/producer ring
>> without a lock, _should_ be fine. This _should_ work as well with
>> consumer or producer operations happening at the same time. After some
>> digging the issue is in the ptr_ring code.
> 
> 
> The comments disagree with you:
> 
> /* Might be slightly faster than skb_array_empty below, but only safe if the
>  * array is never resized. Also, callers invoking this in a loop must take 
> care
>  * to use a compiler barrier, for example cpu_relax().
>  */
> 
> If the comments are right, you miss a barrier here too.
> 

The comment talks about resizing arrays not consumer/producer operations
so I think it is OK. We don't call skb_array_empty on the same skb_array
in a loop here, its different skb arrays. So probably don't need a barrier.

> 
>>
>> The peek code (what empty check calls) is the following,
>>
>> static inline void *__ptr_ring_peek(struct ptr_ring *r)
>> {
>> if (likely(r->size))
>> return r->queue[r->consumer_head];
>> return NULL;
>> }
>>
>> So what the splat is detecting is consumer head being 'out of bounds'.
>> This happens because ptr_ring_discard_one increments the consumer_head
>> and then checks to see if it overran the array size. If above peek
>> happens after the increment, but before the size check we get the
>> splat. There are two ways, as far as I can see, to fix this. First
>> do the check before incrementing the consumer head. Or the easier
>> fix,
>>
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct
>> ptr_ring *r,
>>
>>  static inline void **__ptr_ring_init_queue_alloc(unsigned int size,
>> gfp_t gfp)
>>  {
>> -   return kcalloc(size, sizeof(void *), gfp);
>> +   return kcalloc(size + 1, sizeof(void *), gfp);
>>  }
>>
>> With Jakub's help (Thanks!) I was able to reproduce the original splat
>> and also verify the above removes it.
>>
>> To be clear "resizing" a skb_array only refers to changing the actual
>> array size not adding/removing elements.
> 
> I never look into the implementation, just simply trust the comments.
> 
> At least the comments above __skb_array_empty() need to improve.
> > 
>>
>>>

 Although its not logical IMO to have both reset and dequeue running at
 the same time. Some skbs would get through others would get sent, sort
 of a mess. I don't see how it can be an issue. The never resized bit
 in the documentation is referring to resizing the ring size _not_ popping
 off elements of the ring. array_empty just reads the consumer head.
 The only ring resizing in pfifo fast should be at init and destroy where
 enqueue/dequeue should be disconnected by then. Although based on the
 trace I missed a case.
>>>
>>>
>>> Both pfifo_fast_reset() and pfifo_fast_dequeue() call
>>> skb_array_consume_bh(), so there is no difference w.r.t. resizing.
>>>
>>
>> Sorry not following.
>>
>>> And ->reset() is called in qdisc_graft() too. Let's say we have 
>>> htb+pfifo_fast,
>>> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
>>> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
>>> concurrently.
>>
>> Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering
>> though if this API can be cleaned up. What are the paths that do a reset
>> without a destroy.. Do we really need to have this pattern where reset
>> is called then later destroy. Seems destroy could do the entire cleanup
>> and this would simplify things. None of this has to do with the splat
>> though.
> 
> I don't follow your point any more.> 
> We are talking about ->reset() race with ->dequeue() which is the
> cause of the bug, right?> 

Yes. But it should be OK for qdiscs to have reset and dequeue run
in parallel. So the bug is in ptr_ring.

> If you expect ->reset() runs in parallel with ->dequeue() for pfifo_fast,
> why did you even mention synchronize_net() from the beginning???

A mistake probably I am tracking two separate issues. First this one
with skb_array out of bounds access and the tx_action being called
while destroy is in progress. When I was first looking at this I thought
the splat was from the tx_action issue so probably started talking
about how to sync destroy and tx_action paths. Either way it was
a mistake to mention in this thread.

> Also you changed the code too, to adjust rcu grace period.
> 
> 
>>
>>>
>>>

 I think the right fix is to only call reset/destroy patterns after
 waiting a grace period and for all tx_action

[PATCH net-next] nfp: bpf: allocate vNIC priv for keeping track of the offloaded program

2017-12-27 Thread Jakub Kicinski

After TC offloads were converted to callbacks we have no choice
but keep track of the offloaded filter in the driver.

Since this change came a little late in the release cycle
there were a number of conflicts and allocation of vNIC priv
structure seems to have slipped away in linux-next.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c | 30 ++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 214b02a3acdd..4b63167906ca 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -84,6 +84,33 @@ static const char *nfp_bpf_extra_cap(struct nfp_app *app, 
struct nfp_net *nn)
return nfp_net_ebpf_capable(nn) ? "BPF" : "";
 }
 
+static int
+nfp_bpf_vnic_alloc(struct nfp_app *app, struct nfp_net *nn, unsigned int id)
+{
+   int err;
+
+   nn->app_priv = kzalloc(sizeof(struct nfp_bpf_vnic), GFP_KERNEL);
+   if (!nn->app_priv)
+   return -ENOMEM;
+
+   err = nfp_app_nic_vnic_alloc(app, nn, id);
+   if (err)
+   goto err_free_priv;
+
+   return 0;
+err_free_priv:
+   kfree(nn->app_priv);
+   return err;
+}
+
+static void nfp_bpf_vnic_free(struct nfp_app *app, struct nfp_net *nn)
+{
+   struct nfp_bpf_vnic *bv = nn->app_priv;
+
+   WARN_ON(bv->tc_prog);
+   kfree(bv);
+}
+
 static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type type,
 void *type_data, void *cb_priv)
 {
@@ -286,7 +313,8 @@ const struct nfp_app_type app_bpf = {
 
.extra_cap  = nfp_bpf_extra_cap,
 
-   .vnic_alloc = nfp_app_nic_vnic_alloc,
+   .vnic_alloc = nfp_bpf_vnic_alloc,
+   .vnic_free  = nfp_bpf_vnic_free,
 
.setup_tc   = nfp_bpf_setup_tc,
.tc_busy= nfp_bpf_tc_busy,
-- 
2.15.1

pull-request: bpf 2017-12-28

2017-12-27 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Two small fixes for bpftool. Fix otherwise broken output if any of
   the system calls failed when listing maps in json format and instead
   of bailing out, skip maps or progs that disappeared between fetching
   next id and getting an fd for that id, both from Jakub.

2) Small fix in BPF selftests to respect LLC passed from command line
   when testing for -mcpu=probe presence, from Quentin.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit c50b7c473f609189da3bccd28ee5dcf3b55109cd:

  Merge branch 'net-zerocopy-fixes' (2017-12-21 15:00:59 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to aee657460a8ce66443d5e7413046d02d7b2165db:

  Merge branch 'bpf-bpftool-various-fixes' (2017-12-23 01:09:53 +0100)


Daniel Borkmann (1):
  Merge branch 'bpf-bpftool-various-fixes'

Jakub Kicinski (2):
  tools: bpftool: maps: close json array on error paths of show
  tools: bpftool: protect against races with disappearing objects

Quentin Monnet (1):
  selftests/bpf: fix Makefile for passing LLC to the command line

 tools/bpf/bpftool/map.c  | 8 +---
 tools/bpf/bpftool/prog.c | 2 ++
 tools/testing/selftests/bpf/Makefile | 2 +-
 3 files changed, 8 insertions(+), 4 deletions(-)

Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface

2017-12-27 Thread Russell King - ARM Linux

On Wed, Dec 27, 2017 at 11:42:52PM +0100, Antoine Tenart wrote:
> Hi Russell,
> 
> On Wed, Dec 27, 2017 at 10:24:01PM +, Russell King - ARM Linux wrote:
> > On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote:
> > >  
> > > +_eth2 {
> > > + /* CPS Lane 5 */
> > > + status = "okay";
> > > + phy-mode = "2500base-x";
> > > + /* Generic PHY, providing serdes lanes */
> > > + phys = <_comphy5 2>;
> > > +};
> > > +
> > 
> > This is wrong.  This lane is connected to a SFP cage which can support
> > more than just 2500base-X.  Tying it in this way to 2500base-X means
> > that this port does not support conenctions at 1000base-X, despite
> > that's one of the most popular and more standardised speeds.
> 
> What do you suggest to describe this in the dt, to enable a port using
> the current PPv2 driver?

I don't - I'm merely pointing out that you're bodging support for the
SFP cage rather than productively discussing phylink for mvpp2.

As far as I remember, the discussion stalled at this point:

- You think there's modes that mvpp2 supports that are not supportable
  if you use phylink.

- I've described what phylink supports, and I've asked you for details
  about what you can't support.

Unfortunately, no details have been forthcoming, and no further
discussion has occurred - the ball is entirely in your court to
progress this issue since I requested information from you and that
is where things seem to have stalled.

The result is that, with your patch, you're locking the port to 2.5G
speeds, meaning that only 4.3Mbps Fibrechannel SFPs can be used with
the port, and it can only be used with another device that supports
2.5G speeds.  You can't use a copper RJ45 module, and you can't use
a standard 1000base-X module either in this configuration.

What I'm most concerned about, given the bindings for comphy that
have been merged, is that Free Electrons is pushing forward seemingly
with no regard to the requirement that the serdes lanes are dynamically
reconfigurable, and that's a basic requirement for SFP, and for the
88x3310 PHYs on the Macchiatobin platform.

So, my question to you is: what is Free Electrons plans to properly
support the ethernet ports on the Macchiatobin platform?

For those on the Cc list who don't know, phylink is part of full support
for SFP and SFP+ cages, sponsored (in terms of hardware including SFP
modules) by SolidRun on both SolidRun's Clearfog and Macchiatobin
platforms, supporting a wide range of SFP modules including:

- 1G Optical ethernet modules (duplex and bidi modules)
- 10/100/1G RJ45 modules
- 10G SFP+ modules
- 2.5Gbase-X using 4.3Mbps Fibrechannel modules
- Direct attach cables

There is work ongoing between Florian, Andrew and myself to switch DSA
to phylink, and have SFP modules working with both Marvell and Broadcom
DSA switches.

Phylink and SFP was already merged into mainline, and has been usable
(provided that the network driver is converted to phylink rather than
phylib) since 4.14-rc1.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

Re: [PATCH bpf-next v2 6/8] bpf: offload: report device information for offloaded programs

2017-12-27 Thread Jakub Kicinski

On Wed, 27 Dec 2017 18:26:16 +0100, Daniel Borkmann wrote:
> On 12/21/2017 10:01 PM, Jakub Kicinski wrote:
> > Report to the user ifindex and namespace information of offloaded
> > programs.  If device has disappeared return -ENODEV.  Specify the
> > namespace using dev/inode combination.
> > 
> > CC: Eric W. Biederman 
> > Signed-off-by: Jakub Kicinski 
> > Reviewed-by: Quentin Monnet 
> > ---
> > v2:
> >  - take RTNL lock to grab a coherent snapshot of device state
> >(ifindex vs name space) and avoid races with name space
> >moves (based on Eric's comment on Kirill's patch to
> >peernet2id_alloc()).
> > ---
> >  fs/nsfs.c  |  2 +-
> >  include/linux/bpf.h|  2 ++
> >  include/linux/proc_ns.h|  1 +
> >  include/uapi/linux/bpf.h   |  3 +++
> >  kernel/bpf/offload.c   | 44 
> > ++
> >  kernel/bpf/syscall.c   |  6 ++
> >  tools/include/uapi/linux/bpf.h |  3 +++
> >  7 files changed, 60 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/nsfs.c b/fs/nsfs.c
> > index 7c6f76d29f56..e50628675935 100644
> > --- a/fs/nsfs.c
> > +++ b/fs/nsfs.c
> > @@ -51,7 +51,7 @@ static void nsfs_evict(struct inode *inode)
> > ns->ops->put(ns);
> >  }
> >  
> > -static void *__ns_get_path(struct path *path, struct ns_common *ns)
> > +void *__ns_get_path(struct path *path, struct ns_common *ns)
> >  {
> > struct vfsmount *mnt = nsfs_mnt;
> > struct dentry *dentry;
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 9a916ab34299..7810ae57b357 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -531,6 +531,8 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 
> > ufd,
> >  
> >  int bpf_prog_offload_compile(struct bpf_prog *prog);
> >  void bpf_prog_offload_destroy(struct bpf_prog *prog);
> > +int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
> > +  struct bpf_prog *prog);
> >  
> >  #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
> >  int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
> > diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> > index 2ff18c9840a7..1733359cf713 100644
> > --- a/include/linux/proc_ns.h
> > +++ b/include/linux/proc_ns.h
> > @@ -76,6 +76,7 @@ static inline int ns_alloc_inum(struct ns_common *ns)
> >  
> >  extern struct file *proc_ns_fget(int fd);
> >  #define get_proc_ns(inode) ((struct ns_common *)(inode)->i_private)
> > +extern void *__ns_get_path(struct path *path, struct ns_common *ns);
> >  extern void *ns_get_path(struct path *path, struct task_struct *task,
> > const struct proc_ns_operations *ns_ops);
> >  
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index d01f1cb3cfc0..72b37fc3bc0c 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -921,6 +921,9 @@ struct bpf_prog_info {
> > __u32 nr_map_ids;
> > __aligned_u64 map_ids;
> > char name[BPF_OBJ_NAME_LEN];
> > +   __u32 ifindex;
> > +   __u64 netns_dev;
> > +   __u64 netns_ino;
> >  } __attribute__((aligned(8)));
> >  
> >  struct bpf_map_info {
> > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> > index 1e6064ea3609..4d5bd1e3 100644
> > --- a/kernel/bpf/offload.c
> > +++ b/kernel/bpf/offload.c
> > @@ -16,9 +16,11 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  
> > @@ -181,6 +183,48 @@ int bpf_prog_offload_compile(struct bpf_prog *prog)
> > return bpf_prog_offload_translate(prog);
> >  }
> >  
> > +int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
> > +  struct bpf_prog *prog)
> > +{
> > +   struct bpf_dev_offload *offload;
> > +   struct inode *ns_inode;
> > +   struct path ns_path;
> > +   int ifindex, err;
> > +   struct net *net;
> > +
> > +again:
> > +   rtnl_lock();
> > +   down_read(_devs_lock);
> > +
> > +   offload = prog->aux->offload;
> > +   if (!offload) {
> > +   up_read(_devs_lock);
> > +   rtnl_unlock();
> > +   return -ENODEV;
> > +   }
> > +
> > +   ifindex = offload->netdev->ifindex;
> > +   net = dev_net(offload->netdev);
> > +   get_net(net); /* __ns_get_path() drops the reference */
> > +
> > +   up_read(_devs_lock);
> > +   rtnl_unlock();
> > +
> > +   err = PTR_ERR_OR_ZERO(__ns_get_path(_path, >ns));
> > +   if (err) {
> > +   if (err == -EAGAIN)
> > +   goto again;
> > +   return err;
> > +   }
> > +   ns_inode = ns_path.dentry->d_inode;
> > +
> > +   info->ifindex = ifindex;
> > +   info->netns_dev = new_encode_dev(ns_inode->i_sb->s_dev);
> > +   info->netns_ino = ns_inode->i_ino;  
> 
> I think here, we're still missing a:
> 
>   path_put(_path);
> 
> Otherwise we're leaking that

Re: [PATCH v2 bpf-next 0/3] bpf: two stack check fixes

2017-12-27 Thread Daniel Borkmann

On 12/25/2017 10:15 PM, Alexei Starovoitov wrote:
> Jann reported an issue with stack depth tracking. Fix it and add tests.
> Also fix off-by-one error in MAX_CALL_FRAMES check.
> This set is on top of Jann's "selftest for late caller stack size increase" 
> test.
> 
> Alexei Starovoitov (3):
>   bpf: fix maximum stack depth tracking logic
>   selftests/bpf: additional stack depth tests
>   bpf: fix max call depth check
> 
>  include/linux/bpf_verifier.h|   1 +
>  kernel/bpf/verifier.c   |  86 +++
>  tools/testing/selftests/bpf/test_verifier.c | 156 
> 
>  3 files changed, 225 insertions(+), 18 deletions(-)

Applied to bpf-next, thanks Alexei!

Re: [PATCH] bpf: selftest for late caller stack size increase

2017-12-27 Thread Daniel Borkmann

On 12/22/2017 07:12 PM, Jann Horn wrote:
> This checks that it is not possible to bypass the total stack size check in
> update_stack_depth() by calling a function that uses a large amount of
> stack memory *before* using a large amount of stack memory in the caller.
> 
> Currently, the first added testcase causes a rejection as expected, but
> the second testcase is (AFAICS incorrectly) accepted:
> 
> [...]
> #483/p calls: stack overflow using two frames (post-call access) FAIL
> Unexpected success to load!
> 0: (85) call pc+2
> caller:
>  R10=fp0,call_-1
> callee:
>  frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0
> 3: (72) *(u8 *)(r10 -300) = 0
> 4: (b7) r0 = 0
> 5: (95) exit
> returning from callee:
>  frame1: R0_w=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0
> to caller at 1:
>  R0_w=inv0 R10=fp0,call_-1
> 
> from 5 to 1: R0=inv0 R10=fp0,call_-1
> 1: (72) *(u8 *)(r10 -300) = 0
> 2: (95) exit
> processed 6 insns, stack depth 300+300
> [...]
> Summary: 704 PASSED, 1 FAILED
> 
> AFAICS the JIT-generated code for the second testcase shows that this
> really causes the stack pointer to be decremented by 300+300:
> 
> first function:
>   55push rbp
> 0001  4889E5mov rbp,rsp
> 0004  4881EC5801sub rsp,0x158
> 000B  4883ED28  sub rbp,byte +0x28
> [...]
> 0025  E89AB3AFE5call 0xe5afb3c4
> 002A  C685D4FE00mov byte [rbp-0x12c],0x0
> [...]
> 0041  4883C528  add rbp,byte +0x28
> 0045  C9leave
> 0046  C3ret
> 
> second function:
>   55push rbp
> 0001  4889E5mov rbp,rsp
> 0004  4881EC5801sub rsp,0x158
> 000B  4883ED28  sub rbp,byte +0x28
> [...]
> 0025  C685D4FE00mov byte [rbp-0x12c],0x0
> [...]
> 003E  4883C528  add rbp,byte +0x28
> 0042  C9leave
> 0043  C3ret
> 
> Signed-off-by: Jann Horn 

Applied to bpf-next, thanks a lot Jann!

Re: [PATCH 3/4] libbpf: break loop earlier

2017-12-27 Thread Daniel Borkmann

On 12/27/2017 09:30 PM, Eric Leblond wrote:
> On Wed, 2017-12-27 at 11:00 -0800, Alexei Starovoitov wrote:
>> On Wed, Dec 27, 2017 at 07:02:28PM +0100, Eric Leblond wrote:
>>> Get out of the loop when we have a match.
>>>
>>> Signed-off-by: Eric Leblond 
>>> ---
>>>  tools/lib/bpf/libbpf.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>>> index 5fe8aaa2123e..d263748aa341 100644
>>> --- a/tools/lib/bpf/libbpf.c
>>> +++ b/tools/lib/bpf/libbpf.c
>>> @@ -412,6 +412,7 @@ bpf_object__init_prog_names(struct bpf_object
>>> *obj)
>>>prog->section_name);
>>> return -LIBBPF_ERRNO__LIBELF;
>>> }
>>> +   break;
>>
>> why this is needed?
> 
> It was just cosmetic, no related bug.
> 
>> The top of the loop is:
>>  for (si = 0; si < symbols->d_size / sizeof(GElf_Sym) && !name;
>>
>> so as soon as name is found the loop will exit.
> 
> OK, I've missed that. Please disregard this patch.

Ok, please respin the series one last time w/o this patch then.

Please also keep the already given ACKs on the other patches.

Thanks, Eric!

Re: [PATCH net-next 09/10] net: qualcomm: rmnet: Add support for TX checksum offload

2017-12-27 Thread kbuild test robot

Hi Subash,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Subash-Abhinov-Kasiviswanathan/net-qualcomm-rmnet-Enable-csum-offloads/20171228-041216
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)


vim +187 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c

   106  
   107  #if IS_ENABLED(CONFIG_IPV6)
   108  static int
   109  rmnet_map_ipv6_dl_csum_trailer(struct sk_buff *skb,
   110 struct rmnet_map_dl_csum_trailer 
*csum_trailer)
   111  {
   112  u16 ip_pseudo_payload_csum, pseudo_csum, ip6_hdr_csum, 
*csum_field;
   113  u16 csum_value, ip6_payload_csum, csum_value_final;
   114  struct ipv6hdr *ip6h;
   115  void *txporthdr;
   116  u32 length;
   117  
   118  ip6h = (struct ipv6hdr *)(skb->data);
   119  
   120  txporthdr = skb->data + sizeof(struct ipv6hdr);
   121  csum_field = rmnet_map_get_csum_field(ip6h->nexthdr, txporthdr);
   122  
   123  if (!csum_field)
   124  return -EPROTONOSUPPORT;
   125  
   126  csum_value = ~ntohs(csum_trailer->csum_value);
   127  ip6_hdr_csum = ~ntohs(ip_compute_csum(ip6h,
   128(int)(txporthdr - (void *)(skb->data;
   129  ip6_payload_csum = csum16_sub(csum_value, ip6_hdr_csum);
   130  
   131  length = (ip6h->nexthdr == IPPROTO_UDP) ?
   132   ntohs(((struct udphdr *)txporthdr)->len) :
   133   ntohs(ip6h->payload_len);
   134  pseudo_csum = ~ntohs(csum_ipv6_magic(>saddr, >daddr,
   135   length, ip6h->nexthdr, 0));
   136  ip_pseudo_payload_csum = csum16_add(ip6_payload_csum, 
pseudo_csum);
   137  
 > 138  csum_value_final = ~csum16_sub(ip_pseudo_payload_csum,
 > 139 ntohs(*csum_field));
   140  
   141  if (unlikely(csum_value_final == 0)) {
   142  switch (ip6h->nexthdr) {
   143  case IPPROTO_UDP:
   144  /* RFC 2460 section 8.1
   145   * DL6 One's complement rule for UDP checksum 0
   146   */
   147  csum_value_final = ~csum_value_final;
   148  break;
   149  
   150  case IPPROTO_TCP:
   151  /* DL6 Non-RFC compliant TCP checksum found */
   152  if (*csum_field == 0x)
   153  csum_value_final = ~csum_value_final;
   154  break;
   155  }
   156  }
   157  
   158  if (csum_value_final == ntohs(*csum_field))
   159  return 0;
   160  else
   161  return -EINVAL;
   162  }
   163  #endif
   164  
   165  static void rmnet_map_complement_ipv4_txporthdr_csum_field(void *iphdr)
   166  {
   167  struct iphdr *ip4h = (struct iphdr *)iphdr;
   168  void *txphdr;
   169  u16 *csum;
   170  
   171  txphdr = iphdr + ip4h->ihl * 4;
   172  
   173  if (ip4h->protocol == IPPROTO_TCP || ip4h->protocol == 
IPPROTO_UDP) {
   174  csum = (u16 *)rmnet_map_get_csum_field(ip4h->protocol, 
txphdr);
   175  *csum = ~(*csum);
   176  }
   177  }
   178  
   179  static void
   180  rmnet_map_ipv4_ul_csum_header(void *iphdr,
   181struct rmnet_map_ul_csum_header 
*ul_header,
   182struct sk_buff *skb)
   183  {
   184  struct iphdr *ip4h = (struct iphdr *)iphdr;
   185  u16 *hdr = (u16 *)ul_header;
   186  
 > 187  ul_header->csum_start_offset = 
 > htons((u16)(skb_transport_header(skb) -
   188 (unsigned char 
*)iphdr));
   189  ul_header->csum_insert_offset = skb->csum_offset;
   190  ul_header->csum_enabled = 1;
   191  if (ip4h->protocol == IPPROTO_UDP)
   192  ul_header->udp_ip4_ind = 1;
   193  else
   194  ul_header->udp_ip4_ind = 0;
   195  
   196  /* Changing remaining fields to network order */
   197  hdr++;
 > 198  *hdr = htons(*hdr);
   199  
   200  skb->ip_summed = CHECKSUM_NONE;
   201  
   202  rmnet_map_complement_ipv4_txporthdr_csum_field(iphdr);
   203  }
   204  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework

2017-12-27 Thread Alexei Starovoitov


On 12/27/17 12:09 AM, Masami Hiramatsu wrote:

On Tue, 26 Dec 2017 18:12:56 -0800
Alexei Starovoitov  wrote:


On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:

Support in-kernel fault-injection framework via debugfs.
This allows you to inject a conditional error to specified
function using debugfs interfaces.

Signed-off-by: Masami Hiramatsu 
---
 Documentation/fault-injection/fault-injection.txt |5 +
 kernel/Makefile   |1
 kernel/fail_function.c|  169 +
 lib/Kconfig.debug |   10 +
 4 files changed, 185 insertions(+)
 create mode 100644 kernel/fail_function.c

diff --git a/Documentation/fault-injection/fault-injection.txt 
b/Documentation/fault-injection/fault-injection.txt
index 918972babcd8..6243a588dd71 100644
--- a/Documentation/fault-injection/fault-injection.txt
+++ b/Documentation/fault-injection/fault-injection.txt
@@ -30,6 +30,11 @@ o fail_mmc_request
   injects MMC data errors on devices permitted by setting
   debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request

+o fail_function
+
+  injects error return on specific functions by setting debugfs entries
+  under /sys/kernel/debug/fail_function. No boot option supported.


I like it.
Could you document it a bit better?


Yes, I will do in next series.


In particular retval is configurable, but without an example no one
will be able to figure out how to use it.


Ah, right. BTW, as I pointed in the covermail, should we store the
expected error value range into the injectable list? e.g.

ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)

And provide APIs to check/get it.


I'm afraid such check would be too costly.
Right now we have only two functions marked but I expect hundreds more
will be added in the near future as soon as developers realize the
potential of such error injection.
All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data.
Multiple by 1k and we have 8k of data spent on marks.
If we add max/min range marks that doubles it for very little use.
I think marking function only is enough.

Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-27 Thread Alexei Starovoitov


On 12/26/17 9:56 PM, Masami Hiramatsu wrote:

On Tue, 26 Dec 2017 17:57:32 -0800
Alexei Starovoitov  wrote:


On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote:

Check whether error injectable event is on function entry or not.
Currently it checks the event is ftrace-based kprobes or not,
but that is wrong. It should check if the event is on the entry
of target function. Since error injection will override a function
to just return with modified return value, that operation must
be done before the target function starts making stackframe.

As a side effect, bpf error injection is no need to depend on
function-tracer. It can work with sw-breakpoint based kprobe
events too.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/Kconfig|2 --
 kernel/trace/bpf_trace.c|6 +++---
 kernel/trace/trace_kprobe.c |8 +---
 kernel/trace/trace_probe.h  |   12 ++--
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ae3a2d519e50..6400e1bf97c5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -533,9 +533,7 @@ config FUNCTION_PROFILER
 config BPF_KPROBE_OVERRIDE
bool "Enable BPF programs to override a kprobed function"
depends on BPF_EVENTS
-   depends on KPROBES_ON_FTRACE
depends on HAVE_KPROBE_OVERRIDE
-   depends on DYNAMIC_FTRACE_WITH_REGS
default n
help
 Allows BPF to override the execution of a probed function and
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f6d2327ecb59..d663660f8392 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event *event,
int ret = -EEXIST;

/*
-* Kprobe override only works for ftrace based kprobes, and only if they
-* are on the opt-in list.
+* Kprobe override only works if they are on the function entry,
+* and only if they are on the opt-in list.
 */
if (prog->kprobe_override &&
-   (!trace_kprobe_ftrace(event->tp_event) ||
+   (!trace_kprobe_on_func_entry(event->tp_event) ||
 !trace_kprobe_error_injectable(event->tp_event)))
return -EINVAL;

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 91f4b57dab82..265e3e27e8dc 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -88,13 +88,15 @@ static nokprobe_inline unsigned long 
trace_kprobe_nhit(struct trace_kprobe *tk)
return nhit;
 }

-int trace_kprobe_ftrace(struct trace_event_call *call)
+bool trace_kprobe_on_func_entry(struct trace_event_call *call)
 {
struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
-   return kprobe_ftrace(>rp.kp);
+
+   return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name,
+   tk->rp.kp.offset);


That would be nice, but did you test this?


Yes, because the jprobe, which was only official user of modifying execution
path using kprobe, did same way to check. (and kretprobe also does it)


My understanding that kprobe will restore all regs and
here we need to override return ip _and_ value.


yes, no problem. kprobe restore all regs from pt_regs, including regs->ip.


Could you add a patch with the test the way Josef did
or describe the steps to test this new mode?


Would you mean below patch? If so, it should work without any change.

 [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return


yeah. I expect bpf_override_return test to work as-is.
I'm asking for the test for new functionality added by this patch.
In particular kprobe on func entry without ftrace.
How did you test it?
and how I can repeat the test?
I'm still not sure that it works correctly.

Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface

2017-12-27 Thread Antoine Tenart

Hi Russell,

On Wed, Dec 27, 2017 at 10:24:01PM +, Russell King - ARM Linux wrote:
> On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote:
> >  
> > +_eth2 {
> > +   /* CPS Lane 5 */
> > +   status = "okay";
> > +   phy-mode = "2500base-x";
> > +   /* Generic PHY, providing serdes lanes */
> > +   phys = <_comphy5 2>;
> > +};
> > +
> 
> This is wrong.  This lane is connected to a SFP cage which can support
> more than just 2500base-X.  Tying it in this way to 2500base-X means
> that this port does not support conenctions at 1000base-X, despite
> that's one of the most popular and more standardised speeds.

What do you suggest to describe this in the dt, to enable a port using
the current PPv2 driver?

Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[PATCH ipsec] xfrm: skip policies marked as dead while rehashing

2017-12-27 Thread Florian Westphal

syzkaller triggered following KASAN splat:

BUG: KASAN: slab-out-of-bounds in xfrm_hash_rebuild+0xdbe/0xf00 
net/xfrm/xfrm_policy.c:618
read of size 2 at addr 8801c8e92fe4 by task kworker/1:1/23 [..]
Workqueue: events xfrm_hash_rebuild [..]
 __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428
 xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618
 process_one_work+0xbbf/0x1b10 kernel/workqueue.c:2112
 worker_thread+0x223/0x1990 kernel/workqueue.c:2246 [..]

The reproducer triggers:
1016 if (error) {
1017 list_move_tail(>walk.all, >all);
1018 goto out;
1019 }

in xfrm_policy_walk() via pfkey (it sets tiny rcv space, dump
callback returns -ENOBUFS).

In this case, *walk is located the pfkey socket struct, so this socket
becomes visible in the global policy list.

It looks like this is intentional -- phony walker has walk.dead set to 1
and all other places skip such "policies".

Ccing original authors of the two commits that seem to expose this
issue (first patch missed ->dead check, second patch adds pfkey
sockets to policies dumper list).

Fixes: 880a6fab8f6ba5b ("xfrm: configure policy hash table thresholds by 
netlink")
Fixes: 12a169e7d8f4b1c ("ipsec: Put dumpers on the dump list")
Cc: Herbert Xu 
Cc: Timo Teras 
Cc: Christophe Gouault 
Reported-by: syzbot 

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 9542975eb2f9..181bc6181789 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -609,7 +609,8 @@ static void xfrm_hash_rebuild(struct work_struct *work)
 
/* re-insert all policies by order of creation */
list_for_each_entry_reverse(policy, >xfrm.policy_all, walk.all) {
-   if (xfrm_policy_id2dir(policy->index) >= XFRM_POLICY_MAX) {
+   if (policy->walk.dead ||
+   xfrm_policy_id2dir(policy->index) >= XFRM_POLICY_MAX) {
/* skip socket policies */
continue;
}
-- 
2.13.6

Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface

2017-12-27 Thread Russell King - ARM Linux

On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote:
> This patch enables the fourth network interface on the Marvell
> Macchiatobin. It is configured in the 2500Base-X PHY mode.
> 
> Signed-off-by: Antoine Tenart 
> ---
>  arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts 
> b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> index b3350827ee55..c51efd511324 100644
> --- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> +++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> @@ -279,6 +279,14 @@
>   phys = <_comphy0 1>;
>  };
>  
> +_eth2 {
> + /* CPS Lane 5 */
> + status = "okay";
> + phy-mode = "2500base-x";
> + /* Generic PHY, providing serdes lanes */
> + phys = <_comphy5 2>;
> +};
> +

This is wrong.  This lane is connected to a SFP cage which can support
more than just 2500base-X.  Tying it in this way to 2500base-X means
that this port does not support conenctions at 1000base-X, despite
that's one of the most popular and more standardised speeds.

So, I feel I have to NAK this patch in its current form.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

[PATCH net-next 1/6] phy: add 2.5G SGMII mode to the phy_mode enum

2017-12-27 Thread Antoine Tenart

This patch adds one more generic PHY mode to the phy_mode enum, to allow
configuring generic PHYs to the 2.5G SGMII mode by using the set_mode
callback.

Signed-off-by: Antoine Tenart 
---
 include/linux/phy/phy.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/phy/phy.h b/include/linux/phy/phy.h
index 4f8423a948d5..70459a28f3a1 100644
--- a/include/linux/phy/phy.h
+++ b/include/linux/phy/phy.h
@@ -28,6 +28,7 @@ enum phy_mode {
PHY_MODE_USB_DEVICE,
PHY_MODE_USB_OTG,
PHY_MODE_SGMII,
+   PHY_MODE_SGMII_2_5G,
PHY_MODE_10GKR,
PHY_MODE_UFS_HS_A,
PHY_MODE_UFS_HS_B,
-- 
2.14.3

[PATCH net-next 3/6] net: mvpp2: 1000baseX support

2017-12-27 Thread Antoine Tenart

This patch adds the 1000Base-X PHY mode support in the Marvell PPv2
driver. 1000Base-X is quite close the SGMII and uses nearly the same
code path.

Signed-off-by: Antoine Tenart 
---
 drivers/net/ethernet/marvell/mvpp2.c | 45 
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index a19760736b71..094db9dd633f 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -4501,6 +4501,7 @@ static int mvpp22_gop_init(struct mvpp2_port *port)
mvpp22_gop_init_rgmii(port);
break;
case PHY_INTERFACE_MODE_SGMII:
+   case PHY_INTERFACE_MODE_1000BASEX:
mvpp22_gop_init_sgmii(port);
break;
case PHY_INTERFACE_MODE_10GKR:
@@ -4538,7 +4539,8 @@ static void mvpp22_gop_unmask_irq(struct mvpp2_port *port)
u32 val;
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-   port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+   port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
/* Enable the GMAC link status irq for this port */
val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
val |= MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
@@ -4568,7 +4570,8 @@ static void mvpp22_gop_mask_irq(struct mvpp2_port *port)
}
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-   port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+   port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
val &= ~MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK);
@@ -4580,7 +4583,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
u32 val;
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-   port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+   port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
val = readl(port->base + MVPP22_GMAC_INT_MASK);
val |= MVPP22_GMAC_INT_MASK_LINK_STAT;
writel(val, port->base + MVPP22_GMAC_INT_MASK);
@@ -4605,6 +4609,7 @@ static int mvpp22_comphy_init(struct mvpp2_port *port)
 
switch (port->phy_interface) {
case PHY_INTERFACE_MODE_SGMII:
+   case PHY_INTERFACE_MODE_1000BASEX:
mode = PHY_MODE_SGMII;
break;
case PHY_INTERFACE_MODE_10GKR:
@@ -4625,7 +4630,8 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
 {
u32 val;
 
-   if (port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
+   if (port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
val = readl(port->base + MVPP22_GMAC_CTRL_4_REG);
val |= MVPP22_CTRL4_SYNC_BYPASS_DIS | MVPP22_CTRL4_DP_CLK_SEL |
   MVPP22_CTRL4_QSGMII_BYPASS_ACTIVE;
@@ -4640,9 +4646,11 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
writel(val, port->base + MVPP22_GMAC_CTRL_4_REG);
}
 
-   /* The port is connected to a copper PHY */
val = readl(port->base + MVPP2_GMAC_CTRL_0_REG);
-   val &= ~MVPP2_GMAC_PORT_TYPE_MASK;
+   if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX)
+   val |= MVPP2_GMAC_PORT_TYPE_MASK;
+   else
+   val &= ~MVPP2_GMAC_PORT_TYPE_MASK;
writel(val, port->base + MVPP2_GMAC_CTRL_0_REG);
 
val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
@@ -4651,6 +4659,19 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
   MVPP2_GMAC_AN_DUPLEX_EN;
if (port->phy_interface == PHY_INTERFACE_MODE_SGMII)
val |= MVPP2_GMAC_IN_BAND_AUTONEG;
+
+   if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX)
+   /* 1000BaseX port cannot negotiate speed nor can it
+* negotiate duplex: they are always operating with a
+* fixed speed of 1000Mbps in full duplex, so force
+* 1000 speed and full duplex here.
+*/
+   val |= MVPP2_GMAC_CONFIG_GMII_SPEED |
+  MVPP2_GMAC_CONFIG_FULL_DUPLEX;
+   else
+   val |= MVPP2_GMAC_AN_SPEED_EN |
+  MVPP2_GMAC_AN_DUPLEX_EN;
+
writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
 }
 
@@ -4671,7 +4692,8 @@ static void mvpp2_port_mii_gmac_configure(struct 
mvpp2_port *port)
 
/* Configure the PCS and in-band AN */

[PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface

2017-12-27 Thread Antoine Tenart

This patch enables the fourth network interface on the Marvell
Macchiatobin. It is configured in the 2500Base-X PHY mode.

Signed-off-by: Antoine Tenart 
---
 arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts 
b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
index b3350827ee55..c51efd511324 100644
--- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
+++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
@@ -279,6 +279,14 @@
phys = <_comphy0 1>;
 };
 
+_eth2 {
+   /* CPS Lane 5 */
+   status = "okay";
+   phy-mode = "2500base-x";
+   /* Generic PHY, providing serdes lanes */
+   phys = <_comphy5 2>;
+};
+
 _pinctrl {
cps_spi1_pins: spi1-pins {
marvell,pins = "mpp12", "mpp13", "mpp14", "mpp15", "mpp16";
-- 
2.14.3

[PATCH net-next 0/6] net: mvpp2: 1000BaseX and 2000BaseX support

2017-12-27 Thread Antoine Tenart

Hi all,

This series adds 1000BaseX and 2500BaseX support to the Marvell PPv2
driver. In order to use it, the 2.5 SGMII mode is added in the Marvell
common PHY driver (cp110-comphy).

Thanks to theses patches the fourth network interface can be used on the
mcbin, and two patches are attached: one to describe the interface in
the mcbin device tree, and another one adding Ethernet aliases now that
the four interfaces are described.

This was tested on a mcbin.

Patches 1 and 2 should go through the PHY tree, patches 3 and 4 through
the net-next tree and patches 5 and 6 through the mvebu one.

Please note the two mvpp2 patches do not conflict with the ACPI series
Marcin sent a few days ago, and the two series can be processed in
parallel. (Marcin is aware of me sending this series).

Thanks!
Antoine

Antoine Tenart (5):
  phy: add 2.5G SGMII mode to the phy_mode enum
  phy: cp110-comphy: 2.5G SGMII mode
  net: mvpp2: 1000baseX support
  net: mvpp2: 2500baseX support
  arm64: dts: marvell: mcbin: enable the fourth network interface

Yan Markman (1):
  arm64: dts: marvell: add Ethernet aliases

 arch/arm64/boot/dts/marvell/armada-7040-db.dts|  6 ++
 arch/arm64/boot/dts/marvell/armada-8040-db.dts|  7 +++
 arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 15 +
 drivers/net/ethernet/marvell/mvpp2.c  | 67 +++
 drivers/phy/marvell/phy-mvebu-cp110-comphy.c  | 17 +-
 include/linux/phy/phy.h   |  1 +
 6 files changed, 100 insertions(+), 13 deletions(-)

-- 
2.14.3

[PATCH net-next 2/6] phy: cp110-comphy: 2.5G SGMII mode

2017-12-27 Thread Antoine Tenart

This patch allow the CP100 comphy to configure some lanes in the
2.5G SGMII mode. This mode is quite close to SGMII and uses nearly the
same code path.

Signed-off-by: Antoine Tenart 
---
 drivers/phy/marvell/phy-mvebu-cp110-comphy.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/phy/marvell/phy-mvebu-cp110-comphy.c 
b/drivers/phy/marvell/phy-mvebu-cp110-comphy.c
index a0d522154cdf..946a6ed7b66f 100644
--- a/drivers/phy/marvell/phy-mvebu-cp110-comphy.c
+++ b/drivers/phy/marvell/phy-mvebu-cp110-comphy.c
@@ -135,19 +135,25 @@ struct mvebu_comhy_conf {
 static const struct mvebu_comhy_conf mvebu_comphy_cp110_modes[] = {
/* lane 0 */
MVEBU_COMPHY_CONF(0, 1, PHY_MODE_SGMII, 0x1),
+   MVEBU_COMPHY_CONF(0, 1, PHY_MODE_SGMII_2_5G, 0x1),
/* lane 1 */
MVEBU_COMPHY_CONF(1, 2, PHY_MODE_SGMII, 0x1),
+   MVEBU_COMPHY_CONF(1, 2, PHY_MODE_SGMII_2_5G, 0x1),
/* lane 2 */
MVEBU_COMPHY_CONF(2, 0, PHY_MODE_SGMII, 0x1),
+   MVEBU_COMPHY_CONF(2, 0, PHY_MODE_SGMII_2_5G, 0x1),
MVEBU_COMPHY_CONF(2, 0, PHY_MODE_10GKR, 0x1),
/* lane 3 */
MVEBU_COMPHY_CONF(3, 1, PHY_MODE_SGMII, 0x2),
+   MVEBU_COMPHY_CONF(3, 1, PHY_MODE_SGMII_2_5G, 0x2),
/* lane 4 */
MVEBU_COMPHY_CONF(4, 0, PHY_MODE_SGMII, 0x2),
+   MVEBU_COMPHY_CONF(4, 0, PHY_MODE_SGMII_2_5G, 0x2),
MVEBU_COMPHY_CONF(4, 0, PHY_MODE_10GKR, 0x2),
MVEBU_COMPHY_CONF(4, 1, PHY_MODE_SGMII, 0x1),
/* lane 5 */
MVEBU_COMPHY_CONF(5, 2, PHY_MODE_SGMII, 0x1),
+   MVEBU_COMPHY_CONF(5, 2, PHY_MODE_SGMII_2_5G, 0x1),
 };
 
 struct mvebu_comphy_priv {
@@ -206,6 +212,10 @@ static void mvebu_comphy_ethernet_init_reset(struct 
mvebu_comphy_lane *lane,
if (mode == PHY_MODE_10GKR)
val |= MVEBU_COMPHY_SERDES_CFG0_GEN_RX(0xe) |
   MVEBU_COMPHY_SERDES_CFG0_GEN_TX(0xe);
+   else if (mode == PHY_MODE_SGMII_2_5G)
+   val |= MVEBU_COMPHY_SERDES_CFG0_GEN_RX(0x8) |
+  MVEBU_COMPHY_SERDES_CFG0_GEN_TX(0x8) |
+  MVEBU_COMPHY_SERDES_CFG0_HALF_BUS;
else if (mode == PHY_MODE_SGMII)
val |= MVEBU_COMPHY_SERDES_CFG0_GEN_RX(0x6) |
   MVEBU_COMPHY_SERDES_CFG0_GEN_TX(0x6) |
@@ -296,13 +306,13 @@ static int mvebu_comphy_init_plls(struct 
mvebu_comphy_lane *lane,
return 0;
 }
 
-static int mvebu_comphy_set_mode_sgmii(struct phy *phy)
+static int mvebu_comphy_set_mode_sgmii(struct phy *phy, enum phy_mode mode)
 {
struct mvebu_comphy_lane *lane = phy_get_drvdata(phy);
struct mvebu_comphy_priv *priv = lane->priv;
u32 val;
 
-   mvebu_comphy_ethernet_init_reset(lane, PHY_MODE_SGMII);
+   mvebu_comphy_ethernet_init_reset(lane, mode);
 
val = readl(priv->base + MVEBU_COMPHY_RX_CTRL1(lane->id));
val &= ~MVEBU_COMPHY_RX_CTRL1_CLK8T_EN;
@@ -487,7 +497,8 @@ static int mvebu_comphy_power_on(struct phy *phy)
 
switch (lane->mode) {
case PHY_MODE_SGMII:
-   ret = mvebu_comphy_set_mode_sgmii(phy);
+   case PHY_MODE_SGMII_2_5G:
+   ret = mvebu_comphy_set_mode_sgmii(phy, lane->mode);
break;
case PHY_MODE_10GKR:
ret = mvebu_comphy_set_mode_10gkr(phy);
-- 
2.14.3

[PATCH net-next 4/6] net: mvpp2: 2500baseX support

2017-12-27 Thread Antoine Tenart

This patch adds the 2500Base-X PHY mode support in the Marvell PPv2
driver. 2500Base-X is quite close to 1000Base-X and SGMII modes and uses
nearly the same code path.

Signed-off-by: Antoine Tenart 
---
 drivers/net/ethernet/marvell/mvpp2.c | 40 
 1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 094db9dd633f..5d2a6f5a52b6 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -4502,6 +4502,7 @@ static int mvpp22_gop_init(struct mvpp2_port *port)
break;
case PHY_INTERFACE_MODE_SGMII:
case PHY_INTERFACE_MODE_1000BASEX:
+   case PHY_INTERFACE_MODE_2500BASEX:
mvpp22_gop_init_sgmii(port);
break;
case PHY_INTERFACE_MODE_10GKR:
@@ -4540,7 +4541,8 @@ static void mvpp22_gop_unmask_irq(struct mvpp2_port *port)
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
/* Enable the GMAC link status irq for this port */
val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
val |= MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
@@ -4571,7 +4573,8 @@ static void mvpp22_gop_mask_irq(struct mvpp2_port *port)
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
val &= ~MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK);
@@ -4584,7 +4587,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
 
if (phy_interface_mode_is_rgmii(port->phy_interface) ||
port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
val = readl(port->base + MVPP22_GMAC_INT_MASK);
val |= MVPP22_GMAC_INT_MASK_LINK_STAT;
writel(val, port->base + MVPP22_GMAC_INT_MASK);
@@ -4612,6 +4616,9 @@ static int mvpp22_comphy_init(struct mvpp2_port *port)
case PHY_INTERFACE_MODE_1000BASEX:
mode = PHY_MODE_SGMII;
break;
+   case PHY_INTERFACE_MODE_2500BASEX:
+   mode = PHY_MODE_SGMII_2_5G;
+   break;
case PHY_INTERFACE_MODE_10GKR:
mode = PHY_MODE_10GKR;
break;
@@ -4631,7 +4638,8 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
u32 val;
 
if (port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) {
+   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
val = readl(port->base + MVPP22_GMAC_CTRL_4_REG);
val |= MVPP22_CTRL4_SYNC_BYPASS_DIS | MVPP22_CTRL4_DP_CLK_SEL |
   MVPP22_CTRL4_QSGMII_BYPASS_ACTIVE;
@@ -4647,7 +4655,8 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
}
 
val = readl(port->base + MVPP2_GMAC_CTRL_0_REG);
-   if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX)
+   if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
+   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX)
val |= MVPP2_GMAC_PORT_TYPE_MASK;
else
val &= ~MVPP2_GMAC_PORT_TYPE_MASK;
@@ -4660,6 +4669,11 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
if (port->phy_interface == PHY_INTERFACE_MODE_SGMII)
val |= MVPP2_GMAC_IN_BAND_AUTONEG;
 
+   /* Clear all fields we may want to explicitly set below */
+   val &= ~(MVPP2_GMAC_CONFIG_FULL_DUPLEX | MVPP2_GMAC_CONFIG_GMII_SPEED |
+MVPP2_GMAC_CONFIG_MII_SPEED | MVPP2_GMAC_AN_SPEED_EN |
+MVPP2_GMAC_AN_DUPLEX_EN);
+
if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX)
/* 1000BaseX port cannot negotiate speed nor can it
 * negotiate duplex: they are always operating with a
@@ -4668,6 +4682,10 @@ static void mvpp2_port_mii_gmac_configure_mode(struct 
mvpp2_port *port)
 */
val |=

[PATCH net-next 6/6] arm64: dts: marvell: add Ethernet aliases

2017-12-27 Thread Antoine Tenart

From: Yan Markman 

This patch adds Ethernet aliases in the Marvell Aramada 7040 DB, 8040 DB
and 8040 mcbin device trees so that the bootloader setup the MAC
addresses correctly.

Signed-off-by: Yan Markman 
[Antoine: commit message, small fixes]
Signed-off-by: Antoine Tenart 
---
 arch/arm64/boot/dts/marvell/armada-7040-db.dts| 6 ++
 arch/arm64/boot/dts/marvell/armada-8040-db.dts| 7 +++
 arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 7 +++
 3 files changed, 20 insertions(+)

diff --git a/arch/arm64/boot/dts/marvell/armada-7040-db.dts 
b/arch/arm64/boot/dts/marvell/armada-7040-db.dts
index 52b5341cb270..62b83416b30c 100644
--- a/arch/arm64/boot/dts/marvell/armada-7040-db.dts
+++ b/arch/arm64/boot/dts/marvell/armada-7040-db.dts
@@ -61,6 +61,12 @@
reg = <0x0 0x0 0x0 0x8000>;
};
 
+   aliases {
+   ethernet0 = _eth0;
+   ethernet1 = _eth1;
+   ethernet2 = _eth2;
+   };
+
cpm_reg_usb3_0_vbus: cpm-usb3-0-vbus {
compatible = "regulator-fixed";
regulator-name = "usb3h0-vbus";
diff --git a/arch/arm64/boot/dts/marvell/armada-8040-db.dts 
b/arch/arm64/boot/dts/marvell/armada-8040-db.dts
index d97b72bed662..d9fffde64c44 100644
--- a/arch/arm64/boot/dts/marvell/armada-8040-db.dts
+++ b/arch/arm64/boot/dts/marvell/armada-8040-db.dts
@@ -61,6 +61,13 @@
reg = <0x0 0x0 0x0 0x8000>;
};
 
+   aliases {
+   ethernet0 = _eth0;
+   ethernet1 = _eth2;
+   ethernet2 = _eth0;
+   ethernet3 = _eth1;
+   };
+
cpm_reg_usb3_0_vbus: cpm-usb3-0-vbus {
compatible = "regulator-fixed";
regulator-name = "cpm-usb3h0-vbus";
diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts 
b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
index c51efd511324..7a23bb279e1c 100644
--- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
+++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
@@ -62,6 +62,13 @@
reg = <0x0 0x0 0x0 0x8000>;
};
 
+   aliases {
+   ethernet0 = _eth0;
+   ethernet1 = _eth0;
+   ethernet2 = _eth1;
+   ethernet3 = _eth2;
+   };
+
/* Regulator labels correspond with schematics */
v_3_3: regulator-3-3v {
compatible = "regulator-fixed";
-- 
2.14.3

Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode

2017-12-27 Thread Marcelo Ricardo Leitner

On Wed, Dec 27, 2017 at 01:40:24PM -0800, Stephen Hemminger wrote:
> On Wed, 27 Dec 2017 18:39:29 -0200
> Marcelo Ricardo Leitner  wrote:
> 
> > > > +   send = false;
> > > > +   else
> > > > +   send = true;
> > > > +
> > > > +   ret = do_cmd(largc, largv, batch_size, msg_iov_index++, 
> > > > send);  
> > > 
> > > What happens if tc commands are interlaced in the file -- qdisc add,
> > > class add, filter add, then a delete, show, exec, etc.? Right now each
> > > command is handled one at a time so an add followed by a delete will
> > > work. Your proposed batching loop won't work for this case as some
> > > commands are executed when that line is reached and others are batched
> > > for later send. Not all of the tc commands need to be batched in a
> > > single message so perhaps those commands cause the queue to be flushed
> > > (ie, message sent), then that command is executed and you start the
> > > batching over.
> > > 
> > > Further, I really think the batching can be done without the global
> > > variables and without the command handlers knowing it is batching or
> > > part of an iov. e.g., in the case of batching try having the commands
> > > malloc the request buffer and return the pointer back to this loop in
> > > which case this loop calls rtnl_talk_msg and frees the buffers.  
> > 
> > Sounds like the batching is being done at the wrong level. If it was
> > done by rtnl_talk(), it should be easier.
> > We can keep rtnl_talk() for previous users and make rtnl_talk_msg() do
> > the batching, mostly independent of which kind of msg it it.
> > 
> > As you need to inform it that it was the last entry, that may be
> > detected with feof(stdin). Just add a 'bool flush' parameter to it.
> >rtnl_talk_msg(, flush=feof(stdin));
> > 
> > Next step then would be to add a memory manager layer to it, so
> > libnetlink wouldn't need to copy the messages but recycle pointers:
> >   rtnl_get_msgbuf(): returns a buffer that one can use to fill in the
> > msg and use with rtnl_talk_msg()
> >   and the free is done by libnetlink itself when the message is
> >   finally sent, so no need to keep track of what one needs to free or
> >   can reuse.
> 
> 
> What about using sendmmsg instead?
> That woudl allow sending multiple messages in one syscall.

Could be. Although the batching effect would be very different.
sendmmsg calls cond_resched() between messages, for instance.

Re: [PATCH v2 net-next] net/trace: fix printk format in inet_sock_set_state

2017-12-27 Thread David Miller

From: Yafang Shao 
Date: Sun, 24 Dec 2017 23:10:39 +0800

> There's a space character missed in the printk messages.
> 
> Put the message into one line could simplify searching for
> the messages in the kernel source.
> 
> Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint with
> inet_sock_set_state tracepoint")

Please do not break up long "Fixes: " tag lines, for the same reason
you shouldn't break up long kernel log message and trace message
strings.

I fixed it this time.

> Cc: Sergei Shtylyov 
> Signed-off-by: Yafang Shao 

Applied.

Re: [PATCH net-next 07/10] net: qualcomm: rmnet: Add support for RX checksum offload

2017-12-27 Thread kbuild test robot

Hi Subash,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Subash-Abhinov-Kasiviswanathan/net-qualcomm-rmnet-Enable-csum-offloads/20171228-041216
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)


vim +30 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c

26  
27  static u16 *rmnet_map_get_csum_field(unsigned char protocol,
28   const void *txporthdr)
29  {
  > 30  u16 *check = 0;
31  
32  switch (protocol) {
33  case IPPROTO_TCP:
  > 34  check = &(((struct tcphdr *)txporthdr)->check);
35  break;
36  
37  case IPPROTO_UDP:
  > 38  check = &(((struct udphdr *)txporthdr)->check);
39  break;
40  
41  default:
42  check = 0;
43  break;
44  }
45  
46  return check;
47  }
48  
49  static int
50  rmnet_map_ipv4_dl_csum_trailer(struct sk_buff *skb,
51 struct rmnet_map_dl_csum_trailer 
*csum_trailer)
52  {
53  u16 ip_pseudo_payload_csum, pseudo_csum, ip_hdr_csum, 
*csum_field;
54  u16 csum_value, ip_payload_csum, csum_value_final;
55  struct iphdr *ip4h;
56  void *txporthdr;
57  
58  ip4h = (struct iphdr *)(skb->data);
59  if ((ntohs(ip4h->frag_off) & IP_MF) ||
60  ((ntohs(ip4h->frag_off) & IP_OFFSET) > 0))
61  return -EOPNOTSUPP;
62  
63  txporthdr = skb->data + ip4h->ihl * 4;
64  
65  csum_field = rmnet_map_get_csum_field(ip4h->protocol, 
txporthdr);
66  
67  if (!csum_field)
68  return -EPROTONOSUPPORT;
69  
70  /* RFC 768 - Skip IPv4 UDP packets where sender checksum field 
is 0 */
71  if (*csum_field == 0 && ip4h->protocol == IPPROTO_UDP)
72  return 0;
73  
  > 74  csum_value = ~ntohs(csum_trailer->csum_value);
  > 75  ip_hdr_csum = ~ip_fast_csum(ip4h, (int)ip4h->ihl);
  > 76  ip_payload_csum = csum16_sub(csum_value, ip_hdr_csum);
77  
  > 78  pseudo_csum = ~ntohs(csum_tcpudp_magic(ip4h->saddr, ip4h->daddr,
79   (u16)(ntohs(ip4h->tot_len) - ip4h->ihl * 
4),
80   (u16)ip4h->protocol, 0));
  > 81  ip_pseudo_payload_csum = csum16_add(ip_payload_csum, 
pseudo_csum);
82  
  > 83  csum_value_final = ~csum16_sub(ip_pseudo_payload_csum,
  > 84 ntohs(*csum_field));
85  
86  if (unlikely(csum_value_final == 0)) {
87  switch (ip4h->protocol) {
88  case IPPROTO_UDP:
89  /* RFC 768 - DL4 1's complement rule for UDP 
csum 0 */
90  csum_value_final = ~csum_value_final;
91  break;
92  
93  case IPPROTO_TCP:
94  /* DL4 Non-RFC compliant TCP checksum found */
95  if (*csum_field == 0x)
96  csum_value_final = ~csum_value_final;
97  break;
98  }
99  }
   100  
   101  if (csum_value_final == ntohs(*csum_field))
   102  return 0;
   103  else
   104  return -EINVAL;
   105  }
   106  
   107  #if IS_ENABLED(CONFIG_IPV6)
   108  static int
   109  rmnet_map_ipv6_dl_csum_trailer(struct sk_buff *skb,
   110 struct rmnet_map_dl_csum_trailer 
*csum_trailer)
   111  {
   112  u16 ip_pseudo_payload_csum, pseudo_csum, ip6_hdr_csum, 
*csum_field;
   113  u16 csum_value, ip6_payload_csum, csum_value_final;
   114  struct ipv6hdr *ip6h;
   115  void *txporthdr;
   116  u32 length;
   117  
   118  ip6h = (struct ipv6hdr *)(skb->data);
   119  
   120  txporthdr = skb->data + sizeof(struct ipv6hdr);
   121  csum_field = rmnet_map_get_csum_field(ip6h->nexthdr, txporthdr);
   122  
   123  if (!csum_field)
   124  return -EPROTONOSUPPORT;
   125  
   126  csum_value = ~ntohs(csum_trailer->csum_value);
 > 127  ip6_hdr_csum = ~ntohs(ip_compute_csum(ip6h,
   128(int)(txporthdr - (void *)(skb->data;
 > 129  ip6_payload_csum = csum16_sub(csum_value, ip6_hdr_csum);
   130  
   131  length = (ip6h->nexthdr == IPPROTO_UDP) ?
   132   ntohs(((struct udphdr *)txporthdr)->len) :

Re: [pull request][for-next V2 00/11] Mellanox, mlx5 E-Switch updates 2017-12-19

2017-12-27 Thread David Miller

From: David Miller 
Date: Wed, 27 Dec 2017 17:01:22 -0500 (EST)

> Pulled, thank you.

Actually, I had to revert.  Please fix this and resubmit:

drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c: In function 
‘esw_offloads_load_reps’:
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:774:2: warning: this 
‘for’ clause does not guard... [-Wmisleading-indentation]
  for (rep_type = 0; rep_type < NUM_REP_TYPES; rep_type++)
  ^~~
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:776:3: note: ...this 
statement, but the latter is misleadingly indented as if it is guarded by the 
‘for’
   if (err)
   ^~

Re: [pull request][for-next V2 00/11] Mellanox, mlx5 E-Switch updates 2017-12-19

2017-12-27 Thread David Miller

From: Saeed Mahameed 
Date: Sun, 24 Dec 2017 15:45:36 +0200

> Hi Dave and Doug,
> 
> ==
> This series includes updates for mlx5 E-Switch infrastructures,
> to be merged into net-next and rdma-next trees.
> 
> Mark's patches provide E-Switch refactoring that generalize the mlx5
> E-Switch vf representors interfaces and data structures. The serious is
> mainly focused on moving ethernet (netdev) specific representors logic out
> of E-Switch (eswitch.c) into mlx5e representor module (en_rep.c), which
> provides better separation and allows future support for other types of vf
> representors (e.g. RDMA).
> 
> Gal's patches at the end of this serious, provide a simple syntax fix and
> two other patches that handles vport ingress/egress ACL steering name
> spaces to be aligned with the Firmware/Hardware specs.
> ===
> 
> V1->V2:
>  - Addressed coding style comments in patches #1 and #7
>  - The series is still based on rc4, as now I see net-next is also @rc4.
> 
> Please pull and let me know if there's any problem.

Pulled, thank you.

Re: [PATCH iproute] qdisc: Print offload indication

2017-12-27 Thread Stephen Hemminger

On Tue, 26 Dec 2017 11:48:45 +0200
Yuval Mintz  wrote:

> Use the newly added TCA_HW_OFFLOAD indication from kernel
> to print a consistent 'offloaded' message to user when listing qdiscs.
> 
> Signed-off-by: Yuval Mintz 

Applied to master (since TCA_HW_OFFLOAD is already present).

Re: [PATCH v2 net-next 0/2] kcm: Fix two locking issues

2017-12-27 Thread David Miller

From: Tom Herbert 
Date: Sat, 23 Dec 2017 09:17:14 -0800

> One issue is lockdep warnings when sock_owned_by_user returns true
> in strparser. Fix is to add and call sock_owned_by_user_nocheck since
> the check for owned by user is not an error condition in this case.
> 
> The other issue is a potential deadlock between TX and RX paths
> 
> KCM socket lock and the psock socket lock are acquired in both
> the RX and TX path, however they take the locks in opposite order
> which can lead to deadlock. The fix is to add try_sock_lock to see
> if psock socket lock can get acquired in the TX path with KCM lock
> held. If not, then KCM socket is released and the psock socket lock
> and KCM socket lock are acquired in the same order as the RX path.
> 
> Tested:
> 
> Ran KCM traffic without incident.
> 
> v2: Remove patches to address potential deadlock. I couldn't convince
> myself this is an issue after looking at the code some more.

If this fixes real locking bugs you should target them at 'net' not
'net-next'.

I also see you telling some people hitting kcm locking problems to
test "the patch" but you give them no idea what patch you are even
talking about.

Is it this series?  Nobody knows.

Please poing them at exactly what patch you want them to test, get
their testing results, and add appropriate Tested-by: tags as you
respin this for 'net'.

Thanks.

Re: [PATCH iproute2 0/3] ip/tunnel: Fix noencap- and document "external" parameter handling.

2017-12-27 Thread Stephen Hemminger

On Wed, 27 Dec 2017 13:28:13 +0200
Serhey Popovych  wrote:

> In this series I present next set of improvements/fixes:
> 
>   1) Fix noencap- option handling: we need to clear bit, instead
>  of seting all, but one we expect to clear.
> 
>   2) Document "external" parameter both in ip-link(8) and help
>  output for link_gre.c. Add "noexternal" option variant to
>  bring inline with GENEVE and VXLAN.
> 
>   3) Trivial: clear flowlabel/tclass from flowinfo in case of
>  inherit to stop sending garbage to the kernel. It has no
>  functional change, but follows similar behaviour in link_ip6tnl.c
> 
> See individual patch description message for details.
> 
> Thanks,
> Serhii
> 
> Serhey Popovych (3):
>   gre,ip6tnl/tunnel: Fix noencap- support
>   gre6/tunnel: Do not submit garbage in flowinfo
>   ip/tunnel: Document "external" parameter
> 
>  ip/link_gre.c |7 +--
>  ip/link_gre6.c|4 ++--
>  ip/link_ip6tnl.c  |6 --
>  ip/link_iptnl.c   |4 +++-
>  man/man8/ip-link.8.in |6 ++
>  5 files changed, 20 insertions(+), 7 deletions(-)
> 

These are really disjoint. I applied the noencap and the flowinfo patch.
Agree with William that having noexternal which does nothing useful is of 
little value now.

Re: [PATCH net-next 0/4] zerocopy refinements

2017-12-27 Thread David Miller

From: Willem de Bruijn 
Date: Fri, 22 Dec 2017 19:00:16 -0500

> From: Willem de Bruijn 
> 
> 1/4 is a small optimization follow-up to the earlier fix to skb_segment:
> check skb state once per skb, instead of once per frag.
> 2/4 makes behavior more consistent between standard and zerocopy send:
> set the PSH bit when hitting MAX_SKB_FRAGS. This helps GRO.
> 3/4 resolves a surprising inconsistency in notification:
> because small packets were not stored in frags, they would not set
> the copied error code over loopback. This change also optimizes
> the path by removing copying and making tso_fragment cheaper.
> 4/4 follows-up to 3/4 by no longer allocated now unused memory.
> this was actually already in RFC patches, but dropped as I pared
> down the patch set during revisions.

Looks good, series applied, thanks.

Re: [PATCH net-next v2 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2017-12-27 Thread David Miller

From: Jason Baron 
Date: Fri, 22 Dec 2017 16:54:01 -0500

> The ability to set speed and duplex for virtio_net in useful in various
> scenarios as described here:
> 
> 16032be virtio_net: add ethtool support for set and get of settings
> 
> However, it would be nice to be able to set this from the hypervisor,
> such that virtio_net doesn't require custom guest ethtool commands.
> 
> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
> the hypervisor to export a linkspeed and duplex setting. The user can
> subsequently overwrite it later if desired via: 'ethtool -s'.
> 
> Signed-off-by: Jason Baron 
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 

Looks mostly fine to me but need some virtio_net reviewers on this one.

> @@ -57,6 +57,8 @@
>* Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */
>  
> +#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Host set linkspeed and duplex */
> +

Why use a value so far away from the largest existing one?

Just curious.

Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode

2017-12-27 Thread Stephen Hemminger

On Wed, 27 Dec 2017 18:39:29 -0200
Marcelo Ricardo Leitner  wrote:

> > > + send = false;
> > > + else
> > > + send = true;
> > > +
> > > + ret = do_cmd(largc, largv, batch_size, msg_iov_index++, send);  
> > 
> > What happens if tc commands are interlaced in the file -- qdisc add,
> > class add, filter add, then a delete, show, exec, etc.? Right now each
> > command is handled one at a time so an add followed by a delete will
> > work. Your proposed batching loop won't work for this case as some
> > commands are executed when that line is reached and others are batched
> > for later send. Not all of the tc commands need to be batched in a
> > single message so perhaps those commands cause the queue to be flushed
> > (ie, message sent), then that command is executed and you start the
> > batching over.
> > 
> > Further, I really think the batching can be done without the global
> > variables and without the command handlers knowing it is batching or
> > part of an iov. e.g., in the case of batching try having the commands
> > malloc the request buffer and return the pointer back to this loop in
> > which case this loop calls rtnl_talk_msg and frees the buffers.  
> 
> Sounds like the batching is being done at the wrong level. If it was
> done by rtnl_talk(), it should be easier.
> We can keep rtnl_talk() for previous users and make rtnl_talk_msg() do
> the batching, mostly independent of which kind of msg it it.
> 
> As you need to inform it that it was the last entry, that may be
> detected with feof(stdin). Just add a 'bool flush' parameter to it.
>rtnl_talk_msg(, flush=feof(stdin));
> 
> Next step then would be to add a memory manager layer to it, so
> libnetlink wouldn't need to copy the messages but recycle pointers:
>   rtnl_get_msgbuf(): returns a buffer that one can use to fill in the
> msg and use with rtnl_talk_msg()
>   and the free is done by libnetlink itself when the message is
>   finally sent, so no need to keep track of what one needs to free or
>   can reuse.


What about using sendmmsg instead?
That woudl allow sending multiple messages in one syscall.

Re: lost connection to test machine (3)

2017-12-27 Thread Florian Westphal

Dmitry Vyukov  wrote:
> On Wed, Dec 27, 2017 at 7:18 PM, syzbot
>  wrote:
> > Hello,
> >
> > syzkaller hit the following crash on
> > beacbc68ac3e23821a681adb30b45dc55b17488d
> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > C reproducer is attached
> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> > for information about syzkaller reproducers
> >
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: 
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> 
> +netfilter maintainers
> 
> Here is cleaned reproducer:
> 
> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int main()
> {
>   int fd;
> 
>   fd = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
>   struct ipt_replace opt = {};
>   opt.num_counters = 1;
>   opt.size = -1;
>   setsockopt(fd, SOL_IP, 0x40, , 0x4);
>   return 0;
> }
> 
> 
> What happens there is that here:
> 
> struct xt_table_info *xt_alloc_table_info(unsigned int size)
> {
> ...
> if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
> return NULL;
> 
> size = -1 and SMP_ALIGN(size) = 0, so this still tries to allocate
> 4GB+delta bytes.
> 
> I don't understand why this uses SMP_ALIGN since we add 2 pages on
> top, it seems that we could just drop SMP_ALIGN and local SMP_ALIGN
> definition altogether.

Looking at history.git this seems to be a left over from back when
iptables allocated size * num_cpus() (and used an SMP_ALIGN based offset
for each cpu).

So yes, I think we can just toss/drop this.

[PATCH] Staging: ipx: fixed several no space before tabs coding style issues

2017-12-27 Thread Jianshen Liu

Fixed several coding style warnings of "please, no space before tabs".

Signed-off-by: Jianshen Liu 
---
 drivers/staging/ipx/af_ipx.c| 56 -
 drivers/staging/ipx/ipx_proc.c  |  2 +-
 drivers/staging/ipx/ipx_route.c |  6 ++---
 3 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/drivers/staging/ipx/af_ipx.c b/drivers/staging/ipx/af_ipx.c
index d21a9d1..d8be06c 100644
--- a/drivers/staging/ipx/af_ipx.c
+++ b/drivers/staging/ipx/af_ipx.c
@@ -2,7 +2,7 @@
  * Implements an IPX socket layer.
  *
  * This code is derived from work by
- * Ross Biro   :   Writing the original IP stack
+ * Ross Biro   :   Writing the original IP stack
  * Fred Van Kempen :   Tidying up the TCP/IP
  *
  * Many thanks go to Keith Baker, Institute For Industrial Information
@@ -20,7 +20,7 @@
  * provide warranty for any of this software. This material is provided
  * "AS-IS" and at no charge.
  *
- * Portions Copyright (c) 1995 Caldera, Inc. 
+ * Portions Copyright (c) 1995 Caldera, Inc. 
  * Neither Greg Page nor Caldera, Inc. admit liability nor provide
  * warranty for any of this software. This material is provided
  * "AS-IS" and at no charge.
@@ -758,7 +758,7 @@ static void ipxitf_discover_netnum(struct ipx_interface 
*intrfc,
 
 /**
  * ipxitf_pprop - Process packet propagation IPX packet type 0x14, used for
- *   NetBIOS broadcasts
+ *   NetBIOS broadcasts
  * @intrfc: IPX interface receiving this packet
  * @skb: Received packet
  *
@@ -870,11 +870,11 @@ static struct ipx_interface *ipxitf_alloc(struct 
net_device *dev, __be32 netnum,
if (intrfc) {
intrfc->if_dev  = dev;
intrfc->if_netnum   = netnum;
-   intrfc->if_dlink_type   = dlink_type;
-   intrfc->if_dlink= dlink;
-   intrfc->if_internal = internal;
-   intrfc->if_ipx_offset   = ipx_offset;
-   intrfc->if_sknum= IPX_MIN_EPHEMERAL_SOCKET;
+   intrfc->if_dlink_type   = dlink_type;
+   intrfc->if_dlink= dlink;
+   intrfc->if_internal = internal;
+   intrfc->if_ipx_offset   = ipx_offset;
+   intrfc->if_sknum= IPX_MIN_EPHEMERAL_SOCKET;
INIT_HLIST_HEAD(>if_sklist);
refcount_set(>refcnt, 1);
spin_lock_init(>if_sklist_lock);
@@ -965,23 +965,23 @@ static int ipxitf_create(struct ipx_interface_definition 
*idef)
 
switch (idef->ipx_dlink_type) {
case IPX_FRAME_8022:
-   dlink_type  = htons(ETH_P_802_2);
-   datalink= p8022_datalink;
+   dlink_type  = htons(ETH_P_802_2);
+   datalink= p8022_datalink;
break;
case IPX_FRAME_ETHERII:
if (dev->type != ARPHRD_IEEE802) {
-   dlink_type  = htons(ETH_P_IPX);
-   datalink= pEII_datalink;
+   dlink_type  = htons(ETH_P_IPX);
+   datalink= pEII_datalink;
break;
}
/* fall through */
case IPX_FRAME_SNAP:
-   dlink_type  = htons(ETH_P_SNAP);
-   datalink= pSNAP_datalink;
+   dlink_type  = htons(ETH_P_SNAP);
+   datalink= pSNAP_datalink;
break;
case IPX_FRAME_8023:
-   dlink_type  = htons(ETH_P_802_3);
-   datalink= p8023_datalink;
+   dlink_type  = htons(ETH_P_802_3);
+   datalink= p8023_datalink;
break;
case IPX_FRAME_NONE:
default:
@@ -1522,7 +1522,7 @@ static int ipx_connect(struct socket *sock, struct 
sockaddr *uaddr,
struct ipx_route *rt;
 
sk->sk_state= TCP_CLOSE;
-   sock->state = SS_UNCONNECTED;
+   sock->state = SS_UNCONNECTED;
 
lock_sock(sk);
if (addr_len != sizeof(*addr))
@@ -1534,7 +1534,7 @@ static int ipx_connect(struct socket *sock, struct 
sockaddr *uaddr,
struct sockaddr_ipx uaddr;
 
uaddr.sipx_port = 0;
-   uaddr.sipx_network  = 0;
+   uaddr.sipx_network  = 0;
 
 #ifdef CONFIG_IPX_INTERN
rc = -ENETDOWN;
@@ -1563,8 +1563,8 @@ static int ipx_connect(struct socket *sock, struct 
sockaddr *uaddr,
ipxs->type = addr->sipx_type;
 
if (sock->type == SOCK_DGRAM) {
-   sock->state = SS_CONNECTED;
-   sk->sk_state= TCP_ESTABLISHED;
+   sock->state = SS_CONNECTED;
+   sk->sk_state= TCP_ESTABLISHED;
}
 
if (rt)
@@ -1736,10 +1736,10 @@

Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode

2017-12-27 Thread Marcelo Ricardo Leitner

On Wed, Dec 27, 2017 at 09:39:15AM -0600, David Ahern wrote:
> On 12/25/17 2:46 AM, Chris Mi wrote:
> > Signed-off-by: Chris Mi 
> > ---
> >  tc/m_action.c  |  91 +--
> >  tc/tc.c|  47 ++
> >  tc/tc_common.h |   8 +++-
> >  tc/tc_filter.c | 121 
> > +
> >  4 files changed, 204 insertions(+), 63 deletions(-)
> > 
> > diff --git a/tc/m_action.c b/tc/m_action.c
> > index fc422364..c4c3b862 100644
> > --- a/tc/m_action.c
> > +++ b/tc/m_action.c
> > @@ -23,6 +23,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "utils.h"
> >  #include "tc_common.h"
> > @@ -546,40 +547,88 @@ bad_val:
> > return ret;
> >  }
> >  
> > +typedef struct {
> > +   struct nlmsghdr n;
> > +   struct tcamsg   t;
> > +   charbuf[MAX_MSG];
> > +} tc_action_req;
> > +
> > +static tc_action_req *action_reqs;
> > +static struct iovec msg_iov[MSG_IOV_MAX];
> > +
> > +void free_action_reqs(void)
> > +{
> > +   free(action_reqs);
> > +}
> > +
> > +static tc_action_req *get_action_req(int batch_size, int index)
> > +{
> > +   tc_action_req *req;
> > +
> > +   if (action_reqs == NULL) {
> > +   action_reqs = malloc(batch_size * sizeof (tc_action_req));
> > +   if (action_reqs == NULL)
> > +   return NULL;
> > +   }
> > +   req = _reqs[index];
> > +   memset(req, 0, sizeof (*req));
> > +
> > +   return req;
> > +}
> > +
> >  static int tc_action_modify(int cmd, unsigned int flags,
> > -   int *argc_p, char ***argv_p)
> > +   int *argc_p, char ***argv_p,
> > +   int batch_size, int index, bool send)
> >  {
> > int argc = *argc_p;
> > char **argv = *argv_p;
> > int ret = 0;
> > -   struct {
> > -   struct nlmsghdr n;
> > -   struct tcamsg   t;
> > -   charbuf[MAX_MSG];
> > -   } req = {
> > -   .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcamsg)),
> > -   .n.nlmsg_flags = NLM_F_REQUEST | flags,
> > -   .n.nlmsg_type = cmd,
> > -   .t.tca_family = AF_UNSPEC,
> > +   tc_action_req *req;
> > +   struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK };
> > +   struct iovec *iov = _iov[index];
> > +
> > +   req = get_action_req(batch_size, index);
> > +   if (req == NULL) {
> > +   fprintf(stderr, "get_action_req error: not enough buffer\n");
> > +   return -ENOMEM;
> > +   }
> > +
> > +   req->n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcamsg));
> > +   req->n.nlmsg_flags = NLM_F_REQUEST | flags;
> > +   req->n.nlmsg_type = cmd;
> > +   req->t.tca_family = AF_UNSPEC;
> > +   struct rtattr *tail = NLMSG_TAIL(>n);
> > +
> > +   struct msghdr msg = {
> > +   .msg_name = ,
> > +   .msg_namelen = sizeof(nladdr),
> > +   .msg_iov = msg_iov,
> > +   .msg_iovlen = index + 1,
> > };
> > -   struct rtattr *tail = NLMSG_TAIL();
> >  
> > argc -= 1;
> > argv += 1;
> > -   if (parse_action(, , TCA_ACT_TAB, )) {
> > +   if (parse_action(, , TCA_ACT_TAB, >n)) {
> > fprintf(stderr, "Illegal \"action\"\n");
> > return -1;
> > }
> > -   tail->rta_len = (void *) NLMSG_TAIL() - (void *) tail;
> > +   tail->rta_len = (void *) NLMSG_TAIL(>n) - (void *) tail;
> > +
> > +   *argc_p = argc;
> > +   *argv_p = argv;
> > +
> > +   iov->iov_base = >n;
> > +   iov->iov_len = req->n.nlmsg_len;
> > +
> > +   if (!send)
> > +   return 0;
> >  
> > -   if (rtnl_talk(, , NULL) < 0) {
> > +   ret = rtnl_talk_msg(, , NULL);
> > +   if (ret < 0) {
> > fprintf(stderr, "We have an error talking to the kernel\n");
> > ret = -1;
> > }
> >  
> > -   *argc_p = argc;
> > -   *argv_p = argv;
> > -
> > return ret;
> >  }
> >  
> > @@ -679,7 +728,7 @@ bad_val:
> > return ret;
> >  }
> >  
> > -int do_action(int argc, char **argv)
> > +int do_action(int argc, char **argv, int batch_size, int index, bool send)
> >  {
> >  
> > int ret = 0;
> > @@ -689,12 +738,14 @@ int do_action(int argc, char **argv)
> > if (matches(*argv, "add") == 0) {
> > ret =  tc_action_modify(RTM_NEWACTION,
> > NLM_F_EXCL | NLM_F_CREATE,
> > -   , );
> > +   , , batch_size,
> > +   index, send);
> > } else if (matches(*argv, "change") == 0 ||
> >   matches(*argv, "replace") == 0) {
> > ret = tc_action_modify(RTM_NEWACTION,
> >NLM_F_CREATE | NLM_F_REPLACE,
> > -  , );
> > +  , , batch_size,
> > +

Re: [PATCH v3 1/4] security: Add support for SCTP security hooks

2017-12-27 Thread Paul Moore

On Wed, Dec 27, 2017 at 11:22 AM, Richard Haines
 wrote:
> On Fri, 2017-12-22 at 15:45 -0200, Marcelo Ricardo Leitner wrote:
>> On Fri, Dec 22, 2017 at 09:20:45AM -0800, Casey Schaufler wrote:
>> > On 12/22/2017 5:05 AM, Marcelo Ricardo Leitner wrote:
>> > > From: Richard Haines 
>> > >
>> > > The SCTP security hooks are explained in:
>> > > Documentation/security/LSM-sctp.rst
>>
>> Thanks Casey for your comments. However, I'm not that acquainted with
>> these area of codes and I cannot work on them. I'll just wait for
>> Richard then.
>
> I'm back online and will post a V4 set of patches within a week. These
> will address Paul's comments as per [1] and Casey's regarding the
> documentation.
> Sorry for the delay

No worries, thanks.

-- 
paul moore
www.paul-moore.com

Re: [PATCH 3/4] libbpf: break loop earlier

2017-12-27 Thread Eric Leblond

Hello,

On Wed, 2017-12-27 at 11:00 -0800, Alexei Starovoitov wrote:
> On Wed, Dec 27, 2017 at 07:02:28PM +0100, Eric Leblond wrote:
> > Get out of the loop when we have a match.
> > 
> > Signed-off-by: Eric Leblond 
> > ---
> >  tools/lib/bpf/libbpf.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 5fe8aaa2123e..d263748aa341 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -412,6 +412,7 @@ bpf_object__init_prog_names(struct bpf_object
> > *obj)
> >prog->section_name);
> > return -LIBBPF_ERRNO__LIBELF;
> > }
> > +   break;
> 
> why this is needed?

It was just cosmetic, no related bug.

> The top of the loop is:
>  for (si = 0; si < symbols->d_size / sizeof(GElf_Sym) && !name;
> 
> so as soon as name is found the loop will exit.

OK, I've missed that. Please disregard this patch.

BR,
-- 
Eric Leblond

Re: [PATCHv3 0/2] capability controlled user-namespaces

2017-12-27 Thread Michael Kerrisk (man-pages)

Hello Mahesh,

On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)
 wrote:
> Hello James,
>
> Seems like I missed your name to be added into the review of this
> patch series. Would you be willing be pull this into the security
> tree? Serge Hallyn has already ACKed it.

We seem to have no formal documentation/specification of this feature.
I think that should be written up before this patch goes into
mainline...

Cheers,

Michael


>
> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar  wrote:
>> From: Mahesh Bandewar 
>>
>> TL;DR version
>> -
>> Creating a sandbox environment with namespaces is challenging
>> considering what these sandboxed processes can engage into. e.g.
>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>> Current form of user-namespaces, however, if changed a bit can allow
>> us to create a sandbox environment without locking down user-
>> namespaces.
>>
>> Detailed version
>> 
>>
>> Problem
>> ---
>> User-namespaces in the current form have increased the attack surface as
>> any process can acquire capabilities which are not available to them (by
>> default) by performing combination of clone()/unshare()/setns() syscalls.
>>
>> #define _GNU_SOURCE
>> #include 
>> #include 
>> #include 
>>
>> int main(int ac, char **av)
>> {
>> int sock = -1;
>>
>> printf("Attempting to open RAW socket before unshare()...\n");
>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>> if (sock < 0) {
>> perror("socket() SOCK_RAW failed: ");
>> } else {
>> printf("Successfully opened RAW-Sock before unshare().\n");
>> close(sock);
>> sock = -1;
>> }
>>
>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>> perror("unshare() failed: ");
>> return 1;
>> }
>>
>> printf("Attempting to open RAW socket after unshare()...\n");
>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>> if (sock < 0) {
>> perror("socket() SOCK_RAW failed: ");
>> } else {
>> printf("Successfully opened RAW-Sock after unshare().\n");
>> close(sock);
>> sock = -1;
>> }
>>
>> return 0;
>> }
>>
>> The above example shows how easy it is to acquire NET_RAW capabilities
>> and once acquired, these processes could take benefit of above mentioned
>> or similar issues discovered/undiscovered with malicious intent. Note
>> that this is just an example and the problem/solution is not limited
>> to NET_RAW capability *only*.
>>
>> The easiest fix one can apply here is to lock-down user-namespaces which
>> many of the distros do (i.e. don't allow users to create user namespaces),
>> but unfortunately that prevents everyone from using them.
>>
>> Approach
>> 
>> Introduce a notion of 'controlled' user-namespaces. Every process on
>> the host is allowed to create user-namespaces (governed by the limit
>> imposed by per-ns sysctl) however, mark user-namespaces created by
>> sandboxed processes as 'controlled'. Use this 'mark' at the time of
>> capability check in conjunction with a global capability whitelist.
>> If the capability is not whitelisted, processes that belong to
>> controlled user-namespaces will not be allowed.
>>
>> Once a user-ns is marked as 'controlled'; all its child user-
>> namespaces are marked as 'controlled' too.
>>
>> A global whitelist is list of capabilities governed by the
>> sysctl which is available to (privileged) user in init-ns to modify
>> while it's applicable to all controlled user-namespaces on the host.
>>
>> Marking user-namespaces controlled without modifying the whitelist is
>> equivalent of the current behavior. The default value of whitelist includes
>> all capabilities so that the compatibility is maintained. However it gives
>> admins fine-grained ability to control various capabilities system wide
>> without locking down user-namespaces.
>>
>> Please see individual patches in this series.
>>
>> Mahesh Bandewar (2):
>>   capability: introduce sysctl for controlled user-ns capability whitelist
>>   userns: control capabilities of some user namespaces
>>
>>  Documentation/sysctl/kernel.txt | 21 +
>>  include/linux/capability.h  |  7 ++
>>  include/linux/user_namespace.h  | 25 
>>  kernel/capability.c | 52 
>> +
>>  kernel/sysctl.c |  5 
>>  kernel/user_namespace.c |  4 
>>  security/commoncap.c|  8 +++
>>  7 files changed, 122 insertions(+)
>>
>> --
>> 2.15.0.531.g2ccb3012c9-goog
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--

Re: [patch net-next v2 00/10] Add support for resource abstraction

2017-12-27 Thread David Ahern

On 12/27/17 1:31 PM, Andrew Lunn wrote:
>> Hmm. That documents mainly sysfs. No mention of Netlink at all. But
>> maybe I missed it. Also, that defines the interface as is. However we
>> are talking about the data exchanged over the interface, not the
>> interface itself. I don't see how ASIC/HW specific thing, like for
>> example KVD in our case could be part of kernel ABI.
> 
> You need to be very careful here. As soon as somebody starts using it,
> it might become an ABI. Or you need to clearly document it is not ABI,
> there is no guarantee it will not disappear or change its meaning in
> the next kernel, and it should be used with extreme caution.
> 

+1

Once the names go in, people can write scripts that invoke devlink at
boot to partition resources. With the proposed patch set, the name
(e.g., kvd/linear) becomes part of the ABI.

Re: [patch net-next v2 00/10] Add support for resource abstraction

2017-12-27 Thread Arkadi Sharshevsky



On 12/27/2017 06:34 PM, David Ahern wrote:
> On 12/27/17 2:09 AM, Jiri Pirko wrote:
>> Wed, Dec 27, 2017 at 05:05:09AM CET, d...@cumulusnetworks.com wrote:
>>> On 12/26/17 5:23 AM, Jiri Pirko wrote:
 From: Jiri Pirko 

 Many of the ASIC's internal resources are limited and are shared between
 several hardware procedures. For example, unified hash-based memory can
 be used for many lookup purposes, like FDB and LPM. In many cases the user
 can provide a partitioning scheme for such a resource in order to perform
 fine tuning for his application. In such cases performing driver reload is
 needed for the changes to take place, thus this patchset also adds support
 for hot reload.

 Such an abstraction can be coupled with devlink's dpipe interface, which
 models the ASIC's pipeline as a graph of match/action tables. By modeling
 the hardware resource object, and by coupling it to several dpipe tables,
 further visibility can be achieved in order to debug ASIC-wide issues.

 The proposed interface will provide the user the ability to understand the
 limitations of the hardware, and receive notification regarding its 
 occupancy.
 Furthermore, monitoring the resource occupancy can be done in real-time and
 can be useful in many cases.
>>>
>>> In the last RFC (not v1, but RFC) I asked for some kind of description
>>> for each resource, and you and Arkadi have pushed back. Let's walk
>>> through an example to see what I mean:
>>>
>>> $ devlink resource show pci/:03:00.0
>>> pci/:03:00.0:
>>>  name kvd size 245760 size_valid true
>>>  resources:
>>>name linear size 98304 occ 0
>>>name hash_double size 60416
>>>name hash_single size 87040
>>>
>>> So this 2700 has 3 resources that can be managed -- some table or
>>> resource or something named 'kvd' with linear, hash_double and
>>> hash_single sub-resources. What are these names referring too? The above
>>> output gives no description, and 'kvd' is not an industry term. Further,
>>
>> This are internal resources specific to the ASIC. Would you like some
>> description to each or something like that?
> 
> devlink has some nice self-documenting capabilities. What's missing here
> is a description of what the resource is used for in standard terms --
> ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a
> short list versus an exhaustive list of everything it is used for. e.g.,
> Why would a user decrease linear and increase hash_single or vice versa?
> 
>>
>>
>>> what are these sizes that a user can control? The output contains no
>>> units, no description, nothing. In short, the above output provides
>>> random numbers associated with random names.
>>
>> Units are now exposed from kernel, just this version of iproute2 patch
>> does not display it.
> 
> please provide an iproute2 patch that does so the full context if this
> patch set can be reviewed from a user perspective.
> 
>>
>>
>>>
>>> I can see dpipe tables exported by this device:
>>>
>>> $ devlink dpipe header show pci/:03:00.0
>>>
>>> pci/:03:00.0:
>>>  name mlxsw_meta
>>>  field:
>>>name erif_port bitwidth 32 mapping_type ifindex
>>>name l3_forward bitwidth 1
>>>name l3_drop bitwidth 1
>>>name adj_index bitwidth 32
>>>name adj_size bitwidth 32
>>>name adj_hash_index bitwidth 32
>>>
>>>  name ipv6
>>>  field:
>>>name destination ip bitwidth 128
>>>
>>>  name ipv4
>>>  field:
>>>name destination ip bitwidth 32
>>>
>>>  name ethernet
>>>  field:
>>>name destination mac bitwidth 48
>>>
>>> but none mention 'kvd' or 'linear' or 'hash" and none of the other
>>> various devlink options:
>>>
>>> $ devlink
>>> Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help }
>>> where  OBJECT := { dev | port | sb | monitor | dpipe }
>>>
>>> seem to related to resources.
>>>
>>> So how does a user know what they are controlling by this 'resource'
>>> option? Is the user expected to have a PRM or user guide on hand for the
>>> specific device model that is being configured?
>>
>> The relation of specific dpipe table to specific resource is exposed by
>> the kernel as well. Probably the iproute2 patch just does not display
>> it.
> 
> please provide an iproute2 patch that does so the full context if this
> patch set can be reviewed from a user perspective.
>

As Yuval stated you are using the wrong command here.
You are printing the headers not the tables. On each dpipe
table you can see the resource  it is using (the resource
path aka host table uses /kvd/hash_single for example).

This is already working. Just try it.

>>
>>
>>>
>>> Again, I have no objections to kvd, linear, hash, etc terms as they do
>>> relate to Mellanox products. But kvd/linear, for example, does correlate
>>> to industry standard concepts in some way. My request is that the
>>> resource listing guide the user in some way, stating what these
>>> resources mean.
>>
>> So the

Re: WARNING in strp_data_ready

2017-12-27 Thread Dmitry Vyukov

On Wed, Dec 27, 2017 at 9:08 PM, Ozgur  wrote:
>
>
> 27.12.2017, 22:21, "Dmitry Vyukov" :
>> On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert  wrote:
>>>  Did you try the patch I posted?
>>
>> Hi Tom,
>
> Hello Dmitry,
>
>> No. And I didn't know I need to. Why?
>> If you think the patch needs additional testing, you can ask syzbot to
>> test it. See 
>> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
>> Otherwise proceed with committing it. Or what are we waiting for?
>>
>> Thanks
>
> I think we need to fixed patch for crash, in fact check to patch code and 
> test solve the bug.
> How do test it because there is no patch in the following bug?

Hi Ozgur,

I am not sure I completely understand what you mean. But the
reproducer for this bug (which one can use for testing) is here:
https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY
Tom also mentions there is some patch for this, but I don't know where
it is, it doesn't seem to be referenced from this thread.


> The fix patch should be for this net/kcm/kcmsock.c file and  lock functions 
> must be added calling sk_data_ready ().
> Regards
>
> Ozgur
>
>>>  On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov  wrote:
  On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov  wrote:
>>   wrote:
>>>  On 10/24/2017 08:20 AM, syzbot wrote:
  Hello,

  syzkaller hit the following crash on 
 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
  
 git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
  compiler: gcc (GCC) 7.1.1 20170620
  .config is attached
  Raw console output is attached.
  C reproducer is attached
  syzkaller reproducer is attached. See https://goo.gl/kgGztJ
  for information about syzkaller reproducers

  WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
 sock_owned_by_me include/net/sock.h:1505 [inline]
  WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
 sock_owned_by_user include/net/sock.h:1511 [inline]
  WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
  Hardware name: Google Google Compute Engine/Google Compute Engine, 
 BIOS Google 01/01/2011
  Call Trace:
   
   __dump_stack lib/dump_stack.c:16 [inline]
   dump_stack+0x194/0x257 lib/dump_stack.c:52
   panic+0x1e4/0x417 kernel/panic.c:181
   __warn+0x1c4/0x1d9 kernel/panic.c:542
   report_bug+0x211/0x2d0 lib/bug.c:183
   fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
   do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
   do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
   do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
   do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
   invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
  RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
  RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
  RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
  RSP: 0018:8801db206b18 EFLAGS: 00010206
  RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: 
  RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0
  RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd
  R10: 8801db206a00 R11: dc00 R12: 8801d29fa000
  R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001
   psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>>>
>>>  Looks like KCM is calling sk_data_ready() without first taking the
>>>  sock lock.
>>>
>>>  /* Called with lower sock held */
>>>  static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff 
>>> *skb)
>>>  {
>>>   [...]
>>>  if (kcm_queue_rcv_skb(>sk, skb)) {
>>>
>>>  In this case kcm->sk is not the same lock the comment is referring to.
>>>  And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>>>
>>>  @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
>>>  I don't have anything better in mind immediately.
>>  The sock locks are taken in reverse order in the send path so so
>>  grabbing kcm sock lock with lower lock held to call sk_data_ready may
>>  lead to deadlock like I think.
>>
>>  It might be possible to change the order in the send path to do this.
>>  Something like:
>>
>>  trylock on lower socket lock
>>  -if trylock fails

Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode

2017-12-27 Thread Marcelo Ricardo Leitner

On Mon, Dec 25, 2017 at 05:46:57PM +0900, Chris Mi wrote:
> @@ -267,6 +287,7 @@ int main(int argc, char **argv)
>  {
>   int ret;
>   char *batch_file = NULL;
> + int batch_size = 1;
>  
>   while (argc > 1) {
>   if (argv[1][0] != '-')
> @@ -297,6 +318,14 @@ int main(int argc, char **argv)
>   if (argc <= 1)
>   usage();
>   batch_file = argv[1];
> + } else if (matches(argv[1], "-batchsize") == 0 ||
> + matches(argv[1], "-bs") == 0) {
> + argc--; argv++;
> + if (argc <= 1)
> + usage();
> + batch_size = atoi(argv[1]);
> + if (batch_size > MSG_IOV_MAX)
> + batch_size = MSG_IOV_MAX;

what about
if (batch_size < 1)
batch_size = 1;

>   } else if (matches(argv[1], "-netns") == 0) {
>   NEXT_ARG();
>   if (netns_switch(argv[1]))

Re: [patch net-next v2 00/10] Add support for resource abstraction

2017-12-27 Thread Roopa Prabhu

On Wed, Dec 27, 2017 at 8:34 AM, David Ahern  wrote:
> On 12/27/17 2:09 AM, Jiri Pirko wrote:
>> Wed, Dec 27, 2017 at 05:05:09AM CET, d...@cumulusnetworks.com wrote:
>>> On 12/26/17 5:23 AM, Jiri Pirko wrote:
 From: Jiri Pirko 

 Many of the ASIC's internal resources are limited and are shared between
 several hardware procedures. For example, unified hash-based memory can
 be used for many lookup purposes, like FDB and LPM. In many cases the user
 can provide a partitioning scheme for such a resource in order to perform
 fine tuning for his application. In such cases performing driver reload is
 needed for the changes to take place, thus this patchset also adds support
 for hot reload.

 Such an abstraction can be coupled with devlink's dpipe interface, which
 models the ASIC's pipeline as a graph of match/action tables. By modeling
 the hardware resource object, and by coupling it to several dpipe tables,
 further visibility can be achieved in order to debug ASIC-wide issues.

 The proposed interface will provide the user the ability to understand the
 limitations of the hardware, and receive notification regarding its 
 occupancy.
 Furthermore, monitoring the resource occupancy can be done in real-time and
 can be useful in many cases.
>>>
>>> In the last RFC (not v1, but RFC) I asked for some kind of description
>>> for each resource, and you and Arkadi have pushed back. Let's walk
>>> through an example to see what I mean:
>>>
>>> $ devlink resource show pci/:03:00.0
>>> pci/:03:00.0:
>>>  name kvd size 245760 size_valid true
>>>  resources:
>>>name linear size 98304 occ 0
>>>name hash_double size 60416
>>>name hash_single size 87040
>>>
>>> So this 2700 has 3 resources that can be managed -- some table or
>>> resource or something named 'kvd' with linear, hash_double and
>>> hash_single sub-resources. What are these names referring too? The above
>>> output gives no description, and 'kvd' is not an industry term. Further,
>>
>> This are internal resources specific to the ASIC. Would you like some
>> description to each or something like that?
>
> devlink has some nice self-documenting capabilities. What's missing here
> is a description of what the resource is used for in standard terms --
> ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a
> short list versus an exhaustive list of everything it is used for. e.g.,
> Why would a user decrease linear and increase hash_single or vice versa?

Arkadi, on what david says above, can the resource names and ids not
be driver specific, but moved up to the switchdev layer and just map
to fdb, host routes, nexthops table sizes etc ?.
Can these generic networking resources then in-turn be mapped to kvd
sizes by the driver ?

Re: [patch net-next v2 00/10] Add support for resource abstraction

2017-12-27 Thread Andrew Lunn

> Hmm. That documents mainly sysfs. No mention of Netlink at all. But
> maybe I missed it. Also, that defines the interface as is. However we
> are talking about the data exchanged over the interface, not the
> interface itself. I don't see how ASIC/HW specific thing, like for
> example KVD in our case could be part of kernel ABI.

You need to be very careful here. As soon as somebody starts using it,
it might become an ABI. Or you need to clearly document it is not ABI,
there is no guarantee it will not disappear or change its meaning in
the next kernel, and it should be used with extreme caution.

Personally, if DSA drivers were to expose such settings, i would
consider them ABI, which people can rely on to remain stable.

Andrew

Re: [PATCH net] sfp: fix sfp-bus oops when removing socket/upstream

2017-12-27 Thread Florian Fainelli



On 12/26/2017 03:15 PM, Russell King wrote:
> When we remove a socket or upstream, and the other side isn't
> registered, we dereference a NULL pointer, causing a kernel oops.
> Fix this.
> 
> Signed-off-by: Russell King 

Reviewed-by: Florian Fainelli 
Fixes: ce0aa27ff3f6 ("sfp: add sfp-bus to bridge between network devices
and sfp cages")
-- 
Florian

Re: [PATCH net] phylink: ensure we report link down when LOS asserted

2017-12-27 Thread Florian Fainelli



On 12/26/2017 03:15 PM, Russell King wrote:
> Although we disable the netdev carrier, we fail to report in the kernel
> log that the link went down.  Fix this.
> 
> Signed-off-by: Russell King 

Reviewed-by: Florian Fainelli 
Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
-- 
Florian

Re: WARNING in strp_data_ready

2017-12-27 Thread Dmitry Vyukov

On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert  wrote:
> Did you try the patch I posted?

Hi Tom,

No. And I didn't know I need to. Why?
If you think the patch needs additional testing, you can ask syzbot to
test it. See 
https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
Otherwise proceed with committing it. Or what are we waiting for?

Thanks



> On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov  wrote:
>> On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov  wrote:
  wrote:
> On 10/24/2017 08:20 AM, syzbot wrote:
>> Hello,
>>
>> syzkaller hit the following crash on 
>> 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached
>> Raw console output is attached.
>> C reproducer is attached
>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>> for information about syzkaller reproducers
>>
>>
>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me 
>> include/net/sock.h:1505 [inline]
>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
>> sock_owned_by_user include/net/sock.h:1511 [inline]
>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
>> strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>> Kernel panic - not syncing: panic_on_warn set ...
>>
>> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> Google 01/01/2011
>> Call Trace:
>>  
>>  __dump_stack lib/dump_stack.c:16 [inline]
>>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>>  panic+0x1e4/0x417 kernel/panic.c:181
>>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>>  report_bug+0x211/0x2d0 lib/bug.c:183
>>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>> RSP: 0018:8801db206b18 EFLAGS: 00010206
>> RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: 
>> RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0
>> RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd
>> R10: 8801db206a00 R11: dc00 R12: 8801d29fa000
>> R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001
>>  psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>
> Looks like KCM is calling sk_data_ready() without first taking the
> sock lock.
>
> /* Called with lower sock held */
> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
> {
>  [...]
> if (kcm_queue_rcv_skb(>sk, skb)) {
>
> In this case kcm->sk is not the same lock the comment is referring to.
> And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>
> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
> I don't have anything better in mind immediately.
>
 The sock locks are taken in reverse order in the send path so so
 grabbing kcm sock lock with lower lock held to call sk_data_ready may
 lead to deadlock like I think.

 It might be possible to change the order in the send path to do this.
 Something like:

 trylock on lower socket lock
 -if trylock fails
   - release kcm sock lock
   - lock lower sock
   - lock kcm sock
 - call sendpage locked function

 I admit that dealing with two levels of socket locks in the data path
 is quite a pain :-)
>>>
>>> up
>>>
>>> still happening and we've lost 50K+ test VMs on this
>>
>> up
>>
>> Still happens and number of crashes crossed 60K, can we do something
>> with this please?

[PATCH v3 bpf-next 1/2] tools/bpftool: use version from the kernel source tree

2017-12-27 Thread Roman Gushchin

Bpftool determines it's own version based on the kernel
version, which is picked from the linux/version.h header.

It's strange to use the version of the installed kernel
headers, and makes much more sense to use the version
of the actual source tree, where bpftool sources are.

Fix this by building kernelversion target and use
the resulting string as bpftool version.

Example:
before:

$ bpftool version
bpftool v4.14.6

after:
$ bpftool version
bpftool v4.15.0-rc3

$bpftool version --json
{"version":"4.15.0-rc3"}

Signed-off-by: Roman Gushchin 
Reviewed-by: Jakub Kicinski 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
---
 tools/bpf/bpftool/Makefile |  3 +++
 tools/bpf/bpftool/main.c   | 13 ++---
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 3f17ad317512..f8f31a8d9269 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -23,6 +23,8 @@ endif
 
 LIBBPF = $(BPF_PATH)libbpf.a
 
+BPFTOOL_VERSION=$(shell make --no-print-directory -sC ../../.. kernelversion)
+
 $(LIBBPF): FORCE
$(Q)$(MAKE) -C $(BPF_DIR) OUTPUT=$(OUTPUT) $(OUTPUT)libbpf.a 
FEATURES_DUMP=$(FEATURE_DUMP_EXPORT)
 
@@ -38,6 +40,7 @@ CC = gcc
 CFLAGS += -O2
 CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow
 CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi 
-I$(srctree)/tools/include -I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/
+CFLAGS += -DBPFTOOL_VERSION='"$(BPFTOOL_VERSION)"'
 LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
 
 INSTALL ?= install
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index ecd53ccf1239..3a0396d87c42 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -38,7 +38,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -95,21 +94,13 @@ static int do_help(int argc, char **argv)
 
 static int do_version(int argc, char **argv)
 {
-   unsigned int version[3];
-
-   version[0] = LINUX_VERSION_CODE >> 16;
-   version[1] = LINUX_VERSION_CODE >> 8 & 0xf;
-   version[2] = LINUX_VERSION_CODE & 0xf;
-
if (json_output) {
jsonw_start_object(json_wtr);
jsonw_name(json_wtr, "version");
-   jsonw_printf(json_wtr, "\"%u.%u.%u\"",
-version[0], version[1], version[2]);
+   jsonw_printf(json_wtr, "\"%s\"", BPFTOOL_VERSION);
jsonw_end_object(json_wtr);
} else {
-   printf("%s v%u.%u.%u\n", bin_name,
-  version[0], version[1], version[2]);
+   printf("%s v%s\n", bin_name, BPFTOOL_VERSION);
}
return 0;
 }
-- 
2.14.3

1 2 >

1 - 100 of 177 matches

Mail list logo