Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
On Wed, 27 Dec 2017 19:45:42 -0800 Alexei Starovoitovwrote: > On 12/27/17 6:34 PM, Masami Hiramatsu wrote: > > On Wed, 27 Dec 2017 14:46:24 -0800 > > Alexei Starovoitov wrote: > > > >> On 12/26/17 9:56 PM, Masami Hiramatsu wrote: > >>> On Tue, 26 Dec 2017 17:57:32 -0800 > >>> Alexei Starovoitov wrote: > >>> > On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote: > > Check whether error injectable event is on function entry or not. > > Currently it checks the event is ftrace-based kprobes or not, > > but that is wrong. It should check if the event is on the entry > > of target function. Since error injection will override a function > > to just return with modified return value, that operation must > > be done before the target function starts making stackframe. > > > > As a side effect, bpf error injection is no need to depend on > > function-tracer. It can work with sw-breakpoint based kprobe > > events too. > > > > Signed-off-by: Masami Hiramatsu > > --- > > kernel/trace/Kconfig|2 -- > > kernel/trace/bpf_trace.c|6 +++--- > > kernel/trace/trace_kprobe.c |8 +--- > > kernel/trace/trace_probe.h | 12 ++-- > > 4 files changed, 14 insertions(+), 14 deletions(-) > > > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > > index ae3a2d519e50..6400e1bf97c5 100644 > > --- a/kernel/trace/Kconfig > > +++ b/kernel/trace/Kconfig > > @@ -533,9 +533,7 @@ config FUNCTION_PROFILER > > config BPF_KPROBE_OVERRIDE > > bool "Enable BPF programs to override a kprobed function" > > depends on BPF_EVENTS > > - depends on KPROBES_ON_FTRACE > > depends on HAVE_KPROBE_OVERRIDE > > - depends on DYNAMIC_FTRACE_WITH_REGS > > default n > > help > > Allows BPF to override the execution of a probed function and > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > index f6d2327ecb59..d663660f8392 100644 > > --- a/kernel/trace/bpf_trace.c > > +++ b/kernel/trace/bpf_trace.c > > @@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event > > *event, > > int ret = -EEXIST; > > > > /* > > -* Kprobe override only works for ftrace based kprobes, and > > only if they > > -* are on the opt-in list. > > +* Kprobe override only works if they are on the function entry, > > +* and only if they are on the opt-in list. > > */ > > if (prog->kprobe_override && > > - (!trace_kprobe_ftrace(event->tp_event) || > > + (!trace_kprobe_on_func_entry(event->tp_event) || > > !trace_kprobe_error_injectable(event->tp_event))) > > return -EINVAL; > > > > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c > > index 91f4b57dab82..265e3e27e8dc 100644 > > --- a/kernel/trace/trace_kprobe.c > > +++ b/kernel/trace/trace_kprobe.c > > @@ -88,13 +88,15 @@ static nokprobe_inline unsigned long > > trace_kprobe_nhit(struct trace_kprobe *tk) > > return nhit; > > } > > > > -int trace_kprobe_ftrace(struct trace_event_call *call) > > +bool trace_kprobe_on_func_entry(struct trace_event_call *call) > > { > > struct trace_kprobe *tk = (struct trace_kprobe *)call->data; > > - return kprobe_ftrace(>rp.kp); > > + > > + return kprobe_on_func_entry(tk->rp.kp.addr, > > tk->rp.kp.symbol_name, > > + tk->rp.kp.offset); > > That would be nice, but did you test this? > >>> > >>> Yes, because the jprobe, which was only official user of modifying > >>> execution > >>> path using kprobe, did same way to check. (and kretprobe also does it) > >>> > My understanding that kprobe will restore all regs and > here we need to override return ip _and_ value. > >>> > >>> yes, no problem. kprobe restore all regs from pt_regs, including regs->ip. > >>> > Could you add a patch with the test the way Josef did > or describe the steps to test this new mode? > >>> > >>> Would you mean below patch? If so, it should work without any change. > >>> > >>> [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return > >> > >> yeah. I expect bpf_override_return test to work as-is. > >> I'm asking for the test for new functionality added by this patch. > >> In particular kprobe on func entry without ftrace. > >> How did you test it? > > > > This function is used in kretprobe and jprobe. Jprobe was the user of > > "modifying instruction pointer to another function" in kprobes. > > If it doesn't work, jprobe also doesn't work, this means you can not > > modify IP by
Re: WARNING in strp_data_ready
On Thu, Dec 28, 2017 at 2:19 AM, Tom Herbertwrote: > On Wed, Dec 27, 2017 at 12:20 PM, Ozgur wrote: >> >> >> 27.12.2017, 23:14, "Dmitry Vyukov" : >>> On Wed, Dec 27, 2017 at 9:08 PM, Ozgur wrote: 27.12.2017, 22:21, "Dmitry Vyukov" : > On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert > wrote: >> Did you try the patch I posted? > > Hi Tom, Hello Dmitry, > No. And I didn't know I need to. Why? > If you think the patch needs additional testing, you can ask syzbot to > test it. See > https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot > Otherwise proceed with committing it. Or what are we waiting for? > > Thanks I think we need to fixed patch for crash, in fact check to patch code and test solve the bug. How do test it because there is no patch in the following bug? >>> >>> Hi Ozgur, >>> >>> I am not sure I completely understand what you mean. But the >>> reproducer for this bug (which one can use for testing) is here: >>> https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY >>> Tom also mentions there is some patch for this, but I don't know where >>> it is, it doesn't seem to be referenced from this thread. >> >> Hello Dmitry, >> >> Ah, I'm sorry I don't seen Tom mail and I don't have a patch not tested :) >> I think Tom send patch to only you and are you tested? >> >> kcmsock.c will change and strp_data_ready I think locked. >> >> Tom, please send a patch for me? I can test and inform you. >> > Hi Ozgur, > > I reposted the patches as RFC "kcm: Fix lockdep issue". Please test if you > can! OK, I will work as your typist this time: #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master But I wonder what part of the following you don't understand? Do we need to improve wording or something? > If you think the patch needs additional testing, you can ask syzbot to test > it. > See > https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot Also I don't know what git repo/branch you have in mind. Kernel patches don't generally apply just to any tree. Fingers crossed that I guessed correctly and it will apply. diff --git a/include/net/sock.h b/include/net/sock.h index 9155da422692..7a7b14e9628a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1514,6 +1514,11 @@ static inline bool sock_owned_by_user(const struct sock *sk) return sk->sk_lock.owned; } +static inline bool sock_owned_by_user_nocheck(const struct sock *sk) +{ + return sk->sk_lock.owned; +} + /* no reclassification while locks are held */ static inline bool sock_allow_reclassification(const struct sock *csk) { diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c index c5fda15ba319..1fdab5c4eda8 100644 --- a/net/strparser/strparser.c +++ b/net/strparser/strparser.c @@ -401,7 +401,7 @@ void strp_data_ready(struct strparser *strp) * allows a thread in BH context to safely check if the process * lock is held. In this case, if the lock is held, queue work. */ - if (sock_owned_by_user(strp->sk)) { + if (sock_owned_by_user_nocheck(strp->sk)) { queue_work(strp_wq, >work); return; }
Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework
On Wed, 27 Dec 2017 19:49:28 -0800 Alexei Starovoitovwrote: > On 12/27/17 5:38 PM, Masami Hiramatsu wrote: > > On Wed, 27 Dec 2017 14:49:46 -0800 > > Alexei Starovoitov wrote: > > > >> On 12/27/17 12:09 AM, Masami Hiramatsu wrote: > >>> On Tue, 26 Dec 2017 18:12:56 -0800 > >>> Alexei Starovoitov wrote: > >>> > On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote: > > Support in-kernel fault-injection framework via debugfs. > > This allows you to inject a conditional error to specified > > function using debugfs interfaces. > > > > Signed-off-by: Masami Hiramatsu > > --- > > Documentation/fault-injection/fault-injection.txt |5 + > > kernel/Makefile |1 > > kernel/fail_function.c| 169 > > + > > lib/Kconfig.debug | 10 + > > 4 files changed, 185 insertions(+) > > create mode 100644 kernel/fail_function.c > > > > diff --git a/Documentation/fault-injection/fault-injection.txt > > b/Documentation/fault-injection/fault-injection.txt > > index 918972babcd8..6243a588dd71 100644 > > --- a/Documentation/fault-injection/fault-injection.txt > > +++ b/Documentation/fault-injection/fault-injection.txt > > @@ -30,6 +30,11 @@ o fail_mmc_request > >injects MMC data errors on devices permitted by setting > >debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request > > > > +o fail_function > > + > > + injects error return on specific functions by setting debugfs entries > > + under /sys/kernel/debug/fail_function. No boot option supported. > > I like it. > Could you document it a bit better? > >>> > >>> Yes, I will do in next series. > >>> > In particular retval is configurable, but without an example no one > will be able to figure out how to use it. > >>> > >>> Ah, right. BTW, as I pointed in the covermail, should we store the > >>> expected error value range into the injectable list? e.g. > >>> > >>> ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO) > >>> > >>> And provide APIs to check/get it. > >> > >> I'm afraid such check would be too costly. > >> Right now we have only two functions marked but I expect hundreds more > >> will be added in the near future as soon as developers realize the > >> potential of such error injection. > >> All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data. > >> Multiple by 1k and we have 8k of data spent on marks. > >> If we add max/min range marks that doubles it for very little use. > >> I think marking function only is enough. > > > > Sorry, I don't think so. > > Even if it takes 16 bytes more for each points, I don't think it is > > any overhead for machines in these days. Even if so, we can provide > > a kconfig to reduce it. > > I mean, we are living in GB-order memory are, and it will be bigger > > in the future. Why we have to worry about hundreds of 16bytes memory > > pieces? It will take a few KB, and even if we mark thousands of > > functions, it never reaches 1MB, in GB memory pool. :) > > > > Of course, for many small-footprint embedded devices (like having > > less than 128MB memory), this feature can be a overhead. But they > > can cut off the table by kconfig. > > I still disagree on wasting 16-byte * num_of_funcs of .data here. > The trade-off of usability vs memory just not worth it. Sorry. > Please focus on testing your changes instead. Then what happen if the user set invalid retval to those functions? even if we limit the injectable functions, it can cause a problem, for example, obj = func_return_object(); if (!obj) { handling_error...; } obj->field = x; In this case, obviously func_return_object() must return NULL if there is an error, not -ENOMEM. But without the correct retval information, how would you check the BPF code doesn't cause a trouble? Currently it seems you are expecting only the functions which return error code. ret = func_return_state(); if (ret < 0) { handling_error...; } But how we can distinguish those? If we have the error range for each function, we can ensure what is *correct* error code, NULL or errno, or any other error numbers. :) At least fail_function needs this feature because it can check return value when setting it up. Thank you, -- Masami Hiramatsu
Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
On Wed, Dec 27, 2017 at 10:24:01PM +, Russell King - ARM Linux wrote: > On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote: > > This patch enables the fourth network interface on the Marvell > > Macchiatobin. It is configured in the 2500Base-X PHY mode. > > > > Signed-off-by: Antoine Tenart> > --- > > arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 8 > > 1 file changed, 8 insertions(+) > > > > diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts > > b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts > > index b3350827ee55..c51efd511324 100644 > > --- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts > > +++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts > > @@ -279,6 +279,14 @@ > > phys = <_comphy0 1>; > > }; > > > > +_eth2 { > > + /* CPS Lane 5 */ > > + status = "okay"; > > + phy-mode = "2500base-x"; > > + /* Generic PHY, providing serdes lanes */ > > + phys = <_comphy5 2>; > > +}; > > + > > This is wrong. This lane is connected to a SFP cage which can support > more than just 2500base-X. Tying it in this way to 2500base-X means > that this port does not support conenctions at 1000base-X, despite > that's one of the most popular and more standardised speeds. > Hi Antoine I agree with Russell here. SFP modules are hot pluggable, and support a range of interface modes. You need to query what the SFP module is in order to know how to configure the SERDES interface. The phylink infrastructure does that for you. Andrew
Re: [PATCH net-next 1/6] phy: add 2.5G SGMII mode to the phy_mode enum
On Wed, Dec 27, 2017 at 11:14:41PM +0100, Antoine Tenart wrote: > This patch adds one more generic PHY mode to the phy_mode enum, to allow > configuring generic PHYs to the 2.5G SGMII mode by using the set_mode > callback. > > Signed-off-by: Antoine Tenart> --- > include/linux/phy/phy.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/include/linux/phy/phy.h b/include/linux/phy/phy.h > index 4f8423a948d5..70459a28f3a1 100644 > --- a/include/linux/phy/phy.h > +++ b/include/linux/phy/phy.h > @@ -28,6 +28,7 @@ enum phy_mode { > PHY_MODE_USB_DEVICE, > PHY_MODE_USB_OTG, > PHY_MODE_SGMII, > + PHY_MODE_SGMII_2_5G, > PHY_MODE_10GKR, > PHY_MODE_UFS_HS_A, > PHY_MODE_UFS_HS_B, Hi Antoine There was a discussion maybe last month about adding 2.5G SGMII. I would prefer 2500SGMII. Putting the number first makes it uniform with the other defines, 1000BASEX, 25000BASEX, 10GKR. Andrew
Re: [PATCH] Staging: ipx: fixed several no space before tabs coding style issues
On Wed, Dec 27, 2017 at 09:25:44PM +, Jianshen Liu wrote: > Fixed several coding style warnings of "please, no space before tabs". > > Signed-off-by: Jianshen Liu> --- > drivers/staging/ipx/af_ipx.c| 56 > - > drivers/staging/ipx/ipx_proc.c | 2 +- > drivers/staging/ipx/ipx_route.c | 6 ++--- > 3 files changed, 32 insertions(+), 32 deletions(-) Please read drivers/staging/ipx/TODO
[PATCH net-next v9 0/2] add UniPhier AVE ethernet support
This series adds support for Socionext AVE ethernet controller implemented on UniPhier SoCs. This driver supports RGMII/RMII modes. v8: https://www.spinics.net/lists/netdev/msg474374.html The PHY patch included in v1 has already separated in: http://www.spinics.net/lists/netdev/msg454595.html Changes since v8: - move operators at the beginning of the line to the end of the previous line - dt-bindings: add blank lines before mdio and phy subnodes Changes since v7: - dt-bindings: fix mdio subnode description Changes since v6: - sort the order of local variables from longest to shortest line - fix ave_probe() which calls register_netdev() at the end of initialization - dt-bindings: remove phy node descriptions in mdio node Changes since v5: - replace license boilerplate with SPDX Identifier - remove inline directives and an unused function Changes since v4: - fix larger integer warning on AVE_PFMBYTE_MASK0 Changes since v3: - remove checking dma address and use dma_set_mask() to restirct address - replace ave_mdio_busywait() with read_poll_timeout() - replace functions to access to registers with readl/writel() directly - replace a function to access to macaddr with ave_hw_write_macaddr() - change return value of ave_dma_map() to error value - move mdiobus_unregister() from ave_remove() to ave_uninit() - eliminate else block at the end of ave_dma_map() - add mask definitions for packet filter - sort bitmap definitions in descending order - add error check to some functions - rename and sort functions to clear sub-categories - fix error value consistency - remove unneeded initializers - change type of constant arrays Changes since v2: - replace clk_get() with devm_clk_get() - replace reset_control_get() with devm_reset_control_get_optional_shared() - add error return when the error occurs on the above *_get functions - sort soc data and compatible strings - remove clearly obvious comments - modify dt-bindings document consistent with these modifications Changes since v1: - add/remove devicetree properties and sub-node - remove "internal-phy-interrupt" and "desc-bits" property - add SoC data structures based on compatible strings - add node operation to apply "mdio" sub-node - add support for features - add support for {get,set}_pauseparam and pause frame operations - add support for ndo_get_stats64 instead of ndo_get_stats - replace with desiable functions - replace check for valid phy_mode with phy_interface{_mode}_is_rgmii() - replace phy attach message with phy_attached_info() - replace 32bit operation with {upper,lower}_32_bits() on ave_wdesc_addr() - replace nway_reset and get_link with generic functions - move operations to proper functions - move phy_start_aneg() to ndo_open, and remove unnecessary PHY interrupt operations See http://www.spinics.net/lists/netdev/msg454590.html - move irq initialization and descriptor memory allocation to ndo_open - move initialization of reset and clock and mdiobus to ndo_init - fix skbuffer operations - fix skb alignment operations and add Rx buffer adjustment for descriptor See http://www.spinics.net/lists/netdev/msg456014.html - add error returns when dma_map_single() failed - clean up code structures - clean up wait-loop and wake-queue conditions - add ave_wdesc_addr() and offset definitions - add ave_macaddr_init() to clean up mac-address operation - fix checking whether Tx entry is not enough - fix supported features of phydev - add necessary free/disable operations - add phydev check on ave_{get,set}_wol() - remove netif_carrier functions, phydev initializer, and Tx budget check - change obsolate codes - replace ndev->{base_addr,irq} with the members of ave_private - rename goto labels and mask definitions, and remove unused codes Kunihiko Hayashi (2): dt-bindings: net: add DT bindings for Socionext UniPhier AVE net: ethernet: socionext: add AVE ethernet driver .../bindings/net/socionext,uniphier-ave4.txt | 48 + drivers/net/ethernet/Kconfig |1 + drivers/net/ethernet/Makefile |1 + drivers/net/ethernet/socionext/Kconfig | 22 + drivers/net/ethernet/socionext/Makefile|5 + drivers/net/ethernet/socionext/sni_ave.c | 1736 6 files changed, 1813 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt create mode 100644 drivers/net/ethernet/socionext/Kconfig create mode 100644 drivers/net/ethernet/socionext/Makefile create mode 100644 drivers/net/ethernet/socionext/sni_ave.c -- 2.7.4
[PATCH net-next v9 1/2] dt-bindings: net: add DT bindings for Socionext UniPhier AVE
DT bindings for the AVE ethernet controller found on Socionext's UniPhier platforms. Signed-off-by: Kunihiko HayashiSigned-off-by: Jassi Brar Acked-by: Rob Herring Reviewed-by: Florian Fainelli --- .../bindings/net/socionext,uniphier-ave4.txt | 48 ++ 1 file changed, 48 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt new file mode 100644 index 000..270ea4e --- /dev/null +++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt @@ -0,0 +1,48 @@ +* Socionext AVE ethernet controller + +This describes the devicetree bindings for AVE ethernet controller +implemented on Socionext UniPhier SoCs. + +Required properties: + - compatible: Should be + - "socionext,uniphier-pro4-ave4" : for Pro4 SoC + - "socionext,uniphier-pxs2-ave4" : for PXs2 SoC + - "socionext,uniphier-ld11-ave4" : for LD11 SoC + - "socionext,uniphier-ld20-ave4" : for LD20 SoC + - reg: Address where registers are mapped and size of region. + - interrupts: Should contain the MAC interrupt. + - phy-mode: See ethernet.txt in the same directory. Allow to choose + "rgmii", "rmii", or "mii" according to the PHY. + - phy-handle: Should point to the external phy device. + See ethernet.txt file in the same directory. + - clocks: A phandle to the clock for the MAC. + +Optional properties: + - resets: A phandle to the reset control for the MAC. + - local-mac-address: See ethernet.txt in the same directory. + +Required subnode: + - mdio: A container for child nodes representing phy nodes. + See phy.txt in the same directory. + +Example: + + ether: ethernet@6500 { + compatible = "socionext,uniphier-ld20-ave4"; + reg = <0x6500 0x8500>; + interrupts = <0 66 4>; + phy-mode = "rgmii"; + phy-handle = <>; + clocks = <_clk 6>; + resets = <_rst 6>; + local-mac-address = [00 00 00 00 00 00]; + + mdio { + #address-cells = <1>; + #size-cells = <0>; + + ethphy: ethphy@1 { + reg = <1>; + }; + }; + }; -- 2.7.4
[PATCH net-next v9 2/2] net: ethernet: socionext: add AVE ethernet driver
The UniPhier platform from Socionext provides the AVE ethernet controller that includes MAC and MDIO bus supporting RGMII/RMII modes. The controller is named AVE. Signed-off-by: Kunihiko HayashiSigned-off-by: Jassi Brar Reviewed-by: Andrew Lunn --- drivers/net/ethernet/Kconfig |1 + drivers/net/ethernet/Makefile|1 + drivers/net/ethernet/socionext/Kconfig | 22 + drivers/net/ethernet/socionext/Makefile |5 + drivers/net/ethernet/socionext/sni_ave.c | 1736 ++ 5 files changed, 1765 insertions(+) create mode 100644 drivers/net/ethernet/socionext/Kconfig create mode 100644 drivers/net/ethernet/socionext/Makefile create mode 100644 drivers/net/ethernet/socionext/sni_ave.c diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig index c604213..d50519e 100644 --- a/drivers/net/ethernet/Kconfig +++ b/drivers/net/ethernet/Kconfig @@ -170,6 +170,7 @@ source "drivers/net/ethernet/sis/Kconfig" source "drivers/net/ethernet/sfc/Kconfig" source "drivers/net/ethernet/sgi/Kconfig" source "drivers/net/ethernet/smsc/Kconfig" +source "drivers/net/ethernet/socionext/Kconfig" source "drivers/net/ethernet/stmicro/Kconfig" source "drivers/net/ethernet/sun/Kconfig" source "drivers/net/ethernet/tehuti/Kconfig" diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile index 39f62733..6cf5ade 100644 --- a/drivers/net/ethernet/Makefile +++ b/drivers/net/ethernet/Makefile @@ -82,6 +82,7 @@ obj-$(CONFIG_SFC) += sfc/ obj-$(CONFIG_SFC_FALCON) += sfc/falcon/ obj-$(CONFIG_NET_VENDOR_SGI) += sgi/ obj-$(CONFIG_NET_VENDOR_SMSC) += smsc/ +obj-$(CONFIG_NET_VENDOR_SOCIONEXT) += socionext/ obj-$(CONFIG_NET_VENDOR_STMICRO) += stmicro/ obj-$(CONFIG_NET_VENDOR_SUN) += sun/ obj-$(CONFIG_NET_VENDOR_TEHUTI) += tehuti/ diff --git a/drivers/net/ethernet/socionext/Kconfig b/drivers/net/ethernet/socionext/Kconfig new file mode 100644 index 000..3a1829e --- /dev/null +++ b/drivers/net/ethernet/socionext/Kconfig @@ -0,0 +1,22 @@ +config NET_VENDOR_SOCIONEXT + bool "Socionext ethernet drivers" + default y + ---help--- + Option to select ethernet drivers for Socionext platforms. + + Note that the answer to this question doesn't directly affect the + kernel: saying N will just cause the configurator to skip all + the questions about Socionext devices. If you say Y, you will be asked + for your specific card in the following questions. + +if NET_VENDOR_SOCIONEXT + +config SNI_AVE + tristate "Socionext AVE ethernet support" + depends on (ARCH_UNIPHIER || COMPILE_TEST) && OF + select PHYLIB + ---help--- + Driver for gigabit ethernet MACs, called AVE, in the + Socionext UniPhier family. + +endif #NET_VENDOR_SOCIONEXT diff --git a/drivers/net/ethernet/socionext/Makefile b/drivers/net/ethernet/socionext/Makefile new file mode 100644 index 000..ab83df6 --- /dev/null +++ b/drivers/net/ethernet/socionext/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Makefile for all ethernet ip drivers on Socionext platforms +# +obj-$(CONFIG_SNI_AVE) += sni_ave.o diff --git a/drivers/net/ethernet/socionext/sni_ave.c b/drivers/net/ethernet/socionext/sni_ave.c new file mode 100644 index 000..111e7ca --- /dev/null +++ b/drivers/net/ethernet/socionext/sni_ave.c @@ -0,0 +1,1736 @@ +// SPDX-License-Identifier: GPL-2.0 +/** + * sni_ave.c - Socionext UniPhier AVE ethernet driver + * Copyright 2014 Panasonic Corporation + * Copyright 2015-2017 Socionext Inc. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* General Register Group */ +#define AVE_IDR0x000 /* ID */ +#define AVE_VR 0x004 /* Version */ +#define AVE_GRR0x008 /* Global Reset */ +#define AVE_CFGR 0x00c /* Configuration */ + +/* Interrupt Register Group */ +#define AVE_GIMR 0x100 /* Global Interrupt Mask */ +#define AVE_GISR 0x104 /* Global Interrupt Status */ + +/* MAC Register Group */ +#define AVE_TXCR 0x200 /* TX Setup */ +#define AVE_RXCR 0x204 /* RX Setup */ +#define AVE_RXMAC1R0x208 /* MAC address (lower) */ +#define AVE_RXMAC2R0x20c /* MAC address (upper) */ +#define AVE_MDIOCTR0x214 /* MDIO Control */ +#define AVE_MDIOAR 0x218 /* MDIO Address */ +#define AVE_MDIOWDR0x21c /* MDIO Data */ +#define AVE_MDIOSR 0x220 /* MDIO Status */ +#define AVE_MDIORDR0x224 /* MDIO Rd Data */ + +/* Descriptor Control Register Group */ +#define AVE_DESCC 0x300 /* Descriptor Control */ +#define AVE_TXDC 0x304
[PATCH net-next] cxgb4/cxgb4vf: support for XLAUI Port Type
Add support for new Backplane XLAUI port type. Signed-off-by: Casey LeedomSigned-off-by: Ganesh Goudar --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c | 10 +- drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 1 + drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h | 1 + drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 10 +- 4 files changed, 20 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c index 541419b..7852d98 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c @@ -517,7 +517,8 @@ static int from_fw_port_mod_type(enum fw_port_type port_type, else return PORT_OTHER; } else if (port_type == FW_PORT_TYPE_KR4_100G || - port_type == FW_PORT_TYPE_KR_SFP28) { + port_type == FW_PORT_TYPE_KR_SFP28 || + port_type == FW_PORT_TYPE_KR_XLAUI) { return PORT_NONE; } @@ -645,6 +646,13 @@ static void fw_caps_to_lmm(enum fw_port_type port_type, FW_CAPS_TO_LMM(SPEED_25G, 25000baseKR_Full); break; + case FW_PORT_TYPE_KR_XLAUI: + SET_LMM(Backplane); + FW_CAPS_TO_LMM(SPEED_1G, 1000baseKX_Full); + FW_CAPS_TO_LMM(SPEED_10G, 1baseKR_Full); + FW_CAPS_TO_LMM(SPEED_40G, 4baseKR4_Full); + break; + case FW_PORT_TYPE_CR2_QSFP: SET_LMM(FIBRE); SET_LMM(5baseSR2_Full); diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c index 44930ca..f2a60e0 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c +++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c @@ -6084,6 +6084,7 @@ const char *t4_get_port_type_description(enum fw_port_type port_type) "CR2_QSFP", "SFP28", "KR_SFP28", + "KR_XLAUI" }; if (port_type < ARRAY_SIZE(port_type_description)) diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h index 01f5a5e..427f252 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h +++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h @@ -2829,6 +2829,7 @@ enum fw_port_type { FW_PORT_TYPE_CR2_QSFP, FW_PORT_TYPE_SFP28, FW_PORT_TYPE_KR_SFP28, + FW_PORT_TYPE_KR_XLAUI, FW_PORT_TYPE_NONE = FW_PORT_CMD_PTYPE_M }; diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c index b48361c..96f69f8 100644 --- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c @@ -1229,7 +1229,8 @@ static int from_fw_port_mod_type(enum fw_port_type port_type, else return PORT_OTHER; } else if (port_type == FW_PORT_TYPE_KR4_100G || - port_type == FW_PORT_TYPE_KR_SFP28) { + port_type == FW_PORT_TYPE_KR_SFP28 || + port_type == FW_PORT_TYPE_KR_XLAUI) { return PORT_NONE; } @@ -1323,6 +1324,13 @@ static void fw_caps_to_lmm(enum fw_port_type port_type, SET_LMM(25000baseKR_Full); break; + case FW_PORT_TYPE_KR_XLAUI: + SET_LMM(Backplane); + FW_CAPS_TO_LMM(SPEED_1G, 1000baseKX_Full); + FW_CAPS_TO_LMM(SPEED_10G, 1baseKR_Full); + FW_CAPS_TO_LMM(SPEED_40G, 4baseKR4_Full); + break; + case FW_PORT_TYPE_CR2_QSFP: SET_LMM(FIBRE); SET_LMM(5baseSR2_Full); -- 2.1.0
[PATCH net-next v6 6/6] net: dccp: Remove dccpprobe module
Remove DCCP probe module since jprobe has been deprecated. That function is now replaced by dccp/dccp_probe trace-event. You can use it via ftrace or perftools. Signed-off-by: Masami Hiramatsu--- Changes in v5: - Fix a conflict with previous change in Makefile. --- net/dccp/Kconfig | 17 net/dccp/Makefile |2 - net/dccp/probe.c | 203 - 3 files changed, 222 deletions(-) delete mode 100644 net/dccp/probe.c diff --git a/net/dccp/Kconfig b/net/dccp/Kconfig index 8c0ef71bed2f..b270e84d9c13 100644 --- a/net/dccp/Kconfig +++ b/net/dccp/Kconfig @@ -39,23 +39,6 @@ config IP_DCCP_DEBUG Just say N. -config NET_DCCPPROBE - tristate "DCCP connection probing" - depends on PROC_FS && KPROBES - ---help--- - This module allows for capturing the changes to DCCP connection - state in response to incoming packets. It is used for debugging - DCCP congestion avoidance modules. If you don't understand - what was just said, you don't need it: say N. - - Documentation on how to use DCCP connection probing can be found - at: - - http://www.linuxfoundation.org/collaborate/workgroups/networking/dccpprobe - - To compile this code as a module, choose M here: the - module will be called dccp_probe. - endmenu diff --git a/net/dccp/Makefile b/net/dccp/Makefile index 4215f13a63af..5b4ff37bc806 100644 --- a/net/dccp/Makefile +++ b/net/dccp/Makefile @@ -21,12 +21,10 @@ obj-$(subst y,$(CONFIG_IP_DCCP),$(CONFIG_IPV6)) += dccp_ipv6.o dccp_ipv6-y := ipv6.o obj-$(CONFIG_INET_DCCP_DIAG) += dccp_diag.o -obj-$(CONFIG_NET_DCCPPROBE) += dccp_probe.o dccp-$(CONFIG_SYSCTL) += sysctl.o dccp_diag-y := diag.o -dccp_probe-y := probe.o # build with local directory for trace.h CFLAGS_proto.o := -I$(src) diff --git a/net/dccp/probe.c b/net/dccp/probe.c deleted file mode 100644 index 3d3fda05b32d.. --- a/net/dccp/probe.c +++ /dev/null @@ -1,203 +0,0 @@ -/* - * dccp_probe - Observe the DCCP flow with kprobes. - * - * The idea for this came from Werner Almesberger's umlsim - * Copyright (C) 2004, Stephen Hemminger - * - * Modified for DCCP from Stephen Hemminger's code - * Copyright (C) 2006, Ian McDonald - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "dccp.h" -#include "ccid.h" -#include "ccids/ccid3.h" - -static int port; - -static int bufsize = 64 * 1024; - -static const char procname[] = "dccpprobe"; - -static struct { - struct kfifo fifo; - spinlock_tlock; - wait_queue_head_t wait; - struct timespec64 tstart; -} dccpw; - -static void printl(const char *fmt, ...) -{ - va_list args; - int len; - struct timespec64 now; - char tbuf[256]; - - va_start(args, fmt); - getnstimeofday64(); - - now = timespec64_sub(now, dccpw.tstart); - - len = sprintf(tbuf, "%lu.%06lu ", - (unsigned long) now.tv_sec, - (unsigned long) now.tv_nsec / NSEC_PER_USEC); - len += vscnprintf(tbuf+len, sizeof(tbuf)-len, fmt, args); - va_end(args); - - kfifo_in_locked(, tbuf, len, ); - wake_up(); -} - -static int jdccp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) -{ - const struct inet_sock *inet = inet_sk(sk); - struct ccid3_hc_tx_sock *hc = NULL; - - if (ccid_get_current_tx_ccid(dccp_sk(sk)) == DCCPC_CCID3) - hc = ccid3_hc_tx_sk(sk); - - if (port == 0 || ntohs(inet->inet_dport) == port || - ntohs(inet->inet_sport) == port) { - if (hc) - printl("%pI4:%u %pI4:%u %d %d %d %d %u %llu %llu %d\n", - >inet_saddr, ntohs(inet->inet_sport), - >inet_daddr, ntohs(inet->inet_dport), size, - hc->tx_s, hc->tx_rtt, hc->tx_p, - hc->tx_x_calc, hc->tx_x_recv >> 6, - hc->tx_x >> 6, hc->tx_t_ipi); - else - printl("%pI4:%u %pI4:%u
[PATCH net-next v6 1/6] net: tcp: Add trace events for TCP congestion window tracing
This adds an event to trace TCP stat variables with slightly intrusive trace-event. This uses ftrace/perf event log buffer to trace those state, no needs to prepare own ring-buffer, nor custom user apps. User can use ftrace to trace this event as below; # cd /sys/kernel/debug/tracing # echo 1 > events/tcp/tcp_probe/enable (run workloads) # cat trace Signed-off-by: Masami Hiramatsu--- Changes in v6: - Avoid preprocessor directives in tracepoint macro args as Mat did on net tree. --- include/trace/events/tcp.h | 97 net/ipv4/tcp_input.c |3 + 2 files changed, 100 insertions(+) diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index 8e88a1671538..4dea6342f7d4 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM tcp @@ -8,6 +9,7 @@ #include #include #include +#include /* * tcp event with arguments sk and skb @@ -277,6 +279,101 @@ TRACE_EVENT(tcp_retransmit_synack, __entry->saddr_v6, __entry->daddr_v6) ); + +#define TP_STORE_ADDR_PORTS_V4(__entry, inet, sk) \ + do {\ + struct sockaddr_in *v4 = (void *)__entry->saddr;\ + \ + v4->sin_family = AF_INET; \ + v4->sin_port = inet->inet_sport;\ + v4->sin_addr.s_addr = inet->inet_saddr; \ + v4 = (void *)__entry->daddr;\ + v4->sin_family = AF_INET; \ + v4->sin_port = inet->inet_dport;\ + v4->sin_addr.s_addr = inet->inet_daddr; \ + } while (0) + +#if IS_ENABLED(CONFIG_IPV6) + +#define TP_STORE_ADDR_PORTS(__entry, inet, sk) \ + do {\ + if (sk->sk_family == AF_INET6) {\ + struct sockaddr_in6 *v6 = (void *)__entry->saddr; \ + \ + v6->sin6_family = AF_INET6; \ + v6->sin6_port = inet->inet_sport; \ + v6->sin6_addr = inet6_sk(sk)->saddr;\ + v6 = (void *)__entry->daddr;\ + v6->sin6_family = AF_INET6; \ + v6->sin6_port = inet->inet_dport; \ + v6->sin6_addr = sk->sk_v6_daddr;\ + } else \ + TP_STORE_ADDR_PORTS_V4(__entry, inet, sk); \ + } while (0) + +#else + +#define TP_STORE_ADDR_PORTS(__entry, inet, sk) \ + TP_STORE_ADDR_PORTS_V4(__entry, inet, sk); + +#endif + +TRACE_EVENT(tcp_probe, + + TP_PROTO(struct sock *sk, struct sk_buff *skb), + + TP_ARGS(sk, skb), + + TP_STRUCT__entry( + /* sockaddr_in6 is always bigger than sockaddr_in */ + __array(__u8, saddr, sizeof(struct sockaddr_in6)) + __array(__u8, daddr, sizeof(struct sockaddr_in6)) + __field(__u16, sport) + __field(__u16, dport) + __field(__u32, mark) + __field(__u16, length) + __field(__u32, snd_nxt) + __field(__u32, snd_una) + __field(__u32, snd_cwnd) + __field(__u32, ssthresh) + __field(__u32, snd_wnd) + __field(__u32, srtt) + __field(__u32, rcv_wnd) + ), + + TP_fast_assign( + const struct tcp_sock *tp = tcp_sk(sk); + const struct inet_sock *inet = inet_sk(sk); + + memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); + memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); + + TP_STORE_ADDR_PORTS(__entry, inet, sk); + + /* For filtering use */ + __entry->sport = ntohs(inet->inet_sport); + __entry->dport = ntohs(inet->inet_dport); + __entry->mark = skb->mark; + + __entry->length = skb->len; + __entry->snd_nxt = tp->snd_nxt; + __entry->snd_una = tp->snd_una; + __entry->snd_cwnd = tp->snd_cwnd; + __entry->snd_wnd = tp->snd_wnd; + __entry->rcv_wnd = tp->rcv_wnd; + __entry->ssthresh = tcp_current_ssthresh(sk); + __entry->srtt = tp->srtt_us >> 3; + ), + +
[PATCH net-next v6 4/6] net: sctp: Remove debug SCTP probe module
Remove SCTP probe module since jprobe has been deprecated. That function is now replaced by sctp/sctp_probe and sctp/sctp_probe_path trace-events. You can use it via ftrace or perftools. Signed-off-by: Masami Hiramatsu--- net/sctp/Kconfig | 12 --- net/sctp/Makefile |3 - net/sctp/probe.c | 244 - 3 files changed, 259 deletions(-) delete mode 100644 net/sctp/probe.c diff --git a/net/sctp/Kconfig b/net/sctp/Kconfig index d9c04dc1b3f3..c740b189d4ba 100644 --- a/net/sctp/Kconfig +++ b/net/sctp/Kconfig @@ -37,18 +37,6 @@ menuconfig IP_SCTP if IP_SCTP -config NET_SCTPPROBE - tristate "SCTP: Association probing" -depends on PROC_FS && KPROBES ----help--- -This module allows for capturing the changes to SCTP association -state in response to incoming packets. It is used for debugging -SCTP congestion control algorithms. If you don't understand -what was just said, you don't need it: say N. - -To compile this code as a module, choose M here: the -module will be called sctp_probe. - config SCTP_DBG_OBJCNT bool "SCTP: Debug object counts" depends on PROC_FS diff --git a/net/sctp/Makefile b/net/sctp/Makefile index 54bd9c1a8aa1..6776582ec449 100644 --- a/net/sctp/Makefile +++ b/net/sctp/Makefile @@ -4,7 +4,6 @@ # obj-$(CONFIG_IP_SCTP) += sctp.o -obj-$(CONFIG_NET_SCTPPROBE) += sctp_probe.o obj-$(CONFIG_INET_SCTP_DIAG) += sctp_diag.o sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \ @@ -16,8 +15,6 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \ offload.o stream_sched.o stream_sched_prio.o \ stream_sched_rr.o stream_interleave.o -sctp_probe-y := probe.o - sctp-$(CONFIG_SCTP_DBG_OBJCNT) += objcnt.o sctp-$(CONFIG_PROC_FS) += proc.o sctp-$(CONFIG_SYSCTL) += sysctl.o diff --git a/net/sctp/probe.c b/net/sctp/probe.c deleted file mode 100644 index 1280f85a598d.. --- a/net/sctp/probe.c +++ /dev/null @@ -1,244 +0,0 @@ -/* - * sctp_probe - Observe the SCTP flow with kprobes. - * - * The idea for this came from Werner Almesberger's umlsim - * Copyright (C) 2004, Stephen Hemminger - * - * Modified for SCTP from Stephen Hemminger's code - * Copyright (C) 2010, Wei Yongjun - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - */ - -#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include - -MODULE_SOFTDEP("pre: sctp"); -MODULE_AUTHOR("Wei Yongjun "); -MODULE_DESCRIPTION("SCTP snooper"); -MODULE_LICENSE("GPL"); - -static int port __read_mostly = 0; -MODULE_PARM_DESC(port, "Port to match (0=all)"); -module_param(port, int, 0); - -static unsigned int fwmark __read_mostly = 0; -MODULE_PARM_DESC(fwmark, "skb mark to match (0=no mark)"); -module_param(fwmark, uint, 0); - -static int bufsize __read_mostly = 64 * 1024; -MODULE_PARM_DESC(bufsize, "Log buffer size (default 64k)"); -module_param(bufsize, int, 0); - -static int full __read_mostly = 1; -MODULE_PARM_DESC(full, "Full log (1=every ack packet received, 0=only cwnd changes)"); -module_param(full, int, 0); - -static const char procname[] = "sctpprobe"; - -static struct { - struct kfifo fifo; - spinlock_tlock; - wait_queue_head_t wait; - struct timespec64 tstart; -} sctpw; - -static __printf(1, 2) void printl(const char *fmt, ...) -{ - va_list args; - int len; - char tbuf[256]; - - va_start(args, fmt); - len = vscnprintf(tbuf, sizeof(tbuf), fmt, args); - va_end(args); - - kfifo_in_locked(, tbuf, len, ); - wake_up(); -} - -static int sctpprobe_open(struct inode *inode, struct file *file) -{ - kfifo_reset(); - ktime_get_ts64(); - - return 0; -} - -static ssize_t sctpprobe_read(struct file *file, char __user *buf, - size_t len, loff_t *ppos) -{ - int error = 0, cnt = 0; - unsigned char *tbuf; - - if (!buf) - return -EINVAL; - - if (len == 0) - return 0; - - tbuf = vmalloc(len); - if (!tbuf) - return
[PATCH net-next v6 2/6] net: tcp: Remove TCP probe module
Remove TCP probe module since jprobe has been deprecated. That function is now replaced by tcp/tcp_probe trace-event. You can use it via ftrace or perftools. Signed-off-by: Masami Hiramatsu--- net/Kconfig | 17 --- net/ipv4/Makefile|1 net/ipv4/tcp_probe.c | 301 -- 3 files changed, 319 deletions(-) delete mode 100644 net/ipv4/tcp_probe.c diff --git a/net/Kconfig b/net/Kconfig index 9dba2715919d..efe930db3c08 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -336,23 +336,6 @@ config NET_PKTGEN To compile this code as a module, choose M here: the module will be called pktgen. -config NET_TCPPROBE - tristate "TCP connection probing" - depends on INET && PROC_FS && KPROBES - ---help--- - This module allows for capturing the changes to TCP connection - state in response to incoming packets. It is used for debugging - TCP congestion avoidance modules. If you don't understand - what was just said, you don't need it: say N. - - Documentation on how to use TCP connection probing can be found - at: - - http://www.linuxfoundation.org/collaborate/workgroups/networking/tcpprobe - - To compile this code as a module, choose M here: the - module will be called tcp_probe. - config NET_DROP_MONITOR tristate "Network packet drop alerting service" depends on INET && TRACEPOINTS diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index c6c8ad1d4b6d..47a0a6649a9d 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -43,7 +43,6 @@ obj-$(CONFIG_INET_DIAG) += inet_diag.o obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o -obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o obj-$(CONFIG_TCP_CONG_BBR) += tcp_bbr.o obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c deleted file mode 100644 index 697f4c67b2e3.. --- a/net/ipv4/tcp_probe.c +++ /dev/null @@ -1,301 +0,0 @@ -/* - * tcpprobe - Observe the TCP flow with kprobes. - * - * The idea for this came from Werner Almesberger's umlsim - * Copyright (C) 2004, Stephen Hemminger - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - */ - -#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include - -MODULE_AUTHOR("Stephen Hemminger "); -MODULE_DESCRIPTION("TCP cwnd snooper"); -MODULE_LICENSE("GPL"); -MODULE_VERSION("1.1"); - -static int port __read_mostly; -MODULE_PARM_DESC(port, "Port to match (0=all)"); -module_param(port, int, 0); - -static unsigned int bufsize __read_mostly = 4096; -MODULE_PARM_DESC(bufsize, "Log buffer size in packets (4096)"); -module_param(bufsize, uint, 0); - -static unsigned int fwmark __read_mostly; -MODULE_PARM_DESC(fwmark, "skb mark to match (0=no mark)"); -module_param(fwmark, uint, 0); - -static int full __read_mostly; -MODULE_PARM_DESC(full, "Full log (1=every ack packet received, 0=only cwnd changes)"); -module_param(full, int, 0); - -static const char procname[] = "tcpprobe"; - -struct tcp_log { - ktime_t tstamp; - union { - struct sockaddr raw; - struct sockaddr_in v4; - struct sockaddr_in6 v6; - } src, dst; - u16 length; - u32 snd_nxt; - u32 snd_una; - u32 snd_wnd; - u32 rcv_wnd; - u32 snd_cwnd; - u32 ssthresh; - u32 srtt; -}; - -static struct { - spinlock_t lock; - wait_queue_head_t wait; - ktime_t start; - u32 lastcwnd; - - unsigned long head, tail; - struct tcp_log *log; -} tcp_probe; - -static inline int tcp_probe_used(void) -{ - return (tcp_probe.head - tcp_probe.tail) & (bufsize - 1); -} - -static inline int tcp_probe_avail(void) -{ - return bufsize - tcp_probe_used() - 1; -} - -#define tcp_probe_copy_fl_to_si4(inet, si4, mem) \ - do {\ - si4.sin_family = AF_INET;
[PATCH net-next v6 3/6] net: sctp: Add SCTP ACK tracking trace event
Add SCTP ACK tracking trace event to trace the changes of SCTP association state in response to incoming packets. It is used for debugging SCTP congestion control algorithms, and will replace sctp_probe module. Note that this event a bit tricky. Since this consists of 2 events (sctp_probe and sctp_probe_path) so you have to enable both events as below. # cd /sys/kernel/debug/tracing # echo 1 > events/sctp/sctp_probe/enable # echo 1 > events/sctp/sctp_probe_path/enable Or, you can enable all the events under sctp. # echo 1 > events/sctp/enable Since sctp_probe_path event is always invoked from sctp_probe event, you can not see any output if you only enable sctp_probe_path. Signed-off-by: Masami Hiramatsu--- Changes in v3: - Add checking whether sctp_probe_path event is enabled before iterating sctp paths to record. Thanks Steven. Changes in v4: - Move a temporal variable definition in the block. - Fix to cast pointer to unsigned long instead of __u64 for 32bit environment. --- include/trace/events/sctp.h | 99 +++ net/sctp/sm_statefuns.c |5 ++ 2 files changed, 104 insertions(+) create mode 100644 include/trace/events/sctp.h diff --git a/include/trace/events/sctp.h b/include/trace/events/sctp.h new file mode 100644 index ..7475c7be165a --- /dev/null +++ b/include/trace/events/sctp.h @@ -0,0 +1,99 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM sctp + +#if !defined(_TRACE_SCTP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_SCTP_H + +#include +#include + +TRACE_EVENT(sctp_probe_path, + + TP_PROTO(struct sctp_transport *sp, +const struct sctp_association *asoc), + + TP_ARGS(sp, asoc), + + TP_STRUCT__entry( + __field(__u64, asoc) + __field(__u32, primary) + __array(__u8, ipaddr, sizeof(union sctp_addr)) + __field(__u32, state) + __field(__u32, cwnd) + __field(__u32, ssthresh) + __field(__u32, flight_size) + __field(__u32, partial_bytes_acked) + __field(__u32, pathmtu) + ), + + TP_fast_assign( + __entry->asoc = (unsigned long)asoc; + __entry->primary = (sp == asoc->peer.primary_path); + memcpy(__entry->ipaddr, >ipaddr, sizeof(union sctp_addr)); + __entry->state = sp->state; + __entry->cwnd = sp->cwnd; + __entry->ssthresh = sp->ssthresh; + __entry->flight_size = sp->flight_size; + __entry->partial_bytes_acked = sp->partial_bytes_acked; + __entry->pathmtu = sp->pathmtu; + ), + + TP_printk("asoc=%#llx%s ipaddr=%pISpc state=%u cwnd=%u ssthresh=%u " + "flight_size=%u partial_bytes_acked=%u pathmtu=%u", + __entry->asoc, __entry->primary ? "(*)" : "", + __entry->ipaddr, __entry->state, __entry->cwnd, + __entry->ssthresh, __entry->flight_size, + __entry->partial_bytes_acked, __entry->pathmtu) +); + +TRACE_EVENT(sctp_probe, + + TP_PROTO(const struct sctp_endpoint *ep, +const struct sctp_association *asoc, +struct sctp_chunk *chunk), + + TP_ARGS(ep, asoc, chunk), + + TP_STRUCT__entry( + __field(__u64, asoc) + __field(__u32, mark) + __field(__u16, bind_port) + __field(__u16, peer_port) + __field(__u32, pathmtu) + __field(__u32, rwnd) + __field(__u16, unack_data) + ), + + TP_fast_assign( + struct sk_buff *skb = chunk->skb; + + __entry->asoc = (unsigned long)asoc; + __entry->mark = skb->mark; + __entry->bind_port = ep->base.bind_addr.port; + __entry->peer_port = asoc->peer.port; + __entry->pathmtu = asoc->pathmtu; + __entry->rwnd = asoc->peer.rwnd; + __entry->unack_data = asoc->unack_data; + + if (trace_sctp_probe_path_enabled()) { + struct sctp_transport *sp; + + list_for_each_entry(sp, >peer.transport_addr_list, + transports) { + trace_sctp_probe_path(sp, asoc); + } + } + ), + + TP_printk("asoc=%#llx mark=%#x bind_port=%d peer_port=%d pathmtu=%d " + "rwnd=%u unack_data=%d", + __entry->asoc, __entry->mark, __entry->bind_port, + __entry->peer_port, __entry->pathmtu, __entry->rwnd, + __entry->unack_data) +); + +#endif /* _TRACE_SCTP_H */ + +/* This part must be outside protection */ +#include diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c index
[PATCH net-next v6 5/6] net: dccp: Add DCCP sendmsg trace event
Add DCCP sendmsg trace event (dccp/dccp_probe) for replacing dccpprobe. User can trace this event via ftrace or perftools. Signed-off-by: Masami Hiramatsu--- Changes in v5 - Fix to add local directory to include for trace.h. Thanks Steven! --- net/dccp/Makefile |3 ++ net/dccp/proto.c |5 +++ net/dccp/trace.h | 105 + 3 files changed, 113 insertions(+) create mode 100644 net/dccp/trace.h diff --git a/net/dccp/Makefile b/net/dccp/Makefile index 2e7b56097bc4..4215f13a63af 100644 --- a/net/dccp/Makefile +++ b/net/dccp/Makefile @@ -27,3 +27,6 @@ dccp-$(CONFIG_SYSCTL) += sysctl.o dccp_diag-y := diag.o dccp_probe-y := probe.o + +# build with local directory for trace.h +CFLAGS_proto.o := -I$(src) diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 7a75a1d3568b..fa7e92e08920 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -38,6 +38,9 @@ #include "dccp.h" #include "feat.h" +#define CREATE_TRACE_POINTS +#include "trace.h" + DEFINE_SNMP_STAT(struct dccp_mib, dccp_statistics) __read_mostly; EXPORT_SYMBOL_GPL(dccp_statistics); @@ -761,6 +764,8 @@ int dccp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) int rc, size; long timeo; + trace_dccp_probe(sk, len); + if (len > dp->dccps_mss_cache) return -EMSGSIZE; diff --git a/net/dccp/trace.h b/net/dccp/trace.h new file mode 100644 index ..aa01321a6c37 --- /dev/null +++ b/net/dccp/trace.h @@ -0,0 +1,105 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM dccp + +#if !defined(_TRACE_DCCP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_DCCP_H + +#include +#include "dccp.h" +#include "ccids/ccid3.h" +#include + +TRACE_EVENT(dccp_probe, + + TP_PROTO(struct sock *sk, size_t size), + + TP_ARGS(sk, size), + + TP_STRUCT__entry( + /* sockaddr_in6 is always bigger than sockaddr_in */ + __array(__u8, saddr, sizeof(struct sockaddr_in6)) + __array(__u8, daddr, sizeof(struct sockaddr_in6)) + __field(__u16, sport) + __field(__u16, dport) + __field(__u16, size) + __field(__u16, tx_s) + __field(__u32, tx_rtt) + __field(__u32, tx_p) + __field(__u32, tx_x_calc) + __field(__u64, tx_x_recv) + __field(__u64, tx_x) + __field(__u32, tx_t_ipi) + ), + + TP_fast_assign( + const struct inet_sock *inet = inet_sk(sk); + struct ccid3_hc_tx_sock *hc = NULL; + + if (ccid_get_current_tx_ccid(dccp_sk(sk)) == DCCPC_CCID3) + hc = ccid3_hc_tx_sk(sk); + + memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); + memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); + + if (sk->sk_family == AF_INET) { + struct sockaddr_in *v4 = (void *)__entry->saddr; + + v4->sin_family = AF_INET; + v4->sin_port = inet->inet_sport; + v4->sin_addr.s_addr = inet->inet_saddr; + v4 = (void *)__entry->daddr; + v4->sin_family = AF_INET; + v4->sin_port = inet->inet_dport; + v4->sin_addr.s_addr = inet->inet_daddr; +#if IS_ENABLED(CONFIG_IPV6) + } else if (sk->sk_family == AF_INET6) { + struct sockaddr_in6 *v6 = (void *)__entry->saddr; + + v6->sin6_family = AF_INET6; + v6->sin6_port = inet->inet_sport; + v6->sin6_addr = inet6_sk(sk)->saddr; + v6 = (void *)__entry->daddr; + v6->sin6_family = AF_INET6; + v6->sin6_port = inet->inet_dport; + v6->sin6_addr = sk->sk_v6_daddr; +#endif + } + + /* For filtering use */ + __entry->sport = ntohs(inet->inet_sport); + __entry->dport = ntohs(inet->inet_dport); + + __entry->size = size; + if (hc) { + __entry->tx_s = hc->tx_s; + __entry->tx_rtt = hc->tx_rtt; + __entry->tx_p = hc->tx_p; + __entry->tx_x_calc = hc->tx_x_calc; + __entry->tx_x_recv = hc->tx_x_recv >> 6; + __entry->tx_x = hc->tx_x >> 6; + __entry->tx_t_ipi = hc->tx_t_ipi; + } else { + __entry->tx_s = 0; + memset(&__entry->tx_rtt, 0, (void *)&__entry->tx_t_ipi - + (void *)&__entry->tx_rtt + + sizeof(__entry->tx_t_ipi)); + } + ), + + TP_printk("src=%pISpc dest=%pISpc
[PATCH net-next v6 0/6] net: tcp: sctp: dccp: Replace jprobe usage with trace events
Hi, This series is v6 of the replacement of jprobe usage with trace events. This version fixes trace/events/tcp.h to avoid sparse warning, as same as Mat Martineau's patch. Previous version is here; https://www.spinics.net/lists/netdev/msg474057.html Changes from v5: [1/6]: Avoid preprocessor directives in tracepoint macro args Thank you, --- Masami Hiramatsu (6): net: tcp: Add trace events for TCP congestion window tracing net: tcp: Remove TCP probe module net: sctp: Add SCTP ACK tracking trace event net: sctp: Remove debug SCTP probe module net: dccp: Add DCCP sendmsg trace event net: dccp: Remove dccpprobe module include/trace/events/sctp.h | 99 ++ include/trace/events/tcp.h | 97 ++ net/Kconfig | 17 -- net/dccp/Kconfig| 17 -- net/dccp/Makefile |5 - net/dccp/probe.c| 203 - net/dccp/proto.c|5 + net/dccp/trace.h| 105 +++ net/ipv4/Makefile |1 net/ipv4/tcp_input.c|3 net/ipv4/tcp_probe.c| 301 --- net/sctp/Kconfig| 12 -- net/sctp/Makefile |3 net/sctp/probe.c| 244 --- net/sctp/sm_statefuns.c |5 + 15 files changed, 317 insertions(+), 800 deletions(-) create mode 100644 include/trace/events/sctp.h delete mode 100644 net/dccp/probe.c create mode 100644 net/dccp/trace.h delete mode 100644 net/ipv4/tcp_probe.c delete mode 100644 net/sctp/probe.c -- Masami Hiramatsu (Linaro)
[PATCH net-next] cxgb4: display VNI correctly
Fix incorrect VNI display in mps_tcam Signed-off-by: Santosh RastapurSigned-off-by: Ganesh Goudar --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c index d3ced04..4ea76c1 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c @@ -1743,7 +1743,7 @@ static int mps_tcam_show(struct seq_file *seq, void *v) */ if (lookup_type && (lookup_type != DATALKPTYPE_M)) { /* Inner header VNI */ - vniy = ((data2 & DATAVIDH2_F) << 23) | + vniy = (data2 & DATAVIDH2_F) | (DATAVIDH1_G(data2) << 16) | VIDL_G(val); dip_hit = data2 & DATADIPHIT_F; } else { @@ -1753,6 +1753,7 @@ static int mps_tcam_show(struct seq_file *seq, void *v) port_num = DATAPORTNUM_G(data2); /* Read tcamx. Change the control param */ + vnix = 0; ctl |= CTLXYBITSEL_V(1); t4_write_reg(adap, MPS_CLS_TCAM_DATA2_CTL_A, ctl); val = t4_read_reg(adap, MPS_CLS_TCAM_DATA1_A); @@ -1761,7 +1762,7 @@ static int mps_tcam_show(struct seq_file *seq, void *v) data2 = t4_read_reg(adap, MPS_CLS_TCAM_DATA2_CTL_A); if (lookup_type && (lookup_type != DATALKPTYPE_M)) { /* Inner header VNI mask */ - vnix = ((data2 & DATAVIDH2_F) << 23) | + vnix = (data2 & DATAVIDH2_F) | (DATAVIDH1_G(data2) << 16) | VIDL_G(val); } } else { @@ -1834,7 +1835,8 @@ static int mps_tcam_show(struct seq_file *seq, void *v) addr[1], addr[2], addr[3], addr[4], addr[5], (unsigned long long)mask, - vniy, vnix, dip_hit ? 'Y' : 'N', + vniy, (vnix | vniy), + dip_hit ? 'Y' : 'N', port_num, (cls_lo & T6_SRAM_VLD_F) ? 'Y' : 'N', PORTMAP_G(cls_hi), -- 2.1.0
Re: [PATCH iproute2 0/3] ip/tunnel: Fix noencap- and document "external" parameter handling.
Stephen Hemminger wrote: > On Wed, 27 Dec 2017 13:28:13 +0200 > Serhey Popovychwrote: > >> In this series I present next set of improvements/fixes: >> >> 1) Fix noencap- option handling: we need to clear bit, instead >> of seting all, but one we expect to clear. >> >> 2) Document "external" parameter both in ip-link(8) and help >> output for link_gre.c. Add "noexternal" option variant to >> bring inline with GENEVE and VXLAN. >> >> 3) Trivial: clear flowlabel/tclass from flowinfo in case of >> inherit to stop sending garbage to the kernel. It has no >> functional change, but follows similar behaviour in link_ip6tnl.c >> >> See individual patch description message for details. >> >> Thanks, >> Serhii >> >> Serhey Popovych (3): >> gre,ip6tnl/tunnel: Fix noencap- support >> gre6/tunnel: Do not submit garbage in flowinfo >> ip/tunnel: Document "external" parameter >> >> ip/link_gre.c |7 +-- >> ip/link_gre6.c|4 ++-- >> ip/link_ip6tnl.c |6 -- >> ip/link_iptnl.c |4 +++- >> man/man8/ip-link.8.in |6 ++ >> 5 files changed, 20 insertions(+), 7 deletions(-) >> > > These are really disjoint. I applied the noencap and the flowinfo patch. > Agree with William that having noexternal which does nothing useful is of > little value now. Agree with you guys, just added it to match GENEVE/VXLAN implementation which is has "noexternal". Will send v2 to just document "external" and clear flowlabel/tclass in flowinfo in case of inherit. I have plans for iproute2-next to merge link_{gre,vti}{,6}.c and link_ip{6,}tnl.c functionality together and just want to do things as clear as possible before doing merge. This is to reduce size of ip(8) binary to help openwrt like solutions to survive JSON print support that increases size of binary. Thank you for quick review and good feed back! > signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
On 12/27/17 8:16 PM, Steven Rostedt wrote: On Wed, 27 Dec 2017 19:45:42 -0800 Alexei Starovoitovwrote: I don't think that's the case. My reading of current trace_kprobe_ftrace() -> arch_check_ftrace_location() is that it will not be true for old mcount case. In the old mcount case, you can't use ftrace to return without calling the function. That is, no modification of the return ip, unless you created a trampoline that could handle arbitrary stack frames, and remove them from the stack before returning back to the function. correct. I was saying that trace_kprobe_ftrace() won't let us do bpf_override_return with old mcount. As far as the rest of your arguments it very much puzzles me that you claim that this patch suppose to work based on historical reasoning whereas you did NOT test it. I believe that Masami is saying that the modification of the IP from kprobes has been very well tested. But I'm guessing that you still want a test case for using kprobes in this particular instance. It's not the implementation of modifying the IP that you are worried about, but the implementation of BPF using it in this case. Right? exactly. No doubt that old code works. But it doesn't mean that bpf_override_return() will continue to work in kprobes that are not ftrace based. I suspect Josef's existing test case will cover this situation. Probably only special .config is needed to disable ftrace, so "kprobe on entry but not ftrace" check will kick in. But I didn't get an impression that this situation was tested. Instead I see only logical reasoning that it's _supposed_ to work. That's not enough.
Re: [PATCH v3 bpf-next 2/2] tools/bpftool: fix bpftool build with bintutils >= 2.9
On Wed, 27 Dec 2017 19:16:29 +, Roman Gushchin wrote: > Bpftool build is broken with binutils version 2.29 and later. > The cause is commit 003ca0fd2286 ("Refactor disassembler selection") > in the binutils repo, which changed the disassembler() function > signature. > > Fix this by adding a new "feature" to the tools/build/features > infrastructure and make it responsible for decision which > disassembler() function signature to use. > > Signed-off-by: Roman Gushchin> Cc: Jakub Kicinski > Cc: Alexei Starovoitov > Cc: Daniel Borkmann Acked-by: Jakub Kicinski
Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
On Wed, 27 Dec 2017 19:45:42 -0800 Alexei Starovoitovwrote: > I don't think that's the case. My reading of current > trace_kprobe_ftrace() -> arch_check_ftrace_location() > is that it will not be true for old mcount case. In the old mcount case, you can't use ftrace to return without calling the function. That is, no modification of the return ip, unless you created a trampoline that could handle arbitrary stack frames, and remove them from the stack before returning back to the function. > > As far as the rest of your arguments it very much puzzles me that > you claim that this patch suppose to work based on historical > reasoning whereas you did NOT test it. I believe that Masami is saying that the modification of the IP from kprobes has been very well tested. But I'm guessing that you still want a test case for using kprobes in this particular instance. It's not the implementation of modifying the IP that you are worried about, but the implementation of BPF using it in this case. Right? -- Steve
Re: [patch net-next v5 05/10] net: sched: keep track of offloaded filters and check tc offload feature
On Tue, 26 Dec 2017 15:15:59 +0100, Jiri Pirko wrote: > From: Jiri Pirko> > During block bind, we need to check tc offload feature. If it is > disabled yet still the block contains offloaded filters, forbid the > bind. Also forbid to register callback for a block that already > containes offloaded filters, as the play back is not supported now. > For keeping track of offloaded filters there is a new counter > introduced, alongside with couple of helpers called from cls_* code. > These helpers set and clear TCA_CLS_FLAGS_IN_HW flag. > > Signed-off-by: Jiri Pirko LGTM, thanks!
Re: [PATCH net-next] nfp: bpf: allocate vNIC priv for keeping track of the offloaded program
On Wed, 27 Dec 2017 20:38:11 -0500 (EST), David Miller wrote: > From: Jakub Kicinski> Date: Wed, 27 Dec 2017 15:36:49 -0800 > > > After TC offloads were converted to callbacks we have no choice > > but keep track of the offloaded filter in the driver. > > > > Since this change came a little late in the release cycle > > there were a number of conflicts and allocation of vNIC priv > > structure seems to have slipped away in linux-next. > > > > Signed-off-by: Jakub Kicinski > > Oh well, I thought I resolved the merge conflict properly in this > case. Apparently not :-) The resolution looks suspiciously similar to what Stephen did in linux-next, but I wasn't sure if you reuse his resolutions hence no Fixes: 903628bbc3a7 ;)
Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework
On 12/27/17 5:38 PM, Masami Hiramatsu wrote: On Wed, 27 Dec 2017 14:49:46 -0800 Alexei Starovoitovwrote: On 12/27/17 12:09 AM, Masami Hiramatsu wrote: On Tue, 26 Dec 2017 18:12:56 -0800 Alexei Starovoitov wrote: On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote: Support in-kernel fault-injection framework via debugfs. This allows you to inject a conditional error to specified function using debugfs interfaces. Signed-off-by: Masami Hiramatsu --- Documentation/fault-injection/fault-injection.txt |5 + kernel/Makefile |1 kernel/fail_function.c| 169 + lib/Kconfig.debug | 10 + 4 files changed, 185 insertions(+) create mode 100644 kernel/fail_function.c diff --git a/Documentation/fault-injection/fault-injection.txt b/Documentation/fault-injection/fault-injection.txt index 918972babcd8..6243a588dd71 100644 --- a/Documentation/fault-injection/fault-injection.txt +++ b/Documentation/fault-injection/fault-injection.txt @@ -30,6 +30,11 @@ o fail_mmc_request injects MMC data errors on devices permitted by setting debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request +o fail_function + + injects error return on specific functions by setting debugfs entries + under /sys/kernel/debug/fail_function. No boot option supported. I like it. Could you document it a bit better? Yes, I will do in next series. In particular retval is configurable, but without an example no one will be able to figure out how to use it. Ah, right. BTW, as I pointed in the covermail, should we store the expected error value range into the injectable list? e.g. ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO) And provide APIs to check/get it. I'm afraid such check would be too costly. Right now we have only two functions marked but I expect hundreds more will be added in the near future as soon as developers realize the potential of such error injection. All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data. Multiple by 1k and we have 8k of data spent on marks. If we add max/min range marks that doubles it for very little use. I think marking function only is enough. Sorry, I don't think so. Even if it takes 16 bytes more for each points, I don't think it is any overhead for machines in these days. Even if so, we can provide a kconfig to reduce it. I mean, we are living in GB-order memory are, and it will be bigger in the future. Why we have to worry about hundreds of 16bytes memory pieces? It will take a few KB, and even if we mark thousands of functions, it never reaches 1MB, in GB memory pool. :) Of course, for many small-footprint embedded devices (like having less than 128MB memory), this feature can be a overhead. But they can cut off the table by kconfig. I still disagree on wasting 16-byte * num_of_funcs of .data here. The trade-off of usability vs memory just not worth it. Sorry. Please focus on testing your changes instead.
[net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds
When running consumer and/or producer operations and empty checks in parallel its possible to have the empty check run past the end of the array. The scenario occurs when an empty check is run while __ptr_ring_discard_one() is in progress. Specifically after the consumer_head is incremented but before (consumer_head >= ring_size) check is made and the consumer head is zeroe'd. To resolve this, without having to rework how consumer/producer ops work on the array, simply add an extra dummy slot to the end of the array. Even if we did a rework to avoid the extra slot it looks like the normal case checks would suffer some so best to just allocate an extra pointer. Reported-by: Jakub KicinskiFixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array") Signed-off-by: John Fastabend --- include/linux/ptr_ring.h |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 6866df4..13fb06a 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -447,7 +447,12 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { - return kcalloc(size, sizeof(void *), gfp); + /* Allocate an extra dummy element at end of ring to avoid consumer head +* or produce head access past the end of the array. Possible when +* producer/consumer operations and __ptr_ring_peek operations run in +* parallel. +*/ + return kcalloc(size + 1, sizeof(void *), gfp); } static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)
Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
On 12/27/17 6:34 PM, Masami Hiramatsu wrote: On Wed, 27 Dec 2017 14:46:24 -0800 Alexei Starovoitovwrote: On 12/26/17 9:56 PM, Masami Hiramatsu wrote: On Tue, 26 Dec 2017 17:57:32 -0800 Alexei Starovoitov wrote: On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote: Check whether error injectable event is on function entry or not. Currently it checks the event is ftrace-based kprobes or not, but that is wrong. It should check if the event is on the entry of target function. Since error injection will override a function to just return with modified return value, that operation must be done before the target function starts making stackframe. As a side effect, bpf error injection is no need to depend on function-tracer. It can work with sw-breakpoint based kprobe events too. Signed-off-by: Masami Hiramatsu --- kernel/trace/Kconfig|2 -- kernel/trace/bpf_trace.c|6 +++--- kernel/trace/trace_kprobe.c |8 +--- kernel/trace/trace_probe.h | 12 ++-- 4 files changed, 14 insertions(+), 14 deletions(-) diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index ae3a2d519e50..6400e1bf97c5 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -533,9 +533,7 @@ config FUNCTION_PROFILER config BPF_KPROBE_OVERRIDE bool "Enable BPF programs to override a kprobed function" depends on BPF_EVENTS - depends on KPROBES_ON_FTRACE depends on HAVE_KPROBE_OVERRIDE - depends on DYNAMIC_FTRACE_WITH_REGS default n help Allows BPF to override the execution of a probed function and diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f6d2327ecb59..d663660f8392 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event *event, int ret = -EEXIST; /* -* Kprobe override only works for ftrace based kprobes, and only if they -* are on the opt-in list. +* Kprobe override only works if they are on the function entry, +* and only if they are on the opt-in list. */ if (prog->kprobe_override && - (!trace_kprobe_ftrace(event->tp_event) || + (!trace_kprobe_on_func_entry(event->tp_event) || !trace_kprobe_error_injectable(event->tp_event))) return -EINVAL; diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 91f4b57dab82..265e3e27e8dc 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -88,13 +88,15 @@ static nokprobe_inline unsigned long trace_kprobe_nhit(struct trace_kprobe *tk) return nhit; } -int trace_kprobe_ftrace(struct trace_event_call *call) +bool trace_kprobe_on_func_entry(struct trace_event_call *call) { struct trace_kprobe *tk = (struct trace_kprobe *)call->data; - return kprobe_ftrace(>rp.kp); + + return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name, + tk->rp.kp.offset); That would be nice, but did you test this? Yes, because the jprobe, which was only official user of modifying execution path using kprobe, did same way to check. (and kretprobe also does it) My understanding that kprobe will restore all regs and here we need to override return ip _and_ value. yes, no problem. kprobe restore all regs from pt_regs, including regs->ip. Could you add a patch with the test the way Josef did or describe the steps to test this new mode? Would you mean below patch? If so, it should work without any change. [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return yeah. I expect bpf_override_return test to work as-is. I'm asking for the test for new functionality added by this patch. In particular kprobe on func entry without ftrace. How did you test it? This function is used in kretprobe and jprobe. Jprobe was the user of "modifying instruction pointer to another function" in kprobes. If it doesn't work, jprobe also doesn't work, this means you can not modify IP by kprobes anymore. Anyway, until linux-4.13, that was well tested by kprobe smoke test. and how I can repeat the test? I'm still not sure that it works correctly. That works correctly because it checks given address is on the entry point (the 1st instruction) of a function, using kallsyms. The reason why I made another flag for ftrace was, there are 2 modes for ftrace dynamic instrumentation, fentry and mcount. With new fentry mode, ftrace will be put on the first instruction of the function, so it will work as you expected. With traditional gcc mcount, ftrace will be called after making call frame for _mcount(). This means if you modify ip, it will not work or cause a trouble because _mcount call frame is still on the stack. So, current ftrace-based checker doesn't work, it depends on the case. Of course, in most
[PATCH bpf-next v3 1/9] bpf: offload: don't require rtnl for dev list manipulation
We don't need the RTNL lock for all operations on offload state. We only need to hold it around ndo calls. The device offload initialization doesn't require it. The soon-to-come querying of the offload info will only need it partially. We will also be able to remove the waitqueue in following patches. Use struct rw_semaphore because map offload will require sleeping with the semaphore held for read. Suggested-by: Kirill TkhaiSigned-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- v3: - use dev_get_by_index() (Kirill). v2: - use dev_get_by_index_rcu() instead of implicit lock dependencies; - use DECLARE_RWSEM() instead of init_rwsem() (Kirill). --- kernel/bpf/offload.c | 34 -- 1 file changed, 24 insertions(+), 10 deletions(-) diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 8455b89d1bbf..032079754d88 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -20,13 +20,16 @@ #include #include #include +#include -/* protected by RTNL */ +/* Protects bpf_prog_offload_devs and offload members of all progs. + * RTNL lock cannot be taken when holding this lock. + */ +static DECLARE_RWSEM(bpf_devs_lock); static LIST_HEAD(bpf_prog_offload_devs); int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr) { - struct net *net = current->nsproxy->net_ns; struct bpf_dev_offload *offload; if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS && @@ -43,19 +46,26 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr) offload->prog = prog; init_waitqueue_head(>verifier_done); - rtnl_lock(); - offload->netdev = __dev_get_by_index(net, attr->prog_ifindex); - if (!offload->netdev) { - rtnl_unlock(); - kfree(offload); - return -EINVAL; - } + offload->netdev = dev_get_by_index(current->nsproxy->net_ns, + attr->prog_ifindex); + if (!offload->netdev) + goto err_free; + down_write(_devs_lock); + if (offload->netdev->reg_state != NETREG_REGISTERED) + goto err_unlock; prog->aux->offload = offload; list_add_tail(>offloads, _prog_offload_devs); - rtnl_unlock(); + dev_put(offload->netdev); + up_write(_devs_lock); return 0; +err_unlock: + up_write(_devs_lock); + dev_put(offload->netdev); +err_free: + kfree(offload); + return -EINVAL; } static int __bpf_offload_ndo(struct bpf_prog *prog, enum bpf_netdev_command cmd, @@ -126,7 +136,9 @@ void bpf_prog_offload_destroy(struct bpf_prog *prog) wake_up(>verifier_done); rtnl_lock(); + down_write(_devs_lock); __bpf_prog_offload_destroy(prog); + up_write(_devs_lock); rtnl_unlock(); kfree(offload); @@ -181,11 +193,13 @@ static int bpf_offload_notification(struct notifier_block *notifier, if (netdev->reg_state != NETREG_UNREGISTERING) break; + down_write(_devs_lock); list_for_each_entry_safe(offload, tmp, _prog_offload_devs, offloads) { if (offload->netdev == netdev) __bpf_prog_offload_destroy(offload->prog); } + up_write(_devs_lock); break; default: break; -- 2.15.1
[PATCH bpf-next v3 5/9] bpf: offload: free program id when device disappears
Bound programs are quite useless after their device disappears. They are simply waiting for reference count to go to zero, don't list them in BPF_PROG_GET_NEXT_ID by freeing their ID early. Note that orphaned offload programs will return -ENODEV on BPF_OBJ_GET_INFO_BY_FD so user will never see ID 0. Signed-off-by: Jakub KicinskiReviewed-by: Quentin Monnet Acked-by: Alexei Starovoitov --- include/linux/bpf.h | 2 ++ kernel/bpf/offload.c | 3 +++ kernel/bpf/syscall.c | 9 +++-- 3 files changed, 12 insertions(+), 2 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 669549f7e3e8..9a916ab34299 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -357,6 +357,8 @@ void bpf_prog_put(struct bpf_prog *prog); int __bpf_prog_charge(struct user_struct *user, u32 pages); void __bpf_prog_uncharge(struct user_struct *user, u32 pages); +void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock); + struct bpf_map *bpf_map_get_with_uref(u32 ufd); struct bpf_map *__bpf_map_get(struct fd f); struct bpf_map * __must_check bpf_map_inc(struct bpf_map *map, bool uref); diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 3126e1a842e6..e4f1668a021c 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -130,6 +130,9 @@ static void __bpf_prog_offload_destroy(struct bpf_prog *prog) if (offload->dev_state) WARN_ON(__bpf_offload_ndo(prog, BPF_OFFLOAD_DESTROY, )); + /* Make sure BPF_PROG_GET_NEXT_ID can't find this dead program */ + bpf_prog_free_id(prog, true); + list_del_init(>offloads); kfree(offload); prog->aux->offload = NULL; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 1143db61584c..7d9f5b0f0e49 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -905,9 +905,13 @@ static int bpf_prog_alloc_id(struct bpf_prog *prog) return id > 0 ? 0 : id; } -static void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock) +void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock) { - /* cBPF to eBPF migrations are currently not in the idr store. */ + /* cBPF to eBPF migrations are currently not in the idr store. +* Offloaded programs are removed from the store when their device +* disappears - even if someone grabs an fd to them they are unusable, +* simply waiting for refcnt to drop to be freed. +*/ if (!prog->aux->id) return; @@ -917,6 +921,7 @@ static void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock) __acquire(_idr_lock); idr_remove(_idr, prog->aux->id); + prog->aux->id = 0; if (do_idr_lock) spin_unlock_bh(_idr_lock); -- 2.15.1
[PATCH bpf-next v3 4/9] bpf: offload: free prog->aux->offload when device disappears
All bpf offload operations should now be under bpf_devs_lock, it's safe to free and clear the entire offload structure, not only the netdev pointer. __bpf_prog_offload_destroy() will no longer be called multiple times. Suggested-by: Alexei StarovoitovSigned-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet Acked-by: Alexei Starovoitov --- kernel/bpf/offload.c | 23 +-- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 69ddc3899bab..3126e1a842e6 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -70,12 +70,14 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr) static int __bpf_offload_ndo(struct bpf_prog *prog, enum bpf_netdev_command cmd, struct netdev_bpf *data) { - struct net_device *netdev = prog->aux->offload->netdev; + struct bpf_dev_offload *offload = prog->aux->offload; + struct net_device *netdev; ASSERT_RTNL(); - if (!netdev) + if (!offload) return -ENODEV; + netdev = offload->netdev; if (!netdev->netdev_ops->ndo_bpf) return -EOPNOTSUPP; @@ -111,7 +113,7 @@ int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env, down_read(_devs_lock); offload = env->prog->aux->offload; - if (offload->netdev) + if (offload) ret = offload->dev_ops->insn_hook(env, insn_idx, prev_insn_idx); up_read(_devs_lock); @@ -123,31 +125,24 @@ static void __bpf_prog_offload_destroy(struct bpf_prog *prog) struct bpf_dev_offload *offload = prog->aux->offload; struct netdev_bpf data = {}; - /* Caution - if netdev is destroyed before the program, this function -* will be called twice. -*/ - data.offload.prog = prog; if (offload->dev_state) WARN_ON(__bpf_offload_ndo(prog, BPF_OFFLOAD_DESTROY, )); - offload->dev_state = false; list_del_init(>offloads); - offload->netdev = NULL; + kfree(offload); + prog->aux->offload = NULL; } void bpf_prog_offload_destroy(struct bpf_prog *prog) { - struct bpf_dev_offload *offload = prog->aux->offload; - rtnl_lock(); down_write(_devs_lock); - __bpf_prog_offload_destroy(prog); + if (prog->aux->offload) + __bpf_prog_offload_destroy(prog); up_write(_devs_lock); rtnl_unlock(); - - kfree(offload); } static int bpf_prog_offload_translate(struct bpf_prog *prog) -- 2.15.1
[PATCH bpf-next v3 3/9] bpf: offload: allow netdev to disappear while verifier is running
To allow verifier instruction callbacks without any extra locking NETDEV_UNREGISTER notification would wait on a waitqueue for verifier to finish. This design decision was made when rtnl lock was providing all the locking. Use the read/write lock instead and remove the workqueue. Verifier will now call into the offload code, so dev_ops are moved to offload structure. Since verifier calls are all under bpf_prog_is_dev_bound() we no longer need static inline implementations to please builds with CONFIG_NET=n. Signed-off-by: Jakub KicinskiReviewed-by: Quentin Monnet Acked-by: Alexei Starovoitov --- drivers/net/ethernet/netronome/nfp/bpf/main.h | 2 +- drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 2 +- drivers/net/netdevsim/bpf.c | 2 +- include/linux/bpf.h | 9 +-- include/linux/bpf_verifier.h | 16 ++-- include/linux/netdevice.h | 4 +-- kernel/bpf/offload.c | 30 --- kernel/bpf/verifier.c | 20 ++- 8 files changed, 37 insertions(+), 48 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h index aae1be9ed056..89a9b6393882 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h @@ -238,7 +238,7 @@ struct nfp_bpf_vnic { int nfp_bpf_jit(struct nfp_prog *prog); -extern const struct bpf_ext_analyzer_ops nfp_bpf_analyzer_ops; +extern const struct bpf_prog_offload_ops nfp_bpf_analyzer_ops; struct netdev_bpf; struct nfp_app; diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c index 9c2608445bd8..d8870c2f11f3 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c @@ -260,6 +260,6 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx) return 0; } -const struct bpf_ext_analyzer_ops nfp_bpf_analyzer_ops = { +const struct bpf_prog_offload_ops nfp_bpf_analyzer_ops = { .insn_hook = nfp_verify_insn, }; diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index a243fa7ae02f..5134d5c1306c 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -66,7 +66,7 @@ nsim_bpf_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn) return 0; } -static const struct bpf_ext_analyzer_ops nsim_bpf_analyzer_ops = { +static const struct bpf_prog_offload_ops nsim_bpf_analyzer_ops = { .insn_hook = nsim_bpf_verify_insn, }; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 838eee10e979..669549f7e3e8 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -17,6 +17,7 @@ #include #include +struct bpf_verifier_env; struct perf_event; struct bpf_prog; struct bpf_map; @@ -184,14 +185,18 @@ struct bpf_verifier_ops { struct bpf_prog *prog, u32 *target_size); }; +struct bpf_prog_offload_ops { + int (*insn_hook)(struct bpf_verifier_env *env, +int insn_idx, int prev_insn_idx); +}; + struct bpf_dev_offload { struct bpf_prog *prog; struct net_device *netdev; void*dev_priv; struct list_headoffloads; booldev_state; - boolverifier_running; - wait_queue_head_t verifier_done; + const struct bpf_prog_offload_ops *dev_ops; }; struct bpf_prog_aux { diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index c009e472f647..70237bb29eb7 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -166,12 +166,6 @@ static inline bool bpf_verifier_log_full(const struct bpf_verifer_log *log) return log->len_used >= log->len_total - 1; } -struct bpf_verifier_env; -struct bpf_ext_analyzer_ops { - int (*insn_hook)(struct bpf_verifier_env *env, -int insn_idx, int prev_insn_idx); -}; - #define BPF_MAX_SUBPROGS 256 /* single container for all structs @@ -185,7 +179,6 @@ struct bpf_verifier_env { bool strict_alignment; /* perform strict pointer alignment checks */ struct bpf_verifier_state *cur_state; /* current verifier state */ struct bpf_verifier_state_list **explored_states; /* search pruning optimization */ - const struct bpf_ext_analyzer_ops *dev_ops; /* device analyzer ops */ struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */ u32 used_map_cnt; /* number of used maps */ u32 id_gen; /* used to generate unique reg IDs */ @@ -205,13 +198,8 @@
[PATCH bpf-next v3 9/9] selftests/bpf: test device info reporting for bound progs
Check if bound programs report correct device info. Test in local namespace, in remote one, back to the local ns, remove the device and check that information is cleared. Signed-off-by: Jakub KicinskiReviewed-by: Quentin Monnet -- v2: - check the error code from "prog show pin XX" with device removed is -ENODEV. --- tools/testing/selftests/bpf/test_offload.py | 112 +--- 1 file changed, 101 insertions(+), 11 deletions(-) diff --git a/tools/testing/selftests/bpf/test_offload.py b/tools/testing/selftests/bpf/test_offload.py index c940505c2978..e3c750f17cb8 100755 --- a/tools/testing/selftests/bpf/test_offload.py +++ b/tools/testing/selftests/bpf/test_offload.py @@ -18,6 +18,8 @@ import argparse import json import os import pprint +import random +import string import subprocess import time @@ -27,6 +29,7 @@ bpf_test_dir = os.path.dirname(os.path.realpath(__file__)) pp = pprint.PrettyPrinter() devs = [] # devices we created for clean up files = [] # files to be removed +netns = [] # net namespaces to be removed def log_get_sec(level=0): return "*" * (log_level + level) @@ -128,22 +131,25 @@ files = [] # files to be removed if f in files: files.remove(f) -def tool(name, args, flags, JSON=True, fail=True): +def tool(name, args, flags, JSON=True, ns="", fail=True): params = "" if JSON: params += "%s " % (flags["json"]) -ret, out = cmd(name + " " + params + args, fail=fail) +if ns != "": +ns = "ip netns exec %s " % (ns) + +ret, out = cmd(ns + name + " " + params + args, fail=fail) if JSON and len(out.strip()) != 0: return ret, json.loads(out) else: return ret, out -def bpftool(args, JSON=True, fail=True): -return tool("bpftool", args, {"json":"-p"}, JSON=JSON, fail=fail) +def bpftool(args, JSON=True, ns="", fail=True): +return tool("bpftool", args, {"json":"-p"}, JSON=JSON, ns=ns, fail=fail) -def bpftool_prog_list(expected=None): -_, progs = bpftool("prog show", JSON=True, fail=True) +def bpftool_prog_list(expected=None, ns=""): +_, progs = bpftool("prog show", JSON=True, ns=ns, fail=True) if expected is not None: if len(progs) != expected: fail(True, "%d BPF programs loaded, expected %d" % @@ -158,13 +164,13 @@ files = [] # files to be removed time.sleep(0.05) raise Exception("Time out waiting for program counts to stabilize want %d, have %d" % (expected, nprogs)) -def ip(args, force=False, JSON=True, fail=True): +def ip(args, force=False, JSON=True, ns="", fail=True): if force: args = "-force " + args -return tool("ip", args, {"json":"-j"}, JSON=JSON, fail=fail) +return tool("ip", args, {"json":"-j"}, JSON=JSON, ns=ns, fail=fail) -def tc(args, JSON=True, fail=True): -return tool("tc", args, {"json":"-p"}, JSON=JSON, fail=fail) +def tc(args, JSON=True, ns="", fail=True): +return tool("tc", args, {"json":"-p"}, JSON=JSON, ns=ns, fail=fail) def ethtool(dev, opt, args, fail=True): return cmd("ethtool %s %s %s" % (opt, dev["ifname"], args), fail=fail) @@ -178,6 +184,15 @@ files = [] # files to be removed def bpf_bytecode(bytecode): return "bytecode \"%s\"" % (bytecode) +def mknetns(n_retry=10): +for i in range(n_retry): +name = ''.join([random.choice(string.ascii_letters) for i in range(8)]) +ret, _ = ip("netns add %s" % (name), fail=False) +if ret == 0: +netns.append(name) +return name +return None + class DebugfsDir: """ Class for accessing DebugFS directories as a dictionary. @@ -237,6 +252,8 @@ files = [] # files to be removed self.dev = self._netdevsim_create() devs.append(self) +self.ns = "" + self.dfs_dir = '/sys/kernel/debug/netdevsim/%s' % (self.dev['ifname']) self.dfs_refresh() @@ -257,7 +274,7 @@ files = [] # files to be removed def remove(self): devs.remove(self) -ip("link del dev %s" % (self.dev["ifname"])) +ip("link del dev %s" % (self.dev["ifname"]), ns=self.ns) def dfs_refresh(self): self.dfs = DebugfsDir(self.dfs_dir) @@ -285,6 +302,11 @@ files = [] # files to be removed time.sleep(0.05) raise Exception("Time out waiting for program counts to stabilize want %d/%d, have %d bound, %d loaded" % (bound, total, nbound, nprogs)) +def set_ns(self, ns): +name = "1" if ns == "" else ns +ip("link set dev %s netns %s" % (self.dev["ifname"], name), ns=self.ns) +self.ns = ns + def set_mtu(self, mtu, fail=True): return ip("link set dev %s mtu %d" % (self.dev["ifname"], mtu), fail=fail) @@ -372,6 +394,8 @@ files = [] # files to be removed dev.remove() for f in files: cmd("rm -f %s" % (f)) +for ns in netns: +cmd("ip netns delete
[PATCH bpf-next v3 7/9] bpf: offload: report device information for offloaded programs
Report to the user ifindex and namespace information of offloaded programs. If device has disappeared return -ENODEV. Specify the namespace using dev/inode combination. CC: Eric W. BiedermanSigned-off-by: Jakub Kicinski --- v3: - path_put() (Daniel); - move more of the logic to nsfs.c (Daniel). v2: - take RTNL lock to grab a coherent snapshot of device state (ifindex vs name space) and avoid races with name space moves (based on Eric's comment on Kirill's patch to peernet2id_alloc()). --- include/linux/bpf.h| 2 ++ include/uapi/linux/bpf.h | 3 +++ kernel/bpf/offload.c | 59 ++ kernel/bpf/syscall.c | 6 + tools/include/uapi/linux/bpf.h | 3 +++ 5 files changed, 73 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 9a916ab34299..7810ae57b357 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -531,6 +531,8 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd, int bpf_prog_offload_compile(struct bpf_prog *prog); void bpf_prog_offload_destroy(struct bpf_prog *prog); +int bpf_prog_offload_info_fill(struct bpf_prog_info *info, + struct bpf_prog *prog); #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL) int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index d01f1cb3cfc0..72b37fc3bc0c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -921,6 +921,9 @@ struct bpf_prog_info { __u32 nr_map_ids; __aligned_u64 map_ids; char name[BPF_OBJ_NAME_LEN]; + __u32 ifindex; + __u64 netns_dev; + __u64 netns_ino; } __attribute__((aligned(8))); struct bpf_map_info { diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index e4f1668a021c..040d4e0edf3f 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -16,9 +16,11 @@ #include #include #include +#include #include #include #include +#include #include #include @@ -176,6 +178,63 @@ int bpf_prog_offload_compile(struct bpf_prog *prog) return bpf_prog_offload_translate(prog); } +struct ns_get_path_bpf_prog_args { + struct bpf_prog *prog; + struct bpf_prog_info *info; +}; + +static struct ns_common *bpf_prog_offload_info_fill_ns(void *private_data) +{ + struct ns_get_path_bpf_prog_args *args = private_data; + struct bpf_prog_aux *aux = args->prog->aux; + struct ns_common *ns; + struct net *net; + + rtnl_lock(); + down_read(_devs_lock); + + if (aux->offload) { + args->info->ifindex = aux->offload->netdev->ifindex; + net = dev_net(aux->offload->netdev); + get_net(net); + ns = >ns; + } else { + args->info->ifindex = 0; + ns = NULL; + } + + up_read(_devs_lock); + rtnl_unlock(); + + return ns; +} + +int bpf_prog_offload_info_fill(struct bpf_prog_info *info, + struct bpf_prog *prog) +{ + struct ns_get_path_bpf_prog_args args = { + .prog = prog, + .info = info, + }; + struct inode *ns_inode; + struct path ns_path; + void *res; + + res = ns_get_path_cb(_path, bpf_prog_offload_info_fill_ns, ); + if (IS_ERR(res)) { + if (!info->ifindex) + return -ENODEV; + return PTR_ERR(res); + } + + ns_inode = ns_path.dentry->d_inode; + info->netns_dev = new_encode_dev(ns_inode->i_sb->s_dev); + info->netns_ino = ns_inode->i_ino; + path_put(_path); + + return 0; +} + const struct bpf_prog_ops bpf_offload_prog_ops = { }; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 7d9f5b0f0e49..20444fd678d0 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1624,6 +1624,12 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, return -EFAULT; } + if (bpf_prog_is_dev_bound(prog->aux)) { + err = bpf_prog_offload_info_fill(, prog); + if (err) + return err; + } + done: if (copy_to_user(uinfo, , info_len) || put_user(info_len, >info.info_len)) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index db1b0923a308..4e8c60acfa32 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -921,6 +921,9 @@ struct bpf_prog_info { __u32 nr_map_ids; __aligned_u64 map_ids; char name[BPF_OBJ_NAME_LEN]; + __u32 ifindex; + __u64 netns_dev; + __u64 netns_ino; } __attribute__((aligned(8))); struct bpf_map_info { -- 2.15.1
[PATCH bpf-next v3 6/9] nsfs: generalize ns_get_path() for path resolution with a task
ns_get_path() takes struct task_struct and proc_ns_ops as its parameters. For path resolution directly from a namespace, e.g. based on a networking device's net name space, we need more flexibility. Add a ns_get_path_cb() helper which will allow callers to use any method of obtaining the name space reference. Convert ns_get_path() to use ns_get_path_cb(). Following patches will bring a networking user. CC: Eric W. BiedermanSuggested-by: Daniel Borkmann Signed-off-by: Jakub Kicinski --- fs/nsfs.c | 29 ++--- include/linux/proc_ns.h | 3 +++ 2 files changed, 29 insertions(+), 3 deletions(-) diff --git a/fs/nsfs.c b/fs/nsfs.c index 7c6f76d29f56..36b0772701a0 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -103,14 +103,14 @@ static void *__ns_get_path(struct path *path, struct ns_common *ns) goto got_it; } -void *ns_get_path(struct path *path, struct task_struct *task, - const struct proc_ns_operations *ns_ops) +void *ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb, +void *private_data) { struct ns_common *ns; void *ret; again: - ns = ns_ops->get(task); + ns = ns_get_cb(private_data); if (!ns) return ERR_PTR(-ENOENT); @@ -120,6 +120,29 @@ void *ns_get_path(struct path *path, struct task_struct *task, return ret; } +struct ns_get_path_task_args { + const struct proc_ns_operations *ns_ops; + struct task_struct *task; +}; + +static struct ns_common *ns_get_path_task(void *private_data) +{ + struct ns_get_path_task_args *args = private_data; + + return args->ns_ops->get(args->task); +} + +void *ns_get_path(struct path *path, struct task_struct *task, + const struct proc_ns_operations *ns_ops) +{ + struct ns_get_path_task_args args = { + .ns_ops = ns_ops, + .task = task, + }; + + return ns_get_path_cb(path, ns_get_path_task, ); +} + int open_related_ns(struct ns_common *ns, struct ns_common *(*get_ns)(struct ns_common *ns)) { diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h index 2ff18c9840a7..d31cb6215905 100644 --- a/include/linux/proc_ns.h +++ b/include/linux/proc_ns.h @@ -78,6 +78,9 @@ extern struct file *proc_ns_fget(int fd); #define get_proc_ns(inode) ((struct ns_common *)(inode)->i_private) extern void *ns_get_path(struct path *path, struct task_struct *task, const struct proc_ns_operations *ns_ops); +typedef struct ns_common *ns_get_path_helper_t(void *); +extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t ns_get_cb, + void *private_data); extern int ns_get_name(char *buf, size_t size, struct task_struct *task, const struct proc_ns_operations *ns_ops); -- 2.15.1
[PATCH bpf-next v3 8/9] tools: bpftool: report device information for offloaded programs
Print the just-exposed device information about device to which program is bound. Signed-off-by: Jakub KicinskiReviewed-by: Quentin Monnet --- tools/bpf/bpftool/common.c | 52 ++ tools/bpf/bpftool/main.h | 2 ++ tools/bpf/bpftool/prog.c | 3 +++ 3 files changed, 57 insertions(+) diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c index b62c94e3997a..6601c95a9258 100644 --- a/tools/bpf/bpftool/common.c +++ b/tools/bpf/bpftool/common.c @@ -44,7 +44,9 @@ #include #include #include +#include #include +#include #include #include @@ -412,3 +414,53 @@ void delete_pinned_obj_table(struct pinned_obj_table *tab) free(obj); } } + +static char * +ifindex_to_name_ns(__u32 ifindex, __u32 ns_dev, __u32 ns_ino, char *buf) +{ + struct stat st; + int err; + + err = stat("/proc/self/ns/net", ); + if (err) { + p_err("Can't stat /proc/self: %s", strerror(errno)); + return NULL; + } + + if (st.st_dev != ns_dev || st.st_ino != ns_ino) + return NULL; + + return if_indextoname(ifindex, buf); +} + +void print_dev_plain(__u32 ifindex, __u64 ns_dev, __u64 ns_inode) +{ + char name[IF_NAMESIZE]; + + if (!ifindex) + return; + + printf(" dev "); + if (ifindex_to_name_ns(ifindex, ns_dev, ns_inode, name)) + printf("%s", name); + else + printf("ifindex %u ns_dev %llu ns_ino %llu", + ifindex, ns_dev, ns_inode); +} + +void print_dev_json(__u32 ifindex, __u64 ns_dev, __u64 ns_inode) +{ + char name[IF_NAMESIZE]; + + if (!ifindex) + return; + + jsonw_name(json_wtr, "dev"); + jsonw_start_object(json_wtr); + jsonw_uint_field(json_wtr, "ifindex", ifindex); + jsonw_uint_field(json_wtr, "ns_dev", ns_dev); + jsonw_uint_field(json_wtr, "ns_inode", ns_inode); + if (ifindex_to_name_ns(ifindex, ns_dev, ns_inode, name)) + jsonw_string_field(json_wtr, "ifname", name); + jsonw_end_object(json_wtr); +} diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h index 8f6d3cac0347..65b526fe6e7e 100644 --- a/tools/bpf/bpftool/main.h +++ b/tools/bpf/bpftool/main.h @@ -96,6 +96,8 @@ struct pinned_obj { int build_pinned_obj_table(struct pinned_obj_table *table, enum bpf_obj_type type); void delete_pinned_obj_table(struct pinned_obj_table *tab); +void print_dev_plain(__u32 ifindex, __u64 ns_dev, __u64 ns_inode); +void print_dev_json(__u32 ifindex, __u64 ns_dev, __u64 ns_inode); struct cmd { const char *cmd; diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index 037484ceaeaf..4ccf6301f0fe 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -230,6 +230,8 @@ static void print_prog_json(struct bpf_prog_info *info, int fd) info->tag[0], info->tag[1], info->tag[2], info->tag[3], info->tag[4], info->tag[5], info->tag[6], info->tag[7]); + print_dev_json(info->ifindex, info->netns_dev, info->netns_ino); + if (info->load_time) { char buf[32]; @@ -287,6 +289,7 @@ static void print_prog_plain(struct bpf_prog_info *info, int fd) printf("tag "); fprint_hex(stdout, info->tag, BPF_TAG_SIZE, ""); + print_dev_plain(info->ifindex, info->netns_dev, info->netns_ino); printf("\n"); if (info->load_time) { -- 2.15.1
[PATCH bpf-next v3 2/9] bpf: offload: don't use prog->aux->offload as boolean
We currently use aux->offload to indicate that program is bound to a specific device. This forces us to keep the offload structure around even after the device is gone. Add a bool member to struct bpf_prog_aux to indicate if offload was requested. Suggested-by: Alexei StarovoitovSigned-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet Acked-by: Alexei Starovoitov --- include/linux/bpf.h | 3 ++- kernel/bpf/syscall.c | 4 +++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index da54ef644fcd..838eee10e979 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -201,6 +201,7 @@ struct bpf_prog_aux { u32 stack_depth; u32 id; u32 func_cnt; + bool offload_requested; struct bpf_prog **func; void *jit_data; /* JIT specific data. arch dependent */ struct latch_tree_node ksym_tnode; @@ -529,7 +530,7 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr); static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux) { - return aux->offload; + return aux->offload_requested; } #else static inline int bpf_prog_offload_init(struct bpf_prog *prog, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index e2e1c78ce1dc..1143db61584c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1151,6 +1151,8 @@ static int bpf_prog_load(union bpf_attr *attr) if (!prog) return -ENOMEM; + prog->aux->offload_requested = !!attr->prog_ifindex; + err = security_bpf_prog_alloc(prog->aux); if (err) goto free_prog_nouncharge; @@ -1172,7 +1174,7 @@ static int bpf_prog_load(union bpf_attr *attr) atomic_set(>aux->refcnt, 1); prog->gpl_compatible = is_gpl ? 1 : 0; - if (attr->prog_ifindex) { + if (bpf_prog_is_dev_bound(prog->aux)) { err = bpf_prog_offload_init(prog, attr); if (err) goto free_prog; -- 2.15.1
[PATCH bpf-next v3 0/9] bpf: offload: report device back to user space (take 2)
Hi! This series is a redo of reporting offload device information to user space after the first attempt did not take into account name spaces. As requested by Kirill offloads are now protected by an r/w sem. This allows us to remove the workqueue and free the offload state fully when device is removed (suggested by Alexei). Net namespace is reported with a device/inode pair. The accompanying bpftool support is placed in common code because maps will have very similar info. Note that the UAPI information can't be nicely encapsulated into a struct, because in case we need to grow the device information the new fields will have to be added at the end of struct bpf_prog_info, we can't grow structures in the middle of bpf_prog_info. v3: - use dev_get_by_index(); - redo ns code (new patch 6). v2: - rework the locking in patch 1 (use RCU instead of locking dependencies); - grab RTNL for a short time in patch 6; - minor update to the test in patch 8. Jakub Kicinski (9): bpf: offload: don't require rtnl for dev list manipulation bpf: offload: don't use prog->aux->offload as boolean bpf: offload: allow netdev to disappear while verifier is running bpf: offload: free prog->aux->offload when device disappears bpf: offload: free program id when device disappears nsfs: generalize ns_get_path() for path resolution with a task bpf: offload: report device information for offloaded programs tools: bpftool: report device information for offloaded programs selftests/bpf: test device info reporting for bound progs drivers/net/ethernet/netronome/nfp/bpf/main.h | 2 +- drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 2 +- drivers/net/netdevsim/bpf.c | 2 +- fs/nsfs.c | 29 - include/linux/bpf.h | 16 ++- include/linux/bpf_verifier.h | 16 +-- include/linux/netdevice.h | 4 +- include/linux/proc_ns.h | 3 + include/uapi/linux/bpf.h | 3 + kernel/bpf/offload.c | 147 -- kernel/bpf/syscall.c | 19 ++- kernel/bpf/verifier.c | 20 ++- tools/bpf/bpftool/common.c| 52 tools/bpf/bpftool/main.h | 2 + tools/bpf/bpftool/prog.c | 3 + tools/include/uapi/linux/bpf.h| 3 + tools/testing/selftests/bpf/test_offload.py | 112 +++-- 17 files changed, 346 insertions(+), 89 deletions(-) -- 2.15.1
Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
On Wed, 27 Dec 2017 14:46:24 -0800 Alexei Starovoitovwrote: > On 12/26/17 9:56 PM, Masami Hiramatsu wrote: > > On Tue, 26 Dec 2017 17:57:32 -0800 > > Alexei Starovoitov wrote: > > > >> On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote: > >>> Check whether error injectable event is on function entry or not. > >>> Currently it checks the event is ftrace-based kprobes or not, > >>> but that is wrong. It should check if the event is on the entry > >>> of target function. Since error injection will override a function > >>> to just return with modified return value, that operation must > >>> be done before the target function starts making stackframe. > >>> > >>> As a side effect, bpf error injection is no need to depend on > >>> function-tracer. It can work with sw-breakpoint based kprobe > >>> events too. > >>> > >>> Signed-off-by: Masami Hiramatsu > >>> --- > >>> kernel/trace/Kconfig|2 -- > >>> kernel/trace/bpf_trace.c|6 +++--- > >>> kernel/trace/trace_kprobe.c |8 +--- > >>> kernel/trace/trace_probe.h | 12 ++-- > >>> 4 files changed, 14 insertions(+), 14 deletions(-) > >>> > >>> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > >>> index ae3a2d519e50..6400e1bf97c5 100644 > >>> --- a/kernel/trace/Kconfig > >>> +++ b/kernel/trace/Kconfig > >>> @@ -533,9 +533,7 @@ config FUNCTION_PROFILER > >>> config BPF_KPROBE_OVERRIDE > >>> bool "Enable BPF programs to override a kprobed function" > >>> depends on BPF_EVENTS > >>> - depends on KPROBES_ON_FTRACE > >>> depends on HAVE_KPROBE_OVERRIDE > >>> - depends on DYNAMIC_FTRACE_WITH_REGS > >>> default n > >>> help > >>>Allows BPF to override the execution of a probed function and > >>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > >>> index f6d2327ecb59..d663660f8392 100644 > >>> --- a/kernel/trace/bpf_trace.c > >>> +++ b/kernel/trace/bpf_trace.c > >>> @@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event > >>> *event, > >>> int ret = -EEXIST; > >>> > >>> /* > >>> - * Kprobe override only works for ftrace based kprobes, and only if they > >>> - * are on the opt-in list. > >>> + * Kprobe override only works if they are on the function entry, > >>> + * and only if they are on the opt-in list. > >>>*/ > >>> if (prog->kprobe_override && > >>> - (!trace_kprobe_ftrace(event->tp_event) || > >>> + (!trace_kprobe_on_func_entry(event->tp_event) || > >>>!trace_kprobe_error_injectable(event->tp_event))) > >>> return -EINVAL; > >>> > >>> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c > >>> index 91f4b57dab82..265e3e27e8dc 100644 > >>> --- a/kernel/trace/trace_kprobe.c > >>> +++ b/kernel/trace/trace_kprobe.c > >>> @@ -88,13 +88,15 @@ static nokprobe_inline unsigned long > >>> trace_kprobe_nhit(struct trace_kprobe *tk) > >>> return nhit; > >>> } > >>> > >>> -int trace_kprobe_ftrace(struct trace_event_call *call) > >>> +bool trace_kprobe_on_func_entry(struct trace_event_call *call) > >>> { > >>> struct trace_kprobe *tk = (struct trace_kprobe *)call->data; > >>> - return kprobe_ftrace(>rp.kp); > >>> + > >>> + return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name, > >>> + tk->rp.kp.offset); > >> > >> That would be nice, but did you test this? > > > > Yes, because the jprobe, which was only official user of modifying execution > > path using kprobe, did same way to check. (and kretprobe also does it) > > > >> My understanding that kprobe will restore all regs and > >> here we need to override return ip _and_ value. > > > > yes, no problem. kprobe restore all regs from pt_regs, including regs->ip. > > > >> Could you add a patch with the test the way Josef did > >> or describe the steps to test this new mode? > > > > Would you mean below patch? If so, it should work without any change. > > > > [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return > > yeah. I expect bpf_override_return test to work as-is. > I'm asking for the test for new functionality added by this patch. > In particular kprobe on func entry without ftrace. > How did you test it? This function is used in kretprobe and jprobe. Jprobe was the user of "modifying instruction pointer to another function" in kprobes. If it doesn't work, jprobe also doesn't work, this means you can not modify IP by kprobes anymore. Anyway, until linux-4.13, that was well tested by kprobe smoke test. > and how I can repeat the test? > I'm still not sure that it works correctly. That works correctly because it checks given address is on the entry point (the 1st instruction) of a function, using kallsyms. The reason why I made another flag for ftrace was, there are 2 modes for ftrace dynamic instrumentation, fentry and mcount. With new fentry mode, ftrace will be put on the first instruction of the function, so it will work as you
Re: pull-request: bpf-next 2017-12-28
From: Daniel BorkmannDate: Thu, 28 Dec 2017 01:18:21 +0100 > The following pull-request contains BPF updates for your *net-next* > tree. Pulled. Any progress on those tests failing on strict alignment architectures?
Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework
On Wed, 27 Dec 2017 14:49:46 -0800 Alexei Starovoitovwrote: > On 12/27/17 12:09 AM, Masami Hiramatsu wrote: > > On Tue, 26 Dec 2017 18:12:56 -0800 > > Alexei Starovoitov wrote: > > > >> On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote: > >>> Support in-kernel fault-injection framework via debugfs. > >>> This allows you to inject a conditional error to specified > >>> function using debugfs interfaces. > >>> > >>> Signed-off-by: Masami Hiramatsu > >>> --- > >>> Documentation/fault-injection/fault-injection.txt |5 + > >>> kernel/Makefile |1 > >>> kernel/fail_function.c| 169 > >>> + > >>> lib/Kconfig.debug | 10 + > >>> 4 files changed, 185 insertions(+) > >>> create mode 100644 kernel/fail_function.c > >>> > >>> diff --git a/Documentation/fault-injection/fault-injection.txt > >>> b/Documentation/fault-injection/fault-injection.txt > >>> index 918972babcd8..6243a588dd71 100644 > >>> --- a/Documentation/fault-injection/fault-injection.txt > >>> +++ b/Documentation/fault-injection/fault-injection.txt > >>> @@ -30,6 +30,11 @@ o fail_mmc_request > >>>injects MMC data errors on devices permitted by setting > >>>debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request > >>> > >>> +o fail_function > >>> + > >>> + injects error return on specific functions by setting debugfs entries > >>> + under /sys/kernel/debug/fail_function. No boot option supported. > >> > >> I like it. > >> Could you document it a bit better? > > > > Yes, I will do in next series. > > > >> In particular retval is configurable, but without an example no one > >> will be able to figure out how to use it. > > > > Ah, right. BTW, as I pointed in the covermail, should we store the > > expected error value range into the injectable list? e.g. > > > > ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO) > > > > And provide APIs to check/get it. > > I'm afraid such check would be too costly. > Right now we have only two functions marked but I expect hundreds more > will be added in the near future as soon as developers realize the > potential of such error injection. > All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data. > Multiple by 1k and we have 8k of data spent on marks. > If we add max/min range marks that doubles it for very little use. > I think marking function only is enough. Sorry, I don't think so. Even if it takes 16 bytes more for each points, I don't think it is any overhead for machines in these days. Even if so, we can provide a kconfig to reduce it. I mean, we are living in GB-order memory are, and it will be bigger in the future. Why we have to worry about hundreds of 16bytes memory pieces? It will take a few KB, and even if we mark thousands of functions, it never reaches 1MB, in GB memory pool. :) Of course, for many small-footprint embedded devices (like having less than 128MB memory), this feature can be a overhead. But they can cut off the table by kconfig. Thank you, -- Masami Hiramatsu
Re: [RFC PATCH net-next] net: hns3: hns3_get_channels() can be static
From: kbuild test robotDate: Thu, 28 Dec 2017 09:09:59 +0800 > > Fixes: 482d2e9c1cc7 ("net: hns3: add support to query tqps number") > Signed-off-by: Fengguang Wu Applied, thanks.
Re: [PATCH net-next] nfp: bpf: allocate vNIC priv for keeping track of the offloaded program
From: Jakub KicinskiDate: Wed, 27 Dec 2017 15:36:49 -0800 > After TC offloads were converted to callbacks we have no choice > but keep track of the offloaded filter in the driver. > > Since this change came a little late in the release cycle > there were a number of conflicts and allocation of vNIC priv > structure seems to have slipped away in linux-next. > > Signed-off-by: Jakub Kicinski Oh well, I thought I resolved the merge conflict properly in this case. Apparently not :-) Applied, thanks Jakub.
Re: pull-request: bpf 2017-12-28
From: Daniel BorkmannDate: Thu, 28 Dec 2017 00:27:23 +0100 > The following pull-request contains BPF updates for your *net* tree. > > The main changes are: > > 1) Two small fixes for bpftool. Fix otherwise broken output if any of >the system calls failed when listing maps in json format and instead >of bailing out, skip maps or progs that disappeared between fetching >next id and getting an fd for that id, both from Jakub. > > 2) Small fix in BPF selftests to respect LLC passed from command line >when testing for -mcpu=probe presence, from Quentin. > > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Pulled, thanks Daniel.
Re: WARNING in strp_data_ready
On Wed, Dec 27, 2017 at 12:20 PM, Ozgurwrote: > > > 27.12.2017, 23:14, "Dmitry Vyukov" : >> On Wed, Dec 27, 2017 at 9:08 PM, Ozgur wrote: >>> 27.12.2017, 22:21, "Dmitry Vyukov" : On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert wrote: > Did you try the patch I posted? Hi Tom, >>> >>> Hello Dmitry, >>> No. And I didn't know I need to. Why? If you think the patch needs additional testing, you can ask syzbot to test it. See https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot Otherwise proceed with committing it. Or what are we waiting for? Thanks >>> >>> I think we need to fixed patch for crash, in fact check to patch code and >>> test solve the bug. >>> How do test it because there is no patch in the following bug? >> >> Hi Ozgur, >> >> I am not sure I completely understand what you mean. But the >> reproducer for this bug (which one can use for testing) is here: >> https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY >> Tom also mentions there is some patch for this, but I don't know where >> it is, it doesn't seem to be referenced from this thread. > > Hello Dmitry, > > Ah, I'm sorry I don't seen Tom mail and I don't have a patch not tested :) > I think Tom send patch to only you and are you tested? > > kcmsock.c will change and strp_data_ready I think locked. > > Tom, please send a patch for me? I can test and inform you. > Hi Ozgur, I reposted the patches as RFC "kcm: Fix lockdep issue". Please test if you can! Thanks, Tom > Regards > > Ozgur > >>> The fix patch should be for this net/kcm/kcmsock.c file and lock functions >>> must be added calling sk_data_ready (). >>> Regards >>> >>> Ozgur >>> > On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov > wrote: >> On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov >> wrote: wrote: > On 10/24/2017 08:20 AM, syzbot wrote: >> Hello, >> >> syzkaller hit the following crash on >> 73d3393ada4f70fa3df5639c8d438f2f034c0ecb >> >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master >> compiler: gcc (GCC) 7.1.1 20170620 >> .config is attached >> Raw console output is attached. >> C reproducer is attached >> syzkaller reproducer is attached. See https://goo.gl/kgGztJ >> for information about syzkaller reproducers >> >> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 >> sock_owned_by_me include/net/sock.h:1505 [inline] >> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 >> sock_owned_by_user include/net/sock.h:1511 [inline] >> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 >> strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 >> Kernel panic - not syncing: panic_on_warn set ... >> >> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138 >> Hardware name: Google Google Compute Engine/Google Compute Engine, >> BIOS Google 01/01/2011 >> Call Trace: >> >>__dump_stack lib/dump_stack.c:16 [inline] >>dump_stack+0x194/0x257 lib/dump_stack.c:52 >>panic+0x1e4/0x417 kernel/panic.c:181 >>__warn+0x1c4/0x1d9 kernel/panic.c:542 >>report_bug+0x211/0x2d0 lib/bug.c:183 >>fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 >>do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] >>do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 >>do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 >>do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 >>invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 >> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline] >> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline] >> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 >> RSP: 0018:8801db206b18 EFLAGS: 00010206 >> RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: >> RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0 >> RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd >> R10: 8801db206a00 R11: dc00 R12: 8801d29fa000 >> R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001 >>psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353 > > Looks like KCM is calling sk_data_ready() without first taking the > sock lock. > > /* Called with lower sock held */ > static void kcm_rcv_strparser(struct
[PATCH RFC 0/2] kcm: Fix lockdep issue
When sock_owned_by_user returns true in strparser. Fix is to add and call sock_owned_by_user_nocheck since the check for owned by user is not an error condition in this case. Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages") Reported-by: syzbotTom Herbert (2): sock: Add sock_owned_by_user_nocheck strparser: Call sock_owned_by_user_nocheck include/net/sock.h| 5 + net/strparser/strparser.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) -- 2.11.0
[PATCH RFC 1/2] sock: Add sock_owned_by_user_nocheck
This allows checking socket lock ownership with producing lockdep warnings. Signed-off-by: Tom Herbert--- include/net/sock.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/net/sock.h b/include/net/sock.h index 6c1db823f8b9..66fd3951e6f3 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1515,6 +1515,11 @@ static inline bool sock_owned_by_user(const struct sock *sk) return sk->sk_lock.owned; } +static inline bool sock_owned_by_user_nocheck(const struct sock *sk) +{ + return sk->sk_lock.owned; +} + /* no reclassification while locks are held */ static inline bool sock_allow_reclassification(const struct sock *csk) { -- 2.11.0
[PATCH RFC 2/2] strparser: Call sock_owned_by_user_nocheck
strparser wants to check socket ownership without producing any warnings. As indicated by the comment in the code, it is permissible for owned_by_user to return true. Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages") Reported-by: syzbotSigned-off-by: Tom Herbert --- net/strparser/strparser.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c index c5fda15ba319..1fdab5c4eda8 100644 --- a/net/strparser/strparser.c +++ b/net/strparser/strparser.c @@ -401,7 +401,7 @@ void strp_data_ready(struct strparser *strp) * allows a thread in BH context to safely check if the process * lock is held. In this case, if the lock is held, queue work. */ - if (sock_owned_by_user(strp->sk)) { + if (sock_owned_by_user_nocheck(strp->sk)) { queue_work(strp_wq, >work); return; } -- 2.11.0
[net-next:master 731/763] drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c:852:6: sparse: symbol 'hns3_get_channels' was not declared. Should it be static?
tree: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master head: 55b07a65e15bea6e253a907dacaf89b61fe504ca commit: 482d2e9c1cc7c0e154464e3e052db09e5e62541f [731/763] net: hns3: add support to query tqps number reproduce: # apt-get install sparse git checkout 482d2e9c1cc7c0e154464e3e052db09e5e62541f make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) Please review and possibly fold the followup patch. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
[RFC PATCH net-next] net: hns3: hns3_get_channels() can be static
Fixes: 482d2e9c1cc7 ("net: hns3: add support to query tqps number") Signed-off-by: Fengguang Wu--- hns3_ethtool.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 2ae4d397..379c01d 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -876,8 +876,8 @@ static int hns3_nway_reset(struct net_device *netdev) return genphy_restart_aneg(phy); } -void hns3_get_channels(struct net_device *netdev, - struct ethtool_channels *ch) +static void hns3_get_channels(struct net_device *netdev, + struct ethtool_channels *ch) { struct hnae3_handle *h = hns3_get_handle(netdev);
Re: [PATCH 1/4] libbpf: add function to setup XDP
On 2017/12/28 3:02, Eric Leblond wrote: > Most of the code is taken from set_link_xdp_fd() in bpf_load.c and > slightly modified to be library compliant. > > Signed-off-by: Eric Leblond> --- ... > +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) ... > + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { > + ret = -errno; > + goto cleanup; > + } > + > + addrlen = sizeof(sa); > + if (getsockname(sock, (struct sockaddr *), ) < 0) { > + ret = errno; forgot to prepend '-'? > + goto cleanup; > + } > + > + if (addrlen != sizeof(sa)) { > + ret = errno; errno is not set? > + goto cleanup; > + } -- Toshiaki Makita
Re: [PATCHv3 0/2] capability controlled user-namespaces
On Wed, Dec 27, 2017 at 12:23 PM, Michael Kerrisk (man-pages)wrote: > Hello Mahesh, > > On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार) > wrote: >> Hello James, >> >> Seems like I missed your name to be added into the review of this >> patch series. Would you be willing be pull this into the security >> tree? Serge Hallyn has already ACKed it. > > We seem to have no formal documentation/specification of this feature. > I think that should be written up before this patch goes into > mainline... > absolutely. I have added enough information into the Documentation dir relevant to this feature (please look at the individual patches), that could be used. I could help if needed. thanks, --mahesh.. > Cheers, > > Michael > > >> >> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar wrote: >>> From: Mahesh Bandewar >>> >>> TL;DR version >>> - >>> Creating a sandbox environment with namespaces is challenging >>> considering what these sandboxed processes can engage into. e.g. >>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few. >>> Current form of user-namespaces, however, if changed a bit can allow >>> us to create a sandbox environment without locking down user- >>> namespaces. >>> >>> Detailed version >>> >>> >>> Problem >>> --- >>> User-namespaces in the current form have increased the attack surface as >>> any process can acquire capabilities which are not available to them (by >>> default) by performing combination of clone()/unshare()/setns() syscalls. >>> >>> #define _GNU_SOURCE >>> #include >>> #include >>> #include >>> >>> int main(int ac, char **av) >>> { >>> int sock = -1; >>> >>> printf("Attempting to open RAW socket before unshare()...\n"); >>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW); >>> if (sock < 0) { >>> perror("socket() SOCK_RAW failed: "); >>> } else { >>> printf("Successfully opened RAW-Sock before unshare().\n"); >>> close(sock); >>> sock = -1; >>> } >>> >>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) { >>> perror("unshare() failed: "); >>> return 1; >>> } >>> >>> printf("Attempting to open RAW socket after unshare()...\n"); >>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW); >>> if (sock < 0) { >>> perror("socket() SOCK_RAW failed: "); >>> } else { >>> printf("Successfully opened RAW-Sock after unshare().\n"); >>> close(sock); >>> sock = -1; >>> } >>> >>> return 0; >>> } >>> >>> The above example shows how easy it is to acquire NET_RAW capabilities >>> and once acquired, these processes could take benefit of above mentioned >>> or similar issues discovered/undiscovered with malicious intent. Note >>> that this is just an example and the problem/solution is not limited >>> to NET_RAW capability *only*. >>> >>> The easiest fix one can apply here is to lock-down user-namespaces which >>> many of the distros do (i.e. don't allow users to create user namespaces), >>> but unfortunately that prevents everyone from using them. >>> >>> Approach >>> >>> Introduce a notion of 'controlled' user-namespaces. Every process on >>> the host is allowed to create user-namespaces (governed by the limit >>> imposed by per-ns sysctl) however, mark user-namespaces created by >>> sandboxed processes as 'controlled'. Use this 'mark' at the time of >>> capability check in conjunction with a global capability whitelist. >>> If the capability is not whitelisted, processes that belong to >>> controlled user-namespaces will not be allowed. >>> >>> Once a user-ns is marked as 'controlled'; all its child user- >>> namespaces are marked as 'controlled' too. >>> >>> A global whitelist is list of capabilities governed by the >>> sysctl which is available to (privileged) user in init-ns to modify >>> while it's applicable to all controlled user-namespaces on the host. >>> >>> Marking user-namespaces controlled without modifying the whitelist is >>> equivalent of the current behavior. The default value of whitelist includes >>> all capabilities so that the compatibility is maintained. However it gives >>> admins fine-grained ability to control various capabilities system wide >>> without locking down user-namespaces. >>> >>> Please see individual patches in this series. >>> >>> Mahesh Bandewar (2): >>> capability: introduce sysctl for controlled user-ns capability whitelist >>> userns: control capabilities of some user namespaces >>> >>> Documentation/sysctl/kernel.txt | 21 + >>> include/linux/capability.h | 7 ++ >>> include/linux/user_namespace.h | 25 >>> kernel/capability.c | 52 >>> + >>>
pull-request: bpf-next 2017-12-28
Hi David, The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Fix incorrect state pruning related to recognition of zero initialized stack slots, where stacksafe exploration would mistakenly return a positive pruning verdict too early ignoring other slots, from Gianluca. 2) Various BPF to BPF calls related follow-up fixes. Fix an off-by-one in maximum call depth check, and rework maximum stack depth tracking logic to fix a bypass of the total stack size check reported by Jann. Also fix a bug in arm64 JIT where prog->jited_len was uninitialized. Addition of various test cases to BPF selftests, from Alexei. 3) Addition of a BPF selftest to test_verifier that is related to BPF to BPF calls which demonstrates a late caller stack size increase and thus out of bounds access. Fixed above in 2). Test case from Jann. 4) Addition of correlating BPF helper calls, BPF to BPF calls as well as BPF maps to bpftool xlated dump in order to allow for better BPF program introspection and debugging, from Daniel. 5) Fixing several bugs in BPF to BPF calls kallsyms handling in order to get it actually to work for subprogs, from Daniel. 6) Extending sparc64 JIT support for BPF to BPF calls and fix a couple of build errors for libbpf on sparc64, from David. 7) Allow narrower context access for BPF dev cgroup typed programs in order to adapt to LLVM code generation. Also adjust memlock rlimit in the test_dev_cgroup BPF selftest, from Yonghong. 8) Add netdevsim Kconfig entry to BPF selftests since test_offload.py relies on netdevsim device being available, from Jakub. 9) Reduce scope of xdp_do_generic_redirect_map() to being static, from Xiongwei. 10) Minor cleanups and spelling fixes in BPF verifier, from Colin. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Thanks a lot & have a happy new year! The following changes since commit 962b582785b60a2b420b0636ad762959c72406f6: cxgb4: Simplify PCIe Completion Timeout setting (2017-12-18 15:12:57 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git for you to fetch changes up to 624588d9d6cc0a1a270a65fb4d5220f1ceddcf38: Merge branch 'bpf-stack-depth-tracking-fixes' (2017-12-27 18:36:24 +0100) Alexei Starovoitov (5): bpf: arm64: fix uninitialized variable Merge branch 'bpftool-improvements-kallsymfix' bpf: fix maximum stack depth tracking logic selftests/bpf: additional stack depth tests bpf: fix max call depth check Colin Ian King (2): bpf: fix spelling mistake: "funcation"-> "function" bpf: make function skip_callee static and return NULL rather than 0 Daniel Borkmann (3): bpf: fix kallsyms handling for subprogs bpf: allow for correlation of maps and helpers in dump Merge branch 'bpf-stack-depth-tracking-fixes' David Miller (2): libbpf: Fix build errors. bpf: sparc64: Add JIT support for multi-function programs. Gianluca Borello (1): bpf: fix stacksafe exploration when comparing states Jakub Kicinski (1): selftests/bpf: add netdevsim to config Jann Horn (1): bpf: selftest for late caller stack size increase Xiongwei Song (1): bpf: make function xdp_do_generic_redirect_map() static Yonghong Song (2): bpf/cgroup: fix a verification error for a CGROUP_DEVICE type prog tools/bpf: adjust rlimit RLIMIT_MEMLOCK for test_dev_cgroup arch/arm64/net/bpf_jit_comp.c | 1 + arch/sparc/net/bpf_jit_comp_64.c | 44 - include/linux/bpf_verifier.h | 1 + include/linux/filter.h| 9 + include/uapi/linux/bpf.h | 3 +- kernel/bpf/cgroup.c | 15 +- kernel/bpf/core.c | 4 +- kernel/bpf/disasm.c | 65 +-- kernel/bpf/disasm.h | 29 +++- kernel/bpf/syscall.c | 93 +- kernel/bpf/verifier.c | 126 +++--- net/core/filter.c | 5 +- tools/bpf/bpftool/prog.c | 181 ++- tools/lib/bpf/libbpf.c| 5 +- tools/testing/selftests/bpf/config| 1 + tools/testing/selftests/bpf/test_dev_cgroup.c | 9 +- tools/testing/selftests/bpf/test_verifier.c | 241 ++ 17 files changed, 764 insertions(+), 68 deletions(-)
Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
On 12/27/2017 10:29 AM, Cong Wang wrote: > On Sat, Dec 23, 2017 at 10:57 PM, John Fastabend >wrote: >> On 12/22/2017 12:31 PM, Cong Wang wrote: >>> I understand why you had it, but it is just not safe. You don't want >>> to achieve performance gain by crashing system, right? >> >> huh? So my point is the patch you submit here is not a >> real fix but a work around. To peek the head of a consumer/producer ring >> without a lock, _should_ be fine. This _should_ work as well with >> consumer or producer operations happening at the same time. After some >> digging the issue is in the ptr_ring code. > > > The comments disagree with you: > > /* Might be slightly faster than skb_array_empty below, but only safe if the > * array is never resized. Also, callers invoking this in a loop must take > care > * to use a compiler barrier, for example cpu_relax(). > */ > > If the comments are right, you miss a barrier here too. > The comment talks about resizing arrays not consumer/producer operations so I think it is OK. We don't call skb_array_empty on the same skb_array in a loop here, its different skb arrays. So probably don't need a barrier. > >> >> The peek code (what empty check calls) is the following, >> >> static inline void *__ptr_ring_peek(struct ptr_ring *r) >> { >> if (likely(r->size)) >> return r->queue[r->consumer_head]; >> return NULL; >> } >> >> So what the splat is detecting is consumer head being 'out of bounds'. >> This happens because ptr_ring_discard_one increments the consumer_head >> and then checks to see if it overran the array size. If above peek >> happens after the increment, but before the size check we get the >> splat. There are two ways, as far as I can see, to fix this. First >> do the check before incrementing the consumer head. Or the easier >> fix, >> >> --- a/include/linux/ptr_ring.h >> +++ b/include/linux/ptr_ring.h >> @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct >> ptr_ring *r, >> >> static inline void **__ptr_ring_init_queue_alloc(unsigned int size, >> gfp_t gfp) >> { >> - return kcalloc(size, sizeof(void *), gfp); >> + return kcalloc(size + 1, sizeof(void *), gfp); >> } >> >> With Jakub's help (Thanks!) I was able to reproduce the original splat >> and also verify the above removes it. >> >> To be clear "resizing" a skb_array only refers to changing the actual >> array size not adding/removing elements. > > I never look into the implementation, just simply trust the comments. > > At least the comments above __skb_array_empty() need to improve. > > >> >>> Although its not logical IMO to have both reset and dequeue running at the same time. Some skbs would get through others would get sent, sort of a mess. I don't see how it can be an issue. The never resized bit in the documentation is referring to resizing the ring size _not_ popping off elements of the ring. array_empty just reads the consumer head. The only ring resizing in pfifo fast should be at init and destroy where enqueue/dequeue should be disconnected by then. Although based on the trace I missed a case. >>> >>> >>> Both pfifo_fast_reset() and pfifo_fast_dequeue() call >>> skb_array_consume_bh(), so there is no difference w.r.t. resizing. >>> >> >> Sorry not following. >> >>> And ->reset() is called in qdisc_graft() too. Let's say we have >>> htb+pfifo_fast, >>> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast, >>> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue() >>> concurrently. >> >> Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering >> though if this API can be cleaned up. What are the paths that do a reset >> without a destroy.. Do we really need to have this pattern where reset >> is called then later destroy. Seems destroy could do the entire cleanup >> and this would simplify things. None of this has to do with the splat >> though. > > I don't follow your point any more.> > We are talking about ->reset() race with ->dequeue() which is the > cause of the bug, right?> Yes. But it should be OK for qdiscs to have reset and dequeue run in parallel. So the bug is in ptr_ring. > If you expect ->reset() runs in parallel with ->dequeue() for pfifo_fast, > why did you even mention synchronize_net() from the beginning??? A mistake probably I am tracking two separate issues. First this one with skb_array out of bounds access and the tx_action being called while destroy is in progress. When I was first looking at this I thought the splat was from the tx_action issue so probably started talking about how to sync destroy and tx_action paths. Either way it was a mistake to mention in this thread. > Also you changed the code too, to adjust rcu grace period. > > >> >>> >>> I think the right fix is to only call reset/destroy patterns after waiting a grace period and for all tx_action
[PATCH net-next] nfp: bpf: allocate vNIC priv for keeping track of the offloaded program
After TC offloads were converted to callbacks we have no choice but keep track of the offloaded filter in the driver. Since this change came a little late in the release cycle there were a number of conflicts and allocation of vNIC priv structure seems to have slipped away in linux-next. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/bpf/main.c | 30 ++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index 214b02a3acdd..4b63167906ca 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -84,6 +84,33 @@ static const char *nfp_bpf_extra_cap(struct nfp_app *app, struct nfp_net *nn) return nfp_net_ebpf_capable(nn) ? "BPF" : ""; } +static int +nfp_bpf_vnic_alloc(struct nfp_app *app, struct nfp_net *nn, unsigned int id) +{ + int err; + + nn->app_priv = kzalloc(sizeof(struct nfp_bpf_vnic), GFP_KERNEL); + if (!nn->app_priv) + return -ENOMEM; + + err = nfp_app_nic_vnic_alloc(app, nn, id); + if (err) + goto err_free_priv; + + return 0; +err_free_priv: + kfree(nn->app_priv); + return err; +} + +static void nfp_bpf_vnic_free(struct nfp_app *app, struct nfp_net *nn) +{ + struct nfp_bpf_vnic *bv = nn->app_priv; + + WARN_ON(bv->tc_prog); + kfree(bv); +} + static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type type, void *type_data, void *cb_priv) { @@ -286,7 +313,8 @@ const struct nfp_app_type app_bpf = { .extra_cap = nfp_bpf_extra_cap, - .vnic_alloc = nfp_app_nic_vnic_alloc, + .vnic_alloc = nfp_bpf_vnic_alloc, + .vnic_free = nfp_bpf_vnic_free, .setup_tc = nfp_bpf_setup_tc, .tc_busy= nfp_bpf_tc_busy, -- 2.15.1
pull-request: bpf 2017-12-28
Hi David, The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Two small fixes for bpftool. Fix otherwise broken output if any of the system calls failed when listing maps in json format and instead of bailing out, skip maps or progs that disappeared between fetching next id and getting an fd for that id, both from Jakub. 2) Small fix in BPF selftests to respect LLC passed from command line when testing for -mcpu=probe presence, from Quentin. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! The following changes since commit c50b7c473f609189da3bccd28ee5dcf3b55109cd: Merge branch 'net-zerocopy-fixes' (2017-12-21 15:00:59 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git for you to fetch changes up to aee657460a8ce66443d5e7413046d02d7b2165db: Merge branch 'bpf-bpftool-various-fixes' (2017-12-23 01:09:53 +0100) Daniel Borkmann (1): Merge branch 'bpf-bpftool-various-fixes' Jakub Kicinski (2): tools: bpftool: maps: close json array on error paths of show tools: bpftool: protect against races with disappearing objects Quentin Monnet (1): selftests/bpf: fix Makefile for passing LLC to the command line tools/bpf/bpftool/map.c | 8 +--- tools/bpf/bpftool/prog.c | 2 ++ tools/testing/selftests/bpf/Makefile | 2 +- 3 files changed, 8 insertions(+), 4 deletions(-)
Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
On Wed, Dec 27, 2017 at 11:42:52PM +0100, Antoine Tenart wrote: > Hi Russell, > > On Wed, Dec 27, 2017 at 10:24:01PM +, Russell King - ARM Linux wrote: > > On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote: > > > > > > +_eth2 { > > > + /* CPS Lane 5 */ > > > + status = "okay"; > > > + phy-mode = "2500base-x"; > > > + /* Generic PHY, providing serdes lanes */ > > > + phys = <_comphy5 2>; > > > +}; > > > + > > > > This is wrong. This lane is connected to a SFP cage which can support > > more than just 2500base-X. Tying it in this way to 2500base-X means > > that this port does not support conenctions at 1000base-X, despite > > that's one of the most popular and more standardised speeds. > > What do you suggest to describe this in the dt, to enable a port using > the current PPv2 driver? I don't - I'm merely pointing out that you're bodging support for the SFP cage rather than productively discussing phylink for mvpp2. As far as I remember, the discussion stalled at this point: - You think there's modes that mvpp2 supports that are not supportable if you use phylink. - I've described what phylink supports, and I've asked you for details about what you can't support. Unfortunately, no details have been forthcoming, and no further discussion has occurred - the ball is entirely in your court to progress this issue since I requested information from you and that is where things seem to have stalled. The result is that, with your patch, you're locking the port to 2.5G speeds, meaning that only 4.3Mbps Fibrechannel SFPs can be used with the port, and it can only be used with another device that supports 2.5G speeds. You can't use a copper RJ45 module, and you can't use a standard 1000base-X module either in this configuration. What I'm most concerned about, given the bindings for comphy that have been merged, is that Free Electrons is pushing forward seemingly with no regard to the requirement that the serdes lanes are dynamically reconfigurable, and that's a basic requirement for SFP, and for the 88x3310 PHYs on the Macchiatobin platform. So, my question to you is: what is Free Electrons plans to properly support the ethernet ports on the Macchiatobin platform? For those on the Cc list who don't know, phylink is part of full support for SFP and SFP+ cages, sponsored (in terms of hardware including SFP modules) by SolidRun on both SolidRun's Clearfog and Macchiatobin platforms, supporting a wide range of SFP modules including: - 1G Optical ethernet modules (duplex and bidi modules) - 10/100/1G RJ45 modules - 10G SFP+ modules - 2.5Gbase-X using 4.3Mbps Fibrechannel modules - Direct attach cables There is work ongoing between Florian, Andrew and myself to switch DSA to phylink, and have SFP modules working with both Marvell and Broadcom DSA switches. Phylink and SFP was already merged into mainline, and has been usable (provided that the network driver is converted to phylink rather than phylib) since 4.14-rc1. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up
Re: [PATCH bpf-next v2 6/8] bpf: offload: report device information for offloaded programs
On Wed, 27 Dec 2017 18:26:16 +0100, Daniel Borkmann wrote: > On 12/21/2017 10:01 PM, Jakub Kicinski wrote: > > Report to the user ifindex and namespace information of offloaded > > programs. If device has disappeared return -ENODEV. Specify the > > namespace using dev/inode combination. > > > > CC: Eric W. Biederman> > Signed-off-by: Jakub Kicinski > > Reviewed-by: Quentin Monnet > > --- > > v2: > > - take RTNL lock to grab a coherent snapshot of device state > >(ifindex vs name space) and avoid races with name space > >moves (based on Eric's comment on Kirill's patch to > >peernet2id_alloc()). > > --- > > fs/nsfs.c | 2 +- > > include/linux/bpf.h| 2 ++ > > include/linux/proc_ns.h| 1 + > > include/uapi/linux/bpf.h | 3 +++ > > kernel/bpf/offload.c | 44 > > ++ > > kernel/bpf/syscall.c | 6 ++ > > tools/include/uapi/linux/bpf.h | 3 +++ > > 7 files changed, 60 insertions(+), 1 deletion(-) > > > > diff --git a/fs/nsfs.c b/fs/nsfs.c > > index 7c6f76d29f56..e50628675935 100644 > > --- a/fs/nsfs.c > > +++ b/fs/nsfs.c > > @@ -51,7 +51,7 @@ static void nsfs_evict(struct inode *inode) > > ns->ops->put(ns); > > } > > > > -static void *__ns_get_path(struct path *path, struct ns_common *ns) > > +void *__ns_get_path(struct path *path, struct ns_common *ns) > > { > > struct vfsmount *mnt = nsfs_mnt; > > struct dentry *dentry; > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 9a916ab34299..7810ae57b357 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -531,6 +531,8 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 > > ufd, > > > > int bpf_prog_offload_compile(struct bpf_prog *prog); > > void bpf_prog_offload_destroy(struct bpf_prog *prog); > > +int bpf_prog_offload_info_fill(struct bpf_prog_info *info, > > + struct bpf_prog *prog); > > > > #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL) > > int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr); > > diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h > > index 2ff18c9840a7..1733359cf713 100644 > > --- a/include/linux/proc_ns.h > > +++ b/include/linux/proc_ns.h > > @@ -76,6 +76,7 @@ static inline int ns_alloc_inum(struct ns_common *ns) > > > > extern struct file *proc_ns_fget(int fd); > > #define get_proc_ns(inode) ((struct ns_common *)(inode)->i_private) > > +extern void *__ns_get_path(struct path *path, struct ns_common *ns); > > extern void *ns_get_path(struct path *path, struct task_struct *task, > > const struct proc_ns_operations *ns_ops); > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index d01f1cb3cfc0..72b37fc3bc0c 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -921,6 +921,9 @@ struct bpf_prog_info { > > __u32 nr_map_ids; > > __aligned_u64 map_ids; > > char name[BPF_OBJ_NAME_LEN]; > > + __u32 ifindex; > > + __u64 netns_dev; > > + __u64 netns_ino; > > } __attribute__((aligned(8))); > > > > struct bpf_map_info { > > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c > > index 1e6064ea3609..4d5bd1e3 100644 > > --- a/kernel/bpf/offload.c > > +++ b/kernel/bpf/offload.c > > @@ -16,9 +16,11 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > +#include > > #include > > #include > > > > @@ -181,6 +183,48 @@ int bpf_prog_offload_compile(struct bpf_prog *prog) > > return bpf_prog_offload_translate(prog); > > } > > > > +int bpf_prog_offload_info_fill(struct bpf_prog_info *info, > > + struct bpf_prog *prog) > > +{ > > + struct bpf_dev_offload *offload; > > + struct inode *ns_inode; > > + struct path ns_path; > > + int ifindex, err; > > + struct net *net; > > + > > +again: > > + rtnl_lock(); > > + down_read(_devs_lock); > > + > > + offload = prog->aux->offload; > > + if (!offload) { > > + up_read(_devs_lock); > > + rtnl_unlock(); > > + return -ENODEV; > > + } > > + > > + ifindex = offload->netdev->ifindex; > > + net = dev_net(offload->netdev); > > + get_net(net); /* __ns_get_path() drops the reference */ > > + > > + up_read(_devs_lock); > > + rtnl_unlock(); > > + > > + err = PTR_ERR_OR_ZERO(__ns_get_path(_path, >ns)); > > + if (err) { > > + if (err == -EAGAIN) > > + goto again; > > + return err; > > + } > > + ns_inode = ns_path.dentry->d_inode; > > + > > + info->ifindex = ifindex; > > + info->netns_dev = new_encode_dev(ns_inode->i_sb->s_dev); > > + info->netns_ino = ns_inode->i_ino; > > I think here, we're still missing a: > > path_put(_path); > > Otherwise we're leaking that
Re: [PATCH v2 bpf-next 0/3] bpf: two stack check fixes
On 12/25/2017 10:15 PM, Alexei Starovoitov wrote: > Jann reported an issue with stack depth tracking. Fix it and add tests. > Also fix off-by-one error in MAX_CALL_FRAMES check. > This set is on top of Jann's "selftest for late caller stack size increase" > test. > > Alexei Starovoitov (3): > bpf: fix maximum stack depth tracking logic > selftests/bpf: additional stack depth tests > bpf: fix max call depth check > > include/linux/bpf_verifier.h| 1 + > kernel/bpf/verifier.c | 86 +++ > tools/testing/selftests/bpf/test_verifier.c | 156 > > 3 files changed, 225 insertions(+), 18 deletions(-) Applied to bpf-next, thanks Alexei!
Re: [PATCH] bpf: selftest for late caller stack size increase
On 12/22/2017 07:12 PM, Jann Horn wrote: > This checks that it is not possible to bypass the total stack size check in > update_stack_depth() by calling a function that uses a large amount of > stack memory *before* using a large amount of stack memory in the caller. > > Currently, the first added testcase causes a rejection as expected, but > the second testcase is (AFAICS incorrectly) accepted: > > [...] > #483/p calls: stack overflow using two frames (post-call access) FAIL > Unexpected success to load! > 0: (85) call pc+2 > caller: > R10=fp0,call_-1 > callee: > frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0 > 3: (72) *(u8 *)(r10 -300) = 0 > 4: (b7) r0 = 0 > 5: (95) exit > returning from callee: > frame1: R0_w=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0 > to caller at 1: > R0_w=inv0 R10=fp0,call_-1 > > from 5 to 1: R0=inv0 R10=fp0,call_-1 > 1: (72) *(u8 *)(r10 -300) = 0 > 2: (95) exit > processed 6 insns, stack depth 300+300 > [...] > Summary: 704 PASSED, 1 FAILED > > AFAICS the JIT-generated code for the second testcase shows that this > really causes the stack pointer to be decremented by 300+300: > > first function: > 55push rbp > 0001 4889E5mov rbp,rsp > 0004 4881EC5801sub rsp,0x158 > 000B 4883ED28 sub rbp,byte +0x28 > [...] > 0025 E89AB3AFE5call 0xe5afb3c4 > 002A C685D4FE00mov byte [rbp-0x12c],0x0 > [...] > 0041 4883C528 add rbp,byte +0x28 > 0045 C9leave > 0046 C3ret > > second function: > 55push rbp > 0001 4889E5mov rbp,rsp > 0004 4881EC5801sub rsp,0x158 > 000B 4883ED28 sub rbp,byte +0x28 > [...] > 0025 C685D4FE00mov byte [rbp-0x12c],0x0 > [...] > 003E 4883C528 add rbp,byte +0x28 > 0042 C9leave > 0043 C3ret > > Signed-off-by: Jann HornApplied to bpf-next, thanks a lot Jann!
Re: [PATCH 3/4] libbpf: break loop earlier
On 12/27/2017 09:30 PM, Eric Leblond wrote: > On Wed, 2017-12-27 at 11:00 -0800, Alexei Starovoitov wrote: >> On Wed, Dec 27, 2017 at 07:02:28PM +0100, Eric Leblond wrote: >>> Get out of the loop when we have a match. >>> >>> Signed-off-by: Eric Leblond>>> --- >>> tools/lib/bpf/libbpf.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c >>> index 5fe8aaa2123e..d263748aa341 100644 >>> --- a/tools/lib/bpf/libbpf.c >>> +++ b/tools/lib/bpf/libbpf.c >>> @@ -412,6 +412,7 @@ bpf_object__init_prog_names(struct bpf_object >>> *obj) >>>prog->section_name); >>> return -LIBBPF_ERRNO__LIBELF; >>> } >>> + break; >> >> why this is needed? > > It was just cosmetic, no related bug. > >> The top of the loop is: >> for (si = 0; si < symbols->d_size / sizeof(GElf_Sym) && !name; >> >> so as soon as name is found the loop will exit. > > OK, I've missed that. Please disregard this patch. Ok, please respin the series one last time w/o this patch then. Please also keep the already given ACKs on the other patches. Thanks, Eric!
Re: [PATCH net-next 09/10] net: qualcomm: rmnet: Add support for TX checksum offload
Hi Subash, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Subash-Abhinov-Kasiviswanathan/net-qualcomm-rmnet-Enable-csum-offloads/20171228-041216 reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) vim +187 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 106 107 #if IS_ENABLED(CONFIG_IPV6) 108 static int 109 rmnet_map_ipv6_dl_csum_trailer(struct sk_buff *skb, 110 struct rmnet_map_dl_csum_trailer *csum_trailer) 111 { 112 u16 ip_pseudo_payload_csum, pseudo_csum, ip6_hdr_csum, *csum_field; 113 u16 csum_value, ip6_payload_csum, csum_value_final; 114 struct ipv6hdr *ip6h; 115 void *txporthdr; 116 u32 length; 117 118 ip6h = (struct ipv6hdr *)(skb->data); 119 120 txporthdr = skb->data + sizeof(struct ipv6hdr); 121 csum_field = rmnet_map_get_csum_field(ip6h->nexthdr, txporthdr); 122 123 if (!csum_field) 124 return -EPROTONOSUPPORT; 125 126 csum_value = ~ntohs(csum_trailer->csum_value); 127 ip6_hdr_csum = ~ntohs(ip_compute_csum(ip6h, 128(int)(txporthdr - (void *)(skb->data; 129 ip6_payload_csum = csum16_sub(csum_value, ip6_hdr_csum); 130 131 length = (ip6h->nexthdr == IPPROTO_UDP) ? 132 ntohs(((struct udphdr *)txporthdr)->len) : 133 ntohs(ip6h->payload_len); 134 pseudo_csum = ~ntohs(csum_ipv6_magic(>saddr, >daddr, 135 length, ip6h->nexthdr, 0)); 136 ip_pseudo_payload_csum = csum16_add(ip6_payload_csum, pseudo_csum); 137 > 138 csum_value_final = ~csum16_sub(ip_pseudo_payload_csum, > 139 ntohs(*csum_field)); 140 141 if (unlikely(csum_value_final == 0)) { 142 switch (ip6h->nexthdr) { 143 case IPPROTO_UDP: 144 /* RFC 2460 section 8.1 145 * DL6 One's complement rule for UDP checksum 0 146 */ 147 csum_value_final = ~csum_value_final; 148 break; 149 150 case IPPROTO_TCP: 151 /* DL6 Non-RFC compliant TCP checksum found */ 152 if (*csum_field == 0x) 153 csum_value_final = ~csum_value_final; 154 break; 155 } 156 } 157 158 if (csum_value_final == ntohs(*csum_field)) 159 return 0; 160 else 161 return -EINVAL; 162 } 163 #endif 164 165 static void rmnet_map_complement_ipv4_txporthdr_csum_field(void *iphdr) 166 { 167 struct iphdr *ip4h = (struct iphdr *)iphdr; 168 void *txphdr; 169 u16 *csum; 170 171 txphdr = iphdr + ip4h->ihl * 4; 172 173 if (ip4h->protocol == IPPROTO_TCP || ip4h->protocol == IPPROTO_UDP) { 174 csum = (u16 *)rmnet_map_get_csum_field(ip4h->protocol, txphdr); 175 *csum = ~(*csum); 176 } 177 } 178 179 static void 180 rmnet_map_ipv4_ul_csum_header(void *iphdr, 181struct rmnet_map_ul_csum_header *ul_header, 182struct sk_buff *skb) 183 { 184 struct iphdr *ip4h = (struct iphdr *)iphdr; 185 u16 *hdr = (u16 *)ul_header; 186 > 187 ul_header->csum_start_offset = > htons((u16)(skb_transport_header(skb) - 188 (unsigned char *)iphdr)); 189 ul_header->csum_insert_offset = skb->csum_offset; 190 ul_header->csum_enabled = 1; 191 if (ip4h->protocol == IPPROTO_UDP) 192 ul_header->udp_ip4_ind = 1; 193 else 194 ul_header->udp_ip4_ind = 0; 195 196 /* Changing remaining fields to network order */ 197 hdr++; > 198 *hdr = htons(*hdr); 199 200 skb->ip_summed = CHECKSUM_NONE; 201 202 rmnet_map_complement_ipv4_txporthdr_csum_field(iphdr); 203 } 204 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework
On 12/27/17 12:09 AM, Masami Hiramatsu wrote: On Tue, 26 Dec 2017 18:12:56 -0800 Alexei Starovoitovwrote: On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote: Support in-kernel fault-injection framework via debugfs. This allows you to inject a conditional error to specified function using debugfs interfaces. Signed-off-by: Masami Hiramatsu --- Documentation/fault-injection/fault-injection.txt |5 + kernel/Makefile |1 kernel/fail_function.c| 169 + lib/Kconfig.debug | 10 + 4 files changed, 185 insertions(+) create mode 100644 kernel/fail_function.c diff --git a/Documentation/fault-injection/fault-injection.txt b/Documentation/fault-injection/fault-injection.txt index 918972babcd8..6243a588dd71 100644 --- a/Documentation/fault-injection/fault-injection.txt +++ b/Documentation/fault-injection/fault-injection.txt @@ -30,6 +30,11 @@ o fail_mmc_request injects MMC data errors on devices permitted by setting debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request +o fail_function + + injects error return on specific functions by setting debugfs entries + under /sys/kernel/debug/fail_function. No boot option supported. I like it. Could you document it a bit better? Yes, I will do in next series. In particular retval is configurable, but without an example no one will be able to figure out how to use it. Ah, right. BTW, as I pointed in the covermail, should we store the expected error value range into the injectable list? e.g. ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO) And provide APIs to check/get it. I'm afraid such check would be too costly. Right now we have only two functions marked but I expect hundreds more will be added in the near future as soon as developers realize the potential of such error injection. All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data. Multiple by 1k and we have 8k of data spent on marks. If we add max/min range marks that doubles it for very little use. I think marking function only is enough.
Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
On 12/26/17 9:56 PM, Masami Hiramatsu wrote: On Tue, 26 Dec 2017 17:57:32 -0800 Alexei Starovoitovwrote: On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote: Check whether error injectable event is on function entry or not. Currently it checks the event is ftrace-based kprobes or not, but that is wrong. It should check if the event is on the entry of target function. Since error injection will override a function to just return with modified return value, that operation must be done before the target function starts making stackframe. As a side effect, bpf error injection is no need to depend on function-tracer. It can work with sw-breakpoint based kprobe events too. Signed-off-by: Masami Hiramatsu --- kernel/trace/Kconfig|2 -- kernel/trace/bpf_trace.c|6 +++--- kernel/trace/trace_kprobe.c |8 +--- kernel/trace/trace_probe.h | 12 ++-- 4 files changed, 14 insertions(+), 14 deletions(-) diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index ae3a2d519e50..6400e1bf97c5 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -533,9 +533,7 @@ config FUNCTION_PROFILER config BPF_KPROBE_OVERRIDE bool "Enable BPF programs to override a kprobed function" depends on BPF_EVENTS - depends on KPROBES_ON_FTRACE depends on HAVE_KPROBE_OVERRIDE - depends on DYNAMIC_FTRACE_WITH_REGS default n help Allows BPF to override the execution of a probed function and diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f6d2327ecb59..d663660f8392 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event *event, int ret = -EEXIST; /* -* Kprobe override only works for ftrace based kprobes, and only if they -* are on the opt-in list. +* Kprobe override only works if they are on the function entry, +* and only if they are on the opt-in list. */ if (prog->kprobe_override && - (!trace_kprobe_ftrace(event->tp_event) || + (!trace_kprobe_on_func_entry(event->tp_event) || !trace_kprobe_error_injectable(event->tp_event))) return -EINVAL; diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 91f4b57dab82..265e3e27e8dc 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -88,13 +88,15 @@ static nokprobe_inline unsigned long trace_kprobe_nhit(struct trace_kprobe *tk) return nhit; } -int trace_kprobe_ftrace(struct trace_event_call *call) +bool trace_kprobe_on_func_entry(struct trace_event_call *call) { struct trace_kprobe *tk = (struct trace_kprobe *)call->data; - return kprobe_ftrace(>rp.kp); + + return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name, + tk->rp.kp.offset); That would be nice, but did you test this? Yes, because the jprobe, which was only official user of modifying execution path using kprobe, did same way to check. (and kretprobe also does it) My understanding that kprobe will restore all regs and here we need to override return ip _and_ value. yes, no problem. kprobe restore all regs from pt_regs, including regs->ip. Could you add a patch with the test the way Josef did or describe the steps to test this new mode? Would you mean below patch? If so, it should work without any change. [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return yeah. I expect bpf_override_return test to work as-is. I'm asking for the test for new functionality added by this patch. In particular kprobe on func entry without ftrace. How did you test it? and how I can repeat the test? I'm still not sure that it works correctly.
Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
Hi Russell, On Wed, Dec 27, 2017 at 10:24:01PM +, Russell King - ARM Linux wrote: > On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote: > > > > +_eth2 { > > + /* CPS Lane 5 */ > > + status = "okay"; > > + phy-mode = "2500base-x"; > > + /* Generic PHY, providing serdes lanes */ > > + phys = <_comphy5 2>; > > +}; > > + > > This is wrong. This lane is connected to a SFP cage which can support > more than just 2500base-X. Tying it in this way to 2500base-X means > that this port does not support conenctions at 1000base-X, despite > that's one of the most popular and more standardised speeds. What do you suggest to describe this in the dt, to enable a port using the current PPv2 driver? Antoine -- Antoine Ténart, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com
[PATCH ipsec] xfrm: skip policies marked as dead while rehashing
syzkaller triggered following KASAN splat: BUG: KASAN: slab-out-of-bounds in xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618 read of size 2 at addr 8801c8e92fe4 by task kworker/1:1/23 [..] Workqueue: events xfrm_hash_rebuild [..] __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428 xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618 process_one_work+0xbbf/0x1b10 kernel/workqueue.c:2112 worker_thread+0x223/0x1990 kernel/workqueue.c:2246 [..] The reproducer triggers: 1016 if (error) { 1017 list_move_tail(>walk.all, >all); 1018 goto out; 1019 } in xfrm_policy_walk() via pfkey (it sets tiny rcv space, dump callback returns -ENOBUFS). In this case, *walk is located the pfkey socket struct, so this socket becomes visible in the global policy list. It looks like this is intentional -- phony walker has walk.dead set to 1 and all other places skip such "policies". Ccing original authors of the two commits that seem to expose this issue (first patch missed ->dead check, second patch adds pfkey sockets to policies dumper list). Fixes: 880a6fab8f6ba5b ("xfrm: configure policy hash table thresholds by netlink") Fixes: 12a169e7d8f4b1c ("ipsec: Put dumpers on the dump list") Cc: Herbert XuCc: Timo Teras Cc: Christophe Gouault Reported-by: syzbot Signed-off-by: Florian Westphal --- net/xfrm/xfrm_policy.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 9542975eb2f9..181bc6181789 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -609,7 +609,8 @@ static void xfrm_hash_rebuild(struct work_struct *work) /* re-insert all policies by order of creation */ list_for_each_entry_reverse(policy, >xfrm.policy_all, walk.all) { - if (xfrm_policy_id2dir(policy->index) >= XFRM_POLICY_MAX) { + if (policy->walk.dead || + xfrm_policy_id2dir(policy->index) >= XFRM_POLICY_MAX) { /* skip socket policies */ continue; } -- 2.13.6
Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote: > This patch enables the fourth network interface on the Marvell > Macchiatobin. It is configured in the 2500Base-X PHY mode. > > Signed-off-by: Antoine Tenart> --- > arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 8 > 1 file changed, 8 insertions(+) > > diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts > b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts > index b3350827ee55..c51efd511324 100644 > --- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts > +++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts > @@ -279,6 +279,14 @@ > phys = <_comphy0 1>; > }; > > +_eth2 { > + /* CPS Lane 5 */ > + status = "okay"; > + phy-mode = "2500base-x"; > + /* Generic PHY, providing serdes lanes */ > + phys = <_comphy5 2>; > +}; > + This is wrong. This lane is connected to a SFP cage which can support more than just 2500base-X. Tying it in this way to 2500base-X means that this port does not support conenctions at 1000base-X, despite that's one of the most popular and more standardised speeds. So, I feel I have to NAK this patch in its current form. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up
[PATCH net-next 1/6] phy: add 2.5G SGMII mode to the phy_mode enum
This patch adds one more generic PHY mode to the phy_mode enum, to allow configuring generic PHYs to the 2.5G SGMII mode by using the set_mode callback. Signed-off-by: Antoine Tenart--- include/linux/phy/phy.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/phy/phy.h b/include/linux/phy/phy.h index 4f8423a948d5..70459a28f3a1 100644 --- a/include/linux/phy/phy.h +++ b/include/linux/phy/phy.h @@ -28,6 +28,7 @@ enum phy_mode { PHY_MODE_USB_DEVICE, PHY_MODE_USB_OTG, PHY_MODE_SGMII, + PHY_MODE_SGMII_2_5G, PHY_MODE_10GKR, PHY_MODE_UFS_HS_A, PHY_MODE_UFS_HS_B, -- 2.14.3
[PATCH net-next 3/6] net: mvpp2: 1000baseX support
This patch adds the 1000Base-X PHY mode support in the Marvell PPv2 driver. 1000Base-X is quite close the SGMII and uses nearly the same code path. Signed-off-by: Antoine Tenart--- drivers/net/ethernet/marvell/mvpp2.c | 45 1 file changed, 35 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c index a19760736b71..094db9dd633f 100644 --- a/drivers/net/ethernet/marvell/mvpp2.c +++ b/drivers/net/ethernet/marvell/mvpp2.c @@ -4501,6 +4501,7 @@ static int mvpp22_gop_init(struct mvpp2_port *port) mvpp22_gop_init_rgmii(port); break; case PHY_INTERFACE_MODE_SGMII: + case PHY_INTERFACE_MODE_1000BASEX: mvpp22_gop_init_sgmii(port); break; case PHY_INTERFACE_MODE_10GKR: @@ -4538,7 +4539,8 @@ static void mvpp22_gop_unmask_irq(struct mvpp2_port *port) u32 val; if (phy_interface_mode_is_rgmii(port->phy_interface) || - port->phy_interface == PHY_INTERFACE_MODE_SGMII) { + port->phy_interface == PHY_INTERFACE_MODE_SGMII || + port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) { /* Enable the GMAC link status irq for this port */ val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK); val |= MVPP22_GMAC_INT_SUM_MASK_LINK_STAT; @@ -4568,7 +4570,8 @@ static void mvpp22_gop_mask_irq(struct mvpp2_port *port) } if (phy_interface_mode_is_rgmii(port->phy_interface) || - port->phy_interface == PHY_INTERFACE_MODE_SGMII) { + port->phy_interface == PHY_INTERFACE_MODE_SGMII || + port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) { val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK); val &= ~MVPP22_GMAC_INT_SUM_MASK_LINK_STAT; writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK); @@ -4580,7 +4583,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port) u32 val; if (phy_interface_mode_is_rgmii(port->phy_interface) || - port->phy_interface == PHY_INTERFACE_MODE_SGMII) { + port->phy_interface == PHY_INTERFACE_MODE_SGMII || + port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) { val = readl(port->base + MVPP22_GMAC_INT_MASK); val |= MVPP22_GMAC_INT_MASK_LINK_STAT; writel(val, port->base + MVPP22_GMAC_INT_MASK); @@ -4605,6 +4609,7 @@ static int mvpp22_comphy_init(struct mvpp2_port *port) switch (port->phy_interface) { case PHY_INTERFACE_MODE_SGMII: + case PHY_INTERFACE_MODE_1000BASEX: mode = PHY_MODE_SGMII; break; case PHY_INTERFACE_MODE_10GKR: @@ -4625,7 +4630,8 @@ static void mvpp2_port_mii_gmac_configure_mode(struct mvpp2_port *port) { u32 val; - if (port->phy_interface == PHY_INTERFACE_MODE_SGMII) { + if (port->phy_interface == PHY_INTERFACE_MODE_SGMII || + port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) { val = readl(port->base + MVPP22_GMAC_CTRL_4_REG); val |= MVPP22_CTRL4_SYNC_BYPASS_DIS | MVPP22_CTRL4_DP_CLK_SEL | MVPP22_CTRL4_QSGMII_BYPASS_ACTIVE; @@ -4640,9 +4646,11 @@ static void mvpp2_port_mii_gmac_configure_mode(struct mvpp2_port *port) writel(val, port->base + MVPP22_GMAC_CTRL_4_REG); } - /* The port is connected to a copper PHY */ val = readl(port->base + MVPP2_GMAC_CTRL_0_REG); - val &= ~MVPP2_GMAC_PORT_TYPE_MASK; + if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) + val |= MVPP2_GMAC_PORT_TYPE_MASK; + else + val &= ~MVPP2_GMAC_PORT_TYPE_MASK; writel(val, port->base + MVPP2_GMAC_CTRL_0_REG); val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG); @@ -4651,6 +4659,19 @@ static void mvpp2_port_mii_gmac_configure_mode(struct mvpp2_port *port) MVPP2_GMAC_AN_DUPLEX_EN; if (port->phy_interface == PHY_INTERFACE_MODE_SGMII) val |= MVPP2_GMAC_IN_BAND_AUTONEG; + + if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) + /* 1000BaseX port cannot negotiate speed nor can it +* negotiate duplex: they are always operating with a +* fixed speed of 1000Mbps in full duplex, so force +* 1000 speed and full duplex here. +*/ + val |= MVPP2_GMAC_CONFIG_GMII_SPEED | + MVPP2_GMAC_CONFIG_FULL_DUPLEX; + else + val |= MVPP2_GMAC_AN_SPEED_EN | + MVPP2_GMAC_AN_DUPLEX_EN; + writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG); } @@ -4671,7 +4692,8 @@ static void mvpp2_port_mii_gmac_configure(struct mvpp2_port *port) /* Configure the PCS and in-band AN */
[PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
This patch enables the fourth network interface on the Marvell Macchiatobin. It is configured in the 2500Base-X PHY mode. Signed-off-by: Antoine Tenart--- arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 8 1 file changed, 8 insertions(+) diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts index b3350827ee55..c51efd511324 100644 --- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts +++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts @@ -279,6 +279,14 @@ phys = <_comphy0 1>; }; +_eth2 { + /* CPS Lane 5 */ + status = "okay"; + phy-mode = "2500base-x"; + /* Generic PHY, providing serdes lanes */ + phys = <_comphy5 2>; +}; + _pinctrl { cps_spi1_pins: spi1-pins { marvell,pins = "mpp12", "mpp13", "mpp14", "mpp15", "mpp16"; -- 2.14.3
[PATCH net-next 0/6] net: mvpp2: 1000BaseX and 2000BaseX support
Hi all, This series adds 1000BaseX and 2500BaseX support to the Marvell PPv2 driver. In order to use it, the 2.5 SGMII mode is added in the Marvell common PHY driver (cp110-comphy). Thanks to theses patches the fourth network interface can be used on the mcbin, and two patches are attached: one to describe the interface in the mcbin device tree, and another one adding Ethernet aliases now that the four interfaces are described. This was tested on a mcbin. Patches 1 and 2 should go through the PHY tree, patches 3 and 4 through the net-next tree and patches 5 and 6 through the mvebu one. Please note the two mvpp2 patches do not conflict with the ACPI series Marcin sent a few days ago, and the two series can be processed in parallel. (Marcin is aware of me sending this series). Thanks! Antoine Antoine Tenart (5): phy: add 2.5G SGMII mode to the phy_mode enum phy: cp110-comphy: 2.5G SGMII mode net: mvpp2: 1000baseX support net: mvpp2: 2500baseX support arm64: dts: marvell: mcbin: enable the fourth network interface Yan Markman (1): arm64: dts: marvell: add Ethernet aliases arch/arm64/boot/dts/marvell/armada-7040-db.dts| 6 ++ arch/arm64/boot/dts/marvell/armada-8040-db.dts| 7 +++ arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 15 + drivers/net/ethernet/marvell/mvpp2.c | 67 +++ drivers/phy/marvell/phy-mvebu-cp110-comphy.c | 17 +- include/linux/phy/phy.h | 1 + 6 files changed, 100 insertions(+), 13 deletions(-) -- 2.14.3
[PATCH net-next 2/6] phy: cp110-comphy: 2.5G SGMII mode
This patch allow the CP100 comphy to configure some lanes in the 2.5G SGMII mode. This mode is quite close to SGMII and uses nearly the same code path. Signed-off-by: Antoine Tenart--- drivers/phy/marvell/phy-mvebu-cp110-comphy.c | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/drivers/phy/marvell/phy-mvebu-cp110-comphy.c b/drivers/phy/marvell/phy-mvebu-cp110-comphy.c index a0d522154cdf..946a6ed7b66f 100644 --- a/drivers/phy/marvell/phy-mvebu-cp110-comphy.c +++ b/drivers/phy/marvell/phy-mvebu-cp110-comphy.c @@ -135,19 +135,25 @@ struct mvebu_comhy_conf { static const struct mvebu_comhy_conf mvebu_comphy_cp110_modes[] = { /* lane 0 */ MVEBU_COMPHY_CONF(0, 1, PHY_MODE_SGMII, 0x1), + MVEBU_COMPHY_CONF(0, 1, PHY_MODE_SGMII_2_5G, 0x1), /* lane 1 */ MVEBU_COMPHY_CONF(1, 2, PHY_MODE_SGMII, 0x1), + MVEBU_COMPHY_CONF(1, 2, PHY_MODE_SGMII_2_5G, 0x1), /* lane 2 */ MVEBU_COMPHY_CONF(2, 0, PHY_MODE_SGMII, 0x1), + MVEBU_COMPHY_CONF(2, 0, PHY_MODE_SGMII_2_5G, 0x1), MVEBU_COMPHY_CONF(2, 0, PHY_MODE_10GKR, 0x1), /* lane 3 */ MVEBU_COMPHY_CONF(3, 1, PHY_MODE_SGMII, 0x2), + MVEBU_COMPHY_CONF(3, 1, PHY_MODE_SGMII_2_5G, 0x2), /* lane 4 */ MVEBU_COMPHY_CONF(4, 0, PHY_MODE_SGMII, 0x2), + MVEBU_COMPHY_CONF(4, 0, PHY_MODE_SGMII_2_5G, 0x2), MVEBU_COMPHY_CONF(4, 0, PHY_MODE_10GKR, 0x2), MVEBU_COMPHY_CONF(4, 1, PHY_MODE_SGMII, 0x1), /* lane 5 */ MVEBU_COMPHY_CONF(5, 2, PHY_MODE_SGMII, 0x1), + MVEBU_COMPHY_CONF(5, 2, PHY_MODE_SGMII_2_5G, 0x1), }; struct mvebu_comphy_priv { @@ -206,6 +212,10 @@ static void mvebu_comphy_ethernet_init_reset(struct mvebu_comphy_lane *lane, if (mode == PHY_MODE_10GKR) val |= MVEBU_COMPHY_SERDES_CFG0_GEN_RX(0xe) | MVEBU_COMPHY_SERDES_CFG0_GEN_TX(0xe); + else if (mode == PHY_MODE_SGMII_2_5G) + val |= MVEBU_COMPHY_SERDES_CFG0_GEN_RX(0x8) | + MVEBU_COMPHY_SERDES_CFG0_GEN_TX(0x8) | + MVEBU_COMPHY_SERDES_CFG0_HALF_BUS; else if (mode == PHY_MODE_SGMII) val |= MVEBU_COMPHY_SERDES_CFG0_GEN_RX(0x6) | MVEBU_COMPHY_SERDES_CFG0_GEN_TX(0x6) | @@ -296,13 +306,13 @@ static int mvebu_comphy_init_plls(struct mvebu_comphy_lane *lane, return 0; } -static int mvebu_comphy_set_mode_sgmii(struct phy *phy) +static int mvebu_comphy_set_mode_sgmii(struct phy *phy, enum phy_mode mode) { struct mvebu_comphy_lane *lane = phy_get_drvdata(phy); struct mvebu_comphy_priv *priv = lane->priv; u32 val; - mvebu_comphy_ethernet_init_reset(lane, PHY_MODE_SGMII); + mvebu_comphy_ethernet_init_reset(lane, mode); val = readl(priv->base + MVEBU_COMPHY_RX_CTRL1(lane->id)); val &= ~MVEBU_COMPHY_RX_CTRL1_CLK8T_EN; @@ -487,7 +497,8 @@ static int mvebu_comphy_power_on(struct phy *phy) switch (lane->mode) { case PHY_MODE_SGMII: - ret = mvebu_comphy_set_mode_sgmii(phy); + case PHY_MODE_SGMII_2_5G: + ret = mvebu_comphy_set_mode_sgmii(phy, lane->mode); break; case PHY_MODE_10GKR: ret = mvebu_comphy_set_mode_10gkr(phy); -- 2.14.3
[PATCH net-next 4/6] net: mvpp2: 2500baseX support
This patch adds the 2500Base-X PHY mode support in the Marvell PPv2 driver. 2500Base-X is quite close to 1000Base-X and SGMII modes and uses nearly the same code path. Signed-off-by: Antoine Tenart--- drivers/net/ethernet/marvell/mvpp2.c | 40 1 file changed, 31 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c index 094db9dd633f..5d2a6f5a52b6 100644 --- a/drivers/net/ethernet/marvell/mvpp2.c +++ b/drivers/net/ethernet/marvell/mvpp2.c @@ -4502,6 +4502,7 @@ static int mvpp22_gop_init(struct mvpp2_port *port) break; case PHY_INTERFACE_MODE_SGMII: case PHY_INTERFACE_MODE_1000BASEX: + case PHY_INTERFACE_MODE_2500BASEX: mvpp22_gop_init_sgmii(port); break; case PHY_INTERFACE_MODE_10GKR: @@ -4540,7 +4541,8 @@ static void mvpp22_gop_unmask_irq(struct mvpp2_port *port) if (phy_interface_mode_is_rgmii(port->phy_interface) || port->phy_interface == PHY_INTERFACE_MODE_SGMII || - port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) { + port->phy_interface == PHY_INTERFACE_MODE_1000BASEX || + port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) { /* Enable the GMAC link status irq for this port */ val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK); val |= MVPP22_GMAC_INT_SUM_MASK_LINK_STAT; @@ -4571,7 +4573,8 @@ static void mvpp22_gop_mask_irq(struct mvpp2_port *port) if (phy_interface_mode_is_rgmii(port->phy_interface) || port->phy_interface == PHY_INTERFACE_MODE_SGMII || - port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) { + port->phy_interface == PHY_INTERFACE_MODE_1000BASEX || + port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) { val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK); val &= ~MVPP22_GMAC_INT_SUM_MASK_LINK_STAT; writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK); @@ -4584,7 +4587,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port) if (phy_interface_mode_is_rgmii(port->phy_interface) || port->phy_interface == PHY_INTERFACE_MODE_SGMII || - port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) { + port->phy_interface == PHY_INTERFACE_MODE_1000BASEX || + port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) { val = readl(port->base + MVPP22_GMAC_INT_MASK); val |= MVPP22_GMAC_INT_MASK_LINK_STAT; writel(val, port->base + MVPP22_GMAC_INT_MASK); @@ -4612,6 +4616,9 @@ static int mvpp22_comphy_init(struct mvpp2_port *port) case PHY_INTERFACE_MODE_1000BASEX: mode = PHY_MODE_SGMII; break; + case PHY_INTERFACE_MODE_2500BASEX: + mode = PHY_MODE_SGMII_2_5G; + break; case PHY_INTERFACE_MODE_10GKR: mode = PHY_MODE_10GKR; break; @@ -4631,7 +4638,8 @@ static void mvpp2_port_mii_gmac_configure_mode(struct mvpp2_port *port) u32 val; if (port->phy_interface == PHY_INTERFACE_MODE_SGMII || - port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) { + port->phy_interface == PHY_INTERFACE_MODE_1000BASEX || + port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) { val = readl(port->base + MVPP22_GMAC_CTRL_4_REG); val |= MVPP22_CTRL4_SYNC_BYPASS_DIS | MVPP22_CTRL4_DP_CLK_SEL | MVPP22_CTRL4_QSGMII_BYPASS_ACTIVE; @@ -4647,7 +4655,8 @@ static void mvpp2_port_mii_gmac_configure_mode(struct mvpp2_port *port) } val = readl(port->base + MVPP2_GMAC_CTRL_0_REG); - if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) + if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX || + port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) val |= MVPP2_GMAC_PORT_TYPE_MASK; else val &= ~MVPP2_GMAC_PORT_TYPE_MASK; @@ -4660,6 +4669,11 @@ static void mvpp2_port_mii_gmac_configure_mode(struct mvpp2_port *port) if (port->phy_interface == PHY_INTERFACE_MODE_SGMII) val |= MVPP2_GMAC_IN_BAND_AUTONEG; + /* Clear all fields we may want to explicitly set below */ + val &= ~(MVPP2_GMAC_CONFIG_FULL_DUPLEX | MVPP2_GMAC_CONFIG_GMII_SPEED | +MVPP2_GMAC_CONFIG_MII_SPEED | MVPP2_GMAC_AN_SPEED_EN | +MVPP2_GMAC_AN_DUPLEX_EN); + if (port->phy_interface == PHY_INTERFACE_MODE_1000BASEX) /* 1000BaseX port cannot negotiate speed nor can it * negotiate duplex: they are always operating with a @@ -4668,6 +4682,10 @@ static void mvpp2_port_mii_gmac_configure_mode(struct mvpp2_port *port) */ val |=
[PATCH net-next 6/6] arm64: dts: marvell: add Ethernet aliases
From: Yan MarkmanThis patch adds Ethernet aliases in the Marvell Aramada 7040 DB, 8040 DB and 8040 mcbin device trees so that the bootloader setup the MAC addresses correctly. Signed-off-by: Yan Markman [Antoine: commit message, small fixes] Signed-off-by: Antoine Tenart --- arch/arm64/boot/dts/marvell/armada-7040-db.dts| 6 ++ arch/arm64/boot/dts/marvell/armada-8040-db.dts| 7 +++ arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 7 +++ 3 files changed, 20 insertions(+) diff --git a/arch/arm64/boot/dts/marvell/armada-7040-db.dts b/arch/arm64/boot/dts/marvell/armada-7040-db.dts index 52b5341cb270..62b83416b30c 100644 --- a/arch/arm64/boot/dts/marvell/armada-7040-db.dts +++ b/arch/arm64/boot/dts/marvell/armada-7040-db.dts @@ -61,6 +61,12 @@ reg = <0x0 0x0 0x0 0x8000>; }; + aliases { + ethernet0 = _eth0; + ethernet1 = _eth1; + ethernet2 = _eth2; + }; + cpm_reg_usb3_0_vbus: cpm-usb3-0-vbus { compatible = "regulator-fixed"; regulator-name = "usb3h0-vbus"; diff --git a/arch/arm64/boot/dts/marvell/armada-8040-db.dts b/arch/arm64/boot/dts/marvell/armada-8040-db.dts index d97b72bed662..d9fffde64c44 100644 --- a/arch/arm64/boot/dts/marvell/armada-8040-db.dts +++ b/arch/arm64/boot/dts/marvell/armada-8040-db.dts @@ -61,6 +61,13 @@ reg = <0x0 0x0 0x0 0x8000>; }; + aliases { + ethernet0 = _eth0; + ethernet1 = _eth2; + ethernet2 = _eth0; + ethernet3 = _eth1; + }; + cpm_reg_usb3_0_vbus: cpm-usb3-0-vbus { compatible = "regulator-fixed"; regulator-name = "cpm-usb3h0-vbus"; diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts index c51efd511324..7a23bb279e1c 100644 --- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts +++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts @@ -62,6 +62,13 @@ reg = <0x0 0x0 0x0 0x8000>; }; + aliases { + ethernet0 = _eth0; + ethernet1 = _eth0; + ethernet2 = _eth1; + ethernet3 = _eth2; + }; + /* Regulator labels correspond with schematics */ v_3_3: regulator-3-3v { compatible = "regulator-fixed"; -- 2.14.3
Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode
On Wed, Dec 27, 2017 at 01:40:24PM -0800, Stephen Hemminger wrote: > On Wed, 27 Dec 2017 18:39:29 -0200 > Marcelo Ricardo Leitnerwrote: > > > > > + send = false; > > > > + else > > > > + send = true; > > > > + > > > > + ret = do_cmd(largc, largv, batch_size, msg_iov_index++, > > > > send); > > > > > > What happens if tc commands are interlaced in the file -- qdisc add, > > > class add, filter add, then a delete, show, exec, etc.? Right now each > > > command is handled one at a time so an add followed by a delete will > > > work. Your proposed batching loop won't work for this case as some > > > commands are executed when that line is reached and others are batched > > > for later send. Not all of the tc commands need to be batched in a > > > single message so perhaps those commands cause the queue to be flushed > > > (ie, message sent), then that command is executed and you start the > > > batching over. > > > > > > Further, I really think the batching can be done without the global > > > variables and without the command handlers knowing it is batching or > > > part of an iov. e.g., in the case of batching try having the commands > > > malloc the request buffer and return the pointer back to this loop in > > > which case this loop calls rtnl_talk_msg and frees the buffers. > > > > Sounds like the batching is being done at the wrong level. If it was > > done by rtnl_talk(), it should be easier. > > We can keep rtnl_talk() for previous users and make rtnl_talk_msg() do > > the batching, mostly independent of which kind of msg it it. > > > > As you need to inform it that it was the last entry, that may be > > detected with feof(stdin). Just add a 'bool flush' parameter to it. > >rtnl_talk_msg(, flush=feof(stdin)); > > > > Next step then would be to add a memory manager layer to it, so > > libnetlink wouldn't need to copy the messages but recycle pointers: > > rtnl_get_msgbuf(): returns a buffer that one can use to fill in the > > msg and use with rtnl_talk_msg() > > and the free is done by libnetlink itself when the message is > > finally sent, so no need to keep track of what one needs to free or > > can reuse. > > > What about using sendmmsg instead? > That woudl allow sending multiple messages in one syscall. Could be. Although the batching effect would be very different. sendmmsg calls cond_resched() between messages, for instance.
Re: [PATCH v2 net-next] net/trace: fix printk format in inet_sock_set_state
From: Yafang ShaoDate: Sun, 24 Dec 2017 23:10:39 +0800 > There's a space character missed in the printk messages. > > Put the message into one line could simplify searching for > the messages in the kernel source. > > Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint with > inet_sock_set_state tracepoint") Please do not break up long "Fixes: " tag lines, for the same reason you shouldn't break up long kernel log message and trace message strings. I fixed it this time. > Cc: Sergei Shtylyov > Signed-off-by: Yafang Shao Applied.
Re: [PATCH net-next 07/10] net: qualcomm: rmnet: Add support for RX checksum offload
Hi Subash, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Subash-Abhinov-Kasiviswanathan/net-qualcomm-rmnet-Enable-csum-offloads/20171228-041216 reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) vim +30 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 26 27 static u16 *rmnet_map_get_csum_field(unsigned char protocol, 28 const void *txporthdr) 29 { > 30 u16 *check = 0; 31 32 switch (protocol) { 33 case IPPROTO_TCP: > 34 check = &(((struct tcphdr *)txporthdr)->check); 35 break; 36 37 case IPPROTO_UDP: > 38 check = &(((struct udphdr *)txporthdr)->check); 39 break; 40 41 default: 42 check = 0; 43 break; 44 } 45 46 return check; 47 } 48 49 static int 50 rmnet_map_ipv4_dl_csum_trailer(struct sk_buff *skb, 51 struct rmnet_map_dl_csum_trailer *csum_trailer) 52 { 53 u16 ip_pseudo_payload_csum, pseudo_csum, ip_hdr_csum, *csum_field; 54 u16 csum_value, ip_payload_csum, csum_value_final; 55 struct iphdr *ip4h; 56 void *txporthdr; 57 58 ip4h = (struct iphdr *)(skb->data); 59 if ((ntohs(ip4h->frag_off) & IP_MF) || 60 ((ntohs(ip4h->frag_off) & IP_OFFSET) > 0)) 61 return -EOPNOTSUPP; 62 63 txporthdr = skb->data + ip4h->ihl * 4; 64 65 csum_field = rmnet_map_get_csum_field(ip4h->protocol, txporthdr); 66 67 if (!csum_field) 68 return -EPROTONOSUPPORT; 69 70 /* RFC 768 - Skip IPv4 UDP packets where sender checksum field is 0 */ 71 if (*csum_field == 0 && ip4h->protocol == IPPROTO_UDP) 72 return 0; 73 > 74 csum_value = ~ntohs(csum_trailer->csum_value); > 75 ip_hdr_csum = ~ip_fast_csum(ip4h, (int)ip4h->ihl); > 76 ip_payload_csum = csum16_sub(csum_value, ip_hdr_csum); 77 > 78 pseudo_csum = ~ntohs(csum_tcpudp_magic(ip4h->saddr, ip4h->daddr, 79 (u16)(ntohs(ip4h->tot_len) - ip4h->ihl * 4), 80 (u16)ip4h->protocol, 0)); > 81 ip_pseudo_payload_csum = csum16_add(ip_payload_csum, pseudo_csum); 82 > 83 csum_value_final = ~csum16_sub(ip_pseudo_payload_csum, > 84 ntohs(*csum_field)); 85 86 if (unlikely(csum_value_final == 0)) { 87 switch (ip4h->protocol) { 88 case IPPROTO_UDP: 89 /* RFC 768 - DL4 1's complement rule for UDP csum 0 */ 90 csum_value_final = ~csum_value_final; 91 break; 92 93 case IPPROTO_TCP: 94 /* DL4 Non-RFC compliant TCP checksum found */ 95 if (*csum_field == 0x) 96 csum_value_final = ~csum_value_final; 97 break; 98 } 99 } 100 101 if (csum_value_final == ntohs(*csum_field)) 102 return 0; 103 else 104 return -EINVAL; 105 } 106 107 #if IS_ENABLED(CONFIG_IPV6) 108 static int 109 rmnet_map_ipv6_dl_csum_trailer(struct sk_buff *skb, 110 struct rmnet_map_dl_csum_trailer *csum_trailer) 111 { 112 u16 ip_pseudo_payload_csum, pseudo_csum, ip6_hdr_csum, *csum_field; 113 u16 csum_value, ip6_payload_csum, csum_value_final; 114 struct ipv6hdr *ip6h; 115 void *txporthdr; 116 u32 length; 117 118 ip6h = (struct ipv6hdr *)(skb->data); 119 120 txporthdr = skb->data + sizeof(struct ipv6hdr); 121 csum_field = rmnet_map_get_csum_field(ip6h->nexthdr, txporthdr); 122 123 if (!csum_field) 124 return -EPROTONOSUPPORT; 125 126 csum_value = ~ntohs(csum_trailer->csum_value); > 127 ip6_hdr_csum = ~ntohs(ip_compute_csum(ip6h, 128(int)(txporthdr - (void *)(skb->data; > 129 ip6_payload_csum = csum16_sub(csum_value, ip6_hdr_csum); 130 131 length = (ip6h->nexthdr == IPPROTO_UDP) ? 132 ntohs(((struct udphdr *)txporthdr)->len) :
Re: [pull request][for-next V2 00/11] Mellanox, mlx5 E-Switch updates 2017-12-19
From: David MillerDate: Wed, 27 Dec 2017 17:01:22 -0500 (EST) > Pulled, thank you. Actually, I had to revert. Please fix this and resubmit: drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c: In function ‘esw_offloads_load_reps’: drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:774:2: warning: this ‘for’ clause does not guard... [-Wmisleading-indentation] for (rep_type = 0; rep_type < NUM_REP_TYPES; rep_type++) ^~~ drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:776:3: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘for’ if (err) ^~
Re: [pull request][for-next V2 00/11] Mellanox, mlx5 E-Switch updates 2017-12-19
From: Saeed MahameedDate: Sun, 24 Dec 2017 15:45:36 +0200 > Hi Dave and Doug, > > == > This series includes updates for mlx5 E-Switch infrastructures, > to be merged into net-next and rdma-next trees. > > Mark's patches provide E-Switch refactoring that generalize the mlx5 > E-Switch vf representors interfaces and data structures. The serious is > mainly focused on moving ethernet (netdev) specific representors logic out > of E-Switch (eswitch.c) into mlx5e representor module (en_rep.c), which > provides better separation and allows future support for other types of vf > representors (e.g. RDMA). > > Gal's patches at the end of this serious, provide a simple syntax fix and > two other patches that handles vport ingress/egress ACL steering name > spaces to be aligned with the Firmware/Hardware specs. > === > > V1->V2: > - Addressed coding style comments in patches #1 and #7 > - The series is still based on rc4, as now I see net-next is also @rc4. > > Please pull and let me know if there's any problem. Pulled, thank you.
Re: [PATCH iproute] qdisc: Print offload indication
On Tue, 26 Dec 2017 11:48:45 +0200 Yuval Mintzwrote: > Use the newly added TCA_HW_OFFLOAD indication from kernel > to print a consistent 'offloaded' message to user when listing qdiscs. > > Signed-off-by: Yuval Mintz Applied to master (since TCA_HW_OFFLOAD is already present).
Re: [PATCH v2 net-next 0/2] kcm: Fix two locking issues
From: Tom HerbertDate: Sat, 23 Dec 2017 09:17:14 -0800 > One issue is lockdep warnings when sock_owned_by_user returns true > in strparser. Fix is to add and call sock_owned_by_user_nocheck since > the check for owned by user is not an error condition in this case. > > The other issue is a potential deadlock between TX and RX paths > > KCM socket lock and the psock socket lock are acquired in both > the RX and TX path, however they take the locks in opposite order > which can lead to deadlock. The fix is to add try_sock_lock to see > if psock socket lock can get acquired in the TX path with KCM lock > held. If not, then KCM socket is released and the psock socket lock > and KCM socket lock are acquired in the same order as the RX path. > > Tested: > > Ran KCM traffic without incident. > > v2: Remove patches to address potential deadlock. I couldn't convince > myself this is an issue after looking at the code some more. If this fixes real locking bugs you should target them at 'net' not 'net-next'. I also see you telling some people hitting kcm locking problems to test "the patch" but you give them no idea what patch you are even talking about. Is it this series? Nobody knows. Please poing them at exactly what patch you want them to test, get their testing results, and add appropriate Tested-by: tags as you respin this for 'net'. Thanks.
Re: [PATCH iproute2 0/3] ip/tunnel: Fix noencap- and document "external" parameter handling.
On Wed, 27 Dec 2017 13:28:13 +0200 Serhey Popovychwrote: > In this series I present next set of improvements/fixes: > > 1) Fix noencap- option handling: we need to clear bit, instead > of seting all, but one we expect to clear. > > 2) Document "external" parameter both in ip-link(8) and help > output for link_gre.c. Add "noexternal" option variant to > bring inline with GENEVE and VXLAN. > > 3) Trivial: clear flowlabel/tclass from flowinfo in case of > inherit to stop sending garbage to the kernel. It has no > functional change, but follows similar behaviour in link_ip6tnl.c > > See individual patch description message for details. > > Thanks, > Serhii > > Serhey Popovych (3): > gre,ip6tnl/tunnel: Fix noencap- support > gre6/tunnel: Do not submit garbage in flowinfo > ip/tunnel: Document "external" parameter > > ip/link_gre.c |7 +-- > ip/link_gre6.c|4 ++-- > ip/link_ip6tnl.c |6 -- > ip/link_iptnl.c |4 +++- > man/man8/ip-link.8.in |6 ++ > 5 files changed, 20 insertions(+), 7 deletions(-) > These are really disjoint. I applied the noencap and the flowinfo patch. Agree with William that having noexternal which does nothing useful is of little value now.
Re: [PATCH net-next 0/4] zerocopy refinements
From: Willem de BruijnDate: Fri, 22 Dec 2017 19:00:16 -0500 > From: Willem de Bruijn > > 1/4 is a small optimization follow-up to the earlier fix to skb_segment: > check skb state once per skb, instead of once per frag. > 2/4 makes behavior more consistent between standard and zerocopy send: > set the PSH bit when hitting MAX_SKB_FRAGS. This helps GRO. > 3/4 resolves a surprising inconsistency in notification: > because small packets were not stored in frags, they would not set > the copied error code over loopback. This change also optimizes > the path by removing copying and making tso_fragment cheaper. > 4/4 follows-up to 3/4 by no longer allocated now unused memory. > this was actually already in RFC patches, but dropped as I pared > down the patch set during revisions. Looks good, series applied, thanks.
Re: [PATCH net-next v2 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor
From: Jason BaronDate: Fri, 22 Dec 2017 16:54:01 -0500 > The ability to set speed and duplex for virtio_net in useful in various > scenarios as described here: > > 16032be virtio_net: add ethtool support for set and get of settings > > However, it would be nice to be able to set this from the hypervisor, > such that virtio_net doesn't require custom guest ethtool commands. > > Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows > the hypervisor to export a linkspeed and duplex setting. The user can > subsequently overwrite it later if desired via: 'ethtool -s'. > > Signed-off-by: Jason Baron > Cc: "Michael S. Tsirkin" > Cc: Jason Wang Looks mostly fine to me but need some virtio_net reviewers on this one. > @@ -57,6 +57,8 @@ >* Steering */ > #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */ > > +#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Host set linkspeed and duplex */ > + Why use a value so far away from the largest existing one? Just curious.
Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode
On Wed, 27 Dec 2017 18:39:29 -0200 Marcelo Ricardo Leitnerwrote: > > > + send = false; > > > + else > > > + send = true; > > > + > > > + ret = do_cmd(largc, largv, batch_size, msg_iov_index++, send); > > > > What happens if tc commands are interlaced in the file -- qdisc add, > > class add, filter add, then a delete, show, exec, etc.? Right now each > > command is handled one at a time so an add followed by a delete will > > work. Your proposed batching loop won't work for this case as some > > commands are executed when that line is reached and others are batched > > for later send. Not all of the tc commands need to be batched in a > > single message so perhaps those commands cause the queue to be flushed > > (ie, message sent), then that command is executed and you start the > > batching over. > > > > Further, I really think the batching can be done without the global > > variables and without the command handlers knowing it is batching or > > part of an iov. e.g., in the case of batching try having the commands > > malloc the request buffer and return the pointer back to this loop in > > which case this loop calls rtnl_talk_msg and frees the buffers. > > Sounds like the batching is being done at the wrong level. If it was > done by rtnl_talk(), it should be easier. > We can keep rtnl_talk() for previous users and make rtnl_talk_msg() do > the batching, mostly independent of which kind of msg it it. > > As you need to inform it that it was the last entry, that may be > detected with feof(stdin). Just add a 'bool flush' parameter to it. >rtnl_talk_msg(, flush=feof(stdin)); > > Next step then would be to add a memory manager layer to it, so > libnetlink wouldn't need to copy the messages but recycle pointers: > rtnl_get_msgbuf(): returns a buffer that one can use to fill in the > msg and use with rtnl_talk_msg() > and the free is done by libnetlink itself when the message is > finally sent, so no need to keep track of what one needs to free or > can reuse. What about using sendmmsg instead? That woudl allow sending multiple messages in one syscall.
Re: lost connection to test machine (3)
Dmitry Vyukovwrote: > On Wed, Dec 27, 2017 at 7:18 PM, syzbot > wrote: > > Hello, > > > > syzkaller hit the following crash on > > beacbc68ac3e23821a681adb30b45dc55b17488d > > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master > > compiler: gcc (GCC) 7.1.1 20170620 > > .config is attached > > Raw console output is attached. > > C reproducer is attached > > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > > for information about syzkaller reproducers > > > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > Reported-by: > > It will help syzbot understand when the bug is fixed. See footer for > > details. > > If you forward the report, please keep this part and the footer. > > +netfilter maintainers > > Here is cleaned reproducer: > > // autogenerated by syzkaller (http://github.com/google/syzkaller) > #include > #include > #include > #include > #include > #include > > int main() > { > int fd; > > fd = socket(AF_INET, SOCK_STREAM, IPPROTO_IP); > struct ipt_replace opt = {}; > opt.num_counters = 1; > opt.size = -1; > setsockopt(fd, SOL_IP, 0x40, , 0x4); > return 0; > } > > > What happens there is that here: > > struct xt_table_info *xt_alloc_table_info(unsigned int size) > { > ... > if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages) > return NULL; > > size = -1 and SMP_ALIGN(size) = 0, so this still tries to allocate > 4GB+delta bytes. > > I don't understand why this uses SMP_ALIGN since we add 2 pages on > top, it seems that we could just drop SMP_ALIGN and local SMP_ALIGN > definition altogether. Looking at history.git this seems to be a left over from back when iptables allocated size * num_cpus() (and used an SMP_ALIGN based offset for each cpu). So yes, I think we can just toss/drop this.
[PATCH] Staging: ipx: fixed several no space before tabs coding style issues
Fixed several coding style warnings of "please, no space before tabs". Signed-off-by: Jianshen Liu--- drivers/staging/ipx/af_ipx.c| 56 - drivers/staging/ipx/ipx_proc.c | 2 +- drivers/staging/ipx/ipx_route.c | 6 ++--- 3 files changed, 32 insertions(+), 32 deletions(-) diff --git a/drivers/staging/ipx/af_ipx.c b/drivers/staging/ipx/af_ipx.c index d21a9d1..d8be06c 100644 --- a/drivers/staging/ipx/af_ipx.c +++ b/drivers/staging/ipx/af_ipx.c @@ -2,7 +2,7 @@ * Implements an IPX socket layer. * * This code is derived from work by - * Ross Biro : Writing the original IP stack + * Ross Biro : Writing the original IP stack * Fred Van Kempen : Tidying up the TCP/IP * * Many thanks go to Keith Baker, Institute For Industrial Information @@ -20,7 +20,7 @@ * provide warranty for any of this software. This material is provided * "AS-IS" and at no charge. * - * Portions Copyright (c) 1995 Caldera, Inc. + * Portions Copyright (c) 1995 Caldera, Inc. * Neither Greg Page nor Caldera, Inc. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. @@ -758,7 +758,7 @@ static void ipxitf_discover_netnum(struct ipx_interface *intrfc, /** * ipxitf_pprop - Process packet propagation IPX packet type 0x14, used for - * NetBIOS broadcasts + * NetBIOS broadcasts * @intrfc: IPX interface receiving this packet * @skb: Received packet * @@ -870,11 +870,11 @@ static struct ipx_interface *ipxitf_alloc(struct net_device *dev, __be32 netnum, if (intrfc) { intrfc->if_dev = dev; intrfc->if_netnum = netnum; - intrfc->if_dlink_type = dlink_type; - intrfc->if_dlink= dlink; - intrfc->if_internal = internal; - intrfc->if_ipx_offset = ipx_offset; - intrfc->if_sknum= IPX_MIN_EPHEMERAL_SOCKET; + intrfc->if_dlink_type = dlink_type; + intrfc->if_dlink= dlink; + intrfc->if_internal = internal; + intrfc->if_ipx_offset = ipx_offset; + intrfc->if_sknum= IPX_MIN_EPHEMERAL_SOCKET; INIT_HLIST_HEAD(>if_sklist); refcount_set(>refcnt, 1); spin_lock_init(>if_sklist_lock); @@ -965,23 +965,23 @@ static int ipxitf_create(struct ipx_interface_definition *idef) switch (idef->ipx_dlink_type) { case IPX_FRAME_8022: - dlink_type = htons(ETH_P_802_2); - datalink= p8022_datalink; + dlink_type = htons(ETH_P_802_2); + datalink= p8022_datalink; break; case IPX_FRAME_ETHERII: if (dev->type != ARPHRD_IEEE802) { - dlink_type = htons(ETH_P_IPX); - datalink= pEII_datalink; + dlink_type = htons(ETH_P_IPX); + datalink= pEII_datalink; break; } /* fall through */ case IPX_FRAME_SNAP: - dlink_type = htons(ETH_P_SNAP); - datalink= pSNAP_datalink; + dlink_type = htons(ETH_P_SNAP); + datalink= pSNAP_datalink; break; case IPX_FRAME_8023: - dlink_type = htons(ETH_P_802_3); - datalink= p8023_datalink; + dlink_type = htons(ETH_P_802_3); + datalink= p8023_datalink; break; case IPX_FRAME_NONE: default: @@ -1522,7 +1522,7 @@ static int ipx_connect(struct socket *sock, struct sockaddr *uaddr, struct ipx_route *rt; sk->sk_state= TCP_CLOSE; - sock->state = SS_UNCONNECTED; + sock->state = SS_UNCONNECTED; lock_sock(sk); if (addr_len != sizeof(*addr)) @@ -1534,7 +1534,7 @@ static int ipx_connect(struct socket *sock, struct sockaddr *uaddr, struct sockaddr_ipx uaddr; uaddr.sipx_port = 0; - uaddr.sipx_network = 0; + uaddr.sipx_network = 0; #ifdef CONFIG_IPX_INTERN rc = -ENETDOWN; @@ -1563,8 +1563,8 @@ static int ipx_connect(struct socket *sock, struct sockaddr *uaddr, ipxs->type = addr->sipx_type; if (sock->type == SOCK_DGRAM) { - sock->state = SS_CONNECTED; - sk->sk_state= TCP_ESTABLISHED; + sock->state = SS_CONNECTED; + sk->sk_state= TCP_ESTABLISHED; } if (rt) @@ -1736,10 +1736,10 @@
Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode
On Wed, Dec 27, 2017 at 09:39:15AM -0600, David Ahern wrote: > On 12/25/17 2:46 AM, Chris Mi wrote: > > Signed-off-by: Chris Mi> > --- > > tc/m_action.c | 91 +-- > > tc/tc.c| 47 ++ > > tc/tc_common.h | 8 +++- > > tc/tc_filter.c | 121 > > + > > 4 files changed, 204 insertions(+), 63 deletions(-) > > > > diff --git a/tc/m_action.c b/tc/m_action.c > > index fc422364..c4c3b862 100644 > > --- a/tc/m_action.c > > +++ b/tc/m_action.c > > @@ -23,6 +23,7 @@ > > #include > > #include > > #include > > +#include > > > > #include "utils.h" > > #include "tc_common.h" > > @@ -546,40 +547,88 @@ bad_val: > > return ret; > > } > > > > +typedef struct { > > + struct nlmsghdr n; > > + struct tcamsg t; > > + charbuf[MAX_MSG]; > > +} tc_action_req; > > + > > +static tc_action_req *action_reqs; > > +static struct iovec msg_iov[MSG_IOV_MAX]; > > + > > +void free_action_reqs(void) > > +{ > > + free(action_reqs); > > +} > > + > > +static tc_action_req *get_action_req(int batch_size, int index) > > +{ > > + tc_action_req *req; > > + > > + if (action_reqs == NULL) { > > + action_reqs = malloc(batch_size * sizeof (tc_action_req)); > > + if (action_reqs == NULL) > > + return NULL; > > + } > > + req = _reqs[index]; > > + memset(req, 0, sizeof (*req)); > > + > > + return req; > > +} > > + > > static int tc_action_modify(int cmd, unsigned int flags, > > - int *argc_p, char ***argv_p) > > + int *argc_p, char ***argv_p, > > + int batch_size, int index, bool send) > > { > > int argc = *argc_p; > > char **argv = *argv_p; > > int ret = 0; > > - struct { > > - struct nlmsghdr n; > > - struct tcamsg t; > > - charbuf[MAX_MSG]; > > - } req = { > > - .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcamsg)), > > - .n.nlmsg_flags = NLM_F_REQUEST | flags, > > - .n.nlmsg_type = cmd, > > - .t.tca_family = AF_UNSPEC, > > + tc_action_req *req; > > + struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK }; > > + struct iovec *iov = _iov[index]; > > + > > + req = get_action_req(batch_size, index); > > + if (req == NULL) { > > + fprintf(stderr, "get_action_req error: not enough buffer\n"); > > + return -ENOMEM; > > + } > > + > > + req->n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcamsg)); > > + req->n.nlmsg_flags = NLM_F_REQUEST | flags; > > + req->n.nlmsg_type = cmd; > > + req->t.tca_family = AF_UNSPEC; > > + struct rtattr *tail = NLMSG_TAIL(>n); > > + > > + struct msghdr msg = { > > + .msg_name = , > > + .msg_namelen = sizeof(nladdr), > > + .msg_iov = msg_iov, > > + .msg_iovlen = index + 1, > > }; > > - struct rtattr *tail = NLMSG_TAIL(); > > > > argc -= 1; > > argv += 1; > > - if (parse_action(, , TCA_ACT_TAB, )) { > > + if (parse_action(, , TCA_ACT_TAB, >n)) { > > fprintf(stderr, "Illegal \"action\"\n"); > > return -1; > > } > > - tail->rta_len = (void *) NLMSG_TAIL() - (void *) tail; > > + tail->rta_len = (void *) NLMSG_TAIL(>n) - (void *) tail; > > + > > + *argc_p = argc; > > + *argv_p = argv; > > + > > + iov->iov_base = >n; > > + iov->iov_len = req->n.nlmsg_len; > > + > > + if (!send) > > + return 0; > > > > - if (rtnl_talk(, , NULL) < 0) { > > + ret = rtnl_talk_msg(, , NULL); > > + if (ret < 0) { > > fprintf(stderr, "We have an error talking to the kernel\n"); > > ret = -1; > > } > > > > - *argc_p = argc; > > - *argv_p = argv; > > - > > return ret; > > } > > > > @@ -679,7 +728,7 @@ bad_val: > > return ret; > > } > > > > -int do_action(int argc, char **argv) > > +int do_action(int argc, char **argv, int batch_size, int index, bool send) > > { > > > > int ret = 0; > > @@ -689,12 +738,14 @@ int do_action(int argc, char **argv) > > if (matches(*argv, "add") == 0) { > > ret = tc_action_modify(RTM_NEWACTION, > > NLM_F_EXCL | NLM_F_CREATE, > > - , ); > > + , , batch_size, > > + index, send); > > } else if (matches(*argv, "change") == 0 || > > matches(*argv, "replace") == 0) { > > ret = tc_action_modify(RTM_NEWACTION, > >NLM_F_CREATE | NLM_F_REPLACE, > > - , ); > > + , , batch_size, > > +
Re: [PATCH v3 1/4] security: Add support for SCTP security hooks
On Wed, Dec 27, 2017 at 11:22 AM, Richard Haineswrote: > On Fri, 2017-12-22 at 15:45 -0200, Marcelo Ricardo Leitner wrote: >> On Fri, Dec 22, 2017 at 09:20:45AM -0800, Casey Schaufler wrote: >> > On 12/22/2017 5:05 AM, Marcelo Ricardo Leitner wrote: >> > > From: Richard Haines >> > > >> > > The SCTP security hooks are explained in: >> > > Documentation/security/LSM-sctp.rst >> >> Thanks Casey for your comments. However, I'm not that acquainted with >> these area of codes and I cannot work on them. I'll just wait for >> Richard then. > > I'm back online and will post a V4 set of patches within a week. These > will address Paul's comments as per [1] and Casey's regarding the > documentation. > Sorry for the delay No worries, thanks. -- paul moore www.paul-moore.com
Re: [PATCH 3/4] libbpf: break loop earlier
Hello, On Wed, 2017-12-27 at 11:00 -0800, Alexei Starovoitov wrote: > On Wed, Dec 27, 2017 at 07:02:28PM +0100, Eric Leblond wrote: > > Get out of the loop when we have a match. > > > > Signed-off-by: Eric Leblond> > --- > > tools/lib/bpf/libbpf.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c > > index 5fe8aaa2123e..d263748aa341 100644 > > --- a/tools/lib/bpf/libbpf.c > > +++ b/tools/lib/bpf/libbpf.c > > @@ -412,6 +412,7 @@ bpf_object__init_prog_names(struct bpf_object > > *obj) > >prog->section_name); > > return -LIBBPF_ERRNO__LIBELF; > > } > > + break; > > why this is needed? It was just cosmetic, no related bug. > The top of the loop is: > for (si = 0; si < symbols->d_size / sizeof(GElf_Sym) && !name; > > so as soon as name is found the loop will exit. OK, I've missed that. Please disregard this patch. BR, -- Eric Leblond
Re: [PATCHv3 0/2] capability controlled user-namespaces
Hello Mahesh, On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)wrote: > Hello James, > > Seems like I missed your name to be added into the review of this > patch series. Would you be willing be pull this into the security > tree? Serge Hallyn has already ACKed it. We seem to have no formal documentation/specification of this feature. I think that should be written up before this patch goes into mainline... Cheers, Michael > > On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar wrote: >> From: Mahesh Bandewar >> >> TL;DR version >> - >> Creating a sandbox environment with namespaces is challenging >> considering what these sandboxed processes can engage into. e.g. >> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few. >> Current form of user-namespaces, however, if changed a bit can allow >> us to create a sandbox environment without locking down user- >> namespaces. >> >> Detailed version >> >> >> Problem >> --- >> User-namespaces in the current form have increased the attack surface as >> any process can acquire capabilities which are not available to them (by >> default) by performing combination of clone()/unshare()/setns() syscalls. >> >> #define _GNU_SOURCE >> #include >> #include >> #include >> >> int main(int ac, char **av) >> { >> int sock = -1; >> >> printf("Attempting to open RAW socket before unshare()...\n"); >> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW); >> if (sock < 0) { >> perror("socket() SOCK_RAW failed: "); >> } else { >> printf("Successfully opened RAW-Sock before unshare().\n"); >> close(sock); >> sock = -1; >> } >> >> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) { >> perror("unshare() failed: "); >> return 1; >> } >> >> printf("Attempting to open RAW socket after unshare()...\n"); >> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW); >> if (sock < 0) { >> perror("socket() SOCK_RAW failed: "); >> } else { >> printf("Successfully opened RAW-Sock after unshare().\n"); >> close(sock); >> sock = -1; >> } >> >> return 0; >> } >> >> The above example shows how easy it is to acquire NET_RAW capabilities >> and once acquired, these processes could take benefit of above mentioned >> or similar issues discovered/undiscovered with malicious intent. Note >> that this is just an example and the problem/solution is not limited >> to NET_RAW capability *only*. >> >> The easiest fix one can apply here is to lock-down user-namespaces which >> many of the distros do (i.e. don't allow users to create user namespaces), >> but unfortunately that prevents everyone from using them. >> >> Approach >> >> Introduce a notion of 'controlled' user-namespaces. Every process on >> the host is allowed to create user-namespaces (governed by the limit >> imposed by per-ns sysctl) however, mark user-namespaces created by >> sandboxed processes as 'controlled'. Use this 'mark' at the time of >> capability check in conjunction with a global capability whitelist. >> If the capability is not whitelisted, processes that belong to >> controlled user-namespaces will not be allowed. >> >> Once a user-ns is marked as 'controlled'; all its child user- >> namespaces are marked as 'controlled' too. >> >> A global whitelist is list of capabilities governed by the >> sysctl which is available to (privileged) user in init-ns to modify >> while it's applicable to all controlled user-namespaces on the host. >> >> Marking user-namespaces controlled without modifying the whitelist is >> equivalent of the current behavior. The default value of whitelist includes >> all capabilities so that the compatibility is maintained. However it gives >> admins fine-grained ability to control various capabilities system wide >> without locking down user-namespaces. >> >> Please see individual patches in this series. >> >> Mahesh Bandewar (2): >> capability: introduce sysctl for controlled user-ns capability whitelist >> userns: control capabilities of some user namespaces >> >> Documentation/sysctl/kernel.txt | 21 + >> include/linux/capability.h | 7 ++ >> include/linux/user_namespace.h | 25 >> kernel/capability.c | 52 >> + >> kernel/sysctl.c | 5 >> kernel/user_namespace.c | 4 >> security/commoncap.c| 8 +++ >> 7 files changed, 122 insertions(+) >> >> -- >> 2.15.0.531.g2ccb3012c9-goog >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html --
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/27/17 1:31 PM, Andrew Lunn wrote: >> Hmm. That documents mainly sysfs. No mention of Netlink at all. But >> maybe I missed it. Also, that defines the interface as is. However we >> are talking about the data exchanged over the interface, not the >> interface itself. I don't see how ASIC/HW specific thing, like for >> example KVD in our case could be part of kernel ABI. > > You need to be very careful here. As soon as somebody starts using it, > it might become an ABI. Or you need to clearly document it is not ABI, > there is no guarantee it will not disappear or change its meaning in > the next kernel, and it should be used with extreme caution. > +1 Once the names go in, people can write scripts that invoke devlink at boot to partition resources. With the proposed patch set, the name (e.g., kvd/linear) becomes part of the ABI.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/27/2017 06:34 PM, David Ahern wrote: > On 12/27/17 2:09 AM, Jiri Pirko wrote: >> Wed, Dec 27, 2017 at 05:05:09AM CET, d...@cumulusnetworks.com wrote: >>> On 12/26/17 5:23 AM, Jiri Pirko wrote: From: Jiri PirkoMany of the ASIC's internal resources are limited and are shared between several hardware procedures. For example, unified hash-based memory can be used for many lookup purposes, like FDB and LPM. In many cases the user can provide a partitioning scheme for such a resource in order to perform fine tuning for his application. In such cases performing driver reload is needed for the changes to take place, thus this patchset also adds support for hot reload. Such an abstraction can be coupled with devlink's dpipe interface, which models the ASIC's pipeline as a graph of match/action tables. By modeling the hardware resource object, and by coupling it to several dpipe tables, further visibility can be achieved in order to debug ASIC-wide issues. The proposed interface will provide the user the ability to understand the limitations of the hardware, and receive notification regarding its occupancy. Furthermore, monitoring the resource occupancy can be done in real-time and can be useful in many cases. >>> >>> In the last RFC (not v1, but RFC) I asked for some kind of description >>> for each resource, and you and Arkadi have pushed back. Let's walk >>> through an example to see what I mean: >>> >>> $ devlink resource show pci/:03:00.0 >>> pci/:03:00.0: >>> name kvd size 245760 size_valid true >>> resources: >>>name linear size 98304 occ 0 >>>name hash_double size 60416 >>>name hash_single size 87040 >>> >>> So this 2700 has 3 resources that can be managed -- some table or >>> resource or something named 'kvd' with linear, hash_double and >>> hash_single sub-resources. What are these names referring too? The above >>> output gives no description, and 'kvd' is not an industry term. Further, >> >> This are internal resources specific to the ASIC. Would you like some >> description to each or something like that? > > devlink has some nice self-documenting capabilities. What's missing here > is a description of what the resource is used for in standard terms -- > ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a > short list versus an exhaustive list of everything it is used for. e.g., > Why would a user decrease linear and increase hash_single or vice versa? > >> >> >>> what are these sizes that a user can control? The output contains no >>> units, no description, nothing. In short, the above output provides >>> random numbers associated with random names. >> >> Units are now exposed from kernel, just this version of iproute2 patch >> does not display it. > > please provide an iproute2 patch that does so the full context if this > patch set can be reviewed from a user perspective. > >> >> >>> >>> I can see dpipe tables exported by this device: >>> >>> $ devlink dpipe header show pci/:03:00.0 >>> >>> pci/:03:00.0: >>> name mlxsw_meta >>> field: >>>name erif_port bitwidth 32 mapping_type ifindex >>>name l3_forward bitwidth 1 >>>name l3_drop bitwidth 1 >>>name adj_index bitwidth 32 >>>name adj_size bitwidth 32 >>>name adj_hash_index bitwidth 32 >>> >>> name ipv6 >>> field: >>>name destination ip bitwidth 128 >>> >>> name ipv4 >>> field: >>>name destination ip bitwidth 32 >>> >>> name ethernet >>> field: >>>name destination mac bitwidth 48 >>> >>> but none mention 'kvd' or 'linear' or 'hash" and none of the other >>> various devlink options: >>> >>> $ devlink >>> Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help } >>> where OBJECT := { dev | port | sb | monitor | dpipe } >>> >>> seem to related to resources. >>> >>> So how does a user know what they are controlling by this 'resource' >>> option? Is the user expected to have a PRM or user guide on hand for the >>> specific device model that is being configured? >> >> The relation of specific dpipe table to specific resource is exposed by >> the kernel as well. Probably the iproute2 patch just does not display >> it. > > please provide an iproute2 patch that does so the full context if this > patch set can be reviewed from a user perspective. > As Yuval stated you are using the wrong command here. You are printing the headers not the tables. On each dpipe table you can see the resource it is using (the resource path aka host table uses /kvd/hash_single for example). This is already working. Just try it. >> >> >>> >>> Again, I have no objections to kvd, linear, hash, etc terms as they do >>> relate to Mellanox products. But kvd/linear, for example, does correlate >>> to industry standard concepts in some way. My request is that the >>> resource listing guide the user in some way, stating what these >>> resources mean. >> >> So the
Re: WARNING in strp_data_ready
On Wed, Dec 27, 2017 at 9:08 PM, Ozgurwrote: > > > 27.12.2017, 22:21, "Dmitry Vyukov" : >> On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert wrote: >>> Did you try the patch I posted? >> >> Hi Tom, > > Hello Dmitry, > >> No. And I didn't know I need to. Why? >> If you think the patch needs additional testing, you can ask syzbot to >> test it. See >> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot >> Otherwise proceed with committing it. Or what are we waiting for? >> >> Thanks > > I think we need to fixed patch for crash, in fact check to patch code and > test solve the bug. > How do test it because there is no patch in the following bug? Hi Ozgur, I am not sure I completely understand what you mean. But the reproducer for this bug (which one can use for testing) is here: https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY Tom also mentions there is some patch for this, but I don't know where it is, it doesn't seem to be referenced from this thread. > The fix patch should be for this net/kcm/kcmsock.c file and lock functions > must be added calling sk_data_ready (). > Regards > > Ozgur > >>> On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov wrote: On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov wrote: >> wrote: >>> On 10/24/2017 08:20 AM, syzbot wrote: Hello, syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. C reproducer is attached syzkaller reproducer is attached. See https://goo.gl/kgGztJ for information about syzkaller reproducers WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline] WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline] WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x417 kernel/panic.c:181 __warn+0x1c4/0x1d9 kernel/panic.c:542 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline] RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline] RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 RSP: 0018:8801db206b18 EFLAGS: 00010206 RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0 RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd R10: 8801db206a00 R11: dc00 R12: 8801d29fa000 R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001 psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353 >>> >>> Looks like KCM is calling sk_data_ready() without first taking the >>> sock lock. >>> >>> /* Called with lower sock held */ >>> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff >>> *skb) >>> { >>> [...] >>> if (kcm_queue_rcv_skb(>sk, skb)) { >>> >>> In this case kcm->sk is not the same lock the comment is referring to. >>> And kcm_queue_rcv_skb() will eventually call sk_data_ready(). >>> >>> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock? >>> I don't have anything better in mind immediately. >> The sock locks are taken in reverse order in the send path so so >> grabbing kcm sock lock with lower lock held to call sk_data_ready may >> lead to deadlock like I think. >> >> It might be possible to change the order in the send path to do this. >> Something like: >> >> trylock on lower socket lock >> -if trylock fails
Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode
On Mon, Dec 25, 2017 at 05:46:57PM +0900, Chris Mi wrote: > @@ -267,6 +287,7 @@ int main(int argc, char **argv) > { > int ret; > char *batch_file = NULL; > + int batch_size = 1; > > while (argc > 1) { > if (argv[1][0] != '-') > @@ -297,6 +318,14 @@ int main(int argc, char **argv) > if (argc <= 1) > usage(); > batch_file = argv[1]; > + } else if (matches(argv[1], "-batchsize") == 0 || > + matches(argv[1], "-bs") == 0) { > + argc--; argv++; > + if (argc <= 1) > + usage(); > + batch_size = atoi(argv[1]); > + if (batch_size > MSG_IOV_MAX) > + batch_size = MSG_IOV_MAX; what about if (batch_size < 1) batch_size = 1; > } else if (matches(argv[1], "-netns") == 0) { > NEXT_ARG(); > if (netns_switch(argv[1]))
Re: [patch net-next v2 00/10] Add support for resource abstraction
On Wed, Dec 27, 2017 at 8:34 AM, David Ahernwrote: > On 12/27/17 2:09 AM, Jiri Pirko wrote: >> Wed, Dec 27, 2017 at 05:05:09AM CET, d...@cumulusnetworks.com wrote: >>> On 12/26/17 5:23 AM, Jiri Pirko wrote: From: Jiri Pirko Many of the ASIC's internal resources are limited and are shared between several hardware procedures. For example, unified hash-based memory can be used for many lookup purposes, like FDB and LPM. In many cases the user can provide a partitioning scheme for such a resource in order to perform fine tuning for his application. In such cases performing driver reload is needed for the changes to take place, thus this patchset also adds support for hot reload. Such an abstraction can be coupled with devlink's dpipe interface, which models the ASIC's pipeline as a graph of match/action tables. By modeling the hardware resource object, and by coupling it to several dpipe tables, further visibility can be achieved in order to debug ASIC-wide issues. The proposed interface will provide the user the ability to understand the limitations of the hardware, and receive notification regarding its occupancy. Furthermore, monitoring the resource occupancy can be done in real-time and can be useful in many cases. >>> >>> In the last RFC (not v1, but RFC) I asked for some kind of description >>> for each resource, and you and Arkadi have pushed back. Let's walk >>> through an example to see what I mean: >>> >>> $ devlink resource show pci/:03:00.0 >>> pci/:03:00.0: >>> name kvd size 245760 size_valid true >>> resources: >>>name linear size 98304 occ 0 >>>name hash_double size 60416 >>>name hash_single size 87040 >>> >>> So this 2700 has 3 resources that can be managed -- some table or >>> resource or something named 'kvd' with linear, hash_double and >>> hash_single sub-resources. What are these names referring too? The above >>> output gives no description, and 'kvd' is not an industry term. Further, >> >> This are internal resources specific to the ASIC. Would you like some >> description to each or something like that? > > devlink has some nice self-documenting capabilities. What's missing here > is a description of what the resource is used for in standard terms -- > ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a > short list versus an exhaustive list of everything it is used for. e.g., > Why would a user decrease linear and increase hash_single or vice versa? Arkadi, on what david says above, can the resource names and ids not be driver specific, but moved up to the switchdev layer and just map to fdb, host routes, nexthops table sizes etc ?. Can these generic networking resources then in-turn be mapped to kvd sizes by the driver ?
Re: [patch net-next v2 00/10] Add support for resource abstraction
> Hmm. That documents mainly sysfs. No mention of Netlink at all. But > maybe I missed it. Also, that defines the interface as is. However we > are talking about the data exchanged over the interface, not the > interface itself. I don't see how ASIC/HW specific thing, like for > example KVD in our case could be part of kernel ABI. You need to be very careful here. As soon as somebody starts using it, it might become an ABI. Or you need to clearly document it is not ABI, there is no guarantee it will not disappear or change its meaning in the next kernel, and it should be used with extreme caution. Personally, if DSA drivers were to expose such settings, i would consider them ABI, which people can rely on to remain stable. Andrew
Re: [PATCH net] sfp: fix sfp-bus oops when removing socket/upstream
On 12/26/2017 03:15 PM, Russell King wrote: > When we remove a socket or upstream, and the other side isn't > registered, we dereference a NULL pointer, causing a kernel oops. > Fix this. > > Signed-off-by: Russell KingReviewed-by: Florian Fainelli Fixes: ce0aa27ff3f6 ("sfp: add sfp-bus to bridge between network devices and sfp cages") -- Florian
Re: [PATCH net] phylink: ensure we report link down when LOS asserted
On 12/26/2017 03:15 PM, Russell King wrote: > Although we disable the netdev carrier, we fail to report in the kernel > log that the link went down. Fix this. > > Signed-off-by: Russell KingReviewed-by: Florian Fainelli Fixes: 9525ae83959b ("phylink: add phylink infrastructure") -- Florian
Re: WARNING in strp_data_ready
On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbertwrote: > Did you try the patch I posted? Hi Tom, No. And I didn't know I need to. Why? If you think the patch needs additional testing, you can ask syzbot to test it. See https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot Otherwise proceed with committing it. Or what are we waiting for? Thanks > On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov wrote: >> On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov wrote: wrote: > On 10/24/2017 08:20 AM, syzbot wrote: >> Hello, >> >> syzkaller hit the following crash on >> 73d3393ada4f70fa3df5639c8d438f2f034c0ecb >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master >> compiler: gcc (GCC) 7.1.1 20170620 >> .config is attached >> Raw console output is attached. >> C reproducer is attached >> syzkaller reproducer is attached. See https://goo.gl/kgGztJ >> for information about syzkaller reproducers >> >> >> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me >> include/net/sock.h:1505 [inline] >> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 >> sock_owned_by_user include/net/sock.h:1511 [inline] >> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 >> strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 >> Kernel panic - not syncing: panic_on_warn set ... >> >> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138 >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS >> Google 01/01/2011 >> Call Trace: >> >> __dump_stack lib/dump_stack.c:16 [inline] >> dump_stack+0x194/0x257 lib/dump_stack.c:52 >> panic+0x1e4/0x417 kernel/panic.c:181 >> __warn+0x1c4/0x1d9 kernel/panic.c:542 >> report_bug+0x211/0x2d0 lib/bug.c:183 >> fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 >> do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] >> do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 >> do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 >> do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 >> invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 >> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline] >> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline] >> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 >> RSP: 0018:8801db206b18 EFLAGS: 00010206 >> RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: >> RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0 >> RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd >> R10: 8801db206a00 R11: dc00 R12: 8801d29fa000 >> R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001 >> psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353 > > Looks like KCM is calling sk_data_ready() without first taking the > sock lock. > > /* Called with lower sock held */ > static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb) > { > [...] > if (kcm_queue_rcv_skb(>sk, skb)) { > > In this case kcm->sk is not the same lock the comment is referring to. > And kcm_queue_rcv_skb() will eventually call sk_data_ready(). > > @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock? > I don't have anything better in mind immediately. > The sock locks are taken in reverse order in the send path so so grabbing kcm sock lock with lower lock held to call sk_data_ready may lead to deadlock like I think. It might be possible to change the order in the send path to do this. Something like: trylock on lower socket lock -if trylock fails - release kcm sock lock - lock lower sock - lock kcm sock - call sendpage locked function I admit that dealing with two levels of socket locks in the data path is quite a pain :-) >>> >>> up >>> >>> still happening and we've lost 50K+ test VMs on this >> >> up >> >> Still happens and number of crashes crossed 60K, can we do something >> with this please?
[PATCH v3 bpf-next 1/2] tools/bpftool: use version from the kernel source tree
Bpftool determines it's own version based on the kernel version, which is picked from the linux/version.h header. It's strange to use the version of the installed kernel headers, and makes much more sense to use the version of the actual source tree, where bpftool sources are. Fix this by building kernelversion target and use the resulting string as bpftool version. Example: before: $ bpftool version bpftool v4.14.6 after: $ bpftool version bpftool v4.15.0-rc3 $bpftool version --json {"version":"4.15.0-rc3"} Signed-off-by: Roman GushchinReviewed-by: Jakub Kicinski Cc: Alexei Starovoitov Cc: Daniel Borkmann --- tools/bpf/bpftool/Makefile | 3 +++ tools/bpf/bpftool/main.c | 13 ++--- 2 files changed, 5 insertions(+), 11 deletions(-) diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile index 3f17ad317512..f8f31a8d9269 100644 --- a/tools/bpf/bpftool/Makefile +++ b/tools/bpf/bpftool/Makefile @@ -23,6 +23,8 @@ endif LIBBPF = $(BPF_PATH)libbpf.a +BPFTOOL_VERSION=$(shell make --no-print-directory -sC ../../.. kernelversion) + $(LIBBPF): FORCE $(Q)$(MAKE) -C $(BPF_DIR) OUTPUT=$(OUTPUT) $(OUTPUT)libbpf.a FEATURES_DUMP=$(FEATURE_DUMP_EXPORT) @@ -38,6 +40,7 @@ CC = gcc CFLAGS += -O2 CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi -I$(srctree)/tools/include -I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/ +CFLAGS += -DBPFTOOL_VERSION='"$(BPFTOOL_VERSION)"' LIBS = -lelf -lbfd -lopcodes $(LIBBPF) INSTALL ?= install diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c index ecd53ccf1239..3a0396d87c42 100644 --- a/tools/bpf/bpftool/main.c +++ b/tools/bpf/bpftool/main.c @@ -38,7 +38,6 @@ #include #include #include -#include #include #include #include @@ -95,21 +94,13 @@ static int do_help(int argc, char **argv) static int do_version(int argc, char **argv) { - unsigned int version[3]; - - version[0] = LINUX_VERSION_CODE >> 16; - version[1] = LINUX_VERSION_CODE >> 8 & 0xf; - version[2] = LINUX_VERSION_CODE & 0xf; - if (json_output) { jsonw_start_object(json_wtr); jsonw_name(json_wtr, "version"); - jsonw_printf(json_wtr, "\"%u.%u.%u\"", -version[0], version[1], version[2]); + jsonw_printf(json_wtr, "\"%s\"", BPFTOOL_VERSION); jsonw_end_object(json_wtr); } else { - printf("%s v%u.%u.%u\n", bin_name, - version[0], version[1], version[2]); + printf("%s v%s\n", bin_name, BPFTOOL_VERSION); } return 0; } -- 2.14.3