[PATCH net] ipv6: addrconf: increment ifp refcount before ipv6_del_addr()
From: Eric DumazetIn the (unlikely) event fixup_permanent_addr() returns a failure, addrconf_permanent_addr() calls ipv6_del_addr() without the mandatory call to in6_ifa_hold(), leading to a refcount error, spotted by syzkaller : WARNING: CPU: 1 PID: 3142 at lib/refcount.c:227 refcount_dec+0x4c/0x50 lib/refcount.c:227 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3142 Comm: ip Not tainted 4.14.0-rc4-next-20171009+ #33 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x41c kernel/panic.c:181 __warn+0x1c4/0x1e0 kernel/panic.c:544 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 RIP: 0010:refcount_dec+0x4c/0x50 lib/refcount.c:227 RSP: 0018:8801ca49e680 EFLAGS: 00010286 RAX: 002c RBX: 8801d07cfcdc RCX: RDX: 002c RSI: 110039493c90 RDI: ed0039493cc4 RBP: 8801ca49e688 R08: 8801ca49dd70 R09: R10: 8801ca49df58 R11: R12: 110039493cd9 R13: 8801ca49e6e8 R14: 8801ca49e7e8 R15: 8801d07cfcdc __in6_ifa_put include/net/addrconf.h:369 [inline] ipv6_del_addr+0x42b/0xb60 net/ipv6/addrconf.c:1208 addrconf_permanent_addr net/ipv6/addrconf.c:3327 [inline] addrconf_notify+0x1c66/0x2190 net/ipv6/addrconf.c:3393 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93 __raw_notifier_call_chain kernel/notifier.c:394 [inline] raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401 call_netdevice_notifiers_info+0x32/0x60 net/core/dev.c:1697 call_netdevice_notifiers net/core/dev.c:1715 [inline] __dev_notify_flags+0x15d/0x430 net/core/dev.c:6843 dev_change_flags+0xf5/0x140 net/core/dev.c:6879 do_setlink+0xa1b/0x38e0 net/core/rtnetlink.c:2113 rtnl_newlink+0xf0d/0x1a40 net/core/rtnetlink.c:2661 rtnetlink_rcv_msg+0x733/0x1090 net/core/rtnetlink.c:4301 netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2408 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4313 netlink_unicast_kernel net/netlink/af_netlink.c:1273 [inline] netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1299 netlink_sendmsg+0xa4a/0xe70 net/netlink/af_netlink.c:1862 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2049 __sys_sendmsg+0xe5/0x210 net/socket.c:2083 SYSC_sendmsg net/socket.c:2094 [inline] SyS_sendmsg+0x2d/0x50 net/socket.c:2090 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x7fa9174d3320 RSP: 002b:7ffe302ae9e8 EFLAGS: 0246 ORIG_RAX: 002e RAX: ffda RBX: 7ffe302b2ae0 RCX: 7fa9174d3320 RDX: RSI: 7ffe302aea20 RDI: 0016 RBP: 0082 R08: R09: 000f R10: R11: 0246 R12: 7ffe302b32a0 R13: R14: 7ffe302b2ab8 R15: 7ffe302b32b8 Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional") Signed-off-by: Eric Dumazet Cc: David Ahern --- net/ipv6/addrconf.c |1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 4a96ebbf8eda5f59a6ff88e836d666a404d2bf0d..8a1c846d3df949a4638589f187120db22a3525ba 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -3335,6 +3335,7 @@ static void addrconf_permanent_addr(struct net_device *dev) if ((ifp->flags & IFA_F_PERMANENT) && fixup_permanent_addr(idev, ifp) < 0) { write_unlock_bh(>lock); + in6_ifa_hold(ifp); ipv6_del_addr(ifp); write_lock_bh(>lock);
Re: [Patch net 00/16] net_sched: fix races with RCU callbacks
On Mon, Oct 30, 2017 at 6:03 PM, Lucas Bateswrote: > On Oct 30, 2017 19:13, "Cong Wang" wrote: >> >> On Mon, Oct 30, 2017 at 3:39 PM, Lucas Bates wrote: >> > e.On Thu, Oct 26, 2017 at 9:24 PM, Cong Wang >> > wrote: >> >> Recently, the RCU callbacks used in TC filters and TC actions keep >> >> drawing my attention, they introduce at least 4 race condition bugs: >> > >> >> As suggested by Paul, we could defer the work to a workqueue and >> >> gain the permission of holding RTNL again without any performance >> >> impact, however, in tcf_block_put() we could have a deadlock when >> >> flushing workqueue while hodling RTNL lock, the trick here is to >> >> defer the work itself in workqueue and make it queued after all >> >> other works so that we keep the same ordering to avoid any >> >> use-after-free. Please see the first patch for details. >> > >> > Cong, I don't believe the problem's been resolved just yet I have >> > a new kernel, compiled just today and I'm still tripping over a kernel >> > bug in this scenario when I run Chris' new test case. >> > >> >> Without a stack trace, I can't do anything. "a kernel bug" could >> be anything, why do you believe it is caused by this patchset? >> > The stack trace I saw touched on the same code that was affected by the > patchset. I've attached a photo of the trace - sorry, I should have sent it > earlier. I also apologize for having to send a photo but I was doing the > testing on a small device lacking any kind of serial console access. Can you try this patch? From your stack trace it is not clear where the cause is, but we know that the crash is in __tcf_idr_release(), this is how I came up with the following patch: diff --git a/include/net/act_api.h b/include/net/act_api.h index b944e0eb93be..5072446d5f06 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -122,7 +122,9 @@ void tcf_idrinfo_destroy(const struct tc_action_ops *ops, static inline void tc_action_net_exit(struct tc_action_net *tn) { + rtnl_lock(); tcf_idrinfo_destroy(tn->ops, tn->idrinfo); + rtnl_unlock(); kfree(tn->idrinfo); } > > If you need more information or need me to try something else, I'll be able > to tomorrow. > I will look deeper tomorrow. It doesn't look like caused by this patchset so far, probably yet another missing rtnl like what the above patch shows. Thanks!
linux-next: Signed-off-by missing for commit in the net-next tree
Hi all, Commit 509708310cf9 ("r8169: Add support for interrupt coalesce tuning (ethtool -C)") is missing a Signed-off-by from its author (or its author is wrong). -- Cheers, Stephen Rothwell
Re: [BUG ?] ipv6: addrconf: Adds a missing in6_ifa_hold()
On Mon, 2017-10-30 at 22:49 -0600, David Ahern wrote: > On 10/30/17 9:53 PM, Eric Dumazet wrote: > > David, I was looking at addrconf_permanent_addr() and wondered > > if there is not some problem with it. > > > > It seems we need to increment ifp refcount before calling > > ipv6_del_addr() > > > > Could you double check if this patch is needed, I am guessing you have a > > test suite exercising this code path ? > > A lot has changed in 20 months since the patch that added the code. For > instance, taking down the 'lo' device no longer affects host routes on > other interfaces. Also, fixup_permanent_addr only fails on memory > allocation. Did you hit this with a test case because I do not have a > general one that causes the memory failure (hard coding a failure for an > address is the only way). > > > > > Thanks. > > > > PS : Presumably CONFIG_REFCOUNT_FULL=y should have warned you of the > > problem. > > I have not run a debug kernel in a while -- and did not have this option > set. Added it to my debug config. > > > > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > > index > > 4a96ebbf8eda5f59a6ff88e836d666a404d2bf0d..8a1c846d3df949a4638589f187120db22a3525ba > > 100644 > > --- a/net/ipv6/addrconf.c > > +++ b/net/ipv6/addrconf.c > > @@ -3335,6 +3335,7 @@ static void addrconf_permanent_addr(struct net_device > > *dev) > > if ((ifp->flags & IFA_F_PERMANENT) && > > fixup_permanent_addr(idev, ifp) < 0) { > > write_unlock_bh(>lock); > > + in6_ifa_hold(ifp); > > ipv6_del_addr(ifp); > > write_lock_bh(>lock); > > > > Yes, forcing a failure here does trigger refcnt warning, but then you > knew that. ;-) > Okay ;) > > PS. is the following a known failure? I triggered it looking into your > report > > [ 170.385741] == > [ 170.387490] WARNING: possible circular locking dependency detected > [ 170.389214] 4.14.0-rc5+ #338 Not tainted > [ 170.390323] -- > [ 170.392017] swapper/0/0 is trying to acquire lock: > [ 170.393408] (slock-AF_INET){+.-.}, at: [] > tcp_delack_timer+0x29/0xb1 > [ 170.395622] > but task is already holding lock: > [ 170.396943] ((timer)){+.-.}, at: [] > call_timer_fn+0x5/0x36b > [ 170.397912] > which lock already depends on the new lock. > > [ 170.398986] > the existing dependency chain (in reverse order) is: > [ 170.399965] > -> #1 ((timer)){+.-.}: > [ 170.400629]lock_acquire+0x154/0x220 > [ 170.401198]del_timer_sync+0x47/0xbd > [ 170.401760]inet_csk_reqsk_queue_drop+0x109/0x141 > [ 170.402464]inet_csk_complete_hashdance+0x3b/0x68 > [ 170.403173]tcp_check_req+0x517/0x5f1 > [ 170.403746]tcp_v4_rcv+0x6ad/0xce7 > [ 170.404287]ip_local_deliver_finish+0x1d4/0x281 > [ 170.404985]ip_local_deliver+0xaf/0xcf > [ 170.405571]ip_rcv_finish+0x632/0x6ff > [ 170.406140]ip_rcv+0x45d/0x4a6 > ... Yeah, please look at https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=timers/core=52f737c2da40259ac9962170ce608b6fb1b55ee4
Re: [BUG ?] ipv6: addrconf: Adds a missing in6_ifa_hold()
On 10/30/17 9:53 PM, Eric Dumazet wrote: > David, I was looking at addrconf_permanent_addr() and wondered > if there is not some problem with it. > > It seems we need to increment ifp refcount before calling > ipv6_del_addr() > > Could you double check if this patch is needed, I am guessing you have a > test suite exercising this code path ? A lot has changed in 20 months since the patch that added the code. For instance, taking down the 'lo' device no longer affects host routes on other interfaces. Also, fixup_permanent_addr only fails on memory allocation. Did you hit this with a test case because I do not have a general one that causes the memory failure (hard coding a failure for an address is the only way). > > Thanks. > > PS : Presumably CONFIG_REFCOUNT_FULL=y should have warned you of the > problem. I have not run a debug kernel in a while -- and did not have this option set. Added it to my debug config. > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > index > 4a96ebbf8eda5f59a6ff88e836d666a404d2bf0d..8a1c846d3df949a4638589f187120db22a3525ba > 100644 > --- a/net/ipv6/addrconf.c > +++ b/net/ipv6/addrconf.c > @@ -3335,6 +3335,7 @@ static void addrconf_permanent_addr(struct net_device > *dev) > if ((ifp->flags & IFA_F_PERMANENT) && > fixup_permanent_addr(idev, ifp) < 0) { > write_unlock_bh(>lock); > + in6_ifa_hold(ifp); > ipv6_del_addr(ifp); > write_lock_bh(>lock); > Yes, forcing a failure here does trigger refcnt warning, but then you knew that. ;-) PS. is the following a known failure? I triggered it looking into your report [ 170.385741] == [ 170.387490] WARNING: possible circular locking dependency detected [ 170.389214] 4.14.0-rc5+ #338 Not tainted [ 170.390323] -- [ 170.392017] swapper/0/0 is trying to acquire lock: [ 170.393408] (slock-AF_INET){+.-.}, at: [] tcp_delack_timer+0x29/0xb1 [ 170.395622] but task is already holding lock: [ 170.396943] ((timer)){+.-.}, at: [] call_timer_fn+0x5/0x36b [ 170.397912] which lock already depends on the new lock. [ 170.398986] the existing dependency chain (in reverse order) is: [ 170.399965] -> #1 ((timer)){+.-.}: [ 170.400629]lock_acquire+0x154/0x220 [ 170.401198]del_timer_sync+0x47/0xbd [ 170.401760]inet_csk_reqsk_queue_drop+0x109/0x141 [ 170.402464]inet_csk_complete_hashdance+0x3b/0x68 [ 170.403173]tcp_check_req+0x517/0x5f1 [ 170.403746]tcp_v4_rcv+0x6ad/0xce7 [ 170.404287]ip_local_deliver_finish+0x1d4/0x281 [ 170.404985]ip_local_deliver+0xaf/0xcf [ 170.405571]ip_rcv_finish+0x632/0x6ff [ 170.406140]ip_rcv+0x45d/0x4a6 ...
[PATCH net-next 1/2] net: phy: Cosmetic fixes to phylink/sfp/sfp-bus.c
Perform a number of stylistic changes to phylink.c, sfp.c and sfp-bus.c: - align with netdev-style comments - align function arguments to the opening parenthesis - remove blank lines - fixup a few lines over 80 columns Signed-off-by: Florian Fainelli--- drivers/net/phy/phylink.c | 13 ++--- drivers/net/phy/sfp-bus.c | 11 +++ drivers/net/phy/sfp.c | 27 +-- 3 files changed, 22 insertions(+), 29 deletions(-) diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index bcb4755bcd95..05c8f1c10e36 100644 --- a/drivers/net/phy/phylink.c +++ b/drivers/net/phy/phylink.c @@ -357,7 +357,7 @@ static void phylink_get_fixed_state(struct phylink *pl, struct phylink_link_stat *1 1 0 1 TX */ static void phylink_resolve_flow(struct phylink *pl, - struct phylink_link_state *state) +struct phylink_link_state *state) { int new_pause = 0; @@ -506,7 +506,8 @@ static int phylink_register_sfp(struct phylink *pl, struct device_node *np) } struct phylink *phylink_create(struct net_device *ndev, struct device_node *np, - phy_interface_t iface, const struct phylink_mac_ops *ops) + phy_interface_t iface, + const struct phylink_mac_ops *ops) { struct phylink *pl; int ret; @@ -585,7 +586,7 @@ void phylink_phy_change(struct phy_device *phydev, bool up, bool do_carrier) phylink_run_resolve(pl); netdev_dbg(pl->netdev, "phy link %s %s/%s/%s\n", up ? "up" : "down", - phy_modes(phydev->interface), + phy_modes(phydev->interface), phy_speed_to_str(phydev->speed), phy_duplex_to_str(phydev->duplex)); } @@ -823,7 +824,7 @@ static void phylink_get_ksettings(const struct phylink_link_state *state, } int phylink_ethtool_ksettings_get(struct phylink *pl, - struct ethtool_link_ksettings *kset) + struct ethtool_link_ksettings *kset) { struct phylink_link_state link_state; @@ -870,7 +871,7 @@ int phylink_ethtool_ksettings_get(struct phylink *pl, EXPORT_SYMBOL_GPL(phylink_ethtool_ksettings_get); int phylink_ethtool_ksettings_set(struct phylink *pl, - const struct ethtool_link_ksettings *kset) + const struct ethtool_link_ksettings *kset) { struct ethtool_link_ksettings our_kset; struct phylink_link_state config; @@ -1337,8 +1338,6 @@ int phylink_mii_ioctl(struct phylink *pl, struct ifreq *ifr, int cmd) } EXPORT_SYMBOL_GPL(phylink_mii_ioctl); - - static int phylink_sfp_module_insert(void *upstream, const struct sfp_eeprom_id *id) { diff --git a/drivers/net/phy/sfp-bus.c b/drivers/net/phy/sfp-bus.c index 5cb5384697ea..8a1b1f4c1b7c 100644 --- a/drivers/net/phy/sfp-bus.c +++ b/drivers/net/phy/sfp-bus.c @@ -26,7 +26,6 @@ struct sfp_bus { bool started; }; - int sfp_parse_port(struct sfp_bus *bus, const struct sfp_eeprom_id *id, unsigned long *support) { @@ -208,7 +207,6 @@ void sfp_parse_support(struct sfp_bus *bus, const struct sfp_eeprom_id *id, } EXPORT_SYMBOL_GPL(sfp_parse_support); - static LIST_HEAD(sfp_buses); static DEFINE_MUTEX(sfp_mutex); @@ -295,7 +293,6 @@ static void sfp_unregister_bus(struct sfp_bus *bus) bus->registered = false; } - int sfp_get_module_info(struct sfp_bus *bus, struct ethtool_modinfo *modinfo) { if (!bus->registered) @@ -305,7 +302,7 @@ int sfp_get_module_info(struct sfp_bus *bus, struct ethtool_modinfo *modinfo) EXPORT_SYMBOL_GPL(sfp_get_module_info); int sfp_get_module_eeprom(struct sfp_bus *bus, struct ethtool_eeprom *ee, - u8 *data) + u8 *data) { if (!bus->registered) return -ENOIOCTLCMD; @@ -330,8 +327,8 @@ void sfp_upstream_stop(struct sfp_bus *bus) EXPORT_SYMBOL_GPL(sfp_upstream_stop); struct sfp_bus *sfp_register_upstream(struct device_node *np, - struct net_device *ndev, void *upstream, - const struct sfp_upstream_ops *ops) + struct net_device *ndev, void *upstream, + const struct sfp_upstream_ops *ops) { struct sfp_bus *bus = sfp_bus_get(np); int ret = 0; @@ -368,7 +365,6 @@ void sfp_unregister_upstream(struct sfp_bus *bus) } EXPORT_SYMBOL_GPL(sfp_unregister_upstream); - /* Socket driver entry points */ int sfp_add_phy(struct sfp_bus *bus, struct phy_device *phydev) { @@ -395,7 +391,6 @@ void sfp_remove_phy(struct sfp_bus *bus) } EXPORT_SYMBOL_GPL(sfp_remove_phy); - void sfp_link_up(struct sfp_bus *bus) { const struct sfp_upstream_ops *ops = sfp_get_upstream_ops(bus); diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c index baee371bf767..448465da0422 100644 ---
[PATCH net-next 0/2] PHYLINK cosmetic and build fixes
Hi David, Russell, Please find two small "fixes" one that corrects some stylistic changes and another one that fixes an actual build failure in sfp.c. Since PHYLINK is not directly visible to user, and there are no in-tree users yet (coming) this is not targeted at "net" but "net-next" instead. Thanks Florian Fainelli (2): net: phy: Cosmetic fixes to phylink/sfp/sfp-bus.c net: phy: Fix sfp.c build against GPIO definitions drivers/net/phy/phylink.c | 13 ++--- drivers/net/phy/sfp-bus.c | 11 +++ drivers/net/phy/sfp.c | 29 ++--- 3 files changed, 23 insertions(+), 30 deletions(-) -- 2.14.1
[PATCH net-next 2/2] net: phy: Fix sfp.c build against GPIO definitions
include/gpio.h does not contain the references we want, we should be including linux/gpio/consumer.h instead. Fixes: 73970055450e ("sfp: add SFP module support") Signed-off-by: Florian Fainelli--- drivers/net/phy/sfp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c index 448465da0422..e381811e5f11 100644 --- a/drivers/net/phy/sfp.c +++ b/drivers/net/phy/sfp.c @@ -1,5 +1,5 @@ #include -#include +#include #include #include #include -- 2.14.1
Re: [PATCH] net: caif: chnl_net: remove unnecessary null check in chnl_recv_cb
Quoting "Gustavo A. R. Silva": Quoting "Gustavo A. R. Silva" : Hi, Quoting "Gustavo A. R. Silva" : container_of is never null, so this null check is unnecessary. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva --- net/caif/chnl_net.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c index 922ac1d..489298d 100644 --- a/net/caif/chnl_net.c +++ b/net/caif/chnl_net.c @@ -77,8 +77,6 @@ static int chnl_recv_cb(struct cflayer *layr, struct cfpkt *pkt) u8 buf; priv = container_of(layr, struct chnl_net, chnl); - if (!priv) - return -EINVAL; But the driver only passes in to container_of the type of driver structure registered with it. So this type of manipulation must be completely safe. Right? Now it seems to me (again) that the patch is fine. I'm sorry, I'm a little bit confused here. skb = (struct sk_buff *) cfpkt_tonative(pkt); -- 2.7.4 Please, ignore this patch. I just realized that function chnl_recv_cb is being called only during initialization: chnl_init_module() -> rtnl_link_register() -> ipcaif_net_setup() -> chnl_recv_cb(): Well, here ipcaif_net_setup stores a pointer to chnl_recv_cb in a structure. It doesn't call it. static void ipcaif_net_setup(struct net_device *dev) { [...] priv = netdev_priv(dev); priv->chnl.receive = chnl_recv_cb; [...] } static struct rtnl_link_ops ipcaif_link_ops __read_mostly = { .kind = "caif", .priv_size = sizeof(struct chnl_net), .setup = ipcaif_net_setup, .maxtype= IFLA_CAIF_MAX, .policy = ipcaif_policy, .newlink= ipcaif_newlink, .changelink = ipcaif_changelink, .get_size = ipcaif_get_size, .fill_info = ipcaif_fill_info, }; static int __init chnl_init_module(void) { return rtnl_link_register(_link_ops); } -- Gustavo A. R. Silva
Re: Using the aesni generic gcm(aes) aead in atomic context
On Mon, Oct 30, 2017 at 03:18:21PM +, Ilya Lesokhin wrote: > Hi, > I've tried using the aesni generic gcm(aes) aead to implement TLS SW fallback > and > I'm getting > [ 3356.839506] BUG: sleeping function called from invalid context at > ./include/crypto/algapi.h:417 > > The warning is coming from a ___might_sleep() macro that is called if > CRYPTO_TFM_REQ_MAY_SLEEP is set. > I'm getting the warning regardless of if pass CRYPTO_ALG_ASYNC or 0 as flags > to crypto_alloc_aead("gcm(aes)", 0, flags). > > I've also noticed that rfc4106_encrypt() includes a irq_fpu_usable() check > while generic_gcmaes_encrypt() doesn't. > Is the generic gcm(aes) aead unsafe in atomic context? > And if so which aead should I use? > > Finally, out of curiosity, doesn't macsec crypto run in atomic context? Are you allocating the tfm from atomic context? That is not allowed. Normally you would allocate the tfm in process context, e.g., when the connection is setup. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[BUG ?] ipv6: addrconf: Adds a missing in6_ifa_hold()
David, I was looking at addrconf_permanent_addr() and wondered if there is not some problem with it. It seems we need to increment ifp refcount before calling ipv6_del_addr() Could you double check if this patch is needed, I am guessing you have a test suite exercising this code path ? Thanks. PS : Presumably CONFIG_REFCOUNT_FULL=y should have warned you of the problem. diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 4a96ebbf8eda5f59a6ff88e836d666a404d2bf0d..8a1c846d3df949a4638589f187120db22a3525ba 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -3335,6 +3335,7 @@ static void addrconf_permanent_addr(struct net_device *dev) if ((ifp->flags & IFA_F_PERMANENT) && fixup_permanent_addr(idev, ifp) < 0) { write_unlock_bh(>lock); + in6_ifa_hold(ifp); ipv6_del_addr(ifp); write_lock_bh(>lock);
Re: [PATCH net v4 2/3] dt-binding: net: sfp binding documentation
Hi Baruch, On 09/07/2017 02:25 AM, Baruch Siach wrote: > Add device-tree binding documentation SFP transceivers. Support for SFP > transceivers has been recently introduced (drivers/net/phy/sfp.c). > > Signed-off-by: Baruch Siach> --- > v4: > Remove redundant 'single' from the gpio specifier > Rename 'moddef0-gpios' property to 'mod-def0-gpios' > Remove 'phy-mode' property from the example; SFP determines the mode > > v3: > Mention gpios phandle and specifier > Mention the polarity of each gpio > Fix example property names > > v2: > Rename -gpio properties to -gpios > Rename the rate-select-gpio property to rate-select0-gpios > Add the rate-select1-gpios property > Add examples > --- > Documentation/devicetree/bindings/net/sff,sfp.txt | 76 > +++ > 1 file changed, 76 insertions(+) > create mode 100644 Documentation/devicetree/bindings/net/sff,sfp.txt > > diff --git a/Documentation/devicetree/bindings/net/sff,sfp.txt > b/Documentation/devicetree/bindings/net/sff,sfp.txt > new file mode 100644 > index ..60e970ce10ee > --- /dev/null > +++ b/Documentation/devicetree/bindings/net/sff,sfp.txt > @@ -0,0 +1,76 @@ > +Small Form Factor (SFF) Committee Small Form-factor Pluggable (SFP) > +Transceiver > + > +Required properties: > + > +- compatible : must be "sff,sfp" > + > +Optional Properties: > + > +- i2c-bus : phandle of an I2C bus controller for the SFP two wire serial > + interface What was the reasoning behind using this property instead of making the SFP a child of the i2c bus directly? Were you thinking that there could be systems where the SFP is not i2c-addresable, but another 2-wire protocol is used instead? This is not particularly wrong per-se I guess, but usually, the parent/child relationship should make that more obvious. Right now, we have to have a series of platform devices matching sff,sfp, so that puts some constraints on where these devices can be within a Device Tree. Sorry for catching up on this so late... Thanks! -- Florian
Re: [PATCH v2 net-next] ipv6: Implement limits on Hop-by-Hop and Destination options
On Tue, 2017-10-31 at 11:10 +0900, David Miller wrote: > From: Tom Herbert> Date: Mon, 30 Oct 2017 14:16:00 -0700 > > > I wrote a quick test program that floods a whole bunch of these > > packets to a host and sure enough there is substantial time spent > > in ip6_parse_tlv. > ... > > 25.38% [kernel][k] __fib6_clean_all > > 21.63% [kernel][k] ip6_parse_tlv > > Yet the routing code still dominates the cost. I am guessing the ip6_rt_max_size defaulting to 4096 needs to be revisited, after all per-cpu added stuff.
[PATCH net-next] bpf: document answers to common questions about BPF
to address common misconceptions about what BPF is and what it's not add short BPF Q that clarifies core BPF design principles and answers some common questions. Signed-off-by: Alexei StarovoitovAcked-by: Daniel Borkmann Acked-by: John Fastabend Acked-by: Jakub Kicinski --- Documentation/bpf/bpf_design_QA.txt | 156 MAINTAINERS | 1 + 2 files changed, 157 insertions(+) create mode 100644 Documentation/bpf/bpf_design_QA.txt diff --git a/Documentation/bpf/bpf_design_QA.txt b/Documentation/bpf/bpf_design_QA.txt new file mode 100644 index ..f3e458a0bb2f --- /dev/null +++ b/Documentation/bpf/bpf_design_QA.txt @@ -0,0 +1,156 @@ +BPF extensibility and applicability to networking, tracing, security +in the linux kernel and several user space implementations of BPF +virtual machine led to a number of misunderstanding on what BPF actually is. +This short QA is an attempt to address that and outline a direction +of where BPF is heading long term. + +Q: Is BPF a generic instruction set similar to x64 and arm64? +A: NO. + +Q: Is BPF a generic virtual machine ? +A: NO. + +BPF is generic instruction set _with_ C calling convention. + +Q: Why C calling convention was chosen? +A: Because BPF programs are designed to run in the linux kernel + which is written in C, hence BPF defines instruction set compatible + with two most used architectures x64 and arm64 (and takes into + consideration important quirks of other architectures) and + defines calling convention that is compatible with C calling + convention of the linux kernel on those architectures. + +Q: can multiple return values be supported in the future? +A: NO. BPF allows only register R0 to be used as return value. + +Q: can more than 5 function arguments be supported in the future? +A: NO. BPF calling convention only allows registers R1-R5 to be used + as arguments. BPF is not a standalone instruction set. + (unlike x64 ISA that allows msft, cdecl and other conventions) + +Q: can BPF programs access instruction pointer or return address? +A: NO. + +Q: can BPF programs access stack pointer ? +A: NO. Only frame pointer (register R10) is accessible. + From compiler point of view it's necessary to have stack pointer. + For example LLVM defines register R11 as stack pointer in its + BPF backend, but it makes sure that generated code never uses it. + +Q: Does C-calling convention diminishes possible use cases? +A: YES. BPF design forces addition of major functionality in the form + of kernel helper functions and kernel objects like BPF maps with + seamless interoperability between them. It lets kernel call into + BPF programs and programs call kernel helpers with zero overhead. + As all of them were native C code. That is particularly the case + for JITed BPF programs that are indistinguishable from + native kernel C code. + +Q: Does it mean that 'innovative' extensions to BPF code are disallowed? +A: Soft yes. At least for now until BPF core has support for + bpf-to-bpf calls, indirect calls, loops, global variables, + jump tables, read only sections and all other normal constructs + that C code can produce. + +Q: Can loops be supported in a safe way? +A: It's not clear yet. BPF developers are trying to find a way to + support bounded loops where the verifier can guarantee that + the program terminates in less than 4096 instructions. + +Q: How come LD_ABS and LD_IND instruction are present in BPF whereas + C code cannot express them and has to use builtin intrinsics? +A: This is artifact of compatibility with classic BPF. Modern + networking code in BPF performs better without them. + See 'direct packet access'. + +Q: It seems not all BPF instructions are one-to-one to native CPU. + For example why BPF_JNE and other compare and jumps are not cpu-like? +A: This was necessary to avoid introducing flags into ISA which are + impossible to make generic and efficient across CPU architectures. + +Q: why BPF_DIV instruction doesn't map to x64 div? +A: Because if we picked one-to-one relationship to x64 it would have made + it more complicated to support on arm64 and other archs. Also it + needs div-by-zero runtime check. + +Q: why there is no BPF_SDIV for signed divide operation? +A: Because it would be rarely used. llvm errors in such case and + prints a suggestion to use unsigned divide instead + +Q: Why BPF has implicit prologue and epilogue? +A: Because architectures like sparc have register windows and in general + there are enough subtle differences between architectures, so naive + store return address into stack won't work. Another reason is BPF has + to be safe from division by zero (and legacy exception path + of LD_ABS insn). Those instructions need to invoke epilogue and + return implicitly. + +Q: Why BPF_JLT and BPF_JLE
Re: [PATCH v2 net-next] ipv6: Implement limits on Hop-by-Hop and Destination options
On Mon, Oct 30, 2017 at 7:10 PM, David Millerwrote: > From: Tom Herbert > Date: Mon, 30 Oct 2017 14:16:00 -0700 > >> I wrote a quick test program that floods a whole bunch of these >> packets to a host and sure enough there is substantial time spent >> in ip6_parse_tlv. > ... >> 25.38% [kernel][k] __fib6_clean_all >> 21.63% [kernel][k] ip6_parse_tlv > > Yet the routing code still dominates the cost. I wouldn't read too much into that. This was unconnected UDP on VMs and the only purpose here was to demonstrate that ip6_parse_tlv does get a lot of work with a lot of TLVs. Martin's results listed in the tested section are probably a more accurate gauge of the impact and potential to mitigate DOS. Tom
Re: [PATCH] net: caif: chnl_net: remove unnecessary null check in chnl_recv_cb
Quoting "Gustavo A. R. Silva": Hi, Quoting "Gustavo A. R. Silva" : container_of is never null, so this null check is unnecessary. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva --- net/caif/chnl_net.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c index 922ac1d..489298d 100644 --- a/net/caif/chnl_net.c +++ b/net/caif/chnl_net.c @@ -77,8 +77,6 @@ static int chnl_recv_cb(struct cflayer *layr, struct cfpkt *pkt) u8 buf; priv = container_of(layr, struct chnl_net, chnl); - if (!priv) - return -EINVAL; skb = (struct sk_buff *) cfpkt_tonative(pkt); -- 2.7.4 Please, ignore this patch. I just realized that function chnl_recv_cb is being called only during initialization: chnl_init_module() -> rtnl_link_register() -> ipcaif_net_setup() -> chnl_recv_cb(): Well, here ipcaif_net_setup stores a pointer to chnl_recv_cb in a structure. It doesn't call it. static void ipcaif_net_setup(struct net_device *dev) { [...] priv = netdev_priv(dev); priv->chnl.receive = chnl_recv_cb; [...] } static struct rtnl_link_ops ipcaif_link_ops __read_mostly = { .kind = "caif", .priv_size = sizeof(struct chnl_net), .setup = ipcaif_net_setup, .maxtype= IFLA_CAIF_MAX, .policy = ipcaif_policy, .newlink= ipcaif_newlink, .changelink = ipcaif_changelink, .get_size = ipcaif_get_size, .fill_info = ipcaif_fill_info, }; static int __init chnl_init_module(void) { return rtnl_link_register(_link_ops); } -- Gustavo A. R. Silva
Re: [PATCH net] tc-testing: fix arg to ip command: -s -> -n
From: "Brenda J. Butler"Date: Mon, 30 Oct 2017 17:59:22 -0400 > Fixes: 31c2611b66e0 ("selftests: Introduce a new test case to tc testsuite") > Fixes: 76b903ee198d ("selftests: Introduce tc testsuite") > Signed-off-by: Brenda J. Butler Applied, thanks Brenda.
Re: [PATCH v2 net-next] ipv6: Implement limits on Hop-by-Hop and Destination options
From: Tom HerbertDate: Mon, 30 Oct 2017 14:16:00 -0700 > I wrote a quick test program that floods a whole bunch of these > packets to a host and sure enough there is substantial time spent > in ip6_parse_tlv. ... > 25.38% [kernel][k] __fib6_clean_all > 21.63% [kernel][k] ip6_parse_tlv Yet the routing code still dominates the cost.
Re: [Patch net] net_sched: remove tcf_block_put_deferred()
From: Cong WangDate: Mon, 30 Oct 2017 11:10:09 -0700 > In commit 7aa0045dadb6 ("net_sched: introduce a workqueue for RCU callbacks > of tc filter") > I defer tcf_chain_flush() to a workqueue, this causes a use-after-free > because qdisc is already destroyed after we queue this work. > > The tcf_block_put_deferred() is no longer necessary after we get RTNL > for each tc filter destroy work, no others could jump in at this point. > Same for tcf_chain_hold(), we are fully serialized now. > > This also reduces one indirection therefore makes the code more > readable. Note this brings back a rcu_barrier(), however comparing > to the code prior to commit 7aa0045dadb6 we still reduced one > rcu_barrier(). For net-next, we can consider to refcnt tcf block to > avoid it. > > Fixes: 7aa0045dadb6 ("net_sched: introduce a workqueue for RCU callbacks of > tc filter") > Cc: Daniel Borkmann > Cc: Jiri Pirko > Cc: John Fastabend > Cc: Jamal Hadi Salim > Cc: "Paul E. McKenney" > Cc: Eric Dumazet > Signed-off-by: Cong Wang Applied, thanks for fixing this use-after-free so quickly. This will be another fun merge into net-next :-)
Re: [PATCH net] l2tp: hold tunnel in pppol2tp_connect()
From: Guillaume NaultDate: Mon, 30 Oct 2017 17:58:58 +0100 > Use l2tp_tunnel_get() in pppol2tp_connect() to ensure the tunnel isn't > going to disappear while processing the rest of the function. > > Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp > parts") > Signed-off-by: Guillaume Nault > --- > > Note: in case of backporting to -stable, this patch only makes sense if > f3c66d4e144a ("l2tp: prevent creation of sessions on terminated tunnels") > is already present in the tree, as we need this issue to be fixed > before fixing the current one. > > The reason is that when connecting a session, we don't only depend on > the tunnel, but also on its socket. Therefore, holding a reference on > the tunnel is not enough, we also have to make sure that it's not going > to drop its socket before the session is registered. Applied and thank you for the detailed -stable explanation, it really helps.
Re: [PATCH net-next] tc-testing: correction to docstring in get_unique_item
On Mon, Oct 30, 2017 at 8:37 PM, Brenda J. Butlerwrote: > The docstring says the function returns a "set" but it returns > a "list". These are both python data types, so we should refer to > the right one that is being returned. > > Signed-off-by: Brenda J. Butler Acked-by: Lucas Bates
Re: [Patch net 00/16] net_sched: fix races with RCU callbacks
On Mon, Oct 30, 2017 at 7:12 PM, Cong Wangwrote: > On Mon, Oct 30, 2017 at 3:39 PM, Lucas Bates wrote: >> e.On Thu, Oct 26, 2017 at 9:24 PM, Cong Wang >> wrote: >>> Recently, the RCU callbacks used in TC filters and TC actions keep >>> drawing my attention, they introduce at least 4 race condition bugs: >> >>> As suggested by Paul, we could defer the work to a workqueue and >>> gain the permission of holding RTNL again without any performance >>> impact, however, in tcf_block_put() we could have a deadlock when >>> flushing workqueue while hodling RTNL lock, the trick here is to >>> defer the work itself in workqueue and make it queued after all >>> other works so that we keep the same ordering to avoid any >>> use-after-free. Please see the first patch for details. >> >> Cong, I don't believe the problem's been resolved just yet I have >> a new kernel, compiled just today and I'm still tripping over a kernel >> bug in this scenario when I run Chris' new test case. >> > > Without a stack trace, I can't do anything. "a kernel bug" could > be anything, why do you believe it is caused by this patchset? > (Sending again due to email format issues) The stack trace I saw touched on the same code that was affected by the patchset. I've sent a photo of the trace to you directly - I'm sorry, I should have sent it earlier. I also apologize for having to send the photo but I was doing the testing on a small device lacking any kind of serial console access. >> I'm doing this on a machine where I don't have a spare device to use >> on the run. Instead I created a veth pair that will have one end >> migrated into the container. >> >> The bug isn't consistent. I'm running into it anywhere between one and >> four runs of the d052 test case. >> >> Steps to reproduce: >> >> sudo ip li add foo type veth >> sudo ./tdc.py -d foo -c flower >> [repeat until kernel bug encountered] > > I tested with basic and u32 filter, since it is not specific to any type > of filters, all filters had the same problem. I will try flower filter too > before you provide a stack trace or a detailed report. If you need more information or need me to try something else, I'll be able to tomorrow.
Re: [PATCH 2/2] samples/bpf: add a test for bpf_override_return
On 10/30/17 2:19 PM, Josef Bacik wrote: + +rm -f testfile.img +dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1 +DEVICE=$(losetup --show -f testfile.img) +mkfs.btrfs -f $DEVICE +mkdir tmpmnt +./tracex7 $DEVICE +if [ $? -eq 0 ] +then + echo "SUCCESS!" +else + echo "FAILED!" +fi +losetup -d $DEVICE nice test! diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c new file mode 100644 index ..a2f74f736e66 --- /dev/null +++ b/samples/bpf/tracex7_kern.c @@ -0,0 +1,15 @@ +#include +#include +#include +#include "bpf_helpers.h" + +SEC("kprobe/open_ctree") +int bpf_prog1(struct pt_regs *ctx) +{ + unsigned long rc = -12; pls add empty line here. + bpf_override_return(ctx, rc); + return 0; +}
Re: [PATCH 1/2] bpf: add a bpf_override_function helper
On 10/30/17 2:19 PM, Josef Bacik wrote: From: Josef BacikError injection is sloppy and very ad-hoc. BPF could fill this niche perfectly with it's kprobe functionality. We could make sure errors are only triggered in specific call chains that we care about with very specific situations. Accomplish this with the bpf_override_funciton helper. This will modify the probe'd callers return value to the specified value and set the PC to an override function that simply returns, bypassing the originally probed function. This gives us a nice clean way to implement systematic error injection for all of our code paths. Signed-off-by: Josef Bacik --- arch/Kconfig | 3 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/kprobes.h | 4 arch/x86/include/asm/ptrace.h| 5 + arch/x86/kernel/kprobes/ftrace.c | 14 include/uapi/linux/bpf.h | 7 +- kernel/trace/Kconfig | 11 ++ kernel/trace/bpf_trace.c | 47 +++- kernel/trace/trace.h | 6 + kernel/trace/trace_kprobe.c | 23 ++-- 10 files changed, 108 insertions(+), 13 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index d789a89cb32c..4fb618082259 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -195,6 +195,9 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_KPROBE_OVERRIDE + bool + config HAVE_NMI bool diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 971feac13506..5126d2750dd0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -152,6 +152,7 @@ config X86 select HAVE_KERNEL_XZ select HAVE_KPROBES select HAVE_KPROBES_ON_FTRACE + select HAVE_KPROBE_OVERRIDE select HAVE_KRETPROBES select HAVE_KVM select HAVE_LIVEPATCH if X86_64 diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h index 6cf65437b5e5..c6c3b1f4306a 100644 --- a/arch/x86/include/asm/kprobes.h +++ b/arch/x86/include/asm/kprobes.h @@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size; void arch_remove_kprobe(struct kprobe *p); asmlinkage void kretprobe_trampoline(void); +#ifdef CONFIG_KPROBES_ON_FTRACE +extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs); +#endif + /* Architecture specific copy of original instruction*/ struct arch_specific_insn { /* copy of the original instruction */ diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 91c04c8e67fa..f04e71800c2f 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct pt_regs *regs) return regs->ax; } +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc) +{ + regs->ax = rc; +} + /* * user_mode(regs) determines whether a register set came from user * mode. On x86_32, this is true if V8086 mode was enabled OR if the diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c index 041f7b6dfa0f..3c455bf490cb 100644 --- a/arch/x86/kernel/kprobes/ftrace.c +++ b/arch/x86/kernel/kprobes/ftrace.c @@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p) p->ainsn.boostable = false; return 0; } + +asmlinkage void override_func(void); +asm( + ".type override_func, @function\n" + "override_func:\n" + " ret\n" + ".size override_func, .-override_func\n" +); + +void arch_ftrace_kprobe_override_function(struct pt_regs *regs) +{ + regs->ip = (unsigned long)_func; +} +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0b7b54d898bd..1ad5b87a42f6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -673,6 +673,10 @@ union bpf_attr { * @buf: buf to fill * @buf_size: size of the buf * Return : 0 on success or negative error code + * + * int bpf_override_return(pt_regs, rc) + * @pt_regs: pointer to struct pt_regs + * @rc: the return value to set */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -732,7 +736,8 @@ union bpf_attr { FN(xdp_adjust_meta),\ FN(perf_event_read_value), \ FN(perf_prog_read_value), \ - FN(getsockopt), + FN(getsockopt), \ + FN(override_return), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 434c840e2d82..9dc0deeaad2b 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -518,6 +518,17 @@ config FUNCTION_PROFILER If in doubt, say N. +config BPF_KPROBE_OVERRIDE + bool "Enable BPF programs to override a
[PATCH net-next] tc-testing: correction to docstring in get_unique_item
The docstring says the function returns a "set" but it returns a "list". These are both python data types, so we should refer to the right one that is being returned. Signed-off-by: Brenda J. Butler--- This patch was formerly sent as part of a 15-patch series. It is being split off as a stand-alone patch. tools/testing/selftests/tc-testing/tdc_helper.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/tc-testing/tdc_helper.py b/tools/testing/selftests/tc-testing/tdc_helper.py index c3254f861fb2..6f99a4efe761 100644 --- a/tools/testing/selftests/tc-testing/tdc_helper.py +++ b/tools/testing/selftests/tc-testing/tdc_helper.py @@ -15,7 +15,7 @@ def get_categorized_testlist(alltests, ucat): def get_unique_item(lst): -""" For a list, return a set of the unique items in the list. """ +""" For a list, return a list of the unique items in the list. """ return list(set(lst)) -- 2.15.0.rc0
Re: [PATCH net-next] net: filter: remove unused variable and fix warning
From: Jakub KicinskiDate: Mon, 30 Oct 2017 13:46:47 -0700 > bpf_getsockopt bpf call sets the ret variable to zero and > never changes it. What's worse in case CONFIG_INET is > not selected the variable is completely unused generating > a warning. > > Signed-off-by: Jakub Kicinski > Reviewed-by: Quentin Monnet Applied.
Re: [PATCH] net: caif: chnl_net: remove unnecessary null check in chnl_recv_cb
Hi, Quoting "Gustavo A. R. Silva": container_of is never null, so this null check is unnecessary. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva --- net/caif/chnl_net.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c index 922ac1d..489298d 100644 --- a/net/caif/chnl_net.c +++ b/net/caif/chnl_net.c @@ -77,8 +77,6 @@ static int chnl_recv_cb(struct cflayer *layr, struct cfpkt *pkt) u8 buf; priv = container_of(layr, struct chnl_net, chnl); - if (!priv) - return -EINVAL; skb = (struct sk_buff *) cfpkt_tonative(pkt); -- 2.7.4 Please, ignore this patch. I just realized that function chnl_recv_cb is being called only during initialization: chnl_init_module() -> rtnl_link_register() -> ipcaif_net_setup() -> chnl_recv_cb(): static void ipcaif_net_setup(struct net_device *dev) { [...] priv = netdev_priv(dev); priv->chnl.receive = chnl_recv_cb; [...] } static struct rtnl_link_ops ipcaif_link_ops __read_mostly = { .kind = "caif", .priv_size = sizeof(struct chnl_net), .setup = ipcaif_net_setup, .maxtype= IFLA_CAIF_MAX, .policy = ipcaif_policy, .newlink= ipcaif_newlink, .changelink = ipcaif_changelink, .get_size = ipcaif_get_size, .fill_info = ipcaif_fill_info, }; static int __init chnl_init_module(void) { return rtnl_link_register(_link_ops); } Thanks -- Gustavo A. R. Silva
Re: [PATCH v2 net-next] net: display hw address of source machine during ipv6 DAD failure
On 10/30/17 5:38 PM, Vishwanath Pai wrote: > This patch updates the error messages displayed in kernel log to include > hwaddress of the source machine that caused ipv6 duplicate address > detection failures. > > Examples: > > a) When we receive a NA packet from another machine advertising our > address: > > ICMPv6: NA: 34:ab:cd:56:11:e8 advertised our address 2001:db8:: on eth0! > > b) When we detect DAD failure during address assignment to an interface: > > IPv6: eth0: IPv6 duplicate address 2001:db8:: used by 34:ab:cd:56:11:e8 > detected! > > v2: > Changed %pI6 to %pI6c in ndisc_recv_na() > Chaged the v6 address in the commit message to 2001:db8:: > > Suggested-by: Igor Lubashev> Signed-off-by: Vishwanath Pai > --- > include/net/addrconf.h | 2 +- > net/ipv6/addrconf.c| 6 +++--- > net/ipv6/ndisc.c | 9 + > 3 files changed, 9 insertions(+), 8 deletions(-) LGTM Acked-by: David Ahern
[PATCH v2 net-next] net: display hw address of source machine during ipv6 DAD failure
This patch updates the error messages displayed in kernel log to include hwaddress of the source machine that caused ipv6 duplicate address detection failures. Examples: a) When we receive a NA packet from another machine advertising our address: ICMPv6: NA: 34:ab:cd:56:11:e8 advertised our address 2001:db8:: on eth0! b) When we detect DAD failure during address assignment to an interface: IPv6: eth0: IPv6 duplicate address 2001:db8:: used by 34:ab:cd:56:11:e8 detected! v2: Changed %pI6 to %pI6c in ndisc_recv_na() Chaged the v6 address in the commit message to 2001:db8:: Suggested-by: Igor LubashevSigned-off-by: Vishwanath Pai --- include/net/addrconf.h | 2 +- net/ipv6/addrconf.c| 6 +++--- net/ipv6/ndisc.c | 9 + 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/net/addrconf.h b/include/net/addrconf.h index 15b5ffd..2a616ea 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -208,7 +208,7 @@ bool inet6_mc_check(struct sock *sk, const struct in6_addr *mc_addr, void ipv6_mc_init_dev(struct inet6_dev *idev); void ipv6_mc_destroy_dev(struct inet6_dev *idev); int ipv6_mc_check_mld(struct sk_buff *skb, struct sk_buff **skb_trimmed); -void addrconf_dad_failure(struct inet6_ifaddr *ifp); +void addrconf_dad_failure(struct sk_buff *skb, struct inet6_ifaddr *ifp); bool ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group, const struct in6_addr *src_addr); diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 5a8a102..cfa374c 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1987,7 +1987,7 @@ static int addrconf_dad_end(struct inet6_ifaddr *ifp) return err; } -void addrconf_dad_failure(struct inet6_ifaddr *ifp) +void addrconf_dad_failure(struct sk_buff *skb, struct inet6_ifaddr *ifp) { struct inet6_dev *idev = ifp->idev; struct net *net = dev_net(ifp->idev->dev); @@ -1997,8 +1997,8 @@ void addrconf_dad_failure(struct inet6_ifaddr *ifp) return; } - net_info_ratelimited("%s: IPv6 duplicate address %pI6c detected!\n", -ifp->idev->dev->name, >addr); + net_info_ratelimited("%s: IPv6 duplicate address %pI6c used by %pM detected!\n", +ifp->idev->dev->name, >addr, eth_hdr(skb)->h_source); spin_lock_bh(>lock); diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index 266a530..f9c3ffe 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -46,6 +46,7 @@ #endif #include +#include #include #include #include @@ -822,7 +823,7 @@ static void ndisc_recv_ns(struct sk_buff *skb) * who is doing DAD * so fail our DAD process */ - addrconf_dad_failure(ifp); + addrconf_dad_failure(skb, ifp); return; } else { /* @@ -975,7 +976,7 @@ static void ndisc_recv_na(struct sk_buff *skb) if (ifp) { if (skb->pkt_type != PACKET_LOOPBACK && (ifp->flags & IFA_F_TENTATIVE)) { - addrconf_dad_failure(ifp); + addrconf_dad_failure(skb, ifp); return; } /* What should we make now? The advertisement @@ -989,8 +990,8 @@ static void ndisc_recv_na(struct sk_buff *skb) */ if (skb->pkt_type != PACKET_LOOPBACK) ND_PRINTK(1, warn, - "NA: someone advertises our address %pI6 on %s!\n", - >addr, ifp->idev->dev->name); + "NA: %pM advertised our address %pI6c on %s!\n", + eth_hdr(skb)->h_source, >addr, ifp->idev->dev->name); in6_ifa_put(ifp); return; } -- 1.9.1
Re: [Patch net 00/16] net_sched: fix races with RCU callbacks
On Mon, Oct 30, 2017 at 3:39 PM, Lucas Bateswrote: > e.On Thu, Oct 26, 2017 at 9:24 PM, Cong Wang wrote: >> Recently, the RCU callbacks used in TC filters and TC actions keep >> drawing my attention, they introduce at least 4 race condition bugs: > >> As suggested by Paul, we could defer the work to a workqueue and >> gain the permission of holding RTNL again without any performance >> impact, however, in tcf_block_put() we could have a deadlock when >> flushing workqueue while hodling RTNL lock, the trick here is to >> defer the work itself in workqueue and make it queued after all >> other works so that we keep the same ordering to avoid any >> use-after-free. Please see the first patch for details. > > Cong, I don't believe the problem's been resolved just yet I have > a new kernel, compiled just today and I'm still tripping over a kernel > bug in this scenario when I run Chris' new test case. > Without a stack trace, I can't do anything. "a kernel bug" could be anything, why do you believe it is caused by this patchset? Note, there is a use-after-free bug caused by this patchset, the fix is already on netdev in case you hit the same bug. > I'm doing this on a machine where I don't have a spare device to use > on the run. Instead I created a veth pair that will have one end > migrated into the container. > > The bug isn't consistent. I'm running into it anywhere between one and > four runs of the d052 test case. > > Steps to reproduce: > > sudo ip li add foo type veth > sudo ./tdc.py -d foo -c flower > [repeat until kernel bug encountered] I tested with basic and u32 filter, since it is not specific to any type of filters, all filters had the same problem. I will try flower filter too before you provide a stack trace or a detailed report.
Re: [PATCH net] tun/tap: sanitize TUNSETSNDBUF input
On Mon, 2017-10-30 at 18:50 -0400, Craig Gallek wrote: > From: Craig Gallek> > Syzkaller found several variants of the lockup below by setting negative > values with the TUNSETSNDBUF ioctl. This patch adds a sanity check > to both the tun and tap versions of this ioctl. > Reviewed-by: Eric Dumazet Thanks !
[PATCH net] tun/tap: sanitize TUNSETSNDBUF input
From: Craig GallekSyzkaller found several variants of the lockup below by setting negative values with the TUNSETSNDBUF ioctl. This patch adds a sanity check to both the tun and tap versions of this ioctl. watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [repro:2389] Modules linked in: irq event stamp: 329692056 hardirqs last enabled at (329692055): [] _raw_spin_unlock_irqrestore+0x31/0x75 hardirqs last disabled at (329692056): [] apic_timer_interrupt+0x98/0xb0 softirqs last enabled at (35659740): [] __do_softirq+0x328/0x48c softirqs last disabled at (35659731): [] irq_exit+0xbc/0xd0 CPU: 0 PID: 2389 Comm: repro Not tainted 4.14.0-rc7 #23 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: 880009452140 task.stack: 880006a2 RIP: 0010:_raw_spin_lock_irqsave+0x11/0x80 RSP: 0018:880006a27c50 EFLAGS: 0282 ORIG_RAX: ff10 RAX: 880009ac68d0 RBX: 880006a27ce0 RCX: RDX: 0001 RSI: 880006a27ce0 RDI: 880009ac6900 RBP: 880006a27c60 R08: R09: R10: 0001 R11: 0063ff00 R12: 880009ac6900 R13: 880006a27cf8 R14: 0001 R15: 880006a27cf8 FS: 7f4be4838700() GS:88000cc0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 20101000 CR3: 09616000 CR4: 06f0 Call Trace: prepare_to_wait+0x26/0xc0 sock_alloc_send_pskb+0x14e/0x270 ? remove_wait_queue+0x60/0x60 tun_get_user+0x2cc/0x19d0 ? __tun_get+0x60/0x1b0 tun_chr_write_iter+0x57/0x86 __vfs_write+0x156/0x1e0 vfs_write+0xf7/0x230 SyS_write+0x57/0xd0 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x7f4be4356df9 RSP: 002b:7ffc18101c08 EFLAGS: 0293 ORIG_RAX: 0001 RAX: ffda RBX: RCX: 7f4be4356df9 RDX: 0046 RSI: 20101000 RDI: 0005 RBP: 7ffc18101c40 R08: 0001 R09: 0001 R10: 0001 R11: 0293 R12: 559c75f64780 R13: 7ffc18101d30 R14: R15: Fixes: 33dccbb050bb ("tun: Limit amount of queued packets per device") Fixes: 20d29d7a916a ("net: macvtap driver") Signed-off-by: Craig Gallek --- drivers/net/tap.c | 2 ++ drivers/net/tun.c | 4 2 files changed, 6 insertions(+) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index 1b10fcc6a58d..6c0c84c33e1f 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -1032,6 +1032,8 @@ static long tap_ioctl(struct file *file, unsigned int cmd, case TUNSETSNDBUF: if (get_user(s, sp)) return -EFAULT; + if (s <= 0) + return -EINVAL; q->sk.sk_sndbuf = s; return 0; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 5550f56cb895..42bb820a56c9 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2429,6 +2429,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, ret = -EFAULT; break; } + if (sndbuf <= 0) { + ret = -EINVAL; + break; + } tun->sndbuf = sndbuf; tun_set_sndbuf(tun); -- 2.15.0.rc2.357.g7e34df9404-goog
[PATCH] net: caif: chnl_net: remove unnecessary null check in chnl_recv_cb
container_of is never null, so this null check is unnecessary. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva--- net/caif/chnl_net.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c index 922ac1d..489298d 100644 --- a/net/caif/chnl_net.c +++ b/net/caif/chnl_net.c @@ -77,8 +77,6 @@ static int chnl_recv_cb(struct cflayer *layr, struct cfpkt *pkt) u8 buf; priv = container_of(layr, struct chnl_net, chnl); - if (!priv) - return -EINVAL; skb = (struct sk_buff *) cfpkt_tonative(pkt); -- 2.7.4
Re: [Patch net 00/16] net_sched: fix races with RCU callbacks
e.On Thu, Oct 26, 2017 at 9:24 PM, Cong Wangwrote: > Recently, the RCU callbacks used in TC filters and TC actions keep > drawing my attention, they introduce at least 4 race condition bugs: > As suggested by Paul, we could defer the work to a workqueue and > gain the permission of holding RTNL again without any performance > impact, however, in tcf_block_put() we could have a deadlock when > flushing workqueue while hodling RTNL lock, the trick here is to > defer the work itself in workqueue and make it queued after all > other works so that we keep the same ordering to avoid any > use-after-free. Please see the first patch for details. Cong, I don't believe the problem's been resolved just yet I have a new kernel, compiled just today and I'm still tripping over a kernel bug in this scenario when I run Chris' new test case. I'm doing this on a machine where I don't have a spare device to use on the run. Instead I created a veth pair that will have one end migrated into the container. The bug isn't consistent. I'm running into it anywhere between one and four runs of the d052 test case. Steps to reproduce: sudo ip li add foo type veth sudo ./tdc.py -d foo -c flower [repeat until kernel bug encountered]
[PATCH] netfilter: ipset: use swap macro instead of _manually_ swapping values
Make use of the swap macro and remove unnecessary variables tmp. This makes the code easier to read and maintain. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva--- net/netfilter/ipset/ip_set_bitmap_ip.c| 8 ++-- net/netfilter/ipset/ip_set_bitmap_ipmac.c | 8 ++-- net/netfilter/ipset/ip_set_bitmap_port.c | 8 ++-- 3 files changed, 6 insertions(+), 18 deletions(-) diff --git a/net/netfilter/ipset/ip_set_bitmap_ip.c b/net/netfilter/ipset/ip_set_bitmap_ip.c index d8975a0..488d6d0 100644 --- a/net/netfilter/ipset/ip_set_bitmap_ip.c +++ b/net/netfilter/ipset/ip_set_bitmap_ip.c @@ -263,12 +263,8 @@ bitmap_ip_create(struct net *net, struct ip_set *set, struct nlattr *tb[], ret = ip_set_get_hostipaddr4(tb[IPSET_ATTR_IP_TO], _ip); if (ret) return ret; - if (first_ip > last_ip) { - u32 tmp = first_ip; - - first_ip = last_ip; - last_ip = tmp; - } + if (first_ip > last_ip) + swap(first_ip, last_ip); } else if (tb[IPSET_ATTR_CIDR]) { u8 cidr = nla_get_u8(tb[IPSET_ATTR_CIDR]); diff --git a/net/netfilter/ipset/ip_set_bitmap_ipmac.c b/net/netfilter/ipset/ip_set_bitmap_ipmac.c index 4c279fb..c00b6a2 100644 --- a/net/netfilter/ipset/ip_set_bitmap_ipmac.c +++ b/net/netfilter/ipset/ip_set_bitmap_ipmac.c @@ -337,12 +337,8 @@ bitmap_ipmac_create(struct net *net, struct ip_set *set, struct nlattr *tb[], ret = ip_set_get_hostipaddr4(tb[IPSET_ATTR_IP_TO], _ip); if (ret) return ret; - if (first_ip > last_ip) { - u32 tmp = first_ip; - - first_ip = last_ip; - last_ip = tmp; - } + if (first_ip > last_ip) + swap(first_ip, last_ip); } else if (tb[IPSET_ATTR_CIDR]) { u8 cidr = nla_get_u8(tb[IPSET_ATTR_CIDR]); diff --git a/net/netfilter/ipset/ip_set_bitmap_port.c b/net/netfilter/ipset/ip_set_bitmap_port.c index 7f9bbd7..b561ca8 100644 --- a/net/netfilter/ipset/ip_set_bitmap_port.c +++ b/net/netfilter/ipset/ip_set_bitmap_port.c @@ -238,12 +238,8 @@ bitmap_port_create(struct net *net, struct ip_set *set, struct nlattr *tb[], first_port = ip_set_get_h16(tb[IPSET_ATTR_PORT]); last_port = ip_set_get_h16(tb[IPSET_ATTR_PORT_TO]); - if (first_port > last_port) { - u16 tmp = first_port; - - first_port = last_port; - last_port = tmp; - } + if (first_port > last_port) + swap(first_port, last_port); elements = last_port - first_port + 1; set->dsize = ip_set_elem_len(set, tb, 0, 0); -- 2.7.4
[PATCH net-next] net: dsa: b53: Have b53_hdr_setup() enable/disable tagging
Have b53_hdr_setup() check what kind of tagging protocol is configured (Broadcom or none) and apply the correct settings in both cases. Signed-off-by: Florian Fainelli--- drivers/net/dsa/b53/b53_common.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c index fa37f501f10b..a7ca62ba27b7 100644 --- a/drivers/net/dsa/b53/b53_common.c +++ b/drivers/net/dsa/b53/b53_common.c @@ -545,6 +545,7 @@ EXPORT_SYMBOL(b53_disable_port); void b53_brcm_hdr_setup(struct dsa_switch *ds, int port) { + bool tag_en = !!(ds->ops->get_tag_protocol(ds) == DSA_TAG_PROTO_BRCM); struct b53_device *dev = ds->priv; u8 hdr_ctl, val; u16 reg; @@ -567,7 +568,10 @@ void b53_brcm_hdr_setup(struct dsa_switch *ds, int port) /* Enable Broadcom tags for IMP port */ b53_read8(dev, B53_MGMT_PAGE, B53_BRCM_HDR, _ctl); - hdr_ctl |= val; + if (tag_en) + hdr_ctl |= val; + else + hdr_ctl &= ~val; b53_write8(dev, B53_MGMT_PAGE, B53_BRCM_HDR, hdr_ctl); /* Registers below are only accessible on newer devices */ @@ -578,14 +582,20 @@ void b53_brcm_hdr_setup(struct dsa_switch *ds, int port) * allow us to tag outgoing frames */ b53_read16(dev, B53_MGMT_PAGE, B53_BRCM_HDR_RX_DIS, ); - reg &= ~BIT(port); + if (tag_en) + reg &= ~BIT(port); + else + reg |= BIT(port); b53_write16(dev, B53_MGMT_PAGE, B53_BRCM_HDR_RX_DIS, reg); /* Enable transmission of Broadcom tags from the switch (CPU RX) to * allow delivering frames to the per-port net_devices */ b53_read16(dev, B53_MGMT_PAGE, B53_BRCM_HDR_TX_DIS, ); - reg &= ~BIT(port); + if (tag_en) + reg &= ~BIT(port); + else + reg |= BIT(port); b53_write16(dev, B53_MGMT_PAGE, B53_BRCM_HDR_TX_DIS, reg); } EXPORT_SYMBOL(b53_brcm_hdr_setup); -- 2.9.3
Re: [PATCH net-next v5 01/10] devlink: Add permanent config parameter get/set operations
On Mon, 30 Oct 2017 18:03:01 +0100, Jiri Pirko wrote: > >+cfgparam_attr = nla_nest_start(msg, DEVLINK_ATTR_PERM_CONFIG); > >+/* Update restart reqd - if any param needs restart, should be set */ > >+if (need_restart) { > > You should propagate this out so the caller would fill it to the > message. This is a global thing, not per-param, shoult not be nested. Right, I think Jiri has already asked for this but I feel like we should provide the ability to query running/pending configurations independently. I don't see it in this patch set? I'm not sure what the status of the reconfig trigger patches for mlxsw is, but we actually may need 3 config sets: - current/runtime configurable, - requiring soft reset of the device/driver reinit; - requiring hard reset/set on boot. Secondly, IMHO calling set/get parameters "permanent" is a bit backwards. One device may not be able to change max VF counts or MSIX allocation without full reinit of PCIe blocks, but for others soft reset is more than enough. Port splitting is another example. For NICs port splitting at runtime is usually not a priority in HW/FW development, so some form of reset is generally required, while switches can split a port at runtime. IOW we should define parameters without assigning them to config sets in the ABI itself. And also we should make it in a way which will allow existing parameters to be reused in permanent/sort reset required/runtime modes. Does that make sense?
Re: [PATCH v2 net-next] tcp: add tracepoint trace_tcp_retransmit_synack()
On Mon, 2017-10-30 at 21:43 +, Song Liu wrote: > CCing key audience of the patch. > Please stop doing this for every patch you are sending. Include all relevant CC to your patch submission, and that's enough. Thanks.
Re: WARNING in strp_data_ready
On Mon, Oct 30, 2017 at 2:44 PM, John Fastabendwrote: > On 10/24/2017 08:20 AM, syzbot wrote: >> Hello, >> >> syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master >> compiler: gcc (GCC) 7.1.1 20170620 >> .config is attached >> Raw console output is attached. >> C reproducer is attached >> syzkaller reproducer is attached. See https://goo.gl/kgGztJ >> for information about syzkaller reproducers >> >> >> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me >> include/net/sock.h:1505 [inline] >> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user >> include/net/sock.h:1511 [inline] >> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 >> strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 >> Kernel panic - not syncing: panic_on_warn set ... >> >> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138 >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS >> Google 01/01/2011 >> Call Trace: >> >> __dump_stack lib/dump_stack.c:16 [inline] >> dump_stack+0x194/0x257 lib/dump_stack.c:52 >> panic+0x1e4/0x417 kernel/panic.c:181 >> __warn+0x1c4/0x1d9 kernel/panic.c:542 >> report_bug+0x211/0x2d0 lib/bug.c:183 >> fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 >> do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] >> do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 >> do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 >> do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 >> invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 >> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline] >> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline] >> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 >> RSP: 0018:8801db206b18 EFLAGS: 00010206 >> RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: >> RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0 >> RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd >> R10: 8801db206a00 R11: dc00 R12: 8801d29fa000 >> R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001 >> psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353 > > Looks like KCM is calling sk_data_ready() without first taking the > sock lock. > > /* Called with lower sock held */ > static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb) > { > [...] > if (kcm_queue_rcv_skb(>sk, skb)) { > > In this case kcm->sk is not the same lock the comment is referring to. > And kcm_queue_rcv_skb() will eventually call sk_data_ready(). > > @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock? > I don't have anything better in mind immediately. > The sock locks are taken in reverse order in the send path so so grabbing kcm sock lock with lower lock held to call sk_data_ready may lead to deadlock like I think. It might be possible to change the order in the send path to do this. Something like: trylock on lower socket lock -if trylock fails - release kcm sock lock - lock lower sock - lock kcm sock - call sendpage locked function I admit that dealing with two levels of socket locks in the data path is quite a pain :-) Tom > Thanks, > John
[PATCH net] tc-testing: fix arg to ip command: -s -> -n
Fixes: 31c2611b66e0 ("selftests: Introduce a new test case to tc testsuite") Fixes: 76b903ee198d ("selftests: Introduce tc testsuite") Signed-off-by: Brenda J. Butler--- We want to run the command inside the container, not in the host. tools/testing/selftests/tc-testing/tdc.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/tc-testing/tdc.py b/tools/testing/selftests/tc-testing/tdc.py index 5f11f5d7456e..a8981c5d0aaf 100755 --- a/tools/testing/selftests/tc-testing/tdc.py +++ b/tools/testing/selftests/tc-testing/tdc.py @@ -152,11 +152,11 @@ def ns_create(): exec_cmd(cmd, False) cmd = 'ip link set $DEV0 up' exec_cmd(cmd, False) -cmd = 'ip -s $NS link set $DEV1 up' +cmd = 'ip -n $NS link set $DEV1 up' exec_cmd(cmd, False) cmd = 'ip link set $DEV2 netns $NS' exec_cmd(cmd, False) -cmd = 'ip -s $NS link set $DEV2 up' +cmd = 'ip -n $NS link set $DEV2 up' exec_cmd(cmd, False) -- 2.15.0.rc0
Re: [lkp-robot] [bpf] 76cdd39f41: WARNING:trinity-c0_still_has_locks_held
I am not able to reproduce the bug with latest net-next. Thanks Daniel to point it out, looks like your test uses an old-version unapplied patchset: https://github.com/0day-ci/linux/commit/76cdd39f4117a6cbd520b5d09993ac87acbdcfd8 which yes, there is a bug to leak the mutext lock and the bug is fixed in subsequent patch set. FYI, we noticed the following commit (built with gcc-4.8): commit: 76cdd39f4117a6cbd520b5d09993ac87acbdcfd8 ("bpf: permit multiple bpf attachments for a single perf event") url: https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-permit-multiple-bpf-attachments-for-a-single-perf-tracepoint-event/20171024-080608 in testcase: boot on test machine: qemu-system-x86_64 -enable-kvm -smp 2 -m 512M caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): ++++ || 052bd6a4fb | 76cdd39f41 | ++++ | boot_successes | 64 | 66 | ++++ [ 50.247341] WARNING: trinity-c0/3371 still has locks held! [main] Random reseed: 4199236289 [ 50.258948] 4.14.0-rc5-01671-g76cdd39 #1 Not tainted [ 50.260888] [child0:3373] io_getevents (208) returned ENOSYS, marking as inactive. [child0:3373] sysfs (139) returned ENOSYS, marking as inactive. [ 50.273066] 1 lock held by trinity-c0/3371: [ 50.274247] #0: (bpf_event_mutex){}, at: [] perf_event_detach_bpf_prog+0x17/0xd0 [child0:3373] fanotify_mark (301) returned ENOSYS, marking as inactive. *** glibc detected *** /trinity: double free or corruption (out): 0x01d51000 *** [ 50.288585] [ 50.288585] stack backtrace: [ 50.290123] CPU: 1 PID: 3371 Comm: trinity-c0 Not tainted 4.14.0-rc5-01671-g76cdd39 #1 [ 50.292804] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 50.295763] Call Trace: [ 50.296791] dump_stack+0xcb/0x13f [ 50.298244] do_exit+0xf1f/0x1380 [ 50.299525] ? syscall_trace_enter+0x3ce/0x490 [ 50.301186] do_group_exit+0x9d/0x120 [ 50.302531] SyS_exit_group+0xb/0x10 [ 50.303746] do_syscall_64+0xa6/0x240 [ 50.305256] entry_SYSCALL64_slow_path+0x25/0x25 [ 50.306847] RIP: 0033:0x7f60b8dae408 [ 50.307944] RSP: 002b:7fff020656e8 EFLAGS: 0206 ORIG_RAX: 00e7 [ 50.310457] RAX: ffda RBX: RCX: 7f60b8dae408 [ 50.312482] RDX: RSI: 003c RDI: [ 50.314526] RBP: 7fff020665a0 R08: 00e7 R09: ffa0 [ 50.316981] R10: 7fff02065480 R11: 0206 R12: 0299 [ 50.319413] R13: 0059 R14: 7fff02065d10 R15: 0002 === Backtrace: = To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k job-script # job-script is attached in this email Thanks, Xiaolong
Re: [PATCH net-next] net: filter: remove unused variable and fix warning
On Mon, Oct 30, 2017 at 01:46:47PM -0700, Jakub Kicinski wrote: > bpf_getsockopt bpf call sets the ret variable to zero and > never changes it. What's worse in case CONFIG_INET is > not selected the variable is completely unused generating > a warning. > > Signed-off-by: Jakub Kicinski> Reviewed-by: Quentin Monnet Acked-by: Alexei Starovoitov
Re: [PATCH net-next] net: filter: remove unused variable and fix warning
On 10/30/2017 09:46 PM, Jakub Kicinski wrote: bpf_getsockopt bpf call sets the ret variable to zero and never changes it. What's worse in case CONFIG_INET is not selected the variable is completely unused generating a warning. Signed-off-by: Jakub KicinskiReviewed-by: Quentin Monnet Acked-by: Daniel Borkmann
Re: [PATCH] netfilter: ipset: ip_set_bitmap_ipmac: use swap macro in bitmap_ipmac_create
Hi Jozsef, Quoting Jozsef Kadlecsik: Hi, On Sat, 28 Oct 2017, Gustavo A. R. Silva wrote: Make use of the swap macro and remove unnecessary variable tmp. This makes the code easier to read and maintain. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva Please resubmit the tree patches as a single one, they do the same thing. Thanks! OK. I'll do that. Thanks -- Gustavo A. R. Silva
[PATCH net-next] bpf: reduce verifier memory consumption
the verifier got progressively smarter over time and size of its internal state grew as well. Time to reduce the memory consumption. Before: sizeof(struct bpf_verifier_state) = 6520 After: sizeof(struct bpf_verifier_state) = 896 It's done by observing that majority of BPF programs use little to no stack whereas verifier kept all of 512 stack slots ready always. Instead dynamically reallocate struct verifier state when stack access is detected. Besides memory savings such approach gives few % runtime perf improvement as well and small reduction in number of processed insns: before after bpf_lb-DLB_L3.o2285 2043 bpf_lb-DLB_L4.o3723 3570 bpf_lb-DUNKNOWN.o 1110 1109 bpf_lxc-DDROP_ALL.o 27954 27849 bpf_lxc-DUNKNOWN.o38954 38724 bpf_netdev.o 16943 16131 bpf_overlay.o 7929 7733 Signed-off-by: Alexei StarovoitovAcked-by: Daniel Borkmann --- drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 8 +- include/linux/bpf_verifier.h | 18 +- kernel/bpf/verifier.c | 390 ++ 3 files changed, 268 insertions(+), 148 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c index 3d3dcac1c942..a8c7615546a9 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c @@ -76,9 +76,9 @@ nfp_bpf_goto_meta(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, static int nfp_bpf_check_exit(struct nfp_prog *nfp_prog, - const struct bpf_verifier_env *env) + struct bpf_verifier_env *env) { - const struct bpf_reg_state *reg0 = >cur_state.regs[0]; + const struct bpf_reg_state *reg0 = cur_regs(env) + BPF_REG_0; u64 imm; if (nfp_prog->act == NN_ACT_XDP) @@ -144,9 +144,9 @@ nfp_bpf_check_stack_access(struct nfp_prog *nfp_prog, static int nfp_bpf_check_ptr(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta, - const struct bpf_verifier_env *env, u8 reg_no) + struct bpf_verifier_env *env, u8 reg_no) { - const struct bpf_reg_state *reg = >cur_state.regs[reg_no]; + const struct bpf_reg_state *reg = cur_regs(env) + reg_no; int err; if (reg->type != PTR_TO_CTX && diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index feeaea93d959..c1430701742c 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -88,20 +88,25 @@ enum bpf_stack_slot_type { #define BPF_REG_SIZE 8 /* size of eBPF register in bytes */ +struct bpf_verifier_stack { + struct bpf_reg_state spilled_ptr; + u8 slot_type[BPF_REG_SIZE]; +}; + /* state of the program: * type of all registers and stack info */ struct bpf_verifier_state { struct bpf_reg_state regs[MAX_BPF_REG]; - u8 stack_slot_type[MAX_BPF_STACK]; - struct bpf_reg_state spilled_regs[MAX_BPF_STACK / BPF_REG_SIZE]; struct bpf_verifier_state *parent; + int allocated_stack; + struct bpf_verifier_stack stack[0]; }; /* linked list of verifier states used to prune search */ struct bpf_verifier_state_list { - struct bpf_verifier_state state; struct bpf_verifier_state_list *next; + struct bpf_verifier_state state; }; struct bpf_insn_aux_data { @@ -145,7 +150,7 @@ struct bpf_verifier_env { struct bpf_verifier_stack_elem *head; /* stack of verifier states to be processed */ int stack_size; /* number of states to be processed */ bool strict_alignment; /* perform strict pointer alignment checks */ - struct bpf_verifier_state cur_state; /* current verifier state */ + struct bpf_verifier_state *cur_state; /* current verifier state */ struct bpf_verifier_state_list **explored_states; /* search pruning optimization */ const struct bpf_ext_analyzer_ops *analyzer_ops; /* external analyzer ops */ void *analyzer_priv; /* pointer to external analyzer's private data */ @@ -159,6 +164,11 @@ struct bpf_verifier_env { struct bpf_verifer_log log; }; +static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env) +{ + return env->cur_state->regs; +} + int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops, void *priv); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index d906775e12c1..a9fbcf9812d0 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -145,10 +145,10 @@ struct bpf_verifier_stack_elem { * before processing instruction 'insn_idx' * and after processing instruction 'prev_insn_idx' */ - struct bpf_verifier_state st; int insn_idx; int prev_insn_idx; struct bpf_verifier_stack_elem *next; + struct
Re: WARNING in strp_data_ready
On 10/24/2017 08:20 AM, syzbot wrote: > Hello, > > syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > C reproducer is attached > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > for information about syzkaller reproducers > > > WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me > include/net/sock.h:1505 [inline] > WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user > include/net/sock.h:1511 [inline] > WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 > strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 > Kernel panic - not syncing: panic_on_warn set ... > > CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > Â > Â __dump_stack lib/dump_stack.c:16 [inline] > Â dump_stack+0x194/0x257 lib/dump_stack.c:52 > Â panic+0x1e4/0x417 kernel/panic.c:181 > Â __warn+0x1c4/0x1d9 kernel/panic.c:542 > Â report_bug+0x211/0x2d0 lib/bug.c:183 > Â fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 > Â do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] > Â do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 > Â do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 > Â do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 > Â invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 > RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline] > RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline] > RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 > RSP: 0018:8801db206b18 EFLAGS: 00010206 > RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: > RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0 > RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd > R10: 8801db206a00 R11: dc00 R12: 8801d29fa000 > R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001 > Â psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353 Looks like KCM is calling sk_data_ready() without first taking the sock lock. /* Called with lower sock held */ static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb) { [...] if (kcm_queue_rcv_skb(>sk, skb)) { In this case kcm->sk is not the same lock the comment is referring to. And kcm_queue_rcv_skb() will eventually call sk_data_ready(). @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock? I don't have anything better in mind immediately. Thanks, John
Re: [PATCH v2 net-next] tcp: add tracepoint trace_tcp_retransmit_synack()
CCing key audience of the patch. Thanks, Song > On Oct 30, 2017, at 2:41 PM, Song Liuwrote: > > This tracepoint can be used to trace synack retransmits. It maintains > pointer to struct request_sock. > > We cannot simply reuse trace_tcp_retransmit_skb() here, because the > sk here is the LISTEN socket. The IP addresses and ports should be > extracted from struct request_sock. > > Note that, like many other tracepoints, this patch uses IS_ENABLED > in TP_fast_assign macro, which triggers sparse warning like: > > ./include/trace/events/tcp.h:274:1: error: directive in argument list > ./include/trace/events/tcp.h:281:1: error: directive in argument list > > However, there is no good solution to avoid these warnings. To the > best of our knowledge, these warnings are harmless. > > Signed-off-by: Song Liu > Acked-by: Alexei Starovoitov > Acked-by: Martin KaFai Lau > --- > include/trace/events/tcp.h | 56 ++ > net/ipv4/tcp_output.c | 1 + > 2 files changed, 57 insertions(+) > > diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h > index 03699ba..07a 100644 > --- a/include/trace/events/tcp.h > +++ b/include/trace/events/tcp.h > @@ -237,6 +237,62 @@ TRACE_EVENT(tcp_set_state, > show_tcp_state_name(__entry->newstate)) > ); > > +TRACE_EVENT(tcp_retransmit_synack, > + > + TP_PROTO(const struct sock *sk, const struct request_sock *req), > + > + TP_ARGS(sk, req), > + > + TP_STRUCT__entry( > + __field(const void *, skaddr) > + __field(const void *, req) > + __field(__u16, sport) > + __field(__u16, dport) > + __array(__u8, saddr, 4) > + __array(__u8, daddr, 4) > + __array(__u8, saddr_v6, 16) > + __array(__u8, daddr_v6, 16) > + ), > + > + TP_fast_assign( > + struct inet_request_sock *ireq = inet_rsk(req); > + struct in6_addr *pin6; > + __be32 *p32; > + > + __entry->skaddr = sk; > + __entry->req = req; > + > + __entry->sport = ireq->ir_num; > + __entry->dport = ntohs(ireq->ir_rmt_port); > + > + p32 = (__be32 *) __entry->saddr; > + *p32 = ireq->ir_loc_addr; > + > + p32 = (__be32 *) __entry->daddr; > + *p32 = ireq->ir_rmt_addr; > + > +#if IS_ENABLED(CONFIG_IPV6) > + if (sk->sk_family == AF_INET6) { > + pin6 = (struct in6_addr *)__entry->saddr_v6; > + *pin6 = ireq->ir_v6_loc_addr; > + pin6 = (struct in6_addr *)__entry->daddr_v6; > + *pin6 = ireq->ir_v6_rmt_addr; > + } else > +#endif > + { > + pin6 = (struct in6_addr *)__entry->saddr_v6; > + ipv6_addr_set_v4mapped(ireq->ir_loc_addr, pin6); > + pin6 = (struct in6_addr *)__entry->daddr_v6; > + ipv6_addr_set_v4mapped(ireq->ir_rmt_addr, pin6); > + } > + ), > + > + TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c > daddrv6=%pI6c", > + __entry->sport, __entry->dport, > + __entry->saddr, __entry->daddr, > + __entry->saddr_v6, __entry->daddr_v6) > +); > + > #endif /* _TRACE_TCP_H */ > > /* This part must be outside protection */ > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index a69a34f..74d1c4d 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3781,6 +3781,7 @@ int tcp_rtx_synack(const struct sock *sk, struct > request_sock *req) > __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS); > if (unlikely(tcp_passive_fastopen(sk))) > tcp_sk(sk)->total_retrans++; > + trace_tcp_retransmit_synack(sk, req); > } > return res; > } > -- > 2.9.5 >
[PATCH v2 net-next] tcp: add tracepoint trace_tcp_retransmit_synack()
This tracepoint can be used to trace synack retransmits. It maintains pointer to struct request_sock. We cannot simply reuse trace_tcp_retransmit_skb() here, because the sk here is the LISTEN socket. The IP addresses and ports should be extracted from struct request_sock. Note that, like many other tracepoints, this patch uses IS_ENABLED in TP_fast_assign macro, which triggers sparse warning like: ./include/trace/events/tcp.h:274:1: error: directive in argument list ./include/trace/events/tcp.h:281:1: error: directive in argument list However, there is no good solution to avoid these warnings. To the best of our knowledge, these warnings are harmless. Signed-off-by: Song LiuAcked-by: Alexei Starovoitov Acked-by: Martin KaFai Lau --- include/trace/events/tcp.h | 56 ++ net/ipv4/tcp_output.c | 1 + 2 files changed, 57 insertions(+) diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index 03699ba..07a 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -237,6 +237,62 @@ TRACE_EVENT(tcp_set_state, show_tcp_state_name(__entry->newstate)) ); +TRACE_EVENT(tcp_retransmit_synack, + + TP_PROTO(const struct sock *sk, const struct request_sock *req), + + TP_ARGS(sk, req), + + TP_STRUCT__entry( + __field(const void *, skaddr) + __field(const void *, req) + __field(__u16, sport) + __field(__u16, dport) + __array(__u8, saddr, 4) + __array(__u8, daddr, 4) + __array(__u8, saddr_v6, 16) + __array(__u8, daddr_v6, 16) + ), + + TP_fast_assign( + struct inet_request_sock *ireq = inet_rsk(req); + struct in6_addr *pin6; + __be32 *p32; + + __entry->skaddr = sk; + __entry->req = req; + + __entry->sport = ireq->ir_num; + __entry->dport = ntohs(ireq->ir_rmt_port); + + p32 = (__be32 *) __entry->saddr; + *p32 = ireq->ir_loc_addr; + + p32 = (__be32 *) __entry->daddr; + *p32 = ireq->ir_rmt_addr; + +#if IS_ENABLED(CONFIG_IPV6) + if (sk->sk_family == AF_INET6) { + pin6 = (struct in6_addr *)__entry->saddr_v6; + *pin6 = ireq->ir_v6_loc_addr; + pin6 = (struct in6_addr *)__entry->daddr_v6; + *pin6 = ireq->ir_v6_rmt_addr; + } else +#endif + { + pin6 = (struct in6_addr *)__entry->saddr_v6; + ipv6_addr_set_v4mapped(ireq->ir_loc_addr, pin6); + pin6 = (struct in6_addr *)__entry->daddr_v6; + ipv6_addr_set_v4mapped(ireq->ir_rmt_addr, pin6); + } + ), + + TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c", + __entry->sport, __entry->dport, + __entry->saddr, __entry->daddr, + __entry->saddr_v6, __entry->daddr_v6) +); + #endif /* _TRACE_TCP_H */ /* This part must be outside protection */ diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index a69a34f..74d1c4d 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3781,6 +3781,7 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req) __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS); if (unlikely(tcp_passive_fastopen(sk))) tcp_sk(sk)->total_retrans++; + trace_tcp_retransmit_synack(sk, req); } return res; } -- 2.9.5
[PATCH v2 net-next] tcp: add tracepoint trace_tcp_retransmit_synack
Change from v1: Updated commit message to highlight potential sparse warning. Song Liu (1): tcp: add tracepoint trace_tcp_retransmit_synack() include/trace/events/tcp.h | 56 ++ net/ipv4/tcp_output.c | 1 + 2 files changed, 57 insertions(+) -- 2.9.5
[PATCH 0/2] Add the ability to do BPF directed error injection
A lot of our error paths are not well tested because we have no good way of injecting errors generically. Some subystems (block, memory) have ways to inject errors, but they are random so it's hard to get reproduceable results. With BPF we can add determinism to our error injection. We can use kprobes and other things to verify we are injecting errors at the exact case we are trying to test. This patch gives us the tool to actual do the error injection part. It is very simple, we just set the return value of the pt_regs we're given to whatever we provide, and then override the PC with a dummy function that simply returns. Right now this only works on x86, but it would be simple enough to expand to other architectures. Thanks, Josef
[PATCH 2/2] samples/bpf: add a test for bpf_override_return
From: Josef BacikThis adds a basic test for bpf_override_return to verify it works. We override the main function for mounting a btrfs fs so it'll return -ENOMEM and then make sure that trying to mount a btrfs fs will fail. Signed-off-by: Josef Bacik --- samples/bpf/Makefile | 4 samples/bpf/test_override_return.sh | 15 +++ samples/bpf/tracex7_kern.c| 15 +++ samples/bpf/tracex7_user.c| 28 tools/include/uapi/linux/bpf.h| 7 ++- tools/testing/selftests/bpf/bpf_helpers.h | 3 ++- 6 files changed, 70 insertions(+), 2 deletions(-) create mode 100755 samples/bpf/test_override_return.sh create mode 100644 samples/bpf/tracex7_kern.c create mode 100644 samples/bpf/tracex7_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index ea2b9e6135f3..83d06bc1f710 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -14,6 +14,7 @@ hostprogs-y += tracex3 hostprogs-y += tracex4 hostprogs-y += tracex5 hostprogs-y += tracex6 +hostprogs-y += tracex7 hostprogs-y += test_probe_write_user hostprogs-y += trace_output hostprogs-y += lathist @@ -58,6 +59,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o +tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o @@ -100,6 +102,7 @@ always += tracex3_kern.o always += tracex4_kern.o always += tracex5_kern.o always += tracex6_kern.o +always += tracex7_kern.o always += sock_flags_kern.o always += test_probe_write_user_kern.o always += trace_output_kern.o @@ -153,6 +156,7 @@ HOSTLOADLIBES_tracex3 += -lelf HOSTLOADLIBES_tracex4 += -lelf -lrt HOSTLOADLIBES_tracex5 += -lelf HOSTLOADLIBES_tracex6 += -lelf +HOSTLOADLIBES_tracex7 += -lelf HOSTLOADLIBES_test_cgrp2_sock2 += -lelf HOSTLOADLIBES_load_sock_ops += -lelf HOSTLOADLIBES_test_probe_write_user += -lelf diff --git a/samples/bpf/test_override_return.sh b/samples/bpf/test_override_return.sh new file mode 100755 index ..e68b9ee6814b --- /dev/null +++ b/samples/bpf/test_override_return.sh @@ -0,0 +1,15 @@ +#!/bin/bash + +rm -f testfile.img +dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1 +DEVICE=$(losetup --show -f testfile.img) +mkfs.btrfs -f $DEVICE +mkdir tmpmnt +./tracex7 $DEVICE +if [ $? -eq 0 ] +then + echo "SUCCESS!" +else + echo "FAILED!" +fi +losetup -d $DEVICE diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c new file mode 100644 index ..a2f74f736e66 --- /dev/null +++ b/samples/bpf/tracex7_kern.c @@ -0,0 +1,15 @@ +#include +#include +#include +#include "bpf_helpers.h" + +SEC("kprobe/open_ctree") +int bpf_prog1(struct pt_regs *ctx) +{ + unsigned long rc = -12; + bpf_override_return(ctx, rc); + return 0; +} + +char _license[] SEC("license") = "GPL"; +u32 _version SEC("version") = LINUX_VERSION_CODE; diff --git a/samples/bpf/tracex7_user.c b/samples/bpf/tracex7_user.c new file mode 100644 index ..8a52ac492e8b --- /dev/null +++ b/samples/bpf/tracex7_user.c @@ -0,0 +1,28 @@ +#define _GNU_SOURCE + +#include +#include +#include +#include "libbpf.h" +#include "bpf_load.h" + +int main(int argc, char **argv) +{ + FILE *f; + char filename[256]; + char command[256]; + int ret; + + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(filename)) { + printf("%s", bpf_log_buf); + return 1; + } + + snprintf(command, 256, "mount %s tmpmnt/", argv[1]); + f = popen(command, "r"); + ret = pclose(f); + + return ret ? 0 : 1; +} diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 4a4b6e78c977..3756dde69834 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -673,6 +673,10 @@ union bpf_attr { * @buf: buf to fill * @buf_size: size of the buf * Return : 0 on success or negative error code + * + * int bpf_override_return(pt_regs, rc) + * @pt_regs: pointer to struct pt_regs + * @rc: the return value to set */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -732,7 +736,8 @@ union bpf_attr { FN(xdp_adjust_meta),\ FN(perf_event_read_value), \ FN(perf_prog_read_value), \ - FN(getsockopt), + FN(getsockopt), \ + FN(override_return), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git
[PATCH 1/2] bpf: add a bpf_override_function helper
From: Josef BacikError injection is sloppy and very ad-hoc. BPF could fill this niche perfectly with it's kprobe functionality. We could make sure errors are only triggered in specific call chains that we care about with very specific situations. Accomplish this with the bpf_override_funciton helper. This will modify the probe'd callers return value to the specified value and set the PC to an override function that simply returns, bypassing the originally probed function. This gives us a nice clean way to implement systematic error injection for all of our code paths. Signed-off-by: Josef Bacik --- arch/Kconfig | 3 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/kprobes.h | 4 arch/x86/include/asm/ptrace.h| 5 + arch/x86/kernel/kprobes/ftrace.c | 14 include/uapi/linux/bpf.h | 7 +- kernel/trace/Kconfig | 11 ++ kernel/trace/bpf_trace.c | 47 +++- kernel/trace/trace.h | 6 + kernel/trace/trace_kprobe.c | 23 ++-- 10 files changed, 108 insertions(+), 13 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index d789a89cb32c..4fb618082259 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -195,6 +195,9 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_KPROBE_OVERRIDE + bool + config HAVE_NMI bool diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 971feac13506..5126d2750dd0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -152,6 +152,7 @@ config X86 select HAVE_KERNEL_XZ select HAVE_KPROBES select HAVE_KPROBES_ON_FTRACE + select HAVE_KPROBE_OVERRIDE select HAVE_KRETPROBES select HAVE_KVM select HAVE_LIVEPATCH if X86_64 diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h index 6cf65437b5e5..c6c3b1f4306a 100644 --- a/arch/x86/include/asm/kprobes.h +++ b/arch/x86/include/asm/kprobes.h @@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size; void arch_remove_kprobe(struct kprobe *p); asmlinkage void kretprobe_trampoline(void); +#ifdef CONFIG_KPROBES_ON_FTRACE +extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs); +#endif + /* Architecture specific copy of original instruction*/ struct arch_specific_insn { /* copy of the original instruction */ diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 91c04c8e67fa..f04e71800c2f 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct pt_regs *regs) return regs->ax; } +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc) +{ + regs->ax = rc; +} + /* * user_mode(regs) determines whether a register set came from user * mode. On x86_32, this is true if V8086 mode was enabled OR if the diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c index 041f7b6dfa0f..3c455bf490cb 100644 --- a/arch/x86/kernel/kprobes/ftrace.c +++ b/arch/x86/kernel/kprobes/ftrace.c @@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p) p->ainsn.boostable = false; return 0; } + +asmlinkage void override_func(void); +asm( + ".type override_func, @function\n" + "override_func:\n" + " ret\n" + ".size override_func, .-override_func\n" +); + +void arch_ftrace_kprobe_override_function(struct pt_regs *regs) +{ + regs->ip = (unsigned long)_func; +} +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0b7b54d898bd..1ad5b87a42f6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -673,6 +673,10 @@ union bpf_attr { * @buf: buf to fill * @buf_size: size of the buf * Return : 0 on success or negative error code + * + * int bpf_override_return(pt_regs, rc) + * @pt_regs: pointer to struct pt_regs + * @rc: the return value to set */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -732,7 +736,8 @@ union bpf_attr { FN(xdp_adjust_meta),\ FN(perf_event_read_value), \ FN(perf_prog_read_value), \ - FN(getsockopt), + FN(getsockopt), \ + FN(override_return), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 434c840e2d82..9dc0deeaad2b 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -518,6 +518,17 @@ config FUNCTION_PROFILER If in doubt, say N. +config BPF_KPROBE_OVERRIDE + bool "Enable BPF programs to override a kprobed function" +
Re: [PATCH] ravb: Use common error handling code in ravb_probe()
On 10/30/2017 11:51 PM, Sergei Shtylyov wrote: From: Markus ElfringDate: Sat, 28 Oct 2017 19:10:08 +0200 Add a jump target so that a bit of exception handling can be better reused at the end of this function. Have you actually tried to see if there's any change in the resulting object code? I've looked at ARM gcc 4.8.5's output and it didn't really change -- gcc was able to optimize these repetitive assignments pretty well... This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- drivers/net/ethernet/renesas/ravb_main.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) Diffstat also speaks about the futility of this patch. Ah, you've added empty lines where there was none... Please don't do this. I'll ACK the v2 patch, given my comments are addressed. [...] MBR, Sergei
Re: [PATCH net-next 1/1] net sched qdisc: pass netlink message flags in event notification
Cong Wangwrites: > On Mon, Oct 30, 2017 at 11:07 AM, Roman Mashak wrote: >> Cong Wang writes: >> >>> On Sat, Oct 28, 2017 at 8:36 PM, Roman Mashak wrote: Cong Wang writes: >>> >>> Hmm, I thought you use RTM_NEWQDISC+RTM_DELQDISC to >>> determine it is replacement, no? >> >> Create is RTM_NEWQDISC and NLM_F_EXCL|NLM_F_CREATE, replacement is >> RTM_NEWQDISC and NLM_F_REPLACE in netlink flags. > > Is there any reason we can't use RTM_NEWQDISC+RTM_DELQDISC > rather than NLM_F_REPLACE to determine it is replacement? > I'm not sure this would be valid semantics for replace operation, look at the rfc3549: Additional flag bits for NEW requests NLM_F_REPLACE Replace existing matching config object with this request. > Note, RTM_NEWQDISC+RTM_DELQDISC are put in a same > message not two. Hmm, could you clarify how do you expect to put two event IDs in nlmsg_type?
[PATCH v2 net-next] ipv6: Implement limits on Hop-by-Hop and Destination options
RFC 8200 (IPv6) defines Hop-by-Hop options and Destination options extension headers. Both of these carry a list of TLVs which is only limited by the maximum length of the extension header (2048 bytes). By the spec a host must process all the TLVs in these options, however these could be used as a fairly obvious denial of service attack. I think this could in fact be a significant DOS vector on the Internet, one mitigating factor might be that many FWs drop all packets with EH (and obviously this is only IPv6) so an Internet wide attack might not be so effective (yet!). By my calculation, the worse case packet with TLVs in a standard 1500 byte MTU packet that would be processed by the stack contains 1282 invidual TLVs (including pad TLVS) or 724 two byte TLVs. I wrote a quick test program that floods a whole bunch of these packets to a host and sure enough there is substantial time spent in ip6_parse_tlv. These packets contain nothing but unknown TLVS (that are ignored), TLV padding, and bogus UDP header with zero payload length. 25.38% [kernel][k] __fib6_clean_all 21.63% [kernel][k] ip6_parse_tlv 4.21% [kernel][k] __local_bh_enable_ip 2.18% [kernel][k] ip6_pol_route.isra.39 1.98% [kernel][k] fib6_walk_continue 1.88% [kernel][k] _raw_write_lock_bh 1.65% [kernel][k] dst_release This patch adds configurable limits to Destination and Hop-by-Hop options. There are three limits that may be set: - Limit the number of options in a Hop-by-Hop or Destination options extension header. - Limit the byte length of a Hop-by-Hop or Destination options extension header. - Disallow unrecognized options in a Hop-by-Hop or Destination options extension header. The limits are set in corresponding sysctls: ipv6.sysctl.max_dst_opts_cnt ipv6.sysctl.max_hbh_opts_cnt ipv6.sysctl.max_dst_opts_len ipv6.sysctl.max_hbh_opts_len If a max_*_opts_cnt is less than zero then unknown TLVs are disallowed. The number of known TLVs that are allowed is the absolute value of this number. If a limit is exceeded when processing an extension header the packet is dropped. Default values are set to 8 for options counts, and set to INT_MAX for maximum length. Note the choice to limit options to 8 is an arbitrary guess (roughly based on the fact that the stack supports three HBH options and just one destination option). These limits have being proposed in draft-ietf-6man-rfc6434-bis. Tested (by Martin Lau) I tested out 1 thread (i.e. one raw_udp process). I changed the net.ipv6.max_dst_(opts|hbh)_number between 8 to 2048. With sysctls setting to 2048, the softirq% is packed to 100%. With 8, the softirq% is almost unnoticable from mpstat. v2; - Code and documention cleanup. - Change references of RFC2460 to be RFC8200. - Add reference to RFC6434-bis where the limits will be in standard. Signed-off-by: Tom Herbert--- Documentation/networking/ip-sysctl.txt | 24 include/net/ipv6.h | 40 include/net/netns/ipv6.h | 4 ++ net/ipv6/af_inet6.c| 4 ++ net/ipv6/exthdrs.c | 67 -- net/ipv6/sysctl_net_ipv6.c | 32 6 files changed, 159 insertions(+), 12 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 77f4de59dc9c..e6661b205f72 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1385,6 +1385,30 @@ mld_qrv - INTEGER Default: 2 (as specified by RFC3810 9.1) Minimum: 1 (as specified by RFC6636 4.5) +max_dst_opts_cnt - INTEGER + Maximum number of non-padding TLVs allowed in a Destination + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + Default: 8 + +max_hbh_opts_cnt - INTEGER + Maximum number of non-padding TLVs allowed in a Hop-by-Hop + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + Default: 8 + +max dst_opts_len - INTEGER + Maximum length allowed for a Destination options extension + header. + Default: INT_MAX (unlimited) + +max hbh_opts_len - INTEGER + Maximum length allowed for a Hop-by-Hop options extension + header. + Default: INT_MAX (unlimited) + IPv6 Fragmentation: ip6frag_high_thresh - INTEGER diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 3cda3b521c36..fb6d67012de6 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -51,6 +51,46 @@ #define
Re: [PATCH net-next] net: filter: remove unused variable and fix warning
Good catch, thanks! On 10/30/17, 1:47 PM, "Jakub Kicinski"wrote: bpf_getsockopt bpf call sets the ret variable to zero and never changes it. What's worse in case CONFIG_INET is not selected the variable is completely unused generating a warning. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet Acked-by: Lawrence Brakmo --- net/core/filter.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 721c30889033..a0112168d6f9 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3288,7 +3288,6 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { struct sock *sk = bpf_sock->sk; - int ret = 0; if (!sk_fullsock(sk)) goto err_clear; @@ -3308,7 +3307,7 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, } else { goto err_clear; } - return ret; + return 0; #endif err_clear: memset(optval, 0, optlen); -- 2.14.1
Re: [PATCH] ravb: Use common error handling code in ravb_probe()
On 10/29/2017 02:00 PM, Geert Uytterhoeven wrote: From: Markus ElfringDate: Sat, 28 Oct 2017 19:10:08 +0200 Add a jump target so that a bit of exception handling can be better reused at the end of this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- drivers/net/ethernet/renesas/ravb_main.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index a8822a756e08..62dbdf7de6cd 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -2069,10 +2069,9 @@ static int ravb_probe(struct platform_device *pdev) irq = platform_get_irq_byname(pdev, "ch22"); else irq = platform_get_irq(pdev, 0); - if (irq < 0) { - error = irq; - goto out_release; - } + if (irq < 0) + goto failure_indication; IMHO, it's really confusing that "irq" contains the error code, not "error". I think it would have been equally confusing if 'error' was assigned to 'ndev->irq', etc. It's just the duality of the result of these functions that makes them confusing... Especially when jumping to a meaningless label named "failure_indication" ("irq_failure" would be more intuitive). Yeah, the label sucks. :-) So I prefer the original code, regardless of the label name. On the 2nd thought, the patch can be fixed up and then merged. Gr{oetje,eeting}s, Geert MBR, Sergei
Re: [PATCH] net: tipc: Convert timers to use timer_setup()
On Mon, Oct 30, 2017 at 2:57 AM, Jon Maloywrote: > > >> -Original Message- >> From: keesc...@google.com [mailto:keesc...@google.com] On Behalf Of >> Kees Cook >> Sent: Friday, October 27, 2017 06:58 >> To: Jon Maloy >> Cc: David S. Miller ; Ying Xue >> ; netdev@vger.kernel.org; tipc- >> discuss...@lists.sourceforge.net; linux-ker...@vger.kernel.org >> Subject: Re: [PATCH] net: tipc: Convert timers to use timer_setup() >> >> On Tue, Oct 24, 2017 at 11:44 AM, Jon Maloy >> wrote: >> > NAK. It doesn't sound like a good idea to send this to net. Especially >> > since >> one of these timers has already been refactored in net-next. >> >> Hi! I'm not sure what you mean about the one timer issue. I don't see any >> use of timer_setup() in net/tipc (and no recent conversions to the older >> setup_timer() API). What's the preferred path for landing this API conversion >> in net/tipc/? >> >> And, just to note, these changes are almost entirely mechanical. The only >> "special" case is in tipc_sk_timeout() where the argument needs to be >> slightly adjusted to fetch the tsk from the sk again. > > I am referring to commit 0d5fcebf3c37 ("tipc: refactor tipc_sk_timeout() > function") which is in "net-next" but not yet in "net". > I guess David M is the one who decides whether you deliver to net or > net-next, but in the former case you will have to wait until David merges in > the above commit, and then add your change again. Ah, thanks, I see the change now. These patches passed in the night, it seems. I've sent a v2 now for net-next. Thanks! -Kees -- Kees Cook Pixel Security
[PATCH v2] net: tipc: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: Jon MaloyCc: Ying Xue Cc: "David S. Miller" Cc: netdev@vger.kernel.org Cc: tipc-discuss...@lists.sourceforge.net Signed-off-by: Kees Cook --- Rebased on commit 0d5fcebf3c37 ("tipc: refactor tipc_sk_timeout() function"). --- net/tipc/discover.c | 6 +++--- net/tipc/monitor.c | 6 +++--- net/tipc/node.c | 8 net/tipc/socket.c | 10 +- net/tipc/subscr.c | 6 +++--- 5 files changed, 18 insertions(+), 18 deletions(-) diff --git a/net/tipc/discover.c b/net/tipc/discover.c index 02462d67d191..92e4828c6b09 100644 --- a/net/tipc/discover.c +++ b/net/tipc/discover.c @@ -224,9 +224,9 @@ void tipc_disc_remove_dest(struct tipc_link_req *req) * * Called whenever a link setup request timer associated with a bearer expires. */ -static void disc_timeout(unsigned long data) +static void disc_timeout(struct timer_list *t) { - struct tipc_link_req *req = (struct tipc_link_req *)data; + struct tipc_link_req *req = from_timer(req, t, timer); struct sk_buff *skb; int max_delay; @@ -292,7 +292,7 @@ int tipc_disc_create(struct net *net, struct tipc_bearer *b, req->num_nodes = 0; req->timer_intv = TIPC_LINK_REQ_INIT; spin_lock_init(>lock); - setup_timer(>timer, disc_timeout, (unsigned long)req); + timer_setup(>timer, disc_timeout, 0); mod_timer(>timer, jiffies + req->timer_intv); b->link_req = req; *skb = skb_clone(req->buf, GFP_ATOMIC); diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c index 9e109bb1a207..b9c32557d73c 100644 --- a/net/tipc/monitor.c +++ b/net/tipc/monitor.c @@ -578,9 +578,9 @@ void tipc_mon_get_state(struct net *net, u32 addr, read_unlock_bh(>lock); } -static void mon_timeout(unsigned long m) +static void mon_timeout(struct timer_list *t) { - struct tipc_monitor *mon = (void *)m; + struct tipc_monitor *mon = from_timer(mon, t, timer); struct tipc_peer *self; int best_member_cnt = dom_size(mon->peer_cnt) - 1; @@ -623,7 +623,7 @@ int tipc_mon_create(struct net *net, int bearer_id) self->is_up = true; self->is_head = true; INIT_LIST_HEAD(>list); - setup_timer(>timer, mon_timeout, (unsigned long)mon); + timer_setup(>timer, mon_timeout, 0); mon->timer_intv = msecs_to_jiffies(MON_TIMEOUT + (tn->random & 0x)); mod_timer(>timer, jiffies + mon->timer_intv); return 0; diff --git a/net/tipc/node.c b/net/tipc/node.c index 89f8ac73bf65..009a81631280 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -153,7 +153,7 @@ static void tipc_node_link_down(struct tipc_node *n, int bearer_id, bool delete); static void node_lost_contact(struct tipc_node *n, struct sk_buff_head *inputq); static void tipc_node_delete(struct tipc_node *node); -static void tipc_node_timeout(unsigned long data); +static void tipc_node_timeout(struct timer_list *t); static void tipc_node_fsm_evt(struct tipc_node *n, int evt); static struct tipc_node *tipc_node_find(struct net *net, u32 addr); static void tipc_node_put(struct tipc_node *node); @@ -361,7 +361,7 @@ struct tipc_node *tipc_node_create(struct net *net, u32 addr, u16 capabilities) goto exit; } tipc_node_get(n); - setup_timer(>timer, tipc_node_timeout, (unsigned long)n); + timer_setup(>timer, tipc_node_timeout, 0); n->keepalive_intv = U32_MAX; hlist_add_head_rcu(>hash, >node_htable[tipc_hashfn(addr)]); list_for_each_entry_rcu(temp_node, >node_list, list) { @@ -500,9 +500,9 @@ void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port) /* tipc_node_timeout - handle expiration of node timer */ -static void tipc_node_timeout(unsigned long data) +static void tipc_node_timeout(struct timer_list *t) { - struct tipc_node *n = (struct tipc_node *)data; + struct tipc_node *n = from_timer(n, t, timer); struct tipc_link_entry *le; struct sk_buff_head xmitq; int bearer_id; diff --git a/net/tipc/socket.c b/net/tipc/socket.c index ea61c32f6b80..5d18c0caa92b 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -125,7 +125,7 @@ static void tipc_sock_destruct(struct sock *sk); static int tipc_release(struct socket *sock); static int tipc_accept(struct socket *sock, struct socket *new_sock, int flags, bool kern); -static void tipc_sk_timeout(unsigned long data); +static void tipc_sk_timeout(struct timer_list *t); static int tipc_sk_publish(struct tipc_sock *tsk, uint scope, struct tipc_name_seq const *seq); static int tipc_sk_withdraw(struct tipc_sock *tsk, uint scope, @@ -464,7
[PATCH] drivers/net: tundra: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: "David S. Miller"Cc: Philippe Reynes Cc: "yuval.sh...@oracle.com" Cc: Eric Dumazet Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook --- drivers/net/ethernet/tundra/tsi108_eth.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/tundra/tsi108_eth.c b/drivers/net/ethernet/tundra/tsi108_eth.c index c2d15d9c0c33..0624b71ab5d4 100644 --- a/drivers/net/ethernet/tundra/tsi108_eth.c +++ b/drivers/net/ethernet/tundra/tsi108_eth.c @@ -164,7 +164,7 @@ static struct platform_driver tsi_eth_driver = { }, }; -static void tsi108_timed_checker(unsigned long dev_ptr); +static void tsi108_timed_checker(struct timer_list *t); #ifdef DEBUG static void dump_eth_one(struct net_device *dev) @@ -1370,7 +1370,7 @@ static int tsi108_open(struct net_device *dev) napi_enable(>napi); - setup_timer(>timer, tsi108_timed_checker, (unsigned long)dev); + timer_setup(>timer, tsi108_timed_checker, 0); mod_timer(>timer, jiffies + 1); tsi108_restart_rx(data, dev); @@ -1666,10 +1666,10 @@ tsi108_init_one(struct platform_device *pdev) * Thus, we have to do it using a timer. */ -static void tsi108_timed_checker(unsigned long dev_ptr) +static void tsi108_timed_checker(struct timer_list *t) { - struct net_device *dev = (struct net_device *)dev_ptr; - struct tsi108_prv_data *data = netdev_priv(dev); + struct tsi108_prv_data *data = from_timer(data, t, timer); + struct net_device *dev = data->dev; tsi108_check_phy(dev); tsi108_check_rxring(dev); -- 2.7.4 -- Kees Cook Pixel Security
[PATCH] drivers/net: ntb_netdev: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: Jon MasonCc: Dave Jiang Cc: Allen Hubbe Cc: linux-...@googlegroups.com Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook --- drivers/net/ntb_netdev.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ntb_netdev.c b/drivers/net/ntb_netdev.c index 0250aa9ae2cb..9f6f7ccd44f7 100644 --- a/drivers/net/ntb_netdev.c +++ b/drivers/net/ntb_netdev.c @@ -230,10 +230,10 @@ static netdev_tx_t ntb_netdev_start_xmit(struct sk_buff *skb, return NETDEV_TX_BUSY; } -static void ntb_netdev_tx_timer(unsigned long data) +static void ntb_netdev_tx_timer(struct timer_list *t) { - struct net_device *ndev = (struct net_device *)data; - struct ntb_netdev *dev = netdev_priv(ndev); + struct ntb_netdev *dev = from_timer(dev, t, tx_timer); + struct net_device *ndev = dev->ndev; if (ntb_transport_tx_free_entry(dev->qp) < tx_stop) { mod_timer(>tx_timer, jiffies + msecs_to_jiffies(tx_time)); @@ -269,7 +269,7 @@ static int ntb_netdev_open(struct net_device *ndev) } } - setup_timer(>tx_timer, ntb_netdev_tx_timer, (unsigned long)ndev); + timer_setup(>tx_timer, ntb_netdev_tx_timer, 0); netif_carrier_off(ndev); ntb_transport_link_up(dev->qp); -- 2.7.4 -- Kees Cook Pixel Security
Re: [PATCH net-next] net: display hw address of source machine during ipv6 DAD failure
On 10/30/17 2:50 PM, Vishwanath Pai wrote: > It does print out the full uncompressed IPv6 address. I modified the > message manually by hand while copying it out to the commit message in > order to hide the real IP address of my machines, but did not realize > that it was different from what the kernel would actually print out. use 2001:db8::/32 (https://tools.ietf.org/html/rfc3849)
[PATCH] drivers/net: cris: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: "David S. Miller"Cc: Kalle Valo Cc: Arnd Bergmann Cc: Greg Kroah-Hartman Cc: "yuval.sh...@oracle.com" Cc: Paul Gortmaker Cc: Philippe Reynes Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook --- This patch depends on the tip/timers/core DEFINE_TIMER() cleanups, so I intend to carry this in the timer tree. An Ack or Review would be appreciated. Thanks! --- drivers/net/cris/eth_v10.c | 36 +--- 1 file changed, 17 insertions(+), 19 deletions(-) diff --git a/drivers/net/cris/eth_v10.c b/drivers/net/cris/eth_v10.c index 1fcc86fa4e05..7193dfafa577 100644 --- a/drivers/net/cris/eth_v10.c +++ b/drivers/net/cris/eth_v10.c @@ -164,9 +164,16 @@ static unsigned int network_rec_config_shadow = 0; static unsigned int network_tr_ctrl_shadow = 0; +/* Timers */ +static void e100_check_speed(struct timer_list *unused); +static void e100_clear_network_leds(struct timer_list *unused); +static void e100_check_duplex(struct timer_list *unused); +static DEFINE_TIMER(speed_timer, e100_check_speed); +static DEFINE_TIMER(clear_led_timer, e100_clear_network_leds); +static DEFINE_TIMER(duplex_timer, e100_check_duplex); +static struct net_device *timer_dev; + /* Network speed indication. */ -static DEFINE_TIMER(speed_timer, NULL); -static DEFINE_TIMER(clear_led_timer, NULL); static int current_speed; /* Speed read from transceiver */ static int current_speed_selection; /* Speed selected by user */ static unsigned long led_next_time; @@ -174,7 +181,6 @@ static int led_active; static int rx_queue_len; /* Duplex */ -static DEFINE_TIMER(duplex_timer, NULL); static int full_duplex; static enum duplex current_duplex; @@ -199,9 +205,7 @@ static void update_rx_stats(struct net_device_stats *); static void update_tx_stats(struct net_device_stats *); static int e100_probe_transceiver(struct net_device* dev); -static void e100_check_speed(unsigned long priv); static void e100_set_speed(struct net_device* dev, unsigned long speed); -static void e100_check_duplex(unsigned long priv); static void e100_set_duplex(struct net_device* dev, enum duplex); static void e100_negotiate(struct net_device* dev); @@ -213,7 +217,6 @@ static void e100_send_mdio_bit(unsigned char bit); static unsigned char e100_receive_mdio_bit(void); static void e100_reset_transceiver(struct net_device* net); -static void e100_clear_network_leds(unsigned long dummy); static void e100_set_network_leds(int active); static const struct ethtool_ops e100_ethtool_ops; @@ -380,17 +383,12 @@ etrax_ethernet_init(void) current_speed = 10; current_speed_selection = 0; /* Auto */ speed_timer.expires = jiffies + NET_LINK_UP_CHECK_INTERVAL; - speed_timer.data = (unsigned long)dev; - speed_timer.function = e100_check_speed; - - clear_led_timer.function = e100_clear_network_leds; - clear_led_timer.data = (unsigned long)dev; full_duplex = 0; current_duplex = autoneg; duplex_timer.expires = jiffies + NET_DUPLEX_CHECK_INTERVAL; -duplex_timer.data = (unsigned long)dev; - duplex_timer.function = e100_check_duplex; + + timer_dev = dev; /* Initialize mii interface */ np->mii_if.phy_id_mask = 0x1f; @@ -679,9 +677,9 @@ intel_check_speed(struct net_device* dev) } #endif static void -e100_check_speed(unsigned long priv) +e100_check_speed(struct timer_list *unused) { - struct net_device* dev = (struct net_device*)priv; + struct net_device *dev = timer_dev; struct net_local *np = netdev_priv(dev); static int led_initiated = 0; unsigned long data; @@ -798,9 +796,9 @@ e100_set_speed(struct net_device* dev, unsigned long speed) } static void -e100_check_duplex(unsigned long priv) +e100_check_duplex(struct timer_list *unused) { - struct net_device *dev = (struct net_device *)priv; + struct net_device *dev = timer_dev; struct net_local *np = netdev_priv(dev); int old_duplex; @@ -1668,9 +1666,9 @@ e100_hardware_send_packet(struct net_local *np, char *buf, int length) } static void -e100_clear_network_leds(unsigned long dummy) +e100_clear_network_leds(struct timer_list *unused) { - struct net_device *dev = (struct net_device *)dummy; + struct net_device *dev = timer_dev; struct net_local *np = netdev_priv(dev); spin_lock(>led_lock); -- 2.7.4 -- Kees Cook Pixel Security
[PATCH] ip/ipvlan: enhance ability to add mode flags to existing modes
From: Mahesh BandewarIPvlan supported bridge-only functionality prior to commits a190d04db937 ('ipvlan: introduce 'private' attribute for all existing modes.') and fe89aa6b250c ('ipvlan: implement VEPA mode'). These two commits allow to configure the VEPA and private modes now. This patch adds those options in ip command. e.g. bash:~# ip link add link eth0 name ipvl0 type ipvlan mode l2 private -or- bash:~# ip link add link eth0 type ipvl0 type ipvlan mode l2 vepa Also the output will reflect the mode and the mode-flag accordingly. e.g. bash:~# ip -details link show ipvl0 4: ipvl0@eth0: mtu 1500 qdisc ... link/ether 00:1a:11:44:a5:3e brd ff:ff:ff:ff:ff:ff promiscuity 0 ipvlan mode l2 private addrgenmode eui64 numtxqueues 1 ... Signed-off-by: Mahesh Bandewar --- include/uapi/linux/if_link.h | 4 ip/iplink_ipvlan.c | 35 ++- 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index dafe0a6e0421..35bc598566e0 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -463,6 +463,7 @@ enum macsec_validation_type { enum { IFLA_IPVLAN_UNSPEC, IFLA_IPVLAN_MODE, + IFLA_IPVLAN_FLAGS, __IFLA_IPVLAN_MAX }; @@ -475,6 +476,9 @@ enum ipvlan_mode { IPVLAN_MODE_MAX }; +#define IPVLAN_F_PRIVATE 0x01 +#define IPVLAN_F_VEPA 0x02 + /* VXLAN section */ enum { IFLA_VXLAN_UNSPEC, diff --git a/ip/iplink_ipvlan.c b/ip/iplink_ipvlan.c index 9f48309ee030..8889808508fe 100644 --- a/ip/iplink_ipvlan.c +++ b/ip/iplink_ipvlan.c @@ -20,12 +20,21 @@ static void ipvlan_explain(FILE *f) { - fprintf(f, "Usage: ... ipvlan [ mode { l2 | l3 | l3s } ]\n"); + fprintf(f, + "Usage: ... ipvlan [ mode MODE ] [ FLAGS ]\n" + "\n" + "MODE: l3 | l3s | l2\n" + "FLAGS: bridge | private | vepa\n" + "(first values are the defaults if nothing is specified).\n" + ); } static int ipvlan_parse_opt(struct link_util *lu, int argc, char **argv, struct nlmsghdr *n) { + __u16 flags = 0; + bool mflag_given = false; + while (argc > 0) { if (matches(*argv, "mode") == 0) { __u16 mode = 0; @@ -43,6 +52,14 @@ static int ipvlan_parse_opt(struct link_util *lu, int argc, char **argv, return -1; } addattr16(n, 1024, IFLA_IPVLAN_MODE, mode); + } else if (matches(*argv, "private") == 0 && !mflag_given) { + flags |= IPVLAN_F_PRIVATE; + mflag_given = true; + } else if (matches(*argv, "vepa") == 0 && !mflag_given) { + flags |= IPVLAN_F_VEPA; + mflag_given = true; + } else if (matches(*argv, "bridge") == 0 && !mflag_given) { + mflag_given = true; } else if (matches(*argv, "help") == 0) { ipvlan_explain(stderr); return -1; @@ -55,6 +72,7 @@ static int ipvlan_parse_opt(struct link_util *lu, int argc, char **argv, argc--; argv++; } + addattr16(n, 1024, IFLA_IPVLAN_FLAGS, flags); return 0; } @@ -75,6 +93,21 @@ static void ipvlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) print_string(PRINT_ANY, "mode", " mode %s ", mode_str); } } + if (tb[IFLA_IPVLAN_FLAGS]) { + if (RTA_PAYLOAD(tb[IFLA_IPVLAN_FLAGS]) == sizeof(__u16)) { + __u16 flags = rta_getattr_u16(tb[IFLA_IPVLAN_FLAGS]); + + if (flags & IPVLAN_F_PRIVATE) + print_bool(PRINT_ANY, "private", "private ", + true); + else if (flags & IPVLAN_F_VEPA) + print_bool(PRINT_ANY, "vepa", "vepa ", + true); + else + print_bool(PRINT_ANY, "bridge", "bridge ", + true); + } + } } static void ipvlan_print_help(struct link_util *lu, int argc, char **argv, -- 2.15.0.rc2.357.g7e34df9404-goog
Re: [net-next] tcp_delack_timer circular locking dependancy
On Mon, 2017-10-30 at 15:20 -0400, Dave Jones wrote: > [ 105.316650] == > [ 105.316818] WARNING: possible circular locking dependency detected > [ 105.316986] 4.14.0-rc7-think+ #1 Not tainted > [ 105.317108] -- > [ 105.317273] swapper/2/0 is trying to acquire lock: > [ 105.317407] ( > [ 105.317476] slock-AF_INET6 > [ 105.317564] ){+.-.} > [ 105.317642] , at: [] tcp_delack_timer+0x26/0x130 > [ 105.317807] >but task is already holding lock: > [ 105.317961] ( > [ 105.318024] (timer) > [ 105.318097] #5 > [ 105.318168] ){+.-.} > [ 105.318241] , at: [] call_timer_fn+0x5/0x5e0 > [ 105.318393] >which lock already depends on the new lock. > > [ 105.318594] >the existing dependency chain (in reverse order) is: > [ 105.318781] >-> #1 > [ 105.318879] ( > [ 105.318939] (timer) > [ 105.319009] #5 > [ 105.319068] ){+.-.} > [ 105.319137] : > [ 105.319195]del_timer_sync+0x3c/0xb0 > [ 105.319313]inet_csk_reqsk_queue_drop+0x26c/0x4e0 > [ 105.319459]inet_csk_complete_hashdance+0x1e/0x90 > [ 105.319598]tcp_check_req+0x787/0x9a0 > [ 105.319716]tcp_v6_rcv+0x914/0x1060 > [ 105.319828]ip6_input_finish+0x291/0xba0 > [ 105.319950]ip6_input+0xb2/0x380 > [ 105.320059]ip6_rcv_finish+0x103/0x350 > [ 105.320180]ipv6_rcv+0x93f/0xff0 > [ 105.320291]__netif_receive_skb_core+0x13ef/0x1900 > [ 105.320436]netif_receive_skb_internal+0xea/0x4c0 > [ 105.320579]napi_gro_receive+0x28e/0x320 > [ 105.320705]e1000_clean_rx_irq+0x3e9/0x6f0 > [ 105.320838]e1000e_poll+0x14e/0x570 > [ 105.320954]net_rx_action+0x4db/0xc80 > [ 105.321075]__do_softirq+0x1ca/0x7bf > [ 105.321194]irq_exit+0x104/0x110 > [ 105.321303]do_IRQ+0xb2/0x130 > [ 105.321407]ret_from_intr+0x0/0x19 > [ 105.321523]cpuidle_enter_state+0x223/0x5b0 > [ 105.321655]do_idle+0x110/0x1b0 > [ 105.321766]cpu_startup_entry+0xdb/0xe0 > [ 105.321891]start_secondary+0x2e9/0x360 > [ 105.322014]verify_cpu+0x0/0xf1 > [ 105.322121] >-> #0 > [ 105.322215] ( > [ 105.322276] slock-AF_INET6 > [ 105.322359] ){+.-.} > [ 105.322428] : > [ 105.322487]lock_acquire+0x12e/0x350 > [ 105.322602]_raw_spin_lock+0x30/0x70 > [ 105.322722]tcp_delack_timer+0x26/0x130 > [ 105.322846]call_timer_fn+0x188/0x5e0 > [ 105.322966]__run_timers+0x54d/0x670 > [ 105.323084]run_timer_softirq+0x2a/0x50 > [ 105.323208]__do_softirq+0x1ca/0x7bf > [ 105.323325]irq_exit+0x104/0x110 > [ 105.323435]smp_apic_timer_interrupt+0x14b/0x510 > [ 105.323576]apic_timer_interrupt+0x9a/0xa0 > [ 105.323705]cpuidle_enter_state+0x223/0x5b0 > [ 105.323836]do_idle+0x110/0x1b0 > [ 105.323944]cpu_startup_entry+0xdb/0xe0 > [ 105.324067]start_secondary+0x2e9/0x360 > [ 105.324189]verify_cpu+0x0/0xf1 > [ 105.324295] >other info that might help us debug this: > > [ 105.324489] Possible unsafe locking scenario: > > [ 105.324644]CPU0CPU1 > [ 105.324767] > [ 105.324890] lock( > [ 105.324963] (timer) > [ 105.325033] #5 > [ 105.325093] ); > [ 105.325152]lock( > [ 105.325278] slock-AF_INET6 > [ 105.325360] ); > [ 105.325419]lock( > [ 105.325544] (timer) > [ 105.325612] #5 > [ 105.325670] ); > [ 105.325729] lock( > [ 105.325797] slock-AF_INET6 > [ 105.325879] ); > [ 105.325938] > *** DEADLOCK *** > > [ 105.326086] 1 lock held by swapper/2/0: > [ 105.326193] #0: > [ 105.326257] ( > [ 105.331697] (timer) > [ 105.337038] #5 > [ 105.342339] ){+.-.} > [ 105.347620] , at: [] call_timer_fn+0x5/0x5e0 > [ 105.353021] >stack backtrace: > [ 105.363515] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.0-rc7-think+ #1 > [ 105.368886] Hardware name: LENOVO ThinkServer TS140/ThinkServer TS140, > BIOS FBKTB3AUS 06/16/2015 > [ 105.374330] Call Trace: > [ 105.379697] > [ 105.384997] dump_stack+0xbc/0x145 > [ 105.390339] ? dma_virt_map_sg+0xfb/0xfb > [ 105.395733] ? call_timer_fn+0x5/0x5e0 > [ 105.401076] ? print_lock+0x54/0x68 > [ 105.406344] print_circular_bug.isra.42+0x283/0x2bc > [ 105.411695] ? print_circular_bug_header+0xda/0xda > [ 105.417054] ? graph_lock+0x8d/0x100 > [ 105.422419] ? check_noncircular+0x20/0x20 > [ 105.427857] ? sched_clock_cpu+0x14/0xf0 > [ 105.433309] __lock_acquire+0x1f4a/0x2050 > [ 105.438725] ? debug_check_no_locks_freed+0x1a0/0x1a0 > [ 105.444160] ? __lock_acquire+0x6b3/0x2050 > [ 105.449580] ? debug_check_no_locks_freed+0x1a0/0x1a0 > [ 105.455015] ?
Re: [PATCH] ravb: Use common error handling code in ravb_probe()
Hello! On 10/28/2017 08:19 PM, SF Markus Elfring wrote: From: Markus ElfringDate: Sat, 28 Oct 2017 19:10:08 +0200 Add a jump target so that a bit of exception handling can be better reused at the end of this function. Have you actually tried to see if there's any change in the resulting object code? I've looked at ARM gcc 4.8.5's output and it didn't really change -- gcc was able to optimize these repetitive assignments pretty well... This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- drivers/net/ethernet/renesas/ravb_main.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) Diffstat also speaks about the futility of this patch. diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index a8822a756e08..62dbdf7de6cd 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -2226,6 +,10 @@ static int ravb_probe(struct platform_device *pdev) pm_runtime_put(>dev); pm_runtime_disable(>dev); return error; + +failure_indication: Pretty poor label name -- I would've preferred 'no_irq' or smth... + error = irq; + goto out_release; The only good thing is that looking at the assembly, I was able to figure out that 'ndev' check at 'out_release' is futile... :-) [...] MBR, Sergei
Re: [PATCH net-next] net: display hw address of source machine during ipv6 DAD failure
On 10/30/2017 04:43 PM, David Ahern wrote: > On 10/30/17 2:29 PM, Vishwanath Pai wrote: >> This patch updates the error messages displayed in kernel log to include >> hwaddress of the source machine that caused ipv6 duplicate address >> detection failures. >> >> Examples: >> >> a) When we receive a NA packet from another machine advertising our >> address: >> >> ICMPv6: NA: 34:ab:cd:56:11:e8 advertised our address 2601::2bb4 on eth0! > > your example above does not agree with the format below. You have the > compressed IPv6 address, yet the format ... > >> >> b) When we detect DAD failure during address assignment to an interface: >> >> IPv6: eth0: IPv6 duplicate address 2601::2b78 used by 34:ab:cd:56:11:e8 >> detected! >> >> Suggested-by: Igor Lubashev>> Signed-off-by: Vishwanath Pai >> --- > > >> @@ -989,8 +990,8 @@ static void ndisc_recv_na(struct sk_buff *skb) >> */ >> if (skb->pkt_type != PACKET_LOOPBACK) >> ND_PRINTK(1, warn, >> - "NA: someone advertises our address %pI6 on >> %s!\n", >> - >addr, ifp->idev->dev->name); >> + "NA: %pM advertised our address %pI6 on >> %s!\n", > > ... is uncompressed. Please make that '%pI6c' instead of pI6 in the line > above > > >> + eth_hdr(skb)->h_source, >addr, >> ifp->idev->dev->name); >> in6_ifa_put(ifp); >> return; >> } > > > It does print out the full uncompressed IPv6 address. I modified the message manually by hand while copying it out to the commit message in order to hide the real IP address of my machines, but did not realize that it was different from what the kernel would actually print out. -Vishwanath
[PATCH net-next] bpf: avoid rcu_dereference inside bpf_event_mutex lock region
During perf event attaching/detaching bpf programs, the tp_event->prog_array change is protected by the bpf_event_mutex lock in both attaching and deteching functions. Although tp_event->prog_array is a rcu pointer, rcu_derefrence is not needed to access it since mutex lock will guarantee ordering. Verified through "make C=2" that sparse locking check still happy with the new change. Also change the label name in perf_event_{attach,detach}_bpf_prog from "out" to "unlock" to reflect the code action after the label. Signed-off-by: Yonghong SongAcked-by: Alexei Starovoitov Acked-by: Martin KaFai Lau --- kernel/trace/bpf_trace.c | 17 +++-- 1 file changed, 7 insertions(+), 10 deletions(-) Peter, Could you check whether the below change to remove rcu_dereference_protected is what you wanted or not? Thanks! diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index b65011d..e7685c5 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -767,20 +767,19 @@ int perf_event_attach_bpf_prog(struct perf_event *event, mutex_lock(_event_mutex); if (event->prog) - goto out; + goto unlock; - old_array = rcu_dereference_protected(event->tp_event->prog_array, - lockdep_is_held(_event_mutex)); + old_array = event->tp_event->prog_array; ret = bpf_prog_array_copy(old_array, NULL, prog, _array); if (ret < 0) - goto out; + goto unlock; /* set the new array to event->tp_event and set event->prog */ event->prog = prog; rcu_assign_pointer(event->tp_event->prog_array, new_array); bpf_prog_array_free(old_array); -out: +unlock: mutex_unlock(_event_mutex); return ret; } @@ -794,11 +793,9 @@ void perf_event_detach_bpf_prog(struct perf_event *event) mutex_lock(_event_mutex); if (!event->prog) - goto out; - - old_array = rcu_dereference_protected(event->tp_event->prog_array, - lockdep_is_held(_event_mutex)); + goto unlock; + old_array = event->tp_event->prog_array; ret = bpf_prog_array_copy(old_array, event->prog, NULL, _array); if (ret < 0) { bpf_prog_array_delete_safe(old_array, event->prog); @@ -810,6 +807,6 @@ void perf_event_detach_bpf_prog(struct perf_event *event) bpf_prog_put(event->prog); event->prog = NULL; -out: +unlock: mutex_unlock(_event_mutex); } -- 2.9.5
[PATCH net-next] net: filter: remove unused variable and fix warning
bpf_getsockopt bpf call sets the ret variable to zero and never changes it. What's worse in case CONFIG_INET is not selected the variable is completely unused generating a warning. Signed-off-by: Jakub KicinskiReviewed-by: Quentin Monnet --- net/core/filter.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 721c30889033..a0112168d6f9 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3288,7 +3288,6 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { struct sock *sk = bpf_sock->sk; - int ret = 0; if (!sk_fullsock(sk)) goto err_clear; @@ -3308,7 +3307,7 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, } else { goto err_clear; } - return ret; + return 0; #endif err_clear: memset(optval, 0, optlen); -- 2.14.1
Re: [PATCH net-next] net: display hw address of source machine during ipv6 DAD failure
On 10/30/17 2:29 PM, Vishwanath Pai wrote: > This patch updates the error messages displayed in kernel log to include > hwaddress of the source machine that caused ipv6 duplicate address > detection failures. > > Examples: > > a) When we receive a NA packet from another machine advertising our > address: > > ICMPv6: NA: 34:ab:cd:56:11:e8 advertised our address 2601::2bb4 on eth0! your example above does not agree with the format below. You have the compressed IPv6 address, yet the format ... > > b) When we detect DAD failure during address assignment to an interface: > > IPv6: eth0: IPv6 duplicate address 2601::2b78 used by 34:ab:cd:56:11:e8 > detected! > > Suggested-by: Igor Lubashev> Signed-off-by: Vishwanath Pai > --- > @@ -989,8 +990,8 @@ static void ndisc_recv_na(struct sk_buff *skb) >*/ > if (skb->pkt_type != PACKET_LOOPBACK) > ND_PRINTK(1, warn, > - "NA: someone advertises our address %pI6 on > %s!\n", > - >addr, ifp->idev->dev->name); > + "NA: %pM advertised our address %pI6 on > %s!\n", ... is uncompressed. Please make that '%pI6c' instead of pI6 in the line above > + eth_hdr(skb)->h_source, >addr, > ifp->idev->dev->name); > in6_ifa_put(ifp); > return; > }
Re: [run_timer_softirq] BUG: unable to handle kernel paging request at 0000000000010007
On Mon, Oct 30, 2017 at 12:29:47PM -0700, Linus Torvalds wrote: On Sun, Oct 29, 2017 at 4:48 PM, Fengguang Wuwrote: Here are 3 dmesgs related to wireless and 1 from ethernet. Fengguang, these would be lovelier still _if_ you have DEBUG_INFO enabled on the kernel, and your script were to find things like "symbol+0xhex/0xhex", and run "./scripts/faddr2line" on them. So [ 235.425464] BUG: unable to handle kernel paging request at 00010007 [ 235.425470] IP: run_timer_softirq+0x13a/0x470 would also then have run_timer_softirq at timer.c:XYZ which would make it easier to see exactly _what_ it is that faults. As it is, I think there's a fair number of inlining that makes it hard to see the cause, but that faddrtoline would make very obvious. Good idea and tips! It'll definitely help debug the issues where bisect cannot help. Finding that "symbol+xyz/abc" pattern should be fairly easy to automate, and should fit the 0day model fairly well. No? Sure. We'll add DEBUG_INFO and automate faddr2line. Regards, Fengguang
[PATCH net-next] net: display hw address of source machine during ipv6 DAD failure
This patch updates the error messages displayed in kernel log to include hwaddress of the source machine that caused ipv6 duplicate address detection failures. Examples: a) When we receive a NA packet from another machine advertising our address: ICMPv6: NA: 34:ab:cd:56:11:e8 advertised our address 2601::2bb4 on eth0! b) When we detect DAD failure during address assignment to an interface: IPv6: eth0: IPv6 duplicate address 2601::2b78 used by 34:ab:cd:56:11:e8 detected! Suggested-by: Igor LubashevSigned-off-by: Vishwanath Pai --- include/net/addrconf.h | 2 +- net/ipv6/addrconf.c| 6 +++--- net/ipv6/ndisc.c | 9 + 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/net/addrconf.h b/include/net/addrconf.h index 15b5ffd..2a616ea 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -208,7 +208,7 @@ bool inet6_mc_check(struct sock *sk, const struct in6_addr *mc_addr, void ipv6_mc_init_dev(struct inet6_dev *idev); void ipv6_mc_destroy_dev(struct inet6_dev *idev); int ipv6_mc_check_mld(struct sk_buff *skb, struct sk_buff **skb_trimmed); -void addrconf_dad_failure(struct inet6_ifaddr *ifp); +void addrconf_dad_failure(struct sk_buff *skb, struct inet6_ifaddr *ifp); bool ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group, const struct in6_addr *src_addr); diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 5a8a102..cfa374c 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1987,7 +1987,7 @@ static int addrconf_dad_end(struct inet6_ifaddr *ifp) return err; } -void addrconf_dad_failure(struct inet6_ifaddr *ifp) +void addrconf_dad_failure(struct sk_buff *skb, struct inet6_ifaddr *ifp) { struct inet6_dev *idev = ifp->idev; struct net *net = dev_net(ifp->idev->dev); @@ -1997,8 +1997,8 @@ void addrconf_dad_failure(struct inet6_ifaddr *ifp) return; } - net_info_ratelimited("%s: IPv6 duplicate address %pI6c detected!\n", -ifp->idev->dev->name, >addr); + net_info_ratelimited("%s: IPv6 duplicate address %pI6c used by %pM detected!\n", +ifp->idev->dev->name, >addr, eth_hdr(skb)->h_source); spin_lock_bh(>lock); diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index 266a530..9d6ea9d 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -46,6 +46,7 @@ #endif #include +#include #include #include #include @@ -822,7 +823,7 @@ static void ndisc_recv_ns(struct sk_buff *skb) * who is doing DAD * so fail our DAD process */ - addrconf_dad_failure(ifp); + addrconf_dad_failure(skb, ifp); return; } else { /* @@ -975,7 +976,7 @@ static void ndisc_recv_na(struct sk_buff *skb) if (ifp) { if (skb->pkt_type != PACKET_LOOPBACK && (ifp->flags & IFA_F_TENTATIVE)) { - addrconf_dad_failure(ifp); + addrconf_dad_failure(skb, ifp); return; } /* What should we make now? The advertisement @@ -989,8 +990,8 @@ static void ndisc_recv_na(struct sk_buff *skb) */ if (skb->pkt_type != PACKET_LOOPBACK) ND_PRINTK(1, warn, - "NA: someone advertises our address %pI6 on %s!\n", - >addr, ifp->idev->dev->name); + "NA: %pM advertised our address %pI6 on %s!\n", + eth_hdr(skb)->h_source, >addr, ifp->idev->dev->name); in6_ifa_put(ifp); return; } -- 1.9.1
Re: [PATCH net-next v5 06/10] bnxt: Add devlink support for config get/set
On Mon, Oct 30, 2017 at 1:35 PM, Jiri Pirkowrote: > Mon, Oct 30, 2017 at 03:46:12PM CET, steven.l...@broadcom.com wrote: >>Implements get and set of configuration parameters using new devlink >>config get/set API. Parameters themselves defined in later patches. >> >>Signed-off-by: Steve Lin >>Acked-by: Andy Gospodarek >>--- >> drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c | 258 >> +- >> drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.h | 17 ++ >> 2 files changed, 263 insertions(+), 12 deletions(-) >> >>diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c >>b/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c >>index 402fa32..deb24e0 100644 > > [...] > >>+static int bnxt_dl_perm_config_set(struct devlink *devlink, >>+ enum devlink_perm_config_param param, >>+ enum devlink_perm_config_type type, >>+ union devlink_perm_config_value *value, >>+ bool *restart_reqd) >>+{ >>+ struct bnxt *bp = bnxt_get_bp_from_dl(devlink); >>+ struct bnxt_drv_cfgparam *entry = NULL; >>+ u32 value_int; >>+ u32 bytesize; >>+ int idx = 0; >>+ int ret = 0; >>+ u16 val16; >>+ u8 val8; >>+ int i; >>+ >>+ *restart_reqd = 0; >>+ >>+ /* Find parameter in table */ >>+ for (i = 0; i < BNXT_NUM_DRV_CFGPARAM; i++) { >>+ if (param == bnxt_drv_cfgparam_list[i].param) { >>+ entry = _drv_cfgparam_list[i]; >>+ break; >>+ } >>+ } >>+ >>+ /* Not found */ >>+ if (!entry) >>+ return -EINVAL; >>+ >>+ /* Check to see if this func type can access variable */ >>+ if (BNXT_PF(bp) && !(entry->func & BNXT_DRV_PF)) >>+ return -EOPNOTSUPP; >>+ if (BNXT_VF(bp) && !(entry->func & BNXT_DRV_VF)) >>+ return -EOPNOTSUPP; >>+ >>+ /* If parameter is per port or function, compute index */ >>+ if (entry->appl == BNXT_DRV_APPL_PORT) { >>+ idx = bp->pf.port_id; >>+ } else if (entry->appl == BNXT_DRV_APPL_FUNCTION) { >>+ if (BNXT_PF(bp)) >>+ idx = bp->pf.fw_fid - BNXT_FIRST_PF_FID; >> } >> >>+ bytesize = roundup(entry->bitlength, BITS_PER_BYTE) / BITS_PER_BYTE; >>+ >>+ /* Type expected by device may or may not be the same as type passed >>+ * in from devlink API. So first convert to an interval u32 value, >>+ * then to type needed by device. >>+ */ >>+ if (type == DEVLINK_PERM_CONFIG_TYPE_U8) { >>+ value_int = value->value8; >>+ } else if (type == DEVLINK_PERM_CONFIG_TYPE_U16) { >>+ value_int = value->value16; >>+ } else if (type == DEVLINK_PERM_CONFIG_TYPE_U32) { >>+ value_int = value->value32; >>+ } else { >>+ /* Unsupported type */ >>+ return -EINVAL; >>+ } >>+ >>+ if (bytesize == 1) { >>+ val8 = value_int; > >>+ ret = bnxt_nvm_write(bp, entry->nvm_param, idx, , >>+ entry->bitlength); >>+ } else if (bytesize == 2) { >>+ val16 = value_int; >>+ ret = bnxt_nvm_write(bp, entry->nvm_param, idx, , >>+ entry->bitlength); >>+ } else { >>+ ret = bnxt_nvm_write(bp, entry->nvm_param, idx, _int, >>+ entry->bitlength); >>+ } >>+ >>+ /* Restart required for all nvm parameter writes */ >>+ *restart_reqd = 1; >>+ >>+ return ret; >>+} > > I don't like this swissknife approach for setters and getters. It's > messy, and hard to read. There should be clean get/set pair function > per parameter defined in array. The advantage of the swiss kinfe approach is that you don't have a large number of almost-identical functions (i.e. any "set" that operates on a u32 will be doing essentially the exact same thing) in the driver, which could lead to code bloat and also make it harder to catch errors (since any bug that may creep in during the copy/paste to make so many functions will only be caught when that specific parameter is set, rather than when any parameter is set). In general, I very much appreciate the thorough review. I do feel, though, like some of this basic architectural stuff could have been addressed more efficiently in v1 or v2, rather than v5. I admit that when the comments to v4 included suggestions like removing extra spaces in the commit message, I was thinking that we were close to reaching consensus. :) I just want to make sure that you feel that the overall goal of this effort is worthwhile - I will make the changes you've suggested here and in the other replies to v5, but I am hopeful that v6 does not result in another set of big architectural changes. > Something
Re: [PATCH net-next v5 02/10] devlink: Adding SR-IOV enablement perm config param
On Mon, Oct 30, 2017 at 1:20 PM, Jiri Pirkowrote: > Mon, Oct 30, 2017 at 03:46:08PM CET, steven.l...@broadcom.com wrote: >>Adding DEVLINK_PERM_CONFIG_SRIOV_ENABLED permanent config >>parameter; this parameter controls whether the device >>advertises the SR-IOV PCI capability. Value is permanent, so >>becomes the new default value for this device. >> >> DEVLINK_PERM_CONFIG_DISABLE = Disable SR-IOV >> DEVLINK_PERM_CONFIG_ENABLE = Enable SR-IOV >> >>Signed-off-by: Steve Lin >>Acked-by: Andy Gospodarek >>--- >> include/uapi/linux/devlink.h | 18 +- >> net/core/devlink.c | 1 + >> 2 files changed, 18 insertions(+), 1 deletion(-) >> >>diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h >>index 55e35f2..1c5d4ae 100644 >>--- a/include/uapi/linux/devlink.h >>+++ b/include/uapi/linux/devlink.h >>@@ -256,8 +256,24 @@ enum devlink_dpipe_header_id { >> DEVLINK_DPIPE_HEADER_IPV6, >> }; >> >>-/* Permanent config parameters */ >>+/* Permanent config parameter enable/disable >>+ * DEVLINK_PERM_CONFIG_DISABLE = disable a permanent config parameter >>+ * DEVLINK_PERM_CONFIG_ENBALE = enable a permanent config parameter >>+ */ >>+enum devlink_perm_config_enabled { >>+ DEVLINK_PERM_CONFIG_DISABLE, >>+ DEVLINK_PERM_CONFIG_ENABLE, >>+}; > > > Basically, this is "bool" > > Why don't you use some own enum instead of NLA_U*. Like team does for > example: > > enum team_option_type { > TEAM_OPTION_TYPE_U32, > TEAM_OPTION_TYPE_STRING, > TEAM_OPTION_TYPE_BINARY, > TEAM_OPTION_TYPE_BOOL, > TEAM_OPTION_TYPE_S32, > }; > > I had added enum devlink_perm_config_type in v5 based on your comments in v4 - I will add BOOL if it helps clarity. > >>+ >>+/* Permanent config parameters: >>+ * DEVLINK_PERM_CONFIG_SRIOV_ENABLED: Configures whether SR-IOV PCI >>capability >>+ * is advertised by the device. >>+ * DEVLINK_PERM_CONFIG_DISABLE = disable SR-IOV >>+ * DEVLINK_PERM_CONFIG_ENABLE = enable SR-IOV >>+ */ >> enum devlink_perm_config_param { >>+ DEVLINK_PERM_CONFIG_SRIOV_ENABLED, > > > Please align "ENABLED" vs "ENABLE". > The rest of devlink code uses "ENABLED" > Ok. > >>+ >> __DEVLINK_PERM_CONFIG_MAX, >> DEVLINK_PERM_CONFIG_MAX = __DEVLINK_PERM_CONFIG_MAX - 1 >> }; >>diff --git a/net/core/devlink.c b/net/core/devlink.c >>index 5e77408..b4d43c29 100644 >>--- a/net/core/devlink.c >>+++ b/net/core/devlink.c >>@@ -1569,6 +1569,7 @@ static int devlink_nl_cmd_eswitch_set_doit(struct >>sk_buff *skb, >> static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1]; >> >> static const u8 devlink_perm_cfg_param_types[DEVLINK_PERM_CONFIG_MAX + 1] = { >>+ [DEVLINK_PERM_CONFIG_SRIOV_ENABLED] = NLA_U8, >> }; >> >> static int devlink_nl_single_param_get(struct sk_buff *msg, >>-- >>2.7.4 >>
Re: [PATCH net-next v5 01/10] devlink: Add permanent config parameter get/set operations
On Mon, Oct 30, 2017 at 1:03 PM, Jiri Pirkowrote: > Mon, Oct 30, 2017 at 03:46:07PM CET, steven.l...@broadcom.com wrote: >>Add support for permanent config parameter get/set commands. Used >>for persistent device configuration parameters. >> >>Signed-off-by: Steve Lin >>Acked-by: Andy Gospodarek > > In general, I don't like how you use netlink. Please see the rest of > this code. We should do things in the common way. More info inlined. > > > [...] > >> >>+/* Permanent config parameters */ >>+enum devlink_perm_config_param { >>+ __DEVLINK_PERM_CONFIG_MAX, >>+ DEVLINK_PERM_CONFIG_MAX = __DEVLINK_PERM_CONFIG_MAX - 1 >>+}; >>+ >>+/* Permanent config types */ >>+enum devlink_perm_config_type { >>+ DEVLINK_PERM_CONFIG_TYPE_UNSPEC, > > You should not need this. Type should be always specified. This was an attempt to align enum devlink_perm_config_type with the NLA_* enums in netlink.h, which include the unspecified type. I will remove the unspecified enum, but it seems like having multiple type enums that are not aligned with each other could be problematic. > > >>+ DEVLINK_PERM_CONFIG_TYPE_U8, >>+ DEVLINK_PERM_CONFIG_TYPE_U16, >>+ DEVLINK_PERM_CONFIG_TYPE_U32, >>+}; > > This definitelly should not be in uapi header. > > >>+ >>+/* Permanent config value */ >>+union devlink_perm_config_value { >>+ __u8value8; >>+ __u16 value16; >>+ __u32 value32; >>+}; > > This definitelly should not be in uapi header. Ok, will move these. > > > >>+ >> #endif /* _UAPI_LINUX_DEVLINK_H_ */ > > [...] > > >>+static int devlink_nl_config_get_fill(struct sk_buff *msg, >>+struct devlink *devlink, >>+enum devlink_command cmd, >>+struct genl_info *info) >>+{ >>+ struct nlattr *tb[DEVLINK_ATTR_MAX + 1]; >>+ enum devlink_perm_config_param param; >>+ enum devlink_perm_config_type type; >>+ struct nlattr *cfgparam_attr; >>+ struct nlattr *attr; >>+ void *hdr; >>+ int err; >>+ int rem; >>+ >>+ hdr = genlmsg_put(msg, info->snd_portid, info->snd_seq, >>+_nl_family, 0, cmd); >>+ if (!hdr) >>+ return -EMSGSIZE; >>+ >>+ err = devlink_nl_put_handle(msg, devlink); >>+ if (err) >>+ goto nla_put_failure; >>+ >>+ if (!info->attrs[DEVLINK_ATTR_PERM_CONFIGS]) { >>+ err = -EINVAL; >>+ goto nla_put_failure; >>+ } >>+ >>+ cfgparam_attr = nla_nest_start(msg, DEVLINK_ATTR_PERM_CONFIGS); >>+ >>+ nla_for_each_nested(attr, info->attrs[DEVLINK_ATTR_PERM_CONFIGS], > > > Okay. So I have to know what is in kernel in order to get the value. > > We need a dump operation. In fact, why can't we just have dump operation? Ok, I can add a dump operation. > > > >>+ rem) { >>+ err = nla_parse_nested(tb, DEVLINK_ATTR_MAX, attr, >>+ devlink_nl_policy, NULL); >>+ if (err) >>+ goto nla_nest_failure; >>+ if (!tb[DEVLINK_ATTR_PERM_CONFIG_PARAMETER] || >>+ !tb[DEVLINK_ATTR_PERM_CONFIG_TYPE]) >>+ continue; >>+ >>+ param = nla_get_u32(tb[DEVLINK_ATTR_PERM_CONFIG_PARAMETER]); >>+ type = nla_get_u8(tb[DEVLINK_ATTR_PERM_CONFIG_TYPE]); > > So to get parameter, I have to specify a type? What if I specify wrong > type? That is wrong. The type should be read attr for get. For set, seems like the user would need to know and set the type. So, I was just doing the same for get(). But I can make the type a read attribute for get. > > You should have a param -> type table and always use that. > > >>+ if (param > DEVLINK_PERM_CONFIG_MAX || >>+ type > NLA_TYPE_MAX) { >>+ continue; >>+ } >>+ /* Note if single_param_get fails, that param won't be in >>+ * response msg, so caller will know which param(s) failed to >>+ * get. >>+ */ >>+ devlink_nl_single_param_get(msg, devlink, param, type); >>+ } >>+ >>+ nla_nest_end(msg, cfgparam_attr); >>+ >>+ genlmsg_end(msg, hdr); >>+ return 0; >>+ >>+nla_nest_failure: >>+ nla_nest_cancel(msg, cfgparam_attr); >>+nla_put_failure: > > > You should make this multipart aware. If config parameters won't fit > into a single skb, you have to allocate another one. See > devlink_dpipe_tables_fill > as an example how to do it. It is important to have this from the > beginning as the userspace knows to expect multipart message. In a response to a similar comment from v2, I tried to describe why I felt multipart messages wouldn't be required. There was no response to my reply, so I thought that my argument was accepted. Apparently not, so I will
Re: [PATCH net-next] net: bcmgenet: Avoid calling platform_device_put() twice in bcmgenet_mii_exit()
On 10/27/2017 10:05 PM, Wei Yongjun wrote: > Remove platform_device_put() call after platform_device_unregister() > from function bcmgenet_mii_exit(), otherwise, we will call > platform_device_put() twice. > > Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC > MDIO controller driver") Please don't force a wrap to 80 columns for commit subjects. > Signed-off-by: Wei YongjunAcked-by: Florian Fainelli > --- > drivers/net/ethernet/broadcom/genet/bcmmii.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/drivers/net/ethernet/broadcom/genet/bcmmii.c > b/drivers/net/ethernet/broadcom/genet/bcmmii.c > index ba3fcfd..5333274 100644 > --- a/drivers/net/ethernet/broadcom/genet/bcmmii.c > +++ b/drivers/net/ethernet/broadcom/genet/bcmmii.c > @@ -571,5 +571,4 @@ void bcmgenet_mii_exit(struct net_device *dev) > of_phy_deregister_fixed_link(dn); > of_node_put(priv->phy_dn); > platform_device_unregister(priv->mii_pdev); > - platform_device_put(priv->mii_pdev); > } > -- Florian
Re: [run_timer_softirq] BUG: unable to handle kernel paging request at 0000000000010007
On Sun, Oct 29, 2017 at 4:48 PM, Fengguang Wuwrote: > > Here are 3 dmesgs related to wireless and 1 from ethernet. Fengguang, these would be lovelier still _if_ you have DEBUG_INFO enabled on the kernel, and your script were to find things like "symbol+0xhex/0xhex", and run "./scripts/faddr2line" on them. So > [ 235.425464] BUG: unable to handle kernel paging request at 00010007 > [ 235.425470] IP: run_timer_softirq+0x13a/0x470 would also then have run_timer_softirq at timer.c:XYZ which would make it easier to see exactly _what_ it is that faults. As it is, I think there's a fair number of inlining that makes it hard to see the cause, but that faddrtoline would make very obvious. Finding that "symbol+xyz/abc" pattern should be fairly easy to automate, and should fit the 0day model fairly well. No? In this case, the trapping instruction ends up decoding to 0: 4c 8d 6c c5 90lea-0x70(%rbp,%rax,8),%r13 5: 49 8b 45 00 mov0x0(%r13),%rax 9: 48 85 c0 test %rax,%rax c: 74 deje 0xffec e: 4d 8b 7d 00 mov0x0(%r13),%r15 12: 4d 89 7c 24 08mov%r15,0x8(%r12) 17: 0f 1f 44 00 00nopl 0x0(%rax,%rax,1) 1c: 49 8b 07 mov(%r15),%rax 1f: 49 8b 57 08 mov0x8(%r15),%rdx 23: 48 85 c0 test %rax,%rax 26: 48 89 02 mov%rax,(%rdx) 29: 74 04je 0x2f 2b:* 48 89 50 08 mov%rdx,0x8(%rax) <-- trapping instruction 2f: 41 f6 47 2a 20testb $0x20,0x2a(%r15) 34: 49 c7 47 08 00 00 00 movq $0x0,0x8(%r15) and %rax has the value 0x, so yes, it will trap at 0x10007. It's not trivial to see just *wjhat* access it is. I *think* that "testb $32" is checking for TIMER_IRQSAFE in expire_timers(), and that the oops is due to the list operations in detach_timer() (inlined). Which doesn't really help: it looks like the timer lists are corrupt. With some luck, some register state could have the timer function pointer in it, and we'd get a hint of *which* timer this is, but that doesn't look to be the case here either. I'm not seeing anything to really help debug this here. Linus
Re: [patch net-next RFC 1/9] net_sch: red: Add offload ability to RED qdisc
On 10/30/17 7:54 AM, Nogah Frankel wrote: >>> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h >>> index 0e88cc2..743c42a 100644 >>> --- a/include/uapi/linux/pkt_sched.h >>> +++ b/include/uapi/linux/pkt_sched.h >>> @@ -255,6 +255,7 @@ struct tc_red_qopt { >>> #define TC_RED_ECN 1 >>> #define TC_RED_HARDDROP2 >>> #define TC_RED_ADAPTATIVE 4 >>> +#define TC_RED_OFFLOADED 8 >>> }; >>> >>> struct tc_red_xstats { >> >> What keeps a user from setting this flag in the tc_red_qopt it >> passes into the a change operation? > > Nothing keeps the user from doing it, but it has no effect. > The decision to offload is the driver's only. > It is basically a read-only flag. > If it is read-only, then attempts to set it from userspace should cause the command to fail.
Re: [PATCH net-next 1/1] net sched qdisc: pass netlink message flags in event notification
On Mon, Oct 30, 2017 at 11:07 AM, Roman Mashakwrote: > Cong Wang writes: > >> On Sat, Oct 28, 2017 at 8:36 PM, Roman Mashak wrote: >>> Cong Wang writes: >> >> Hmm, I thought you use RTM_NEWQDISC+RTM_DELQDISC to >> determine it is replacement, no? > > Create is RTM_NEWQDISC and NLM_F_EXCL|NLM_F_CREATE, replacement is > RTM_NEWQDISC and NLM_F_REPLACE in netlink flags. Is there any reason we can't use RTM_NEWQDISC+RTM_DELQDISC rather than NLM_F_REPLACE to determine it is replacement? Note, RTM_NEWQDISC+RTM_DELQDISC are put in a same message not two.
[net-next] tcp_delack_timer circular locking dependancy
[ 105.316650] == [ 105.316818] WARNING: possible circular locking dependency detected [ 105.316986] 4.14.0-rc7-think+ #1 Not tainted [ 105.317108] -- [ 105.317273] swapper/2/0 is trying to acquire lock: [ 105.317407] ( [ 105.317476] slock-AF_INET6 [ 105.317564] ){+.-.} [ 105.317642] , at: [] tcp_delack_timer+0x26/0x130 [ 105.317807] but task is already holding lock: [ 105.317961] ( [ 105.318024] (timer) [ 105.318097] #5 [ 105.318168] ){+.-.} [ 105.318241] , at: [] call_timer_fn+0x5/0x5e0 [ 105.318393] which lock already depends on the new lock. [ 105.318594] the existing dependency chain (in reverse order) is: [ 105.318781] -> #1 [ 105.318879] ( [ 105.318939] (timer) [ 105.319009] #5 [ 105.319068] ){+.-.} [ 105.319137] : [ 105.319195]del_timer_sync+0x3c/0xb0 [ 105.319313]inet_csk_reqsk_queue_drop+0x26c/0x4e0 [ 105.319459]inet_csk_complete_hashdance+0x1e/0x90 [ 105.319598]tcp_check_req+0x787/0x9a0 [ 105.319716]tcp_v6_rcv+0x914/0x1060 [ 105.319828]ip6_input_finish+0x291/0xba0 [ 105.319950]ip6_input+0xb2/0x380 [ 105.320059]ip6_rcv_finish+0x103/0x350 [ 105.320180]ipv6_rcv+0x93f/0xff0 [ 105.320291]__netif_receive_skb_core+0x13ef/0x1900 [ 105.320436]netif_receive_skb_internal+0xea/0x4c0 [ 105.320579]napi_gro_receive+0x28e/0x320 [ 105.320705]e1000_clean_rx_irq+0x3e9/0x6f0 [ 105.320838]e1000e_poll+0x14e/0x570 [ 105.320954]net_rx_action+0x4db/0xc80 [ 105.321075]__do_softirq+0x1ca/0x7bf [ 105.321194]irq_exit+0x104/0x110 [ 105.321303]do_IRQ+0xb2/0x130 [ 105.321407]ret_from_intr+0x0/0x19 [ 105.321523]cpuidle_enter_state+0x223/0x5b0 [ 105.321655]do_idle+0x110/0x1b0 [ 105.321766]cpu_startup_entry+0xdb/0xe0 [ 105.321891]start_secondary+0x2e9/0x360 [ 105.322014]verify_cpu+0x0/0xf1 [ 105.322121] -> #0 [ 105.322215] ( [ 105.322276] slock-AF_INET6 [ 105.322359] ){+.-.} [ 105.322428] : [ 105.322487]lock_acquire+0x12e/0x350 [ 105.322602]_raw_spin_lock+0x30/0x70 [ 105.322722]tcp_delack_timer+0x26/0x130 [ 105.322846]call_timer_fn+0x188/0x5e0 [ 105.322966]__run_timers+0x54d/0x670 [ 105.323084]run_timer_softirq+0x2a/0x50 [ 105.323208]__do_softirq+0x1ca/0x7bf [ 105.323325]irq_exit+0x104/0x110 [ 105.323435]smp_apic_timer_interrupt+0x14b/0x510 [ 105.323576]apic_timer_interrupt+0x9a/0xa0 [ 105.323705]cpuidle_enter_state+0x223/0x5b0 [ 105.323836]do_idle+0x110/0x1b0 [ 105.323944]cpu_startup_entry+0xdb/0xe0 [ 105.324067]start_secondary+0x2e9/0x360 [ 105.324189]verify_cpu+0x0/0xf1 [ 105.324295] other info that might help us debug this: [ 105.324489] Possible unsafe locking scenario: [ 105.324644]CPU0CPU1 [ 105.324767] [ 105.324890] lock( [ 105.324963] (timer) [ 105.325033] #5 [ 105.325093] ); [ 105.325152]lock( [ 105.325278] slock-AF_INET6 [ 105.325360] ); [ 105.325419]lock( [ 105.325544] (timer) [ 105.325612] #5 [ 105.325670] ); [ 105.325729] lock( [ 105.325797] slock-AF_INET6 [ 105.325879] ); [ 105.325938] *** DEADLOCK *** [ 105.326086] 1 lock held by swapper/2/0: [ 105.326193] #0: [ 105.326257] ( [ 105.331697] (timer) [ 105.337038] #5 [ 105.342339] ){+.-.} [ 105.347620] , at: [] call_timer_fn+0x5/0x5e0 [ 105.353021] stack backtrace: [ 105.363515] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.0-rc7-think+ #1 [ 105.368886] Hardware name: LENOVO ThinkServer TS140/ThinkServer TS140, BIOS FBKTB3AUS 06/16/2015 [ 105.374330] Call Trace: [ 105.379697] [ 105.384997] dump_stack+0xbc/0x145 [ 105.390339] ? dma_virt_map_sg+0xfb/0xfb [ 105.395733] ? call_timer_fn+0x5/0x5e0 [ 105.401076] ? print_lock+0x54/0x68 [ 105.406344] print_circular_bug.isra.42+0x283/0x2bc [ 105.411695] ? print_circular_bug_header+0xda/0xda [ 105.417054] ? graph_lock+0x8d/0x100 [ 105.422419] ? check_noncircular+0x20/0x20 [ 105.427857] ? sched_clock_cpu+0x14/0xf0 [ 105.433309] __lock_acquire+0x1f4a/0x2050 [ 105.438725] ? debug_check_no_locks_freed+0x1a0/0x1a0 [ 105.444160] ? __lock_acquire+0x6b3/0x2050 [ 105.449580] ? debug_check_no_locks_freed+0x1a0/0x1a0 [ 105.455015] ? sched_clock_cpu+0x14/0xf0 [ 105.460514] ? __lock_acquire+0x6b3/0x2050 [ 105.465984] ? cyc2ns_read_end+0x10/0x10 [ 105.471395] ? debug_check_no_locks_freed+0x1a0/0x1a0 [ 105.476934] ? mark_lock+0x16f/0x9b0 [ 105.482507] ? print_irqtrace_events+0x110/0x110 [ 105.488150] ?
Re: [PATCH net-next] net: bcmgenet: Avoid calling platform_device_put() twice in bcmgenet_mii_exit()
On 10/27/2017 10:05 PM, Wei Yongjun wrote: > Remove platform_device_put() call after platform_device_unregister() > from function bcmgenet_mii_exit(), otherwise, we will call > platform_device_put() twice. > > Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC > MDIO controller driver") > Signed-off-by: Wei Yongjun> --- > drivers/net/ethernet/broadcom/genet/bcmmii.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/drivers/net/ethernet/broadcom/genet/bcmmii.c > b/drivers/net/ethernet/broadcom/genet/bcmmii.c > index ba3fcfd..5333274 100644 > --- a/drivers/net/ethernet/broadcom/genet/bcmmii.c > +++ b/drivers/net/ethernet/broadcom/genet/bcmmii.c > @@ -571,5 +571,4 @@ void bcmgenet_mii_exit(struct net_device *dev) > of_phy_deregister_fixed_link(dn); > of_node_put(priv->phy_dn); > platform_device_unregister(priv->mii_pdev); > - platform_device_put(priv->mii_pdev); > } > Acked-by: Doug Berger
Re: [PATCH] net: phy: leds: Add support for "link" trigger
On 10/30/2017 11:46 AM, Maciej S. Szmigiero wrote: > On 30.10.2017 19:36, Florian Fainelli wrote: >> On 10/28/2017 08:27 AM, Maciej S. Szmigiero wrote: >>> Currently, we create a LED trigger for any link speed known to a PHY. >>> These triggers only fire when their exact link speed had been negotiated >>> (they aren't cumulative, that is, they don't fire for "their or any higher" >>> link speed). >>> >>> What we are missing, however, is a trigger which will fire on any link >>> speed known to the PHY. Such trigger can then be used for implementing a >>> poor man's substitute of the "link" LED on boards that lack it. >>> Let's add it. >> >> The use case definitively makes sense, but the implementation a little less. >> >>> >>> Signed-off-by: Maciej S. Szmigiero>>> --- >>> drivers/net/phy/Kconfig| 7 +-- >>> drivers/net/phy/phy_led_triggers.c | 19 --- >>> 2 files changed, 21 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig >>> index cd931cf9dcc2..3bcc2107ad77 100644 >>> --- a/drivers/net/phy/Kconfig >>> +++ b/drivers/net/phy/Kconfig >>> @@ -191,11 +191,14 @@ config LED_TRIGGER_PHY >>> Adds support for a set of LED trigger events per-PHY. Link >>> state change will trigger the events, for consumption by an >>> LED class driver. There are triggers for each link speed currently >>> - supported by the phy, and are of the form: >>> + supported by the PHY and also a one common "link" trigger as a >>> + logical-or of all the link speed ones. >>> + All these triggers are named according to the following pattern: >>> :: >>> >>> Where speed is in the form: >>> - Mbps or Gbps >>> + Mbps OR Gbps OR link >>> + for any speed known to the PHY. >>> >>> >>> comment "MII PHY device drivers" >>> diff --git a/drivers/net/phy/phy_led_triggers.c >>> b/drivers/net/phy/phy_led_triggers.c >>> index 94ca42e630bb..5b6f1876f514 100644 >>> --- a/drivers/net/phy/phy_led_triggers.c >>> +++ b/drivers/net/phy/phy_led_triggers.c >>> @@ -20,7 +20,8 @@ static struct phy_led_trigger >>> *phy_speed_to_led_trigger(struct phy_device *phy, >>> { >>> unsigned int i; >>> >>> - for (i = 0; i < phy->phy_num_led_triggers; i++) { >>> + /* the first (i = 0) trigger is for "any" speed */ >>> + for (i = 1; i < phy->phy_num_led_triggers; i++) { >>> if (phy->phy_led_triggers[i].speed == speed) >>> return >phy_led_triggers[i]; >> >> This looks unnecessary, because existing LED triggers are all relative >> to a particular speed, you need to coerce the existing code into >> accepting SPEED_UNKNOWN as a hint to mean "link". Just implement "link" >> as a different trigger that does not need the speed argument/lookup to >> be happening and don't change how the indexing into the array of speed >> triggers is done. Just have a separate "link" trigger that is stored >> separately from that existing array. > > This way of implementation (overloading SPEED_UNKNOWN) needed least > modification to the driver, but if a purer way is preferred then ok. Sure, that makes it simpler to implement, but it seems to me this would be cleaner if the "link" trigger was handled separately, the next thing that comes to mind is actually implementing an "activity" trigger, and this one is likely also reasonably speed independent. > >> Also, what happens if I set both "link" and a given "speed" and my RJ-45 >> connector only has two LEDs, and one of them is already used for "activity"? > > One LED can have only one trigger attached AFAIK. True, so we should be fine, thanks! -- Florian
Re: [PATCH v9 06/10] arm64: dts: allwinner: Restore EMAC changes
On Fri, Oct 27, 2017 at 05:11:14PM +0200, Maxime Ripard wrote: > On Tue, Oct 24, 2017 at 07:57:10PM +0200, Corentin Labbe wrote: > > The original dwmac-sun8i DT bindings have some issue on how to handle > > integrated PHY and was reverted in last RC of 4.13. > > But now we have a solution so we need to get back that was reverted. > > > > This patch restore arm64 DT about dwmac-sun8i > > This reverts commit 87e1f5e8bb4b ("arm64: dts: allwinner: Revert EMAC > > changes") > > > > Signed-off-by: Corentin Labbe> > --- > > .../boot/dts/allwinner/sun50i-a64-bananapi-m64.dts | 16 > > .../boot/dts/allwinner/sun50i-a64-pine64-plus.dts| 15 +++ > > arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts | 17 + > > .../dts/allwinner/sun50i-a64-sopine-baseboard.dts| 16 > > arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi| 20 > > > > .../boot/dts/allwinner/sun50i-h5-nanopi-neo2.dts | 17 + > > .../boot/dts/allwinner/sun50i-h5-orangepi-pc2.dts| 17 + > > .../boot/dts/allwinner/sun50i-h5-orangepi-prime.dts | 17 + > > 8 files changed, 135 insertions(+) > > Can you split the changes between the A64 and the H5? It's going to be > difficult to merge otherwise. > > (You also forgot to add Florian's Acked-by on your whole serie). > Will do Thanks Regards
Re: [PATCH] net: phy: leds: Add support for "link" trigger
On 30.10.2017 19:36, Florian Fainelli wrote: > On 10/28/2017 08:27 AM, Maciej S. Szmigiero wrote: >> Currently, we create a LED trigger for any link speed known to a PHY. >> These triggers only fire when their exact link speed had been negotiated >> (they aren't cumulative, that is, they don't fire for "their or any higher" >> link speed). >> >> What we are missing, however, is a trigger which will fire on any link >> speed known to the PHY. Such trigger can then be used for implementing a >> poor man's substitute of the "link" LED on boards that lack it. >> Let's add it. > > The use case definitively makes sense, but the implementation a little less. > >> >> Signed-off-by: Maciej S. Szmigiero>> --- >> drivers/net/phy/Kconfig| 7 +-- >> drivers/net/phy/phy_led_triggers.c | 19 --- >> 2 files changed, 21 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig >> index cd931cf9dcc2..3bcc2107ad77 100644 >> --- a/drivers/net/phy/Kconfig >> +++ b/drivers/net/phy/Kconfig >> @@ -191,11 +191,14 @@ config LED_TRIGGER_PHY >>Adds support for a set of LED trigger events per-PHY. Link >>state change will trigger the events, for consumption by an >>LED class driver. There are triggers for each link speed currently >> - supported by the phy, and are of the form: >> + supported by the PHY and also a one common "link" trigger as a >> + logical-or of all the link speed ones. >> + All these triggers are named according to the following pattern: >>:: >> >>Where speed is in the form: >> -Mbps or Gbps >> +Mbps OR Gbps OR link >> +for any speed known to the PHY. >> >> >> comment "MII PHY device drivers" >> diff --git a/drivers/net/phy/phy_led_triggers.c >> b/drivers/net/phy/phy_led_triggers.c >> index 94ca42e630bb..5b6f1876f514 100644 >> --- a/drivers/net/phy/phy_led_triggers.c >> +++ b/drivers/net/phy/phy_led_triggers.c >> @@ -20,7 +20,8 @@ static struct phy_led_trigger >> *phy_speed_to_led_trigger(struct phy_device *phy, >> { >> unsigned int i; >> >> -for (i = 0; i < phy->phy_num_led_triggers; i++) { >> +/* the first (i = 0) trigger is for "any" speed */ >> +for (i = 1; i < phy->phy_num_led_triggers; i++) { >> if (phy->phy_led_triggers[i].speed == speed) >> return >phy_led_triggers[i]; > > This looks unnecessary, because existing LED triggers are all relative > to a particular speed, you need to coerce the existing code into > accepting SPEED_UNKNOWN as a hint to mean "link". Just implement "link" > as a different trigger that does not need the speed argument/lookup to > be happening and don't change how the indexing into the array of speed > triggers is done. Just have a separate "link" trigger that is stored > separately from that existing array. This way of implementation (overloading SPEED_UNKNOWN) needed least modification to the driver, but if a purer way is preferred then ok. > Also, what happens if I set both "link" and a given "speed" and my RJ-45 > connector only has two LEDs, and one of them is already used for "activity"? One LED can have only one trigger attached AFAIK. Maciej
Re: [PATCH net-next 3/7] net: dsa: get port type at parse time
On 10/30/2017 11:46 AM, Vivien Didelot wrote: > Hi Florian, > > Florian Fainelliwrites: > >> Reviewed-by: Florian Fainelli >> >> One nit below: >> >>> static int dsa_port_parse_of(struct dsa_port *dp, struct device_node *dn) >>> { >>> + struct device_node *ethernet = of_parse_phandle(dn, "ethernet", 0); >>> + struct device_node *link = of_parse_phandle(dn, "link", 0); >>> + >>> + if (ethernet) { >>> + dp->type = DSA_PORT_TYPE_CPU; >>> + } else if (link) { >>> + dp->type = DSA_PORT_TYPE_DSA; >>> + } else { >>> + dp->type = DSA_PORT_TYPE_USER; >>> + } >>> + >> >> The curly braces are probably not necessary since all of these are >> single line statements. > > I didn't mention it in the commit message because it was obvious that > the next patches will extend these condition arms. Fair enough then ;) -- Florian
Re: [PATCH net-next 3/7] net: dsa: get port type at parse time
Hi Florian, Florian Fainelliwrites: > Reviewed-by: Florian Fainelli > > One nit below: > >> static int dsa_port_parse_of(struct dsa_port *dp, struct device_node *dn) >> { >> +struct device_node *ethernet = of_parse_phandle(dn, "ethernet", 0); >> +struct device_node *link = of_parse_phandle(dn, "link", 0); >> + >> +if (ethernet) { >> +dp->type = DSA_PORT_TYPE_CPU; >> +} else if (link) { >> +dp->type = DSA_PORT_TYPE_DSA; >> +} else { >> +dp->type = DSA_PORT_TYPE_USER; >> +} >> + > > The curly braces are probably not necessary since all of these are > single line statements. I didn't mention it in the commit message because it was obvious that the next patches will extend these condition arms. Thanks, Vivien
Re: [PATCH net-next 7/7] net: dsa: remove name arg from slave create
On 10/27/2017 12:55 PM, Vivien Didelot wrote: > Now that slave dsa_port always have their name set, there is no need to > pass it to dsa_slave_create() anymore. Remove this argument. > > Signed-off-by: Vivien DidelotReviewed-by: Florian Fainelli -- Florian
Re: [PATCH net-next 6/7] net: dsa: get port name at parse time
On 10/27/2017 12:55 PM, Vivien Didelot wrote: > Get the optional "label" property and assign a default one directly at > parse time instead of doing it when creating the slave. > > For legacy, simply assign the port name stored in cd->port_names. > > Signed-off-by: Vivien DidelotReviewed-by: Florian Fainelli -- Florian
Re: KASAN: use-after-free Read in sit_tunnel_xmit
On Mon, Oct 30, 2017 at 8:34 AM, syzbotwrote: > Hello, > > syzkaller hit the following crash on > 4dc12ffeaeac939097a3f55c881d3dc3523dff0c > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > > > > > skbuff: bad partial csum: csum=53081/14726 len=2273 > == > BUG: KASAN: use-after-free in ipv6_get_dsfield include/net/dsfield.h:23 > [inline] > BUG: KASAN: use-after-free in ipip6_tunnel_xmit net/ipv6/sit.c:968 [inline] > BUG: KASAN: use-after-free in sit_tunnel_xmit+0x2a41/0x3130 > net/ipv6/sit.c:1016 > Read of size 2 at addr 8801c64afd00 by task syz-executor3/16942 > > CPU: 0 PID: 16942 Comm: syz-executor3 Not tainted 4.14.0-rc5+ #97 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:16 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:52 > print_address_description+0x73/0x250 mm/kasan/report.c:252 > kasan_report_error mm/kasan/report.c:351 [inline] > kasan_report+0x25b/0x340 mm/kasan/report.c:409 > __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428 > ipv6_get_dsfield include/net/dsfield.h:23 [inline] > ipip6_tunnel_xmit net/ipv6/sit.c:968 [inline] > sit_tunnel_xmit+0x2a41/0x3130 net/ipv6/sit.c:1016 > __netdev_start_xmit include/linux/netdevice.h:4022 [inline] > netdev_start_xmit include/linux/netdevice.h:4031 [inline] > xmit_one net/core/dev.c:3008 [inline] > dev_hard_start_xmit+0x248/0xac0 net/core/dev.c:3024 > __dev_queue_xmit+0x17d2/0x2070 net/core/dev.c:3505 > dev_queue_xmit+0x17/0x20 net/core/dev.c:3538 > neigh_direct_output+0x15/0x20 net/core/neighbour.c:1390 > neigh_output include/net/neighbour.h:481 [inline] > ip6_finish_output2+0xad1/0x22a0 net/ipv6/ip6_output.c:120 > ip6_fragment+0x25ae/0x3420 net/ipv6/ip6_output.c:723 > ip6_finish_output+0x319/0x920 net/ipv6/ip6_output.c:144 > NF_HOOK_COND include/linux/netfilter.h:238 [inline] > ip6_output+0x1f4/0x850 net/ipv6/ip6_output.c:163 > dst_output include/net/dst.h:459 [inline] > ip6_local_out+0x95/0x160 net/ipv6/output_core.c:176 > ip6_send_skb+0xa1/0x330 net/ipv6/ip6_output.c:1658 > ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1678 > rawv6_push_pending_frames net/ipv6/raw.c:616 [inline] > rawv6_sendmsg+0x2eb9/0x3e40 net/ipv6/raw.c:935 > inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763 > sock_sendmsg_nosec net/socket.c:633 [inline] > sock_sendmsg+0xca/0x110 net/socket.c:643 > SYSC_sendto+0x352/0x5a0 net/socket.c:1750 > SyS_sendto+0x40/0x50 net/socket.c:1718 > entry_SYSCALL_64_fastpath+0x1f/0xbe > RIP: 0033:0x452869 > RSP: 002b:7fe3c12e5be8 EFLAGS: 0212 ORIG_RAX: 002c > RAX: ffda RBX: 007580d8 RCX: 00452869 > RDX: 07f1 RSI: 2013b7ff RDI: 0014 > RBP: 0161 R08: 204e8fe4 R09: 001c > R10: 0100 R11: 0212 R12: 006f01b8 > R13: R14: 7fe3c12e66d4 R15: 0017 > > Allocated by task 16924: > save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > set_track mm/kasan/kasan.c:459 [inline] > kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 > __do_kmalloc_node mm/slab.c:3689 [inline] > __kmalloc_node_track_caller+0x47/0x70 mm/slab.c:3703 > __kmalloc_reserve.isra.40+0x41/0xd0 net/core/skbuff.c:138 > __alloc_skb+0x13b/0x780 net/core/skbuff.c:206 > alloc_skb include/linux/skbuff.h:985 [inline] > sock_wmalloc+0x140/0x1d0 net/core/sock.c:1932 > __ip6_append_data.isra.43+0x2681/0x3340 net/ipv6/ip6_output.c:1397 > ip6_append_data+0x189/0x290 net/ipv6/ip6_output.c:1552 > rawv6_sendmsg+0x1dd9/0x3e40 net/ipv6/raw.c:928 > inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763 > sock_sendmsg_nosec net/socket.c:633 [inline] > sock_sendmsg+0xca/0x110 net/socket.c:643 > SYSC_sendto+0x352/0x5a0 net/socket.c:1750 > SyS_sendto+0x40/0x50 net/socket.c:1718 > entry_SYSCALL_64_fastpath+0x1f/0xbe > > Freed by task 16942: > save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > set_track mm/kasan/kasan.c:459 [inline] > kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524 > __cache_free mm/slab.c:3503 [inline] > kfree+0xca/0x250 mm/slab.c:3820 > skb_free_head+0x74/0xb0 net/core/skbuff.c:554 > pskb_expand_head+0x36b/0x1210 net/core/skbuff.c:1494 > __pskb_pull_tail+0x14a/0x17c0 net/core/skbuff.c:1877 > pskb_may_pull include/linux/skbuff.h:2102 [inline] > _decode_session6+0x8a4/0x1100 net/ipv6/xfrm6_policy.c:148 Looks like we should indicate error by returning some int for afinfo->decode_session(). > __xfrm_decode_session+0x63/0x100 net/xfrm/xfrm_policy.c:2343 > xfrm_decode_session_reverse include/net/xfrm.h:1181
Re: [PATCH net-next 5/7] net: dsa: get master device at port parsing time
On 10/27/2017 12:55 PM, Vivien Didelot wrote: > Fetching the master device can be done directly when a port is parsed > from device tree or pdata, instead of waiting until dsa_dst_parse. > > Now that -EPROBE_DEFER is returned before we add the switch to the tree, > there is no need to check for this error after dsa_dst_parse. > > Signed-off-by: Vivien DidelotReviewed-by: Florian Fainelli -- Florian
Re: [PATCH net-next 3/7] net: dsa: get port type at parse time
On 10/27/2017 12:55 PM, Vivien Didelot wrote: > Assign a port's type at parsed time instead of waiting for the tree to > be completed. > > Because this is now done earlier, we can use the port's type in > dsa_port_is_* helpers instead of digging again in topology description. > > Signed-off-by: Vivien DidelotReviewed-by: Florian Fainelli One nit below: > static int dsa_port_parse_of(struct dsa_port *dp, struct device_node *dn) > { > + struct device_node *ethernet = of_parse_phandle(dn, "ethernet", 0); > + struct device_node *link = of_parse_phandle(dn, "link", 0); > + > + if (ethernet) { > + dp->type = DSA_PORT_TYPE_CPU; > + } else if (link) { > + dp->type = DSA_PORT_TYPE_DSA; > + } else { > + dp->type = DSA_PORT_TYPE_USER; > + } > + The curly braces are probably not necessary since all of these are single line statements. > dp->dn = dn; > > return 0; > @@ -630,6 +629,14 @@ static int dsa_parse_ports_of(struct device_node *dn, > struct dsa_switch *ds) > static int dsa_port_parse(struct dsa_port *dp, const char *name, > struct device *dev) > { > + if (!strcmp(name, "cpu")) { > + dp->type = DSA_PORT_TYPE_CPU; > + } else if (!strcmp(name, "dsa")) { > + dp->type = DSA_PORT_TYPE_DSA; > + } else { > + dp->type = DSA_PORT_TYPE_USER; > + } Likewise. -- Florian
Re: [PATCH net-next 2/7] net: dsa: add port parse functions
On 10/27/2017 12:55 PM, Vivien Didelot wrote: > Add symmetrical DSA port parsing functions for pdata and device tree, > used to parse and validate a given port node or platform data. > > They don't do much for the moment but will be extended later on to > assign a port type and get device references. > > Signed-off-by: Vivien DidelotReviewed-by: Florian Fainelli -- Florian
Re: [PATCH net-next 1/7] net: dsa: get ports within parsing code
On 10/27/2017 12:55 PM, Vivien Didelot wrote: > There is no point into hiding the -EINVAL error code in ERR_PTR from a > dsa_get_ports function, simply get the "ports" node directly from within > the dsa_parse_ports_dn function. > > This also has the effect to make the pdata and device tree handling code > symmetrical inside _dsa_register_switch. > > At the same time, rename dsa_parse_ports_dn to dsa_parse_ports_of > because _of is a more common suffix for device tree parsing functions. > > Signed-off-by: Vivien DidelotReviewed-by: Florian Fainelli -- Florian
Re: [PATCH] net: phy: leds: Add support for "link" trigger
On 10/28/2017 08:27 AM, Maciej S. Szmigiero wrote: > Currently, we create a LED trigger for any link speed known to a PHY. > These triggers only fire when their exact link speed had been negotiated > (they aren't cumulative, that is, they don't fire for "their or any higher" > link speed). > > What we are missing, however, is a trigger which will fire on any link > speed known to the PHY. Such trigger can then be used for implementing a > poor man's substitute of the "link" LED on boards that lack it. > Let's add it. The use case definitively makes sense, but the implementation a little less. > > Signed-off-by: Maciej S. Szmigiero> --- > drivers/net/phy/Kconfig| 7 +-- > drivers/net/phy/phy_led_triggers.c | 19 --- > 2 files changed, 21 insertions(+), 5 deletions(-) > > diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig > index cd931cf9dcc2..3bcc2107ad77 100644 > --- a/drivers/net/phy/Kconfig > +++ b/drivers/net/phy/Kconfig > @@ -191,11 +191,14 @@ config LED_TRIGGER_PHY > Adds support for a set of LED trigger events per-PHY. Link > state change will trigger the events, for consumption by an > LED class driver. There are triggers for each link speed currently > - supported by the phy, and are of the form: > + supported by the PHY and also a one common "link" trigger as a > + logical-or of all the link speed ones. > + All these triggers are named according to the following pattern: > :: > > Where speed is in the form: > - Mbps or Gbps > + Mbps OR Gbps OR link > + for any speed known to the PHY. > > > comment "MII PHY device drivers" > diff --git a/drivers/net/phy/phy_led_triggers.c > b/drivers/net/phy/phy_led_triggers.c > index 94ca42e630bb..5b6f1876f514 100644 > --- a/drivers/net/phy/phy_led_triggers.c > +++ b/drivers/net/phy/phy_led_triggers.c > @@ -20,7 +20,8 @@ static struct phy_led_trigger > *phy_speed_to_led_trigger(struct phy_device *phy, > { > unsigned int i; > > - for (i = 0; i < phy->phy_num_led_triggers; i++) { > + /* the first (i = 0) trigger is for "any" speed */ > + for (i = 1; i < phy->phy_num_led_triggers; i++) { > if (phy->phy_led_triggers[i].speed == speed) > return >phy_led_triggers[i]; This looks unnecessary, because existing LED triggers are all relative to a particular speed, you need to coerce the existing code into accepting SPEED_UNKNOWN as a hint to mean "link". Just implement "link" as a different trigger that does not need the speed argument/lookup to be happening and don't change how the indexing into the array of speed triggers is done. Just have a separate "link" trigger that is stored separately from that existing array. Also, what happens if I set both "link" and a given "speed" and my RJ-45 connector only has two LEDs, and one of them is already used for "activity"? -- Florian
Re: [PATCH v9 02/10] dt-bindings: net: dwmac-sun8i: update documentation about integrated PHY
On Fri, Oct 27, 2017 at 09:37:10AM -0500, Rob Herring wrote: > On Tue, Oct 24, 2017 at 07:57:06PM +0200, Corentin Labbe wrote: > > This patch add documentation about the MDIO switch used on sun8i-h3-emac > > for integrated PHY. > > > > Signed-off-by: Corentin Labbe> > --- > > .../devicetree/bindings/net/dwmac-sun8i.txt| 145 > > +++-- > > 1 file changed, 133 insertions(+), 12 deletions(-) > > > > diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt > > b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt > > index 725f3b187886..2600ce9ad3cc 100644 > > --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt > > +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt > > @@ -4,18 +4,18 @@ This device is a platform glue layer for stmmac. > > Please see stmmac.txt for the other unchanged properties. > > > > Required properties: > > -- compatible: should be one of the following string: > > +- compatible: must be one of the following string: > > "allwinner,sun8i-a83t-emac" > > "allwinner,sun8i-h3-emac" > > "allwinner,sun8i-v3s-emac" > > "allwinner,sun50i-a64-emac" > > - reg: address and length of the register for the device. > > - interrupts: interrupt for the device > > -- interrupt-names: should be "macirq" > > +- interrupt-names: must be "macirq" > > - clocks: A phandle to the reference clock for this device > > -- clock-names: should be "stmmaceth" > > +- clock-names: must be "stmmaceth" > > - resets: A phandle to the reset control for this device > > -- reset-names: should be "stmmaceth" > > +- reset-names: must be "stmmaceth" > > - phy-mode: See ethernet.txt > > - phy-handle: See ethernet.txt > > - #address-cells: shall be 1 > > @@ -39,23 +39,42 @@ Optional properties for the following compatibles: > > - allwinner,leds-active-low: EPHY LEDs are active low > > > > Required child node of emac: > > -- mdio bus node: should be named mdio > > +- mdio bus node: with compatible "snps,dwmac-mdio" > > It should still be named mdio. > > > > > Required properties of the mdio node: > > - #address-cells: shall be 1 > > - #size-cells: shall be 0 > > > > -The device node referenced by "phy" or "phy-handle" should be a child node > > +The device node referenced by "phy" or "phy-handle" must be a child node > > of the mdio node. See phy.txt for the generic PHY bindings. > > > > -Required properties of the phy node with the following compatibles: > > +The following compatibles require that the emac node have a mdio-mux child > > +node called "mdio-mux": > > + - "allwinner,sun8i-h3-emac" > > + - "allwinner,sun8i-v3s-emac": > > +Required properties for the mdio-mux node: > > + - compatible = "allwinner,sun8i-h3-mdio-mux" > > + - mdio-parent-bus: a phandle to EMAC mdio > > + - one child mdio for the integrated mdio with the compatible > > +"allwinner,sun8i-h3-mdio-internal" > > + - one child mdio for the external mdio if present (V3s have none) > > +Required properties for the mdio-mux children node: > > + - reg: 1 for internal MDIO bus, 2 for external MDIO bus > > + > > +The following compatibles require a PHY node representing the integrated > > +PHY, under the integrated MDIO bus node if an mdio-mux node is used: > >- "allwinner,sun8i-h3-emac", > >- "allwinner,sun8i-v3s-emac": > > + > > +Additional information regarding generic multiplexer properties can be > > found > > +at Documentation/devicetree/bindings/net/mdio-mux.txt > > + > > +Required properties of the integrated phy node: > > - clocks: a phandle to the reference clock for the EPHY > > - resets: a phandle to the reset control for the EPHY > > +- Must be a child of the integrated mdio > > > > -Example: > > - > > +Example with integrated PHY: > > emac: ethernet@1c0b000 { > > compatible = "allwinner,sun8i-h3-emac"; > > syscon = <>; > > @@ -72,13 +91,115 @@ emac: ethernet@1c0b000 { > > phy-handle = <_mii_phy>; > > phy-mode = "mii"; > > allwinner,leds-active-low; > > + > > + mdio0: mdio { > > + #address-cells = <1>; > > + #size-cells = <0>; > > + compatible = "snps,dwmac-mdio"; > > + }; > > + > > + mdio-mux { > > + compatible = "mdio-mux", "allwinner,sun8i-h3-mdio-mux"; > > Drop mdio-mux. > > With those fixed, > > Acked-by: Rob Herring Will change what you ask. Thanks Regards
Re: KASAN: use-after-free Write in detach_if_pending
On Mon, 2017-10-30 at 20:52 +0300, Dmitry Vyukov wrote: > syz fix: patch title > > then that's doable. If we agree on this format, then I am ready to > implement this. Yes please. David Miller and Linus never rebase their trees, for good reasons.
Re: [PATCH] net: ethernet: slicoss: remove redundant initialization of idx
On 30.10.2017 19:04, Jakub Kicinski wrote: > On Sun, 29 Oct 2017 13:38:09 +, Colin King wrote: >> From: Colin Ian King>> >> Variable idx is being initialized and later on over-written by >> a new value in a do-loop without the initial value ever being >> read. Hence the initializion is redundant and can be removed. >> Cleans up clang warning: >> >> drivers/net/ethernet/alacritech/slicoss.c:358:15: warning: Value >> stored to 'idx' during its initialization is never read >> >> Signed-off-by: Colin Ian King >> --- >> drivers/net/ethernet/alacritech/slicoss.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/net/ethernet/alacritech/slicoss.c >> b/drivers/net/ethernet/alacritech/slicoss.c >> index 15a8096c60df..ac8004186fa8 100644 >> --- a/drivers/net/ethernet/alacritech/slicoss.c >> +++ b/drivers/net/ethernet/alacritech/slicoss.c >> @@ -355,7 +355,7 @@ static void slic_xmit_complete(struct slic_device *sdev) >> { >> struct slic_tx_queue *txq = >txq; >> struct net_device *dev = sdev->netdev; >> -unsigned int idx = txq->done_idx; >> +unsigned int idx; >> struct slic_tx_buffer *buff; >> unsigned int frames = 0; >> unsigned int bytes = 0; > > You should probably reorder the variables now so they stay longest to > shortest. > I agree. While the change itself is ok, it would be nice if we could keep the order of the variables by length. Regards, Lino
[PATCH iproute2 3/3] xfrm_{state,policy}: Allow to deleteall polices/states with marks
Using 'ip deleteall' with policies that have marks, fails unless you eplicitely specify the mark values. This is very uncomfortable when bulk-deleting policies and states. With this patch all relevant states and policies are wiped by 'ip deleteall' regardless of their mark values. Signed-off-by: Thomas Egerer--- ip/xfrm_policy.c | 9 + ip/xfrm_state.c | 12 2 files changed, 21 insertions(+) diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c index e2fa771..d544026 100644 --- a/ip/xfrm_policy.c +++ b/ip/xfrm_policy.c @@ -753,6 +753,15 @@ static int xfrm_policy_keep(const struct sockaddr_nl *who, xpid->dir = xpinfo->dir; xpid->index = xpinfo->index; + if (tb[XFRMA_MARK]) { + int r = addattr_l(new_n, xb->size, XFRMA_MARK, + (void *)RTA_DATA(tb[XFRMA_MARK]), tb[XFRMA_MARK]->rta_len); + if (r < 0) { + fprintf(stderr, "%s: XFRMA_MARK failed\n", __func__); + exit(1); + } + } + xb->offset += new_n->nlmsg_len; xb->nlmsg_count++; diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c index 3e20d6e..85d959c 100644 --- a/ip/xfrm_state.c +++ b/ip/xfrm_state.c @@ -1081,6 +1081,7 @@ static int xfrm_state_keep(const struct sockaddr_nl *who, int len = n->nlmsg_len; struct nlmsghdr *new_n; struct xfrm_usersa_id *xsid; + struct rtattr *tb[XFRMA_MAX+1]; if (n->nlmsg_type != XFRM_MSG_NEWSA) { fprintf(stderr, "Not a state: %08x %08x %08x\n", @@ -1117,6 +1118,17 @@ static int xfrm_state_keep(const struct sockaddr_nl *who, addattr_l(new_n, xb->size, XFRMA_SRCADDR, >saddr, sizeof(xsid->daddr)); + parse_rtattr(tb, XFRMA_MAX, XFRMS_RTA(xsinfo), len); + + if (tb[XFRMA_MARK]) { + int r = addattr_l(new_n, xb->size, XFRMA_MARK, + (void *)RTA_DATA(tb[XFRMA_MARK]), tb[XFRMA_MARK]->rta_len); + if (r < 0) { + fprintf(stderr, "%s: XFRMA_MARK failed\n", __func__); + exit(1); + } + } + xb->offset += new_n->nlmsg_len; xb->nlmsg_count++; -- 2.6.4
[PATCH iproute2 2/3] xfrm_policy: Do not attempt to deleteall a socket policy
Socket polices are added to a socket using setsockopt(2). They cannot be deleted by iproute2. The attempt to delete them causes an error (EINVAL). To avoid this unnecessary error message all socket policies are skipped in xfrm_policy_keep. Signed-off-by: Thomas Egerer--- ip/xfrm_policy.c | 4 1 file changed, 4 insertions(+) diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c index 7d2139e..e2fa771 100644 --- a/ip/xfrm_policy.c +++ b/ip/xfrm_policy.c @@ -735,6 +735,10 @@ static int xfrm_policy_keep(const struct sockaddr_nl *who, if (!xfrm_policy_filter_match(xpinfo, ptype)) return 0; + /* can't delete socket policies */ + if (xpinfo->dir >= XFRM_POLICY_MAX) + return 0; + if (xb->offset + NLMSG_LENGTH(sizeof(*xpid)) > xb->size) return 0; -- 2.6.4
[PATCH iproute2 1/3] xfrm_policy: Add filter option for socket policies
Listing policies on systems with a lot of socket policies can be confusing due to the number of returned polices. Even if socket polices are not of interest, they cannot be filtered. This patch adds an option to filter all socket policies from the output. Signed-off-by: Thomas Egerer--- ip/xfrm.h| 1 + ip/xfrm_policy.c | 8 +++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/ip/xfrm.h b/ip/xfrm.h index 54d80ce..8566d63 100644 --- a/ip/xfrm.h +++ b/ip/xfrm.h @@ -90,6 +90,7 @@ struct xfrm_filter { __u8 action_mask; __u32 priority_mask; __u8 policy_flags_mask; + __u8 filter_socket; __u8 ptype; __u8 ptype_mask; diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c index 98460a0..7d2139e 100644 --- a/ip/xfrm_policy.c +++ b/ip/xfrm_policy.c @@ -58,7 +58,7 @@ static void usage(void) fprintf(stderr, "[ LIMIT-LIST ] [ TMPL-LIST ]\n"); fprintf(stderr, "Usage: ip xfrm policy { delete | get } { SELECTOR | index INDEX } dir DIR\n"); fprintf(stderr, "[ ctx CTX ] [ mark MARK [ mask MASK ] ] [ ptype PTYPE ]\n"); - fprintf(stderr, "Usage: ip xfrm policy { deleteall | list } [ SELECTOR ] [ dir DIR ]\n"); + fprintf(stderr, "Usage: ip xfrm policy { deleteall | list } [ nosock ] [ SELECTOR ] [ dir DIR ]\n"); fprintf(stderr, "[ index INDEX ] [ ptype PTYPE ] [ action ACTION ] [ priority PRIORITY ]\n"); fprintf(stderr, "[ flag FLAG-LIST ]\n"); fprintf(stderr, "Usage: ip xfrm policy flush [ ptype PTYPE ]\n"); @@ -403,6 +403,9 @@ static int xfrm_policy_filter_match(struct xfrm_userpolicy_info *xpinfo, if ((xpinfo->dir^filter.xpinfo.dir)_mask) return 0; + if (filter.filter_socket && (xpinfo->dir >= XFRM_POLICY_MAX)) + return 0; + if ((ptype^filter.ptype)_mask) return 0; @@ -806,6 +809,9 @@ static int xfrm_policy_list_or_deleteall(int argc, char **argv, int deleteall) filter.policy_flags_mask = XFRM_FILTER_MASK_FULL; + } else if (strcmp(*argv, "nosock") == 0) { + /* filter all socket-based policies */ + filter.filter_socket = 1; } else { if (selp) invarg("unknown", *argv); -- 2.6.4