Re: [PATCHv4 net-next 2/2] openvswitch: add erspan version I and II support
On Thu, Jan 18, 2018 at 2:04 PM, William Tuwrote: > The patch adds support for openvswitch to configure erspan > v1 and v2. The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added > to uapi as a nested attribute to support all ERSPAN v1 and v2's > fields. Note taht Previous commit "openvswitch: Add erspan tunnel > support." was reverted since it does not design properly. > > Signed-off-by: William Tu > --- > include/uapi/linux/openvswitch.h | 11 +++ > net/openvswitch/flow_netlink.c | 154 > ++- > 2 files changed, 164 insertions(+), 1 deletion(-) > > diff --git a/include/uapi/linux/openvswitch.h > b/include/uapi/linux/openvswitch.h > index dcfab5e3b55c..f1f98fd703fe 100644 > --- a/include/uapi/linux/openvswitch.h > +++ b/include/uapi/linux/openvswitch.h > @@ -273,6 +273,16 @@ enum { > > #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1) > > +enum { > + OVS_ERSPAN_OPT_UNSPEC, > + OVS_ERSPAN_OPT_IDX, /* u32 index */ > + OVS_ERSPAN_OPT_VER, /* u8 version number */ > + OVS_ERSPAN_OPT_DIR, /* u8 direction */ > + OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */ > + __OVS_ERSPAN_OPT_MAX, > +}; > + > +#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1) > > /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels. > */ > @@ -363,6 +373,7 @@ enum ovs_tunnel_key_attr { > OVS_TUNNEL_KEY_ATTR_IPV6_SRC, /* struct in6_addr src IPv6 > address. */ > OVS_TUNNEL_KEY_ATTR_IPV6_DST, /* struct in6_addr dst IPv6 > address. */ > OVS_TUNNEL_KEY_ATTR_PAD, > + OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */ > __OVS_TUNNEL_KEY_ATTR_MAX > }; > Rather than passing individual members of erspan_metadata, can you just pass whole structure between kernel and userspace. So that ovs kernel module can just handle all erspan options as one binary blob and does not need to interpret it.
Re: [Patch net-next 1/3] net: introduce helper dev_change_tx_queue_len()
On 01/19/2018 03:09 PM, Cong Wang wrote: > This patch promotes the local change_tx_queue_len() to a core > helper function, dev_change_tx_queue_len(), so that rtnetlink > and net-sysfs could share the code. This also prepares for the > following patch. > > Note, the -EFAULT in the original code doesn't make sense, > we should propagate the errno from notifiers. > > Cc: John Fastabend> Signed-off-by: Cong Wang > --- [...] > static ssize_t tx_queue_len_store(struct device *dev, > struct device_attribute *attr, > const char *buf, size_t len) > @@ -355,7 +332,7 @@ static ssize_t tx_queue_len_store(struct device *dev, > if (!capable(CAP_NET_ADMIN)) > return -EPERM; > > - return netdev_store(dev, attr, buf, len, change_tx_queue_len); > + return netdev_store(dev, attr, buf, len, dev_change_tx_queue_len); > } Is this protected by RTNL lock? If not what happens if this and do_setlink both try to change tx queue length at the same time? Seems we could get a race with multiple dev_deactivate/dev_activate sequences in-flight in the following 2/3 patch. Thanks, John > NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec); >
Re: [PATCH net] net/mlx5e: Fix fixpoint divide exception in mlx5e_am_stats_compare
On 01/20/2018 12:10 AM, Sergei Shtylyov wrote: > Hello! > > On 1/19/2018 11:10 PM, Saeed Mahameed wrote: > >> From: Talat Batheesh>> >> Helmut reported a bug about devision by zero while > > Division. > >> running traffic and doing physical cable pull test. >> >> When the cable unplugged the ppms become zero, so when >> dividing the current ppms by the previous ppms in the > > You got it right the on the 2nd time. :-) > BTW, you missed one ;-) Fixed both in V2. Thanks !
[PATCH net V2] net/mlx5e: Fix fixpoint divide exception in mlx5e_am_stats_compare
From: Talat BatheeshHelmut reported a bug about division by zero while running traffic and doing physical cable pull test. When the cable unplugged the ppms become zero, so when dividing the current ppms by the previous ppms in the next dim iteration there is division by zero. This patch prevent this division for both ppms and epms. Fixes: c3164d2fc48f ("net/mlx5e: Added BW check for DIM decision mechanism") Reported-by: Helmut Grauer Signed-off-by: Talat Batheesh Signed-off-by: Saeed Mahameed --- v2: - Fix spelling in commit message. This patch is the equivalent of: ("net/dim: Fix fixpoint divide exception in net_dim_stats_compare") which was already submitted to net-next. Since mlx5 irq moderation code got moved in net-next, this patch should only go to net and skip net-next. for -stable: 4.12.y and later. drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c index e401d9d245f3..b69a705fd787 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c @@ -201,9 +201,15 @@ static int mlx5e_am_stats_compare(struct mlx5e_rx_am_stats *curr, return (curr->bpms > prev->bpms) ? MLX5E_AM_STATS_BETTER : MLX5E_AM_STATS_WORSE; + if (!prev->ppms) + return curr->ppms ? MLX5E_AM_STATS_BETTER : + MLX5E_AM_STATS_SAME; + if (IS_SIGNIFICANT_DIFF(curr->ppms, prev->ppms)) return (curr->ppms > prev->ppms) ? MLX5E_AM_STATS_BETTER : MLX5E_AM_STATS_WORSE; + if (!prev->epms) + return MLX5E_AM_STATS_SAME; if (IS_SIGNIFICANT_DIFF(curr->epms, prev->epms)) return (curr->epms < prev->epms) ? MLX5E_AM_STATS_BETTER : -- 2.13.0
Re: pull-request: bpf-next 2018-01-19
From: Alexei StarovoitovDate: Fri, 19 Jan 2018 21:12:29 -0800 > The following pull-request contains BPF updates for your *net-next* tree. > > The main changes are: > > 1) bpf array map HW offload, from Jakub. > > 2) support for bpf_get_next_key() for LPM map, from Yonghong. > > 3) test_verifier now runs loaded programs, from Alexei. > > 4) xdp cpumap monitoring, from Jesper. > > 5) variety of tests, cleanups and small x64 JIT optimization, from Daniel. > > 6) user space can now retrieve HW JITed program, from Jiong. > > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git > > Note there is a minor conflict between Russell's arm32 JIT fixes > and removal of bpf_jit_enable variable by Daniel which should > be resolved by keeping Russell's comment and removing that variable. Pulled, thanks for the guidance on the merge conflict. I hope I interpreted it correctly :-)
Re: [net-next: PATCH 0/8] Armada 7k/8k PP2 ACPI support
> I'm not familiar with MDIO bus but an alternative to GeneriSerialBus > would be to follow what SDIO is doing, e.g have the PHY devices listed > below the MDIO controller and use _ADR to describe their "address" on > that bus. You can see how _ADR applies to SDIO bus from ACPI spec. Hi Mika SDIO is not a serial bus, well it can be in its simplest form, but high speed implementations have 4 data lines. So i can understand them not using GenericSerialBus. MDIO is a serial bus, very similar to SPI, I2C, and UART. > If you go with the SDIO way then each PHY is described as normal ACPI > device and you can use ACPI _HID/_CID to match the device to the > corresponding driver. Just some background here. If you have a plain PHY as a device on an MDIO bus, you don't need to match it to a driver within ACPI. Registers 2 and 3 contain a vendor and product ID. That is what it used to match the device to the driver. What you might need to know is the protocol to talk on the bus. Most devices use clause 22 protocol. A few devices are clause 45. 22 is the default in Linux, and you need to indicate if 45 should be used. You can also indicate 22. It gets more complex when the device on the bus is not a PHY. It is a generic bus, you can connect anything to it. Ethernet switches can be on the bus. They generally cannot be identified using registers 2 and 3. So you do need to match the device to the driver. Most do have ID registers, so the driver can work out what specific device is on the bus. However, Marvell moved the ID registers on there newer generation of devices, so we need to give the driver a hint where to look. So in device tree, we have two different compatible string. Broadcom really do use it as a generic bus. They have their USB PHYs and PCIE PHYs on an MDIO bus. In DT, they have compatible strings to match the device to the driver, as normal. We need to ensure what we define for ACPI has the same level of flexibility. Andrew
Memory corruption with r8169 across several device revisions and kernels
Dear network experts, please redirect me if this is the wrong place. I have reproduced the following issue across three devices with different Realtek card revisions and different Distros (Debian 9, Ubuntu 17.04, Gentoo with kernels 4.9, 4.11.3, 4.14.12). It's safely reproducible with at least: Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 12) Memory corruption at physical addresses either in low memory or kernel memory or user space memory occurs when reading from: /proc/self/net/dev The physical memory addresses which get corrupted change with each boot of the system, and also appear to change with each reload of the kernel module (I have only one data point on that). To reproduce, execute: $ while true; do cat /proc/self/net/dev > /dev/null; done and in parallel, scan memory for corruption, e.g. $ memtester 15G Of course, one should try to map all system memory here. It usually shows up in the first loop iteration if the "while" loop is executed in parallel. Depending on the actual memory being corrupted, it may also become visible via Corrupted low memory at 8800b000 (b000 phys) = 0016e109 in klog, if the low memory corruption scanning is activated. The values found in overwritten memory match those contained in /proc/self/net/dev for the realtek ethernet device. Unloading r8169 or disabling the card in bios "fixes" this issue. I have already ended up with two corrupted btrfs filesystems due to this issue, and many segfaults in userspace. Please include me directly in replies, I may not stay subscribed to the list. Cheers, Oliver
Re: net merged into net-next
On Fri, Jan 19, 2018 at 8:02 PM, David Millerwrote: > > Cong, please check my conflict resolution of drivers/net/tun.c, thank > you. It looks good to me except I am not sure about the xdp_rxq_info_unreg() inside tun_cleanup_tx_ring().
backport of "netfilter: nf_tables: fix mismatch in big-endian system" to stable
nftables does not work correctly on big endian systems without this patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10596608c4d62cb8c1c2b806debcbd32fe657e71 With nftables v0.8.1 this rule does not work: nft add rule ip foo bar ct state new tcp dport 22 accept I can set it successfully, but it will just not match to port 22. The examples from this tutorial will not work with kernel 4.9 + nftables v0.8.1: https://people.netfilter.org/pablo/nft-tutorial.pdf This nftables commit broke nftables for me on my MIPS BE system: https://git.netfilter.org/nftables/commit/?id=0825c57d571bb7121e7048e198b9b023f7e7f358 but the root cause is in the kernel. I would like to get commit 10596608c4d62cb8c1c2b806debcbd32fe657e71 ("netfilter: nf_tables: fix mismatch in big-endian system") backported to kernel 4.9 or even earlier to make nftable work correctly on big endian systems. Hauke
Re: [net-next: PATCH 0/8] Armada 7k/8k PP2 ACPI support
On Fri, Jan 19, 2018 at 07:07:29PM +0100, Marcin Wojtas wrote: > Hi Mika, Hi, > 2018-01-18 14:00 GMT+01:00 Andrew Lunn: > >> I CC'ed Mika since he is more familiar with handling these bits of ACPI > >> specs - I wonder whether this is a problem that cropped up on x86 > >> systems too. > > > > Hi Lorenzo > > > > There is nothing about MDIO, PHYs, Ethernet switches, etc in version > > 6.2 of the spec. If x86 has this, it must be after 6.2 was released. > > I would not be too surprised if x86 has none of this. If you look at > > the typical drives used on x86, i210, e1000e, ixgb, r8169, etc. They > > are all PCI devices, and hide all this. > > > >> I do not think there is one and only answer but there must be a single > >> set of bindings and if the ACPI specs already cater for some of them > >> we have to reuse them. > > > > Agreed. Due diligence so far suggests there is nothing already > > defined. But im a newbie to ACPI, so could be looking in the wrong > > place. I really hope there is somebody like Rob Herring, the DT > > maintainer, who keeps an eye on all ACPI talk and would tell us if we > > are heading off in the wrong direction. > > > > My initial approach with MDIO bus with PHYs as child nodes was super > easy to describe and handle in Linux - please refer to: > - typical representation of mdio bus with the phys - > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/devicetree/bindings/net/mdio.txt?h=v4.15-rc8 > - my patches in the initial series > However I guess it would be more proper to use the > GenericSerialBus-based description, as advised in ACPI Spec. The > question is, whether we should define new type of a bus or not > (MdioSerialBus, similar to e.g. I2cSerialBus). I'm not familiar with MDIO bus but an alternative to GeneriSerialBus would be to follow what SDIO is doing, e.g have the PHY devices listed below the MDIO controller and use _ADR to describe their "address" on that bus. You can see how _ADR applies to SDIO bus from ACPI spec. Of course ACPI spec should then be updated accordingly to describe what _ADR means for devices on MDIO bus. > Since I have a code that can be tested and easily modified to use > different ACPI approaches with real platform MDIO controller > (mvmdio.c) and NIC (mvpp2.c), in coming weeks I may be able to find > some time to prepare a proof of concept based on GenericSerialBus. > Please expect some RFC patches hopefully right after the coming merge > window is closed. > > Of course, if I come up on some ACPI - specific implementation > questions, I won't hesitate to ask in this thred. I will also > appreciate any hints. For now my two main concerns are: > - The PHY address on the mdio bus - should it be put into _CRS -> > GenericSerialBus() field or separate _ADR? I'd lean towards first > option. > - The PHY type - in Linux it's resolved basing on two generic > compatible strings > (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/devicetree/bindings/net/phy.txt?h=v4.15-rc8). > I'd put it as a sort of ID into GenericSerialBus(). If you agree - any > specific filed that you would try to use? If you go with the SDIO way then each PHY is described as normal ACPI device and you can use ACPI _HID/_CID to match the device to the corresponding driver. > Do I understand correctly that the MDIO controller node should > comprise OperationRegion() definition of the GenericSerialBus? I don't think OpRegions are useful in this case because they are mainly used to allow BIOS AML code to access the hardware through OS driver.
Re: ipv6_addrconf: WARNING about suspicious RCU usage
Am 20.01.2018 um 20:19 schrieb Ido Schimmel: > On Sat, Jan 20, 2018 at 10:49:03AM -0800, Eric Dumazet wrote: >> On Sat, 2018-01-20 at 15:37 +0200, Ido Schimmel wrote: >>> On Sat, Jan 20, 2018 at 12:57:01PM +0100, Heiner Kallweit wrote: Since some time (didn't bisect it yet) I get the following warning. Is it a known issue? >>> >>> [...] >>> [86220.126999] BUG: sleeping function called from invalid context at mm/slab.h:420 [86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: kworker/0:2 [86220.127082] 4 locks held by kworker/0:2/1003: [86220.127107] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [] process_one_work+0x1de/0x680 [86220.127179] #1: ((addr_chk_work).work){+.+.}, at: [ ] process_one_work+0x1de/0x680 [86220.127242] #2: (rtnl_mutex){+.+.}, at: [ ] rtnl_lock+0x12/0x20 [86220.127300] #3: (rcu_read_lock_bh){}, at: [ ] addrconf_verify_rtnl+0x1e/0x510 [ipv6] [86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7 [86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015 [86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6] [86220.127568] Call Trace: [86220.127591] dump_stack+0x70/0x9e [86220.127616] ___might_sleep+0x14d/0x240 [86220.127644] __might_sleep+0x45/0x80 [86220.127672] kmem_cache_alloc_trace+0x53/0x250 [86220.127717] ? ipv6_add_addr+0xfe/0x6e0 [ipv6] [86220.127762] ipv6_add_addr+0xfe/0x6e0 [ipv6] [86220.127807] ipv6_create_tempaddr+0x24d/0x430 [ipv6] [86220.127854] ? ipv6_create_tempaddr+0x24d/0x430 [ipv6] [86220.127903] addrconf_verify_rtnl+0x339/0x510 [ipv6] [86220.127950] ? addrconf_verify_rtnl+0x339/0x510 [ipv6] [86220.127998] addrconf_verify_work+0xe/0x20 [ipv6] [86220.128032] process_one_work+0x258/0x680 [86220.128063] worker_thread+0x35/0x3f0 [86220.128091] kthread+0x124/0x140 [86220.128117] ? process_one_work+0x680/0x680 [86220.128146] ? kthread_create_worker_on_cpu+0x40/0x40 [86220.128180] ? umh_complete+0x40/0x40 [86220.128207] ? call_usermodehelper_exec_async+0x12a/0x160 [86220.128243] ret_from_fork+0x4b/0x60 >>> >>> Can you please try attached patch (untested)? >> >> >> >> I would also/instead break rcu section. > > Thanks Eric, this should work. We can continue to block in > ipv6_create_tempaddr(). > > Heiner, can you try Eric's patch instead? > Sure, thanks for the quick response to both of you. >> >> Holding RCU (and BH) for whole hash traversal is a recipe for disaster, >> if we have thousands of IPv6 addresses. >> >> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c >> index >> ab99cb641b7cccdda0ad4ae553c09274d7dbc047..adda73466ae1dd0f3b700b3db5fbf3065e4d3f7f >> 100644 >> --- a/net/ipv6/addrconf.c >> +++ b/net/ipv6/addrconf.c >> @@ -4356,9 +4356,11 @@ static void addrconf_verify_rtnl(void) >> spin_lock(>lock); >> ifpub->regen_count = 0; >> spin_unlock(>lock); >> +rcu_read_unlock_bh(); >> ipv6_create_tempaddr(ifpub, >> ifp, true); >> in6_ifa_put(ifpub); >> in6_ifa_put(ifp); >> +rcu_read_lock_bh(); >> goto restart; >> } >> } else if (time_before(ifp->tstamp + >> ifp->prefered_lft * HZ - regen_advance * HZ, next)) >> >> >> >> >
Re: ipv6_addrconf: WARNING about suspicious RCU usage
On Sat, Jan 20, 2018 at 10:49:03AM -0800, Eric Dumazet wrote: > On Sat, 2018-01-20 at 15:37 +0200, Ido Schimmel wrote: > > On Sat, Jan 20, 2018 at 12:57:01PM +0100, Heiner Kallweit wrote: > > > Since some time (didn't bisect it yet) I get the following warning. > > > Is it a known issue? > > > > [...] > > > > > [86220.126999] BUG: sleeping function called from invalid context at > > > mm/slab.h:420 > > > [86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: > > > kworker/0:2 > > > [86220.127082] 4 locks held by kworker/0:2/1003: > > > [86220.127107] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: > > > [] process_one_work+0x1de/0x680 > > > [86220.127179] #1: ((addr_chk_work).work){+.+.}, at: > > > [ ] process_one_work+0x1de/0x680 > > > [86220.127242] #2: (rtnl_mutex){+.+.}, at: [ ] > > > rtnl_lock+0x12/0x20 > > > [86220.127300] #3: (rcu_read_lock_bh){}, at: [ ] > > > addrconf_verify_rtnl+0x1e/0x510 [ipv6] > > > [86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted > > > 4.15.0-rc7-next-20180110+ #7 > > > [86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS > > > B246P105 06/01/2015 > > > [86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6] > > > [86220.127568] Call Trace: > > > [86220.127591] dump_stack+0x70/0x9e > > > [86220.127616] ___might_sleep+0x14d/0x240 > > > [86220.127644] __might_sleep+0x45/0x80 > > > [86220.127672] kmem_cache_alloc_trace+0x53/0x250 > > > [86220.127717] ? ipv6_add_addr+0xfe/0x6e0 [ipv6] > > > [86220.127762] ipv6_add_addr+0xfe/0x6e0 [ipv6] > > > [86220.127807] ipv6_create_tempaddr+0x24d/0x430 [ipv6] > > > [86220.127854] ? ipv6_create_tempaddr+0x24d/0x430 [ipv6] > > > [86220.127903] addrconf_verify_rtnl+0x339/0x510 [ipv6] > > > [86220.127950] ? addrconf_verify_rtnl+0x339/0x510 [ipv6] > > > [86220.127998] addrconf_verify_work+0xe/0x20 [ipv6] > > > [86220.128032] process_one_work+0x258/0x680 > > > [86220.128063] worker_thread+0x35/0x3f0 > > > [86220.128091] kthread+0x124/0x140 > > > [86220.128117] ? process_one_work+0x680/0x680 > > > [86220.128146] ? kthread_create_worker_on_cpu+0x40/0x40 > > > [86220.128180] ? umh_complete+0x40/0x40 > > > [86220.128207] ? call_usermodehelper_exec_async+0x12a/0x160 > > > [86220.128243] ret_from_fork+0x4b/0x60 > > > > Can you please try attached patch (untested)? > > > > I would also/instead break rcu section. Thanks Eric, this should work. We can continue to block in ipv6_create_tempaddr(). Heiner, can you try Eric's patch instead? > > Holding RCU (and BH) for whole hash traversal is a recipe for disaster, > if we have thousands of IPv6 addresses. > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > index > ab99cb641b7cccdda0ad4ae553c09274d7dbc047..adda73466ae1dd0f3b700b3db5fbf3065e4d3f7f > 100644 > --- a/net/ipv6/addrconf.c > +++ b/net/ipv6/addrconf.c > @@ -4356,9 +4356,11 @@ static void addrconf_verify_rtnl(void) > spin_lock(>lock); > ifpub->regen_count = 0; > spin_unlock(>lock); > + rcu_read_unlock_bh(); > ipv6_create_tempaddr(ifpub, > ifp, true); > in6_ifa_put(ifpub); > in6_ifa_put(ifp); > + rcu_read_lock_bh(); > goto restart; > } > } else if (time_before(ifp->tstamp + > ifp->prefered_lft * HZ - regen_advance * HZ, next)) > > > >
Re: ipv6_addrconf: WARNING about suspicious RCU usage
On Sat, 2018-01-20 at 15:37 +0200, Ido Schimmel wrote: > On Sat, Jan 20, 2018 at 12:57:01PM +0100, Heiner Kallweit wrote: > > Since some time (didn't bisect it yet) I get the following warning. > > Is it a known issue? > > [...] > > > [86220.126999] BUG: sleeping function called from invalid context at > > mm/slab.h:420 > > [86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: > > kworker/0:2 > > [86220.127082] 4 locks held by kworker/0:2/1003: > > [86220.127107] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: > > [] process_one_work+0x1de/0x680 > > [86220.127179] #1: ((addr_chk_work).work){+.+.}, at: [ ] > > process_one_work+0x1de/0x680 > > [86220.127242] #2: (rtnl_mutex){+.+.}, at: [ ] > > rtnl_lock+0x12/0x20 > > [86220.127300] #3: (rcu_read_lock_bh){}, at: [ ] > > addrconf_verify_rtnl+0x1e/0x510 [ipv6] > > [86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted > > 4.15.0-rc7-next-20180110+ #7 > > [86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS > > B246P105 06/01/2015 > > [86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6] > > [86220.127568] Call Trace: > > [86220.127591] dump_stack+0x70/0x9e > > [86220.127616] ___might_sleep+0x14d/0x240 > > [86220.127644] __might_sleep+0x45/0x80 > > [86220.127672] kmem_cache_alloc_trace+0x53/0x250 > > [86220.127717] ? ipv6_add_addr+0xfe/0x6e0 [ipv6] > > [86220.127762] ipv6_add_addr+0xfe/0x6e0 [ipv6] > > [86220.127807] ipv6_create_tempaddr+0x24d/0x430 [ipv6] > > [86220.127854] ? ipv6_create_tempaddr+0x24d/0x430 [ipv6] > > [86220.127903] addrconf_verify_rtnl+0x339/0x510 [ipv6] > > [86220.127950] ? addrconf_verify_rtnl+0x339/0x510 [ipv6] > > [86220.127998] addrconf_verify_work+0xe/0x20 [ipv6] > > [86220.128032] process_one_work+0x258/0x680 > > [86220.128063] worker_thread+0x35/0x3f0 > > [86220.128091] kthread+0x124/0x140 > > [86220.128117] ? process_one_work+0x680/0x680 > > [86220.128146] ? kthread_create_worker_on_cpu+0x40/0x40 > > [86220.128180] ? umh_complete+0x40/0x40 > > [86220.128207] ? call_usermodehelper_exec_async+0x12a/0x160 > > [86220.128243] ret_from_fork+0x4b/0x60 > > Can you please try attached patch (untested)? I would also/instead break rcu section. Holding RCU (and BH) for whole hash traversal is a recipe for disaster, if we have thousands of IPv6 addresses. diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index ab99cb641b7cccdda0ad4ae553c09274d7dbc047..adda73466ae1dd0f3b700b3db5fbf3065e4d3f7f 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4356,9 +4356,11 @@ static void addrconf_verify_rtnl(void) spin_lock(>lock); ifpub->regen_count = 0; spin_unlock(>lock); + rcu_read_unlock_bh(); ipv6_create_tempaddr(ifpub, ifp, true); in6_ifa_put(ifpub); in6_ifa_put(ifp); + rcu_read_lock_bh(); goto restart; } } else if (time_before(ifp->tstamp + ifp->prefered_lft * HZ - regen_advance * HZ, next))
Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.
On Fri, Jan 19, 2018 at 2:55 PM, Benjamin Poirierwrote: > On 2018/01/20 07:45, Benjamin Poirier wrote: > [...] >> > >> > I'm of the mind that we need to cut down on the code thrash. This >> > driver is supposed to have been in a "maintenance" mode for the last >> > year or so as there aren't being any new parts added is my >> > understanding. As-is I don't see any reason why 16ecba59bc33 ("e1000e: >> > Do not read ICR in Other interrupt", v4.5-rc1) was submitted or >> > accepted in the first place. I don't see any notes about it fixing any >> > bug or addressing any issue and it seems like that is the start of all >> > the issues we have been having recently with RXO triggering more >> > interrupts, various link issues, and this most recent VMware issue. >> >> I'm sorry to say but you're the one who suggested that change: >> >> http://lkml.iu.edu/hypermail/linux/kernel/1510.3/03528.html >> >> On 2015/10/28 23:08, Alexander Duyck wrote: >> > On 10/22/2015 05:32 PM, Benjamin Poirier wrote: >> [...] >> > >> > I would argue your first patch probably didn't go far enough to remove dead >> > code. Specifically you should only ever get into this function if LSC is >> > set. There are no other causes that should trigger this. As such you >> > could >> > probably remove the ICR read, and instead replace it with an ICR write of >> > the LSC bit since OTHER is already cleared via EIAC. >> > > > ... The assumption that "There are no other causes that should trigger > this." turned out to be wrong and that caused the RXO problems. Clearing > OTHER via EIAC is causing the problems with vmware now. I don't think > you foresaw those problems back in 2015 and neither did I. Well that explains why I felt like I was explaining things to a younger version of myself. I was a bit more relaxed in terms of being willing to make arbitrary changes a few years ago. I tend to be a bit more conservative now, at least as far as having justifications as to why I want to do things. With any change you end up taking on risk, and so usually a patch has a justification as to why you are making the change. Unfortunately at the time I didn't have all the information and based my suggestion on a bad assumption. I would guess at the time I was thinking of doing general code cleanup. Other drivers such as igb work this way, but it led us down the path we are on now where we are having to make one fix after another. It is leading in the opposite direction of maintainability as this is all becoming more complex. Suggesting this was a bad decision on my part at the time. I'm only human, I make mistakes.. :-) With further review of the code I am seeing various other issues that could still pop up as I am not certain we should even have the "other" interrupt messing with the NAPI polling or packet accounting logic at all. The question I would have at this point is if we revert 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1) and all the related fixes for it, what do we end up with? It seems like the code is slowly heading back in the direction of where it was originally anyway since there have been a number of partial reverts. I'm wondering what would happen if we were to just short-cut that and audit the patches involved to see what we really need and don't. Your patch as proposed is essentially another step in that direction. I'm thinking we may want to drop my currently proposed fix for now and instead look at going through and figure out what changes after that first one are still really needed. It doesn't look like my fix will make it for 4.15 anyway so we might as well focus on making certain to have things as solid as possible by the time 4.16-rc1 rolls around. Thanks. - Alex
Re: ipv6_addrconf: WARNING about suspicious RCU usage
On Sat, Jan 20, 2018 at 12:57:01PM +0100, Heiner Kallweit wrote: > Since some time (didn't bisect it yet) I get the following warning. > Is it a known issue? [...] > [86220.126999] BUG: sleeping function called from invalid context at > mm/slab.h:420 > [86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: > kworker/0:2 > [86220.127082] 4 locks held by kworker/0:2/1003: > [86220.127107] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: > [] process_one_work+0x1de/0x680 > [86220.127179] #1: ((addr_chk_work).work){+.+.}, at: [ ] > process_one_work+0x1de/0x680 > [86220.127242] #2: (rtnl_mutex){+.+.}, at: [ ] > rtnl_lock+0x12/0x20 > [86220.127300] #3: (rcu_read_lock_bh){}, at: [ ] > addrconf_verify_rtnl+0x1e/0x510 [ipv6] > [86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted > 4.15.0-rc7-next-20180110+ #7 > [86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS > B246P105 06/01/2015 > [86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6] > [86220.127568] Call Trace: > [86220.127591] dump_stack+0x70/0x9e > [86220.127616] ___might_sleep+0x14d/0x240 > [86220.127644] __might_sleep+0x45/0x80 > [86220.127672] kmem_cache_alloc_trace+0x53/0x250 > [86220.127717] ? ipv6_add_addr+0xfe/0x6e0 [ipv6] > [86220.127762] ipv6_add_addr+0xfe/0x6e0 [ipv6] > [86220.127807] ipv6_create_tempaddr+0x24d/0x430 [ipv6] > [86220.127854] ? ipv6_create_tempaddr+0x24d/0x430 [ipv6] > [86220.127903] addrconf_verify_rtnl+0x339/0x510 [ipv6] > [86220.127950] ? addrconf_verify_rtnl+0x339/0x510 [ipv6] > [86220.127998] addrconf_verify_work+0xe/0x20 [ipv6] > [86220.128032] process_one_work+0x258/0x680 > [86220.128063] worker_thread+0x35/0x3f0 > [86220.128091] kthread+0x124/0x140 > [86220.128117] ? process_one_work+0x680/0x680 > [86220.128146] ? kthread_create_worker_on_cpu+0x40/0x40 > [86220.128180] ? umh_complete+0x40/0x40 > [86220.128207] ? call_usermodehelper_exec_async+0x12a/0x160 > [86220.128243] ret_from_fork+0x4b/0x60 Can you please try attached patch (untested)? >From 70e4bc2a8fe08fb30251f786990a91d3ed2232e6 Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Sat, 20 Jan 2018 15:29:33 +0200 Subject: [PATCH] ipv6: Do not sleep in RCU read-side critical section Signed-off-by: Ido Schimmel --- net/ipv6/addrconf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index ab99cb641b7c..9ad1365c91ca 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4356,7 +4356,8 @@ static void addrconf_verify_rtnl(void) spin_lock(>lock); ifpub->regen_count = 0; spin_unlock(>lock); - ipv6_create_tempaddr(ifpub, ifp, true); + ipv6_create_tempaddr(ifpub, ifp, +false); in6_ifa_put(ifpub); in6_ifa_put(ifp); goto restart; -- 2.14.3
Re: KASAN: use-after-free Read in fib6_ifup (2)
On Thu, Jan 18, 2018 at 09:58:01PM -0800, syzbot wrote: > Hello, > > syzbot hit the following crash on net-next commit > 564737f981fb4b4b3266901508bb9b90d9d43de8 > > So far this crash happened 18 times on mmots, net-next. > C reproducer is attached. > syzkaller reproducer is attached. > Raw console output is attached. > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached. [...] > syzbot will keep track of this bug report. > If you forgot to add the Reported-by tag, once the fix for this bug is > merged > into any tree, please reply to this email with: > #syz fix: exact-commit-title The reproducer configures a default source-specific route which triggers a bug that is now fixed in net-next, so: #syz fix: ipv6: don't let tb6_root node share routes with other node Verified it on my machine using the reproducer attached by the bot.
ipv6_addrconf: WARNING about suspicious RCU usage
Since some time (didn't bisect it yet) I get the following warning. Is it a known issue? [86220.125562] = [86220.125586] WARNING: suspicious RCU usage [86220.125612] 4.15.0-rc7-next-20180110+ #7 Not tainted [86220.125641] - [86220.125666] kernel/sched/core.c:6026 Illegal context switch in RCU-bh read-side critical section! [86220.125711] other info that might help us debug this: [86220.125755] rcu_scheduler_active = 2, debug_locks = 1 [86220.125792] 4 locks held by kworker/0:2/1003: [86220.125817] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [] process_one_work+0x1de/0x680 [86220.125895] #1: ((addr_chk_work).work){+.+.}, at: [ ] process_one_work+0x1de/0x680 [86220.125959] #2: (rtnl_mutex){+.+.}, at: [ ] rtnl_lock+0x12/0x20 [86220.126017] #3: (rcu_read_lock_bh){}, at: [ ] addrconf_verify_rtnl+0x1e/0x510 [ipv6] [86220.126111] stack backtrace: [86220.126142] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7 [86220.126185] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015 [86220.126250] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6] [86220.126288] Call Trace: [86220.126312] dump_stack+0x70/0x9e [86220.126337] lockdep_rcu_suspicious+0xce/0xf0 [86220.126365] ___might_sleep+0x1d3/0x240 [86220.126390] __might_sleep+0x45/0x80 [86220.126416] kmem_cache_alloc_trace+0x53/0x250 [86220.126458] ? ipv6_add_addr+0xfe/0x6e0 [ipv6] [86220.126498] ipv6_add_addr+0xfe/0x6e0 [ipv6] [86220.126538] ipv6_create_tempaddr+0x24d/0x430 [ipv6] [86220.126580] ? ipv6_create_tempaddr+0x24d/0x430 [ipv6] [86220.126623] addrconf_verify_rtnl+0x339/0x510 [ipv6] [86220.126664] ? addrconf_verify_rtnl+0x339/0x510 [ipv6] [86220.126708] addrconf_verify_work+0xe/0x20 [ipv6] [86220.126738] process_one_work+0x258/0x680 [86220.126765] worker_thread+0x35/0x3f0 [86220.126790] kthread+0x124/0x140 [86220.126813] ? process_one_work+0x680/0x680 [86220.126839] ? kthread_create_worker_on_cpu+0x40/0x40 [86220.126869] ? umh_complete+0x40/0x40 [86220.126893] ? call_usermodehelper_exec_async+0x12a/0x160 [86220.126926] ret_from_fork+0x4b/0x60 [86220.126999] BUG: sleeping function called from invalid context at mm/slab.h:420 [86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: kworker/0:2 [86220.127082] 4 locks held by kworker/0:2/1003: [86220.127107] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [ ] process_one_work+0x1de/0x680 [86220.127179] #1: ((addr_chk_work).work){+.+.}, at: [ ] process_one_work+0x1de/0x680 [86220.127242] #2: (rtnl_mutex){+.+.}, at: [ ] rtnl_lock+0x12/0x20 [86220.127300] #3: (rcu_read_lock_bh){}, at: [ ] addrconf_verify_rtnl+0x1e/0x510 [ipv6] [86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7 [86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015 [86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6] [86220.127568] Call Trace: [86220.127591] dump_stack+0x70/0x9e [86220.127616] ___might_sleep+0x14d/0x240 [86220.127644] __might_sleep+0x45/0x80 [86220.127672] kmem_cache_alloc_trace+0x53/0x250 [86220.127717] ? ipv6_add_addr+0xfe/0x6e0 [ipv6] [86220.127762] ipv6_add_addr+0xfe/0x6e0 [ipv6] [86220.127807] ipv6_create_tempaddr+0x24d/0x430 [ipv6] [86220.127854] ? ipv6_create_tempaddr+0x24d/0x430 [ipv6] [86220.127903] addrconf_verify_rtnl+0x339/0x510 [ipv6] [86220.127950] ? addrconf_verify_rtnl+0x339/0x510 [ipv6] [86220.127998] addrconf_verify_work+0xe/0x20 [ipv6] [86220.128032] process_one_work+0x258/0x680 [86220.128063] worker_thread+0x35/0x3f0 [86220.128091] kthread+0x124/0x140 [86220.128117] ? process_one_work+0x680/0x680 [86220.128146] ? kthread_create_worker_on_cpu+0x40/0x40 [86220.128180] ? umh_complete+0x40/0x40 [86220.128207] ? call_usermodehelper_exec_async+0x12a/0x160 [86220.128243] ret_from_fork+0x4b/0x60
[net-next:master 308/357] gemini.c:undefined reference to `devm_ioremap_resource'
tree: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master head: 8565d26bcb2ff6df646e946d2913fcf706d46b66 commit: 4d5ae32f5e1e13f7f36d6439ec3257993b9f5b88 [308/357] net: ethernet: Add a driver for Gemini gigabit ethernet config: um-allyesconfig (attached as .config) compiler: gcc-7 (Debian 7.2.0-12) 7.2.1 20171025 reproduce: git checkout 4d5ae32f5e1e13f7f36d6439ec3257993b9f5b88 # save the attached .config to linux build tree make ARCH=um All errors (new ones prefixed by >>): arch/um/drivers/vde.o: In function `vde_open_real': (.text+0x951): warning: Using 'getgrnam' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking (.text+0x79c): warning: Using 'getpwuid' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking (.text+0xab5): warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoaddr': (.text+0xdee5): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametonetaddr': (.text+0xdf85): warning: Using 'getnetbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoproto': (.text+0xe1a5): warning: Using 'getprotobyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoport': (.text+0xdfd7): warning: Using 'getservbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking drivers/net/ethernet/cortina/gemini.o: In function `geth_cleanup_freeq': >> gemini.c:(.text+0x97d): undefined reference to `bad_dma_ops' gemini.c:(.text+0xad5): undefined reference to `bad_dma_ops' gemini.c:(.text+0xb85): undefined reference to `bad_dma_ops' drivers/net/ethernet/cortina/gemini.o: In function `gemini_ethernet_probe': >> gemini.c:(.text+0xd2b): undefined reference to `devm_ioremap_resource' drivers/net/ethernet/cortina/gemini.o: In function `geth_freeq_alloc_map_page': gemini.c:(.text+0x11c4): undefined reference to `bad_dma_ops' gemini.c:(.text+0x11d0): undefined reference to `bad_dma_ops' gemini.c:(.text+0x13dc): undefined reference to `bad_dma_ops' drivers/net/ethernet/cortina/gemini.o: In function `geth_resize_freeq': gemini.c:(.text+0x1dcd): undefined reference to `bad_dma_ops' gemini.c:(.text+0x1fc4): undefined reference to `bad_dma_ops' drivers/net/ethernet/cortina/gemini.o:gemini.c:(.text+0x2167): more undefined references to `bad_dma_ops' follow drivers/net/ethernet/cortina/gemini.o: In function `gemini_ethernet_port_probe': gemini.c:(.text+0x2dec): undefined reference to `devm_ioremap_resource' gemini.c:(.text+0x2e4c): undefined reference to `devm_ioremap_resource' drivers/net/ethernet/cortina/gemini.o: In function `gmac_map_tx_bufs.isra.0': gemini.c:(.text+0x37fb): undefined reference to `bad_dma_ops' gemini.c:(.text+0x3813): undefined reference to `bad_dma_ops' gemini.c:(.text+0x393a): undefined reference to `bad_dma_ops' drivers/net/ethernet/cortina/gemini.o: In function `gmac_cleanup_rxq': gemini.c:(.text+0x3ffa): undefined reference to `bad_dma_ops' gemini.c:(.text+0x409d): undefined reference to `bad_dma_ops' drivers/net/ethernet/cortina/gemini.o:gemini.c:(.text+0x4245): more undefined references to `bad_dma_ops' follow collect2: error: ld returned 1 exit status --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v6 2/3] clocksource/drivers/atcpit100: VDSO support
Correct the composition Dear Arnd: 1. Setup an additional memory mapping for clock hardware in user space when establishing vdso-needed memory mapping In arch_setup_additional_pages(), kernel establishes memory mapping for vdso's text and vdata page in user space. In order to make clock hardware be accessible in user space, we try to establish an additional memory mapping for clock hardware here based on clock information from driver. This page is located between vdata page and vdso text page. For safety, this region for clock accessing is read-only. 2. Accessing clock hardware in vdso After step 1, clock hardware is accessible in user space through memory-mapped IO. However, it is not enough to access a specific register. Therefore, we store register offset information in vdata page to make it visible in user space. Now, vdso can derive the address of counter register by summation of __get_timerpage() and counter register offset where __get_timerpage() is used to derive the virtual address of memory-mapped clock. 2018-01-20 19:11 GMT+08:00 Vincent Chen: > 2018-01-18 19:08 GMT+08:00 Arnd Bergmann : >> On Mon, Jan 15, 2018 at 6:57 AM, Greentime Hu wrote: >>> From: Rick Chen >>> >>> VDSO needs real-time cycle count to ensure the time accuracy. >>> Unlike others, nds32 architecture does not define clock source, >>> hence VDSO needs atcpit100 offering real-time cycle count >>> to derive the correct time. >>> >>> Signed-off-by: Vincent Chen >>> Signed-off-by: Rick Chen >>> Signed-off-by: Greentime Hu >> >> I'm a bit puzzled by this patch, can you explain how the vdso actually >> manages to access the clock hardware? It looks like you make the >> physical address and the register offset available to user space, but >> how does it end up accessing it? >> >> Arnd > > Dear Arnd: > > Accessing clock hardware in vdso can be divided to 2 step. > > 1. Setup an additional memory mapping for clock hardware in user space > when establishing > vdso-needed memory mapping > > In arch_setup_additional_pages(), kernel establishes memory > mapping for vdso's text and vdata page > in user space. In order to make clock hardware be accessible in > user space, we try to establish an > additional memory mapping for clock hardware here based on clock > information from driver. This page is > located between vdata page and vdso text page. For safety, this > region for clock accessing is read-only. > > 2. Accessing clock hardware in vdso > > After step 1, clock hardware is accessible in user space > through memory-mapped IO. However, it is not >enough to access a specific register. Therefore, we store register > offset information in vdata page to make it >visible in user space. Now, vdso can derive the address of counter > register by summation of __get_timerpage() >and counter register offset where __get_timerpage() is used to > derive the virtual address of memory-mapped >clock. > > > Vincent
Re: [PATCH net-next v4 6/8] net: sched: create tc_can_offload_extack() wrapper
On Sat, Jan 20, 2018 at 2:54 AM, Jiri Pirkowrote: > Sat, Jan 20, 2018 at 11:33:31AM CET, jakub.kicin...@netronome.com wrote: >>On Sat, Jan 20, 2018 at 2:22 AM, Jiri Pirko wrote: >>> Sat, Jan 20, 2018 at 11:12:25AM CET, jakub.kicin...@netronome.com wrote: net/sched/sch_prio.c: if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) net/sched/sch_prio.c: if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) net/sched/sch_red.c:if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) net/sched/sch_red.c:if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) Do you mean the qdisc offloads too? The whole lot? >>> >>> Yes. >> >>Actually looking at the qdisc code and destroy callbacks, if we plumb >>it through everywhere won't that mean user will see error messages on >>destroy of qdiscs/filters which were never offloaded? >> >>Just looking at prio_offload() as a simple example. prio_destroy() >>will always call tc_can_offload(). > > Hmmm. You are right. > Either we pass null from there (NL_SET_ERR_MSG can cope), or we leave > tc_can_offload helper as is and let the caller to set the extack. I'm afraid it doesn't stop there though :/ Even now if one install a filter on a netdev with offload abilities and no skip_* flag there is this: # tc filter add dev eth0 ingress bpf obj drop.o da Warning: TC offload is disabled on net device. I'm not sure we should warn the user if there are no skip_* flags, just because the device has *some* offload capabilities? Perhaps we should put the flags into tc_cls_common_offload and add a helper to only set extack if skip_sw is set.. > I see that tc_can_offload_extack is probably good idea then. > > But please use it in all drivers that are calling tc_can_offload so the > user experience is consistent for all. Certainly, will do.
Re: [PATCH v6 2/3] clocksource/drivers/atcpit100: VDSO support
2018-01-18 19:08 GMT+08:00 Arnd Bergmann: > On Mon, Jan 15, 2018 at 6:57 AM, Greentime Hu wrote: >> From: Rick Chen >> >> VDSO needs real-time cycle count to ensure the time accuracy. >> Unlike others, nds32 architecture does not define clock source, >> hence VDSO needs atcpit100 offering real-time cycle count >> to derive the correct time. >> >> Signed-off-by: Vincent Chen >> Signed-off-by: Rick Chen >> Signed-off-by: Greentime Hu > > I'm a bit puzzled by this patch, can you explain how the vdso actually > manages to access the clock hardware? It looks like you make the > physical address and the register offset available to user space, but > how does it end up accessing it? > > Arnd Dear Arnd: Accessing clock hardware in vdso can be divided to 2 step. 1. Setup an additional memory mapping for clock hardware in user space when establishing vdso-needed memory mapping In arch_setup_additional_pages(), kernel establishes memory mapping for vdso's text and vdata page in user space. In order to make clock hardware be accessible in user space, we try to establish an additional memory mapping for clock hardware here based on clock information from driver. This page is located between vdata page and vdso text page. For safety, this region for clock accessing is read-only. 2. Accessing clock hardware in vdso After step 1, clock hardware is accessible in user space through memory-mapped IO. However, it is not enough to access a specific register. Therefore, we store register offset information in vdata page to make it visible in user space. Now, vdso can derive the address of counter register by summation of __get_timerpage() and counter register offset where __get_timerpage() is used to derive the virtual address of memory-mapped clock. Vincent
Hello dear
Hello dear My name is abudul Ahassan, I work with one of the leading Banks here in Africa. I have a business proposal worth 18$million dollars for more information contact me on my private emailI urgently hope to get your response as soon as possible. Yours Sincerely, Abudu Ahassan
Re: [PATCH net-next v4 6/8] net: sched: create tc_can_offload_extack() wrapper
Sat, Jan 20, 2018 at 11:33:31AM CET, jakub.kicin...@netronome.com wrote: >On Sat, Jan 20, 2018 at 2:22 AM, Jiri Pirkowrote: >> Sat, Jan 20, 2018 at 11:12:25AM CET, jakub.kicin...@netronome.com wrote: >>>net/sched/sch_prio.c: if (!tc_can_offload(dev) || >>>!dev->netdev_ops->ndo_setup_tc) >>>net/sched/sch_prio.c: if (!tc_can_offload(dev) || >>>!dev->netdev_ops->ndo_setup_tc) >>>net/sched/sch_red.c:if (!tc_can_offload(dev) || >>>!dev->netdev_ops->ndo_setup_tc) >>>net/sched/sch_red.c:if (!tc_can_offload(dev) || >>>!dev->netdev_ops->ndo_setup_tc) >>> >>>Do you mean the qdisc offloads too? The whole lot? >> >> Yes. > >Actually looking at the qdisc code and destroy callbacks, if we plumb >it through everywhere won't that mean user will see error messages on >destroy of qdiscs/filters which were never offloaded? > >Just looking at prio_offload() as a simple example. prio_destroy() >will always call tc_can_offload(). Hmmm. You are right. Either we pass null from there (NL_SET_ERR_MSG can cope), or we leave tc_can_offload helper as is and let the caller to set the extack. I see that tc_can_offload_extack is probably good idea then. But please use it in all drivers that are calling tc_can_offload so the user experience is consistent for all. Thanks!
Re: [PATCH net-next v4 6/8] net: sched: create tc_can_offload_extack() wrapper
On Sat, Jan 20, 2018 at 2:22 AM, Jiri Pirkowrote: > Sat, Jan 20, 2018 at 11:12:25AM CET, jakub.kicin...@netronome.com wrote: >>net/sched/sch_prio.c: if (!tc_can_offload(dev) || >>!dev->netdev_ops->ndo_setup_tc) >>net/sched/sch_prio.c: if (!tc_can_offload(dev) || >>!dev->netdev_ops->ndo_setup_tc) >>net/sched/sch_red.c:if (!tc_can_offload(dev) || >>!dev->netdev_ops->ndo_setup_tc) >>net/sched/sch_red.c:if (!tc_can_offload(dev) || >>!dev->netdev_ops->ndo_setup_tc) >> >>Do you mean the qdisc offloads too? The whole lot? > > Yes. Actually looking at the qdisc code and destroy callbacks, if we plumb it through everywhere won't that mean user will see error messages on destroy of qdiscs/filters which were never offloaded? Just looking at prio_offload() as a simple example. prio_destroy() will always call tc_can_offload(). Hmm...
Re: [PATCH net-next v4 5/8] net: sched: add extack support for offload via tc_cls_common_offload
On Sat, Jan 20, 2018 at 12:54 AM, Jiri Pirkowrote: > Sat, Jan 20, 2018 at 02:44:47AM CET, jakub.kicin...@netronome.com wrote: >>From: Quentin Monnet >> >>Add extack support for hardware offload of classifiers. In order >>to achieve this, a pointer to a struct netlink_ext_ack is added to the >>struct tc_cls_common_offload that is passed to the callback for setting >>up the classifier. Function tc_cls_common_offload_init() is updated to >>support initialization of this new attribute. >> >>Signed-off-by: Quentin Monnet >>Reviewed-by: Jakub Kicinski > > [...] > > >>diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c >>index f675a92e1b66..727c10378f37 100644 >>--- a/net/sched/cls_flower.c >>+++ b/net/sched/cls_flower.c >>@@ -223,7 +223,7 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, >>struct cls_fl_filter *f) >> struct tc_cls_flower_offload cls_flower = {}; >> struct tcf_block *block = tp->chain->block; >> >>- tc_cls_common_offload_init(_flower.common, tp); >>+ tc_cls_common_offload_init(_flower.common, tp, NULL); > > While you are at it, you should propagate extack whenever possible. For > this, you can easily pass extack to fl_hw_destroy_filter. > Same for other classifiers. > >> cls_flower.command = TC_CLSFLOWER_DESTROY; >> cls_flower.cookie = (unsigned long) f; >> >>@@ -243,7 +243,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp, >> bool skip_sw = tc_skip_sw(f->flags); >> int err; >> >>- tc_cls_common_offload_init(_flower.common, tp); >>+ tc_cls_common_offload_init(_flower.common, tp, extack); >> cls_flower.command = TC_CLSFLOWER_REPLACE; >> cls_flower.cookie = (unsigned long) f; >> cls_flower.dissector = dissector; >>@@ -272,7 +272,7 @@ static void fl_hw_update_stats(struct tcf_proto *tp, >>struct cls_fl_filter *f) >> struct tc_cls_flower_offload cls_flower = {}; >> struct tcf_block *block = tp->chain->block; >> >>- tc_cls_common_offload_init(_flower.common, tp); >>+ tc_cls_common_offload_init(_flower.common, tp, NULL); > > For this, it would be probably a bit trickier to get extack, but also > possible. > My motivation is to make the extack always available for users of > tc_cls_common_offload Not sure I can think of anything other than "IO error" that could stop us from destroy or dumping stats ;) Also I'm not sure extack fits well with dumps of multiple entities, how would user know which entity produced the message? Extack is best for configuration requests IMHO.. (I'm not saying we shouldn't plumb it through to more places, just wondering what you think.)
Re: [PATCH net-next v4 6/8] net: sched: create tc_can_offload_extack() wrapper
Sat, Jan 20, 2018 at 11:12:25AM CET, jakub.kicin...@netronome.com wrote: >On Sat, Jan 20, 2018 at 12:59 AM, Jiri Pirkowrote: >> Sat, Jan 20, 2018 at 02:44:48AM CET, jakub.kicin...@netronome.com wrote: >>>From: Quentin Monnet >>> >>>Create a wrapper around tc_can_offload() that takes an additional >>>extack pointer argument in order to output an error message if TC >>>offload is disabled on the device. >>> >>>In this way, the error message is handled by the core and can be the >>>same for all drivers. >>> >>>Signed-off-by: Quentin Monnet >>>Reviewed-by: Jakub Kicinski >>>--- >>> include/net/pkt_cls.h | 11 +++ >>> 1 file changed, 11 insertions(+) >>> >>>diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h >>>index f497f622580b..2f8f16a4d88e 100644 >>>--- a/include/net/pkt_cls.h >>>+++ b/include/net/pkt_cls.h >>>@@ -656,6 +656,17 @@ static inline bool tc_can_offload(const struct >>>net_device *dev) >>> return dev->features & NETIF_F_HW_TC; >>> } >>> >>>+static inline bool tc_can_offload_extack(const struct net_device *dev, >>>+ struct netlink_ext_ack *extack) >> >> I don't like to add tc_can_offload variant for this. It makes sense >> the original tc_can_offload to be extended and set the extack message >> always. >> >> It would require some more work in drivers (5), sure, but we endup with >> nicer and consistent code. > >$ git grep tc_can_offload >drivers/net/ethernet/broadcom/bnxt/bnxt.c: if >(!bnxt_tc_flower_enabled(bp) || !tc_can_offload(bp->dev)) >drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c: if >(!bnxt_tc_flower_enabled(vf_rep->bp) || !tc_can_offload(bp->dev)) >drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:if >(!tc_can_offload(dev)) >drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: if >(!tc_can_offload(adapter->netdev)) >drivers/net/ethernet/mellanox/mlx5/core/en_main.c: if >(!tc_can_offload(priv->netdev)) >drivers/net/ethernet/mellanox/mlx5/core/en_rep.c: if >(!tc_can_offload(priv->netdev)) >drivers/net/ethernet/mellanox/mlxsw/spectrum.c: if >(!tc_can_offload(mlxsw_sp_port->dev)) >drivers/net/ethernet/netronome/nfp/bpf/main.c: >!tc_can_offload(nn->dp.netdev) || >drivers/net/ethernet/netronome/nfp/flower/offload.c:if >(!tc_can_offload(repr->netdev)) >drivers/net/ethernet/netronome/nfp/flower/offload.c:if >(!tc_can_offload(repr->netdev)) >drivers/net/netdevsim/bpf.c:!tc_can_offload(ns->netdev) || >include/net/pkt_cls.h:static inline bool tc_can_offload(const struct >net_device *dev) >net/dsa/slave.c:if (!tc_can_offload(dev)) >net/sched/cls_api.c:if (!tc_can_offload(dev) && >tcf_block_offload_in_use(block)) >net/sched/sch_mq.c: if (!tc_can_offload(dev) || >!dev->netdev_ops->ndo_setup_tc) >net/sched/sch_prio.c: if (!tc_can_offload(dev) || >!dev->netdev_ops->ndo_setup_tc) >net/sched/sch_prio.c: if (!tc_can_offload(dev) || >!dev->netdev_ops->ndo_setup_tc) >net/sched/sch_red.c:if (!tc_can_offload(dev) || >!dev->netdev_ops->ndo_setup_tc) >net/sched/sch_red.c:if (!tc_can_offload(dev) || >!dev->netdev_ops->ndo_setup_tc) > >Do you mean the qdisc offloads too? The whole lot? Yes.
Re: [PATCH net-next v4 6/8] net: sched: create tc_can_offload_extack() wrapper
On Sat, Jan 20, 2018 at 12:59 AM, Jiri Pirkowrote: > Sat, Jan 20, 2018 at 02:44:48AM CET, jakub.kicin...@netronome.com wrote: >>From: Quentin Monnet >> >>Create a wrapper around tc_can_offload() that takes an additional >>extack pointer argument in order to output an error message if TC >>offload is disabled on the device. >> >>In this way, the error message is handled by the core and can be the >>same for all drivers. >> >>Signed-off-by: Quentin Monnet >>Reviewed-by: Jakub Kicinski >>--- >> include/net/pkt_cls.h | 11 +++ >> 1 file changed, 11 insertions(+) >> >>diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h >>index f497f622580b..2f8f16a4d88e 100644 >>--- a/include/net/pkt_cls.h >>+++ b/include/net/pkt_cls.h >>@@ -656,6 +656,17 @@ static inline bool tc_can_offload(const struct >>net_device *dev) >> return dev->features & NETIF_F_HW_TC; >> } >> >>+static inline bool tc_can_offload_extack(const struct net_device *dev, >>+ struct netlink_ext_ack *extack) > > I don't like to add tc_can_offload variant for this. It makes sense > the original tc_can_offload to be extended and set the extack message > always. > > It would require some more work in drivers (5), sure, but we endup with > nicer and consistent code. $ git grep tc_can_offload drivers/net/ethernet/broadcom/bnxt/bnxt.c: if (!bnxt_tc_flower_enabled(bp) || !tc_can_offload(bp->dev)) drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c: if (!bnxt_tc_flower_enabled(vf_rep->bp) || !tc_can_offload(bp->dev)) drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:if (!tc_can_offload(dev)) drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: if (!tc_can_offload(adapter->netdev)) drivers/net/ethernet/mellanox/mlx5/core/en_main.c: if (!tc_can_offload(priv->netdev)) drivers/net/ethernet/mellanox/mlx5/core/en_rep.c: if (!tc_can_offload(priv->netdev)) drivers/net/ethernet/mellanox/mlxsw/spectrum.c: if (!tc_can_offload(mlxsw_sp_port->dev)) drivers/net/ethernet/netronome/nfp/bpf/main.c: !tc_can_offload(nn->dp.netdev) || drivers/net/ethernet/netronome/nfp/flower/offload.c:if (!tc_can_offload(repr->netdev)) drivers/net/ethernet/netronome/nfp/flower/offload.c:if (!tc_can_offload(repr->netdev)) drivers/net/netdevsim/bpf.c:!tc_can_offload(ns->netdev) || include/net/pkt_cls.h:static inline bool tc_can_offload(const struct net_device *dev) net/dsa/slave.c:if (!tc_can_offload(dev)) net/sched/cls_api.c:if (!tc_can_offload(dev) && tcf_block_offload_in_use(block)) net/sched/sch_mq.c: if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) net/sched/sch_prio.c: if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) net/sched/sch_prio.c: if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) net/sched/sch_red.c:if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) net/sched/sch_red.c:if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc) Do you mean the qdisc offloads too? The whole lot?
[patch iproute2 net-next v12 1/3] tc: introduce tc_qdisc_block_exists helper
From: Jiri PirkoThis hepler used qdisc dump to list all qdisc and find if block index in question is used by any of them. That means the block with specified index exists. Signed-off-by: Jiri Pirko --- v11->v12: - fixed return type of tc_qdisc_block_exists - removed 0 checks for block in tc_qdisc_block_exists - rever xmas tree variables in tc_qdisc_block_exists --- tc/tc_qdisc.c | 61 +++ tc/tc_util.h | 2 ++ 2 files changed, 63 insertions(+) diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c index 70279b9d..54701c26 100644 --- a/tc/tc_qdisc.c +++ b/tc/tc_qdisc.c @@ -412,3 +412,64 @@ int do_qdisc(int argc, char **argv) fprintf(stderr, "Command \"%s\" is unknown, try \"tc qdisc help\".\n", *argv); return -1; } + +struct tc_qdisc_block_exists_ctx { + __u32 block_index; + bool found; +}; + +static int tc_qdisc_block_exists_cb(const struct sockaddr_nl *who, + struct nlmsghdr *n, void *arg) +{ + struct tc_qdisc_block_exists_ctx *ctx = arg; + struct tcmsg *t = NLMSG_DATA(n); + struct rtattr *tb[TCA_MAX+1]; + int len = n->nlmsg_len; + + if (n->nlmsg_type != RTM_NEWQDISC) + return 0; + + len -= NLMSG_LENGTH(sizeof(*t)); + if (len < 0) + return -1; + + parse_rtattr(tb, TCA_MAX, TCA_RTA(t), len); + + if (tb[TCA_KIND] == NULL) + return -1; + + if (tb[TCA_INGRESS_BLOCK] && + RTA_PAYLOAD(tb[TCA_INGRESS_BLOCK]) >= sizeof(__u32)) { + __u32 block = rta_getattr_u32(tb[TCA_INGRESS_BLOCK]); + + if (block == ctx->block_index) + ctx->found = true; + } + + if (tb[TCA_EGRESS_BLOCK] && + RTA_PAYLOAD(tb[TCA_EGRESS_BLOCK]) >= sizeof(__u32)) { + __u32 block = rta_getattr_u32(tb[TCA_EGRESS_BLOCK]); + + if (block == ctx->block_index) + ctx->found = true; + } + return 0; +} + +bool tc_qdisc_block_exists(__u32 block_index) +{ + struct tc_qdisc_block_exists_ctx ctx = { .block_index = block_index }; + struct tcmsg t = { .tcm_family = AF_UNSPEC }; + + if (rtnl_dump_request(, RTM_GETQDISC, , sizeof(t)) < 0) { + perror("Cannot send dump request"); + return false; + } + + if (rtnl_dump_filter(, tc_qdisc_block_exists_cb, ) < 0) { + perror("Dump terminated\n"); + return false; + } + + return ctx.found; +} diff --git a/tc/tc_util.h b/tc/tc_util.h index 1218610d..682dd4fd 100644 --- a/tc/tc_util.h +++ b/tc/tc_util.h @@ -132,4 +132,6 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt); int cls_names_init(char *path); void cls_names_uninit(void); +bool tc_qdisc_block_exists(__u32 block_index); + #endif -- 2.14.3
[patch iproute2 net-next v12 0/3] tc: shared block support
From: Jiri PirkoKernel allows to share all filters between qdiscs with use of shared block. Example: block number 22. "22" is just an identification: $ tc qdisc add dev ens7 ingress_block 22 ingress $ tc qdisc add dev ens8 ingress_block 22 ingress If we don't specify "block" command line option, no shared block would be created: $ tc qdisc add dev ens9 ingress Now if we list the qdiscs, we will see the block index in the output: $ tc qdisc qdisc ingress : dev ens7 parent :fff1 ingress_block 22 qdisc ingress : dev ens8 parent :fff1 ingress_block 22 qdisc ingress : dev ens9 parent :fff1 To make is more visual, the situation looks like this: ens7 ingress qdisc ens7 ingress qdisc | | | | +--> block 22 <--+ Unlimited number of qdiscs may share the same block. Block sharing is also supported for clsact qdisc: $ tc qdisc add dev ens10 ingress_block 23 egress_block 24 clsact $ tc qdisc show dev ens10 qdisc clsact : dev ens10 parent :fff1 ingress_block 23 egress_block 24 We can add filter using the block index: $ tc filter add block 22 protocol ip pref 25 flower dst_ip 192.168.0.0/16 action drop Note we cannot use the qdisc for filter manipulations of shared blocks: $ tc filter add dev ens8 ingress protocol ip pref 1 flower dst_ip 192.168.100.2 action drop Error: This filter block is shared. Please use the block index to manipulate the filters. We will see the same output if we list filters for ingress qdisc of ens7 and ens8, also for the block 22: $ tc filter show block 22 filter protocol ip pref 25 flower chain 0 filter protocol ip pref 25 flower chain 0 handle 0x1 ... $ tc filter show dev ens7 ingress filter block 22 protocol ip pref 25 flower chain 0 filter block 22 protocol ip pref 25 flower chain 0 handle 0x1 ... $ tc filter show dev ens8 ingress filter block 22 protocol ip pref 25 flower chain 0 filter block 22 protocol ip pref 25 flower chain 0 handle 0x1 ... --- v11->v12: - separate iproute2 patchset, kernel part is merged - original patch 1 is removed as the header changes are already in iproute2 code - patch 1: - fixed return type of tc_qdisc_block_exists - removed 0 checks for block in tc_qdisc_block_exists - rever xmas tree variables in tc_qdisc_block_exists - patch 2 - fixes error message when both dev and block are on the command line --- Note that if you want me to change the ">=" vs "==" thing, I will do it. But I think that it is a global iproute2 thing and should be changed in a separate patch/patchset. Jiri Pirko (3): tc: introduce tc_qdisc_block_exists helper tc: introduce support for block-handle for filter operations tc: implement ingress/egress block index attributes for qdiscs man/man8/tc.8 | 24 - tc/tc_filter.c | 105 +++-- tc/tc_qdisc.c | 97 tc/tc_util.h | 2 ++ 4 files changed, 210 insertions(+), 18 deletions(-) -- 2.14.3
[patch iproute2 net-next v12 3/3] tc: implement ingress/egress block index attributes for qdiscs
From: Jiri PirkoDuring qdisc creation it is possible to specify shared block for bot ingress and egress. Pass this values to kernel according to the command line options. Signed-off-by: Jiri Pirko --- man/man8/tc.8 | 6 +- tc/tc_qdisc.c | 36 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/man/man8/tc.8 b/man/man8/tc.8 index d65f0583..5ffea373 100644 --- a/man/man8/tc.8 +++ b/man/man8/tc.8 @@ -11,7 +11,11 @@ tc \- show / manipulate traffic control settings \fIqdisc-id\fR .B | root ] .B [ handle -\fIqdisc-id\fR ] qdisc +\fIqdisc-id\fR ] +.B [ ingress_block +\fIBLOCK_INDEX\fR ] +.B [ egress_block +\fIBLOCK_INDEX\fR ] qdisc [ qdisc specific parameters ] .P diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c index 54701c26..8cc4b73d 100644 --- a/tc/tc_qdisc.c +++ b/tc/tc_qdisc.c @@ -31,6 +31,7 @@ static int usage(void) fprintf(stderr, " [ handle QHANDLE ] [ root | ingress | clsact | parent CLASSID ]\n"); fprintf(stderr, " [ estimator INTERVAL TIME_CONSTANT ]\n"); fprintf(stderr, " [ stab [ help | STAB_OPTIONS] ]\n"); + fprintf(stderr, " [ ingress_block BLOCK_INDEX ] [ egress_block BLOCK_INDEX ]\n"); fprintf(stderr, " [ [ QDISC_KIND ] [ help | OPTIONS ] ]\n"); fprintf(stderr, "\n"); fprintf(stderr, " tc qdisc show [ dev STRING ] [ ingress | clsact ] [ invisible ]\n"); @@ -61,6 +62,8 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) .n.nlmsg_type = cmd, .t.tcm_family = AF_UNSPEC, }; + __u32 ingress_block = 0; + __u32 egress_block = 0; while (argc > 0) { if (strcmp(*argv, "dev") == 0) { @@ -121,6 +124,14 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) if (parse_size_table(, , ) < 0) return -1; continue; + } else if (matches(*argv, "ingress_block") == 0) { + NEXT_ARG(); + if (get_u32(_block, *argv, 0) || !ingress_block) + invarg("invalid ingress block index value", *argv); + } else if (matches(*argv, "egress_block") == 0) { + NEXT_ARG(); + if (get_u32(_block, *argv, 0) || !egress_block) + invarg("invalid egress block index value", *argv); } else if (matches(*argv, "help") == 0) { usage(); } else { @@ -138,6 +149,13 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) if (est.ewma_log) addattr_l(, sizeof(req), TCA_RATE, , sizeof(est)); + if (ingress_block) + addattr32(, sizeof(req), + TCA_INGRESS_BLOCK, ingress_block); + if (egress_block) + addattr32(, sizeof(req), + TCA_EGRESS_BLOCK, egress_block); + if (q) { if (q->parse_qopt) { if (q->parse_qopt(q, argc, argv, , d)) @@ -270,6 +288,24 @@ int print_qdisc(const struct sockaddr_nl *who, (rta_getattr_u8(tb[TCA_HW_OFFLOAD]))) print_bool(PRINT_ANY, "offloaded", "offloaded ", true); + if (tb[TCA_INGRESS_BLOCK] && + RTA_PAYLOAD(tb[TCA_INGRESS_BLOCK]) >= sizeof(__u32)) { + __u32 block = rta_getattr_u32(tb[TCA_INGRESS_BLOCK]); + + if (block) + print_uint(PRINT_ANY, "ingress_block", + "ingress_block %u ", block); + } + + if (tb[TCA_EGRESS_BLOCK] && + RTA_PAYLOAD(tb[TCA_EGRESS_BLOCK]) >= sizeof(__u32)) { + __u32 block = rta_getattr_u32(tb[TCA_EGRESS_BLOCK]); + + if (block) + print_uint(PRINT_ANY, "egress_block", + "egress_block %u ", block); + } + /* pfifo_fast is generic enough to warrant the hardcoding --JHS */ if (strcmp("pfifo_fast", RTA_DATA(tb[TCA_KIND])) == 0) q = get_qdisc_kind("prio"); -- 2.14.3
[patch iproute2 net-next v12 2/3] tc: introduce support for block-handle for filter operations
From: Jiri PirkoSo far, qdisc was the only handle that could be used to manipulate filters. Kernel added support for using block to manipulate it. So add the support to use block index to manipulate filters. The magic TCM_IFINDEX_MAGIC_BLOCK indicates the block index is in use. Signed-off-by: Jiri Pirko --- v11->v12: - fixes error message when both dev and block are on the command line --- man/man8/tc.8 | 18 ++ tc/tc_filter.c | 105 +++-- 2 files changed, 106 insertions(+), 17 deletions(-) diff --git a/man/man8/tc.8 b/man/man8/tc.8 index ff071b33..d65f0583 100644 --- a/man/man8/tc.8 +++ b/man/man8/tc.8 @@ -41,6 +41,19 @@ tc \- show / manipulate traffic control settings .B flowid \fIflow-id\fR +.B tc +.RI "[ " OPTIONS " ]" +.B filter [ add | change | replace | delete | get ] block +\fIBLOCK_INDEX\fR +.B [ handle \fIfilter-id\fR ] +.B protocol +\fIprotocol\fR +.B prio +\fIpriority\fR filtertype +[ filtertype specific parameters ] +.B flowid +\fIflow-id\fR + .B tc .RI "[ " OPTIONS " ]" .RI "[ " FORMAT " ]" @@ -58,6 +71,11 @@ tc \- show / manipulate traffic control settings .RI "[ " OPTIONS " ]" .B filter show dev \fIDEV\fR +.P +.B tc +.RI "[ " OPTIONS " ]" +.B filter show block +\fIBLOCK_INDEX\fR .P .ti 8 diff --git a/tc/tc_filter.c b/tc/tc_filter.c index 7dd123ab..c619e464 100644 --- a/tc/tc_filter.c +++ b/tc/tc_filter.c @@ -28,14 +28,17 @@ static void usage(void) { fprintf(stderr, - "Usage: tc filter [ add | del | change | replace | show ] dev STRING\n" - "Usage: tc filter get dev STRING parent CLASSID protocol PROTO handle FILTERID pref PRIO FILTER_TYPE\n" + "Usage: tc filter [ add | del | change | replace | show ] [ dev STRING ]\n" + " tc filter [ add | del | change | replace | show ] [ block BLOCK_INDEX ]\n" + " tc filter get dev STRING parent CLASSID protocol PROTO handle FILTERID pref PRIO FILTER_TYPE\n" + " tc filter get block BLOCK_INDEX protocol PROTO handle FILTERID pref PRIO FILTER_TYPE\n" " [ pref PRIO ] protocol PROTO [ chain CHAIN_INDEX ]\n" " [ estimator INTERVAL TIME_CONSTANT ]\n" " [ root | ingress | egress | parent CLASSID ]\n" " [ handle FILTERID ] [ [ FILTER_TYPE ] [ help | OPTIONS ] ]\n" "\n" " tc filter show [ dev STRING ] [ root | ingress | egress | parent CLASSID ]\n" + " tc filter show [ block BLOCK_INDEX ]\n" "Where:\n" "FILTER_TYPE := { rsvp | u32 | bpf | fw | route | etc. }\n" "FILTERID := ... format depends on classifier, see there\n" @@ -58,6 +61,7 @@ static int tc_filter_modify(int cmd, unsigned int flags, int argc, char **argv, int chain_index_set = 0; char d[IFNAMSIZ] = {}; int protocol_set = 0; + __u32 block_index = 0; char *fhandle = NULL; __u32 protocol = 0; __u32 chain_index; @@ -89,7 +93,21 @@ static int tc_filter_modify(int cmd, unsigned int flags, int argc, char **argv, NEXT_ARG(); if (d[0]) duparg("dev", *argv); + if (block_index) { + fprintf(stderr, "Error: \"dev\" and \"block\" are mutually exlusive\n"); + return -1; + } strncpy(d, *argv, sizeof(d)-1); + } else if (matches(*argv, "block") == 0) { + NEXT_ARG(); + if (block_index) + duparg("block", *argv); + if (d[0]) { + fprintf(stderr, "Error: \"dev\" and \"block\" are mutually exlusive\n"); + return -1; + } + if (get_u32(_index, *argv, 0) || !block_index) + invarg("invalid block index value", *argv); } else if (strcmp(*argv, "root") == 0) { if (req->t.tcm_parent) { fprintf(stderr, @@ -184,6 +202,9 @@ static int tc_filter_modify(int cmd, unsigned int flags, int argc, char **argv, fprintf(stderr, "Cannot find device \"%s\"\n", d); return 1; } + } else if (block_index) { + req->t.tcm_ifindex = TCM_IFINDEX_MAGIC_BLOCK; + req->t.tcm_block_index = block_index; } if (q) { @@ -228,6 +249,7 @@ static __u32 filter_prio; static __u32 filter_protocol; static __u32 filter_chain_index; static int filter_chain_index_set; +static __u32 filter_block_index; __u16 f_proto; int print_filter(const struct
Re: [patch iproute2 net-next v11 3/4] tc: introduce support for block-handle for filter operations
Fri, Jan 19, 2018 at 09:51:22PM CET, dsah...@gmail.com wrote: >On 1/17/18 2:48 AM, Jiri Pirko wrote: >> @@ -89,7 +93,21 @@ static int tc_filter_modify(int cmd, unsigned int flags, >> int argc, char **argv, >> NEXT_ARG(); >> if (d[0]) >> duparg("dev", *argv); >> +if (block_index) { >> +fprintf(stderr, "Error: \"dev\" cannot be used >> in the same time as \"block\"\n"); > >'in the same time' does not sound right. something like: 'dev and block >are mutually exlusive' ack > >> +return -1; >> +} >> strncpy(d, *argv, sizeof(d)-1); >> +} else if (matches(*argv, "block") == 0) { >> +NEXT_ARG(); >> +if (block_index) >> +duparg("block", *argv); >> +if (d[0]) { >> +fprintf(stderr, "Error: \"block\" cannot be >> used in the same time as \"dev\"\n"); > >same here. Correct the ones below as well. ack > > >> +return -1; >> +} >> +if (get_u32(_index, *argv, 0) || !block_index) >> +invarg("invalid block index value", *argv);
Re: [patch iproute2 net-next v11 2/4] tc: introduce tc_qdisc_block_exists helper
Fri, Jan 19, 2018 at 09:45:57PM CET, dsah...@gmail.com wrote: >On 1/17/18 2:48 AM, Jiri Pirko wrote: >> From: Jiri Pirko>> >> This hepler used qdisc dump to list all qdisc and find if block index in >> question is used by any of them. That means the block with specified >> index exists. >> >> Signed-off-by: Jiri Pirko >> --- >> tc/tc_qdisc.c | 61 >> +++ >> tc/tc_util.h | 2 ++ >> 2 files changed, 63 insertions(+) >> >> diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c >> index 70279b9..3f91558 100644 >> --- a/tc/tc_qdisc.c >> +++ b/tc/tc_qdisc.c >> @@ -412,3 +412,64 @@ int do_qdisc(int argc, char **argv) >> fprintf(stderr, "Command \"%s\" is unknown, try \"tc qdisc help\".\n", >> *argv); >> return -1; >> } >> + >> +struct tc_qdisc_block_exists_ctx { >> +__u32 block_index; >> +bool found; >> +}; >> + >> +static int tc_qdisc_block_exists_cb(const struct sockaddr_nl *who, >> +struct nlmsghdr *n, void *arg) >> +{ >> +struct tc_qdisc_block_exists_ctx *ctx = arg; >> +struct tcmsg *t = NLMSG_DATA(n); >> +int len = n->nlmsg_len; >> +struct rtattr *tb[TCA_MAX+1]; > >reverse xmas tree :) Will fix > >> + >> +if (n->nlmsg_type != RTM_NEWQDISC) >> +return 0; >> + >> +len -= NLMSG_LENGTH(sizeof(*t)); >> +if (len < 0) >> +return -1; >> + >> +parse_rtattr(tb, TCA_MAX, TCA_RTA(t), len); >> + >> +if (tb[TCA_KIND] == NULL) >> +return -1; >> + >> +if (tb[TCA_INGRESS_BLOCK] && >> +RTA_PAYLOAD(tb[TCA_INGRESS_BLOCK]) >= sizeof(__u32)) { > >why not just == sizeof(__u32) since that's what it should be? test1:~/iproute2$ git grep "== sizeof(__u32)" | grep RTA_PAYLOAD| wc -l 1 test1:~/iproute2$ git grep ">= sizeof(__u32)" | grep RTA_PAYLOAD| wc -l 42 I just go with the flow :) > >> +__u32 block = rta_getattr_u32(tb[TCA_INGRESS_BLOCK]); >> + >> +if (block && block == ctx->block_index) > >block > 0 test should not be needed. If someone is calling >tc_qdisc_block_exists, then block_index should be > 0 in which case you >just need block == ctx->block_index Right. Will Fix. > > >> +ctx->found = true; >> +} >> + >> +if (tb[TCA_EGRESS_BLOCK] && >> +RTA_PAYLOAD(tb[TCA_EGRESS_BLOCK]) >= sizeof(__u32)) { >> +__u32 block = rta_getattr_u32(tb[TCA_EGRESS_BLOCK]); >> + >> +if (block && block == ctx->block_index) > >same 2 comments for this block ack > >> +ctx->found = true; >> +} >> +return 0; >> +} >> + >> +int tc_qdisc_block_exists(__u32 block_index) > >This should be bool since you are returning true / false Ah, missed that. Fixing. > >> +{ >> +struct tc_qdisc_block_exists_ctx ctx = { .block_index = block_index }; >> +struct tcmsg t = { .tcm_family = AF_UNSPEC }; >> + >> +if (rtnl_dump_request(, RTM_GETQDISC, , sizeof(t)) < 0) { >> +perror("Cannot send dump request"); >> +return false; >> +} >> + >> +if (rtnl_dump_filter(, tc_qdisc_block_exists_cb, ) < 0) { >> +perror("Dump terminated\n"); >> +return false; >> +} >> + >> +return ctx.found; >> +} >> diff --git a/tc/tc_util.h b/tc/tc_util.h >> index 1218610..72f4282 100644 >> --- a/tc/tc_util.h >> +++ b/tc/tc_util.h >> @@ -132,4 +132,6 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, >> struct rtattr *opt); >> int cls_names_init(char *path); >> void cls_names_uninit(void); >> >> +int tc_qdisc_block_exists(__u32 block_index); >> + >> #endif >> >
Re: [PATCH net-next v4 6/8] net: sched: create tc_can_offload_extack() wrapper
Sat, Jan 20, 2018 at 02:44:48AM CET, jakub.kicin...@netronome.com wrote: >From: Quentin Monnet> >Create a wrapper around tc_can_offload() that takes an additional >extack pointer argument in order to output an error message if TC >offload is disabled on the device. > >In this way, the error message is handled by the core and can be the >same for all drivers. > >Signed-off-by: Quentin Monnet >Reviewed-by: Jakub Kicinski >--- > include/net/pkt_cls.h | 11 +++ > 1 file changed, 11 insertions(+) > >diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h >index f497f622580b..2f8f16a4d88e 100644 >--- a/include/net/pkt_cls.h >+++ b/include/net/pkt_cls.h >@@ -656,6 +656,17 @@ static inline bool tc_can_offload(const struct net_device >*dev) > return dev->features & NETIF_F_HW_TC; > } > >+static inline bool tc_can_offload_extack(const struct net_device *dev, >+ struct netlink_ext_ack *extack) I don't like to add tc_can_offload variant for this. It makes sense the original tc_can_offload to be extended and set the extack message always. It would require some more work in drivers (5), sure, but we endup with nicer and consistent code. >+{ >+ bool can = tc_can_offload(dev); >+ >+ if (!can) >+ NL_SET_ERR_MSG(extack, "TC offload is disabled on net device"); >+ >+ return can; >+} >+ > static inline bool tc_skip_hw(u32 flags) > { > return (flags & TCA_CLS_FLAGS_SKIP_HW) ? true : false; >-- >2.15.1 >
Re: [PATCH net-next v4 5/8] net: sched: add extack support for offload via tc_cls_common_offload
Sat, Jan 20, 2018 at 02:44:47AM CET, jakub.kicin...@netronome.com wrote: >From: Quentin Monnet> >Add extack support for hardware offload of classifiers. In order >to achieve this, a pointer to a struct netlink_ext_ack is added to the >struct tc_cls_common_offload that is passed to the callback for setting >up the classifier. Function tc_cls_common_offload_init() is updated to >support initialization of this new attribute. > >Signed-off-by: Quentin Monnet >Reviewed-by: Jakub Kicinski >--- > include/net/pkt_cls.h| 5 - > net/sched/cls_bpf.c | 4 ++-- > net/sched/cls_flower.c | 6 +++--- > net/sched/cls_matchall.c | 4 ++-- > net/sched/cls_u32.c | 8 > 5 files changed, 15 insertions(+), 12 deletions(-) > >diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h >index 2e4b8e436d25..f497f622580b 100644 >--- a/include/net/pkt_cls.h >+++ b/include/net/pkt_cls.h >@@ -602,15 +602,18 @@ struct tc_cls_common_offload { > u32 chain_index; > __be16 protocol; > u32 prio; >+ struct netlink_ext_ack *extack; > }; > > static inline void > tc_cls_common_offload_init(struct tc_cls_common_offload *cls_common, >- const struct tcf_proto *tp) >+ const struct tcf_proto *tp, >+ struct netlink_ext_ack *extack) > { > cls_common->chain_index = tp->chain->index; > cls_common->protocol = tp->protocol; > cls_common->prio = tp->prio; >+ cls_common->extack = extack; > } [...] >diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c >index f675a92e1b66..727c10378f37 100644 >--- a/net/sched/cls_flower.c >+++ b/net/sched/cls_flower.c >@@ -223,7 +223,7 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, >struct cls_fl_filter *f) > struct tc_cls_flower_offload cls_flower = {}; > struct tcf_block *block = tp->chain->block; > >- tc_cls_common_offload_init(_flower.common, tp); >+ tc_cls_common_offload_init(_flower.common, tp, NULL); While you are at it, you should propagate extack whenever possible. For this, you can easily pass extack to fl_hw_destroy_filter. Same for other classifiers. > cls_flower.command = TC_CLSFLOWER_DESTROY; > cls_flower.cookie = (unsigned long) f; > >@@ -243,7 +243,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp, > bool skip_sw = tc_skip_sw(f->flags); > int err; > >- tc_cls_common_offload_init(_flower.common, tp); >+ tc_cls_common_offload_init(_flower.common, tp, extack); > cls_flower.command = TC_CLSFLOWER_REPLACE; > cls_flower.cookie = (unsigned long) f; > cls_flower.dissector = dissector; >@@ -272,7 +272,7 @@ static void fl_hw_update_stats(struct tcf_proto *tp, >struct cls_fl_filter *f) > struct tc_cls_flower_offload cls_flower = {}; > struct tcf_block *block = tp->chain->block; > >- tc_cls_common_offload_init(_flower.common, tp); >+ tc_cls_common_offload_init(_flower.common, tp, NULL); For this, it would be probably a bit trickier to get extack, but also possible. My motivation is to make the extack always available for users of tc_cls_common_offload > cls_flower.command = TC_CLSFLOWER_STATS; > cls_flower.cookie = (unsigned long) f; > cls_flower.exts = >exts; [...]
Re: [PATCH bpf-next 0/4] libbpf: add XDP binding support
On 01/20/2018 03:27 AM, Alexei Starovoitov wrote: > On Sat, Jan 20, 2018 at 03:00:37AM +0100, Daniel Borkmann wrote: >> On 01/19/2018 12:43 AM, Eric Leblond wrote: >>> Hello, >>> >>> This patchset rebases the libbpf code on latest bpf-next code and addresses >>> remarks by Daniel. >> >> Ok, I think it's a good start. We should later on clean up the >> netlink handling code a bit, but that's all internal and can be >> done in a second step. Applied to bpf-next, thanks Eric. > > Sorry, Eric, Daniel. > I had to revert this patch set. It breaks build on systems > where headers are not the most recent. > > Since libbpf is used by perf it has to be built cleanly on centos7 at least. > > The errors I got: > bpf.c: In function ‘bpf_set_link_xdp_fd’: > bpf.c:456:23: error: ‘SOL_NETLINK’ undeclared (first use in this function) > if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK, >^~~ > bpf.c:456:23: note: each undeclared identifier is reported only once for each > function it appears in > bpf.c:456:36: error: ‘NETLINK_EXT_ACK’ undeclared (first use in this function) > if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK, > ^~~ > nlattr.c: In function ‘nla_dump_errormsg’: > nlattr.c:152:34: error: ‘NLMSGERR_ATTR_MAX’ undeclared (first use in this > function) > struct nla_policy extack_policy[NLMSGERR_ATTR_MAX + 1] = { Yeah, fully agree, thanks for catching this, Alexei!
Re: [PATCH net] net/mlx5e: Fix fixpoint divide exception in mlx5e_am_stats_compare
Hello! On 1/19/2018 11:10 PM, Saeed Mahameed wrote: From: Talat BatheeshHelmut reported a bug about devision by zero while Division. running traffic and doing physical cable pull test. When the cable unplugged the ppms become zero, so when dividing the current ppms by the previous ppms in the You got it right the on the 2nd time. :-) next dim iteration there is devision by zero. This patch prevent this division for both ppms and epms. Fixes: c3164d2fc48f ("net/mlx5e: Added BW check for DIM decision mechanism") Reported-by: Helmut Grauer Signed-off-by: Talat Batheesh Signed-off-by: Saeed Mahameed [...] MBR, Sergei