date:20171113

Re: [PATCH 0/7] net: core: devname allocation cleanups

2017-11-13 Thread David Miller

From: Rasmus Villemoes 
Date: Mon, 13 Nov 2017 00:15:03 +0100

> It's somewhat confusing to have both dev_alloc_name and
> dev_get_valid_name. I can't see why the former is less strict than the
> latter, so make them (or rather dev_alloc_name_ns and
> dev_get_valid_name) equivalent, hardening dev_alloc_name() a little.
> 
> Obvious follow-up patches would be to only export one function, and
> make dev_alloc_name a static inline wrapper for that (whichever name
> is chosen for the exported interface). But maybe there is a good
> reason the two exported interfaces do different checking, so I'll
> refrain from including the trivial but tree-wide renaming in this
> series.

Series applied, thanks.

Re: [PATCH net-next] net-sysfs: trigger netlink notification on ifalias change via sysfs

2017-11-13 Thread Jiri Pirko

Tue, Nov 14, 2017 at 08:21:36AM CET, ro...@cumulusnetworks.com wrote:
>From: Roopa Prabhu 
>
>This patch adds netlink notifications on iflias changes via sysfs.
>makes it consistent with the netlink path which also calls
>netdev_state_change. Also makes it consistent with other sysfs
>netdev_store operations.
>
>Signed-off-by: Roopa Prabhu 

Reviewed-by: Jiri Pirko

Re: [PATCH net 0/6] tls: Miscellaneous fixes

2017-11-13 Thread David Miller

From: Ilya Lesokhin 
Date: Mon, 13 Nov 2017 10:22:43 +0200

> Here's a set of miscellaneous fix patches.
> 
> Patch 1 makes sure aead_request is initailized properly.
> Patches 2-3 Fix a memory leak we've encountered.
> patch 4 moves tls_make_aad to allow sharing it in the future.
> Patch 5 fixes a TOCTOU issue reported here:
> https://www.spinics.net/lists/kernel/msg2608603.html
> Patch 6 Avoids callback overriding when tls_set_sw_offload fails.

Series applied, thanks.

[PATCH v3] wcn36xx: Set default BTLE coexistence config

2017-11-13 Thread Ramon Fried

From: Eyal Ilsar 

If the value for the firmware configuration parameters
BTC_STATIC_LEN_LE_BT and BTC_STATIC_LEN_LE_WLAN are not set the duty
cycle between BT and WLAN is such that if BT (including BLE) is active
WLAN gets 0 bandwidth. When tuning these parameters having a too high
value for WLAN means that BLE performance degrades.
The "sweet" point of roughly half of the maximal values was empirically
found to achieve a balance between BLE and Wi-Fi coexistence
performance.

Signed-off-by: Eyal Ilsar 
Signed-off-by: Ramon Fried 
---
 drivers/net/wireless/ath/wcn36xx/smd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c 
b/drivers/net/wireless/ath/wcn36xx/smd.c
index 9c6590d5348a..1c7598752255 100644
--- a/drivers/net/wireless/ath/wcn36xx/smd.c
+++ b/drivers/net/wireless/ath/wcn36xx/smd.c
@@ -72,8 +72,10 @@ static struct wcn36xx_cfg_val wcn36xx_cfg_vals[] = {
WCN36XX_CFG_VAL(DYNAMIC_PS_POLL_VALUE, 0),
WCN36XX_CFG_VAL(TX_PWR_CTRL_ENABLE, 1),
WCN36XX_CFG_VAL(ENABLE_CLOSE_LOOP, 1),
-   WCN36XX_CFG_VAL(ENABLE_LPWR_IMG_TRANSITION, 0),
+   WCN36XX_CFG_VAL(BTC_STATIC_LEN_LE_BT, 12),
+   WCN36XX_CFG_VAL(BTC_STATIC_LEN_LE_WLAN, 3),
WCN36XX_CFG_VAL(MAX_ASSOC_LIMIT, 10),
+   WCN36XX_CFG_VAL(ENABLE_LPWR_IMG_TRANSITION, 0),
WCN36XX_CFG_VAL(ENABLE_MCC_ADAPTIVE_SCHEDULER, 0),
 };
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

[PATCH net-next] net-sysfs: trigger netlink notification on ifalias change via sysfs

2017-11-13 Thread Roopa Prabhu

From: Roopa Prabhu 

This patch adds netlink notifications on iflias changes via sysfs.
makes it consistent with the netlink path which also calls
netdev_state_change. Also makes it consistent with other sysfs
netdev_store operations.

Signed-off-by: Roopa Prabhu 
---
 net/core/net-sysfs.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 51d5836..799b752 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -382,7 +382,7 @@ static ssize_t ifalias_store(struct device *dev, struct 
device_attribute *attr,
struct net_device *netdev = to_net_dev(dev);
struct net *net = dev_net(netdev);
size_t count = len;
-   ssize_t ret;
+   ssize_t ret = 0;
 
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
return -EPERM;
@@ -391,9 +391,20 @@ static ssize_t ifalias_store(struct device *dev, struct 
device_attribute *attr,
if (len >  0 && buf[len - 1] == '\n')
--count;
 
-   ret = dev_set_alias(netdev, buf, count);
+   if (!rtnl_trylock())
+   return restart_syscall();
+
+   if (dev_isalive(netdev)) {
+   ret = dev_set_alias(netdev, buf, count);
+   if (ret < 0)
+   goto err;
+   ret = len;
+   netdev_state_change(netdev);
+   }
+err:
+   rtnl_unlock();
 
-   return ret < 0 ? ret : len;
+   return ret;
 }
 
 static ssize_t ifalias_show(struct device *dev,
-- 
2.1.4

Re: [PATCH net-next 0/3] bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics

2017-11-13 Thread David Miller

From: Yonghong Song 
Date: Sat, 11 Nov 2017 16:00:33 -0800

> This patch set intends to change verifier ARG_CONST_SIZE_OR_ZERO
> semantics so that simpler bpf programs can be written with verifier
> acceptance. Patch #1 comment provided the detailed examples and
> the patch itself implements the new semantics. Patch #2
> changes bpf_probe_read helper arg2 type from
> ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO. Patch #3 fixed a few
> test cases and added some for better coverage.

Series applied, thank you.

Re: [PATCH v2 net-next] tcp: allow drivers to tweak TSQ logic

2017-11-13 Thread David Miller

From: Eric Dumazet 
Date: Sat, 11 Nov 2017 15:54:12 -0800

> From: Eric Dumazet 
> 
> I had many reports that TSQ logic breaks wifi aggregation.
> 
> Current logic is to allow up to 1 ms of bytes to be queued into qdisc
> and drivers queues.
> 
> But Wifi aggregation needs a bigger budget to allow bigger rates to
> be discovered by various TCP Congestion Controls algorithms.
> 
> This patch adds an extra socket field, allowing wifi drivers to select
> another log scale to derive TCP Small Queue credit from current pacing
> rate.
> 
> Initial value is 10, meaning that this patch does not change current
> behavior.
> 
> We expect wifi drivers to set this field to smaller values (tests have
> been done with values from 6 to 9)
> 
> They would have to use following template :
> 
> if (skb->sk && skb->sk->sk_pacing_shift != MY_PACING_SHIFT)
>  skb->sk->sk_pacing_shift = MY_PACING_SHIFT;
> 
> 
> Ref: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1670041
> Signed-off-by: Eric Dumazet 
> Cc: Johannes Berg 
> Cc: Toke Høiland-Jørgensen 
> Cc: Kir Kolyshkin 
> ---
> v2: added kernel-doc comment, based on Johannes feedback.

Applied, thanks Eric.

Re: [PATCH net-next 0/3] rxrpc: Fixes

2017-11-13 Thread David Miller

From: David Howells 
Date: Sat, 11 Nov 2017 17:57:52 +

> 
> Here are some patches that fix some things in AF_RXRPC:
> 
>  (1) Prevent notifications from being passed to a kernel service for a call
>  that it has ended.
> 
>  (2) Fix a null pointer deference that occurs under some circumstances when an
>  ACK is generated.
> 
>  (3) Fix a number of things to do with call expiration.
> 
> The patches can be found here also:
> 
>   
> http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-next
> 
> Tagged thusly:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
>   rxrpc-next-2017

Pulled, thanks David.

Re: [PATCHv3 1/1] bnx2x: fix slowpath null crash

2017-11-13 Thread David Miller

From: Zhu Yanjun 
Date: Sat, 11 Nov 2017 10:42:03 -0500

> When "NETDEV WATCHDOG: em4 (bnx2x): transmit queue 2 timed out" occurs,
> BNX2X_SP_RTNL_TX_TIMEOUT is set. In the function bnx2x_sp_rtnl_task,
> bnx2x_nic_unload and bnx2x_nic_load are executed to shutdown and open
> NIC. In the function bnx2x_nic_load, bnx2x_alloc_mem allocates dma
> failure. The message "bnx2x: [bnx2x_alloc_mem:8399(em4)]Can't
> allocate memory" pops out. The variable slowpath is set to NULL.
> When shutdown the NIC, the function bnx2x_nic_unload is called. In
> the function bnx2x_nic_unload, the following functions are executed.
> bnx2x_chip_cleanup
> bnx2x_set_storm_rx_mode
> bnx2x_set_q_rx_mode
> bnx2x_set_q_rx_mode
> bnx2x_config_rx_mode
> bnx2x_set_rx_mode_e2
> In the function bnx2x_set_rx_mode_e2, the variable slowpath is operated.
> Then the crash occurs.
> To fix this crash, the variable slowpath is checked. And in the function
> bnx2x_sp_rtnl_task, after dma memory allocation fails, another shutdown
> and open NIC is executed.
> 
> CC: Joe Jin 
> CC: Junxiao Bi 
> Signed-off-by: Zhu Yanjun 
> Acked-by: Ariel Elior 

Applied.

Re: [PATCH net-next 0/2] cxgb4: collect LE-TCAM and SGE queue contexts

2017-11-13 Thread David Miller

From: Rahul Lakkireddy 
Date: Sat, 11 Nov 2017 19:48:14 +0530

> Collect hardware dumps via ethtool --get-dump facility.
> 
> Patch 1 collects LE-TCAM dump.
> 
> Patch 2 collects SGE queue context dumps.

Series applied, thanks.

Re: [PATCH net] vxlan: fix the issue that neigh proxy blocks all icmpv6 packets

2017-11-13 Thread David Miller

From: Xin Long 
Date: Sat, 11 Nov 2017 19:58:50 +0800

> Commit f1fb08f6337c ("vxlan: fix ND proxy when skb doesn't have transport
> header offset") removed icmp6_code and icmp6_type check before calling
> neigh_reduce when doing neigh proxy.
> 
> It means all icmpv6 packets would be blocked by this, not only ns packet.
> In Jianlin's env, even ping6 couldn't work through it.
> 
> This patch is to bring the icmp6_code and icmp6_type check back and also
> removed the same check from neigh_reduce().
> 
> Fixes: f1fb08f6337c ("vxlan: fix ND proxy when skb doesn't have transport 
> header offset")
> Reported-by: Jianlin Shi 
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks.

Re: [PATCH v2 2/2] chcr: Add support for Inline IPSec

2017-11-13 Thread David Miller

From: Atul Gupta 
Date: Thu,  9 Nov 2017 16:59:01 +0530

> register xfrmdev_ops callbacks, Send IPsec tunneled data
> to HW for inline processing.
> The driver use hardware crypto accelerator to encrypt and
> generate ICV for the transmitted packet in Inline mode.
> 
> Signed-off-by: Atul Gupta 
> Signed-off-by: Harsh Jain 
> Signed-off-by: Ganesh Goudar 
> ---
> V2: Fixed the build warnings and created patch against cryptodev
> to avoid possible merge conflicts

Herbert, feel free to merge these two patches via your crypto
tree.

Thanks!

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-13 Thread John Fastabend

On 11/13/2017 09:33 PM, Björn Töpel wrote:
> 2017-11-14 0:50 GMT+01:00 Alexei Starovoitov :
>> On 11/13/17 9:07 PM, Björn Töpel wrote:
>>>
>>> 2017-10-31 13:41 GMT+01:00 Björn Töpel :

 From: Björn Töpel 

>>> [...]


 We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
 Korea, and our paper with complete benchmarks will be released shortly
 on the NetDev 2.2 site.

>>>
>>> We're back in the saddle after an excellent netdevconf week. Kudos to
>>> the organizers; We had a blast! Thanks for all the constructive
>>> feedback.
>>>
>>> I'll summarize the major points, that we'll address in the next RFC
>>> below.
>>>
>>> * Instead of extending AF_PACKET with yet another version, introduce a
>>>   new address/packet family. As for naming had some name suggestions:
>>>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>>>   AF_ZEROCOPY, unless there're no strong opinions against it.
>>>
>>> * No explicit zerocopy enablement. Use the zeropcopy path if
>>>   supported, if not -- fallback to the skb path, for netdevs that
>>>   don't support the required ndos. Further, we'll have the zerocopy
>>>   behavior for the skb path as well, meaning that an AF_ZEROCOPY
>>>   socket will consume the skb and we'll honor skb->queue_mapping,
>>>   meaning that we only consume the packets for the enabled queue.
>>>
>>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>>   in a separate patchset.
>>
>>
>> all sounds good to me except above bit.
>> I don't remember people suggesting to split it this way.
>> What's the value of it without tx?
>>
> 
> We definitely need Tx for our use-cases! I'll rephrase, so the
> idea was making the initial patch set without Tx *driver*
> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
> 
> So AF_ZEROCOPY, the socket parts, would have Tx support.
> 
> @John Did I recall that correctly?
> 

Yep, that is what I said. However, on second thought, without the
driver tx half I guess tx will be significantly slower. So in order
to get the driver API correct in the first go around lets implement
this in the first series as well.

Just try to minimize the TX driver work as much as possible.

Thanks,
John

Re: KASAN: use-after-free Read in rds_tcp_dev_event

2017-11-13 Thread Sowmini Varadhan

On (11/13/17 19:30), Girish Moodalbail wrote:
> (L538-540). However, it leaves behind some of the rds_tcp connections that
> shared the same underlying RDS connection (L534 and 535). These connections
> with pointer to stale network namespace are left behind in the global list.

It leaves behind no such thing. After mprds, you want to collect
only one instance of the conn that is being removed, that's why
lines 534-535 skips over duplicat instances of the same conn
(for multiple paths in the same conn).

> When the 2nd network namespace is deleted, we will hit the above stale
> pointer and hit UAF panic.
> I think we should move away from global list to a per-namespace list. The
> global list are used only in two places (both of which are per-namespace
> operations):

Nice try, but not so. 

Let me look at this tomorrow, I missed this mail in my mbox.

--Sowmini

Re: [PATCH] wcn36xx: Set BTLE coexistence related configuration values to defaults

2017-11-13 Thread Kalle Valo

Ramon Fried  writes:

> From: Eyal Ilsar 
>
> If the value for the firmware configuration parameters
> BTC_STATIC_LEN_LE_BT and BTC_STATIC_LEN_LE_WLAN are not set the duty
> cycle between BT and WLAN is such that if BT (including BLE) is active
> WLAN gets 0 bandwidth. When tuning these parameters having a too high
> value for WLAN means that BLE performance degrades. The "sweet" point
> of roughly half of the maximal values was empirically found to achieve
> a balance between BLE and Wi-Fi coexistence performance.
>
> Signed-off-by: Eyal Ilsar 
> Signed-off-by: Ramon Fried 

Then submit a new version of the patch then please include the version
number:

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches#patch_version_missing

So after fixing Bjorn's comments the next version should be v3.

-- 
Kalle Valo

Re: [PATCH v5 00/13] exit_net checks for objects initialized in net_init hook

2017-11-13 Thread David Miller

From: Vasily Averin 
Date: Sun, 12 Nov 2017 22:26:44 +0300

> OpenVz kernel team have a long history of fighting against namespace-related 
> bugs,
> some of them could be prevented by using simple checks described below.
> 
> One of typical errors is related to live cycle of namespaces:
> usually objects created for some namespace should not live longer than 
> namespace itself.
> 
> Such kind of issues can be invisible on usual systems where additional 
> namespaces
> are not used, because initial namespaces usually lives forever and never 
> destroyed.
> 
> However in systems with namespaces it can lead to memory leaks or to 
> use-after-free.
> Both of them are critical for systems with running containers.
> As you knows it's quite hard to find the reason of such issues,
> especially in rarely-triggered scenarios on production nodes on default 
> kernels
> without specially enabled debug settings. Any additional hints can be useful 
> here.
> 
> This patch set should help to detect some of these issues.
> It is based on assumption that objects initialized in init hook of 
> pernet_operations
> should return to initial state until end of exit hook.
> 
> Many drivers and subsystems already have such checks, however I've found 
> number
> of places where list_empty check would be useful at least as smoke test.
> 
> These checks are useful for long-term stable kernels,
> they allows to detect problems related to incomplete or incorrectly
> backported patches.

All applied to net-next except patch #9 and #10 which need to go via the
NFS maintainer.

Re: [patch net] net: forbid netdev used by mirred tc act from being moved to another netns

2017-11-13 Thread Jiri Pirko

Tue, Nov 14, 2017 at 06:51:42AM CET, xiyou.wangc...@gmail.com wrote:
>On Mon, Nov 13, 2017 at 9:17 PM, Jiri Pirko  wrote:
>> Mon, Nov 13, 2017 at 08:53:57PM CET, xiyou.wangc...@gmail.com wrote:
>>>On Mon, Nov 13, 2017 at 6:05 AM, Jiri Pirko  wrote:
 From: Jiri Pirko 

 Currently, user may choose to move device that is used by mirred action
 to another network namespace. That is wrong as the action still remains
 in the original namespace and references non-existing ifindex.
>>>
>>>It is a pure display issue, the action itself should function well
>>>because we only use ifindex to lookup netdevice once and
>>>we save the netdevice pointer in action.
>>>
>>>If you really want to fix it, just tell iprout2 to display netnsid together
>>>with ifindex.
>>
>> It is not only display issue. I think it is wrong to let a netdevice
>
>What's wrong with it? Is it mis-functioning?

Nope.

>
>> dissapear from underneath the mirred action. You certainly cannot add an
>
>
>It disappears only because we don't display it properly, nothing else.

Okay.

>
>
>> action mirred with device from another net namespace. So should we allow
>> that?
>
>On the other hand why linking a device to mirred action prevents it
>from moving to another netns? Also, device can be moved back too.
>
>I don't see anything wrong with it except displaying it.

Okay. What about my question? Should we allow adding an action mirred
pointing to a netdev in another netns? I think it would make sense in
case we consider movement of mirred device legit.

[PATCH net-next] openvswitch: Using kfree_rcu() to simplify the code

2017-11-13 Thread Wei Yongjun

The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.

Signed-off-by: Wei Yongjun 
---
 net/openvswitch/meter.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c
index 2a5ba35..f9e2b1f 100644
--- a/net/openvswitch/meter.c
+++ b/net/openvswitch/meter.c
@@ -42,19 +42,12 @@
[OVS_BAND_ATTR_STATS] = { .len = sizeof(struct ovs_flow_stats) },
 };
 
-static void rcu_free_ovs_meter_callback(struct rcu_head *rcu)
-{
-   struct dp_meter *meter = container_of(rcu, struct dp_meter, rcu);
-
-   kfree(meter);
-}
-
 static void ovs_meter_free(struct dp_meter *meter)
 {
if (!meter)
return;
 
-   call_rcu(&meter->rcu, rcu_free_ovs_meter_callback);
+   kfree_rcu(meter, rcu);
 }
 
 static struct hlist_head *meter_hash_bucket(const struct datapath *dp,

[PATCH net-next] openvswitch: Make local function ovs_nsh_key_attr_size() static

2017-11-13 Thread Wei Yongjun

Fixes the following sparse warnings:

net/openvswitch/flow_netlink.c:340:8: warning:
 symbol 'ovs_nsh_key_attr_size' was not declared. Should it be static?

Signed-off-by: Wei Yongjun 
---
 net/openvswitch/flow_netlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 4201f92..d79c364 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -337,7 +337,7 @@ size_t ovs_tun_key_attr_size(void)
+ nla_total_size(4);   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS */
 }
 
-size_t ovs_nsh_key_attr_size(void)
+static size_t ovs_nsh_key_attr_size(void)
 {
/* Whenever adding new OVS_NSH_KEY_ FIELDS, we should consider
 * updating this function.

Re: [PATCH V2 net] net: hns3: Updates MSI/MSI-X alloc/free APIs(depricated) to new APIs

2017-11-13 Thread Yuval Shaia

On Thu, Nov 09, 2017 at 04:38:13PM +, Salil Mehta wrote:
> This patch migrates the HNS3 driver code from use of depricated PCI
> MSI/MSI-X interrupt vector allocation/free APIs to new common APIs.
> 
> Signed-off-by: Salil Mehta 
> Suggested-by: Christoph Hellwig 
> ---
> PATCH V2: Yuval Shaia 
>   Link -> https://lkml.org/lkml/2017/11/9/138
> PATCH V1: Initial Submit
> ---
>  .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 107 
> +++--
>  .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  15 ++-
>  2 files changed, 42 insertions(+), 80 deletions(-)

Reviewed-by: Yuval Shaia 

> 
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
> b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> index c1cdbfd..d65c599 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> @@ -885,14 +885,14 @@ static int hclge_query_pf_resource(struct hclge_dev 
> *hdev)
>   hdev->pkt_buf_size = __le16_to_cpu(req->buf_size) << HCLGE_BUF_UNIT_S;
>  
>   if (hnae3_dev_roce_supported(hdev)) {
> - hdev->num_roce_msix =
> + hdev->num_roce_msi =
>   hnae_get_field(__le16_to_cpu(req->pf_intr_vector_number),
>  HCLGE_PF_VEC_NUM_M, HCLGE_PF_VEC_NUM_S);
>  
>   /* PF should have NIC vectors and Roce vectors,
>* NIC vectors are queued before Roce vectors.
>*/
> - hdev->num_msi = hdev->num_roce_msix  + HCLGE_ROCE_VECTOR_OFFSET;
> + hdev->num_msi = hdev->num_roce_msi  + HCLGE_ROCE_VECTOR_OFFSET;
>   } else {
>   hdev->num_msi =
>   hnae_get_field(__le16_to_cpu(req->pf_intr_vector_number),
> @@ -1835,7 +1835,7 @@ static int hclge_init_roce_base_info(struct hclge_vport 
> *vport)
>   struct hnae3_handle *roce = &vport->roce;
>   struct hnae3_handle *nic = &vport->nic;
>  
> - roce->rinfo.num_vectors = vport->back->num_roce_msix;
> + roce->rinfo.num_vectors = vport->back->num_roce_msi;
>  
>   if (vport->back->num_msi_left < vport->roce.rinfo.num_vectors ||
>   vport->back->num_msi_left == 0)
> @@ -1853,67 +1853,47 @@ static int hclge_init_roce_base_info(struct 
> hclge_vport *vport)
>   return 0;
>  }
>  
> -static int hclge_init_msix(struct hclge_dev *hdev)
> +static int hclge_init_msi(struct hclge_dev *hdev)
>  {
>   struct pci_dev *pdev = hdev->pdev;
> - int ret, i;
> -
> - hdev->msix_entries = devm_kcalloc(&pdev->dev, hdev->num_msi,
> -   sizeof(struct msix_entry),
> -   GFP_KERNEL);
> - if (!hdev->msix_entries)
> - return -ENOMEM;
> -
> - hdev->vector_status = devm_kcalloc(&pdev->dev, hdev->num_msi,
> -sizeof(u16), GFP_KERNEL);
> - if (!hdev->vector_status)
> - return -ENOMEM;
> + int vectors;
> + int i;
>  
> - for (i = 0; i < hdev->num_msi; i++) {
> - hdev->msix_entries[i].entry = i;
> - hdev->vector_status[i] = HCLGE_INVALID_VPORT;
> + vectors = pci_alloc_irq_vectors(pdev, 1, hdev->num_msi,
> + PCI_IRQ_MSI | PCI_IRQ_MSIX);
> + if (vectors < 0) {
> + dev_err(&pdev->dev,
> + "failed(%d) to allocate MSI/MSI-X vectors\n",
> + vectors);
> + return vectors;
>   }
> + if (vectors < hdev->num_msi)
> + dev_warn(&hdev->pdev->dev,
> +  "requested %d MSI/MSI-X, but allocated %d MSI/MSI-X\n",
> +  hdev->num_msi, vectors);
>  
> - hdev->num_msi_left = hdev->num_msi;
> - hdev->base_msi_vector = hdev->pdev->irq;
> + hdev->num_msi = vectors;
> + hdev->num_msi_left = vectors;
> + hdev->base_msi_vector = pdev->irq;
>   hdev->roce_base_vector = hdev->base_msi_vector +
>   HCLGE_ROCE_VECTOR_OFFSET;
>  
> - ret = pci_enable_msix_range(hdev->pdev, hdev->msix_entries,
> - hdev->num_msi, hdev->num_msi);
> - if (ret < 0) {
> - dev_info(&hdev->pdev->dev,
> -  "MSI-X vector alloc failed: %d\n", ret);
> - return ret;
> - }
> -
> - return 0;
> -}
> -
> -static int hclge_init_msi(struct hclge_dev *hdev)
> -{
> - struct pci_dev *pdev = hdev->pdev;
> - int vectors;
> - int i;
> -
>   hdev->vector_status = devm_kcalloc(&pdev->dev, hdev->num_msi,
>  sizeof(u16), GFP_KERNEL);
> - if (!hdev->vector_status)
> + if (!hdev->vector_status) {
> + pci_free_irq_vectors(pdev);
>   return -ENOMEM;
> + }
>  
>   for (i = 0; i < hdev->num_msi; i++)
>   hdev->vector_status[i] = HCLGE_INVALID_VPORT;
>  
> - vectors = pci_alloc_irq_vectors(pdev, 1, hde

[PATCH net-next] openvswitch: Fix return value check in ovs_meter_cmd_features()

2017-11-13 Thread Wei Yongjun

In case of error, the function ovs_meter_cmd_reply_start() returns
ERR_PTR() not NULL. The NULL test in the return value check should
be replaced with IS_ERR().

Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
Signed-off-by: Wei Yongjun 
---
 net/openvswitch/meter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c
index 2a5ba35..2e58b6c 100644
--- a/net/openvswitch/meter.c
+++ b/net/openvswitch/meter.c
@@ -166,7 +166,7 @@ static int ovs_meter_cmd_features(struct sk_buff *skb, 
struct genl_info *info)
 
reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_FEATURES,
  &ovs_reply_header);
-   if (!reply)
+   if (IS_ERR(reply))
return PTR_ERR(reply);
 
if (nla_put_u32(reply, OVS_METER_ATTR_MAX_METERS, U32_MAX) ||

[PATCH net-next] liquidio: Missing error code in liquidio_init_nic_module()

2017-11-13 Thread Dan Carpenter

We accidentally return success if lio_vf_rep_modinit() fails instead of
propogating the error code.

Fixes: e20f469660ad ("liquidio: synchronize VF representor names with NIC 
firmware")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index f05045a69dcc..6aa0eee88ea5 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -4038,7 +4038,8 @@ static int liquidio_init_nic_module(struct octeon_device 
*oct)
 */
if (!oct->octeon_id &&
oct->fw_info.app_cap_flags & LIQUIDIO_SWITCHDEV_CAP) {
-   if (lio_vf_rep_modinit()) {
+   retval = lio_vf_rep_modinit();
+   if (retval) {
liquidio_stop_nic_module(oct);
goto octnet_init_failure;
}

[PATCH net-next] xdp: sample: Missing curly braces in read_route()

2017-11-13 Thread Dan Carpenter

The assert statement is supposed to be part of the else branch but the
curly braces were accidentally left off.

Fixes: 3e29cd0e6563 ("xdp: Sample xdp program implementing ip forward")
Signed-off-by: Dan Carpenter 

diff --git a/samples/bpf/xdp_router_ipv4_user.c 
b/samples/bpf/xdp_router_ipv4_user.c
index 2c1fe3f4b1a4..916462112d55 100644
--- a/samples/bpf/xdp_router_ipv4_user.c
+++ b/samples/bpf/xdp_router_ipv4_user.c
@@ -206,12 +206,13 @@ static void read_route(struct nlmsghdr *nh, int nll)
direct_entry.arp.mac = 0;
direct_entry.arp.dst = 0;
if (route.dst_len == 32) {
-   if (nh->nlmsg_type == RTM_DELROUTE)
+   if (nh->nlmsg_type == RTM_DELROUTE) {
assert(bpf_map_delete_elem(map_fd[3], 
&route.dst) == 0);
-   else
+   } else {
if (bpf_map_lookup_elem(map_fd[2], 
&route.dst, &direct_entry.arp.mac) == 0)
direct_entry.arp.dst = 
route.dst;
assert(bpf_map_update_elem(map_fd[3], 
&route.dst, &direct_entry, 0) == 0);
+   }
}
for (i = 0; i < 4; i++)
prefix_key->data[i] = (route.dst >> i * 8) & 
0xff;

Re: r8169 regression: UDP packets dropped intermittantly

2017-11-13 Thread Jonathan Woithe

As far as I am aware there were no follow up comments to my last post on
this subject on 24 March 2017.  The text of that post is included below for
reference.  To summarise: a short test program which reliably triggered the
problem was written in the hope it would assist in the repair of this
regression.

Today I ran the tests on the 4.14 kernel.  The problem is still present.  If
the same machine is run under a 4.3 kernel with the hacked r8169 driver the
problem does not occur.  Using the 4.3 r8169 driver triggers the problem. 
It also works without trouble under 2.6.35.11 (the kernel we've stuck with
due to the problem affecting most newer kernels).

To recap the history of this thread, the misbehaviour of the r8169 driver in
the presence of small UDP packets affects kernels newer than 3.3.  The
initial post in this thread was on 9 March 2013.  The regression was
introduced with commit da78dbff2e05630921c551dbbc70a4b7981a8fff.

Since this regression has persisted for more than 4 years, is there any
chance that it will be fixed?  The inability to run newer kernels has
prevented us from providing them as upgrades in our products.  If this
problem in the r8169 driver will never be fixed, it seems we'll have to find
a supply of a PCI/PCIe NIC which doesn't utilise this driver.  Of course
this won't help those whose systems in the field are fitted with the
r8169-based card.

Regards
  jonathan

Post from Mar 24, 2017:

> On Thu, Jun 23, 2016 at 01:22:50AM +0200, Francois Romieu wrote:
> > Jonathan Woithe  :
> > [...]
> > > to mainline (in which case I'll keep watching out for it)?  Or is the
> > > out-of-tree workaround mentioned above considered to be the long term
> > > fix for those who encounter the problem?
> > 
> > It's a workaround. Nothing less, nothing more.
> 
> Recently I have had a chance to revisit this issue.  I have written a
> program (r8196-test, source is included below) which recreates the problem
> without requiring our external hardware devices.  That is, this program
> triggers the fault when run between two networked computers.  To use, two
> PCs are needed.  One (the "master") has an rtl8169 network card fitted (ours
> has a Netgear GA311, but the problem has been seen with others too from
> memory).  The network hardware of the other computer (the "slave") isn't
> important.  First run
> 
>   ./r8196-test
> 
> on the slave, followed by 
> 
>   ./r8196-test 
> 
> on the master.  When running stock kernel version 4.3 the master stops
> reliably within a minute or so with a timeout, indicating (in this case)
> that the response packet never arrived within the 0.5 second timeout period. 
> The ID whose response was never received by the master is reported as having
> been seen (and a response sent) by the slave.
> 
> If I substitute the forward ported r8169 driver mentioned earlier in this
> thread into kernel 4.3, the above program sequence runs seemingly
> indefinitely without any timeouts (runtime is beyond two hours as of this
> writing, compared to tens of seconds with the standard driver).
> 
> This demonstrates that the problem is independent of our custom network
> devices and allows the fault to be recreated using commodity hardware.
> 
> Does this make it any easier to develop a mainline fix for the regression?
> 
> Regards
>   jonathan
> 
> /*
>  * To test, the "master" mode is run on a PC with an RTL-8169 card.
>  * The "slave" mode is run on any other PC.  "Master" mode is activated
>  * by providing the IP of the slave PC on the command line.  The slave
>  * should be started before the master; without a running slave the master
>  * will time out.
>  *
>  * This code is in the public domain.
>  */
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #include 
> 
> unsigned char ping_payload[] = {
> 0x00, 0x00,
> 0x00, 0x00, 0x00, 0x00,
> };
> 
> #define PING_PAYLOAD_SIZE 6
> 
> unsigned char ack_payload[] = {
> 0x12, 0x34,
> 0x01, 0x01, 0x00, 0x00,
> 0x00, 0x00, 0x00, 0x00,
> 0x00, 0x00, 0x00, 0x00,
> };
> 
> #define ACK_PAYLOAD_SIZE 14
> 
> #define UDP_PORT 49491
> 
> signed int open_udp(const char *target_addr)
> {
> struct sockaddr_in local_addr;
> struct timeval tv;
> int sock;
> 
> sock = socket(PF_INET,SOCK_DGRAM, 0);
> if (sock < 0) {
> return -1;
> }
> 
> tv.tv_sec = 0;
> tv.tv_usec = 50;
> setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *)&tv, sizeof(tv));
> setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, (char *)&tv, sizeof(tv));
> 
> memset(&local_addr, 0, sizeof(local_addr));
> local_addr.sin_family = AF_INET;
> local_addr.sin_addr.s_addr = INADDR_ANY;
> local_addr.sin_port = htons(49491);
> if (bind(sock, (struct sockaddr *)&local_addr,
>  sizeof(struct sockaddr)) < 0) {
> return -1;
> }
> 
> if (target_addr != NULL) {
> struct sockaddr_in dest_addr;
> memset(&dest_addr, 0, sizeof(dest_addr));

Re: [PATCH v2 net-next] socket: Move the socket inuse to namespace.

2017-11-13 Thread Cong Wang

On Mon, Nov 13, 2017 at 7:23 PM, 张军伟(基础平台部)
 wrote:
>
>> On 14 Nov 2017, at 10:09, Cong Wang  wrote:
>>
>> On Mon, Nov 13, 2017 at 4:36 AM, Tonghao Zhang  
>> wrote:
>>> From: Tonghao Zhang 
>>>
>>> This patch add a member in struct netns_core. and this is
>>> a counter for socket_inuse in the _net_ namespace. The patch
>>> will add/sub counter in the sock_alloc or sock_release. In
>>> the sock_alloc, the private data of inode saves the special
>>> _net_. When releasing it, we can access the special _net_ and
>>> dec the counter of socket in that namespace.
>>>
>>> By the way, we dont use the 'current->nsproxy->net_ns' in the
>>> sock_release. In one case，when one task exits, the 'do_exit'
>>> may set the current->nsproxy NULL, and then call the sock_release.
>>> Use the private data of inode, saving few bytes.
>>
>> Why do you need to hold netns refcnt for socket? sock already holds
>> it, so you can just access it by sock_net(sock->sk) ?
> I think you suggestion is
> replace get_net(net) ===> with sock_net(sock->sk)

Not literally, but essentially yes.


>
> If thus, I think it could not work.
> Because sock->sk is assigned after sock_alloc.
> What we change is done in sock_alloc.

You don't have to do it in sock_alloc().


> By the way, sock->sk may has been released(NULL) in sock_release.

My point is since sock is always paired with a sk and sk already
holds refcnt, you don't need to hold it again, it looks unnecessary.

Re: [patch net] net: forbid netdev used by mirred tc act from being moved to another netns

2017-11-13 Thread Cong Wang

On Mon, Nov 13, 2017 at 9:17 PM, Jiri Pirko  wrote:
> Mon, Nov 13, 2017 at 08:53:57PM CET, xiyou.wangc...@gmail.com wrote:
>>On Mon, Nov 13, 2017 at 6:05 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> Currently, user may choose to move device that is used by mirred action
>>> to another network namespace. That is wrong as the action still remains
>>> in the original namespace and references non-existing ifindex.
>>
>>It is a pure display issue, the action itself should function well
>>because we only use ifindex to lookup netdevice once and
>>we save the netdevice pointer in action.
>>
>>If you really want to fix it, just tell iprout2 to display netnsid together
>>with ifindex.
>
> It is not only display issue. I think it is wrong to let a netdevice

What's wrong with it? Is it mis-functioning?

> dissapear from underneath the mirred action. You certainly cannot add an


It disappears only because we don't display it properly, nothing else.


> action mirred with device from another net namespace. So should we allow
> that?

On the other hand why linking a device to mirred action prevents it
from moving to another netns? Also, device can be moved back too.

I don't see anything wrong with it except displaying it.

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-13 Thread Björn Töpel

2017-11-14 0:50 GMT+01:00 Alexei Starovoitov :
> On 11/13/17 9:07 PM, Björn Töpel wrote:
>>
>> 2017-10-31 13:41 GMT+01:00 Björn Töpel :
>>>
>>> From: Björn Töpel 
>>>
>> [...]
>>>
>>>
>>> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
>>> Korea, and our paper with complete benchmarks will be released shortly
>>> on the NetDev 2.2 site.
>>>
>>
>> We're back in the saddle after an excellent netdevconf week. Kudos to
>> the organizers; We had a blast! Thanks for all the constructive
>> feedback.
>>
>> I'll summarize the major points, that we'll address in the next RFC
>> below.
>>
>> * Instead of extending AF_PACKET with yet another version, introduce a
>>   new address/packet family. As for naming had some name suggestions:
>>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>>   AF_ZEROCOPY, unless there're no strong opinions against it.
>>
>> * No explicit zerocopy enablement. Use the zeropcopy path if
>>   supported, if not -- fallback to the skb path, for netdevs that
>>   don't support the required ndos. Further, we'll have the zerocopy
>>   behavior for the skb path as well, meaning that an AF_ZEROCOPY
>>   socket will consume the skb and we'll honor skb->queue_mapping,
>>   meaning that we only consume the packets for the enabled queue.
>>
>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>   in a separate patchset.
>
>
> all sounds good to me except above bit.
> I don't remember people suggesting to split it this way.
> What's the value of it without tx?
>

We definitely need Tx for our use-cases! I'll rephrase, so the
idea was making the initial patch set without Tx *driver*
specific code, e.g. use ndo_xdp_xmit/flush at a later point.

So AF_ZEROCOPY, the socket parts, would have Tx support.

@John Did I recall that correctly?

>> * Minimize the size of the i40e zerocopy patches, by moving the driver
>>   specific code to separate patches.
>>
>> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>>   XDP redirect map call with ingress flag.
>>
>> * Extend the XDP redirect to support explicit allocator/destructor
>>   functions. Right now, XDP redirect assumes that the page allocator
>>   was used, and the XDP redirect cleanup path is decreasing the page
>>   count of the XDP buffer. This assumption breaks for the zerocopy
>>   case.
>>
>>
>> Björn
>>
>>
>>> We based this patch set on net-next commit e1ea2f9856b7 ("Merge
>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>>>
>>> Please focus your review on:
>>>
>>> * The V4 user space interface
>>> * PACKET_ZEROCOPY and its semantics
>>> * Packet array interface
>>> * XDP semantics when excuting in zero-copy mode (user space passed
>>>   buffers)
>>> * XDP_PASS_TO_KERNEL semantics
>>>
>>> To do:
>>>
>>> * Investigate the user-space ring structure’s performance problems
>>> * Continue the XDP integration into packet arrays
>>> * Optimize performance
>>> * SKB <-> V4 conversions in tp4a_populate & tp4a_flush
>>> * Packet buffer is unnecessarily pinned for virtual devices
>>> * Support shared packet buffers
>>> * Unify V4 and SKB receive path in I40E driver
>>> * Support for packets spanning multiple frames
>>> * Disassociate the packet array implementation from the V4 queue
>>>   structure
>>>
>>> We would really like to thank the reviewers of the limited
>>> distribution RFC for all their comments that have helped improve the
>>> interfaces and the code significantly: Alexei Starovoitov, Alexander
>>> Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team
>>> at Intel that has been helping out reviewing code, writing tests, and
>>> sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi
>>> Zhang, your participation has really helped.
>>>
>>> Thanks: Björn and Magnus
>>>
>>> [1]
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.netdevconf.org_2.2_&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qR6oNZj1CqLATni4ibTgAQ&m=lKyFxON3kKygiOgECLBfmqRwM7ZyXFSUvLED1vP-gos&s=44jzm1W8xkGyZSZVANRygzHz6y4XHbYrYBRM-K5RhTc&e=
>>>
>>>
>>> Björn Töpel (7):
>>>   packet: introduce AF_PACKET V4 userspace API
>>>   packet: implement PACKET_MEMREG setsockopt
>>>   packet: enable AF_PACKET V4 rings
>>>   packet: wire up zerocopy for AF_PACKET V4
>>>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
>>>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
>>>   samples/tpacket4: added tpbench
>>>
>>> Magnus Karlsson (7):
>>>   packet: enable Rx for AF_PACKET V4
>>>   packet: enable Tx support for AF_PACKET V4
>>>   netdevice: add AF_PACKET V4 zerocopy ops
>>>   veth: added support for PACKET_ZEROCOPY
>>>   samples/tpacket4: added veth support
>>>   i40e: added XDP support for TP4 enabled queue pairs
>>>   xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use
>>>
>>>  drivers/net/ethernet/intel/i40e/i40e.h |3 +
>>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |9 +
>>>  drivers/net/ethernet/intel/i40e/i40e_main.c|  837 -
>>>  drive

Re: [patch net] net: forbid netdev used by mirred tc act from being moved to another netns

2017-11-13 Thread Jiri Pirko

Mon, Nov 13, 2017 at 08:53:57PM CET, xiyou.wangc...@gmail.com wrote:
>On Mon, Nov 13, 2017 at 6:05 AM, Jiri Pirko  wrote:
>> From: Jiri Pirko 
>>
>> Currently, user may choose to move device that is used by mirred action
>> to another network namespace. That is wrong as the action still remains
>> in the original namespace and references non-existing ifindex.
>
>It is a pure display issue, the action itself should function well
>because we only use ifindex to lookup netdevice once and
>we save the netdevice pointer in action.
>
>If you really want to fix it, just tell iprout2 to display netnsid together
>with ifindex.

It is not only display issue. I think it is wrong to let a netdevice
dissapear from underneath the mirred action. You certainly cannot add an
action mirred with device from another net namespace. So should we allow
that?

Re: [net-next:master 488/665] verifier.c:undefined reference to `__multi3'

2017-11-13 Thread Jim Wilson


On 11/11/2017 05:33 PM, Fengguang Wu wrote:

CC gcc list. According to Alexei:

  This is a known issue with gcc 7 on mips that is "optimizing"
  normal 64-bit multiply into 128-bit variant.
  Nothing to fix on the kernel side.


I filed a bug report.  This is now
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82981

I found a helpful thread at
https://www.linux-mips.org/archives/linux-mips/2017-08/msg00041.html
that had enough info for me to reproduce and file the bug report.

Jim

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Tom Herbert

On Mon, Nov 13, 2017 at 7:45 PM, John Fastabend
 wrote:
> On 11/13/2017 06:18 PM, Michael Ma wrote:
>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck :
>>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet  
>>> wrote:
 On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger 
>> :
>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>> Michael Ma  wrote:
>>>
 Any comments? We plan to implement this as a qdisc and appreciate any 
 early feedback.

 Thanks,
 Michael

> On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
>
> Currently txq/qdisc selection is based on flow hash so packets from
> the same flow will follow the order when they enter qdisc/txq, which
> avoids out-of-order problem.
>
> To improve the concurrency of QoS algorithm we plan to have multiple
> per-cpu queues for a single TC class and do busy polling from a
> per-class thread to drain these queues. If we can do this frequently
> enough the out-of-order situation in this polling thread should not be
> that bad.
>
> To give more details - in the send path we introduce per-cpu per-class
> queues so that packets from the same class and same core will be
> enqueued to the same place. Then a per-class thread poll the queues
> belonging to its class from all the cpus and aggregate them into
> another per-class queue. This can effectively reduce contention but
> inevitably introduces potential out-of-order issue.
>
> Any concern/suggestion for working towards this direction?
>>>
>>> In general, there is no meta design discussions in Linux development
>>> Several developers have tried to do lockless
>>> qdisc and similar things in the past.
>>>
>>> The devil is in the details, show us the code.
>>
>> Thanks for the response, Stephen. The code is fairly straightforward,
>> we have a per-cpu per-class queue defined as this:
>>
>> struct bandwidth_group
>> {
>> struct skb_list queues[MAX_CPU_COUNT];
>> struct skb_list drain;
>> }
>>
>> "drain" queue is used to aggregate per-cpu queues belonging to the
>> same class. In the enqueue function, we determine the cpu where the
>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>
>> int cpu;
>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>
>> cpu = get_cpu();
>> skb_list_append(&bwg->queues[cpu], skb);
>>
>> Here we don't check the flow of the packet so if there is task
>> migration or multiple threads sending packets through the same flow we
>> theoretically can have packets enqueued to different queues and
>> aggregated to the "drain" queue out of order.
>>
>> Also AFAIK there is no lockless htb-like qdisc implementation
>> currently, however if there is already similar effort ongoing please
>> let me know.
>
> The question I would have is how would this differ from using XPS w/
> mqprio? Would this be a classful qdisc like HTB or a classless one
> like mqprio?
>
> From what I can tell XPS would be able to get you your per-cpu
> functionality, the benefit of it though would be that it would avoid
> out-of-order issues for sockets originating on the local system. The
> only thing I see as an issue right now is that the rate limiting with
> mqprio is assumed to be handled via hardware due to mechanisms such as
> DCB.

 I think one of the key point was in : " do busy polling from a per-class
 thread to drain these queues."

 I mentioned this idea in TX path of :

 https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>>
>>> I think this is a bit different from that idea in that the busy
>>> polling is transferring packets from a per-cpu qdisc to a per traffic
>>> class queueing discipline. Basically it would be a busy_poll version
>>> of qdisc_run that would be transferring packets from one qdisc onto
>>> another instead of attempting to transmit them directly.
>>
>> The idea is to have the whole part implemented as one classful qdisc
>> and remove the qdisc root lock since all the synchronization will be
>> handled internally (let's put aside that other filters/actions/qdiscs
>> might still require a root lock for now).
>>>
>>> What I think is tripping me up is that I don't think this is even
>>> meant to work with a multiqueue device. The description seems more
>>> like a mqprio implementation feeding into a prio qdisc which is then
>>> used for dequeue. To me it seems like this solution would be pulling
>>> complexity off of the enqueue side and just adding it to the dequeue
>>> side. The use of the "busy poll" is just to try to s

Fwd: FW: [PATCH 18/31] nds32: Library functions

2017-11-13 Thread Vincent Chen

>>On Wed, Nov 08, 2017 at 01:55:06PM +0800, Greentime Hu wrote:
>
>> +#define __range_ok(addr, size) (size <= get_fs() && addr <= (get_fs()
>> +-size))
>> +
>> +#define access_ok(type, addr, size) \
>> + __range_ok((unsigned long)addr, (unsigned long)size)
>
>> +#define __get_user_x(__r2,__p,__e,__s,__i...)   
>>  \
>> +__asm__ __volatile__ (   \
>> + __asmeq("%0", "$r0") __asmeq("%1", "$r2")   \
>> + "bal__get_user_" #__s   \
>
>... which does not check access_ok() or do any visible equivalents; OK...
>
>> +#define get_user(x,p)   
>>  \
>> + ({  \
>> + const register typeof(*(p)) __user *__p asm("$r0") = (p);\
>> + register unsigned long __r2 asm("$r2"); \
>> + register int __e asm("$r0");\
>> + switch (sizeof(*(__p))) {   \
>> + case 1: \
>> + __get_user_x(__r2, __p, __e, 1, "$lp"); \
>
>... and neither does this, which is almost certainly *not* OK.
>
>> +#define put_user(x,p)   
>>  \
>
>Same here, AFAICS.

Thanks.
I will add access_ok() in get_user/put_user

>> +extern unsigned long __arch_copy_from_user(void *to, const void __user * 
>> from,
>> +unsigned long n);
>
>> +static inline unsigned long raw_copy_from_user(void *to,
>> +const void __user * from,
>> +unsigned long n)
>> +{
>> + return __arch_copy_from_user(to, from, n); }
>
>Er...  Why not call your __arch_... raw_... and be done with that?

Thanks.
I will modify it in next patch version

>> +#define INLINE_COPY_FROM_USER
>> +#define INLINE_COPY_TO_USER
>
>Are those actually worth bothering?  IOW, have you compared behaviour with and 
>without them?

We compared the assembly code of copy_from/to_user's caller function,
and we think the performance is better by making copy_from/to_user as
inline


>> +ENTRY(__arch_copy_to_user)
>> + push$r0
>> + push$r2
>> + beqz$r2, ctu_exit
>> + srli$p0, $r2, #2! $p0 = number of word to clear
>> + andi$r2, $r2, #3! Bytes less than a word to copy
>> + beqz$p0, byte_ctu   ! Only less than a word to copy
>> +word_ctu:
>> + lmw.bim $p1, [$r1], $p1 ! Load the next word
>> +USER(smw.bim,$p1, [$r0], $p1)! Store the next word
>
>Umm...  It's that happy with unaligned loads and stores?  Your memcpy seems to 
>be trying to avoid those...

Thanks.
This should be aligned loads and stores, too.
I will modify it in next version patch.

>> +9001:
>> + pop $p1 ! Original $r2, n
>> + pop $p0 ! Original $r0, void *to
>> + sub $r1, $r0, $p0   ! Bytes copied
>> + sub $r2, $p1, $r1   ! Bytes left to copy
>> + push$lp
>> + move$r0, $p0
>> + bal memzero ! Clean up the memory
>
>Just what memory are you zeroing here?  The one you had been unable to store 
>into in the first place?
>
>> +ENTRY(__arch_copy_from_user)
>
>> +9001:
>> + pop $p1 ! Original $r2, n
>> + pop $p0 ! Original $r0, void *to
>> + sub $r1, $r1, $p0   ! Bytes copied
>> + sub $r2, $p1, $r1   ! Bytes left to copy
>> + push$lp
>> + bal memzero ! Clean up the memory
>
>Ditto, only this one is even worse - instead of just oopsing on you, it will 
>quietly destroy data past the area you've copied into.  raw_copy_..._user() 
>MUST NOT ZERO ANYTHING.  Ever.


Thanks
So, I should keep the area that we've copied into instead of zeroing
the area even if unpredicted exception is happened. Right?


Best regards
Vincent

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Dave Taht

I have been thinking we'd try to submit sch_cake to mainline on this
go-around. It's been out of tree for way too long.

I look forward to understanding your patches soon in the tbf case.

(I'm only responding because cake uses deficit, rather than a token
bucket, scheduler, and is not reliant on the tc filter infrastructure
for its queuing, and I'd love to have it handle multiple cpus better.
)

repo: https://github.com/dtaht/sch_cake.git

doc: https://www.bufferbloat.net/projects/codel/wiki/CakeTechnical/

Re: Per-CPU Queueing for QoS

2017-11-13 Thread John Fastabend

On 11/13/2017 06:18 PM, Michael Ma wrote:
> 2017-11-13 16:13 GMT-08:00 Alexander Duyck :
>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet  wrote:
>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
 On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger :
>> On Sun, 12 Nov 2017 13:43:13 -0800
>> Michael Ma  wrote:
>>
>>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>> early feedback.
>>>
>>> Thanks,
>>> Michael
>>>
 On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:

 Currently txq/qdisc selection is based on flow hash so packets from
 the same flow will follow the order when they enter qdisc/txq, which
 avoids out-of-order problem.

 To improve the concurrency of QoS algorithm we plan to have multiple
 per-cpu queues for a single TC class and do busy polling from a
 per-class thread to drain these queues. If we can do this frequently
 enough the out-of-order situation in this polling thread should not be
 that bad.

 To give more details - in the send path we introduce per-cpu per-class
 queues so that packets from the same class and same core will be
 enqueued to the same place. Then a per-class thread poll the queues
 belonging to its class from all the cpus and aggregate them into
 another per-class queue. This can effectively reduce contention but
 inevitably introduces potential out-of-order issue.

 Any concern/suggestion for working towards this direction?
>>
>> In general, there is no meta design discussions in Linux development
>> Several developers have tried to do lockless
>> qdisc and similar things in the past.
>>
>> The devil is in the details, show us the code.
>
> Thanks for the response, Stephen. The code is fairly straightforward,
> we have a per-cpu per-class queue defined as this:
>
> struct bandwidth_group
> {
> struct skb_list queues[MAX_CPU_COUNT];
> struct skb_list drain;
> }
>
> "drain" queue is used to aggregate per-cpu queues belonging to the
> same class. In the enqueue function, we determine the cpu where the
> packet is processed and enqueue it to the corresponding per-cpu queue:
>
> int cpu;
> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>
> cpu = get_cpu();
> skb_list_append(&bwg->queues[cpu], skb);
>
> Here we don't check the flow of the packet so if there is task
> migration or multiple threads sending packets through the same flow we
> theoretically can have packets enqueued to different queues and
> aggregated to the "drain" queue out of order.
>
> Also AFAIK there is no lockless htb-like qdisc implementation
> currently, however if there is already similar effort ongoing please
> let me know.

 The question I would have is how would this differ from using XPS w/
 mqprio? Would this be a classful qdisc like HTB or a classless one
 like mqprio?

 From what I can tell XPS would be able to get you your per-cpu
 functionality, the benefit of it though would be that it would avoid
 out-of-order issues for sockets originating on the local system. The
 only thing I see as an issue right now is that the rate limiting with
 mqprio is assumed to be handled via hardware due to mechanisms such as
 DCB.
>>>
>>> I think one of the key point was in : " do busy polling from a per-class
>>> thread to drain these queues."
>>>
>>> I mentioned this idea in TX path of :
>>>
>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>
>> I think this is a bit different from that idea in that the busy
>> polling is transferring packets from a per-cpu qdisc to a per traffic
>> class queueing discipline. Basically it would be a busy_poll version
>> of qdisc_run that would be transferring packets from one qdisc onto
>> another instead of attempting to transmit them directly.
> 
> The idea is to have the whole part implemented as one classful qdisc
> and remove the qdisc root lock since all the synchronization will be
> handled internally (let's put aside that other filters/actions/qdiscs
> might still require a root lock for now).
>>
>> What I think is tripping me up is that I don't think this is even
>> meant to work with a multiqueue device. The description seems more
>> like a mqprio implementation feeding into a prio qdisc which is then
>> used for dequeue. To me it seems like this solution would be pulling
>> complexity off of the enqueue side and just adding it to the dequeue
>> side. The use of the "busy poll" is just to try to simplify things as
>> the agent would then be consolidating traffic from multiple per-cpu
>> queues onto one drain queue.
> 
> We're essentially trying to spread the complexit

Re: [PATCH v2 net-next] socket: Move the socket inuse to namespace.

2017-11-13 Thread 基础平台部


> On 14 Nov 2017, at 10:09, Cong Wang  wrote:
> 
> On Mon, Nov 13, 2017 at 4:36 AM, Tonghao Zhang  
> wrote:
>> From: Tonghao Zhang 
>> 
>> This patch add a member in struct netns_core. and this is
>> a counter for socket_inuse in the _net_ namespace. The patch
>> will add/sub counter in the sock_alloc or sock_release. In
>> the sock_alloc, the private data of inode saves the special
>> _net_. When releasing it, we can access the special _net_ and
>> dec the counter of socket in that namespace.
>> 
>> By the way, we dont use the 'current->nsproxy->net_ns' in the
>> sock_release. In one case，when one task exits, the 'do_exit'
>> may set the current->nsproxy NULL, and then call the sock_release.
>> Use the private data of inode, saving few bytes.
> 
> Why do you need to hold netns refcnt for socket? sock already holds
> it, so you can just access it by sock_net(sock->sk) ?
I think you suggestion is 
replace get_net(net) ===> with sock_net(sock->sk) 
 
If thus, I think it could not work.
Because sock->sk is assigned after sock_alloc.
What we change is done in sock_alloc.
By the way, sock->sk may has been released(NULL) in sock_release.

[PATCH v2] net: use ARRAY_SIZE

2017-11-13 Thread Jérémy Lefaure

Using the ARRAY_SIZE macro improves the readability of the code. Also,
it is not always useful to use a variable to store this constant
calculated at compile time.

Found with Coccinelle with the following semantic patch:
@r depends on (org || report)@
type T;
T[] E;
position p;
@@
(
 (sizeof(E)@p /sizeof(*E))
|
 (sizeof(E)@p /sizeof(E[...]))
|
 (sizeof(E)@p /sizeof(T))
)

Signed-off-by: Jérémy Lefaure 
---
Changes in v2:
This patch was part of bigger patch (wrongly sent in a series, sorry).
The previous patch has been split to all changes in drivers/net/wireless
are in a separate patch sent to Kalle Valo.

 drivers/net/ethernet/emulex/benet/be_cmds.c |  4 ++--
 drivers/net/ethernet/intel/i40e/i40e_adminq.h   |  3 ++-
 drivers/net/ethernet/intel/i40evf/i40e_adminq.h |  3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c   |  3 ++-
 drivers/net/ethernet/intel/ixgbevf/vf.c | 17 -
 drivers/net/usb/kalmia.c|  9 +
 include/net/bond_3ad.h  |  3 ++-
 net/ipv6/seg6_local.c   |  6 +++---
 8 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c 
b/drivers/net/ethernet/emulex/benet/be_cmds.c
index 02dd5246dfae..ec39363afd5f 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -15,6 +15,7 @@
  * Costa Mesa, CA 92626
  */
 
+#include 
 #include 
 #include "be.h"
 #include "be_cmds.h"
@@ -103,10 +104,9 @@ static struct be_cmd_priv_map cmd_priv_map[] = {
 static bool be_cmd_allowed(struct be_adapter *adapter, u8 opcode, u8 subsystem)
 {
int i;
-   int num_entries = sizeof(cmd_priv_map)/sizeof(struct be_cmd_priv_map);
u32 cmd_privileges = adapter->cmd_privileges;
 
-   for (i = 0; i < num_entries; i++)
+   for (i = 0; i < ARRAY_SIZE(cmd_priv_map); i++)
if (opcode == cmd_priv_map[i].opcode &&
subsystem == cmd_priv_map[i].subsystem)
if (!(cmd_privileges & cmd_priv_map[i].priv_mask))
diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
index 2349fbe04bd2..892083b65b91 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
@@ -27,6 +27,7 @@
 #ifndef _I40E_ADMINQ_H_
 #define _I40E_ADMINQ_H_
 
+#include 
 #include "i40e_osdep.h"
 #include "i40e_status.h"
 #include "i40e_adminq_cmd.h"
@@ -143,7 +144,7 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
return -EAGAIN;
 
-   if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]
+   if (!((u32)aq_rc < ARRAY_SIZE(aq_to_posix)))
return -ERANGE;
 
return aq_to_posix[aq_rc];
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq.h 
b/drivers/net/ethernet/intel/i40evf/i40e_adminq.h
index e0bfaa3d4a21..5622a24cc74d 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq.h
@@ -27,6 +27,7 @@
 #ifndef _I40E_ADMINQ_H_
 #define _I40E_ADMINQ_H_
 
+#include 
 #include "i40e_osdep.h"
 #include "i40e_status.h"
 #include "i40e_adminq_cmd.h"
@@ -143,7 +144,7 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
return -EAGAIN;
 
-   if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]
+   if (!((u32)aq_rc < ARRAY_SIZE(aq_to_posix)))
return -ERANGE;
 
return aq_to_posix[aq_rc];
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
index cb7da5f9c4da..0b804790256e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
@@ -21,6 +21,7 @@
  *  Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
  *
  
**/
+#include 
 #include "ixgbe_x540.h"
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
@@ -949,7 +950,7 @@ static s32 ixgbe_checksum_ptr_x550(struct ixgbe_hw *hw, u16 
ptr,
u16 length, bufsz, i, start;
u16 *local_buffer;
 
-   bufsz = sizeof(buf) / sizeof(buf[0]);
+   bufsz = ARRAY_SIZE(buf);
 
/* Read a chunk at the pointer location */
if (!buffer) {
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c 
b/drivers/net/ethernet/intel/ixgbevf/vf.c
index 0c25006ce9af..96bfef92fb4c 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.c
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.c
@@ -24,6 +24,7 @@
 
 
***/
 
+#include 
 #include "vf.h"
 #include "ixgbevf.h"
 
@@ -285,7 +286,7 @@ static s32 ixgbevf_set_uc_addr_vf(struct ixgbe_hw *hw, u32 
index, u8 *addr)

[PATCH] uapi: fix linux/tls.h userspace compilation error

2017-11-13 Thread Dmitry V. Levin

Move inclusion of a private kernel header 
from uapi/linux/tls.h to its only user - net/tls.h,
to fix the following linux/tls.h userspace compilation error:

/usr/include/linux/tls.h:41:21: fatal error: net/tcp.h: No such file or 
directory

As to this point uapi/linux/tls.h was totaly unusuable for userspace,
cleanup this header file further by moving other redundant includes
to net/tls.h.

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Cc:  # v4.13+
Signed-off-by: Dmitry V. Levin 
---
 include/net/tls.h| 4 
 include/uapi/linux/tls.h | 4 
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index b89d397dd62f..c06db1eadac2 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -35,6 +35,10 @@
 #define _TLS_OFFLOAD_H
 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 
diff --git a/include/uapi/linux/tls.h b/include/uapi/linux/tls.h
index d5e0682ab837..293b2cdad88d 100644
--- a/include/uapi/linux/tls.h
+++ b/include/uapi/linux/tls.h
@@ -35,10 +35,6 @@
 #define _UAPI_LINUX_TLS_H
 
 #include 
-#include 
-#include 
-#include 
-#include 
 
 /* TLS socket options */
 #define TLS_TX 1   /* Set transmit parameters */

-- 
ldv

Re: KASAN: use-after-free Read in rds_tcp_dev_event

2017-11-13 Thread Girish Moodalbail


On 11/7/17 12:28 PM, syzbot wrote:

Hello,

syzkaller hit the following crash on 287683d027a3ff83feb6c7044430c79881664ecf
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
compiler: gcc (GCC) 7.1.1 20170620
.config is attached
Raw console output is attached.




==
BUG: KASAN: use-after-free in rds_tcp_kill_sock net/rds/tcp.c:530 [inline]
BUG: KASAN: use-after-free in rds_tcp_dev_event+0xc01/0xc90 net/rds/tcp.c:568
Read of size 8 at addr 8801cd879200 by task kworker/u4:3/147

CPU: 0 PID: 147 Comm: kworker/u4:3 Not tainted 4.14.0-rc7+ #156
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011

Workqueue: netns cleanup_net
Call Trace:
  __dump_stack lib/dump_stack.c:16 [inline]
  dump_stack+0x194/0x257 lib/dump_stack.c:52
  print_address_description+0x73/0x250 mm/kasan/report.c:252
  kasan_report_error mm/kasan/report.c:351 [inline]
  kasan_report+0x25b/0x340 mm/kasan/report.c:409
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
  rds_tcp_kill_sock net/rds/tcp.c:530 [inline]
  rds_tcp_dev_event+0xc01/0xc90 net/rds/tcp.c:568


The issue here is that we are trying to reference a network namespace (struct 
net *) that is long gone (i.e., L532 below -- c_net is the culprit).


528 spin_lock_irq(&rds_tcp_conn_lock);
529 list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list,
 t_tcp_node) {
530 struct net *c_net = tc->t_cpath->cp_conn->c_net;
531
532 if (net != c_net || !tc->t_sock)
533 continue;
534 if (!list_has_conn(&tmp_list, tc->t_cpath->cp_conn))
535 list_move_tail(&tc->t_tcp_node, &tmp_list);
536 }
537 spin_unlock_irq(&rds_tcp_conn_lock);
538 list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node) {
539 rds_tcp_conn_paths_destroy(tc->t_cpath->cp_conn);
540 rds_conn_destroy(tc->t_cpath->cp_conn);
541 }

When a network namespace is deleted, devices within that namespace are 
unregistered and removed one by one. RDS is notified about this event through 
rds_tcp_dev_event() callback. When the loopback device is removed from the 
namespace, the above RDS callback function destroys all the RDS connections in 
that namespace.


The loop@L529 above walks through each of the rds_tcp connection in the global 
list (rds_tcp_conn_list) to see if that connection belongs to the namespace in 
question. It collects all such connections and destroys them (L538-540). 
However, it leaves behind some of the rds_tcp connections that shared the same 
underlying RDS connection (L534 and 535). These connections with pointer to 
stale network namespace are left behind in the global list. When the 2nd network 
namespace is deleted, we will hit the above stale pointer and hit UAF panic.


I think we should move away from global list to a per-namespace list. The global 
list are used only in two places (both of which are per-namespace operations):


 - to destroy all the RDS connections belonging to a namespace when the
   network namespace is being deleted.
 - to reset all the RDS connections  when socket parameters for a namespace are
   modified using sysctl

Thanks,
~Girish

Re: [PATCH v2 net-next] socket: Move the socket inuse to namespace.

2017-11-13 Thread Tonghao Zhang

Yes,  thanks for your advise. I will modify, test it and then submit v3

On Tue, Nov 14, 2017 at 10:09 AM, Cong Wang  wrote:
> On Mon, Nov 13, 2017 at 4:36 AM, Tonghao Zhang  
> wrote:
>> From: Tonghao Zhang 
>>
>> This patch add a member in struct netns_core. and this is
>> a counter for socket_inuse in the _net_ namespace. The patch
>> will add/sub counter in the sock_alloc or sock_release. In
>> the sock_alloc, the private data of inode saves the special
>> _net_. When releasing it, we can access the special _net_ and
>> dec the counter of socket in that namespace.
>>
>> By the way, we dont use the 'current->nsproxy->net_ns' in the
>> sock_release. In one case，when one task exits, the 'do_exit'
>> may set the current->nsproxy NULL, and then call the sock_release.
>> Use the private data of inode, saving few bytes.
>
> Why do you need to hold netns refcnt for socket? sock already holds
> it, so you can just access it by sock_net(sock->sk) ?

Re: [PATCH net] ipv6: set all.accept_dad to 0 by default

2017-11-13 Thread Stefano Brivio

On Mon, 13 Nov 2017 14:21:52 +
Erik Kline  wrote:

> Alternatively, if we really want to make all, default, and ifname
> useful perhaps we need to investigate a tristate option (for currently
> boolean values, at least).  -1 could mean no preference, for example.

I think this would make sense in general, but on the other hand it
would be quite a big change, and Nicolas' patch has the advantages of
being small, keeping the global flag functional, and restoring the
previous default behaviour out of the box when 'accept_dad' is disabled
by the user for a given interface.

Besides, this still wouldn't solve the case where flags are >= 0 and
conflicting -- there, it's still debatable whether we want a logical
OR or a logical AND.

In the end, I would prefer either Nicolas' patch, or to get rid of the
global 'accept_dad' flag altogether. Having a flag which does
absolutely nothing, which was the case before 35e015e1f577, doesn't
sound correct by any means.

-- 
Stefano

Re: [PATCH net] geneve: show remote address and checksum info even after link down

2017-11-13 Thread Hangbin Liu

Hi Eric,

Thanks for the comments.

On Mon, Nov 13, 2017 at 03:27:25PM -0500, Eric Garver wrote:
> > Fixes: 11387fe4a98 ("geneve: fix fill_info when using collect_metadata")
> > Signed-off-by: Hangbin Liu 
> > ---
> >  drivers/net/geneve.c | 28 +++-
> >  1 file changed, 11 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> > index ed51018..b010db7 100644
> > --- a/drivers/net/geneve.c
> > +++ b/drivers/net/geneve.c
> > @@ -1511,32 +1511,18 @@ static int geneve_fill_info(struct sk_buff *skb, 
> > const struct net_device *dev)
> > if (nla_put_u32(skb, IFLA_GENEVE_ID, vni))
> > goto nla_put_failure;
> >  
> > -   if (rtnl_dereference(geneve->sock4)) {
> > +   if (ip_tunnel_info_af(info) == AF_INET) {
> > if (nla_put_in_addr(skb, IFLA_GENEVE_REMOTE,
> > info->key.u.ipv4.dst))
> > goto nla_put_failure;
> >  
> 
> We can avoid passing up *_REMOTE{,6} for COLLECT_METADATA. They're
> mutually exclusive.

Do you mean something like

+   bool metadata = geneve->collect_md;
-   if (rtnl_dereference(geneve->sock4)) {
+   if (!metadata && ip_tunnel_info_af(info) == AF_INET) {

> 
> > -   if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
> > -  !!(info->key.tun_flags & TUNNEL_CSUM)))
> > -   goto nla_put_failure;
> > -
> > -   }
> > -
> >  #if IS_ENABLED(CONFIG_IPV6)
> > -   if (rtnl_dereference(geneve->sock6)) {
> > +   } else {

and

-   if (rtnl_dereference(geneve->sock6)) {
+   } else if (!metadata) {

?
> > +
> > +   if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
> > +  !!(info->key.tun_flags & TUNNEL_CSUM)) ||
> > +   nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
> > +  !(info->key.tun_flags & TUNNEL_CSUM)) ||
> 
> These two are nonsensical for COLLECT_METADATA as they come from the skb
> tun_info. They can be moved to inside the ipv4/ipv6 checks. For
> non-metadata, it doesn't make sense to pass them both as the tunnel is
> either ipv4 or ipv6, but not both.

OK, I will put them back to the ipv4/6 chunks.

Thanks
Hangbin

Re: [PATCH net] ipv6: set all.accept_dad to 0 by default

2017-11-13 Thread Stefano Brivio

On Mon, 13 Nov 2017 14:45:36 +0100
Nicolas Dichtel  wrote:

> The commit a2d3f3e33853 modifies the way to disable dad on an interface.
> Before the patch, setting .accept_dad to 0 was enough to disable it.
> Because all.accept_dad is set to 1 by default, after the patch, the user
> needs to set both all.accept_dad and .accept_dad to 0 to disable it.

Perhaps it would make sense to be a bit more descriptive here, this
seems to have generated quite some confusion.

Besides, a2d3f3e33853 was just a fix-up for 35e015e1f577, which is
instead the change which actually changed the behaviour.

What about:

---
With commit 35e015e1f577 ("ipv6: fix net.ipv6.conf.all interface DAD
handlers"), the global 'accept_dad' flag is also taken into account
and set to 1 by default. If either global or per-interface flag is
non-zero, DAD will be enabled on a given interface.

This is not backward compatible: before 35e015e1f577, the user could
disable DAD just by setting the per-interface flag to 0. Now, the
user instead needs to set both flags to 0 to actually disable DAD.

Restore the previous behaviour by setting the default for the global
'accept_dad' flag to 0. This way, DAD is still enabled by default,
as per-interface flags are set to 1 on device creation, but setting
them to 0 is enough to disable DAD on a given interface.

- Before 35e015e1f577:
  globalper-interfaceDAD enabled
[default]   1 1  yes
X 0  no
X 1  yes

- After 35e015e1f577:
  globalper-interfaceDAD enabled
0 0  no
0 1  yes
1 0  yes
[default]   1 1  yes

- After this fix:
  globalper-interfaceDAD enabled
0 0  no
[default]   0 1  yes
1 0  yes
1 1  yes
---

> Fixes: a2d3f3e33853 ("ipv6: fix net.ipv6.conf.all.accept_dad behaviour for 
> real")

And I'd rather say:

Fixes: 35e015e1f577 ("ipv6: fix net.ipv6.conf.all interface DAD handlers")

> CC: Stefano Brivio 
> CC: Matteo Croce 
> CC: Erik Kline 
> Signed-off-by: Nicolas Dichtel 

With a more descriptive commit message and the appropriate Fixes:
reference, FWIW,

Acked-by: Stefano Brivio 

-- 
Stefano

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Michael Ma

2017-11-13 16:13 GMT-08:00 Alexander Duyck :
> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet  wrote:
>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger :
>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>> >> Michael Ma  wrote:
>>> >>
>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>> >>> early feedback.
>>> >>>
>>> >>> Thanks,
>>> >>> Michael
>>> >>>
>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
>>> >>> >
>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>> >>> > avoids out-of-order problem.
>>> >>> >
>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>> >>> > enough the out-of-order situation in this polling thread should not be
>>> >>> > that bad.
>>> >>> >
>>> >>> > To give more details - in the send path we introduce per-cpu per-class
>>> >>> > queues so that packets from the same class and same core will be
>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>> >>> > belonging to its class from all the cpus and aggregate them into
>>> >>> > another per-class queue. This can effectively reduce contention but
>>> >>> > inevitably introduces potential out-of-order issue.
>>> >>> >
>>> >>> > Any concern/suggestion for working towards this direction?
>>> >>
>>> >> In general, there is no meta design discussions in Linux development
>>> >> Several developers have tried to do lockless
>>> >> qdisc and similar things in the past.
>>> >>
>>> >> The devil is in the details, show us the code.
>>> >
>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>> > we have a per-cpu per-class queue defined as this:
>>> >
>>> > struct bandwidth_group
>>> > {
>>> > struct skb_list queues[MAX_CPU_COUNT];
>>> > struct skb_list drain;
>>> > }
>>> >
>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>> > same class. In the enqueue function, we determine the cpu where the
>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>> >
>>> > int cpu;
>>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>> >
>>> > cpu = get_cpu();
>>> > skb_list_append(&bwg->queues[cpu], skb);
>>> >
>>> > Here we don't check the flow of the packet so if there is task
>>> > migration or multiple threads sending packets through the same flow we
>>> > theoretically can have packets enqueued to different queues and
>>> > aggregated to the "drain" queue out of order.
>>> >
>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>> > currently, however if there is already similar effort ongoing please
>>> > let me know.
>>>
>>> The question I would have is how would this differ from using XPS w/
>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>> like mqprio?
>>>
>>> From what I can tell XPS would be able to get you your per-cpu
>>> functionality, the benefit of it though would be that it would avoid
>>> out-of-order issues for sockets originating on the local system. The
>>> only thing I see as an issue right now is that the rate limiting with
>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>> DCB.
>>
>> I think one of the key point was in : " do busy polling from a per-class
>> thread to drain these queues."
>>
>> I mentioned this idea in TX path of :
>>
>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>
> I think this is a bit different from that idea in that the busy
> polling is transferring packets from a per-cpu qdisc to a per traffic
> class queueing discipline. Basically it would be a busy_poll version
> of qdisc_run that would be transferring packets from one qdisc onto
> another instead of attempting to transmit them directly.

The idea is to have the whole part implemented as one classful qdisc
and remove the qdisc root lock since all the synchronization will be
handled internally (let's put aside that other filters/actions/qdiscs
might still require a root lock for now).
>
> What I think is tripping me up is that I don't think this is even
> meant to work with a multiqueue device. The description seems more
> like a mqprio implementation feeding into a prio qdisc which is then
> used for dequeue. To me it seems like this solution would be pulling
> complexity off of the enqueue side and just adding it to the dequeue
> side. The use of the "busy poll" is just to try to simplify things as
> the agent would then be consolidating traffic from multiple per-cpu
> queues onto one drain queue.

We're essentially trying to spread the complexity from enqueue to
different stages such as enqueue/aggregation and rate
li

Re: [PATCH net-next] net: dsa: Fix dependencies on bridge

2017-11-13 Thread David Miller

From: Andrew Lunn 
Date: Sat, 11 Nov 2017 16:29:41 +0100

> DSA now uses one of the symbols exported by the bridge,
> br_vlan_enabled(). This has a stub, if the bridge is not
> enabled. However, if the bridge is enabled, we cannot have DSA built
> in and the bridge as a module, otherwise we get undefined symbols at
> link time:
> 
>net/dsa/port.o: In function `dsa_port_vlan_add':
>net/dsa/port.c:255: undefined reference to `br_vlan_enabled'
>net/dsa/port.o: In function `dsa_port_vlan_del':
>net/dsa/port.c:270: undefined reference to `br_vlan_enabled'
> 
> Reported-by: kbuild test robot 
> Signed-off-by: Andrew Lunn 

Applied, thanks Andrew.

Re: [PATCH v2 net-next] socket: Move the socket inuse to namespace.

2017-11-13 Thread Cong Wang

On Mon, Nov 13, 2017 at 4:36 AM, Tonghao Zhang  wrote:
> From: Tonghao Zhang 
>
> This patch add a member in struct netns_core. and this is
> a counter for socket_inuse in the _net_ namespace. The patch
> will add/sub counter in the sock_alloc or sock_release. In
> the sock_alloc, the private data of inode saves the special
> _net_. When releasing it, we can access the special _net_ and
> dec the counter of socket in that namespace.
>
> By the way, we dont use the 'current->nsproxy->net_ns' in the
> sock_release. In one case，when one task exits, the 'do_exit'
> may set the current->nsproxy NULL, and then call the sock_release.
> Use the private data of inode, saving few bytes.

Why do you need to hold netns refcnt for socket? sock already holds
it, so you can just access it by sock_net(sock->sk) ?

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Michael Ma

2017-11-13 18:05 GMT-08:00 Michael Ma :
> 2017-11-13 15:08 GMT-08:00 Eric Dumazet :
>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger :
>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>> >> Michael Ma  wrote:
>>> >>
>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>> >>> early feedback.
>>> >>>
>>> >>> Thanks,
>>> >>> Michael
>>> >>>
>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
>>> >>> >
>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>> >>> > avoids out-of-order problem.
>>> >>> >
>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>> >>> > enough the out-of-order situation in this polling thread should not be
>>> >>> > that bad.
>>> >>> >
>>> >>> > To give more details - in the send path we introduce per-cpu per-class
>>> >>> > queues so that packets from the same class and same core will be
>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>> >>> > belonging to its class from all the cpus and aggregate them into
>>> >>> > another per-class queue. This can effectively reduce contention but
>>> >>> > inevitably introduces potential out-of-order issue.
>>> >>> >
>>> >>> > Any concern/suggestion for working towards this direction?
>>> >>
>>> >> In general, there is no meta design discussions in Linux development
>>> >> Several developers have tried to do lockless
>>> >> qdisc and similar things in the past.
>>> >>
>>> >> The devil is in the details, show us the code.
>>> >
>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>> > we have a per-cpu per-class queue defined as this:
>>> >
>>> > struct bandwidth_group
>>> > {
>>> > struct skb_list queues[MAX_CPU_COUNT];
>>> > struct skb_list drain;
>>> > }
>>> >
>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>> > same class. In the enqueue function, we determine the cpu where the
>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>> >
>>> > int cpu;
>>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>> >
>>> > cpu = get_cpu();
>>> > skb_list_append(&bwg->queues[cpu], skb);
>>> >
>>> > Here we don't check the flow of the packet so if there is task
>>> > migration or multiple threads sending packets through the same flow we
>>> > theoretically can have packets enqueued to different queues and
>>> > aggregated to the "drain" queue out of order.
>>> >
>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>> > currently, however if there is already similar effort ongoing please
>>> > let me know.
>>>
>>> The question I would have is how would this differ from using XPS w/
>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>> like mqprio?
>>>
>>> From what I can tell XPS would be able to get you your per-cpu
>>> functionality, the benefit of it though would be that it would avoid
>>> out-of-order issues for sockets originating on the local system. The
>>> only thing I see as an issue right now is that the rate limiting with
>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>> DCB.
>>
>> I think one of the key point was in : " do busy polling from a per-class
>> thread to drain these queues."
>>
>> I mentioned this idea in TX path of :
>>
>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>
>>
>
> Right - this part is the key difference. With mqprio we still don't
> have the ability to explore parallelism at the level of class. The
> parallelism is restricted to the way of partitioning flows across
> queues.
>
>>

Eric - do you think if we do busy polling frequently enough
out-of-order problem will effectively be mitigated? I'll take a look
at your slides as well.

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Michael Ma

2017-11-13 15:08 GMT-08:00 Eric Dumazet :
> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger :
>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>> >> Michael Ma  wrote:
>> >>
>> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
>> >>> early feedback.
>> >>>
>> >>> Thanks,
>> >>> Michael
>> >>>
>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
>> >>> >
>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>> >>> > avoids out-of-order problem.
>> >>> >
>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>> >>> > per-cpu queues for a single TC class and do busy polling from a
>> >>> > per-class thread to drain these queues. If we can do this frequently
>> >>> > enough the out-of-order situation in this polling thread should not be
>> >>> > that bad.
>> >>> >
>> >>> > To give more details - in the send path we introduce per-cpu per-class
>> >>> > queues so that packets from the same class and same core will be
>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>> >>> > belonging to its class from all the cpus and aggregate them into
>> >>> > another per-class queue. This can effectively reduce contention but
>> >>> > inevitably introduces potential out-of-order issue.
>> >>> >
>> >>> > Any concern/suggestion for working towards this direction?
>> >>
>> >> In general, there is no meta design discussions in Linux development
>> >> Several developers have tried to do lockless
>> >> qdisc and similar things in the past.
>> >>
>> >> The devil is in the details, show us the code.
>> >
>> > Thanks for the response, Stephen. The code is fairly straightforward,
>> > we have a per-cpu per-class queue defined as this:
>> >
>> > struct bandwidth_group
>> > {
>> > struct skb_list queues[MAX_CPU_COUNT];
>> > struct skb_list drain;
>> > }
>> >
>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>> > same class. In the enqueue function, we determine the cpu where the
>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>> >
>> > int cpu;
>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>> >
>> > cpu = get_cpu();
>> > skb_list_append(&bwg->queues[cpu], skb);
>> >
>> > Here we don't check the flow of the packet so if there is task
>> > migration or multiple threads sending packets through the same flow we
>> > theoretically can have packets enqueued to different queues and
>> > aggregated to the "drain" queue out of order.
>> >
>> > Also AFAIK there is no lockless htb-like qdisc implementation
>> > currently, however if there is already similar effort ongoing please
>> > let me know.
>>
>> The question I would have is how would this differ from using XPS w/
>> mqprio? Would this be a classful qdisc like HTB or a classless one
>> like mqprio?
>>
>> From what I can tell XPS would be able to get you your per-cpu
>> functionality, the benefit of it though would be that it would avoid
>> out-of-order issues for sockets originating on the local system. The
>> only thing I see as an issue right now is that the rate limiting with
>> mqprio is assumed to be handled via hardware due to mechanisms such as
>> DCB.
>
> I think one of the key point was in : " do busy polling from a per-class
> thread to drain these queues."
>
> I mentioned this idea in TX path of :
>
> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>
>

Right - this part is the key difference. With mqprio we still don't
have the ability to explore parallelism at the level of class. The
parallelism is restricted to the way of partitioning flows across
queues.

>

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Michael Ma

2017-11-13 14:47 GMT-08:00 Alexander Duyck :
> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger :
>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>> Michael Ma  wrote:
>>>
 Any comments? We plan to implement this as a qdisc and appreciate any 
 early feedback.

 Thanks,
 Michael

 > On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
 >
 > Currently txq/qdisc selection is based on flow hash so packets from
 > the same flow will follow the order when they enter qdisc/txq, which
 > avoids out-of-order problem.
 >
 > To improve the concurrency of QoS algorithm we plan to have multiple
 > per-cpu queues for a single TC class and do busy polling from a
 > per-class thread to drain these queues. If we can do this frequently
 > enough the out-of-order situation in this polling thread should not be
 > that bad.
 >
 > To give more details - in the send path we introduce per-cpu per-class
 > queues so that packets from the same class and same core will be
 > enqueued to the same place. Then a per-class thread poll the queues
 > belonging to its class from all the cpus and aggregate them into
 > another per-class queue. This can effectively reduce contention but
 > inevitably introduces potential out-of-order issue.
 >
 > Any concern/suggestion for working towards this direction?
>>>
>>> In general, there is no meta design discussions in Linux development
>>> Several developers have tried to do lockless
>>> qdisc and similar things in the past.
>>>
>>> The devil is in the details, show us the code.
>>
>> Thanks for the response, Stephen. The code is fairly straightforward,
>> we have a per-cpu per-class queue defined as this:
>>
>> struct bandwidth_group
>> {
>> struct skb_list queues[MAX_CPU_COUNT];
>> struct skb_list drain;
>> }
>>
>> "drain" queue is used to aggregate per-cpu queues belonging to the
>> same class. In the enqueue function, we determine the cpu where the
>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>
>> int cpu;
>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>
>> cpu = get_cpu();
>> skb_list_append(&bwg->queues[cpu], skb);
>>
>> Here we don't check the flow of the packet so if there is task
>> migration or multiple threads sending packets through the same flow we
>> theoretically can have packets enqueued to different queues and
>> aggregated to the "drain" queue out of order.
>>
>> Also AFAIK there is no lockless htb-like qdisc implementation
>> currently, however if there is already similar effort ongoing please
>> let me know.
>
> The question I would have is how would this differ from using XPS w/
> mqprio? Would this be a classful qdisc like HTB or a classless one
> like mqprio?
>

Yes, xps + mqprio will achive similar effect as this. However xps +
mqprio requires quite some static configuration to make things work.
Also all the overhead associated with qdisc root lock still exists
even though the contention might be reduced. With thread dedicated to
cpu we can avoid acquiring lock for enqueue/dequeue since it's
effectively single producer and single consumer. In the case that
multiple threads do share the same flow the lock contention still
exists and we don't have per-class parallelism for rate limiting if
classes ever share the same cores, which is avoidable by having
per-class thread to run stuff like leaky bucket.

We still plan to implement it as a classful qdisc.

> From what I can tell XPS would be able to get you your per-cpu
> functionality, the benefit of it though would be that it would avoid
> out-of-order issues for sockets originating on the local system. The
> only thing I see as an issue right now is that the rate limiting with
> mqprio is assumed to be handled via hardware due to mechanisms such as
> DCB.
>

mqprio can also be attached with qdiscs like HTB so this can actually
work without DCB.

> - Alex

[PATCH] usbnet: ipheth: prevent TX queue timeouts when device not ready

2017-11-13 Thread Alexander Kappner

iOS devices require the host to be "trusted" before servicing network
packets. Establishing trust requires the user to confirm a dialog on the
iOS device.Until trust is established, the iOS device will silently discard
network packets from the host. Currently, the ipheth driver does not detect
whether an iOS device has established trust with the host, and immediately
sets up the transmit queues.

This causes the following problems:

- Kernel taint due to WARN() in netdev watchdog.
- Dmesg spam ("TX timeout").
- Disruption of user space networking activity (dhcpd, etc...) when new
interface comes up but cannot be used.
- Unnecessary host and device wakeups and USB traffic

Example dmesg output:

[ 1101.319778] NETDEV WATCHDOG: eth1 (ipheth): transmit queue 0 timed out
[ 1101.319817] [ cut here ]
[ 1101.319828] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 
dev_watchdog+0x20f/0x220
[ 1101.319831] Modules linked in: ipheth usbmon nvidia_drm(PO) 
nvidia_modeset(PO) nvidia(PO) iwlmvm mac80211 iwlwifi btusb btrtl btbcm btintel 
qmi_wwan bluetooth cfg80211 ecdh_generic thinkpad_acpi rfkill [last unloaded: 
ipheth]
[ 1101.319861] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P   O
4.13.12.1 #1
[ 1101.319864] Hardware name: LENOVO 20ENCTO1WW/20ENCTO1WW, BIOS N1EET62W (1.35 
) 11/10/2016
[ 1101.319867] task: 81e11500 task.stack: 81e0
[ 1101.319873] RIP: 0010:dev_watchdog+0x20f/0x220
[ 1101.319876] RSP: 0018:8810a3c03e98 EFLAGS: 00010292
[ 1101.319880] RAX: 003a RBX:  RCX: 
[ 1101.319883] RDX: 8810a3c15c48 RSI: 81ccbfc2 RDI: 
[ 1101.319886] RBP: 880c04ebc41c R08:  R09: 0379
[ 1101.319889] R10: 0100696589d0 R11: 0378 R12: 880c04ebc000
[ 1101.319892] R13:  R14: 0001 R15: 880c2865fc80
[ 1101.319896] FS:  () GS:8810a3c0() 
knlGS:
[ 1101.319899] CS:  0010 DS:  ES:  CR0: 80050033
[ 1101.319902] CR2: 7f3ff24ac000 CR3: 01e0a000 CR4: 003406f0
[ 1101.319905] DR0:  DR1:  DR2: 
[ 1101.319908] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1101.319910] Call Trace:
[ 1101.319914]  
[ 1101.319921]  ? dev_graft_qdisc+0x70/0x70
[ 1101.319928]  ? dev_graft_qdisc+0x70/0x70
[ 1101.319934]  ? call_timer_fn+0x2e/0x170
[ 1101.319939]  ? dev_graft_qdisc+0x70/0x70
[ 1101.319944]  ? run_timer_softirq+0x1ea/0x440
[ 1101.319951]  ? timerqueue_add+0x54/0x80
[ 1101.319956]  ? enqueue_hrtimer+0x38/0xa0
[ 1101.319963]  ? __do_softirq+0xed/0x2e7
[ 1101.319970]  ? irq_exit+0xb4/0xc0
[ 1101.319976]  ? smp_apic_timer_interrupt+0x39/0x50
[ 1101.319981]  ? apic_timer_interrupt+0x8c/0xa0
[ 1101.319983]  
[ 1101.319992]  ? cpuidle_enter_state+0xfa/0x2a0
[ 1101.31]  ? do_idle+0x1a3/0x1f0
[ 1101.320004]  ? cpu_startup_entry+0x5f/0x70
[ 1101.320011]  ? start_kernel+0x444/0x44c
[ 1101.320017]  ? early_idt_handler_array+0x120/0x120
[ 1101.320023]  ? x86_64_start_kernel+0x145/0x154
[ 1101.320028]  ? secondary_startup_64+0x9f/0x9f
[ 1101.320033] Code: 20 04 00 00 eb 9f 4c 89 e7 c6 05 59 44 71 00 01 e8 a7 df 
fd ff 89 d9 4c 89 e6 48 c7 c7 70 b7 cd 81 48 89 c2 31 c0 e8 97 64 90 ff <0f> ff 
eb bf 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[ 1101.320103] ---[ end trace 0cc4d251e2b57080 ]---
[ 1101.320110] ipheth 1-5:4.2: ipheth_tx_timeout: TX timeout

The last message "TX timeout" is repeated every 5 seconds until trust is
established or the device is disconnected, filling up dmesg.

The proposed patch eliminates the problem by, upon connection, keeping the
TX queue and carrier disabled until a packet is first received from the iOS
device. This is reflected by the confirmed_pairing variable in the device
structure. Only after at least one packet has been received from the iOS
device, the transmit queue and carrier are brought up during the periodic
device poll in ipheth_carrier_set. Because the iOS device will always send
a packet immediately upon trust being established, this should not delay
the interface becoming useable. To prevent failed UBRs in
ipheth_rcvbulk_callback from perpetually re-enabling the queue if it was
disabled, a new check is added so only successful transfers re-enable the
queue, whereas failed transfers only trigger an immediate poll.

This has the added benefit of removing the periodic control requests to the
iOS device until trust has been established and thus should reduce wakeup
events on both the host and the iOS device.

Signed-off-by: Alexander Kappner 
---
 drivers/net/usb/ipheth.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/net/usb/ipheth.c b/drivers/net/usb/ipheth.c
index d49c710..ca71f6c 100644
--- a/drivers/net/usb/ipheth.c
+++ b/drivers/net/usb/ipheth.c
@@ -149,6 +149,7 @@ struct ipheth_device

RE: Microchip KSZ* DSA drivers Re: [PATCH v1 RFC 1/1] Add Microchip KSZ8795 DSA driver

2017-11-13 Thread Tristram.Ha

> Subject: Microchip KSZ* DSA drivers Re: [PATCH v1 RFC 1/1] Add Microchip
> KSZ8795 DSA driver
> 
> Hi!
> 
> Are there any news here? Is there new release planned? Is there a git
> tree somewhere? I probably should get it working, soon.. so I guess I
> can help with testing.
> 

Reviewed patches will be submitted formally to net-next within this week.

BTW, how is the KSZ8895 driver working for you?  Are there some issues that
prevent you from using it?

Re: [PATCH] ibmveth: Kernel crash LSO offload flag toggle

2017-11-13 Thread Daniel Axtens

Hi Bryant,

A few things:

1) The commit message could probably be trimmed, especially the stack
traces.

2) What exactly are you changing and why does it fix the issue? I
couldn't figure that out from the commit message.

3) Does this need a Fixes: tag?


>   }
>  
> - netdev->min_mtu = IBMVETH_MIN_MTU;
> - netdev->max_mtu = ETH_MAX_MTU;
> -

4) What does the above hunk do? It seems unrelated...


Regards,
Daniel

Re: [PATCH RFC,WIP 0/5] Flow offload infrastructure

2017-11-13 Thread Jakub Kicinski

On Fri,  3 Nov 2017 16:26:31 +0100, Pablo Neira Ayuso wrote:
> I'm measuring here that the software flow table forwarding path is 2.5
> faster than the classic forwarding path in my testbed.
> 
> TODO, still many things:
> 
> * Only IPv4 at this time.
> * Only IPv4 SNAT is supported.
> * No netns support yet.
> * Missing netlink interface to operate with the flow table, to force the
>   handover of flow to the software path.
> * Higher configurability, instead of registering the flow table
>   inconditionally, add an interface to specify software flow table
>   properties.
> * No flow counters at this time.
> 
> This should serve a number of usecases where we can rely on this kernel
> bypass. Packets that need fragmentation / PMTU / IP option handling /
> ... and any specific handling, then we should pass them up to the
> forwarding classic path.

I didn't realize it from this patch set, but it was mentioned at the
conference that this patch set is completely stateless.  I.e. things
like TCP window tracking are not included here.  IMHO that's a big
concern, because offloading flows is trivial when compared to state
sync.  IMHO state sync is *the* challenge in implementing connection
tacking offload...

Re: [PATCH 12/14] nubus: Rename struct nubus_dev

2017-11-13 Thread Finn Thain

On Mon, 13 Nov 2017, Geert Uytterhoeven wrote:

> Hi Finn,
> 
> On Sat, Nov 11, 2017 at 7:12 AM, Finn Thain  
> wrote:
> > It is misleading to use "dev" to mean a functional resource. And in 
> > adopting the Linux Driver Model, struct nubus_board will embed a 
> > struct device. Drivers will then bind with boards, not with functional 
> > resources.
> >
> > Rename struct nubus_dev as struct nubus_functional_resource. This is 
> > the vendor's terminology and avoids confusion.
> 
> Isn't "struct nubus_functional_resource" a bit long? 

Right.

> What about "nubus_res"? "nubus_fres"?
> 

I think the temporary variables can remain 'fres', as that's just a 
mnemonic, though the declaration really ought to tell us something 
meaningful.

You and I both avoided 'func' as an abbreviation. I think it suggests a 
'C' function pointer. And 'res' and 'rsrc' suggest an ioport.h struct 
resource. An unambiguous compromise would be 'nubus_functional_rsrc' but 
maybe this is still too long?

Perhaps 'struct nubus_rsrc' or 'struct nubus_res' are okay if we can keep 
the distinction between 'functional resources', 'slot resources' and 
'board resources' clear by naming instances appropriately? E.g.

@@ -33,7 +33,7 @@ struct nubus_dirent {

 struct nubus_board {
struct nubus_board *next;
-   struct nubus_dev *first_dev;
+   struct nubus_rsrc *first_func_rsrc;

/* Only 9-E actually exist, though 0-8 are also theoretically
   possible, and 0 is a special case which represents the
@@ -81,8 +81,8 @@ struct nubus_dev {
struct nubus_board *board;
 };

-/* This is all NuBus devices (used to find devices later on) */
-extern struct nubus_dev *nubus_devices;
+/* This is all NuBus functional resources (used to find devices later on) 
*/
+extern struct nubus_rsrc *nubus_func_rsrcs;
 /* This is all NuBus cards */
 extern struct nubus_board *nubus_boards;

etc.

-- 

> Gr{oetje,eeting}s,
> 
> Geert
>

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Alexander Duyck

On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet  wrote:
> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger :
>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>> >> Michael Ma  wrote:
>> >>
>> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
>> >>> early feedback.
>> >>>
>> >>> Thanks,
>> >>> Michael
>> >>>
>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
>> >>> >
>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>> >>> > avoids out-of-order problem.
>> >>> >
>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>> >>> > per-cpu queues for a single TC class and do busy polling from a
>> >>> > per-class thread to drain these queues. If we can do this frequently
>> >>> > enough the out-of-order situation in this polling thread should not be
>> >>> > that bad.
>> >>> >
>> >>> > To give more details - in the send path we introduce per-cpu per-class
>> >>> > queues so that packets from the same class and same core will be
>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>> >>> > belonging to its class from all the cpus and aggregate them into
>> >>> > another per-class queue. This can effectively reduce contention but
>> >>> > inevitably introduces potential out-of-order issue.
>> >>> >
>> >>> > Any concern/suggestion for working towards this direction?
>> >>
>> >> In general, there is no meta design discussions in Linux development
>> >> Several developers have tried to do lockless
>> >> qdisc and similar things in the past.
>> >>
>> >> The devil is in the details, show us the code.
>> >
>> > Thanks for the response, Stephen. The code is fairly straightforward,
>> > we have a per-cpu per-class queue defined as this:
>> >
>> > struct bandwidth_group
>> > {
>> > struct skb_list queues[MAX_CPU_COUNT];
>> > struct skb_list drain;
>> > }
>> >
>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>> > same class. In the enqueue function, we determine the cpu where the
>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>> >
>> > int cpu;
>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>> >
>> > cpu = get_cpu();
>> > skb_list_append(&bwg->queues[cpu], skb);
>> >
>> > Here we don't check the flow of the packet so if there is task
>> > migration or multiple threads sending packets through the same flow we
>> > theoretically can have packets enqueued to different queues and
>> > aggregated to the "drain" queue out of order.
>> >
>> > Also AFAIK there is no lockless htb-like qdisc implementation
>> > currently, however if there is already similar effort ongoing please
>> > let me know.
>>
>> The question I would have is how would this differ from using XPS w/
>> mqprio? Would this be a classful qdisc like HTB or a classless one
>> like mqprio?
>>
>> From what I can tell XPS would be able to get you your per-cpu
>> functionality, the benefit of it though would be that it would avoid
>> out-of-order issues for sockets originating on the local system. The
>> only thing I see as an issue right now is that the rate limiting with
>> mqprio is assumed to be handled via hardware due to mechanisms such as
>> DCB.
>
> I think one of the key point was in : " do busy polling from a per-class
> thread to drain these queues."
>
> I mentioned this idea in TX path of :
>
> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf

I think this is a bit different from that idea in that the busy
polling is transferring packets from a per-cpu qdisc to a per traffic
class queueing discipline. Basically it would be a busy_poll version
of qdisc_run that would be transferring packets from one qdisc onto
another instead of attempting to transmit them directly.

What I think is tripping me up is that I don't think this is even
meant to work with a multiqueue device. The description seems more
like a mqprio implementation feeding into a prio qdisc which is then
used for dequeue. To me it seems like this solution would be pulling
complexity off of the enqueue side and just adding it to the dequeue
side. The use of the "busy poll" is just to try to simplify things as
the agent would then be consolidating traffic from multiple per-cpu
queues onto one drain queue.

Structure wise this ends up looking not too different from mqprio, the
main difference though would be that this would be a classful qdisc
and that the virtual qdiscs we have for the traffic classes would need
to be replaced with actual qdiscs for handling the "drain" aspect of
this.

- Alex

Re: len = bpf_probe_read_str(); bpf_perf_event_output(... len) == FAIL

2017-11-13 Thread Daniel Borkmann

On 11/13/2017 04:08 PM, Arnaldo Carvalho de Melo wrote:
> Em Mon, Nov 13, 2017 at 03:56:14PM +0100, Daniel Borkmann escreveu:
>> On 11/13/2017 03:30 PM, Arnaldo Carvalho de Melo wrote:
>>> Hi,
>>>
>>> In a5e8c07059d0 ("bpf: add bpf_probe_read_str helper") you
>>> state:
>>>
>>>"This is suboptimal because the size of the string needs to be estimated
>>> at compile time, causing more memory to be copied than often necessary,
>>> and can become more problematic if further processing on buf is done,
>>> for example by pushing it to userspace via bpf_perf_event_output(),
>>> since the real length of the string is unknown and the entire buffer
>>> must be copied (and defining an unrolled strnlen() inside the bpf
>>> program is a very inefficient and unfeasible approach)."
>>>
>>> So I went on to try this with 'perf trace' but it isn't working if I use
>>> the return from bpf_probe_read_str(), I must be missing something
>>> here... 
>>>
>>> I.e. this works:
>>>
>>> [root@jouet bpf]# cat open.c
>>> #include "bpf.h"
>>>
>>> SEC("prog=do_sys_open filename")
>>> int prog(void *ctx, int err, const char __user *filename_ptr)
>>> {
>>> char filename[128];
>>> const unsigned len = bpf_probe_read_str(filename, sizeof(filename), 
>>> filename_ptr);
>>> perf_event_output(ctx, &__bpf_stdout__, get_smp_processor_id(), 
>>> filename, 32);
>>
>> By the way, you can just use BPF_F_CURRENT_CPU flag instead of the helper
>> call get_smp_processor_id() to get current CPU.
> 
> Thanks, switched to it.
> 
>>> But then if I use the return value to push just the string lenght, it
>>> doesn't work:
>>>
>>> [root@jouet bpf]# cat open.c
>>> #include "bpf.h"
>>>
>>> SEC("prog=do_sys_open filename")
>>> int prog(void *ctx, int err, const char __user *filename_ptr)
>>> {
>>> char filename[128];
>>> const unsigned len = bpf_probe_read_str(filename, sizeof(filename), 
>>> filename_ptr);
>>> perf_event_output(ctx, &__bpf_stdout__, get_smp_processor_id(), 
>>> filename, len);
>>
>> The below issue 'invalid stack type R4 off=-128 access_size=0' is basically 
>> that
>> unsigned len is unknown at verification time, thus unbounded. Can you try the
>> following to see if that passes?
>>
>> if (len > 0 && len <= sizeof(filename))
>> perf_event_output(ctx, &__bpf_stdout__, get_smp_processor_id(), 
>> filename, len);
> 
> I had it like:
> 
>   if (len > 0 && len < 32)
> 
> And it didn't helped, now I did exactly as you suggested:
> 
> [root@jouet bpf]# cat open.c
> #include "bpf.h"
> 
> SEC("prog=do_sys_open filename")
> int prog(void *ctx, int err, const char __user *filename_ptr)
> {
>   char filename[128];
>   const unsigned len = bpf_probe_read_str(filename, sizeof(filename), 
> filename_ptr);
>   if (len > 0 && len <= sizeof(filename))
>   perf_event_output(ctx, &__bpf_stdout__, BPF_F_CURRENT_CPU, 
> filename, len);
>   return 1;
> }
> [root@jouet bpf]# trace -e open,open.c touch /etc/passwd
> bpf: builtin compilation failed: -95, try external compiler
> event syntax error: 'open.c'
>  \___ Kernel verifier blocks program loading
> 
> [root@jouet bpf]# 
> 
> The -v output looks the same:
> 
> [root@jouet bpf]# trace -v -e open,open.c touch /etc/passwd
> bpf: builtin compilation failed: -95, try external compiler
> Kernel build dir is set to /lib/modules/4.14.0-rc6+/build
> set env: KBUILD_DIR=/lib/modules/4.14.0-rc6+/build
> unset env: KBUILD_OPTS
> include option is set to  -nostdinc -isystem 
> /usr/lib/gcc/x86_64-redhat-linux/7/include 
> -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  
> -I/home/acme/git/linux/include -I./include 
> -I/home/acme/git/linux/arch/x86/include/uapi 
> -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi 
> -I./include/generated/uapi -include 
> /home/acme/git/linux/include/linux/kconfig.h 
> set env: NR_CPUS=4
> set env: LINUX_VERSION_CODE=0x40e00
> set env: CLANG_EXEC=/usr/local/bin/clang
> unset env: CLANG_OPTIONS
> set env: KERNEL_INC_OPTIONS= -nostdinc -isystem 
> /usr/lib/gcc/x86_64-redhat-linux/7/include 
> -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  
> -I/home/acme/git/linux/include -I./include 
> -I/home/acme/git/linux/arch/x86/include/uapi 
> -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi 
> -I./include/generated/uapi -include 
> /home/acme/git/linux/include/linux/kconfig.h 
> set env: WORKING_DIR=/lib/modules/4.14.0-rc6+/build
> set env: CLANG_SOURCE=/home/acme/bpf/open.c
> llvm compiling command template: $CLANG_EXEC -D__KERNEL__ 
> -D__NR_CPUS__=$NR_CPUS -DLINUX_VERSION_CODE=$LINUX_VERSION_CODE 
> $CLANG_OPTIONS $KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign 
> -working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf -O2 -o -
> libbpf: loading object 'open.c' from buffer
> libbpf: section .strtab, size 103, link 0, flags 0, type=3
> libbpf: section .text, size 0, link 0, flags 6, typ

Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen

> I wonder if there is some way to use the split format for the
> intermediate files, but then for the very final link bring them all in
> and make the end result be a traditional single binary. I'm not
> talking the separate "dwp" package that packs multiple dwo files into
> one, but to actually link them all in at the one.
> 
> Sure, it would lose some of the advantage, but I think a large portion
> of the -gsplit-dwarf advantage is about the intermediate state. No?

Not sure it's worth to do complicated workarounds. I assume
it will be not that difficult to fix binutils (after all
gdb works), and disabling the option is a reasonable workaround.

> 
> I tried to google for it, but couldn't find anything. But apparently
> elfutils doesn't support dwo files either. It seems mainly the linker
> and gdb itself that supports it.

The original design document is

https://gcc.gnu.org/wiki/DebugFission

-Andi

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-13 Thread Alexei Starovoitov


On 11/13/17 9:07 PM, Björn Töpel wrote:

2017-10-31 13:41 GMT+01:00 Björn Töpel :

From: Björn Töpel 


[...]


We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
Korea, and our paper with complete benchmarks will be released shortly
on the NetDev 2.2 site.



We're back in the saddle after an excellent netdevconf week. Kudos to
the organizers; We had a blast! Thanks for all the constructive
feedback.

I'll summarize the major points, that we'll address in the next RFC
below.

* Instead of extending AF_PACKET with yet another version, introduce a
  new address/packet family. As for naming had some name suggestions:
  AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
  AF_ZEROCOPY, unless there're no strong opinions against it.

* No explicit zerocopy enablement. Use the zeropcopy path if
  supported, if not -- fallback to the skb path, for netdevs that
  don't support the required ndos. Further, we'll have the zerocopy
  behavior for the skb path as well, meaning that an AF_ZEROCOPY
  socket will consume the skb and we'll honor skb->queue_mapping,
  meaning that we only consume the packets for the enabled queue.

* Limit the scope of the first patchset to Rx only, and introduce Tx
  in a separate patchset.


all sounds good to me except above bit.
I don't remember people suggesting to split it this way.
What's the value of it without tx?


* Minimize the size of the i40e zerocopy patches, by moving the driver
  specific code to separate patches.

* Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
  XDP redirect map call with ingress flag.

* Extend the XDP redirect to support explicit allocator/destructor
  functions. Right now, XDP redirect assumes that the page allocator
  was used, and the XDP redirect cleanup path is decreasing the page
  count of the XDP buffer. This assumption breaks for the zerocopy
  case.


Björn



We based this patch set on net-next commit e1ea2f9856b7 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").

Please focus your review on:

* The V4 user space interface
* PACKET_ZEROCOPY and its semantics
* Packet array interface
* XDP semantics when excuting in zero-copy mode (user space passed
  buffers)
* XDP_PASS_TO_KERNEL semantics

To do:

* Investigate the user-space ring structure’s performance problems
* Continue the XDP integration into packet arrays
* Optimize performance
* SKB <-> V4 conversions in tp4a_populate & tp4a_flush
* Packet buffer is unnecessarily pinned for virtual devices
* Support shared packet buffers
* Unify V4 and SKB receive path in I40E driver
* Support for packets spanning multiple frames
* Disassociate the packet array implementation from the V4 queue
  structure

We would really like to thank the reviewers of the limited
distribution RFC for all their comments that have helped improve the
interfaces and the code significantly: Alexei Starovoitov, Alexander
Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team
at Intel that has been helping out reviewing code, writing tests, and
sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi
Zhang, your participation has really helped.

Thanks: Björn and Magnus

[1] 
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.netdevconf.org_2.2_&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qR6oNZj1CqLATni4ibTgAQ&m=lKyFxON3kKygiOgECLBfmqRwM7ZyXFSUvLED1vP-gos&s=44jzm1W8xkGyZSZVANRygzHz6y4XHbYrYBRM-K5RhTc&e=

Björn Töpel (7):
  packet: introduce AF_PACKET V4 userspace API
  packet: implement PACKET_MEMREG setsockopt
  packet: enable AF_PACKET V4 rings
  packet: wire up zerocopy for AF_PACKET V4
  i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
  i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
  samples/tpacket4: added tpbench

Magnus Karlsson (7):
  packet: enable Rx for AF_PACKET V4
  packet: enable Tx support for AF_PACKET V4
  netdevice: add AF_PACKET V4 zerocopy ops
  veth: added support for PACKET_ZEROCOPY
  samples/tpacket4: added veth support
  i40e: added XDP support for TP4 enabled queue pairs
  xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use

 drivers/net/ethernet/intel/i40e/i40e.h |3 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |9 +
 drivers/net/ethernet/intel/i40e/i40e_main.c|  837 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|  582 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   38 +
 drivers/net/veth.c |  174 +++
 include/linux/netdevice.h  |   16 +
 include/linux/tpacket4.h   | 1502 
 include/uapi/linux/bpf.h   |1 +
 include/uapi/linux/if_packet.h |   65 +-
 net/packet/af_packet.c | 1252 +---
 net/packet/internal.h  |9 +
 samples/tpacket4/Makefile  |   12 +
 samples/tpacket4/bench_all.sh  |   28 +
 samples/tpacket4/tpbench.c

[PATCH] ibmveth: Kernel crash LSO offload flag toggle

2017-11-13 Thread Bryant G. Ly

The following script when run (along with some iperf traffic recreates the
crash within 5-10 mins or so).

while true
do
ethtool -k ibmveth0 | grep tcp-segmentation-offload
ethtool -K ibmveth0 tso off
ethtool -k ibmveth0 | grep tcp-segmentation-offload
ethtool -K ibmveth0 tso on
done

Note: This issue happens the very first time largsesend offload is
turned off too (but the above script recreates the issue all the times)

Stack trace output:
 [76563.914380] NIP [c0063940] memcpy_power7+0x40/0x800
[76563.914387] LR [d0d31788] ibmveth_start_xmit+0x1c8/0x8d0 [ibmveth]
[76563.914392] Call Trace:
[76563.914396] [c000feab3270] [c000feab32d0] 0xc000feab32d0 
(unreliable)
[76563.914407] [c000feab3360] [c09816f4] 
dev_hard_start_xmit+0x304/0x530
[76563.914415] [c000feab3440] [c09b6564] sch_direct_xmit+0x124/0x330
[76563.914423] [c000feab34e0] [c0981ddc] 
__dev_queue_xmit+0x26c/0x770
[76563.914431] [c000feab3580] [c0a1efc0] arp_xmit+0x30/0xd0
[76563.914438] [c000feab35f0] [c0a1f0f4] 
arp_send_dst.part.0+0x94/0xb0
[76563.914445] [c000feab3660] [c0a1fcb4] arp_solicit+0x114/0x2b0
[76563.914452] [c000feab3730] [c098d8f4] neigh_probe+0x84/0xd0
[76563.914460] [c000feab3760] [c09937cc] 
neigh_timer_handler+0xbc/0x320
[76563.914468] [c000feab37a0] [c014a3fc] call_timer_fn+0x5c/0x1c0
[76563.914474] [c000feab3830] [c014a8bc] 
run_timer_softirq+0x31c/0x3f0
[76563.914483] [c000feab3900] [c00bec58] __do_softirq+0x188/0x3e0
[76563.914490] [c000feab39f0] [c00bf128] irq_exit+0xc8/0x100
[76563.914498] [c000feab3a10] [c001f974] timer_interrupt+0xa4/0xe0
[76563.914505] [c000feab3a40] [c0002714] 
decrementer_common+0x114/0x180

Oops output:
 [76563.914173] Unable to handle kernel paging request for data at address 
0x
[76563.914197] Faulting instruction address: 0xc0063940
[76563.914205] Oops: Kernel access of bad area, sig: 11 [#1]
[76563.914210] SMP NR_CPUS=2048 NUMA pSeries
[76563.914217] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag 
udp_diag inet_diag unix_diag af_packet_diag netlink_diag nls_utf8 isofs 
binfmt_misc pseries_rng rtc_generic autofs4 ibmvfc scsi_transport_fc ibmvscsi 
ibmveth
[76563.914251] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0-34-generic 
#53-Ubuntu
[76563.914258] task: c000fe9efcc0 ti: c000feab task.ti: 
c000feab
[76563.914265] NIP: c0063940 LR: d0d31788 CTR: c0064100
[76563.914271] REGS: c000feab2ff0 TRAP: 0300   Not tainted  
(4.4.0-34-generic)
[76563.914277] MSR: 80009033   CR: 4800284e  XER: 
001a
[76563.914294] CFAR: c0008468 DAR:  DSISR: 4200 
SOFTE: 1
GPR00: f240 c000feab3270 c15b5d00 
GPR04: cd9b0004 002a 0006 c000efc0ccac
GPR08: d0d3dd28 0080  d0d34758
GPR12: c0064100 ce7f1c80 c000ffdca938 0100
GPR16: c000ffdca738 c000ffdca538 c000feab34a0 c15f4d00
GPR20:  c15f4cf0 c000f5945900 
GPR24:  8000 c000feab32d0 c000efc0ccac
GPR28: c000f23ccb00 c000f5945000 c000f23ccb00 c000efc0cc00
[76563.914380] NIP [c0063940] memcpy_power7+0x40/0x800
[76563.914387] LR [d0d31788] ibmveth_start_xmit+0x1c8/0x8d0 [ibmveth]
[76563.914392] Call Trace:
[76563.914396] [c000feab3270] [c000feab32d0] 0xc000feab32d0 
(unreliable)
[76563.914407] [c000feab3360] [c09816f4] 
dev_hard_start_xmit+0x304/0x530
[76563.914415] [c000feab3440] [c09b6564] sch_direct_xmit+0x124/0x330
[76563.914423] [c000feab34e0] [c0981ddc] 
__dev_queue_xmit+0x26c/0x770
[76563.914431] [c000feab3580] [c0a1efc0] arp_xmit+0x30/0xd0
[76563.914438] [c000feab35f0] [c0a1f0f4] 
arp_send_dst.part.0+0x94/0xb0
[76563.914445] [c000feab3660] [c0a1fcb4] arp_solicit+0x114/0x2b0
[76563.914452] [c000feab3730] [c098d8f4] neigh_probe+0x84/0xd0
[76563.914460] [c000feab3760] [c09937cc] 
neigh_timer_handler+0xbc/0x320
[76563.914468] [c000feab37a0] [c014a3fc] call_timer_fn+0x5c/0x1c0
[76563.914474] [c000feab3830] [c014a8bc] 
run_timer_softirq+0x31c/0x3f0
[76563.914483] [c000feab3900] [c00bec58] __do_softirq+0x188/0x3e0
[76563.914490] [c000feab39f0] [c00bf128] irq_exit+0xc8/0x100
[76563.914498] [c000feab3a10] [c001f974] timer_interrupt+0xa4/0xe0
[76563.914505] [c000feab3a40] [c0002714] 
decrementer_common+0x114/0x180
[76563.914515] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
[76563.914515] LR = check_and_cede_processor+0x34/0x50
[76563.914525] [c000feab3d30] [c0916

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Eric Dumazet

On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger :
> >> On Sun, 12 Nov 2017 13:43:13 -0800
> >> Michael Ma  wrote:
> >>
> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
> >>> early feedback.
> >>>
> >>> Thanks,
> >>> Michael
> >>>
> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
> >>> >
> >>> > Currently txq/qdisc selection is based on flow hash so packets from
> >>> > the same flow will follow the order when they enter qdisc/txq, which
> >>> > avoids out-of-order problem.
> >>> >
> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
> >>> > per-cpu queues for a single TC class and do busy polling from a
> >>> > per-class thread to drain these queues. If we can do this frequently
> >>> > enough the out-of-order situation in this polling thread should not be
> >>> > that bad.
> >>> >
> >>> > To give more details - in the send path we introduce per-cpu per-class
> >>> > queues so that packets from the same class and same core will be
> >>> > enqueued to the same place. Then a per-class thread poll the queues
> >>> > belonging to its class from all the cpus and aggregate them into
> >>> > another per-class queue. This can effectively reduce contention but
> >>> > inevitably introduces potential out-of-order issue.
> >>> >
> >>> > Any concern/suggestion for working towards this direction?
> >>
> >> In general, there is no meta design discussions in Linux development
> >> Several developers have tried to do lockless
> >> qdisc and similar things in the past.
> >>
> >> The devil is in the details, show us the code.
> >
> > Thanks for the response, Stephen. The code is fairly straightforward,
> > we have a per-cpu per-class queue defined as this:
> >
> > struct bandwidth_group
> > {
> > struct skb_list queues[MAX_CPU_COUNT];
> > struct skb_list drain;
> > }
> >
> > "drain" queue is used to aggregate per-cpu queues belonging to the
> > same class. In the enqueue function, we determine the cpu where the
> > packet is processed and enqueue it to the corresponding per-cpu queue:
> >
> > int cpu;
> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
> >
> > cpu = get_cpu();
> > skb_list_append(&bwg->queues[cpu], skb);
> >
> > Here we don't check the flow of the packet so if there is task
> > migration or multiple threads sending packets through the same flow we
> > theoretically can have packets enqueued to different queues and
> > aggregated to the "drain" queue out of order.
> >
> > Also AFAIK there is no lockless htb-like qdisc implementation
> > currently, however if there is already similar effort ongoing please
> > let me know.
> 
> The question I would have is how would this differ from using XPS w/
> mqprio? Would this be a classful qdisc like HTB or a classless one
> like mqprio?
> 
> From what I can tell XPS would be able to get you your per-cpu
> functionality, the benefit of it though would be that it would avoid
> out-of-order issues for sockets originating on the local system. The
> only thing I see as an issue right now is that the rate limiting with
> mqprio is assumed to be handled via hardware due to mechanisms such as
> DCB.

I think one of the key point was in : " do busy polling from a per-class
thread to drain these queues." 

I mentioned this idea in TX path of :

https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf

Re: [RFC v2 0/6] enable creating [k,u]probe with perf_event_open

2017-11-13 Thread Song Liu


> On Nov 12, 2017, at 3:40 PM, Song Liu  wrote:
> 
> Changes v1 to v2:
>  Fix build issue reported by kbuild test bot by adding ifdef of
>  CONFIG_KPROBE_EVENTS, and CONFIG_UPROBE_EVENTS.
> 
> v1 cover letter:
> 
> This is to follow up the discussion over "new kprobe api" at Linux
> Plumbers 2017:
> 
> https://www.linuxplumbersconf.org/2017/ocw/proposals/4808
> 
> With current kernel, user space tools can only create/destroy [k,u]probes
> with a text-based API (kprobe_events and uprobe_events in tracefs). This
> approach relies on user space to clean up the [k,u]probe after using them.
> However, this is not easy for user space to clean up properly.
> 
> To solve this problem, we introduce a file descriptor based API.
> Specifically, we extended perf_event_open to create [k,u]probe, and attach
> this [k,u]probe to the file descriptor created by perf_event_open. These
> [k,u]probe are associated with this file descriptor, so they are not
> available in tracefs.
> 
> We reuse large portion of existing trace_kprobe and trace_uprobe code.
> Currently, the file descriptor API does not support arguments as the
> text-based API does. This should not be a problem, as user of the file
> decriptor based API read data through other methods (bpf, etc.).
> 
> I also include a patch to to bcc, and a patch to man-page perf_even_open.
> Please see the list below. A fork of bcc with this patch is also available
> on github:
> 
>  https://github.com/liu-song-6/bcc/tree/new_perf_event_opn
> 
> Thanks,
> Song
> 
> man-pages patch:
>  perf_event_open.2: add new type PERF_TYPE_PROBE
> 
> bcc patch:
>  bcc: Try use new API to create [k,u]probe with perf_event_open
> 
> kernel patches:
> 
> Song Liu (6):
>  perf: Add new type PERF_TYPE_PROBE
>  perf: copy new perf_event.h to tools/include/uapi
>  perf: implement kprobe support to PERF_TYPE_PROBE
>  perf: implement uprobe support to PERF_TYPE_PROBE
>  bpf: add option for bpf_load.c to use PERF_TYPE_PROBE
>  bpf: add new test test_many_kprobe
> 
> include/linux/trace_events.h  |   2 +
> include/uapi/linux/perf_event.h   |  35 ++-
> kernel/events/core.c  |  39 ++-
> kernel/trace/trace_event_perf.c   | 127 +++
> kernel/trace/trace_kprobe.c   |  91 +++--
> kernel/trace/trace_probe.h|  11 ++
> kernel/trace/trace_uprobe.c   |  90 +++--
> samples/bpf/Makefile  |   3 +
> samples/bpf/bpf_load.c|  61 ++-
> samples/bpf/bpf_load.h|  12 +++
> samples/bpf/test_many_kprobe_user.c   | 184 ++
> tools/include/uapi/linux/perf_event.h |  35 ++-
> 12 files changed, 643 insertions(+), 47 deletions(-)
> create mode 100644 samples/bpf/test_many_kprobe_user.c
> 
> --
> 2.9.5

Dear Peter, 

Could you please share feedback about this set? 

Thanks and Best Regards,
Song

Re: [PATCH net-next] bpf: expose sk_priority through struct bpf_sock_ops

2017-11-13 Thread Alexei Starovoitov


On 11/14/17 4:20 AM, Daniel Borkmann wrote:

On 11/13/2017 09:09 PM, Lawrence Brakmo wrote:

On 11/13/17, 11:01 AM, "Vlad Dumitrescu"  wrote:

On Sat, Nov 11, 2017 at 2:38 PM, Alexei Starovoitov  wrote:
>
> On 11/12/17 4:46 AM, Daniel Borkmann wrote:
>>
>> On 11/11/2017 05:06 AM, Alexei Starovoitov wrote:
>>>
>>> On 11/11/17 6:07 AM, Daniel Borkmann wrote:

 On 11/10/2017 08:17 PM, Vlad Dumitrescu wrote:
>
> From: Vlad Dumitrescu 
>
> Allows BPF_PROG_TYPE_SOCK_OPS programs to read sk_priority.
>
> Signed-off-by: Vlad Dumitrescu 
> ---
>   include/uapi/linux/bpf.h   |  1 +
>   net/core/filter.c  | 11 +++
>   tools/include/uapi/linux/bpf.h |  1 +
>   3 files changed, 13 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e880ae6434ee..9757a2002513 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -947,6 +947,7 @@ struct bpf_sock_ops {
>   __u32 local_ip6[4];/* Stored in network byte order */
>   __u32 remote_port;/* Stored in network byte order */
>   __u32 local_port;/* stored in host byte order */
> +__u32 priority;
>   };
> /* List of known BPF sock_ops operators.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 61c791f9f628..a6329642d047 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4449,6 +4449,17 @@ static u32 sock_ops_convert_ctx_access(enum
> bpf_access_type type,
>   *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> offsetof(struct sock_common, skc_num));
>   break;
> +
> +case offsetof(struct bpf_sock_ops, priority):
> +BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_priority) != 4);
> +
> +*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> +struct bpf_sock_ops_kern, sk),
> +  si->dst_reg, si->src_reg,
> +  offsetof(struct bpf_sock_ops_kern, sk));
> +*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> +  offsetof(struct sock, sk_priority));
> +break;


 Hm, I don't think this would work, I actually think your initial patch
 was ok.
 bpf_setsockopt() as well as bpf_getsockopt() check for sk_fullsock(sk)
 right
 before accessing options on either socket or TCP level, and bail out
 with error
 otherwise; in such cases we'd read something else here and assume it's
 sk_priority.
>>>
>>>
>>> even if it's not fullsock, it will just read zero, no? what's a problem
>>> with that?
>>> In non-fullsock hooks like BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
>>> the program author will know that it's meaningless to read sk_priority,
>>> so returning zero with minimal checks is fine.
>>> While adding extra runtime if (sk_fullsock(sk)) is unnecessary,
>>> since the safety is not compromised.
>>
>>
>> Hm, on my kernel, struct sock has the 4 bytes sk_priority at offset 440,
>> struct request_sock itself is only 232 byte long in total, and the struct
>> inet_timewait_sock is 208 byte long, so you'd be accessing out of bounds
>> that way, so it cannot be ignored and assumed zero.
>
>
> I thought we always pass fully allocated sock but technically not 
fullsock yet. My mistake. We do: tcp_timeout_init((struct sock *)req))
> so yeah ctx rewrite approach won't work.
> Let's go back to access via helper.
>

TIL. Thanks!

Is there anything else needed from me to get the helper approach accepted?

I plan to add access to TCP state variables (cwnd, rtt, etc.) and I have been 
thinking
about this issue. I think it is possible to access it directly as long as we 
use a value
like 0x to represent an invalid value (e.g. not fullsock). The ctx 
rewrite just
needs to add a conditional to determine what to return. I would probably add a
field into the internal kernel struct to indicate if it is fullsock or not (we 
should
know when we call tcp_call_bpf whether it is a fullsock or not based on 
context).

Let me do a sample patch that I can send for review and get feedback from
Alexi and Daniel.


Agree, if the mov op from the ctx rewrite to read(/write) a sk member, for
example, is just a BPF_W, then we know upper reg bits are zero anyway for the
success case, so we might be able to utilize this for writing a signed error
back to the user if !fullsk.


it can be __u64 in bpf_sock_ops too, while real read is 32-bit or less,
then guaranteed no conflicts if we return (s64)-enoent

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Alexander Duyck

On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger :
>> On Sun, 12 Nov 2017 13:43:13 -0800
>> Michael Ma  wrote:
>>
>>> Any comments? We plan to implement this as a qdisc and appreciate any early 
>>> feedback.
>>>
>>> Thanks,
>>> Michael
>>>
>>> > On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
>>> >
>>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> > the same flow will follow the order when they enter qdisc/txq, which
>>> > avoids out-of-order problem.
>>> >
>>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> > per-cpu queues for a single TC class and do busy polling from a
>>> > per-class thread to drain these queues. If we can do this frequently
>>> > enough the out-of-order situation in this polling thread should not be
>>> > that bad.
>>> >
>>> > To give more details - in the send path we introduce per-cpu per-class
>>> > queues so that packets from the same class and same core will be
>>> > enqueued to the same place. Then a per-class thread poll the queues
>>> > belonging to its class from all the cpus and aggregate them into
>>> > another per-class queue. This can effectively reduce contention but
>>> > inevitably introduces potential out-of-order issue.
>>> >
>>> > Any concern/suggestion for working towards this direction?
>>
>> In general, there is no meta design discussions in Linux development
>> Several developers have tried to do lockless
>> qdisc and similar things in the past.
>>
>> The devil is in the details, show us the code.
>
> Thanks for the response, Stephen. The code is fairly straightforward,
> we have a per-cpu per-class queue defined as this:
>
> struct bandwidth_group
> {
> struct skb_list queues[MAX_CPU_COUNT];
> struct skb_list drain;
> }
>
> "drain" queue is used to aggregate per-cpu queues belonging to the
> same class. In the enqueue function, we determine the cpu where the
> packet is processed and enqueue it to the corresponding per-cpu queue:
>
> int cpu;
> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>
> cpu = get_cpu();
> skb_list_append(&bwg->queues[cpu], skb);
>
> Here we don't check the flow of the packet so if there is task
> migration or multiple threads sending packets through the same flow we
> theoretically can have packets enqueued to different queues and
> aggregated to the "drain" queue out of order.
>
> Also AFAIK there is no lockless htb-like qdisc implementation
> currently, however if there is already similar effort ongoing please
> let me know.

The question I would have is how would this differ from using XPS w/
mqprio? Would this be a classful qdisc like HTB or a classless one
like mqprio?

>From what I can tell XPS would be able to get you your per-cpu
functionality, the benefit of it though would be that it would avoid
out-of-order issues for sockets originating on the local system. The
only thing I see as an issue right now is that the rate limiting with
mqprio is assumed to be handled via hardware due to mechanisms such as
DCB.

- Alex

Re: [RFC PATCH 5/5] selinux: Add SCTP support

2017-11-13 Thread Paul Moore

On Mon, Nov 13, 2017 at 5:05 PM, Richard Haines
 wrote:
> On Mon, 2017-11-06 at 19:09 -0500, Paul Moore wrote:
>> On Tue, Oct 17, 2017 at 9:59 AM, Richard Haines
>>  wrote:
>> > The SELinux SCTP implementation is explained in:
>> > Documentation/security/SELinux-sctp.txt
>> >
>> > Signed-off-by: Richard Haines 
>> > ---
>> >  Documentation/security/SELinux-sctp.txt | 108 +
>> >  security/selinux/hooks.c| 268
>> > ++--
>> >  security/selinux/include/classmap.h |   3 +-
>> >  security/selinux/include/netlabel.h |   9 +-
>> >  security/selinux/include/objsec.h   |   5 +
>> >  security/selinux/netlabel.c |  52 ++-
>> >  6 files changed, 427 insertions(+), 18 deletions(-)
>> >  create mode 100644 Documentation/security/SELinux-sctp.txt

...

>> > +Policy Statements
>> > +==
>> > +The following class and permissions to support SCTP are available
>> > within the
>> > +kernel:
>> > +class sctp_socket inherits socket { node_bind }
>> > +
>> > +whenever the following policy capability is enabled:
>> > +policycap extended_socket_class;
>> > +
>> > +The SELinux SCTP support adds the additional permissions that are
>> > explained
>> > +in the sections below:
>> > +association bindx connectx
>>
>> Is the distinction between bind and bindx significant?  The same
>> question applies to connect/connectx.  I think we can probably just
>> reuse bind and connect in these cases.
>
> This has been discussed before with Marcelo and keeping bindx/connectx
> is a useful distinction.

My apologies, I must have forgotten/missed that discussion.  Do you
have an archive pointer?

>> > +SCTP Peer Labeling
>> > +===
>> > +An SCTP socket will only have one peer label assigned to it. This
>> > will be
>> > +assigned during the establishment of the first association. Once
>> > the peer
>> > +label has been assigned, any new associations will have the
>> > "association"
>> > +permission validated by checking the socket peer sid against the
>> > received
>> > +packets peer sid to determine whether the association should be
>> > allowed or
>> > +denied.
>> > +
>> > +NOTES:
>> > +   1) If peer labeling is not enabled, then the peer context will
>> > always be
>> > +  SECINITSID_UNLABELED (unlabeled_t in Reference Policy).
>> > +
>> > +   2) As SCTP supports multiple endpoints with multi-homing on a
>> > single socket
>> > +  it is recommended that peer labels are consistent.
>>
>> My apologies if I'm confused, but I thought it was multiple
>> associations per-endpoint, with the associations providing the
>> multi-homing functionality, no?
>
> I've reworded to:
> As SCTP can support more than one transport address per endpoint
> (multi-homing) on a single socket, it is possible to configure policy
> and NetLabel to provide different peer labels for each of these. As the
> socket peer label is determined by the first associations transport
> address, it is recommended that all peer labels are consistent.

I'm still not sure this makes complete sense to me, but since I'm
still not 100% confident in my understanding of SCTP I'm willing to
punt on this for the moment.

>> > +   3) getpeercon(3) may be used by userspace to retrieve the
>> > sockets peer
>> > +   context.
>> > +
>> > +   4) If using NetLabel be aware that if a label is assigned to a
>> > specific
>> > +  interface, and that interface 'goes down', then the NetLabel
>> > service
>> > +  will remove the entry. Therefore ensure that the network
>> > startup scripts
>> > +  call netlabelctl(8) to set the required label (see netlabel-
>> > config(8)
>> > +  helper script for details).
>>
>> Maybe this will be made clear as I work my way through this patch,
>> but
>> how is point #4 SCTP specific?
>
> It's not, I added this as a useful hint as I keep forgetting about it,
> I'll reword to: While not SCTP specific, be aware .

Okay.  Better.

>> > +/* Called whenever SCTP receives an INIT or INIT_ACK chunk */
>> > +static int selinux_sctp_assoc_request(struct sctp_endpoint *ep,
>> > + struct sk_buff *skb,
>> > + int sctp_cid)
>> > +{
>> > +   struct sk_security_struct *sksec = ep->base.sk-
>> > >sk_security;
>> > +   struct common_audit_data ad;
>> > +   struct lsm_network_audit net = {0,};
>> > +   u8 peerlbl_active;
>> > +   u32 peer_sid = SECINITSID_UNLABELED;
>> > +   u32 conn_sid;
>> > +   int err;
>> > +
>> > +   if (!selinux_policycap_extsockclass)
>> > +   return 0;
>>
>> We *may* need to protect a lot of the new SCTP code with a new policy
>> capability, I think reusing extsockclass here could be problematic.
>
> I hope not. I will need some direction here as I've not had problems
> (yet).

It's actually not that bad, take a look at the NNP/nosuid patch from
Stephen, af63f4193f9f ("selinux: Generalize support for

Re: [RFC PATCH 4/5] netlabel: Add SCTP support

2017-11-13 Thread Paul Moore

On Mon, Nov 13, 2017 at 3:50 PM, Richard Haines
 wrote:
> On Mon, 2017-11-06 at 18:15 -0500, Paul Moore wrote:
>> On Tue, Oct 17, 2017 at 9:58 AM, Richard Haines
>>  wrote:
>> > Add support to label SCTP associations and cater for a situation
>> > where
>> > family = PF_INET6 with an ip_hdr(skb)->version = 4.
>> >
>> > Signed-off-by: Richard Haines 
>> > ---
>> >  include/net/netlabel.h|  3 ++
>> >  net/netlabel/netlabel_kapi.c  | 80
>> > +++
>> >  net/netlabel/netlabel_unlabeled.c | 10 +
>> >  3 files changed, 93 insertions(+)

>> >  /**
>> > + * netlbl_sctp_setattr - Label an incoming sctp association socket
>> > using
>> > + * the correct protocol
>> > + * @sk: the socket to label
>> > + * @skb: the packet
>> > + * @secattr: the security attributes
>> > + *
>> > + * Description:
>> > + * Attach the correct label to the given socket using the security
>> > attributes
>> > + * specified in @secattr.  Returns zero on success, negative
>> > values on failure.
>> > + *
>> > + */
>> > +int netlbl_sctp_setattr(struct sock *sk,
>> > +   struct sk_buff *skb,
>> > +   const struct netlbl_lsm_secattr *secattr)
>> > +{
>> > +   int ret_val = -EINVAL;
>> > +   struct netlbl_dommap_def *entry;
>> > +   struct iphdr *hdr4;
>> > +#if IS_ENABLED(CONFIG_IPV6)
>> > +   struct ipv6hdr *hdr6;
>> > +#endif
>> > +
>> > +   rcu_read_lock();
>> > +   switch (sk->sk_family) {
>> > +   case AF_INET:
>> > +   hdr4 = ip_hdr(skb);
>> > +
>> > +   entry = netlbl_domhsh_getentry_af4(secattr->domain,
>> > +  hdr4->saddr);
>> > +   if (entry == NULL) {
>> > +   ret_val = -ENOENT;
>> > +   goto sctp_setattr_return;
>> > +   }
>> > +   switch (entry->type) {
>> > +   case NETLBL_NLTYPE_CIPSOV4:
>> > +   ret_val = cipso_v4_sock_setattr(sk, entry-
>> > >cipso,
>> > +   secattr);
>> > +   break;
>> > +   case NETLBL_NLTYPE_UNLABELED:
>> > +   netlbl_sock_delattr(sk);
>> > +   ret_val = 0;
>> > +   break;
>> > +   default:
>> > +   ret_val = -ENOENT;
>> > +   }
>> > +   break;
>> > +#if IS_ENABLED(CONFIG_IPV6)
>> > +   case AF_INET6:
>> > +   hdr6 = ipv6_hdr(skb);
>> > +   entry = netlbl_domhsh_getentry_af6(secattr->domain,
>> > +  &hdr6->saddr);
>> > +   if (entry == NULL) {
>> > +   ret_val = -ENOENT;
>> > +   goto sctp_setattr_return;
>> > +   }
>> > +   switch (entry->type) {
>> > +   case NETLBL_NLTYPE_CALIPSO:
>> > +   ret_val = calipso_sock_setattr(sk, entry-
>> > >calipso,
>> > +  secattr);
>> > +   break;
>> > +   case NETLBL_NLTYPE_UNLABELED:
>> > +   netlbl_sock_delattr(sk);
>> > +   ret_val = 0;
>> > +   break;
>> > +   default:
>> > +   ret_val = -ENOENT;
>> > +   }
>> > +   break;
>> > +#endif /* IPv6 */
>> > +   default:
>> > +   ret_val = -EPROTONOSUPPORT;
>> > +   }
>> > +
>> > +sctp_setattr_return:
>> > +   rcu_read_unlock();
>> > +   return ret_val;
>> > +}
>>
>> It seems like we should try to leverage the code in
>> netlbl_conn_setattr() a bit more.  I would suggest either tweaking
>> the
>> callers to use a sockaddr struct and netlbl_conn_setattr(), or
>> implement netlbl_sctp_setattr() as a simple wrapper around
>> netlbl_conn_setattr() ... the former seems a bit cleaner, but I
>> suspect patch 5/5 will make it clear which approach is better.
>
> I've now modified selinux_netlbl_sctp_assoc_request() to extract the ip
> address info from skb and add to a sockaddr struct, then call
> netlbl_conn_setattr(). This removes any specific SCTP code from
> netlabel_kapi.c

Great, I think that's probably the best option.

>> > diff --git a/net/netlabel/netlabel_unlabeled.c
>> > b/net/netlabel/netlabel_unlabeled.c
>> > index 22dc1b9..c070dfc 100644
>> > --- a/net/netlabel/netlabel_unlabeled.c
>> > +++ b/net/netlabel/netlabel_unlabeled.c
>> > @@ -1472,6 +1472,16 @@ int netlbl_unlabel_getattr(const struct
>> > sk_buff *skb,
>> > iface = rcu_dereference(netlbl_unlhsh_def);
>> > if (iface == NULL || !iface->valid)
>> > goto unlabel_getattr_nolabel;
>> > +
>> > +#if IS_ENABLED(CONFIG_IPV6)
>> > +   /* When resolving a fallback label, check the sk_buff
>> > version as
>> > +* it is possibl

Re: [PATCH] netlabel: If PF_INET6, check sk_buff ip header version

2017-11-13 Thread Paul Moore

On Mon, Nov 13, 2017 at 3:54 PM, Richard Haines
 wrote:
> When resolving a fallback label, check the sk_buff version as it
> is possible (e.g. SCTP) to have family = PF_INET6 while
> receiving ip_hdr(skb)->version = 4.
>
> Signed-off-by: Richard Haines 
> ---
>  net/netlabel/netlabel_unlabeled.c | 10 ++
>  1 file changed, 10 insertions(+)

Thanks Richard.

Acked-by: Paul Moore 

> diff --git a/net/netlabel/netlabel_unlabeled.c 
> b/net/netlabel/netlabel_unlabeled.c
> index 22dc1b9..c070dfc 100644
> --- a/net/netlabel/netlabel_unlabeled.c
> +++ b/net/netlabel/netlabel_unlabeled.c
> @@ -1472,6 +1472,16 @@ int netlbl_unlabel_getattr(const struct sk_buff *skb,
> iface = rcu_dereference(netlbl_unlhsh_def);
> if (iface == NULL || !iface->valid)
> goto unlabel_getattr_nolabel;
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +   /* When resolving a fallback label, check the sk_buff version as
> +* it is possible (e.g. SCTP) to have family = PF_INET6 while
> +* receiving ip_hdr(skb)->version = 4.
> +*/
> +   if (family == PF_INET6 && ip_hdr(skb)->version == 4)
> +   family = PF_INET;
> +#endif /* IPv6 */
> +
> switch (family) {
> case PF_INET: {
> struct iphdr *hdr4;
> --
> 2.13.6

-- 
paul moore
www.paul-moore.com

[PATCH][v3] uprobes/x86: emulate push insns for uprobe on x86

2017-11-13 Thread Yonghong Song

Uprobe is a tracing mechanism for userspace programs.
Typical uprobe will incur overhead of two traps.
First trap is caused by replaced trap insn, and
the second trap is to execute the original displaced
insn in user space.

To reduce the overhead, kernel provides hooks
for architectures to emulate the original insn
and skip the second trap. In x86, emulation
is done for certain branch insns.

This patch extends the emulation to "push "
insns. These insns are typical in the beginning
of the function. For example, bcc
in https://github.com/iovisor/bcc repo provides
tools to measure funclantency, detect memleak, etc.
The tools will place uprobes in the beginning of
function and possibly uretprobes at the end of function.
This patch is able to reduce the trap overhead for
uprobe from 2 to 1.

Without this patch, uretprobe will typically incur
three traps. With this patch, if the function starts
with "push" insn, the number of traps can be
reduced from 3 to 2.

An experiment was conducted on two local VMs,
fedora 26 64-bit VM and 32-bit VM, both 4 processors
and 4GB memory, booted with latest tip repo (and this patch).
The host is MacBook with intel i7 processor.

The test program looks like
  #include 
  #include 
  #include 
  #include 

  static void test() __attribute__((noinline));
  void test() {}
  int main() {
struct timeval start, end;

gettimeofday(&start, NULL);
for (int i = 0; i < 100; i++) {
  test();
}
gettimeofday(&end, NULL);

printf("%ld\n", ((end.tv_sec * 100 + end.tv_usec)
 - (start.tv_sec * 100 + start.tv_usec)));
return 0;
  }

The program is compiled without optimization, and
the first insn for function "test" is "push %rbp".
The host is relatively idle.

Before the test run, the uprobe is inserted as below for uprobe:
  echo 'p :' > /sys/kernel/debug/tracing/uprobe_events
  echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
and for uretprobe:
  echo 'r :' > /sys/kernel/debug/tracing/uprobe_events
  echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable

Unit: microsecond(usec) per loop iteration

x86_64  W/ this patch   W/O this patch
uprobe  1.553.1
uretprobe   2.0 3.6

x86_32  W/ this patch   W/O this patch
uprobe  1.413.5
uretprobe   1.754.0

You can see that this patch significantly reduced the overhead,
50% for uprobe and 44% for uretprobe on x86_64, and even more
on x86_32.

Signed-off-by: Yonghong Song 
---
 arch/x86/include/asm/uprobes.h |   4 ++
 arch/x86/kernel/uprobes.c  | 111 +++--
 2 files changed, 111 insertions(+), 4 deletions(-)

Changelogs:
v2 -> v3:
  . Do not emulate 32bit application on x86_64 platforms
v1 -> v2:
  . Address Oleg's comments

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 74f4c2f..d8bfa98 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -53,6 +53,10 @@ struct arch_uprobe {
u8  fixups;
u8  ilen;
}   defparam;
+   struct {
+   u8  reg_offset; /* to the start of pt_regs */
+   u8  ilen;
+   }   push;
};
 };
 
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index a3755d2..007d8e77 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -528,11 +528,11 @@ static int default_pre_xol_op(struct arch_uprobe 
*auprobe, struct pt_regs *regs)
return 0;
 }
 
-static int push_ret_address(struct pt_regs *regs, unsigned long ip)
+static int emulate_push_stack(struct pt_regs *regs, unsigned long val)
 {
unsigned long new_sp = regs->sp - sizeof_long();
 
-   if (copy_to_user((void __user *)new_sp, &ip, sizeof_long()))
+   if (copy_to_user((void __user *)new_sp, &val, sizeof_long()))
return -EFAULT;
 
regs->sp = new_sp;
@@ -566,7 +566,7 @@ static int default_post_xol_op(struct arch_uprobe *auprobe, 
struct pt_regs *regs
regs->ip += correction;
} else if (auprobe->defparam.fixups & UPROBE_FIX_CALL) {
regs->sp += sizeof_long(); /* Pop incorrect return address */
-   if (push_ret_address(regs, utask->vaddr + 
auprobe->defparam.ilen))
+   if (emulate_push_stack(regs, utask->vaddr + 
auprobe->defparam.ilen))
return -ERESTART;
}
/* popf; tell the caller to not touch TF */
@@ -655,7 +655,7 @@ static bool branch_emulate_op(struct arch_uprobe *auprobe, 
struct pt_regs *regs)
 *
 * But there is corner case, see the comment in ->post_xol().
 */
-   if (push_ret_address(regs, new_ip))
+   if (emulate_push_stack(regs, new_ip))
return false;

Re: [RFC PATCH 5/5] selinux: Add SCTP support

2017-11-13 Thread Richard Haines

On Mon, 2017-11-06 at 19:09 -0500, Paul Moore wrote:
> On Tue, Oct 17, 2017 at 9:59 AM, Richard Haines
>  wrote:
> > The SELinux SCTP implementation is explained in:
> > Documentation/security/SELinux-sctp.txt
> > 
> > Signed-off-by: Richard Haines 
> > ---
> >  Documentation/security/SELinux-sctp.txt | 108 +
> >  security/selinux/hooks.c| 268
> > ++--
> >  security/selinux/include/classmap.h |   3 +-
> >  security/selinux/include/netlabel.h |   9 +-
> >  security/selinux/include/objsec.h   |   5 +
> >  security/selinux/netlabel.c |  52 ++-
> >  6 files changed, 427 insertions(+), 18 deletions(-)
> >  create mode 100644 Documentation/security/SELinux-sctp.txt
> > 
> > diff --git a/Documentation/security/SELinux-sctp.txt
> > b/Documentation/security/SELinux-sctp.txt
> > new file mode 100644
> > index 000..32e0255
> > --- /dev/null
> > +++ b/Documentation/security/SELinux-sctp.txt
> > @@ -0,0 +1,108 @@
> 
> See my previous comments about moving to reStructuredText for these
> docs.

Done
> 
> > +   SCTP SELinux Support
> > +  ==
> > +
> > +Security Hooks
> > +===
> > +
> > +The Documentation/security/LSM-sctp.txt document describes how the
> > following
> > +sctp security hooks are utilised:
> > +security_sctp_assoc_request()
> > +security_sctp_bind_connect()
> > +security_sctp_sk_clone()
> > +
> > +security_inet_conn_established()
> > +
> > +
> > +Policy Statements
> > +==
> > +The following class and permissions to support SCTP are available
> > within the
> > +kernel:
> > +class sctp_socket inherits socket { node_bind }
> > +
> > +whenever the following policy capability is enabled:
> > +policycap extended_socket_class;
> > +
> > +The SELinux SCTP support adds the additional permissions that are
> > explained
> > +in the sections below:
> > +association bindx connectx
> 
> Is the distinction between bind and bindx significant?  The same
> question applies to connect/connectx.  I think we can probably just
> reuse bind and connect in these cases.

This has been discussed before with Marcelo and keeping bindx/connectx
is a useful distinction.
> 
> See my question on sctp_socket:association below.
> 
> > +If userspace tools have been updated, SCTP will support the
> > portcon
> > +statement as shown in the following example:
> > +portcon sctp 1024-1036 system_u:object_r:sctp_ports_t:s0
> > +
> > +
> > +SCTP Bind, Connect and ASCONF Chunk Parameter Permission Checks
> > +
> > +The hook security_sctp_bind_connect() is called by SCTP to check
> > permissions
> > +required for ipv4/ipv6 addresses based on the @optname as follows:
> > +
> > +  --
> > 
> > +  |  BINDX Permission
> > Check|
> > +  |   @optname | @address
> > contains |
> > +  ||
> > ---|
> > +  | SCTP_SOCKOPT_BINDX_ADD | One or more ipv4 / ipv6 addresses
> > |
> > +  --
> > 
> > +
> > +  --
> > 
> > +  |  BIND Permission
> > Checks|
> > +  |   @optname | @address
> > contains |
> > +  ||
> > ---|
> > +  | SCTP_PRIMARY_ADDR  | Single ipv4 or ipv6
> > address   |
> > +  | SCTP_SET_PEER_PRIMARY_ADDR | Single ipv4 or ipv6
> > address   |
> > +  --
> > 
> > +
> > +  --
> > 
> > +  | CONNECTX Permission
> > Check  |
> > +  |   @optname | @address
> > contains |
> > +  ||
> > ---|
> > +  | SCTP_SOCKOPT_CONNECTX  | One or more ipv4 / ipv6 addresses
> > |
> > +  --
> > 
> > +
> > +  --
> > 
> > +  | CONNECT Permission
> > Checks  |
> > +  |   @optname | @address
> > contains |
> > +  ||
> > ---|
> > +  | SCTP_SENDMSG_CONNECT   | Single ipv4 or ipv6
> > address   |
> > +  | SCTP_PARAM_ADD_IP  | One or more ipv4 / ipv6 addresses
> > |
> > +  | SCTP_PARAM_SET_PRIMARY | Single ipv4 or ipv6
> > address   |
> > +  --
> >

Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Linus Torvalds

On Mon, Nov 13, 2017 at 1:41 PM, Andi Kleen  wrote:
>
> It seems to be broken for normal programs too

Ok.

I wonder if there is some way to use the split format for the
intermediate files, but then for the very final link bring them all in
and make the end result be a traditional single binary. I'm not
talking the separate "dwp" package that packs multiple dwo files into
one, but to actually link them all in at the one.

Sure, it would lose some of the advantage, but I think a large portion
of the -gsplit-dwarf advantage is about the intermediate state. No?

I tried to google for it, but couldn't find anything. But apparently
elfutils doesn't support dwo files either. It seems mainly the linker
and gdb itself that supports it.

  Linus

Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-13 Thread Sagi Grimberg




Can you explain what do you mean by "subsystem"? I thought that the
subsystem would be the irq subsystem (which means you are the one to provide
the needed input :) ) and the driver would pass in something
like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
requirements that you listed and NULL to tell the core to leave it alone
and do what it sees fit (or pass msi_irq_ops with flag that means that).

ops structure is a very common way for drivers to communicate with a
subsystem core.


So if you look at the above pseudo code then the subsys_*_move_callbacks
are probably subsystem specific, i.e. block or networking.

Those subsystem callbacks might either handle it at the subsystem level
directly or call into the particular driver.


Personally I do not think that integrating this to networking/block
stacks in that level is going to work. drivers don't communicate any
information on what they do with msi(x) vectors (and I'm not sure they
should).

I think that driver that uses managed facilities is up to its own
discretion, it interacts with the irq subsystem to allocate the vectors
so it make sense to me that it should pass in the ops directly and
handle the callouts.


That's certainly out of the scope what the generic interrupt code can do :)


Don't get me wrong, I'm all for adding useful helpers in net/ and block/
if drivers need some services from the core subsystem or if drivers end
up sharing lots of logic.

For example, drivers already take care of draining queues when device
hot unplug is triggered, so they must be able to get it right for
IRQ vector migration (at least I assume).

Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen

On Mon, Nov 13, 2017 at 12:56:31PM -0800, Linus Torvalds wrote:
> On Mon, Nov 13, 2017 at 12:10 PM, Andi Kleen  wrote:
> >
> > You're right. It works for line information, but strangely not for
> > inlines. I assume it can be fixed.
> 
> So I'm not 100% sure it's strictly a addr2line bug.

It seems to be broken for normal programs too

$ cat tinline.c

int i;

static inline int finline(void)
{
i++;
}

main()
{
finline();
}

$ gcc -O2 -gsplit-dwarf tinline.c
$ addr2line -i -e a.out 0x4003b0
/home/ak/tsrc/tinline.c:6
$ gcc -O2  -g tinline.c
$ addr2line -i -e a.out 0x4003b0
/home/ak/tsrc/tinline.c:6
/home/ak/tsrc/tinline.c:12
$

I filed https://sourceware.org/bugzilla/show_bug.cgi?id=22434


-Andi

Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-13 Thread Thomas Gleixner

On Mon, 13 Nov 2017, Sagi Grimberg wrote:
> > > >  #1 Before the core tries to move the interrupt so it can veto
> > > > the
> > > >   move if it cannot allocate new resources or whatever is 
> > > > required
> > > >   to operate after the move.
> > > 
> > > What would the core do if a driver veto a move?
> > 
> > Return the error code from write_affinity() as it does with any other error
> > which fails to set the affinity.
> 
> OK, so this would mean that the driver queue no longer has a vector
> correct? so is the semantics that it needs to cleanup its resources or
> should it expect another callout for that?

The driver queue still has the old vector, i.e.

echo XXX > /proc/irq/YYY/affinity

 write_irq_affinity(newaffinity)

newvec = reserve_new_vector();

ret = subsys_pre_move_callback(, newaffinity);

if (ret) {
drop_new_vector(newvec);
return ret;
}

shutdown(oldvec);
install(newvec);

susbsys_post_move_callback()

startup(newvec);

subsys_pre_move_callback()

ret = do_whatever();
if (ret)
return ret;

/*
 * Make sure nothing is queued anymore and outstanding
 * requests are completed. Same as for managed CPU hotplug.
 */
stop_and_drain_queue();
return 0;

subsys_post_move_callback()

install_new_data();

/* Reenable queue. Same as for managed CPU hotplug */
reenable_queue();

free_old_data();
return;

Does that clarify the mechanism?

> > > This looks like it can work to me, but I'm probably not familiar enough
> > > to see the full picture here.
> > 
> > On the interrupt core side this is workable, I just need the input from the
> > driver^Wsubsystem side if this can be implemented sanely.
> 
> Can you explain what do you mean by "subsystem"? I thought that the
> subsystem would be the irq subsystem (which means you are the one to provide
> the needed input :) ) and the driver would pass in something
> like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
> requirements that you listed and NULL to tell the core to leave it alone
> and do what it sees fit (or pass msi_irq_ops with flag that means that).
> 
> ops structure is a very common way for drivers to communicate with a
> subsystem core.

So if you look at the above pseudo code then the subsys_*_move_callbacks
are probably subsystem specific, i.e. block or networking.

Those subsystem callbacks might either handle it at the subsystem level
directly or call into the particular driver.

That's certainly out of the scope what the generic interrupt code can do :)

Thanks,

tglx

RE: tipc_udp_send_msg oops in 4.4 when setting link tolerance

2017-11-13 Thread Jon Maloy

Hi Tommi,
I am not sure, but is seems like the following patch is what you need:
commit 9b3009604b8e ("tipc: add net device to skb before UDP xmit")
This was applied in tipc 4.5.

Is this a stooping problem for you?

BR
///jon

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Tommi Rantala
> Sent: Monday, November 13, 2017 11:23
> To: Jon Maloy ; Ying Xue
> ; David S. Miller ;
> netdev@vger.kernel.org; tipc-discuss...@lists.sourceforge.net; linux-
> ker...@vger.kernel.org
> Subject: tipc_udp_send_msg oops in 4.4 when setting link tolerance
> 
> Hi,
> 
> I always get an instant TIPC oops in 4.4, when I try to set the link tolerance
> (with LINKNAME != "broadcast-link"):
> 
>   $ tipc link set tolerance 1000 link $LINKNAME
> 
> Any idea what's going on? Some tipc patch missing in 4.4?
> 
> In 4.9 the "tipc" command executes just fine, but I've seen a few times that
> later some random process crashes with "BUG: Bad page state". KASAN does
> not report anything before it happens.
> 
> 4.14 is OK, could not reproduce these problems with it.
> 
> 
> 
> 
> tipc_udp_send_msg+0x102/0x4f0
> 
> matches to:
> tipc_udp_send_msg at linux-stable/net/tipc/udp_media.c:172
> 
> static int tipc_udp_send_msg(struct net *net, struct sk_buff *skb,
>   struct tipc_bearer *b,
>   struct tipc_media_addr *dest) {
>  int ttl, err = 0;
>  struct udp_bearer *ub;
>  struct udp_media_addr *dst = (struct udp_media_addr *)&dest->value;
>  struct udp_media_addr *src = (struct udp_media_addr *)&b-
> >addr.value;
>  struct rtable *rt;
> 
>  if (skb_headroom(skb) < UDP_MIN_HEADROOM) {
>  err = pskb_expand_head(skb, UDP_MIN_HEADROOM, 0,
> GFP_ATOMIC);
>  if (err)
>  goto tx_error;
>  }
> 
>  skb_set_inner_protocol(skb, htons(ETH_P_TIPC));
>  ub = rcu_dereference_rtnl(b->media_ptr);
>  if (!ub) {
>  err = -ENODEV;
>  goto tx_error;
>  }
>  if (dst->proto == htons(ETH_P_IP)) {   <-- HERE
> 
> 
> 
> [  111.423647]
> ==
> 
> [  111.424826] BUG: KASAN: null-ptr-deref on address   (null)
> [  111.425538] Read of size 2 by task tipc/2643 [  111.426215] CPU: 3 PID: 
> 2643
> Comm: tipc Not tainted 4.4.97-pc64 #1 [  111.428081]  
> 880026327478 8248005e
> 0002
> [  111.429476]  880047ad5ac0 8800263274f8 8227f5af
> 000265711040
> [  111.430728]   0297 a0387fd2
> 02090220 [  111.432051] Call Trace:
> [  111.432472]  [] dump_stack+0x86/0xc8 [  111.433208]
> [] kasan_report.part.2+0x41f/0x520 [  111.434040]
> [] ? tipc_udp_send_msg+0x102/0x4f0 [tipc] [  111.434908]
> [] kasan_report+0x25/0x30 [  111.435647]
> [] __asan_load2+0x66/0x70 [  111.436391]
> [] tipc_udp_send_msg+0x102/0x4f0 [tipc] [  111.437334]
> [] ? kasan_kmalloc+0x5e/0x70 [  111.438301]
> [] ? kasan_slab_alloc+0xd/0x10 [  111.439328]
> [] ?
> __kmalloc_node_track_caller+0xac/0x230
> [  111.440493]  [] ? kasan_kmalloc+0x5e/0x70 [
> 111.441479]  [] ? tipc_udp_disable+0xe0/0xe0 [tipc] [
> 111.442628]  [] ? kasan_kmalloc+0x5e/0x70 [  111.443598]
> [] ? kasan_krealloc+0x62/0x80 [  111.444610]
> [] ? memset+0x28/0x30 [  111.445539]  []
> ? __alloc_skb+0x2b3/0x310 [  111.446560]  [] ?
> skb_complete_tx_timestamp+0x110/0x110
> [  111.447695]  [] ? __module_text_address+0x16/0xa0 [
> 111.448735]  [] ? skb_put+0x8b/0xd0 [  111.449608]
> [] ? memcpy+0x36/0x40 [  111.450524]
> [] ?
> tipc_link_build_proto_msg+0x398/0x4c0 [tipc] [  111.451946]
> [] tipc_bearer_xmit_skb+0xa0/0xb0 [tipc] [  111.453078]
> [] tipc_link_proto_xmit+0x11b/0x160 [tipc] [  111.454218]
> [] ?
> tipc_link_build_reset_msg+0x50/0x50 [tipc] [  111.455542]
> [] tipc_nl_link_set+0x1ee/0x3b0 [tipc] [  111.456659]
> [] ? tipc_nl_parse_link_prop+0xd0/0xd0 [tipc] [
> 111.457831]  [] ? is_ftrace_trampoline+0x59/0x90 [
> 111.458884]  [] ? __kernel_text_address+0x65/0x80 [
> 111.459931]  [] ? nla_parse+0xb6/0x140 [  111.460892]
> [] genl_family_rcv_msg+0x37e/0x5e0 [  111.461948]
> [] ? set_orig_addr.isra.53+0xe5/0x120 [tipc] [  111.463107]
> [] ? genl_rcv+0x40/0x40 [  111.463987]
> [] ? alloc_debug_processing+0x154/0x180
> [  111.465048]  [] ? ___slab_alloc+0x43d/0x460 [
> 111.465986]  [] ? alloc_debug_processing+0x154/0x180
> [  111.467045]  [] ? netlink_lookup+0x19c/0x220 [
> 111.468067]  [] genl_rcv_msg+0xd8/0x110 [  111.468994]
> [] netlink_rcv_skb+0x14b/0x180 [  111.469939]
> [] ? genl_family_rcv_msg+0x5e0/0x5e0 [  111.470954]
> [] genl_rcv+0x28/0x40 [  111.471798]  []
> netlink_unicast+0x2e7/0x3a0 [  111.472806]  [] ?
> netlink_attachskb+0x330/0x330 [  111.473845]  [] ?
> copy_from_iter+0xf1/0x3b0 [  111.474847]  []
> netlink_sendmsg+0x4ad/0x620 [  111.475

Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-13 Thread Sagi Grimberg




Do you know if any exist? Would it make sense to have a survey to
understand if anyone relies on it?

 From what I've seen so far, drivers that were converted simply worked
with the non-managed facility and didn't have any special code for it.
Perhaps Christoph can comment as he convert most of them.

But if there aren't any drivers that absolutely rely on it, maybe its
not a bad idea to allow it by default?


Sure, I was just cautious and I have to admit that I have no insight into
the driver side details.


Christoph, feel free to chime in :)

Should I construct an email list of the driver maintainers of the
converted drivers?


* When and how is the driver informed about the change?

   When:

 #1 Before the core tries to move the interrupt so it can veto the
  move if it cannot allocate new resources or whatever is required
  to operate after the move.


What would the core do if a driver veto a move?


Return the error code from write_affinity() as it does with any other error
which fails to set the affinity.


OK, so this would mean that the driver queue no longer has a vector
correct? so is the semantics that it needs to cleanup its resources or
should it expect another callout for that?


I'm wandering in what conditions a driver will be unable to allocate
resources for move to cpu X but able to allocate for move to cpu Y.


Node affine memory allocation is the only thing which comes to my mind, or
some decision not to have a gazillion of queues on a single CPU.


Yea, makes sense.


This looks like it can work to me, but I'm probably not familiar enough
to see the full picture here.


On the interrupt core side this is workable, I just need the input from the
driver^Wsubsystem side if this can be implemented sanely.


Can you explain what do you mean by "subsystem"? I thought that the
subsystem would be the irq subsystem (which means you are the one to 
provide the needed input :) ) and the driver would pass in something

like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
requirements that you listed and NULL to tell the core to leave it alone
and do what it sees fit (or pass msi_irq_ops with flag that means that).

ops structure is a very common way for drivers to communicate with a
subsystem core.

Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Linus Torvalds

On Mon, Nov 13, 2017 at 12:10 PM, Andi Kleen  wrote:
>
> You're right. It works for line information, but strangely not for
> inlines. I assume it can be fixed.

So I'm not 100% sure it's strictly a addr2line bug.

We do more than just link the vmlinux file. There's that whole
complicated script for our final link in

scripts/link-vmlinux.sh

which links in all the kallsyms information etc, but also does things
like the exception table sorting.

So it's possible that addr2line works fine on a normal executable, but
that we've done some munging that then makes it not really do the
right thing for vmlinux.

But it is also equally possible that addr2line simply doesn't
understand -gsplit-dwarf at all. I've never seen it used outside the
kernel.

   Linus

[PATCH] netlabel: If PF_INET6, check sk_buff ip header version

2017-11-13 Thread Richard Haines

When resolving a fallback label, check the sk_buff version as it
is possible (e.g. SCTP) to have family = PF_INET6 while
receiving ip_hdr(skb)->version = 4.

Signed-off-by: Richard Haines 
---
 net/netlabel/netlabel_unlabeled.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/netlabel/netlabel_unlabeled.c 
b/net/netlabel/netlabel_unlabeled.c
index 22dc1b9..c070dfc 100644
--- a/net/netlabel/netlabel_unlabeled.c
+++ b/net/netlabel/netlabel_unlabeled.c
@@ -1472,6 +1472,16 @@ int netlbl_unlabel_getattr(const struct sk_buff *skb,
iface = rcu_dereference(netlbl_unlhsh_def);
if (iface == NULL || !iface->valid)
goto unlabel_getattr_nolabel;
+
+#if IS_ENABLED(CONFIG_IPV6)
+   /* When resolving a fallback label, check the sk_buff version as
+* it is possible (e.g. SCTP) to have family = PF_INET6 while
+* receiving ip_hdr(skb)->version = 4.
+*/
+   if (family == PF_INET6 && ip_hdr(skb)->version == 4)
+   family = PF_INET;
+#endif /* IPv6 */
+
switch (family) {
case PF_INET: {
struct iphdr *hdr4;
-- 
2.13.6

Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-13 Thread Thomas Gleixner

On Mon, 13 Nov 2017, Sagi Grimberg wrote:
> > 3) Affinity override in managed mode
> > 
> >   Doable, but there are a couple of things to think about:
> 
> I think that it will be good to shoot for (3). Given that there are
> driver requirements I'd say that driver will expose up front if it can
> handle it, and if not we fallback to (1).
> 
> >* How is this enabled?
> > 
> >  - Opt-in by driver
> > 
> >  - Extra sysfs/procfs knob
> > 
> >  We definitely should not enable it per default because that would
> >  surprise users/drivers which work with the current managed devices and
> >  rely on the affinity files to be non writeable in managed mode.
> 
> Do you know if any exist? Would it make sense to have a survey to
> understand if anyone relies on it?
> 
> From what I've seen so far, drivers that were converted simply worked
> with the non-managed facility and didn't have any special code for it.
> Perhaps Christoph can comment as he convert most of them.
> 
> But if there aren't any drivers that absolutely rely on it, maybe its
> not a bad idea to allow it by default?

Sure, I was just cautious and I have to admit that I have no insight into
the driver side details.

> >* When and how is the driver informed about the change?
> > 
> >   When:
> > 
> > #1 Before the core tries to move the interrupt so it can veto the
> >   move if it cannot allocate new resources or whatever is required
> >   to operate after the move.
> 
> What would the core do if a driver veto a move?

Return the error code from write_affinity() as it does with any other error
which fails to set the affinity.

> I'm wandering in what conditions a driver will be unable to allocate
> resources for move to cpu X but able to allocate for move to cpu Y.

Node affine memory allocation is the only thing which comes to my mind, or
some decision not to have a gazillion of queues on a single CPU. 

> This looks like it can work to me, but I'm probably not familiar enough
> to see the full picture here.

On the interrupt core side this is workable, I just need the input from the
driver^Wsubsystem side if this can be implemented sanely.

Thanks,

tglx

Re: [RFC PATCH 4/5] netlabel: Add SCTP support

2017-11-13 Thread Richard Haines

On Mon, 2017-11-06 at 18:15 -0500, Paul Moore wrote:
> On Tue, Oct 17, 2017 at 9:58 AM, Richard Haines
>  wrote:
> > Add support to label SCTP associations and cater for a situation
> > where
> > family = PF_INET6 with an ip_hdr(skb)->version = 4.
> > 
> > Signed-off-by: Richard Haines 
> > ---
> >  include/net/netlabel.h|  3 ++
> >  net/netlabel/netlabel_kapi.c  | 80
> > +++
> >  net/netlabel/netlabel_unlabeled.c | 10 +
> >  3 files changed, 93 insertions(+)
> > 
> > diff --git a/include/net/netlabel.h b/include/net/netlabel.h
> > index 72d6435..7348966 100644
> > --- a/include/net/netlabel.h
> > +++ b/include/net/netlabel.h
> > @@ -494,6 +494,9 @@ int netlbl_conn_setattr(struct sock *sk,
> > const struct netlbl_lsm_secattr *secattr);
> >  int netlbl_req_setattr(struct request_sock *req,
> >const struct netlbl_lsm_secattr *secattr);
> > +int netlbl_sctp_setattr(struct sock *sk,
> > +   struct sk_buff *skb,
> > +   const struct netlbl_lsm_secattr *secattr);
> >  void netlbl_req_delattr(struct request_sock *req);
> >  int netlbl_skbuff_setattr(struct sk_buff *skb,
> >   u16 family,
> > diff --git a/net/netlabel/netlabel_kapi.c
> > b/net/netlabel/netlabel_kapi.c
> > index ea7c670..1c82bbe 100644
> > --- a/net/netlabel/netlabel_kapi.c
> > +++ b/net/netlabel/netlabel_kapi.c
> > @@ -1121,6 +1121,7 @@ int netlbl_conn_setattr(struct sock *sk,
> > switch (addr->sa_family) {
> > case AF_INET:
> > addr4 = (struct sockaddr_in *)addr;
> > +
> 
> I'm guessing this bit of extra whitespace was an accident; but just
> in
> case, drop it from this patch please.
Done
> 
> > entry = netlbl_domhsh_getentry_af4(secattr->domain,
> >addr4-
> > >sin_addr.s_addr);
> > if (entry == NULL) {
> > @@ -1177,6 +1178,85 @@ int netlbl_conn_setattr(struct sock *sk,
> >  }
> > 
> >  /**
> > + * netlbl_sctp_setattr - Label an incoming sctp association socket
> > using
> > + * the correct protocol
> > + * @sk: the socket to label
> > + * @skb: the packet
> > + * @secattr: the security attributes
> > + *
> > + * Description:
> > + * Attach the correct label to the given socket using the security
> > attributes
> > + * specified in @secattr.  Returns zero on success, negative
> > values on failure.
> > + *
> > + */
> > +int netlbl_sctp_setattr(struct sock *sk,
> > +   struct sk_buff *skb,
> > +   const struct netlbl_lsm_secattr *secattr)
> > +{
> > +   int ret_val = -EINVAL;
> > +   struct netlbl_dommap_def *entry;
> > +   struct iphdr *hdr4;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +   struct ipv6hdr *hdr6;
> > +#endif
> > +
> > +   rcu_read_lock();
> > +   switch (sk->sk_family) {
> > +   case AF_INET:
> > +   hdr4 = ip_hdr(skb);
> > +
> > +   entry = netlbl_domhsh_getentry_af4(secattr->domain,
> > +  hdr4->saddr);
> > +   if (entry == NULL) {
> > +   ret_val = -ENOENT;
> > +   goto sctp_setattr_return;
> > +   }
> > +   switch (entry->type) {
> > +   case NETLBL_NLTYPE_CIPSOV4:
> > +   ret_val = cipso_v4_sock_setattr(sk, entry-
> > >cipso,
> > +   secattr);
> > +   break;
> > +   case NETLBL_NLTYPE_UNLABELED:
> > +   netlbl_sock_delattr(sk);
> > +   ret_val = 0;
> > +   break;
> > +   default:
> > +   ret_val = -ENOENT;
> > +   }
> > +   break;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +   case AF_INET6:
> > +   hdr6 = ipv6_hdr(skb);
> > +   entry = netlbl_domhsh_getentry_af6(secattr->domain,
> > +  &hdr6->saddr);
> > +   if (entry == NULL) {
> > +   ret_val = -ENOENT;
> > +   goto sctp_setattr_return;
> > +   }
> > +   switch (entry->type) {
> > +   case NETLBL_NLTYPE_CALIPSO:
> > +   ret_val = calipso_sock_setattr(sk, entry-
> > >calipso,
> > +  secattr);
> > +   break;
> > +   case NETLBL_NLTYPE_UNLABELED:
> > +   netlbl_sock_delattr(sk);
> > +   ret_val = 0;
> > +   break;
> > +   default:
> > +   ret_val = -ENOENT;
> > +   }
> > +   break;
> > +#endif /* IPv6 */
> > +   default:
> > +   ret_val = -EPROTONOSUPPORT;
> > +   }
> > +
>

Re: [PATCH][v2] uprobes/x86: emulate push insns for uprobe on x86

2017-11-13 Thread Yonghong Song




On 11/13/17 4:59 AM, Oleg Nesterov wrote:

The patch looks good to me, but I have a question because I know nothing
about insn encoding,

On 11/10, Yonghong Song wrote:


+static int push_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
+{
+   u8 opc1 = OPCODE1(insn), reg_offset = 0;
+
+   if (opc1 < 0x50 || opc1 > 0x57)
+   return -ENOSYS;
+
+   if (insn->length > 2)
+   return -ENOSYS;
+   if (insn->length == 2) {
+   /* only support rex_prefix 0x41 (x64 only) */
+#ifdef CONFIG_X86_64
+   if (insn->rex_prefix.nbytes != 1 ||
+   insn->rex_prefix.bytes[0] != 0x41)
+   return -ENOSYS;
+
+   auprobe->push.ilen = 2;


and the "else" branch does

auprobe->push.ilen = 1;

you could add
auprobe->push.ilen = insn->length;

at the end of push_setup_xol_ops() instead, but this is minor/cosmetic,


Will make this change in the next revision.





+   switch (opc1) {
+   case 0x50:
+   reg_offset = offsetof(struct pt_regs, r8);
+   break;
+   case 0x51:
+   reg_offset = offsetof(struct pt_regs, r9);
+   break;
+   case 0x52:
+   reg_offset = offsetof(struct pt_regs, r10);
+   break;
+   case 0x53:
+   reg_offset = offsetof(struct pt_regs, r11);
+   break;
+   case 0x54:
+   reg_offset = offsetof(struct pt_regs, r12);
+   break;
+   case 0x55:
+   reg_offset = offsetof(struct pt_regs, r13);
+   break;
+   case 0x56:
+   reg_offset = offsetof(struct pt_regs, r14);
+   break;
+   case 0x57:
+   reg_offset = offsetof(struct pt_regs, r15);
+   break;
+   }
+#else
+   return -ENOSYS;
+#endif


OK, but shouldn't we also return ENOSYS if CONFIG_X86_64=y but the probed task 
is 32bit?


Just tested with a 32bit app on x86 box and segfaults. Yes, we would 
need to return ENOSYS if the app is 32bit on 64bit system. I may not

be worthwhile to emulate this uncommon case.

I will use mmap_is_ia32 or a variant to test whether the app is
32bit or not. Please let me know whether this is correct approach or not.



Or in this case uprobe_init_insn(x86_64 => false) should fail and 
push_setup_xol_ops()
won't be called?

Oleg.

Re: [PATCH net] geneve: show remote address and checksum info even after link down

2017-11-13 Thread Eric Garver

On Mon, Nov 13, 2017 at 05:03:26PM +0800, Hangbin Liu wrote:
> geneve->sock4/6 were added with geneve_open and released with geneve_stop.
> So when geneve link down, we will not able to show remote address and
> checksum info after commit 11387fe4a98 ("geneve: fix fill_info when using
> collect_metadata").
> 
> Fix this by reset back to the previous behavior: only show the corresponding
> remote address and always show checksum info.

Thanks Hangbin,

I think we can do better though and maybe avoid another round of this
churn.

> 
> Fixes: 11387fe4a98 ("geneve: fix fill_info when using collect_metadata")
> Signed-off-by: Hangbin Liu 
> ---
>  drivers/net/geneve.c | 28 +++-
>  1 file changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index ed51018..b010db7 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -1511,32 +1511,18 @@ static int geneve_fill_info(struct sk_buff *skb, 
> const struct net_device *dev)
>   if (nla_put_u32(skb, IFLA_GENEVE_ID, vni))
>   goto nla_put_failure;
>  
> - if (rtnl_dereference(geneve->sock4)) {
> + if (ip_tunnel_info_af(info) == AF_INET) {
>   if (nla_put_in_addr(skb, IFLA_GENEVE_REMOTE,
>   info->key.u.ipv4.dst))
>   goto nla_put_failure;
>  

We can avoid passing up *_REMOTE{,6} for COLLECT_METADATA. They're
mutually exclusive.

> - if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
> -!!(info->key.tun_flags & TUNNEL_CSUM)))
> - goto nla_put_failure;
> -
> - }
> -
>  #if IS_ENABLED(CONFIG_IPV6)
> - if (rtnl_dereference(geneve->sock6)) {
> + } else {
>   if (nla_put_in6_addr(skb, IFLA_GENEVE_REMOTE6,
>&info->key.u.ipv6.dst))
>   goto nla_put_failure;
> -
> - if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
> -!(info->key.tun_flags & TUNNEL_CSUM)))
> - goto nla_put_failure;
> -
> - if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
> -!geneve->use_udp6_rx_checksums))
> - goto nla_put_failure;
> - }
>  #endif
> + }
>  
>   if (nla_put_u8(skb, IFLA_GENEVE_TTL, info->key.ttl) ||
>   nla_put_u8(skb, IFLA_GENEVE_TOS, info->key.tos) ||
> @@ -1550,6 +1536,14 @@ static int geneve_fill_info(struct sk_buff *skb, const 
> struct net_device *dev)
>   if (nla_put_flag(skb, IFLA_GENEVE_COLLECT_METADATA))
>   goto nla_put_failure;
>   }
> +
> + if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
> +!!(info->key.tun_flags & TUNNEL_CSUM)) ||
> + nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
> +!(info->key.tun_flags & TUNNEL_CSUM)) ||

These two are nonsensical for COLLECT_METADATA as they come from the skb
tun_info. They can be moved to inside the ipv4/ipv6 checks. For
non-metadata, it doesn't make sense to pass them both as the tunnel is
either ipv4 or ipv6, but not both.

> + nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
> +!geneve->use_udp6_rx_checksums))
> + goto nla_put_failure;
>   return 0;
>  
>  nla_put_failure:

Re: [PATCH net-next v2] net: move decnet to staging

2017-11-13 Thread Eric Dumazet

On Mon, 2017-11-13 at 11:32 -0800, Joe Perches wrote:
> On Mon, 2017-11-13 at 09:11 -0800, Stephen Hemminger wrote:
> > Support for Decnet has been orphaned for some time.
> > In the interest of reducing the potential bug surface and pre-holiday
> > cleaning, move the decnet protocol into staging for eventual removal.
> []
> > diff --git a/drivers/staging/decnet/TODO b/drivers/staging/decnet/TODO
> []
> > @@ -0,0 +1,4 @@
> > +The DecNet code will be removed soon from the kernel tree as it is old,
> > +obsolete, and buggy.
> 
> Old and obsolete, well OK, but
> what's buggy about decnet?
> 
> https://bugzilla.kernel.org/buglist.cgi?quicksearch=decnet
> 
> Zarro Boogs found.
> 

Then that means nobody uses it.

And that syzkaller guys never bothered to add code to actually trigger
the bugs that are probably there. Probably they have bigger fishes to
fry at this moment.

If we leave the code there, chances are high that some hacker is
interested into exploiting the bugs.

Re: [PATCH net-next] bpf: expose sk_priority through struct bpf_sock_ops

2017-11-13 Thread Daniel Borkmann

On 11/13/2017 09:09 PM, Lawrence Brakmo wrote:
> On 11/13/17, 11:01 AM, "Vlad Dumitrescu"  wrote:
> 
> On Sat, Nov 11, 2017 at 2:38 PM, Alexei Starovoitov  wrote:
> >
> > On 11/12/17 4:46 AM, Daniel Borkmann wrote:
> >>
> >> On 11/11/2017 05:06 AM, Alexei Starovoitov wrote:
> >>>
> >>> On 11/11/17 6:07 AM, Daniel Borkmann wrote:
> 
>  On 11/10/2017 08:17 PM, Vlad Dumitrescu wrote:
> >
> > From: Vlad Dumitrescu 
> >
> > Allows BPF_PROG_TYPE_SOCK_OPS programs to read sk_priority.
> >
> > Signed-off-by: Vlad Dumitrescu 
> > ---
> >   include/uapi/linux/bpf.h   |  1 +
> >   net/core/filter.c  | 11 +++
> >   tools/include/uapi/linux/bpf.h |  1 +
> >   3 files changed, 13 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index e880ae6434ee..9757a2002513 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -947,6 +947,7 @@ struct bpf_sock_ops {
> >   __u32 local_ip6[4];/* Stored in network byte order */
> >   __u32 remote_port;/* Stored in network byte order */
> >   __u32 local_port;/* stored in host byte order */
> > +__u32 priority;
> >   };
> > /* List of known BPF sock_ops operators.
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 61c791f9f628..a6329642d047 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4449,6 +4449,17 @@ static u32 sock_ops_convert_ctx_access(enum
> > bpf_access_type type,
> >   *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> > offsetof(struct sock_common, skc_num));
> >   break;
> > +
> > +case offsetof(struct bpf_sock_ops, priority):
> > +BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_priority) != 4);
> > +
> > +*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> > +struct bpf_sock_ops_kern, sk),
> > +  si->dst_reg, si->src_reg,
> > +  offsetof(struct bpf_sock_ops_kern, sk));
> > +*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> > +  offsetof(struct sock, sk_priority));
> > +break;
> 
> 
>  Hm, I don't think this would work, I actually think your initial 
> patch
>  was ok.
>  bpf_setsockopt() as well as bpf_getsockopt() check for 
> sk_fullsock(sk)
>  right
>  before accessing options on either socket or TCP level, and bail out
>  with error
>  otherwise; in such cases we'd read something else here and assume 
> it's
>  sk_priority.
> >>>
> >>>
> >>> even if it's not fullsock, it will just read zero, no? what's a 
> problem
> >>> with that?
> >>> In non-fullsock hooks like BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
> >>> the program author will know that it's meaningless to read 
> sk_priority,
> >>> so returning zero with minimal checks is fine.
> >>> While adding extra runtime if (sk_fullsock(sk)) is unnecessary,
> >>> since the safety is not compromised.
> >>
> >>
> >> Hm, on my kernel, struct sock has the 4 bytes sk_priority at offset 
> 440,
> >> struct request_sock itself is only 232 byte long in total, and the 
> struct
> >> inet_timewait_sock is 208 byte long, so you'd be accessing out of 
> bounds
> >> that way, so it cannot be ignored and assumed zero.
> >
> >
> > I thought we always pass fully allocated sock but technically not 
> fullsock yet. My mistake. We do: tcp_timeout_init((struct sock *)req))
> > so yeah ctx rewrite approach won't work.
> > Let's go back to access via helper.
> >
> 
> TIL. Thanks!
> 
> Is there anything else needed from me to get the helper approach accepted?
> 
> I plan to add access to TCP state variables (cwnd, rtt, etc.) and I have been 
> thinking
> about this issue. I think it is possible to access it directly as long as we 
> use a value
> like 0x to represent an invalid value (e.g. not fullsock). The ctx 
> rewrite just
> needs to add a conditional to determine what to return. I would probably add a
> field into the internal kernel struct to indicate if it is fullsock or not 
> (we should
> know when we call tcp_call_bpf whether it is a fullsock or not based on 
> context).
> 
> Let me do a sample patch that I can send for review and get feedback from
> Alexi and Daniel.

Agree, if the mov op from the ctx rewrite to read(/write) a sk member, for
example, is just a BPF_W, then we know upper reg bits are zero anyway for the
success case, so we might be able to uti

Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread H.J. Lu

On Mon, Nov 13, 2017 at 12:10 PM, Andi Kleen  wrote:
>> Put another way: the CONFIG_DEBUG_INFO_SPLIT option is useless. Yes,
>> it saves time and disk space, but does so at the expense of making all
>> the debug information unavailable to basic tools.
>
> You're right. It works for line information, but strangely not for
> inlines. I assume it can be fixed.
>

Is there a binutils bug report for this?


-- 
H.J.

[RFC PATCH 17/17] net: sched: lock once per bulk dequeue

2017-11-13 Thread John Fastabend

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 683f6ec..8ab7933 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -206,33 +206,22 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
 {
const struct netdev_queue *txq = q->dev_queue;
struct sk_buff *skb = NULL;
+   spinlock_t *lock = NULL;
 
-   *packets = 1;
-   if (unlikely(!skb_queue_empty(&q->gso_skb))) {
-   spinlock_t *lock = NULL;
-
-   if (q->flags & TCQ_F_NOLOCK) {
-   lock = qdisc_lock(q);
-   spin_lock(lock);
-   }
-
-   skb = skb_peek(&q->gso_skb);
-
-   /* skb may be null if another cpu pulls gso_skb off in between
-* empty check and lock.
-*/
-   if (!skb) {
-   if (lock)
-   spin_unlock(lock);
-   goto validate;
-   }
+   if (q->flags & TCQ_F_NOLOCK) {
+   lock = qdisc_lock(q);
+   spin_lock(lock);
+   }
 
+   *packets = 1;
+   skb = skb_peek(&q->gso_skb);
+   if (unlikely(skb)) {
/* skb in gso_skb were already validated */
*validate = false;
/* check the reason of requeuing without tx lock first */
txq = skb_get_tx_queue(txq->dev, skb);
if (!netif_xmit_frozen_or_stopped(txq)) {
-   skb = __skb_dequeue(&q->gso_skb);
+   __skb_unlink(skb, &q->gso_skb);
if (qdisc_is_percpu_stats(q)) {
qdisc_qstats_cpu_backlog_dec(q, skb);
qdisc_qstats_cpu_qlen_dec(q);
@@ -247,17 +236,17 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
spin_unlock(lock);
goto trace;
}
-validate:
-   *validate = true;
 
-   if ((q->flags & TCQ_F_ONETXQUEUE) &&
-   netif_xmit_frozen_or_stopped(txq))
-   return skb;
+   *validate = true;
 
skb = qdisc_dequeue_skb_bad_txq(q);
if (unlikely(skb))
goto bulk;
-   skb = q->dequeue(q);
+
+   if (!(q->flags & TCQ_F_ONETXQUEUE) ||
+   !netif_xmit_frozen_or_stopped(txq))
+   skb = q->dequeue(q);
+
if (skb) {
 bulk:
if (qdisc_may_bulk(q))
@@ -265,6 +254,8 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
else
try_bulk_dequeue_skb_slow(q, skb, packets);
}
+   if (lock)
+   spin_unlock(lock);
 trace:
trace_qdisc_dequeue(q, txq, *packets, skb);
return skb;
@@ -624,7 +615,7 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc 
*qdisc)
if (__skb_array_empty(q))
continue;
 
-   skb = skb_array_consume_bh(q);
+   skb = __skb_array_consume(q);
}
if (likely(skb)) {
qdisc_qstats_cpu_backlog_dec(qdisc, skb);
@@ -661,7 +652,7 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
struct skb_array *q = band2list(priv, band);
struct sk_buff *skb;
 
-   while ((skb = skb_array_consume_bh(q)) != NULL)
+   while ((skb = __skb_array_consume(q)) != NULL)
__skb_array_destroy_skb(skb);
}

[RFC PATCH 16/17] net: skb_array additions for unlocked consumer

2017-11-13 Thread John Fastabend

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index c7addf3..d0a240e 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -142,6 +142,11 @@ static inline int skb_array_consume_batched_bh(struct 
skb_array *a,
return ptr_ring_consume_batched_bh(&a->ring, (void **)array, n);
 }
 
+static inline struct sk_buff *__skb_array_consume(struct skb_array *a)
+{
+   return __ptr_ring_consume(&a->ring);
+}
+
 static inline int __skb_array_len_with_tag(struct sk_buff *skb)
 {
if (likely(skb)) {

[RFC PATCH 14/17] net: skb_array: expose peek API

2017-11-13 Thread John Fastabend

This adds a peek routine to skb_array.h for use with qdisc.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index 8621ffd..c7addf3 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -72,6 +72,11 @@ static inline bool __skb_array_empty(struct skb_array *a)
return !__ptr_ring_peek(&a->ring);
 }
 
+static inline struct sk_buff *__skb_array_peek(struct skb_array *a)
+{
+   return __ptr_ring_peek(&a->ring);
+}
+
 static inline bool skb_array_empty(struct skb_array *a)
 {
return ptr_ring_empty(&a->ring);

[RFC PATCH 15/17] net: sched: pfifo_fast use skb_array

2017-11-13 Thread John Fastabend

This converts the pfifo_fast qdisc to use the skb_array data structure
and set the lockless qdisc bit.

This also removes the logic used to pick the next band to dequeue from
and instead just checks a per priority array for packets from top priority
to lowest. This might need to be a bit more clever but seems to work
for now.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 84c4ea1..683f6ec 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -581,93 +582,95 @@ struct Qdisc_ops noqueue_qdisc_ops __read_mostly = {
 
 /*
  * Private data for a pfifo_fast scheduler containing:
- * - queues for the three band
- * - bitmap indicating which of the bands contain skbs
+ * - rings for priority bands
  */
 struct pfifo_fast_priv {
-   u32 bitmap;
-   struct qdisc_skb_head q[PFIFO_FAST_BANDS];
+   struct skb_array q[PFIFO_FAST_BANDS];
 };
 
-/*
- * Convert a bitmap to the first band number where an skb is queued, where:
- * bitmap=0 means there are no skbs on any band.
- * bitmap=1 means there is an skb on band 0.
- * bitmap=7 means there are skbs on all 3 bands, etc.
- */
-static const int bitmap2band[] = {-1, 0, 1, 0, 2, 0, 1, 0};
-
-static inline struct qdisc_skb_head *band2list(struct pfifo_fast_priv *priv,
-int band)
+static inline struct skb_array *band2list(struct pfifo_fast_priv *priv,
+ int band)
 {
-   return priv->q + band;
+   return &priv->q[band];
 }
 
 static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
  struct sk_buff **to_free)
 {
-   if (qdisc->q.qlen < qdisc_dev(qdisc)->tx_queue_len) {
-   int band = prio2band[skb->priority & TC_PRIO_MAX];
-   struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-   struct qdisc_skb_head *list = band2list(priv, band);
-
-   priv->bitmap |= (1 << band);
-   qdisc->q.qlen++;
-   return __qdisc_enqueue_tail(skb, qdisc, list);
-   }
+   int band = prio2band[skb->priority & TC_PRIO_MAX];
+   struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
+   struct skb_array *q = band2list(priv, band);
+   int err;
 
-   return qdisc_drop(skb, qdisc, to_free);
+   err = skb_array_produce_bh(q, skb);
+
+   if (unlikely(err))
+   return qdisc_drop_cpu(skb, qdisc, to_free);
+
+   qdisc_qstats_cpu_qlen_inc(qdisc);
+   qdisc_qstats_cpu_backlog_inc(qdisc, skb);
+   return NET_XMIT_SUCCESS;
 }
 
 static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-   int band = bitmap2band[priv->bitmap];
-
-   if (likely(band >= 0)) {
-   struct qdisc_skb_head *qh = band2list(priv, band);
-   struct sk_buff *skb = __qdisc_dequeue_head(qh);
+   struct sk_buff *skb = NULL;
+   int band;
 
-   if (likely(skb != NULL)) {
-   qdisc_qstats_backlog_dec(qdisc, skb);
-   qdisc_bstats_update(qdisc, skb);
-   }
+   for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
+   struct skb_array *q = band2list(priv, band);
 
-   qdisc->q.qlen--;
-   if (qh->qlen == 0)
-   priv->bitmap &= ~(1 << band);
+   if (__skb_array_empty(q))
+   continue;
 
-   return skb;
+   skb = skb_array_consume_bh(q);
+   }
+   if (likely(skb)) {
+   qdisc_qstats_cpu_backlog_dec(qdisc, skb);
+   qdisc_bstats_cpu_update(qdisc, skb);
+   qdisc_qstats_cpu_qlen_dec(qdisc);
}
 
-   return NULL;
+   return skb;
 }
 
 static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
 {
struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-   int band = bitmap2band[priv->bitmap];
+   struct sk_buff *skb = NULL;
+   int band;
 
-   if (band >= 0) {
-   struct qdisc_skb_head *qh = band2list(priv, band);
+   for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
+   struct skb_array *q = band2list(priv, band);
 
-   return qh->head;
+   skb = __skb_array_peek(q);
+   if (!skb)
+   continue;
}
 
-   return NULL;
+   return skb;
 }
 
 static void pfifo_fast_reset(struct Qdisc *qdisc)
 {
-   int prio;
+   int i, band;
struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 
-   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
-   __qdisc_reset_queue(band2list(priv, prio));
+   for (band = 0; band < PFIFO_FAST_BANDS; band++) {
+   struct skb_array *q = band2list(priv, ban

[RFC PATCH 13/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio

2017-11-13 Thread John Fastabend

The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index b85885a9..24474d0 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -388,22 +388,32 @@ static int mqprio_dump(struct Qdisc *sch, struct sk_buff 
*skb)
struct nlattr *nla = (struct nlattr *)skb_tail_pointer(skb);
struct tc_mqprio_qopt opt = { 0 };
struct Qdisc *qdisc;
-   unsigned int i;
+   unsigned int ntx, tc;
 
sch->q.qlen = 0;
memset(&sch->bstats, 0, sizeof(sch->bstats));
memset(&sch->qstats, 0, sizeof(sch->qstats));
 
-   for (i = 0; i < dev->num_tx_queues; i++) {
-   qdisc = rtnl_dereference(netdev_get_tx_queue(dev, i)->qdisc);
+   for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+   struct gnet_stats_basic_cpu __percpu *cpu_bstats = NULL;
+   struct gnet_stats_queue __percpu *cpu_qstats = NULL;
+   __u32 qlen = 0;
+
+   qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
spin_lock_bh(qdisc_lock(qdisc));
-   sch->q.qlen += qdisc->q.qlen;
-   sch->bstats.bytes   += qdisc->bstats.bytes;
-   sch->bstats.packets += qdisc->bstats.packets;
-   sch->qstats.backlog += qdisc->qstats.backlog;
-   sch->qstats.drops   += qdisc->qstats.drops;
-   sch->qstats.requeues+= qdisc->qstats.requeues;
-   sch->qstats.overlimits  += qdisc->qstats.overlimits;
+
+   if (qdisc_is_percpu_stats(qdisc)) {
+   cpu_bstats = qdisc->cpu_bstats;
+   cpu_qstats = qdisc->cpu_qstats;
+   }
+
+   qlen = qdisc_qlen_sum(qdisc);
+
+   __gnet_stats_copy_basic(NULL, &sch->bstats,
+   cpu_bstats, &qdisc->bstats);
+   __gnet_stats_copy_queue(&sch->qstats,
+   cpu_qstats, &qdisc->qstats, qlen);
+
spin_unlock_bh(qdisc_lock(qdisc));
}
 
@@ -411,9 +421,9 @@ static int mqprio_dump(struct Qdisc *sch, struct sk_buff 
*skb)
memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
opt.hw = priv->hw_offload;
 
-   for (i = 0; i < netdev_get_num_tc(dev); i++) {
-   opt.count[i] = dev->tc_to_txq[i].count;
-   opt.offset[i] = dev->tc_to_txq[i].offset;
+   for (tc = 0; tc < netdev_get_num_tc(dev); tc++) {
+   opt.count[tc] = dev->tc_to_txq[tc].count;
+   opt.offset[tc] = dev->tc_to_txq[tc].offset;
}
 
if (nla_put(skb, TCA_OPTIONS, NLA_ALIGN(sizeof(opt)), &opt))
@@ -495,7 +505,6 @@ static int mqprio_dump_class_stats(struct Qdisc *sch, 
unsigned long cl,
if (cl >= TC_H_MIN_PRIORITY) {
int i;
__u32 qlen = 0;
-   struct Qdisc *qdisc;
struct gnet_stats_queue qstats = {0};
struct gnet_stats_basic_packed bstats = {0};
struct net_device *dev = qdisc_dev(sch);
@@ -511,18 +520,26 @@ static int mqprio_dump_class_stats(struct Qdisc *sch, 
unsigned long cl,
 
for (i = tc.offset; i < tc.offset + tc.count; i++) {
struct netdev_queue *q = netdev_get_tx_queue(dev, i);
+   struct Qdisc *qdisc = rtnl_dereference(q->qdisc);
+   struct gnet_stats_basic_cpu __percpu *cpu_bstats = NULL;
+   struct gnet_stats_queue __percpu *cpu_qstats = NULL;
 
-   qdisc = rtnl_dereference(q->qdisc);
spin_lock_bh(qdisc_lock(qdisc));
-   qlen  += qdisc->q.qlen;
-   bstats.bytes  += qdisc->bstats.bytes;
-   bstats.packets+= qdisc->bstats.packets;
-   qstats.backlog+= qdisc->qstats.backlog;
-   qstats.drops  += qdisc->qstats.drops;
-   qstats.requeues   += qdisc->qstats.requeues;
-   qstats.overlimits += qdisc->qstats.overlimits;
+   if (qdisc_is_percpu_stats(qdisc)) {
+   cpu_bstats = qdisc->cpu_bstats;
+   cpu_qstats = qdisc->cpu_qstats;
+   }
+
+   qlen = qdisc_qlen_sum(qdisc);
+   __gnet_stats_copy_basic(NULL, &sch->bstats,
+   cpu_bstats, &qdisc->bstats);
+

[RFC PATCH 12/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq

2017-11-13 Thread John Fastabend

The sch_mq qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Also exports __gnet_stats_copy_queue() to use as a helper function.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 304f7aa..0304ba2 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -49,6 +49,9 @@ int gnet_stats_copy_rate_est(struct gnet_dump *d,
 int gnet_stats_copy_queue(struct gnet_dump *d,
  struct gnet_stats_queue __percpu *cpu_q,
  struct gnet_stats_queue *q, __u32 qlen);
+void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
+const struct gnet_stats_queue __percpu *cpu_q,
+const struct gnet_stats_queue *q, __u32 qlen);
 int gnet_stats_copy_app(struct gnet_dump *d, void *st, int len);
 
 int gnet_stats_finish_copy(struct gnet_dump *d);
diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
index 87f2855..b2b2323b 100644
--- a/net/core/gen_stats.c
+++ b/net/core/gen_stats.c
@@ -252,10 +252,10 @@
}
 }
 
-static void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
-   const struct gnet_stats_queue __percpu *cpu,
-   const struct gnet_stats_queue *q,
-   __u32 qlen)
+void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
+const struct gnet_stats_queue __percpu *cpu,
+const struct gnet_stats_queue *q,
+__u32 qlen)
 {
if (cpu) {
__gnet_stats_copy_queue_cpu(qstats, cpu);
@@ -269,6 +269,7 @@ static void __gnet_stats_copy_queue(struct gnet_stats_queue 
*qstats,
 
qstats->qlen = qlen;
 }
+EXPORT_SYMBOL(__gnet_stats_copy_queue);
 
 /**
  * gnet_stats_copy_queue - copy queue statistics into statistics TLV
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index 213b586..bc59f05 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mq_sched {
struct Qdisc**qdiscs;
@@ -103,15 +104,25 @@ static int mq_dump(struct Qdisc *sch, struct sk_buff *skb)
memset(&sch->qstats, 0, sizeof(sch->qstats));
 
for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+   struct gnet_stats_basic_cpu __percpu *cpu_bstats = NULL;
+   struct gnet_stats_queue __percpu *cpu_qstats = NULL;
+   __u32 qlen = 0;
+
qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
spin_lock_bh(qdisc_lock(qdisc));
-   sch->q.qlen += qdisc->q.qlen;
-   sch->bstats.bytes   += qdisc->bstats.bytes;
-   sch->bstats.packets += qdisc->bstats.packets;
-   sch->qstats.backlog += qdisc->qstats.backlog;
-   sch->qstats.drops   += qdisc->qstats.drops;
-   sch->qstats.requeues+= qdisc->qstats.requeues;
-   sch->qstats.overlimits  += qdisc->qstats.overlimits;
+
+   if (qdisc_is_percpu_stats(qdisc)) {
+   cpu_bstats = qdisc->cpu_bstats;
+   cpu_qstats = qdisc->cpu_qstats;
+   }
+
+   qlen = qdisc_qlen_sum(qdisc);
+
+   __gnet_stats_copy_basic(NULL, &sch->bstats,
+   cpu_bstats, &qdisc->bstats);
+   __gnet_stats_copy_queue(&sch->qstats,
+   cpu_qstats, &qdisc->qstats, qlen);
+
spin_unlock_bh(qdisc_lock(qdisc));
}
return 0;

[RFC PATCH 11/17] net: sched: helper to sum qlen

2017-11-13 Thread John Fastabend

Reporting qlen when qlen is per cpu requires aggregating the per
cpu counters. This adds a helper routine for this.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index bad24a9..5824509 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -304,6 +304,21 @@ static inline int qdisc_qlen(const struct Qdisc *q)
return q->q.qlen;
 }
 
+static inline int qdisc_qlen_sum(const struct Qdisc *q)
+{
+   __u32 qlen = 0;
+   int i;
+
+   if (q->flags & TCQ_F_NOLOCK) {
+   for_each_possible_cpu(i)
+   qlen += per_cpu_ptr(q->cpu_qstats, i)->qlen;
+   } else {
+   qlen = q->q.qlen;
+   }
+
+   return qlen;
+}
+
 static inline struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
 {
return (struct qdisc_skb_cb *)skb->cb;
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index b6c4f53..5cb64d2 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -797,7 +797,8 @@ static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc 
*q, u32 clid,
goto nla_put_failure;
if (q->ops->dump && q->ops->dump(q, skb) < 0)
goto nla_put_failure;
-   qlen = q->q.qlen;
+
+   qlen = qdisc_qlen_sum(q);
 
stab = rtnl_dereference(q->stab);
if (stab && qdisc_dump_stab(skb, stab) < 0)

[RFC PATCH 10/17] net: sched: qdisc_qlen for per cpu logic

2017-11-13 Thread John Fastabend

Add qdisc qlen helper routines for lockless qdiscs to use.

The qdisc qlen is no longer used in the hotpath but it is reported
via stats query on the qdisc so it still needs to be tracked. This
adds the per cpu operations needed.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 4717c4b..bad24a9 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -291,8 +291,16 @@ static inline void qdisc_cb_private_validate(const struct 
sk_buff *skb, int sz)
BUILD_BUG_ON(sizeof(qcb->data) < sz);
 }
 
+static inline int qdisc_qlen_cpu(const struct Qdisc *q)
+{
+   return this_cpu_ptr(q->cpu_qstats)->qlen;
+}
+
 static inline int qdisc_qlen(const struct Qdisc *q)
 {
+   if (q->flags & TCQ_F_NOLOCK)
+   return qdisc_qlen_cpu(q);
+
return q->q.qlen;
 }

[RFC PATCH 07/17] net: sched: drop qdisc_reset from dev_graft_qdisc

2017-11-13 Thread John Fastabend

In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy'
operation is called on the old qdisc. The destroy operation will wait
a rcu grace period and call qdisc_rcu_free(). At which point
gso_cpu_skb is free'd along with all stats so no need to zero stats
and gso_cpu_skb from the graft operation itself.

Further after dropping the qdisc locks we can not continue to call
qdisc_reset before waiting an rcu grace period so that the qdisc is
detached from all cpus. By removing the qdisc_reset() here we get
the correct property of waiting an rcu grace period and letting the
qdisc_destroy operation clean up the qdisc correctly.

Note, a refcnt greater than 1 would cause the destroy operation to
be aborted however if this ever happened the reference to the qdisc
would be lost and we would have a memory leak.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 1055bec..ea017ce 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -818,10 +818,6 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue 
*dev_queue,
root_lock = qdisc_lock(oqdisc);
spin_lock_bh(root_lock);
 
-   /* Prune old scheduler */
-   if (oqdisc && refcount_read(&oqdisc->refcnt) <= 1)
-   qdisc_reset(oqdisc);
-
/* ... and graft new one */
if (qdisc == NULL)
qdisc = &noop_qdisc;
@@ -976,6 +972,16 @@ static bool some_qdisc_is_busy(struct net_device *dev)
return false;
 }
 
+static void dev_qdisc_reset(struct net_device *dev,
+   struct netdev_queue *dev_queue,
+   void *none)
+{
+   struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
+
+   if (qdisc)
+   qdisc_reset(qdisc);
+}
+
 /**
  * dev_deactivate_many - deactivate transmissions on several devices
  * @head: list of devices to deactivate
@@ -986,7 +992,6 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 void dev_deactivate_many(struct list_head *head)
 {
struct net_device *dev;
-   bool sync_needed = false;
 
list_for_each_entry(dev, head, close_list) {
netdev_for_each_tx_queue(dev, dev_deactivate_queue,
@@ -996,20 +1001,25 @@ void dev_deactivate_many(struct list_head *head)
 &noop_qdisc);
 
dev_watchdog_down(dev);
-   sync_needed |= !dev->dismantle;
}
 
/* Wait for outstanding qdisc-less dev_queue_xmit calls.
 * This is avoided if all devices are in dismantle phase :
 * Caller will call synchronize_net() for us
 */
-   if (sync_needed)
-   synchronize_net();
+   synchronize_net();
 
/* Wait for outstanding qdisc_run calls. */
-   list_for_each_entry(dev, head, close_list)
+   list_for_each_entry(dev, head, close_list) {
while (some_qdisc_is_busy(dev))
yield();
+   /* The new qdisc is assigned at this point so we can safely
+* unwind stale skb lists and qdisc statistics
+*/
+   netdev_for_each_tx_queue(dev, dev_qdisc_reset, NULL);
+   if (dev_ingress_queue(dev))
+   dev_qdisc_reset(dev, dev_ingress_queue(dev), NULL);
+   }
 }
 
 void dev_deactivate(struct net_device *dev)

[RFC PATCH 09/17] net: sched: check for frozen queue before skb_bad_txq check

2017-11-13 Thread John Fastabend

I can not think of any reason to pull the bad txq skb off the qdisc if
the txq we plan to send this on is still frozen. So check for frozen
queue first and abort before dequeuing either skb_bad_txq skb or
normal qdisc dequeue() skb.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 92570d4..84c4ea1 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -204,7 +204,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
   int *packets)
 {
const struct netdev_queue *txq = q->dev_queue;
-   struct sk_buff *skb;
+   struct sk_buff *skb = NULL;
 
*packets = 1;
if (unlikely(!skb_queue_empty(&q->gso_skb))) {
@@ -248,12 +248,15 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
}
 validate:
*validate = true;
+
+   if ((q->flags & TCQ_F_ONETXQUEUE) &&
+   netif_xmit_frozen_or_stopped(txq))
+   return skb;
+
skb = qdisc_dequeue_skb_bad_txq(q);
if (unlikely(skb))
goto bulk;
-   if (!(q->flags & TCQ_F_ONETXQUEUE) ||
-   !netif_xmit_frozen_or_stopped(txq))
-   skb = q->dequeue(q);
+   skb = q->dequeue(q);
if (skb) {
 bulk:
if (qdisc_may_bulk(q))

Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen

> Put another way: the CONFIG_DEBUG_INFO_SPLIT option is useless. Yes,
> it saves time and disk space, but does so at the expense of making all
> the debug information unavailable to basic tools.

You're right. It works for line information, but strangely not for
inlines. I assume it can be fixed.

-Andi

[RFC PATCH 08/17] net: sched: use skb list for skb_bad_tx

2017-11-13 Thread John Fastabend

Similar to how gso is handled use skb list for skb_bad_tx this is
required with lockless qdiscs because we may have multiple cores
attempting to push skbs into skb_bad_tx concurrently

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6e329f0..4717c4b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -95,7 +95,7 @@ struct Qdisc {
struct gnet_stats_queue qstats;
unsigned long   state;
struct Qdisc*next_sched;
-   struct sk_buff  *skb_bad_txq;
+   struct sk_buff_head skb_bad_txq;
int padded;
refcount_t  refcnt;
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ea017ce..92570d4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -45,6 +45,68 @@
  * - ingress filtering is also serialized via qdisc root lock
  * - updates to tree and tree walking are only done under the rtnl mutex.
  */
+
+static inline struct sk_buff *__skb_dequeue_bad_txq(struct Qdisc *q)
+{
+   const struct netdev_queue *txq = q->dev_queue;
+   spinlock_t *lock = NULL;
+   struct sk_buff *skb;
+
+   if (q->flags & TCQ_F_NOLOCK) {
+   lock = qdisc_lock(q);
+   spin_lock(lock);
+   }
+
+   skb = skb_peek(&q->skb_bad_txq);
+   if (skb) {
+   /* check the reason of requeuing without tx lock first */
+   txq = skb_get_tx_queue(txq->dev, skb);
+   if (!netif_xmit_frozen_or_stopped(txq)) {
+   skb = __skb_dequeue(&q->skb_bad_txq);
+   if (qdisc_is_percpu_stats(q)) {
+   qdisc_qstats_cpu_backlog_dec(q, skb);
+   qdisc_qstats_cpu_qlen_dec(q);
+   } else {
+   qdisc_qstats_backlog_dec(q, skb);
+   q->q.qlen--;
+   }
+   } else {
+   skb = NULL;
+   }
+   }
+
+   if (lock)
+   spin_unlock(lock);
+
+   return skb;
+}
+
+static inline struct sk_buff *qdisc_dequeue_skb_bad_txq(struct Qdisc *q)
+{
+   struct sk_buff *skb = skb_peek(&q->skb_bad_txq);
+
+   if (unlikely(skb))
+   skb = __skb_dequeue_bad_txq(q);
+
+   return skb;
+}
+
+static inline void qdisc_enqueue_skb_bad_txq(struct Qdisc *q,
+struct sk_buff *skb)
+{
+   spinlock_t *lock = NULL;
+
+   if (q->flags & TCQ_F_NOLOCK) {
+   lock = qdisc_lock(q);
+   spin_lock(lock);
+   }
+
+   __skb_queue_tail(&q->skb_bad_txq, skb);
+
+   if (lock)
+   spin_unlock(lock);
+}
+
 static inline int __dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
__skb_queue_head(&q->gso_skb, skb);
@@ -117,9 +179,15 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
if (!nskb)
break;
if (unlikely(skb_get_queue_mapping(nskb) != mapping)) {
-   q->skb_bad_txq = nskb;
-   qdisc_qstats_backlog_inc(q, nskb);
-   q->q.qlen++;
+   qdisc_enqueue_skb_bad_txq(q, nskb);
+
+   if (qdisc_is_percpu_stats(q)) {
+   qdisc_qstats_cpu_backlog_inc(q, nskb);
+   qdisc_qstats_cpu_qlen_inc(q);
+   } else {
+   qdisc_qstats_backlog_inc(q, nskb);
+   q->q.qlen++;
+   }
break;
}
skb->next = nskb;
@@ -180,19 +248,9 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
}
 validate:
*validate = true;
-   skb = q->skb_bad_txq;
-   if (unlikely(skb)) {
-   /* check the reason of requeuing without tx lock first */
-   txq = skb_get_tx_queue(txq->dev, skb);
-   if (!netif_xmit_frozen_or_stopped(txq)) {
-   q->skb_bad_txq = NULL;
-   qdisc_qstats_backlog_dec(q, skb);
-   q->q.qlen--;
-   goto bulk;
-   }
-   skb = NULL;
-   goto trace;
-   }
+   skb = qdisc_dequeue_skb_bad_txq(q);
+   if (unlikely(skb))
+   goto bulk;
if (!(q->flags & TCQ_F_ONETXQUEUE) ||
!netif_xmit_frozen_or_stopped(txq))
skb = q->dequeue(q);
@@ -683,6 +741,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
sch->padded = (char *) sch - (char *) p;
}
__skb_queue_head_init(&sch->gso_skb);
+   __skb_queue_head_init(&sch->skb_bad_txq);
qdisc_skb_head_init(&sch->q);
spin_lock_init(&sch->q.lock);

Re: [PATCH net-next] bpf: expose sk_priority through struct bpf_sock_ops

2017-11-13 Thread Lawrence Brakmo

On 11/13/17, 11:01 AM, "Vlad Dumitrescu"  wrote:

On Sat, Nov 11, 2017 at 2:38 PM, Alexei Starovoitov  wrote:
>
> On 11/12/17 4:46 AM, Daniel Borkmann wrote:
>>
>> On 11/11/2017 05:06 AM, Alexei Starovoitov wrote:
>>>
>>> On 11/11/17 6:07 AM, Daniel Borkmann wrote:

 On 11/10/2017 08:17 PM, Vlad Dumitrescu wrote:
>
> From: Vlad Dumitrescu 
>
> Allows BPF_PROG_TYPE_SOCK_OPS programs to read sk_priority.
>
> Signed-off-by: Vlad Dumitrescu 
> ---
>   include/uapi/linux/bpf.h   |  1 +
>   net/core/filter.c  | 11 +++
>   tools/include/uapi/linux/bpf.h |  1 +
>   3 files changed, 13 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e880ae6434ee..9757a2002513 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -947,6 +947,7 @@ struct bpf_sock_ops {
>   __u32 local_ip6[4];/* Stored in network byte order */
>   __u32 remote_port;/* Stored in network byte order */
>   __u32 local_port;/* stored in host byte order */
> +__u32 priority;
>   };
> /* List of known BPF sock_ops operators.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 61c791f9f628..a6329642d047 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4449,6 +4449,17 @@ static u32 sock_ops_convert_ctx_access(enum
> bpf_access_type type,
>   *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> offsetof(struct sock_common, skc_num));
>   break;
> +
> +case offsetof(struct bpf_sock_ops, priority):
> +BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_priority) != 4);
> +
> +*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> +struct bpf_sock_ops_kern, sk),
> +  si->dst_reg, si->src_reg,
> +  offsetof(struct bpf_sock_ops_kern, sk));
> +*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> +  offsetof(struct sock, sk_priority));
> +break;


 Hm, I don't think this would work, I actually think your initial patch
 was ok.
 bpf_setsockopt() as well as bpf_getsockopt() check for sk_fullsock(sk)
 right
 before accessing options on either socket or TCP level, and bail out
 with error
 otherwise; in such cases we'd read something else here and assume it's
 sk_priority.
>>>
>>>
>>> even if it's not fullsock, it will just read zero, no? what's a problem
>>> with that?
>>> In non-fullsock hooks like BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
>>> the program author will know that it's meaningless to read sk_priority,
>>> so returning zero with minimal checks is fine.
>>> While adding extra runtime if (sk_fullsock(sk)) is unnecessary,
>>> since the safety is not compromised.
>>
>>
>> Hm, on my kernel, struct sock has the 4 bytes sk_priority at offset 440,
>> struct request_sock itself is only 232 byte long in total, and the struct
>> inet_timewait_sock is 208 byte long, so you'd be accessing out of bounds
>> that way, so it cannot be ignored and assumed zero.
>
>
> I thought we always pass fully allocated sock but technically not 
fullsock yet. My mistake. We do: tcp_timeout_init((struct sock *)req))
> so yeah ctx rewrite approach won't work.
> Let's go back to access via helper.
>

TIL. Thanks!

Is there anything else needed from me to get the helper approach accepted?

I plan to add access to TCP state variables (cwnd, rtt, etc.) and I have been 
thinking
about this issue. I think it is possible to access it directly as long as we 
use a value
like 0x to represent an invalid value (e.g. not fullsock). The ctx 
rewrite just
needs to add a conditional to determine what to return. I would probably add a
field into the internal kernel struct to indicate if it is fullsock or not (we 
should
know when we call tcp_call_bpf whether it is a fullsock or not based on 
context).

Let me do a sample patch that I can send for review and get feedback from
Alexi and Daniel.

[RFC PATCH 04/17] net: sched: provide per cpu qstat helpers

2017-11-13 Thread John Fastabend

The per cpu qstats support was added with per cpu bstat support which
is currently used by the ingress qdisc. This patch adds a set of
helpers needed to make other qdiscs that use qstats per cpu as well.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index bb806a0..7c4b96b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -631,12 +631,39 @@ static inline void qdisc_qstats_backlog_dec(struct Qdisc 
*sch,
sch->qstats.backlog -= qdisc_pkt_len(skb);
 }
 
+static inline void qdisc_qstats_cpu_backlog_dec(struct Qdisc *sch,
+   const struct sk_buff *skb)
+{
+   this_cpu_sub(sch->cpu_qstats->backlog, qdisc_pkt_len(skb));
+}
+
 static inline void qdisc_qstats_backlog_inc(struct Qdisc *sch,
const struct sk_buff *skb)
 {
sch->qstats.backlog += qdisc_pkt_len(skb);
 }
 
+static inline void qdisc_qstats_cpu_backlog_inc(struct Qdisc *sch,
+   const struct sk_buff *skb)
+{
+   this_cpu_add(sch->cpu_qstats->backlog, qdisc_pkt_len(skb));
+}
+
+static inline void qdisc_qstats_cpu_qlen_inc(struct Qdisc *sch)
+{
+   this_cpu_inc(sch->cpu_qstats->qlen);
+}
+
+static inline void qdisc_qstats_cpu_qlen_dec(struct Qdisc *sch)
+{
+   this_cpu_dec(sch->cpu_qstats->qlen);
+}
+
+static inline void qdisc_qstats_cpu_requeues_inc(struct Qdisc *sch)
+{
+   this_cpu_inc(sch->cpu_qstats->requeues);
+}
+
 static inline void __qdisc_qstats_drop(struct Qdisc *sch, int count)
 {
sch->qstats.drops += count;
@@ -844,6 +871,14 @@ static inline void rtnl_qdisc_drop(struct sk_buff *skb, 
struct Qdisc *sch)
qdisc_qstats_drop(sch);
 }
 
+static inline int qdisc_drop_cpu(struct sk_buff *skb, struct Qdisc *sch,
+struct sk_buff **to_free)
+{
+   __qdisc_drop(skb, to_free);
+   qdisc_qstats_cpu_drop(sch);
+
+   return NET_XMIT_DROP;
+}
 
 static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
 struct sk_buff **to_free)

[RFC PATCH 05/17] net: sched: a dflt qdisc may be used with per cpu stats

2017-11-13 Thread John Fastabend

Enable dflt qdisc support for per cpu stats before this patch a dflt
qdisc was required to use the global statistics qstats and bstats.

This adds a static flags field to qdisc_ops that is propagated
into qdisc->flags in qdisc allocate call. This allows the allocation
block to completely allocate the qdisc object so we don't have
dangling allocations after qdisc init.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 7c4b96b..7bc2826 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -179,6 +179,7 @@ struct Qdisc_ops {
const struct Qdisc_class_ops*cl_ops;
charid[IFNAMSIZ];
int priv_size;
+   unsigned intstatic_flags;
 
int (*enqueue)(struct sk_buff *skb,
   struct Qdisc *sch,
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index d063f6b..131eab8 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -635,6 +635,19 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
qdisc_skb_head_init(&sch->q);
spin_lock_init(&sch->q.lock);
 
+   if (ops->static_flags & TCQ_F_CPUSTATS) {
+   sch->cpu_bstats =
+   netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu);
+   if (!sch->cpu_bstats)
+   goto errout1;
+
+   sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
+   if (!sch->cpu_qstats) {
+   free_percpu(sch->cpu_bstats);
+   goto errout1;
+   }
+   }
+
spin_lock_init(&sch->busylock);
lockdep_set_class(&sch->busylock,
  dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
@@ -644,6 +657,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
  dev->qdisc_running_key ?: &qdisc_running_key);
 
sch->ops = ops;
+   sch->flags = ops->static_flags;
sch->enqueue = ops->enqueue;
sch->dequeue = ops->dequeue;
sch->dev_queue = dev_queue;
@@ -651,6 +665,8 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
refcount_set(&sch->refcnt, 1);
 
return sch;
+errout1:
+   kfree(p);
 errout:
return ERR_PTR(err);
 }

[RFC PATCH 06/17] net: sched: explicit locking in gso_cpu fallback

2017-11-13 Thread John Fastabend

This work is preparing the qdisc layer to support egress lockless
qdiscs. If we are running the egress qdisc lockless in the case we
overrun the netdev, for whatever reason, the netdev returns a busy
error code and the skb is parked on the gso_skb pointer. With many
cores all hitting this case at once its possible to have multiple
sk_buffs here so we turn gso_skb into a queue.

This should be the edge case and if we see this frequently then
the netdev/qdisc layer needs to back off.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 7bc2826..6e329f0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -88,7 +88,7 @@ struct Qdisc {
/*
 * For performance sake on SMP, we put highly modified fields at the end
 */
-   struct sk_buff  *gso_skb cacheline_aligned_in_smp;
+   struct sk_buff_head gso_skb cacheline_aligned_in_smp;
struct qdisc_skb_head   q;
struct gnet_stats_basic_packed bstats;
seqcount_t  running;
@@ -795,26 +795,30 @@ static inline struct sk_buff *qdisc_peek_head(struct 
Qdisc *sch)
 /* generic pseudo peek method for non-work-conserving qdisc */
 static inline struct sk_buff *qdisc_peek_dequeued(struct Qdisc *sch)
 {
+   struct sk_buff *skb = skb_peek(&sch->gso_skb);
+
/* we can reuse ->gso_skb because peek isn't called for root qdiscs */
-   if (!sch->gso_skb) {
-   sch->gso_skb = sch->dequeue(sch);
-   if (sch->gso_skb) {
+   if (!skb) {
+   skb = sch->dequeue(sch);
+
+   if (skb) {
+   __skb_queue_head(&sch->gso_skb, skb);
/* it's still part of the queue */
-   qdisc_qstats_backlog_inc(sch, sch->gso_skb);
+   qdisc_qstats_backlog_inc(sch, skb);
sch->q.qlen++;
}
}
 
-   return sch->gso_skb;
+   return skb;
 }
 
 /* use instead of qdisc->dequeue() for all qdiscs queried with ->peek() */
 static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
 {
-   struct sk_buff *skb = sch->gso_skb;
+   struct sk_buff *skb = skb_peek(&sch->gso_skb);
 
if (skb) {
-   sch->gso_skb = NULL;
+   skb = __skb_dequeue(&sch->gso_skb);
qdisc_qstats_backlog_dec(sch, skb);
sch->q.qlen--;
} else {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 131eab8..1055bec 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -45,10 +45,9 @@
  * - ingress filtering is also serialized via qdisc root lock
  * - updates to tree and tree walking are only done under the rtnl mutex.
  */
-
-static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+static inline int __dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
-   q->gso_skb = skb;
+   __skb_queue_head(&q->gso_skb, skb);
q->qstats.requeues++;
qdisc_qstats_backlog_inc(q, skb);
q->q.qlen++;/* it's still part of the queue */
@@ -57,6 +56,30 @@ static inline int dev_requeue_skb(struct sk_buff *skb, 
struct Qdisc *q)
return 0;
 }
 
+static inline int dev_requeue_skb_locked(struct sk_buff *skb, struct Qdisc *q)
+{
+   spinlock_t *lock = qdisc_lock(q);
+
+   spin_lock(lock);
+   __skb_queue_tail(&q->gso_skb, skb);
+   spin_unlock(lock);
+
+   qdisc_qstats_cpu_requeues_inc(q);
+   qdisc_qstats_cpu_backlog_inc(q, skb);
+   qdisc_qstats_cpu_qlen_inc(q);
+   __netif_schedule(q);
+
+   return 0;
+}
+
+static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+{
+   if (q->flags & TCQ_F_NOLOCK)
+   return dev_requeue_skb_locked(skb, q);
+   else
+   return __dev_requeue_skb(skb, q);
+}
+
 static void try_bulk_dequeue_skb(struct Qdisc *q,
 struct sk_buff *skb,
 const struct netdev_queue *txq,
@@ -112,23 +135,50 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
 static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
   int *packets)
 {
-   struct sk_buff *skb = q->gso_skb;
const struct netdev_queue *txq = q->dev_queue;
+   struct sk_buff *skb;
 
*packets = 1;
-   if (unlikely(skb)) {
+   if (unlikely(!skb_queue_empty(&q->gso_skb))) {
+   spinlock_t *lock = NULL;
+
+   if (q->flags & TCQ_F_NOLOCK) {
+   lock = qdisc_lock(q);
+   spin_lock(lock);
+   }
+
+   skb = skb_peek(&q->gso_skb);
+
+   /* skb may be null if another cpu pulls gso_skb off in between
+* empty check and lock.
+*/
+   if (!skb) {
+   if (lock)
+

[RFC PATCH 03/17] net: sched: remove remaining uses for qdisc_qlen in xmit path

2017-11-13 Thread John Fastabend

sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
of the routine only check if it is zero or not. Simplify the logic so
that we don't need to return an actual queue length value.

This introduces a case now where sch_direct_xmit would have returned
a qlen of zero but now it returns true. However in this case all
call sites of sch_direct_xmit will implement a dequeue() and get
a null skb and abort. This trades tracking qlen in the hotpath for
an extra dequeue operation. Overall this seems to be good for
performance.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 4eea719..2404692 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -105,9 +105,9 @@ struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec 
*r,
 void qdisc_put_rtab(struct qdisc_rate_table *tab);
 void qdisc_put_stab(struct qdisc_size_table *tab);
 void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc);
-int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
-   struct net_device *dev, struct netdev_queue *txq,
-   spinlock_t *root_lock, bool validate);
+bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
+struct net_device *dev, struct netdev_queue *txq,
+spinlock_t *root_lock, bool validate);
 
 void __qdisc_run(struct Qdisc *q);
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ec757f6..d063f6b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -164,12 +164,12 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
  * only one CPU can execute this function.
  *
  * Returns to the caller:
- * 0  - queue is empty or throttled.
- * >0 - queue is not empty.
+ * false  - hardware queue frozen backoff
+ * true   - feel free to send more pkts
  */
-int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
-   struct net_device *dev, struct netdev_queue *txq,
-   spinlock_t *root_lock, bool validate)
+bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
+struct net_device *dev, struct netdev_queue *txq,
+spinlock_t *root_lock, bool validate)
 {
int ret = NETDEV_TX_BUSY;
 
@@ -190,7 +190,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
} else {
if (root_lock)
spin_lock(root_lock);
-   return qdisc_qlen(q);
+   return true;
}
 
if (root_lock)
@@ -198,18 +198,19 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
if (dev_xmit_complete(ret)) {
/* Driver sent out skb successfully or skb was consumed */
-   ret = qdisc_qlen(q);
+   ret = true;
} else {
/* Driver returned NETDEV_TX_BUSY - requeue skb */
if (unlikely(ret != NETDEV_TX_BUSY))
net_warn_ratelimited("BUG %s code %d qlen %d\n",
 dev->name, ret, q->q.qlen);
 
-   ret = dev_requeue_skb(skb, q);
+   dev_requeue_skb(skb, q);
+   ret = false;
}
 
if (ret && netif_xmit_frozen_or_stopped(txq))
-   ret = 0;
+   ret = false;
 
return ret;
 }
@@ -233,7 +234,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  * >0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct Qdisc *q, int *packets)
+static inline bool qdisc_restart(struct Qdisc *q, int *packets)
 {
spinlock_t *root_lock = NULL;
struct netdev_queue *txq;
@@ -244,7 +245,7 @@ static inline int qdisc_restart(struct Qdisc *q, int 
*packets)
/* Dequeue packet */
skb = dequeue_skb(q, &validate, packets);
if (unlikely(!skb))
-   return 0;
+   return false;
 
if (!(q->flags & TCQ_F_NOLOCK))
root_lock = qdisc_lock(q);

[RFC PATCH 02/17] net: sched: allow qdiscs to handle locking

2017-11-13 Thread John Fastabend

This patch adds a flag for queueing disciplines to indicate the stack
does not need to use the qdisc lock to protect operations. This can
be used to build lockless scheduling algorithms and improving
performance.

The flag is checked in the tx path and the qdisc lock is only taken
if it is not set. For now use a conditional if statement. Later we
could be more aggressive if it proves worthwhile and use a static key
or wrap this in a likely().

Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason
for this is synchronizing a qlen counter across threads proves to
cost more than doing the enqueue/dequeue operations when tested with
pktgen.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 65d0d25..bb806a0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -71,6 +71,7 @@ struct Qdisc {
  * qdisc_tree_decrease_qlen() should stop.
  */
 #define TCQ_F_INVISIBLE0x80 /* invisible by default in dump */
+#define TCQ_F_NOLOCK   0x100 /* qdisc does not require locking */
u32 limit;
const struct Qdisc_ops  *ops;
struct qdisc_size_table __rcu *stab;
diff --git a/net/core/dev.c b/net/core/dev.c
index 9f5fa78..e03bee6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3179,6 +3179,21 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, 
struct Qdisc *q,
int rc;
 
qdisc_calculate_pkt_len(skb, q);
+
+   if (q->flags & TCQ_F_NOLOCK) {
+   if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
+   __qdisc_drop(skb, &to_free);
+   rc = NET_XMIT_DROP;
+   } else {
+   rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+   __qdisc_run(q);
+   }
+
+   if (unlikely(to_free))
+   kfree_skb_list(to_free);
+   return rc;
+   }
+
/*
 * Heuristic to force contended enqueues to serialize on a
 * separate lock before trying to get qdisc main lock.
@@ -4161,19 +4176,22 @@ static __latent_entropy void net_tx_action(struct 
softirq_action *h)
 
while (head) {
struct Qdisc *q = head;
-   spinlock_t *root_lock;
+   spinlock_t *root_lock = NULL;
 
head = head->next_sched;
 
-   root_lock = qdisc_lock(q);
-   spin_lock(root_lock);
+   if (!(q->flags & TCQ_F_NOLOCK)) {
+   root_lock = qdisc_lock(q);
+   spin_lock(root_lock);
+   }
/* We need to make sure head->next_sched is read
 * before clearing __QDISC_STATE_SCHED
 */
smp_mb__before_atomic();
clear_bit(__QDISC_STATE_SCHED, &q->state);
qdisc_run(q);
-   spin_unlock(root_lock);
+   if (root_lock)
+   spin_unlock(root_lock);
}
}
 }
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index f6803e1..ec757f6 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -174,7 +174,8 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
int ret = NETDEV_TX_BUSY;
 
/* And release qdisc */
-   spin_unlock(root_lock);
+   if (root_lock)
+   spin_unlock(root_lock);
 
/* Note that we validate skb (GSO, checksum, ...) outside of locks */
if (validate)
@@ -187,10 +188,13 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
HARD_TX_UNLOCK(dev, txq);
} else {
-   spin_lock(root_lock);
+   if (root_lock)
+   spin_lock(root_lock);
return qdisc_qlen(q);
}
-   spin_lock(root_lock);
+
+   if (root_lock)
+   spin_lock(root_lock);
 
if (dev_xmit_complete(ret)) {
/* Driver sent out skb successfully or skb was consumed */
@@ -231,9 +235,9 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  */
 static inline int qdisc_restart(struct Qdisc *q, int *packets)
 {
+   spinlock_t *root_lock = NULL;
struct netdev_queue *txq;
struct net_device *dev;
-   spinlock_t *root_lock;
struct sk_buff *skb;
bool validate;
 
@@ -242,7 +246,9 @@ static inline int qdisc_restart(struct Qdisc *q, int 
*packets)
if (unlikely(!skb))
return 0;
 
-   root_lock = qdisc_lock(q);
+   if (!(q->flags & TCQ_F_NOLOCK))
+   root_lock = qdisc_lock(q);
+
dev = qdisc_dev(q);
txq = skb_get_tx_qu

[RFC PATCH 00/17] lockless qdisc

2017-11-13 Thread John Fastabend

Multiple folks asked me about this series at net(dev)conf so with
a 10+hour flight and a bit testing once back home I think these
are ready to be submitted. Net-next is closed at the moment

  http://vger.kernel.org/~davem/net-next.html

but, once it opens up we can get these in first thing and have
plenty of time to resolve in fallout. Although I haven't seen any
issues with my latest testing.

My first test case uses multiple containers (via cilium) where
multiple client containers use 'wrk' to benchmark connections with
a server container running lighttpd. Where lighttpd is configured
to use multiple threads, one per core. Additionally this test has
a proxy agent running so all traffic takes an extra hop through a
proxy container. In these cases each TCP packet traverses the egress
qdisc layer at least four times and the ingress qdisc layer an
additional four times. This makes for a good stress test IMO, perf
details below.

The other micro-benchmark I run is injecting packets directly into
qdisc layer using pktgen. This uses the benchmark script,

 ./pktgen_bench_xmit_mode_queue_xmit.sh 

Benchmarks taken in two cases, "base" running latest net-next no
changes to qdisc layer and "qdisc" tests run with qdisc lockless
updates. Numbers reported in req/sec. All virtual 'veth' devices
run with pfifo_fast in the qdisc test case.

`wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"`

conns16  32 64   1024
---
base:   18831  20201  21393  29151
qdisc:  19309  21063  23899  29265

notice in all cases we see performance improvement when running
with qdisc case.

Microbenchmarks using pktgen are as follows,

`pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c 2000

base(mq):  2.1Mpps
base(pfifo_fast):  2.1Mpps
qdisc(mq): 2.6Mpps
qdisc(pfifo_fast): 2.6Mpps

notice numbers are the same for mq and pfifo_fast because only
testing a single thread here.

Comments and feedback welcome. Anyone willing to do additional testing
would be greatly appreciated. The patches can be pulled here,

  https://github.com/cilium/linux/tree/qdisc

Thanks,
John

---

John Fastabend (17):
  net: sched: cleanup qdisc_run and __qdisc_run semantics
  net: sched: allow qdiscs to handle locking
  net: sched: remove remaining uses for qdisc_qlen in xmit path
  net: sched: provide per cpu qstat helpers
  net: sched: a dflt qdisc may be used with per cpu stats
  net: sched: explicit locking in gso_cpu fallback
  net: sched: drop qdisc_reset from dev_graft_qdisc
  net: sched: use skb list for skb_bad_tx
  net: sched: check for frozen queue before skb_bad_txq check
  net: sched: qdisc_qlen for per cpu logic
  net: sched: helper to sum qlen
  net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
  net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio
  net: skb_array: expose peek API
  net: sched: pfifo_fast use skb_array
  net: skb_array additions for unlocked consumer
  net: sched: lock once per bulk dequeue


 0 files changed

--
Signature

1 2 3 >

1 - 100 of 221 matches

Mail list logo