[PATCH net] ip6_gre: init dev->mtu and dev->hard_header_len correctly

2018-01-11 Thread Alexey Kodanev
Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path,
call common GRE functions") moved dev->mtu initialization
from ip6gre_tunnel_setup() to ip6gre_tunnel_init(), as a
result, the previously set values, before ndo_init(), are
reset in the following cases:

* rtnl_create_link() can update dev->mtu from IFLA_MTU
  parameter

* ip6gre_tnl_link_config() is invoked before ndo_init() in
  netlink and ioctl setup, so ndo_init() can reset MTU
  adjustments with the lower device MTU as well, dev->mtu
  and dev->hard_header_len.

  Not applicable for ip6gretap because it has one more call
  to ip6gre_tnl_link_config(tunnel, 1) in ip6gre_tap_init().

Since, initially net_device allocated with kvzalloc, make sure
that dev->mtu is zero, i.e. not changed, before setting default
MTU inside ndo_init(), and invoke ip6gre_tnl_link_config after
setting default values.

For ip6gretap, reset dev->mtu to zero in ip6gre_tap_setup()
after ether_setup(), in order for it to work with the new check
in ip6gre_tunnel_init_common().

Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE 
functions")
Fixes: db2ec95d1ba4 ("ip6_gre: Fix MTU setting")
Signed-off-by: Alexey Kodanev 
---
 net/ipv6/ip6_gre.c |   22 ++
 1 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 7726959..edf65d0 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -337,7 +337,6 @@ static void ip6gre_tunnel_unlink(struct ip6gre_net *ign, 
struct ip6_tnl *t)
 
nt->dev = dev;
nt->net = dev_net(dev);
-   ip6gre_tnl_link_config(nt, 1);
 
if (register_netdevice(dev) < 0)
goto failed_free;
@@ -1047,6 +1046,7 @@ static void ip6gre_tnl_init_features(struct net_device 
*dev)
 static int ip6gre_tunnel_init_common(struct net_device *dev)
 {
struct ip6_tnl *tunnel;
+   int set_mtu = !dev->mtu;
int ret;
int t_hlen;
 
@@ -1072,13 +1072,16 @@ static int ip6gre_tunnel_init_common(struct net_device 
*dev)
t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
 
dev->hard_header_len = LL_MAX_HEADER + t_hlen;
-   dev->mtu = ETH_DATA_LEN - t_hlen;
-   if (dev->type == ARPHRD_ETHER)
-   dev->mtu -= ETH_HLEN;
-   if (!(tunnel->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
-   dev->mtu -= 8;
+   if (set_mtu) {
+   dev->mtu = ETH_DATA_LEN - t_hlen;
+   if (dev->type == ARPHRD_ETHER)
+   dev->mtu -= ETH_HLEN;
+   if (!(tunnel->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
+   dev->mtu -= 8;
+   }
 
ip6gre_tnl_init_features(dev);
+   ip6gre_tnl_link_config(tunnel, set_mtu);
 
return 0;
 }
@@ -1303,7 +1306,6 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
 
 static int ip6gre_tap_init(struct net_device *dev)
 {
-   struct ip6_tnl *tunnel;
int ret;
 
ret = ip6gre_tunnel_init_common(dev);
@@ -1312,10 +1314,6 @@ static int ip6gre_tap_init(struct net_device *dev)
 
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
 
-   tunnel = netdev_priv(dev);
-
-   ip6gre_tnl_link_config(tunnel, 1);
-
return 0;
 }
 
@@ -1335,6 +1333,7 @@ static void ip6gre_tap_setup(struct net_device *dev)
 
ether_setup(dev);
 
+   dev->mtu = 0;
dev->max_mtu = 0;
dev->netdev_ops = _tap_netdev_ops;
dev->needs_free_netdev = true;
@@ -1408,7 +1407,6 @@ static int ip6gre_newlink(struct net *src_net, struct 
net_device *dev,
 
nt->dev = dev;
nt->net = dev_net(dev);
-   ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
 
err = register_netdevice(dev);
if (err)
-- 
1.7.1



Re: [patch net-next v7 07/13] net: sched: use block index as a handle instead of qdisc when block is shared

2018-01-11 Thread Jamal Hadi Salim

On 18-01-09 09:07 AM, Jiri Pirko wrote:

From: Jiri Pirko 

As the tcm_ifindex 0 is invalid ifindex, reuse it to indicate that we
work with block, instead of qdisc. So if tcm_ifindex is 0, tcm_parent is
used to carry block_index.



Commit log still refers to ifindex of 0 instead of TCM_IFINDEX_MAGIC_BLOCK

cheers,
jamal


Re: [PATCH 26/32] aio: refactor read/write iocb setup

2018-01-11 Thread Christoph Hellwig
On Wed, Jan 10, 2018 at 04:19:53PM -0500, Jeff Moyer wrote:
> > +static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
> > +{
> > +   int ret;
> > +
> > +   req->ki_filp = fget(iocb->aio_fildes);
> > +   if (unlikely(!req->ki_filp))
> > +   return -EBADF;
> > +   req->ki_complete = aio_complete_rw;
> > +   req->ki_flags = 0;
> 
> The above assignment seems superfluous...

Thanks, fixed.


Re: [PATCH net-next v2] xfrm: Add ESN support for IPSec HW offload

2018-01-11 Thread Aviad Yehezkel


On 1/11/2018 10:28 AM, Yossi Kuperman wrote:

From: Shannon Nelson [mailto:shannon.nel...@oracle.com]
Sent: Thursday, January 11, 2018 5:21 AM

On 1/10/2018 3:09 PM, Yossi Kuperman wrote:

On 10 Jan 2018, at 19:36, Shannon Nelson  wrote:


On 1/10/2018 2:34 AM, yoss...@mellanox.com wrote:
From: Yossef Efraim 
This patch adds ESN support to IPsec device offload.
Adding new xfrm device operation to synchronize device ESN.
Signed-off-by: Yossef Efraim 
---
Changes from v1:
   - Added documentation
---
   Documentation/networking/xfrm_device.txt |  3 +++
   include/linux/netdevice.h|  1 +
   include/net/xfrm.h   | 12 
   net/xfrm/xfrm_device.c   |  4 ++--
   net/xfrm/xfrm_replay.c   |  2 ++
   5 files changed, 20 insertions(+), 2 deletions(-)

[...]


diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 7598250..704a055 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -147,8 +147,8 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state 
*x,
   if (!x->type_offload)
   return -EINVAL;
   -/* We don't yet support UDP encapsulation, TFC padding and ESN. */
-if (x->encap || x->tfcpad || (x->props.flags & XFRM_STATE_ESN))
+/* We don't yet support UDP encapsulation and TFC padding. */
+if (x->encap || x->tfcpad)

As I mentioned before, this will cause issues when working with hardware that 
has no ESN support, such as Intel's x540: the stack will

expect the driver to do ESN, and nothing actually happens but a rollover of the 
numbers.  Sure, the driver could look for the ESN attribute
and fail the add, but that's a mode where we have to update every driver to 
fend off problems every time we add a new feature.  Much
better is to only update drivers that actively support the new feature.

You are right.

I’m not sure why this check is here in the first place. IMO it should take 
place in xdo_dev_state_add—a driver-specific callback.


If you say I'm right, then why do you say it should take place in the
driver callback?  I just wrote that it should *not*.


Sorry, I wasn't clear; you are right with respect that this change will break 
Intel's x540 driver.

However, I do think that this is the purpose of xdo_dev_state_add(). Again, As 
far as I can understand, and please correct me if I'm wrong, this shouldn’t be 
here in the first place.

Please have a look at mlx5e_xfrm_validate_state(). Currently, it return an 
error if the user requests ESN, regardless of the underlying device's 
capabilities. Subsequent patch to mlx5 driver, will allow such a request if the 
device does support it; maintaining backward compatibility.

Here is a code snippet:

-   if (x->props.flags & XFRM_STATE_ESN) {
+   if (x->props.flags & XFRM_STATE_ESN &&
+   !(mlx5_accel_ipsec_device_caps(priv->mdev) & MLX5_ACCEL_IPSEC_ESN)) 
{
 netdev_info(netdev, "Cannot offload ESN xfrm states\n");
 return -EINVAL;
 }


This code seems to be assuming that all drivers/NICs with the offload
will be able to do ESN, and this is not the case.  If this code is put
into place, suddenly the ixgbe driver's offload will have a failure
case: the driver doesn't support ESN, and doesn't know to NAK the
state_add if the ESN bit is on.  This is a generic capabilities issue
for which we already have a solution "pattern".

I guess you are right but ixgbe driver is already checking many other 
caps during add_sa callback (below code from v3 patches for ixgbe ipsec):


+   if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) {
+   netdev_err(dev, "Unsupported protocol 0x%04x for ipsec 
offload\n",
+  xs->id.proto);
+   return -EINVAL;
+   }
+
+   if (xs->xso.flags & XFRM_OFFLOAD_INBOUND) {
+   struct rx_sa rsa;
+
+   if (xs->calg) {
+   netdev_err(dev, "Compression offload not supported\n");
+   return -EINVAL;
+   }


What is the difference for checking xs->calg exists in state to ESN?

I think in long term we can refactor to cap mask declaration by the 
driver and call add_sa only if mask exists but

this can be a totally different patch.



We weren't assuming that, please see above.


  > What do you suggest?
  >

There should be a capabilities/feature flag for the driver to set and
the XFRM code shouldn't try the state_add with ESN if the driver hasn't
set an ESN bit in its capabilities.  Other capabilities that might make
sense here are IPv6, TSO, and CSUM; there may be others.


Look at how feature bits are added to netdev->features to signify what the 
driver can do.  I think that's a much better approach.


It looks like an overkill?

Alternatively, just solve this by failing to add the SA that has ESN set
if the driver hasn't defined your new 

Re: [patch net-next v7 07/13] net: sched: use block index as a handle instead of qdisc when block is shared

2018-01-11 Thread Jiri Pirko
Thu, Jan 11, 2018 at 02:25:36PM CET, j...@mojatatu.com wrote:
>On 18-01-09 09:07 AM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> As the tcm_ifindex 0 is invalid ifindex, reuse it to indicate that we
>> work with block, instead of qdisc. So if tcm_ifindex is 0, tcm_parent is
>> used to carry block_index.
>> 
>
>Commit log still refers to ifindex of 0 instead of TCM_IFINDEX_MAGIC_BLOCK

Missed this. Will update, thanks!


Re: [patch net-next v7 08/13] net: sched: add rt netlink message type for block get

2018-01-11 Thread Jiri Pirko
Thu, Jan 11, 2018 at 02:27:11PM CET, j...@mojatatu.com wrote:
>On 18-01-09 09:07 AM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> Add simple block get operation which primary purpose is to check the
>> block existence by block index.
>> 
>
>block_dump missing?

It is not needed for anything now. You see all the blocks when you list
qdiscs. Yet, dump could be easily added if needed in the future.


Re: [patch net-next v7 00/13] net: sched: allow qdiscs to share filter block instances

2018-01-11 Thread Jiri Pirko
Thu, Jan 11, 2018 at 02:19:16PM CET, j...@mojatatu.com wrote:
>On 18-01-09 09:07 AM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> Currently the filters added to qdiscs are independent. So for example if you
>> have 2 netdevices and you create ingress qdisc on both and you want to add
>> identical filter rules both, you need to add them twice. This patchset
>> makes this easier and mainly saves resources allowing to share all filters
>> within a qdisc - I call it a "filter block". Also this helps to save
>> resources when we do offload to hw for example to expensive TCAM.
>> 
>> So back to the example. First, we create 2 qdiscs. Both will share
>> block number 22. "22" is just an identification:
>> $ tc qdisc add dev ens7 ingress block 22
>>  
>> $ tc qdisc add dev ens8 ingress block 22
>>  
>> 
>> If we don't specify "block" command line option, no shared block would
>> be created:
>> $ tc qdisc add dev ens9 ingress
>> 
>> Now if we list the qdiscs, we will see the block index in the output:
>> 
>> $ tc qdisc
>> qdisc ingress : dev ens7 parent :fff1 block 22
>> qdisc ingress : dev ens8 parent :fff1 block 22
>> qdisc ingress : dev ens9 parent :fff1
>> 
>> 
>> To make is more visual, the situation looks like this:
>> 
>> ens7 ingress qdisc ens7 ingress qdisc
>>|  |
>>|  |
>>+-->  block 22  <--+
>> 
>> Unlimited number of qdiscs may share the same block.
>> 
>> Now we can add filter using the block index:
>> 
>> $ tc filter add block 22 protocol ip pref 25 flower dst_ip 192.168.0.0/16 
>> action drop
>> 
>> 
>> Note we cannot use the qdisc for filter manipulations of shared blocks:
>> 
>> $ tc filter add dev ens8 ingress protocol ip pref 1 flower dst_ip 
>> 192.168.100.2 action drop
>> Error: This filter block is shared. Please use the block index to manipulate 
>> the filters.
>> 
>> 
>> We will see the same output if we list filters for ingress qdisc of
>> ens7 and ens8, also for the block 22:
>> 
>> $ tc filter show block 22
>> filter block 22 protocol ip pref 25 flower chain 0
>> filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
>> ...
>> 
>> $ tc filter show dev ens7 ingress
>> filter block 22 protocol ip pref 25 flower chain 0
>> filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
>> ...
>> 
>> $ tc filter show dev ens8 ingress
>> filter block 22 protocol ip pref 25 flower chain 0
>> filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
>> ...
>> 
>
>Somewhere here mention the egress issue we talked about, something
>like:

I don't understand why to mention something that is not supported and
future thinking and work needs to be done in order to support it. Let's
leave that text for a cover letter of that patchset, could we?


>
>At the moment on ingress and clsact_xxx are well supported by the
>block infrastructure. For this to work well with egress qdisc,
>all the ports/qdiscs sharing the block will have to be symmetric.
>e.g. if ens8 and ens9 root qdiscs shared a block at their (egress)
>root qdiscs, then those qdiscs would both need to have the same
>handle id. An example of a symettric shared block setup would like like:
>
>tc qdisc add dev ens8 root block 22 handle 1:0 prio
>tc qdisc add dev ens9 root block 22 handle 1:0 prio
>
>
>I am confident the above would work. You said you are thinking of
>getting this to always work (I cant think of a simple way to do it),
>but for the moment the above is fine.
>Most people who want this would probably use clsact egress and not
>care about queues (so it may never be "fixed")
>
>cheers,
>jamal


[PATCH 2/3] tcp: Add ESP encapsulation support

2018-01-11 Thread Herbert Xu
This patch adds the plumbing in TCP for ESP encapsulation support
per RFC8229.

The patch mostly deals with inbound processing, as well as enabling
TCP encapsulation on a socket through setsockopt.  The outbound
processing is dealt with in the ESP code as is done for UDP.

The inbound processing is split into two halves.  First of all,
the softirq path directly intercepts ESP packets and feeds them
into the IPsec stack.  Most of the time the packet will be freed
right away if it contains complete ESP packets.  However, if
the message is incomplete or it contains non-ESP data, then the
skb will be added to the receive queue.  We also add packets to
the receive queue if it is currently non-emtpy, in order to
preserve sequence number continuity and minimise the changes
to the TCP code.

On the user-space facing side, packets marked as ESP-only are
skipped and not visible to user-space.  However, some ESP data
may seep through.  For example, if we receive a partial message
then we will always give it to user-space regardless of whether
it turns out to be ESP or not.  So user-space should be prepared
to skip ESP messages (SPI != 0).

There is a little bit of code dealing with the encapsulation side.
In particular, if encapsulation data comes in while the socket
is owned by user-space, the packets will be stored in tp->encap_out
and processed during release_sock.

Signed-off-by: Herbert Xu 
---

 include/linux/tcp.h  |   15 ++
 include/net/tcp.h|   27 +++
 include/uapi/linux/tcp.h |1 
 include/uapi/linux/udp.h |1 
 net/ipv4/tcp.c   |   68 +
 net/ipv4/tcp_input.c |  326 +--
 net/ipv4/tcp_ipv4.c  |1 
 net/ipv4/tcp_output.c|   48 ++
 8 files changed, 473 insertions(+), 14 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index ca4a636..1360a0e 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
fastopen_no_cookie:1, /* Allow send/recv SYN+data without a 
cookie */
is_sack_reneg:1,/* in recovery from loss with SACK reneg? */
-   unused:2;
+   encap:1,/* TCP IKE/ESP encapsulation */
+   encap_lenhi_valid:1;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
unused1 : 1,
@@ -373,6 +374,16 @@ struct tcp_sock {
 */
struct request_sock *fastopen_rsk;
u32 *saved_syn;
+
+#ifdef CONFIG_XFRM
+/* TCP ESP encapsulation */
+   struct sk_buff *encap_in;
+   struct sk_buff_head encap_out;
+   u32 encap_seq;
+   u32 encap_last;
+   u16 encap_backlog;
+   u8  encap_lenhi;
+#endif
 };
 
 enum tsq_enum {
@@ -384,6 +395,7 @@ enum tsq_enum {
TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
* tcp_v{4|6}_mtu_reduced()
*/
+   TCP_ESP_DEFERRED,  /* esp_output_tcp_encap2 queued packets */
 };
 
 enum tsq_flags {
@@ -393,6 +405,7 @@ enum tsq_flags {
TCPF_WRITE_TIMER_DEFERRED   = (1UL << TCP_WRITE_TIMER_DEFERRED),
TCPF_DELACK_TIMER_DEFERRED  = (1UL << TCP_DELACK_TIMER_DEFERRED),
TCPF_MTU_REDUCED_DEFERRED   = (1UL << TCP_MTU_REDUCED_DEFERRED),
+   TCPF_ESP_DEFERRED   = (1UL << TCP_ESP_DEFERRED),
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6da880d..6513ae2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -327,6 +327,7 @@ int tcp_sendpage_locked(struct sock *sk, struct page *page, 
int offset,
size_t size, int flags);
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 size_t size, int flags);
+int tcp_encap_output(struct sock *sk, struct sk_buff *skb);
 void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
@@ -399,6 +400,7 @@ int compat_tcp_setsockopt(struct sock *sk, int level, int 
optname,
  char __user *optval, unsigned int optlen);
 void tcp_set_keepalive(struct sock *sk, int val);
 void tcp_syn_ack_timeout(const struct request_sock *req);
+void tcp_cleanup_rbuf(struct sock *sk, int copied);
 int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
int flags, int *addr_len);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
@@ -789,7 +791,8 @@ struct tcp_skb_cb {
__u8txstamp_ack:1,  /* Record TX timestamp for ack? */
eor:1,  /* Is skb MSG_EOR marked? */
has_rxtstamp:1, /* SKB has a RX 

[PATCH 3/3] ipsec: Add ESP over TCP encapsulation support

2018-01-11 Thread Herbert Xu
This patch adds support for ESP over TCP encapsulation per RFC8229.

Most of the input processing is done in the TCP stack and not in
this patch, which is similar to UDP encapsulation.

On the output side, there are two potential levels of indirection.
Firstly all packets are fed through a tasklet in order to avoid
TCP socket lock recursion.  They're then processed directly if
the TCP socket is not owned by user-space.  If it is owned then
we'll place the packet in a queue (tp->encap_out) for processing
when the socket lock is released.

The first outbound packet will trigger a socket lockup for a
matching TCP socket.  If the TCP connection drops we will repeat
the lookup as needed.  The TCP socket is cached in the xfrm state
and is read using RCU.

Note that unlike normal IPsec packets, once we hit a TCP xfrm
state, the xfrm stack is short-circuited and its journey will
continue through the TCP stack, after which a new IPsec lookup
will be done.  This is different from how UDP encapsulation is
done.  This means that if you're doing nested IPsec then you
will need to construct the policies with this in mind.  That is,
start with a new policy whenever TCP encapsulation is done.

Signed-off-by: Herbert Xu 
---

 include/net/xfrm.h|7 +
 net/ipv4/esp4.c   |  208 --
 net/xfrm/xfrm_input.c |   21 +++--
 net/xfrm/xfrm_state.c |3 
 4 files changed, 228 insertions(+), 11 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index ae35991..3694536 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -180,6 +180,7 @@ struct xfrm_state {
 
/* Data for encapsulator */
struct xfrm_encap_tmpl  *encap;
+   struct sock __rcu   *encap_sk;
 
/* Data for care-of address */
xfrm_address_t  *coaddr;
@@ -210,6 +211,9 @@ struct xfrm_state {
u32 replay_maxage;
u32 replay_maxdiff;
 
+   /* Copy of encap_type from encap to avoid locking. */
+   u16 encap_type;
+
/* Replay detection notification timer */
struct timer_list   rtimer;
 
@@ -1570,6 +1574,9 @@ struct xfrmk_spdinfo {
 int xfrm_prepare_input(struct xfrm_state *x, struct sk_buff *skb);
 int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type);
 int xfrm_input_resume(struct sk_buff *skb, int nexthdr);
+int xfrm_trans_queue_net(struct net *net, struct sk_buff *skb,
+int (*finish)(struct net *, struct sock *,
+  struct sk_buff *));
 int xfrm_trans_queue(struct sk_buff *skb,
 int (*finish)(struct net *, struct sock *,
   struct sk_buff *));
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 61fe6e4..0544e4e 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -9,13 +9,16 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -30,6 +33,11 @@ struct esp_output_extra {
u32 esphoff;
 };
 
+struct esp_tcp_sk {
+   struct sock *sk;
+   struct rcu_head rcu;
+};
+
 #define ESP_SKB_CB(__skb) ((struct esp_skb_cb *)&((__skb)->cb[0]))
 
 static u32 esp4_get_mtu(struct xfrm_state *x, int mtu);
@@ -118,6 +126,143 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp)
put_page(sg_page(sg));
 }
 
+static void esp_free_tcp_sk(struct rcu_head *head)
+{
+   struct esp_tcp_sk *esk = container_of(head, struct esp_tcp_sk, rcu);
+
+   sock_put(esk->sk);
+   kfree(esk);
+}
+
+static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
+{
+   struct xfrm_encap_tmpl *encap = x->encap;
+   struct esp_tcp_sk *esk;
+   __be16 sport, dport;
+   struct sock *nsk;
+   struct sock *sk;
+
+   sk = rcu_dereference(x->encap_sk);
+   if (sk && sk->sk_state == TCP_ESTABLISHED)
+   return sk;
+
+   spin_lock_bh(>lock);
+   sport = encap->encap_sport;
+   dport = encap->encap_dport;
+   nsk = rcu_dereference_protected(x->encap_sk,
+   lockdep_is_held(>lock));
+   if (sk && sk == nsk) {
+   esk = kmalloc(sizeof(*esk), GFP_ATOMIC);
+   if (!esk) {
+   spin_unlock_bh(>lock);
+   return ERR_PTR(-ENOMEM);
+   }
+   RCU_INIT_POINTER(x->encap_sk, NULL);
+   esk->sk = sk;
+   call_rcu(>rcu, esp_free_tcp_sk);
+   }
+   spin_unlock_bh(>lock);
+
+   /* XXX We don't support bound_dev_if. */
+   sk = inet_lookup_established(xs_net(x), _hashinfo, x->id.daddr.a4,
+dport, x->props.saddr.a4, sport, 0);
+
+   if (!sk)
+   return ERR_PTR(-ENOENT);
+
+   if (!tcp_sk(sk)->encap) {
+   

[PATCH 1/3] skbuff: Avoid sleeping in skb_send_sock_locked

2018-01-11 Thread Herbert Xu
For a function that needs to be called with the socket spinlock
held, sleeping would seem to be a bad idea.  This function does
in fact avoid sleeping when calling kernel_sendpage_locked on the
page part of the skb.  However, it doesn't do that when sending
the linear part.  Resulting in sleeping when the socket send buffer
is full.

This patch fixes it by setting the MSG_DONTWAIT flag when calling
kernel_sendmsg_locked.

Signed-off-by: Herbert Xu 
---

 net/core/skbuff.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6b0ff39..8197b7a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2279,6 +2279,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff 
*skb, int offset,
kv.iov_base = skb->data + offset;
kv.iov_len = slen;
memset(, 0, sizeof(msg));
+   msg.msg_flags = MSG_DONTWAIT;
 
ret = kernel_sendmsg_locked(sk, , , 1, slen);
if (ret <= 0)


Re: [patch net-next v7 03/13] net: sched: avoid usage of tp->q in tcf_classify

2018-01-11 Thread David Ahern
On 1/11/18 2:40 AM, Jiri Pirko wrote:
> Wed, Jan 10, 2018 at 05:17:28PM CET, dsah...@gmail.com wrote:
>> On 1/9/18 7:07 AM, Jiri Pirko wrote:
>>> From: Jiri Pirko 
>>>
>>> Use block index in the messages instead.
>>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  net/sched/cls_api.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>>> index 9b45950..31e91dc 100644
>>> --- a/net/sched/cls_api.c
>>> +++ b/net/sched/cls_api.c
>>> @@ -672,8 +672,9 @@ int tcf_classify(struct sk_buff *skb, const struct 
>>> tcf_proto *tp,
>>>  #ifdef CONFIG_NET_CLS_ACT
>>>  reset:
>>> if (unlikely(limit++ >= max_reclassify_loop)) {
>>> -   net_notice_ratelimited("%s: reclassify loop, rule prio %u, 
>>> protocol %02x\n",
>>> -  tp->q->ops->id, tp->prio & 0x,
>>> +   net_notice_ratelimited("%u: reclassify loop, rule prio %u, 
>>> protocol %02x\n",
>>
>> if you are dumping index instead of prio shouldn't the 'rule prio' above
>> be adjusted?
> 
> I'm not! Why do you think so?
> 
> "%u:" is tp->chain->block->index
> "prio %u" is tp->prio & 0x
> "%02x" is ntohs(tp->protocol)
> 

Never mind. scanned that too quickly.


[PATCH] netfilter: nf_tables: fix odd_ptr_err.cocci warnings

2018-01-11 Thread Julia Lawall
tree:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
head:   b4464bcab38d3f7fe995a7cb960eeac6889bec08
commit: 3b49e2e94e6ebb8b23d0955d9e898254455734f8 [8286/9035] netfilter:
nf_tables: add flow table netlink frontend

The following is a 0-day report generated by Coccinelle.  But from the
line before, it looks like the fix is backwards, and the test shoud be on
flowtable.

julia


 nf_tables_api.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -5419,7 +5419,7 @@ static int nf_tables_getflowtable(struct
flowtable = nf_tables_flowtable_lookup(table, nla[NFTA_FLOWTABLE_NAME],
   genmask);
if (IS_ERR(table))
-   return PTR_ERR(flowtable);
+   return PTR_ERR(table);

skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
if (!skb2)


Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks

2018-01-11 Thread Jiri Pirko
Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote:
>On 18-01-09 09:07 AM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> Benefit from the previously introduced shared filter blocks
>> infrastructure and allow ingress and clsact qdisc instances to share
>> filter blocks. The block index is coming from userspace as qdisc option.
>
>Didnt quiet follow why ingress is special and needs attributes to
>set the block but other qdiscs didnt.

Jamal, again, other qdiscs does not support block sharing. This patchset
only adds support for sharing of block for ingress and clsact qdiscs.
Later on, other qdiscs could also support block sharing.


>Will check again later after some coffee..
>
>cheers,
>jamal


Re: general protection fault in sctp_v6_get_dst

2018-01-11 Thread Neil Horman
On Thu, Jan 11, 2018 at 05:30:17PM +0800, Xin Long wrote:
> On Thu, Jan 11, 2018 at 2:15 AM, syzbot
>  wrote:
> > syzkaller has found reproducer for the following crash on
> > 61ad64080e039dce99a7f8d89b729bbea995e2f7
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > C reproducer is attached
> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> > for information about syzkaller reproducers
> >
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+7b7b518b1228d2743...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed.
> >
> > device lo entered promiscuous mode
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault:  [#1] SMP KASAN
> > Dumping ftrace buffer:
> >(ftrace buffer empty)
> > Modules linked in:
> > CPU: 0 PID: 3506 Comm: syzkaller968983 Not tainted 4.15.0-rc7+ #181
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:__read_once_size include/linux/compiler.h:183 [inline]
> > RIP: 0010:sctp_v6_get_dst+0x59e/0x1c60 net/sctp/ipv6.c:271
> > RSP: 0018:8801db205e20 EFLAGS: 00010206
> > RAX: dc00 RBX:  RCX: 8512e05b
> > RDX: 000f RSI: 67cf608c RDI: 8801db22376c
> > RBP: 8801db206190 R08: 11003b640b05 R09: 0002
> > R10: 8801db205cf0 R11: 8512e008 R12: 8801bf884db0
> > R13: 204e R14: 8801bfe3e680 R15: 8801bf884d80
> > FS:  7f122e219700() GS:8801db20() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 20aaff09 CR3: 0001bfdf0005 CR4: 001606f0
> >
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  
> >  sctp_transport_route+0xa8/0x430 net/sctp/transport.c:293
> >  sctp_assoc_add_peer+0x4fe/0x1190 net/sctp/associola.c:655
> >  sctp_process_init+0x119/0x2440 net/sctp/sm_make_chunk.c:2341
> >  sctp_sf_do_5_1B_init+0x8c9/0xe80 net/sctp/sm_statefuns.c:414
> >  sctp_do_sm+0x192/0x6ed0 net/sctp/sm_sideeffect.c:1178
> >  sctp_endpoint_bh_rcv+0x379/0x8f0 net/sctp/endpointola.c:456
> >  sctp_inq_push+0x23b/0x300 net/sctp/inqueue.c:95
> >  sctp_rcv+0x29f3/0x35c0 net/sctp/input.c:267
> >  sctp6_rcv+0x15/0x30 net/sctp/ipv6.c:1006
> >  ip6_input_finish+0x37e/0x17a0 net/ipv6/ip6_input.c:284
> >  NF_HOOK include/linux/netfilter.h:288 [inline]
> >  ip6_input+0xdb/0x560 net/ipv6/ip6_input.c:327
> >  dst_input include/net/dst.h:449 [inline]
> >  ip6_rcv_finish+0x1a9/0x7a0 net/ipv6/ip6_input.c:71
> >  NF_HOOK include/linux/netfilter.h:288 [inline]
> >  ipv6_rcv+0xf37/0x1fa0 net/ipv6/ip6_input.c:208
> >  __netif_receive_skb_core+0x1a41/0x3460 net/core/dev.c:4538
> >  __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4603
> >  process_backlog+0x203/0x740 net/core/dev.c:5283
> >  napi_poll net/core/dev.c:5681 [inline]
> >  net_rx_action+0x792/0x1910 net/core/dev.c:5747
> >  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
> >  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1133
> >  
> >  do_softirq.part.21+0x14d/0x190 kernel/softirq.c:329
> >  do_softirq kernel/softirq.c:177 [inline]
> >  __local_bh_enable_ip+0x1ee/0x230 kernel/softirq.c:182
> >  local_bh_enable include/linux/bottom_half.h:32 [inline]
> >  rcu_read_unlock_bh include/linux/rcupdate.h:727 [inline]
> >  ip6_finish_output2+0xba0/0x23a0 net/ipv6/ip6_output.c:121
> >  ip6_finish_output+0x698/0xaf0 net/ipv6/ip6_output.c:154
> >  NF_HOOK_COND include/linux/netfilter.h:277 [inline]
> >  ip6_output+0x1eb/0x840 net/ipv6/ip6_output.c:171
> >  dst_output include/net/dst.h:443 [inline]
> >  NF_HOOK include/linux/netfilter.h:288 [inline]
> >  ip6_xmit+0xd84/0x2090 net/ipv6/ip6_output.c:277
> >  sctp_v6_xmit+0x438/0x630 net/sctp/ipv6.c:225
> >  sctp_packet_transmit+0x225e/0x3750 net/sctp/output.c:638
> >  sctp_outq_flush+0xabb/0x4060 net/sctp/outqueue.c:911
> >  sctp_outq_uncork+0x5a/0x70 net/sctp/outqueue.c:776
> >  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1807 [inline]
> >  sctp_side_effects net/sctp/sm_sideeffect.c:1210 [inline]
> >  sctp_do_sm+0x4e0/0x6ed0 net/sctp/sm_sideeffect.c:1181
> >  sctp_primitive_ASSOCIATE+0x9d/0xd0 net/sctp/primitive.c:88
> >  sctp_sendmsg+0x1d2e/0x33f0 net/sctp/socket.c:2018
> >  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
> >  sock_sendmsg_nosec net/socket.c:628 [inline]
> >  sock_sendmsg+0xca/0x110 net/socket.c:638
> >  SYSC_sendto+0x361/0x5c0 net/socket.c:1719
> >  SyS_sendto+0x40/0x50 net/socket.c:1687
> >  entry_SYSCALL_64_fastpath+0x23/0x9a
> > RIP: 0033:0x4456c9
> > RSP: 002b:7f122e218d98 EFLAGS: 

[PATCH 0/3] ipsec: Add ESP over TCP encapsulation

2018-01-11 Thread Herbert Xu
Hi:

This series of patches add basic support for ESP over TCP (RFC 8229).
Note that this does not include TLS support but it could be added in
future.

Here is an iproute patch to setup xfrm states with this:

diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c
index 12c2f72..f3fb1e2 100644
--- a/ip/ipxfrm.c
+++ b/ip/ipxfrm.c
@@ -738,6 +738,9 @@ void xfrm_xfrma_print(struct rtattr *tb[], __u16 family,
case 2:
fprintf(fp, "espinudp ");
break;
+   case 6:
+   fprintf(fp, "espintcp ");
+   break;
default:
fprintf(fp, "%u ", e->encap_type);
break;
@@ -1182,6 +1185,8 @@ int xfrm_encap_type_parse(__u16 *type, int *argcp, char 
***argvp)
*type = 1;
else if (strcmp(*argv, "espinudp") == 0)
*type = 2;
+   else if (strcmp(*argv, "espintcp") == 0)
+   *type = 6;
else
invarg("ENCAP-TYPE value is invalid", *argv);
 

Here is a sample program for setting up the TCP socket to use this.
Note that it doesn't do the magic word as required by RFC 8229 so
you'll need to add that for a real key manager.

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TCP_ENCAP 35

int main(int argc, char **argv)
{
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(4500),
};
char buf[4096];
int one = 1;
int err;
int s;

s = socket(AF_INET, SOCK_STREAM, 0);
if (s < 0)
error(-1, errno, "socket");

if (bind(s, (struct sockaddr *), sizeof(addr)) < 0)
error(-1, errno, "bind");

if (argc > 1) {
addr.sin_addr.s_addr = inet_addr(argv[1]);
if (connect(s, (struct sockaddr *), sizeof(addr)) < 0)
error(-1, errno, "connect");
} else {
if (listen(s, 0) < 0)
error(-1, errno, "listen");

s = accept(s, NULL, 0);
if (s < 0)
error(-1, errno, "accept");
}

if (setsockopt(s, SOL_TCP, TCP_NODELAY, , sizeof(one)) < 0)
error(-1, errno, "TCP_NODELAY");

if (setsockopt(s, SOL_TCP, TCP_ENCAP, NULL, 0) < 0)
error(-1, errno, "TCP_ENCAP");

while ((err = read(s, buf, sizeof(buf))) > 0)
;

if (err < 0)
error(-1, errno, "read");

return 0;
}


Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks

2018-01-11 Thread Jamal Hadi Salim

On 18-01-09 09:07 AM, Jiri Pirko wrote:

From: Jiri Pirko 

Benefit from the previously introduced shared filter blocks
infrastructure and allow ingress and clsact qdisc instances to share
filter blocks. The block index is coming from userspace as qdisc option.


Didnt quiet follow why ingress is special and needs attributes to
set the block but other qdiscs didnt.
Will check again later after some coffee..

cheers,
jamal


Re: [PATCH net] net: ipv4: Make "ip route get" match iif lo rules again.

2018-01-11 Thread David Ahern
On 1/11/18 2:36 AM, Lorenzo Colitti wrote:
> Commit 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu
> versions of route lookup") broke "ip route get" in the presence
> of rules that specify iif lo.
> 
> Host-originated traffic always has iif lo, because
> ip_route_output_key_hash and ip6_route_output_flags set the flow
> iif to LOOPBACK_IFINDEX. Thus, putting "iif lo" in an ip rule is a
> convenient way to select only originated traffic and not forwarded
> traffic.
> 
> inet_rtm_getroute used to match these rules correctly because
> even though it sets the flow iif to 0, it called
> ip_route_output_key which overwrites iif with LOOPBACK_IFINDEX.
> But now that it calls ip_route_output_key_hash_rcu, the ifindex
> will remain 0 and not match the iif lo in the rule. As a result,
> "ip route get" will return ENETUNREACH.
> 
> Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of 
> route lookup")
> Tested: 
> https://android.googlesource.com/kernel/tests/+/master/net/test/multinetwork_test.py
>  passes again
> Signed-off-by: Lorenzo Colitti 
> ---
>  net/ipv4/route.c | 1 +
>  1 file changed, 1 insertion(+)
> 

Missed that. Thanks for fixing.

Acked-by: David Ahern 



Re: [PATCH] netfilter: nf_tables: fix odd_ptr_err.cocci warnings

2018-01-11 Thread Pablo Neira Ayuso
On Thu, Jan 11, 2018 at 03:02:12PM +0100, Julia Lawall wrote:
> tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
> head:   b4464bcab38d3f7fe995a7cb960eeac6889bec08
> commit: 3b49e2e94e6ebb8b23d0955d9e898254455734f8 [8286/9035] netfilter:
> nf_tables: add flow table netlink frontend
> 
> The following is a 0-day report generated by Coccinelle.  But from the
> line before, it looks like the fix is backwards, and the test shoud be on
> flowtable.

There's a fix for this in nf-next.git

https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git/commit/?id=03a0120f75dfb1807c0441376e26b36160087de4

Will pass it up to David asap.


BUG: using smp_processor_id() in preemptible

2018-01-11 Thread Ricardo Nabinger Sanchez
Greetings,

I'm getting occasional video lock-ups, and while checking logs I found
these:

===
[  297.445296] BUG: using smp_processor_id() in preemptible [] code: 
claws-mail/1635
[  297.445319] caller is jprobe_return+0x12/0x25
[  297.445332] CPU: 1 PID: 1635 Comm: claws-mail Not tainted 4.14.0 #1
[  297.445341] Hardware name: Micro-Star International Co., Ltd. 
GX780/GT780/MS-1761, BIOS E1761IMS V3.01 05/02/2011
[  297.445349] Call Trace:
[  297.445372]  dump_stack+0x9f/0xe1
[  297.445392]  check_preemption_disabled+0xec/0xf0
[  297.445409]  jprobe_return+0x12/0x25
[  297.445425]  tcp_v4_do_rcv+0x7f/0x1a0
[  297.445443]  __release_sock+0x6d/0x100
[  297.445462]  release_sock+0x2b/0xb0
[  297.445475]  tcp_recvmsg+0x300/0x8f0
[  297.445504]  ? __lock_acquire+0x3ee/0x1610
[  297.445517]  ? core_sys_select+0x240/0x3e0
[  297.445541]  inet_recvmsg+0x51/0x1b0
[  297.445566]  sock_read_iter+0x8c/0xd0
[  297.445598]  __vfs_read+0xd5/0x140
[  297.445632]  vfs_read+0x9e/0x150
[  297.445652]  SyS_read+0x45/0xa0
[  297.445675]  entry_SYSCALL_64_fastpath+0x23/0xc2
[  297.445687] RIP: 0033:0x7ff2536001b8
[  297.445696] RSP: 002b:7ff247152890 EFLAGS: 0246 ORIG_RAX: 

[  297.445713] RAX: ffda RBX: 9cd088ccbff0 RCX: 7ff2536001b8
[  297.445721] RDX: 0005 RSI: 7ff23c02bb43 RDI: 0013
[  297.445730] RBP: 7ff23c02bb43 R08:  R09: 7ff23c00e520
[  297.445738] R10: 0010 R11: 0246 R12: 0086
[  297.445746] R13: 002f R14: 7ff254d3c998 R15: 0001
...
[  366.965766] BUG: using smp_processor_id() in preemptible [] code: 
Socket Thread/1435
[  366.965769] caller is jprobe_return+0x12/0x25
[  366.965773] CPU: 0 PID: 1435 Comm: Socket Thread Not tainted 4.14.0 #1
[  366.965775] Hardware name: Micro-Star International Co., Ltd. 
GX780/GT780/MS-1761, BIOS E1761IMS V3.01 05/02/2011
[  366.965777] Call Trace:
[  366.965780]  dump_stack+0x9f/0xe1
[  366.965786]  check_preemption_disabled+0xec/0xf0
[  366.965790]  jprobe_return+0x12/0x25
[  366.965793]  tcp_v4_do_rcv+0x7f/0x1a0
[  366.965797]  __release_sock+0x6d/0x100
[  366.965811]  release_sock+0x2b/0xb0
[  366.965813]  tcp_recvmsg+0x300/0x8f0
[  366.965826]  inet_recvmsg+0x51/0x1b0
[  366.965834]  SYSC_recvfrom+0xc6/0x130
[  366.965845]  ? entry_SYSCALL_64_fastpath+0x5/0xc2
[  366.965848]  ? trace_hardirqs_on_caller+0xcb/0x200
[  366.965851]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  366.965858]  entry_SYSCALL_64_fastpath+0x23/0xc2
[  366.965860] RIP: 0033:0x7f475ab7e5da
[  366.965862] RSP: 002b:7f47438fc8b0 EFLAGS: 0246 ORIG_RAX: 
002d
[  366.965864] RAX: ffda RBX: 9cd088ae7ff0 RCX: 7f475ab7e5da
[  366.965865] RDX: 8000 RSI: 7f4721202000 RDI: 007c
[  366.965867] RBP:  R08:  R09: 
[  366.965868] R10:  R11: 0246 R12: 0086
[  366.965869] R13: 7f47212025a8 R14: 7a58 R15: 7f474ba1e5f2
[  366.966571] BUG: using smp_processor_id() in preemptible [] code: 
Socket Thread/1435
[  366.966574] caller is jprobe_return+0x12/0x25
[  366.966576] CPU: 0 PID: 1435 Comm: Socket Thread Not tainted 4.14.0 #1
[  366.966577] Hardware name: Micro-Star International Co., Ltd. 
GX780/GT780/MS-1761, BIOS E1761IMS V3.01 05/02/2011
[  366.966578] Call Trace:
[  366.966582]  dump_stack+0x9f/0xe1
[  366.966586]  check_preemption_disabled+0xec/0xf0
[  366.966592]  jprobe_return+0x12/0x25
[  366.966596]  tcp_v4_do_rcv+0x7f/0x1a0
[  366.966601]  __release_sock+0x6d/0x100
[  366.966606]  release_sock+0x2b/0xb0
[  366.966610]  tcp_recvmsg+0x300/0x8f0
[  366.966622]  inet_recvmsg+0x51/0x1b0
[  366.966630]  SYSC_recvfrom+0xc6/0x130
[  366.966643]  ? entry_SYSCALL_64_fastpath+0x5/0xc2
[  366.966647]  ? trace_hardirqs_on_caller+0xcb/0x200
[  366.966651]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  366.97]  entry_SYSCALL_64_fastpath+0x23/0xc2
[  366.99] RIP: 0033:0x7f475ab7e5da
[  366.966670] RSP: 002b:7f47438fc8b0 EFLAGS: 0246 ORIG_RAX: 
002d
[  366.966673] RAX: ffda RBX: 9cd088ae7ff0 RCX: 7f475ab7e5da
[  366.966674] RDX: 8000 RSI: 7f4721202000 RDI: 007c
[  366.966676] RBP:  R08:  R09: 
[  366.966677] R10:  R11: 0246 R12: 0086
[  366.966679] R13: 7f47438fca70 R14: 05a8 R15: 7f4721202000
[  366.979991] BUG: using smp_processor_id() in preemptible [] code: 
Socket Thread/1435
[  366.97] caller is jprobe_return+0x12/0x25
[  366.980004] CPU: 0 PID: 1435 Comm: Socket Thread Not tainted 4.14.0 #1
[  366.980007] Hardware name: Micro-Star International Co., Ltd. 
GX780/GT780/MS-1761, BIOS E1761IMS V3.01 05/02/2011
[  366.980012] Call Trace:
[  366.980023]  dump_stack+0x9f/0xe1
[  366.980033]  

Re: [patch net-next v7 00/13] net: sched: allow qdiscs to share filter block instances

2018-01-11 Thread Jamal Hadi Salim

On 18-01-09 09:07 AM, Jiri Pirko wrote:

From: Jiri Pirko 

Currently the filters added to qdiscs are independent. So for example if you
have 2 netdevices and you create ingress qdisc on both and you want to add
identical filter rules both, you need to add them twice. This patchset
makes this easier and mainly saves resources allowing to share all filters
within a qdisc - I call it a "filter block". Also this helps to save
resources when we do offload to hw for example to expensive TCAM.

So back to the example. First, we create 2 qdiscs. Both will share
block number 22. "22" is just an identification:
$ tc qdisc add dev ens7 ingress block 22
 
$ tc qdisc add dev ens8 ingress block 22
 

If we don't specify "block" command line option, no shared block would
be created:
$ tc qdisc add dev ens9 ingress

Now if we list the qdiscs, we will see the block index in the output:

$ tc qdisc
qdisc ingress : dev ens7 parent :fff1 block 22
qdisc ingress : dev ens8 parent :fff1 block 22
qdisc ingress : dev ens9 parent :fff1


To make is more visual, the situation looks like this:

ens7 ingress qdisc ens7 ingress qdisc
   |  |
   |  |
   +-->  block 22  <--+

Unlimited number of qdiscs may share the same block.

Now we can add filter using the block index:

$ tc filter add block 22 protocol ip pref 25 flower dst_ip 192.168.0.0/16 
action drop


Note we cannot use the qdisc for filter manipulations of shared blocks:

$ tc filter add dev ens8 ingress protocol ip pref 1 flower dst_ip 192.168.100.2 
action drop
Error: This filter block is shared. Please use the block index to manipulate 
the filters.


We will see the same output if we list filters for ingress qdisc of
ens7 and ens8, also for the block 22:

$ tc filter show block 22
filter block 22 protocol ip pref 25 flower chain 0
filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
...

$ tc filter show dev ens7 ingress
filter block 22 protocol ip pref 25 flower chain 0
filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
...

$ tc filter show dev ens8 ingress
filter block 22 protocol ip pref 25 flower chain 0
filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
...



Somewhere here mention the egress issue we talked about, something
like:

At the moment on ingress and clsact_xxx are well supported by the
block infrastructure. For this to work well with egress qdisc,
all the ports/qdiscs sharing the block will have to be symmetric.
e.g. if ens8 and ens9 root qdiscs shared a block at their (egress)
root qdiscs, then those qdiscs would both need to have the same
handle id. An example of a symettric shared block setup would like like:

tc qdisc add dev ens8 root block 22 handle 1:0 prio
tc qdisc add dev ens9 root block 22 handle 1:0 prio


I am confident the above would work. You said you are thinking of
getting this to always work (I cant think of a simple way to do it),
but for the moment the above is fine.
Most people who want this would probably use clsact egress and not
care about queues (so it may never be "fixed")

cheers,
jamal


Re: [patch net-next v7 08/13] net: sched: add rt netlink message type for block get

2018-01-11 Thread Jamal Hadi Salim

On 18-01-09 09:07 AM, Jiri Pirko wrote:

From: Jiri Pirko 

Add simple block get operation which primary purpose is to check the
block existence by block index.



block_dump missing?

cheers,
jamal


Re: [PATCH 30/32] aio: add delayed cancel support

2018-01-11 Thread Christoph Hellwig
On Wed, Jan 10, 2018 at 06:26:39PM -0500, Jeff Moyer wrote:
> >> The upcoming aio poll support would like to be able to complete the
> >> iocb inline from the cancellation context, but that would cause
> >> a lock order reversal.  Add support for optionally moving the cancelation
> >> outside the context lock to avoid this reversal.
> >>
> >> Signed-off-by: Christoph Hellwig 
> >
> > Acked-by: Jeff Moyer 
> 
> Actually, let's move these two defines:
> 
> #define AIO_IOCB_DELAYED_CANCEL (1 << 0)
> #define AIO_IOCB_CANCELLED  (1 << 1)
> 
> to include/linux/aio.h so that drivers outside of fs/aio.c can make use
> of them.

struct aio_kiocb is private to aio.c, so just exposing them won't
do anything useful.  If we really need these elsewhere we'll need
to come up with a proper interface.


[PATCH net] net: ipv4: Make "ip route get" match iif lo rules again.

2018-01-11 Thread Lorenzo Colitti
Commit 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu
versions of route lookup") broke "ip route get" in the presence
of rules that specify iif lo.

Host-originated traffic always has iif lo, because
ip_route_output_key_hash and ip6_route_output_flags set the flow
iif to LOOPBACK_IFINDEX. Thus, putting "iif lo" in an ip rule is a
convenient way to select only originated traffic and not forwarded
traffic.

inet_rtm_getroute used to match these rules correctly because
even though it sets the flow iif to 0, it called
ip_route_output_key which overwrites iif with LOOPBACK_IFINDEX.
But now that it calls ip_route_output_key_hash_rcu, the ifindex
will remain 0 and not match the iif lo in the rule. As a result,
"ip route get" will return ENETUNREACH.

Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of 
route lookup")
Tested: 
https://android.googlesource.com/kernel/tests/+/master/net/test/multinetwork_test.py
 passes again
Signed-off-by: Lorenzo Colitti 
---
 net/ipv4/route.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 43b69af242..4e153b23bc 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2762,6 +2762,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
if (err == 0 && rt->dst.error)
err = -rt->dst.error;
} else {
+   fl4.flowi4_iif = LOOPBACK_IFINDEX;
rt = ip_route_output_key_hash_rcu(net, , , skb);
err = 0;
if (IS_ERR(rt))
-- 
2.16.0.rc1.238.g530d649a79-goog



Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Jiri Kosina
On Tue, 9 Jan 2018, Josh Poimboeuf wrote:

> On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote:
> > On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina  wrote:
> > > On Fri, 5 Jan 2018, Dan Williams wrote:
> > >
> > > [ ... snip ... ]
> > >> Andi Kleen (1):
> > >>   x86, barrier: stop speculation for failed access_ok
> > >>
> > >> Dan Williams (13):
> > >>   x86: implement nospec_barrier()
> > >>   [media] uvcvideo: prevent bounds-check bypass via speculative 
> > >> execution
> > >>   carl9170: prevent bounds-check bypass via speculative execution
> > >>   p54: prevent bounds-check bypass via speculative execution
> > >>   qla2xxx: prevent bounds-check bypass via speculative execution
> > >>   cw1200: prevent bounds-check bypass via speculative execution
> > >>   Thermal/int340x: prevent bounds-check bypass via speculative 
> > >> execution
> > >>   ipv6: prevent bounds-check bypass via speculative execution
> > >>   ipv4: prevent bounds-check bypass via speculative execution
> > >>   vfs, fdtable: prevent bounds-check bypass via speculative execution
> > >>   net: mpls: prevent bounds-check bypass via speculative execution
> > >>   udf: prevent bounds-check bypass via speculative execution
> > >>   userns: prevent bounds-check bypass via speculative execution
> > >>
> > >> Mark Rutland (4):
> > >>   asm-generic/barrier: add generic nospec helpers
> > >>   Documentation: document nospec helpers
> > >>   arm64: implement nospec_ptr()
> > >>   arm: implement nospec_ptr()
> > >
> > > So considering the recent publication of [1], how come we all of a sudden
> > > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
> > > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
> > >
> > > Is this going to be handled in eBPF in some other way?
> > >
> > > Without that in place, and considering Jann Horn's paper, it would seem
> > > like PTI doesn't really lock it down fully, right?
> > 
> > Here is the latest (v3) bpf fix:
> > 
> > https://patchwork.ozlabs.org/patch/856645/
> > 
> > I currently have v2 on my 'nospec' branch and will move that to v3 for
> > the next update, unless it goes upstream before then.

Daniel, I guess you're planning to send this still for 4.15?

> That patch seems specific to CONFIG_BPF_SYSCALL.  Is the bpf() syscall
> the only attack vector?  Or are there other ways to run bpf programs
> that we should be worried about?

Seems like Alexei is probably the only person in the whole universe who 
isn't CCed here ... let's fix that.

Thanks,

-- 
Jiri Kosina
SUSE Labs



[PATCH net-next 10/11] net: hns3: add feature check when feature changed

2018-01-11 Thread Peng Li
From: Jian Shen 

Local variable "changed" was defined to indicates features changed,
but was used only for feature NETIF_F_HW_VLAN_CTAG_RX. Add checking
for other features.

Fixes: 052ece6dc19c ("net: hns3: add ethtool related offload command")
Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 27 ++---
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 34879c4..a7ae4f3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1118,25 +1118,28 @@ static int hns3_nic_net_set_mac_address(struct 
net_device *netdev, void *p)
 static int hns3_nic_set_features(struct net_device *netdev,
 netdev_features_t features)
 {
+   netdev_features_t changed = netdev->features ^ features;
struct hns3_nic_priv *priv = netdev_priv(netdev);
struct hnae3_handle *h = priv->ae_handle;
-   netdev_features_t changed;
int ret;
 
-   if (features & (NETIF_F_TSO | NETIF_F_TSO6)) {
-   priv->ops.fill_desc = hns3_fill_desc_tso;
-   priv->ops.maybe_stop_tx = hns3_nic_maybe_stop_tso;
-   } else {
-   priv->ops.fill_desc = hns3_fill_desc;
-   priv->ops.maybe_stop_tx = hns3_nic_maybe_stop_tx;
+   if (changed & (NETIF_F_TSO | NETIF_F_TSO6)) {
+   if (features & (NETIF_F_TSO | NETIF_F_TSO6)) {
+   priv->ops.fill_desc = hns3_fill_desc_tso;
+   priv->ops.maybe_stop_tx = hns3_nic_maybe_stop_tso;
+   } else {
+   priv->ops.fill_desc = hns3_fill_desc;
+   priv->ops.maybe_stop_tx = hns3_nic_maybe_stop_tx;
+   }
}
 
-   if (features & NETIF_F_HW_VLAN_CTAG_FILTER)
-   h->ae_algo->ops->enable_vlan_filter(h, true);
-   else
-   h->ae_algo->ops->enable_vlan_filter(h, false);
+   if (changed & NETIF_F_HW_VLAN_CTAG_FILTER) {
+   if (features & NETIF_F_HW_VLAN_CTAG_FILTER)
+   h->ae_algo->ops->enable_vlan_filter(h, true);
+   else
+   h->ae_algo->ops->enable_vlan_filter(h, false);
+   }
 
-   changed = netdev->features ^ features;
if (changed & NETIF_F_HW_VLAN_CTAG_RX) {
if (features & NETIF_F_HW_VLAN_CTAG_RX)
ret = h->ae_algo->ops->enable_hw_strip_rxvtag(h, true);
-- 
1.9.1



[PATCH net-next 06/11] net: hns3: refactor GL update function

2018-01-11 Thread Peng Li
From: Fuyun Liang 

The GL update function uses the max GL value between tx_int_gl and
rx_int_gl to set both new tx_int_gl and new rx_int_gl. Therefore, User
can not enable TX GL self-adaptive or RX GL self-adaptive individually.

This patch refactors the code to update the TX GL and the RX GL
separately, making user can enable TX GL self-adaptive or RX GL
self-adaptive individually.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 35 +++--
 1 file changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 59d8d9f..2a139ef 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2459,25 +2459,22 @@ static bool hns3_get_new_int_gl(struct 
hns3_enet_ring_group *ring_group)
 
 static void hns3_update_new_int_gl(struct hns3_enet_tqp_vector *tqp_vector)
 {
-   u16 rx_int_gl, tx_int_gl;
-   bool rx, tx;
-
-   rx = hns3_get_new_int_gl(_vector->rx_group);
-   tx = hns3_get_new_int_gl(_vector->tx_group);
-   rx_int_gl = tqp_vector->rx_group.int_gl;
-   tx_int_gl = tqp_vector->tx_group.int_gl;
-   if (rx && tx) {
-   if (rx_int_gl > tx_int_gl) {
-   tqp_vector->tx_group.int_gl = rx_int_gl;
-   tqp_vector->tx_group.flow_level =
-   tqp_vector->rx_group.flow_level;
-   hns3_set_vector_coalesc_gl(tqp_vector, rx_int_gl);
-   } else {
-   tqp_vector->rx_group.int_gl = tx_int_gl;
-   tqp_vector->rx_group.flow_level =
-   tqp_vector->tx_group.flow_level;
-   hns3_set_vector_coalesc_gl(tqp_vector, tx_int_gl);
-   }
+   struct hns3_enet_ring_group *rx_group = _vector->rx_group;
+   struct hns3_enet_ring_group *tx_group = _vector->tx_group;
+   bool rx_update, tx_update;
+
+   if (rx_group->gl_adapt_enable) {
+   rx_update = hns3_get_new_int_gl(rx_group);
+   if (rx_update)
+   hns3_set_vector_coalesce_rx_gl(tqp_vector,
+  rx_group->int_gl);
+   }
+
+   if (tx_group->gl_adapt_enable) {
+   tx_update = hns3_get_new_int_gl(_vector->tx_group);
+   if (tx_update)
+   hns3_set_vector_coalesce_tx_gl(tqp_vector,
+  tx_group->int_gl);
}
 }
 
-- 
1.9.1



[PATCH net-next 00/11] add some new features and fix some bugs

2018-01-11 Thread Peng Li
This patchset adds some new features and fixes some bugs:
[patch 1/11] adds ethtool_ops.get_channels support for VF.
[patch 2/11] removes TSO config command from VF driver.
[patch 3/11] adds ethtool_ops.get_coalesce support to PF.
[patch 4/11] adds ethtool_ops.set_coalesce support to PF.
[patch 5/11 - 11/11] do some code improvements and fix some bugs.

Fuyun Liang (7):
  net: hns3: add ethtool_ops.get_coalesce support to PF
  net: hns3: add ethtool_ops.set_coalesce support to PF
  net: hns3: refactor interrupt coalescing init function
  net: hns3: refactor GL update function
  net: hns3: remove unused GL setup function
  net: hns3: change the unit of GL value macro
  net: hns3: add int_gl_idx setup for TX and RX queues

Jian Shen (2):
  net: hns3: fixes for feature changed checking
  net: hns3: fix possible NULL pointer in hns3_nic_set_features

Peng Li (2):
  net: hns3: add ethtool_ops.get_channels support for VF
  net: hns3: remove TSO config command from VF driver

 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   7 +
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 148 ++---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  26 ++-
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 179 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c|   5 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h   |   8 -
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  |  50 +++---
 7 files changed, 336 insertions(+), 87 deletions(-)

-- 
1.9.1



Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-11 Thread Jiri Benc
On Wed, 10 Jan 2018 18:09:50 -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
> I still prefer the approach I had mentioned that uses 'mtu_adj'. In
> that approach you can leave those slaves which have changed their mtu
> to be lower than masters' but if master's mtu changes to larger value
> all other slaves will get updated mtu leaving behind the slaves who
> have opted to change their mtu on their own. Also the same thing is
> true when mtu get reduced at master.

The problem with this magic behavior is, well, that it's magic. There's
no way to tell what happens with a given slave when the master's MTU
gets changed just by looking at the current configuration. There's also
no way to switch the magic behavior back on once the slave's MTU is
changed.

At minimum, you'd need some kind of indication that the slave's MTU is
following the master. And a way to toggle this back.

Keefe's patch is much saner, the behavior is completely deterministic.

 Jiri


Re: [PATCH 03/32] fs: introduce new ->get_poll_head and ->poll_mask methods

2018-01-11 Thread Christoph Hellwig
For other horrors that are even worse than any given ->poll instance
take a look at scif_poll and friends..


[PATCH 08/11] xfrm: don't call xfrm_policy_cache_flush while holding spinlock

2018-01-11 Thread Steffen Klassert
From: Florian Westphal 

xfrm_policy_cache_flush can sleep, so it cannot be called while holding
a spinlock.  We could release the lock first, but I don't see why we need
to invoke this function here in first place, the packet path won't reuse
an xdst entry unless its still valid.

While at it, add an annotation to xfrm_policy_cache_flush, it would
have probably caught this bug sooner.

Fixes: ec30d78c14a813 ("xfrm: add xdst pcpu cache")
Reported-by: syzbot+e149f7d1328c26f9c...@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 2ef6db98e9ba..bc5eae12fb09 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -975,8 +975,6 @@ int xfrm_policy_flush(struct net *net, u8 type, bool 
task_valid)
}
if (!cnt)
err = -ESRCH;
-   else
-   xfrm_policy_cache_flush();
 out:
spin_unlock_bh(>xfrm.xfrm_policy_lock);
return err;
@@ -1744,6 +1742,8 @@ void xfrm_policy_cache_flush(void)
bool found = 0;
int cpu;
 
+   might_sleep();
+
local_bh_disable();
rcu_read_lock();
for_each_possible_cpu(cpu) {
-- 
2.14.1



[PATCH 10/11] af_key: Fix memory leak in key_notify_policy.

2018-01-11 Thread Steffen Klassert
We leak the allocated out_skb in case
pfkey_xfrm_policy2msg() fails. Fix this
by freeing it on error.

Reported-by: Dmitry Vyukov 
Signed-off-by: Steffen Klassert 
---
 net/key/af_key.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index d40861a048fe..7e2e7188e7f4 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -2202,8 +2202,10 @@ static int key_notify_policy(struct xfrm_policy *xp, int 
dir, const struct km_ev
return PTR_ERR(out_skb);
 
err = pfkey_xfrm_policy2msg(out_skb, xp, dir);
-   if (err < 0)
+   if (err < 0) {
+   kfree_skb(out_skb);
return err;
+   }
 
out_hdr = (struct sadb_msg *) out_skb->data;
out_hdr->sadb_msg_version = PF_KEY_V2;
-- 
2.14.1



[PATCH 09/11] esp: Fix GRO when the headers not fully in the linear part of the skb.

2018-01-11 Thread Steffen Klassert
The GRO layer does not necessarily pull the complete headers
into the linear part of the skb, a part may remain on the
first page fragment. This can lead to a crash if we try to
pull the headers, so make sure we have them on the linear
part before pulling.

Fixes: 7785bba299a8 ("esp: Add a software GRO codepath")
Reported-by: syzbot+82bbd65569c49c6c0...@syzkaller.appspotmail.com
Signed-off-by: Steffen Klassert 
---
 net/ipv4/esp4_offload.c | 3 ++-
 net/ipv6/esp6_offload.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index f8b918c766b0..b1338e576d00 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -38,7 +38,8 @@ static struct sk_buff **esp4_gro_receive(struct sk_buff 
**head,
__be32 spi;
int err;
 
-   skb_pull(skb, offset);
+   if (!pskb_pull(skb, offset))
+   return NULL;
 
if ((err = xfrm_parse_spi(skb, IPPROTO_ESP, , )) != 0)
goto out;
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index 333a478aa161..dd9627490c7c 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -60,7 +60,8 @@ static struct sk_buff **esp6_gro_receive(struct sk_buff 
**head,
int nhoff;
int err;
 
-   skb_pull(skb, offset);
+   if (!pskb_pull(skb, offset))
+   return NULL;
 
if ((err = xfrm_parse_spi(skb, IPPROTO_ESP, , )) != 0)
goto out;
-- 
2.14.1



[PATCH 11/11] xfrm: Fix a race in the xdst pcpu cache.

2018-01-11 Thread Steffen Klassert
We need to run xfrm_resolve_and_create_bundle() with
bottom halves off. Otherwise we may reuse an already
released dst_enty when the xfrm lookup functions are
called from process context.

Fixes: c30d78c14a813db39a647b6a348b428 ("xfrm: add xdst pcpu cache")
Reported-by: Darius Ski 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index bc5eae12fb09..bd6b0e7a0ee4 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2063,8 +2063,11 @@ xfrm_bundle_lookup(struct net *net, const struct flowi 
*fl, u16 family, u8 dir,
if (num_xfrms <= 0)
goto make_dummy_bundle;
 
+   local_bh_disable();
xdst = xfrm_resolve_and_create_bundle(pols, num_pols, fl, family,
- xflo->dst_orig);
+ xflo->dst_orig);
+   local_bh_enable();
+
if (IS_ERR(xdst)) {
err = PTR_ERR(xdst);
if (err != -EAGAIN)
@@ -2151,9 +2154,12 @@ struct dst_entry *xfrm_lookup(struct net *net, struct 
dst_entry *dst_orig,
goto no_transform;
}
 
+   local_bh_disable();
xdst = xfrm_resolve_and_create_bundle(
pols, num_pols, fl,
family, dst_orig);
+   local_bh_enable();
+
if (IS_ERR(xdst)) {
xfrm_pols_put(pols, num_pols);
err = PTR_ERR(xdst);
-- 
2.14.1



[PATCH 06/11] xfrm: Use __skb_queue_tail in xfrm_trans_queue

2018-01-11 Thread Steffen Klassert
From: Herbert Xu 

We do not need locking in xfrm_trans_queue because it is designed
to use per-CPU buffers.  However, the original code incorrectly
used skb_queue_tail which takes the lock.  This patch switches
it to __skb_queue_tail instead.

Reported-and-tested-by: Artem Savkov 
Fixes: acf568ee859f ("xfrm: Reinject transport-mode packets...")
Signed-off-by: Herbert Xu 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 3f6f6f8c9fa5..5b2409746ae0 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -518,7 +518,7 @@ int xfrm_trans_queue(struct sk_buff *skb,
return -ENOBUFS;
 
XFRM_TRANS_SKB_CB(skb)->finish = finish;
-   skb_queue_tail(>queue, skb);
+   __skb_queue_tail(>queue, skb);
tasklet_schedule(>tasklet);
return 0;
 }
-- 
2.14.1



[PATCH 05/11] xfrm: fix rcu usage in xfrm_get_type_offload

2018-01-11 Thread Steffen Klassert
From: Sabrina Dubroca 

request_module can sleep, thus we cannot hold rcu_read_lock() while
calling it. The function also jumps back and takes rcu_read_lock()
again (in xfrm_state_get_afinfo()), resulting in an imbalance.

This codepath is triggered whenever a new offloaded state is created.

Fixes: ffdb5211da1c ("xfrm: Auto-load xfrm offload modules")
Reported-by: 
syzbot+ca425f44816d749e8eb49755567a75ee48cf4...@syzkaller.appspotmail.com
Signed-off-by: Sabrina Dubroca 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_state.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 1e80f68e2266..429957412633 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -313,13 +313,14 @@ xfrm_get_type_offload(u8 proto, unsigned short family, 
bool try_load)
if ((type && !try_module_get(type->owner)))
type = NULL;
 
+   rcu_read_unlock();
+
if (!type && try_load) {
request_module("xfrm-offload-%d-%d", family, proto);
try_load = 0;
goto retry;
}
 
-   rcu_read_unlock();
return type;
 }
 
-- 
2.14.1



[PATCH 04/11] af_key: fix buffer overread in parse_exthdrs()

2018-01-11 Thread Steffen Klassert
From: Eric Biggers 

If a message sent to a PF_KEY socket ended with an incomplete extension
header (fewer than 4 bytes remaining), then parse_exthdrs() read past
the end of the message, into uninitialized memory.  Fix it by returning
-EINVAL in this case.

Reproducer:

#include 
#include 
#include 

int main()
{
int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2);
char buf[17] = { 0 };
struct sadb_msg *msg = (void *)buf;

msg->sadb_msg_version = PF_KEY_V2;
msg->sadb_msg_type = SADB_DELETE;
msg->sadb_msg_len = 2;

write(sock, buf, 17);
}

Cc: sta...@vger.kernel.org
Signed-off-by: Eric Biggers 
Signed-off-by: Steffen Klassert 
---
 net/key/af_key.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 596499cc8b2f..d40861a048fe 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -516,6 +516,9 @@ static int parse_exthdrs(struct sk_buff *skb, const struct 
sadb_msg *hdr, void *
uint16_t ext_type;
int ext_len;
 
+   if (len < sizeof(*ehdr))
+   return -EINVAL;
+
ext_len  = ehdr->sadb_ext_len;
ext_len *= sizeof(uint64_t);
ext_type = ehdr->sadb_ext_type;
-- 
2.14.1



Re: [iptables] extensions: add support for 'srh' match

2018-01-11 Thread Pablo Neira Ayuso
On Thu, Jan 11, 2018 at 11:14:52AM +0100, Ahmed Abdelsalam wrote:
> On Wed, 10 Jan 2018 16:32:24 +0100
> Pablo Neira Ayuso  wrote:
> 
> > On Fri, Dec 29, 2017 at 12:08:25PM +0100, Ahmed Abdelsalam wrote:
> > > This patch adds a new exetension to iptables to supprt 'srh' match
> > > The implementation considers revision 7 of the SRH draft.
> > > https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07
> > > 
> > > Signed-off-by: Ahmed Abdelsalam 
> > > ---
> > >  extensions/libip6t_srh.c| 283 
> > > 
> > >  include/linux/netfilter_ipv6/ip6t_srh.h |  63 +++
> > 
> > Please, add a extensions/libip6t_srh.t test file and send a v2.
> > 
> > Thanks.
> Ok, 
> Is there minimum requirements of the test cases to be added to the 
> extensions/libip6t_srh.t file ?

I leave it up to you to decide what level of coverage you consider is
good to make sure that future changes don't break your new feature.

Thanks!


Re: [iptables] extensions: add support for 'srh' match

2018-01-11 Thread Ahmed Abdelsalam
On Wed, 10 Jan 2018 16:32:24 +0100
Pablo Neira Ayuso  wrote:

> On Fri, Dec 29, 2017 at 12:08:25PM +0100, Ahmed Abdelsalam wrote:
> > This patch adds a new exetension to iptables to supprt 'srh' match
> > The implementation considers revision 7 of the SRH draft.
> > https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07
> > 
> > Signed-off-by: Ahmed Abdelsalam 
> > ---
> >  extensions/libip6t_srh.c| 283 
> > 
> >  include/linux/netfilter_ipv6/ip6t_srh.h |  63 +++
> 
> Please, add a extensions/libip6t_srh.t test file and send a v2.
> 
> Thanks.
Ok, 
Is there minimum requirements of the test cases to be added to the 
extensions/libip6t_srh.t file ?

-- 
Ahmed 


[patch net-next 5/5] mlxsw: spectrum: qdiscs: Support stats for PRIO qdisc

2018-01-11 Thread Jiri Pirko
From: Nogah Frankel 

Support basic stats for PRIO qdisc, which includes tx packets and bytes
count, drops count and backlog size. The rest of the stats are irrelevant
for this qdisc offload.
Since backlog is not only incremental but reflecting momentary value, in
case of a qdisc that stops being offloaded but is not destroyed, backlog
value needs to be updated about the un-offloading.
For that reason an unoffload function is being added to the ops struct.

Signed-off-by: Nogah Frankel 
Reviewed-by: Yuval Mintz 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c   | 92 ++
 1 file changed, 92 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 9e83edde7b35..272c04951e5d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -66,6 +66,11 @@ struct mlxsw_sp_qdisc_ops {
  void *xstats_ptr);
void (*clean_stats)(struct mlxsw_sp_port *mlxsw_sp_port,
struct mlxsw_sp_qdisc *mlxsw_sp_qdisc);
+   /* unoffload - to be used for a qdisc that stops being offloaded without
+* being destroyed.
+*/
+   void (*unoffload)(struct mlxsw_sp_port *mlxsw_sp_port,
+ struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params);
 };
 
 struct mlxsw_sp_qdisc {
@@ -73,6 +78,9 @@ struct mlxsw_sp_qdisc {
u8 tclass_num;
union {
struct red_stats red;
+   struct mlxsw_sp_qdisc_prio_stats {
+   u64 backlog;
+   } prio;
} xstats_base;
struct mlxsw_sp_qdisc_stats {
u64 tx_bytes;
@@ -144,6 +152,9 @@ mlxsw_sp_qdisc_replace(struct mlxsw_sp_port *mlxsw_sp_port, 
u32 handle,
 
 err_bad_param:
 err_config:
+   if (mlxsw_sp_qdisc->handle == handle && ops->unoffload)
+   ops->unoffload(mlxsw_sp_port, mlxsw_sp_qdisc, params);
+
mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc);
return err;
 }
@@ -450,11 +461,88 @@ mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port 
*mlxsw_sp_port,
return 0;
 }
 
+static void
+mlxsw_sp_qdisc_prio_unoffload(struct mlxsw_sp_port *mlxsw_sp_port,
+ struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+ void *params)
+{
+   struct tc_prio_qopt_offload_params *p = params;
+
+   *p->backlog -= mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp,
+   mlxsw_sp_qdisc->xstats_base.prio.backlog);
+}
+
+static int
+mlxsw_sp_qdisc_get_prio_stats(struct mlxsw_sp_port *mlxsw_sp_port,
+ struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+ struct tc_qopt_offload_stats *stats_ptr)
+{
+   u64 tx_bytes, tx_packets, drops = 0, backlog = 0;
+   struct mlxsw_sp_qdisc_prio_stats *prio_base;
+   struct mlxsw_sp_qdisc_stats *stats_base;
+   struct mlxsw_sp_port_xstats *xstats;
+   struct rtnl_link_stats64 *stats;
+   int i;
+
+   prio_base = _sp_qdisc->xstats_base.prio;
+   xstats = _sp_port->periodic_hw_stats.xstats;
+   stats = _sp_port->periodic_hw_stats.stats;
+   stats_base = _sp_qdisc->stats_base;
+
+   tx_bytes = stats->tx_bytes - stats_base->tx_bytes;
+   tx_packets = stats->tx_packets - stats_base->tx_packets;
+
+   for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
+   drops += xstats->tail_drop[i];
+   backlog += xstats->backlog[i];
+   }
+   drops = drops - stats_base->drops;
+
+   _bstats_update(stats_ptr->bstats, tx_bytes, tx_packets);
+   stats_ptr->qstats->drops += drops;
+   stats_ptr->qstats->backlog +=
+   mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp,
+backlog) -
+   mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp,
+prio_base->backlog);
+   prio_base->backlog = backlog;
+   stats_base->drops += drops;
+   stats_base->tx_bytes += tx_bytes;
+   stats_base->tx_packets += tx_packets;
+   return 0;
+}
+
+static void
+mlxsw_sp_setup_tc_qdisc_prio_clean_stats(struct mlxsw_sp_port *mlxsw_sp_port,
+struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
+{
+   struct mlxsw_sp_qdisc_stats *stats_base;
+   struct mlxsw_sp_port_xstats *xstats;
+   struct rtnl_link_stats64 *stats;
+   int i;
+
+   xstats = _sp_port->periodic_hw_stats.xstats;
+   stats = _sp_port->periodic_hw_stats.stats;
+   stats_base = _sp_qdisc->stats_base;
+
+   stats_base->tx_packets = stats->tx_packets;
+   stats_base->tx_bytes = stats->tx_bytes;
+
+   stats_base->drops = 0;
+   for (i = 0; 

[patch net-next 0/5] mlxsw: Offload PRIO qdisc

2018-01-11 Thread Jiri Pirko
From: Jiri Pirko 

Add an offload support for PRIO qdisc for mlxsw driver.
PRIO qdisc is being offloaded by using ndo_setup_tc. It has three
commands, to set or tune the qdisc, to remove it and to get its stats.

Like RED offloading, offloading this qdisc is not enforced on the driver
and determining its offload state is done in the dump action, when the
stats are being updated.
In the driver, offloading of PRIO is supported as root qdisc only. It
supports only priorities 0-7 (the range that is used by the current static
mapping of DSCP to skb prio and by 1:1 PCP values mapping) and up to 8
bands.

Patches 1-2 offload DSCP to priority mapping in the mlxsw_sp driver.
Patch 3 adds offload support for PRIO qdisc.
Patches 4-5 Add PRIO offload support in the mlxsw_sp driver.

Nogah Frankel (3):
  net: sch: prio: Add offload ability to PRIO qdisc
  mlxsw: spectrum: qdiscs: Support PRIO qdisc offload
  mlxsw: spectrum: qdiscs: Support stats for PRIO qdisc

Yuval Mintz (2):
  mlxsw: reg: add rdpm register
  mlxsw: spectrum_router: Configure default routing priority

 drivers/net/ethernet/mellanox/mlxsw/item.h |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/reg.h  |  37 +
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |   2 +
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   2 +
 .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c   | 174 +
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  24 +++
 include/linux/netdevice.h  |   1 +
 include/net/pkt_cls.h  |  25 +++
 net/sched/sch_prio.c   |  59 +++
 9 files changed, 325 insertions(+), 1 deletion(-)

-- 
2.14.3



[patch net-next 3/5] net: sch: prio: Add offload ability to PRIO qdisc

2018-01-11 Thread Jiri Pirko
From: Nogah Frankel 

Add the ability to offload PRIO qdisc by using ndo_setup_tc.
There are three commands for PRIO offloading:
* TC_PRIO_REPLACE: handles set and tune
* TC_PRIO_DESTROY: handles qdisc destroy
* TC_PRIO_STATS: updates the qdiscs counters (given as reference)

Like RED qdisc, the indication of whether PRIO is being offloaded is being
set and updated as part of the dump function. It is so because the driver
could decide to offload or not based on the qdisc parent, which could
change without notifying the qdisc.

Signed-off-by: Nogah Frankel 
Reviewed-by: Yuval Mintz 
Signed-off-by: Jiri Pirko 
---
 include/linux/netdevice.h |  1 +
 include/net/pkt_cls.h | 25 
 net/sched/sch_prio.c  | 59 +++
 3 files changed, 85 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ef7b348e8498..6d95477b962c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -780,6 +780,7 @@ enum tc_setup_type {
TC_SETUP_BLOCK,
TC_SETUP_QDISC_CBS,
TC_SETUP_QDISC_RED,
+   TC_SETUP_QDISC_PRIO,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 0d1343cba84c..4ba8c3ba3dd4 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -761,4 +761,29 @@ struct tc_red_qopt_offload {
};
 };
 
+enum tc_prio_command {
+   TC_PRIO_REPLACE,
+   TC_PRIO_DESTROY,
+   TC_PRIO_STATS,
+};
+
+struct tc_prio_qopt_offload_params {
+   int bands;
+   u8 priomap[TC_PRIO_MAX + 1];
+   /* In case that a prio qdisc is offloaded and now is changed to a
+* non-offloadedable config, it needs to update the backlog value
+* to negate the HW backlog value.
+*/
+   u32 *backlog;
+};
+
+struct tc_prio_qopt_offload {
+   enum tc_prio_command command;
+   u32 handle;
+   u32 parent;
+   union {
+   struct tc_prio_qopt_offload_params replace_params;
+   struct tc_qopt_offload_stats stats;
+   };
+};
 #endif
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index fe1510eb111f..3f47a30ce72f 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -142,6 +142,31 @@ prio_reset(struct Qdisc *sch)
sch->q.qlen = 0;
 }
 
+static int prio_offload(struct Qdisc *sch, bool enable)
+{
+   struct prio_sched_data *q = qdisc_priv(sch);
+   struct net_device *dev = qdisc_dev(sch);
+   struct tc_prio_qopt_offload opt = {
+   .handle = sch->handle,
+   .parent = sch->parent,
+   };
+
+   if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
+   return -EOPNOTSUPP;
+
+   if (enable) {
+   opt.command = TC_PRIO_REPLACE;
+   opt.replace_params.bands = q->bands;
+   memcpy(_params.priomap, q->prio2band,
+  TC_PRIO_MAX + 1);
+   opt.replace_params.backlog = >qstats.backlog;
+   } else {
+   opt.command = TC_PRIO_DESTROY;
+   }
+
+   return dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_PRIO, );
+}
+
 static void
 prio_destroy(struct Qdisc *sch)
 {
@@ -149,6 +174,7 @@ prio_destroy(struct Qdisc *sch)
struct prio_sched_data *q = qdisc_priv(sch);
 
tcf_block_put(q->block);
+   prio_offload(sch, false);
for (prio = 0; prio < q->bands; prio++)
qdisc_destroy(q->queues[prio]);
 }
@@ -204,6 +230,7 @@ static int prio_tune(struct Qdisc *sch, struct nlattr *opt,
}
 
sch_tree_unlock(sch);
+   prio_offload(sch, true);
return 0;
 }
 
@@ -223,15 +250,47 @@ static int prio_init(struct Qdisc *sch, struct nlattr 
*opt,
return prio_tune(sch, opt, extack);
 }
 
+static int prio_dump_offload(struct Qdisc *sch)
+{
+   struct net_device *dev = qdisc_dev(sch);
+   struct tc_prio_qopt_offload hw_stats = {
+   .handle = sch->handle,
+   .parent = sch->parent,
+   .command = TC_PRIO_STATS,
+   .stats.bstats = >bstats,
+   .stats.qstats = >qstats,
+   };
+   int err;
+
+   sch->flags &= ~TCQ_F_OFFLOADED;
+   if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
+   return 0;
+
+   err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_PRIO,
+   _stats);
+   if (err == -EOPNOTSUPP)
+   return 0;
+
+   if (!err)
+   sch->flags |= TCQ_F_OFFLOADED;
+
+   return err;
+}
+
 static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
struct prio_sched_data *q = qdisc_priv(sch);
unsigned char *b = skb_tail_pointer(skb);
struct tc_prio_qopt opt;
+   int err;
 
opt.bands = q->bands;
memcpy(, 

[patch net-next 4/5] mlxsw: spectrum: qdiscs: Support PRIO qdisc offload

2018-01-11 Thread Jiri Pirko
From: Nogah Frankel 

Add support for offloading PRIO qdisc as root qdisc.
The support is for up to 8 bands.
Routed packets priority is determined by the DSCP field with the default
translations. Bridged packets priority is determined by the PCP field, if
exist, otherwise it is set to 0.
Since both options have only priorities 0-7, higher priorities mapping are
being ignored.

Signed-off-by: Nogah Frankel 
Reviewed-by: Yuval Mintz 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  2 +
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  2 +
 .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c   | 82 ++
 3 files changed, 86 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 54c7d9202e81..f78bfe394966 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1830,6 +1830,8 @@ static int mlxsw_sp_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
return mlxsw_sp_setup_tc_block(mlxsw_sp_port, type_data);
case TC_SETUP_QDISC_RED:
return mlxsw_sp_setup_tc_red(mlxsw_sp_port, type_data);
+   case TC_SETUP_QDISC_PRIO:
+   return mlxsw_sp_setup_tc_prio(mlxsw_sp_port, type_data);
default:
return -EOPNOTSUPP;
}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index b6f475e83474..16f8fbda0891 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -565,6 +565,8 @@ int mlxsw_sp_tc_qdisc_init(struct mlxsw_sp_port 
*mlxsw_sp_port);
 void mlxsw_sp_tc_qdisc_fini(struct mlxsw_sp_port *mlxsw_sp_port);
 int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port,
  struct tc_red_qopt_offload *p);
+int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port,
+  struct tc_prio_qopt_offload *p);
 
 /* spectrum_fid.c */
 int mlxsw_sp_fid_flood_set(struct mlxsw_sp_fid *fid,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 273300b75a68..9e83edde7b35 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -41,9 +41,12 @@
 #include "spectrum.h"
 #include "reg.h"
 
+#define MLXSW_SP_PRIO_BAND_TO_TCLASS(band) (IEEE_8021QAZ_MAX_TCS - band - 1)
+
 enum mlxsw_sp_qdisc_type {
MLXSW_SP_QDISC_NO_QDISC,
MLXSW_SP_QDISC_RED,
+   MLXSW_SP_QDISC_PRIO,
 };
 
 struct mlxsw_sp_qdisc_ops {
@@ -402,6 +405,85 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port 
*mlxsw_sp_port,
}
 }
 
+static int
+mlxsw_sp_qdisc_prio_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
+   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
+{
+   int i;
+
+   for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++)
+   mlxsw_sp_port_prio_tc_set(mlxsw_sp_port, i,
+ MLXSW_SP_PORT_DEFAULT_TCLASS);
+
+   return 0;
+}
+
+static int
+mlxsw_sp_qdisc_prio_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
+struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+void *params)
+{
+   struct tc_prio_qopt_offload_params *p = params;
+
+   if (p->bands > IEEE_8021QAZ_MAX_TCS)
+   return -EOPNOTSUPP;
+
+   return 0;
+}
+
+static int
+mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port,
+   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+   void *params)
+{
+   struct tc_prio_qopt_offload_params *p = params;
+   int tclass, i;
+   int err;
+
+   for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
+   tclass = MLXSW_SP_PRIO_BAND_TO_TCLASS(p->priomap[i]);
+   err = mlxsw_sp_port_prio_tc_set(mlxsw_sp_port, i, tclass);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
+static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio = {
+   .type = MLXSW_SP_QDISC_PRIO,
+   .check_params = mlxsw_sp_qdisc_prio_check_params,
+   .replace = mlxsw_sp_qdisc_prio_replace,
+   .destroy = mlxsw_sp_qdisc_prio_destroy,
+};
+
+int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port,
+  struct tc_prio_qopt_offload *p)
+{
+   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc;
+
+   if (p->parent != TC_H_ROOT)
+   return -EOPNOTSUPP;
+
+   mlxsw_sp_qdisc = mlxsw_sp_port->root_qdisc;
+   if (p->command == TC_PRIO_REPLACE)
+   return mlxsw_sp_qdisc_replace(mlxsw_sp_port, p->handle,
+ mlxsw_sp_qdisc,
+ 

[patch net-next 2/5] mlxsw: spectrum_router: Configure default routing priority

2018-01-11 Thread Jiri Pirko
From: Yuval Mintz 

When routing ip packets, the kernel is setting the SKB's priority
based on the tos field of the packet.
Imitate this behavior in the mlxsw router, having the internal
switch priority of a routed packet determined according to its DS
field.

Signed-off-by: Yuval Mintz 
Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 24 ++
 1 file changed, 24 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 434b3922b34f..8f115d1c7056 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -7008,6 +7008,24 @@ static int mlxsw_sp_mp_hash_init(struct mlxsw_sp 
*mlxsw_sp)
 }
 #endif
 
+static int mlxsw_sp_dscp_init(struct mlxsw_sp *mlxsw_sp)
+{
+   char rdpm_pl[MLXSW_REG_RDPM_LEN];
+   unsigned int i;
+
+   MLXSW_REG_ZERO(rdpm, rdpm_pl);
+
+   /* HW is determining switch priority based on DSCP-bits, but the
+* kernel is still doing that based on the ToS. Since there's a
+* mismatch in bits we need to make sure to translate the right
+* value ToS would observe, skipping the 2 least-significant ECN bits.
+*/
+   for (i = 0; i < MLXSW_REG_RDPM_DSCP_ENTRY_REC_MAX_COUNT; i++)
+   mlxsw_reg_rdpm_pack(rdpm_pl, i, rt_tos2priority(i << 2));
+
+   return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(rdpm), rdpm_pl);
+}
+
 static int __mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
 {
char rgcr_pl[MLXSW_REG_RGCR_LEN];
@@ -7020,6 +7038,7 @@ static int __mlxsw_sp_router_init(struct mlxsw_sp 
*mlxsw_sp)
 
mlxsw_reg_rgcr_pack(rgcr_pl, true, true);
mlxsw_reg_rgcr_max_router_interfaces_set(rgcr_pl, max_rifs);
+   mlxsw_reg_rgcr_usp_set(rgcr_pl, true);
err = mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(rgcr), rgcr_pl);
if (err)
return err;
@@ -7095,6 +7114,10 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
if (err)
goto err_mp_hash_init;
 
+   err = mlxsw_sp_dscp_init(mlxsw_sp);
+   if (err)
+   goto err_dscp_init;
+
mlxsw_sp->router->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
err = register_fib_notifier(_sp->router->fib_nb,
mlxsw_sp_router_fib_dump_flush);
@@ -7104,6 +7127,7 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
return 0;
 
 err_register_fib_notifier:
+err_dscp_init:
 err_mp_hash_init:
unregister_netevent_notifier(_sp->router->netevent_nb);
 err_register_netevent_notifier:
-- 
2.14.3



Re: [PATCH 34/38] arm: Implement thread_struct whitelist for hardened usercopy

2018-01-11 Thread Russell King - ARM Linux
On Wed, Jan 10, 2018 at 06:03:06PM -0800, Kees Cook wrote:
> ARM does not carry FPU state in the thread structure, so it can declare
> no usercopy whitelist at all.

This comment seems to be misleading.  We have stored FP state in the
thread structure for a long time - for example, VFP state is stored
in thread->vfpstate.hard, so we _do_ have floating point state in
the thread structure.

What I think this commit message needs to describe is why we don't
need a whitelist _despite_ having FP state in the thread structure.

At the moment, the commit message is making me think that this patch
is wrong and will introduce a regression.

Thanks.

> 
> Cc: Russell King 
> Cc: Ingo Molnar 
> Cc: Christian Borntraeger 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: linux-arm-ker...@lists.infradead.org
> Signed-off-by: Kees Cook 
> ---
>  arch/arm/Kconfig | 1 +
>  arch/arm/include/asm/processor.h | 7 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 51c8df561077..3ea00d65f35d 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -50,6 +50,7 @@ config ARM
>   select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
>   select HAVE_ARCH_MMAP_RND_BITS if MMU
>   select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
> + select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>   select HAVE_ARCH_TRACEHOOK
>   select HAVE_ARM_SMCCC if CPU_V7
>   select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
> diff --git a/arch/arm/include/asm/processor.h 
> b/arch/arm/include/asm/processor.h
> index 338cbe0a18ef..01a41be58d43 100644
> --- a/arch/arm/include/asm/processor.h
> +++ b/arch/arm/include/asm/processor.h
> @@ -45,6 +45,13 @@ struct thread_struct {
>   struct debug_info   debug;
>  };
>  
> +/* Nothing needs to be usercopy-whitelisted from thread_struct. */
> +static inline void arch_thread_struct_whitelist(unsigned long *offset,
> + unsigned long *size)
> +{
> + *offset = *size = 0;
> +}
> +
>  #define INIT_THREAD  {   }
>  
>  #define start_thread(regs,pc,sp) \
> -- 
> 2.7.4
> 

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


Re: KASAN: use-after-free Read in __bpf_prog_put

2018-01-11 Thread Daniel Borkmann
Hi Dmitry,

On 01/11/2018 11:22 AM, Dmitry Vyukov wrote:
> On Thu, Jan 11, 2018 at 11:17 AM, syzbot
>  wrote:
>> Hello,
>>
>> syzkaller hit the following crash on
>> 4147d50978df60f34d444c647dde9e5b34a4315e
>> git://git.cmpxchg.org/linux-mmots.git/master
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached
>> Raw console output is attached.
>> Unfortunately, I don't have any reproducer for this bug yet.
>>
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+d85bfb332db8f0794...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> netlink: 3 bytes leftover after parsing attributes in process
>> `syz-executor5'.
>> ==
>> BUG: KASAN: use-after-free in __bpf_prog_put+0x5e8/0x640
>> kernel/bpf/syscall.c:944
>> netlink: 'syz-executor5': attribute type 5 has an invalid length.
>> Read of size 8 at addr 8801d3619658 by task syz-executor0/12398
>>
>> CPU: 1 PID: 12398 Comm: syz-executor0 Not tainted 4.15.0-rc7-mm1+ #53
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>  print_address_description+0x73/0x250 mm/kasan/report.c:256
>>  kasan_report_error mm/kasan/report.c:354 [inline]
>>  kasan_report+0x23b/0x360 mm/kasan/report.c:412
>>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>>  __bpf_prog_put+0x5e8/0x640 kernel/bpf/syscall.c:944
>>  bpf_prog_put+0x1a/0x20 kernel/bpf/syscall.c:961
>>  prog_fd_array_put_ptr+0x15/0x20 kernel/bpf/arraymap.c:446
>>  fd_array_map_delete_elem+0xc8/0x110 kernel/bpf/arraymap.c:420
>>  map_delete_elem kernel/bpf/syscall.c:737 [inline]
>>  SYSC_bpf kernel/bpf/syscall.c:1814 [inline]
>>  SyS_bpf+0x22ea/0x4400 kernel/bpf/syscall.c:1782
>>  entry_SYSCALL_64_fastpath+0x29/0xa0
>> RIP: 0033:0x452ac9
>> RSP: 002b:7fb70df60c58 EFLAGS: 0212 ORIG_RAX: 0141
>> RAX: ffda RBX: 0071bea0 RCX: 00452ac9
>> RDX: 0010 RSI: 20f02ff0 RDI: 0003
>> RBP: 03aa R08:  R09: 
>> R10:  R11: 0212 R12: 006f3890
>> R13:  R14: 7fb70df616d4 R15: 
>>
>> Allocated by task 11996:
>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>>  set_track mm/kasan/kasan.c:459 [inline]
>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:552
>>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3541
>>  kmem_cache_zalloc include/linux/slab.h:694 [inline]
>>  get_empty_filp+0xfb/0x4f0 fs/file_table.c:122
>>  path_openat+0xed/0x3530 fs/namei.c:3514
>>  do_filp_open+0x25b/0x3b0 fs/namei.c:3572
>>  do_sys_open+0x502/0x6d0 fs/open.c:1059
>>  SYSC_open fs/open.c:1077 [inline]
>>  SyS_open+0x2d/0x40 fs/open.c:1072
>>  entry_SYSCALL_64_fastpath+0x29/0xa0
>>
>> Freed by task 11994:
>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>>  set_track mm/kasan/kasan.c:459 [inline]
>>  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:520
>>  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:527
>>  __cache_free mm/slab.c:3485 [inline]
>>  kmem_cache_free+0x86/0x2b0 mm/slab.c:3743
>>  file_free_rcu+0x5c/0x70 fs/file_table.c:49
>>  __rcu_reclaim kernel/rcu/rcu.h:172 [inline]
>>  rcu_do_batch kernel/rcu/tree.c:2675 [inline]
>>  invoke_rcu_callbacks kernel/rcu/tree.c:2934 [inline]
>>  __rcu_process_callbacks kernel/rcu/tree.c:2901 [inline]
>>  rcu_process_callbacks+0xd6c/0x17f0 kernel/rcu/tree.c:2918
>>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>>
>> The buggy address belongs to the object at 8801d36195c0
>>  which belongs to the cache filp of size 456
>> The buggy address is located 152 bytes inside of
>>  456-byte region [8801d36195c0, 8801d3619788)
>> The buggy address belongs to the page:
>> page:ea00074d8640 count:1 mapcount:0 mapping:8801d36190c0 index:0x0
>> flags: 0x2fffc000100(slab)
>> raw: 02fffc000100 8801d36190c0  00010006
>> raw: ea00074c49a0 ea000747a160 8801dae30180 
>> page dumped because: kasan: bad access detected
>>
>> Memory state around the buggy address:
>>  8801d3619500: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>>  8801d3619580: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
>>>
>>> 8801d3619600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>
>> ^
>>  8801d3619680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>  8801d3619700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ==
> 
> 
> Is it the same as "general 

[PATCH net-next 03/11] net: hns3: add ethtool_ops.get_coalesce support to PF

2018-01-11 Thread Peng Li
From: Fuyun Liang 

This patch adds ethtool_ops.get_coalesce support to PF.

Whilst our hardware supports per queue values, external interfaces
support only a single shared value. As such we use the values for
queue 0.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  2 ++
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  1 +
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 37 ++
 3 files changed, 40 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index adec88d..0bad0e3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -448,6 +448,8 @@ struct hnae3_knic_private_info {
u16 num_tqps; /* total number of TQPs in this handle */
struct hnae3_queue **tqp;  /* array base of all TQPs in this instance */
const struct hnae3_dcb_ops *dcb_ops;
+
+   u16 int_rl_setting;
 };
 
 struct hnae3_roce_private_info {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index a2a7ea3..24f6109 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -464,6 +464,7 @@ struct hns3_enet_ring_group {
u16 count;
enum hns3_flow_level_range flow_level;
u16 int_gl;
+   u8 gl_adapt_enable;
 };
 
 struct hns3_enet_tqp_vector {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index f44336c..81b4b3b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -887,6 +887,42 @@ static void hns3_get_channels(struct net_device *netdev,
h->ae_algo->ops->get_channels(h, ch);
 }
 
+static int hns3_get_coalesce_per_queue(struct net_device *netdev, u32 queue,
+  struct ethtool_coalesce *cmd)
+{
+   struct hns3_enet_tqp_vector *tx_vector, *rx_vector;
+   struct hns3_nic_priv *priv = netdev_priv(netdev);
+   struct hnae3_handle *h = priv->ae_handle;
+   u16 queue_num = h->kinfo.num_tqps;
+
+   if (queue >= queue_num) {
+   netdev_err(netdev,
+  "Invalid queue value %d! Queue max id=%d\n",
+  queue, queue_num - 1);
+   return -EINVAL;
+   }
+
+   tx_vector = priv->ring_data[queue].ring->tqp_vector;
+   rx_vector = priv->ring_data[queue_num + queue].ring->tqp_vector;
+
+   cmd->use_adaptive_tx_coalesce = tx_vector->tx_group.gl_adapt_enable;
+   cmd->use_adaptive_rx_coalesce = rx_vector->rx_group.gl_adapt_enable;
+
+   cmd->tx_coalesce_usecs = tx_vector->tx_group.int_gl;
+   cmd->rx_coalesce_usecs = rx_vector->rx_group.int_gl;
+
+   cmd->tx_coalesce_usecs_high = h->kinfo.int_rl_setting;
+   cmd->rx_coalesce_usecs_high = h->kinfo.int_rl_setting;
+
+   return 0;
+}
+
+static int hns3_get_coalesce(struct net_device *netdev,
+struct ethtool_coalesce *cmd)
+{
+   return hns3_get_coalesce_per_queue(netdev, 0, cmd);
+}
+
 static const struct ethtool_ops hns3vf_ethtool_ops = {
.get_drvinfo = hns3_get_drvinfo,
.get_ringparam = hns3_get_ringparam,
@@ -925,6 +961,7 @@ static void hns3_get_channels(struct net_device *netdev,
.nway_reset = hns3_nway_reset,
.get_channels = hns3_get_channels,
.set_channels = hns3_set_channels,
+   .get_coalesce = hns3_get_coalesce,
 };
 
 void hns3_ethtool_set_ops(struct net_device *netdev)
-- 
1.9.1



[PATCH net-next 08/11] net: hns3: change the unit of GL value macro

2018-01-11 Thread Peng Li
From: Fuyun Liang 

Previously, driver used 2us as the GL unit. The time unit ethtool
command "-c" and "-C" use is 1us, so now the GL unit driver uses
actually is 1us.

This patch changes the unit of GL value macro from
2us to 1us.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 7adbda8..213f501 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -452,10 +452,10 @@ enum hns3_link_mode_bits {
 };
 
 #define HNS3_INT_GL_MAX0x1FE0
-#define HNS3_INT_GL_50K0x000A
-#define HNS3_INT_GL_20K0x0019
-#define HNS3_INT_GL_18K0x001B
-#define HNS3_INT_GL_8K 0x003E
+#define HNS3_INT_GL_50K0x0014
+#define HNS3_INT_GL_20K0x0032
+#define HNS3_INT_GL_18K0x0036
+#define HNS3_INT_GL_8K 0x007C
 
 #define HNS3_INT_RL_MAX0x00EC
 #define HNS3_INT_RL_ENABLE_MASK0x40
-- 
1.9.1



[PATCH net-next 02/11] net: hns3: remove TSO config command from VF driver

2018-01-11 Thread Peng Li
Only main PF can config TSO MSS length according to hardware.
This patch removes TSO config command from VF driver.

Signed-off-by: Peng Li 
---
 .../net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h |  8 
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c| 20 
 2 files changed, 28 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h
index ad8adfe..2caca93 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.h
@@ -86,8 +86,6 @@ enum hclgevf_opcode_type {
HCLGEVF_OPC_QUERY_TX_STATUS = 0x0B03,
HCLGEVF_OPC_QUERY_RX_STATUS = 0x0B13,
HCLGEVF_OPC_CFG_COM_TQP_QUEUE   = 0x0B20,
-   /* TSO cmd */
-   HCLGEVF_OPC_TSO_GENERIC_CONFIG  = 0x0C01,
/* RSS cmd */
HCLGEVF_OPC_RSS_GENERIC_CONFIG  = 0x0D01,
HCLGEVF_OPC_RSS_INDIR_TABLE = 0x0D07,
@@ -202,12 +200,6 @@ struct hclgevf_cfg_tx_queue_pointer_cmd {
u8 rsv[14];
 };
 
-#define HCLGEVF_TSO_ENABLE_B   0
-struct hclgevf_cfg_tso_status_cmd {
-   u8 tso_enable;
-   u8 rsv[23];
-};
-
 #define HCLGEVF_TYPE_CRQ   0
 #define HCLGEVF_TYPE_CSQ   1
 #define HCLGEVF_NIC_CSQ_BASEADDR_L_REG 0x27000
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 5f9afa6..3d2bc9a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -201,20 +201,6 @@ static int hclge_get_queue_info(struct hclgevf_dev *hdev)
return 0;
 }
 
-static int hclgevf_enable_tso(struct hclgevf_dev *hdev, int enable)
-{
-   struct hclgevf_cfg_tso_status_cmd *req;
-   struct hclgevf_desc desc;
-
-   req = (struct hclgevf_cfg_tso_status_cmd *)desc.data;
-
-   hclgevf_cmd_setup_basic_desc(, HCLGEVF_OPC_TSO_GENERIC_CONFIG,
-false);
-   hnae_set_bit(req->tso_enable, HCLGEVF_TSO_ENABLE_B, enable);
-
-   return hclgevf_cmd_send(>hw, , 1);
-}
-
 static int hclgevf_alloc_tqps(struct hclgevf_dev *hdev)
 {
struct hclgevf_tqp *tqp;
@@ -1375,12 +1361,6 @@ static int hclgevf_init_ae_dev(struct hnae3_ae_dev 
*ae_dev)
goto err_config;
}
 
-   ret = hclgevf_enable_tso(hdev, true);
-   if (ret) {
-   dev_err(>dev, "failed(%d) to enable tso\n", ret);
-   goto err_config;
-   }
-
/* Initialize VF's MTA */
hdev->accept_mta_mc = true;
ret = hclgevf_cfg_func_mta_filter(>nic, hdev->accept_mta_mc);
-- 
1.9.1



[PATCH net-next 07/11] net: hns3: remove unused GL setup function

2018-01-11 Thread Peng Li
From: Fuyun Liang 

Since the TX GL and the RX GL need to be set separately,
hns3_set_vector_coalesc_gl() has been replaced with
hns3_set_vector_coalesce_rx_gl() and hns3_set_vector_coalesce_tx_gl().

This patch removes hns3_set_vector_coalesc_gl().

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 2a139ef..2e9e61c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -158,18 +158,6 @@ static void hns3_vector_disable(struct 
hns3_enet_tqp_vector *tqp_vector)
napi_disable(_vector->napi);
 }
 
-static void hns3_set_vector_coalesc_gl(struct hns3_enet_tqp_vector *tqp_vector,
-  u32 gl_value)
-{
-   /* this defines the configuration for GL (Interrupt Gap Limiter)
-* GL defines inter interrupt gap.
-* GL and RL(Rate Limiter) are 2 ways to acheive interrupt coalescing
-*/
-   writel(gl_value, tqp_vector->mask_addr + HNS3_VECTOR_GL0_OFFSET);
-   writel(gl_value, tqp_vector->mask_addr + HNS3_VECTOR_GL1_OFFSET);
-   writel(gl_value, tqp_vector->mask_addr + HNS3_VECTOR_GL2_OFFSET);
-}
-
 void hns3_set_vector_coalesce_rl(struct hns3_enet_tqp_vector *tqp_vector,
 u32 rl_value)
 {
-- 
1.9.1



[PATCH net-next 05/11] net: hns3: refactor interrupt coalescing init function

2018-01-11 Thread Peng Li
From: Fuyun Liang 

In the hardware, the coalesce configurable registers include GL0, GL1,
GL2. In the driver, the TX queues use the register GL1 and the RX queues
use the register GL0. This function initializes the configuration of the
interrupt coalescing, but does not distinguish between the TX direction
and the RX direction. It will cause some confusion.

This patch refactors the function to initialize the TX GL and the RX GL
separately. And the initialization of related variables also is added to
this patch.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 29 +
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 32c9f88..59d8d9f 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -206,21 +206,32 @@ void hns3_set_vector_coalesce_tx_gl(struct 
hns3_enet_tqp_vector *tqp_vector,
writel(tx_gl_reg, tqp_vector->mask_addr + HNS3_VECTOR_GL1_OFFSET);
 }
 
-static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector)
+static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector,
+  struct hns3_nic_priv *priv)
 {
+   struct hnae3_handle *h = priv->ae_handle;
+
/* initialize the configuration for interrupt coalescing.
 * 1. GL (Interrupt Gap Limiter)
 * 2. RL (Interrupt Rate Limiter)
 */
 
-   /* Default :enable interrupt coalesce */
-   tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K;
+   /* Default: enable interrupt coalescing self-adaptive and GL */
+   tqp_vector->tx_group.gl_adapt_enable = 1;
+   tqp_vector->rx_group.gl_adapt_enable = 1;
+
tqp_vector->tx_group.int_gl = HNS3_INT_GL_50K;
-   hns3_set_vector_coalesc_gl(tqp_vector, HNS3_INT_GL_50K);
-   /* for now we are disabling Interrupt RL - we
-* will re-enable later
-*/
-   hns3_set_vector_coalesce_rl(tqp_vector, 0);
+   tqp_vector->rx_group.int_gl = HNS3_INT_GL_50K;
+
+   hns3_set_vector_coalesce_tx_gl(tqp_vector,
+  tqp_vector->tx_group.int_gl);
+   hns3_set_vector_coalesce_rx_gl(tqp_vector,
+  tqp_vector->rx_group.int_gl);
+
+   /* Default: disable RL */
+   h->kinfo.int_rl_setting = 0;
+   hns3_set_vector_coalesce_rl(tqp_vector, h->kinfo.int_rl_setting);
+
tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW;
tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW;
 }
@@ -2654,7 +2665,7 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv 
*priv)
tqp_vector->rx_group.total_packets = 0;
tqp_vector->tx_group.total_bytes = 0;
tqp_vector->tx_group.total_packets = 0;
-   hns3_vector_gl_rl_init(tqp_vector);
+   hns3_vector_gl_rl_init(tqp_vector, priv);
tqp_vector->handle = h;
 
ret = hns3_get_vector_ring_chain(tqp_vector,
-- 
1.9.1



[PATCH net-next 01/11] net: hns3: add ethtool_ops.get_channels support for VF

2018-01-11 Thread Peng Li
This patch supports the ethtool's get_channels() for VF.

Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  1 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 30 ++
 2 files changed, 31 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index d3cb3ec..f44336c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -900,6 +900,7 @@ static void hns3_get_channels(struct net_device *netdev,
.get_rxfh = hns3_get_rss,
.set_rxfh = hns3_set_rss,
.get_link_ksettings = hns3_get_link_ksettings,
+   .get_channels = hns3_get_channels,
 };
 
 static const struct ethtool_ops hns3_ethtool_ops = {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 655f522..5f9afa6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1433,6 +1433,35 @@ static void hclgevf_uninit_ae_dev(struct hnae3_ae_dev 
*ae_dev)
ae_dev->priv = NULL;
 }
 
+static u32 hclgevf_get_max_channels(struct hclgevf_dev *hdev)
+{
+   struct hnae3_handle *nic = >nic;
+   struct hnae3_knic_private_info *kinfo = >kinfo;
+
+   return min_t(u32, hdev->rss_size_max * kinfo->num_tc, hdev->num_tqps);
+}
+
+/**
+ * hclgevf_get_channels - Get the current channels enabled and max supported.
+ * @handle: hardware information for network interface
+ * @ch: ethtool channels structure
+ *
+ * We don't support separate tx and rx queues as channels. The other count
+ * represents how many queues are being used for control. max_combined counts
+ * how many queue pairs we can support. They may not be mapped 1 to 1 with
+ * q_vectors since we support a lot more queue pairs than q_vectors.
+ **/
+static void hclgevf_get_channels(struct hnae3_handle *handle,
+struct ethtool_channels *ch)
+{
+   struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
+
+   ch->max_combined = hclgevf_get_max_channels(hdev);
+   ch->other_count = 0;
+   ch->max_other = 0;
+   ch->combined_count = hdev->num_tqps;
+}
+
 static const struct hnae3_ae_ops hclgevf_ops = {
.init_ae_dev = hclgevf_init_ae_dev,
.uninit_ae_dev = hclgevf_uninit_ae_dev,
@@ -1462,6 +1491,7 @@ static void hclgevf_uninit_ae_dev(struct hnae3_ae_dev 
*ae_dev)
.get_tc_size = hclgevf_get_tc_size,
.get_fw_version = hclgevf_get_fw_version,
.set_vlan_filter = hclgevf_set_vlan_filter,
+   .get_channels = hclgevf_get_channels,
 };
 
 static struct hnae3_ae_algo ae_algovf = {
-- 
1.9.1



Re: [PATCH 03/32] fs: introduce new ->get_poll_head and ->poll_mask methods

2018-01-11 Thread Christoph Hellwig
On Wed, Jan 10, 2018 at 09:04:16PM +, Al Viro wrote:
> There's another problem with that - currently ->poll() may tell you "sod off,
> I've got nothing for you to sleep on, eat your POLLHUP|POLLERR|something
> and don't pester me again".  With your API that's hard to express sanely.

And what exactly can currently tell 'sod off' right now?  ->poll
can only return the (E)POLL* mask.  But what would probably be sane
is to do the same thing in vfs_poll I already do in aio poll:  call
->poll_mask a first time before calling poll_wait to clear any
already pending events.  That way any early error gets instantly
propagated.

> Another piece of fun related to that is handling of disconnects in general -
> 
> static __poll_t proc_reg_poll(struct file *file, struct poll_table_struct 
> *pts)
> {
> struct proc_dir_entry *pde = PDE(file_inode(file));
> __poll_t rv = DEFAULT_POLLMASK;
> __poll_t (*poll)(struct file *, struct poll_table_struct *);
> if (use_pde(pde)) {
> poll = pde->proc_fops->poll;
> if (poll)
> rv = poll(file, pts);
> unuse_pde(pde);
> }
> return rv;
> }
> 
> and similar in sysfs.  

Can't find anything in sysfs, but debugfs has an amazingly bad variant
of the above, including confidence ensuring commit description bits like:

In order not to pollute debugfs with wrapper definitions that aren't ever
needed, I chose not to define a wrapper for every struct file_operations
method possible. Instead, a wrapper is defined only for the subset of
methods which are actually set by any debugfs users.
Currently, these are:
 
  ->llseek()
  ->read()
  ->write()
  ->unlocked_ioctl()
  ->poll()

So anyone implementing say, read_iter/write_iter or compat_ioctl
silently doesn't get the magic protection..

Either way - those two will need updating for the new scheme if
we add proc/debugfs ops, so I better do them now and convert at least
one example each.

> Note, BTW, the places like wait->_qproc = NULL; in do_select() and its ilk.
> Some of them are "don't bother putting me on any queues, I won't be sleeping
> anyway".  Some are "I'm already on all queues I care about, I'm going to
> sleep now and the query everything again once woken up".  It would be nice
> to have the method splitup reflect that kind of logics...

Hmm.  ->poll_mask already is a simple 'are these events pending'
method, and thuse should deal perfectly fine with both cases.  What
additional split do you think would be helpful?

> What about af_alg_poll(), BTW?  Looks like you've missed that one...

Converted for the next iteration.

> Another thing: IMO file_can_poll() should use FMODE_CAN_POLL - either as
> "true if set, otherwise check ->f_op and set accordingly" or set in
> do_dentry_open() and just check it in file_can_poll()...

I don't really see the point of wasting a fmode bit for it.  But
if really want that I cna do it.


[PATCH 02/11] xfrm: skip policies marked as dead while rehashing

2018-01-11 Thread Steffen Klassert
From: Florian Westphal 

syzkaller triggered following KASAN splat:

BUG: KASAN: slab-out-of-bounds in xfrm_hash_rebuild+0xdbe/0xf00 
net/xfrm/xfrm_policy.c:618
read of size 2 at addr 8801c8e92fe4 by task kworker/1:1/23 [..]
Workqueue: events xfrm_hash_rebuild [..]
 __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428
 xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618
 process_one_work+0xbbf/0x1b10 kernel/workqueue.c:2112
 worker_thread+0x223/0x1990 kernel/workqueue.c:2246 [..]

The reproducer triggers:
1016 if (error) {
1017 list_move_tail(>walk.all, >all);
1018 goto out;
1019 }

in xfrm_policy_walk() via pfkey (it sets tiny rcv space, dump
callback returns -ENOBUFS).

In this case, *walk is located the pfkey socket struct, so this socket
becomes visible in the global policy list.

It looks like this is intentional -- phony walker has walk.dead set to 1
and all other places skip such "policies".

Ccing original authors of the two commits that seem to expose this
issue (first patch missed ->dead check, second patch adds pfkey
sockets to policies dumper list).

Fixes: 880a6fab8f6ba5b ("xfrm: configure policy hash table thresholds by 
netlink")
Fixes: 12a169e7d8f4b1c ("ipsec: Put dumpers on the dump list")
Cc: Herbert Xu 
Cc: Timo Teras 
Cc: Christophe Gouault 
Reported-by: syzbot 

Signed-off-by: Florian Westphal 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 70aa5cb0c659..2ef6db98e9ba 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -609,7 +609,8 @@ static void xfrm_hash_rebuild(struct work_struct *work)
 
/* re-insert all policies by order of creation */
list_for_each_entry_reverse(policy, >xfrm.policy_all, walk.all) {
-   if (xfrm_policy_id2dir(policy->index) >= XFRM_POLICY_MAX) {
+   if (policy->walk.dead ||
+   xfrm_policy_id2dir(policy->index) >= XFRM_POLICY_MAX) {
/* skip socket policies */
continue;
}
-- 
2.14.1



[PATCH 07/11] xfrm: Return error on unknown encap_type in init_state

2018-01-11 Thread Steffen Klassert
From: Herbert Xu 

Currently esp will happily create an xfrm state with an unknown
encap type for IPv4, without setting the necessary state parameters.
This patch fixes it by returning -EINVAL.

There is a similar problem in IPv6 where if the mode is unknown
we will skip initialisation while returning zero.  However, this
is harmless as the mode has already been checked further up the
stack.  This patch removes this anomaly by aligning the IPv6
behaviour with IPv4 and treating unknown modes (which cannot
actually happen) as transport mode.

Fixes: 38320c70d282 ("[IPSEC]: Use crypto_aead and authenc in ESP")
Signed-off-by: Herbert Xu 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/esp4.c | 1 +
 net/ipv6/esp6.c | 3 +--
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index d57aa64fa7c7..61fe6e4d23fc 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -981,6 +981,7 @@ static int esp_init_state(struct xfrm_state *x)
 
switch (encap->encap_type) {
default:
+   err = -EINVAL;
goto error;
case UDP_ENCAP_ESPINUDP:
x->props.header_len += sizeof(struct udphdr);
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index a902ff8f59be..1a7f00cd4803 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -890,13 +890,12 @@ static int esp6_init_state(struct xfrm_state *x)
x->props.header_len += IPV4_BEET_PHMAXLEN +
   (sizeof(struct ipv6hdr) - 
sizeof(struct iphdr));
break;
+   default:
case XFRM_MODE_TRANSPORT:
break;
case XFRM_MODE_TUNNEL:
x->props.header_len += sizeof(struct ipv6hdr);
break;
-   default:
-   goto error;
}
 
align = ALIGN(crypto_aead_blocksize(aead), 4);
-- 
2.14.1



pull request (net): ipsec 2018-01-11

2018-01-11 Thread Steffen Klassert
1) Don't allow to change the encap type on state updates.
   The encap type is set on state initialization and
   should not change anymore. From Herbert Xu.

2) Skip dead policies when rehashing to fix a
   slab-out-of-bounds bug in xfrm_hash_rebuild.
   From Florian Westphal.

3) Two buffer overread fixes in pfkey.
   From Eric Biggers.

4) Fix rcu usage in xfrm_get_type_offload,
   request_module can sleep, so can't be used
   under rcu_read_lock. From Sabrina Dubroca.

5) Fix an uninitialized lock in xfrm_trans_queue.
   Use __skb_queue_tail instead of skb_queue_tail
   in xfrm_trans_queue as we don't need the lock.
   From Herbert Xu.

6) Currently it is possible to create an xfrm state with an
   unknown encap type in ESP IPv4. Fix this by returning an
   error on unknown encap types. Also from Herbert Xu.

7) Fix sleeping inside a spinlock in xfrm_policy_cache_flush.
   From Florian Westphal.

8) Fix ESP GRO when the headers not fully in the linear part
   of the skb. We need to pull before we can access them.

9) Fix a skb leak on error in key_notify_policy.

10) Fix a race in the xdst pcpu cache, we need to
run the resolver routines with bottom halfes
off like the old flowcache did.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 2758b3e3e630ba304fc4aca434d591e70e528298:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2017-12-28 
23:20:21 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to 76a4201191814a0061cb5c861fafb9ecaa764846:

  xfrm: Fix a race in the xdst pcpu cache. (2018-01-10 12:14:28 +0100)


Eric Biggers (2):
  af_key: fix buffer overread in verify_address_len()
  af_key: fix buffer overread in parse_exthdrs()

Florian Westphal (2):
  xfrm: skip policies marked as dead while rehashing
  xfrm: don't call xfrm_policy_cache_flush while holding spinlock

Herbert Xu (3):
  xfrm: Forbid state updates from changing encap type
  xfrm: Use __skb_queue_tail in xfrm_trans_queue
  xfrm: Return error on unknown encap_type in init_state

Sabrina Dubroca (1):
  xfrm: fix rcu usage in xfrm_get_type_offload

Steffen Klassert (3):
  esp: Fix GRO when the headers not fully in the linear part of the skb.
  af_key: Fix memory leak in key_notify_policy.
  xfrm: Fix a race in the xdst pcpu cache.

 net/ipv4/esp4.c |  1 +
 net/ipv4/esp4_offload.c |  3 ++-
 net/ipv6/esp6.c |  3 +--
 net/ipv6/esp6_offload.c |  3 ++-
 net/key/af_key.c| 12 +++-
 net/xfrm/xfrm_input.c   |  2 +-
 net/xfrm/xfrm_policy.c  | 15 +++
 net/xfrm/xfrm_state.c   | 11 +--
 8 files changed, 38 insertions(+), 12 deletions(-)


[PATCH 01/11] xfrm: Forbid state updates from changing encap type

2018-01-11 Thread Steffen Klassert
From: Herbert Xu 

Currently we allow state updates to competely replace the contents
of x->encap.  This is bad because on the user side ESP only sets up
header lengths depending on encap_type once when the state is first
created.  This could result in the header lengths getting out of
sync with the actual state configuration.

In practice key managers will never do a state update to change the
encapsulation type.  Only the port numbers need to be changed as the
peer NAT entry is updated.

Therefore this patch adds a check in xfrm_state_update to forbid
any changes to the encap_type.

Signed-off-by: Herbert Xu 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_state.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 500b3391f474..1e80f68e2266 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1534,8 +1534,12 @@ int xfrm_state_update(struct xfrm_state *x)
err = -EINVAL;
spin_lock_bh(>lock);
if (likely(x1->km.state == XFRM_STATE_VALID)) {
-   if (x->encap && x1->encap)
+   if (x->encap && x1->encap &&
+   x->encap->encap_type == x1->encap->encap_type)
memcpy(x1->encap, x->encap, sizeof(*x1->encap));
+   else if (x->encap || x1->encap)
+   goto fail;
+
if (x->coaddr && x1->coaddr) {
memcpy(x1->coaddr, x->coaddr, sizeof(*x1->coaddr));
}
@@ -1552,6 +1556,8 @@ int xfrm_state_update(struct xfrm_state *x)
x->km.state = XFRM_STATE_DEAD;
__xfrm_state_put(x);
}
+
+fail:
spin_unlock_bh(>lock);
 
xfrm_state_put(x1);
-- 
2.14.1



[PATCH 03/11] af_key: fix buffer overread in verify_address_len()

2018-01-11 Thread Steffen Klassert
From: Eric Biggers 

If a message sent to a PF_KEY socket ended with one of the extensions
that takes a 'struct sadb_address' but there were not enough bytes
remaining in the message for the ->sa_family member of the 'struct
sockaddr' which is supposed to follow, then verify_address_len() read
past the end of the message, into uninitialized memory.  Fix it by
returning -EINVAL in this case.

This bug was found using syzkaller with KMSAN.

Reproducer:

#include 
#include 
#include 

int main()
{
int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2);
char buf[24] = { 0 };
struct sadb_msg *msg = (void *)buf;
struct sadb_address *addr = (void *)(msg + 1);

msg->sadb_msg_version = PF_KEY_V2;
msg->sadb_msg_type = SADB_DELETE;
msg->sadb_msg_len = 3;
addr->sadb_address_len = 1;
addr->sadb_address_exttype = SADB_EXT_ADDRESS_SRC;

write(sock, buf, 24);
}

Reported-by: Alexander Potapenko 
Cc: sta...@vger.kernel.org
Signed-off-by: Eric Biggers 
Signed-off-by: Steffen Klassert 
---
 net/key/af_key.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 3dffb892d52c..596499cc8b2f 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -401,6 +401,11 @@ static int verify_address_len(const void *p)
 #endif
int len;
 
+   if (sp->sadb_address_len <
+   DIV_ROUND_UP(sizeof(*sp) + offsetofend(typeof(*addr), sa_family),
+sizeof(uint64_t)))
+   return -EINVAL;
+
switch (addr->sa_family) {
case AF_INET:
len = DIV_ROUND_UP(sizeof(*sp) + sizeof(*sin), 
sizeof(uint64_t));
-- 
2.14.1



Re: [Patch net] tipc: fix a memory leak in tipc_nl_node_get_link()

2018-01-11 Thread Ying Xue
On 01/11/2018 04:50 AM, Cong Wang wrote:
> When tipc_node_find_by_name() fails, the nlmsg is not
> freed.
> 
> While on it, switch to a goto label to properly
> free it.
> 
> Fixes: be9c086715c ("tipc: narrow down exposure of struct tipc_node")
> Reported-by: Dmitry Vyukov 
> Cc: Jon Maloy 
> Cc: Ying Xue 
> Signed-off-by: Cong Wang 

Acked-by: Ying Xue 

> ---
>  net/tipc/node.c | 26 ++
>  1 file changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/net/tipc/node.c b/net/tipc/node.c
> index 507017fe0f1b..9036d8756e73 100644
> --- a/net/tipc/node.c
> +++ b/net/tipc/node.c
> @@ -1880,36 +1880,38 @@ int tipc_nl_node_get_link(struct sk_buff *skb, struct 
> genl_info *info)
>  
>   if (strcmp(name, tipc_bclink_name) == 0) {
>   err = tipc_nl_add_bc_link(net, );
> - if (err) {
> - nlmsg_free(msg.skb);
> - return err;
> - }
> + if (err)
> + goto err_free;
>   } else {
>   int bearer_id;
>   struct tipc_node *node;
>   struct tipc_link *link;
>  
>   node = tipc_node_find_by_name(net, name, _id);
> - if (!node)
> - return -EINVAL;
> + if (!node) {
> + err = -EINVAL;
> + goto err_free;
> + }
>  
>   tipc_node_read_lock(node);
>   link = node->links[bearer_id].link;
>   if (!link) {
>   tipc_node_read_unlock(node);
> - nlmsg_free(msg.skb);
> - return -EINVAL;
> + err = -EINVAL;
> + goto err_free;
>   }
>  
>   err = __tipc_nl_add_link(net, , link, 0);
>   tipc_node_read_unlock(node);
> - if (err) {
> - nlmsg_free(msg.skb);
> - return err;
> - }
> + if (err)
> + goto err_free;
>   }
>  
>   return genlmsg_reply(msg.skb, info);
> +
> +err_free:
> + nlmsg_free(msg.skb);
> + return err;
>  }
>  
>  int tipc_nl_node_reset_link_stats(struct sk_buff *skb, struct genl_info 
> *info)
> 


Re: [PATCH] [net-next] net: socionext: include linux/io.h to fix build

2018-01-11 Thread Ard Biesheuvel
On 11 January 2018 at 10:36, Arnd Bergmann  wrote:
> I ran into a randconfig build failure:
>
> drivers/net/ethernet/socionext/netsec.c: In function 'netsec_probe':
> drivers/net/ethernet/socionext/netsec.c:1583:17: error: implicit declaration 
> of function 'devm_ioremap'; did you mean 'ioremap'? 
> [-Werror=implicit-function-declaration]
>
> Including linux/io.h directly fixes this.
>
> Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
> Signed-off-by: Arnd Bergmann 

Thanks for fixing this. This is the same issue spotted by kbuild test robot.

Acked-by: Ard Biesheuvel 

> ---
>  drivers/net/ethernet/socionext/netsec.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index a8edcf387bba..af47147dd656 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> --
> 2.9.0
>


Re: [patch net-next v7 08/13] net: sched: add rt netlink message type for block get

2018-01-11 Thread Jiri Pirko
Thu, Jan 11, 2018 at 10:37:10AM CET, j...@resnulli.us wrote:
>Wed, Jan 10, 2018 at 05:48:09PM CET, dsah...@gmail.com wrote:
>>On 1/9/18 7:07 AM, Jiri Pirko wrote:
>>> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
>>> index 9c026d9..038cde7 100644
>>> --- a/include/uapi/linux/rtnetlink.h
>>> +++ b/include/uapi/linux/rtnetlink.h
>>> @@ -150,6 +150,12 @@ enum {
>>> RTM_NEWCACHEREPORT = 96,
>>>  #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT
>>>  
>>> +   RTM_NEWBLOCK = 100,
>>> +#define RTM_NEWBLOCK RTM_NEWBLOCK
>>> +   RTM_DELBLOCK,
>>> +#define RTM_DELBLOCK RTM_DELBLOCK
>>> +   RTM_GETBLOCK,
>>> +#define RTM_GETBLOCK RTM_GETBLOCK
>>> __RTM_MAX,
>>>  #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
>>>  };
>>
>>Seems like this is creating an inconsistency. RTM_GETBLOCK is used to
>>dump the set of shared blocks, but RTM_NEWBLOCK / RTM_DELBLOCK are not
>>used to create / delete one.
>
>Why is it a problem? RTM_NEWBLOCK is used as a reply for RTM_GETBLOCK.
>I plan to have block notifications as a follow-up, there the RTM_GETBLOCK

I mean RTM_NEWBLOCK and RTM_DELBLOCK of couse.

>and RTM_DELBLOCK will be used. The fact the user cannot create and
>delete block explicitly is no problem in my opinion. The block creation
>and deletion is done according to usage of qdiscs.


[PATCH net-next] net: phy: mdio-bcm-unimac: fix potential NULL dereference in unimac_mdio_probe()

2018-01-11 Thread Wei Yongjun
platform_get_resource() may fail and return NULL, so we should
better check it's return value to avoid a NULL pointer dereference
a bit later in the code.

This is detected by Coccinelle semantic patch.

@@
expression pdev, res, n, t, e, e1, e2;
@@

res = platform_get_resource(pdev, t, n);
+ if (!res)
+   return -EINVAL;
... when != res == NULL
e = devm_ioremap(e1, res->start, e2);

Signed-off-by: Wei Yongjun 
---
 drivers/net/phy/mdio-bcm-unimac.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/phy/mdio-bcm-unimac.c 
b/drivers/net/phy/mdio-bcm-unimac.c
index 08e0647..8d37066 100644
--- a/drivers/net/phy/mdio-bcm-unimac.c
+++ b/drivers/net/phy/mdio-bcm-unimac.c
@@ -205,6 +205,8 @@ static int unimac_mdio_probe(struct platform_device *pdev)
return -ENOMEM;
 
r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+   if (!r)
+   return -EINVAL;
 
/* Just ioremap, as this MDIO block is usually integrated into an
 * Ethernet MAC controller register range



[PATCH net-next] net: socionext: Fix error return code in netsec_netdev_open()

2018-01-11 Thread Wei Yongjun
Fix to return error code -ENODEV from the of_phy_connect() error
handling case instead of 0, as done elsewhere in this function.

Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
Signed-off-by: Wei Yongjun 
---
 drivers/net/ethernet/socionext/netsec.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index a8edcf3..78e4ff6 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -1292,6 +1292,7 @@ static int netsec_netdev_open(struct net_device *ndev)
netsec_phy_adjust_link, 0,
priv->phy_interface)) {
netif_err(priv, link, priv->ndev, "missing PHY\n");
+   ret = -ENODEV;
goto err3;
}
} else {



Re: [Patch net v2] tun: fix a memory leak for tfile->tx_array

2018-01-11 Thread Jason Wang



On 2018年01月11日 02:51, Cong Wang wrote:

tfile->tun could be detached before we close the tun fd,
via tun_detach_all(), so it should not be used to check for
tfile->tx_array.

As Jason suggested, we probably have to clean it up
unconditionally, but this requires to check if it is initialized
or not. Currently skb_array_cleanup() doesn't have such a check,
so I check it in the caller, it is ugly but we can always
improve it in net-next.


Rethink about this, looks like I was wrong. The case I mentioned 
previously is


open
attach
detach
close

But during close, we will try to enable tfile through tun_enable_queue() 
in __tun_detach(), which means we can do the cleanup for sure.


It looks to me what is actual missed is the cleanups tun_detach_all(). 
For me the only case that could leak is


open
attach
ip link del link dev tap0
close or another set_iff()

So in this case, clean during close is not sufficient since it could be 
attached to another device.


Thanks



Reported-by: Dmitry Vyukov 
Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
Cc: Jason Wang 
Signed-off-by: Cong Wang 
---
  drivers/net/tun.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4f4a842a1c9c..4c85474ffbaf 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -657,7 +657,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
tun->dev->reg_state == NETREG_REGISTERED)
unregister_netdevice(tun->dev);
}
-   if (tun)
+   if (tfile->tx_array.ring.queue)
skb_array_cleanup(>tx_array);
sock_put(>sk);
}
@@ -2851,6 +2851,8 @@ static int tun_chr_open(struct inode *inode, struct file 
* file)
  
  	sock_set_flag(>sk, SOCK_ZEROCOPY);
  
+	memset(>tx_array, 0, sizeof(tfile->tx_array));

+
return 0;
  }
  




Re: KASAN: use-after-free Read in __bpf_prog_put

2018-01-11 Thread Dmitry Vyukov
On Thu, Jan 11, 2018 at 11:17 AM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> 4147d50978df60f34d444c647dde9e5b34a4315e
> git://git.cmpxchg.org/linux-mmots.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> Unfortunately, I don't have any reproducer for this bug yet.
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+d85bfb332db8f0794...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> netlink: 3 bytes leftover after parsing attributes in process
> `syz-executor5'.
> ==
> BUG: KASAN: use-after-free in __bpf_prog_put+0x5e8/0x640
> kernel/bpf/syscall.c:944
> netlink: 'syz-executor5': attribute type 5 has an invalid length.
> Read of size 8 at addr 8801d3619658 by task syz-executor0/12398
>
> CPU: 1 PID: 12398 Comm: syz-executor0 Not tainted 4.15.0-rc7-mm1+ #53
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report+0x23b/0x360 mm/kasan/report.c:412
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>  __bpf_prog_put+0x5e8/0x640 kernel/bpf/syscall.c:944
>  bpf_prog_put+0x1a/0x20 kernel/bpf/syscall.c:961
>  prog_fd_array_put_ptr+0x15/0x20 kernel/bpf/arraymap.c:446
>  fd_array_map_delete_elem+0xc8/0x110 kernel/bpf/arraymap.c:420
>  map_delete_elem kernel/bpf/syscall.c:737 [inline]
>  SYSC_bpf kernel/bpf/syscall.c:1814 [inline]
>  SyS_bpf+0x22ea/0x4400 kernel/bpf/syscall.c:1782
>  entry_SYSCALL_64_fastpath+0x29/0xa0
> RIP: 0033:0x452ac9
> RSP: 002b:7fb70df60c58 EFLAGS: 0212 ORIG_RAX: 0141
> RAX: ffda RBX: 0071bea0 RCX: 00452ac9
> RDX: 0010 RSI: 20f02ff0 RDI: 0003
> RBP: 03aa R08:  R09: 
> R10:  R11: 0212 R12: 006f3890
> R13:  R14: 7fb70df616d4 R15: 
>
> Allocated by task 11996:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:552
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3541
>  kmem_cache_zalloc include/linux/slab.h:694 [inline]
>  get_empty_filp+0xfb/0x4f0 fs/file_table.c:122
>  path_openat+0xed/0x3530 fs/namei.c:3514
>  do_filp_open+0x25b/0x3b0 fs/namei.c:3572
>  do_sys_open+0x502/0x6d0 fs/open.c:1059
>  SYSC_open fs/open.c:1077 [inline]
>  SyS_open+0x2d/0x40 fs/open.c:1072
>  entry_SYSCALL_64_fastpath+0x29/0xa0
>
> Freed by task 11994:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:520
>  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:527
>  __cache_free mm/slab.c:3485 [inline]
>  kmem_cache_free+0x86/0x2b0 mm/slab.c:3743
>  file_free_rcu+0x5c/0x70 fs/file_table.c:49
>  __rcu_reclaim kernel/rcu/rcu.h:172 [inline]
>  rcu_do_batch kernel/rcu/tree.c:2675 [inline]
>  invoke_rcu_callbacks kernel/rcu/tree.c:2934 [inline]
>  __rcu_process_callbacks kernel/rcu/tree.c:2901 [inline]
>  rcu_process_callbacks+0xd6c/0x17f0 kernel/rcu/tree.c:2918
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>
> The buggy address belongs to the object at 8801d36195c0
>  which belongs to the cache filp of size 456
> The buggy address is located 152 bytes inside of
>  456-byte region [8801d36195c0, 8801d3619788)
> The buggy address belongs to the page:
> page:ea00074d8640 count:1 mapcount:0 mapping:8801d36190c0 index:0x0
> flags: 0x2fffc000100(slab)
> raw: 02fffc000100 8801d36190c0  00010006
> raw: ea00074c49a0 ea000747a160 8801dae30180 
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>  8801d3619500: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  8801d3619580: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
>>
>> 8801d3619600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
> ^
>  8801d3619680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  8801d3619700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==


Is it the same as "general protection fault in __bpf_prog_put"?
https://groups.google.com/forum/#!topic/syzkaller-bugs/jUsNMmVgms0

The first stack looks similar, but alloc/free stacks looks 

[patch net-next 1/5] mlxsw: reg: add rdpm register

2018-01-11 Thread Jiri Pirko
From: Yuval Mintz 

Add rdpm definition - router DSCP to priority mapping register.

Signed-off-by: Yuval Mintz 
Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/item.h |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/reg.h  | 37 ++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h 
b/drivers/net/ethernet/mellanox/mlxsw/item.h
index 28427f0758c7..31c886edc791 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/item.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/item.h
@@ -42,7 +42,7 @@
 
 struct mlxsw_item {
unsigned short  offset; /* bytes in container */
-   unsigned short  step;   /* step in bytes for indexed items */
+   short   step;   /* step in bytes for indexed items */
unsigned short  in_step_offset; /* offset within one step */
unsigned char   shift;  /* shift in bits */
unsigned char   element_size;   /* size of element in bit array */
diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 6c4e08b8058a..0e08be41c8e0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -4827,6 +4827,42 @@ static inline void mlxsw_reg_ratr_counter_pack(char 
*payload, u64 counter_index,
mlxsw_reg_ratr_counter_set_type_set(payload, set_type);
 }
 
+/* RDPM - Router DSCP to Priority Mapping
+ * --
+ * Controls the mapping from DSCP field to switch priority on routed packets
+ */
+#define MLXSW_REG_RDPM_ID 0x8009
+#define MLXSW_REG_RDPM_BASE_LEN 0x00
+#define MLXSW_REG_RDPM_DSCP_ENTRY_REC_LEN 0x01
+#define MLXSW_REG_RDPM_DSCP_ENTRY_REC_MAX_COUNT 64
+#define MLXSW_REG_RDPM_LEN 0x40
+#define MLXSW_REG_RDPM_LAST_ENTRY (MLXSW_REG_RDPM_BASE_LEN + \
+  MLXSW_REG_RDPM_LEN - \
+  MLXSW_REG_RDPM_DSCP_ENTRY_REC_LEN)
+
+MLXSW_REG_DEFINE(rdpm, MLXSW_REG_RDPM_ID, MLXSW_REG_RDPM_LEN);
+
+/* reg_dscp_entry_e
+ * Enable update of the specific entry
+ * Access: Index
+ */
+MLXSW_ITEM8_INDEXED(reg, rdpm, dscp_entry_e, MLXSW_REG_RDPM_LAST_ENTRY, 7, 1,
+   -MLXSW_REG_RDPM_DSCP_ENTRY_REC_LEN, 0x00, false);
+
+/* reg_dscp_entry_prio
+ * Switch Priority
+ * Access: RW
+ */
+MLXSW_ITEM8_INDEXED(reg, rdpm, dscp_entry_prio, MLXSW_REG_RDPM_LAST_ENTRY, 0, 
4,
+   -MLXSW_REG_RDPM_DSCP_ENTRY_REC_LEN, 0x00, false);
+
+static inline void mlxsw_reg_rdpm_pack(char *payload, unsigned short index,
+  u8 prio)
+{
+   mlxsw_reg_rdpm_dscp_entry_e_set(payload, index, 1);
+   mlxsw_reg_rdpm_dscp_entry_prio_set(payload, index, prio);
+}
+
 /* RICNT - Router Interface Counter Register
  * -
  * The RICNT register retrieves per port performance counters
@@ -7640,6 +7676,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(rtar),
MLXSW_REG(ratr),
MLXSW_REG(rtdp),
+   MLXSW_REG(rdpm),
MLXSW_REG(ricnt),
MLXSW_REG(rrcr),
MLXSW_REG(ralta),
-- 
2.14.3



[PATCH] [net-next] net: socionext: include linux/io.h to fix build

2018-01-11 Thread Arnd Bergmann
I ran into a randconfig build failure:

drivers/net/ethernet/socionext/netsec.c: In function 'netsec_probe':
drivers/net/ethernet/socionext/netsec.c:1583:17: error: implicit declaration of 
function 'devm_ioremap'; did you mean 'ioremap'? 
[-Werror=implicit-function-declaration]

Including linux/io.h directly fixes this.

Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/socionext/netsec.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index a8edcf387bba..af47147dd656 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
-- 
2.9.0



[PATCH net-next 09/11] net: hns3: add int_gl_idx setup for TX and RX queues

2018-01-11 Thread Peng Li
From: Fuyun Liang 

If the int_gl_idx does not be set, the default interrupt coalesce index
is 0. The TX queues and the RX queues will both use the GL0 as the
interrupt coalesce GL switch. But it should be GL1 for TX queues and GL0
for RX queues.

This patch adds the int_gl_idx setup for TX queues and RX queues.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 
SoC")
Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h |  5 +
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 11 +++
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c |  5 +
 3 files changed, 21 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 0bad0e3..634e932 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -133,11 +133,16 @@ struct hnae3_vector_info {
 #define HNAE3_RING_TYPE_B 0
 #define HNAE3_RING_TYPE_TX 0
 #define HNAE3_RING_TYPE_RX 1
+#define HNAE3_RING_GL_IDX_S 0
+#define HNAE3_RING_GL_IDX_M GENMASK(1, 0)
+#define HNAE3_RING_GL_RX 0
+#define HNAE3_RING_GL_TX 1
 
 struct hnae3_ring_chain_node {
struct hnae3_ring_chain_node *next;
u32 tqp_index;
u32 flag;
+   u32 int_gl_idx;
 };
 
 #define HNAE3_IS_TX_RING(node) \
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 2e9e61c..34879c4 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2523,6 +2523,8 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
cur_chain->tqp_index = tx_ring->tqp->tqp_index;
hnae_set_bit(cur_chain->flag, HNAE3_RING_TYPE_B,
 HNAE3_RING_TYPE_TX);
+   hnae_set_field(cur_chain->int_gl_idx, HNAE3_RING_GL_IDX_M,
+  HNAE3_RING_GL_IDX_S, HNAE3_RING_GL_TX);
 
cur_chain->next = NULL;
 
@@ -2538,6 +2540,10 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
chain->tqp_index = tx_ring->tqp->tqp_index;
hnae_set_bit(chain->flag, HNAE3_RING_TYPE_B,
 HNAE3_RING_TYPE_TX);
+   hnae_set_field(chain->int_gl_idx,
+  HNAE3_RING_GL_IDX_M,
+  HNAE3_RING_GL_IDX_S,
+  HNAE3_RING_GL_TX);
 
cur_chain = chain;
}
@@ -2549,6 +2555,8 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
cur_chain->tqp_index = rx_ring->tqp->tqp_index;
hnae_set_bit(cur_chain->flag, HNAE3_RING_TYPE_B,
 HNAE3_RING_TYPE_RX);
+   hnae_set_field(cur_chain->int_gl_idx, HNAE3_RING_GL_IDX_M,
+  HNAE3_RING_GL_IDX_S, HNAE3_RING_GL_RX);
 
rx_ring = rx_ring->next;
}
@@ -2562,6 +2570,9 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
chain->tqp_index = rx_ring->tqp->tqp_index;
hnae_set_bit(chain->flag, HNAE3_RING_TYPE_B,
 HNAE3_RING_TYPE_RX);
+   hnae_set_field(chain->int_gl_idx, HNAE3_RING_GL_IDX_M,
+  HNAE3_RING_GL_IDX_S, HNAE3_RING_GL_RX);
+
cur_chain = chain;
 
rx_ring = rx_ring->next;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index d7352f5..27f0ab6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -3409,6 +3409,11 @@ int hclge_bind_ring_with_vector(struct hclge_vport 
*vport,
   hnae_get_bit(node->flag, HNAE3_RING_TYPE_B));
hnae_set_field(tqp_type_and_id, HCLGE_TQP_ID_M,
   HCLGE_TQP_ID_S, node->tqp_index);
+   hnae_set_field(tqp_type_and_id, HCLGE_INT_GL_IDX_M,
+  HCLGE_INT_GL_IDX_S,
+  hnae_get_field(node->int_gl_idx,
+ HNAE3_RING_GL_IDX_M,
+ HNAE3_RING_GL_IDX_S));
req->tqp_type_and_id[i] = cpu_to_le16(tqp_type_and_id);
if (++i >= HCLGE_VECTOR_ELEMENTS_PER_CMD) {
req->int_cause_num = HCLGE_VECTOR_ELEMENTS_PER_CMD;
-- 
1.9.1



[PATCH net-next 04/11] net: hns3: add ethtool_ops.set_coalesce support to PF

2018-01-11 Thread Peng Li
From: Fuyun Liang 

This patch adds ethtool_ops.set_coalesce support to PF.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c|  34 -
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  17 +++
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 141 +
 3 files changed, 188 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 14c7625..32c9f88 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -170,14 +170,40 @@ static void hns3_set_vector_coalesc_gl(struct 
hns3_enet_tqp_vector *tqp_vector,
writel(gl_value, tqp_vector->mask_addr + HNS3_VECTOR_GL2_OFFSET);
 }
 
-static void hns3_set_vector_coalesc_rl(struct hns3_enet_tqp_vector *tqp_vector,
-  u32 rl_value)
+void hns3_set_vector_coalesce_rl(struct hns3_enet_tqp_vector *tqp_vector,
+u32 rl_value)
 {
+   u32 rl_reg = hns3_rl_usec_to_reg(rl_value);
+
/* this defines the configuration for RL (Interrupt Rate Limiter).
 * Rl defines rate of interrupts i.e. number of interrupts-per-second
 * GL and RL(Rate Limiter) are 2 ways to acheive interrupt coalescing
 */
-   writel(rl_value, tqp_vector->mask_addr + HNS3_VECTOR_RL_OFFSET);
+
+   if (rl_reg > 0 && !tqp_vector->tx_group.gl_adapt_enable &&
+   !tqp_vector->rx_group.gl_adapt_enable)
+   /* According to the hardware, the range of rl_reg is
+* 0-59 and the unit is 4.
+*/
+   rl_reg |=  HNS3_INT_RL_ENABLE_MASK;
+
+   writel(rl_reg, tqp_vector->mask_addr + HNS3_VECTOR_RL_OFFSET);
+}
+
+void hns3_set_vector_coalesce_rx_gl(struct hns3_enet_tqp_vector *tqp_vector,
+   u32 gl_value)
+{
+   u32 rx_gl_reg = hns3_gl_usec_to_reg(gl_value);
+
+   writel(rx_gl_reg, tqp_vector->mask_addr + HNS3_VECTOR_GL0_OFFSET);
+}
+
+void hns3_set_vector_coalesce_tx_gl(struct hns3_enet_tqp_vector *tqp_vector,
+   u32 gl_value)
+{
+   u32 tx_gl_reg = hns3_gl_usec_to_reg(gl_value);
+
+   writel(tx_gl_reg, tqp_vector->mask_addr + HNS3_VECTOR_GL1_OFFSET);
 }
 
 static void hns3_vector_gl_rl_init(struct hns3_enet_tqp_vector *tqp_vector)
@@ -194,7 +220,7 @@ static void hns3_vector_gl_rl_init(struct 
hns3_enet_tqp_vector *tqp_vector)
/* for now we are disabling Interrupt RL - we
 * will re-enable later
 */
-   hns3_set_vector_coalesc_rl(tqp_vector, 0);
+   hns3_set_vector_coalesce_rl(tqp_vector, 0);
tqp_vector->rx_group.flow_level = HNS3_FLOW_LOW;
tqp_vector->tx_group.flow_level = HNS3_FLOW_LOW;
 }
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 24f6109..7adbda8 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -451,11 +451,15 @@ enum hns3_link_mode_bits {
HNS3_LM_COUNT = 15
 };
 
+#define HNS3_INT_GL_MAX0x1FE0
 #define HNS3_INT_GL_50K0x000A
 #define HNS3_INT_GL_20K0x0019
 #define HNS3_INT_GL_18K0x001B
 #define HNS3_INT_GL_8K 0x003E
 
+#define HNS3_INT_RL_MAX0x00EC
+#define HNS3_INT_RL_ENABLE_MASK0x40
+
 struct hns3_enet_ring_group {
/* array of pointers to rings */
struct hns3_enet_ring *ring;
@@ -595,6 +599,12 @@ static inline void hns3_write_reg(void __iomem *base, u32 
reg, u32 value)
 #define hns3_get_handle(ndev) \
(((struct hns3_nic_priv *)netdev_priv(ndev))->ae_handle)
 
+#define hns3_gl_usec_to_reg(int_gl) (int_gl >> 1)
+#define hns3_gl_round_down(int_gl) round_down(int_gl, 2)
+
+#define hns3_rl_usec_to_reg(int_rl) (int_rl >> 2)
+#define hns3_rl_round_down(int_rl) round_down(int_rl, 4)
+
 void hns3_ethtool_set_ops(struct net_device *netdev);
 int hns3_set_channels(struct net_device *netdev,
  struct ethtool_channels *ch);
@@ -607,6 +617,13 @@ int hns3_clean_rx_ring(
struct hns3_enet_ring *ring, int budget,
void (*rx_fn)(struct hns3_enet_ring *, struct sk_buff *));
 
+void hns3_set_vector_coalesce_rx_gl(struct hns3_enet_tqp_vector *tqp_vector,
+   u32 gl_value);
+void hns3_set_vector_coalesce_tx_gl(struct hns3_enet_tqp_vector *tqp_vector,
+   u32 gl_value);
+void hns3_set_vector_coalesce_rl(struct hns3_enet_tqp_vector *tqp_vector,
+u32 rl_value);
+
 #ifdef CONFIG_HNS3_DCB
 void hns3_dcbnl_setup(struct hnae3_handle *handle);
 #else
diff --git 

[PATCH net-next 11/11] net: hns3: check for NULL function pointer in hns3_nic_set_features

2018-01-11 Thread Peng Li
From: Jian Shen 

It's necessary to check hook whether being defined before
calling, improve the reliability.

Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index a7ae4f3..ac84816 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1133,14 +1133,16 @@ static int hns3_nic_set_features(struct net_device 
*netdev,
}
}
 
-   if (changed & NETIF_F_HW_VLAN_CTAG_FILTER) {
+   if ((changed & NETIF_F_HW_VLAN_CTAG_FILTER) &&
+   h->ae_algo->ops->enable_vlan_filter) {
if (features & NETIF_F_HW_VLAN_CTAG_FILTER)
h->ae_algo->ops->enable_vlan_filter(h, true);
else
h->ae_algo->ops->enable_vlan_filter(h, false);
}
 
-   if (changed & NETIF_F_HW_VLAN_CTAG_RX) {
+   if ((changed & NETIF_F_HW_VLAN_CTAG_RX) &&
+   h->ae_algo->ops->enable_hw_strip_rxvtag) {
if (features & NETIF_F_HW_VLAN_CTAG_RX)
ret = h->ae_algo->ops->enable_hw_strip_rxvtag(h, true);
else
-- 
1.9.1



PATCH V5 4/4] selinux: Add SCTP support

2018-01-11 Thread Richard Haines
The SELinux SCTP implementation is explained in:
Documentation/security/SELinux-sctp.rst

Signed-off-by: Richard Haines 
---
V5 Change: Rework selinux_netlbl_socket_connect() and
selinux_netlbl_socket_connect_locked as requested by Paul.

 Documentation/security/SELinux-sctp.rst | 157 ++
 security/selinux/hooks.c| 280 +---
 security/selinux/include/classmap.h |   2 +-
 security/selinux/include/netlabel.h |  21 ++-
 security/selinux/include/objsec.h   |   4 +
 security/selinux/netlabel.c | 133 +--
 6 files changed, 565 insertions(+), 32 deletions(-)
 create mode 100644 Documentation/security/SELinux-sctp.rst

diff --git a/Documentation/security/SELinux-sctp.rst 
b/Documentation/security/SELinux-sctp.rst
new file mode 100644
index 000..2f66bf3
--- /dev/null
+++ b/Documentation/security/SELinux-sctp.rst
@@ -0,0 +1,157 @@
+SCTP SELinux Support
+=
+
+Security Hooks
+===
+
+``Documentation/security/LSM-sctp.rst`` describes the following SCTP security
+hooks with the SELinux specifics expanded below::
+
+security_sctp_assoc_request()
+security_sctp_bind_connect()
+security_sctp_sk_clone()
+security_inet_conn_established()
+
+
+security_sctp_assoc_request()
+-
+Passes the ``@ep`` and ``@chunk->skb`` of the association INIT packet to the
+security module. Returns 0 on success, error on failure.
+::
+
+@ep - pointer to sctp endpoint structure.
+@skb - pointer to skbuff of association packet.
+
+The security module performs the following operations:
+ IF this is the first association on ``@ep->base.sk``, then set the peer
+ sid to that in ``@skb``. This will ensure there is only one peer sid
+ assigned to ``@ep->base.sk`` that may support multiple associations.
+
+ ELSE validate the ``@ep->base.sk peer_sid`` against the ``@skb peer sid``
+ to determine whether the association should be allowed or denied.
+
+ Set the sctp ``@ep sid`` to socket's sid (from ``ep->base.sk``) with
+ MLS portion taken from ``@skb peer sid``. This will be used by SCTP
+ TCP style sockets and peeled off connections as they cause a new socket
+ to be generated.
+
+ If IP security options are configured (CIPSO/CALIPSO), then the ip
+ options are set on the socket.
+
+
+security_sctp_bind_connect()
+-
+Checks permissions required for ipv4/ipv6 addresses based on the ``@optname``
+as follows::
+
+  --
+  |   BIND Permission Checks   |
+  |   @optname | @address contains |
+  ||---|
+  | SCTP_SOCKOPT_BINDX_ADD | One or more ipv4 / ipv6 addresses |
+  | SCTP_PRIMARY_ADDR  | Single ipv4 or ipv6 address   |
+  | SCTP_SET_PEER_PRIMARY_ADDR | Single ipv4 or ipv6 address   |
+  --
+
+  --
+  | CONNECT Permission Checks  |
+  |   @optname | @address contains |
+  ||---|
+  | SCTP_SOCKOPT_CONNECTX  | One or more ipv4 / ipv6 addresses |
+  | SCTP_PARAM_ADD_IP  | One or more ipv4 / ipv6 addresses |
+  | SCTP_SENDMSG_CONNECT   | Single ipv4 or ipv6 address   |
+  | SCTP_PARAM_SET_PRIMARY | Single ipv4 or ipv6 address   |
+  --
+
+
+``Documentation/security/LSM-sctp.rst`` gives a summary of the ``@optname``
+entries and also describes ASCONF chunk processing when Dynamic Address
+Reconfiguration is enabled.
+
+
+security_sctp_sk_clone()
+-
+Called whenever a new socket is created by **accept**\(2) (i.e. a TCP style
+socket) or when a socket is 'peeled off' e.g userspace calls
+**sctp_peeloff**\(3). ``security_sctp_sk_clone()`` will set the new
+sockets sid and peer sid to that contained in the ``@ep sid`` and
+``@ep peer sid`` respectively.
+::
+
+@ep - pointer to current sctp endpoint structure.
+@sk - pointer to current sock structure.
+@sk - pointer to new sock structure.
+
+
+security_inet_conn_established()
+-
+Called when a COOKIE ACK is received where it sets the connection's peer sid
+to that in ``@skb``::
+
+@sk  - pointer to sock structure.
+@skb - pointer to skbuff of the COOKIE ACK packet.
+
+
+Policy Statements
+==
+The following class and permissions to support SCTP are available within the
+kernel::
+
+class sctp_socket inherits socket { node_bind }
+
+whenever the following policy capability is enabled::
+
+

Re: [patch net-next v7 07/13] net: sched: use block index as a handle instead of qdisc when block is shared

2018-01-11 Thread Jiri Pirko
Wed, Jan 10, 2018 at 07:12:44PM CET, dsah...@gmail.com wrote:
>On 1/9/18 7:07 AM, Jiri Pirko wrote:
>> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
>> index 843e29a..9c026d9 100644
>> --- a/include/uapi/linux/rtnetlink.h
>> +++ b/include/uapi/linux/rtnetlink.h
>> @@ -541,9 +541,15 @@ struct tcmsg {
>>  int tcm_ifindex;
>>  __u32   tcm_handle;
>>  __u32   tcm_parent;
>> +/* tcm_block_index is used instead of tcm_parent
>> + * in case tcm_ifindex == TCM_IFINDEX_MAGIC_BLOCK
>> + */
>> +#define tcm_block_index tcm_parent
>>  __u32   tcm_info;
>>  };
>>  
>> +#define TCM_IFINDEX_MAGIC_BLOCK (0xU)
>> +
>>  enum {
>>  TCA_UNSPEC,
>>  TCA_KIND,
>
>
>This could be more clearly documented for anyone wanting to write an app
>against the API. Something like:
>
>For shared blocks, tcm_ifindex is set to TCM_IFINDEX_MAGIC_BLOCK, and
>tcm_parent is aliased to tcm_block_index which is the block index.

Okay, will add this comment here.


Re: general protection fault in sctp_v6_get_dst

2018-01-11 Thread Xin Long
On Thu, Jan 11, 2018 at 2:15 AM, syzbot
 wrote:
> syzkaller has found reproducer for the following crash on
> 61ad64080e039dce99a7f8d89b729bbea995e2f7
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+7b7b518b1228d2743...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed.
>
> device lo entered promiscuous mode
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 3506 Comm: syzkaller968983 Not tainted 4.15.0-rc7+ #181
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__read_once_size include/linux/compiler.h:183 [inline]
> RIP: 0010:sctp_v6_get_dst+0x59e/0x1c60 net/sctp/ipv6.c:271
> RSP: 0018:8801db205e20 EFLAGS: 00010206
> RAX: dc00 RBX:  RCX: 8512e05b
> RDX: 000f RSI: 67cf608c RDI: 8801db22376c
> RBP: 8801db206190 R08: 11003b640b05 R09: 0002
> R10: 8801db205cf0 R11: 8512e008 R12: 8801bf884db0
> R13: 204e R14: 8801bfe3e680 R15: 8801bf884d80
> FS:  7f122e219700() GS:8801db20() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 20aaff09 CR3: 0001bfdf0005 CR4: 001606f0
>
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  sctp_transport_route+0xa8/0x430 net/sctp/transport.c:293
>  sctp_assoc_add_peer+0x4fe/0x1190 net/sctp/associola.c:655
>  sctp_process_init+0x119/0x2440 net/sctp/sm_make_chunk.c:2341
>  sctp_sf_do_5_1B_init+0x8c9/0xe80 net/sctp/sm_statefuns.c:414
>  sctp_do_sm+0x192/0x6ed0 net/sctp/sm_sideeffect.c:1178
>  sctp_endpoint_bh_rcv+0x379/0x8f0 net/sctp/endpointola.c:456
>  sctp_inq_push+0x23b/0x300 net/sctp/inqueue.c:95
>  sctp_rcv+0x29f3/0x35c0 net/sctp/input.c:267
>  sctp6_rcv+0x15/0x30 net/sctp/ipv6.c:1006
>  ip6_input_finish+0x37e/0x17a0 net/ipv6/ip6_input.c:284
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip6_input+0xdb/0x560 net/ipv6/ip6_input.c:327
>  dst_input include/net/dst.h:449 [inline]
>  ip6_rcv_finish+0x1a9/0x7a0 net/ipv6/ip6_input.c:71
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ipv6_rcv+0xf37/0x1fa0 net/ipv6/ip6_input.c:208
>  __netif_receive_skb_core+0x1a41/0x3460 net/core/dev.c:4538
>  __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4603
>  process_backlog+0x203/0x740 net/core/dev.c:5283
>  napi_poll net/core/dev.c:5681 [inline]
>  net_rx_action+0x792/0x1910 net/core/dev.c:5747
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1133
>  
>  do_softirq.part.21+0x14d/0x190 kernel/softirq.c:329
>  do_softirq kernel/softirq.c:177 [inline]
>  __local_bh_enable_ip+0x1ee/0x230 kernel/softirq.c:182
>  local_bh_enable include/linux/bottom_half.h:32 [inline]
>  rcu_read_unlock_bh include/linux/rcupdate.h:727 [inline]
>  ip6_finish_output2+0xba0/0x23a0 net/ipv6/ip6_output.c:121
>  ip6_finish_output+0x698/0xaf0 net/ipv6/ip6_output.c:154
>  NF_HOOK_COND include/linux/netfilter.h:277 [inline]
>  ip6_output+0x1eb/0x840 net/ipv6/ip6_output.c:171
>  dst_output include/net/dst.h:443 [inline]
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip6_xmit+0xd84/0x2090 net/ipv6/ip6_output.c:277
>  sctp_v6_xmit+0x438/0x630 net/sctp/ipv6.c:225
>  sctp_packet_transmit+0x225e/0x3750 net/sctp/output.c:638
>  sctp_outq_flush+0xabb/0x4060 net/sctp/outqueue.c:911
>  sctp_outq_uncork+0x5a/0x70 net/sctp/outqueue.c:776
>  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1807 [inline]
>  sctp_side_effects net/sctp/sm_sideeffect.c:1210 [inline]
>  sctp_do_sm+0x4e0/0x6ed0 net/sctp/sm_sideeffect.c:1181
>  sctp_primitive_ASSOCIATE+0x9d/0xd0 net/sctp/primitive.c:88
>  sctp_sendmsg+0x1d2e/0x33f0 net/sctp/socket.c:2018
>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:628 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:638
>  SYSC_sendto+0x361/0x5c0 net/socket.c:1719
>  SyS_sendto+0x40/0x50 net/socket.c:1687
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> RIP: 0033:0x4456c9
> RSP: 002b:7f122e218d98 EFLAGS: 0212 ORIG_RAX: 002c
> RAX: ffda RBX: 006dac3c RCX: 004456c9
> RDX: 0001 RSI: 20aaff09 RDI: 0007
> RBP:  R08: 20abf000 R09: 001c
> R10: 

pull request: bluetooth-next 2018-01-11

2018-01-11 Thread Johan Hedberg
Hi Dave,

Here's likely the last bluetooth-next pull request for the 4.16 kernel.

 - Added support for Bluetooth on 2015+ MacBook (Pro)
 - Fix to QCA Rome suspend/resume handling
 - Two new QCA_ROME USB IDs in btusb
 - A few other minor fixes

Please let me know if there are any issues pulling. Thanks.

Johan

---
The following changes since commit 18feb87105c3c16dc01e6981a6aafb175679b997:

  enic: add wq clean up budget (2017-12-26 13:10:07 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git 
for-upstream

for you to fetch changes up to ff8759609d021c0e85945fcc4a148a0e55ace70f:

  Bluetooth: btbcm: Fix sleep mode struct ordering (2018-01-10 19:00:14 +0100)


AceLan Kao (1):
  Bluetooth: btusb: Add support for 0cf3:e010

Arnd Bergmann (1):
  Bluetooth: hciuart: add nvmem dependency

Colin Ian King (2):
  Bluetooth: bpa10x: make array 'req' static, shrinks object size
  Bluetooth: btintel: make array 'param' static, shrinks object size

Hans de Goede (1):
  Bluetooth: btusb: Restore QCA Rome suspend/resume fix with a "rewritten" 
version

Ioan Moldovan (1):
  Bluetooth: Add a new 04ca:3015 QCA_ROME device

Kai-Heng Feng (1):
  Revert "Bluetooth: btusb: fix QCA Rome suspend/resume"

Lukas Wunner (15):
  Bluetooth: Avoid WARN splat due to missing GPIOLIB
  Bluetooth: hci_bcm: Streamline runtime PM code
  Bluetooth: Depend on rather than select GPIOLIB
  Bluetooth: hci_bcm: Mandate presence of shutdown and device wake GPIO
  Bluetooth: hci_bcm: Clean up unnecessary #ifdef
  Bluetooth: hci_bcm: Fix race on close
  Bluetooth: hci_bcm: Fix unbalanced pm_runtime_disable()
  Bluetooth: hci_bcm: Invalidate IRQ on request failure
  Bluetooth: hci_bcm: Document struct bcm_device
  Bluetooth: hci_bcm: Add callbacks to toggle GPIOs
  Bluetooth: hci_bcm: Handle errors properly
  Bluetooth: hci_bcm: Support Apple GPIO handling
  Bluetooth: hci_bcm: Silence IRQ printk
  Bluetooth: hci_bcm: Sleep instead of spinning
  Bluetooth: btbcm: Fix sleep mode struct ordering

Ronald Tschalär (1):
  Bluetooth: hci_bcm: Validate IRQ before using it

 drivers/bluetooth/Kconfig   |   4 +
 drivers/bluetooth/bpa10x.c  |   2 +-
 drivers/bluetooth/btbcm.h   |   2 +-
 drivers/bluetooth/btintel.c |   2 +-
 drivers/bluetooth/btusb.c   |  22 ++--
 drivers/bluetooth/hci_bcm.c | 239 +++-
 6 files changed, 207 insertions(+), 64 deletions(-)


signature.asc
Description: PGP signature


Re: [patch net-next v7 03/13] net: sched: avoid usage of tp->q in tcf_classify

2018-01-11 Thread Jiri Pirko
Wed, Jan 10, 2018 at 05:17:28PM CET, dsah...@gmail.com wrote:
>On 1/9/18 7:07 AM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> Use block index in the messages instead.
>> 
>> Signed-off-by: Jiri Pirko 
>> ---
>>  net/sched/cls_api.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 9b45950..31e91dc 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -672,8 +672,9 @@ int tcf_classify(struct sk_buff *skb, const struct 
>> tcf_proto *tp,
>>  #ifdef CONFIG_NET_CLS_ACT
>>  reset:
>>  if (unlikely(limit++ >= max_reclassify_loop)) {
>> -net_notice_ratelimited("%s: reclassify loop, rule prio %u, 
>> protocol %02x\n",
>> -   tp->q->ops->id, tp->prio & 0x,
>> +net_notice_ratelimited("%u: reclassify loop, rule prio %u, 
>> protocol %02x\n",
>
>if you are dumping index instead of prio shouldn't the 'rule prio' above
>be adjusted?

I'm not! Why do you think so?

"%u:" is tp->chain->block->index
"prio %u" is tp->prio & 0x
"%02x" is ntohs(tp->protocol)


>
>
>> +   tp->chain->block->index,
>> +   tp->prio & 0x,
>> ntohs(tp->protocol));
>>  return TC_ACT_SHOT;
>>  }
>> 
>


[PATCH 2/2] xen-netfront: Fix race between device setup and open

2018-01-11 Thread Ross Lagerwall
When a netfront device is set up it registers a netdev fairly early on,
before it has set up the queues and is actually usable. A userspace tool
like NetworkManager will immediately try to open it and access its state
as soon as it appears. The bug can be reproduced by hotplugging VIFs
until the VM runs out of grant refs. It registers the netdev but fails
to set up any queues (since there are no more grant refs). In the
meantime, NetworkManager opens the device and the kernel crashes trying
to access the queues (of which there are none).

Fix this in two ways:
* For initial setup, register the netdev much later, after the queues
are setup. This avoids the race entirely.
* During a suspend/resume cycle, the frontend reconnects to the backend
and the queues are recreated. It is possible (though highly unlikely) to
race with something opening the device and accessing the queues after
they have been destroyed but before they have been recreated. Extend the
region covered by the rtnl semaphore to protect against this race. There
is a possibility that we fail to recreate the queues so check for this
in the open function.

Signed-off-by: Ross Lagerwall 
---
 drivers/net/xen-netfront.c | 46 --
 1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 9bd7dde..8328d39 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -351,6 +351,9 @@ static int xennet_open(struct net_device *dev)
unsigned int i = 0;
struct netfront_queue *queue = NULL;
 
+   if (!np->queues)
+   return -ENODEV;
+
for (i = 0; i < num_queues; ++i) {
queue = >queues[i];
napi_enable(>napi);
@@ -1358,18 +1361,8 @@ static int netfront_probe(struct xenbus_device *dev,
 #ifdef CONFIG_SYSFS
info->netdev->sysfs_groups[0] = _dev_group;
 #endif
-   err = register_netdev(info->netdev);
-   if (err) {
-   pr_warn("%s: register_netdev err=%d\n", __func__, err);
-   goto fail;
-   }
 
return 0;
-
- fail:
-   xennet_free_netdev(netdev);
-   dev_set_drvdata(>dev, NULL);
-   return err;
 }
 
 static void xennet_end_access(int ref, void *page)
@@ -1737,8 +1730,6 @@ static void xennet_destroy_queues(struct netfront_info 
*info)
 {
unsigned int i;
 
-   rtnl_lock();
-
for (i = 0; i < info->netdev->real_num_tx_queues; i++) {
struct netfront_queue *queue = >queues[i];
 
@@ -1747,8 +1738,6 @@ static void xennet_destroy_queues(struct netfront_info 
*info)
netif_napi_del(>napi);
}
 
-   rtnl_unlock();
-
kfree(info->queues);
info->queues = NULL;
 }
@@ -1764,8 +1753,6 @@ static int xennet_create_queues(struct netfront_info 
*info,
if (!info->queues)
return -ENOMEM;
 
-   rtnl_lock();
-
for (i = 0; i < *num_queues; i++) {
struct netfront_queue *queue = >queues[i];
 
@@ -1774,7 +1761,7 @@ static int xennet_create_queues(struct netfront_info 
*info,
 
ret = xennet_init_queue(queue);
if (ret < 0) {
-   dev_warn(>netdev->dev,
+   dev_warn(>xbdev->dev,
 "only created %d queues\n", i);
*num_queues = i;
break;
@@ -1788,10 +1775,8 @@ static int xennet_create_queues(struct netfront_info 
*info,
 
netif_set_real_num_tx_queues(info->netdev, *num_queues);
 
-   rtnl_unlock();
-
if (*num_queues == 0) {
-   dev_err(>netdev->dev, "no queues\n");
+   dev_err(>xbdev->dev, "no queues\n");
return -EINVAL;
}
return 0;
@@ -1828,6 +1813,7 @@ static int talk_to_netback(struct xenbus_device *dev,
goto out;
}
 
+   rtnl_lock();
if (info->queues)
xennet_destroy_queues(info);
 
@@ -1838,6 +1824,7 @@ static int talk_to_netback(struct xenbus_device *dev,
info->queues = NULL;
goto out;
}
+   rtnl_unlock();
 
/* Create shared ring, alloc event channel -- for each queue */
for (i = 0; i < num_queues; ++i) {
@@ -1934,8 +1921,10 @@ static int talk_to_netback(struct xenbus_device *dev,
xenbus_transaction_end(xbt, 1);
  destroy_ring:
xennet_disconnect_backend(info);
+   rtnl_lock();
xennet_destroy_queues(info);
  out:
+   rtnl_unlock();
device_unregister(>dev);
return err;
 }
@@ -1965,6 +1954,15 @@ static int xennet_connect(struct net_device *dev)
netdev_update_features(dev);
rtnl_unlock();
 
+   if (dev->reg_state == NETREG_UNINITIALIZED) {
+   err = register_netdev(dev);
+   if (err) {
+   pr_warn("%s: register_netdev err=%d\n", __func__, 

RE: [PATCH net-next v2] xfrm: Add ESN support for IPSec HW offload

2018-01-11 Thread Yossi Kuperman
> From: Shannon Nelson [mailto:shannon.nel...@oracle.com]
> Sent: Thursday, January 11, 2018 5:21 AM
> 
> On 1/10/2018 3:09 PM, Yossi Kuperman wrote:
> >> On 10 Jan 2018, at 19:36, Shannon Nelson  wrote:
> >>
> >>> On 1/10/2018 2:34 AM, yoss...@mellanox.com wrote:
> >>> From: Yossef Efraim 
> >>> This patch adds ESN support to IPsec device offload.
> >>> Adding new xfrm device operation to synchronize device ESN.
> >>> Signed-off-by: Yossef Efraim 
> >>> ---
> >>> Changes from v1:
> >>>   - Added documentation
> >>> ---
> >>>   Documentation/networking/xfrm_device.txt |  3 +++
> >>>   include/linux/netdevice.h|  1 +
> >>>   include/net/xfrm.h   | 12 
> >>>   net/xfrm/xfrm_device.c   |  4 ++--
> >>>   net/xfrm/xfrm_replay.c   |  2 ++
> >>>   5 files changed, 20 insertions(+), 2 deletions(-)
> 
> [...]
> 
> >>> diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
> >>> index 7598250..704a055 100644
> >>> --- a/net/xfrm/xfrm_device.c
> >>> +++ b/net/xfrm/xfrm_device.c
> >>> @@ -147,8 +147,8 @@ int xfrm_dev_state_add(struct net *net, struct 
> >>> xfrm_state *x,
> >>>   if (!x->type_offload)
> >>>   return -EINVAL;
> >>>   -/* We don't yet support UDP encapsulation, TFC padding and ESN. */
> >>> -if (x->encap || x->tfcpad || (x->props.flags & XFRM_STATE_ESN))
> >>> +/* We don't yet support UDP encapsulation and TFC padding. */
> >>> +if (x->encap || x->tfcpad)
> >>
> >> As I mentioned before, this will cause issues when working with hardware 
> >> that has no ESN support, such as Intel's x540: the stack will
> expect the driver to do ESN, and nothing actually happens but a rollover of 
> the numbers.  Sure, the driver could look for the ESN attribute
> and fail the add, but that's a mode where we have to update every driver to 
> fend off problems every time we add a new feature.  Much
> better is to only update drivers that actively support the new feature.
> >>
> >
> > You are right.
> >
> > I’m not sure why this check is here in the first place. IMO it should take 
> > place in xdo_dev_state_add—a driver-specific callback.
> >
> 
> If you say I'm right, then why do you say it should take place in the
> driver callback?  I just wrote that it should *not*.
> 

Sorry, I wasn't clear; you are right with respect that this change will break 
Intel's x540 driver.

However, I do think that this is the purpose of xdo_dev_state_add(). Again, As 
far as I can understand, and please correct me if I'm wrong, this shouldn’t be 
here in the first place.

Please have a look at mlx5e_xfrm_validate_state(). Currently, it return an 
error if the user requests ESN, regardless of the underlying device's 
capabilities. Subsequent patch to mlx5 driver, will allow such a request if the 
device does support it; maintaining backward compatibility.

Here is a code snippet:

-   if (x->props.flags & XFRM_STATE_ESN) {
+   if (x->props.flags & XFRM_STATE_ESN &&
+   !(mlx5_accel_ipsec_device_caps(priv->mdev) & MLX5_ACCEL_IPSEC_ESN)) 
{
netdev_info(netdev, "Cannot offload ESN xfrm states\n");
return -EINVAL;
}

> This code seems to be assuming that all drivers/NICs with the offload
> will be able to do ESN, and this is not the case.  If this code is put
> into place, suddenly the ixgbe driver's offload will have a failure
> case: the driver doesn't support ESN, and doesn't know to NAK the
> state_add if the ESN bit is on.  This is a generic capabilities issue
> for which we already have a solution "pattern".
> 

We weren't assuming that, please see above.

>  > What do you suggest?
>  >
> 
> There should be a capabilities/feature flag for the driver to set and
> the XFRM code shouldn't try the state_add with ESN if the driver hasn't
> set an ESN bit in its capabilities.  Other capabilities that might make
> sense here are IPv6, TSO, and CSUM; there may be others.
> 
> >> Look at how feature bits are added to netdev->features to signify what the 
> >> driver can do.  I think that's a much better approach.
> >>
> >
> > It looks like an overkill?
> 
> Alternatively, just solve this by failing to add the SA that has ESN set
> if the driver hasn't defined your new xdo_dev_state_advance_esn().
> 
> sln
> 
> 
> >
> >> sln
> >>
> >>
> >>>   return -EINVAL;
> >>> dev = dev_get_by_index(net, xuo->ifindex);
> >>> diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
> >>> index 0250181..1d38c6a 100644
> >>> --- a/net/xfrm/xfrm_replay.c
> >>> +++ b/net/xfrm/xfrm_replay.c
> >>> @@ -551,6 +551,8 @@ static void xfrm_replay_advance_esn(struct xfrm_state 
> >>> *x, __be32 net_seq)
> >>>   bitnr = replay_esn->replay_window - (diff - pos);
> >>>   }
> >>>   +xfrm_dev_state_advance_esn(x);
> >>> +
> >>>   nr = bitnr >> 5;
> >>>   bitnr = bitnr & 0x1F;
> >>>   

Re: [PATCH 03/32] fs: introduce new ->get_poll_head and ->poll_mask methods

2018-01-11 Thread Christoph Hellwig
On Thu, Jan 11, 2018 at 05:22:00AM +, Al Viro wrote:
> Whee...  The very first ->poll() instance in alphabetic order on pathnames:
> in arch/cris/arch-v10/drivers/gpio.c
> 
> static __poll_t gpio_poll(struct file *file, poll_table *wait)
> {
> __poll_t mask = 0;
> struct gpio_private *priv = file->private_data;
> unsigned long data;
> unsigned long flags;
> 
> spin_lock_irqsave(_lock, flags);
> 
> poll_wait(file, >alarm_wq, wait);
> 
> IOW, we are doing poll_wait() (== possible GFP_KERNEL __get_free_page()) under
> a spinlock...

Yes.  Another god reason to separate poll_wait and the actual
event check callback..


Re: [patch net-next v7 08/13] net: sched: add rt netlink message type for block get

2018-01-11 Thread Jiri Pirko
Wed, Jan 10, 2018 at 05:48:09PM CET, dsah...@gmail.com wrote:
>On 1/9/18 7:07 AM, Jiri Pirko wrote:
>> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
>> index 9c026d9..038cde7 100644
>> --- a/include/uapi/linux/rtnetlink.h
>> +++ b/include/uapi/linux/rtnetlink.h
>> @@ -150,6 +150,12 @@ enum {
>>  RTM_NEWCACHEREPORT = 96,
>>  #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT
>>  
>> +RTM_NEWBLOCK = 100,
>> +#define RTM_NEWBLOCK RTM_NEWBLOCK
>> +RTM_DELBLOCK,
>> +#define RTM_DELBLOCK RTM_DELBLOCK
>> +RTM_GETBLOCK,
>> +#define RTM_GETBLOCK RTM_GETBLOCK
>>  __RTM_MAX,
>>  #define RTM_MAX (((__RTM_MAX + 3) & ~3) - 1)
>>  };
>
>Seems like this is creating an inconsistency. RTM_GETBLOCK is used to
>dump the set of shared blocks, but RTM_NEWBLOCK / RTM_DELBLOCK are not
>used to create / delete one.

Why is it a problem? RTM_NEWBLOCK is used as a reply for RTM_GETBLOCK.
I plan to have block notifications as a follow-up, there the RTM_GETBLOCK
and RTM_DELBLOCK will be used. The fact the user cannot create and
delete block explicitly is no problem in my opinion. The block creation
and deletion is done according to usage of qdiscs.


Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks

2018-01-11 Thread Jamal Hadi Salim

On 18-01-11 09:41 AM, Jiri Pirko wrote:

Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote:

On 18-01-11 09:24 AM, Jiri Pirko wrote:

Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote:

On 18-01-09 09:07 AM, Jiri Pirko wrote:

From: Jiri Pirko 

Benefit from the previously introduced shared filter blocks
infrastructure and allow ingress and clsact qdisc instances to share
filter blocks. The block index is coming from userspace as qdisc option.


Didnt quiet follow why ingress is special and needs attributes to
set the block but other qdiscs didnt.


Jamal, again, other qdiscs does not support block sharing. This patchset
only adds support for sharing of block for ingress and clsact qdiscs.
Later on, other qdiscs could also support block sharing.



Can you stop a config which says:
tc qdisc add dev ens9 root block 22 handle 1:0 prio ?


Please see the iproute2 patches. Parsing of "block" command line option
is done inside q_ingress.c



I only looked at the kernel code. Good you can stop it at tc
but the API does not stop it (unless you expect the rest of the
world to only use tc).
Really - there is no reason for this API to be only via ingress qdisc
attributes. You can add a check in cls api to reject any parent that is
not either of the clsacts + ingress (depending on tc doesnt sound
right).

cheers,
jamal


[PATCH] usbnet: silence an unnecessary warning

2018-01-11 Thread Oliver Neukum
That a kevent could not be scheduled is not an error.
Such handlers must be able to deal with multiple events anyway.
As the successful scheduling of a work is a debug event, make
the failure debug priority, too.

Signed-off-by: Oliver Neukum 
Reported-by: Cristian Caravena 
---
 drivers/net/usb/usbnet.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index d56fe32bf48d..1e0bbe23f95c 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -458,8 +458,7 @@ void usbnet_defer_kevent (struct usbnet *dev, int work)
 {
set_bit (work, >flags);
if (!schedule_work (>kevent)) {
-   if (net_ratelimit())
-   netdev_err(dev->net, "kevent %d may have been 
dropped\n", work);
+   netdev_dbg(dev->net, "kevent %d may have been dropped\n", work);
} else {
netdev_dbg(dev->net, "kevent %d scheduled\n", work);
}
-- 
2.13.6



Re: [PATCH] usbnet: silence an unnecessary warning

2018-01-11 Thread Bjørn Mork
Oliver Neukum  writes:

> That a kevent could not be scheduled is not an error.
> Such handlers must be able to deal with multiple events anyway.
> As the successful scheduling of a work is a debug event, make
> the failure debug priority, too.
>
> Signed-off-by: Oliver Neukum 
> Reported-by: Cristian Caravena 
> ---
>  drivers/net/usb/usbnet.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
> index d56fe32bf48d..1e0bbe23f95c 100644
> --- a/drivers/net/usb/usbnet.c
> +++ b/drivers/net/usb/usbnet.c
> @@ -458,8 +458,7 @@ void usbnet_defer_kevent (struct usbnet *dev, int work)
>  {
>   set_bit (work, >flags);
>   if (!schedule_work (>kevent)) {
> - if (net_ratelimit())
> - netdev_err(dev->net, "kevent %d may have been 
> dropped\n", work);
> + netdev_dbg(dev->net, "kevent %d may have been dropped\n", work);
>   } else {
>   netdev_dbg(dev->net, "kevent %d scheduled\n", work);
>   }

Great!  But why do you drop the ratelimit?  This can be very noisy when
it hits.  I'd like to keep it ratelimited.

But if you do decide to drop the limit, then you'll have to clean up the
braces...



Bjørn


Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams
On Thu, Jan 11, 2018 at 1:54 AM, Jiri Kosina  wrote:
> On Tue, 9 Jan 2018, Josh Poimboeuf wrote:
>
>> On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote:
>> > On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina  wrote:
>> > > On Fri, 5 Jan 2018, Dan Williams wrote:
>> > >
>> > > [ ... snip ... ]
>> > >> Andi Kleen (1):
>> > >>   x86, barrier: stop speculation for failed access_ok
>> > >>
>> > >> Dan Williams (13):
>> > >>   x86: implement nospec_barrier()
>> > >>   [media] uvcvideo: prevent bounds-check bypass via speculative 
>> > >> execution
>> > >>   carl9170: prevent bounds-check bypass via speculative execution
>> > >>   p54: prevent bounds-check bypass via speculative execution
>> > >>   qla2xxx: prevent bounds-check bypass via speculative execution
>> > >>   cw1200: prevent bounds-check bypass via speculative execution
>> > >>   Thermal/int340x: prevent bounds-check bypass via speculative 
>> > >> execution
>> > >>   ipv6: prevent bounds-check bypass via speculative execution
>> > >>   ipv4: prevent bounds-check bypass via speculative execution
>> > >>   vfs, fdtable: prevent bounds-check bypass via speculative 
>> > >> execution
>> > >>   net: mpls: prevent bounds-check bypass via speculative execution
>> > >>   udf: prevent bounds-check bypass via speculative execution
>> > >>   userns: prevent bounds-check bypass via speculative execution
>> > >>
>> > >> Mark Rutland (4):
>> > >>   asm-generic/barrier: add generic nospec helpers
>> > >>   Documentation: document nospec helpers
>> > >>   arm64: implement nospec_ptr()
>> > >>   arm: implement nospec_ptr()
>> > >
>> > > So considering the recent publication of [1], how come we all of a sudden
>> > > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
>> > > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
>> > >
>> > > Is this going to be handled in eBPF in some other way?
>> > >
>> > > Without that in place, and considering Jann Horn's paper, it would seem
>> > > like PTI doesn't really lock it down fully, right?
>> >
>> > Here is the latest (v3) bpf fix:
>> >
>> > https://patchwork.ozlabs.org/patch/856645/
>> >
>> > I currently have v2 on my 'nospec' branch and will move that to v3 for
>> > the next update, unless it goes upstream before then.
>
> Daniel, I guess you're planning to send this still for 4.15?

It's pending in the bpf.git tree:


https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=b2157399cc9

>> That patch seems specific to CONFIG_BPF_SYSCALL.  Is the bpf() syscall
>> the only attack vector?  Or are there other ways to run bpf programs
>> that we should be worried about?
>
> Seems like Alexei is probably the only person in the whole universe who
> isn't CCed here ... let's fix that.

He will be cc'd on v2 of this series which will be available later today.


Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks

2018-01-11 Thread Jamal Hadi Salim

On 18-01-11 09:24 AM, Jiri Pirko wrote:

Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote:

On 18-01-09 09:07 AM, Jiri Pirko wrote:

From: Jiri Pirko 

Benefit from the previously introduced shared filter blocks
infrastructure and allow ingress and clsact qdisc instances to share
filter blocks. The block index is coming from userspace as qdisc option.


Didnt quiet follow why ingress is special and needs attributes to
set the block but other qdiscs didnt.


Jamal, again, other qdiscs does not support block sharing. This patchset
only adds support for sharing of block for ingress and clsact qdiscs.
Later on, other qdiscs could also support block sharing.



Can you stop a config which says:
tc qdisc add dev ens9 root block 22 handle 1:0 prio ?

cheers,
jamal


Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks

2018-01-11 Thread Jiri Pirko
Thu, Jan 11, 2018 at 03:46:09PM CET, j...@mojatatu.com wrote:
>On 18-01-11 09:41 AM, Jiri Pirko wrote:
>> Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote:
>> > On 18-01-11 09:24 AM, Jiri Pirko wrote:
>> > > Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote:
>> > > > On 18-01-09 09:07 AM, Jiri Pirko wrote:
>> > > > > From: Jiri Pirko 
>> > > > > 
>> > > > > Benefit from the previously introduced shared filter blocks
>> > > > > infrastructure and allow ingress and clsact qdisc instances to share
>> > > > > filter blocks. The block index is coming from userspace as qdisc 
>> > > > > option.
>> > > > 
>> > > > Didnt quiet follow why ingress is special and needs attributes to
>> > > > set the block but other qdiscs didnt.
>> > > 
>> > > Jamal, again, other qdiscs does not support block sharing. This patchset
>> > > only adds support for sharing of block for ingress and clsact qdiscs.
>> > > Later on, other qdiscs could also support block sharing.
>> > > 
>> > 
>> > Can you stop a config which says:
>> > tc qdisc add dev ens9 root block 22 handle 1:0 prio ?
>> 
>> Please see the iproute2 patches. Parsing of "block" command line option
>> is done inside q_ingress.c
>> 
>
>I only looked at the kernel code. Good you can stop it at tc
>but the API does not stop it (unless you expect the rest of the
>world to only use tc).

Jamal, apparently, you did not looked at the kernel code either :)
Look at the changes done in net/sched/sch_ingress.c - there is where the
parsing of block attr takes place.


>Really - there is no reason for this API to be only via ingress qdisc
>attributes. You can add a check in cls api to reject any parent that is
>not either of the clsacts + ingress (depending on tc doesnt sound
>right).

I was thinking to take this direction originally. To have another
generic attr called TCA_BLOCK or something that would be used when qdisc
is created. For ingress, what would work. But for clsact, you need to be
able to specify 2 block during qdisc creation - one for ingress, one for
egress. That's when I realized this has to be per-qdisc-type attr. 


Re: [PATCH 2/2] xen-netfront: Fix race between device setup and open

2018-01-11 Thread David Miller
From: Ross Lagerwall 
Date: Thu, 11 Jan 2018 09:36:38 +

> When a netfront device is set up it registers a netdev fairly early on,
> before it has set up the queues and is actually usable. A userspace tool
> like NetworkManager will immediately try to open it and access its state
> as soon as it appears. The bug can be reproduced by hotplugging VIFs
> until the VM runs out of grant refs. It registers the netdev but fails
> to set up any queues (since there are no more grant refs). In the
> meantime, NetworkManager opens the device and the kernel crashes trying
> to access the queues (of which there are none).
> 
> Fix this in two ways:
> * For initial setup, register the netdev much later, after the queues
> are setup. This avoids the race entirely.
> * During a suspend/resume cycle, the frontend reconnects to the backend
> and the queues are recreated. It is possible (though highly unlikely) to
> race with something opening the device and accessing the queues after
> they have been destroyed but before they have been recreated. Extend the
> region covered by the rtnl semaphore to protect against this race. There
> is a possibility that we fail to recreate the queues so check for this
> in the open function.
> 
> Signed-off-by: Ross Lagerwall 

Where is patch 1/2 and the 0/2 header posting which explains what this
patch series is doing, how it is doing it, and why it is doing it that
way?

Thanks.


Re: [PATCH 1/2] xen/grant-table: Use put_page instead of free_page

2018-01-11 Thread Ross Lagerwall

+CC netdev

On 01/11/2018 09:36 AM, Ross Lagerwall wrote:

The page given to gnttab_end_foreign_access() to free could be a
compound page so use put_page() instead of free_page() since it can
handle both compound and single pages correctly.

This bug was discovered when migrating a Xen VM with several VIFs and
CONFIG_DEBUG_VM enabled. It hits a BUG usually after fewer than 10
iterations. All netfront devices disconnect from the backend during a
suspend/resume and this will call gnttab_end_foreign_access() if a
netfront queue has an outstanding skb. The mismatch between calling
get_page() and free_page() on a compound page causes a reference
counting error which is detected when DEBUG_VM is enabled.

Signed-off-by: Ross Lagerwall 
---
  drivers/xen/grant-table.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index f45114f..27be107 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -382,7 +382,7 @@ static void gnttab_handle_deferred(struct timer_list 
*unused)
if (entry->page) {
pr_debug("freeing g.e. %#x (pfn %#lx)\n",
 entry->ref, page_to_pfn(entry->page));
-   __free_page(entry->page);
+   put_page(entry->page);
} else
pr_info("freeing g.e. %#x\n", entry->ref);
kfree(entry);
@@ -438,7 +438,7 @@ void gnttab_end_foreign_access(grant_ref_t ref, int 
readonly,
if (gnttab_end_foreign_access_ref(ref, readonly)) {
put_free_entry(ref);
if (page != 0)
-   free_page(page);
+   put_page(virt_to_page(page));
} else
gnttab_add_deferred(ref, readonly,
page ? virt_to_page(page) : NULL);



Re: [PATCH 2/2] xen-netfront: Fix race between device setup and open

2018-01-11 Thread Ross Lagerwall

On 01/11/2018 03:26 PM, David Miller wrote:

From: Ross Lagerwall 
Date: Thu, 11 Jan 2018 09:36:38 +


When a netfront device is set up it registers a netdev fairly early on,
before it has set up the queues and is actually usable. A userspace tool
like NetworkManager will immediately try to open it and access its state
as soon as it appears. The bug can be reproduced by hotplugging VIFs
until the VM runs out of grant refs. It registers the netdev but fails
to set up any queues (since there are no more grant refs). In the
meantime, NetworkManager opens the device and the kernel crashes trying
to access the queues (of which there are none).

Fix this in two ways:
* For initial setup, register the netdev much later, after the queues
are setup. This avoids the race entirely.
* During a suspend/resume cycle, the frontend reconnects to the backend
and the queues are recreated. It is possible (though highly unlikely) to
race with something opening the device and accessing the queues after
they have been destroyed but before they have been recreated. Extend the
region covered by the rtnl semaphore to protect against this race. There
is a possibility that we fail to recreate the queues so check for this
in the open function.

Signed-off-by: Ross Lagerwall 


Where is patch 1/2 and the 0/2 header posting which explains what this
patch series is doing, how it is doing it, and why it is doing it that
way?



I've now added CC'd netdev on the other two.

Cheers,
--
Ross Lagerwall


Re: [PATCH] net: phy: Fix phy_modify() semantic difference fallout

2018-01-11 Thread Geert Uytterhoeven
On Thu, Jan 11, 2018 at 4:53 PM, Russell King - ARM Linux
 wrote:
> On Thu, Jan 11, 2018 at 10:48:35AM -0500, David Miller wrote:
>> From: Geert Uytterhoeven 
>> Date: Tue,  9 Jan 2018 12:11:21 +0100
>>
>> > In case of success, the return values of (__)phy_write() and
>> > (__)phy_modify() are not compatible: (__)phy_write() returns 0, while
>> > (__)phy_modify() returns the old PHY register value.
>> >
>> > Apparently this change was catered for in drivers/net/phy/marvell.c, but
>> > not in other source files.
>> >
>> > Hence genphy_restart_aneg() now returns 4416 instead zero, which is
>> > considered an error:
>> >
>> > ravb e680.ethernet eth0: failed to connect PHY
>> > IP-Config: Failed to open eth0
>> > IP-Config: No network devices available
>> >
>> > Fix this by converting positive values to zero in all callers of
>> > phy_modify().
>> >
>> > Fixes: fea23fb591cce995 ("net: phy: convert read-modify-write to 
>> > phy_modify()")
>> > Signed-off-by: Geert Uytterhoeven 
>> > ---
>> > Alternatively, __phy_modify() could be changed to follow __phy_write()
>> > semantics?
>>
>> I really want a resolution to this quickly, this broke lots of stuff
>> for people.
>>
>> __phy_modify() wants to return multiple values, so it should be coded
>> up to do so explicitly rather than trying to encode two values from
>> overlapping value spaces in one return value.
>>
>> That means the original value should be returned by-reference.  And
>> this will make the error/no-error return value unambiguous.
>>
>> int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set,
>>u16 *orig_val);
>
> I'm sorry I have no time to work on this right now due to the meltdown
> and spectre stuff that hit last week.  If you need to do something,
> please revert both the mvneta series and the series containing this
> patch.

I'll have a look into it...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH] net: phy: Fix phy_modify() semantic difference fallout

2018-01-11 Thread Geert Uytterhoeven
On Thu, Jan 11, 2018 at 4:54 PM, Geert Uytterhoeven
 wrote:
> On Thu, Jan 11, 2018 at 4:53 PM, Russell King - ARM Linux
>  wrote:
>> On Thu, Jan 11, 2018 at 10:48:35AM -0500, David Miller wrote:
>>> From: Geert Uytterhoeven 
>>> Date: Tue,  9 Jan 2018 12:11:21 +0100
>>>
>>> > In case of success, the return values of (__)phy_write() and
>>> > (__)phy_modify() are not compatible: (__)phy_write() returns 0, while
>>> > (__)phy_modify() returns the old PHY register value.
>>> >
>>> > Apparently this change was catered for in drivers/net/phy/marvell.c, but
>>> > not in other source files.
>>> >
>>> > Hence genphy_restart_aneg() now returns 4416 instead zero, which is
>>> > considered an error:
>>> >
>>> > ravb e680.ethernet eth0: failed to connect PHY
>>> > IP-Config: Failed to open eth0
>>> > IP-Config: No network devices available
>>> >
>>> > Fix this by converting positive values to zero in all callers of
>>> > phy_modify().
>>> >
>>> > Fixes: fea23fb591cce995 ("net: phy: convert read-modify-write to 
>>> > phy_modify()")
>>> > Signed-off-by: Geert Uytterhoeven 
>>> > ---
>>> > Alternatively, __phy_modify() could be changed to follow __phy_write()
>>> > semantics?
>>>
>>> I really want a resolution to this quickly, this broke lots of stuff
>>> for people.
>>>
>>> __phy_modify() wants to return multiple values, so it should be coded
>>> up to do so explicitly rather than trying to encode two values from
>>> overlapping value spaces in one return value.
>>>
>>> That means the original value should be returned by-reference.  And
>>> this will make the error/no-error return value unambiguous.
>>>
>>> int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set,
>>>u16 *orig_val);
>>
>> I'm sorry I have no time to work on this right now due to the meltdown
>> and spectre stuff that hit last week.  If you need to do something,
>> please revert both the mvneta series and the series containing this
>> patch.
>
> I'll have a look into it...

Sorry, the phy_restore_page() semantics are driving me crazy.
Let's revert.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [patch iproute2 v8 1/2] lib/libnetlink: Add functions rtnl_talk_msg and rtnl_talk_iov

2018-01-11 Thread Phil Sutter
On Wed, Jan 10, 2018 at 09:12:45PM +0100, Phil Sutter wrote:
> On Wed, Jan 10, 2018 at 12:20:36PM -0700, David Ahern wrote:
> [...]
> > 2. I am using a batch file with drop filters:
> > 
> > filter add dev eth2 ingress protocol ip pref 273 flower dst_ip
> > 192.168.253.0/16 action drop
> > 
> > and for each command tc is trying to dlopen m_drop.so:
> > 
> > open("/usr/lib/tc//m_drop.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such
> > file or directory)
> 
> [...]
> 
> > Can you look at a follow on patch (not part of this set) to cache status
> > of dlopen attempts?
> 
> IMHO the logic used in get_action_kind() for gact is the culprit here:
> After trying to dlopen m_drop.so, it dlopens m_gact.so although it is
> present already. (Unless I missed something.)

Not quite, m_gact.c is statically compiled in and there is logic around
dlopen(NULL, ...) to prevent calling it twice.

> I guess the better (and easier) fix would be to create some more struct
> action_util instances in m_gact.c for the primitives it supports so that
> the lookup in action_list succeeds for consecutive uses. Note that
> parse_gact() even supports this already.

Sadly, this doesn't fly: If a lookup for action 'drop' is successful,
that value is set as TCA_ACT_KIND and the kernel doesn't know about it.

I came up with an alternative solution, what do you think about attached
patch?

Thanks, Phil
diff --git a/tc/m_action.c b/tc/m_action.c
index fc4223648e8cf..d3df93c066a89 100644
--- a/tc/m_action.c
+++ b/tc/m_action.c
@@ -194,7 +194,10 @@ int parse_action(int *argc_p, char ***argv_p, int tca_id, 
struct nlmsghdr *n)
} else {
struct action_util *a = NULL;
 
-   strncpy(k, *argv, sizeof(k) - 1);
+   if (!action_a2n(*argv, NULL, false))
+   strncpy(k, "gact", sizeof(k) - 1);
+   else
+   strncpy(k, *argv, sizeof(k) - 1);
eap = 0;
if (argc > 0) {
a = get_action_kind(k);
diff --git a/tc/tc_util.c b/tc/tc_util.c
index ee9a70aa6830c..10e5aa91168a1 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -511,7 +511,7 @@ static const char *action_n2a(int action)
  *
  * In error case, returns -1 and does not touch @result. Otherwise returns 0.
  */
-static int action_a2n(char *arg, int *result, bool allow_num)
+int action_a2n(char *arg, int *result, bool allow_num)
 {
int n;
char dummy;
@@ -535,13 +535,15 @@ static int action_a2n(char *arg, int *result, bool 
allow_num)
for (iter = a2n; iter->a; iter++) {
if (matches(arg, iter->a) != 0)
continue;
-   *result = iter->n;
-   return 0;
+   n = iter->n;
+   goto out_ok;
}
if (!allow_num || sscanf(arg, "%d%c", , ) != 1)
return -1;
 
-   *result = n;
+out_ok:
+   if (result)
+   *result = n;
return 0;
 }
 
diff --git a/tc/tc_util.h b/tc/tc_util.h
index 1218610d77092..e354765ff1ed0 100644
--- a/tc/tc_util.h
+++ b/tc/tc_util.h
@@ -132,4 +132,6 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct 
rtattr *opt);
 int cls_names_init(char *path);
 void cls_names_uninit(void);
 
+int action_a2n(char *arg, int *result, bool allow_num);
+
 #endif


Re: [PATCH 30/32] aio: add delayed cancel support

2018-01-11 Thread Jeff Moyer
Christoph Hellwig  writes:

> On Wed, Jan 10, 2018 at 06:26:39PM -0500, Jeff Moyer wrote:
>> >> The upcoming aio poll support would like to be able to complete the
>> >> iocb inline from the cancellation context, but that would cause
>> >> a lock order reversal.  Add support for optionally moving the cancelation
>> >> outside the context lock to avoid this reversal.
>> >>
>> >> Signed-off-by: Christoph Hellwig 
>> >
>> > Acked-by: Jeff Moyer 
>> 
>> Actually, let's move these two defines:
>> 
>> #define AIO_IOCB_DELAYED_CANCEL (1 << 0)
>> #define AIO_IOCB_CANCELLED  (1 << 1)
>> 
>> to include/linux/aio.h so that drivers outside of fs/aio.c can make use
>> of them.
>
> struct aio_kiocb is private to aio.c, so just exposing them won't
> do anything useful.  If we really need these elsewhere we'll need
> to come up with a proper interface.

Duh, good point.  My main concern is that things like usb gadget will
have to deal with races between cancellation and completion on their
own.  It would be nice if we had infrastructure for them to use.  I'll
have a look through that code to see if there's something we could or
should be doing.

Cheers,
Jeff


Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks

2018-01-11 Thread Roopa Prabhu
On Thu, Jan 11, 2018 at 7:07 AM, Jiri Pirko  wrote:
> Thu, Jan 11, 2018 at 03:46:09PM CET, j...@mojatatu.com wrote:
>>On 18-01-11 09:41 AM, Jiri Pirko wrote:
>>> Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote:
>>> > On 18-01-11 09:24 AM, Jiri Pirko wrote:
>>> > > Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote:
>>> > > > On 18-01-09 09:07 AM, Jiri Pirko wrote:
>>> > > > > From: Jiri Pirko 
>>> > > > >
>>> > > > > Benefit from the previously introduced shared filter blocks
>>> > > > > infrastructure and allow ingress and clsact qdisc instances to share
>>> > > > > filter blocks. The block index is coming from userspace as qdisc 
>>> > > > > option.
>>> > > >
>>> > > > Didnt quiet follow why ingress is special and needs attributes to
>>> > > > set the block but other qdiscs didnt.
>>> > >
>>> > > Jamal, again, other qdiscs does not support block sharing. This patchset
>>> > > only adds support for sharing of block for ingress and clsact qdiscs.
>>> > > Later on, other qdiscs could also support block sharing.
>>> > >
>>> >
>>> > Can you stop a config which says:
>>> > tc qdisc add dev ens9 root block 22 handle 1:0 prio ?
>>>
>>> Please see the iproute2 patches. Parsing of "block" command line option
>>> is done inside q_ingress.c
>>>
>>
>>I only looked at the kernel code. Good you can stop it at tc
>>but the API does not stop it (unless you expect the rest of the
>>world to only use tc).
>
> Jamal, apparently, you did not looked at the kernel code either :)
> Look at the changes done in net/sched/sch_ingress.c - there is where the
> parsing of block attr takes place.
>
>
>>Really - there is no reason for this API to be only via ingress qdisc
>>attributes. You can add a check in cls api to reject any parent that is
>>not either of the clsacts + ingress (depending on tc doesnt sound
>>right).
>
> I was thinking to take this direction originally. To have another
> generic attr called TCA_BLOCK or something that would be used when qdisc
> is created. For ingress, what would work. But for clsact, you need to be
> able to specify 2 block during qdisc creation - one for ingress, one for
> egress. That's when I realized this has to be per-qdisc-type attr.


yeah, see the problem...but.., would it help if we just introduce two
generic attrs TCA_BLOCK_INGRESS and TCA_BLOCK_EGRESS instead of having
to duplicate these attrs at every qdisc ?.
and add proper validation depending on qdisc type..


Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks

2018-01-11 Thread Jamal Hadi Salim

On 18-01-11 10:07 AM, Jiri Pirko wrote:

Thu, Jan 11, 2018 at 03:46:09PM CET, j...@mojatatu.com wrote:

On 18-01-11 09:41 AM, Jiri Pirko wrote:

Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote:




I only looked at the kernel code. Good you can stop it at tc
but the API does not stop it (unless you expect the rest of the
world to only use tc).


Jamal, apparently, you did not looked at the kernel code either :)
Look at the changes done in net/sched/sch_ingress.c - there is where the
parsing of block attr takes place.



reason i raised it is from looking at tc_ctl_tfilter().
If i specify ifindex != TCM_IFINDEX_MAGIC_BLOCK,
parent = 0X and block = 22 that should work, no?
i.e regardless of whether parent is INGRESS etc.

And so i was confused why you had attributes in sch_ingress.c




Really - there is no reason for this API to be only via ingress qdisc
attributes. You can add a check in cls api to reject any parent that is
not either of the clsacts + ingress (depending on tc doesnt sound
right).


I was thinking to take this direction originally. To have another
generic attr called TCA_BLOCK or something that would be used when qdisc
is created. For ingress, what would work. But for clsact, you need to be
able to specify 2 block during qdisc creation - one for ingress, one for
egress. That's when I realized this has to be per-qdisc-type attr.



ok for clsact - i can see that we dont have enough fields in the tcm
message.

TCA_BLOCK sounds appealing - could be a speacial tlv with many block ids
maybe? I really would like to use this for egress as well - and what
i described earlier should work for me.

cheers,
jamal


Re: [PATCH] net: phy: Fix phy_modify() semantic difference fallout

2018-01-11 Thread David Miller
From: Geert Uytterhoeven 
Date: Tue,  9 Jan 2018 12:11:21 +0100

> In case of success, the return values of (__)phy_write() and
> (__)phy_modify() are not compatible: (__)phy_write() returns 0, while
> (__)phy_modify() returns the old PHY register value.
> 
> Apparently this change was catered for in drivers/net/phy/marvell.c, but
> not in other source files.
> 
> Hence genphy_restart_aneg() now returns 4416 instead zero, which is
> considered an error:
> 
> ravb e680.ethernet eth0: failed to connect PHY
> IP-Config: Failed to open eth0
> IP-Config: No network devices available
> 
> Fix this by converting positive values to zero in all callers of
> phy_modify().
> 
> Fixes: fea23fb591cce995 ("net: phy: convert read-modify-write to 
> phy_modify()")
> Signed-off-by: Geert Uytterhoeven 
> ---
> Alternatively, __phy_modify() could be changed to follow __phy_write()
> semantics?

I really want a resolution to this quickly, this broke lots of stuff
for people.

__phy_modify() wants to return multiple values, so it should be coded
up to do so explicitly rather than trying to encode two values from
overlapping value spaces in one return value.

That means the original value should be returned by-reference.  And
this will make the error/no-error return value unambiguous.

int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set,
 u16 *orig_val);

Thank you.


Re: [PATCH 0/2] Fix a couple of crashes in netfront

2018-01-11 Thread Ross Lagerwall

+CC netdev

On 01/11/2018 09:36 AM, Ross Lagerwall wrote:

Here are a couple of patches to fix two crashes in netfront.

Ross Lagerwall (2):
   xen/grant-table: Use put_page instead of free_page
   xen-netfront: Fix race between device setup and open

  drivers/net/xen-netfront.c | 46 --
  drivers/xen/grant-table.c  |  4 ++--
  2 files changed, 26 insertions(+), 24 deletions(-)



Re: [PATCH] net: phy: Fix phy_modify() semantic difference fallout

2018-01-11 Thread Russell King - ARM Linux
On Thu, Jan 11, 2018 at 10:48:35AM -0500, David Miller wrote:
> From: Geert Uytterhoeven 
> Date: Tue,  9 Jan 2018 12:11:21 +0100
> 
> > In case of success, the return values of (__)phy_write() and
> > (__)phy_modify() are not compatible: (__)phy_write() returns 0, while
> > (__)phy_modify() returns the old PHY register value.
> > 
> > Apparently this change was catered for in drivers/net/phy/marvell.c, but
> > not in other source files.
> > 
> > Hence genphy_restart_aneg() now returns 4416 instead zero, which is
> > considered an error:
> > 
> > ravb e680.ethernet eth0: failed to connect PHY
> > IP-Config: Failed to open eth0
> > IP-Config: No network devices available
> > 
> > Fix this by converting positive values to zero in all callers of
> > phy_modify().
> > 
> > Fixes: fea23fb591cce995 ("net: phy: convert read-modify-write to 
> > phy_modify()")
> > Signed-off-by: Geert Uytterhoeven 
> > ---
> > Alternatively, __phy_modify() could be changed to follow __phy_write()
> > semantics?
> 
> I really want a resolution to this quickly, this broke lots of stuff
> for people.
> 
> __phy_modify() wants to return multiple values, so it should be coded
> up to do so explicitly rather than trying to encode two values from
> overlapping value spaces in one return value.
> 
> That means the original value should be returned by-reference.  And
> this will make the error/no-error return value unambiguous.
> 
> int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set,
>u16 *orig_val);

I'm sorry I have no time to work on this right now due to the meltdown
and spectre stuff that hit last week.  If you need to do something,
please revert both the mvneta series and the series containing this
patch.

Thanks.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks

2018-01-11 Thread Jiri Pirko
Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote:
>On 18-01-11 09:24 AM, Jiri Pirko wrote:
>> Thu, Jan 11, 2018 at 02:36:01PM CET, j...@mojatatu.com wrote:
>> > On 18-01-09 09:07 AM, Jiri Pirko wrote:
>> > > From: Jiri Pirko 
>> > > 
>> > > Benefit from the previously introduced shared filter blocks
>> > > infrastructure and allow ingress and clsact qdisc instances to share
>> > > filter blocks. The block index is coming from userspace as qdisc option.
>> > 
>> > Didnt quiet follow why ingress is special and needs attributes to
>> > set the block but other qdiscs didnt.
>> 
>> Jamal, again, other qdiscs does not support block sharing. This patchset
>> only adds support for sharing of block for ingress and clsact qdiscs.
>> Later on, other qdiscs could also support block sharing.
>> 
>
>Can you stop a config which says:
>tc qdisc add dev ens9 root block 22 handle 1:0 prio ?

Please see the iproute2 patches. Parsing of "block" command line option
is done inside q_ingress.c


Re: [PATCH bpf-next v4 5/5] error-injection: Support fault injection framework

2018-01-11 Thread Akinobu Mita
2018-01-11 9:51 GMT+09:00 Masami Hiramatsu :
> Support in-kernel fault-injection framework via debugfs.
> This allows you to inject a conditional error to specified
> function using debugfs interfaces.
>
> Here is the result of test script described in
> Documentation/fault-injection/fault-injection.txt
>
>   ===
>   # ./test_fail_function.sh
>   1+0 records in
>   1+0 records out
>   1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
>   btrfs-progs v4.4
>   See http://btrfs.wiki.kernel.org for more information.
>
>   Label:  (null)
>   UUID:   bfa96010-12e9-4360-aed0-42eec7af5798
>   Node size:  16384
>   Sector size:4096
>   Filesystem size:1001.00MiB
>   Block group profiles:
> Data: single8.00MiB
> Metadata: DUP  58.00MiB
> System:   DUP  12.00MiB
>   SSD detected:   no
>   Incompat features:  extref, skinny-metadata
>   Number of devices:  1
>   Devices:
>  IDSIZE  PATH
>   1  1001.00MiB  /dev/loop2
>
>   mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
>   SUCCESS!
>   ===
>
>
> Signed-off-by: Masami Hiramatsu 
> Reviewed-by: Josef Bacik 
> ---
>   Changes in v3:
>- Check and adjust error value for each target function
>- Clear kporbe flag for reuse
>- Add more documents and example
> ---
>  Documentation/fault-injection/fault-injection.txt |   62 ++
>  kernel/Makefile   |1
>  kernel/fail_function.c|  217 
> +
>  lib/Kconfig.debug |   10 +
>  4 files changed, 290 insertions(+)
>  create mode 100644 kernel/fail_function.c
>
> diff --git a/Documentation/fault-injection/fault-injection.txt 
> b/Documentation/fault-injection/fault-injection.txt
> index 918972babcd8..4aecbceef9d2 100644
> --- a/Documentation/fault-injection/fault-injection.txt
> +++ b/Documentation/fault-injection/fault-injection.txt
> @@ -30,6 +30,12 @@ o fail_mmc_request
>injects MMC data errors on devices permitted by setting
>debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
>
> +o fail_function
> +
> +  injects error return on specific functions, which are marked by
> +  ALLOW_ERROR_INJECTION() macro, by setting debugfs entries
> +  under /sys/kernel/debug/fail_function. No boot option supported.
> +
>  Configure fault-injection capabilities behavior
>  ---
>
> @@ -123,6 +129,24 @@ configuration of fault-injection capabilities.
> default is 'N', setting it to 'Y' will disable failure injections
> when dealing with private (address space) futexes.
>
> +- /sys/kernel/debug/fail_function/inject:
> +
> +   specifies the target function of error injection by name.
> +
> +- /sys/kernel/debug/fail_function/retval:
> +
> +   specifies the "error" return value to inject to the given
> +   function.
> +

Is it possible to inject errors into multiple functions at the same time?

If so, it will be more useful to support it in the fault injection, too.
Because some kind of bugs are caused by the combination of errors.
(e.g. another error in an error path)

I suggest the following interface.

- /sys/kernel/debug/fail_function/inject:

  specifies the target function of error injection by name.
  /sys/kernel/debug/fail_function// directory will be created.

- /sys/kernel/debug/fail_function/uninject:

  specifies the target function of error injection by name that is
  currently being injected.  /sys/kernel/debug/fail_function//
  directory will be removed.

- /sys/kernel/debug/fail_function//retval:

  specifies the "error" return value to inject to the given function.


Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-11 Thread महेश बंडेवार
On Thu, Jan 11, 2018 at 3:25 AM, Jiri Benc  wrote:
> On Wed, 10 Jan 2018 18:09:50 -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
>> I still prefer the approach I had mentioned that uses 'mtu_adj'. In
>> that approach you can leave those slaves which have changed their mtu
>> to be lower than masters' but if master's mtu changes to larger value
>> all other slaves will get updated mtu leaving behind the slaves who
>> have opted to change their mtu on their own. Also the same thing is
>> true when mtu get reduced at master.
>
> The problem with this magic behavior is, well, that it's magic. There's
> no way to tell what happens with a given slave when the master's MTU
> gets changed just by looking at the current configuration. There's also
> no way to switch the magic behavior back on once the slave's MTU is
> changed.
>
I guess the logic would be as simple as - if mtu_adj for a slave is
set to 0, then it's
following master otherwise not. By setting different mtu for a slave, you will
set this mtu_adj a positive number which would mean it's not following master.
So it's subjected to clamping when masters' mtu is reducing but should stay
otherwise. Also when slave decides to follow master again, it can set the mtu
to be same as masters' (making mtu_adj == 0) and then it would start following
master again.

Whether it's magic or not, it's the current behavior and I know several use
cases depend on this behavior which would be broken otherwise. The
approach I proposed keeps that going for those who depend on that while
adds an ability to set mtu per slave for the use case mentioned in this
patch-set too.

> At minimum, you'd need some kind of indication that the slave's MTU is
> following the master. And a way to toggle this back.
>
> Keefe's patch is much saner, the behavior is completely deterministic.
>
>  Jiri


Re: [patch net-next v7 09/13] net: sched: allow ingress and clsact qdiscs to share filter blocks

2018-01-11 Thread Jamal Hadi Salim

On 18-01-11 11:15 AM, Jiri Pirko wrote:

Thu, Jan 11, 2018 at 04:44:32PM CET, j...@mojatatu.com wrote:

On 18-01-11 10:07 AM, Jiri Pirko wrote:

Thu, Jan 11, 2018 at 03:46:09PM CET, j...@mojatatu.com wrote:

On 18-01-11 09:41 AM, Jiri Pirko wrote:

Thu, Jan 11, 2018 at 03:37:08PM CET, j...@mojatatu.com wrote:




I only looked at the kernel code. Good you can stop it at tc
but the API does not stop it (unless you expect the rest of the
world to only use tc).


Jamal, apparently, you did not looked at the kernel code either :)
Look at the changes done in net/sched/sch_ingress.c - there is where the
parsing of block attr takes place.



reason i raised it is from looking at tc_ctl_tfilter().
If i specify ifindex != TCM_IFINDEX_MAGIC_BLOCK,
parent = 0X and block = 22 that should work, no?
i.e regardless of whether parent is INGRESS etc.


No, the block needs to be created first by qdisc instance.
Seems to me that you are mixing apples and oranges a bit.



You are correct, the qdisc attachment must exist first ;->



TCA_BLOCK sounds appealing - could be a speacial tlv with many block ids
maybe? I really would like to use this for egress as well - and what
i described earlier should work for me.


I don't get what you mean by "tlv with many block ids". What is it good
for? :O



I meant A TLV with a bunch of 32 bit values, so you can have more than
one block id in it. But what Roopa suggested is more explicit (and
better).

cheers,
jamal


  1   2   3   >