Re: possible deadlock in rtnl_lock (5)

2018-03-27 Thread Dmitry Vyukov
Please keep the Reported-by notice, and reproducer will probably be useful too:

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+a46d6abf9d56b1365...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

syzbot hit the following crash on upstream commit
3eb2ce825ea1ad89d20f7a3b5780df850e4be274 (Sun Mar 25 22:44:30 2018 +)
Linux 4.16-rc7
syzbot dashboard link:
https://syzkaller.appspot.com/bug?extid=a46d6abf9d56b1365a72

So far this crash happened 27 times on net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6524202618191872
syzkaller reproducer:
https://syzkaller.appspot.com/x/repro.syz?id=5383267238805504
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5136472378179584
Kernel config: https://syzkaller.appspot.com/x/.config?id=-8440362230543204781
compiler: gcc (GCC) 7.1.1 20170620




On Tue, Mar 27, 2018 at 9:52 PM, Julian Anastasov  wrote:
>
> Hello,
>
> On Tue, 27 Mar 2018, Florian Westphal wrote:
>
>> syzbot  wrote:
>> [ cc Julian and trimming cc list ]
>>
>> > syzkaller688027/4497 is trying to acquire lock:
>> >  (rtnl_mutex){+.+.}, at: [] rtnl_lock+0x17/0x20
>> > net/core/rtnetlink.c:74
>>
>> > but task is already holding lock:
>> > IPVS: stopping backup sync thread 4495 ...
>> >  (rtnl_mutex){+.+.}, at: [] rtnl_lock+0x17/0x20
>> > net/core/rtnetlink.c:74
>> >
>> > other info that might help us debug this:
>> >  Possible unsafe locking scenario:
>> >
>> >CPU0
>> >
>> >   lock(rtnl_mutex);
>> >   lock(rtnl_mutex);
>> >
>> >  *** DEADLOCK ***
>> >
>> >  May be due to missing lock nesting notation
>>
>> Looks like this is real, commit e0b26cc997d57305b4097711e12e13992580ae34
>> ("ipvs: call rtnl_lock early") added rtnl_lock when starting sync thread
>> but socket close invokes rtnl_lock too:
>
> I see, thanks! I'll have to move the locks into
> start_sync_thread and to split make_{send,receive}_sock
> to {make,setup}_{send,receive}_sock ...
>
>> > stack backtrace:
>> >  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>> >  ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643
>> >  inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413
>> >  sock_release+0x8d/0x1e0 net/socket.c:595
>> >  start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924
>> >  do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389
>
> Regards
>
> --
> Julian Anastasov 
>
> --
> You received this message because you are subscribed to the Google Groups 
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller-bugs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/syzkaller-bugs/alpine.LFD.2.20.1803272227370.3460%40ja.home.ssi.bg.
> For more options, visit https://groups.google.com/d/optout.


Re: [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation

2018-03-27 Thread Stephen Hemminger
On Thu, 22 Mar 2018 11:55:10 +0100
Jiri Pirko  wrote:

> From: Jiri Pirko 
> 
> This patchset resolves 2 issues we have right now:
> 1) There are many netdevices / ports in the system, for port, pf, vf
>represenatation but the user has no way to see which is which

There already are a lot of attributes, adding more doesn't necessarily
help make things clearer.

> 2) The ndo_get_phys_port_name is implemented in each driver separatelly,
>which may lead to inconsistent names between drivers.

Why not address that problem. My concern is that your new attribute
will have the same problem.

Also adding pf and vfNNN on the name will make the already tightly squeezed
interface name length a real problem. I have had arguments with people
trying use VLAN 4000 and standard naming policy.  Which means you really
can't go that long.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 23:24 -0400, Sinan Kaya wrote:
> On 3/27/2018 10:51 PM, Linus Torvalds wrote:
> > > The discussion at hand is about
> > > 
> > > dma_buffer->foo = 1;/* WB */
> > > writel(KICK, DMA_KICK_REGISTER);/* UC */
> > 
> > Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> > even if that writel() might be of type WC, because that only delays
> > writes, it doesn't move them earlier.
> 
> Now that we clarified x86 myth, Is this guaranteed on all architectures?

If not we need to fix it. It's guaranteed on the "main" ones (arm,
arm64, powerpc, i386, x86_64). We might need to check with other arch
maintainers for the rest.

We really want Linux to provide well defined "sane" semantics for the
basic writel accessors.

Note: We still have open questions about how readl() relates to
surrounding memory accesses. It looks like ARM and powerpc do different
things here.

> We keep getting IA64 exception example. Maybe, this is also corrected since
> then.

I would think ia64 fixed it back when it was all discussed. I was under
the impression all ia64 had "special" was the writel vs. spin_unlock
which requires mmiowb, but maybe that was never completely fixed ?

> Jose Abreu says "I don't know about x86 but arc architecture doesn't
> have a wmb() in the writel() function (in some configs)".

Well, it probably should then.

> As long as we have these exceptions, these wmb() in common drivers is not
> going anywhere and relaxed-arches will continue paying performance penalty.

Well, let's fix them or leave them broken, at this point, it doesn't
matter. We can give all arch maintainers a wakeup call and start making
drivers work based on the documented assumptions.

> I see 15% performance loss on ARM64 servers using Intel i40e network
> drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
> most of the time rather than real work because of sequences like this all over
> the place.
> 
>  dma_buffer->foo = 1;/* WB */
>wmb()
>  writel(KICK, DMA_KICK_REGISTER);/* UC */
>
> I posted several patches last week to remove duplicate barriers on ARM while
> trying to make the code friendly with other architectures.
> 
> Basically changing it to
> 
> dma_buffer->foo = 1;/* WB */
> wmb()
> writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
> mmiowb()
> 
> This is a small step in the performance direction until we remove all 
> exceptions.
> 
> https://www.spinics.net/lists/netdev/msg491842.html
> https://www.spinics.net/lists/linux-rdma/msg62434.html
> https://www.spinics.net/lists/arm-kernel/msg642336.html
> 
> Discussion started to move around the need for relaxed API on PPC and then
> why wmb() question came up.

I'm working on the problem of relaxed APIs for powerpc, but we can keep
that orthogonal. As is, today, a wmb() + writel() and a wmb() +
writel_relaxed() on powerpc are identical. So changing them will not
break us.

But I don't see the point of doing that transformation if we can just
get the straying archs fixed. It's not like any of them has a
significant market presence these days anyway.

Cheers,
Ben.

> Sinan
> 


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 16:51 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt
>  wrote:
> > 
> > The discussion at hand is about
> > 
> > dma_buffer->foo = 1;/* WB */
> > writel(KICK, DMA_KICK_REGISTER);/* UC */
> 
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Ok so this is our answer ...

 ... snip ... (thanks for the background info !)

> Oh, the above UC case is absoutely guaranteed.

Good.

Then

> The only issue really is that 99.9% of all testing gets done on x86
> unless you look at specific SoC drivers.
> 
> On ARM, for example, there is likely little reason to care about x86
> memory ordering, because there is almost zero driver overlap between
> x86 and ARM.
> 
> *Historically*, the reason for following the x86 IO ordering was
> simply that a lot of architectures used the drivers that were
> developed on x86. The alpha and powerpc workstations were *designed*
> with the x86 IO bus (PCI, then PCIe) and to work with the devices that
> came with it.
> 
> ARM? PCIe is almost irrelevant. For ARM servers, if they ever take
> off, sure. But 99.99% of ARM is about their own SoC's, and so "x86
> test coverage" is simply not an issue.
> 
> How much of an issue is it for Power? Maybe you decide it's not a big deal.
> 
> Then all the above is almost irrelevant.

So the overlap may not be that NIL in practice :-) But even then that
doesn't matter as ARM has been happily implementing the same semantic
you describe above for years, as do we powerpc.

This is why, I want (with your agreement) to define clearly and once
and for all, that the Linux semantics of writel are that it is ordered
with previous writes to coherent memory (*)

This is already what ARM and powerpc provide, from what you say, what
x86 provides, I don't see any reason to keep that badly documented and
have drivers randomly growing useless wmb()'s because they don't think
it works on x86 without them !

Once that's sorted, let's tackle the problem of mmiowb vs. spin_unlock
and the problem of writel_relaxed semantics but as separate issues :-)

Also, can I assume the above ordering with writel() equally applies to
readl() or not ?

IE:
dma_buf->foo = 1;
readl(STUPID_DEVICE_DMA_KICK_ON_READ);

Also works on x86 ? (It does on power, maybe not on ARM).

Cheers,
Ben.

(*) From an Linux API perspective, all of this is only valid if the
memory was allocated by dma_alloc_coherent(). Anything obtained by
dma_map_something() might have been bounced bufferred or might require
extra cache flushes on some architectures, and thus needs
dma_sync_for_{cpu,device} calls.

Cheers,
Ben.



Re: [PATCH v13 net-next 01/12] tls: support for Inline tls record

2018-03-27 Thread Atul Gupta


On 3/27/2018 11:53 PM, Stefano Brivio wrote:
> On Tue, 27 Mar 2018 23:06:30 +0530
> Atul Gupta  wrote:
>
>> +static struct tls_context *create_ctx(struct sock *sk)
>> +{
>> +struct inet_connection_sock *icsk = inet_csk(sk);
>> +struct tls_context *ctx;
>> +
>> +/* allocate tls context */
>> +ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>> +if (!ctx)
>> +return NULL;
>> +
>> +icsk->icsk_ulp_data = ctx;
>> +return ctx;
>> +}
>>
>> [...]
>>
>>  static int tls_init(struct sock *sk)
>>  {
>>  int ip_ver = sk->sk_family == AF_INET6 ? TLSV6 : TLSV4;
>> -struct inet_connection_sock *icsk = inet_csk(sk);
>>  struct tls_context *ctx;
>>  int rc = 0;
>>  
>> +if (tls_hw_prot(sk))
>> +goto out;
>> +
>>  /* The TLS ulp is currently supported only for TCP sockets
>>   * in ESTABLISHED state.
>>   * Supporting sockets in LISTEN state will require us
>> @@ -530,12 +624,11 @@ static int tls_init(struct sock *sk)
>>  return -ENOTSUPP;
>>  
>>  /* allocate tls context */
>> -ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>> +ctx = create_ctx(sk);
>>  if (!ctx) {
>>  rc = -ENOMEM;
>>  goto out;
>>  }
>> -icsk->icsk_ulp_data = ctx;
> Why are you changing this?
since create_ctx is called at two place it is assigned in allocating function 
than duplicate the assignment.
>
> This is now equivalent to the original implementation, except that you
> are "hiding" the assignment of icsk->icsk_ulp_data into a function named
> "create_ctx".
>
> Please also note that you are duplicating the "allocate tls context"
> comment.
will remove this comment.
>



Re: [PATCH V5 net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-27 Thread Shannon Nelson

On 3/27/2018 4:56 PM, Saeed Mahameed wrote:

From: Ilya Lesokhin 

This patch adds a generic infrastructure to offload TLS crypto to a
network device. It enables the kernel TLS socket to skip encryption
and authentication operations on the transmit side of the data path.
Leaving those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to
the TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmits the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 


Acked-by: Shannon Nelson 


---
  include/net/tls.h | 120 +--
  net/tls/Kconfig   |  10 +
  net/tls/Makefile  |   2 +
  net/tls/tls_device.c  | 759 ++
  net/tls/tls_device_fallback.c | 454 +
  net/tls/tls_main.c| 120 ---
  net/tls/tls_sw.c  | 132 
  7 files changed, 1476 insertions(+), 121 deletions(-)
  create mode 100644 net/tls/tls_device.c
  create mode 100644 net/tls/tls_device_fallback.c

diff --git a/include/net/tls.h b/include/net/tls.h
index 437a746300bf..0a8529e9ec21 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -57,21 +57,10 @@
  
  #define TLS_AAD_SPACE_SIZE		13
  
-struct tls_sw_context {

+struct tls_sw_context_tx {
struct crypto_aead *aead_send;
-   struct crypto_aead *aead_recv;
struct crypto_wait async_wait;
  
-	/* Receive context */

-   struct strparser strp;
-   void (*saved_data_ready)(struct sock *sk);
-   unsigned int (*sk_poll)(struct file *file, struct socket *sock,
-   struct poll_table_struct *wait);
-   struct sk_buff *recv_pkt;
-   u8 control;
-   bool decrypted;
-
-   /* Sending context */
char aad_space[TLS_AAD_SPACE_SIZE];
  
  	unsigned int sg_plaintext_size;

@@ -88,6 +77,50 @@ struct tls_sw_context {
struct scatterlist sg_aead_out[2];
  };
  
+struct tls_sw_context_rx {

+   struct crypto_aead *aead_recv;
+   struct crypto_wait async_wait;
+
+   struct strparser strp;
+   void (*saved_data_ready)(struct sock *sk);
+   unsigned int (*sk_poll)(struct file *file, struct socket *sock,
+   struct poll_table_struct *wait);
+   struct sk_buff *recv_pkt;
+   u8 control;
+   bool decrypted;
+};
+
+struct tls_record_info {
+   struct list_head list;
+   u32 end_seq;
+   int len;

Re: [PATCH] vhost-net: add time limitation for tx polling(Internet mail)

2018-03-27 Thread 张海斌
On 2018年03月27日 19:26, Jason wrote
On 2018年03月27日 17:12, haibinzhang wrote:
>> handle_tx() will delay rx for a long time when busy tx polling udp packets
>> with short length(ie: 1byte udp payload), because setting VHOST_NET_WEIGHT
>> takes into account only sent-bytes but no time.
>
>Interesting.
>
>Looking at vhost_can_busy_poll() it tries to poke pending vhost work and 
>exit the busy loop if it found one. So I believe something block the 
>work queuing. E.g did reverting 8241a1e466cd56e6c10472cac9c1ad4e54bc65db 
>fix the issue?

"busy tx polling" means using netperf send udp packets with 1 bytes 
payload(total 47bytes frame lenght), 
and handle_tx() will be busy sending packets continuously.

>
>>   It's not fair for handle_rx(),
>> so needs to limit max time of tx polling.
>>
>> ---
>>   drivers/vhost/net.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 8139bc70ad7d..dc9218a3a75b 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -473,6 +473,7 @@ static void handle_tx(struct vhost_net *net)
>>  struct socket *sock;
>>  struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>>  bool zcopy, zcopy_used;
>> +unsigned long start = jiffies;
>
>Checking jiffies is tricky, need to convert it to ms or whatever others.
>
>>   
>>  mutex_lock(>mutex);
>>  sock = vq->private_data;
>> @@ -580,7 +581,7 @@ static void handle_tx(struct vhost_net *net)
>>  else
>>  vhost_zerocopy_signal_used(net, vq);
>>  vhost_net_tx_packet(net);
>> -if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>> +if (unlikely(total_len >= VHOST_NET_WEIGHT) || unlikely(jiffies 
>> - start >= 1)) {
>
>How value 1 is determined here? And we need a complete test to make sure 
>this won't affect other use cases.

We just want <1ms ping latency, but actually we are not sure what value is 
reasonable.
We have some test results using netperf before this patch as follow,

Udp payload1byte100bytes1000bytes1400bytes
  Ping avg latency25ms 10ms   2ms 1.5ms

What is other testcases?

>
>Another thought is introduce another limit of #packets, but this need 
>benchmark too.
>
>Thanks
>
>>  vhost_poll_queue(>poll);
>>  break;
>>  }
>
>


[PATCH v2 bpf-next 6/9] bpf: Hooks for sys_connect

2018-03-27 Thread Alexei Starovoitov
From: Andrey Ignatov 

== The problem ==

See description of the problem in the initial patch of this patch set.

== The solution ==

The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.

It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
`BPF_CGROUP_INET6_CONNECT` for program type
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
source and destination of a connection at connect(2) time.

Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
  significantly;
* there is no use-case for port.

As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.

Support is added for IPv4 and IPv6, for TCP and UDP.

IPv4 and IPv6 have separate attach types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.

== Implementation notes ==

The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.

bpf_bind()` sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.

`bpf_bind()` sets `with_lock` to `false` when calling to `__inet_bind()`
and `__inet6_bind()` since all call-sites, where `bpf_bind()` is called,
already hold socket lock.

Signed-off-by: Andrey Ignatov 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf-cgroup.h | 31 
 include/net/sock.h |  3 +++
 include/net/udp.h  |  1 +
 include/uapi/linux/bpf.h   | 12 ++-
 kernel/bpf/syscall.c   |  8 
 net/core/filter.c  | 50 ++
 net/ipv4/af_inet.c | 13 
 net/ipv4/tcp_ipv4.c| 16 +++
 net/ipv4/udp.c | 14 +
 net/ipv6/tcp_ipv6.c| 16 +++
 net/ipv6/udp.c | 20 +++
 11 files changed, 183 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 67dc4a6471ad..c6ab295e6dcb 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -116,12 +116,38 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 
major, u32 minor,
__ret; \
 })
 
+#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type)  \
+({\
+   int __ret = 0; \
+   if (cgroup_bpf_enabled) {  \
+   lock_sock(sk); \
+   __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type);\
+   release_sock(sk);  \
+   }  \
+   __ret; \
+})
+
 #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) \
BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
 
 #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) \
BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
 
+#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (cgroup_bpf_enabled && \
+   sk->sk_prot->pre_connect)
+
+#define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr)  \
+   BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_CONNECT)
+
+#define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr)  

[PATCH v2 bpf-next 1/9] bpf: Check attach type at prog load time

2018-03-27 Thread Alexei Starovoitov
From: Andrey Ignatov 

== The problem ==

There are use-cases when a program of some type can be attached to
multiple attach points and those attach points must have different
permissions to access context or to call helpers.

E.g. context structure may have fields for both IPv4 and IPv6 but it
doesn't make sense to read from / write to IPv6 field when attach point
is somewhere in IPv4 stack.

Same applies to BPF-helpers: it may make sense to call some helper from
some attach point, but not from other for same prog type.

== The solution ==

Introduce `expected_attach_type` field in in `struct bpf_attr` for
`BPF_PROG_LOAD` command. If scenario described in "The problem" section
is the case for some prog type, the field will be checked twice:

1) At load time prog type is checked to see if attach type for it must
   be known to validate program permissions correctly. Prog will be
   rejected with EINVAL if it's the case and `expected_attach_type` is
   not specified or has invalid value.

2) At attach time `attach_type` is compared with `expected_attach_type`,
   if prog type requires to have one, and, if they differ, attach will
   be rejected with EINVAL.

The `expected_attach_type` is now available as part of `struct bpf_prog`
in both `bpf_verifier_ops->is_valid_access()` and
`bpf_verifier_ops->get_func_proto()` () and can be used to check context
accesses and calls to helpers correspondingly.

Initially the idea was discussed by Alexei Starovoitov  and
Daniel Borkmann  here:
https://marc.info/?l=linux-netdev=152107378717201=2

Signed-off-by: Andrey Ignatov 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |  5 -
 include/linux/filter.h   |  1 +
 include/uapi/linux/bpf.h |  5 +
 kernel/bpf/cgroup.c  |  3 ++-
 kernel/bpf/syscall.c | 31 ++-
 kernel/bpf/verifier.c|  6 +++---
 kernel/trace/bpf_trace.c | 21 ++---
 net/core/filter.c| 39 +--
 8 files changed, 84 insertions(+), 27 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 819229c80eca..95a7abd0ee92 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -208,12 +208,15 @@ struct bpf_prog_ops {
 
 struct bpf_verifier_ops {
/* return eBPF function prototype for verification */
-   const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id 
func_id);
+   const struct bpf_func_proto *
+   (*get_func_proto)(enum bpf_func_id func_id,
+ const struct bpf_prog *prog);
 
/* return true if 'size' wide access at offset 'off' within bpf_context
 * with 'type' (read or write) is allowed
 */
bool (*is_valid_access)(int off, int size, enum bpf_access_type type,
+   const struct bpf_prog *prog,
struct bpf_insn_access_aux *info);
int (*gen_prologue)(struct bpf_insn *insn, bool direct_write,
const struct bpf_prog *prog);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 109d05ccea9a..5787280b1e88 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -469,6 +469,7 @@ struct bpf_prog {
is_func:1,  /* program is a bpf function */
kprobe_override:1; /* Do we override a kprobe? 
*/
enum bpf_prog_type  type;   /* Type of BPF program */
+   enum bpf_attach_typeexpected_attach_type; /* For some prog types */
u32 len;/* Number of filter blocks */
u32 jited_len;  /* Size of jited insns in bytes 
*/
u8  tag[BPF_TAG_SIZE];
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 18b7c510c511..dbc8a66b5d7e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -294,6 +294,11 @@ union bpf_attr {
__u32   prog_flags;
charprog_name[BPF_OBJ_NAME_LEN];
__u32   prog_ifindex;   /* ifindex of netdev to prep 
for */
+   /* For some prog types expected attach type must be known at
+* load time to verify attach type specific parts of prog
+* (context accesses, allowed helpers, etc).
+*/
+   __u32   expected_attach_type;
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index c1c0b60d3f2f..8730b24ed540 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -545,7 +545,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 
major, u32 minor,
 EXPORT_SYMBOL(__cgroup_bpf_check_dev_permission);
 
 static const struct bpf_func_proto *
-cgroup_dev_func_proto(enum bpf_func_id func_id)

[PATCH v2 bpf-next 2/9] libbpf: Support expected_attach_type at prog load

2018-03-27 Thread Alexei Starovoitov
From: Andrey Ignatov 

Support setting `expected_attach_type` at prog load time in both
`bpf/bpf.h` and `bpf/libbpf.h`.

Since both headers already have API to load programs, new functions are
added not to break backward compatibility for existing ones:
* `bpf_load_program_xattr()` is added to `bpf/bpf.h`;
* `bpf_prog_load_xattr()` is added to `bpf/libbpf.h`.

Both new functions accept structures, `struct bpf_load_program_attr` and
`struct bpf_prog_load_attr` correspondingly, where new fields can be
added in the future w/o changing the API.

Standard `_xattr` suffix is used to name the new API functions.

Since `bpf_load_program_name()` is not used as heavily as
`bpf_load_program()`, it was removed in favor of more generic
`bpf_load_program_xattr()`.

Signed-off-by: Andrey Ignatov 
Signed-off-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h |   5 ++
 tools/lib/bpf/bpf.c|  44 +++--
 tools/lib/bpf/bpf.h|  17 +--
 tools/lib/bpf/libbpf.c | 105 +++--
 tools/lib/bpf/libbpf.h |   8 
 5 files changed, 133 insertions(+), 46 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d245c41213ac..b44bcd7814ee 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -294,6 +294,11 @@ union bpf_attr {
__u32   prog_flags;
charprog_name[BPF_OBJ_NAME_LEN];
__u32   prog_ifindex;   /* ifindex of netdev to prep 
for */
+   /* For some prog types expected attach type must be known at
+* load time to verify attach type specific parts of prog
+* (context accesses, allowed helpers, etc).
+*/
+   __u32   expected_attach_type;
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 592a58a2b681..e85a2191f8b5 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -146,26 +146,30 @@ int bpf_create_map_in_map(enum bpf_map_type map_type, 
const char *name,
  -1);
 }
 
-int bpf_load_program_name(enum bpf_prog_type type, const char *name,
- const struct bpf_insn *insns,
- size_t insns_cnt, const char *license,
- __u32 kern_version, char *log_buf,
- size_t log_buf_sz)
+int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
+  char *log_buf, size_t log_buf_sz)
 {
-   int fd;
union bpf_attr attr;
-   __u32 name_len = name ? strlen(name) : 0;
+   __u32 name_len;
+   int fd;
+
+   if (!load_attr)
+   return -EINVAL;
+
+   name_len = load_attr->name ? strlen(load_attr->name) : 0;
 
bzero(, sizeof(attr));
-   attr.prog_type = type;
-   attr.insn_cnt = (__u32)insns_cnt;
-   attr.insns = ptr_to_u64(insns);
-   attr.license = ptr_to_u64(license);
+   attr.prog_type = load_attr->prog_type;
+   attr.expected_attach_type = load_attr->expected_attach_type;
+   attr.insn_cnt = (__u32)load_attr->insns_cnt;
+   attr.insns = ptr_to_u64(load_attr->insns);
+   attr.license = ptr_to_u64(load_attr->license);
attr.log_buf = ptr_to_u64(NULL);
attr.log_size = 0;
attr.log_level = 0;
-   attr.kern_version = kern_version;
-   memcpy(attr.prog_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1));
+   attr.kern_version = load_attr->kern_version;
+   memcpy(attr.prog_name, load_attr->name,
+  min(name_len, BPF_OBJ_NAME_LEN - 1));
 
fd = sys_bpf(BPF_PROG_LOAD, , sizeof(attr));
if (fd >= 0 || !log_buf || !log_buf_sz)
@@ -184,8 +188,18 @@ int bpf_load_program(enum bpf_prog_type type, const struct 
bpf_insn *insns,
 __u32 kern_version, char *log_buf,
 size_t log_buf_sz)
 {
-   return bpf_load_program_name(type, NULL, insns, insns_cnt, license,
-kern_version, log_buf, log_buf_sz);
+   struct bpf_load_program_attr load_attr;
+
+   memset(_attr, 0, sizeof(struct bpf_load_program_attr));
+   load_attr.prog_type = type;
+   load_attr.expected_attach_type = 0;
+   load_attr.name = NULL;
+   load_attr.insns = insns;
+   load_attr.insns_cnt = insns_cnt;
+   load_attr.license = license;
+   load_attr.kern_version = kern_version;
+
+   return bpf_load_program_xattr(_attr, log_buf, log_buf_sz);
 }
 
 int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 8d18fb73d7fb..2f7813d5e357 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -41,13 +41,20 @@ int bpf_create_map_in_map(enum bpf_map_type 

[PATCH v2 bpf-next 9/9] selftests/bpf: Selftest for sys_bind post-hooks.

2018-03-27 Thread Alexei Starovoitov
From: Andrey Ignatov 

Add selftest for attach types `BPF_CGROUP_INET4_POST_BIND` and
`BPF_CGROUP_INET6_POST_BIND`.

The main things tested are:
* prog load behaves as expected (valid/invalid accesses in prog);
* prog attach behaves as expected (load- vs attach-time attach types);
* `BPF_CGROUP_INET_SOCK_CREATE` can be attached in a backward compatible
  way;
* post-hooks return expected result and errno.

Example:
  # ./test_sock
  Test case: bind4 load with invalid access: src_ip6 .. [PASS]
  Test case: bind4 load with invalid access: mark .. [PASS]
  Test case: bind6 load with invalid access: src_ip4 .. [PASS]
  Test case: sock_create load with invalid access: src_port .. [PASS]
  Test case: sock_create load w/o expected_attach_type (compat mode) ..
  [PASS]
  Test case: sock_create load w/ expected_attach_type .. [PASS]
  Test case: attach type mismatch bind4 vs bind6 .. [PASS]
  Test case: attach type mismatch bind6 vs bind4 .. [PASS]
  Test case: attach type mismatch default vs bind4 .. [PASS]
  Test case: attach type mismatch bind6 vs sock_create .. [PASS]
  Test case: bind4 reject all .. [PASS]
  Test case: bind6 reject all .. [PASS]
  Test case: bind6 deny specific IP & port .. [PASS]
  Test case: bind4 allow specific IP & port .. [PASS]
  Test case: bind4 allow all .. [PASS]
  Test case: bind6 allow all .. [PASS]
  Summary: 16 PASSED, 0 FAILED

Signed-off-by: Andrey Ignatov 
Signed-off-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h  |  11 +
 tools/testing/selftests/bpf/Makefile|   4 +-
 tools/testing/selftests/bpf/test_sock.c | 479 
 3 files changed, 493 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7a8314342e2d..bd9ad31c0ebc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -150,6 +150,8 @@ enum bpf_attach_type {
BPF_CGROUP_INET6_BIND,
BPF_CGROUP_INET4_CONNECT,
BPF_CGROUP_INET6_CONNECT,
+   BPF_CGROUP_INET4_POST_BIND,
+   BPF_CGROUP_INET6_POST_BIND,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -940,6 +942,15 @@ struct bpf_sock {
__u32 protocol;
__u32 mark;
__u32 priority;
+   __u32 src_ip4;  /* Allows 1,2,4-byte read.
+* Stored in network byte order.
+*/
+   __u32 src_ip6[4];   /* Allows 1,2,4-byte read.
+* Stored in network byte order.
+*/
+   __u32 src_port; /* Allows 4-byte read.
+* Stored in host byte order
+*/
 };
 
 #define XDP_PACKET_HEADROOM 256
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index c64d4ebc77ff..0a315ddabbf4 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -23,7 +23,8 @@ urandom_read: urandom_read.c
 
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs \
-   test_align test_verifier_log test_dev_cgroup test_tcpbpf_user 
test_sock_addr
+   test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
+   test_sock test_sock_addr
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
@@ -52,6 +53,7 @@ $(TEST_GEN_PROGS): $(BPFOBJ)
 $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/libbpf.a
 
 $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
+$(OUTPUT)/test_sock: cgroup_helpers.c
 $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 
 .PHONY: force
diff --git a/tools/testing/selftests/bpf/test_sock.c 
b/tools/testing/selftests/bpf/test_sock.c
new file mode 100644
index ..73bb20cfb9b7
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock.c
@@ -0,0 +1,479 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+
+#include "cgroup_helpers.h"
+
+#ifndef ARRAY_SIZE
+# define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+#define CG_PATH"/foo"
+#define MAX_INSNS  512
+
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+
+struct sock_test {
+   const char *descr;
+   /* BPF prog properties */
+   struct bpf_insn insns[MAX_INSNS];
+   enum bpf_attach_type expected_attach_type;
+   enum bpf_attach_type attach_type;
+   /* Socket properties */
+   int domain;
+   int type;
+   /* Endpoint to bind() to */
+   const char *ip;
+   unsigned short port;
+   /* Expected test result */
+   enum {
+   LOAD_REJECT,
+   ATTACH_REJECT,
+   BIND_REJECT,
+  

[PATCH v2 bpf-next 3/9] bpf: Hooks for sys_bind

2018-03-27 Thread Alexei Starovoitov
From: Andrey Ignatov 

== The problem ==

There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured.  Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.

Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).

Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
  with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.

== The solution ==

The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).

It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
(similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).

The new program type is intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.

The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.

== Implementation notes ==

[1]
Separate attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.

[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".

There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.

Signed-off-by: Andrey Ignatov 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf-cgroup.h |  21 
 include/linux/bpf_types.h  |   1 +
 include/linux/filter.h |  10 ++
 include/uapi/linux/bpf.h   |  23 +
 kernel/bpf/cgroup.c|  36 +++
 kernel/bpf/syscall.c   |  36 +--
 kernel/bpf/verifier.c  |   1 +
 net/core/filter.c  | 232 +
 net/ipv4/af_inet.c |   7 ++
 net/ipv6/af_inet6.c|   7 ++
 10 files changed, 366 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 8a4566691c8f..67dc4a6471ad 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -6,6 +6,7 @@
 #include 
 
 struct sock;
+struct sockaddr;
 struct cgroup;
 struct sk_buff;
 struct bpf_sock_ops_kern;
@@ -63,6 +64,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 int __cgroup_bpf_run_filter_sk(struct sock *sk,
   enum bpf_attach_type type);
 
+int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
+ struct sockaddr *uaddr,
+ enum bpf_attach_type type);
+
 int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
 struct bpf_sock_ops_kern *sock_ops,
 enum bpf_attach_type type);
@@ -103,6 +108,20 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 
major, u32 minor,
__ret; \
 })
 
+#define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type)
   \
+({\
+  

[PATCH v2 bpf-next 8/9] bpf: Post-hooks for sys_bind

2018-03-27 Thread Alexei Starovoitov
From: Andrey Ignatov 

"Post-hooks" are hooks that are called right before returning from
sys_bind. At this time IP and port are already allocated and no further
changes to `struct sock` can happen before returning from sys_bind but
BPF program has a chance to inspect the socket and change sys_bind
result.

Specifically it can e.g. inspect what port was allocated and if it
doesn't satisfy some policy, BPF program can force sys_bind to fail and
return EPERM to user.

Another example of usage is recording the IP:port pair to some map to
use it in later calls to sys_connect. E.g. if some TCP server inside
cgroup was bound to some IP:port_n, it can be recorded to a map. And
later when some TCP client inside same cgroup is trying to connect to
127.0.0.1:port_n, BPF hook for sys_connect can override the destination
and connect application to IP:port_n instead of 127.0.0.1:port_n. That
helps forcing all applications inside a cgroup to use desired IP and not
break those applications if they e.g. use localhost to communicate
between each other.

== Implementation details ==

Post-hooks are implemented as two new attach types
`BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.

Separate attach types for IPv4 and IPv6 are introduced to avoid access
to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
`inet6_bind()` since those fields might not make sense in such cases.

Signed-off-by: Andrey Ignatov 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf-cgroup.h |  16 +--
 include/uapi/linux/bpf.h   |  11 +
 kernel/bpf/syscall.c   |  43 +
 net/core/filter.c  | 116 +++--
 net/ipv4/af_inet.c |  18 ---
 net/ipv6/af_inet6.c|  21 +---
 6 files changed, 195 insertions(+), 30 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index c6ab295e6dcb..30d15e64b993 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -98,16 +98,24 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 
major, u32 minor,
__ret; \
 })
 
-#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) \
+#define BPF_CGROUP_RUN_SK_PROG(sk, type)  \
 ({\
int __ret = 0; \
if (cgroup_bpf_enabled) {  \
-   __ret = __cgroup_bpf_run_filter_sk(sk, \
-BPF_CGROUP_INET_SOCK_CREATE); \
+   __ret = __cgroup_bpf_run_filter_sk(sk, type);  \
}  \
__ret; \
 })
 
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) \
+   BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET_SOCK_CREATE)
+
+#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk)
   \
+   BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET4_POST_BIND)
+
+#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk)
   \
+   BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET6_POST_BIND)
+
 #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type)
   \
 ({\
int __ret = 0; \
@@ -183,6 +191,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { 
return 0; }
 #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) ({ 0; })
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 969091fc6ee7..9f5f166f9d73 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -150,6 +150,8 @@ enum bpf_attach_type {
BPF_CGROUP_INET6_BIND,
BPF_CGROUP_INET4_CONNECT,
BPF_CGROUP_INET6_CONNECT,
+   BPF_CGROUP_INET4_POST_BIND,
+   BPF_CGROUP_INET6_POST_BIND,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -941,6 +943,15 @@ struct bpf_sock {
__u32 protocol;
__u32 mark;
__u32 priority;
+   __u32 src_ip4;  /* Allows 1,2,4-byte read.
+  

[PATCH v2 bpf-next 4/9] selftests/bpf: Selftest for sys_bind hooks

2018-03-27 Thread Alexei Starovoitov
From: Andrey Ignatov 

Add selftest to work with bpf_sock_addr context from
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` programs.

Try to bind(2) on IP:port and apply:
* loads to make sure context can be read correctly, including narrow
  loads (byte, half) for IP and full-size loads (word) for all fields;
* stores to those fields allowed by verifier.

All combination from IPv4/IPv6 and TCP/UDP are tested.

Both scenarios are tested:
* valid programs can be loaded and attached;
* invalid programs can be neither loaded nor attached.

Test passes when expected data can be read from context in the
BPF-program, and after the call to bind(2) socket is bound to IP:port
pair that was written by BPF-program to the context.

Example:
  # ./test_sock_addr
  Attached bind4 program.
  Test case #1 (IPv4/TCP):
  Requested: bind(192.168.1.254, 4040) ..
 Actual: bind(127.0.0.1, )
  Test case #2 (IPv4/UDP):
  Requested: bind(192.168.1.254, 4040) ..
 Actual: bind(127.0.0.1, )
  Attached bind6 program.
  Test case #3 (IPv6/TCP):
  Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
 Actual: bind(::1, )
  Test case #4 (IPv6/UDP):
  Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
 Actual: bind(::1, )
  ### SUCCESS

Signed-off-by: Andrey Ignatov 
Signed-off-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h   |  23 ++
 tools/lib/bpf/libbpf.c   |   6 +
 tools/testing/selftests/bpf/Makefile |   3 +-
 tools/testing/selftests/bpf/test_sock_addr.c | 486 +++
 4 files changed, 517 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b44bcd7814ee..a8ec89065496 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -134,6 +134,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
BPF_PROG_TYPE_SK_MSG,
+   BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 };
 
 enum bpf_attach_type {
@@ -145,6 +146,8 @@ enum bpf_attach_type {
BPF_SK_SKB_STREAM_VERDICT,
BPF_CGROUP_DEVICE,
BPF_SK_MSG_VERDICT,
+   BPF_CGROUP_INET4_BIND,
+   BPF_CGROUP_INET6_BIND,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1002,6 +1005,26 @@ struct bpf_map_info {
__u64 netns_ino;
 } __attribute__((aligned(8)));
 
+/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
+ * by user and intended to be used by socket (e.g. to bind to, depends on
+ * attach attach type).
+ */
+struct bpf_sock_addr {
+   __u32 user_family;  /* Allows 4-byte read, but no write. */
+   __u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
+* Stored in network byte order.
+*/
+   __u32 user_ip6[4];  /* Allows 1,2,4-byte read an 4-byte write.
+* Stored in network byte order.
+*/
+   __u32 user_port;/* Allows 4-byte read and write.
+* Stored in network byte order
+*/
+   __u32 family;   /* Allows 4-byte read, but no write */
+   __u32 type; /* Allows 4-byte read, but no write */
+   __u32 protocol; /* Allows 4-byte read, but no write */
+};
+
 /* User bpf_sock_ops struct to access socket values and specify request ops
  * and their replies.
  * Some of this fields are in network (bigendian) byte order and may need
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 48e3e743ebf7..d7ce8818982c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1859,6 +1859,9 @@ static void bpf_program__set_expected_attach_type(struct 
bpf_program *prog,
 
 #define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_FULL(string, ptype, 0)
 
+#define BPF_SA_PROG_SEC(string, ptype) \
+   BPF_PROG_SEC_FULL(string, BPF_PROG_TYPE_CGROUP_SOCK_ADDR, ptype)
+
 static const struct {
const char *sec;
size_t len;
@@ -1882,10 +1885,13 @@ static const struct {
BPF_PROG_SEC("sockops", BPF_PROG_TYPE_SOCK_OPS),
BPF_PROG_SEC("sk_skb",  BPF_PROG_TYPE_SK_SKB),
BPF_PROG_SEC("sk_msg",  BPF_PROG_TYPE_SK_MSG),
+   BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
+   BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
 };
 
 #undef BPF_PROG_SEC
 #undef BPF_PROG_SEC_FULL
+#undef BPF_SA_PROG_SEC
 
 static int bpf_program__identify_section(struct bpf_program *prog)
 {
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index f35fb02bdf56..f4717c910874 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -23,7 +23,7 @@ urandom_read: 

[PATCH v2 bpf-next 0/9] bpf: introduce cgroup-bpf bind, connect, post-bind hooks

2018-03-27 Thread Alexei Starovoitov
v1->v2:
- support expected_attach_type at prog load time so that prog (incl.
  context accesses and calls to helpers) can be validated with regard to
  specific attach point it is supposed to be attached to.
  Later, at attach time, attach type is checked so that it must be same as
  at load time if it was provided
- reworked hooks to rely on expected_attach_type, and reduced number of new
  prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR
- reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks
- add selftests for post-sys_bind hook

For our container management we've been using complicated and fragile setup
consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
all containerized applications. Unfortunately it doesn't work for apps that
don't use glibc and changing all applications that run in the datacenter
is not possible due to 3rd party code and libraries (despite being
open source code) and sheer amount of legacy code that has to be rewritten
(we're rewriting what we can in parallel)

These applications are written without containers in mind and have
builtin assumptions about network services. Like an application X
expects to connect localhost:special_port and find service Y in there.
To move application X and service Y into two different containers
LD_PRELOAD approach is used to help one service connect to another
without rewriting them.
Moving these two applications into different L2 (netns) or L3 (vrf)
network isolation scopes doesn't help to solve the problem, since
applications need to see each other like they were running on
the host without containers.
So if app X and app Y would run in different netns something
would need to punch a connectivity hole in those namespaces.
That would be real layering violation (with corresponding
network debugging pains), since clean l2, l3 abstraction would
suddenly support something that breaks through the layers.

Instead we used LD_PRELOAD (and now bpf programs) at bind/connect
time to help applications discover and connect to each other.
All applications are running in init_nens and there are no vrfs.
After bind/connect the normal fib/neighbor core networking
logic works as it should always do and the whole system is
clean from network point of view and can be debugged with
standard tools.

We also considered resurrecting Hannes's afnetns work,
but all hierarchical namespace abstraction don't work due
to these builtin networking assumptions inside the apps.
To run an application inside cgroup container that was not written
with containers in mind we have to make an illusion of running
in non-containerized environment.
In some cases we remember the port and container id in the post-bind hook
in a bpf map and when some other task in a different container is trying
to connect to a service we need to know where this service is running.
It can be remote and can be local. Both client and service may or may not
be written with containers in mind and this sockaddr rewrite is providing
connectivity and load balancing feature.

BPF+cgroup looks to be the best solution for this problem.
Hence we introduce 3 hooks:
- at entry into sys_bind and sys_connect
  to let bpf prog look and modify 'struct sockaddr' provided
  by user space and fail bind/connect when appropriate
- post sys_bind after port is allocated

The approach works great and has zero overhead for anyone who doesn't
use it and very low overhead when deployed.

Different use case for this feature is to do low overhead firewall
that doesn't need to inspect all packets and works at bind/connect time.

Andrey Ignatov (9):
  bpf: Check attach type at prog load time
  libbpf: Support expected_attach_type at prog load
  bpf: Hooks for sys_bind
  selftests/bpf: Selftest for sys_bind hooks
  net: Introduce __inet_bind() and __inet6_bind
  bpf: Hooks for sys_connect
  selftests/bpf: Selftest for sys_connect hooks
  bpf: Post-hooks for sys_bind
  selftests/bpf: Selftest for sys_bind post-hooks.

 include/linux/bpf-cgroup.h|  68 ++-
 include/linux/bpf.h   |   5 +-
 include/linux/bpf_types.h |   1 +
 include/linux/filter.h|  11 +
 include/net/inet_common.h |   2 +
 include/net/ipv6.h|   2 +
 include/net/sock.h|   3 +
 include/net/udp.h |   1 +
 include/uapi/linux/bpf.h  |  51 ++-
 kernel/bpf/cgroup.c   |  39 +-
 kernel/bpf/syscall.c  | 102 -
 kernel/bpf/verifier.c |   7 +-
 kernel/trace/bpf_trace.c  |  21 +-
 net/core/filter.c | 435 +--
 net/ipv4/af_inet.c|  71 +++-
 net/ipv4/tcp_ipv4.c   |  16 +
 net/ipv4/udp.c|  14 +
 net/ipv6/af_inet6.c 

[PATCH v2 bpf-next 5/9] net: Introduce __inet_bind() and __inet6_bind

2018-03-27 Thread Alexei Starovoitov
From: Andrey Ignatov 

Refactor `bind()` code to make it ready to be called from BPF helper
function `bpf_bind()` (will be added soon). Implementation of
`inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
`__inet6_bind()` correspondingly. These function can be used from both
`sk_prot->bind` and `bpf_bind()` contexts.

New functions have two additional arguments.

`force_bind_address_no_port` forces binding to IP only w/o checking
`inet_sock.bind_address_no_port` field. It'll allow to bind local end of
a connection to desired IP in `bpf_bind()` w/o changing
`bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
can return an error and we'd need to restore original value of
`bind_address_no_port` in that case if we changed this before calling to
the helper.

`with_lock` specifies whether to lock socket when working with `struct
sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
behavior is preserved. But it will be set to `false` for `bpf_bind()`
use-case. The reason is all call-sites, where `bpf_bind()` will be
called, already hold that socket lock.

Signed-off-by: Andrey Ignatov 
Acked-by: Alexei Starovoitov 
Signed-off-by: Alexei Starovoitov 
---
 include/net/inet_common.h |  2 ++
 include/net/ipv6.h|  2 ++
 net/ipv4/af_inet.c| 39 ---
 net/ipv6/af_inet6.c   | 37 -
 4 files changed, 52 insertions(+), 28 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 500f81375200..384b90c62c0b 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -32,6 +32,8 @@ int inet_shutdown(struct socket *sock, int how);
 int inet_listen(struct socket *sock, int backlog);
 void inet_sock_destruct(struct sock *sk);
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
+int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+   bool force_bind_address_no_port, bool with_lock);
 int inet_getname(struct socket *sock, struct sockaddr *uaddr,
 int peer);
 int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 50a6f0ddb878..2e5fedc56e59 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1066,6 +1066,8 @@ void ipv6_local_error(struct sock *sk, int err, struct 
flowi6 *fl6, u32 info);
 void ipv6_local_rxpmtu(struct sock *sk, struct flowi6 *fl6, u32 mtu);
 
 int inet6_release(struct socket *sock);
+int __inet6_bind(struct sock *sock, struct sockaddr *uaddr, int addr_len,
+bool force_bind_address_no_port, bool with_lock);
 int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
 int inet6_getname(struct socket *sock, struct sockaddr *uaddr,
  int peer);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2dec266507dc..e203a39d6988 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -432,30 +432,37 @@ EXPORT_SYMBOL(inet_release);
 
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
-   struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
struct sock *sk = sock->sk;
-   struct inet_sock *inet = inet_sk(sk);
-   struct net *net = sock_net(sk);
-   unsigned short snum;
-   int chk_addr_ret;
-   u32 tb_id = RT_TABLE_LOCAL;
int err;
 
/* If the socket has its own bind function then use it. (RAW) */
if (sk->sk_prot->bind) {
-   err = sk->sk_prot->bind(sk, uaddr, addr_len);
-   goto out;
+   return sk->sk_prot->bind(sk, uaddr, addr_len);
}
-   err = -EINVAL;
if (addr_len < sizeof(struct sockaddr_in))
-   goto out;
+   return -EINVAL;
 
/* BPF prog is run before any checks are done so that if the prog
 * changes context in a wrong way it will be caught.
 */
err = BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr);
if (err)
-   goto out;
+   return err;
+
+   return __inet_bind(sk, uaddr, addr_len, false, true);
+}
+EXPORT_SYMBOL(inet_bind);
+
+int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+   bool force_bind_address_no_port, bool with_lock)
+{
+   struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
+   struct inet_sock *inet = inet_sk(sk);
+   struct net *net = sock_net(sk);
+   unsigned short snum;
+   int chk_addr_ret;
+   u32 tb_id = RT_TABLE_LOCAL;
+   int err;
 
if (addr->sin_family != AF_INET) {
/* Compatibility games : accept AF_UNSPEC (mapped to AF_INET)
@@ -499,7 +506,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
 *  would be illegal to use them (multicast/broadcast) in
 *  which case the sending 

[PATCH v2 bpf-next 7/9] selftests/bpf: Selftest for sys_connect hooks

2018-03-27 Thread Alexei Starovoitov
From: Andrey Ignatov 

Add selftest for BPF_CGROUP_INET4_CONNECT and BPF_CGROUP_INET6_CONNECT
attach types.

Try to connect(2) to specified IP:port and test that:
* remote IP:port pair is overridden;
* local end of connection is bound to specified IP.

All combinations of IPv4/IPv6 and TCP/UDP are tested.

Example:
  # tcpdump -pn -i lo -w connect.pcap 2>/dev/null &
  [1] 478
  # strace -qqf -e connect -o connect.trace ./test_sock_addr.sh
  Wait for testing IPv4/IPv6 to become available ... OK
  Load bind4 with invalid type (can pollute stderr) ... REJECTED
  Load bind4 with valid type ... OK
  Attach bind4 with invalid type ... REJECTED
  Attach bind4 with valid type ... OK
  Load connect4 with invalid type (can pollute stderr) libbpf: load bpf \
program failed: Permission denied
  libbpf: -- BEGIN DUMP LOG ---
  libbpf:
  0: (b7) r2 = 23569
  1: (63) *(u32 *)(r1 +24) = r2
  2: (b7) r2 = 16777343
  3: (63) *(u32 *)(r1 +4) = r2
  invalid bpf_context access off=4 size=4
  [ 1518.404609] random: crng init done

  libbpf: -- END LOG --
  libbpf: failed to load program 'cgroup/connect4'
  libbpf: failed to load object './connect4_prog.o'
  ... REJECTED
  Load connect4 with valid type ... OK
  Attach connect4 with invalid type ... REJECTED
  Attach connect4 with valid type ... OK
  Test case #1 (IPv4/TCP):
  Requested: bind(192.168.1.254, 4040) ..
 Actual: bind(127.0.0.1, )
  Requested: connect(192.168.1.254, 4040) from (*, *) ..
 Actual: connect(127.0.0.1, ) from (127.0.0.4, 56068)
  Test case #2 (IPv4/UDP):
  Requested: bind(192.168.1.254, 4040) ..
 Actual: bind(127.0.0.1, )
  Requested: connect(192.168.1.254, 4040) from (*, *) ..
 Actual: connect(127.0.0.1, ) from (127.0.0.4, 56447)
  Load bind6 with invalid type (can pollute stderr) ... REJECTED
  Load bind6 with valid type ... OK
  Attach bind6 with invalid type ... REJECTED
  Attach bind6 with valid type ... OK
  Load connect6 with invalid type (can pollute stderr) libbpf: load bpf \
program failed: Permission denied
  libbpf: -- BEGIN DUMP LOG ---
  libbpf:
  0: (b7) r6 = 0
  1: (63) *(u32 *)(r1 +12) = r6
  invalid bpf_context access off=12 size=4

  libbpf: -- END LOG --
  libbpf: failed to load program 'cgroup/connect6'
  libbpf: failed to load object './connect6_prog.o'
  ... REJECTED
  Load connect6 with valid type ... OK
  Attach connect6 with invalid type ... REJECTED
  Attach connect6 with valid type ... OK
  Test case #3 (IPv6/TCP):
  Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
 Actual: bind(::1, )
  Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *)
 Actual: connect(::1, ) from (::6, 37458)
  Test case #4 (IPv6/UDP):
  Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
 Actual: bind(::1, )
  Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *)
 Actual: connect(::1, ) from (::6, 39315)
  ### SUCCESS
  # egrep 'connect\(.*AF_INET' connect.trace | \
  > egrep -vw 'htons\(1025\)' | fold -b -s -w 72
  502   connect(7, {sa_family=AF_INET, sin_port=htons(4040),
  sin_addr=inet_addr("192.168.1.254")}, 128) = 0
  502   connect(8, {sa_family=AF_INET, sin_port=htons(4040),
  sin_addr=inet_addr("192.168.1.254")}, 128) = 0
  502   connect(9, {sa_family=AF_INET6, sin6_port=htons(6060),
  inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", _addr),
  sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
  502   connect(10, {sa_family=AF_INET6, sin6_port=htons(6060),
  inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", _addr),
  sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
  # fg
  tcpdump -pn -i lo -w connect.pcap 2> /dev/null
  # tcpdump -r connect.pcap -n tcp | cut -c 1-72
  reading from file connect.pcap, link-type EN10MB (Ethernet)
  17:57:40.383533 IP 127.0.0.4.56068 > 127.0.0.1.: Flags [S], seq 1333
  17:57:40.383566 IP 127.0.0.1. > 127.0.0.4.56068: Flags [S.], seq 112
  17:57:40.383589 IP 127.0.0.4.56068 > 127.0.0.1.: Flags [.], ack 1, w
  17:57:40.384578 IP 127.0.0.1. > 127.0.0.4.56068: Flags [R.], seq 1,
  17:57:40.403327 IP6 ::6.37458 > ::1.: Flags [S], seq 406513443, win
  17:57:40.403357 IP6 ::1. > ::6.37458: Flags [S.], seq 2448389240, ac
  17:57:40.403376 IP6 ::6.37458 > ::1.: Flags [.], ack 1, win 342, opt
  17:57:40.404263 IP6 ::1. > ::6.37458: Flags [R.], seq 1, ack 1, win

Signed-off-by: Andrey Ignatov 
Signed-off-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h|  12 ++-
 tools/lib/bpf/libbpf.c|   2 +
 tools/testing/selftests/bpf/Makefile  |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h |   2 +
 tools/testing/selftests/bpf/connect4_prog.c   |  45 +++
 tools/testing/selftests/bpf/connect6_prog.c   |  61 +++
 tools/testing/selftests/bpf/test_sock_addr.c  | 104 

Re: [PATCH v13 net-next 08/12] crypto : chtls - CPL handler definition

2018-03-27 Thread Atul Gupta


On 3/27/2018 11:12 PM, Stefano Brivio wrote:
> On Tue, 27 Mar 2018 23:06:37 +0530
> Atul Gupta  wrote:
>
>> Exchange messages with hardware to program the TLS session
>> CPL handlers for messages received from chip.
>>
>> Signed-off-by: Atul Gupta 
>> Signed-off-by: Michael Werner 
>> Reviewed-by: Sabrina Dubroca 
>> Reviewed-by: Stefano Brivio 
> No, I haven't.
Stefano,
I was not clear on protocol for reviewed-by tag, I want to acknowledge the 
feedback
you have provided to code so far. Will remove this.
Thanks
Atul
>



Re: [V9fs-developer] [PATCH] net/9p: fix potential refcnt problem of trans module

2018-03-27 Thread jiangyiwen
On 2018/3/28 10:52, cgxu...@gmx.com wrote:
> 在 2018年3月28日,上午10:10,jiangyiwen  写道:
>>
>> On 2018/3/27 20:49, Chengguang Xu wrote:
>>> When specifying trans_mod multiple times in a mount,
>>> it may cause inaccurate refcount of trans module. Also,
>>> in the error case of option parsing, we should put the
>>> trans module if we have already got.
>>>
>>> Signed-off-by: Chengguang Xu 
>>> ---
>>> net/9p/client.c | 5 -
>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/9p/client.c b/net/9p/client.c
>>> index b433aff..7ccfb4b 100644
>>> --- a/net/9p/client.c
>>> +++ b/net/9p/client.c
>>> @@ -190,7 +190,9 @@ static int parse_opts(char *opts, struct p9_client 
>>> *clnt)
>>> p9_debug(P9_DEBUG_ERROR,
>>>  "problem allocating copy of trans 
>>> arg\n");
>>> goto free_and_return;
>>> -}
>>> +   }
>>> +
>>> +   v9fs_put_trans(clnt->trans_mod);
>>
>> I think this should return error if using multiple times
>> in a mount.
> 
> Fail or retake are just different policies how to deal with this kind of 
> situation,
> most filesystems in kernel choose retaking new value, so we’d better keep 
> consistence
> with others unless there is a necessary reason.
> 
> Thanks,
> Chengguang.
> 

Yes, you're right, most filesystems retake indeed the new value.

Reviewed-by: Yiwen Jiang 

> 
>>
>>> clnt->trans_mod = v9fs_get_trans_by_name(s);
>>> if (clnt->trans_mod == NULL) {
>>> pr_info("Could not find request transport: 
>>> %s\n",
>>> @@ -226,6 +228,7 @@ static int parse_opts(char *opts, struct p9_client 
>>> *clnt)
>>> }
>>>
>>> free_and_return:
>>> +   v9fs_put_trans(clnt->trans_mod);
>>
>> This looks good.
>>
>>> kfree(tmp_options);
>>> return ret;
>>> }
>>>
>>
>>
> 
> 
> .
> 




Re: [SPAMMY (6.9)]Re: [PATCH v13 net-next 02/12] ethtool: enable Inline TLS in HW

2018-03-27 Thread Atul Gupta


On 3/28/2018 2:14 AM, Sabrina Dubroca wrote:
> 2018-03-27, 23:06:31 +0530, Atul Gupta wrote:
>> Ethtool option enables TLS record offload on HW, user
>> configures the feature for netdev capable of Inline TLS.
>> This allows user to define custom sk_prot for Inline TLS sock
>>
>> Signed-off-by: Atul Gupta 
>> Reviewed-by: Sabrina Dubroca 
> uh, what? I definitely didn't give my "Reviewed-by" for any of these
> patches. Please never do that again.
Sabrina,
I was not clear on protocol. I perhaps want to acknowledge the valuable 
feedback you
have provided. Will remove this. Thanks
>



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 10:51 PM, Linus Torvalds wrote:
>> The discussion at hand is about
>>
>> dma_buffer->foo = 1;/* WB */
>> writel(KICK, DMA_KICK_REGISTER);/* UC */
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Now that we clarified x86 myth, Is this guaranteed on all architectures?
We keep getting IA64 exception example. Maybe, this is also corrected since
then.

Jose Abreu says "I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs)".

As long as we have these exceptions, these wmb() in common drivers is not
going anywhere and relaxed-arches will continue paying performance penalty.

I see 15% performance loss on ARM64 servers using Intel i40e network
drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
most of the time rather than real work because of sequences like this all over
the place.

 dma_buffer->foo = 1;/* WB */
 wmb()
 writel(KICK, DMA_KICK_REGISTER);/* UC */

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

Basically changing it to

dma_buffer->foo = 1;/* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
mmiowb()

This is a small step in the performance direction until we remove all 
exceptions.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

Discussion started to move around the need for relaxed API on PPC and then
why wmb() question came up.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


RE: [PATCH] net: fec: set dma_coherent_mask

2018-03-27 Thread Andy Duan
From: Geert Uytterhoeven  Sent: 2018年3月27日 20:59
> Hi Greg,
> 
> On Mon, Mar 26, 2018 at 3:36 PM, Greg Ungerer  wrote:
> > As of commit 205e1b7f51e4 ("dma-mapping: warn when there is no
> > coherent_dma_mask") the Freescale FEC driver is issuing the following
> > warning on driver initialization on ColdFire systems:
> >
> > WARNING: CPU: 0 PID: 1 at ./include/linux/dma-mapping.h:516 0x40159e20
> > Modules linked in:
> > CPU: 0 PID: 1 Comm: swapper Not tainted 4.16.0-rc7-dirty #4 Stack from
> > 41833dd8:
> > 41833dd8 40259c53 40025534 40279e26 0003 
> 4004e514 41827000
> > 400255de 40244e42 0204 40159e20 0009 
>  4024531d
> > 40159e20 40244e42 0204   
> 0007 
> >  40279e26 4028d040 40226576 4003ae88 40279e26
> 418273f6 41833ef8
> > 7fff 418273f2 41867028 4003c9a2 4180ac6c 0004 41833f8c
> 4013e71c
> > 40279e1c 40279e26 40226c16 4013ced2 40279e26 40279e58
> 4028d040
> >  Call Trace:
> > [<40025534>] 0x40025534
> >  [<4004e514>] 0x4004e514
> >  [<400255de>] 0x400255de
> >  [<40159e20>] 0x40159e20
> >  [<40159e20>] 0x40159e20
> >
> > It is not fatal, the driver and the system continue to function normally.
> >
> > As per the warning the coherent_dma_mask is not set on this device.
> > There is nothing special about the DMA memory coherency on this
> > hardware so we can just set the mask to 32bits during probe.
> >
> > Signed-off-by: Greg Ungerer 
> 
> Thanks for your patch!
> 
> > ---
> >  drivers/net/ethernet/freescale/fec_main.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > Is this the best way to handle this problem?
> > Comments welcome...
> >
> > diff --git a/drivers/net/ethernet/freescale/fec_main.c
> > b/drivers/net/ethernet/freescale/fec_main.c
> > index d4604bc..3cb130a 100644
> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -2702,6 +2702,8 @@ static int fec_enet_alloc_queue(struct net_device
> *ndev)
> > int ret = 0;
> > struct fec_enet_priv_tx_q *txq;
> >
> > +   dma_set_coherent_mask(>pdev->dev, DMA_BIT_MASK(32));
> > +
> > for (i = 0; i < fep->num_tx_queues; i++) {
> > txq = kzalloc(sizeof(*txq), GFP_KERNEL);
> > if (!txq) {
> 
> As per your other email, this does not trigger on iMX systems using DT.
> Hence I'm wondering if the Coldfire platform code shouldn't just do the same 
> what
> drivers/of/device.c does, cfr.
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.s
> pinics.net%2Flists%2Flinux-m68k%2Fmsg10929.html=02%7C01%7Cfugan
> g.duan%40nxp.com%7C3db2dbadf1154965370608d593e294b2%7C686ea1d3
> bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636577523756185995=bsO
> F%2BnSeXTP7C%2F55JLO3Pv8YybPahMaZDyqB16f9ySA%3D=0?
> 
> Gr{oetje,eeting}s,
> 
> Geert
> 
It is better to use of_dma_configure(>dev, np), it iommu support, device 
also can allocate memory from 64bit space.

Andy


Re: [V9fs-developer] [PATCH] net/9p: fix potential refcnt problem of trans module

2018-03-27 Thread cgxu...@gmx.com
在 2018年3月28日,上午10:10,jiangyiwen  写道:
> 
> On 2018/3/27 20:49, Chengguang Xu wrote:
>> When specifying trans_mod multiple times in a mount,
>> it may cause inaccurate refcount of trans module. Also,
>> in the error case of option parsing, we should put the
>> trans module if we have already got.
>> 
>> Signed-off-by: Chengguang Xu 
>> ---
>> net/9p/client.c | 5 -
>> 1 file changed, 4 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/9p/client.c b/net/9p/client.c
>> index b433aff..7ccfb4b 100644
>> --- a/net/9p/client.c
>> +++ b/net/9p/client.c
>> @@ -190,7 +190,9 @@ static int parse_opts(char *opts, struct p9_client *clnt)
>>  p9_debug(P9_DEBUG_ERROR,
>>   "problem allocating copy of trans 
>> arg\n");
>>  goto free_and_return;
>> - }
>> +}
>> +
>> +v9fs_put_trans(clnt->trans_mod);
> 
> I think this should return error if using multiple times
> in a mount.

Fail or retake are just different policies how to deal with this kind of 
situation,
most filesystems in kernel choose retaking new value, so we’d better keep 
consistence
with others unless there is a necessary reason.

Thanks,
Chengguang.


> 
>>  clnt->trans_mod = v9fs_get_trans_by_name(s);
>>  if (clnt->trans_mod == NULL) {
>>  pr_info("Could not find request transport: 
>> %s\n",
>> @@ -226,6 +228,7 @@ static int parse_opts(char *opts, struct p9_client *clnt)
>>  }
>> 
>> free_and_return:
>> +v9fs_put_trans(clnt->trans_mod);
> 
> This looks good.
> 
>>  kfree(tmp_options);
>>  return ret;
>> }
>> 
> 
> 



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt
 wrote:
>
> The discussion at hand is about
>
> dma_buffer->foo = 1;/* WB */
> writel(KICK, DMA_KICK_REGISTER);/* UC */

Yes. That certainly is ordered on x86. In fact, afaik it's ordered
even if that writel() might be of type WC, because that only delays
writes, it doesn't move them earlier.

Whether people *do* that or not, I don't know. But I wouldn't be
surprised if they do.

So if it's a DMA buffer, it's "cached". And even cached accesses are
ordered wrt MMIO.

Basically, to get unordered writes on x86, you need to either use
explicitly nontemporal stores, or have a writecombining region with
back-to-back writes that actually combine.

And nobody really does that nontemporal store thing any more because
the hardware that cared pretty much doesn't exist any more. It was too
much pain. People use DMA and maybe an UC store for starting the DMA
(or possibly a WC buffer that gets multiple  stores in ascending order
as a stream of commands).

Things like UC will force everything to be entirely ordered, but even
without UC, loads won't pass loads, and stores won't pass stores.

> Now it appears that this wasn't fully understood back then, and some
> people are now saying that x86 might not even provide that semantic
> always.

Oh, the above UC case is absoutely guaranteed.

And I think even if it's WC, the write to kick off the DMA is ordered
wrt the cached write.

On x86, I think you need barriers only if you do things like

 - do two non-temporal stores and require them to be ordered: put a
sfence or mfence in between them.

 - do two WC stores, and make sure they do not combine: put a sfence
or mfence between them.

 - do a store, and a subsequent from a different address, and neither
of them is UC: put a mfence between them. But note that this is
literally just "load after store". A "store after load" doesn't need
one.

I think that's pretty much it.

For example, the "lfence" instruction is almost entirely pointless on
x86 - it was designed back in the time when people *thought* they
might re-order loads. But loads don't get re-ordered. At least Intel
seems to document that only non-temporal *stores* can get re-ordered
wrt each other.

End result: lfence is a historical oddity that can now be used to
guarantee that a previous  load has finished, and that in turn meant
that it is  now used in the Spectre mitigations. But it basically has
no real memory ordering meaning since nothing passes an earlier load
anyway, it's more of a pipeline thing.

But in the end, one question is just "how much do drivers actually
_rely_ on the x86 strong ordering?"

We so support "smp_wmb()" even though x86 has strong enough ordering
that just a barrier suffices. Somebody might just say "screw the x86
memory ordering, we're relaxed, and we'll fix up the drivers we care
about".

The only issue really is that 99.9% of all testing gets done on x86
unless you look at specific SoC drivers.

On ARM, for example, there is likely little reason to care about x86
memory ordering, because there is almost zero driver overlap between
x86 and ARM.

*Historically*, the reason for following the x86 IO ordering was
simply that a lot of architectures used the drivers that were
developed on x86. The alpha and powerpc workstations were *designed*
with the x86 IO bus (PCI, then PCIe) and to work with the devices that
came with it.

ARM? PCIe is almost irrelevant. For ARM servers, if they ever take
off, sure. But 99.99% of ARM is about their own SoC's, and so "x86
test coverage" is simply not an issue.

How much of an issue is it for Power? Maybe you decide it's not a big deal.

Then all the above is almost irrelevant.

   Linus


[PATCH net-next] ipv6: export ip6 fragments sysctl to unprivileged users

2018-03-27 Thread Eric Dumazet
IPv4 was changed in commit 52a773d645e9 ("net: Export ip fragment
sysctl to unprivileged users")

The only sysctl that is not per-netns is not used :
ip6frag_secret_interval

Signed-off-by: Eric Dumazet 
Cc: Nikolay Borisov 
---
 net/ipv6/reassembly.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 
afbc000ad4f2912acd475246e2ff2368c1bbeacb..08a139f14d0f6fa8ca326088cce1144411e09bf5
 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -650,10 +650,6 @@ static int __net_init ip6_frags_ns_sysctl_register(struct 
net *net)
table[1].data = >ipv6.frags.low_thresh;
table[1].extra2 = >ipv6.frags.high_thresh;
table[2].data = >ipv6.frags.timeout;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != _user_ns)
-   table[0].procname = NULL;
}
 
hdr = register_net_sysctl(net, "net/ipv6", table);
-- 
2.17.0.rc1.321.gba9d0f2565-goog



[PATCH net-next] liquidio: Prioritize control messages

2018-03-27 Thread Felix Manlunas
From: Intiyaz Basha 

During heavy tx traffic, control messages (sent by liquidio driver to NIC
firmware) sometimes do not get processed in a timely manner.  Reason is:
the low-level metadata of control messages and that of egress network
packets indicate that they have the same priority.

Fix it by setting a higher priority for control messages through the new
ctrl_qpg field in the oct_txpciq struct.  It is the NIC firmware that does
the actual setting of priority by writing to the new ctrl_qpg field; the
host driver treats that value as opaque and just assigns it to pki_ih3->qpg

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/liquidio_common.h | 8 ++--
 drivers/net/ethernet/cavium/liquidio/request_manager.c | 3 ++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h 
b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
index 82a783d..75eea83 100644
--- a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
+++ b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
@@ -712,9 +712,13 @@ enum lio_phy_type {
u64 pkind:6;
u64 use_qpg:1;
u64 qpg:11;
-   u64 reserved:30;
+   u64 reserved0:10;
+   u64 ctrl_qpg:11;
+   u64 reserved:9;
 #else
-   u64 reserved:30;
+   u64 reserved:9;
+   u64 ctrl_qpg:11;
+   u64 reserved0:10;
u64 qpg:11;
u64 use_qpg:1;
u64 pkind:6;
diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c 
b/drivers/net/ethernet/cavium/liquidio/request_manager.c
index 2766af0..b127035 100644
--- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
+++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
@@ -628,7 +628,8 @@ static void check_db_timeout(struct work_struct *work)
pki_ih3->tag = LIO_CONTROL;
pki_ih3->tagtype = ATOMIC_TAG;
pki_ih3->qpg =
-   oct->instr_queue[sc->iq_no]->txpciq.s.qpg;
+   oct->instr_queue[sc->iq_no]->txpciq.s.ctrl_qpg;
+
pki_ih3->pm  = 0x7;
pki_ih3->sl  = 8;
 
-- 
1.8.3.1



[PATCH v7 bpf-next 02/10] net/mediatek: disambiguate mt76 vs mt7601u trace events

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

two trace events defined with the same name and both unused.
They conflict in allyesconfig build. Rename one of them.

Signed-off-by: Alexei Starovoitov 
---
 drivers/net/wireless/mediatek/mt7601u/trace.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt7601u/trace.h 
b/drivers/net/wireless/mediatek/mt7601u/trace.h
index 289897300ef0..82c8898b9076 100644
--- a/drivers/net/wireless/mediatek/mt7601u/trace.h
+++ b/drivers/net/wireless/mediatek/mt7601u/trace.h
@@ -34,7 +34,7 @@
 #define REG_PR_FMT "%04x=%08x"
 #define REG_PR_ARG __entry->reg, __entry->val
 
-DECLARE_EVENT_CLASS(dev_reg_evt,
+DECLARE_EVENT_CLASS(dev_reg_evtu,
TP_PROTO(struct mt7601u_dev *dev, u32 reg, u32 val),
TP_ARGS(dev, reg, val),
TP_STRUCT__entry(
@@ -51,12 +51,12 @@ DECLARE_EVENT_CLASS(dev_reg_evt,
)
 );
 
-DEFINE_EVENT(dev_reg_evt, reg_read,
+DEFINE_EVENT(dev_reg_evtu, reg_read,
TP_PROTO(struct mt7601u_dev *dev, u32 reg, u32 val),
TP_ARGS(dev, reg, val)
 );
 
-DEFINE_EVENT(dev_reg_evt, reg_write,
+DEFINE_EVENT(dev_reg_evtu, reg_write,
TP_PROTO(struct mt7601u_dev *dev, u32 reg, u32 val),
TP_ARGS(dev, reg, val)
 );
-- 
2.9.5



[PATCH v7 bpf-next 10/10] selftests/bpf: test for bpf_get_stackid() from raw tracepoints

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

similar to traditional traceopint test add bpf_get_stackid() test
from raw tracepoints
and reduce verbosity of existing stackmap test

Signed-off-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_progs.c | 91 
 1 file changed, 70 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index e9df48b306df..faadbe233966 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -877,7 +877,7 @@ static void test_stacktrace_map()
 
err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, , _fd);
if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno))
-   goto out;
+   return;
 
/* Get the ID for the sched/sched_switch tracepoint */
snprintf(buf, sizeof(buf),
@@ -888,8 +888,7 @@ static void test_stacktrace_map()
 
bytes = read(efd, buf, sizeof(buf));
close(efd);
-   if (CHECK(bytes <= 0 || bytes >= sizeof(buf),
- "read", "bytes %d errno %d\n", bytes, errno))
+   if (bytes <= 0 || bytes >= sizeof(buf))
goto close_prog;
 
/* Open the perf event and attach bpf progrram */
@@ -906,29 +905,24 @@ static void test_stacktrace_map()
goto close_prog;
 
err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
-   if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n",
- err, errno))
-   goto close_pmu;
+   if (err)
+   goto disable_pmu;
 
err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
-   if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n",
- err, errno))
+   if (err)
goto disable_pmu;
 
/* find map fds */
control_map_fd = bpf_find_map(__func__, obj, "control_map");
-   if (CHECK(control_map_fd < 0, "bpf_find_map control_map",
- "err %d errno %d\n", err, errno))
+   if (control_map_fd < 0)
goto disable_pmu;
 
stackid_hmap_fd = bpf_find_map(__func__, obj, "stackid_hmap");
-   if (CHECK(stackid_hmap_fd < 0, "bpf_find_map stackid_hmap",
- "err %d errno %d\n", err, errno))
+   if (stackid_hmap_fd < 0)
goto disable_pmu;
 
stackmap_fd = bpf_find_map(__func__, obj, "stackmap");
-   if (CHECK(stackmap_fd < 0, "bpf_find_map stackmap", "err %d errno %d\n",
- err, errno))
+   if (stackmap_fd < 0)
goto disable_pmu;
 
/* give some time for bpf program run */
@@ -945,24 +939,78 @@ static void test_stacktrace_map()
err = compare_map_keys(stackid_hmap_fd, stackmap_fd);
if (CHECK(err, "compare_map_keys stackid_hmap vs. stackmap",
  "err %d errno %d\n", err, errno))
-   goto disable_pmu;
+   goto disable_pmu_noerr;
 
err = compare_map_keys(stackmap_fd, stackid_hmap_fd);
if (CHECK(err, "compare_map_keys stackmap vs. stackid_hmap",
  "err %d errno %d\n", err, errno))
-   ; /* fall through */
+   goto disable_pmu_noerr;
 
+   goto disable_pmu_noerr;
 disable_pmu:
+   error_cnt++;
+disable_pmu_noerr:
ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE);
-
-close_pmu:
close(pmu_fd);
-
 close_prog:
bpf_object__close(obj);
+}
 
-out:
-   return;
+static void test_stacktrace_map_raw_tp()
+{
+   int control_map_fd, stackid_hmap_fd, stackmap_fd;
+   const char *file = "./test_stacktrace_map.o";
+   int efd, err, prog_fd;
+   __u32 key, val, duration = 0;
+   struct bpf_object *obj;
+
+   err = bpf_prog_load(file, BPF_PROG_TYPE_RAW_TRACEPOINT, , _fd);
+   if (CHECK(err, "prog_load raw tp", "err %d errno %d\n", err, errno))
+   return;
+
+   efd = bpf_raw_tracepoint_open("sched_switch", prog_fd);
+   if (CHECK(efd < 0, "raw_tp_open", "err %d errno %d\n", efd, errno))
+   goto close_prog;
+
+   /* find map fds */
+   control_map_fd = bpf_find_map(__func__, obj, "control_map");
+   if (control_map_fd < 0)
+   goto close_prog;
+
+   stackid_hmap_fd = bpf_find_map(__func__, obj, "stackid_hmap");
+   if (stackid_hmap_fd < 0)
+   goto close_prog;
+
+   stackmap_fd = bpf_find_map(__func__, obj, "stackmap");
+   if (stackmap_fd < 0)
+   goto close_prog;
+
+   /* give some time for bpf program run */
+   sleep(1);
+
+   /* disable stack trace collection */
+   key = 0;
+   val = 1;
+   bpf_map_update_elem(control_map_fd, , , 0);
+
+   /* for every element in stackid_hmap, we can find a corresponding one
+* in stackmap, and vise versa.
+*/
+   err = compare_map_keys(stackid_hmap_fd, stackmap_fd);
+   if (CHECK(err, "compare_map_keys 

[PATCH v7 bpf-next 04/10] net/wireless/iwlwifi: fix iwlwifi_dev_ucode_error tracepoint

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

fix iwlwifi_dev_ucode_error tracepoint to pass pointer to a table
instead of all 17 arguments by value.
dvm/main.c and mvm/utils.c have 'struct iwl_error_event_table'
defined with very similar yet subtly different fields and offsets.
tracepoint is still common and using definition of 'struct 
iwl_error_event_table'
from dvm/commands.h while copying fields.
Long term this tracepoint probably should be split into two.

Signed-off-by: Alexei Starovoitov 
---
 drivers/net/wireless/intel/iwlwifi/dvm/main.c  |  7 +---
 .../wireless/intel/iwlwifi/iwl-devtrace-iwlwifi.h  | 39 ++
 drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c  |  1 +
 drivers/net/wireless/intel/iwlwifi/mvm/utils.c |  7 +---
 4 files changed, 21 insertions(+), 33 deletions(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/dvm/main.c 
b/drivers/net/wireless/intel/iwlwifi/dvm/main.c
index d11d72615de2..e68254e12764 100644
--- a/drivers/net/wireless/intel/iwlwifi/dvm/main.c
+++ b/drivers/net/wireless/intel/iwlwifi/dvm/main.c
@@ -1651,12 +1651,7 @@ static void iwl_dump_nic_error_log(struct iwl_priv *priv)
priv->status, table.valid);
}
 
-   trace_iwlwifi_dev_ucode_error(trans->dev, table.error_id, table.tsf_low,
- table.data1, table.data2, table.line,
- table.blink2, table.ilink1, table.ilink2,
- table.bcon_time, table.gp1, table.gp2,
- table.gp3, table.ucode_ver, table.hw_ver,
- 0, table.brd_ver);
+   trace_iwlwifi_dev_ucode_error(trans->dev, , 0, table.brd_ver);
IWL_ERR(priv, "0x%08X | %-28s\n", table.error_id,
desc_lookup(table.error_id));
IWL_ERR(priv, "0x%08X | uPc\n", table.pc);
diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-devtrace-iwlwifi.h 
b/drivers/net/wireless/intel/iwlwifi/iwl-devtrace-iwlwifi.h
index 9518a82f44c2..27e3e4e96aa2 100644
--- a/drivers/net/wireless/intel/iwlwifi/iwl-devtrace-iwlwifi.h
+++ b/drivers/net/wireless/intel/iwlwifi/iwl-devtrace-iwlwifi.h
@@ -126,14 +126,11 @@ TRACE_EVENT(iwlwifi_dev_tx,
  __entry->framelen, __entry->skbaddr)
 );
 
+struct iwl_error_event_table;
 TRACE_EVENT(iwlwifi_dev_ucode_error,
-   TP_PROTO(const struct device *dev, u32 desc, u32 tsf_low,
-u32 data1, u32 data2, u32 line, u32 blink2, u32 ilink1,
-u32 ilink2, u32 bcon_time, u32 gp1, u32 gp2, u32 rev_type,
-u32 major, u32 minor, u32 hw_ver, u32 brd_ver),
-   TP_ARGS(dev, desc, tsf_low, data1, data2, line,
-blink2, ilink1, ilink2, bcon_time, gp1, gp2,
-rev_type, major, minor, hw_ver, brd_ver),
+   TP_PROTO(const struct device *dev, const struct iwl_error_event_table 
*table,
+u32 hw_ver, u32 brd_ver),
+   TP_ARGS(dev, table, hw_ver, brd_ver),
TP_STRUCT__entry(
DEV_ENTRY
__field(u32, desc)
@@ -155,20 +152,20 @@ TRACE_EVENT(iwlwifi_dev_ucode_error,
),
TP_fast_assign(
DEV_ASSIGN;
-   __entry->desc = desc;
-   __entry->tsf_low = tsf_low;
-   __entry->data1 = data1;
-   __entry->data2 = data2;
-   __entry->line = line;
-   __entry->blink2 = blink2;
-   __entry->ilink1 = ilink1;
-   __entry->ilink2 = ilink2;
-   __entry->bcon_time = bcon_time;
-   __entry->gp1 = gp1;
-   __entry->gp2 = gp2;
-   __entry->rev_type = rev_type;
-   __entry->major = major;
-   __entry->minor = minor;
+   __entry->desc = table->error_id;
+   __entry->tsf_low = table->tsf_low;
+   __entry->data1 = table->data1;
+   __entry->data2 = table->data2;
+   __entry->line = table->line;
+   __entry->blink2 = table->blink2;
+   __entry->ilink1 = table->ilink1;
+   __entry->ilink2 = table->ilink2;
+   __entry->bcon_time = table->bcon_time;
+   __entry->gp1 = table->gp1;
+   __entry->gp2 = table->gp2;
+   __entry->rev_type = table->gp3;
+   __entry->major = table->ucode_ver;
+   __entry->minor = table->hw_ver;
__entry->hw_ver = hw_ver;
__entry->brd_ver = brd_ver;
),
diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c 
b/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c
index 50510fb6ab8c..6aa719865a58 100644
--- a/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c
+++ b/drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c
@@ -30,6 +30,7 @@
 #ifndef __CHECKER__
 #include "iwl-trans.h"
 
+#include "dvm/commands.h"
 #define CREATE_TRACE_POINTS
 #include "iwl-devtrace.h"
 
diff 

[PATCH v7 bpf-next 01/10] treewide: remove large struct-pass-by-value from tracepoint arguments

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

- fix trace_hfi1_ctxt_info() to pass large struct by reference instead of by 
value
- convert 'type array[]' tracepoint arguments into 'type *array',
  since compiler will warn that sizeof('type array[]') == sizeof('type *array')
  and later should be used instead

The CAST_TO_U64 macro in the later patch will enforce that tracepoint
arguments can only be integers, pointers, or less than 8 byte structures.
Larger structures should be passed by reference.

Signed-off-by: Alexei Starovoitov 
---
 drivers/infiniband/hw/hfi1/file_ops.c|  2 +-
 drivers/infiniband/hw/hfi1/trace_ctxts.h | 12 ++--
 include/trace/events/f2fs.h  |  2 +-
 net/wireless/trace.h |  2 +-
 sound/firewire/amdtp-stream-trace.h  |  2 +-
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c 
b/drivers/infiniband/hw/hfi1/file_ops.c
index 41fafebe3b0d..da4aa1a95b11 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -1153,7 +1153,7 @@ static int get_ctxt_info(struct hfi1_filedata *fd, 
unsigned long arg, u32 len)
cinfo.sdma_ring_size = fd->cq->nentries;
cinfo.rcvegr_size = uctxt->egrbufs.rcvtid_size;
 
-   trace_hfi1_ctxt_info(uctxt->dd, uctxt->ctxt, fd->subctxt, cinfo);
+   trace_hfi1_ctxt_info(uctxt->dd, uctxt->ctxt, fd->subctxt, );
if (copy_to_user((void __user *)arg, , len))
return -EFAULT;
 
diff --git a/drivers/infiniband/hw/hfi1/trace_ctxts.h 
b/drivers/infiniband/hw/hfi1/trace_ctxts.h
index 4eb4cc798035..e00c8a7d559c 100644
--- a/drivers/infiniband/hw/hfi1/trace_ctxts.h
+++ b/drivers/infiniband/hw/hfi1/trace_ctxts.h
@@ -106,7 +106,7 @@ TRACE_EVENT(hfi1_uctxtdata,
 TRACE_EVENT(hfi1_ctxt_info,
TP_PROTO(struct hfi1_devdata *dd, unsigned int ctxt,
 unsigned int subctxt,
-struct hfi1_ctxt_info cinfo),
+struct hfi1_ctxt_info *cinfo),
TP_ARGS(dd, ctxt, subctxt, cinfo),
TP_STRUCT__entry(DD_DEV_ENTRY(dd)
 __field(unsigned int, ctxt)
@@ -120,11 +120,11 @@ TRACE_EVENT(hfi1_ctxt_info,
TP_fast_assign(DD_DEV_ASSIGN(dd);
__entry->ctxt = ctxt;
__entry->subctxt = subctxt;
-   __entry->egrtids = cinfo.egrtids;
-   __entry->rcvhdrq_cnt = cinfo.rcvhdrq_cnt;
-   __entry->rcvhdrq_size = cinfo.rcvhdrq_entsize;
-   __entry->sdma_ring_size = cinfo.sdma_ring_size;
-   __entry->rcvegr_size = cinfo.rcvegr_size;
+   __entry->egrtids = cinfo->egrtids;
+   __entry->rcvhdrq_cnt = cinfo->rcvhdrq_cnt;
+   __entry->rcvhdrq_size = cinfo->rcvhdrq_entsize;
+   __entry->sdma_ring_size = cinfo->sdma_ring_size;
+   __entry->rcvegr_size = cinfo->rcvegr_size;
),
TP_printk("[%s] ctxt %u:%u " CINFO_FMT,
  __get_str(dev),
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index 06c87f9f720c..795698925d20 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -491,7 +491,7 @@ DEFINE_EVENT(f2fs__truncate_node, f2fs_truncate_node,
 
 TRACE_EVENT(f2fs_truncate_partial_nodes,
 
-   TP_PROTO(struct inode *inode, nid_t nid[], int depth, int err),
+   TP_PROTO(struct inode *inode, nid_t *nid, int depth, int err),
 
TP_ARGS(inode, nid, depth, err),
 
diff --git a/net/wireless/trace.h b/net/wireless/trace.h
index 5152938b358d..018c81fa72fb 100644
--- a/net/wireless/trace.h
+++ b/net/wireless/trace.h
@@ -3137,7 +3137,7 @@ TRACE_EVENT(rdev_start_radar_detection,
 
 TRACE_EVENT(rdev_set_mcast_rate,
TP_PROTO(struct wiphy *wiphy, struct net_device *netdev,
-int mcast_rate[NUM_NL80211_BANDS]),
+int *mcast_rate),
TP_ARGS(wiphy, netdev, mcast_rate),
TP_STRUCT__entry(
WIPHY_ENTRY
diff --git a/sound/firewire/amdtp-stream-trace.h 
b/sound/firewire/amdtp-stream-trace.h
index ea0d486652c8..54cdd4ffa9ce 100644
--- a/sound/firewire/amdtp-stream-trace.h
+++ b/sound/firewire/amdtp-stream-trace.h
@@ -14,7 +14,7 @@
 #include 
 
 TRACE_EVENT(in_packet,
-   TP_PROTO(const struct amdtp_stream *s, u32 cycles, u32 cip_header[2], 
unsigned int payload_length, unsigned int index),
+   TP_PROTO(const struct amdtp_stream *s, u32 cycles, u32 *cip_header, 
unsigned int payload_length, unsigned int index),
TP_ARGS(s, cycles, cip_header, payload_length, index),
TP_STRUCT__entry(
__field(unsigned int, second)
-- 
2.9.5



[PATCH v7 bpf-next 09/10] samples/bpf: raw tracepoint test

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

add empty raw_tracepoint bpf program to test overhead similar
to kprobe and traditional tracepoint tests

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/Makefile|  1 +
 samples/bpf/bpf_load.c  | 14 ++
 samples/bpf/test_overhead_raw_tp_kern.c | 17 +
 samples/bpf/test_overhead_user.c| 12 
 4 files changed, 44 insertions(+)
 create mode 100644 samples/bpf/test_overhead_raw_tp_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 2c2a587e0942..4d6a6edd4bf6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -119,6 +119,7 @@ always += offwaketime_kern.o
 always += spintest_kern.o
 always += map_perf_test_kern.o
 always += test_overhead_tp_kern.o
+always += test_overhead_raw_tp_kern.o
 always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index b1a310c3ae89..bebe4188b4b3 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -61,6 +61,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+   bool is_raw_tracepoint = strncmp(event, "raw_tracepoint/", 15) == 0;
bool is_xdp = strncmp(event, "xdp", 3) == 0;
bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
@@ -85,6 +86,8 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
prog_type = BPF_PROG_TYPE_KPROBE;
} else if (is_tracepoint) {
prog_type = BPF_PROG_TYPE_TRACEPOINT;
+   } else if (is_raw_tracepoint) {
+   prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT;
} else if (is_xdp) {
prog_type = BPF_PROG_TYPE_XDP;
} else if (is_perf_event) {
@@ -131,6 +134,16 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
return populate_prog_array(event, fd);
}
 
+   if (is_raw_tracepoint) {
+   efd = bpf_raw_tracepoint_open(event + 15, fd);
+   if (efd < 0) {
+   printf("tracepoint %s %s\n", event + 15, 
strerror(errno));
+   return -1;
+   }
+   event_fd[prog_cnt - 1] = efd;
+   return 0;
+   }
+
if (is_kprobe || is_kretprobe) {
if (is_kprobe)
event += 7;
@@ -587,6 +600,7 @@ static int do_load_bpf_file(const char *path, fixup_map_cb 
fixup_map)
if (memcmp(shname, "kprobe/", 7) == 0 ||
memcmp(shname, "kretprobe/", 10) == 0 ||
memcmp(shname, "tracepoint/", 11) == 0 ||
+   memcmp(shname, "raw_tracepoint/", 15) == 0 ||
memcmp(shname, "xdp", 3) == 0 ||
memcmp(shname, "perf_event", 10) == 0 ||
memcmp(shname, "socket", 6) == 0 ||
diff --git a/samples/bpf/test_overhead_raw_tp_kern.c 
b/samples/bpf/test_overhead_raw_tp_kern.c
new file mode 100644
index ..d2af8bc1c805
--- /dev/null
+++ b/samples/bpf/test_overhead_raw_tp_kern.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Facebook */
+#include 
+#include "bpf_helpers.h"
+
+SEC("raw_tracepoint/task_rename")
+int prog(struct bpf_raw_tracepoint_args *ctx)
+{
+   return 0;
+}
+
+SEC("raw_tracepoint/urandom_read")
+int prog2(struct bpf_raw_tracepoint_args *ctx)
+{
+   return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_overhead_user.c b/samples/bpf/test_overhead_user.c
index d291167fd3c7..e1d35e07a10e 100644
--- a/samples/bpf/test_overhead_user.c
+++ b/samples/bpf/test_overhead_user.c
@@ -158,5 +158,17 @@ int main(int argc, char **argv)
unload_progs();
}
 
+   if (test_flags & 0xC0) {
+   snprintf(filename, sizeof(filename),
+"%s_raw_tp_kern.o", argv[0]);
+   if (load_bpf_file(filename)) {
+   printf("%s", bpf_log_buf);
+   return 1;
+   }
+   printf("w/RAW_TRACEPOINT\n");
+   run_perf_test(num_cpu, test_flags >> 6);
+   unload_progs();
+   }
+
return 0;
 }
-- 
2.9.5



[PATCH v7 bpf-next 08/10] libbpf: add bpf_raw_tracepoint_open helper

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

add bpf_raw_tracepoint_open(const char *name, int prog_fd) api to libbpf

Signed-off-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h | 11 +++
 tools/lib/bpf/bpf.c| 11 +++
 tools/lib/bpf/bpf.h|  1 +
 3 files changed, 23 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d245c41213ac..58060bec999d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_cmd {
BPF_MAP_GET_FD_BY_ID,
BPF_OBJ_GET_INFO_BY_FD,
BPF_PROG_QUERY,
+   BPF_RAW_TRACEPOINT_OPEN,
 };
 
 enum bpf_map_type {
@@ -134,6 +135,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
BPF_PROG_TYPE_SK_MSG,
+   BPF_PROG_TYPE_RAW_TRACEPOINT,
 };
 
 enum bpf_attach_type {
@@ -344,6 +346,11 @@ union bpf_attr {
__aligned_u64   prog_ids;
__u32   prog_cnt;
} query;
+
+   struct {
+   __u64 name;
+   __u32 prog_fd;
+   } raw_tracepoint;
 } __attribute__((aligned(8)));
 
 /* BPF helper function descriptions:
@@ -1151,4 +1158,8 @@ struct bpf_cgroup_dev_ctx {
__u32 minor;
 };
 
+struct bpf_raw_tracepoint_args {
+   __u64 args[0];
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 592a58a2b681..e0500055f1a6 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -428,6 +428,17 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 
*info_len)
return err;
 }
 
+int bpf_raw_tracepoint_open(const char *name, int prog_fd)
+{
+   union bpf_attr attr;
+
+   bzero(, sizeof(attr));
+   attr.raw_tracepoint.name = ptr_to_u64(name);
+   attr.raw_tracepoint.prog_fd = prog_fd;
+
+   return sys_bpf(BPF_RAW_TRACEPOINT_OPEN, , sizeof(attr));
+}
+
 int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 {
struct sockaddr_nl sa;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 8d18fb73d7fb..ee59342c6f42 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -79,4 +79,5 @@ int bpf_map_get_fd_by_id(__u32 id);
 int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len);
 int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
   __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt);
+int bpf_raw_tracepoint_open(const char *name, int prog_fd);
 #endif
-- 
2.9.5



[PATCH v7 bpf-next 06/10] tracepoint: compute num_args at build time

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

compute number of arguments passed into tracepoint
at compile time and store it as part of 'struct tracepoint'.
The number is necessary to check safety of bpf program access that
is coming in subsequent patch.

Signed-off-by: Alexei Starovoitov 
Reviewed-by: Steven Rostedt (VMware) 
---
 include/linux/tracepoint-defs.h |  1 +
 include/linux/tracepoint.h  | 12 ++--
 include/trace/define_trace.h| 14 +++---
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 64ed7064f1fa..39a283c61c51 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -33,6 +33,7 @@ struct tracepoint {
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
+   u32 num_args;
 };
 
 #endif
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index c94f466d57ef..c92f4adbc0d7 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -230,18 +230,18 @@ extern void syscall_unregfunc(void);
  * structures, so we create an array of pointers that will be used for 
iteration
  * on the tracepoints.
  */
-#define DEFINE_TRACE_FN(name, reg, unreg)   \
+#define DEFINE_TRACE_FN(name, reg, unreg, num_args) \
static const char __tpstrtab_##name[]\
__attribute__((section("__tracepoints_strings"))) = #name;   \
struct tracepoint __tracepoint_##name\
__attribute__((section("__tracepoints"))) =  \
-   { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
+   { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL, 
num_args };\
static struct tracepoint * const __tracepoint_ptr_##name __used  \
__attribute__((section("__tracepoints_ptrs"))) = \
&__tracepoint_##name;
 
-#define DEFINE_TRACE(name) \
-   DEFINE_TRACE_FN(name, NULL, NULL);
+#define DEFINE_TRACE(name, num_args)   \
+   DEFINE_TRACE_FN(name, NULL, NULL, num_args);
 
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name) \
EXPORT_SYMBOL_GPL(__tracepoint_##name)
@@ -275,8 +275,8 @@ extern void syscall_unregfunc(void);
return false;   \
}
 
-#define DEFINE_TRACE_FN(name, reg, unreg)
-#define DEFINE_TRACE(name)
+#define DEFINE_TRACE_FN(name, reg, unreg, num_args)
+#define DEFINE_TRACE(name, num_args)
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
 #define EXPORT_TRACEPOINT_SYMBOL(name)
 
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index d9e3d4aa3f6e..96b22ace9ae7 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -25,7 +25,7 @@
 
 #undef TRACE_EVENT
 #define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
-   DEFINE_TRACE(name)
+   DEFINE_TRACE(name, COUNT_ARGS(args))
 
 #undef TRACE_EVENT_CONDITION
 #define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, print) 
\
@@ -39,24 +39,24 @@
 #undef TRACE_EVENT_FN
 #define TRACE_EVENT_FN(name, proto, args, tstruct, \
assign, print, reg, unreg)  \
-   DEFINE_TRACE_FN(name, reg, unreg)
+   DEFINE_TRACE_FN(name, reg, unreg, COUNT_ARGS(args))
 
 #undef TRACE_EVENT_FN_COND
 #define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct,  \
assign, print, reg, unreg)  \
-   DEFINE_TRACE_FN(name, reg, unreg)
+   DEFINE_TRACE_FN(name, reg, unreg, COUNT_ARGS(args))
 
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args) \
-   DEFINE_TRACE(name)
+   DEFINE_TRACE(name, COUNT_ARGS(args))
 
 #undef DEFINE_EVENT_FN
 #define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg) \
-   DEFINE_TRACE_FN(name, reg, unreg)
+   DEFINE_TRACE_FN(name, reg, unreg, COUNT_ARGS(args))
 
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
-   DEFINE_TRACE(name)
+   DEFINE_TRACE(name, COUNT_ARGS(args))
 
 #undef DEFINE_EVENT_CONDITION
 #define DEFINE_EVENT_CONDITION(template, name, proto, args, cond) \
@@ -64,7 +64,7 @@
 
 #undef DECLARE_TRACE
 #define DECLARE_TRACE(name, proto, args)   \
-   DEFINE_TRACE(name)
+   DEFINE_TRACE(name, COUNT_ARGS(args))
 
 #undef TRACE_INCLUDE
 #undef __TRACE_INCLUDE
-- 
2.9.5



[PATCH v7 bpf-next 03/10] net/mac802154: disambiguate mac80215 vs mac802154 trace events

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

two trace events defined with the same name and both unused.
They conflict in allyesconfig build. Rename one of them.

Signed-off-by: Alexei Starovoitov 
---
 net/mac802154/trace.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/mac802154/trace.h b/net/mac802154/trace.h
index 2c8a43d3607f..df855c33daf2 100644
--- a/net/mac802154/trace.h
+++ b/net/mac802154/trace.h
@@ -33,7 +33,7 @@
 
 /* Tracing for driver callbacks */
 
-DECLARE_EVENT_CLASS(local_only_evt,
+DECLARE_EVENT_CLASS(local_only_evt4,
TP_PROTO(struct ieee802154_local *local),
TP_ARGS(local),
TP_STRUCT__entry(
@@ -45,7 +45,7 @@ DECLARE_EVENT_CLASS(local_only_evt,
TP_printk(LOCAL_PR_FMT, LOCAL_PR_ARG)
 );
 
-DEFINE_EVENT(local_only_evt, 802154_drv_return_void,
+DEFINE_EVENT(local_only_evt4, 802154_drv_return_void,
TP_PROTO(struct ieee802154_local *local),
TP_ARGS(local)
 );
@@ -65,12 +65,12 @@ TRACE_EVENT(802154_drv_return_int,
  __entry->ret)
 );
 
-DEFINE_EVENT(local_only_evt, 802154_drv_start,
+DEFINE_EVENT(local_only_evt4, 802154_drv_start,
TP_PROTO(struct ieee802154_local *local),
TP_ARGS(local)
 );
 
-DEFINE_EVENT(local_only_evt, 802154_drv_stop,
+DEFINE_EVENT(local_only_evt4, 802154_drv_stop,
TP_PROTO(struct ieee802154_local *local),
TP_ARGS(local)
 );
-- 
2.9.5



[PATCH v7 bpf-next 00/10] bpf, tracing: introduce bpf raw tracepoints

2018-03-27 Thread Alexei Starovoitov
v6->v7:
- adopted Steven's bpf_raw_tp_map section approach to find tracepoint
  and corresponding bpf probe function instead of kallsyms approach.
  dropped kernel_tracepoint_find_by_name() patch

v5->v6:
- avoid changing semantics of for_each_kernel_tracepoint() function, instead
  introduce kernel_tracepoint_find_by_name() helper

v4->v5:
- adopted Daniel's fancy REPEAT macro in bpf_trace.c in patch 7
  
v3->v4:
- adopted Linus's CAST_TO_U64 macro to cast any integer, pointer, or small
  struct to u64. That nicely reduced the size of patch 1

v2->v3:
- with Linus's suggestion introduced generic COUNT_ARGS and CONCATENATE macros
  (or rather moved them from apparmor)
  that cleaned up patches 6 and 7
- added patch 4 to refactor trace_iwlwifi_dev_ucode_error() from 17 args to 4
  Now any tracepoint with >12 args will have build error

v1->v2:
- simplified api by combing bpf_raw_tp_open(name) + bpf_attach(prog_fd) into
  bpf_raw_tp_open(name, prog_fd) as suggested by Daniel.
  That simplifies bpf_detach as well which is now simple close() of fd.
- fixed memory leak in error path which was spotted by Daniel.
- fixed bpf_get_stackid(), bpf_perf_event_output() called from raw tracepoints
- added more tests
- fixed allyesconfig build caught by buildbot

v1:
This patch set is a different way to address the pressing need to access
task_struct pointers in sched tracepoints from bpf programs.

The first approach simply added these pointers to sched tracepoints:
https://lkml.org/lkml/2017/12/14/753
which Peter nacked.
Few options were discussed and eventually the discussion converged on
doing bpf specific tracepoint_probe_register() probe functions.
Details here:
https://lkml.org/lkml/2017/12/20/929

Patch 1 is kernel wide cleanup of pass-struct-by-value into
pass-struct-by-reference into tracepoints.

Patches 2 and 3 are minor cleanups to address allyesconfig build

Patch 4 refactor trace_iwlwifi_dev_ucode_error from 17 to 4 args

Patch 5 introduces COUNT_ARGS macro

Patch 6 minor prep work to expose number of arguments passed
into tracepoints.

Patch 7 introduces BPF_RAW_TRACEPOINT api.
the auto-cleanup and multiple concurrent users are must have
features of tracing api. For bpf raw tracepoints it looks like:
  // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
  prog_fd = bpf_prog_load(...);

  // receive anon_inode fd for given bpf_raw_tracepoint
  // and attach bpf program to it
  raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);

Ctrl-C of tracing daemon or cmdline tool will automatically
detach bpf program, unload it and unregister tracepoint probe.
More details in patch 7.

Patch 8 - trivial support in libbpf
Patches 9, 10 - user space tests

samples/bpf/test_overhead performance on 1 cpu:

tracepointbase  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
task_rename   1.1M   769K947K1.0M
urandom_read  789K   697K750K755K

Alexei Starovoitov (10):
  treewide: remove large struct-pass-by-value from tracepoint arguments
  net/mediatek: disambiguate mt76 vs mt7601u trace events
  net/mac802154: disambiguate mac80215 vs mac802154 trace events
  net/wireless/iwlwifi: fix iwlwifi_dev_ucode_error tracepoint
  macro: introduce COUNT_ARGS() macro
  tracepoint: compute num_args at build time
  bpf: introduce BPF_RAW_TRACEPOINT
  libbpf: add bpf_raw_tracepoint_open helper
  samples/bpf: raw tracepoint test
  selftests/bpf: test for bpf_get_stackid() from raw tracepoints

 drivers/infiniband/hw/hfi1/file_ops.c  |   2 +-
 drivers/infiniband/hw/hfi1/trace_ctxts.h   |  12 +-
 drivers/net/wireless/intel/iwlwifi/dvm/main.c  |   7 +-
 .../wireless/intel/iwlwifi/iwl-devtrace-iwlwifi.h  |  39 ++---
 drivers/net/wireless/intel/iwlwifi/iwl-devtrace.c  |   1 +
 drivers/net/wireless/intel/iwlwifi/mvm/utils.c |   7 +-
 drivers/net/wireless/mediatek/mt7601u/trace.h  |   6 +-
 include/asm-generic/vmlinux.lds.h  |  10 ++
 include/linux/bpf_types.h  |   1 +
 include/linux/kernel.h |   7 +
 include/linux/trace_events.h   |  42 +
 include/linux/tracepoint-defs.h|   6 +
 include/linux/tracepoint.h |  12 +-
 include/trace/bpf_probe.h  |  91 ++
 include/trace/define_trace.h   |  15 +-
 include/trace/events/f2fs.h|   2 +-
 include/uapi/linux/bpf.h   |  11 ++
 kernel/bpf/syscall.c   |  78 +
 kernel/trace/bpf_trace.c   | 183 +
 net/mac802154/trace.h  |   8 +-
 net/wireless/trace.h   |   2 +-
 samples/bpf/Makefile   |   1 +
 samples/bpf/bpf_load.c |  14 ++
 samples/bpf/test_overhead_raw_tp_kern.c|  17 ++
 

[PATCH v7 bpf-next 05/10] macro: introduce COUNT_ARGS() macro

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

move COUNT_ARGS() macro from apparmor to generic header and extend it
to count till twelve.

COUNT() was an alternative name for this logic, but it's used for
different purpose in many other places.

Similarly for CONCATENATE() macro.

Suggested-by: Linus Torvalds 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/kernel.h   | 7 +++
 security/apparmor/include/path.h | 7 +--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..293fa0677fba 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -919,6 +919,13 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
 #define swap(a, b) \
do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0)
 
+/* This counts to 12. Any more, it will return 13th argument. */
+#define __COUNT_ARGS(_0, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, 
_n, X...) _n
+#define COUNT_ARGS(X...) __COUNT_ARGS(, ##X, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 
2, 1, 0)
+
+#define __CONCAT(a, b) a ## b
+#define CONCATENATE(a, b) __CONCAT(a, b)
+
 /**
  * container_of - cast a member of a structure out to the containing structure
  * @ptr:   the pointer to the member.
diff --git a/security/apparmor/include/path.h b/security/apparmor/include/path.h
index 05fb3305671e..e042b994f2b8 100644
--- a/security/apparmor/include/path.h
+++ b/security/apparmor/include/path.h
@@ -43,15 +43,10 @@ struct aa_buffers {
 
 DECLARE_PER_CPU(struct aa_buffers, aa_buffers);
 
-#define COUNT_ARGS(X...) COUNT_ARGS_HELPER(, ##X, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
-#define COUNT_ARGS_HELPER(_0, _1, _2, _3, _4, _5, _6, _7, _8, _9, n, X...) n
-#define CONCAT(X, Y) X ## Y
-#define CONCAT_AFTER(X, Y) CONCAT(X, Y)
-
 #define ASSIGN(FN, X, N) ((X) = FN(N))
 #define EVAL1(FN, X) ASSIGN(FN, X, 0) /*X = FN(0)*/
 #define EVAL2(FN, X, Y...) do { ASSIGN(FN, X, 1);  EVAL1(FN, Y); } while (0)
-#define EVAL(FN, X...) CONCAT_AFTER(EVAL, COUNT_ARGS(X))(FN, X)
+#define EVAL(FN, X...) CONCATENATE(EVAL, COUNT_ARGS(X))(FN, X)
 
 #define for_each_cpu_buffer(I) for ((I) = 0; (I) < MAX_PATH_BUFFERS; (I)++)
 
-- 
2.9.5



[PATCH v7 bpf-next 07/10] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Alexei Starovoitov
From: Alexei Starovoitov 

Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
kernel internal arguments of the tracepoints in their raw form.

>From bpf program point of view the access to the arguments look like:
struct bpf_raw_tracepoint_args {
   __u64 args[0];
};

int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
{
  // program can read args[N] where N depends on tracepoint
  // and statically verified at program load+attach time
}

kprobe+bpf infrastructure allows programs access function arguments.
This feature allows programs access raw tracepoint arguments.

Similar to proposed 'dynamic ftrace events' there are no abi guarantees
to what the tracepoints arguments are and what their meaning is.
The program needs to type cast args properly and use bpf_probe_read()
helper to access struct fields when argument is a pointer.

For every tracepoint __bpf_trace_##call function is prepared.
In assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
   0x81132080 <+0>: mov%ecx,%ecx
   0x81132082 <+2>: jmpq   0x811231f0 

where

TRACE_EVENT(xdp_exception,
TP_PROTO(const struct net_device *dev,
 const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
and in total this approach adds 7k bytes to .text.

This approach gives the lowest possible overhead
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since tracepoint+bpf are used at speeds of 1M+ events per second
this is valuable optimization.

The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
that returns anon_inode FD of 'bpf-raw-tracepoint' object.

The user space looks like:
// load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
prog_fd = bpf_prog_load(...);
// receive anon_inode fd for given bpf_raw_tracepoint with prog attached
raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);

Ctrl-C of tracing daemon or cmdline tool that uses this feature
will automatically detach bpf program, unload it and
unregister tracepoint probe.

On the kernel side the __bpf_raw_tp_map section of pointers to
tracepoint definition and to __bpf_trace_*() probe function is used
to find a tracepoint with "xdp_exception" name and
corresponding __bpf_trace_xdp_exception() probe function
which are passed to tracepoint_probe_register() to connect probe
with tracepoint.

Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
tracepoint mechanisms. perf_event_open() can be used in parallel
on the same tracepoint.
Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
Each with its own bpf program. The kernel will execute
all tracepoint probes and all attached bpf programs.

In the future bpf_raw_tracepoints can be extended with
query/introspection logic.

__bpf_raw_tp_map section logic was contributed by Steven Rostedt

Signed-off-by: Alexei Starovoitov 
Signed-off-by: Steven Rostedt (VMware) 
---
 include/asm-generic/vmlinux.lds.h |  10 +++
 include/linux/bpf_types.h |   1 +
 include/linux/trace_events.h  |  42 +
 include/linux/tracepoint-defs.h   |   5 ++
 include/trace/bpf_probe.h |  91 +++
 include/trace/define_trace.h  |   1 +
 include/uapi/linux/bpf.h  |  11 +++
 kernel/bpf/syscall.c  |  78 
 kernel/trace/bpf_trace.c  | 183 ++
 9 files changed, 422 insertions(+)
 create mode 100644 include/trace/bpf_probe.h

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 1ab0e520d6fc..8add3493a202 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -178,6 +178,15 @@
 #define TRACE_SYSCALLS()
 #endif
 
+#ifdef CONFIG_BPF_EVENTS
+#define BPF_RAW_TP() STRUCT_ALIGN();   \
+VMLINUX_SYMBOL(__start__bpf_raw_tp) = .;   \
+KEEP(*(__bpf_raw_tp_map))  \
+VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .;
+#else
+#define BPF_RAW_TP()
+#endif
+
 #ifdef CONFIG_SERIAL_EARLYCON
 #define EARLYCON_TABLE() STRUCT_ALIGN();   \
 VMLINUX_SYMBOL(__earlycon_table) = .;  \
@@ -249,6 +258,7 @@
LIKELY_PROFILE()\
BRANCH_PROFILE()\
TRACE_PRINTKS() \
+   BPF_RAW_TP()\
TRACEPOINT_STR()
 
 /*
diff --git a/include/linux/bpf_types.h 

Re: [V9fs-developer] [PATCH] net/9p: fix potential refcnt problem of trans module

2018-03-27 Thread jiangyiwen
On 2018/3/27 20:49, Chengguang Xu wrote:
> When specifying trans_mod multiple times in a mount,
> it may cause inaccurate refcount of trans module. Also,
> in the error case of option parsing, we should put the
> trans module if we have already got.
> 
> Signed-off-by: Chengguang Xu 
> ---
>  net/9p/client.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index b433aff..7ccfb4b 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -190,7 +190,9 @@ static int parse_opts(char *opts, struct p9_client *clnt)
>   p9_debug(P9_DEBUG_ERROR,
>"problem allocating copy of trans 
> arg\n");
>   goto free_and_return;
> -  }
> + }
> +
> + v9fs_put_trans(clnt->trans_mod);

I think this should return error if using multiple times
in a mount.

>   clnt->trans_mod = v9fs_get_trans_by_name(s);
>   if (clnt->trans_mod == NULL) {
>   pr_info("Could not find request transport: 
> %s\n",
> @@ -226,6 +228,7 @@ static int parse_opts(char *opts, struct p9_client *clnt)
>   }
>  
>  free_and_return:
> + v9fs_put_trans(clnt->trans_mod);

This looks good.

>   kfree(tmp_options);
>   return ret;
>  }
> 




Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink

2018-03-27 Thread Jakub Kicinski
On Tue, 27 Mar 2018 18:22:00 -0700, David Ahern wrote:
> +void nsim_devlink_setup(struct netdevsim *ns)
> +{
> + struct net *net = nsim_to_net(ns);
> + bool *reg_devlink = net_generic(net, nsim_devlink_id);
> + struct devlink *devlink;
> + int err = -ENOMEM;
> +
> + /* only one device per namespace controls devlink */
> + if (!*reg_devlink) {
> + ns->devlink = NULL;
> + return;
> + }
> +
> + devlink = devlink_alloc(_devlink_ops, 0);
> + if (!devlink)
> + return;
> +
> + err = devlink_register(devlink, >dev);
> + if (err)
> + goto err_devlink_free;
> +
> + err = devlink_resources_register(devlink);
> + if (err)
> + goto err_dl_unregister;
> +
> + ns->devlink = devlink;
> +
> + *reg_devlink = false;
> +
> + return;
> +
> +err_dl_unregister:
> + devlink_unregister(devlink);
> +err_devlink_free:
> + devlink_free(devlink);
> +}

nit: DaveM expressed preference to not have silent failures in a
 discussion about DebugFS, not sure it applies here, but why not
 handle errors?


[PATCH net-next 1/6] net: Fix fib notifer to return errno

2018-03-27 Thread David Ahern
Notifier handlers use notifier_from_errno to convert any potential error
to an encoded format. As a consequence the other side, call_fib_notifier{s}
in this case, needs to use notifier_to_errno to return the error from
the handler back to its caller.

Signed-off-by: David Ahern 
---
 net/core/fib_notifier.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
index 0c048bdeb016..b793b523aba3 100644
--- a/net/core/fib_notifier.c
+++ b/net/core/fib_notifier.c
@@ -13,16 +13,22 @@ int call_fib_notifier(struct notifier_block *nb, struct net 
*net,
  enum fib_event_type event_type,
  struct fib_notifier_info *info)
 {
+   int err;
+
info->net = net;
-   return nb->notifier_call(nb, event_type, info);
+   err = nb->notifier_call(nb, event_type, info);
+   return notifier_to_errno(err);
 }
 EXPORT_SYMBOL(call_fib_notifier);
 
 int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
   struct fib_notifier_info *info)
 {
+   int err;
+
info->net = net;
-   return atomic_notifier_call_chain(_chain, event_type, info);
+   err = atomic_notifier_call_chain(_chain, event_type, info);
+   return notifier_to_errno(err);
 }
 EXPORT_SYMBOL(call_fib_notifiers);
 
-- 
2.11.0



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 16:10 +0100, Will Deacon wrote:
> To clarify: are you saying that on x86 you need a wmb() prior to a writel
> if you want that writel to be ordered after prior writes to memory? Is this
> specific to WC memory or some other non-standard attribute?
> 
> The only reason we have wmb() inside writel() on arm, arm64 and power is for
> parity with x86 because Linus (CC'd) wanted architectures to order I/O vs
> memory by default so that it was easier to write portable drivers. The
> performance impact of that implicit barrier is non-trivial, but we want the
> driver portability and I went as far as adding generic _relaxed versions for
> the cases where ordering isn't required. You seem to be suggesting that none
> of this is necessary and drivers would already run into problems on x86 if
> they didn't use wmb() explicitly in conjunction with writel, which I find
> hard to believe and is in direct contradiction with the current Linux I/O
> memory model (modulo the broken example in the dma_*mb section of
> memory-barriers.txt).

Another clarification while we are at it 

All of this only applies to concurrent access by the CPU and the device
to memory allocate with dma_alloc_coherent().

For memory "mapped" into the DMA domain via dma_map_* then an extra
dma_sync_for_* is needed.

In most useful server cases etc... these latter are NOPs, but
architecture without full DMA cache coherency or using swiotlb,
dma_map_* might maintain bounce buffers or play additional cache
flushing tricks.

Cheers,
Ben.



[PATCH net-next 4/6] net/ipv4: Allow notifier to fail route replace

2018-03-27 Thread David Ahern
Add checking to call to call_fib_entry_notifiers for IPv4 route replace.
Allows a notifier handler to fail the replace.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_trie.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 67116233e2bc..3dcffd3ce98c 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1219,8 +1219,13 @@ int fib_table_insert(struct net *net, struct fib_table 
*tb,
new_fa->tb_id = tb->tb_id;
new_fa->fa_default = -1;
 
-   call_fib_entry_notifiers(net, FIB_EVENT_ENTRY_REPLACE,
-key, plen, new_fa, extack);
+   err = call_fib_entry_notifiers(net,
+  FIB_EVENT_ENTRY_REPLACE,
+  key, plen, new_fa,
+  extack);
+   if (err)
+   goto out_free_new_fa;
+
rtmsg_fib(RTM_NEWROUTE, htonl(key), new_fa, plen,
  tb->tb_id, >fc_nlinfo, nlflags);
 
-- 
2.11.0



[PATCH net-next 2/6] net: Move call_fib_rule_notifiers up in fib_nl_newrule

2018-03-27 Thread David Ahern
Move call_fib_rule_notifiers up in fib_nl_newrule to the point right
before the rule is inserted into the list. At this point there are no
more failure paths within the core rule code, so if the notifier
does not fail then the rule will be inserted into the list.

Signed-off-by: David Ahern 
---
 net/core/fib_rules.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 9d87ce868402..33958f84c173 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -631,6 +631,11 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr 
*nlh,
if (err < 0)
goto errout_free;
 
+   err = call_fib_rule_notifiers(net, FIB_EVENT_RULE_ADD, rule, ops,
+ extack);
+   if (err < 0)
+   goto errout_free;
+
list_for_each_entry(r, >rules_list, list) {
if (r->pref > rule->pref)
break;
@@ -667,7 +672,6 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr 
*nlh,
if (rule->tun_id)
ip_tunnel_need_metadata();
 
-   call_fib_rule_notifiers(net, FIB_EVENT_RULE_ADD, rule, ops, extack);
notify_rule_change(RTM_NEWRULE, rule, ops, nlh, NETLINK_CB(skb).portid);
flush_route_cache(ops);
rules_ops_put(ops);
-- 
2.11.0



[PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink

2018-03-27 Thread David Ahern
Add devlink support to netdevsim and use it to implement a simple,
profile based resource controller. Only one controller is needed
per namespace, so the first netdevsim netdevice in a namespace
registers with devlink. If that device is deleted, the resource
settings are deleted.

The resource controller allows a user to limit the number of IPv4 and
IPv6 FIB entries and FIB rules. The resource paths are:
/IPv4
/IPv4/fib
/IPv4/fib-rules
/IPv6
/IPv6/fib
/IPv6/fib-rules

The IPv4 and IPv6 top level resources are unlimited in size and can not
be changed. From there, the number of FIB entries and FIB rule entries
are unlimited by default. A user can specify a limit for the fib and
fib-rules resources:

$ devlink resource set netdevsim/netdevsim0 path /IPv4/fib size 96
$ devlink resource set netdevsim/netdevsim0 path /IPv4/fib-rules size 16
$ devlink resource set netdevsim/netdevsim0 path /IPv6/fib size 64
$ devlink resource set netdevsim/netdevsim0 path /IPv6/fib-rules size 16
$ devlink dev reload netdevsim/netdevsim0

such that the number of rules or routes is limited (96 ipv4 routes in the
example above):
$ for n in $(seq 1 32); do ip ro add 10.99.$n.0/24 dev eth1; done
Error: netdevsim: Exceeded number of supported fib entries.

$ devlink resource show netdevsim/netdevsim0
netdevsim/netdevsim0:
  name IPv4 size unlimited unit entry size_min 0 size_max unlimited 
size_gran 1 dpipe_tables non
resources:
  name fib size 96 occ 96 unit entry size_min 0 size_max unlimited 
size_gran 1 dpipe_tables
...

With this template in place for resource management, it is fairly trivial
to extend and shows one way to implement a simple counter based resource
controller typical of network profiles.

Currently, devlink only supports initial namespace. Code is in place to
adapt netdevsim to a per namespace controller once the network namespace
issues are resolved.

Signed-off-by: David Ahern 
---
 drivers/net/Kconfig   |   1 +
 drivers/net/netdevsim/Makefile|   4 +
 drivers/net/netdevsim/devlink.c   | 294 ++
 drivers/net/netdevsim/fib.c   | 263 ++
 drivers/net/netdevsim/netdev.c|  12 +-
 drivers/net/netdevsim/netdevsim.h |  43 ++
 6 files changed, 616 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/netdevsim/devlink.c
 create mode 100644 drivers/net/netdevsim/fib.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 08b85215c2be..891846655000 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -500,6 +500,7 @@ source "drivers/net/hyperv/Kconfig"
 config NETDEVSIM
tristate "Simulated networking device"
depends on DEBUG_FS
+   depends on MAY_USE_DEVLINK
help
  This driver is a developer testing tool and software model that can
  be used to test various control path networking APIs, especially
diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
index 09388c06171d..449b2a1a1800 100644
--- a/drivers/net/netdevsim/Makefile
+++ b/drivers/net/netdevsim/Makefile
@@ -9,3 +9,7 @@ ifeq ($(CONFIG_BPF_SYSCALL),y)
 netdevsim-objs += \
bpf.o
 endif
+
+ifneq ($(CONFIG_NET_DEVLINK),)
+netdevsim-objs += devlink.o fib.o
+endif
diff --git a/drivers/net/netdevsim/devlink.c b/drivers/net/netdevsim/devlink.c
new file mode 100644
index ..bbdcf064ba10
--- /dev/null
+++ b/drivers/net/netdevsim/devlink.c
@@ -0,0 +1,294 @@
+/*
+ * Copyright (c) 2018 Cumulus Networks. All rights reserved.
+ * Copyright (c) 2018 David Ahern 
+ *
+ * This software is licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree.
+ *
+ * THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
+ * WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
+ * BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
+ * OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
+ * THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+ */
+
+#include 
+#include 
+#include 
+
+#include "netdevsim.h"
+
+static unsigned int nsim_devlink_id;
+
+/* place holder until devlink and namespaces is sorted out */
+static struct net *nsim_devlink_net(struct devlink *devlink)
+{
+   return _net;
+}
+
+/* IPv4
+ */
+static u64 nsim_ipv4_fib_resource_occ_get(struct devlink *devlink)
+{
+   struct net *net = nsim_devlink_net(devlink);
+
+   return nsim_fib_get_val(net, NSIM_RESOURCE_IPV4_FIB, false);
+}
+
+static struct devlink_resource_ops nsim_ipv4_fib_res_ops = {
+   .occ_get = nsim_ipv4_fib_resource_occ_get,
+};
+
+static u64 nsim_ipv4_fib_rules_res_occ_get(struct devlink *devlink)
+{
+   struct net 

[PATCH net-next 0/6] net: Allow FIB notifiers to fail add and replace

2018-03-27 Thread David Ahern
I wanted to revisit how resource overload is handled for hardware offload
of FIB entries and rules. At the moment, the in-kernel fib notifier can
tell a driver about a route or rule add, replace, and delete, but the
notifier can not affect the action. Specifically, in the case of mlxsw
if a route or rule add is going to overflow the ASIC resources the only
recourse is to abort hardware offload. Aborting offload is akin to taking
down the switch as the path from data plane to the control plane simply
can not support the traffic bandwidth of the front panel ports. Further,
the current state of FIB notifiers is inconsistent with other resources
where a driver can affect a user request - e.g., enslavement of a port
into a bridge or a VRF.

As a result of the work done over the past 3+ years, I believe we are
at a point where we can bring consistency to the stack and offloads,
and reliably allow the FIB notifiers to fail a request, pushing an error
along with a suitable error message back to the user. Rather than
aborting offload when the switch is out of resources, userspace is simply
prevented from adding more routes and has a clear indication of why.

This set does not resolve the corner case where rules or routes not
supported by the device are installed prior to the driver getting loaded
and registering for FIB notifications. In that case, hardware offload has
not been established and it can refuse to offload anything, sending
errors back to userspace via extack. Since conceptually the driver owns
the netdevices associated with its asic, this corner case mainly applies
to unsupported rules and any races during the bringup phase.

Patch 1 fixes call_fib_notifiers to extract the errno from the encoded
response from handlers.

Patches 2-5 allow the call to call_fib_notifiers to fail the add or
replace of a route or rule.

Patch 6 adds a simple resource controller to netdevsim to illustrate
how a FIB resource controller can limit the number of route entries.

Changes since RFC
- correct return code for call_fib_notifier
- dropped patch 6 exporting devlink symbols
- limited example resource controller to init_net only
- updated Kconfig for netdevsim to use MAY_USE_DEVLINK
- updated cover letter regarding startup case noted by Ido

David Ahern (6):
  net: Fix fib notifer to return errno
  net: Move call_fib_rule_notifiers up in fib_nl_newrule
  net/ipv4: Move call_fib_entry_notifiers up for new routes
  net/ipv4: Allow notifier to fail route replace
  net/ipv6: Move call_fib6_entry_notifiers up for route adds
  netdevsim: Add simple FIB resource controller via devlink

 drivers/net/Kconfig   |   1 +
 drivers/net/netdevsim/Makefile|   4 +
 drivers/net/netdevsim/devlink.c   | 294 ++
 drivers/net/netdevsim/fib.c   | 263 ++
 drivers/net/netdevsim/netdev.c|  12 +-
 drivers/net/netdevsim/netdevsim.h |  43 ++
 net/core/fib_notifier.c   |  10 +-
 net/core/fib_rules.c  |   6 +-
 net/ipv4/fib_trie.c   |  27 +++-
 net/ipv6/ip6_fib.c|  16 ++-
 10 files changed, 664 insertions(+), 12 deletions(-)
 create mode 100644 drivers/net/netdevsim/devlink.c
 create mode 100644 drivers/net/netdevsim/fib.c

-- 
2.11.0



[PATCH net-next 5/6] net/ipv6: Move call_fib6_entry_notifiers up for route adds

2018-03-27 Thread David Ahern
Move call to call_fib6_entry_notifiers for new IPv6 routes to right
before the insertion into the FIB. At this point notifier handlers can
decide the fate of the new route with a clean path to delete the
potential new entry if the notifier returns non-0.

Signed-off-by: David Ahern 
---
 net/ipv6/ip6_fib.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 908b8e5b615a..deab2db6692e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1007,12 +1007,16 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
struct rt6_info *rt,
if (err)
return err;
 
+   err = call_fib6_entry_notifiers(info->nl_net,
+   FIB_EVENT_ENTRY_ADD,
+   rt, extack);
+   if (err)
+   return err;
+
rcu_assign_pointer(rt->rt6_next, iter);
atomic_inc(>rt6i_ref);
rcu_assign_pointer(rt->rt6i_node, fn);
rcu_assign_pointer(*ins, rt);
-   call_fib6_entry_notifiers(info->nl_net, FIB_EVENT_ENTRY_ADD,
- rt, extack);
if (!info->skip_notify)
inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
info->nl_net->ipv6.rt6_stats->fib_rt_entries++;
@@ -1036,12 +1040,16 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
struct rt6_info *rt,
if (err)
return err;
 
+   err = call_fib6_entry_notifiers(info->nl_net,
+   FIB_EVENT_ENTRY_REPLACE,
+   rt, extack);
+   if (err)
+   return err;
+
atomic_inc(>rt6i_ref);
rcu_assign_pointer(rt->rt6i_node, fn);
rt->rt6_next = iter->rt6_next;
rcu_assign_pointer(*ins, rt);
-   call_fib6_entry_notifiers(info->nl_net, FIB_EVENT_ENTRY_REPLACE,
- rt, extack);
if (!info->skip_notify)
inet6_rt_notify(RTM_NEWROUTE, rt, info, NLM_F_REPLACE);
if (!(fn->fn_flags & RTN_RTINFO)) {
-- 
2.11.0



[PATCH net-next 3/6] net/ipv4: Move call_fib_entry_notifiers up for new routes

2018-03-27 Thread David Ahern
Move call to call_fib_entry_notifiers for new IPv4 routes to right
before the call to fib_insert_alias. At this point the only remaining
failure path is memory allocations in fib_insert_node. Handle that
very unlikely failure with a call to call_fib_entry_notifiers to
tell drivers about it.

At this point notifier handlers can decide the fate of the new route
with a clean path to delete the potential new entry if the notifier
returns non-0.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_trie.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index fac0b73e24d1..67116233e2bc 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1065,6 +1065,9 @@ static int fib_insert_node(struct trie *t, struct 
key_vector *tp,
return -ENOMEM;
 }
 
+/* fib notifier for ADD is sent before calling fib_insert_alias with
+ * the expectation that the only possible failure ENOMEM
+ */
 static int fib_insert_alias(struct trie *t, struct key_vector *tp,
struct key_vector *l, struct fib_alias *new,
struct fib_alias *fa, t_key key)
@@ -1263,21 +1266,32 @@ int fib_table_insert(struct net *net, struct fib_table 
*tb,
new_fa->tb_id = tb->tb_id;
new_fa->fa_default = -1;
 
+   err = call_fib_entry_notifiers(net, event, key, plen, new_fa, extack);
+   if (err)
+   goto out_free_new_fa;
+
/* Insert new entry to the list. */
err = fib_insert_alias(t, tp, l, new_fa, fa, key);
if (err)
-   goto out_free_new_fa;
+   goto out_fib_notif;
 
if (!plen)
tb->tb_num_default++;
 
rt_cache_flush(cfg->fc_nlinfo.nl_net);
-   call_fib_entry_notifiers(net, event, key, plen, new_fa, extack);
rtmsg_fib(RTM_NEWROUTE, htonl(key), new_fa, plen, new_fa->tb_id,
  >fc_nlinfo, nlflags);
 succeeded:
return 0;
 
+out_fib_notif:
+   /* notifier was sent that entry would be added to trie, but
+* the add failed and need to recover. Only failure for
+* fib_insert_alias is ENOMEM.
+*/
+   NL_SET_ERR_MSG(extack, "Failed to insert route into trie");
+   call_fib_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, key,
+plen, new_fa, NULL);
 out_free_new_fa:
kmem_cache_free(fn_alias_kmem, new_fa);
 out:
-- 
2.11.0



[PATCH iproute2-next 0/2] more JSON support

2018-03-27 Thread Stephen Hemminger
From: Stephen Hemminger 

Add JSON to ILA and L2TP display

Stephen Hemminger (2):
  ip/ila: support json and color
  ip/l2tp: add JSON support

 ip/ipila.c  |  76 +++---
 ip/ipl2tp.c | 152 
 2 files changed, 140 insertions(+), 88 deletions(-)

-- 
2.16.2



[PATCH iproute2-next 2/2] ip/l2tp: add JSON support

2018-03-27 Thread Stephen Hemminger
From: Stephen Hemminger 

Convert ip l2tp to use JSON output routines.

Signed-off-by: Stephen Hemminger 
---
 ip/ipl2tp.c | 152 
 1 file changed, 103 insertions(+), 49 deletions(-)

diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
index 8aaee747e294..750f912aa96a 100644
--- a/ip/ipl2tp.c
+++ b/ip/ipl2tp.c
@@ -204,15 +204,22 @@ static int delete_session(struct l2tp_parm *p)
return 0;
 }
 
-static void print_cookie(char *name, const uint8_t *cookie, int len)
+static void print_cookie(const char *name, const char *fmt,
+const uint8_t *cookie, int len)
 {
-   printf("  %s %02x%02x%02x%02x", name,
-  cookie[0], cookie[1],
-  cookie[2], cookie[3]);
+   char abuf[32];
+   size_t n;
+
+   n = snprintf(abuf, sizeof(abuf),
+"%02x%02x%02x%02x",
+cookie[0], cookie[1], cookie[2], cookie[3]);
if (len == 8)
-   printf("%02x%02x%02x%02x",
-  cookie[4], cookie[5],
-  cookie[6], cookie[7]);
+   snprintf(abuf + n, sizeof(abuf) - n,
+"%02x%02x%02x%02x",
+cookie[4], cookie[5],
+cookie[6], cookie[7]);
+
+   print_string(PRINT_ANY, name, fmt, abuf);
 }
 
 static void print_tunnel(const struct l2tp_data *data)
@@ -220,74 +227,115 @@ static void print_tunnel(const struct l2tp_data *data)
const struct l2tp_parm *p = >config;
char buf[INET6_ADDRSTRLEN];
 
-   printf("Tunnel %u, encap %s\n",
-  p->tunnel_id,
-  p->encap == L2TP_ENCAPTYPE_UDP ? "UDP" :
-  p->encap == L2TP_ENCAPTYPE_IP ? "IP" : "??");
-   printf("  From %s ",
-  inet_ntop(p->local_ip.family, p->local_ip.data,
-buf, sizeof(buf)));
-   printf("to %s\n",
-  inet_ntop(p->peer_ip.family, p->peer_ip.data,
-buf, sizeof(buf)));
-   printf("  Peer tunnel %u\n",
-  p->peer_tunnel_id);
+   open_json_object(NULL);
+   print_uint(PRINT_ANY, "tunnel_id", "Tunnel %u,", p->tunnel_id);
+   print_string(PRINT_ANY, "encap", " encap %s",
+p->encap == L2TP_ENCAPTYPE_UDP ? "UDP" :
+p->encap == L2TP_ENCAPTYPE_IP ? "IP" : "??");
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+
+   print_string(PRINT_ANY, "local", "  From %s ",
+inet_ntop(p->local_ip.family, p->local_ip.data,
+  buf, sizeof(buf)));
+   print_string(PRINT_ANY, "peer", "to %s",
+inet_ntop(p->peer_ip.family, p->peer_ip.data,
+  buf, sizeof(buf)));
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+
+   print_uint(PRINT_ANY, "peer_tunnel", "  Peer tunnel %u",
+  p->peer_tunnel_id);
+   print_string(PRINT_FP, NULL, "%s", _SL_);
 
if (p->encap == L2TP_ENCAPTYPE_UDP) {
-   printf("  UDP source / dest ports: %hu/%hu\n",
-  p->local_udp_port, p->peer_udp_port);
+   print_string(PRINT_FP, NULL,
+"  UDP source / dest ports:", NULL);
+
+   print_uint(PRINT_ANY, "local_port", " %hu",
+  p->local_udp_port);
+   print_uint(PRINT_ANY, "peer_port", "/%hu",
+  p->peer_udp_port);
+   print_string(PRINT_FP, NULL, "%s", _SL_);
 
switch (p->local_ip.family) {
case AF_INET:
-   printf("  UDP checksum: %s\n",
-  p->udp_csum ? "enabled" : "disabled");
+   print_bool(PRINT_JSON, "checksum",
+  NULL, p->udp_csum);
+   print_string(PRINT_FP, NULL,
+"  UDP checksum: %s\n",
+p->udp_csum ? "enabled" : "disabled");
break;
case AF_INET6:
-   printf("  UDP checksum: %s%s%s%s\n",
-  p->udp6_csum_tx && p->udp6_csum_rx
-  ? "enabled" : "",
-  p->udp6_csum_tx && !p->udp6_csum_rx
-  ? "tx" : "",
-  !p->udp6_csum_tx && p->udp6_csum_rx
-  ? "rx" : "",
-  !p->udp6_csum_tx && !p->udp6_csum_rx
-  ? "disabled" : "");
+   if (is_json_context()) {
+   print_bool(PRINT_JSON, "checksum_tx",
+  NULL, p->udp6_csum_tx);
+
+   print_bool(PRINT_JSON, "checksum_rx",
+  NULL, 

[PATCH iproute2-next 1/2] ip/ila: support json and color

2018-03-27 Thread Stephen Hemminger
From: Stephen Hemminger 

Use json print to enhance ila output.

Signed-off-by: Stephen Hemminger 
---
 ip/ipila.c | 76 ++
 1 file changed, 37 insertions(+), 39 deletions(-)

diff --git a/ip/ipila.c b/ip/ipila.c
index 9a324296ffd6..370385c0c375 100644
--- a/ip/ipila.c
+++ b/ip/ipila.c
@@ -23,6 +23,7 @@
 #include "utils.h"
 #include "ip_common.h"
 #include "ila_common.h"
+#include "json_print.h"
 
 static void usage(void)
 {
@@ -47,9 +48,7 @@ static int genl_family = -1;
 #define ILA_RTA(g) ((struct rtattr *)(((char *)(g)) +  \
NLMSG_ALIGN(sizeof(struct genlmsghdr
 
-#define ADDR_BUF_SIZE sizeof(":::")
-
-static int print_addr64(__u64 addr, char *buff, size_t len)
+static void print_addr64(__u64 addr, char *buff, size_t len)
 {
__u16 *words = (__u16 *)
__u16 v;
@@ -64,38 +63,27 @@ static int print_addr64(__u64 addr, char *buff, size_t len)
sep = "";
 
ret = snprintf([written], len - written, "%x%s", v, sep);
-   if (ret < 0)
-   return ret;
-
written += ret;
}
-
-   return written;
 }
 
-static void print_ila_locid(FILE *fp, int attr, struct rtattr *tb[], int space)
+static void print_ila_locid(const char *tag, int attr, struct rtattr *tb[])
 {
char abuf[256];
-   size_t blen;
-   int i;
 
-   if (tb[attr]) {
-   blen = print_addr64(rta_getattr_u64(tb[attr]),
-   abuf, sizeof(abuf));
-   fprintf(fp, "%s", abuf);
-   } else {
-   fprintf(fp, "-");
-   blen = 1;
-   }
+   if (tb[attr])
+   print_addr64(rta_getattr_u64(tb[attr]),
+abuf, sizeof(abuf));
+   else
+   snprintf(abuf, sizeof(abuf), "-");
 
-   for (i = 0; i < space - blen; i++)
-   fprintf(fp, " ");
+   /* 20 = sizeof(":::") */
+   print_string(PRINT_ANY, tag, "%-20s", abuf);
 }
 
 static int print_ila_mapping(const struct sockaddr_nl *who,
 struct nlmsghdr *n, void *arg)
 {
-   FILE *fp = (FILE *)arg;
struct genlmsghdr *ghdr;
struct rtattr *tb[ILA_ATTR_MAX + 1];
int len = n->nlmsg_len;
@@ -110,31 +98,38 @@ static int print_ila_mapping(const struct sockaddr_nl *who,
ghdr = NLMSG_DATA(n);
parse_rtattr(tb, ILA_ATTR_MAX, (void *) ghdr + GENL_HDRLEN, len);
 
-   print_ila_locid(fp, ILA_ATTR_LOCATOR_MATCH, tb, ADDR_BUF_SIZE);
-   print_ila_locid(fp, ILA_ATTR_LOCATOR, tb, ADDR_BUF_SIZE);
+   open_json_object(NULL);
+   print_ila_locid("locator_match", ILA_ATTR_LOCATOR_MATCH, tb);
+   print_ila_locid("locator", ILA_ATTR_LOCATOR, tb);
 
-   if (tb[ILA_ATTR_IFINDEX])
-   fprintf(fp, "%-16s",
-   ll_index_to_name(rta_getattr_u32(
-   tb[ILA_ATTR_IFINDEX])));
-   else
-   fprintf(fp, "%-10s ", "-");
+   if (tb[ILA_ATTR_IFINDEX]) {
+   __u32 ifindex
+   = rta_getattr_u32(tb[ILA_ATTR_IFINDEX]);
 
-   if (tb[ILA_ATTR_CSUM_MODE])
-   fprintf(fp, "%s",
-   ila_csum_mode2name(rta_getattr_u8(
-   tb[ILA_ATTR_CSUM_MODE])));
-   else
-   fprintf(fp, "%-10s ", "-");
+   print_color_string(PRINT_ANY, COLOR_IFNAME,
+  "interface", "%-16s",
+  ll_index_to_name(ifindex));
+   } else {
+   print_string(PRINT_FP, NULL, "%-10s ", "-");
+   }
+
+   if (tb[ILA_ATTR_CSUM_MODE]) {
+   __u8 csum = rta_getattr_u8(tb[ILA_ATTR_CSUM_MODE]);
+
+   print_string(PRINT_ANY, "csum_mode", "%s",
+ila_csum_mode2name(csum));
+   } else
+   print_string(PRINT_FP, NULL, "%-10s ", "-");
 
if (tb[ILA_ATTR_IDENT_TYPE])
-   fprintf(fp, "%s",
+   print_string(PRINT_ANY, "ident_type", "%s",
ila_ident_type2name(rta_getattr_u8(
tb[ILA_ATTR_IDENT_TYPE])));
else
-   fprintf(fp, "-");
+   print_string(PRINT_FP, NULL, "%s", "-");
 
-   fprintf(fp, "\n");
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+   close_json_object();
 
return 0;
 }
@@ -156,10 +151,13 @@ static int do_list(int argc, char **argv)
exit(1);
}
 
+   new_json_obj(json);
if (rtnl_dump_filter(_rth, print_ila_mapping, stdout) < 0) {
fprintf(stderr, "Dump terminated\n");
return 1;
}
+   delete_json_obj();
+   fflush(stdout);
 
return 0;
 }
-- 
2.16.2



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 14:39 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt
>  wrote:
> > 
> > Well, we need to clarify that once and for all, because as I wrote
> > earlier, it was decreed by Linus more than a decade ago that writel
> > would be fully ordered by itself vs. previous memory stores (at least
> > on UC memory).
> 
> Yes.
> 
> So "writel()" needs to be ordered with respect to other writel() uses
> on the same thread. Anything else *will* break drivers. Obviously, the
> drivers may then do magic to say "do write combining etc", but that
> magic will be architecture-specific.
> 
> The other issue is that "writel()" needs to be ordered wrt other CPU's
> doing "writel()" if those writel's are in a spinlocked region.

 .../...

The discussion at hand is about

dma_buffer->foo = 1;/* WB */
writel(KICK, DMA_KICK_REGISTER);/* UC */

(The WC case is something else, let's not mix things up just yet)

IE, a store to normal WB cache memory followed by a writel to a device
which will then DMA from that buffer.

Back in the days, we did require on powerpc a wmb() between these, but
you made the point that x86 didn't and driver writers would never get
that right.

We decided to go conservative, added the necessary barrier inside
writel, so did ARM and it became the norm that writel is also fully
ordered vs. previous stores to memory *by the same CPU* of course (or
protected by the same spinlock).

Now it appears that this wasn't fully understood back then, and some
people are now saying that x86 might not even provide that semantic
always.

So a number (fairly large) of drivers have been adding wmb() in those
case, while others haven't, and it's a mess.

The mess is compounded by the fact that if writel is now defined to
*not* provide that ordering guarantee, then writel_relaxed() is
pointless since all it is defined to relax is precisely the above
ordering guarantee.

So I want to get to the bottom of this once and for all so we can have
well defined and documented semantics and stop having drivers do random
things that may or may not work on some or all architectures (including
x86 !).

Quite note about the spinlock case... In fact this is the only case you
did allow back then to be relaxed. In theory a writel followed by a
spin_unlock requires an mmiowb (which is the only point of that barrier
in fact). This was done because an arch (I think ia64) had a hard time
getting MMIOs from multiple CPUs get in order vs. a lock and required
an expensive access to the PCI host bridge to do so.

Back then, on powerpc, we chose not to allow that relaxing and instead
added code to our writel to set a per-cpu flag which would cause the
next spin_unlock to use a stronger barrier than usual.

We do need to clarify this as well, but let's start with the most basic
one first, there is enough confusion already.

Cheers,
Ben.


[Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring buffer percentage

2018-03-27 Thread Long Li
From: Long Li 

In Vmbus, we have defined a function to calculate available ring buffer
percentage to write.

Use that function and remove netvsc's private version.

Signed-off-by: Long Li 
---
 drivers/net/hyperv/hyperv_net.h |  1 -
 drivers/net/hyperv/netvsc.c | 17 +++--
 drivers/net/hyperv/netvsc_drv.c |  3 ---
 3 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index cd538d5a7986..a0199ab13d67 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -189,7 +189,6 @@ struct netvsc_device;
 struct net_device_context;
 
 extern u32 netvsc_ring_bytes;
-extern struct reciprocal_value netvsc_ring_reciprocal;
 
 struct netvsc_device *netvsc_device_add(struct hv_device *device,
const struct netvsc_device_info *info);
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 0265d703eb03..8af0069e4d8c 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 
@@ -590,17 +589,6 @@ void netvsc_device_remove(struct hv_device *device)
 #define RING_AVAIL_PERCENT_HIWATER 20
 #define RING_AVAIL_PERCENT_LOWATER 10
 
-/*
- * Get the percentage of available bytes to write in the ring.
- * The return value is in range from 0 to 100.
- */
-static u32 hv_ringbuf_avail_percent(const struct hv_ring_buffer_info 
*ring_info)
-{
-   u32 avail_write = hv_get_bytes_to_write(ring_info);
-
-   return reciprocal_divide(avail_write  * 100, netvsc_ring_reciprocal);
-}
-
 static inline void netvsc_free_send_slot(struct netvsc_device *net_device,
 u32 index)
 {
@@ -649,7 +637,8 @@ static void netvsc_send_tx_complete(struct netvsc_device 
*net_device,
wake_up(_device->wait_drain);
 
if (netif_tx_queue_stopped(netdev_get_tx_queue(ndev, q_idx)) &&
-   (hv_ringbuf_avail_percent(>outbound) > 
RING_AVAIL_PERCENT_HIWATER ||
+   (hv_get_avail_to_write_percent(>outbound) >
+RING_AVAIL_PERCENT_HIWATER ||
 queue_sends < 1)) {
netif_tx_wake_queue(netdev_get_tx_queue(ndev, q_idx));
ndev_ctx->eth_stats.wake_queue++;
@@ -757,7 +746,7 @@ static inline int netvsc_send_pkt(
struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx);
u64 req_id;
int ret;
-   u32 ring_avail = hv_ringbuf_avail_percent(_channel->outbound);
+   u32 ring_avail = hv_get_avail_to_write_percent(_channel->outbound);
 
nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
if (skb)
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index faea0be18924..b0b1c2fd2b7b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -35,7 +35,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -55,7 +54,6 @@ static unsigned int ring_size __ro_after_init = 128;
 module_param(ring_size, uint, S_IRUGO);
 MODULE_PARM_DESC(ring_size, "Ring buffer size (# of pages)");
 unsigned int netvsc_ring_bytes __ro_after_init;
-struct reciprocal_value netvsc_ring_reciprocal __ro_after_init;
 
 static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE |
NETIF_MSG_LINK | NETIF_MSG_IFUP |
@@ -2186,7 +2184,6 @@ static int __init netvsc_drv_init(void)
ring_size);
}
netvsc_ring_bytes = ring_size * PAGE_SIZE;
-   netvsc_ring_reciprocal = reciprocal_value(netvsc_ring_bytes);
 
ret = vmbus_driver_register(_drv);
if (ret)
-- 
2.14.1



[Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write

2018-03-27 Thread Long Li
From: Long Li 

This is a best effort for estimating on how busy the ring buffer is for
that channel, based on available buffer to write in percentage. It is still
possible that at the time of actual ring buffer write, the space may not be
available due to other processes may be writing at the time.

Selecting a channel based on how full it is can reduce the possibility that
a ring buffer write will fail, and avoid the situation a channel is over
busy.

Now it's possible that storvsc can use a smaller ring buffer size
(e.g. 40k bytes) to take advantage of cache locality.

Signed-off-by: Long Li 
---
 drivers/scsi/storvsc_drv.c | 62 +-
 1 file changed, 50 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index a2ec0bc9e9fa..b1a87072b3ab 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer 
size (bytes)");
 
 module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
 MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to 
subchannels");
+
+static int ring_avail_percent_lowater = 10;
+module_param(ring_avail_percent_lowater, int, S_IRUGO);
+MODULE_PARM_DESC(ring_avail_percent_lowater,
+   "Select a channel if available ring size > this in percent");
+
 /*
  * Timeout in seconds for all devices managed by this driver.
  */
@@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device *device,
 {
struct storvsc_device *stor_device;
struct vstor_packet *vstor_packet;
-   struct vmbus_channel *outgoing_channel;
+   struct vmbus_channel *outgoing_channel, *channel;
int ret = 0;
-   struct cpumask alloced_mask;
+   struct cpumask alloced_mask, other_numa_mask;
int tgt_cpu;
 
vstor_packet = >vstor_packet;
@@ -1301,22 +1307,53 @@ static int storvsc_do_io(struct hv_device *device,
/*
 * Select an an appropriate channel to send the request out.
 */
-
if (stor_device->stor_chns[q_num] != NULL) {
outgoing_channel = stor_device->stor_chns[q_num];
-   if (outgoing_channel->target_cpu == smp_processor_id()) {
+   if (outgoing_channel->target_cpu == q_num) {
/*
 * Ideally, we want to pick a different channel if
 * available on the same NUMA node.
 */
cpumask_and(_mask, _device->alloced_cpus,
cpumask_of_node(cpu_to_node(q_num)));
-   for_each_cpu_wrap(tgt_cpu, _mask,
-   outgoing_channel->target_cpu + 1) {
-   if (tgt_cpu != outgoing_channel->target_cpu) {
-   outgoing_channel =
-   stor_device->stor_chns[tgt_cpu];
-   break;
+
+   for_each_cpu_wrap(tgt_cpu, _mask, q_num + 1) {
+   if (tgt_cpu == q_num)
+   continue;
+   channel = stor_device->stor_chns[tgt_cpu];
+   if (hv_get_avail_to_write_percent(
+   >outbound)
+   > ring_avail_percent_lowater) {
+   outgoing_channel = channel;
+   goto found_channel;
+   }
+   }
+
+   /*
+* All the other channels on the same NUMA node are
+* busy. Try to use the channel on the current CPU
+*/
+   if (hv_get_avail_to_write_percent(
+   _channel->outbound)
+   > ring_avail_percent_lowater)
+   goto found_channel;
+
+   /*
+* If we reach here, all the channels on the current
+* NUMA node are busy. Try to find a channel in
+* other NUMA nodes
+*/
+   cpumask_andnot(_numa_mask,
+   _device->alloced_cpus,
+   cpumask_of_node(cpu_to_node(q_num)));
+
+   for_each_cpu(tgt_cpu, _numa_mask) {
+   channel = stor_device->stor_chns[tgt_cpu];
+   if (hv_get_avail_to_write_percent(
+   >outbound)
+   > ring_avail_percent_lowater) {
+

[Resend Patch 1/3] Vmbus: Add function to report available ring buffer to write in total ring size percentage

2018-03-27 Thread Long Li
From: Long Li 

Netvsc has a function to calculate how much ring buffer in percentage is
available to write. This function is also useful for storvsc and other
vmbus devices.

Define a similar function in vmbus to be used by other vmbus devices.

Signed-off-by: Long Li 
---
 drivers/hv/ring_buffer.c |  2 ++
 include/linux/hyperv.h   | 12 
 2 files changed, 14 insertions(+)

diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 8699bb969e7e..3c836c099a8f 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -227,6 +227,8 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
*ring_info,
ring_info->ring_buffer->feature_bits.value = 1;
 
ring_info->ring_size = page_cnt << PAGE_SHIFT;
+   ring_info->ring_size_div10_reciprocal =
+   reciprocal_value(ring_info->ring_size / 10);
ring_info->ring_datasize = ring_info->ring_size -
sizeof(struct hv_ring_buffer);
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 2048f3c3b68a..eb7204851089 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define MAX_PAGE_BUFFER_COUNT  32
 #define MAX_MULTIPAGE_BUFFER_COUNT 32 /* 128K */
@@ -121,6 +122,7 @@ struct hv_ring_buffer {
 struct hv_ring_buffer_info {
struct hv_ring_buffer *ring_buffer;
u32 ring_size;  /* Include the shared header */
+   struct reciprocal_value ring_size_div10_reciprocal;
spinlock_t ring_lock;
 
u32 ring_datasize;  /* < ring_size */
@@ -155,6 +157,16 @@ static inline u32 hv_get_bytes_to_write(const struct 
hv_ring_buffer_info *rbi)
return write;
 }
 
+static inline u32 hv_get_avail_to_write_percent(
+   const struct hv_ring_buffer_info *rbi)
+{
+   u32 avail_write = hv_get_bytes_to_write(rbi);
+
+   return reciprocal_divide(
+   (avail_write  << 3) + (avail_write << 1),
+   rbi->ring_size_div10_reciprocal);
+}
+
 /*
  * VMBUS version is 32 bit entity broken up into
  * two 16 bit quantities: major_number. minor_number.
-- 
2.14.1



Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Alexei Starovoitov

On 3/27/18 5:44 PM, Mathieu Desnoyers wrote:

- On Mar 27, 2018, at 8:00 PM, Alexei Starovoitov a...@fb.com wrote:


On 3/27/18 4:13 PM, Mathieu Desnoyers wrote:

- On Mar 27, 2018, at 6:48 PM, Alexei Starovoitov a...@fb.com wrote:


On 3/27/18 2:04 PM, Steven Rostedt wrote:


+#ifdef CONFIG_BPF_EVENTS
+#define BPF_RAW_TP() . = ALIGN(8); \


Given that the section consists of a 16-bytes structure elements
on architectures with 8 bytes pointers, this ". = ALIGN(8)" should
be turned into a STRUCT_ALIGN(), especially given that the compiler
is free to up-align the structure on 32 bytes.


STRUCT_ALIGN fixed the 'off by 8' issue with kasan,
but it fails without kasan too.
For some reason the whole region __start__bpf_raw_tp - __stop__bpf_raw_tp
comes inited with :
[   22.703562] i 1 btp 8288e530 btp->tp  func

[   22.704638] i 2 btp 8288e540 btp->tp  func

[   22.705599] i 3 btp 8288e550 btp->tp  func

[   22.706551] i 4 btp 8288e560 btp->tp  func

[   22.707503] i 5 btp 8288e570 btp->tp  func

[   22.708452] i 6 btp 8288e580 btp->tp  func

[   22.709406] i 7 btp 8288e590 btp->tp  func

[   22.710368] i 8 btp 8288e5a0 btp->tp  func


while gdb shows that everything is good inside vmlinux
for exactly these addresses.
Some other linker magic missing?


No, Steven's iteration code is incorrect.

+extern struct bpf_raw_event_map __start__bpf_raw_tp;
+extern struct bpf_raw_event_map __stop__bpf_raw_tp;

That should be:

extern struct bpf_raw_event_map __start__bpf_raw_tp[];
extern struct bpf_raw_event_map __stop__bpf_raw_tp[];


+
+struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
+{
+const struct bpf_raw_event_map *btp = &__start__bpf_raw_tp;

const struct bpf_raw_event_map *btp = __start__bpf_raw_tp;

+int i = 0;
+
+for (; btp < &__stop__bpf_raw_tp; btp++) {

for (; btp < __stop__bpf_raw_tp; btp++) {

Those start/stop symbols are given their address by the linker
automatically (this is a GNU linker extension). We don't want
pointers to the symbols, but rather the symbols per se to act
as start/stop addresses.


right. that part I fixed first.

Turned out it was in init.data section and got poisoned.
this fixes it:
@@ -258,6 +258,7 @@
LIKELY_PROFILE()\
BRANCH_PROFILE()\
TRACE_PRINTKS() \
+   BPF_RAW_TP()\
TRACEPOINT_STR()

 /*
@@ -585,7 +586,6 @@
*(.init.rodata) \
FTRACE_EVENTS() \
TRACE_SYSCALLS()\
-   BPF_RAW_TP()\
KPROBE_BLACKLIST()  \
ERROR_INJECT_WHITELIST()\
MEM_DISCARD(init.rodata)\

and it works :)
I will clean few other nits I found while debugging and respin.



Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()

2018-03-27 Thread NeilBrown
On Wed, Mar 28 2018, Andreas Grünbacher wrote:

> Neil,
>
> 2018-03-27 1:33 GMT+02:00 NeilBrown :
>> The documentation for rhashtable_walk_peek() wrong.  It claims to
>> return the *next* entry, whereas it in fact returns the *previous*
>> entry.
>> However if no entries have yet been returned - or if the iterator
>> was reset due to a resize event, then rhashtable_walk_peek()
>> *does* return the next entry, but also advances the iterator.
>>
>> I suspect that this interface should be discarded and the one user
>> should be changed to not require it.  Possibly this patch should be
>> seen as a first step in that conversation.
>>
>> This patch mostly corrects the documentation, but does make a
>> small code change so that the documentation can be correct without
>> listing too many special cases.  I don't think the one user will
>> be affected by the code change.
>
> how about I come up with a replacement so that we can remove
> rhashtable_walk_peek straight away without making it differently
> broken in the meantime?
>

Hi Andreas,
 I'd be very happy with that outcome - thanks for the offer!

NeilBrown


signature.asc
Description: PGP signature


Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-03-27 Thread William Tu
> Indeed. Intel iommu has least effect on RX because of premap/recycle.
> But TX dma map and unmap is really expensive!
>
>>
>> Basically the IOMMU can make creating/destroying a DMA mapping really
>> expensive. The easiest way to work around it in the case of the Intel
>> IOMMU is to boot with "iommu=pt" which will create an identity mapping
>> for the host. The downside is though that you then have the entire
>> system accessible to the device unless a new mapping is created for it
>> by assigning it to a new IOMMU domain.
>
>
> Yeah thats what I would say, If you really want to use intel iommu and
> don't want to hit by performance , use 'iommu=pt'.
>
> Good to have confirmation from you Alex. Thanks.
>

Thanks for the suggestion! Update my performance number:

without iommu=pt (posted before)
Benchmark   XDP_SKB
rxdrop  2.3 Mpps
txpush 1.05 Mpps
l2fwd0.90 Mpps

with iommu=pt (new)
Benchmark   XDP_SKB
rxdrop  2.24 Mpps
txpush 1.54 Mpps
l2fwd1.23 Mpps

TX indeed shows better rate, while RX remains.
William


[net-next 07/15] net/mlx5e: Remove rq_headroom field from params

2018-03-27 Thread Saeed Mahameed
From: Tariq Toukan 

It can be derived from other params, calculate it
via the dedicated function when needed.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 -
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 +++-
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ba7f1ceb6dcd..ff9aeda186a1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -230,7 +230,6 @@ enum mlx5e_priv_flag {
 struct mlx5e_params {
u8  log_sq_size;
u8  rq_wq_type;
-   u16 rq_headroom;
u8  log_rq_size;
u16 num_channels;
u8  num_tc;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 65e6955713e7..4907b7bb08e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -92,6 +92,19 @@ u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev 
*mdev,
mlx5e_mpwqe_get_log_stride_size(mdev, params);
 }
 
+static u16 mlx5e_get_rq_headroom(struct mlx5e_params *params)
+{
+   u16 linear_rq_headroom = params->xdp_prog ?
+   XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM;
+
+   linear_rq_headroom += NET_IP_ALIGN;
+
+   if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST)
+   return linear_rq_headroom;
+
+   return 0;
+}
+
 void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
   struct mlx5e_params *params, u8 rq_type)
 {
@@ -107,12 +120,9 @@ void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
params->log_rq_size = is_kdump_kernel() ?
MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE :
MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
-   params->rq_headroom = params->xdp_prog ?
-   XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM;
-   params->rq_headroom += NET_IP_ALIGN;
 
/* Extra room needed for build_skb */
-   params->lro_wqe_sz -= params->rq_headroom +
+   params->lro_wqe_sz -= mlx5e_get_rq_headroom(params) +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
}
 
@@ -441,7 +451,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
goto err_rq_wq_destroy;
 
rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
-   rq->buff.headroom = params->rq_headroom;
+   rq->buff.headroom = mlx5e_get_rq_headroom(params);
 
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
-- 
2.14.3



[net-next 08/15] net/mlx5e: Do not reset Receive Queue params on every type change

2018-03-27 Thread Saeed Mahameed
From: Tariq Toukan 

Do not implicit a call to mlx5e_init_rq_type_params() upon every
change in RQ type. It should be called only on channels creation.

Fixes: 2fc4bfb7250d ("net/mlx5e: Dynamic RQ type infrastructure")
Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  3 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 15 +++
 drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c |  3 ++-
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ff9aeda186a1..45d0c64e77e5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -918,8 +918,7 @@ void mlx5e_set_tx_cq_mode_params(struct mlx5e_params 
*params,
 void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params,
 u8 cq_period_mode);
 void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
-  struct mlx5e_params *params,
-  u8 rq_type);
+  struct mlx5e_params *params);
 
 static inline bool mlx5e_tunnel_inner_ft_supported(struct mlx5_core_dev *mdev)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 4907b7bb08e0..ffe3b2469032 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -106,9 +106,8 @@ static u16 mlx5e_get_rq_headroom(struct mlx5e_params 
*params)
 }
 
 void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
-  struct mlx5e_params *params, u8 rq_type)
+  struct mlx5e_params *params)
 {
-   params->rq_wq_type = rq_type;
params->lro_wqe_sz = MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ;
switch (params->rq_wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
@@ -135,15 +134,14 @@ void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
 
 static bool slow_pci_heuristic(struct mlx5_core_dev *mdev);
 
-static void mlx5e_set_rq_params(struct mlx5_core_dev *mdev,
-   struct mlx5e_params *params)
+static void mlx5e_set_rq_type(struct mlx5_core_dev *mdev,
+ struct mlx5e_params *params)
 {
-   u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(mdev) &&
+   params->rq_wq_type = mlx5e_check_fragmented_striding_rq_cap(mdev) &&
!slow_pci_heuristic(mdev) &&
!params->xdp_prog && !MLX5_IPSEC_DEV(mdev) ?
MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
MLX5_WQ_TYPE_LINKED_LIST;
-   mlx5e_init_rq_type_params(mdev, params, rq_type);
 }
 
 static void mlx5e_update_carrier(struct mlx5e_priv *priv)
@@ -3736,7 +3734,7 @@ static int mlx5e_xdp_set(struct net_device *netdev, 
struct bpf_prog *prog)
bpf_prog_put(old_prog);
 
if (reset) /* change RQ type according to priv->xdp_prog */
-   mlx5e_set_rq_params(priv->mdev, >channels.params);
+   mlx5e_set_rq_type(priv->mdev, >channels.params);
 
if (was_opened && reset)
mlx5e_open_locked(netdev);
@@ -4029,7 +4027,8 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS, 
params->rx_cqe_compress_def);
 
/* RQ */
-   mlx5e_set_rq_params(mdev, params);
+   mlx5e_set_rq_type(mdev, params);
+   mlx5e_init_rq_type_params(mdev, params);
 
/* HW LRO */
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
index f953378bd13d..870584a07c48 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
@@ -56,7 +56,8 @@ static void mlx5i_build_nic_params(struct mlx5_core_dev *mdev,
   struct mlx5e_params *params)
 {
/* Override RQ params as IPoIB supports only LINKED LIST RQ for now */
-   mlx5e_init_rq_type_params(mdev, params, MLX5_WQ_TYPE_LINKED_LIST);
+   params->rq_wq_type = MLX5_WQ_TYPE_LINKED_LIST;
+   mlx5e_init_rq_type_params(mdev, params);
 
/* RQ size in ipoib by default is 512 */
params->log_rq_size = is_kdump_kernel() ?
-- 
2.14.3



[net-next 12/15] mlx5_{ib,core}: Add query SQ state helper function

2018-03-27 Thread Saeed Mahameed
From: Eran Ben Elisha 

Move query SQ state function from mlx5_ib to mlx5_core in order to
have it in shared code.

It will be used in a downstream patch from mlx5e.

Signed-off-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/qp.c| 14 +---
 drivers/net/ethernet/mellanox/mlx5/core/transobj.c | 25 ++
 include/linux/mlx5/transobj.h  |  1 +
 3 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 85c612ac547a..0d0b0b8dad98 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -4739,26 +4739,14 @@ static int query_raw_packet_qp_sq_state(struct 
mlx5_ib_dev *dev,
struct mlx5_ib_sq *sq,
u8 *sq_state)
 {
-   void *out;
-   void *sqc;
-   int inlen;
int err;
 
-   inlen = MLX5_ST_SZ_BYTES(query_sq_out);
-   out = kvzalloc(inlen, GFP_KERNEL);
-   if (!out)
-   return -ENOMEM;
-
-   err = mlx5_core_query_sq(dev->mdev, sq->base.mqp.qpn, out);
+   err = mlx5_core_query_sq_state(dev->mdev, sq->base.mqp.qpn, sq_state);
if (err)
goto out;
-
-   sqc = MLX5_ADDR_OF(query_sq_out, out, sq_context);
-   *sq_state = MLX5_GET(sqc, sqc, state);
sq->state = *sq_state;
 
 out:
-   kvfree(out);
return err;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/transobj.c 
b/drivers/net/ethernet/mellanox/mlx5/core/transobj.c
index 9e38343a951f..c64957b5ef47 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/transobj.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/transobj.c
@@ -157,6 +157,31 @@ int mlx5_core_query_sq(struct mlx5_core_dev *dev, u32 sqn, 
u32 *out)
 }
 EXPORT_SYMBOL(mlx5_core_query_sq);
 
+int mlx5_core_query_sq_state(struct mlx5_core_dev *dev, u32 sqn, u8 *state)
+{
+   void *out;
+   void *sqc;
+   int inlen;
+   int err;
+
+   inlen = MLX5_ST_SZ_BYTES(query_sq_out);
+   out = kvzalloc(inlen, GFP_KERNEL);
+   if (!out)
+   return -ENOMEM;
+
+   err = mlx5_core_query_sq(dev, sqn, out);
+   if (err)
+   goto out;
+
+   sqc = MLX5_ADDR_OF(query_sq_out, out, sq_context);
+   *state = MLX5_GET(sqc, sqc, state);
+
+out:
+   kvfree(out);
+   return err;
+}
+EXPORT_SYMBOL_GPL(mlx5_core_query_sq_state);
+
 int mlx5_core_create_tir(struct mlx5_core_dev *dev, u32 *in, int inlen,
 u32 *tirn)
 {
diff --git a/include/linux/mlx5/transobj.h b/include/linux/mlx5/transobj.h
index 7e8f281f8c00..80d7aa8b2831 100644
--- a/include/linux/mlx5/transobj.h
+++ b/include/linux/mlx5/transobj.h
@@ -47,6 +47,7 @@ int mlx5_core_create_sq(struct mlx5_core_dev *dev, u32 *in, 
int inlen,
 int mlx5_core_modify_sq(struct mlx5_core_dev *dev, u32 sqn, u32 *in, int 
inlen);
 void mlx5_core_destroy_sq(struct mlx5_core_dev *dev, u32 sqn);
 int mlx5_core_query_sq(struct mlx5_core_dev *dev, u32 sqn, u32 *out);
+int mlx5_core_query_sq_state(struct mlx5_core_dev *dev, u32 sqn, u8 *state);
 int mlx5_core_create_tir(struct mlx5_core_dev *dev, u32 *in, int inlen,
 u32 *tirn);
 int mlx5_core_modify_tir(struct mlx5_core_dev *dev, u32 tirn, u32 *in,
-- 
2.14.3



[net-next 05/15] net/mlx5e: Use no-offset function in skb header copy

2018-03-27 Thread Saeed Mahameed
From: Tariq Toukan 

In copying skb header to skb->data, replace the call to
skb_copy_to_linear_data_offset() with a zero offset with
the call to the no-offset function skb_copy_to_linear_data().

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index ffcbe5c3818a..781b8f21d6d1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -333,9 +333,8 @@ mlx5e_copy_skb_header_mpwqe(struct device *pdev,
len = ALIGN(headlen_pg, sizeof(long));
dma_sync_single_for_cpu(pdev, dma_info->addr + offset, len,
DMA_FROM_DEVICE);
-   skb_copy_to_linear_data_offset(skb, 0,
-  page_address(dma_info->page) + offset,
-  len);
+   skb_copy_to_linear_data(skb, page_address(dma_info->page) + offset, 
len);
+
if (unlikely(offset + headlen > PAGE_SIZE)) {
dma_info++;
headlen_pg = len;
-- 
2.14.3



Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Mathieu Desnoyers
- On Mar 27, 2018, at 8:00 PM, Alexei Starovoitov a...@fb.com wrote:

> On 3/27/18 4:13 PM, Mathieu Desnoyers wrote:
>> - On Mar 27, 2018, at 6:48 PM, Alexei Starovoitov a...@fb.com wrote:
>>
>>> On 3/27/18 2:04 PM, Steven Rostedt wrote:

 +#ifdef CONFIG_BPF_EVENTS
 +#define BPF_RAW_TP() . = ALIGN(8);\
>>
>> Given that the section consists of a 16-bytes structure elements
>> on architectures with 8 bytes pointers, this ". = ALIGN(8)" should
>> be turned into a STRUCT_ALIGN(), especially given that the compiler
>> is free to up-align the structure on 32 bytes.
> 
> STRUCT_ALIGN fixed the 'off by 8' issue with kasan,
> but it fails without kasan too.
> For some reason the whole region __start__bpf_raw_tp - __stop__bpf_raw_tp
> comes inited with :
> [   22.703562] i 1 btp 8288e530 btp->tp  func
> 
> [   22.704638] i 2 btp 8288e540 btp->tp  func
> 
> [   22.705599] i 3 btp 8288e550 btp->tp  func
> 
> [   22.706551] i 4 btp 8288e560 btp->tp  func
> 
> [   22.707503] i 5 btp 8288e570 btp->tp  func
> 
> [   22.708452] i 6 btp 8288e580 btp->tp  func
> 
> [   22.709406] i 7 btp 8288e590 btp->tp  func
> 
> [   22.710368] i 8 btp 8288e5a0 btp->tp  func
> 
> 
> while gdb shows that everything is good inside vmlinux
> for exactly these addresses.
> Some other linker magic missing?

No, Steven's iteration code is incorrect.

+extern struct bpf_raw_event_map __start__bpf_raw_tp;
+extern struct bpf_raw_event_map __stop__bpf_raw_tp;

That should be:

extern struct bpf_raw_event_map __start__bpf_raw_tp[];
extern struct bpf_raw_event_map __stop__bpf_raw_tp[];


+
+struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name)
+{
+const struct bpf_raw_event_map *btp = &__start__bpf_raw_tp;

const struct bpf_raw_event_map *btp = __start__bpf_raw_tp;

+int i = 0;
+
+for (; btp < &__stop__bpf_raw_tp; btp++) {

for (; btp < __stop__bpf_raw_tp; btp++) {

Those start/stop symbols are given their address by the linker
automatically (this is a GNU linker extension). We don't want
pointers to the symbols, but rather the symbols per se to act
as start/stop addresses.

Thanks,

Mathieu

+i++;
+if (!strcmp(btp->tp->name, name))
+return btp;
+}
+return NULL;
+}



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[net-next 10/15] net/mlx5e: Remove unused max inline related code

2018-03-27 Thread Saeed Mahameed
From: Gal Pressman 

Commit 58d522912ac7 ("net/mlx5e: Support TX packet copy into WQE")
introduced the max inline WQE as an ethtool tunable. One commit later,
that functionality was made dependent on BlueFlame.

Commit 6982ab609768 ("net/mlx5e: Xmit, no write combining") removed
BlueFlame support, and with it the max inline WQE.
This patch cleans up the leftovers from the removed feature.

Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  3 --
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 32 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 11 
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  1 -
 4 files changed, 2 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 13dd7a97ae04..6898f5e26006 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -240,7 +240,6 @@ struct mlx5e_params {
struct net_dim_cq_moder tx_cq_moderation;
bool lro_en;
u32 lro_wqe_sz;
-   u16 tx_max_inline;
u8  tx_min_inline_mode;
u8  rss_hfunc;
u8  toeplitz_hash_key[40];
@@ -366,7 +365,6 @@ struct mlx5e_txqsq {
void __iomem  *uar_map;
struct netdev_queue   *txq;
u32sqn;
-   u16max_inline;
u8 min_inline_mode;
u16edge;
struct device *pdev;
@@ -1017,7 +1015,6 @@ int mlx5e_rx_flow_steer(struct net_device *dev, const 
struct sk_buff *skb,
u16 rxq_index, u32 flow_id);
 #endif
 
-u16 mlx5e_get_max_inline_cap(struct mlx5_core_dev *mdev);
 int mlx5e_create_tir(struct mlx5_core_dev *mdev,
 struct mlx5e_tir *tir, u32 *in, int inlen);
 void mlx5e_destroy_tir(struct mlx5_core_dev *mdev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 7bfe17b7c279..c57c929d7973 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1118,13 +1118,9 @@ static int mlx5e_get_tunable(struct net_device *dev,
 const struct ethtool_tunable *tuna,
 void *data)
 {
-   const struct mlx5e_priv *priv = netdev_priv(dev);
-   int err = 0;
+   int err;
 
switch (tuna->id) {
-   case ETHTOOL_TX_COPYBREAK:
-   *(u32 *)data = priv->channels.params.tx_max_inline;
-   break;
case ETHTOOL_PFC_PREVENTION_TOUT:
err = mlx5e_get_pfc_prevention_tout(dev, data);
break;
@@ -1141,35 +1137,11 @@ static int mlx5e_set_tunable(struct net_device *dev,
 const void *data)
 {
struct mlx5e_priv *priv = netdev_priv(dev);
-   struct mlx5_core_dev *mdev = priv->mdev;
-   struct mlx5e_channels new_channels = {};
-   int err = 0;
-   u32 val;
+   int err;
 
mutex_lock(>state_lock);
 
switch (tuna->id) {
-   case ETHTOOL_TX_COPYBREAK:
-   val = *(u32 *)data;
-   if (val > mlx5e_get_max_inline_cap(mdev)) {
-   err = -EINVAL;
-   break;
-   }
-
-   new_channels.params = priv->channels.params;
-   new_channels.params.tx_max_inline = val;
-
-   if (!test_bit(MLX5E_STATE_OPENED, >state)) {
-   priv->channels.params = new_channels.params;
-   break;
-   }
-
-   err = mlx5e_open_channels(priv, _channels);
-   if (err)
-   break;
-   mlx5e_switch_priv_channels(priv, _channels, NULL);
-
-   break;
case ETHTOOL_PFC_PREVENTION_TOUT:
err = mlx5e_set_pfc_prevention_tout(dev, *(u16 *)data);
break;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 7610a7916e96..5d8eb0a9c0f0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -993,7 +993,6 @@ static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
sq->channel   = c;
sq->txq_ix= txq_ix;
sq->uar_map   = mdev->mlx5e_res.bfreg.map;
-   sq->max_inline  = params->tx_max_inline;
sq->min_inline_mode = params->tx_min_inline_mode;
if (MLX5_IPSEC_DEV(c->priv->mdev))
set_bit(MLX5E_SQ_STATE_IPSEC, >state);
@@ -3882,15 +3881,6 @@ static int mlx5e_check_required_hca_cap(struct 
mlx5_core_dev *mdev)
return 0;
 }
 
-u16 mlx5e_get_max_inline_cap(struct 

[net-next 01/15] net/mlx5e: Unify slow PCI heuristic

2018-03-27 Thread Saeed Mahameed
From: Tariq Toukan 

Get the link/pci speed query and logic into a single function.
Unify the heuristics and use a single PCI threshold (16G) for all.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 31 ++
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|  5 
 2 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1d36d7569f44..46707826f27e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3902,16 +3902,20 @@ static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, 
u32 *pci_bw)
return 0;
 }
 
-static bool cqe_compress_heuristic(u32 link_speed, u32 pci_bw)
+static bool slow_pci_heuristic(struct mlx5_core_dev *mdev)
 {
-   return (link_speed && pci_bw &&
-   (pci_bw < 4) && (pci_bw < link_speed));
-}
+   u32 link_speed = 0;
+   u32 pci_bw = 0;
 
-static bool hw_lro_heuristic(u32 link_speed, u32 pci_bw)
-{
-   return !(link_speed && pci_bw &&
-(pci_bw <= 16000) && (pci_bw < link_speed));
+   mlx5e_get_max_linkspeed(mdev, _speed);
+   mlx5e_get_pci_bw(mdev, _bw);
+   mlx5_core_dbg_once(mdev, "Max link speed = %d, PCI BW = %d\n",
+  link_speed, pci_bw);
+
+#define MLX5E_SLOW_PCI_RATIO (2)
+
+   return link_speed && pci_bw &&
+   link_speed > MLX5E_SLOW_PCI_RATIO * pci_bw;
 }
 
 void mlx5e_set_tx_cq_mode_params(struct mlx5e_params *params, u8 
cq_period_mode)
@@ -3980,17 +3984,10 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
u16 max_channels)
 {
u8 cq_period_mode = 0;
-   u32 link_speed = 0;
-   u32 pci_bw = 0;
 
params->num_channels = max_channels;
params->num_tc   = 1;
 
-   mlx5e_get_max_linkspeed(mdev, _speed);
-   mlx5e_get_pci_bw(mdev, _bw);
-   mlx5_core_dbg(mdev, "Max link speed = %d, PCI BW = %d\n",
- link_speed, pci_bw);
-
/* SQ */
params->log_sq_size = is_kdump_kernel() ?
MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE :
@@ -4000,7 +3997,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
params->rx_cqe_compress_def = false;
if (MLX5_CAP_GEN(mdev, cqe_compression) &&
MLX5_CAP_GEN(mdev, vport_group_manager))
-   params->rx_cqe_compress_def = 
cqe_compress_heuristic(link_speed, pci_bw);
+   params->rx_cqe_compress_def = slow_pci_heuristic(mdev);
 
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS, 
params->rx_cqe_compress_def);
 
@@ -4011,7 +4008,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
 
/* TODO: && MLX5_CAP_ETH(mdev, lro_cap) */
if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
-   params->lro_en = hw_lro_heuristic(link_speed, pci_bw);
+   params->lro_en = !slow_pci_heuristic(mdev);
params->lro_timeout = mlx5e_choose_lro_timeout(mdev, 
MLX5E_DEFAULT_LRO_TIMEOUT);
 
/* CQ moderation params */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 4e25f2b2e0bc..7d001fe6e631 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -50,6 +50,11 @@ extern uint mlx5_core_debug_mask;
 __func__, __LINE__, current->pid,  \
 ##__VA_ARGS__)
 
+#define mlx5_core_dbg_once(__dev, format, ...) \
+   dev_dbg_once(&(__dev)->pdev->dev, "%s:%d:(pid %d): " format,\
+__func__, __LINE__, current->pid,  \
+##__VA_ARGS__)
+
 #define mlx5_core_dbg_mask(__dev, mask, format, ...)   \
 do {   \
if ((mask) & mlx5_core_debug_mask)  \
-- 
2.14.3



[net-next 06/15] net/mlx5e: Remove RQ MPWQE fields from params

2018-03-27 Thread Saeed Mahameed
From: Tariq Toukan 

Introduce functions to calculate them when needed.
They can be derived from other params.
This will simplify transition between RQ configurations.

In general, any parameter that is not explicitly set
or controlled, but derived from other parameters,
should not have a control-path field itself, but a
getter function.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  7 ++--
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 13 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 38 +++---
 3 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 85767f0869d8..ba7f1ceb6dcd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -231,8 +231,6 @@ struct mlx5e_params {
u8  log_sq_size;
u8  rq_wq_type;
u16 rq_headroom;
-   u8  mpwqe_log_stride_sz;
-   u8  mpwqe_log_num_strides;
u8  log_rq_size;
u16 num_channels;
u8  num_tc;
@@ -840,6 +838,11 @@ void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
 
+u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev,
+  struct mlx5e_params *params);
+u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
+  struct mlx5e_params *params);
+
 void mlx5e_update_stats(struct mlx5e_priv *priv);
 
 int mlx5e_create_flow_steering(struct mlx5e_priv *priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index d415e67b557b..234b5b2ebf0f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -231,8 +231,8 @@ static u32 mlx5e_rx_wqes_to_packets(struct mlx5e_priv 
*priv, int rq_wq_type,
if (rq_wq_type != MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
return num_wqe;
 
-   stride_size = 1 << priv->channels.params.mpwqe_log_stride_sz;
-   num_strides = 1 << priv->channels.params.mpwqe_log_num_strides;
+   stride_size = 1 << mlx5e_mpwqe_get_log_stride_size(priv->mdev, 
>channels.params);
+   num_strides = 1 << mlx5e_mpwqe_get_log_num_strides(priv->mdev, 
>channels.params);
wqe_size = stride_size * num_strides;
 
packets_per_wqe = wqe_size /
@@ -252,8 +252,8 @@ static u32 mlx5e_packets_to_rx_wqes(struct mlx5e_priv 
*priv, int rq_wq_type,
if (rq_wq_type != MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
return num_packets;
 
-   stride_size = 1 << priv->channels.params.mpwqe_log_stride_sz;
-   num_strides = 1 << priv->channels.params.mpwqe_log_num_strides;
+   stride_size = 1 << mlx5e_mpwqe_get_log_stride_size(priv->mdev, 
>channels.params);
+   num_strides = 1 << mlx5e_mpwqe_get_log_num_strides(priv->mdev, 
>channels.params);
wqe_size = stride_size * num_strides;
 
num_packets = (1 << order_base_2(num_packets));
@@ -1561,11 +1561,6 @@ int mlx5e_modify_rx_cqe_compression_locked(struct 
mlx5e_priv *priv, bool new_val
new_channels.params = priv->channels.params;
MLX5E_SET_PFLAG(_channels.params, MLX5E_PFLAG_RX_CQE_COMPRESS, 
new_val);
 
-   new_channels.params.mpwqe_log_stride_sz =
-   MLX5E_MPWQE_STRIDE_SZ(priv->mdev, new_val);
-   new_channels.params.mpwqe_log_num_strides =
-   MLX5_MPWRQ_LOG_WQE_SZ - new_channels.params.mpwqe_log_stride_sz;
-
if (!test_bit(MLX5E_STATE_OPENED, >state)) {
priv->channels.params = new_channels.params;
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index d4dd00089eb1..65e6955713e7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -78,6 +78,20 @@ static bool mlx5e_check_fragmented_striding_rq_cap(struct 
mlx5_core_dev *mdev)
MLX5_CAP_ETH(mdev, reg_umr_sq);
 }
 
+u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev,
+  struct mlx5e_params *params)
+{
+   return MLX5E_MPWQE_STRIDE_SZ(mdev,
+   MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS));
+}
+
+u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
+  struct mlx5e_params *params)
+{
+   return MLX5_MPWRQ_LOG_WQE_SZ -
+   mlx5e_mpwqe_get_log_stride_size(mdev, params);
+}
+
 void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
   struct mlx5e_params *params, u8 rq_type)
 {

[net-next 02/15] net/mlx5e: Disable Striding RQ when PCI is slower than link

2018-03-27 Thread Saeed Mahameed
From: Tariq Toukan 

We turn the feature off for servers with PCI BW bounded
by a threshold (16G) and lower than MAX LINK BW.
This improves the effectiveness of CQE compression feature,
that is defaulted to ON for the same case.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 46707826f27e..d4dd00089eb1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -113,13 +113,16 @@ void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
   MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS));
 }
 
+static bool slow_pci_heuristic(struct mlx5_core_dev *mdev);
+
 static void mlx5e_set_rq_params(struct mlx5_core_dev *mdev,
struct mlx5e_params *params)
 {
u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(mdev) &&
-   !params->xdp_prog && !MLX5_IPSEC_DEV(mdev) ?
-   MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
-   MLX5_WQ_TYPE_LINKED_LIST;
+   !slow_pci_heuristic(mdev) &&
+   !params->xdp_prog && !MLX5_IPSEC_DEV(mdev) ?
+   MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
+   MLX5_WQ_TYPE_LINKED_LIST;
mlx5e_init_rq_type_params(mdev, params, rq_type);
 }
 
-- 
2.14.3



[net-next 11/15] net/mlx5e: Move all TX timeout logic to be under state lock

2018-03-27 Thread Saeed Mahameed
From: Eran Ben Elisha 

Driver callback for handling TX timeout should access some internal
resources (SQ, CQ) in order to decide if the tx timeout work should be
scheduled.  These resources might be unavailable if channels are closed
in parallel (ifdown for example).

The state lock is the mechanism to protect from such races.
Move all TX timeout logic to be in the work under a state lock.

In addition, Move the work from the global WQ to mlx5e WQ to make sure
this work is flushed when device is detached..

Also, move the mlx5e_tx_timeout_work code to be next to the TX timeout
NDO for better code locality.

Fixes: 3947ca185999 ("net/mlx5e: Implement ndo_tx_timeout callback")
Signed-off-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 61 +--
 1 file changed, 34 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5d8eb0a9c0f0..e0b75f52d556 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -177,26 +177,6 @@ static void mlx5e_update_carrier_work(struct work_struct 
*work)
mutex_unlock(>state_lock);
 }
 
-static void mlx5e_tx_timeout_work(struct work_struct *work)
-{
-   struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv,
-  tx_timeout_work);
-   int err;
-
-   rtnl_lock();
-   mutex_lock(>state_lock);
-   if (!test_bit(MLX5E_STATE_OPENED, >state))
-   goto unlock;
-   mlx5e_close_locked(priv->netdev);
-   err = mlx5e_open_locked(priv->netdev);
-   if (err)
-   netdev_err(priv->netdev, "mlx5e_open_locked failed recovering 
from a tx_timeout, err(%d).\n",
-  err);
-unlock:
-   mutex_unlock(>state_lock);
-   rtnl_unlock();
-}
-
 void mlx5e_update_stats(struct mlx5e_priv *priv)
 {
int i;
@@ -3658,13 +3638,19 @@ static bool mlx5e_tx_timeout_eq_recover(struct 
net_device *dev,
return true;
 }
 
-static void mlx5e_tx_timeout(struct net_device *dev)
+static void mlx5e_tx_timeout_work(struct work_struct *work)
 {
-   struct mlx5e_priv *priv = netdev_priv(dev);
+   struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv,
+  tx_timeout_work);
+   struct net_device *dev = priv->netdev;
bool reopen_channels = false;
-   int i;
+   int i, err;
 
-   netdev_err(dev, "TX timeout detected\n");
+   rtnl_lock();
+   mutex_lock(>state_lock);
+
+   if (!test_bit(MLX5E_STATE_OPENED, >state))
+   goto unlock;
 
for (i = 0; i < priv->channels.num * priv->channels.params.num_tc; i++) 
{
struct netdev_queue *dev_queue = netdev_get_tx_queue(dev, i);
@@ -3672,7 +3658,9 @@ static void mlx5e_tx_timeout(struct net_device *dev)
 
if (!netif_xmit_stopped(dev_queue))
continue;
-   netdev_err(dev, "TX timeout on queue: %d, SQ: 0x%x, CQ: 0x%x, 
SQ Cons: 0x%x SQ Prod: 0x%x, usecs since last trans: %u\n",
+
+   netdev_err(dev,
+  "TX timeout on queue: %d, SQ: 0x%x, CQ: 0x%x, SQ 
Cons: 0x%x SQ Prod: 0x%x, usecs since last trans: %u\n",
   i, sq->sqn, sq->cq.mcq.cqn, sq->cc, sq->pc,
   jiffies_to_usecs(jiffies - dev_queue->trans_start));
 
@@ -3685,8 +3673,27 @@ static void mlx5e_tx_timeout(struct net_device *dev)
}
}
 
-   if (reopen_channels && test_bit(MLX5E_STATE_OPENED, >state))
-   schedule_work(>tx_timeout_work);
+   if (!reopen_channels)
+   goto unlock;
+
+   mlx5e_close_locked(dev);
+   err = mlx5e_open_locked(dev);
+   if (err)
+   netdev_err(priv->netdev,
+  "mlx5e_open_locked failed recovering from a 
tx_timeout, err(%d).\n",
+  err);
+
+unlock:
+   mutex_unlock(>state_lock);
+   rtnl_unlock();
+}
+
+static void mlx5e_tx_timeout(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+
+   netdev_err(dev, "TX timeout detected\n");
+   queue_work(priv->wq, >tx_timeout_work);
 }
 
 static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
-- 
2.14.3



[net-next 15/15] net/mlx5e: Recover Send Queue (SQ) from error state

2018-03-27 Thread Saeed Mahameed
From: Eran Ben Elisha 

An error TX completion (CQE) which arrived on a specific SQ indicates
that this SQ got moved by the hardware to error state, which means all
pending and incoming TX requests are dropped or will be dropped and no
further "Good" CQEs will be generated for that SQ.

Before this patch TX completions (CQEs) were not monitored and were
handled as a regular CQE. This caused the SQ to stay in an error state,
making it useless for xmiting new packets.

Mitigation plan:
In case of an error completion, schedule a recovery work which would do
the following:
- Mark the TXQ as DRV_XOFF to disable new packets to arrive from the
  stack
- NAPI to flush all pending SQ WQEs (via flush_in_error_en bit) to
  release SW and HW resources(SKB, DMA, etc) and have the SQ and CQ
  consumer/producer indices synced.
- Modify the SQ state ERR -> RST -> RDY (restart the SQ).
- Reactivate the SQ and reset SQ cc and pc

If we identify two consecutive requests for SQ recover in less than
500 msecs, drop the recover request to avoid CPU overload, as this
scenario most likely happened due to a severe repeated bug.

In addition, add SQ recover SW counter to monitor successful recoveries.

Signed-off-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   6 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 115 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c|  10 +-
 5 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 6898f5e26006..353ac6daa3dc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -122,6 +122,7 @@
 #define MLX5E_MAX_NUM_SQS  (MLX5E_MAX_NUM_CHANNELS * 
MLX5E_MAX_NUM_TC)
 #define MLX5E_TX_CQ_POLL_BUDGET128
 #define MLX5E_UPDATE_STATS_INTERVAL200 /* msecs */
+#define MLX5E_SQ_RECOVER_MIN_INTERVAL  500 /* msecs */
 
 #define MLX5E_ICOSQ_MAX_WQEBBS \
(DIV_ROUND_UP(sizeof(struct mlx5e_umr_wqe), MLX5_SEND_WQE_BB))
@@ -332,6 +333,7 @@ struct mlx5e_sq_dma {
 
 enum {
MLX5E_SQ_STATE_ENABLED,
+   MLX5E_SQ_STATE_RECOVERING,
MLX5E_SQ_STATE_IPSEC,
 };
 
@@ -378,6 +380,10 @@ struct mlx5e_txqsq {
struct mlx5e_channel  *channel;
inttxq_ix;
u32rate_limit;
+   struct mlx5e_txqsq_recover {
+   struct work_struct recover_work;
+   u64last_recover;
+   } recover;
 } cacheline_aligned_in_smp;
 
 struct mlx5e_xdpsq {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e0b75f52d556..1b48dec67abf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -956,6 +956,7 @@ static int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, int 
numa)
return 0;
 }
 
+static void mlx5e_sq_recover(struct work_struct *work);
 static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
 int txq_ix,
 struct mlx5e_params *params,
@@ -974,6 +975,7 @@ static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
sq->txq_ix= txq_ix;
sq->uar_map   = mdev->mlx5e_res.bfreg.map;
sq->min_inline_mode = params->tx_min_inline_mode;
+   INIT_WORK(>recover.recover_work, mlx5e_sq_recover);
if (MLX5_IPSEC_DEV(c->priv->mdev))
set_bit(MLX5E_SQ_STATE_IPSEC, >state);
 
@@ -1040,6 +1042,7 @@ static int mlx5e_create_sq(struct mlx5_core_dev *mdev,
MLX5_SET(sqc,  sqc, min_wqe_inline_mode, csp->min_inline_mode);
 
MLX5_SET(sqc,  sqc, state, MLX5_SQC_STATE_RST);
+   MLX5_SET(sqc,  sqc, flush_in_error_en, 1);
 
MLX5_SET(wq,   wq, wq_type,   MLX5_WQ_TYPE_CYCLIC);
MLX5_SET(wq,   wq, uar_page,  mdev->mlx5e_res.bfreg.index);
@@ -1158,9 +1161,20 @@ static int mlx5e_open_txqsq(struct mlx5e_channel *c,
return err;
 }
 
+static void mlx5e_reset_txqsq_cc_pc(struct mlx5e_txqsq *sq)
+{
+   WARN_ONCE(sq->cc != sq->pc,
+ "SQ 0x%x: cc (0x%x) != pc (0x%x)\n",
+ sq->sqn, sq->cc, sq->pc);
+   sq->cc = 0;
+   sq->dma_fifo_cc = 0;
+   sq->pc = 0;
+}
+
 static void mlx5e_activate_txqsq(struct mlx5e_txqsq *sq)
 {
sq->txq = netdev_get_tx_queue(sq->channel->netdev, sq->txq_ix);
+   clear_bit(MLX5E_SQ_STATE_RECOVERING, >state);
set_bit(MLX5E_SQ_STATE_ENABLED, >state);
netdev_tx_reset_queue(sq->txq);
netif_tx_start_queue(sq->txq);
@@ -1205,6 +1219,107 @@ static void mlx5e_close_txqsq(struct 

[net-next 09/15] net/mlx5e: Add ethtool priv-flag for Striding RQ

2018-03-27 Thread Saeed Mahameed
From: Tariq Toukan 

Add a control private flag in ethtool to enable/disable
Striding RQ feature.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  7 
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 38 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 20 
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |  3 +-
 4 files changed, 60 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 45d0c64e77e5..13dd7a97ae04 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -205,12 +205,14 @@ static const char mlx5e_priv_flags[][ETH_GSTRING_LEN] = {
"rx_cqe_moder",
"tx_cqe_moder",
"rx_cqe_compress",
+   "rx_striding_rq",
 };
 
 enum mlx5e_priv_flag {
MLX5E_PFLAG_RX_CQE_BASED_MODER = (1 << 0),
MLX5E_PFLAG_TX_CQE_BASED_MODER = (1 << 1),
MLX5E_PFLAG_RX_CQE_COMPRESS = (1 << 2),
+   MLX5E_PFLAG_RX_STRIDING_RQ = (1 << 3),
 };
 
 #define MLX5E_SET_PFLAG(params, pflag, enable) \
@@ -827,6 +829,10 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq);
 void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq);
 void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq);
 
+bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev);
+bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
+   struct mlx5e_params *params);
+
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
bool recycle);
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
@@ -917,6 +923,7 @@ void mlx5e_set_tx_cq_mode_params(struct mlx5e_params 
*params,
 u8 cq_period_mode);
 void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params,
 u8 cq_period_mode);
+void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params 
*params);
 void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
   struct mlx5e_params *params);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 234b5b2ebf0f..7bfe17b7c279 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1598,6 +1598,38 @@ static int set_pflag_rx_cqe_compress(struct net_device 
*netdev,
return 0;
 }
 
+static int set_pflag_rx_striding_rq(struct net_device *netdev, bool enable)
+{
+   struct mlx5e_priv *priv = netdev_priv(netdev);
+   struct mlx5_core_dev *mdev = priv->mdev;
+   struct mlx5e_channels new_channels = {};
+   int err;
+
+   if (enable) {
+   if (!mlx5e_check_fragmented_striding_rq_cap(mdev))
+   return -EOPNOTSUPP;
+   if (!mlx5e_striding_rq_possible(mdev, >channels.params))
+   return -EINVAL;
+   }
+
+   new_channels.params = priv->channels.params;
+
+   MLX5E_SET_PFLAG(_channels.params, MLX5E_PFLAG_RX_STRIDING_RQ, 
enable);
+   mlx5e_set_rq_type(mdev, _channels.params);
+
+   if (!test_bit(MLX5E_STATE_OPENED, >state)) {
+   priv->channels.params = new_channels.params;
+   return 0;
+   }
+
+   err = mlx5e_open_channels(priv, _channels);
+   if (err)
+   return err;
+
+   mlx5e_switch_priv_channels(priv, _channels, NULL);
+   return 0;
+}
+
 static int mlx5e_handle_pflag(struct net_device *netdev,
  u32 wanted_flags,
  enum mlx5e_priv_flag flag,
@@ -1643,6 +1675,12 @@ static int mlx5e_set_priv_flags(struct net_device 
*netdev, u32 pflags)
err = mlx5e_handle_pflag(netdev, pflags,
 MLX5E_PFLAG_RX_CQE_COMPRESS,
 set_pflag_rx_cqe_compress);
+   if (err)
+   goto out;
+
+   err = mlx5e_handle_pflag(netdev, pflags,
+MLX5E_PFLAG_RX_STRIDING_RQ,
+set_pflag_rx_striding_rq);
 
 out:
mutex_unlock(>state_lock);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ffe3b2469032..7610a7916e96 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -71,7 +71,7 @@ struct mlx5e_channel_param {
struct mlx5e_cq_param  icosq_cq;
 };
 
-static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
+bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
 {
return MLX5_CAP_GEN(mdev, striding_rq) &&
  

[net-next 03/15] net/mlx5e: Remove unused define MLX5_MPWRQ_STRIDES_PER_PAGE

2018-03-27 Thread Saeed Mahameed
From: Tariq Toukan 

Clean it up as it's not in use.

Fixes: d9d9f156f380 ("net/mlx5e: Expand WQE stride when CQE compression is 
enabled")
Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 294bc9f175a5..85767f0869d8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -93,8 +93,6 @@
 #define MLX5_MPWRQ_WQE_PAGE_ORDER  (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
 #define MLX5_MPWRQ_PAGES_PER_WQE   BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
-#define MLX5_MPWRQ_STRIDES_PER_PAGE(MLX5_MPWRQ_NUM_STRIDES >> \
-MLX5_MPWRQ_WQE_PAGE_ORDER)
 
 #define MLX5_MTT_OCTW(npages) (ALIGN(npages, 8) / 2)
 #define MLX5E_REQUIRED_MTTS(wqes)  \
-- 
2.14.3



[net-next 13/15] mlx5: Move dump error CQE function out of mlx5_ib for code sharing

2018-03-27 Thread Saeed Mahameed
From: Eran Ben Elisha 

Move mlx5_ib dump error CQE implementation to mlx5 CQ header file in
order to use it in a downstream patch from mlx5e.

In addition, use print_hex_dump instead of manual dumping of the buffer.

Signed-off-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/cq.c | 8 +---
 include/linux/mlx5/cq.h | 6 ++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 94a27d89a303..77d257ec899b 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -267,14 +267,8 @@ static void handle_responder(struct ib_wc *wc, struct 
mlx5_cqe64 *cqe,
 
 static void dump_cqe(struct mlx5_ib_dev *dev, struct mlx5_err_cqe *cqe)
 {
-   __be32 *p = (__be32 *)cqe;
-   int i;
-
mlx5_ib_warn(dev, "dump error cqe\n");
-   for (i = 0; i < sizeof(*cqe) / 16; i++, p += 4)
-   pr_info("%08x %08x %08x %08x\n", be32_to_cpu(p[0]),
-   be32_to_cpu(p[1]), be32_to_cpu(p[2]),
-   be32_to_cpu(p[3]));
+   mlx5_dump_err_cqe(dev->mdev, cqe);
 }
 
 static void mlx5_handle_error_cqe(struct mlx5_ib_dev *dev,
diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
index 445ad194e0fe..0ef6138eca49 100644
--- a/include/linux/mlx5/cq.h
+++ b/include/linux/mlx5/cq.h
@@ -193,6 +193,12 @@ int mlx5_core_modify_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq,
 int mlx5_core_modify_cq_moderation(struct mlx5_core_dev *dev,
   struct mlx5_core_cq *cq, u16 cq_period,
   u16 cq_max_count);
+static inline void mlx5_dump_err_cqe(struct mlx5_core_dev *dev,
+struct mlx5_err_cqe *err_cqe)
+{
+   print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET, 16, 1, err_cqe,
+  sizeof(*err_cqe), false);
+}
 int mlx5_debug_cq_add(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq);
 void mlx5_debug_cq_remove(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq);
 
-- 
2.14.3



[net-next 04/15] net/mlx5e: Separate dma base address and offset in dma_sync call

2018-03-27 Thread Saeed Mahameed
From: Tariq Toukan 

Pass the base dma address and offset to dma_sync_single_range_for_cpu(),
instead of doing the pre-calculation.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8cce90dc461d..ffcbe5c3818a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -870,10 +870,8 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct 
mlx5_cqe64 *cqe,
data   = va + rx_headroom;
frag_size  = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 
-   dma_sync_single_range_for_cpu(rq->pdev,
- di->addr + wi->offset,
- 0, frag_size,
- DMA_FROM_DEVICE);
+   dma_sync_single_range_for_cpu(rq->pdev, di->addr, wi->offset,
+ frag_size, DMA_FROM_DEVICE);
prefetch(data);
wi->offset += frag_size;
 
-- 
2.14.3



[net-next 14/15] net/mlx5e: Dump xmit error completions

2018-03-27 Thread Saeed Mahameed
From: Eran Ben Elisha 

Monitor and dump xmit error completions. In addition, add err_cqe
counter to track the number of error completion per send queue.

Signed-off-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |  2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 19 +++
 3 files changed, 24 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index c0dab9a8969e..ad91d9de0240 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -60,6 +60,7 @@ static const struct counter_desc sw_stats_desc[] = {
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_wake) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_dropped) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
+   { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_cqe_err) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_wqe_err) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_filler) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_buff_alloc_err) },
@@ -153,6 +154,7 @@ static void mlx5e_grp_sw_update_stats(struct mlx5e_priv 
*priv)
s->tx_queue_stopped += sq_stats->stopped;
s->tx_queue_wake+= sq_stats->wake;
s->tx_queue_dropped += sq_stats->dropped;
+   s->tx_cqe_err   += sq_stats->cqe_err;
s->tx_xmit_more += sq_stats->xmit_more;
s->tx_csum_partial_inner += 
sq_stats->csum_partial_inner;
s->tx_csum_none += sq_stats->csum_none;
@@ -1103,6 +1105,7 @@ static const struct counter_desc sq_stats_desc[] = {
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, wake) },
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, dropped) },
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, xmit_more) },
+   { MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, cqe_err) },
 };
 
 static const struct counter_desc ch_stats_desc[] = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 43a72efa28c0..43dc808684c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -78,6 +78,7 @@ struct mlx5e_sw_stats {
u64 tx_queue_wake;
u64 tx_queue_dropped;
u64 tx_xmit_more;
+   u64 tx_cqe_err;
u64 rx_wqe_err;
u64 rx_mpwqe_filler;
u64 rx_buff_alloc_err;
@@ -197,6 +198,7 @@ struct mlx5e_sq_stats {
u64 stopped;
u64 wake;
u64 dropped;
+   u64 cqe_err;
 };
 
 struct mlx5e_ch_stats {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 11b4f1089d1c..88b5b7bfc9a9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -417,6 +417,18 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct 
net_device *dev)
return mlx5e_sq_xmit(sq, skb, wqe, pi);
 }
 
+static void mlx5e_dump_error_cqe(struct mlx5e_txqsq *sq,
+struct mlx5_err_cqe *err_cqe)
+{
+   u32 ci = mlx5_cqwq_get_ci(>cq.wq);
+
+   netdev_err(sq->channel->netdev,
+  "Error cqe on cqn 0x%x, ci 0x%x, sqn 0x%x, syndrome 0x%x, 
vendor syndrome 0x%x\n",
+  sq->cq.mcq.cqn, ci, sq->sqn, err_cqe->syndrome,
+  err_cqe->vendor_err_synd);
+   mlx5_dump_err_cqe(sq->cq.mdev, err_cqe);
+}
+
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 {
struct mlx5e_txqsq *sq;
@@ -456,6 +468,13 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 
wqe_counter = be16_to_cpu(cqe->wqe_counter);
 
+   if (unlikely(cqe->op_own >> 4 == MLX5_CQE_REQ_ERR)) {
+   if (!sq->stats.cqe_err)
+   mlx5e_dump_error_cqe(sq,
+(struct mlx5_err_cqe 
*)cqe);
+   sq->stats.cqe_err++;
+   }
+
do {
struct mlx5e_tx_wqe_info *wi;
struct sk_buff *skb;
-- 
2.14.3



[pull request][net-next 00/15] Mellanox, mlx5 mlx5-updates-2018-03-27

2018-03-27 Thread Saeed Mahameed
Hi Dave,

This series contains Misc updates and cleanups for mlx5e rx path
and SQ recovery feature for tx path.

For more information please see tag log below.

Please pull and let me know if there's any problem.

Thanks,
Saeed.

---

The following changes since commit 5d22d47b9ed96eddb35821dc2cc4f629f45827f7:

  Merge branch 'sfc-filter-locking' (2018-03-27 13:33:21 -0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5-updates-2018-03-27

for you to fetch changes up to db75373c91b0cfb6a68ad6ae88721e4e21ae6261:

  net/mlx5e: Recover Send Queue (SQ) from error state (2018-03-27 17:29:28 
-0700)


mlx5-updates-2018-03-27 (Misc updates & SQ recovery)

This series contains Misc updates and cleanups for mlx5e rx path
and SQ recovery feature for tx path.

>From Tariq: (RX updates)
- Disable Striding RQ when PCI devices, striding RQ limits the use
  of CQE compression feature, which is very critical for slow PCI
  devices performance, in this change we will prefer CQE compression
  over Striding RQ only on specific "slow"  PCIe links.
- RX path cleanups
- Private flag to enable/disable striding RQ

>From Eran: (TX fast recovery)
- TX timeout logic improvements, fast SQ recovery and TX error reporting
  if a HW error occurs while transmitting on a specific SQ, the driver will
  ignore such error and will wait for TX timeout to occur and reset all
  the rings. Instead, the current series improves the resiliency for such
  HW errors by detecting TX completions with errors, which will report them
  and perform a fast recover for the specific faulty SQ even before a TX
  timeout is detected.

Thanks,
Saeed.


Eran Ben Elisha (5):
  net/mlx5e: Move all TX timeout logic to be under state lock
  mlx5_{ib,core}: Add query SQ state helper function
  mlx5: Move dump error CQE function out of mlx5_ib for code sharing
  net/mlx5e: Dump xmit error completions
  net/mlx5e: Recover Send Queue (SQ) from error state

Gal Pressman (1):
  net/mlx5e: Remove unused max inline related code

Tariq Toukan (9):
  net/mlx5e: Unify slow PCI heuristic
  net/mlx5e: Disable Striding RQ when PCI is slower than link
  net/mlx5e: Remove unused define MLX5_MPWRQ_STRIDES_PER_PAGE
  net/mlx5e: Separate dma base address and offset in dma_sync call
  net/mlx5e: Use no-offset function in skb header copy
  net/mlx5e: Remove RQ MPWQE fields from params
  net/mlx5e: Remove rq_headroom field from params
  net/mlx5e: Do not reset Receive Queue params on every type change
  net/mlx5e: Add ethtool priv-flag for Striding RQ

 drivers/infiniband/hw/mlx5/cq.c|   8 +-
 drivers/infiniband/hw/mlx5/qp.c|  14 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  29 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  83 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 306 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |   1 -
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|  11 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |   6 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c|  27 +-
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |   4 +-
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|   5 +
 drivers/net/ethernet/mellanox/mlx5/core/transobj.c |  25 ++
 include/linux/mlx5/cq.h|   6 +
 include/linux/mlx5/transobj.h  |   1 +
 15 files changed, 368 insertions(+), 162 deletions(-)


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt
 wrote:
>
> Well, we need to clarify that once and for all, because as I wrote
> earlier, it was decreed by Linus more than a decade ago that writel
> would be fully ordered by itself vs. previous memory stores (at least
> on UC memory).

Yes.

So "writel()" needs to be ordered with respect to other writel() uses
on the same thread. Anything else *will* break drivers. Obviously, the
drivers may then do magic to say "do write combining etc", but that
magic will be architecture-specific.

The other issue is that "writel()" needs to be ordered wrt other CPU's
doing "writel()" if those writel's are in a spinlocked region.

So it's not  that "writel()" needs to be ordered wrt the spinlock
itself, but you *do* need to honor ordering if you have something like
this:

   spin_lock();
   writel(a);
   writel(b);
   spin_unlock();

and if two CPU's run the above code "at the same time", then the
*ONLY* acceptable sequence is abab.

You cannot, and must not, ever see "aabb" at the device, for example,
because of how the writel would basically leak out of the spinlock.

That sounds "obvious", but dammit, a lot of architectures got that
wrong, afaik.

Linus


Re: [PATCH net] net: fix possible out-of-bound read in skb_network_protocol()

2018-03-27 Thread Pravin Shelar
On Mon, Mar 26, 2018 at 8:08 AM, Eric Dumazet  wrote:
> skb mac header is not necessarily set at the time skb_network_protocol()
> is called. Use skb->data instead.
>
> BUG: KASAN: slab-out-of-bounds in skb_network_protocol+0x46b/0x4b0 
> net/core/dev.c:2739
> Read of size 2 at addr 8801b3097a0b by task syz-executor5/14242
>
> CPU: 1 PID: 14242 Comm: syz-executor5 Not tainted 4.16.0-rc6+ #280
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report+0x23c/0x360 mm/kasan/report.c:412
>  __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:443
>  skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739
>  harmonize_features net/core/dev.c:2924 [inline]
>  netif_skb_features+0x509/0x9b0 net/core/dev.c:3011
>  validate_xmit_skb+0x81/0xb00 net/core/dev.c:3084
>  validate_xmit_skb_list+0xbf/0x120 net/core/dev.c:3142
>  packet_direct_xmit+0x117/0x790 net/packet/af_packet.c:256
>  packet_snd net/packet/af_packet.c:2944 [inline]
>  packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2969
>  sock_sendmsg_nosec net/socket.c:629 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:639
>  ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047
>  __sys_sendmsg+0xe5/0x210 net/socket.c:2081
>
> Fixes: 19acc327258a ("gso: Handle Trans-Ether-Bridging protocol in 
> skb_network_protocol()")
> Signed-off-by: Eric Dumazet 
> Cc: Pravin B Shelar 
> Reported-by: Reported-by: syzbot 
> ---
>  net/core/dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 
> 12be205357146f0dcd55cc6e6f71dfb65fdeb33b..ef0cc6ea5f8da5b87c751d9eebfc0943fbe36a06
>  100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2735,7 +2735,7 @@ __be16 skb_network_protocol(struct sk_buff *skb, int 
> *depth)
> if (unlikely(!pskb_may_pull(skb, sizeof(struct ethhdr
> return 0;
>
> -   eth = (struct ethhdr *)skb_mac_header(skb);
> +   eth = (struct ethhdr *)skb->data;
> type = eth->h_proto;
> }
>
Thanks for fixing it.


RE: [Intel-wired-lan] [next-queue PATCH v5 7/9] igb: Add MAC address support for ethtool nftuple filters

2018-03-27 Thread Vinicius Costa Gomes
Hi Aaron,

"Brown, Aaron F"  writes:

[...]

> And watching the rx_queue counters continues to be spread across the 
> different queues.  This is with Jeff Kirsher's  next queue, kernel 
> 4.16.0-rc4_next-queue_dev-queue_e31d20a, which has the series of 8 igb 
> patches applied.
>
> When I go back and run the an older build (with an earlier version of
> the series) of the same tree, 4.16.0-rc4_next-queue_dev-queue_84a3942,
> with the same procedure and same systems all the rx traffic is
> relegated to queue 0 (or whichever queue I assign it to) for either
> the src or dst filter.  Here is a sample of my counters after it had
> been running netperf_stress over the weekend:

The difference in behaviour between v4 and v5 is that v4 is configuring
(wrongly) the controller to send all the traffic directed to the
local MAC address to queue 0, v5 allows that filter to be added, but it
does nothing in reality.

I am working on a new version of this series that should work for adding
filters that involve the local MAC address. The initial use cases that I
had in mind all used MAC addresses different from the local one, but I
see that this indeed useful (and less surprising).


Thank you,
--
Vinicius


Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-03-27 Thread William Tu
On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer
 wrote:
> On Mon, 26 Mar 2018 14:58:02 -0700
> William Tu  wrote:
>
>> > Again high count for NMI ?!?
>> >
>> > Maybe you just forgot to tell perf that you want it to decode the
>> > bpf_prog correctly?
>> >
>> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
>> >
>> > Enable via:
>> >  $ sysctl net/core/bpf_jit_kallsyms=1
>> >
>> > And use perf report (while BPF is STILL LOADED):
>> >
>> >  $ perf report --kallsyms=/proc/kallsyms
>> >
>> > E.g. for emailing this you can use this command:
>> >
>> >  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms 
>> > --no-children --stdio -g none | head -n 40
>> >
>>
>> Thanks, I followed the steps, the result of l2fwd
>> # Total Lost Samples: 119
>> #
>> # Samples: 2K of event 'cycles:ppp'
>> # Event count (approx.): 25675705627
>> #
>> # Overhead  CPU  Command  Shared Object   Symbol
>> #   ...  ...  ..  
>> ..
>> #
>> 10.48%  013  xdpsock  xdpsock [.] main
>>  9.77%  013  xdpsock  [kernel.vmlinux][k] clflush_cache_range
>>  8.45%  013  xdpsock  [kernel.vmlinux][k] nmi
>>  8.07%  013  xdpsock  [kernel.vmlinux][k] xsk_sendmsg
>>  7.81%  013  xdpsock  [kernel.vmlinux][k] __domain_mapping
>>  4.95%  013  xdpsock  [kernel.vmlinux][k] ixgbe_xmit_frame_ring
>>  4.66%  013  xdpsock  [kernel.vmlinux][k] skb_store_bits
>>  4.39%  013  xdpsock  [kernel.vmlinux][k] syscall_return_via_sysret
>>  3.93%  013  xdpsock  [kernel.vmlinux][k] pfn_to_dma_pte
>>  2.62%  013  xdpsock  [kernel.vmlinux][k] __intel_map_single
>>  2.53%  013  xdpsock  [kernel.vmlinux][k] __alloc_skb
>>  2.36%  013  xdpsock  [kernel.vmlinux][k] iommu_no_mapping
>>  2.21%  013  xdpsock  [kernel.vmlinux][k] alloc_skb_with_frags
>>  2.07%  013  xdpsock  [kernel.vmlinux][k] skb_set_owner_w
>>  1.98%  013  xdpsock  [kernel.vmlinux][k] __kmalloc_node_track_caller
>>  1.94%  013  xdpsock  [kernel.vmlinux][k] ksize
>>  1.84%  013  xdpsock  [kernel.vmlinux][k] validate_xmit_skb_list
>>  1.62%  013  xdpsock  [kernel.vmlinux][k] kmem_cache_alloc_node
>>  1.48%  013  xdpsock  [kernel.vmlinux][k] __kmalloc_reserve.isra.37
>>  1.21%  013  xdpsock  xdpsock [.] xq_enq
>>  1.08%  013  xdpsock  [kernel.vmlinux][k] intel_alloc_iova
>>
>
> You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of
> bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks 
> wrong.
>
Thanks, you're right. Let me dig more on this NMI behavior.

>
>> And l2fwd under "perf stat" looks OK to me. There is little context
>> switches, cpu is fully utilized, 1.17 insn per cycle seems ok.
>>
>> Performance counter stats for 'CPU(s) 6':
>>   1.787420  cpu-clock (msec)  #1.000 CPUs utilized
>> 24  context-switches  #0.002 K/sec
>>  0  cpu-migrations#0.000 K/sec
>>  0  page-faults   #0.000 K/sec
>> 22,361,333,647  cycles#2.236 GHz
>> 13,458,442,838  stalled-cycles-frontend   #   60.19% frontend cycles idle
>> 26,251,003,067  instructions  #1.17  insn per cycle
>>   #0.51  stalled cycles per 
>> insn
>>  4,938,921,868  branches  #  493.853 M/sec
>>  7,591,739  branch-misses #0.15% of all branches
>>   10.000835769 seconds time elapsed
>
> This perf stat also indicate something is wrong.
>
> The 1.17 insn per cycle is NOT okay, it is too low (compared to what
> usually I see, e.g. 2.36  insn per cycle).
>
> It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
> cycles idle'.   This means your CPU have issues/bottleneck fetching
> instructions. Explained by Andi Kleen here [1]
>
> [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
>
thanks for the link!
It's definitely weird that my frontend cycle (fetch and decode)
stalled is so high.
I assume this xdpsock code is small and should all fit into the icache.
However, doing another perf stat on xdpsock l2fwd shows

13,720,109,581  stalled-cycles-frontend   # 60.01% frontend cycles
idle (23.82%)

stalled-cycles-backend
7,994,837  branch-misses   # 0.16% of all branches
 (23.80%)
  996,874,424  bus-cycles # 99.679 M/sec  (23.80%)
   18,942,220,445  ref-cycles  # 1894.067 M/sec  (28.56%)
  100,983,226  LLC-loads # 10.097 M/sec  (23.80%)
4,897,089  LLC-load-misses   # 4.85% of all
LL-cache hits (23.80%)
   66,659,889  LLC-stores  # 6.665 M/sec  

Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-27 Thread Alexei Starovoitov

On 3/27/18 4:13 PM, Mathieu Desnoyers wrote:

- On Mar 27, 2018, at 6:48 PM, Alexei Starovoitov a...@fb.com wrote:


On 3/27/18 2:04 PM, Steven Rostedt wrote:


+#ifdef CONFIG_BPF_EVENTS
+#define BPF_RAW_TP() . = ALIGN(8); \


Given that the section consists of a 16-bytes structure elements
on architectures with 8 bytes pointers, this ". = ALIGN(8)" should
be turned into a STRUCT_ALIGN(), especially given that the compiler
is free to up-align the structure on 32 bytes.


STRUCT_ALIGN fixed the 'off by 8' issue with kasan,
but it fails without kasan too.
For some reason the whole region __start__bpf_raw_tp - __stop__bpf_raw_tp
comes inited with :
[   22.703562] i 1 btp 8288e530 btp->tp  func 

[   22.704638] i 2 btp 8288e540 btp->tp  func 

[   22.705599] i 3 btp 8288e550 btp->tp  func 

[   22.706551] i 4 btp 8288e560 btp->tp  func 

[   22.707503] i 5 btp 8288e570 btp->tp  func 

[   22.708452] i 6 btp 8288e580 btp->tp  func 

[   22.709406] i 7 btp 8288e590 btp->tp  func 

[   22.710368] i 8 btp 8288e5a0 btp->tp  func 



while gdb shows that everything is good inside vmlinux
for exactly these addresses.
Some other linker magic missing?



[PATCH V5 net-next 00/14] TLS offload, netdev & MLX5 support

2018-03-27 Thread Saeed Mahameed
Hi Dave,

The following series from Ilya and Boris provides TLS TX inline crypto
offload.

v1->v2:
   - Added IS_ENABLED(CONFIG_TLS_DEVICE) and a STATIC_KEY for icsk_clean_acked
   - File license fix
   - Fix spelling, comment by DaveW
   - Move memory allocations out of tls_set_device_offload and other misc fixes,
comments by Kiril.

v2->v3:
   - Reversed xmas tree where needed and style fixes
   - Removed the need for skb_page_frag_refill, per Eric's comment
   - IPv6 dependency fixes

v3->v4:
   - Remove “inline” from functions in C files
   - Make clean_acked_data_enabled a static variable and add enable/disable 
functions to control it.
   - Remove unnecessary variable initialization mentioned by ShannonN
   - Rebase over TLS RX
   - Refactor the tls_software_fallback to reduce the number of variables 
mentioned by KirilT

v4->v5:
   - Add missing CONFIG_TLS_DEVICE

Boris says:
===
This series adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption and
authentication operations on the transmit side of the data path. Leaving
those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to the
TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
  TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Expected TCP SN is accessed without a lock, under the assumption that
TCP doesn't transmit SKBs from different TX queue concurrently.

We assume that packets are not rerouted to a different network device.

Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf

===

Thanks,
Saeed.

---

Boris Pismenny (2):
  MAINTAINERS: Update mlx5 innova driver maintainers
  MAINTAINERS: Update TLS maintainers

Ilya Lesokhin (12):
  tcp: Add clean acked data hook
  net: Rename and export copy_skb_header
  net: Add Software fallback infrastructure for socket dependent
offloads
  net: Add TLS offload netdev ops
  net: Add TLS TX offload features
  net/tls: Add generic NIC offload infrastructure
  net/tls: Support TLS device offload with IPv6
  net/mlx5e: Move defines out of ipsec code
  net/mlx5: Accel, Add TLS tx offload interface
  net/mlx5e: TLS, Add Innova TLS TX support
  net/mlx5e: TLS, Add Innova TLS TX offload data path
  net/mlx5e: TLS, Add error statistics

 MAINTAINERS|  19 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig|  11 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c|  71 ++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h|  86 +++
 drivers/net/ethernet/mellanox/mlx5/core/en.h   

[PATCH V5 net-next 04/14] net: Add TLS offload netdev ops

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

Add new netdev ops to add and delete tls context

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 include/linux/netdevice.h | 24 
 1 file changed, 24 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2a2d9cf50aa2..2b01e5577be3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -864,6 +864,26 @@ struct xfrmdev_ops {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+enum tls_offload_ctx_dir {
+   TLS_OFFLOAD_CTX_DIR_RX,
+   TLS_OFFLOAD_CTX_DIR_TX,
+};
+
+struct tls_crypto_info;
+struct tls_context;
+
+struct tlsdev_ops {
+   int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
+  enum tls_offload_ctx_dir direction,
+  struct tls_crypto_info *crypto_info,
+  u32 start_offload_tcp_sn);
+   void (*tls_dev_del)(struct net_device *netdev,
+   struct tls_context *ctx,
+   enum tls_offload_ctx_dir direction);
+};
+#endif
+
 struct dev_ifalias {
struct rcu_head rcuhead;
char ifalias[];
@@ -1748,6 +1768,10 @@ struct net_device {
const struct xfrmdev_ops *xfrmdev_ops;
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+   const struct tlsdev_ops *tlsdev_ops;
+#endif
+
const struct header_ops *header_ops;
 
unsigned intflags;
-- 
2.14.3



[PATCH V5 net-next 12/14] net/mlx5e: TLS, Add error statistics

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

Add statistics for rare TLS related errors.
Since the errors are rare we have a counter per netdev
rather then per SQ.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  3 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 22 ++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 22 ++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 24 +++---
 .../mellanox/mlx5/core/en_accel/tls_stats.c| 89 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 22 ++
 8 files changed, 178 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index ec785f589666..a7135f5d5cf6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o 
ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o 
en_accel/tls_stats.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index be588295c216..2de6f52fbb30 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -796,6 +796,9 @@ struct mlx5e_priv {
 #ifdef CONFIG_MLX5_EN_IPSEC
struct mlx5e_ipsec*ipsec;
 #endif
+#ifdef CONFIG_MLX5_EN_TLS
+   struct mlx5e_tls  *tls;
+#endif
 };
 
 struct mlx5e_profile {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index aa6981c98bdc..d167845271c3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -173,3 +173,25 @@ void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
netdev->hw_features |= NETIF_F_HW_TLS_TX;
netdev->tlsdev_ops = _tls_ops;
 }
+
+int mlx5e_tls_init(struct mlx5e_priv *priv)
+{
+   struct mlx5e_tls *tls = kzalloc(sizeof(*tls), GFP_KERNEL);
+
+   if (!tls)
+   return -ENOMEM;
+
+   priv->tls = tls;
+   return 0;
+}
+
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv)
+{
+   struct mlx5e_tls *tls = priv->tls;
+
+   if (!tls)
+   return;
+
+   kfree(tls);
+   priv->tls = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index f7216b9b98e2..b6162178f621 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -38,6 +38,17 @@
 #include 
 #include "en.h"
 
+struct mlx5e_tls_sw_stats {
+   atomic64_t tx_tls_drop_metadata;
+   atomic64_t tx_tls_drop_resync_alloc;
+   atomic64_t tx_tls_drop_no_sync_data;
+   atomic64_t tx_tls_drop_bypass_required;
+};
+
+struct mlx5e_tls {
+   struct mlx5e_tls_sw_stats sw_stats;
+};
+
 struct mlx5e_tls_offload_context {
struct tls_offload_context base;
u32 expected_seq;
@@ -55,10 +66,21 @@ mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
 }
 
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
+int mlx5e_tls_init(struct mlx5e_priv *priv);
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv);
+
+int mlx5e_tls_get_count(struct mlx5e_priv *priv);
+int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data);
+int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data);
 
 #else
 
 static inline void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_init(struct mlx5e_priv *priv) { return 0; }
+static inline void mlx5e_tls_cleanup(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_get_count(struct mlx5e_priv *priv) { return 0; }
+static inline int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t 
*data) { return 0; }
+static inline int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data) { 
return 0; }
 
 #endif
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index 49e8d455ebc3..ad2790fb5966 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -164,7 +164,8 @@ static struct sk_buff *
 

[PATCH V5 net-next 11/14] net/mlx5e: TLS, Add Innova TLS TX offload data path

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

Implement the TLS tx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.

Special metadata ethertype is used to pass information to
the hardware.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  15 ++
 .../mellanox/mlx5/core/en_accel/en_accel.h |  72 ++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c |   2 +
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 272 +
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h |  50 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  10 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c|  37 +--
 10 files changed, 455 insertions(+), 16 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 50872ed30c0b..ec785f589666 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o 
ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 61a14e8cbf56..be588295c216 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -340,6 +340,7 @@ struct mlx5e_sq_dma {
 enum {
MLX5E_SQ_STATE_ENABLED,
MLX5E_SQ_STATE_IPSEC,
+   MLX5E_SQ_STATE_TLS,
 };
 
 struct mlx5e_sq_wqe_info {
@@ -825,6 +826,8 @@ void mlx5e_build_ptys2ethtool_map(void);
 u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
   void *accel_priv, select_queue_fallback_t fallback);
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+ struct mlx5e_tx_wqe *wqe, u16 pi);
 
 void mlx5e_completion_event(struct mlx5_core_cq *mcq);
 void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
@@ -930,6 +933,18 @@ static inline bool mlx5e_tunnel_inner_ft_supported(struct 
mlx5_core_dev *mdev)
MLX5_CAP_FLOWTABLE_NIC_RX(mdev, 
ft_field_support.inner_ip_version));
 }
 
+static inline void mlx5e_sq_fetch_wqe(struct mlx5e_txqsq *sq,
+ struct mlx5e_tx_wqe **wqe,
+ u16 *pi)
+{
+   struct mlx5_wq_cyc *wq;
+
+   wq = >wq;
+   *pi = sq->pc & wq->sz_m1;
+   *wqe = mlx5_wq_cyc_get_wqe(wq, *pi);
+   memset(*wqe, 0, sizeof(**wqe));
+}
+
 static inline
 struct mlx5e_tx_wqe *mlx5e_post_nop(struct mlx5_wq_cyc *wq, u32 sqn, u16 *pc)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
new file mode 100644
index ..68fcb40a2847
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -0,0 +1,72 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A 

[PATCH V5 net-next 10/14] net/mlx5e: TLS, Add Innova TLS TX support

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

Add NETIF_F_HW_TLS_TX capability and expose tlsdev_ops to work with the
TLS generic NIC offload infrastructure.
The NETIF_F_HW_TLS_TX capability will be added in the next patch.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig|  11 ++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 173 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  65 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 +
 5 files changed, 254 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 25deaa5a534c..6befd2c381b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -85,3 +85,14 @@ config MLX5_EN_IPSEC
  Build support for IPsec cryptography-offload accelaration in the NIC.
  Note: Support for hardware with this capability needs to be selected
  for this option to become available.
+
+config MLX5_EN_TLS
+   bool "TLS cryptography-offload accelaration"
+   depends on MLX5_CORE_EN
+   depends on TLS_DEVICE
+   depends on MLX5_ACCEL
+   default n
+   ---help---
+ Build support for TLS cryptography-offload accelaration in the NIC.
+ Note: Support for hardware with this capability needs to be selected
+ for this option to become available.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9989e5265a45..50872ed30c0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,4 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o 
ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
en_accel/ipsec_stats.o
 
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
new file mode 100644
index ..38d88108a55a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -0,0 +1,173 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include 
+#include 
+#include "en_accel/tls.h"
+#include "accel/tls.h"
+
+static void mlx5e_tls_set_ipv4_flow(void *flow, struct sock *sk)
+{
+   struct inet_sock *inet = inet_sk(sk);
+
+   MLX5_SET(tls_flow, flow, ipv6, 0);
+   memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
+  >inet_daddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+   memcpy(MLX5_ADDR_OF(tls_flow, flow, src_ipv4_src_ipv6.ipv4_layout.ipv4),
+  >inet_rcv_saddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static void mlx5e_tls_set_ipv6_flow(void *flow, struct sock *sk)
+{
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   MLX5_SET(tls_flow, flow, ipv6, 1);
+   memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
+  >sk_v6_daddr, MLX5_FLD_SZ_BYTES(ipv6_layout, 

[PATCH V5 net-next 03/14] net: Add Software fallback infrastructure for socket dependent offloads

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 include/net/sock.h | 21 +
 net/Kconfig|  4 
 net/core/dev.c |  4 
 3 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 709311132d4c..7607eeed6be2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -481,6 +481,11 @@ struct sock {
void(*sk_error_report)(struct sock *sk);
int (*sk_backlog_rcv)(struct sock *sk,
  struct sk_buff *skb);
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+   struct sk_buff* (*sk_validate_xmit_skb)(struct sock *sk,
+   struct net_device *dev,
+   struct sk_buff *skb);
+#endif
void(*sk_destruct)(struct sock *sk);
struct sock_reuseport __rcu *sk_reuseport_cb;
struct rcu_head sk_rcu;
@@ -2328,6 +2333,22 @@ static inline bool sk_fullsock(const struct sock *sk)
return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
 }
 
+/* Checks if this SKB belongs to an HW offloaded socket
+ * and whether any SW fallbacks are required based on dev.
+ */
+static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
+  struct net_device *dev)
+{
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+   struct sock *sk = skb->sk;
+
+   if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+   skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#endif
+
+   return skb;
+}
+
 /* This helper checks if a socket is a LISTEN or NEW_SYN_RECV
  * SYNACK messages can be attached to either ones (depending on SYNCOOKIE)
  */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..fe84cfe3260e 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -407,6 +407,10 @@ config GRO_CELLS
bool
default n
 
+config SOCK_VALIDATE_XMIT
+   bool
+   default n
+
 config NET_DEVLINK
tristate "Network physical/parent device Netlink interface"
help
diff --git a/net/core/dev.c b/net/core/dev.c
index e13807b5c84d..e8a126a09d28 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3105,6 +3105,10 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff 
*skb, struct net_device
if (unlikely(!skb))
goto out_null;
 
+   skb = sk_validate_xmit_skb(skb, dev);
+   if (unlikely(!skb))
+   goto out_null;
+
if (netif_needs_gso(skb, features)) {
struct sk_buff *segs;
 
-- 
2.14.3



[PATCH V5 net-next 07/14] net/tls: Support TLS device offload with IPv6

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

Previously get_netdev_for_sock worked only with IPv4.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 net/tls/tls_device.c | 51 ++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f33cd65efa8a..4c9664e141eb 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -34,6 +34,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -97,13 +102,57 @@ static void tls_device_queue_ctx_destruction(struct 
tls_context *ctx)
spin_unlock_irqrestore(_device_lock, flags);
 }
 
+#if IS_ENABLED(CONFIG_IPV6)
+static struct net_device *ipv6_get_netdev(struct sock *sk)
+{
+   struct net_device *dev = NULL;
+   struct inet_sock *inet = inet_sk(sk);
+   struct ipv6_pinfo *np = inet6_sk(sk);
+   struct flowi6 _fl6, *fl6 = &_fl6;
+   struct dst_entry *dst;
+
+   memset(fl6, 0, sizeof(*fl6));
+   fl6->flowi6_proto = sk->sk_protocol;
+   fl6->daddr = sk->sk_v6_daddr;
+   fl6->saddr = np->saddr;
+   fl6->flowlabel = np->flow_label;
+   IP6_ECN_flow_xmit(sk, fl6->flowlabel);
+   fl6->flowi6_oif = sk->sk_bound_dev_if;
+   fl6->flowi6_mark = sk->sk_mark;
+   fl6->fl6_sport = inet->inet_sport;
+   fl6->fl6_dport = inet->inet_dport;
+   fl6->flowi6_uid = sk->sk_uid;
+   security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+
+   if (ipv6_stub->ipv6_dst_lookup(sock_net(sk), sk, , fl6) < 0)
+   return NULL;
+
+   dev = dst->dev;
+   dev_hold(dev);
+   dst_release(dst);
+
+   return dev;
+}
+#endif
+
 /* We assume that the socket is already connected */
 static struct net_device *get_netdev_for_sock(struct sock *sk)
 {
struct inet_sock *inet = inet_sk(sk);
struct net_device *netdev = NULL;
 
-   netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
+   if (sk->sk_family == AF_INET)
+   netdev = dev_get_by_index(sock_net(sk),
+ inet->cork.fl.flowi_oif);
+#if IS_ENABLED(CONFIG_IPV6)
+   else if (sk->sk_family == AF_INET6) {
+   netdev = ipv6_get_netdev(sk);
+   if (!netdev && !sk->sk_ipv6only &&
+   ipv6_addr_type(>sk_v6_daddr) == IPV6_ADDR_MAPPED)
+   netdev = dev_get_by_index(sock_net(sk),
+ inet->cork.fl.flowi_oif);
+   }
+#endif
 
return netdev;
 }
-- 
2.14.3



[PATCH iproute] arrange prefix parsing code after redundant patches

2018-03-27 Thread Alexander Zubkov
A problem was reported with parsing of prefixes all/any/default.
Commit 7696f1097f79be2ce5984a8a16103fd17391cac2 fixes the problem,
but there were also other pathces applied:
00b31a6b2ecf73ee477f701098164600a2bfe227, which were intended to
fix the same problem. And they became redundant now. This patch
reverts changes introduced by those redundant patches.

Signed-off-by: Alexander Zubkov 
---
 ip/iproute.c | 65 ++--
 lib/utils.c  | 13 
 2 files changed, 46 insertions(+), 32 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 32c93ed..bf886fd 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -191,20 +191,42 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
return 0;
if ((filter.tos^r->rtm_tos))
return 0;
-   if (filter.rdst.family &&
-   (r->rtm_family != filter.rdst.family || filter.rdst.bitlen > 
r->rtm_dst_len))
-   return 0;
-   if (filter.mdst.family &&
-   (r->rtm_family != filter.mdst.family ||
-(filter.mdst.bitlen >= 0 && filter.mdst.bitlen < r->rtm_dst_len)))
-   return 0;
-   if (filter.rsrc.family &&
-   (r->rtm_family != filter.rsrc.family || filter.rsrc.bitlen > 
r->rtm_src_len))
-   return 0;
-   if (filter.msrc.family &&
-   (r->rtm_family != filter.msrc.family ||
-(filter.msrc.bitlen >= 0 && filter.msrc.bitlen < r->rtm_src_len)))
-   return 0;
+   if (filter.rdst.family) {
+   if (r->rtm_family != filter.rdst.family ||
+   filter.rdst.bitlen > r->rtm_dst_len)
+   return 0;
+   } else if (filter.rdst.flags & PREFIXLEN_SPECIFIED) {
+   if (filter.rdst.bitlen > r->rtm_dst_len)
+   return 0;
+   }
+   if (filter.mdst.family) {
+   if (r->rtm_family != filter.mdst.family ||
+   (filter.mdst.bitlen >= 0 &&
+filter.mdst.bitlen < r->rtm_dst_len))
+   return 0;
+   } else if (filter.mdst.flags & PREFIXLEN_SPECIFIED) {
+   if (filter.mdst.bitlen >= 0 &&
+   filter.mdst.bitlen < r->rtm_dst_len)
+   return 0;
+   }
+   if (filter.rsrc.family) {
+   if (r->rtm_family != filter.rsrc.family ||
+   filter.rsrc.bitlen > r->rtm_src_len)
+   return 0;
+   } else if (filter.rsrc.flags & PREFIXLEN_SPECIFIED) {
+   if (filter.rsrc.bitlen > r->rtm_src_len)
+   return 0;
+   }
+   if (filter.msrc.family) {
+   if (r->rtm_family != filter.msrc.family ||
+   (filter.msrc.bitlen >= 0 &&
+filter.msrc.bitlen < r->rtm_src_len))
+   return 0;
+   } else if (filter.msrc.flags & PREFIXLEN_SPECIFIED) {
+   if (filter.msrc.bitlen >= 0 &&
+   filter.msrc.bitlen < r->rtm_src_len)
+   return 0;
+   }
if (filter.rvia.family) {
int family = r->rtm_family;
 
@@ -221,7 +243,9 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
 
if (tb[RTA_DST])
memcpy(, RTA_DATA(tb[RTA_DST]), (r->rtm_dst_len+7)/8);
-   if (filter.rsrc.family || filter.msrc.family) {
+   if (filter.rsrc.family || filter.msrc.family ||
+   filter.rsrc.flags & PREFIXLEN_SPECIFIED ||
+   filter.msrc.flags & PREFIXLEN_SPECIFIED) {
if (tb[RTA_SRC])
memcpy(, RTA_DATA(tb[RTA_SRC]), 
(r->rtm_src_len+7)/8);
}
@@ -241,15 +265,18 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
memcpy(, RTA_DATA(tb[RTA_PREFSRC]), 
host_len/8);
}
 
-   if (filter.rdst.family && inet_addr_match(, , 
filter.rdst.bitlen))
+   if ((filter.rdst.family || filter.rdst.flags & PREFIXLEN_SPECIFIED) &&
+   inet_addr_match(, , filter.rdst.bitlen))
return 0;
-   if (filter.mdst.family && filter.mdst.bitlen >= 0 &&
+   if ((filter.mdst.family || filter.mdst.flags & PREFIXLEN_SPECIFIED) &&
inet_addr_match(, , r->rtm_dst_len))
return 0;
 
-   if (filter.rsrc.family && inet_addr_match(, , 
filter.rsrc.bitlen))
+   if ((filter.rsrc.family || filter.rsrc.flags & PREFIXLEN_SPECIFIED) &&
+   inet_addr_match(, , filter.rsrc.bitlen))
return 0;
-   if (filter.msrc.family && filter.msrc.bitlen >= 0 &&
+   if ((filter.msrc.family || filter.msrc.flags & PREFIXLEN_SPECIFIED) &&
+   filter.msrc.bitlen >= 0 &&
inet_addr_match(, , r->rtm_src_len))
return 0;
 
diff --git a/lib/utils.c b/lib/utils.c
index dadefb5..b9e9a6c 100644
--- a/lib/utils.c
+++ b/lib/utils.c

[PATCH V5 net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

This patch adds a generic infrastructure to offload TLS crypto to a
network device. It enables the kernel TLS socket to skip encryption
and authentication operations on the transmit side of the data path.
Leaving those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to
the TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmits the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 include/net/tls.h | 120 +--
 net/tls/Kconfig   |  10 +
 net/tls/Makefile  |   2 +
 net/tls/tls_device.c  | 759 ++
 net/tls/tls_device_fallback.c | 454 +
 net/tls/tls_main.c| 120 ---
 net/tls/tls_sw.c  | 132 
 7 files changed, 1476 insertions(+), 121 deletions(-)
 create mode 100644 net/tls/tls_device.c
 create mode 100644 net/tls/tls_device_fallback.c

diff --git a/include/net/tls.h b/include/net/tls.h
index 437a746300bf..0a8529e9ec21 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -57,21 +57,10 @@
 
 #define TLS_AAD_SPACE_SIZE 13
 
-struct tls_sw_context {
+struct tls_sw_context_tx {
struct crypto_aead *aead_send;
-   struct crypto_aead *aead_recv;
struct crypto_wait async_wait;
 
-   /* Receive context */
-   struct strparser strp;
-   void (*saved_data_ready)(struct sock *sk);
-   unsigned int (*sk_poll)(struct file *file, struct socket *sock,
-   struct poll_table_struct *wait);
-   struct sk_buff *recv_pkt;
-   u8 control;
-   bool decrypted;
-
-   /* Sending context */
char aad_space[TLS_AAD_SPACE_SIZE];
 
unsigned int sg_plaintext_size;
@@ -88,6 +77,50 @@ struct tls_sw_context {
struct scatterlist sg_aead_out[2];
 };
 
+struct tls_sw_context_rx {
+   struct crypto_aead *aead_recv;
+   struct crypto_wait async_wait;
+
+   struct strparser strp;
+   void (*saved_data_ready)(struct sock *sk);
+   unsigned int (*sk_poll)(struct file *file, struct socket *sock,
+   struct poll_table_struct *wait);
+   struct sk_buff *recv_pkt;
+   u8 control;
+   bool decrypted;
+};
+
+struct tls_record_info {
+   struct list_head list;
+   u32 end_seq;
+   int len;
+   int num_frags;
+   skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+struct tls_offload_context {
+ 

[PATCH V5 net-next 09/14] net/mlx5: Accel, Add TLS tx offload interface

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

Add routines for manipulating TLS TX offload contexts.

In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.

Add implementation for Innova TLS (FPGA-based) hardware.

These routines will be used by the TLS offload support in a later patch

mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs
to work directly with mlx5_core rather than Innova FPGA or other mlx5
acceleration providers.

In the future, when IPSec/TLS or any other acceleration gets integrated
into ConnectX chip, mlx5/accel layer will provide the integrated
acceleration, rather than the Innova one.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   4 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c|  71 +++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h|  86 
 .../net/ethernet/mellanox/mlx5/core/fpga/core.h|   1 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 563 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  68 +++
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  11 +
 include/linux/mlx5/mlx5_ifc.h  |  16 -
 include/linux/mlx5/mlx5_ifc_fpga.h |  77 +++
 9 files changed, 879 insertions(+), 18 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index c805769d92a9..9989e5265a45 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -8,10 +8,10 @@ mlx5_core-y :=main.o cmd.o debugfs.o fw.o eq.o uar.o 
pagealloc.o \
fs_counters.o rl.o lag.o dev.o wq.o lib/gid.o lib/clock.o \
diag/fs_tracepoint.o
 
-mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o
+mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o accel/tls.o
 
 mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o fpga/conn.o fpga/sdk.o 
\
-   fpga/ipsec.o
+   fpga/ipsec.o fpga/tls.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o 
\
en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o \
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
new file mode 100644
index ..77ac19f38cbe
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include 
+
+#include "accel/tls.h"
+#include "mlx5_core.h"
+#include "fpga/tls.h"
+
+int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+  struct tls_crypto_info *crypto_info,
+  u32 start_offload_tcp_sn, u32 *p_swid)
+{
+   return mlx5_fpga_tls_add_tx_flow(mdev, flow, crypto_info,
+start_offload_tcp_sn, p_swid);
+}
+
+void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid)
+{
+   mlx5_fpga_tls_del_tx_flow(mdev, swid, 

[PATCH V5 net-next 14/14] MAINTAINERS: Update TLS maintainers

2018-03-27 Thread Saeed Mahameed
From: Boris Pismenny 

Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index bb8e4db89f0b..0d43f1a1eba3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9712,7 +9712,7 @@ F:net/netfilter/xt_CONNSECMARK.c
 F: net/netfilter/xt_SECMARK.c
 
 NETWORKING [TLS]
-M: Ilya Lesokhin 
+M: Boris Pismenny 
 M: Aviad Yehezkel 
 M: Dave Watson 
 L: netdev@vger.kernel.org
-- 
2.14.3



[PATCH V5 net-next 13/14] MAINTAINERS: Update mlx5 innova driver maintainers

2018-03-27 Thread Saeed Mahameed
From: Boris Pismenny 

Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 MAINTAINERS | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9107d9241564..bb8e4db89f0b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8914,26 +8914,17 @@ W:  http://www.mellanox.com
 Q: http://patchwork.ozlabs.org/project/netdev/list/
 F: drivers/net/ethernet/mellanox/mlx5/core/en_*
 
-MELLANOX ETHERNET INNOVA DRIVER
-M: Ilan Tayari 
-R: Boris Pismenny 
+MELLANOX ETHERNET INNOVA DRIVERS
+M: Boris Pismenny 
 L: netdev@vger.kernel.org
 S: Supported
 W: http://www.mellanox.com
 Q: http://patchwork.ozlabs.org/project/netdev/list/
+F: drivers/net/ethernet/mellanox/mlx5/core/en_accel/*
+F: drivers/net/ethernet/mellanox/mlx5/core/accel/*
 F: drivers/net/ethernet/mellanox/mlx5/core/fpga/*
 F: include/linux/mlx5/mlx5_ifc_fpga.h
 
-MELLANOX ETHERNET INNOVA IPSEC DRIVER
-M: Ilan Tayari 
-R: Boris Pismenny 
-L: netdev@vger.kernel.org
-S: Supported
-W: http://www.mellanox.com
-Q: http://patchwork.ozlabs.org/project/netdev/list/
-F: drivers/net/ethernet/mellanox/mlx5/core/en_ipsec/*
-F: drivers/net/ethernet/mellanox/mlx5/core/ipsec*
-
 MELLANOX ETHERNET SWITCH DRIVERS
 M: Jiri Pirko 
 M: Ido Schimmel 
-- 
2.14.3



[PATCH V5 net-next 08/14] net/mlx5e: Move defines out of ipsec code

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

The defines are not IPSEC specific.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h | 3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h | 3 ---
 drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c | 5 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h   | 2 ++
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 294bc9f175a5..61a14e8cbf56 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,9 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
+#define MLX5E_METADATA_ETHER_LEN 8
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
index 1198fc1eba4c..93bf10e6508c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
@@ -45,9 +45,6 @@
 #define MLX5E_IPSEC_SADB_RX_BITS 10
 #define MLX5E_IPSEC_ESN_SCOPE_MID 0x8000L
 
-#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
-#define MLX5E_METADATA_ETHER_LEN 8
-
 struct mlx5e_priv;
 
 struct mlx5e_ipsec_sw_stats {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
index 0f5da499a223..3c4f1f326e13 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
@@ -43,9 +43,6 @@
 #include "fpga/sdk.h"
 #include "fpga/core.h"
 
-#define SBU_QP_QUEUE_SIZE 8
-#define MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC   (60 * 1000)
-
 enum mlx5_fpga_ipsec_cmd_status {
MLX5_FPGA_IPSEC_CMD_PENDING,
MLX5_FPGA_IPSEC_CMD_SEND_FAIL,
@@ -258,7 +255,7 @@ static int mlx5_fpga_ipsec_cmd_wait(void *ctx)
 {
struct mlx5_fpga_ipsec_cmd_context *context = ctx;
unsigned long timeout =
-   msecs_to_jiffies(MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC);
+   msecs_to_jiffies(MLX5_FPGA_CMD_TIMEOUT_MSEC);
int res;
 
res = wait_for_completion_timeout(>complete, timeout);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
index baa537e54a49..a0573cc2fc9b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
@@ -41,6 +41,8 @@
  * DOC: Innova SDK
  * This header defines the in-kernel API for Innova FPGA client drivers.
  */
+#define SBU_QP_QUEUE_SIZE 8
+#define MLX5_FPGA_CMD_TIMEOUT_MSEC (60 * 1000)
 
 enum mlx5_fpga_access_type {
MLX5_FPGA_ACCESS_TYPE_I2C = 0x0,
-- 
2.14.3



[PATCH V5 net-next 05/14] net: Add TLS TX offload features

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

This patch adds a netdev feature to configure TLS TX offloads.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 include/linux/netdev_features.h | 2 ++
 net/core/ethtool.c  | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index db84c516bcfb..18dc34202080 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -77,6 +77,7 @@ enum {
NETIF_F_HW_ESP_BIT, /* Hardware ESP transformation offload 
*/
NETIF_F_HW_ESP_TX_CSUM_BIT, /* ESP with TX checksum offload */
NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */
+   NETIF_F_HW_TLS_TX_BIT,  /* Hardware TLS TX offload */
 
NETIF_F_GRO_HW_BIT, /* Hardware Generic receive offload */
 
@@ -145,6 +146,7 @@ enum {
 #define NETIF_F_HW_ESP __NETIF_F(HW_ESP)
 #define NETIF_F_HW_ESP_TX_CSUM __NETIF_F(HW_ESP_TX_CSUM)
 #defineNETIF_F_RX_UDP_TUNNEL_PORT  __NETIF_F(RX_UDP_TUNNEL_PORT)
+#define NETIF_F_HW_TLS_TX  __NETIF_F(HW_TLS_TX)
 
 #define for_each_netdev_feature(mask_addr, bit)\
for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index bb6e498c6e3d..0fb7cb4b68ce 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -107,6 +107,7 @@ static const char 
netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
[NETIF_F_HW_ESP_BIT] =   "esp-hw-offload",
[NETIF_F_HW_ESP_TX_CSUM_BIT] =   "esp-tx-csum-hw-offload",
[NETIF_F_RX_UDP_TUNNEL_PORT_BIT] =   "rx-udp_tunnel-port-offload",
+   [NETIF_F_HW_TLS_TX_BIT] ="tls-hw-tx-offload",
 };
 
 static const char
-- 
2.14.3



[PATCH V5 net-next 02/14] net: Rename and export copy_skb_header

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 9 +
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 47082f54ec1f..096e7fa572d8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1031,6 +1031,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned 
int size,
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t priority);
 struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
   gfp_t gfp_mask, bool fclone);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b5c75d4fcf37..e13652b169da 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1304,7 +1304,7 @@ static void skb_headers_offset_update(struct sk_buff 
*skb, int off)
skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
__copy_skb_header(new, old);
 
@@ -1312,6 +1312,7 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(skb_copy_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
@@ -1354,7 +1355,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t 
gfp_mask)
 
BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
-   copy_skb_header(n, skb);
+   skb_copy_header(n, skb);
return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1418,7 +1419,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, 
int headroom,
skb_clone_fraglist(n);
}
 
-   copy_skb_header(n, skb);
+   skb_copy_header(n, skb);
 out:
return n;
 }
@@ -1598,7 +1599,7 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
BUG_ON(skb_copy_bits(skb, -head_copy_len, n->head + head_copy_off,
 skb->len + head_copy_len));
 
-   copy_skb_header(n, skb);
+   skb_copy_header(n, skb);
 
skb_headers_offset_update(n, newheadroom - oldheadroom);
 
-- 
2.14.3



[PATCH V5 net-next 01/14] tcp: Add clean acked data hook

2018-03-27 Thread Saeed Mahameed
From: Ilya Lesokhin 

Called when a TCP segment is acknowledged.
Could be used by application protocols who hold additional
metadata associated with the stream data.

This is required by TLS device offload to release
metadata associated with acknowledged TLS records.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
Signed-off-by: Aviad Yehezkel 
Signed-off-by: Saeed Mahameed 
---
 include/net/inet_connection_sock.h |  2 ++
 include/net/tcp.h  |  8 
 net/ipv4/tcp_input.c   | 23 +++
 3 files changed, 33 insertions(+)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index b68fea022a82..2ab6667275df 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
  * @icsk_af_ops   Operations which are AF_INET{4,6} specific
  * @icsk_ulp_ops  Pluggable ULP control hook
  * @icsk_ulp_data ULP private data
+ * @icsk_clean_acked  Clean acked data hook
  * @icsk_listen_portaddr_node  hash to the portaddr listener hashtable
  * @icsk_ca_state:Congestion control state
  * @icsk_retransmits: Number of unrecovered [RTO] timeouts
@@ -102,6 +103,7 @@ struct inet_connection_sock {
const struct inet_connection_sock_af_ops *icsk_af_ops;
const struct tcp_ulp_ops  *icsk_ulp_ops;
void  *icsk_ulp_data;
+   void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
struct hlist_node icsk_listen_portaddr_node;
unsigned int  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
__u8  icsk_ca_state:6,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9c9b3768b350..a15e294ced66 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2101,4 +2101,12 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 #if IS_ENABLED(CONFIG_SMC)
 extern struct static_key_false tcp_have_smc;
 #endif
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+void clean_acked_data_enable(struct inet_connection_sock *icsk,
+void (*cad)(struct sock *sk, u32 ack_seq));
+void clean_acked_data_disable(struct inet_connection_sock *icsk);
+
+#endif
+
 #endif /* _TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 451ef3012636..8cfc6d1ac804 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -111,6 +111,23 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 #define REXMIT_LOST1 /* retransmit packets marked lost */
 #define REXMIT_NEW 2 /* FRTO-style transmit of unsent/new packets */
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+static DEFINE_STATIC_KEY_FALSE(clean_acked_data_enabled);
+
+void clean_acked_data_enable(struct inet_connection_sock *icsk,
+void (*cad)(struct sock *sk, u32 ack_seq))
+{
+   icsk->icsk_clean_acked = cad;
+   static_branch_inc(_acked_data_enabled);
+}
+
+void clean_acked_data_disable(struct inet_connection_sock *icsk)
+{
+   static_branch_dec(_acked_data_enabled);
+   icsk->icsk_clean_acked = NULL;
+}
+#endif
+
 static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
 unsigned int len)
 {
@@ -3542,6 +3559,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
if (after(ack, prior_snd_una)) {
flag |= FLAG_SND_UNA_ADVANCED;
icsk->icsk_retransmits = 0;
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+   if (static_branch_unlikely(_acked_data_enabled))
+   if (icsk->icsk_clean_acked)
+   icsk->icsk_clean_acked(sk, ack);
+#endif
}
 
prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una;
-- 
2.14.3



[iproute PATCH v2 3/3] ss: Drop filter_default_dbs()

2018-03-27 Thread Phil Sutter
Instead call filter_db_parse(..., "all"). This eliminates the duplicate
default DB definition.

Signed-off-by: Phil Sutter 
---
 misc/ss.c | 19 +--
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 83e476a0407e5..fc8e2a0d719fd 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -408,23 +408,6 @@ static int filter_af_get(struct filter *f, int af)
return !!(f->families & FAMILY_MASK(af));
 }
 
-static void filter_default_dbs(struct filter *f, bool enable)
-{
-   filter_db_set(f, UDP_DB, enable);
-   filter_db_set(f, DCCP_DB, enable);
-   filter_db_set(f, TCP_DB, enable);
-   filter_db_set(f, RAW_DB, enable);
-   filter_db_set(f, UNIX_ST_DB, enable);
-   filter_db_set(f, UNIX_DG_DB, enable);
-   filter_db_set(f, UNIX_SQ_DB, enable);
-   filter_db_set(f, PACKET_R_DB, enable);
-   filter_db_set(f, PACKET_DG_DB, enable);
-   filter_db_set(f, NETLINK_DB, enable);
-   filter_db_set(f, SCTP_DB, enable);
-   filter_db_set(f, VSOCK_ST_DB, enable);
-   filter_db_set(f, VSOCK_DG_DB, enable);
-}
-
 static void filter_states_set(struct filter *f, int states)
 {
if (states)
@@ -4934,7 +4917,7 @@ int main(int argc, char *argv[])
 
if (do_default) {
state_filter = state_filter ? state_filter : SS_CONN;
-   filter_default_dbs(_filter, true);
+   filter_db_parse(_filter, "all");
}
 
filter_states_set(_filter, state_filter);
-- 
2.16.1



[iproute PATCH v2 0/3] ss: Allow excluding a socket table from being queried

2018-03-27 Thread Phil Sutter
The first patch in this series adds the new functionality, the remaining
two refactor the code a bit.

Note that Patch 1 creates warnings with checkpatch due to overlong
lines, but patch 2 removes them again so for the sake of readability I
left this as is.

Changes since v1:
- Fixed checkpatch errors in patch 2/3.

Phil Sutter (3):
  ss: Allow excluding a socket table from being queried
  ss: Put filter DB parsing into a separate function
  ss: Drop filter_default_dbs()

 man/man8/ss.8 |   8 +++-
 misc/ss.c | 147 +++---
 2 files changed, 76 insertions(+), 79 deletions(-)

-- 
2.16.1



[iproute PATCH v2 1/3] ss: Allow excluding a socket table from being queried

2018-03-27 Thread Phil Sutter
The original problem was that a simple call to 'ss' leads to loading of
sctp_diag kernel module which might not be desired. While searching for
a workaround, it became clear how inconvenient it is to exclude a single
socket table from being queried.

This patch allows to prefix an item passed to '-A' parameter with an
exclamation mark to inverse its meaning.

Signed-off-by: Phil Sutter 
---
 man/man8/ss.8 |   8 -
 misc/ss.c | 108 --
 2 files changed, 66 insertions(+), 50 deletions(-)

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 973afbe0b386b..28033d8f01dda 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -317,7 +317,10 @@ Currently the following families are supported: unix, 
inet, inet6, link, netlink
 List of socket tables to dump, separated by commas. The following identifiers
 are understood: all, inet, tcp, udp, raw, unix, packet, netlink, unix_dgram,
 unix_stream, unix_seqpacket, packet_raw, packet_dgram, dccp, sctp,
-vsock_stream, vsock_dgram.
+vsock_stream, vsock_dgram. Any item in the list may optionally be prefixed by
+an exclamation mark
+.RB ( ! )
+to exclude that socket table from being dumped.
 .TP
 .B \-D FILE, \-\-diag=FILE
 Do not display anything, just dump raw information about TCP sockets to FILE 
after applying filters. If FILE is - stdout is used.
@@ -380,6 +383,9 @@ Find all local processes connected to X server.
 .TP
 .B ss -o state fin-wait-1 '( sport = :http or sport = :https )' dst 
193.233.7/24
 List all the tcp sockets in state FIN-WAIT-1 for our apache to network 
193.233.7/24 and look at their timers.
+.TP
+.B ss -a -A 'all,!tcp'
+List sockets in all states from all socket tables but TCP.
 .SH SEE ALSO
 .BR ip (8),
 .br
diff --git a/misc/ss.c b/misc/ss.c
index 6338820bf4a01..05522176f1e61 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -329,10 +329,14 @@ static const struct filter default_afs[AF_MAX] = {
 static int do_default = 1;
 static struct filter current_filter;
 
-static void filter_db_set(struct filter *f, int db)
+static void filter_db_set(struct filter *f, int db, bool enable)
 {
-   f->states   |= default_dbs[db].states;
-   f->dbs  |= 1 << db;
+   if (enable) {
+   f->states   |= default_dbs[db].states;
+   f->dbs  |= 1 << db;
+   } else {
+   f->dbs &= ~(1 << db);
+   }
do_default   = 0;
 }
 
@@ -349,21 +353,21 @@ static int filter_af_get(struct filter *f, int af)
return !!(f->families & FAMILY_MASK(af));
 }
 
-static void filter_default_dbs(struct filter *f)
+static void filter_default_dbs(struct filter *f, bool enable)
 {
-   filter_db_set(f, UDP_DB);
-   filter_db_set(f, DCCP_DB);
-   filter_db_set(f, TCP_DB);
-   filter_db_set(f, RAW_DB);
-   filter_db_set(f, UNIX_ST_DB);
-   filter_db_set(f, UNIX_DG_DB);
-   filter_db_set(f, UNIX_SQ_DB);
-   filter_db_set(f, PACKET_R_DB);
-   filter_db_set(f, PACKET_DG_DB);
-   filter_db_set(f, NETLINK_DB);
-   filter_db_set(f, SCTP_DB);
-   filter_db_set(f, VSOCK_ST_DB);
-   filter_db_set(f, VSOCK_DG_DB);
+   filter_db_set(f, UDP_DB, enable);
+   filter_db_set(f, DCCP_DB, enable);
+   filter_db_set(f, TCP_DB, enable);
+   filter_db_set(f, RAW_DB, enable);
+   filter_db_set(f, UNIX_ST_DB, enable);
+   filter_db_set(f, UNIX_DG_DB, enable);
+   filter_db_set(f, UNIX_SQ_DB, enable);
+   filter_db_set(f, PACKET_R_DB, enable);
+   filter_db_set(f, PACKET_DG_DB, enable);
+   filter_db_set(f, NETLINK_DB, enable);
+   filter_db_set(f, SCTP_DB, enable);
+   filter_db_set(f, VSOCK_ST_DB, enable);
+   filter_db_set(f, VSOCK_DG_DB, enable);
 }
 
 static void filter_states_set(struct filter *f, int states)
@@ -4712,19 +4716,19 @@ int main(int argc, char *argv[])
follow_events = 1;
break;
case 'd':
-   filter_db_set(_filter, DCCP_DB);
+   filter_db_set(_filter, DCCP_DB, true);
break;
case 't':
-   filter_db_set(_filter, TCP_DB);
+   filter_db_set(_filter, TCP_DB, true);
break;
case 'S':
-   filter_db_set(_filter, SCTP_DB);
+   filter_db_set(_filter, SCTP_DB, true);
break;
case 'u':
-   filter_db_set(_filter, UDP_DB);
+   filter_db_set(_filter, UDP_DB, true);
break;
case 'w':
-   filter_db_set(_filter, RAW_DB);
+   filter_db_set(_filter, RAW_DB, true);
break;
case 'x':
filter_af_set(_filter, AF_UNIX);
@@ -4781,59 +4785,65 @@ int main(int argc, char *argv[])
}
 

[iproute PATCH v2 2/3] ss: Put filter DB parsing into a separate function

2018-03-27 Thread Phil Sutter
Use a table for database name parsing. The tricky bit is to allow for
association of a (nearly) arbitrary number of DBs with each name.
Luckily the number is not fully arbitrary as there is an upper bound of
MAX_DB items. Since it is not possible to have a variable length
array inside a variable length array, use this knowledge to make the
inner array of fixed length. But since DB values start from zero, an
explicit end entry needs to be present as well, so the inner array has
to be MAX_DB + 1 in size.

Signed-off-by: Phil Sutter 
---
Changes since v1:
- Fix checkpatch errors.

 misc/ss.c | 114 ++
 1 file changed, 56 insertions(+), 58 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 05522176f1e61..83e476a0407e5 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -340,6 +340,61 @@ static void filter_db_set(struct filter *f, int db, bool 
enable)
do_default   = 0;
 }
 
+static int filter_db_parse(struct filter *f, const char *s)
+{
+   const struct {
+   const char *name;
+   int dbs[MAX_DB + 1];
+   } db_name_tbl[] = {
+#define ENTRY(name, ...) { #name, { __VA_ARGS__, MAX_DB } }
+   ENTRY(all, UDP_DB, DCCP_DB, TCP_DB, RAW_DB,
+  UNIX_ST_DB, UNIX_DG_DB, UNIX_SQ_DB,
+  PACKET_R_DB, PACKET_DG_DB, NETLINK_DB,
+  SCTP_DB, VSOCK_ST_DB, VSOCK_DG_DB),
+   ENTRY(inet, UDP_DB, DCCP_DB, TCP_DB, SCTP_DB, RAW_DB),
+   ENTRY(udp, UDP_DB),
+   ENTRY(dccp, DCCP_DB),
+   ENTRY(tcp, TCP_DB),
+   ENTRY(sctp, SCTP_DB),
+   ENTRY(raw, RAW_DB),
+   ENTRY(unix, UNIX_ST_DB, UNIX_DG_DB, UNIX_SQ_DB),
+   ENTRY(unix_stream, UNIX_ST_DB),
+   ENTRY(u_str, UNIX_ST_DB),   /* alias for unix_stream */
+   ENTRY(unix_dgram, UNIX_DG_DB),
+   ENTRY(u_dgr, UNIX_DG_DB),   /* alias for unix_dgram */
+   ENTRY(unix_seqpacket, UNIX_SQ_DB),
+   ENTRY(u_seq, UNIX_SQ_DB),   /* alias for unix_seqpacket */
+   ENTRY(packet, PACKET_R_DB, PACKET_DG_DB),
+   ENTRY(packet_raw, PACKET_R_DB),
+   ENTRY(p_raw, PACKET_R_DB),  /* alias for packet_raw */
+   ENTRY(packet_dgram, PACKET_DG_DB),
+   ENTRY(p_dgr, PACKET_DG_DB), /* alias for packet_dgram */
+   ENTRY(netlink, NETLINK_DB),
+   ENTRY(vsock, VSOCK_ST_DB, VSOCK_DG_DB),
+   ENTRY(vsock_stream, VSOCK_ST_DB),
+   ENTRY(v_str, VSOCK_ST_DB),  /* alias for vsock_stream */
+   ENTRY(vsock_dgram, VSOCK_DG_DB),
+   ENTRY(v_dgr, VSOCK_DG_DB),  /* alias for vsock_dgram */
+#undef ENTRY
+   };
+   bool enable = true;
+   unsigned int i;
+   const int *dbp;
+
+   if (s[0] == '!') {
+   enable = false;
+   s++;
+   }
+   for (i = 0; i < ARRAY_SIZE(db_name_tbl); i++) {
+   if (strcmp(s, db_name_tbl[i].name))
+   continue;
+   for (dbp = db_name_tbl[i].dbs; *dbp != MAX_DB; dbp++)
+   filter_db_set(f, *dbp, enable);
+   return 0;
+   }
+   return -1;
+}
+
 static void filter_af_set(struct filter *f, int af)
 {
f->states  |= default_afs[af].states;
@@ -4785,66 +4840,9 @@ int main(int argc, char *argv[])
}
p = p1 = optarg;
do {
-   bool enable = true;
-
if ((p1 = strchr(p, ',')) != NULL)
*p1 = 0;
-   if (p[0] == '!') {
-   enable = false;
-   p++;
-   }
-   if (strcmp(p, "all") == 0) {
-   filter_default_dbs(_filter, 
enable);
-   } else if (strcmp(p, "inet") == 0) {
-   filter_db_set(_filter, UDP_DB, 
enable);
-   filter_db_set(_filter, DCCP_DB, 
enable);
-   filter_db_set(_filter, TCP_DB, 
enable);
-   filter_db_set(_filter, SCTP_DB, 
enable);
-   filter_db_set(_filter, RAW_DB, 
enable);
-   } else if (strcmp(p, "udp") == 0) {
-   filter_db_set(_filter, UDP_DB, 
enable);
-   } else if (strcmp(p, "dccp") == 0) {
-   filter_db_set(_filter, DCCP_DB, 
enable);
-   } else if (strcmp(p, "tcp") == 0) {
-   filter_db_set(_filter, 

Re: [Xen-devel] [PATCH 1/1] xen-netback: process malformed sk_buff correctly to avoid BUG_ON()

2018-03-27 Thread Dongli Zhang
Below is the sample kernel module used to reproduce the issue on purpose with
"vif1.0" hard coded:

#include 
#include 
#include 
#include 
#include 

static int __init test_skb_init(void)
{
struct sk_buff *skb;
struct skb_shared_info *si;
struct net_device *dev;

dev = dev_get_by_name(_net, "vif1.0");
if (!dev) {
pr_alert("failed to get net_device\n");
return 0;
}

skb = alloc_skb(2000, GFP_ATOMIC | __GFP_NOWARN);
if (!skb) {
pr_alert("failed to allocate sk_buff\n");
return 0;
}

si = skb_shinfo(skb);

skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);

skb->dev = dev;
skb->len = 386;
skb->data_len = 352;

skb->mac_len = 14;
skb->pkt_type = 3;
skb->protocol = 8;
skb->transport_header = 98;
skb->network_header = 78;
skb->mac_header = 64;

skb->tail = 98;
skb->end = 384;

pr_alert("skb->data = 0x%016llx\n", (u64) skb->data);

dev->netdev_ops->ndo_start_xmit(skb, dev);

return 0;
}

static void __exit test_skb_exit(void)
{
}

MODULE_LICENSE("GPL");
module_init(test_skb_init);
module_exit(test_skb_exit);

Dongli Zhang



On 03/28/2018 07:42 AM, Dongli Zhang wrote:
> The "BUG_ON(!frag_iter)" in function xenvif_rx_next_chunk() is triggered if
> the received sk_buff is malformed, that is, when the sk_buff has pattern
> (skb->data_len && !skb_shinfo(skb)->nr_frags). Below is a sample call
> stack:
> 
> [  438.652658] [ cut here ]
> [  438.652660] kernel BUG at drivers/net/xen-netback/rx.c:325!
> [  438.652714] invalid opcode:  [#1] SMP NOPTI
> [  438.652813] CPU: 0 PID: 2492 Comm: vif1.0-q0-guest Tainted: G   O  
>4.16.0-rc6+ #1
> [  438.652896] RIP: e030:xenvif_rx_skb+0x3c2/0x5e0 [xen_netback]
> [  438.652926] RSP: e02b:c90040877dc8 EFLAGS: 00010246
> [  438.652956] RAX: 0160 RBX: 0022 RCX: 
> 0001
> [  438.652993] RDX: c900402890d0 RSI:  RDI: 
> c90040889000
> [  438.653029] RBP: 88002b460040 R08: c90040877de0 R09: 
> 0100
> [  438.653065] R10: 7ff0 R11: 0002 R12: 
> c90040889000
> [  438.653100] R13: 8000 R14: 0022 R15: 
> 8000
> [  438.653149] FS:  7f15603778c0() GS:88003040() 
> knlGS:
> [  438.653188] CS:  e033 DS:  ES:  CR0: 80050033
> [  438.653219] CR2: 01832a08 CR3: 29c12000 CR4: 
> 00042660
> [  438.653262] Call Trace:
> [  438.653284]  ? xen_hypercall_event_channel_op+0xa/0x20
> [  438.653313]  xenvif_rx_action+0x41/0x80 [xen_netback]
> [  438.653341]  xenvif_kthread_guest_rx+0xb2/0x2a8 [xen_netback]
> [  438.653374]  ? __schedule+0x352/0x700
> [  438.653398]  ? wait_woken+0x80/0x80
> [  438.653421]  kthread+0xf3/0x130
> [  438.653442]  ? xenvif_rx_action+0x80/0x80 [xen_netback]
> [  438.653470]  ? kthread_destroy_worker+0x40/0x40
> [  438.653497]  ret_from_fork+0x35/0x40
> 
> The issue is hit by xen-netback when there is bug with other networking
> interface (e.g., dom0 physical NIC), who has generated and forwarded
> malformed sk_buff to dom0 vifX.Y. It is possible to reproduce the issue on
> purpose with below sample code in a kernel module:
> 
> skb->dev = dev; // dev of vifX.Y
> skb->len = 386;
> skb->data_len = 352;
> skb->tail = 98;
> skb->end = 384;
> dev->netdev_ops->ndo_start_xmit(skb, dev);
> 
> This patch stops processing sk_buff immediately if it is detected as
> malformed, that is, pkt->frag_iter is NULL but there is still remaining
> pkt->remaining_len.
> 
> Signed-off-by: Dongli Zhang 
> ---
>  drivers/net/xen-netback/rx.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/xen-netback/rx.c b/drivers/net/xen-netback/rx.c
> index b1cf7c6..289cc82 100644
> --- a/drivers/net/xen-netback/rx.c
> +++ b/drivers/net/xen-netback/rx.c
> @@ -369,6 +369,14 @@ static void xenvif_rx_data_slot(struct xenvif_queue 
> *queue,
>   offset += len;
>   pkt->remaining_len -= len;
>  
> + if (unlikely(!pkt->frag_iter && pkt->remaining_len)) {
> + pkt->remaining_len = 0;
> + pkt->extra_count = 0;
> + pr_err_ratelimited("malformed sk_buff at %s\n",
> +queue->name);
> + break;
> + }
> +
>   } while (offset < XEN_PAGE_SIZE && pkt->remaining_len > 0);
>  
>   if (pkt->remaining_len > 0)
> 


  1   2   3   4   5   >