Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-17 Thread Heiner Kallweit
On 18.10.2018 07:21, David Miller wrote:
> From: Francois Romieu 
> Date: Thu, 18 Oct 2018 01:30:45 +0200
> 
>> Heiner Kallweit  :
>> [...]
>>> This issue has been there more or less forever (at least it exists in
>>> 3.16 already), so I can't provide a "Fixes" tag. 
>>
>> Hardly forever. It fixes da78dbff2e05630921c551dbbc70a4b7981a8fff.
> 
> I don't see exactly how that can be true.
> 
> That commit didn't change the parts of the NAPI poll processing which
> are relevant here, mainly the guarding of the RX and TX work using
> the status bits which are cleared.
> 

AFAICS Francois is right and patch da78dbff2e05 ("r8169: remove work
from irq handler") introduced the guarding of RX and TX work.
I just checked back to 3.16 as oldest LTS kernel version.

> Maybe I'm missing something?  If so, indeed it would be nice to add
> a proper Fixes: tag here.
> 
Shall I submit a v2 including the Fixes line?

> Thanks!
> 



Re: bond: take rcu lock in netpoll_send_skb_on_dev

2018-10-17 Thread Cong Wang
On Mon, Oct 15, 2018 at 4:36 AM Eran Ben Elisha  wrote:
> Hi,
>
> This suggested fix introduced a regression while using netconsole module
> with mlx5_core module loaded.

It is already reported here:
https://marc.info/?l=linux-kernel=153917359528669=2


>
> During irq handling, we hit a warning that this rcu_read_lock_bh cannot
> be taken inside an IRQ.

Yes, I mentioned the same even before this patch was sent out:
https://marc.info/?l=linux-netdev=153816136624679=2

Thanks.


Re: [net-next PATCH] net: sched: cls_flower: Classify packets using port ranges

2018-10-17 Thread Cong Wang
On Wed, Oct 17, 2018 at 9:42 PM David Miller  wrote:
>
> From: Amritha Nambiar 
> Date: Fri, 12 Oct 2018 06:53:30 -0700
>
> > Added support in tc flower for filtering based on port ranges.
> > This is a rework of the RFC patch at:
> > https://patchwork.ozlabs.org/patch/969595/
>
> You never addressed Cong's feedback asking you to explain why this
> can't be simply built using existing generic filtering facilities that
> exist already.
>
> I appreciate that you addressed Jiri's feedback, but Cong's feedback is
> just as, if not more, important.
>

My objection is against introducing a new filter just for port range, now
it is built on top of flower filter, so it is much better now.

u32 filter can do the nearly same, but requires a power-of-two, so it is
not completely duplicated.

Therefore, I think the idea of building it on top of flower is fine. But I don't
read into any code, only the description.

Thanks!


[PATCH v3 bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB

2018-10-17 Thread Song Liu
BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
skb. This patch enables direct access of skb for these programs.

Two helper functions bpf_compute_and_save_data_pointers() and
bpf_restore_data_pointers() are introduced. There are used in
__cgroup_bpf_run_filter_skb(), to compute proper data_end for the
BPF program, and restore original data afterwards.

Signed-off-by: Song Liu 
---
 include/linux/filter.h | 24 
 kernel/bpf/cgroup.c|  6 ++
 net/core/filter.c  | 36 +++-
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5771874bc01e..96b3ee7f14c9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -548,6 +548,30 @@ static inline void bpf_compute_data_pointers(struct 
sk_buff *skb)
cb->data_end  = skb->data + skb_headlen(skb);
 }
 
+/* Similar to bpf_compute_data_pointers(), except that save orginal
+ * data in cb->data and cb->meta_data for restore.
+ */
+static inline void bpf_compute_and_save_data_pointers(
+   struct sk_buff *skb, void *saved_pointers[2])
+{
+   struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+
+   saved_pointers[0] = cb->data_meta;
+   saved_pointers[1] = cb->data_end;
+   cb->data_meta = skb->data - skb_metadata_len(skb);
+   cb->data_end  = skb->data + skb_headlen(skb);
+}
+
+/* Restore data saved by bpf_compute_data_pointers(). */
+static inline void bpf_restore_data_pointers(
+   struct sk_buff *skb, void *saved_pointers[2])
+{
+   struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+
+   cb->data_meta = saved_pointers[0];
+   cb->data_end = saved_pointers[1];;
+}
+
 static inline u8 *bpf_skb_cb(struct sk_buff *skb)
 {
/* eBPF programs may read/write skb->cb[] area to transfer meta
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 00f6ed2e4f9a..5f5180104ddc 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -554,6 +554,7 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
unsigned int offset = skb->data - skb_network_header(skb);
struct sock *save_sk;
struct cgroup *cgrp;
+   void *saved_pointers[2];
int ret;
 
if (!sk || !sk_fullsock(sk))
@@ -566,8 +567,13 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
save_sk = skb->sk;
skb->sk = sk;
__skb_push(skb, offset);
+
+   /* compute pointers for the bpf prog */
+   bpf_compute_and_save_data_pointers(skb, saved_pointers);
+
ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
 bpf_prog_run_save_cb);
+   bpf_restore_data_pointers(skb, saved_pointers);
__skb_pull(skb, offset);
skb->sk = save_sk;
return ret == 1 ? 0 : -EPERM;
diff --git a/net/core/filter.c b/net/core/filter.c
index 1a3ac6c46873..e3ca30bd6840 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5346,6 +5346,40 @@ static bool sk_filter_is_valid_access(int off, int size,
return bpf_skb_is_valid_access(off, size, type, prog, info);
 }
 
+static bool cg_skb_is_valid_access(int off, int size,
+  enum bpf_access_type type,
+  const struct bpf_prog *prog,
+  struct bpf_insn_access_aux *info)
+{
+   switch (off) {
+   case bpf_ctx_range(struct __sk_buff, tc_classid):
+   case bpf_ctx_range(struct __sk_buff, data_meta):
+   case bpf_ctx_range(struct __sk_buff, flow_keys):
+   return false;
+   }
+   if (type == BPF_WRITE) {
+   switch (off) {
+   case bpf_ctx_range(struct __sk_buff, mark):
+   case bpf_ctx_range(struct __sk_buff, priority):
+   case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+   break;
+   default:
+   return false;
+   }
+   }
+
+   switch (off) {
+   case bpf_ctx_range(struct __sk_buff, data):
+   info->reg_type = PTR_TO_PACKET;
+   break;
+   case bpf_ctx_range(struct __sk_buff, data_end):
+   info->reg_type = PTR_TO_PACKET_END;
+   break;
+   }
+
+   return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
 static bool lwt_is_valid_access(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
@@ -7038,7 +7072,7 @@ const struct bpf_prog_ops xdp_prog_ops = {
 
 const struct bpf_verifier_ops cg_skb_verifier_ops = {
.get_func_proto = cg_skb_func_proto,
-   .is_valid_access= sk_filter_is_valid_access,
+   .is_valid_access= cg_skb_is_valid_access,
.convert_ctx_access = bpf_convert_ctx_access,
 };
 
-- 
2.17.1



[PATCH v3 bpf-next 0/2] bpf: add cg_skb_is_valid_access

2018-10-17 Thread Song Liu
Changes v2 -> v3:
1. Added helper function bpf_compute_and_save_data_pointers() and
   bpf_restore_data_pointers().

Changes v1 -> v2:
1. Updated the list of read-only fields, and read-write fields.
2. Added dummy sk to bpf_prog_test_run_skb().

This set enables BPF program of type BPF_PROG_TYPE_CGROUP_SKB to access
some __skb_buff data directly.

Song Liu (2):
  bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
  bpf: add tests for direct packet access from CGROUP_SKB

 include/linux/filter.h  |  24 +++
 kernel/bpf/cgroup.c |   6 +
 net/bpf/test_run.c  |   4 +
 net/core/filter.c   |  36 -
 tools/testing/selftests/bpf/test_verifier.c | 170 
 5 files changed, 239 insertions(+), 1 deletion(-)

--
2.17.1


[PATCH v3 bpf-next 2/2] bpf: add tests for direct packet access from CGROUP_SKB

2018-10-17 Thread Song Liu
Tests are added to make sure CGROUP_SKB cannot access:
  tc_classid, data_meta, flow_keys

and can read and write:
  mark, prority, and cb[0-4]

and can read other fields.

To make selftest with skb->sk work, a dummy sk is added in
bpf_prog_test_run_skb().

Signed-off-by: Song Liu 
---
 net/bpf/test_run.c  |   4 +
 tools/testing/selftests/bpf/test_verifier.c | 170 
 2 files changed, 174 insertions(+)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0c423b8cd75c..c7210e2f1ae9 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx,
struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE])
@@ -115,6 +116,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const 
union bpf_attr *kattr,
u32 retval, duration;
int hh_len = ETH_HLEN;
struct sk_buff *skb;
+   struct sock sk;
void *data;
int ret;
 
@@ -142,6 +144,8 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const 
union bpf_attr *kattr,
kfree(data);
return -ENOMEM;
}
+   sock_init_data(NULL, );
+   skb->sk = 
 
skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
__skb_put(skb, size);
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index cf4cd32b6772..5bfba7e8afd7 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -4862,6 +4862,176 @@ static struct bpf_test tests[] = {
.result = REJECT,
.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
+   {
+   "direct packet read test#1 for CGROUP_SKB",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+   offsetof(struct __sk_buff, len)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+   offsetof(struct __sk_buff, pkt_type)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+   offsetof(struct __sk_buff, mark)),
+   BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+   offsetof(struct __sk_buff, mark)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+   offsetof(struct __sk_buff, queue_mapping)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+   offsetof(struct __sk_buff, protocol)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+   offsetof(struct __sk_buff, vlan_present)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   },
+   {
+   "direct packet read test#2 for CGROUP_SKB",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+   offsetof(struct __sk_buff, vlan_tci)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+   offsetof(struct __sk_buff, vlan_proto)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+   offsetof(struct __sk_buff, priority)),
+   BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+   offsetof(struct __sk_buff, priority)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+   offsetof(struct __sk_buff, 
ingress_ifindex)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+   offsetof(struct __sk_buff, tc_index)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+   offsetof(struct __sk_buff, hash)),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   },
+   {
+   "direct packet read test#3 for CGROUP_SKB",
+   .insns 

Re: [PATCH net] net: ipmr: fix unresolved entry dumps

2018-10-17 Thread David Miller
From: Nikolay Aleksandrov 
Date: Wed, 17 Oct 2018 22:34:34 +0300

> If the skb space ends in an unresolved entry while dumping we'll miss
> some unresolved entries. The reason is due to zeroing the entry counter
> between dumping resolved and unresolved mfc entries. We should just
> keep counting until the whole table is dumped and zero when we move to
> the next as we have a separate table counter.
> 
> Reported-by: Colin Ian King 
> Fixes: 8fb472c09b9d ("ipmr: improve hash scalability")
> Signed-off-by: Nikolay Aleksandrov 

Applied and queued up for -stable.


Re: [PATCH net] net/sched: properly init chain in case of multiple control actions

2018-10-17 Thread Cong Wang
On Tue, Oct 16, 2018 at 10:38 AM Davide Caratti  wrote:
>
> On Mon, 2018-10-15 at 11:31 -0700, Cong Wang wrote:
> > On Sat, Oct 13, 2018 at 8:23 AM Davide Caratti  wrote:
> > >
> > > On Fri, 2018-10-12 at 13:57 -0700, Cong Wang wrote:
> > > > Why not just validate the fallback action in each action init()?
> > > > For example, checking tcfg_paction in tcf_gact_init().
> > > >
> > > > I don't see the need of making it generic.
> ...
> > > A (legal?) trick  is to let tcf_action store the fallback action when it
> > > contains a 'goto chain' command, I just posted a proposal for gact. If you
> > > think it's ok, I will test and post the same for act_police.
> >
> > Do we really need to support TC_ACT_GOTO_CHAIN for
> > gact->tcfg_paction etc.? I mean, is it useful in practice or is it just for
> > completeness?
> >
> > IF we don't need to support it, we can just make it invalid without needing
> > to initialize it in ->init() at all.
> >
> > If we do, however, we really need to move it into each ->init(), because
> > we have to lock each action if we are modifying an existing one. With
> > your patch, tcf_action_goto_chain_init() is still called without the 
> > per-action
> > lock.
> >
> > What's more, if we support two different actions in gact, that is, 
> > tcfg_paction
> > and tcf_action, how could you still only have one a->goto_chain pointer?
> > There should be two pointers for each of them. :)
>
> whatever fixes the NULL dereference is OK for me.
> I thought that the proposal made with
>
> https://www.mail-archive.com/netdev@vger.kernel.org/msg251933.html
>
> (i.e., letting init() copy tcfg_paction to tcf_action in case it contained
> 'goto chain x') was smart enough to preserve the current behavior, and
> also let 'goto chain' work in case it was configured  *only* for the
> fallback action.
> When the action is modified, the change to tcfg_paction is done with the
> same spinlock as tcf_action, so I didn't notice anything worse than the
> current locking layout.
>
> (well, after some more thinking I looked again at that patch and yes, it
> lacked the most important thing:)

Hmm, as I said, I am not sure if the logic is correct, if we have two different
goto actions, we must have two pointers.

I will re-think about it tomorrow. (I am at a conference so don't have much
time on reviewing this.)

Thanks.


Re: [PATCH net] sctp: fix the data size calculation in sctp_data_size

2018-10-17 Thread David Miller
From: Xin Long 
Date: Wed, 17 Oct 2018 21:11:27 +0800

> sctp data size should be calculated by subtracting data chunk header's
> length from chunk_hdr->length, not just data header.
> 
> Fixes: 668c9beb9020 ("sctp: implement assign_number for 
> sctp_stream_interleave")
> Signed-off-by: Xin Long 

Applied and queued up for -stable.


Re: [PATCH V1 net-next] net: ena: enable Low Latency Queues

2018-10-17 Thread David Miller
From: 
Date: Wed, 17 Oct 2018 15:33:23 +0300

> From: Arthur Kiyanovski 
> 
> Use the new API to enable usage of LLQ.
> 
> Signed-off-by: Arthur Kiyanovski 

Applied.


Re: [PATCH V2 net-next] net: ena: Fix Kconfig dependency on X86

2018-10-17 Thread David Miller
From: 
Date: Wed, 17 Oct 2018 10:04:21 +

> From: Netanel Belgazal 
> 
> The Kconfig limitation of X86 is to too wide.
> The ENA driver only requires a little endian dependency.
> 
> Change the dependency to be on little endian CPU.
> 
> Signed-off-by: Netanel Belgazal 

Applied.


Re: [PATCH net] mlxsw: core: Fix use-after-free when flashing firmware during init

2018-10-17 Thread David Miller
From: Ido Schimmel 
Date: Wed, 17 Oct 2018 08:05:45 +

> When the switch driver (e.g., mlxsw_spectrum) determines it needs to
> flash a new firmware version it resets the ASIC after the flashing
> process. The bus driver (e.g., mlxsw_pci) then registers itself again
> with mlxsw_core which means (among other things) that the device
> registers itself again with the hwmon subsystem again.
> 
> Since the device was registered with the hwmon subsystem using
> devm_hwmon_device_register_with_groups(), then the old hwmon device
> (registered before the flashing) was never unregistered and was
> referencing stale data, resulting in a use-after free.
> 
> Fix by removing reliance on device managed APIs in mlxsw_hwmon_init().
> 
> Fixes: c86d62cc410c ("mlxsw: spectrum: Reset FW after flash")
> Signed-off-by: Ido Schimmel 
> Reported-by: Alexander Petrovskiy 
> Tested-by: Alexander Petrovskiy 
> Reviewed-by: Petr Machata 

Applied.


Re: [PATCH net] udp6: fix encap return code for resubmitting

2018-10-17 Thread David Miller
From: Paolo Abeni 
Date: Wed, 17 Oct 2018 11:44:04 +0200

> The commit eb63f2964dbe ("udp6: add missing checks on edumux packet
> processing") used the same return code convention of the ipv4 counterpart,
> but ipv6 uses the opposite one: positive values means resubmit.
> 
> This change addresses the issue, using positive return value for
> resubmitting. Also update the related comment, which was broken, too.
> 
> Fixes: eb63f2964dbe ("udp6: add missing checks on edumux packet processing")
> Signed-off-by: Paolo Abeni 
> ---
> Note: I could not find any in kernel udp6 encap using the above
> feature, that would explain why nobody complained so far...

Applied.


Re: [PATCH net-next 0/2] tcp_bbr: TCP BBR changes for EDT pacing model

2018-10-17 Thread David Miller
From: Neal Cardwell 
Date: Tue, 16 Oct 2018 20:16:43 -0400

> Two small patches for TCP BBR to follow up with Eric's recent work to change
> the TCP and fq pacing machinery to an "earliest departure time" (EDT) model:
> 
> - The first patch adjusts the TCP BBR logic to work with the new
>   "earliest departure time" (EDT) pacing model.
> 
> - The second patch adjusts the TCP BBR logic to centralize the setting
>   of gain values, to simplify the code and prepare for future changes.

Series applied, thanks Neal.


Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-17 Thread David Miller
From: Francois Romieu 
Date: Thu, 18 Oct 2018 01:30:45 +0200

> Heiner Kallweit  :
> [...]
>> This issue has been there more or less forever (at least it exists in
>> 3.16 already), so I can't provide a "Fixes" tag. 
> 
> Hardly forever. It fixes da78dbff2e05630921c551dbbc70a4b7981a8fff.

I don't see exactly how that can be true.

That commit didn't change the parts of the NAPI poll processing which
are relevant here, mainly the guarding of the RX and TX work using
the status bits which are cleared.

Maybe I'm missing something?  If so, indeed it would be nice to add
a proper Fixes: tag here.

Thanks!


[bpf-next v2 1/2] bpf: skmsg, fix psock create on existing kcm/tls port

2018-10-17 Thread John Fastabend
Before using the psock returned by sk_psock_get() when adding it to a
sockmap we need to ensure it is actually a sockmap based psock.
Previously we were only checking this after incrementing the reference
counter which was an error. This resulted in a slab-out-of-bounds
error when the psock was not actually a sockmap type.

This moves the check up so the reference counter is only used
if it is a sockmap psock.

Eric reported the following KASAN BUG,

BUG: KASAN: slab-out-of-bounds in atomic_read 
include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 
lib/refcount.c:120
Read of size 4 at addr 88019548be58 by task syz-executor4/22387

CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
 sk_psock_get include/linux/skmsg.h:379 [inline]
 sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
 sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
 sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
 map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818

Signed-off-by: John Fastabend 
Reported-by: Eric Dumazet 
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
---
 include/linux/skmsg.h | 25 -
 net/core/sock_map.c   | 11 ++-
 2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 677b673..f44ca6b 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -275,11 +275,6 @@ static inline struct sk_psock *sk_psock(const struct sock 
*sk)
return rcu_dereference_sk_user_data(sk);
 }
 
-static inline bool sk_has_psock(struct sock *sk)
-{
-   return sk_psock(sk) != NULL && sk->sk_prot->recvmsg == tcp_bpf_recvmsg;
-}
-
 static inline void sk_psock_queue_msg(struct sk_psock *psock,
  struct sk_msg *msg)
 {
@@ -379,6 +374,26 @@ static inline bool sk_psock_test_state(const struct 
sk_psock *psock,
return test_bit(bit, >state);
 }
 
+static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
+{
+   struct sk_psock *psock;
+
+   rcu_read_lock();
+   psock = sk_psock(sk);
+   if (psock) {
+   if (sk->sk_prot->recvmsg != tcp_bpf_recvmsg) {
+   psock = ERR_PTR(-EBUSY);
+   goto out;
+   }
+
+   if (!refcount_inc_not_zero(>refcnt))
+   psock = NULL;
+   }
+out:
+   rcu_read_unlock();
+   return psock;
+}
+
 static inline struct sk_psock *sk_psock_get(struct sock *sk)
 {
struct sk_psock *psock;
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3c0e44c..be6092a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -175,12 +175,13 @@ static int sock_map_link(struct bpf_map *map, struct 
sk_psock_progs *progs,
}
}
 
-   psock = sk_psock_get(sk);
+   psock = sk_psock_get_checked(sk);
+   if (IS_ERR(psock)) {
+   ret = PTR_ERR(psock);
+   goto out_progs;
+   }
+
if (psock) {
-   if (!sk_has_psock(sk)) {
-   ret = -EBUSY;
-   goto out_progs;
-   }
if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
(skb_progs  && READ_ONCE(psock->progs.skb_parser))) {
sk_psock_put(sk, psock);
-- 
1.9.1



[bpf-next v2 0/2] Fix kcm + sockmap by checking psock type

2018-10-17 Thread John Fastabend
We check if the sk_user_data (the psock in skmsg) is in fact a sockmap
type to late, after we read the refcnt which is an error. This
series moves the check up before reading refcnt and also adds a test
to test_maps to test trying to add a KCM socket into a sockmap.

While reviewig this code I also found an issue with KCM and kTLS
where each uses sk_data_ready hooks and associated stream parser
breaking expectations in kcm, ktls or both. But that fix will need
to go to net.

Thanks to Eric for reporting.

v2: Fix up file +/- my scripts lost track of them

John Fastabend (2):
  bpf: skmsg, fix psock create on existing kcm/tls port
  bpf: test_maps add a test to catch kcm + sockmap

 include/linux/skmsg.h | 25 +---
 net/core/sock_map.c   | 11 +++---
 tools/testing/selftests/bpf/Makefile  |  2 +-
 tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++
 tools/testing/selftests/bpf/test_maps.c   | 64 ++-
 5 files changed, 103 insertions(+), 13 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

-- 
1.9.1



[bpf-next v2 2/2] bpf: test_maps add a test to catch kcm + sockmap

2018-10-17 Thread John Fastabend
Adding a socket to both sockmap and kcm is not supported due to
collision on sk_user_data usage.

If selftests is run without KCM support we will issue a warning
and continue with the tests.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/Makefile  |  2 +-
 tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++
 tools/testing/selftests/bpf/test_maps.c   | 64 ++-
 3 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index d99dd6f..f290554 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -28,7 +28,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
-   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
+   sockmap_verdict_prog.o sockmap_kcm.o dev_cgroup.o sample_ret0.o 
test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
diff --git a/tools/testing/selftests/bpf/sockmap_kcm.c 
b/tools/testing/selftests/bpf/sockmap_kcm.c
new file mode 100644
index 000..4377adc
--- /dev/null
+++ b/tools/testing/selftests/bpf/sockmap_kcm.c
@@ -0,0 +1,14 @@
+#include 
+#include "bpf_helpers.h"
+#include "bpf_util.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+
+SEC("socket_kcm")
+int bpf_prog1(struct __sk_buff *skb)
+{
+   return skb->len;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 9b552c0..be20f1d 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -479,14 +480,16 @@ static void test_devmap(int task, void *data)
 #define SOCKMAP_PARSE_PROG "./sockmap_parse_prog.o"
 #define SOCKMAP_VERDICT_PROG "./sockmap_verdict_prog.o"
 #define SOCKMAP_TCP_MSG_PROG "./sockmap_tcp_msg_prog.o"
+#define KCM_PROG "./sockmap_kcm.o"
 static void test_sockmap(int tasks, void *data)
 {
struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_msg, *bpf_map_break;
-   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break;
+   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break, kcm;
int ports[] = {50200, 50201, 50202, 50204};
int err, i, fd, udp, sfd[6] = {0xdeadbeef};
u8 buf[20] = {0x0, 0x5, 0x3, 0x2, 0x1, 0x0};
-   int parse_prog, verdict_prog, msg_prog;
+   int parse_prog, verdict_prog, msg_prog, kcm_prog;
+   struct kcm_attach attach_info;
struct sockaddr_in addr;
int one = 1, s, sc, rc;
struct bpf_object *obj;
@@ -744,6 +747,62 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
 
+   /* Test adding a KCM socket into map */
+#define AF_KCM 41
+   kcm = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
+   if (kcm == -1) {
+   printf("Warning, KCM+Sockmap could not be tested.\n");
+   goto skip_kcm;
+   }
+
+   err = bpf_prog_load(KCM_PROG,
+   BPF_PROG_TYPE_SOCKET_FILTER,
+   , _prog);
+   if (err) {
+   printf("Failed to load SK_SKB parse prog\n");
+   goto out_sockmap;
+   }
+
+   i = 2;
+   memset(_info, 0, sizeof(attach_info));
+   attach_info.fd = sfd[i];
+   attach_info.bpf_fd = kcm_prog;
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (!err) {
+   perror("Failed KCM attached to sockmap fd: ");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_delete_elem(fd, );
+   if (err) {
+   printf("Failed delete sockmap from empty map %i %i\n", err, 
errno);
+   goto out_sockmap;
+   }
+
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (err) {
+   perror("Failed KCM attach");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_update_elem(fd, , [i], BPF_ANY);
+   if (!err) {
+   printf("Failed sockmap attached KCM sock!\n");
+   goto out_sockmap;
+   }
+   err = ioctl(kcm, SIOCKCMUNATTACH, _info);
+   if (err) {
+   printf("Failed detach KCM sock!\n");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_update_elem(fd, , [i], BPF_ANY);
+   if (err) {
+   printf("Failed post-kcm update sockmap '%i:%i'\n",
+  i, sfd[i]);
+

Re: [PATCH net] sctp: not free the new asoc when sctp_wait_for_connect returns err

2018-10-17 Thread David Miller
From: Xin Long 
Date: Wed, 17 Oct 2018 03:06:12 +0800

> When sctp_wait_for_connect is called to wait for connect ready
> for sp->strm_interleave in sctp_sendmsg_to_asoc, a panic could
> be triggered if cpu is scheduled out and the new asoc is freed
> elsewhere, as it will return err and later the asoc gets freed
> again in sctp_sendmsg.
 ...
> This is a similar issue with the one fixed in Commit ca3af4dd28cf
> ("sctp: do not free asoc when it is already dead in sctp_sendmsg").
> But this one can't be fixed by returning -ESRCH for the dead asoc
> in sctp_wait_for_connect, as it will break sctp_connect's return
> value to users.
> 
> This patch is to simply set err to -ESRCH before it returns to
> sctp_sendmsg when any err is returned by sctp_wait_for_connect
> for sp->strm_interleave, so that no asoc would be freed due to
> this.
> 
> When users see this error, they will know the packet hasn't been
> sent. And it also makes sense to not free asoc because waiting
> connect fails, like the second call for sctp_wait_for_connect in
> sctp_sendmsg_to_asoc.
> 
> Fixes: 668c9beb9020 ("sctp: implement assign_number for 
> sctp_stream_interleave")
> Signed-off-by: Xin Long 

Applied and queued up for -stable.


Re: [PATCH net] sctp: fix race on sctp_id2asoc

2018-10-17 Thread David Miller
From: Marcelo Ricardo Leitner 
Date: Tue, 16 Oct 2018 15:18:17 -0300

> syzbot reported an use-after-free involving sctp_id2asoc.  Dmitry Vyukov
> helped to root cause it and it is because of reading the asoc after it
> was freed:
> 
> CPU 1   CPU 2
> (working on socket 1)(working on socket 2)
>sctp_association_destroy
> sctp_id2asoc
>spin lock
>  grab the asoc from idr
>spin unlock
>spin lock
>remove asoc from idr
>  spin unlock
>  free(asoc)
>if asoc->base.sk != sk ... [*]
> 
> This can only be hit if trying to fetch asocs from different sockets. As
> we have a single IDR for all asocs, in all SCTP sockets, their id is
> unique on the system. An application can try to send stuff on an id
> that matches on another socket, and the if in [*] will protect from such
> usage. But it didn't consider that as that asoc may belong to another
> socket, it may be freed in parallel (read: under another socket lock).
> 
> We fix it by moving the checks in [*] into the protected region. This
> fixes it because the asoc cannot be freed while the lock is held.
> 
> Reported-by: syzbot+c7dd55d7aec49d48e...@syzkaller.appspotmail.com
> Acked-by: Dmitry Vyukov 
> Signed-off-by: Marcelo Ricardo Leitner 

Applied and queued up for -stable.


Re: [PATCH net] r8169: re-enable MSI-X on RTL8168g

2018-10-17 Thread David Miller
From: Heiner Kallweit 
Date: Tue, 16 Oct 2018 19:35:17 +0200

> Similar to d49c88d7677b ("r8169: Enable MSI-X on RTL8106e") after
> e9d0ba506ea8 ("PCI: Reprogram bridge prefetch registers on resume")
> we can safely assume that this also fixes the root cause of
> the issue worked around by 7c53a722459c ("r8169: don't use MSI-X on
> RTL8168g"). So let's revert it.
> 
> Fixes: 7c53a722459c ("r8169: don't use MSI-X on RTL8168g")
> Signed-off-by: Heiner Kallweit 

Applied.


Re: [PATCH v2 bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB

2018-10-17 Thread Song Liu



> On Oct 17, 2018, at 9:44 PM, Alexei Starovoitov 
>  wrote:
> 
> On Wed, Oct 17, 2018 at 04:36:15PM -0700, Song Liu wrote:
>> BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
>> skb. This patch enables direct access of skb for these programs.
>> 
>> In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
>> to compute proper data_end for the BPF program.
>> 
>> Signed-off-by: Song Liu 
>> ---
>> kernel/bpf/cgroup.c |  4 
>> net/core/filter.c   | 36 +++-
>> 2 files changed, 39 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 00f6ed2e4f9a..340d496f35bd 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>>  save_sk = skb->sk;
>>  skb->sk = sk;
>>  __skb_push(skb, offset);
>> +
>> +/* compute pointers for the bpf prog */
>> +bpf_compute_data_pointers(skb);
> 
> cg_skb_is_valid_access() below looks good to me now,
> but I just realized that above change is not safe for all sockets.
> After sk_filter_trim_cap() is called in udp_queue_rcv_skb()
> it needs to see valid UDP_SKB_CB.
> But sizeof(struct udp_skb_cb)==28, so bpf_compute_data_pointers()
> would mangle the end of it.
> So we have to save/restore data_end/data_meta pointers as well.
> 
> I'm thinking that new helper like:
>  bpf_compute_and_save_data_pointers(skb, _of_16_bytes);
>  BPF_PROG_RUN_ARRAY();
>  bpf_restore_data_pointers(skb, _of_16_bytes);
> would be decent interface.

Thanks Alexei!

Will send v3 shortly.

Song

Re: [PATCH net] net: bpfilter: use get_pid_task instead of pid_task

2018-10-17 Thread David Miller
From: Taehee Yoo 
Date: Wed, 17 Oct 2018 00:35:10 +0900

> pid_task() dereferences rcu protected tasks array.
> But there is no rcu_read_lock() in shutdown_umh() routine so that
> rcu_read_lock() is needed.
> get_pid_task() is wrapper function of pid_task. it holds rcu_read_lock()
> then calls pid_task(). if task isn't NULL, it increases reference count
> of task.
> 
> test commands:
>%modprobe bpfilter
>%modprobe -rv bpfilter
 ...
> Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
> Signed-off-by: Taehee Yoo 

Applied and queued up for -stable, thanks.


Re: [PATCH net-next v2] net: dsa: mv88e6xxx: Fix 88E6141/6341 2500mbps SERDES speed

2018-10-17 Thread David Miller
From: Marek Behún 
Date: Sat, 13 Oct 2018 14:40:31 +0200

> This is a fix for the port_set_speed method for the Topaz family.
> Currently the same method is used as for the Peridot family, but
> this is wrong for the SERDES port.
> 
> On Topaz, the SERDES port is port 5, not 9 and 10 as in Peridot.
> Moreover setting alt_bit on Topaz only makes sense for port 0 (for
> (differentiating 100mbps vs 200mbps). The SERDES port does not
> support more than 2500mbps, so alt_bit does not make any difference.
> 
> Signed-off-by: Marek Behún 

Applied, thank you.


Re: [PATCH net 0/2] geneve, vxlan: Don't set exceptions if skb->len < mtu

2018-10-17 Thread David Miller
From: Stefano Brivio 
Date: Fri, 12 Oct 2018 23:53:57 +0200

> This series fixes the exception abuse described in 2/2, and 1/2
> is just a preparatory change to make 2/2 less ugly.

Series applied.


Re: [PATCH bpf] bpf: fix doc of bpf_skb_adjust_room() in uapi

2018-10-17 Thread Alexei Starovoitov
On Wed, Oct 17, 2018 at 04:24:48PM +0200, Nicolas Dichtel wrote:
> len_diff is signed.
> 
> Fixes: fa15601ab31e ("bpf: add documentation for eBPF helpers (33-41)")
> CC: Quentin Monnet 
> Signed-off-by: Nicolas Dichtel 
> ---
>  include/uapi/linux/bpf.h   | 2 +-
>  tools/include/uapi/linux/bpf.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 66917a4eba27..c4ffe91d5598 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1430,7 +1430,7 @@ union bpf_attr {
>   *   Return
>   *   0 on success, or a negative error in case of failure.
>   *
> - * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 
> flags)
> + * int bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 
> flags)

Thanks. Applied to bpf-next, since we're very late into release cycle.



Re: [PATCH net-next] netpoll: allow cleanup to be synchronous

2018-10-17 Thread David Miller
From: Debabrata Banerjee 
Date: Fri, 12 Oct 2018 12:59:29 -0400

> @@ -826,7 +826,10 @@ static void netpoll_async_cleanup(struct work_struct 
> *work)
>  
>  void __netpoll_free_async(struct netpoll *np)
>  {
> - schedule_work(>cleanup_work);
> + if (rtnl_is_locked())
> + __netpoll_cleanup(np);
> + else
> + schedule_work(>cleanup_work);
>  }

rtnl_is_locked() says only that the RTNL mutex is held by someone.

It does not necessarily say that it is held by the current execution
context.

Which means you could erronesly run this synchronously when another
thread has the RTNL mutex held, not you.

I'm not applying this, sorry.


Re: [PATCH v2 bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB

2018-10-17 Thread Alexei Starovoitov
On Wed, Oct 17, 2018 at 04:36:15PM -0700, Song Liu wrote:
> BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
> skb. This patch enables direct access of skb for these programs.
> 
> In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
> to compute proper data_end for the BPF program.
> 
> Signed-off-by: Song Liu 
> ---
>  kernel/bpf/cgroup.c |  4 
>  net/core/filter.c   | 36 +++-
>  2 files changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 00f6ed2e4f9a..340d496f35bd 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>   save_sk = skb->sk;
>   skb->sk = sk;
>   __skb_push(skb, offset);
> +
> + /* compute pointers for the bpf prog */
> + bpf_compute_data_pointers(skb);

cg_skb_is_valid_access() below looks good to me now,
but I just realized that above change is not safe for all sockets.
After sk_filter_trim_cap() is called in udp_queue_rcv_skb()
it needs to see valid UDP_SKB_CB.
But sizeof(struct udp_skb_cb)==28, so bpf_compute_data_pointers()
would mangle the end of it.
So we have to save/restore data_end/data_meta pointers as well.

I'm thinking that new helper like:
  bpf_compute_and_save_data_pointers(skb, _of_16_bytes);
  BPF_PROG_RUN_ARRAY();
  bpf_restore_data_pointers(skb, _of_16_bytes);
would be decent interface.



Re: [net-next PATCH] net: sched: cls_flower: Classify packets using port ranges

2018-10-17 Thread David Miller
From: Amritha Nambiar 
Date: Fri, 12 Oct 2018 06:53:30 -0700

> Added support in tc flower for filtering based on port ranges.
> This is a rework of the RFC patch at:
> https://patchwork.ozlabs.org/patch/969595/

You never addressed Cong's feedback asking you to explain why this
can't be simply built using existing generic filtering facilities that
exist already.

I appreciate that you addressed Jiri's feedback, but Cong's feedback is
just as, if not more, important.

Thank you.


RE: [PATCH v2 2/2] netdev/phy: add MDIO bus multiplexer driven by a regmap

2018-10-17 Thread Pankaj Bansal


> -Original Message-
> From: Florian Fainelli [mailto:f.faine...@gmail.com]
> Sent: Sunday, October 7, 2018 11:39 PM
> To: Pankaj Bansal ; Andrew Lunn 
> Cc: netdev@vger.kernel.org
> Subject: Re: [PATCH v2 2/2] netdev/phy: add MDIO bus multiplexer driven by a
> regmap
> 
> 
> 
> On 10/07/18 11:24, Pankaj Bansal wrote:
> > Add support for an MDIO bus multiplexer controlled by a regmap device,
> > like an FPGA.
> >
> > Tested on a NXP LX2160AQDS board which uses the "QIXIS" FPGA attached
> > to the i2c bus.
> >
> > Signed-off-by: Pankaj Bansal 
> > ---
> >
> > Notes:
> > V2:
> >  - replaced be32_to_cpup with of_property_read_u32
> >  - incorporated Andrew's comments
> >
> >  drivers/net/phy/Kconfig   |  13 +++
> >  drivers/net/phy/Makefile  |   1 +
> >  drivers/net/phy/mdio-mux-regmap.c | 171 
> >  3 files changed, 185 insertions(+)
> >
> > diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig index
> > 82070792edbb..d1ac9e70cbb2 100644
> > --- a/drivers/net/phy/Kconfig
> > +++ b/drivers/net/phy/Kconfig
> > @@ -87,6 +87,19 @@ config MDIO_BUS_MUX_MMIOREG
> >
> >   Currently, only 8/16/32 bits registers are supported.
> >
> > +config MDIO_BUS_MUX_REGMAP
> > +   tristate "REGMAP controlled MDIO bus multiplexers"
> > +   depends on OF_MDIO && REGMAP
> > +   select MDIO_BUS_MUX
> > +   help
> > + This module provides a driver for MDIO bus multiplexers that
> > + are controlled via a regmap device, like an FPGA connected to i2c.
> > + The multiplexer connects one of several child MDIO busses to a
> > + parent bus.Child bus selection is under the control of one of
> > + the FPGA's registers.
> > +
> > + Currently, only upto 32 bits registers are supported.
> > +
> >  config MDIO_CAVIUM
> > tristate
> >
> > diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile index
> > 5805c0b7d60e..33053f9f320d 100644
> > --- a/drivers/net/phy/Makefile
> > +++ b/drivers/net/phy/Makefile
> > @@ -29,6 +29,7 @@ obj-$(CONFIG_MDIO_BUS_MUX)+= mdio-mux.o
> >  obj-$(CONFIG_MDIO_BUS_MUX_BCM_IPROC)   += mdio-mux-bcm-iproc.o
> >  obj-$(CONFIG_MDIO_BUS_MUX_GPIO)+= mdio-mux-gpio.o
> >  obj-$(CONFIG_MDIO_BUS_MUX_MMIOREG) += mdio-mux-mmioreg.o
> > +obj-$(CONFIG_MDIO_BUS_MUX_REGMAP) += mdio-mux-regmap.o
> >  obj-$(CONFIG_MDIO_CAVIUM)  += mdio-cavium.o
> >  obj-$(CONFIG_MDIO_GPIO)+= mdio-gpio.o
> >  obj-$(CONFIG_MDIO_HISI_FEMAC)  += mdio-hisi-femac.o
> > diff --git a/drivers/net/phy/mdio-mux-regmap.c
> > b/drivers/net/phy/mdio-mux-regmap.c
> > new file mode 100644
> > index ..6068d05a728a
> > --- /dev/null
> > +++ b/drivers/net/phy/mdio-mux-regmap.c
> > @@ -0,0 +1,171 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +
> > +/* Simple regmap based MDIO MUX driver
> > + *
> > + * Copyright 2018 NXP
> > + *
> > + * Based on mdio-mux-mmioreg.c by Timur Tabi
> > + *
> > + * Author:
> > + * Pankaj Bansal 
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +struct mdio_mux_regmap_state {
> > +   void*mux_handle;
> > +   struct regmap   *regmap;
> > +   u32 mux_reg;
> > +   u32 mask;
> > +};
> > +
> > +/* MDIO multiplexing switch function
> > + *
> > + * This function is called by the mdio-mux layer when it thinks the
> > +mdio bus
> > + * multiplexer needs to switch.
> > + *
> > + * 'current_child' is the current value of the mux register (masked
> > +via
> > + * s->mask).
> > + *
> > + * 'desired_child' is the value of the 'reg' property of the target
> > +child MDIO
> > + * node.
> > + *
> > + * The first time this function is called, current_child == -1.
> > + *
> > + * If current_child == desired_child, then the mux is already set to
> > +the
> > + * correct bus.
> > + */
> > +static int mdio_mux_regmap_switch_fn(int current_child, int desired_child,
> > +void *data)
> > +{
> > +   struct mdio_mux_regmap_state *s = data;
> > +   bool change;
> > +   int ret;
> > +
> > +   ret = regmap_update_bits_check(s->regmap,
> > +  s->mux_reg,
> > +  s->mask,
> > +  desired_child,
> > +  );
> > +
> > +   if (ret)
> > +   return ret;
> > +   if (change)
> > +   pr_debug("%s %d -> %d\n", __func__, current_child,
> > +desired_child);
> 
> If you add a struct platform_device or struct device reference to struct
> mdio_mux_regmap_state, the you can use dev_dbg() here with the correct
> device, which would be helpful if you are debugging problems, and there are
> more than once instance of them in the system.

Ok, I will add platform_device reference to struct.

> 
> > +   return ret;
> > +}
> > +
> > +static int mdio_mux_regmap_probe(struct platform_device *pdev) {
> > +   struct device_node *np2, 

Re: [PATCH 00/16] octeontx2-af: NPA and NIX blocks initialization

2018-10-17 Thread David Miller
From: sunil.kovv...@gmail.com
Date: Tue, 16 Oct 2018 16:57:04 +0530

> This patchset is a continuation to earlier submitted patch series
> to add a new driver for Marvell's OcteonTX2 SOC's 
> Resource virtualization unit (RVU) admin function driver.

Series applied.


RE: [PATCH v2 1/2] dt-bindings: net: add MDIO bus multiplexer driven by a regmap device

2018-10-17 Thread Pankaj Bansal
Hi Florian

> -Original Message-
> From: Florian Fainelli [mailto:f.faine...@gmail.com]
> Sent: Sunday, October 7, 2018 11:32 PM
> To: Pankaj Bansal ; Andrew Lunn 
> Cc: netdev@vger.kernel.org
> Subject: Re: [PATCH v2 1/2] dt-bindings: net: add MDIO bus multiplexer driven 
> by
> a regmap device
> 
> 
> 
> On 10/07/18 11:24, Pankaj Bansal wrote:
> > Add support for an MDIO bus multiplexer controlled by a regmap device,
> > like an FPGA.
> >
> > Tested on a NXP LX2160AQDS board which uses the "QIXIS" FPGA attached
> > to the i2c bus.
> >
> > Signed-off-by: Pankaj Bansal 
> > ---
> >
> > Notes:
> > V2:
> >  - Fixed formatting error caused by using space instead of tab
> >  - Add newline between property list and subnode
> >  - Add newline between two subnodes
> >
> >  .../bindings/net/mdio-mux-regmap.txt | 95 ++
> >  1 file changed, 95 insertions(+)
> >
> > diff --git a/Documentation/devicetree/bindings/net/mdio-mux-regmap.txt
> > b/Documentation/devicetree/bindings/net/mdio-mux-regmap.txt
> > new file mode 100644
> > index ..88ebac26c1c5
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/net/mdio-mux-regmap.txt
> > @@ -0,0 +1,95 @@
> > +Properties for an MDIO bus multiplexer controlled by a regmap
> > +
> > +This is a special case of a MDIO bus multiplexer.  A regmap device,
> > +like an FPGA, is used to control which child bus is connected.  The
> > +mdio-mux node must be a child of the device that is controlled by a regmap.
> > +The driver currently only supports devices with upto 32-bit registers.
> 
> I would omit any sort of details about Linux constructs designed to support
> specific needs (e.g: regmap) as well as putting driver limitations into a 
> binding
> document because it really ought to be restricted to describing hardware.
> 

Actually the platform driver mdio-mux-regmap.c, is generalization of 
mdio-mux-mmioreg.c
As such, it doesn't describe any new hardware, so no new properties are 
described by it.
The only new property is compatible field.
I don't know how to describe this driver otherwise.  Can you please help?

> > +
> > +Required properties in addition to the generic multiplexer properties:
> > +
> > +- compatible : string, must contain "mdio-mux-regmap"
> > +
> > +- reg : integer, contains the offset of the register that controls the bus
> > +   multiplexer. it can be 32 bit number.
> 
> Technically it could be any "reg" property size, the only requirement is that 
> it
> must be of the appropriate size with respect to what the parent node of this
> "mdio-mux-regmap" node has, determined by #address-cells/#size-cells width.

We are reading only single cell of this property using "of_propert_read_u32".
That is why I put the size in this.

> 
> > +
> > +- mux-mask : integer, contains an 32 bit mask that specifies which
> > +   bits in the register control the actual bus multiplexer.  The
> > +   'reg' property of each child mdio-mux node must be constrained by
> > +   this mask.
> 
> Same thing here.

We are reading only single cell of this property using "of_propert_read_u32".
That is why I put the size in this.

> 
> Since this is a MDIO mux, I would invite you to specify what the correct
> #address-cells/#size-cells values should be (1, and 0 respectively as your
> example correctly shows).
> 

I will mention #address-cells/#size-cells values

> > +
> > +Example:
> > +
> > +The FPGA node defines a i2c connected FPGA with a register space of 0x30
> bytes.
> > +For the "EMI2" MDIO bus, register 0x54 (BRDCFG4) controls the mux on that
> bus.
> > +A bitmask of 0x07 means that bits 0, 1 and 2 (bit 0 is lsb) are the
> > +bits on
> > +BRDCFG4 that control the actual mux.
> > +
> > +i2c@200 {
> > +   compatible = "fsl,vf610-i2c";
> > +   #address-cells = <1>;
> > +   #size-cells = <0>;
> > +   reg = <0x0 0x200 0x0 0x1>;
> > +   interrupts = <0 34 0x4>; // Level high type
> > +   clock-names = "i2c";
> > +   clocks = < 4 7>;
> > +   fsl-scl-gpio = < 15 0>;
> > +   status = "okay";
> > +
> > +   /* The FPGA node */
> > +   fpga@66 {
> > +   compatible = "fsl,lx2160aqds-fpga", "fsl,fpga-qixis-i2c";
> > +   reg = <0x66>;
> > +   #address-cells = <1>;
> > +   #size-cells = <0>;
> > +
> > +   mdio1_mux@54 {
> > +   compatible = "mdio-mux-regmap", "mdio-mux";
> > +   mdio-parent-bus = <>; /* MDIO bus */
> > +   reg = <0x54>;/* BRDCFG4 */
> > +   mux-mask = <0x07>;  /* EMI2_MDIO */
> > +   #address-cells=<1>;
> > +   #size-cells = <0>;
> > +
> > +   mdio1_ioslot1@0 {   // Slot 1
> > +   reg = <0x00>;
> > +   #address-cells = <1>;
> > +   #size-cells = <0>;
> > +
> > +   phy1@1 {
> > +   reg = <1>;
> > +  

Re: [bpf-next PATCH 1/2] bpf: skmsg, fix psock create on existing kcm/tls port

2018-10-17 Thread Alexei Starovoitov
On Wed, Oct 17, 2018 at 08:37:39PM -0700, John Fastabend wrote:
> Before using the psock returned by sk_psock_get() when adding it to a
> sockmap we need to ensure it is actually a sockmap based psock.
> Previously we were only checking this after incrementing the reference
> counter which was an error. This resulted in a slab-out-of-bounds
> error when the psock was not actually a sockmap type.
> 
> This moves the check up so the reference counter is only used
> if it is a sockmap psock.
> 
> Eric reported the following KASAN BUG,
> 
> BUG: KASAN: slab-out-of-bounds in atomic_read 
> include/asm-generic/atomic-instrumented.h:21 [inline]
> BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 
> lib/refcount.c:120
> Read of size 4 at addr 88019548be58 by task syz-executor4/22387
> 
> CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
>  print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
>  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
>  check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
>  kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
>  atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
>  refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
>  sk_psock_get include/linux/skmsg.h:379 [inline]
>  sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
>  sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
>  sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
>  map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
> 
> Signed-off-by: John Fastabend 
> Reported-by: Eric Dumazet 
> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
> ---
>  0 files changed

something isn't right here.
It messed up the cover letter as well.
Please fix your scripts and resubmit.



[bpf-next PATCH 0/2] Fix kcm + sockmap by checking psock type

2018-10-17 Thread John Fastabend
We check if the sk_user_data (the psock in skmsg) is in fact a sockmap
type to late, after we read the refcnt which is an error. This
series moves the check up before reading refcnt and also adds a test
to test_maps to test trying to add a KCM socket into a sockmap.

While reviewig this code I also found an issue with KCM and kTLS
where each uses sk_data_ready hooks and associated stream parser
breaking expectations in kcm, ktls or both. But that fix will need
to go to net.

Thanks to Eric for reporting.

---

John Fastabend (2):
  bpf: skmsg, fix psock create on existing kcm/tls port
  bpf: test_maps add a test to catch kcm + sockmap


 tools/testing/selftests/bpf/Makefile  |2 -
 tools/testing/selftests/bpf/sockmap_kcm.c |   14 ++
 tools/testing/selftests/bpf/test_maps.c   |   64 -
 3 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

--
Signature


[bpf-next PATCH 2/2] bpf: test_maps add a test to catch kcm + sockmap

2018-10-17 Thread John Fastabend
Adding a socket to both sockmap and kcm is not supported due to
collision on sk_user_data usage.

If selftests is run without KCM support we will issue a warning
and continue with the tests.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/Makefile  |2 -
 tools/testing/selftests/bpf/sockmap_kcm.c |   14 ++
 tools/testing/selftests/bpf/test_maps.c   |   64 -
 3 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index d99dd6f..f290554 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -28,7 +28,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
-   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
+   sockmap_verdict_prog.o sockmap_kcm.o dev_cgroup.o sample_ret0.o 
test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
diff --git a/tools/testing/selftests/bpf/sockmap_kcm.c 
b/tools/testing/selftests/bpf/sockmap_kcm.c
new file mode 100644
index 000..4377adc
--- /dev/null
+++ b/tools/testing/selftests/bpf/sockmap_kcm.c
@@ -0,0 +1,14 @@
+#include 
+#include "bpf_helpers.h"
+#include "bpf_util.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+
+SEC("socket_kcm")
+int bpf_prog1(struct __sk_buff *skb)
+{
+   return skb->len;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 9b552c0..be20f1d 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -479,14 +480,16 @@ static void test_devmap(int task, void *data)
 #define SOCKMAP_PARSE_PROG "./sockmap_parse_prog.o"
 #define SOCKMAP_VERDICT_PROG "./sockmap_verdict_prog.o"
 #define SOCKMAP_TCP_MSG_PROG "./sockmap_tcp_msg_prog.o"
+#define KCM_PROG "./sockmap_kcm.o"
 static void test_sockmap(int tasks, void *data)
 {
struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_msg, *bpf_map_break;
-   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break;
+   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break, kcm;
int ports[] = {50200, 50201, 50202, 50204};
int err, i, fd, udp, sfd[6] = {0xdeadbeef};
u8 buf[20] = {0x0, 0x5, 0x3, 0x2, 0x1, 0x0};
-   int parse_prog, verdict_prog, msg_prog;
+   int parse_prog, verdict_prog, msg_prog, kcm_prog;
+   struct kcm_attach attach_info;
struct sockaddr_in addr;
int one = 1, s, sc, rc;
struct bpf_object *obj;
@@ -744,6 +747,62 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
 
+   /* Test adding a KCM socket into map */
+#define AF_KCM 41
+   kcm = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
+   if (kcm == -1) {
+   printf("Warning, KCM+Sockmap could not be tested.\n");
+   goto skip_kcm;
+   }
+
+   err = bpf_prog_load(KCM_PROG,
+   BPF_PROG_TYPE_SOCKET_FILTER,
+   , _prog);
+   if (err) {
+   printf("Failed to load SK_SKB parse prog\n");
+   goto out_sockmap;
+   }
+
+   i = 2;
+   memset(_info, 0, sizeof(attach_info));
+   attach_info.fd = sfd[i];
+   attach_info.bpf_fd = kcm_prog;
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (!err) {
+   perror("Failed KCM attached to sockmap fd: ");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_delete_elem(fd, );
+   if (err) {
+   printf("Failed delete sockmap from empty map %i %i\n", err, 
errno);
+   goto out_sockmap;
+   }
+
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (err) {
+   perror("Failed KCM attach");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_update_elem(fd, , [i], BPF_ANY);
+   if (!err) {
+   printf("Failed sockmap attached KCM sock!\n");
+   goto out_sockmap;
+   }
+   err = ioctl(kcm, SIOCKCMUNATTACH, _info);
+   if (err) {
+   printf("Failed detach KCM sock!\n");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_update_elem(fd, , [i], BPF_ANY);
+   if (err) {
+   printf("Failed post-kcm update sockmap '%i:%i'\n",
+  i, sfd[i]);
+  

[bpf-next PATCH 1/2] bpf: skmsg, fix psock create on existing kcm/tls port

2018-10-17 Thread John Fastabend
Before using the psock returned by sk_psock_get() when adding it to a
sockmap we need to ensure it is actually a sockmap based psock.
Previously we were only checking this after incrementing the reference
counter which was an error. This resulted in a slab-out-of-bounds
error when the psock was not actually a sockmap type.

This moves the check up so the reference counter is only used
if it is a sockmap psock.

Eric reported the following KASAN BUG,

BUG: KASAN: slab-out-of-bounds in atomic_read 
include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 
lib/refcount.c:120
Read of size 4 at addr 88019548be58 by task syz-executor4/22387

CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
 sk_psock_get include/linux/skmsg.h:379 [inline]
 sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
 sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
 sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
 map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818

Signed-off-by: John Fastabend 
Reported-by: Eric Dumazet 
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
---
 0 files changed

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 677b673..f44ca6b 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -275,11 +275,6 @@ static inline struct sk_psock *sk_psock(const struct sock 
*sk)
return rcu_dereference_sk_user_data(sk);
 }
 
-static inline bool sk_has_psock(struct sock *sk)
-{
-   return sk_psock(sk) != NULL && sk->sk_prot->recvmsg == tcp_bpf_recvmsg;
-}
-
 static inline void sk_psock_queue_msg(struct sk_psock *psock,
  struct sk_msg *msg)
 {
@@ -379,6 +374,26 @@ static inline bool sk_psock_test_state(const struct 
sk_psock *psock,
return test_bit(bit, >state);
 }
 
+static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
+{
+   struct sk_psock *psock;
+
+   rcu_read_lock();
+   psock = sk_psock(sk);
+   if (psock) {
+   if (sk->sk_prot->recvmsg != tcp_bpf_recvmsg) {
+   psock = ERR_PTR(-EBUSY);
+   goto out;
+   }
+
+   if (!refcount_inc_not_zero(>refcnt))
+   psock = NULL;
+   }
+out:
+   rcu_read_unlock();
+   return psock;
+}
+
 static inline struct sk_psock *sk_psock_get(struct sock *sk)
 {
struct sk_psock *psock;
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3c0e44c..be6092a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -175,12 +175,13 @@ static int sock_map_link(struct bpf_map *map, struct 
sk_psock_progs *progs,
}
}
 
-   psock = sk_psock_get(sk);
+   psock = sk_psock_get_checked(sk);
+   if (IS_ERR(psock)) {
+   ret = PTR_ERR(psock);
+   goto out_progs;
+   }
+
if (psock) {
-   if (!sk_has_psock(sk)) {
-   ret = -EBUSY;
-   goto out_progs;
-   }
if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
(skb_progs  && READ_ONCE(psock->progs.skb_parser))) {
sk_psock_put(sk, psock);



Re: [PATCH bpf-next v3 00/13] bpf: add btf func info support

2018-10-17 Thread Alexei Starovoitov
On Wed, Oct 17, 2018 at 04:30:22PM -0700, Yonghong Song wrote:
> 
> This patch added func info support to the kernel so we can
> get better ksym's for bpf function calls. Basically,
> pairs of bpf function calls and their corresponding types
> are passed to the kernel. Extracting function names from
> the types, the kernel is able to construct a ksym for
> each function call with embedded function name.
...
> Below is a demonstration from Patch #13.
>   $ bpftool prog dump jited id 1
>   int _dummy_tracepoint(struct dummy_tracepoint_args * ):
>   bpf_prog_b07ccb89267cf242__dummy_tracepoint:
>  0:   push   %rbp
>  1:   mov%rsp,%rbp
> ..
> 3c:   add$0x28,%rbp
> 40:   leaveq
> 41:   retq
>   
>   int test_long_fname_1(struct dummy_tracepoint_args * ):
>   bpf_prog_2dcecc18072623fc_test_long_fname_1:
>  0:   push   %rbp
>  1:   mov%rsp,%rbp

Considering we only had 16-byte names for main prog and
no names at all for subprogs, this is huge improvement
in BPF introspection.
Cannot wait for the followup patches with line info support.

For the set:
Acked-by: Alexei Starovoitov 



[net-next 12/14] net/mlx5: E-Switch, Enable setting goto slow path chain action

2018-10-17 Thread Saeed Mahameed
From: Paul Blakey 

A pre-step for the tc offloads code to use this when a neigh is
not available for encap rules.

Signed-off-by: Paul Blakey 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  | 2 ++
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 6 ++
 2 files changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 54215f4312fa..aaafc9f17115 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -60,6 +60,7 @@
MLX5_CAP_ESW_FLOWTABLE(dev, fdb_multi_path_to_table)
 
 #define FDB_MAX_CHAIN 3
+#define FDB_SLOW_PATH_CHAIN (FDB_MAX_CHAIN + 1)
 #define FDB_MAX_PRIO 16
 
 struct vport_ingress {
@@ -356,6 +357,7 @@ static inline int  mlx5_eswitch_enable_sriov(struct 
mlx5_eswitch *esw, int nvfs,
 static inline void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw) {}
 
 #define FDB_MAX_CHAIN 1
+#define FDB_SLOW_PATH_CHAIN (FDB_MAX_CHAIN + 1)
 #define FDB_MAX_PRIO 1
 
 #endif /* CONFIG_MLX5_ESWITCH */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 289f1992f624..42a130455ef8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -671,6 +671,9 @@ esw_get_prio_table(struct mlx5_eswitch *esw, u32 chain, u16 
prio, int level)
int table_prio, l = 0;
u32 flags = 0;
 
+   if (chain == FDB_SLOW_PATH_CHAIN)
+   return esw->fdb_table.offloads.slow_fdb;
+
mutex_lock(>fdb_table.offloads.fdb_prio_lock);
 
fdb = fdb_prio_table(esw, chain, prio, level).fdb;
@@ -730,6 +733,9 @@ esw_put_prio_table(struct mlx5_eswitch *esw, u32 chain, u16 
prio, int level)
 {
int l;
 
+   if (chain == FDB_SLOW_PATH_CHAIN)
+   return;
+
mutex_lock(>fdb_table.offloads.fdb_prio_lock);
 
for (l = level; l >= 0; l--) {
-- 
2.17.2



[net-next 14/14] net/mlx5e: Support offloading tc priorities and chains for eswitch flows

2018-10-17 Thread Saeed Mahameed
From: Paul Blakey 

Currently we fail when user specify a non-zero chain, this patch adds the
support for it and tc priorities. To get to a new chain, use the tc
goto action.

Currently we support a fixed prio range 1-16, and chain range 0-3.

Signed-off-by: Paul Blakey 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 --
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  6 ---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 51 ++-
 .../mellanox/mlx5/core/eswitch_offloads.c |  2 +-
 4 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c9848e333450..1243edbedc9e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3392,9 +3392,6 @@ static int mlx5e_setup_tc_block_cb(enum tc_setup_type 
type, void *type_data,
 {
struct mlx5e_priv *priv = cb_priv;
 
-   if (!tc_cls_can_offload_and_chain0(priv->netdev, type_data))
-   return -EOPNOTSUPP;
-
switch (type) {
case TC_SETUP_CLSFLOWER:
return mlx5e_setup_tc_cls_flower(priv, type_data, 
MLX5E_TC_INGRESS);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 64c2b9ea8b1e..c3c657548824 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -853,9 +853,6 @@ static int mlx5e_rep_setup_tc_cb_egdev(enum tc_setup_type 
type, void *type_data,
 {
struct mlx5e_priv *priv = cb_priv;
 
-   if (!tc_cls_can_offload_and_chain0(priv->netdev, type_data))
-   return -EOPNOTSUPP;
-
switch (type) {
case TC_SETUP_CLSFLOWER:
return mlx5e_rep_setup_tc_cls_flower(priv, type_data, 
MLX5E_TC_EGRESS);
@@ -869,9 +866,6 @@ static int mlx5e_rep_setup_tc_cb(enum tc_setup_type type, 
void *type_data,
 {
struct mlx5e_priv *priv = cb_priv;
 
-   if (!tc_cls_can_offload_and_chain0(priv->netdev, type_data))
-   return -EOPNOTSUPP;
-
switch (type) {
case TC_SETUP_CLSFLOWER:
return mlx5e_rep_setup_tc_cls_flower(priv, type_data, 
MLX5E_TC_INGRESS);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index cb66964aa1ff..608025ca5c04 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -898,15 +898,32 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
  struct netlink_ext_ack *extack)
 {
struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
+   u32 max_chain = mlx5_eswitch_get_chain_range(esw);
struct mlx5_esw_flow_attr *attr = flow->esw_attr;
+   u16 max_prio = mlx5_eswitch_get_prio_range(esw);
struct net_device *out_dev, *encap_dev = NULL;
struct mlx5_fc *counter = NULL;
struct mlx5e_rep_priv *rpriv;
struct mlx5e_priv *out_priv;
int err = 0, encap_err = 0;
 
-   /* keep the old behaviour, use same prio for all offloaded rules */
-   attr->prio = 1;
+   /* if prios are not supported, keep the old behaviour of using same prio
+* for all offloaded rules.
+*/
+   if (!mlx5_eswitch_prios_supported(esw))
+   attr->prio = 1;
+
+   if (attr->chain > max_chain) {
+   NL_SET_ERR_MSG(extack, "Requested chain is out of supported 
range");
+   err = -EOPNOTSUPP;
+   goto err_max_prio_chain;
+   }
+
+   if (attr->prio > max_prio) {
+   NL_SET_ERR_MSG(extack, "Requested priority is out of supported 
range");
+   err = -EOPNOTSUPP;
+   goto err_max_prio_chain;
+   }
 
if (attr->action & MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT) {
out_dev = __dev_get_by_index(dev_net(priv->netdev),
@@ -975,6 +992,7 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
if (attr->action & MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT)
mlx5e_detach_encap(priv, flow);
 err_attach_encap:
+err_max_prio_chain:
return err;
 }
 
@@ -2826,6 +2844,7 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
struct mlx5e_tc_flow *flow,
struct netlink_ext_ack *extack)
 {
+   struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
struct mlx5_esw_flow_attr *attr = flow->esw_attr;
struct mlx5e_rep_priv *rpriv = priv->ppriv;
struct ip_tunnel_info *info = NULL;
@@ -2934,6 +2953,25 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
continue;
}
 
+   if (is_tcf_gact_goto_chain(a)) {
+   u32 

[net-next 13/14] net/mlx5e: Use a slow path rule instead if vxlan neighbour isn't available

2018-10-17 Thread Saeed Mahameed
From: Paul Blakey 

When adding a vxlan tc rule, and a neighbour isn't available, we
don't insert any rule to hardware. Once we enable offloading flows
with multiple priorities, a packet that should have matched this rule
will continue in hardware pipeline and might match a wrong one.

This is unlike in tc software path where it will be matched and
forwarded to the vxlan device (which will cause a ARP lookup
eventually) and stop processing further tc filters.

To address that, when when a neighbour isn't available (EAGAIN from
attach_encap), or gets deleted, change the original action to be a
forward to slow path instead. Neighbour update will restore the original
action once the neighbour becomes available. This will be done atomically
so at any given time we will have a the correct match.

Signed-off-by: Paul Blakey 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 104 ++
 1 file changed, 82 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 1786e25644ac..cb66964aa1ff 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -74,6 +74,7 @@ enum {
MLX5E_TC_FLOW_OFFLOADED = BIT(MLX5E_TC_FLOW_BASE + 2),
MLX5E_TC_FLOW_HAIRPIN   = BIT(MLX5E_TC_FLOW_BASE + 3),
MLX5E_TC_FLOW_HAIRPIN_RSS = BIT(MLX5E_TC_FLOW_BASE + 4),
+   MLX5E_TC_FLOW_SLOW= BIT(MLX5E_TC_FLOW_BASE + 5),
 };
 
 #define MLX5E_TC_MAX_SPLITS 1
@@ -82,7 +83,7 @@ struct mlx5e_tc_flow {
struct rhash_head   node;
struct mlx5e_priv   *priv;
u64 cookie;
-   u8  flags;
+   u16 flags;
struct mlx5_flow_handle *rule[MLX5E_TC_MAX_SPLITS + 1];
struct list_headencap;   /* flows sharing the same encap ID */
struct list_headmod_hdr; /* flows sharing the same mod hdr ID */
@@ -860,6 +861,36 @@ mlx5e_tc_unoffload_fdb_rules(struct mlx5_eswitch *esw,
mlx5_eswitch_del_offloaded_rule(esw, flow->rule[0], attr);
 }
 
+static struct mlx5_flow_handle *
+mlx5e_tc_offload_to_slow_path(struct mlx5_eswitch *esw,
+ struct mlx5e_tc_flow *flow,
+ struct mlx5_flow_spec *spec,
+ struct mlx5_esw_flow_attr *slow_attr)
+{
+   struct mlx5_flow_handle *rule;
+
+   memcpy(slow_attr, flow->esw_attr, sizeof(*slow_attr));
+   slow_attr->action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+   slow_attr->mirror_count = 0,
+   slow_attr->dest_chain = FDB_SLOW_PATH_CHAIN,
+
+   rule = mlx5e_tc_offload_fdb_rules(esw, flow, spec, slow_attr);
+   if (!IS_ERR(rule))
+   flow->flags |= MLX5E_TC_FLOW_SLOW;
+
+   return rule;
+}
+
+static void
+mlx5e_tc_unoffload_from_slow_path(struct mlx5_eswitch *esw,
+ struct mlx5e_tc_flow *flow,
+ struct mlx5_esw_flow_attr *slow_attr)
+{
+   memcpy(slow_attr, flow->esw_attr, sizeof(*slow_attr));
+   mlx5e_tc_unoffload_fdb_rules(esw, flow, slow_attr);
+   flow->flags &= ~MLX5E_TC_FLOW_SLOW;
+}
+
 static int
 mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
  struct mlx5e_tc_flow_parse_attr *parse_attr,
@@ -917,15 +948,21 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
/* we get here if (1) there's no error or when
 * (2) there's an encap action and we're on -EAGAIN (no valid neigh)
 */
-   if (encap_err != -EAGAIN) {
+   if (encap_err == -EAGAIN) {
+   /* continue with goto slow path rule instead */
+   struct mlx5_esw_flow_attr slow_attr;
+
+   flow->rule[0] = mlx5e_tc_offload_to_slow_path(esw, flow, 
_attr->spec, _attr);
+   } else {
flow->rule[0] = mlx5e_tc_offload_fdb_rules(esw, flow, 
_attr->spec, attr);
-   if (IS_ERR(flow->rule[0])) {
-   err = PTR_ERR(flow->rule[0]);
-   goto err_add_rule;
-   }
}
 
-   return encap_err;
+   if (IS_ERR(flow->rule[0])) {
+   err = PTR_ERR(flow->rule[0]);
+   goto err_add_rule;
+   }
+
+   return 0;
 
 err_add_rule:
mlx5_fc_destroy(esw->dev, counter);
@@ -946,9 +983,14 @@ static void mlx5e_tc_del_fdb_flow(struct mlx5e_priv *priv,
 {
struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
struct mlx5_esw_flow_attr *attr = flow->esw_attr;
+   struct mlx5_esw_flow_attr slow_attr;
 
-   if (flow->flags & MLX5E_TC_FLOW_OFFLOADED)
-   mlx5e_tc_unoffload_fdb_rules(esw, flow, flow->esw_attr);
+   if (flow->flags & MLX5E_TC_FLOW_OFFLOADED) {
+   if (flow->flags & MLX5E_TC_FLOW_SLOW)
+   mlx5e_tc_unoffload_from_slow_path(esw, flow, 
_attr);

[net-next 05/14] net/mlx5: Add cap bits for multi fdb encap

2018-10-17 Thread Saeed Mahameed
From: Paul Blakey 

If set, the firmware supports creating of flow tables with encap
enabled while VFs are configured, if we already created one
(restriction still applies on the first creation).

Signed-off-by: Paul Blakey 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 include/linux/mlx5/mlx5_ifc.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 15e36198f85f..963611820006 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -584,7 +584,9 @@ struct mlx5_ifc_flow_table_nic_cap_bits {
 struct mlx5_ifc_flow_table_eswitch_cap_bits {
u8  reserved_at_0[0x1c];
u8  fdb_multi_path_to_table[0x1];
-   u8  reserved_at_1d[0x1e3];
+   u8  reserved_at_1d[0x1];
+   u8  multi_fdb_encap[0x1];
+   u8  reserved_at_1e[0x1e1];
 
struct mlx5_ifc_flow_table_prop_layout_bits 
flow_table_properties_nic_esw_fdb;
 
-- 
2.17.2



[net-next 03/14] net/mlx5e: Change return type of tc add flow functions

2018-10-17 Thread Saeed Mahameed
From: Rabie Loulou 

Refactor the flow add utility functions to return err code instead of rule
pointers. This will allow for simpler logic when one tc rule is
duplicated to two HW rules in downstream patches.

Signed-off-by: Rabie Loulou 
Signed-off-by: Shahar Klein 
Reviewed-by: Roi Dayan 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 86 +--
 1 file changed, 39 insertions(+), 47 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 5ce87f54852d..861986f82844 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -673,7 +673,7 @@ static void mlx5e_hairpin_flow_del(struct mlx5e_priv *priv,
}
 }
 
-static struct mlx5_flow_handle *
+static int
 mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
  struct mlx5e_tc_flow_parse_attr *parse_attr,
  struct mlx5e_tc_flow *flow,
@@ -689,14 +689,12 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
.reformat_id = 0,
};
struct mlx5_fc *counter = NULL;
-   struct mlx5_flow_handle *rule;
bool table_created = false;
int err, dest_ix = 0;
 
if (flow->flags & MLX5E_TC_FLOW_HAIRPIN) {
err = mlx5e_hairpin_flow_add(priv, flow, parse_attr, extack);
if (err) {
-   rule = ERR_PTR(err);
goto err_add_hairpin_flow;
}
if (flow->flags & MLX5E_TC_FLOW_HAIRPIN_RSS) {
@@ -716,7 +714,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
if (attr->action & MLX5_FLOW_CONTEXT_ACTION_COUNT) {
counter = mlx5_fc_create(dev, true);
if (IS_ERR(counter)) {
-   rule = ERR_CAST(counter);
+   err = PTR_ERR(counter);
goto err_fc_create;
}
dest[dest_ix].type = MLX5_FLOW_DESTINATION_TYPE_COUNTER;
@@ -729,10 +727,8 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
err = mlx5e_attach_mod_hdr(priv, flow, parse_attr);
flow_act.modify_id = attr->mod_hdr_id;
kfree(parse_attr->mod_hdr_actions);
-   if (err) {
-   rule = ERR_PTR(err);
+   if (err)
goto err_create_mod_hdr_id;
-   }
}
 
if (IS_ERR_OR_NULL(priv->fs.tc.t)) {
@@ -758,7 +754,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
   "Failed to create tc offload 
table\n");
netdev_err(priv->netdev,
   "Failed to create tc offload table\n");
-   rule = ERR_CAST(priv->fs.tc.t);
+   err = PTR_ERR(priv->fs.tc.t);
goto err_create_ft;
}
 
@@ -768,13 +764,15 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
if (attr->match_level != MLX5_MATCH_NONE)
parse_attr->spec.match_criteria_enable = 
MLX5_MATCH_OUTER_HEADERS;
 
-   rule = mlx5_add_flow_rules(priv->fs.tc.t, _attr->spec,
-  _act, dest, dest_ix);
+   flow->rule[0] = mlx5_add_flow_rules(priv->fs.tc.t, _attr->spec,
+   _act, dest, dest_ix);
 
-   if (IS_ERR(rule))
+   if (IS_ERR(flow->rule[0])) {
+   err = PTR_ERR(flow->rule[0]);
goto err_add_rule;
+   }
 
-   return rule;
+   return 0;
 
 err_add_rule:
if (table_created) {
@@ -790,7 +788,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
if (flow->flags & MLX5E_TC_FLOW_HAIRPIN)
mlx5e_hairpin_flow_del(priv, flow);
 err_add_hairpin_flow:
-   return rule;
+   return err;
 }
 
 static void mlx5e_tc_del_nic_flow(struct mlx5e_priv *priv,
@@ -825,7 +823,7 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
  struct mlx5e_tc_flow *flow,
  struct netlink_ext_ack *extack);
 
-static struct mlx5_flow_handle *
+static int
 mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
  struct mlx5e_tc_flow_parse_attr *parse_attr,
  struct mlx5e_tc_flow *flow,
@@ -834,21 +832,20 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
struct mlx5_esw_flow_attr *attr = flow->esw_attr;
struct net_device *out_dev, *encap_dev = NULL;
-   struct mlx5_flow_handle *rule = NULL;
struct mlx5_fc *counter = NULL;
struct mlx5e_rep_priv *rpriv;
struct mlx5e_priv *out_priv;
-   int err;
+   int err = 0, encap_err = 0;
 
if (attr->action & MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT) {
out_dev = 

[net-next 06/14] net/mlx5: Split FDB fast path prio to multiple namespaces

2018-10-17 Thread Saeed Mahameed
From: Paul Blakey 

Towards supporting multi-chains and priorities, split the FDB fast path
to multiple namespaces (sub namespaces), each with multiple priorities.

This patch adds a new flow steering type, FS_TYPE_PRIO_CHAINS, which is
like current FS_TYPE_PRIO, but may contain only namespaces, and those
will be in parallel to one another in terms of managing of the flow
tables connections inside them. Meaning, while searching for the next
or previous flow table to connect for a new table inside such namespace
we skip the parallel namespaces in the same level under the
FS_TYPE_PRIO_CHAINS prio we originated from.

We use this new type for splitting the fast path prio into multiple
parallel namespaces, each containing normal prios.
The prios inside them (and their tables) will be connected to one
another, but not from one parallel namespace to another, instead the
last prio in each namespace will be connected to the next prio in
the containing FDB namespace, which is the slow path prio.

Signed-off-by: Paul Blakey 
Acked-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  7 ++
 .../net/ethernet/mellanox/mlx5/core/fs_core.c | 88 ---
 .../net/ethernet/mellanox/mlx5/core/fs_core.h | 13 +++
 include/linux/mlx5/fs.h   |  2 +
 5 files changed, 101 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 9c893d7d273e..d004957328f9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -263,7 +263,7 @@ static int esw_create_legacy_fdb_table(struct mlx5_eswitch 
*esw)
esw_debug(dev, "Create FDB log_max_size(%d)\n",
  MLX5_CAP_ESW_FLOWTABLE_FDB(dev, log_max_ft_size));
 
-   root_ns = mlx5_get_flow_namespace(dev, MLX5_FLOW_NAMESPACE_FDB);
+   root_ns = mlx5_get_fdb_sub_ns(dev, 0);
if (!root_ns) {
esw_warn(dev, "Failed to get FDB flow namespace\n");
return -EOPNOTSUPP;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index c1b627577003..1698a322a7c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -59,6 +59,9 @@
 #define mlx5_esw_has_fwd_fdb(dev) \
MLX5_CAP_ESW_FLOWTABLE(dev, fdb_multi_path_to_table)
 
+#define FDB_MAX_CHAIN 3
+#define FDB_MAX_PRIO 16
+
 struct vport_ingress {
struct mlx5_flow_table *acl;
struct mlx5_flow_group *allow_untagged_spoofchk_grp;
@@ -319,6 +322,10 @@ static inline void mlx5_eswitch_cleanup(struct 
mlx5_eswitch *esw) {}
 static inline void mlx5_eswitch_vport_event(struct mlx5_eswitch *esw, struct 
mlx5_eqe *eqe) {}
 static inline int  mlx5_eswitch_enable_sriov(struct mlx5_eswitch *esw, int 
nvfs, int mode) { return 0; }
 static inline void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw) {}
+
+#define FDB_MAX_CHAIN 1
+#define FDB_MAX_PRIO 1
+
 #endif /* CONFIG_MLX5_ESWITCH */
 
 #endif /* __MLX5_ESWITCH_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index cdcbf9d0ae6c..7eb6d58733ac 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -40,6 +40,7 @@
 #include "diag/fs_tracepoint.h"
 #include "accel/ipsec.h"
 #include "fpga/ipsec.h"
+#include "eswitch.h"
 
 #define INIT_TREE_NODE_ARRAY_SIZE(...) (sizeof((struct 
init_tree_node[]){__VA_ARGS__}) /\
 sizeof(struct init_tree_node))
@@ -713,7 +714,7 @@ static struct mlx5_flow_table 
*find_closest_ft_recursive(struct fs_node  *root,
struct fs_node *iter = list_entry(start, struct fs_node, list);
struct mlx5_flow_table *ft = NULL;
 
-   if (!root)
+   if (!root || root->type == FS_TYPE_PRIO_CHAINS)
return NULL;
 
list_for_each_advance_continue(iter, >children, reverse) {
@@ -1973,6 +1974,18 @@ void mlx5_destroy_flow_group(struct mlx5_flow_group *fg)
   fg->id);
 }
 
+struct mlx5_flow_namespace *mlx5_get_fdb_sub_ns(struct mlx5_core_dev *dev,
+   int n)
+{
+   struct mlx5_flow_steering *steering = dev->priv.steering;
+
+   if (!steering || !steering->fdb_sub_ns)
+   return NULL;
+
+   return steering->fdb_sub_ns[n];
+}
+EXPORT_SYMBOL(mlx5_get_fdb_sub_ns);
+
 struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
enum 
mlx5_flow_namespace_type type)
 {
@@ -2051,8 +2064,10 @@ struct mlx5_flow_namespace 
*mlx5_get_flow_vport_acl_namespace(struct mlx5_core_d
}
 }
 
-static struct fs_prio *fs_create_prio(struct 

[net-next 08/14] net/mlx5: E-Switch, Add chains and priorities

2018-10-17 Thread Saeed Mahameed
From: Paul Blakey 

A chain is a group of priorities, so use the fdb parallel
sub namespaces to implement chains, and a flow table for each
priority in them.

Because these namespaces are parallel and in series to the slow path
fdb, the chains aren't connected to one another (but to the slow path),
and one must use a explicit goto action to reach a different chain.

Flow tables for the priorities will be created on demand and destroyed
once not used.

The Firmware has four pools of tables for sizes S/XS/M/L (4k, 64k, 1m, 4m).
We maintain ghost copies of the pools occupancy.

When a new table is to be created, we scan the pools from large to small
and find the 1st table size which can be now created. When a table is
destroyed, we update the relevant pool.

Multi chain/prio isn't enabled yet by this patch, for now all flows
will use the default chain 0, and prio 1.

Signed-off-by: Paul Blakey 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   3 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  32 +-
 .../mellanox/mlx5/core/eswitch_offloads.c | 390 ++
 3 files changed, 339 insertions(+), 86 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 7487bdd55f23..6c04e11f9a05 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -837,6 +837,9 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
struct mlx5e_priv *out_priv;
int err = 0, encap_err = 0;
 
+   /* keep the old behaviour, use same prio for all offloaded rules */
+   attr->prio = 1;
+
if (attr->action & MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT) {
out_dev = __dev_get_by_index(dev_net(priv->netdev),
 attr->parse_attr->mirred_ifindex);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 584e735bbad1..54215f4312fa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -123,6 +123,13 @@ struct mlx5_vport {
u16 enabled_events;
 };
 
+enum offloads_fdb_flags {
+   ESW_FDB_CHAINS_AND_PRIOS_SUPPORTED = BIT(0),
+};
+
+extern const unsigned int ESW_POOLS[4];
+
+#define PRIO_LEVELS 2
 struct mlx5_eswitch_fdb {
union {
struct legacy_fdb {
@@ -133,16 +140,24 @@ struct mlx5_eswitch_fdb {
} legacy;
 
struct offloads_fdb {
-   struct mlx5_flow_table *fast_fdb;
-   struct mlx5_flow_table *fwd_fdb;
struct mlx5_flow_table *slow_fdb;
struct mlx5_flow_group *send_to_vport_grp;
struct mlx5_flow_group *miss_grp;
struct mlx5_flow_handle *miss_rule_uni;
struct mlx5_flow_handle *miss_rule_multi;
int vlan_push_pop_refcount;
+
+   struct {
+   struct mlx5_flow_table *fdb;
+   u32 num_rules;
+   } fdb_prio[FDB_MAX_CHAIN + 1][FDB_MAX_PRIO + 
1][PRIO_LEVELS];
+   /* Protects fdb_prio table */
+   struct mutex fdb_prio_lock;
+
+   int fdb_left[ARRAY_SIZE(ESW_POOLS)];
} offloads;
};
+   u32 flags;
 };
 
 struct mlx5_esw_offload {
@@ -184,6 +199,7 @@ struct mlx5_eswitch {
 
struct mlx5_esw_offload offloads;
int mode;
+   int nvports;
 };
 
 void esw_offloads_cleanup(struct mlx5_eswitch *esw, int nvports);
@@ -236,6 +252,15 @@ mlx5_eswitch_del_fwd_rule(struct mlx5_eswitch *esw,
  struct mlx5_flow_handle *rule,
  struct mlx5_esw_flow_attr *attr);
 
+bool
+mlx5_eswitch_prios_supported(struct mlx5_eswitch *esw);
+
+u16
+mlx5_eswitch_get_prio_range(struct mlx5_eswitch *esw);
+
+u32
+mlx5_eswitch_get_chain_range(struct mlx5_eswitch *esw);
+
 struct mlx5_flow_handle *
 mlx5_eswitch_create_vport_rx_rule(struct mlx5_eswitch *esw, int vport,
  struct mlx5_flow_destination *dest);
@@ -274,6 +299,9 @@ struct mlx5_esw_flow_attr {
u32 mod_hdr_id;
u8  match_level;
struct mlx5_fc *counter;
+   u32 chain;
+   u16 prio;
+   u32 dest_chain;
struct mlx5e_tc_flow_parse_attr *parse_attr;
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 983bb8a80f75..8501b6c31c02 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -37,32 +37,59 @@
 #include 

[net-next 01/14] net/mlx5: E-Switch, Get counters for offloaded flows from callers

2018-10-17 Thread Saeed Mahameed
From: Mark Bloch 

There's no real reason for the e-switch logic to manage the creation of
counters for offloaded flows. The API already has the directive for the
caller to denote they want to attach a counter to the created flow.
As such, we go and move the management of flow counters to the mlx5e
tc offload logic. This also lets us remove an inelegant interface where
the FS layer had to provide a way to retrieve a counter from a flow rule.

Signed-off-by: Mark Bloch 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 32 +--
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  1 +
 .../mellanox/mlx5/core/eswitch_offloads.c | 20 ++--
 .../net/ethernet/mellanox/mlx5/core/fs_core.c | 15 -
 include/linux/mlx5/fs.h   |  1 -
 5 files changed, 33 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index acf7a847f561..8a27c0813a18 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -61,6 +61,7 @@ struct mlx5_nic_flow_attr {
u32 hairpin_tirn;
u8 match_level;
struct mlx5_flow_table  *hairpin_ft;
+   struct mlx5_fc  *counter;
 };
 
 #define MLX5E_TC_FLOW_BASE (MLX5E_TC_LAST_EXPORTED_BIT + 1)
@@ -721,6 +722,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
dest[dest_ix].type = MLX5_FLOW_DESTINATION_TYPE_COUNTER;
dest[dest_ix].counter = counter;
dest_ix++;
+   attr->counter = counter;
}
 
if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
@@ -797,7 +799,7 @@ static void mlx5e_tc_del_nic_flow(struct mlx5e_priv *priv,
struct mlx5_nic_flow_attr *attr = flow->nic_attr;
struct mlx5_fc *counter = NULL;
 
-   counter = mlx5_flow_rule_counter(flow->rule[0]);
+   counter = attr->counter;
mlx5_del_flow_rules(flow->rule[0]);
mlx5_fc_destroy(priv->mdev, counter);
 
@@ -833,6 +835,7 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
struct mlx5_esw_flow_attr *attr = flow->esw_attr;
struct net_device *out_dev, *encap_dev = NULL;
struct mlx5_flow_handle *rule = NULL;
+   struct mlx5_fc *counter = NULL;
struct mlx5e_rep_priv *rpriv;
struct mlx5e_priv *out_priv;
int err;
@@ -868,6 +871,16 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
}
}
 
+   if (attr->action & MLX5_FLOW_CONTEXT_ACTION_COUNT) {
+   counter = mlx5_fc_create(esw->dev, true);
+   if (IS_ERR(counter)) {
+   rule = ERR_CAST(counter);
+   goto err_create_counter;
+   }
+
+   attr->counter = counter;
+   }
+
/* we get here if (1) there's no error (rule being null) or when
 * (2) there's an encap action and we're on -EAGAIN (no valid neigh)
 */
@@ -888,6 +901,8 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
mlx5_eswitch_del_offloaded_rule(esw, rule, attr);
rule = flow->rule[1];
 err_add_rule:
+   mlx5_fc_destroy(esw->dev, counter);
+err_create_counter:
if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
mlx5e_detach_mod_hdr(priv, flow);
 err_mod_hdr:
@@ -921,6 +936,9 @@ static void mlx5e_tc_del_fdb_flow(struct mlx5e_priv *priv,
 
if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
mlx5e_detach_mod_hdr(priv, flow);
+
+   if (attr->action & MLX5_FLOW_CONTEXT_ACTION_COUNT)
+   mlx5_fc_destroy(esw->dev, attr->counter);
 }
 
 void mlx5e_tc_encap_flows_add(struct mlx5e_priv *priv,
@@ -992,6 +1010,14 @@ void mlx5e_tc_encap_flows_del(struct mlx5e_priv *priv,
}
 }
 
+static struct mlx5_fc *mlx5e_tc_get_counter(struct mlx5e_tc_flow *flow)
+{
+   if (flow->flags & MLX5E_TC_FLOW_ESWITCH)
+   return flow->esw_attr->counter;
+   else
+   return flow->nic_attr->counter;
+}
+
 void mlx5e_tc_update_neigh_used_value(struct mlx5e_neigh_hash_entry *nhe)
 {
struct mlx5e_neigh *m_neigh = >m_neigh;
@@ -1017,7 +1043,7 @@ void mlx5e_tc_update_neigh_used_value(struct 
mlx5e_neigh_hash_entry *nhe)
continue;
list_for_each_entry(flow, >flows, encap) {
if (flow->flags & MLX5E_TC_FLOW_OFFLOADED) {
-   counter = mlx5_flow_rule_counter(flow->rule[0]);
+   counter = mlx5e_tc_get_counter(flow);
mlx5_fc_query_cached(counter, , , 
);
if (time_after((unsigned long)lastuse, 
nhe->reported_lastuse)) {
neigh_used = true;
@@ -3019,7 +3045,7 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv,
if (!(flow->flags & MLX5E_TC_FLOW_OFFLOADED))

[net-next 02/14] net/mlx5: Use flow counter IDs and not the wrapping cache object

2018-10-17 Thread Saeed Mahameed
From: Mark Bloch 

Currently, when a flow rule is created using the FS core layer, the caller
has to pass the entire flow counter object and not just the counter HW
handle (ID). This requires both the FS core and the caller to have
knowledge about the inner implementation of the FS layer flow counters
cache and limits the possible users.

Move to use the counter ID across the place when dealing with flows.

Doing this decoupling, now can we privatize the inner implementation
of the flow counters.

Signed-off-by: Mark Bloch 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/main.c  |  7 +--
 .../ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h   |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  4 ++--
 .../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  | 10 ++
 drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c  |  6 ++
 include/linux/mlx5/fs.h|  3 ++-
 9 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 5d9b7f62a0ba..5ced0cc46ba1 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3320,15 +3320,18 @@ static struct mlx5_ib_flow_handler 
*_create_flow_rule(struct mlx5_ib_dev *dev,
}
 
if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_COUNT) {
+   struct mlx5_ib_mcounters *mcounters;
+
err = flow_counters_set_data(flow_act.counters, ucmd);
if (err)
goto free;
 
+   mcounters = to_mcounters(flow_act.counters);
handler->ibcounters = flow_act.counters;
dest_arr[dest_num].type =
MLX5_FLOW_DESTINATION_TYPE_COUNTER;
-   dest_arr[dest_num].counter =
-   to_mcounters(flow_act.counters)->hw_cntrs_hndl;
+   dest_arr[dest_num].counter_id =
+   mlx5_fc_id(mcounters->hw_cntrs_hndl);
dest_num++;
}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
index e83dda441a81..d027ce00c8ce 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
@@ -252,10 +252,10 @@ TRACE_EVENT(mlx5_fs_add_rule,
   memcpy(__entry->destination,
  >dest_attr,
  sizeof(__entry->destination));
-  if (rule->dest_attr.type & 
MLX5_FLOW_DESTINATION_TYPE_COUNTER &&
-  rule->dest_attr.counter)
+  if (rule->dest_attr.type &
+  MLX5_FLOW_DESTINATION_TYPE_COUNTER)
__entry->counter_id =
-   rule->dest_attr.counter->id;
+   rule->dest_attr.counter_id;
),
TP_printk("rule=%p fte=%p index=%u sw_action=<%s> [dst] %s\n",
  __entry->rule, __entry->fte, __entry->index,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 8a27c0813a18..5ce87f54852d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -720,7 +720,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
goto err_fc_create;
}
dest[dest_ix].type = MLX5_FLOW_DESTINATION_TYPE_COUNTER;
-   dest[dest_ix].counter = counter;
+   dest[dest_ix].counter_id = mlx5_fc_id(counter);
dest_ix++;
attr->counter = counter;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index e1d47fa5ab83..9c893d7d273e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1198,7 +1198,7 @@ static int esw_vport_ingress_config(struct mlx5_eswitch 
*esw,
if (counter) {
flow_act.action |= MLX5_FLOW_CONTEXT_ACTION_COUNT;
drop_ctr_dst.type = MLX5_FLOW_DESTINATION_TYPE_COUNTER;
-   drop_ctr_dst.counter = counter;
+   drop_ctr_dst.counter_id = mlx5_fc_id(counter);
dst = _ctr_dst;
dest_num++;
}
@@ -1285,7 +1285,7 @@ static int esw_vport_egress_config(struct mlx5_eswitch 
*esw,
if (counter) {
flow_act.action |= MLX5_FLOW_CONTEXT_ACTION_COUNT;

[net-next 11/14] net/mlx5e: Avoid duplicated code for tc offloads add/del fdb rule

2018-10-17 Thread Saeed Mahameed
From: Or Gerlitz 

The code for adding/deleting fdb flow is repeated when
user-space does flow add/del and when we add/del from
the neigh update path - unify them to avoid the duplication.

Signed-off-by: Or Gerlitz 
Signed-off-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 91 ++-
 1 file changed, 50 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 8b25850cbf6a..1786e25644ac 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -823,6 +823,43 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
  struct mlx5e_tc_flow *flow,
  struct netlink_ext_ack *extack);
 
+static struct mlx5_flow_handle *
+mlx5e_tc_offload_fdb_rules(struct mlx5_eswitch *esw,
+  struct mlx5e_tc_flow *flow,
+  struct mlx5_flow_spec *spec,
+  struct mlx5_esw_flow_attr *attr)
+{
+   struct mlx5_flow_handle *rule;
+
+   rule = mlx5_eswitch_add_offloaded_rule(esw, spec, attr);
+   if (IS_ERR(rule))
+   return rule;
+
+   if (attr->mirror_count) {
+   flow->rule[1] = mlx5_eswitch_add_fwd_rule(esw, spec, attr);
+   if (IS_ERR(flow->rule[1])) {
+   mlx5_eswitch_del_offloaded_rule(esw, rule, attr);
+   return flow->rule[1];
+   }
+   }
+
+   flow->flags |= MLX5E_TC_FLOW_OFFLOADED;
+   return rule;
+}
+
+static void
+mlx5e_tc_unoffload_fdb_rules(struct mlx5_eswitch *esw,
+struct mlx5e_tc_flow *flow,
+  struct mlx5_esw_flow_attr *attr)
+{
+   flow->flags &= ~MLX5E_TC_FLOW_OFFLOADED;
+
+   if (attr->mirror_count)
+   mlx5_eswitch_del_fwd_rule(esw, flow->rule[1], attr);
+
+   mlx5_eswitch_del_offloaded_rule(esw, flow->rule[0], attr);
+}
+
 static int
 mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
  struct mlx5e_tc_flow_parse_attr *parse_attr,
@@ -881,25 +918,15 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
 * (2) there's an encap action and we're on -EAGAIN (no valid neigh)
 */
if (encap_err != -EAGAIN) {
-   flow->rule[0] = mlx5_eswitch_add_offloaded_rule(esw, 
_attr->spec, attr);
+   flow->rule[0] = mlx5e_tc_offload_fdb_rules(esw, flow, 
_attr->spec, attr);
if (IS_ERR(flow->rule[0])) {
err = PTR_ERR(flow->rule[0]);
goto err_add_rule;
}
-
-   if (attr->mirror_count) {
-   flow->rule[1] = mlx5_eswitch_add_fwd_rule(esw, 
_attr->spec, attr);
-   if (IS_ERR(flow->rule[1])) {
-   err = PTR_ERR(flow->rule[1]);
-   goto err_fwd_rule;
-   }
-   }
}
 
return encap_err;
 
-err_fwd_rule:
-   mlx5_eswitch_del_offloaded_rule(esw, flow->rule[0], attr);
 err_add_rule:
mlx5_fc_destroy(esw->dev, counter);
 err_create_counter:
@@ -920,12 +947,8 @@ static void mlx5e_tc_del_fdb_flow(struct mlx5e_priv *priv,
struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
struct mlx5_esw_flow_attr *attr = flow->esw_attr;
 
-   if (flow->flags & MLX5E_TC_FLOW_OFFLOADED) {
-   flow->flags &= ~MLX5E_TC_FLOW_OFFLOADED;
-   if (attr->mirror_count)
-   mlx5_eswitch_del_fwd_rule(esw, flow->rule[1], attr);
-   mlx5_eswitch_del_offloaded_rule(esw, flow->rule[0], attr);
-   }
+   if (flow->flags & MLX5E_TC_FLOW_OFFLOADED)
+   mlx5e_tc_unoffload_fdb_rules(esw, flow, flow->esw_attr);
 
mlx5_eswitch_del_vlan_action(esw, attr);
 
@@ -946,6 +969,8 @@ void mlx5e_tc_encap_flows_add(struct mlx5e_priv *priv,
 {
struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
struct mlx5_esw_flow_attr *esw_attr;
+   struct mlx5_flow_handle *rule;
+   struct mlx5_flow_spec *spec;
struct mlx5e_tc_flow *flow;
int err;
 
@@ -964,26 +989,16 @@ void mlx5e_tc_encap_flows_add(struct mlx5e_priv *priv,
list_for_each_entry(flow, >flows, encap) {
esw_attr = flow->esw_attr;
esw_attr->encap_id = e->encap_id;
-   flow->rule[0] = mlx5_eswitch_add_offloaded_rule(esw, 
_attr->parse_attr->spec, esw_attr);
-   if (IS_ERR(flow->rule[0])) {
-   err = PTR_ERR(flow->rule[0]);
+   spec = _attr->parse_attr->spec;
+
+   rule = mlx5e_tc_offload_fdb_rules(esw, flow, spec, esw_attr);
+   if (IS_ERR(rule)) {
+   err = PTR_ERR(rule);
mlx5_core_warn(priv->mdev, "Failed to 

[pull request][net-next 00/14] Mellanox, mlx5 updates

2018-10-17 Thread Saeed Mahameed
Hi Dave,

This series from Paul, Or and Mark provides the support for 
e-switch tc offloading of multiple priorities and chains, the series
is based on a merge commit with mlx5-next branch for patches that are
already submitted and reviewed through the netdev and rdma mailing
lists.

For more information please see tag log below.

Please pull and let me know if there is any problem.

Thanks,
saeed.

---

The following changes since commit 186daf0c20507072e72a3c74db4ac50a5b6dae07:

  Merge branch 'mlx5-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into net-next 
(2018-10-17 14:13:36 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5-updates-2018-10-17

for you to fetch changes up to bf07aa730a04a375bc10d09df1e81357af1d4477:

  net/mlx5e: Support offloading tc priorities and chains for eswitch flows 
(2018-10-17 14:20:49 -0700)


mlx5-updates-2018-10-17



>From Or Gerlitz :

This series from Paul adds support to mlx5 e-switch tc offloading of multiple 
priorities and chains.

This is made of four building blocks (along with few minor driver refactors):

[1] Split FDB fast path prio to multiple namespaces

Currently the FDB name-space contains two priorities, fast path (p0) and slow 
path (p1).
The slow path contains the per representor SQ send-to-vport TX rule and the 
match-all
RX miss rule. As a pre-step to support multi-chains and priorities, we split 
the FDB fast path
to multiple namespaces  (sub namespaces), each with multiple priorities.

[2] E-Switch chains and priorities

A chain is a group of priorities. We use the fdb parallel sub-namespaces to 
implement chains,
and a flow table for each priority in them.

Because these namespaces are parallel and in series to the slow path
fdb, the chains aren't connected to each other (but to the slow path),
and one must use a explicit goto action to reach a different chain.

Flow tables for the priorities are created on demand and destroyed
once not used.

[3] Add a no-append flow insertion mode, use it for TC offloads

Enhance the driver fs core, such that if a no-append flag is set by the caller,
we add a new FTE, instead of appending the actions of the inserted rule when
the same match already exists.

For encap rules, we defer the HW offloading till we have a valid neighbor. This 
can
result in the packet hitting a lower priority rule in the HW DP. Use the 
no-append API
to push these packets to the slow path FDB table, so they go to the TC kernel 
DP as done
before priorities where supported.

[4] Offloading tc priorities and chains for eswitch flows

Using [1], [2] and [3] above we add the support for offloading both chains
and priorities. To get to a new chain, use the tc goto action. We support
a fixed prio range 1-16, and chains 0-3.
=


Mark Bloch (2):
  net/mlx5: E-Switch, Get counters for offloaded flows from callers
  net/mlx5: Use flow counter IDs and not the wrapping cache object

Or Gerlitz (2):
  net/mlx5: E-Switch, Have explicit API to delete fwd rules
  net/mlx5e: Avoid duplicated code for tc offloads add/del fdb rule

Paul Blakey (8):
  net/mlx5: Add cap bits for multi fdb encap
  net/mlx5: Split FDB fast path prio to multiple namespaces
  net/mlx5: E-Switch, Add chains and priorities
  net/mlx5: Add a no-append flow insertion mode
  net/mlx5e: For TC offloads, always add new flow instead of appending the 
actions
  net/mlx5: E-Switch, Enable setting goto slow path chain action
  net/mlx5e: Use a slow path rule instead if vxlan neighbour isn't available
  net/mlx5e: Support offloading tc priorities and chains for eswitch flows

Rabie Loulou (1):
  net/mlx5e: Change return type of tc add flow functions

Roi Dayan (1):
  net/mlx5e: Split TC add rule path for nic vs e-switch

 drivers/infiniband/hw/mlx5/main.c  |  13 +-
 .../mellanox/mlx5/core/diag/fs_tracepoint.h|   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 -
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |   6 -
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 490 +++--
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  46 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 410 +
 .../net/ethernet/mellanox/mlx5/core/fpga/ipsec.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  | 122 +++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |  13 +
 .../net/ethernet/mellanox/mlx5/core/fs_counters.c  |   6 +
 include/linux/mlx5/fs.h   

[net-next 07/14] net/mlx5: E-Switch, Have explicit API to delete fwd rules

2018-10-17 Thread Saeed Mahameed
From: Or Gerlitz 

Be symmetric with the e-switch API to add rules which has a
specific function to add fwd rules which are used as part of
vport mirroring.

This patch doesn't change any functionality.

Signed-off-by: Or Gerlitz 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   | 4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 4 
 .../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c| 8 
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index a4a432f02930..7487bdd55f23 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -920,7 +920,7 @@ static void mlx5e_tc_del_fdb_flow(struct mlx5e_priv *priv,
if (flow->flags & MLX5E_TC_FLOW_OFFLOADED) {
flow->flags &= ~MLX5E_TC_FLOW_OFFLOADED;
if (attr->mirror_count)
-   mlx5_eswitch_del_offloaded_rule(esw, flow->rule[1], 
attr);
+   mlx5_eswitch_del_fwd_rule(esw, flow->rule[1], attr);
mlx5_eswitch_del_offloaded_rule(esw, flow->rule[0], attr);
}
 
@@ -996,7 +996,7 @@ void mlx5e_tc_encap_flows_del(struct mlx5e_priv *priv,
 
flow->flags &= ~MLX5E_TC_FLOW_OFFLOADED;
if (attr->mirror_count)
-   mlx5_eswitch_del_offloaded_rule(esw, 
flow->rule[1], attr);
+   mlx5_eswitch_del_fwd_rule(esw, flow->rule[1], 
attr);
mlx5_eswitch_del_offloaded_rule(esw, flow->rule[0], 
attr);
}
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 1698a322a7c4..584e735bbad1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -231,6 +231,10 @@ void
 mlx5_eswitch_del_offloaded_rule(struct mlx5_eswitch *esw,
struct mlx5_flow_handle *rule,
struct mlx5_esw_flow_attr *attr);
+void
+mlx5_eswitch_del_fwd_rule(struct mlx5_eswitch *esw,
+ struct mlx5_flow_handle *rule,
+ struct mlx5_esw_flow_attr *attr);
 
 struct mlx5_flow_handle *
 mlx5_eswitch_create_vport_rx_rule(struct mlx5_eswitch *esw, int vport,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 39932dce15cb..983bb8a80f75 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -194,6 +194,14 @@ mlx5_eswitch_del_offloaded_rule(struct mlx5_eswitch *esw,
esw->offloads.num_flows--;
 }
 
+void
+mlx5_eswitch_del_fwd_rule(struct mlx5_eswitch *esw,
+ struct mlx5_flow_handle *rule,
+ struct mlx5_esw_flow_attr *attr)
+{
+   mlx5_eswitch_del_offloaded_rule(esw, rule, attr);
+}
+
 static int esw_set_global_vlan_pop(struct mlx5_eswitch *esw, u8 val)
 {
struct mlx5_eswitch_rep *rep;
-- 
2.17.2



[net-next 10/14] net/mlx5e: For TC offloads, always add new flow instead of appending the actions

2018-10-17 Thread Saeed Mahameed
From: Paul Blakey 

When replacing a tc flower rule, flower first requests to add the
new rule (new action), then deletes the old one.
But currently when asked to add a new tc flower flow, we append the
actions (and counters to it).

This can result in a fte with two flow counters or conflicting
actions (drop and encap action) which firmware complains/errs
about and isn't achieving what the user aimed for.

Instead, insert the flow using the new no-append flag which will add a
new HW rule, the old flow and rule will be deleted later by flower

Signed-off-by: Paul Blakey 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index a9c68b7859b4..8b25850cbf6a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -686,7 +686,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
.action = attr->action,
.flow_tag = attr->flow_tag,
.reformat_id = 0,
-   .flags= FLOW_ACT_HAS_TAG,
+   .flags= FLOW_ACT_HAS_TAG | FLOW_ACT_NO_APPEND,
};
struct mlx5_fc *counter = NULL;
bool table_created = false;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 8501b6c31c02..289f1992f624 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -80,8 +80,8 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
struct mlx5_esw_flow_attr *attr)
 {
struct mlx5_flow_destination dest[MLX5_MAX_FLOW_FWD_VPORTS + 1] = {};
+   struct mlx5_flow_act flow_act = { .flags = FLOW_ACT_NO_APPEND, };
bool mirror = !!(attr->mirror_count);
-   struct mlx5_flow_act flow_act = {0};
struct mlx5_flow_handle *rule;
struct mlx5_flow_table *fdb;
int j, i = 0;
@@ -195,7 +195,7 @@ mlx5_eswitch_add_fwd_rule(struct mlx5_eswitch *esw,
  struct mlx5_esw_flow_attr *attr)
 {
struct mlx5_flow_destination dest[MLX5_MAX_FLOW_FWD_VPORTS + 1] = {};
-   struct mlx5_flow_act flow_act = {0};
+   struct mlx5_flow_act flow_act = { .flags = FLOW_ACT_NO_APPEND, };
struct mlx5_flow_table *fast_fdb;
struct mlx5_flow_table *fwd_fdb;
struct mlx5_flow_handle *rule;
-- 
2.17.2



[net-next 09/14] net/mlx5: Add a no-append flow insertion mode

2018-10-17 Thread Saeed Mahameed
From: Paul Blakey 

If no-append flag is set, we will add a new FTE, instead of appending
the actions of the inserted rule when the same match already exists.

While here, move the has_flow_tag boolean indicator to be a flag too.

This patch doesn't change any functionality.

Signed-off-by: Paul Blakey 
Reviewed-by: Or Gerlitz 
Reviewed-by: Mark Bloch 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/main.c  |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|  2 +-
 .../net/ethernet/mellanox/mlx5/core/fpga/ipsec.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  9 -
 include/linux/mlx5/fs.h| 14 +++---
 5 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 5ced0cc46ba1..af32899bb72a 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2793,7 +2793,7 @@ static int parse_flow_attr(struct mlx5_core_dev *mdev, 
u32 *match_c,
return -EINVAL;
 
action->flow_tag = ib_spec->flow_tag.tag_id;
-   action->has_flow_tag = true;
+   action->flags |= FLOW_ACT_HAS_TAG;
break;
case IB_FLOW_SPEC_ACTION_DROP:
if (FIELDS_NOT_SUPPORTED(ib_spec->drop,
@@ -2886,7 +2886,7 @@ is_valid_esp_aes_gcm(struct mlx5_core_dev *mdev,
return egress ? VALID_SPEC_INVALID : VALID_SPEC_NA;
 
return is_crypto && is_ipsec &&
-   (!egress || (!is_drop && !flow_act->has_flow_tag)) ?
+   (!egress || (!is_drop && !(flow_act->flags & 
FLOW_ACT_HAS_TAG))) ?
VALID_SPEC_VALID : VALID_SPEC_INVALID;
 }
 
@@ -3349,7 +3349,7 @@ static struct mlx5_ib_flow_handler 
*_create_flow_rule(struct mlx5_ib_dev *dev,
MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO;
}
 
-   if (flow_act.has_flow_tag &&
+   if ((flow_act.flags & FLOW_ACT_HAS_TAG)  &&
(flow_attr->type == IB_FLOW_ATTR_ALL_DEFAULT ||
 flow_attr->type == IB_FLOW_ATTR_MC_DEFAULT)) {
mlx5_ib_warn(dev, "Flow tag %u and attribute type %x isn't 
allowed in leftovers\n",
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 6c04e11f9a05..a9c68b7859b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -684,9 +684,9 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
struct mlx5_flow_destination dest[2] = {};
struct mlx5_flow_act flow_act = {
.action = attr->action,
-   .has_flow_tag = true,
.flow_tag = attr->flow_tag,
.reformat_id = 0,
+   .flags= FLOW_ACT_HAS_TAG,
};
struct mlx5_fc *counter = NULL;
bool table_created = false;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
index 5645a4facad2..28aa8c968a80 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
@@ -650,7 +650,7 @@ static bool mlx5_is_fpga_egress_ipsec_rule(struct 
mlx5_core_dev *dev,
(match_criteria_enable &
 ~(MLX5_MATCH_OUTER_HEADERS | MLX5_MATCH_MISC_PARAMETERS)) ||
(flow_act->action & ~(MLX5_FLOW_CONTEXT_ACTION_ENCRYPT | 
MLX5_FLOW_CONTEXT_ACTION_ALLOW)) ||
-flow_act->has_flow_tag)
+(flow_act->flags & FLOW_ACT_HAS_TAG))
return false;
 
return true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 7eb6d58733ac..67ba4c975d81 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1428,7 +1428,7 @@ static int check_conflicting_ftes(struct fs_fte *fte, 
const struct mlx5_flow_act
return -EEXIST;
}
 
-   if (flow_act->has_flow_tag &&
+   if ((flow_act->flags & FLOW_ACT_HAS_TAG) &&
fte->action.flow_tag != flow_act->flow_tag) {
mlx5_core_warn(get_dev(>node),
   "FTE flow tag %u already exists with different 
flow tag %u\n",
@@ -1628,6 +1628,8 @@ try_add_to_existing_fg(struct mlx5_flow_table *ft,
 
 search_again_locked:
version = matched_fgs_get_version(match_head);
+   if (flow_act->flags & FLOW_ACT_NO_APPEND)
+   goto skip_search;
/* Try to find a fg that already contains a matching fte */
list_for_each_entry(iter, match_head, list) {
struct fs_fte *fte_tmp;
@@ -1644,6 +1646,11 @@ try_add_to_existing_fg(struct mlx5_flow_table *ft,
return rule;
}
 
+skip_search:
+   /* No group with matching fte found, or we 

[net-next 04/14] net/mlx5e: Split TC add rule path for nic vs e-switch

2018-10-17 Thread Saeed Mahameed
From: Roi Dayan 

Move to have clear separation on the code path to add nic vs e-switch
flows. While here we break the code that deals with adding offloaded
TC tool to few smaller stages, each on helper function.

Besides getting us simpler and readable code, these are pre-steps
for being able to have two HW flows serving one SW TC flow for some
e-switch use cases.

Signed-off-by: Roi Dayan 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 185 +-
 1 file changed, 138 insertions(+), 47 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 861986f82844..a4a432f02930 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2907,34 +2907,15 @@ static struct rhashtable *get_tc_ht(struct mlx5e_priv 
*priv)
return >fs.tc.ht;
 }
 
-int mlx5e_configure_flower(struct mlx5e_priv *priv,
-  struct tc_cls_flower_offload *f, int flags)
+static int
+mlx5e_alloc_flow(struct mlx5e_priv *priv, int attr_size,
+struct tc_cls_flower_offload *f, u8 flow_flags,
+struct mlx5e_tc_flow_parse_attr **__parse_attr,
+struct mlx5e_tc_flow **__flow)
 {
-   struct netlink_ext_ack *extack = f->common.extack;
-   struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
struct mlx5e_tc_flow_parse_attr *parse_attr;
-   struct rhashtable *tc_ht = get_tc_ht(priv);
struct mlx5e_tc_flow *flow;
-   int attr_size, err = 0;
-   u8 flow_flags = 0;
-
-   get_flags(flags, _flags);
-
-   flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params);
-   if (flow) {
-   NL_SET_ERR_MSG_MOD(extack,
-  "flow cookie already exists, ignoring");
-   netdev_warn_once(priv->netdev, "flow cookie %lx already exists, 
ignoring\n", f->cookie);
-   return 0;
-   }
-
-   if (esw && esw->mode == SRIOV_OFFLOADS) {
-   flow_flags |= MLX5E_TC_FLOW_ESWITCH;
-   attr_size  = sizeof(struct mlx5_esw_flow_attr);
-   } else {
-   flow_flags |= MLX5E_TC_FLOW_NIC;
-   attr_size  = sizeof(struct mlx5_nic_flow_attr);
-   }
+   int err;
 
flow = kzalloc(sizeof(*flow) + attr_size, GFP_KERNEL);
parse_attr = kvzalloc(sizeof(*parse_attr), GFP_KERNEL);
@@ -2948,45 +2929,155 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv,
flow->priv = priv;
 
err = parse_cls_flower(priv, flow, _attr->spec, f);
-   if (err < 0)
+   if (err)
goto err_free;
 
-   if (flow->flags & MLX5E_TC_FLOW_ESWITCH) {
-   err = parse_tc_fdb_actions(priv, f->exts, parse_attr, flow,
-  extack);
-   if (err < 0)
-   goto err_free;
-   err = mlx5e_tc_add_fdb_flow(priv, parse_attr, flow, extack);
-   } else {
-   err = parse_tc_nic_actions(priv, f->exts, parse_attr, flow,
-  extack);
-   if (err < 0)
-   goto err_free;
-   err = mlx5e_tc_add_nic_flow(priv, parse_attr, flow, extack);
-   }
+   *__flow = flow;
+   *__parse_attr = parse_attr;
+
+   return 0;
 
+err_free:
+   kfree(flow);
+   kvfree(parse_attr);
+   return err;
+}
+
+static int
+mlx5e_add_fdb_flow(struct mlx5e_priv *priv,
+  struct tc_cls_flower_offload *f,
+  u8 flow_flags,
+  struct mlx5e_tc_flow **__flow)
+{
+   struct netlink_ext_ack *extack = f->common.extack;
+   struct mlx5e_tc_flow_parse_attr *parse_attr;
+   struct mlx5e_tc_flow *flow;
+   int attr_size, err;
+
+   flow_flags |= MLX5E_TC_FLOW_ESWITCH;
+   attr_size  = sizeof(struct mlx5_esw_flow_attr);
+   err = mlx5e_alloc_flow(priv, attr_size, f, flow_flags,
+  _attr, );
+   if (err)
+   goto out;
+
+   err = parse_tc_fdb_actions(priv, f->exts, parse_attr, flow, extack);
+   if (err)
+   goto err_free;
+
+   err = mlx5e_tc_add_fdb_flow(priv, parse_attr, flow, extack);
if (err && err != -EAGAIN)
goto err_free;
 
-   if (err != -EAGAIN)
+   if (!err)
flow->flags |= MLX5E_TC_FLOW_OFFLOADED;
 
-   if (!(flow->flags & MLX5E_TC_FLOW_ESWITCH) ||
-   !(flow->esw_attr->action &
+   if (!(flow->esw_attr->action &
  MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT))
kvfree(parse_attr);
 
-   err = rhashtable_insert_fast(tc_ht, >node, tc_ht_params);
-   if (err) {
-   mlx5e_tc_del_flow(priv, flow);
-   kfree(flow);
-   }
+   *__flow = flow;
 
+   return 0;
+
+err_free:
+   kfree(flow);

Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-17 Thread Francois Romieu
Holger Hoffstätte  :
[...]
> I continued to use the BQL patch in my private tree after it was reverted
> and also had occasional timeouts, but *only* after I started playing
> with ethtool to change offload settings. Without offloads or the BQL patch
> everything has been rock-solid since then.
> The other weird problem was that timeouts would occur on an otherwise
> *completely idle* system. Since that occasionally borked my NFS server
> over night I ultimately removed BQL as well. Rock-solid since then.

The bug will induce delayed rx processing when a spike of "load" is
followed by an idle period. I do not see how bql would matter per se though.

-- 
Ueimor


[PATCH v2 bpf-next 0/2] bpf: add cg_skb_is_valid_access

2018-10-17 Thread Song Liu
Changes v1 -> v2:
1. Updated the list of read-only fields, and read-write fields.
2. Added dummy sk to bpf_prog_test_run_skb().

This set enables BPF program of type BPF_PROG_TYPE_CGROUP_SKB to access
some __skb_buff data directly.

Song Liu (2):
  bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
  bpf: add tests for direct packet access from CGROUP_SKB

 kernel/bpf/cgroup.c |   4 +
 net/core/filter.c   |  37 -
 tools/testing/selftests/bpf/test_verifier.c | 159 
 3 files changed, 199 insertions(+), 1 deletion(-)

--
2.17.1


Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-17 Thread Francois Romieu
Heiner Kallweit  :
[...]
> This issue has been there more or less forever (at least it exists in
> 3.16 already), so I can't provide a "Fixes" tag. 

Hardly forever. It fixes da78dbff2e05630921c551dbbc70a4b7981a8fff.

-- 
Ueimor


[PATCH v2 bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB

2018-10-17 Thread Song Liu
BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
skb. This patch enables direct access of skb for these programs.

In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
to compute proper data_end for the BPF program.

Signed-off-by: Song Liu 
---
 kernel/bpf/cgroup.c |  4 
 net/core/filter.c   | 36 +++-
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 00f6ed2e4f9a..340d496f35bd 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
save_sk = skb->sk;
skb->sk = sk;
__skb_push(skb, offset);
+
+   /* compute pointers for the bpf prog */
+   bpf_compute_data_pointers(skb);
+
ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
 bpf_prog_run_save_cb);
__skb_pull(skb, offset);
diff --git a/net/core/filter.c b/net/core/filter.c
index 1a3ac6c46873..e3ca30bd6840 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5346,6 +5346,40 @@ static bool sk_filter_is_valid_access(int off, int size,
return bpf_skb_is_valid_access(off, size, type, prog, info);
 }
 
+static bool cg_skb_is_valid_access(int off, int size,
+  enum bpf_access_type type,
+  const struct bpf_prog *prog,
+  struct bpf_insn_access_aux *info)
+{
+   switch (off) {
+   case bpf_ctx_range(struct __sk_buff, tc_classid):
+   case bpf_ctx_range(struct __sk_buff, data_meta):
+   case bpf_ctx_range(struct __sk_buff, flow_keys):
+   return false;
+   }
+   if (type == BPF_WRITE) {
+   switch (off) {
+   case bpf_ctx_range(struct __sk_buff, mark):
+   case bpf_ctx_range(struct __sk_buff, priority):
+   case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+   break;
+   default:
+   return false;
+   }
+   }
+
+   switch (off) {
+   case bpf_ctx_range(struct __sk_buff, data):
+   info->reg_type = PTR_TO_PACKET;
+   break;
+   case bpf_ctx_range(struct __sk_buff, data_end):
+   info->reg_type = PTR_TO_PACKET_END;
+   break;
+   }
+
+   return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
 static bool lwt_is_valid_access(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
@@ -7038,7 +7072,7 @@ const struct bpf_prog_ops xdp_prog_ops = {
 
 const struct bpf_verifier_ops cg_skb_verifier_ops = {
.get_func_proto = cg_skb_func_proto,
-   .is_valid_access= sk_filter_is_valid_access,
+   .is_valid_access= cg_skb_is_valid_access,
.convert_ctx_access = bpf_convert_ctx_access,
 };
 
-- 
2.17.1



[PATCH v2 bpf-next 2/2] bpf: add tests for direct packet access from CGROUP_SKB

2018-10-17 Thread Song Liu
Tests are added to make sure CGROUP_SKB cannot access:
  tc_classid, data_meta, flow_keys

and can read and write:
  mark, prority, and cb[0-4]

and can read other fields.

To make selftest with skb->sk work, a dummy sk is added in
bpf_prog_test_run_skb().

Signed-off-by: Song Liu 
---
 net/bpf/test_run.c  |   4 +
 tools/testing/selftests/bpf/test_verifier.c | 170 
 2 files changed, 174 insertions(+)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0c423b8cd75c..c7210e2f1ae9 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx,
struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE])
@@ -115,6 +116,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const 
union bpf_attr *kattr,
u32 retval, duration;
int hh_len = ETH_HLEN;
struct sk_buff *skb;
+   struct sock sk;
void *data;
int ret;
 
@@ -142,6 +144,8 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const 
union bpf_attr *kattr,
kfree(data);
return -ENOMEM;
}
+   sock_init_data(NULL, );
+   skb->sk = 
 
skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
__skb_put(skb, size);
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index cf4cd32b6772..5bfba7e8afd7 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -4862,6 +4862,176 @@ static struct bpf_test tests[] = {
.result = REJECT,
.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
+   {
+   "direct packet read test#1 for CGROUP_SKB",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+   offsetof(struct __sk_buff, len)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+   offsetof(struct __sk_buff, pkt_type)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+   offsetof(struct __sk_buff, mark)),
+   BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+   offsetof(struct __sk_buff, mark)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+   offsetof(struct __sk_buff, queue_mapping)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+   offsetof(struct __sk_buff, protocol)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+   offsetof(struct __sk_buff, vlan_present)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   },
+   {
+   "direct packet read test#2 for CGROUP_SKB",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+   offsetof(struct __sk_buff, vlan_tci)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_5, BPF_REG_1,
+   offsetof(struct __sk_buff, vlan_proto)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+   offsetof(struct __sk_buff, priority)),
+   BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
+   offsetof(struct __sk_buff, priority)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+   offsetof(struct __sk_buff, 
ingress_ifindex)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+   offsetof(struct __sk_buff, tc_index)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_1,
+   offsetof(struct __sk_buff, hash)),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   },
+   {
+   "direct packet read test#3 for CGROUP_SKB",
+   .insns 

[PATCH bpf-next v3 13/13] tools/bpf: bpftool: add support for jited func types

2018-10-17 Thread Yonghong Song
This patch added support to print function signature
if btf func_info is available. Note that ksym
now uses function name instead of prog_name as
prog_name has a limit of 16 bytes including
ending '\0'.

The following is a sample output for selftests
test_btf with file test_btf_haskv.o:

  $ bpftool prog dump jited id 1
  int _dummy_tracepoint(struct dummy_tracepoint_args * ):
  bpf_prog_b07ccb89267cf242__dummy_tracepoint:
 0:   push   %rbp
 1:   mov%rsp,%rbp
..
3c:   add$0x28,%rbp
40:   leaveq
41:   retq

  int test_long_fname_1(struct dummy_tracepoint_args * ):
  bpf_prog_2dcecc18072623fc_test_long_fname_1:
 0:   push   %rbp
 1:   mov%rsp,%rbp
..
3a:   add$0x28,%rbp
3e:   leaveq
3f:   retq

  int test_long_fname_2(struct dummy_tracepoint_args * ):
  bpf_prog_89d64e4abf0f0126_test_long_fname_2:
 0:   push   %rbp
 1:   mov%rsp,%rbp
..
80:   add$0x28,%rbp
84:   leaveq
85:   retq

Signed-off-by: Yonghong Song 
---
 tools/bpf/bpftool/btf_dumper.c | 90 ++
 tools/bpf/bpftool/main.h   |  2 +
 tools/bpf/bpftool/prog.c   | 54 
 3 files changed, 146 insertions(+)

diff --git a/tools/bpf/bpftool/btf_dumper.c b/tools/bpf/bpftool/btf_dumper.c
index 55bc512a1831..6122b735ddcc 100644
--- a/tools/bpf/bpftool/btf_dumper.c
+++ b/tools/bpf/bpftool/btf_dumper.c
@@ -249,3 +249,93 @@ int btf_dumper_type(const struct btf_dumper *d, __u32 
type_id,
 {
return btf_dumper_do_type(d, type_id, 0, data);
 }
+
+#define BTF_PRINT_ARG(...) \
+   do {\
+   pos += snprintf(func_sig + pos, size - pos, \
+   __VA_ARGS__);   \
+   if (pos >= size)\
+   return -1;  \
+   } while (0)
+#define BTF_PRINT_TYPE(type)   \
+   do {\
+   pos = __btf_dumper_type_only(btf, type, func_sig,   \
+pos, size);\
+   if (pos == -1)  \
+   return -1;  \
+   } while (0)
+
+static int __btf_dumper_type_only(struct btf *btf, __u32 type_id,
+ char *func_sig, int pos, int size)
+{
+   const struct btf_type *t = btf__type_by_id(btf, type_id);
+   const struct btf_array *array;
+   int i, vlen;
+
+   switch (BTF_INFO_KIND(t->info)) {
+   case BTF_KIND_INT:
+   BTF_PRINT_ARG("%s ", btf__name_by_offset(btf, t->name_off));
+   break;
+   case BTF_KIND_STRUCT:
+   BTF_PRINT_ARG("struct %s ",
+ btf__name_by_offset(btf, t->name_off));
+   break;
+   case BTF_KIND_UNION:
+   BTF_PRINT_ARG("union %s ",
+ btf__name_by_offset(btf, t->name_off));
+   break;
+   case BTF_KIND_ENUM:
+   BTF_PRINT_ARG("enum %s ",
+ btf__name_by_offset(btf, t->name_off));
+   break;
+   case BTF_KIND_ARRAY:
+   array = (struct btf_array *)(t + 1);
+   BTF_PRINT_TYPE(array->type);
+   BTF_PRINT_ARG("[%d]", array->nelems);
+   break;
+   case BTF_KIND_PTR:
+   BTF_PRINT_TYPE(t->type);
+   BTF_PRINT_ARG("* ");
+   break;
+   case BTF_KIND_UNKN:
+   case BTF_KIND_FWD:
+   case BTF_KIND_TYPEDEF:
+   return -1;
+   case BTF_KIND_VOLATILE:
+   BTF_PRINT_ARG("volatile ");
+   BTF_PRINT_TYPE(t->type);
+   break;
+   case BTF_KIND_CONST:
+   BTF_PRINT_ARG("const ");
+   BTF_PRINT_TYPE(t->type);
+   break;
+   case BTF_KIND_RESTRICT:
+   BTF_PRINT_ARG("restrict ");
+   BTF_PRINT_TYPE(t->type);
+   break;
+   case BTF_KIND_FUNC:
+   case BTF_KIND_FUNC_PROTO:
+   BTF_PRINT_TYPE(t->type);
+   BTF_PRINT_ARG("%s(", btf__name_by_offset(btf, t->name_off));
+   vlen = BTF_INFO_VLEN(t->info);
+   for (i = 0; i < vlen; i++) {
+   __u32 arg_type = ((__u32 *)(t + 1))[i];
+
+   if (i)
+   BTF_PRINT_ARG(", ");
+   BTF_PRINT_TYPE(arg_type);
+   }
+   BTF_PRINT_ARG(")");
+   break;
+   default:
+   return -1;
+   }
+
+   return pos;
+}
+
+int btf_dumper_type_only(struct btf *btf, __u32 type_id, char *func_sig,
+  

[PATCH bpf-next v3 06/13] tools/bpf: sync kernel uapi bpf.h header to tools directory

2018-10-17 Thread Yonghong Song
The kernel uapi bpf.h is synced to tools directory.

Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f9187b41dff6..7ebbf4f06a65 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -332,6 +332,9 @@ union bpf_attr {
 * (context accesses, allowed helpers, etc).
 */
__u32   expected_attach_type;
+   __u32   prog_btf_fd;/* fd pointing to BTF type data 
*/
+   __u32   func_info_len;  /* func_info length */
+   __aligned_u64   func_info;  /* func type info */
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -2585,6 +2588,9 @@ struct bpf_prog_info {
__u32 nr_jited_func_lens;
__aligned_u64 jited_ksyms;
__aligned_u64 jited_func_lens;
+   __u32 btf_id;
+   __u32 nr_jited_func_types;
+   __aligned_u64 jited_func_types;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
@@ -2896,4 +2902,9 @@ struct bpf_flow_keys {
};
 };
 
+struct bpf_func_info {
+   __u32   insn_offset;
+   __u32   type_id;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.17.1



[PATCH bpf-next v3 10/13] tools/bpf: do not use pahole if clang/llvm can generate BTF sections

2018-10-17 Thread Yonghong Song
Add additional checks in tools/testing/selftests/bpf and
samples/bpf such that if clang/llvm compiler can generate
BTF sections, do not use pahole.

Signed-off-by: Yonghong Song 
---
 samples/bpf/Makefile | 8 
 tools/testing/selftests/bpf/Makefile | 8 
 2 files changed, 16 insertions(+)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index be0a961450bc..870fe7ee2b69 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -208,12 +208,20 @@ endif
 BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
 BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
 BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 
'usage.*llvm')
+BTF_LLVM_PROBE := $(shell echo "int main() { return 0; }" | \
+ clang -target bpf -O2 -g -c -x c - -o 
./llvm_btf_verify.o; \
+ readelf -S ./llvm_btf_verify.o | grep BTF; \
+ /bin/rm -f ./llvm_btf_verify.o)
 
+ifneq ($(BTF_LLVM_PROBE),)
+   EXTRA_CFLAGS += -g
+else
 ifneq ($(and $(BTF_LLC_PROBE),$(BTF_PAHOLE_PROBE),$(BTF_OBJCOPY_PROBE)),)
EXTRA_CFLAGS += -g
LLC_FLAGS += -mattr=dwarfris
DWARF2BTF = y
 endif
+endif
 
 # Trick to allow make to be run from this directory
 all:
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index d99dd6fc3fbe..8d5612724db8 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -121,7 +121,14 @@ $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
 BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
 BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
 BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 
'usage.*llvm')
+BTF_LLVM_PROBE := $(shell echo "int main() { return 0; }" | \
+ clang -target bpf -O2 -g -c -x c - -o 
./llvm_btf_verify.o; \
+ readelf -S ./llvm_btf_verify.o | grep BTF; \
+ /bin/rm -f ./llvm_btf_verify.o)
 
+ifneq ($(BTF_LLVM_PROBE),)
+   CLANG_FLAGS += -g
+else
 ifneq ($(BTF_LLC_PROBE),)
 ifneq ($(BTF_PAHOLE_PROBE),)
 ifneq ($(BTF_OBJCOPY_PROBE),)
@@ -131,6 +138,7 @@ ifneq ($(BTF_OBJCOPY_PROBE),)
 endif
 endif
 endif
+endif
 
 $(OUTPUT)/%.o: %.c
$(CLANG) $(CLANG_FLAGS) \
-- 
2.17.1



[PATCH bpf-next v3 12/13] tools/bpf: enhance test_btf file testing to test func info

2018-10-17 Thread Yonghong Song
Change the bpf programs test_btf_haskv.c and test_btf_nokv.c to
have two sections, and enhance test_btf.c test_file feature
to test btf func_info returned by the kernel.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_btf.c   | 87 ++--
 tools/testing/selftests/bpf/test_btf_haskv.c | 16 +++-
 tools/testing/selftests/bpf/test_btf_nokv.c  | 16 +++-
 3 files changed, 106 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_btf.c 
b/tools/testing/selftests/bpf/test_btf.c
index e03a8cea4bb7..38ca942eba28 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -2180,13 +2180,13 @@ static struct btf_file_test file_tests[] = {
 },
 };
 
-static int file_has_btf_elf(const char *fn)
+static int file_has_btf_elf(const char *fn, bool *has_btf_ext)
 {
Elf_Scn *scn = NULL;
GElf_Ehdr ehdr;
+   int ret = 0;
int elf_fd;
Elf *elf;
-   int ret;
 
if (CHECK(elf_version(EV_CURRENT) == EV_NONE,
  "elf_version(EV_CURRENT) == EV_NONE"))
@@ -2218,14 +2218,12 @@ static int file_has_btf_elf(const char *fn)
}
 
sh_name = elf_strptr(elf, ehdr.e_shstrndx, sh.sh_name);
-   if (!strcmp(sh_name, BTF_ELF_SEC)) {
+   if (!strcmp(sh_name, BTF_ELF_SEC))
ret = 1;
-   goto done;
-   }
+   if (!strcmp(sh_name, BTF_EXT_ELF_SEC))
+   *has_btf_ext = true;
}
 
-   ret = 0;
-
 done:
close(elf_fd);
elf_end(elf);
@@ -2235,15 +2233,22 @@ static int file_has_btf_elf(const char *fn)
 static int do_test_file(unsigned int test_num)
 {
const struct btf_file_test *test = _tests[test_num - 1];
+   const char *expected_fnames[] = {"_dummy_tracepoint",
+"test_long_fname_1",
+"test_long_fname_2"};
+   __u32 func_lens[10], func_types[10], info_len;
+   struct bpf_prog_info info = {};
struct bpf_object *obj = NULL;
struct bpf_program *prog;
+   bool has_btf_ext = false;
+   struct btf *btf = NULL;
struct bpf_map *map;
-   int err;
+   int i, err, prog_fd;
 
fprintf(stderr, "BTF libbpf test[%u] (%s): ", test_num,
test->file);
 
-   err = file_has_btf_elf(test->file);
+   err = file_has_btf_elf(test->file, _btf_ext);
if (err == -1)
return err;
 
@@ -2271,6 +2276,7 @@ static int do_test_file(unsigned int test_num)
err = bpf_object__load(obj);
if (CHECK(err < 0, "bpf_object__load: %d", err))
goto done;
+   prog_fd = bpf_program__fd(prog);
 
map = bpf_object__find_map_by_name(obj, "btf_map");
if (CHECK(!map, "btf_map not found")) {
@@ -2285,6 +2291,69 @@ static int do_test_file(unsigned int test_num)
  test->btf_kv_notfound))
goto done;
 
+   if (!jit_enabled || !has_btf_ext)
+   goto skip_jit;
+
+   info_len = sizeof(struct bpf_prog_info);
+   info.nr_jited_func_types = ARRAY_SIZE(func_types);
+   info.nr_jited_func_lens = ARRAY_SIZE(func_lens);
+   info.jited_func_types = ptr_to_u64(_types[0]);
+   info.jited_func_lens = ptr_to_u64(_lens[0]);
+
+   err = bpf_obj_get_info_by_fd(prog_fd, , _len);
+
+   if (CHECK(err == -1, "invalid get info errno:%d", errno)) {
+   fprintf(stderr, "%s\n", btf_log_buf);
+   err = -1;
+   goto done;
+   }
+   if (CHECK(info.nr_jited_func_lens != 3,
+ "incorrect info.nr_jited_func_lens %d",
+ info.nr_jited_func_lens)) {
+   err = -1;
+   goto done;
+   }
+   if (CHECK(info.nr_jited_func_types != 3,
+ "incorrect info.nr_jited_func_types %d",
+ info.nr_jited_func_types)) {
+   err = -1;
+   goto done;
+   }
+   if (CHECK(info.btf_id == 0, "incorrect btf_id = 0")) {
+   err = -1;
+   goto done;
+   }
+
+   err = btf_get_from_id(info.btf_id, );
+   if (CHECK(err, "cannot get btf from kernel, err: %d", err))
+   goto done;
+
+   /* check three functions */
+   for (i = 0; i < 3; i++) {
+   const struct btf_type *t;
+   const char *fname;
+
+   t = btf__type_by_id(btf, func_types[i]);
+   if (CHECK(!t, "btf__type_by_id failure: id %u",
+ func_types[i])) {
+   err = -1;
+   goto done;
+   }
+
+   fname = btf__name_by_offset(btf, t->name_off);
+   err = strcmp(fname, expected_fnames[i]);
+   /* for the second and third functions in .text section,
+* the compiler may order them either way.
+ 

[PATCH bpf-next v3 05/13] bpf: get better bpf_prog ksyms based on btf func type_id

2018-10-17 Thread Yonghong Song
This patch added interface to load a program with the following
additional information:
   . prog_btf_fd
   . func_info and func_info_len
where func_info will provides function range and type_id
corresponding to each function.

If verifier agrees with function range provided by the user,
the bpf_prog ksym for each function will use the func name
provided in the type_id, which is supposed to provide better
encoding as it is not limited by 16 bytes program name
limitation and this is better for bpf program which contains
multiple subprograms.

The bpf_prog_info interface is also extended to
return btf_id and jited_func_types, so user spaces can
print out the function prototype for each jited function.

Signed-off-by: Yonghong Song 
---
 include/linux/bpf.h  |  2 +
 include/linux/bpf_verifier.h |  1 +
 include/linux/btf.h  |  2 +
 include/uapi/linux/bpf.h | 11 +
 kernel/bpf/btf.c |  4 +-
 kernel/bpf/core.c| 13 ++
 kernel/bpf/syscall.c | 86 +++-
 kernel/bpf/verifier.c| 46 +++
 8 files changed, 162 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e60fff48288b..a99e038ce9c4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -308,6 +308,8 @@ struct bpf_prog_aux {
void *security;
 #endif
struct bpf_prog_offload *offload;
+   struct btf *btf;
+   u32 type_id; /* type id for this prog/func */
union {
struct work_struct work;
struct rcu_head rcu;
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 9e8056ec20fa..e84782ec50ac 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -201,6 +201,7 @@ static inline bool bpf_verifier_log_needed(const struct 
bpf_verifier_log *log)
 struct bpf_subprog_info {
u32 start; /* insn idx of function entry point */
u16 stack_depth; /* max. stack depth used by this function */
+   u32 type_id; /* btf type_id for this subprog */
 };
 
 /* single container for all structs
diff --git a/include/linux/btf.h b/include/linux/btf.h
index e076c4697049..7f2c0a4a45ea 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -46,5 +46,7 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, 
void *obj,
   struct seq_file *m);
 int btf_get_fd_by_id(u32 id);
 u32 btf_id(const struct btf *btf);
+const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
+const char *btf_name_by_offset(const struct btf *btf, u32 offset);
 
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f9187b41dff6..7ebbf4f06a65 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -332,6 +332,9 @@ union bpf_attr {
 * (context accesses, allowed helpers, etc).
 */
__u32   expected_attach_type;
+   __u32   prog_btf_fd;/* fd pointing to BTF type data 
*/
+   __u32   func_info_len;  /* func_info length */
+   __aligned_u64   func_info;  /* func type info */
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -2585,6 +2588,9 @@ struct bpf_prog_info {
__u32 nr_jited_func_lens;
__aligned_u64 jited_ksyms;
__aligned_u64 jited_func_lens;
+   __u32 btf_id;
+   __u32 nr_jited_func_types;
+   __aligned_u64 jited_func_types;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
@@ -2896,4 +2902,9 @@ struct bpf_flow_keys {
};
 };
 
+struct bpf_func_info {
+   __u32   insn_offset;
+   __u32   type_id;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 763f8e06bc91..13e83f82374b 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -471,7 +471,7 @@ static bool btf_name_valid_identifier(const struct btf 
*btf, u32 offset)
return !*src;
 }
 
-static const char *btf_name_by_offset(const struct btf *btf, u32 offset)
+const char *btf_name_by_offset(const struct btf *btf, u32 offset)
 {
if (!offset)
return "(anon)";
@@ -481,7 +481,7 @@ static const char *btf_name_by_offset(const struct btf 
*btf, u32 offset)
return "(invalid-name-offset)";
 }
 
-static const struct btf_type *btf_type_by_id(const struct btf *btf, u32 
type_id)
+const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id)
 {
if (type_id > btf->nr_types)
return NULL;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index defcf4df6d91..4c4d414e030a 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -21,12 +21,14 @@
  * Kris Katterjohn - Added many additional checks in bpf_check_classic()
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -387,6 +389,8 @@ 

[PATCH bpf-next v3 09/13] tools/bpf: add support to read .BTF.ext sections

2018-10-17 Thread Yonghong Song
The .BTF section is already available to encode types.
These types can be used for map
pretty print. The whole .BTF will be passed to the
kernel as well for which kernel can verify and return
to the user space for pretty print etc.

The llvm patch at https://reviews.llvm.org/D53261
will generate .BTF section and one more section .BTF.ext.
The .BTF.ext section encodes function type
information and line information. Note that
this patch set only supports function type info.
The functionality is implemented in libbpf.

The .BTF section can be directly loaded into the
kernel, and the .BTF.ext section cannot. The loader
may need to do some relocation and merging,
similar to merging multiple code sections, before
loading into the kernel.

Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/btf.c| 232 +
 tools/lib/bpf/btf.h|  48 +
 tools/lib/bpf/libbpf.c |  53 +-
 3 files changed, 329 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 33095fc1860b..4748e0bacd2b 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -37,6 +37,11 @@ struct btf {
int fd;
 };
 
+struct btf_ext {
+   void *func_info;
+   __u32 func_info_len;
+};
+
 static int btf_add_type(struct btf *btf, struct btf_type *t)
 {
if (btf->types_size - btf->nr_types < 2) {
@@ -397,3 +402,230 @@ const char *btf__name_by_offset(const struct btf *btf, 
__u32 offset)
else
return NULL;
 }
+
+static int btf_ext_validate_func_info(const struct btf_sec_func_info *sinfo,
+ __u32 size, btf_print_fn_t err_log)
+{
+   int sec_hdrlen = sizeof(struct btf_sec_func_info);
+   __u32 record_size = sizeof(struct bpf_func_info);
+   __u32 size_left = size, num_records;
+   __u64 total_record_size;
+
+   while (size_left) {
+   if (size_left < sec_hdrlen) {
+   elog("BTF.ext func_info header not found");
+   return -EINVAL;
+   }
+
+   num_records = sinfo->num_func_info;
+   if (num_records == 0) {
+   elog("incorrect BTF.ext num_func_info");
+   return -EINVAL;
+   }
+
+   total_record_size = sec_hdrlen +
+   (__u64)num_records * record_size;
+   if (size_left < total_record_size) {
+   elog("incorrect BTF.ext num_func_info");
+   return -EINVAL;
+   }
+
+   size_left -= total_record_size;
+   sinfo = (void *)sinfo + total_record_size;
+   }
+
+   return 0;
+}
+static int btf_ext_parse_hdr(__u8 *data, __u32 data_size,
+btf_print_fn_t err_log)
+{
+   const struct btf_ext_header *hdr = (struct btf_ext_header *)data;
+   const struct btf_sec_func_info *sinfo;
+   __u32 meta_left, last_func_info_pos;
+
+   if (data_size < sizeof(*hdr)) {
+   elog("BTF.ext header not found");
+   return -EINVAL;
+   }
+
+   if (hdr->magic != BTF_MAGIC) {
+   elog("Invalid BTF.ext magic:%x\n", hdr->magic);
+   return -EINVAL;
+   }
+
+   if (hdr->version != BTF_VERSION) {
+   elog("Unsupported BTF.ext version:%u\n", hdr->version);
+   return -ENOTSUP;
+   }
+
+   if (hdr->flags) {
+   elog("Unsupported BTF.ext flags:%x\n", hdr->flags);
+   return -ENOTSUP;
+   }
+
+   meta_left = data_size - sizeof(*hdr);
+   if (!meta_left) {
+   elog("BTF.ext has no data\n");
+   return -EINVAL;
+   }
+
+   if (meta_left < hdr->func_info_off) {
+   elog("Invalid BTF.ext func_info section offset:%u\n",
+hdr->func_info_off);
+   return -EINVAL;
+   }
+
+   if (hdr->func_info_off & 0x02) {
+   elog("BTF.ext func_info section is not aligned to 4 bytes\n");
+   return -EINVAL;
+   }
+
+   last_func_info_pos = sizeof(*hdr) + hdr->func_info_off +
+hdr->func_info_len;
+   if (last_func_info_pos > data_size) {
+   elog("Invalid BTF.ext func_info section size:%u\n",
+hdr->func_info_len);
+   return -EINVAL;
+   }
+
+   sinfo = (const struct btf_sec_func_info *)(data + sizeof(*hdr) +
+  hdr->func_info_off);
+   return btf_ext_validate_func_info(sinfo, hdr->func_info_len,
+ err_log);
+}
+
+void btf_ext__free(struct btf_ext *btf_ext)
+{
+   if (!btf_ext)
+   return;
+
+   free(btf_ext->func_info);
+   free(btf_ext);
+}
+
+struct btf_ext *btf_ext__new(__u8 *data, __u32 size, btf_print_fn_t err_log)
+{
+   int hdrlen = sizeof(struct btf_ext_header);
+ 

[PATCH bpf-next v3 01/13] bpf: btf: Break up btf_type_is_void()

2018-10-17 Thread Yonghong Song
This patch breaks up btf_type_is_void() into
btf_type_is_void() and btf_type_is_fwd().

It also adds btf_type_nosize() to better describe it is
testing a type has nosize info.

Signed-off-by: Martin KaFai Lau 
Signed-off-by: Yonghong Song 
---
 kernel/bpf/btf.c | 37 ++---
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 378cef70341c..be406d8906ce 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -306,15 +306,22 @@ static bool btf_type_is_modifier(const struct btf_type *t)
 
 static bool btf_type_is_void(const struct btf_type *t)
 {
-   /* void => no type and size info.
-* Hence, FWD is also treated as void.
-*/
-   return t == _void || BTF_INFO_KIND(t->info) == BTF_KIND_FWD;
+   return t == _void;
+}
+
+static bool btf_type_is_fwd(const struct btf_type *t)
+{
+   return BTF_INFO_KIND(t->info) == BTF_KIND_FWD;
+}
+
+static bool btf_type_nosize(const struct btf_type *t)
+{
+   return btf_type_is_void(t) || btf_type_is_fwd(t);
 }
 
-static bool btf_type_is_void_or_null(const struct btf_type *t)
+static bool btf_type_nosize_or_null(const struct btf_type *t)
 {
-   return !t || btf_type_is_void(t);
+   return !t || btf_type_nosize(t);
 }
 
 /* union is only a special case of struct:
@@ -826,7 +833,7 @@ const struct btf_type *btf_type_id_size(const struct btf 
*btf,
u32 size = 0;
 
size_type = btf_type_by_id(btf, size_type_id);
-   if (btf_type_is_void_or_null(size_type))
+   if (btf_type_nosize_or_null(size_type))
return NULL;
 
if (btf_type_has_size(size_type)) {
@@ -842,7 +849,7 @@ const struct btf_type *btf_type_id_size(const struct btf 
*btf,
size = btf->resolved_sizes[size_type_id];
size_type_id = btf->resolved_ids[size_type_id];
size_type = btf_type_by_id(btf, size_type_id);
-   if (btf_type_is_void(size_type))
+   if (btf_type_nosize_or_null(size_type))
return NULL;
}
 
@@ -1164,7 +1171,7 @@ static int btf_modifier_resolve(struct btf_verifier_env 
*env,
}
 
/* "typedef void new_void", "const void"...etc */
-   if (btf_type_is_void(next_type))
+   if (btf_type_is_void(next_type) || btf_type_is_fwd(next_type))
goto resolved;
 
if (!env_type_is_resolve_sink(env, next_type) &&
@@ -1178,7 +1185,7 @@ static int btf_modifier_resolve(struct btf_verifier_env 
*env,
 * pretty print).
 */
if (!btf_type_id_size(btf, _type_id, _type_size) &&
-   !btf_type_is_void(btf_type_id_resolve(btf, _type_id))) {
+   !btf_type_nosize(btf_type_id_resolve(btf, _type_id))) {
btf_verifier_log_type(env, v->t, "Invalid type_id");
return -EINVAL;
}
@@ -1205,7 +1212,7 @@ static int btf_ptr_resolve(struct btf_verifier_env *env,
}
 
/* "void *" */
-   if (btf_type_is_void(next_type))
+   if (btf_type_is_void(next_type) || btf_type_is_fwd(next_type))
goto resolved;
 
if (!env_type_is_resolve_sink(env, next_type) &&
@@ -1235,7 +1242,7 @@ static int btf_ptr_resolve(struct btf_verifier_env *env,
}
 
if (!btf_type_id_size(btf, _type_id, _type_size) &&
-   !btf_type_is_void(btf_type_id_resolve(btf, _type_id))) {
+   !btf_type_nosize(btf_type_id_resolve(btf, _type_id))) {
btf_verifier_log_type(env, v->t, "Invalid type_id");
return -EINVAL;
}
@@ -1396,7 +1403,7 @@ static int btf_array_resolve(struct btf_verifier_env *env,
/* Check array->index_type */
index_type_id = array->index_type;
index_type = btf_type_by_id(btf, index_type_id);
-   if (btf_type_is_void_or_null(index_type)) {
+   if (btf_type_nosize_or_null(index_type)) {
btf_verifier_log_type(env, v->t, "Invalid index");
return -EINVAL;
}
@@ -1415,7 +1422,7 @@ static int btf_array_resolve(struct btf_verifier_env *env,
/* Check array->type */
elem_type_id = array->type;
elem_type = btf_type_by_id(btf, elem_type_id);
-   if (btf_type_is_void_or_null(elem_type)) {
+   if (btf_type_nosize_or_null(elem_type)) {
btf_verifier_log_type(env, v->t,
  "Invalid elem");
return -EINVAL;
@@ -1615,7 +1622,7 @@ static int btf_struct_resolve(struct btf_verifier_env 
*env,
const struct btf_type *member_type = btf_type_by_id(env->btf,
member_type_id);
 
-   if (btf_type_is_void_or_null(member_type)) {
+   if (btf_type_nosize_or_null(member_type)) {
btf_verifier_log_member(env, v->t, member,
"Invalid member");

[PATCH bpf-next v3 08/13] tools/bpf: extends test_btf to test load/retrieve func_type info

2018-10-17 Thread Yonghong Song
A two function bpf program is loaded with btf and func_info.
After successful prog load, the bpf_get_info syscall is called
to retrieve prog info to ensure the types returned from the
kernel matches the types passed to the kernel from the
user space.

Several negative tests are also added to test loading/retriving
of func_type info.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_btf.c | 278 -
 1 file changed, 275 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_btf.c 
b/tools/testing/selftests/bpf/test_btf.c
index b6461c3c5e11..e03a8cea4bb7 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -22,9 +23,13 @@
 #include "bpf_rlimit.h"
 #include "bpf_util.h"
 
+#define MAX_INSNS  512
+#define MAX_SUBPROGS   16
+
 static uint32_t pass_cnt;
 static uint32_t error_cnt;
 static uint32_t skip_cnt;
+static bool jit_enabled;
 
 #define CHECK(condition, format...) ({ \
int __ret = !!(condition);  \
@@ -60,6 +65,24 @@ static int __base_pr(const char *format, ...)
return err;
 }
 
+static bool is_jit_enabled(void)
+{
+   const char *jit_sysctl = "/proc/sys/net/core/bpf_jit_enable";
+   bool enabled = false;
+   int sysctl_fd;
+
+   sysctl_fd = open(jit_sysctl, 0, O_RDONLY);
+   if (sysctl_fd != -1) {
+   char tmpc;
+
+   if (read(sysctl_fd, , sizeof(tmpc)) == 1)
+   enabled = (tmpc != '0');
+   close(sysctl_fd);
+   }
+
+   return enabled;
+}
+
 #define BTF_INFO_ENC(kind, root, vlen) \
((!!(root) << 31) | ((kind) << 24) | ((vlen) & BTF_MAX_VLEN))
 
@@ -103,6 +126,7 @@ static struct args {
bool get_info_test;
bool pprint_test;
bool always_log;
+   bool func_type_test;
 } args;
 
 static char btf_log_buf[BTF_LOG_BUF_SIZE];
@@ -2693,16 +2717,256 @@ static int test_pprint(void)
return err;
 }
 
+static struct btf_func_type_test {
+   const char *descr;
+   const char *str_sec;
+   __u32 raw_types[MAX_NR_RAW_TYPES];
+   __u32 str_sec_size;
+   struct bpf_insn insns[MAX_INSNS];
+   __u32 prog_type;
+   struct bpf_func_info func_info[MAX_SUBPROGS];
+   __u32 func_info_len;
+   bool expected_prog_load_failure;
+} func_type_test[] = {
+
+{
+   .descr = "func_type test #1",
+   .str_sec = "\0int\0unsigned int\0funcA\0funcB",
+   .raw_types = {
+   BTF_TYPE_INT_ENC(NAME_TBD, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+   BTF_TYPE_INT_ENC(NAME_TBD, 0, 0, 32, 4),   /* [2] */
+   BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_FUNC, 0, 2), 1),  
/* [3] */
+   1, 2,
+   BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_FUNC, 0, 2), 1),  
/* [4] */
+   2, 1,
+   BTF_END_RAW,
+   },
+   .str_sec_size = sizeof("\0int\0unsigned int\0funcA\0funcB"),
+   .insns = {
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_IMM(BPF_REG_0, 2),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   .func_info = { {0, 3}, {3, 4} },
+   .func_info_len = 2 * sizeof(struct bpf_func_info),
+},
+
+{
+   .descr = "func_type test #2",
+   .str_sec = "\0int\0unsigned int\0funcA\0funcB",
+   .raw_types = {
+   BTF_TYPE_INT_ENC(NAME_TBD, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+   BTF_TYPE_INT_ENC(NAME_TBD, 0, 0, 32, 4),   /* [2] */
+   /* incorrect func type */
+   BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_FUNC_PROTO, 0, 2), 
1),  /* [3] */
+   1, 2,
+   BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_FUNC, 0, 2), 1),  
/* [4] */
+   2, 1,
+   BTF_END_RAW,
+   },
+   .str_sec_size = sizeof("\0int\0unsigned int\0funcA\0funcB"),
+   .insns = {
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_IMM(BPF_REG_0, 2),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   .func_info = { {0, 3}, {3, 4} },
+   .func_info_len = 2 * sizeof(struct bpf_func_info),
+   .expected_prog_load_failure = true,
+},
+
+{
+   .descr = "func_type test #3",
+   .str_sec = "\0int\0unsigned int\0funcA\0funcB",
+   .raw_types = {
+   BTF_TYPE_INT_ENC(NAME_TBD, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+   BTF_TYPE_INT_ENC(NAME_TBD, 0, 0, 32, 4),   /* [2] */
+   BTF_TYPE_ENC(NAME_TBD, 

[PATCH bpf-next v3 04/13] tools/bpf: add btf func/func_proto unit tests in selftest test_btf

2018-10-17 Thread Yonghong Song
Add several BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
unit tests in bpf selftest test_btf.

Signed-off-by: Martin KaFai Lau 
Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/btf.c|   4 +
 tools/testing/selftests/bpf/test_btf.c | 216 +
 2 files changed, 220 insertions(+)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 449591aa9900..33095fc1860b 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -165,6 +165,10 @@ static int btf_parse_type_sec(struct btf *btf, 
btf_print_fn_t err_log)
case BTF_KIND_ENUM:
next_type += vlen * sizeof(struct btf_enum);
break;
+   case BTF_KIND_FUNC:
+   case BTF_KIND_FUNC_PROTO:
+   next_type += vlen * sizeof(int);
+   break;
case BTF_KIND_TYPEDEF:
case BTF_KIND_PTR:
case BTF_KIND_FWD:
diff --git a/tools/testing/selftests/bpf/test_btf.c 
b/tools/testing/selftests/bpf/test_btf.c
index f42b3396d622..b6461c3c5e11 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -1374,6 +1374,222 @@ static struct btf_raw_test raw_tests[] = {
.map_create_err = true,
 },
 
+{
+   .descr = "func pointer #1",
+   .raw_types = {
+   BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+   BTF_TYPE_INT_ENC(0, 0, 0, 32, 4),   /* [2] */
+   /* int (*func)(int, unsigned int) */
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_FUNC_PROTO, 0, 2), 1),
/* [3] */
+   1, 2,
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_PTR, 0, 0), 3),   /* [4] 
*/
+   BTF_END_RAW,
+   },
+   .str_sec = "",
+   .str_sec_size = sizeof(""),
+   .map_type = BPF_MAP_TYPE_ARRAY,
+   .map_name = "func_type_check_btf",
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .key_type_id = 1,
+   .value_type_id = 1,
+   .max_entries = 4,
+},
+
+{
+   .descr = "func pointer #2",
+   .raw_types = {
+   BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+   BTF_TYPE_INT_ENC(0, 0, 0, 32, 4),   /* [2] */
+   /* void (*func)(int, unsigned int, ) */
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_FUNC_PROTO, 0, 3), 0),
/* [3] */
+   1, 2, 0,
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_PTR, 0, 0), 3),   /* [4] 
*/
+   BTF_END_RAW,
+   },
+   .str_sec = "",
+   .str_sec_size = sizeof(""),
+   .map_type = BPF_MAP_TYPE_ARRAY,
+   .map_name = "func_type_check_btf",
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .key_type_id = 1,
+   .value_type_id = 1,
+   .max_entries = 4,
+},
+
+{
+   .descr = "func pointer #3",
+   .raw_types = {
+   BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+   BTF_TYPE_INT_ENC(0, 0, 0, 32, 4),   /* [2] */
+   /* void (*func)(void, int, unsigned int) */
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_FUNC_PROTO, 0, 3), 0),
/* [3] */
+   1, 0, 2,
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_PTR, 0, 0), 3),   /* [4] 
*/
+   BTF_END_RAW,
+   },
+   .str_sec = "",
+   .str_sec_size = sizeof(""),
+   .map_type = BPF_MAP_TYPE_ARRAY,
+   .map_name = "func_type_check_btf",
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .key_type_id = 1,
+   .value_type_id = 1,
+   .max_entries = 4,
+   .btf_load_err = true,
+   .err_str = "Invalid arg#2",
+},
+
+{
+   .descr = "func pointer #4",
+   .raw_types = {
+   BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+   BTF_TYPE_INT_ENC(0, 0, 0, 32, 4),   /* [2] */
+   /*
+* Testing:
+* BTF_KIND_CONST => BTF_KIND_TYPEDEF => BTF_KIND_PTR =>
+* BTF_KIND_FUNC_PROTO
+*/
+   /* typedef void (*func_ptr)(int, unsigned int) */
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_TYPEDEF, 0, 0), 5),/* [3] 
*/
+   /* const func_ptr */
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_CONST, 0, 0), 3), /* [4] 
*/
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_PTR, 0, 0), 6),   /* [5] 
*/
+   BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_FUNC_PROTO, 0, 2), 0),
/* [6] */
+   1, 2,
+   BTF_END_RAW,
+   },
+   .str_sec = "",
+   .str_sec_size = sizeof(""),
+   .map_type = BPF_MAP_TYPE_ARRAY,
+   .map_name = "func_type_check_btf",
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .key_type_id = 1,
+   .value_type_id = 1,
+   .max_entries = 4,
+},
+
+{
+   .descr = "func pointer #5",
+ 

[PATCH bpf-next v3 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO

2018-10-17 Thread Yonghong Song
This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
support to the type section. BTF_KIND_FUNC_PROTO is used
to specify the type of a function pointer. With this,
BTF has a complete set of C types (except float).

BTF_KIND_FUNC is used to specify the signature of a
defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
by another type, e.g., a pointer type, and BTF_KIND_FUNC
type cannot be referenced by another type.

For both BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO types,
the func return type is in t->type (where t is a
"struct btf_type" object). The func args are an array of
u32s immediately following object "t".

As a concrete example, for the C program below,
  $ cat test.c
  int foo(int (*bar)(int)) { return bar(5); }
with LLVM patch https://reviews.llvm.org/D53261
in Debug mode, we have
  $ clang -target bpf -g -O2 -mllvm -debug-only=btf -c test.c
  Type Table:
  [1] FUNC NameOff=1 Info=0x0c01 Size/Type=2
  ParamType=3
  [2] INT NameOff=11 Info=0x0100 Size/Type=4
  Desc=0x0120
  [3] PTR NameOff=0 Info=0x0200 Size/Type=4
  [4] FUNC_PROTO NameOff=0 Info=0x0d01 Size/Type=2
  ParamType=2

  String Table:
  0 :
  1 : foo
  5 : .text
  11 : int
  15 : test.c
  22 : int foo(int (*bar)(int)) { return bar(5); }

  FuncInfo Table:
  SecNameOff=5
  InsnOffset= TypeId=1

  ...

(Eventually we shall have bpftool to dump btf information
 like the above.)

Function "foo" has a FUNC type (type_id = 1).
The parameter of "foo" has type_id 3 which is PTR->FUNC_PROTO,
where FUNC_PROTO refers to function pointer "bar".

In FuncInfo Table, for section .text, the function,
with to-be-determined offset (marked as ),
has type_id=1 which refers to a FUNC type.
This way, the function signature is
available to both kernel and user space.
Here, the insn offset is not available during the dump time
as relocation is resolved pretty late in the compilation process.

Signed-off-by: Martin KaFai Lau 
Signed-off-by: Yonghong Song 
---
 include/uapi/linux/btf.h |   9 +-
 kernel/bpf/btf.c | 280 ++-
 2 files changed, 253 insertions(+), 36 deletions(-)

diff --git a/include/uapi/linux/btf.h b/include/uapi/linux/btf.h
index 972265f32871..63f8500e6f34 100644
--- a/include/uapi/linux/btf.h
+++ b/include/uapi/linux/btf.h
@@ -40,7 +40,8 @@ struct btf_type {
/* "size" is used by INT, ENUM, STRUCT and UNION.
 * "size" tells the size of the type it is describing.
 *
-* "type" is used by PTR, TYPEDEF, VOLATILE, CONST and RESTRICT.
+* "type" is used by PTR, TYPEDEF, VOLATILE, CONST, RESTRICT,
+* FUNC and FUNC_PROTO.
 * "type" is a type_id referring to another type.
 */
union {
@@ -64,8 +65,10 @@ struct btf_type {
 #define BTF_KIND_VOLATILE  9   /* Volatile */
 #define BTF_KIND_CONST 10  /* Const*/
 #define BTF_KIND_RESTRICT  11  /* Restrict */
-#define BTF_KIND_MAX   11
-#define NR_BTF_KINDS   12
+#define BTF_KIND_FUNC  12  /* Function */
+#define BTF_KIND_FUNC_PROTO13  /* Function Prototype   */
+#define BTF_KIND_MAX   13
+#define NR_BTF_KINDS   14
 
 /* For some specific BTF_KIND, "struct btf_type" is immediately
  * followed by extra data.
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index be406d8906ce..763f8e06bc91 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -259,6 +260,8 @@ static const char * const btf_kind_str[NR_BTF_KINDS] = {
[BTF_KIND_VOLATILE] = "VOLATILE",
[BTF_KIND_CONST]= "CONST",
[BTF_KIND_RESTRICT] = "RESTRICT",
+   [BTF_KIND_FUNC] = "FUNC",
+   [BTF_KIND_FUNC_PROTO]   = "FUNC_PROTO",
 };
 
 struct btf_kind_operations {
@@ -281,6 +284,9 @@ struct btf_kind_operations {
 static const struct btf_kind_operations * const kind_ops[NR_BTF_KINDS];
 static struct btf_type btf_void;
 
+static int btf_resolve(struct btf_verifier_env *env,
+  const struct btf_type *t, u32 type_id);
+
 static bool btf_type_is_modifier(const struct btf_type *t)
 {
/* Some of them is not strictly a C modifier
@@ -314,9 +320,20 @@ static bool btf_type_is_fwd(const struct btf_type *t)
return BTF_INFO_KIND(t->info) == BTF_KIND_FWD;
 }
 
+static bool btf_type_is_func(const struct btf_type *t)
+{
+   return BTF_INFO_KIND(t->info) == BTF_KIND_FUNC;
+}
+
+static bool btf_type_is_func_proto(const struct btf_type *t)
+{
+   return BTF_INFO_KIND(t->info) == BTF_KIND_FUNC_PROTO;
+}
+
 static bool btf_type_nosize(const struct btf_type *t)
 {
-   return btf_type_is_void(t) || btf_type_is_fwd(t);
+   return btf_type_is_void(t) || btf_type_is_fwd(t) ||
+  btf_type_is_func(t) || btf_type_is_func_proto(t);
 }
 
 static bool btf_type_nosize_or_null(const struct btf_type *t)
@@ -433,6 

[PATCH bpf-next v3 00/13] bpf: add btf func info support

2018-10-17 Thread Yonghong Song
The BTF support was added to kernel by Commit 69b693f0aefa
("bpf: btf: Introduce BPF Type Format (BTF)"), which introduced
.BTF section into ELF file and is primarily
used for map pretty print.
pahole is used to convert dwarf to BTF for ELF files.

This patch added func info support to the kernel so we can
get better ksym's for bpf function calls. Basically,
pairs of bpf function calls and their corresponding types
are passed to the kernel. Extracting function names from
the types, the kernel is able to construct a ksym for
each function call with embedded function name.

This patch set added support of FUNC and FUNC_PROTO types
in the kernel. LLVM patch https://reviews.llvm.org/D53261
can generate func info, encoded in .BTF.ext ELF section.
The following is an example to show FUNC and
FUNC_PROTO difference, compiled with the above LLVM patch
with Debug mode.

  -bash-4.2$ cat test.c
  int foo(int (*bar)(int)) { return bar(5); }
  -bash-4.2$ clang -target bpf -g -O2 -mllvm -debug-only=btf -c test.c
  Type Table:
  [1] FUNC name_off=1 info=0x0c01 size/type=2
param_type=3
  [2] INT name_off=11 info=0x0100 size/type=4
desc=0x0120
  [3] PTR name_off=0 info=0x0200 size/type=4
  [4] FUNC_PROTO name_off=0 info=0x0d01 size/type=2
param_type=2

  String Table:
  0 :
  1 : foo
  5 : .text
  11 : int
  15 : test.c
  22 : int foo(int (*bar)(int)) { return bar(5); }

  FuncInfo Table:
  sec_name_off=5
insn_offset= type_id=1
  ...
  
In the above, type and string tables are in .BTF section and
the func info in .BTF.ext. The "" is the
insn offset which is not available during the dump time but
resolved during later compilation process.
Following the format specification at Patch #9 and examine the
raw data in .BTF.ext section, we have
  FuncInfo Table:
  sec_name_off=5
insn_offset=0 type_id=1
The (insn_offset, type_id) can be passed to the kernel
so the kernel can find the func name and use it in the ksym.
Below is a demonstration from Patch #13.
  $ bpftool prog dump jited id 1
  int _dummy_tracepoint(struct dummy_tracepoint_args * ):
  bpf_prog_b07ccb89267cf242__dummy_tracepoint:
 0:   push   %rbp
 1:   mov%rsp,%rbp
..
3c:   add$0x28,%rbp
40:   leaveq
41:   retq
  
  int test_long_fname_1(struct dummy_tracepoint_args * ):
  bpf_prog_2dcecc18072623fc_test_long_fname_1:
 0:   push   %rbp
 1:   mov%rsp,%rbp
..
3a:   add$0x28,%rbp
3e:   leaveq
3f:   retq
  
  int test_long_fname_2(struct dummy_tracepoint_args * ):
  bpf_prog_89d64e4abf0f0126_test_long_fname_2:
 0:   push   %rbp
 1:   mov%rsp,%rbp
..
80:   add$0x28,%rbp
84:   leaveq
85:   retq

For the patchset,
Patch #1  refactors the code to break up btf_type_is_void().
Patch #2  introduces new BTF types BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO.
Patch #3  syncs btf.h header to tools directory.
Patch #4  adds btf func/func_proto self tests in test_btf.
Patch #5  adds kernel interface to load func_info to kernel
  and pass func_info to userspace.
Patch #6  syncs bpf.h header to tools directory.
Patch #7  adds news btf/func_info related fields in libbpf
  program load function.
Patch #8  extends selftest test_btf to test load/retrieve func_type info.
Patch #9  adds .BTF.ext func info support.
Patch #10 changes Makefile to avoid using pahole if llvm is capable of
  generating BTF sections.
Patch #11 refactors to have btf_get_from_id() in libbpf for reuse.
Patch #12 enhance test_btf file testing to test func info.
Patch #13 adds bpftool support for func signature dump.

Changelogs:
  v2 -> v3:
. Removed kernel btf extern functions btf_type_id_func()
  and btf_get_name_by_id(). Instead, exposing existing
  functions btf_type_by_id() and btf_name_by_offset().
. Added comments about ELF section .BTF.ext layout.
. Better codes in btftool as suggested by Edward Cree.
  v1 -> v2:
. Added missing sign-off.
. Limited the func_name/struct_member_name length for validity test.
. Removed/changed several verifier messages.
. Modified several commit messages to remove line_off reference.

Yonghong Song (13):
  bpf: btf: Break up btf_type_is_void()
  bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
  tools/bpf: sync kernel btf.h header
  tools/bpf: add btf func/func_proto unit tests in selftest test_btf
  bpf: get better bpf_prog ksyms based on btf func type_id
  tools/bpf: sync kernel uapi bpf.h header to tools directory
  tools/bpf: add new fields for program load in lib/bpf
  tools/bpf: extends test_btf to test load/retrieve func_type info
  tools/bpf: add support to read .BTF.ext sections
  tools/bpf: do not use pahole if clang/llvm can generate BTF sections
  tools/bpf: refactor to implement btf_get_from_id() in lib/bpf
  tools/bpf: enhance test_btf file testing to test func info
  

[PATCH bpf-next v3 11/13] tools/bpf: refactor to implement btf_get_from_id() in lib/bpf

2018-10-17 Thread Yonghong Song
The function get_btf() is implemented in tools/bpf/bpftool/map.c
to get a btf structure given a map_info. This patch
refactored this function to be function btf_get_from_id()
in tools/lib/bpf so that it can be used later.

Signed-off-by: Yonghong Song 
---
 tools/bpf/bpftool/map.c | 68 ++--
 tools/lib/bpf/btf.c | 69 +
 tools/lib/bpf/btf.h | 18 ++-
 3 files changed, 81 insertions(+), 74 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 7bf38f0e152e..1b8a75fa0471 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -215,70 +215,6 @@ static int do_dump_btf(const struct btf_dumper *d,
return ret;
 }
 
-static int get_btf(struct bpf_map_info *map_info, struct btf **btf)
-{
-   struct bpf_btf_info btf_info = { 0 };
-   __u32 len = sizeof(btf_info);
-   __u32 last_size;
-   int btf_fd;
-   void *ptr;
-   int err;
-
-   err = 0;
-   *btf = NULL;
-   btf_fd = bpf_btf_get_fd_by_id(map_info->btf_id);
-   if (btf_fd < 0)
-   return 0;
-
-   /* we won't know btf_size until we call bpf_obj_get_info_by_fd(). so
-* let's start with a sane default - 4KiB here - and resize it only if
-* bpf_obj_get_info_by_fd() needs a bigger buffer.
-*/
-   btf_info.btf_size = 4096;
-   last_size = btf_info.btf_size;
-   ptr = malloc(last_size);
-   if (!ptr) {
-   err = -ENOMEM;
-   goto exit_free;
-   }
-
-   bzero(ptr, last_size);
-   btf_info.btf = ptr_to_u64(ptr);
-   err = bpf_obj_get_info_by_fd(btf_fd, _info, );
-
-   if (!err && btf_info.btf_size > last_size) {
-   void *temp_ptr;
-
-   last_size = btf_info.btf_size;
-   temp_ptr = realloc(ptr, last_size);
-   if (!temp_ptr) {
-   err = -ENOMEM;
-   goto exit_free;
-   }
-   ptr = temp_ptr;
-   bzero(ptr, last_size);
-   btf_info.btf = ptr_to_u64(ptr);
-   err = bpf_obj_get_info_by_fd(btf_fd, _info, );
-   }
-
-   if (err || btf_info.btf_size > last_size) {
-   err = errno;
-   goto exit_free;
-   }
-
-   *btf = btf__new((__u8 *)btf_info.btf, btf_info.btf_size, NULL);
-   if (IS_ERR(*btf)) {
-   err = PTR_ERR(*btf);
-   *btf = NULL;
-   }
-
-exit_free:
-   close(btf_fd);
-   free(ptr);
-
-   return err;
-}
-
 static json_writer_t *get_btf_writer(void)
 {
json_writer_t *jw = jsonw_new(stdout);
@@ -765,7 +701,7 @@ static int do_dump(int argc, char **argv)
 
prev_key = NULL;
 
-   err = get_btf(, );
+   err = btf_get_from_id(info.btf_id, );
if (err) {
p_err("failed to get btf");
goto exit_free;
@@ -909,7 +845,7 @@ static int do_lookup(int argc, char **argv)
}
 
/* here means bpf_map_lookup_elem() succeeded */
-   err = get_btf(, );
+   err = btf_get_from_id(info.btf_id, );
if (err) {
p_err("failed to get btf");
goto exit_free;
diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 4748e0bacd2b..ab654628e966 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -42,6 +42,11 @@ struct btf_ext {
__u32 func_info_len;
 };
 
+static inline __u64 ptr_to_u64(const void *ptr)
+{
+   return (__u64) (unsigned long) ptr;
+}
+
 static int btf_add_type(struct btf *btf, struct btf_type *t)
 {
if (btf->types_size - btf->nr_types < 2) {
@@ -403,6 +408,70 @@ const char *btf__name_by_offset(const struct btf *btf, 
__u32 offset)
return NULL;
 }
 
+int btf_get_from_id(__u32 id, struct btf **btf)
+{
+   struct bpf_btf_info btf_info = { 0 };
+   __u32 len = sizeof(btf_info);
+   __u32 last_size;
+   int btf_fd;
+   void *ptr;
+   int err;
+
+   err = 0;
+   *btf = NULL;
+   btf_fd = bpf_btf_get_fd_by_id(id);
+   if (btf_fd < 0)
+   return 0;
+
+   /* we won't know btf_size until we call bpf_obj_get_info_by_fd(). so
+* let's start with a sane default - 4KiB here - and resize it only if
+* bpf_obj_get_info_by_fd() needs a bigger buffer.
+*/
+   btf_info.btf_size = 4096;
+   last_size = btf_info.btf_size;
+   ptr = malloc(last_size);
+   if (!ptr) {
+   err = -ENOMEM;
+   goto exit_free;
+   }
+
+   bzero(ptr, last_size);
+   btf_info.btf = ptr_to_u64(ptr);
+   err = bpf_obj_get_info_by_fd(btf_fd, _info, );
+
+   if (!err && btf_info.btf_size > last_size) {
+   void *temp_ptr;
+
+   last_size = btf_info.btf_size;
+   temp_ptr = realloc(ptr, last_size);
+   if (!temp_ptr) {
+   err = -ENOMEM;
+   

[PATCH bpf-next v3 03/13] tools/bpf: sync kernel btf.h header

2018-10-17 Thread Yonghong Song
The kernel uapi btf.h is synced to the tools directory.

Signed-off-by: Martin KaFai Lau 
Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/btf.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/btf.h b/tools/include/uapi/linux/btf.h
index 972265f32871..63f8500e6f34 100644
--- a/tools/include/uapi/linux/btf.h
+++ b/tools/include/uapi/linux/btf.h
@@ -40,7 +40,8 @@ struct btf_type {
/* "size" is used by INT, ENUM, STRUCT and UNION.
 * "size" tells the size of the type it is describing.
 *
-* "type" is used by PTR, TYPEDEF, VOLATILE, CONST and RESTRICT.
+* "type" is used by PTR, TYPEDEF, VOLATILE, CONST, RESTRICT,
+* FUNC and FUNC_PROTO.
 * "type" is a type_id referring to another type.
 */
union {
@@ -64,8 +65,10 @@ struct btf_type {
 #define BTF_KIND_VOLATILE  9   /* Volatile */
 #define BTF_KIND_CONST 10  /* Const*/
 #define BTF_KIND_RESTRICT  11  /* Restrict */
-#define BTF_KIND_MAX   11
-#define NR_BTF_KINDS   12
+#define BTF_KIND_FUNC  12  /* Function */
+#define BTF_KIND_FUNC_PROTO13  /* Function Prototype   */
+#define BTF_KIND_MAX   13
+#define NR_BTF_KINDS   14
 
 /* For some specific BTF_KIND, "struct btf_type" is immediately
  * followed by extra data.
-- 
2.17.1



[PATCH bpf-next v3 07/13] tools/bpf: add new fields for program load in lib/bpf

2018-10-17 Thread Yonghong Song
The new fields are added for program load in lib/bpf so
application uses api bpf_load_program_xattr() is able
to load program with btf and func_info data.

This functionality will be used in next patch
by bpf selftest test_btf.

Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/bpf.c | 3 +++
 tools/lib/bpf/bpf.h | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index d70a255cb05e..d8d48ab34220 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -196,6 +196,9 @@ int bpf_load_program_xattr(const struct 
bpf_load_program_attr *load_attr,
attr.log_level = 0;
attr.kern_version = load_attr->kern_version;
attr.prog_ifindex = load_attr->prog_ifindex;
+   attr.prog_btf_fd = load_attr->prog_btf_fd;
+   attr.func_info_len = load_attr->func_info_len;
+   attr.func_info = ptr_to_u64(load_attr->func_info);
memcpy(attr.prog_name, load_attr->name,
   min(name_len, BPF_OBJ_NAME_LEN - 1));
 
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 258c3c178333..d2bdaffd7712 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -74,6 +74,9 @@ struct bpf_load_program_attr {
const char *license;
__u32 kern_version;
__u32 prog_ifindex;
+   __u32 prog_btf_fd;
+   __u32 func_info_len;
+   const struct bpf_func_info *func_info;
 };
 
 /* Flags to direct loading requirements */
-- 
2.17.1



Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}

2018-10-17 Thread Daniel Borkmann
On 10/17/2018 05:50 PM, Peter Zijlstra wrote:
> On Wed, Oct 17, 2018 at 04:41:55PM +0200, Daniel Borkmann wrote:
>> @@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap 
>> *mm)
>>  {
>>  struct perf_event_mmap_page *pc = mm->base;
>>  u64 head = READ_ONCE(pc->data_head);
>> -rmb();
>> +
>> +smp_rmb();
>>  return head;
>>  }
>>  
>> @@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap 
>> *md, u64 tail)
>>  /*
>>   * ensure all reads are done before we write the tail out.
>>   */
>> -mb();
>> +smp_mb();
>>  pc->data_tail = tail;
> 
> Ideally that would be a WRITE_ONCE() to avoid store tearing.

Right, agree.

> Alternatively, I think we can use smp_store_release() here, all we care
> about is that the prior loads stay prior.
> 
> Similarly, I suppose, we could use smp_load_acquire() for the data_head
> load above.

Wouldn't this then also allow the kernel side to use smp_store_release()
when it updates the head? We'd be pretty much at the model as described
in Documentation/core-api/circular-buffers.rst.

Meaning, rough pseudo-code diff would look as:

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d3cf40..3d96275 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -84,8 +84,9 @@ static void perf_output_put_handle(struct perf_output_handle 
*handle)
 *
 * See perf_output_begin().
 */
-   smp_wmb(); /* B, matches C */
-   rb->user_page->data_head = head;
+
+   /* B, matches C */
+   smp_store_release(>user_page->data_head, head);

/*
 * Now check if we missed an update -- rely on previous implied

Plus, user space side of perf (assuming we have the barriers imported):

diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 05a6d47..66e1304 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -72,20 +72,15 @@ void perf_mmap__consume(struct perf_mmap *map);
 static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
 {
struct perf_event_mmap_page *pc = mm->base;
-   u64 head = READ_ONCE(pc->data_head);
-   rmb();
-   return head;
+
+   return smp_load_acquire(>data_head);
 }

 static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
 {
struct perf_event_mmap_page *pc = md->base;

-   /*
-* ensure all reads are done before we write the tail out.
-*/
-   mb();
-   pc->data_tail = tail;
+   smp_store_release(>data_tail, tail);
 }

 union perf_event *perf_mmap__read_forward(struct perf_mmap *map);


Re: gianfar: Implement MAC reset and reconfig procedure

2018-10-17 Thread Daniel Walker
On Tue, Oct 16, 2018 at 03:03:07PM -0700, Florian Fainelli wrote:
> On 10/16/2018 02:36 PM, Daniel Walker wrote:
> > Hi,
> > 
> > I would like to report an issue in the gianfar driver. The issue is as 
> > follows. 
> > 
> > We have a P2020 board that uses the gianfar driver, and we have a m88e1101
> > PHY connect. When the interface is initially brought up traffic flows as
> > normal. If you take the interface down then bring it back up traffic stops
> > flowing. If you do this sequence over and over up/down/up we find that the
> > interface will allow traffic to flow at a low percentage.
> > 
> > In v4.9 interface allows traffic about %10 of the time.
> > 
> > In v4.19-rc8 the allows traffic %30 of the time.
> > 
> > After bisecting I found that in v3.14 the interface was rock solid and 
> > never did
> > we see this issue. However, v3.15 we started to see this issue. After 
> > bisecting I
> > found the following change is the first one which causes the issue,
> > 
> > a328ac9 gianfar: Implement MAC reset and reconfig procedure
> > 
> > I was able to revert this in v3.15 , however with later development a revert
> > doesn't appear to be possible. We have no fix for this currently.
> > 
> > I can do testing if you have an idea what might cause the issue.
> 
> What we have seen being typically the problem is that when you have a
> PHY connection whereby the PHY provides the RX clock to the MAC (e.g:
> RGMII), it is very easy to get in a situation where the PHY clock is
> stopped, and the MAC is asked to be reset, but the HW design does not
> like that at all since it e.g: stops on packet boundaries and need some
> clock cycles to do that, and that results in all sorts of issues (in our
> case it was some FIFO corruption). We solved that in bcmgenet.c with
> looping internally the TX clock to the RX clock to make sure the
> Ethernet MAC (UniMAC in our designs) was successfully reset:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=28c2d1a7a0bfdf3617800d2beae1c67983c03d15
> 
> Could that somehow be the problem here?

A little more context on this issue after some debugging.


The patch which I quote above adds a line into int startup_gfar() which does,

gfar_mac_reset(priv);

If this line is removed then everything starts working again (this is debugging
at the v3.15 source level).

On further inspection the block of code inside gfar_mac_reset() is causes a
problem is this one,

/* Initialize MACCFG2. */
tempval = MACCFG2_INIT_SETTINGS;
if (gfar_has_errata(priv, GFAR_ERRATA_74))
tempval |= MACCFG2_HUGEFRAME | MACCFG2_LENGTHCHECK;
gfar_write(>maccfg2, tempval);

and if you change this block to this,

tempval = gfar_read(>maccfg2);
if (gfar_has_errata(priv, GFAR_ERRATA_74))
tempval |= MACCFG2_HUGEFRAME | MACCFG2_LENGTHCHECK;
gfar_write(>maccfg2, tempval);

Then everything starts working.

At least on my hardware if you gfar_read() when the hardware first comes up it 
doesn't cause any issues
however, I don't know about other hardware. It would seems that 
MACCFG2_INIT_SETTINGS is not set up
correctly or shouldn't be used in this context.

Daniel


[net-next 06/11] igc: Add transmit and receive fastpath and interrupt handlers

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

This patch adds support for allocating, configuring, and freeing Tx/Rx ring
resources.  With these changes in place the descriptor queues are in a
state where they are ready to transmit or receive if provided buffers.

This also adds the transmit and receive fastpath and interrupt handlers.
With this code in place the network device is now able to send and receive
frames over the network interface using a single queue.

Signed-off-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/igc.h |   66 +
 drivers/net/ethernet/intel/igc/igc_base.h|   15 +
 drivers/net/ethernet/intel/igc/igc_defines.h |   45 +
 drivers/net/ethernet/intel/igc/igc_main.c| 1123 +-
 4 files changed, 1205 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
index 7bb19328b899..88ee451e36fd 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -32,13 +32,31 @@ extern char igc_driver_version[];
 #define IGC_START_ITR  648 /* ~6000 ints/sec */
 #define IGC_FLAG_HAS_MSI   BIT(0)
 #define IGC_FLAG_QUEUE_PAIRS   BIT(4)
+#define IGC_FLAG_NEED_LINK_UPDATE  BIT(9)
 #define IGC_FLAG_HAS_MSIX  BIT(13)
+#define IGC_FLAG_VLAN_PROMISC  BIT(15)
 
 #define IGC_START_ITR  648 /* ~6000 ints/sec */
 #define IGC_4K_ITR 980
 #define IGC_20K_ITR196
 #define IGC_70K_ITR56
 
+#define IGC_DEFAULT_ITR3 /* dynamic */
+#define IGC_MAX_ITR_USECS  1
+#define IGC_MIN_ITR_USECS  10
+#define NON_Q_VECTORS  1
+#define MAX_MSIX_ENTRIES   10
+
+/* TX/RX descriptor defines */
+#define IGC_DEFAULT_TXD256
+#define IGC_DEFAULT_TX_WORK128
+#define IGC_MIN_TXD80
+#define IGC_MAX_TXD4096
+
+#define IGC_DEFAULT_RXD256
+#define IGC_MIN_RXD80
+#define IGC_MAX_RXD4096
+
 /* Transmit and receive queues */
 #define IGC_MAX_RX_QUEUES  4
 #define IGC_MAX_TX_QUEUES  4
@@ -85,6 +103,16 @@ extern char igc_driver_version[];
 #define IGC_MAX_FRAME_BUILD_SKB (IGC_RXBUFFER_2048 - IGC_TS_HDR_LEN)
 #endif
 
+/* How many Rx Buffers do we bundle into one write to the hardware ? */
+#define IGC_RX_BUFFER_WRITE16 /* Must be power of 2 */
+
+/* igc_test_staterr - tests bits within Rx descriptor status and error fields 
*/
+static inline __le32 igc_test_staterr(union igc_adv_rx_desc *rx_desc,
+ const u32 stat_err_bits)
+{
+   return rx_desc->wb.upper.status_error & cpu_to_le32(stat_err_bits);
+}
+
 enum igc_state_t {
__IGC_TESTING,
__IGC_RESETTING,
@@ -92,6 +120,27 @@ enum igc_state_t {
__IGC_PTP_TX_IN_PROGRESS,
 };
 
+enum igc_tx_flags {
+   /* cmd_type flags */
+   IGC_TX_FLAGS_VLAN   = 0x01,
+   IGC_TX_FLAGS_TSO= 0x02,
+   IGC_TX_FLAGS_TSTAMP = 0x04,
+
+   /* olinfo flags */
+   IGC_TX_FLAGS_IPV4   = 0x10,
+   IGC_TX_FLAGS_CSUM   = 0x20,
+};
+
+/* The largest size we can write to the descriptor is 65535.  In order to
+ * maintain a power of two alignment we have to limit ourselves to 32K.
+ */
+#define IGC_MAX_TXD_PWR15
+#define IGC_MAX_DATA_PER_TXD   BIT(IGC_MAX_TXD_PWR)
+
+/* Tx Descriptors needed, worst case */
+#define TXD_USE_COUNT(S)   DIV_ROUND_UP((S), IGC_MAX_DATA_PER_TXD)
+#define DESC_NEEDED(MAX_SKB_FRAGS + 4)
+
 /* wrapper around a pointer to a socket buffer,
  * so a DMA handle can be stored along with the buffer
  */
@@ -123,6 +172,7 @@ struct igc_tx_queue_stats {
u64 packets;
u64 bytes;
u64 restart_queue;
+   u64 restart_queue2;
 };
 
 struct igc_rx_queue_stats {
@@ -181,11 +231,14 @@ struct igc_ring {
/* TX */
struct {
struct igc_tx_queue_stats tx_stats;
+   struct u64_stats_sync tx_syncp;
+   struct u64_stats_sync tx_syncp2;
};
/* RX */
struct {
struct igc_rx_queue_stats rx_stats;
struct igc_rx_packet_stats pkt_stats;
+   struct u64_stats_sync rx_syncp;
struct sk_buff *skb;
};
};
@@ -258,11 +311,17 @@ struct igc_adapter {
struct work_struct watchdog_task;
struct work_struct dma_err_task;
 
+   u8 tx_timeout_factor;
+
int msg_enable;
u32 max_frame_size;
+   u32 min_frame_size;
 
/* OS defined structs */
struct pci_dev *pdev;
+   /* lock for statistics */
+   spinlock_t stats64_lock;
+   struct rtnl_link_stats64 stats64;
 
/* structs defined in igc_hw.h */
struct igc_hw hw;
@@ -275,8 

[net-next 01/11] igc: Add skeletal frame for Intel(R) 2.5G Ethernet Controller support

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

This patch adds the beginning framework onto which I am going to add
the igc driver which supports the Intel(R) I225-LM/I225-V 2.5G
Ethernet Controller.

Signed-off-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/Kconfig|  16 +++
 drivers/net/ethernet/intel/Makefile   |   1 +
 drivers/net/ethernet/intel/igc/Makefile   |  10 ++
 drivers/net/ethernet/intel/igc/igc.h  |  29 +
 drivers/net/ethernet/intel/igc/igc_hw.h   |  10 ++
 drivers/net/ethernet/intel/igc/igc_main.c | 146 ++
 6 files changed, 212 insertions(+)
 create mode 100644 drivers/net/ethernet/intel/igc/Makefile
 create mode 100644 drivers/net/ethernet/intel/igc/igc.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_hw.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_main.c

diff --git a/drivers/net/ethernet/intel/Kconfig 
b/drivers/net/ethernet/intel/Kconfig
index b542aba6f0e8..76f926d4bf13 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -287,4 +287,20 @@ config FM10K
  To compile this driver as a module, choose M here. The module
  will be called fm10k.  MSI-X interrupt support is required
 
+config IGC
+   tristate "Intel(R) Ethernet Controller I225-LM/I225-V support"
+   default n
+   depends on PCI
+   ---help---
+ This driver supports Intel(R) Ethernet Controller I225-LM/I225-V
+ family of adapters.
+
+ For more information on how to identify your adapter, go
+ to the Adapter & Driver ID Guide that can be located at:
+
+ 
+
+ To compile this driver as a module, choose M here. The module
+ will be called igc.
+
 endif # NET_VENDOR_INTEL
diff --git a/drivers/net/ethernet/intel/Makefile 
b/drivers/net/ethernet/intel/Makefile
index b91153df6ee8..3075290063f6 100644
--- a/drivers/net/ethernet/intel/Makefile
+++ b/drivers/net/ethernet/intel/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_E100) += e100.o
 obj-$(CONFIG_E1000) += e1000/
 obj-$(CONFIG_E1000E) += e1000e/
 obj-$(CONFIG_IGB) += igb/
+obj-$(CONFIG_IGC) += igc/
 obj-$(CONFIG_IGBVF) += igbvf/
 obj-$(CONFIG_IXGBE) += ixgbe/
 obj-$(CONFIG_IXGBEVF) += ixgbevf/
diff --git a/drivers/net/ethernet/intel/igc/Makefile 
b/drivers/net/ethernet/intel/igc/Makefile
new file mode 100644
index ..3d13b015d401
--- /dev/null
+++ b/drivers/net/ethernet/intel/igc/Makefile
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c)  2018 Intel Corporation
+
+#
+# Intel(R) I225-LM/I225-V 2.5G Ethernet Controller
+#
+
+obj-$(CONFIG_IGC) += igc.o
+
+igc-objs := igc_main.o
diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
new file mode 100644
index ..afe595cfcf63
--- /dev/null
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c)  2018 Intel Corporation */
+
+#ifndef _IGC_H_
+#define _IGC_H_
+
+#include 
+
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+
+#define IGC_ERR(args...) pr_err("igc: " args)
+
+#define PFX "igc: "
+
+#include 
+#include 
+#include 
+
+/* main */
+extern char igc_driver_name[];
+extern char igc_driver_version[];
+
+#endif /* _IGC_H_ */
diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h 
b/drivers/net/ethernet/intel/igc/igc_hw.h
new file mode 100644
index ..aa68b4516700
--- /dev/null
+++ b/drivers/net/ethernet/intel/igc/igc_hw.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c)  2018 Intel Corporation */
+
+#ifndef _IGC_HW_H_
+#define _IGC_HW_H_
+
+#define IGC_DEV_ID_I225_LM 0x15F2
+#define IGC_DEV_ID_I225_V  0x15F3
+
+#endif /* _IGC_HW_H_ */
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c 
b/drivers/net/ethernet/intel/igc/igc_main.c
new file mode 100644
index ..753749ce5ae0
--- /dev/null
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c)  2018 Intel Corporation */
+
+#include 
+#include 
+
+#include "igc.h"
+#include "igc_hw.h"
+
+#define DRV_VERSION"0.0.1-k"
+#define DRV_SUMMARY"Intel(R) 2.5G Ethernet Linux Driver"
+
+MODULE_AUTHOR("Intel Corporation, ");
+MODULE_DESCRIPTION(DRV_SUMMARY);
+MODULE_LICENSE("GPL v2");
+MODULE_VERSION(DRV_VERSION);
+
+char igc_driver_name[] = "igc";
+char igc_driver_version[] = DRV_VERSION;
+static const char igc_driver_string[] = DRV_SUMMARY;
+static const char igc_copyright[] =
+   "Copyright(c) 2018 Intel Corporation.";
+
+static const struct pci_device_id igc_pci_tbl[] = {
+   { PCI_VDEVICE(INTEL, IGC_DEV_ID_I225_LM) },
+   { PCI_VDEVICE(INTEL, IGC_DEV_ID_I225_V) },
+   /* required last entry */
+   {0, }
+};
+
+MODULE_DEVICE_TABLE(pci, igc_pci_tbl);
+
+/**
+ * igc_probe - Device Initialization Routine
+ * @pdev: PCI device information struct
+ * 

[net-next 07/11] igc: Add HW initialization code

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

Add code for hardware initialization and reset
Add code for semaphore handling

Signed-off-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/Makefile  |   2 +-
 drivers/net/ethernet/intel/igc/igc_base.c| 187 +++
 drivers/net/ethernet/intel/igc/igc_base.h|   2 +
 drivers/net/ethernet/intel/igc/igc_defines.h |  36 +++
 drivers/net/ethernet/intel/igc/igc_hw.h  |  85 +
 drivers/net/ethernet/intel/igc/igc_i225.c| 141 +
 drivers/net/ethernet/intel/igc/igc_mac.c | 315 +++
 drivers/net/ethernet/intel/igc/igc_mac.h |  11 +
 drivers/net/ethernet/intel/igc/igc_main.c|  21 ++
 drivers/net/ethernet/intel/igc/igc_regs.h|  20 ++
 10 files changed, 819 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/intel/igc/igc_i225.c

diff --git a/drivers/net/ethernet/intel/igc/Makefile 
b/drivers/net/ethernet/intel/igc/Makefile
index c32c45300692..8b8022ea590a 100644
--- a/drivers/net/ethernet/intel/igc/Makefile
+++ b/drivers/net/ethernet/intel/igc/Makefile
@@ -7,4 +7,4 @@
 
 obj-$(CONFIG_IGC) += igc.o
 
-igc-objs := igc_main.o igc_mac.o igc_base.o
+igc-objs := igc_main.o igc_mac.o igc_i225.o igc_base.o
diff --git a/drivers/net/ethernet/intel/igc/igc_base.c 
b/drivers/net/ethernet/intel/igc/igc_base.c
index 3425b7466017..4efb47497e6b 100644
--- a/drivers/net/ethernet/intel/igc/igc_base.c
+++ b/drivers/net/ethernet/intel/igc/igc_base.c
@@ -5,6 +5,184 @@
 
 #include "igc_hw.h"
 #include "igc_i225.h"
+#include "igc_mac.h"
+#include "igc_base.h"
+#include "igc.h"
+
+/**
+ * igc_set_pcie_completion_timeout - set pci-e completion timeout
+ * @hw: pointer to the HW structure
+ */
+static s32 igc_set_pcie_completion_timeout(struct igc_hw *hw)
+{
+   u32 gcr = rd32(IGC_GCR);
+   u16 pcie_devctl2;
+   s32 ret_val = 0;
+
+   /* only take action if timeout value is defaulted to 0 */
+   if (gcr & IGC_GCR_CMPL_TMOUT_MASK)
+   goto out;
+
+   /* if capabilities version is type 1 we can write the
+* timeout of 10ms to 200ms through the GCR register
+*/
+   if (!(gcr & IGC_GCR_CAP_VER2)) {
+   gcr |= IGC_GCR_CMPL_TMOUT_10ms;
+   goto out;
+   }
+
+   /* for version 2 capabilities we need to write the config space
+* directly in order to set the completion timeout value for
+* 16ms to 55ms
+*/
+   ret_val = igc_read_pcie_cap_reg(hw, PCIE_DEVICE_CONTROL2,
+   _devctl2);
+   if (ret_val)
+   goto out;
+
+   pcie_devctl2 |= PCIE_DEVICE_CONTROL2_16ms;
+
+   ret_val = igc_write_pcie_cap_reg(hw, PCIE_DEVICE_CONTROL2,
+_devctl2);
+out:
+   /* disable completion timeout resend */
+   gcr &= ~IGC_GCR_CMPL_TMOUT_RESEND;
+
+   wr32(IGC_GCR, gcr);
+
+   return ret_val;
+}
+
+/**
+ * igc_reset_hw_base - Reset hardware
+ * @hw: pointer to the HW structure
+ *
+ * This resets the hardware into a known state.  This is a
+ * function pointer entry point called by the api module.
+ */
+static s32 igc_reset_hw_base(struct igc_hw *hw)
+{
+   s32 ret_val;
+   u32 ctrl;
+
+   /* Prevent the PCI-E bus from sticking if there is no TLP connection
+* on the last TLP read/write transaction when MAC is reset.
+*/
+   ret_val = igc_disable_pcie_master(hw);
+   if (ret_val)
+   hw_dbg("PCI-E Master disable polling has failed.\n");
+
+   /* set the completion timeout for interface */
+   ret_val = igc_set_pcie_completion_timeout(hw);
+   if (ret_val)
+   hw_dbg("PCI-E Set completion timeout has failed.\n");
+
+   hw_dbg("Masking off all interrupts\n");
+   wr32(IGC_IMC, 0x);
+
+   wr32(IGC_RCTL, 0);
+   wr32(IGC_TCTL, IGC_TCTL_PSP);
+   wrfl();
+
+   usleep_range(1, 2);
+
+   ctrl = rd32(IGC_CTRL);
+
+   hw_dbg("Issuing a global reset to MAC\n");
+   wr32(IGC_CTRL, ctrl | IGC_CTRL_RST);
+
+   ret_val = igc_get_auto_rd_done(hw);
+   if (ret_val) {
+   /* When auto config read does not complete, do not
+* return with an error. This can happen in situations
+* where there is no eeprom and prevents getting link.
+*/
+   hw_dbg("Auto Read Done did not complete\n");
+   }
+
+   /* Clear any pending interrupt events. */
+   wr32(IGC_IMC, 0x);
+   rd32(IGC_ICR);
+
+   return ret_val;
+}
+
+/**
+ * igc_init_mac_params_base - Init MAC func ptrs.
+ * @hw: pointer to the HW structure
+ */
+static s32 igc_init_mac_params_base(struct igc_hw *hw)
+{
+   struct igc_mac_info *mac = >mac;
+
+   /* Set mta register count */
+   mac->mta_reg_count = 128;
+   mac->rar_entry_count = IGC_RAR_ENTRIES;
+
+   /* reset */
+   mac->ops.reset_hw = 

[net-next 08/11] igc: Add NVM support

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

Add code for NVM support and get MAC address, complete probe
method.

Signed-off-by: Sasha Neftin 
Signed-off-by: Alexander Duyck 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/Makefile  |   2 +-
 drivers/net/ethernet/intel/igc/igc.h |   6 +
 drivers/net/ethernet/intel/igc/igc_base.c| 109 ++
 drivers/net/ethernet/intel/igc/igc_defines.h |  52 +++
 drivers/net/ethernet/intel/igc/igc_hw.h  |   3 +
 drivers/net/ethernet/intel/igc/igc_i225.c| 349 +++
 drivers/net/ethernet/intel/igc/igc_i225.h|   3 +
 drivers/net/ethernet/intel/igc/igc_mac.c | 170 +
 drivers/net/ethernet/intel/igc/igc_mac.h |   6 +
 drivers/net/ethernet/intel/igc/igc_main.c|  20 +-
 drivers/net/ethernet/intel/igc/igc_nvm.c | 215 
 drivers/net/ethernet/intel/igc/igc_nvm.h |  14 +
 drivers/net/ethernet/intel/igc/igc_regs.h|   3 +
 13 files changed, 949 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/igc/igc_nvm.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_nvm.h

diff --git a/drivers/net/ethernet/intel/igc/Makefile 
b/drivers/net/ethernet/intel/igc/Makefile
index 8b8022ea590a..2b5378d96c7b 100644
--- a/drivers/net/ethernet/intel/igc/Makefile
+++ b/drivers/net/ethernet/intel/igc/Makefile
@@ -7,4 +7,4 @@
 
 obj-$(CONFIG_IGC) += igc.o
 
-igc-objs := igc_main.o igc_mac.o igc_i225.o igc_base.o
+igc-objs := igc_main.o igc_mac.o igc_i225.o igc_base.o igc_nvm.o
diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
index 88ee451e36fd..6dcf51c112f4 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -131,6 +131,10 @@ enum igc_tx_flags {
IGC_TX_FLAGS_CSUM   = 0x20,
 };
 
+enum igc_boards {
+   board_base,
+};
+
 /* The largest size we can write to the descriptor is 65535.  In order to
  * maintain a power of two alignment we have to limit ourselves to 32K.
  */
@@ -342,6 +346,8 @@ struct igc_adapter {
spinlock_t nfc_lock;
 
struct igc_mac_addr *mac_table;
+
+   struct igc_info ei;
 };
 
 /* igc_desc_unused - calculate if we have unused descriptors */
diff --git a/drivers/net/ethernet/intel/igc/igc_base.c 
b/drivers/net/ethernet/intel/igc/igc_base.c
index 4efb47497e6b..2d49814966d3 100644
--- a/drivers/net/ethernet/intel/igc/igc_base.c
+++ b/drivers/net/ethernet/intel/igc/igc_base.c
@@ -53,6 +53,22 @@ static s32 igc_set_pcie_completion_timeout(struct igc_hw *hw)
return ret_val;
 }
 
+/**
+ * igc_check_for_link_base - Check for link
+ * @hw: pointer to the HW structure
+ *
+ * If sgmii is enabled, then use the pcs register to determine link, otherwise
+ * use the generic interface for determining link.
+ */
+static s32 igc_check_for_link_base(struct igc_hw *hw)
+{
+   s32 ret_val = 0;
+
+   ret_val = igc_check_for_copper_link(hw);
+
+   return ret_val;
+}
+
 /**
  * igc_reset_hw_base - Reset hardware
  * @hw: pointer to the HW structure
@@ -107,12 +123,51 @@ static s32 igc_reset_hw_base(struct igc_hw *hw)
return ret_val;
 }
 
+/**
+ * igc_init_nvm_params_base - Init NVM func ptrs.
+ * @hw: pointer to the HW structure
+ */
+static s32 igc_init_nvm_params_base(struct igc_hw *hw)
+{
+   struct igc_nvm_info *nvm = >nvm;
+   u32 eecd = rd32(IGC_EECD);
+   u16 size;
+
+   size = (u16)((eecd & IGC_EECD_SIZE_EX_MASK) >>
+IGC_EECD_SIZE_EX_SHIFT);
+
+   /* Added to a constant, "size" becomes the left-shift value
+* for setting word_size.
+*/
+   size += NVM_WORD_SIZE_BASE_SHIFT;
+
+   /* Just in case size is out of range, cap it to the largest
+* EEPROM size supported
+*/
+   if (size > 15)
+   size = 15;
+
+   nvm->word_size = BIT(size);
+   nvm->opcode_bits = 8;
+   nvm->delay_usec = 1;
+
+   nvm->page_size = eecd & IGC_EECD_ADDR_BITS ? 32 : 8;
+   nvm->address_bits = eecd & IGC_EECD_ADDR_BITS ?
+   16 : 8;
+
+   if (nvm->word_size == BIT(15))
+   nvm->page_size = 128;
+
+   return 0;
+}
+
 /**
  * igc_init_mac_params_base - Init MAC func ptrs.
  * @hw: pointer to the HW structure
  */
 static s32 igc_init_mac_params_base(struct igc_hw *hw)
 {
+   struct igc_dev_spec_base *dev_spec = >dev_spec._base;
struct igc_mac_info *mac = >mac;
 
/* Set mta register count */
@@ -125,6 +180,10 @@ static s32 igc_init_mac_params_base(struct igc_hw *hw)
mac->ops.acquire_swfw_sync = igc_acquire_swfw_sync_i225;
mac->ops.release_swfw_sync = igc_release_swfw_sync_i225;
 
+   /* Allow a single clear of the SW semaphore on I225 */
+   if (mac->type == igc_i225)
+   dev_spec->clear_semaphore_once = true;
+
return 0;
 }
 
@@ -142,10 +201,43 @@ static s32 igc_get_invariants_base(struct igc_hw *hw)
if (ret_val)

[net-next 05/11] igc: Add support for Tx/Rx rings

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

This change adds the defines and structures necessary to support both Tx
and Rx descriptor rings.

Signed-off-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/Makefile  |   2 +-
 drivers/net/ethernet/intel/igc/igc.h | 125 +++
 drivers/net/ethernet/intel/igc/igc_base.c|  83 ++
 drivers/net/ethernet/intel/igc/igc_base.h|  89 ++
 drivers/net/ethernet/intel/igc/igc_defines.h |  43 +
 drivers/net/ethernet/intel/igc/igc_hw.h  |   1 +
 drivers/net/ethernet/intel/igc/igc_main.c| 827 +++
 drivers/net/ethernet/intel/igc/igc_regs.h|   3 +
 8 files changed, 1172 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/intel/igc/igc_base.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_base.h

diff --git a/drivers/net/ethernet/intel/igc/Makefile 
b/drivers/net/ethernet/intel/igc/Makefile
index 06e0b9e23a8c..c32c45300692 100644
--- a/drivers/net/ethernet/intel/igc/Makefile
+++ b/drivers/net/ethernet/intel/igc/Makefile
@@ -7,4 +7,4 @@
 
 obj-$(CONFIG_IGC) += igc.o
 
-igc-objs := igc_main.o igc_mac.o
+igc-objs := igc_main.o igc_mac.o igc_base.o
diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
index e595d135ea7b..7bb19328b899 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -46,6 +46,45 @@ extern char igc_driver_version[];
 #define MAX_Q_VECTORS  8
 #define MAX_STD_JUMBO_FRAME_SIZE   9216
 
+/* Supported Rx Buffer Sizes */
+#define IGC_RXBUFFER_256   256
+#define IGC_RXBUFFER_2048  2048
+#define IGC_RXBUFFER_3072  3072
+
+#define IGC_RX_HDR_LEN IGC_RXBUFFER_256
+
+/* RX and TX descriptor control thresholds.
+ * PTHRESH - MAC will consider prefetch if it has fewer than this number of
+ *   descriptors available in its onboard memory.
+ *   Setting this to 0 disables RX descriptor prefetch.
+ * HTHRESH - MAC will only prefetch if there are at least this many descriptors
+ *   available in host memory.
+ *   If PTHRESH is 0, this should also be 0.
+ * WTHRESH - RX descriptor writeback threshold - MAC will delay writing back
+ *   descriptors until either it has this many to write back, or the
+ *   ITR timer expires.
+ */
+#define IGC_RX_PTHRESH 8
+#define IGC_RX_HTHRESH 8
+#define IGC_TX_PTHRESH 8
+#define IGC_TX_HTHRESH 1
+#define IGC_RX_WTHRESH 4
+#define IGC_TX_WTHRESH 16
+
+#define IGC_RX_DMA_ATTR \
+   (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
+
+#define IGC_TS_HDR_LEN 16
+
+#define IGC_SKB_PAD(NET_SKB_PAD + NET_IP_ALIGN)
+
+#if (PAGE_SIZE < 8192)
+#define IGC_MAX_FRAME_BUILD_SKB \
+   (SKB_WITH_OVERHEAD(IGC_RXBUFFER_2048) - IGC_SKB_PAD - IGC_TS_HDR_LEN)
+#else
+#define IGC_MAX_FRAME_BUILD_SKB (IGC_RXBUFFER_2048 - IGC_TS_HDR_LEN)
+#endif
+
 enum igc_state_t {
__IGC_TESTING,
__IGC_RESETTING,
@@ -53,6 +92,33 @@ enum igc_state_t {
__IGC_PTP_TX_IN_PROGRESS,
 };
 
+/* wrapper around a pointer to a socket buffer,
+ * so a DMA handle can be stored along with the buffer
+ */
+struct igc_tx_buffer {
+   union igc_adv_tx_desc *next_to_watch;
+   unsigned long time_stamp;
+   struct sk_buff *skb;
+   unsigned int bytecount;
+   u16 gso_segs;
+   __be16 protocol;
+
+   DEFINE_DMA_UNMAP_ADDR(dma);
+   DEFINE_DMA_UNMAP_LEN(len);
+   u32 tx_flags;
+};
+
+struct igc_rx_buffer {
+   dma_addr_t dma;
+   struct page *page;
+#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
+   __u32 page_offset;
+#else
+   __u16 page_offset;
+#endif
+   __u16 pagecnt_bias;
+};
+
 struct igc_tx_queue_stats {
u64 packets;
u64 bytes;
@@ -214,4 +280,63 @@ struct igc_adapter {
struct igc_mac_addr *mac_table;
 };
 
+/* igc_desc_unused - calculate if we have unused descriptors */
+static inline u16 igc_desc_unused(const struct igc_ring *ring)
+{
+   u16 ntc = ring->next_to_clean;
+   u16 ntu = ring->next_to_use;
+
+   return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
+}
+
+static inline struct netdev_queue *txring_txq(const struct igc_ring *tx_ring)
+{
+   return netdev_get_tx_queue(tx_ring->netdev, tx_ring->queue_index);
+}
+
+enum igc_ring_flags_t {
+   IGC_RING_FLAG_RX_3K_BUFFER,
+   IGC_RING_FLAG_RX_BUILD_SKB_ENABLED,
+   IGC_RING_FLAG_RX_SCTP_CSUM,
+   IGC_RING_FLAG_RX_LB_VLAN_BSWAP,
+   IGC_RING_FLAG_TX_CTX_IDX,
+   IGC_RING_FLAG_TX_DETECT_HANG
+};
+
+#define ring_uses_large_buffer(ring) \
+   test_bit(IGC_RING_FLAG_RX_3K_BUFFER, &(ring)->flags)
+
+#define ring_uses_build_skb(ring) \
+   test_bit(IGC_RING_FLAG_RX_BUILD_SKB_ENABLED, &(ring)->flags)
+
+static inline unsigned int igc_rx_bufsz(struct igc_ring 

[net-next 04/11] igc: Add interrupt support

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

This patch set adds interrupt support for the igc interfaces.

Signed-off-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/igc.h |  127 +++
 drivers/net/ethernet/intel/igc/igc_defines.h |   40 +
 drivers/net/ethernet/intel/igc/igc_hw.h  |   84 ++
 drivers/net/ethernet/intel/igc/igc_main.c| 1016 ++
 4 files changed, 1267 insertions(+)

diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
index 2e819cac19e5..e595d135ea7b 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -28,6 +28,17 @@
 extern char igc_driver_name[];
 extern char igc_driver_version[];
 
+/* Interrupt defines */
+#define IGC_START_ITR  648 /* ~6000 ints/sec */
+#define IGC_FLAG_HAS_MSI   BIT(0)
+#define IGC_FLAG_QUEUE_PAIRS   BIT(4)
+#define IGC_FLAG_HAS_MSIX  BIT(13)
+
+#define IGC_START_ITR  648 /* ~6000 ints/sec */
+#define IGC_4K_ITR 980
+#define IGC_20K_ITR196
+#define IGC_70K_ITR56
+
 /* Transmit and receive queues */
 #define IGC_MAX_RX_QUEUES  4
 #define IGC_MAX_TX_QUEUES  4
@@ -42,10 +53,96 @@ enum igc_state_t {
__IGC_PTP_TX_IN_PROGRESS,
 };
 
+struct igc_tx_queue_stats {
+   u64 packets;
+   u64 bytes;
+   u64 restart_queue;
+};
+
+struct igc_rx_queue_stats {
+   u64 packets;
+   u64 bytes;
+   u64 drops;
+   u64 csum_err;
+   u64 alloc_failed;
+};
+
+struct igc_rx_packet_stats {
+   u64 ipv4_packets;  /* IPv4 headers processed */
+   u64 ipv4e_packets; /* IPv4E headers with extensions processed */
+   u64 ipv6_packets;  /* IPv6 headers processed */
+   u64 ipv6e_packets; /* IPv6E headers with extensions processed */
+   u64 tcp_packets;   /* TCP headers processed */
+   u64 udp_packets;   /* UDP headers processed */
+   u64 sctp_packets;  /* SCTP headers processed */
+   u64 nfs_packets;   /* NFS headers processe */
+   u64 other_packets;
+};
+
+struct igc_ring_container {
+   struct igc_ring *ring;  /* pointer to linked list of rings */
+   unsigned int total_bytes;   /* total bytes processed this int */
+   unsigned int total_packets; /* total packets processed this int */
+   u16 work_limit; /* total work allowed per interrupt */
+   u8 count;   /* total number of rings in vector */
+   u8 itr; /* current ITR setting for ring */
+};
+
+struct igc_ring {
+   struct igc_q_vector *q_vector;  /* backlink to q_vector */
+   struct net_device *netdev;  /* back pointer to net_device */
+   struct device *dev; /* device for dma mapping */
+   union { /* array of buffer info structs */
+   struct igc_tx_buffer *tx_buffer_info;
+   struct igc_rx_buffer *rx_buffer_info;
+   };
+   void *desc; /* descriptor ring memory */
+   unsigned long flags;/* ring specific flags */
+   void __iomem *tail; /* pointer to ring tail register */
+   dma_addr_t dma; /* phys address of the ring */
+   unsigned int size;  /* length of desc. ring in bytes */
+
+   u16 count;  /* number of desc. in the ring */
+   u8 queue_index; /* logical index of the ring*/
+   u8 reg_idx; /* physical index of the ring */
+
+   /* everything past this point are written often */
+   u16 next_to_clean;
+   u16 next_to_use;
+   u16 next_to_alloc;
+
+   union {
+   /* TX */
+   struct {
+   struct igc_tx_queue_stats tx_stats;
+   };
+   /* RX */
+   struct {
+   struct igc_rx_queue_stats rx_stats;
+   struct igc_rx_packet_stats pkt_stats;
+   struct sk_buff *skb;
+   };
+   };
+} cacheline_internodealigned_in_smp;
+
 struct igc_q_vector {
struct igc_adapter *adapter;/* backlink */
+   void __iomem *itr_register;
+   u32 eims_value; /* EIMS mask value */
+
+   u16 itr_val;
+   u8 set_itr;
+
+   struct igc_ring_container rx, tx;
 
struct napi_struct napi;
+
+   struct rcu_head rcu;/* to avoid race with update stats on free */
+   char name[IFNAMSIZ + 9];
+   struct net_device poll_dev;
+
+   /* for dynamic allocation of rings associated with this q_vector */
+   struct igc_ring ring[0] cacheline_internodealigned_in_smp;
 };
 
 struct igc_mac_addr {
@@ -65,13 +162,35 @@ struct igc_adapter {
unsigned long state;
unsigned int 

[net-next 10/11] igc: Add setup link functionality

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

Add link establishment methods
Add auto negotiation methods
Add read MAC address method

Signed-off-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/igc.h |   1 +
 drivers/net/ethernet/intel/igc/igc_base.c|  40 +++
 drivers/net/ethernet/intel/igc/igc_defines.h |  38 +++
 drivers/net/ethernet/intel/igc/igc_mac.c | 271 +++
 drivers/net/ethernet/intel/igc/igc_mac.h |   2 +
 drivers/net/ethernet/intel/igc/igc_main.c|  30 ++
 drivers/net/ethernet/intel/igc/igc_phy.c | 334 +++
 drivers/net/ethernet/intel/igc/igc_phy.h |   1 +
 8 files changed, 717 insertions(+)

diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
index 7cfbd83d25e4..86fa889b4ab6 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -314,6 +314,7 @@ struct igc_adapter {
struct work_struct reset_task;
struct work_struct watchdog_task;
struct work_struct dma_err_task;
+   bool fc_autoneg;
 
u8 tx_timeout_factor;
 
diff --git a/drivers/net/ethernet/intel/igc/igc_base.c 
b/drivers/net/ethernet/intel/igc/igc_base.c
index 55faef987479..832da609d9a7 100644
--- a/drivers/net/ethernet/intel/igc/igc_base.c
+++ b/drivers/net/ethernet/intel/igc/igc_base.c
@@ -177,6 +177,29 @@ static s32 igc_init_nvm_params_base(struct igc_hw *hw)
return 0;
 }
 
+/**
+ * igc_setup_copper_link_base - Configure copper link settings
+ * @hw: pointer to the HW structure
+ *
+ * Configures the link for auto-neg or forced speed and duplex.  Then we check
+ * for link, once link is established calls to configure collision distance
+ * and flow control are called.
+ */
+static s32 igc_setup_copper_link_base(struct igc_hw *hw)
+{
+   s32  ret_val = 0;
+   u32 ctrl;
+
+   ctrl = rd32(IGC_CTRL);
+   ctrl |= IGC_CTRL_SLU;
+   ctrl &= ~(IGC_CTRL_FRCSPD | IGC_CTRL_FRCDPX);
+   wr32(IGC_CTRL, ctrl);
+
+   ret_val = igc_setup_copper_link(hw);
+
+   return ret_val;
+}
+
 /**
  * igc_init_mac_params_base - Init MAC func ptrs.
  * @hw: pointer to the HW structure
@@ -200,6 +223,9 @@ static s32 igc_init_mac_params_base(struct igc_hw *hw)
if (mac->type == igc_i225)
dev_spec->clear_semaphore_once = true;
 
+   /* physical interface link setup */
+   mac->ops.setup_physical_interface = igc_setup_copper_link_base;
+
return 0;
 }
 
@@ -242,6 +268,8 @@ static s32 igc_init_phy_params_base(struct igc_hw *hw)
if (ret_val)
return ret_val;
 
+   igc_check_for_link_base(hw);
+
/* Verify phy id and set remaining function pointers */
switch (phy->id) {
case I225_I_PHY_ID:
@@ -258,10 +286,22 @@ static s32 igc_init_phy_params_base(struct igc_hw *hw)
 
 static s32 igc_get_invariants_base(struct igc_hw *hw)
 {
+   struct igc_mac_info *mac = >mac;
u32 link_mode = 0;
u32 ctrl_ext = 0;
s32 ret_val = 0;
 
+   switch (hw->device_id) {
+   case IGC_DEV_ID_I225_LM:
+   case IGC_DEV_ID_I225_V:
+   mac->type = igc_i225;
+   break;
+   default:
+   return -IGC_ERR_MAC_INIT;
+   }
+
+   hw->phy.media_type = igc_media_type_copper;
+
ctrl_ext = rd32(IGC_CTRL_EXT);
link_mode = ctrl_ext & IGC_CTRL_EXT_LINK_MODE_MASK;
 
diff --git a/drivers/net/ethernet/intel/igc/igc_defines.h 
b/drivers/net/ethernet/intel/igc/igc_defines.h
index d271671e6825..70275a0e85d7 100644
--- a/drivers/net/ethernet/intel/igc/igc_defines.h
+++ b/drivers/net/ethernet/intel/igc/igc_defines.h
@@ -13,6 +13,11 @@
 /* Physical Func Reset Done Indication */
 #define IGC_CTRL_EXT_LINK_MODE_MASK0x00C0
 
+/* Loop limit on how long we wait for auto-negotiation to complete */
+#define COPPER_LINK_UP_LIMIT   10
+#define PHY_AUTO_NEG_LIMIT 45
+#define PHY_FORCE_LIMIT20
+
 /* Number of 100 microseconds we wait for PCI Express master disable */
 #define MASTER_DISABLE_TIMEOUT 800
 /*Blocks new Master requests */
@@ -54,6 +59,12 @@
 #define IGC_CTRL_RST   0x0400  /* Global reset */
 
 #define IGC_CTRL_PHY_RST   0x8000  /* PHY Reset */
+#define IGC_CTRL_SLU   0x0040  /* Set link up (Force Link) */
+#define IGC_CTRL_FRCSPD0x0800  /* Force Speed */
+#define IGC_CTRL_FRCDPX0x1000  /* Force Duplex */
+
+#define IGC_CTRL_RFCE  0x0800  /* Receive Flow Control enable */
+#define IGC_CTRL_TFCE  0x1000  /* Transmit flow control enable */
 
 /* PBA constants */
 #define IGC_PBA_34K0x0022
@@ -66,6 +77,29 @@
 #define IGC_SWFW_EEP_SM0x1
 #define IGC_SWFW_PHY0_SM   0x2
 
+/* Autoneg Advertisement Register */
+#define NWAY_AR_10T_HD_CAPS0x0020   /* 10T   Half Duplex Capable */
+#define NWAY_AR_10T_FD_CAPS0x0040   

[net-next 11/11] igc: Add watchdog

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

Code completion, remove obsolete code
Add watchdog methods

Signed-off-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/igc.h |  12 +
 drivers/net/ethernet/intel/igc/igc_defines.h |  11 +
 drivers/net/ethernet/intel/igc/igc_hw.h  |   1 +
 drivers/net/ethernet/intel/igc/igc_main.c| 232 +++
 4 files changed, 256 insertions(+)

diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
index 86fa889b4ab6..cdf18a5d9e08 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -33,6 +33,8 @@ extern char igc_driver_version[];
 #define IGC_FLAG_HAS_MSI   BIT(0)
 #define IGC_FLAG_QUEUE_PAIRS   BIT(4)
 #define IGC_FLAG_NEED_LINK_UPDATE  BIT(9)
+#define IGC_FLAG_MEDIA_RESET   BIT(10)
+#define IGC_FLAG_MAS_ENABLEBIT(12)
 #define IGC_FLAG_HAS_MSIX  BIT(13)
 #define IGC_FLAG_VLAN_PROMISC  BIT(15)
 
@@ -290,6 +292,7 @@ struct igc_adapter {
 
/* TX */
u16 tx_work_limit;
+   u32 tx_timeout_count;
int num_tx_queues;
struct igc_ring *tx_ring[IGC_MAX_TX_QUEUES];
 
@@ -348,6 +351,7 @@ struct igc_adapter {
 
struct igc_mac_addr *mac_table;
 
+   unsigned long link_check_timeout;
struct igc_info ei;
 };
 
@@ -417,6 +421,14 @@ static inline unsigned int igc_rx_pg_order(struct igc_ring 
*ring)
return 0;
 }
 
+static inline s32 igc_read_phy_reg(struct igc_hw *hw, u32 offset, u16 *data)
+{
+   if (hw->phy.ops.read_reg)
+   return hw->phy.ops.read_reg(hw, offset, data);
+
+   return 0;
+}
+
 #define igc_rx_pg_size(_ring) (PAGE_SIZE << igc_rx_pg_order(_ring))
 
 #define IGC_TXD_DCMD   (IGC_ADVTXD_DCMD_EOP | IGC_ADVTXD_DCMD_RS)
diff --git a/drivers/net/ethernet/intel/igc/igc_defines.h 
b/drivers/net/ethernet/intel/igc/igc_defines.h
index 70275a0e85d7..8740754ea1fd 100644
--- a/drivers/net/ethernet/intel/igc/igc_defines.h
+++ b/drivers/net/ethernet/intel/igc/igc_defines.h
@@ -66,6 +66,8 @@
 #define IGC_CTRL_RFCE  0x0800  /* Receive Flow Control enable */
 #define IGC_CTRL_TFCE  0x1000  /* Transmit flow control enable */
 
+#define IGC_CONNSW_AUTOSENSE_EN0x1
+
 /* PBA constants */
 #define IGC_PBA_34K0x0022
 
@@ -94,6 +96,10 @@
 #define CR_1000T_HD_CAPS   0x0100 /* Advertise 1000T HD capability */
 #define CR_1000T_FD_CAPS   0x0200 /* Advertise 1000T FD capability  */
 
+/* 1000BASE-T Status Register */
+#define SR_1000T_REMOTE_RX_STATUS  0x1000 /* Remote receiver OK */
+#define SR_1000T_LOCAL_RX_STATUS   0x2000 /* Local receiver OK */
+
 /* PHY GPY 211 registers */
 #define STANDARD_AN_REG_MASK   0x0007 /* MMD */
 #define ANEG_MULTIGBT_AN_CTRL  0x0020 /* MULTI GBT AN Control Register */
@@ -210,6 +216,11 @@
 #define IGC_QVECTOR_MASK   0x7FFC  /* Q-vector mask */
 #define IGC_ITR_VAL_MASK   0x04/* ITR value mask */
 
+/* Interrupt Cause Set */
+#define IGC_ICS_LSCIGC_ICR_LSC   /* Link Status Change */
+#define IGC_ICS_RXDMT0 IGC_ICR_RXDMT0/* rx desc min. threshold */
+#define IGC_ICS_DRSTA  IGC_ICR_DRSTA /* Device Reset Aserted */
+
 #define IGC_ICR_DOUTSYNC   0x1000 /* NIC DMA out of sync */
 #define IGC_EITR_CNT_IGNR  0x8000 /* Don't reset counters on write */
 #define IGC_IVAR_VALID 0x80
diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h 
b/drivers/net/ethernet/intel/igc/igc_hw.h
index 65d1446ff0c3..c50414f48f0d 100644
--- a/drivers/net/ethernet/intel/igc/igc_hw.h
+++ b/drivers/net/ethernet/intel/igc/igc_hw.h
@@ -197,6 +197,7 @@ struct igc_dev_spec_base {
bool clear_semaphore_once;
bool module_plugged;
u8 media_port;
+   bool mas_capable;
 };
 
 struct igc_hw {
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c 
b/drivers/net/ethernet/intel/igc/igc_main.c
index e1a078e084f0..9d85707e8a81 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -1743,6 +1743,7 @@ static void igc_up(struct igc_adapter *adapter)
 
/* start the watchdog. */
hw->mac.get_link_status = 1;
+   schedule_work(>watchdog_task);
 }
 
 /**
@@ -2297,6 +2298,55 @@ static void igc_free_q_vector(struct igc_adapter 
*adapter, int v_idx)
kfree_rcu(q_vector, rcu);
 }
 
+/* Need to wait a few seconds after link up to get diagnostic information from
+ * the phy
+ */
+static void igc_update_phy_info(struct timer_list *t)
+{
+   struct igc_adapter *adapter = from_timer(adapter, t, phy_info_timer);
+
+   igc_get_phy_info(>hw);
+}
+
+/**
+ * igc_has_link - check shared code for link and determine up/down
+ * @adapter: pointer to driver private info
+ */
+static bool igc_has_link(struct igc_adapter *adapter)
+{
+   struct igc_hw *hw = >hw;
+   bool link_active = false;
+
+  

[net-next 02/11] igc: Add support for PF

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

This patch adds the basic defines and structures needed by the PF for
operation. With this it is possible to bring up the interface,
but without being able to configure any of the filters on
the interface itself.
Add skeleton for a function pointers.

Signed-off-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/Makefile  |   2 +-
 drivers/net/ethernet/intel/igc/igc.h |  13 ++
 drivers/net/ethernet/intel/igc/igc_defines.h |  30 +++
 drivers/net/ethernet/intel/igc/igc_hw.h  |  82 
 drivers/net/ethernet/intel/igc/igc_i225.h|  10 +
 drivers/net/ethernet/intel/igc/igc_mac.c |   5 +
 drivers/net/ethernet/intel/igc/igc_mac.h |  11 ++
 drivers/net/ethernet/intel/igc/igc_main.c|  98 ++
 drivers/net/ethernet/intel/igc/igc_regs.h| 192 +++
 9 files changed, 442 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/intel/igc/igc_defines.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_i225.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_mac.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_mac.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_regs.h

diff --git a/drivers/net/ethernet/intel/igc/Makefile 
b/drivers/net/ethernet/intel/igc/Makefile
index 3d13b015d401..06e0b9e23a8c 100644
--- a/drivers/net/ethernet/intel/igc/Makefile
+++ b/drivers/net/ethernet/intel/igc/Makefile
@@ -7,4 +7,4 @@
 
 obj-$(CONFIG_IGC) += igc.o
 
-igc-objs := igc_main.o
+igc-objs := igc_main.o igc_mac.o
diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
index afe595cfcf63..481b2ee694fa 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -22,8 +22,21 @@
 #include 
 #include 
 
+#include "igc_hw.h"
+
 /* main */
 extern char igc_driver_name[];
 extern char igc_driver_version[];
 
+/* Board specific private data structure */
+struct igc_adapter {
+   u8 __iomem *io_addr;
+
+   /* OS defined structs */
+   struct pci_dev *pdev;
+
+   /* structs defined in igc_hw.h */
+   struct igc_hw hw;
+};
+
 #endif /* _IGC_H_ */
diff --git a/drivers/net/ethernet/intel/igc/igc_defines.h 
b/drivers/net/ethernet/intel/igc/igc_defines.h
new file mode 100644
index ..d19dff1d6b74
--- /dev/null
+++ b/drivers/net/ethernet/intel/igc/igc_defines.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c)  2018 Intel Corporation */
+
+#ifndef _IGC_DEFINES_H_
+#define _IGC_DEFINES_H_
+
+/* PCI Bus Info */
+#define PCIE_DEVICE_CONTROL2   0x28
+#define PCIE_DEVICE_CONTROL2_16ms  0x0005
+
+/* Error Codes */
+#define IGC_SUCCESS0
+#define IGC_ERR_NVM1
+#define IGC_ERR_PHY2
+#define IGC_ERR_CONFIG 3
+#define IGC_ERR_PARAM  4
+#define IGC_ERR_MAC_INIT   5
+#define IGC_ERR_RESET  9
+
+/* Device Status */
+#define IGC_STATUS_FD  0x0001  /* Full duplex.0=half,1=full */
+#define IGC_STATUS_LU  0x0002  /* Link up.0=no,1=link */
+#define IGC_STATUS_FUNC_MASK   0x000C  /* PCI Function Mask */
+#define IGC_STATUS_FUNC_SHIFT  2
+#define IGC_STATUS_FUNC_1  0x0004  /* Function 1 */
+#define IGC_STATUS_TXOFF   0x0010  /* transmission paused */
+#define IGC_STATUS_SPEED_100   0x0040  /* Speed 100Mb/s */
+#define IGC_STATUS_SPEED_1000  0x0080  /* Speed 1000Mb/s */
+
+#endif /* _IGC_DEFINES_H_ */
diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h 
b/drivers/net/ethernet/intel/igc/igc_hw.h
index aa68b4516700..84b6067a2476 100644
--- a/drivers/net/ethernet/intel/igc/igc_hw.h
+++ b/drivers/net/ethernet/intel/igc/igc_hw.h
@@ -4,7 +4,89 @@
 #ifndef _IGC_HW_H_
 #define _IGC_HW_H_
 
+#include 
+#include 
+#include "igc_regs.h"
+#include "igc_defines.h"
+#include "igc_mac.h"
+#include "igc_i225.h"
+
 #define IGC_DEV_ID_I225_LM 0x15F2
 #define IGC_DEV_ID_I225_V  0x15F3
 
+/* Function pointers for the MAC. */
+struct igc_mac_operations {
+};
+
+enum igc_mac_type {
+   igc_undefined = 0,
+   igc_i225,
+   igc_num_macs  /* List is 1-based, so subtract 1 for true count. */
+};
+
+enum igc_phy_type {
+   igc_phy_unknown = 0,
+   igc_phy_none,
+   igc_phy_i225,
+};
+
+struct igc_mac_info {
+   struct igc_mac_operations ops;
+
+   u8 addr[ETH_ALEN];
+   u8 perm_addr[ETH_ALEN];
+
+   enum igc_mac_type type;
+
+   u32 collision_delta;
+   u32 ledctl_default;
+   u32 ledctl_mode1;
+   u32 ledctl_mode2;
+   u32 mc_filter_type;
+   u32 tx_packet_delta;
+   u32 txcw;
+
+   u16 mta_reg_count;
+   u16 uta_reg_count;
+
+   u16 rar_entry_count;
+
+   u8 forced_speed_duplex;
+
+   bool adaptive_ifs;
+   bool has_fwsm;
+   bool arc_subsystem_valid;
+
+   

[net-next 09/11] igc: Add code for PHY support

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

Add PHY's ID support
Add support for initialization, acquire and release of PHY
Enable register access

Signed-off-by: Sasha Neftin 
Signed-off-by: Alexander Duyck 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/Makefile  |   2 +-
 drivers/net/ethernet/intel/igc/igc.h |  16 +
 drivers/net/ethernet/intel/igc/igc_base.c| 122 +
 drivers/net/ethernet/intel/igc/igc_base.h|   1 +
 drivers/net/ethernet/intel/igc/igc_defines.h |  79 
 drivers/net/ethernet/intel/igc/igc_hw.h  |  54 +++
 drivers/net/ethernet/intel/igc/igc_mac.c |  45 ++
 drivers/net/ethernet/intel/igc/igc_mac.h |  11 +
 drivers/net/ethernet/intel/igc/igc_main.c|  11 +
 drivers/net/ethernet/intel/igc/igc_phy.c | 457 +++
 drivers/net/ethernet/intel/igc/igc_phy.h |  20 +
 drivers/net/ethernet/intel/igc/igc_regs.h|   3 +
 12 files changed, 820 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/intel/igc/igc_phy.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_phy.h

diff --git a/drivers/net/ethernet/intel/igc/Makefile 
b/drivers/net/ethernet/intel/igc/Makefile
index 2b5378d96c7b..4387f6ba8e67 100644
--- a/drivers/net/ethernet/intel/igc/Makefile
+++ b/drivers/net/ethernet/intel/igc/Makefile
@@ -7,4 +7,4 @@
 
 obj-$(CONFIG_IGC) += igc.o
 
-igc-objs := igc_main.o igc_mac.o igc_i225.o igc_base.o igc_nvm.o
+igc-objs := igc_main.o igc_mac.o igc_i225.o igc_base.o igc_nvm.o igc_phy.o
diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
index 6dcf51c112f4..7cfbd83d25e4 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -359,6 +359,22 @@ static inline u16 igc_desc_unused(const struct igc_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
+static inline s32 igc_get_phy_info(struct igc_hw *hw)
+{
+   if (hw->phy.ops.get_phy_info)
+   return hw->phy.ops.get_phy_info(hw);
+
+   return 0;
+}
+
+static inline s32 igc_reset_phy(struct igc_hw *hw)
+{
+   if (hw->phy.ops.reset)
+   return hw->phy.ops.reset(hw);
+
+   return 0;
+}
+
 static inline struct netdev_queue *txring_txq(const struct igc_ring *tx_ring)
 {
return netdev_get_tx_queue(tx_ring->netdev, tx_ring->queue_index);
diff --git a/drivers/net/ethernet/intel/igc/igc_base.c 
b/drivers/net/ethernet/intel/igc/igc_base.c
index 2d49814966d3..55faef987479 100644
--- a/drivers/net/ethernet/intel/igc/igc_base.c
+++ b/drivers/net/ethernet/intel/igc/igc_base.c
@@ -123,6 +123,22 @@ static s32 igc_reset_hw_base(struct igc_hw *hw)
return ret_val;
 }
 
+/**
+ * igc_get_phy_id_base - Retrieve PHY addr and id
+ * @hw: pointer to the HW structure
+ *
+ * Retrieves the PHY address and ID for both PHY's which do and do not use
+ * sgmi interface.
+ */
+static s32 igc_get_phy_id_base(struct igc_hw *hw)
+{
+   s32  ret_val = 0;
+
+   ret_val = igc_get_phy_id(hw);
+
+   return ret_val;
+}
+
 /**
  * igc_init_nvm_params_base - Init NVM func ptrs.
  * @hw: pointer to the HW structure
@@ -187,6 +203,59 @@ static s32 igc_init_mac_params_base(struct igc_hw *hw)
return 0;
 }
 
+/**
+ * igc_init_phy_params_base - Init PHY func ptrs.
+ * @hw: pointer to the HW structure
+ */
+static s32 igc_init_phy_params_base(struct igc_hw *hw)
+{
+   struct igc_phy_info *phy = >phy;
+   s32 ret_val = 0;
+   u32 ctrl_ext;
+
+   if (hw->phy.media_type != igc_media_type_copper) {
+   phy->type = igc_phy_none;
+   goto out;
+   }
+
+   phy->autoneg_mask   = AUTONEG_ADVERTISE_SPEED_DEFAULT_2500;
+   phy->reset_delay_us = 100;
+
+   ctrl_ext = rd32(IGC_CTRL_EXT);
+
+   /* set lan id */
+   hw->bus.func = (rd32(IGC_STATUS) & IGC_STATUS_FUNC_MASK) >>
+   IGC_STATUS_FUNC_SHIFT;
+
+   /* Make sure the PHY is in a good state. Several people have reported
+* firmware leaving the PHY's page select register set to something
+* other than the default of zero, which causes the PHY ID read to
+* access something other than the intended register.
+*/
+   ret_val = hw->phy.ops.reset(hw);
+   if (ret_val) {
+   hw_dbg("Error resetting the PHY.\n");
+   goto out;
+   }
+
+   ret_val = igc_get_phy_id_base(hw);
+   if (ret_val)
+   return ret_val;
+
+   /* Verify phy id and set remaining function pointers */
+   switch (phy->id) {
+   case I225_I_PHY_ID:
+   phy->type   = igc_phy_i225;
+   break;
+   default:
+   ret_val = -IGC_ERR_PHY;
+   goto out;
+   }
+
+out:
+   return ret_val;
+}
+
 static s32 igc_get_invariants_base(struct igc_hw *hw)
 {
u32 link_mode = 0;
@@ -211,6 +280,8 @@ static s32 igc_get_invariants_base(struct igc_hw *hw)
break;
}
 

[net-next 03/11] igc: Add netdev

2018-10-17 Thread Jeff Kirsher
From: Sasha Neftin 

Now that we have the ability to configure the basic settings on the device
we can start allocating and configuring a netdev for the interface.

Signed-off-by: Sasha Neftin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igc/igc.h |  48 ++
 drivers/net/ethernet/intel/igc/igc_defines.h |  15 +
 drivers/net/ethernet/intel/igc/igc_hw.h  |   1 +
 drivers/net/ethernet/intel/igc/igc_main.c| 471 ++-
 4 files changed, 534 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igc/igc.h 
b/drivers/net/ethernet/intel/igc/igc.h
index 481b2ee694fa..2e819cac19e5 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -28,15 +28,63 @@
 extern char igc_driver_name[];
 extern char igc_driver_version[];
 
+/* Transmit and receive queues */
+#define IGC_MAX_RX_QUEUES  4
+#define IGC_MAX_TX_QUEUES  4
+
+#define MAX_Q_VECTORS  8
+#define MAX_STD_JUMBO_FRAME_SIZE   9216
+
+enum igc_state_t {
+   __IGC_TESTING,
+   __IGC_RESETTING,
+   __IGC_DOWN,
+   __IGC_PTP_TX_IN_PROGRESS,
+};
+
+struct igc_q_vector {
+   struct igc_adapter *adapter;/* backlink */
+
+   struct napi_struct napi;
+};
+
+struct igc_mac_addr {
+   u8 addr[ETH_ALEN];
+   u8 queue;
+   u8 state; /* bitmask */
+};
+
+#define IGC_MAC_STATE_DEFAULT  0x1
+#define IGC_MAC_STATE_MODIFIED 0x2
+#define IGC_MAC_STATE_IN_USE   0x4
+
 /* Board specific private data structure */
 struct igc_adapter {
+   struct net_device *netdev;
+
+   unsigned long state;
+   unsigned int flags;
+   unsigned int num_q_vectors;
+   u16 link_speed;
+   u16 link_duplex;
+
+   u8 port_num;
+
u8 __iomem *io_addr;
+   struct work_struct watchdog_task;
+
+   int msg_enable;
+   u32 max_frame_size;
 
/* OS defined structs */
struct pci_dev *pdev;
 
/* structs defined in igc_hw.h */
struct igc_hw hw;
+
+   struct igc_q_vector *q_vector[MAX_Q_VECTORS];
+
+   struct igc_mac_addr *mac_table;
 };
 
 #endif /* _IGC_H_ */
diff --git a/drivers/net/ethernet/intel/igc/igc_defines.h 
b/drivers/net/ethernet/intel/igc/igc_defines.h
index d19dff1d6b74..c25f75ed9cd4 100644
--- a/drivers/net/ethernet/intel/igc/igc_defines.h
+++ b/drivers/net/ethernet/intel/igc/igc_defines.h
@@ -4,10 +4,22 @@
 #ifndef _IGC_DEFINES_H_
 #define _IGC_DEFINES_H_
 
+#define IGC_CTRL_EXT_DRV_LOAD  0x1000 /* Drv loaded bit for FW */
+
 /* PCI Bus Info */
 #define PCIE_DEVICE_CONTROL2   0x28
 #define PCIE_DEVICE_CONTROL2_16ms  0x0005
 
+/* Receive Address
+ * Number of high/low register pairs in the RAR. The RAR (Receive Address
+ * Registers) holds the directed and multicast addresses that we monitor.
+ * Technically, we have 16 spots.  However, we reserve one of these spots
+ * (RAR[15]) for our directed address used by controllers with
+ * manageability enabled, allowing us room for 15 multicast addresses.
+ */
+#define IGC_RAH_AV 0x8000 /* Receive descriptor valid */
+#define IGC_RAH_POOL_1 0x0004
+
 /* Error Codes */
 #define IGC_SUCCESS0
 #define IGC_ERR_NVM1
@@ -17,6 +29,9 @@
 #define IGC_ERR_MAC_INIT   5
 #define IGC_ERR_RESET  9
 
+/* PBA constants */
+#define IGC_PBA_34K0x0022
+
 /* Device Status */
 #define IGC_STATUS_FD  0x0001  /* Full duplex.0=half,1=full */
 #define IGC_STATUS_LU  0x0002  /* Link up.0=no,1=link */
diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h 
b/drivers/net/ethernet/intel/igc/igc_hw.h
index 84b6067a2476..4cac2e8868e0 100644
--- a/drivers/net/ethernet/intel/igc/igc_hw.h
+++ b/drivers/net/ethernet/intel/igc/igc_hw.h
@@ -59,6 +59,7 @@ struct igc_mac_info {
 
bool autoneg;
bool autoneg_failed;
+   bool get_link_status;
 };
 
 struct igc_bus_info {
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c 
b/drivers/net/ethernet/intel/igc/igc_main.c
index 6a881753f5ce..7c5b0d2f16bf 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -3,6 +3,8 @@
 
 #include 
 #include 
+#include 
+#include 
 
 #include "igc.h"
 #include "igc_hw.h"
@@ -10,10 +12,14 @@
 #define DRV_VERSION"0.0.1-k"
 #define DRV_SUMMARY"Intel(R) 2.5G Ethernet Linux Driver"
 
+static int debug = -1;
+
 MODULE_AUTHOR("Intel Corporation, ");
 MODULE_DESCRIPTION(DRV_SUMMARY);
 MODULE_LICENSE("GPL v2");
 MODULE_VERSION(DRV_VERSION);
+module_param(debug, int, 0);
+MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
 
 char igc_driver_name[] = "igc";
 char igc_driver_version[] = DRV_VERSION;
@@ -32,6 +38,364 @@ MODULE_DEVICE_TABLE(pci, igc_pci_tbl);
 
 /* forward declaration */
 static int igc_sw_init(struct igc_adapter *);
+static void igc_configure(struct igc_adapter *adapter);
+static void 

[net-next 00/11][pull request] 1GbE Intel Wired LAN Driver Updates 2018-10-17

2018-10-17 Thread Jeff Kirsher
This series adds support for the new igc driver.

The igc driver is the new client driver supporting the Intel I225
Ethernet Controller, which supports 2.5GbE speeds.  The reason for
creating a new client driver, instead of adding support for the new
device in e1000e, is that the silicon behaves more like devices
supported in igb driver.  It also did not make sense to add a client
part, to the igb driver which supports only 1GbE server parts.

This initial set of patches is designed for basic support (i.e. link and
pass traffic).  Follow-on patch series will add more advanced support
like VLAN, Wake-on-LAN, etc..

The following are changes since commit aadd4355918fe6e9044a9042fa5968e0a0901681:
  tcp, ulp: remove socket lock assertion on ULP cleanup
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 1GbE

Sasha Neftin (11):
  igc: Add skeletal frame for Intel(R) 2.5G Ethernet Controller support
  igc: Add support for PF
  igc: Add netdev
  igc: Add interrupt support
  igc: Add support for Tx/Rx rings
  igc: Add transmit and receive fastpath and interrupt handlers
  igc: Add HW initialization code
  igc: Add NVM support
  igc: Add code for PHY support
  igc: Add setup link functionality
  igc: Add watchdog

 drivers/net/ethernet/intel/Kconfig   |   16 +
 drivers/net/ethernet/intel/Makefile  |1 +
 drivers/net/ethernet/intel/igc/Makefile  |   10 +
 drivers/net/ethernet/intel/igc/igc.h |  443 ++
 drivers/net/ethernet/intel/igc/igc_base.c|  541 +++
 drivers/net/ethernet/intel/igc/igc_base.h|  107 +
 drivers/net/ethernet/intel/igc/igc_defines.h |  389 ++
 drivers/net/ethernet/intel/igc/igc_hw.h  |  321 ++
 drivers/net/ethernet/intel/igc/igc_i225.c|  490 +++
 drivers/net/ethernet/intel/igc/igc_i225.h|   13 +
 drivers/net/ethernet/intel/igc/igc_mac.c |  806 
 drivers/net/ethernet/intel/igc/igc_mac.h |   41 +
 drivers/net/ethernet/intel/igc/igc_main.c| 3901 ++
 drivers/net/ethernet/intel/igc/igc_nvm.c |  215 +
 drivers/net/ethernet/intel/igc/igc_nvm.h |   14 +
 drivers/net/ethernet/intel/igc/igc_phy.c |  791 
 drivers/net/ethernet/intel/igc/igc_phy.h |   21 +
 drivers/net/ethernet/intel/igc/igc_regs.h|  221 +
 18 files changed, 8341 insertions(+)
 create mode 100644 drivers/net/ethernet/intel/igc/Makefile
 create mode 100644 drivers/net/ethernet/intel/igc/igc.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_base.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_base.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_defines.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_hw.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_i225.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_i225.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_mac.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_mac.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_main.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_nvm.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_nvm.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_phy.c
 create mode 100644 drivers/net/ethernet/intel/igc/igc_phy.h
 create mode 100644 drivers/net/ethernet/intel/igc/igc_regs.h

-- 
2.17.2



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Alexei Starovoitov
On 10/17/18 2:31 PM, Arnaldo Carvalho de Melo wrote:
>
> Keep all that info in a file, as I described above. Or keep it for a
> while, to give that thread in userspace time to get it and tell the
> kernel that it can trow it away.

stashing by kernel into a file is a huge headache, since format of the
file becomes kernel abi.
Plus why have it in two places (in a file and in normal kernel data
structures)?

> It may well be that most of the time the 'perf record' thread catching
> those events picks that up and saves it in
> /var/tmp/bcc/bpf_prog_BUILDID/ even before the program gets unloaded,
> no?

Whether perf user space stashes that info into perf.data as
synthetic records or stashes it somewhere in /var/tmp/ sound about
equivalent to me. Both have their pros and cons. This is certainly
a good topic to discuss further.

But asking kernel to keep JITed images and all relevant bpf data
after program has been unloaded sounds really scary to me.
I struggling to think through the security implications with that.
How long kernel suppose to keep that? Some timeout?

As I explained already the time it takes for perf to do
_single_ get_fd_by_id syscall when it sees RECORD_MMAP with prog_id
is pretty instant.
All other syscalls to grab JITed image and everything else can
be done later. The prog will not go away because perf will hold an fd.
If prog was somehow unloaded before perf could do get_fd_by_id
no one cares about such programs, since there is close to zero
chance that this program was attached to anything and absolutely
no chance that it run.



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Arnaldo Carvalho de Melo
Em Wed, Oct 17, 2018 at 07:08:37PM +, Alexei Starovoitov escreveu:
> On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Oct 17, 2018 at 04:36:08PM +, Alexei Starovoitov escreveu:
> >> On 10/17/18 8:09 AM, David Ahern wrote:
> >>> On 10/16/18 11:43 PM, Song Liu wrote:
>  I agree that processing events while recording has significant overhead.
>  In this case, perf user space need to know details about the the jited 
>  BPF
>  program. It is impossible to pass all these details to user space through
>  the relatively stable ring_buffer API. Therefore, some processing of the
>  data is necessary (get bpf prog_id from ring buffer, and then fetch 
>  program
>  details via BPF_OBJ_GET_INFO_BY_FD.
> 
>  I have some idea on processing important data with relatively low 
>  overhead.
>  Let me try implement it.
> 
> >>>
> >>> As I understand it, you want this series:
> >>>
> >>>  kernel: add event to perf buffer on bpf prog load
> >>>
> >>>  userspace: perf reads the event and grabs information about the program
> >>> from the fd
> >>>
> >>> Is that correct?
> >>>
> >>> Userpsace is not awakened immediately when an event is added the the
> >>> ring. It is awakened once the number of events crosses a watermark. That
> >>> means there is an unknown - and potentially long - time window where the
> >>> program can be unloaded before perf reads the event.
> >
> >>> So no matter what you do expecting perf record to be able to process the
> >>> event quickly is an unreasonable expectation.
> >
> >> yes... unless we go with threaded model as Arnaldo suggested and use
> >> single event as a watermark to wakeup our perf thread.
> >> In such case there is still a race window between user space waking up
> >> and doing _single_ bpf_get_fd_from_id() call to hold that prog
> >> and some other process trying to instantly unload the prog it
> >> just loaded.
> >> I think such race window is extremely tiny and if perf misses
> >> those load/unload events it's a good thing, since there is no chance
> >> that normal pmu event samples would be happening during prog execution.

> >> The alternative approach with no race window at all is to burden kernel
> >> RECORD_* events with _all_ information about bpf prog. Which is jited
> >> addresses, jited image itself, info about all subprogs, info about line
> >> info, all BTF data, etc. As I said earlier I'm strongly against such
> >> RECORD_* bloating.
> >> Instead we need to find a way to process new RECORD_BPF events with
> >> single prog_id field in perf user space with minimal race
> >> and threaded approach sounds like a win to me.

> > There is another alternative, I think: put just a content based hash,
> > like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
> > validator does the jit, etc, it stashes the content that
> > BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
> > the kernel right after getting stuff from sys_bpf() and preparing it for
> > use, then we know that in (start, end) we have blob foo with content id,
> > that we will use to retrieve information that augments what we know with
> > just (start, end, id) and allows annotation, etc.

> > That stash space for jitted stuff gets garbage collected from time to
> > time or is even completely disabled if the user is not interested in
> > such augmentation, just like one can do disabling perf's ~/.debug/
> > directory hashed by build-id.

> > I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
> > is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
> > cookie and we solve the other race we already have with kernel modules,
> > DSOs, etc.

> > I have mentioned this before, there were objections, perhaps this time I
> > formulated in a different way that makes it more interesting?
 
> that 'content based hash' we already have. It's called program tag.

But that was calculated by whom? Userspace? It can't do that, its the
kernel that ultimately puts together, from what userspace gave it, what
we need to do performance analysis, line numbers, etc.

> and we already taught iovisor/bcc to stash that stuff into
> /var/tmp/bcc/bpf_prog_TAG/ directory.
> Unfortunately that approach didn't work.
> JITed image only exists in the kernel. It's there only when

That is why I said that _the_ kernel stashes that thing, not bcc/iovisor
or perf, the kernel calculates the hash, and it also puts that into the
PERF_RECORD_MMAP3, so the tool sees it and goes to get it from the place
the kernel stashed it.

> program is loaded and it's the one that matter the most for performance
> analysis, since sample IPs are pointing into it.

Agreed

> Also the line info mapping that user space knows is not helping much
> either, since verifier optimizes the instructions and then JIT
> does more. The debug_info <-> JIT relationship must be preserved
> by the kernel and returned to user space.


Re: [PATCH net] sctp: fix race on sctp_id2asoc

2018-10-17 Thread Neil Horman
On Tue, Oct 16, 2018 at 03:18:17PM -0300, Marcelo Ricardo Leitner wrote:
> syzbot reported an use-after-free involving sctp_id2asoc.  Dmitry Vyukov
> helped to root cause it and it is because of reading the asoc after it
> was freed:
> 
> CPU 1   CPU 2
> (working on socket 1)(working on socket 2)
>sctp_association_destroy
> sctp_id2asoc
>spin lock
>  grab the asoc from idr
>spin unlock
>spin lock
>remove asoc from idr
>  spin unlock
>  free(asoc)
>if asoc->base.sk != sk ... [*]
> 
> This can only be hit if trying to fetch asocs from different sockets. As
> we have a single IDR for all asocs, in all SCTP sockets, their id is
> unique on the system. An application can try to send stuff on an id
> that matches on another socket, and the if in [*] will protect from such
> usage. But it didn't consider that as that asoc may belong to another
> socket, it may be freed in parallel (read: under another socket lock).
> 
> We fix it by moving the checks in [*] into the protected region. This
> fixes it because the asoc cannot be freed while the lock is held.
> 
> Reported-by: syzbot+c7dd55d7aec49d48e...@syzkaller.appspotmail.com
> Acked-by: Dmitry Vyukov 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/socket.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 
> f73e9d38d5ba734d7ee3347e4015fd30d355bbfa..a7722f43aa69801c31409d4914c99946ee5533f5
>  100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -271,11 +271,10 @@ struct sctp_association *sctp_id2assoc(struct sock *sk, 
> sctp_assoc_t id)
>  
>   spin_lock_bh(_assocs_id_lock);
>   asoc = (struct sctp_association *)idr_find(_assocs_id, (int)id);
> + if (asoc && (asoc->base.sk != sk || asoc->base.dead))
> + asoc = NULL;
>   spin_unlock_bh(_assocs_id_lock);
>  
> - if (!asoc || (asoc->base.sk != sk) || asoc->base.dead)
> - return NULL;
> -
>   return asoc;
>  }
>  
> -- 
> 2.17.1
> 
> 
Acked-by: Neil Horman 



Re: [PATCH linux-firmware] linux-firmware: liquidio: fix GPL compliance issue

2018-10-17 Thread John W. Linville
On Wed, Oct 17, 2018 at 07:34:42PM +, Manlunas, Felix wrote:
> On Fri, Sep 28, 2018 at 04:50:51PM -0700, Felix Manlunas wrote:
> > Part of the code inside the lio_vsw_23xx.bin firmware image is under GPL,
> > but the LICENCE.cavium file neglects to indicate that.  However,
> > LICENCE.cavium does correctly specify the license that covers the other
> > Cavium firmware images that do not contain any GPL code.
> > 
> > Fix the GPL compliance issue by adding a new file, LICENCE.cavium_liquidio,
> > which correctly shows the GPL boilerplate.  This new file specifies the
> > licenses for all liquidio firmware, including the ones that do not have
> > GPL code.
> > 
> > Change the liquidio section of WHENCE to point to LICENCE.cavium_liquidio.
> > 
> > Reported-by: Florian Weimer 
> > Signed-off-by: Manish Awasthi 
> > Signed-off-by: Manoj Panicker 
> > Signed-off-by: Faisal Masood 
> > Signed-off-by: Felix Manlunas 
> > ---
> >  LICENCE.cavium_liquidio | 429 
> > 
> >  WHENCE  |   2 +-
> >  2 files changed, 430 insertions(+), 1 deletion(-)
> >  create mode 100644 LICENCE.cavium_liquidio
> 
> Hello Maintainers of linux-firmware.git,
> 
> Any feedback about this patch?

I would prefer to see an offer that included a defined URL for anyone
to download the source for the kernel in question without having to
announce themselves. The "send an email to i...@cavium.com" offer may
(or may not) be sufficient for the letter of the law. But it seems
both fragile and prone to subjective frustrations and delays for
users to obtain the sources at some future date.

Respectfully,

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.


Re: [PATCH 2/2] net: emac: implement TCP TSO

2018-10-17 Thread Florian Fainelli
On 10/17/2018 12:53 PM, Christian Lamparter wrote:
> This patch enables TSO(v4) hw feature for emac driver.
> As atleast the APM82181's TCP/IP acceleration hardware
> controller (TAH) provides TCP segmentation support in
> the transmit path.
> 
> Signed-off-by: Christian Lamparter 
> ---
>  drivers/net/ethernet/ibm/emac/core.c | 101 ++-
>  drivers/net/ethernet/ibm/emac/core.h |   4 ++
>  drivers/net/ethernet/ibm/emac/emac.h |   7 ++
>  drivers/net/ethernet/ibm/emac/tah.c  |  20 ++
>  drivers/net/ethernet/ibm/emac/tah.h  |   2 +
>  5 files changed, 133 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ibm/emac/core.c 
> b/drivers/net/ethernet/ibm/emac/core.c
> index be560f9031f4..49ffbd6e1707 100644
> --- a/drivers/net/ethernet/ibm/emac/core.c
> +++ b/drivers/net/ethernet/ibm/emac/core.c
> @@ -38,6 +38,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1410,6 +1413,52 @@ static inline u16 emac_tx_csum(struct emac_instance 
> *dev,
>   return 0;
>  }
>  
> +const u32 tah_ss[TAH_NO_SSR] = { 9000, 4500, 1500, 1300, 576, 176 };
> +
> +static int emac_tx_tso(struct emac_instance *dev, struct sk_buff *skb,
> +u16 *ctrl)
> +{
> + if (emac_has_feature(dev, EMAC_FTR_TAH_HAS_TSO) &&
> + skb_is_gso(skb) && !!(skb_shinfo(skb)->gso_type &
> + (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) {
> + u32 seg_size = 0, i;
> +
> + /* Get the MTU */
> + seg_size = skb_shinfo(skb)->gso_size + tcp_hdrlen(skb)
> + + skb_network_header_len(skb);
> +
> + /* Restriction applied for the segmentation size
> +  * to use HW segmentation offload feature: the size
> +  * of the segment must not be less than 168 bytes for
> +  * DIX formatted segments, or 176 bytes for
> +  * IEEE formatted segments.
> +  *
> +  * I use value 176 to check for the segment size here
> +  * as it can cover both 2 conditions above.
> +  */
> + if (seg_size < 176)
> + return -ENODEV;
> +
> + /* Get the best suitable MTU */
> + for (i = 0; i < ARRAY_SIZE(tah_ss); i++) {
> + u32 curr_seg = tah_ss[i];
> +
> + if (curr_seg > dev->ndev->mtu ||
> + curr_seg > seg_size)
> + continue;
> +
> + *ctrl &= ~EMAC_TX_CTRL_TAH_CSUM;
> + *ctrl |= EMAC_TX_CTRL_TAH_SSR(i);
> + return 0;

This is something that you can possibly take out of your hot path and
recalculate when the MTU actually changes?

[snip]

> +static netdev_tx_t emac_sw_tso(struct sk_buff *skb, struct net_device *ndev)
> +{
> + struct emac_instance *dev = netdev_priv(ndev);
> + struct sk_buff *segs, *curr;
> +
> + segs = skb_gso_segment(skb, ndev->features &
> + ~(NETIF_F_TSO | NETIF_F_TSO6));
> + if (IS_ERR_OR_NULL(segs)) {
> + goto drop;
> + } else {
> + while (segs) {
> + /* check for overflow */
> + if (dev->tx_cnt >= NUM_TX_BUFF) {
> + dev_kfree_skb_any(segs);
> + goto drop;
> + }

Would setting dev->max_gso_segs somehow help make sure the stack does
not feed you oversized GSO'd skbs?
-- 
Florian


Re: [PATCH 1/2] net: emac: implement 802.1Q VLAN TX tagging support

2018-10-17 Thread Florian Fainelli
On 10/17/2018 01:08 PM, Florian Fainelli wrote:
> On 10/17/2018 12:53 PM, Christian Lamparter wrote:
>> As per' APM82181 Embedded Processor User Manual 26.1 EMAC Features:
>> VLAN:
>>  - Support for VLAN tag ID in compliance with IEEE 802.3ac.
>>  - VLAN tag insertion or replacement for transmit packets
>>
>> This patch completes the missing code for the VLAN tx tagging
>> support, as the the EMAC_MR1_VLE was already enabled.
>>
>> Signed-off-by: Christian Lamparter 
>> ---
>>  drivers/net/ethernet/ibm/emac/core.c | 32 
>>  drivers/net/ethernet/ibm/emac/core.h |  6 +-
>>  2 files changed, 33 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/ibm/emac/core.c 
>> b/drivers/net/ethernet/ibm/emac/core.c
>> index 760b2ad8e295..be560f9031f4 100644
>> --- a/drivers/net/ethernet/ibm/emac/core.c
>> +++ b/drivers/net/ethernet/ibm/emac/core.c
>> @@ -37,6 +37,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -674,7 +675,7 @@ static int emac_configure(struct emac_instance *dev)
>>   ndev->dev_addr[5]);
>>  
>>  /* VLAN Tag Protocol ID */
>> -out_be32(>vtpid, 0x8100);
>> +out_be32(>vtpid, ETH_P_8021Q);
>>  
>>  /* Receive mode register */
>>  r = emac_iff2rmr(ndev);
>> @@ -1435,6 +1436,22 @@ static inline netdev_tx_t emac_xmit_finish(struct 
>> emac_instance *dev, int len)
>>  return NETDEV_TX_OK;
>>  }
>>  
>> +static inline u16 emac_tx_vlan(struct emac_instance *dev, struct sk_buff 
>> *skb)
>> +{
>> +/* Handle VLAN TPID and TCI insert if this is a VLAN skb */
>> +if (emac_has_feature(dev, EMAC_FTR_HAS_VLAN_CTAG_TX) &&
>> +skb_vlan_tag_present(skb)) {
>> +struct emac_regs __iomem *p = dev->emacp;
>> +
>> +/* update the VLAN TCI */
>> +out_be32(>vtci, (u32)skb_vlan_tag_get(skb));
> 
> The only case where this is likely not going to be 0x8100/ETH_P_8021Q is
> if you do 802.1ad (QinQ) and you decided to somehow offload the S-Tag
> instead of the C-Tag.

Sorry, looks like I mixed up TCI and TPID here, this looks obviously
correct ;)
-- 
Florian


Re: [PATCH 1/2] net: emac: implement 802.1Q VLAN TX tagging support

2018-10-17 Thread Florian Fainelli
On 10/17/2018 12:53 PM, Christian Lamparter wrote:
> As per' APM82181 Embedded Processor User Manual 26.1 EMAC Features:
> VLAN:
>  - Support for VLAN tag ID in compliance with IEEE 802.3ac.
>  - VLAN tag insertion or replacement for transmit packets
> 
> This patch completes the missing code for the VLAN tx tagging
> support, as the the EMAC_MR1_VLE was already enabled.
> 
> Signed-off-by: Christian Lamparter 
> ---
>  drivers/net/ethernet/ibm/emac/core.c | 32 
>  drivers/net/ethernet/ibm/emac/core.h |  6 +-
>  2 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/emac/core.c 
> b/drivers/net/ethernet/ibm/emac/core.c
> index 760b2ad8e295..be560f9031f4 100644
> --- a/drivers/net/ethernet/ibm/emac/core.c
> +++ b/drivers/net/ethernet/ibm/emac/core.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -674,7 +675,7 @@ static int emac_configure(struct emac_instance *dev)
>ndev->dev_addr[5]);
>  
>   /* VLAN Tag Protocol ID */
> - out_be32(>vtpid, 0x8100);
> + out_be32(>vtpid, ETH_P_8021Q);
>  
>   /* Receive mode register */
>   r = emac_iff2rmr(ndev);
> @@ -1435,6 +1436,22 @@ static inline netdev_tx_t emac_xmit_finish(struct 
> emac_instance *dev, int len)
>   return NETDEV_TX_OK;
>  }
>  
> +static inline u16 emac_tx_vlan(struct emac_instance *dev, struct sk_buff 
> *skb)
> +{
> + /* Handle VLAN TPID and TCI insert if this is a VLAN skb */
> + if (emac_has_feature(dev, EMAC_FTR_HAS_VLAN_CTAG_TX) &&
> + skb_vlan_tag_present(skb)) {
> + struct emac_regs __iomem *p = dev->emacp;
> +
> + /* update the VLAN TCI */
> + out_be32(>vtci, (u32)skb_vlan_tag_get(skb));

The only case where this is likely not going to be 0x8100/ETH_P_8021Q is
if you do 802.1ad (QinQ) and you decided to somehow offload the S-Tag
instead of the C-Tag.

It would be a shame to slow down your TX path with an expensive register
write, when maybe inserting the VLAN in software amounts to the same
performance result ;)
-- 
Florian


Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-17 Thread Holger Hoffstätte

On 10/17/18 21:27, Heiner Kallweit wrote:
(snip)

Good to know. What's your kernel version and RTL8168 chip version?
Regarding the chip version the dmesg line with the XID would be relevant.


4.18.15 + PDS (custom CPU scheduler) + cherry pickings from mainline.
Applied both the original patch in this thread & bql, built fine.

Server:
r8169 :04:00.0 eth0: RTL8168evl/8111evl, c8:60:00:68:33:cc, XID 2c900800, 
IRQ 30

Workstation:
r8169 :04:00.0 eth0: RTL8168evl/8111evl, 50:e5:49:41:7d:ad, XID 2c900800, 
IRQ 33

So same chipsets.

On both:
ethtool --coalesce eth0 rx-frames 0 rx-usecs 50 tx-frames 0 tx-usecs 50
ethtool --offload eth0 rx on tx on gro on gso on sg on tso on

Let's see how it goes. :)

cheers
Holger


Re: [PATCHv2 2/2] dt-bindings: can: xilinx_can: add Xilinx CAN FD 2.0 bindings

2018-10-17 Thread Rob Herring
On Fri, 12 Oct 2018 09:55:09 +0530,  wrote:
> From: Shubhrajyoti Datta 
> 
> Add compatible string and new attributes to support the Xilinx CAN
> FD 2.0.
> 
> Signed-off-by: Shubhrajyoti Datta 
> ---
>  Documentation/devicetree/bindings/net/can/xilinx_can.txt | 1 +
>  1 file changed, 1 insertion(+)
> 

Reviewed-by: Rob Herring 


[PATCH 1/2] net: emac: implement 802.1Q VLAN TX tagging support

2018-10-17 Thread Christian Lamparter
As per' APM82181 Embedded Processor User Manual 26.1 EMAC Features:
VLAN:
 - Support for VLAN tag ID in compliance with IEEE 802.3ac.
 - VLAN tag insertion or replacement for transmit packets

This patch completes the missing code for the VLAN tx tagging
support, as the the EMAC_MR1_VLE was already enabled.

Signed-off-by: Christian Lamparter 
---
 drivers/net/ethernet/ibm/emac/core.c | 32 
 drivers/net/ethernet/ibm/emac/core.h |  6 +-
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c 
b/drivers/net/ethernet/ibm/emac/core.c
index 760b2ad8e295..be560f9031f4 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -674,7 +675,7 @@ static int emac_configure(struct emac_instance *dev)
 ndev->dev_addr[5]);
 
/* VLAN Tag Protocol ID */
-   out_be32(>vtpid, 0x8100);
+   out_be32(>vtpid, ETH_P_8021Q);
 
/* Receive mode register */
r = emac_iff2rmr(ndev);
@@ -1435,6 +1436,22 @@ static inline netdev_tx_t emac_xmit_finish(struct 
emac_instance *dev, int len)
return NETDEV_TX_OK;
 }
 
+static inline u16 emac_tx_vlan(struct emac_instance *dev, struct sk_buff *skb)
+{
+   /* Handle VLAN TPID and TCI insert if this is a VLAN skb */
+   if (emac_has_feature(dev, EMAC_FTR_HAS_VLAN_CTAG_TX) &&
+   skb_vlan_tag_present(skb)) {
+   struct emac_regs __iomem *p = dev->emacp;
+
+   /* update the VLAN TCI */
+   out_be32(>vtci, (u32)skb_vlan_tag_get(skb));
+
+   /* Insert VLAN tag */
+   return EMAC_TX_CTRL_IVT;
+   }
+   return 0;
+}
+
 /* Tx lock BH */
 static netdev_tx_t emac_start_xmit(struct sk_buff *skb, struct net_device 
*ndev)
 {
@@ -1443,7 +1460,7 @@ static netdev_tx_t emac_start_xmit(struct sk_buff *skb, 
struct net_device *ndev)
int slot;
 
u16 ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY |
-   MAL_TX_CTRL_LAST | emac_tx_csum(dev, skb);
+   MAL_TX_CTRL_LAST | emac_tx_csum(dev, skb) | emac_tx_vlan(dev, skb);
 
slot = dev->tx_slot++;
if (dev->tx_slot == NUM_TX_BUFF) {
@@ -1518,7 +1535,7 @@ emac_start_xmit_sg(struct sk_buff *skb, struct net_device 
*ndev)
goto stop_queue;
 
ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY |
-   emac_tx_csum(dev, skb);
+   emac_tx_csum(dev, skb) | emac_tx_vlan(dev, skb);
slot = dev->tx_slot;
 
/* skb data */
@@ -2891,7 +2908,8 @@ static int emac_init_config(struct emac_instance *dev)
if (of_device_is_compatible(np, "ibm,emac-apm821xx")) {
dev->features |= (EMAC_APM821XX_REQ_JUMBO_FRAME_SIZE |
  EMAC_FTR_APM821XX_NO_HALF_DUPLEX |
- EMAC_FTR_460EX_PHY_CLK_FIX);
+ EMAC_FTR_460EX_PHY_CLK_FIX |
+ EMAC_FTR_HAS_VLAN_CTAG_TX);
}
} else if (of_device_is_compatible(np, "ibm,emac4")) {
dev->features |= EMAC_FTR_EMAC4;
@@ -3148,6 +3166,12 @@ static int emac_probe(struct platform_device *ofdev)
 
if (dev->tah_dev) {
ndev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG;
+
+   if (emac_has_feature(dev, EMAC_FTR_HAS_VLAN_CTAG_TX)) {
+   ndev->vlan_features |= ndev->hw_features;
+   ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_TX;
+   }
+
ndev->features |= ndev->hw_features | NETIF_F_RXCSUM;
}
ndev->watchdog_timeo = 5 * HZ;
diff --git a/drivers/net/ethernet/ibm/emac/core.h 
b/drivers/net/ethernet/ibm/emac/core.h
index 84caa4a3fc52..8d84d439168c 100644
--- a/drivers/net/ethernet/ibm/emac/core.h
+++ b/drivers/net/ethernet/ibm/emac/core.h
@@ -334,6 +334,8 @@ struct emac_instance {
  * APM821xx does not support Half Duplex mode
  */
 #define EMAC_FTR_APM821XX_NO_HALF_DUPLEX   0x1000
+/* EMAC can insert 802.1Q tag */
+#define EMAC_FTR_HAS_VLAN_CTAG_TX  0x2000
 
 /* Right now, we don't quite handle the always/possible masks on the
  * most optimal way as we don't have a way to say something like
@@ -363,7 +365,9 @@ enum {
EMAC_FTR_460EX_PHY_CLK_FIX |
EMAC_FTR_440EP_PHY_CLK_FIX |
EMAC_APM821XX_REQ_JUMBO_FRAME_SIZE |
-   EMAC_FTR_APM821XX_NO_HALF_DUPLEX,
+   EMAC_FTR_APM821XX_NO_HALF_DUPLEX |
+   EMAC_FTR_HAS_VLAN_CTAG_TX |
+   0,
 };
 
 static inline int emac_has_feature(struct emac_instance *dev,
-- 
2.19.1



[PATCH 2/2] net: emac: implement TCP TSO

2018-10-17 Thread Christian Lamparter
This patch enables TSO(v4) hw feature for emac driver.
As atleast the APM82181's TCP/IP acceleration hardware
controller (TAH) provides TCP segmentation support in
the transmit path.

Signed-off-by: Christian Lamparter 
---
 drivers/net/ethernet/ibm/emac/core.c | 101 ++-
 drivers/net/ethernet/ibm/emac/core.h |   4 ++
 drivers/net/ethernet/ibm/emac/emac.h |   7 ++
 drivers/net/ethernet/ibm/emac/tah.c  |  20 ++
 drivers/net/ethernet/ibm/emac/tah.h  |   2 +
 5 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c 
b/drivers/net/ethernet/ibm/emac/core.c
index be560f9031f4..49ffbd6e1707 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -38,6 +38,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -1410,6 +1413,52 @@ static inline u16 emac_tx_csum(struct emac_instance *dev,
return 0;
 }
 
+const u32 tah_ss[TAH_NO_SSR] = { 9000, 4500, 1500, 1300, 576, 176 };
+
+static int emac_tx_tso(struct emac_instance *dev, struct sk_buff *skb,
+  u16 *ctrl)
+{
+   if (emac_has_feature(dev, EMAC_FTR_TAH_HAS_TSO) &&
+   skb_is_gso(skb) && !!(skb_shinfo(skb)->gso_type &
+   (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) {
+   u32 seg_size = 0, i;
+
+   /* Get the MTU */
+   seg_size = skb_shinfo(skb)->gso_size + tcp_hdrlen(skb)
+   + skb_network_header_len(skb);
+
+   /* Restriction applied for the segmentation size
+* to use HW segmentation offload feature: the size
+* of the segment must not be less than 168 bytes for
+* DIX formatted segments, or 176 bytes for
+* IEEE formatted segments.
+*
+* I use value 176 to check for the segment size here
+* as it can cover both 2 conditions above.
+*/
+   if (seg_size < 176)
+   return -ENODEV;
+
+   /* Get the best suitable MTU */
+   for (i = 0; i < ARRAY_SIZE(tah_ss); i++) {
+   u32 curr_seg = tah_ss[i];
+
+   if (curr_seg > dev->ndev->mtu ||
+   curr_seg > seg_size)
+   continue;
+
+   *ctrl &= ~EMAC_TX_CTRL_TAH_CSUM;
+   *ctrl |= EMAC_TX_CTRL_TAH_SSR(i);
+   return 0;
+   }
+
+   /* none found fall back to software */
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static inline netdev_tx_t emac_xmit_finish(struct emac_instance *dev, int len)
 {
struct emac_regs __iomem *p = dev->emacp;
@@ -1452,8 +1501,46 @@ static inline u16 emac_tx_vlan(struct emac_instance 
*dev, struct sk_buff *skb)
return 0;
 }
 
+static netdev_tx_t
+emac_start_xmit(struct sk_buff *skb, struct net_device *ndev);
+
+static netdev_tx_t emac_sw_tso(struct sk_buff *skb, struct net_device *ndev)
+{
+   struct emac_instance *dev = netdev_priv(ndev);
+   struct sk_buff *segs, *curr;
+
+   segs = skb_gso_segment(skb, ndev->features &
+   ~(NETIF_F_TSO | NETIF_F_TSO6));
+   if (IS_ERR_OR_NULL(segs)) {
+   goto drop;
+   } else {
+   while (segs) {
+   /* check for overflow */
+   if (dev->tx_cnt >= NUM_TX_BUFF) {
+   dev_kfree_skb_any(segs);
+   goto drop;
+   }
+
+   curr = segs;
+   segs = curr->next;
+   curr->next = NULL;
+
+   emac_start_xmit(curr, ndev);
+   }
+   dev_consume_skb_any(skb);
+   }
+
+   return NETDEV_TX_OK;
+
+drop:
+   ++dev->estats.tx_dropped;
+   dev_kfree_skb_any(skb);
+   return NETDEV_TX_OK;
+}
+
 /* Tx lock BH */
-static netdev_tx_t emac_start_xmit(struct sk_buff *skb, struct net_device 
*ndev)
+static netdev_tx_t
+emac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 {
struct emac_instance *dev = netdev_priv(ndev);
unsigned int len = skb->len;
@@ -1462,6 +1549,9 @@ static netdev_tx_t emac_start_xmit(struct sk_buff *skb, 
struct net_device *ndev)
u16 ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY |
MAL_TX_CTRL_LAST | emac_tx_csum(dev, skb) | emac_tx_vlan(dev, skb);
 
+   if (emac_tx_tso(dev, skb, ))
+   return emac_sw_tso(skb, ndev);
+
slot = dev->tx_slot++;
if (dev->tx_slot == NUM_TX_BUFF) {
dev->tx_slot = 0;
@@ -1536,6 +1626,9 @@ emac_start_xmit_sg(struct sk_buff *skb, struct net_device 
*ndev)
 
ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY |

Re: [PATCH net] net: ipmr: fix unresolved entry dumps

2018-10-17 Thread Nikolay Aleksandrov
On 17/10/2018 22:34, Nikolay Aleksandrov wrote:
> If the skb space ends in an unresolved entry while dumping we'll miss
> some unresolved entries. The reason is due to zeroing the entry counter
> between dumping resolved and unresolved mfc entries. We should just
> keep counting until the whole table is dumped and zero when we move to
> the next as we have a separate table counter.
> 
> Reported-by: Colin Ian King 
> Fixes: 8fb472c09b9d ("ipmr: improve hash scalability")
> Signed-off-by: Nikolay Aleksandrov 
> ---
> Dropped Yuval's mail because it bounces.
> 
>  net/ipv4/ipmr_base.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
> index 1ad9aa62a97b..eab8cd5ec2f5 100644
> --- a/net/ipv4/ipmr_base.c
> +++ b/net/ipv4/ipmr_base.c
> @@ -296,8 +296,6 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct 
> netlink_callback *cb,
>  next_entry:
>   e++;
>   }
> - e = 0;
> - s_e = 0;
>  
>   spin_lock_bh(lock);
>   list_for_each_entry(mfc, >mfc_unres_queue, list) {
> 

+CC Colin
Sorry about that, my script somehow missed the reported-by email.



[PATCH net] net: ipmr: fix unresolved entry dumps

2018-10-17 Thread Nikolay Aleksandrov
If the skb space ends in an unresolved entry while dumping we'll miss
some unresolved entries. The reason is due to zeroing the entry counter
between dumping resolved and unresolved mfc entries. We should just
keep counting until the whole table is dumped and zero when we move to
the next as we have a separate table counter.

Reported-by: Colin Ian King 
Fixes: 8fb472c09b9d ("ipmr: improve hash scalability")
Signed-off-by: Nikolay Aleksandrov 
---
Dropped Yuval's mail because it bounces.

 net/ipv4/ipmr_base.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 1ad9aa62a97b..eab8cd5ec2f5 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -296,8 +296,6 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 next_entry:
e++;
}
-   e = 0;
-   s_e = 0;
 
spin_lock_bh(lock);
list_for_each_entry(mfc, >mfc_unres_queue, list) {
-- 
2.17.2



Re: [PATCH linux-firmware] linux-firmware: liquidio: fix GPL compliance issue

2018-10-17 Thread Manlunas, Felix
On Fri, Sep 28, 2018 at 04:50:51PM -0700, Felix Manlunas wrote:
> Part of the code inside the lio_vsw_23xx.bin firmware image is under GPL,
> but the LICENCE.cavium file neglects to indicate that.  However,
> LICENCE.cavium does correctly specify the license that covers the other
> Cavium firmware images that do not contain any GPL code.
> 
> Fix the GPL compliance issue by adding a new file, LICENCE.cavium_liquidio,
> which correctly shows the GPL boilerplate.  This new file specifies the
> licenses for all liquidio firmware, including the ones that do not have
> GPL code.
> 
> Change the liquidio section of WHENCE to point to LICENCE.cavium_liquidio.
> 
> Reported-by: Florian Weimer 
> Signed-off-by: Manish Awasthi 
> Signed-off-by: Manoj Panicker 
> Signed-off-by: Faisal Masood 
> Signed-off-by: Felix Manlunas 
> ---
>  LICENCE.cavium_liquidio | 429 
> 
>  WHENCE  |   2 +-
>  2 files changed, 430 insertions(+), 1 deletion(-)
>  create mode 100644 LICENCE.cavium_liquidio

Hello Maintainers of linux-firmware.git,

Any feedback about this patch?

Thanks,
Felix


Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-17 Thread Heiner Kallweit
On 17.10.2018 21:11, Holger Hoffstätte wrote:
> On 10/17/18 20:12, Heiner Kallweit wrote:
>> On 16.10.2018 23:17, Holger Hoffstätte wrote:
>>> On 10/16/18 22:37, Heiner Kallweit wrote:
 rtl_rx() and rtl_tx() are called only if the respective bits are set
 in the interrupt status register. Under high load NAPI may not be
 able to process all data (work_done == budget) and it will schedule
 subsequent calls to the poll callback.
 rtl_ack_events() however resets the bits in the interrupt status
 register, therefore subsequent calls to rtl8169_poll() won't call
 rtl_rx() and rtl_tx() - chip interrupts are still disabled.
>>>
>>> Very interesting! Could this be the reason for the mysterious
>>> hangs & resets we experienced when enabling BQL for r8169?
>>> They happened more often with TSO/GSO enabled and several people
>>> attempted to fix those hangs unsuccessfully; it was later reverted
>>> and has been since then (#87cda7cb43).
>>> If this bug has been there "forever" it might be tempting to
>>> re-apply BQL and see what happens. Any chance you could give that
>>> a try? I'll gladly test patches, just like I'll run this one.
>>>
>> After reading through the old mail threads regarding BQL on r8169
>> I don't think the fix here is related.
>> It seems that BQL on r8169 worked fine for most people, just one
>> had problems on one of his systems. I assume the issue was specific
> 
> I continued to use the BQL patch in my private tree after it was reverted
> and also had occasional timeouts, but *only* after I started playing
> with ethtool to change offload settings. Without offloads or the BQL patch
> everything has been rock-solid since then.
> The other weird problem was that timeouts would occur on an otherwise
> *completely idle* system. Since that occasionally borked my NFS server
> over night I ultimately removed BQL as well. Rock-solid since then.
> 
>> I will apply the old BQL patch and see how it's on my system
>> (with GRO and SG enabled).
> 
> I don't think it still applies cleanly, but if you cook up an updated
> version I'll gladly test it.
> 
> Thanks! :)
> Holger
> 

Good to know. What's your kernel version and RTL8168 chip version?
Regarding the chip version the dmesg line with the XID would be relevant.

Below is the slightly modified original BQL patch, I just moved the call
to netdev_reset_queue(). This patch applies at least to latest linux-next.

My test system:
- RTL8168evl
- latest linux-next
- BQL patch applied
- SG/GRO enabled:

rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on

I briefly tested normal operation and did some tests with iperf3.
Everything looks good so far.

---
 drivers/net/ethernet/realtek/r8169.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 0d8070adc..e236b46b8 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5852,6 +5852,7 @@ static void rtl8169_tx_clear(struct rtl8169_private *tp)
 {
rtl8169_tx_clear_range(tp, tp->dirty_tx, NUM_TX_DESC);
tp->cur_tx = tp->dirty_tx = 0;
+   netdev_reset_queue(tp->dev);
 }
 
 static void rtl_reset_work(struct rtl8169_private *tp)
@@ -6154,6 +6155,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
txd->opts2 = cpu_to_le32(opts[1]);
 
+   netdev_sent_queue(dev, skb->len);
+
skb_tx_timestamp(skb);
 
/* Force memory writes to complete before releasing descriptor */
@@ -6252,7 +6255,7 @@ static void rtl8169_pcierr_interrupt(struct net_device 
*dev)
 
 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 {
-   unsigned int dirty_tx, tx_left;
+   unsigned int dirty_tx, tx_left, bytes_compl = 0, pkts_compl = 0;
 
dirty_tx = tp->dirty_tx;
smp_rmb();
@@ -6276,10 +6279,8 @@ static void rtl_tx(struct net_device *dev, struct 
rtl8169_private *tp)
rtl8169_unmap_tx_skb(tp_to_dev(tp), tx_skb,
 tp->TxDescArray + entry);
if (status & LastFrag) {
-   u64_stats_update_begin(>tx_stats.syncp);
-   tp->tx_stats.packets++;
-   tp->tx_stats.bytes += tx_skb->skb->len;
-   

  1   2   >