Re: [PATCH bpf-next 0/4] i40e AF_XDP zero-copy buffer leak fixes

2018-09-05 Thread Alexei Starovoitov
On Wed, Sep 05, 2018 at 09:15:14PM +0200, Björn Töpel wrote:
> Den ons 5 sep. 2018 kl 19:14 skrev Jakub Kicinski
> :
> >
> > On Tue,  4 Sep 2018 20:11:01 +0200, Björn Töpel wrote:
> > > From: Björn Töpel 
> > >
> > > This series addresses an AF_XDP zero-copy issue that buffers passed
> > > from userspace to the kernel was leaked when the hardware descriptor
> > > ring was torn down.
> > >
> > > The patches fixes the i40e AF_XDP zero-copy implementation.
> > >
> > > Thanks to Jakub Kicinski for pointing this out!
> > >
> > > Some background for folks that don't know the details: A zero-copy
> > > capable driver picks buffers off the fill ring and places them on the
> > > hardware Rx ring to be completed at a later point when DMA is
> > > complete. Similar on the Tx side; The driver picks buffers off the Tx
> > > ring and places them on the Tx hardware ring.
> > >
> > > In the typical flow, the Rx buffer will be placed onto an Rx ring
> > > (completed to the user), and the Tx buffer will be placed on the
> > > completion ring to notify the user that the transfer is done.
> > >
> > > However, if the driver needs to tear down the hardware rings for some
> > > reason (interface goes down, reconfiguration and such), the userspace
> > > buffers cannot be leaked. They have to be reused or completed back to
> > > userspace.
> > >
> > > The implementation does the following:
> > >
> > > * Outstanding Tx descriptors will be passed to the completion
> > >   ring. The Tx code has back-pressure mechanism in place, so that
> > >   enough empty space in the completion ring is guaranteed.
> > >
> > > * Outstanding Rx descriptors are temporarily stored on a stash/reuse
> > >   queue. The reuse queue is based on Jakub's RFC. When/if the HW rings
> > >   comes up again, entries from the stash are used to re-populate the
> > >   ring.
> > >
> > > * When AF_XDP ZC is enabled, disallow changing the number of hardware
> > >   descriptors via ethtool. Otherwise, the size of the stash/reuse
> > >   queue can grow unbounded.
> > >
> > > Going forward, introducing a "zero-copy allocator" analogous to Jesper
> > > Brouer's page pool would be a more robust and reuseable solution.
> > >
> > > Jakub: I've made a minor checkpatch-fix to your RFC, prior adding it
> > > into this series.
> >
> > Thanks for the fix! :)
> >
> > Out of curiosity, did checking the reuse queue have a noticeable impact
> > in your test (i.e. always using the _rq() helpers)?  You seem to be
> > adding an indirect call, would that not be way worse on a retpoline
> > kernel?
> 
> Do you mean the indirection in __i40e_alloc_rx_buffers_zc (patch #3)?
> The indirect call is elided by the __always_inline -- without that
> retpoline took 2.5Mpps worth of Rx. :-(
> 
> I'm only using the _rq helpers in the configuration/slow path, so the
> fast-path is unchanged.

Applied to bpf-next. Thanks.



Re: [PATCH net 0/3] net/iucv: fixes 2018-09-05

2018-09-05 Thread David Miller
From: Julian Wiedmann 
Date: Wed,  5 Sep 2018 16:55:09 +0200

> please apply three straight-forward fixes for iucv. One that prevents
> leaking the skb on malformed inbound packets, one to fix the error
> handling on transmit error, and one to get rid of a compile warning.

Series applied, thank you.


Re: [PATCH bpf-next] bpf/verifier: properly clear union members after a ctx read

2018-09-05 Thread Alexei Starovoitov
On Wed, Sep 05, 2018 at 02:47:22PM +0100, Edward Cree wrote:
> On 05/09/18 03:23, Alexei Starovoitov wrote:
> > So would you agree it's fair to add
> > Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
> > ?
> Sure.  Though I don't think it needs backporting, as it's a conservative
>  bug (i.e. it merely prevents pruning, but that's safe security-wise).

agree. No backport necessary.

> > How about patch like the following:
> > 
> > From 422fd975ed78645ab67d2eb50ff6e1ff6fb3de32 Mon Sep 17 00:00:00 2001
> > From: Alexei Starovoitov 
> > Date: Tue, 4 Sep 2018 19:13:44 -0700
> > Subject: [PATCH] bpf/verifier: fix verifier instability
> >
> > Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
> > Debugged-by: Edward Cree  
> > Signed-off-by: Alexei Starovoitov 
> Acked-by: Edward Cree 

Thanks. I copy-pasted your commit log and pushed to bpf-next.

Much appreciate the time and effort spent on root causing these issues.


Re: [PATCH RFC] net/mlx5_en: switch to Toeplitz RSS hash by default

2018-09-05 Thread Saeed Mahameed
On Sun, Sep 2, 2018 at 2:55 AM, Konstantin Khlebnikov
 wrote:
> On 02.09.2018 12:29, Tariq Toukan wrote:
>>
>>
>>
>> On 31/08/2018 2:29 PM, Konstantin Khlebnikov wrote:
>>>
>>> XOR (MLX5_RX_HASH_FN_INVERTED_XOR8) gives only 8 bits.
>>> It seems not enough for RFS. All other drivers use toeplitz.
>>>
>>> Driver mlx4_en uses Toeplitz by default and warns if hash XOR is used
>>> together with NETIF_F_RXHASH (enabled by default too): "Enabling both
>>> XOR Hash function and RX Hashing can limit RPS functionality".
>>>
>>> XOR is default in mlx5_en since commit 2be6967cdbc9
>>> ("net/mlx5e: Support ETH_RSS_HASH_XOR").
>>>
>>> Hash function could be set via ethtool. But it would be nice to have
>>> single standard for drivers or proper description why this one is
>>> special.
>>>
>>> Signed-off-by: Konstantin Khlebnikov 
>>> ---
>>
>>
>> Hi Konstantin,
>>
>> Thanks for the patch.
>>
>> I understand the motivation.
>>
>> This change affects the default out-of-the-box behavior and requires a
>> full performance cycle. We'll run performance regression tomorrow, results
>> should be ready by EOW.
>>  > I'll update.
>
>
> Ok, thank you.
>
> The only mention I've found in your documentation
> http://www.mellanox.com/related-docs/prod_software/Mellanox_EN_for_Linux_User_Manual_v4_4.pdf
>
> is
> ---
> 1.1.10 RSS Support
> 1.1.10.1 RSS Hash Function
> The device has the ability to use XOR as the RSS distribution function,
> instead of the default
> Toplitz function.
> The XOR function can be better distributed among driver's receive queues in
> small number of
> streams, where it distributes each TCP/UDP stream to a different queue.
> ---
>
> So Toeplitz is supposed to be default hash function for all versions of
> drivers and hardware.
>
> Also XOR8 seems vulnerable for ddos - hash is predictable, no random\secret
> vector, only 8 bits.
> So, it's easy to route all flows into one point. As we got it by accident.
>
> Moreover, in kernel 4.4.y hash switch via ethtool is broken and does not
> work =)
>

is it broken in mlx5 only or for the whole kernel ?

If it is mlx5 then this might be the reason:
commit 2d75b2bc8a8c0ce5567a6ecef52e194d117efe3f
net/mlx5e: Add ethtool RSS configuration options

was submitted to kernel 4.3

and an important fix for hash function change was submitted to 4.5:

commit bdfc028de1b3cd59490d5413a5c87b0fa50040c2
Author: Tariq Toukan 
Date:   Mon Feb 29 21:17:12 2016 +0200

net/mlx5e: Fix ethtool RX hash func configuration change

We should modify TIRs explicitly to apply the new RSS configuration.
The light ndo close/open calls do not "refresh" them.

Fixes: 2d75b2bc8a8c ('net/mlx5e: Add ethtool RSS configuration options')


>>
>> Regards,
>> Tariq


Re: [PATCH net] net/sched: fix memory leak in act_tunnel_key_init()

2018-09-05 Thread David Miller
From: Davide Caratti 
Date: Tue,  4 Sep 2018 19:00:19 +0200

> If users try to install act_tunnel_key 'set' rules with duplicate values
> of 'index', the tunnel metadata are allocated, but never released. Then,
> kmemleak complains as follows:
 ...
> This problem theoretically happens also in case users attempt to setup a
> geneve rule having wrong configuration data, or when the kernel fails to
> allocate 'params_new'. Ensure that tunnel_key_init() releases the tunnel
> metadata also in the above conditions.
> 
> Addresses-Coverity-ID: 1373974 ("Resource leak")
> Fixes: d0f6dd8a914f4 ("net/sched: Introduce act_tunnel_key")
> Fixes: 0ed5269f9e41f ("net/sched: add tunnel option support to 
> act_tunnel_key")
> Signed-off-by: Davide Caratti 

Applied and queued up for -stable.


Re: [PATCH net-next] nfp: separate VXLAN and GRE feature handling

2018-09-05 Thread David Miller
From: Jakub Kicinski 
Date: Tue,  4 Sep 2018 08:28:33 -0700

> VXLAN and GRE FW features have to currently be both advertised
> for the driver to enable them.  Separate the handling.
> 
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Dirk van der Merwe 

Applied.


Re: [PATCH net-next 0/3] nfp: improve the new rtsym helpers

2018-09-05 Thread David Miller
From: Jakub Kicinski 
Date: Tue,  4 Sep 2018 07:37:30 -0700

> This set fixes a bug in ABS rtsym handling I added in net-next,
> it expands the error checking and reporting on the rtsym accesses.

Series applied.


Re: [PATCH v2 net-next] failover: Add missing check to validate 'slave_dev' in net_failover_slave_unregister

2018-09-05 Thread David Miller
From: YueHaibing 
Date: Tue, 4 Sep 2018 02:56:26 +

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/net_failover.c: In function 'net_failover_slave_unregister':
> drivers/net/net_failover.c:598:35: warning:
>  variable 'primary_dev' set but not used [-Wunused-but-set-variable]
> 
> There should check the validity of 'slave_dev'.
> 
> Fixes: cfc80d9a1163 ("net: Introduce net_failover driver")
> 
> Signed-off-by: YueHaibing 
> ---
> v2: use WARN_ON_ONCE as Liran Alon suggested

Applied.


Re: [PATCH v2 net-next] packet: add sockopt to ignore outgoing packets

2018-09-05 Thread David Miller
From: Vincent Whitchurch 
Date: Mon,  3 Sep 2018 16:23:36 +0200

> Currently, the only way to ignore outgoing packets on a packet socket is
> via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
> AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
> if the filter run from packet_rcv() would reject them.  So the presence
> of a packet socket on the interface takes away the benefits of
> MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
> packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
> cloned, but the cost for that is much lower.)
> 
> Add a socket option to allow AF_PACKET sockets to ignore outgoing
> packets to solve this.  Note that the *BSDs already have something
> similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
> 
> The first intended user is lldpd.
> 
> Signed-off-by: Vincent Whitchurch 
> ---
> v2: Stricter value validation.
> Moved ignore check out of skb_loop_sk().

Applied, thank you.


[net-next 2/9] net/mlx5: Add new list to store deleted flow counters

2018-09-05 Thread Saeed Mahameed
From: Vlad Buslov 

In order to prevent flow counters stats work function from traversing whole
flow counters tree while searching for deleted flow counters, new list to
store deleted flow counters is added to struct mlx5_fc_stats. Lockless
NULL-terminated single linked list data type is used due to following
reasons:
 - This use case only needs to add single element to list and
 remove/iterate whole list. Lockless list doesn't require any additional
 synchronization for these operations.
 - First cache line of flow counter data structure only has space to store
 single additional pointer, which precludes usage of double linked list.

Remove flow counter 'deleted' flag that is no longer needed.

Signed-off-by: Vlad Buslov 
Acked-by: Amir Vadai 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/fs_core.h |  2 +-
 .../ethernet/mellanox/mlx5/core/fs_counters.c | 34 +++
 include/linux/mlx5/driver.h   |  1 +
 3 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index f68590291e0c..617d6239c5f3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -141,6 +141,7 @@ struct mlx5_fc_cache {
 struct mlx5_fc {
struct rb_node node;
struct llist_node addlist;
+   struct llist_node dellist;
 
/* last{packets,bytes} members are used when calculating the delta since
 * last reading
@@ -149,7 +150,6 @@ struct mlx5_fc {
u64 lastbytes;
 
u32 id;
-   bool deleted;
bool aging;
 
struct mlx5_fc_cache cache cacheline_aligned_in_smp;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
index d996d6cf9e19..f1266f215a31 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
@@ -58,7 +58,7 @@
  *   - spawn thread to do the actual destroy
  *
  * - destroy (user context)
- *   - mark a counter as deleted
+ *   - add a counter to lockless dellist
  *   - spawn thread to do the actual del
  *
  * - dump (user context)
@@ -171,9 +171,8 @@ static void mlx5_fc_stats_work(struct work_struct *work)
 priv.fc_stats.work.work);
struct mlx5_fc_stats *fc_stats = >priv.fc_stats;
struct llist_node *tmplist = llist_del_all(_stats->addlist);
+   struct mlx5_fc *counter = NULL, *last = NULL, *tmp;
unsigned long now = jiffies;
-   struct mlx5_fc *counter = NULL;
-   struct mlx5_fc *last = NULL;
struct rb_node *node;
 
if (tmplist || !RB_EMPTY_ROOT(_stats->counters))
@@ -183,26 +182,17 @@ static void mlx5_fc_stats_work(struct work_struct *work)
llist_for_each_entry(counter, tmplist, addlist)
mlx5_fc_stats_insert(_stats->counters, counter);
 
-   node = rb_first(_stats->counters);
-   while (node) {
-   counter = rb_entry(node, struct mlx5_fc, node);
-
-   node = rb_next(node);
-
-   if (counter->deleted) {
-   rb_erase(>node, _stats->counters);
-
-   mlx5_cmd_fc_free(dev, counter->id);
-
-   kfree(counter);
-   continue;
-   }
+   tmplist = llist_del_all(_stats->dellist);
+   llist_for_each_entry_safe(counter, tmp, tmplist, dellist) {
+   rb_erase(>node, _stats->counters);
 
-   last = counter;
+   mlx5_free_fc(dev, counter);
}
 
-   if (time_before(now, fc_stats->next_query) || !last)
+   node = rb_last(_stats->counters);
+   if (time_before(now, fc_stats->next_query) || !node)
return;
+   last = rb_entry(node, struct mlx5_fc, node);
 
node = rb_first(_stats->counters);
while (node) {
@@ -254,13 +244,12 @@ void mlx5_fc_destroy(struct mlx5_core_dev *dev, struct 
mlx5_fc *counter)
return;
 
if (counter->aging) {
-   counter->deleted = true;
+   llist_add(>dellist, _stats->dellist);
mod_delayed_work(fc_stats->wq, _stats->work, 0);
return;
}
 
-   mlx5_cmd_fc_free(dev, counter->id);
-   kfree(counter);
+   mlx5_free_fc(dev, counter);
 }
 EXPORT_SYMBOL(mlx5_fc_destroy);
 
@@ -270,6 +259,7 @@ int mlx5_init_fc_stats(struct mlx5_core_dev *dev)
 
fc_stats->counters = RB_ROOT;
init_llist_head(_stats->addlist);
+   init_llist_head(_stats->dellist);
 
fc_stats->wq = create_singlethread_workqueue("mlx5_fc");
if (!fc_stats->wq)
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index c00549293982..4b53ac64004b 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ 

[net-next 9/9] net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets

2018-09-05 Thread Saeed Mahameed
From: Alaa Hleihel 

CHECKSUM_COMPLETE is not applicable to SCTP protocol.
Setting it for SCTP packets leads to CRC32c validation failure.

Fixes: bbceefce9adf ("net/mlx5e: Support RX CHECKSUM_COMPLETE")
Signed-off-by: Alaa Hleihel 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 2175d6972dc3..424bc89184c6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -755,6 +755,14 @@ static __be32 mlx5e_get_fcs(struct sk_buff *skb)
return fcs_bytes;
 }
 
+static u8 get_ip_proto(struct sk_buff *skb, __be16 proto)
+{
+   void *ip_p = skb->data + sizeof(struct ethhdr);
+
+   return (proto == htons(ETH_P_IP)) ? ((struct iphdr *)ip_p)->protocol :
+   ((struct ipv6hdr *)ip_p)->nexthdr;
+}
+
 static inline void mlx5e_handle_csum(struct net_device *netdev,
 struct mlx5_cqe64 *cqe,
 struct mlx5e_rq *rq,
@@ -775,6 +783,9 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
}
 
if (likely(is_last_ethertype_ip(skb, _depth, ))) {
+   if (unlikely(get_ip_proto(skb, proto) == IPPROTO_SCTP))
+   goto csum_unnecessary;
+
skb->ip_summed = CHECKSUM_COMPLETE;
skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
if (network_depth > ETH_HLEN)
@@ -792,6 +803,7 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
return;
}
 
+csum_unnecessary:
if (likely((cqe->hds_ip_ext & CQE_L3_OK) &&
   (cqe->hds_ip_ext & CQE_L4_OK))) {
skb->ip_summed = CHECKSUM_UNNECESSARY;
-- 
2.17.1



[net-next 4/9] net/mlx5: Add flow counters idr

2018-09-05 Thread Saeed Mahameed
From: Vlad Buslov 

Previous patch in series changed flow counter storage structure from
rb_tree to linked list in order to improve flow counter traversal
performance. The drawback of such solution is that flow counter lookup by
id becomes linear in complexity.

Store pointers to flow counters in idr in order to improve lookup
performance to logarithmic again. Idr is non-intrusive data structure and
doesn't require extending flow counter struct with new elements. This means
that idr can be used for lookup, while linked list from previous patch is
used for traversal, and struct mlx5_fc size is <= 2 cache lines.

Signed-off-by: Vlad Buslov 
Acked-by: Amir Vadai 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 .../ethernet/mellanox/mlx5/core/fs_counters.c | 37 +--
 include/linux/mlx5/driver.h   |  2 +
 2 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
index 90ebfee37508..09206c4acd9a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
@@ -77,13 +77,18 @@ static struct list_head 
*mlx5_fc_counters_lookup_next(struct mlx5_core_dev *dev,
  u32 id)
 {
struct mlx5_fc_stats *fc_stats = >priv.fc_stats;
+   unsigned long next_id = (unsigned long)id + 1;
struct mlx5_fc *counter;
 
-   list_for_each_entry(counter, _stats->counters, list)
-   if (counter->id > id)
-   return >list;
+   rcu_read_lock();
+   /* skip counters that are in idr, but not yet in counters list */
+   while ((counter = idr_get_next_ul(_stats->counters_idr,
+ _id)) != NULL &&
+  list_empty(>list))
+   next_id++;
+   rcu_read_unlock();
 
-   return _stats->counters;
+   return counter ? >list : _stats->counters;
 }
 
 static void mlx5_fc_stats_insert(struct mlx5_core_dev *dev,
@@ -214,15 +219,29 @@ struct mlx5_fc *mlx5_fc_create(struct mlx5_core_dev *dev, 
bool aging)
counter = kzalloc(sizeof(*counter), GFP_KERNEL);
if (!counter)
return ERR_PTR(-ENOMEM);
+   INIT_LIST_HEAD(>list);
 
err = mlx5_cmd_fc_alloc(dev, >id);
if (err)
goto err_out;
 
if (aging) {
+   u32 id = counter->id;
+
counter->cache.lastuse = jiffies;
counter->aging = true;
 
+   idr_preload(GFP_KERNEL);
+   spin_lock(_stats->counters_idr_lock);
+
+   err = idr_alloc_u32(_stats->counters_idr, counter, , id,
+   GFP_NOWAIT);
+
+   spin_unlock(_stats->counters_idr_lock);
+   idr_preload_end();
+   if (err)
+   goto err_out_alloc;
+
llist_add(>addlist, _stats->addlist);
 
mod_delayed_work(fc_stats->wq, _stats->work, 0);
@@ -230,6 +249,8 @@ struct mlx5_fc *mlx5_fc_create(struct mlx5_core_dev *dev, 
bool aging)
 
return counter;
 
+err_out_alloc:
+   mlx5_cmd_fc_free(dev, counter->id);
 err_out:
kfree(counter);
 
@@ -245,6 +266,10 @@ void mlx5_fc_destroy(struct mlx5_core_dev *dev, struct 
mlx5_fc *counter)
return;
 
if (counter->aging) {
+   spin_lock(_stats->counters_idr_lock);
+   WARN_ON(!idr_remove(_stats->counters_idr, counter->id));
+   spin_unlock(_stats->counters_idr_lock);
+
llist_add(>dellist, _stats->dellist);
mod_delayed_work(fc_stats->wq, _stats->work, 0);
return;
@@ -258,6 +283,8 @@ int mlx5_init_fc_stats(struct mlx5_core_dev *dev)
 {
struct mlx5_fc_stats *fc_stats = >priv.fc_stats;
 
+   spin_lock_init(_stats->counters_idr_lock);
+   idr_init(_stats->counters_idr);
INIT_LIST_HEAD(_stats->counters);
init_llist_head(_stats->addlist);
init_llist_head(_stats->dellist);
@@ -283,6 +310,8 @@ void mlx5_cleanup_fc_stats(struct mlx5_core_dev *dev)
destroy_workqueue(dev->priv.fc_stats.wq);
dev->priv.fc_stats.wq = NULL;
 
+   idr_destroy(_stats->counters_idr);
+
tmplist = llist_del_all(_stats->addlist);
llist_for_each_entry_safe(counter, tmp, tmplist, addlist)
mlx5_free_fc(dev, counter);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 61bed33e6675..2a0c845f6bdb 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -583,6 +583,8 @@ struct mlx5_irq_info {
 };
 
 struct mlx5_fc_stats {
+   spinlock_t counters_idr_lock; /* protects counters_idr */
+   struct idr counters_idr;
struct list_head counters;
struct llist_head addlist;
struct llist_head dellist;
-- 
2.17.1



[net-next 8/9] net/mlx5e: Set ECN for received packets using CQE indication

2018-09-05 Thread Saeed Mahameed
From: Natali Shechtman 

In multi-host (MH) NIC scheme, a single HW port serves multiple hosts
or sockets on the same host.
The HW uses a mechanism in the PCIe buffer which monitors
the amount of consumed PCIe buffers per host.
On a certain configuration, under congestion,
the HW emulates a switch doing ECN marking on packets using ECN
indication on the completion descriptor (CQE).

The driver needs to set the ECN bits on the packet SKB,
such that the network stack can react on that, this commit does that.

Signed-off-by: Natali Shechtman 
Reviewed-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 35 ---
 .../ethernet/mellanox/mlx5/core/en_stats.c|  3 ++
 .../ethernet/mellanox/mlx5/core/en_stats.h|  2 ++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 6a959e8b1f9d..2175d6972dc3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -690,12 +691,29 @@ static inline void mlx5e_skb_set_hash(struct mlx5_cqe64 
*cqe,
skb_set_hash(skb, be32_to_cpu(cqe->rss_hash_result), ht);
 }
 
-static inline bool is_last_ethertype_ip(struct sk_buff *skb, int 
*network_depth)
+static inline bool is_last_ethertype_ip(struct sk_buff *skb, int 
*network_depth,
+   __be16 *proto)
 {
-   __be16 ethertype = ((struct ethhdr *)skb->data)->h_proto;
+   *proto = ((struct ethhdr *)skb->data)->h_proto;
+   *proto = __vlan_get_protocol(skb, *proto, network_depth);
+   return (*proto == htons(ETH_P_IP) || *proto == htons(ETH_P_IPV6));
+}
+
+static inline void mlx5e_enable_ecn(struct mlx5e_rq *rq, struct sk_buff *skb)
+{
+   int network_depth = 0;
+   __be16 proto;
+   void *ip;
+   int rc;
 
-   ethertype = __vlan_get_protocol(skb, ethertype, network_depth);
-   return (ethertype == htons(ETH_P_IP) || ethertype == htons(ETH_P_IPV6));
+   if (unlikely(!is_last_ethertype_ip(skb, _depth, )))
+   return;
+
+   ip = skb->data + network_depth;
+   rc = ((proto == htons(ETH_P_IP)) ? IP_ECN_set_ce((struct iphdr *)ip) :
+IP6_ECN_set_ce(skb, (struct ipv6hdr 
*)ip));
+
+   rq->stats->ecn_mark += !!rc;
 }
 
 static __be32 mlx5e_get_fcs(struct sk_buff *skb)
@@ -745,6 +763,7 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
 {
struct mlx5e_rq_stats *stats = rq->stats;
int network_depth = 0;
+   __be16 proto;
 
if (unlikely(!(netdev->features & NETIF_F_RXCSUM)))
goto csum_none;
@@ -755,7 +774,7 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
return;
}
 
-   if (likely(is_last_ethertype_ip(skb, _depth))) {
+   if (likely(is_last_ethertype_ip(skb, _depth, ))) {
skb->ip_summed = CHECKSUM_COMPLETE;
skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
if (network_depth > ETH_HLEN)
@@ -790,6 +809,8 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
stats->csum_none++;
 }
 
+#define MLX5E_CE_BIT_MASK 0x80
+
 static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
  u32 cqe_bcnt,
  struct mlx5e_rq *rq,
@@ -834,6 +855,10 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 
*cqe,
skb->mark = be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK;
 
mlx5e_handle_csum(netdev, cqe, rq, skb, !!lro_num_seg);
+   /* checking CE bit in cqe - MSB in ml_path field */
+   if (unlikely(cqe->ml_path & MLX5E_CE_BIT_MASK))
+   mlx5e_enable_ecn(rq, skb);
+
skb->protocol = eth_type_trans(skb, netdev);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 6839481f7697..90c7607b1f44 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -53,6 +53,7 @@ static const struct counter_desc sw_stats_desc[] = {
 
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_packets) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_bytes) },
+   { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_ecn_mark) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_removed_vlan_packets) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_unnecessary) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_none) },
@@ -144,6 +145,7 @@ void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
s->rx_bytes += rq_stats->bytes;
s->rx_lro_packets += 

[net-next 7/9] net/mlx5e: Replace PTP clock lock from RW lock to seq lock

2018-09-05 Thread Saeed Mahameed
From: Shay Agroskin 

Changed "priv.clock.lock" lock from 'rw_lock' to 'seq_lock'
in order to improve packet rate performance.

Tested on Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz.
Sent 64b packets between two peers connected by ConnectX-5,
and measured packet rate for the receiver in three modes:
no time-stamping (base rate)
time-stamping using rw_lock (old lock) for critical region
time-stamping using seq_lock (new lock) for critical region
Only the receiver time stamped its packets.

The measured packet rate improvements are:

Single flow (multiple TX rings to single RX ring):
without timestamping: 4.26 (M packets)/sec
with rw-lock (old lock):  4.1  (M packets)/sec
with seq-lock (new lock): 4.16 (M packets)/sec
1.46% improvement

Multiple flows (multiple TX rings to six RX rings):
without timestamping: 22   (M packets)/sec
with rw-lock (old lock):  11.7 (M packets)/sec
with seq-lock (new lock): 21.3 (M packets)/sec
82.05% improvement

The packet rate improvement is due to the lack of atomic operations
for the 'readers' by the seq-lock.
Since there are much more 'readers' than 'writers' contention
on this lock, almost all atomic operations are saved.
this results in a dramatic decrease in overall
cache misses.

Signed-off-by: Shay Agroskin 
Signed-off-by: Saeed Mahameed 
---
 .../ethernet/mellanox/mlx5/core/lib/clock.c   | 34 +--
 .../ethernet/mellanox/mlx5/core/lib/clock.h   |  8 +++--
 include/linux/mlx5/driver.h   |  2 +-
 3 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c 
b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
index 3f767cde4c1d..0d90b1b4a3d3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
@@ -111,10 +111,10 @@ static void mlx5_pps_out(struct work_struct *work)
for (i = 0; i < clock->ptp_info.n_pins; i++) {
u64 tstart;
 
-   write_lock_irqsave(>lock, flags);
+   write_seqlock_irqsave(>lock, flags);
tstart = clock->pps_info.start[i];
clock->pps_info.start[i] = 0;
-   write_unlock_irqrestore(>lock, flags);
+   write_sequnlock_irqrestore(>lock, flags);
if (!tstart)
continue;
 
@@ -132,10 +132,10 @@ static void mlx5_timestamp_overflow(struct work_struct 
*work)
overflow_work);
unsigned long flags;
 
-   write_lock_irqsave(>lock, flags);
+   write_seqlock_irqsave(>lock, flags);
timecounter_read(>tc);
mlx5_update_clock_info_page(clock->mdev);
-   write_unlock_irqrestore(>lock, flags);
+   write_sequnlock_irqrestore(>lock, flags);
schedule_delayed_work(>overflow_work, clock->overflow_period);
 }
 
@@ -147,10 +147,10 @@ static int mlx5_ptp_settime(struct ptp_clock_info *ptp,
u64 ns = timespec64_to_ns(ts);
unsigned long flags;
 
-   write_lock_irqsave(>lock, flags);
+   write_seqlock_irqsave(>lock, flags);
timecounter_init(>tc, >cycles, ns);
mlx5_update_clock_info_page(clock->mdev);
-   write_unlock_irqrestore(>lock, flags);
+   write_sequnlock_irqrestore(>lock, flags);
 
return 0;
 }
@@ -162,9 +162,9 @@ static int mlx5_ptp_gettime(struct ptp_clock_info *ptp, 
struct timespec64 *ts)
u64 ns;
unsigned long flags;
 
-   write_lock_irqsave(>lock, flags);
+   write_seqlock_irqsave(>lock, flags);
ns = timecounter_read(>tc);
-   write_unlock_irqrestore(>lock, flags);
+   write_sequnlock_irqrestore(>lock, flags);
 
*ts = ns_to_timespec64(ns);
 
@@ -177,10 +177,10 @@ static int mlx5_ptp_adjtime(struct ptp_clock_info *ptp, 
s64 delta)
ptp_info);
unsigned long flags;
 
-   write_lock_irqsave(>lock, flags);
+   write_seqlock_irqsave(>lock, flags);
timecounter_adjtime(>tc, delta);
mlx5_update_clock_info_page(clock->mdev);
-   write_unlock_irqrestore(>lock, flags);
+   write_sequnlock_irqrestore(>lock, flags);
 
return 0;
 }
@@ -203,12 +203,12 @@ static int mlx5_ptp_adjfreq(struct ptp_clock_info *ptp, 
s32 delta)
adj *= delta;
diff = div_u64(adj, 10ULL);
 
-   write_lock_irqsave(>lock, flags);
+   write_seqlock_irqsave(>lock, flags);
timecounter_read(>tc);
clock->cycles.mult = neg_adj ? clock->nominal_c_mult - diff :
   clock->nominal_c_mult + diff;
mlx5_update_clock_info_page(clock->mdev);
-   write_unlock_irqrestore(>lock, flags);
+   write_sequnlock_irqrestore(>lock, flags);
 
return 0;
 }
@@ -307,12 +307,12 @@ static int 

[net-next 5/9] net/mlx5e: Move mlx5e_priv_flags into en_ethtool.c

2018-09-05 Thread Saeed Mahameed
From: Kamal Heib 

Move the definition of mlx5e_priv_flags into en_ethtool.c because it's
only used there.

Fixes: 4e59e2888139 ("net/mlx5e: Introduce net device priv flags 
infrastructure")
Signed-off-by: Kamal Heib 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h | 7 ---
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 7 +++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index db2cfcd21d43..de0f7702c86a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -204,13 +204,6 @@ struct mlx5e_umr_wqe {
 
 extern const char mlx5e_self_tests[][ETH_GSTRING_LEN];
 
-static const char mlx5e_priv_flags[][ETH_GSTRING_LEN] = {
-   "rx_cqe_moder",
-   "tx_cqe_moder",
-   "rx_cqe_compress",
-   "rx_striding_rq",
-};
-
 enum mlx5e_priv_flag {
MLX5E_PFLAG_RX_CQE_BASED_MODER = (1 << 0),
MLX5E_PFLAG_TX_CQE_BASED_MODER = (1 << 1),
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 98dd3e0ada72..8cd338ceb237 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -135,6 +135,13 @@ void mlx5e_build_ptys2ethtool_map(void)
   ETHTOOL_LINK_MODE_5baseKR2_Full_BIT);
 }
 
+static const char mlx5e_priv_flags[][ETH_GSTRING_LEN] = {
+   "rx_cqe_moder",
+   "tx_cqe_moder",
+   "rx_cqe_compress",
+   "rx_striding_rq",
+};
+
 int mlx5e_ethtool_get_sset_count(struct mlx5e_priv *priv, int sset)
 {
int i, num_stats = 0;
-- 
2.17.1



[pull request][net-next 0/9] Mellanox, mlx5 ethernet updates 2018-09-05

2018-09-05 Thread Saeed Mahameed
Hi Dave,

This pull request provides some updates to mlx5 ethernet driver.

For more information please see tag log below.

Please pull and let me know if there's any problem.

Thanks,
Saeed.

---

The following changes since commit 05dcc71298643256948a2e17db7dbecc748719d2:

  net: lan743x_ptp: make function lan743x_ptp_set_sync_ts_insert() static 
(2018-09-05 08:07:05 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5e-updates-2018-09-05

for you to fetch changes up to fe1dc069990c1f290ef6b99adb46332c03258f38:

  net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets (2018-09-05 21:14:57 
-0700)


mlx5e-updates-2018-09-05

This series provides updates to mlx5 ethernet driver.

1) Starting with a four patches series to optimize flow counters updates,
>From Vlad Buslov:
==

By default mlx5 driver updates cached counters each second. Update function
consumes noticeable amount of CPU resources. The goal of this patch series
is to optimize update function.

Investigation revealed following bottlenecks in fs counters
implementation:
 1) Update code(scheduled each second) iterates over all counters twice.
 (first for finding and deleting counters that are marked for deletion,
 second iteration is for actually updating the counters)
 2) Counters are stored in rb tree. Linear iteration over all rb tree
 elements(rb_next in profiling data) consumed ~65% of time spent in
 update function.

Following optimizations were implemented:
 1) Instead of just marking counters for deletion, store them in
 standalone list. This removes first iteration over whole counters tree.
 2) Store counters in sorted list to optimize traversing them and remove
 calls to rb_next.

First implementation of these changes caused degradation of performance,
instead of improving it. Investigation revealed that there first cache
line of struct mlx5_fc is full and adding anything to it causes amount
of cache misses to double. To mitigate that, following refactorings were
implemented:
 - Change 'addlist' list type from double linked to single linked. This
 allowes to get free space for one additional pointer that is used to
 store deletion list(optimization 1)
 - Substitute rb tree with idr. Idr is non-intrusive data structure and
 doesn't require adding any new members to struct mlx5_fc. Use free
 space that became available for double linked sorted list that is used
 for traversing all counters. (optimization 2)

Described changes reduced CPU time spent in mlx5_fc_stats_work from 70%
to 44%. (global perf profile mode)


The rest of the series are misc updates:

2) From Kamal, Move mlx5e_priv_flags into en_ethtool.c, to avoid a
compilation warning.

3) From Roi Dayan, Move Q counters allocation and drop RQ to init_rx profile
function to avoid allocating Q counters when not required.

4) From Shay Agroskin, Replace PTP clock lock from RW lock to seq lock.
Almost double the packet rate when timestamping is active on multiple TX
queues.

5) From: Natali Shechtman, set ECN for received packets using CQE indication.

6) From: Alaa Hleihel, don't set CHECKSUM_COMPLETE on SCTP packets.
CHECKSUM_COMPLETE is not applicable to SCTP protocol.


Alaa Hleihel (1):
  net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets

Kamal Heib (1):
  net/mlx5e: Move mlx5e_priv_flags into en_ethtool.c

Natali Shechtman (1):
  net/mlx5e: Set ECN for received packets using CQE indication

Roi Dayan (1):
  net/mlx5e: Move Q counters allocation and drop RQ to init_rx

Shay Agroskin (1):
  net/mlx5e: Replace PTP clock lock from RW lock to seq lock

Vlad Buslov (4):
  net/mlx5: Change flow counters addlist type to single linked list
  net/mlx5: Add new list to store deleted flow counters
  net/mlx5: Store flow counters in a list
  net/mlx5: Add flow counters idr

 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  13 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   7 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  45 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  12 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|  47 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |   5 +-
 .../net/ethernet/mellanox/mlx5/core/fs_counters.c  | 184 +++--
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |  17 +-
 .../net/ethernet/mellanox/mlx5/core/lib/clock.c|  34 ++--
 .../net/ethernet/mellanox/mlx5/core/lib/clock.h|   8 +-
 include/linux/mlx5/driver.h|  11 +-
 13 files changed, 235 insertions(+), 153 deletions(-)


[net-next 3/9] net/mlx5: Store flow counters in a list

2018-09-05 Thread Saeed Mahameed
From: Vlad Buslov 

In order to improve performance of flow counter stats query loop that
traverses all configured flow counters, replace rb_tree with double-linked
list. This change improves performance of traversing flow counters by
removing the tree traversal. (profiling data showed that call to rb_next
was most top CPU consumer)

However, lookup of flow flow counter in list becomes linear, instead of
logarithmic. This problem is fixed by next patch in series, which adds idr
for fast lookup. Idr is to be used because it is not an intrusive data
structure and doesn't require adding any new members to struct mlx5_fc,
which allows its control data part to stay <= 1 cache line in size.

Signed-off-by: Vlad Buslov 
Acked-by: Amir Vadai 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/fs_core.h |  2 +-
 .../ethernet/mellanox/mlx5/core/fs_counters.c | 88 +--
 include/linux/mlx5/driver.h   |  2 +-
 3 files changed, 42 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index 617d6239c5f3..a06f83c0c2b6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -139,7 +139,7 @@ struct mlx5_fc_cache {
 };
 
 struct mlx5_fc {
-   struct rb_node node;
+   struct list_head list;
struct llist_node addlist;
struct llist_node dellist;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
index f1266f215a31..90ebfee37508 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
@@ -73,36 +73,38 @@
  *   elapsed, the thread will actually query the hardware.
  */
 
-static void mlx5_fc_stats_insert(struct rb_root *root, struct mlx5_fc *counter)
+static struct list_head *mlx5_fc_counters_lookup_next(struct mlx5_core_dev 
*dev,
+ u32 id)
 {
-   struct rb_node **new = >rb_node;
-   struct rb_node *parent = NULL;
-
-   while (*new) {
-   struct mlx5_fc *this = rb_entry(*new, struct mlx5_fc, node);
-   int result = counter->id - this->id;
-
-   parent = *new;
-   if (result < 0)
-   new = &((*new)->rb_left);
-   else
-   new = &((*new)->rb_right);
-   }
+   struct mlx5_fc_stats *fc_stats = >priv.fc_stats;
+   struct mlx5_fc *counter;
+
+   list_for_each_entry(counter, _stats->counters, list)
+   if (counter->id > id)
+   return >list;
+
+   return _stats->counters;
+}
+
+static void mlx5_fc_stats_insert(struct mlx5_core_dev *dev,
+struct mlx5_fc *counter)
+{
+   struct list_head *next = mlx5_fc_counters_lookup_next(dev, counter->id);
 
-   /* Add new node and rebalance tree. */
-   rb_link_node(>node, parent, new);
-   rb_insert_color(>node, root);
+   list_add_tail(>list, next);
 }
 
-/* The function returns the last node that was queried so the caller
+/* The function returns the last counter that was queried so the caller
  * function can continue calling it till all counters are queried.
  */
-static struct rb_node *mlx5_fc_stats_query(struct mlx5_core_dev *dev,
+static struct mlx5_fc *mlx5_fc_stats_query(struct mlx5_core_dev *dev,
   struct mlx5_fc *first,
   u32 last_id)
 {
+   struct mlx5_fc_stats *fc_stats = >priv.fc_stats;
+   struct mlx5_fc *counter = NULL;
struct mlx5_cmd_fc_bulk *b;
-   struct rb_node *node = NULL;
+   bool more = false;
u32 afirst_id;
int num;
int err;
@@ -132,14 +134,16 @@ static struct rb_node *mlx5_fc_stats_query(struct 
mlx5_core_dev *dev,
goto out;
}
 
-   for (node = >node; node; node = rb_next(node)) {
-   struct mlx5_fc *counter = rb_entry(node, struct mlx5_fc, node);
+   counter = first;
+   list_for_each_entry_from(counter, _stats->counters, list) {
struct mlx5_fc_cache *c = >cache;
u64 packets;
u64 bytes;
 
-   if (counter->id > last_id)
+   if (counter->id > last_id) {
+   more = true;
break;
+   }
 
mlx5_cmd_fc_bulk_get(dev, b,
 counter->id, , );
@@ -155,7 +159,7 @@ static struct rb_node *mlx5_fc_stats_query(struct 
mlx5_core_dev *dev,
 out:
mlx5_cmd_fc_bulk_free(b);
 
-   return node;
+   return more ? counter : NULL;
 }
 
 static void mlx5_free_fc(struct mlx5_core_dev *dev,
@@ -173,33 +177,30 @@ static void mlx5_fc_stats_work(struct work_struct *work)
  

[net-next 6/9] net/mlx5e: Move Q counters allocation and drop RQ to init_rx

2018-09-05 Thread Saeed Mahameed
From: Roi Dayan 

Not all profiles query the HW Q counters in update_stats() callback.
HW Q couners are limited per device and in case of representors all
their Q counters are allocated on the parent PF device.
Avoid reundant allocation of HW Q counters by moving the allocation
to init_rx profile callback.

Signed-off-by: Roi Dayan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  6 +++
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 45 +--
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 12 -
 .../ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 17 ++-
 4 files changed, 55 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index de0f7702c86a..01a967e717e7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -898,6 +898,12 @@ void mlx5e_destroy_mdev_resources(struct mlx5_core_dev 
*mdev);
 int mlx5e_refresh_tirs(struct mlx5e_priv *priv, bool enable_uc_lb);
 
 /* common netdev helpers */
+void mlx5e_create_q_counters(struct mlx5e_priv *priv);
+void mlx5e_destroy_q_counters(struct mlx5e_priv *priv);
+int mlx5e_open_drop_rq(struct mlx5e_priv *priv,
+  struct mlx5e_rq *drop_rq);
+void mlx5e_close_drop_rq(struct mlx5e_rq *drop_rq);
+
 int mlx5e_create_indirect_rqt(struct mlx5e_priv *priv);
 
 int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5a7939e70190..d14c4051edd8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3049,8 +3049,8 @@ static int mlx5e_alloc_drop_cq(struct mlx5_core_dev *mdev,
return mlx5e_alloc_cq_common(mdev, param, cq);
 }
 
-static int mlx5e_open_drop_rq(struct mlx5e_priv *priv,
- struct mlx5e_rq *drop_rq)
+int mlx5e_open_drop_rq(struct mlx5e_priv *priv,
+  struct mlx5e_rq *drop_rq)
 {
struct mlx5_core_dev *mdev = priv->mdev;
struct mlx5e_cq_param cq_param = {};
@@ -3094,7 +3094,7 @@ static int mlx5e_open_drop_rq(struct mlx5e_priv *priv,
return err;
 }
 
-static void mlx5e_close_drop_rq(struct mlx5e_rq *drop_rq)
+void mlx5e_close_drop_rq(struct mlx5e_rq *drop_rq)
 {
mlx5e_destroy_rq(drop_rq);
mlx5e_free_rq(drop_rq);
@@ -4726,7 +4726,7 @@ static void mlx5e_build_nic_netdev(struct net_device 
*netdev)
mlx5e_tls_build_netdev(priv);
 }
 
-static void mlx5e_create_q_counters(struct mlx5e_priv *priv)
+void mlx5e_create_q_counters(struct mlx5e_priv *priv)
 {
struct mlx5_core_dev *mdev = priv->mdev;
int err;
@@ -4744,7 +4744,7 @@ static void mlx5e_create_q_counters(struct mlx5e_priv 
*priv)
}
 }
 
-static void mlx5e_destroy_q_counters(struct mlx5e_priv *priv)
+void mlx5e_destroy_q_counters(struct mlx5e_priv *priv)
 {
if (priv->q_counter)
mlx5_core_dealloc_q_counter(priv->mdev, priv->q_counter);
@@ -4783,9 +4783,17 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
struct mlx5_core_dev *mdev = priv->mdev;
int err;
 
+   mlx5e_create_q_counters(priv);
+
+   err = mlx5e_open_drop_rq(priv, >drop_rq);
+   if (err) {
+   mlx5_core_err(mdev, "open drop rq failed, %d\n", err);
+   goto err_destroy_q_counters;
+   }
+
err = mlx5e_create_indirect_rqt(priv);
if (err)
-   return err;
+   goto err_close_drop_rq;
 
err = mlx5e_create_direct_rqts(priv);
if (err)
@@ -4821,6 +4829,10 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
mlx5e_destroy_direct_rqts(priv);
 err_destroy_indirect_rqts:
mlx5e_destroy_rqt(priv, >indir_rqt);
+err_close_drop_rq:
+   mlx5e_close_drop_rq(>drop_rq);
+err_destroy_q_counters:
+   mlx5e_destroy_q_counters(priv);
return err;
 }
 
@@ -4832,6 +4844,8 @@ static void mlx5e_cleanup_nic_rx(struct mlx5e_priv *priv)
mlx5e_destroy_indirect_tirs(priv);
mlx5e_destroy_direct_rqts(priv);
mlx5e_destroy_rqt(priv, >indir_rqt);
+   mlx5e_close_drop_rq(>drop_rq);
+   mlx5e_destroy_q_counters(priv);
 }
 
 static int mlx5e_init_nic_tx(struct mlx5e_priv *priv)
@@ -4975,7 +4989,6 @@ struct net_device *mlx5e_create_netdev(struct 
mlx5_core_dev *mdev,
 
 int mlx5e_attach_netdev(struct mlx5e_priv *priv)
 {
-   struct mlx5_core_dev *mdev = priv->mdev;
const struct mlx5e_profile *profile;
int err;
 
@@ -4986,28 +4999,16 @@ int mlx5e_attach_netdev(struct mlx5e_priv *priv)
if (err)
goto out;
 
-   mlx5e_create_q_counters(priv);
-
-   err = mlx5e_open_drop_rq(priv, >drop_rq);
-   if (err) {
-   mlx5_core_err(mdev, "open drop rq failed, %d\n", err);
-   goto err_destroy_q_counters;
- 

[net-next 1/9] net/mlx5: Change flow counters addlist type to single linked list

2018-09-05 Thread Saeed Mahameed
From: Vlad Buslov 

In order to prevent flow counters stats work function from traversing whole
flow counters tree while searching for deleted flow counters, new list to
store deleted flow counters will be added to struct mlx5_fc_stats. However,
the flow counter structure itself has no space left to store any more data
in first cache line. To free space that is needed to store additional list
node, convert current addlist double linked list (two pointers per node) to
atomic single linked list (one pointer per node).

Lockless NULL-terminated single linked list data type doesn't require any
additional external synchronization for operations used by flow counters
module (add single new element, remove all elements from list and traverse
them). Remove addlist_lock that is no longer needed.

Signed-off-by: Vlad Buslov 
Acked-by: Amir Vadai 
Reviewed-by: Paul Blakey 
Reviewed-by: Roi Dayan 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/fs_core.h |  3 +-
 .../ethernet/mellanox/mlx5/core/fs_counters.c | 45 +--
 include/linux/mlx5/driver.h   |  4 +-
 3 files changed, 23 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index 32070e5d993d..f68590291e0c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 enum fs_node_type {
FS_TYPE_NAMESPACE,
@@ -139,7 +140,7 @@ struct mlx5_fc_cache {
 
 struct mlx5_fc {
struct rb_node node;
-   struct list_head list;
+   struct llist_node addlist;
 
/* last{packets,bytes} members are used when calculating the delta since
 * last reading
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
index 58af6be13dfa..d996d6cf9e19 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
@@ -52,7 +52,9 @@
  * access to counter list:
  * - create (user context)
  *   - mlx5_fc_create() only adds to an addlist to be used by
- * mlx5_fc_stats_query_work(). addlist is protected by a spinlock.
+ * mlx5_fc_stats_query_work(). addlist is a lockless single linked list
+ * that doesn't require any additional synchronization when adding single
+ * node.
  *   - spawn thread to do the actual destroy
  *
  * - destroy (user context)
@@ -156,28 +158,29 @@ static struct rb_node *mlx5_fc_stats_query(struct 
mlx5_core_dev *dev,
return node;
 }
 
+static void mlx5_free_fc(struct mlx5_core_dev *dev,
+struct mlx5_fc *counter)
+{
+   mlx5_cmd_fc_free(dev, counter->id);
+   kfree(counter);
+}
+
 static void mlx5_fc_stats_work(struct work_struct *work)
 {
struct mlx5_core_dev *dev = container_of(work, struct mlx5_core_dev,
 priv.fc_stats.work.work);
struct mlx5_fc_stats *fc_stats = >priv.fc_stats;
+   struct llist_node *tmplist = llist_del_all(_stats->addlist);
unsigned long now = jiffies;
struct mlx5_fc *counter = NULL;
struct mlx5_fc *last = NULL;
struct rb_node *node;
-   LIST_HEAD(tmplist);
-
-   spin_lock(_stats->addlist_lock);
 
-   list_splice_tail_init(_stats->addlist, );
-
-   if (!list_empty() || !RB_EMPTY_ROOT(_stats->counters))
+   if (tmplist || !RB_EMPTY_ROOT(_stats->counters))
queue_delayed_work(fc_stats->wq, _stats->work,
   fc_stats->sampling_interval);
 
-   spin_unlock(_stats->addlist_lock);
-
-   list_for_each_entry(counter, , list)
+   llist_for_each_entry(counter, tmplist, addlist)
mlx5_fc_stats_insert(_stats->counters, counter);
 
node = rb_first(_stats->counters);
@@ -229,9 +232,7 @@ struct mlx5_fc *mlx5_fc_create(struct mlx5_core_dev *dev, 
bool aging)
counter->cache.lastuse = jiffies;
counter->aging = true;
 
-   spin_lock(_stats->addlist_lock);
-   list_add(>list, _stats->addlist);
-   spin_unlock(_stats->addlist_lock);
+   llist_add(>addlist, _stats->addlist);
 
mod_delayed_work(fc_stats->wq, _stats->work, 0);
}
@@ -268,8 +269,7 @@ int mlx5_init_fc_stats(struct mlx5_core_dev *dev)
struct mlx5_fc_stats *fc_stats = >priv.fc_stats;
 
fc_stats->counters = RB_ROOT;
-   INIT_LIST_HEAD(_stats->addlist);
-   spin_lock_init(_stats->addlist_lock);
+   init_llist_head(_stats->addlist);
 
fc_stats->wq = create_singlethread_workqueue("mlx5_fc");
if (!fc_stats->wq)
@@ -284,6 +284,7 @@ int mlx5_init_fc_stats(struct mlx5_core_dev *dev)
 void mlx5_cleanup_fc_stats(struct mlx5_core_dev *dev)
 {
struct mlx5_fc_stats *fc_stats = 

[net 08/10] net/mlx5: Check for error in mlx5_attach_interface

2018-09-05 Thread Saeed Mahameed
From: Huy Nguyen 

Currently, mlx5_attach_interface does not check for error
after calling intf->attach or intf->add. When these two calls
fails, the client is not initialized and will cause issues such as
kernel panic on invalid address in the teardown path (mlx5_detach_interface)

Fixes: 737a234bb638 ("net/mlx5: Introduce attach/detach to interface API")
Signed-off-by: Huy Nguyen 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/dev.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/dev.c 
b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
index ada723bd91b6..37ba7c78859d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/dev.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
@@ -132,11 +132,11 @@ void mlx5_add_device(struct mlx5_interface *intf, struct 
mlx5_priv *priv)
delayed_event_start(priv);
 
dev_ctx->context = intf->add(dev);
-   set_bit(MLX5_INTERFACE_ADDED, _ctx->state);
-   if (intf->attach)
-   set_bit(MLX5_INTERFACE_ATTACHED, _ctx->state);
-
if (dev_ctx->context) {
+   set_bit(MLX5_INTERFACE_ADDED, _ctx->state);
+   if (intf->attach)
+   set_bit(MLX5_INTERFACE_ATTACHED, _ctx->state);
+
spin_lock_irq(>ctx_lock);
list_add_tail(_ctx->list, >ctx_list);
 
@@ -211,12 +211,17 @@ static void mlx5_attach_interface(struct mlx5_interface 
*intf, struct mlx5_priv
if (intf->attach) {
if (test_bit(MLX5_INTERFACE_ATTACHED, _ctx->state))
goto out;
-   intf->attach(dev, dev_ctx->context);
+   if (intf->attach(dev, dev_ctx->context))
+   goto out;
+
set_bit(MLX5_INTERFACE_ATTACHED, _ctx->state);
} else {
if (test_bit(MLX5_INTERFACE_ADDED, _ctx->state))
goto out;
dev_ctx->context = intf->add(dev);
+   if (!dev_ctx->context)
+   goto out;
+
set_bit(MLX5_INTERFACE_ADDED, _ctx->state);
}
 
-- 
2.17.1



[net 09/10] net/mlx5e: Ethtool steering, fix udp source port value

2018-09-05 Thread Saeed Mahameed
Copy and paste bug was introduced in the offending patch.
We need to write udp source port value into the headers value and not
headers criteria "mask".

Fixes: 142644f8a1f8 ("net/mlx5e: Ethtool steering flow parsing refactoring")
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
index 75bb981e00b7..41cde926cdab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
@@ -191,7 +191,7 @@ set_udp(void *headers_c, void *headers_v, __be16 psrc_m, 
__be16 psrc_v,
 {
if (psrc_m) {
MLX5E_FTE_SET(headers_c, udp_sport, 0x);
-   MLX5E_FTE_SET(headers_c, udp_sport, ntohs(psrc_v));
+   MLX5E_FTE_SET(headers_v, udp_sport, ntohs(psrc_v));
}
 
if (pdst_m) {
-- 
2.17.1



[net 05/10] net/mlx5: E-Switch, Fix memory leak when creating switchdev mode FDB tables

2018-09-05 Thread Saeed Mahameed
From: Raed Salem 

The memory allocated for the slow path table flow group input structure
was not freed upon successful return, fix that.

Fixes: 1967ce6ea5c8 ("net/mlx5: E-Switch, Refactor fast path FDB table creation 
in switchdev mode")
Signed-off-by: Raed Salem 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index f72b5c9dcfe9..3028e8d90920 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -663,6 +663,7 @@ static int esw_create_offloads_fdb_tables(struct 
mlx5_eswitch *esw, int nvports)
if (err)
goto miss_rule_err;
 
+   kvfree(flow_group_in);
return 0;
 
 miss_rule_err:
-- 
2.17.1



[net 10/10] net/mlx5: Fix possible deadlock from lockdep when adding fte to fg

2018-09-05 Thread Saeed Mahameed
From: Roi Dayan 

This is a false positive report due to incorrect nested lock
annotations as we lock multiple fgs with the same subclass.
Instead of locking all fgs only lock the one being used as was
done before.

Fixes: bd71b08ec2ee ("net/mlx5: Support multiple updates of steering rules in 
parallel")
Signed-off-by: Roi Dayan 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/fs_core.c | 74 +--
 1 file changed, 37 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 384b560f2a93..37d114c668b7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1578,6 +1578,33 @@ static u64 matched_fgs_get_version(struct list_head 
*match_head)
return version;
 }
 
+static struct fs_fte *
+lookup_fte_locked(struct mlx5_flow_group *g,
+ u32 *match_value,
+ bool take_write)
+{
+   struct fs_fte *fte_tmp;
+
+   if (take_write)
+   nested_down_write_ref_node(>node, FS_LOCK_PARENT);
+   else
+   nested_down_read_ref_node(>node, FS_LOCK_PARENT);
+   fte_tmp = rhashtable_lookup_fast(>ftes_hash, match_value,
+rhash_fte);
+   if (!fte_tmp || !tree_get_node(_tmp->node)) {
+   fte_tmp = NULL;
+   goto out;
+   }
+
+   nested_down_write_ref_node(_tmp->node, FS_LOCK_CHILD);
+out:
+   if (take_write)
+   up_write_ref_node(>node);
+   else
+   up_read_ref_node(>node);
+   return fte_tmp;
+}
+
 static struct mlx5_flow_handle *
 try_add_to_existing_fg(struct mlx5_flow_table *ft,
   struct list_head *match_head,
@@ -1600,10 +1627,6 @@ try_add_to_existing_fg(struct mlx5_flow_table *ft,
if (IS_ERR(fte))
return  ERR_PTR(-ENOMEM);
 
-   list_for_each_entry(iter, match_head, list) {
-   nested_down_read_ref_node(>g->node, FS_LOCK_PARENT);
-   }
-
 search_again_locked:
version = matched_fgs_get_version(match_head);
/* Try to find a fg that already contains a matching fte */
@@ -1611,20 +1634,9 @@ try_add_to_existing_fg(struct mlx5_flow_table *ft,
struct fs_fte *fte_tmp;
 
g = iter->g;
-   fte_tmp = rhashtable_lookup_fast(>ftes_hash, 
spec->match_value,
-rhash_fte);
-   if (!fte_tmp || !tree_get_node(_tmp->node))
+   fte_tmp = lookup_fte_locked(g, spec->match_value, take_write);
+   if (!fte_tmp)
continue;
-
-   nested_down_write_ref_node(_tmp->node, FS_LOCK_CHILD);
-   if (!take_write) {
-   list_for_each_entry(iter, match_head, list)
-   up_read_ref_node(>g->node);
-   } else {
-   list_for_each_entry(iter, match_head, list)
-   up_write_ref_node(>g->node);
-   }
-
rule = add_rule_fg(g, spec->match_value,
   flow_act, dest, dest_num, fte_tmp);
up_write_ref_node(_tmp->node);
@@ -1633,19 +1645,6 @@ try_add_to_existing_fg(struct mlx5_flow_table *ft,
return rule;
}
 
-   /* No group with matching fte found. Try to add a new fte to any
-* matching fg.
-*/
-
-   if (!take_write) {
-   list_for_each_entry(iter, match_head, list)
-   up_read_ref_node(>g->node);
-   list_for_each_entry(iter, match_head, list)
-   nested_down_write_ref_node(>g->node,
-  FS_LOCK_PARENT);
-   take_write = true;
-   }
-
/* Check the ft version, for case that new flow group
 * was added while the fgs weren't locked
 */
@@ -1657,27 +1656,30 @@ try_add_to_existing_fg(struct mlx5_flow_table *ft,
/* Check the fgs version, for case the new FTE with the
 * same values was added while the fgs weren't locked
 */
-   if (version != matched_fgs_get_version(match_head))
+   if (version != matched_fgs_get_version(match_head)) {
+   take_write = true;
goto search_again_locked;
+   }
 
list_for_each_entry(iter, match_head, list) {
g = iter->g;
 
if (!g->node.active)
continue;
+
+   nested_down_write_ref_node(>node, FS_LOCK_PARENT);
+
err = insert_fte(g, fte);
if (err) {
+   up_write_ref_node(>node);
if (err == -ENOSPC)
continue;
-   list_for_each_entry(iter, match_head, list)
-  

[net 06/10] net/mlx5: Fix not releasing read lock when adding flow rules

2018-09-05 Thread Saeed Mahameed
From: Roi Dayan 

If building match list fg fails and we never jumped to
search_again_locked label then the function returned without
unlocking the read lock.

Fixes: bd71b08ec2ee ("net/mlx5: Support multiple updates of steering rules in 
parallel")
Signed-off-by: Roi Dayan 
Reviewed-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index f418541af7cf..384b560f2a93 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1726,6 +1726,8 @@ _mlx5_add_flow_rules(struct mlx5_flow_table *ft,
if (err) {
if (take_write)
up_write_ref_node(>node);
+   else
+   up_read_ref_node(>node);
return ERR_PTR(err);
}
 
-- 
2.17.1



Re: [PATCH mlx5-next v1 05/15] net/mlx5: Break encap/decap into two separated flow table creation flags

2018-09-05 Thread Leon Romanovsky
On Thu, Sep 06, 2018 at 12:37:17AM +0300, Or Gerlitz wrote:
> On Wed, Sep 5, 2018 at 9:11 PM, Leon Romanovsky  wrote:
> > On Wed, Sep 05, 2018 at 10:38:00AM -0600, Jason Gunthorpe wrote:
> >> On Wed, Sep 05, 2018 at 08:10:25AM +0300, Leon Romanovsky wrote:
> >> > > > -   int en_encap_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN);
> >> > > > +   int en_encap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_ENCAP);
> >> > > > +   int en_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_DECAP);
> >> > >
> >> > > Yuk, please don't use !!.
> >> > >
> >> > >   bool en_decap = flags & MLX5_FLOW_TABLE_TUNNEL_EN_DECAP;
> >> >
> >> > We need to provide en_encap and en_decap as an input to MLX5_SET(...)
> >> > which is passed to FW as 0 or 1.
> >> >
> >> > Boolean type is declared in C as int and treated as zero for false
> >> > and any other value for true,
> >>
> >> No, that isn't right, the kernel uses C99's _Bool intrinsic type, which
> >> is guaranteed to only hold 0 or 1 by the compiler.
> >>
> >> See types.h:
> >>
> >> typedef _Bool   bool;
> >
> > Exciting, it took me a while to find C99 standard and relevant 6.3.1.2.
> > Anyway, this patch didn't change previous functionality, which used "!!"
> > convention.
>
> so? if we didn't do things properly prior to the patch, why not fixing it 
> along
> with the patch? lets fix

Or,

What exactly "to fix"? Both code lines:
1. Have correct syntax
2. Implement proper C99
3. Give same compiler code
4. Have same readability

There is nothing to fix.

And this patch is already merged, so if you truly care about this,
please go ahead and prepare patch for whole driver, or better for
whole kernel.

 kernel git:(rdma-next) git grep "\!\!" |wc -l
8125

Thanks


signature.asc
Description: PGP signature


[net 04/10] net/mlx5: Use u16 for Work Queue buffer strides offset

2018-09-05 Thread Saeed Mahameed
From: Tariq Toukan 

Minimal stride size is 16.
Hence, the number of strides in a fragment (of PAGE_SIZE)
is <= PAGE_SIZE / 16 <= 4K.

u16 is sufficient to represent this.

Fixes: d7037ad73daa ("net/mlx5: Fix QP fragmented buffer allocation")
Signed-off-by: Tariq Toukan 
Reviewed-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/wq.c | 2 +-
 include/linux/mlx5/driver.h  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
index d838af9539b1..68e7f8df2a6d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
@@ -138,7 +138,7 @@ int mlx5_wq_qp_create(struct mlx5_core_dev *mdev, struct 
mlx5_wq_param *param,
  void *qpc, struct mlx5_wq_qp *wq,
  struct mlx5_wq_ctrl *wq_ctrl)
 {
-   u32 sq_strides_offset;
+   u16 sq_strides_offset;
u32 rq_pg_remainder;
int err;
 
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 3a1258fd8ac3..66d94b4557cf 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -363,7 +363,7 @@ struct mlx5_frag_buf_ctrl {
struct mlx5_frag_buffrag_buf;
u32 sz_m1;
u16 frag_sz_m1;
-   u32 strides_offset;
+   u16 strides_offset;
u8  log_sz;
u8  log_stride;
u8  log_frag_strides;
@@ -995,7 +995,7 @@ static inline u32 mlx5_base_mkey(const u32 key)
 }
 
 static inline void mlx5_fill_fbc_offset(u8 log_stride, u8 log_sz,
-   u32 strides_offset,
+   u16 strides_offset,
struct mlx5_frag_buf_ctrl *fbc)
 {
fbc->log_stride = log_stride;
-- 
2.17.1



[pull request][net 00/10] Mellanox, mlx5 fixes 2018-09-05

2018-09-05 Thread Saeed Mahameed
Hi Dave,

This pull request contains some fixes for mlx5 etherent netdevice and
core driver.

Please pull and let me know if there's any problem.

For -stable v4.9:
('net/mlx5: Fix debugfs cleanup in the device init/remove flow')

For -stable v4.12:
("net/mlx5: E-Switch, Fix memory leak when creating switchdev mode FDB tables")

For -stable v4.13:
("net/mlx5: Fix use-after-free in self-healing flow")

For -stable v4.14:
("net/mlx5: Check for error in mlx5_attach_interface")

For -stable v4.15:
("net/mlx5: Fix not releasing read lock when adding flow rules")

For -stable v4.17:
("net/mlx5: Fix possible deadlock from lockdep when adding fte to fg")

For -stable v4.18:
("net/mlx5: Use u16 for Work Queue buffer fragment size")

Thanks,
Saeed.

---

The following changes since commit e65a9e480e91ddf9e15155454d370cead64689c8:

  net: qca_spi: Fix race condition in spi transfers (2018-09-05 08:09:35 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5e-fixes-2018-09-05

for you to fetch changes up to ad9421e36a77056a4f095d49b9605e80b4d216ed:

  net/mlx5: Fix possible deadlock from lockdep when adding fte to fg 
(2018-09-05 17:08:34 -0700)


mlx5e-fixes-2018-09-05


Daniel Jurgens (1):
  net/mlx5: Consider PCI domain in search for next dev

Huy Nguyen (1):
  net/mlx5: Check for error in mlx5_attach_interface

Jack Morgenstein (2):
  net/mlx5: Fix use-after-free in self-healing flow
  net/mlx5: Fix debugfs cleanup in the device init/remove flow

Raed Salem (1):
  net/mlx5: E-Switch, Fix memory leak when creating switchdev mode FDB 
tables

Roi Dayan (2):
  net/mlx5: Fix not releasing read lock when adding flow rules
  net/mlx5: Fix possible deadlock from lockdep when adding fte to fg

Saeed Mahameed (1):
  net/mlx5e: Ethtool steering, fix udp source port value

Tariq Toukan (2):
  net/mlx5: Use u16 for Work Queue buffer fragment size
  net/mlx5: Use u16 for Work Queue buffer strides offset

 drivers/net/ethernet/mellanox/mlx5/core/dev.c  | 22 ---
 .../ethernet/mellanox/mlx5/core/en_fs_ethtool.c|  2 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |  1 +
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  | 76 +++---
 drivers/net/ethernet/mellanox/mlx5/core/health.c   | 10 ++-
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 12 ++--
 drivers/net/ethernet/mellanox/mlx5/core/wq.c   |  6 +-
 drivers/net/ethernet/mellanox/mlx5/core/wq.h   |  2 +-
 include/linux/mlx5/driver.h|  8 +--
 9 files changed, 79 insertions(+), 60 deletions(-)


[net 07/10] net/mlx5: Consider PCI domain in search for next dev

2018-09-05 Thread Saeed Mahameed
From: Daniel Jurgens 

The PCI BDF is not unique. PCI domain must also be considered when
searching for the next physical device during lag setup. Example below:

mlx5_core :01:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(128) RxCqeCmprss(0)
mlx5_core :01:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(128) RxCqeCmprss(0)
mlx5_core 0001:01:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(128) RxCqeCmprss(0)
mlx5_core 0001:01:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(128) RxCqeCmprss(0)

Signed-off-by: Daniel Jurgens 
Reviewed-by: Aviv Heller 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/dev.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/dev.c 
b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
index b994b80d5714..ada723bd91b6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/dev.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
@@ -391,16 +391,17 @@ void mlx5_remove_dev_by_protocol(struct mlx5_core_dev 
*dev, int protocol)
}
 }
 
-static u16 mlx5_gen_pci_id(struct mlx5_core_dev *dev)
+static u32 mlx5_gen_pci_id(struct mlx5_core_dev *dev)
 {
-   return (u16)((dev->pdev->bus->number << 8) |
+   return (u32)((pci_domain_nr(dev->pdev->bus) << 16) |
+(dev->pdev->bus->number << 8) |
 PCI_SLOT(dev->pdev->devfn));
 }
 
 /* Must be called with intf_mutex held */
 struct mlx5_core_dev *mlx5_get_next_phys_dev(struct mlx5_core_dev *dev)
 {
-   u16 pci_id = mlx5_gen_pci_id(dev);
+   u32 pci_id = mlx5_gen_pci_id(dev);
struct mlx5_core_dev *res = NULL;
struct mlx5_core_dev *tmp_dev;
struct mlx5_priv *priv;
-- 
2.17.1



[net 01/10] net/mlx5: Fix use-after-free in self-healing flow

2018-09-05 Thread Saeed Mahameed
From: Jack Morgenstein 

When the mlx5 health mechanism detects a problem while the driver
is in the middle of init_one or remove_one, the driver needs to prevent
the health mechanism from scheduling future work; if future work
is scheduled, there is a problem with use-after-free: the system WQ
tries to run the work item (which has been freed) at the scheduled
future time.

Prevent this by disabling work item scheduling in the health mechanism
when the driver is in the middle of init_one() or remove_one().

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Jack Morgenstein 
Reviewed-by: Feras Daoud 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/health.c | 10 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c   |  6 +++---
 include/linux/mlx5/driver.h  |  2 +-
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c 
b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index d39b0b7011b2..9f39aeca863f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -331,9 +331,17 @@ void mlx5_start_health_poll(struct mlx5_core_dev *dev)
add_timer(>timer);
 }
 
-void mlx5_stop_health_poll(struct mlx5_core_dev *dev)
+void mlx5_stop_health_poll(struct mlx5_core_dev *dev, bool disable_health)
 {
struct mlx5_core_health *health = >priv.health;
+   unsigned long flags;
+
+   if (disable_health) {
+   spin_lock_irqsave(>wq_lock, flags);
+   set_bit(MLX5_DROP_NEW_HEALTH_WORK, >flags);
+   set_bit(MLX5_DROP_NEW_RECOVERY_WORK, >flags);
+   spin_unlock_irqrestore(>wq_lock, flags);
+   }
 
del_timer_sync(>timer);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index cf3e4a659052..739aad0a0b35 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1286,7 +1286,7 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
mlx5_cleanup_once(dev);
 
 err_stop_poll:
-   mlx5_stop_health_poll(dev);
+   mlx5_stop_health_poll(dev, boot);
if (mlx5_cmd_teardown_hca(dev)) {
dev_err(>pdev->dev, "tear_down_hca failed, skip 
cleanup\n");
goto out_err;
@@ -1346,7 +1346,7 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
mlx5_free_irq_vectors(dev);
if (cleanup)
mlx5_cleanup_once(dev);
-   mlx5_stop_health_poll(dev);
+   mlx5_stop_health_poll(dev, cleanup);
err = mlx5_cmd_teardown_hca(dev);
if (err) {
dev_err(>pdev->dev, "tear_down_hca failed, skip 
cleanup\n");
@@ -1608,7 +1608,7 @@ static int mlx5_try_fast_unload(struct mlx5_core_dev *dev)
 * with the HCA, so the health polll is no longer needed.
 */
mlx5_drain_health_wq(dev);
-   mlx5_stop_health_poll(dev);
+   mlx5_stop_health_poll(dev, false);
 
ret = mlx5_cmd_force_teardown_hca(dev);
if (ret) {
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 7a452716de4b..aa65f58c6610 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1052,7 +1052,7 @@ int mlx5_cmd_free_uar(struct mlx5_core_dev *dev, u32 
uarn);
 void mlx5_health_cleanup(struct mlx5_core_dev *dev);
 int mlx5_health_init(struct mlx5_core_dev *dev);
 void mlx5_start_health_poll(struct mlx5_core_dev *dev);
-void mlx5_stop_health_poll(struct mlx5_core_dev *dev);
+void mlx5_stop_health_poll(struct mlx5_core_dev *dev, bool disable_health);
 void mlx5_drain_health_wq(struct mlx5_core_dev *dev);
 void mlx5_trigger_health_work(struct mlx5_core_dev *dev);
 void mlx5_drain_health_recovery(struct mlx5_core_dev *dev);
-- 
2.17.1



[net 02/10] net/mlx5: Fix debugfs cleanup in the device init/remove flow

2018-09-05 Thread Saeed Mahameed
From: Jack Morgenstein 

When initializing the device (procedure init_one), the driver
calls mlx5_pci_init to perform pci initialization. As part of this
initialization, mlx5_pci_init creates a debugfs directory.
If this creation fails, init_one aborts, returning failure to
the caller (which is the probe method caller).

The main reason for such a failure to occur is if the debugfs
directory already exists. This can happen if the last time
mlx5_pci_close was called, debugfs_remove (silently) failed due
to the debugfs directory not being empty.

Guarantee that such a debugfs_remove failure will not occur by
instead calling debugfs_remove_recursive in procedure mlx5_pci_close.

Fixes: 59211bd3b632 ("net/mlx5: Split the load/unload flow into hardware and 
software flows")
Signed-off-by: Jack Morgenstein 
Reviewed-by: Daniel Jurgens 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 739aad0a0b35..b5e9f664fc66 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -878,8 +878,10 @@ static int mlx5_pci_init(struct mlx5_core_dev *dev, struct 
mlx5_priv *priv)
priv->numa_node = dev_to_node(>pdev->dev);
 
priv->dbg_root = debugfs_create_dir(dev_name(>dev), 
mlx5_debugfs_root);
-   if (!priv->dbg_root)
+   if (!priv->dbg_root) {
+   dev_err(>dev, "Cannot create debugfs dir, aborting\n");
return -ENOMEM;
+   }
 
err = mlx5_pci_enable_device(dev);
if (err) {
@@ -928,7 +930,7 @@ static void mlx5_pci_close(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv)
pci_clear_master(dev->pdev);
release_bar(dev->pdev);
mlx5_pci_disable_device(dev);
-   debugfs_remove(priv->dbg_root);
+   debugfs_remove_recursive(priv->dbg_root);
 }
 
 static int mlx5_init_once(struct mlx5_core_dev *dev, struct mlx5_priv *priv)
-- 
2.17.1



[net 03/10] net/mlx5: Use u16 for Work Queue buffer fragment size

2018-09-05 Thread Saeed Mahameed
From: Tariq Toukan 

Minimal stride size is 16.
Hence, the number of strides in a fragment (of PAGE_SIZE)
is <= PAGE_SIZE / 16 <= 4K.

u16 is sufficient to represent this.

Fixes: 388ca8be0037 ("IB/mlx5: Implement fragmented completion queue (CQ)")
Signed-off-by: Tariq Toukan 
Reviewed-by: Eran Ben Elisha 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/wq.c | 4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/wq.h | 2 +-
 include/linux/mlx5/driver.h  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
index c8c315eb5128..d838af9539b1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
@@ -39,9 +39,9 @@ u32 mlx5_wq_cyc_get_size(struct mlx5_wq_cyc *wq)
return (u32)wq->fbc.sz_m1 + 1;
 }
 
-u32 mlx5_wq_cyc_get_frag_size(struct mlx5_wq_cyc *wq)
+u16 mlx5_wq_cyc_get_frag_size(struct mlx5_wq_cyc *wq)
 {
-   return (u32)wq->fbc.frag_sz_m1 + 1;
+   return wq->fbc.frag_sz_m1 + 1;
 }
 
 u32 mlx5_cqwq_get_size(struct mlx5_cqwq *wq)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.h 
b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
index 2bd4c3184eba..3a1a170bb2d7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
@@ -80,7 +80,7 @@ int mlx5_wq_cyc_create(struct mlx5_core_dev *mdev, struct 
mlx5_wq_param *param,
   void *wqc, struct mlx5_wq_cyc *wq,
   struct mlx5_wq_ctrl *wq_ctrl);
 u32 mlx5_wq_cyc_get_size(struct mlx5_wq_cyc *wq);
-u32 mlx5_wq_cyc_get_frag_size(struct mlx5_wq_cyc *wq);
+u16 mlx5_wq_cyc_get_frag_size(struct mlx5_wq_cyc *wq);
 
 int mlx5_wq_qp_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
  void *qpc, struct mlx5_wq_qp *wq,
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index aa65f58c6610..3a1258fd8ac3 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -362,7 +362,7 @@ struct mlx5_frag_buf {
 struct mlx5_frag_buf_ctrl {
struct mlx5_frag_buffrag_buf;
u32 sz_m1;
-   u32 frag_sz_m1;
+   u16 frag_sz_m1;
u32 strides_offset;
u8  log_sz;
u8  log_stride;
-- 
2.17.1



Re: [PATCH net-next v3 5/5] net: dsa: b53: Add SerDes support

2018-09-05 Thread Andrew Lunn
On Wed, Sep 05, 2018 at 12:42:15PM -0700, Florian Fainelli wrote:
> Add support for the Northstar Plus SerDes which is accessed through a
> special page of the switch. Since this is something that most people
> probably will not want to use, make it a configurable option with a
> default on ARCH_BCM_NSP where it is the most useful currently.
> 
> The SerDes supports both SGMII and 1000baseX modes for both lanes, and
> 2500baseX for one of the lanes, and is internally looking like a
> seemingly standard MII PHY, except for the few bits that got repurposed.
> 
> Signed-off-by: Florian Fainelli 
 
Reviewed-by: Andrew Lunn 

Andrew


[PATCH net-next] liquidio: Add spoof checking on a VF MAC address

2018-09-05 Thread Felix Manlunas
From: Weilin Chang 

1. Provide the API to set/unset the spoof checking feature.
2. Add a function to periodically provide the count of found
   packets with spoof VF MAC address.
3. Prevent VF MAC address changing while the spoofchk of the VF is
   on unless the changing MAC address is issued from PF.

Signed-off-by: Weilin Chang 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 74 ++
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c |  3 +-
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 63 ++
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c |  8 +++
 .../net/ethernet/cavium/liquidio/liquidio_common.h | 24 +--
 .../net/ethernet/cavium/liquidio/octeon_device.h   |  5 ++
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  6 ++
 drivers/net/ethernet/cavium/liquidio/octeon_nic.c  | 16 +++--
 8 files changed, 187 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index cdc26ca..0284204 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -1357,6 +1357,69 @@ octnet_nic_stats_callback(struct octeon_device *oct_dev,
}
 }
 
+int lio_fetch_vf_stats(struct lio *lio)
+{
+   struct octeon_device *oct_dev = lio->oct_dev;
+   struct octeon_soft_command *sc;
+   struct oct_nic_vf_stats_resp *resp;
+
+   int retval;
+
+   /* Alloc soft command */
+   sc = (struct octeon_soft_command *)
+   octeon_alloc_soft_command(oct_dev,
+ 0,
+ sizeof(struct oct_nic_vf_stats_resp),
+ 0);
+
+   if (!sc) {
+   dev_err(_dev->pci_dev->dev, "Soft command allocation 
failed\n");
+   retval = -ENOMEM;
+   goto lio_fetch_vf_stats_exit;
+   }
+
+   resp = (struct oct_nic_vf_stats_resp *)sc->virtrptr;
+   memset(resp, 0, sizeof(struct oct_nic_vf_stats_resp));
+
+   init_completion(>complete);
+   sc->sc_status = OCTEON_REQUEST_PENDING;
+
+   sc->iq_no = lio->linfo.txpciq[0].s.q_no;
+
+   octeon_prepare_soft_command(oct_dev, sc, OPCODE_NIC,
+   OPCODE_NIC_VF_PORT_STATS, 0, 0, 0);
+
+   retval = octeon_send_soft_command(oct_dev, sc);
+   if (retval == IQ_SEND_FAILED) {
+   octeon_free_soft_command(oct_dev, sc);
+   goto lio_fetch_vf_stats_exit;
+   }
+
+   retval =
+   wait_for_sc_completion_timeout(oct_dev, sc,
+  (2 * LIO_SC_MAX_TMO_MS));
+   if (retval)  {
+   dev_err(_dev->pci_dev->dev,
+   "sc OPCODE_NIC_VF_PORT_STATS command failed\n");
+   goto lio_fetch_vf_stats_exit;
+   }
+
+   if (sc->sc_status != OCTEON_REQUEST_TIMEOUT && !resp->status) {
+   octeon_swap_8B_data((u64 *)>spoofmac_cnt,
+   (sizeof(u64)) >> 3);
+
+   if (resp->spoofmac_cnt != 0) {
+   dev_warn(_dev->pci_dev->dev,
+"%llu Spoofed packets detected\n",
+resp->spoofmac_cnt);
+   }
+   }
+   WRITE_ONCE(sc->caller_is_done, 1);
+
+lio_fetch_vf_stats_exit:
+   return retval;
+}
+
 void lio_fetch_stats(struct work_struct *work)
 {
struct cavium_wk *wk = (struct cavium_wk *)work;
@@ -1367,6 +1430,17 @@ void lio_fetch_stats(struct work_struct *work)
unsigned long time_in_jiffies;
int retval;
 
+   if (OCTEON_CN23XX_PF(oct_dev)) {
+   /* report spoofchk every 2 seconds */
+   if (!(oct_dev->vfstats_poll % LIO_VFSTATS_POLL) &&
+   (oct_dev->fw_info.app_cap_flags & LIQUIDIO_SPOOFCHK_CAP) &&
+   oct_dev->sriov_info.num_vfs_alloced) {
+   lio_fetch_vf_stats(lio);
+   }
+
+   oct_dev->vfstats_poll++;
+   }
+
/* Alloc soft command */
sc = (struct octeon_soft_command *)
octeon_alloc_soft_command(oct_dev,
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c 
b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
index 46d8379..e1e5808 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
@@ -1713,7 +1713,8 @@ static void lio_vf_get_ethtool_stats(struct net_device 
*netdev,
  */
data[i++] = lstats.rx_dropped;
/* sum of oct->instr_queue[iq_no]->stats.tx_dropped */
-   data[i++] = lstats.tx_dropped;
+   data[i++] = lstats.tx_dropped +
+   oct_dev->link_stats.fromhost.fw_err_drop;
 
data[i++] = oct_dev->link_stats.fromwire.fw_total_mcast;
data[i++] = 

[PATCH bpf-next 1/4] tools/bpf: sync kernel uapi header if_link.h to tools

2018-09-05 Thread Yonghong Song
Among others, this header will be used later for
bpftool net support.

Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/if_link.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/tools/include/uapi/linux/if_link.h 
b/tools/include/uapi/linux/if_link.h
index cf01b6824244..43391e2d1153 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -164,6 +164,8 @@ enum {
IFLA_CARRIER_UP_COUNT,
IFLA_CARRIER_DOWN_COUNT,
IFLA_NEW_IFINDEX,
+   IFLA_MIN_MTU,
+   IFLA_MAX_MTU,
__IFLA_MAX
 };
 
@@ -334,6 +336,7 @@ enum {
IFLA_BRPORT_GROUP_FWD_MASK,
IFLA_BRPORT_NEIGH_SUPPRESS,
IFLA_BRPORT_ISOLATED,
+   IFLA_BRPORT_BACKUP_PORT,
__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
@@ -459,6 +462,16 @@ enum {
 
 #define IFLA_MACSEC_MAX (__IFLA_MACSEC_MAX - 1)
 
+/* XFRM section */
+enum {
+   IFLA_XFRM_UNSPEC,
+   IFLA_XFRM_LINK,
+   IFLA_XFRM_IF_ID,
+   __IFLA_XFRM_MAX
+};
+
+#define IFLA_XFRM_MAX (__IFLA_XFRM_MAX - 1)
+
 enum macsec_validation_type {
MACSEC_VALIDATE_DISABLED = 0,
MACSEC_VALIDATE_CHECK = 1,
@@ -920,6 +933,7 @@ enum {
XDP_ATTACHED_DRV,
XDP_ATTACHED_SKB,
XDP_ATTACHED_HW,
+   XDP_ATTACHED_MULTI,
 };
 
 enum {
@@ -928,6 +942,9 @@ enum {
IFLA_XDP_ATTACHED,
IFLA_XDP_FLAGS,
IFLA_XDP_PROG_ID,
+   IFLA_XDP_DRV_PROG_ID,
+   IFLA_XDP_SKB_PROG_ID,
+   IFLA_XDP_HW_PROG_ID,
__IFLA_XDP_MAX,
 };
 
-- 
2.17.1



[PATCH bpf-next 3/4] tools/bpf: add more netlink functionalities in lib/bpf

2018-09-05 Thread Yonghong Song
This patch added a few netlink attribute parsing functions
and the netlink API functions to query networking links, tc classes,
tc qdiscs and tc filters. For example, the following API is
to get networking links:
  int nl_get_link(int sock, unsigned int nl_pid,
  dump_nlmsg_t dump_link_nlmsg,
  void *cookie);

Note that when the API is called, the user also provided a
callback function with the following signature:
  int (*dump_nlmsg_t)(void *cookie, void *msg, struct nlattr **tb);

The "cookie" is the parameter the user passed to the API and will
be available for the callback function.
The "msg" is the information about the result, e.g., ifinfomsg or
tcmsg. The "tb" is the parsed netlink attributes.

Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/libbpf.h   |  16 
 tools/lib/bpf/libbpf_errno.c |   1 +
 tools/lib/bpf/netlink.c  | 165 ++-
 tools/lib/bpf/nlattr.c   |  33 ---
 tools/lib/bpf/nlattr.h   |  38 
 5 files changed, 238 insertions(+), 15 deletions(-)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 96c55fac54c3..e3b00e23e181 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -46,6 +46,7 @@ enum libbpf_errno {
LIBBPF_ERRNO__PROGTYPE, /* Kernel doesn't support this program type */
LIBBPF_ERRNO__WRNGPID,  /* Wrong pid in netlink message */
LIBBPF_ERRNO__INVSEQ,   /* Invalid netlink sequence */
+   LIBBPF_ERRNO__NLPARSE,  /* netlink parsing error */
__LIBBPF_ERRNO__END,
 };
 
@@ -297,4 +298,19 @@ int bpf_perf_event_read_simple(void *mem, unsigned long 
size,
   unsigned long page_size,
   void **buf, size_t *buf_len,
   bpf_perf_event_print_t fn, void *priv);
+
+struct nlmsghdr;
+struct nlattr;
+typedef int (*dump_nlmsg_t)(void *cookie, void *msg, struct nlattr **tb);
+typedef int (*__dump_nlmsg_t)(struct nlmsghdr *nlmsg, dump_nlmsg_t,
+ void *cookie);
+int bpf_netlink_open(unsigned int *nl_pid);
+int nl_get_link(int sock, unsigned int nl_pid, dump_nlmsg_t dump_link_nlmsg,
+   void *cookie);
+int nl_get_class(int sock, unsigned int nl_pid, int ifindex,
+dump_nlmsg_t dump_class_nlmsg, void *cookie);
+int nl_get_qdisc(int sock, unsigned int nl_pid, int ifindex,
+dump_nlmsg_t dump_qdisc_nlmsg, void *cookie);
+int nl_get_filter(int sock, unsigned int nl_pid, int ifindex, int handle,
+ dump_nlmsg_t dump_filter_nlmsg, void *cookie);
 #endif
diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c
index d9ba851bd7f9..2464ade3b326 100644
--- a/tools/lib/bpf/libbpf_errno.c
+++ b/tools/lib/bpf/libbpf_errno.c
@@ -42,6 +42,7 @@ static const char *libbpf_strerror_table[NR_ERRNO] = {
[ERRCODE_OFFSET(PROGTYPE)]  = "Kernel doesn't support this program 
type",
[ERRCODE_OFFSET(WRNGPID)]   = "Wrong pid in netlink message",
[ERRCODE_OFFSET(INVSEQ)]= "Invalid netlink sequence",
+   [ERRCODE_OFFSET(NLPARSE)]   = "Incorrect netlink message parsing",
 };
 
 int libbpf_strerror(int err, char *buf, size_t size)
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index ccaa991fe9d8..469e068dd0c5 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -18,7 +18,7 @@
 #define SOL_NETLINK 270
 #endif
 
-static int bpf_netlink_open(__u32 *nl_pid)
+int bpf_netlink_open(__u32 *nl_pid)
 {
struct sockaddr_nl sa;
socklen_t addrlen;
@@ -61,7 +61,9 @@ static int bpf_netlink_open(__u32 *nl_pid)
return ret;
 }
 
-static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq)
+static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
+   __dump_nlmsg_t _fn, dump_nlmsg_t fn,
+   void *cookie)
 {
struct nlmsgerr *err;
struct nlmsghdr *nh;
@@ -98,6 +100,11 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq)
default:
break;
}
+   if (_fn) {
+   ret = _fn(nh, fn, cookie);
+   if (ret)
+   return ret;
+   }
}
}
ret = 0;
@@ -157,9 +164,161 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
ret = -errno;
goto cleanup;
}
-   ret = bpf_netlink_recv(sock, nl_pid, seq);
+   ret = bpf_netlink_recv(sock, nl_pid, seq, NULL, NULL, NULL);
 
 cleanup:
close(sock);
return ret;
 }
+
+static int __dump_link_nlmsg(struct nlmsghdr *nlh, dump_nlmsg_t 
dump_link_nlmsg,
+void *cookie)
+{
+   struct nlattr *tb[IFLA_MAX + 1], *attr;
+   struct ifinfomsg *ifi = NLMSG_DATA(nlh);
+   int len;
+
+ 

[PATCH bpf-next 2/4] tools/bpf: move bpf/lib netlink related functions into a new file

2018-09-05 Thread Yonghong Song
There are no functionality change for this patch.

In the subsequent patches, more netlink related library functions
will be added and a separate file is better than cluttering bpf.c.

Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/Build |   2 +-
 tools/lib/bpf/bpf.c | 129 ---
 tools/lib/bpf/netlink.c | 165 
 3 files changed, 166 insertions(+), 130 deletions(-)
 create mode 100644 tools/lib/bpf/netlink.c

diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build
index 13a861135127..512b2c0ba0d2 100644
--- a/tools/lib/bpf/Build
+++ b/tools/lib/bpf/Build
@@ -1 +1 @@
-libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o
+libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o netlink.o
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 60aa4ca8b2c5..3878a26a2071 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -28,16 +28,8 @@
 #include 
 #include "bpf.h"
 #include "libbpf.h"
-#include "nlattr.h"
-#include 
-#include 
-#include 
 #include 
 
-#ifndef SOL_NETLINK
-#define SOL_NETLINK 270
-#endif
-
 /*
  * When building perf, unistd.h is overridden. __NR_bpf is
  * required to be defined explicitly.
@@ -499,127 +491,6 @@ int bpf_raw_tracepoint_open(const char *name, int prog_fd)
return sys_bpf(BPF_RAW_TRACEPOINT_OPEN, , sizeof(attr));
 }
 
-int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
-{
-   struct sockaddr_nl sa;
-   int sock, seq = 0, len, ret = -1;
-   char buf[4096];
-   struct nlattr *nla, *nla_xdp;
-   struct {
-   struct nlmsghdr  nh;
-   struct ifinfomsg ifinfo;
-   char attrbuf[64];
-   } req;
-   struct nlmsghdr *nh;
-   struct nlmsgerr *err;
-   socklen_t addrlen;
-   int one = 1;
-
-   memset(, 0, sizeof(sa));
-   sa.nl_family = AF_NETLINK;
-
-   sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
-   if (sock < 0) {
-   return -errno;
-   }
-
-   if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK,
-  , sizeof(one)) < 0) {
-   fprintf(stderr, "Netlink error reporting not supported\n");
-   }
-
-   if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
-   ret = -errno;
-   goto cleanup;
-   }
-
-   addrlen = sizeof(sa);
-   if (getsockname(sock, (struct sockaddr *), ) < 0) {
-   ret = -errno;
-   goto cleanup;
-   }
-
-   if (addrlen != sizeof(sa)) {
-   ret = -LIBBPF_ERRNO__INTERNAL;
-   goto cleanup;
-   }
-
-   memset(, 0, sizeof(req));
-   req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
-   req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
-   req.nh.nlmsg_type = RTM_SETLINK;
-   req.nh.nlmsg_pid = 0;
-   req.nh.nlmsg_seq = ++seq;
-   req.ifinfo.ifi_family = AF_UNSPEC;
-   req.ifinfo.ifi_index = ifindex;
-
-   /* started nested attribute for XDP */
-   nla = (struct nlattr *)(((char *))
-   + NLMSG_ALIGN(req.nh.nlmsg_len));
-   nla->nla_type = NLA_F_NESTED | IFLA_XDP;
-   nla->nla_len = NLA_HDRLEN;
-
-   /* add XDP fd */
-   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
-   nla_xdp->nla_type = IFLA_XDP_FD;
-   nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
-   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(fd));
-   nla->nla_len += nla_xdp->nla_len;
-
-   /* if user passed in any flags, add those too */
-   if (flags) {
-   nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
-   nla_xdp->nla_type = IFLA_XDP_FLAGS;
-   nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
-   memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(flags));
-   nla->nla_len += nla_xdp->nla_len;
-   }
-
-   req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
-
-   if (send(sock, , req.nh.nlmsg_len, 0) < 0) {
-   ret = -errno;
-   goto cleanup;
-   }
-
-   len = recv(sock, buf, sizeof(buf), 0);
-   if (len < 0) {
-   ret = -errno;
-   goto cleanup;
-   }
-
-   for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
-nh = NLMSG_NEXT(nh, len)) {
-   if (nh->nlmsg_pid != sa.nl_pid) {
-   ret = -LIBBPF_ERRNO__WRNGPID;
-   goto cleanup;
-   }
-   if (nh->nlmsg_seq != seq) {
-   ret = -LIBBPF_ERRNO__INVSEQ;
-   goto cleanup;
-   }
-   switch (nh->nlmsg_type) {
-   case NLMSG_ERROR:
-   err = (struct nlmsgerr *)NLMSG_DATA(nh);
-   if (!err->error)
-   continue;
-   ret = err->error;
-   nla_dump_errormsg(nh);
-   goto 

[PATCH bpf-next 0/4] tools/bpf: add bpftool net support

2018-09-05 Thread Yonghong Song
As bpf usage becomes more pervasive, people starts to worry
about their cpu and memory cost. On a particular host,
people often wanted to know all running bpf programs
and their attachment context. So they can relate
a performance/memory anormly quickly to a particular bpf
program or an application.

bpftool already provides a pretty good coverage for perf
and cgroup related attachments. This patch set enabled
to dump attachment info for xdp and tc bpf programs.

Currently, users can already use "ip link show " and
"tc filter show dev  ..." to dump bpf program attachment
information for xdp and tc bpf programs. The main reason
to implement such functionality in bpftool as well is for
better user experience. We want the bpftool to be the
ultimate tool for bpf introspection. The bpftool net
implementation will only present necessary bpf attachment
information to the user, ignoring most other ip/tc
specific information.

For example, the below is a pretty json print for xdp
and tc_filters.

  $ ./bpftool -jp net
  [{
"xdp": [{
"ifindex": 2,
"devname": "eth0",
"prog_id": 198
}
],
"tc_filters": [{
"ifindex": 2,
"kind": "qdisc_htb",
"name": "prefix_matcher.o:[cls_prefix_matcher_htb]",
"prog_id": 111727,
"tag": "d08fe3b4319bc2fd",
"act": []
},{
"ifindex": 2,
"kind": "qdisc_clsact_ingress",
"name": "fbflow_icmp",
"prog_id": 130246,
"tag": "3f265c7f26db62c9",
"act": []
},{
"ifindex": 2,
"kind": "qdisc_clsact_egress",
"name": "prefix_matcher.o:[cls_prefix_matcher_clsact]",
"prog_id": 111726,
"tag": "99a197826974c876"
},{
"ifindex": 2,
"kind": "qdisc_clsact_egress",
"name": "cls_fg_dscp",
"prog_id": 108619,
"tag": "dc4630674fd72dcc",
"act": []
},{
"ifindex": 2,
"kind": "qdisc_clsact_egress",
"name": "fbflow_egress",
"prog_id": 130245,
"tag": "72d2d830d6888d2c"
}
]
}
  ]

Patch #1 synced kernel uapi header if_link.h to tools directory.
Patch #2 moved tools/bpf/lib/bpf.c netlink related functions to
a new file. Patch #3 implemented additional functions
in libbpf which will be used in Patch #4.
Patch #4 implemented bpftool net support to dump
xdp and tc bpf program attachments.

Yonghong Song (4):
  tools/bpf: sync kernel uapi header if_link.h to tools
  tools/bpf: move bpf/lib netlink related functions into a new file
  tools/bpf: add more netlink functionalities in lib/bpf
  tools/bpf: bpftool: add net support

 .../bpf/bpftool/Documentation/bpftool-net.rst | 133 +++
 tools/bpf/bpftool/Documentation/bpftool.rst   |   6 +-
 tools/bpf/bpftool/bash-completion/bpftool |  17 +-
 tools/bpf/bpftool/main.c  |   3 +-
 tools/bpf/bpftool/main.h  |   7 +
 tools/bpf/bpftool/net.c   | 233 +
 tools/bpf/bpftool/netlink_dumper.c| 181 ++
 tools/bpf/bpftool/netlink_dumper.h| 103 ++
 tools/include/uapi/linux/if_link.h|  17 +
 tools/lib/bpf/Build   |   2 +-
 tools/lib/bpf/bpf.c   | 129 ---
 tools/lib/bpf/libbpf.h|  16 +
 tools/lib/bpf/libbpf_errno.c  |   1 +
 tools/lib/bpf/netlink.c   | 324 ++
 tools/lib/bpf/nlattr.c|  33 +-
 tools/lib/bpf/nlattr.h|  38 ++
 16 files changed, 1094 insertions(+), 149 deletions(-)
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-net.rst
 create mode 100644 tools/bpf/bpftool/net.c
 create mode 100644 tools/bpf/bpftool/netlink_dumper.c
 create mode 100644 tools/bpf/bpftool/netlink_dumper.h
 create mode 100644 tools/lib/bpf/netlink.c

-- 
2.17.1



[PATCH bpf-next 4/4] tools/bpf: bpftool: add net support

2018-09-05 Thread Yonghong Song
Add "bpftool net" support. Networking devices are enumerated
to dump device index/name associated with xdp progs.

For each networking device, tc classes and qdiscs are enumerated
in order to check their bpf filters.
In addition, root handle and clsact ingress/egress are also checked for
bpf filters.  Not all filter information is printed out. Only ifindex,
kind, filter name, prog_id and tag are printed out, which are good
enough to show attachment information. If the filter action
is a bpf action, its bpf program id, bpf name and tag will be
printed out as well.

For example,
  $ ./bpftool net
  xdp [
  ifindex 2 devname eth0 prog_id 198
  ]
  tc_filters [
  ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
prog_id 111727 tag d08fe3b4319bc2fd act []
  ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
prog_id 130246 tag 3f265c7f26db62c9 act []
  ifindex 2 kind qdisc_clsact_egress name 
prefix_matcher.o:[cls_prefix_matcher_clsact]
prog_id 111726 tag 99a197826974c876
  ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
prog_id 108619 tag dc4630674fd72dcc act []
  ifindex 2 kind qdisc_clsact_egress name fbflow_egress
prog_id 130245 tag 72d2d830d6888d2c
  ]
  $ ./bpftool -jp net
  [{
"xdp": [{
"ifindex": 2,
"devname": "eth0",
"prog_id": 198
}
],
"tc_filters": [{
"ifindex": 2,
"kind": "qdisc_htb",
"name": "prefix_matcher.o:[cls_prefix_matcher_htb]",
"prog_id": 111727,
"tag": "d08fe3b4319bc2fd",
"act": []
},{
"ifindex": 2,
"kind": "qdisc_clsact_ingress",
"name": "fbflow_icmp",
"prog_id": 130246,
"tag": "3f265c7f26db62c9",
"act": []
},{
"ifindex": 2,
"kind": "qdisc_clsact_egress",
"name": "prefix_matcher.o:[cls_prefix_matcher_clsact]",
"prog_id": 111726,
"tag": "99a197826974c876"
},{
"ifindex": 2,
"kind": "qdisc_clsact_egress",
"name": "cls_fg_dscp",
"prog_id": 108619,
"tag": "dc4630674fd72dcc",
"act": []
},{
"ifindex": 2,
"kind": "qdisc_clsact_egress",
"name": "fbflow_egress",
"prog_id": 130245,
"tag": "72d2d830d6888d2c"
}
]
}
  ]

Signed-off-by: Yonghong Song 
---
 .../bpf/bpftool/Documentation/bpftool-net.rst | 133 ++
 tools/bpf/bpftool/Documentation/bpftool.rst   |   6 +-
 tools/bpf/bpftool/bash-completion/bpftool |  17 +-
 tools/bpf/bpftool/main.c  |   3 +-
 tools/bpf/bpftool/main.h  |   7 +
 tools/bpf/bpftool/net.c   | 233 ++
 tools/bpf/bpftool/netlink_dumper.c| 181 ++
 tools/bpf/bpftool/netlink_dumper.h| 103 
 8 files changed, 676 insertions(+), 7 deletions(-)
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-net.rst
 create mode 100644 tools/bpf/bpftool/net.c
 create mode 100644 tools/bpf/bpftool/netlink_dumper.c
 create mode 100644 tools/bpf/bpftool/netlink_dumper.h

diff --git a/tools/bpf/bpftool/Documentation/bpftool-net.rst 
b/tools/bpf/bpftool/Documentation/bpftool-net.rst
new file mode 100644
index ..48a61837a264
--- /dev/null
+++ b/tools/bpf/bpftool/Documentation/bpftool-net.rst
@@ -0,0 +1,133 @@
+
+bpftool-net
+
+---
+tool for inspection of netdev/tc related bpf prog attachments
+---
+
+:Manual section: 8
+
+SYNOPSIS
+
+
+   **bpftool** [*OPTIONS*] **net** *COMMAND*
+
+   *OPTIONS* := { [{ **-j** | **--json** }] [{ **-p** | **--pretty** }] }
+
+   *COMMANDS* :=
+   { **show** | **list** } [ **dev** name ] | **help**
+
+NET COMMANDS
+
+
+|  **bpftool** **net { show | list } [ dev name ]**
+|  **bpftool** **net help**
+
+DESCRIPTION
+===
+   **bpftool net { show | list } [ dev name ]**
+ List all networking device driver and tc attachment in the 
system.
+
+  Output will start with all xdp program attachment, followed 
by
+  all tc class/qdisc bpf program attachments. Both xdp 
programs and
+  tc programs are ordered based on ifindex number. If multiple 
bpf
+  programs attached to the same networking device through **tc 
filter**,
+  the order will be first all bpf programs attached to tc 
classes, then
+  all bpf programs 

Re: [PATCH] cxgb4: fix abort_req_rss6 struct

2018-09-05 Thread Jason Gunthorpe
On Fri, Aug 31, 2018 at 11:52:00AM -0700, Steve Wise wrote:
> Remove the incorrect WR_HDR field which can cause a misinterpretation
> of this CPL by ULDs.

What does that mean?

Is this an -rc patch?

Jason


Re: [PATCH net-next v3 4/5] net: dsa: b53: Add PHYLINK support

2018-09-05 Thread Andrew Lunn
On Wed, Sep 05, 2018 at 12:42:14PM -0700, Florian Fainelli wrote:
> Add support for PHYLINK, things are reasonably straight forward since we
> do not yet support SerDes interfaces, that leaves us with just
> MLO_AN_PHY and MLO_AN_FIXED to deal with.
> 
> Signed-off-by: Florian Fainelli 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next v3 3/5] net: dsa: b53: Add helper to set link parameters

2018-09-05 Thread Andrew Lunn
On Wed, Sep 05, 2018 at 12:42:13PM -0700, Florian Fainelli wrote:
> Extract the logic from b53_adjust_link() responsible for overriding a
> given port's link, speed, duplex and pause settings and make two helper
> functions to set the port's configuration and the port's link settings.
> We will make use of both, as separate functions while adding PHYLINK
> support next.
> 
> Signed-off-by: Florian Fainelli 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next v3 2/5] net: dsa: b53: Make SRAB driver manage port interrupts

2018-09-05 Thread Andrew Lunn
On Wed, Sep 05, 2018 at 12:42:12PM -0700, Florian Fainelli wrote:
> Update the SRAB driver to manage per-port interrupts. Since we cannot
> sleep during b53_io_ops, schedule a workqueue whenever we get a port
> specific interrupt. We will later make use of this to call back into
> PHYLINK when there is e.g: a link state change.
> 
> Signed-off-by: Florian Fainelli 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next v3 1/5] net: dsa: b53: Add ability to enable/disable port interrupts

2018-09-05 Thread Andrew Lunn
On Wed, Sep 05, 2018 at 12:42:11PM -0700, Florian Fainelli wrote:
> Some switches expose individual interrupt line(s) for port specific
> event(s), allow configuring these interrupts at an appropriate time
> during port_enable/disable callbacks where all port specific resources
> are known to be set-up and ready for use.
> 
> Signed-off-by: Florian Fainelli 

Reviewed-by: Andrew Lunn 

Andrew


Re: Poptrie in Linux kernel

2018-09-05 Thread Md. Islam
Hi Yasuhiro

I would like to point out that the lookup process, Trie construction
and even the Trie itself is not same as the patent [1]. The idea of
splitting prefix into multiple level  is not new. SAIL [2] does it as
well. In fact, I started by implementing SAIL in Linux kernel. Then
came across the idea of representing chunk IDs with bitmap in Poptrie
[3] and in this book [4]. Although my work was inspired by your paper
[3] and SAIL [2] and this book [4] (I was not aware of the patent[1]),
actual trie, trie lookup and trie construction process is very
different from the paper/patent. The only thing that's same is the
name, POPTRIE :-) As that was a ACM SIGCOMM paper, I thought using the
same name will be helpful for others to understand and review my work.

Looking forward to your suggestions.

1. http://www.conceptsengine.com/patent/grant/0005960863
2. Yang, Tong, et al. "Guarantee IP lookup performance with FIB
explosion." ACM SIGCOMM  2014.
3. Asai, Hirochika, and Yasuhiro Ohara. "Poptrie: A compressed trie
with population count for fast and scalable software IP routing table
lookup." ACM SIGCOMM, 2015.
4. Warren, Henry S. Hacker's delight. Pearson Education, 2013.

Many thannks
Tamim

On Wed, Sep 5, 2018 at 7:48 AM, Yasuhiro Ohara  wrote:
>
> Dear Islam,
>
> Thank you for being interested in and implementing the Poptrie.
>
> Yes, NTT Communications (my company) has filed the patent for the Poptrie.
>
> To my best knowledge, the patent issue is totally different
> from the copyright issue (of the source code), and so even
> the copyright issue were cleared/solved, it does not mean
> the patent issue was gone.
>
> I am requesting to my company what is our company's action
> to your Poptrie implementation.
>
> Would you please wait for a while until NTT Communications
> decide its response. We will inform you as soon as it is decided.
>
> Best regards,
> Yasu
>
>> -Original Message-
>> From: Md. Islam [mailto:misl...@kent.edu]
>> Sent: Wednesday, September 05, 2018 2:55 PM
>> To: pa...@hongo.wide.ad.jp
>> Cc: Hayakawa Yutaro; Yasuhiro Ohara(小原泰弘)
>> Subject: Poptrie in Linux kernel
>>
>> Hi Hirochika
>>
>> I am a PhD student of Kent State University, Ohio, USA. I came across your
>> Poptrie paper and found it fascinating. I implemented it in Linux kernel.
>> However the trie construction and lookup process is somewhat different from
>> the original paper. For instance, in my case, the node looks like as
>> following.
>>
>> struct poptrie_node {
>> u64 vector;
>> u64 leafvec;
>> //Not needed in SIGCOMM paper
>> u64 nodevec;
>> struct poptrie_node *child_nodes;
>> u8 *leaves;
>> //Not needed in SIGCOMM paper
>> u8 *prefixes;
>> //Not needed in SIGCOMM paper
>> struct rcu_headrcu;
>> };
>>
>> Note that here I used some extra variables than the SIGCOMM paper.
>> This is because trie is construction/lookup process is different than the
>> original paper. This will be more evident if you look more at my patch.
>>
>> My implementation is also not based on the original implementation
>> (https://github.com/pixos/poptrie). However as the implementation is
>> copyrighted by you, I would like to ask your permission if I can submit
>> my implementation to upstream kernel.
>>
>> Hayakawa also pointed that NTT communications have patented this. I was
>> not aware of this. Could you please advise if I can submit the patch as
>> Poptrie to upstream Linux kernel? I have cited your paper in the patch.
>> I don't have any problem acknowledgeing your and NTT's contribution. You
>> guys know you deserve it :-) I have attached the patch and my draft paper
>> regarding my implementation. Could you please take a look at the paper and
>> patch to see if there is any legal problem in getting this patch into 
>> upstream
>> Linux kernel.
>>
>> Please let me know any question.
>>
>> Many thanks
>> Tamim
>> PhD Candidate,
>> Kent State University
>> http://web.cs.kent.edu/~mislam4/


Re: [PATCH mlx5-next v1 05/15] net/mlx5: Break encap/decap into two separated flow table creation flags

2018-09-05 Thread Or Gerlitz
On Wed, Sep 5, 2018 at 9:11 PM, Leon Romanovsky  wrote:
> On Wed, Sep 05, 2018 at 10:38:00AM -0600, Jason Gunthorpe wrote:
>> On Wed, Sep 05, 2018 at 08:10:25AM +0300, Leon Romanovsky wrote:
>> > > > -   int en_encap_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN);
>> > > > +   int en_encap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_ENCAP);
>> > > > +   int en_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_DECAP);
>> > >
>> > > Yuk, please don't use !!.
>> > >
>> > >   bool en_decap = flags & MLX5_FLOW_TABLE_TUNNEL_EN_DECAP;
>> >
>> > We need to provide en_encap and en_decap as an input to MLX5_SET(...)
>> > which is passed to FW as 0 or 1.
>> >
>> > Boolean type is declared in C as int and treated as zero for false
>> > and any other value for true,
>>
>> No, that isn't right, the kernel uses C99's _Bool intrinsic type, which
>> is guaranteed to only hold 0 or 1 by the compiler.
>>
>> See types.h:
>>
>> typedef _Bool   bool;
>
> Exciting, it took me a while to find C99 standard and relevant 6.3.1.2.
> Anyway, this patch didn't change previous functionality, which used "!!"
> convention.

so? if we didn't do things properly prior to the patch, why not fixing it along
with the patch? lets fix


Re: [PATCH rdma-next v1 00/15] Flow actions to mutate packets

2018-09-05 Thread Jason Gunthorpe
On Wed, Sep 05, 2018 at 08:14:58AM +0300, Leon Romanovsky wrote:

> > This looks OK to me, can you make the shared commit please?
> 
> Thanks, I pushed to mlx5-next
> 
> 50acec06f392 net/mlx5: Export packet reformat alloc/dealloc functions
> 31ca3648f01b net/mlx5: Pass a namespace for packet reformat ID allocation
> bea4e1f6c6c5 net/mlx5: Expose new packet reformat capabilities
> 60786f0987c0 {net, RDMA}/mlx5: Rename encap to reformat packet
> e0e7a3861b6c net/mlx5: Move header encap type to IFC header file
> 61444b458b01 net/mlx5: Break encap/decap into two separated flow table 
> creation flags
> c3c062f80665 net/mlx5: Add support for more namespaces when allocating modify 
> header
> 90c1d1b8da67 net/mlx5: Export modify header alloc/dealloc functions
> 8ce78257965e net/mlx5: Add proper NIC TX steering flow tables support
> 2226dcb424bf net/mlx5: Cleanup flow namespace getter switch logic

Okay, done thanks

Jason


Re: [PATCH net-next] net: sched: change tcf_del_walker() to use concurrent-safe delete

2018-09-05 Thread Cong Wang
On Wed, Sep 5, 2018 at 12:05 AM Vlad Buslov  wrote:
>
>
> On Tue 04 Sep 2018 at 22:41, Cong Wang  wrote:
> > On Mon, Sep 3, 2018 at 1:33 PM Vlad Buslov  wrote:
> >>
> >>
> >> On Mon 03 Sep 2018 at 18:50, Cong Wang  wrote:
> >> > On Mon, Sep 3, 2018 at 12:06 AM Vlad Buslov  wrote:
> >> >>
> >> >> Action API was changed to work with actions and action_idr in 
> >> >> concurrency
> >> >> safe manner, however tcf_del_walker() still uses actions without taking
> >> >> reference to them first and deletes them directly, disregarding possible
> >> >> concurrent delete.
> >> >>
> >> >> Change tcf_del_walker() to use tcf_idr_delete_index() that doesn't 
> >> >> require
> >> >> caller to hold reference to action and accepts action id as argument,
> >> >> instead of direct action pointer.
> >> >
> >> > Hmm, why doesn't tcf_del_walker() just take idrinfo->lock? At least
> >> > tcf_dump_walker() already does.
> >>
> >> Because tcf_del_walker() calls __tcf_idr_release(), which take
> >> idrinfo->lock itself (deadlock). It also calls sleeping functions like
> >
> > Deadlock can be easily resolved by moving the lock out.
> >
> >
> >> tcf_action_goto_chain_fini(), so just implementing function that
> >> releases action without taking idrinfo->lock is not enough.
> >
> > Sleeping can be resolved either by making it atomic or
> > deferring it to a work queue.
> >
> > None of your arguments here is a blocker to locking
> > idrinfo->lock. You really should focus on if it is really
> > necessary to lock idrinfo->lock in tcf_del_walker(), rather
> > than these details.
> >
> > For me, if you need idrinfo->lock for dump walker, you must
> > need it for delete walker too, because deletion is a writer
> > which should require stronger protection than the dumper,
> > which merely a reader.
>
> I don't get how it is necessary. Dump walker uses pointers to actions
> directly, and in order to be concurrency-safe it must either hold the

It uses the pointer in a read-only way, what you said doesn't change
the fact that it is a reader. And, like other readers, it may not need
to lock at all, which is a different topic.


> lock or obtain reference to action. Note that del walker doesn't use the
> action pointer, it only passed action id to tcf_idr_delete_index()
> function, which does all the necessary locking and can deal with
> potential concurrency issues (concurrent delete, etc.). This approach
> also benefits from code reuse from other code paths that delete actions,
> instead of implementing its own.

Look at the difference below.

With your change:

idr_for_each_entry_ul{
   spin_lock(>lock);
   idr_remove();
   spin_unlock(>lock);
}

With what I suggest:

spin_lock(>lock);
idr_for_each_entry_ul{
   idr_remove();
}
spin_unlock(>lock);

Isn't a concurrent tcf_idr_check_alloc() able to livelock here with
your change?

idr_for_each_entry_ul{
   spin_lock(>lock);
   idr_remove();
   spin_unlock(>lock);
  // tcf_idr_check_alloc() jumps in,
 // allocates next ID which can be found
  // by idr_get_next_ul()
} // the whole loop goes _literately_ infinite...


Also, idr_for_each_entry_ul() is supposed to be protected either
by RCU or idrinfo->lock, no? With your change or without any change,
it doesn't even have any lock after removing RTNL?


[PATCH iproute2] tc/mqprio: Print extra info on invalid args.

2018-09-05 Thread Caleb Raitto
From: Caleb Raitto 

Print the name of the argument that wasn't understood, and also print
the usage string.

Signed-off-by: Caleb Raitto 
---
 tc/q_mqprio.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tc/q_mqprio.c b/tc/q_mqprio.c
index 89b46002..cf2eceb4 100644
--- a/tc/q_mqprio.c
+++ b/tc/q_mqprio.c
@@ -167,7 +167,8 @@ static int mqprio_parse_opt(struct qdisc_util *qu, int argc,
explain();
return -1;
} else {
-   fprintf(stderr, "Unknown argument\n");
+   fprintf(stderr, "Unknown argument: %s\n", *argv);
+   explain();
return -1;
}
argc--; argv++;
-- 
2.19.0.rc1.350.ge57e33dbd1-goog



[PATCH iproute2] man: Change numtc to num_tc

2018-09-05 Thread Caleb Raitto
From: Caleb Raitto 

The argument parser only accepts num_tc:

https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/tc/q_mqprio.c#n55

Signed-off-by: Caleb Raitto 
---
 man/man8/tc-mqprio.8 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/tc-mqprio.8 b/man/man8/tc-mqprio.8
index a1bedd35..4b9e942e 100644
--- a/man/man8/tc-mqprio.8
+++ b/man/man8/tc-mqprio.8
@@ -8,7 +8,7 @@ dev
 classid
 .B | root) [ handle
 major:
-.B ] mqprio [ numtc
+.B ] mqprio [ num_tc
 tcs
 .B ] [ map
 P0 P1 P2...
-- 
2.19.0.rc1.350.ge57e33dbd1-goog



[PATCH net-next v3 3/5] net: dsa: b53: Add helper to set link parameters

2018-09-05 Thread Florian Fainelli
Extract the logic from b53_adjust_link() responsible for overriding a
given port's link, speed, duplex and pause settings and make two helper
functions to set the port's configuration and the port's link settings.
We will make use of both, as separate functions while adding PHYLINK
support next.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 89 +---
 1 file changed, 60 insertions(+), 29 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 85ed264bc163..78aeaccf19a1 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -947,33 +948,50 @@ static int b53_setup(struct dsa_switch *ds)
return ret;
 }
 
-static void b53_adjust_link(struct dsa_switch *ds, int port,
-   struct phy_device *phydev)
+static void b53_force_link(struct b53_device *dev, int port, int link)
 {
-   struct b53_device *dev = ds->priv;
-   struct ethtool_eee *p = >ports[port].eee;
-   u8 rgmii_ctrl = 0, reg = 0, off;
-
-   if (!phy_is_pseudo_fixed_link(phydev))
-   return;
+   u8 reg, val, off;
 
/* Override the port settings */
if (port == dev->cpu_port) {
off = B53_PORT_OVERRIDE_CTRL;
-   reg = PORT_OVERRIDE_EN;
+   val = PORT_OVERRIDE_EN;
} else {
off = B53_GMII_PORT_OVERRIDE_CTRL(port);
-   reg = GMII_PO_EN;
+   val = GMII_PO_EN;
}
 
-   /* Set the link UP */
-   if (phydev->link)
+   b53_read8(dev, B53_CTRL_PAGE, off, );
+   reg |= val;
+   if (link)
reg |= PORT_OVERRIDE_LINK;
+   else
+   reg &= ~PORT_OVERRIDE_LINK;
+   b53_write8(dev, B53_CTRL_PAGE, off, reg);
+}
+
+static void b53_force_port_config(struct b53_device *dev, int port,
+ int speed, int duplex, int pause)
+{
+   u8 reg, val, off;
+
+   /* Override the port settings */
+   if (port == dev->cpu_port) {
+   off = B53_PORT_OVERRIDE_CTRL;
+   val = PORT_OVERRIDE_EN;
+   } else {
+   off = B53_GMII_PORT_OVERRIDE_CTRL(port);
+   val = GMII_PO_EN;
+   }
 
-   if (phydev->duplex == DUPLEX_FULL)
+   b53_read8(dev, B53_CTRL_PAGE, off, );
+   reg |= val;
+   if (duplex == DUPLEX_FULL)
reg |= PORT_OVERRIDE_FULL_DUPLEX;
+   else
+   reg &= ~PORT_OVERRIDE_FULL_DUPLEX;
 
-   switch (phydev->speed) {
+   switch (speed) {
case 2000:
reg |= PORT_OVERRIDE_SPEED_2000M;
/* fallthrough */
@@ -987,21 +1005,41 @@ static void b53_adjust_link(struct dsa_switch *ds, int 
port,
reg |= PORT_OVERRIDE_SPEED_10M;
break;
default:
-   dev_err(ds->dev, "unknown speed: %d\n", phydev->speed);
+   dev_err(dev->dev, "unknown speed: %d\n", speed);
return;
}
 
+   if (pause & MLO_PAUSE_RX)
+   reg |= PORT_OVERRIDE_RX_FLOW;
+   if (pause & MLO_PAUSE_TX)
+   reg |= PORT_OVERRIDE_TX_FLOW;
+
+   b53_write8(dev, B53_CTRL_PAGE, off, reg);
+}
+
+static void b53_adjust_link(struct dsa_switch *ds, int port,
+   struct phy_device *phydev)
+{
+   struct b53_device *dev = ds->priv;
+   struct ethtool_eee *p = >ports[port].eee;
+   u8 rgmii_ctrl = 0, reg = 0, off;
+   int pause;
+
+   if (!phy_is_pseudo_fixed_link(phydev))
+   return;
+
/* Enable flow control on BCM5301x's CPU port */
if (is5301x(dev) && port == dev->cpu_port)
-   reg |= PORT_OVERRIDE_RX_FLOW | PORT_OVERRIDE_TX_FLOW;
+   pause = MLO_PAUSE_TXRX_MASK;
 
if (phydev->pause) {
if (phydev->asym_pause)
-   reg |= PORT_OVERRIDE_TX_FLOW;
-   reg |= PORT_OVERRIDE_RX_FLOW;
+   pause |= MLO_PAUSE_TX;
+   pause |= MLO_PAUSE_RX;
}
 
-   b53_write8(dev, B53_CTRL_PAGE, off, reg);
+   b53_force_port_config(dev, port, phydev->speed, phydev->duplex, pause);
+   b53_force_link(dev, port, phydev->link);
 
if (is531x5(dev) && phy_interface_is_rgmii(phydev)) {
if (port == 8)
@@ -1061,16 +1099,9 @@ static void b53_adjust_link(struct dsa_switch *ds, int 
port,
}
} else if (is5301x(dev)) {
if (port != dev->cpu_port) {
-   u8 po_reg = B53_GMII_PORT_OVERRIDE_CTRL(dev->cpu_port);
-   u8 gmii_po;
-
-   b53_read8(dev, B53_CTRL_PAGE, po_reg, _po);
-   gmii_po |= GMII_PO_LINK |
-  GMII_PO_RX_FLOW |
-  GMII_PO_TX_FLOW |

[PATCH net-next v3 4/5] net: dsa: b53: Add PHYLINK support

2018-09-05 Thread Florian Fainelli
Add support for PHYLINK, things are reasonably straight forward since we
do not yet support SerDes interfaces, that leaves us with just
MLO_AN_PHY and MLO_AN_FIXED to deal with.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 122 +++
 drivers/net/dsa/b53/b53_priv.h   |  17 +
 2 files changed, 139 insertions(+)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 78aeaccf19a1..3d5e822bb17c 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1109,6 +1109,122 @@ static void b53_adjust_link(struct dsa_switch *ds, int 
port,
p->eee_enabled = b53_eee_init(ds, port, phydev);
 }
 
+void b53_port_event(struct dsa_switch *ds, int port)
+{
+   struct b53_device *dev = ds->priv;
+   bool link;
+   u16 sts;
+
+   b53_read16(dev, B53_STAT_PAGE, B53_LINK_STAT, );
+   link = !!(sts & BIT(port));
+   dsa_port_phylink_mac_change(ds, port, link);
+}
+EXPORT_SYMBOL(b53_port_event);
+
+void b53_phylink_validate(struct dsa_switch *ds, int port,
+ unsigned long *supported,
+ struct phylink_link_state *state)
+{
+   struct b53_device *dev = ds->priv;
+   __ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
+
+   /* Allow all the expected bits */
+   phylink_set(mask, Autoneg);
+   phylink_set_port_modes(mask);
+   phylink_set(mask, Pause);
+   phylink_set(mask, Asym_Pause);
+
+   /* With the exclusion of 5325/5365, MII, Reverse MII and 802.3z, we
+* support Gigabit, including Half duplex.
+*/
+   if (state->interface != PHY_INTERFACE_MODE_MII &&
+   state->interface != PHY_INTERFACE_MODE_REVMII &&
+   !phy_interface_mode_is_8023z(state->interface) &&
+   !(is5325(dev) || is5365(dev))) {
+   phylink_set(mask, 1000baseT_Full);
+   phylink_set(mask, 1000baseT_Half);
+   }
+
+   if (!phy_interface_mode_is_8023z(state->interface)) {
+   phylink_set(mask, 10baseT_Half);
+   phylink_set(mask, 10baseT_Full);
+   phylink_set(mask, 100baseT_Half);
+   phylink_set(mask, 100baseT_Full);
+   }
+
+   bitmap_and(supported, supported, mask,
+  __ETHTOOL_LINK_MODE_MASK_NBITS);
+   bitmap_and(state->advertising, state->advertising, mask,
+  __ETHTOOL_LINK_MODE_MASK_NBITS);
+
+   phylink_helper_basex_speed(state);
+}
+EXPORT_SYMBOL(b53_phylink_validate);
+
+int b53_phylink_mac_link_state(struct dsa_switch *ds, int port,
+  struct phylink_link_state *state)
+{
+   int ret = -EOPNOTSUPP;
+
+   return ret;
+}
+EXPORT_SYMBOL(b53_phylink_mac_link_state);
+
+void b53_phylink_mac_config(struct dsa_switch *ds, int port,
+   unsigned int mode,
+   const struct phylink_link_state *state)
+{
+   struct b53_device *dev = ds->priv;
+
+   if (mode == MLO_AN_PHY)
+   return;
+
+   if (mode == MLO_AN_FIXED) {
+   b53_force_port_config(dev, port, state->speed,
+ state->duplex, state->pause);
+   return;
+   }
+}
+EXPORT_SYMBOL(b53_phylink_mac_config);
+
+void b53_phylink_mac_an_restart(struct dsa_switch *ds, int port)
+{
+}
+EXPORT_SYMBOL(b53_phylink_mac_an_restart);
+
+void b53_phylink_mac_link_down(struct dsa_switch *ds, int port,
+  unsigned int mode,
+  phy_interface_t interface)
+{
+   struct b53_device *dev = ds->priv;
+
+   if (mode == MLO_AN_PHY)
+   return;
+
+   if (mode == MLO_AN_FIXED) {
+   b53_force_link(dev, port, false);
+   return;
+   }
+}
+EXPORT_SYMBOL(b53_phylink_mac_link_down);
+
+void b53_phylink_mac_link_up(struct dsa_switch *ds, int port,
+unsigned int mode,
+phy_interface_t interface,
+struct phy_device *phydev)
+{
+   struct b53_device *dev = ds->priv;
+
+   if (mode == MLO_AN_PHY)
+   return;
+
+   if (mode == MLO_AN_FIXED) {
+   b53_force_link(dev, port, true);
+   return;
+   }
+}
+EXPORT_SYMBOL(b53_phylink_mac_link_up);
+
 int b53_vlan_filtering(struct dsa_switch *ds, int port, bool vlan_filtering)
 {
return 0;
@@ -1750,6 +1866,12 @@ static const struct dsa_switch_ops b53_switch_ops = {
.phy_read   = b53_phy_read16,
.phy_write  = b53_phy_write16,
.adjust_link= b53_adjust_link,
+   .phylink_validate   = b53_phylink_validate,
+   .phylink_mac_link_state = b53_phylink_mac_link_state,
+   .phylink_mac_config = b53_phylink_mac_config,
+   .phylink_mac_an_restart = b53_phylink_mac_an_restart,
+   .phylink_mac_link_down  = 

[PATCH net-next v3 5/5] net: dsa: b53: Add SerDes support

2018-09-05 Thread Florian Fainelli
Add support for the Northstar Plus SerDes which is accessed through a
special page of the switch. Since this is something that most people
probably will not want to use, make it a configurable option with a
default on ARCH_BCM_NSP where it is the most useful currently.

The SerDes supports both SGMII and 1000baseX modes for both lanes, and
2500baseX for one of the lanes, and is internally looking like a
seemingly standard MII PHY, except for the few bits that got repurposed.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/Kconfig  |   7 +
 drivers/net/dsa/b53/Makefile |   1 +
 drivers/net/dsa/b53/b53_common.c |  26 
 drivers/net/dsa/b53/b53_priv.h   |  17 +++
 drivers/net/dsa/b53/b53_serdes.c | 217 +++
 drivers/net/dsa/b53/b53_serdes.h | 121 +
 drivers/net/dsa/b53/b53_srab.c   | 101 ++
 7 files changed, 490 insertions(+)
 create mode 100644 drivers/net/dsa/b53/b53_serdes.c
 create mode 100644 drivers/net/dsa/b53/b53_serdes.h

diff --git a/drivers/net/dsa/b53/Kconfig b/drivers/net/dsa/b53/Kconfig
index 37745f4bf4f6..e83ebfafd881 100644
--- a/drivers/net/dsa/b53/Kconfig
+++ b/drivers/net/dsa/b53/Kconfig
@@ -35,3 +35,10 @@ config B53_SRAB_DRIVER
help
  Select to enable support for memory-mapped Switch Register Access
  Bridge Registers (SRAB) like it is found on the BCM53010
+
+config B53_SERDES
+   tristate "B53 SerDes support"
+   depends on B53
+   default ARCH_BCM_NSP
+   help
+ Select to enable support for SerDes on e.g: Northstar Plus SoCs.
diff --git a/drivers/net/dsa/b53/Makefile b/drivers/net/dsa/b53/Makefile
index 4256fb42a4dd..b1be13023ae4 100644
--- a/drivers/net/dsa/b53/Makefile
+++ b/drivers/net/dsa/b53/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_B53_SPI_DRIVER)+= b53_spi.o
 obj-$(CONFIG_B53_MDIO_DRIVER)  += b53_mdio.o
 obj-$(CONFIG_B53_MMAP_DRIVER)  += b53_mmap.o
 obj-$(CONFIG_B53_SRAB_DRIVER)  += b53_srab.o
+obj-$(CONFIG_B53_SERDES)   += b53_serdes.o
diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 3d5e822bb17c..ea4256cd628b 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -765,6 +765,8 @@ static int b53_reset_switch(struct b53_device *priv)
memset(priv->vlans, 0, sizeof(*priv->vlans) * priv->num_vlans);
memset(priv->ports, 0, sizeof(*priv->ports) * priv->num_ports);
 
+   priv->serdes_lane = B53_INVALID_LANE;
+
return b53_switch_reset(priv);
 }
 
@@ -1128,6 +1130,9 @@ void b53_phylink_validate(struct dsa_switch *ds, int port,
struct b53_device *dev = ds->priv;
__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
 
+   if (dev->ops->serdes_phylink_validate)
+   dev->ops->serdes_phylink_validate(dev, port, mask, state);
+
/* Allow all the expected bits */
phylink_set(mask, Autoneg);
phylink_set_port_modes(mask);
@@ -1164,8 +1169,13 @@ EXPORT_SYMBOL(b53_phylink_validate);
 int b53_phylink_mac_link_state(struct dsa_switch *ds, int port,
   struct phylink_link_state *state)
 {
+   struct b53_device *dev = ds->priv;
int ret = -EOPNOTSUPP;
 
+   if (phy_interface_mode_is_8023z(state->interface) &&
+   dev->ops->serdes_link_state)
+   ret = dev->ops->serdes_link_state(dev, port, state);
+
return ret;
 }
 EXPORT_SYMBOL(b53_phylink_mac_link_state);
@@ -1184,11 +1194,19 @@ void b53_phylink_mac_config(struct dsa_switch *ds, int 
port,
  state->duplex, state->pause);
return;
}
+
+   if (phy_interface_mode_is_8023z(state->interface) &&
+   dev->ops->serdes_config)
+   dev->ops->serdes_config(dev, port, mode, state);
 }
 EXPORT_SYMBOL(b53_phylink_mac_config);
 
 void b53_phylink_mac_an_restart(struct dsa_switch *ds, int port)
 {
+   struct b53_device *dev = ds->priv;
+
+   if (dev->ops->serdes_an_restart)
+   dev->ops->serdes_an_restart(dev, port);
 }
 EXPORT_SYMBOL(b53_phylink_mac_an_restart);
 
@@ -1205,6 +1223,10 @@ void b53_phylink_mac_link_down(struct dsa_switch *ds, 
int port,
b53_force_link(dev, port, false);
return;
}
+
+   if (phy_interface_mode_is_8023z(interface) &&
+   dev->ops->serdes_link_set)
+   dev->ops->serdes_link_set(dev, port, mode, interface, false);
 }
 EXPORT_SYMBOL(b53_phylink_mac_link_down);
 
@@ -1222,6 +1244,10 @@ void b53_phylink_mac_link_up(struct dsa_switch *ds, int 
port,
b53_force_link(dev, port, true);
return;
}
+
+   if (phy_interface_mode_is_8023z(interface) &&
+   dev->ops->serdes_link_set)
+   dev->ops->serdes_link_set(dev, port, mode, interface, true);
 }
 EXPORT_SYMBOL(b53_phylink_mac_link_up);
 
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index 

[PATCH net-next v3 2/5] net: dsa: b53: Make SRAB driver manage port interrupts

2018-09-05 Thread Florian Fainelli
Update the SRAB driver to manage per-port interrupts. Since we cannot
sleep during b53_io_ops, schedule a workqueue whenever we get a port
specific interrupt. We will later make use of this to call back into
PHYLINK when there is e.g: a link state change.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_srab.c | 105 +
 1 file changed, 105 insertions(+)

diff --git a/drivers/net/dsa/b53/b53_srab.c b/drivers/net/dsa/b53/b53_srab.c
index 91de2ba99ad1..645dde0d317d 100644
--- a/drivers/net/dsa/b53/b53_srab.c
+++ b/drivers/net/dsa/b53/b53_srab.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -47,6 +48,7 @@
 
 /* command and status register of the SRAB */
 #define B53_SRAB_CTRLS 0x40
+#define  B53_SRAB_CTRLS_HOST_INTR  BIT(1)
 #define  B53_SRAB_CTRLS_RCAREQ BIT(3)
 #define  B53_SRAB_CTRLS_RCAGNT BIT(4)
 #define  B53_SRAB_CTRLS_SW_INIT_DONE   BIT(6)
@@ -60,8 +62,16 @@
 #define  B53_SRAB_P7_SLEEP_TIMER   BIT(11)
 #define  B53_SRAB_IMP0_SLEEP_TIMER BIT(12)
 
+struct b53_srab_port_priv {
+   int irq;
+   bool irq_enabled;
+   struct b53_device *dev;
+   unsigned int num;
+};
+
 struct b53_srab_priv {
void __iomem *regs;
+   struct b53_srab_port_priv port_intrs[B53_N_PORTS];
 };
 
 static int b53_srab_request_grant(struct b53_device *dev)
@@ -344,6 +354,49 @@ static int b53_srab_write64(struct b53_device *dev, u8 
page, u8 reg,
return ret;
 }
 
+static irqreturn_t b53_srab_port_thread(int irq, void *dev_id)
+{
+   return IRQ_HANDLED;
+}
+
+static irqreturn_t b53_srab_port_isr(int irq, void *dev_id)
+{
+   struct b53_srab_port_priv *port = dev_id;
+   struct b53_device *dev = port->dev;
+   struct b53_srab_priv *priv = dev->priv;
+
+   /* Acknowledge the interrupt */
+   writel(BIT(port->num), priv->regs + B53_SRAB_INTR);
+
+   return IRQ_WAKE_THREAD;
+}
+
+static int b53_srab_irq_enable(struct b53_device *dev, int port)
+{
+   struct b53_srab_priv *priv = dev->priv;
+   struct b53_srab_port_priv *p = >port_intrs[port];
+   int ret;
+
+   ret = request_threaded_irq(p->irq, b53_srab_port_isr,
+  b53_srab_port_thread, 0,
+  dev_name(dev->dev), p);
+   if (!ret)
+   p->irq_enabled = true;
+
+   return ret;
+}
+
+static void b53_srab_irq_disable(struct b53_device *dev, int port)
+{
+   struct b53_srab_priv *priv = dev->priv;
+   struct b53_srab_port_priv *p = >port_intrs[port];
+
+   if (p->irq_enabled) {
+   free_irq(p->irq, p);
+   p->irq_enabled = false;
+   }
+}
+
 static const struct b53_io_ops b53_srab_ops = {
.read8 = b53_srab_read8,
.read16 = b53_srab_read16,
@@ -355,6 +408,8 @@ static const struct b53_io_ops b53_srab_ops = {
.write32 = b53_srab_write32,
.write48 = b53_srab_write48,
.write64 = b53_srab_write64,
+   .irq_enable = b53_srab_irq_enable,
+   .irq_disable = b53_srab_irq_disable,
 };
 
 static const struct of_device_id b53_srab_of_match[] = {
@@ -379,6 +434,52 @@ static const struct of_device_id b53_srab_of_match[] = {
 };
 MODULE_DEVICE_TABLE(of, b53_srab_of_match);
 
+static void b53_srab_intr_set(struct b53_srab_priv *priv, bool set)
+{
+   u32 reg;
+
+   reg = readl(priv->regs + B53_SRAB_CTRLS);
+   if (set)
+   reg |= B53_SRAB_CTRLS_HOST_INTR;
+   else
+   reg &= ~B53_SRAB_CTRLS_HOST_INTR;
+   writel(reg, priv->regs + B53_SRAB_CTRLS);
+}
+
+static void b53_srab_prepare_irq(struct platform_device *pdev)
+{
+   struct b53_device *dev = platform_get_drvdata(pdev);
+   struct b53_srab_priv *priv = dev->priv;
+   struct b53_srab_port_priv *port;
+   unsigned int i;
+   char *name;
+
+   /* Clear all pending interrupts */
+   writel(0x, priv->regs + B53_SRAB_INTR);
+
+   if (dev->pdata && dev->pdata->chip_id != BCM58XX_DEVICE_ID)
+   return;
+
+   for (i = 0; i < B53_N_PORTS; i++) {
+   port = >port_intrs[i];
+
+   /* There is no port 6 */
+   if (i == 6)
+   continue;
+
+   name = kasprintf(GFP_KERNEL, "link_state_p%d", i);
+   if (!name)
+   return;
+
+   port->num = i;
+   port->dev = dev;
+   port->irq = platform_get_irq_byname(pdev, name);
+   kfree(name);
+   }
+
+   b53_srab_intr_set(priv, true);
+}
+
 static int b53_srab_probe(struct platform_device *pdev)
 {
struct b53_platform_data *pdata = pdev->dev.platform_data;
@@ -417,13 +518,17 @@ static int b53_srab_probe(struct platform_device *pdev)
 
platform_set_drvdata(pdev, dev);
 
+   b53_srab_prepare_irq(pdev);
+
return b53_switch_register(dev);
 }
 
 static int 

[PATCH net-next v3 1/5] net: dsa: b53: Add ability to enable/disable port interrupts

2018-09-05 Thread Florian Fainelli
Some switches expose individual interrupt line(s) for port specific
event(s), allow configuring these interrupts at an appropriate time
during port_enable/disable callbacks where all port specific resources
are known to be set-up and ready for use.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 9 +
 drivers/net/dsa/b53/b53_priv.h   | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index d93c790bfbe8..85ed264bc163 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -502,8 +502,14 @@ int b53_enable_port(struct dsa_switch *ds, int port, 
struct phy_device *phy)
 {
struct b53_device *dev = ds->priv;
unsigned int cpu_port = ds->ports[port].cpu_dp->index;
+   int ret = 0;
u16 pvlan;
 
+   if (dev->ops->irq_enable)
+   ret = dev->ops->irq_enable(dev, port);
+   if (ret)
+   return ret;
+
/* Clear the Rx and Tx disable bits and set to no spanning tree */
b53_write8(dev, B53_CTRL_PAGE, B53_PORT_CTRL(port), 0);
 
@@ -536,6 +542,9 @@ void b53_disable_port(struct dsa_switch *ds, int port, 
struct phy_device *phy)
b53_read8(dev, B53_CTRL_PAGE, B53_PORT_CTRL(port), );
reg |= PORT_CTRL_RX_DISABLE | PORT_CTRL_TX_DISABLE;
b53_write8(dev, B53_CTRL_PAGE, B53_PORT_CTRL(port), reg);
+
+   if (dev->ops->irq_disable)
+   dev->ops->irq_disable(dev, port);
 }
 EXPORT_SYMBOL(b53_disable_port);
 
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index df149756c282..2980a5838f58 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -43,6 +43,8 @@ struct b53_io_ops {
int (*write64)(struct b53_device *dev, u8 page, u8 reg, u64 value);
int (*phy_read16)(struct b53_device *dev, int addr, int reg, u16 
*value);
int (*phy_write16)(struct b53_device *dev, int addr, int reg, u16 
value);
+   int (*irq_enable)(struct b53_device *dev, int port);
+   void (*irq_disable)(struct b53_device *dev, int port);
 };
 
 enum {
-- 
2.17.1



[PATCH net-next v3 0/5] net: dsa: b53: SerDes support

2018-09-05 Thread Florian Fainelli
Hi all,

This patch series adds support for the SerDes found on NorthStar Plus
(NSP) which allows us to use the SFP port on the BCM958625HR board (and
other similar designs).

Changes in v3:

- properly hunk the request_threaded_irq() bits into patch #2

Changes in v2:

- migrate to threaded interrupt (Andrew)
- fixed a case where MLO_AN_FIXED's mac_config would still call into
  the serdes_config callback
- added an additional check on the phylink interface in mac_config
- default to ARCH_BCM_NSP instead of ARCH_BCM_IPROC which is really
  the NSP Kconfig bit we want


Florian Fainelli (5):
  net: dsa: b53: Add ability to enable/disable port interrupts
  net: dsa: b53: Make SRAB driver manage port interrupts
  net: dsa: b53: Add helper to set link parameters
  net: dsa: b53: Add PHYLINK support
  net: dsa: b53: Add SerDes support

 drivers/net/dsa/b53/Kconfig  |   7 +
 drivers/net/dsa/b53/Makefile |   1 +
 drivers/net/dsa/b53/b53_common.c | 246 +++
 drivers/net/dsa/b53/b53_priv.h   |  36 +
 drivers/net/dsa/b53/b53_serdes.c | 217 +++
 drivers/net/dsa/b53/b53_serdes.h | 121 +++
 drivers/net/dsa/b53/b53_srab.c   | 206 ++
 7 files changed, 805 insertions(+), 29 deletions(-)
 create mode 100644 drivers/net/dsa/b53/b53_serdes.c
 create mode 100644 drivers/net/dsa/b53/b53_serdes.h

-- 
2.17.1



Re: [PATCH net-next v2 0/5] net: dsa: b53: SerDes support

2018-09-05 Thread Florian Fainelli
On 09/05/2018 12:23 PM, Florian Fainelli wrote:
> Hi all,
> 
> This patch series adds support for the SerDes found on NorthStar Plus
> (NSP) which allows us to use the SFP port on the BCM958625HR board (and
> other similar designs).

David, please disregard this version, for some reason I managed to send
the incorrect one that does not have the threaded interrupt support.

> 
> Changes in v2:
> 
> - migrate to threaded interrupt (Andrew)
> - fixed a case where MLO_AN_FIXED's mac_config would still call into
>   the serdes_config callback
> - added an additional check on the phylink interface in mac_config
> - default to ARCH_BCM_NSP instead of ARCH_BCM_IPROC which is really
>   the NSP Kconfig bit we want
> 
> Florian Fainelli (5):
>   net: dsa: b53: Add ability to enable/disable port interrupts
>   net: dsa: b53: Make SRAB driver manage port interrupts
>   net: dsa: b53: Add helper to set link parameters
>   net: dsa: b53: Add PHYLINK support
>   net: dsa: b53: Add SerDes support
> 
>  drivers/net/dsa/b53/Kconfig  |   7 +
>  drivers/net/dsa/b53/Makefile |   1 +
>  drivers/net/dsa/b53/b53_common.c | 246 +++
>  drivers/net/dsa/b53/b53_priv.h   |  36 +
>  drivers/net/dsa/b53/b53_serdes.c | 217 +++
>  drivers/net/dsa/b53/b53_serdes.h | 121 +++
>  drivers/net/dsa/b53/b53_srab.c   | 210 ++
>  7 files changed, 809 insertions(+), 29 deletions(-)
>  create mode 100644 drivers/net/dsa/b53/b53_serdes.c
>  create mode 100644 drivers/net/dsa/b53/b53_serdes.h
> 


-- 
Florian


[PATCH net-next v2 0/5] net: dsa: b53: SerDes support

2018-09-05 Thread Florian Fainelli
Hi all,

This patch series adds support for the SerDes found on NorthStar Plus
(NSP) which allows us to use the SFP port on the BCM958625HR board (and
other similar designs).

Changes in v2:

- migrate to threaded interrupt (Andrew)
- fixed a case where MLO_AN_FIXED's mac_config would still call into
  the serdes_config callback
- added an additional check on the phylink interface in mac_config
- default to ARCH_BCM_NSP instead of ARCH_BCM_IPROC which is really
  the NSP Kconfig bit we want

Florian Fainelli (5):
  net: dsa: b53: Add ability to enable/disable port interrupts
  net: dsa: b53: Make SRAB driver manage port interrupts
  net: dsa: b53: Add helper to set link parameters
  net: dsa: b53: Add PHYLINK support
  net: dsa: b53: Add SerDes support

 drivers/net/dsa/b53/Kconfig  |   7 +
 drivers/net/dsa/b53/Makefile |   1 +
 drivers/net/dsa/b53/b53_common.c | 246 +++
 drivers/net/dsa/b53/b53_priv.h   |  36 +
 drivers/net/dsa/b53/b53_serdes.c | 217 +++
 drivers/net/dsa/b53/b53_serdes.h | 121 +++
 drivers/net/dsa/b53/b53_srab.c   | 210 ++
 7 files changed, 809 insertions(+), 29 deletions(-)
 create mode 100644 drivers/net/dsa/b53/b53_serdes.c
 create mode 100644 drivers/net/dsa/b53/b53_serdes.h

-- 
2.17.1



[PATCH net-next v2 2/5] net: dsa: b53: Make SRAB driver manage port interrupts

2018-09-05 Thread Florian Fainelli
Update the SRAB driver to manage per-port interrupts. Since we cannot
sleep during b53_io_ops, schedule a workqueue whenever we get a port
specific interrupt. We will later make use of this to call back into
PHYLINK when there is e.g: a link state change.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_srab.c | 108 +
 1 file changed, 108 insertions(+)

diff --git a/drivers/net/dsa/b53/b53_srab.c b/drivers/net/dsa/b53/b53_srab.c
index 91de2ba99ad1..411b84f61903 100644
--- a/drivers/net/dsa/b53/b53_srab.c
+++ b/drivers/net/dsa/b53/b53_srab.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "b53_priv.h"
 
@@ -47,6 +48,7 @@
 
 /* command and status register of the SRAB */
 #define B53_SRAB_CTRLS 0x40
+#define  B53_SRAB_CTRLS_HOST_INTR  BIT(1)
 #define  B53_SRAB_CTRLS_RCAREQ BIT(3)
 #define  B53_SRAB_CTRLS_RCAGNT BIT(4)
 #define  B53_SRAB_CTRLS_SW_INIT_DONE   BIT(6)
@@ -60,8 +62,17 @@
 #define  B53_SRAB_P7_SLEEP_TIMER   BIT(11)
 #define  B53_SRAB_IMP0_SLEEP_TIMER BIT(12)
 
+struct b53_srab_port_priv {
+   struct work_struct irq_work;
+   int irq;
+   bool irq_enabled;
+   struct b53_device *dev;
+   unsigned int num;
+};
+
 struct b53_srab_priv {
void __iomem *regs;
+   struct b53_srab_port_priv port_intrs[B53_N_PORTS];
 };
 
 static int b53_srab_request_grant(struct b53_device *dev)
@@ -344,6 +355,50 @@ static int b53_srab_write64(struct b53_device *dev, u8 
page, u8 reg,
return ret;
 }
 
+static void b53_srab_port_defer(struct work_struct *work)
+{
+}
+
+static irqreturn_t b53_srab_port_isr(int irq, void *dev_id)
+{
+   struct b53_srab_port_priv *port = dev_id;
+   struct b53_device *dev = port->dev;
+   struct b53_srab_priv *priv = dev->priv;
+
+   /* Acknowledge the interrupt */
+   writel(BIT(port->num), priv->regs + B53_SRAB_INTR);
+
+   schedule_work(>irq_work);
+
+   return IRQ_HANDLED;
+}
+
+static int b53_srab_irq_enable(struct b53_device *dev, int port)
+{
+   struct b53_srab_priv *priv = dev->priv;
+   struct b53_srab_port_priv *p = >port_intrs[port];
+   int ret;
+
+   ret = request_irq(p->irq, b53_srab_port_isr, 0,
+ dev_name(dev->dev), p);
+   if (!ret)
+   p->irq_enabled = true;
+
+   return ret;
+}
+
+static void b53_srab_irq_disable(struct b53_device *dev, int port)
+{
+   struct b53_srab_priv *priv = dev->priv;
+   struct b53_srab_port_priv *p = >port_intrs[port];
+
+   if (p->irq_enabled) {
+   free_irq(p->irq, p);
+   cancel_work_sync(>irq_work);
+   p->irq_enabled = false;
+   }
+}
+
 static const struct b53_io_ops b53_srab_ops = {
.read8 = b53_srab_read8,
.read16 = b53_srab_read16,
@@ -355,6 +410,8 @@ static const struct b53_io_ops b53_srab_ops = {
.write32 = b53_srab_write32,
.write48 = b53_srab_write48,
.write64 = b53_srab_write64,
+   .irq_enable = b53_srab_irq_enable,
+   .irq_disable = b53_srab_irq_disable,
 };
 
 static const struct of_device_id b53_srab_of_match[] = {
@@ -379,6 +436,53 @@ static const struct of_device_id b53_srab_of_match[] = {
 };
 MODULE_DEVICE_TABLE(of, b53_srab_of_match);
 
+static void b53_srab_intr_set(struct b53_srab_priv *priv, bool set)
+{
+   u32 reg;
+
+   reg = readl(priv->regs + B53_SRAB_CTRLS);
+   if (set)
+   reg |= B53_SRAB_CTRLS_HOST_INTR;
+   else
+   reg &= ~B53_SRAB_CTRLS_HOST_INTR;
+   writel(reg, priv->regs + B53_SRAB_CTRLS);
+}
+
+static void b53_srab_prepare_irq(struct platform_device *pdev)
+{
+   struct b53_device *dev = platform_get_drvdata(pdev);
+   struct b53_srab_priv *priv = dev->priv;
+   struct b53_srab_port_priv *port;
+   unsigned int i;
+   char *name;
+
+   /* Clear all pending interrupts */
+   writel(0x, priv->regs + B53_SRAB_INTR);
+
+   if (dev->pdata && dev->pdata->chip_id != BCM58XX_DEVICE_ID)
+   return;
+
+   for (i = 0; i < B53_N_PORTS; i++) {
+   port = >port_intrs[i];
+
+   /* There is no port 6 */
+   if (i == 6)
+   continue;
+
+   name = kasprintf(GFP_KERNEL, "link_state_p%d", i);
+   if (!name)
+   return;
+
+   port->num = i;
+   port->dev = dev;
+   INIT_WORK(>irq_work, b53_srab_port_defer);
+   port->irq = platform_get_irq_byname(pdev, name);
+   kfree(name);
+   }
+
+   b53_srab_intr_set(priv, true);
+}
+
 static int b53_srab_probe(struct platform_device *pdev)
 {
struct b53_platform_data *pdata = pdev->dev.platform_data;
@@ -417,13 +521,17 @@ static int b53_srab_probe(struct platform_device *pdev)
 
platform_set_drvdata(pdev, dev);
 
+   b53_srab_prepare_irq(pdev);
+
return 

[PATCH net-next v2 4/5] net: dsa: b53: Add PHYLINK support

2018-09-05 Thread Florian Fainelli
Add support for PHYLINK, things are reasonably straight forward since we
do not yet support SerDes interfaces, that leaves us with just
MLO_AN_PHY and MLO_AN_FIXED to deal with.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 122 +++
 drivers/net/dsa/b53/b53_priv.h   |  17 +
 2 files changed, 139 insertions(+)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 78aeaccf19a1..3d5e822bb17c 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1109,6 +1109,122 @@ static void b53_adjust_link(struct dsa_switch *ds, int 
port,
p->eee_enabled = b53_eee_init(ds, port, phydev);
 }
 
+void b53_port_event(struct dsa_switch *ds, int port)
+{
+   struct b53_device *dev = ds->priv;
+   bool link;
+   u16 sts;
+
+   b53_read16(dev, B53_STAT_PAGE, B53_LINK_STAT, );
+   link = !!(sts & BIT(port));
+   dsa_port_phylink_mac_change(ds, port, link);
+}
+EXPORT_SYMBOL(b53_port_event);
+
+void b53_phylink_validate(struct dsa_switch *ds, int port,
+ unsigned long *supported,
+ struct phylink_link_state *state)
+{
+   struct b53_device *dev = ds->priv;
+   __ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
+
+   /* Allow all the expected bits */
+   phylink_set(mask, Autoneg);
+   phylink_set_port_modes(mask);
+   phylink_set(mask, Pause);
+   phylink_set(mask, Asym_Pause);
+
+   /* With the exclusion of 5325/5365, MII, Reverse MII and 802.3z, we
+* support Gigabit, including Half duplex.
+*/
+   if (state->interface != PHY_INTERFACE_MODE_MII &&
+   state->interface != PHY_INTERFACE_MODE_REVMII &&
+   !phy_interface_mode_is_8023z(state->interface) &&
+   !(is5325(dev) || is5365(dev))) {
+   phylink_set(mask, 1000baseT_Full);
+   phylink_set(mask, 1000baseT_Half);
+   }
+
+   if (!phy_interface_mode_is_8023z(state->interface)) {
+   phylink_set(mask, 10baseT_Half);
+   phylink_set(mask, 10baseT_Full);
+   phylink_set(mask, 100baseT_Half);
+   phylink_set(mask, 100baseT_Full);
+   }
+
+   bitmap_and(supported, supported, mask,
+  __ETHTOOL_LINK_MODE_MASK_NBITS);
+   bitmap_and(state->advertising, state->advertising, mask,
+  __ETHTOOL_LINK_MODE_MASK_NBITS);
+
+   phylink_helper_basex_speed(state);
+}
+EXPORT_SYMBOL(b53_phylink_validate);
+
+int b53_phylink_mac_link_state(struct dsa_switch *ds, int port,
+  struct phylink_link_state *state)
+{
+   int ret = -EOPNOTSUPP;
+
+   return ret;
+}
+EXPORT_SYMBOL(b53_phylink_mac_link_state);
+
+void b53_phylink_mac_config(struct dsa_switch *ds, int port,
+   unsigned int mode,
+   const struct phylink_link_state *state)
+{
+   struct b53_device *dev = ds->priv;
+
+   if (mode == MLO_AN_PHY)
+   return;
+
+   if (mode == MLO_AN_FIXED) {
+   b53_force_port_config(dev, port, state->speed,
+ state->duplex, state->pause);
+   return;
+   }
+}
+EXPORT_SYMBOL(b53_phylink_mac_config);
+
+void b53_phylink_mac_an_restart(struct dsa_switch *ds, int port)
+{
+}
+EXPORT_SYMBOL(b53_phylink_mac_an_restart);
+
+void b53_phylink_mac_link_down(struct dsa_switch *ds, int port,
+  unsigned int mode,
+  phy_interface_t interface)
+{
+   struct b53_device *dev = ds->priv;
+
+   if (mode == MLO_AN_PHY)
+   return;
+
+   if (mode == MLO_AN_FIXED) {
+   b53_force_link(dev, port, false);
+   return;
+   }
+}
+EXPORT_SYMBOL(b53_phylink_mac_link_down);
+
+void b53_phylink_mac_link_up(struct dsa_switch *ds, int port,
+unsigned int mode,
+phy_interface_t interface,
+struct phy_device *phydev)
+{
+   struct b53_device *dev = ds->priv;
+
+   if (mode == MLO_AN_PHY)
+   return;
+
+   if (mode == MLO_AN_FIXED) {
+   b53_force_link(dev, port, true);
+   return;
+   }
+}
+EXPORT_SYMBOL(b53_phylink_mac_link_up);
+
 int b53_vlan_filtering(struct dsa_switch *ds, int port, bool vlan_filtering)
 {
return 0;
@@ -1750,6 +1866,12 @@ static const struct dsa_switch_ops b53_switch_ops = {
.phy_read   = b53_phy_read16,
.phy_write  = b53_phy_write16,
.adjust_link= b53_adjust_link,
+   .phylink_validate   = b53_phylink_validate,
+   .phylink_mac_link_state = b53_phylink_mac_link_state,
+   .phylink_mac_config = b53_phylink_mac_config,
+   .phylink_mac_an_restart = b53_phylink_mac_an_restart,
+   .phylink_mac_link_down  = 

[PATCH net-next v2 5/5] net: dsa: b53: Add SerDes support

2018-09-05 Thread Florian Fainelli
Add support for the Northstar Plus SerDes which is accessed through a
special page of the switch. Since this is something that most people
probably will not want to use, make it a configurable option with a
default on ARCH_BCM_NSP where it is the most useful currently.

The SerDes supports both SGMII and 1000baseX modes for both lanes, and
2500baseX for one of the lanes, and is internally looking like a
seemingly standard MII PHY, except for the few bits that got repurposed.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/Kconfig  |   7 +
 drivers/net/dsa/b53/Makefile |   1 +
 drivers/net/dsa/b53/b53_common.c |  26 
 drivers/net/dsa/b53/b53_priv.h   |  17 +++
 drivers/net/dsa/b53/b53_serdes.c | 217 +++
 drivers/net/dsa/b53/b53_serdes.h | 121 +
 drivers/net/dsa/b53/b53_srab.c   | 120 +++--
 7 files changed, 500 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/dsa/b53/b53_serdes.c
 create mode 100644 drivers/net/dsa/b53/b53_serdes.h

diff --git a/drivers/net/dsa/b53/Kconfig b/drivers/net/dsa/b53/Kconfig
index 37745f4bf4f6..e83ebfafd881 100644
--- a/drivers/net/dsa/b53/Kconfig
+++ b/drivers/net/dsa/b53/Kconfig
@@ -35,3 +35,10 @@ config B53_SRAB_DRIVER
help
  Select to enable support for memory-mapped Switch Register Access
  Bridge Registers (SRAB) like it is found on the BCM53010
+
+config B53_SERDES
+   tristate "B53 SerDes support"
+   depends on B53
+   default ARCH_BCM_NSP
+   help
+ Select to enable support for SerDes on e.g: Northstar Plus SoCs.
diff --git a/drivers/net/dsa/b53/Makefile b/drivers/net/dsa/b53/Makefile
index 4256fb42a4dd..b1be13023ae4 100644
--- a/drivers/net/dsa/b53/Makefile
+++ b/drivers/net/dsa/b53/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_B53_SPI_DRIVER)+= b53_spi.o
 obj-$(CONFIG_B53_MDIO_DRIVER)  += b53_mdio.o
 obj-$(CONFIG_B53_MMAP_DRIVER)  += b53_mmap.o
 obj-$(CONFIG_B53_SRAB_DRIVER)  += b53_srab.o
+obj-$(CONFIG_B53_SERDES)   += b53_serdes.o
diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 3d5e822bb17c..ea4256cd628b 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -765,6 +765,8 @@ static int b53_reset_switch(struct b53_device *priv)
memset(priv->vlans, 0, sizeof(*priv->vlans) * priv->num_vlans);
memset(priv->ports, 0, sizeof(*priv->ports) * priv->num_ports);
 
+   priv->serdes_lane = B53_INVALID_LANE;
+
return b53_switch_reset(priv);
 }
 
@@ -1128,6 +1130,9 @@ void b53_phylink_validate(struct dsa_switch *ds, int port,
struct b53_device *dev = ds->priv;
__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
 
+   if (dev->ops->serdes_phylink_validate)
+   dev->ops->serdes_phylink_validate(dev, port, mask, state);
+
/* Allow all the expected bits */
phylink_set(mask, Autoneg);
phylink_set_port_modes(mask);
@@ -1164,8 +1169,13 @@ EXPORT_SYMBOL(b53_phylink_validate);
 int b53_phylink_mac_link_state(struct dsa_switch *ds, int port,
   struct phylink_link_state *state)
 {
+   struct b53_device *dev = ds->priv;
int ret = -EOPNOTSUPP;
 
+   if (phy_interface_mode_is_8023z(state->interface) &&
+   dev->ops->serdes_link_state)
+   ret = dev->ops->serdes_link_state(dev, port, state);
+
return ret;
 }
 EXPORT_SYMBOL(b53_phylink_mac_link_state);
@@ -1184,11 +1194,19 @@ void b53_phylink_mac_config(struct dsa_switch *ds, int 
port,
  state->duplex, state->pause);
return;
}
+
+   if (phy_interface_mode_is_8023z(state->interface) &&
+   dev->ops->serdes_config)
+   dev->ops->serdes_config(dev, port, mode, state);
 }
 EXPORT_SYMBOL(b53_phylink_mac_config);
 
 void b53_phylink_mac_an_restart(struct dsa_switch *ds, int port)
 {
+   struct b53_device *dev = ds->priv;
+
+   if (dev->ops->serdes_an_restart)
+   dev->ops->serdes_an_restart(dev, port);
 }
 EXPORT_SYMBOL(b53_phylink_mac_an_restart);
 
@@ -1205,6 +1223,10 @@ void b53_phylink_mac_link_down(struct dsa_switch *ds, 
int port,
b53_force_link(dev, port, false);
return;
}
+
+   if (phy_interface_mode_is_8023z(interface) &&
+   dev->ops->serdes_link_set)
+   dev->ops->serdes_link_set(dev, port, mode, interface, false);
 }
 EXPORT_SYMBOL(b53_phylink_mac_link_down);
 
@@ -1222,6 +1244,10 @@ void b53_phylink_mac_link_up(struct dsa_switch *ds, int 
port,
b53_force_link(dev, port, true);
return;
}
+
+   if (phy_interface_mode_is_8023z(interface) &&
+   dev->ops->serdes_link_set)
+   dev->ops->serdes_link_set(dev, port, mode, interface, true);
 }
 EXPORT_SYMBOL(b53_phylink_mac_link_up);
 
diff --git a/drivers/net/dsa/b53/b53_priv.h 

[PATCH net-next v2 1/5] net: dsa: b53: Add ability to enable/disable port interrupts

2018-09-05 Thread Florian Fainelli
Some switches expose individual interrupt line(s) for port specific
event(s), allow configuring these interrupts at an appropriate time
during port_enable/disable callbacks where all port specific resources
are known to be set-up and ready for use.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 9 +
 drivers/net/dsa/b53/b53_priv.h   | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index d93c790bfbe8..85ed264bc163 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -502,8 +502,14 @@ int b53_enable_port(struct dsa_switch *ds, int port, 
struct phy_device *phy)
 {
struct b53_device *dev = ds->priv;
unsigned int cpu_port = ds->ports[port].cpu_dp->index;
+   int ret = 0;
u16 pvlan;
 
+   if (dev->ops->irq_enable)
+   ret = dev->ops->irq_enable(dev, port);
+   if (ret)
+   return ret;
+
/* Clear the Rx and Tx disable bits and set to no spanning tree */
b53_write8(dev, B53_CTRL_PAGE, B53_PORT_CTRL(port), 0);
 
@@ -536,6 +542,9 @@ void b53_disable_port(struct dsa_switch *ds, int port, 
struct phy_device *phy)
b53_read8(dev, B53_CTRL_PAGE, B53_PORT_CTRL(port), );
reg |= PORT_CTRL_RX_DISABLE | PORT_CTRL_TX_DISABLE;
b53_write8(dev, B53_CTRL_PAGE, B53_PORT_CTRL(port), reg);
+
+   if (dev->ops->irq_disable)
+   dev->ops->irq_disable(dev, port);
 }
 EXPORT_SYMBOL(b53_disable_port);
 
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index df149756c282..2980a5838f58 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -43,6 +43,8 @@ struct b53_io_ops {
int (*write64)(struct b53_device *dev, u8 page, u8 reg, u64 value);
int (*phy_read16)(struct b53_device *dev, int addr, int reg, u16 
*value);
int (*phy_write16)(struct b53_device *dev, int addr, int reg, u16 
value);
+   int (*irq_enable)(struct b53_device *dev, int port);
+   void (*irq_disable)(struct b53_device *dev, int port);
 };
 
 enum {
-- 
2.17.1



[PATCH net-next v2 3/5] net: dsa: b53: Add helper to set link parameters

2018-09-05 Thread Florian Fainelli
Extract the logic from b53_adjust_link() responsible for overriding a
given port's link, speed, duplex and pause settings and make two helper
functions to set the port's configuration and the port's link settings.
We will make use of both, as separate functions while adding PHYLINK
support next.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 89 +---
 1 file changed, 60 insertions(+), 29 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 85ed264bc163..78aeaccf19a1 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -947,33 +948,50 @@ static int b53_setup(struct dsa_switch *ds)
return ret;
 }
 
-static void b53_adjust_link(struct dsa_switch *ds, int port,
-   struct phy_device *phydev)
+static void b53_force_link(struct b53_device *dev, int port, int link)
 {
-   struct b53_device *dev = ds->priv;
-   struct ethtool_eee *p = >ports[port].eee;
-   u8 rgmii_ctrl = 0, reg = 0, off;
-
-   if (!phy_is_pseudo_fixed_link(phydev))
-   return;
+   u8 reg, val, off;
 
/* Override the port settings */
if (port == dev->cpu_port) {
off = B53_PORT_OVERRIDE_CTRL;
-   reg = PORT_OVERRIDE_EN;
+   val = PORT_OVERRIDE_EN;
} else {
off = B53_GMII_PORT_OVERRIDE_CTRL(port);
-   reg = GMII_PO_EN;
+   val = GMII_PO_EN;
}
 
-   /* Set the link UP */
-   if (phydev->link)
+   b53_read8(dev, B53_CTRL_PAGE, off, );
+   reg |= val;
+   if (link)
reg |= PORT_OVERRIDE_LINK;
+   else
+   reg &= ~PORT_OVERRIDE_LINK;
+   b53_write8(dev, B53_CTRL_PAGE, off, reg);
+}
+
+static void b53_force_port_config(struct b53_device *dev, int port,
+ int speed, int duplex, int pause)
+{
+   u8 reg, val, off;
+
+   /* Override the port settings */
+   if (port == dev->cpu_port) {
+   off = B53_PORT_OVERRIDE_CTRL;
+   val = PORT_OVERRIDE_EN;
+   } else {
+   off = B53_GMII_PORT_OVERRIDE_CTRL(port);
+   val = GMII_PO_EN;
+   }
 
-   if (phydev->duplex == DUPLEX_FULL)
+   b53_read8(dev, B53_CTRL_PAGE, off, );
+   reg |= val;
+   if (duplex == DUPLEX_FULL)
reg |= PORT_OVERRIDE_FULL_DUPLEX;
+   else
+   reg &= ~PORT_OVERRIDE_FULL_DUPLEX;
 
-   switch (phydev->speed) {
+   switch (speed) {
case 2000:
reg |= PORT_OVERRIDE_SPEED_2000M;
/* fallthrough */
@@ -987,21 +1005,41 @@ static void b53_adjust_link(struct dsa_switch *ds, int 
port,
reg |= PORT_OVERRIDE_SPEED_10M;
break;
default:
-   dev_err(ds->dev, "unknown speed: %d\n", phydev->speed);
+   dev_err(dev->dev, "unknown speed: %d\n", speed);
return;
}
 
+   if (pause & MLO_PAUSE_RX)
+   reg |= PORT_OVERRIDE_RX_FLOW;
+   if (pause & MLO_PAUSE_TX)
+   reg |= PORT_OVERRIDE_TX_FLOW;
+
+   b53_write8(dev, B53_CTRL_PAGE, off, reg);
+}
+
+static void b53_adjust_link(struct dsa_switch *ds, int port,
+   struct phy_device *phydev)
+{
+   struct b53_device *dev = ds->priv;
+   struct ethtool_eee *p = >ports[port].eee;
+   u8 rgmii_ctrl = 0, reg = 0, off;
+   int pause;
+
+   if (!phy_is_pseudo_fixed_link(phydev))
+   return;
+
/* Enable flow control on BCM5301x's CPU port */
if (is5301x(dev) && port == dev->cpu_port)
-   reg |= PORT_OVERRIDE_RX_FLOW | PORT_OVERRIDE_TX_FLOW;
+   pause = MLO_PAUSE_TXRX_MASK;
 
if (phydev->pause) {
if (phydev->asym_pause)
-   reg |= PORT_OVERRIDE_TX_FLOW;
-   reg |= PORT_OVERRIDE_RX_FLOW;
+   pause |= MLO_PAUSE_TX;
+   pause |= MLO_PAUSE_RX;
}
 
-   b53_write8(dev, B53_CTRL_PAGE, off, reg);
+   b53_force_port_config(dev, port, phydev->speed, phydev->duplex, pause);
+   b53_force_link(dev, port, phydev->link);
 
if (is531x5(dev) && phy_interface_is_rgmii(phydev)) {
if (port == 8)
@@ -1061,16 +1099,9 @@ static void b53_adjust_link(struct dsa_switch *ds, int 
port,
}
} else if (is5301x(dev)) {
if (port != dev->cpu_port) {
-   u8 po_reg = B53_GMII_PORT_OVERRIDE_CTRL(dev->cpu_port);
-   u8 gmii_po;
-
-   b53_read8(dev, B53_CTRL_PAGE, po_reg, _po);
-   gmii_po |= GMII_PO_LINK |
-  GMII_PO_RX_FLOW |
-  GMII_PO_TX_FLOW |

Re: [RFC PATCH bpf-next 0/4] tools/bpf: bpftool: add net support

2018-09-05 Thread Yonghong Song




On 9/5/18 10:51 AM, Jakub Kicinski wrote:

On Mon, 3 Sep 2018 11:26:43 -0700, Yonghong Song wrote:

The functionality to dump network driver and tc related bpf programs
are added. Currently, users can already use "ip link show "
and "tc filter show dev  ..." to dump bpf program attachment
information for xdp programs and tc bpf programs.
The implementation here allows bpftool as a central place for
bpf introspection and users do not need to revert to other tools.
Also, we can make command simpler to dump bpf related information,
e.g., "bpftool net" is able to dump all xdp and tc bpf programs.


Why not implement this best-effort, unreliable (name spaces) additional
output the same way we added bpffs support, make it a flag to existing
list commands?


Do you mean to implement something like "bpftool -n prog" to show the
attachments for net-related bpf programs? I feel a separate command 
"net" is better since it intends to show the context of the bpf program
as the same program may be installed in different places. This is 
similar to other commands like "cgroup" and "perf".




My knee jerk reaction is that this is duplication of work.  iproute2 can
show us the filters and xdp programs very easily.  Will we add programs
attached to sockets as well?  And lwtunnels?  bpfilter?


This has been discussed in iovisor meeting, but let me summarize here.
I understand that iproute2 ip/tc can do exactly what I implemented here.
The implementation here is mostly from user experience point of view.

People worried about bpf performance/memory cost in the data center. So 
they often ask what bpf programs are running in any host and what is the

context (attachment point) of all bpf programs? Most these engineers
are not networking/bpf/kernel engineers. So yes, we will need
to add programs attached to sockets, lwtunnels and bpfilters etc. later.
This may be a lower priority for me now since FB does not use them yet.

Currently, we already use bpftool do prog/map perf/cgroup dumps. 
Extending bpftool is easier for user than using a different command

as not everybody is very familiar to esp. tc.



Would you be able to give us a convincing user scenario?  What kind of
information is the user looking for?  Are there going to be other
sub-commands to the 'net' object?


As of now, it just dumped bpf related information with prog_id (plus 
other attachment specific information) so users can correlate back to 
the program itself. No plan to add other sub-commands (except "dev 
") at this point.





For example,

   $ bpftool net
   xdp [
   ]
   netdev_filters [
   ifindex 2 name handle_icmp flags direct-action flags_gen [not_in_hw ]


How do you handle shared blocks here?  Does the user really care about
the flags?  What about ordering of filters?


shared block is not handled here. This can be added later.
yes, I can remove flags_gen. For "flags", may or may not. Will go 
through dumped info and remove those not really needed.


The order of filters will be based on ifindex first and inside ifindex,
attached to class first, attached to qdisc second, attached to 
root/clsact last.





 prog_id 3194 tag 846d29c14d0d7d26 act []
   ifindex 2 name handle_egress flags direct-action flags_gen [not_in_hw ]
 prog_id 3193 tag 387d281be9fe77aa
   ]


Re: [PATCH bpf-next 0/4] i40e AF_XDP zero-copy buffer leak fixes

2018-09-05 Thread Björn Töpel
Den ons 5 sep. 2018 kl 19:14 skrev Jakub Kicinski
:
>
> On Tue,  4 Sep 2018 20:11:01 +0200, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > This series addresses an AF_XDP zero-copy issue that buffers passed
> > from userspace to the kernel was leaked when the hardware descriptor
> > ring was torn down.
> >
> > The patches fixes the i40e AF_XDP zero-copy implementation.
> >
> > Thanks to Jakub Kicinski for pointing this out!
> >
> > Some background for folks that don't know the details: A zero-copy
> > capable driver picks buffers off the fill ring and places them on the
> > hardware Rx ring to be completed at a later point when DMA is
> > complete. Similar on the Tx side; The driver picks buffers off the Tx
> > ring and places them on the Tx hardware ring.
> >
> > In the typical flow, the Rx buffer will be placed onto an Rx ring
> > (completed to the user), and the Tx buffer will be placed on the
> > completion ring to notify the user that the transfer is done.
> >
> > However, if the driver needs to tear down the hardware rings for some
> > reason (interface goes down, reconfiguration and such), the userspace
> > buffers cannot be leaked. They have to be reused or completed back to
> > userspace.
> >
> > The implementation does the following:
> >
> > * Outstanding Tx descriptors will be passed to the completion
> >   ring. The Tx code has back-pressure mechanism in place, so that
> >   enough empty space in the completion ring is guaranteed.
> >
> > * Outstanding Rx descriptors are temporarily stored on a stash/reuse
> >   queue. The reuse queue is based on Jakub's RFC. When/if the HW rings
> >   comes up again, entries from the stash are used to re-populate the
> >   ring.
> >
> > * When AF_XDP ZC is enabled, disallow changing the number of hardware
> >   descriptors via ethtool. Otherwise, the size of the stash/reuse
> >   queue can grow unbounded.
> >
> > Going forward, introducing a "zero-copy allocator" analogous to Jesper
> > Brouer's page pool would be a more robust and reuseable solution.
> >
> > Jakub: I've made a minor checkpatch-fix to your RFC, prior adding it
> > into this series.
>
> Thanks for the fix! :)
>
> Out of curiosity, did checking the reuse queue have a noticeable impact
> in your test (i.e. always using the _rq() helpers)?  You seem to be
> adding an indirect call, would that not be way worse on a retpoline
> kernel?

Do you mean the indirection in __i40e_alloc_rx_buffers_zc (patch #3)?
The indirect call is elided by the __always_inline -- without that
retpoline took 2.5Mpps worth of Rx. :-(

I'm only using the _rq helpers in the configuration/slow path, so the
fast-path is unchanged.


Björn


[PATCH 1/7] fix hnode refcounting

2018-09-05 Thread Al Viro
From: Al Viro 

cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via
->hlist and via ->tp_root together.  u32_destroy() drops the former and, in
case when there had been links, leaves the sucker on the list.  As the result,
there's nothing to protect it from getting freed once links are dropped.
That also makes the "is it busy" check incapable of catching the root hnode -
it *is* busy (there's a reference from tp), but we don't see it as something
separate.  "Is it our root?" check partially covers that, but the problem
exists for others' roots as well.

AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
include oopsen, that is) would be this:
* count tp->root and tp_c->hlist as separate references.  I.e.
have u32_init() set refcount to 2, not 1.
* in u32_destroy() we always drop the former; in u32_destroy_hnode() -
the latter.

That way we have *all* references contributing to refcount.  List
removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
an in u32_destroy() in case of tc_u_common going away, along with everything
reachable from it.  IOW, that way we know that u32_destroy_key() won't
free something still on the list (or pointed to by someone's ->root).

Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index f218ccf1e2d9..3f985f29ef30 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -398,6 +398,7 @@ static int u32_init(struct tcf_proto *tp)
rcu_assign_pointer(tp_c->hlist, root_ht);
root_ht->tp_c = tp_c;
 
+   root_ht->refcnt++;
rcu_assign_pointer(tp->root, root_ht);
tp->data = tp_c;
return 0;
@@ -610,7 +611,7 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *ht,
struct tc_u_hnode __rcu **hn;
struct tc_u_hnode *phn;
 
-   WARN_ON(ht->refcnt);
+   WARN_ON(--ht->refcnt);
 
u32_clear_hnode(tp, ht, extack);
 
@@ -649,7 +650,7 @@ static void u32_destroy(struct tcf_proto *tp, struct 
netlink_ext_ack *extack)
 
WARN_ON(root_ht == NULL);
 
-   if (root_ht && --root_ht->refcnt == 0)
+   if (root_ht && --root_ht->refcnt == 1)
u32_destroy_hnode(tp, root_ht, extack);
 
if (--tp_c->refcnt == 0) {
@@ -698,7 +699,6 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool 
*last,
}
 
if (ht->refcnt == 1) {
-   ht->refcnt--;
u32_destroy_hnode(tp, ht, extack);
} else {
NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");
-- 
2.11.0



[PATCH 4/7] get rid of unused argument of u32_destroy_key()

2018-09-05 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 5816288810cc..3311aacad6c3 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -406,8 +406,7 @@ static int u32_init(struct tcf_proto *tp)
return 0;
 }
 
-static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n,
-  bool free_pf)
+static int u32_destroy_key(struct tc_u_knode *n, bool free_pf)
 {
struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
 
@@ -441,7 +440,7 @@ static void u32_delete_key_work(struct work_struct *work)
  struct tc_u_knode,
  rwork);
rtnl_lock();
-   u32_destroy_key(key->tp, key, false);
+   u32_destroy_key(key, false);
rtnl_unlock();
 }
 
@@ -458,7 +457,7 @@ static void u32_delete_key_freepf_work(struct work_struct 
*work)
  struct tc_u_knode,
  rwork);
rtnl_lock();
-   u32_destroy_key(key->tp, key, true);
+   u32_destroy_key(key, true);
rtnl_unlock();
 }
 
@@ -602,7 +601,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *ht,
if (tcf_exts_get_net(>exts))
tcf_queue_work(>rwork, 
u32_delete_key_freepf_work);
else
-   u32_destroy_key(n->tp, n, true);
+   u32_destroy_key(n, true);
}
}
 }
@@ -972,13 +971,13 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
tca[TCA_RATE], ovr, extack);
 
if (err) {
-   u32_destroy_key(tp, new, false);
+   u32_destroy_key(new, false);
return err;
}
 
err = u32_replace_hw_knode(tp, new, flags, extack);
if (err) {
-   u32_destroy_key(tp, new, false);
+   u32_destroy_key(new, false);
return err;
}
 
-- 
2.11.0



[PATCH 3/7] make sure that divisor is a power of 2

2018-09-05 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 9ea5f2be907b..5816288810cc 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -995,7 +995,11 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
if (tb[TCA_U32_DIVISOR]) {
unsigned int divisor = nla_get_u32(tb[TCA_U32_DIVISOR]);
 
-   if (--divisor > 0x100) {
+   if (!is_power_of_2(divisor)) {
+   NL_SET_ERR_MSG_MOD(extack, "Divisor is not a power of 
2");
+   return -EINVAL;
+   }
+   if (divisor-- > 0x100) {
NL_SET_ERR_MSG_MOD(extack, "Exceeded maximum 256 hash 
buckets");
return -EINVAL;
}
-- 
2.11.0



[PATCHES] cls_u32 cleanups and fixes

2018-09-05 Thread Al Viro
Several cls_u32 patches: fixing refcounting, preventing
links to and deletion of root hnodes, validating divisor.  Plus
cleanups - removal of some useless fields and saner handling of
tc_u_common hashtable.  The first 3 in series are fixes (and
-stable fodder), the rest - cleanups.  Branch can be found in
vfs.git#misc.cls_u32; individual patches will go in followups.


[PATCH 5/7] get rid of tc_u_knode ->tp

2018-09-05 Thread Al Viro
From: Al Viro 

not used anymore

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 3311aacad6c3..8a1a573487bd 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -68,7 +68,6 @@ struct tc_u_knode {
u32 mask;
u32 __percpu*pcpu_success;
 #endif
-   struct tcf_proto*tp;
struct rcu_work rwork;
/* The 'sel' field MUST be the last field in structure to allow for
 * tc_u32_keys allocated at end of structure.
@@ -897,7 +896,6 @@ static struct tc_u_knode *u32_init_knode(struct tcf_proto 
*tp,
/* Similarly success statistics must be moved as pointers */
new->pcpu_success = n->pcpu_success;
 #endif
-   new->tp = tp;
memcpy(>sel, s, sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key));
 
if (tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE)) {
@@ -1113,7 +,6 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
n->handle = handle;
n->fshift = s->hmask ? ffs(ntohl(s->hmask)) - 1 : 0;
n->flags = flags;
-   n->tp = tp;
 
err = tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE);
if (err < 0)
-- 
2.11.0



[PATCH 2/7] mark root hnode explicitly

2018-09-05 Thread Al Viro
From: Al Viro 

... and disallow deleting or linking to such

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 3f985f29ef30..9ea5f2be907b 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -91,6 +91,7 @@ struct tc_u_hnode {
 */
struct tc_u_knode __rcu *ht[1];
 };
+#define TCA_CLS_FLAGS_U32_ROOT (1<<8)
 
 struct tc_u_common {
struct tc_u_hnode __rcu *hlist;
@@ -377,6 +378,7 @@ static int u32_init(struct tcf_proto *tp)
root_ht->refcnt++;
root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x8000;
root_ht->prio = tp->prio;
+   root_ht->flags = TCA_CLS_FLAGS_U32_ROOT;
idr_init(_ht->handle_idr);
 
if (tp_c == NULL) {
@@ -491,7 +493,8 @@ static void u32_clear_hw_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *h,
struct tcf_block *block = tp->chain->block;
struct tc_cls_u32_offload cls_u32 = {};
 
-   tc_cls_common_offload_init(_u32.common, tp, h->flags, extack);
+   tc_cls_common_offload_init(_u32.common, tp,
+   h->flags & ~TCA_CLS_FLAGS_U32_ROOT, extack);
cls_u32.command = TC_CLSU32_DELETE_HNODE;
cls_u32.hnode.divisor = h->divisor;
cls_u32.hnode.handle = h->handle;
@@ -693,7 +696,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool 
*last,
goto out;
}
 
-   if (root_ht == ht) {
+   if (ht->flags & TCA_CLS_FLAGS_U32_ROOT) {
NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
return -EINVAL;
}
@@ -795,6 +798,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto 
*tp,
NL_SET_ERR_MSG_MOD(extack, "Link hash table not 
found");
return -EINVAL;
}
+   if (ht_down->flags & TCA_CLS_FLAGS_U32_ROOT) {
+   NL_SET_ERR_MSG_MOD(extack, "Not linke to root 
node");
+   return -EINVAL;
+   }
ht_down->refcnt++;
}
 
@@ -1214,7 +1221,8 @@ static int u32_reoffload_hnode(struct tcf_proto *tp, 
struct tc_u_hnode *ht,
struct tc_cls_u32_offload cls_u32 = {};
int err;
 
-   tc_cls_common_offload_init(_u32.common, tp, ht->flags, extack);
+   tc_cls_common_offload_init(_u32.common, tp,
+   ht->flags & ~TCA_CLS_FLAGS_U32_ROOT, extack);
cls_u32.command = add ? TC_CLSU32_NEW_HNODE : TC_CLSU32_DELETE_HNODE;
cls_u32.hnode.divisor = ht->divisor;
cls_u32.hnode.handle = ht->handle;
-- 
2.11.0



[PATCH 7/7] clean tc_u_common hashtable

2018-09-05 Thread Al Viro
From: Al Viro 

* calculate key *once*, not for each hash chain element
* let tc_u_hash() return the pointer to chain head rather than index -
callers are cleaner that way.

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index be9240ae1417..6d45ec4c218c 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -343,19 +343,16 @@ static void *tc_u_common_ptr(const struct tcf_proto *tp)
return block->q;
 }
 
-static unsigned int tc_u_hash(const struct tcf_proto *tp)
+static struct hlist_head *tc_u_hash(void *key)
 {
-   return hash_ptr(tc_u_common_ptr(tp), U32_HASH_SHIFT);
+   return tc_u_common_hash + hash_ptr(key, U32_HASH_SHIFT);
 }
 
-static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp)
+static struct tc_u_common *tc_u_common_find(void *key)
 {
struct tc_u_common *tc;
-   unsigned int h;
-
-   h = tc_u_hash(tp);
-   hlist_for_each_entry(tc, _u_common_hash[h], hnode) {
-   if (tc->ptr == tc_u_common_ptr(tp))
+   hlist_for_each_entry(tc, tc_u_hash(key), hnode) {
+   if (tc->ptr == key)
return tc;
}
return NULL;
@@ -364,10 +361,8 @@ static struct tc_u_common *tc_u_common_find(const struct 
tcf_proto *tp)
 static int u32_init(struct tcf_proto *tp)
 {
struct tc_u_hnode *root_ht;
-   struct tc_u_common *tp_c;
-   unsigned int h;
-
-   tp_c = tc_u_common_find(tp);
+   void *key = tc_u_common_ptr(tp);
+   struct tc_u_common *tp_c = tc_u_common_find(key);
 
root_ht = kzalloc(sizeof(*root_ht), GFP_KERNEL);
if (root_ht == NULL)
@@ -389,8 +384,7 @@ static int u32_init(struct tcf_proto *tp)
INIT_HLIST_NODE(_c->hnode);
idr_init(_c->handle_idr);
 
-   h = tc_u_hash(tp);
-   hlist_add_head(_c->hnode, _u_common_hash[h]);
+   hlist_add_head(_c->hnode, tc_u_hash(key));
}
 
tp_c->refcnt++;
-- 
2.11.0



[PATCH 6/7] get rid of tc_u_common ->rcu

2018-09-05 Thread Al Viro
From: Al Viro 

unused

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 8a1a573487bd..be9240ae1417 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -98,7 +98,6 @@ struct tc_u_common {
int refcnt;
struct idr  handle_idr;
struct hlist_node   hnode;
-   struct rcu_head rcu;
 };
 
 static inline unsigned int u32_hash_fold(__be32 key,
-- 
2.11.0



Re: 4.18.6 dl_seq_start [xt_hashlimit] unable to handle kernel NULL pointer dereference at 0000000000000050

2018-09-05 Thread Cong Wang
On Wed, Sep 5, 2018 at 4:06 AM Sami Farin  wrote:
>
> 4.17 worked ok, this with 32 GB Ryzen system.
>
> BUG: unable to handle kernel NULL pointer dereference at 0050
> PGD 0 P4D 0
> Oops:  [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 6303 Comm: grep Tainted: GT 4.18.6+ #16
> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Taichi, 
> BIOS P4.60 03/02/2018
> RIP: 0010:dl_seq_start+0x11/0x60 [xt_hashlimit]


Looks like we need to do a s/s->private/s->file/.


> Code: ff 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 
> 00 0f 1f 44 00 00 55 48 89 f5 53 48 8b 87 d8 00 00 00 <48> 8b 78 50 e8 36 3b 
> 6f de 48 89 c3 48 8d 78 48 e8 ca e6 0a df 8b
> RSP: 0018:a79e88befde0 EFLAGS: 00010246
> RAX:  RBX:  RCX: 
> RDX:  RSI: a79e88befe18 RDI: 9a64733417a0
> RBP: a79e88befe18 R08:  R09: 657547bf
> R10: 9f2bf98a R11: 9a6470f5a6c0 R12: a79e88befeb0
> R13: 9a6471879200 R14: 9a6471879200 R15: 9a64733417a0
> FS:  7f6798784740() GS:9a649ea0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0050 CR3: 0007f335c000 CR4: 003406f0
> Call Trace:
>  seq_read+0xc0/0x470
>  proc_reg_read+0x49/0x70
>  vfs_read+0x8a/0x140
>  ksys_read+0x52/0xc0
>  do_syscall_64+0x6f/0x353
>  ? trace_hardirqs_off_thunk+0x1a/0x1c
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7f67980eee21
> Code: fe ff ff 50 48 8d 3d 46 b6 09 00 e8 f9 04 02 00 66 0f 1f 84 00 00 00 00 
> 00 48 8d 05 c1 3b 2d 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 
> 57 c3 66 0f 1f 44 00 00 41 54 49 89 d4 55 48
> RSP: 002b:7ffc314f7a68 EFLAGS: 0246 ORIG_RAX: 
> RAX: ffda RBX: 8000 RCX: 7f67980eee21
> RDX: 8000 RSI: 55f0d317f000 RDI: 0003
> RBP: 8000 R08:  R09: 9008
> R10:  R11: 0246 R12: 55f0d317f000
> R13: 0003 R14: 55f0d317e830 R15: 0003
> Modules linked in: arptable_filter arp_tables nfnetlink_acct ip6table_mangle 
> nf_log_ipv6 xt_hl ip6t_REJECT nf_reject_ipv6 ip6t_rt ip6table_filter 
> ip6_tables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat iptable_raw xt_mark 
> xt_connmark iptable_mangle nf_log_ipv4 nf_log_common xt_LOG xt_length 
> xt_limit ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 
> xt_connlimit nf_conncount xt_multiport xt_hashlimit xt_owner xt_set 
> xt_conntrack iptable_filter ip_set_bitmap_port ip_set_hash_mac 
> ip_set_hash_net ip_set nf_conntrack_netlink nfnetlink bnep hwmon_vid iwlmvm 
> snd_usb_audio snd_usbmidi_lib snd_hwdep snd_rawmidi mac80211 iwlwifi btusb 
> btrtl kvm_amd btbcm btintel bluetooth kvm cfg80211 ecdh_generic irqbypass 
> sp5100_tco wmi_bmof k10temp i2c_piix4 snd_hda_codec_realtek rfkill 
> snd_hda_codec_generic
>  snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core rtc_cmos 
> acpi_cpufreq binfmt_misc snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device 
> snd_pcm tcp_cubic tcp_westwood br_netfilter bridge stp llc ip_tables 
> scsi_transport_iscsi algif_skcipher af_alg uas usb_storage usbhid mxm_wmi ccp 
> igb xhci_pci xhci_hcd usbcore usb_common wmi button 8021q mrp sunrpc 
> snd_timer snd soundcore fuse tun xt_tcpudp x_tables tcp_bbr nf_conntrack_ipv6 
> nf_defrag_ipv6 nf_conntrack sch_fq_codel sch_htb sch_pie analog gameport 
> joydev i2c_dev ecryptfs autofs4 amdkfd amd_iommu_v2 [last unloaded: pcspkr]
> CR2: 0050
> ---[ end trace 0e097a943554aa36 ]---
>
>
> --
> Do what you love because life is too short for anything else.
> https://samifar.in/
>


Re: [**EXTERNAL**] Re: VRF with enslaved L3 enabled bridge

2018-09-05 Thread D'Souza, Nelson
Hi David,

Just following up would you be able to confirm that this is a Linux VRF 
issue? 

Also, how do I log a VRF related defect to ensure this gets resolved in a 
subsequent release.

Thanks,
Nelson

On 8/2/18, 4:12 PM, "D'Souza, Nelson"  wrote:

Hi David,

Turns out the VRF bridge Rx issue is triggered by a docker install.

Docker makes the following sysctl changes:
  net.bridge.bridge-nf-call-arptables = 1
  net.bridge.bridge-nf-call-ip6tables = 1
  net.bridge.bridge-nf-call-iptables = 1 <<< exposes the ipv4 VRF Rx 
issue when a bridge is enslaved to a VRF

which causes packets flowing through all bridges to be subjected to 
netfilter rules. This is required for bridge net filtering when ip forwarding 
is enabled.

Please refer to 
https://github.com/docker/libnetwork/blob/master/drivers/bridge/setup_bridgenetfiltering.go#L53

Setting net.bridge.bridge-nf-call-iptables = 0 resolves the issue, but is 
not really a viable option given that bridge net filtering is a basic 
requirement in existing docker deployments.

It's not clear to me why this conf setting breaks local Rx delivery for a 
bridge enslaved to a VRF, because these packets would always be sent up by the 
bridge for IP netfilter processing.

This issue is easily reproducible on an Ubuntu 18.04.1 VM. Simply 
installing docker will cause pings running on test-vrf to fail. Clearing the 
sysctl conf restores Rx local delivery.

Thanks,
Nelson

On 7/27/18, 4:29 PM, "D'Souza, Nelson"  wrote:

David,

With Ubuntu 18.04.1 (kernel 4.15.0-29) pings sent out on test-vrf and 
br0 are successful.

# uname -rv
4.15.0-29-generic #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018

# ping -c 1 -I test-vrf 172.16.2.2
ping: Warning: source address might be selected on device other than 
test-vrf.
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 test-vrf: 56(84) bytes of 
data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.050 ms

--- 172.16.2.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.050/0.050/0.050/0.000 ms

# ping -c 1 -I br0 172.16.2.2
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 br0: 56(84) bytes of data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.026 ms

--- 172.16.2.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.026/0.026/0.026/0.000 ms

However, with Ubuntu 17.10.1 (kernel  4.13.0-21) pings on only test-vrf 
are successful. Pings on br0 are not successful.
So it seems like there maybe a change in versions after 4.13.0-21 that 
causes pings on br0 to pass.

Nelson

On 7/25/18, 5:35 PM, "D'Souza, Nelson"  wrote:

David, 

I tried out the commands on an Ubuntu 17.10.1 VM.
The pings on test-vrf are successful, but the pings on br0 are not 
successful.

# uname -rv  
4.13.0-21-generic #24-Ubuntu SMP Mon Dec 18 17:29:16 UTC 2017

 # lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 17.10
Release:17.10
Codename:   artful

# ip rule  --> Note: its missing the l3mdev rule
0:  from all lookup local 
32766:  from all lookup main 
32767:  from all lookup default

Ran the configs from a bash script vrf.sh

 # ./vrf.sh 
+ ip netns add foo
+ ip li add veth1 type veth peer name veth2
+ ip li set veth2 netns foo
+ ip -netns foo li set lo up
+ ip -netns foo li set veth2 up
+ ip -netns foo addr add 172.16.1.2/24 dev veth2
+ ip li add test-vrf type vrf table 123
+ ip li set test-vrf up
+ ip ro add vrf test-vrf unreachable default
+ ip li add br0 type bridge
+ ip li set veth1 master br0
+ ip li set veth1 up
+ ip li set br0 up
+ ip addr add dev br0 172.16.1.1/24
+ ip li set br0 master test-vrf
+ ip -netns foo addr add 172.16.2.2/32 dev lo
+ ip ro add vrf test-vrf 172.16.2.2/32 via 172.16.1.2

# ping -I test-vrf 172.16.2.2 -c 2  <<< successful on test-vrf
ping: Warning: source address might be selected on device other 
than test-vrf.
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 test-vrf: 56(84) bytes 
of data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.035 ms
64 bytes from 172.16.2.2: 

Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1

2018-09-05 Thread Song Liu



> On Sep 5, 2018, at 10:32 AM, David Ahern  wrote:
> 
> On 9/5/18 12:11 AM, Song Liu wrote:
>> We are debugging an issue with fib6_node_lookup_1(). 
>> 
>> We use a 4.16 based kernel, and we have back ported most upstream
>> patches in ip6_fib.{c.h}. The only major differences I can spot are
>> 
> 
> Did you backport all patches in each set that included a change to those
> files, or just the patches to ip6_fib.* and any dependencies?

I believe we always try back port all patches in each set. But we have back
ported hundreds of patches to our 4.16 tree, so it is also likely we missed
some useful patches. 

Thanks,
Song



Re: [PATCH mlx5-next v1 05/15] net/mlx5: Break encap/decap into two separated flow table creation flags

2018-09-05 Thread Leon Romanovsky
On Wed, Sep 05, 2018 at 10:38:00AM -0600, Jason Gunthorpe wrote:
> On Wed, Sep 05, 2018 at 08:10:25AM +0300, Leon Romanovsky wrote:
> > > > -   int en_encap_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN);
> > > > +   int en_encap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_ENCAP);
> > > > +   int en_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_DECAP);
> > >
> > > Yuk, please don't use !!.
> > >
> > >   bool en_decap = flags & MLX5_FLOW_TABLE_TUNNEL_EN_DECAP;
> >
> > We need to provide en_encap and en_decap as an input to MLX5_SET(...)
> > which is passed to FW as 0 or 1.
> >
> > Boolean type is declared in C as int and treated as zero for false
> > and any other value for true,
>
> No, that isn't right, the kernel uses C99's _Bool intrinsic type, which
> is guaranteed to only hold 0 or 1 by the compiler.
>
> See types.h:
>
> typedef _Bool   bool;

Exciting, it took me a while to find C99 standard and relevant 6.3.1.2.
Anyway, this patch didn't change previous functionality, which used "!!"
convention.

Thanks

>
> Jason


signature.asc
Description: PGP signature


Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1

2018-09-05 Thread Song Liu



> On Sep 5, 2018, at 10:09 AM, Wei Wang  wrote:
> 
> On Tue, Sep 4, 2018 at 11:11 PM Song Liu  wrote:
>> 
>> We are debugging an issue with fib6_node_lookup_1().
>> 
>> We use a 4.16 based kernel, and we have back ported most upstream
>> patches in ip6_fib.{c.h}. The only major differences I can spot are
>> 
>> 8b7f2731bd68d83940714ce92381d1a72596407c
>> c350637229fccbffee2475400fcd689d5738
>> 
>> I guess the issue is not related to these two fixes.
>> 
>> After staring at the call trace and disassembly code (attached below)
>> I guess this is a use-after-free issue in (or right after) the lookup
>> loop:
>> 
>>for (;;) {
>>struct fib6_node *next;
>> 
>>dir = addr_bit_set(args->addr, fn->fn_bit);
>> 
>>next = dir ? rcu_dereference(fn->right) :
>> rcu_dereference(fn->left);
>> 
>>if (next) {
>>fn = next;
>>continue;
>>}
>>break;
>>}
>> 
>> I guess this probably also happens to latest upstream. I haven't
>> tested this with upstream kernel (or net tree) yet, because we
>> can only trigger this about once a week on 100 servers.
>> 
>> Does this look familiar? Any comments and/or suggestions are highly
>> appreciated.
>> 
> By glancing at the commit logs, I don't think any changes were made
> regarding the core logic of fib6_node handling recently.
> (There were a couple of fixes regarding fib6_info but I don't think it
> is the cause here... But it is still good to check if you have commit
> 9b0a8da8c4c6, e873e4b9cc7e, e70a3aad44cc in your build.)

Looks like we don't have e70a3aad44cc. I think it fixes a memory leak 
(instead of a use-after-free)? Let me add it and run some tests anyway. 
Thanks a lot for this information. 

> 
> I also went through the call path and did not find anything obviously wrong...
> I think it's the best for you to reproduce it and we can debug further.
> One question is, do you have "CONFIG_IPV6_SUBTREE" enabled and specify
> src IP in the routing table?

We do have CONFIG_IPV6_SUBTREE enabled. But we usually do not specify
src IP in the routing table. 

Let me try to reproduce it. 

Thanks again,
Song




Re: [PATCH rdma-next v1 12/15] RDMA/mlx5: Add a new flow action verb - modify header

2018-09-05 Thread Leon Romanovsky
On Wed, Sep 05, 2018 at 10:38:42AM -0600, Jason Gunthorpe wrote:
> On Wed, Sep 05, 2018 at 08:20:44AM +0300, Leon Romanovsky wrote:
> > On Tue, Sep 04, 2018 at 03:58:23PM -0600, Jason Gunthorpe wrote:
> > > On Tue, Aug 28, 2018 at 02:18:51PM +0300, Leon Romanovsky wrote:
> > >
> > > > +static int 
> > > > UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_ACTION_CREATE_MODIFY_HEADER)(
> > > > +   struct ib_uverbs_file *file,
> > > > +   struct uverbs_attr_bundle *attrs)
> > > > +{
> > > > +   struct ib_uobject *uobj = uverbs_attr_get_uobject(
> > > > +   attrs, MLX5_IB_ATTR_CREATE_MODIFY_HEADER_HANDLE);
> > > > +   struct mlx5_ib_dev *mdev = to_mdev(uobj->context->device);
> > > > +   enum mlx5_ib_uapi_flow_table_type ft_type;
> > > > +   struct ib_flow_action *action;
> > > > +   size_t num_actions;
> > > > +   void *in;
> > > > +   int len;
> > > > +   int ret;
> > > > +
> > > > +   if (!mlx5_ib_modify_header_supported(mdev))
> > > > +   return -EOPNOTSUPP;
> > > > +
> > > > +   in = uverbs_attr_get_alloced_ptr(attrs,
> > > > +   MLX5_IB_ATTR_CREATE_MODIFY_HEADER_ACTIONS_PRM);
> > > > +   len = uverbs_attr_get_len(attrs,
> > > > +   MLX5_IB_ATTR_CREATE_MODIFY_HEADER_ACTIONS_PRM);
> > > > +
> > > > +   if (len % MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto))
> > > > +   return -EINVAL;
> > > > +
> > > > +   ret = uverbs_get_const(_type, attrs,
> > > > +  
> > > > MLX5_IB_ATTR_CREATE_MODIFY_HEADER_FT_TYPE);
> > > > +   if (ret)
> > > > +   return -EINVAL;
> > >
> > > This should be
> > >
> > >   if (ret)
> > >   return ret;
> > >
> > > Every call to uverbs_get_const is wrong in this same way..
> >
> > Right, from technical point of view uverbs_get_const can return EINVAL
> > only, and it is correct for now, but need to be changed to proper
> > "return ret".
>
> No, it can return ENOENT as well.

Ahh, right, the "|| !def_val" part.

Thanks

>
> Jason


signature.asc
Description: PGP signature


[PATCH ethtool] ethtool: support combinations of FEC modes

2018-09-05 Thread Edward Cree
Of the three drivers that currently support FEC configuration, two (sfc
 and cxgb4[vf]) accept configurations with more than one bit set in the
 feccmd.fec bitmask.  (The precise semantics of these combinations vary.)
Thus, this patch adds the ability to specify such combinations through a
 comma-separated list of FEC modes in the 'encoding' argument on the
 command line.

Also adds --set-fec tests to test-cmdline.c, and corrects the man page
 (the encoding argument is not optional) while updating it.

Signed-off-by: Edward Cree 
---
I've CCed the maintainers of the other drivers (cxgb4, nfp) that support
 --set-fec, in case they have opinions on this.
I'm not totally happy with the man page changebar; it might be clearer
 just to leave the comma-less version in the syntax synopsis and only
 mention the commas in the running-text.

 ethtool.8.in   | 11 ---
 ethtool.c  | 50 +++---
 test-cmdline.c |  9 +
 3 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/ethtool.8.in b/ethtool.8.in
index c8a902a..414eaa1 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -389,7 +389,8 @@ ethtool \- query or control network driver and hardware 
settings
 .HP
 .B ethtool \-\-set\-fec
 .I devname
-.B4 encoding auto off rs baser
+.B encoding
+.BR auto | off | rs | baser [ , ...]
 .
 .\" Adjust lines (i.e. full justification) and hyphenate.
 .ad
@@ -1119,8 +1120,12 @@ current FEC mode, the driver or firmware must take the 
link down
 administratively and report the problem in the system logs for users to 
correct.
 .RS 4
 .TP
-.A4 encoding auto off rs baser
-Sets the FEC encoding for the device.
+.BR encoding\ auto | off | rs | baser [ , ...]
+
+Sets the FEC encoding for the device.  Combinations of options are specified as
+e.g.
+.B auto,rs
+; the semantics of such combinations vary between drivers.
 .TS
 nokeep;
 lB l.
diff --git a/ethtool.c b/ethtool.c
index e8b7703..9997930 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4967,20 +4967,48 @@ static int do_set_phy_tunable(struct cmd_context *ctx)
 
 static int fecmode_str_to_type(const char *str)
 {
+   if (!strcasecmp(str, "auto"))
+   return ETHTOOL_FEC_AUTO;
+   if (!strcasecmp(str, "off"))
+   return ETHTOOL_FEC_OFF;
+   if (!strcasecmp(str, "rs"))
+   return ETHTOOL_FEC_RS;
+   if (!strcasecmp(str, "baser"))
+   return ETHTOOL_FEC_BASER;
+
+   return 0;
+}
+
+/* Takes a comma-separated list of FEC modes, returns the bitwise OR of their
+ * corresponding ETHTOOL_FEC_* constants.
+ * Accepts repetitions (e.g. 'auto,auto') and trailing comma (e.g. 'off,').
+ */
+static int parse_fecmode(const char *str)
+{
int fecmode = 0;
+   char buf[6];
 
if (!str)
-   return fecmode;
-
-   if (!strcasecmp(str, "auto"))
-   fecmode |= ETHTOOL_FEC_AUTO;
-   else if (!strcasecmp(str, "off"))
-   fecmode |= ETHTOOL_FEC_OFF;
-   else if (!strcasecmp(str, "rs"))
-   fecmode |= ETHTOOL_FEC_RS;
-   else if (!strcasecmp(str, "baser"))
-   fecmode |= ETHTOOL_FEC_BASER;
+   return 0;
+   while (*str) {
+   size_t next;
+   int mode;
 
+   next = strcspn(str, ",");
+   if (next >= 6) /* Bad mode, longest name is 5 chars */
+   return 0;
+   /* Copy into temp buffer and nul-terminate */
+   memcpy(buf, str, next);
+   buf[next] = 0;
+   mode = fecmode_str_to_type(buf);
+   if (!mode) /* Bad mode encountered */
+   return 0;
+   fecmode |= mode;
+   str += next;
+   /* Skip over ',' (but not nul) */
+   if (*str)
+   str++;
+   }
return fecmode;
 }
 
@@ -5028,7 +5056,7 @@ static int do_sfec(struct cmd_context *ctx)
if (!fecmode_str)
exit_bad_args();
 
-   fecmode = fecmode_str_to_type(fecmode_str);
+   fecmode = parse_fecmode(fecmode_str);
if (!fecmode)
exit_bad_args();
 
diff --git a/test-cmdline.c b/test-cmdline.c
index a94edea..9c51cca 100644
--- a/test-cmdline.c
+++ b/test-cmdline.c
@@ -266,6 +266,15 @@ static struct test_case {
{ 0, "--set-eee devname tx-timer 42 advertise 0x4321" },
{ 1, "--set-eee devname tx-timer foo" },
{ 1, "--set-eee devname advertise foo" },
+   { 1, "--set-fec devname" },
+   { 0, "--set-fec devname encoding auto" },
+   { 0, "--set-fec devname encoding off," },
+   { 0, "--set-fec devname encoding baser,rs" },
+   { 0, "--set-fec devname encoding auto,auto," },
+   { 1, "--set-fec devname encoding foo" },
+   { 1, "--set-fec devname encoding auto,foo" },
+   { 1, "--set-fec devname encoding auto,," },
+   { 1, "--set-fec devname auto" },
/* can't test --set-priv-flags yet */
 

Re: [RFC PATCH bpf-next 0/4] tools/bpf: bpftool: add net support

2018-09-05 Thread Jakub Kicinski
On Mon, 3 Sep 2018 11:26:43 -0700, Yonghong Song wrote:
> The functionality to dump network driver and tc related bpf programs
> are added. Currently, users can already use "ip link show "
> and "tc filter show dev  ..." to dump bpf program attachment
> information for xdp programs and tc bpf programs.
> The implementation here allows bpftool as a central place for
> bpf introspection and users do not need to revert to other tools.
> Also, we can make command simpler to dump bpf related information,
> e.g., "bpftool net" is able to dump all xdp and tc bpf programs.

Why not implement this best-effort, unreliable (name spaces) additional
output the same way we added bpffs support, make it a flag to existing
list commands?

My knee jerk reaction is that this is duplication of work.  iproute2 can
show us the filters and xdp programs very easily.  Will we add programs
attached to sockets as well?  And lwtunnels?  bpfilter?

Would you be able to give us a convincing user scenario?  What kind of
information is the user looking for?  Are there going to be other
sub-commands to the 'net' object?

> For example,
> 
>   $ bpftool net
>   xdp [
>   ]
>   netdev_filters [
>   ifindex 2 name handle_icmp flags direct-action flags_gen [not_in_hw ]

How do you handle shared blocks here?  Does the user really care about
the flags?  What about ordering of filters?

> prog_id 3194 tag 846d29c14d0d7d26 act []
>   ifindex 2 name handle_egress flags direct-action flags_gen [not_in_hw ]
> prog_id 3193 tag 387d281be9fe77aa
>   ] 


smsc95xx: Invalid max MTU

2018-09-05 Thread Stefan Wahren
Hi,

recently there was a user who reports that his Raspberry Pi 3B didn't work as 
expected [1].

The problem is that the smsc95xx driver accepts to high MTU values ( > 9000) 
from userspace like dhcp-client, but according to the LAN9500 databook the chip 
seems only capable to handle MTU sizes <= 1500.

It looks like that smsc95xx slipped through the cracks during the creation of 
commit f77f0aee4da4 ("net: use core MTU range checking in USB NIC drivers"). 
Unfortunately i don't have all the chips listed in this driver.

So my questions would be the following patch correct:

diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index 06b4d29..420a0e4 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -1318,6 +1318,7 @@ static int smsc95xx_bind(struct usbnet *dev, struct 
usb_interface *intf)
dev->net->ethtool_ops = _ethtool_ops;
dev->net->flags |= IFF_MULTICAST;
dev->net->hard_header_len += SMSC95XX_TX_OVERHEAD_CSUM;
+   dev->net->max_mtu = ETH_DATA_LEN;
dev->hard_mtu = dev->net->mtu + dev->net->hard_header_len;
 
pdata->dev = dev;

[1] - https://github.com/raspberrypi/linux/issues/2660


Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1

2018-09-05 Thread David Ahern
On 9/5/18 12:11 AM, Song Liu wrote:
> We are debugging an issue with fib6_node_lookup_1(). 
> 
> We use a 4.16 based kernel, and we have back ported most upstream
> patches in ip6_fib.{c.h}. The only major differences I can spot are
> 

Did you backport all patches in each set that included a change to those
files, or just the patches to ip6_fib.* and any dependencies?


Re: [PATCH bpf-next 0/4] i40e AF_XDP zero-copy buffer leak fixes

2018-09-05 Thread Jakub Kicinski
On Tue,  4 Sep 2018 20:11:01 +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> This series addresses an AF_XDP zero-copy issue that buffers passed
> from userspace to the kernel was leaked when the hardware descriptor
> ring was torn down.
> 
> The patches fixes the i40e AF_XDP zero-copy implementation.
> 
> Thanks to Jakub Kicinski for pointing this out!
> 
> Some background for folks that don't know the details: A zero-copy
> capable driver picks buffers off the fill ring and places them on the
> hardware Rx ring to be completed at a later point when DMA is
> complete. Similar on the Tx side; The driver picks buffers off the Tx
> ring and places them on the Tx hardware ring.
> 
> In the typical flow, the Rx buffer will be placed onto an Rx ring
> (completed to the user), and the Tx buffer will be placed on the
> completion ring to notify the user that the transfer is done.
> 
> However, if the driver needs to tear down the hardware rings for some
> reason (interface goes down, reconfiguration and such), the userspace
> buffers cannot be leaked. They have to be reused or completed back to
> userspace.
> 
> The implementation does the following:
> 
> * Outstanding Tx descriptors will be passed to the completion
>   ring. The Tx code has back-pressure mechanism in place, so that
>   enough empty space in the completion ring is guaranteed.
> 
> * Outstanding Rx descriptors are temporarily stored on a stash/reuse
>   queue. The reuse queue is based on Jakub's RFC. When/if the HW rings
>   comes up again, entries from the stash are used to re-populate the
>   ring.
> 
> * When AF_XDP ZC is enabled, disallow changing the number of hardware
>   descriptors via ethtool. Otherwise, the size of the stash/reuse
>   queue can grow unbounded.
> 
> Going forward, introducing a "zero-copy allocator" analogous to Jesper
> Brouer's page pool would be a more robust and reuseable solution.
> 
> Jakub: I've made a minor checkpatch-fix to your RFC, prior adding it
> into this series.

Thanks for the fix! :)

Out of curiosity, did checking the reuse queue have a noticeable impact
in your test (i.e. always using the _rq() helpers)?  You seem to be
adding an indirect call, would that not be way worse on a retpoline
kernel?


Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1

2018-09-05 Thread Wei Wang
On Tue, Sep 4, 2018 at 11:11 PM Song Liu  wrote:
>
> We are debugging an issue with fib6_node_lookup_1().
>
> We use a 4.16 based kernel, and we have back ported most upstream
> patches in ip6_fib.{c.h}. The only major differences I can spot are
>
> 8b7f2731bd68d83940714ce92381d1a72596407c
> c350637229fccbffee2475400fcd689d5738
>
> I guess the issue is not related to these two fixes.
>
> After staring at the call trace and disassembly code (attached below)
> I guess this is a use-after-free issue in (or right after) the lookup
> loop:
>
> for (;;) {
> struct fib6_node *next;
>
> dir = addr_bit_set(args->addr, fn->fn_bit);
>
> next = dir ? rcu_dereference(fn->right) :
>  rcu_dereference(fn->left);
>
> if (next) {
> fn = next;
> continue;
> }
> break;
> }
>
> I guess this probably also happens to latest upstream. I haven't
> tested this with upstream kernel (or net tree) yet, because we
> can only trigger this about once a week on 100 servers.
>
> Does this look familiar? Any comments and/or suggestions are highly
> appreciated.
>
By glancing at the commit logs, I don't think any changes were made
regarding the core logic of fib6_node handling recently.
(There were a couple of fixes regarding fib6_info but I don't think it
is the cause here... But it is still good to check if you have commit
9b0a8da8c4c6, e873e4b9cc7e, e70a3aad44cc in your build.)

I also went through the call path and did not find anything obviously wrong...
I think it's the best for you to reproduce it and we can debug further.
One question is, do you have "CONFIG_IPV6_SUBTREE" enabled and specify
src IP in the routing table?

Thanks.
Wei

> Thanks,
> Song
>
>
> Bug stack trace:
>
> [354764.457916] BUG: unable to handle kernel
> [354764.466125] paging request
> [354764.471720]  at f60fc318
> [354764.478360] IP: fib6_node_lookup_1+0x29/0x130
> [354764.487249] PGD 80010f725067
> [354764.494062] P4D 80010f725067
> [354764.500878] PUD 0
> [354764.505087] Oops:  [#1] SMP PTI
> [354764.512245] Modules linked in:
> [354764.518536]  udp_diag
> [354764.523266]  act_gact
> [354764.527997]  cls_bpf
> [354764.532557]  tcp_diag
> [354764.537291]  inet_diag
> [354764.542200]  nfsv3
> [354764.546409]  nfs
> [354764.550273]  fscache
> [354764.554834]  ip6table_raw
> [354764.560260]  ip6table_filter
> [354764.566208]  xt_DSCP
> [354764.570765]  iptable_raw
> [354764.576020]  iptable_filter
> [354764.581790]  ip6table_mangle
> [354764.587738]  iptable_mangle
> [354764.593505]  sb_edac
> [354764.598058]  x86_pkg_temp_thermal
> [354764.604872]  intel_powerclamp
> [354764.610992]  coretemp
> [354764.615723]  kvm_intel
> [354764.620628]  kvm
> [354764.624494]  irqbypass
> [354764.629399]  iTCO_wdt
> [354764.634132]  iTCO_vendor_support
> [354764.640772]  i2c_i801
> [354764.645507]  lpc_ich
> [354764.650064]  efivars
> [354764.654619]  mfd_core
> [354764.659353]  ipmi_si
> [354764.663911]  ipmi_devintf
> [354764.669341]  ipmi_msghandler
> [354764.675281]  acpi_cpufreq
> [354764.680711]  button
> [354764.685096]  sch_fq_codel
> [354764.690520]  nfsd
> [354764.694557]  nfs_acl
> [354764.699118]  lockd
> [354764.703330]  auth_rpcgss
> [354764.708588]  oid_registry
> [354764.714006]  grace
> [354764.718213]  sunrpc
> [354764.722590]  fuse
> [354764.726626]  loop
> [354764.730661]  efivarfs
> [354764.735395]  autofs4
> [354764.739957] CPU: 5 PID: 3460038 Comm: java Not tainted 
> 4.16.0-14_fbk2_1455_g6bcb99c57db6 #14
> [354764.756996] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM03 
>   06/02/2016
> [354764.773001] RIP: 0010:fib6_node_lookup_1+0x29/0x130
> [354764.782929] RSP: 0018:c9003f0bb730 EFLAGS: 00010206
> [354764.793557] RAX: 883fc131a000 RBX: f60fc300 RCX: 
> ffe4
> [354764.807999] RDX: 0010 RSI: 0001 RDI: 
> c9003f0bb8f0
> [354764.822436] RBP: c9003f0bb750 R08: 0002 R09: 
> 0004
> [354764.836877] R10: c9003f0bb7a8 R11: 883ff7795780 R12: 
> 82305080
> [354764.851317] R13: 0002 R14:  R15: 
> 
> [354764.865765] FS:  7f8defcfc700() GS:881fff94() 
> knlGS:
> [354764.882119] CS:  0010 DS:  ES:  CR0: 80050033
> [354764.893800] CR2: f60fc318 CR3: 000f68cae006 CR4: 
> 003606e0
> [354764.908235] DR0:  DR1:  DR2: 
> 
> [354764.922671] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [354764.937109] Call Trace:
> [354764.942195]  fib6_node_lookup+0x67/0x90
> [354764.950042]  ? fib6_table_lookup+0x43/0x2f0
> [354764.958587]  fib6_table_lookup+0x43/0x2f0
> [354764.966794]  ip6_pol_route+0x43/0x360
> [354764.974294]  ? ip6_pol_route_input+0x20/0x20
> [354764.983016]  

Re: bpfilter causes a leftover kernel process

2018-09-05 Thread Olivier Brunel
Hi,

Quick follow-up on this:

- first off, Arch devs have updated their kernel config so the next
  kernel will not have bpfilter enabled anymore, thus avoiding any
  issue.

- having said that, I've found a neasy way to reproduce it in an Arch
  VM, in case you're interested :

Boot the latest Arch ISO[1] which now contains a kernel 4.18.5 and do a
very basic installation, pretty much just:

# pacstrap /mnt base
# genfstab -U /mnt >> /mnt/etc/fstab

And of course install your boot loader of choice.

Then boot the brand new system, log in and make sure the helper is
actually started, i.e. `modprobe bpfilter` -- Now halt.

You'll see in the end that systemd complains that it can't
unmount /oldroot (EBUSY), aka the root fs; and that's because of the
bpfilter helper, which wasn't killed because it's seen as a kernel
thread due to its empty command line and therefore not signaled.


Cheers,



[1] https://www.archlinux.org/download/


Re: phys_port_id in switchdev mode?

2018-09-05 Thread Jakub Kicinski
On Wed, 5 Sep 2018 09:20:43 -0700, Samudrala, Sridhar wrote:
> > With this libvirt on Host0 should easily find the actual PF0 netdev to
> > run the NDO on, if it wants to use VFs:
> >   - libvrit finds act VF0/0 to plug into the VF;
> >   - reads its phys_port_id -> "PF0 SN";
> >   - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
> >   - runs NDOs on "act PF0" for PF0's VF correctly.  
> 
> I think Host0 corresponds to embedded OS on the NIC. Is this correct?
> I guess in this setup, only PF0's PCI interface on Host0 is in switchdev mode 
> and
> the representors for PF0 and its VFs are created on Host0 when they come up
> on Host1. I would think PF0 on Host0 acts as a Control PF for PF1 on Host1.
> 
> Isn't hypervisor running only on Host1?

The main hypervisor is, but Host0 can very easily want to run some DPI
or some such on flows before it allows them though, and people like
running DPI-like apps in VMs/containers..

> > The other problem remains unsolved - Host0 can't be sure without
> > vendor-specific knowledge whether it's connected to PF0 or PF1.
> > That's why I was thinking maybe we should provide phys_port_id
> > on PF representors as well.  That means we'd have to provide the
> > legacy NDOs on PF reprs too because libvirt may now find PF repr...
> > Would it be cleaner to add a new attribute?
> >
> > Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?  
> 
> Do you mean the legacy VF ndo ops on the PF?  I think it is possible to 
> configure
> the VFs on Host1 via the port representors except for the MAC address.

Yes, the MAC address would be the only one.  Could Host0 care about
which MACs Host1 assigned to its VFs?  I'm not sure..


Re: phys_port_id in switchdev mode?

2018-09-05 Thread Jakub Kicinski
On Tue, 4 Sep 2018 23:37:29 +0300, Or Gerlitz wrote:
> On Tue, Sep 4, 2018 at 1:20 PM, Jakub Kicinski wrote:
> > On Mon, 3 Sep 2018 12:40:22 +0300, Or Gerlitz wrote:  
> >> On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski wrote:  
> >> > Hi!  
> >>
> >> Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to 
> >> die,
> >>
> >> Note I replied couple of minutes ago but it didn't get to the list, so
> >> lets take it from this one:
> >>  
> >> > I wonder if we can use phys_port_id in switchdev to group together
> >> > interfaces of a single PCI PF?  Here is the problem:
> >> >
> >> > With a mix of PF and VF interfaces it gets increasingly difficult to
> >> > figure out which one corresponds to which PF.  We can identify which
> >> > *representor* is which, by means of phys_port_name and devlink
> >> > flavours.  But if the actual VF/PF interfaces are also present on the
> >> > same host, it gets confusing when one tries to identify the PF they
> >> > came from.  Generally one has to resort of matching between PCI DBDF of
> >> > the PF and VFs or read relevant info out of ethtool -i.
> >> >
> >> > In multi host scenario this is particularly painful, as there seems to
> >> > be no immediately obvious way to match PCI interface ID of a card (0,
> >> > 1, 2, 3, 4...) to the DBDF we have connected.
> >> >
> >> > Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
> >> > from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
> >> > random manner, which means we have to provide those for all devices with
> >> > link to the PF (all reprs).  And we have to link them (a) because it's
> >> > right (tm) and (b) to get correct naming.  
> >>
> >> wait, as you commented in later, not only the mellanox vf reprs but rather 
> >> also
> >> the nfp vf reprs are not linked to the PF, because ip link output
> >> grows quadratically.  
> >
> > Right, correct.  If we set phys_port_id libvirt will reliably pick the
> > correct netdev to run NDOs on (PF/PF repr) so we can remove them from
> > the other netdevs and therefore limit the size of ip link show output.  
> 
> just to make sure, this is suggested/future not existing flow of libvirt?

Mm..  admittedly I haven't investigated in depth, but my colleague did
and indicated this is the current flow.  It matches phys_port_id right
here:

https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L2793

Are we wrong?

> > Ugh, you're right!  Libvirt is our primary target here.  IIUC we need
> > phys_port_id on the actual VF and then *a* netdev linked to physfn in
> > sysfs which will have the legacy NDOs.
> >
> > We can't set the phys_port_id on the VF reprs because then we're back
> > to the problem of ip link output growing.  Perhaps we shouldn't set it
> > on PF repr either?
> >
> > Let's make a table (assuming bare metal cloud scenario where Host0 is
> > controlling the network, while Host1 is the actual server):  
> 
> yeah, this would be a super-set the non-smartnic case where
> we have only one host.
> 
> [...]
> 
> > With this libvirt on Host0 should easily find the actual PF0 netdev to
> > run the NDO on, if it wants to use VFs:
> >  - libvrit finds act VF0/0 to plug into the VM;
> >  - reads its phys_port_id -> "PF0 SN";
> >  - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
> >  - runs NDOs on "act PF0" for PF0's VF correctly.  
> 
> What you describe here doesn't seem to be networking
> configuration, as it deals only with VFs and PF but not with reprs,
> and hence AFAIK runs on host host1

No, hm, depends on your definition of SmartNIC.  ARM64 control CPU is
capable of running VMs.  Why would you not run VMs on your controller?
Or one day we will need reprs for containers, people are definitely
going to run containers on the controller...  I wouldn't design this
assuming there is no advanced switching a'la service chains on the
control CPU...

> > Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?  
> 
> I need to think on that

Okay :)


Re: [PATCH rdma-next v1 12/15] RDMA/mlx5: Add a new flow action verb - modify header

2018-09-05 Thread Jason Gunthorpe
On Wed, Sep 05, 2018 at 08:20:44AM +0300, Leon Romanovsky wrote:
> On Tue, Sep 04, 2018 at 03:58:23PM -0600, Jason Gunthorpe wrote:
> > On Tue, Aug 28, 2018 at 02:18:51PM +0300, Leon Romanovsky wrote:
> >
> > > +static int 
> > > UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_ACTION_CREATE_MODIFY_HEADER)(
> > > + struct ib_uverbs_file *file,
> > > + struct uverbs_attr_bundle *attrs)
> > > +{
> > > + struct ib_uobject *uobj = uverbs_attr_get_uobject(
> > > + attrs, MLX5_IB_ATTR_CREATE_MODIFY_HEADER_HANDLE);
> > > + struct mlx5_ib_dev *mdev = to_mdev(uobj->context->device);
> > > + enum mlx5_ib_uapi_flow_table_type ft_type;
> > > + struct ib_flow_action *action;
> > > + size_t num_actions;
> > > + void *in;
> > > + int len;
> > > + int ret;
> > > +
> > > + if (!mlx5_ib_modify_header_supported(mdev))
> > > + return -EOPNOTSUPP;
> > > +
> > > + in = uverbs_attr_get_alloced_ptr(attrs,
> > > + MLX5_IB_ATTR_CREATE_MODIFY_HEADER_ACTIONS_PRM);
> > > + len = uverbs_attr_get_len(attrs,
> > > + MLX5_IB_ATTR_CREATE_MODIFY_HEADER_ACTIONS_PRM);
> > > +
> > > + if (len % MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto))
> > > + return -EINVAL;
> > > +
> > > + ret = uverbs_get_const(_type, attrs,
> > > +MLX5_IB_ATTR_CREATE_MODIFY_HEADER_FT_TYPE);
> > > + if (ret)
> > > + return -EINVAL;
> >
> > This should be
> >
> > if (ret)
> > return ret;
> >
> > Every call to uverbs_get_const is wrong in this same way..
> 
> Right, from technical point of view uverbs_get_const can return EINVAL
> only, and it is correct for now, but need to be changed to proper
> "return ret".

No, it can return ENOENT as well.

Jason


Re: [PATCH mlx5-next v1 05/15] net/mlx5: Break encap/decap into two separated flow table creation flags

2018-09-05 Thread Jason Gunthorpe
On Wed, Sep 05, 2018 at 08:10:25AM +0300, Leon Romanovsky wrote:
> > > - int en_encap_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN);
> > > + int en_encap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_ENCAP);
> > > + int en_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_DECAP);
> >
> > Yuk, please don't use !!.
> >
> > bool en_decap = flags & MLX5_FLOW_TABLE_TUNNEL_EN_DECAP;
> 
> We need to provide en_encap and en_decap as an input to MLX5_SET(...)
> which is passed to FW as 0 or 1. 
>
> Boolean type is declared in C as int and treated as zero for false
> and any other value for true,

No, that isn't right, the kernel uses C99's _Bool intrinsic type, which
is guaranteed to only hold 0 or 1 by the compiler.

See types.h:

typedef _Bool   bool;

Jason


Re: phys_port_id in switchdev mode?

2018-09-05 Thread Samudrala, Sridhar

On 9/4/2018 3:20 AM, Jakub Kicinski wrote:

On Mon, 3 Sep 2018 12:40:22 +0300, Or Gerlitz wrote:

On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski wrote:

Hi!

Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to die,

Note I replied couple of minutes ago but it didn't get to the list, so
lets take it from this one:


I wonder if we can use phys_port_id in switchdev to group together
interfaces of a single PCI PF?  Here is the problem:

With a mix of PF and VF interfaces it gets increasingly difficult to
figure out which one corresponds to which PF.  We can identify which
*representor* is which, by means of phys_port_name and devlink
flavours.  But if the actual VF/PF interfaces are also present on the
same host, it gets confusing when one tries to identify the PF they
came from.  Generally one has to resort of matching between PCI DBDF of
the PF and VFs or read relevant info out of ethtool -i.

In multi host scenario this is particularly painful, as there seems to
be no immediately obvious way to match PCI interface ID of a card (0,
1, 2, 3, 4...) to the DBDF we have connected.

Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
random manner, which means we have to provide those for all devices with
link to the PF (all reprs).  And we have to link them (a) because it's
right (tm) and (b) to get correct naming.

wait, as you commented in later, not only the mellanox vf reprs but rather also
the nfp vf reprs are not linked to the PF, because ip link output
grows quadratically.

Right, correct.  If we set phys_port_id libvirt will reliably pick the
correct netdev to run NDOs on (PF/PF repr) so we can remove them from
the other netdevs and therefore limit the size of ip link show output.


The only reliable way to make
user space (libvirt) choose the repr it should run the NDOs on (which is
IMHO the corresponding PF repr) is to set phys_port_id on actual VFs,
VF reprs, PFs and PF reprs to a value corresponding to the *PCI PF*,
not the external/Ethernet port when in switchdev mode.  User space
should understand phys_port_id in this context, given it was originally
introduced for matching VFs to ports.

Using phy_port_id to match/group VFs to PFs makes sense to me.

So what would be the libvirt use case you envision that needs
the VF and PF reprs to support that as well? or maybe you were
not referring to libvirt but to some other provisioning element? I need
to refresh my memory on that area.

Ugh, you're right!  Libvirt is our primary target here.  IIUC we need
phys_port_id on the actual VF and then *a* netdev linked to physfn in
sysfs which will have the legacy NDOs.

We can't set the phys_port_id on the VF reprs because then we're back
to the problem of ip link output growing.  Perhaps we shouldn't set it
on PF repr either?

Let's make a table (assuming bare metal cloud scenario where Host0 is
controlling the network, while Host1 is the actual server):

[act - actual; rpr - representor; SN -serial number]

Today:

   dev | host | sysfs | phys_-  | switch- | phys_-| NDOs
   |  | link  | port_id | dev_id  | port_name |
---
uplink|   0  |   PF0 |   - | ASIC SN | p0| PF0
act PF0   |   0  |   PF0 |   - |   - |  -|  -
act VF0/0 |   0  | VF0/0 |   - |   - |  -|  -
rpr PF0   |   0  |-  |   - | ASIC SN | pf0   |  -
rpr VF0/0 |   0  |-  |   - | ASIC SN | pf0vf0|  -
act PF1   |   1  |   PF1 |   - |   - |  -| PF1
act VF1/0 |   1  | VF1/0 |   - |   - |  -|  -
rpr PF1   |   0  |-  |   - | ASIC SN | pf1   |  -
rpr VF1/0 |   0  |-  |   - | ASIC SN | pf1vf0|  -

Proposed:

   dev | host | sysfs | phys_-  | switch- | phys_-| NDOs
   |  | link  | port_id | dev_id  | port_name |
---
uplink|   0  |   PF0 |   - | ASIC SN | p0|  -
act PF0   |   0  |   PF0 | PF0 SN  |   - |  -| PF0
act VF0/0 |   0  | VF0/0 | PF0 SN  |   - |  -|  -
rpr PF0   |   0  |   PF0 |   - | ASIC SN | pf0   |  -
rpr VF0/0 |   0  |   PF0 |   - | ASIC SN | pf0vf0|  -
act PF1   |   1  |   PF1 | PF1 SN  |   - |  -| PF1
act VF1/0 |   1  | VF1/0 | PF1 SN  |   - |  -|  -
rpr PF1   |   0  |   PF0 |   - | ASIC SN | pf1   |  -
rpr VF1/0 |   0  |   PF0 |   - | ASIC SN | pf1vf0|  -

With this libvirt on Host0 should easily find the actual PF0 netdev to
run the NDO on, if it wants to use VFs:
  - libvrit finds act VF0/0 to plug into the VF;
  - reads its phys_port_id -> "PF0 SN";
  - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
  - runs NDOs on "act PF0" for PF0's VF correctly.


I think Host0 corresponds to embedded OS on the NIC. Is this correct?
I guess in this setup, 

Re: [PATCH v3 3/3] IB/ipoib: Log sysfs 'dev_id' accesses from userspace

2018-09-05 Thread Stephen Hemminger
On Mon,  3 Sep 2018 19:13:16 +0300
Arseny Maslennikov  wrote:

> + if (ndev->dev_id == ndev->dev_port) {
> + netdev_info_once(ndev,
> + "\"%s\" wants to know my dev_id. "
> + "Should it look at dev_port instead?\n",
> + current->comm);
> + netdev_info_once(ndev,
> + "See Documentation/ABI/testing/sysfs-class-net for more 
> info.\n");
> + }

Single line message is sufficient.
Also don't break strings in messages.

> + }
> +
> + ret = sprintf(buf, "%#x\n", ndev->dev_id);
> +
> + return ret;

Why not?
return sprintf...


[PATCH net-next] qed*: Utilize FW 8.37.7.0

2018-09-05 Thread Denis Bolotin
This patch adds a new qed firmware with fixes and support for new features.

Fixes:
- Fix a rare case of device crash with iWARP, iSCSI or FCoE offload.
- Fix GRE tunneled traffic when iWARP offload is enabled.
- Fix RoCE failure in ib_send_bw when using inline data.
- Fix latency optimization flow for inline WQEs.
- BigBear 100G fix

RDMA:
- Reduce task context size.
- Application page sizes above 2GB support.
- Performance improvements.

ETH:
- Tenant DCB support.
- Replace RSS indirection table update interface.

Misc:
- Debug Tools changes.

Signed-off-by: Denis Bolotin 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed.h   |   1 +
 drivers/net/ethernet/qlogic/qed/qed_debug.c | 248 +--
 drivers/net/ethernet/qlogic/qed/qed_dev.c   |  11 ++
 drivers/net/ethernet/qlogic/qed/qed_hsi.h   | 297 +++-
 include/linux/qed/common_hsi.h  |  10 +-
 include/linux/qed/iscsi_common.h|   2 +-
 6 files changed, 367 insertions(+), 202 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index a60e1c8..5f0962d 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -623,6 +623,7 @@ struct qed_hwfn {
void*unzip_buf;
 
struct dbg_tools_data   dbg_info;
+   void*dbg_user_info;
 
/* PWM region specific data */
u16 wid_count;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c 
b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index 1aa9fc1..78a638e 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -3454,7 +3454,8 @@ static u32 qed_grc_dump_iors(struct qed_hwfn *p_hwfn,
addr = BYTES_TO_DWORDS(storm->sem_fast_mem_addr +
   SEM_FAST_REG_STORM_REG_FILE) +
   IOR_SET_OFFSET(set_id);
-   buf[strlen(buf) - 1] = '0' + set_id;
+   if (strlen(buf) > 0)
+   buf[strlen(buf) - 1] = '0' + set_id;
offset += qed_grc_dump_mem(p_hwfn,
   p_ptt,
   dump_buf + offset,
@@ -5563,35 +5564,6 @@ struct block_info {
enum block_id id;
 };
 
-struct mcp_trace_format {
-   u32 data;
-#define MCP_TRACE_FORMAT_MODULE_MASK   0x
-#define MCP_TRACE_FORMAT_MODULE_SHIFT  0
-#define MCP_TRACE_FORMAT_LEVEL_MASK0x0003
-#define MCP_TRACE_FORMAT_LEVEL_SHIFT   16
-#define MCP_TRACE_FORMAT_P1_SIZE_MASK  0x000c
-#define MCP_TRACE_FORMAT_P1_SIZE_SHIFT 18
-#define MCP_TRACE_FORMAT_P2_SIZE_MASK  0x0030
-#define MCP_TRACE_FORMAT_P2_SIZE_SHIFT 20
-#define MCP_TRACE_FORMAT_P3_SIZE_MASK  0x00c0
-#define MCP_TRACE_FORMAT_P3_SIZE_SHIFT 22
-#define MCP_TRACE_FORMAT_LEN_MASK  0xff00
-#define MCP_TRACE_FORMAT_LEN_SHIFT 24
-
-   char *format_str;
-};
-
-/* Meta data structure, generated by a perl script during MFW build. therefore,
- * the structs mcp_trace_meta and mcp_trace_format are duplicated in the perl
- * script.
- */
-struct mcp_trace_meta {
-   u32 modules_num;
-   char **modules;
-   u32 formats_num;
-   struct mcp_trace_format *formats;
-};
-
 /* REG fifo element */
 struct reg_fifo_element {
u64 data;
@@ -5714,6 +5686,20 @@ struct igu_fifo_addr_data {
enum igu_fifo_addr_types type;
 };
 
+struct mcp_trace_meta {
+   u32 modules_num;
+   char **modules;
+   u32 formats_num;
+   struct mcp_trace_format *formats;
+   bool is_allocated;
+};
+
+/* Debug Tools user data */
+struct dbg_tools_user_data {
+   struct mcp_trace_meta mcp_trace_meta;
+   const u32 *mcp_trace_user_meta_buf;
+};
+
 / Constants **/
 
 #define MAX_MSG_LEN1024
@@ -6137,15 +6123,6 @@ struct user_dbg_array {
 
 / Variables **/
 
-/* MCP Trace meta data array - used in case the dump doesn't contain the
- * meta data (e.g. due to no NVRAM access).
- */
-static struct user_dbg_array s_mcp_trace_meta_arr = { NULL, 0 };
-
-/* Parsed MCP Trace meta data info, based on MCP trace meta array */
-static struct mcp_trace_meta s_mcp_trace_meta;
-static bool s_mcp_trace_meta_valid;
-
 /* Temporary buffer, used for print size calculations */
 static char s_temp_buf[MAX_MSG_LEN];
 
@@ -6311,6 +6288,12 @@ static u32 qed_print_section_params(u32 *dump_buf,
return dump_offset;
 }
 
+static struct dbg_tools_user_data *
+qed_dbg_get_user_data(struct qed_hwfn *p_hwfn)
+{
+   return (struct dbg_tools_user_data *)p_hwfn->dbg_user_info;
+}
+
 /* Parses the idle check rules and returns the number of 

Re: [PATCH net-next] net/mlx5e: Make function mlx5i_grp_sw_update_stats() static

2018-09-05 Thread David Miller
From: Wei Yongjun 
Date: Wed, 5 Sep 2018 11:16:02 +

> Fixes the following sparse warning:
> 
> drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c:119:6: warning:
>  symbol 'mlx5i_grp_sw_update_stats' was not declared. Should it be static?
> 
> Signed-off-by: Wei Yongjun 

Applied.


Re: [PATCH net] mlxsw: spectrum_buffers: Set up a dedicated pool for BUM traffic

2018-09-05 Thread David Miller
From: Petr Machata 
Date: Wed, 05 Sep 2018 12:16:00 +0200

> MC-aware mode was recently enabled by mlxsw on Spectrum switches in
> commit 7b8195306694 ("mlxsw: spectrum: Configure MC-aware mode on mlxsw
> ports"). Unfortunately, testing has shown that the fix is incomplete and
> in the presented form actually makes the problem even worse, because any
> amount of MC traffic causes UC disruption.
> 
> The reason for this is that currently, mlxsw configures the MC-specific
> TCs (8..15) to map to pool 0. It also configures a maximum buffer size
> of 0, but for MC traffic that maximum is disregarded and not part of the
> quota. Therefore MC traffic is always admitted to the egress buffer.
> 
> Fix the configuration by directing the MC TCs into pool 15, which is
> dedicated to MC traffic and recognized as such by the silicon.
> 
> Fixes: 7b8195306694 ("mlxsw: spectrum: Configure MC-aware mode on mlxsw 
> ports")
> Signed-off-by: Petr Machata 
> Acked-by: Jiri Pirko 

Applied, thank you.


[PATCH net 3/3] net/iucv: declare iucv_path_table_empty() as static

2018-09-05 Thread Julian Wiedmann
Fixes a compile warning.

Signed-off-by: Julian Wiedmann 
---
 net/iucv/iucv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/iucv/iucv.c b/net/iucv/iucv.c
index 8f7ef167c45a..eb502c6290c2 100644
--- a/net/iucv/iucv.c
+++ b/net/iucv/iucv.c
@@ -1874,7 +1874,7 @@ static void iucv_pm_complete(struct device *dev)
  * Returns 0 if there are still iucv pathes defined
  *1 if there are no iucv pathes defined
  */
-int iucv_path_table_empty(void)
+static int iucv_path_table_empty(void)
 {
int i;
 
-- 
2.16.4



[PATCH net 2/3] net/af_iucv: fix skb handling on HiperTransport xmit error

2018-09-05 Thread Julian Wiedmann
When sending an skb, afiucv_hs_send() bails out on various error
conditions. But currently the caller has no way of telling whether the
skb was freed or not - resulting in potentially either
a) leaked skbs from iucv_send_ctrl(), or
b) double-free's from iucv_sock_sendmsg().

As dev_queue_xmit() will always consume the skb (even on error), be
consistent and also free the skb from all other error paths. This way
callers no longer need to care about managing the skb.

Signed-off-by: Julian Wiedmann 
Reviewed-by: Ursula Braun 
---
 net/iucv/af_iucv.c | 34 +++---
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 01000c14417f..e2f16a0173a9 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -351,20 +351,28 @@ static int afiucv_hs_send(struct iucv_message *imsg, 
struct sock *sock,
memcpy(_hdr->iucv_hdr, imsg, sizeof(struct iucv_message));
 
skb->dev = iucv->hs_dev;
-   if (!skb->dev)
-   return -ENODEV;
-   if (!(skb->dev->flags & IFF_UP) || !netif_carrier_ok(skb->dev))
-   return -ENETDOWN;
+   if (!skb->dev) {
+   err = -ENODEV;
+   goto err_free;
+   }
+   if (!(skb->dev->flags & IFF_UP) || !netif_carrier_ok(skb->dev)) {
+   err = -ENETDOWN;
+   goto err_free;
+   }
if (skb->len > skb->dev->mtu) {
-   if (sock->sk_type == SOCK_SEQPACKET)
-   return -EMSGSIZE;
-   else
-   skb_trim(skb, skb->dev->mtu);
+   if (sock->sk_type == SOCK_SEQPACKET) {
+   err = -EMSGSIZE;
+   goto err_free;
+   }
+   skb_trim(skb, skb->dev->mtu);
}
skb->protocol = cpu_to_be16(ETH_P_AF_IUCV);
nskb = skb_clone(skb, GFP_ATOMIC);
-   if (!nskb)
-   return -ENOMEM;
+   if (!nskb) {
+   err = -ENOMEM;
+   goto err_free;
+   }
+
skb_queue_tail(>send_skb_q, nskb);
err = dev_queue_xmit(skb);
if (net_xmit_eval(err)) {
@@ -375,6 +383,10 @@ static int afiucv_hs_send(struct iucv_message *imsg, 
struct sock *sock,
WARN_ON(atomic_read(>msg_recv) < 0);
}
return net_xmit_eval(err);
+
+err_free:
+   kfree_skb(skb);
+   return err;
 }
 
 static struct sock *__iucv_get_sock_by_name(char *nm)
@@ -1167,7 +1179,7 @@ static int iucv_sock_sendmsg(struct socket *sock, struct 
msghdr *msg,
err = afiucv_hs_send(, sk, skb, 0);
if (err) {
atomic_dec(>msg_sent);
-   goto fail;
+   goto out;
}
} else { /* Classic VM IUCV transport */
skb_queue_tail(>send_skb_q, skb);
-- 
2.16.4



[PATCH net 1/3] net/af_iucv: drop inbound packets with invalid flags

2018-09-05 Thread Julian Wiedmann
Inbound packets may have any combination of flag bits set in their iucv
header. If we don't know how to handle a specific combination, drop the
skb instead of leaking it.

To clarify what error is returned in this case, replace the hard-coded
0 with the corresponding macro.

Signed-off-by: Julian Wiedmann 
---
 net/iucv/af_iucv.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index a21d8ed0a325..01000c14417f 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -2155,8 +2155,8 @@ static int afiucv_hs_rcv(struct sk_buff *skb, struct 
net_device *dev,
struct sock *sk;
struct iucv_sock *iucv;
struct af_iucv_trans_hdr *trans_hdr;
+   int err = NET_RX_SUCCESS;
char nullstring[8];
-   int err = 0;
 
if (skb->len < (ETH_HLEN + sizeof(struct af_iucv_trans_hdr))) {
WARN_ONCE(1, "AF_IUCV too short skb, len=%d, min=%d",
@@ -2254,7 +2254,7 @@ static int afiucv_hs_rcv(struct sk_buff *skb, struct 
net_device *dev,
err = afiucv_hs_callback_rx(sk, skb);
break;
default:
-   ;
+   kfree_skb(skb);
}
 
return err;
-- 
2.16.4



[PATCH net 0/3] net/iucv: fixes 2018-09-05

2018-09-05 Thread Julian Wiedmann
Hi Dave,

please apply three straight-forward fixes for iucv. One that prevents
leaking the skb on malformed inbound packets, one to fix the error
handling on transmit error, and one to get rid of a compile warning.

Thanks,
Julian


Julian Wiedmann (3):
  net/af_iucv: drop inbound packets with invalid flags
  net/af_iucv: fix skb handling on HiperTransport xmit error
  net/iucv: declare iucv_path_table_empty() as static

 net/iucv/af_iucv.c | 38 +-
 net/iucv/iucv.c|  2 +-
 2 files changed, 26 insertions(+), 14 deletions(-)

-- 
2.16.4



Re: [PATCH net 2/3] tls: clear key material from kernel memory when do_tls_setsockopt_conf fails

2018-09-05 Thread Boris Pismenny

Hi Sabrina,

On 9/5/2018 4:21 PM, Sabrina Dubroca wrote:

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca 
---
  net/tls/tls_main.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 180b6640e531..0d432d025471 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -499,7 +499,7 @@ static int do_tls_setsockopt_conf(struct sock *sk, char 
__user *optval,
goto out;
  
  err_crypto_info:

-   memset(crypto_info, 0, sizeof(*crypto_info));
+   memzero_explicit(crypto_info, sizeof(struct 
tls12_crypto_info_aes_gcm_128));


Besides the key, there are other (not secret) information in 
tls12_crypto_info_aes_gcm_128. I'd prefer you do not delete it to enable 
users to obtain it (using getsockopt) in case we decide to implement a 
fallback to userspace in the future. Such a fallback must obtain the 
kernel's iv, and record sequence number.


Thanks,
Boris.


Re: [PATCH v3 3/3] IB/ipoib: Log sysfs 'dev_id' accesses from userspace

2018-09-05 Thread Leon Romanovsky
On Mon, Sep 03, 2018 at 07:13:16PM +0300, Arseny Maslennikov wrote:
> Signed-off-by: Arseny Maslennikov 
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 38 +++
>  1 file changed, 38 insertions(+)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
> b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 30f840f874b3..7386e5bde3d3 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -2386,6 +2386,42 @@ int ipoib_add_pkey_attr(struct net_device *dev)
>   return device_create_file(>dev, _attr_pkey);
>  }
>
> +/*
> + * We erroneously exposed the iface's port number in the dev_id
> + * sysfs field long after dev_port was introduced for that purpose[1],
> + * and we need to stop everyone from relying on that.
> + * Let's overload the shower routine for the dev_id file here
> + * to gently bring the issue up.
> + *
> + * [1] https://www.spinics.net/lists/netdev/msg272123.html
> + */
> +static ssize_t dev_id_show(struct device *dev,
> +struct device_attribute *attr, char *buf)
> +{
> + struct net_device *ndev = to_net_dev(dev);
> + ssize_t ret = -EINVAL;
> +
> + if (ndev->dev_id == ndev->dev_port) {
> + netdev_info_once(ndev,
> + "\"%s\" wants to know my dev_id. "
> + "Should it look at dev_port instead?\n",
> + current->comm);
> + netdev_info_once(ndev,
> + "See Documentation/ABI/testing/sysfs-class-net for more 
> info.\n");
> + }
> +
> + ret = sprintf(buf, "%#x\n", ndev->dev_id);
> +
> + return ret;
> +}
> +static DEVICE_ATTR_RO(dev_id);
> +

I don't see this field among exposed by IPoIB, why should we expose it now?

> +int ipoib_intercept_dev_id_attr(struct net_device *dev)
> +{
> + device_remove_file(>dev, _attr_dev_id);
> + return device_create_file(>dev, _attr_dev_id);

Why isn't enough to rely on netdev code?

> +}
> +
>  static struct net_device *ipoib_add_port(const char *format,
>struct ib_device *hca, u8 port)
>  {
> @@ -2427,6 +2463,8 @@ static struct net_device *ipoib_add_port(const char 
> *format,
>*/
>   ndev->priv_destructor = ipoib_intf_free;
>
> + if (ipoib_intercept_dev_id_attr(ndev))
> + goto sysfs_failed;
>   if (ipoib_cm_add_mode_attr(ndev))
>   goto sysfs_failed;
>   if (ipoib_add_pkey_attr(ndev))
> --
> 2.19.0.rc1
>


signature.asc
Description: PGP signature


  1   2   >