date:20170114

Re: [pull request][for-next] Mellanox mlx5 Reorganize core driver directory layout

2017-01-14 Thread Leon Romanovsky

On Sat, Jan 14, 2017 at 09:37:20AM -0800, Tom Herbert wrote:
> On Fri, Jan 13, 2017 at 2:45 PM, Saeed Mahameed
>  wrote:
> > On Sat, Jan 14, 2017 at 12:06 AM, Tom Herbert  wrote:
> >> On Fri, Jan 13, 2017 at 12:29 PM, Leon Romanovsky  wrote:
> >>> On Fri, Jan 13, 2017 at 12:14:07PM -0500, David Miller wrote:
>  From: Saeed Mahameed 
>  Date: Thu, 12 Jan 2017 19:22:34 +0200
> 
>  > This pull request includes one patch from Leon, this patch as described
>  > below will change the driver directory structure and layout for better,
>  > logical and modular driver files separation.
>  >
>  > This change is important to both rdma and net maintainers in order to
>  > have smoother management of driver patches for different mlx5 sub 
>  > modules
>  > and smoother rdma-next vs. net-next features submissions.
>  >
>  > Please find more info below -in the tag commit message-,
>  > review and let us know if there's any problem.
>  >
>  > This change doesn't introduce any conflicts with the current mlx5
>  > fixes and cleanups posted on 2017-01-10 to net branch, and merge tests
>  > worked flawlessly with no issues.
>  >
>  > This is the last pull request meant for both rdma-next and net-next.
>  > Once pulled, this will be the base shared code for both trees.
> 
>  This is pretty crazy, it will make all bug fix backporting to -stable
>  a complete nightmare for myself, Doug, various distribution maintainers
>  and many other people who quietly have to maintain their own trees and
>  do backporting.
> >>>
> >>> Hi Dave,
> >>>
> >>> I understand your worries, but our case is similar to various other
> >>> drivers, for example hfi1 which was in staging for years while
> >>> supported in RedHat and moved from there to IB. The Chelsio drivers did
> >>> similar reorg in 2016 (drivers/net/ethernet/chelsio/libcxgb) while their
> >>> drivers were in the tree for long time before.
> >>>
> >>> Additionally, Doug doesn't need to maintain -stable queue and it is done
> >>> by relevant submaintainers who are adding stable tags by themselves. In
> >>> the IB case, the burden will continue to be on me and not on Doug.
> >>>
> >> Recently I had to backport the mlx5 driver from 4.9 to 4.6 in order to
> >> get support for XDP. The biggest issue I faced was the lack of
> >> modularity in the many driver features that are now supported. The
> >> problem with backporting these new features is the spider web of
> >> dependencies that they bring in from the rest of the kernel. I ended
> >> up taking out en_rep, vxlan, en_tc, eswitch, and dcbnl. The result was
> >> ~340 patches which is still a lot but at least this was constrained to
> >> patches in the mlx5 directories and are relevant to what we want to
> >> do.

We had that complexity in mind. This pull request is our initial step
to simplify backporters' work (e.g. rename of en_XXX.c to be XXX.c is
the same as en/XXX.c for the backports).

We are aware of IB backports which required EN part of driver which
had all net dependencies mentioned above and IMHO it is wrong.

Our plan is:
1. Reorg/rename files.
2. Separate en and IB from shared code, to allow parallel work.
3. Reduce en and IB dependencies, to simplify backporting.

It is really awkward to move to item 2 and 3 before we cleaned the desk.
That rename patch provided to all backporters an easy way to bring new
features into the driver.

> >>
> >> In lieu of restructuring the directories, I would much rather see more
> >> config options so that we can build drivers that don't unnecessarily
> >> complicate our lives with features we don't use. This is not just true
> >> for Mellanox, but I would say it would be true of any driver that
> >> someone is trying to deploy and maintain at large scale.

Different maintainers have a different view on the subject and not everyone
is ready to accept new drivers config options.


> >>
> >
> > I think we should have both, if the restructuring made right,
> > new whole features (e.g eswitch and eswitch offlaods or any independent 
> > module),
> > can sit in their own directory and keep their own logic concentrated
> > in one place, and only touch the
> > main driver code with simple entry points in the main flow,  this way
> > you can simply compile their whole directories
> > out with a config flag directly from the Makefile.
> >
> That would be nice, but that is not what your new layout gives us yet.
> The starting point for this structure should be to discern what the
> minimum set of files is to build a functional and good performance
> driver with the basic functionality including only the most common
> offloads. These I would think constitute the core files for the driver
> and hence make sense to be in the "core" directory. Personally I would
> drop the en_ file prefix and not make an en directory,

[PATCH v2 net-next] IPsec: do not ignore crypto err in ah input

2017-01-14 Thread Gilad Ben-Yossef

ah input processing uses the asynchronous hash crypto API which supplies 
an error code as part of the operation completion but the error code was 
being ignored.

Treat a crypto API error indication as a verification failure.

While a crypto API reported error would almost certainly result in a 
memcmp of the digest failing anyway and thus the security risk seems 
minor, performing a memory compare on what might be uninitialized memory 
is wrong.

Signed-off-by: Gilad Ben-Yossef 
CC: Alexander Alemayhu 
---

The change was boot tested on Arm64 but I did not exercise
the specific error code path in question.

Changes from v1:
- Fixed typo in patch description pointed out by Alexander

 net/ipv4/ah4.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index f2a7102..22377c8 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -270,6 +270,9 @@ static void ah_input_done(struct crypto_async_request 
*base, int err)
int ihl = ip_hdrlen(skb);
int ah_hlen = (ah->hdrlen + 2) << 2;
 
+   if (err)
+   goto out;
+
work_iph = AH_SKB_CB(skb)->tmp;
auth_data = ah_tmp_auth(work_iph, ihl);
icv = ah_tmp_icv(ahp->ahash, auth_data, ahp->icv_trunc_len);
-- 
2.1.4

Re: [PATCH net-next] IPsec: do not ignore crypto err in ah input

2017-01-14 Thread Gilad Ben-Yossef

On Fri, Jan 13, 2017 at 9:45 AM, Alexander Alemayhu
 wrote:
> On Thu, Jan 12, 2017 at 03:33:22PM +0200, Gilad Ben-Yossef wrote:
>> ah input processing uses the asynchrnous hash crypto API which
>> supplies an error code as part of the operation completion but
>> the error code was being ignored.
>>
> s/asynchrnous/asynchronous

Yes, I have missed that.

Thank you for your review. I will send a fixed version.

Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

Re: [PATCH] net: add regs attribute to phy device for user diagnose

2017-01-14 Thread Florian Fainelli

Le 01/14/17 à 17:51, yuan linyu a écrit :
> On 六, 2017-01-14 at 10:35 -0800, Florian Fainelli wrote:
>> On 01/14/2017 08:24 AM, Andrew Lunn wrote:
>>>
>>> On Sat, Jan 14, 2017 at 10:46:31AM +0800, yuan linyu wrote:

 From: yuan linyu 

 if phy device have register(s) configuration problem,
 user can use this attribute to diagnose.
 this feature need phy driver maintainer implement.
>>> what is wrong with mii-tool -vv ?
>> Agreed, and without an actual user of this API (ethtool?), nor a PHY
>> driver implementing it, we cannot quite see how you want to make use of
>> this.
> I hope user/developer can read this attribute file "regs" to do
> a full check of all registers value, and they can write any register
> inside PHY through this file.

You need to submit a PHY driver that implements the API you are
proposing here. Right now no PHY driver implements these {set,get}_regs
function pointers, so this is essentially dead code.

> 
> I think mii-tool or ethtool can't do it currently.

Maybe they cannot right now but they can certainly be patched to support
that. sysfs is not an appropriate interface for what you are proposing
here. We already have a set/get register via ethtool (-d/-D) it would
seem natural to use this.

Besides that, are not the current ioctl() good enough for that?
-- 
Florian

Re: [PATCH] net: add regs attribute to phy device for user diagnose

2017-01-14 Thread yuan linyu

On 六, 2017-01-14 at 10:35 -0800, Florian Fainelli wrote:
> On 01/14/2017 08:24 AM, Andrew Lunn wrote:
> > 
> > On Sat, Jan 14, 2017 at 10:46:31AM +0800, yuan linyu wrote:
> > > 
> > > From: yuan linyu 
> > > 
> > > if phy device have register(s) configuration problem,
> > > user can use this attribute to diagnose.
> > > this feature need phy driver maintainer implement.
> > what is wrong with mii-tool -vv ?
> Agreed, and without an actual user of this API (ethtool?), nor a PHY
> driver implementing it, we cannot quite see how you want to make use of
> this.
I hope user/developer can read this attribute file "regs" to do
a full check of all registers value, and they can write any register
inside PHY through this file.

I think mii-tool or ethtool can't do it currently.
> 
> Thank you

Re: [PATCH v5 05/13] net: ethernet: aquantia: Support for NIC-specific code

2017-01-14 Thread Andrew Lunn

On Sun, Jan 15, 2017 at 12:55:02AM +0200, Rami Rosen wrote:
> Hi, Florian,
> 
> > +}
> > +
> > +static int aq_ndev_change_mtu(struct net_device *ndev, int new_mtu)
> > +{
> > + struct aq_nic_s *aq_nic = (struct aq_nic_s *)netdev_priv(ndev);
> > + int err = 0;
> > +
> > + if (new_mtu == ndev->mtu) {
> > + err = 0;
> > + goto err_exit;
> > + }
> > + if (new_mtu < 68) {
> > + err = -EINVAL;
> > + goto err_exit;
> > + }

Actually, the core will already impose min/max of ETH_MIN_MTU and
ETH_DATA_LEN. See 

commit a52ad514fdf3b8a57ca4322c92d2d8d5c6182485
Author: Jarod Wilson 
Date:   Fri Oct 7 22:04:34 2016 -0400

net: deprecate eth_change_mtu, remove usage

With centralized MTU checking, there's nothing productive done by
eth_change_mtu that isn't already done in dev_set_mtu, so mark it as
deprecated and remove all usage of it in the kernel. All callers have been
audited for calls to alloc_etherdev* or ether_setup directly, which means
they all have a valid dev->min_mtu and dev->max_mtu. Now eth_change_mtu
prints out a netdev_warn about being deprecated, for the benefit of
out-of-tree drivers that might be utilizing it.

Of note, dvb_net.c actually had dev->mtu = 4096, while using
eth_change_mtu, meaning that if you ever tried changing it's mtu, you
couldn't set it above 1500 anymore. It's now getting dev->max_mtu also set
to 4096 to remedy that.

Andrew

Re: 4.9 conntrack performance issues

2017-01-14 Thread Denys Fedoryshchenko

On 2017-01-15 02:29, Florian Westphal wrote:

Denys Fedoryshchenko  wrote:

On 2017-01-15 01:53, Florian Westphal wrote:
>Denys Fedoryshchenko  wrote:
>
>I suspect you might also have to change
>
>1011 } else if (expired_count) {
>1012 gc_work->next_gc_run /= 2U;
>1013 next_run = msecs_to_jiffies(1);
>1014 } else {
>
>line 2013 to
>next_run = msecs_to_jiffies(HZ / 2);

I think its wrong to rely on "expired_count", with these
kinds of numbers (up to 10k entries are scanned per round
in Denys setup, its basically always going to be > 0.

I think we should only decide to scan more frequently if
eviction ratio is large, say, we found more than 1/4 of
entries to be stale.

I sent a small patch offlist that does just that.

>How many total connections is the machine handling on average?
>And how many new/delete events happen per second?
1-2 million connections, at current moment 988k
I dont know if it is correct method to measure events rate:

NAT ~ # timeout -t 5 conntrack -E -e NEW | wc -l
conntrack v1.4.2 (conntrack-tools): 40027 flow events have been shown.
40027
NAT ~ # timeout -t 5 conntrack -E -e DESTROY | wc -l
conntrack v1.4.2 (conntrack-tools): 40951 flow events have been shown.
40951

Thanks, thats exactly what I was looking for.
So I am not at all surprised that gc_worker eats cpu cycles...

It is not peak time, so values can be 2-3 higher at peak time, but 
even

right now, it is hogging one core, leaving only 20% idle left,
while others are 80-83% idle.

I agree its a bug.

>>   |--54.65%--gc_worker
>>   |  |
>>   |   --3.58%--nf_ct_gc_expired
>>   | |
>>   | |--1.90%--nf_ct_delete
>
>I'd be interested to see how often that shows up on other cores
>(from packet path).
Other CPU's totally different:
This is top entry
99.60% 0.00%  swapper  [kernel.kallsyms][k] 
start_secondary

|
---start_secondary
   |
--99.42%--cpu_startup_entry
  |

[..]

|--36.02%--process_backlog
 | |  
|

|  |
 | |  
|

|   --35.64%--__netif_receive_skb

gc_worker didnt appeared on other core at all.
Or i am checking something wrong?

Look for "nf_ct_gc_expired" and "nf_ct_delete".
Its going to be deep down in the call graph.
I tried my best to record as much data as possible, but it doesnt show 
it in callgraph, just a little bit in statistics:

 0.01% 0.00%  swapper  [nf_conntrack][k] 
nf_ct_delete
 0.01% 0.00%  swapper  [nf_conntrack][k] 
nf_ct_gc_expired

And thats it.

[PATCH net-next] bpf, trace: make ctx access checks more robust

2017-01-14 Thread Daniel Borkmann

Make sure that ctx cannot potentially be accessed oob by asserting
explicitly that ctx access size into pt_regs for BPF_PROG_TYPE_KPROBE
programs must be within limits. In case some 32bit archs have pt_regs
not being a multiple of 8, then BPF_DW access could cause such access.

BPF_PROG_TYPE_KPROBE progs don't have a ctx conversion function since
there's no extra mapping needed. kprobe_prog_is_valid_access() didn't
enforce sizeof(long) as the only allowed access size, since LLVM can
generate non BPF_W/BPF_DW access to regs from time to time.

For BPF_PROG_TYPE_TRACEPOINT we don't have a ctx conversion either, so
add a BUILD_BUG_ON() check to make sure that BPF_DW access will not be
a similar issue in future (ctx works on event buffer as opposed to
pt_regs there).

Fixes: 2541517c32be ("tracing, perf: Implement BPF programs attached to 
kprobes")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 ( Applies to both, but net-next should be just okay. For the comment
   I used kernel comment style as done throughout whole bpf_trace.c. )

 kernel/trace/bpf_trace.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 1860e7f..81fbc86 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -459,6 +459,13 @@ static bool kprobe_prog_is_valid_access(int off, int size, 
enum bpf_access_type
return false;
if (off % size != 0)
return false;
+   /*
+* Assertion for 32 bit to make sure last 8 byte access
+* (BPF_DW) to the last 4 byte member is disallowed.
+*/
+   if (off + size > sizeof(struct pt_regs))
+   return false;
+
return true;
 }
 
@@ -540,6 +547,8 @@ static bool tp_prog_is_valid_access(int off, int size, enum 
bpf_access_type type
return false;
if (off % size != 0)
return false;
+
+   BUILD_BUG_ON(PERF_MAX_TRACE_SIZE % sizeof(__u64));
return true;
 }
 
-- 
2.5.5

Re: 4.9 conntrack performance issues

2017-01-14 Thread Florian Westphal

Denys Fedoryshchenko  wrote:
> On 2017-01-15 01:53, Florian Westphal wrote:
> >Denys Fedoryshchenko  wrote:
> >
> >I suspect you might also have to change
> >
> >1011 } else if (expired_count) {
> >1012 gc_work->next_gc_run /= 2U;
> >1013 next_run = msecs_to_jiffies(1);
> >1014 } else {
> >
> >line 2013 to
> > next_run = msecs_to_jiffies(HZ / 2);

I think its wrong to rely on "expired_count", with these
kinds of numbers (up to 10k entries are scanned per round
in Denys setup, its basically always going to be > 0.

I think we should only decide to scan more frequently if
eviction ratio is large, say, we found more than 1/4 of
entries to be stale.

I sent a small patch offlist that does just that.

> >How many total connections is the machine handling on average?
> >And how many new/delete events happen per second?
> 1-2 million connections, at current moment 988k
> I dont know if it is correct method to measure events rate:
> 
> NAT ~ # timeout -t 5 conntrack -E -e NEW | wc -l
> conntrack v1.4.2 (conntrack-tools): 40027 flow events have been shown.
> 40027
> NAT ~ # timeout -t 5 conntrack -E -e DESTROY | wc -l
> conntrack v1.4.2 (conntrack-tools): 40951 flow events have been shown.
> 40951

Thanks, thats exactly what I was looking for.
So I am not at all surprised that gc_worker eats cpu cycles...

> It is not peak time, so values can be 2-3 higher at peak time, but even
> right now, it is hogging one core, leaving only 20% idle left,
> while others are 80-83% idle.

I agree its a bug.

> >>   |--54.65%--gc_worker
> >>   |  |
> >>   |   --3.58%--nf_ct_gc_expired
> >>   | |
> >>   | |--1.90%--nf_ct_delete
> >
> >I'd be interested to see how often that shows up on other cores
> >(from packet path).
> Other CPU's totally different:
> This is top entry
> 99.60% 0.00%  swapper  [kernel.kallsyms][k] start_secondary
> |
> ---start_secondary
>|
> --99.42%--cpu_startup_entry
>   |
[..]

> |--36.02%--process_backlog
>  | |  |
> |  |
>  | |  |
> |   --35.64%--__netif_receive_skb
> 
> gc_worker didnt appeared on other core at all.
> Or i am checking something wrong?

Look for "nf_ct_gc_expired" and "nf_ct_delete".
Its going to be deep down in the call graph.

Re: 4.9 conntrack performance issues

2017-01-14 Thread Denys Fedoryshchenko


On 2017-01-15 01:53, Florian Westphal wrote:

Denys Fedoryshchenko  wrote:

[ CC Nicolas since he also played with gc heuristics in the past ]

Sorry if i added someone wrongly to CC, please let me know, if i 
should

remove.
I just run successfully 4.9 on my nat several days ago, and seems 
panic
issue disappeared. But i started to face another issue, it seems 
garbage

collector is hogging one of CPU's.

It was handling load very well at 4.8 and below, it might be still 
fine, but

i suspect queues that belong to hogged cpu might experience issues.


The worker doesn't grab locks for long and calls scheduler for every
bucket to give a chance for other threads to run.

It also doesn't block softinterrupts.

Is there anything can be done to improve cpu load distribution or 
reduce

single core load?


No, I am afraid we don't export any of the heuristics as tuneables so
far.

You could try changing defaults in net/netfilter/nf_conntrack_core.c:

#define GC_MAX_BUCKETS_DIV  64u
/* upper bound of scan intervals */
#define GC_INTERVAL_MAX (2 * HZ)
/* maximum conntracks to evict per gc run */
#define GC_MAX_EVICTS   256u

(the first two result in ~2 minute worst case timeout detection
 on a fully idle system).

For instance you could use

GC_MAX_BUCKETS_DIV -> 128
GC_INTERVAL_MAX-> 30 * HZ

(This means that it takes one hour for a dead connection to be picked
 up on an idle system, but thats only relevant in case you use
 conntrack events to log when connection went down and need more 
precise

 accounting).

Not a big deal in my case.



I suspect you might also have to change

1011 } else if (expired_count) {
1012 gc_work->next_gc_run /= 2U;
1013 next_run = msecs_to_jiffies(1);
1014 } else {

line 2013 to
next_run = msecs_to_jiffies(HZ / 2);

or something like this to not have frequent rescans.

OK


The gc is also done from the packet path (i.e. accounted
towards (k)softirq).

How many total connections is the machine handling on average?
And how many new/delete events happen per second?

1-2 million connections, at current moment 988k
I dont know if it is correct method to measure events rate:

NAT ~ # timeout -t 5 conntrack -E -e NEW | wc -l
conntrack v1.4.2 (conntrack-tools): 40027 flow events have been shown.
40027
NAT ~ # timeout -t 5 conntrack -E -e DESTROY | wc -l
conntrack v1.4.2 (conntrack-tools): 40951 flow events have been shown.
40951

It is not peak time, so values can be 2-3 higher at peak time, but even 
right now, it is hogging one core, leaving only 20% idle left,

while others are 80-83% idle.




88.98% 0.00%  kworker/24:1  [kernel.kallsyms]   [k]
process_one_work
|
---process_one_work
   |
   |--54.65%--gc_worker
   |  |
   |   --3.58%--nf_ct_gc_expired
   | |
   | |--1.90%--nf_ct_delete


I'd be interested to see how often that shows up on other cores
(from packet path).

Other CPU's totally different:
This is top entry
99.60% 0.00%  swapper  [kernel.kallsyms][k] start_secondary
|
---start_secondary
   |
--99.42%--cpu_startup_entry
  |
   --98.04%--default_idle_call
 arch_cpu_idle
 |
 
|--48.58%--call_function_single_interrupt

 |  |
 |   
--46.36%--smp_call_function_single_interrupt
 | 
smp_trace_call_function_single_interrupt

 | |
 | 
|--44.18%--irq_exit

 | |  |
 | |  
|--43.37%--__do_softirq
 | |  |  
|
 | |  |  
 --43.18%--net_rx_action
 | |  |  
   |
 | |  |  
   |--36.02%--process_backlog
 | |  |  
   |  |
 | |  |  
   |   --35.64%--__netif_receive_skb



gc_worker didnt appeared on other core at all.
Or i am checking something wrong?

Re: 4.9 conntrack performance issues

2017-01-14 Thread Florian Westphal

Denys Fedoryshchenko  wrote:

[ CC Nicolas since he also played with gc heuristics in the past ]

> Sorry if i added someone wrongly to CC, please let me know, if i should
> remove.
> I just run successfully 4.9 on my nat several days ago, and seems panic
> issue disappeared. But i started to face another issue, it seems garbage
> collector is hogging one of CPU's.
>
> It was handling load very well at 4.8 and below, it might be still fine, but
> i suspect queues that belong to hogged cpu might experience issues.

The worker doesn't grab locks for long and calls scheduler for every
bucket to give a chance for other threads to run.

It also doesn't block softinterrupts.

> Is there anything can be done to improve cpu load distribution or reduce
> single core load?

No, I am afraid we don't export any of the heuristics as tuneables so
far.

You could try changing defaults in net/netfilter/nf_conntrack_core.c:

#define GC_MAX_BUCKETS_DIV  64u
/* upper bound of scan intervals */
#define GC_INTERVAL_MAX (2 * HZ)
/* maximum conntracks to evict per gc run */
#define GC_MAX_EVICTS   256u

(the first two result in ~2 minute worst case timeout detection
 on a fully idle system).

For instance you could use

GC_MAX_BUCKETS_DIV -> 128
GC_INTERVAL_MAX-> 30 * HZ

(This means that it takes one hour for a dead connection to be picked
 up on an idle system, but thats only relevant in case you use
 conntrack events to log when connection went down and need more precise
 accounting).

I suspect you might also have to change

1011 } else if (expired_count) {
1012 gc_work->next_gc_run /= 2U;
1013 next_run = msecs_to_jiffies(1);
1014 } else {

line 2013 to
next_run = msecs_to_jiffies(HZ / 2);

or something like this to not have frequent rescans.

The gc is also done from the packet path (i.e. accounted
towards (k)softirq).

How many total connections is the machine handling on average?
And how many new/delete events happen per second?

> 88.98% 0.00%  kworker/24:1  [kernel.kallsyms]   [k]
> process_one_work
> |
> ---process_one_work
>|
>|--54.65%--gc_worker
>|  |
>|   --3.58%--nf_ct_gc_expired
>| |
>| |--1.90%--nf_ct_delete

I'd be interested to see how often that shows up on other cores
(from packet path).

4.9 conntrack performance issues

2017-01-14 Thread Denys Fedoryshchenko


Hi!

Sorry if i added someone wrongly to CC, please let me know, if i should 
remove.
I just run successfully 4.9 on my nat several days ago, and seems panic 
issue disappeared. But i started to face another issue, it seems garbage 
collector is hogging one of CPU's.


Here is my data:
2xE5-2640 v3
396G ram
2x10G (bonding) with approx 14-15G load at peak time
It was handling load very well at 4.8 and below, it might be still fine, 
but i suspect queues that belong to hogged cpu might experience issues.
Is there anything can be done to improve cpu load distribution or reduce 
single core load?


net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_checksum = 1
net.netfilter.nf_conntrack_count = 1236021
net.netfilter.nf_conntrack_events = 1
net.netfilter.nf_conntrack_expect_max = 1024
net.netfilter.nf_conntrack_generic_timeout = 600
net.netfilter.nf_conntrack_helper = 0
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_log_invalid = 0
net.netfilter.nf_conntrack_max = 6553600
net.netfilter.nf_conntrack_tcp_be_liberal = 0
net.netfilter.nf_conntrack_tcp_loose = 0
net.netfilter.nf_conntrack_tcp_max_retrans = 3
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 10
net.netfilter.nf_conntrack_tcp_timeout_established = 600
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 20
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 20
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 10
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 20
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 20
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 30
net.netfilter.nf_conntrack_timestamp = 0
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180
net.nf_conntrack_max = 6553600


it is non-peak values, as adjustments i have shorter than default 
timeouts. Changing net.netfilter.nf_conntrack_buckets to higher value 
doesn't fix issue.

I noticed that one of CPU's hogged (N24 in this case):

Linux 4.9.2-build-0127 (NAT)01/14/17_x86_64_(32 CPU)

23:01:54 CPU%usr   %nice%sys %iowait%irq   %soft  %steal 
 %guest   %idle
23:02:04 all0.090.001.600.010.00   28.280.00 
   0.00   70.01
23:02:04   00.110.000.000.000.00   32.380.00 
   0.00   67.51
23:02:04   10.120.000.120.000.00   29.910.00 
   0.00   69.86
23:02:04   20.230.000.110.000.00   29.570.00 
   0.00   70.09
23:02:04   30.110.000.110.110.00   28.800.00 
   0.00   70.86
23:02:04   40.230.000.110.110.00   31.410.00 
   0.00   68.14
23:02:04   50.110.000.000.000.00   29.280.00 
   0.00   70.61
23:02:04   60.110.000.110.000.00   31.810.00 
   0.00   67.96
23:02:04   70.110.000.110.000.00   32.690.00 
   0.00   67.08
23:02:04   80.000.000.230.000.00   42.120.00 
   0.00   57.64
23:02:04   90.110.000.000.000.00   30.860.00 
   0.00   69.02
23:02:04  100.110.000.110.000.00   30.930.00 
   0.00   68.84
23:02:04  110.000.000.110.000.00   32.730.00 
   0.00   67.16
23:02:04  120.110.000.110.000.00   29.850.00 
   0.00   69.92
23:02:04  130.000.000.000.000.00   30.960.00 
   0.00   69.04
23:02:04  140.000.000.000.000.00   30.090.00 
   0.00   69.91
23:02:04  150.000.000.110.000.00   30.630.00 
   0.00   69.26
23:02:04  160.110.000.000.000.00   25.880.00 
   0.00   74.01
23:02:04  170.110.000.000.000.00   22.820.00 
   0.00   77.07
23:02:04  180.110.000.000.000.00   23.750.00 
   0.00   76.14
23:02:04  190.110.000.110.000.00   24.860.00 
   0.00   74.92
23:02:04  200.110.000.110.110.00   24.480.00 
   0.00   75.19
23:02:04  210.220.000.110.000.00   23.430.00 
   0.00   76.24
23:02:04  220.110.000.110.000.00   25.460.00 
   0.00   74.32
23:02:04  230.000.000.110.000.00   25.470.00 
   0.00   74.41
23:02:04  240.000.00   45.060.000.00   42.180.00 
   0.00   12.76
23:02:04  250.110.000.110.110.00   25.220.00 
   0.00   74.46
23:02:04  260.110.000.000.110.00   23.390.00 
   0.00   76.39
23:02:04  270.220.000.110.000.00   23.830.00 
   0.00   75.85
23:02:04  280.110.000.110.000.00   24.100.00 
   0.00   75.68
23:02:04  290.110.000.11

Re: [PATCH v5 05/13] net: ethernet: aquantia: Support for NIC-specific code

2017-01-14 Thread Rami Rosen

Hi, Florian,

> +}
> +
> +static int aq_ndev_change_mtu(struct net_device *ndev, int new_mtu)
> +{
> + struct aq_nic_s *aq_nic = (struct aq_nic_s *)netdev_priv(ndev);
> + int err = 0;
> +
> + if (new_mtu == ndev->mtu) {
> + err = 0;
> + goto err_exit;
> + }
> + if (new_mtu < 68) {
> + err = -EINVAL;
> + goto err_exit;
> + }

> What's so special about 68 here?

I think that the check that the passed MTU is at least 68 bytes is
justified and correct.
This convention is followed by a large number of Ethernet drivers; for example,
http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/i40e/i40e_main.c#L2246
http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L6049

The size of 68 bytes is originated from RFC 791:
https://tools.ietf.org/html/rfc791
...
Every internet module must be able to forward a datagram of 68 octets
without further fragmentation.
This is because an internet header may be up to 60 octets, and the
minimum fragment is 8 octets.
...

Regards,
Rami Rosen

[PATCH] net: marvell: sky2: use new api ethtool_{get|set}_link_ksettings

2017-01-14 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/marvell/sky2.c |   68 --
 1 files changed, 40 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/marvell/sky2.c 
b/drivers/net/ethernet/marvell/sky2.c
index 18d6336..be003c5 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -3589,47 +3589,59 @@ static u32 sky2_supported_modes(const struct sky2_hw 
*hw)
| SUPPORTED_1000baseT_Full;
 }
 
-static int sky2_get_settings(struct net_device *dev, struct ethtool_cmd *ecmd)
+static int sky2_get_link_ksettings(struct net_device *dev,
+  struct ethtool_link_ksettings *cmd)
 {
struct sky2_port *sky2 = netdev_priv(dev);
struct sky2_hw *hw = sky2->hw;
+   u32 supported, advertising;
 
-   ecmd->transceiver = XCVR_INTERNAL;
-   ecmd->supported = sky2_supported_modes(hw);
-   ecmd->phy_address = PHY_ADDR_MARV;
+   supported = sky2_supported_modes(hw);
+   cmd->base.phy_address = PHY_ADDR_MARV;
if (sky2_is_copper(hw)) {
-   ecmd->port = PORT_TP;
-   ethtool_cmd_speed_set(ecmd, sky2->speed);
-   ecmd->supported |=  SUPPORTED_Autoneg | SUPPORTED_TP;
+   cmd->base.port = PORT_TP;
+   cmd->base.speed = sky2->speed;
+   supported |=  SUPPORTED_Autoneg | SUPPORTED_TP;
} else {
-   ethtool_cmd_speed_set(ecmd, SPEED_1000);
-   ecmd->port = PORT_FIBRE;
-   ecmd->supported |=  SUPPORTED_Autoneg | SUPPORTED_FIBRE;
+   cmd->base.speed = SPEED_1000;
+   cmd->base.port = PORT_FIBRE;
+   supported |=  SUPPORTED_Autoneg | SUPPORTED_FIBRE;
}
 
-   ecmd->advertising = sky2->advertising;
-   ecmd->autoneg = (sky2->flags & SKY2_FLAG_AUTO_SPEED)
+   advertising = sky2->advertising;
+   cmd->base.autoneg = (sky2->flags & SKY2_FLAG_AUTO_SPEED)
? AUTONEG_ENABLE : AUTONEG_DISABLE;
-   ecmd->duplex = sky2->duplex;
+   cmd->base.duplex = sky2->duplex;
+
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
+   supported);
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising,
+   advertising);
+
return 0;
 }
 
-static int sky2_set_settings(struct net_device *dev, struct ethtool_cmd *ecmd)
+static int sky2_set_link_ksettings(struct net_device *dev,
+  const struct ethtool_link_ksettings *cmd)
 {
struct sky2_port *sky2 = netdev_priv(dev);
const struct sky2_hw *hw = sky2->hw;
u32 supported = sky2_supported_modes(hw);
+   u32 new_advertising;
+
+   ethtool_convert_link_mode_to_legacy_u32(_advertising,
+   cmd->link_modes.advertising);
 
-   if (ecmd->autoneg == AUTONEG_ENABLE) {
-   if (ecmd->advertising & ~supported)
+   if (cmd->base.autoneg == AUTONEG_ENABLE) {
+   if (new_advertising & ~supported)
return -EINVAL;
 
if (sky2_is_copper(hw))
-   sky2->advertising = ecmd->advertising |
+   sky2->advertising = new_advertising |
ADVERTISED_TP |
ADVERTISED_Autoneg;
else
-   sky2->advertising = ecmd->advertising |
+   sky2->advertising = new_advertising |
ADVERTISED_FIBRE |
ADVERTISED_Autoneg;
 
@@ -3638,30 +3650,30 @@ static int sky2_set_settings(struct net_device *dev, 
struct ethtool_cmd *ecmd)
sky2->speed = -1;
} else {
u32 setting;
-   u32 speed = ethtool_cmd_speed(ecmd);
+   u32 speed = cmd->base.speed;
 
switch (speed) {
case SPEED_1000:
-   if (ecmd->duplex == DUPLEX_FULL)
+   if (cmd->base.duplex == DUPLEX_FULL)
setting = SUPPORTED_1000baseT_Full;
-   else if (ecmd->duplex == DUPLEX_HALF)
+   else if (cmd->base.duplex == DUPLEX_HALF)
setting = SUPPORTED_1000baseT_Half;
else
return -EINVAL;
break;
case SPEED_100:
-   if (ecmd->duplex == DUPLEX_FULL)
+   if (cmd->base.duplex == DUPLEX_FULL)
setting = SUPPORTED_100baseT_Full;
-   else

[PATCH resend iproute2 1/1] utils: make hex2mem available to all users

2017-01-14 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

hex2mem() api is useful for parsing hexstrings which are then packed in
a stream of chars.

Signed-off-by: Jamal Hadi Salim 
---
 include/utils.h |  1 +
 ip/ipl2tp.c | 25 -
 lib/utils.c | 25 +
 3 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index dc1d6b9..22369e0 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -118,6 +118,7 @@ int get_be32(__be32 *val, const char *arg, int base);
 int get_be16(__be16 *val, const char *arg, int base);
 int get_addr64(__u64 *ap, const char *cp);
 
+int hex2mem(const char *buf, uint8_t *mem, int count);
 char *hexstring_n2a(const __u8 *str, int len, char *buf, int blen);
 __u8 *hexstring_a2n(const char *str, __u8 *buf, int blen, unsigned int *len);
 #define ADDR64_BUF_SIZE sizeof(":::")
diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
index 0f91aeb..88664c9 100644
--- a/ip/ipl2tp.c
+++ b/ip/ipl2tp.c
@@ -485,31 +485,6 @@ static int get_tunnel(struct l2tp_data *p)
  * Command parser
  */
 
-static int hex2mem(const char *buf, uint8_t *mem, int count)
-{
-   int i, j;
-   int c;
-
-   for (i = 0, j = 0; i < count; i++, j += 2) {
-   c = get_hex(buf[j]);
-   if (c < 0)
-   goto err;
-
-   mem[i] = c << 4;
-
-   c = get_hex(buf[j + 1]);
-   if (c < 0)
-   goto err;
-
-   mem[i] |= c;
-   }
-
-   return 0;
-
-err:
-   return -1;
-}
-
 static void usage(void) __attribute__((noreturn));
 
 static void usage(void)
diff --git a/lib/utils.c b/lib/utils.c
index 83c9d09..870c4f1 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -962,6 +962,31 @@ __u8 *hexstring_a2n(const char *str, __u8 *buf, int blen, 
unsigned int *len)
return buf;
 }
 
+int hex2mem(const char *buf, uint8_t *mem, int count)
+{
+   int i, j;
+   int c;
+
+   for (i = 0, j = 0; i < count; i++, j += 2) {
+   c = get_hex(buf[j]);
+   if (c < 0)
+   goto err;
+
+   mem[i] = c << 4;
+
+   c = get_hex(buf[j + 1]);
+   if (c < 0)
+   goto err;
+
+   mem[i] |= c;
+   }
+
+   return 0;
+
+err:
+   return -1;
+}
+
 int addr64_n2a(__u64 addr, char *buff, size_t len)
 {
__u16 *words = (__u16 *)
-- 
1.9.1

Re: [PATCH iproute2 1/1] utils: make hex2mem available to all users

2017-01-14 Thread Jamal Hadi Salim


Sorry, messed up Stephen's address. Resending..

cheers,
jamal

On 17-01-14 05:04 PM, Jamal Hadi Salim wrote:

From: Jamal Hadi Salim 

hex2mem() api is useful for parsing hexstrings which are then packed in
a stream of chars.

Signed-off-by: Jamal Hadi Salim 
---
 include/utils.h |  1 +
 ip/ipl2tp.c | 25 -
 lib/utils.c | 25 +
 3 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index dc1d6b9..22369e0 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -118,6 +118,7 @@ int get_be32(__be32 *val, const char *arg, int base);
 int get_be16(__be16 *val, const char *arg, int base);
 int get_addr64(__u64 *ap, const char *cp);

+int hex2mem(const char *buf, uint8_t *mem, int count);
 char *hexstring_n2a(const __u8 *str, int len, char *buf, int blen);
 __u8 *hexstring_a2n(const char *str, __u8 *buf, int blen, unsigned int *len);
 #define ADDR64_BUF_SIZE sizeof(":::")
diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
index 0f91aeb..88664c9 100644
--- a/ip/ipl2tp.c
+++ b/ip/ipl2tp.c
@@ -485,31 +485,6 @@ static int get_tunnel(struct l2tp_data *p)
  * Command parser
  */

-static int hex2mem(const char *buf, uint8_t *mem, int count)
-{
-   int i, j;
-   int c;
-
-   for (i = 0, j = 0; i < count; i++, j += 2) {
-   c = get_hex(buf[j]);
-   if (c < 0)
-   goto err;
-
-   mem[i] = c << 4;
-
-   c = get_hex(buf[j + 1]);
-   if (c < 0)
-   goto err;
-
-   mem[i] |= c;
-   }
-
-   return 0;
-
-err:
-   return -1;
-}
-
 static void usage(void) __attribute__((noreturn));

 static void usage(void)
diff --git a/lib/utils.c b/lib/utils.c
index 83c9d09..870c4f1 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -962,6 +962,31 @@ __u8 *hexstring_a2n(const char *str, __u8 *buf, int blen, 
unsigned int *len)
return buf;
 }

+int hex2mem(const char *buf, uint8_t *mem, int count)
+{
+   int i, j;
+   int c;
+
+   for (i = 0, j = 0; i < count; i++, j += 2) {
+   c = get_hex(buf[j]);
+   if (c < 0)
+   goto err;
+
+   mem[i] = c << 4;
+
+   c = get_hex(buf[j + 1]);
+   if (c < 0)
+   goto err;
+
+   mem[i] |= c;
+   }
+
+   return 0;
+
+err:
+   return -1;
+}
+
 int addr64_n2a(__u64 addr, char *buff, size_t len)
 {
__u16 *words = (__u16 *)

[PATCH iproute2 1/1] utils: make hex2mem available to all users

2017-01-14 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

hex2mem() api is useful for parsing hexstrings which are then packed in
a stream of chars.

Signed-off-by: Jamal Hadi Salim 
---
 include/utils.h |  1 +
 ip/ipl2tp.c | 25 -
 lib/utils.c | 25 +
 3 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index dc1d6b9..22369e0 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -118,6 +118,7 @@ int get_be32(__be32 *val, const char *arg, int base);
 int get_be16(__be16 *val, const char *arg, int base);
 int get_addr64(__u64 *ap, const char *cp);
 
+int hex2mem(const char *buf, uint8_t *mem, int count);
 char *hexstring_n2a(const __u8 *str, int len, char *buf, int blen);
 __u8 *hexstring_a2n(const char *str, __u8 *buf, int blen, unsigned int *len);
 #define ADDR64_BUF_SIZE sizeof(":::")
diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
index 0f91aeb..88664c9 100644
--- a/ip/ipl2tp.c
+++ b/ip/ipl2tp.c
@@ -485,31 +485,6 @@ static int get_tunnel(struct l2tp_data *p)
  * Command parser
  */
 
-static int hex2mem(const char *buf, uint8_t *mem, int count)
-{
-   int i, j;
-   int c;
-
-   for (i = 0, j = 0; i < count; i++, j += 2) {
-   c = get_hex(buf[j]);
-   if (c < 0)
-   goto err;
-
-   mem[i] = c << 4;
-
-   c = get_hex(buf[j + 1]);
-   if (c < 0)
-   goto err;
-
-   mem[i] |= c;
-   }
-
-   return 0;
-
-err:
-   return -1;
-}
-
 static void usage(void) __attribute__((noreturn));
 
 static void usage(void)
diff --git a/lib/utils.c b/lib/utils.c
index 83c9d09..870c4f1 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -962,6 +962,31 @@ __u8 *hexstring_a2n(const char *str, __u8 *buf, int blen, 
unsigned int *len)
return buf;
 }
 
+int hex2mem(const char *buf, uint8_t *mem, int count)
+{
+   int i, j;
+   int c;
+
+   for (i = 0, j = 0; i < count; i++, j += 2) {
+   c = get_hex(buf[j]);
+   if (c < 0)
+   goto err;
+
+   mem[i] = c << 4;
+
+   c = get_hex(buf[j + 1]);
+   if (c < 0)
+   goto err;
+
+   mem[i] |= c;
+   }
+
+   return 0;
+
+err:
+   return -1;
+}
+
 int addr64_n2a(__u64 addr, char *buff, size_t len)
 {
__u16 *words = (__u16 *)
-- 
1.9.1

[PATCH net-next 1/1] net sched actions: Add support for user cookies

2017-01-14 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Introduce optional 128-bit action cookie.
Like all other cookie schemes in the networking world (eg in protocols
like http or existing kernel fib protocol field, etc) the idea is to save
user state that when retrieved serves as a correlator. The kernel
_should not_ intepret it.  The user can store whatever they wish in the
128 bits.

Sample exercise(using two 64bit values to represent the 128 bits):

.. create an accept action with cookie a1b2c3d4
sudo $TC actions add action ok index 1 cookie a1b2c3d4

.. dump all gact actions..
sudo $TC -s actions ls action gact

action order 0: gact action pass
 random type none pass val 0
 index 1 ref 1 bind 0 installed 5 sec used 5 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
cookie a1b2c3d4

.. bind the accept action to a filter..
sudo $TC filter add dev lo parent : protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1

... send some traffic..
$ ping 127.0.0.1 -c 3
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms

--- 127.0.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2109ms
rtt min/avg/max/mdev = 0.020/0.028/0.038/0.008 ms 1

... show some stats
$ sudo $TC -s actions get action gact index 1

action order 1: gact action pass
 random type none pass val 0
 index 1 ref 2 bind 1 installed 204 sec used 5 sec
Action statistics:
Sent 12168 bytes 164 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
cookie a1b2c3d4

.. try longer cookie...
$ sudo $TC actions replace action ok index 1 cookie 1234567890abcdef
.. dump..
$ sudo $TC -s actions ls action gact

action order 1: gact action pass
 random type none pass val 0
 index 1 ref 2 bind 1 installed 204 sec used 5 sec
Action statistics:
Sent 12168 bytes 164 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
cookie 1234567890abcdef

Signed-off-by: Jamal Hadi Salim 
---
 include/net/act_api.h|  1 +
 include/uapi/linux/pkt_cls.h | 11 +++
 net/sched/act_api.c  | 28 ++--
 3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 1d71644..0692458 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -41,6 +41,7 @@ struct tc_action {
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
struct gnet_stats_queue __percpu *cpu_qstats;
+   struct tc_cookie*act_ck;
 };
 #define tcf_head   common.tcfa_head
 #define tcf_index  common.tcfa_index
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 1e5e1dd..063bc89 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -4,6 +4,16 @@
 #include 
 #include 
 
+#define MAX_TC_COOKIE_SZ 16
+
+/* This structure holds cookie structure that is passed from user
+ * to the kernel for actions and classifiers
+ */
+struct tc_cookie {
+   unsigned char ck[MAX_TC_COOKIE_SZ];
+   unsigned char ck_len;
+};
+
 /* Action attributes */
 enum {
TCA_ACT_UNSPEC,
@@ -12,6 +22,7 @@ enum {
TCA_ACT_INDEX,
TCA_ACT_STATS,
TCA_ACT_PAD,
+   TCA_ACT_COOKIE,
__TCA_ACT_MAX
 };
 
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index f04715a..b82908a 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -33,6 +33,7 @@ static void free_tcf(struct rcu_head *head)
 
free_percpu(p->cpu_bstats);
free_percpu(p->cpu_qstats);
+   kfree(p->act_ck);
kfree(p);
 }
 
@@ -464,8 +465,8 @@ int tcf_action_destroy(struct list_head *actions, int bind)
return a->ops->dump(skb, a, bind, ref);
 }
 
-int
-tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref)
+int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind,
+ int ref)
 {
int err = -EINVAL;
unsigned char *b = skb_tail_pointer(skb);
@@ -475,6 +476,12 @@ int tcf_action_destroy(struct list_head *actions, int bind)
goto nla_put_failure;
if (tcf_action_copy_stats(skb, a, 0))
goto nla_put_failure;
+   if (a->act_ck) {
+   if (nla_put(skb, TCA_ACT_COOKIE, a->act_ck->ck_len,
+   a->act_ck))
+   goto nla_put_failure;
+   }
+
nest = nla_nest_start(skb, TCA_OPTIONS);
if (nest == NULL)
goto nla_put_failure;
@@ -575,6 +582,23 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct nlattr *nla,

[PATCH net-next v3 04/10] net: dsa: Move ports assignment closer to error checking

2017-01-14 Thread Florian Fainelli

Move the assignment of ports in _dsa_register_switch() closer to where
it is checked, no functional change. Re-order declarations to be
preserve the inverted christmas tree style.

Signed-off-by: Florian Fainelli 
---
 net/dsa/dsa2.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 04ab62251fe3..cd91070b5467 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -587,8 +587,8 @@ static struct device_node *dsa_get_ports(struct dsa_switch 
*ds,
 static int _dsa_register_switch(struct dsa_switch *ds, struct device *dev)
 {
struct device_node *np = dev->of_node;
-   struct device_node *ports = dsa_get_ports(ds, np);
struct dsa_switch_tree *dst;
+   struct device_node *ports;
u32 tree, index;
int i, err;
 
@@ -596,6 +596,7 @@ static int _dsa_register_switch(struct dsa_switch *ds, 
struct device *dev)
if (err)
return err;
 
+   ports = dsa_get_ports(ds, np);
if (IS_ERR(ports))
return PTR_ERR(ports);
 
-- 
2.9.3

[PATCH net-next v3 01/10] net: dsa: Pass device pointer to dsa_register_switch

2017-01-14 Thread Florian Fainelli

In preparation for allowing dsa_register_switch() to be supplied with
device/platform data, pass down a struct device pointer instead of a
struct device_node.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c |  2 +-
 drivers/net/dsa/mv88e6xxx/chip.c | 11 ++-
 drivers/net/dsa/qca8k.c  |  2 +-
 include/net/dsa.h|  2 +-
 net/dsa/dsa2.c   |  7 ---
 5 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 5102a3701a1a..7179eed9ee6d 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1882,7 +1882,7 @@ int b53_switch_register(struct b53_device *dev)
 
pr_info("found switch: %s, rev %i\n", dev->name, dev->core_rev);
 
-   return dsa_register_switch(dev->ds, dev->ds->dev->of_node);
+   return dsa_register_switch(dev->ds, dev->ds->dev);
 }
 EXPORT_SYMBOL(b53_switch_register);
 
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 987b2dbbd35a..3238a4752b98 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -4421,8 +4421,7 @@ static struct dsa_switch_driver mv88e6xxx_switch_drv = {
.ops= _switch_ops,
 };
 
-static int mv88e6xxx_register_switch(struct mv88e6xxx_chip *chip,
-struct device_node *np)
+static int mv88e6xxx_register_switch(struct mv88e6xxx_chip *chip)
 {
struct device *dev = chip->dev;
struct dsa_switch *ds;
@@ -4437,7 +4436,7 @@ static int mv88e6xxx_register_switch(struct 
mv88e6xxx_chip *chip,
 
dev_set_drvdata(dev, ds);
 
-   return dsa_register_switch(ds, np);
+   return dsa_register_switch(ds, dev);
 }
 
 static void mv88e6xxx_unregister_switch(struct mv88e6xxx_chip *chip)
@@ -4521,9 +4520,11 @@ static int mv88e6xxx_probe(struct mdio_device *mdiodev)
if (err)
goto out_g2_irq;
 
-   err = mv88e6xxx_register_switch(chip, np);
-   if (err)
+   err = mv88e6xxx_register_switch(chip);
+   if (err) {
+   mv88e6xxx_mdio_unregister(chip);
goto out_mdio;
+   }
 
return 0;
 
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 54d270d59eb0..c084aa484d2b 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -964,7 +964,7 @@ qca8k_sw_probe(struct mdio_device *mdiodev)
mutex_init(>reg_mutex);
dev_set_drvdata(>dev, priv);
 
-   return dsa_register_switch(priv->ds, priv->ds->dev->of_node);
+   return dsa_register_switch(priv->ds, >dev);
 }
 
 static void
diff --git a/include/net/dsa.h b/include/net/dsa.h
index b94d1f2ef912..16a502a6c26a 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -403,7 +403,7 @@ static inline bool dsa_uses_tagged_protocol(struct 
dsa_switch_tree *dst)
 }
 
 void dsa_unregister_switch(struct dsa_switch *ds);
-int dsa_register_switch(struct dsa_switch *ds, struct device_node *np);
+int dsa_register_switch(struct dsa_switch *ds, struct device *dev);
 #ifdef CONFIG_PM_SLEEP
 int dsa_switch_suspend(struct dsa_switch *ds);
 int dsa_switch_resume(struct dsa_switch *ds);
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 42a41d84053c..4170f7ea8e28 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -579,8 +579,9 @@ static struct device_node *dsa_get_ports(struct dsa_switch 
*ds,
return ports;
 }
 
-static int _dsa_register_switch(struct dsa_switch *ds, struct device_node *np)
+static int _dsa_register_switch(struct dsa_switch *ds, struct device *dev)
 {
+   struct device_node *np = dev->of_node;
struct device_node *ports = dsa_get_ports(ds, np);
struct dsa_switch_tree *dst;
u32 tree, index;
@@ -660,12 +661,12 @@ static int _dsa_register_switch(struct dsa_switch *ds, 
struct device_node *np)
return err;
 }
 
-int dsa_register_switch(struct dsa_switch *ds, struct device_node *np)
+int dsa_register_switch(struct dsa_switch *ds, struct device *dev)
 {
int err;
 
mutex_lock(_mutex);
-   err = _dsa_register_switch(ds, np);
+   err = _dsa_register_switch(ds, dev);
mutex_unlock(_mutex);
 
return err;
-- 
2.9.3

[PATCH net-next v3 02/10] net: dsa: Make most functions take a dsa_port argument

2017-01-14 Thread Florian Fainelli

In preparation for allowing platform data, and therefore no valid
device_node pointer, make most DSA functions takes a pointer to a
dsa_port structure whenever possible. While at it, introduce a
dsa_port_is_valid() helper function which checks whether port->dn is
NULL or not at the moment.

Signed-off-by: Florian Fainelli 
---
 net/dsa/dsa.c  | 15 --
 net/dsa/dsa2.c | 61 +-
 net/dsa/dsa_priv.h |  4 ++--
 3 files changed, 44 insertions(+), 36 deletions(-)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index fd532487dfdf..2306d1b87c83 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -110,8 +110,9 @@ dsa_switch_probe(struct device *parent, struct device 
*host_dev, int sw_addr,
 
 /* basic switch operations **/
 int dsa_cpu_dsa_setup(struct dsa_switch *ds, struct device *dev,
- struct device_node *port_dn, int port)
+ struct dsa_port *dport, int port)
 {
+   struct device_node *port_dn = dport->dn;
struct phy_device *phydev;
int ret, mode;
 
@@ -141,15 +142,15 @@ int dsa_cpu_dsa_setup(struct dsa_switch *ds, struct 
device *dev,
 
 static int dsa_cpu_dsa_setups(struct dsa_switch *ds, struct device *dev)
 {
-   struct device_node *port_dn;
+   struct dsa_port *dport;
int ret, port;
 
for (port = 0; port < DSA_MAX_PORTS; port++) {
if (!(dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)))
continue;
 
-   port_dn = ds->ports[port].dn;
-   ret = dsa_cpu_dsa_setup(ds, dev, port_dn, port);
+   dport = >ports[port];
+   ret = dsa_cpu_dsa_setup(ds, dev, dport, port);
if (ret)
return ret;
}
@@ -366,8 +367,10 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
return ds;
 }
 
-void dsa_cpu_dsa_destroy(struct device_node *port_dn)
+void dsa_cpu_dsa_destroy(struct dsa_port *port)
 {
+   struct device_node *port_dn = port->dn;
+
if (of_phy_is_fixed_link(port_dn))
of_phy_deregister_fixed_link(port_dn);
 }
@@ -393,7 +396,7 @@ static void dsa_switch_destroy(struct dsa_switch *ds)
for (port = 0; port < DSA_MAX_PORTS; port++) {
if (!(dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)))
continue;
-   dsa_cpu_dsa_destroy(ds->ports[port].dn);
+   dsa_cpu_dsa_destroy(>ports[port]);
 
/* Clearing a bit which is not set does no harm */
ds->cpu_port_mask |= ~(1 << port);
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 4170f7ea8e28..6e3675220fef 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -79,14 +79,19 @@ static void dsa_dst_del_ds(struct dsa_switch_tree *dst,
kref_put(>refcount, dsa_free_dst);
 }
 
-static bool dsa_port_is_dsa(struct device_node *port)
+static bool dsa_port_is_valid(struct dsa_port *port)
 {
-   return !!of_parse_phandle(port, "link", 0);
+   return !!port->dn;
 }
 
-static bool dsa_port_is_cpu(struct device_node *port)
+static bool dsa_port_is_dsa(struct dsa_port *port)
 {
-   return !!of_parse_phandle(port, "ethernet", 0);
+   return !!of_parse_phandle(port->dn, "link", 0);
+}
+
+static bool dsa_port_is_cpu(struct dsa_port *port)
+{
+   return !!of_parse_phandle(port->dn, "ethernet", 0);
 }
 
 static bool dsa_ds_find_port(struct dsa_switch *ds,
@@ -120,7 +125,7 @@ static struct dsa_switch *dsa_dst_find_port(struct 
dsa_switch_tree *dst,
 
 static int dsa_port_complete(struct dsa_switch_tree *dst,
 struct dsa_switch *src_ds,
-struct device_node *port,
+struct dsa_port *port,
 u32 src_port)
 {
struct device_node *link;
@@ -128,7 +133,7 @@ static int dsa_port_complete(struct dsa_switch_tree *dst,
struct dsa_switch *dst_ds;
 
for (index = 0;; index++) {
-   link = of_parse_phandle(port, "link", index);
+   link = of_parse_phandle(port->dn, "link", index);
if (!link)
break;
 
@@ -151,13 +156,13 @@ static int dsa_port_complete(struct dsa_switch_tree *dst,
  */
 static int dsa_ds_complete(struct dsa_switch_tree *dst, struct dsa_switch *ds)
 {
-   struct device_node *port;
+   struct dsa_port *port;
u32 index;
int err;
 
for (index = 0; index < DSA_MAX_PORTS; index++) {
-   port = ds->ports[index].dn;
-   if (!port)
+   port = >ports[index];
+   if (!dsa_port_is_valid(port))
continue;
 
if (!dsa_port_is_dsa(port))
@@ -197,7 +202,7 @@ static int dsa_dst_complete(struct dsa_switch_tree *dst)
return 0;
 }
 
-static int dsa_dsa_port_apply(struct device_node

[PATCH net-next v3 00/10] net: dsa: Support for pdata in dsa2

2017-01-14 Thread Florian Fainelli

Hi all,

This is not exactly new, and was sent before, although back then, I did not
have an user of the pre-declared MDIO board information, but now we do. Note
that I have additional changes queued up to have b53 register platform data for
MIPS bcm47xx and bcm63xx.

Yes I know that we should have the Orion platforms eventually be converted to
Device Tree, but until that happens, I don't want any remaining users of the
old "dsa" platform device (hence the previous DTS submissions for ARM/mvebu)
and, there will be platforms out there that most likely won't never see DT
coming their way (BCM47xx is almost 100% sure, BCM63xx maybe not in a distant
future).

We would probably want the whole series to be merged via David Miller's tree
to simplify things.

Greg, can you Ack/Nack patch 5 since it touched the core LDD?

Vivien, since some patches did change, I did not apply your Tested-by tag
to all patches.

Thanks!

Changes in v3:

- Tested EPROBE_DEFER from a mockup MDIO/DSA switch driver and everything
  is fine, once the driver finally probes we have access to platform data
  as expected

- added comment above dsa_port_is_valid() that port->name is mandatory
  for platform data cases

- added an extra check in dsa_parse_member() for a NULL pdata pointer

- fixed a bunch of checkpatch errors and warnings

Changes in v2:

- Rebased against latest net-next/master

- Moved dev_find_class() to device_find_class() into drivers/base/core.c

- Moved dev_to_net_device into net/core/dev.c

- Utilize dsa_chip_data directly instead of dsa_platform_data

- Augmented dsa_chip_data to be multi-CPU port ready

Changes from last submission (few months back):

- rebased against latest net-next

- do not introduce dsa2_platform_data which was overkill and was meant to
  allow us to do exaclty the same things with platform data and Device Tree
  we use the existing dsa_platform_data instead

- properly register MDIO devices when the MDIO bus is registered and associate
  platform_data with them

- add a change to the Orion platform code to demonstrate how this can be used

Thank you

Florian Fainelli (10):
  net: dsa: Pass device pointer to dsa_register_switch
  net: dsa: Make most functions take a dsa_port argument
  net: dsa: Suffix function manipulating device_node with _dn
  net: dsa: Move ports assignment closer to error checking
  drivers: base: Add device_find_class()
  net: dsa: Migrate to device_find_class()
  net: Relocate dev_to_net_device() into core
  net: dsa: Add support for platform data
  net: phy: Allow pre-declaration of MDIO devices
  ARM: orion: Register DSA switch as a MDIO device

 arch/arm/mach-orion5x/common.c   |   2 +-
 arch/arm/mach-orion5x/common.h   |   4 +-
 arch/arm/mach-orion5x/rd88f5181l-fxo-setup.c |   7 +-
 arch/arm/mach-orion5x/rd88f5181l-ge-setup.c  |   7 +-
 arch/arm/mach-orion5x/rd88f6183ap-ge-setup.c |   7 +-
 arch/arm/mach-orion5x/wnr854t-setup.c|   2 +-
 arch/arm/mach-orion5x/wrt350n-v2-setup.c |   7 +-
 arch/arm/plat-orion/common.c |  25 +++-
 arch/arm/plat-orion/include/plat/common.h|   4 +-
 drivers/base/core.c  |  19 +++
 drivers/net/dsa/b53/b53_common.c |   2 +-
 drivers/net/dsa/mv88e6xxx/chip.c |  11 +-
 drivers/net/dsa/qca8k.c  |   2 +-
 drivers/net/phy/Makefile |   3 +-
 drivers/net/phy/mdio-boardinfo.c |  86 +
 drivers/net/phy/mdio-boardinfo.h |  19 +++
 drivers/net/phy/mdio_bus.c   |   4 +
 drivers/net/phy/mdio_device.c|  11 ++
 include/linux/device.h   |   1 +
 include/linux/mdio.h |   3 +
 include/linux/mod_devicetable.h  |   1 +
 include/linux/netdevice.h|   2 +
 include/linux/phy.h  |  19 +++
 include/net/dsa.h|   8 +-
 net/core/dev.c   |  19 +++
 net/dsa/dsa.c|  53 ++--
 net/dsa/dsa2.c   | 174 +++
 net/dsa/dsa_priv.h   |   4 +-
 28 files changed, 365 insertions(+), 141 deletions(-)
 create mode 100644 drivers/net/phy/mdio-boardinfo.c
 create mode 100644 drivers/net/phy/mdio-boardinfo.h

-- 
2.9.3

[PATCH net-next v3 07/10] net: Relocate dev_to_net_device() into core

2017-01-14 Thread Florian Fainelli

dev_to_net_device() is moved from net/dsa/dsa.c to net/core/dev.c since
it going to be used by net/dsa/dsa2.c and the namespace of the function
justifies making it available to other users potentially.

Signed-off-by: Florian Fainelli 
---
 include/linux/netdevice.h |  2 ++
 net/core/dev.c| 19 +++
 net/dsa/dsa.c | 18 --
 3 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 97ae0ac513ee..6d021c37b774 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4390,4 +4390,6 @@ do {  
\
 #define PTYPE_HASH_SIZE(16)
 #define PTYPE_HASH_MASK(PTYPE_HASH_SIZE - 1)
 
+struct net_device *dev_to_net_device(struct device *dev);
+
 #endif /* _LINUX_NETDEVICE_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index ad5959e56116..7547e2ccc06b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8128,6 +8128,25 @@ const char *netdev_drivername(const struct net_device 
*dev)
return empty;
 }
 
+struct net_device *dev_to_net_device(struct device *dev)
+{
+   struct device *d;
+
+   d = device_find_class(dev, "net");
+   if (d) {
+   struct net_device *nd;
+
+   nd = to_net_dev(d);
+   dev_hold(nd);
+   put_device(d);
+
+   return nd;
+   }
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(dev_to_net_device);
+
 static void __netdev_printk(const char *level, const struct net_device *dev,
struct va_format *vaf)
 {
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 77fa4c4f5828..6c264f92fec5 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -473,24 +473,6 @@ struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(dsa_host_dev_to_mii_bus);
 
-static struct net_device *dev_to_net_device(struct device *dev)
-{
-   struct device *d;
-
-   d = device_find_class(dev, "net");
-   if (d != NULL) {
-   struct net_device *nd;
-
-   nd = to_net_dev(d);
-   dev_hold(nd);
-   put_device(d);
-
-   return nd;
-   }
-
-   return NULL;
-}
-
 #ifdef CONFIG_OF
 static int dsa_of_setup_routing_table(struct dsa_platform_data *pd,
struct dsa_chip_data *cd,
-- 
2.9.3

[PATCH net-next v3 03/10] net: dsa: Suffix function manipulating device_node with _dn

2017-01-14 Thread Florian Fainelli

Make it clear that these functions take a device_node structure pointer

Signed-off-by: Florian Fainelli 
---
 net/dsa/dsa2.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 6e3675220fef..04ab62251fe3 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -94,8 +94,8 @@ static bool dsa_port_is_cpu(struct dsa_port *port)
return !!of_parse_phandle(port->dn, "ethernet", 0);
 }
 
-static bool dsa_ds_find_port(struct dsa_switch *ds,
-struct device_node *port)
+static bool dsa_ds_find_port_dn(struct dsa_switch *ds,
+   struct device_node *port)
 {
u32 index;
 
@@ -105,8 +105,8 @@ static bool dsa_ds_find_port(struct dsa_switch *ds,
return false;
 }
 
-static struct dsa_switch *dsa_dst_find_port(struct dsa_switch_tree *dst,
-   struct device_node *port)
+static struct dsa_switch *dsa_dst_find_port_dn(struct dsa_switch_tree *dst,
+  struct device_node *port)
 {
struct dsa_switch *ds;
u32 index;
@@ -116,7 +116,7 @@ static struct dsa_switch *dsa_dst_find_port(struct 
dsa_switch_tree *dst,
if (!ds)
continue;
 
-   if (dsa_ds_find_port(ds, port))
+   if (dsa_ds_find_port_dn(ds, port))
return ds;
}
 
@@ -137,7 +137,7 @@ static int dsa_port_complete(struct dsa_switch_tree *dst,
if (!link)
break;
 
-   dst_ds = dsa_dst_find_port(dst, link);
+   dst_ds = dsa_dst_find_port_dn(dst, link);
of_node_put(link);
 
if (!dst_ds)
@@ -546,7 +546,7 @@ static int dsa_parse_ports_dn(struct device_node *ports, 
struct dsa_switch *ds)
return 0;
 }
 
-static int dsa_parse_member(struct device_node *np, u32 *tree, u32 *index)
+static int dsa_parse_member_dn(struct device_node *np, u32 *tree, u32 *index)
 {
int err;
 
@@ -592,7 +592,7 @@ static int _dsa_register_switch(struct dsa_switch *ds, 
struct device *dev)
u32 tree, index;
int i, err;
 
-   err = dsa_parse_member(np, , );
+   err = dsa_parse_member_dn(np, , );
if (err)
return err;
 
-- 
2.9.3

[PATCH net-next v3 06/10] net: dsa: Migrate to device_find_class()

2017-01-14 Thread Florian Fainelli

Now that the base device driver code provides an identical
implementation of dev_find_class() utilize device_find_class() instead
of our own version of it.

Signed-off-by: Florian Fainelli 
---
 net/dsa/dsa.c | 22 ++
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 2306d1b87c83..77fa4c4f5828 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -455,29 +455,11 @@ EXPORT_SYMBOL_GPL(dsa_switch_resume);
 #endif
 
 /* platform driver init and cleanup */
-static int dev_is_class(struct device *dev, void *class)
-{
-   if (dev->class != NULL && !strcmp(dev->class->name, class))
-   return 1;
-
-   return 0;
-}
-
-static struct device *dev_find_class(struct device *parent, char *class)
-{
-   if (dev_is_class(parent, class)) {
-   get_device(parent);
-   return parent;
-   }
-
-   return device_find_child(parent, class, dev_is_class);
-}
-
 struct mii_bus *dsa_host_dev_to_mii_bus(struct device *dev)
 {
struct device *d;
 
-   d = dev_find_class(dev, "mdio_bus");
+   d = device_find_class(dev, "mdio_bus");
if (d != NULL) {
struct mii_bus *bus;
 
@@ -495,7 +477,7 @@ static struct net_device *dev_to_net_device(struct device 
*dev)
 {
struct device *d;
 
-   d = dev_find_class(dev, "net");
+   d = device_find_class(dev, "net");
if (d != NULL) {
struct net_device *nd;
 
-- 
2.9.3

[PATCH net-next v3 08/10] net: dsa: Add support for platform data

2017-01-14 Thread Florian Fainelli

Allow drivers to use the new DSA API with platform data. Most of the
code in net/dsa/dsa2.c does not rely so much on device_nodes and can get
the same information from platform_data instead.

We purposely do not support distributed configurations with platform
data, so drivers should be providing a pointer to a 'struct
dsa_chip_data' structure if they wish to communicate per-port layout.

Multiple CPUs port could potentially be supported and dsa_chip_data is
extended to receive up to one reference to an upstream network device
per port described by a dsa_chip_data structure.

Signed-off-by: Florian Fainelli 
---
 include/net/dsa.h |   6 
 net/dsa/dsa2.c| 101 --
 2 files changed, 89 insertions(+), 18 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 16a502a6c26a..491008792e4d 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -42,6 +42,11 @@ struct dsa_chip_data {
struct device   *host_dev;
int sw_addr;
 
+   /*
+* Reference to network devices
+*/
+   struct device   *netdev[DSA_MAX_PORTS];
+
/* set to size of eeprom if supported by the switch */
int eeprom_len;
 
@@ -140,6 +145,7 @@ struct dsa_switch_tree {
 };
 
 struct dsa_port {
+   const char  *name;
struct net_device   *netdev;
struct device_node  *dn;
unsigned intageing_time;
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index cd91070b5467..598229b02fd3 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -79,19 +79,28 @@ static void dsa_dst_del_ds(struct dsa_switch_tree *dst,
kref_put(>refcount, dsa_free_dst);
 }
 
+/* For platform data configurations, we need to have a valid name argument to
+ * differentiate a disabled port from an enabled one
+ */
 static bool dsa_port_is_valid(struct dsa_port *port)
 {
-   return !!port->dn;
+   return !!(port->dn || port->name);
 }
 
 static bool dsa_port_is_dsa(struct dsa_port *port)
 {
-   return !!of_parse_phandle(port->dn, "link", 0);
+   if (port->name && !strcmp(port->name, "dsa"))
+   return true;
+   else
+   return !!of_parse_phandle(port->dn, "link", 0);
 }
 
 static bool dsa_port_is_cpu(struct dsa_port *port)
 {
-   return !!of_parse_phandle(port->dn, "ethernet", 0);
+   if (port->name && !strcmp(port->name, "cpu"))
+   return true;
+   else
+   return !!of_parse_phandle(port->dn, "ethernet", 0);
 }
 
 static bool dsa_ds_find_port_dn(struct dsa_switch *ds,
@@ -251,10 +260,11 @@ static void dsa_cpu_port_unapply(struct dsa_port *port, 
u32 index,
 static int dsa_user_port_apply(struct dsa_port *port, u32 index,
   struct dsa_switch *ds)
 {
-   const char *name;
+   const char *name = port->name;
int err;
 
-   name = of_get_property(port->dn, "label", NULL);
+   if (port->dn)
+   name = of_get_property(port->dn, "label", NULL);
if (!name)
name = "eth%d";
 
@@ -439,11 +449,15 @@ static int dsa_cpu_parse(struct dsa_port *port, u32 index,
struct net_device *ethernet_dev;
struct device_node *ethernet;
 
-   ethernet = of_parse_phandle(port->dn, "ethernet", 0);
-   if (!ethernet)
-   return -EINVAL;
+   if (port->dn) {
+   ethernet = of_parse_phandle(port->dn, "ethernet", 0);
+   if (!ethernet)
+   return -EINVAL;
+   ethernet_dev = of_find_net_device_by_node(ethernet);
+   } else {
+   ethernet_dev = dev_to_net_device(ds->cd->netdev[index]);
+   }
 
-   ethernet_dev = of_find_net_device_by_node(ethernet);
if (!ethernet_dev)
return -EPROBE_DEFER;
 
@@ -546,6 +560,33 @@ static int dsa_parse_ports_dn(struct device_node *ports, 
struct dsa_switch *ds)
return 0;
 }
 
+static int dsa_parse_ports(struct dsa_chip_data *cd, struct dsa_switch *ds)
+{
+   bool valid_name_found = false;
+   unsigned int i;
+
+   for (i = 0; i < DSA_MAX_PORTS; i++) {
+   if (!cd->port_names[i])
+   continue;
+
+   ds->ports[i].name = cd->port_names[i];
+
+   /* Initialize enabled_port_mask now for drv->setup()
+* to have access to a correct value, just like what
+* net/dsa/dsa.c::dsa_switch_setup_one does.
+*/
+   if (!dsa_port_is_cpu(>ports[i]))
+   ds->enabled_port_mask |= 1 << i;
+
+   valid_name_found = true;
+   }
+
+   if (!valid_name_found && i == DSA_MAX_PORTS)
+   return -EINVAL;
+
+   return 0;
+}
+
 static int dsa_parse_member_dn(struct device_node *np, u32 *tree, u32 *index)
 {
int err;
@@ -570,6 +611,18 @@ static int dsa_parse_member_dn(struct device_node *np, u32

[PATCH net-next v3 09/10] net: phy: Allow pre-declaration of MDIO devices

2017-01-14 Thread Florian Fainelli

Allow board support code to collect pre-declarations for MDIO devices by
registering them with mdiobus_register_board_info(). SPI and I2C buses
have a similar feature, we were missing this for MDIO devices, but this
is particularly useful for e.g: MDIO-connected switches which need to
provide their port layout (often board-specific) to a MDIO Ethernet
switch driver.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/Makefile |  3 +-
 drivers/net/phy/mdio-boardinfo.c | 86 
 drivers/net/phy/mdio-boardinfo.h | 19 +
 drivers/net/phy/mdio_bus.c   |  4 ++
 drivers/net/phy/mdio_device.c| 11 +
 include/linux/mdio.h |  3 ++
 include/linux/mod_devicetable.h  |  1 +
 include/linux/phy.h  | 19 +
 8 files changed, 145 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/phy/mdio-boardinfo.c
 create mode 100644 drivers/net/phy/mdio-boardinfo.h

diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 356859ac7c18..407b0b601ea8 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -1,6 +1,7 @@
 # Makefile for Linux PHY drivers and MDIO bus drivers
 
-libphy-y   := phy.o phy_device.o mdio_bus.o mdio_device.o
+libphy-y   := phy.o phy_device.o mdio_bus.o mdio_device.o \
+  mdio-boardinfo.o
 libphy-$(CONFIG_SWPHY) += swphy.o
 libphy-$(CONFIG_LED_TRIGGER_PHY)   += phy_led_triggers.o
 
diff --git a/drivers/net/phy/mdio-boardinfo.c b/drivers/net/phy/mdio-boardinfo.c
new file mode 100644
index ..6b988f77da08
--- /dev/null
+++ b/drivers/net/phy/mdio-boardinfo.c
@@ -0,0 +1,86 @@
+/*
+ * mdio-boardinfo - Collect pre-declarations for MDIO devices
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "mdio-boardinfo.h"
+
+static LIST_HEAD(mdio_board_list);
+static DEFINE_MUTEX(mdio_board_lock);
+
+/**
+ * mdiobus_setup_mdiodev_from_board_info - create and setup MDIO devices
+ * from pre-collected board specific MDIO information
+ * @mdiodev: MDIO device pointer
+ * Context: can sleep
+ */
+void mdiobus_setup_mdiodev_from_board_info(struct mii_bus *bus)
+{
+   struct mdio_board_entry *be;
+   struct mdio_device *mdiodev;
+   struct mdio_board_info *bi;
+   int ret;
+
+   mutex_lock(_board_lock);
+   list_for_each_entry(be, _board_list, list) {
+   bi = >board_info;
+
+   if (strcmp(bus->id, bi->bus_id))
+   continue;
+
+   mdiodev = mdio_device_create(bus, bi->mdio_addr);
+   if (IS_ERR(mdiodev))
+   continue;
+
+   strncpy(mdiodev->modalias, bi->modalias,
+   sizeof(mdiodev->modalias));
+   mdiodev->bus_match = mdio_device_bus_match;
+   mdiodev->dev.platform_data = (void *)bi->platform_data;
+
+   ret = mdio_device_register(mdiodev);
+   if (ret) {
+   mdio_device_free(mdiodev);
+   continue;
+   }
+   }
+   mutex_unlock(_board_lock);
+}
+
+/**
+ * mdio_register_board_info - register MDIO devices for a given board
+ * @info: array of devices descriptors
+ * @n: number of descriptors provided
+ * Context: can sleep
+ *
+ * The board info passed can be marked with __initdata but be pointers
+ * such as platform_data etc. are copied as-is
+ */
+int mdiobus_register_board_info(const struct mdio_board_info *info,
+   unsigned int n)
+{
+   struct mdio_board_entry *be;
+   unsigned int i;
+
+   be = kcalloc(n, sizeof(*be), GFP_KERNEL);
+   if (!be)
+   return -ENOMEM;
+
+   for (i = 0; i < n; i++, be++, info++) {
+   memcpy(>board_info, info, sizeof(*info));
+   mutex_lock(_board_lock);
+   list_add_tail(>list, _board_list);
+   mutex_unlock(_board_lock);
+   }
+
+   return 0;
+}
diff --git a/drivers/net/phy/mdio-boardinfo.h b/drivers/net/phy/mdio-boardinfo.h
new file mode 100644
index ..00f98163e90e
--- /dev/null
+++ b/drivers/net/phy/mdio-boardinfo.h
@@ -0,0 +1,19 @@
+/*
+ * mdio-boardinfo.h - board info interface internal to the mdio_bus
+ * component
+ */
+
+#ifndef __MDIO_BOARD_INFO_H
+#define __MDIO_BOARD_INFO_H
+
+#include 
+#include 
+
+struct mdio_board_entry {
+   struct list_headlist;
+   struct mdio_board_info  board_info;
+};
+
+void mdiobus_setup_mdiodev_from_board_info(struct mii_bus *bus);
+
+#endif /* __MDIO_BOARD_INFO_H */
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c

[PATCH net-next v3 10/10] ARM: orion: Register DSA switch as a MDIO device

2017-01-14 Thread Florian Fainelli

Utilize the ability to pass board specific MDIO bus information towards a
particular MDIO device thus allowing us to provide the per-port switch layout
to the Marvell 88E6XXX switch driver.

Since we would end-up with conflicting registration paths, do not register the
"dsa" platform device anymore.

Note that the MDIO devices registered by code in net/dsa/dsa2.c does not
parse a dsa_platform_data, but directly take a dsa_chip_data (specific
to a single switch chip), so we update the different call sites to pass
this structure down to orion_ge00_switch_init().

Signed-off-by: Florian Fainelli 
---
 arch/arm/mach-orion5x/common.c   |  2 +-
 arch/arm/mach-orion5x/common.h   |  4 ++--
 arch/arm/mach-orion5x/rd88f5181l-fxo-setup.c |  7 +--
 arch/arm/mach-orion5x/rd88f5181l-ge-setup.c  |  7 +--
 arch/arm/mach-orion5x/rd88f6183ap-ge-setup.c |  7 +--
 arch/arm/mach-orion5x/wnr854t-setup.c|  2 +-
 arch/arm/mach-orion5x/wrt350n-v2-setup.c |  7 +--
 arch/arm/plat-orion/common.c | 25 +++--
 arch/arm/plat-orion/include/plat/common.h|  4 ++--
 9 files changed, 29 insertions(+), 36 deletions(-)

diff --git a/arch/arm/mach-orion5x/common.c b/arch/arm/mach-orion5x/common.c
index 04910764c385..83a7ec4c16d0 100644
--- a/arch/arm/mach-orion5x/common.c
+++ b/arch/arm/mach-orion5x/common.c
@@ -105,7 +105,7 @@ void __init orion5x_eth_init(struct 
mv643xx_eth_platform_data *eth_data)
 /*
  * Ethernet switch
  /
-void __init orion5x_eth_switch_init(struct dsa_platform_data *d)
+void __init orion5x_eth_switch_init(struct dsa_chip_data *d)
 {
orion_ge00_switch_init(d);
 }
diff --git a/arch/arm/mach-orion5x/common.h b/arch/arm/mach-orion5x/common.h
index 8a4115bd441d..efeffc6b4ebb 100644
--- a/arch/arm/mach-orion5x/common.h
+++ b/arch/arm/mach-orion5x/common.h
@@ -3,7 +3,7 @@
 
 #include 
 
-struct dsa_platform_data;
+struct dsa_chip_data;
 struct mv643xx_eth_platform_data;
 struct mv_sata_platform_data;
 
@@ -41,7 +41,7 @@ void orion5x_setup_wins(void);
 void orion5x_ehci0_init(void);
 void orion5x_ehci1_init(void);
 void orion5x_eth_init(struct mv643xx_eth_platform_data *eth_data);
-void orion5x_eth_switch_init(struct dsa_platform_data *d);
+void orion5x_eth_switch_init(struct dsa_chip_data *d);
 void orion5x_i2c_init(void);
 void orion5x_sata_init(struct mv_sata_platform_data *sata_data);
 void orion5x_spi_init(void);
diff --git a/arch/arm/mach-orion5x/rd88f5181l-fxo-setup.c 
b/arch/arm/mach-orion5x/rd88f5181l-fxo-setup.c
index dccadf68ea2b..a3c1336d30c9 100644
--- a/arch/arm/mach-orion5x/rd88f5181l-fxo-setup.c
+++ b/arch/arm/mach-orion5x/rd88f5181l-fxo-setup.c
@@ -101,11 +101,6 @@ static struct dsa_chip_data 
rd88f5181l_fxo_switch_chip_data = {
.port_names[7]  = "lan3",
 };
 
-static struct dsa_platform_data __initdata rd88f5181l_fxo_switch_plat_data = {
-   .nr_chips   = 1,
-   .chip   = _fxo_switch_chip_data,
-};
-
 static void __init rd88f5181l_fxo_init(void)
 {
/*
@@ -120,7 +115,7 @@ static void __init rd88f5181l_fxo_init(void)
 */
orion5x_ehci0_init();
orion5x_eth_init(_fxo_eth_data);
-   orion5x_eth_switch_init(_fxo_switch_plat_data);
+   orion5x_eth_switch_init(_fxo_switch_chip_data);
orion5x_uart0_init();
 
mvebu_mbus_add_window_by_id(ORION_MBUS_DEVBUS_BOOT_TARGET,
diff --git a/arch/arm/mach-orion5x/rd88f5181l-ge-setup.c 
b/arch/arm/mach-orion5x/rd88f5181l-ge-setup.c
index affe5ec825de..252efe29bd1a 100644
--- a/arch/arm/mach-orion5x/rd88f5181l-ge-setup.c
+++ b/arch/arm/mach-orion5x/rd88f5181l-ge-setup.c
@@ -102,11 +102,6 @@ static struct dsa_chip_data rd88f5181l_ge_switch_chip_data 
= {
.port_names[7]  = "lan3",
 };
 
-static struct dsa_platform_data __initdata rd88f5181l_ge_switch_plat_data = {
-   .nr_chips   = 1,
-   .chip   = _ge_switch_chip_data,
-};
-
 static struct i2c_board_info __initdata rd88f5181l_ge_i2c_rtc = {
I2C_BOARD_INFO("ds1338", 0x68),
 };
@@ -125,7 +120,7 @@ static void __init rd88f5181l_ge_init(void)
 */
orion5x_ehci0_init();
orion5x_eth_init(_ge_eth_data);
-   orion5x_eth_switch_init(_ge_switch_plat_data);
+   orion5x_eth_switch_init(_ge_switch_chip_data);
orion5x_i2c_init();
orion5x_uart0_init();
 
diff --git a/arch/arm/mach-orion5x/rd88f6183ap-ge-setup.c 
b/arch/arm/mach-orion5x/rd88f6183ap-ge-setup.c
index 67ee8571b03c..f4f1dbe1d91d 100644
--- a/arch/arm/mach-orion5x/rd88f6183ap-ge-setup.c
+++ b/arch/arm/mach-orion5x/rd88f6183ap-ge-setup.c
@@ -40,11 +40,6 @@ static struct dsa_chip_data rd88f6183ap_ge_switch_chip_data 
= {
.port_names[5]  = "cpu",
 };
 
-static struct dsa_platform_data __initdata rd88f6183ap_ge_switch_plat_data = {
-   .nr_chips

[PATCH net-next v3 05/10] drivers: base: Add device_find_class()

2017-01-14 Thread Florian Fainelli

Add a helper function to lookup a device reference given a class name.
This is a preliminary patch to remove adhoc code from net/dsa/dsa.c and
make it more generic.

Signed-off-by: Florian Fainelli 
---
 drivers/base/core.c| 19 +++
 include/linux/device.h |  1 +
 2 files changed, 20 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 020ea7f05520..3dd6047c10d8 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2065,6 +2065,25 @@ struct device *device_find_child(struct device *parent, 
void *data,
 }
 EXPORT_SYMBOL_GPL(device_find_child);
 
+static int dev_is_class(struct device *dev, void *class)
+{
+   if (dev->class != NULL && !strcmp(dev->class->name, class))
+   return 1;
+
+   return 0;
+}
+
+struct device *device_find_class(struct device *parent, char *class)
+{
+   if (dev_is_class(parent, class)) {
+   get_device(parent);
+   return parent;
+   }
+
+   return device_find_child(parent, class, dev_is_class);
+}
+EXPORT_SYMBOL_GPL(device_find_class);
+
 int __init devices_init(void)
 {
devices_kset = kset_create_and_add("devices", _uevent_ops, NULL);
diff --git a/include/linux/device.h b/include/linux/device.h
index 491b4c0ca633..8d37f5ecb972 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1120,6 +1120,7 @@ extern int device_for_each_child_reverse(struct device 
*dev, void *data,
 int (*fn)(struct device *dev, void *data));
 extern struct device *device_find_child(struct device *dev, void *data,
int (*match)(struct device *dev, void *data));
+extern struct device *device_find_class(struct device *parent, char *class);
 extern int device_rename(struct device *dev, const char *new_name);
 extern int device_move(struct device *dev, struct device *new_parent,
   enum dpm_order dpm_order);
-- 
2.9.3

[PATCH 1/1] ax25: Fix segfault after sock connection timeout

2017-01-14 Thread Basil Gunn

The ax.25 socket connection timed out & the sock struct has been
previously taken down ie. sock struct is now a NULL pointer. Checking
the sock_flag causes the segfault.  Check if the socket struct pointer
is NULL before checking sock_flag. This segfault is seen in
timed out netrom connections.

Please submit to -stable.

Signed-off-by: Basil Gunn 
---

diff --git a/net/ax25/ax25_subr.c b/net/ax25/ax25_subr.c
index 4855d18..038b109 100644
--- a/net/ax25/ax25_subr.c
+++ b/net/ax25/ax25_subr.c
@@ -264,7 +264,7 @@ void ax25_disconnect(ax25_cb *ax25, int reason)
 {
ax25_clear_queues(ax25);

-   if (!sock_flag(ax25->sk, SOCK_DESTROY))
+   if (!ax25->sk || !sock_flag(ax25->sk, SOCK_DESTROY))
ax25_stop_heartbeat(ax25);
ax25_stop_t1timer(ax25);
ax25_stop_t2timer(ax25);

Re: [PATCH net-next 1/1] net sched actions: Add support for user cookies

2017-01-14 Thread John Fastabend

On 17-01-14 07:49 AM, Jiri Pirko wrote:
> Sat, Jan 14, 2017 at 04:39:24PM CET, j...@mojatatu.com wrote:
>> On 17-01-14 10:22 AM, Jiri Pirko wrote:
>>
 .. create an accept action with cookie 0xA:0xa0a0a0a0a0a0a0
 sudo $TC actions add action ok index 1 cookie 0xA 0xa0a0a0a0a0a0a0
>>>
>>> 2x 64bit values? Why can't this have variable length, according to what
>>> user needs:
>>
>>
>> You can intepret it however you wish. It is 128 bits. You can make it
>> 2x64, 4x32, 8x16, 16x8
>>
>>>
>>> sudo $TC actions add action ok index 1 cookie a0
>>> sudo $TC actions add action ok index 1 cookie a01122
>>> sudo $TC actions add action ok index 1 cookie a01122334455
>>> sudo $TC actions add action ok index 1 cookie a01122334455aabbccddeeff
>>>
>>
>> Sure you can do that too..
>> I will add add 16 8b fields to the union.
>>
>>

 .. dump all gact actions..
 sudo $TC -s actions ls action gact

action order 0: gact action pass
 random type none pass val 0
 index 1 ref 2 bind 1 installed 1221 sec used 27 sec
Action statistics:
Sent 373248 bytes 5056 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
 cookie(000a::a0a0a0a0:00a0a0a0)
>>>
>>> Input is 2x64 and dump is 4x32? That is confusing. With my suggested
>>> example, this would be:
>>>
>>>  cookie a0
>>>  cookie a01122
>>>  cookie a01122334455
>>>  cookie a01122334455aabbccddeeff
>>>
>>
>> Your suggestion is more sensible for a user space cli tool like tc.
>> I will add a uchar cku8[16] field and make changes to iproute2.
>>
 struct tc_action_ops;

 +union act_cookie {
 +  u16 ck16[8];
 +  u32 ck32[4];
 +  u64 ck64[2];
>>>
>>> Since this should be never interpreted by kernel, I don't understand why
>>> this union is needed. Why just don't pass a char array?
>>>
>>
>> programmatic usability.
> 
> I don't see why. In userspace you can map whatever struct you need to the
> mem with chararray. It's totally up to you as an app developer. There is
> no need to make that part of kernel api. Really.
> 
> 
>>
>>> Also, whatever format this is, could we make is shared with cls cookie?
>>>
>>>
>>
>> The structure could be shared (and because it is in pkt_cls.h
>> that makes it easier). But the TLVs are domain specific. We need another
>> one for classifiers.
>>
 +};
 +
 struct tc_action {
const struct tc_action_ops  *ops;
__u32   type; /* for backward 
 compat(TCA_OLD_COMPAT) */
 @@ -41,6 +47,7 @@ struct tc_action {
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
struct gnet_stats_queue __percpu *cpu_qstats;
 +  union act_cookie*ck;
 };
 #define tcf_head   common.tcfa_head
 #define tcf_index  common.tcfa_index
 diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
 index 1e5e1dd..6379af3 100644
 --- a/include/uapi/linux/pkt_cls.h
 +++ b/include/uapi/linux/pkt_cls.h
 @@ -4,6 +4,12 @@
 #include 
 #include 

 +union u_act_cookie {
 +  __u16 ck16[8];
 +  __u32 ck32[4];
 +  __u64 ck64[2];
 +};
>>>
>>> Again, the same struct? I don't understand why twice.
>>
>> Just old habits.
>> user vs kernel api? Standard action approach one says
>> __u32 other says u32; hanging off the user variant to kernel
>> didnt feel right.
> 
> Just have it in uapi and use it from within kernel. But did you see what
> I suggested in the other thread (regarding IFLA_PHYS_PORT_ID and
> IFLA_PHYS_SWITCH_ID)? If you do it what way, you don't need no struct.
> 

+1 on using something like MAX_PHYS_ITEM_ID_LEN seems much nicer to me
and easy to extend later as Jiri notes.

Re: [PATCH net-next v2 08/10] net: dsa: Add support for platform data

2017-01-14 Thread Florian Fainelli

On 01/13/2017 02:37 PM, Florian Fainelli wrote:
> On 01/13/2017 06:04 AM, Andrew Lunn wrote:
>>> index cd91070b5467..d326fc4afad7 100644
>>> --- a/net/dsa/dsa2.c
>>> +++ b/net/dsa/dsa2.c
>>> @@ -81,17 +81,23 @@ static void dsa_dst_del_ds(struct dsa_switch_tree *dst,
>>>  
>>>  static bool dsa_port_is_valid(struct dsa_port *port)
>>>  {
>>> -   return !!port->dn;
>>> +   return !!(port->dn || port->name);
>>>  }
>>   
>> Does this clash with Viviens recent change to make names optional and
>> have the kernel assign it?
> 
> So there were two ways to look at this, one was that could check here
> that ds->pd is assigned and port->name is assigned, which means that
> platform data has to provide valid port name. We can also eliminate this
> check entirely because we now support NULL names just fines.

Considering that the comment above struct dsa_chip_data::port_names in
net/dsa/dsa.h is pretty clear about the port_names usage, I am tempted
to keep the code as-is since without a name, for platform data, we would
not have a way to tell if a port is disabled or not.

Does that work for you?
-- 
Florian

Re: [PATCH v5 01/13] net: ethernet: aquantia: Make and configuration files.

2017-01-14 Thread David VomLehn


On 01/14/2017 10:48 AM, Florian Fainelli wrote:

On 01/14/2017 10:42 AM, David VomLehn wrote:

Yes, we did have it that way at one point. But...there is also the
kernel philosophy of not putting in something for future expansion; you
can always do it later... Honestly, I've vacillated on this particular one.

(please don't top post). There are several threads at the moment talking
about renaming driver directory (synopsys/stmmac, mlx5), doing it later
is certainly a possibility but is really frowned upon, since it will
later on make the life of people backporting -stable changes a lot harder.
About that top-posting...I agree, but Microsoft has the entire planet 
doing it the other way. They find bottom posting confusing and often 
can't even figure out that you replied. One more reason I've hated 
Microsoft since 1976. So, I suspect I'm not the only one who 
occasionally slips up. My apologies.


Even if there is just one driver at the moment, I would go with a
dedicated directory for it, there are enough object files that justify
this choice IMHO.

Good argument.

--
David VL

Re: [PATCH net-next v2 05/10] drivers: base: Add device_find_class()

2017-01-14 Thread Florian Fainelli



On 01/13/2017 02:55 AM, David Laight wrote:
> From: Florian Fainelli
>> Sent: 12 January 2017 22:51
>> On 01/12/2017 01:21 PM, David Miller wrote:
>>> From: Florian Fainelli 
>>> Date: Wed, 11 Jan 2017 19:41:16 -0800
>>>
 Add a helper function to lookup a device reference given a class name.
 This is a preliminary patch to remove adhoc code from net/dsa/dsa.c and
 make it more generic.
> ...
 +static int dev_is_class(struct device *dev, void *class)
>>>
>>> I know you are just moving code, but this class argumnet is a string
>>> and thus should be "char *" or even "const char *".
>>
>> Well, this is really so that we don't need to cast the arguments passed
>> to device_find_child(), which takes a void *data as well. If we made
>> that a const char *class, we'd get warnings that look like these:
>>
>> drivers/base/core.c: In function 'device_find_class':
>> drivers/base/core.c:2083:2: warning: passing argument 2 of
>> 'device_find_child' discards 'const' qualifier from pointer target type
>> [enabled by default]
>>   return device_find_child(parent, class, dev_is_class);
>>   ^
>> drivers/base/core.c:2050:16: note: expected 'void *' but argument is of
>> type 'const char *'
>>  struct device *device_find_child(struct device *parent, void *data,
>> ^
> ...
> 
> Maybe device_find_child() needs changing to take 'const void *' ?

As a separate patch set, sure, I will add that to my TODO. Thanks!
-- 
Florian

Re: [PATCH v5 01/13] net: ethernet: aquantia: Make and configuration files.

2017-01-14 Thread Florian Fainelli

On 01/14/2017 10:42 AM, David VomLehn wrote:
> Yes, we did have it that way at one point. But...there is also the
> kernel philosophy of not putting in something for future expansion; you
> can always do it later... Honestly, I've vacillated on this particular one.

(please don't top post). There are several threads at the moment talking
about renaming driver directory (synopsys/stmmac, mlx5), doing it later
is certainly a possibility but is really frowned upon, since it will
later on make the life of people backporting -stable changes a lot harder.

Even if there is just one driver at the moment, I would go with a
dedicated directory for it, there are enough object files that justify
this choice IMHO.
-- 
Florian

Re: [PATCH net] openvswitch: maintain correct checksum state in conntrack actions

2017-01-14 Thread Pravin Shelar

On Thu, Jan 12, 2017 at 4:33 PM, Lance Richardson  wrote:
> When executing conntrack actions on skbuffs with checksum mode
> CHECKSUM_COMPLETE, the checksum must be updated to account for
> header pushes and pulls. Otherwise we get "hw csum failure"
> logs similar to this (ICMP packet received on geneve tunnel
> via ixgbe NIC):
>
> [  405.740065] genev_sys_6081: hw csum failure
> [  405.740106] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G  I 
> 4.10.0-rc3+ #1
> [  405.740108] Call Trace:
> [  405.740110]  
> [  405.740113]  dump_stack+0x63/0x87
> [  405.740116]  netdev_rx_csum_fault+0x3a/0x40
> [  405.740118]  __skb_checksum_complete+0xcf/0xe0
> [  405.740120]  nf_ip_checksum+0xc8/0xf0
> [  405.740124]  icmp_error+0x1de/0x351 [nf_conntrack_ipv4]
> [  405.740132]  nf_conntrack_in+0xe1/0x550 [nf_conntrack]
> [  405.740137]  ? find_bucket.isra.2+0x62/0x70 [openvswitch]
> [  405.740143]  __ovs_ct_lookup+0x95/0x980 [openvswitch]
> [  405.740145]  ? netif_rx_internal+0x44/0x110
> [  405.740149]  ovs_ct_execute+0x147/0x4b0 [openvswitch]
> [  405.740153]  do_execute_actions+0x22e/0xa70 [openvswitch]
> [  405.740157]  ovs_execute_actions+0x40/0x120 [openvswitch]
> [  405.740161]  ovs_dp_process_packet+0x84/0x120 [openvswitch]
> [  405.740166]  ovs_vport_receive+0x73/0xd0 [openvswitch]
> [  405.740168]  ? udp_rcv+0x1a/0x20
> [  405.740170]  ? ip_local_deliver_finish+0x93/0x1e0
> [  405.740172]  ? ip_local_deliver+0x6f/0xe0
> [  405.740174]  ? ip_rcv_finish+0x3a0/0x3a0
> [  405.740176]  ? ip_rcv_finish+0xdb/0x3a0
> [  405.740177]  ? ip_rcv+0x2a7/0x400
> [  405.740180]  ? __netif_receive_skb_core+0x970/0xa00
> [  405.740185]  netdev_frame_hook+0xd3/0x160 [openvswitch]
> [  405.740187]  __netif_receive_skb_core+0x1dc/0xa00
> [  405.740194]  ? ixgbe_clean_rx_irq+0x46d/0xa20 [ixgbe]
> [  405.740197]  __netif_receive_skb+0x18/0x60
> [  405.740199]  netif_receive_skb_internal+0x40/0xb0
> [  405.740201]  napi_gro_receive+0xcd/0x120
> [  405.740204]  gro_cell_poll+0x57/0x80 [geneve]
> [  405.740206]  net_rx_action+0x260/0x3c0
> [  405.740209]  __do_softirq+0xc9/0x28c
> [  405.740211]  irq_exit+0xd9/0xf0
> [  405.740213]  do_IRQ+0x51/0xd0
> [  405.740215]  common_interrupt+0x93/0x93
>
> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
> Signed-off-by: Lance Richardson 

Looks good.
Acked-by: Pravin B Shelar 

Thanks.

Re: [PATCH v5 01/13] net: ethernet: aquantia: Make and configuration files.

2017-01-14 Thread David VomLehn

Yes, we did have it that way at one point. But...there is also the 
kernel philosophy of not putting in something for future expansion; you 
can always do it later... Honestly, I've vacillated on this particular one.


On 01/14/2017 10:39 AM, Florian Fainelli wrote:


On 01/12/2017 09:02 PM, Alexander Loktionov wrote:

From: David VomLehn 

Patches to create the make and configuration files.

Signed-off-by: Alexander Loktionov 
Signed-off-by: Dmitrii Tarakanov 
Signed-off-by: Pavel Belous 
Signed-off-by: Dmitry Bezrukov 
Signed-off-by: David M. VomLehn 
---
  drivers/net/ethernet/aquantia/Kconfig  | 24 +++
  drivers/net/ethernet/aquantia/Makefile | 42 ++
  drivers/net/ethernet/aquantia/ver.h| 18 +++
  3 files changed, 84 insertions(+)
  create mode 100644 drivers/net/ethernet/aquantia/Kconfig
  create mode 100644 drivers/net/ethernet/aquantia/Makefile
  create mode 100644 drivers/net/ethernet/aquantia/ver.h

diff --git a/drivers/net/ethernet/aquantia/Kconfig 
b/drivers/net/ethernet/aquantia/Kconfig
new file mode 100644
index 000..a74a4c0
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/Kconfig
@@ -0,0 +1,24 @@
+#
+# aQuantia device configuration
+#
+
+config NET_VENDOR_AQUANTIA
+   bool "aQuantia devices"
+   default y
+   ---help---
+ Set this to y if you have an Ethernet network cards that uses the 
aQuantia
+ chipset.
+
+ This option does not build any drivers; it casues the aQuantia
+ drivers that can be built to appear in the list of Ethernet drivers.
+
+
+if NET_VENDOR_AQUANTIA
+
+config AQTION
+   tristate "aQuantia AQtion Support"
+   depends on PCI
+   ---help---
+ This enables the support for the aQuantia AQtion Ethernet card.
+
+endif # NET_VENDOR_AQUANTIA
diff --git a/drivers/net/ethernet/aquantia/Makefile 
b/drivers/net/ethernet/aquantia/Makefile
new file mode 100644
index 000..e4ae696
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/Makefile
@@ -0,0 +1,42 @@
+
+#
+# aQuantia Ethernet Controller AQtion Linux Driver
+# Copyright(c) 2014-2017 aQuantia Corporation.
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms and conditions of the GNU General Public License,
+# version 2, as published by the Free Software Foundation.
+#
+# This program is distributed in the hope it will be useful, but WITHOUT
+# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+# more details.
+#
+# You should have received a copy of the GNU General Public License along
+# with this program. If not, see .
+#
+# The full GNU General Public License is included in this distribution in
+# the file called "COPYING".
+#
+# Contact Information: 
+# aQuantia Corporation, 105 E. Tasman Dr. San Jose, CA 95134, USA
+#
+
+
+#
+# Makefile for the AQtion(tm) Ethernet driver
+#
+
+obj-$(CONFIG_AQTION) += atlantic.o
+
+atlantic-objs := aq_main.o \
+   aq_nic.o \
+   aq_pci_func.o \
+   aq_vec.o \
+   aq_ring.o \
+   aq_hw_utils.o \
+   aq_ethtool.o \
+   hw_atl/hw_atl_a0.o \
+   hw_atl/hw_atl_b0.o \
+   hw_atl/hw_atl_utils.o \
+   hw_atl/hw_atl_llh.o

You might want to create an aqtion or atlantic folder just in case you
later on submit a driver for another aquantia NIC. That would keep the
directory structure clean.



--
David VL

Re: [PATCH v5 01/13] net: ethernet: aquantia: Make and configuration files.

2017-01-14 Thread Florian Fainelli



On 01/12/2017 09:02 PM, Alexander Loktionov wrote:
> From: David VomLehn 
> 
> Patches to create the make and configuration files.
> 
> Signed-off-by: Alexander Loktionov 
> Signed-off-by: Dmitrii Tarakanov 
> Signed-off-by: Pavel Belous 
> Signed-off-by: Dmitry Bezrukov 
> Signed-off-by: David M. VomLehn 
> ---
>  drivers/net/ethernet/aquantia/Kconfig  | 24 +++
>  drivers/net/ethernet/aquantia/Makefile | 42 
> ++
>  drivers/net/ethernet/aquantia/ver.h| 18 +++
>  3 files changed, 84 insertions(+)
>  create mode 100644 drivers/net/ethernet/aquantia/Kconfig
>  create mode 100644 drivers/net/ethernet/aquantia/Makefile
>  create mode 100644 drivers/net/ethernet/aquantia/ver.h
> 
> diff --git a/drivers/net/ethernet/aquantia/Kconfig 
> b/drivers/net/ethernet/aquantia/Kconfig
> new file mode 100644
> index 000..a74a4c0
> --- /dev/null
> +++ b/drivers/net/ethernet/aquantia/Kconfig
> @@ -0,0 +1,24 @@
> +#
> +# aQuantia device configuration
> +#
> +
> +config NET_VENDOR_AQUANTIA
> + bool "aQuantia devices"
> + default y
> + ---help---
> +   Set this to y if you have an Ethernet network cards that uses the 
> aQuantia
> +   chipset.
> +
> +   This option does not build any drivers; it casues the aQuantia
> +   drivers that can be built to appear in the list of Ethernet drivers.
> +
> +
> +if NET_VENDOR_AQUANTIA
> +
> +config AQTION
> + tristate "aQuantia AQtion Support"
> + depends on PCI
> + ---help---
> +   This enables the support for the aQuantia AQtion Ethernet card.
> +
> +endif # NET_VENDOR_AQUANTIA
> diff --git a/drivers/net/ethernet/aquantia/Makefile 
> b/drivers/net/ethernet/aquantia/Makefile
> new file mode 100644
> index 000..e4ae696
> --- /dev/null
> +++ b/drivers/net/ethernet/aquantia/Makefile
> @@ -0,0 +1,42 @@
> +
> +#
> +# aQuantia Ethernet Controller AQtion Linux Driver
> +# Copyright(c) 2014-2017 aQuantia Corporation.
> +#
> +# This program is free software; you can redistribute it and/or modify it
> +# under the terms and conditions of the GNU General Public License,
> +# version 2, as published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope it will be useful, but WITHOUT
> +# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> +# more details.
> +#
> +# You should have received a copy of the GNU General Public License along
> +# with this program. If not, see .
> +#
> +# The full GNU General Public License is included in this distribution in
> +# the file called "COPYING".
> +#
> +# Contact Information: 
> +# aQuantia Corporation, 105 E. Tasman Dr. San Jose, CA 95134, USA
> +#
> +
> +
> +#
> +# Makefile for the AQtion(tm) Ethernet driver
> +#
> +
> +obj-$(CONFIG_AQTION) += atlantic.o
> +
> +atlantic-objs := aq_main.o \
> + aq_nic.o \
> + aq_pci_func.o \
> + aq_vec.o \
> + aq_ring.o \
> + aq_hw_utils.o \
> + aq_ethtool.o \
> + hw_atl/hw_atl_a0.o \
> + hw_atl/hw_atl_b0.o \
> + hw_atl/hw_atl_utils.o \
> + hw_atl/hw_atl_llh.o

You might want to create an aqtion or atlantic folder just in case you
later on submit a driver for another aquantia NIC. That would keep the
directory structure clean.
-- 
Florian

Re: [PATCH] net: add regs attribute to phy device for user diagnose

2017-01-14 Thread Florian Fainelli

On 01/14/2017 08:24 AM, Andrew Lunn wrote:
> On Sat, Jan 14, 2017 at 10:46:31AM +0800, yuan linyu wrote:
>> From: yuan linyu 
>>
>> if phy device have register(s) configuration problem,
>> user can use this attribute to diagnose.
>> this feature need phy driver maintainer implement.
> 
> what is wrong with mii-tool -vv ?

Agreed, and without an actual user of this API (ethtool?), nor a PHY
driver implementing it, we cannot quite see how you want to make use of
this.

Thank you
-- 
Florian

Re: [PATCH] net: korina: use new api ethtool_{get|set}_link_ksettings

2017-01-14 Thread Florian Fainelli

On 01/14/2017 03:33 AM, Philippe Reynes wrote:
> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-14 Thread Jamal Hadi Salim


On 17-01-14 10:29 AM, Jiri Pirko wrote:

Sat, Jan 14, 2017 at 04:03:17PM CET, j...@mojatatu.com wrote:



Fair. So could this be done like IFLA_PHYS_SWITCH_ID and
IFLA_PHYS_PORT_ID. They can have variable size, max is MAX_PHYS_ITEM_ID_LEN

We can let user to pass arbitrary len up to 16 bytes. This has benefit in
fact that if in future this needs to be extended to say 32 bytes, it is
backward compatible. We just change the check in kernel.



Sure. Ive run out of time for now; but will work on a new version later.

cheers,
jamal

Re: [pull request][for-next] Mellanox mlx5 Reorganize core driver directory layout

2017-01-14 Thread Tom Herbert

On Fri, Jan 13, 2017 at 2:45 PM, Saeed Mahameed
 wrote:
> On Sat, Jan 14, 2017 at 12:06 AM, Tom Herbert  wrote:
>> On Fri, Jan 13, 2017 at 12:29 PM, Leon Romanovsky  wrote:
>>> On Fri, Jan 13, 2017 at 12:14:07PM -0500, David Miller wrote:
 From: Saeed Mahameed 
 Date: Thu, 12 Jan 2017 19:22:34 +0200

 > This pull request includes one patch from Leon, this patch as described
 > below will change the driver directory structure and layout for better,
 > logical and modular driver files separation.
 >
 > This change is important to both rdma and net maintainers in order to
 > have smoother management of driver patches for different mlx5 sub modules
 > and smoother rdma-next vs. net-next features submissions.
 >
 > Please find more info below -in the tag commit message-,
 > review and let us know if there's any problem.
 >
 > This change doesn't introduce any conflicts with the current mlx5
 > fixes and cleanups posted on 2017-01-10 to net branch, and merge tests
 > worked flawlessly with no issues.
 >
 > This is the last pull request meant for both rdma-next and net-next.
 > Once pulled, this will be the base shared code for both trees.

 This is pretty crazy, it will make all bug fix backporting to -stable
 a complete nightmare for myself, Doug, various distribution maintainers
 and many other people who quietly have to maintain their own trees and
 do backporting.
>>>
>>> Hi Dave,
>>>
>>> I understand your worries, but our case is similar to various other
>>> drivers, for example hfi1 which was in staging for years while
>>> supported in RedHat and moved from there to IB. The Chelsio drivers did
>>> similar reorg in 2016 (drivers/net/ethernet/chelsio/libcxgb) while their
>>> drivers were in the tree for long time before.
>>>
>>> Additionally, Doug doesn't need to maintain -stable queue and it is done
>>> by relevant submaintainers who are adding stable tags by themselves. In
>>> the IB case, the burden will continue to be on me and not on Doug.
>>>
>> Recently I had to backport the mlx5 driver from 4.9 to 4.6 in order to
>> get support for XDP. The biggest issue I faced was the lack of
>> modularity in the many driver features that are now supported. The
>> problem with backporting these new features is the spider web of
>> dependencies that they bring in from the rest of the kernel. I ended
>> up taking out en_rep, vxlan, en_tc, eswitch, and dcbnl. The result was
>> ~340 patches which is still a lot but at least this was constrained to
>> patches in the mlx5 directories and are relevant to what we want to
>> do.
>>
>> In lieu of restructuring the directories, I would much rather see more
>> config options so that we can build drivers that don't unnecessarily
>> complicate our lives with features we don't use. This is not just true
>> for Mellanox, but I would say it would be true of any driver that
>> someone is trying to deploy and maintain at large scale.
>>
>
> I think we should have both, if the restructuring made right,
> new whole features (e.g eswitch and eswitch offlaods or any independent 
> module),
> can sit in their own directory and keep their own logic concentrated
> in one place, and only touch the
> main driver code with simple entry points in the main flow,  this way
> you can simply compile their whole directories
> out with a config flag directly from the Makefile.
>
That would be nice, but that is not what your new layout gives us yet.
The starting point for this structure should be to discern what the
minimum set of files is to build a functional and good performance
driver with the basic functionality including only the most common
offloads. These I would think constitute the core files for the driver
and hence make sense to be in the "core" directory. Personally I would
drop the en_ file prefix and not make an en directory, we're already
in the ethernet driver part of the tree so I don't know what this name
is supposed to convey. For reference I gave you this list of
components I disabled to do the backport, some we're not split out in
your directory layout. A good example is the VXLAN offload support,
the changes in that area we're mostly due to Alexander's patch set to
redo the interface for these accelerations-- so to back port driver
from 4.9 to 4.6 I would need to pull those in and whatever else those
depend on. It was way easier to just ifdef out the relevant calls and
not compile the vxlan files to do the backport.

Tom

>> Btw, we did hit one issue in the backport. We started to get rx csum
>> faults (checksum complete value indicates TCP checksum is bad, but
>> host computation says checksum is good). I ran against 4.9 upstream
>> kernel and do see these, however don't see them in 4.10. I haven't
>> bisected yet. Is this a known issue?
>>
>
> Not to me, I don't recall any csum related fixes

Re: [PATCH] can: Fix kernel panic at security_sock_rcv_skb

2017-01-14 Thread Eric Dumazet

On Sat, 2017-01-14 at 14:53 +0100, Oliver Hartkopp wrote:
> Hello Eric,
> 
> On 01/14/2017 04:43 AM, Liu Shuo wrote:
> > On Thu 12.Jan'17 at 17:33:38 +0100, Oliver Hartkopp wrote:
> >> On 01/12/2017 02:01 PM, Eric Dumazet wrote:
> 
> >>> The main problem seems that the sockets themselves are not RCU
> >>> protected.
> >>>
> >>> If CAN uses RCU for delivery, then sockets should be freed only after
> >>> one RCU grace period.
> >>>
> >>> On recent kernels, following patch could help :
> >>>
> >>
> >> Thanks Eric!
> >>
> >> @Liu ShuoX: Can you check if Eric's suggestion fixes the issue in your
> >> setup?
> > Sorry for late reply. I was OOO yesterday.
> > With Eric's hint, i just found his patch that "net: add SOCK_RCU_FREE
> > socket flag" in the latest kernel. With backporting this one plus Eric's
> > following patch, it fixs my failure.
> 
> what would be the best approach to fix this issue - even in stable kernels?
> 
> E.g. would this change be ok for a stable as a quick fix?
> 
> diff --git a/net/can/af_can.c b/net/can/af_can.c
> index 1108079d934f..6b974c2b66ef 100644
> --- a/net/can/af_can.c
> +++ b/net/can/af_can.c
> @@ -112,6 +112,7 @@ EXPORT_SYMBOL(can_ioctl);
> 
>   static void can_sock_destruct(struct sock *sk)
>   {
> +   synchronize_rcu();
>  skb_queue_purge(>sk_receive_queue);
>   }

Adding a synchronize_rcu() at socket close time might have side effects,
if say an application had 1000 such sockets and dies.

This might add 20 seconds of exit time and have serious implications.

I will submit the second patch : It is working for all linux versions.

> 
> And once this arrived in the mainline tree your suggested patch could be 
> applied?
> 
> In any case we should not forget to give Reported-by credits to Liu.

Sure

Re: pull-request: mac80211-next 2017-01-13

2017-01-14 Thread David Miller

From: Johannes Berg 
Date: Fri, 13 Jan 2017 11:29:32 +0100

> This is my first pull request for net-next, and it seems a bit
> bigger than the past few releases. Detailed information below,
> as usual.
> 
> Please pull and let me know if there's any problem.

Pulled, thanks Johannes.

Re: [PATCH] cxgb4: Remove redundant memset before memcpy

2017-01-14 Thread David Miller

From: Shyam Saini 
Date: Sat, 14 Jan 2017 06:47:40 +0530

> The region set by the call to memset, immediately overwritten by
> the subsequent call to memcpy and thus makes the  memset redundant.
> 
> Also remove the memset((, 0, sizeof(info)) on line 398 because
> info is memcpy()'ed to before being used in the loop and it isn't
> used outside of the loop.
> 
> Signed-off-by: Shyam Saini 

Applied to net-next, thanks.

Re: [PATCH v3 1/2] bpf: add a longest prefix match trie map implementation

2017-01-14 Thread Alexei Starovoitov


On 1/14/17 4:17 AM, Daniel Mack wrote:

+static struct bpf_map *trie_alloc(union bpf_attr *attr)
+{
+   size_t cost, cost_per_node;
+   struct lpm_trie *trie;
+   int ret;
+
+   /* check sanity of attributes */
+   if (attr->max_entries == 0 ||
+   attr->map_flags != BPF_F_NO_PREALLOC ||
+   attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
+   attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
+   attr->value_size == 0)
+   return ERR_PTR(-EINVAL);


could you also make it root only for now ?
Like we did for lru map:
if (lru && !capable(CAP_SYS_ADMIN))
/* LRU implementation is much complicated than other
 * maps.  Hence, limit to CAP_SYS_ADMIN for now.
 */
return ERR_PTR(-EPERM);

trie is not that complicated, but I think socket_filters
(the only unpriv prog type today) can live a release or
two without ability to use it.

In patch 2/2 there are some comments not in networking style:
+   /*
+* Perform longest prefix-match on @key/@n_bits. That is,

As far as performance it has to be measured from datapath.
Like create tc+cls_bpf prog that does a dozen of trie lookups,
attach it to veth or tap and use
samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh
to send traffic into it.
Like: samples/bpf/test_cls_bpf.sh does.

Another alternative is to extend samples/bpf/map_perf_test
It has perf tests for most map types today (including lru)
and trie would be natural addition there.
I would prefer this latter option.

Thanks!

Re: [PATCH] net: add regs attribute to phy device for user diagnose

2017-01-14 Thread Andrew Lunn

On Sat, Jan 14, 2017 at 10:46:31AM +0800, yuan linyu wrote:
> From: yuan linyu 
> 
> if phy device have register(s) configuration problem,
> user can use this attribute to diagnose.
> this feature need phy driver maintainer implement.

what is wrong with mii-tool -vv ?

 Andrew

Re: [PATCH net-next 1/1] net sched actions: Add support for user cookies

2017-01-14 Thread Jiri Pirko

Sat, Jan 14, 2017 at 04:39:24PM CET, j...@mojatatu.com wrote:
>On 17-01-14 10:22 AM, Jiri Pirko wrote:
>
>> > .. create an accept action with cookie 0xA:0xa0a0a0a0a0a0a0
>> > sudo $TC actions add action ok index 1 cookie 0xA 0xa0a0a0a0a0a0a0
>> 
>> 2x 64bit values? Why can't this have variable length, according to what
>> user needs:
>
>
>You can intepret it however you wish. It is 128 bits. You can make it
>2x64, 4x32, 8x16, 16x8
>
>> 
>> sudo $TC actions add action ok index 1 cookie a0
>> sudo $TC actions add action ok index 1 cookie a01122
>> sudo $TC actions add action ok index 1 cookie a01122334455
>> sudo $TC actions add action ok index 1 cookie a01122334455aabbccddeeff
>> 
>
>Sure you can do that too..
>I will add add 16 8b fields to the union.
>
>
>> > 
>> > .. dump all gact actions..
>> > sudo $TC -s actions ls action gact
>> > 
>> >action order 0: gact action pass
>> > random type none pass val 0
>> > index 1 ref 2 bind 1 installed 1221 sec used 27 sec
>> >Action statistics:
>> >Sent 373248 bytes 5056 pkt (dropped 0, overlimits 0 requeues 0)
>> >backlog 0b 0p requeues 0
>> > cookie(000a::a0a0a0a0:00a0a0a0)
>> 
>> Input is 2x64 and dump is 4x32? That is confusing. With my suggested
>> example, this would be:
>> 
>>   cookie a0
>>   cookie a01122
>>   cookie a01122334455
>>   cookie a01122334455aabbccddeeff
>> 
>
>Your suggestion is more sensible for a user space cli tool like tc.
>I will add a uchar cku8[16] field and make changes to iproute2.
>
>> > struct tc_action_ops;
>> > 
>> > +union act_cookie {
>> > +  u16 ck16[8];
>> > +  u32 ck32[4];
>> > +  u64 ck64[2];
>> 
>> Since this should be never interpreted by kernel, I don't understand why
>> this union is needed. Why just don't pass a char array?
>> 
>
>programmatic usability.

I don't see why. In userspace you can map whatever struct you need to the
mem with chararray. It's totally up to you as an app developer. There is
no need to make that part of kernel api. Really.


>
>> Also, whatever format this is, could we make is shared with cls cookie?
>> 
>> 
>
>The structure could be shared (and because it is in pkt_cls.h
>that makes it easier). But the TLVs are domain specific. We need another
>one for classifiers.
>
>> > +};
>> > +
>> > struct tc_action {
>> >const struct tc_action_ops  *ops;
>> >__u32   type; /* for backward 
>> > compat(TCA_OLD_COMPAT) */
>> > @@ -41,6 +47,7 @@ struct tc_action {
>> >struct rcu_head tcfa_rcu;
>> >struct gnet_stats_basic_cpu __percpu *cpu_bstats;
>> >struct gnet_stats_queue __percpu *cpu_qstats;
>> > +  union act_cookie*ck;
>> > };
>> > #define tcf_head   common.tcfa_head
>> > #define tcf_index  common.tcfa_index
>> > diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>> > index 1e5e1dd..6379af3 100644
>> > --- a/include/uapi/linux/pkt_cls.h
>> > +++ b/include/uapi/linux/pkt_cls.h
>> > @@ -4,6 +4,12 @@
>> > #include 
>> > #include 
>> > 
>> > +union u_act_cookie {
>> > +  __u16 ck16[8];
>> > +  __u32 ck32[4];
>> > +  __u64 ck64[2];
>> > +};
>> 
>> Again, the same struct? I don't understand why twice.
>
>Just old habits.
>user vs kernel api? Standard action approach one says
>__u32 other says u32; hanging off the user variant to kernel
>didnt feel right.

Just have it in uapi and use it from within kernel. But did you see what
I suggested in the other thread (regarding IFLA_PHYS_PORT_ID and
IFLA_PHYS_SWITCH_ID)? If you do it what way, you don't need no struct.

Re: [PATCH net-next 1/1] net sched actions: Add support for user cookies

2017-01-14 Thread Jamal Hadi Salim


On 17-01-14 10:22 AM, Jiri Pirko wrote:


.. create an accept action with cookie 0xA:0xa0a0a0a0a0a0a0
sudo $TC actions add action ok index 1 cookie 0xA 0xa0a0a0a0a0a0a0


2x 64bit values? Why can't this have variable length, according to what
user needs:



You can intepret it however you wish. It is 128 bits. You can make it
2x64, 4x32, 8x16, 16x8



sudo $TC actions add action ok index 1 cookie a0
sudo $TC actions add action ok index 1 cookie a01122
sudo $TC actions add action ok index 1 cookie a01122334455
sudo $TC actions add action ok index 1 cookie a01122334455aabbccddeeff



Sure you can do that too..
I will add add 16 8b fields to the union.




.. dump all gact actions..
sudo $TC -s actions ls action gact

action order 0: gact action pass
 random type none pass val 0
 index 1 ref 2 bind 1 installed 1221 sec used 27 sec
Action statistics:
Sent 373248 bytes 5056 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
 cookie(000a::a0a0a0a0:00a0a0a0)


Input is 2x64 and dump is 4x32? That is confusing. With my suggested
example, this would be:

 cookie a0
 cookie a01122
 cookie a01122334455
 cookie a01122334455aabbccddeeff



Your suggestion is more sensible for a user space cli tool like tc.
I will add a uchar cku8[16] field and make changes to iproute2.


struct tc_action_ops;

+union act_cookie {
+   u16 ck16[8];
+   u32 ck32[4];
+   u64 ck64[2];


Since this should be never interpreted by kernel, I don't understand why
this union is needed. Why just don't pass a char array?



programmatic usability.


Also, whatever format this is, could we make is shared with cls cookie?




The structure could be shared (and because it is in pkt_cls.h
that makes it easier). But the TLVs are domain specific. We need another
one for classifiers.


+};
+
struct tc_action {
const struct tc_action_ops  *ops;
__u32   type; /* for backward 
compat(TCA_OLD_COMPAT) */
@@ -41,6 +47,7 @@ struct tc_action {
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
struct gnet_stats_queue __percpu *cpu_qstats;
+   union act_cookie*ck;
};
#define tcf_headcommon.tcfa_head
#define tcf_index   common.tcfa_index
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 1e5e1dd..6379af3 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -4,6 +4,12 @@
#include 
#include 

+union u_act_cookie {
+   __u16 ck16[8];
+   __u32 ck32[4];
+   __u64 ck64[2];
+};


Again, the same struct? I don't understand why twice.


Just old habits.
user vs kernel api? Standard action approach one says
__u32 other says u32; hanging off the user variant to kernel
didnt feel right.

cheers,
jamal

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-14 Thread Jiri Pirko

Sat, Jan 14, 2017 at 04:03:17PM CET, j...@mojatatu.com wrote:
>On 17-01-14 09:48 AM, Jiri Pirko wrote:
>> Sat, Jan 14, 2017 at 01:56:35PM CET, j...@mojatatu.com wrote:
>
>
>> > I think the feature makes a lot of sense (as is the action variant).
>> > But can we make it:
>> > a) fixed size
>> 
>> Can you elaborate on why is this needed?
>> 
>
>My experience with the action bits its easier to debug
>and enforces some discipline to not abuse the amount of RAM used.
>If you have 1M rules, one extra 128M is easier on the system than
>a few Gigs.

Fair. So could this be done like IFLA_PHYS_SWITCH_ID and
IFLA_PHYS_PORT_ID. They can have variable size, max is MAX_PHYS_ITEM_ID_LEN

We can let user to pass arbitrary len up to 16 bytes. This has benefit in
fact that if in future this needs to be extended to say 32 bytes, it is
backward compatible. We just change the check in kernel.


>
>> 
>> > b) apply to all classifiers
>> > c) please post a usage example via iproute2/tc
>> > 
>> > I am going to post the action variant in the next while - will do some more
>> > testing first.
>> 
>> I believe we have to make the cls and ats cookies exactly the same.
>> 
>
>Probably - they are both needed. See the patch I just posted.
>
>cheers,
>jamal
>

Re: [PATCH net-next 1/1] net sched actions: Add support for user cookies

2017-01-14 Thread Jiri Pirko

Sat, Jan 14, 2017 at 03:59:18PM CET, j...@mojatatu.com wrote:
>From: Jamal Hadi Salim 
>
>Introduce optional 128-bit action cookie.
>Like all other cookie schemes in the networking world (eg in protocols
>like http or existing kernel fib protocol field, etc) the idea is to save
>user state that when retrieved serves as a correlator. The kernel
>_should not_ intepret it.  The user can store whatever they wish in the
>128 bits.
>
>Sample exercise(using two 64bit values to represent the 128 bits):
>
>.. create an accept action with cookie 0xA:0xa0a0a0a0a0a0a0
>sudo $TC actions add action ok index 1 cookie 0xA 0xa0a0a0a0a0a0a0

2x 64bit values? Why can't this have variable length, according to what
user needs:

sudo $TC actions add action ok index 1 cookie a0
sudo $TC actions add action ok index 1 cookie a01122
sudo $TC actions add action ok index 1 cookie a01122334455
sudo $TC actions add action ok index 1 cookie a01122334455aabbccddeeff


>
>.. dump all gact actions..
>sudo $TC -s actions ls action gact
>
>   action order 0: gact action pass
>random type none pass val 0
>index 1 ref 2 bind 1 installed 1221 sec used 27 sec
>   Action statistics:
>   Sent 373248 bytes 5056 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
>cookie(000a::a0a0a0a0:00a0a0a0)

Input is 2x64 and dump is 4x32? That is confusing. With my suggested
example, this would be:

 cookie a0
 cookie a01122
 cookie a01122334455
 cookie a01122334455aabbccddeeff


>
>.. bind the accept action to a filter..
>sudo $TC filter add dev lo parent : protocol ip prio 1 \
>u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1
>
>... send some traffic..
>$ ping 127.0.0.1 -c 3
>PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
>64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
>64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
>64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms
>
>--- 127.0.0.1 ping statistics ---
>3 packets transmitted, 3 received, 0% packet loss, time 2109ms
>rtt min/avg/max/mdev = 0.020/0.028/0.038/0.008 ms 1
>
>... show some stats
>$ sudo $TC -s actions get  action gact index 1
>
>   action order 1: gact action pass
>random type none pass val 0
>index 1 ref 3 bind 1 installed 2182 sec used 1 sec
>   Action statistics:
>   Sent 700344 bytes 9486 pkt (dropped 0, overlimits 0 requeues 0)
>   backlog 0b 0p requeues 0
>
>Signed-off-by: Jamal Hadi Salim 
>---
> include/net/act_api.h|  7 +++
> include/uapi/linux/pkt_cls.h |  7 +++
> net/sched/act_api.c  | 27 +--
> 3 files changed, 39 insertions(+), 2 deletions(-)
>
>diff --git a/include/net/act_api.h b/include/net/act_api.h
>index 1d71644..b948db9 100644
>--- a/include/net/act_api.h
>+++ b/include/net/act_api.h
>@@ -20,6 +20,12 @@ struct tcf_hashinfo {
> 
> struct tc_action_ops;
> 
>+union act_cookie {
>+  u16 ck16[8];
>+  u32 ck32[4];
>+  u64 ck64[2];

Since this should be never interpreted by kernel, I don't understand why
this union is needed. Why just don't pass a char array?

Also, whatever format this is, could we make is shared with cls cookie?


>+};
>+
> struct tc_action {
>   const struct tc_action_ops  *ops;
>   __u32   type; /* for backward 
> compat(TCA_OLD_COMPAT) */
>@@ -41,6 +47,7 @@ struct tc_action {
>   struct rcu_head tcfa_rcu;
>   struct gnet_stats_basic_cpu __percpu *cpu_bstats;
>   struct gnet_stats_queue __percpu *cpu_qstats;
>+  union act_cookie*ck;
> };
> #define tcf_head  common.tcfa_head
> #define tcf_index common.tcfa_index
>diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>index 1e5e1dd..6379af3 100644
>--- a/include/uapi/linux/pkt_cls.h
>+++ b/include/uapi/linux/pkt_cls.h
>@@ -4,6 +4,12 @@
> #include 
> #include 
> 
>+union u_act_cookie {
>+  __u16 ck16[8];
>+  __u32 ck32[4];
>+  __u64 ck64[2];
>+};

Again, the same struct? I don't understand why twice.


>+
> /* Action attributes */
> enum {
>   TCA_ACT_UNSPEC,
>@@ -12,6 +18,7 @@ enum {
>   TCA_ACT_INDEX,
>   TCA_ACT_STATS,
>   TCA_ACT_PAD,
>+  TCA_ACT_COOKIE,
>   __TCA_ACT_MAX
> };
> 
>diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>index f04715a..85e77181 100644
>--- a/net/sched/act_api.c
>+++ b/net/sched/act_api.c
>@@ -33,6 +33,7 @@ static void free_tcf(struct rcu_head *head)
> 
>   free_percpu(p->cpu_bstats);
>   free_percpu(p->cpu_qstats);
>+  kfree(p->ck);
>   kfree(p);
> }
> 
>@@ -464,8 +465,8 @@ int tcf_action_destroy(struct list_head *actions, int bind)
>   return a->ops->dump(skb, a, bind, ref);
> }
> 
>-int
>-tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref)
>+int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind,
>+

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-14 Thread Jamal Hadi Salim


On 17-01-14 09:48 AM, Jiri Pirko wrote:

Sat, Jan 14, 2017 at 01:56:35PM CET, j...@mojatatu.com wrote:




I think the feature makes a lot of sense (as is the action variant).
But can we make it:
a) fixed size


Can you elaborate on why is this needed?



My experience with the action bits its easier to debug
and enforces some discipline to not abuse the amount of RAM used.
If you have 1M rules, one extra 128M is easier on the system than
a few Gigs.




b) apply to all classifiers
c) please post a usage example via iproute2/tc

I am going to post the action variant in the next while - will do some more
testing first.


I believe we have to make the cls and ats cookies exactly the same.



Probably - they are both needed. See the patch I just posted.

cheers,
jamal

[PATCH net-next 1/1] net sched actions: Add support for user cookies

2017-01-14 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Introduce optional 128-bit action cookie.
Like all other cookie schemes in the networking world (eg in protocols
like http or existing kernel fib protocol field, etc) the idea is to save
user state that when retrieved serves as a correlator. The kernel
_should not_ intepret it.  The user can store whatever they wish in the
128 bits.

Sample exercise(using two 64bit values to represent the 128 bits):

.. create an accept action with cookie 0xA:0xa0a0a0a0a0a0a0
sudo $TC actions add action ok index 1 cookie 0xA 0xa0a0a0a0a0a0a0

.. dump all gact actions..
sudo $TC -s actions ls action gact

action order 0: gact action pass
 random type none pass val 0
 index 1 ref 2 bind 1 installed 1221 sec used 27 sec
Action statistics:
Sent 373248 bytes 5056 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
 cookie(000a::a0a0a0a0:00a0a0a0)

.. bind the accept action to a filter..
sudo $TC filter add dev lo parent : protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1

... send some traffic..
$ ping 127.0.0.1 -c 3
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms

--- 127.0.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2109ms
rtt min/avg/max/mdev = 0.020/0.028/0.038/0.008 ms 1

... show some stats
$ sudo $TC -s actions get  action gact index 1

action order 1: gact action pass
 random type none pass val 0
 index 1 ref 3 bind 1 installed 2182 sec used 1 sec
Action statistics:
Sent 700344 bytes 9486 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

Signed-off-by: Jamal Hadi Salim 
---
 include/net/act_api.h|  7 +++
 include/uapi/linux/pkt_cls.h |  7 +++
 net/sched/act_api.c  | 27 +--
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 1d71644..b948db9 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -20,6 +20,12 @@ struct tcf_hashinfo {
 
 struct tc_action_ops;
 
+union act_cookie {
+   u16 ck16[8];
+   u32 ck32[4];
+   u64 ck64[2];
+};
+
 struct tc_action {
const struct tc_action_ops  *ops;
__u32   type; /* for backward 
compat(TCA_OLD_COMPAT) */
@@ -41,6 +47,7 @@ struct tc_action {
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
struct gnet_stats_queue __percpu *cpu_qstats;
+   union act_cookie*ck;
 };
 #define tcf_head   common.tcfa_head
 #define tcf_index  common.tcfa_index
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 1e5e1dd..6379af3 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -4,6 +4,12 @@
 #include 
 #include 
 
+union u_act_cookie {
+   __u16 ck16[8];
+   __u32 ck32[4];
+   __u64 ck64[2];
+};
+
 /* Action attributes */
 enum {
TCA_ACT_UNSPEC,
@@ -12,6 +18,7 @@ enum {
TCA_ACT_INDEX,
TCA_ACT_STATS,
TCA_ACT_PAD,
+   TCA_ACT_COOKIE,
__TCA_ACT_MAX
 };
 
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index f04715a..85e77181 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -33,6 +33,7 @@ static void free_tcf(struct rcu_head *head)
 
free_percpu(p->cpu_bstats);
free_percpu(p->cpu_qstats);
+   kfree(p->ck);
kfree(p);
 }
 
@@ -464,8 +465,8 @@ int tcf_action_destroy(struct list_head *actions, int bind)
return a->ops->dump(skb, a, bind, ref);
 }
 
-int
-tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref)
+int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind,
+ int ref)
 {
int err = -EINVAL;
unsigned char *b = skb_tail_pointer(skb);
@@ -475,6 +476,12 @@ int tcf_action_destroy(struct list_head *actions, int bind)
goto nla_put_failure;
if (tcf_action_copy_stats(skb, a, 0))
goto nla_put_failure;
+   if (a->ck) {
+   if (nla_put(skb, TCA_ACT_COOKIE, sizeof(union act_cookie),
+   a->ck))
+   goto nla_put_failure;
+   }
+
nest = nla_nest_start(skb, TCA_OPTIONS);
if (nest == NULL)
goto nla_put_failure;
@@ -575,6 +582,22 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct nlattr *nla,
if (err < 0)
goto err_mod;
 
+   if (tb[TCA_ACT_COOKIE]) {
+   if (nla_len(tb[TCA_ACT_COOKIE]) != sizeof(union act_cookie)) {
+   err = -EINVAL;
+   goto err_mod;

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-14 Thread Jiri Pirko

Sat, Jan 14, 2017 at 01:56:35PM CET, j...@mojatatu.com wrote:
>On 17-01-09 01:23 PM, John Fastabend wrote:
>> On 17-01-08 09:19 AM, Jiri Pirko wrote:
>
>[..]
>> > This should never be interpreted by kernel. I think this would be good
>> > to make clear in the comment in the code.
>> > 
>> 
>> Ah OK I had assumed you would be pushing this via tc_cls_flower_offload into
>> the driver in a follow up patch. But if it lives in kernel space as opaque
>> cookie guess its no different then other id fields order/prio/cookie.
>> 
>> Thanks for clarifying.
>
>
>I think the feature makes a lot of sense (as is the action variant).
>But can we make it:
>a) fixed size

Can you elaborate on why is this needed?


>b) apply to all classifiers
>c) please post a usage example via iproute2/tc
>
>I am going to post the action variant in the next while - will do some more
>testing first.

I believe we have to make the cls and ats cookies exactly the same.

>
>cheers,
>jamal

Re: [PATCH] can: Fix kernel panic at security_sock_rcv_skb

2017-01-14 Thread Oliver Hartkopp


Hello Eric,

On 01/14/2017 04:43 AM, Liu Shuo wrote:

On Thu 12.Jan'17 at 17:33:38 +0100, Oliver Hartkopp wrote:

On 01/12/2017 02:01 PM, Eric Dumazet wrote:



The main problem seems that the sockets themselves are not RCU
protected.

If CAN uses RCU for delivery, then sockets should be freed only after
one RCU grace period.

On recent kernels, following patch could help :



Thanks Eric!

@Liu ShuoX: Can you check if Eric's suggestion fixes the issue in your
setup?

Sorry for late reply. I was OOO yesterday.
With Eric's hint, i just found his patch that "net: add SOCK_RCU_FREE
socket flag" in the latest kernel. With backporting this one plus Eric's
following patch, it fixs my failure.


what would be the best approach to fix this issue - even in stable kernels?

E.g. would this change be ok for a stable as a quick fix?

diff --git a/net/can/af_can.c b/net/can/af_can.c
index 1108079d934f..6b974c2b66ef 100644
--- a/net/can/af_can.c
+++ b/net/can/af_can.c
@@ -112,6 +112,7 @@ EXPORT_SYMBOL(can_ioctl);

 static void can_sock_destruct(struct sock *sk)
 {
+   synchronize_rcu();
skb_queue_purge(>sk_receive_queue);
 }

And once this arrived in the mainline tree your suggested patch could be 
applied?


In any case we should not forget to give Reported-by credits to Liu.

Best regards,
Oliver

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-14 Thread Jamal Hadi Salim


On 17-01-03 07:22 AM, Paul Blakey wrote:


I don't mind having it on TC level but I didn't want to intervene with
all filters/TC.



Please do make it for all classifiers. I described a use case for u32
where the cookie could be used to pretty print the output on a
dump/get.

cheers,
jamal

Re: [PATCH net-next] net/sched: cls_flower: Add user specified data

2017-01-14 Thread Jamal Hadi Salim


On 17-01-09 01:23 PM, John Fastabend wrote:

On 17-01-08 09:19 AM, Jiri Pirko wrote:


[..]

This should never be interpreted by kernel. I think this would be good
to make clear in the comment in the code.



Ah OK I had assumed you would be pushing this via tc_cls_flower_offload into
the driver in a follow up patch. But if it lives in kernel space as opaque
cookie guess its no different then other id fields order/prio/cookie.

Thanks for clarifying.



I think the feature makes a lot of sense (as is the action variant).
But can we make it:
a) fixed size
b) apply to all classifiers
c) please post a usage example via iproute2/tc

I am going to post the action variant in the next while - will do some 
more testing first.


cheers,
jamal

Re: [PATCH] cxgb4: Remove redundant memset before memcpy

2017-01-14 Thread Tobias Klauser

On 2017-01-14 at 02:17:40 +0100, Shyam Saini  wrote:
> The region set by the call to memset, immediately overwritten by
> the subsequent call to memcpy and thus makes the  memset redundant.
> 
> Also remove the memset((, 0, sizeof(info)) on line 398 because
> info is memcpy()'ed to before being used in the loop and it isn't
> used outside of the loop.
> 
> Signed-off-by: Shyam Saini 

Reviewed-by: Tobias Klauser

[PATCH v3 2/2] bpf: Add tests for the lpm trie map

2017-01-14 Thread Daniel Mack

From: David Herrmann 

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.

Signed-off-by: David Herrmann 
Signed-off-by: Daniel Mack 
---
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 358 +
 3 files changed, 361 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 071431b..d3b1c9b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,3 +1,4 @@
 test_verifier
 test_maps
 test_lru_map
+test_lpm_map
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a5f245..064a3e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map
+test_objs = test_verifier test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh
+TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
new file mode 100644
index 000..dd83f0b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -0,0 +1,358 @@
+/*
+ * Randomized tests for eBPF longest-prefix-match maps
+ *
+ * This program runs randomized tests against the lpm-bpf-map. It implements a
+ * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked
+ * lists. The implementation should be pretty straightforward.
+ *
+ * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies
+ * the trie-based bpf-map implementation behaves the same way as tlpm.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "bpf_sys.h"
+#include "bpf_util.h"
+
+struct tlpm_node {
+   struct tlpm_node *next;
+   size_t n_bits;
+   uint8_t key[];
+};
+
+static struct tlpm_node *tlpm_add(struct tlpm_node *list,
+ const uint8_t *key,
+ size_t n_bits)
+{
+   struct tlpm_node *node;
+   size_t n;
+
+   /* add new entry with @key/@n_bits to @list and return new head */
+
+   n = (n_bits + 7) / 8;
+   node = malloc(sizeof(*node) + n);
+   assert(node);
+
+   node->next = list;
+   node->n_bits = n_bits;
+   memcpy(node->key, key, n);
+
+   return node;
+}
+
+static void tlpm_clear(struct tlpm_node *list)
+{
+   struct tlpm_node *node;
+
+   /* free all entries in @list */
+
+   while ((node = list)) {
+   list = list->next;
+   free(node);
+   }
+}
+
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+   const uint8_t *key,
+   size_t n_bits)
+{
+   struct tlpm_node *best = NULL;
+   size_t i;
+
+   /*
+* Perform longest prefix-match on @key/@n_bits. That is, iterate all
+* entries and match each prefix against @key. Remember the "best"
+* entry we find (i.e., the longest prefix that matches) and return it
+* to the caller when done.
+*/
+
+   for ( ; list; list = list->next) {
+   for (i = 0; i < n_bits && i < list->n_bits; ++i) {
+   if ((key[i / 8] & (1 << (7 - i % 8))) !=
+   (list->key[i / 8] & (1 << (7 - i % 8
+   break;
+   }
+
+   if (i >= list->n_bits) {
+   if (!best || i > best->n_bits)
+   best = list;
+   }
+   }
+
+   return best;
+}
+
+static void test_lpm_basic(void)
+{
+   struct tlpm_node *list = NULL, *t1, *t2;
+
+   /* very basic, static tests to verify tlpm works as expected */
+
+   assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
+   t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8);
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16));
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0x00 },

[PATCH v3 0/2] bpf: add longest prefix match map

2017-01-14 Thread Daniel Mack

This patch set adds a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges. It is exposed as a
bpf map type.
   
Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
 
Note that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.
 
Patch 1/2 adds the implementation, and 2/2 an extensive test suite.

Feedback is much appreciated.
 
 
Thanks,
Daniel

Changelog:

v2 -> v3:
* Store both the key match data and the caller provided
  value in the same byte array attached to a node. This
  avoids double allocations
* Bring back node->flags to distinguish between 'real'
  and intermediate nodes
* Fix comment style and some typos

v1 -> v2:
* Turn spin lock into raw spinlock
* Lock with irqsave options during trie_update_elem()
* Return -ENOMEM properly from trie_alloc()
* Force attr->flags == BPF_F_NO_PREALLOC during creation
* Set trie->map.pages after creation to account for map memory
* Allow arbitrary value sizes
* Removed node->flags and denode intermediate nodes through
  node->value == NULL instead

rfc -> v1:
* Add __rcu pointer annotations to make sparse happy
* Fold _lpm_trie_find_target_node() into its only caller
* Fix some minor documentation issues


Daniel Mack (1):
  bpf: add a longest prefix match trie map implementation

David Herrmann (1):
  bpf: Add tests for the lpm trie map

 include/uapi/linux/bpf.h   |   7 +
 kernel/bpf/Makefile|   2 +-
 kernel/bpf/lpm_trie.c  | 493 +
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 358 +
 6 files changed, 862 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/lpm_trie.c
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

-- 
2.9.3

[PATCH v3 1/2] bpf: add a longest prefix match trie map implementation

2017-01-14 Thread Daniel Mack

This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack 
Reviewed-by: David Herrmann 
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/lpm_trie.c| 493 +++
 3 files changed, 501 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+   __u32   prefixlen;  /* up to 32 for AF_INET, 128 for AF_INET6 */
+   __u8data[0];/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..1c1ad27
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,493 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016,2017 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+   struct rcu_head rcu;
+   struct lpm_trie_node __rcu  *child[2];
+   u32 prefixlen;
+   u32 flags;
+   u8  data[0];
+};
+
+struct lpm_trie {
+   struct bpf_map  map;
+   struct lpm_trie_node __rcu  *root;
+   size_t  n_entries;
+   size_t  max_prefixlen;
+   size_t  data_size;
+   raw_spinlock_t  lock;
+};
+
+/* This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *  ++
+ *  |   (1)  (R) |
+ *  | 192.168.0.0/16 |
+ *  |value: 1|
+ *  |   [0][1]   |
+ *  ++
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less specific one),
+ * node (2) will become a child of (1). In child

Re: [PATCH net-next] net/mlx5e: Support bpf_xdp_adjust_head()

2017-01-14 Thread Saeed Mahameed

On Sat, Jan 14, 2017 at 12:31 AM, Martin KaFai Lau  wrote:
> On Fri, Jan 13, 2017 at 03:58:46PM +0200, Saeed Mahameed wrote:
>> >> > @@ -680,7 +687,7 @@ static inline void mlx5e_xmit_xdp_frame(struct 
>> >> > mlx5e_rq *rq,
>> >> > memset(wqe, 0, sizeof(*wqe));
>> >> >
>> >> > /* copy the inline part */
>> >> > -   memcpy(eseg->inline_hdr_start, data, MLX5E_XDP_MIN_INLINE);
>> >> > +   memcpy(eseg->inline_hdr_start, xdp->data, MLX5E_XDP_MIN_INLINE);
>> >> > eseg->inline_hdr_sz = cpu_to_be16(MLX5E_XDP_MIN_INLINE);
>> >> >
>> >> > dseg = (struct mlx5_wqe_data_seg *)cseg + 
>> >> > (MLX5E_XDP_TX_DS_COUNT - 1);
>> >> > @@ -706,22 +713,16 @@ static inline void mlx5e_xmit_xdp_frame(struct 
>> >> > mlx5e_rq *rq,
>> >> >  static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>> >> > const struct bpf_prog *prog,
>> >> > struct mlx5e_dma_info *di,
>> >> > -   void *data, u16 len)
>> >> > +   struct xdp_buff *xdp)
>> >> >  {
>> >> > -   struct xdp_buff xdp;
>> >> > u32 act;
>> >> >
>> >> > -   if (!prog)
>> >> > -   return false;
>> >> > -
>> >> > -   xdp.data = data;
>> >> > -   xdp.data_end = xdp.data + len;
>> >> > -   act = bpf_prog_run_xdp(prog, );
>> >> > +   act = bpf_prog_run_xdp(prog, xdp);
>> >> > switch (act) {
>> >> > case XDP_PASS:
>> >> > return false;
>> >> > case XDP_TX:
>> >> > -   mlx5e_xmit_xdp_frame(rq, di, MLX5_RX_HEADROOM, len);
>> >> > +   mlx5e_xmit_xdp_frame(rq, di, xdp);
>> >> > return true;
>> >> > default:
>> >> > bpf_warn_invalid_xdp_action(act);
>> >> > @@ -737,18 +738,19 @@ static inline
>> >> >  struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 
>> >> > *cqe,
>> >> >  u16 wqe_counter, u32 cqe_bcnt)
>> >> >  {
>> >> > +   const struct bpf_prog *xdp_prog;
>> >> > struct mlx5e_dma_info *di;
>> >> > struct sk_buff *skb;
>> >> > void *va, *data;
>> >> > -   bool consumed;
>> >> > +   u16 rx_headroom = rq->rx_headroom;
>> >> >
>> >> > di = >dma_info[wqe_counter];
>> >> > va = page_address(di->page);
>> >> > -   data   = va + MLX5_RX_HEADROOM;
>> >> > +   data   = va + rx_headroom;
>> >> >
>> >> > dma_sync_single_range_for_cpu(rq->pdev,
>> >> >   di->addr,
>> >> > - MLX5_RX_HEADROOM,
>> >> > + rx_headroom,
>> >> >   rq->buff.wqe_sz,
>> >> >   DMA_FROM_DEVICE);
>> >> > prefetch(data);
>> >> > @@ -760,11 +762,26 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, 
>> >> > struct mlx5_cqe64 *cqe,
>> >> > }
>> >> >
>> >> > rcu_read_lock();
>> >> > -   consumed = mlx5e_xdp_handle(rq, READ_ONCE(rq->xdp_prog), di, 
>> >> > data,
>> >> > -   cqe_bcnt);
>> >> > +   xdp_prog = READ_ONCE(rq->xdp_prog);
>> >> > +   if (xdp_prog) {
>> >> > +   struct xdp_buff xdp;
>> >> > +   bool consumed;
>> >> > +
>> >> > +   xdp.data = data;
>> >> > +   xdp.data_end = xdp.data + cqe_bcnt;
>> >> > +   xdp.data_hard_start = va;
>> >> > +
>> >> > +   consumed = mlx5e_xdp_handle(rq, xdp_prog, di, );
>> >> > +
>> >> > +   if (consumed) {
>> >> > +   rcu_read_unlock();
>> >> > +   return NULL; /* page/packet was consumed by XDP 
>> >> > */
>> >> > +   }
>> >> > +
>> >> > +   rx_headroom = xdp.data - xdp.data_hard_start;
>> >> > +   cqe_bcnt = xdp.data_end - xdp.data;
>> >> > +   }
>> >>
>> >> This whole new logic belongs to mlx5e_xdp_handle, I would like to keep
>> >> xdp related code in one place.
>> >>
>> >> move the xdp_buff initialization back to there and keep the xdp_prog
>> >> check in mlx5e_xdp_handle;
>> >> +  xdp_prog = READ_ONCE(rq->xdp_prog);
>> >> +   if (!xdp_prog)
>> >> +return false
>> >>
>> >> you can remove "const struct bpf_prog *prog" parameter from
>> >> mlx5e_xdp_handle and take it directly from rq.
>> >>
>> >> if you need va for xdp_buff you can pass it as a paramter to
>> >> mlx5e_xdp_handle  as well:
>> >> mlx5e_xdp_handle(rq, di, va, data, cqe_bcnt);
>> >> Make sense ?
>> > I moved them because xdp.data could be adjusted which then
>> > rx_headroom and cqe_bcnt have to be adjusted accordingly
>> > in skb_from_cqe() also.
>> >
>> > I understand your point.  After another quick thought,
>> > the adjusted xdp.data is the only one that we want in skb_from_cqe().
>> > I will try to

[PATCH] net: marvell: skge: use new api ethtool_{get|set}_link_ksettings

2017-01-14 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

The callback set_link_ksettings no longer update the value
of advertising, as the struct ethtool_link_ksettings is
defined as const.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/marvell/skge.c |   63 --
 1 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/marvell/skge.c 
b/drivers/net/ethernet/marvell/skge.c
index 9146a51..81106b7 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -300,65 +300,76 @@ static u32 skge_supported_modes(const struct skge_hw *hw)
return supported;
 }
 
-static int skge_get_settings(struct net_device *dev,
-struct ethtool_cmd *ecmd)
+static int skge_get_link_ksettings(struct net_device *dev,
+  struct ethtool_link_ksettings *cmd)
 {
struct skge_port *skge = netdev_priv(dev);
struct skge_hw *hw = skge->hw;
+   u32 supported, advertising;
 
-   ecmd->transceiver = XCVR_INTERNAL;
-   ecmd->supported = skge_supported_modes(hw);
+   supported = skge_supported_modes(hw);
 
if (hw->copper) {
-   ecmd->port = PORT_TP;
-   ecmd->phy_address = hw->phy_addr;
+   cmd->base.port = PORT_TP;
+   cmd->base.phy_address = hw->phy_addr;
} else
-   ecmd->port = PORT_FIBRE;
+   cmd->base.port = PORT_FIBRE;
+
+   advertising = skge->advertising;
+   cmd->base.autoneg = skge->autoneg;
+   cmd->base.speed = skge->speed;
+   cmd->base.duplex = skge->duplex;
+
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
+   supported);
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising,
+   advertising);
 
-   ecmd->advertising = skge->advertising;
-   ecmd->autoneg = skge->autoneg;
-   ethtool_cmd_speed_set(ecmd, skge->speed);
-   ecmd->duplex = skge->duplex;
return 0;
 }
 
-static int skge_set_settings(struct net_device *dev, struct ethtool_cmd *ecmd)
+static int skge_set_link_ksettings(struct net_device *dev,
+  const struct ethtool_link_ksettings *cmd)
 {
struct skge_port *skge = netdev_priv(dev);
const struct skge_hw *hw = skge->hw;
u32 supported = skge_supported_modes(hw);
int err = 0;
+   u32 advertising;
+
+   ethtool_convert_link_mode_to_legacy_u32(,
+   cmd->link_modes.advertising);
 
-   if (ecmd->autoneg == AUTONEG_ENABLE) {
-   ecmd->advertising = supported;
+   if (cmd->base.autoneg == AUTONEG_ENABLE) {
+   advertising = supported;
skge->duplex = -1;
skge->speed = -1;
} else {
u32 setting;
-   u32 speed = ethtool_cmd_speed(ecmd);
+   u32 speed = cmd->base.speed;
 
switch (speed) {
case SPEED_1000:
-   if (ecmd->duplex == DUPLEX_FULL)
+   if (cmd->base.duplex == DUPLEX_FULL)
setting = SUPPORTED_1000baseT_Full;
-   else if (ecmd->duplex == DUPLEX_HALF)
+   else if (cmd->base.duplex == DUPLEX_HALF)
setting = SUPPORTED_1000baseT_Half;
else
return -EINVAL;
break;
case SPEED_100:
-   if (ecmd->duplex == DUPLEX_FULL)
+   if (cmd->base.duplex == DUPLEX_FULL)
setting = SUPPORTED_100baseT_Full;
-   else if (ecmd->duplex == DUPLEX_HALF)
+   else if (cmd->base.duplex == DUPLEX_HALF)
setting = SUPPORTED_100baseT_Half;
else
return -EINVAL;
break;
 
case SPEED_10:
-   if (ecmd->duplex == DUPLEX_FULL)
+   if (cmd->base.duplex == DUPLEX_FULL)
setting = SUPPORTED_10baseT_Full;
-   else if (ecmd->duplex == DUPLEX_HALF)
+   else if (cmd->base.duplex == DUPLEX_HALF)
setting = SUPPORTED_10baseT_Half;
else
return -EINVAL;
@@ -371,11 +382,11 @@ static int skge_set_settings(struct net_device *dev, 
struct ethtool_cmd *ecmd)
return -EINVAL;
 
skge->speed = speed;
-   skge->duplex = ecmd->duplex;
+   skge->duplex = cmd->base.duplex;

Re: [Patch net] atm: remove an unnecessary loop

2017-01-14 Thread Chas Williams

On Fri, 2017-01-13 at 16:30 -0800, Cong Wang wrote:
> On Fri, Jan 13, 2017 at 3:54 PM, Chas Williams <3ch...@gmail.com> wrote:
> > On Fri, 2017-01-13 at 10:20 -0800, Cong Wang wrote:
> >> On Fri, Jan 13, 2017 at 9:10 AM, David Miller  wrote:
> >> > From: Francois Romieu 
> >> > Date: Fri, 13 Jan 2017 01:07:00 +0100
> >> >
> >> >> Were alloc_skb moved one level up in the call stack, there would be
> >> >> no need to use the new wait api in the subsequent page, thus easing
> >> >> pre 3.19 longterm kernel maintenance (at least those on korg page).
> >> >>
> >> >> But it tastes a tad bit too masochistic.
> >> >
> >> > Lack of error handling of allocation failure is always a huge red
> >> > flag.  We even long ago tried to do something like this for TCP FIN
> >> > handling.
> >> >
> >> > It's dumb, it doesn't work.
> >> >
> >> > Therefore I agree that the correct fix is to move the SKB allocation
> >> > up one level to vcc_sendmsg() and make it handle errors properly.
> >>
> >> If you can justify API is not broken by doing that, I am more than happy
> >> to do it, as I already stated in the latter patch:
> >
> > The man page for sendmsg() allows for ENOMEM.  See below.
> >
> 
> Errno is just one part, you miss the behavior behind the logic.
> 
> >>
> >> "Of course, the logic itself is suspicious, other sendmsg()
> >> could handle skb allocation failure very well, not sure
> >> why ATM has to wait for a successful one here. But probably
> >> it is too late to change since the errno and behavior is
> >> visible to user-space. So just leave the logic as it is."
> >>
> >> For some reason, no one reads that patch. :-/
> >
> > I read it and I agree.  I think it should be moved up/conflated with
> > vcc_sendmsg().  vcc_sendmsg() can already return an errno for other
> > conditions so if so has written something where they are explicitly
> > not expecting a ENOMEM, we really can't help them.
> 
> Nope, the reason is never ENOMEM is expected or not. The current
> _behavior_ behind this logic might be relied on by user-space.
> The behavior here is, when allocation fails, kernel will retry under
> certain circumstances, for example, if any fatal signal pending,
> returns ERESTARTSYS, etc.. This is what I worry, not just ENOMEM
> or not, which is too obvious.

Yes, and that behavior is certainly wrong.  Proving that nothing relies
on it would be very difficult since this is a negative supposition.

It's not clear to me that it is a good idea to ignore the pending signal
and just send the data.  At best, this seems like the signal is getting
ignored when the program might actaully want to do something about it.
The way the vcc sockets work, you are almost always waiting for "space
available to send".  Since vcc_sendmsg() has always been able to return
ERESTARTSYS for this condition, this isn't exactly new behavior, it
could just happen (very) slightly more often.

The loop in alloc_tx() also ignores the MSG_DONTWAIT flag.  The user
might end up waiting after all.  So that seems broken as well.  If
someone is expecting to wait with MSG_DONTWAIT when memory pressure
is present, I can't help them.  They are insane.

> Of course, I could be too conservative, I'd rather not to break things
> for -stable at least.
> 
> Thanks.

[PATCH] net: korina: use new api ethtool_{get|set}_link_ksettings

2017-01-14 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/korina.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/korina.c b/drivers/net/ethernet/korina.c
index 8037426..3e415b8 100644
--- a/drivers/net/ethernet/korina.c
+++ b/drivers/net/ethernet/korina.c
@@ -695,25 +695,27 @@ static void netdev_get_drvinfo(struct net_device *dev,
strlcpy(info->bus_info, lp->dev->name, sizeof(info->bus_info));
 }
 
-static int netdev_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int netdev_get_link_ksettings(struct net_device *dev,
+struct ethtool_link_ksettings *cmd)
 {
struct korina_private *lp = netdev_priv(dev);
int rc;
 
spin_lock_irq(>lock);
-   rc = mii_ethtool_gset(>mii_if, cmd);
+   rc = mii_ethtool_get_link_ksettings(>mii_if, cmd);
spin_unlock_irq(>lock);
 
return rc;
 }
 
-static int netdev_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int netdev_set_link_ksettings(struct net_device *dev,
+const struct ethtool_link_ksettings *cmd)
 {
struct korina_private *lp = netdev_priv(dev);
int rc;
 
spin_lock_irq(>lock);
-   rc = mii_ethtool_sset(>mii_if, cmd);
+   rc = mii_ethtool_set_link_ksettings(>mii_if, cmd);
spin_unlock_irq(>lock);
korina_set_carrier(>mii_if);
 
@@ -729,9 +731,9 @@ static u32 netdev_get_link(struct net_device *dev)
 
 static const struct ethtool_ops netdev_ethtool_ops = {
.get_drvinfo= netdev_get_drvinfo,
-   .get_settings   = netdev_get_settings,
-   .set_settings   = netdev_set_settings,
.get_link   = netdev_get_link,
+   .get_link_ksettings = netdev_get_link_ksettings,
+   .set_link_ksettings = netdev_set_link_ksettings,
 };
 
 static int korina_alloc_ring(struct net_device *dev)
-- 
1.7.4.4

Re: [PATCH 5/6] treewide: use kv[mz]alloc* rather than opencoded variants

2017-01-14 Thread Leon Romanovsky

On Thu, Jan 12, 2017 at 04:37:16PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
>
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests < 64kB are basically never failing and invoke OOM killer to
> satisfy the allocation. This sounds too disruptive for something that
> has a reasonable fallback - the vmalloc. On the other hand those
> requests might fallback to vmalloc even when the memory allocator would
> succeed after several more reclaim/compaction attempts previously. There
> is no guarantee something like that happens though.
>
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.
>
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: Herbert Xu 
> Cc: Anton Vorontsov 
> Cc: Colin Cross 
> Cc: Kees Cook 
> Cc: Tony Luck 
> Cc: "Rafael J. Wysocki" 
> Cc: Ben Skeggs 
> Cc: Kent Overstreet 
> Cc: Santosh Raspatur 
> Cc: Hariprasad S 
> Cc: Tariq Toukan 
> Cc: Yishai Hadas 
> Cc: Dan Williams 
> Cc: Oleg Drokin 
> Cc: Andreas Dilger 
> Cc: Boris Ostrovsky 
> Cc: David Sterba 
> Cc: "Yan, Zheng" 
> Cc: Ilya Dryomov 
> Cc: Alexander Viro 
> Cc: Alexei Starovoitov 
> Cc: Eric Dumazet 
> Cc: netdev@vger.kernel.org
> Signed-off-by: Michal Hocko 
> ---
>  arch/s390/kvm/kvm-s390.c   | 10 ++-
>  crypto/lzo.c   |  4 +--
>  drivers/acpi/apei/erst.c   |  8 ++---
>  drivers/char/agp/generic.c |  8 +
>  drivers/gpu/drm/nouveau/nouveau_gem.c  |  4 +--
>  drivers/md/bcache/util.h   | 12 ++--
>  drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h|  3 --
>  drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 25 ++--
>  drivers/net/ethernet/chelsio/cxgb3/l2t.c   |  2 +-
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 31 
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c |  9 ++
>  drivers/net/ethernet/mellanox/mlx4/mr.c|  9 ++
>  drivers/nvdimm/dimm_devs.c |  5 +---
>  .../staging/lustre/lnet/libcfs/linux/linux-mem.c   | 11 +--
>  drivers/xen/evtchn.c   | 14 +
>  fs/btrfs/ctree.c   |  9 ++
>  fs/btrfs/ioctl.c   |  9 ++
>  fs/btrfs/send.c| 27 ++---
>  fs/ceph/file.c |  9 ++
>  fs/select.c|  5 +---
>  fs/xattr.c | 27 ++---
>  kernel/bpf/hashtab.c   | 11 ++-
>  lib/iov_iter.c |  5 +---
>  mm/frame_vector.c  |  5 +---
>  net/ipv4/inet_hashtables.c |  6 +---
>  net/ipv4/tcp_metrics.c |  5 +---
>  net/mpls/af_mpls.c |  5 +---
>  net/netfilter/x_tables.c   | 34 
> ++
>  net/netfilter/xt_recent.c  |  5 +---
>  net/sched/sch_choke.c  |  5 +---
>  net/sched/sch_fq_codel.c   | 26 -
>  net/sched/sch_hhf.c| 33 ++---
>  net/sched/sch_netem.c  |  6 +---
>  net/sched/sch_sfq.c|  6 +---
>  security/keys/keyctl.c | 22 --
>  35 files changed, 96 insertions(+), 319 deletions(-)

Hi Michal,

I don't see mlx5_vzalloc in the changed list. Any reason why did you skip it?

 881 static inline void *mlx5_vzalloc(unsigned long size)
 882 {
 883 void *rtn;
 884
 885 rtn = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 886 if (!rtn)
 887 rtn = vzalloc(size);
 888 return rtn;
 889 }

Thanks


signature.asc
Description: PGP signature

Re: [PATCH v2 1/2] bpf: add a longest prefix match trie map implementation

2017-01-14 Thread Daniel Mack

On 01/13/2017 07:01 PM, Alexei Starovoitov wrote:
> On Thu, Jan 12, 2017 at 06:29:21PM +0100, Daniel Mack wrote:
>> This trie implements a longest prefix match algorithm that can be used
>> to match IP addresses to a stored set of ranges.
>>
>> Internally, data is stored in an unbalanced trie of nodes that has a
>> maximum height of n, where n is the prefixlen the trie was created
>> with.
>>
>> Tries may be created with prefix lengths that are multiples of 8, in
>> the range from 8 to 2048. The key used for lookup and update operations
>> is a struct bpf_lpm_trie_key, and the value is a uint64_t.
>>
>> The code carries more information about the internal implementation.
>>
>> Signed-off-by: Daniel Mack 
>> Reviewed-by: David Herrmann 
>> ---
>>  include/uapi/linux/bpf.h |   7 +
>>  kernel/bpf/Makefile  |   2 +-
>>  kernel/bpf/lpm_trie.c| 499 
>> +++
>>  3 files changed, 507 insertions(+), 1 deletion(-)
>>  create mode 100644 kernel/bpf/lpm_trie.c

...

Thanks for spotting my typos! :)

>> +static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie 
>> *trie,
>> + const void *value)
>> +{
>> +struct lpm_trie_node *node;
>> +gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN;
>> +
>> +node = kmalloc(sizeof(struct lpm_trie_node) + trie->data_size, gfp);
>> +if (!node)
>> +return ERR_PTR(-ENOMEM);
>> +
>> +if (value) {
>> +node->value = kmemdup(value, trie->map.value_size, gfp);
> 
> can you make value to be part of the node? similar to how hash map is done?
> that will help avoid 2nd allocation, will speedup insertion and will
> help converting this code to user pre-allocated elements.
> I suspect the concern was that for many inner nodes that value is null ?
> But in your use case the value_size will be == 0 eventually,
> so by embedding it when can save memory too, since 'value' pointer will
> be replaced with boolean present flag ?
> So potentially less memory and less cache misses?

Yes, that's a good idea. Implemented that now.

> Overall algorithm is indeed straightforward and simple which is great,
> but I would still like to see some performance numbers.

I'm not sure yet how to implement such a test in a meaningful way, tbh.
Given that the lookups have to be done one by one, I expect the syscall
overhead to be quite significant.

> Looks like
> the best case for single 32-bit element it needs 4 xors and compares
> which is fine. For mostly populate trie it's 4xors * 32 depth
> which is pretty good too, but cache misses on pointer walks may
> kill performance unless we're hitting the same path all the time.
> I think it's all acceptable due to simplicity of the implementation
> which we may improve later if it turns out to be a bottle neck for
> some use cases. We just need a baseline to have realistic expectations.

Yes, the maximum height of the trie is the number of bits in the prefix,
so for n bits, the iteration would at most take n steps to finish. For
each step, an xor and compare for n/8 bytes are needed.

As you say, the implementation could be improved under the hood if
someone spots a bottleneck somewhere.

I'll post a v3 with your comments addressed for further discussion.


Thanks,
Daniel

[net-next 3/3] tipc: reduce risk of user starvation during link congestion

2017-01-14 Thread Jon Maloy

The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.

This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.

In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.

The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.

Signed-off-by: Parthasarathy Bhuvaragan 
Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/bcast.c  |   6 +-
 net/tipc/link.c   |  75 +---
 net/tipc/msg.h|   2 -
 net/tipc/node.c   |  15 +--
 net/tipc/socket.c | 347 --
 5 files changed, 194 insertions(+), 251 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index aa1babb..c35fad3 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -174,7 +174,7 @@ static void tipc_bcbase_xmit(struct net *net, struct 
sk_buff_head *xmitq)
  *and to identified node local sockets
  * @net: the applicable net namespace
  * @list: chain of buffers containing message
- * Consumes the buffer chain, except when returning -ELINKCONG
+ * Consumes the buffer chain.
  * Returns 0 if success, otherwise errno: -ELINKCONG,-EHOSTUNREACH,-EMSGSIZE
  */
 int tipc_bcast_xmit(struct net *net, struct sk_buff_head *list)
@@ -197,7 +197,7 @@ int tipc_bcast_xmit(struct net *net, struct sk_buff_head 
*list)
tipc_bcast_unlock(net);
 
/* Don't send to local node if adding to link failed */
-   if (unlikely(rc)) {
+   if (unlikely(rc && (rc != -ELINKCONG))) {
__skb_queue_purge();
return rc;
}
@@ -206,7 +206,7 @@ int tipc_bcast_xmit(struct net *net, struct sk_buff_head 
*list)
tipc_bcbase_xmit(net, );
tipc_sk_mcast_rcv(net, , );
__skb_queue_purge(list);
-   return 0;
+   return rc;
 }
 
 /* tipc_bcast_rcv - receive a broadcast packet, and deliver to rcv link
diff --git a/net/tipc/link.c b/net/tipc/link.c
index bda89bf..b758ca8 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -776,60 +776,47 @@ int tipc_link_timeout(struct tipc_link *l, struct 
sk_buff_head *xmitq)
 
 /**
  * link_schedule_user - schedule a message sender for wakeup after congestion
- * @link: congested link
- * @list: message that was attempted sent
+ * @l: congested link
+ * @hdr: header of message that is being sent
  * Create pseudo msg to send back to user when congestion abates
- * Does not consume buffer list
  */
-static int link_schedule_user(struct tipc_link *link, struct sk_buff_head 
*list)
+static int link_schedule_user(struct tipc_link *l, struct tipc_msg *hdr)
 {
-   struct tipc_msg *msg = buf_msg(skb_peek(list));
-   int imp = msg_importance(msg);
-   u32 oport = msg_origport(msg);
-   u32 addr = tipc_own_addr(link->net);
+   u32 dnode = tipc_own_addr(l->net);
+   u32 dport = msg_origport(hdr);
struct sk_buff *skb;
 
-   /* This really cannot happen...  */
-   if (unlikely(imp > TIPC_CRITICAL_IMPORTANCE)) {
-   pr_warn("%s<%s>, send queue full", link_rst_msg, link->name);
-   return -ENOBUFS;
-   }
-   /* Non-blocking sender: */
-   if (TIPC_SKB_CB(skb_peek(list))->wakeup_pending)
-   return -ELINKCONG;
-
/* Create and schedule wakeup pseudo message */
skb = tipc_msg_create(SOCK_WAKEUP, 0, INT_H_SIZE, 0,
- addr, addr, oport, 0, 0);
+ dnode, l->addr, dport, 0, 0);
if (!skb)
return -ENOBUFS;
-   TIPC_SKB_CB(skb)->chain_sz = skb_queue_len(list);
-   TIPC_SKB_CB(skb)->chain_imp =

[net-next 2/3] tipc: modify struct tipc_plist to be more versatile

2017-01-14 Thread Jon Maloy

During multicast reception we currently use a simple linked list with
push/pop semantics to store port numbers.

We now see a need for a more generic list for storing values of type
u32. We therefore make some modifications to this list, while replacing
the prefix 'tipc_plist_' with 'u32_'. We also add a couple of new
functions which will come to use in the next commits.

Acked-by: Parthasarathy Bhuvaragan 
Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/name_table.c | 100 --
 net/tipc/name_table.h |  21 ---
 net/tipc/socket.c |   8 ++--
 3 files changed, 83 insertions(+), 46 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index e190460..5a86df1 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -608,7 +608,7 @@ u32 tipc_nametbl_translate(struct net *net, u32 type, u32 
instance,
  * Returns non-zero if any off-node ports overlap
  */
 int tipc_nametbl_mc_translate(struct net *net, u32 type, u32 lower, u32 upper,
- u32 limit, struct tipc_plist *dports)
+ u32 limit, struct list_head *dports)
 {
struct name_seq *seq;
struct sub_seq *sseq;
@@ -633,7 +633,7 @@ int tipc_nametbl_mc_translate(struct net *net, u32 type, 
u32 lower, u32 upper,
info = sseq->info;
list_for_each_entry(publ, >node_list, node_list) {
if (publ->scope <= limit)
-   tipc_plist_push(dports, publ->ref);
+   u32_push(dports, publ->ref);
}
 
if (info->cluster_list_size != info->node_list_size)
@@ -1022,40 +1022,84 @@ int tipc_nl_name_table_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
return skb->len;
 }
 
-void tipc_plist_push(struct tipc_plist *pl, u32 port)
+struct u32_item {
+   struct list_head list;
+   u32 value;
+};
+
+bool u32_find(struct list_head *l, u32 value)
 {
-   struct tipc_plist *nl;
+   struct u32_item *item;
 
-   if (likely(!pl->port)) {
-   pl->port = port;
-   return;
+   list_for_each_entry(item, l, list) {
+   if (item->value == value)
+   return true;
}
-   if (pl->port == port)
-   return;
-   list_for_each_entry(nl, >list, list) {
-   if (nl->port == port)
-   return;
+   return false;
+}
+
+bool u32_push(struct list_head *l, u32 value)
+{
+   struct u32_item *item;
+
+   list_for_each_entry(item, l, list) {
+   if (item->value == value)
+   return false;
+   }
+   item = kmalloc(sizeof(*item), GFP_ATOMIC);
+   if (unlikely(!item))
+   return false;
+
+   item->value = value;
+   list_add(>list, l);
+   return true;
+}
+
+u32 u32_pop(struct list_head *l)
+{
+   struct u32_item *item;
+   u32 value = 0;
+
+   if (list_empty(l))
+   return 0;
+   item = list_first_entry(l, typeof(*item), list);
+   value = item->value;
+   list_del(>list);
+   kfree(item);
+   return value;
+}
+
+bool u32_del(struct list_head *l, u32 value)
+{
+   struct u32_item *item, *tmp;
+
+   list_for_each_entry_safe(item, tmp, l, list) {
+   if (item->value != value)
+   continue;
+   list_del(>list);
+   kfree(item);
+   return true;
}
-   nl = kmalloc(sizeof(*nl), GFP_ATOMIC);
-   if (nl) {
-   nl->port = port;
-   list_add(>list, >list);
+   return false;
+}
+
+void u32_list_purge(struct list_head *l)
+{
+   struct u32_item *item, *tmp;
+
+   list_for_each_entry_safe(item, tmp, l, list) {
+   list_del(>list);
+   kfree(item);
}
 }
 
-u32 tipc_plist_pop(struct tipc_plist *pl)
+int u32_list_len(struct list_head *l)
 {
-   struct tipc_plist *nl;
-   u32 port = 0;
+   struct u32_item *item;
+   int i = 0;
 
-   if (likely(list_empty(>list))) {
-   port = pl->port;
-   pl->port = 0;
-   return port;
+   list_for_each_entry(item, l, list) {
+   i++;
}
-   nl = list_first_entry(>list, typeof(*nl), list);
-   port = nl->port;
-   list_del(>list);
-   kfree(nl);
-   return port;
+   return i;
 }
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index 1524a73..c89bb3f 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -99,7 +99,7 @@ int tipc_nl_name_table_dump(struct sk_buff *skb, struct 
netlink_callback *cb);
 
 u32 tipc_nametbl_translate(struct net *net, u32 type, u32 instance, u32 *node);
 int tipc_nametbl_mc_translate(struct net *net, u32 type, u32 lower, u32 upper,
-

[net-next 1/3] tipc: unify tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg() functions

2017-01-14 Thread Jon Maloy

The functions tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg() are very
similar. The latter function is also called from two locations, and
there will be more in the coming commits, which will all need to test on
different conditions.

Instead of making yet another duplicates of the function, we now
introduce a new macro tipc_wait_for_cond() where the wakeup condition
can be stated as an argument to the call. This macro replaces all
current and future uses of the two functions, which can now be
eliminated.

Acked-by: Parthasarathy Bhuvaragan 
Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/socket.c | 108 +-
 1 file changed, 49 insertions(+), 59 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 800caaa..f27462e 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -110,7 +110,6 @@ static void tipc_write_space(struct sock *sk);
 static void tipc_sock_destruct(struct sock *sk);
 static int tipc_release(struct socket *sock);
 static int tipc_accept(struct socket *sock, struct socket *new_sock, int 
flags);
-static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p);
 static void tipc_sk_timeout(unsigned long data);
 static int tipc_sk_publish(struct tipc_sock *tsk, uint scope,
   struct tipc_name_seq const *seq);
@@ -334,6 +333,49 @@ static int tipc_set_sk_state(struct sock *sk, int state)
return res;
 }
 
+static int tipc_sk_sock_err(struct socket *sock, long *timeout)
+{
+   struct sock *sk = sock->sk;
+   int err = sock_error(sk);
+   int typ = sock->type;
+
+   if (err)
+   return err;
+   if (typ == SOCK_STREAM || typ == SOCK_SEQPACKET) {
+   if (sk->sk_state == TIPC_DISCONNECTING)
+   return -EPIPE;
+   else if (!tipc_sk_connected(sk))
+   return -ENOTCONN;
+   }
+   if (!*timeout)
+   return -EAGAIN;
+   if (signal_pending(current))
+   return sock_intr_errno(*timeout);
+
+   return 0;
+}
+
+#define tipc_wait_for_cond(sock_, timeout_, condition_)
\
+({ \
+   int rc_ = 0;\
+   int done_ = 0;  \
+   \
+   while (!(condition_) && !done_) {   \
+   struct sock *sk_ = sock->sk;\
+   DEFINE_WAIT_FUNC(wait_, woken_wake_function);   \
+   \
+   rc_ = tipc_sk_sock_err(sock_, timeout_);\
+   if (rc_)\
+   break;  \
+   prepare_to_wait(sk_sleep(sk_), _,  \
+   TASK_INTERRUPTIBLE);\
+   done_ = sk_wait_event(sk_, timeout_,\
+ (condition_), _);\
+   remove_wait_queue(sk_sleep(sk_), _);   \
+   }   \
+   rc_;\
+})
+
 /**
  * tipc_sk_create - create a TIPC socket
  * @net: network namespace (must be default network)
@@ -721,7 +763,7 @@ static int tipc_sendmcast(struct  socket *sock, struct 
tipc_name_seq *seq,
 
if (rc == -ELINKCONG) {
tsk->link_cong = 1;
-   rc = tipc_wait_for_sndmsg(sock, );
+   rc = tipc_wait_for_cond(sock, , !tsk->link_cong);
if (!rc)
continue;
}
@@ -830,31 +872,6 @@ static void tipc_sk_proto_rcv(struct tipc_sock *tsk, 
struct sk_buff *skb,
kfree_skb(skb);
 }
 
-static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p)
-{
-   DEFINE_WAIT_FUNC(wait, woken_wake_function);
-   struct sock *sk = sock->sk;
-   struct tipc_sock *tsk = tipc_sk(sk);
-   int done;
-
-   do {
-   int err = sock_error(sk);
-   if (err)
-   return err;
-   if (sk->sk_shutdown & SEND_SHUTDOWN)
-   return -EPIPE;
-   if (!*timeo_p)
-   return -EAGAIN;
-   if (signal_pending(current))
-   return sock_intr_errno(*timeo_p);
-
-   add_wait_queue(sk_sleep(sk), );
-   done = sk_wait_event(sk, timeo_p, !tsk->link_cong, );
-   remove_wait_queue(sk_sleep(sk), );

[net-next 0/3] tipc: improve interaction socket-link

2017-01-14 Thread Jon Maloy

We fix a very real starvation problem that may occur when a link
encounters send buffer congestion. At the same time we make the 
interaction between the socket and link layer simpler and more 
consistent.

Jon Maloy (3):
  tipc: unify tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg()
functions
  tipc: modify struct tipc_plist to be more versatile
  tipc: reduce risk of user starvation during link congestion

 net/tipc/bcast.c  |   6 +-
 net/tipc/link.c   |  75 -
 net/tipc/msg.h|   2 -
 net/tipc/name_table.c | 100 +++
 net/tipc/name_table.h |  21 +--
 net/tipc/node.c   |  15 +-
 net/tipc/socket.c | 449 ++
 7 files changed, 319 insertions(+), 349 deletions(-)

-- 
2.7.4

71 matches

Mail list logo