date:20160331

Greetings to you You have new message

2016-03-31 Thread Anya-sanda Chindori Chininga

Hello Dear
How are you doing today? I read your profile today at hotdog and found you 
worthy to be mine as someone whom i can lay on his arms as long as love is 
concern, caring and teasing you all the night long, If you are interested in 
knowing more about me and for me to send you some photos of mine please contact 
me back, for i have some thing important to share with you above all,remember 
that age,colour,language or religion does not matter but love matters alot in 
life. God bless you as you read my mail
Waiting to hear from you 
Thanks 
kiss

Re: [PATCH net-next 1/6] net: skbuff: don't use union for napi_id and sender_cpu

2016-03-31 Thread Jason Wang



On 04/01/2016 10:55 AM, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 10:13 +0800, Jason Wang wrote:
>
>
>> The problem is we want to support busy polling for tun. This needs
>> napi_id to be passed to tun socket by sk_mark_napi_id() during
>> tun_net_xmit(). But before reaching this, XPS will set sender_cpu will
>> make us can't see correct napi_id.
>>
> Looks like napi_id should have precedence then ?

But then when busy polling is enabled, we may still hit the issue before
commit 2bd82484bb4c5db1d5dc983ac7c409b2782e0154? So looks like sometimes
(e.g for tun), we need both two fields.

>
> Only forwarding should allow the field to be cleared to allow XPS to do
> its job.
>
> Maybe skb_sender_cpu_clear() was removed too early (commit
> 64d4e3431e686dc37ce388ba531c4c4e866fb141)

Not sure I get you, but this will clear napi_id too.

> Look, it is 8pm here, I am pretty sure a solution can be found,
> but I am also need to take a break, I started at 3am today...
>
>
>

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa


On 01.04.2016 06:26, Alexei Starovoitov wrote:

On Fri, Apr 01, 2016 at 06:12:49AM +0200, Hannes Frederic Sowa wrote:

On 01.04.2016 06:04, Alexei Starovoitov wrote:

On Thu, Mar 31, 2016 at 08:03:38PM -0700, Eric Dumazet wrote:

On Thu, 2016-03-31 at 18:45 -0700, Alexei Starovoitov wrote:


Eric, what's your take on Hannes's patch 2 ?
Is it more accurate to ask lockdep to check for actual lock
or lockdep can rely on owned flag?
Potentially there could be races between setting the flag and
actual lock... but that code is contained, so unlikely.
Will we find the real issues with this 'stronger' check or
just spend a ton of time adapting to new model like your other
patch for release_sock and whatever may need to come next...


More precise lockdep checks are certainly good, I only objected to 4/4
trying to work around another bug.

But why do we rush for 'net' tree ?

This looks net-next material to me.

Locking changes are often subtle, lets take the time to do them
properly.


completely agree. I think only first patch belongs in net.
Everything else is net-next material.


Problem with first patch is that it uses lock_sock_fast, thus the current
sock_owned_by_user check doesn't get rid the lockdep warning. :/

Thus we would need to go with the two first patches. Do you think it is
acceptable? I actually didn't see a problem and testing showed no problems
so far.


I see. right. the patch 1 only makes sense when coupled with 2.
but now I'm not so sure that lockdep_is_held(>sk_lock.slock)
is a valid check, since current sock_owned_by_user() is equivalent
to lockdep_is_held(>sk_lock) only.
I would go with Daniel's approach. Much simpler to reason about.


IMHO we should treat sk_lock and sk_lock.slock the same as they are 
encapsulated by socket lock api.


I was rather afraid that we call those changed functions from within 
release_sock and thus would have the same problem again, where we get 
splats because of the time where we actually have user ownership but not 
the mark in the lockdep data structures. But this seems not to be the 
case as the functions are only directly called on behalf of user space.


Daniel, what do you think? I would be fine with your patch for net and 
we clean this up a bit in net-next then.


Bye,
Hannes

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Alexei Starovoitov

On Fri, Apr 01, 2016 at 06:12:49AM +0200, Hannes Frederic Sowa wrote:
> On 01.04.2016 06:04, Alexei Starovoitov wrote:
> >On Thu, Mar 31, 2016 at 08:03:38PM -0700, Eric Dumazet wrote:
> >>On Thu, 2016-03-31 at 18:45 -0700, Alexei Starovoitov wrote:
> >>
> >>>Eric, what's your take on Hannes's patch 2 ?
> >>>Is it more accurate to ask lockdep to check for actual lock
> >>>or lockdep can rely on owned flag?
> >>>Potentially there could be races between setting the flag and
> >>>actual lock... but that code is contained, so unlikely.
> >>>Will we find the real issues with this 'stronger' check or
> >>>just spend a ton of time adapting to new model like your other
> >>>patch for release_sock and whatever may need to come next...
> >>
> >>More precise lockdep checks are certainly good, I only objected to 4/4
> >>trying to work around another bug.
> >>
> >>But why do we rush for 'net' tree ?
> >>
> >>This looks net-next material to me.
> >>
> >>Locking changes are often subtle, lets take the time to do them
> >>properly.
> >
> >completely agree. I think only first patch belongs in net.
> >Everything else is net-next material.
> 
> Problem with first patch is that it uses lock_sock_fast, thus the current
> sock_owned_by_user check doesn't get rid the lockdep warning. :/
> 
> Thus we would need to go with the two first patches. Do you think it is
> acceptable? I actually didn't see a problem and testing showed no problems
> so far.

I see. right. the patch 1 only makes sense when coupled with 2.
but now I'm not so sure that lockdep_is_held(>sk_lock.slock)
is a valid check, since current sock_owned_by_user() is equivalent
to lockdep_is_held(>sk_lock) only.
I would go with Daniel's approach. Much simpler to reason about.

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa


On 01.04.2016 06:04, Alexei Starovoitov wrote:

On Thu, Mar 31, 2016 at 08:03:38PM -0700, Eric Dumazet wrote:

On Thu, 2016-03-31 at 18:45 -0700, Alexei Starovoitov wrote:


Eric, what's your take on Hannes's patch 2 ?
Is it more accurate to ask lockdep to check for actual lock
or lockdep can rely on owned flag?
Potentially there could be races between setting the flag and
actual lock... but that code is contained, so unlikely.
Will we find the real issues with this 'stronger' check or
just spend a ton of time adapting to new model like your other
patch for release_sock and whatever may need to come next...


More precise lockdep checks are certainly good, I only objected to 4/4
trying to work around another bug.

But why do we rush for 'net' tree ?

This looks net-next material to me.

Locking changes are often subtle, lets take the time to do them
properly.


completely agree. I think only first patch belongs in net.
Everything else is net-next material.


Problem with first patch is that it uses lock_sock_fast, thus the 
current sock_owned_by_user check doesn't get rid the lockdep warning. :/


Thus we would need to go with the two first patches. Do you think it is 
acceptable? I actually didn't see a problem and testing showed no 
problems so far.


Bye,
Hannes

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Alexei Starovoitov

On Thu, Mar 31, 2016 at 08:03:38PM -0700, Eric Dumazet wrote:
> On Thu, 2016-03-31 at 18:45 -0700, Alexei Starovoitov wrote:
> 
> > Eric, what's your take on Hannes's patch 2 ?
> > Is it more accurate to ask lockdep to check for actual lock
> > or lockdep can rely on owned flag?
> > Potentially there could be races between setting the flag and
> > actual lock... but that code is contained, so unlikely.
> > Will we find the real issues with this 'stronger' check or
> > just spend a ton of time adapting to new model like your other
> > patch for release_sock and whatever may need to come next...
> 
> More precise lockdep checks are certainly good, I only objected to 4/4
> trying to work around another bug.
> 
> But why do we rush for 'net' tree ?
> 
> This looks net-next material to me.
> 
> Locking changes are often subtle, lets take the time to do them
> properly.

completely agree. I think only first patch belongs in net.
Everything else is net-next material.

Re: [PATCH] rds: rds-stress show all zeros after few minutes

2016-03-31 Thread santosh.shilim...@oracle.com


Hi Shamir,

Nice to see this one soon on the list,
Just to make $subject more relevant. How about below?

RDS: fix congestion map corruption for PAGE_SIZE > 8k

On 3/30/16 5:50 PM, shamir rabinovitch wrote:

Issue can be seen on platforms that use 8K and above page size
while rds fragment size is 4K. On those platforms single page is
shared between 2 or more rds fragments. Each fragment has it's own
offeset and rds cong map code need to take this offset to account.
Not taking this offset to account lead to reading the data fragment
as congestion map fragment and hang of the rds transmit due to far
cong map corruption.

Reviewed-by: Wengang Wang 
Reviewed-by: Ajaykumar Hotchandani 
Acked-by: Santosh Shilimkar 
Tested-by: Anand Bibhuti 

Signed-off-by: shamir rabinovitch 
---
  net/rds/ib_recv.c |2 +-
  net/rds/iw_recv.c |2 +-
  net/rds/page.c|5 +++--
  3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 977fb86..abc8cc8 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -796,7 +796,7 @@ static void rds_ib_cong_recv(struct rds_connection *conn,

addr = kmap_atomic(sg_page(>f_sg));

-   src = addr + frag_off;
+   src = addr + frag->f_sg.offset + frag_off;
dst = (void *)map->m_page_addrs[map_page] + map_off;
for (k = 0; k < to_copy; k += 8) {
/* Record ports that became uncongested, ie
diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c

If you refresh the patch against 4.6-rc1, you won't need to
patch iw_recv.c :-)



diff --git a/net/rds/page.c b/net/rds/page.c
index 5a14e6d..715cbaa 100644
--- a/net/rds/page.c
+++ b/net/rds/page.c
@@ -135,8 +135,9 @@ int rds_page_remainder_alloc(struct scatterlist *scat, 
unsigned long bytes,
if (rem->r_offset != 0)
rds_stats_inc(s_page_remainder_hit);

-   rem->r_offset += bytes;
-   if (rem->r_offset == PAGE_SIZE) {
+   /* some hw (e.g. sparc) require aligned memory */
+   rem->r_offset += ALIGN(bytes, 8);
+   if (rem->r_offset >= PAGE_SIZE) {
__free_page(rem->r_page);
rem->r_page = NULL;
}


This hunk I missed out looks like. This doesn't belong to the
$subject patch. Could you please add this in separate patch. I
will need more than just "some hw (e.g. sparc) require aligned memory"

Once you fix these, please repost the updated version, and I will add
them to the 4.7 queue. Thanks !!

Regards,
Santosh

Re: [PATCH RFC net-next] net: core: Pass XPS select queue decision to skb_tx_hash

2016-03-31 Thread John Fastabend

On 16-03-30 11:30 AM, Saeed Mahameed wrote:
> On Wed, Mar 30, 2016 at 8:04 PM, John Fastabend
>  wrote:
>>
>> OK, so let me see if I get this right now. This was the precedence
>> before the patch in the normal no select queue case,
>>
>> (1) socket mapping sk_tx_queue_mapping iff !ooo_okay
>> (2) xps
>> (3) skb->queue_mapping
>> (4) qoffset/qcount (hash over tc queues)
>> (5) hash over num_tx_queues
>>
>> With this patch the precedence is a bit changed because
>> skb_tx_hash is always called.
>>
>> (1) socket mapping sk_tx_queue_mapping iff !ooo_okay
>> (2) skb->queue_mapping
>> (3) qoffset/qcount
>>(hash over tc queues if xps choice is > qcount)
>> (4) xps
>> (5) hash over num_tx_queues
>>
>> Sound right? Nice thing about this with correct configuration
>> of tc with qcount = xps_queues it sort of works as at least
> 
> Yes !
> for qcount = xps_queues which almost all drivers default
> configurations goes this way, it works like charm, xps selects the
> exact TC TX queue at the correct offset without any need for further
> SKB hashing.
> and even if by mistake XPS was also configured on TC TX queue then
> this patch will detect that the xps hash is out of this TC
> offset/qcount range and will re-hash. But i don't see why would user
> or driver do such strange configuration.
> 
>> I expect it to. I think the question is are people OK with
>> letting skb->queue_mapping take precedence. I am at least
>> because it makes the skb edit queue_mapping action from tc
>> easier to use.
>>
> 
> skb->queue_mapping toke precedence also before this patch, the only
> thing this patch came to change is how to compute the txq when
> skb->queue_mapping is not present, so we don't need to worry about
> this.
> 

I don't believe that is correct in the general case. Perhaps
in the ndo_select_queue path though. See this line,

if (queue_index < 0 || skb->ooo_okay ||
queue_index >= dev->real_num_tx_queues) {
int new_index = get_xps_queue(dev, skb);
if (new_index < 0)
new_index = skb_tx_hash(dev, skb);

The skb_tx_hash() routine is never called if xps is enabled.
And so we never get into the call to do this,

if (skb_rx_queue_recorded(skb)) {
hash = skb_get_rx_queue(skb);
while (unlikely(hash >= num_tx_queues))
hash -= num_tx_queues;
return hash;
}

Right? FWIW I think that using queue_mapping before xps is better
because we can use tc to pick the queue_mapping them programmatically
if we want for these special cases instead if wanted.

>> And just a comment on the code why not just move get_xps_queue
>> into skb_tx_hash at this point if its always being called as the
>> "hint". Then we avoid calling it in the case queue_mapping is
>> set.
>>
> 
> Very good point, the only place that calls skb_tx_hash(dev, skb) other
> than __netdev_pick_tx is mlx4 driver and they did it there just
> because they wanted to bypass XPS configuration if TC QoS is
> configured, with this fix we don't have to bypass XPS at all for when
> TC is configured.
> 
> I will change it.
> 

Great thanks.

Re: qdisc spin lock

2016-03-31 Thread John Fastabend

On 16-03-31 04:48 PM, Michael Ma wrote:
> I didn't really know that multiple qdiscs can be isolated using MQ so
> that each txq can be associated with a particular qdisc. Also we don't
> really have multiple interfaces...

MQ will assign a default qdisc to each txq and the default qdisc can
be changed to htb or any other qdisc of your choice.

> 
> With this MQ solution we'll still need to assign transmit queues to
> different classes by doing some math on the bandwidth limit if I
> understand correctly, which seems to be less convenient compared with
> a solution purely within HTB.
> 

Agreed.

> I assume that with this solution I can still share qdisc among
> multiple transmit queues - please let me know if this is not the case.

Nope sorry doesn't work that way unless you employ some sort of stacked
netdevice strategy which does start to get a bit complex. The basic hint
would be to stack some type of virtual netdev on top of a device and
run the htb qdisc there. Push traffic onto the netdev depending on the
class it belongs to. Its ugly yes.

Noting all that I posted an RFC patch some time back to allow writing
qdiscs that do not require taking the lock. I'll try to respin these
and submit them when net-next opens again. The next logical step is to
write a "better" HTB probably using a shared counter and dropping the
requirement that it be exact.

Sorry I didn't get a chance to look at the paper in your post so not
sure if they suggest something similar or not.

Thanks,
John

> 
> 2016-03-31 15:16 GMT-07:00 Cong Wang :
>> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma  wrote:
>>> As far as I understand the design of TC is to simplify locking schema
>>> and minimize the work in __qdisc_run so that throughput won’t be
>>> affected, especially with large packets. However if the scenario is
>>> that multiple classes in the queueing discipline only have the shaping
>>> limit, there isn’t really a necessary correlation between different
>>> classes. The only synchronization point should be when the packet is
>>> dequeued from the qdisc queue and enqueued to the transmit queue of
>>> the device. My question is – is it worth investing on avoiding the
>>> locking contention by partitioning the queue/lock so that this
>>> scenario is addressed with relatively smaller latency?
>>
>> If your HTB classes don't share bandwidth, why do you still make them
>> under the same hierarchy? IOW, you can just isolate them either with some
>> other qdisc or just separated interfaces.

[v7, 1/5] ARM64: dts: ls2080a: add device configuration node

2016-03-31 Thread Yangbo Lu

Add the dts node for device configuration unit that provides
general purpose configuration and status for the device.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None
Changes for v3:
- None
Changes for v4:
- None
Changes for v5:
- Added this patch
Changes for v6:
- None
Changes for v7:
- None
---
 arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi 
b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi
index 9d746c6..8724cf1 100644
--- a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi
+++ b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi
@@ -191,6 +191,12 @@
clocks = <>;
};
 
+   dcfg: dcfg@1e0 {
+   compatible = "fsl,ls2080a-dcfg", "syscon";
+   reg = <0x0 0x1e0 0x0 0x1>;
+   little-endian;
+   };
+
serial0: serial@21c0500 {
compatible = "fsl,ns16550", "ns16550a";
reg = <0x0 0x21c0500 0x0 0x100>;
-- 
2.1.0.27.g96db324

[v7, 4/5] powerpc/fsl: move mpc85xx.h to include/linux/fsl

2016-03-31 Thread Yangbo Lu

Move mpc85xx.h to include/linux/fsl and rename it to svr.h as
a common header file. It has been used for mpc85xx and it will
be used for ARM-based SoC as well.

Signed-off-by: Yangbo Lu 
Acked-by: Wolfram Sang 
---
Changes for v2:
- None
Changes for v3:
- None
Changes for v4:
- None
Changes for v5:
- Changed to Move mpc85xx.h to include/linux/fsl/
- Adjusted '#include ' position in file
Changes for v6:
- None
Changes for v7:
- Added 'Acked-by: Wolfram Sang' for I2C part
- Also applied to arch/powerpc/kernel/cpu_setup_fsl_booke.S
---
 arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +-
 drivers/clk/clk-qoriq.c   | 3 +--
 drivers/i2c/busses/i2c-mpc.c  | 2 +-
 drivers/iommu/fsl_pamu.c  | 3 +--
 drivers/net/ethernet/freescale/gianfar.c  | 2 +-
 arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h | 4 ++--
 6 files changed, 7 insertions(+), 9 deletions(-)
 rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%)

diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S 
b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index 462aed9..2b0284e 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -13,13 +13,13 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 
 _GLOBAL(__e500_icache_setup)
mfspr   r0, SPRN_L1CSR1
diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c
index 7bc1c45..fc7f722 100644
--- a/drivers/clk/clk-qoriq.c
+++ b/drivers/clk/clk-qoriq.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1148,8 +1149,6 @@ bad_args:
 }
 
 #ifdef CONFIG_PPC
-#include 
-
 static const u32 a4510_svrs[] __initconst = {
(SVR_P2040 << 8) | 0x10,/* P2040 1.0 */
(SVR_P2040 << 8) | 0x11,/* P2040 1.1 */
diff --git a/drivers/i2c/busses/i2c-mpc.c b/drivers/i2c/busses/i2c-mpc.c
index 48ecffe..600704c 100644
--- a/drivers/i2c/busses/i2c-mpc.c
+++ b/drivers/i2c/busses/i2c-mpc.c
@@ -27,9 +27,9 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
-#include 
 #include 
 
 #define DRV_NAME "mpc-i2c"
diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c
index a34355f..af8fb27 100644
--- a/drivers/iommu/fsl_pamu.c
+++ b/drivers/iommu/fsl_pamu.c
@@ -21,11 +21,10 @@
 #include "fsl_pamu.h"
 
 #include 
+#include 
 #include 
 #include 
 
-#include 
-
 /* define indexes for each operation mapping scenario */
 #define OMI_QMAN0x00
 #define OMI_FMAN0x01
diff --git a/drivers/net/ethernet/freescale/gianfar.c 
b/drivers/net/ethernet/freescale/gianfar.c
index d2f917a..2224b10 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -86,11 +86,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #ifdef CONFIG_PPC
 #include 
-#include 
 #endif
 #include 
 #include 
diff --git a/arch/powerpc/include/asm/mpc85xx.h b/include/linux/fsl/svr.h
similarity index 97%
rename from arch/powerpc/include/asm/mpc85xx.h
rename to include/linux/fsl/svr.h
index 213f3a8..8d13836 100644
--- a/arch/powerpc/include/asm/mpc85xx.h
+++ b/include/linux/fsl/svr.h
@@ -9,8 +9,8 @@
  * (at your option) any later version.
  */
 
-#ifndef __ASM_PPC_MPC85XX_H
-#define __ASM_PPC_MPC85XX_H
+#ifndef FSL_SVR_H
+#define FSL_SVR_H
 
 #define SVR_REV(svr)   ((svr) & 0xFF)  /* SOC design resision */
 #define SVR_MAJ(svr)   (((svr) >>  4) & 0xF)   /* Major revision field*/
-- 
2.1.0.27.g96db324

[v7, 2/5] soc: fsl: add GUTS driver for QorIQ platforms

2016-03-31 Thread Yangbo Lu

The global utilities block controls power management, I/O device
enabling, power-onreset(POR) configuration monitoring, alternate
function selection for multiplexed signals,and clock control.

This patch adds GUTS driver to manage and access global utilities
block.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None
Changes for v3:
- None
Changes for v4:
- Added this patch
Changes for v5:
- Modified copyright info
- Changed MODULE_LICENSE to GPL
- Changed EXPORT_SYMBOL_GPL to EXPORT_SYMBOL
- Made FSL_GUTS user-invisible
- Added a complete compatible list for GUTS
- Stored guts info in file-scope variable
- Added mfspr() getting SVR
- Redefined GUTS APIs
- Called fsl_guts_init rather than using platform driver
- Removed useless parentheses
- Removed useless 'extern' key words
Changes for v6:
- Made guts thread safe in fsl_guts_init
Changes for v7:
- Removed 'ifdef' for function declaration in guts.h
---
 drivers/soc/Kconfig  |   2 +-
 drivers/soc/fsl/Kconfig  |   8 
 drivers/soc/fsl/Makefile |   1 +
 drivers/soc/fsl/guts.c   | 119 +++
 include/linux/fsl/guts.h |  98 +++---
 5 files changed, 179 insertions(+), 49 deletions(-)
 create mode 100644 drivers/soc/fsl/Kconfig
 create mode 100644 drivers/soc/fsl/guts.c

diff --git a/drivers/soc/Kconfig b/drivers/soc/Kconfig
index cb58ef0..7106463 100644
--- a/drivers/soc/Kconfig
+++ b/drivers/soc/Kconfig
@@ -2,7 +2,7 @@ menu "SOC (System On Chip) specific Drivers"
 
 source "drivers/soc/bcm/Kconfig"
 source "drivers/soc/brcmstb/Kconfig"
-source "drivers/soc/fsl/qe/Kconfig"
+source "drivers/soc/fsl/Kconfig"
 source "drivers/soc/mediatek/Kconfig"
 source "drivers/soc/qcom/Kconfig"
 source "drivers/soc/rockchip/Kconfig"
diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig
new file mode 100644
index 000..b313759
--- /dev/null
+++ b/drivers/soc/fsl/Kconfig
@@ -0,0 +1,8 @@
+#
+# Freescale SOC drivers
+#
+
+source "drivers/soc/fsl/qe/Kconfig"
+
+config FSL_GUTS
+   bool
diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
index 203307f..02afb7f 100644
--- a/drivers/soc/fsl/Makefile
+++ b/drivers/soc/fsl/Makefile
@@ -4,3 +4,4 @@
 
 obj-$(CONFIG_QUICC_ENGINE) += qe/
 obj-$(CONFIG_CPM)  += qe/
+obj-$(CONFIG_FSL_GUTS) += guts.o
diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
new file mode 100644
index 000..fa155e6
--- /dev/null
+++ b/drivers/soc/fsl/guts.c
@@ -0,0 +1,119 @@
+/*
+ * Freescale QorIQ Platforms GUTS Driver
+ *
+ * Copyright (C) 2016 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct guts {
+   struct ccsr_guts __iomem *regs;
+   bool little_endian;
+};
+
+static struct guts *guts;
+static DEFINE_MUTEX(guts_lock);
+
+u32 fsl_guts_get_svr(void)
+{
+   u32 svr = 0;
+
+   if (!guts || !guts->regs) {
+#ifdef CONFIG_PPC
+   svr =  mfspr(SPRN_SVR);
+#endif
+   return svr;
+   }
+
+   if (guts->little_endian)
+   svr = ioread32(>regs->svr);
+   else
+   svr = ioread32be(>regs->svr);
+
+   return svr;
+}
+EXPORT_SYMBOL(fsl_guts_get_svr);
+
+/*
+ * Table for matching compatible strings, for device tree
+ * guts node, for Freescale QorIQ SOCs.
+ */
+static const struct of_device_id guts_of_match[] = {
+   /* For T4 & B4 Series SOCs */
+   { .compatible = "fsl,qoriq-device-config-1.0", },
+   /* For P Series SOCs */
+   { .compatible = "fsl,qoriq-device-config-2.0", },
+   { .compatible = "fsl,p1010-guts", },
+   { .compatible = "fsl,p1020-guts", },
+   { .compatible = "fsl,p1021-guts", },
+   { .compatible = "fsl,p1022-guts", },
+   { .compatible = "fsl,p1023-guts", },
+   { .compatible = "fsl,p2020-guts", },
+   /* For BSC Series SOCs */
+   { .compatible = "fsl,bsc9131-guts", },
+   { .compatible = "fsl,bsc9132-guts", },
+   /* For MPC85xx Series SOCs */
+   { .compatible = "fsl,mpc8536-guts", },
+   { .compatible = "fsl,mpc8544-guts", },
+   { .compatible = "fsl,mpc8548-guts", },
+   { .compatible = "fsl,mpc8568-guts", },
+   { .compatible = "fsl,mpc8569-guts", },
+   { .compatible = "fsl,mpc8572-guts", },
+   /* For Layerscape Series SOCs */
+   { .compatible = "fsl,ls1021a-dcfg", },
+   { .compatible = "fsl,ls1043a-dcfg", },
+   { .compatible = "fsl,ls2080a-dcfg", },
+   {}
+};
+
+int fsl_guts_init(void)
+{
+   struct device_node *np;
+

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa



On Fri, Apr 1, 2016, at 05:13, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 04:01 +0200, Hannes Frederic Sowa wrote:
> 
> > I thought so first, as well. But given the double check for the 
> > spin_lock and the "mutex" we end up with the same result for the 
> > lockdep_sock_is_held check.
> > 
> > Do you see other consequences?
> 
> Well, we release the spinlock in __release_sock()
> 
> So another thread could come and acquire the socket, then call
> mutex_acquire() while the first thread did not call yet mutex_release()
> 
> So maybe lockdep will complain (but I do not know lockdep enough to
> tell)
> 
> So maybe the following would be better :
> 
> (Absolutely untested, really I need to take a break)

I quickly tested the patch and my scripts didn't show any splats so far.
This patch seems more consistent albeit I don't think it is relevant for
lockdep_sock_is_held as we only flip owned while holding slock. But this
definitely needs more review.

Thanks a lot!

[v7, 3/5] dt: move guts devicetree doc out of powerpc directory

2016-03-31 Thread Yangbo Lu

Move guts devicetree doc to Documentation/devicetree/bindings/soc/fsl/
since it's used by not only PowerPC but also ARM. And add a specification
for 'little-endian' property.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None
Changes for v3:
- None
Changes for v4:
- Added this patch
Changes for v5:
- Modified the description for little-endian property
Changes for v6:
- None
Changes for v7:
- None
---
 Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt | 3 +++
 1 file changed, 3 insertions(+)
 rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%)

diff --git a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt 
b/Documentation/devicetree/bindings/soc/fsl/guts.txt
similarity index 91%
rename from Documentation/devicetree/bindings/powerpc/fsl/guts.txt
rename to Documentation/devicetree/bindings/soc/fsl/guts.txt
index b71b203..07adca9 100644
--- a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt
+++ b/Documentation/devicetree/bindings/soc/fsl/guts.txt
@@ -25,6 +25,9 @@ Recommended properties:
  - fsl,liodn-bits : Indicates the number of defined bits in the LIODN
registers, for those SOCs that have a PAMU device.
 
+ - little-endian : Indicates that the global utilities block is little
+   endian. The default is big endian.
+
 Examples:
global-utilities@e {/* global utilities block */
compatible = "fsl,mpc8548-guts";
-- 
2.1.0.27.g96db324

[v7, 5/5] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

2016-03-31 Thread Yangbo Lu

The eSDHC of T4240-R1.0-R2.0 has incorrect vender version and spec version.
Acturally the right version numbers should be VVN=0x13 and SVN = 0x1.
This patch adds the GUTS driver support for eSDHC driver to get SVR(System
version register). And fix host version to avoid that incorrect version
numbers break down the ADMA data transfer.

Signed-off-by: Yangbo Lu 
Acked-by: Ulf Hansson 
---
Changes for v2:
- Got SVR through iomap instead of dts
Changes for v3:
- Managed GUTS through syscon instead of iomap in eSDHC driver
Changes for v4:
- Got SVR by GUTS driver instead of SYSCON
Changes for v5:
- Changed to get SVR through API fsl_guts_get_svr()
- Combined patch 4, patch 5 and patch 6 into one
Changes for v6:
- Added 'Acked-by: Ulf Hansson'
Changes for v7:
- None
---
 drivers/mmc/host/Kconfig  |  1 +
 drivers/mmc/host/sdhci-of-esdhc.c | 23 +++
 2 files changed, 24 insertions(+)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 04feea8..5743b05 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -142,6 +142,7 @@ config MMC_SDHCI_OF_ESDHC
depends on MMC_SDHCI_PLTFM
depends on PPC || ARCH_MXC || ARCH_LAYERSCAPE
select MMC_SDHCI_IO_ACCESSORS
+   select FSL_GUTS
help
  This selects the Freescale eSDHC controller support.
 
diff --git a/drivers/mmc/host/sdhci-of-esdhc.c 
b/drivers/mmc/host/sdhci-of-esdhc.c
index 3f34d35..68cc020 100644
--- a/drivers/mmc/host/sdhci-of-esdhc.c
+++ b/drivers/mmc/host/sdhci-of-esdhc.c
@@ -18,6 +18,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include "sdhci-pltfm.h"
 #include "sdhci-esdhc.h"
@@ -28,6 +30,8 @@
 struct sdhci_esdhc {
u8 vendor_ver;
u8 spec_ver;
+   u32 soc_ver;
+   u8 soc_rev;
 };
 
 /**
@@ -73,6 +77,8 @@ static u32 esdhc_readl_fixup(struct sdhci_host *host,
 static u16 esdhc_readw_fixup(struct sdhci_host *host,
 int spec_reg, u32 value)
 {
+   struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host);
+   struct sdhci_esdhc *esdhc = sdhci_pltfm_priv(pltfm_host);
u16 ret;
int shift = (spec_reg & 0x2) * 8;
 
@@ -80,6 +86,13 @@ static u16 esdhc_readw_fixup(struct sdhci_host *host,
ret = value & 0x;
else
ret = (value >> shift) & 0x;
+
+   /* Workaround for T4240-R1.0-R2.0 eSDHC which has incorrect
+* vendor version and spec version information.
+*/
+   if ((spec_reg == SDHCI_HOST_VERSION) &&
+   (esdhc->soc_ver == SVR_T4240) && (esdhc->soc_rev <= 0x20))
+   ret = (VENDOR_V_23 << SDHCI_VENDOR_VER_SHIFT) | SDHCI_SPEC_200;
return ret;
 }
 
@@ -567,10 +580,20 @@ static void esdhc_init(struct platform_device *pdev, 
struct sdhci_host *host)
struct sdhci_pltfm_host *pltfm_host;
struct sdhci_esdhc *esdhc;
u16 host_ver;
+   u32 svr;
 
pltfm_host = sdhci_priv(host);
esdhc = sdhci_pltfm_priv(pltfm_host);
 
+   fsl_guts_init();
+   svr = fsl_guts_get_svr();
+   if (svr) {
+   esdhc->soc_ver = SVR_SOC_VER(svr);
+   esdhc->soc_rev = SVR_REV(svr);
+   } else {
+   dev_err(>dev, "Failed to get SVR value!\n");
+   }
+
host_ver = sdhci_readw(host, SDHCI_HOST_VERSION);
esdhc->vendor_ver = (host_ver & SDHCI_VENDOR_VER_MASK) >>
 SDHCI_VENDOR_VER_SHIFT;
-- 
2.1.0.27.g96db324

[v7, 0/5] Fix eSDHC host version register bug

2016-03-31 Thread Yangbo Lu

This patchset is used to fix a host version register bug in the T4240-R1.0-R2.0
eSDHC controller. To get the SoC version and revision, it's needed to add the
GUTS driver to access the global utilities registers.

So, the first three patches are to add the GUTS driver.
The following two patches are to enable GUTS driver support to get SVR in eSDHC
driver and fix host version for T4240.

Yangbo Lu (5):
  ARM64: dts: ls2080a: add device configuration node
  soc: fsl: add GUTS driver for QorIQ platforms
  dt: move guts devicetree doc out of powerpc directory
  powerpc/fsl: move mpc85xx.h to include/linux/fsl
  mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

 .../bindings/{powerpc => soc}/fsl/guts.txt |   3 +
 arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi |   6 ++
 arch/powerpc/kernel/cpu_setup_fsl_booke.S  |   2 +-
 drivers/clk/clk-qoriq.c|   3 +-
 drivers/i2c/busses/i2c-mpc.c   |   2 +-
 drivers/iommu/fsl_pamu.c   |   3 +-
 drivers/mmc/host/Kconfig   |   1 +
 drivers/mmc/host/sdhci-of-esdhc.c  |  23 
 drivers/net/ethernet/freescale/gianfar.c   |   2 +-
 drivers/soc/Kconfig|   2 +-
 drivers/soc/fsl/Kconfig|   8 ++
 drivers/soc/fsl/Makefile   |   1 +
 drivers/soc/fsl/guts.c | 119 +
 include/linux/fsl/guts.h   |  98 -
 .../asm/mpc85xx.h => include/linux/fsl/svr.h   |   4 +-
 15 files changed, 219 insertions(+), 58 deletions(-)
 rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%)
 create mode 100644 drivers/soc/fsl/Kconfig
 create mode 100644 drivers/soc/fsl/guts.c
 rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%)

-- 
2.1.0.27.g96db324

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Eric Dumazet

On Fri, 2016-04-01 at 04:01 +0200, Hannes Frederic Sowa wrote:

> I thought so first, as well. But given the double check for the 
> spin_lock and the "mutex" we end up with the same result for the 
> lockdep_sock_is_held check.
> 
> Do you see other consequences?

Well, we release the spinlock in __release_sock()

So another thread could come and acquire the socket, then call
mutex_acquire() while the first thread did not call yet mutex_release()

So maybe lockdep will complain (but I do not know lockdep enough to
tell)

So maybe the following would be better :

(Absolutely untested, really I need to take a break)

diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e03727b..7d5dfa7e1918 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1327,7 +1327,13 @@ static inline void sk_wmem_free_skb(struct sock *sk, 
struct sk_buff *skb)
 
 static inline void sock_release_ownership(struct sock *sk)
 {
-   sk->sk_lock.owned = 0;
+   if (sk->sk_lock.owned) {
+   /*
+* The sk_lock has mutex_unlock() semantics:
+*/
+   mutex_release(>sk_lock.dep_map, 1, _RET_IP_);
+   sk->sk_lock.owned = 0;
+   }
 }
 
 /*
diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230..c7ab98e72346 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2429,10 +2429,6 @@ EXPORT_SYMBOL(lock_sock_nested);
 
 void release_sock(struct sock *sk)
 {
-   /*
-* The sk_lock has mutex_unlock() semantics:
-*/
-   mutex_release(>sk_lock.dep_map, 1, _RET_IP_);
 
spin_lock_bh(>sk_lock.slock);
if (sk->sk_backlog.tail)

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa

On Fri, Apr 1, 2016, at 05:03, Eric Dumazet wrote:
> On Thu, 2016-03-31 at 18:45 -0700, Alexei Starovoitov wrote:
> 
> > Eric, what's your take on Hannes's patch 2 ?
> > Is it more accurate to ask lockdep to check for actual lock
> > or lockdep can rely on owned flag?
> > Potentially there could be races between setting the flag and
> > actual lock... but that code is contained, so unlikely.
> > Will we find the real issues with this 'stronger' check or
> > just spend a ton of time adapting to new model like your other
> > patch for release_sock and whatever may need to come next...
> 
> More precise lockdep checks are certainly good, I only objected to 4/4
> trying to work around another bug.
> 
> But why do we rush for 'net' tree ?
> 
> This looks net-next material to me.
> 
> Locking changes are often subtle, lets take the time to do them
> properly.

I certainly can see my mistake now trying to paper over the splats. Do
you object if I send the first patches to fix up the reported lockdep?

Re: [PULL REQUEST] Please pull rdma.git

2016-03-31 Thread Leon Romanovsky

On Thu, Mar 31, 2016 at 10:18:47PM -0400, David Miller wrote:
> From: Leon Romanovsky 
> Date: Fri, 1 Apr 2016 02:38:28 +0300
> 
> > On Wed, Mar 23, 2016 at 09:37:54AM -0400, Doug Ledford wrote:
> >> On 03/23/2016 06:57 AM, Leon Romanovsky wrote:
> >> > On Sat, Mar 19, 2016 at 02:37:08PM -0700, Linus Torvalds wrote:
> >> >> So the *best* situation would be:
> >> >>
> >> >>  - your two groups talk it over, and figure out what the common commits 
> >> >> are
> >> >>
> >> >>  - you put those common commits as a "base" branch in git
> >> >>
> >> >>  - you ask the two upper-level maintainers to both pull that base branch
> >> >>
> >> >>  - you then make sure that you send the later patches (whether as
> >> >> emailed patches or as pull requests) based on top of that base branch.
> >> > 
> >> > Hi David and Doug,
> >> > 
> >> > Are you OK with the approach suggested by Linus?
> >> > We are eager to know it, so we will adopt it as soon
> >> > as possible in our development flow.
> >> > 
> >> > The original thread [1].
> >> > 
> >> > [1] http://thread.gmane.org/gmane.linux.drivers.rdma/34907
> >> > 
> >> > Thanks.
> >> > 
> >> 
> >> I'm fine with it.  Since I happen to use topic branches to build my
> >> for-next from anyway, I might need to be the one that Dave pulls from
> >> versus the other way around.
> > 
> > Resending to linux-netdev.
> > 
> > David,
> > Can you please express your opinion about Linus's suggestion to
> > eliminate merge conflicts in Mellanox related products?
> 
> Sure, sounds fine.

Thank you, I appreciate a lot Doug's and your openness and
willingness to help us eliminate the future merge obstacles.

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Eric Dumazet

On Thu, 2016-03-31 at 18:45 -0700, Alexei Starovoitov wrote:

> Eric, what's your take on Hannes's patch 2 ?
> Is it more accurate to ask lockdep to check for actual lock
> or lockdep can rely on owned flag?
> Potentially there could be races between setting the flag and
> actual lock... but that code is contained, so unlikely.
> Will we find the real issues with this 'stronger' check or
> just spend a ton of time adapting to new model like your other
> patch for release_sock and whatever may need to come next...

More precise lockdep checks are certainly good, I only objected to 4/4
trying to work around another bug.

But why do we rush for 'net' tree ?

This looks net-next material to me.

Locking changes are often subtle, lets take the time to do them
properly.

Re: [PATCH net-next 1/6] net: skbuff: don't use union for napi_id and sender_cpu

2016-03-31 Thread Eric Dumazet

On Fri, 2016-04-01 at 10:13 +0800, Jason Wang wrote:

> 
> The problem is we want to support busy polling for tun. This needs
> napi_id to be passed to tun socket by sk_mark_napi_id() during
> tun_net_xmit(). But before reaching this, XPS will set sender_cpu will
> make us can't see correct napi_id.
> 

Looks like napi_id should have precedence then ?

Only forwarding should allow the field to be cleared to allow XPS to do
its job.

Maybe skb_sender_cpu_clear() was removed too early (commit
64d4e3431e686dc37ce388ba531c4c4e866fb141)

Look, it is 8pm here, I am pretty sure a solution can be found,
but I am also need to take a break, I started at 3am today...

Re: [PATCH net-next 1/6] net: skbuff: don't use union for napi_id and sender_cpu

2016-03-31 Thread Jason Wang



On 04/01/2016 04:01 AM, David Miller wrote:
> From: Eric Dumazet 
> Date: Thu, 31 Mar 2016 03:32:21 -0700
>
>> On Thu, 2016-03-31 at 13:50 +0800, Jason Wang wrote:
>>> We use a union for napi_id and send_cpu, this is ok for most of the
>>> cases except when we want to support busy polling for tun which needs
>>> napi_id to be stored and passed to socket during tun_net_xmit(). In
>>> this case, napi_id was overridden with sender_cpu before tun_net_xmit()
>>> was called if XPS was enabled. Fixing by not using union for napi_id
>>> and sender_cpu.
>>>
>>> Signed-off-by: Jason Wang 
>>> ---
>>>  include/linux/skbuff.h | 10 +-
>>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index 15d0df9..8aee891 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -743,11 +743,11 @@ struct sk_buff {
>>> __u32   hash;
>>> __be16  vlan_proto;
>>> __u16   vlan_tci;
>>> -#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
>>> -   union {
>>> -   unsigned intnapi_id;
>>> -   unsigned intsender_cpu;
>>> -   };
>>> +#if defined(CONFIG_NET_RX_BUSY_POLL)
>>> +   unsigned intnapi_id;
>>> +#endif
>>> +#if defined(CONFIG_XPS)
>>> +   unsigned intsender_cpu;
>>>  #endif
>>> union {
>>>  #ifdef CONFIG_NETWORK_SECMARK
>> Hmmm...
>>
>> This is a serious problem.
>>
>> Making skb bigger (8 bytes because of alignment) was not considered
>> valid for sender_cpu introduction. We worked quite hard to avoid this,
>> if you take a look at git history :(
>>
>> Can you describe more precisely the problem and code path ?
> From what I can see they are doing busy poll loops in the TX code paths,
> as well as the RX code paths, of vhost.
>
> Doing this in the TX side makes little sense to me.  The busy poll
> implementations in the drivers only process their RX queues when
> ->ndo_busy_poll() is invoked.  So I wonder what this is accomplishing
> for the vhost TX case?

In vhost TX case, it's possible that new packets were arrived at rx
queue during tx polling. Consider tx and rx were processed in one
thread, poll rx looks feasible to me.

Re: Reboot Failed was occurred after v4.4-rc1 during IPv6 Ready Logo Conformance Test

2016-03-31 Thread Yuki Machida


On 2016年03月31日 16:09, Yuki Machida wrote:
> Hi all,
> 
> Reboot Failed was occurred at Linux Kernel after v4.4-rc1 during IPv6 Ready 
> Logo Conformance Test.
> Not Fix a bug in v4.5-rc7 yet.
> I will conform that it fix a bug in v4.6-rc1.
I conform that it was not occurred in v4.6-rc1.

I will find a patch for this bug, and conform it was applied -stable kernel.
If it was not applied -stable kernel, I will report again.

> Currently, it is under investigation.
> 
> IPv6 Ready Logo
> https://www.ipv6ready.org/
> TAHI Project
> http://www.tahi.org/
> 
> I ran the IPv6 Ready Logo Core Conformance Test on Intel D510MO (Atom D510).
> It is using userland build with yocto project.
> 
> Test Environment
> Test Specification  : 4.0.6
> Tool Version: REL_3_3_2
> Test Program Version: V6LC_5_0_0
> Target Device   : Intel D510MO (Atom D510)
> 
> I conform that random testcases are failed.
> (e.g. No.5, No.130, No.131, No.134, No.167 and No.168)
> 
> Regards,
> Yuki Machida
>

Re: qdisc spin lock

2016-03-31 Thread David Miller

From: Michael Ma 
Date: Thu, 31 Mar 2016 16:48:43 -0700

> I didn't really know that multiple qdiscs can be isolated using MQ so
 ...

Please stop top-posting.

Re: [PULL REQUEST] Please pull rdma.git

2016-03-31 Thread David Miller

From: Leon Romanovsky 
Date: Fri, 1 Apr 2016 02:38:28 +0300

> On Wed, Mar 23, 2016 at 09:37:54AM -0400, Doug Ledford wrote:
>> On 03/23/2016 06:57 AM, Leon Romanovsky wrote:
>> > On Sat, Mar 19, 2016 at 02:37:08PM -0700, Linus Torvalds wrote:
>> >> So the *best* situation would be:
>> >>
>> >>  - your two groups talk it over, and figure out what the common commits 
>> >> are
>> >>
>> >>  - you put those common commits as a "base" branch in git
>> >>
>> >>  - you ask the two upper-level maintainers to both pull that base branch
>> >>
>> >>  - you then make sure that you send the later patches (whether as
>> >> emailed patches or as pull requests) based on top of that base branch.
>> > 
>> > Hi David and Doug,
>> > 
>> > Are you OK with the approach suggested by Linus?
>> > We are eager to know it, so we will adopt it as soon
>> > as possible in our development flow.
>> > 
>> > The original thread [1].
>> > 
>> > [1] http://thread.gmane.org/gmane.linux.drivers.rdma/34907
>> > 
>> > Thanks.
>> > 
>> 
>> I'm fine with it.  Since I happen to use topic branches to build my
>> for-next from anyway, I might need to be the one that Dave pulls from
>> versus the other way around.
> 
> Resending to linux-netdev.
> 
> David,
> Can you please express your opinion about Linus's suggestion to
> eliminate merge conflicts in Mellanox related products?

Sure, sounds fine.

Re: [PATCH net-next 1/6] net: skbuff: don't use union for napi_id and sender_cpu

2016-03-31 Thread Jason Wang



On 03/31/2016 06:32 PM, Eric Dumazet wrote:
> On Thu, 2016-03-31 at 13:50 +0800, Jason Wang wrote:
>> We use a union for napi_id and send_cpu, this is ok for most of the
>> cases except when we want to support busy polling for tun which needs
>> napi_id to be stored and passed to socket during tun_net_xmit(). In
>> this case, napi_id was overridden with sender_cpu before tun_net_xmit()
>> was called if XPS was enabled. Fixing by not using union for napi_id
>> and sender_cpu.
>>
>> Signed-off-by: Jason Wang 
>> ---
>>  include/linux/skbuff.h | 10 +-
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 15d0df9..8aee891 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -743,11 +743,11 @@ struct sk_buff {
>>  __u32   hash;
>>  __be16  vlan_proto;
>>  __u16   vlan_tci;
>> -#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
>> -union {
>> -unsigned intnapi_id;
>> -unsigned intsender_cpu;
>> -};
>> +#if defined(CONFIG_NET_RX_BUSY_POLL)
>> +unsigned intnapi_id;
>> +#endif
>> +#if defined(CONFIG_XPS)
>> +unsigned intsender_cpu;
>>  #endif
>>  union {
>>  #ifdef CONFIG_NETWORK_SECMARK
> Hmmm...
>
> This is a serious problem.
>
> Making skb bigger (8 bytes because of alignment) was not considered
> valid for sender_cpu introduction. We worked quite hard to avoid this,
> if you take a look at git history :(
>
> Can you describe more precisely the problem and code path ?
>

The problem is we want to support busy polling for tun. This needs
napi_id to be passed to tun socket by sk_mark_napi_id() during
tun_net_xmit(). But before reaching this, XPS will set sender_cpu will
make us can't see correct napi_id.

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa


On 01.04.2016 03:45, Eric Dumazet wrote:

On Thu, 2016-03-31 at 18:39 -0700, Eric Dumazet wrote:

On Fri, 2016-04-01 at 03:36 +0200, Hannes Frederic Sowa wrote:

On Fri, Apr 1, 2016, at 03:19, Eric Dumazet wrote:

Thanks.

As you can see, release_sock() messes badly lockdep (once your other
patches are in )

Once we properly fix release_sock() and/or __release_sock(), all these
false positives disappear.


This was a loopback connection. I need to study release_sock and
__release_sock more as I cannot currently see an issue with the lockdep
handling.


Okay, please try :

diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230..570dcd91d64e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2429,10 +2429,6 @@ EXPORT_SYMBOL(lock_sock_nested);

  void release_sock(struct sock *sk)
  {
-   /*
-* The sk_lock has mutex_unlock() semantics:
-*/
-   mutex_release(>sk_lock.dep_map, 1, _RET_IP_);

spin_lock_bh(>sk_lock.slock);
if (sk->sk_backlog.tail)
@@ -2445,6 +2441,10 @@ void release_sock(struct sock *sk)
sk->sk_prot->release_cb(sk);

sock_release_ownership(sk);
+   /*
+* The sk_lock has mutex_unlock() semantics:
+*/
+   mutex_release(>sk_lock.dep_map, 1, _RET_IP_);
if (waitqueue_active(>sk_lock.wq))
wake_up(>sk_lock.wq);
spin_unlock_bh(>sk_lock.slock);


Also take a look at commit c3f9b01849ef3bc69024990092b9f42e20df7797

We might need to include the mutex_release() in sock_release_ownership()


I thought so first, as well. But given the double check for the 
spin_lock and the "mutex" we end up with the same result for the 
lockdep_sock_is_held check.


Do you see other consequences?

Thanks,
Hannes

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa


On 01.04.2016 03:39, Eric Dumazet wrote:

On Fri, 2016-04-01 at 03:36 +0200, Hannes Frederic Sowa wrote:

On Fri, Apr 1, 2016, at 03:19, Eric Dumazet wrote:

Thanks.

As you can see, release_sock() messes badly lockdep (once your other
patches are in )

Once we properly fix release_sock() and/or __release_sock(), all these
false positives disappear.


This was a loopback connection. I need to study release_sock and
__release_sock more as I cannot currently see an issue with the lockdep
handling.


Okay, please try :

diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230..570dcd91d64e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2429,10 +2429,6 @@ EXPORT_SYMBOL(lock_sock_nested);

  void release_sock(struct sock *sk)
  {
-   /*
-* The sk_lock has mutex_unlock() semantics:
-*/
-   mutex_release(>sk_lock.dep_map, 1, _RET_IP_);

spin_lock_bh(>sk_lock.slock);
if (sk->sk_backlog.tail)
@@ -2445,6 +2441,10 @@ void release_sock(struct sock *sk)
sk->sk_prot->release_cb(sk);

sock_release_ownership(sk);
+   /*
+* The sk_lock has mutex_unlock() semantics:
+*/
+   mutex_release(>sk_lock.dep_map, 1, _RET_IP_);
if (waitqueue_active(>sk_lock.wq))
wake_up(>sk_lock.wq);
spin_unlock_bh(>sk_lock.slock);



Looks much better with your patch already. I slowly begin to understand, 
this is really tricky... :)


Bye,
Hannes

RE: [PATCH] fec: Do not access unexisting register in Coldfire

2016-03-31 Thread Fugang Duan

From: Fabio Estevam  Sent: Thursday, March 31, 2016 11:05 PM
> To: da...@davemloft.net
> Cc: Fugang Duan ; troy.ki...@boundarydevices.com;
> g...@uclinux.org; netdev@vger.kernel.org; Fabio Estevam
> 
> Subject: [PATCH] fec: Do not access unexisting register in Coldfire
> 
> From: Fabio Estevam 
> 
> Commit 55cd48c821de ("net: fec: stop the "rcv is not +last, " error
> messages") introduces a write to a register that does not exist in Coldfire.
> 
> Move the FEC_FTRL register access inside the FEC_QUIRK_HAS_RACC 'if' block,
> so that we guarantee it will not be used on Coldfire CPUs.
> 
> Reported-by: Greg Ungerer 
> Signed-off-by: Fabio Estevam 
> ---
>  drivers/net/ethernet/freescale/fec_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 37c0815..08243c2 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -943,8 +943,8 @@ fec_restart(struct net_device *ndev)
>   else
>   val &= ~FEC_RACC_OPTIONS;
>   writel(val, fep->hwp + FEC_RACC);
> + writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
>   }
> - writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
>  #endif
> 
>   /*
> --
> 1.9.1

If you stick to do it like this,  you must add comments on the quirk flag 
FEC_QUIRK_HAS_RACC.

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Alexei Starovoitov

On Thu, Mar 31, 2016 at 06:19:52PM -0700, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 02:21 +0200, Hannes Frederic Sowa wrote:
> 
> > 
> > [   31.064029] ===
> > [   31.064030] [ INFO: suspicious RCU usage. ]
> > [   31.064032] 4.5.0+ #13 Not tainted
> > [   31.064033] ---
> > [   31.064034] include/net/sock.h:1594 suspicious 
> > rcu_dereference_check() usage!
> > [   31.064035]
> > other info that might help us debug this:
> > 
> > [   31.064041]
> > rcu_scheduler_active = 1, debug_locks = 1
> > [   31.064042] no locks held by ssh/817.
> > [   31.064043]
> > stack backtrace:
> > [   31.064045] CPU: 0 PID: 817 Comm: ssh Not tainted 4.5.0+ #13
> > [   31.064046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > BIOS 1.8.2-20150714_191134- 04/01/2014
> > [   31.064047]  0286 6730b46b 8800badf7bd0 
> > 81442b33
> > [   31.064050]  8800b8c78000 0001 8800badf7c00 
> > 8110ae75
> > [   31.064052]  880035ea2f00 8800b8e28000 0003 
> > 04c4
> > [   31.064054] Call Trace:
> > [   31.064058]  [] dump_stack+0x85/0xc2
> > [   31.064062]  [] lockdep_rcu_suspicious+0xc5/0x100
> > [   31.064064]  [] __sk_dst_check+0x77/0xb0
> > [   31.064066]  [] inet6_sk_rebuild_header+0x52/0x300
> > [   31.064068]  [] ? selinux_skb_peerlbl_sid+0x5e/0xa0
> > [   31.064070]  [] ? 
> > selinux_inet_conn_established+0x3e/0x40
> > [   31.064072]  [] tcp_finish_connect+0x4d/0x270
> > [   31.064074]  [] tcp_rcv_state_process+0x627/0xe40
> > [   31.064076]  [] tcp_v6_do_rcv+0xd4/0x410
> > [   31.064078]  [] release_sock+0x85/0x1c0
> > [   31.064079]  [] __inet_stream_connect+0x1c3/0x340
> > [   31.064081]  [] ? lock_sock_nested+0x49/0xb0
> > [   31.064083]  [] ? abort_exclusive_wait+0xb0/0xb0
> > [   31.064084]  [] inet_stream_connect+0x38/0x50
> > [   31.064086]  [] SYSC_connect+0xcf/0xf0
> > [   31.064088]  [] ? trace_hardirqs_on_caller+0x129/0x1b0
> > [   31.064090]  [] ? trace_hardirqs_on_thunk+0x1b/0x1d
> > [   31.064091]  [] SyS_connect+0xe/0x10
> > [   31.064094]  [] entry_SYSCALL_64_fastpath+0x1f/0xbd
> > 
> > Bye,
> > Hannes
> 
> Thanks.
> 
> As you can see, release_sock() messes badly lockdep (once your other
> patches are in )
> 
> Once we properly fix release_sock() and/or __release_sock(), all these
> false positives disappear.

+1. Nice catch.

Eric, what's your take on Hannes's patch 2 ?
Is it more accurate to ask lockdep to check for actual lock
or lockdep can rely on owned flag?
Potentially there could be races between setting the flag and
actual lock... but that code is contained, so unlikely.
Will we find the real issues with this 'stronger' check or
just spend a ton of time adapting to new model like your other
patch for release_sock and whatever may need to come next...

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Eric Dumazet

On Thu, 2016-03-31 at 18:39 -0700, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 03:36 +0200, Hannes Frederic Sowa wrote:
> > On Fri, Apr 1, 2016, at 03:19, Eric Dumazet wrote:
> > > Thanks.
> > > 
> > > As you can see, release_sock() messes badly lockdep (once your other
> > > patches are in )
> > > 
> > > Once we properly fix release_sock() and/or __release_sock(), all these
> > > false positives disappear.
> > 
> > This was a loopback connection. I need to study release_sock and
> > __release_sock more as I cannot currently see an issue with the lockdep
> > handling.
> 
> Okay, please try :
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index b67b9aedb230..570dcd91d64e 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2429,10 +2429,6 @@ EXPORT_SYMBOL(lock_sock_nested);
>  
>  void release_sock(struct sock *sk)
>  {
> - /*
> -  * The sk_lock has mutex_unlock() semantics:
> -  */
> - mutex_release(>sk_lock.dep_map, 1, _RET_IP_);
>  
>   spin_lock_bh(>sk_lock.slock);
>   if (sk->sk_backlog.tail)
> @@ -2445,6 +2441,10 @@ void release_sock(struct sock *sk)
>   sk->sk_prot->release_cb(sk);
>  
>   sock_release_ownership(sk);
> + /*
> +  * The sk_lock has mutex_unlock() semantics:
> +  */
> + mutex_release(>sk_lock.dep_map, 1, _RET_IP_);
>   if (waitqueue_active(>sk_lock.wq))
>   wake_up(>sk_lock.wq);
>   spin_unlock_bh(>sk_lock.slock);

Also take a look at commit c3f9b01849ef3bc69024990092b9f42e20df7797

We might need to include the mutex_release() in sock_release_ownership()

Thanks !

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Eric Dumazet

On Fri, 2016-04-01 at 03:36 +0200, Hannes Frederic Sowa wrote:
> On Fri, Apr 1, 2016, at 03:19, Eric Dumazet wrote:
> > Thanks.
> > 
> > As you can see, release_sock() messes badly lockdep (once your other
> > patches are in )
> > 
> > Once we properly fix release_sock() and/or __release_sock(), all these
> > false positives disappear.
> 
> This was a loopback connection. I need to study release_sock and
> __release_sock more as I cannot currently see an issue with the lockdep
> handling.

Okay, please try :

diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230..570dcd91d64e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2429,10 +2429,6 @@ EXPORT_SYMBOL(lock_sock_nested);
 
 void release_sock(struct sock *sk)
 {
-   /*
-* The sk_lock has mutex_unlock() semantics:
-*/
-   mutex_release(>sk_lock.dep_map, 1, _RET_IP_);
 
spin_lock_bh(>sk_lock.slock);
if (sk->sk_backlog.tail)
@@ -2445,6 +2441,10 @@ void release_sock(struct sock *sk)
sk->sk_prot->release_cb(sk);
 
sock_release_ownership(sk);
+   /*
+* The sk_lock has mutex_unlock() semantics:
+*/
+   mutex_release(>sk_lock.dep_map, 1, _RET_IP_);
if (waitqueue_active(>sk_lock.wq))
wake_up(>sk_lock.wq);
spin_unlock_bh(>sk_lock.slock);

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa



On Fri, Apr 1, 2016, at 03:23, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 02:30 +0200, Hannes Frederic Sowa wrote:
> 
> > I fixed this one, I wait with review a bit and collapse some of the 
> > newer fixes into one and check and repost again tomorrow.
> 
> No change should be needed in TCP, once net/core/sock.c is fixed.
> 
> When someone fixes a lockdep issue, it should be mandatory to include in
> the changelog the lockdep splat, so that we can double check.

Yup, will do that. Every change will be a single patch with lockdep
report.

Thank you,
Hannes

RE: [PATCH] net: fec: stop the "rcv is not +last, " error messages

2016-03-31 Thread Fugang Duan

From: Fabio Estevam  Sent: Thursday, March 31, 2016 6:57 PM
> To: Fugang Duan 
> Cc: Greg Ungerer ; Troy Kisky
> ; netdev@vger.kernel.org
> Subject: Re: [PATCH] net: fec: stop the "rcv is not +last, " error messages
> 
> Hi Andy,
> 
> On Wed, Mar 30, 2016 at 10:41 PM, Fugang Duan 
> wrote:
> >
> > Fabio, we cannot do it like this that may cause confused for the quirk flag
> "FEC_QUIRK_HAS_RACC".
> 
> We can treat FEC_QUIRK_HAS_RACC flag as "this is a non-Coldfire SoC".
> 

FEC_QUIRK_HAS_RACC means the HW support "Receive Accelerator Function 
Configuration". It is really make somebody confused.

To save trouble,  you treat  FEC_QUIRK_HAS_RACC flag as "this is a non-Coldfire 
SoC",  you must add comment on the flag define.

> >
> >
> > Hi, Greg,
> >
> > The header file fec.h define the FEC_FTRL as below,  if ColdFire SoC has no 
> > this
> register,  we may remove the define in here and define the register according
> to SOC type. For example, it is ColdFire Soc, define it as 0xFFF. Is it  
> feasible ?
> >
> 
> This is even worse IMHO. We should not write to a 'fake' register offset of 
> 0xFFF.

We can do it like this:

#if defined(CONFIG_ARM)
writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
#endif

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa

On Fri, Apr 1, 2016, at 03:19, Eric Dumazet wrote:
> Thanks.
> 
> As you can see, release_sock() messes badly lockdep (once your other
> patches are in )
> 
> Once we properly fix release_sock() and/or __release_sock(), all these
> false positives disappear.

This was a loopback connection. I need to study release_sock and
__release_sock more as I cannot currently see an issue with the lockdep
handling.

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Eric Dumazet

On Fri, 2016-04-01 at 02:30 +0200, Hannes Frederic Sowa wrote:

> I fixed this one, I wait with review a bit and collapse some of the 
> newer fixes into one and check and repost again tomorrow.

No change should be needed in TCP, once net/core/sock.c is fixed.

When someone fixes a lockdep issue, it should be mandatory to include in
the changelog the lockdep splat, so that we can double check.

Thanks.

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Eric Dumazet

On Fri, 2016-04-01 at 02:21 +0200, Hannes Frederic Sowa wrote:

> 
> [   31.064029] ===
> [   31.064030] [ INFO: suspicious RCU usage. ]
> [   31.064032] 4.5.0+ #13 Not tainted
> [   31.064033] ---
> [   31.064034] include/net/sock.h:1594 suspicious 
> rcu_dereference_check() usage!
> [   31.064035]
> other info that might help us debug this:
> 
> [   31.064041]
> rcu_scheduler_active = 1, debug_locks = 1
> [   31.064042] no locks held by ssh/817.
> [   31.064043]
> stack backtrace:
> [   31.064045] CPU: 0 PID: 817 Comm: ssh Not tainted 4.5.0+ #13
> [   31.064046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.8.2-20150714_191134- 04/01/2014
> [   31.064047]  0286 6730b46b 8800badf7bd0 
> 81442b33
> [   31.064050]  8800b8c78000 0001 8800badf7c00 
> 8110ae75
> [   31.064052]  880035ea2f00 8800b8e28000 0003 
> 04c4
> [   31.064054] Call Trace:
> [   31.064058]  [] dump_stack+0x85/0xc2
> [   31.064062]  [] lockdep_rcu_suspicious+0xc5/0x100
> [   31.064064]  [] __sk_dst_check+0x77/0xb0
> [   31.064066]  [] inet6_sk_rebuild_header+0x52/0x300
> [   31.064068]  [] ? selinux_skb_peerlbl_sid+0x5e/0xa0
> [   31.064070]  [] ? 
> selinux_inet_conn_established+0x3e/0x40
> [   31.064072]  [] tcp_finish_connect+0x4d/0x270
> [   31.064074]  [] tcp_rcv_state_process+0x627/0xe40
> [   31.064076]  [] tcp_v6_do_rcv+0xd4/0x410
> [   31.064078]  [] release_sock+0x85/0x1c0
> [   31.064079]  [] __inet_stream_connect+0x1c3/0x340
> [   31.064081]  [] ? lock_sock_nested+0x49/0xb0
> [   31.064083]  [] ? abort_exclusive_wait+0xb0/0xb0
> [   31.064084]  [] inet_stream_connect+0x38/0x50
> [   31.064086]  [] SYSC_connect+0xcf/0xf0
> [   31.064088]  [] ? trace_hardirqs_on_caller+0x129/0x1b0
> [   31.064090]  [] ? trace_hardirqs_on_thunk+0x1b/0x1d
> [   31.064091]  [] SyS_connect+0xe/0x10
> [   31.064094]  [] entry_SYSCALL_64_fastpath+0x1f/0xbd
> 
> Bye,
> Hannes

Thanks.

As you can see, release_sock() messes badly lockdep (once your other
patches are in )

Once we properly fix release_sock() and/or __release_sock(), all these
false positives disappear.

Re: Question on rhashtable in worst-case scenario.

2016-03-31 Thread Herbert Xu

On Thu, Mar 31, 2016 at 05:29:59PM +0200, Johannes Berg wrote:
> 
> Does removing this completely disable the "-EEXIST" error? I can't say
> I fully understand the elasticity stuff in __rhashtable_insert_fast().

What EEXIST error are you talking about? The only one that can be
returned on insertion is if you're explicitly checking for dups
which clearly can't be the case for you.

If you're talking about the EEXIST error due to a rehash then it is
completely hidden from you by rhashtable_insert_rehash.

If you actually meant EBUSY then yes this should prevent it from
occurring, unless your chain-length exceeds 2^32.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa


On 01.04.2016 02:12, Eric Dumazet wrote:

Since this function does not take a reference on the dst, how could it
be safe ?

As soon as you exit the function, dst would possibly point to something
obsolete/freed.

This works because the caller owns the socket.

If you believe one path needs to be fixed, tell me which one it is.

Please ?


I fixed this one, I wait with review a bit and collapse some of the 
newer fixes into one and check and repost again tomorrow.


Thanks for reviewing!

Bye,
Hannes

Re: [PATCH] sctp: avoid refreshing heartbeat timer too often

2016-03-31 Thread Marcelo Ricardo Leitner

On Thu, Mar 31, 2016 at 06:25:12PM -0300, 'Marcelo Ricardo Leitner' wrote:
> On Thu, Mar 31, 2016 at 11:16:52AM +, David Laight wrote:
> > From: Marcelo Ricardo Leitner
> > > Sent: 30 March 2016 13:13
> > > Em 30-03-2016 06:37, David Laight escreveu:
> > > > From: Marcelo Ricardo Leitner
> > > >> Sent: 29 March 2016 14:42
> > > >>
> > > >> Currently on high rate SCTP streams the heartbeat timer refresh can
> > > >> consume quite a lot of resources as timer updates are costly and it
> > > >> contains a random factor, which a) is also costly and b) invalidates
> > > >> mod_timer() optimization for not editing a timer to the same value.
> > > >> It may even cause the timer to be slightly advanced, for no good 
> > > >> reason.
> > > >
> > > > Interesting thoughts:
> > > > 1) Is it necessary to use a different 'random factor' until the timer 
> > > > actually
> > > > expires?
> > > 
> > > I don't understand you fully here, but we have to have a random factor
> > > on timer expire. As noted by Daniel Borkmann on his commit 8f61059a96c2
> > > ("net: sctp: improve timer slack calculation for transport HBs"):
> > 
> > When a HEARTBEAT chunk is sent determine the new interval, use that
> > interval until the timer actually expires when a new interval is
> > calculated. So the random number is only generated once per heartbeat.
> > 
> > >  RFC4960, section 8.3 says:
> > > 
> > >On an idle destination address that is allowed to heartbeat,
> > >it is recommended that a HEARTBEAT chunk is sent once per RTO
> > >of that destination address plus the protocol parameter
> > >'HB.interval', with jittering of +/- 50% of the RTO value,
> > >and exponential backoff of the RTO if the previous HEARTBEAT
> > >is unanswered.
> > > 
> > > Previous to his commit, it was using a random factor based on jiffies.
> > > 
> > > This patch then assumes that random_A+2 is just as random as random_B as
> > > long as it is within the allowed range, avoiding the unnecessary updates.
> > > 
> > > > 2) It might be better to allow the heartbeat timer to expire, on expiry 
> > > > work
> > > > out the new interval based on when the last 'refresh' was done.
> > > 
> > > Cool, I thought about this too. It would introduce some extra complexity
> > > that is not really worth I think, specially because now we may be doing
> > > more timer updates even with this patch but it's not triggering any wake
> > > ups and we would need at least 2 wake ups then: one for the first
> > > timeout event, and then re-schedule the timer for the next updated one,
> > > and maybe again, and again.. less timer updates but more wake ups, one
> > > at every heartbeat interval even on a busy transport. Seems it's cheaper
> > > to just update the timer then.
> > 
> > One wakeup per heartbeat interval on a busy connection is probably noise.
> > Probably much less than the 1000s of timer updates that would otherwise 
> > happen.
> 
> I was thinking more on near-idle systems, as the overhead for this
> refresh looked rather small now even for busy transports if compared to
> other points, a worth trade-off for reducing wake ups, imho.
> 
> But then I checked tcp, and it does what you're suggesting.
> I'll rework the patch. Thanks

This is what I'm getting with the new approach. I splitted
sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
sctp_transport_reset_hb_timer, thus why sctp_transport_reset_t3_rtx in there
and it never updates the timer, only start if it's not running already
(as before). Ran netperf for 60 seconds now, to be sure that the timer
would expire twice (1st for initial path validation and 2nd for pure hb).

Samples: 230K of event 'cpu-clock', Event count (approx.): 5770725
  Overhead  Command  Shared Object  Symbol
+5,65%  netperf  [kernel.vmlinux]   [k] memcpy_erms
+5,59%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
-5,05%  netperf  [kernel.vmlinux]   [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
  + 49,89% __wake_up_sync_key
  + 45,68% sctp_ulpq_tail_event
  - 2,85% mod_timer
 + 76,51% sctp_transport_reset_t3_rtx
 + 23,49% sctp_do_sm
  + 1,55% del_timer
+2,50%  netperf  [sctp] [k] sctp_datamsg_from_user
+2,26%  netperf  [sctp] [k] sctp_sendmsg

Doesn't seem much different from v1, but ok.

Also I could do some more cleanups on heartbeat/timer code.

  Marcelo

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa


On 01.04.2016 02:12, Eric Dumazet wrote:

On Fri, 2016-04-01 at 02:01 +0200, Hannes Frederic Sowa wrote:

Hi Eric,

On 01.04.2016 01:39, Eric Dumazet wrote:

On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:

Various fixes which were discovered by this changeset. More probably
to come...



Really ?

Again, TCP stack locks the socket most of the time.

The fact that lockdep does not understand this is not a reason to add
this overhead.


I don't see how lockdep does not understand this? I think I added the
appropriate helper to exactly verify if we have the socket lock with
lockdep, please have a look at lockdep_sock_is_held in patch #2.

Some of the patches also just reorder the rcu_read_lock, which is anyway
mostly free.


I strongly disagree. Adding rcu_read_lock() everywhere as you did gives
a sense of 'security' by merely shutting down maybe good signals.


I did those patches by adding the first patches and running tcp tests, 
so I have splats for every single one of those. I just didn't bother 
them into the changelog. I certainly can do so.



Your changelog explains nothing, and your patch makes absolutely no
sense to me.

Lets take following example :

  struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
  {
-   struct dst_entry *dst = __sk_dst_get(sk);
+   struct dst_entry *dst;

+   rcu_read_lock();
+   dst = __sk_dst_get(sk);
 if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
 sk_tx_queue_clear(sk);
 RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
 dst_release(dst);
-   return NULL;
+   dst = NULL;
 }
+   rcu_read_unlock();

 return dst;
  }


Since this function does not take a reference on the dst, how could it
be safe ?


This is stupid by me, I am sorry, thanks for pointing this out. I will
fix this. Just looking too long at all those lockdep reports.

I will find another fix for this.


As soon as you exit the function, dst would possibly point to something
obsolete/freed.

This works because the caller owns the socket.

If you believe one path needs to be fixed, tell me which one it is.

Please ?


[   31.064029] ===
[   31.064030] [ INFO: suspicious RCU usage. ]
[   31.064032] 4.5.0+ #13 Not tainted
[   31.064033] ---
[   31.064034] include/net/sock.h:1594 suspicious 
rcu_dereference_check() usage!

[   31.064035]
   other info that might help us debug this:

[   31.064041]
   rcu_scheduler_active = 1, debug_locks = 1
[   31.064042] no locks held by ssh/817.
[   31.064043]
   stack backtrace:
[   31.064045] CPU: 0 PID: 817 Comm: ssh Not tainted 4.5.0+ #13
[   31.064046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.8.2-20150714_191134- 04/01/2014
[   31.064047]  0286 6730b46b 8800badf7bd0 
81442b33
[   31.064050]  8800b8c78000 0001 8800badf7c00 
8110ae75
[   31.064052]  880035ea2f00 8800b8e28000 0003 
04c4

[   31.064054] Call Trace:
[   31.064058]  [] dump_stack+0x85/0xc2
[   31.064062]  [] lockdep_rcu_suspicious+0xc5/0x100
[   31.064064]  [] __sk_dst_check+0x77/0xb0
[   31.064066]  [] inet6_sk_rebuild_header+0x52/0x300
[   31.064068]  [] ? selinux_skb_peerlbl_sid+0x5e/0xa0
[   31.064070]  [] ? 
selinux_inet_conn_established+0x3e/0x40

[   31.064072]  [] tcp_finish_connect+0x4d/0x270
[   31.064074]  [] tcp_rcv_state_process+0x627/0xe40
[   31.064076]  [] tcp_v6_do_rcv+0xd4/0x410
[   31.064078]  [] release_sock+0x85/0x1c0
[   31.064079]  [] __inet_stream_connect+0x1c3/0x340
[   31.064081]  [] ? lock_sock_nested+0x49/0xb0
[   31.064083]  [] ? abort_exclusive_wait+0xb0/0xb0
[   31.064084]  [] inet_stream_connect+0x38/0x50
[   31.064086]  [] SYSC_connect+0xcf/0xf0
[   31.064088]  [] ? trace_hardirqs_on_caller+0x129/0x1b0
[   31.064090]  [] ? trace_hardirqs_on_thunk+0x1b/0x1d
[   31.064091]  [] SyS_connect+0xe/0x10
[   31.064094]  [] entry_SYSCALL_64_fastpath+0x1f/0xbd

Bye,
Hannes

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Eric Dumazet

On Fri, 2016-04-01 at 02:01 +0200, Hannes Frederic Sowa wrote:
> Hi Eric,
> 
> On 01.04.2016 01:39, Eric Dumazet wrote:
> > On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
> >> Various fixes which were discovered by this changeset. More probably
> >> to come...
> >
> >
> > Really ?
> >
> > Again, TCP stack locks the socket most of the time.
> >
> > The fact that lockdep does not understand this is not a reason to add
> > this overhead.
> 
> I don't see how lockdep does not understand this? I think I added the 
> appropriate helper to exactly verify if we have the socket lock with 
> lockdep, please have a look at lockdep_sock_is_held in patch #2.
> 
> Some of the patches also just reorder the rcu_read_lock, which is anyway 
> mostly free.

I strongly disagree. Adding rcu_read_lock() everywhere as you did gives
a sense of 'security' by merely shutting down maybe good signals.

Your changelog explains nothing, and your patch makes absolutely no
sense to me.

Lets take following example :

 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
-   struct dst_entry *dst = __sk_dst_get(sk);
+   struct dst_entry *dst;

+   rcu_read_lock();
+   dst = __sk_dst_get(sk);
if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
sk_tx_queue_clear(sk);
RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
dst_release(dst);
-   return NULL;
+   dst = NULL;
}
+   rcu_read_unlock();

return dst;
 }

Since this function does not take a reference on the dst, how could it
be safe ?

As soon as you exit the function, dst would possibly point to something
obsolete/freed.

This works because the caller owns the socket.

If you believe one path needs to be fixed, tell me which one it is.

Please ?

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa


Hi Eric,

On 01.04.2016 01:39, Eric Dumazet wrote:

On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:

Various fixes which were discovered by this changeset. More probably
to come...



Really ?

Again, TCP stack locks the socket most of the time.

The fact that lockdep does not understand this is not a reason to add
this overhead.


I don't see how lockdep does not understand this? I think I added the 
appropriate helper to exactly verify if we have the socket lock with 
lockdep, please have a look at lockdep_sock_is_held in patch #2.


Some of the patches also just reorder the rcu_read_lock, which is anyway 
mostly free.


Bye,
Hannes

[PATCH net 6/4] tcp: fix __sk_dst_get usage in tcp_current_mss

2016-03-31 Thread Hannes Frederic Sowa

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv4/tcp_output.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ba3621834e7bfa..3f70582578ada0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1426,21 +1426,21 @@ EXPORT_SYMBOL(tcp_sync_mss);
 unsigned int tcp_current_mss(struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
-   const struct dst_entry *dst = __sk_dst_get(sk);
+   const struct dst_entry *dst;
u32 mss_now;
unsigned int header_len;
struct tcp_out_options opts;
struct tcp_md5sig_key *md5;
 
+   rcu_read_lock();
mss_now = tp->mss_cache;
-
+   dst = __sk_dst_get(sk);
if (dst) {
u32 mtu = dst_mtu(dst);
if (mtu != inet_csk(sk)->icsk_pmtu_cookie)
mss_now = tcp_sync_mss(sk, mtu);
}
 
-   rcu_read_lock();
header_len = tcp_established_options(sk, NULL, , ) +
 sizeof(struct tcphdr);
rcu_read_unlock();
-- 
2.5.5

[PATCH net 5/4] tcp: fix rcu usage around __sk_dst_get in tcp_update_metrics

2016-03-31 Thread Hannes Frederic Sowa

Already another one fix I overlooked.

Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv4/tcp_metrics.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 33a36648423e8b..196de79902819a 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -367,7 +367,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct 
sock *sk,
 void tcp_update_metrics(struct sock *sk)
 {
const struct inet_connection_sock *icsk = inet_csk(sk);
-   struct dst_entry *dst = __sk_dst_get(sk);
+   struct dst_entry *dst;
struct tcp_sock *tp = tcp_sk(sk);
struct net *net = sock_net(sk);
struct tcp_metrics_block *tm;
@@ -375,13 +375,14 @@ void tcp_update_metrics(struct sock *sk)
u32 val;
int m;
 
+   rcu_read_lock();
+   dst = __sk_dst_get(sk);
if (sysctl_tcp_nometrics_save || !dst)
-   return;
+   goto out_unlock;
 
if (dst->flags & DST_HOST)
dst_confirm(dst);
 
-   rcu_read_lock();
if (icsk->icsk_backoff || !tp->srtt_us) {
/* This session failed to estimate rtt. Why?
 * Probably, no packets returned in time.  Reset our
-- 
2.5.5

Re: qdisc spin lock

2016-03-31 Thread Michael Ma

I didn't really know that multiple qdiscs can be isolated using MQ so
that each txq can be associated with a particular qdisc. Also we don't
really have multiple interfaces...

With this MQ solution we'll still need to assign transmit queues to
different classes by doing some math on the bandwidth limit if I
understand correctly, which seems to be less convenient compared with
a solution purely within HTB.

I assume that with this solution I can still share qdisc among
multiple transmit queues - please let me know if this is not the case.

2016-03-31 15:16 GMT-07:00 Cong Wang :
> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma  wrote:
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> If your HTB classes don't share bandwidth, why do you still make them
> under the same hierarchy? IOW, you can just isolate them either with some
> other qdisc or just separated interfaces.

Re: [PATCH net-next 2/8] tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO

2016-03-31 Thread Yuchung Cheng

On Wed, Mar 30, 2016 at 8:35 PM, Willem de Bruijn
 wrote:
> On Wed, Mar 30, 2016 at 6:37 PM, Soheil Hassas Yeganeh
>  wrote:
>> From: Soheil Hassas Yeganeh 
>>
>> SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
>> to associate timestamps with send calls. For TCP connections,
>> tp->snd_una is used as the starting point to calculate
>> relative IDs.
>>
>> This socket option will fail if set before the handshake on a
>> passive TCP fast open connection with data in SYN or SYN/ACK,
>> since setsockopt requires the connection to be in the
>> ESTABLISHED state.
>>
>> To address these, instead of limiting the option to the
>> ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
>> long as the connection is not in LISTEN or CLOSE states.
>>
>> Signed-off-by: Soheil Hassas Yeganeh 
>
> Acked-by: Willem de Bruijn 
Acked-by: Yuchung Cheng

Re: [PATCH] fec: Do not access unexisting register in Coldfire

2016-03-31 Thread Greg Ungerer

Thanks for taking care of that Fabio.

Regards
Greg


On 01/04/16 01:05, Fabio Estevam wrote:
> From: Fabio Estevam 
> 
> Commit 55cd48c821de ("net: fec: stop the "rcv is not +last, " error
> messages") introduces a write to a register that does not exist in
> Coldfire.
> 
> Move the FEC_FTRL register access inside the FEC_QUIRK_HAS_RACC 'if' block,
> so that we guarantee it will not be used on Coldfire CPUs.
> 
> Reported-by: Greg Ungerer 
> Signed-off-by: Fabio Estevam 
> ---
>  drivers/net/ethernet/freescale/fec_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c 
> b/drivers/net/ethernet/freescale/fec_main.c
> index 37c0815..08243c2 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -943,8 +943,8 @@ fec_restart(struct net_device *ndev)
>   else
>   val &= ~FEC_RACC_OPTIONS;
>   writel(val, fep->hwp + FEC_RACC);
> + writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
>   }
> - writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
>  #endif
>  
>   /*
>

Re: [PULL REQUEST] Please pull rdma.git

2016-03-31 Thread Leon Romanovsky

On Wed, Mar 23, 2016 at 09:37:54AM -0400, Doug Ledford wrote:
> On 03/23/2016 06:57 AM, Leon Romanovsky wrote:
> > On Sat, Mar 19, 2016 at 02:37:08PM -0700, Linus Torvalds wrote:
> >> So the *best* situation would be:
> >>
> >>  - your two groups talk it over, and figure out what the common commits are
> >>
> >>  - you put those common commits as a "base" branch in git
> >>
> >>  - you ask the two upper-level maintainers to both pull that base branch
> >>
> >>  - you then make sure that you send the later patches (whether as
> >> emailed patches or as pull requests) based on top of that base branch.
> > 
> > Hi David and Doug,
> > 
> > Are you OK with the approach suggested by Linus?
> > We are eager to know it, so we will adopt it as soon
> > as possible in our development flow.
> > 
> > The original thread [1].
> > 
> > [1] http://thread.gmane.org/gmane.linux.drivers.rdma/34907
> > 
> > Thanks.
> > 
> 
> I'm fine with it.  Since I happen to use topic branches to build my
> for-next from anyway, I might need to be the one that Dave pulls from
> versus the other way around.

Resending to linux-netdev.

David,
Can you please express your opinion about Linus's suggestion to
eliminate merge conflicts in Mellanox related products?

Thanks

> 
> -- 
> Doug Ledford 
>   GPG KeyID: 0E572FDD
> 
>

Re: [PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Eric Dumazet

On Fri, 2016-04-01 at 01:29 +0200, Hannes Frederic Sowa wrote:
> Various fixes which were discovered by this changeset. More probably
> to come...


Really ?

Again, TCP stack locks the socket most of the time.

The fact that lockdep does not understand this is not a reason to add
this overhead.

Re: qdisc spin lock

2016-03-31 Thread Michael Ma

Thanks for the suggestion - I'll try the MQ solution out. It seems to
be able to solve the problem well with the assumption that bandwidth
can be statically partitioned.

2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer :
>
> On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma  wrote:
>
>> I know this might be an old topic so bare with me – what we are facing
>> is that applications are sending small packets using hundreds of
>> threads so the contention on spin lock in __dev_xmit_skb increases the
>> latency of dev_queue_xmit significantly. We’re building a network QoS
>> solution to avoid interference of different applications using HTB.
>
> Yes, as you have noticed with HTB there is a single qdisc lock, and
> congestion obviously happens :-)
>
> It is possible with different tricks to make it scale.  I believe
> Google is using a variant of HTB, and it scales for them.  They have
> not open source their modifications to HTB (which likely also involves
> a great deal of setup tricks).
>
> If your purpose it to limit traffic/bandwidth per "cloud" instance,
> then you can just use another TC setup structure.  Like using MQ and
> assigning a HTB per MQ queue (where the MQ queues are bound to each
> CPU/HW queue)... But you have to figure out this setup yourself...
>
>
>> But in this case when some applications send massive small packets in
>> parallel, the application to be protected will get its throughput
>> affected (because it’s doing synchronous network communication using
>> multiple threads and throughput is sensitive to the increased latency)
>>
>> Here is the profiling from perf:
>>
>> -  67.57%   iperf  [kernel.kallsyms] [k] _spin_lock
>>   - 99.94% dev_queue_xmit
>>   -  96.91% _spin_lock
>>  - 2.62%  __qdisc_run
>>   - 98.98% sch_direct_xmit
>> - 99.98% _spin_lock
>>
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> Yes, there is a lot go gain, but it is not easy ;-)
>
>> I must have oversimplified a lot of details since I’m not familiar
>> with the TC implementation at this point – just want to get your input
>> in terms of whether this is a worthwhile effort or there is something
>> fundamental that I’m not aware of. If this is just a matter of quite
>> some additional work, would also appreciate helping to outline the
>> required work here.
>>
>> Also would appreciate if there is any information about the latest
>> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf
>
> This article seems to be very low quality... spelling errors, only 5
> pages, no real code, etc.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

Re: Best way to reduce system call overhead for tun device I/O?

2016-03-31 Thread Stephen Hemminger

On Fri, 1 Apr 2016 00:28:57 +0200
Guus Sliepen  wrote:

> On Thu, Mar 31, 2016 at 05:20:50PM -0400, David Miller wrote:
> 
> > >> I'm trying to reduce system call overhead when reading/writing to/from a
> > >> tun device in userspace. [...] What would be the right way to do this?
> > >>
> > > Personally I think tun could benefit greatly if it were implemented as
> > > a socket instead of character interface. One thing that could be much
> > > better is sending/receiving of meta data attached to skbuf. For
> > > instance GSO data could be in ancillary data in a socket instead of
> > > inline with packet data as tun seems to be doing now.
> > 
> > Agreed.
> 
> Ok. So how should the userspace API work? Creating an AF_PACKET socket
> and then using a tun ioctl to create a tun interface and bind it to the
> socket?
> 
> int fd = socket(AF_PACKET, ...)
> struct ifreq ifr = {...};
> ioctl(fd, TUNSETIFF, );
> 

Rather than bodge AF_PACKET onto TUN, why not just create a new device type
and control it from something modern like netlink.

Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter

2016-03-31 Thread Hannes Frederic Sowa


Hi Daniel,

On 31.03.2016 23:52, Daniel Borkmann wrote:

On 03/31/2016 09:48 PM, Hannes Frederic Sowa wrote:
[...]

Tightest solution would probably be to combine both patches.

bool called_by_tuntap;

old_fp = rcu_dereference_protected(sk->sk_filter, called_by_tuntap ?
lockdep_rtnl_is_held() : lockdep_sock_is_held());


If I understand you correctly with combining them, you mean you'd still
need the API change to pass the bool 'called_by_tuntap' down, right?


I actually decided to simply lock the sockets. So that there will always 
be a clear lock owner during the updates. I think this is cleaner. What 
do you think?



If so, your main difference is, after all, to replace the
sock_owned_by_user()
with the lockdep_sock_is_held() construction instead, correct?


I just didn't do that part because we hold socket lock now.


But then, isn't it already sufficient when you pass the bool itself down
that 'demuxes' in this case between the sock_owned_by_user() vs
lockdep_rtnl_is_held() check?


I just send out the patches, please have a look.

Thanks,
Hannes

[PATCH net 4/4] tcp: various missing rcu_read_lock around __sk_dst_get

2016-03-31 Thread Hannes Frederic Sowa

Various fixes which were discovered by this changeset. More probably
to come...

Signed-off-by: Hannes Frederic Sowa 
---
 include/net/tcp.h  |  5 -
 net/core/sock.c|  7 +--
 net/ipv4/tcp_input.c   | 18 ++
 net/ipv4/tcp_metrics.c | 12 +---
 net/ipv4/tcp_output.c  | 22 --
 5 files changed, 48 insertions(+), 16 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b91370f61be64a..541c99bf633d4b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -647,11 +647,14 @@ static inline void tcp_fast_path_check(struct sock *sk)
 /* Compute the actual rto_min value */
 static inline u32 tcp_rto_min(struct sock *sk)
 {
-   const struct dst_entry *dst = __sk_dst_get(sk);
+   const struct dst_entry *dst;
u32 rto_min = TCP_RTO_MIN;
 
+   rcu_read_lock();
+   dst = __sk_dst_get(sk);
if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
+   rcu_read_unlock();
return rto_min;
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9aedb230f9..963d0ba7aa4232 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -486,14 +486,17 @@ EXPORT_SYMBOL(sk_receive_skb);
 
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
-   struct dst_entry *dst = __sk_dst_get(sk);
+   struct dst_entry *dst;
 
+   rcu_read_lock();
+   dst = __sk_dst_get(sk);
if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
sk_tx_queue_clear(sk);
RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
dst_release(dst);
-   return NULL;
+   dst = NULL;
}
+   rcu_read_unlock();
 
return dst;
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f79ade821..8489e9e45f906c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -203,10 +203,16 @@ static void tcp_enter_quickack_mode(struct sock *sk)
 static bool tcp_in_quickack_mode(struct sock *sk)
 {
const struct inet_connection_sock *icsk = inet_csk(sk);
-   const struct dst_entry *dst = __sk_dst_get(sk);
+   const struct dst_entry *dst;
+   bool ret;
 
-   return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
-   (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
+   rcu_read_lock();
+   dst = __sk_dst_get(sk);
+   ret = (dst && dst_metric(dst, RTAX_QUICKACK)) ||
+ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
+   rcu_read_unlock();
+
+   return ret;
 }
 
 static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
@@ -3674,9 +3680,13 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
tcp_process_tlp_ack(sk, ack, flag);
 
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) {
-   struct dst_entry *dst = __sk_dst_get(sk);
+   struct dst_entry *dst;
+
+   rcu_read_lock();
+   dst = __sk_dst_get(sk);
if (dst)
dst_confirm(dst);
+   rcu_read_unlock();
}
 
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 7b7eec43990692..33a36648423e8b 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -488,22 +488,20 @@ out_unlock:
 
 void tcp_init_metrics(struct sock *sk)
 {
-   struct dst_entry *dst = __sk_dst_get(sk);
+   struct dst_entry *dst;
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_metrics_block *tm;
u32 val, crtt = 0; /* cached RTT scaled by 8 */
 
+   rcu_read_lock();
+   dst = __sk_dst_get(sk);
if (!dst)
goto reset;
-
dst_confirm(dst);
 
-   rcu_read_lock();
tm = tcp_get_metrics(sk, dst, true);
-   if (!tm) {
-   rcu_read_unlock();
+   if (!tm)
goto reset;
-   }
 
if (tcp_metric_locked(tm, TCP_METRIC_CWND))
tp->snd_cwnd_clamp = tcp_metric_get(tm, TCP_METRIC_CWND);
@@ -527,7 +525,6 @@ void tcp_init_metrics(struct sock *sk)
}
 
crtt = tcp_metric_get(tm, TCP_METRIC_RTT);
-   rcu_read_unlock();
 reset:
/* The initial RTT measurement from the SYN/SYN-ACK is not ideal
 * to seed the RTO for later data packets because SYN packets are
@@ -575,6 +572,7 @@ reset:
else
tp->snd_cwnd = tcp_init_cwnd(tp, dst);
tp->snd_cwnd_stamp = tcp_time_stamp;
+   rcu_read_unlock();
 }
 
 bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7d2dc015cd19a6..ba3621834e7bfa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -548,6 +548,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct 
sk_buff *skb,
struct tcp_fastopen_request *fastopen

[PATCH net 3/4] sock: use lockdep_sock_is_held were appropriate

2016-03-31 Thread Hannes Frederic Sowa

Also make lockdep_sock_is_held accept const arguments, so we don't need to
modify too many callers.

Reported-by: Sasha Levin 
Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Cc: Michal Kubecek 
Signed-off-by: Hannes Frederic Sowa 
---
 include/net/sock.h   | 7 ---
 net/dccp/ipv4.c  | 2 +-
 net/dccp/ipv6.c  | 2 +-
 net/ipv4/af_inet.c   | 2 +-
 net/ipv4/cipso_ipv4.c| 3 ++-
 net/ipv4/ip_sockglue.c   | 4 ++--
 net/ipv4/tcp_ipv4.c  | 8 +++-
 net/ipv6/ipv6_sockglue.c | 6 --
 net/ipv6/tcp_ipv6.c  | 2 +-
 net/socket.c | 2 +-
 10 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 30f9b5ad0a82ef..bbea02fdaaa0fd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1382,8 +1382,9 @@ static inline void unlock_sock_fast(struct sock *sk, bool 
slow)
spin_unlock_bh(>sk_lock.slock);
 }
 
-static bool lockdep_sock_is_held(struct sock *sk)
+static bool lockdep_sock_is_held(const struct sock *csk)
 {
+   struct sock *sk = (struct sock *)csk;
return lockdep_is_held(>sk_lock) ||
   lockdep_is_held(>sk_lock.slock);
 }
@@ -1589,8 +1590,8 @@ static inline void sk_rethink_txhash(struct sock *sk)
 static inline struct dst_entry *
 __sk_dst_get(struct sock *sk)
 {
-   return rcu_dereference_check(sk->sk_dst_cache, sock_owned_by_user(sk) ||
-  
lockdep_is_held(>sk_lock.slock));
+   return rcu_dereference_check(sk->sk_dst_cache,
+lockdep_sock_is_held(sk));
 }
 
 static inline struct dst_entry *
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c67a961ba5382..0ea298d849383f 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -62,7 +62,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, 
int addr_len)
nexthop = daddr = usin->sin_addr.s_addr;
 
inet_opt = rcu_dereference_protected(inet->inet_opt,
-sock_owned_by_user(sk));
+lockdep_sock_is_held(sk));
if (inet_opt != NULL && inet_opt->opt.srr) {
if (daddr == 0)
return -EINVAL;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 4663a01d503991..6ea214fa5499c5 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -865,7 +865,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr 
*uaddr,
fl6.fl6_sport = inet->inet_sport;
security_sk_classify_flow(sk, flowi6_to_flowi());
 
-   opt = rcu_dereference_protected(np->opt, sock_owned_by_user(sk));
+   opt = rcu_dereference_protected(np->opt, lockdep_sock_is_held(sk));
final_p = fl6_update_dst(, opt, );
 
dst = ip6_dst_lookup_flow(sk, , final_p);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e481992dbaef2..7e37ebb5af396e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1106,7 +1106,7 @@ static int inet_sk_reselect_saddr(struct sock *sk)
struct ip_options_rcu *inet_opt;
 
inet_opt = rcu_dereference_protected(inet->inet_opt,
-sock_owned_by_user(sk));
+lockdep_sock_is_held(sk));
if (inet_opt && inet_opt->opt.srr)
daddr = inet_opt->opt.faddr;
 
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index bdb2a07ec363b7..40d6b87713a132 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1933,7 +1933,8 @@ int cipso_v4_sock_setattr(struct sock *sk,
 
sk_inet = inet_sk(sk);
 
-   old = rcu_dereference_protected(sk_inet->inet_opt, 
sock_owned_by_user(sk));
+   old = rcu_dereference_protected(sk_inet->inet_opt,
+   lockdep_sock_is_held(sk));
if (sk_inet->is_icsk) {
sk_conn = inet_csk(sk);
if (old)
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 035ad645a8d9d8..cf073059192d99 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -635,7 +635,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
if (err)
break;
old = rcu_dereference_protected(inet->inet_opt,
-   sock_owned_by_user(sk));
+   lockdep_sock_is_held(sk));
if (inet->is_icsk) {
struct inet_connection_sock *icsk = inet_csk(sk);
 #if IS_ENABLED(CONFIG_IPV6)
@@ -1295,7 +1295,7 @@ static int do_ip_getsockopt(struct sock *sk, int level, 
int optname,
struct ip_options_rcu *inet_opt;
 
inet_opt = rcu_dereference_protected(inet->inet_opt,
-

[PATCH net 1/4] tun: add socket locking around sk_{attach,detach}_filter

2016-03-31 Thread Hannes Frederic Sowa

[Changelog stolen from Daniel Borkmann:]

Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

  [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() 
usage!
  [   52.765688] other info that might help us debug this:
  [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
  [   52.765701] 1 lock held by a.out/1525:
  [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [] 
rtnl_lock+0x17/0x20
  [   52.765721] stack backtrace:
  [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
  [...]
  [   52.765768] Call Trace:
  [   52.765775]  [] dump_stack+0x85/0xc8
  [   52.765784]  [] lockdep_rcu_suspicious+0xd5/0x110
  [   52.765792]  [] sk_detach_filter+0x82/0x90
  [   52.765801]  [] tun_detach_filter+0x35/0x90 [tun]
  [   52.765810]  [] __tun_chr_ioctl+0x354/0x1130 [tun]
  [   52.765818]  [] ? selinux_file_ioctl+0x130/0x210
  [   52.765827]  [] tun_chr_ioctl+0x13/0x20 [tun]
  [   52.765834]  [] do_vfs_ioctl+0x96/0x690
  [   52.765843]  [] ? security_file_ioctl+0x43/0x60
  [   52.765850]  [] SyS_ioctl+0x79/0x90
  [   52.765858]  [] do_syscall_64+0x62/0x140
  [   52.765866]  [] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in commit f91ff5b9ff52 ("net: sk_{detach|attach}_filter()
rcu fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in commit 994051625981 ("tun: socket
filter support"), tap filters are managed under RTNL lock from
__tun_chr_ioctl(). Thus the sock_owned_by_user(sk) doesn't apply in this
specific case and therefore triggers the false positive.

Simply holding the locks during the filter updates will fix this problem.

Fixes: 994051625981 ("tun: socket filter support")
Reported-by: Sasha Levin 
Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Cc: Michal Kubecek 
Signed-off-by: Hannes Frederic Sowa 
---
 drivers/net/tun.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950617c36e..dccbbacbbc7f02 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -622,7 +622,11 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file, bool skip_filte
 
/* Re-attach the filter to persist device */
if (!skip_filter && (tun->filter_attached == true)) {
+   bool slow;
+
+   slow = lock_sock_fast(tfile->socket.sk);
err = sk_attach_filter(>fprog, tfile->socket.sk);
+   unlock_sock_fast(tfile->socket.sk, slow);
if (!err)
goto out;
}
@@ -1821,8 +1825,12 @@ static void tun_detach_filter(struct tun_struct *tun, 
int n)
struct tun_file *tfile;
 
for (i = 0; i < n; i++) {
+   bool slow;
+
tfile = rtnl_dereference(tun->tfiles[i]);
+   slow = lock_sock_fast(tfile->socket.sk);
sk_detach_filter(tfile->socket.sk);
+   unlock_sock_fast(tfile->socket.sk, slow);
}
 
tun->filter_attached = false;
@@ -1834,8 +1842,12 @@ static int tun_attach_filter(struct tun_struct *tun)
struct tun_file *tfile;
 
for (i = 0; i < tun->numqueues; i++) {
+   bool slow;
+
tfile = rtnl_dereference(tun->tfiles[i]);
+   slow = lock_sock_fast(tfile->socket.sk);
ret = sk_attach_filter(>fprog, tfile->socket.sk);
+   unlock_sock_fast(tfile->socket.sk, slow);
if (ret) {
tun_detach_filter(tun, i);
return ret;
-- 
2.5.5

[PATCH net 0/4] net: fix and tighten rcu dereference checks

2016-03-31 Thread Hannes Frederic Sowa

Only the first patch is really applicable for stable. It adds appropriate
socket locks so lockdep doesn't complain if tuntap's ioctls modify the
filters on the socket.

Rest of the patches tighten the rcu dereference socket lock checks.

Last patch fixes missing rcu_read_locks which were discovered by this
change. Certainly there are more to come.

Hannes Frederic Sowa (4):
  tun: add socket locking around sk_{attach,detach}_filter
  net: proper check if we hold the socket lock during dereference
  sock: use lockdep_sock_is_held were appropriate
  tcp: various missing rcu_read_lock around __sk_dst_get

 drivers/net/tun.c| 12 
 include/net/sock.h   | 10 --
 include/net/tcp.h|  5 -
 net/core/filter.c|  6 +++---
 net/core/sock.c  |  7 +--
 net/dccp/ipv4.c  |  2 +-
 net/dccp/ipv6.c  |  2 +-
 net/ipv4/af_inet.c   |  2 +-
 net/ipv4/cipso_ipv4.c|  3 ++-
 net/ipv4/ip_sockglue.c   |  4 ++--
 net/ipv4/tcp_input.c | 18 ++
 net/ipv4/tcp_ipv4.c  |  8 +++-
 net/ipv4/tcp_metrics.c   | 12 +---
 net/ipv4/tcp_output.c| 22 --
 net/ipv6/ipv6_sockglue.c |  6 --
 net/ipv6/tcp_ipv6.c  |  2 +-
 net/socket.c |  2 +-
 17 files changed, 87 insertions(+), 36 deletions(-)

-- 
2.5.5

[PATCH net 2/4] net: proper check if we hold the socket lock during dereference

2016-03-31 Thread Hannes Frederic Sowa

lockdep_sock_is_held makes sure that we currently own the
lock. sock_owned_by_user simply checks if a user holds the socket. This
could lead to non deterministic lock checks.

Reported-by: Sasha Levin 
Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Cc: Michal Kubecek 
Signed-off-by: Hannes Frederic Sowa 
---
 include/net/sock.h | 5 +
 net/core/filter.c  | 6 +++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e03727b73..30f9b5ad0a82ef 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1382,6 +1382,11 @@ static inline void unlock_sock_fast(struct sock *sk, 
bool slow)
spin_unlock_bh(>sk_lock.slock);
 }
 
+static bool lockdep_sock_is_held(struct sock *sk)
+{
+   return lockdep_is_held(>sk_lock) ||
+  lockdep_is_held(>sk_lock.slock);
+}
 
 struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
  struct proto *prot, int kern);
diff --git a/net/core/filter.c b/net/core/filter.c
index 4b81b71171b4ce..e8486ba601eae7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1166,7 +1166,7 @@ static int __sk_attach_prog(struct bpf_prog *prog, struct 
sock *sk)
}
 
old_fp = rcu_dereference_protected(sk->sk_filter,
-  sock_owned_by_user(sk));
+  lockdep_sock_is_held(sk));
rcu_assign_pointer(sk->sk_filter, fp);
 
if (old_fp)
@@ -2259,7 +2259,7 @@ int sk_detach_filter(struct sock *sk)
return -EPERM;
 
filter = rcu_dereference_protected(sk->sk_filter,
-  sock_owned_by_user(sk));
+  lockdep_sock_is_held(sk));
if (filter) {
RCU_INIT_POINTER(sk->sk_filter, NULL);
sk_filter_uncharge(sk, filter);
@@ -2279,7 +2279,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter 
__user *ubuf,
 
lock_sock(sk);
filter = rcu_dereference_protected(sk->sk_filter,
-  sock_owned_by_user(sk));
+  lockdep_sock_is_held(sk));
if (!filter)
goto out;
 
-- 
2.5.5

Urgent Founds!!

2016-03-31 Thread Mr Tony Hope


Urgent Founds!!

We wish to inform you that your over due Inheritance funds which we agreed to 
pay you in cash is already sealed and package with a security proof box. The 
funds worth of $7.5 millions US Dollar,in the package will be conveyed to you 
by an Int'l diplomatic agent, Mr. Tony Hope.

He will be leaving for your country any time from now, therefore reach us with 
the details below.

Using a Diplomatic agent this time is because of the failure that were recorded 
in the other transfer options. Just try and give the diplomat your information 
and offer him all assistance he may need, especially directive assistance so 
that he will be able to get your consignment box to you in the couple of days.

Please contact him with your full information, such as your.

1) Full name: ..
2) Resident address:..
3) Phone number:.
4) The name of your nearest airport:
Send him the information above for him to locate your home with your package.

Below is some of his contact information:

Name: Mr Tony Hope
Email: (dipltonyh...@photofile.ru)Phone numbers:
+229-99 22 69 25

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Re: possible bug in latest network tree

2016-03-31 Thread Cong Wang

On Wed, Mar 30, 2016 at 2:05 PM, Light, John J  wrote:
>
> Maybe the problem is that the default shouldn't be to a TCP value, but should 
> be a 'transport' value.
>

The code is fine because net->ipv4 is always there, it doesn't
depend on any CONFIG, so it is safe for dccp to use too, although
I agree it is a bit ugly.

Re: Best way to reduce system call overhead for tun device I/O?

2016-03-31 Thread Guus Sliepen

On Thu, Mar 31, 2016 at 05:20:50PM -0400, David Miller wrote:

> >> I'm trying to reduce system call overhead when reading/writing to/from a
> >> tun device in userspace. [...] What would be the right way to do this?
> >>
> > Personally I think tun could benefit greatly if it were implemented as
> > a socket instead of character interface. One thing that could be much
> > better is sending/receiving of meta data attached to skbuf. For
> > instance GSO data could be in ancillary data in a socket instead of
> > inline with packet data as tun seems to be doing now.
> 
> Agreed.

Ok. So how should the userspace API work? Creating an AF_PACKET socket
and then using a tun ioctl to create a tun interface and bind it to the
socket?

int fd = socket(AF_PACKET, ...)
struct ifreq ifr = {...};
ioctl(fd, TUNSETIFF, );

-- 
Met vriendelijke groet / with kind regards,
 Guus Sliepen

[PATCH v2 net-next] net: ipv4: Consider unreachable nexthops in multipath routes

2016-03-31 Thread David Ahern

Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

 [h2]   [h3]   15.0.0.5
  |  |
 3| 3|
[SP1]  [SP2]--+
 1  2   1 2
 |  | /-+ |
 |   \   /|
 | X  |
 |/ \ |
 |   /   \---\|
 1  2 1   2
 12.0.0.2  [TOR1] 3-3 [TOR2] 12.0.0.3
 4 4
  \   /
\/
 \  /
  ---|   |-/
 1   2
[TOR3]
  3|
   |
  [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

root@h1:~# ip ro ls
...
12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
15.0.0.0/16
nexthop via 12.0.0.2  dev swp1 weight 1
nexthop via 12.0.0.3  dev swp1 weight 1
...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

root@h1:~# ip neigh show
...
12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
12.0.0.2 dev swp1  FAILED
...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Signed-off-by: David Ahern 
---
v2
- use rcu locking to avoid refcnts per Eric's suggestion
- only consider neighbor info for nh_scope == RT_SCOPE_LINK per Julian's
  comment
- drop the 'state == NUD_REACHABLE' from the state check since it is
  part of NUD_VALID (comment from Julian)
- wrapped the use of the neigh in a sysctl

 Documentation/networking/ip-sysctl.txt | 10 ++
 include/net/netns/ipv4.h   |  3 +++
 net/ipv4/fib_semantics.c   | 32 ++--
 net/ipv4/sysctl_net_ipv4.c | 11 +++
 4 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index b183e2b606c8..5b316d33a23f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -63,6 +63,16 @@ fwmark_reflect - BOOLEAN
fwmark of the packet they are replying to.
Default: 0
 
+fib_multipath_use_neigh - BOOLEAN
+   Use status of existing neighbor entry when determining nexthop for
+   multipath routes. If disabled neighbor information is not used then
+   packets could be directed to a dead nexthop. Only valid for kernels
+   built with CONFIG_IP_ROUTE_MULTIPATH enabled.
+   Default: 0 (disabled)
+   Possible values:
+   0 - disabled
+   1 - enabled
+
 route/max_size - INTEGER
Maximum number of routes allowed in the kernel.  Increase
this when using large numbers of interfaces and/or routes.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a69cde3ce460..d061ffeb1e71 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -133,6 +133,9 @@ struct netns_ipv4 {
struct fib_rules_ops*mr_rules_ops;
 #endif
 #endif
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+   int sysctl_fib_multipath_use_neigh;
+#endif
atomic_trt_genid;
 };
 #endif
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e8ff10..6d423faff0ce 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1559,17 +1559,45 @@ int fib_sync_up(struct net_device *dev, unsigned int 
nh_flags)
 }
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
+static bool fib_good_nh(const struct fib_nh *nh, struct net_device *dev)
+{
+   struct neighbour *n = NULL;
+   int state = NUD_NONE;
+
+   if (nh->nh_scope == RT_SCOPE_LINK) {
+   rcu_read_lock_bh();
+
+   n = __neigh_lookup_noref(_tbl, >nh_gw, dev);
+   if (n)
+   state = n->nud_state;
+
+

Re: qdisc spin lock

2016-03-31 Thread Cong Wang

On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma  wrote:
> As far as I understand the design of TC is to simplify locking schema
> and minimize the work in __qdisc_run so that throughput won’t be
> affected, especially with large packets. However if the scenario is
> that multiple classes in the queueing discipline only have the shaping
> limit, there isn’t really a necessary correlation between different
> classes. The only synchronization point should be when the packet is
> dequeued from the qdisc queue and enqueued to the transmit queue of
> the device. My question is – is it worth investing on avoiding the
> locking contention by partitioning the queue/lock so that this
> scenario is addressed with relatively smaller latency?

If your HTB classes don't share bandwidth, why do you still make them
under the same hierarchy? IOW, you can just isolate them either with some
other qdisc or just separated interfaces.

Re: [net PATCH v2] i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet

2016-03-31 Thread Alexander Duyck

On Thu, Mar 31, 2016 at 3:04 PM, Jesse Brandeburg
 wrote:
> On Wed, 30 Mar 2016 16:15:37 -0700
> Alexander Duyck  wrote:
>
>> This patch addresses a bug introduced based on my interpretation of the
>> XL710 datasheet.  Specifically section 8.4.1 states that "A single transmit
>> packet may span up to 8 buffers (up to 8 data descriptors per packet
>> including both the header and payload buffers)."  It then later goes on to
>> say that each segment for a TSO obeys the previous rule, however it then
>> refers to TSO header and the segment payload buffers.
>>
>> I believe the actual limit for fragments with TSO and a skbuff that has
>> payload data in the header portion of the buffer is actually only 7
>> fragments as the skb->data portion counts as 2 buffers, one for the TSO
>> header, and one for a segment payload buffer.
>>
>> Fixes: 2d37490b82af ("i40e/i40evf: Rewrite logic for 8 descriptor per packet 
>> check")
>> Signed-off-by: Alexander Duyck 
>> ---
>>
>> v2: I realized that I overlooked the check in the inline function and as a
>> result we were still allowing for cases where 8 descriptors were being
>> used per packet and this would result in 9 DMA buffers.  I updated the
>> code so that we only allow 8 in the case of a single send, otherwise we
>> go into the function that walks the frags to verify each block.
>>
>> I have tested this using rds-stress and it seems to run traffic without
>> throwing any errors.
>
> Looking like it is working for me too with at least the PF.

I was testing PF <-> VF in my environment so I think I ended up
covering both in my test at least.

- Alex

Re: [net PATCH v2] i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet

2016-03-31 Thread Jesse Brandeburg

On Wed, 30 Mar 2016 16:15:37 -0700
Alexander Duyck  wrote:

> This patch addresses a bug introduced based on my interpretation of the
> XL710 datasheet.  Specifically section 8.4.1 states that "A single transmit
> packet may span up to 8 buffers (up to 8 data descriptors per packet
> including both the header and payload buffers)."  It then later goes on to
> say that each segment for a TSO obeys the previous rule, however it then
> refers to TSO header and the segment payload buffers.
> 
> I believe the actual limit for fragments with TSO and a skbuff that has
> payload data in the header portion of the buffer is actually only 7
> fragments as the skb->data portion counts as 2 buffers, one for the TSO
> header, and one for a segment payload buffer.
> 
> Fixes: 2d37490b82af ("i40e/i40evf: Rewrite logic for 8 descriptor per packet 
> check")
> Signed-off-by: Alexander Duyck 
> ---
> 
> v2: I realized that I overlooked the check in the inline function and as a
> result we were still allowing for cases where 8 descriptors were being
> used per packet and this would result in 9 DMA buffers.  I updated the
> code so that we only allow 8 in the case of a single send, otherwise we
> go into the function that walks the frags to verify each block.
> 
> I have tested this using rds-stress and it seems to run traffic without
> throwing any errors.

Looking like it is working for me too with at least the PF.

Acked-by: Jesse Brandeburg 

Should also add:
Reported-by: Sowmini Varadhan

Re: [PATCH net-next 0/8] add TX timestamping via cmsg

2016-03-31 Thread Martin KaFai Lau

On Wed, Mar 30, 2016 at 11:50:31PM -0400, Willem de Bruijn wrote:
> >> Nice patches!
>
> This does not yet solve the append issue that your MSG_EOR patch
> addresses, of course.
Yes.  I have been thinking about both approaches.

>
> The straightforward jump to new_segment that I proposed as
> simplification is more properly a "start-of-record" than
> "end-of-record" signal. It is probably preferable to indeed be able to
> pass EOR as signal that the last skb must not be appended to in
> subsequent calls.
I suspect we could do better than only checking MSG_EOR and jump.
Before entering the loop, we may be able to check if the
last-not-yet-written out skb has tskey set and the current
tcp_sendmsg may need to overwrite it before jumping.

Also, the 2nd sendmsg may not be called with MSG_EOR set but
the per-write-knob is turned on.  It could overwrite the
1st sendmsg with both per-write-knob on and MSG_EOR set.

Note that there is another collapse-case during tcp retrans
where the MSG_EOR bit is already loss.

Hence, having EOR passed as signal (as you mentioned) and stored
is needed.

I think another bit in the TCP_SKB_CB may help here.
The semantic of this bit could be 'no skb merge under some rare conditions'.
For now, it is limited to tskey.

[Another side note is, the split/fragment case should be fine as it is.
 When splitting a skb into two smaller skbs, the tskey should fall
 into either of them and no information loss.]

> I think that the record bounds issue is best solved independently from
> the interface for intermittent timestamps because
I understand that users may want to selectively do timestamping on a
particular sendmsg (per-write-knob), while do not care if the tskey
will be overwritten (no-tskey-overwritten) by the future
sendmsg/retrans.  Separating them gives the end-user a choice.

On the other hand, if the caller has specifically asked to do tstamp in
a particular tcp_sendmsg, it is a strong intention that the caller wants to
specifically track this message alone and not expecting this tstmap to
include anything else sent after it.  Beside, TLS user needs to make more
effort to pass the per-write-knob to tcp_sendmsg.  Hence, when per-write-knob
is used, I think the no-tskey-overwritten should be at least allowed (if
not the default).

> (a) changing the tcp
> bytestream packetization for timestamping introduces subtle
> differences between tracked and untracked data that are not always
> acceptable and (b) EOR can also be useful outside timestamps. A
> zerocopy sendmsg patchset that I sent for RFC last year encountered a
> similar requirement, to give one example: each skb with user data must
> point to a completion notification structure (ubuf_info), and can only
> point to one at a time. Appends that cause a conflict in skb->uarg
> pointers had to be blocked, at the cost of possibly different
> packetization compared to regular sends.
With no-tskey-overwritten turned on in production, we don't found the
end-user's performance has been impacted in meaningful way. I believe
it is the proper tradeoff as an opt-in feature to get measurement
data for existing tcp-socket users without adding a lot of burden
to the TCP stack which is a byte-oriented socket to begin with.

Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter

2016-03-31 Thread Daniel Borkmann


On 03/31/2016 09:48 PM, Hannes Frederic Sowa wrote:
[...]

Tightest solution would probably be to combine both patches.

bool called_by_tuntap;

old_fp = rcu_dereference_protected(sk->sk_filter, called_by_tuntap ? 
lockdep_rtnl_is_held() : lockdep_sock_is_held());


If I understand you correctly with combining them, you mean you'd still
need the API change to pass the bool 'called_by_tuntap' down, right?

If so, your main difference is, after all, to replace the sock_owned_by_user()
with the lockdep_sock_is_held() construction instead, correct?

But then, isn't it already sufficient when you pass the bool itself down
that 'demuxes' in this case between the sock_owned_by_user() vs
lockdep_rtnl_is_held() check?

Thanks,
Daniel

Re: [PATCH] sctp: avoid refreshing heartbeat timer too often

2016-03-31 Thread 'Marcelo Ricardo Leitner'

On Thu, Mar 31, 2016 at 11:16:52AM +, David Laight wrote:
> From: Marcelo Ricardo Leitner
> > Sent: 30 March 2016 13:13
> > Em 30-03-2016 06:37, David Laight escreveu:
> > > From: Marcelo Ricardo Leitner
> > >> Sent: 29 March 2016 14:42
> > >>
> > >> Currently on high rate SCTP streams the heartbeat timer refresh can
> > >> consume quite a lot of resources as timer updates are costly and it
> > >> contains a random factor, which a) is also costly and b) invalidates
> > >> mod_timer() optimization for not editing a timer to the same value.
> > >> It may even cause the timer to be slightly advanced, for no good reason.
> > >
> > > Interesting thoughts:
> > > 1) Is it necessary to use a different 'random factor' until the timer 
> > > actually
> > > expires?
> > 
> > I don't understand you fully here, but we have to have a random factor
> > on timer expire. As noted by Daniel Borkmann on his commit 8f61059a96c2
> > ("net: sctp: improve timer slack calculation for transport HBs"):
> 
> When a HEARTBEAT chunk is sent determine the new interval, use that
> interval until the timer actually expires when a new interval is
> calculated. So the random number is only generated once per heartbeat.
> 
> >  RFC4960, section 8.3 says:
> > 
> >On an idle destination address that is allowed to heartbeat,
> >it is recommended that a HEARTBEAT chunk is sent once per RTO
> >of that destination address plus the protocol parameter
> >'HB.interval', with jittering of +/- 50% of the RTO value,
> >and exponential backoff of the RTO if the previous HEARTBEAT
> >is unanswered.
> > 
> > Previous to his commit, it was using a random factor based on jiffies.
> > 
> > This patch then assumes that random_A+2 is just as random as random_B as
> > long as it is within the allowed range, avoiding the unnecessary updates.
> > 
> > > 2) It might be better to allow the heartbeat timer to expire, on expiry 
> > > work
> > > out the new interval based on when the last 'refresh' was done.
> > 
> > Cool, I thought about this too. It would introduce some extra complexity
> > that is not really worth I think, specially because now we may be doing
> > more timer updates even with this patch but it's not triggering any wake
> > ups and we would need at least 2 wake ups then: one for the first
> > timeout event, and then re-schedule the timer for the next updated one,
> > and maybe again, and again.. less timer updates but more wake ups, one
> > at every heartbeat interval even on a busy transport. Seems it's cheaper
> > to just update the timer then.
> 
> One wakeup per heartbeat interval on a busy connection is probably noise.
> Probably much less than the 1000s of timer updates that would otherwise 
> happen.

I was thinking more on near-idle systems, as the overhead for this
refresh looked rather small now even for busy transports if compared to
other points, a worth trade-off for reducing wake ups, imho.

But then I checked tcp, and it does what you're suggesting.
I'll rework the patch. Thanks

> A further optimisation would be to restart the timer if more than (say) 80%
> of the way through the timeout period.
> 
> Similarly the HEARTBEAT could be sent if the 2nd wakeup would be almost 
> immediate.

Re: Best way to reduce system call overhead for tun device I/O?

2016-03-31 Thread David Miller

From: Tom Herbert 
Date: Thu, 31 Mar 2016 17:18:48 -0400

> On Tue, Mar 29, 2016 at 6:40 PM, Guus Sliepen  wrote:
>> I'm trying to reduce system call overhead when reading/writing to/from a
>> tun device in userspace. For sockets, one can use sendmmsg()/recvmmsg(),
>> but a tun fd is not a socket fd, so this doesn't work. I'm see several
>> options to allow userspace to read/write multiple packets with one
>> syscall:
>>
>> - Implement a TX/RX ring buffer that is mmap()ed, like with AF_PACKET
>>   sockets.
>>
>> - Implement a ioctl() to emulate sendmmsg()/recvmmsg().
>>
>> - Add a flag that can be set using TUNSETIFF that makes regular
>>   read()/write() calls handle multiple packets in one go.
>>
>> - Expose a socket fd to userspace, so regular sendmmsg()/recvmmsg() can
>>   be used. There is tun_get_socket() which is used internally in the
>>   kernel, but this is not exposed to userspace, and doesn't look trivial
>>   to do either.
>>
>> What would be the right way to do this?
>>
> Personally I think tun could benefit greatly if it were implemented as
> a socket instead of character interface. One thing that could be much
> better is sending/receiving of meta data attached to skbuf. For
> instance GSO data could be in ancillary data in a socket instead of
> inline with packet data as tun seems to be doing now.

Agreed.

Re: [PATCH 1/1] net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card

2016-03-31 Thread David Miller

From: Bjørn Mork 
Date: Thu, 31 Mar 2016 23:07:30 +0200

> David Miller  writes:
>> From: Daniele Palmas 
>> Date: Thu, 31 Mar 2016 15:16:47 +0200
>>
>>> Telit LE910 V2 is a mobile broadband card with no ARP capabilities:
>>> the patch makes this device to use wwan_noarp_info struct
>>> 
>>> Signed-off-by: Daniele Palmas 
>>
>> Bjorn, can you take a quick look at this?
> 
> Looks fine to me.
> 
> Reviewed-by: Bjørn Mork 

Thanks, applied.

Re: Best way to reduce system call overhead for tun device I/O?

2016-03-31 Thread Tom Herbert

On Tue, Mar 29, 2016 at 6:40 PM, Guus Sliepen  wrote:
> I'm trying to reduce system call overhead when reading/writing to/from a
> tun device in userspace. For sockets, one can use sendmmsg()/recvmmsg(),
> but a tun fd is not a socket fd, so this doesn't work. I'm see several
> options to allow userspace to read/write multiple packets with one
> syscall:
>
> - Implement a TX/RX ring buffer that is mmap()ed, like with AF_PACKET
>   sockets.
>
> - Implement a ioctl() to emulate sendmmsg()/recvmmsg().
>
> - Add a flag that can be set using TUNSETIFF that makes regular
>   read()/write() calls handle multiple packets in one go.
>
> - Expose a socket fd to userspace, so regular sendmmsg()/recvmmsg() can
>   be used. There is tun_get_socket() which is used internally in the
>   kernel, but this is not exposed to userspace, and doesn't look trivial
>   to do either.
>
> What would be the right way to do this?
>
Personally I think tun could benefit greatly if it were implemented as
a socket instead of character interface. One thing that could be much
better is sending/receiving of meta data attached to skbuf. For
instance GSO data could be in ancillary data in a socket instead of
inline with packet data as tun seems to be doing now.

Tom

> --
> Met vriendelijke groet / with kind regards,
>  Guus Sliepen

Re: [PATCH 1/1] net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card

2016-03-31 Thread Bjørn Mork

David Miller  writes:
> From: Daniele Palmas 
> Date: Thu, 31 Mar 2016 15:16:47 +0200
>
>> Telit LE910 V2 is a mobile broadband card with no ARP capabilities:
>> the patch makes this device to use wwan_noarp_info struct
>> 
>> Signed-off-by: Daniele Palmas 
>
> Bjorn, can you take a quick look at this?

Looks fine to me.

Reviewed-by: Bjørn Mork 


Bjørn

[PATCH net-next v2 2/6] net: dsa: mv88e6xxx: protect FID registers access

2016-03-31 Thread Vivien Didelot

Only switch families with 4096 address databases have dedicated FID
registers for ATU and VTU operations.

Factorize the access to the GLOBAL_ATU_FID register and introduce a
mv88e6xxx_has_fid_reg() helper function to protect the access to
GLOBAL_ATU_FID and GLOBAL_VTU_FID.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 19a8208..cc066dc 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -482,6 +482,16 @@ static bool mv88e6xxx_6352_family(struct dsa_switch *ds)
return false;
 }
 
+static bool mv88e6xxx_has_fid_reg(struct dsa_switch *ds)
+{
+   /* Does the device have dedicated FID registers for ATU and VTU ops? */
+   if (mv88e6xxx_6097_family(ds) || mv88e6xxx_6165_family(ds) ||
+   mv88e6xxx_6351_family(ds) || mv88e6xxx_6352_family(ds))
+   return true;
+
+   return false;
+}
+
 static bool mv88e6xxx_has_stu(struct dsa_switch *ds)
 {
/* Does the device have STU and dedicated SID registers for VTU ops? */
@@ -961,10 +971,16 @@ out:
return ret;
 }
 
-static int _mv88e6xxx_atu_cmd(struct dsa_switch *ds, u16 cmd)
+static int _mv88e6xxx_atu_cmd(struct dsa_switch *ds, u16 fid, u16 cmd)
 {
int ret;
 
+   if (mv88e6xxx_has_fid_reg(ds)) {
+   ret = _mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_ATU_FID, fid);
+   if (ret < 0)
+   return ret;
+   }
+
ret = _mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_ATU_OP, cmd);
if (ret < 0)
return ret;
@@ -1011,11 +1027,6 @@ static int _mv88e6xxx_atu_flush_move(struct dsa_switch 
*ds,
return err;
 
if (entry->fid) {
-   err = _mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_ATU_FID,
-  entry->fid);
-   if (err)
-   return err;
-
op = static_too ? GLOBAL_ATU_OP_FLUSH_MOVE_ALL_DB :
GLOBAL_ATU_OP_FLUSH_MOVE_NON_STATIC_DB;
} else {
@@ -1023,7 +1034,7 @@ static int _mv88e6xxx_atu_flush_move(struct dsa_switch 
*ds,
GLOBAL_ATU_OP_FLUSH_MOVE_NON_STATIC;
}
 
-   return _mv88e6xxx_atu_cmd(ds, op);
+   return _mv88e6xxx_atu_cmd(ds, entry->fid, op);
 }
 
 static int _mv88e6xxx_atu_flush(struct dsa_switch *ds, u16 fid, bool 
static_too)
@@ -1331,8 +1342,7 @@ static int _mv88e6xxx_vtu_getnext(struct dsa_switch *ds,
if (ret < 0)
return ret;
 
-   if (mv88e6xxx_6097_family(ds) || mv88e6xxx_6165_family(ds) ||
-   mv88e6xxx_6351_family(ds) || mv88e6xxx_6352_family(ds)) {
+   if (mv88e6xxx_has_fid_reg(ds)) {
ret = _mv88e6xxx_reg_read(ds, REG_GLOBAL,
  GLOBAL_VTU_FID);
if (ret < 0)
@@ -1429,7 +1439,9 @@ static int _mv88e6xxx_vtu_loadpurge(struct dsa_switch *ds,
ret = _mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_VTU_SID, reg);
if (ret < 0)
return ret;
+   }
 
+   if (mv88e6xxx_has_fid_reg(ds)) {
reg = entry->fid & GLOBAL_VTU_FID_MASK;
ret = _mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_VTU_FID, reg);
if (ret < 0)
@@ -1976,11 +1988,7 @@ static int _mv88e6xxx_atu_load(struct dsa_switch *ds,
if (ret < 0)
return ret;
 
-   ret = _mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_ATU_FID, entry->fid);
-   if (ret < 0)
-   return ret;
-
-   return _mv88e6xxx_atu_cmd(ds, GLOBAL_ATU_OP_LOAD_DB);
+   return _mv88e6xxx_atu_cmd(ds, entry->fid, GLOBAL_ATU_OP_LOAD_DB);
 }
 
 static int _mv88e6xxx_port_fdb_load(struct dsa_switch *ds, int port,
@@ -2063,11 +2071,7 @@ static int _mv88e6xxx_atu_getnext(struct dsa_switch *ds, 
u16 fid,
if (ret < 0)
return ret;
 
-   ret = _mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_ATU_FID, fid);
-   if (ret < 0)
-   return ret;
-
-   ret = _mv88e6xxx_atu_cmd(ds, GLOBAL_ATU_OP_GET_NEXT_DB);
+   ret = _mv88e6xxx_atu_cmd(ds, fid, GLOBAL_ATU_OP_GET_NEXT_DB);
if (ret < 0)
return ret;
 
-- 
2.7.4

[PATCH net-next v2 1/6] net: dsa: mv88e6xxx: protect SID register access

2016-03-31 Thread Vivien Didelot

Introduce a mv88e6xxx_has_stu() helper to protect the access to the
GLOBAL_VTU_SID register, instead of checking switch families.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index fa086e0..19a8208 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -482,6 +482,16 @@ static bool mv88e6xxx_6352_family(struct dsa_switch *ds)
return false;
 }
 
+static bool mv88e6xxx_has_stu(struct dsa_switch *ds)
+{
+   /* Does the device have STU and dedicated SID registers for VTU ops? */
+   if (mv88e6xxx_6097_family(ds) || mv88e6xxx_6165_family(ds) ||
+   mv88e6xxx_6351_family(ds) || mv88e6xxx_6352_family(ds))
+   return true;
+
+   return false;
+}
+
 /* We expect the switch to perform auto negotiation if there is a real
  * phy. However, in the case of a fixed link phy, we force the port
  * settings from the fixed link settings.
@@ -1329,7 +1339,9 @@ static int _mv88e6xxx_vtu_getnext(struct dsa_switch *ds,
return ret;
 
next.fid = ret & GLOBAL_VTU_FID_MASK;
+   }
 
+   if (mv88e6xxx_has_stu(ds)) {
ret = _mv88e6xxx_reg_read(ds, REG_GLOBAL,
  GLOBAL_VTU_SID);
if (ret < 0)
@@ -1412,8 +1424,7 @@ static int _mv88e6xxx_vtu_loadpurge(struct dsa_switch *ds,
if (ret < 0)
return ret;
 
-   if (mv88e6xxx_6097_family(ds) || mv88e6xxx_6165_family(ds) ||
-   mv88e6xxx_6351_family(ds) || mv88e6xxx_6352_family(ds)) {
+   if (mv88e6xxx_has_stu(ds)) {
reg = entry->sid & GLOBAL_VTU_SID_MASK;
ret = _mv88e6xxx_reg_write(ds, REG_GLOBAL, GLOBAL_VTU_SID, reg);
if (ret < 0)
-- 
2.7.4

[PATCH net-next v2 3/6] net: dsa: mv88e6xxx: variable number of databases

2016-03-31 Thread Vivien Didelot

Marvell switch chips have different number of address databases.

The code currently only supports models with 4096 databases. Such switch
has dedicated FID registers for ATU and VTU operations. Models with
fewer databases have their FID split in several registers.

List them all but only support models with 4096 databases at the moment.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 38 ++
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index cc066dc..888c66b 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -482,6 +482,30 @@ static bool mv88e6xxx_6352_family(struct dsa_switch *ds)
return false;
 }
 
+static unsigned int mv88e6xxx_num_databases(struct dsa_switch *ds)
+{
+   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+
+   /* The following devices have 4-bit identifiers for 16 databases */
+   if (ps->id == PORT_SWITCH_ID_6061)
+   return 16;
+
+   /* The following devices have 6-bit identifiers for 64 databases */
+   if (ps->id == PORT_SWITCH_ID_6065)
+   return 64;
+
+   /* The following devices have 8-bit identifiers for 256 databases */
+   if (mv88e6xxx_6095_family(ds) || mv88e6xxx_6185_family(ds))
+   return 256;
+
+   /* The following devices have 12-bit identifiers for 4096 databases */
+   if (mv88e6xxx_6097_family(ds) || mv88e6xxx_6165_family(ds) ||
+   mv88e6xxx_6351_family(ds) || mv88e6xxx_6352_family(ds))
+   return 4096;
+
+   return 0;
+}
+
 static bool mv88e6xxx_has_fid_reg(struct dsa_switch *ds)
 {
/* Does the device have dedicated FID registers for ATU and VTU ops? */
@@ -1534,9 +1558,15 @@ loadpurge:
 static int _mv88e6xxx_port_fid(struct dsa_switch *ds, int port, u16 *new,
   u16 *old)
 {
+   u16 upper_mask;
u16 fid;
int ret;
 
+   if (mv88e6xxx_num_databases(ds) == 4096)
+   upper_mask = 0xff;
+   else
+   return -EOPNOTSUPP;
+
/* Port's default FID bits 3:0 are located in reg 0x06, offset 12 */
ret = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_BASE_VLAN);
if (ret < 0)
@@ -1559,11 +1589,11 @@ static int _mv88e6xxx_port_fid(struct dsa_switch *ds, 
int port, u16 *new,
if (ret < 0)
return ret;
 
-   fid |= (ret & PORT_CONTROL_1_FID_11_4_MASK) << 4;
+   fid |= (ret & upper_mask) << 4;
 
if (new) {
-   ret &= ~PORT_CONTROL_1_FID_11_4_MASK;
-   ret |= (*new >> 4) & PORT_CONTROL_1_FID_11_4_MASK;
+   ret &= ~upper_mask;
+   ret |= (*new >> 4) & upper_mask;
 
ret = _mv88e6xxx_reg_write(ds, REG_PORT(port), PORT_CONTROL_1,
   ret);
@@ -1627,7 +1657,7 @@ static int _mv88e6xxx_fid_new(struct dsa_switch *ds, u16 
*fid)
 * databases are not needed. Return the next positive available.
 */
*fid = find_next_zero_bit(fid_bitmap, MV88E6XXX_N_FID, 1);
-   if (unlikely(*fid == MV88E6XXX_N_FID))
+   if (unlikely(*fid >= mv88e6xxx_num_databases(ds)))
return -ENOSPC;
 
/* Clear the database */
-- 
2.7.4

[PATCH net-next v2 0/6] net: dsa: mv88e6131: HW bridging support for 6185

2016-03-31 Thread Vivien Didelot

All packets passing through a switch of the 6185 family are currently all
directed to the CPU port. This means that port bridging is software driven.

To enable hardware bridging for this switch family, we need to implement the
port mapping operations, the FDB operations, and optionally the VLAN operations
(for 802.1Q and VLAN filtering aware systems).

However this family only has 256 FDBs indexed by 8-bit identifiers, opposed to
4096 FDBs with 12-bit identifiers for other families such as 6352. It also
doesn't have dedicated FID registers for ATU and VTU operations.

This patchset fixes these differences, and enable hardware bridging for 6185.

Changes v1 -> v2:
 - Describe the different numbers of databases and prefer a feature-based logic
   over the current ID/family-based logic.

Vivien Didelot (6):
  net: dsa: mv88e6xxx: protect SID register access
  net: dsa: mv88e6xxx: protect FID registers access
  net: dsa: mv88e6xxx: variable number of databases
  net: dsa: mv88e6xxx: support 256 databases
  net: dsa: mv88e6xxx: map destination addresses for 6185
  net: dsa: mv88e6131: enable hardware bridging

 drivers/net/dsa/mv88e6131.c |  11 
 drivers/net/dsa/mv88e6xxx.c | 134 +++-
 2 files changed, 118 insertions(+), 27 deletions(-)

-- 
2.7.4

[PATCH net-next v2 6/6] net: dsa: mv88e6131: enable hardware bridging

2016-03-31 Thread Vivien Didelot

By adding support for bridge operations, FDB operations, and optionally
VLAN operations (for 802.1Q and VLAN filtering aware systems), the
switch bridges ports correctly, the CPU is able to populate the hardware
address databases, and thus hardware bridging becomes functional within
the 88E6185 family of switches.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6131.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/dsa/mv88e6131.c b/drivers/net/dsa/mv88e6131.c
index a92ca65..2407028 100644
--- a/drivers/net/dsa/mv88e6131.c
+++ b/drivers/net/dsa/mv88e6131.c
@@ -169,6 +169,17 @@ struct dsa_switch_driver mv88e6131_switch_driver = {
.get_ethtool_stats  = mv88e6xxx_get_ethtool_stats,
.get_sset_count = mv88e6xxx_get_sset_count,
.adjust_link= mv88e6xxx_adjust_link,
+   .port_bridge_join   = mv88e6xxx_port_bridge_join,
+   .port_bridge_leave  = mv88e6xxx_port_bridge_leave,
+   .port_vlan_filtering= mv88e6xxx_port_vlan_filtering,
+   .port_vlan_prepare  = mv88e6xxx_port_vlan_prepare,
+   .port_vlan_add  = mv88e6xxx_port_vlan_add,
+   .port_vlan_del  = mv88e6xxx_port_vlan_del,
+   .port_vlan_dump = mv88e6xxx_port_vlan_dump,
+   .port_fdb_prepare   = mv88e6xxx_port_fdb_prepare,
+   .port_fdb_add   = mv88e6xxx_port_fdb_add,
+   .port_fdb_del   = mv88e6xxx_port_fdb_del,
+   .port_fdb_dump  = mv88e6xxx_port_fdb_dump,
 };
 
 MODULE_ALIAS("platform:mv88e6085");
-- 
2.7.4

[PATCH net-next v2 5/6] net: dsa: mv88e6xxx: map destination addresses for 6185

2016-03-31 Thread Vivien Didelot

The 88E6185 switch also has a MapDA bit in its Port Control 2 register.
When this bit is cleared, all frames are sent out to the CPU port.

Set this bit to rely on address databases (ATU) hits and direct frames
out of the correct ports, and thus allow hardware bridging.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 52c3e17..56f5657 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -2455,7 +2455,8 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, 
int port)
reg = 0;
if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) ||
mv88e6xxx_6165_family(ds) || mv88e6xxx_6097_family(ds) ||
-   mv88e6xxx_6095_family(ds) || mv88e6xxx_6320_family(ds))
+   mv88e6xxx_6095_family(ds) || mv88e6xxx_6320_family(ds) ||
+   mv88e6xxx_6185_family(ds))
reg = PORT_CONTROL_2_MAP_DA;
 
if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) ||
-- 
2.7.4

Re: [PATCH 1/1] net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card

2016-03-31 Thread David Miller

From: Daniele Palmas 
Date: Thu, 31 Mar 2016 15:16:47 +0200

> Telit LE910 V2 is a mobile broadband card with no ARP capabilities:
> the patch makes this device to use wwan_noarp_info struct
> 
> Signed-off-by: Daniele Palmas 

Bjorn, can you take a quick look at this?

Thank you.

Re: [PATCH net] rtnl: fix msg size calculation in if_nlmsg_size()

2016-03-31 Thread David Miller

From: Nicolas Dichtel 
Date: Thu, 31 Mar 2016 18:10:31 +0200

> Size of the attribute IFLA_PHYS_PORT_NAME was missing.
> 
> Fixes: db24a9044ee1 ("net: add support for phys_port_name")
> CC: David Ahern 
> Signed-off-by: Nicolas Dichtel 

Applied, and queued up for -stable, thanks Nicolas.

Man... this thing is getting really huge and unwieldly.

Re: [PATCH] net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES

2016-03-31 Thread David Miller

From: Thomas Petazzoni 
Date: Thu, 31 Mar 2016 22:37:35 +0200

> Hello,
> 
> On Thu, 31 Mar 2016 15:15:47 -0400 (EDT), David Miller wrote:
>> From: Jisheng Zhang 
>> Date: Wed, 30 Mar 2016 19:55:21 +0800
>> 
>> > The mvneta is also used in some Marvell berlin family SoCs which may
>> > have 64bytes cacheline size. Replace the MVNETA_CPU_D_CACHE_LINE_SIZE
>> > usage with L1_CACHE_BYTES.
>> > 
>> > And since dma_alloc_coherent() is always cacheline size aligned, so
>> > remove the align checks.
>> > 
>> > Signed-off-by: Jisheng Zhang 
>> 
>> Applied.
> 
> A new version of the patch was sent, which more rightfully uses
> cache_line_size(), see:
> 
>  "[PATCH v2] net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with 
> cache_line_size"

Sorry about that.

Send me a realtive fixup patch if you like.

Thanks.

Re: [PATCH] fec: Do not access unexisting register in Coldfire

2016-03-31 Thread David Miller

From: Fabio Estevam 
Date: Thu, 31 Mar 2016 12:05:17 -0300

> From: Fabio Estevam 
> 
> Commit 55cd48c821de ("net: fec: stop the "rcv is not +last, " error
> messages") introduces a write to a register that does not exist in
> Coldfire.
> 
> Move the FEC_FTRL register access inside the FEC_QUIRK_HAS_RACC 'if' block,
> so that we guarantee it will not be used on Coldfire CPUs.
> 
> Reported-by: Greg Ungerer 
> Signed-off-by: Fabio Estevam 

Applied, thanks.

Re: [PATCH] net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES

2016-03-31 Thread Thomas Petazzoni

Hello,

On Thu, 31 Mar 2016 15:15:47 -0400 (EDT), David Miller wrote:
> From: Jisheng Zhang 
> Date: Wed, 30 Mar 2016 19:55:21 +0800
> 
> > The mvneta is also used in some Marvell berlin family SoCs which may
> > have 64bytes cacheline size. Replace the MVNETA_CPU_D_CACHE_LINE_SIZE
> > usage with L1_CACHE_BYTES.
> > 
> > And since dma_alloc_coherent() is always cacheline size aligned, so
> > remove the align checks.
> > 
> > Signed-off-by: Jisheng Zhang 
> 
> Applied.

A new version of the patch was sent, which more rightfully uses
cache_line_size(), see:

 "[PATCH v2] net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with 
cache_line_size"

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

Re: [PATCH v2] rds: rds-stress show all zeros after few minutes

2016-03-31 Thread David Miller

From: shamir rabinovitch 
Date: Thu, 31 Mar 2016 02:29:22 -0400

> Issue can be seen on platforms that use 8K and above page size
> while rds fragment size is 4K. On those platforms single page is
> shared between 2 or more rds fragments. Each fragment has its own
> offset and rds congestion map code need to take this offset to account.
> Not taking this offset to account lead to reading the data fragment
> as congestion map fragment and hang of the rds transmit due to far
> congestion map corruption.
> 
> Signed-off-by: shamir rabinovitch 
> 
> Reviewed-by: Wengang Wang 
> Reviewed-by: Ajaykumar Hotchandani 
> Acked-by: Santosh Shilimkar 
> Tested-by: Anand Bibhuti 

This doesn't apply cleanly to my current tree, please respin.

Re: [PATCH net-next 1/6] net: skbuff: don't use union for napi_id and sender_cpu

2016-03-31 Thread David Miller

From: Eric Dumazet 
Date: Thu, 31 Mar 2016 03:32:21 -0700

> On Thu, 2016-03-31 at 13:50 +0800, Jason Wang wrote:
>> We use a union for napi_id and send_cpu, this is ok for most of the
>> cases except when we want to support busy polling for tun which needs
>> napi_id to be stored and passed to socket during tun_net_xmit(). In
>> this case, napi_id was overridden with sender_cpu before tun_net_xmit()
>> was called if XPS was enabled. Fixing by not using union for napi_id
>> and sender_cpu.
>> 
>> Signed-off-by: Jason Wang 
>> ---
>>  include/linux/skbuff.h | 10 +-
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 15d0df9..8aee891 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -743,11 +743,11 @@ struct sk_buff {
>>  __u32   hash;
>>  __be16  vlan_proto;
>>  __u16   vlan_tci;
>> -#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
>> -union {
>> -unsigned intnapi_id;
>> -unsigned intsender_cpu;
>> -};
>> +#if defined(CONFIG_NET_RX_BUSY_POLL)
>> +unsigned intnapi_id;
>> +#endif
>> +#if defined(CONFIG_XPS)
>> +unsigned intsender_cpu;
>>  #endif
>>  union {
>>  #ifdef CONFIG_NETWORK_SECMARK
> 
> Hmmm...
> 
> This is a serious problem.
> 
> Making skb bigger (8 bytes because of alignment) was not considered
> valid for sender_cpu introduction. We worked quite hard to avoid this,
> if you take a look at git history :(
> 
> Can you describe more precisely the problem and code path ?

>From what I can see they are doing busy poll loops in the TX code paths,
as well as the RX code paths, of vhost.

Doing this in the TX side makes little sense to me.  The busy poll
implementations in the drivers only process their RX queues when
->ndo_busy_poll() is invoked.  So I wonder what this is accomplishing
for the vhost TX case?

Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter

2016-03-31 Thread David Miller

From: Hannes Frederic Sowa 
Date: Thu, 31 Mar 2016 21:48:27 +0200

> Tightest solution would probably be to combine both patches.
> 
> bool called_by_tuntap;
> 
> old_fp = rcu_dereference_protected(sk->sk_filter, called_by_tuntap ?
> lockdep_rtnl_is_held() : lockdep_sock_is_held());

Ok, I see what you're saying.

I misunderstood how the RTNL lockdep checks work and thought we could
get false positives from other entities taking RTNL.

Can you cook up the combined patch?

Thanks.

Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter

2016-03-31 Thread David Miller

From: Alexei Starovoitov 
Date: Thu, 31 Mar 2016 12:31:56 -0700

> On Thu, Mar 31, 2016 at 09:24:12PM +0200, Hannes Frederic Sowa wrote:
>> Hello,
>> 
>> On 31.03.2016 21:21, David Miller wrote:
>> >From: Daniel Borkmann 
>> >Date: Thu, 31 Mar 2016 14:16:18 +0200
>> >
>> >Alexei, do you really mind if I apply Danile's patch?
> 
> I don't have strong opinion either way.
> Though Hannes's patch below looks simpler and easier to backport.
> Yeah, I do care about backports quite a bit more nowadays :)

You know, I care a lot about backports too :)

But Hannes's patch has the same fundamental issue, I think.

If we accept both synchornization styles, false positives are more
likely.

And in the long term, we can fix the false positive possibilities with
the RTNL checks.

Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter

2016-03-31 Thread Hannes Frederic Sowa

On 31.03.2016 21:36, David Miller wrote:

From: Hannes Frederic Sowa 
Date: Thu, 31 Mar 2016 21:24:12 +0200

diff --git a/net/core/filter.c b/net/core/filter.c
index 4b81b71171b4ce..8ab270d5ce5507 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1166,7 +1166,8 @@ static int __sk_attach_prog(struct bpf_prog
*prog, struct sock *sk)
}

old_fp = rcu_dereference_protected(sk->sk_filter,
-  sock_owned_by_user(sk));
+  lockdep_rtnl_is_held() ||
+  lockdep_sock_is_held(sk));
rcu_assign_pointer(sk->sk_filter, fp);

if (old_fp)

I have the same objections Daniel did.

Not all socket filter clients use RTNL as the synchornization
mechanism.  The caller, or some descriptive element, should tell us
what that synchronizing element is.

Yes, I understand how these RTNL checks can pass "accidently" but
the opposite is true too.  A socket locking synchornizing user,
who didn't lock the socket, might now pass because RTNL happens
to be held elsewhere.

Actually lockdep_rtnl_is_held checks if this specific code/thread holds 
the lock and no other cpu/thread. So it will not pass here in case 
another cpu has the lock.

lockdep stores the current held locks in current->held_locks, if we 
preempt we switch current pointer, if we take a spin_lock we can't sleep 
thus not preempt. Thus we always know that this specific code has the lock.

Using sock_owned_by_user actually has this problem, and thus I am 
replacing it. We don't know who has the socket locked.

Tightest solution would probably be to combine both patches.

bool called_by_tuntap;

old_fp = rcu_dereference_protected(sk->sk_filter, called_by_tuntap ? 
lockdep_rtnl_is_held() : lockdep_sock_is_held());

Bye,
Hannes

Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter

2016-03-31 Thread David Miller

From: Hannes Frederic Sowa 
Date: Thu, 31 Mar 2016 21:24:12 +0200

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 4b81b71171b4ce..8ab270d5ce5507 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1166,7 +1166,8 @@ static int __sk_attach_prog(struct bpf_prog
> *prog, struct sock *sk)
>   }
> 
>   old_fp = rcu_dereference_protected(sk->sk_filter,
> -sock_owned_by_user(sk));
> +lockdep_rtnl_is_held() ||
> +lockdep_sock_is_held(sk));
>   rcu_assign_pointer(sk->sk_filter, fp);
> 
>   if (old_fp)

I have the same objections Daniel did.

Not all socket filter clients use RTNL as the synchornization
mechanism.  The caller, or some descriptive element, should tell us
what that synchronizing element is.

Yes, I understand how these RTNL checks can pass "accidently" but
the opposite is true too.  A socket locking synchornizing user,
who didn't lock the socket, might now pass because RTNL happens
to be held elsewhere.

Constraining the test properly, based upon the user, makes this less
likely to happen.  And that's desirable.

Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter

2016-03-31 Thread Alexei Starovoitov

On Thu, Mar 31, 2016 at 09:24:12PM +0200, Hannes Frederic Sowa wrote:
> Hello,
> 
> On 31.03.2016 21:21, David Miller wrote:
> >From: Daniel Borkmann 
> >Date: Thu, 31 Mar 2016 14:16:18 +0200
> >
> >>On 03/31/2016 01:59 PM, Eric Dumazet wrote:
> >>>On Thu, 2016-03-31 at 13:35 +0200, Daniel Borkmann wrote:
> >>>
> +static inline bool sock_owned_externally(const struct sock *sk)
> +{
> + return sk->sk_flags & (1UL << SOCK_EXTERNAL_OWNER);
> +}
> +
> >>>
> >>>Have you reinvented sock_flag(sl, SOCK_EXTERNAL_OWNER) ? ;)
> >>>
> >>>Anyway, using a flag for this purpose sounds overkill to me.
> >>
> >>Right.
> >>
> >>>Setting it is a way to 'fool' lockdep anyway...
> >>
> >>Yep, correct, we'd be fooling the tun case, so this diff doesn't
> >>really make it any better there.
> >
> >I like the currently proposed patch where TUN says that RTNL is what
> >the synchronizing element is.
> >
> >Maybe we could make a helper of some sort but since we only have once
> >case like this is just overkill.
> >
> >Alexei, do you really mind if I apply Danile's patch?

I don't have strong opinion either way.
Though Hannes's patch below looks simpler and easier to backport.
Yeah, I do care about backports quite a bit more nowadays :)

> I proposed the following patch to Daniel and he seemed to like it. I
> was just waiting for his feedback and tags and wanted to send it out
> then.
> 
> What do you think?
> 
> lockdep_sock_is_held does also have some other applications in other
> parts of the source.
> 
> Bye,
> Hannes
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 255d3e03727b73..651b84a38cfb9b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1330,6 +1330,12 @@ static inline void sock_release_ownership(struct sock
> *sk)
>   sk->sk_lock.owned = 0;
>  }
> 
> +static inline bool lockdep_sock_is_held(struct sock *sk)
> +{
> + return lockdep_is_held(>sk_lock) ||
> +lockdep_is_held(>sk_lock.slock);
> +}
> +
>  /*
>   * Macro so as to not evaluate some arguments when
>   * lockdep is not enabled.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 4b81b71171b4ce..8ab270d5ce5507 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1166,7 +1166,8 @@ static int __sk_attach_prog(struct bpf_prog *prog,
> struct sock *sk)
>   }
> 
>   old_fp = rcu_dereference_protected(sk->sk_filter,
> -sock_owned_by_user(sk));
> +lockdep_rtnl_is_held() ||
> +lockdep_sock_is_held(sk));
>   rcu_assign_pointer(sk->sk_filter, fp);
> 
>   if (old_fp)
> @@ -2259,7 +2260,9 @@ int sk_detach_filter(struct sock *sk)
>   return -EPERM;
> 
>   filter = rcu_dereference_protected(sk->sk_filter,
> -sock_owned_by_user(sk));
> +lockdep_rtnl_is_held() ||
> +lockdep_sock_is_held(sk));
> +
>   if (filter) {
>   RCU_INIT_POINTER(sk->sk_filter, NULL);
>   sk_filter_uncharge(sk, filter);
>

Re: [PATCH 3/4] net: w5100: enable to support sleepable register access interface

2016-03-31 Thread David Miller

From: Akinobu Mita 
Date: Thu, 31 Mar 2016 01:38:39 +0900

> + struct sk_buff_head tx_queue;

The way the queueing works in this driver is that it is only possible
to have one SKB being transmitted at one time.

This is evident by how the driver immediately stops the TX queue when
it is given a new packet to transmit, and this is woken up by the TX
completion IRQ.

So don't use a queue here, just use a plain single pointer.

The SKB queue you're using here is going to also do locking which is
even more unnecessary overhead.

Thanks.

Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter

2016-03-31 Thread Hannes Frederic Sowa

Hello,

On 31.03.2016 21:21, David Miller wrote:

From: Daniel Borkmann 
Date: Thu, 31 Mar 2016 14:16:18 +0200

On 03/31/2016 01:59 PM, Eric Dumazet wrote:

On Thu, 2016-03-31 at 13:35 +0200, Daniel Borkmann wrote:

+static inline bool sock_owned_externally(const struct sock *sk)
+{
+   return sk->sk_flags & (1UL << SOCK_EXTERNAL_OWNER);
+}
+

Have you reinvented sock_flag(sl, SOCK_EXTERNAL_OWNER) ? ;)

Anyway, using a flag for this purpose sounds overkill to me.

Right.

Setting it is a way to 'fool' lockdep anyway...

Yep, correct, we'd be fooling the tun case, so this diff doesn't
really make it any better there.

I like the currently proposed patch where TUN says that RTNL is what
the synchronizing element is.

Maybe we could make a helper of some sort but since we only have once
case like this is just overkill.

Alexei, do you really mind if I apply Danile's patch?

I proposed the following patch to Daniel and he seemed to like it. I
was just waiting for his feedback and tags and wanted to send it out
then.

What do you think?

lockdep_sock_is_held does also have some other applications in other
parts of the source.

Bye,
Hannes

diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e03727b73..651b84a38cfb9b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1330,6 +1330,12 @@ static inline void sock_release_ownership(struct 
sock *sk)

sk->sk_lock.owned = 0;
 }

+static inline bool lockdep_sock_is_held(struct sock *sk)
+{
+   return lockdep_is_held(>sk_lock) ||
+  lockdep_is_held(>sk_lock.slock);
+}
+
 /*
  * Macro so as to not evaluate some arguments when
  * lockdep is not enabled.
diff --git a/net/core/filter.c b/net/core/filter.c
index 4b81b71171b4ce..8ab270d5ce5507 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1166,7 +1166,8 @@ static int __sk_attach_prog(struct bpf_prog *prog, 
struct sock *sk)

}

old_fp = rcu_dereference_protected(sk->sk_filter,
-  sock_owned_by_user(sk));
+  lockdep_rtnl_is_held() ||
+  lockdep_sock_is_held(sk));
rcu_assign_pointer(sk->sk_filter, fp);

if (old_fp)
@@ -2259,7 +2260,9 @@ int sk_detach_filter(struct sock *sk)
return -EPERM;

filter = rcu_dereference_protected(sk->sk_filter,
-  sock_owned_by_user(sk));
+  lockdep_rtnl_is_held() ||
+  lockdep_sock_is_held(sk));
+
if (filter) {
RCU_INIT_POINTER(sk->sk_filter, NULL);
sk_filter_uncharge(sk, filter);

c

Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter

2016-03-31 Thread David Miller

From: Daniel Borkmann 
Date: Thu, 31 Mar 2016 14:16:18 +0200

> On 03/31/2016 01:59 PM, Eric Dumazet wrote:
>> On Thu, 2016-03-31 at 13:35 +0200, Daniel Borkmann wrote:
>>
>>> +static inline bool sock_owned_externally(const struct sock *sk)
>>> +{
>>> +   return sk->sk_flags & (1UL << SOCK_EXTERNAL_OWNER);
>>> +}
>>> +
>>
>> Have you reinvented sock_flag(sl, SOCK_EXTERNAL_OWNER) ? ;)
>>
>> Anyway, using a flag for this purpose sounds overkill to me.
> 
> Right.
> 
>> Setting it is a way to 'fool' lockdep anyway...
> 
> Yep, correct, we'd be fooling the tun case, so this diff doesn't
> really make it any better there.

I like the currently proposed patch where TUN says that RTNL is what
the synchronizing element is.

Maybe we could make a helper of some sort but since we only have once
case like this is just overkill.

Alexei, do you really mind if I apply Danile's patch?

Thanks.

Re: [PATCH 3/4] samples/bpf: Simplify building BPF samples

2016-03-31 Thread Alexei Starovoitov


On 3/31/16 11:51 AM, Naveen N. Rao wrote:

On 2016/03/31 10:49AM, Alexei Starovoitov wrote:

On 3/31/16 4:25 AM, Naveen N. Rao wrote:

Make BPF samples build depend on CONFIG_SAMPLE_BPF. We still don't add a
Kconfig option since that will add a dependency on llvm for allyesconfig
builds which may not be desirable.

Those who need to build the BPF samples can now just do:

make CONFIG_SAMPLE_BPF=y

or:

export CONFIG_SAMPLE_BPF=y
make


I don't like this 'simplification'.
make samples/bpf/
is much easier to type than capital letters.


This started out as a patch to have the BPF samples built with a Kconfig
option. As stated in the commit description, I realised that it won't
work for allyesconfig builds. However, the reason I retained this patch
is since it gets us one step closer to building the samples as part of
the kernel build.

The 'simplification' is since I can now have the export in my .bashrc
and the kernel build will now build the BPF samples too without
requiring an additional 'make samples/bpf/' step.

I agree this is subjective, so I am ok if this isn't taken in.


If you can change it that 'make samples/bpf/' still works then it would
be fine. As it is it breaks our testing setup.

Re: qdisc spin lock

2016-03-31 Thread Jesper Dangaard Brouer


On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma  wrote:

> I know this might be an old topic so bare with me – what we are facing
> is that applications are sending small packets using hundreds of
> threads so the contention on spin lock in __dev_xmit_skb increases the
> latency of dev_queue_xmit significantly. We’re building a network QoS
> solution to avoid interference of different applications using HTB.

Yes, as you have noticed with HTB there is a single qdisc lock, and
congestion obviously happens :-)

It is possible with different tricks to make it scale.  I believe
Google is using a variant of HTB, and it scales for them.  They have
not open source their modifications to HTB (which likely also involves
a great deal of setup tricks).

If your purpose it to limit traffic/bandwidth per "cloud" instance,
then you can just use another TC setup structure.  Like using MQ and
assigning a HTB per MQ queue (where the MQ queues are bound to each
CPU/HW queue)... But you have to figure out this setup yourself...


> But in this case when some applications send massive small packets in
> parallel, the application to be protected will get its throughput
> affected (because it’s doing synchronous network communication using
> multiple threads and throughput is sensitive to the increased latency)
> 
> Here is the profiling from perf:
> 
> -  67.57%   iperf  [kernel.kallsyms] [k] _spin_lock
>   - 99.94% dev_queue_xmit
>   -  96.91% _spin_lock
>
>  - 2.62%  __qdisc_run 
>
>   - 98.98% sch_direct_xmit
> - 99.98% _spin_lock
> 
> As far as I understand the design of TC is to simplify locking schema
> and minimize the work in __qdisc_run so that throughput won’t be
> affected, especially with large packets. However if the scenario is
> that multiple classes in the queueing discipline only have the shaping
> limit, there isn’t really a necessary correlation between different
> classes. The only synchronization point should be when the packet is
> dequeued from the qdisc queue and enqueued to the transmit queue of
> the device. My question is – is it worth investing on avoiding the
> locking contention by partitioning the queue/lock so that this
> scenario is addressed with relatively smaller latency?

Yes, there is a lot go gain, but it is not easy ;-)

> I must have oversimplified a lot of details since I’m not familiar
> with the TC implementation at this point – just want to get your input
> in terms of whether this is a worthwhile effort or there is something
> fundamental that I’m not aware of. If this is just a matter of quite
> some additional work, would also appreciate helping to outline the
> required work here.
> 
> Also would appreciate if there is any information about the latest
> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf

This article seems to be very low quality... spelling errors, only 5
pages, no real code, etc. 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH] net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES

2016-03-31 Thread David Miller

From: Jisheng Zhang 
Date: Wed, 30 Mar 2016 19:53:41 +0800

> The mvpp2 ip maybe used in SoCs which may have have 64bytes cacheline
> size. Replace the MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES.
> 
> And since dma_alloc_coherent() is always cacheline size aligned, so
> remove the align checks.
> 
> Signed-off-by: Jisheng Zhang 

Applied.

Re: [PATCH] net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES

2016-03-31 Thread David Miller

From: Jisheng Zhang 
Date: Wed, 30 Mar 2016 19:55:21 +0800

> The mvneta is also used in some Marvell berlin family SoCs which may
> have 64bytes cacheline size. Replace the MVNETA_CPU_D_CACHE_LINE_SIZE
> usage with L1_CACHE_BYTES.
> 
> And since dma_alloc_coherent() is always cacheline size aligned, so
> remove the align checks.
> 
> Signed-off-by: Jisheng Zhang 

Applied.

Re: [PATCH net-next v3 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup

2016-03-31 Thread David Miller

From: Patrick Uiterwijk 
Date: Wed, 30 Mar 2016 01:39:41 +

> Some of the vendor-specific bootloaders set up this part
> of the initialization for us, so this was never added.
> However, since upstream bootloaders don't initialize the
> chip specifically, they leave the fiber MII's PDOWN flag
> set, which means that the CPU port doesn't connect.
> 
> This patch checks whether this flag has been clear prior
> by something else, and if not make us clear it.
> 
> Reviewed-by: Andrew Lunn 
> Signed-off-by: Patrick Uiterwijk 

Applied, thank you.

Re: [PATCH net-next v3 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}

2016-03-31 Thread David Miller

From: Patrick Uiterwijk 
Date: Wed, 30 Mar 2016 01:39:40 +

> Add versions of the phy_page_read and _write functions to
> be used in a context where the SMI mutex is held.
> 
> Tested-by: Vivien Didelot 
> Reviewed-by: Vivien Didelot 
> Signed-off-by: Patrick Uiterwijk 

Applied.

Re: [PATCH 3/4] samples/bpf: Simplify building BPF samples

2016-03-31 Thread Naveen N. Rao

On 2016/03/31 10:49AM, Alexei Starovoitov wrote:
> On 3/31/16 4:25 AM, Naveen N. Rao wrote:
> >Make BPF samples build depend on CONFIG_SAMPLE_BPF. We still don't add a
> >Kconfig option since that will add a dependency on llvm for allyesconfig
> >builds which may not be desirable.
> >
> >Those who need to build the BPF samples can now just do:
> >
> >make CONFIG_SAMPLE_BPF=y
> >
> >or:
> >
> >export CONFIG_SAMPLE_BPF=y
> >make
> 
> I don't like this 'simplification'.
> make samples/bpf/
> is much easier to type than capital letters.

This started out as a patch to have the BPF samples built with a Kconfig 
option. As stated in the commit description, I realised that it won't 
work for allyesconfig builds. However, the reason I retained this patch 
is since it gets us one step closer to building the samples as part of 
the kernel build.

The 'simplification' is since I can now have the export in my .bashrc 
and the kernel build will now build the BPF samples too without 
requiring an additional 'make samples/bpf/' step.

I agree this is subjective, so I am ok if this isn't taken in.

- Naveen

1 2 >

1 - 100 of 160 matches

Mail list logo