date:20151109

Re: [PATCH] man: Syntax and warning fixes

2015-11-09 Thread Ville Skyttä

On Mon, Nov 9, 2015 at 11:40 AM, Albino B Neto  wrote:
> 2015-11-07 7:55 GMT-02:00 Ville Skyttä :
>>> Signed-off-by and description ?
>>
>> Superseding patches sent separately. I suggest documenting the
>> requirement for Signed-off-by somewhere (README.devel?);
>
> Please read the Documentation/SubmittingPatches.

No such file in iproute2 git.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

PING: [PATCH] net: smsc911x: Reset PHY during initialization

2015-11-09 Thread Pavel Fedin

 Hello! So, what should we do with this?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On 
> Behalf Of Pavel
> Fedin
> Sent: Monday, November 02, 2015 10:19 AM
> To: 'David Miller'
> Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; 
> steve.glendinn...@shawell.net
> Subject: RE: [PATCH] net: smsc911x: Reset PHY during initialization
> 
>  Hello!
> 
> > > On certain hardware after software reboot the chip may get stuck and fail
> > > to reinitialize during reset. This can be fixed by ensuring that PHY is
> > > reset too.
> > >
> > > Old PHY resetting method required operational MDIO interface, therefore
> > > the chip should have been already set up. In order to be able to function
> > > during probe, it is changed to use PMT_CTRL register.
> > >
> > > The problem could be observed on SMDK5410 board.
> > >
> > > Signed-off-by: Pavel Fedin 
> >
> > I'm pretty sure this is going to break the PHY loopback test.
> 
>  It's not (at least in normal situation), because first we do the test, and 
> only then, if it
> fails, we reset the PHY. So, during
> normal operation of the driver, when loopback test succeeds at first attempt, 
> we never attempt
> to reset the PHY. And, by the way, at
> least on my board this PHY reset did nothing useful, because after it 
> loopback test still
> failed, all 100 times.
>  And, we don't use PHY reset anywhere outside of loopback test.
> 
> > This is such a tricky and fragile area to get right, therefore I
> > want you to specifically guard any change in how PHY reset is
> > done with tests against the specific chips that have the problem.
> 
>  Well, i could do one (or some combination) of the following, if you really 
> want to:
>  a) Leave PHY reset inside loopback test as it is, but add a second routine 
> and call it only
> before smsc911x_soft_reset().
>  b) Reset PHY only conditionally, using the following algorithm:
> if smsc911x_soft_reset() {
> /* NIC reset failed, kick the PHY and retry */
> smsc911x_phy_reset()
> if (smsc_911x_soft_reset())
> return -ENODEV;
> }
>  c) Do extra PHY reset only if some hint in device tree is specified (or the 
> machine is known
> to have the problem)
> 
>  But, i dislike approach (a) for code duplication, because datasheet says 
> that both reset
> methods are equivalent; i dislike approach
> (b) for actually being a hack; and i dislike (c) for adding extra stuff which 
> seems to be not
> necessary. After all, what is wrong
> with just extra PHY reset before attempting to program anything? By the way, 
> i test my patch
> with both software reboot and hardware
> reboot, and even powercycle. Everything is quite stable.
>  Well, it's up to you to decide. But i'd like to get the fix upstreamed 
> somehow because from
> time to time we use these boards for
> tests, and it's quite annoiying to be unable to reboot them properly.
> 
> > Furthermore, I want you to test whether this has any negative
> > effects on the PHY loopback test.
> 
>  I did. I never disabled loopback test, so i actually discovered a problem 
> (even two) with it.
> First, the problem was chip reset
> timeout. By increasing it to 300ms this problem seemed to be fixed, but 
> loopback test started
> to fail. This was how i found and
> fixed crash with missing phy_disconnect(). I examined the code and discovered 
> that upon
> loopback test failure we reset the PHY and
> retry. But in my case this PHY reset never did the right thing, and all 
> loopback attempts
> (10x10 IIRC) were failing.
>  Some comments in the code gave me a clue that the main NIC can misbehave on 
> reset if e.g. PHY
> is in powerdown state. I printed
> value of its control register and discovered that it was not the case. But 
> then i discovered
> that we actually never try to reset the
> PHY before doing anything. Also, while studying the datasheed i discovered 
> that there is a
> shorthand for resetting PHY without MDIO
> bus set up, but our driver doesn't use it, preferring MDIO method instead, 
> which already
> requires operational NIC.
>  So, i suggested that PHY is just not being reset when board is rebooted by 
> software. I wrote
> the patch and it worked. I verified it
> by reverting my previous NIC reset timeout increase, and it continued to 
> work. After this
> loopback test on my board passes at first
> attempt.
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.

Re: [PATCH RFC 5/6] net/faraday: Enable NCSI interface

2015-11-09 Thread Gavin Shan

On Tue, Nov 10, 2015 at 07:28:24AM +1100, Benjamin Herrenschmidt wrote:
>On Mon, 2015-11-09 at 18:30 +1100, Gavin Shan wrote:
>> 
>> Yeah, It's something that I hilighed in the cover letter. I was
>> thinking we might need a better way to enable Tx/Rx before the
>> interrupt is up, but couldn't figure out one way. So I need some
>> advice here.
>
>No, that's not right. For Tx/Rx to work the interface must be opened,
>there is simply no way around that and that's perfectly fine. The
>situation with an MDIO PHY is the same, most drivers can't talk to
>their PHY until the interface has been opened because the chip is
>basically powered down otherwise.
>
>So we require the interface to be opened to talk, so far so good,
>the NC-SI stack doesn't even need to open it itself, it's acceptable
>to require userspace to do it. IE. Userspace will chose what interface
>to use, open it (for DHCP etc... or whatever other reason) and *that*
>will then trigger the NC-SI negociation.
>

Yes, NCSI is smiliar to PHY to some extent. However, PHY's negotiation
is purly electrical procedure, no packets received from MAC for it. We
have the same situation when the NCSI/PHY is going to be brought down.

At the beginning, the NCSI packets can be received and transmitted after
the interface is opened. Before NCSI negotiation is done, no other packets
can be received and transmitted. For the Rx path (for other packets), the
NCSI link isn't enabled when NCSI negotiation isn't finished. There might
have lots of egress packets whose IP addresses can't be resolved to MAC
address as ARP resolution doesn't work before NCSI negotiation is done.
So there is a weird window: interface is up, but no packets (except NCSI
packets) can be received or transmitted.

When the interface is brought down, for example by "ifconfig eth0 down",
The NCSI interface needs to be teared down by transmitting and receiving
NCSI commands and responses. Similiarly, it introduces another weird window:
interface is down, but NCSI packets still can be transmitted and received.

>Cheers,
>Ben.
>

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] drivers: net: cpsw: add support for fixed-links.

2015-11-09 Thread David Miller

From: Daniel Trautmann 
Date: Mon, 9 Nov 2015 20:24:14 +0100

> Add support for fixed-links in configurations without PHY.
> (e.g. connection to a switch, SGMII point to point, SFPs)
> 
> Check: Documentation/devicetree/bindings/net/fixed-link.txt.
> 
> Signed-off-by: Daniel Trautmann 

This patch doesn't apply cleanly at all.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] drivers: net: cpsw: add support for fixed-links.

2015-11-09 Thread David Miller

From: Daniel Trautmann 
Date: Mon, 9 Nov 2015 20:24:14 +0100

> @@ -2039,19 +2039,35 @@ static int cpsw_probe_dt(struct cpsw_priv *priv,
>   priv->phy_node = of_parse_phandle(slave_node, "phy-handle", 0);
>   parp = of_get_property(slave_node, "phy_id", &lenp);
>   if ((parp == NULL) || (lenp != (sizeof(void *) * 2))) {
> - dev_err(&pdev->dev, "Missing slave[%d] phy_id 
> property\n", i);

I know this has nothing to do with your patch, but that length check
is completely bogus.

It should be "sizeof(__be32) * 2" not "sizeof(void *) * 2".
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deadlock between bind and splice

2015-11-09 Thread Al Viro

On Tue, Nov 10, 2015 at 02:38:54AM +, Al Viro wrote:
> On Fri, Nov 06, 2015 at 07:42:15AM -0800, Eric Dumazet wrote:
> 
> > Thank you for this report.
> > 
> > pipe is part of fs, not net ;)
> 
> AF_UNIX bind() vs. socketpair() interplay, OTOH...

FWIW, BSD folks unlock the socket for the duration of mknod - mark it as
"somebody's trying to bind it" to avoid the fun with racing double bind(),
but that's about it.  Tempting, to be honest...

BTW, why does unix_autobind() do allocation under ->readlock?  The allocation
will be normally used - that if (u->addr) return; part is just dealing with
an unlikely race, as far as I can see...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] net: fix a race in dst_release()

2015-11-09 Thread David Miller

From: Eric Dumazet 
Date: Mon, 09 Nov 2015 17:51:23 -0800

> From: Eric Dumazet 
> 
> Only cpu seeing dst refcount going to 0 can safely
> dereference dst->flags. 
> 
> Otherwise an other cpu might already have freed the dst.
> 
> Fixes: 27b75c95f10d ("net: avoid RCU for NOCACHE dst")
> Reported-by: Greg Thelen 
> Signed-off-by: Eric Dumazet 

Ouch...

Applied and queued up for -stable, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: mvneta: Fix memory use after free.

2015-11-09 Thread David Miller

From: Justin Maggard 
Date: Mon,  9 Nov 2015 17:21:05 -0800

> After changing an interface's MTU, then bringing the interface down and
> back up again, I immediately saw tons of kernel messages like below.
> The reason for this bad behavior is mvneta_rxq_drop_pkts(), which calls
> dma_unmap_single() on already-freed memory.  So we need to switch the
> order of those two operations.
 ...
> Signed-off-by: Justin Maggard 

Applied, thanks Justin.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deadlock between bind and splice

2015-11-09 Thread Al Viro

On Fri, Nov 06, 2015 at 07:42:15AM -0800, Eric Dumazet wrote:

> Thank you for this report.
> 
> pipe is part of fs, not net ;)

AF_UNIX bind() vs. socketpair() interplay, OTOH...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 net-next 4/4] net: af_unix: implement splice for stream af_unix sockets

2015-11-09 Thread Hannes Frederic Sowa

Hallo,

On Tue, Nov 10, 2015, at 02:11, Eric Dumazet wrote:
> On Thu, 2015-05-21 at 17:00 +0200, Hannes Frederic Sowa wrote:
> 
> > +
> > +static ssize_t skb_unix_socket_splice(struct sock *sk,
> > + struct pipe_inode_info *pipe,
> > + struct splice_pipe_desc *spd)
> > +{
> > +   int ret;
> > +   struct unix_sock *u = unix_sk(sk);
> > +
> > +   mutex_unlock(&u->readlock);
> > +   ret = splice_to_pipe(pipe, spd);
> > +   mutex_lock(&u->readlock);
> > +
> > +   return ret;
> > +}
> > +
> 
> Hi Hannes
> 
> Since we release u->readlock, what prevents another thread to read() the
> same af_unix socket and consume the skb while we splice it ?
> 
> TCP stack has special code to take care of this possibility.

Hm, I do see the problem, yes. I left a window open for races happening.

Without having a closer look what spontaneously comes to my mind:

increment skb->users before splice_to_pipe, re-peek the socket queue
after the splice, recheck if we have the same skb pointer in the front,
and then consume_skb the skb properly and indicating a short read.

Let me try tomorrow if that works, but I fear we can read data multiple
times with this fix, so I don't know if this is a proper fix. I have to
check the TCP code closer. Thanks for the hint!

Thanks for the report,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deadlock between bind and splice

2015-11-09 Thread Al Viro

On Fri, Nov 06, 2015 at 01:58:27PM +0100, Dmitry Vyukov wrote:
> Hello,
> 
> I am on revision d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5) and
> seeing the following lockdep reports. I don't have exact reproducer
> program as it is caused by several independent programs (state
> accumulated in kernel across invocations); if the report is not enough
> I can try to cook a reproducer.
> 
> Thanks.
> 
> [ INFO: possible circular locking dependency detected ]
> 4.3.0+ #30 Not tainted
> ---
> a.out/9972 is trying to acquire lock:
>  (&pipe->mutex/1){+.+.+.}, at: [< inline >] pipe_lock_nested
> fs/pipe.c:59
>  (&pipe->mutex/1){+.+.+.}, at: []
> pipe_lock+0x56/0x70 fs/pipe.c:67
> 
> but task is already holding lock:
>  (sb_writers#5){.+.+.+}, at: []
> __sb_start_write+0xec/0x130 fs/super.c:1198
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2
[AF_UNIX bind() does sb_start_write() while holding unix_sock locked]

> -> #1
[splice() to AF_UNIX socket is trying to lock unix_sock while holding
the pipe locked]

> -> #0 (&pipe->mutex/1){+.+.+.}:
[splice() to regular file is locking the pipe under sb_start_write()]

Cute...  The first impression is that in #1 you need the socket to
be connected, or it won't even reach that attempt to lock unix_sock,
while bind() on the same sucker ought to bugger off before getting
around to touching the filesystem, so it looks like a false positive,
but... socketpair() yields a connected socket and AFAICS there's
nothing in unix_bind() to bugger off on such.

So the scenario ought to be:
(a while ago) A: socketpair()
B: splice() from a pipe to /mnt/regular_file
does sb_start_write() on /mnt
C: try to freeze /mnt
wait for B to finish with /mnt
A: bind() try to bind our socket to /mnt/new_socket_name
lock our socket, see it not bound yet
decide that it needs to create something in /mnt
try to do sb_start_write() on /mnt, block (it's
waiting for C).
D: splice() from the same pipe to our socket
lock the pipe, see that socket is connected
try to lock the socket, block waiting for A
B:  get around to actually feeding a chunk from
pipe to file, try to lock the pipe.  Deadlock.

Locking the socket is interruptible, though, so killing D will
untangle that mess - it's not quite a hopeless deadlock.

Deadlock or not, should bind() actually work on connected sockets?
AFAICS, socketpair() is the only way for it to happen...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] net: fix a race in dst_release()

2015-11-09 Thread Eric Dumazet

From: Eric Dumazet 

Only cpu seeing dst refcount going to 0 can safely
dereference dst->flags. 

Otherwise an other cpu might already have freed the dst.

Fixes: 27b75c95f10d ("net: avoid RCU for NOCACHE dst")
Reported-by: Greg Thelen 
Signed-off-by: Eric Dumazet 
---
 net/core/dst.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index 2a1818065e12..e6dc77252fe9 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -306,7 +306,7 @@ void dst_release(struct dst_entry *dst)
if (unlikely(newrefcnt < 0))
net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
 __func__, dst, newrefcnt);
-   if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt)
+   if (!newrefcnt && unlikely(dst->flags & DST_NOCACHE))
call_rcu(&dst->rcu_head, dst_destroy_rcu);
}
 }


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: mvneta: Fix memory use after free.

2015-11-09 Thread Justin Maggard

After changing an interface's MTU, then bringing the interface down and
back up again, I immediately saw tons of kernel messages like below.
The reason for this bad behavior is mvneta_rxq_drop_pkts(), which calls
dma_unmap_single() on already-freed memory.  So we need to switch the
order of those two operations.

[  152.388518] BUG: Bad page state in process ifconfig  pfn:1b518
[  152.388526] page:dff3dbc0 count:0 mapcount:0 mapping:  (null) index:0x0
[  152.395178] flags: 0x200(arch_1)
[  152.398441] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[  152.398446] bad because of flags:
[  152.398450] flags: 0x200(arch_1)
[  152.401716] Modules linked in:
[  152.401728] CPU: 0 PID: 1453 Comm: ifconfig Tainted: PB  O
4.1.12.armada.1 #1
[  152.401733] Hardware name: Marvell Armada 370/XP (Device Tree)
[  152.401749] [] (unwind_backtrace) from [] 
(show_stack+0x10/0x14)
[  152.401762] [] (show_stack) from [] 
(dump_stack+0x74/0x90)
[  152.401772] [] (dump_stack) from [] (bad_page+0xc4/0x124)
[  152.401783] [] (bad_page) from [] 
(get_page_from_freelist+0x4e4/0x644)
[  152.401794] [] (get_page_from_freelist) from [] 
(__alloc_pages_nodemask+0x148/0x784)
[  152.401805] [] (__alloc_pages_nodemask) from [] 
(kmalloc_order+0x10/0x20)
[  152.401818] [] (kmalloc_order) from [] 
(mvneta_rx_refill+0xc4/0xe8)
[  152.401830] [] (mvneta_rx_refill) from [] 
(mvneta_setup_rxqs+0x298/0x39c)
[  152.401842] [] (mvneta_setup_rxqs) from [] 
(mvneta_open+0x3c/0x150)
[  152.401853] [] (mvneta_open) from [] 
(__dev_open+0xac/0x124)
[  152.401864] [] (__dev_open) from [] 
(__dev_change_flags+0x8c/0x148)
[  152.401875] [] (__dev_change_flags) from [] 
(dev_change_flags+0x18/0x48)
[  152.401886] [] (dev_change_flags) from [] 
(devinet_ioctl+0x620/0x6d0)
[  152.401897] [] (devinet_ioctl) from [] 
(sock_ioctl+0x64/0x288)
[  152.401908] [] (sock_ioctl) from [] 
(do_vfs_ioctl+0x78/0x608)
[  152.401918] [] (do_vfs_ioctl) from [] 
(SyS_ioctl+0x64/0x74)
[  152.401930] [] (SyS_ioctl) from [] 
(ret_fast_syscall+0x0/0x3c)

Signed-off-by: Justin Maggard 
---
 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index a47496a..e84c7f2 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1493,9 +1493,9 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
struct mvneta_rx_desc *rx_desc = rxq->descs + i;
void *data = (void *)rx_desc->buf_cookie;
 
-   mvneta_frag_free(pp, data);
dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
 MVNETA_RX_BUF_SIZE(pp->pkt_size), 
DMA_FROM_DEVICE);
+   mvneta_frag_free(pp, data);
}
 
if (rx_done)
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 net-next 4/4] net: af_unix: implement splice for stream af_unix sockets

2015-11-09 Thread Eric Dumazet

On Thu, 2015-05-21 at 17:00 +0200, Hannes Frederic Sowa wrote:

> +
> +static ssize_t skb_unix_socket_splice(struct sock *sk,
> +   struct pipe_inode_info *pipe,
> +   struct splice_pipe_desc *spd)
> +{
> + int ret;
> + struct unix_sock *u = unix_sk(sk);
> +
> + mutex_unlock(&u->readlock);
> + ret = splice_to_pipe(pipe, spd);
> + mutex_lock(&u->readlock);
> +
> + return ret;
> +}
> +

Hi Hannes

Since we release u->readlock, what prevents another thread to read() the
same af_unix socket and consume the skb while we splice it ?

TCP stack has special code to take care of this possibility.

tcp_read_sock() :

used = recv_actor(desc, skb, offset, len);
if (used <= 0) {
if (!copied)
copied = used;
break;
} else if (used <= len) {
seq += used;
copied += used;
offset += used;
}
/* If recv_actor drops the lock (e.g. TCP splice
 * receive) the skb pointer might be invalid when
 * getting here: tcp_collapse might have deleted it
 * while aggregating skbs from the socket queue.
 */
skb = tcp_recv_skb(sk, seq - 1, &offset);
if (!skb)
break;
/* TCP coalescing might have appended data to the skb.
 * Try to splice more frags
 */
if (offset + 1 != skb->len)
continue;

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/4] e1000e: Remove unreachable code

2015-11-09 Thread Benjamin Poirier

msi-x interrupts are not shared so there's no need to check if the
interrupt was really from this adapter.

Signed-off-by: Benjamin Poirier 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 0a854a4..a228167 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1907,12 +1907,6 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct e1000_hw *hw = &adapter->hw;
u32 icr = er32(ICR);
 
-   if (!(icr & E1000_ICR_INT_ASSERTED)) {
-   if (!test_bit(__E1000_DOWN, &adapter->state))
-   ew32(IMS, E1000_IMS_OTHER);
-   return IRQ_NONE;
-   }
-
if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));
 
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/4] e1000e msi-x fixes

2015-11-09 Thread Benjamin Poirier

Hi,

For this series:


Benjamin Poirier (4):
  e1000e: Remove unreachable code
  e1000e: Do not read icr in Other interrupt
  e1000e: Do not write lsc to ics in msi-x mode
  e1000e: Fix msi-x interrupt automask

 drivers/net/ethernet/intel/e1000e/defines.h |  3 +-
 drivers/net/ethernet/intel/e1000e/netdev.c  | 66 -
 2 files changed, 30 insertions(+), 39 deletions(-)

Changes in v3:
Preserve LSC in IMS, LSC events are not delivered otherwise.
Disable CTRL_EXT.IAME to prevent IMC write on ICR read from external
program.

Changes in v2:
Address review comments from Alexander Duyck: extend cleanup of Other
interrupt handler and use tx_ring->ims_val.


The first three patches cleanup handling of Other interrupts and the
last patch fixes tx and rx interrupts. Please consider reading the
description for that patch before proceeding. I believe that the
following simple tracing statements are helpful in detecting the problem
fixed by the last patch.

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a09d1e4..29b8c6e 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1942,6 +1942,9 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_ring *rx_ring = adapter->rx_ring;
+   struct e1000_hw *hw = &adapter->hw;
+
+   trace_printk("%s: rxq0 irq ims 0x%08x\n", netdev->name, er32(IMS));
 
/* Write the ITR value calculated at the end of the
 * previous interrupt.
@@ -1956,6 +1959,7 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused 
irq, void *data)
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
__napi_schedule(&adapter->napi);
+   trace_printk("%s: scheduling napi\n", netdev->name);
}
return IRQ_HANDLED;
 }
@@ -2663,6 +2667,8 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
struct net_device *poll_dev = adapter->netdev;
int tx_cleaned = 1, work_done = 0;
 
+   trace_printk("%s: poll starting ims 0x%08x\n", poll_dev->name,
+er32(IMS));
adapter = netdev_priv(poll_dev);
 
if (!adapter->msix_entries ||
@@ -2680,6 +2686,8 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
e1000_set_itr(adapter);
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, &adapter->state)) {
+   trace_printk("%s: will enable rxq0 irq\n",
+poll_dev->name);
if (adapter->msix_entries)
ew32(IMS, adapter->rx_ring->ims_val);
else

 8< 

With that patch but without the patches in this series we can see that rx irqs
occur at unexpected times:

  -0 [000] .Ns.  1986.887517: e1000e_poll: eth1: will enable 
rxq0 irq
  -0 [000] d.h.  1986.896654: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
  -0 [000] d.h.  1986.896657: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] d.H.  1986.896662: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
  -0 [000] ..s.  1986.896667: e1000e_poll: eth1: poll 
starting ims 0x0154
Warning: many interrupts (2) before napi
  -0 [000] ..s.  1986.896685: e1000e_poll: eth1: will enable 
rxq0 irq

  -0 [000] d.h.  1990.688870: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] ..s.  1990.688875: e1000e_poll: eth1: poll 
starting ims 0x0154
  -0 [000] dNH.  1990.688913: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
Warning: interrupt inside napi
  -0 [000] .Ns.  1990.688916: e1000e_poll: eth1: will enable 
rxq0 irq
  -0 [000] d.h.  1990.729688: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154

Here's a typical sequence after applying the patches in this series. Notice
that ims is changed. Another printk at the end of e1000e_poll would show it to
be 0x0154.

  -0 [000] d.h. 23547.977917: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0144
  -0 [000] d.h. 23547.977922: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] ..s. 23547.977928: e1000e_poll: eth1: poll 
starting ims 0x0144
  -0 [000] ..s. 23547.977961: e1000e_poll: eth1: will enable 
rxq0 irq

Finally, here's the script I used to generate the warnings above:

#!/usr/bin/python3

import sys
import re
import pprint


class NaE(Exception):
"Not an Event"
pass

class Event:
def __init__(self, line):
# sample events:
#  -0 [000] d.h.  2025.256536: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
#  -0 [000] d.h.  2025.25

[PATCH v3 3/4] e1000e: Do not write lsc to ics in msi-x mode

2015-11-09 Thread Benjamin Poirier

In msi-x mode, there is no handler for the lsc interrupt so there is no
point in writing that to ics now that we always assume Other interrupts
are caused by lsc.

Reviewed-by: Jasna Hodzic 
Signed-off-by: Benjamin Poirier 
---
 drivers/net/ethernet/intel/e1000e/defines.h |  3 ++-
 drivers/net/ethernet/intel/e1000e/netdev.c  | 27 ---
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index 133d407..f7c7804 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -441,12 +441,13 @@
 #define E1000_IMS_RXQ1  E1000_ICR_RXQ1  /* Rx Queue 1 Interrupt */
 #define E1000_IMS_TXQ0  E1000_ICR_TXQ0  /* Tx Queue 0 Interrupt */
 #define E1000_IMS_TXQ1  E1000_ICR_TXQ1  /* Tx Queue 1 Interrupt */
-#define E1000_IMS_OTHER E1000_ICR_OTHER /* Other Interrupts */
+#define E1000_IMS_OTHER E1000_ICR_OTHER /* Other Interrupt */
 
 /* Interrupt Cause Set */
 #define E1000_ICS_LSC   E1000_ICR_LSC   /* Link Status Change */
 #define E1000_ICS_RXSEQ E1000_ICR_RXSEQ /* Rx sequence error */
 #define E1000_ICS_RXDMT0E1000_ICR_RXDMT0/* Rx desc min. threshold */
+#define E1000_ICS_OTHER E1000_ICR_OTHER /* Other Interrupt */
 
 /* Transmit Descriptor Control */
 #define E1000_TXDCTL_PTHRESH 0x003F /* TXDCTL Prefetch Threshold */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a73e323..ed7cc8e 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4130,10 +4130,23 @@ void e1000e_reset(struct e1000_adapter *adapter)
 
 }
 
-int e1000e_up(struct e1000_adapter *adapter)
+/**
+ * e1000e_trigger_lsc - trigger an lsc interrupt
+ *
+ * Fire a link status change interrupt to start the watchdog.
+ **/
+static void e1000e_trigger_lsc(struct e1000_adapter *adapter)
 {
struct e1000_hw *hw = &adapter->hw;
 
+   if (adapter->msix_entries)
+   ew32(ICS, E1000_ICS_OTHER);
+   else
+   ew32(ICS, E1000_ICS_LSC);
+}
+
+int e1000e_up(struct e1000_adapter *adapter)
+{
/* hardware has been reset, we need to reload some things */
e1000_configure(adapter);
 
@@ -4145,11 +4158,7 @@ int e1000e_up(struct e1000_adapter *adapter)
 
netif_start_queue(adapter->netdev);
 
-   /* fire a link change interrupt to start the watchdog */
-   if (adapter->msix_entries)
-   ew32(ICS, E1000_ICS_LSC | E1000_ICR_OTHER);
-   else
-   ew32(ICS, E1000_ICS_LSC);
+   e1000e_trigger_lsc(adapter);
 
return 0;
 }
@@ -4576,11 +4585,7 @@ static int e1000_open(struct net_device *netdev)
hw->mac.get_link_status = true;
pm_runtime_put(&pdev->dev);
 
-   /* fire a link status change interrupt to start the watchdog */
-   if (adapter->msix_entries)
-   ew32(ICS, E1000_ICS_LSC | E1000_ICR_OTHER);
-   else
-   ew32(ICS, E1000_ICS_LSC);
+   e1000e_trigger_lsc(adapter);
 
return 0;
 
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/4] e1000e: Do not read icr in Other interrupt

2015-11-09 Thread Benjamin Poirier

removes the icr read in the other interrupt handler, uses eiac to
autoclear the Other bit from icr and ims. This allows us to avoid
interference with rx and tx interrupts in the Other interrupt handler.

The information read from icr is not needed. IMS is configured such that
the only interrupt cause that can trigger the Other interrupt is Link
Status Change.

Signed-off-by: Benjamin Poirier 

---
I noticed a 8-16% improvement in netperf rr tests after applying this
patch. This is a little surprising since this patch touches the handling
of Other interrupts, which do not occur during such a test. Some
profiling was not very insightful but the improvement seems related to
writing Other to EIAC.
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a228167..a73e323 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1905,24 +1905,15 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
-   u32 icr = er32(ICR);
 
-   if (icr & adapter->eiac_mask)
-   ew32(ICS, (icr & adapter->eiac_mask));
+   hw->mac.get_link_status = true;
 
-   if (icr & E1000_ICR_OTHER) {
-   if (!(icr & E1000_ICR_LSC))
-   goto no_link_interrupt;
-   hw->mac.get_link_status = true;
-   /* guard against interrupt when we're going down */
-   if (!test_bit(__E1000_DOWN, &adapter->state))
-   mod_timer(&adapter->watchdog_timer, jiffies + 1);
+   /* guard against interrupt when we're going down */
+   if (!test_bit(__E1000_DOWN, &adapter->state)) {
+   mod_timer(&adapter->watchdog_timer, jiffies + 1);
+   ew32(IMS, E1000_IMS_OTHER);
}
 
-no_link_interrupt:
-   if (!test_bit(__E1000_DOWN, &adapter->state))
-   ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);
-
return IRQ_HANDLED;
 }
 
@@ -2019,6 +2010,7 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
   hw->hw_addr + E1000_EITR_82574(vector));
else
writel(1, hw->hw_addr + E1000_EITR_82574(vector));
+   adapter->eiac_mask |= E1000_IMS_OTHER;
 
/* Cause Tx interrupts on every write back */
ivar |= (1 << 31);
@@ -2247,7 +2239,7 @@ static void e1000_irq_enable(struct e1000_adapter 
*adapter)
 
if (adapter->msix_entries) {
ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
-   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
+   ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC);
} else if ((hw->mac.type == e1000_pch_lpt) ||
   (hw->mac.type == e1000_pch_spt)) {
ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/4] e1000e: Fix msi-x interrupt automask

2015-11-09 Thread Benjamin Poirier

Since the introduction of 82574 support in e1000e, the driver has worked
on the assumption that msi-x interrupt generation is automatically
disabled after each irq. As it turns out, this is not the case.
Currently, rx interrupts can fire multiple times before and during napi
processing. This can be a problem for users because frames that arrive
in a certain window (after adapter->clean_rx() but before
napi_complete_done() has cleared NAPI_STATE_SCHED) generate an interrupt
which does not lead to napi_schedule(). These frames sit in the rx queue
until another frame arrives (a tcp retransmit for example).

While the EIAC and CTRL_EXT registers are properly configured for irq
automask, the modification of IAM in e1000_configure_msix() is what
prevents automask from working as intended.

This patch removes that erroneous write and fixes interrupt rearming for
tx interrupts. It also clears IAME from CTRL_EXT. This is not strictly
necessary for operation of the driver but it is to avoid disruption from
potential programs that access the registers directly, like `ethregs -c`.

Reported-by: Frank Steiner 
Signed-off-by: Benjamin Poirier 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index ed7cc8e..2a22ed7 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1931,6 +1931,9 @@ static irqreturn_t e1000_intr_msix_tx(int __always_unused 
irq, void *data)
/* Ring was not completely cleaned, so fire another interrupt */
ew32(ICS, tx_ring->ims_val);
 
+   if (!test_bit(__E1000_DOWN, &adapter->state))
+   ew32(IMS, adapter->tx_ring->ims_val);
+
return IRQ_HANDLED;
 }
 
@@ -2018,12 +2021,8 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
ew32(IVAR, ivar);
 
/* enable MSI-X PBA support */
-   ctrl_ext = er32(CTRL_EXT);
-   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR;
-
-   /* Auto-Mask Other interrupts upon ICR read */
-   ew32(IAM, ~E1000_EIAC_MASK_82574 | E1000_IMS_OTHER);
-   ctrl_ext |= E1000_CTRL_EXT_EIAME;
+   ctrl_ext = er32(CTRL_EXT) & ~E1000_CTRL_EXT_IAME;
+   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR | E1000_CTRL_EXT_EIAME;
ew32(CTRL_EXT, ctrl_ext);
e1e_flush();
 }
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next v2 0/8] openvswitch: NAT support.

2015-11-09 Thread Jarno Rajahalme

[Resending as plain text]

> On Nov 9, 2015, at 5:31 AM, Patrick McHardy  wrote:
> 
> On 06.11, Jarno Rajahalme wrote:
>> This series adds NAT support to openvswitch kernel module.  A few
>> changes are needed to the netfilter code to facilitate this (patches
>> 1-3/8).  Patches 4-7 make the openvswitch kernel module ready for the
>> patch 8 that adds the NAT support for calling into netfilter NAT code
>> from the openvswitch conntrack action.
> 
> I'm missing some high level description, especially how it is invoked, how
> it makes sure expectations of the NAT code about its invocation are met
> (it is my understanding that OVS simply invokes this based on actions
> specified by the user) and how it interacts with the remaining netfilter
> features.
> 

The corresponding OVS userspace patches contain the new test cases for the NAT 
features (http://openvswitch.org/pipermail/dev/2015-November/061920.html) in 
tests/system-traffic.at. I’ll walk through two of them below.

Test case: conntrack - simple SNAT

In these tests ports 1 and 2 are in different namespaces. The flow table below 
allows all IPv4 traffic between port 1 to port 2, but IP connections from port 
1 to port 2 are source NATted:

in_port=1,ip,action=ct(commit,zone=1,nat(src=10.1.1.240-10.1.1.255)),2
in_port=2,ct_state=-trk,ip,action=ct(table=0,zone=1,nat)
in_port=2,ct_state=+trk,ct_zone=1,ip,action=1

This flow table matches all IPv4 traffic from port 1, runs them through 
conntrack in zone 1 and NATs them. The NAT is initialized to do source IP 
mapping to the given range for the first packet of each connection, after which 
the new connection is committed. For further packets of already tracked 
connections NAT is done according to the connection state and the commit is a 
no-op. Each packet that is not flagged as a drop by the CT action is forwarded 
to port 2. The CT action does an implicit fragmentation reassembly, so that 
only complete packets are run through conntrack. Reassembled packets are 
re-fragmented on output.

The IPv4 traffic coming from port 2 is first matched for the non-tracked state 
(-trk), which means that the packet has not been through a CT action yet. Such 
traffic is run trough the conntrack in zone 1 and all packets associated with a 
NATted connection are NATted also in the return direction. After the packet has 
been through conntrack it is recirculated back to table 0 (which is the default 
table, so all the rules above are in table 0). The CT action changes the ‘trk’ 
flag to being set, so the packets after recirculation match the third rule 
(+trk), and the packet is output on port 1. Since also the ct_zone is matched, 
only packets that were actually tracked by conntrack can match the 3rd rule and 
output to 1.

I’m skipped the rules for ARP handling in this walkthrough, but the test case 
has rules to match on the ARP request on the NATted address and reply to it.

The above is verified with a HTTP request and a subsequent conntrack entry 
listing in the test case:

dnl HTTP requests from p0->p1 should work fine.
NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 5 -T 1 --retry-connrefused -v -o 
wget0.log])

AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2) | sed -e 
's/dst=10.1.1.2[[45]][[0-9]]/dst=10.1.1.2XX/'], [0], [dnl
TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport= dport= 
src=10.1.1.2 dst=10.1.1.2XX sport= dport= [[ASSURED]] mark=0 
zone=1 use=1
])

As a second example, I’ll walk through a test case of NAT with FTP on IPv6:

As before, ports 1 (p0) and 2 (p1) reside in different namespaces (‘at_ns0’ and 
‘at_ns1’, respectively). A static neighbor cache entry for the NATted address 
is created in at_ns1. In a more realistic scenario a controller would need to 
implement a ND proxy instead:

ADD_VETH(p0, at_ns0, br0, "fc00::1/96")
NS_CHECK_EXEC([at_ns0], [ip link set dev p0 address 80:88:88:88:88:88])
ADD_VETH(p1, at_ns1, br0, "fc00::2/96")
dnl Would be nice if NAT could translate neighbor discovery messages, too.
NS_CHECK_EXEC([at_ns1], [ip -6 neigh add fc00::240 lladdr 80:88:88:88:88:88 
dev p1])

The OpenFlow rules are split into two tables (table 0 and table 1).

Table 0 tracks all IPv6 traffic and drops all other traffic. Packets are 
recirculated to table 1 after conntrack. Existing connections are NATted before 
recirculation:

table=0 priority=10 ip6, action=ct(nat,table=1)
table=0 priority=0 action=drop

Table 1:

FTP control connections from the “private” address to port 21 are output to 
port 1 after being tracked, NATted, and committed (== confirmed).

table=1 in_port=1 ct_state=+new tcp6 ipv6_src=fc00::1 tp_dst=21  
action=ct(alg=ftp,commit,nat(src=fc00::240)),2

The matches on the source/dest IP address below are not really needed, but in 
the test case they are to make sure the NATting actually happens.

Related TCP connections to the reverse direction are NATt

Re: Fw: [Bug 106711] VXLAN: RTNL assertion failed at net/core/net_namespace.c:187

2015-11-09 Thread Cong Wang

On Mon, Nov 9, 2015 at 10:39 AM, Stephen Hemminger
 wrote:
>
>
> Begin forwarded message:
>
> Date: Mon, 9 Nov 2015 09:42:52 +
> From: "bugzilla-dae...@bugzilla.kernel.org" 
> 
> To: "shemmin...@linux-foundation.org" 
> Subject: [Bug 106711] VXLAN: RTNL assertion failed at 
> net/core/net_namespace.c:187
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=106711
>
> Kari Hautio  changed:
>
>What|Removed |Added
> 
>  CC||khau...@gmail.com
>
> --- Comment #1 from Kari Hautio  ---
> Also affects 4.1.12
>
> [   34.412104] RTNL: assertion failed at
> /home/kernel/COD/linux/net/core/net_namespace.c (187)
> [   34.412120] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.1.12-040112-generic
> #201510262131
> [   34.412123] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.7.5-20140709_153950- 04/01/2014
> [   34.412126]  880035dd2380 88013fd838f8 817e367c
> 
> [   34.412131]   88013fd83928 816cfbe5
> 880035526200
> [   34.412135]  8800b9b3fe00 88013a3dc9c0 88013a3dccc0
> 88013fd83938
> [   34.412140] Call Trace:
> [   34.412142][] dump_stack+0x45/0x57
> [   34.412159]  [] __peernet2id+0xa5/0xb0
> [   34.412163]  [] peernet2id+0x18/0x30

As I replied in BZ, this is probably fixed by:

commit 95f38411df055a0ecefe3a3d119d98241087d5ca
Author: Nicolas Dichtel 
Date:   Thu May 7 11:02:51 2015 +0200

netns: use a spin_lock to protect nsid management


Thanks.

> [   34.412168]  [] vxlan_fdb_info+0xfa/0x360 [vxlan]
> [   34.412173]  [] ? __alloc_skb+0x8c/0x1f0
> [   34.412178]  [] vxlan_fdb_notify+0x72/0x100 [vxlan]
> [   34.412182]  [] vxlan_fdb_create+0x14a/0x3a0 [vxlan]
> [   34.412187]  [] vxlan_snoop+0x1cb/0x1d0 [vxlan]
> [   34.412192]  [] vxlan_rcv+0x3a2/0x620 [vxlan]
> [   34.412196]  [] vxlan_udp_encap_recv+0x156/0x420 [vxlan]
> [   34.412201]  [] ? vxlan_open+0x1e0/0x1e0 [vxlan]
> [   34.412207]  [] udp_queue_rcv_skb+0x1f3/0x4f0
> [   34.412211]  [] __udp4_lib_rcv+0x127/0x930
> [   34.412215]  [] udp_rcv+0x1a/0x20
> [   34.412227]  [] ip_local_deliver_finish+0xae/0x230
> [   34.412231]  [] ip_local_deliver+0x46/0xa0
> [   34.412235]  [] ip_rcv_finish+0x81/0x370
> [   34.412239]  [] ip_rcv+0x2a2/0x3c0
> [   34.412244]  [] __netif_receive_skb_core+0x7e3/0xa10
> [   34.412248]  [] __netif_receive_skb+0x18/0x60
> [   34.412252]  [] process_backlog+0xb2/0x150
> [   34.412256]  [] net_rx_action+0x1fa/0x320
> [   34.412265]  [] __do_softirq+0x103/0x280
> [   34.412269]  [] irq_exit+0xad/0xb0
> [   34.412274]  [] smp_apic_timer_interrupt+0x46/0x60
> [   34.412279]  [] apic_timer_interrupt+0x6e/0x80
> [   34.412280][] ? native_safe_halt+0x6/0x10
> [   34.412290]  [] ? rcu_eqs_enter+0xa3/0xb0
> [   34.412297]  [] default_idle+0x1e/0xc0
> [   34.412315]  [] arch_cpu_idle+0xf/0x20
> [   34.412325]  [] cpu_startup_entry+0x367/0x3e0
> [   34.412332]  [] ? clockevents_register_device+0x112/0x1e0
> [   34.412338]  [] start_secondary+0x17e/0x1a0
>
> --
> You are receiving this mail because:
> You are the assignee for the bug.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()

2015-11-09 Thread Luck, Tony

> ACK for the ACPI changes (and CCing Tony and Boris for the heads-up as they
> are way more famailiar with the APEI code than I am).

Sure. If kvfree() really is smart enough to figure it out then there
it no point in the if (blah) kfree() else vfree().

The drivers/acpi/apei/erst.c code isn't doing anything subtle or magic here.

-Tony

Re: linux-next network throughput performance regression

2015-11-09 Thread Eric Dumazet

On Mon, 2015-11-09 at 20:23 +, Simon Xiao wrote:
> Thanks Eric to provide the data. I am looping Tom (as I am looking into his 
> recent patches) and Olaf (from Suse).
> 
> So, if I understand it correctly, you are running netperf with single
> TCP connection, and you got ~26Gbps initially and got ~30Gbps after
> turning the tx-usecs and tx-frames.
> 
> Do you have a baseline on your environment for the best/max/or peak
> throughput?

The peak on my lab pair is about 34Gbits, usually I get this if I pin
the receiving thread on a cpu, otherwise process scheduler can really
hurt too much.

lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H lpaa24 -l 20 -Cc -T ,1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
lpaa24.prod.google.com () port 0 AF_INET : cpu bind
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 29200
tcpi_rtt 101 tcpi_rttvar 15 tcpi_snd_ssthresh 289 tpci_snd_cwnd 289
tcpi_reordering 3 tcpi_total_retrans 453
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384  1638420.00  33975.99   1.27 3.36 0.147   0.389  

Not too bad, I don't recall reaching more than that ever.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-09 Thread Jason Baron

On 11/09/2015 09:40 AM, Rainer Weikusat wrote:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
> 
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
> 
> Signed-off-by: Rainer Weikusuat 

[...]

> @@ -1590,10 +1723,14 @@ restart:
>   goto out_unlock;
>   }
>  
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> + if (!unix_dgram_peer_recv_ready(sk, other)) {
>   if (!timeo) {
> - err = -EAGAIN;
> - goto out_unlock;
> + if (unix_dgram_peer_wake_me(sk, other)) {
> + err = -EAGAIN;
> + goto out_unlock;
> + }
> +
> + goto restart;
>   }

So this will cause 'unix_state_lock(other) to be called twice in a
row if we 'goto restart' (and hence will softlock the box). It just
needs a 'unix_state_unlock(other);' before the 'goto restart'.

I also tested this patch with a single unix server and 200 client
threads doing roughly epoll() followed by write() until -EAGAIN in a
loop. The throughput for the test was roughly the same as current
upstream, but the cpu usage was a lot higher. I think its b/c this patch
takes the server wait queue lock in the _poll() routine. This causes a
lot of contention. The previous patch you posted for this where you did
not clear the wait queue on every wakeup and thus didn't need the queue
lock in poll() (unless we were adding to it), performed much better.

However, the previous patch which tested better didn't add to the remote
queue when it was full on sendmsg() - so it wouldn't be correct for
epoll ET. Adding to the remote queue for every sendmsg() that fails does
seem undesirable, if we aren't even doing poll(). So I'm not sure if
just going back to the previous patch is a great option eitherI'm
also not sure how realistic the test case I have is. It would be great
if we had some other workloads to test against.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] inet: delay address promotion check until last request in message

2015-11-09 Thread Neil Horman

On Mon, Nov 09, 2015 at 10:35:42PM +0200, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Sun, 8 Nov 2015, Neil Horman wrote:
> 
> > On Sat, Nov 07, 2015 at 01:49:25AM +0200, Julian Anastasov wrote:
> > > 
> > >   flush can provide many parameters. As there is no
> > > any kind of indication in the netlink message that all addresses
> > > are removed, we can not avoid the promotion.
> > > 
> > This is true, but seems irrellevant to me.  A flush operation is a sequence 
> > of
> > RTM_DELADDR operations in a one or more netlink packets.  The way my patch 
> > is
> > written, if a set of DELADDR requests is interspersed with other non DELADDR
> > requests, then we do a promotion check between each consecutive set of 
> > DELADDR
> > requests.  As such, all that happens is that the promotion check happens
> > possibly more often than needed.  Its not optimal, but not harmful either.
> 
>   It is harmful, you miss promotion for some of the
> subnets, see below...
> 
> > > > +* Only check for address promotion when this is the last 
> > > > request
> > > > +* in this netlink transaction.  It allows this operation to 
> > > > complete
> > > > +* in O(n) time rather than O(n^2)
> > > 
> > >   It is not correct to assume that one promotion per
> > > transaction is enough. The promotion happens in every subnet,
> > > it was not once per device.
> > > 
> > 
> > I'm not sure I understand the relevance here.  All I'm doing is, in effect
> > masking the promote_secondaries sysctl for an interface doing a flush 
> > operation.
> > Its equivalent to doing this in user space:
> > 
> > echo 0 > /proc/sys/net/ipv4/conf//promote_secondaries
> > A=`some arbitrary address in `
> > ip addr del  except A>
> > echo 1 > /proc/sys/net/ipv4/conf//promote_secondaries
> > ip addr del A
> > 
> > Can you please explain to me the use case in which delaying a promotion
> > operation until we think we're done ('done' being defined by the above 
> > transition
> > from a DELADDR operation to a non-DELADDR operation in a netlink packet)
> > produces an outcome that differs from the expectation with this patch in 
> > place?
> 
>   Here is how we can miss promotion...
> 
> dev=eth1
> ifconfig $dev up
> ip addr add 1.2.3.4/24 dev $dev
> ip addr add 1.2.3.4/16 dev $dev
> ip addr add 1.2.3.44/24 dev $dev
> ip addr add 1.2.3.44/16 dev $dev
> echo 1 > /proc/sys/net/ipv4/conf/$dev/promote_secondaries
> ip addr flush dev $dev to 1.2.3.4/30
> ip -V
> ip utility, iproute2-ss010824
> 
> What happens is a request to delete just primary addresses:
> 1.2.3.4/24 and 1.2.3.4/16. The /30 is chosen in such a way,
> so that any repeating attempts to flush the secondary
> addresses are avoided. As result, with your patch, only
> 1.2.3.44/16 is promoted (the last secondary), 1.2.3.44/24
> which is first in the list of the secondaries, is not
> promoted, it is removed. You can even use 1.2.3.4/24,
> 1.2.3.5/16, 1.2.3.44/24 and 1.2.3.55/16 in case the
> equal IPs are not a good example.
> 
> The problem will be more visible if one builds netlink message
> by hand containing DEL for different primaries.
> 
Ah, crap, Ok.  I didn't consider the use case in which user space would build a
filter of addresses that were on a mix of subnets.  That makes sense now, thanks
for the explination.

I suppose then the only optimization to be had here is to detect the case of a
complete flush of all addresses in user space and either manually or
automatically disable promotion during the flush operation

Neil

> Regards
> 
> --
> Julian Anastasov 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()

2015-11-09 Thread Rafael J. Wysocki

On Monday, November 09, 2015 08:56:10 PM Tetsuo Handa wrote:
> There are many locations that do
> 
>   if (memory_was_allocated_by_vmalloc)
> vfree(ptr);
>   else
> kfree(ptr);
> 
> but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
> using is_vmalloc_addr(). Unless callers have special reasons, we can
> replace this branch with kvfree(). Please check and reply if you found
> problems.

ACK for the ACPI changes (and CCing Tony and Boris for the heads-up as they
are way more famailiar with the APEI code than I am).

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/4] Produce system time from correlated clocksource

2015-11-09 Thread John Stultz

On Tue, Nov 3, 2015 at 11:18 AM, Stanton, Kevin B
 wrote:
> On Wed, 21 Oct 2015, Thomas Gleixner wrote:
>> On Tue, 20 Oct 2015, John Stultz wrote:
>>> Being able to have various hardware sharing a time base is quite
>>> useful, and methods for correlating timestamps together are useful.
>>> But I don't yet really understand why its important that we can
>>> translate a hardware timestamp from some time in the past to the
>>> correct system time in the past without error.
>
>>If your device can only provide timestamps from the past, then
>>having access to the history is important if you want to have
>>precise correlation.
>
> I hope this can be solved in timekeeping. But first, a quick
> recap...
>
> The timestamp pair (including the ART snapshot, as described
> previously) is captured simultaneously by the hardware
> resulting, effectively, in a (PTP,TSC) pair, or
> (AudioPosition,TSC) pair. The in-the-past-TSC value needs to be
> converted to system time so that it can be used by applications,
> without exposing the underlying ART or TSC.
>
> Note: ART is architectural, defined as part of Invariant
> Timekeeping in the current SDM, so this isn't a one-off
> capability.
>
> To convert a past TSC timestamp to system time 'correctly' (in a
> mathematical sense), a history of monotonic rate adjustments
> since that time in the past must be maintained.

But again, my main problem is that I'm not totally understanding the rational.
Here you're providing the *what*, not really the *why*.

As I mentioned earlier, there are possibly simpler (at least for the
kernel) ways to generate similar data using CLOCK_MONOTONIC_RAW, which
has potentially a 500ppm error. *Why* is it important to add more
complexity to the timekeeping core in order to avoid that error?

> Regarding the amount of history, as Chris mentioned (and
> in the context of new Intel hardware) LAN timestamp pairs are
> a few microseconds in the past (no history would be required),
> but for timestamps captured by the audio DSP, unfortunately,
> they can be a small number of *milliseconds* in the past by the
> time they're available to the audio driver (some history
> required to convert accurately). I'm told that 4ms of adjustment
> history accommodates known hardware.

For a timestamp recorded 4ms ago, 500ppm of error is 2us. Why is 2us
problematic for audio? That seems quite below the human threshold to
notice.

> Getting this 'correct' in one place (timekeeping) seems a lot
> better than unnecessarily introducing inaccuracy via software
> sampling (and extrapolation) or leaving it to each driver to do
> it themselves, and to do it differently (and/or do it wrongly).

Having a common infrastructure for extrapolating the data isn't
something I'm objecting to.  Its just that the shadow-timekeeper has
been problematic enough in recent times bug wise, so I'm just hesitant
to expand the complexity there. That said, I'm open to ideas, but
would really like a better understanding of why other solutions would
be insufficient, and why this one is the best solution.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()

2015-11-09 Thread Dilger, Andreas

On 2015/11/09, 04:56, "Tetsuo Handa" 
wrote:

>There are many locations that do
>
>  if (memory_was_allocated_by_vmalloc)
>vfree(ptr);
>  else
>kfree(ptr);
>
>but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
>using is_vmalloc_addr(). Unless callers have special reasons, we can
>replace this branch with kvfree(). Please check and reply if you found
>problems.
>
>Signed-off-by: Tetsuo Handa 
>Acked-by: Michal Hocko 

For Lustre part:
Reviewed-by: Andreas Dilger 

Cheers, Andreas
-- 
Andreas Dilger

Lustre Principal Engineer
Intel High Performance Data Division


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] inet: delay address promotion check until last request in message

2015-11-09 Thread Julian Anastasov


Hello,

On Sun, 8 Nov 2015, Neil Horman wrote:

> On Sat, Nov 07, 2015 at 01:49:25AM +0200, Julian Anastasov wrote:
> > 
> > flush can provide many parameters. As there is no
> > any kind of indication in the netlink message that all addresses
> > are removed, we can not avoid the promotion.
> > 
> This is true, but seems irrellevant to me.  A flush operation is a sequence of
> RTM_DELADDR operations in a one or more netlink packets.  The way my patch is
> written, if a set of DELADDR requests is interspersed with other non DELADDR
> requests, then we do a promotion check between each consecutive set of DELADDR
> requests.  As such, all that happens is that the promotion check happens
> possibly more often than needed.  Its not optimal, but not harmful either.

It is harmful, you miss promotion for some of the
subnets, see below...

> > > +  * Only check for address promotion when this is the last request
> > > +  * in this netlink transaction.  It allows this operation to complete
> > > +  * in O(n) time rather than O(n^2)
> > 
> > It is not correct to assume that one promotion per
> > transaction is enough. The promotion happens in every subnet,
> > it was not once per device.
> > 
> 
> I'm not sure I understand the relevance here.  All I'm doing is, in effect
> masking the promote_secondaries sysctl for an interface doing a flush 
> operation.
> Its equivalent to doing this in user space:
> 
> echo 0 > /proc/sys/net/ipv4/conf//promote_secondaries
> A=`some arbitrary address in `
> ip addr del  except A>
> echo 1 > /proc/sys/net/ipv4/conf//promote_secondaries
> ip addr del A
> 
> Can you please explain to me the use case in which delaying a promotion
> operation until we think we're done ('done' being defined by the above 
> transition
> from a DELADDR operation to a non-DELADDR operation in a netlink packet)
> produces an outcome that differs from the expectation with this patch in 
> place?

Here is how we can miss promotion...

dev=eth1
ifconfig $dev up
ip addr add 1.2.3.4/24 dev $dev
ip addr add 1.2.3.4/16 dev $dev
ip addr add 1.2.3.44/24 dev $dev
ip addr add 1.2.3.44/16 dev $dev
echo 1 > /proc/sys/net/ipv4/conf/$dev/promote_secondaries
ip addr flush dev $dev to 1.2.3.4/30
ip -V
ip utility, iproute2-ss010824

What happens is a request to delete just primary addresses:
1.2.3.4/24 and 1.2.3.4/16. The /30 is chosen in such a way,
so that any repeating attempts to flush the secondary
addresses are avoided. As result, with your patch, only
1.2.3.44/16 is promoted (the last secondary), 1.2.3.44/24
which is first in the list of the secondaries, is not
promoted, it is removed. You can even use 1.2.3.4/24,
1.2.3.5/16, 1.2.3.44/24 and 1.2.3.55/16 in case the
equal IPs are not a good example.

The problem will be more visible if one builds netlink message
by hand containing DEL for different primaries.

Regards

--
Julian Anastasov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 5/6] net/faraday: Enable NCSI interface

2015-11-09 Thread Benjamin Herrenschmidt

On Mon, 2015-11-09 at 18:30 +1100, Gavin Shan wrote:
> 
> Yeah, It's something that I hilighed in the cover letter. I was
> thinking we might need a better way to enable Tx/Rx before the
> interrupt is up, but couldn't figure out one way. So I need some
> advice here.

No, that's not right. For Tx/Rx to work the interface must be opened,
there is simply no way around that and that's perfectly fine. The
situation with an MDIO PHY is the same, most drivers can't talk to
their PHY until the interface has been opened because the chip is
basically powered down otherwise.

So we require the interface to be opened to talk, so far so good,
the NC-SI stack doesn't even need to open it itself, it's acceptable
to require userspace to do it. IE. Userspace will chose what interface
to use, open it (for DHCP etc... or whatever other reason) and *that*
will then trigger the NC-SI negociation.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: linux-next network throughput performance regression

2015-11-09 Thread Simon Xiao

Thanks Eric to provide the data. I am looping Tom (as I am looking into his 
recent patches) and Olaf (from Suse).

So, if I understand it correctly, you are running netperf with single TCP 
connection, and you got ~26Gbps initially and got ~30Gbps after turning the 
tx-usecs and tx-frames.

Do you have a baseline on your environment for the best/max/or peak throughput?
Again, in my environment (SLES bare metal), if use SLES12 default kernel as a 
baseline, we can see significant performance drop (10% ~ 50%) on latest 
linux-next kernel. 
Absolutely I will try the same test on net-next soon and update the results to 
here later.

Thanks,
Simon


> -Original Message-
> From: Eric Dumazet [mailto:eric.duma...@gmail.com]
> Sent: Saturday, November 7, 2015 11:50 AM
> To: David Ahern 
> Cc: Simon Xiao ; de...@linuxdriverproject.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org; David Miller
> ; KY Srinivasan ; Haiyang
> Zhang 
> Subject: Re: linux-next network throughput performance regression
> 
> On Sat, 2015-11-07 at 11:35 -0800, Eric Dumazet wrote:
> > On Fri, 2015-11-06 at 14:30 -0700, David Ahern wrote:
> > > On 11/6/15 2:18 PM, Simon Xiao wrote:
> > > > The .config file used to build linux-next kernel is attached to this 
> > > > mail.
> > >
> > > Thanks.
> > >
> > > Failed to notice this on the first response; my brain filled in. Why
> > > linux-next tree? Can you try net-next which is more relevant for
> > > this mailing list, post the top commit id and config file used?
> >
> > Throughput on a single TCP flow for a 40G NIC can be tricky to tune.
> >
> > Make sure IRQ are properly setup/balanced, as I know that IRQ names
> > were changed recently and your scripts might have not noticed...
> >
> > Also "ethtool -c eth0" might show very different interrupt coalescing
> > params ?
> >
> > I too have a Mellanox 40Gb in my lab and saw no difference in
> > performance with recent kernels.
> >
> > Of course, a simple "perf record -a -g sleep 4 ; perf report" might
> > point to some obvious issue. Like unexpected segmentation in case of
> > forwarding...
> >
> >
> 
> I did a test with current net tree on both sender and receiver
> 
> lpaa23:~# ./netperf -H 10.246.7.152
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 10.246.7.152 () port 0 AF_INET
> Recv   SendSend
> Socket Socket  Message  Elapsed
> Size   SizeSize Time Throughput
> bytes  bytes   bytessecs.10^6bits/sec
> 
>  87380  16384  1638410.0026864.98
> lpaa23:~# ethtool -c eth1
> Coalesce parameters for eth1:
> Adaptive RX: on  TX: off
> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 40
> pkt-rate-high: 45
> 
> rx-usecs: 16
> rx-frames: 44
> rx-usecs-irq: 0
> rx-frames-irq: 0
> 
> tx-usecs: 16
> tx-frames: 16
> tx-usecs-irq: 0
> tx-frames-irq: 256
> 
> rx-usecs-low: 0
> rx-frame-low: 0
> tx-usecs-low: 0
> tx-frame-low: 0
> 
> rx-usecs-high: 128
> rx-frame-high: 0
> tx-usecs-high: 0
> tx-frame-high: 0
> 
> lpaa23:~# ethtool -C eth1 tx-usecs 4 tx-frames 4


> lpaa23:~# ./netperf -H
> 10.246.7.152 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0
> AF_INET to
> 10.246.7.152 () port 0 AF_INET
> Recv   SendSend
> Socket Socket  Message  Elapsed
> Size   SizeSize Time Throughput
> bytes  bytes   bytessecs.10^6bits/sec
> 
>  87380  16384  1638410.0030206.27
>

Re: [PATCH] arm64: bpf: fix JIT stack setup

2015-11-09 Thread Z Lim

On Mon, Nov 9, 2015 at 10:08 AM, Shi, Yang  wrote:
> I added it to stay align with ARMv8 AAPCS to maintain the correct FP during
> function call. It makes us get correct stack backtrace.
>
> I think we'd better to keep compliant with ARMv8 AAPCS in BPF JIT prologue
> too.
>
> If nobody thinks it is necessary, we definitely could remove that change.

Oh no, I don't think anyone will say it's unnecessary!
I agree the A64_FP-related change is a good idea, so stack unwinding works.

How about splitting this into two patches? One for the BPF-related
bug, and another for A64 FP-handling.

Thanks again for tracking this down and improving things overall for arm64 :)

>
> Thanks,
> Yang
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)

2015-11-09 Thread Bendik Rønning Opstad

On 24/10/15 14:57, Eric Dumazet wrote:
> Thank you for this very high quality patch submission.
>
> Please give us a few days for proper evaluation.
>
> Thanks !

Guys, thank you very much for taking the time to evaluate this.

Since there haven't been any more feedback or comments I'll submit an
RFCv2 with a few changes which includes removing the Nagle
modification.

After discussing the Nagle change on setsockopt we realize that it
should be evaluated more thoroughly, and is better left for a later
patch submission.

Bendik

P.S.
Trimming the CC list to only those who have responded as gmail says
I'm spamming :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] drivers: net: cpsw: add support for fixed-links.

2015-11-09 Thread Daniel Trautmann

Add support for fixed-links in configurations without PHY.
(e.g. connection to a switch, SGMII point to point, SFPs)

Check: Documentation/devicetree/bindings/net/fixed-link.txt.

Signed-off-by: Daniel Trautmann 
---
 drivers/net/ethernet/ti/cpsw.c | 40 
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 040fbc1..3e32365 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2039,19 +2039,35 @@ static int cpsw_probe_dt(struct cpsw_priv *priv,
priv->phy_node = of_parse_phandle(slave_node, "phy-handle", 0);
parp = of_get_property(slave_node, "phy_id", &lenp);
if ((parp == NULL) || (lenp != (sizeof(void *) * 2))) {
-   dev_err(&pdev->dev, "Missing slave[%d] phy_id 
property\n", i);
-   goto no_phy_slave;
-   }
-   mdio_node = of_find_node_by_phandle(be32_to_cpup(parp));
-   phyid = be32_to_cpup(parp+1);
-   mdio = of_find_device_by_node(mdio_node);
-   of_node_put(mdio_node);
-   if (!mdio) {
-   dev_err(&pdev->dev, "Missing mdio platform device\n");
-   return -EINVAL;
+   if (!of_phy_is_fixed_link(slave_node)) {
+   dev_err(&pdev->dev,
+   "Missing slave[%d] phy_id property\n",
+   i);
+   goto no_phy_slave;
+   }
+
+   ret = of_phy_register_fixed_link(slave_node);
+   if (ret) {
+   dev_err(&pdev->dev, "cannot register fixed 
PHY\n");
+   return ret;
+   }
+
+   /* In the case of a fixed PHY, the DT node associated
+* to the PHY is the Ethernet MAC DT node.
+*/
+   priv->phy_node = of_node_get(slave_node);
+   } else {
+   mdio_node = of_find_node_by_phandle(be32_to_cpup(parp));
+   phyid = be32_to_cpup(parp + 1);
+   mdio = of_find_device_by_node(mdio_node);
+   of_node_put(mdio_node);
+   if (!mdio) {
+   dev_err(&pdev->dev, "Missing mdio platform 
device\n");
+   return -EINVAL;
+   }
+   snprintf(slave_data->phy_id, sizeof(slave_data->phy_id),
+PHY_ID_FMT, mdio->name, phyid);
}
-   snprintf(slave_data->phy_id, sizeof(slave_data->phy_id),
-PHY_ID_FMT, mdio->name, phyid);
slave_data->phy_if = of_get_phy_mode(slave_node);
if (slave_data->phy_if < 0) {
dev_err(&pdev->dev, "Missing or malformed slave[%d] 
phy-mode property\n",
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] packet: Allow packets with only a header (but no payload)

2015-11-09 Thread Felix Fietkau

On 2015-11-09 18:53, Willem de Bruijn wrote:
> On Sat, Nov 7, 2015 at 8:11 AM, Felix Fietkau  wrote:
>> On 2015-07-31 00:15, Martin Blumenstingl wrote:
>>> On Wed, Jul 29, 2015 at 8:05 AM, Willem de Bruijn  
>>> wrote:
 Martin, to return to your initial statement that PPPoE PADI packets can
 have a zero payload: the PPPoE RFC states that PADI packets "MUST
 contain exactly one TAG of TAG_TYPE Service-Name, indicating the
 service the Host is requesting, and any number of other TAG types."
 (RFC 2516, 5.1). Is the observed behavior (no payload) perhaps
 incorrect?
>>> As far as I can see you are right, but the real world seems to be different.
>>> My ISP for example lists the PPPoE connection settings, but they are
>>> nowhere mentioning the "service name".
>>>
>>> I have also re-read pppd's source code again and that seems to confirm
>>> what you are reading in the RFC: Leaving the service name away makes
>>> seems to violate the RFC, but pppd still accepts those configurations.
>>>
 Even if it is, if this is breaking established userspace expectations,
 we should look into it. Ethernet specifies a minimum payload size of
 46 on the wire, but perhaps that is handled with padding, so that
 0 length should be valid within the stack. Also, there may be other
 valid uses of 0 length payload on top of link layers that are not Ethernet.
>>> Good catch. I would also like to note that the documentation for
>>> "hard_header_len" describes it as "Hardware header length". When the
>>> purpose of this field we should check whether the documentation should
>>> be updated to "Minimum hardware header length" -> that would mean the
>>> condition has to be a "len < hard_header_len" instead of a "len <=
>>> hard_header_len" (as it is now).
>>>
>>> PS: I have also added the pppd maintainer (Paul Mackerras) to this
>>> thread because I think he should know about this issue (and he can
>>> probably provide more details if required).
>>> As a quick summary for him: linux  >= 3.19 rejects PADI packets when
>>> no service name is configured.
>> Any news on this? Users are complaining about this regression:
>> https://dev.openwrt.org/ticket/20707
> 
> I took another look. This hinges on the question what the contract with
> device drivers is on skb network data and length. Is passing an skb with
> skb->len == 0 to ndo_start_xmit allowed?
> 
> From what I gather from the ethernet spec [1], sending frames with an
> empty head is allowed on that medium, at least.
> 
> A quick scan of a few drivers and the loopback path also does not show
> anything that would break. In some cases, skb_network_header points
> beyond the end of the buffer (ETH_HLEN), but the length is correctly
> reported as 0.
> 
> The tap device can also generate packets consisting of only a link layer
> header: compares len < ETH_HLEN in tun_get_user.
> 
> So, I think that this change should be correct:
> 
>  static bool ll_header_truncated(const struct net_device *dev, int len)
>  {
> -   /* net device doesn't like empty head */
> -   if (unlikely(len <= dev->hard_header_len)) {
> +   if (unlikely(len < dev->hard_header_len)) {
> 
> but a definitive answer would require an audit of all device drivers
> (including bonding, ..) or at least the certainty that it has always
> been correct to send a packet of only link layer header to
> ndo_start_xmit.
> 
> [1] IEEE 802.3™-2012 – Section One, {3.2.8, 4.2.3.3}
Yeah, I agree that such an audit is required. However, I think it's
*much* more important to add this change as soon as possible to fix the
regression. The old code may have had theoretical driver issues, but the
current code breaks real-world user setups.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [iproute PATCH v2 4/6] ipaddress: fix ipaddr_flush for Linux >= 3.1

2015-11-09 Thread Sergei Shtylyov


On 11/09/2015 09:51 PM, Sergei Shtylyov wrote:


Linux version 3.1 introduced a consistency check for netlink dumps in
commit 670dc28 ("netlink: advertise incomplete dumps"). This bites


The scripts/checkpatch.pl now enforces 12-digit commit ID...


   Sorry, didn't realize it wasn't a kernel patch. :-)


iproute2 when flushing more addresses than can fit into a single
RTM_GETADDR response. To silence the spurious error message "Dump was
interrupted and may be inconsistent.", advise rtnl_dump_filter_l() to
not care about NLM_F_DUMP_INTR.

Signed-off-by: Phil Sutter 


[...]

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [iproute PATCH v2 4/6] ipaddress: fix ipaddr_flush for Linux >= 3.1

2015-11-09 Thread Sergei Shtylyov


Hello.

On 11/08/2015 11:22 PM, Phil Sutter wrote:


Linux version 3.1 introduced a consistency check for netlink dumps in
commit 670dc28 ("netlink: advertise incomplete dumps"). This bites


   The scripts/checkpatch.pl now enforces 12-digit commit ID...


iproute2 when flushing more addresses than can fit into a single
RTM_GETADDR response. To silence the spurious error message "Dump was
interrupted and may be inconsistent.", advise rtnl_dump_filter_l() to
not care about NLM_F_DUMP_INTR.

Signed-off-by: Phil Sutter 


MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fw: [Bug 107551] New: IPV4 Reassembling Problem

2015-11-09 Thread Stephen Hemminger

Begin forwarded message:

Date: Mon, 9 Nov 2015 11:04:29 +
From: "bugzilla-dae...@bugzilla.kernel.org" 

To: "shemmin...@linux-foundation.org" 
Subject: [Bug 107551] New: IPV4 Reassembling Problem

https://bugzilla.kernel.org/show_bug.cgi?id=107551

Bug ID: 107551
   Summary: IPV4 Reassembling Problem
   Product: Networking
   Version: 2.5
Kernel Version: 4.2.3
  Hardware: Other
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: IPV4
  Assignee: shemmin...@linux-foundation.org
  Reporter: bernd.weib...@siemens.com
Regression: No

I have an ipv4 reassembling issue using the latest Kernel (4.2.3) in an nios2
environment.
I first noticed that issue when sending UDP-Pakets from an Host PC to my
device. Each packet has a payload of 4096 Byte.
Due to the MTU of 1500 Byte each packet will be split into three ip-packets.
Approximately 1 of 200 Packets were not correctly
reassembled by the Kernel, so they cannot be received within user-space. The
command nstat says, that there are reassembling errors
due to timeout, but at least in wireshark all packets are correctly displayed.
Sometime the kernel issues a icmp-messages, saying that there
was a timeout during ip reassembling.
I fir9st thought that it might be a performance problem, so I increased the
time between the packets to about one packet per 500 ms but without success.
Then I decided to reduce the payload to 1024 Byte, because it might be a buffer
(or TSE FIFO) problem. Now all packets are send into one ipv4 packet and there
is nothing lost on the receiver side, but when I reduced the MTU of the host
from 1500 to 800 byte (each datagram will be split into two packets now) the
problem accurse again!
So this might be a reassembling issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fw: [Bug 106711] VXLAN: RTNL assertion failed at net/core/net_namespace.c:187

2015-11-09 Thread Stephen Hemminger



Begin forwarded message:

Date: Mon, 9 Nov 2015 09:42:52 +
From: "bugzilla-dae...@bugzilla.kernel.org" 

To: "shemmin...@linux-foundation.org" 
Subject: [Bug 106711] VXLAN: RTNL assertion failed at 
net/core/net_namespace.c:187


https://bugzilla.kernel.org/show_bug.cgi?id=106711

Kari Hautio  changed:

   What|Removed |Added

 CC||khau...@gmail.com

--- Comment #1 from Kari Hautio  ---
Also affects 4.1.12

[   34.412104] RTNL: assertion failed at
/home/kernel/COD/linux/net/core/net_namespace.c (187)
[   34.412120] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.1.12-040112-generic
#201510262131
[   34.412123] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.7.5-20140709_153950- 04/01/2014
[   34.412126]  880035dd2380 88013fd838f8 817e367c

[   34.412131]   88013fd83928 816cfbe5
880035526200
[   34.412135]  8800b9b3fe00 88013a3dc9c0 88013a3dccc0
88013fd83938
[   34.412140] Call Trace:
[   34.412142][] dump_stack+0x45/0x57
[   34.412159]  [] __peernet2id+0xa5/0xb0
[   34.412163]  [] peernet2id+0x18/0x30
[   34.412168]  [] vxlan_fdb_info+0xfa/0x360 [vxlan]
[   34.412173]  [] ? __alloc_skb+0x8c/0x1f0
[   34.412178]  [] vxlan_fdb_notify+0x72/0x100 [vxlan]
[   34.412182]  [] vxlan_fdb_create+0x14a/0x3a0 [vxlan]
[   34.412187]  [] vxlan_snoop+0x1cb/0x1d0 [vxlan]
[   34.412192]  [] vxlan_rcv+0x3a2/0x620 [vxlan]
[   34.412196]  [] vxlan_udp_encap_recv+0x156/0x420 [vxlan]
[   34.412201]  [] ? vxlan_open+0x1e0/0x1e0 [vxlan]
[   34.412207]  [] udp_queue_rcv_skb+0x1f3/0x4f0
[   34.412211]  [] __udp4_lib_rcv+0x127/0x930
[   34.412215]  [] udp_rcv+0x1a/0x20
[   34.412227]  [] ip_local_deliver_finish+0xae/0x230
[   34.412231]  [] ip_local_deliver+0x46/0xa0
[   34.412235]  [] ip_rcv_finish+0x81/0x370
[   34.412239]  [] ip_rcv+0x2a2/0x3c0
[   34.412244]  [] __netif_receive_skb_core+0x7e3/0xa10
[   34.412248]  [] __netif_receive_skb+0x18/0x60
[   34.412252]  [] process_backlog+0xb2/0x150
[   34.412256]  [] net_rx_action+0x1fa/0x320
[   34.412265]  [] __do_softirq+0x103/0x280
[   34.412269]  [] irq_exit+0xad/0xb0
[   34.412274]  [] smp_apic_timer_interrupt+0x46/0x60
[   34.412279]  [] apic_timer_interrupt+0x6e/0x80
[   34.412280][] ? native_safe_halt+0x6/0x10
[   34.412290]  [] ? rcu_eqs_enter+0xa3/0xb0
[   34.412297]  [] default_idle+0x1e/0xc0
[   34.412315]  [] arch_cpu_idle+0xf/0x20
[   34.412325]  [] cpu_startup_entry+0x367/0x3e0
[   34.412332]  [] ? clockevents_register_device+0x112/0x1e0
[   34.412338]  [] start_secondary+0x17e/0x1a0

-- 
You are receiving this mail because:
You are the assignee for the bug.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-09 Thread David Miller

From: Rainer Weikusat 
Date: Mon, 09 Nov 2015 14:40:48 +

> + __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
> + &u->peer_wake);

This is more simply:

__remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait, q);

> +static inline int unix_dgram_peer_recv_ready(struct sock *sk,
> +  struct sock *other)

Please do not us the inline keyword in foo.c files, let the compiler
decide.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] net: mvneta: Introduce RSS support

2015-11-09 Thread Gregory CLEMENT

Hi Marcin,

 On ven., nov. 06 2015, Marcin Wojtas  wrote:

> Hi Gregory,
>
>
>> I also choose to associate all the TX queues on the same CPU that the
>> one associated to the RX queue. It allows to contain all the
>> interrupts on the same CPU. I think that an improvement on this side
>> would be the support of the XPS.
>>
>
> Did you make some tries? E.g. after mapping certain txqs to CPU1, when
> using them was mvneta_tx() executing only on this CPU? Or it rather
> means that the irq will hit according to the mapping? As far as I know
> the HW, the latter should be true, which would mean the real XPS with
> this controller is impossible and the maximum we can control is the
> irq.

I hoped that with XPS we can associate CPUs to all the tx queues of a
ethernet port. For example choosing for eth2, that all the tx queues will
be handled by CPU0 and CPU3. And for this it would involve allowing
receiving the tx interrupts only on CPU0 and CPU3. This kind of feature
seems to be doable with the hardware.

But after reading more carefully what XPS is about and how it is used in
the kernel, it seems we can't configure the hardware from the sysfs
interface. From my understanding we can only indicate to the kernel
which tx queue use for a given CPU.

>
> I think it may be worth to unmask TX irqs on all cpus, so all percpu
> napi's would be able to read tx_cause and process sent packets. I'm
> looking forward to your opinion.

By doing, this my understanding is that when an TX interrupt occurs then
all the CPUs enter in the interrupt handler but only one can manage
it. So, for me it looks like a big waste of CPU load. I could understand
that it would be interesting in some situation but it doesn't seem to be
the good default behavior. That's why I was looking for a way to let the
use configure it according to his needs.

Please correct me if I am wrong somewhere, because currently I don't
find a good solution for it.

Thanks,

Gregory
-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] arm64: bpf: fix JIT stack setup

2015-11-09 Thread Shi, Yang


On 11/8/2015 2:29 PM, Z Lim wrote:

On Sat, Nov 7, 2015 at 6:27 PM, Alexei Starovoitov
 wrote:

On Fri, Nov 06, 2015 at 09:36:17PM -0800, Yang Shi wrote:

ARM64 JIT used FP (x29) as eBPF fp register, but FP is subjected to
change during function call so it may cause the BPF prog stack base address
change too. Whenever, it pointed to the bottom of BPF prog stack instead of
the top.

So, when copying data via bpf_probe_read, it will be copied to (SP - offset),
then it may overwrite the saved FP/LR.

Use x25 to replace FP as BPF stack base register (fp). Since x25 is callee
saved register, so it will keep intact during function call.
It is initialized in BPF prog prologue when BPF prog is started to run
everytime. When BPF prog exits, it could be just tossed.

Other than this the BPf prog stack base need to be setup before function
call stack.

So, the BPF stack layout looks like:

  high
  original A64_SP =>   0:+-+ BPF prologue
 | | FP/LR and callee saved registers
  BPF fp register => +64:+-+
 | |
 | ... | BPF prog stack
 | |
 | |
  current A64_SP =>  +-+
 | |
 | ... | Function call stack
 | |
 +-+
   low

Signed-off-by: Yang Shi 
CC: Zi Shen Lim 
CC: Xi Wang 


Thanks for tracking it down.
That looks like fundamental bug in arm64 jit. I'm surprised function calls 
worked at all.
Zi please review.



For function calls (BPF_JMP | BPF_CALL), we are compliant with AAPCS64
[1]. That part is okay.


bpf_probe_read accesses the BPF program stack, which is based on BPF_REG_FP.

This exposes an issue with how BPF_REG_FP was setup, as Yang pointed out.
Instead of having BPF_REG_FP point to top of stack, we erroneously
point it to the bottom of stack. When there are function calls, we run
the risk of clobbering of BPF stack. Bad idea.


Yes, exactly.



Otherwise, since BPF_REG_FP is read-only, and is setup exactly once in
prologue, it remains consistent throughout lifetime of the BPF
program.


Yang, can you please try the following?


It should work without the below change:

+   emit(A64_MOV(1, A64_FP, A64_SP), ctx);

I added it to stay align with ARMv8 AAPCS to maintain the correct FP 
during function call. It makes us get correct stack backtrace.


I think we'd better to keep compliant with ARMv8 AAPCS in BPF JIT 
prologue too.


If nobody thinks it is necessary, we definitely could remove that change.

Thanks,
Yang



8<-
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -161,12 +161,12 @@ static void build_prologue(struct jit_ctx *ctx)
 if (ctx->tmp_used)
 emit(A64_PUSH(tmp1, tmp2, A64_SP), ctx);

-   /* Set up BPF stack */
-   emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
-
 /* Set up frame pointer */
 emit(A64_MOV(1, fp, A64_SP), ctx);

+   /* Set up BPF stack */
+   emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
+
 /* Clear registers A and X */
 emit_a64_mov_i64(ra, 0, ctx);
 emit_a64_mov_i64(rx, 0, ctx);
->8

[1] 
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] packet: Allow packets with only a header (but no payload)

2015-11-09 Thread Willem de Bruijn

On Sat, Nov 7, 2015 at 8:11 AM, Felix Fietkau  wrote:
> On 2015-07-31 00:15, Martin Blumenstingl wrote:
>> On Wed, Jul 29, 2015 at 8:05 AM, Willem de Bruijn  wrote:
>>> Martin, to return to your initial statement that PPPoE PADI packets can
>>> have a zero payload: the PPPoE RFC states that PADI packets "MUST
>>> contain exactly one TAG of TAG_TYPE Service-Name, indicating the
>>> service the Host is requesting, and any number of other TAG types."
>>> (RFC 2516, 5.1). Is the observed behavior (no payload) perhaps
>>> incorrect?
>> As far as I can see you are right, but the real world seems to be different.
>> My ISP for example lists the PPPoE connection settings, but they are
>> nowhere mentioning the "service name".
>>
>> I have also re-read pppd's source code again and that seems to confirm
>> what you are reading in the RFC: Leaving the service name away makes
>> seems to violate the RFC, but pppd still accepts those configurations.
>>
>>> Even if it is, if this is breaking established userspace expectations,
>>> we should look into it. Ethernet specifies a minimum payload size of
>>> 46 on the wire, but perhaps that is handled with padding, so that
>>> 0 length should be valid within the stack. Also, there may be other
>>> valid uses of 0 length payload on top of link layers that are not Ethernet.
>> Good catch. I would also like to note that the documentation for
>> "hard_header_len" describes it as "Hardware header length". When the
>> purpose of this field we should check whether the documentation should
>> be updated to "Minimum hardware header length" -> that would mean the
>> condition has to be a "len < hard_header_len" instead of a "len <=
>> hard_header_len" (as it is now).
>>
>> PS: I have also added the pppd maintainer (Paul Mackerras) to this
>> thread because I think he should know about this issue (and he can
>> probably provide more details if required).
>> As a quick summary for him: linux  >= 3.19 rejects PADI packets when
>> no service name is configured.
> Any news on this? Users are complaining about this regression:
> https://dev.openwrt.org/ticket/20707

I took another look. This hinges on the question what the contract with
device drivers is on skb network data and length. Is passing an skb with
skb->len == 0 to ndo_start_xmit allowed?

>From what I gather from the ethernet spec [1], sending frames with an
empty head is allowed on that medium, at least.

A quick scan of a few drivers and the loopback path also does not show
anything that would break. In some cases, skb_network_header points
beyond the end of the buffer (ETH_HLEN), but the length is correctly
reported as 0.

The tap device can also generate packets consisting of only a link layer
header: compares len < ETH_HLEN in tun_get_user.

So, I think that this change should be correct:

 static bool ll_header_truncated(const struct net_device *dev, int len)
 {
-   /* net device doesn't like empty head */
-   if (unlikely(len <= dev->hard_header_len)) {
+   if (unlikely(len < dev->hard_header_len)) {

but a definitive answer would require an audit of all device drivers
(including bonding, ..) or at least the certainty that it has always
been correct to send a packet of only link layer header to
ndo_start_xmit.

[1] IEEE 802.3™-2012 – Section One, {3.2.8, 4.2.3.3}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread Russell King - ARM Linux

On Mon, Nov 09, 2015 at 06:08:49PM +0100, Andrew Lunn wrote:
> You use fixed-phy when the MAC is connected to a switch, not a phy. Or
> when the MAC is connected to an SFP module.

... hopefully not for much longer.  Once -rc1 is out, I'll sort out
posting my phylink and SFP module hotplug support as a RFC.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread Russell King - ARM Linux

On Mon, Nov 09, 2015 at 06:12:02PM +0100, Arnd Bergmann wrote:
> On Monday 09 November 2015 17:08:34 Russell King - ARM Linux wrote:
> > They are "optional" because when you're using a DSA switch, you don't
> > specify a PHY (because, there isn't one).  For example, this is what
> > I'm using with an Armada 388 board with a Marvell DSA switch.  The
> > DSA does not appear as a PHY, and no node in the DSA stanza can be
> > referenced for a phy entry in the ethernet device's stanza.
> > 
> > eth1: ethernet@3 {
> > compatible = "marvell,armada-370-neta";
> > reg = <0x3 0x4000>;
> > interrupts-extended = <&mpic 10>;
> > clocks = <&gateclk 3>;
> > managed = "in-band-status";
> > phy-mode = "sgmii";
> > status = "okay";
> > };
> > 
> > 
> 
> Ok, then it would be nice to change the binding to reflect that,
> and also document the "managed" property there.

"managed" is already documented.  See
Documentation/devicetree/bindings/net/ethernet.txt:

- managed: string, specifies the PHY management type. Supported values are:
  "auto", "in-band-status". "auto" is the default, it usess MDIO for
  management if fixed-link is not specified.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: network stream fairness

2015-11-09 Thread Eric Dumazet

On Mon, 2015-11-09 at 17:50 +0100, Niklas Cassel wrote:

> 
> for i in `seq 1 20`; do ss -temoi dst 192.168.0.141; sleep 1; done
...
> ESTAB  0  0192.168.0.1:54578  
>  192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
> ino:7901838 sk:880143735040
> State  Recv-Q Send-Q Local Address:Port   
>   Peer Address:Port   
> ESTAB  0  0192.168.0.1:54576  
>  192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
> ino:7901837 sk:880143735800
> ESTAB  0  0192.168.0.1:54574  
>  192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 
> uid:20283 ino:7902591 sk:880223311800
> ESTAB  1448   0192.168.0.1:54578  
>  192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
> ino:7901838 sk:880143735040

This is the receiver, I would like to see the sender side, since the
changes we talk about are at his side, of course.

Thanks.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND] net: Documentation: Fix default value tcp_limit_output_bytes

2015-11-09 Thread David Miller

From: Niklas Cassel 
Date: Mon, 9 Nov 2015 15:59:00 +0100

> Commit c39c4c6abb89 ("tcp: double default TSQ output bytes limit")
> updated default value for tcp_limit_output_bytes
> 
> Signed-off-by: Niklas Cassel 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread Arnd Bergmann

On Monday 09 November 2015 18:08:49 Andrew Lunn wrote:
> > > I suppose it comes down to, are we allowed to optionally implement
> > > part of the DT binding?
> > 
> > I'm not sure what you are asking. A lot of DT bindings have both
> > optional and mandatory properties. For mvneta, the "phy" and "phy-mode"
> > properties are listed as mandatory, so the driver can safely assume
> > that they are always present. If there are reasons to leave them out,
> > and for the driver to handle that case correctly, the binding
> > should be updated to mark them as optional.
> 
> Hi Arnd
> 
> You are looking at it from the perspective of the driver. I was
> meaning from the perspective of the DT blob. Can be blob assume the
> driver implements all of the binding, all of the time?

That question is not really relevant: the DT describes the hardware,
it doesn't matter whether there are drivers for all the bits or
whether all properties are read.

> You use fixed-phy when the MAC is connected to a switch, not a phy. Or
> when the MAC is connected to an SFP module. The driver can currently
> be built to not implement the fixed-phy party of the binding. Is that
> O.K. from the perspective of the DT blob? Or should the driver always
> implement all of the binding, in which these NOP stubs should be
> removed and fixed phy always be enabled for the drivers that use it.

Sure, that is ok.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread Arnd Bergmann

On Monday 09 November 2015 17:08:34 Russell King - ARM Linux wrote:
> They are "optional" because when you're using a DSA switch, you don't
> specify a PHY (because, there isn't one).  For example, this is what
> I'm using with an Armada 388 board with a Marvell DSA switch.  The
> DSA does not appear as a PHY, and no node in the DSA stanza can be
> referenced for a phy entry in the ethernet device's stanza.
> 
> eth1: ethernet@3 {
> compatible = "marvell,armada-370-neta";
> reg = <0x3 0x4000>;
> interrupts-extended = <&mpic 10>;
> clocks = <&gateclk 3>;
> managed = "in-band-status";
> phy-mode = "sgmii";
> status = "okay";
> };
> 
> 

Ok, then it would be nice to change the binding to reflect that,
and also document the "managed" property there.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread Russell King - ARM Linux

On Mon, Nov 09, 2015 at 05:57:43PM +0100, Arnd Bergmann wrote:
> I'm not sure what you are asking. A lot of DT bindings have both
> optional and mandatory properties. For mvneta, the "phy" and "phy-mode"
> properties are listed as mandatory, so the driver can safely assume
> that they are always present. If there are reasons to leave them out,
> and for the driver to handle that case correctly, the binding
> should be updated to mark them as optional.

They are "optional" because when you're using a DSA switch, you don't
specify a PHY (because, there isn't one).  For example, this is what
I'm using with an Armada 388 board with a Marvell DSA switch.  The
DSA does not appear as a PHY, and no node in the DSA stanza can be
referenced for a phy entry in the ethernet device's stanza.

eth1: ethernet@3 {
compatible = "marvell,armada-370-neta";
reg = <0x3 0x4000>;
interrupts-extended = <&mpic 10>;
clocks = <&gateclk 3>;
managed = "in-band-status";
phy-mode = "sgmii";
status = "okay";
};

dsa@0 {
compatible = "marvell,dsa";
dsa,ethernet = <ð1>;
dsa,mii-bus = <&mdio>;
pinctrl-0 = <&clearfog_dsa0_clk_pins &clearfog_dsa0_pins>;
pinctrl-names = "default";
#address-cells = <2>;
#size-cells = <0>;

switch@0 {
...
};
};

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread Andrew Lunn

> > I suppose it comes down to, are we allowed to optionally implement
> > part of the DT binding?
> 
> I'm not sure what you are asking. A lot of DT bindings have both
> optional and mandatory properties. For mvneta, the "phy" and "phy-mode"
> properties are listed as mandatory, so the driver can safely assume
> that they are always present. If there are reasons to leave them out,
> and for the driver to handle that case correctly, the binding
> should be updated to mark them as optional.

Hi Arnd

You are looking at it from the perspective of the driver. I was
meaning from the perspective of the DT blob. Can be blob assume the
driver implements all of the binding, all of the time?

You use fixed-phy when the MAC is connected to a switch, not a phy. Or
when the MAC is connected to an SFP module. The driver can currently
be built to not implement the fixed-phy party of the binding. Is that
O.K. from the perspective of the DT blob? Or should the driver always
implement all of the binding, in which these NOP stubs should be
removed and fixed phy always be enabled for the drivers that use it.

Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] macvtap: Resolve possible __might_sleep warning in macvtap_do_read()

2015-11-09 Thread David Miller

From: Vladislav Yasevich 
Date: Mon,  9 Nov 2015 09:14:17 -0500

> macvtap_do_read code calls macvtap_put_user while it might be set up
> to wait for the user.  This results in the following warning:
 ...
> Make sure thet we call finish_wait() if we have the skb to process
> before trying to actually process it.
> 
> Signed-off-by: Vladislav Yasevich 

Oh I see, this is that new check in might_sleep() that makes sure
that the task state is TASK_RUNNING.

Looks good, applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread Arnd Bergmann

On Monday 09 November 2015 17:42:32 Andrew Lunn wrote:
> On Mon, Nov 09, 2015 at 03:08:57PM +0100, Arnd Bergmann wrote:
> > The fixed_phy infrastructure is done in a way that is optional,
> > by providing 'static inline' helper functions doing nothing in
> > include/linux/phy_fixed.h for all its APIs. However, three out
> > of the four users (DSA, BCMGENET, and SYSTEMPORT) always
> > 'select FIXED_PHY', presumably because they need that.
> 
> Hi Arnd
> 
> Need is probably too strong, it could be considered an optional
> feature. If you don't have a fixed_phy property in your DT blob, you
> don't need fixed phy support in your image.

Ok, I see.

> > MVNETA is the fourth one, and if that is built-in but FIXED_PHY
> > is configured as a loadable module, we get a link error:
> > 
> > drivers/built-in.o: In function `mvneta_fixed_link_update':
> > fpga-mgr.c:(.text+0x33ed80): undefined reference to `fixed_phy_update_state'
> > 
> > Presumably this driver has the same dependency as the others,
> > so this patch also uses 'select' to ensure that the fixed-phy
> > support is built-in.
> 
> This will work, and is uniform with the other instances. But maybe a
> more correct fix is to ensure fixed-phy is never a module when there
> is a builtin user.

That is hard to express with Kconfig. The alternative I listed instead
guarantees that CONFIG_MVNETA cannot be set to 'y' whenever FIXED_PHY=m.
For all practical purposes that has the same effect.

The fixed-phy support isn't very big (around 2KB), so I wonder how
relevant that optimization is.

> > Should we perhaps make 'FIXED_PHY' a silent option and remove the
> > inline helpers, based on the assumption that a driver that wants these
> > will not work without them?
> 
> I suppose it comes down to, are we allowed to optionally implement
> part of the DT binding?

I'm not sure what you are asking. A lot of DT bindings have both
optional and mandatory properties. For mvneta, the "phy" and "phy-mode"
properties are listed as mandatory, so the driver can safely assume
that they are always present. If there are reasons to leave them out,
and for the driver to handle that case correctly, the binding
should be updated to mark them as optional.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: network stream fairness

2015-11-09 Thread Niklas Cassel

On 11/09/2015 05:24 PM, Eric Dumazet wrote:
> On Mon, 2015-11-09 at 08:07 -0800, Eric Dumazet wrote:
> 
>> Your numbers suggest a cwnd growth then, which might show a CC bug.
>>
>> Please run the following when your iper3 runs on regular 4.3 kernel
>>
>> for i in `seq 1 10`
>> do
>> ss -temoi dst 192.168.0.141
>> sleep 1
>> done
>>

See output at the end of the mail.

iperf3 output sample:
[  4]  24.00-24.53  sec  1.92 MBytes  30.2 Mbits/sec  
[  6]  24.00-24.53  sec  3.84 MBytes  60.5 Mbits/sec  
[SUM]  24.00-24.53  sec  5.76 MBytes  90.7 Mbits/sec 


> 
> Another thing to try is to change tcp_limit_output_bytes back to 131072

I tried both

sysctl -w net.ipv4.tcp_limit_output_bytes=131072
sysctl -w net.ipv4.tcp_limit_output_bytes=13107

Neither appears to solve the problem.

I don't know why but
echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max
solves it. Feels weird that setting tcp_limit_output_bytes doesn't fix it.

> 
> Of course, your driver be the problem as well, but I am guessing it is
> not upstream ?

It's not upstreamed, but it is GPL, so I could probably upload it somewhere
if that helps.






for i in `seq 1 20`; do ss -temoi dst 192.168.0.141; sleep 1; done
State  Recv-Q Send-Q Local Address:Port 
Peer Address:Port   
State  Recv-Q Send-Q Local Address:Port 
Peer Address:Port   
ESTAB  0  106  192.168.0.1:54574
   192.168.0.141:5201 timer:(on,190ms,0) rto:0.2 ato:0.04 
cwnd:10 qack:14 bidir uid:20283 ino:7902591 sk:880223311800
State  Recv-Q Send-Q Local Address:Port 
Peer Address:Port   
ESTAB  0  0192.168.0.1:54576
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
ino:7901837 sk:880143735800
ESTAB  0  0192.168.0.1:54574
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 
uid:20283 ino:7902591 sk:880223311800
ESTAB  0  0192.168.0.1:54578
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
ino:7901838 sk:880143735040
State  Recv-Q Send-Q Local Address:Port 
Peer Address:Port   
ESTAB  0  0192.168.0.1:54576
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
ino:7901837 sk:880143735800
ESTAB  0  0192.168.0.1:54574
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 
uid:20283 ino:7902591 sk:880223311800
ESTAB  0  0192.168.0.1:54578
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
ino:7901838 sk:880143735040
State  Recv-Q Send-Q Local Address:Port 
Peer Address:Port   
ESTAB  0  0192.168.0.1:54576
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
ino:7901837 sk:880143735800
ESTAB  0  0192.168.0.1:54574
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 
uid:20283 ino:7902591 sk:880223311800
ESTAB  0  0192.168.0.1:54578
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
ino:7901838 sk:880143735040
State  Recv-Q Send-Q Local Address:Port 
Peer Address:Port   
ESTAB  0  0192.168.0.1:54576
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
ino:7901837 sk:880143735800
ESTAB  0  0192.168.0.1:54574
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 
uid:20283 ino:7902591 sk:880223311800
ESTAB  0  0192.168.0.1:54578
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
ino:7901838 sk:880143735040
State  Recv-Q Send-Q Local Address:Port 
Peer Address:Port   
ESTAB  0  0192.168.0.1:54576
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 
ino:7901837 sk:880143735800
ESTAB  0  0192.168.0.1:54574
   192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 
uid:20283 ino:7902591 sk:880223311800
ESTAB  0  0

Re: [PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread Andrew Lunn

On Mon, Nov 09, 2015 at 03:08:57PM +0100, Arnd Bergmann wrote:
> The fixed_phy infrastructure is done in a way that is optional,
> by providing 'static inline' helper functions doing nothing in
> include/linux/phy_fixed.h for all its APIs. However, three out
> of the four users (DSA, BCMGENET, and SYSTEMPORT) always
> 'select FIXED_PHY', presumably because they need that.

Hi Arnd

Need is probably too strong, it could be considered an optional
feature. If you don't have a fixed_phy property in your DT blob, you
don't need fixed phy support in your image.

> MVNETA is the fourth one, and if that is built-in but FIXED_PHY
> is configured as a loadable module, we get a link error:
> 
> drivers/built-in.o: In function `mvneta_fixed_link_update':
> fpga-mgr.c:(.text+0x33ed80): undefined reference to `fixed_phy_update_state'
> 
> Presumably this driver has the same dependency as the others,
> so this patch also uses 'select' to ensure that the fixed-phy
> support is built-in.

This will work, and is uniform with the other instances. But maybe a
more correct fix is to ensure fixed-phy is never a module when there
is a builtin user.

> Should we perhaps make 'FIXED_PHY' a silent option and remove the
> inline helpers, based on the assumption that a driver that wants these
> will not work without them?

I suppose it comes down to, are we allowed to optionally implement
part of the DT binding?

 Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread David Miller

From: Arnd Bergmann 
Date: Mon, 09 Nov 2015 15:08:57 +0100

> The fixed_phy infrastructure is done in a way that is optional,
> by providing 'static inline' helper functions doing nothing in
> include/linux/phy_fixed.h for all its APIs. However, three out
> of the four users (DSA, BCMGENET, and SYSTEMPORT) always
> 'select FIXED_PHY', presumably because they need that.
> MVNETA is the fourth one, and if that is built-in but FIXED_PHY
> is configured as a loadable module, we get a link error:
> 
> drivers/built-in.o: In function `mvneta_fixed_link_update':
> fpga-mgr.c:(.text+0x33ed80): undefined reference to `fixed_phy_update_state'
> 
> Presumably this driver has the same dependency as the others,
> so this patch also uses 'select' to ensure that the fixed-phy
> support is built-in.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state 
> signaling")
> ---
> Found using ARM randconfig tests. An alternative here would be
> to use 'depends on FIXED_PHY || FIXED_PHY=n', I picked the 'select'
> approach for consistency.
> 
> Should we perhaps make 'FIXED_PHY' a silent option and remove the
> inline helpers, based on the assumption that a driver that wants these
> will not work without them?

This seems reasonable to fix the problem for the time being, applied.

Thanks Arnd.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: caif: check return value of alloc_netdev

2015-11-09 Thread David Miller

From: Rasmus Villemoes 
Date: Mon,  9 Nov 2015 13:19:10 +0100

> I don't know if dev can actually be NULL here, but the test should be
> above alloc_netdev(), to avoid leaking the struct net_device in case
> dev is actually NULL. And of course the return value from alloc_netdev
> should be tested.
> 
> Signed-off-by: Rasmus Villemoes 
> ---
> Maybe the existing code was supposed to be "if (!ndev)", and dev
> cannot be NULL, but then -ENODEV is a slightly odd return
> value. Doing both tests seems to be the safe choice.

This is made painfully impossible to figure out because there seems
to be no in-tree code instantiating platform_device objects for
this caif_spi framework, nor are there even in-tree implementations
of the init_xfer et al. required methods.

So your fix is as good as any, applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()

2015-11-09 Thread David Miller

From: Jan Kara 
Date: Mon, 9 Nov 2015 13:11:26 +0100

> You can add
> 
> Acked-by: Jan Kara 
> 
> for the UDF and fs/xattr.c parts.

Please do not quote and entire large patch just to give an ACK.

Just quote the minimum necessary context, which is usually just
the commit message.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: network stream fairness

2015-11-09 Thread Eric Dumazet

On Mon, 2015-11-09 at 08:07 -0800, Eric Dumazet wrote:

> Your numbers suggest a cwnd growth then, which might show a CC bug.
> 
> Please run the following when your iper3 runs on regular 4.3 kernel
> 
> for i in `seq 1 10`
> do
> ss -temoi dst 192.168.0.141
> sleep 1
> done
> 

Another thing to try is to change tcp_limit_output_bytes back to 131072

Of course, your driver be the problem as well, but I am guessing it is
not upstream ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA

2015-11-09 Thread David Miller

From: Geert Uytterhoeven 
Date: Mon,  9 Nov 2015 10:34:30 +0100

> If NO_DMA=y:
 ...
> As this affects all of HNS_ENET, HNS_DSAF, HNS, HIX5HD2_GMAC, and
> HIP04_ETH, add a dependency on HAS_DMA to the main NET_VENDOR_HISILICON
> symbol to fix this.
> 
> Signed-off-by: Geert Uytterhoeven 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: network stream fairness

2015-11-09 Thread Eric Dumazet

On Mon, 2015-11-09 at 16:53 +0100, Niklas Cassel wrote:
> On 11/09/2015 04:50 PM, Eric Dumazet wrote:
> > On Mon, 2015-11-09 at 16:41 +0100, Niklas Cassel wrote:
> >> I have a ethernet driver for a 100 Mbps NIC.
> >> The NIC has dedicated hardware for offloading.
> >> The driver has implemented TSO, GSO and BQL.
> >> Since the CPU on the SoC is rather weak, I'd rather
> >> not increase the CPU load by turning off offloading.
> >>
> >> Since commit
> >> 605ad7f184b6 ("tcp: refine TSO autosizing")
> >>
> >> the bandwidth is no longer fair between streams.
> >> see output at the end of the mail, where I'm testing with 2 streams.
> >>
> >>
> >> If I revert 605ad7f184b6 on 4.3, I get a stable 45 Mbps per stream.
> >>
> >> I can also use vanilla 4.3 and do:
> >> echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max
> >> to also get a stable 45 Mbps per stream.
> >>
> >> My question is, am I supposed to set the BQL limit explicitly?
> >> It is possible that I have missed something in my driver,
> >> but my understanding is that the TCP stack sets and adjusts
> >> the BQL limit automatically.
> >>
> >>
> >> Perhaps the following info might help:
> >>
> >> After running iperf3 on vanilla 4.3:
> >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
> >> limit 89908
> >> limit_max 1879048192
> >>
> >> After running iperf3 on vanilla 4.3 + BQL explicitly set:
> >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
> >> limit 3000
> >> limit_max 3000
> >>
> >> After running iperf3 on 4.3 + 605ad7f184b6 reverted:
> >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
> >> limit 8886
> >> limit_max 1879048192
> >>
> > 
> > There is absolutely nothing ensuring fairness among multiple TCP flows.
> > 
> > One TCP flow can very easily grab whole bandwidth for itself, there are
> > numerous descriptions of this phenomena in various TCP studies. 
> > 
> > This is why we have packet schedulers ;)
> 
> Oh.. How stupid of me, I forgot to mention.. all of the measurements were
> done with fq_codel.

Your numbers suggest a cwnd growth then, which might show a CC bug.

Please run the following when your iper3 runs on regular 4.3 kernel

for i in `seq 1 10`
do
ss -temoi dst 192.168.0.141
sleep 1
done


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()

2015-11-09 Thread Russell King - ARM Linux

On Mon, Nov 09, 2015 at 08:56:10PM +0900, Tetsuo Handa wrote:
> There are many locations that do
> 
>   if (memory_was_allocated_by_vmalloc)
> vfree(ptr);
>   else
> kfree(ptr);
> 
> but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
> using is_vmalloc_addr(). Unless callers have special reasons, we can
> replace this branch with kvfree(). Please check and reply if you found
> problems.
> 
> Signed-off-by: Tetsuo Handa 
> Acked-by: Michal Hocko 
> Cc: Russell King  # arm

Acked-by: Russell King 

In so far as this ARM specific change looks reasonable.  Thanks.

> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index e62400e..492bf3e 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1200,10 +1200,7 @@ error:
>   while (i--)
>   if (pages[i])
>   __free_pages(pages[i], 0);
> - if (array_size <= PAGE_SIZE)
> - kfree(pages);
> - else
> - vfree(pages);
> + kvfree(pages);
>   return NULL;
>  }
>  
> @@ -1211,7 +1208,6 @@ static int __iommu_free_buffer(struct device *dev, 
> struct page **pages,
>  size_t size, struct dma_attrs *attrs)
>  {
>   int count = size >> PAGE_SHIFT;
> - int array_size = count * sizeof(struct page *);
>   int i;
>  
>   if (dma_get_attr(DMA_ATTR_FORCE_CONTIGUOUS, attrs)) {
> @@ -1222,10 +1218,7 @@ static int __iommu_free_buffer(struct device *dev, 
> struct page **pages,
>   __free_pages(pages[i], 0);
>   }
>  
> - if (array_size <= PAGE_SIZE)
> - kfree(pages);
> - else
> - vfree(pages);
> + kvfree(pages);
>   return 0;
>  }
>  

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: network stream fairness

2015-11-09 Thread Niklas Cassel

On 11/09/2015 04:50 PM, Eric Dumazet wrote:
> On Mon, 2015-11-09 at 16:41 +0100, Niklas Cassel wrote:
>> I have a ethernet driver for a 100 Mbps NIC.
>> The NIC has dedicated hardware for offloading.
>> The driver has implemented TSO, GSO and BQL.
>> Since the CPU on the SoC is rather weak, I'd rather
>> not increase the CPU load by turning off offloading.
>>
>> Since commit
>> 605ad7f184b6 ("tcp: refine TSO autosizing")
>>
>> the bandwidth is no longer fair between streams.
>> see output at the end of the mail, where I'm testing with 2 streams.
>>
>>
>> If I revert 605ad7f184b6 on 4.3, I get a stable 45 Mbps per stream.
>>
>> I can also use vanilla 4.3 and do:
>> echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max
>> to also get a stable 45 Mbps per stream.
>>
>> My question is, am I supposed to set the BQL limit explicitly?
>> It is possible that I have missed something in my driver,
>> but my understanding is that the TCP stack sets and adjusts
>> the BQL limit automatically.
>>
>>
>> Perhaps the following info might help:
>>
>> After running iperf3 on vanilla 4.3:
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
>> limit 89908
>> limit_max 1879048192
>>
>> After running iperf3 on vanilla 4.3 + BQL explicitly set:
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
>> limit 3000
>> limit_max 3000
>>
>> After running iperf3 on 4.3 + 605ad7f184b6 reverted:
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
>> limit 8886
>> limit_max 1879048192
>>
> 
> There is absolutely nothing ensuring fairness among multiple TCP flows.
> 
> One TCP flow can very easily grab whole bandwidth for itself, there are
> numerous descriptions of this phenomena in various TCP studies. 
> 
> This is why we have packet schedulers ;)

Oh.. How stupid of me, I forgot to mention.. all of the measurements were
done with fq_codel.

> 
> tc qdisc replace dev eth0 root fq
> 
> This will probably help : No need to change BQL settings, so that you
> keep minimal latencies for the interactive traffic (like ping)
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: network stream fairness

2015-11-09 Thread Eric Dumazet

On Mon, 2015-11-09 at 16:41 +0100, Niklas Cassel wrote:
> I have a ethernet driver for a 100 Mbps NIC.
> The NIC has dedicated hardware for offloading.
> The driver has implemented TSO, GSO and BQL.
> Since the CPU on the SoC is rather weak, I'd rather
> not increase the CPU load by turning off offloading.
> 
> Since commit
> 605ad7f184b6 ("tcp: refine TSO autosizing")
> 
> the bandwidth is no longer fair between streams.
> see output at the end of the mail, where I'm testing with 2 streams.
> 
> 
> If I revert 605ad7f184b6 on 4.3, I get a stable 45 Mbps per stream.
> 
> I can also use vanilla 4.3 and do:
> echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max
> to also get a stable 45 Mbps per stream.
> 
> My question is, am I supposed to set the BQL limit explicitly?
> It is possible that I have missed something in my driver,
> but my understanding is that the TCP stack sets and adjusts
> the BQL limit automatically.
> 
> 
> Perhaps the following info might help:
> 
> After running iperf3 on vanilla 4.3:
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
> limit 89908
> limit_max 1879048192
> 
> After running iperf3 on vanilla 4.3 + BQL explicitly set:
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
> limit 3000
> limit_max 3000
> 
> After running iperf3 on 4.3 + 605ad7f184b6 reverted:
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
> limit 8886
> limit_max 1879048192
> 

There is absolutely nothing ensuring fairness among multiple TCP flows.

One TCP flow can very easily grab whole bandwidth for itself, there are
numerous descriptions of this phenomena in various TCP studies. 

This is why we have packet schedulers ;)

tc qdisc replace dev eth0 root fq

This will probably help : No need to change BQL settings, so that you
keep minimal latencies for the interactive traffic (like ping)




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

network stream fairness

2015-11-09 Thread Niklas Cassel

I have a ethernet driver for a 100 Mbps NIC.
The NIC has dedicated hardware for offloading.
The driver has implemented TSO, GSO and BQL.
Since the CPU on the SoC is rather weak, I'd rather
not increase the CPU load by turning off offloading.

Since commit
605ad7f184b6 ("tcp: refine TSO autosizing")

the bandwidth is no longer fair between streams.
see output at the end of the mail, where I'm testing with 2 streams.


If I revert 605ad7f184b6 on 4.3, I get a stable 45 Mbps per stream.

I can also use vanilla 4.3 and do:
echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max
to also get a stable 45 Mbps per stream.

My question is, am I supposed to set the BQL limit explicitly?
It is possible that I have missed something in my driver,
but my understanding is that the TCP stack sets and adjusts
the BQL limit automatically.


Perhaps the following info might help:

After running iperf3 on vanilla 4.3:
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/
limit 89908
limit_max 1879048192

After running iperf3 on vanilla 4.3 + BQL explicitly set:
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/
limit 3000
limit_max 3000

After running iperf3 on 4.3 + 605ad7f184b6 reverted:
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/
limit 8886
limit_max 1879048192




Vanilla 4.3, no BQL limit explicitly set.

iperf3 -R -c 192.168.0.141 -t 240 -P 2
Connecting to host 192.168.0.141, port 5201
Reverse mode, remote host 192.168.0.141 is sending
[  4] local 192.168.0.1 port 50902 connected to 192.168.0.141 port 5201
[  6] local 192.168.0.1 port 50904 connected to 192.168.0.141 port 5201
[ ID] Interval   Transfer Bandwidth
[  4]   0.00-1.00   sec  7.24 MBytes  60.7 Mbits/sec
[  6]   0.00-1.00   sec  3.56 MBytes  29.9 Mbits/sec
[SUM]   0.00-1.00   sec  10.8 MBytes  90.6 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  7.40 MBytes  62.1 Mbits/sec
[  6]   1.00-2.00   sec  3.40 MBytes  28.5 Mbits/sec
[SUM]   1.00-2.00   sec  10.8 MBytes  90.6 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  7.45 MBytes  62.5 Mbits/sec
[  6]   2.00-3.00   sec  3.36 MBytes  28.2 Mbits/sec
[SUM]   2.00-3.00   sec  10.8 MBytes  90.7 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  7.72 MBytes  64.8 Mbits/sec
[  6]   3.00-4.00   sec  3.08 MBytes  25.9 Mbits/sec
[SUM]   3.00-4.00   sec  10.8 MBytes  90.7 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  8.01 MBytes  67.2 Mbits/sec
[  6]   4.00-5.00   sec  2.80 MBytes  23.5 Mbits/sec
[SUM]   4.00-5.00   sec  10.8 MBytes  90.7 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  7.91 MBytes  66.3 Mbits/sec
[  6]   5.00-6.00   sec  2.90 MBytes  24.3 Mbits/sec
[SUM]   5.00-6.00   sec  10.8 MBytes  90.7 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  7.77 MBytes  65.2 Mbits/sec
[  6]   6.00-7.00   sec  3.04 MBytes  25.5 Mbits/sec
[SUM]   6.00-7.00   sec  10.8 MBytes  90.7 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  6.99 MBytes  58.6 Mbits/sec
[  6]   7.00-8.00   sec  3.82 MBytes  32.0 Mbits/sec
[SUM]   7.00-8.00   sec  10.8 MBytes  90.7 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  7.08 MBytes  59.4 Mbits/sec
[  6]   8.00-9.00   sec  3.73 MBytes  31.3 Mbits/sec
[SUM]   8.00-9.00   sec  10.8 MBytes  90.7 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  6.93 MBytes  58.1 Mbits/sec
[  6]   9.00-10.00  sec  3.88 MBytes  32.6 Mbits/sec
[SUM]   9.00-10.00  sec  10.8 MBytes  90.7 Mbits/sec
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RESEND] net: Documentation: Fix default value tcp_limit_output_bytes

2015-11-09 Thread Niklas Cassel

Commit c39c4c6abb89 ("tcp: double default TSQ output bytes limit")
updated default value for tcp_limit_output_bytes

Signed-off-by: Niklas Cassel 
---
Already sent the patch to linux-...@vger.kernel.org,
but sending it to netdev@vger.kernel.org too.

 Documentation/networking/ip-sysctl.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 05915be..2ea4c45 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -709,7 +709,7 @@ tcp_limit_output_bytes - INTEGER
typical pfifo_fast qdiscs.
tcp_limit_output_bytes limits the number of bytes on qdisc
or device to reduce artificial RTT/cwnd and reduce bufferbloat.
-   Default: 131072
+   Default: 262144
 
 tcp_challenge_ack_limit - INTEGER
Limits number of Challenge ACK sent per second, as recommended
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-09 Thread Rainer Weikusat

An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusuat 

---
"Why do things always end up messy and complicated"? Patch is against
4.3.0.


diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_tpeer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..4f263e3 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@ found:
return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+ void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   &u->peer_wake);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+   struct unix_sock *u, *u_other;
+   int rc;
+
+   u = unix_sk(sk);
+   u_other = unix_sk(other);
+   rc = 0;
+
+   spin_lock(&u_other->peer_wait.lock);
+
+   if (!u->peer_wake.private) {
+   u->peer_wake.private = other;
+   __add_wait_queue(&u_other

Re: [PATCH] inet: delay address promotion check until last request in message

2015-11-09 Thread Neil Horman

On Mon, Nov 09, 2015 at 07:31:38AM +0900, David Miller wrote:
> From: Neil Horman 
> Date: Sun, 8 Nov 2015 16:56:41 -0500
> 
> > All I'm doing is, in effect masking the promote_secondaries sysctl
> > for an interface doing a flush operation.
> 
> Packets and other entities not controlled by RTNL can see an inconsistent
> state: no primary address on the interface.
> 
> You can argue that this happened already but usually the window is tiny.
> 
> For the case you are optimizing for, the window is now several seconds
> long.  That's not really something we can do.
> 

I agree with what you're saying in regards to the fact that an inconsistent
state exists during this operation, but the time frame during which it exists is
variable dependent on the number of addresses being removed, and is no different
than the inconsistent state we allow by setting the promote_secondaries sysctl
to 0.

In my view we have a range of conditions bounded by the following two extreemes:

a) User space is removing a single addresses from an arbitrarily sized set on an
interface.

b) User space is flushing every address from an arbitrarily sized set on an
interface

In extreeme condition (a), which is also the most common condition, my patch is
effectively a no-op, because the netlink message see no next rtm_deladdr
request, and so performs promotion as it currently does as part of that one
transaction

In extreeme condition (b), I would argue that the fact that our interface state
is inconsistent is irrelevant.  While its true that other applications may see
that an interface has no primary address, they will in the very near future see
that the interface has no address at all.  Given that the error handling path
for both conditions is often the same or very simmilar (tell the user to pick a
different interface), I would say this isn't something we need to worry about.

The only condition that we do need to worry about I think is the condition in
which an administrator is removing M addresses from a large set of N addresses
on an interface, where M < N but M and N are both sufficiently large to make the
inconsistent state window significant.  In that case we may encounter
applications that see transient errors when searching for an interfaces primary
address.  But my question would then be, so what?  Lots of administrative
operations result in transient errors on other applications while they are in
progress (an arbitrary sized set of iptables rules getting flushed and restored
can result in the same transient error behavior, as an example).

The bottom line in my mind is, given the condition described above, which is the
lesser of evils? Having a window of time where applications may see an
inconsistent state, and not be able to use an interface? Or having a much larger
window of time (recall this patch reduces run time from O(n^2) to O(n)), during
which no other rtnl aware operations can progress, leaving the interface and
several other subsystems unusable.

As a matter of philosophy, This can certainly be implemented in user space.
We could have user space disable promotion only during a flush (either by coding
it into iproute2 and simmilar applications, or by simply documenting the fact
that for large flush operations promotion should be disabled.  That opens us up
to other races however (if multiple users are toggling the value of promote
secondaries), and still gives rise to the issue you describe above, in which
interfaces may have an inconsistent state during the flush operation.

Regards
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT] Networking

2015-11-09 Thread Rasmus Villemoes

On Mon, Nov 09 2015, Hannes Frederic Sowa  wrote:

> Hi,
>
> On Wed, Oct 28, 2015, at 15:27, Rasmus Villemoes wrote:
>> 
>> I agree - proper overflow checking can be really hard. Quick, assuming a
>> and b have the same unsigned integer type, is 'a+b> check overflow? Of course not (hint: promotion rules). And as you say,
>> it gets even more complicated for signed types.
>> 
>> A few months ago I tried posting a complete set of fallbacks for older
>> compilers (https://lkml.org/lkml/2015/7/19/358), but nothing really
>> happened. Now I know where Linus stands, so I guess I can just delete
>> that branch.
>
> I actually like your approach of being type agnostic a bit more (in
> comparison to static inline functions), mostly because of one specific
> reason:
>
> The type agnostic __builtin_*_overflow function even do the correct
> things if you deal with types smaller than int. Imagine e.g. you want to
> add to unsigned chars a and b,

If you read my mail again you'll see that I mentioned exactly this :-)
so obviously I agree that this is a nice part of it.

> unsigned char a, b;
> if (a + b < a)
>   goto overflow;
> else
>   a += b;
>
> The overflow condition will never trigger, as the comparisons will
> always be done in the integer domain and a + b < a is never true. I
> actually think that this is easy to overlook and the functions should
> handle that.

Yes. While people very rarely use local u8 or u16 variables for
computations, I think one could imagine a and b being struct members,
which for one reason or another happens to be of a type narrower than
int (which would also make the issue much harder to spot since the
struct definition is far away). Something like

combine_packets(struct foo *a, const struct foo *b)
{
  if (a->len + b->len < a->len)
return -EOVERFLOW;
  /* ensure a->payload is big enough...*/
  memcpy(a->payload + a->len, b->payload, b->len);
  a->len += b->len;
  ...
}

which, depending on details, would either lead to memory corruption or
loss of parts of the packets.

I haven't actually found any instance of this in the kernel, but that
doesn't mean it couldn't get introduced (or that it doesn't exist).

Aside: It turns out clang is smart enough to optimize away the broken
overflow check, but gcc isn't. Neither issue a warning, despite the
intention being rather clear.

Rasmus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] macvtap: Resolve possible __might_sleep warning in macvtap_do_read()

2015-11-09 Thread Vladislav Yasevich

macvtap_do_read code calls macvtap_put_user while it might be set up
to wait for the user.  This results in the following warning:

Jun 23 16:25:26 galen kernel: [ cut here ]
Jun 23 16:25:26 galen kernel: WARNING: CPU: 0 PID: 30433 at kernel/sched/core.c:
7286 __might_sleep+0x7f/0x90()
Jun 23 16:25:26 galen kernel: do not call blocking ops when !TASK_RUNNING; state
=1 set at [] prepare_to_wait+0x2f/0x90
Jun 23 16:25:26 galen kernel: CPU: 0 PID: 30433 Comm: cat Not tainted 4.1.0-rc6+
 #11
Jun 23 16:25:26 galen kernel: Call Trace:
Jun 23 16:25:26 galen kernel: [] dump_stack+0x4c/0x65
Jun 23 16:25:26 galen kernel: [] warn_slowpath_common+0x8a/0xc
0
Jun 23 16:25:26 galen kernel: [] warn_slowpath_fmt+0x46/0x50
Jun 23 16:25:26 galen kernel: [] ?  prepare_to_wait+0x2f/0x90
Jun 23 16:25:26 galen kernel: [] ?  prepare_to_wait+0x2f/0x90
Jun 23 16:25:26 galen kernel: [] __might_sleep+0x7f/0x90
Jun 23 16:25:26 galen kernel: [] might_fault+0x55/0xb0
Jun 23 16:25:26 galen kernel: [] ?  
trace_hardirqs_on_caller+0x fd/0x1c0
Jun 23 16:25:26 galen kernel: [] copy_to_iter+0x7c/0x360
Jun 23 16:25:26 galen kernel: [] macvtap_do_read+0x256/0x3d0 
[macvtap]
Jun 23 16:25:26 galen kernel: [] ?  
prepare_to_wait_event+0x110/0x110
Jun 23 16:25:26 galen kernel: [] macvtap_read_iter+0x2b/0x50 
[macvtap]
Jun 23 16:25:26 galen kernel: [] __vfs_read+0xae/0xe0
Jun 23 16:25:26 galen kernel: [] vfs_read+0x86/0x140
Jun 23 16:25:26 galen kernel: [] SyS_read+0x49/0xb0
Jun 23 16:25:26 galen kernel: [] 
system_call_fastpath+0x12/0x76
Jun 23 16:25:26 galen kernel: ---[ end trace 22e33f67e70c0c2a ]---

Make sure thet we call finish_wait() if we have the skb to process
before trying to actually process it.

Signed-off-by: Vladislav Yasevich 
---
 drivers/net/macvtap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 248478c..844a442 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -935,6 +935,9 @@ static ssize_t macvtap_do_read(struct macvtap_queue *q,
/* Nothing to read, let's sleep */
schedule();
}
+   if (!noblock)
+   finish_wait(sk_sleep(&q->sk), &wait);
+
if (skb) {
ret = macvtap_put_user(q, skb, to);
if (unlikely(ret < 0))
@@ -942,8 +945,6 @@ static ssize_t macvtap_do_read(struct macvtap_queue *q,
else
consume_skb(skb);
}
-   if (!noblock)
-   finish_wait(sk_sleep(&q->sk), &wait);
return ret;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] mvneta: add FIXED_PHY dependency

2015-11-09 Thread Arnd Bergmann

The fixed_phy infrastructure is done in a way that is optional,
by providing 'static inline' helper functions doing nothing in
include/linux/phy_fixed.h for all its APIs. However, three out
of the four users (DSA, BCMGENET, and SYSTEMPORT) always
'select FIXED_PHY', presumably because they need that.
MVNETA is the fourth one, and if that is built-in but FIXED_PHY
is configured as a loadable module, we get a link error:

drivers/built-in.o: In function `mvneta_fixed_link_update':
fpga-mgr.c:(.text+0x33ed80): undefined reference to `fixed_phy_update_state'

Presumably this driver has the same dependency as the others,
so this patch also uses 'select' to ensure that the fixed-phy
support is built-in.

Signed-off-by: Arnd Bergmann 
Fixes: 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state 
signaling")
---
Found using ARM randconfig tests. An alternative here would be
to use 'depends on FIXED_PHY || FIXED_PHY=n', I picked the 'select'
approach for consistency.

Should we perhaps make 'FIXED_PHY' a silent option and remove the
inline helpers, based on the assumption that a driver that wants these
will not work without them?

diff --git a/drivers/net/ethernet/marvell/Kconfig 
b/drivers/net/ethernet/marvell/Kconfig
index 80af9ffce5ea..a1c862b4664d 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -44,6 +44,7 @@ config MVNETA
tristate "Marvell Armada 370/38x/XP network interface support"
depends on PLAT_ORION
select MVMDIO
+   select FIXED_PHY
---help---
  This driver supports the network interface units in the
  Marvell ARMADA XP, ARMADA 370 and ARMADA 38x SoC family.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next v2 7/8] openvswitch: Delay conntrack helper call for new connections.

2015-11-09 Thread Patrick McHardy

On 06.11, Jarno Rajahalme wrote:
> There is no need to help connections that are not confirmed, so we can
> delay helping new connections to the time when they are confirmed.
> This change is needed for NAT support, and having this as a separate
> patch will make the following NAT patch a bit easier to review.

For the first packet a helper receives the connection is always unconfirmed.
It makes no sense to confirm it if the helper drops the packet.

> Signed-off-by: Jarno Rajahalme 
> ---
>  net/openvswitch/conntrack.c | 20 +++-
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 7aa38fa..ba44287 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -458,6 +458,7 @@ static bool skb_nfct_cached(struct net *net,
>  /* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if
>   * not done already.  Update key with new CT state after passing the packet
>   * through conntrack.
> + * Note that invalid packets are accepted while the skb->nfct remains unset!
>   */
>  static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
>  const struct ovs_conntrack_info *info,
> @@ -468,7 +469,11 @@ static int __ovs_ct_lookup(struct net *net, struct 
> sw_flow_key *key,
>* actually run the packet through conntrack twice unless it's for a
>* different zone.
>*/
> - if (!skb_nfct_cached(net, key, info, skb)) {
> + bool cached = skb_nfct_cached(net, key, info, skb);
> + enum ip_conntrack_info ctinfo;
> + struct nf_conn *ct;
> +
> + if (!cached) {
>   struct nf_conn *tmpl = info->ct;
>   int err;
>  
> @@ -491,11 +496,16 @@ static int __ovs_ct_lookup(struct net *net, struct 
> sw_flow_key *key,
>   return -ENOENT;
>  
>   ovs_ct_update_key(skb, key, true);
> + }
>  
> - if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
> - WARN_ONCE(1, "helper rejected packet");
> - return -EINVAL;
> - }
> + /* Call the helper right after nf_conntrack_in() for confirmed
> +  * connections, but only when commiting for unconfirmed connections.
> +  */
> + ct = nf_ct_get(skb, &ctinfo);
> + if (ct && (nf_ct_is_confirmed(ct) ? !cached : info->commit)
> + && ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
> + WARN_ONCE(1, "helper rejected packet");
> + return -EINVAL;
>   }
>  
>   return 0;
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next v2 0/8] openvswitch: NAT support.

2015-11-09 Thread Patrick McHardy

On 06.11, Jarno Rajahalme wrote:
> This series adds NAT support to openvswitch kernel module.  A few
> changes are needed to the netfilter code to facilitate this (patches
> 1-3/8).  Patches 4-7 make the openvswitch kernel module ready for the
> patch 8 that adds the NAT support for calling into netfilter NAT code
> from the openvswitch conntrack action.

I'm missing some high level description, especially how it is invoked, how
it makes sure expectations of the NAT code about its invocation are met
(it is my understanding that OVS simply invokes this based on actions
specified by the user) and how it interacts with the remaining netfilter
features.

> Jarno Rajahalme (8):
>   netfilter: Remove IP_CT_NEW_REPLY definition.
>   netfilter: Factor out nf_ct_get_info().
>   netfilter: Allow calling into nat helper without skb_dst.
>   openvswitch: Update the CT state key only after nf_conntrack_in().
>   openvswitch: Find existing conntrack entry after upcall.
>   openvswitch: Handle NF_REPEAT in conntrack action.
>   openvswitch: Delay conntrack helper call for new connections.
>   openvswitch: Interface with NAT.
> 
>  include/net/netfilter/nf_conntrack.h   |  15 +
>  include/uapi/linux/netfilter/nf_conntrack_common.h |  12 +-
>  include/uapi/linux/openvswitch.h   |  47 ++
>  net/ipv4/netfilter/nf_nat_l3proto_ipv4.c   |  29 +-
>  net/ipv6/netfilter/nf_nat_l3proto_ipv6.c   |  29 +-
>  net/netfilter/nf_conntrack_core.c  |  22 +-
>  net/openvswitch/conntrack.c| 632 
> +++--
>  net/openvswitch/conntrack.h|   3 +-
>  8 files changed, 686 insertions(+), 103 deletions(-)
> 
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux 4.2.4

2015-11-09 Thread Pablo Neira Ayuso

On Mon, Nov 09, 2015 at 01:35:11PM +0100, Gerhard Wiesinger wrote:
> On 08.11.2015 18:20, Greg KH wrote:
> >That's great, can you let me know the git commits that end up in Linus's
> >tree?  That's what we need for the stable kernel.
> 
> Find the commits here:
> https://git.kernel.org/cgit/linux/kernel/git/pablo/nf.git/
> https://git.kernel.org/cgit/linux/kernel/git/pablo/nf.git/commit/?id=e75cb467df29a428612c162e6f1451c5c0717091
> 
> Don't know exactly the merging processes, so feel free to merge or contact
> Pablo.

I'll take care of that, this is already following its path to -stable,
will take a little while though as usual.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux 4.2.4

2015-11-09 Thread Gerhard Wiesinger


On 08.11.2015 18:20, Greg KH wrote:

On Sun, Nov 08, 2015 at 02:51:01PM +0100, Gerhard Wiesinger wrote:

On 25.10.2015 17:29, Greg KH wrote:

On Sun, Oct 25, 2015 at 11:48:54AM +0100, Gerhard Wiesinger wrote:

On 25.10.2015 10:46, Willy Tarreau wrote:

ipset *triggered* the problem. The whole stack dump would tell more.

OK, find the stack traces in the bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1272645

Kernel 4.1.10 triggered also a kernel dump when playing with ipset commands
and IPv6, details in the bug report  


Kernel 4.2 seems to me not well tested in the netfilter parts at all
(Bug with already known bugfix
https://lists.debian.org/debian-kernel/2015/10/msg00034.html was
triggered on 2 of 3 of my machines, the new bug on 1 of 1 tested machine).

There's a reason why Greg maintains stable and LTS kernels :-)

Stable kernels don't crash but definiton. :-)

At least triggered 2 kernel panics in 5min, even with 4.1.10 and ipset
commands ...

Does this happen also with Linus's tree?  I suggest you ask the
networking developers about this on netdev@vger.kernel.org, there's
nothing that I can do on my own about this, sorry.

Patch is now available, see:
[PATCH 0/3] ipset patches for nf
https://marc.info/?l=netfilter-devel&m=144690007708041&w=2
https://marc.info/?l=netfilter-devel&m=144690007808042&w=2
https://marc.info/?l=netfilter-devel&m=144690008608043&w=2
https://marc.info/?l=netfilter-devel&m=144690007708039&w=2
[ANNOUNCE] ipset 6.27 released
https://marc.info/?l=netfilter-devel&m=144690048308099&w=2

Requires also new userland ipset version.

Please integrate it upstream.

Thanx to Jozsef Kadlecsik for fixing it.

That's great, can you let me know the git commits that end up in Linus's
tree?  That's what we need for the stable kernel.


Find the commits here:
https://git.kernel.org/cgit/linux/kernel/git/pablo/nf.git/
https://git.kernel.org/cgit/linux/kernel/git/pablo/nf.git/commit/?id=e75cb467df29a428612c162e6f1451c5c0717091

Don't know exactly the merging processes, so feel free to merge or 
contact Pablo.


Ciao,
Gerhard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: caif: check return value of alloc_netdev

2015-11-09 Thread Rasmus Villemoes

I don't know if dev can actually be NULL here, but the test should be
above alloc_netdev(), to avoid leaking the struct net_device in case
dev is actually NULL. And of course the return value from alloc_netdev
should be tested.

Signed-off-by: Rasmus Villemoes 
---
Maybe the existing code was supposed to be "if (!ndev)", and dev
cannot be NULL, but then -ENODEV is a slightly odd return
value. Doing both tests seems to be the safe choice.

 drivers/net/caif/caif_spi.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/caif/caif_spi.c b/drivers/net/caif/caif_spi.c
index de3962014af7..4721948a92f6 100644
--- a/drivers/net/caif/caif_spi.c
+++ b/drivers/net/caif/caif_spi.c
@@ -730,11 +730,14 @@ int cfspi_spi_probe(struct platform_device *pdev)
int res;
dev = (struct cfspi_dev *)pdev->dev.platform_data;
 
-   ndev = alloc_netdev(sizeof(struct cfspi), "cfspi%d",
-   NET_NAME_UNKNOWN, cfspi_setup);
if (!dev)
return -ENODEV;
 
+   ndev = alloc_netdev(sizeof(struct cfspi), "cfspi%d",
+   NET_NAME_UNKNOWN, cfspi_setup);
+   if (!ndev)
+   return -ENOMEM;
+
cfspi = netdev_priv(ndev);
netif_stop_queue(ndev);
cfspi->ndev = ndev;
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/4] tty: Remove drivers' extra tty_ldisc_flush()

2015-11-09 Thread Peter Hurley

The tty_port_close_start() helper already flushes the tty and ldisc
buffers on final close; tty drivers which use this helper need not
repeat tty_ldisc_flush().

Signed-off-by: Peter Hurley 
---
 drivers/char/pcmcia/synclink_cs.c | 3 ---
 drivers/tty/amiserial.c   | 2 --
 drivers/tty/rocket.c  | 2 --
 drivers/tty/serial/serial_core.c  | 2 --
 drivers/tty/synclink.c| 1 -
 drivers/tty/synclink_gt.c | 1 -
 drivers/tty/synclinkmp.c  | 1 -
 drivers/tty/tty_port.c| 2 --
 8 files changed, 14 deletions(-)

diff --git a/drivers/char/pcmcia/synclink_cs.c 
b/drivers/char/pcmcia/synclink_cs.c
index 45df4bf..3f74677 100644
--- a/drivers/char/pcmcia/synclink_cs.c
+++ b/drivers/char/pcmcia/synclink_cs.c
@@ -2354,10 +2354,7 @@ static void mgslpc_close(struct tty_struct *tty, struct 
file * filp)
mgslpc_wait_until_sent(tty, info->timeout);
 
mgslpc_flush_buffer(tty);
-
-   tty_ldisc_flush(tty);
shutdown(info, tty);
-   
tty_port_close_end(port, tty);
tty_port_tty_set(port, NULL);
 cleanup:
diff --git a/drivers/tty/amiserial.c b/drivers/tty/amiserial.c
index 2caaf5a..bffc0a4 100644
--- a/drivers/tty/amiserial.c
+++ b/drivers/tty/amiserial.c
@@ -1420,8 +1420,6 @@ static void rs_close(struct tty_struct *tty, struct file 
* filp)
}
shutdown(tty, state);
rs_flush_buffer(tty);
-   
-   tty_ldisc_flush(tty);
port->tty = NULL;
 
tty_port_close_end(port, tty);
diff --git a/drivers/tty/rocket.c b/drivers/tty/rocket.c
index a6b5ce0..5905200 100644
--- a/drivers/tty/rocket.c
+++ b/drivers/tty/rocket.c
@@ -1022,8 +1022,6 @@ static void rp_close(struct tty_struct *tty, struct file 
*filp)
sClrDTR(cp);
 
rp_flush_buffer(tty);
-   
-   tty_ldisc_flush(tty);
 
clear_bit((info->aiop * 8) + info->chan, (void *) 
&xmit_flags[info->board]);
 
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index db27a40..418587f 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -1439,8 +1439,6 @@ static void uart_close(struct tty_struct *tty, struct 
file *filp)
wake_up_interruptible(&port->open_wait);
 
mutex_unlock(&port->mutex);
-
-   tty_ldisc_flush(tty);
 }
 
 static void uart_wait_until_sent(struct tty_struct *tty, int timeout)
diff --git a/drivers/tty/synclink.c b/drivers/tty/synclink.c
index 6188059..1334498 100644
--- a/drivers/tty/synclink.c
+++ b/drivers/tty/synclink.c
@@ -3099,7 +3099,6 @@ static void mgsl_close(struct tty_struct *tty, struct 
file * filp)
if (info->port.flags & ASYNC_INITIALIZED)
mgsl_wait_until_sent(tty, info->timeout);
mgsl_flush_buffer(tty);
-   tty_ldisc_flush(tty);
shutdown(info);
mutex_unlock(&info->port.mutex);
 
diff --git a/drivers/tty/synclink_gt.c b/drivers/tty/synclink_gt.c
index 5505ea8..1987fb4 100644
--- a/drivers/tty/synclink_gt.c
+++ b/drivers/tty/synclink_gt.c
@@ -729,7 +729,6 @@ static void close(struct tty_struct *tty, struct file *filp)
if (info->port.flags & ASYNC_INITIALIZED)
wait_until_sent(tty, info->timeout);
flush_buffer(tty);
-   tty_ldisc_flush(tty);
 
shutdown(info);
mutex_unlock(&info->port.mutex);
diff --git a/drivers/tty/synclinkmp.c b/drivers/tty/synclinkmp.c
index fb00a06..fb17dac 100644
--- a/drivers/tty/synclinkmp.c
+++ b/drivers/tty/synclinkmp.c
@@ -816,7 +816,6 @@ static void close(struct tty_struct *tty, struct file *filp)
wait_until_sent(tty, info->timeout);
 
flush_buffer(tty);
-   tty_ldisc_flush(tty);
shutdown(info);
mutex_unlock(&info->port.mutex);
 
diff --git a/drivers/tty/tty_port.c b/drivers/tty/tty_port.c
index ecb6435..c30525a 100644
--- a/drivers/tty/tty_port.c
+++ b/drivers/tty/tty_port.c
@@ -503,8 +503,6 @@ void tty_port_close_end(struct tty_port *port, struct 
tty_struct *tty)
 {
unsigned long flags;
 
-   tty_ldisc_flush(tty);
-
spin_lock_irqsave(&port->lock, flags);
 
if (port->blocked_open) {
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/4] n_tty: Ignore all read data when closing

2015-11-09 Thread Peter Hurley

On final port close (and thus final tty close), only output flow
control requests in the input data should be processed. Ignore all
other input data, including parity errors, overruns and breaks.

Signed-off-by: Peter Hurley 
---
 drivers/tty/n_tty.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bc613b8..2de0283 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1564,8 +1564,6 @@ n_tty_receive_buf_closing(struct tty_struct *tty, const 
unsigned char *cp,
flag = *fp++;
if (likely(flag == TTY_NORMAL))
n_tty_receive_char_closing(tty, *cp++);
-   else
-   n_tty_receive_char_flagged(tty, *cp++, flag);
}
 }
 
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/4] tty: Abstract and encapsulate tty->closing behavior

2015-11-09 Thread Peter Hurley

tty->closing is exclusively used to cause the N_TTY line discipline
to drop further input on final tty close (except XON/XOFF output flow
control changes). In turn, this prevents the line discipline from
generating new tty driver i/o requests (eg., when echoing) after the tty
driver has performed h/w shutdown.

Abstract this notification with new ldisc api function,
tty_ldisc_closing(), which invokes the new line discipline method,
ops->closing(). Define this method for N_TTY line discipline and
localize closing state to n_tty private data. Remove tty->closing.

Note that resetting tty->closing to 0 (and thus allowing the
line discipline to resume normal input processing) is unnecessary
and undesirable: since the tty is in final close, both the line
discipline instance and the tty are being destroyed, so resuming normal
input processing after h/w shutdown is counter-productive.

NB: ipwireless_tty_free() is completely bogus; freeing the tty (?!) with
open, in-use file descriptors is laughable.

Signed-off-by: Peter Hurley 
---

v2: Fixed tty_ldisc_closing() ld use found by Johannes Stezenbach 


 drivers/isdn/i4l/isdn_tty.c  |  2 +-
 drivers/s390/char/con3215.c  |  3 +--
 drivers/staging/dgap/dgap.c  |  4 +---
 drivers/staging/dgnc/dgnc_tty.c  |  4 +---
 drivers/tty/hvc/hvsi.c   |  2 +-
 drivers/tty/ipwireless/tty.c |  1 -
 drivers/tty/n_tty.c  | 13 +++--
 drivers/tty/rocket.c |  1 -
 drivers/tty/serial/68328serial.c |  3 +--
 drivers/tty/serial/crisv10.c |  3 +--
 drivers/tty/serial/serial_core.c |  1 -
 drivers/tty/tty_ldisc.c  | 20 
 drivers/tty/tty_port.c   |  3 +--
 include/linux/tty.h  |  2 +-
 include/linux/tty_ldisc.h|  9 +
 15 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/drivers/isdn/i4l/isdn_tty.c b/drivers/isdn/i4l/isdn_tty.c
index 2175225..cddba25 100644
--- a/drivers/isdn/i4l/isdn_tty.c
+++ b/drivers/isdn/i4l/isdn_tty.c
@@ -1574,7 +1574,7 @@ isdn_tty_close(struct tty_struct *tty, struct file *filp)
}
port->flags |= ASYNC_CLOSING;
 
-   tty->closing = 1;
+   tty_ldisc_closing(tty);
/*
 * At this point we stop accepting input.  To do this, we
 * disable the receive line status interrupts, and tell the
diff --git a/drivers/s390/char/con3215.c b/drivers/s390/char/con3215.c
index 0fc3fe5..715251d 100644
--- a/drivers/s390/char/con3215.c
+++ b/drivers/s390/char/con3215.c
@@ -1006,11 +1006,10 @@ static void tty3215_close(struct tty_struct *tty, 
struct file * filp)
raw = (struct raw3215_info *) tty->driver_data;
if (raw == NULL || tty->count > 1)
return;
-   tty->closing = 1;
+   tty_ldisc_closing(tty);
/* Shutdown the terminal */
raw3215_shutdown(raw);
tasklet_kill(&raw->tlet);
-   tty->closing = 0;
tty_port_tty_set(&raw->port, NULL);
 }
 
diff --git a/drivers/staging/dgap/dgap.c b/drivers/staging/dgap/dgap.c
index 9112dd2..0456e28 100644
--- a/drivers/staging/dgap/dgap.c
+++ b/drivers/staging/dgap/dgap.c
@@ -4622,7 +4622,7 @@ static void dgap_tty_close(struct tty_struct *tty, struct 
file *file)
 
un->un_flags |= UN_CLOSING;
 
-   tty->closing = 1;
+   tty_ldisc_closing(tty);
 
/*
 * Only officially close channel if count is 0 and
@@ -4645,8 +4645,6 @@ static void dgap_tty_close(struct tty_struct *tty, struct 
file *file)
 
spin_lock_irqsave(&ch->ch_lock, lock_flags);
 
-   tty->closing = 0;
-
/*
 * If we have HUPCL set, lower DTR and RTS
 */
diff --git a/drivers/staging/dgnc/dgnc_tty.c b/drivers/staging/dgnc/dgnc_tty.c
index fbfe79a..96960d8 100644
--- a/drivers/staging/dgnc/dgnc_tty.c
+++ b/drivers/staging/dgnc/dgnc_tty.c
@@ -1410,7 +1410,7 @@ static void dgnc_tty_close(struct tty_struct *tty, struct 
file *file)
/* OK, its the last close on the unit */
un->un_flags |= UN_CLOSING;
 
-   tty->closing = 1;
+   tty_ldisc_closing(tty);
 
 
/*
@@ -1441,8 +1441,6 @@ static void dgnc_tty_close(struct tty_struct *tty, struct 
file *file)
 
spin_lock_irqsave(&ch->ch_lock, flags);
 
-   tty->closing = 0;
-
/*
 * If we have HUPCL set, lower DTR and RTS
 */
diff --git a/drivers/tty/hvc/hvsi.c b/drivers/tty/hvc/hvsi.c
index a75146f..fbaa6ab 100644
--- a/drivers/tty/hvc/hvsi.c
+++ b/drivers/tty/hvc/hvsi.c
@@ -796,7 +796,7 @@ static void hvsi_close(struct tty_struct *tty, struct file 
*filp)
 * any data delivered to the tty layer after this will 
be
 * discarded (except for XON/XOFF)
 */
-   tty->closing = 1;
+   tty_ldisc_closing(tty);
 
spin_unlock_irqrestore(&hp->lock, flags);
 
diff --git a/drivers/tt

[PATCH v2 1/4] tty: rocket: Remove private close_wait

2015-11-09 Thread Peter Hurley

This driver's private completion variable, close_wait, is no longer
used for wait since "tty: Remove ASYNC_CLOSING checks in open()/hangup";
remove.

Signed-off-by: Peter Hurley 
---
 drivers/tty/rocket.c | 2 --
 drivers/tty/rocket_int.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/drivers/tty/rocket.c b/drivers/tty/rocket.c
index 802eac7..c5179f8 100644
--- a/drivers/tty/rocket.c
+++ b/drivers/tty/rocket.c
@@ -643,7 +643,6 @@ static void init_r_port(int board, int aiop, int chan, 
struct pci_dev *pci_dev)
info->chan = chan;
tty_port_init(&info->port);
info->port.ops = &rocket_port_ops;
-   init_completion(&info->close_wait);
info->flags &= ~ROCKET_MODE_MASK;
switch (pc104[board][line]) {
case 422:
@@ -1049,7 +1048,6 @@ static void rp_close(struct tty_struct *tty, struct file 
*filp)
mutex_unlock(&port->mutex);
tty_port_tty_set(port, NULL);
 
-   complete_all(&info->close_wait);
atomic_dec(&rp_num_ports_open);
 
 #ifdef ROCKET_DEBUG_OPEN
diff --git a/drivers/tty/rocket_int.h b/drivers/tty/rocket_int.h
index 67e0f1e..ef1e1be 100644
--- a/drivers/tty/rocket_int.h
+++ b/drivers/tty/rocket_int.h
@@ -1144,7 +1144,6 @@ struct r_port {
int read_status_mask;
int cps;
 
-   struct completion close_wait;   /* Not yet matching the core */
spinlock_t slock;
struct mutex write_mtx;
 };
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/4] Replace tty->closing

2015-11-09 Thread Peter Hurley

Hi Greg,

This series cleans up a messy and poorly documented mechanism required
at tty final close to prevent drivers from crashing after h/w shutdown.

Without special handling, N_TTY echoing will cause driver i/o requests
_after_ h/w shutdown, which typically crashes the driver. Currently,
the tty_struct::closing flag triggers this special handling. However,
this mechanism is error-prone and subject to driver misuse.

This series replaces tty->closing with a ldisc-specific interface,
tty_ldisc_closing(), and implements this interface for N_TTY.
For tty drivers which use tty_port_close_start(), this change eliminates
the last vestige of direct driver<->ldisc interaction. The few tty drivers
which open-code the close() method [1] still use tty_ldisc_closing() directly.

The tty driver is aware final close for the tty has commenced because the
tty->count == 1 in the close() method. On final close, the following is
also true:
1. port->count == 1. tty drivers which ref count the port, use the
   --port->count == 0 as a substitute condition for final close.
2. final close is occurring as a result of the last in-use file descriptor
   release. Consequently, there will be no read/poll/ioctl in-progress.
3. the line discipline instance will be stopped and destroyed immediately
   after the tty driver completes the close() method
4. the tty itself will be unrefed immediately after the line discipline
   instance is destroyed.

Thus, the ldisc and tty buffers need only be flushed once, as any data
received by the tty driver after the flush but before h/w shutdown will
be deleted when the line discipline instance is destroyed.
No new echoes will occur after the ldisc flush because the echo buffer
is also flushed and new input (which otherwise might generate echoes) is
ignored while closing. This series removes the extra tty_ldisc_flush()
being performed by most drivers after h/w shutdown.

Additionally, the ldisc closing state need not be reset since the line
discipline instance is being destroyed, so no interface is provided to reset
closing.


[1] tty drivers which open-code the close() method
drivers/staging/dgnc/dgnc_tty.c
drivers/staging/dgap/dgap.c
drivers/tty/hvc/hvsi.c
drivers/tty/serial/68328serial.c
drivers/tty/serial/crisv10.c
drivers/isdn/i4l/isdn_tty.c
drivers/s390/char/con3215.c

Changes to v2:
* Fixed tty_ldisc_closing() ld use found by Johannes Stezenbach 


Regards,

Peter Hurley (4):
  tty: rocket: Remove private close_wait
  n_tty: Ignore all read data when closing
  tty: Abstract and encapsulate tty->closing behavior
  tty: Remove drivers' extra tty_ldisc_flush()

 drivers/char/pcmcia/synclink_cs.c |  3 ---
 drivers/isdn/i4l/isdn_tty.c   |  2 +-
 drivers/s390/char/con3215.c   |  3 +--
 drivers/staging/dgap/dgap.c   |  4 +---
 drivers/staging/dgnc/dgnc_tty.c   |  4 +---
 drivers/tty/amiserial.c   |  2 --
 drivers/tty/hvc/hvsi.c|  2 +-
 drivers/tty/ipwireless/tty.c  |  1 -
 drivers/tty/n_tty.c   | 15 +++
 drivers/tty/rocket.c  |  5 -
 drivers/tty/rocket_int.h  |  1 -
 drivers/tty/serial/68328serial.c  |  3 +--
 drivers/tty/serial/crisv10.c  |  3 +--
 drivers/tty/serial/serial_core.c  |  3 ---
 drivers/tty/synclink.c|  1 -
 drivers/tty/synclink_gt.c |  1 -
 drivers/tty/synclinkmp.c  |  1 -
 drivers/tty/tty_ldisc.c   | 20 
 drivers/tty/tty_port.c|  5 +
 include/linux/tty.h   |  2 +-
 include/linux/tty_ldisc.h |  9 +
 21 files changed, 49 insertions(+), 41 deletions(-)

-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()

2015-11-09 Thread Jan Kara

On Mon 09-11-15 20:56:10, Tetsuo Handa wrote:
> There are many locations that do
> 
>   if (memory_was_allocated_by_vmalloc)
> vfree(ptr);
>   else
> kfree(ptr);
> 
> but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
> using is_vmalloc_addr(). Unless callers have special reasons, we can
> replace this branch with kvfree(). Please check and reply if you found
> problems.
> 
> Signed-off-by: Tetsuo Handa 
> Acked-by: Michal Hocko 

You can add

Acked-by: Jan Kara 

for the UDF and fs/xattr.c parts.

Honza

> Cc: Russell King  # arm
> Cc:  # apei
> Cc:  # drbd
> Cc:  # mspec
> Cc:  # drm
> Cc: Oleg Drokin  # lustre
> Cc: Andreas Dilger  # lustre
> Cc:  # coda
> Cc:  # jffs2
> Cc: Jan Kara  # udf
> Cc:  # xattr
> Cc:  # ipc + mm
> Cc:  # ipv4
> ---
>  arch/arm/mm/dma-mapping.c  | 11 ++--
>  drivers/acpi/apei/erst.c   |  6 ++--
>  drivers/block/drbd/drbd_bitmap.c   | 26 +
>  drivers/block/drbd/drbd_int.h  |  3 --
>  drivers/char/mspec.c   | 15 ++
>  drivers/gpu/drm/drm_hashtab.c  |  5 +---
>  .../lustre/include/linux/libcfs/libcfs_private.h   |  8 ++
>  fs/coda/coda_linux.h   |  3 +-
>  fs/jffs2/build.c   |  8 ++
>  fs/jffs2/fs.c  |  5 +---
>  fs/jffs2/super.c   |  5 +---
>  fs/udf/super.c |  7 +
>  fs/xattr.c | 33 
> ++
>  ipc/sem.c  |  2 +-
>  ipc/util.c | 11 ++--
>  ipc/util.h |  2 +-
>  mm/percpu.c| 18 +---
>  mm/vmalloc.c   |  5 +---
>  net/ipv4/fib_trie.c|  4 +--
>  19 files changed, 46 insertions(+), 131 deletions(-)
> 
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index e62400e..492bf3e 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1200,10 +1200,7 @@ error:
>   while (i--)
>   if (pages[i])
>   __free_pages(pages[i], 0);
> - if (array_size <= PAGE_SIZE)
> - kfree(pages);
> - else
> - vfree(pages);
> + kvfree(pages);
>   return NULL;
>  }
>  
> @@ -1211,7 +1208,6 @@ static int __iommu_free_buffer(struct device *dev, 
> struct page **pages,
>  size_t size, struct dma_attrs *attrs)
>  {
>   int count = size >> PAGE_SHIFT;
> - int array_size = count * sizeof(struct page *);
>   int i;
>  
>   if (dma_get_attr(DMA_ATTR_FORCE_CONTIGUOUS, attrs)) {
> @@ -1222,10 +1218,7 @@ static int __iommu_free_buffer(struct device *dev, 
> struct page **pages,
>   __free_pages(pages[i], 0);
>   }
>  
> - if (array_size <= PAGE_SIZE)
> - kfree(pages);
> - else
> - vfree(pages);
> + kvfree(pages);
>   return 0;
>  }
>  
> diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
> index 6682c5d..6e6bc10 100644
> --- a/drivers/acpi/apei/erst.c
> +++ b/drivers/acpi/apei/erst.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include  /* kvfree() */
>  #include 
>  
>  #include "apei-internal.h"
> @@ -532,10 +533,7 @@ retry:
>   return -ENOMEM;
>   memcpy(new_entries, entries,
>  erst_record_id_cache.len * sizeof(entries[0]));
> - if (erst_record_id_cache.size < PAGE_SIZE)
> - kfree(entries);
> - else
> - vfree(entries);
> + kvfree(entries);
>   erst_record_id_cache.entries = entries = new_entries;
>   erst_record_id_cache.size = new_size;
>   }
> diff --git a/drivers/block/drbd/drbd_bitmap.c 
> b/drivers/block/drbd/drbd_bitmap.c
> index 9462d27..2daaafb 100644
> --- a/drivers/block/drbd/drbd_bitmap.c
> +++ b/drivers/block/drbd/drbd_bitmap.c
> @@ -364,12 +364,9 @@ static void bm_free_pages(struct page **pages, unsigned 
> long number)
>   }
>  }
>  
> -static void bm_vk_free(void *ptr, int v)
> +static inline void bm_vk_free(void *ptr)
>  {
> - if (v)
> - vfree(ptr);
> - else
> - kfree(ptr);
> + kvfree(ptr);
>  }
>  
>  /*
> @@ -379,7 +376,7 @@ static struct page **bm_realloc_pages(struct drbd_bitmap 
> *b, unsigned long want)
>  {
>   struct page **old_pages = b->bm_pages;
>   struct page **new_pages, *page;
> - unsigned int i, bytes, vmalloced = 0;
> + unsigned int i, bytes;
>   unsigned long have = b->bm_number_of_pages;
>  
>

Re: [GIT] Networking

2015-11-09 Thread Hannes Frederic Sowa

Hi,

On Wed, Oct 28, 2015, at 15:27, Rasmus Villemoes wrote:
> On Wed, Oct 28 2015, Hannes Frederic Sowa 
> wrote:
> 
> > Hi Linus,
> >
> > On Wed, Oct 28, 2015, at 10:39, Linus Torvalds wrote:
> >> Get rid of it. And I don't *ever* want to see that shit again.
> >
> > I don't want to give up on that this easily:
> >
> > In future I would like to see an interface like this. It is often hard
> > to do correct overflow/wrap-around tests and it would be great if there
> > are helper functions which could easily and without a lot of thinking be
> > used by people to remove those problems from the kernel.
> 
> I agree - proper overflow checking can be really hard. Quick, assuming a
> and b have the same unsigned integer type, is 'a+b check overflow? Of course not (hint: promotion rules). And as you say,
> it gets even more complicated for signed types.
> 
> A few months ago I tried posting a complete set of fallbacks for older
> compilers (https://lkml.org/lkml/2015/7/19/358), but nothing really
> happened. Now I know where Linus stands, so I guess I can just delete
> that branch.

I actually like your approach of being type agnostic a bit more (in
comparison to static inline functions), mostly because of one specific
reason:

The type agnostic __builtin_*_overflow function even do the correct
things if you deal with types smaller than int. Imagine e.g. you want to
add to unsigned chars a and b,

unsigned char a, b;
if (a + b < a)
  goto overflow;
else
  a += b;

The overflow condition will never trigger, as the comparisons will
always be done in the integer domain and a + b < a is never true. I
actually think that this is easy to overlook and the functions should
handle that. The macro version does this quite nicely.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] tty: Abstract and encapsulate tty->closing behavior

2015-11-09 Thread Peter Hurley

On 11/09/2015 04:12 AM, Johannes Stezenbach wrote:
> On Sun, Nov 08, 2015 at 05:02:52PM -0500, Peter Hurley wrote:
>> +void tty_ldisc_closing(struct tty_struct *tty)
>> +{
>> +struct tty_ldisc *ld = tty_ldisc_ref(tty);
>> +
>> +if (ld->ops->closing)
>> +ld->ops->closing(tty);
>> +if (ld)
>> +tty_ldisc_deref(ld);
>> +}
> 
> That looks strange.  Do you need to check ld _before_ dereferencing?

Yes. Thanks for noticing.

Regards,
Peter Hurley
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()

2015-11-09 Thread Tetsuo Handa

There are many locations that do

  if (memory_was_allocated_by_vmalloc)
vfree(ptr);
  else
kfree(ptr);

but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.

Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Cc: Russell King  # arm
Cc:  # apei
Cc:  # drbd
Cc:  # mspec
Cc:  # drm
Cc: Oleg Drokin  # lustre
Cc: Andreas Dilger  # lustre
Cc:  # coda
Cc:  # jffs2
Cc: Jan Kara  # udf
Cc:  # xattr
Cc:  # ipc + mm
Cc:  # ipv4
---
 arch/arm/mm/dma-mapping.c  | 11 ++--
 drivers/acpi/apei/erst.c   |  6 ++--
 drivers/block/drbd/drbd_bitmap.c   | 26 +
 drivers/block/drbd/drbd_int.h  |  3 --
 drivers/char/mspec.c   | 15 ++
 drivers/gpu/drm/drm_hashtab.c  |  5 +---
 .../lustre/include/linux/libcfs/libcfs_private.h   |  8 ++
 fs/coda/coda_linux.h   |  3 +-
 fs/jffs2/build.c   |  8 ++
 fs/jffs2/fs.c  |  5 +---
 fs/jffs2/super.c   |  5 +---
 fs/udf/super.c |  7 +
 fs/xattr.c | 33 ++
 ipc/sem.c  |  2 +-
 ipc/util.c | 11 ++--
 ipc/util.h |  2 +-
 mm/percpu.c| 18 +---
 mm/vmalloc.c   |  5 +---
 net/ipv4/fib_trie.c|  4 +--
 19 files changed, 46 insertions(+), 131 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index e62400e..492bf3e 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1200,10 +1200,7 @@ error:
while (i--)
if (pages[i])
__free_pages(pages[i], 0);
-   if (array_size <= PAGE_SIZE)
-   kfree(pages);
-   else
-   vfree(pages);
+   kvfree(pages);
return NULL;
 }
 
@@ -1211,7 +1208,6 @@ static int __iommu_free_buffer(struct device *dev, struct 
page **pages,
   size_t size, struct dma_attrs *attrs)
 {
int count = size >> PAGE_SHIFT;
-   int array_size = count * sizeof(struct page *);
int i;
 
if (dma_get_attr(DMA_ATTR_FORCE_CONTIGUOUS, attrs)) {
@@ -1222,10 +1218,7 @@ static int __iommu_free_buffer(struct device *dev, 
struct page **pages,
__free_pages(pages[i], 0);
}
 
-   if (array_size <= PAGE_SIZE)
-   kfree(pages);
-   else
-   vfree(pages);
+   kvfree(pages);
return 0;
 }
 
diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
index 6682c5d..6e6bc10 100644
--- a/drivers/acpi/apei/erst.c
+++ b/drivers/acpi/apei/erst.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include  /* kvfree() */
 #include 
 
 #include "apei-internal.h"
@@ -532,10 +533,7 @@ retry:
return -ENOMEM;
memcpy(new_entries, entries,
   erst_record_id_cache.len * sizeof(entries[0]));
-   if (erst_record_id_cache.size < PAGE_SIZE)
-   kfree(entries);
-   else
-   vfree(entries);
+   kvfree(entries);
erst_record_id_cache.entries = entries = new_entries;
erst_record_id_cache.size = new_size;
}
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 9462d27..2daaafb 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -364,12 +364,9 @@ static void bm_free_pages(struct page **pages, unsigned 
long number)
}
 }
 
-static void bm_vk_free(void *ptr, int v)
+static inline void bm_vk_free(void *ptr)
 {
-   if (v)
-   vfree(ptr);
-   else
-   kfree(ptr);
+   kvfree(ptr);
 }
 
 /*
@@ -379,7 +376,7 @@ static struct page **bm_realloc_pages(struct drbd_bitmap 
*b, unsigned long want)
 {
struct page **old_pages = b->bm_pages;
struct page **new_pages, *page;
-   unsigned int i, bytes, vmalloced = 0;
+   unsigned int i, bytes;
unsigned long have = b->bm_number_of_pages;
 
BUG_ON(have == 0 && old_pages != NULL);
@@ -401,7 +398,6 @@ static struct page **bm_realloc_pages(struct drbd_bitmap 
*b, unsigned long want)
PAGE_KERNEL);
if (!new_pages)
return NULL;
-   vmalloced = 1;
}
 
if (want >= have) {
@@ -411,7 +407,7 @@ static struct page **bm_realloc_pag

Re: [PATCH net 1/2] packet: do skb_probe_transport_header when we actually have data

2015-11-09 Thread Daniel Borkmann


On 11/07/2015 11:29 PM, David Miller wrote:

From: Eric Dumazet 
Date: Sat, 07 Nov 2015 10:53:50 -0800


Well, imagine following scenario (a real one, as I use it all of
time, thus how I discovered all trafgen traffic ends up on one slave
only)

Even if qdisc is bypassed on the bond0, the current handling does
not prevent going to the slave qdiscs.


Ok, depending upon the semantics Daniel intended, we may have to add
a qdisc bypass boolean bit to SKBs.


It was resembling pktgen to some extend, only to be more flexible in
terms of defining packet payload (trafgen, I mean). Yes, pktgen has
the same issue on that regard, hmm I'm not yet sure, though, if it's
worth burning an extra skb bit for both cases.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: AF_PACKET mmap() v4...

2015-11-09 Thread Daniel Borkmann


On 11/08/2015 05:27 AM, John Fastabend wrote:

On 15-11-07 06:19 PM, Alexei Starovoitov wrote:

On Thu, Nov 05, 2015 at 10:39:15AM +0100, Daniel Borkmann wrote:

On 11/05/2015 10:07 AM, Arnd Bergmann wrote:

On Thursday 05 November 2015 00:04:14 David Miller wrote:

As part of fixing y2038 problems, Arnd is going to have to make a new
version fo the AF_PACKET mmap() tpacker descriptors in order to extend
the time values to 64-bit.


would also be quite useful to add ability to attach metadata to packet
from bpf program.
Right now we can only trim the length. Would be great if program could
compute something and pass it along with packet as metadata.


Also most modern NICs can generate metadata using packet filters
it would be nice to allow these to populate any metadata fields as well.
Ethtool already has a flow classifier feature that could be easily
extended once the stack has support.


If I understand this correctly, that would be something independent from
packet sockets, right? Attaching metadata to the skb could be currently done
via mark, tc_index, tc_classid, priority, but I presume you mean something
else. ;)

Or, do you mean to push meta data into skb->data f.e. in front of the frame.
As in having some sort of a 'dynamic-sized', reserved scratch space or stack
at the head or tail of the skb (not visible to the network itself, but only
to the local NIC)?

It would be interesting if it could be used to interact with the NIC, as
John says, e.g. from incoming side to place a tag or additional meta data
there as a result of the NIC's flow classifier, which might then be read out
from an eBPF program to perform further actions on the skb.

If you even want to take this one step further for data center environments,
there was the idea floating around [1], where you encapsulate a restricted
set of instructions together with some scratch space between Ethernet header
and payload.

These "tiny packet programs" could then query switch meta data on the fly
that is being stored into the scratch space. Of course, this requires vendor
support, but this seems really powerful.

  [1] http://jvimal.github.io/tpp/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net v2] bpf: fix trivial comment typo

2015-11-09 Thread Matthew Fernandez

On 05/11/15 16:32, David Miller wrote:

From: Matthew Fernandez 
Date: Thu, 5 Nov 2015 11:09:52 +1100

bpf: fix trivial comment typo

Signed-off-by: Matthew Fernandez 

This does not apply.  It looks like your email client has
corrupted the patch.

Well I managed to foul that up nicely, didn't I? I don't know which tree you'd prefer this to go into, so I'll wait 
until net-next reopens to resubmit a corrected version. I'll try to be more thorough and avoid embarrassing myself and 
wasting your time for the third time running. Thanks for your patience.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT] Networking

2015-11-09 Thread Hannes Frederic Sowa

Hello,

Ingo Molnar  writes:

> * Linus Torvalds  wrote:
>
>> Does anybody have any particular other "uhhuh, overflow in multiplication" 
>> issues in mind? Because the interface for a saturating multiplication (or 
>> addition, for that matter) would actually be much easier. And would be 
>> trivial 
>> to have as an inline asm for compatibility with older versions of gcc too.
>> 
>> Then you could just do that jiffies conversion - or allocation, for that 
>> matter 
>> - without any special overflow handling at all. Doing
>> 
>> buf = kmalloc(sat_mul(sizeof(x), nr), GFP_KERNEL);
>> 
>> would just magically work.
>
> Exactly: saturation is the default behavior for many GPU vector/pixel 
> attributes 
> as well, to simplify and speed up the code and the hardware. I always wanted 
> our 
> ABIs to saturate instead of duplicating complexity with overflow failure 
> logic.

I don't think saturation arithmetic is useful at all in the kernel as a
replacement for overflow/wrap-around checks. Linus' example has a
discrepancy between what the caller expects and the actual number of
bytes allocated. Imagine sat_mul does the operation in signed char and
kmalloc takes only signed chars as an argument, it could actually be a
huge discrepancy that could lead to security vulnerabilities. The call
should definitely error out here and not try to allocate memory of some
different size and return it to the caller.

> In the kernel the first point of failure is missing overflow checks. The 
> second 
> point of failure are buggy overflow checks. We can eliminate both if we just 
> use 
> safe operations that produce output that never exit the valid range. This 
> also 
> happens to result in the simplest code. We should start thinking of overflow 
> checks as rootkit enablers.

Sorry, I don't understand that at all. sat_mul is a rootkit enabler, I
fear. If you allocate a smalelr portion of memory as the caller actually
asked for because of saturation logic, this definitely could lead to
memory corruption and hard to diagnose bugs.

> And note how much this simplifies review and static analysis: if this is the 
> dominant model used in new kernel code then the analysis (human or machine) 
> would 
> only have to ensure that no untrusted input values get multiplied (or added) 
> in an 
> unsafe way. It would not have to be able to understand and track any 
> 'overflow 
> logic' through a maze of return paths, and validate whether the 'overflow 
> logic' 
> is correct for all input parameter ranges...

Sorry, I don't really understand that proposal. :/

> The flip side is marginally less ABI robustness: random input parameters due 
> to 
> memory corruption will just saturate and produce nonsensical results. I don't 
> think it's a big issue, and I also think the simplicity of input parameter 
> validation is _way_ more important than our behavior to random input - but 
> I've 
> been overruled in the past when trying to introduce saturating ABIs, so 
> saturation 
> is something people sometimes find inelegant.

If those nonsensical results are memory corruptions I also don't
agree. I think we need to be very much accurate when dealing with
overflows.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT] Networking

2015-11-09 Thread Hannes Frederic Sowa

Hello,

Ingo Molnar  writes:

> * Linus Torvalds  wrote:
>
>> Does anybody have any particular other "uhhuh, overflow in multiplication" 
>> issues in mind? Because the interface for a saturating multiplication (or 
>> addition, for that matter) would actually be much easier. And would be 
>> trivial 
>> to have as an inline asm for compatibility with older versions of gcc too.
>> 
>> Then you could just do that jiffies conversion - or allocation, for that 
>> matter 
>> - without any special overflow handling at all. Doing
>> 
>> buf = kmalloc(sat_mul(sizeof(x), nr), GFP_KERNEL);
>> 
>> would just magically work.
>
> Exactly: saturation is the default behavior for many GPU vector/pixel 
> attributes 
> as well, to simplify and speed up the code and the hardware. I always wanted 
> our 
> ABIs to saturate instead of duplicating complexity with overflow failure 
> logic.

I don't think saturation arithmetic is useful at all in the kernel as a
replacement for overflow/wrap-around checks. Linus' example has a
discrepancy between what the caller expects and the actual number of
bytes allocated. Imagine sat_mul does the operation in signed char and
kmalloc takes only signed chars as an argument, it could actually be a
huge discrepancy that could lead to security vulnerabilities. The call
should definitely error out here and not try to allocate memory of some
different size and return it to the caller.

> In the kernel the first point of failure is missing overflow checks. The 
> second 
> point of failure are buggy overflow checks. We can eliminate both if we just 
> use 
> safe operations that produce output that never exit the valid range. This 
> also 
> happens to result in the simplest code. We should start thinking of overflow 
> checks as rootkit enablers.

Sorry, I don't understand that at all. sat_mul is a rootkit enabler, I
fear. If you allocate a smalelr portion of memory as the caller actually
asked for because of saturation logic, this definitely could lead to
memory corruption and hard to diagnose bugs.

> And note how much this simplifies review and static analysis: if this is the 
> dominant model used in new kernel code then the analysis (human or machine) 
> would 
> only have to ensure that no untrusted input values get multiplied (or added) 
> in an 
> unsafe way. It would not have to be able to understand and track any 
> 'overflow 
> logic' through a maze of return paths, and validate whether the 'overflow 
> logic' 
> is correct for all input parameter ranges...

Sorry, I don't really understand that proposal. :/

> The flip side is marginally less ABI robustness: random input parameters due 
> to 
> memory corruption will just saturate and produce nonsensical results. I don't 
> think it's a big issue, and I also think the simplicity of input parameter 
> validation is _way_ more important than our behavior to random input - but 
> I've 
> been overruled in the past when trying to introduce saturating ABIs, so 
> saturation 
> is something people sometimes find inelegant.

If those nonsensical results are memory corruptions I also don't
agree. I think we need to be very much accurate when dealing with
overflows.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: deadlock between setsockopt/getsockopt

2015-11-09 Thread Dmitry Vyukov

On Sun, Nov 8, 2015 at 6:16 PM, Eric Dumazet  wrote:
> On Sun, 2015-11-08 at 11:15 +0100, Dmitry Vyukov wrote:
>> Hello,
>>
>> I've got the following deadlock report on commit
>> d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5).
>>
>>
>> [ INFO: possible circular locking dependency detected ]
>> 4.3.0+ #39 Not tainted
>> ---
>> syzkaller_execu/18311 is trying to acquire lock:
>>  (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
>> net/core/rtnetlink.c:70
>>
>> but task is already holding lock:
>>  (sk_lock-AF_INET){+.+.+.}, at: [< inline >] lock_sock
>> include/net/sock.h:1477
>>  (sk_lock-AF_INET){+.+.+.}, at: []
>> do_ip_getsockopt.part.9+0x111/0x1510 net/ipv4/ip_sockglue.c:1272
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (sk_lock-AF_INET){+.+.+.}:
>>[] lock_acquire+0x16d/0x2f0
>> kernel/locking/lockdep.c:3585
>>[] lock_sock_nested+0xb8/0x110 net/core/sock.c:2443
>>[< inline >] lock_sock include/net/sock.h:1477
>>[] do_ip_setsockopt.isra.12+0x193/0x2af0
>> net/ipv4/ip_sockglue.c:621
>>[] ip_setsockopt+0x3a/0xb0 
>> net/ipv4/ip_sockglue.c:1202
>>[] tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2616
>>[] sock_common_setsockopt+0x95/0xd0
>> net/core/sock.c:2643
>>[< inline >] SYSC_setsockopt net/socket.c:1757
>>[] SyS_setsockopt+0x158/0x240 net/socket.c:1736
>>[] entry_SYSCALL_64_fastpath+0x31/0x9a
>> arch/x86/entry/entry_64.S:187
>>
>> -> #0 (rtnl_mutex){+.+.+.}:
>>[< inline >] check_prev_add kernel/locking/lockdep.c:1853
>>[< inline >] check_prevs_add kernel/locking/lockdep.c:1958
>>[< inline >] validate_chain kernel/locking/lockdep.c:2144
>>[] __lock_acquire+0x36d9/0x40e0
>> kernel/locking/lockdep.c:3206
>>[] lock_acquire+0x16d/0x2f0
>> kernel/locking/lockdep.c:3585
>>[< inline >] __mutex_lock_common kernel/locking/mutex.c:518
>>[] mutex_lock_nested+0x9c/0x8f0
>> kernel/locking/mutex.c:618
>>[] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:70
>>[] ip_mc_msfget+0xe0/0x620 net/ipv4/igmp.c:2398
>>[] do_ip_getsockopt.part.9+0x405/0x1510
>> net/ipv4/ip_sockglue.c:1399
>>[< inline >] do_ip_getsockopt net/ipv4/ip_sockglue.c:1264
>>[] ip_getsockopt+0xa8/0x1c0 
>> net/ipv4/ip_sockglue.c:1495
>>[] tcp_getsockopt+0x82/0xd0 net/ipv4/tcp.c:2916
>>[] sock_common_getsockopt+0x95/0xd0
>> net/core/sock.c:2602
>>[< inline >] SYSC_getsockopt net/socket.c:1788
>>[] SyS_getsockopt+0x142/0x230 net/socket.c:1770
>>
>> other info that might help us debug this:
>>
>>  Possible unsafe locking scenario:
>>
>>CPU0CPU1
>>
>>   lock(sk_lock-AF_INET);
>>lock(rtnl_mutex);
>>lock(sk_lock-AF_INET);
>>   lock(rtnl_mutex);
>>
>>  *** DEADLOCK ***
>>
>> 1 lock held by syzkaller_execu/18311:
>>  #0:  (sk_lock-AF_INET){+.+.+.}, at: [< inline >] lock_sock
>> include/net/sock.h:1477
>>  #0:  (sk_lock-AF_INET){+.+.+.}, at: []
>> do_ip_getsockopt.part.9+0x111/0x1510 net/ipv4/ip_sockglue.c:1272
>>
>> stack backtrace:
>> CPU: 1 PID: 18311 Comm: syzkaller_execu Not tainted 4.3.0+ #39
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>   88005b647598 81aad406 845cb400
>>  84612200 845cb400 88005b6475e0 811ec511
>>  88005b6476e0 6c7d5800 88006c7d5fb0 88006c7d5fd2
>> Call Trace:
>>  [< inline >] __dump_stack lib/dump_stack.c:15
>>  [] dump_stack+0x68/0x92 lib/dump_stack.c:50
>>  [] print_circular_bug+0x2d1/0x390
>> kernel/locking/lockdep.c:1226
>>  [< inline >] check_prev_add kernel/locking/lockdep.c:1853
>>  [< inline >] check_prevs_add kernel/locking/lockdep.c:1958
>>  [< inline >] validate_chain kernel/locking/lockdep.c:2144
>>  [] __lock_acquire+0x36d9/0x40e0 
>> kernel/locking/lockdep.c:3206
>>  [] lock_acquire+0x16d/0x2f0 kernel/locking/lockdep.c:3585
>>  [< inline >] __mutex_lock_common kernel/locking/mutex.c:518
>>  [< inline >] __mutex_lock_common kernel/locking/mutex.c:518
>>  [] mutex_lock_nested+0x9c/0x8f0 kernel/locking/mutex.c:618
>>  [] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:70
>>  [] ip_mc_msfget+0xe0/0x620 net/ipv4/igmp.c:2398
>>  [] do_ip_getsockopt.part.9+0x405/0x1510
>> net/ipv4/ip_sockglue.c:1399
>>  [< inline >] do_ip_getsockopt net/ipv4/ip_sockglue.c:1264
>>  [] ip_getsockopt+0xa8/0x1c0 net/ipv4/ip_sockglue.c:1495
>>  [] tcp_getsockopt+0x82/0xd0 net/ipv4/tcp.c:2916
>>  [] sock_common_getsockopt+0x95/0xd0 net/core/sock.c:2602
>>  [< inline >] SYSC_getsockopt net/socket.c:1788
>>  [] SyS_getsockopt+0x142/0x230 net/socket.c:1770
>>

Re: [PATCH net-next ] net: ipv4: memset addr before calling copy_to_user()

2015-11-09 Thread Hannes Frederic Sowa

Hello,

On Mon, Nov 9, 2015, at 07:52, Loganaden Velvindron wrote:
> zero addr before calling copy_to_user()
>  
> Signed-off-by: Loganaden Velvindron 
> ---
>  net/ipv4/ip_sockglue.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index c3c359a..d7a5a8b 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -1373,6 +1373,7 @@ static int do_ip_getsockopt(struct sock *sk, int
> level, int optname,
>   case IP_MULTICAST_IF:
>   {
>   struct in_addr addr;
> +   memset(&addr, 0, sizeof(addr));
>   len = min_t(unsigned int, len, sizeof(struct in_addr));
>   addr.s_addr = inet->mc_addr;
>   release_sock(sk);

There is no possibility we leak any unwanted data to user space here.

If you are not sure if sizeof(addr) > sizeof(addr.s_addr) use a
designated initializer:

addr = { .s_addr = inet->mc_addr };

which clears all other non-initialized elements. But here we are very
certain.

We do not do defensive programming, we try to do logical things, and
only logical things.
— Eric Dumazet (Thanks to Dan Carpenter.)

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] man: Syntax and warning fixes

2015-11-09 Thread Albino B Neto

2015-11-07 7:55 GMT-02:00 Ville Skyttä :
>> Signed-off-by and description ?
>
> Superseding patches sent separately. I suggest documenting the
> requirement for Signed-off-by somewhere (README.devel?);

Please read the Documentation/SubmittingPatches.

Albino
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] tty: Abstract and encapsulate tty->closing behavior

2015-11-09 Thread Johannes Stezenbach

On Sun, Nov 08, 2015 at 05:02:52PM -0500, Peter Hurley wrote:
> +void tty_ldisc_closing(struct tty_struct *tty)
> +{
> + struct tty_ldisc *ld = tty_ldisc_ref(tty);
> +
> + if (ld->ops->closing)
> + ld->ops->closing(tty);
> + if (ld)
> + tty_ldisc_deref(ld);
> +}

That looks strange.  Do you need to check ld _before_ dereferencing?

Johannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA

2015-11-09 Thread Arnd Bergmann

On Monday 09 November 2015 10:34:30 Geert Uytterhoeven wrote:
> If NO_DMA=y:
> 
> ERROR: "dma_set_mask" 
> [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
> ERROR: "dma_unmap_single" 
> [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
> ERROR: "dma_unmap_page" 
> [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
> ERROR: "dma_mapping_error" 
> [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
> ERROR: "dma_map_page" 
> [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
> ERROR: "dma_supported" 
> [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
> ERROR: "dma_map_single" 
> [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
> ERROR: "dma_set_mask" [drivers/net/ethernet/hisilicon/hns/hns_dsaf.ko] 
> undefined!
> ERROR: "dma_supported" [drivers/net/ethernet/hisilicon/hns/hns_dsaf.ko] 
> undefined!
> ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
> undefined!
> ERROR: "dma_unmap_page" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
> undefined!
> ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
> undefined!
> ERROR: "dma_map_page" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
> undefined!
> ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
> undefined!
> ERROR: "dma_alloc_coherent" 
> [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined!
> ERROR: "dma_mapping_error" 
> [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined!
> ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] 
> undefined!
> ERROR: "dma_unmap_single" 
> [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined!
> ERROR: "dma_free_coherent" 
> [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined!
> ERROR: "dma_alloc_coherent" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
> undefined!
> ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
> undefined!
> ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
> undefined!
> ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
> undefined!
> ERROR: "dma_free_coherent" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
> undefined!
> 
> As this affects all of HNS_ENET, HNS_DSAF, HNS, HIX5HD2_GMAC, and
> HIP04_ETH, add a dependency on HAS_DMA to the main NET_VENDOR_HISILICON
> symbol to fix this.
> 
> Signed-off-by: Geert Uytterhoeven 
> 

Acked-by: Arnd Bergmann 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA

2015-11-09 Thread Geert Uytterhoeven

If NO_DMA=y:

ERROR: "dma_set_mask" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] 
undefined!
ERROR: "dma_unmap_single" 
[drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
ERROR: "dma_unmap_page" 
[drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
ERROR: "dma_mapping_error" 
[drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
ERROR: "dma_map_page" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] 
undefined!
ERROR: "dma_supported" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] 
undefined!
ERROR: "dma_map_single" 
[drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined!
ERROR: "dma_set_mask" [drivers/net/ethernet/hisilicon/hns/hns_dsaf.ko] 
undefined!
ERROR: "dma_supported" [drivers/net/ethernet/hisilicon/hns/hns_dsaf.ko] 
undefined!
ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
undefined!
ERROR: "dma_unmap_page" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
undefined!
ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
undefined!
ERROR: "dma_map_page" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
undefined!
ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hns/hnae.ko] 
undefined!
ERROR: "dma_alloc_coherent" 
[drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined!
ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] 
undefined!
ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] 
undefined!
ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] 
undefined!
ERROR: "dma_free_coherent" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] 
undefined!
ERROR: "dma_alloc_coherent" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
undefined!
ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
undefined!
ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
undefined!
ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
undefined!
ERROR: "dma_free_coherent" [drivers/net/ethernet/hisilicon/hip04_eth.ko] 
undefined!

As this affects all of HNS_ENET, HNS_DSAF, HNS, HIX5HD2_GMAC, and
HIP04_ETH, add a dependency on HAS_DMA to the main NET_VENDOR_HISILICON
symbol to fix this.

Signed-off-by: Geert Uytterhoeven 
---
 drivers/net/ethernet/hisilicon/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/Kconfig 
b/drivers/net/ethernet/hisilicon/Kconfig
index f250dec488fd2a2b..74beb1867230c9cf 100644
--- a/drivers/net/ethernet/hisilicon/Kconfig
+++ b/drivers/net/ethernet/hisilicon/Kconfig
@@ -5,7 +5,8 @@
 config NET_VENDOR_HISILICON
bool "Hisilicon devices"
default y
-   depends on OF && (ARM || ARM64 || COMPILE_TEST)
+   depends on OF && HAS_DMA
+   depends on ARM || ARM64 || COMPILE_TEST
---help---
  If you have a network (Ethernet) card belonging to this class, say Y.
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] man: Syntax and warning fixes

2015-11-09 Thread Daniel Borkmann


On 11/07/2015 05:40 PM, David Ahern wrote:

On 11/7/15 2:55 AM, Ville Skyttä wrote:

On Sat, Nov 7, 2015 at 11:47 AM, Albino B Neto  wrote:

2015-11-07 7:44 GMT-02:00 Ville Skyttä :

---
  man/man8/tc-bpf.8 | 2 +-
  man/man8/tipc-bearer.8| 4 ++--
  man/man8/tipc-link.8  | 6 +++---
  man/man8/tipc-media.8 | 4 ++--
  man/man8/tipc-nametable.8 | 4 ++--
  man/man8/tipc-node.8  | 4 ++--
  man/man8/tipc-socket.8| 4 ++--
  7 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/man/man8/tc-bpf.8 b/man/man8/tc-bpf.8
index 2c02ab2..f5201d3 100644
--- a/man/man8/tc-bpf.8
+++ b/man/man8/tc-bpf.8
@@ -844,7 +844,7 @@ result in the default classid:
  Basically, such a minimal generator is equivalent to:


Signed-off-by and description ?


Superseding patches sent separately. I suggest documenting the
requirement for Signed-off-by somewhere (README.devel?); git log shows
lots of commits without it, especially a bunch of man page related
ones about a week ago.


I believe man page changes should cc linux-...@vger.kernel.org


Not really needed, this is for iproute2, not man-pages project.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net 1/2] packet: do skb_probe_transport_header when we actually have data

2015-11-09 Thread Daniel Borkmann

On 11/09/2015 04:11 AM, David Miller wrote:

From: Daniel Borkmann 
Date: Sun, 08 Nov 2015 01:33:56 +0100

Hmm, yeah, on a (only quick) look, it seems this is mostly needed for
the
virtio_net related code in packet_snd() / packet_recvmsg(), not
handled in
RX/TX ring paths actually.

$ git grep -n gso_size net/packet/
net/packet/af_packet.c:2748: if (vnet_hdr.gso_size == 0)
net/packet/af_packet.c:2825:skb_shinfo(skb)->gso_size =
net/packet/af_packet.c:2826: __virtio16_to_cpu(vio_le(),
vnet_hdr.gso_size);
net/packet/af_packet.c:3219:vnet_hdr.gso_size =
net/packet/af_packet.c:3220: __cpu_to_virtio16(vio_le(),
sinfo->gso_size);

Need to take a closer look on Monday.

I think for complete safety, we need the transport header set for
all SKBs once they hit the device.

I know this is separate from the bugs you are trying to fix, but
let's take care of this in this series ok?

Ok, sure. I can add an extra one into this series removing the
conditional.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT] Networking

2015-11-09 Thread Ingo Molnar

* Linus Torvalds  wrote:

> Does anybody have any particular other "uhhuh, overflow in multiplication" 
> issues in mind? Because the interface for a saturating multiplication (or 
> addition, for that matter) would actually be much easier. And would be 
> trivial 
> to have as an inline asm for compatibility with older versions of gcc too.
> 
> Then you could just do that jiffies conversion - or allocation, for that 
> matter 
> - without any special overflow handling at all. Doing
> 
> buf = kmalloc(sat_mul(sizeof(x), nr), GFP_KERNEL);
> 
> would just magically work.

Exactly: saturation is the default behavior for many GPU vector/pixel 
attributes 
as well, to simplify and speed up the code and the hardware. I always wanted 
our 
ABIs to saturate instead of duplicating complexity with overflow failure logic.

In the kernel the first point of failure is missing overflow checks. The second 
point of failure are buggy overflow checks. We can eliminate both if we just 
use 
safe operations that produce output that never exit the valid range. This also 
happens to result in the simplest code. We should start thinking of overflow 
checks as rootkit enablers.

And note how much this simplifies review and static analysis: if this is the 
dominant model used in new kernel code then the analysis (human or machine) 
would 
only have to ensure that no untrusted input values get multiplied (or added) in 
an 
unsafe way. It would not have to be able to understand and track any 'overflow 
logic' through a maze of return paths, and validate whether the 'overflow 
logic' 
is correct for all input parameter ranges...

The flip side is marginally less ABI robustness: random input parameters due to 
memory corruption will just saturate and produce nonsensical results. I don't 
think it's a big issue, and I also think the simplicity of input parameter 
validation is _way_ more important than our behavior to random input - but I've 
been overruled in the past when trying to introduce saturating ABIs, so 
saturation 
is something people sometimes find inelegant.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

99 matches

Mail list logo