Re: [PATCH] man: Syntax and warning fixes
On Mon, Nov 9, 2015 at 11:40 AM, Albino B Neto wrote: > 2015-11-07 7:55 GMT-02:00 Ville Skyttä : >>> Signed-off-by and description ? >> >> Superseding patches sent separately. I suggest documenting the >> requirement for Signed-off-by somewhere (README.devel?); > > Please read the Documentation/SubmittingPatches. No such file in iproute2 git. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PING: [PATCH] net: smsc911x: Reset PHY during initialization
Hello! So, what should we do with this? Kind regards, Pavel Fedin Expert Engineer Samsung Electronics Research center Russia > -Original Message- > From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On > Behalf Of Pavel > Fedin > Sent: Monday, November 02, 2015 10:19 AM > To: 'David Miller' > Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; > steve.glendinn...@shawell.net > Subject: RE: [PATCH] net: smsc911x: Reset PHY during initialization > > Hello! > > > > On certain hardware after software reboot the chip may get stuck and fail > > > to reinitialize during reset. This can be fixed by ensuring that PHY is > > > reset too. > > > > > > Old PHY resetting method required operational MDIO interface, therefore > > > the chip should have been already set up. In order to be able to function > > > during probe, it is changed to use PMT_CTRL register. > > > > > > The problem could be observed on SMDK5410 board. > > > > > > Signed-off-by: Pavel Fedin > > > > I'm pretty sure this is going to break the PHY loopback test. > > It's not (at least in normal situation), because first we do the test, and > only then, if it > fails, we reset the PHY. So, during > normal operation of the driver, when loopback test succeeds at first attempt, > we never attempt > to reset the PHY. And, by the way, at > least on my board this PHY reset did nothing useful, because after it > loopback test still > failed, all 100 times. > And, we don't use PHY reset anywhere outside of loopback test. > > > This is such a tricky and fragile area to get right, therefore I > > want you to specifically guard any change in how PHY reset is > > done with tests against the specific chips that have the problem. > > Well, i could do one (or some combination) of the following, if you really > want to: > a) Leave PHY reset inside loopback test as it is, but add a second routine > and call it only > before smsc911x_soft_reset(). > b) Reset PHY only conditionally, using the following algorithm: > if smsc911x_soft_reset() { > /* NIC reset failed, kick the PHY and retry */ > smsc911x_phy_reset() > if (smsc_911x_soft_reset()) > return -ENODEV; > } > c) Do extra PHY reset only if some hint in device tree is specified (or the > machine is known > to have the problem) > > But, i dislike approach (a) for code duplication, because datasheet says > that both reset > methods are equivalent; i dislike approach > (b) for actually being a hack; and i dislike (c) for adding extra stuff which > seems to be not > necessary. After all, what is wrong > with just extra PHY reset before attempting to program anything? By the way, > i test my patch > with both software reboot and hardware > reboot, and even powercycle. Everything is quite stable. > Well, it's up to you to decide. But i'd like to get the fix upstreamed > somehow because from > time to time we use these boards for > tests, and it's quite annoiying to be unable to reboot them properly. > > > Furthermore, I want you to test whether this has any negative > > effects on the PHY loopback test. > > I did. I never disabled loopback test, so i actually discovered a problem > (even two) with it. > First, the problem was chip reset > timeout. By increasing it to 300ms this problem seemed to be fixed, but > loopback test started > to fail. This was how i found and > fixed crash with missing phy_disconnect(). I examined the code and discovered > that upon > loopback test failure we reset the PHY and > retry. But in my case this PHY reset never did the right thing, and all > loopback attempts > (10x10 IIRC) were failing. > Some comments in the code gave me a clue that the main NIC can misbehave on > reset if e.g. PHY > is in powerdown state. I printed > value of its control register and discovered that it was not the case. But > then i discovered > that we actually never try to reset the > PHY before doing anything. Also, while studying the datasheed i discovered > that there is a > shorthand for resetting PHY without MDIO > bus set up, but our driver doesn't use it, preferring MDIO method instead, > which already > requires operational NIC. > So, i suggested that PHY is just not being reset when board is rebooted by > software. I wrote > the patch and it worked. I verified it > by reverting my previous NIC reset timeout increase, and it continued to > work. After this > loopback test on my board passes at first > attempt. > > Kind regards, > Pavel Fedin > Expert Engineer > Samsung Electronics Research center Russia > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.
Re: [PATCH RFC 5/6] net/faraday: Enable NCSI interface
On Tue, Nov 10, 2015 at 07:28:24AM +1100, Benjamin Herrenschmidt wrote: >On Mon, 2015-11-09 at 18:30 +1100, Gavin Shan wrote: >> >> Yeah, It's something that I hilighed in the cover letter. I was >> thinking we might need a better way to enable Tx/Rx before the >> interrupt is up, but couldn't figure out one way. So I need some >> advice here. > >No, that's not right. For Tx/Rx to work the interface must be opened, >there is simply no way around that and that's perfectly fine. The >situation with an MDIO PHY is the same, most drivers can't talk to >their PHY until the interface has been opened because the chip is >basically powered down otherwise. > >So we require the interface to be opened to talk, so far so good, >the NC-SI stack doesn't even need to open it itself, it's acceptable >to require userspace to do it. IE. Userspace will chose what interface >to use, open it (for DHCP etc... or whatever other reason) and *that* >will then trigger the NC-SI negociation. > Yes, NCSI is smiliar to PHY to some extent. However, PHY's negotiation is purly electrical procedure, no packets received from MAC for it. We have the same situation when the NCSI/PHY is going to be brought down. At the beginning, the NCSI packets can be received and transmitted after the interface is opened. Before NCSI negotiation is done, no other packets can be received and transmitted. For the Rx path (for other packets), the NCSI link isn't enabled when NCSI negotiation isn't finished. There might have lots of egress packets whose IP addresses can't be resolved to MAC address as ARP resolution doesn't work before NCSI negotiation is done. So there is a weird window: interface is up, but no packets (except NCSI packets) can be received or transmitted. When the interface is brought down, for example by "ifconfig eth0 down", The NCSI interface needs to be teared down by transmitting and receiving NCSI commands and responses. Similiarly, it introduces another weird window: interface is down, but NCSI packets still can be transmitted and received. >Cheers, >Ben. > Thanks, Gavin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drivers: net: cpsw: add support for fixed-links.
From: Daniel Trautmann Date: Mon, 9 Nov 2015 20:24:14 +0100 > Add support for fixed-links in configurations without PHY. > (e.g. connection to a switch, SGMII point to point, SFPs) > > Check: Documentation/devicetree/bindings/net/fixed-link.txt. > > Signed-off-by: Daniel Trautmann This patch doesn't apply cleanly at all. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drivers: net: cpsw: add support for fixed-links.
From: Daniel Trautmann Date: Mon, 9 Nov 2015 20:24:14 +0100 > @@ -2039,19 +2039,35 @@ static int cpsw_probe_dt(struct cpsw_priv *priv, > priv->phy_node = of_parse_phandle(slave_node, "phy-handle", 0); > parp = of_get_property(slave_node, "phy_id", &lenp); > if ((parp == NULL) || (lenp != (sizeof(void *) * 2))) { > - dev_err(&pdev->dev, "Missing slave[%d] phy_id > property\n", i); I know this has nothing to do with your patch, but that length check is completely bogus. It should be "sizeof(__be32) * 2" not "sizeof(void *) * 2". -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deadlock between bind and splice
On Tue, Nov 10, 2015 at 02:38:54AM +, Al Viro wrote: > On Fri, Nov 06, 2015 at 07:42:15AM -0800, Eric Dumazet wrote: > > > Thank you for this report. > > > > pipe is part of fs, not net ;) > > AF_UNIX bind() vs. socketpair() interplay, OTOH... FWIW, BSD folks unlock the socket for the duration of mknod - mark it as "somebody's trying to bind it" to avoid the fun with racing double bind(), but that's about it. Tempting, to be honest... BTW, why does unix_autobind() do allocation under ->readlock? The allocation will be normally used - that if (u->addr) return; part is just dealing with an unlikely race, as far as I can see... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] net: fix a race in dst_release()
From: Eric Dumazet Date: Mon, 09 Nov 2015 17:51:23 -0800 > From: Eric Dumazet > > Only cpu seeing dst refcount going to 0 can safely > dereference dst->flags. > > Otherwise an other cpu might already have freed the dst. > > Fixes: 27b75c95f10d ("net: avoid RCU for NOCACHE dst") > Reported-by: Greg Thelen > Signed-off-by: Eric Dumazet Ouch... Applied and queued up for -stable, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: mvneta: Fix memory use after free.
From: Justin Maggard Date: Mon, 9 Nov 2015 17:21:05 -0800 > After changing an interface's MTU, then bringing the interface down and > back up again, I immediately saw tons of kernel messages like below. > The reason for this bad behavior is mvneta_rxq_drop_pkts(), which calls > dma_unmap_single() on already-freed memory. So we need to switch the > order of those two operations. ... > Signed-off-by: Justin Maggard Applied, thanks Justin. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deadlock between bind and splice
On Fri, Nov 06, 2015 at 07:42:15AM -0800, Eric Dumazet wrote: > Thank you for this report. > > pipe is part of fs, not net ;) AF_UNIX bind() vs. socketpair() interplay, OTOH... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
Hallo, On Tue, Nov 10, 2015, at 02:11, Eric Dumazet wrote: > On Thu, 2015-05-21 at 17:00 +0200, Hannes Frederic Sowa wrote: > > > + > > +static ssize_t skb_unix_socket_splice(struct sock *sk, > > + struct pipe_inode_info *pipe, > > + struct splice_pipe_desc *spd) > > +{ > > + int ret; > > + struct unix_sock *u = unix_sk(sk); > > + > > + mutex_unlock(&u->readlock); > > + ret = splice_to_pipe(pipe, spd); > > + mutex_lock(&u->readlock); > > + > > + return ret; > > +} > > + > > Hi Hannes > > Since we release u->readlock, what prevents another thread to read() the > same af_unix socket and consume the skb while we splice it ? > > TCP stack has special code to take care of this possibility. Hm, I do see the problem, yes. I left a window open for races happening. Without having a closer look what spontaneously comes to my mind: increment skb->users before splice_to_pipe, re-peek the socket queue after the splice, recheck if we have the same skb pointer in the front, and then consume_skb the skb properly and indicating a short read. Let me try tomorrow if that works, but I fear we can read data multiple times with this fix, so I don't know if this is a proper fix. I have to check the TCP code closer. Thanks for the hint! Thanks for the report, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deadlock between bind and splice
On Fri, Nov 06, 2015 at 01:58:27PM +0100, Dmitry Vyukov wrote: > Hello, > > I am on revision d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5) and > seeing the following lockdep reports. I don't have exact reproducer > program as it is caused by several independent programs (state > accumulated in kernel across invocations); if the report is not enough > I can try to cook a reproducer. > > Thanks. > > [ INFO: possible circular locking dependency detected ] > 4.3.0+ #30 Not tainted > --- > a.out/9972 is trying to acquire lock: > (&pipe->mutex/1){+.+.+.}, at: [< inline >] pipe_lock_nested > fs/pipe.c:59 > (&pipe->mutex/1){+.+.+.}, at: [] > pipe_lock+0x56/0x70 fs/pipe.c:67 > > but task is already holding lock: > (sb_writers#5){.+.+.+}, at: [] > __sb_start_write+0xec/0x130 fs/super.c:1198 > > which lock already depends on the new lock. > > the existing dependency chain (in reverse order) is: > > -> #2 [AF_UNIX bind() does sb_start_write() while holding unix_sock locked] > -> #1 [splice() to AF_UNIX socket is trying to lock unix_sock while holding the pipe locked] > -> #0 (&pipe->mutex/1){+.+.+.}: [splice() to regular file is locking the pipe under sb_start_write()] Cute... The first impression is that in #1 you need the socket to be connected, or it won't even reach that attempt to lock unix_sock, while bind() on the same sucker ought to bugger off before getting around to touching the filesystem, so it looks like a false positive, but... socketpair() yields a connected socket and AFAICS there's nothing in unix_bind() to bugger off on such. So the scenario ought to be: (a while ago) A: socketpair() B: splice() from a pipe to /mnt/regular_file does sb_start_write() on /mnt C: try to freeze /mnt wait for B to finish with /mnt A: bind() try to bind our socket to /mnt/new_socket_name lock our socket, see it not bound yet decide that it needs to create something in /mnt try to do sb_start_write() on /mnt, block (it's waiting for C). D: splice() from the same pipe to our socket lock the pipe, see that socket is connected try to lock the socket, block waiting for A B: get around to actually feeding a chunk from pipe to file, try to lock the pipe. Deadlock. Locking the socket is interruptible, though, so killing D will untangle that mess - it's not quite a hopeless deadlock. Deadlock or not, should bind() actually work on connected sockets? AFAICS, socketpair() is the only way for it to happen... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] net: fix a race in dst_release()
From: Eric Dumazet Only cpu seeing dst refcount going to 0 can safely dereference dst->flags. Otherwise an other cpu might already have freed the dst. Fixes: 27b75c95f10d ("net: avoid RCU for NOCACHE dst") Reported-by: Greg Thelen Signed-off-by: Eric Dumazet --- net/core/dst.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dst.c b/net/core/dst.c index 2a1818065e12..e6dc77252fe9 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -306,7 +306,7 @@ void dst_release(struct dst_entry *dst) if (unlikely(newrefcnt < 0)) net_warn_ratelimited("%s: dst:%p refcnt:%d\n", __func__, dst, newrefcnt); - if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt) + if (!newrefcnt && unlikely(dst->flags & DST_NOCACHE)) call_rcu(&dst->rcu_head, dst_destroy_rcu); } } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: mvneta: Fix memory use after free.
After changing an interface's MTU, then bringing the interface down and back up again, I immediately saw tons of kernel messages like below. The reason for this bad behavior is mvneta_rxq_drop_pkts(), which calls dma_unmap_single() on already-freed memory. So we need to switch the order of those two operations. [ 152.388518] BUG: Bad page state in process ifconfig pfn:1b518 [ 152.388526] page:dff3dbc0 count:0 mapcount:0 mapping: (null) index:0x0 [ 152.395178] flags: 0x200(arch_1) [ 152.398441] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set [ 152.398446] bad because of flags: [ 152.398450] flags: 0x200(arch_1) [ 152.401716] Modules linked in: [ 152.401728] CPU: 0 PID: 1453 Comm: ifconfig Tainted: PB O 4.1.12.armada.1 #1 [ 152.401733] Hardware name: Marvell Armada 370/XP (Device Tree) [ 152.401749] [] (unwind_backtrace) from [] (show_stack+0x10/0x14) [ 152.401762] [] (show_stack) from [] (dump_stack+0x74/0x90) [ 152.401772] [] (dump_stack) from [] (bad_page+0xc4/0x124) [ 152.401783] [] (bad_page) from [] (get_page_from_freelist+0x4e4/0x644) [ 152.401794] [] (get_page_from_freelist) from [] (__alloc_pages_nodemask+0x148/0x784) [ 152.401805] [] (__alloc_pages_nodemask) from [] (kmalloc_order+0x10/0x20) [ 152.401818] [] (kmalloc_order) from [] (mvneta_rx_refill+0xc4/0xe8) [ 152.401830] [] (mvneta_rx_refill) from [] (mvneta_setup_rxqs+0x298/0x39c) [ 152.401842] [] (mvneta_setup_rxqs) from [] (mvneta_open+0x3c/0x150) [ 152.401853] [] (mvneta_open) from [] (__dev_open+0xac/0x124) [ 152.401864] [] (__dev_open) from [] (__dev_change_flags+0x8c/0x148) [ 152.401875] [] (__dev_change_flags) from [] (dev_change_flags+0x18/0x48) [ 152.401886] [] (dev_change_flags) from [] (devinet_ioctl+0x620/0x6d0) [ 152.401897] [] (devinet_ioctl) from [] (sock_ioctl+0x64/0x288) [ 152.401908] [] (sock_ioctl) from [] (do_vfs_ioctl+0x78/0x608) [ 152.401918] [] (do_vfs_ioctl) from [] (SyS_ioctl+0x64/0x74) [ 152.401930] [] (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x3c) Signed-off-by: Justin Maggard --- drivers/net/ethernet/marvell/mvneta.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index a47496a..e84c7f2 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -1493,9 +1493,9 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp, struct mvneta_rx_desc *rx_desc = rxq->descs + i; void *data = (void *)rx_desc->buf_cookie; - mvneta_frag_free(pp, data); dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr, MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE); + mvneta_frag_free(pp, data); } if (rx_done) -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
On Thu, 2015-05-21 at 17:00 +0200, Hannes Frederic Sowa wrote: > + > +static ssize_t skb_unix_socket_splice(struct sock *sk, > + struct pipe_inode_info *pipe, > + struct splice_pipe_desc *spd) > +{ > + int ret; > + struct unix_sock *u = unix_sk(sk); > + > + mutex_unlock(&u->readlock); > + ret = splice_to_pipe(pipe, spd); > + mutex_lock(&u->readlock); > + > + return ret; > +} > + Hi Hannes Since we release u->readlock, what prevents another thread to read() the same af_unix socket and consume the skb while we splice it ? TCP stack has special code to take care of this possibility. tcp_read_sock() : used = recv_actor(desc, skb, offset, len); if (used <= 0) { if (!copied) copied = used; break; } else if (used <= len) { seq += used; copied += used; offset += used; } /* If recv_actor drops the lock (e.g. TCP splice * receive) the skb pointer might be invalid when * getting here: tcp_collapse might have deleted it * while aggregating skbs from the socket queue. */ skb = tcp_recv_skb(sk, seq - 1, &offset); if (!skb) break; /* TCP coalescing might have appended data to the skb. * Try to splice more frags */ if (offset + 1 != skb->len) continue; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/4] e1000e: Remove unreachable code
msi-x interrupts are not shared so there's no need to check if the interrupt was really from this adapter. Signed-off-by: Benjamin Poirier --- drivers/net/ethernet/intel/e1000e/netdev.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 0a854a4..a228167 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1907,12 +1907,6 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) struct e1000_hw *hw = &adapter->hw; u32 icr = er32(ICR); - if (!(icr & E1000_ICR_INT_ASSERTED)) { - if (!test_bit(__E1000_DOWN, &adapter->state)) - ew32(IMS, E1000_IMS_OTHER); - return IRQ_NONE; - } - if (icr & adapter->eiac_mask) ew32(ICS, (icr & adapter->eiac_mask)); -- 2.6.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/4] e1000e msi-x fixes
Hi, For this series: Benjamin Poirier (4): e1000e: Remove unreachable code e1000e: Do not read icr in Other interrupt e1000e: Do not write lsc to ics in msi-x mode e1000e: Fix msi-x interrupt automask drivers/net/ethernet/intel/e1000e/defines.h | 3 +- drivers/net/ethernet/intel/e1000e/netdev.c | 66 - 2 files changed, 30 insertions(+), 39 deletions(-) Changes in v3: Preserve LSC in IMS, LSC events are not delivered otherwise. Disable CTRL_EXT.IAME to prevent IMC write on ICR read from external program. Changes in v2: Address review comments from Alexander Duyck: extend cleanup of Other interrupt handler and use tx_ring->ims_val. The first three patches cleanup handling of Other interrupts and the last patch fixes tx and rx interrupts. Please consider reading the description for that patch before proceeding. I believe that the following simple tracing statements are helpful in detecting the problem fixed by the last patch. diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index a09d1e4..29b8c6e 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1942,6 +1942,9 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused irq, void *data) struct net_device *netdev = data; struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_ring *rx_ring = adapter->rx_ring; + struct e1000_hw *hw = &adapter->hw; + + trace_printk("%s: rxq0 irq ims 0x%08x\n", netdev->name, er32(IMS)); /* Write the ITR value calculated at the end of the * previous interrupt. @@ -1956,6 +1959,7 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused irq, void *data) adapter->total_rx_bytes = 0; adapter->total_rx_packets = 0; __napi_schedule(&adapter->napi); + trace_printk("%s: scheduling napi\n", netdev->name); } return IRQ_HANDLED; } @@ -2663,6 +2667,8 @@ static int e1000e_poll(struct napi_struct *napi, int weight) struct net_device *poll_dev = adapter->netdev; int tx_cleaned = 1, work_done = 0; + trace_printk("%s: poll starting ims 0x%08x\n", poll_dev->name, +er32(IMS)); adapter = netdev_priv(poll_dev); if (!adapter->msix_entries || @@ -2680,6 +2686,8 @@ static int e1000e_poll(struct napi_struct *napi, int weight) e1000_set_itr(adapter); napi_complete_done(napi, work_done); if (!test_bit(__E1000_DOWN, &adapter->state)) { + trace_printk("%s: will enable rxq0 irq\n", +poll_dev->name); if (adapter->msix_entries) ew32(IMS, adapter->rx_ring->ims_val); else 8< With that patch but without the patches in this series we can see that rx irqs occur at unexpected times: -0 [000] .Ns. 1986.887517: e1000e_poll: eth1: will enable rxq0 irq -0 [000] d.h. 1986.896654: e1000_intr_msix_rx: eth1: rxq0 irq ims 0x0154 -0 [000] d.h. 1986.896657: e1000_intr_msix_rx: eth1: scheduling napi -0 [000] d.H. 1986.896662: e1000_intr_msix_rx: eth1: rxq0 irq ims 0x0154 -0 [000] ..s. 1986.896667: e1000e_poll: eth1: poll starting ims 0x0154 Warning: many interrupts (2) before napi -0 [000] ..s. 1986.896685: e1000e_poll: eth1: will enable rxq0 irq -0 [000] d.h. 1990.688870: e1000_intr_msix_rx: eth1: scheduling napi -0 [000] ..s. 1990.688875: e1000e_poll: eth1: poll starting ims 0x0154 -0 [000] dNH. 1990.688913: e1000_intr_msix_rx: eth1: rxq0 irq ims 0x0154 Warning: interrupt inside napi -0 [000] .Ns. 1990.688916: e1000e_poll: eth1: will enable rxq0 irq -0 [000] d.h. 1990.729688: e1000_intr_msix_rx: eth1: rxq0 irq ims 0x0154 Here's a typical sequence after applying the patches in this series. Notice that ims is changed. Another printk at the end of e1000e_poll would show it to be 0x0154. -0 [000] d.h. 23547.977917: e1000_intr_msix_rx: eth1: rxq0 irq ims 0x0144 -0 [000] d.h. 23547.977922: e1000_intr_msix_rx: eth1: scheduling napi -0 [000] ..s. 23547.977928: e1000e_poll: eth1: poll starting ims 0x0144 -0 [000] ..s. 23547.977961: e1000e_poll: eth1: will enable rxq0 irq Finally, here's the script I used to generate the warnings above: #!/usr/bin/python3 import sys import re import pprint class NaE(Exception): "Not an Event" pass class Event: def __init__(self, line): # sample events: # -0 [000] d.h. 2025.256536: e1000_intr_msix_rx: eth1: rxq0 irq ims 0x0154 # -0 [000] d.h. 2025.25
[PATCH v3 3/4] e1000e: Do not write lsc to ics in msi-x mode
In msi-x mode, there is no handler for the lsc interrupt so there is no point in writing that to ics now that we always assume Other interrupts are caused by lsc. Reviewed-by: Jasna Hodzic Signed-off-by: Benjamin Poirier --- drivers/net/ethernet/intel/e1000e/defines.h | 3 ++- drivers/net/ethernet/intel/e1000e/netdev.c | 27 --- 2 files changed, 18 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h index 133d407..f7c7804 100644 --- a/drivers/net/ethernet/intel/e1000e/defines.h +++ b/drivers/net/ethernet/intel/e1000e/defines.h @@ -441,12 +441,13 @@ #define E1000_IMS_RXQ1 E1000_ICR_RXQ1 /* Rx Queue 1 Interrupt */ #define E1000_IMS_TXQ0 E1000_ICR_TXQ0 /* Tx Queue 0 Interrupt */ #define E1000_IMS_TXQ1 E1000_ICR_TXQ1 /* Tx Queue 1 Interrupt */ -#define E1000_IMS_OTHER E1000_ICR_OTHER /* Other Interrupts */ +#define E1000_IMS_OTHER E1000_ICR_OTHER /* Other Interrupt */ /* Interrupt Cause Set */ #define E1000_ICS_LSC E1000_ICR_LSC /* Link Status Change */ #define E1000_ICS_RXSEQ E1000_ICR_RXSEQ /* Rx sequence error */ #define E1000_ICS_RXDMT0E1000_ICR_RXDMT0/* Rx desc min. threshold */ +#define E1000_ICS_OTHER E1000_ICR_OTHER /* Other Interrupt */ /* Transmit Descriptor Control */ #define E1000_TXDCTL_PTHRESH 0x003F /* TXDCTL Prefetch Threshold */ diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index a73e323..ed7cc8e 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -4130,10 +4130,23 @@ void e1000e_reset(struct e1000_adapter *adapter) } -int e1000e_up(struct e1000_adapter *adapter) +/** + * e1000e_trigger_lsc - trigger an lsc interrupt + * + * Fire a link status change interrupt to start the watchdog. + **/ +static void e1000e_trigger_lsc(struct e1000_adapter *adapter) { struct e1000_hw *hw = &adapter->hw; + if (adapter->msix_entries) + ew32(ICS, E1000_ICS_OTHER); + else + ew32(ICS, E1000_ICS_LSC); +} + +int e1000e_up(struct e1000_adapter *adapter) +{ /* hardware has been reset, we need to reload some things */ e1000_configure(adapter); @@ -4145,11 +4158,7 @@ int e1000e_up(struct e1000_adapter *adapter) netif_start_queue(adapter->netdev); - /* fire a link change interrupt to start the watchdog */ - if (adapter->msix_entries) - ew32(ICS, E1000_ICS_LSC | E1000_ICR_OTHER); - else - ew32(ICS, E1000_ICS_LSC); + e1000e_trigger_lsc(adapter); return 0; } @@ -4576,11 +4585,7 @@ static int e1000_open(struct net_device *netdev) hw->mac.get_link_status = true; pm_runtime_put(&pdev->dev); - /* fire a link status change interrupt to start the watchdog */ - if (adapter->msix_entries) - ew32(ICS, E1000_ICS_LSC | E1000_ICR_OTHER); - else - ew32(ICS, E1000_ICS_LSC); + e1000e_trigger_lsc(adapter); return 0; -- 2.6.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/4] e1000e: Do not read icr in Other interrupt
removes the icr read in the other interrupt handler, uses eiac to autoclear the Other bit from icr and ims. This allows us to avoid interference with rx and tx interrupts in the Other interrupt handler. The information read from icr is not needed. IMS is configured such that the only interrupt cause that can trigger the Other interrupt is Link Status Change. Signed-off-by: Benjamin Poirier --- I noticed a 8-16% improvement in netperf rr tests after applying this patch. This is a little surprising since this patch touches the handling of Other interrupts, which do not occur during such a test. Some profiling was not very insightful but the improvement seems related to writing Other to EIAC. --- drivers/net/ethernet/intel/e1000e/netdev.c | 22 +++--- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index a228167..a73e323 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1905,24 +1905,15 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) struct net_device *netdev = data; struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_hw *hw = &adapter->hw; - u32 icr = er32(ICR); - if (icr & adapter->eiac_mask) - ew32(ICS, (icr & adapter->eiac_mask)); + hw->mac.get_link_status = true; - if (icr & E1000_ICR_OTHER) { - if (!(icr & E1000_ICR_LSC)) - goto no_link_interrupt; - hw->mac.get_link_status = true; - /* guard against interrupt when we're going down */ - if (!test_bit(__E1000_DOWN, &adapter->state)) - mod_timer(&adapter->watchdog_timer, jiffies + 1); + /* guard against interrupt when we're going down */ + if (!test_bit(__E1000_DOWN, &adapter->state)) { + mod_timer(&adapter->watchdog_timer, jiffies + 1); + ew32(IMS, E1000_IMS_OTHER); } -no_link_interrupt: - if (!test_bit(__E1000_DOWN, &adapter->state)) - ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER); - return IRQ_HANDLED; } @@ -2019,6 +2010,7 @@ static void e1000_configure_msix(struct e1000_adapter *adapter) hw->hw_addr + E1000_EITR_82574(vector)); else writel(1, hw->hw_addr + E1000_EITR_82574(vector)); + adapter->eiac_mask |= E1000_IMS_OTHER; /* Cause Tx interrupts on every write back */ ivar |= (1 << 31); @@ -2247,7 +2239,7 @@ static void e1000_irq_enable(struct e1000_adapter *adapter) if (adapter->msix_entries) { ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574); - ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC); + ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC); } else if ((hw->mac.type == e1000_pch_lpt) || (hw->mac.type == e1000_pch_spt)) { ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER); -- 2.6.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/4] e1000e: Fix msi-x interrupt automask
Since the introduction of 82574 support in e1000e, the driver has worked on the assumption that msi-x interrupt generation is automatically disabled after each irq. As it turns out, this is not the case. Currently, rx interrupts can fire multiple times before and during napi processing. This can be a problem for users because frames that arrive in a certain window (after adapter->clean_rx() but before napi_complete_done() has cleared NAPI_STATE_SCHED) generate an interrupt which does not lead to napi_schedule(). These frames sit in the rx queue until another frame arrives (a tcp retransmit for example). While the EIAC and CTRL_EXT registers are properly configured for irq automask, the modification of IAM in e1000_configure_msix() is what prevents automask from working as intended. This patch removes that erroneous write and fixes interrupt rearming for tx interrupts. It also clears IAME from CTRL_EXT. This is not strictly necessary for operation of the driver but it is to avoid disruption from potential programs that access the registers directly, like `ethregs -c`. Reported-by: Frank Steiner Signed-off-by: Benjamin Poirier --- drivers/net/ethernet/intel/e1000e/netdev.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index ed7cc8e..2a22ed7 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1931,6 +1931,9 @@ static irqreturn_t e1000_intr_msix_tx(int __always_unused irq, void *data) /* Ring was not completely cleaned, so fire another interrupt */ ew32(ICS, tx_ring->ims_val); + if (!test_bit(__E1000_DOWN, &adapter->state)) + ew32(IMS, adapter->tx_ring->ims_val); + return IRQ_HANDLED; } @@ -2018,12 +2021,8 @@ static void e1000_configure_msix(struct e1000_adapter *adapter) ew32(IVAR, ivar); /* enable MSI-X PBA support */ - ctrl_ext = er32(CTRL_EXT); - ctrl_ext |= E1000_CTRL_EXT_PBA_CLR; - - /* Auto-Mask Other interrupts upon ICR read */ - ew32(IAM, ~E1000_EIAC_MASK_82574 | E1000_IMS_OTHER); - ctrl_ext |= E1000_CTRL_EXT_EIAME; + ctrl_ext = er32(CTRL_EXT) & ~E1000_CTRL_EXT_IAME; + ctrl_ext |= E1000_CTRL_EXT_PBA_CLR | E1000_CTRL_EXT_EIAME; ew32(CTRL_EXT, ctrl_ext); e1e_flush(); } -- 2.6.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next v2 0/8] openvswitch: NAT support.
[Resending as plain text] > On Nov 9, 2015, at 5:31 AM, Patrick McHardy wrote: > > On 06.11, Jarno Rajahalme wrote: >> This series adds NAT support to openvswitch kernel module. A few >> changes are needed to the netfilter code to facilitate this (patches >> 1-3/8). Patches 4-7 make the openvswitch kernel module ready for the >> patch 8 that adds the NAT support for calling into netfilter NAT code >> from the openvswitch conntrack action. > > I'm missing some high level description, especially how it is invoked, how > it makes sure expectations of the NAT code about its invocation are met > (it is my understanding that OVS simply invokes this based on actions > specified by the user) and how it interacts with the remaining netfilter > features. > The corresponding OVS userspace patches contain the new test cases for the NAT features (http://openvswitch.org/pipermail/dev/2015-November/061920.html) in tests/system-traffic.at. I’ll walk through two of them below. Test case: conntrack - simple SNAT In these tests ports 1 and 2 are in different namespaces. The flow table below allows all IPv4 traffic between port 1 to port 2, but IP connections from port 1 to port 2 are source NATted: in_port=1,ip,action=ct(commit,zone=1,nat(src=10.1.1.240-10.1.1.255)),2 in_port=2,ct_state=-trk,ip,action=ct(table=0,zone=1,nat) in_port=2,ct_state=+trk,ct_zone=1,ip,action=1 This flow table matches all IPv4 traffic from port 1, runs them through conntrack in zone 1 and NATs them. The NAT is initialized to do source IP mapping to the given range for the first packet of each connection, after which the new connection is committed. For further packets of already tracked connections NAT is done according to the connection state and the commit is a no-op. Each packet that is not flagged as a drop by the CT action is forwarded to port 2. The CT action does an implicit fragmentation reassembly, so that only complete packets are run through conntrack. Reassembled packets are re-fragmented on output. The IPv4 traffic coming from port 2 is first matched for the non-tracked state (-trk), which means that the packet has not been through a CT action yet. Such traffic is run trough the conntrack in zone 1 and all packets associated with a NATted connection are NATted also in the return direction. After the packet has been through conntrack it is recirculated back to table 0 (which is the default table, so all the rules above are in table 0). The CT action changes the ‘trk’ flag to being set, so the packets after recirculation match the third rule (+trk), and the packet is output on port 1. Since also the ct_zone is matched, only packets that were actually tracked by conntrack can match the 3rd rule and output to 1. I’m skipped the rules for ARP handling in this walkthrough, but the test case has rules to match on the ARP request on the NATted address and reply to it. The above is verified with a HTTP request and a subsequent conntrack entry listing in the test case: dnl HTTP requests from p0->p1 should work fine. NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid]) NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 5 -T 1 --retry-connrefused -v -o wget0.log]) AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2) | sed -e 's/dst=10.1.1.2[[45]][[0-9]]/dst=10.1.1.2XX/'], [0], [dnl TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport= dport= src=10.1.1.2 dst=10.1.1.2XX sport= dport= [[ASSURED]] mark=0 zone=1 use=1 ]) As a second example, I’ll walk through a test case of NAT with FTP on IPv6: As before, ports 1 (p0) and 2 (p1) reside in different namespaces (‘at_ns0’ and ‘at_ns1’, respectively). A static neighbor cache entry for the NATted address is created in at_ns1. In a more realistic scenario a controller would need to implement a ND proxy instead: ADD_VETH(p0, at_ns0, br0, "fc00::1/96") NS_CHECK_EXEC([at_ns0], [ip link set dev p0 address 80:88:88:88:88:88]) ADD_VETH(p1, at_ns1, br0, "fc00::2/96") dnl Would be nice if NAT could translate neighbor discovery messages, too. NS_CHECK_EXEC([at_ns1], [ip -6 neigh add fc00::240 lladdr 80:88:88:88:88:88 dev p1]) The OpenFlow rules are split into two tables (table 0 and table 1). Table 0 tracks all IPv6 traffic and drops all other traffic. Packets are recirculated to table 1 after conntrack. Existing connections are NATted before recirculation: table=0 priority=10 ip6, action=ct(nat,table=1) table=0 priority=0 action=drop Table 1: FTP control connections from the “private” address to port 21 are output to port 1 after being tracked, NATted, and committed (== confirmed). table=1 in_port=1 ct_state=+new tcp6 ipv6_src=fc00::1 tp_dst=21 action=ct(alg=ftp,commit,nat(src=fc00::240)),2 The matches on the source/dest IP address below are not really needed, but in the test case they are to make sure the NATting actually happens. Related TCP connections to the reverse direction are NATt
Re: Fw: [Bug 106711] VXLAN: RTNL assertion failed at net/core/net_namespace.c:187
On Mon, Nov 9, 2015 at 10:39 AM, Stephen Hemminger wrote: > > > Begin forwarded message: > > Date: Mon, 9 Nov 2015 09:42:52 + > From: "bugzilla-dae...@bugzilla.kernel.org" > > To: "shemmin...@linux-foundation.org" > Subject: [Bug 106711] VXLAN: RTNL assertion failed at > net/core/net_namespace.c:187 > > > https://bugzilla.kernel.org/show_bug.cgi?id=106711 > > Kari Hautio changed: > >What|Removed |Added > > CC||khau...@gmail.com > > --- Comment #1 from Kari Hautio --- > Also affects 4.1.12 > > [ 34.412104] RTNL: assertion failed at > /home/kernel/COD/linux/net/core/net_namespace.c (187) > [ 34.412120] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.1.12-040112-generic > #201510262131 > [ 34.412123] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.7.5-20140709_153950- 04/01/2014 > [ 34.412126] 880035dd2380 88013fd838f8 817e367c > > [ 34.412131] 88013fd83928 816cfbe5 > 880035526200 > [ 34.412135] 8800b9b3fe00 88013a3dc9c0 88013a3dccc0 > 88013fd83938 > [ 34.412140] Call Trace: > [ 34.412142][] dump_stack+0x45/0x57 > [ 34.412159] [] __peernet2id+0xa5/0xb0 > [ 34.412163] [] peernet2id+0x18/0x30 As I replied in BZ, this is probably fixed by: commit 95f38411df055a0ecefe3a3d119d98241087d5ca Author: Nicolas Dichtel Date: Thu May 7 11:02:51 2015 +0200 netns: use a spin_lock to protect nsid management Thanks. > [ 34.412168] [] vxlan_fdb_info+0xfa/0x360 [vxlan] > [ 34.412173] [] ? __alloc_skb+0x8c/0x1f0 > [ 34.412178] [] vxlan_fdb_notify+0x72/0x100 [vxlan] > [ 34.412182] [] vxlan_fdb_create+0x14a/0x3a0 [vxlan] > [ 34.412187] [] vxlan_snoop+0x1cb/0x1d0 [vxlan] > [ 34.412192] [] vxlan_rcv+0x3a2/0x620 [vxlan] > [ 34.412196] [] vxlan_udp_encap_recv+0x156/0x420 [vxlan] > [ 34.412201] [] ? vxlan_open+0x1e0/0x1e0 [vxlan] > [ 34.412207] [] udp_queue_rcv_skb+0x1f3/0x4f0 > [ 34.412211] [] __udp4_lib_rcv+0x127/0x930 > [ 34.412215] [] udp_rcv+0x1a/0x20 > [ 34.412227] [] ip_local_deliver_finish+0xae/0x230 > [ 34.412231] [] ip_local_deliver+0x46/0xa0 > [ 34.412235] [] ip_rcv_finish+0x81/0x370 > [ 34.412239] [] ip_rcv+0x2a2/0x3c0 > [ 34.412244] [] __netif_receive_skb_core+0x7e3/0xa10 > [ 34.412248] [] __netif_receive_skb+0x18/0x60 > [ 34.412252] [] process_backlog+0xb2/0x150 > [ 34.412256] [] net_rx_action+0x1fa/0x320 > [ 34.412265] [] __do_softirq+0x103/0x280 > [ 34.412269] [] irq_exit+0xad/0xb0 > [ 34.412274] [] smp_apic_timer_interrupt+0x46/0x60 > [ 34.412279] [] apic_timer_interrupt+0x6e/0x80 > [ 34.412280][] ? native_safe_halt+0x6/0x10 > [ 34.412290] [] ? rcu_eqs_enter+0xa3/0xb0 > [ 34.412297] [] default_idle+0x1e/0xc0 > [ 34.412315] [] arch_cpu_idle+0xf/0x20 > [ 34.412325] [] cpu_startup_entry+0x367/0x3e0 > [ 34.412332] [] ? clockevents_register_device+0x112/0x1e0 > [ 34.412338] [] start_secondary+0x17e/0x1a0 > > -- > You are receiving this mail because: > You are the assignee for the bug. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()
> ACK for the ACPI changes (and CCing Tony and Boris for the heads-up as they > are way more famailiar with the APEI code than I am). Sure. If kvfree() really is smart enough to figure it out then there it no point in the if (blah) kfree() else vfree(). The drivers/acpi/apei/erst.c code isn't doing anything subtle or magic here. -Tony
Re: linux-next network throughput performance regression
On Mon, 2015-11-09 at 20:23 +, Simon Xiao wrote: > Thanks Eric to provide the data. I am looping Tom (as I am looking into his > recent patches) and Olaf (from Suse). > > So, if I understand it correctly, you are running netperf with single > TCP connection, and you got ~26Gbps initially and got ~30Gbps after > turning the tx-usecs and tx-frames. > > Do you have a baseline on your environment for the best/max/or peak > throughput? The peak on my lab pair is about 34Gbits, usually I get this if I pin the receiving thread on a cpu, otherwise process scheduler can really hurt too much. lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H lpaa24 -l 20 -Cc -T ,1 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : cpu bind tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 29200 tcpi_rtt 101 tcpi_rttvar 15 tcpi_snd_ssthresh 289 tpci_snd_cwnd 289 tcpi_reordering 3 tcpi_total_retrans 453 Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB 87380 16384 1638420.00 33975.99 1.27 3.36 0.147 0.389 Not too bad, I don't recall reaching more than that ever. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
On 11/09/2015 09:40 AM, Rainer Weikusat wrote: > An AF_UNIX datagram socket being the client in an n:1 association with > some server socket is only allowed to send messages to the server if the > receive queue of this socket contains at most sk_max_ack_backlog > datagrams. This implies that prospective writers might be forced to go > to sleep despite none of the message presently enqueued on the server > receive queue were sent by them. In order to ensure that these will be > woken up once space becomes again available, the present unix_dgram_poll > routine does a second sock_poll_wait call with the peer_wait wait queue > of the server socket as queue argument (unix_dgram_recvmsg does a wake > up on this queue after a datagram was received). This is inherently > problematic because the server socket is only guaranteed to remain alive > for as long as the client still holds a reference to it. In case the > connection is dissolved via connect or by the dead peer detection logic > in unix_dgram_sendmsg, the server socket may be freed despite "the > polling mechanism" (in particular, epoll) still has a pointer to the > corresponding peer_wait queue. There's no way to forcibly deregister a > wait queue with epoll. > > Based on an idea by Jason Baron, the patch below changes the code such > that a wait_queue_t belonging to the client socket is enqueued on the > peer_wait queue of the server whenever the peer receive queue full > condition is detected by either a sendmsg or a poll. A wake up on the > peer queue is then relayed to the ordinary wait queue of the client > socket via wake function. The connection to the peer wait queue is again > dissolved if either a wake up is about to be relayed or the client > socket reconnects or a dead peer is detected or the client socket is > itself closed. This enables removing the second sock_poll_wait from > unix_dgram_poll, thus avoiding the use-after-free, while still ensuring > that no blocked writer sleeps forever. > > Signed-off-by: Rainer Weikusuat [...] > @@ -1590,10 +1723,14 @@ restart: > goto out_unlock; > } > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > + if (!unix_dgram_peer_recv_ready(sk, other)) { > if (!timeo) { > - err = -EAGAIN; > - goto out_unlock; > + if (unix_dgram_peer_wake_me(sk, other)) { > + err = -EAGAIN; > + goto out_unlock; > + } > + > + goto restart; > } So this will cause 'unix_state_lock(other) to be called twice in a row if we 'goto restart' (and hence will softlock the box). It just needs a 'unix_state_unlock(other);' before the 'goto restart'. I also tested this patch with a single unix server and 200 client threads doing roughly epoll() followed by write() until -EAGAIN in a loop. The throughput for the test was roughly the same as current upstream, but the cpu usage was a lot higher. I think its b/c this patch takes the server wait queue lock in the _poll() routine. This causes a lot of contention. The previous patch you posted for this where you did not clear the wait queue on every wakeup and thus didn't need the queue lock in poll() (unless we were adding to it), performed much better. However, the previous patch which tested better didn't add to the remote queue when it was full on sendmsg() - so it wouldn't be correct for epoll ET. Adding to the remote queue for every sendmsg() that fails does seem undesirable, if we aren't even doing poll(). So I'm not sure if just going back to the previous patch is a great option eitherI'm also not sure how realistic the test case I have is. It would be great if we had some other workloads to test against. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] inet: delay address promotion check until last request in message
On Mon, Nov 09, 2015 at 10:35:42PM +0200, Julian Anastasov wrote: > > Hello, > > On Sun, 8 Nov 2015, Neil Horman wrote: > > > On Sat, Nov 07, 2015 at 01:49:25AM +0200, Julian Anastasov wrote: > > > > > > flush can provide many parameters. As there is no > > > any kind of indication in the netlink message that all addresses > > > are removed, we can not avoid the promotion. > > > > > This is true, but seems irrellevant to me. A flush operation is a sequence > > of > > RTM_DELADDR operations in a one or more netlink packets. The way my patch > > is > > written, if a set of DELADDR requests is interspersed with other non DELADDR > > requests, then we do a promotion check between each consecutive set of > > DELADDR > > requests. As such, all that happens is that the promotion check happens > > possibly more often than needed. Its not optimal, but not harmful either. > > It is harmful, you miss promotion for some of the > subnets, see below... > > > > > +* Only check for address promotion when this is the last > > > > request > > > > +* in this netlink transaction. It allows this operation to > > > > complete > > > > +* in O(n) time rather than O(n^2) > > > > > > It is not correct to assume that one promotion per > > > transaction is enough. The promotion happens in every subnet, > > > it was not once per device. > > > > > > > I'm not sure I understand the relevance here. All I'm doing is, in effect > > masking the promote_secondaries sysctl for an interface doing a flush > > operation. > > Its equivalent to doing this in user space: > > > > echo 0 > /proc/sys/net/ipv4/conf//promote_secondaries > > A=`some arbitrary address in ` > > ip addr del except A> > > echo 1 > /proc/sys/net/ipv4/conf//promote_secondaries > > ip addr del A > > > > Can you please explain to me the use case in which delaying a promotion > > operation until we think we're done ('done' being defined by the above > > transition > > from a DELADDR operation to a non-DELADDR operation in a netlink packet) > > produces an outcome that differs from the expectation with this patch in > > place? > > Here is how we can miss promotion... > > dev=eth1 > ifconfig $dev up > ip addr add 1.2.3.4/24 dev $dev > ip addr add 1.2.3.4/16 dev $dev > ip addr add 1.2.3.44/24 dev $dev > ip addr add 1.2.3.44/16 dev $dev > echo 1 > /proc/sys/net/ipv4/conf/$dev/promote_secondaries > ip addr flush dev $dev to 1.2.3.4/30 > ip -V > ip utility, iproute2-ss010824 > > What happens is a request to delete just primary addresses: > 1.2.3.4/24 and 1.2.3.4/16. The /30 is chosen in such a way, > so that any repeating attempts to flush the secondary > addresses are avoided. As result, with your patch, only > 1.2.3.44/16 is promoted (the last secondary), 1.2.3.44/24 > which is first in the list of the secondaries, is not > promoted, it is removed. You can even use 1.2.3.4/24, > 1.2.3.5/16, 1.2.3.44/24 and 1.2.3.55/16 in case the > equal IPs are not a good example. > > The problem will be more visible if one builds netlink message > by hand containing DEL for different primaries. > Ah, crap, Ok. I didn't consider the use case in which user space would build a filter of addresses that were on a mix of subnets. That makes sense now, thanks for the explination. I suppose then the only optimization to be had here is to detect the case of a complete flush of all addresses in user space and either manually or automatically disable promotion during the flush operation Neil > Regards > > -- > Julian Anastasov > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()
On Monday, November 09, 2015 08:56:10 PM Tetsuo Handa wrote: > There are many locations that do > > if (memory_was_allocated_by_vmalloc) > vfree(ptr); > else > kfree(ptr); > > but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory > using is_vmalloc_addr(). Unless callers have special reasons, we can > replace this branch with kvfree(). Please check and reply if you found > problems. ACK for the ACPI changes (and CCing Tony and Boris for the heads-up as they are way more famailiar with the APEI code than I am). Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/4] Produce system time from correlated clocksource
On Tue, Nov 3, 2015 at 11:18 AM, Stanton, Kevin B wrote: > On Wed, 21 Oct 2015, Thomas Gleixner wrote: >> On Tue, 20 Oct 2015, John Stultz wrote: >>> Being able to have various hardware sharing a time base is quite >>> useful, and methods for correlating timestamps together are useful. >>> But I don't yet really understand why its important that we can >>> translate a hardware timestamp from some time in the past to the >>> correct system time in the past without error. > >>If your device can only provide timestamps from the past, then >>having access to the history is important if you want to have >>precise correlation. > > I hope this can be solved in timekeeping. But first, a quick > recap... > > The timestamp pair (including the ART snapshot, as described > previously) is captured simultaneously by the hardware > resulting, effectively, in a (PTP,TSC) pair, or > (AudioPosition,TSC) pair. The in-the-past-TSC value needs to be > converted to system time so that it can be used by applications, > without exposing the underlying ART or TSC. > > Note: ART is architectural, defined as part of Invariant > Timekeeping in the current SDM, so this isn't a one-off > capability. > > To convert a past TSC timestamp to system time 'correctly' (in a > mathematical sense), a history of monotonic rate adjustments > since that time in the past must be maintained. But again, my main problem is that I'm not totally understanding the rational. Here you're providing the *what*, not really the *why*. As I mentioned earlier, there are possibly simpler (at least for the kernel) ways to generate similar data using CLOCK_MONOTONIC_RAW, which has potentially a 500ppm error. *Why* is it important to add more complexity to the timekeeping core in order to avoid that error? > Regarding the amount of history, as Chris mentioned (and > in the context of new Intel hardware) LAN timestamp pairs are > a few microseconds in the past (no history would be required), > but for timestamps captured by the audio DSP, unfortunately, > they can be a small number of *milliseconds* in the past by the > time they're available to the audio driver (some history > required to convert accurately). I'm told that 4ms of adjustment > history accommodates known hardware. For a timestamp recorded 4ms ago, 500ppm of error is 2us. Why is 2us problematic for audio? That seems quite below the human threshold to notice. > Getting this 'correct' in one place (timekeeping) seems a lot > better than unnecessarily introducing inaccuracy via software > sampling (and extrapolation) or leaving it to each driver to do > it themselves, and to do it differently (and/or do it wrongly). Having a common infrastructure for extrapolating the data isn't something I'm objecting to. Its just that the shadow-timekeeper has been problematic enough in recent times bug wise, so I'm just hesitant to expand the complexity there. That said, I'm open to ideas, but would really like a better understanding of why other solutions would be insufficient, and why this one is the best solution. thanks -john -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()
On 2015/11/09, 04:56, "Tetsuo Handa" wrote: >There are many locations that do > > if (memory_was_allocated_by_vmalloc) >vfree(ptr); > else >kfree(ptr); > >but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory >using is_vmalloc_addr(). Unless callers have special reasons, we can >replace this branch with kvfree(). Please check and reply if you found >problems. > >Signed-off-by: Tetsuo Handa >Acked-by: Michal Hocko For Lustre part: Reviewed-by: Andreas Dilger Cheers, Andreas -- Andreas Dilger Lustre Principal Engineer Intel High Performance Data Division -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] inet: delay address promotion check until last request in message
Hello, On Sun, 8 Nov 2015, Neil Horman wrote: > On Sat, Nov 07, 2015 at 01:49:25AM +0200, Julian Anastasov wrote: > > > > flush can provide many parameters. As there is no > > any kind of indication in the netlink message that all addresses > > are removed, we can not avoid the promotion. > > > This is true, but seems irrellevant to me. A flush operation is a sequence of > RTM_DELADDR operations in a one or more netlink packets. The way my patch is > written, if a set of DELADDR requests is interspersed with other non DELADDR > requests, then we do a promotion check between each consecutive set of DELADDR > requests. As such, all that happens is that the promotion check happens > possibly more often than needed. Its not optimal, but not harmful either. It is harmful, you miss promotion for some of the subnets, see below... > > > + * Only check for address promotion when this is the last request > > > + * in this netlink transaction. It allows this operation to complete > > > + * in O(n) time rather than O(n^2) > > > > It is not correct to assume that one promotion per > > transaction is enough. The promotion happens in every subnet, > > it was not once per device. > > > > I'm not sure I understand the relevance here. All I'm doing is, in effect > masking the promote_secondaries sysctl for an interface doing a flush > operation. > Its equivalent to doing this in user space: > > echo 0 > /proc/sys/net/ipv4/conf//promote_secondaries > A=`some arbitrary address in ` > ip addr del except A> > echo 1 > /proc/sys/net/ipv4/conf//promote_secondaries > ip addr del A > > Can you please explain to me the use case in which delaying a promotion > operation until we think we're done ('done' being defined by the above > transition > from a DELADDR operation to a non-DELADDR operation in a netlink packet) > produces an outcome that differs from the expectation with this patch in > place? Here is how we can miss promotion... dev=eth1 ifconfig $dev up ip addr add 1.2.3.4/24 dev $dev ip addr add 1.2.3.4/16 dev $dev ip addr add 1.2.3.44/24 dev $dev ip addr add 1.2.3.44/16 dev $dev echo 1 > /proc/sys/net/ipv4/conf/$dev/promote_secondaries ip addr flush dev $dev to 1.2.3.4/30 ip -V ip utility, iproute2-ss010824 What happens is a request to delete just primary addresses: 1.2.3.4/24 and 1.2.3.4/16. The /30 is chosen in such a way, so that any repeating attempts to flush the secondary addresses are avoided. As result, with your patch, only 1.2.3.44/16 is promoted (the last secondary), 1.2.3.44/24 which is first in the list of the secondaries, is not promoted, it is removed. You can even use 1.2.3.4/24, 1.2.3.5/16, 1.2.3.44/24 and 1.2.3.55/16 in case the equal IPs are not a good example. The problem will be more visible if one builds netlink message by hand containing DEL for different primaries. Regards -- Julian Anastasov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 5/6] net/faraday: Enable NCSI interface
On Mon, 2015-11-09 at 18:30 +1100, Gavin Shan wrote: > > Yeah, It's something that I hilighed in the cover letter. I was > thinking we might need a better way to enable Tx/Rx before the > interrupt is up, but couldn't figure out one way. So I need some > advice here. No, that's not right. For Tx/Rx to work the interface must be opened, there is simply no way around that and that's perfectly fine. The situation with an MDIO PHY is the same, most drivers can't talk to their PHY until the interface has been opened because the chip is basically powered down otherwise. So we require the interface to be opened to talk, so far so good, the NC-SI stack doesn't even need to open it itself, it's acceptable to require userspace to do it. IE. Userspace will chose what interface to use, open it (for DHCP etc... or whatever other reason) and *that* will then trigger the NC-SI negociation. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: linux-next network throughput performance regression
Thanks Eric to provide the data. I am looping Tom (as I am looking into his recent patches) and Olaf (from Suse). So, if I understand it correctly, you are running netperf with single TCP connection, and you got ~26Gbps initially and got ~30Gbps after turning the tx-usecs and tx-frames. Do you have a baseline on your environment for the best/max/or peak throughput? Again, in my environment (SLES bare metal), if use SLES12 default kernel as a baseline, we can see significant performance drop (10% ~ 50%) on latest linux-next kernel. Absolutely I will try the same test on net-next soon and update the results to here later. Thanks, Simon > -Original Message- > From: Eric Dumazet [mailto:eric.duma...@gmail.com] > Sent: Saturday, November 7, 2015 11:50 AM > To: David Ahern > Cc: Simon Xiao ; de...@linuxdriverproject.org; > netdev@vger.kernel.org; linux-ker...@vger.kernel.org; David Miller > ; KY Srinivasan ; Haiyang > Zhang > Subject: Re: linux-next network throughput performance regression > > On Sat, 2015-11-07 at 11:35 -0800, Eric Dumazet wrote: > > On Fri, 2015-11-06 at 14:30 -0700, David Ahern wrote: > > > On 11/6/15 2:18 PM, Simon Xiao wrote: > > > > The .config file used to build linux-next kernel is attached to this > > > > mail. > > > > > > Thanks. > > > > > > Failed to notice this on the first response; my brain filled in. Why > > > linux-next tree? Can you try net-next which is more relevant for > > > this mailing list, post the top commit id and config file used? > > > > Throughput on a single TCP flow for a 40G NIC can be tricky to tune. > > > > Make sure IRQ are properly setup/balanced, as I know that IRQ names > > were changed recently and your scripts might have not noticed... > > > > Also "ethtool -c eth0" might show very different interrupt coalescing > > params ? > > > > I too have a Mellanox 40Gb in my lab and saw no difference in > > performance with recent kernels. > > > > Of course, a simple "perf record -a -g sleep 4 ; perf report" might > > point to some obvious issue. Like unexpected segmentation in case of > > forwarding... > > > > > > I did a test with current net tree on both sender and receiver > > lpaa23:~# ./netperf -H 10.246.7.152 > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 10.246.7.152 () port 0 AF_INET > Recv SendSend > Socket Socket Message Elapsed > Size SizeSize Time Throughput > bytes bytes bytessecs.10^6bits/sec > > 87380 16384 1638410.0026864.98 > lpaa23:~# ethtool -c eth1 > Coalesce parameters for eth1: > Adaptive RX: on TX: off > stats-block-usecs: 0 > sample-interval: 0 > pkt-rate-low: 40 > pkt-rate-high: 45 > > rx-usecs: 16 > rx-frames: 44 > rx-usecs-irq: 0 > rx-frames-irq: 0 > > tx-usecs: 16 > tx-frames: 16 > tx-usecs-irq: 0 > tx-frames-irq: 256 > > rx-usecs-low: 0 > rx-frame-low: 0 > tx-usecs-low: 0 > tx-frame-low: 0 > > rx-usecs-high: 128 > rx-frame-high: 0 > tx-usecs-high: 0 > tx-frame-high: 0 > > lpaa23:~# ethtool -C eth1 tx-usecs 4 tx-frames 4 > lpaa23:~# ./netperf -H > 10.246.7.152 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 > AF_INET to > 10.246.7.152 () port 0 AF_INET > Recv SendSend > Socket Socket Message Elapsed > Size SizeSize Time Throughput > bytes bytes bytessecs.10^6bits/sec > > 87380 16384 1638410.0030206.27 >
Re: [PATCH] arm64: bpf: fix JIT stack setup
On Mon, Nov 9, 2015 at 10:08 AM, Shi, Yang wrote: > I added it to stay align with ARMv8 AAPCS to maintain the correct FP during > function call. It makes us get correct stack backtrace. > > I think we'd better to keep compliant with ARMv8 AAPCS in BPF JIT prologue > too. > > If nobody thinks it is necessary, we definitely could remove that change. Oh no, I don't think anyone will say it's unnecessary! I agree the A64_FP-related change is a good idea, so stack unwinding works. How about splitting this into two patches? One for the BPF-related bug, and another for A64 FP-handling. Thanks again for tracking this down and improving things overall for arm64 :) > > Thanks, > Yang > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC net-next 0/2] tcp: Redundant Data Bundling (RDB)
On 24/10/15 14:57, Eric Dumazet wrote: > Thank you for this very high quality patch submission. > > Please give us a few days for proper evaluation. > > Thanks ! Guys, thank you very much for taking the time to evaluate this. Since there haven't been any more feedback or comments I'll submit an RFCv2 with a few changes which includes removing the Nagle modification. After discussing the Nagle change on setsockopt we realize that it should be evaluated more thoroughly, and is better left for a later patch submission. Bendik P.S. Trimming the CC list to only those who have responded as gmail says I'm spamming :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] drivers: net: cpsw: add support for fixed-links.
Add support for fixed-links in configurations without PHY. (e.g. connection to a switch, SGMII point to point, SFPs) Check: Documentation/devicetree/bindings/net/fixed-link.txt. Signed-off-by: Daniel Trautmann --- drivers/net/ethernet/ti/cpsw.c | 40 1 file changed, 28 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 040fbc1..3e32365 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -2039,19 +2039,35 @@ static int cpsw_probe_dt(struct cpsw_priv *priv, priv->phy_node = of_parse_phandle(slave_node, "phy-handle", 0); parp = of_get_property(slave_node, "phy_id", &lenp); if ((parp == NULL) || (lenp != (sizeof(void *) * 2))) { - dev_err(&pdev->dev, "Missing slave[%d] phy_id property\n", i); - goto no_phy_slave; - } - mdio_node = of_find_node_by_phandle(be32_to_cpup(parp)); - phyid = be32_to_cpup(parp+1); - mdio = of_find_device_by_node(mdio_node); - of_node_put(mdio_node); - if (!mdio) { - dev_err(&pdev->dev, "Missing mdio platform device\n"); - return -EINVAL; + if (!of_phy_is_fixed_link(slave_node)) { + dev_err(&pdev->dev, + "Missing slave[%d] phy_id property\n", + i); + goto no_phy_slave; + } + + ret = of_phy_register_fixed_link(slave_node); + if (ret) { + dev_err(&pdev->dev, "cannot register fixed PHY\n"); + return ret; + } + + /* In the case of a fixed PHY, the DT node associated +* to the PHY is the Ethernet MAC DT node. +*/ + priv->phy_node = of_node_get(slave_node); + } else { + mdio_node = of_find_node_by_phandle(be32_to_cpup(parp)); + phyid = be32_to_cpup(parp + 1); + mdio = of_find_device_by_node(mdio_node); + of_node_put(mdio_node); + if (!mdio) { + dev_err(&pdev->dev, "Missing mdio platform device\n"); + return -EINVAL; + } + snprintf(slave_data->phy_id, sizeof(slave_data->phy_id), +PHY_ID_FMT, mdio->name, phyid); } - snprintf(slave_data->phy_id, sizeof(slave_data->phy_id), -PHY_ID_FMT, mdio->name, phyid); slave_data->phy_if = of_get_phy_mode(slave_node); if (slave_data->phy_if < 0) { dev_err(&pdev->dev, "Missing or malformed slave[%d] phy-mode property\n", -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] packet: Allow packets with only a header (but no payload)
On 2015-11-09 18:53, Willem de Bruijn wrote: > On Sat, Nov 7, 2015 at 8:11 AM, Felix Fietkau wrote: >> On 2015-07-31 00:15, Martin Blumenstingl wrote: >>> On Wed, Jul 29, 2015 at 8:05 AM, Willem de Bruijn >>> wrote: Martin, to return to your initial statement that PPPoE PADI packets can have a zero payload: the PPPoE RFC states that PADI packets "MUST contain exactly one TAG of TAG_TYPE Service-Name, indicating the service the Host is requesting, and any number of other TAG types." (RFC 2516, 5.1). Is the observed behavior (no payload) perhaps incorrect? >>> As far as I can see you are right, but the real world seems to be different. >>> My ISP for example lists the PPPoE connection settings, but they are >>> nowhere mentioning the "service name". >>> >>> I have also re-read pppd's source code again and that seems to confirm >>> what you are reading in the RFC: Leaving the service name away makes >>> seems to violate the RFC, but pppd still accepts those configurations. >>> Even if it is, if this is breaking established userspace expectations, we should look into it. Ethernet specifies a minimum payload size of 46 on the wire, but perhaps that is handled with padding, so that 0 length should be valid within the stack. Also, there may be other valid uses of 0 length payload on top of link layers that are not Ethernet. >>> Good catch. I would also like to note that the documentation for >>> "hard_header_len" describes it as "Hardware header length". When the >>> purpose of this field we should check whether the documentation should >>> be updated to "Minimum hardware header length" -> that would mean the >>> condition has to be a "len < hard_header_len" instead of a "len <= >>> hard_header_len" (as it is now). >>> >>> PS: I have also added the pppd maintainer (Paul Mackerras) to this >>> thread because I think he should know about this issue (and he can >>> probably provide more details if required). >>> As a quick summary for him: linux >= 3.19 rejects PADI packets when >>> no service name is configured. >> Any news on this? Users are complaining about this regression: >> https://dev.openwrt.org/ticket/20707 > > I took another look. This hinges on the question what the contract with > device drivers is on skb network data and length. Is passing an skb with > skb->len == 0 to ndo_start_xmit allowed? > > From what I gather from the ethernet spec [1], sending frames with an > empty head is allowed on that medium, at least. > > A quick scan of a few drivers and the loopback path also does not show > anything that would break. In some cases, skb_network_header points > beyond the end of the buffer (ETH_HLEN), but the length is correctly > reported as 0. > > The tap device can also generate packets consisting of only a link layer > header: compares len < ETH_HLEN in tun_get_user. > > So, I think that this change should be correct: > > static bool ll_header_truncated(const struct net_device *dev, int len) > { > - /* net device doesn't like empty head */ > - if (unlikely(len <= dev->hard_header_len)) { > + if (unlikely(len < dev->hard_header_len)) { > > but a definitive answer would require an audit of all device drivers > (including bonding, ..) or at least the certainty that it has always > been correct to send a packet of only link layer header to > ndo_start_xmit. > > [1] IEEE 802.3™-2012 – Section One, {3.2.8, 4.2.3.3} Yeah, I agree that such an audit is required. However, I think it's *much* more important to add this change as soon as possible to fix the regression. The old code may have had theoretical driver issues, but the current code breaks real-world user setups. - Felix -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [iproute PATCH v2 4/6] ipaddress: fix ipaddr_flush for Linux >= 3.1
On 11/09/2015 09:51 PM, Sergei Shtylyov wrote: Linux version 3.1 introduced a consistency check for netlink dumps in commit 670dc28 ("netlink: advertise incomplete dumps"). This bites The scripts/checkpatch.pl now enforces 12-digit commit ID... Sorry, didn't realize it wasn't a kernel patch. :-) iproute2 when flushing more addresses than can fit into a single RTM_GETADDR response. To silence the spurious error message "Dump was interrupted and may be inconsistent.", advise rtnl_dump_filter_l() to not care about NLM_F_DUMP_INTR. Signed-off-by: Phil Sutter [...] MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [iproute PATCH v2 4/6] ipaddress: fix ipaddr_flush for Linux >= 3.1
Hello. On 11/08/2015 11:22 PM, Phil Sutter wrote: Linux version 3.1 introduced a consistency check for netlink dumps in commit 670dc28 ("netlink: advertise incomplete dumps"). This bites The scripts/checkpatch.pl now enforces 12-digit commit ID... iproute2 when flushing more addresses than can fit into a single RTM_GETADDR response. To silence the spurious error message "Dump was interrupted and may be inconsistent.", advise rtnl_dump_filter_l() to not care about NLM_F_DUMP_INTR. Signed-off-by: Phil Sutter MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bug 107551] New: IPV4 Reassembling Problem
Begin forwarded message: Date: Mon, 9 Nov 2015 11:04:29 + From: "bugzilla-dae...@bugzilla.kernel.org" To: "shemmin...@linux-foundation.org" Subject: [Bug 107551] New: IPV4 Reassembling Problem https://bugzilla.kernel.org/show_bug.cgi?id=107551 Bug ID: 107551 Summary: IPV4 Reassembling Problem Product: Networking Version: 2.5 Kernel Version: 4.2.3 Hardware: Other OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: IPV4 Assignee: shemmin...@linux-foundation.org Reporter: bernd.weib...@siemens.com Regression: No I have an ipv4 reassembling issue using the latest Kernel (4.2.3) in an nios2 environment. I first noticed that issue when sending UDP-Pakets from an Host PC to my device. Each packet has a payload of 4096 Byte. Due to the MTU of 1500 Byte each packet will be split into three ip-packets. Approximately 1 of 200 Packets were not correctly reassembled by the Kernel, so they cannot be received within user-space. The command nstat says, that there are reassembling errors due to timeout, but at least in wireshark all packets are correctly displayed. Sometime the kernel issues a icmp-messages, saying that there was a timeout during ip reassembling. I fir9st thought that it might be a performance problem, so I increased the time between the packets to about one packet per 500 ms but without success. Then I decided to reduce the payload to 1024 Byte, because it might be a buffer (or TSE FIFO) problem. Now all packets are send into one ipv4 packet and there is nothing lost on the receiver side, but when I reduced the MTU of the host from 1500 to 800 byte (each datagram will be split into two packets now) the problem accurse again! So this might be a reassembling issue. -- You are receiving this mail because: You are the assignee for the bug. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bug 106711] VXLAN: RTNL assertion failed at net/core/net_namespace.c:187
Begin forwarded message: Date: Mon, 9 Nov 2015 09:42:52 + From: "bugzilla-dae...@bugzilla.kernel.org" To: "shemmin...@linux-foundation.org" Subject: [Bug 106711] VXLAN: RTNL assertion failed at net/core/net_namespace.c:187 https://bugzilla.kernel.org/show_bug.cgi?id=106711 Kari Hautio changed: What|Removed |Added CC||khau...@gmail.com --- Comment #1 from Kari Hautio --- Also affects 4.1.12 [ 34.412104] RTNL: assertion failed at /home/kernel/COD/linux/net/core/net_namespace.c (187) [ 34.412120] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.1.12-040112-generic #201510262131 [ 34.412123] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 [ 34.412126] 880035dd2380 88013fd838f8 817e367c [ 34.412131] 88013fd83928 816cfbe5 880035526200 [ 34.412135] 8800b9b3fe00 88013a3dc9c0 88013a3dccc0 88013fd83938 [ 34.412140] Call Trace: [ 34.412142][] dump_stack+0x45/0x57 [ 34.412159] [] __peernet2id+0xa5/0xb0 [ 34.412163] [] peernet2id+0x18/0x30 [ 34.412168] [] vxlan_fdb_info+0xfa/0x360 [vxlan] [ 34.412173] [] ? __alloc_skb+0x8c/0x1f0 [ 34.412178] [] vxlan_fdb_notify+0x72/0x100 [vxlan] [ 34.412182] [] vxlan_fdb_create+0x14a/0x3a0 [vxlan] [ 34.412187] [] vxlan_snoop+0x1cb/0x1d0 [vxlan] [ 34.412192] [] vxlan_rcv+0x3a2/0x620 [vxlan] [ 34.412196] [] vxlan_udp_encap_recv+0x156/0x420 [vxlan] [ 34.412201] [] ? vxlan_open+0x1e0/0x1e0 [vxlan] [ 34.412207] [] udp_queue_rcv_skb+0x1f3/0x4f0 [ 34.412211] [] __udp4_lib_rcv+0x127/0x930 [ 34.412215] [] udp_rcv+0x1a/0x20 [ 34.412227] [] ip_local_deliver_finish+0xae/0x230 [ 34.412231] [] ip_local_deliver+0x46/0xa0 [ 34.412235] [] ip_rcv_finish+0x81/0x370 [ 34.412239] [] ip_rcv+0x2a2/0x3c0 [ 34.412244] [] __netif_receive_skb_core+0x7e3/0xa10 [ 34.412248] [] __netif_receive_skb+0x18/0x60 [ 34.412252] [] process_backlog+0xb2/0x150 [ 34.412256] [] net_rx_action+0x1fa/0x320 [ 34.412265] [] __do_softirq+0x103/0x280 [ 34.412269] [] irq_exit+0xad/0xb0 [ 34.412274] [] smp_apic_timer_interrupt+0x46/0x60 [ 34.412279] [] apic_timer_interrupt+0x6e/0x80 [ 34.412280][] ? native_safe_halt+0x6/0x10 [ 34.412290] [] ? rcu_eqs_enter+0xa3/0xb0 [ 34.412297] [] default_idle+0x1e/0xc0 [ 34.412315] [] arch_cpu_idle+0xf/0x20 [ 34.412325] [] cpu_startup_entry+0x367/0x3e0 [ 34.412332] [] ? clockevents_register_device+0x112/0x1e0 [ 34.412338] [] start_secondary+0x17e/0x1a0 -- You are receiving this mail because: You are the assignee for the bug. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
From: Rainer Weikusat Date: Mon, 09 Nov 2015 14:40:48 + > + __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait, > + &u->peer_wake); This is more simply: __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait, q); > +static inline int unix_dgram_peer_recv_ready(struct sock *sk, > + struct sock *other) Please do not us the inline keyword in foo.c files, let the compiler decide. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] net: mvneta: Introduce RSS support
Hi Marcin, On ven., nov. 06 2015, Marcin Wojtas wrote: > Hi Gregory, > > >> I also choose to associate all the TX queues on the same CPU that the >> one associated to the RX queue. It allows to contain all the >> interrupts on the same CPU. I think that an improvement on this side >> would be the support of the XPS. >> > > Did you make some tries? E.g. after mapping certain txqs to CPU1, when > using them was mvneta_tx() executing only on this CPU? Or it rather > means that the irq will hit according to the mapping? As far as I know > the HW, the latter should be true, which would mean the real XPS with > this controller is impossible and the maximum we can control is the > irq. I hoped that with XPS we can associate CPUs to all the tx queues of a ethernet port. For example choosing for eth2, that all the tx queues will be handled by CPU0 and CPU3. And for this it would involve allowing receiving the tx interrupts only on CPU0 and CPU3. This kind of feature seems to be doable with the hardware. But after reading more carefully what XPS is about and how it is used in the kernel, it seems we can't configure the hardware from the sysfs interface. From my understanding we can only indicate to the kernel which tx queue use for a given CPU. > > I think it may be worth to unmask TX irqs on all cpus, so all percpu > napi's would be able to read tx_cause and process sent packets. I'm > looking forward to your opinion. By doing, this my understanding is that when an TX interrupt occurs then all the CPUs enter in the interrupt handler but only one can manage it. So, for me it looks like a big waste of CPU load. I could understand that it would be interesting in some situation but it doesn't seem to be the good default behavior. That's why I was looking for a way to let the use configure it according to his needs. Please correct me if I am wrong somewhere, because currently I don't find a good solution for it. Thanks, Gregory -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] arm64: bpf: fix JIT stack setup
On 11/8/2015 2:29 PM, Z Lim wrote: On Sat, Nov 7, 2015 at 6:27 PM, Alexei Starovoitov wrote: On Fri, Nov 06, 2015 at 09:36:17PM -0800, Yang Shi wrote: ARM64 JIT used FP (x29) as eBPF fp register, but FP is subjected to change during function call so it may cause the BPF prog stack base address change too. Whenever, it pointed to the bottom of BPF prog stack instead of the top. So, when copying data via bpf_probe_read, it will be copied to (SP - offset), then it may overwrite the saved FP/LR. Use x25 to replace FP as BPF stack base register (fp). Since x25 is callee saved register, so it will keep intact during function call. It is initialized in BPF prog prologue when BPF prog is started to run everytime. When BPF prog exits, it could be just tossed. Other than this the BPf prog stack base need to be setup before function call stack. So, the BPF stack layout looks like: high original A64_SP => 0:+-+ BPF prologue | | FP/LR and callee saved registers BPF fp register => +64:+-+ | | | ... | BPF prog stack | | | | current A64_SP => +-+ | | | ... | Function call stack | | +-+ low Signed-off-by: Yang Shi CC: Zi Shen Lim CC: Xi Wang Thanks for tracking it down. That looks like fundamental bug in arm64 jit. I'm surprised function calls worked at all. Zi please review. For function calls (BPF_JMP | BPF_CALL), we are compliant with AAPCS64 [1]. That part is okay. bpf_probe_read accesses the BPF program stack, which is based on BPF_REG_FP. This exposes an issue with how BPF_REG_FP was setup, as Yang pointed out. Instead of having BPF_REG_FP point to top of stack, we erroneously point it to the bottom of stack. When there are function calls, we run the risk of clobbering of BPF stack. Bad idea. Yes, exactly. Otherwise, since BPF_REG_FP is read-only, and is setup exactly once in prologue, it remains consistent throughout lifetime of the BPF program. Yang, can you please try the following? It should work without the below change: + emit(A64_MOV(1, A64_FP, A64_SP), ctx); I added it to stay align with ARMv8 AAPCS to maintain the correct FP during function call. It makes us get correct stack backtrace. I think we'd better to keep compliant with ARMv8 AAPCS in BPF JIT prologue too. If nobody thinks it is necessary, we definitely could remove that change. Thanks, Yang 8<- --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -161,12 +161,12 @@ static void build_prologue(struct jit_ctx *ctx) if (ctx->tmp_used) emit(A64_PUSH(tmp1, tmp2, A64_SP), ctx); - /* Set up BPF stack */ - emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx); - /* Set up frame pointer */ emit(A64_MOV(1, fp, A64_SP), ctx); + /* Set up BPF stack */ + emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx); + /* Clear registers A and X */ emit_a64_mov_i64(ra, 0, ctx); emit_a64_mov_i64(rx, 0, ctx); ->8 [1] http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] packet: Allow packets with only a header (but no payload)
On Sat, Nov 7, 2015 at 8:11 AM, Felix Fietkau wrote: > On 2015-07-31 00:15, Martin Blumenstingl wrote: >> On Wed, Jul 29, 2015 at 8:05 AM, Willem de Bruijn wrote: >>> Martin, to return to your initial statement that PPPoE PADI packets can >>> have a zero payload: the PPPoE RFC states that PADI packets "MUST >>> contain exactly one TAG of TAG_TYPE Service-Name, indicating the >>> service the Host is requesting, and any number of other TAG types." >>> (RFC 2516, 5.1). Is the observed behavior (no payload) perhaps >>> incorrect? >> As far as I can see you are right, but the real world seems to be different. >> My ISP for example lists the PPPoE connection settings, but they are >> nowhere mentioning the "service name". >> >> I have also re-read pppd's source code again and that seems to confirm >> what you are reading in the RFC: Leaving the service name away makes >> seems to violate the RFC, but pppd still accepts those configurations. >> >>> Even if it is, if this is breaking established userspace expectations, >>> we should look into it. Ethernet specifies a minimum payload size of >>> 46 on the wire, but perhaps that is handled with padding, so that >>> 0 length should be valid within the stack. Also, there may be other >>> valid uses of 0 length payload on top of link layers that are not Ethernet. >> Good catch. I would also like to note that the documentation for >> "hard_header_len" describes it as "Hardware header length". When the >> purpose of this field we should check whether the documentation should >> be updated to "Minimum hardware header length" -> that would mean the >> condition has to be a "len < hard_header_len" instead of a "len <= >> hard_header_len" (as it is now). >> >> PS: I have also added the pppd maintainer (Paul Mackerras) to this >> thread because I think he should know about this issue (and he can >> probably provide more details if required). >> As a quick summary for him: linux >= 3.19 rejects PADI packets when >> no service name is configured. > Any news on this? Users are complaining about this regression: > https://dev.openwrt.org/ticket/20707 I took another look. This hinges on the question what the contract with device drivers is on skb network data and length. Is passing an skb with skb->len == 0 to ndo_start_xmit allowed? >From what I gather from the ethernet spec [1], sending frames with an empty head is allowed on that medium, at least. A quick scan of a few drivers and the loopback path also does not show anything that would break. In some cases, skb_network_header points beyond the end of the buffer (ETH_HLEN), but the length is correctly reported as 0. The tap device can also generate packets consisting of only a link layer header: compares len < ETH_HLEN in tun_get_user. So, I think that this change should be correct: static bool ll_header_truncated(const struct net_device *dev, int len) { - /* net device doesn't like empty head */ - if (unlikely(len <= dev->hard_header_len)) { + if (unlikely(len < dev->hard_header_len)) { but a definitive answer would require an audit of all device drivers (including bonding, ..) or at least the certainty that it has always been correct to send a packet of only link layer header to ndo_start_xmit. [1] IEEE 802.3™-2012 – Section One, {3.2.8, 4.2.3.3} -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mvneta: add FIXED_PHY dependency
On Mon, Nov 09, 2015 at 06:08:49PM +0100, Andrew Lunn wrote: > You use fixed-phy when the MAC is connected to a switch, not a phy. Or > when the MAC is connected to an SFP module. ... hopefully not for much longer. Once -rc1 is out, I'll sort out posting my phylink and SFP module hotplug support as a RFC. -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mvneta: add FIXED_PHY dependency
On Mon, Nov 09, 2015 at 06:12:02PM +0100, Arnd Bergmann wrote: > On Monday 09 November 2015 17:08:34 Russell King - ARM Linux wrote: > > They are "optional" because when you're using a DSA switch, you don't > > specify a PHY (because, there isn't one). For example, this is what > > I'm using with an Armada 388 board with a Marvell DSA switch. The > > DSA does not appear as a PHY, and no node in the DSA stanza can be > > referenced for a phy entry in the ethernet device's stanza. > > > > eth1: ethernet@3 { > > compatible = "marvell,armada-370-neta"; > > reg = <0x3 0x4000>; > > interrupts-extended = <&mpic 10>; > > clocks = <&gateclk 3>; > > managed = "in-band-status"; > > phy-mode = "sgmii"; > > status = "okay"; > > }; > > > > > > Ok, then it would be nice to change the binding to reflect that, > and also document the "managed" property there. "managed" is already documented. See Documentation/devicetree/bindings/net/ethernet.txt: - managed: string, specifies the PHY management type. Supported values are: "auto", "in-band-status". "auto" is the default, it usess MDIO for management if fixed-link is not specified. -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network stream fairness
On Mon, 2015-11-09 at 17:50 +0100, Niklas Cassel wrote: > > for i in `seq 1 20`; do ss -temoi dst 192.168.0.141; sleep 1; done ... > ESTAB 0 0192.168.0.1:54578 > 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 > ino:7901838 sk:880143735040 > State Recv-Q Send-Q Local Address:Port > Peer Address:Port > ESTAB 0 0192.168.0.1:54576 > 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 > ino:7901837 sk:880143735800 > ESTAB 0 0192.168.0.1:54574 > 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 > uid:20283 ino:7902591 sk:880223311800 > ESTAB 1448 0192.168.0.1:54578 > 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 > ino:7901838 sk:880143735040 This is the receiver, I would like to see the sender side, since the changes we talk about are at his side, of course. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] net: Documentation: Fix default value tcp_limit_output_bytes
From: Niklas Cassel Date: Mon, 9 Nov 2015 15:59:00 +0100 > Commit c39c4c6abb89 ("tcp: double default TSQ output bytes limit") > updated default value for tcp_limit_output_bytes > > Signed-off-by: Niklas Cassel Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mvneta: add FIXED_PHY dependency
On Monday 09 November 2015 18:08:49 Andrew Lunn wrote: > > > I suppose it comes down to, are we allowed to optionally implement > > > part of the DT binding? > > > > I'm not sure what you are asking. A lot of DT bindings have both > > optional and mandatory properties. For mvneta, the "phy" and "phy-mode" > > properties are listed as mandatory, so the driver can safely assume > > that they are always present. If there are reasons to leave them out, > > and for the driver to handle that case correctly, the binding > > should be updated to mark them as optional. > > Hi Arnd > > You are looking at it from the perspective of the driver. I was > meaning from the perspective of the DT blob. Can be blob assume the > driver implements all of the binding, all of the time? That question is not really relevant: the DT describes the hardware, it doesn't matter whether there are drivers for all the bits or whether all properties are read. > You use fixed-phy when the MAC is connected to a switch, not a phy. Or > when the MAC is connected to an SFP module. The driver can currently > be built to not implement the fixed-phy party of the binding. Is that > O.K. from the perspective of the DT blob? Or should the driver always > implement all of the binding, in which these NOP stubs should be > removed and fixed phy always be enabled for the drivers that use it. Sure, that is ok. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mvneta: add FIXED_PHY dependency
On Monday 09 November 2015 17:08:34 Russell King - ARM Linux wrote: > They are "optional" because when you're using a DSA switch, you don't > specify a PHY (because, there isn't one). For example, this is what > I'm using with an Armada 388 board with a Marvell DSA switch. The > DSA does not appear as a PHY, and no node in the DSA stanza can be > referenced for a phy entry in the ethernet device's stanza. > > eth1: ethernet@3 { > compatible = "marvell,armada-370-neta"; > reg = <0x3 0x4000>; > interrupts-extended = <&mpic 10>; > clocks = <&gateclk 3>; > managed = "in-band-status"; > phy-mode = "sgmii"; > status = "okay"; > }; > > Ok, then it would be nice to change the binding to reflect that, and also document the "managed" property there. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mvneta: add FIXED_PHY dependency
On Mon, Nov 09, 2015 at 05:57:43PM +0100, Arnd Bergmann wrote: > I'm not sure what you are asking. A lot of DT bindings have both > optional and mandatory properties. For mvneta, the "phy" and "phy-mode" > properties are listed as mandatory, so the driver can safely assume > that they are always present. If there are reasons to leave them out, > and for the driver to handle that case correctly, the binding > should be updated to mark them as optional. They are "optional" because when you're using a DSA switch, you don't specify a PHY (because, there isn't one). For example, this is what I'm using with an Armada 388 board with a Marvell DSA switch. The DSA does not appear as a PHY, and no node in the DSA stanza can be referenced for a phy entry in the ethernet device's stanza. eth1: ethernet@3 { compatible = "marvell,armada-370-neta"; reg = <0x3 0x4000>; interrupts-extended = <&mpic 10>; clocks = <&gateclk 3>; managed = "in-band-status"; phy-mode = "sgmii"; status = "okay"; }; dsa@0 { compatible = "marvell,dsa"; dsa,ethernet = <ð1>; dsa,mii-bus = <&mdio>; pinctrl-0 = <&clearfog_dsa0_clk_pins &clearfog_dsa0_pins>; pinctrl-names = "default"; #address-cells = <2>; #size-cells = <0>; switch@0 { ... }; }; -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mvneta: add FIXED_PHY dependency
> > I suppose it comes down to, are we allowed to optionally implement > > part of the DT binding? > > I'm not sure what you are asking. A lot of DT bindings have both > optional and mandatory properties. For mvneta, the "phy" and "phy-mode" > properties are listed as mandatory, so the driver can safely assume > that they are always present. If there are reasons to leave them out, > and for the driver to handle that case correctly, the binding > should be updated to mark them as optional. Hi Arnd You are looking at it from the perspective of the driver. I was meaning from the perspective of the DT blob. Can be blob assume the driver implements all of the binding, all of the time? You use fixed-phy when the MAC is connected to a switch, not a phy. Or when the MAC is connected to an SFP module. The driver can currently be built to not implement the fixed-phy party of the binding. Is that O.K. from the perspective of the DT blob? Or should the driver always implement all of the binding, in which these NOP stubs should be removed and fixed phy always be enabled for the drivers that use it. Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] macvtap: Resolve possible __might_sleep warning in macvtap_do_read()
From: Vladislav Yasevich Date: Mon, 9 Nov 2015 09:14:17 -0500 > macvtap_do_read code calls macvtap_put_user while it might be set up > to wait for the user. This results in the following warning: ... > Make sure thet we call finish_wait() if we have the skb to process > before trying to actually process it. > > Signed-off-by: Vladislav Yasevich Oh I see, this is that new check in might_sleep() that makes sure that the task state is TASK_RUNNING. Looks good, applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mvneta: add FIXED_PHY dependency
On Monday 09 November 2015 17:42:32 Andrew Lunn wrote: > On Mon, Nov 09, 2015 at 03:08:57PM +0100, Arnd Bergmann wrote: > > The fixed_phy infrastructure is done in a way that is optional, > > by providing 'static inline' helper functions doing nothing in > > include/linux/phy_fixed.h for all its APIs. However, three out > > of the four users (DSA, BCMGENET, and SYSTEMPORT) always > > 'select FIXED_PHY', presumably because they need that. > > Hi Arnd > > Need is probably too strong, it could be considered an optional > feature. If you don't have a fixed_phy property in your DT blob, you > don't need fixed phy support in your image. Ok, I see. > > MVNETA is the fourth one, and if that is built-in but FIXED_PHY > > is configured as a loadable module, we get a link error: > > > > drivers/built-in.o: In function `mvneta_fixed_link_update': > > fpga-mgr.c:(.text+0x33ed80): undefined reference to `fixed_phy_update_state' > > > > Presumably this driver has the same dependency as the others, > > so this patch also uses 'select' to ensure that the fixed-phy > > support is built-in. > > This will work, and is uniform with the other instances. But maybe a > more correct fix is to ensure fixed-phy is never a module when there > is a builtin user. That is hard to express with Kconfig. The alternative I listed instead guarantees that CONFIG_MVNETA cannot be set to 'y' whenever FIXED_PHY=m. For all practical purposes that has the same effect. The fixed-phy support isn't very big (around 2KB), so I wonder how relevant that optimization is. > > Should we perhaps make 'FIXED_PHY' a silent option and remove the > > inline helpers, based on the assumption that a driver that wants these > > will not work without them? > > I suppose it comes down to, are we allowed to optionally implement > part of the DT binding? I'm not sure what you are asking. A lot of DT bindings have both optional and mandatory properties. For mvneta, the "phy" and "phy-mode" properties are listed as mandatory, so the driver can safely assume that they are always present. If there are reasons to leave them out, and for the driver to handle that case correctly, the binding should be updated to mark them as optional. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network stream fairness
On 11/09/2015 05:24 PM, Eric Dumazet wrote: > On Mon, 2015-11-09 at 08:07 -0800, Eric Dumazet wrote: > >> Your numbers suggest a cwnd growth then, which might show a CC bug. >> >> Please run the following when your iper3 runs on regular 4.3 kernel >> >> for i in `seq 1 10` >> do >> ss -temoi dst 192.168.0.141 >> sleep 1 >> done >> See output at the end of the mail. iperf3 output sample: [ 4] 24.00-24.53 sec 1.92 MBytes 30.2 Mbits/sec [ 6] 24.00-24.53 sec 3.84 MBytes 60.5 Mbits/sec [SUM] 24.00-24.53 sec 5.76 MBytes 90.7 Mbits/sec > > Another thing to try is to change tcp_limit_output_bytes back to 131072 I tried both sysctl -w net.ipv4.tcp_limit_output_bytes=131072 sysctl -w net.ipv4.tcp_limit_output_bytes=13107 Neither appears to solve the problem. I don't know why but echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max solves it. Feels weird that setting tcp_limit_output_bytes doesn't fix it. > > Of course, your driver be the problem as well, but I am guessing it is > not upstream ? It's not upstreamed, but it is GPL, so I could probably upload it somewhere if that helps. for i in `seq 1 20`; do ss -temoi dst 192.168.0.141; sleep 1; done State Recv-Q Send-Q Local Address:Port Peer Address:Port State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 106 192.168.0.1:54574 192.168.0.141:5201 timer:(on,190ms,0) rto:0.2 ato:0.04 cwnd:10 qack:14 bidir uid:20283 ino:7902591 sk:880223311800 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0192.168.0.1:54576 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 ino:7901837 sk:880143735800 ESTAB 0 0192.168.0.1:54574 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 uid:20283 ino:7902591 sk:880223311800 ESTAB 0 0192.168.0.1:54578 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 ino:7901838 sk:880143735040 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0192.168.0.1:54576 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 ino:7901837 sk:880143735800 ESTAB 0 0192.168.0.1:54574 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 uid:20283 ino:7902591 sk:880223311800 ESTAB 0 0192.168.0.1:54578 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 ino:7901838 sk:880143735040 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0192.168.0.1:54576 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 ino:7901837 sk:880143735800 ESTAB 0 0192.168.0.1:54574 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 uid:20283 ino:7902591 sk:880223311800 ESTAB 0 0192.168.0.1:54578 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 ino:7901838 sk:880143735040 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0192.168.0.1:54576 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 ino:7901837 sk:880143735800 ESTAB 0 0192.168.0.1:54574 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 uid:20283 ino:7902591 sk:880223311800 ESTAB 0 0192.168.0.1:54578 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 ino:7901838 sk:880143735040 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0192.168.0.1:54576 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 uid:20283 ino:7901837 sk:880143735800 ESTAB 0 0192.168.0.1:54574 192.168.0.141:5201 rto:0.2 ato:0.04 cwnd:10 qack:10 uid:20283 ino:7902591 sk:880223311800 ESTAB 0 0
Re: [PATCH] mvneta: add FIXED_PHY dependency
On Mon, Nov 09, 2015 at 03:08:57PM +0100, Arnd Bergmann wrote: > The fixed_phy infrastructure is done in a way that is optional, > by providing 'static inline' helper functions doing nothing in > include/linux/phy_fixed.h for all its APIs. However, three out > of the four users (DSA, BCMGENET, and SYSTEMPORT) always > 'select FIXED_PHY', presumably because they need that. Hi Arnd Need is probably too strong, it could be considered an optional feature. If you don't have a fixed_phy property in your DT blob, you don't need fixed phy support in your image. > MVNETA is the fourth one, and if that is built-in but FIXED_PHY > is configured as a loadable module, we get a link error: > > drivers/built-in.o: In function `mvneta_fixed_link_update': > fpga-mgr.c:(.text+0x33ed80): undefined reference to `fixed_phy_update_state' > > Presumably this driver has the same dependency as the others, > so this patch also uses 'select' to ensure that the fixed-phy > support is built-in. This will work, and is uniform with the other instances. But maybe a more correct fix is to ensure fixed-phy is never a module when there is a builtin user. > Should we perhaps make 'FIXED_PHY' a silent option and remove the > inline helpers, based on the assumption that a driver that wants these > will not work without them? I suppose it comes down to, are we allowed to optionally implement part of the DT binding? Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mvneta: add FIXED_PHY dependency
From: Arnd Bergmann Date: Mon, 09 Nov 2015 15:08:57 +0100 > The fixed_phy infrastructure is done in a way that is optional, > by providing 'static inline' helper functions doing nothing in > include/linux/phy_fixed.h for all its APIs. However, three out > of the four users (DSA, BCMGENET, and SYSTEMPORT) always > 'select FIXED_PHY', presumably because they need that. > MVNETA is the fourth one, and if that is built-in but FIXED_PHY > is configured as a loadable module, we get a link error: > > drivers/built-in.o: In function `mvneta_fixed_link_update': > fpga-mgr.c:(.text+0x33ed80): undefined reference to `fixed_phy_update_state' > > Presumably this driver has the same dependency as the others, > so this patch also uses 'select' to ensure that the fixed-phy > support is built-in. > > Signed-off-by: Arnd Bergmann > Fixes: 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state > signaling") > --- > Found using ARM randconfig tests. An alternative here would be > to use 'depends on FIXED_PHY || FIXED_PHY=n', I picked the 'select' > approach for consistency. > > Should we perhaps make 'FIXED_PHY' a silent option and remove the > inline helpers, based on the assumption that a driver that wants these > will not work without them? This seems reasonable to fix the problem for the time being, applied. Thanks Arnd. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: caif: check return value of alloc_netdev
From: Rasmus Villemoes Date: Mon, 9 Nov 2015 13:19:10 +0100 > I don't know if dev can actually be NULL here, but the test should be > above alloc_netdev(), to avoid leaking the struct net_device in case > dev is actually NULL. And of course the return value from alloc_netdev > should be tested. > > Signed-off-by: Rasmus Villemoes > --- > Maybe the existing code was supposed to be "if (!ndev)", and dev > cannot be NULL, but then -ENODEV is a slightly odd return > value. Doing both tests seems to be the safe choice. This is made painfully impossible to figure out because there seems to be no in-tree code instantiating platform_device objects for this caif_spi framework, nor are there even in-tree implementations of the init_xfer et al. required methods. So your fix is as good as any, applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()
From: Jan Kara Date: Mon, 9 Nov 2015 13:11:26 +0100 > You can add > > Acked-by: Jan Kara > > for the UDF and fs/xattr.c parts. Please do not quote and entire large patch just to give an ACK. Just quote the minimum necessary context, which is usually just the commit message. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network stream fairness
On Mon, 2015-11-09 at 08:07 -0800, Eric Dumazet wrote: > Your numbers suggest a cwnd growth then, which might show a CC bug. > > Please run the following when your iper3 runs on regular 4.3 kernel > > for i in `seq 1 10` > do > ss -temoi dst 192.168.0.141 > sleep 1 > done > Another thing to try is to change tcp_limit_output_bytes back to 131072 Of course, your driver be the problem as well, but I am guessing it is not upstream ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA
From: Geert Uytterhoeven Date: Mon, 9 Nov 2015 10:34:30 +0100 > If NO_DMA=y: ... > As this affects all of HNS_ENET, HNS_DSAF, HNS, HIX5HD2_GMAC, and > HIP04_ETH, add a dependency on HAS_DMA to the main NET_VENDOR_HISILICON > symbol to fix this. > > Signed-off-by: Geert Uytterhoeven Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network stream fairness
On Mon, 2015-11-09 at 16:53 +0100, Niklas Cassel wrote: > On 11/09/2015 04:50 PM, Eric Dumazet wrote: > > On Mon, 2015-11-09 at 16:41 +0100, Niklas Cassel wrote: > >> I have a ethernet driver for a 100 Mbps NIC. > >> The NIC has dedicated hardware for offloading. > >> The driver has implemented TSO, GSO and BQL. > >> Since the CPU on the SoC is rather weak, I'd rather > >> not increase the CPU load by turning off offloading. > >> > >> Since commit > >> 605ad7f184b6 ("tcp: refine TSO autosizing") > >> > >> the bandwidth is no longer fair between streams. > >> see output at the end of the mail, where I'm testing with 2 streams. > >> > >> > >> If I revert 605ad7f184b6 on 4.3, I get a stable 45 Mbps per stream. > >> > >> I can also use vanilla 4.3 and do: > >> echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max > >> to also get a stable 45 Mbps per stream. > >> > >> My question is, am I supposed to set the BQL limit explicitly? > >> It is possible that I have missed something in my driver, > >> but my understanding is that the TCP stack sets and adjusts > >> the BQL limit automatically. > >> > >> > >> Perhaps the following info might help: > >> > >> After running iperf3 on vanilla 4.3: > >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ > >> limit 89908 > >> limit_max 1879048192 > >> > >> After running iperf3 on vanilla 4.3 + BQL explicitly set: > >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ > >> limit 3000 > >> limit_max 3000 > >> > >> After running iperf3 on 4.3 + 605ad7f184b6 reverted: > >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ > >> limit 8886 > >> limit_max 1879048192 > >> > > > > There is absolutely nothing ensuring fairness among multiple TCP flows. > > > > One TCP flow can very easily grab whole bandwidth for itself, there are > > numerous descriptions of this phenomena in various TCP studies. > > > > This is why we have packet schedulers ;) > > Oh.. How stupid of me, I forgot to mention.. all of the measurements were > done with fq_codel. Your numbers suggest a cwnd growth then, which might show a CC bug. Please run the following when your iper3 runs on regular 4.3 kernel for i in `seq 1 10` do ss -temoi dst 192.168.0.141 sleep 1 done -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()
On Mon, Nov 09, 2015 at 08:56:10PM +0900, Tetsuo Handa wrote: > There are many locations that do > > if (memory_was_allocated_by_vmalloc) > vfree(ptr); > else > kfree(ptr); > > but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory > using is_vmalloc_addr(). Unless callers have special reasons, we can > replace this branch with kvfree(). Please check and reply if you found > problems. > > Signed-off-by: Tetsuo Handa > Acked-by: Michal Hocko > Cc: Russell King # arm Acked-by: Russell King In so far as this ARM specific change looks reasonable. Thanks. > diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c > index e62400e..492bf3e 100644 > --- a/arch/arm/mm/dma-mapping.c > +++ b/arch/arm/mm/dma-mapping.c > @@ -1200,10 +1200,7 @@ error: > while (i--) > if (pages[i]) > __free_pages(pages[i], 0); > - if (array_size <= PAGE_SIZE) > - kfree(pages); > - else > - vfree(pages); > + kvfree(pages); > return NULL; > } > > @@ -1211,7 +1208,6 @@ static int __iommu_free_buffer(struct device *dev, > struct page **pages, > size_t size, struct dma_attrs *attrs) > { > int count = size >> PAGE_SHIFT; > - int array_size = count * sizeof(struct page *); > int i; > > if (dma_get_attr(DMA_ATTR_FORCE_CONTIGUOUS, attrs)) { > @@ -1222,10 +1218,7 @@ static int __iommu_free_buffer(struct device *dev, > struct page **pages, > __free_pages(pages[i], 0); > } > > - if (array_size <= PAGE_SIZE) > - kfree(pages); > - else > - vfree(pages); > + kvfree(pages); > return 0; > } > -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network stream fairness
On 11/09/2015 04:50 PM, Eric Dumazet wrote: > On Mon, 2015-11-09 at 16:41 +0100, Niklas Cassel wrote: >> I have a ethernet driver for a 100 Mbps NIC. >> The NIC has dedicated hardware for offloading. >> The driver has implemented TSO, GSO and BQL. >> Since the CPU on the SoC is rather weak, I'd rather >> not increase the CPU load by turning off offloading. >> >> Since commit >> 605ad7f184b6 ("tcp: refine TSO autosizing") >> >> the bandwidth is no longer fair between streams. >> see output at the end of the mail, where I'm testing with 2 streams. >> >> >> If I revert 605ad7f184b6 on 4.3, I get a stable 45 Mbps per stream. >> >> I can also use vanilla 4.3 and do: >> echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max >> to also get a stable 45 Mbps per stream. >> >> My question is, am I supposed to set the BQL limit explicitly? >> It is possible that I have missed something in my driver, >> but my understanding is that the TCP stack sets and adjusts >> the BQL limit automatically. >> >> >> Perhaps the following info might help: >> >> After running iperf3 on vanilla 4.3: >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ >> limit 89908 >> limit_max 1879048192 >> >> After running iperf3 on vanilla 4.3 + BQL explicitly set: >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ >> limit 3000 >> limit_max 3000 >> >> After running iperf3 on 4.3 + 605ad7f184b6 reverted: >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ >> limit 8886 >> limit_max 1879048192 >> > > There is absolutely nothing ensuring fairness among multiple TCP flows. > > One TCP flow can very easily grab whole bandwidth for itself, there are > numerous descriptions of this phenomena in various TCP studies. > > This is why we have packet schedulers ;) Oh.. How stupid of me, I forgot to mention.. all of the measurements were done with fq_codel. > > tc qdisc replace dev eth0 root fq > > This will probably help : No need to change BQL settings, so that you > keep minimal latencies for the interactive traffic (like ping) > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network stream fairness
On Mon, 2015-11-09 at 16:41 +0100, Niklas Cassel wrote: > I have a ethernet driver for a 100 Mbps NIC. > The NIC has dedicated hardware for offloading. > The driver has implemented TSO, GSO and BQL. > Since the CPU on the SoC is rather weak, I'd rather > not increase the CPU load by turning off offloading. > > Since commit > 605ad7f184b6 ("tcp: refine TSO autosizing") > > the bandwidth is no longer fair between streams. > see output at the end of the mail, where I'm testing with 2 streams. > > > If I revert 605ad7f184b6 on 4.3, I get a stable 45 Mbps per stream. > > I can also use vanilla 4.3 and do: > echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max > to also get a stable 45 Mbps per stream. > > My question is, am I supposed to set the BQL limit explicitly? > It is possible that I have missed something in my driver, > but my understanding is that the TCP stack sets and adjusts > the BQL limit automatically. > > > Perhaps the following info might help: > > After running iperf3 on vanilla 4.3: > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ > limit 89908 > limit_max 1879048192 > > After running iperf3 on vanilla 4.3 + BQL explicitly set: > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ > limit 3000 > limit_max 3000 > > After running iperf3 on 4.3 + 605ad7f184b6 reverted: > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ > limit 8886 > limit_max 1879048192 > There is absolutely nothing ensuring fairness among multiple TCP flows. One TCP flow can very easily grab whole bandwidth for itself, there are numerous descriptions of this phenomena in various TCP studies. This is why we have packet schedulers ;) tc qdisc replace dev eth0 root fq This will probably help : No need to change BQL settings, so that you keep minimal latencies for the interactive traffic (like ping) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
network stream fairness
I have a ethernet driver for a 100 Mbps NIC. The NIC has dedicated hardware for offloading. The driver has implemented TSO, GSO and BQL. Since the CPU on the SoC is rather weak, I'd rather not increase the CPU load by turning off offloading. Since commit 605ad7f184b6 ("tcp: refine TSO autosizing") the bandwidth is no longer fair between streams. see output at the end of the mail, where I'm testing with 2 streams. If I revert 605ad7f184b6 on 4.3, I get a stable 45 Mbps per stream. I can also use vanilla 4.3 and do: echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max to also get a stable 45 Mbps per stream. My question is, am I supposed to set the BQL limit explicitly? It is possible that I have missed something in my driver, but my understanding is that the TCP stack sets and adjusts the BQL limit automatically. Perhaps the following info might help: After running iperf3 on vanilla 4.3: /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ limit 89908 limit_max 1879048192 After running iperf3 on vanilla 4.3 + BQL explicitly set: /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ limit 3000 limit_max 3000 After running iperf3 on 4.3 + 605ad7f184b6 reverted: /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ limit 8886 limit_max 1879048192 Vanilla 4.3, no BQL limit explicitly set. iperf3 -R -c 192.168.0.141 -t 240 -P 2 Connecting to host 192.168.0.141, port 5201 Reverse mode, remote host 192.168.0.141 is sending [ 4] local 192.168.0.1 port 50902 connected to 192.168.0.141 port 5201 [ 6] local 192.168.0.1 port 50904 connected to 192.168.0.141 port 5201 [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 7.24 MBytes 60.7 Mbits/sec [ 6] 0.00-1.00 sec 3.56 MBytes 29.9 Mbits/sec [SUM] 0.00-1.00 sec 10.8 MBytes 90.6 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 1.00-2.00 sec 7.40 MBytes 62.1 Mbits/sec [ 6] 1.00-2.00 sec 3.40 MBytes 28.5 Mbits/sec [SUM] 1.00-2.00 sec 10.8 MBytes 90.6 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 2.00-3.00 sec 7.45 MBytes 62.5 Mbits/sec [ 6] 2.00-3.00 sec 3.36 MBytes 28.2 Mbits/sec [SUM] 2.00-3.00 sec 10.8 MBytes 90.7 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 3.00-4.00 sec 7.72 MBytes 64.8 Mbits/sec [ 6] 3.00-4.00 sec 3.08 MBytes 25.9 Mbits/sec [SUM] 3.00-4.00 sec 10.8 MBytes 90.7 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 4.00-5.00 sec 8.01 MBytes 67.2 Mbits/sec [ 6] 4.00-5.00 sec 2.80 MBytes 23.5 Mbits/sec [SUM] 4.00-5.00 sec 10.8 MBytes 90.7 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 5.00-6.00 sec 7.91 MBytes 66.3 Mbits/sec [ 6] 5.00-6.00 sec 2.90 MBytes 24.3 Mbits/sec [SUM] 5.00-6.00 sec 10.8 MBytes 90.7 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 6.00-7.00 sec 7.77 MBytes 65.2 Mbits/sec [ 6] 6.00-7.00 sec 3.04 MBytes 25.5 Mbits/sec [SUM] 6.00-7.00 sec 10.8 MBytes 90.7 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 7.00-8.00 sec 6.99 MBytes 58.6 Mbits/sec [ 6] 7.00-8.00 sec 3.82 MBytes 32.0 Mbits/sec [SUM] 7.00-8.00 sec 10.8 MBytes 90.7 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 8.00-9.00 sec 7.08 MBytes 59.4 Mbits/sec [ 6] 8.00-9.00 sec 3.73 MBytes 31.3 Mbits/sec [SUM] 8.00-9.00 sec 10.8 MBytes 90.7 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 9.00-10.00 sec 6.93 MBytes 58.1 Mbits/sec [ 6] 9.00-10.00 sec 3.88 MBytes 32.6 Mbits/sec [SUM] 9.00-10.00 sec 10.8 MBytes 90.7 Mbits/sec -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND] net: Documentation: Fix default value tcp_limit_output_bytes
Commit c39c4c6abb89 ("tcp: double default TSQ output bytes limit") updated default value for tcp_limit_output_bytes Signed-off-by: Niklas Cassel --- Already sent the patch to linux-...@vger.kernel.org, but sending it to netdev@vger.kernel.org too. Documentation/networking/ip-sysctl.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 05915be..2ea4c45 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -709,7 +709,7 @@ tcp_limit_output_bytes - INTEGER typical pfifo_fast qdiscs. tcp_limit_output_bytes limits the number of bytes on qdisc or device to reduce artificial RTT/cwnd and reduce bufferbloat. - Default: 131072 + Default: 262144 tcp_challenge_ack_limit - INTEGER Limits number of Challenge ACK sent per second, as recommended -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] unix: avoid use-after-free in ep_remove_wait_queue
An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer Weikusuat --- "Why do things always end up messy and complicated"? Patch is against 4.3.0. diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b36d837..2a91a05 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_tpeer_wake; }; static inline struct unix_sock *unix_sk(const struct sock *sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f6582..4f263e3 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,6 +326,122 @@ found: return s; } +/* Support code for asymmetrically connected dgram sockets + * + * If a datagram socket is connected to a socket not itself connected + * to the first socket (eg, /dev/log), clients may only enqueue more + * messages if the present receive queue of the server socket is not + * "too large". This means there's a second writeability condition + * poll and sendmsg need to test. The dgram recv code will do a wake + * up on the peer_wait wait queue of a socket upon reception of a + * datagram which needs to be propagated to sleeping would-be writers + * since these might not have sent anything so far. This can't be + * accomplished via poll_wait because the lifetime of the server + * socket might be less than that of its clients if these break their + * association with it or if the server socket is closed while clients + * are still connected to it and there's no way to inform "a polling + * implementation" that it should let go of a certain wait queue + * + * In order to propagate a wake up, a wait_queue_t of the client + * socket is enqueued on the peer_wait queue of the server socket + * whose wake function does a wake_up on the ordinary client socket + * wait queue. This connection is established whenever a write (or + * poll for write) hit the flow control condition and broken when the + * association to the server socket is dissolved or after a wake up + * was relayed. + */ + +static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait, + &u->peer_wake); + u->peer_wake.private = NULL; + + /* relaying can only happen while the wq still exists */ + u_sleep = sk_sleep(&u->sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other) +{ + struct unix_sock *u, *u_other; + int rc; + + u = unix_sk(sk); + u_other = unix_sk(other); + rc = 0; + + spin_lock(&u_other->peer_wait.lock); + + if (!u->peer_wake.private) { + u->peer_wake.private = other; + __add_wait_queue(&u_other
Re: [PATCH] inet: delay address promotion check until last request in message
On Mon, Nov 09, 2015 at 07:31:38AM +0900, David Miller wrote: > From: Neil Horman > Date: Sun, 8 Nov 2015 16:56:41 -0500 > > > All I'm doing is, in effect masking the promote_secondaries sysctl > > for an interface doing a flush operation. > > Packets and other entities not controlled by RTNL can see an inconsistent > state: no primary address on the interface. > > You can argue that this happened already but usually the window is tiny. > > For the case you are optimizing for, the window is now several seconds > long. That's not really something we can do. > I agree with what you're saying in regards to the fact that an inconsistent state exists during this operation, but the time frame during which it exists is variable dependent on the number of addresses being removed, and is no different than the inconsistent state we allow by setting the promote_secondaries sysctl to 0. In my view we have a range of conditions bounded by the following two extreemes: a) User space is removing a single addresses from an arbitrarily sized set on an interface. b) User space is flushing every address from an arbitrarily sized set on an interface In extreeme condition (a), which is also the most common condition, my patch is effectively a no-op, because the netlink message see no next rtm_deladdr request, and so performs promotion as it currently does as part of that one transaction In extreeme condition (b), I would argue that the fact that our interface state is inconsistent is irrelevant. While its true that other applications may see that an interface has no primary address, they will in the very near future see that the interface has no address at all. Given that the error handling path for both conditions is often the same or very simmilar (tell the user to pick a different interface), I would say this isn't something we need to worry about. The only condition that we do need to worry about I think is the condition in which an administrator is removing M addresses from a large set of N addresses on an interface, where M < N but M and N are both sufficiently large to make the inconsistent state window significant. In that case we may encounter applications that see transient errors when searching for an interfaces primary address. But my question would then be, so what? Lots of administrative operations result in transient errors on other applications while they are in progress (an arbitrary sized set of iptables rules getting flushed and restored can result in the same transient error behavior, as an example). The bottom line in my mind is, given the condition described above, which is the lesser of evils? Having a window of time where applications may see an inconsistent state, and not be able to use an interface? Or having a much larger window of time (recall this patch reduces run time from O(n^2) to O(n)), during which no other rtnl aware operations can progress, leaving the interface and several other subsystems unusable. As a matter of philosophy, This can certainly be implemented in user space. We could have user space disable promotion only during a flush (either by coding it into iproute2 and simmilar applications, or by simply documenting the fact that for large flush operations promotion should be disabled. That opens us up to other races however (if multiple users are toggling the value of promote secondaries), and still gives rise to the issue you describe above, in which interfaces may have an inconsistent state during the flush operation. Regards Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT] Networking
On Mon, Nov 09 2015, Hannes Frederic Sowa wrote: > Hi, > > On Wed, Oct 28, 2015, at 15:27, Rasmus Villemoes wrote: >> >> I agree - proper overflow checking can be really hard. Quick, assuming a >> and b have the same unsigned integer type, is 'a+b> check overflow? Of course not (hint: promotion rules). And as you say, >> it gets even more complicated for signed types. >> >> A few months ago I tried posting a complete set of fallbacks for older >> compilers (https://lkml.org/lkml/2015/7/19/358), but nothing really >> happened. Now I know where Linus stands, so I guess I can just delete >> that branch. > > I actually like your approach of being type agnostic a bit more (in > comparison to static inline functions), mostly because of one specific > reason: > > The type agnostic __builtin_*_overflow function even do the correct > things if you deal with types smaller than int. Imagine e.g. you want to > add to unsigned chars a and b, If you read my mail again you'll see that I mentioned exactly this :-) so obviously I agree that this is a nice part of it. > unsigned char a, b; > if (a + b < a) > goto overflow; > else > a += b; > > The overflow condition will never trigger, as the comparisons will > always be done in the integer domain and a + b < a is never true. I > actually think that this is easy to overlook and the functions should > handle that. Yes. While people very rarely use local u8 or u16 variables for computations, I think one could imagine a and b being struct members, which for one reason or another happens to be of a type narrower than int (which would also make the issue much harder to spot since the struct definition is far away). Something like combine_packets(struct foo *a, const struct foo *b) { if (a->len + b->len < a->len) return -EOVERFLOW; /* ensure a->payload is big enough...*/ memcpy(a->payload + a->len, b->payload, b->len); a->len += b->len; ... } which, depending on details, would either lead to memory corruption or loss of parts of the packets. I haven't actually found any instance of this in the kernel, but that doesn't mean it couldn't get introduced (or that it doesn't exist). Aside: It turns out clang is smart enough to optimize away the broken overflow check, but gcc isn't. Neither issue a warning, despite the intention being rather clear. Rasmus -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] macvtap: Resolve possible __might_sleep warning in macvtap_do_read()
macvtap_do_read code calls macvtap_put_user while it might be set up to wait for the user. This results in the following warning: Jun 23 16:25:26 galen kernel: [ cut here ] Jun 23 16:25:26 galen kernel: WARNING: CPU: 0 PID: 30433 at kernel/sched/core.c: 7286 __might_sleep+0x7f/0x90() Jun 23 16:25:26 galen kernel: do not call blocking ops when !TASK_RUNNING; state =1 set at [] prepare_to_wait+0x2f/0x90 Jun 23 16:25:26 galen kernel: CPU: 0 PID: 30433 Comm: cat Not tainted 4.1.0-rc6+ #11 Jun 23 16:25:26 galen kernel: Call Trace: Jun 23 16:25:26 galen kernel: [] dump_stack+0x4c/0x65 Jun 23 16:25:26 galen kernel: [] warn_slowpath_common+0x8a/0xc 0 Jun 23 16:25:26 galen kernel: [] warn_slowpath_fmt+0x46/0x50 Jun 23 16:25:26 galen kernel: [] ? prepare_to_wait+0x2f/0x90 Jun 23 16:25:26 galen kernel: [] ? prepare_to_wait+0x2f/0x90 Jun 23 16:25:26 galen kernel: [] __might_sleep+0x7f/0x90 Jun 23 16:25:26 galen kernel: [] might_fault+0x55/0xb0 Jun 23 16:25:26 galen kernel: [] ? trace_hardirqs_on_caller+0x fd/0x1c0 Jun 23 16:25:26 galen kernel: [] copy_to_iter+0x7c/0x360 Jun 23 16:25:26 galen kernel: [] macvtap_do_read+0x256/0x3d0 [macvtap] Jun 23 16:25:26 galen kernel: [] ? prepare_to_wait_event+0x110/0x110 Jun 23 16:25:26 galen kernel: [] macvtap_read_iter+0x2b/0x50 [macvtap] Jun 23 16:25:26 galen kernel: [] __vfs_read+0xae/0xe0 Jun 23 16:25:26 galen kernel: [] vfs_read+0x86/0x140 Jun 23 16:25:26 galen kernel: [] SyS_read+0x49/0xb0 Jun 23 16:25:26 galen kernel: [] system_call_fastpath+0x12/0x76 Jun 23 16:25:26 galen kernel: ---[ end trace 22e33f67e70c0c2a ]--- Make sure thet we call finish_wait() if we have the skb to process before trying to actually process it. Signed-off-by: Vladislav Yasevich --- drivers/net/macvtap.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 248478c..844a442 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -935,6 +935,9 @@ static ssize_t macvtap_do_read(struct macvtap_queue *q, /* Nothing to read, let's sleep */ schedule(); } + if (!noblock) + finish_wait(sk_sleep(&q->sk), &wait); + if (skb) { ret = macvtap_put_user(q, skb, to); if (unlikely(ret < 0)) @@ -942,8 +945,6 @@ static ssize_t macvtap_do_read(struct macvtap_queue *q, else consume_skb(skb); } - if (!noblock) - finish_wait(sk_sleep(&q->sk), &wait); return ret; } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] mvneta: add FIXED_PHY dependency
The fixed_phy infrastructure is done in a way that is optional, by providing 'static inline' helper functions doing nothing in include/linux/phy_fixed.h for all its APIs. However, three out of the four users (DSA, BCMGENET, and SYSTEMPORT) always 'select FIXED_PHY', presumably because they need that. MVNETA is the fourth one, and if that is built-in but FIXED_PHY is configured as a loadable module, we get a link error: drivers/built-in.o: In function `mvneta_fixed_link_update': fpga-mgr.c:(.text+0x33ed80): undefined reference to `fixed_phy_update_state' Presumably this driver has the same dependency as the others, so this patch also uses 'select' to ensure that the fixed-phy support is built-in. Signed-off-by: Arnd Bergmann Fixes: 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state signaling") --- Found using ARM randconfig tests. An alternative here would be to use 'depends on FIXED_PHY || FIXED_PHY=n', I picked the 'select' approach for consistency. Should we perhaps make 'FIXED_PHY' a silent option and remove the inline helpers, based on the assumption that a driver that wants these will not work without them? diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig index 80af9ffce5ea..a1c862b4664d 100644 --- a/drivers/net/ethernet/marvell/Kconfig +++ b/drivers/net/ethernet/marvell/Kconfig @@ -44,6 +44,7 @@ config MVNETA tristate "Marvell Armada 370/38x/XP network interface support" depends on PLAT_ORION select MVMDIO + select FIXED_PHY ---help--- This driver supports the network interface units in the Marvell ARMADA XP, ARMADA 370 and ARMADA 38x SoC family. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next v2 7/8] openvswitch: Delay conntrack helper call for new connections.
On 06.11, Jarno Rajahalme wrote: > There is no need to help connections that are not confirmed, so we can > delay helping new connections to the time when they are confirmed. > This change is needed for NAT support, and having this as a separate > patch will make the following NAT patch a bit easier to review. For the first packet a helper receives the connection is always unconfirmed. It makes no sense to confirm it if the helper drops the packet. > Signed-off-by: Jarno Rajahalme > --- > net/openvswitch/conntrack.c | 20 +++- > 1 file changed, 15 insertions(+), 5 deletions(-) > > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c > index 7aa38fa..ba44287 100644 > --- a/net/openvswitch/conntrack.c > +++ b/net/openvswitch/conntrack.c > @@ -458,6 +458,7 @@ static bool skb_nfct_cached(struct net *net, > /* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if > * not done already. Update key with new CT state after passing the packet > * through conntrack. > + * Note that invalid packets are accepted while the skb->nfct remains unset! > */ > static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, > const struct ovs_conntrack_info *info, > @@ -468,7 +469,11 @@ static int __ovs_ct_lookup(struct net *net, struct > sw_flow_key *key, >* actually run the packet through conntrack twice unless it's for a >* different zone. >*/ > - if (!skb_nfct_cached(net, key, info, skb)) { > + bool cached = skb_nfct_cached(net, key, info, skb); > + enum ip_conntrack_info ctinfo; > + struct nf_conn *ct; > + > + if (!cached) { > struct nf_conn *tmpl = info->ct; > int err; > > @@ -491,11 +496,16 @@ static int __ovs_ct_lookup(struct net *net, struct > sw_flow_key *key, > return -ENOENT; > > ovs_ct_update_key(skb, key, true); > + } > > - if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) { > - WARN_ONCE(1, "helper rejected packet"); > - return -EINVAL; > - } > + /* Call the helper right after nf_conntrack_in() for confirmed > + * connections, but only when commiting for unconfirmed connections. > + */ > + ct = nf_ct_get(skb, &ctinfo); > + if (ct && (nf_ct_is_confirmed(ct) ? !cached : info->commit) > + && ovs_ct_helper(skb, info->family) != NF_ACCEPT) { > + WARN_ONCE(1, "helper rejected packet"); > + return -EINVAL; > } > > return 0; > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next v2 0/8] openvswitch: NAT support.
On 06.11, Jarno Rajahalme wrote: > This series adds NAT support to openvswitch kernel module. A few > changes are needed to the netfilter code to facilitate this (patches > 1-3/8). Patches 4-7 make the openvswitch kernel module ready for the > patch 8 that adds the NAT support for calling into netfilter NAT code > from the openvswitch conntrack action. I'm missing some high level description, especially how it is invoked, how it makes sure expectations of the NAT code about its invocation are met (it is my understanding that OVS simply invokes this based on actions specified by the user) and how it interacts with the remaining netfilter features. > Jarno Rajahalme (8): > netfilter: Remove IP_CT_NEW_REPLY definition. > netfilter: Factor out nf_ct_get_info(). > netfilter: Allow calling into nat helper without skb_dst. > openvswitch: Update the CT state key only after nf_conntrack_in(). > openvswitch: Find existing conntrack entry after upcall. > openvswitch: Handle NF_REPEAT in conntrack action. > openvswitch: Delay conntrack helper call for new connections. > openvswitch: Interface with NAT. > > include/net/netfilter/nf_conntrack.h | 15 + > include/uapi/linux/netfilter/nf_conntrack_common.h | 12 +- > include/uapi/linux/openvswitch.h | 47 ++ > net/ipv4/netfilter/nf_nat_l3proto_ipv4.c | 29 +- > net/ipv6/netfilter/nf_nat_l3proto_ipv6.c | 29 +- > net/netfilter/nf_conntrack_core.c | 22 +- > net/openvswitch/conntrack.c| 632 > +++-- > net/openvswitch/conntrack.h| 3 +- > 8 files changed, 686 insertions(+), 103 deletions(-) > > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux 4.2.4
On Mon, Nov 09, 2015 at 01:35:11PM +0100, Gerhard Wiesinger wrote: > On 08.11.2015 18:20, Greg KH wrote: > >That's great, can you let me know the git commits that end up in Linus's > >tree? That's what we need for the stable kernel. > > Find the commits here: > https://git.kernel.org/cgit/linux/kernel/git/pablo/nf.git/ > https://git.kernel.org/cgit/linux/kernel/git/pablo/nf.git/commit/?id=e75cb467df29a428612c162e6f1451c5c0717091 > > Don't know exactly the merging processes, so feel free to merge or contact > Pablo. I'll take care of that, this is already following its path to -stable, will take a little while though as usual. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux 4.2.4
On 08.11.2015 18:20, Greg KH wrote: On Sun, Nov 08, 2015 at 02:51:01PM +0100, Gerhard Wiesinger wrote: On 25.10.2015 17:29, Greg KH wrote: On Sun, Oct 25, 2015 at 11:48:54AM +0100, Gerhard Wiesinger wrote: On 25.10.2015 10:46, Willy Tarreau wrote: ipset *triggered* the problem. The whole stack dump would tell more. OK, find the stack traces in the bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1272645 Kernel 4.1.10 triggered also a kernel dump when playing with ipset commands and IPv6, details in the bug report Kernel 4.2 seems to me not well tested in the netfilter parts at all (Bug with already known bugfix https://lists.debian.org/debian-kernel/2015/10/msg00034.html was triggered on 2 of 3 of my machines, the new bug on 1 of 1 tested machine). There's a reason why Greg maintains stable and LTS kernels :-) Stable kernels don't crash but definiton. :-) At least triggered 2 kernel panics in 5min, even with 4.1.10 and ipset commands ... Does this happen also with Linus's tree? I suggest you ask the networking developers about this on netdev@vger.kernel.org, there's nothing that I can do on my own about this, sorry. Patch is now available, see: [PATCH 0/3] ipset patches for nf https://marc.info/?l=netfilter-devel&m=144690007708041&w=2 https://marc.info/?l=netfilter-devel&m=144690007808042&w=2 https://marc.info/?l=netfilter-devel&m=144690008608043&w=2 https://marc.info/?l=netfilter-devel&m=144690007708039&w=2 [ANNOUNCE] ipset 6.27 released https://marc.info/?l=netfilter-devel&m=144690048308099&w=2 Requires also new userland ipset version. Please integrate it upstream. Thanx to Jozsef Kadlecsik for fixing it. That's great, can you let me know the git commits that end up in Linus's tree? That's what we need for the stable kernel. Find the commits here: https://git.kernel.org/cgit/linux/kernel/git/pablo/nf.git/ https://git.kernel.org/cgit/linux/kernel/git/pablo/nf.git/commit/?id=e75cb467df29a428612c162e6f1451c5c0717091 Don't know exactly the merging processes, so feel free to merge or contact Pablo. Ciao, Gerhard -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: caif: check return value of alloc_netdev
I don't know if dev can actually be NULL here, but the test should be above alloc_netdev(), to avoid leaking the struct net_device in case dev is actually NULL. And of course the return value from alloc_netdev should be tested. Signed-off-by: Rasmus Villemoes --- Maybe the existing code was supposed to be "if (!ndev)", and dev cannot be NULL, but then -ENODEV is a slightly odd return value. Doing both tests seems to be the safe choice. drivers/net/caif/caif_spi.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/caif/caif_spi.c b/drivers/net/caif/caif_spi.c index de3962014af7..4721948a92f6 100644 --- a/drivers/net/caif/caif_spi.c +++ b/drivers/net/caif/caif_spi.c @@ -730,11 +730,14 @@ int cfspi_spi_probe(struct platform_device *pdev) int res; dev = (struct cfspi_dev *)pdev->dev.platform_data; - ndev = alloc_netdev(sizeof(struct cfspi), "cfspi%d", - NET_NAME_UNKNOWN, cfspi_setup); if (!dev) return -ENODEV; + ndev = alloc_netdev(sizeof(struct cfspi), "cfspi%d", + NET_NAME_UNKNOWN, cfspi_setup); + if (!ndev) + return -ENOMEM; + cfspi = netdev_priv(ndev); netif_stop_queue(ndev); cfspi->ndev = ndev; -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/4] tty: Remove drivers' extra tty_ldisc_flush()
The tty_port_close_start() helper already flushes the tty and ldisc buffers on final close; tty drivers which use this helper need not repeat tty_ldisc_flush(). Signed-off-by: Peter Hurley --- drivers/char/pcmcia/synclink_cs.c | 3 --- drivers/tty/amiserial.c | 2 -- drivers/tty/rocket.c | 2 -- drivers/tty/serial/serial_core.c | 2 -- drivers/tty/synclink.c| 1 - drivers/tty/synclink_gt.c | 1 - drivers/tty/synclinkmp.c | 1 - drivers/tty/tty_port.c| 2 -- 8 files changed, 14 deletions(-) diff --git a/drivers/char/pcmcia/synclink_cs.c b/drivers/char/pcmcia/synclink_cs.c index 45df4bf..3f74677 100644 --- a/drivers/char/pcmcia/synclink_cs.c +++ b/drivers/char/pcmcia/synclink_cs.c @@ -2354,10 +2354,7 @@ static void mgslpc_close(struct tty_struct *tty, struct file * filp) mgslpc_wait_until_sent(tty, info->timeout); mgslpc_flush_buffer(tty); - - tty_ldisc_flush(tty); shutdown(info, tty); - tty_port_close_end(port, tty); tty_port_tty_set(port, NULL); cleanup: diff --git a/drivers/tty/amiserial.c b/drivers/tty/amiserial.c index 2caaf5a..bffc0a4 100644 --- a/drivers/tty/amiserial.c +++ b/drivers/tty/amiserial.c @@ -1420,8 +1420,6 @@ static void rs_close(struct tty_struct *tty, struct file * filp) } shutdown(tty, state); rs_flush_buffer(tty); - - tty_ldisc_flush(tty); port->tty = NULL; tty_port_close_end(port, tty); diff --git a/drivers/tty/rocket.c b/drivers/tty/rocket.c index a6b5ce0..5905200 100644 --- a/drivers/tty/rocket.c +++ b/drivers/tty/rocket.c @@ -1022,8 +1022,6 @@ static void rp_close(struct tty_struct *tty, struct file *filp) sClrDTR(cp); rp_flush_buffer(tty); - - tty_ldisc_flush(tty); clear_bit((info->aiop * 8) + info->chan, (void *) &xmit_flags[info->board]); diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c index db27a40..418587f 100644 --- a/drivers/tty/serial/serial_core.c +++ b/drivers/tty/serial/serial_core.c @@ -1439,8 +1439,6 @@ static void uart_close(struct tty_struct *tty, struct file *filp) wake_up_interruptible(&port->open_wait); mutex_unlock(&port->mutex); - - tty_ldisc_flush(tty); } static void uart_wait_until_sent(struct tty_struct *tty, int timeout) diff --git a/drivers/tty/synclink.c b/drivers/tty/synclink.c index 6188059..1334498 100644 --- a/drivers/tty/synclink.c +++ b/drivers/tty/synclink.c @@ -3099,7 +3099,6 @@ static void mgsl_close(struct tty_struct *tty, struct file * filp) if (info->port.flags & ASYNC_INITIALIZED) mgsl_wait_until_sent(tty, info->timeout); mgsl_flush_buffer(tty); - tty_ldisc_flush(tty); shutdown(info); mutex_unlock(&info->port.mutex); diff --git a/drivers/tty/synclink_gt.c b/drivers/tty/synclink_gt.c index 5505ea8..1987fb4 100644 --- a/drivers/tty/synclink_gt.c +++ b/drivers/tty/synclink_gt.c @@ -729,7 +729,6 @@ static void close(struct tty_struct *tty, struct file *filp) if (info->port.flags & ASYNC_INITIALIZED) wait_until_sent(tty, info->timeout); flush_buffer(tty); - tty_ldisc_flush(tty); shutdown(info); mutex_unlock(&info->port.mutex); diff --git a/drivers/tty/synclinkmp.c b/drivers/tty/synclinkmp.c index fb00a06..fb17dac 100644 --- a/drivers/tty/synclinkmp.c +++ b/drivers/tty/synclinkmp.c @@ -816,7 +816,6 @@ static void close(struct tty_struct *tty, struct file *filp) wait_until_sent(tty, info->timeout); flush_buffer(tty); - tty_ldisc_flush(tty); shutdown(info); mutex_unlock(&info->port.mutex); diff --git a/drivers/tty/tty_port.c b/drivers/tty/tty_port.c index ecb6435..c30525a 100644 --- a/drivers/tty/tty_port.c +++ b/drivers/tty/tty_port.c @@ -503,8 +503,6 @@ void tty_port_close_end(struct tty_port *port, struct tty_struct *tty) { unsigned long flags; - tty_ldisc_flush(tty); - spin_lock_irqsave(&port->lock, flags); if (port->blocked_open) { -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/4] n_tty: Ignore all read data when closing
On final port close (and thus final tty close), only output flow control requests in the input data should be processed. Ignore all other input data, including parity errors, overruns and breaks. Signed-off-by: Peter Hurley --- drivers/tty/n_tty.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c index bc613b8..2de0283 100644 --- a/drivers/tty/n_tty.c +++ b/drivers/tty/n_tty.c @@ -1564,8 +1564,6 @@ n_tty_receive_buf_closing(struct tty_struct *tty, const unsigned char *cp, flag = *fp++; if (likely(flag == TTY_NORMAL)) n_tty_receive_char_closing(tty, *cp++); - else - n_tty_receive_char_flagged(tty, *cp++, flag); } } -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/4] tty: Abstract and encapsulate tty->closing behavior
tty->closing is exclusively used to cause the N_TTY line discipline to drop further input on final tty close (except XON/XOFF output flow control changes). In turn, this prevents the line discipline from generating new tty driver i/o requests (eg., when echoing) after the tty driver has performed h/w shutdown. Abstract this notification with new ldisc api function, tty_ldisc_closing(), which invokes the new line discipline method, ops->closing(). Define this method for N_TTY line discipline and localize closing state to n_tty private data. Remove tty->closing. Note that resetting tty->closing to 0 (and thus allowing the line discipline to resume normal input processing) is unnecessary and undesirable: since the tty is in final close, both the line discipline instance and the tty are being destroyed, so resuming normal input processing after h/w shutdown is counter-productive. NB: ipwireless_tty_free() is completely bogus; freeing the tty (?!) with open, in-use file descriptors is laughable. Signed-off-by: Peter Hurley --- v2: Fixed tty_ldisc_closing() ld use found by Johannes Stezenbach drivers/isdn/i4l/isdn_tty.c | 2 +- drivers/s390/char/con3215.c | 3 +-- drivers/staging/dgap/dgap.c | 4 +--- drivers/staging/dgnc/dgnc_tty.c | 4 +--- drivers/tty/hvc/hvsi.c | 2 +- drivers/tty/ipwireless/tty.c | 1 - drivers/tty/n_tty.c | 13 +++-- drivers/tty/rocket.c | 1 - drivers/tty/serial/68328serial.c | 3 +-- drivers/tty/serial/crisv10.c | 3 +-- drivers/tty/serial/serial_core.c | 1 - drivers/tty/tty_ldisc.c | 20 drivers/tty/tty_port.c | 3 +-- include/linux/tty.h | 2 +- include/linux/tty_ldisc.h| 9 + 15 files changed, 49 insertions(+), 22 deletions(-) diff --git a/drivers/isdn/i4l/isdn_tty.c b/drivers/isdn/i4l/isdn_tty.c index 2175225..cddba25 100644 --- a/drivers/isdn/i4l/isdn_tty.c +++ b/drivers/isdn/i4l/isdn_tty.c @@ -1574,7 +1574,7 @@ isdn_tty_close(struct tty_struct *tty, struct file *filp) } port->flags |= ASYNC_CLOSING; - tty->closing = 1; + tty_ldisc_closing(tty); /* * At this point we stop accepting input. To do this, we * disable the receive line status interrupts, and tell the diff --git a/drivers/s390/char/con3215.c b/drivers/s390/char/con3215.c index 0fc3fe5..715251d 100644 --- a/drivers/s390/char/con3215.c +++ b/drivers/s390/char/con3215.c @@ -1006,11 +1006,10 @@ static void tty3215_close(struct tty_struct *tty, struct file * filp) raw = (struct raw3215_info *) tty->driver_data; if (raw == NULL || tty->count > 1) return; - tty->closing = 1; + tty_ldisc_closing(tty); /* Shutdown the terminal */ raw3215_shutdown(raw); tasklet_kill(&raw->tlet); - tty->closing = 0; tty_port_tty_set(&raw->port, NULL); } diff --git a/drivers/staging/dgap/dgap.c b/drivers/staging/dgap/dgap.c index 9112dd2..0456e28 100644 --- a/drivers/staging/dgap/dgap.c +++ b/drivers/staging/dgap/dgap.c @@ -4622,7 +4622,7 @@ static void dgap_tty_close(struct tty_struct *tty, struct file *file) un->un_flags |= UN_CLOSING; - tty->closing = 1; + tty_ldisc_closing(tty); /* * Only officially close channel if count is 0 and @@ -4645,8 +4645,6 @@ static void dgap_tty_close(struct tty_struct *tty, struct file *file) spin_lock_irqsave(&ch->ch_lock, lock_flags); - tty->closing = 0; - /* * If we have HUPCL set, lower DTR and RTS */ diff --git a/drivers/staging/dgnc/dgnc_tty.c b/drivers/staging/dgnc/dgnc_tty.c index fbfe79a..96960d8 100644 --- a/drivers/staging/dgnc/dgnc_tty.c +++ b/drivers/staging/dgnc/dgnc_tty.c @@ -1410,7 +1410,7 @@ static void dgnc_tty_close(struct tty_struct *tty, struct file *file) /* OK, its the last close on the unit */ un->un_flags |= UN_CLOSING; - tty->closing = 1; + tty_ldisc_closing(tty); /* @@ -1441,8 +1441,6 @@ static void dgnc_tty_close(struct tty_struct *tty, struct file *file) spin_lock_irqsave(&ch->ch_lock, flags); - tty->closing = 0; - /* * If we have HUPCL set, lower DTR and RTS */ diff --git a/drivers/tty/hvc/hvsi.c b/drivers/tty/hvc/hvsi.c index a75146f..fbaa6ab 100644 --- a/drivers/tty/hvc/hvsi.c +++ b/drivers/tty/hvc/hvsi.c @@ -796,7 +796,7 @@ static void hvsi_close(struct tty_struct *tty, struct file *filp) * any data delivered to the tty layer after this will be * discarded (except for XON/XOFF) */ - tty->closing = 1; + tty_ldisc_closing(tty); spin_unlock_irqrestore(&hp->lock, flags); diff --git a/drivers/tt
[PATCH v2 1/4] tty: rocket: Remove private close_wait
This driver's private completion variable, close_wait, is no longer used for wait since "tty: Remove ASYNC_CLOSING checks in open()/hangup"; remove. Signed-off-by: Peter Hurley --- drivers/tty/rocket.c | 2 -- drivers/tty/rocket_int.h | 1 - 2 files changed, 3 deletions(-) diff --git a/drivers/tty/rocket.c b/drivers/tty/rocket.c index 802eac7..c5179f8 100644 --- a/drivers/tty/rocket.c +++ b/drivers/tty/rocket.c @@ -643,7 +643,6 @@ static void init_r_port(int board, int aiop, int chan, struct pci_dev *pci_dev) info->chan = chan; tty_port_init(&info->port); info->port.ops = &rocket_port_ops; - init_completion(&info->close_wait); info->flags &= ~ROCKET_MODE_MASK; switch (pc104[board][line]) { case 422: @@ -1049,7 +1048,6 @@ static void rp_close(struct tty_struct *tty, struct file *filp) mutex_unlock(&port->mutex); tty_port_tty_set(port, NULL); - complete_all(&info->close_wait); atomic_dec(&rp_num_ports_open); #ifdef ROCKET_DEBUG_OPEN diff --git a/drivers/tty/rocket_int.h b/drivers/tty/rocket_int.h index 67e0f1e..ef1e1be 100644 --- a/drivers/tty/rocket_int.h +++ b/drivers/tty/rocket_int.h @@ -1144,7 +1144,6 @@ struct r_port { int read_status_mask; int cps; - struct completion close_wait; /* Not yet matching the core */ spinlock_t slock; struct mutex write_mtx; }; -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/4] Replace tty->closing
Hi Greg, This series cleans up a messy and poorly documented mechanism required at tty final close to prevent drivers from crashing after h/w shutdown. Without special handling, N_TTY echoing will cause driver i/o requests _after_ h/w shutdown, which typically crashes the driver. Currently, the tty_struct::closing flag triggers this special handling. However, this mechanism is error-prone and subject to driver misuse. This series replaces tty->closing with a ldisc-specific interface, tty_ldisc_closing(), and implements this interface for N_TTY. For tty drivers which use tty_port_close_start(), this change eliminates the last vestige of direct driver<->ldisc interaction. The few tty drivers which open-code the close() method [1] still use tty_ldisc_closing() directly. The tty driver is aware final close for the tty has commenced because the tty->count == 1 in the close() method. On final close, the following is also true: 1. port->count == 1. tty drivers which ref count the port, use the --port->count == 0 as a substitute condition for final close. 2. final close is occurring as a result of the last in-use file descriptor release. Consequently, there will be no read/poll/ioctl in-progress. 3. the line discipline instance will be stopped and destroyed immediately after the tty driver completes the close() method 4. the tty itself will be unrefed immediately after the line discipline instance is destroyed. Thus, the ldisc and tty buffers need only be flushed once, as any data received by the tty driver after the flush but before h/w shutdown will be deleted when the line discipline instance is destroyed. No new echoes will occur after the ldisc flush because the echo buffer is also flushed and new input (which otherwise might generate echoes) is ignored while closing. This series removes the extra tty_ldisc_flush() being performed by most drivers after h/w shutdown. Additionally, the ldisc closing state need not be reset since the line discipline instance is being destroyed, so no interface is provided to reset closing. [1] tty drivers which open-code the close() method drivers/staging/dgnc/dgnc_tty.c drivers/staging/dgap/dgap.c drivers/tty/hvc/hvsi.c drivers/tty/serial/68328serial.c drivers/tty/serial/crisv10.c drivers/isdn/i4l/isdn_tty.c drivers/s390/char/con3215.c Changes to v2: * Fixed tty_ldisc_closing() ld use found by Johannes Stezenbach Regards, Peter Hurley (4): tty: rocket: Remove private close_wait n_tty: Ignore all read data when closing tty: Abstract and encapsulate tty->closing behavior tty: Remove drivers' extra tty_ldisc_flush() drivers/char/pcmcia/synclink_cs.c | 3 --- drivers/isdn/i4l/isdn_tty.c | 2 +- drivers/s390/char/con3215.c | 3 +-- drivers/staging/dgap/dgap.c | 4 +--- drivers/staging/dgnc/dgnc_tty.c | 4 +--- drivers/tty/amiserial.c | 2 -- drivers/tty/hvc/hvsi.c| 2 +- drivers/tty/ipwireless/tty.c | 1 - drivers/tty/n_tty.c | 15 +++ drivers/tty/rocket.c | 5 - drivers/tty/rocket_int.h | 1 - drivers/tty/serial/68328serial.c | 3 +-- drivers/tty/serial/crisv10.c | 3 +-- drivers/tty/serial/serial_core.c | 3 --- drivers/tty/synclink.c| 1 - drivers/tty/synclink_gt.c | 1 - drivers/tty/synclinkmp.c | 1 - drivers/tty/tty_ldisc.c | 20 drivers/tty/tty_port.c| 5 + include/linux/tty.h | 2 +- include/linux/tty_ldisc.h | 9 + 21 files changed, 49 insertions(+), 41 deletions(-) -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()
On Mon 09-11-15 20:56:10, Tetsuo Handa wrote: > There are many locations that do > > if (memory_was_allocated_by_vmalloc) > vfree(ptr); > else > kfree(ptr); > > but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory > using is_vmalloc_addr(). Unless callers have special reasons, we can > replace this branch with kvfree(). Please check and reply if you found > problems. > > Signed-off-by: Tetsuo Handa > Acked-by: Michal Hocko You can add Acked-by: Jan Kara for the UDF and fs/xattr.c parts. Honza > Cc: Russell King # arm > Cc: # apei > Cc: # drbd > Cc: # mspec > Cc: # drm > Cc: Oleg Drokin # lustre > Cc: Andreas Dilger # lustre > Cc: # coda > Cc: # jffs2 > Cc: Jan Kara # udf > Cc: # xattr > Cc: # ipc + mm > Cc: # ipv4 > --- > arch/arm/mm/dma-mapping.c | 11 ++-- > drivers/acpi/apei/erst.c | 6 ++-- > drivers/block/drbd/drbd_bitmap.c | 26 + > drivers/block/drbd/drbd_int.h | 3 -- > drivers/char/mspec.c | 15 ++ > drivers/gpu/drm/drm_hashtab.c | 5 +--- > .../lustre/include/linux/libcfs/libcfs_private.h | 8 ++ > fs/coda/coda_linux.h | 3 +- > fs/jffs2/build.c | 8 ++ > fs/jffs2/fs.c | 5 +--- > fs/jffs2/super.c | 5 +--- > fs/udf/super.c | 7 + > fs/xattr.c | 33 > ++ > ipc/sem.c | 2 +- > ipc/util.c | 11 ++-- > ipc/util.h | 2 +- > mm/percpu.c| 18 +--- > mm/vmalloc.c | 5 +--- > net/ipv4/fib_trie.c| 4 +-- > 19 files changed, 46 insertions(+), 131 deletions(-) > > diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c > index e62400e..492bf3e 100644 > --- a/arch/arm/mm/dma-mapping.c > +++ b/arch/arm/mm/dma-mapping.c > @@ -1200,10 +1200,7 @@ error: > while (i--) > if (pages[i]) > __free_pages(pages[i], 0); > - if (array_size <= PAGE_SIZE) > - kfree(pages); > - else > - vfree(pages); > + kvfree(pages); > return NULL; > } > > @@ -1211,7 +1208,6 @@ static int __iommu_free_buffer(struct device *dev, > struct page **pages, > size_t size, struct dma_attrs *attrs) > { > int count = size >> PAGE_SHIFT; > - int array_size = count * sizeof(struct page *); > int i; > > if (dma_get_attr(DMA_ATTR_FORCE_CONTIGUOUS, attrs)) { > @@ -1222,10 +1218,7 @@ static int __iommu_free_buffer(struct device *dev, > struct page **pages, > __free_pages(pages[i], 0); > } > > - if (array_size <= PAGE_SIZE) > - kfree(pages); > - else > - vfree(pages); > + kvfree(pages); > return 0; > } > > diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c > index 6682c5d..6e6bc10 100644 > --- a/drivers/acpi/apei/erst.c > +++ b/drivers/acpi/apei/erst.c > @@ -32,6 +32,7 @@ > #include > #include > #include > +#include /* kvfree() */ > #include > > #include "apei-internal.h" > @@ -532,10 +533,7 @@ retry: > return -ENOMEM; > memcpy(new_entries, entries, > erst_record_id_cache.len * sizeof(entries[0])); > - if (erst_record_id_cache.size < PAGE_SIZE) > - kfree(entries); > - else > - vfree(entries); > + kvfree(entries); > erst_record_id_cache.entries = entries = new_entries; > erst_record_id_cache.size = new_size; > } > diff --git a/drivers/block/drbd/drbd_bitmap.c > b/drivers/block/drbd/drbd_bitmap.c > index 9462d27..2daaafb 100644 > --- a/drivers/block/drbd/drbd_bitmap.c > +++ b/drivers/block/drbd/drbd_bitmap.c > @@ -364,12 +364,9 @@ static void bm_free_pages(struct page **pages, unsigned > long number) > } > } > > -static void bm_vk_free(void *ptr, int v) > +static inline void bm_vk_free(void *ptr) > { > - if (v) > - vfree(ptr); > - else > - kfree(ptr); > + kvfree(ptr); > } > > /* > @@ -379,7 +376,7 @@ static struct page **bm_realloc_pages(struct drbd_bitmap > *b, unsigned long want) > { > struct page **old_pages = b->bm_pages; > struct page **new_pages, *page; > - unsigned int i, bytes, vmalloced = 0; > + unsigned int i, bytes; > unsigned long have = b->bm_number_of_pages; > >
Re: [GIT] Networking
Hi, On Wed, Oct 28, 2015, at 15:27, Rasmus Villemoes wrote: > On Wed, Oct 28 2015, Hannes Frederic Sowa > wrote: > > > Hi Linus, > > > > On Wed, Oct 28, 2015, at 10:39, Linus Torvalds wrote: > >> Get rid of it. And I don't *ever* want to see that shit again. > > > > I don't want to give up on that this easily: > > > > In future I would like to see an interface like this. It is often hard > > to do correct overflow/wrap-around tests and it would be great if there > > are helper functions which could easily and without a lot of thinking be > > used by people to remove those problems from the kernel. > > I agree - proper overflow checking can be really hard. Quick, assuming a > and b have the same unsigned integer type, is 'a+b check overflow? Of course not (hint: promotion rules). And as you say, > it gets even more complicated for signed types. > > A few months ago I tried posting a complete set of fallbacks for older > compilers (https://lkml.org/lkml/2015/7/19/358), but nothing really > happened. Now I know where Linus stands, so I guess I can just delete > that branch. I actually like your approach of being type agnostic a bit more (in comparison to static inline functions), mostly because of one specific reason: The type agnostic __builtin_*_overflow function even do the correct things if you deal with types smaller than int. Imagine e.g. you want to add to unsigned chars a and b, unsigned char a, b; if (a + b < a) goto overflow; else a += b; The overflow condition will never trigger, as the comparisons will always be done in the integer domain and a + b < a is never true. I actually think that this is easy to overlook and the functions should handle that. The macro version does this quite nicely. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] tty: Abstract and encapsulate tty->closing behavior
On 11/09/2015 04:12 AM, Johannes Stezenbach wrote: > On Sun, Nov 08, 2015 at 05:02:52PM -0500, Peter Hurley wrote: >> +void tty_ldisc_closing(struct tty_struct *tty) >> +{ >> +struct tty_ldisc *ld = tty_ldisc_ref(tty); >> + >> +if (ld->ops->closing) >> +ld->ops->closing(tty); >> +if (ld) >> +tty_ldisc_deref(ld); >> +} > > That looks strange. Do you need to check ld _before_ dereferencing? Yes. Thanks for noticing. Regards, Peter Hurley -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()
There are many locations that do if (memory_was_allocated_by_vmalloc) vfree(ptr); else kfree(ptr); but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory using is_vmalloc_addr(). Unless callers have special reasons, we can replace this branch with kvfree(). Please check and reply if you found problems. Signed-off-by: Tetsuo Handa Acked-by: Michal Hocko Cc: Russell King # arm Cc: # apei Cc: # drbd Cc: # mspec Cc: # drm Cc: Oleg Drokin # lustre Cc: Andreas Dilger # lustre Cc: # coda Cc: # jffs2 Cc: Jan Kara # udf Cc: # xattr Cc: # ipc + mm Cc: # ipv4 --- arch/arm/mm/dma-mapping.c | 11 ++-- drivers/acpi/apei/erst.c | 6 ++-- drivers/block/drbd/drbd_bitmap.c | 26 + drivers/block/drbd/drbd_int.h | 3 -- drivers/char/mspec.c | 15 ++ drivers/gpu/drm/drm_hashtab.c | 5 +--- .../lustre/include/linux/libcfs/libcfs_private.h | 8 ++ fs/coda/coda_linux.h | 3 +- fs/jffs2/build.c | 8 ++ fs/jffs2/fs.c | 5 +--- fs/jffs2/super.c | 5 +--- fs/udf/super.c | 7 + fs/xattr.c | 33 ++ ipc/sem.c | 2 +- ipc/util.c | 11 ++-- ipc/util.h | 2 +- mm/percpu.c| 18 +--- mm/vmalloc.c | 5 +--- net/ipv4/fib_trie.c| 4 +-- 19 files changed, 46 insertions(+), 131 deletions(-) diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index e62400e..492bf3e 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -1200,10 +1200,7 @@ error: while (i--) if (pages[i]) __free_pages(pages[i], 0); - if (array_size <= PAGE_SIZE) - kfree(pages); - else - vfree(pages); + kvfree(pages); return NULL; } @@ -1211,7 +1208,6 @@ static int __iommu_free_buffer(struct device *dev, struct page **pages, size_t size, struct dma_attrs *attrs) { int count = size >> PAGE_SHIFT; - int array_size = count * sizeof(struct page *); int i; if (dma_get_attr(DMA_ATTR_FORCE_CONTIGUOUS, attrs)) { @@ -1222,10 +1218,7 @@ static int __iommu_free_buffer(struct device *dev, struct page **pages, __free_pages(pages[i], 0); } - if (array_size <= PAGE_SIZE) - kfree(pages); - else - vfree(pages); + kvfree(pages); return 0; } diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c index 6682c5d..6e6bc10 100644 --- a/drivers/acpi/apei/erst.c +++ b/drivers/acpi/apei/erst.c @@ -32,6 +32,7 @@ #include #include #include +#include /* kvfree() */ #include #include "apei-internal.h" @@ -532,10 +533,7 @@ retry: return -ENOMEM; memcpy(new_entries, entries, erst_record_id_cache.len * sizeof(entries[0])); - if (erst_record_id_cache.size < PAGE_SIZE) - kfree(entries); - else - vfree(entries); + kvfree(entries); erst_record_id_cache.entries = entries = new_entries; erst_record_id_cache.size = new_size; } diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c index 9462d27..2daaafb 100644 --- a/drivers/block/drbd/drbd_bitmap.c +++ b/drivers/block/drbd/drbd_bitmap.c @@ -364,12 +364,9 @@ static void bm_free_pages(struct page **pages, unsigned long number) } } -static void bm_vk_free(void *ptr, int v) +static inline void bm_vk_free(void *ptr) { - if (v) - vfree(ptr); - else - kfree(ptr); + kvfree(ptr); } /* @@ -379,7 +376,7 @@ static struct page **bm_realloc_pages(struct drbd_bitmap *b, unsigned long want) { struct page **old_pages = b->bm_pages; struct page **new_pages, *page; - unsigned int i, bytes, vmalloced = 0; + unsigned int i, bytes; unsigned long have = b->bm_number_of_pages; BUG_ON(have == 0 && old_pages != NULL); @@ -401,7 +398,6 @@ static struct page **bm_realloc_pages(struct drbd_bitmap *b, unsigned long want) PAGE_KERNEL); if (!new_pages) return NULL; - vmalloced = 1; } if (want >= have) { @@ -411,7 +407,7 @@ static struct page **bm_realloc_pag
Re: [PATCH net 1/2] packet: do skb_probe_transport_header when we actually have data
On 11/07/2015 11:29 PM, David Miller wrote: From: Eric Dumazet Date: Sat, 07 Nov 2015 10:53:50 -0800 Well, imagine following scenario (a real one, as I use it all of time, thus how I discovered all trafgen traffic ends up on one slave only) Even if qdisc is bypassed on the bond0, the current handling does not prevent going to the slave qdiscs. Ok, depending upon the semantics Daniel intended, we may have to add a qdisc bypass boolean bit to SKBs. It was resembling pktgen to some extend, only to be more flexible in terms of defining packet payload (trafgen, I mean). Yes, pktgen has the same issue on that regard, hmm I'm not yet sure, though, if it's worth burning an extra skb bit for both cases. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AF_PACKET mmap() v4...
On 11/08/2015 05:27 AM, John Fastabend wrote: On 15-11-07 06:19 PM, Alexei Starovoitov wrote: On Thu, Nov 05, 2015 at 10:39:15AM +0100, Daniel Borkmann wrote: On 11/05/2015 10:07 AM, Arnd Bergmann wrote: On Thursday 05 November 2015 00:04:14 David Miller wrote: As part of fixing y2038 problems, Arnd is going to have to make a new version fo the AF_PACKET mmap() tpacker descriptors in order to extend the time values to 64-bit. would also be quite useful to add ability to attach metadata to packet from bpf program. Right now we can only trim the length. Would be great if program could compute something and pass it along with packet as metadata. Also most modern NICs can generate metadata using packet filters it would be nice to allow these to populate any metadata fields as well. Ethtool already has a flow classifier feature that could be easily extended once the stack has support. If I understand this correctly, that would be something independent from packet sockets, right? Attaching metadata to the skb could be currently done via mark, tc_index, tc_classid, priority, but I presume you mean something else. ;) Or, do you mean to push meta data into skb->data f.e. in front of the frame. As in having some sort of a 'dynamic-sized', reserved scratch space or stack at the head or tail of the skb (not visible to the network itself, but only to the local NIC)? It would be interesting if it could be used to interact with the NIC, as John says, e.g. from incoming side to place a tag or additional meta data there as a result of the NIC's flow classifier, which might then be read out from an eBPF program to perform further actions on the skb. If you even want to take this one step further for data center environments, there was the idea floating around [1], where you encapsulate a restricted set of instructions together with some scratch space between Ethernet header and payload. These "tiny packet programs" could then query switch meta data on the fly that is being stored into the scratch space. Of course, this requires vendor support, but this seems really powerful. [1] http://jvimal.github.io/tpp/ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] bpf: fix trivial comment typo
On 05/11/15 16:32, David Miller wrote: From: Matthew Fernandez Date: Thu, 5 Nov 2015 11:09:52 +1100 bpf: fix trivial comment typo Signed-off-by: Matthew Fernandez This does not apply. It looks like your email client has corrupted the patch. Well I managed to foul that up nicely, didn't I? I don't know which tree you'd prefer this to go into, so I'll wait until net-next reopens to resubmit a corrected version. I'll try to be more thorough and avoid embarrassing myself and wasting your time for the third time running. Thanks for your patience. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT] Networking
Hello, Ingo Molnar writes: > * Linus Torvalds wrote: > >> Does anybody have any particular other "uhhuh, overflow in multiplication" >> issues in mind? Because the interface for a saturating multiplication (or >> addition, for that matter) would actually be much easier. And would be >> trivial >> to have as an inline asm for compatibility with older versions of gcc too. >> >> Then you could just do that jiffies conversion - or allocation, for that >> matter >> - without any special overflow handling at all. Doing >> >> buf = kmalloc(sat_mul(sizeof(x), nr), GFP_KERNEL); >> >> would just magically work. > > Exactly: saturation is the default behavior for many GPU vector/pixel > attributes > as well, to simplify and speed up the code and the hardware. I always wanted > our > ABIs to saturate instead of duplicating complexity with overflow failure > logic. I don't think saturation arithmetic is useful at all in the kernel as a replacement for overflow/wrap-around checks. Linus' example has a discrepancy between what the caller expects and the actual number of bytes allocated. Imagine sat_mul does the operation in signed char and kmalloc takes only signed chars as an argument, it could actually be a huge discrepancy that could lead to security vulnerabilities. The call should definitely error out here and not try to allocate memory of some different size and return it to the caller. > In the kernel the first point of failure is missing overflow checks. The > second > point of failure are buggy overflow checks. We can eliminate both if we just > use > safe operations that produce output that never exit the valid range. This > also > happens to result in the simplest code. We should start thinking of overflow > checks as rootkit enablers. Sorry, I don't understand that at all. sat_mul is a rootkit enabler, I fear. If you allocate a smalelr portion of memory as the caller actually asked for because of saturation logic, this definitely could lead to memory corruption and hard to diagnose bugs. > And note how much this simplifies review and static analysis: if this is the > dominant model used in new kernel code then the analysis (human or machine) > would > only have to ensure that no untrusted input values get multiplied (or added) > in an > unsafe way. It would not have to be able to understand and track any > 'overflow > logic' through a maze of return paths, and validate whether the 'overflow > logic' > is correct for all input parameter ranges... Sorry, I don't really understand that proposal. :/ > The flip side is marginally less ABI robustness: random input parameters due > to > memory corruption will just saturate and produce nonsensical results. I don't > think it's a big issue, and I also think the simplicity of input parameter > validation is _way_ more important than our behavior to random input - but > I've > been overruled in the past when trying to introduce saturating ABIs, so > saturation > is something people sometimes find inelegant. If those nonsensical results are memory corruptions I also don't agree. I think we need to be very much accurate when dealing with overflows. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT] Networking
Hello, Ingo Molnar writes: > * Linus Torvalds wrote: > >> Does anybody have any particular other "uhhuh, overflow in multiplication" >> issues in mind? Because the interface for a saturating multiplication (or >> addition, for that matter) would actually be much easier. And would be >> trivial >> to have as an inline asm for compatibility with older versions of gcc too. >> >> Then you could just do that jiffies conversion - or allocation, for that >> matter >> - without any special overflow handling at all. Doing >> >> buf = kmalloc(sat_mul(sizeof(x), nr), GFP_KERNEL); >> >> would just magically work. > > Exactly: saturation is the default behavior for many GPU vector/pixel > attributes > as well, to simplify and speed up the code and the hardware. I always wanted > our > ABIs to saturate instead of duplicating complexity with overflow failure > logic. I don't think saturation arithmetic is useful at all in the kernel as a replacement for overflow/wrap-around checks. Linus' example has a discrepancy between what the caller expects and the actual number of bytes allocated. Imagine sat_mul does the operation in signed char and kmalloc takes only signed chars as an argument, it could actually be a huge discrepancy that could lead to security vulnerabilities. The call should definitely error out here and not try to allocate memory of some different size and return it to the caller. > In the kernel the first point of failure is missing overflow checks. The > second > point of failure are buggy overflow checks. We can eliminate both if we just > use > safe operations that produce output that never exit the valid range. This > also > happens to result in the simplest code. We should start thinking of overflow > checks as rootkit enablers. Sorry, I don't understand that at all. sat_mul is a rootkit enabler, I fear. If you allocate a smalelr portion of memory as the caller actually asked for because of saturation logic, this definitely could lead to memory corruption and hard to diagnose bugs. > And note how much this simplifies review and static analysis: if this is the > dominant model used in new kernel code then the analysis (human or machine) > would > only have to ensure that no untrusted input values get multiplied (or added) > in an > unsafe way. It would not have to be able to understand and track any > 'overflow > logic' through a maze of return paths, and validate whether the 'overflow > logic' > is correct for all input parameter ranges... Sorry, I don't really understand that proposal. :/ > The flip side is marginally less ABI robustness: random input parameters due > to > memory corruption will just saturate and produce nonsensical results. I don't > think it's a big issue, and I also think the simplicity of input parameter > validation is _way_ more important than our behavior to random input - but > I've > been overruled in the past when trying to introduce saturating ABIs, so > saturation > is something people sometimes find inelegant. If those nonsensical results are memory corruptions I also don't agree. I think we need to be very much accurate when dealing with overflows. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deadlock between setsockopt/getsockopt
On Sun, Nov 8, 2015 at 6:16 PM, Eric Dumazet wrote: > On Sun, 2015-11-08 at 11:15 +0100, Dmitry Vyukov wrote: >> Hello, >> >> I've got the following deadlock report on commit >> d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5). >> >> >> [ INFO: possible circular locking dependency detected ] >> 4.3.0+ #39 Not tainted >> --- >> syzkaller_execu/18311 is trying to acquire lock: >> (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20 >> net/core/rtnetlink.c:70 >> >> but task is already holding lock: >> (sk_lock-AF_INET){+.+.+.}, at: [< inline >] lock_sock >> include/net/sock.h:1477 >> (sk_lock-AF_INET){+.+.+.}, at: [] >> do_ip_getsockopt.part.9+0x111/0x1510 net/ipv4/ip_sockglue.c:1272 >> >> which lock already depends on the new lock. >> >> the existing dependency chain (in reverse order) is: >> >> -> #1 (sk_lock-AF_INET){+.+.+.}: >>[] lock_acquire+0x16d/0x2f0 >> kernel/locking/lockdep.c:3585 >>[] lock_sock_nested+0xb8/0x110 net/core/sock.c:2443 >>[< inline >] lock_sock include/net/sock.h:1477 >>[] do_ip_setsockopt.isra.12+0x193/0x2af0 >> net/ipv4/ip_sockglue.c:621 >>[] ip_setsockopt+0x3a/0xb0 >> net/ipv4/ip_sockglue.c:1202 >>[] tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2616 >>[] sock_common_setsockopt+0x95/0xd0 >> net/core/sock.c:2643 >>[< inline >] SYSC_setsockopt net/socket.c:1757 >>[] SyS_setsockopt+0x158/0x240 net/socket.c:1736 >>[] entry_SYSCALL_64_fastpath+0x31/0x9a >> arch/x86/entry/entry_64.S:187 >> >> -> #0 (rtnl_mutex){+.+.+.}: >>[< inline >] check_prev_add kernel/locking/lockdep.c:1853 >>[< inline >] check_prevs_add kernel/locking/lockdep.c:1958 >>[< inline >] validate_chain kernel/locking/lockdep.c:2144 >>[] __lock_acquire+0x36d9/0x40e0 >> kernel/locking/lockdep.c:3206 >>[] lock_acquire+0x16d/0x2f0 >> kernel/locking/lockdep.c:3585 >>[< inline >] __mutex_lock_common kernel/locking/mutex.c:518 >>[] mutex_lock_nested+0x9c/0x8f0 >> kernel/locking/mutex.c:618 >>[] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:70 >>[] ip_mc_msfget+0xe0/0x620 net/ipv4/igmp.c:2398 >>[] do_ip_getsockopt.part.9+0x405/0x1510 >> net/ipv4/ip_sockglue.c:1399 >>[< inline >] do_ip_getsockopt net/ipv4/ip_sockglue.c:1264 >>[] ip_getsockopt+0xa8/0x1c0 >> net/ipv4/ip_sockglue.c:1495 >>[] tcp_getsockopt+0x82/0xd0 net/ipv4/tcp.c:2916 >>[] sock_common_getsockopt+0x95/0xd0 >> net/core/sock.c:2602 >>[< inline >] SYSC_getsockopt net/socket.c:1788 >>[] SyS_getsockopt+0x142/0x230 net/socket.c:1770 >> >> other info that might help us debug this: >> >> Possible unsafe locking scenario: >> >>CPU0CPU1 >> >> lock(sk_lock-AF_INET); >>lock(rtnl_mutex); >>lock(sk_lock-AF_INET); >> lock(rtnl_mutex); >> >> *** DEADLOCK *** >> >> 1 lock held by syzkaller_execu/18311: >> #0: (sk_lock-AF_INET){+.+.+.}, at: [< inline >] lock_sock >> include/net/sock.h:1477 >> #0: (sk_lock-AF_INET){+.+.+.}, at: [] >> do_ip_getsockopt.part.9+0x111/0x1510 net/ipv4/ip_sockglue.c:1272 >> >> stack backtrace: >> CPU: 1 PID: 18311 Comm: syzkaller_execu Not tainted 4.3.0+ #39 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 >> 88005b647598 81aad406 845cb400 >> 84612200 845cb400 88005b6475e0 811ec511 >> 88005b6476e0 6c7d5800 88006c7d5fb0 88006c7d5fd2 >> Call Trace: >> [< inline >] __dump_stack lib/dump_stack.c:15 >> [] dump_stack+0x68/0x92 lib/dump_stack.c:50 >> [] print_circular_bug+0x2d1/0x390 >> kernel/locking/lockdep.c:1226 >> [< inline >] check_prev_add kernel/locking/lockdep.c:1853 >> [< inline >] check_prevs_add kernel/locking/lockdep.c:1958 >> [< inline >] validate_chain kernel/locking/lockdep.c:2144 >> [] __lock_acquire+0x36d9/0x40e0 >> kernel/locking/lockdep.c:3206 >> [] lock_acquire+0x16d/0x2f0 kernel/locking/lockdep.c:3585 >> [< inline >] __mutex_lock_common kernel/locking/mutex.c:518 >> [< inline >] __mutex_lock_common kernel/locking/mutex.c:518 >> [] mutex_lock_nested+0x9c/0x8f0 kernel/locking/mutex.c:618 >> [] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:70 >> [] ip_mc_msfget+0xe0/0x620 net/ipv4/igmp.c:2398 >> [] do_ip_getsockopt.part.9+0x405/0x1510 >> net/ipv4/ip_sockglue.c:1399 >> [< inline >] do_ip_getsockopt net/ipv4/ip_sockglue.c:1264 >> [] ip_getsockopt+0xa8/0x1c0 net/ipv4/ip_sockglue.c:1495 >> [] tcp_getsockopt+0x82/0xd0 net/ipv4/tcp.c:2916 >> [] sock_common_getsockopt+0x95/0xd0 net/core/sock.c:2602 >> [< inline >] SYSC_getsockopt net/socket.c:1788 >> [] SyS_getsockopt+0x142/0x230 net/socket.c:1770 >>
Re: [PATCH net-next ] net: ipv4: memset addr before calling copy_to_user()
Hello, On Mon, Nov 9, 2015, at 07:52, Loganaden Velvindron wrote: > zero addr before calling copy_to_user() > > Signed-off-by: Loganaden Velvindron > --- > net/ipv4/ip_sockglue.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c > index c3c359a..d7a5a8b 100644 > --- a/net/ipv4/ip_sockglue.c > +++ b/net/ipv4/ip_sockglue.c > @@ -1373,6 +1373,7 @@ static int do_ip_getsockopt(struct sock *sk, int > level, int optname, > case IP_MULTICAST_IF: > { > struct in_addr addr; > + memset(&addr, 0, sizeof(addr)); > len = min_t(unsigned int, len, sizeof(struct in_addr)); > addr.s_addr = inet->mc_addr; > release_sock(sk); There is no possibility we leak any unwanted data to user space here. If you are not sure if sizeof(addr) > sizeof(addr.s_addr) use a designated initializer: addr = { .s_addr = inet->mc_addr }; which clears all other non-initialized elements. But here we are very certain. We do not do defensive programming, we try to do logical things, and only logical things. — Eric Dumazet (Thanks to Dan Carpenter.) Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] man: Syntax and warning fixes
2015-11-07 7:55 GMT-02:00 Ville Skyttä : >> Signed-off-by and description ? > > Superseding patches sent separately. I suggest documenting the > requirement for Signed-off-by somewhere (README.devel?); Please read the Documentation/SubmittingPatches. Albino -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] tty: Abstract and encapsulate tty->closing behavior
On Sun, Nov 08, 2015 at 05:02:52PM -0500, Peter Hurley wrote: > +void tty_ldisc_closing(struct tty_struct *tty) > +{ > + struct tty_ldisc *ld = tty_ldisc_ref(tty); > + > + if (ld->ops->closing) > + ld->ops->closing(tty); > + if (ld) > + tty_ldisc_deref(ld); > +} That looks strange. Do you need to check ld _before_ dereferencing? Johannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA
On Monday 09 November 2015 10:34:30 Geert Uytterhoeven wrote: > If NO_DMA=y: > > ERROR: "dma_set_mask" > [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! > ERROR: "dma_unmap_single" > [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! > ERROR: "dma_unmap_page" > [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! > ERROR: "dma_mapping_error" > [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! > ERROR: "dma_map_page" > [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! > ERROR: "dma_supported" > [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! > ERROR: "dma_map_single" > [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! > ERROR: "dma_set_mask" [drivers/net/ethernet/hisilicon/hns/hns_dsaf.ko] > undefined! > ERROR: "dma_supported" [drivers/net/ethernet/hisilicon/hns/hns_dsaf.ko] > undefined! > ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hns/hnae.ko] > undefined! > ERROR: "dma_unmap_page" [drivers/net/ethernet/hisilicon/hns/hnae.ko] > undefined! > ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hns/hnae.ko] > undefined! > ERROR: "dma_map_page" [drivers/net/ethernet/hisilicon/hns/hnae.ko] > undefined! > ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hns/hnae.ko] > undefined! > ERROR: "dma_alloc_coherent" > [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined! > ERROR: "dma_mapping_error" > [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined! > ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] > undefined! > ERROR: "dma_unmap_single" > [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined! > ERROR: "dma_free_coherent" > [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined! > ERROR: "dma_alloc_coherent" [drivers/net/ethernet/hisilicon/hip04_eth.ko] > undefined! > ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hip04_eth.ko] > undefined! > ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hip04_eth.ko] > undefined! > ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hip04_eth.ko] > undefined! > ERROR: "dma_free_coherent" [drivers/net/ethernet/hisilicon/hip04_eth.ko] > undefined! > > As this affects all of HNS_ENET, HNS_DSAF, HNS, HIX5HD2_GMAC, and > HIP04_ETH, add a dependency on HAS_DMA to the main NET_VENDOR_HISILICON > symbol to fix this. > > Signed-off-by: Geert Uytterhoeven > Acked-by: Arnd Bergmann -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA
If NO_DMA=y: ERROR: "dma_set_mask" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! ERROR: "dma_unmap_page" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! ERROR: "dma_map_page" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! ERROR: "dma_supported" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hns/hns_enet_drv.ko] undefined! ERROR: "dma_set_mask" [drivers/net/ethernet/hisilicon/hns/hns_dsaf.ko] undefined! ERROR: "dma_supported" [drivers/net/ethernet/hisilicon/hns/hns_dsaf.ko] undefined! ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hns/hnae.ko] undefined! ERROR: "dma_unmap_page" [drivers/net/ethernet/hisilicon/hns/hnae.ko] undefined! ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hns/hnae.ko] undefined! ERROR: "dma_map_page" [drivers/net/ethernet/hisilicon/hns/hnae.ko] undefined! ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hns/hnae.ko] undefined! ERROR: "dma_alloc_coherent" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined! ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined! ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined! ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined! ERROR: "dma_free_coherent" [drivers/net/ethernet/hisilicon/hix5hd2_gmac.ko] undefined! ERROR: "dma_alloc_coherent" [drivers/net/ethernet/hisilicon/hip04_eth.ko] undefined! ERROR: "dma_mapping_error" [drivers/net/ethernet/hisilicon/hip04_eth.ko] undefined! ERROR: "dma_map_single" [drivers/net/ethernet/hisilicon/hip04_eth.ko] undefined! ERROR: "dma_unmap_single" [drivers/net/ethernet/hisilicon/hip04_eth.ko] undefined! ERROR: "dma_free_coherent" [drivers/net/ethernet/hisilicon/hip04_eth.ko] undefined! As this affects all of HNS_ENET, HNS_DSAF, HNS, HIX5HD2_GMAC, and HIP04_ETH, add a dependency on HAS_DMA to the main NET_VENDOR_HISILICON symbol to fix this. Signed-off-by: Geert Uytterhoeven --- drivers/net/ethernet/hisilicon/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/Kconfig b/drivers/net/ethernet/hisilicon/Kconfig index f250dec488fd2a2b..74beb1867230c9cf 100644 --- a/drivers/net/ethernet/hisilicon/Kconfig +++ b/drivers/net/ethernet/hisilicon/Kconfig @@ -5,7 +5,8 @@ config NET_VENDOR_HISILICON bool "Hisilicon devices" default y - depends on OF && (ARM || ARM64 || COMPILE_TEST) + depends on OF && HAS_DMA + depends on ARM || ARM64 || COMPILE_TEST ---help--- If you have a network (Ethernet) card belonging to this class, say Y. -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] man: Syntax and warning fixes
On 11/07/2015 05:40 PM, David Ahern wrote: On 11/7/15 2:55 AM, Ville Skyttä wrote: On Sat, Nov 7, 2015 at 11:47 AM, Albino B Neto wrote: 2015-11-07 7:44 GMT-02:00 Ville Skyttä : --- man/man8/tc-bpf.8 | 2 +- man/man8/tipc-bearer.8| 4 ++-- man/man8/tipc-link.8 | 6 +++--- man/man8/tipc-media.8 | 4 ++-- man/man8/tipc-nametable.8 | 4 ++-- man/man8/tipc-node.8 | 4 ++-- man/man8/tipc-socket.8| 4 ++-- 7 files changed, 14 insertions(+), 14 deletions(-) diff --git a/man/man8/tc-bpf.8 b/man/man8/tc-bpf.8 index 2c02ab2..f5201d3 100644 --- a/man/man8/tc-bpf.8 +++ b/man/man8/tc-bpf.8 @@ -844,7 +844,7 @@ result in the default classid: Basically, such a minimal generator is equivalent to: Signed-off-by and description ? Superseding patches sent separately. I suggest documenting the requirement for Signed-off-by somewhere (README.devel?); git log shows lots of commits without it, especially a bunch of man page related ones about a week ago. I believe man page changes should cc linux-...@vger.kernel.org Not really needed, this is for iproute2, not man-pages project. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 1/2] packet: do skb_probe_transport_header when we actually have data
On 11/09/2015 04:11 AM, David Miller wrote: From: Daniel Borkmann Date: Sun, 08 Nov 2015 01:33:56 +0100 Hmm, yeah, on a (only quick) look, it seems this is mostly needed for the virtio_net related code in packet_snd() / packet_recvmsg(), not handled in RX/TX ring paths actually. $ git grep -n gso_size net/packet/ net/packet/af_packet.c:2748: if (vnet_hdr.gso_size == 0) net/packet/af_packet.c:2825:skb_shinfo(skb)->gso_size = net/packet/af_packet.c:2826: __virtio16_to_cpu(vio_le(), vnet_hdr.gso_size); net/packet/af_packet.c:3219:vnet_hdr.gso_size = net/packet/af_packet.c:3220: __cpu_to_virtio16(vio_le(), sinfo->gso_size); Need to take a closer look on Monday. I think for complete safety, we need the transport header set for all SKBs once they hit the device. I know this is separate from the bugs you are trying to fix, but let's take care of this in this series ok? Ok, sure. I can add an extra one into this series removing the conditional. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT] Networking
* Linus Torvalds wrote: > Does anybody have any particular other "uhhuh, overflow in multiplication" > issues in mind? Because the interface for a saturating multiplication (or > addition, for that matter) would actually be much easier. And would be > trivial > to have as an inline asm for compatibility with older versions of gcc too. > > Then you could just do that jiffies conversion - or allocation, for that > matter > - without any special overflow handling at all. Doing > > buf = kmalloc(sat_mul(sizeof(x), nr), GFP_KERNEL); > > would just magically work. Exactly: saturation is the default behavior for many GPU vector/pixel attributes as well, to simplify and speed up the code and the hardware. I always wanted our ABIs to saturate instead of duplicating complexity with overflow failure logic. In the kernel the first point of failure is missing overflow checks. The second point of failure are buggy overflow checks. We can eliminate both if we just use safe operations that produce output that never exit the valid range. This also happens to result in the simplest code. We should start thinking of overflow checks as rootkit enablers. And note how much this simplifies review and static analysis: if this is the dominant model used in new kernel code then the analysis (human or machine) would only have to ensure that no untrusted input values get multiplied (or added) in an unsafe way. It would not have to be able to understand and track any 'overflow logic' through a maze of return paths, and validate whether the 'overflow logic' is correct for all input parameter ranges... The flip side is marginally less ABI robustness: random input parameters due to memory corruption will just saturate and produce nonsensical results. I don't think it's a big issue, and I also think the simplicity of input parameter validation is _way_ more important than our behavior to random input - but I've been overruled in the past when trying to introduce saturating ABIs, so saturation is something people sometimes find inelegant. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html