Re: Kill rtalloc()
On 21/08/14(Thu) 15:36, Martin Pieuchot wrote: As you might have seen, I started looking at routing entries pointing to stale ifa's. This is a necessary step if we want to rely on our routing entries to do address lookups. This audit leads me to believe that the actual check used to determine if a cached route entry is still valid might not be enough. Plus some lookups are completely missing such check, see the recent cloning case. Now, before digging into cached route validity, I'd like to simplify the interface to do route lookups. In other words, I'd like to kill rtalloc() and rtalloc_noclone(). The diff below is the part #1. Apart from the fact that this change unifies all our route lookups, it also gives us the possibility to stop using a struct route or similar when it is not needed. This structure, used to cache a route, is not protocol agnostic and is responsible for a lot of switch() and code duplication. Search for struct route in net/pf.c if you want to see some examples by yourself. So I'm looking for careful reviewers since the diff below remove the following check for every route lookup: if (ro-ro_rt ro-ro_rt-rt_ifp (ro-ro_rt-rt_flags RTF_UP) return; But apparently all the calls to rtalloc() and rtalloc_mpath() seems to be done after checking for a similar condition. Ok? Nobody? Index: net/pf.c === RCS file: /home/ncvs/src/sys/net/pf.c,v retrieving revision 1.886 diff -u -p -r1.886 pf.c --- net/pf.c 12 Aug 2014 15:29:33 - 1.886 +++ net/pf.c 21 Aug 2014 12:08:12 - @@ -5567,7 +5567,7 @@ pf_route(struct mbuf **m, struct pf_rule ro-ro_tableid = m0-m_pkthdr.ph_rtableid; if (!r-rt) { - rtalloc(ro); + ro-ro_rt = rtalloc1(ro-ro_dst, RT_REPORT, ro-ro_tableid); if (ro-ro_rt == 0) { ipstat.ips_noroute++; goto bad; Index: net/pfkeyv2.c === RCS file: /home/ncvs/src/sys/net/pfkeyv2.c,v retrieving revision 1.133 diff -u -p -r1.133 pfkeyv2.c --- net/pfkeyv2.c 12 Jul 2014 18:44:22 - 1.133 +++ net/pfkeyv2.c 21 Aug 2014 12:08:12 - @@ -1569,7 +1569,8 @@ pfkeyv2_send(struct socket *socket, void /* Set the rdomain that was obtained from the socket */ re.re_tableid = rdomain; - rtalloc((struct route *) re); + re.re_rt = rtalloc1((struct sockaddr *)re.re_dst, RT_REPORT, + re.re_tableid); if (re.re_rt != NULL) { ipo = ((struct sockaddr_encap *) re.re_rt-rt_gateway)-sen_ipsp; RTFREE(re.re_rt); Index: net/radix_mpath.c === RCS file: /home/ncvs/src/sys/net/radix_mpath.c,v retrieving revision 1.23 diff -u -p -r1.23 radix_mpath.c --- net/radix_mpath.c 27 May 2014 19:38:15 - 1.23 +++ net/radix_mpath.c 21 Aug 2014 12:08:12 - @@ -53,7 +53,7 @@ #include netinet6/ip6_var.h #endif -u_int32_t rn_mpath_hash(struct route *, u_int32_t *); +u_int32_t rn_mpath_hash(struct sockaddr *, u_int32_t *); /* * give some jitter to hash, to avoid synchronization between routers @@ -383,42 +383,37 @@ rn_mpath_reprio(struct radix_node *rn, i /* * allocate a route, potentially using multipath to select the peer. */ -void -rtalloc_mpath(struct route *ro, u_int32_t *srcaddrp) +struct rtentry * +rtalloc_mpath(struct sockaddr *dst, u_int32_t *srcaddrp, u_int rtableid) { + struct rtentry *rt; #if defined(INET) || defined(INET6) struct radix_node *rn; int hash, npaths, threshold; #endif - /* - * return a cached entry if it is still valid, otherwise we increase - * the risk of disrupting local flows. - */ - if (ro-ro_rt ro-ro_rt-rt_ifp (ro-ro_rt-rt_flags RTF_UP)) - return; - ro-ro_rt = rtalloc1(ro-ro_dst, RT_REPORT, ro-ro_tableid); + rt = rtalloc1(dst, RT_REPORT, rtableid); /* if the route does not exist or it is not multipath, don't care */ - if (!ro-ro_rt || !(ro-ro_rt-rt_flags RTF_MPATH)) - return; + if (rt == NULL || !ISSET(rt-rt_flags, RTF_MPATH)) + return (rt); /* check if multipath routing is enabled for the specified protocol */ if (!(0 #ifdef INET - || (ipmultipath ro-ro_dst.sa_family == AF_INET) + || (ipmultipath dst-sa_family == AF_INET) #endif #ifdef INET6 - || (ip6_multipath ro-ro_dst.sa_family == AF_INET6) + || (ip6_multipath dst-sa_family == AF_INET6) #endif )) - return; + return (rt); #if defined(INET) || defined(INET6) /* gw selection by Hash-Threshold (RFC 2992) */ - rn = (struct radix_node
Re: minphys woes
On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote: On Fri, 29 Aug 2014, Mike Belopuhov wrote: correct me if i'm wrong, but what happens is that bread being a block read reads up to MAXBSIZE which is conveniently set to 64k and you can't create a filesystem with a larger block size. physio (raw device io) however doesn't go through bread and need to know how split the provided buffer in separate transactions hence minphys. Yes, that seems to be what happens. But if every adapter needs to support transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the adapter to be able to override the default minphys function with its own. And adapters that only support smaller transfers would need to have logic in their driver to be able to split the transfer into smaller chunks. i believe that if you start digging you realise that (at least at some point) the least common denominator is (or was) 64k meaning that even the shittiest controller on vax can do 64k. which means that we don't have code for a filesystem or buffercache to probe the controller for a supported transfer size. I think it makes more sense to have that logic in one place to be used by all drivers. perhaps, but unless the filesystem will issue breads of larger blocks the only benefit would be physio itself which doesn't really justify the change.
Re: reduce the number of missed PCB cache with tcpbench -su
On 29 August 2014 18:01, Damien Miller d...@mindrot.org wrote: What's the benefit of this? This creates a UDP PCB per connection. Otherwise we always rely on matching the wildcard PCB. I've never seen an application do this; I doubt that. However, things like NTP or DNS servers usually expect requests from everyone so they don't do it. Now I have actually identified one use case that this diff breaks: multiple senders won't work. Which is kind of annoying. wouldn't it be better to improve the PCB cache so it caught this case (which seems the usual way UDP applications behave) instead? Definitely! Using something like a prefix tree (mickey advocates hash array mapped tries) would eliminate this second lookup and also solve problems with attacking PCB hash with collisions and resizing the hash table. -d
Re: minphys woes
On Mon, 1 Sep 2014, Mike Belopuhov wrote: On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote: Yes, that seems to be what happens. But if every adapter needs to support transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the adapter to be able to override the default minphys function with its own. And adapters that only support smaller transfers would need to have logic in their driver to be able to split the transfer into smaller chunks. i believe that if you start digging you realise that (at least at some point) the least common denominator is (or was) 64k meaning that even the shittiest controller on vax can do 64k. which means that we don't have code for a filesystem or buffercache to probe the controller for a supported transfer size. That's possible. But what is really strange is, why does openbsd then have an infrastructure to set different max transfer sizes for physio on a per-adapter basis? This makes no sense. Either the drivers have to support 64k transfers, in which case most of the minphys infrastructure is useless, or they don't have to. In the latter case the minphys infrastructure would need to be used in all code paths. I think it makes more sense to have that logic in one place to be used by all drivers. perhaps, but unless the filesystem will issue breads of larger blocks the only benefit would be physio itself which doesn't really justify the change. You lost me there. Why should this depend on how many requests some filesystem makes with larger blocks? If there is the possibility that filesystems do such requests, someone needs to do the splitting of the requests. The question is who. The driver or the block layer. BTW, for my use case I don't actually want to limit the block size, but rather the number of DMA segments. But the most reasonable way to do that seemed to set minphys to max_segments * pagesize. If we change how these things work, we could take the number of DMA segments into account.
Re: minphys woes
On Mon, 1 Sep 2014, David Gwynne wrote: Yes, that seems to be what happens. But if every adapter needs to support transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the adapter to be able to override the default minphys function with its own. And adapters that only support smaller transfers would need to have logic in their driver to be able to split the transfer into smaller chunks. I think it makes more sense to have that logic in one place to be used by all drivers. this has blown my mind. how do the stupid ata cf card things that only support single sector IO work? It seems _wdc_ata_bio_start in ata_wdc.c does the splitting in this case.
Re: minphys woes
On 1 September 2014 13:06, Stefan Fritsch s...@sfritsch.de wrote: On Mon, 1 Sep 2014, Mike Belopuhov wrote: On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote: Yes, that seems to be what happens. But if every adapter needs to support transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the adapter to be able to override the default minphys function with its own. And adapters that only support smaller transfers would need to have logic in their driver to be able to split the transfer into smaller chunks. i believe that if you start digging you realise that (at least at some point) the least common denominator is (or was) 64k meaning that even the shittiest controller on vax can do 64k. which means that we don't have code for a filesystem or buffercache to probe the controller for a supported transfer size. That's possible. But what is really strange is, why does openbsd then have an infrastructure to set different max transfer sizes for physio on a per-adapter basis? This makes no sense. Either the drivers have to support 64k transfers, in which case most of the minphys infrastructure is useless, or they don't have to. In the latter case the minphys infrastructure would need to be used in all code paths. i haven't found a controller that does less than MAXPHYS. perhaps they meant to improve the situation but stopped short. I think it makes more sense to have that logic in one place to be used by all drivers. perhaps, but unless the filesystem will issue breads of larger blocks the only benefit would be physio itself which doesn't really justify the change. You lost me there. Why should this depend on how many requests some filesystem makes with larger blocks? If there is the possibility that filesystems do such requests, someone needs to do the splitting of the requests. The question is who. The driver or the block layer. well, the filesystem doesn't issue requests for more than 64k at a time. newfs won't allow you to build a 128k block file filesystem. BTW, for my use case I don't actually want to limit the block size, but rather the number of DMA segments. But the most reasonable way to do that seemed to set minphys to max_segments * pagesize. If we change how these things work, we could take the number of DMA segments into account. can't help you with this one.
Re: minphys woes
On 1 Sep 2014, at 9:22 pm, Mike Belopuhov m...@belopuhov.com wrote: On 1 September 2014 13:06, Stefan Fritsch s...@sfritsch.de wrote: On Mon, 1 Sep 2014, Mike Belopuhov wrote: On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote: Yes, that seems to be what happens. But if every adapter needs to support transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the adapter to be able to override the default minphys function with its own. And adapters that only support smaller transfers would need to have logic in their driver to be able to split the transfer into smaller chunks. i believe that if you start digging you realise that (at least at some point) the least common denominator is (or was) 64k meaning that even the shittiest controller on vax can do 64k. which means that we don't have code for a filesystem or buffercache to probe the controller for a supported transfer size. That's possible. But what is really strange is, why does openbsd then have an infrastructure to set different max transfer sizes for physio on a per-adapter basis? This makes no sense. Either the drivers have to support 64k transfers, in which case most of the minphys infrastructure is useless, or they don't have to. In the latter case the minphys infrastructure would need to be used in all code paths. i haven't found a controller that does less than MAXPHYS. perhaps they meant to improve the situation but stopped short. if we wanted to raise MAXPHYS, we'd have to support older controllers that cant do greater than 64k with some mechanism. I think it makes more sense to have that logic in one place to be used by all drivers. perhaps, but unless the filesystem will issue breads of larger blocks the only benefit would be physio itself which doesn't really justify the change. You lost me there. Why should this depend on how many requests some filesystem makes with larger blocks? If there is the possibility that filesystems do such requests, someone needs to do the splitting of the requests. The question is who. The driver or the block layer. well, the filesystem doesn't issue requests for more than 64k at a time. newfs won't allow you to build a 128k block file filesystem. BTW, for my use case I don't actually want to limit the block size, but rather the number of DMA segments. But the most reasonable way to do that seemed to set minphys to max_segments * pagesize. If we change how these things work, we could take the number of DMA segments into account. can't help you with this one.
Re: UPDATE: xkeyboard-config 2.12
On Mon, 1 Sep 2014 20:52:49 +0600, Alexandr Shadchin alexandr.shadc...@gmail.com wrote: Hi, This diff updates xkeyboard-config to the latest release 2.12. Also includes diff from http://marc.info/?l=openbsd-techm=140750210214198w=2 Tested on amd64 and i386. Comments ? OK ? Compiled on amd64 on my x201, it works fine and I can confirm that the bug is fixed, thanks. -- Daniel
Re: bge(4) Jumbo support for newer chipsets
if noone else is going to try this then i think it should go in. On 28 Aug 2014, at 20:32, David Gwynne da...@gwynne.id.au wrote: On 28 Aug 2014, at 3:02 am, Mike Belopuhov m...@belopuhov.com wrote: On 27 August 2014 08:25, Brad Smith b...@comstyle.com wrote: Looking for some testing of the following diff to add Jumbo support for the BCM5714 / BCM5780 and BCM5717 / BCM5719 / BCM5720 / BCM57765 / BCM57766 chipsets. i have tested this on Broadcom BCM5719 rev 0x01, unknown BCM5719 (0x5719001), APE firmware NCSI 1.1.15.0 and Broadcom BCM5714 rev 0xa3, BCM5715 A3 (0x9003). it works, however i'm not strictly a fan of switching the cluster pool to larger one for 5714. wasting another 8k page (on sparc for example) for every rx cluster in 90% cases sounds kinda wrong to me. but ymmv. this is what MCLGETI was invented to solve though. comparing pre mclgeti to what this does: a 5714 right now without jumbos would have 512 rings entries with 2048 bytes on each. 2048 * 512 is a 1024k of ram. if we bumped the std ring up to jumbos by default, 9216 * 512 would eat 4608k of ram. my boxes with bge with mclgeti generally sit around 40 clusters, but sometimes end up around 80. 80 * 9216 is 720k. we can have jumbos and still be ahead. if you compare the nics with split rings: 512 * 2048 + 256 * 9216 is ~3.3M. the same chip with mclgeti and only doing a 1500 byte workload would be 80 * 2048 + 17 * 9216, or 300k. apart from that there's a deficiency in the diff itself. you probably want to change MCLBYTES in bge_rxrinfo to bge_rx_std_len otherwise statistics look wrong. yeah. i have tested both 1500 and 9000 mtus on a 5714 and it is working well. as you say, 5719 seems to be fine too, but ive only tested it with mtu 1500. ill test 9k tomorrow. it needs tests on older chips too though.
Re: bge(4) Jumbo support for newer chipsets
On Wed, Aug 27, 2014 at 02:25:27AM -0400, Brad Smith wrote: Looking for some testing of the following diff to add Jumbo support for the BCM5714 / BCM5780 and BCM5717 / BCM5719 / BCM5720 / BCM57765 / BCM57766 chipsets. Here is an updated diff with bge_rxrinfo() being fixed. Index: if_bge.c === RCS file: /home/cvs/src/sys/dev/pci/if_bge.c,v retrieving revision 1.360 diff -u -p -u -p -r1.360 if_bge.c --- if_bge.c26 Aug 2014 11:01:21 - 1.360 +++ if_bge.c2 Sep 2014 01:50:30 - @@ -1117,10 +1117,10 @@ bge_newbuf(struct bge_softc *sc, int i) struct mbuf *m; int error; - m = MCLGETI(NULL, M_DONTWAIT, NULL, MCLBYTES); + m = MCLGETI(NULL, M_DONTWAIT, NULL, sc-bge_rx_std_len); if (!m) return (ENOBUFS); - m-m_len = m-m_pkthdr.len = MCLBYTES; + m-m_len = m-m_pkthdr.len = sc-bge_rx_std_len; if (!(sc-bge_flags BGE_RX_ALIGNBUG)) m_adj(m, ETHER_ALIGN); @@ -1241,8 +1241,8 @@ bge_init_rx_ring_std(struct bge_softc *s return (0); for (i = 0; i BGE_STD_RX_RING_CNT; i++) { - if (bus_dmamap_create(sc-bge_dmatag, MCLBYTES, 1, MCLBYTES, 0, - BUS_DMA_NOWAIT | BUS_DMA_ALLOCNOW, + if (bus_dmamap_create(sc-bge_dmatag, sc-bge_rx_std_len, 1, + sc-bge_rx_std_len, 0, BUS_DMA_NOWAIT | BUS_DMA_ALLOCNOW, sc-bge_cdata.bge_rx_std_map[i]) != 0) { printf(%s: unable to create dmamap for slot %d\n, sc-bge_dev.dv_xname, i); @@ -1485,6 +1485,7 @@ bge_init_tx_ring(struct bge_softc *sc) { int i; bus_dmamap_t dmamap; + bus_size_t txsegsz, txmaxsegsz; struct txdmamap_pool_entry *dma; if (sc-bge_flags BGE_TXRING_VALID) @@ -1504,11 +1505,18 @@ bge_init_tx_ring(struct bge_softc *sc) if (BGE_CHIPREV(sc-bge_chipid) == BGE_CHIPREV_5700_BX) bge_writembx(sc, BGE_MBX_TX_NIC_PROD0_LO, 0); + if (BGE_IS_JUMBO_CAPABLE(sc)) { + txsegsz = 4096; + txmaxsegsz = BGE_JLEN; + } else { + txsegsz = MCLBYTES; + txmaxsegsz = MCLBYTES; + } + SLIST_INIT(sc-txdma_list); for (i = 0; i BGE_TX_RING_CNT; i++) { - if (bus_dmamap_create(sc-bge_dmatag, BGE_JLEN, - BGE_NTXSEG, BGE_JLEN, 0, BUS_DMA_NOWAIT, - dmamap)) + if (bus_dmamap_create(sc-bge_dmatag, txmaxsegsz, + BGE_NTXSEG, txsegsz, 0, BUS_DMA_NOWAIT, dmamap)) return (ENOBUFS); if (dmamap == NULL) panic(dmamap NULL in bge_init_tx_ring); @@ -2001,7 +2009,7 @@ bge_blockinit(struct bge_softc *sc) * using this ring (i.e. once we set the MTU * high enough to require it). */ - if (BGE_IS_JUMBO_CAPABLE(sc)) { + if (sc-bge_flags BGE_JUMBO_RING) { rcb = sc-bge_rdata-bge_info.bge_jumbo_rx_rcb; BGE_HOSTADDR(rcb-bge_hostaddr, BGE_RING_DMA_ADDR(sc, bge_rx_jumbo_ring)); @@ -2065,7 +2073,7 @@ bge_blockinit(struct bge_softc *sc) * to work around HW bugs. */ CSR_WRITE_4(sc, BGE_RBDI_STD_REPL_THRESH, 8); - if (BGE_IS_JUMBO_CAPABLE(sc)) + if (sc-bge_flags BGE_JUMBO_RING) CSR_WRITE_4(sc, BGE_RBDI_JUMBO_REPL_THRESH, 8); if (BGE_IS_5717_PLUS(sc)) { @@ -2699,7 +2707,8 @@ bge_attach(struct device *parent, struct case BGE_ASICREV_BCM5719: case BGE_ASICREV_BCM5720: sc-bge_flags |= BGE_5717_PLUS | BGE_5755_PLUS | BGE_575X_PLUS | - BGE_5705_PLUS; + BGE_5705_PLUS | BGE_JUMBO_CAPABLE | BGE_JUMBO_RING | + BGE_JUMBO_FRAME; if (BGE_ASICREV(sc-bge_chipid) == BGE_ASICREV_BCM5719 || BGE_ASICREV(sc-bge_chipid) == BGE_ASICREV_BCM5720) { /* @@ -2707,6 +2716,13 @@ bge_attach(struct device *parent, struct * of TXMBUF available space. */ sc-bge_flags |= BGE_RDMA_BUG; + + if (BGE_ASICREV(sc-bge_chipid) == BGE_ASICREV_BCM5719 + sc-bge_chipid == BGE_CHIPID_BCM5719_A0) { + /* Jumbo frame on BCM5719 A0 does not work. */ + sc-bge_flags = ~(BGE_JUMBO_CAPABLE | + BGE_JUMBO_RING | BGE_JUMBO_FRAME); + } } break; case BGE_ASICREV_BCM5755: @@ -2721,12 +2737,12 @@ bge_attach(struct device *parent, struct case BGE_ASICREV_BCM5701: case BGE_ASICREV_BCM5703: case BGE_ASICREV_BCM5704: - sc-bge_flags |=