Re: Kill rtalloc()

2014-09-01 Thread Martin Pieuchot
On 21/08/14(Thu) 15:36, Martin Pieuchot wrote:
 As you might have seen, I started looking at routing entries pointing to
 stale ifa's.  This is a necessary step if we want to rely on our routing
 entries to do address lookups.
 
 This audit leads me to believe that the actual check used to determine if
 a cached route entry is still valid might not be enough.  Plus some 
 lookups are completely missing such check, see the recent cloning case.
 
 Now, before digging into cached route validity, I'd like to simplify
 the interface to do route lookups.  In other words, I'd like to kill
 rtalloc() and rtalloc_noclone().  The diff below is the part #1.
 
 Apart from the fact that this change unifies all our route lookups, it
 also gives us the possibility to stop using a struct route or similar
 when it is not needed.  This structure, used to cache a route, is not
 protocol agnostic and is responsible for a lot of switch() and code
 duplication.  Search for struct route in net/pf.c if you want to see
 some examples by yourself.
 
 So I'm looking for careful reviewers since the diff below remove the
 following check for every route lookup:
 
   if (ro-ro_rt  ro-ro_rt-rt_ifp  (ro-ro_rt-rt_flags  RTF_UP)
   return;
 
 But apparently all the calls to rtalloc() and rtalloc_mpath() seems to
 be done after checking for a similar condition.
 
 Ok?

Nobody?

 
 Index: net/pf.c
 ===
 RCS file: /home/ncvs/src/sys/net/pf.c,v
 retrieving revision 1.886
 diff -u -p -r1.886 pf.c
 --- net/pf.c  12 Aug 2014 15:29:33 -  1.886
 +++ net/pf.c  21 Aug 2014 12:08:12 -
 @@ -5567,7 +5567,7 @@ pf_route(struct mbuf **m, struct pf_rule
   ro-ro_tableid = m0-m_pkthdr.ph_rtableid;
  
   if (!r-rt) {
 - rtalloc(ro);
 + ro-ro_rt = rtalloc1(ro-ro_dst, RT_REPORT, ro-ro_tableid);
   if (ro-ro_rt == 0) {
   ipstat.ips_noroute++;
   goto bad;
 Index: net/pfkeyv2.c
 ===
 RCS file: /home/ncvs/src/sys/net/pfkeyv2.c,v
 retrieving revision 1.133
 diff -u -p -r1.133 pfkeyv2.c
 --- net/pfkeyv2.c 12 Jul 2014 18:44:22 -  1.133
 +++ net/pfkeyv2.c 21 Aug 2014 12:08:12 -
 @@ -1569,7 +1569,8 @@ pfkeyv2_send(struct socket *socket, void
   /* Set the rdomain that was obtained from the socket */
   re.re_tableid = rdomain;
  
 - rtalloc((struct route *) re);
 + re.re_rt = rtalloc1((struct sockaddr *)re.re_dst, RT_REPORT,
 + re.re_tableid);
   if (re.re_rt != NULL) {
   ipo = ((struct sockaddr_encap *) 
 re.re_rt-rt_gateway)-sen_ipsp;
   RTFREE(re.re_rt);
 Index: net/radix_mpath.c
 ===
 RCS file: /home/ncvs/src/sys/net/radix_mpath.c,v
 retrieving revision 1.23
 diff -u -p -r1.23 radix_mpath.c
 --- net/radix_mpath.c 27 May 2014 19:38:15 -  1.23
 +++ net/radix_mpath.c 21 Aug 2014 12:08:12 -
 @@ -53,7 +53,7 @@
  #include netinet6/ip6_var.h
  #endif
  
 -u_int32_t rn_mpath_hash(struct route *, u_int32_t *);
 +u_int32_t rn_mpath_hash(struct sockaddr *, u_int32_t *);
  
  /*
   * give some jitter to hash, to avoid synchronization between routers
 @@ -383,42 +383,37 @@ rn_mpath_reprio(struct radix_node *rn, i
  /*
   * allocate a route, potentially using multipath to select the peer.
   */
 -void
 -rtalloc_mpath(struct route *ro, u_int32_t *srcaddrp)
 +struct rtentry *
 +rtalloc_mpath(struct sockaddr *dst, u_int32_t *srcaddrp, u_int rtableid)
  {
 + struct rtentry *rt;
  #if defined(INET) || defined(INET6)
   struct radix_node *rn;
   int hash, npaths, threshold;
  #endif
  
 - /*
 -  * return a cached entry if it is still valid, otherwise we increase
 -  * the risk of disrupting local flows.
 -  */
 - if (ro-ro_rt  ro-ro_rt-rt_ifp  (ro-ro_rt-rt_flags  RTF_UP))
 - return;
 - ro-ro_rt = rtalloc1(ro-ro_dst, RT_REPORT, ro-ro_tableid);
 + rt = rtalloc1(dst, RT_REPORT, rtableid);
  
   /* if the route does not exist or it is not multipath, don't care */
 - if (!ro-ro_rt || !(ro-ro_rt-rt_flags  RTF_MPATH))
 - return;
 + if (rt == NULL || !ISSET(rt-rt_flags, RTF_MPATH))
 + return (rt);
  
   /* check if multipath routing is enabled for the specified protocol */
   if (!(0
  #ifdef INET
 - || (ipmultipath  ro-ro_dst.sa_family == AF_INET)
 + || (ipmultipath  dst-sa_family == AF_INET)
  #endif
  #ifdef INET6
 - || (ip6_multipath  ro-ro_dst.sa_family == AF_INET6)
 + || (ip6_multipath  dst-sa_family == AF_INET6)
  #endif
   ))
 - return;
 + return (rt);
  
  #if defined(INET) || defined(INET6)
   /* gw selection by Hash-Threshold (RFC 2992) */
 - rn = (struct radix_node 

Re: minphys woes

2014-09-01 Thread Mike Belopuhov
On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote:
 On Fri, 29 Aug 2014, Mike Belopuhov wrote:
 correct me if i'm wrong, but what happens is that bread being a block
 read reads up to MAXBSIZE which is conveniently set to 64k and you can't
 create a filesystem with a larger block size.

 physio (raw device io) however doesn't go through bread and need to know
 how split the provided buffer in separate transactions hence minphys.

 Yes, that seems to be what happens. But if every adapter needs to support
 transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the
 adapter to be able to override the default minphys function with its own.
 And adapters that only support smaller transfers would need to have logic
 in their driver to be able to split the transfer into smaller chunks.


i believe that if you start digging you realise that (at least at some point)
the least common denominator is (or was) 64k meaning that even the shittiest
controller on vax can do 64k.  which means that we don't have code for a
filesystem or buffercache to probe the controller for a supported transfer
size.

 I think it makes more sense to have that logic in one place to be used by
 all drivers.

perhaps, but unless the filesystem will issue breads of larger blocks the
only benefit would be physio itself which doesn't really justify the change.



Re: reduce the number of missed PCB cache with tcpbench -su

2014-09-01 Thread Mike Belopuhov
On 29 August 2014 18:01, Damien Miller d...@mindrot.org wrote:
 What's the benefit of this?

This creates a UDP PCB per connection.  Otherwise we always rely on
matching the wildcard PCB.

 I've never seen an application do this;

I doubt that.  However, things like NTP or DNS servers usually expect
requests from everyone so they don't do it.  Now I have actually
identified one use case that this diff breaks: multiple senders won't
work.  Which is kind of annoying.

 wouldn't it be better to improve the PCB cache so it caught this case
 (which seems the usual way UDP applications behave) instead?


Definitely!  Using something like a prefix tree (mickey advocates hash
array mapped tries) would eliminate this second lookup and also solve
problems with attacking PCB hash with collisions and resizing the hash
table.

 -d




Re: minphys woes

2014-09-01 Thread Stefan Fritsch
On Mon, 1 Sep 2014, Mike Belopuhov wrote:

 On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote:
  Yes, that seems to be what happens. But if every adapter needs to support
  transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the
  adapter to be able to override the default minphys function with its own.
  And adapters that only support smaller transfers would need to have logic
  in their driver to be able to split the transfer into smaller chunks.
 
 
 i believe that if you start digging you realise that (at least at some point)
 the least common denominator is (or was) 64k meaning that even the shittiest
 controller on vax can do 64k.  which means that we don't have code for a
 filesystem or buffercache to probe the controller for a supported transfer
 size.

That's possible. But what is really strange is, why does openbsd then have 
an infrastructure to set different max transfer sizes for physio on a 
per-adapter basis? This makes no sense. Either the drivers have to support 
64k transfers, in which case most of the minphys infrastructure is 
useless, or they don't have to. In the latter case the minphys 
infrastructure would need to be used in all code paths.

  I think it makes more sense to have that logic in one place to be used by
  all drivers.
 
 perhaps, but unless the filesystem will issue breads of larger blocks the
 only benefit would be physio itself which doesn't really justify the change.

You lost me there. Why should this depend on how many requests some 
filesystem makes with larger blocks? If there is the possibility that 
filesystems do such requests, someone needs to do the splitting of the 
requests. The question is who. The driver or the block layer.


BTW, for my use case I don't actually want to limit the block size, but 
rather the number of DMA segments. But the most reasonable way to do that 
seemed to set minphys to max_segments * pagesize. If we change how these 
things work, we could take the number of DMA segments into account.



Re: minphys woes

2014-09-01 Thread Stefan Fritsch
On Mon, 1 Sep 2014, David Gwynne wrote:
  Yes, that seems to be what happens. But if every adapter needs to support 
  transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the 
  adapter to be able to override the default minphys function with its own. 
  And adapters that only support smaller transfers would need to have logic 
  in their driver to be able to split the transfer into smaller chunks.
  
  I think it makes more sense to have that logic in one place to be used by 
  all drivers.
 
 this has blown my mind. how do the stupid ata cf card things that only 
 support single sector IO work?

It seems _wdc_ata_bio_start in ata_wdc.c does the splitting in this case.



Re: minphys woes

2014-09-01 Thread Mike Belopuhov
On 1 September 2014 13:06, Stefan Fritsch s...@sfritsch.de wrote:
 On Mon, 1 Sep 2014, Mike Belopuhov wrote:

 On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote:
  Yes, that seems to be what happens. But if every adapter needs to support
  transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the
  adapter to be able to override the default minphys function with its own.
  And adapters that only support smaller transfers would need to have logic
  in their driver to be able to split the transfer into smaller chunks.
 

 i believe that if you start digging you realise that (at least at some point)
 the least common denominator is (or was) 64k meaning that even the shittiest
 controller on vax can do 64k.  which means that we don't have code for a
 filesystem or buffercache to probe the controller for a supported transfer
 size.

 That's possible. But what is really strange is, why does openbsd then have
 an infrastructure to set different max transfer sizes for physio on a
 per-adapter basis? This makes no sense. Either the drivers have to support
 64k transfers, in which case most of the minphys infrastructure is
 useless, or they don't have to. In the latter case the minphys
 infrastructure would need to be used in all code paths.


i haven't found a controller that does less than MAXPHYS.
perhaps they meant to improve the situation but stopped short.

  I think it makes more sense to have that logic in one place to be used by
  all drivers.

 perhaps, but unless the filesystem will issue breads of larger blocks the
 only benefit would be physio itself which doesn't really justify the change.

 You lost me there. Why should this depend on how many requests some
 filesystem makes with larger blocks? If there is the possibility that
 filesystems do such requests, someone needs to do the splitting of the
 requests. The question is who. The driver or the block layer.



well, the filesystem doesn't issue requests for more than 64k at a time.
newfs won't allow you to build a 128k block file filesystem.

 BTW, for my use case I don't actually want to limit the block size, but
 rather the number of DMA segments. But the most reasonable way to do that
 seemed to set minphys to max_segments * pagesize. If we change how these
 things work, we could take the number of DMA segments into account.

can't help you with this one.



Re: minphys woes

2014-09-01 Thread David Gwynne

On 1 Sep 2014, at 9:22 pm, Mike Belopuhov m...@belopuhov.com wrote:

 On 1 September 2014 13:06, Stefan Fritsch s...@sfritsch.de wrote:
 On Mon, 1 Sep 2014, Mike Belopuhov wrote:
 
 On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote:
 Yes, that seems to be what happens. But if every adapter needs to support
 transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the
 adapter to be able to override the default minphys function with its own.
 And adapters that only support smaller transfers would need to have logic
 in their driver to be able to split the transfer into smaller chunks.
 
 
 i believe that if you start digging you realise that (at least at some 
 point)
 the least common denominator is (or was) 64k meaning that even the shittiest
 controller on vax can do 64k.  which means that we don't have code for a
 filesystem or buffercache to probe the controller for a supported transfer
 size.
 
 That's possible. But what is really strange is, why does openbsd then have
 an infrastructure to set different max transfer sizes for physio on a
 per-adapter basis? This makes no sense. Either the drivers have to support
 64k transfers, in which case most of the minphys infrastructure is
 useless, or they don't have to. In the latter case the minphys
 infrastructure would need to be used in all code paths.
 
 
 i haven't found a controller that does less than MAXPHYS.
 perhaps they meant to improve the situation but stopped short.

if we wanted to raise MAXPHYS, we'd have to support older controllers that cant 
do greater than 64k with some mechanism.

 
 I think it makes more sense to have that logic in one place to be used by
 all drivers.
 
 perhaps, but unless the filesystem will issue breads of larger blocks the
 only benefit would be physio itself which doesn't really justify the change.
 
 You lost me there. Why should this depend on how many requests some
 filesystem makes with larger blocks? If there is the possibility that
 filesystems do such requests, someone needs to do the splitting of the
 requests. The question is who. The driver or the block layer.
 
 
 
 well, the filesystem doesn't issue requests for more than 64k at a time.
 newfs won't allow you to build a 128k block file filesystem.
 
 BTW, for my use case I don't actually want to limit the block size, but
 rather the number of DMA segments. But the most reasonable way to do that
 seemed to set minphys to max_segments * pagesize. If we change how these
 things work, we could take the number of DMA segments into account.
 
 can't help you with this one.




Re: UPDATE: xkeyboard-config 2.12

2014-09-01 Thread Daniel Jakots
On Mon, 1 Sep 2014 20:52:49 +0600, Alexandr Shadchin
alexandr.shadc...@gmail.com wrote:

 Hi,
 
 This diff updates xkeyboard-config to the latest release 2.12.
 Also includes diff from
 http://marc.info/?l=openbsd-techm=140750210214198w=2 Tested on
 amd64 and i386.
 
 Comments ? OK ?
 

Compiled on amd64 on my x201, it works fine and I can confirm
that the bug is fixed, thanks.

-- 
Daniel



Re: bge(4) Jumbo support for newer chipsets

2014-09-01 Thread David Gwynne
if noone else is going to try this then i think it should go in.

On 28 Aug 2014, at 20:32, David Gwynne da...@gwynne.id.au wrote:

 
 On 28 Aug 2014, at 3:02 am, Mike Belopuhov m...@belopuhov.com wrote:
 
 On 27 August 2014 08:25, Brad Smith b...@comstyle.com wrote:
 Looking for some testing of the following diff to add Jumbo support for the
 BCM5714 / BCM5780 and BCM5717 / BCM5719 / BCM5720 / BCM57765 / BCM57766
 chipsets.
 
 
 
 i have tested this on Broadcom BCM5719 rev 0x01, unknown BCM5719 
 (0x5719001),
 APE firmware NCSI 1.1.15.0  and Broadcom BCM5714 rev 0xa3, BCM5715
 A3 (0x9003).
 
 it works, however i'm not strictly a fan of switching the cluster pool to
 larger one for 5714.  wasting another 8k page (on sparc for example) for
 every rx cluster in 90% cases sounds kinda wrong to me.  but ymmv.
 
 this is what MCLGETI was invented to solve though. comparing pre mclgeti to 
 what this does:
 
 a 5714 right now without jumbos would have 512 rings entries with 2048 bytes 
 on each. 2048 * 512 is a 1024k of ram. if we bumped the std ring up to jumbos 
 by default, 9216 * 512 would eat 4608k of ram.
 
 my boxes with bge with mclgeti generally sit around 40 clusters, but 
 sometimes end up around 80. 80 * 9216 is 720k. we can have jumbos and still 
 be ahead.
 
 if you compare the nics with split rings: 512 * 2048 + 256 * 9216 is ~3.3M. 
 the same chip with mclgeti and only doing a 1500 byte workload would be 80 * 
 2048 + 17 * 9216, or 300k.
 
 
 apart from that there's a deficiency in the diff itself.  you probably want
 to change MCLBYTES in bge_rxrinfo to bge_rx_std_len otherwise statistics
 look wrong.
 
 yeah.
 
 i have tested both 1500 and 9000 mtus on a 5714 and it is working well. as 
 you say, 5719 seems to be fine too, but ive only tested it with mtu 1500. ill 
 test 9k tomorrow.
 
 it needs tests on older chips too though.




Re: bge(4) Jumbo support for newer chipsets

2014-09-01 Thread Brad Smith
On Wed, Aug 27, 2014 at 02:25:27AM -0400, Brad Smith wrote:
 Looking for some testing of the following diff to add Jumbo support for the
 BCM5714 / BCM5780 and BCM5717 / BCM5719 / BCM5720 / BCM57765 / BCM57766
 chipsets.

Here is an updated diff with bge_rxrinfo() being fixed.


Index: if_bge.c
===
RCS file: /home/cvs/src/sys/dev/pci/if_bge.c,v
retrieving revision 1.360
diff -u -p -u -p -r1.360 if_bge.c
--- if_bge.c26 Aug 2014 11:01:21 -  1.360
+++ if_bge.c2 Sep 2014 01:50:30 -
@@ -1117,10 +1117,10 @@ bge_newbuf(struct bge_softc *sc, int i)
struct mbuf *m;
int error;
 
-   m = MCLGETI(NULL, M_DONTWAIT, NULL, MCLBYTES);
+   m = MCLGETI(NULL, M_DONTWAIT, NULL, sc-bge_rx_std_len);
if (!m)
return (ENOBUFS);
-   m-m_len = m-m_pkthdr.len = MCLBYTES;
+   m-m_len = m-m_pkthdr.len = sc-bge_rx_std_len;
if (!(sc-bge_flags  BGE_RX_ALIGNBUG))
m_adj(m, ETHER_ALIGN);
 
@@ -1241,8 +1241,8 @@ bge_init_rx_ring_std(struct bge_softc *s
return (0);
 
for (i = 0; i  BGE_STD_RX_RING_CNT; i++) {
-   if (bus_dmamap_create(sc-bge_dmatag, MCLBYTES, 1, MCLBYTES, 0,
-   BUS_DMA_NOWAIT | BUS_DMA_ALLOCNOW,
+   if (bus_dmamap_create(sc-bge_dmatag, sc-bge_rx_std_len, 1,
+   sc-bge_rx_std_len, 0, BUS_DMA_NOWAIT | BUS_DMA_ALLOCNOW,
sc-bge_cdata.bge_rx_std_map[i]) != 0) {
printf(%s: unable to create dmamap for slot %d\n,
sc-bge_dev.dv_xname, i);
@@ -1485,6 +1485,7 @@ bge_init_tx_ring(struct bge_softc *sc)
 {
int i;
bus_dmamap_t dmamap;
+   bus_size_t txsegsz, txmaxsegsz;
struct txdmamap_pool_entry *dma;
 
if (sc-bge_flags  BGE_TXRING_VALID)
@@ -1504,11 +1505,18 @@ bge_init_tx_ring(struct bge_softc *sc)
if (BGE_CHIPREV(sc-bge_chipid) == BGE_CHIPREV_5700_BX)
bge_writembx(sc, BGE_MBX_TX_NIC_PROD0_LO, 0);
 
+   if (BGE_IS_JUMBO_CAPABLE(sc)) {
+   txsegsz = 4096;
+   txmaxsegsz = BGE_JLEN;
+   } else {
+   txsegsz = MCLBYTES;
+   txmaxsegsz = MCLBYTES;
+   }
+
SLIST_INIT(sc-txdma_list);
for (i = 0; i  BGE_TX_RING_CNT; i++) {
-   if (bus_dmamap_create(sc-bge_dmatag, BGE_JLEN,
-   BGE_NTXSEG, BGE_JLEN, 0, BUS_DMA_NOWAIT,
-   dmamap))
+   if (bus_dmamap_create(sc-bge_dmatag, txmaxsegsz,
+   BGE_NTXSEG, txsegsz, 0, BUS_DMA_NOWAIT, dmamap))
return (ENOBUFS);
if (dmamap == NULL)
panic(dmamap NULL in bge_init_tx_ring);
@@ -2001,7 +2009,7 @@ bge_blockinit(struct bge_softc *sc)
 * using this ring (i.e. once we set the MTU
 * high enough to require it).
 */
-   if (BGE_IS_JUMBO_CAPABLE(sc)) {
+   if (sc-bge_flags  BGE_JUMBO_RING) {
rcb = sc-bge_rdata-bge_info.bge_jumbo_rx_rcb;
BGE_HOSTADDR(rcb-bge_hostaddr,
BGE_RING_DMA_ADDR(sc, bge_rx_jumbo_ring));
@@ -2065,7 +2073,7 @@ bge_blockinit(struct bge_softc *sc)
 * to work around HW bugs.
 */
CSR_WRITE_4(sc, BGE_RBDI_STD_REPL_THRESH, 8);
-   if (BGE_IS_JUMBO_CAPABLE(sc))
+   if (sc-bge_flags  BGE_JUMBO_RING)
CSR_WRITE_4(sc, BGE_RBDI_JUMBO_REPL_THRESH, 8);
 
if (BGE_IS_5717_PLUS(sc)) {
@@ -2699,7 +2707,8 @@ bge_attach(struct device *parent, struct
case BGE_ASICREV_BCM5719:
case BGE_ASICREV_BCM5720:
sc-bge_flags |= BGE_5717_PLUS | BGE_5755_PLUS | BGE_575X_PLUS |
-   BGE_5705_PLUS;
+   BGE_5705_PLUS | BGE_JUMBO_CAPABLE | BGE_JUMBO_RING |
+   BGE_JUMBO_FRAME;
if (BGE_ASICREV(sc-bge_chipid) == BGE_ASICREV_BCM5719 ||
BGE_ASICREV(sc-bge_chipid) == BGE_ASICREV_BCM5720) {
/*
@@ -2707,6 +2716,13 @@ bge_attach(struct device *parent, struct
 * of TXMBUF available space.
 */
sc-bge_flags |= BGE_RDMA_BUG;
+
+   if (BGE_ASICREV(sc-bge_chipid) == BGE_ASICREV_BCM5719 

+   sc-bge_chipid == BGE_CHIPID_BCM5719_A0) {
+   /* Jumbo frame on BCM5719 A0 does not work. */
+   sc-bge_flags = ~(BGE_JUMBO_CAPABLE |
+   BGE_JUMBO_RING | BGE_JUMBO_FRAME);
+   }
}
break;
case BGE_ASICREV_BCM5755:
@@ -2721,12 +2737,12 @@ bge_attach(struct device *parent, struct
case BGE_ASICREV_BCM5701:
case BGE_ASICREV_BCM5703:
case BGE_ASICREV_BCM5704:
-   sc-bge_flags |=