Re: OFW/FDT regulator API

2016-08-12 Thread Tinker

On 2016-08-13 11:34, Theo de Raadt wrote:
I have to admit, it's a bit shocking that VOLTAGE REGULATORS have to 
be

exposed to the software in the first place.
Just imagine a bug in some OS or firmware causing the voltages to 
spike

up and fry the hell out of a device.

I guess that's modern-day hardware design for you.


No, that is the arm ecosystem.

There are three choices:

1. A vendor-locked stable platform that has only one, ok maybe two,
   ok maybe a couple of slow-moving shifts, over a 20 year sequence
   of time, because the market is gigantic:
   a. i386
   b. macppc, for a while
   c. vax, but then time moved on.
   c. sparc64 but then time moved on, and uhm oracle and $$$.

2. A stable hardware platform where operating systems don't need to
   change because the investment is big but market is narrow:
   a. sparc
   b. sgi
   c. loongson

3. An unstable platform with a description / machine-dependent
   handling
   a. non-mainstream arm platforms
   b. non-mainstrain mips platform
   c. vendor-de-jour die-next-year

Pick a place you spend your money, and understand the consequences.

It really is that simple.  If you don't like the shock, don't spend
the money.

Or spend the money, and understand the shock.

Below the covers it is all the same hardware.

The long-term vendor-locked platforms do a better job of investing in
abstracting it away (ACPI or OpenFirmware), because exposing something
too strictly causes them harm because they have to expose it forever.

Which might cause them to lose their position, and they didn't become
tier 1 by being sloppy.

You -- by buying an arm platform de-jour -- on the other hand are
doing a great job being a tier 3 customer.  Not that it will ever
become a tier 2 or tier 3 platform, because you will eagerly buy the
next board of theirs when it hits the market... which is probably
before you finish reading this mail...


IBM Power architecture is in category 1?



Re: OFW/FDT regulator API

2016-08-12 Thread Theo de Raadt
> I have to admit, it's a bit shocking that VOLTAGE REGULATORS have to be 
> exposed to the software in the first place.
> Just imagine a bug in some OS or firmware causing the voltages to spike 
> up and fry the hell out of a device.
> 
> I guess that's modern-day hardware design for you.

No, that is the arm ecosystem.

There are three choices:

1. A vendor-locked stable platform that has only one, ok maybe two,
   ok maybe a couple of slow-moving shifts, over a 20 year sequence
   of time, because the market is gigantic:
   a. i386
   b. macppc, for a while
   c. vax, but then time moved on.
   c. sparc64 but then time moved on, and uhm oracle and $$$.

2. A stable hardware platform where operating systems don't need to
   change because the investment is big but market is narrow:
   a. sparc
   b. sgi
   c. loongson

3. An unstable platform with a description / machine-dependent
   handling
   a. non-mainstream arm platforms
   b. non-mainstrain mips platform
   c. vendor-de-jour die-next-year

Pick a place you spend your money, and understand the consequences.

It really is that simple.  If you don't like the shock, don't spend
the money.

Or spend the money, and understand the shock.

Below the covers it is all the same hardware.

The long-term vendor-locked platforms do a better job of investing in
abstracting it away (ACPI or OpenFirmware), because exposing something
too strictly causes them harm because they have to expose it forever.

Which might cause them to lose their position, and they didn't become
tier 1 by being sloppy.

You -- by buying an arm platform de-jour -- on the other hand are
doing a great job being a tier 3 customer.  Not that it will ever
become a tier 2 or tier 3 platform, because you will eagerly buy the
next board of theirs when it hits the market... which is probably
before you finish reading this mail...



Re: OFW/FDT regulator API

2016-08-12 Thread bytevolcano
I have to admit, it's a bit shocking that VOLTAGE REGULATORS have to be 
exposed to the software in the first place.
Just imagine a bug in some OS or firmware causing the voltages to spike 
up and fry the hell out of a device.


I guess that's modern-day hardware design for you.

Mark Kettenis wrote:

The diff below adds a simple "regulator" API, Regulators are devices
that apply voltage and/or current to subsystems to power them on.
Examples are applying voltage to an SD card, powering a USB bus,
turning on an Ethernet PHY, scaling the CPU voltage.  Regulators come
in many flavours.  The simples ones are simple gpio pins on which you
drive a voltage.  More complex ones are typically special power
control chips.

This diff only adds support for the "regulator-fixed" type.  This is a
simple gpio-based regulator that can only apply a fixed voltage.  It
doesn't need a driver of its own as it interacts with the hardware
through the gpio and pinctrl APIs.

In the future, when we need to add support for more complex types that
need their own driver, those drivers will register themselves in a
similar way to gpio controllers and pinctrl devices.  Ten we'll also
have to implement functions to set specific voltages and such.

ok?


Index: dev/ofw/ofw_regulator.c
===
RCS file: dev/ofw/ofw_regulator.c
diff -N dev/ofw/ofw_regulator.c
--- /dev/null   1 Jan 1970 00:00:00 -
+++ dev/ofw/ofw_regulator.c 12 Aug 2016 20:50:19 -
@@ -0,0 +1,75 @@
+/* $OpenBSD$   */
+/*
+ * Copyright (c) 2016 Mark Kettenis
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+int
+regulator_set(uint32_t phandle, int enable)
+{
+   uint32_t gpio[3];
+   uint32_t startup_delay;
+   int active;
+   int node;
+
+   node = OF_getnodebyphandle(phandle);
+   if (node == 0)
+   return -1;
+
+   if (!OF_is_compatible(node, "regulator-fixed"))
+   return -1;
+
+   pinctrl_byname(node, "default");
+
+   if (OF_getproplen(node, "enable-active-high") == 0)
+   active = 1;
+   else
+   active = 0;
+
+   if (OF_getpropintarray(node, "gpio", gpio,
+   sizeof(gpio)) != sizeof(gpio))
+   return -1;
+
+   gpio_controller_config_pin(gpio, GPIO_CONFIG_OUTPUT);
+   if (enable)
+   gpio_controller_set_pin(gpio, active);
+   else
+   gpio_controller_set_pin(gpio, !active);
+
+   startup_delay = OF_getpropint(node, "startup-delay-us", 0);
+   if (enable && startup_delay > 0)
+   delay(startup_delay);
+
+   return 0;
+}
+
+int
+regulator_enable(uint32_t phandle)
+{
+   return regulator_set(phandle, 1);
+}
+
+int
+regulator_disable(uint32_t phandle)
+{
+   return regulator_set(phandle, 0);
+}
Index: dev/ofw/ofw_regulator.h
===
RCS file: dev/ofw/ofw_regulator.h
diff -N dev/ofw/ofw_regulator.h
--- /dev/null   1 Jan 1970 00:00:00 -
+++ dev/ofw/ofw_regulator.h 12 Aug 2016 20:50:19 -
@@ -0,0 +1,24 @@
+/* $OpenBSD$   */
+/*
+ * Copyright (c) 2016 Mark Kettenis
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifndef _DEV_OFW_REGULATOR_H_
+#define _DEV_OFW_REGULATOR_H_
+
+intregulator_enable(uint32_t);
+intregulator_disable(uint32_t);
+
+#endif /* _DEV_OFW_GPIO_H_ */





Re: Fix relro coredumps

2016-08-12 Thread Jeremie Courreges-Anglas
Mark Kettenis  writes:

> With relro, certain bits of a process that were mapped into memory as
> writable are revreted back to read-only after making some initial
> changes.  Since the kernel coredump code only writes out writable
> pieces of memory, these relro bits are not written out.  Unfortunately
> these bits contain essential pieces of information that the debugger
> needs to analyze the coredump.
>
> The diff below fixes that bu also dumping out all the bits that have
> an amap allocated.  That's a sign that the pages are no longer
> pristine and have been written to.
>
> ok?

I'm not an uvm hacker but your explanation and the diff make sense to
me.  Successfuly tested on i386.  We should really get coredumps fixed
before g2k16. :)

>
> Index: uvm_unix.c
> ===
> RCS file: /home/cvs/src/sys/uvm/uvm_unix.c,v
> retrieving revision 1.58
> diff -u -p -r1.58 uvm_unix.c
> --- uvm_unix.c4 Apr 2016 16:34:16 -   1.58
> +++ uvm_unix.c12 Aug 2016 21:45:22 -
> @@ -161,6 +161,7 @@ uvm_coredump_walkmap(struct proc *p, voi
>   }
>  
>   if (!(entry->protection & PROT_WRITE) &&
> + entry->aref.ar_amap == NULL &&
>   entry->start != p->p_p->ps_sigcode)
>   continue;
>  
>

-- 
jca | PGP: 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: jot: use strchr(3) and strspn(3) in getprec()

2016-08-12 Thread Jeremie Courreges-Anglas
Theo Buehler  writes:

> Here's one final diff I have for jot for now:
>
> Use strchr(3) and strspn(3) instead of handrolled functions.
>
> Part of dsl@NetBSD's r1.20.

ok jca@

-- 
jca | PGP: 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: Fix relro coredumps

2016-08-12 Thread Philip Guenther
On Fri, Aug 12, 2016 at 2:56 PM, Mark Kettenis  wrote:
> The diff below fixes that bu also dumping out all the bits that have
> an amap allocated.  That's a sign that the pages are no longer
> pristine and have been written to.
>
> ok?

_There's_ the check, nice!

This would have affected dumping of other useful stuff too, like the
__atexit list and the malloc options...

ok guenther@



Fix relro coredumps

2016-08-12 Thread Mark Kettenis
With relro, certain bits of a process that were mapped into memory as
writable are revreted back to read-only after making some initial
changes.  Since the kernel coredump code only writes out writable
pieces of memory, these relro bits are not written out.  Unfortunately
these bits contain essential pieces of information that the debugger
needs to analyze the coredump.

The diff below fixes that bu also dumping out all the bits that have
an amap allocated.  That's a sign that the pages are no longer
pristine and have been written to.

ok?


Index: uvm_unix.c
===
RCS file: /home/cvs/src/sys/uvm/uvm_unix.c,v
retrieving revision 1.58
diff -u -p -r1.58 uvm_unix.c
--- uvm_unix.c  4 Apr 2016 16:34:16 -   1.58
+++ uvm_unix.c  12 Aug 2016 21:45:22 -
@@ -161,6 +161,7 @@ uvm_coredump_walkmap(struct proc *p, voi
}
 
if (!(entry->protection & PROT_WRITE) &&
+   entry->aref.ar_amap == NULL &&
entry->start != p->p_p->ps_sigcode)
continue;
 



Re: OpenIKED Keepalive Broken

2016-08-12 Thread William Ahern
On Fri, Aug 12, 2016 at 09:56:41PM +0200, fRANz wrote:
> On Sat, Aug 6, 2016 at 2:18 AM, William Ahern
>  wrote:

> > isakmpd unconditionally sends NAT-T keepalive messages every 30 seconds,
> > whereas iked's ikev2_ike_sa_alive only sends a keepalive message iff
> > `!foundin && foundout`. But that presumes that the SA initiator is also the
> > initiator of traffic, which definitely isn't the case in my situation, and
> > seems dubious and unreliable even for real road warriors.
> 
> ...
> 
> > I'd be happy to create a proper patch if someone could explain the purpose
> > of the conditional logic. I wouldn't want to accidentally break something.
> >
> > I also wouldn't mind making the keepalive interval configurable--rather than
> > a compile-time constant--so users could deal with NAT gateways which
> > aggressively flush state.
> 
> Hello William,
> I did the same switch (from isakmpd to iked) with a lot of problems,
> maybe the same that you're reported.
> Did you receive any feedback from OpenBSD staff, catching the occasion
> of the 6.0 release ready to go?
> Regards,
> -f

No feedback, yet, but soon after posting I realized a few things:

1) My hack makes the tunnel much more stable, but not nearly as stable as
for isakmpd. I think it's because with isakmpd both peers are sending a
keepalive every 30 seconds, whereas I only applied the hack I posted to the
active, behind-the-NAT peer. See point #3, below.

2) The logic of ikev2_ike_sa_alive is intended, I think, to preserve the
limited lifetime of SAs. Otherwise by unconditionally sending a keepalive
and not distinguishing keepalive traffic, the SA might never expire. I'm not
sure why iked isn't using the standard NAT-T keepalive message format and
protocol like isakmpd does. AFAICT it's still defined by IKEv2. Maybe iked
is using a hybrid keepalive/dead peer detetion solution, but the author
forgot to account for different effective behavior in some scenarios; or
maybe it was just more expedient than implementing NAT-T keepalive messages.
Figuring that out will probably help me answer what a proper solution looks
like.

3) I'm fairly certain the keepalive interval should be configurable. The
default UDP NAT state expiration on OpenBSD, for example, is 30 seconds. The
compile-time constant interval for keepalives in isakmpd and iked is also 30
seconds--the recommended period in the RFCs. Over time it's inevitable for a
peer's keepalive packet to miss the window for preserving NAT state,
especially considering that the peer's and router's timers are going to be
synchronized. That would explain why even with isakmpd I still need a
cronjob to ping the passive host. And it explains why isakmpd is more stable
than my hacked iked--isakmpd sends keepalives independently from both sides,
so you have two shots at making the NAT expiration window. iked's keepalive
is a round-trip message; also two packets, but the timing of the first
packet is all that matters.

Of course, the NAT state could expire before an IKE keepalive for many
reasons, but a keepalive interval at least a few seconds less than the
router's NAT expiration should keep the connection stable for longer periods
of time. And rather than having a cron job run every minute, the SA child
lifetime could be set to something smallish. If and when NAT state does
expire, the active peer behind NAT will rekey the SA within a tolerable
period, reestablishing NAT state and restoring reverse traffic. Lowering
both keepalive and child SA lifetime should make these types of tunnels much
more stable and reliable, without recourse to external hacks.



jot: use strchr(3) and strspn(3) in getprec()

2016-08-12 Thread Theo Buehler
Here's one final diff I have for jot for now:

Use strchr(3) and strspn(3) instead of handrolled functions.

Part of dsl@NetBSD's r1.20.

Index: jot.c
===
RCS file: /cvs/src/usr.bin/jot/jot.c,v
retrieving revision 1.33
diff -u -p -r1.33 jot.c
--- jot.c   12 Aug 2016 21:31:11 -  1.33
+++ jot.c   12 Aug 2016 21:35:07 -
@@ -348,18 +348,9 @@ usage(void)
 static int
 getprec(char *s)
 {
-   char*p;
-   char*q;
-
-   for (p = s; *p != '\0'; p++)
-   if (*p == '.')
-   break;
-   if (*p == '\0')
+   if ((s = strchr(s, '.')) == NULL)
return (0);
-   for (q = ++p; *p != '\0'; p++)
-   if (!isdigit((unsigned char)*p))
-   break;
-   return (p - q);
+   return (strspn(s + 1, "0123456789"));
 }
 
 static void



OFW/FDT regulator API

2016-08-12 Thread Mark Kettenis
The diff below adds a simple "regulator" API, Regulators are devices
that apply voltage and/or current to subsystems to power them on.
Examples are applying voltage to an SD card, powering a USB bus,
turning on an Ethernet PHY, scaling the CPU voltage.  Regulators come
in many flavours.  The simples ones are simple gpio pins on which you
drive a voltage.  More complex ones are typically special power
control chips.

This diff only adds support for the "regulator-fixed" type.  This is a
simple gpio-based regulator that can only apply a fixed voltage.  It
doesn't need a driver of its own as it interacts with the hardware
through the gpio and pinctrl APIs.

In the future, when we need to add support for more complex types that
need their own driver, those drivers will register themselves in a
similar way to gpio controllers and pinctrl devices.  Ten we'll also
have to implement functions to set specific voltages and such.

ok?


Index: dev/ofw/ofw_regulator.c
===
RCS file: dev/ofw/ofw_regulator.c
diff -N dev/ofw/ofw_regulator.c
--- /dev/null   1 Jan 1970 00:00:00 -
+++ dev/ofw/ofw_regulator.c 12 Aug 2016 20:50:19 -
@@ -0,0 +1,75 @@
+/* $OpenBSD$   */
+/*
+ * Copyright (c) 2016 Mark Kettenis
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+int
+regulator_set(uint32_t phandle, int enable)
+{
+   uint32_t gpio[3];
+   uint32_t startup_delay;
+   int active;
+   int node;
+
+   node = OF_getnodebyphandle(phandle);
+   if (node == 0)
+   return -1;
+
+   if (!OF_is_compatible(node, "regulator-fixed"))
+   return -1;
+
+   pinctrl_byname(node, "default");
+
+   if (OF_getproplen(node, "enable-active-high") == 0)
+   active = 1;
+   else
+   active = 0;
+
+   if (OF_getpropintarray(node, "gpio", gpio,
+   sizeof(gpio)) != sizeof(gpio))
+   return -1;
+
+   gpio_controller_config_pin(gpio, GPIO_CONFIG_OUTPUT);
+   if (enable)
+   gpio_controller_set_pin(gpio, active);
+   else
+   gpio_controller_set_pin(gpio, !active);
+
+   startup_delay = OF_getpropint(node, "startup-delay-us", 0);
+   if (enable && startup_delay > 0)
+   delay(startup_delay);
+
+   return 0;
+}
+
+int
+regulator_enable(uint32_t phandle)
+{
+   return regulator_set(phandle, 1);
+}
+
+int
+regulator_disable(uint32_t phandle)
+{
+   return regulator_set(phandle, 0);
+}
Index: dev/ofw/ofw_regulator.h
===
RCS file: dev/ofw/ofw_regulator.h
diff -N dev/ofw/ofw_regulator.h
--- /dev/null   1 Jan 1970 00:00:00 -
+++ dev/ofw/ofw_regulator.h 12 Aug 2016 20:50:19 -
@@ -0,0 +1,24 @@
+/* $OpenBSD$   */
+/*
+ * Copyright (c) 2016 Mark Kettenis
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifndef _DEV_OFW_REGULATOR_H_
+#define _DEV_OFW_REGULATOR_H_
+
+intregulator_enable(uint32_t);
+intregulator_disable(uint32_t);
+
+#endif /* _DEV_OFW_GPIO_H_ */



Re: Fwd: jot: remove crazy switch inside while loop

2016-08-12 Thread Theo de Raadt
> Theo Buehler wrote:
> > Any comment, support or objection on this one?
> > 
> > I'm aware that this diff is hard to review, but doing it in several
> > steps won't simplify the main step all that much, I think.
> > 
> > The regression tests cover this part of the code quite well, I think.
> 
> move fast and break stuff. it's the right direction and the sooner we get
> where we're going, the better.

that's right!  best way to get a diff tested by a large group of
people, is to commit it in snapshots, and then listen for regression
reports.

hopefully regressions are found before 6.1!



Re: Fwd: jot: remove crazy switch inside while loop

2016-08-12 Thread Ted Unangst
Theo Buehler wrote:
> Any comment, support or objection on this one?
> 
> I'm aware that this diff is hard to review, but doing it in several
> steps won't simplify the main step all that much, I think.
> 
> The regression tests cover this part of the code quite well, I think.

move fast and break stuff. it's the right direction and the sooner we get
where we're going, the better.



Fwd: jot: remove crazy switch inside while loop

2016-08-12 Thread Theo Buehler
Any comment, support or objection on this one?

I'm aware that this diff is hard to review, but doing it in several
steps won't simplify the main step all that much, I think.

The regression tests cover this part of the code quite well, I think.

- Forwarded message from Theo Buehler  -

Date: Sat, 6 Aug 2016 17:51:10 +0200
From: Theo Buehler 
To: tech@openbsd.org
Subject: jot: remove crazy switch inside while loop

jot.c has a loop containing a 16-way switch that fiddles with hardcoded
values:

while (mask)/* 4 bit mask has 1's where last 4 args were given */
switch (mask) { /* fill in the 0's by default or computation */
case 001:
reps = REPS_DEF;
mask = 011;

It is very hard to follow what happens in there and most cases can't be
touched without the risk of affecting many others.

The following patch is inspired by David Laight's (dsl@NetBSD) r1.21.

Instead of hardcoded values, use names for the bit patterns that
indicate which values were specified for reps, begin, ender and step on
the command line. Then use a simple switch statement to decide what to
do and add two detailed comments that explain why we do what we do.

Note that it makes absolutely no sense to compute anything if random
output is requested, so just skip the whole switch in that case.

Also try to be a bit more helpful with the error messages: The
nonsensical "Infinite sequences cannot be bounded" does not really help
with figuring out what went wrong.

Index: jot.c
===
RCS file: /var/cvs/src/usr.bin/jot/jot.c,v
retrieving revision 1.31
diff -u -p -r1.31 jot.c
--- jot.c   5 Aug 2016 13:43:38 -   1.31
+++ jot.c   6 Aug 2016 13:48:18 -
@@ -52,11 +52,16 @@
 #defineENDER_DEF   100
 #defineSTEP_DEF1
 
+#defineSTEP1
+#defineENDER   2
+#defineBEGIN   4
+#defineREPS8
+
 #defineis_default(s)   (strcmp((s), "-") == 0)
 
 static double  begin = BEGIN_DEF;
 static double  ender = ENDER_DEF;
-static double  s = STEP_DEF;
+static double  step = STEP_DEF;
 static longreps = REPS_DEF;
 static boolrandomize;
 static boolinfinity;
@@ -131,9 +136,9 @@ main(int argc, char *argv[])
switch (argc) { /* examine args right to left, falling thru cases */
case 4:
if (!is_default(argv[3])) {
-   if (!sscanf(argv[3], "%lf", ))
+   if (!sscanf(argv[3], "%lf", ))
errx(1, "Bad s value:  %s", argv[3]);
-   mask |= 01;
+   mask |= STEP;
if (randomize)
warnx("random seeding not supported");
}
@@ -141,7 +146,7 @@ main(int argc, char *argv[])
if (!is_default(argv[2])) {
if (!sscanf(argv[2], "%lf", ))
ender = argv[2][strlen(argv[2])-1];
-   mask |= 02;
+   mask |= ENDER;
if (prec == -1)
n = getprec(argv[2]);
}
@@ -149,7 +154,7 @@ main(int argc, char *argv[])
if (!is_default(argv[1])) {
if (!sscanf(argv[1], "%lf", ))
begin = argv[1][strlen(argv[1])-1];
-   mask |= 04;
+   mask |= BEGIN;
if (prec == -1)
prec = getprec(argv[1]);
if (n > prec)   /* maximum precision */
@@ -159,7 +164,7 @@ main(int argc, char *argv[])
if (!is_default(argv[0])) {
if (!sscanf(argv[0], "%ld", ))
errx(1, "Bad reps value:  %s", argv[0]);
-   mask |= 010;
+   mask |= REPS;
if (prec == -1)
prec = 0;
}
@@ -172,101 +177,74 @@ main(int argc, char *argv[])
argv[4]);
}
getformat();
-   while (mask)/* 4 bit mask has 1's where last 4 args were given */
-   switch (mask) { /* fill in the 0's by default or computation */
-   case 001:
-   reps = REPS_DEF;
-   mask = 011;
-   break;
-   case 002:
-   reps = REPS_DEF;
-   mask = 012;
-   break;
-   case 003:
-   reps = REPS_DEF;
-   mask = 013;
-   break;
-   case 004:
-   reps = REPS_DEF;
-   mask = 014;
-   break;
-  

Re: per cpu memory in the kernel

2016-08-12 Thread Mark Kettenis
> From: Martin Pieuchot 
> Date: Fri, 12 Aug 2016 20:44:04 +0200
> 
> On 08/11/16 06:43, David Gwynne wrote:
> > ive been tinkering with per cpu memory in the kernel
> > [...]
> 
> I'd like to have more people comment on this because we need an MP-safe
> way to handle counters in the network stack.

I had a first look.  Need to have a bit more time to digest this.

> > im still debating whether the API should do protection against
> > interrupts on the local cpu by handling spls for you. at the moment
> > it is up to the caller to manually splfoo() and splx(), but im half
> > convinced that cpumem_enter and _leave should do that on behalf of
> > the caller.
> 
> Let's see how it looks like when we have more usages of per cpu memory.

The per-CPU counters API is likely to need it though.  I'm less sure
about the per-CPU memory API itself.

> > this diff also includes two uses of the percpu code. one is to
> > provide per cpu caches of pool items, and the other is per cpu
> > counters for mbuf statistics.
> 
> I'd like to discuss pool cache later.

The pool cache sits on top of the per-CPU memory API isn't it?  So it
doesn't need to go in simultaniously.

> > ive added a wrapper around percpu memory for counters. basically
> > you ask it for N counters, where each counter is a uint64_t.
> > counters_enter will give you a per cpu reference to these N counters
> > which you can increment and decrement as you wish. internally the
> > api will version the per cpu counters so a reader can know when
> > theyre consistent, which is important on 32bit archs (where 64bit
> > ops arent necessarily atomic), or where you want several counters
> > to be consistent with each other (like packet and byte counters).
> > counters_read is provided to use the above machinery for a consistent
> > read.
> 
> Comment below.
> 
> > secondly, there is a boot strapping problem with per cpu data
> > structures, which is very apparent with the mbuf layer. the problem
> > is we dont know how many cpus are in the system until we're halfway
> > through attaching device drivers in the system. however, if we want
> > to use percpu data structures during attach we need to know how
> > many cpus we have. mbufs are allocated during attach, so we need
> > to know how many cpus we have before attach.
> 
> What about using ncpusfound?

On sparc64 that gets called very early, before pmap_bootstrap(),
because we need to allocate some per-CPU memory ;).

On other systems we typically calculate it on the fly as we're
attaching CPUs.  That may not be early enough.

> > +uint64_t *
> > +counters_enter(struct counters_ref *ref, struct cpumem *cm)
> > +{
> > +   ref->gen = cpumem_enter(cm);
> > +   (*ref->gen)++; /* make the generation number odd */
> > +   return (ref->gen + 1);
> > +}
> > +
> > +void
> > +counters_leave(struct counters_ref *ref, struct cpumem *cm)
> > +{
> > +   membar_producer();
> > +   (*ref->gen)++; /* make the generation number even again */
> > +   cpumem_leave(cm, ref->gen);
> > +}
> 
> So every counter++ will now look like a critical section?  Do we really
> need a memory barrier for every +1?  What can happen if a CPU is reading
> a value that is currently being updated?  Can't we live with that?

Not for 64-bit counters on 32-bit machines.  You could do something
clever and read the upper 32 bits, then read the lower 32 bits and
then read the upper 32-bits again and start over of you get different
value the 2nd time around.  But you still need to guarantee that
whoever updates the counter writes the upper 32 bits and lower 32 bits
in a particular order, and that they become visible to other CPUs in
that order.  And for that you need memory bars.



Re: per cpu memory in the kernel

2016-08-12 Thread Martin Pieuchot
On 08/11/16 06:43, David Gwynne wrote:
> ive been tinkering with per cpu memory in the kernel
> [...]

I'd like to have more people comment on this because we need an MP-safe
way to handle counters in the network stack.

> im still debating whether the API should do protection against
> interrupts on the local cpu by handling spls for you. at the moment
> it is up to the caller to manually splfoo() and splx(), but im half
> convinced that cpumem_enter and _leave should do that on behalf of
> the caller.

Let's see how it looks like when we have more usages of per cpu memory.

> this diff also includes two uses of the percpu code. one is to
> provide per cpu caches of pool items, and the other is per cpu
> counters for mbuf statistics.

I'd like to discuss pool cache later.

> ive added a wrapper around percpu memory for counters. basically
> you ask it for N counters, where each counter is a uint64_t.
> counters_enter will give you a per cpu reference to these N counters
> which you can increment and decrement as you wish. internally the
> api will version the per cpu counters so a reader can know when
> theyre consistent, which is important on 32bit archs (where 64bit
> ops arent necessarily atomic), or where you want several counters
> to be consistent with each other (like packet and byte counters).
> counters_read is provided to use the above machinery for a consistent
> read.

Comment below.

> secondly, there is a boot strapping problem with per cpu data
> structures, which is very apparent with the mbuf layer. the problem
> is we dont know how many cpus are in the system until we're halfway
> through attaching device drivers in the system. however, if we want
> to use percpu data structures during attach we need to know how
> many cpus we have. mbufs are allocated during attach, so we need
> to know how many cpus we have before attach.

What about using ncpusfound?

> +uint64_t *
> +counters_enter(struct counters_ref *ref, struct cpumem *cm)
> +{
> + ref->gen = cpumem_enter(cm);
> + (*ref->gen)++; /* make the generation number odd */
> + return (ref->gen + 1);
> +}
> +
> +void
> +counters_leave(struct counters_ref *ref, struct cpumem *cm)
> +{
> + membar_producer();
> + (*ref->gen)++; /* make the generation number even again */
> + cpumem_leave(cm, ref->gen);
> +}

So every counter++ will now look like a critical section?  Do we really
need a memory barrier for every +1?  What can happen if a CPU is reading
a value that is currently being updated?  Can't we live with that?



www/60.html missing "syncachelimit"

2016-08-12 Thread Marcus MERIGHI
someone once said tech@ is where patches go and www@ is no longer
there...

Bye, Marcus

Index: 60.html
===
RCS file: /cvs/www/60.html,v
retrieving revision 1.47
diff -u -p -u -r1.47 60.html
--- 60.html 10 Aug 2016 13:49:33 -  1.47
+++ 60.html 12 Aug 2016 16:44:51 -
@@ -244,7 +244,7 @@ to 6.0.
 -s -p tcp shows the relevant information to tune
 the SYN cache with
 http://man.openbsd.org/sysctl.8;>sysctl(8)
-net.inet.tcp.
+net.inet.tcp.syncachelimit.
 The administrator can require root privileges for binding to some TCP
and UDP ports with
http://man.openbsd.org/sysctl.8;>sysctl(8)



httpd: be stricter with TLS configuration

2016-08-12 Thread Joel Sing
The following diff makes httpd stricter with respect to TLS configuration:

- Do not allow TLS and non-TLS to be configured on the same port.
- Do not allow TLS options to be specified without a TLS listener.
- Ensure that TLS options are the same when a server is specified on the
  same address/port.

Currently, these configurations are permitted but do not work as intended.

This also factors out (and reuses) the server matching code that was already
duplicated.

ok?

Index: parse.y
===
RCS file: /cvs/src/usr.sbin/httpd/parse.y,v
retrieving revision 1.78
diff -u -p -r1.78 parse.y
--- parse.y 21 Jun 2016 21:35:24 -  1.78
+++ parse.y 12 Aug 2016 16:48:28 -
@@ -107,8 +107,10 @@ int host_if(const char *, struct addre
 int host(const char *, struct addresslist *,
int, struct portrange *, const char *, int);
 voidhost_free(struct addresslist *);
+struct server  *server_match(struct server *, int);
 struct server  *server_inherit(struct server *, struct server_config *,
struct server_config *);
+int server_tls_cmp(struct server *, struct server *);
 int getservice(char *);
 int is_if_in_group(const char *, const char *);
 
@@ -283,24 +285,13 @@ server: SERVER optmatch STRING{
 
TAILQ_INSERT_TAIL(>srv_hosts, srv_conf, entry);
} '{' optnl serveropts_l '}'{
-   struct server   *s = NULL, *sn;
+   struct server   *s, *sn;
struct server_config*a, *b;
 
srv_conf = >srv_conf;
 
-   TAILQ_FOREACH(s, conf->sc_servers, srv_entry) {
-   if ((s->srv_conf.flags &
-   SRVFLAG_LOCATION) == 0 &&
-   strcmp(s->srv_conf.name,
-   srv->srv_conf.name) == 0 &&
-   s->srv_conf.port == srv->srv_conf.port &&
-   sockaddr_cmp(
-   (struct sockaddr *)>srv_conf.ss,
-   (struct sockaddr *)>srv_conf.ss,
-   s->srv_conf.prefixlen) == 0)
-   break;
-   }
-   if (s != NULL) {
+   /* Check if the new server already exists. */
+   if (server_match(srv, 1) != NULL) {
yyerror("server \"%s\" defined twice",
srv->srv_conf.name);
serverconfig_free(srv_conf);
@@ -315,16 +306,39 @@ server: SERVER optmatch STRING{
YYERROR;
}
 
+   if ((s = server_match(srv, 0)) != NULL) {
+   if ((s->srv_conf.flags & SRVFLAG_TLS) != 
+   (srv->srv_conf.flags & SRVFLAG_TLS)) {
+   yyerror("server \"%s\": TLS and "
+   "non-TLS on same address/port",
+   srv->srv_conf.name);
+   serverconfig_free(srv_conf);
+   free(srv);
+   YYERROR;
+   }
+   if (server_tls_cmp(s, srv) != 0) {
+   yyerror("server \"%s\": TLS "
+   "configuration mismatch on same "
+   "address/port",
+   srv->srv_conf.name);
+   serverconfig_free(srv_conf);
+   free(srv);
+   YYERROR;
+   }
+   }
+
if ((srv->srv_conf.flags & SRVFLAG_TLS) &&
srv->srv_conf.tls_protocols == 0) {
-   yyerror("no TLS protocols");
+   yyerror("server \"%s\": no TLS protocols",
+   srv->srv_conf.name);
+   serverconfig_free(srv_conf);
free(srv);
YYERROR;
}
 
if (server_tls_load_keypair(srv) == -1) {
-   yyerror("failed to load public/private keys "
-   "for server %s", srv->srv_conf.name);
+   yyerror("server \"%s\": failed to load "
+  

fuse: dedup vnode type

2016-08-12 Thread Martin Natano
Fuse stores the vnode type in two places: vtype in struct fusefs_node
and v_type in struct vnode. Given the fact, that fusefs_node structs are
never allocated without an associated vnode and those two fields are
always in sync, one of those locations is superfluous.

Ok?

natano


Index: miscfs/fuse/fuse_lookup.c
===
RCS file: /cvs/src/sys/miscfs/fuse/fuse_lookup.c,v
retrieving revision 1.11
diff -u -p -r1.11 fuse_lookup.c
--- miscfs/fuse/fuse_lookup.c   19 Mar 2016 12:04:15 -  1.11
+++ miscfs/fuse/fuse_lookup.c   11 Aug 2016 10:26:51 -
@@ -147,7 +147,6 @@ fusefs_lookup(void *v)
goto out;
 
tdp->v_type = IFTOVT(fbuf->fb_vattr.va_mode);
-   VTOI(tdp)->vtype = tdp->v_type;
*vpp = tdp;
cnp->cn_flags |= SAVENAME;
 
@@ -183,10 +182,8 @@ fusefs_lookup(void *v)
} else {
error = VFS_VGET(fmp->mp, nid, );
 
-   if (!error) {
+   if (!error)
tdp->v_type = IFTOVT(fbuf->fb_vattr.va_mode);
-   VTOI(tdp)->vtype = tdp->v_type;
-   }
 
update_vattr(fmp->mp, >fb_vattr);
 
Index: miscfs/fuse/fuse_vfsops.c
===
RCS file: /cvs/src/sys/miscfs/fuse/fuse_vfsops.c,v
retrieving revision 1.23
diff -u -p -r1.23 fuse_vfsops.c
--- miscfs/fuse/fuse_vfsops.c   19 Jun 2016 11:54:33 -  1.23
+++ miscfs/fuse/fuse_vfsops.c   11 Aug 2016 10:26:51 -
@@ -170,15 +170,12 @@ int
 fusefs_root(struct mount *mp, struct vnode **vpp)
 {
struct vnode *nvp;
-   struct fusefs_node *ip;
int error;
 
if ((error = VFS_VGET(mp, (ino_t)FUSE_ROOTINO, )) != 0)
return (error);
 
-   ip = VTOI(nvp);
nvp->v_type = VDIR;
-   ip->vtype = VDIR;
 
*vpp = nvp;
return (0);
Index: miscfs/fuse/fuse_vnops.c
===
RCS file: /cvs/src/sys/miscfs/fuse/fuse_vnops.c,v
retrieving revision 1.28
diff -u -p -r1.28 fuse_vnops.c
--- miscfs/fuse/fuse_vnops.c19 Jun 2016 11:54:33 -  1.28
+++ miscfs/fuse/fuse_vnops.c11 Aug 2016 10:26:51 -
@@ -235,7 +235,7 @@ fusefs_open(void *v)
return (ENXIO);
 
isdir = 0;
-   if (ip->vtype == VDIR)
+   if (ap->a_vp->v_type == VDIR)
isdir = 1;
else {
if ((ap->a_mode & FREAD) && (ap->a_mode & FWRITE)) {
@@ -274,7 +274,7 @@ fusefs_close(void *v)
if (!fmp->sess_init)
return (0);
 
-   if (ip->vtype == VDIR) {
+   if (ap->a_vp->v_type == VDIR) {
isdir = 1;
 
if (ip->fufh[fufh_type].fh_type != FUFH_INVALID)
@@ -665,7 +665,6 @@ fusefs_symlink(void *v)
}
 
tdp->v_type = VLNK;
-   VTOI(tdp)->vtype = tdp->v_type;
VTOI(tdp)->parent = dp->ufs_ino.i_number;
VN_KNOTE(ap->a_dvp, NOTE_WRITE);
 
@@ -762,7 +761,7 @@ fusefs_inactive(void *v)
fufh = &(ip->fufh[type]);
if (fufh->fh_type != FUFH_INVALID)
fusefs_file_close(fmp, ip, fufh->fh_type, type,
-   (ip->vtype == VDIR), ap->a_p);
+   (vp->v_type == VDIR), ap->a_p);
}
 
error = VOP_GETATTR(vp, , cred, p);
@@ -835,7 +834,7 @@ fusefs_reclaim(void *v)
if (fufh->fh_type != FUFH_INVALID) {
printf("fusefs: vnode being reclaimed is valid\n");
fusefs_file_close(fmp, ip, fufh->fh_type, type,
-   (ip->vtype == VDIR), ap->a_p);
+   (vp->v_type == VDIR), ap->a_p);
}
}
/*
@@ -932,8 +931,6 @@ fusefs_create(void *v)
}
 
tdp->v_type = IFTOVT(fbuf->fb_io_mode);
-   VTOI(tdp)->vtype = tdp->v_type;
-
if (dvp != NULL && dvp->v_type == VDIR)
VTOI(tdp)->parent = ip->ufs_ino.i_number;
 
@@ -998,8 +995,6 @@ fusefs_mknod(void *v)
}
 
tdp->v_type = IFTOVT(fbuf->fb_io_mode);
-   VTOI(tdp)->vtype = tdp->v_type;
-
if (dvp != NULL && dvp->v_type == VDIR)
VTOI(tdp)->parent = ip->ufs_ino.i_number;
 
@@ -1211,7 +1206,7 @@ abortit:
 * "ls" or "pwd" with the "." directory entry missing, and "cd .."
 * doesn't work if the ".." entry is missing.
 */
-   if (ip->vtype == VDIR) {
+   if (fvp->v_type == VDIR) {
/*
 * Avoid ".", "..", and aliases of "." for obvious reasons.
 */
@@ -1325,8 +1320,6 @@ fusefs_mkdir(void *v)
}
 
tdp->v_type = IFTOVT(fbuf->fb_io_mode);
-   VTOI(tdp)->vtype = tdp->v_type;
-
if (dvp != NULL && dvp->v_type == VDIR)
VTOI(tdp)->parent = ip->ufs_ino.i_number;
 
Index: miscfs/fuse/fusefs_node.h

Re: fuse: dedup vnode type

2016-08-12 Thread Martin Natano
> Index: miscfs/fuse/fusefs_node.h
> ===
> RCS file: /cvs/src/sys/miscfs/fuse/fusefs_node.h,v
> retrieving revision 1.2
> diff -u -p -r1.2 fusefs_node.h
> --- miscfs/fuse/fusefs_node.h 1 Feb 2014 09:30:38 -   1.2
> +++ miscfs/fuse/fusefs_node.h 11 Aug 2016 10:26:51 -
> @@ -46,8 +46,6 @@ struct fusefs_node {
>  
>   /** meta **/
>   off_t filesize;
> - uint64_t  nlookup;
> - enum vtypevtype;
>  };

While there I also removed the unused 'nlookup' field,



Re: turn rb tree code into functions in the kernel

2016-08-12 Thread Ted Unangst
David Gwynne wrote:
> i recently proposed replacing a hash with an rb tree somewhere in
> the network stack, but it was pointed out that rb trees are big.
> 
> in hindsight i think the other person was talking about the size
> of an RB_ENTRY inside each thing you're tracking, but it made me
> look at the code size of rb trees again. it turns out on amd64 its
> about 2.5k of code per type of rb tree. a type being each RB_ENTRY
> inside a particular struct. ie, if a struct has two RB_ENTRYs in
> it, then it generates two chunks of code, one for each of them.

I love everything about this, but didn't actually look much at the diff or try
it out.



Don't mention 6bone in sshd_config(5)

2016-08-12 Thread Jeremie Courreges-Anglas

IPv6 provides a dedicated range for documentation, so let's use it.
This is consistent with the IPv4 example.

ok?

Index: sshd_config.5
===
RCS file: /cvs/src/usr.bin/ssh/sshd_config.5,v
retrieving revision 1.227
diff -u -p -p -u -r1.227 sshd_config.5
--- sshd_config.5   19 Jul 2016 12:59:16 -  1.227
+++ sshd_config.5   12 Aug 2016 13:51:00 -
@@ -1075,7 +1075,7 @@ criteria may additionally contain addres
 address/masklen format, e.g.\&
 .Dq 192.0.2.0/24
 or
-.Dq 3ffe:::/32 .
+.Dq 2001:db8::/32 .
 Note that the mask length provided must be consistent with the address -
 it is an error to specify a mask length that is too long for the address
 or one with bits set in this host portion of the address.

-- 
jca | PGP: 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: acpiec: handle burst mode failure

2016-08-12 Thread joshua stein
On Fri, 08 Jul 2016 at 18:51:17 -0500, joshua stein wrote:
> If the EC fails to go into burst mode for whatever reason, the Burst
> Acknowledge byte will not be there to read, which means the status
> won't have EC_STAT_OBF, which means acpiec_wait will spin forever,
> hanging the machine.
> 
> This at least gets us moving again, ignoring the failure to enter
> burst mode.

That patch was put into the latest snapshot (but not yet in CVS) and
some people are seeing fallout from it like strange thermal and
battery status readings.

This is a different patch which just doesn't bother with burst mode
unless the transfer is big.  This should match how Linux and FreeBSD
behave.  I have at least one confirmation that this fixes things on
the ThinkPad X260.

Anyone else that is seeing strangeness on the latest snapshot, I ask
that you try this patch against -current and see if your issues go
away.


Index: sys/dev/acpi/acpiec.c
===
RCS file: /cvs/src/sys/dev/acpi/acpiec.c,v
retrieving revision 1.53
diff -u -p -u -r1.53 acpiec.c
--- sys/dev/acpi/acpiec.c   7 May 2016 18:03:36 -   1.53
+++ sys/dev/acpi/acpiec.c   11 Aug 2016 20:06:54 -
@@ -218,10 +218,12 @@ acpiec_read(struct acpiec_softc *sc, u_i
 */
dnprintf(20, "%s: read %d, %d\n", DEVNAME(sc), (int)addr, len);
sc->sc_ecbusy = 1;
-   acpiec_burst_enable(sc);
+   if (len > 1)
+   acpiec_burst_enable(sc);
for (reg = 0; reg < len; reg++)
buffer[reg] = acpiec_read_1(sc, addr + reg);
-   acpiec_burst_disable(sc);
+   if (len > 1)
+   acpiec_burst_disable(sc);
sc->sc_ecbusy = 0;
 }
 
@@ -237,10 +239,12 @@ acpiec_write(struct acpiec_softc *sc, u_
 */
dnprintf(20, "%s: write %d, %d\n", DEVNAME(sc), (int)addr, len);
sc->sc_ecbusy = 1;
-   acpiec_burst_enable(sc);
+   if (len > 1)
+   acpiec_burst_enable(sc);
for (reg = 0; reg < len; reg++)
acpiec_write_1(sc, addr + reg, buffer[reg]);
-   acpiec_burst_disable(sc);
+   if (len > 1)
+   acpiec_burst_disable(sc);
sc->sc_ecbusy = 0;
 }
 



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread Mark Kettenis
> Date: Fri, 12 Aug 2016 14:26:34 +0200
> From: Claudio Jeker 
> 
> On Fri, Aug 12, 2016 at 04:38:45PM +1000, David Gwynne wrote:
> > 
> > > On 1 Aug 2016, at 21:07, Simon Mages  wrote:
> > > 
> > > I sent this message to dlg@ directly to discuss my modification of his
> > > diff to make the
> > > bigger mbuf clusters work. i got no response so far, thats why i
> > > decided to post it on tech@
> > > directly. Maybe this way i get faster some feedback :)
> > 
> > hey simon,
> > 
> > i was travelling when you sent your mail to me and then it fell out of my 
> > head. sorry about that.
> > 
> > if this is working correctly then i would like to put it in the tree. from 
> > the light testing i have done, it is working correctly. would anyone object?
> > 
> > some performance measurement would also be interesting :)
> > 
> 
> I would prefer we take the diff I started at n2k16. I need to dig it out
> though.

I think the subject of the thread has become misleading.  At least the
diff I think David and Simon are talking about is about using the
larger mbuf pools for socket buffers and no longer about using memory
>4G for them.

David, Simon, best to start all over again, and repost the diff with a
proper subject and explanation.  You shouldn't be forcing other
developers to read through several pages of private conversations.



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread Claudio Jeker
On Fri, Aug 12, 2016 at 04:38:45PM +1000, David Gwynne wrote:
> 
> > On 1 Aug 2016, at 21:07, Simon Mages  wrote:
> > 
> > I sent this message to dlg@ directly to discuss my modification of his
> > diff to make the
> > bigger mbuf clusters work. i got no response so far, thats why i
> > decided to post it on tech@
> > directly. Maybe this way i get faster some feedback :)
> 
> hey simon,
> 
> i was travelling when you sent your mail to me and then it fell out of my 
> head. sorry about that.
> 
> if this is working correctly then i would like to put it in the tree. from 
> the light testing i have done, it is working correctly. would anyone object?
> 
> some performance measurement would also be interesting :)
> 

I would prefer we take the diff I started at n2k16. I need to dig it out
though.

-- 
:wq Claudio



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread Mark Kettenis
> From: David Gwynne 
> Date: Fri, 12 Aug 2016 16:38:45 +1000
> 
> > On 1 Aug 2016, at 21:07, Simon Mages  wrote:
> > 
> > I sent this message to dlg@ directly to discuss my modification of his
> > diff to make the
> > bigger mbuf clusters work. i got no response so far, thats why i
> > decided to post it on tech@
> > directly. Maybe this way i get faster some feedback :)
> 
> hey simon,
> 
> i was travelling when you sent your mail to me and then it fell out
> of my head. sorry about that.
> 
> if this is working correctly then i would like to put it in the tree. from 
> the light testing i have done, it is working correctly. would anyone object?
> 
> some performance measurement would also be interesting :)

Hmm, during debugging I've relied on the fact that only drivers
allocate the larger mbuf clusters for their rx rings.

Anyway, shouldn't the diff be using ulmin()?


> dlg
> 
> > 
> > BR
> > Simon
> > 
> > ### Original Mail:
> > 
> > -- Forwarded message --
> > From: Simon Mages 
> > Date: Fri, 22 Jul 2016 13:24:24 +0200
> > Subject: Re: [PATCH] let the mbufs use more then 4gb of memory
> > To: David Gwynne 
> > 
> > Hi,
> > 
> > I think i found the problem with your diff regarding the bigger mbuf 
> > clusters.
> > 
> > You choose a buffer size based on space and resid, but what happens when 
> > resid
> > is larger then space and space is for example 2050? The cluster choosen has
> > then the size 4096. But this size is to large for the socket buffer. In the
> > past this was never a problem because you only allocated external clusters
> > of size MCLBYTES and this was only done when space was larger then MCLBYTES.
> > 
> > diff:
> > Index: kern/uipc_socket.c
> > ===
> > RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> > retrieving revision 1.152
> > diff -u -p -u -p -r1.152 uipc_socket.c
> > --- kern/uipc_socket.c  13 Jun 2016 21:24:43 -  1.152
> > +++ kern/uipc_socket.c  22 Jul 2016 10:56:02 -
> > @@ -496,15 +496,18 @@ restart:
> > mlen = MLEN;
> > }
> > if (resid >= MINCLSIZE && space >= MCLBYTES) {
> > -   MCLGET(m, M_NOWAIT);
> > +   MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
> > +   lmin(space, MAXMCLBYTES)));
> > if ((m->m_flags & M_EXT) == 0)
> > goto nopages;
> > if (atomic && top == 0) {
> > -   len = ulmin(MCLBYTES - max_hdr,
> > -   resid);
> > +   len = lmin(lmin(resid, space),
> > +   m->m_ext.ext_size -
> > +   max_hdr);
> > m->m_data += max_hdr;
> > } else
> > -   len = ulmin(MCLBYTES, resid);
> > +   len = lmin(lmin(resid, space),
> > +   m->m_ext.ext_size);
> > space -= len;
> > } else {
> > nopages:
> > 
> > Im using this diff no for a while on my notebook and everything works as
> > expected. But i had no time to realy test it or test the performance. This 
> > will
> > be my next step.
> > 
> > I reproduced the unix socket problem you mentioned with the following little
> > programm:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > #include 
> > #include 
> > #include 
> > 
> > #define FILE "/tmp/afile"
> > 
> > int senddesc(int fd, int so);
> > int recvdesc(int so);
> > 
> > int
> > main(void)
> > {
> > struct stat sb;
> > int sockpair[2];
> > pid_t pid = 0;
> > int status;
> > int newfile;
> > 
> > if (unlink(FILE) < 0)
> > warn("unlink: %s", FILE);
> > 
> > int file = open(FILE, O_RDWR|O_CREAT|O_TRUNC);
> > 
> > if (socketpair(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, sockpair) < 0)
> > err(1, "socketpair");
> > 
> > if ((pid =fork())) {
> > senddesc(file, sockpair[0]);
> > if (waitpid(pid, , 0) < 0)
> > err(1, "waitpid");
> > } else {
> > newfile = recvdesc(sockpair[1]);
> > if (fstat(newfile, ) < 0)
> > err(1, "fstat");
> > }
> > 
> > return 0;
> > }
> > 
> > int
> > senddesc(int fd, int so)
> > {
> > struct msghdr msg;
> > struct cmsghdr *cmsg;
> > union {
> > struct 

Re: fuse needs ufs ihash

2016-08-12 Thread Martin Natano
On Thu, Aug 11, 2016 at 10:48:59AM -0400, Ted Unangst wrote:
> Martin Natano wrote:
> > I'm already working on a diff to decouple fuse from ufs ihash. In the
> > meantime: Make sure the necessary code is compiled in when fuse is
> > enabled in the config.
> 
> Are people building kernels with FFS? This is like the old INET option, which
> I deleted because it was insane to disable it.

I don't think a kernel without ffs would be particulary useful. If you
want to remove the FFS option and noone comes up with a good reason to
keep it consider having my ok. Although as long as the option exists I
think it would be nice to have it work correctly.

Some background information why I want to untangle fuse from ufs ihash:
I believe the usage of ufs_ihash in fuse produces some ugly code in
fuse. e.g. (from fusefs_node.h):

...
#include 
...
#ifdef ITOV
# undef ITOV
#endif
#define ITOV(ip) ((ip)->ufs_ino.i_vnode)

#ifdef VTOI
# undef VTOI
#endif
#define VTOI(vp) ((struct fusefs_node *)(vp)->v_data)
...


and (from fuse_vfsops.c):

...
ip->ufs_ino.i_ump = (struct ufsmount *)fmp;
...

and (everywhere):

...
fmp = (struct fusefs_mnt *)ip->ufs_ino.i_ump;
...


Furthermove, struct fusefs_node contains a struct inode, which contains
much more fields than are required for just the ufs ihash stuff which is
actually used by fuse, resulting in a bloated fuse node structure with
most of the stuff in it unused.

Also, the dichotomy between ino_t and ufsino_t in fuse would not be
necessary when no ufs code is involved, eliminating the need to have
overflow checks in place, that should be there, but are not.

So, my idea to untangle fuse from ufs has nothing to do with trying to
build a kernel without ffs. ;) IMHO, kill the option.

natano



Re: turn rb tree code into functions in the kernel

2016-08-12 Thread David Gwynne
this updates the diff after the SLIST changes in uvm.

On Thu, Aug 11, 2016 at 10:19:20AM +1000, David Gwynne wrote:
> i recently proposed replacing a hash with an rb tree somewhere in
> the network stack, but it was pointed out that rb trees are big.
> 
> in hindsight i think the other person was talking about the size
> of an RB_ENTRY inside each thing you're tracking, but it made me
> look at the code size of rb trees again. it turns out on amd64 its
> about 2.5k of code per type of rb tree. a type being each RB_ENTRY
> inside a particular struct. ie, if a struct has two RB_ENTRYs in
> it, then it generates two chunks of code, one for each of them.
> 
> this refactors it so the "type" info is represented as separate
> info, rather than encoded in generated functions for each RB_ENTRY.
> the result of this is a bunch of rb functions that take this type
> as their first argument, and then a bunch of void *s because they
> can now work on anything.
> 
> all the void *s were terrifying when i tried using this in uvm, cos
> there's a lot of types in there and there was no help from the
> compiler if you passed the wrong thing with the wrong rb type.
> because of this i have RB_PROTOTYPE generate a bunch of static
> inline functions that take arguments of specific node types which
> in turn call the generic rb functions with the right instance of
> the rb type struct.
> 
> a particular challenge of the above was making the compare function
> take arguments of the particular types rather than void *s. internally
> the rb functions pass void *s to the compare function, but to get
> type safety in the callback i made RBT_GENERATE provide a wrapper
> function around the supplied comparison function which ensures that
> its arguments are the same as the specified type. it also makes the
> code more obvious and lets the compiler help.
> 
> because the rb code is all functions, it means the comparison
> function cannot be inlined into insert, find and nfind anymore. the
> user supplied one can be inlined into the wrapper function though,
> so i try to do this as much as possible.
> 
> while here i also made the compare functions take const arguments.
> 
> the tree.h RB code also supports augmented red black trees. and of
> course UVM uses it, cos it uses all the rb features. the function
> version also does it, but it tries to batch the augment calls so
> it can do less conditionals.
> 
> the function version is also aggressively inlined. the only function
> calls the interface does internally are to the comparison and augment
> functions.
> 
> i have tried to benchmark some of this. in the places it is slower
> (eg, insert) it is only very slightly slower, and only if you're
> doing millions of RB operations and nothing else. in some situations
> it is significantly faster (delete), which i think is because of
> the inlining.
> 
> the diff below also includes switches from the tree.h code to the
> function code in most subsystems except for ipsec and net80211. a
> bunch of pmaps in various archs likely need to be touched too.
> 
> overall we should get ~50k of code saving on a GENERIC kernel, with
> the bonus that adding more RB trees would be free. there's also an
> argument that code that uses lots of different rb trees (eg, uvm)
> will benefit by not having to pull multiple versions of the rb code
> into the cache.
> 
> thoughts? tests?

Index: sys/sys/rbtree.h
===
RCS file: sys/sys/rbtree.h
diff -N sys/sys/rbtree.h
--- /dev/null   1 Jan 1970 00:00:00 -
+++ sys/sys/rbtree.h12 Aug 2016 07:06:57 -
@@ -0,0 +1,224 @@
+/* $OpenBSD */
+
+/*
+ * Copyright (c) 2016 David Gwynne 
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifndef _SYS_RBTREE_H_
+#define _SYS_RBTREE_H_
+
+#include 
+
+struct rb_type {
+   int (*t_compare)(const void *, const void *);
+   void(*t_augment)(void *);
+   size_tt_offset; /* offset of rb_entry in type */
+};
+
+struct rb_entry {
+   struct rb_entry  *rbe_parent;
+   struct rb_entry  *rbe_left;
+   struct rb_entry  *rbe_right;
+   unsigned int  rbe_color;
+};
+
+struct rb_tree {
+   struct rb_entry *rbt_root;
+};
+
+static inline 

Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread Tinker

On 2016-06-23 05:42, Theo de Raadt wrote:

secondly, allocating more than 4g at a time to socket buffers is
generally a waste of memory.


and there is one further problem.

Eventually, this subsystem will starve the system.  Other subsystems
which also need large amounts of memory, then have to scramble.  There
have to be backpressure mechanisms in each subsystem to force out
memory.

There is no such mechanism in socket buffers.

The mechanisms in the remaining parts of the kernel have always proven
to be weak, as in, they don't interact as nicely as we want, to create
space.  There has been much work to make them work better.

However in socket buffers, there is no such mechanism.  What are
you going to do.  Throw data away?  You can't do that.  Therefore,
you are holding the remaining system components hostage, and your
diff creates deadlock.

You probably tested your diff under ideal conditions with gobs of
memory...


The backpressure mechanism to free up [disk IO] buffer cache content is 
really effective though, so 90 is a mostly suitable bufcachepercent 
sysctl setting, right?




Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread David Gwynne

> On 1 Aug 2016, at 21:07, Simon Mages  wrote:
> 
> I sent this message to dlg@ directly to discuss my modification of his
> diff to make the
> bigger mbuf clusters work. i got no response so far, thats why i
> decided to post it on tech@
> directly. Maybe this way i get faster some feedback :)

hey simon,

i was travelling when you sent your mail to me and then it fell out of my head. 
sorry about that.

if this is working correctly then i would like to put it in the tree. from the 
light testing i have done, it is working correctly. would anyone object?

some performance measurement would also be interesting :)

dlg

> 
> BR
> Simon
> 
> ### Original Mail:
> 
> -- Forwarded message --
> From: Simon Mages 
> Date: Fri, 22 Jul 2016 13:24:24 +0200
> Subject: Re: [PATCH] let the mbufs use more then 4gb of memory
> To: David Gwynne 
> 
> Hi,
> 
> I think i found the problem with your diff regarding the bigger mbuf clusters.
> 
> You choose a buffer size based on space and resid, but what happens when resid
> is larger then space and space is for example 2050? The cluster choosen has
> then the size 4096. But this size is to large for the socket buffer. In the
> past this was never a problem because you only allocated external clusters
> of size MCLBYTES and this was only done when space was larger then MCLBYTES.
> 
> diff:
> Index: kern/uipc_socket.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> retrieving revision 1.152
> diff -u -p -u -p -r1.152 uipc_socket.c
> --- kern/uipc_socket.c13 Jun 2016 21:24:43 -  1.152
> +++ kern/uipc_socket.c22 Jul 2016 10:56:02 -
> @@ -496,15 +496,18 @@ restart:
>   mlen = MLEN;
>   }
>   if (resid >= MINCLSIZE && space >= MCLBYTES) {
> - MCLGET(m, M_NOWAIT);
> + MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
> + lmin(space, MAXMCLBYTES)));
>   if ((m->m_flags & M_EXT) == 0)
>   goto nopages;
>   if (atomic && top == 0) {
> - len = ulmin(MCLBYTES - max_hdr,
> - resid);
> + len = lmin(lmin(resid, space),
> + m->m_ext.ext_size -
> + max_hdr);
>   m->m_data += max_hdr;
>   } else
> - len = ulmin(MCLBYTES, resid);
> + len = lmin(lmin(resid, space),
> + m->m_ext.ext_size);
>   space -= len;
>   } else {
> nopages:
> 
> Im using this diff no for a while on my notebook and everything works as
> expected. But i had no time to realy test it or test the performance. This 
> will
> be my next step.
> 
> I reproduced the unix socket problem you mentioned with the following little
> programm:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #include 
> #include 
> #include 
> 
> #define FILE "/tmp/afile"
> 
> int senddesc(int fd, int so);
> int recvdesc(int so);
> 
> int
> main(void)
> {
>   struct stat sb;
>   int sockpair[2];
>   pid_t pid = 0;
>   int status;
>   int newfile;
> 
>   if (unlink(FILE) < 0)
>   warn("unlink: %s", FILE);
> 
>   int file = open(FILE, O_RDWR|O_CREAT|O_TRUNC);
> 
>   if (socketpair(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, sockpair) < 0)
>   err(1, "socketpair");
> 
>   if ((pid =fork())) {
>   senddesc(file, sockpair[0]);
>   if (waitpid(pid, , 0) < 0)
>   err(1, "waitpid");
>   } else {
>   newfile = recvdesc(sockpair[1]);
>   if (fstat(newfile, ) < 0)
>   err(1, "fstat");
>   }
> 
>   return 0;
> }
> 
> int
> senddesc(int fd, int so)
> {
>   struct msghdr msg;
>   struct cmsghdr *cmsg;
>   union {
>   struct cmsghdr  hdr;
>   unsigned char   buf[CMSG_SPACE(sizeof(int))];
>   } cmsgbuf;
> 
>   char *cbuf = calloc(6392, sizeof(char));
>   memset(cbuf, 'K', 6392);
>   struct iovec iov = {
>   .iov_base = cbuf,
>   .iov_len = 6392,
>   };
> 
>   memset(, 0, sizeof(struct msghdr));
>   msg.msg_iov = 
>   msg.msg_iovlen = 1;
>   msg.msg_control = 
>