Re: New pseudo-driver dt(4): dynamic tracer

2020-01-19 Thread Mark Kettenis
> Date: Thu, 16 Jan 2020 19:37:54 +0100
> From: Martin Pieuchot 
> 
> I'd like to import this new MI pseudo-driver and the framework it
> provides to instrument and inspect kernel internals.
> 
> It is still under development and all the code is guarded under NDT, so
> it shouldn't impact GENERIC.  However at this stage of development I'm
> interested to get code reviews and feedbacks.
> 
> The design is fairly simple: events, in the form of descriptors on a
> ring, are being produced in any kernel context and being consumed by
> a userland process reading /dev/dt.
> 
> struct dt_evt {
>   unsigned intdtev_pbn;   /* Probe number */
>   unsigned intdtev_cpu;   /* CPU id */
>   pid_t   dtev_pid;   /* ID of current process */
>   pid_t   dtev_tid;   /* ID of current thread */
>   struct timespec dtev_tsp;   /* timestamp (nsecs) */
>   struct dt_stack_trace   dtev_kstack;/* kernel stack frame */
>   ...
> };
> 
> To decide when an event needs to be recorded, tracepoints, also known as
> probes, are placed in the code and enabled on demand.  For the moment the
> framework allows to define static tracepoints via the TRACEPOINT() macro
> and also exposes per-syscall and hardclock probes allowing to generate
> flamegraphs.
> 
> I'd appreciate particular review of the following items:
> 
>  * Event producer/consumer code which currently needs a mutex.  The
>current implementation doesn't always use a PCB per-CPU.  Moving
>to a lockless implementation would be beneficial for recording in
>deeper places of the kernel.

I think that should be possible, but I'd have to think about that for
a bit longer.  Shouldn't prevent this from going in.  And it might
mean we can't support hppa MP support kernels (but that's not a big
issue).

>  * Barriers for enabling/disabling probes.  dt(4) can be opened multiple
>times allowing multiple processes to gather event in parallel.  I'd
>like to be sure a close(2) doesn't stop another thread from capturing
>events.

Is it absolutely neccessary that we support multiple processes gathering?

>  * Read wakeup.  Currently a mutex is used to not loose a wakeup event
>in dtread(), however this mutex creates lock ordering problems with
>the SCHED_LOCK() when probes are executed to profile the scheduler.

Hmm, isn't it possible to avoid this by using sleep_setup() and
sleep_finish_all() manually?  I might give that a go once this is in.

> I included changes to MAKEDEV on all archs where the dev number 30 is
> still free.  I'll pick the closest number for others archs if that's ok. 
> 
> The man page doesn't include a list of ioctl(2) as I believe they will
> change soon.  I'll send another mail for the userland interface.
> 
> Ok to continue in-tree?

Yes!

Is there a particular reason your changing db_expr_t into long in some
of the places?  I'm not necessarily against removing that useless
abstraction, but this seems to be only a partial change.

> Index: etc/MAKEDEV.common
> ===
> RCS file: /cvs/src/etc/MAKEDEV.common,v
> retrieving revision 1.107
> diff -u -p -r1.107 MAKEDEV.common
> --- etc/MAKEDEV.common17 Dec 2019 13:15:17 -  1.107
> +++ etc/MAKEDEV.common16 Jan 2020 18:31:56 -
> @@ -167,6 +167,7 @@ target(all, vmm)dnl
>  target(all, pvbus, 0, 1)dnl
>  target(all, bpf)dnl
>  target(all, kcov)dnl
> +target(all, dt)dnl
>  dnl
>  _mkdev(all, {-all-}, {-dnl
>  show_target(all)dnl
> @@ -528,3 +529,5 @@ _mkdev(pvbus, {-pvbus*-}, {-M pvbus$U c 
>  _mkdev(local, local, {-test -s $T.local && sh $T.local-})dnl
>  __devitem(kcov, kcov, Kernel code coverage tracing)dnl
>  _mkdev(kcov, kcov, {-M kcov c major_kcov_c 0 600-})dnl
> +__devitem(dt, dt, Dynamic Tracer)dnl
> +_mkdev(dt, dt, {-M dt c major_dt_c 0 600-})dnl
> Index: etc/etc.amd64/MAKEDEV.md
> ===
> RCS file: /cvs/src/etc/etc.amd64/MAKEDEV.md,v
> retrieving revision 1.72
> diff -u -p -r1.72 MAKEDEV.md
> --- etc/etc.amd64/MAKEDEV.md  17 Dec 2019 13:08:54 -  1.72
> +++ etc/etc.amd64/MAKEDEV.md  16 Jan 2020 18:31:56 -
> @@ -68,6 +68,7 @@ _DEV(au, 42)
>  _DEV(bio, 79)
>  _DEV(bktr, 49)
>  _DEV(bpf, 23)
> +_DEV(dt, 30)
>  _DEV(diskmap, 90)
>  _DEV(drm, 87)
>  _DEV(fdesc, 22)
> Index: etc/etc.arm64/MAKEDEV.md
> ===
> RCS file: /cvs/src/etc/etc.arm64/MAKEDEV.md,v
> retrieving revision 1.5
> diff -u -p -r1.5 MAKEDEV.md
> --- etc/etc.arm64/MAKEDEV.md  22 Dec 2019 18:18:02 -  1.5
> +++ etc/etc.arm64/MAKEDEV.md  16 Jan 2020 18:31:56 -
> @@ -60,6 +60,7 @@ _DEV(au, 42)
>  _DEV(bio, 79)
>  _DEV(bktr, 49)
>  _DEV(bpf, 23)
> +_DEV(dt, 30)
>  _DEV(diskmap, 90)
>  _DEV(drm, 87)
>  _DEV(fdesc, 22)
> Index: etc/etc.armv7/MAKEDEV.md
> 

Re: ntpd and 2036

2020-01-19 Thread Otto Moerbeek
On Fri, Jan 10, 2020 at 03:14:42PM +0100, Otto Moerbeek wrote:

> Hi,
> 
> THe ntp protocol uses 32-bit unsigned timestamps counting seconds
> since 1900. That means that in 2036 the timestamp field will wrap.
> This difff makes sure ntpd handles that correctly by assuming we are
> in era 0 unless we see "small" timestamps.
> 
> tested in the future (incuding wrapping form era 0 to 1) on a couple
> of machines including one running xntpd for interoperability.
> 
> ok?

ping...

-Otto

> 
> Index: client.c
> ===
> RCS file: /cvs/src/usr.sbin/ntpd/client.c,v
> retrieving revision 1.112
> diff -u -p -r1.112 client.c
> --- client.c  10 Nov 2019 19:24:47 -  1.112
> +++ client.c  10 Jan 2020 14:06:14 -
> @@ -324,12 +324,6 @@ client_dispatch(struct ntp_peer *p, u_in
>   }
>   }
>  
> - if (T4 < JAN_1970) {
> - client_log_error(p, "recvmsg control format", EBADF);
> - set_next(p, error_interval());
> - return (0);
> - }
> -
>   ntp_getmsg((struct sockaddr *)>addr->ss, buf, size, );
>  
>   if (msg.orgtime.int_partl != p->query->msg.xmttime.int_partl ||
> @@ -374,16 +368,6 @@ client_dispatch(struct ntp_peer *p, u_in
>   T1 = p->query->xmttime;
>   T2 = lfp_to_d(msg.rectime);
>   T3 = lfp_to_d(msg.xmttime);
> -
> - /*
> -  * XXX workaround: time_t / tv_sec must never wrap.
> -  * around 2020 we will need a solution (64bit time_t / tv_sec).
> -  * consider every answer with a timestamp beyond january 2030 bogus.
> -  */
> - if (T2 > JAN_2030 || T3 > JAN_2030) {
> - set_next(p, error_interval());
> - return (0);
> - }
>  
>   /* Detect liars */
>   if (!p->trusted && conf->constraint_median != 0 &&
> Index: ntp.h
> ===
> RCS file: /cvs/src/usr.sbin/ntpd/ntp.h,v
> retrieving revision 1.13
> diff -u -p -r1.13 ntp.h
> --- ntp.h 22 Apr 2009 07:42:17 -  1.13
> +++ ntp.h 10 Jan 2020 14:06:14 -
> @@ -141,7 +141,19 @@ struct ntp_query {
>  #define  MODE_RES2   7   /* reserved for private use */
>  
>  #define  JAN_19702208988800UL/* 1970 - 1900 in seconds */
> -#define  JAN_20301893456000UL + JAN_1970 /* 1. 1. 2030 00:00:00 
> */
> +
> +/*
> + * The era we're in if we have no reason to assume otherwise.
> + * If lfp_to_d() sees an offset <= INT32_MAX the era is is assumed to be
> + * NTP_ERA + 1.
> + * Once the actual year is well into era 1, (after 2036) define NTP_ERA to 1
> + * and adapt (remove) the test in lfp_to_d().
> + * Once more than half of era 1 has elapsed (after 2104), re-inroduce the 
> test
> + * to move to era 2 if offset <= INT32_MAX, repeat for each half era.
> + */
> +#define NTP_ERA  0
> +
> +#define SECS_IN_ERA  (UINT32_MAX + 1ULL)
>  
>  #define  NTP_VERSION 4
>  #define  NTP_MAXSTRATUM  15
> Index: util.c
> ===
> RCS file: /cvs/src/usr.sbin/ntpd/util.c,v
> retrieving revision 1.24
> diff -u -p -r1.24 util.c
> --- util.c1 Mar 2017 00:56:30 -   1.24
> +++ util.c10 Jan 2020 14:06:14 -
> @@ -86,12 +86,17 @@ d_to_tv(double d, struct timeval *tv)
>  double
>  lfp_to_d(struct l_fixedpt lfp)
>  {
> - double  ret;
> + double  base, ret;
>  
>   lfp.int_partl = ntohl(lfp.int_partl);
>   lfp.fractionl = ntohl(lfp.fractionl);
>  
> - ret = (double)(lfp.int_partl) + ((double)lfp.fractionl / UINT_MAX);
> + /* see comment in ntp.h */
> + base = NTP_ERA;
> + if (lfp.int_partl <= INT32_MAX)
> + base++; 
> + ret = base * SECS_IN_ERA;
> + ret += (double)(lfp.int_partl) + ((double)lfp.fractionl / UINT_MAX);
>  
>   return (ret);
>  }
> @@ -101,6 +106,8 @@ d_to_lfp(double d)
>  {
>   struct l_fixedptlfp;
>  
> + while (d > SECS_IN_ERA)
> + d -= SECS_IN_ERA;
>   lfp.int_partl = htonl((u_int32_t)d);
>   lfp.fractionl = htonl((u_int32_t)((d - (u_int32_t)d) * UINT_MAX));
>  
> 
> 



ciss(4): de-indent polling logic

2020-01-19 Thread Scott Cheloha
I'm hunting the hz(9) in ciss_cmd().  To remove it I need to first
untangle the polling code.

To start, I think it's sensible to remove a level of indentation.

I've attached the -w diff to simplify review.

ok?

Index: ic/ciss.c
===
RCS file: /cvs/src/sys/dev/ic/ciss.c,v
retrieving revision 1.75
diff -u -p -w -r1.75 ciss.c
--- ic/ciss.c   14 Aug 2016 04:08:03 -  1.75
+++ ic/ciss.c   20 Jan 2020 02:48:38 -
@@ -518,7 +518,10 @@ ciss_cmd(struct ciss_ccb *ccb, int flags
} else
bus_space_write_4(sc->iot, sc->ioh, CISS_INQ, ccb->ccb_cmdpa);
 
-   if (wait & SCSI_POLL) {
+   /* If we're not waiting for completion we're done. */
+   if (!(wait & SCSI_POLL))
+   return (error);
+
struct timeval tv;
int etick;
CISS_DPRINTF(CISS_D_CMD, ("waiting "));
@@ -598,7 +601,6 @@ ciss_cmd(struct ciss_ccb *ccb, int flags
 
CISS_DPRINTF(CISS_D_CMD, ("done %d:%d",
ccb->ccb_err.cmd_stat, ccb->ccb_err.scsi_stat));
-   }
 
return (error);
 }



nfs: tsleep(9) -> tsleep_nsec(9)

2020-01-19 Thread Scott Cheloha
nfs_request() has an exponenential backoff.  The timeout is already
in seconds so we just stop converting to ticks.

nfs_sndlock() and nfs_rcvlock() both have no timeout initially and
then switch to a timeout of two seconds under certain circumstances.
Ticks to seconds.

The kqueue polling thread is ticks to seconds, but we're dividing by
two so I convert to milliseconds first.  It's ugly, but I don't want
to disturb NFS_MINATTRTIMO if I can avoid it.

ok?

Index: nfs/nfs_socket.c
===
RCS file: /cvs/src/sys/nfs/nfs_socket.c,v
retrieving revision 1.135
diff -u -p -r1.135 nfs_socket.c
--- nfs/nfs_socket.c15 Jan 2020 13:17:35 -  1.135
+++ nfs/nfs_socket.c20 Jan 2020 01:37:14 -
@@ -851,10 +851,9 @@ nfs_request(struct vnode *vp, int procnu
struct mbuf *m;
u_int32_t *tl;
struct nfsmount *nmp;
-   struct timeval tv;
caddr_t cp2;
int t1, i, error = 0;
-   int trylater_delay;
+   int addr, trylater_delay;
struct nfsreq *rep;
struct nfsm_info info;
 
@@ -999,9 +998,8 @@ tryagain:
error == NFSERR_TRYLATER) {
m_freem(info.nmi_mrep);
error = 0;
-   tv.tv_sec = trylater_delay;
-   tv.tv_usec = 0;
-   tsleep(, PSOCK, "nfsretry", tvtohz());
+   tsleep_nsec(, PSOCK, "nfsretry",
+   SEC_TO_NSEC(trylater_delay));
trylater_delay *= NFS_TIMEOUTMUL;
if (trylater_delay > NFS_MAXTIMEO)
trylater_delay = NFS_MAXTIMEO;
@@ -1241,8 +1239,9 @@ nfs_sigintr(struct nfsmount *nmp, struct
 int
 nfs_sndlock(int *flagp, struct nfsreq *rep)
 {
+   uint64_t slptimeo = INFSLP;
struct proc *p;
-   int slpflag = 0, slptimeo = 0;
+   int slpflag = 0;
 
if (rep) {
p = rep->r_procp;
@@ -1254,11 +1253,10 @@ nfs_sndlock(int *flagp, struct nfsreq *r
if (rep && nfs_sigintr(rep->r_nmp, rep, p))
return (EINTR);
*flagp |= NFSMNT_WANTSND;
-   (void)tsleep((caddr_t)flagp, slpflag | (PZERO - 1), "nfsndlck",
-   slptimeo);
+   tsleep_nsec(flagp, slpflag | (PZERO - 1), "nfsndlck", slptimeo);
if (slpflag == PCATCH) {
slpflag = 0;
-   slptimeo = 2 * hz;
+   slptimeo = SEC_TO_NSEC(2);
}
}
*flagp |= NFSMNT_SNDLOCK;
@@ -1284,8 +1282,9 @@ nfs_sndunlock(int *flagp)
 int
 nfs_rcvlock(struct nfsreq *rep)
 {
+   uint64_t slptimeo = INFSLP;
int *flagp = >r_nmp->nm_flag;
-   int slpflag, slptimeo = 0;
+   int slpflag;
 
if (*flagp & NFSMNT_INT)
slpflag = PCATCH;
@@ -1296,8 +1295,7 @@ nfs_rcvlock(struct nfsreq *rep)
if (nfs_sigintr(rep->r_nmp, rep, rep->r_procp))
return (EINTR);
*flagp |= NFSMNT_WANTRCV;
-   (void)tsleep((caddr_t)flagp, slpflag | (PZERO - 1), "nfsrcvlk",
-   slptimeo);
+   tsleep_nsec(flagp, slpflag | (PZERO - 1), "nfsrcvlk", slptimeo);
if (rep->r_mrep != NULL) {
/*
 * Don't take the lock if our reply has been received
@@ -1307,7 +1305,7 @@ nfs_rcvlock(struct nfsreq *rep)
}
if (slpflag == PCATCH) {
slpflag = 0;
-   slptimeo = 2 * hz;
+   slptimeo = SEC_TO_NSEC(2);
}
}
*flagp |= NFSMNT_RCVLOCK;
Index: nfs/nfs_kq.c
===
RCS file: /cvs/src/sys/nfs/nfs_kq.c,v
retrieving revision 1.27
diff -u -p -r1.27 nfs_kq.c
--- nfs/nfs_kq.c31 Dec 2019 13:48:32 -  1.27
+++ nfs/nfs_kq.c20 Jan 2020 01:37:14 -
@@ -173,8 +173,8 @@ next:
rw_exit_write(_lock);
 
/* wait a while before checking for changes again */
-   tsleep(pnfskq, PSOCK, "nfskqpw", NFS_MINATTRTIMO * hz / 2);
-
+   tsleep_nsec(pnfskq, PSOCK, "nfskqpw",
+   MSEC_TO_NSEC(NFS_MINATTRTIMO * 1000 / 2));
}
 }
 



ospf(6)d: allow "type p2p" globally or per area

2020-01-19 Thread Remi Locherer
This makes the interface setting "type p2p" configurable globally or
per area. ospf(6)d allows this for almost all interface related settings.

As a side-effect of this diff ospf(6)d -nv prints "type p2p" also for
point-to-point interfaces like gif or gre. I think this is an advantage
but I can also change that by re-introducing the iface->p2p variable.

OK?

Remi



Index: ospf6d/ospf6d.h
===
RCS file: /cvs/src/usr.sbin/ospf6d/ospf6d.h,v
retrieving revision 1.44
diff -u -p -r1.44 ospf6d.h
--- ospf6d/ospf6d.h 3 Jan 2020 17:45:02 -   1.44
+++ ospf6d/ospf6d.h 12 Jan 2020 21:44:41 -
@@ -329,7 +329,6 @@ struct iface {
u_int8_t if_type;
u_int8_t linkstate;
u_int8_t priority;
-   u_int8_t p2p;
u_int8_t cflags;
 #define F_IFACE_PASSIVE0x01
 #define F_IFACE_CONFIGURED 0x02
Index: ospf6d/parse.y
===
RCS file: /cvs/src/usr.sbin/ospf6d/parse.y,v
retrieving revision 1.48
diff -u -p -r1.48 parse.y
--- ospf6d/parse.y  26 Dec 2019 10:24:18 -  1.48
+++ ospf6d/parse.y  19 Jan 2020 21:51:56 -
@@ -102,6 +102,7 @@ struct config_defaults {
u_int16_t   rxmt_interval;
u_int16_t   metric;
u_int8_tpriority;
+   u_int8_tp2p;
 };
 
 struct config_defaults  globaldefs;
@@ -449,6 +450,9 @@ defaults: METRIC NUMBER {
}
defs->rxmt_interval = $2;
}
+   | TYPE P2P  {
+   defs->p2p = 1;
+   }
;
 
 optnl  : '\n' optnl
@@ -550,6 +554,8 @@ interface   : INTERFACE STRING  {
iface->metric = defs->metric;
iface->priority = defs->priority;
iface->cflags |= F_IFACE_CONFIGURED;
+   if (defs->p2p == 1)
+   iface->type = IF_TYPE_POINTOPOINT;
iface = NULL;
/* interface is always part of an area */
defs = 
@@ -566,10 +572,6 @@ interfaceopts_l: interfaceopts_l interf
;
 
 interfaceoptsl : PASSIVE   { iface->cflags |= F_IFACE_PASSIVE; }
-   | TYPE P2P  {
-   iface->p2p = 1;
-   iface->type = IF_TYPE_POINTOPOINT;
-   }
| DEMOTE STRING {
if (strlcpy(iface->demote_group, $2,
sizeof(iface->demote_group)) >=
@@ -1034,6 +1036,7 @@ parse_config(char *filename, int opts)
defs->rxmt_interval = DEFAULT_RXMT_INTERVAL;
defs->metric = DEFAULT_METRIC;
defs->priority = DEFAULT_PRIORITY;
+   defs->p2p = 0;
 
conf->spf_delay = DEFAULT_SPF_DELAY;
conf->spf_hold_time = DEFAULT_SPF_HOLDTIME;
Index: ospf6d/printconf.c
===
RCS file: /cvs/src/usr.sbin/ospf6d/printconf.c,v
retrieving revision 1.9
diff -u -p -r1.9 printconf.c
--- ospf6d/printconf.c  26 Dec 2019 10:24:18 -  1.9
+++ ospf6d/printconf.c  12 Jan 2020 21:43:06 -
@@ -1,4 +1,5 @@
-/* $OpenBSD: printconf.c,v 1.9 2019/12/26 10:24:18 remi Exp $ */
+/*  $OpenBSD: printconf.c,v 1.9 2019/12/26 10:24:18 remi Exp $
+*/
 
 /*
  * Copyright (c) 2004, 2005 Esben Norby 
@@ -135,7 +136,7 @@ print_iface(struct iface *iface)
printf("\t\trouter-priority %d\n", iface->priority);
printf("\t\ttransmit-delay %d\n", iface->transmit_delay);
 
-   if (iface->p2p)
+   if (iface->type == IF_TYPE_POINTOPOINT)
printf("\t\ttype p2p\n");
 
printf("\t}\n");
Index: ospfd/ospfd.c
===
RCS file: /cvs/src/usr.sbin/ospfd/ospfd.c,v
retrieving revision 1.110
diff -u -p -r1.110 ospfd.c
--- ospfd/ospfd.c   23 Nov 2019 15:05:21 -  1.110
+++ ospfd/ospfd.c   18 Jan 2020 14:02:04 -
@@ -893,7 +893,6 @@ merge_interfaces(struct area *a, struct 
if (i->self)
i->self->priority = i->priority;
i->flags = xi->flags; /* needed? */
-   i->type = xi->type; /* needed? */
i->if_type = xi->if_type; /* needed? */
i->linkstate = xi->linkstate; /* needed? */
 
@@ -915,11 +914,11 @@ merge_interfaces(struct area *a, struct 
if_fsm(i, IF_EVT_UP);
}
 
-   if (i->p2p != xi->p2p) {
+   if (i->type != xi->type) {
/* restart interface to enable or disable DR election */
if (ospfd_process == PROC_OSPF_ENGINE)
if_fsm(i, 

Re: securelevel.7: Clarify mem(4) semantics

2020-01-19 Thread Klemens Nanni
On Sun, Jan 19, 2020 at 11:50:01AM -0700, Theo de Raadt wrote:
> Your text seems somewhat backwards, because if you can't open it, it
> doesn't matter if it is read-only, it is read-not.
It starts with what holds true unconditionally: the fact that you cannot
write;  then it describes an additional case that has another condition
to it.

Is this version better?


Index: share/man/man7/securelevel.7
===
RCS file: /cvs/src/share/man/man7/securelevel.7,v
retrieving revision 1.31
diff -u -p -r1.31 securelevel.7
--- share/man/man7/securelevel.721 Aug 2019 20:44:09 -  1.31
+++ share/man/man7/securelevel.719 Jan 2020 20:40:30 -
@@ -66,7 +66,10 @@ securelevel may no longer be lowered exc
 .Pa /dev/mem
 and
 .Pa /dev/kmem
-cannot be opened
+can only be opened if the
+.Va kern.allowkmem
+.Xr sysctl 8
+variable is set, both are read-only
 .It
 raw disk devices of mounted file systems are read-only
 .It



Re: securelevel.7: Clarify mem(4) semantics

2020-01-19 Thread Theo de Raadt
Your text seems somewhat backwards, because if you can't open it, it
doesn't matter if it is read-only, it is read-not.

>Neither mem(4) nor securelevel(7) tell exactly which operations are
>allowed on these devices at which security level.
>
>The utility cbmem from the new sysutils/coreboot-utils port opens
>/dev/mem read-only and reading is possible at all security levels, as
>long as the sysctl kern.allowkmem is non-zero.
>
>I verified this with cbmem.
>
>Writing however is only possible at an insecure security level.
>
>intelmetool from coreboot-utils (not packaged yet) does write to
>/dev/mem and I verified that writing is only possible at securelevel -1
>and 0.  kern.allowkmem does not matter.
>
>Diff below explicity states this in securelevel(7);  note how I
>intentionally document those semantics in this MI manual rather then the
>various copies of MD mem(4).
>
>Feedback? OK?
>
>
>Index: share/man/man7/securelevel.7
>===
>RCS file: /cvs/src/share/man/man7/securelevel.7,v
>retrieving revision 1.31
>diff -u -p -r1.31 securelevel.7
>--- share/man/man7/securelevel.7   21 Aug 2019 20:44:09 -  1.31
>+++ share/man/man7/securelevel.7   19 Jan 2020 16:14:52 -
>@@ -66,7 +66,10 @@ securelevel may no longer be lowered exc
> .Pa /dev/mem
> and
> .Pa /dev/kmem
>-cannot be opened
>+are read-only and can only be opened if the
>+.Va kern.allowkmem
>+.Xr sysctl 8
>+variable is set
> .It
> raw disk devices of mounted file systems are read-only
> .It
>
>



securelevel.7: Clarify mem(4) semantics

2020-01-19 Thread Klemens Nanni
Neither mem(4) nor securelevel(7) tell exactly which operations are
allowed on these devices at which security level.

The utility cbmem from the new sysutils/coreboot-utils port opens
/dev/mem read-only and reading is possible at all security levels, as
long as the sysctl kern.allowkmem is non-zero.

I verified this with cbmem.

Writing however is only possible at an insecure security level.

intelmetool from coreboot-utils (not packaged yet) does write to
/dev/mem and I verified that writing is only possible at securelevel -1
and 0.  kern.allowkmem does not matter.

Diff below explicity states this in securelevel(7);  note how I
intentionally document those semantics in this MI manual rather then the
various copies of MD mem(4).

Feedback? OK?


Index: share/man/man7/securelevel.7
===
RCS file: /cvs/src/share/man/man7/securelevel.7,v
retrieving revision 1.31
diff -u -p -r1.31 securelevel.7
--- share/man/man7/securelevel.721 Aug 2019 20:44:09 -  1.31
+++ share/man/man7/securelevel.719 Jan 2020 16:14:52 -
@@ -66,7 +66,10 @@ securelevel may no longer be lowered exc
 .Pa /dev/mem
 and
 .Pa /dev/kmem
-cannot be opened
+are read-only and can only be opened if the
+.Va kern.allowkmem
+.Xr sysctl 8
+variable is set
 .It
 raw disk devices of mounted file systems are read-only
 .It



Re: Please test: kill `p_priority'

2020-01-19 Thread Martin Pieuchot
On 14/01/20(Tue) 17:16, Martin Pieuchot wrote:
> On 13/01/20(Mon) 21:31, Martin Pieuchot wrote:
> > I'd like hardclock() to be free of SCHED_LOCK(), the diff below brings
> > us one step away from that goal.  Please test with your favorite
> > benchmark and report any regression :o)
> > 
> > The reason for moving the SCHED_LOCK() away from hardclock() is because
> > it will allow us to re-commit the diff moving accounting out of the
> > SCHED_LOCK().  In the end we shrink the SCHED_LOCK() which is generally
> > a good thing(tm).
> > 
> > `p_priority' is not very well named.  It's currently a placeholder for
> > three values:
> > 
> >   - the sleeping priority, corresponding to the value passed to tsleep(9)
> > which will be later used by setrunnable() to place the thread on the
> > appropriate runqueue.
> > 
> >   - the per-CPU runqueue priority which is used to add/remove a thread
> > to/from a runqueue.
> > 
> >   - the scheduling priority, also named `p_usrpri', calculated from the
> > estimated amount of CPU time used.
> > 
> > 
> > The diff below splits the current `p_priority' into two fields `p_runpri'
> > and `p_slppri'.  The important part is the schedclock() chunk:
> > 
> > @@ -519,8 +515,6 @@ schedclock(struct proc *p)
> > SCHED_LOCK(s);
> > newcpu = ESTCPULIM(p->p_estcpu + 1);
> > setpriority(p, newcpu, p->p_p->ps_nice);
> > -   if (p->p_priority >= PUSER)
> > -   p->p_priority = p->p_usrpri;
> > SCHED_UNLOCK(s);
> >   
> > `p_priority' had to be updated because it was overwritten during each
> > tsleep()/setrunnable() cycle.  The same happened in schedcpu():
> > 
> > schedcpu: reseting priority for zerothread(44701) 127 -> 90
> > schedclock: reseting priority for zerothread(44701) 127 -> 91
> > 
> > Getting rid of this assignment means we can then protect `p_estcpu' and
> > `p_usrpri' with a custom rer-thread lock.
> > 
> > With this division, `p_runpri' becomes part of the scheduler fields while
> > `p_slppri' is obviously part of the global sleepqueue.
> > 
> > The rest of the diff is quite straightforward, including the userland
> > compatibility bits.
> > 
> > Tests, comments and oks welcome :o)
> 
> Updated diff that changes getpriority() into a macro allowing libkvm to
> build as reported by anton@.

New diff that initializes `p_runpri' and fix a parenthesis incoherency
in macro, pointed by visa@.

This has been tested by various people without regression so far, so I'd
like to move forward, ok?

Index: arch/sparc64/sparc64/db_interface.c
===
RCS file: /cvs/src/sys/arch/sparc64/sparc64/db_interface.c,v
retrieving revision 1.54
diff -u -p -r1.54 db_interface.c
--- arch/sparc64/sparc64/db_interface.c 7 Nov 2019 14:44:53 -   1.54
+++ arch/sparc64/sparc64/db_interface.c 19 Jan 2020 13:45:36 -
@@ -958,10 +958,10 @@ db_proc_cmd(addr, have_addr, count, modi
return;
}
db_printf("process %p:", p);
-   db_printf("pid:%d vmspace:%p pmap:%p ctx:%x wchan:%p pri:%d upri:%d\n",
+   db_printf("pid:%d vmspace:%p pmap:%p ctx:%x wchan:%p rpri:%d upri:%d\n",
p->p_p->ps_pid, p->p_vmspace, p->p_vmspace->vm_map.pmap,
p->p_vmspace->vm_map.pmap->pm_ctx,
-   p->p_wchan, p->p_priority, p->p_usrpri);
+   p->p_wchan, p->p_runpri, p->p_usrpri);
db_printf("maxsaddr:%p ssiz:%dpg or %llxB\n",
p->p_vmspace->vm_maxsaddr, p->p_vmspace->vm_ssize,
(unsigned long long)ptoa(p->p_vmspace->vm_ssize));
Index: dev/pci/drm/i915/intel_breadcrumbs.c
===
RCS file: /cvs/src/sys/dev/pci/drm/i915/intel_breadcrumbs.c,v
retrieving revision 1.4
diff -u -p -r1.4 intel_breadcrumbs.c
--- dev/pci/drm/i915/intel_breadcrumbs.c10 Jul 2019 07:56:30 -  
1.4
+++ dev/pci/drm/i915/intel_breadcrumbs.c19 Jan 2020 13:45:36 -
@@ -451,7 +451,7 @@ static bool __intel_engine_add_wait(stru
 #ifdef __linux__
if (wait->tsk->prio > to_wait(parent)->tsk->prio) {
 #else
-   if (wait->tsk->p_priority > 
to_wait(parent)->tsk->p_priority) {
+   if (wait->tsk->p_usrpri > 
to_wait(parent)->tsk->p_usrpri) {
 #endif
p = >rb_right;
first = false;
@@ -538,7 +538,7 @@ static inline bool chain_wakeup(struct r
 #else
 static inline bool chain_wakeup(struct rb_node *rb, int priority)
 {
-   return rb && to_wait(rb)->tsk->p_priority <= priority;
+   return rb && to_wait(rb)->tsk->p_usrpri <= priority;
 }
 #endif
 
@@ -558,7 +558,7 @@ static inline int wakeup_priority(struct
if (p == b->signaler)
return INT_MIN;
else
-   return p->p_priority;
+   return p->p_usrpri;
 

Re: pthread_cond, futex(2) & ECANCELED

2020-01-19 Thread Martin Pieuchot
On 18/01/20(Sat) 14:16, Martin Pieuchot wrote:
> When futex(2) got imported it didn't return ECANCELED.  This was changed
> later with futex-based semaphores.
> 
> This modification introduced a behavior change in pthread_cond_*wait(3).
> The diff below restores the previous behavior by treating ECANCELED like
> EINTR.
> 
> Note that the __thrsleep(2) version also doesn't completely check for
> ECANCELED, this diff also changes that.

Updated version below includes a check missed previously in the
__thrsleep(2) based implementation, pointed out by visa@

Index: thread/rthread_cond.c
===
RCS file: /cvs/src/lib/libc/thread/rthread_cond.c,v
retrieving revision 1.5
diff -u -p -r1.5 rthread_cond.c
--- thread/rthread_cond.c   29 Jan 2019 17:40:26 -  1.5
+++ thread/rthread_cond.c   18 Jan 2020 13:10:56 -
@@ -109,13 +109,13 @@ _rthread_cond_timedwait(pthread_cond_t c
* we should just go back to sleep without changing state
* (timeouts, etc).
*/
-   } while ((error == EINTR) &&
+   } while ((error == EINTR || error == ECANCELED) &&
   (tib->tib_canceled == 0 || (tib->tib_cantcancel & CANCEL_DISABLED)));
 
/* if timeout or canceled, make note of that */
if (error == ETIMEDOUT)
rv = ETIMEDOUT;
-   else if (error == EINTR)
+   else if (error == EINTR || error == ECANCELED)
canceled = 1;
 
pthread_mutex_lock(mutexp);
Index: thread/rthread_sync.c
===
RCS file: /cvs/src/lib/libc/thread/rthread_sync.c,v
retrieving revision 1.5
diff -u -p -r1.5 rthread_sync.c
--- thread/rthread_sync.c   24 Apr 2018 16:28:42 -  1.5
+++ thread/rthread_sync.c   19 Jan 2020 09:50:06 -
@@ -407,7 +407,7 @@ pthread_cond_timedwait(pthread_cond_t *c
/* if timeout or canceled, make note of that */
if (error == EWOULDBLOCK)
rv = ETIMEDOUT;
-   else if (error == EINTR)
+   else if (error == EINTR || error = ECANCELED)
canceled = 1;
 
/* transfer between the queues */
@@ -544,7 +544,7 @@ pthread_cond_wait(pthread_cond_t *condp,
assert(self->blocking_cond == cond);
 
/* if canceled, make note of that */
-   if (error == EINTR)
+   if (error == EINTR || error == ECANCELED)
canceled = 1;
 
/* transfer between the queues */



Re: carp: send only IPv4 carp packets on dual stack interface

2020-01-19 Thread Stuart Henderson
On 2020/01/19 00:11, Sebastian Benoit wrote:
> chr...@openbsd.org(chr...@openbsd.org) on 2020.01.18 06:18:21 +0100:
> > On Wed, Jan 15, 2020 at 12:47:28PM +0100, Sebastian Benoit wrote:
> > >Christopher Zimmermann(chr...@openbsd.org) on 2020.01.15 11:55:43 +0100:
> > >>Hi,
> > >>
> > >>as far as I can see a dual stack carp interface does not care whether it
> > >>receives advertisements addressed to IPv4 or IPv6. Any one will do.
> > >>So I propose to send IPv6 advertisements only when IPv4 is not possible.
> > >>
> > >>Why?
> > >>
> > >>- Noise can be reduced by using unicast advertisements.
> > >>  This is only possible for IPv4 by ``ifconfig carppeer``.
> > >>  I don't like flooding the whole network with carp advertisements when
> > >>  I may also unicast them.
> > >
> > >Maybe i'm getting confused, but in the problem description you were talking
> > >about v6 vs v4, and here you argue about unicast (vs multicast?) being
> > >better. Thats orthogonal, isnt it?
> > 
> > Yes, kind of. The point is we support ``carppeer`` for IPv4, but not for 
> > IPv6.
> > 
> > >>- breaking IPv6 connectivity (for example by running iked without -6)
> > >>  will start a preempt-war, because failing ip6_output will cause the
> > >>  demote counter to be increased. That's what hit me.
> > >
> > >But the whole point of carp is to notice broken connectivity. If you run v6
> > >on an interface, you want to know if its working, no?
> > 
> > I grant you that much. But what kind of failures do you hope to detect 
> > on the _sending_ carp master, that would not also affect the backup?
> 
> sure: misconfigured pf. Missing routes. Buggy switch. 

misconfigured mac address filter on switch.

> > >At the very least, this needs some more thought and testing in all the ways
> > >carp can be configured.
> > 
> > Anyway, my main concern indeed is the broadcast noise generated by carp 
> > and I would be equally happy if we had a ``carppeer6`` option. Would 
> > that be considered?
> 
> of course carppeer should work with v6, and as claudio says without an extra
> keyword in ifconfig, but thats a trivial detail.
> 

Currently carp only handles one address per af, setting carppeer twice
changes the current peer address rather than adding another. A trivial
implementation that sets the v4 peer address if a v4 address is passed
in, and sets the v6 peer address if a v6 address is passed in, that
would mean things work differently with

ifconfig carp1 carppeer $foo
ifconfig carp1 carppeer $bar

depending on whether foo/bar are v4 or v6. Also removing a configured
carppeer address to reset to multicast is just done with -carppeer
with no way to indicate the af.

It would work pretty nicely if you could set multiple carppeer addresses
(of whatever af) and remove them individually. That's a more complex
change (carp would need to keep a list of peers per af rather than a
single address) but without something like that they can't really be
equals and it feels like shoehorning both afs into the same keyword
will just be confusing.