Re: New pseudo-driver dt(4): dynamic tracer
> Date: Thu, 16 Jan 2020 19:37:54 +0100 > From: Martin Pieuchot > > I'd like to import this new MI pseudo-driver and the framework it > provides to instrument and inspect kernel internals. > > It is still under development and all the code is guarded under NDT, so > it shouldn't impact GENERIC. However at this stage of development I'm > interested to get code reviews and feedbacks. > > The design is fairly simple: events, in the form of descriptors on a > ring, are being produced in any kernel context and being consumed by > a userland process reading /dev/dt. > > struct dt_evt { > unsigned intdtev_pbn; /* Probe number */ > unsigned intdtev_cpu; /* CPU id */ > pid_t dtev_pid; /* ID of current process */ > pid_t dtev_tid; /* ID of current thread */ > struct timespec dtev_tsp; /* timestamp (nsecs) */ > struct dt_stack_trace dtev_kstack;/* kernel stack frame */ > ... > }; > > To decide when an event needs to be recorded, tracepoints, also known as > probes, are placed in the code and enabled on demand. For the moment the > framework allows to define static tracepoints via the TRACEPOINT() macro > and also exposes per-syscall and hardclock probes allowing to generate > flamegraphs. > > I'd appreciate particular review of the following items: > > * Event producer/consumer code which currently needs a mutex. The >current implementation doesn't always use a PCB per-CPU. Moving >to a lockless implementation would be beneficial for recording in >deeper places of the kernel. I think that should be possible, but I'd have to think about that for a bit longer. Shouldn't prevent this from going in. And it might mean we can't support hppa MP support kernels (but that's not a big issue). > * Barriers for enabling/disabling probes. dt(4) can be opened multiple >times allowing multiple processes to gather event in parallel. I'd >like to be sure a close(2) doesn't stop another thread from capturing >events. Is it absolutely neccessary that we support multiple processes gathering? > * Read wakeup. Currently a mutex is used to not loose a wakeup event >in dtread(), however this mutex creates lock ordering problems with >the SCHED_LOCK() when probes are executed to profile the scheduler. Hmm, isn't it possible to avoid this by using sleep_setup() and sleep_finish_all() manually? I might give that a go once this is in. > I included changes to MAKEDEV on all archs where the dev number 30 is > still free. I'll pick the closest number for others archs if that's ok. > > The man page doesn't include a list of ioctl(2) as I believe they will > change soon. I'll send another mail for the userland interface. > > Ok to continue in-tree? Yes! Is there a particular reason your changing db_expr_t into long in some of the places? I'm not necessarily against removing that useless abstraction, but this seems to be only a partial change. > Index: etc/MAKEDEV.common > === > RCS file: /cvs/src/etc/MAKEDEV.common,v > retrieving revision 1.107 > diff -u -p -r1.107 MAKEDEV.common > --- etc/MAKEDEV.common17 Dec 2019 13:15:17 - 1.107 > +++ etc/MAKEDEV.common16 Jan 2020 18:31:56 - > @@ -167,6 +167,7 @@ target(all, vmm)dnl > target(all, pvbus, 0, 1)dnl > target(all, bpf)dnl > target(all, kcov)dnl > +target(all, dt)dnl > dnl > _mkdev(all, {-all-}, {-dnl > show_target(all)dnl > @@ -528,3 +529,5 @@ _mkdev(pvbus, {-pvbus*-}, {-M pvbus$U c > _mkdev(local, local, {-test -s $T.local && sh $T.local-})dnl > __devitem(kcov, kcov, Kernel code coverage tracing)dnl > _mkdev(kcov, kcov, {-M kcov c major_kcov_c 0 600-})dnl > +__devitem(dt, dt, Dynamic Tracer)dnl > +_mkdev(dt, dt, {-M dt c major_dt_c 0 600-})dnl > Index: etc/etc.amd64/MAKEDEV.md > === > RCS file: /cvs/src/etc/etc.amd64/MAKEDEV.md,v > retrieving revision 1.72 > diff -u -p -r1.72 MAKEDEV.md > --- etc/etc.amd64/MAKEDEV.md 17 Dec 2019 13:08:54 - 1.72 > +++ etc/etc.amd64/MAKEDEV.md 16 Jan 2020 18:31:56 - > @@ -68,6 +68,7 @@ _DEV(au, 42) > _DEV(bio, 79) > _DEV(bktr, 49) > _DEV(bpf, 23) > +_DEV(dt, 30) > _DEV(diskmap, 90) > _DEV(drm, 87) > _DEV(fdesc, 22) > Index: etc/etc.arm64/MAKEDEV.md > === > RCS file: /cvs/src/etc/etc.arm64/MAKEDEV.md,v > retrieving revision 1.5 > diff -u -p -r1.5 MAKEDEV.md > --- etc/etc.arm64/MAKEDEV.md 22 Dec 2019 18:18:02 - 1.5 > +++ etc/etc.arm64/MAKEDEV.md 16 Jan 2020 18:31:56 - > @@ -60,6 +60,7 @@ _DEV(au, 42) > _DEV(bio, 79) > _DEV(bktr, 49) > _DEV(bpf, 23) > +_DEV(dt, 30) > _DEV(diskmap, 90) > _DEV(drm, 87) > _DEV(fdesc, 22) > Index: etc/etc.armv7/MAKEDEV.md >
Re: ntpd and 2036
On Fri, Jan 10, 2020 at 03:14:42PM +0100, Otto Moerbeek wrote: > Hi, > > THe ntp protocol uses 32-bit unsigned timestamps counting seconds > since 1900. That means that in 2036 the timestamp field will wrap. > This difff makes sure ntpd handles that correctly by assuming we are > in era 0 unless we see "small" timestamps. > > tested in the future (incuding wrapping form era 0 to 1) on a couple > of machines including one running xntpd for interoperability. > > ok? ping... -Otto > > Index: client.c > === > RCS file: /cvs/src/usr.sbin/ntpd/client.c,v > retrieving revision 1.112 > diff -u -p -r1.112 client.c > --- client.c 10 Nov 2019 19:24:47 - 1.112 > +++ client.c 10 Jan 2020 14:06:14 - > @@ -324,12 +324,6 @@ client_dispatch(struct ntp_peer *p, u_in > } > } > > - if (T4 < JAN_1970) { > - client_log_error(p, "recvmsg control format", EBADF); > - set_next(p, error_interval()); > - return (0); > - } > - > ntp_getmsg((struct sockaddr *)>addr->ss, buf, size, ); > > if (msg.orgtime.int_partl != p->query->msg.xmttime.int_partl || > @@ -374,16 +368,6 @@ client_dispatch(struct ntp_peer *p, u_in > T1 = p->query->xmttime; > T2 = lfp_to_d(msg.rectime); > T3 = lfp_to_d(msg.xmttime); > - > - /* > - * XXX workaround: time_t / tv_sec must never wrap. > - * around 2020 we will need a solution (64bit time_t / tv_sec). > - * consider every answer with a timestamp beyond january 2030 bogus. > - */ > - if (T2 > JAN_2030 || T3 > JAN_2030) { > - set_next(p, error_interval()); > - return (0); > - } > > /* Detect liars */ > if (!p->trusted && conf->constraint_median != 0 && > Index: ntp.h > === > RCS file: /cvs/src/usr.sbin/ntpd/ntp.h,v > retrieving revision 1.13 > diff -u -p -r1.13 ntp.h > --- ntp.h 22 Apr 2009 07:42:17 - 1.13 > +++ ntp.h 10 Jan 2020 14:06:14 - > @@ -141,7 +141,19 @@ struct ntp_query { > #define MODE_RES2 7 /* reserved for private use */ > > #define JAN_19702208988800UL/* 1970 - 1900 in seconds */ > -#define JAN_20301893456000UL + JAN_1970 /* 1. 1. 2030 00:00:00 > */ > + > +/* > + * The era we're in if we have no reason to assume otherwise. > + * If lfp_to_d() sees an offset <= INT32_MAX the era is is assumed to be > + * NTP_ERA + 1. > + * Once the actual year is well into era 1, (after 2036) define NTP_ERA to 1 > + * and adapt (remove) the test in lfp_to_d(). > + * Once more than half of era 1 has elapsed (after 2104), re-inroduce the > test > + * to move to era 2 if offset <= INT32_MAX, repeat for each half era. > + */ > +#define NTP_ERA 0 > + > +#define SECS_IN_ERA (UINT32_MAX + 1ULL) > > #define NTP_VERSION 4 > #define NTP_MAXSTRATUM 15 > Index: util.c > === > RCS file: /cvs/src/usr.sbin/ntpd/util.c,v > retrieving revision 1.24 > diff -u -p -r1.24 util.c > --- util.c1 Mar 2017 00:56:30 - 1.24 > +++ util.c10 Jan 2020 14:06:14 - > @@ -86,12 +86,17 @@ d_to_tv(double d, struct timeval *tv) > double > lfp_to_d(struct l_fixedpt lfp) > { > - double ret; > + double base, ret; > > lfp.int_partl = ntohl(lfp.int_partl); > lfp.fractionl = ntohl(lfp.fractionl); > > - ret = (double)(lfp.int_partl) + ((double)lfp.fractionl / UINT_MAX); > + /* see comment in ntp.h */ > + base = NTP_ERA; > + if (lfp.int_partl <= INT32_MAX) > + base++; > + ret = base * SECS_IN_ERA; > + ret += (double)(lfp.int_partl) + ((double)lfp.fractionl / UINT_MAX); > > return (ret); > } > @@ -101,6 +106,8 @@ d_to_lfp(double d) > { > struct l_fixedptlfp; > > + while (d > SECS_IN_ERA) > + d -= SECS_IN_ERA; > lfp.int_partl = htonl((u_int32_t)d); > lfp.fractionl = htonl((u_int32_t)((d - (u_int32_t)d) * UINT_MAX)); > > >
ciss(4): de-indent polling logic
I'm hunting the hz(9) in ciss_cmd(). To remove it I need to first untangle the polling code. To start, I think it's sensible to remove a level of indentation. I've attached the -w diff to simplify review. ok? Index: ic/ciss.c === RCS file: /cvs/src/sys/dev/ic/ciss.c,v retrieving revision 1.75 diff -u -p -w -r1.75 ciss.c --- ic/ciss.c 14 Aug 2016 04:08:03 - 1.75 +++ ic/ciss.c 20 Jan 2020 02:48:38 - @@ -518,7 +518,10 @@ ciss_cmd(struct ciss_ccb *ccb, int flags } else bus_space_write_4(sc->iot, sc->ioh, CISS_INQ, ccb->ccb_cmdpa); - if (wait & SCSI_POLL) { + /* If we're not waiting for completion we're done. */ + if (!(wait & SCSI_POLL)) + return (error); + struct timeval tv; int etick; CISS_DPRINTF(CISS_D_CMD, ("waiting ")); @@ -598,7 +601,6 @@ ciss_cmd(struct ciss_ccb *ccb, int flags CISS_DPRINTF(CISS_D_CMD, ("done %d:%d", ccb->ccb_err.cmd_stat, ccb->ccb_err.scsi_stat)); - } return (error); }
nfs: tsleep(9) -> tsleep_nsec(9)
nfs_request() has an exponenential backoff. The timeout is already in seconds so we just stop converting to ticks. nfs_sndlock() and nfs_rcvlock() both have no timeout initially and then switch to a timeout of two seconds under certain circumstances. Ticks to seconds. The kqueue polling thread is ticks to seconds, but we're dividing by two so I convert to milliseconds first. It's ugly, but I don't want to disturb NFS_MINATTRTIMO if I can avoid it. ok? Index: nfs/nfs_socket.c === RCS file: /cvs/src/sys/nfs/nfs_socket.c,v retrieving revision 1.135 diff -u -p -r1.135 nfs_socket.c --- nfs/nfs_socket.c15 Jan 2020 13:17:35 - 1.135 +++ nfs/nfs_socket.c20 Jan 2020 01:37:14 - @@ -851,10 +851,9 @@ nfs_request(struct vnode *vp, int procnu struct mbuf *m; u_int32_t *tl; struct nfsmount *nmp; - struct timeval tv; caddr_t cp2; int t1, i, error = 0; - int trylater_delay; + int addr, trylater_delay; struct nfsreq *rep; struct nfsm_info info; @@ -999,9 +998,8 @@ tryagain: error == NFSERR_TRYLATER) { m_freem(info.nmi_mrep); error = 0; - tv.tv_sec = trylater_delay; - tv.tv_usec = 0; - tsleep(, PSOCK, "nfsretry", tvtohz()); + tsleep_nsec(, PSOCK, "nfsretry", + SEC_TO_NSEC(trylater_delay)); trylater_delay *= NFS_TIMEOUTMUL; if (trylater_delay > NFS_MAXTIMEO) trylater_delay = NFS_MAXTIMEO; @@ -1241,8 +1239,9 @@ nfs_sigintr(struct nfsmount *nmp, struct int nfs_sndlock(int *flagp, struct nfsreq *rep) { + uint64_t slptimeo = INFSLP; struct proc *p; - int slpflag = 0, slptimeo = 0; + int slpflag = 0; if (rep) { p = rep->r_procp; @@ -1254,11 +1253,10 @@ nfs_sndlock(int *flagp, struct nfsreq *r if (rep && nfs_sigintr(rep->r_nmp, rep, p)) return (EINTR); *flagp |= NFSMNT_WANTSND; - (void)tsleep((caddr_t)flagp, slpflag | (PZERO - 1), "nfsndlck", - slptimeo); + tsleep_nsec(flagp, slpflag | (PZERO - 1), "nfsndlck", slptimeo); if (slpflag == PCATCH) { slpflag = 0; - slptimeo = 2 * hz; + slptimeo = SEC_TO_NSEC(2); } } *flagp |= NFSMNT_SNDLOCK; @@ -1284,8 +1282,9 @@ nfs_sndunlock(int *flagp) int nfs_rcvlock(struct nfsreq *rep) { + uint64_t slptimeo = INFSLP; int *flagp = >r_nmp->nm_flag; - int slpflag, slptimeo = 0; + int slpflag; if (*flagp & NFSMNT_INT) slpflag = PCATCH; @@ -1296,8 +1295,7 @@ nfs_rcvlock(struct nfsreq *rep) if (nfs_sigintr(rep->r_nmp, rep, rep->r_procp)) return (EINTR); *flagp |= NFSMNT_WANTRCV; - (void)tsleep((caddr_t)flagp, slpflag | (PZERO - 1), "nfsrcvlk", - slptimeo); + tsleep_nsec(flagp, slpflag | (PZERO - 1), "nfsrcvlk", slptimeo); if (rep->r_mrep != NULL) { /* * Don't take the lock if our reply has been received @@ -1307,7 +1305,7 @@ nfs_rcvlock(struct nfsreq *rep) } if (slpflag == PCATCH) { slpflag = 0; - slptimeo = 2 * hz; + slptimeo = SEC_TO_NSEC(2); } } *flagp |= NFSMNT_RCVLOCK; Index: nfs/nfs_kq.c === RCS file: /cvs/src/sys/nfs/nfs_kq.c,v retrieving revision 1.27 diff -u -p -r1.27 nfs_kq.c --- nfs/nfs_kq.c31 Dec 2019 13:48:32 - 1.27 +++ nfs/nfs_kq.c20 Jan 2020 01:37:14 - @@ -173,8 +173,8 @@ next: rw_exit_write(_lock); /* wait a while before checking for changes again */ - tsleep(pnfskq, PSOCK, "nfskqpw", NFS_MINATTRTIMO * hz / 2); - + tsleep_nsec(pnfskq, PSOCK, "nfskqpw", + MSEC_TO_NSEC(NFS_MINATTRTIMO * 1000 / 2)); } }
ospf(6)d: allow "type p2p" globally or per area
This makes the interface setting "type p2p" configurable globally or per area. ospf(6)d allows this for almost all interface related settings. As a side-effect of this diff ospf(6)d -nv prints "type p2p" also for point-to-point interfaces like gif or gre. I think this is an advantage but I can also change that by re-introducing the iface->p2p variable. OK? Remi Index: ospf6d/ospf6d.h === RCS file: /cvs/src/usr.sbin/ospf6d/ospf6d.h,v retrieving revision 1.44 diff -u -p -r1.44 ospf6d.h --- ospf6d/ospf6d.h 3 Jan 2020 17:45:02 - 1.44 +++ ospf6d/ospf6d.h 12 Jan 2020 21:44:41 - @@ -329,7 +329,6 @@ struct iface { u_int8_t if_type; u_int8_t linkstate; u_int8_t priority; - u_int8_t p2p; u_int8_t cflags; #define F_IFACE_PASSIVE0x01 #define F_IFACE_CONFIGURED 0x02 Index: ospf6d/parse.y === RCS file: /cvs/src/usr.sbin/ospf6d/parse.y,v retrieving revision 1.48 diff -u -p -r1.48 parse.y --- ospf6d/parse.y 26 Dec 2019 10:24:18 - 1.48 +++ ospf6d/parse.y 19 Jan 2020 21:51:56 - @@ -102,6 +102,7 @@ struct config_defaults { u_int16_t rxmt_interval; u_int16_t metric; u_int8_tpriority; + u_int8_tp2p; }; struct config_defaults globaldefs; @@ -449,6 +450,9 @@ defaults: METRIC NUMBER { } defs->rxmt_interval = $2; } + | TYPE P2P { + defs->p2p = 1; + } ; optnl : '\n' optnl @@ -550,6 +554,8 @@ interface : INTERFACE STRING { iface->metric = defs->metric; iface->priority = defs->priority; iface->cflags |= F_IFACE_CONFIGURED; + if (defs->p2p == 1) + iface->type = IF_TYPE_POINTOPOINT; iface = NULL; /* interface is always part of an area */ defs = @@ -566,10 +572,6 @@ interfaceopts_l: interfaceopts_l interf ; interfaceoptsl : PASSIVE { iface->cflags |= F_IFACE_PASSIVE; } - | TYPE P2P { - iface->p2p = 1; - iface->type = IF_TYPE_POINTOPOINT; - } | DEMOTE STRING { if (strlcpy(iface->demote_group, $2, sizeof(iface->demote_group)) >= @@ -1034,6 +1036,7 @@ parse_config(char *filename, int opts) defs->rxmt_interval = DEFAULT_RXMT_INTERVAL; defs->metric = DEFAULT_METRIC; defs->priority = DEFAULT_PRIORITY; + defs->p2p = 0; conf->spf_delay = DEFAULT_SPF_DELAY; conf->spf_hold_time = DEFAULT_SPF_HOLDTIME; Index: ospf6d/printconf.c === RCS file: /cvs/src/usr.sbin/ospf6d/printconf.c,v retrieving revision 1.9 diff -u -p -r1.9 printconf.c --- ospf6d/printconf.c 26 Dec 2019 10:24:18 - 1.9 +++ ospf6d/printconf.c 12 Jan 2020 21:43:06 - @@ -1,4 +1,5 @@ -/* $OpenBSD: printconf.c,v 1.9 2019/12/26 10:24:18 remi Exp $ */ +/* $OpenBSD: printconf.c,v 1.9 2019/12/26 10:24:18 remi Exp $ +*/ /* * Copyright (c) 2004, 2005 Esben Norby @@ -135,7 +136,7 @@ print_iface(struct iface *iface) printf("\t\trouter-priority %d\n", iface->priority); printf("\t\ttransmit-delay %d\n", iface->transmit_delay); - if (iface->p2p) + if (iface->type == IF_TYPE_POINTOPOINT) printf("\t\ttype p2p\n"); printf("\t}\n"); Index: ospfd/ospfd.c === RCS file: /cvs/src/usr.sbin/ospfd/ospfd.c,v retrieving revision 1.110 diff -u -p -r1.110 ospfd.c --- ospfd/ospfd.c 23 Nov 2019 15:05:21 - 1.110 +++ ospfd/ospfd.c 18 Jan 2020 14:02:04 - @@ -893,7 +893,6 @@ merge_interfaces(struct area *a, struct if (i->self) i->self->priority = i->priority; i->flags = xi->flags; /* needed? */ - i->type = xi->type; /* needed? */ i->if_type = xi->if_type; /* needed? */ i->linkstate = xi->linkstate; /* needed? */ @@ -915,11 +914,11 @@ merge_interfaces(struct area *a, struct if_fsm(i, IF_EVT_UP); } - if (i->p2p != xi->p2p) { + if (i->type != xi->type) { /* restart interface to enable or disable DR election */ if (ospfd_process == PROC_OSPF_ENGINE) if_fsm(i,
Re: securelevel.7: Clarify mem(4) semantics
On Sun, Jan 19, 2020 at 11:50:01AM -0700, Theo de Raadt wrote: > Your text seems somewhat backwards, because if you can't open it, it > doesn't matter if it is read-only, it is read-not. It starts with what holds true unconditionally: the fact that you cannot write; then it describes an additional case that has another condition to it. Is this version better? Index: share/man/man7/securelevel.7 === RCS file: /cvs/src/share/man/man7/securelevel.7,v retrieving revision 1.31 diff -u -p -r1.31 securelevel.7 --- share/man/man7/securelevel.721 Aug 2019 20:44:09 - 1.31 +++ share/man/man7/securelevel.719 Jan 2020 20:40:30 - @@ -66,7 +66,10 @@ securelevel may no longer be lowered exc .Pa /dev/mem and .Pa /dev/kmem -cannot be opened +can only be opened if the +.Va kern.allowkmem +.Xr sysctl 8 +variable is set, both are read-only .It raw disk devices of mounted file systems are read-only .It
Re: securelevel.7: Clarify mem(4) semantics
Your text seems somewhat backwards, because if you can't open it, it doesn't matter if it is read-only, it is read-not. >Neither mem(4) nor securelevel(7) tell exactly which operations are >allowed on these devices at which security level. > >The utility cbmem from the new sysutils/coreboot-utils port opens >/dev/mem read-only and reading is possible at all security levels, as >long as the sysctl kern.allowkmem is non-zero. > >I verified this with cbmem. > >Writing however is only possible at an insecure security level. > >intelmetool from coreboot-utils (not packaged yet) does write to >/dev/mem and I verified that writing is only possible at securelevel -1 >and 0. kern.allowkmem does not matter. > >Diff below explicity states this in securelevel(7); note how I >intentionally document those semantics in this MI manual rather then the >various copies of MD mem(4). > >Feedback? OK? > > >Index: share/man/man7/securelevel.7 >=== >RCS file: /cvs/src/share/man/man7/securelevel.7,v >retrieving revision 1.31 >diff -u -p -r1.31 securelevel.7 >--- share/man/man7/securelevel.7 21 Aug 2019 20:44:09 - 1.31 >+++ share/man/man7/securelevel.7 19 Jan 2020 16:14:52 - >@@ -66,7 +66,10 @@ securelevel may no longer be lowered exc > .Pa /dev/mem > and > .Pa /dev/kmem >-cannot be opened >+are read-only and can only be opened if the >+.Va kern.allowkmem >+.Xr sysctl 8 >+variable is set > .It > raw disk devices of mounted file systems are read-only > .It > >
securelevel.7: Clarify mem(4) semantics
Neither mem(4) nor securelevel(7) tell exactly which operations are allowed on these devices at which security level. The utility cbmem from the new sysutils/coreboot-utils port opens /dev/mem read-only and reading is possible at all security levels, as long as the sysctl kern.allowkmem is non-zero. I verified this with cbmem. Writing however is only possible at an insecure security level. intelmetool from coreboot-utils (not packaged yet) does write to /dev/mem and I verified that writing is only possible at securelevel -1 and 0. kern.allowkmem does not matter. Diff below explicity states this in securelevel(7); note how I intentionally document those semantics in this MI manual rather then the various copies of MD mem(4). Feedback? OK? Index: share/man/man7/securelevel.7 === RCS file: /cvs/src/share/man/man7/securelevel.7,v retrieving revision 1.31 diff -u -p -r1.31 securelevel.7 --- share/man/man7/securelevel.721 Aug 2019 20:44:09 - 1.31 +++ share/man/man7/securelevel.719 Jan 2020 16:14:52 - @@ -66,7 +66,10 @@ securelevel may no longer be lowered exc .Pa /dev/mem and .Pa /dev/kmem -cannot be opened +are read-only and can only be opened if the +.Va kern.allowkmem +.Xr sysctl 8 +variable is set .It raw disk devices of mounted file systems are read-only .It
Re: Please test: kill `p_priority'
On 14/01/20(Tue) 17:16, Martin Pieuchot wrote: > On 13/01/20(Mon) 21:31, Martin Pieuchot wrote: > > I'd like hardclock() to be free of SCHED_LOCK(), the diff below brings > > us one step away from that goal. Please test with your favorite > > benchmark and report any regression :o) > > > > The reason for moving the SCHED_LOCK() away from hardclock() is because > > it will allow us to re-commit the diff moving accounting out of the > > SCHED_LOCK(). In the end we shrink the SCHED_LOCK() which is generally > > a good thing(tm). > > > > `p_priority' is not very well named. It's currently a placeholder for > > three values: > > > > - the sleeping priority, corresponding to the value passed to tsleep(9) > > which will be later used by setrunnable() to place the thread on the > > appropriate runqueue. > > > > - the per-CPU runqueue priority which is used to add/remove a thread > > to/from a runqueue. > > > > - the scheduling priority, also named `p_usrpri', calculated from the > > estimated amount of CPU time used. > > > > > > The diff below splits the current `p_priority' into two fields `p_runpri' > > and `p_slppri'. The important part is the schedclock() chunk: > > > > @@ -519,8 +515,6 @@ schedclock(struct proc *p) > > SCHED_LOCK(s); > > newcpu = ESTCPULIM(p->p_estcpu + 1); > > setpriority(p, newcpu, p->p_p->ps_nice); > > - if (p->p_priority >= PUSER) > > - p->p_priority = p->p_usrpri; > > SCHED_UNLOCK(s); > > > > `p_priority' had to be updated because it was overwritten during each > > tsleep()/setrunnable() cycle. The same happened in schedcpu(): > > > > schedcpu: reseting priority for zerothread(44701) 127 -> 90 > > schedclock: reseting priority for zerothread(44701) 127 -> 91 > > > > Getting rid of this assignment means we can then protect `p_estcpu' and > > `p_usrpri' with a custom rer-thread lock. > > > > With this division, `p_runpri' becomes part of the scheduler fields while > > `p_slppri' is obviously part of the global sleepqueue. > > > > The rest of the diff is quite straightforward, including the userland > > compatibility bits. > > > > Tests, comments and oks welcome :o) > > Updated diff that changes getpriority() into a macro allowing libkvm to > build as reported by anton@. New diff that initializes `p_runpri' and fix a parenthesis incoherency in macro, pointed by visa@. This has been tested by various people without regression so far, so I'd like to move forward, ok? Index: arch/sparc64/sparc64/db_interface.c === RCS file: /cvs/src/sys/arch/sparc64/sparc64/db_interface.c,v retrieving revision 1.54 diff -u -p -r1.54 db_interface.c --- arch/sparc64/sparc64/db_interface.c 7 Nov 2019 14:44:53 - 1.54 +++ arch/sparc64/sparc64/db_interface.c 19 Jan 2020 13:45:36 - @@ -958,10 +958,10 @@ db_proc_cmd(addr, have_addr, count, modi return; } db_printf("process %p:", p); - db_printf("pid:%d vmspace:%p pmap:%p ctx:%x wchan:%p pri:%d upri:%d\n", + db_printf("pid:%d vmspace:%p pmap:%p ctx:%x wchan:%p rpri:%d upri:%d\n", p->p_p->ps_pid, p->p_vmspace, p->p_vmspace->vm_map.pmap, p->p_vmspace->vm_map.pmap->pm_ctx, - p->p_wchan, p->p_priority, p->p_usrpri); + p->p_wchan, p->p_runpri, p->p_usrpri); db_printf("maxsaddr:%p ssiz:%dpg or %llxB\n", p->p_vmspace->vm_maxsaddr, p->p_vmspace->vm_ssize, (unsigned long long)ptoa(p->p_vmspace->vm_ssize)); Index: dev/pci/drm/i915/intel_breadcrumbs.c === RCS file: /cvs/src/sys/dev/pci/drm/i915/intel_breadcrumbs.c,v retrieving revision 1.4 diff -u -p -r1.4 intel_breadcrumbs.c --- dev/pci/drm/i915/intel_breadcrumbs.c10 Jul 2019 07:56:30 - 1.4 +++ dev/pci/drm/i915/intel_breadcrumbs.c19 Jan 2020 13:45:36 - @@ -451,7 +451,7 @@ static bool __intel_engine_add_wait(stru #ifdef __linux__ if (wait->tsk->prio > to_wait(parent)->tsk->prio) { #else - if (wait->tsk->p_priority > to_wait(parent)->tsk->p_priority) { + if (wait->tsk->p_usrpri > to_wait(parent)->tsk->p_usrpri) { #endif p = >rb_right; first = false; @@ -538,7 +538,7 @@ static inline bool chain_wakeup(struct r #else static inline bool chain_wakeup(struct rb_node *rb, int priority) { - return rb && to_wait(rb)->tsk->p_priority <= priority; + return rb && to_wait(rb)->tsk->p_usrpri <= priority; } #endif @@ -558,7 +558,7 @@ static inline int wakeup_priority(struct if (p == b->signaler) return INT_MIN; else - return p->p_priority; + return p->p_usrpri;
Re: pthread_cond, futex(2) & ECANCELED
On 18/01/20(Sat) 14:16, Martin Pieuchot wrote: > When futex(2) got imported it didn't return ECANCELED. This was changed > later with futex-based semaphores. > > This modification introduced a behavior change in pthread_cond_*wait(3). > The diff below restores the previous behavior by treating ECANCELED like > EINTR. > > Note that the __thrsleep(2) version also doesn't completely check for > ECANCELED, this diff also changes that. Updated version below includes a check missed previously in the __thrsleep(2) based implementation, pointed out by visa@ Index: thread/rthread_cond.c === RCS file: /cvs/src/lib/libc/thread/rthread_cond.c,v retrieving revision 1.5 diff -u -p -r1.5 rthread_cond.c --- thread/rthread_cond.c 29 Jan 2019 17:40:26 - 1.5 +++ thread/rthread_cond.c 18 Jan 2020 13:10:56 - @@ -109,13 +109,13 @@ _rthread_cond_timedwait(pthread_cond_t c * we should just go back to sleep without changing state * (timeouts, etc). */ - } while ((error == EINTR) && + } while ((error == EINTR || error == ECANCELED) && (tib->tib_canceled == 0 || (tib->tib_cantcancel & CANCEL_DISABLED))); /* if timeout or canceled, make note of that */ if (error == ETIMEDOUT) rv = ETIMEDOUT; - else if (error == EINTR) + else if (error == EINTR || error == ECANCELED) canceled = 1; pthread_mutex_lock(mutexp); Index: thread/rthread_sync.c === RCS file: /cvs/src/lib/libc/thread/rthread_sync.c,v retrieving revision 1.5 diff -u -p -r1.5 rthread_sync.c --- thread/rthread_sync.c 24 Apr 2018 16:28:42 - 1.5 +++ thread/rthread_sync.c 19 Jan 2020 09:50:06 - @@ -407,7 +407,7 @@ pthread_cond_timedwait(pthread_cond_t *c /* if timeout or canceled, make note of that */ if (error == EWOULDBLOCK) rv = ETIMEDOUT; - else if (error == EINTR) + else if (error == EINTR || error = ECANCELED) canceled = 1; /* transfer between the queues */ @@ -544,7 +544,7 @@ pthread_cond_wait(pthread_cond_t *condp, assert(self->blocking_cond == cond); /* if canceled, make note of that */ - if (error == EINTR) + if (error == EINTR || error == ECANCELED) canceled = 1; /* transfer between the queues */
Re: carp: send only IPv4 carp packets on dual stack interface
On 2020/01/19 00:11, Sebastian Benoit wrote: > chr...@openbsd.org(chr...@openbsd.org) on 2020.01.18 06:18:21 +0100: > > On Wed, Jan 15, 2020 at 12:47:28PM +0100, Sebastian Benoit wrote: > > >Christopher Zimmermann(chr...@openbsd.org) on 2020.01.15 11:55:43 +0100: > > >>Hi, > > >> > > >>as far as I can see a dual stack carp interface does not care whether it > > >>receives advertisements addressed to IPv4 or IPv6. Any one will do. > > >>So I propose to send IPv6 advertisements only when IPv4 is not possible. > > >> > > >>Why? > > >> > > >>- Noise can be reduced by using unicast advertisements. > > >> This is only possible for IPv4 by ``ifconfig carppeer``. > > >> I don't like flooding the whole network with carp advertisements when > > >> I may also unicast them. > > > > > >Maybe i'm getting confused, but in the problem description you were talking > > >about v6 vs v4, and here you argue about unicast (vs multicast?) being > > >better. Thats orthogonal, isnt it? > > > > Yes, kind of. The point is we support ``carppeer`` for IPv4, but not for > > IPv6. > > > > >>- breaking IPv6 connectivity (for example by running iked without -6) > > >> will start a preempt-war, because failing ip6_output will cause the > > >> demote counter to be increased. That's what hit me. > > > > > >But the whole point of carp is to notice broken connectivity. If you run v6 > > >on an interface, you want to know if its working, no? > > > > I grant you that much. But what kind of failures do you hope to detect > > on the _sending_ carp master, that would not also affect the backup? > > sure: misconfigured pf. Missing routes. Buggy switch. misconfigured mac address filter on switch. > > >At the very least, this needs some more thought and testing in all the ways > > >carp can be configured. > > > > Anyway, my main concern indeed is the broadcast noise generated by carp > > and I would be equally happy if we had a ``carppeer6`` option. Would > > that be considered? > > of course carppeer should work with v6, and as claudio says without an extra > keyword in ifconfig, but thats a trivial detail. > Currently carp only handles one address per af, setting carppeer twice changes the current peer address rather than adding another. A trivial implementation that sets the v4 peer address if a v4 address is passed in, and sets the v6 peer address if a v6 address is passed in, that would mean things work differently with ifconfig carp1 carppeer $foo ifconfig carp1 carppeer $bar depending on whether foo/bar are v4 or v6. Also removing a configured carppeer address to reset to multicast is just done with -carppeer with no way to indicate the af. It would work pretty nicely if you could set multiple carppeer addresses (of whatever af) and remove them individually. That's a more complex change (carp would need to keep a list of peers per af rather than a single address) but without something like that they can't really be equals and it feels like shoehorning both afs into the same keyword will just be confusing.