Re: sysctl ddb.trigger

2023-05-28 Thread Sebastien Marie
On Mon, May 29, 2023 at 02:41:00PM +1000, Aaron Mason wrote:
> On Mon, May 29, 2023 at 4:08 AM Paul de Weerd  wrote:
> >
> > Hi folks,
> >
> > I'm trying to debug an issue where my machine partially locks up after
> > some hours (somewhere between 12 and 48, is my current window).  The
> > extent of the locking is still unclear, that's part of what I'm trying
> > to figure out.
> >
> > While debugging, I thought I'd try to enter ddb, so I set ddb.console
> > to 1 in /etc/sysctl.conf and tried to write to ddb.trigger:
> >
> > pom# sysctl ddb.{console,panic}
> > ddb.console=1
> > ddb.panic=1
> > pom# sysctl ddb.trigger=1
> > sysctl: ddb.trigger: Operation not supported by device
> >
> > Am I holding this thing wrong?  According to ddb(4), the above should
> > be sufficient, no?
> >
> > One thing to note is that I'm running this from a chroot into a mfs
> > system (as part of the debugging of the locking up), could that affect
> > things?  Even if it's from a chroot, I can still change sysctl MIBs -
> > is ddb.trigger special?
> >
> > I'm doing all this through the serial console (glass console and
> > network both are unresponsive in the locked up state), could that be
> > related?  (for the record, BREAK doesn't work either to enter ddb, I
> > guessed it was due to the USB-to-serial dongle I'm using (uplcom(4)
> > lacking support for sending a proper BREAK .. but this may be the same
> > issue?)
> >
> > Paul
> >
> 
> Just spitballing... could it be something blocked by kern.securelevel?
> 

It might helps.

The "Operation not supported by device" is ENODEV, and seems to be the one here:

sys/ddb/db_usrreq.c

74  case DBCTL_TRIGGER:
75  if (newp && db_console) {
76  struct process *pr = curproc->p_p;
77  
78  if (securelevel < 1 ||
79  (pr->ps_flags & PS_CONTROLT && cn_tab &&
80  cn_tab->cn_dev == 
pr->ps_session->s_ttyp->t_dev)) {
81  db_enter();
82  newp = NULL;
83  } else
84  return (ENODEV);
85  }
86  return (sysctl_rdint(oldp, oldlenp, newp, 0));

>From the code, to use ddb.trigger (aka DBCTL_TRIGGER), you need:

- kern.securelevel < 1 (on a running system, kern.securelevel = -1)
OR
- something related to the console (I suppose "having the tty of the current 
  process being the same than the console")

If you are connected to serial, but your console is on VGA, it might be related.

So you might need to set kern.securelevel to lower value ("sysctl 
kern.securelevel=-1"
in /etc/rc.securelevel), or make your console on serial (with "set tty com0" on
bootloader).

Thanks.
-- 
Sebastien Marie



Re: sysctl ddb.trigger

2023-05-28 Thread Aaron Mason
On Mon, May 29, 2023 at 4:08 AM Paul de Weerd  wrote:
>
> Hi folks,
>
> I'm trying to debug an issue where my machine partially locks up after
> some hours (somewhere between 12 and 48, is my current window).  The
> extent of the locking is still unclear, that's part of what I'm trying
> to figure out.
>
> While debugging, I thought I'd try to enter ddb, so I set ddb.console
> to 1 in /etc/sysctl.conf and tried to write to ddb.trigger:
>
> pom# sysctl ddb.{console,panic}
> ddb.console=1
> ddb.panic=1
> pom# sysctl ddb.trigger=1
> sysctl: ddb.trigger: Operation not supported by device
>
> Am I holding this thing wrong?  According to ddb(4), the above should
> be sufficient, no?
>
> One thing to note is that I'm running this from a chroot into a mfs
> system (as part of the debugging of the locking up), could that affect
> things?  Even if it's from a chroot, I can still change sysctl MIBs -
> is ddb.trigger special?
>
> I'm doing all this through the serial console (glass console and
> network both are unresponsive in the locked up state), could that be
> related?  (for the record, BREAK doesn't work either to enter ddb, I
> guessed it was due to the USB-to-serial dongle I'm using (uplcom(4)
> lacking support for sending a proper BREAK .. but this may be the same
> issue?)
>
> Paul
>
> --
> >[<++>-]<+++.>+++[<-->-]<.>+++[<+
> +++>-]<.>++[<>-]<+.--.[-]
>  http://www.weirdnet.nl/
>

Just spitballing... could it be something blocked by kern.securelevel?

-- 
Aaron Mason - Programmer, open source addict
I've taken my software vows - for beta or for worse



sysctl ddb.trigger

2023-05-28 Thread Paul de Weerd
Hi folks,

I'm trying to debug an issue where my machine partially locks up after
some hours (somewhere between 12 and 48, is my current window).  The
extent of the locking is still unclear, that's part of what I'm trying
to figure out.

While debugging, I thought I'd try to enter ddb, so I set ddb.console
to 1 in /etc/sysctl.conf and tried to write to ddb.trigger:

pom# sysctl ddb.{console,panic}
ddb.console=1
ddb.panic=1
pom# sysctl ddb.trigger=1
sysctl: ddb.trigger: Operation not supported by device

Am I holding this thing wrong?  According to ddb(4), the above should
be sufficient, no?

One thing to note is that I'm running this from a chroot into a mfs
system (as part of the debugging of the locking up), could that affect
things?  Even if it's from a chroot, I can still change sysctl MIBs -
is ddb.trigger special?

I'm doing all this through the serial console (glass console and
network both are unresponsive in the locked up state), could that be
related?  (for the record, BREAK doesn't work either to enter ddb, I
guessed it was due to the USB-to-serial dongle I'm using (uplcom(4)
lacking support for sending a proper BREAK .. but this may be the same
issue?)

Paul

-- 
>[<++>-]<+++.>+++[<-->-]<.>+++[<+
+++>-]<.>++[<>-]<+.--.[-]
 http://www.weirdnet.nl/ 



Re: carp flapping

2023-05-28 Thread Nick Holland

Followup...

On 5/12/23 08:17, Stuart Henderson wrote:

On 2023-05-12, Nick Holland  wrote:

...

I had several other people suggest network problems.  I'm not going to
say "impossible" or even "unlikely", but my understanding is that the
two machines are both plugged into the same switch, in the same rack.




I've since had someone more familiar with the physical environment say
my blind trust in their switch hw may be slightly misplaced. :)


You can also look at

netstat -ni -I ixl0
netstat -ni -I ixl0 -e
kstat ixl0:::



These looked REALLY clean.  no drops, fails or collisions.


which may give some other clues

even pfctl -si might have something relevant


Several people pointed out I was using the default advskew of 1 second,
which means a small network glitch (or system load?  maybe I'm all wrong
about this system never breaking a sweat, at least when it comes to
network traffic) would flip it, so I've increased it to 10 on both
machines (and apparently just induced a flip of my own. oops).  By the
nature of this system, some people will be annoyed by any flip, so it
really doesn't matter if it was a 1 second outage or a 30 second outage,
I just want the system available again after an unhappy event (or
routine maintenance).


the course adjustment in seconds is advbase, advskew is a much smaller
delay meant for a config with primary/backup where the backup advertises
just slightly less frequently.


Um. yeah.  I set advbase, and typed advskew in the e-mail. my bad.
After setting to 10, I have gone over two weeks without any flips, so that
looks like that is a pretty good fix.
 
Thanks for the guidance!


Nick.