Re: OpenBSD 6.4-stable + current "freezes" after 4h [not]

2019-01-15 Thread Marco Prause
Re,

On 14.01.19 18:40, Theo de Raadt wrote:
> We accept reasonable bug reports from systems with a few changes.  You do NOT 
> have
> a few changes, you have a huge pile of them, and therefore you are 
> 'responsible
> for all the pieces'.
...
> Almost assuredly you are being burned by your own changes.

First of all, there will be no irony in the following lines.

Theo, I really appreciate your intention protecting the devs from
unnecessary work. You were so damn right stopping the assumption I was
following.

Stuart and Hrvoje, thanks for helping with the information about ddb,
that pushes me in the right direction.


Just for the record and terms of sharing knowledge (also the bad ones):

the problem was caused by a really bad doas call, that I wasn't aware
of, but what might creep in my configs at the same time I updated the
integration stage to 6.4.

(a zabbix_agent was periodically calling

'...cmd ksh args -c "/usr/sbin/ospfctl args show neighbor"'

instead of

'...cmd /usr/sbin/ospfctl args show neighbor')


Fixing this doas-line let the server run stable again.


So again thanks and last but not least : sorry for the noise, guys !


Cheers,

Marco




Re: OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Hrvoje Popovski
On 14.1.2019. 16:25, Hrvoje Popovski wrote:
> On 14.1.2019. 10:02, Marco Prause wrote:
>> splassert: bstp_notify_rtage: want 2 have 0
>> splassert: bstp_notify_rtage: want 2 have 0
>> splassert: bstp_notify_rtage: want 2 have 0
>> splassert: bstp_notify_rtage: want 2 have 0
>> splassert: bstp_notify_rtage: want 2 have 0
>> splassert: bstp_notify_rtage: want 2 have 0
> 
> could you try adding this sysctls
> sysctl kern.splassert=2
> sysctl kern.pool_debug=1
> 
> 
> are you getting similar traces ?
> 
> splassert: bstp_notify_rtage: want 2 have 0
> Starting stack trace...
> bstp_set_port_tc(668bdd1357c8fcb,4) at bstp_set_port_tc+0x1a0
> bstp_update_tc(fa46532a51d755d) at bstp_update_tc+0xfd
> bstp_tick(809f7c00) at bstp_tick+0x357
> softclock(3c3f171cb53a98a3) at softclock+0x117
> softintr_dispatch(120392a2955eaa7c) at softintr_dispatch+0xfc
> Xsoftclock(0,0,1388,0,800267e0,81ccd6b0) at Xsoftclock+0x1f
> acpicpu_idle() at acpicpu_idle+0x281
> sched_idle(0) at sched_idle+0x245
> end trace frame: 0x0, count: 249
> End of stack trace.
> 
> 
> splassert: bstp_notify_rtage: want 2 have 256
> Starting stack trace...
> bstp_set_port_tc(668bdd1357c8fcb,4) at bstp_set_port_tc+0x1a0
> bstp_update_tc(fa46532a51d755d) at bstp_update_tc+0xfd
> bstp_tick(809f7c00) at bstp_tick+0x357
> softclock(3c3f171cb53a98a3) at softclock+0x117
> softintr_dispatch(120392a2955eaa7c) at softintr_dispatch+0xfc
> Xsoftclock(0,0,1388,0,800267e0,81ccd6b0) at Xsoftclock+0x1f
> acpicpu_idle() at acpicpu_idle+0x281
> sched_idle(0) at sched_idle+0x245
> end trace frame: 0x0, count: 249
> End of stack trace.
> 

i'm getting these traces even with

OpenBSD 6.4-current (GENERIC.MP) #499: Mon Dec 10 11:33:10 MST 2018
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

which is before mpi@ commit
Changes by: m...@cvs.openbsd.org2018/12/12 07:19:15

Modified files:
sys/net: if_bridge.c bridgectl.c


splassert: bstp_notify_rtage: want 2 have 0
Starting stack trace...
bstp_set_port_tc(233f0d46a06cbcc7,4) at bstp_set_port_tc+0x1a0
bstp_update_tc(cc45a761c76fe6c6) at bstp_update_tc+0xfd
bstp_tick(80663400) at bstp_tick+0x357
softclock(82030e4bce69f3d2) at softclock+0x117
softintr_dispatch(df881ff53c0f4dab) at softintr_dispatch+0xfc
Xsoftclock(0,0,1388,0,800267e0,81ca66b0) at Xsoftclock+0x1f
acpicpu_idle() at acpicpu_idle+0x281
sched_idle(0) at sched_idle+0x245
end trace frame: 0x0, count: 249
End of stack trace.


so, maybe all this traces are noise regarding this problem or it's been
in tree for a some time







Re: OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Dumitru Moldovan
On Mon, 14 Jan 2019 18:27:56 +0100, Marco Prause  
wrote:

> 
> 
> Am 14. Januar 2019 16:40:48 MEZ schrieb Theo de Raadt :
> >It sure looks like you have a pile of your own changes which are highly
> >unconventional,
> >and you are very far away from a stock OpenBSD configuration.
> 
> Well, that's right so far, because I have decided to use the tool resflash to 
> create images (https://stable.rcesoftware.com/resflash/). 
> 
> That's the "only" changes, that made the system away from a stock OpenBSD 
> configuration. 
> 
> But sure, to get this also out of the way of possible causes, I could install 
> current to the server on the hard disc. I just thought resflash just did some 
> changes to the boot process and I assume the issue more at the bridge-part. 

>From https://stable.rcesoftware.com/resflash/ sources: 

Resflash is not a supported OpenBSD configuration. Please do not email bugs@ or 
misc@ asking for help. If you have a question or a bug to report, please https://www.freelists.org/list/resflash;>post to our mailing list, https://gitlab.com/bconway/resflash/issues;>submit an issue on 
GitLab, or mailto:bconway-at-rcesoftware-dot-com;>email me 
directly.



Re: OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Theo de Raadt
Marco Prause  wrote:

> Am 14. Januar 2019 16:40:48 MEZ schrieb Theo de Raadt :
> >It sure looks like you have a pile of your own changes which are highly
> >unconventional,
> >and you are very far away from a stock OpenBSD configuration.
> 
> Well, that's right so far, because I have decided to use the tool resflash to 
> create images (https://stable.rcesoftware.com/resflash/). 

So basically.. you have changed everything.

We accept reasonable bug reports from systems with a few changes.  You do NOT 
have
a few changes, you have a huge pile of them, and therefore you are 'responsible
for all the pieces'.

> That's the "only" changes, that made the system away from a stock OpenBSD 
> configuration.

You have 5 MFS filesystems.  "only changes"?

Almost assuredly you are being burned by your own changes.



Re: OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Marco Prause



Am 14. Januar 2019 16:40:48 MEZ schrieb Theo de Raadt :
>It sure looks like you have a pile of your own changes which are highly
>unconventional,
>and you are very far away from a stock OpenBSD configuration.

Well, that's right so far, because I have decided to use the tool resflash to 
create images (https://stable.rcesoftware.com/resflash/). 

That's the "only" changes, that made the system away from a stock OpenBSD 
configuration. 

But sure, to get this also out of the way of possible causes, I could install 
current to the server on the hard disc. I just thought resflash just did some 
changes to the boot process and I assume the issue more at the bridge-part. 


>Having made those decisions, you are responsible for your own issues.
>
>Sorry.

That seems fair enough to me. 
Let me have a look at the ddb stuff, Stuart mentioned and the splassert stuff 
Hrvoje mentioned, before I'm going to reinstall the server with a stock current 
OpenBSD. 

Cheers, 
Marco

>> Hi Stuart,
>> 
>> thanks for having a look at this.
>> 
>> 
>> > Is it the same or different hardware type and BIOS version for the
>> > working and hanging machines? (maybe diff the two dmesgs)
>> >
>> > Same or different filesystem mount options?  (Are you using
>softdep?)
>> 
>> it's (nearly) the same hardware.
>> 
>> But thanks to your hint of diffing the dmesg outputs I found a small
>> difference :
>> 
>> 
>> * server1:
>> 
>> bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xec200 (78 entries)
>> bios0: vendor American Megatrends Inc. version "4.6.5" date
>03/02/2015
>> bios0: INTEL Corporation DENLOW_WS   
>> 
>> * server2:
>> 
>> bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xec200 (77
>entries)
>> bios0: vendor American Megatrends Inc. version "4.6.5" date
>03/02/2015   
>> bios0: INTEL Corporation
>DENLOW_WS   
>> 
>> 
>> * server2 has an additional entry, I do not see on server1
>> 
>> acpipci0 at acpi0 PCI0: 0x0010 0x0011 0x
>> 
>> 
>> * server2 also seems to have a slightly different memory setup :
>> 
>> spdmem0 at iic0 addr 0x50: 8GB DDR3 SDRAM PC3-12800
>> 
>> * whereas server1 has :
>> 
>> spdmem0 at iic0 addr 0x50: 4GB DDR3 SDRAM PC3-12800
>> spdmem1 at iic0 addr 0x52: 4GB DDR3 SDRAM PC3-12800
>> 
>> 
>> 
>> On the filesystem I can't see any differences :
>> 
>> * server1:
>> $
>>
>mount  
>  
>> 
>> /dev/sd0d on / type ffs (local, noatime, nodev,
>> read-only) 
>> mfs:14405 on /tmp type mfs (asynchronous, local, noatime, nodev,
>nosuid,
>> size=65536 512-blocks)
>> mfs:35803 on /dev type mfs (asynchronous, local, noatime, noexec,
>> size=12288 512-blocks)   
>> mfs:30894 on /etc type mfs (asynchronous, local, noatime, nodev,
>nosuid,
>> size=65536 512-blocks)
>> mfs:75826 on /var type mfs (asynchronous, local, noatime, nodev,
>noexec,
>> size=131072 512-blocks)   
>> mfs:23894 on /usr/lib type mfs (asynchronous, local, noatime, nodev,
>> nosuid, size=262144 512-blocks)
>> mfs:21714 on /usr/libexec type mfs (asynchronous, local, noatime,
>nodev,
>> size=262144 512-blocks)   
>> $ cat
>>
>/etc/fstab 
>  
>> 
>> dd6727251088320b.a /mbr ffs rw,noatime,nodev,noexec,noauto 1
>> 2 
>> dd6727251088320b.d / ffs ro,noatime,nodev 1
>> 1  
>> dd6727251088320b.f /cfg ffs rw,noatime,nodev,noexec,noauto 1
>> 2 
>> dd6727251088320b.i /efi msdos rw,noatime,nodev,noexec,noauto 0
>> 0   
>> swap /tmp mfs rw,async,noatime,nodev,nosuid,-s32M 0
>> 0  
>>
>$  
>
>> 
>> 
>> 
>> * server2:
>> 
>> $ mount
>> /dev/sd0e on / type ffs (local, noatime, nodev, read-only)
>> mfs:19530 on /tmp type mfs (asynchronous, local, noatime, nodev,
>nosuid,
>> size=65536 512-blocks)
>> mfs:65784 on /dev type mfs (asynchronous, local, noatime, noexec,
>> size=12288 512-blocks)   
>> mfs:41465 on /etc type mfs (asynchronous, local, noatime, nodev,
>nosuid,
>> size=65536 512-blocks)
>> mfs:86708 on /var type mfs (asynchronous, local, noatime, nodev,
>noexec,
>> size=262144 512-blocks)   
>> mfs:90223 on /usr/lib type mfs (asynchronous, local, noatime, nodev,
>> nosuid, size=262144 512-blocks)
>> mfs:22430 on /usr/libexec type mfs (asynchronous, local, noatime,
>nodev,
>> size=262144 512-blocks)   
>> $ cat
>>
>/etc/fstab 
>  
>> 
>> 9f97b8d42ceedbf4.a /mbr ffs rw,noatime,nodev,noexec,noauto 1
>> 2 
>> 9f97b8d42ceedbf4.e / ffs 

Re: OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Theo de Raadt
It sure looks like you have a pile of your own changes which are highly 
unconventional,
and you are very far away from a stock OpenBSD configuration.

Having made those decisions, you are responsible for your own issues.

Sorry.

> Hi Stuart,
> 
> thanks for having a look at this.
> 
> 
> > Is it the same or different hardware type and BIOS version for the
> > working and hanging machines? (maybe diff the two dmesgs)
> >
> > Same or different filesystem mount options?  (Are you using softdep?)
> 
> it's (nearly) the same hardware.
> 
> But thanks to your hint of diffing the dmesg outputs I found a small
> difference :
> 
> 
> * server1:
> 
> bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xec200 (78 entries)
> bios0: vendor American Megatrends Inc. version "4.6.5" date 03/02/2015
> bios0: INTEL Corporation DENLOW_WS   
> 
> * server2:
> 
> bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xec200 (77 entries)
> bios0: vendor American Megatrends Inc. version "4.6.5" date 03/02/2015   
> bios0: INTEL Corporation DENLOW_WS   
> 
> 
> * server2 has an additional entry, I do not see on server1
> 
> acpipci0 at acpi0 PCI0: 0x0010 0x0011 0x
> 
> 
> * server2 also seems to have a slightly different memory setup :
> 
> spdmem0 at iic0 addr 0x50: 8GB DDR3 SDRAM PC3-12800
> 
> * whereas server1 has :
> 
> spdmem0 at iic0 addr 0x50: 4GB DDR3 SDRAM PC3-12800
> spdmem1 at iic0 addr 0x52: 4GB DDR3 SDRAM PC3-12800
> 
> 
> 
> On the filesystem I can't see any differences :
> 
> * server1:
> $
> mount 
>    
> 
> /dev/sd0d on / type ffs (local, noatime, nodev,
> read-only) 
> mfs:14405 on /tmp type mfs (asynchronous, local, noatime, nodev, nosuid,
> size=65536 512-blocks)
> mfs:35803 on /dev type mfs (asynchronous, local, noatime, noexec,
> size=12288 512-blocks)   
> mfs:30894 on /etc type mfs (asynchronous, local, noatime, nodev, nosuid,
> size=65536 512-blocks)
> mfs:75826 on /var type mfs (asynchronous, local, noatime, nodev, noexec,
> size=131072 512-blocks)   
> mfs:23894 on /usr/lib type mfs (asynchronous, local, noatime, nodev,
> nosuid, size=262144 512-blocks)
> mfs:21714 on /usr/libexec type mfs (asynchronous, local, noatime, nodev,
> size=262144 512-blocks)   
> $ cat
> /etc/fstab
>    
> 
> dd6727251088320b.a /mbr ffs rw,noatime,nodev,noexec,noauto 1
> 2 
> dd6727251088320b.d / ffs ro,noatime,nodev 1
> 1  
> dd6727251088320b.f /cfg ffs rw,noatime,nodev,noexec,noauto 1
> 2 
> dd6727251088320b.i /efi msdos rw,noatime,nodev,noexec,noauto 0
> 0   
> swap /tmp mfs rw,async,noatime,nodev,nosuid,-s32M 0
> 0  
> $ 
>  
> 
> 
> 
> * server2:
> 
> $ mount
> /dev/sd0e on / type ffs (local, noatime, nodev, read-only)
> mfs:19530 on /tmp type mfs (asynchronous, local, noatime, nodev, nosuid,
> size=65536 512-blocks)
> mfs:65784 on /dev type mfs (asynchronous, local, noatime, noexec,
> size=12288 512-blocks)   
> mfs:41465 on /etc type mfs (asynchronous, local, noatime, nodev, nosuid,
> size=65536 512-blocks)
> mfs:86708 on /var type mfs (asynchronous, local, noatime, nodev, noexec,
> size=262144 512-blocks)   
> mfs:90223 on /usr/lib type mfs (asynchronous, local, noatime, nodev,
> nosuid, size=262144 512-blocks)
> mfs:22430 on /usr/libexec type mfs (asynchronous, local, noatime, nodev,
> size=262144 512-blocks)   
> $ cat
> /etc/fstab
>    
> 
> 9f97b8d42ceedbf4.a /mbr ffs rw,noatime,nodev,noexec,noauto 1
> 2 
> 9f97b8d42ceedbf4.e / ffs ro,noatime,nodev 1 1
> 9f97b8d42ceedbf4.f /cfg ffs rw,noatime,nodev,noexec,noauto 1
> 2 
> 9f97b8d42ceedbf4.i /efi msdos rw,noatime,nodev,noexec,noauto 0
> 0   
> swap /tmp mfs rw,async,noatime,nodev,nosuid,-s32M 0 0
> $
> 
> 
> 
> For the other suggestions, let me run the system with "
> 
> sysctl ddb.console=1" and wait until the problem will occur to answer your 
> questions as soon I have the additional information.
> 
> 
> Cheers,
> Marco
> 
> 



Re: OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Hrvoje Popovski
On 14.1.2019. 10:02, Marco Prause wrote:
> splassert: bstp_notify_rtage: want 2 have 0
> splassert: bstp_notify_rtage: want 2 have 0
> splassert: bstp_notify_rtage: want 2 have 0
> splassert: bstp_notify_rtage: want 2 have 0
> splassert: bstp_notify_rtage: want 2 have 0
> splassert: bstp_notify_rtage: want 2 have 0

could you try adding this sysctls
sysctl kern.splassert=2
sysctl kern.pool_debug=1


are you getting similar traces ?

splassert: bstp_notify_rtage: want 2 have 0
Starting stack trace...
bstp_set_port_tc(668bdd1357c8fcb,4) at bstp_set_port_tc+0x1a0
bstp_update_tc(fa46532a51d755d) at bstp_update_tc+0xfd
bstp_tick(809f7c00) at bstp_tick+0x357
softclock(3c3f171cb53a98a3) at softclock+0x117
softintr_dispatch(120392a2955eaa7c) at softintr_dispatch+0xfc
Xsoftclock(0,0,1388,0,800267e0,81ccd6b0) at Xsoftclock+0x1f
acpicpu_idle() at acpicpu_idle+0x281
sched_idle(0) at sched_idle+0x245
end trace frame: 0x0, count: 249
End of stack trace.


splassert: bstp_notify_rtage: want 2 have 256
Starting stack trace...
bstp_set_port_tc(668bdd1357c8fcb,4) at bstp_set_port_tc+0x1a0
bstp_update_tc(fa46532a51d755d) at bstp_update_tc+0xfd
bstp_tick(809f7c00) at bstp_tick+0x357
softclock(3c3f171cb53a98a3) at softclock+0x117
softintr_dispatch(120392a2955eaa7c) at softintr_dispatch+0xfc
Xsoftclock(0,0,1388,0,800267e0,81ccd6b0) at Xsoftclock+0x1f
acpicpu_idle() at acpicpu_idle+0x281
sched_idle(0) at sched_idle+0x245
end trace frame: 0x0, count: 249
End of stack trace.



Re: OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Marco Prause
Just a small follow-up to my previous email:

I've just had a look at the hardware, that causes the problem before
I've exchanged it with the new one, that now also produce the problem.

This server seems to have the same hardware-setup then the server1, I
mentioned the email before, which is not freezing.


Here I see the same memory-setup :

spdmem0 at iic0 addr 0x50: 4GB DDR3 SDRAM PC3-12800
spdmem1 at iic0 addr 0x52: 4GB DDR3 SDRAM PC3-12800

and no

acpipci0 at acpi0 PCI0: 0x0010 0x0011 0x00

which may be produced from the current-kernel.





Re: OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Marco Prause
Hi Stuart,

thanks for having a look at this.


> Is it the same or different hardware type and BIOS version for the
> working and hanging machines? (maybe diff the two dmesgs)
>
> Same or different filesystem mount options?  (Are you using softdep?)

it's (nearly) the same hardware.

But thanks to your hint of diffing the dmesg outputs I found a small
difference :


* server1:

bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xec200 (78 entries)
bios0: vendor American Megatrends Inc. version "4.6.5" date 03/02/2015
bios0: INTEL Corporation DENLOW_WS   

* server2:

bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xec200 (77 entries)
bios0: vendor American Megatrends Inc. version "4.6.5" date 03/02/2015   
bios0: INTEL Corporation DENLOW_WS   


* server2 has an additional entry, I do not see on server1

acpipci0 at acpi0 PCI0: 0x0010 0x0011 0x


* server2 also seems to have a slightly different memory setup :

spdmem0 at iic0 addr 0x50: 8GB DDR3 SDRAM PC3-12800

* whereas server1 has :

spdmem0 at iic0 addr 0x50: 4GB DDR3 SDRAM PC3-12800
spdmem1 at iic0 addr 0x52: 4GB DDR3 SDRAM PC3-12800



On the filesystem I can't see any differences :

* server1:
$
mount   
 

/dev/sd0d on / type ffs (local, noatime, nodev,
read-only) 
mfs:14405 on /tmp type mfs (asynchronous, local, noatime, nodev, nosuid,
size=65536 512-blocks)
mfs:35803 on /dev type mfs (asynchronous, local, noatime, noexec,
size=12288 512-blocks)   
mfs:30894 on /etc type mfs (asynchronous, local, noatime, nodev, nosuid,
size=65536 512-blocks)
mfs:75826 on /var type mfs (asynchronous, local, noatime, nodev, noexec,
size=131072 512-blocks)   
mfs:23894 on /usr/lib type mfs (asynchronous, local, noatime, nodev,
nosuid, size=262144 512-blocks)
mfs:21714 on /usr/libexec type mfs (asynchronous, local, noatime, nodev,
size=262144 512-blocks)   
$ cat
/etc/fstab  
 

dd6727251088320b.a /mbr ffs rw,noatime,nodev,noexec,noauto 1
2 
dd6727251088320b.d / ffs ro,noatime,nodev 1
1  
dd6727251088320b.f /cfg ffs rw,noatime,nodev,noexec,noauto 1
2 
dd6727251088320b.i /efi msdos rw,noatime,nodev,noexec,noauto 0
0   
swap /tmp mfs rw,async,noatime,nodev,nosuid,-s32M 0
0  
$   
   



* server2:

$ mount
/dev/sd0e on / type ffs (local, noatime, nodev, read-only)
mfs:19530 on /tmp type mfs (asynchronous, local, noatime, nodev, nosuid,
size=65536 512-blocks)
mfs:65784 on /dev type mfs (asynchronous, local, noatime, noexec,
size=12288 512-blocks)   
mfs:41465 on /etc type mfs (asynchronous, local, noatime, nodev, nosuid,
size=65536 512-blocks)
mfs:86708 on /var type mfs (asynchronous, local, noatime, nodev, noexec,
size=262144 512-blocks)   
mfs:90223 on /usr/lib type mfs (asynchronous, local, noatime, nodev,
nosuid, size=262144 512-blocks)
mfs:22430 on /usr/libexec type mfs (asynchronous, local, noatime, nodev,
size=262144 512-blocks)   
$ cat
/etc/fstab  
 

9f97b8d42ceedbf4.a /mbr ffs rw,noatime,nodev,noexec,noauto 1
2 
9f97b8d42ceedbf4.e / ffs ro,noatime,nodev 1 1
9f97b8d42ceedbf4.f /cfg ffs rw,noatime,nodev,noexec,noauto 1
2 
9f97b8d42ceedbf4.i /efi msdos rw,noatime,nodev,noexec,noauto 0
0   
swap /tmp mfs rw,async,noatime,nodev,nosuid,-s32M 0 0
$



For the other suggestions, let me run the system with "

sysctl ddb.console=1" and wait until the problem will occur to answer your 
questions as soon I have the additional information.


Cheers,
Marco




Re: OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Stuart Henderson
On 2019-01-14, Marco Prause  wrote:
> after an initial boot, everything is working fine for round about 4 hours.
>
> After 4 hours, it is not possible to login into the backup/secondary
> openbsd-server via ssh or even via serial console, but it seems to still
> forward traffic correctly. Also the ospf adjacencies are up as
> well as ipsec security associations and so on.
>
> Monitoring metrics doesn't show any meassured increase of any data.
>
> I've already exchanged the hardware, because it was my first guess, as
> the first server/gateway is running without any problems with the same
> 6.4-stable and config version - but this unfortunately didn't help.

Is it the same or different hardware type and BIOS version for the
working and hanging machines? (maybe diff the two dmesgs)

Same or different filesystem mount options?  (Are you using softdep?)

> When I left an serial console login opened, I was able to execute some
> commands and also a top, I've invoked before, was still running at the
> failure-state. But when entering e.g. ifconfig, or trying a
> tab-completion also the serial console freezes.

The "WAIT" column of a running top(1) may include useful information.

If possible, run with "sysctl ddb.console=1" (needs setting
pre-securelevel, add it to sysctl.conf if it's not already there),
which should allow you to enter ddb by sending a BREAK signal over
the serial line (~# in cu(1)). You can try that under normal
operation (will interrupt service; be ready to type "c" and
enter to continue to resume) to check it works.

Then during a hang attempt to enter ddb, if you are successful then
capture at least the following:

ps
trace

Ideally also switch to all other cpus (the number in the ddb
prompt shows the current one; you can do "mach ddbcpu 3" etc
to switch to another) and re-run trace (which is completely
per-cpu), ps (the line marked "*" indicates the currently
active process on the currently selected CPU - for a report
there's no need to repeat the entire list N times but could
be useful to indicate the running processes on all CPUs).
When you are done with these then also fetch:

sh malloc
sh all pools

For the benefit of other readers who don't have serial console,
ctrl+alt+esc on the keyboard will do the same if the
keyboard/monitor are the selected console device, obviously
it will be harder to capture the output in an easily readable
format!




OpenBSD 6.4-stable + current "freezes" after 4h

2019-01-14 Thread Marco Prause
Hi all @misc,

1st things 1st : sorry for my long description, but :

after upgrading from 6.3-stable to 6.4-stable (and later also current)
in our integration stage, I've met a strange problem.

I run OpenBSD in a hub-and-spoke vpn architecture in round about 14
distributed datacenters.

6.3-stable is running fine and stable as expected.

(all versions 6.3-stable, 6.4-stable and current are running as
resflash-image)



All locations - including the mentioned integration stage - are running
with the same setup.

Each location have two OpenBSD server/gateways, that run:


- ospf over gre over ipsec

-- local to each other and to our two main datacenters (hub)


- two bridge-interfaces inside one server

-- one for tagged frames, one for untagged

-- both bridge-interfaces are connected with a pair-interface

-- first server is configured as primary within ospf,stp and carp


- layer-2 redundancy is done by stp on the openbsd-side and mstp
(instance 0) on the network-gear-side


- layer-3 redundancy is done by ospf and carp


- pf is enabled



The problem can be described as follows :

after an initial boot, everything is working fine for round about 4 hours.

After 4 hours, it is not possible to login into the backup/secondary
openbsd-server via ssh or even via serial console, but it seems to still
forward traffic correctly. Also the ospf adjacencies are up as
well as ipsec security associations and so on.

Monitoring metrics doesn't show any meassured increase of any data.

I've already exchanged the hardware, because it was my first guess, as
the first server/gateway is running without any problems with the same
6.4-stable and config version - but this unfortunately didn't help.

When I left an serial console login opened, I was able to execute some
commands and also a top, I've invoked before, was still running at the
failure-state. But when entering e.g. ifconfig, or trying a
tab-completion also the serial console freezes.


The problem will not occur, if I :


- shutdown bridge0 (for tagged frames)

or

- shutdown bridge1 (for untagged frames)

or

- shutdown pair0 or pair1 (interconnection between the bridges)



Please find attached the commands I was able to execute before
tab-completion or ifconfig in this case :

---cut---

# df -i
Filesystem  512-blocks  Used Avail Capacity iused   ifree 
%iused  Mounted on 
/dev/sd0e  3473724   1127852   2172188    34%   14494  219360
6%   /  
mfs:64049    63326    12 60148 0%   7    8183
0%   /tmp   
mfs:51486    11391    63 10759 1%    1231    1839   
40%   /dev   
mfs:86629    63326  8552 51608    14% 365    7825
4%   /etc   
mfs:35143   253790 11512    229590 5% 236   32530
1%   /var   
mfs:6765    253790 76506    164596    32%  45   32721
0%   /usr/lib   
mfs:9627    253790  6132    234970 3%  66   32700
0%   /usr/libexec

#

# vmstat 1 10
 procs    memory   page    disks    traps 
cpu
 r   s   avm fre  flt  re  pi  po  fr  sr sd0 sd1  int   sys   cs us
sy id
 0  64  104M   7474M   19   0   0   0   0   0   1   0   73    68  168 
0  0 100   
 0  64  104M   7474M   20   0   0   0   0   0   0   0   66    60  128 
0  0 100   
 0  64  104M   7474M   12   0   0   0   0   0   0   0   48    45   92 
0  0 100   
 0  64  104M   7474M   12   0   0   0   0   0   0   0   73    44  146 
0  0 100   
 0  64  104M   7474M   12   0   0   0   0   0   0   0   65    47  132 
0  0 100   
 0  64  104M   7474M   12   0   0   0   0   0   0   0   37    49   82 
0  0 100   
 0  64  104M   7474M   12   0   0   0   0   0   0   0   52    44  107 
0  0 100   
 0  64  104M   7474M   12   0   0   0   0   0   0   0   51    44  106 
0  0 100   
 0  64  104M   7474M   12   0   0   0   0   0   0   0   52    44  104 
0  0 100   
 0  64  104M   7474M   12   0   0   0   0   0   0   0   53    47  118 
0  0 100   
#
# iostat 1 10
  tty  sd0   sd1    cpu
 tin tout  KB/t  t/s  MB/s   KB/t  t/s  MB/s  us ni sy sp in id
   0    2 28.82    0  0.01   0.50    0  0.00   0  0  0  0  0100
   0  193  0.00    0  0.00   0.00    0  0.00   0  0  0  0  0100
   0   64  0.00    0  0.00   0.00    0  0.00   0  0  0  0  0100
   0   64  0.00    0  0.00   0.00    0  0.00   0  0  0  0  0100
   0   64  0.00    0  0.00   0.00    0  0.00   0  0  0  0  0100
   0   64  0.00    0  0.00   0.00    0  0.00   0  0  0  0  0100
   0   64  0.00    0  0.00   0.00    0  0.00   0  0  0  0  0100
   0   64  0.00    0  0.00   0.00    0  0.00   0  0  0  0  0100
   0   64  0.00    0  0.00   0.00    0  0.00   0  0  0  0  0100
   0   64  0.00    0  0.00   0.00    0  0.00   0  0  0  0  0100
#

# df -h
Filesystem Size    Used   Avail Capacity  Mounted on
/dev/sd0e  1.7G    551M    1.0G    34%    /
mfs:69819 30.9M    9.0K   29.4M 0%    /tmp