Re: panic: biodone2 already

2018-08-07 Thread Emmanuel Dreyfus
Robert Elz  wrote:

> This suggests to me that something is getting totally scrambled in
> the buf headers when things get busy.

I tried dumping  the buf_t before panic, to check if it could be
completely corrupted, but it seems it is not the case. Iblkno is
4904744, filesystem has 131891200 blocks.

bp = 0xa5c1e000
bp->b_error = 0
bp->b_resid = 0
bp->b_flags = 0x10
bp->b_prio = 1
bp->b_bufsize = 2048
bp->b_bcount = 2048
bp->b_dev = 0x8e00
bp->b_blkno = 4904744
bp->b_proc = 0x0
bp->b_saveaddr = 0x0
bp->b_private = 0x0
bp->b_dcookie = 0
bp->b_refcnt = 1
bp->b_lblkno = 0
bp->b_cflags = 0x10
bp->b_vp = 0xa5c2b2a8
bp->b_oflags = 0x200
panic: biodone2 already

db{0}> show vnode 0xa5c2b2a8
OBJECT 0xa5c2b2a8: locked=1, pgops=0x8058ac00, npages=0,
refs=2

vnode 0xa5c2b2a8 flags 0x30
tag VT_UFS(1) type VDIR(2) mount 0xa448c000 typedata 0x0
usecount 2 writecount 0 holdcount 1
size 200 writesize 200 numoutput 0
data 0xa5c2cf00 lock 0xa5c2b3d8
state LOADED key(0xa448c000 8) 3a 9f 04 00 00 00 00 00
lrulisthd 0x8067bb80


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: panic: biodone2 already

2018-08-07 Thread Robert Elz
For what it is worth, and in this case it might not be much, I did a similar
test on my test amd64 DomU last night.

Running dump and /etc/daily in parallel did nothing, but running lots of
them in parallel eventually did cause a crash.

But I saw a different crash -- rather than a biodone2, I got a KASSERT
from one I added as part of attempting to diagnose the babylon5
"install failures" - that is, if my test kernel ever gets an I/O error,
the KASSERT (which is just KASSERT(0)) causes a crash.   This
was intended for generic kernels that run in qemu - but I use the
same sources for my generic testing, and simply left that there.
My DomU test system *never* gets an I/O error, so it simply did
not matter (its filesystem is on a raid on the Dom0, and neither the
Dom0 nor the raid report anything even smelling like I/O errors,
what's more, the Dom0 is more likely to crash than ever allow a
real I/O error through to the DomU).

This is the I/O error that occurred...

[ 485570.8105971] xbd0a: error writing fsbn 49691936 of 49691936-49691951 (xbd0 
bn 49691936; cn 24263 tn 0 sn 1312)panic: kernel diagnostic assertion "0" 
failed: file "/readonly/release/testing/src/sys/kern/subr_disk.c", line 163 


What's kind of interesting about that, is that the DomU filesystem is ...

format  FFSv1
endian  little-endian
magic   11954   timeWed Aug  8 03:57:00 2018
superblock location 8192id  [ 57248bd0 6db5a772 ]
cylgrp  dynamic inodes  4.4BSD  sblock  FFSv2   fslevel 4
nbfree  4037762 ndir2334nifree  2009341 nffree  3116
ncg 624 size33554432blocks  33289830

(no idea how I managed to make it FFSv1, but that doesn't matter).

What is interesting is "blocks  33289830" when compared with the
I/O error  "error writing fsbn 49691936" which is 16402106 blocks
beyond the end of the filesystem

This suggests to me that something is getting totally scrambled in
the buf headers when things get busy.

kre



Re: panic: biodone2 already

2018-08-07 Thread Emmanuel Dreyfus
Jaromír Dole?ek  wrote:

> Thanks. Could  you please try a -current kernel for DOMU and see if it
> crashes the same? If possible a DOMU kernel from daily builds, to rule
> out local compiler issue.

It crashes the same way with a kernel from 201808050730. here is uname
-a output:
NetBSD bacasable 8.99.23 NetBSD 8.99.23 (XEN3PAE_DOMU) #0: Sun Aug  5
06:48:50 UTC 2018
mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/xen/compile/XEN3PAE_DOMU
i386

> xbd is not mpsafe, so it shouldn't be even race due to parallell
> processing on different CPUs. Maybe it would be useful to check if the
> problem still happens when you assign just single CPU to the DOMU.

I get the crash with vcpu = 1 for the domU. I also tried to pin a single
cpu for the test domU, I still get it to crash:

xl vcpu-pin bacasable 0 0
xl vcpu-pin $other_domU all 1-3

An interesting point: Adding the -X flag to dump seems to let it work
without a panic. It may be luck, but that did not crash so far.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: panic: biodone2 already

2018-08-07 Thread Jaromír Doleček
2018-08-07 18:42 GMT+02:00 Emmanuel Dreyfus :
>  kern/53506

Thanks. Could  you please try a -current kernel for DOMU and see if it
crashes the same? If possible a DOMU kernel from daily builds, to rule
out local compiler issue.

There are not really many differences in xbd/evtchn code itself
between 8.0 and -current however. There was some interrupt code
reorganization which might affected this, but this happened after
netbsd-8 was created.

xbd is not mpsafe, so it shouldn't be even race due to parallell
processing on different CPUs. Maybe it would be useful to check if the
problem still happens when you assign just single CPU to the DOMU.

Jaromir


Re: mutex_oncpu() called on destroyed mutex?

2018-08-07 Thread Edgar Fuß
> Disabling preemption only affects the CPU that disabled it.
But preemption is *en*abled in the code segment I quoted!

> What's the stack trace of the panic?
mutex_vector_enter() at netbsd:mutex_vector_enter+0x32c
unp_thread() at netbsd:unp_thread+0x2eb


Re: mutex_oncpu() called on destroyed mutex? (was: repeated panics in mutex_vector_enter (from unp_thread))

2018-08-07 Thread Jason Thorpe



> On Aug 7, 2018, at 9:44 AM, Edgar Fuß  wrote:
> 
> I observe this on 6.1, but I can't see the relevant code changed in current.
> 
> mutex_vector_enter() does (-current uses KPREMPT_* macros)
> 
>   do {
>   kpreempt_enable();
>   SPINLOCK_BACKOFF(count);
>   kpreempt_disable();
>   owner = mtx->mtx_owner;
>   } while (mutex_oncpu(owner));
> 
> and my problem seems to be owner == MUTEX_THREAD (i.e. the mutex destroyed) 
> the time mutex_oncpu(owner) is called.
> 
> My understanding of locking is limited (close to zero) but why shouldn't 
> the mutex in question be destroyed during the preemption-enabled period?
> 
> I must be missing something.

It could be destroyed by another thread on a different CPU.  Disabling 
preemption only affects the CPU that disabled it.

Sounds like this is just a classic use-after-free problem.  What's the stack 
trace of the panic?  Is the mutex embedded in some ephemeral data structure?

-- thorpej



Re: ddb input via IPMI virtual console

2018-08-07 Thread Brian Buhrow
Hello.  Sorry, my description wasn't clear.  
Since you hav an IPMI capable server, you should be able to turn serial
port redirection on in the BIOS such that com1 (from NetBSD's point of
view) becomes a virtual port which is accessible using the ipmitool
program.  You would do something like:

ipmitool -H 10.10.1.3 -U ADMIN -I lanplus sol activate
After you enter the password, you should be connected to the virtual
serial port where you can see output or type input.  Since this is a serial
port as far as NetBSD is concerned, DDB should work.
This is a separate session from your virtual console, so you can run
it in a separate window.

Change the username and IP address shown above to match
your setup.


To get NetBSD to use that serial port as a console, you'd do something
like:

cd /usr/mdec
installboot -v -o speed=115200 -o console=com1 /dev/boot 
bootxx_ffsv


-Brian

On Aug 7, 11:14am, Edgar =?iso-8859-1?B?RnXf?= wrote:
} Subject: Re: ddb input via IPMI virtual console
} > how about using a serial console in the kernel and then using ipmitool 
} > to talk to DDB when/if the machine goes down?
} I don't have a serial wire through the firewall.
} 
} > but kernel messages won't go there.
} It would be an awful drawback not to see the kernel messages on a physical 
} console.
>-- End of excerpt from Edgar =?iso-8859-1?B?RnXf?=




mutex_oncpu() called on destroyed mutex? (was: repeated panics in mutex_vector_enter (from unp_thread))

2018-08-07 Thread Edgar Fuß
I observe this on 6.1, but I can't see the relevant code changed in current.

mutex_vector_enter() does (-current uses KPREMPT_* macros)

do {
kpreempt_enable();
SPINLOCK_BACKOFF(count);
kpreempt_disable();
owner = mtx->mtx_owner;
} while (mutex_oncpu(owner));

and my problem seems to be owner == MUTEX_THREAD (i.e. the mutex destroyed) 
the time mutex_oncpu(owner) is called.

My understanding of locking is limited (close to zero) but why shouldn't 
the mutex in question be destroyed during the preemption-enabled period?

I must be missing something.


Re: panic: biodone2 already

2018-08-07 Thread Emmanuel Dreyfus
Martin Husemann  wrote:

> Please file a PR.

 kern/53506

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: panic: biodone2 already

2018-08-07 Thread Emmanuel Dreyfus
Emmanuel Dreyfus  wrote:

> /sbin/dump -a0f /dev/null /
> sh /etc/daily

The second command can be replaced by a simple
grep -r something /etc

But so far I did not managed to crash without running dump.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: panic: biodone2 already

2018-08-07 Thread Martin Husemann
On Tue, Aug 07, 2018 at 05:30:27PM +0200, Emmanuel Dreyfus wrote:
> I can reproduce the crash at will: running at the same time the two
> following commands reliabily trigger  "panic biodone2 already"
> 
> /sbin/dump -a0f /dev/null /
> sh /etc/daily

Please file a PR.

Martin


Re: panic: biodone2 already

2018-08-07 Thread Emmanuel Dreyfus
Jaromír Dole?ek  wrote:

> This is always a bug, driver processes same buf twice. It can do harm.
> If the buf is reused for some other I/O, system can fail to store
> data, or claim to read data when it didn't.

I can reproduce the crash at will: running at the same time the two
following commands reliabily trigger  "panic biodone2 already"

/sbin/dump -a0f /dev/null /
sh /etc/daily

That crash on NetBSD-8.0 domU, either i386 or amd64. At mine the only
machines that managed to spare the bug so far were the one where the
dump is finished when /etc/daily starts. All the other one keep
rebooting at least once a day.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: ddb input via IPMI virtual console

2018-08-07 Thread Mouse
>> how about using a serial console in the kernel and then using
>> ipmitool to talk to DDB when/if the machine goes down?
> I don't have a serial wire through the firewall.

You have a multiconductor cable; while it's intended for PS/2, if it
has at least three conductors (which I believe PS/2 does), it can be
used perfectly well as a serial line.  You'll need adaptors on the ends
of the cable, but they need be only passive adaptors.  RS-232 is
ridiculously tolerant of layer-1 issues.

Unless the insulation is rated for only 6V or something, which strikes
me as unlikely enough that I wouldn't even bother checking if it were
me in that situation.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: repeated panics in mutex_vector_enter (from unp_thread)

2018-08-07 Thread Edgar Fuß
Could someone please aid me in how to debug this?
The server repeatedly panics the same way after ionly hours or even minutes 
of uptime!


Re: NetBSD 8.0 dead socket

2018-08-07 Thread BERTRAND Joël
Emmanuel Dreyfus a écrit :
> Hello

Hello,

> Another bug I observed with NetBSD 8.0. So far it hapened only to 
> named, but perhaps it is more genéric.

Very strange. I use a NetBSD server as main router/NFS server/name
server with bind9 and NetBSD 8.0 (build from sources) and I haven't seen
this kind of trouble.

> named is bound to multiple addresses. After some time, it cease to
> answer on one of them, while the others keep working. tcdump show
> the packets going to port 53 without reply. ktrace shows no data
> coming from kernel. And netstat shows named is still bound to the
> offending addresse.
> 
> Someone else experienced it? Any idea on where to look at?

I have seen a strange bug when a socket is opened in IPv4 _and_ IPv6 in
8-RC1 or 8-BETA, I don't remember.

Best regards,

JKB




Re: ddb input via IPMI virtual console

2018-08-07 Thread Martin Husemann
On Tue, Aug 07, 2018 at 11:19:28AM +0200, Edgar Fuß wrote:
> > Put it in the machine room, use the existing PS/2 keyboards, and... isn't
> > the problem solved?
> ... as I've been told, DDB isn't able to talk to USB keyboards (or did I 
> get that wrong?). So I would end with neither IPMI nor real console working.

ddb should be able to talk to the console keyboard (via polling), but not
additional keyboards.

Martin


Re: 8.0 performance issue when running build.sh?

2018-08-07 Thread J. Hannken-Illjes


> On 6. Aug 2018, at 23:18, Mindaugas Rasiukevicius  wrote:
> 
> Martin Husemann  wrote:
>> So here is a more detailed analyzis using flamegraphs:
>> 
>>  https://netbsd.org/~martin/bc-perf/
>> 
>> 
>> All operations happen on tmpfs, and the locking there is a significant
>> part of the game - however, lots of mostly idle time is wasted and it is
>> not clear to me what is going on.
> 
> Just from a very quick look, it seems like a regression introduced with
> the vcache changes: the MP-safe flag is set too late and not inherited
> by the root vnode.
> 
> http://www.netbsd.org/~rmind/tmpfs_mount_fix.diff

Very good catch, @martin could you try this diff on an autobuilder?

Looks like it speeds up tmpfs lookups by a factor of ~40 on -8.0.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: ddb input via IPMI virtual console

2018-08-07 Thread Edgar Fuß
> Since the problem is that the real keyboards are PS/2, the adapters sound
> perfect.
Ah, yes, that sounds like a perfect solution, but ...

> Put it in the machine room, use the existing PS/2 keyboards, and... isn't
> the problem solved?
... as I've been told, DDB isn't able to talk to USB keyboards (or did I 
get that wrong?). So I would end with neither IPMI nor real console working.


Re: ddb input via IPMI virtual console

2018-08-07 Thread Martin Husemann
On Tue, Aug 07, 2018 at 11:14:12AM +0200, Edgar Fuß wrote:
> > how about using a serial console in the kernel and then using ipmitool 
> > to talk to DDB when/if the machine goes down?
> I don't have a serial wire through the firewall.

You configure the kernel for serial console output and use IPMI to
talk to it.

Martin


Re: ddb input via IPMI virtual console

2018-08-07 Thread Edgar Fuß
> how about using a serial console in the kernel and then using ipmitool 
> to talk to DDB when/if the machine goes down?
I don't have a serial wire through the firewall.

> but kernel messages won't go there.
It would be an awful drawback not to see the kernel messages on a physical 
console.


RE: ddb input via IPMI virtual console

2018-08-07 Thread Terry Moore
>> The real keyboards are PS/2 and I can't change that because it runs
>> on a wire physically passing a /real/ firewall, [...]
>
>(a) Is it possible to run USB over the same conductors used by the PS/2
>cable?  (This is a real question; I don't know enough about layer 1 of
>either to answer it.)

Great question. The answer is "yes, if you're desperate". Keyboards are
low-speed or full-speed USB devices, and the signaling is relatively
non-critical -- especially if you find a low-speed keyboard.  However:

>(b) There exist devices that adapt PS/2 to USB in the
>PS/2-keyboard-to-USB-host direction.  

Since the problem is that the real keyboards are PS/2, the adapters sound
perfect.

https://www.newegg.com/Mouse-Keyboard-PS2-Adapters/SubCategory/ID-3024  US$
6 pretty reasonable.

Put it in the machine room, use the existing PS/2 keyboards, and... isn't
the problem solved? If you need USB as well from another source, use a hub
(in the machine room). But I confess I'm not clear about that part of what
you're trying to do.

--Terry