from:"Manuel Bouyer"

Re: dwiic errors

2024-03-14 Thread Manuel Bouyer

On Thu, Mar 14, 2024 at 08:42:47AM -0700, Paul Goyette wrote:
> On Thu, 14 Mar 2024, Michael van Elst wrote:
> 
> > p...@whooppee.com (Paul Goyette) writes:
> > 
> > > as soon as you proceed past this point (including normal non-single-
> > > user boot), the dwiic starts spewing time-out messages.  These
> > > messages come every 0.5 second or so, and there's usually a hundred
> > > or more messages before they stop;  in some cases the messages have
> > > continued to stream by for several minutes (at which point I pressed
> > > the reset button).  The value for %d is always 0 or 1.
> > 
> > Probably result of
> > 
> > GENERIC:ihidev* at iic?
> > 
> > that is probing for a modern laptop touchpad.
> > 
> > Can you disable ihidev instead of dwiic and see what happens then ?
> 
> No change.  It attaches dwiic0 and then starts with the messages.

It could also be some sensors I guess. Any chance to see what attaches
at dwiic0 ? Maybe entering ddb before the console gets spammed ?

FWIW I have a laptop with the touchpad as ihidev@dwiic and it works
fine with RC6

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: ATF tests panic assertion "uvmexp.swpgonly > 0" failed

2023-12-11 Thread Manuel Bouyer

On Mon, Dec 11, 2023 at 11:27:17AM -0800, Chuck Silvers wrote:
> On Fri, Dec 08, 2023 at 06:13:42PM +0100, Manuel Bouyer wrote:
> > Hello again
> > I see a second rare panic running ATF tests on Xen:
> > lib/libc/regex/t_exhaust (236/949): 1 test cases
> > regcomp_too_big: [ 1254.5816543] panic: kernel diagnostic assertion 
> > "uvmexp.swpgonly > 0" failed: file "/usr/src/sys/uvm/uvm_anon.c", line 175 
> > [ 1254.6116351] cpu1: Begin traceback...
> > [ 1254.6216378] 
> > vpanic(c12d3bf8,d855adcc,d855ade8,c0d03f72,c12d3bf8,c12d3b5f,c13da23e,c13da189,af,c3b00ac0)
> >  at netbsd:vpanic+0x184
> > [ 1254.6516393] 
> > kern_assert(c12d3bf8,c12d3b5f,c13da23e,c13da189,af,c3b00ac0,c2d9f8d0,0,d855ae0c,c0d041c4)
> >  at netbsd:kern_assert+0x23
> > [ 1254.6716402] 
> > uvm_anfree(c2d9f8d0,c2342000,3,c543b0c0,0,c3b00ac0,1,d855ae58,c0d20d1d,c2d9f8d0)
> >  at netbsd:uvm_anfree+0x2b8
> > [ 1254.7016358] 
> > uvm_anon_release(c2d9f8d0,1,d72f2000,c543b0c0,d72f2000,d72f1000,1,0,0,d855ae84)
> >  at netbsd:uvm_anon_release+0x85
> > [ 1254.7216389] 
> > uvm_aio_aiodone_pages(d855ae84,1,1,0,c243c400,1ebf140,0,0,d855ae84,c1e6cc24)
> >  at netbsd:uvm_aio_aiodone_pages+0x2fc
> > [ 1254.7716201] 
> > uvm_aio_aiodone(c5560180,8e016,3,0,72bec,c26ff204,c26ff204,c5560180,d855af20,c0e54a31)
> >  at netbsd:uvm_aio_aiodone+0x97
> > [ 1254.7916387] 
> > biodone2(c5560180,1000,0,c25e4cec,c0db8d85,c5561024,c0e54996,d82f8000,d855af48,c0e0a32d)
> >  at netbsd:biodone2+0x95
> > [ 1254.8216356] 
> > dkiodone(c5561024,10,10,d85502ac,c010293f,d855af70,c5561024,3,d855af70,c0e0a49e)
> >  at netbsd:dkiodone+0x9b
> > [ 1254.8416368] 
> > biodone2(3,0,c010293f,8a260008,2,d8550004,c243c980,d85502ac,d855afe0,c0d7f5c5)
> >  at netbsd:biodone2+0x95
> > [ 1254.8815946] biointr(0,0,0,0,0,0,0,0,0,0) at netbsd:biointr+0x4c
> > [ 1254.8916211] 
> > softint_dispatch(c243c400,3,c2c2c2c2,c2c2c2c2,c2c2c2c2,c2c2c2c2,d855dff0,d855df14,c268b000,80050033)
> >  at netbsd:softint_dispatch+0xe0
> > [ 1254.9216366] Bad frame pointer: 0xd82fcf20
> > [ 1254.9316199] cpu1: End traceback...
> > 
> > The first time seems to be
> > https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/i386-hvm/202310061820Z_anita.txt
> > 
> > and the second time was
> > https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/i386-hvm/202311250810Z_anita.txt
> > 
> > Has anyone else seen this ?
> 
> 
> yes, various people have been seeing this assertion (or some other related 
> ones)
> occasionally for years now.  I've looked into it a few times but I have been
> unable to spot the problem.

thanks
is there a PR open about this ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

ATF tests panic assertion "uvmexp.swpgonly > 0" failed

2023-12-08 Thread Manuel Bouyer

Hello again
I see a second rare panic running ATF tests on Xen:
lib/libc/regex/t_exhaust (236/949): 1 test cases
regcomp_too_big: [ 1254.5816543] panic: kernel diagnostic assertion 
"uvmexp.swpgonly > 0" failed: file "/usr/src/sys/uvm/uvm_anon.c", line 175 
[ 1254.6116351] cpu1: Begin traceback...
[ 1254.6216378] 
vpanic(c12d3bf8,d855adcc,d855ade8,c0d03f72,c12d3bf8,c12d3b5f,c13da23e,c13da189,af,c3b00ac0)
 at netbsd:vpanic+0x184
[ 1254.6516393] 
kern_assert(c12d3bf8,c12d3b5f,c13da23e,c13da189,af,c3b00ac0,c2d9f8d0,0,d855ae0c,c0d041c4)
 at netbsd:kern_assert+0x23
[ 1254.6716402] 
uvm_anfree(c2d9f8d0,c2342000,3,c543b0c0,0,c3b00ac0,1,d855ae58,c0d20d1d,c2d9f8d0)
 at netbsd:uvm_anfree+0x2b8
[ 1254.7016358] 
uvm_anon_release(c2d9f8d0,1,d72f2000,c543b0c0,d72f2000,d72f1000,1,0,0,d855ae84) 
at netbsd:uvm_anon_release+0x85
[ 1254.7216389] 
uvm_aio_aiodone_pages(d855ae84,1,1,0,c243c400,1ebf140,0,0,d855ae84,c1e6cc24) at 
netbsd:uvm_aio_aiodone_pages+0x2fc
[ 1254.7716201] 
uvm_aio_aiodone(c5560180,8e016,3,0,72bec,c26ff204,c26ff204,c5560180,d855af20,c0e54a31)
 at netbsd:uvm_aio_aiodone+0x97
[ 1254.7916387] 
biodone2(c5560180,1000,0,c25e4cec,c0db8d85,c5561024,c0e54996,d82f8000,d855af48,c0e0a32d)
 at netbsd:biodone2+0x95
[ 1254.8216356] 
dkiodone(c5561024,10,10,d85502ac,c010293f,d855af70,c5561024,3,d855af70,c0e0a49e)
 at netbsd:dkiodone+0x9b
[ 1254.8416368] 
biodone2(3,0,c010293f,8a260008,2,d8550004,c243c980,d85502ac,d855afe0,c0d7f5c5) 
at netbsd:biodone2+0x95
[ 1254.8815946] biointr(0,0,0,0,0,0,0,0,0,0) at netbsd:biointr+0x4c
[ 1254.8916211] 
softint_dispatch(c243c400,3,c2c2c2c2,c2c2c2c2,c2c2c2c2,c2c2c2c2,d855dff0,d855df14,c268b000,80050033)
 at netbsd:softint_dispatch+0xe0
[ 1254.9216366] Bad frame pointer: 0xd82fcf20
[ 1254.9316199] cpu1: End traceback...

The first time seems to be
https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/i386-hvm/202310061820Z_anita.txt

and the second time was
https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/i386-hvm/202311250810Z_anita.txt

Has anyone else seen this ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

ATF tests panic

2023-12-08 Thread Manuel Bouyer

Hello
in my daily ATF runs on Xen VMs, I see occasional panics like:
kernel/kqueue/t_proc4 (81/956): 1 test cases
proc4: [ 727.4761311] uvm_fault(0xeb8004cff900, 0x0, 2) -> e
[ 727.4899584] fatal page fault in supervisor mode
[ 727.4981872] trap type 6 code 0x2 rip 0x80dcf31d cs 0x8 rflags 
0x10246 cr2 0xb0 ilevel 0 rsp 0xc280510e4e08
[ 727.5261058] curlwp 0xeb80055a6000 pid 20525.20525 lowest kstack 
0xc280510e02c0
[ 727.5410680] panic: trap
[ 727.5410680] cpu0: Begin traceback...
[ 727.5611378] vpanic() at netbsd:vpanic+0x173
[ 727.5679804] panic() at netbsd:panic+0x3c
[ 727.5787912] trap() at netbsd:trap+0xb0a
[ 727.5896397] --- trap (number 6) ---
[ 727.6161322] _mutex_init() at netbsd:_mutex_init+0x33
[ 727.6274274] knote_proc_fork() at netbsd:knote_proc_fork+0xa2
[ 727.6366632] fork1() at netbsd:fork1+0x6ba
[ 727.6468860] sys_fork() at netbsd:sys_fork+0x29
[ 727.6563376] syscall() at netbsd:syscall+0x17a
[ 727.6661484] --- syscall (number 2) ---
[ 727.6775934] netbsd:syscall+0x17a:
[ 727.6775934] cpu0: End traceback...

[ 727.6928214] dumping to dev 168,1 (offset=8, size=128926):
[ 727.6928214] dump device bad

The first occurance seems to be this:
https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64-pv/202309261750Z_anita.txt

I see it for PV, PVH and HVM runs, but it's quite rare.

I am the only one seeing this ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: openssl3+postfix issue (ca md too weak)

2023-11-14 Thread Manuel Bouyer

On Mon, Nov 13, 2023 at 08:34:04PM +0100, Manuel Bouyer wrote:
> Hello
> I'm facing an issue with postfix+openssl3 which may be critical (depending
> on how it can be fixed).
> 
> Now my postfix setup fails to send mails with
> Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: 
> error:0A00018E:SSL routines::ca md too 
> weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984:
> 
> >From what I understood, this is the remote certificate which is not accepted:
> openssl 3 deprecated some signature algorithm, which are no longer accepted
> with @SECLEVEL=1 (which is the default).

I didn't understand. The message is not about the server certificate but the
client certificate (which, indeed, is quite old and uses a private CA).
Even though no client certificate is requested for this server, is seems
that postfix loads it and errors out if it's too weak. This is quite
confusing ...

The good news is, as it's a private CA I can rebuild it :)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: openssl3+postfix issue (ca md too weak)

2023-11-14 Thread Manuel Bouyer

On Mon, Nov 13, 2023 at 07:16:14PM -0800, Brian Buhrow wrote:
>   Hello Taylor.  Just as a point of reference, smtp clients that connect 
> to domains hosted by
> Microsoft, i.e. outlook.com and any other domains that use their 
> infrastructure for e-mail, will
> have to present a valid SSL certificate in order to submit mail to their smtp 
> servers.  But
> that is a different issue than Manuel is describing, as I understand it.  I 
> think he is saying
> that the server is presenting an SSL certificate that his client doesn't like 
> when he tries to
> send mail to an external smtp server.  In that case, I agree with you, his 
> client shouldn't be
> overly concerned about whether the server presented SSL certificate can be 
> verified all the way
> down the verification chain.  I guess it's fine if it does the verification 
> and puts a note in
> the headers, but it shouldn't stop mail from going out.

Actually, the client is using SMTP AUTH, so making sure he's sending the
auth credentials to the right SMTP server is critical.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: openssl3+postfix issue (ca md too weak)

2023-11-14 Thread Manuel Bouyer

On Tue, Nov 14, 2023 at 02:39:53AM +, Taylor R Campbell wrote:
> [trimming tech-crypto from cc because this is a policy and
> configuration issue, not a cryptography issue]
> 
> > Date: Mon, 13 Nov 2023 20:34:04 +0100
> > From: Manuel Bouyer 
> > 
> > I'm facing an issue with postfix+openssl3 which may be critical (depending
> > on how it can be fixed).
> > 
> > Now my postfix setup fails to send mails with
> > Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: 
> > error:0A00018E:SSL routines::ca md too 
> > weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984:
> 
> 1. This says `warning'; does the mail actually fail to go through, or
>are you just alarmed by the warning?

it fails:
Nov 13 20:21:48 comore postfix/smtp[4182]: warning: TLS library problem: 
error:0A00018E:SSL routines::ca md too 
weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984:
Nov 13 20:21:48 comore postfix/smtp[4182]: D2EF31805C: to=, 
relay=mail.soc.lip6.fr[132.227.86.2]:465, delay=1441, delays=1441/0.05/0.02/0, 
dsn=4.7.5, status=deferred (Cannot start TLS: handshake failure)


> 
> 2. Can you describe your mail topology?

This is a simple mail client (my laptop); outgoing emails go through
2 mails servers (depending on the from, and a relay map). Both mail
servers requires SMTP AUTH (which is why I enforce
smtp_tls_security_level = verify), configured as:
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/home/bouyer/.postfix/sasl_passwd
smtp_sasl_security_options = noanonymous

> 
> 3. Can you describe the postfix configuration on every node involved
>in the topology?

the mails servers this client talks to are both running sendmail,
on netbsd-9

> 4. Can you share master.cf on every node involved if it's not the
>default?

on the client master.cf is the default, with this additional line:
relay-smtps unix -  -   n   -   -   smtp
# Client-side SMTPS requires "encrypt" or stronger.
-o smtp_tls_security_level=verify
-o smtp_tls_wrappermode=yes
-o smtp_starttls_timeout=60
-o smtp_helo_timeout=60
> 
> 5. If you connect to the server with `openssl s_client', what happens?

It works:
openssl s_client -connect mail.soc.lip6.fr:465 -verify_return_error
[...]
Start Time: 1699948718
Timeout   : 7200 (sec)
Verify return code: 0 (ok)
Extended master secret: no
Max Early Data: 0
---
read R BLOCK
220 asim.lip6.fr ESMTP Sendmail 8.15.2/8.15.2; Tue, 14 Nov 2023 08:58:37 +0100 
(MET)

Also, tnftp talking to a web server with the exact same certificate and
certificate chain has no problem either

This is one of the thing I have a hard time to understand: why can't I
reproduce this error with other TLS client ?

> 
> > So, as far as I understand, we end up with a postfix installation which
> > can't talk to servers with valid certificates.
> 
> Unless anything has changed in the past couple years, I don't think
> there is any widespread deployment of SMTP TLS server authentication
> that means anything for general MTAs -- at best, TLS in SMTP serves as
> opportunistic encryption to defend against passive eavesdroppers.

There is actually, for SMTP AUTH
And I don't think using an MTA for SMTP AUTH is that unusual

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: openssl3+postfix issue (ca md too weak)

2023-11-13 Thread Manuel Bouyer

On Tue, Nov 14, 2023 at 11:10:16AM +1300, Lloyd Parkes wrote:
> 
> 
> On 14/11/23 10:56, Joerg Sonnenberger wrote:
> > 
> > NIST has been sunsetting SHA1 for a long time, 2016 in fact. In many cases, 
> > there is a better trust chain
> > for Comodo intermediary certificates and admins should be installing those.
> 
> I'm not sure that's what Comodo has, even though it is the normal way of
> doing things.
> 
> I found a Comodo web page that said SHA1 will be fine, so don't worry, and
> if you are worried, you can buy a different certificate. That same web
> page's link to their intermediate certificates is a dead link. Comodo does
> not fill me with confidence.

Unfortunably I don't have the choise for this one.

> 
> I'm going to guess that the default @SECLEVEL of openssl needs to be
> adjusted if there is no Postfix specific way to adjust it. Apparently you
> can set the environment variable OPENSSL_CONF to run with a custom openssl
> configuration which can avoid reducing the security level of the rest of
> your system. Searching for "openssl @SECLEVEL" gave me the usual levels of
> StackExchange clarity, so ymmv.

I tried this; but nothing that I've tried in /etc/openssl/openssl.cnf
did seems to have any effect. I wonder if postfix is doing some specific
openssl setup that overrides the openssl.cnf settings.

But also note that I could not reproduce the problem with openssl s_client

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: openssl3+postfix issue (ca md too weak)

2023-11-13 Thread Manuel Bouyer

On Mon, Nov 13, 2023 at 10:56:00PM +0100, Joerg Sonnenberger wrote:
> On Monday, November 13, 2023 8:34:04 PM CET Manuel Bouyer wrote:
> > Hello
> > I'm facing an issue with postfix+openssl3 which may be critical (depending
> > on how it can be fixed).
> > 
> > Now my postfix setup fails to send mails with
> > Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: 
> > error:0A00018E:SSL routines::ca md too 
> > weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984:
> > 
> > From what I understood, this is the remote certificate which is not 
> > accepted:
> > openssl 3 deprecated some signature algorithm, which are no longer accepted
> > with @SECLEVEL=1 (which is the default).
> > In server's certificate chain all but the last one are signed with
> > sha384WithRSAEncryption (which should be OK). The last one (the root
> > certificate) is signed with RSA-SHA1 and I don't think this will change
> > soon:
> >  3 s:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, 
> > CN = A
> >  AA Certificate Services
> >i:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, 
> > CN = A
> >  AA Certificate Services
> >a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA1
> >v:NotBefore: Jan  1 00:00:00 2004 GMT; NotAfter: Dec 31 23:59:59 2028 GMT
> > 
> > So, as far as I understand, we end up with a postfix installation which
> > can't talk to servers with valid certificates.
> 
> NIST has been sunsetting SHA1 for a long time, 2016 in fact. In many cases, 
> there is a better trust chain
> for Comodo intermediary certificates and admins should be installing those.

My chain is from October, not that old.
Maybe our CA is not completely up to date; I will have to check that.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: openssl3+postfix issue (ca md too weak)

2023-11-13 Thread Manuel Bouyer

On Mon, Nov 13, 2023 at 10:58:38PM +0100, Steffen Nurpmeso wrote:
> Manuel Bouyer wrote in
>  :
>  |On Mon, Nov 13, 2023 at 10:24:56PM +0100, Steffen Nurpmeso wrote:
>  |> Manuel Bouyer wrote in
>  |>  :
>  |>|Hello
>  |>|I'm facing an issue with postfix+openssl3 which may be critical (dependi\
>  |>|ng
>  |>|on how it can be fixed).
>  |>|
>  |>|Now my postfix setup fails to send mails with
>  ...
>  |>|>From what I understood, this is the remote certificate which is not \
>  |>|>accepted:
>  |>|openssl 3 deprecated some signature algorithm, which are no longer \
>  |>|accepted
>  ...
>  |> Isn't that just postfix config.
>  |
>  |It's possible; but I didn't find anything relevant in the postfix docs
>  |
>  |> Btw *i* have no problem with
>  |> 
>  |>   smtpd_tls_ask_ccert = no
>  |>   smtpd_tls_auth_only = yes
>  |>   smtpd_tls_loglevel = 1
>  |>   #SMART The next is usually nice but when using client certificates
>  |>   smtpd_tls_received_header = no
>  |>   smtpd_tls_fingerprint_digest = sha256
>  |>   smtpd_tls_mandatory_protocols = >=TLSv1.2
>  |>   smtpd_tls_protocols = $smtpd_tls_mandatory_protocols
>  |>   # super modern, forward secrecy TLSv1.2 / TLSv1.3 selection..
>  |>   tls_high_cipherlist = EECDH+AESGCM:EECDH+AES256:EDH+AESGCM:CHACHA20
>  |>   smtpd_tls_mandatory_ciphers = high
>  |>   smtpd_tls_mandatory_exclude_ciphers = TLSv1
>  |> 
>  |> ^ This works in practice without any noticeable trouble.
>  |> (But then i again i do not have to make money from that or my
>  |> customers who must talk to ten year old refrigerators.)
>  |
>  |this is only server-side configuration; my problem is with client-side
>  |rejecting the server's certificate
> 
> Well i have
> 
>   #SMART comment out next
>   smtp_tls_security_level = may

I have
smtp_tls_security_level = verify

and this is what I need because a username/passwd is sent as part of
the smtp transaction

>   # To always go directly SMTPS/SUBMISSIONS
>   #smtp_tls_wrappermode = yes
>   smtp_tls_fingerprint_digest = $smtpd_tls_fingerprint_digest
>   smtp_tls_mandatory_protocols = $smtpd_tls_mandatory_protocols
>   smtp_tls_protocols = $smtpd_tls_protocols
>   #SMART When only relaying to smarthost, the next should be =high 
> _or_better_!
>   smtp_tls_mandatory_ciphers = $smtpd_tls_mandatory_ciphers
>   smtp_tls_mandatory_exclude_ciphers = $smtpd_tls_mandatory_exclude_ciphers
>   smtp_tls_ciphers = $smtpd_tls_ciphers
>   smtp_tls_exclude_ciphers = $smtpd_tls_exclude_ciphers
>   smtp_tls_connection_reuse = yes
> 
> But if you have a problem with only one permanent remote partner

In my config I have 2 possible relays (depending on the from of the email)
and both shows the same problem (yet with different certificates signed by
different CAs).

> you surely want a dedicated map for that one.

No, I need a strong encrypted connection

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: openssl3+postfix issue (ca md too weak)

2023-11-13 Thread Manuel Bouyer

On Mon, Nov 13, 2023 at 10:24:56PM +0100, Steffen Nurpmeso wrote:
> Manuel Bouyer wrote in
>  :
>  |Hello
>  |I'm facing an issue with postfix+openssl3 which may be critical (depending
>  |on how it can be fixed).
>  |
>  |Now my postfix setup fails to send mails with
>  |Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: \
>  |error:0A00018E:SSL routines::ca md too weak:/usr/src/crypto/external/bsd\
>  |/openssl/dist/ssl/statem/statem_lib.c:984:
>  |
>  |>From what I understood, this is the remote certificate which is not \
>  |>accepted:
>  |openssl 3 deprecated some signature algorithm, which are no longer accepted
>  |with @SECLEVEL=1 (which is the default).
>  |In server's certificate chain all but the last one are signed with
>  |sha384WithRSAEncryption (which should be OK). The last one (the root
>  |certificate) is signed with RSA-SHA1 and I don't think this will change
>  |soon:
>  | 3 s:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, \
>  | CN = A
>  | AA Certificate Services
>  |   i:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, \
>  |   CN = A
>  | AA Certificate Services
>  |   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA1
>  |   v:NotBefore: Jan  1 00:00:00 2004 GMT; NotAfter: Dec 31 23:59:59 \
>  |   2028 GMT
>  |
>  |So, as far as I understand, we end up with a postfix installation which
>  |can't talk to servers with valid certificates.
>  |
>  |The solution (from google) would be to force @SECLEVEL=0 but I didn't find
>  |a way to do this for postfix. The solutions I've seen were for openvpn or
>  |curl, but nothing about postfix :(
> 
> Isn't that just postfix config.

It's possible; but I didn't find anything relevant in the postfix docs

> Btw *i* have no problem with
> 
>   smtpd_tls_ask_ccert = no
>   smtpd_tls_auth_only = yes
>   smtpd_tls_loglevel = 1
>   #SMART The next is usually nice but when using client certificates
>   smtpd_tls_received_header = no
>   smtpd_tls_fingerprint_digest = sha256
>   smtpd_tls_mandatory_protocols = >=TLSv1.2
>   smtpd_tls_protocols = $smtpd_tls_mandatory_protocols
>   # super modern, forward secrecy TLSv1.2 / TLSv1.3 selection..
>   tls_high_cipherlist = EECDH+AESGCM:EECDH+AES256:EDH+AESGCM:CHACHA20
>   smtpd_tls_mandatory_ciphers = high
>   smtpd_tls_mandatory_exclude_ciphers = TLSv1
> 
> ^ This works in practice without any noticeable trouble.
> (But then i again i do not have to make money from that or my
> customers who must talk to ten year old refrigerators.)

this is only server-side configuration; my problem is with client-side
rejecting the server's certificate

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

openssl3+postfix issue (ca md too weak)

2023-11-13 Thread Manuel Bouyer

Hello
I'm facing an issue with postfix+openssl3 which may be critical (depending
on how it can be fixed).

Now my postfix setup fails to send mails with
Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: 
error:0A00018E:SSL routines::ca md too 
weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984:

>From what I understood, this is the remote certificate which is not accepted:
openssl 3 deprecated some signature algorithm, which are no longer accepted
with @SECLEVEL=1 (which is the default).
In server's certificate chain all but the last one are signed with
sha384WithRSAEncryption (which should be OK). The last one (the root
certificate) is signed with RSA-SHA1 and I don't think this will change
soon:
 3 s:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, CN = A
 AA Certificate Services
   i:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, CN = A
 AA Certificate Services
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA1
   v:NotBefore: Jan  1 00:00:00 2004 GMT; NotAfter: Dec 31 23:59:59 2028 GMT

So, as far as I understand, we end up with a postfix installation which
can't talk to servers with valid certificates.

The solution (from google) would be to force @SECLEVEL=0 but I didn't find
a way to do this for postfix. The solutions I've seen were for openvpn or
curl, but nothing about postfix :(

Any idea ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: ACPI changes in -current, -10 vs. kernels w/o "genfb"

2023-10-21 Thread Manuel Bouyer

On Sat, Oct 21, 2023 at 11:46:48AM +0200, Manuel Bouyer wrote:
> On Fri, Oct 20, 2023 at 06:47:54PM -0500, John D. Baker wrote:
> > On Thu, 19 Oct 2023, Manuel Bouyer wrote:
> > 
> > > On Thu, Oct 19, 2023 at 08:46:27AM -0500, John D. Baker wrote:
> > >
> > > > [...]
> > > > #  link  VERTHANDI/netbsd
> > > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map 
> > > > --cref -T netbsd.ldscript -Ttext c010 -e start -X -o netbsd 
> > > > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o
> > > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: acpi_wakeup.o: 
> > > > in function `acpi_md_sleep_patch':
> > > > /x/netbsd-10/src/sys/arch/x86/acpi/acpi_wakeup.c:145: undefined 
> > > > reference to `acpi_md_vesa_modenum'
> > > > [...]
> > > 
> > > Hello,
> > > should be fixed on HEAD, will request a pullup to netbsd-10
> > 
> > Thanks!  HEAD built just fine, but with -10/i386, my custom kernels build
> > OK, but the stock XEN3PAE_DOM0 build fails with:
> > 
> > [...]
> > #  link  XEN3PAE_DOM0/netbsd
> > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map 
> > --cref -T netbsd.ldscript -Ttext 0xc010 -e start -X -o netbsd 
> > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o
> > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: genfb_machdep.o: in 
> > function `x86_genfb_init':
> > /x/netbsd-10/src/sys/arch/x86/x86/genfb_machdep.c:141: undefined reference 
> > to `acpi_md_vesa_modenum'
> > [...]
> > 
> > I'm guessing the issue is that XEN3PAE_DOM0 has "genfb", but no ACPI
> > support, so is missing the symbol.
> 
> Actually it has ACPI (which is why genfb tries to use the symbol) but
> not acpi_wakeup
> 
> What's strange is that I did a full build on HEAD and didn't notice the issue.
> 
> Will look at it

pullup-10 #433

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: ACPI changes in -current, -10 vs. kernels w/o "genfb"

2023-10-21 Thread Manuel Bouyer

On Fri, Oct 20, 2023 at 06:47:54PM -0500, John D. Baker wrote:
> On Thu, 19 Oct 2023, Manuel Bouyer wrote:
> 
> > On Thu, Oct 19, 2023 at 08:46:27AM -0500, John D. Baker wrote:
> >
> > > [...]
> > > #  link  VERTHANDI/netbsd
> > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map 
> > > --cref -T netbsd.ldscript -Ttext c010 -e start -X -o netbsd 
> > > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o
> > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: acpi_wakeup.o: in 
> > > function `acpi_md_sleep_patch':
> > > /x/netbsd-10/src/sys/arch/x86/acpi/acpi_wakeup.c:145: undefined reference 
> > > to `acpi_md_vesa_modenum'
> > > [...]
> > 
> > Hello,
> > should be fixed on HEAD, will request a pullup to netbsd-10
> 
> Thanks!  HEAD built just fine, but with -10/i386, my custom kernels build
> OK, but the stock XEN3PAE_DOM0 build fails with:
> 
> [...]
> #  link  XEN3PAE_DOM0/netbsd
> /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map --cref 
> -T netbsd.ldscript -Ttext 0xc010 -e start -X -o netbsd 
> ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o
> /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: genfb_machdep.o: in 
> function `x86_genfb_init':
> /x/netbsd-10/src/sys/arch/x86/x86/genfb_machdep.c:141: undefined reference to 
> `acpi_md_vesa_modenum'
> [...]
> 
> I'm guessing the issue is that XEN3PAE_DOM0 has "genfb", but no ACPI
> support, so is missing the symbol.

Actually it has ACPI (which is why genfb tries to use the symbol) but
not acpi_wakeup

What's strange is that I did a full build on HEAD and didn't notice the issue.

Will look at it

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: ACPI changes in -current, -10 vs. kernels w/o "genfb"

2023-10-19 Thread Manuel Bouyer

On Thu, Oct 19, 2023 at 08:46:27AM -0500, John D. Baker wrote:
> The following change in -current:
> 
>   https://mail-index.netbsd.org/source-changes/2023/10/16/msg148163.html
> 
> and its subsequent pull-up to netbsd-10:
> 
>   https://mail-index.netbsd.org/source-changes/2023/10/18/msg148226.html
> 
> breaks building kernels which exclude "genfb".  The failure is as follows:
> 
> [...]
> #  link  VERTHANDI/netbsd
> /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map --cref 
> -T netbsd.ldscript -Ttext c010 -e start -X -o netbsd 
> ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o
> /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: acpi_wakeup.o: in 
> function `acpi_md_sleep_patch':
> /x/netbsd-10/src/sys/arch/x86/acpi/acpi_wakeup.c:145: undefined reference to 
> `acpi_md_vesa_modenum'
> [...]
> 
> I have machines with ACPI for which "genfb" (or any DRMKMS framebuffer)
> is superfluous and therefore are omitted from the configuration.

Hello,
should be fixed on HEAD, will request a pullup to netbsd-10

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: ftp TLS fails

2023-10-10 Thread Manuel Bouyer

On Tue, Oct 10, 2023 at 03:56:56PM +0200, Manuel Bouyer wrote:
> Hello
> with netbsd-10 from oct, 2 ftp fails to connect to https sites:
> tchatcha:/chroot/usr/pkgsrc-2023Q3/pkgsrc/sysutils/xenkernel418>ftp -o /tmp/o 
> https://ftp.netbsd.org/
> Trying [2001:470:a085:999::21]:443 ...
> ftp: Can't connect to `2001:470:a085:999::21:443': No route to host
> Trying 199.233.217.201:443 ...
> :error:0A86:SSL 
> routines:tls_post_process_server_certificate:certificate verify 
> failed:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_clnt.c:1889:
> ftp: Can't connect to `ftp.netbsd.org:https'
> 
> 
> I have a ca-certificates.crt in /etc/openssl/certs/, I tried to re-run
> certctl but it didn't help.
> I see the same issue with downloads.xen.org
> 
> It seems that not all roots are installed ?

With some help from Thomas I found the problem:
I had a /etc/openssl/openssl.cnf lying around and this caused trouble.
After a rm -r /etc/openssl/* and postinstall again, _ have the certs.

/etc/openssl (I guess I only did rm -rf /etc/openssl/certs* before) and
this fixed things. /etc/openssl/certs.conf has more things now. Before it had
only
netbsd-certctl 20230816

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

ftp TLS fails

2023-10-10 Thread Manuel Bouyer

Hello
with netbsd-10 from oct, 2 ftp fails to connect to https sites:
tchatcha:/chroot/usr/pkgsrc-2023Q3/pkgsrc/sysutils/xenkernel418>ftp -o /tmp/o 
https://ftp.netbsd.org/
Trying [2001:470:a085:999::21]:443 ...
ftp: Can't connect to `2001:470:a085:999::21:443': No route to host
Trying 199.233.217.201:443 ...
:error:0A86:SSL 
routines:tls_post_process_server_certificate:certificate verify 
failed:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_clnt.c:1889:
ftp: Can't connect to `ftp.netbsd.org:https'


I have a ca-certificates.crt in /etc/openssl/certs/, I tried to re-run
certctl but it didn't help.
I see the same issue with downloads.xen.org

It seems that not all roots are installed ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: heartbeat panic by heavy traffic

2023-09-15 Thread Manuel Bouyer

On Fri, Sep 15, 2023 at 02:00:31PM -, Michael van Elst wrote:
> bou...@antioche.eu.org (Manuel Bouyer) writes:
> 
> >But the clock softint shouldn't be locked out for 16s, ever.
> 
> Then the clock softint must have a higher priority than
> everything else including hard interrupts.
> 
> Obviously that's not how the system is designed, there
> are no limits on how long specific events may take and
> thus no guarantee for lower priority tasks to actually
> execute with a certain time. That would be some kind
> of real-time system.

But obviously such events are not expected to take a long time, or
they would have been deffered to lower priority, preemptible tasks.
Letting such events run for a long time wedges the system.

I still maintain that the bug here is the network soft interrupt running
for such a long time, without gigving a chance to other tasks

> 
> Such systems also rarely panic if they detect a violation
> of their rules.
> 
> In any case, locking out lower priority tasks by an
> overwhelmed network layer probably isn't the bug that
> we look for.

I disagree. And the heartbeat panic is here to help locate such bugs.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: heartbeat panic by heavy traffic

2023-09-15 Thread Manuel Bouyer

On Fri, Sep 15, 2023 at 09:19:04AM -, Michael van Elst wrote:
> mar...@duskware.de (Martin Husemann) writes:
> 
> >On Fri, Sep 15, 2023 at 12:17:58PM +0900, Masanobu SAITOH wrote:
> >> I think it would be good to change the default behavior from
> >> panic to something others because GENERIC kernel enables HEARTBEAT.
> >> by default. One of idea is to print warning message at sufficient 
> >> intervals.
> 
> >I disagree. It is very important that we fix the underlying problem
> >instead. Without hearbeat, this behaviour is still visible (but 
> >undiagnosable).
> 
> The crash here comes from how the network stack operates. Running at
> a higher priority, it locks out the lower priority clock softint
> and heartbeat detects that and crashes the system intentionally.

But the clock softint shouldn't be locked out for 16s, ever.
It means that userland processes are stuck too, as well as kernel threads.

This is a real bug, the network stack should be fixed to relax at
periodic intervals.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Call for testing: New kernel heartbeat(9) checks

2023-07-07 Thread Manuel Bouyer

On Fri, Jul 07, 2023 at 05:10:33PM +, Taylor R Campbell wrote:
> > Date: Fri, 7 Jul 2023 17:56:42 +0200
> > From: Manuel Bouyer 
> > 
> > On Fri, Jul 07, 2023 at 01:11:54PM +, Taylor R Campbell wrote:
> > > - The magic numbers for debug.crashme.spl_spinout are for evbarm.
> > >   On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
> 
> Correction: IPL_SOFTCLOCK=2.
> 
> > > 1.cpuctl offline 0
> > >   sleep 20
> > >   cpuctl online 0
> > 
> > With this I get a panic on Xen:
> > [ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" 
> > failed: file 
> > "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
> > [...]
> > [  53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" 
> > failed: file 
> > "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
> 
> This was a mistake that arose because I was testing on aarch64 where
> kpreempt_disabled() is always true.  Update and try again, please!
> 
> sys/kern/kern_heartbeat.c 1.2
> sys/kern/subr_xcall.c 1.36

Yes, with these (and using 2 for IPL_SOFTCLOCK) every test pass now.
thanks ! This allowed me to fix a small bug in Xen's clock initialisation
already :)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Call for testing: New kernel heartbeat(9) checks

2023-07-07 Thread Manuel Bouyer

On Fri, Jul 07, 2023 at 01:11:54PM +, Taylor R Campbell wrote:
> FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
> heartbeat(9) that will make the system crash rather than hang when
> CPUs are stuck in certain ways that hardware watchdog timers can't
> detect (or on systems without hardware watchdog timers).
> 
> It's optional for now, but it's small and I'd like to make it
> mandatory in the future.  If you'd like to try it out, add the
> following two lines to your kernel config:
> 
> options   HEARTBEAT
> options   HEARTBEAT_MAX_PERIOD_DEFAULT=15
> 
> You can disable it with `sysctl -w kern.heartbeat.max_period=0' at
> runtime, or use that knob to change the maximum period before the
> system will crash if not all (online) CPUs have made progress.
> 
> 
> Here are some manual tests that you can use to exercise it -- these
> are manual tests, not automatic tests, because some will deliberately
> crash the kernel to make sure the diagnostic works, and the others, if
> broken, will also crash the kernel.
> 
> Notes:
> - The magic numbers for debug.crashme.spl_spinout are for evbarm.
>   On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
>   For other architectures, consult the source for the numbers to use.
> - If you're on a single-CPU system, skip the cpuctl offline/online
>   tests and just do (4) and (5).
> - If you're on a >2-CPU system, then for the cpuctl offline/online
>   tests, try offlining all CPUs but one at a time.
> 
> 1.cpuctl offline 0
>   sleep 20
>   cpuctl online 0

With this I get a panic on Xen:
[ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: 
file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
[ 225.4605386] cpu0: Begin traceback...
[ 225.4605386] vpanic() at netbsd:vpanic+0x163
[ 225.4605386] kern_assert() at netbsd:kern_assert+0x4b
[ 225.4705333] heartbeat_resume() at netbsd:heartbeat_resume+0x82
[ 225.4705333] cpu_xc_online() at netbsd:cpu_xc_online+0x11
[ 225.4705333] xc_thread() at netbsd:xc_thread+0xc8
[ 225.4705333] cpu0: End traceback...
[ 225.4705333] fatal breakpoint trap in supervisor mode
[ 225.4705333] trap type 1 code 0 rip 0x8022e96d cs 0xe030 rflags 0x202 
cr2 0x9b8030d32000 ilevel 0 rsp 0x9b8030985dd0
[ 225.4705333] curlwp 0x9b80007c6900 pid 0.7 lowest kstack 
0x9b80309812c0
Stopped in pid 0.7 (system) at  netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x163
kern_assert() at netbsd:kern_assert+0x4b
heartbeat_resume() at netbsd:heartbeat_resume+0x82
cpu_xc_online() at netbsd:cpu_xc_online+0x11
xc_thread() at netbsd:xc_thread+0xc8

Is it expected ? Nothing looks Xen-specific here


> 
> 2.cpuctl offline 1
>   sleep 20
>   cpuctl online 1

same panic

> 
> 3.cpuctl offline 0
>   sysctl -w kern.heartbeat.max_period=5
>   sleep 10
>   sysctl -w kern.heartbeat.max_period=0
>   sleep 10
>   sysctl -w kern.heartbeat.max_period=15
>   sleep 20
>   cpuctl online 0

Here we have:
#sysctl -w kern.heartbeat.max_period=15
[  53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: 
file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
[  53.5704682] cpu0: Begin traceback...
[  53.5704682] vpanic() at netbsd:vpanic+0x163
[  53.5704682] kern_assert() at netbsd:kern_assert+0x4b
[  53.5704682] heartbeat_resume() at netbsd:heartbeat_resume+0x82
[  53.5704682] xc_thread() at netbsd:xc_thread+0xc8
[  53.5704682] cpu0: End traceback...


> 
> 4.sysctl -w debug.crashme_enable=1
>   sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
>   # verify system panics after 15sec

my sysctl command did hang, but the system didn't panic

> 
> 5.sysctl -w debug.crashme_enable=1
>   sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
>   # verify system panics after 15sec

This one did panic
> 
> 6.cpuctl offline 0
>   sysctl -w debug.crashme_enable=1
>   sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
>   # verify system panics after 15sec

my sysctl command did hang, but the system didn't panic

> 
> 7.cpuctl offline 0
>   sysctl -w debug.crashme_enable=1
>   sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
>   # verify system panics after 15sec

and this one did panic

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread Manuel Bouyer

On Fri, Jun 23, 2023 at 11:37:23PM +, RVP wrote:
> On Fri, 23 Jun 2023, Brian Buhrow wrote:
> 
> > hello.  My understanding is that the arp caching mechanism works 
> > regardless of whether
> > you use static MAC addresses or dynamically generated ones.
> > [...]
> > If you then run brconfig on the bridge containing the domu, you'll see the 
> > MAC  address you
> > assigned, or which was assigned dynamically, alive and well.
> > 
> 
> Right, but, cacheing implies a timeout, and is there a timeout for the MAC
> addresses on Xen IFs? Does an `arp -an' indicate this (I can't test this--
> no Xen set up.)

Xen IFs are no different from regular ethernert interfaces

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-23 Thread Manuel Bouyer

On Fri, Jun 23, 2023 at 03:52:21PM +0200, Matthias Petermann wrote:
> Hi,
> 
> On 23.06.23 02:45, RVP wrote:
> > So, the server tries to write data into the socket; write() fails with
> > errno = EHOSTDOWN which sshd(8) treats as a fatal error and it exits.
> > The client tries to read/write to a closed connection, and it too quits.
> > 
> > The part which doesn't make sense is the EHOSTDOWN error. Clearly the
> > other end isn't down. Can't say I understand what's happening here. You
> > need a Xen guru now, Matthias :)
> 
> I will still try the tips from yesterday (long time ping test) and collect
> some more data. And yes - I think only someone with a strong Xen background
> can really help me :-) I will followup as soon I completed my recent tests.

I'm not sure it's Xen-specific, there have been changes in the network stack
between -9 and -10 affecting the way ARP and duplicate addresses are managed.

> 
> > 
> > On Thu, 22 Jun 2023, Brian Buhrow wrote:
> > 
> > >   hello.  Actually, on the server side, where you get the "host
> > > is down" message, that is a
> > > system error from the network stack itself.  I've seen it when the
> > > arp cache times out and
> > > can't be refreshed in a timely manner.
> > > 
> > 
> > But, does ARP make any sense for Xen IFs? I thought MAC addresses were
> > ginned up for Xen IFs...
> 
> At the moment, I manually set the MAC adresses for all DomUs in the Domain
> configuration file (at the network interface specification), example:


> 
> ```
> name="srv-net"
> type="pv"
> kernel="/netbsd-XEN3_DOMU.gz"
> memory=512
> vcpus=2
> vif = ['mac=00:16:3E:00:00:01,bridge=bridge0,ip=192.168.2.51' ]

the ip= part is not used by NetBSD.
A fixed mac address shouldn't make a difference, it's the xl tool which
generates one if needed and the domU doesn't know if it's fixed or
auto-generated.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

panic in knote

2022-12-21 Thread Manuel Bouyer

Hello
in my daily tests of HEAD:
https://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/
I've seen twice this panic:

kernel/kqueue/t_proc4 (80/939): 1 test cases
proc4: [ 538.1439841] uvm_fault(0xc287759c, 0, 2) -> 0xe
[ 538.1642020] fatal page fault in supervisor mode
[ 538.1755046] trap type 6 code 0x2 eip 0xc0d657ac cs 0x8 eflags 0x10286 cr2 
0x54 ilevel 0 esp 0xda7b8e78
[ 538.1985618] curlwp 0xc4d70900 pid 23901 lid 23901 lowest kstack 0xda7b62c0
[ 538.2145388] panic: trap
[ 538.2145388] cpu0: Begin traceback...
[ 538.2294995] 
vpanic(c12d339c,da7b8d18,da7b8dd4,c01306b2,c12d339c,da7b8de0,da7b8de0,5d5d,da7b62c0,10286)
 at netbsd:vpanic+0x196
[ 538.2537154] 
panic(c12d339c,da7b8de0,da7b8de0,5d5d,da7b62c0,10286,54,0,da7b8e78,c287759c) at 
netbsd:panic+0x18
[ 538.2847793] trap() at netbsd:trap+0xd7c
[ 538.2949956] --- trap (number 6) ---
[ 538.3051892] 
mutex_init(54,2,0,da7b8e60,c04bdd42,c4c391c0,c4c391c0,c4c391c0,54,c4c391c0) at 
netbsd:mutex_init+0x9
[ 538.3336643] 
knote_proc_fork_track(c4c23c48,c4a6b040,0,da7b8ea4,c0d0f8d8,c36b4440,c36b4440,c4a6b040,c4d70900,da7b8f10)
 at netbsd:knote_proc_fork_track+0xce
[ 538.3636624] 
knote_proc_fork(c4a6b040,c36b4440,da711000,0,0,0,c0d549e0,c4c391c0,da7b8ef4,0) 
at netbsd:knote_proc_fork+0x97
[ 538.3945229] fork1(c4d70900,0,14,0,0,c0d549e0,0,da7b8f60,da7b8f9c,c04bd5ab) 
at netbsd:fork1+0x667
[ 538.4238665] 
sys_fork(c4d70900,da7b8f68,da7b8f60,c23454c8,1,2,da7b8f60,da7b8f68,0,0) at 
netbsd:sys_fork+0x48
[ 538.4556545] syscall() at netbsd:syscall+0x17c
[ 538.4648600] --- syscall (number 2) ---
[ 538.4822324] bb3b7027:
[ 538.4876585] cpu0: End traceback...

any idea ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: kernel deadlock on fstchg with vnd

2022-05-29 Thread Manuel Bouyer

On Sun, May 29, 2022 at 01:18:16PM +0200, J. Hannken-Illjes wrote:
> > On 29. May 2022, at 08:30, Michael van Elst  wrote:
> > 
> > bou...@antioche.eu.org (Manuel Bouyer) writes:
> > 
> >> Hello,
> >> do you have an idea on the problem in this thread:
> >> http://mail-index.netbsd.org/port-xen/2022/05/27/msg010213.html
> > [...]
> >> I can't reproduce this when using vnd from userland.
> > 
> > You can replicate it by addressing the block device with vnconfig.
> > 
> > A workaround would be to modify the Xen block script to select the
> > raw device:
> > 
> > vnconfig /dev/r${disk}d $xparams >/dev/null; then
> > 
> > or just the disk name:
> > 
> > vnconfig ${disk} $xparams >/dev/null; then
> 
> Good catch, sys/dev/vnd.c has this:
> 
>   1751  static void
>   1752  vndclear(struct vnd_softc *vnd, int myminor)
>   1753  {
>   1754  struct vnode *vp = vnd->sc_vp;
>   1755  int fflags = FREAD;
>   1756  int bmaj, cmaj, i, mn;
>   1757  int s;
>   1758
>   1759  #ifdef DEBUG
>   1760  if (vnddebug & VDB_FOLLOW)
>   1761  printf("vndclear(%p): vp %p\n", vnd, vp);
>   1762  #endif
>   1763  /* locate the major number */
>   1764  bmaj = bdevsw_lookup_major(_bdevsw);
>   1765  cmaj = cdevsw_lookup_major(_cdevsw);
>   1766
>   1767  /* Nuke the vnodes for any open instances */
>   1768  for (i = 0; i < MAXPARTITIONS; i++) {
>   1769  mn = DISKMINOR(device_unit(vnd->sc_dev), i);
>   1770  vdevgone(bmaj, mn, mn, VBLK);
>   1771  if (mn != myminor) /* XXX avoid to kill own vnode */
>   1772  vdevgone(cmaj, mn, mn, VCHR);
>   1773  }
> 
> The "skip myself" on lines 1771/1772 is responsible for this behaviour.

Yes and doing the same for block devices avoids the issue.
But Taylor is reluctant to commit this hack.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)

2022-05-27 Thread Manuel Bouyer

On Fri, May 27, 2022 at 02:06:59PM +0200, Matthias Petermann wrote:
> Anyway, Once I try to "xl console" I did only get a fragment:
> 
> ```
> ganymed$ doas xl console net
> [   1.000] cpu_rng: rdrand
> [   1.000] entropy: ready
> [   1.000] Copyright (c) 1996, 1997, 1998, 1999,
> ```
> 
> At the "1999," the Dom0 became frozen, again.

A recent change caused xenconsoled to hang, and possibly xenstore to
miss events too. Should be fixed with
src/sys/arch/xen/xen/xenevt.c 1.65

But the hang on the filesystem remains for me.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

kernel deadlock on fstchg with vnd

2022-05-27 Thread Manuel Bouyer

Hello,
do you have an idea on the problem in this thread:
http://mail-index.netbsd.org/port-xen/2022/05/27/msg010213.html

When stoping a Xen guest with virtual disk backed by a file,
the vnconfig -u process won't exit: it hangs on specio, and other processes
hang on fstchg.
>From kernel messages, the xbd backed has closed the vnd device which is
being unconfigured, although I can't say if it did before or after
the vnconfig -u process was started (but likely before).

I can't reproduce this when using vnd from userland.

This happens with the file on /, or on a different partition,
with or without -o log. It happens even if the dom0 has a single CPU.

Any idea how to debug this further ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)

2022-05-27 Thread Manuel Bouyer

On Fri, May 27, 2022 at 04:41:29PM +0200, J. Hannken-Illjes wrote:
> > On 27. May 2022, at 16:24, Manuel Bouyer  wrote:
> > 
> > On Fri, May 27, 2022 at 02:52:55PM +0200, J. Hannken-Illjes wrote:
> >>> On 27. May 2022, at 14:41, Matthias Petermann  
> >>> wrote:
> >>> 
> >>> Hello Jürgen,
> >>> 
> >>> Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes:
> >>>> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump"
> >>>> should give even more details.
> >>> 
> >>> here is the stacktrace from the vnconfig process (the PID has changed 
> >>> since I restarted):
> >>> 
> >>> https://www.petermann-it.de/tmp/p7.jpg
> >> 
> >> This is the thread currently suspending the root fs (vrevoke suspends it).
> >> 
> >> Looks like it is waiting for I/O to drain on the vnd device ...
> >> 
> >>> You can find the output of fstrans_dump here:
> >>> 
> >>> https://www.petermann-it.de/tmp/p8.jpg
> >> 
> >> The owner is irritating, it should be vnconfig from above.
> > 
> > I can reproduce it:
> 
> What is the recipe?

xl create -c 
shutdown -p now in the guest
notice that the guest doesn't shut down and run xl destroy 
(I think xl destroy is what causes the deadlock, by running a second
vnconfig -u)

But my dom0 has 32 vcpus, and this seems to cause oter troubles (at the
xenstore level, among others). Trying again with only 1 vcpu.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)

2022-05-27 Thread Manuel Bouyer

On Fri, May 27, 2022 at 04:24:30PM +0200, Manuel Bouyer wrote:
> On Fri, May 27, 2022 at 02:52:55PM +0200, J. Hannken-Illjes wrote:
> > > On 27. May 2022, at 14:41, Matthias Petermann  
> > > wrote:
> > > 
> > > Hello Jürgen,
> > > 
> > > Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes:
> > >> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump"
> > >> should give even more details.
> > > 
> > > here is the stacktrace from the vnconfig process (the PID has changed 
> > > since I restarted):
> > > 
> > > https://www.petermann-it.de/tmp/p7.jpg
> > 
> > This is the thread currently suspending the root fs (vrevoke suspends it).
> > 
> > Looks like it is waiting for I/O to drain on the vnd device ...
> > 
> > > You can find the output of fstrans_dump here:
> > > 
> > > https://www.petermann-it.de/tmp/p8.jpg
> > 
> > The owner is irritating, it should be vnconfig from above.
> 
> I can reproduce it:
> db{0}> ps
> PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
> 2419  2419 3   8 0   9000210b9280   tcsh fstchg
> 2415  2415 3  11 0   90001f66f540   vnconfig fstchg
> 2416  2416 3  18 0   900020ea3200dirname fstchg
> 2417  2417 3  24 0   900020e6c700 sh fstchg
> 2414  2414 3  12 0   90001f6d7a00   vnconfig specio
> [...]
> db{0}> tr/t 0t2415
> trace: pid 2415 lid 2415 at 0x90008ed3e980
> sleepq_block() at netbsd:sleepq_block+0x12c
> cv_wait() at netbsd:cv_wait+0x42
> fstrans_start() at netbsd:fstrans_start+0x193
> VOP_LOCK() at netbsd:VOP_LOCK+0x79
> vn_lock() at netbsd:vn_lock+0xae
> namei_tryemulroot() at netbsd:namei_tryemulroot+0x1024
> namei() at netbsd:namei+0x29
> vn_open() at netbsd:vn_open+0x133
> do_open() at netbsd:do_open+0xc3
> do_sys_openat() at netbsd:do_sys_openat+0x74
> sys_open() at netbsd:sys_open+0x24
> syscall() at netbsd:syscall+0x18c
> --- syscall (number 5) ---
> netbsd:syscall+0x18c:
> db{0}> tr/t 0t2414
> trace: pid 2414 lid 2414 at 0x90008c57e6c0
> sleepq_block() at netbsd:sleepq_block+0x12c
> cv_wait() at netbsd:cv_wait+0x42
> spec_io_drain() at netbsd:spec_io_drain+0x84
> spec_close() at netbsd:spec_close+0x1c6
> VOP_CLOSE() at netbsd:VOP_CLOSE+0x38
> spec_node_revoke() at netbsd:spec_node_revoke+0x14d
> vcache_reclaim() at netbsd:vcache_reclaim+0x4e7
> vgone() at netbsd:vgone+0xcd
> vrevoke() at netbsd:vrevoke+0xfa
> genfs_revoke() at netbsd:genfs_revoke+0x13
> VOP_REVOKE() at netbsd:VOP_REVOKE+0x35
> vdevgone() at netbsd:vdevgone+0x64
> vnddoclear.part.0() at netbsd:vnddoclear.part.0+0xaa
> vndioctl() at netbsd:vndioctl+0x78c
> bdev_ioctl() at netbsd:bdev_ioctl+0x91
> spec_ioctl() at netbsd:spec_ioctl+0xa5
> VOP_IOCTL() at netbsd:VOP_IOCTL+0x41
> vn_ioctl() at netbsd:vn_ioctl+0xb3
> sys_ioctl() at netbsd:sys_ioctl+0x555
> syscall() at netbsd:syscall+0x18c
> --- syscall (number 54) ---
> netbsd:syscall+0x18c: 
> db{0}> call fstrans_dump 
> Fstrans locks by lwp:
> [ 5691.3454404] 2414.241 (/) shared 2 cow 0 alias 0
> [ 5691.3454404] Fstrans state by mount:
> [ 5691.3454404] /owner 0x90001f6d7a00 state suspended
> 
> In the ps output there is also:
> 0 2324 3   3   200   90001fe43340       vnd0 vndbp
> db{0}> tr/a 90001fe43340 
> trace: pid 0 lid 2324 at 0x90008c806df0
> sleepq_block() at netbsd:sleepq_block+0x12c
> vndthread() at netbsd:vndthread+0x78c
> 
> So it looks like vnconfig waits for the vnd I/O to drain, but the vnd thread
> is idle.

could this happen if the vnd is still open ?
I suspect the xbd backend did not close the vnd.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)

2022-05-27 Thread Manuel Bouyer

On Fri, May 27, 2022 at 02:52:55PM +0200, J. Hannken-Illjes wrote:
> > On 27. May 2022, at 14:41, Matthias Petermann  wrote:
> > 
> > Hello Jürgen,
> > 
> > Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes:
> >> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump"
> >> should give even more details.
> > 
> > here is the stacktrace from the vnconfig process (the PID has changed since 
> > I restarted):
> > 
> > https://www.petermann-it.de/tmp/p7.jpg
> 
> This is the thread currently suspending the root fs (vrevoke suspends it).
> 
> Looks like it is waiting for I/O to drain on the vnd device ...
> 
> > You can find the output of fstrans_dump here:
> > 
> > https://www.petermann-it.de/tmp/p8.jpg
> 
> The owner is irritating, it should be vnconfig from above.

I can reproduce it:
db{0}> ps
PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
2419  2419 3   8 0   9000210b9280   tcsh fstchg
2415  2415 3  11 0   90001f66f540   vnconfig fstchg
2416  2416 3  18 0   900020ea3200dirname fstchg
2417  2417 3  24 0   900020e6c700 sh fstchg
2414  2414 3  12 0   90001f6d7a00   vnconfig specio
[...]
db{0}> tr/t 0t2415
trace: pid 2415 lid 2415 at 0x90008ed3e980
sleepq_block() at netbsd:sleepq_block+0x12c
cv_wait() at netbsd:cv_wait+0x42
fstrans_start() at netbsd:fstrans_start+0x193
VOP_LOCK() at netbsd:VOP_LOCK+0x79
vn_lock() at netbsd:vn_lock+0xae
namei_tryemulroot() at netbsd:namei_tryemulroot+0x1024
namei() at netbsd:namei+0x29
vn_open() at netbsd:vn_open+0x133
do_open() at netbsd:do_open+0xc3
do_sys_openat() at netbsd:do_sys_openat+0x74
sys_open() at netbsd:sys_open+0x24
syscall() at netbsd:syscall+0x18c
--- syscall (number 5) ---
netbsd:syscall+0x18c:
db{0}> tr/t 0t2414
trace: pid 2414 lid 2414 at 0x90008c57e6c0
sleepq_block() at netbsd:sleepq_block+0x12c
cv_wait() at netbsd:cv_wait+0x42
spec_io_drain() at netbsd:spec_io_drain+0x84
spec_close() at netbsd:spec_close+0x1c6
VOP_CLOSE() at netbsd:VOP_CLOSE+0x38
spec_node_revoke() at netbsd:spec_node_revoke+0x14d
vcache_reclaim() at netbsd:vcache_reclaim+0x4e7
vgone() at netbsd:vgone+0xcd
vrevoke() at netbsd:vrevoke+0xfa
genfs_revoke() at netbsd:genfs_revoke+0x13
VOP_REVOKE() at netbsd:VOP_REVOKE+0x35
vdevgone() at netbsd:vdevgone+0x64
vnddoclear.part.0() at netbsd:vnddoclear.part.0+0xaa
vndioctl() at netbsd:vndioctl+0x78c
bdev_ioctl() at netbsd:bdev_ioctl+0x91
spec_ioctl() at netbsd:spec_ioctl+0xa5
VOP_IOCTL() at netbsd:VOP_IOCTL+0x41
vn_ioctl() at netbsd:vn_ioctl+0xb3
sys_ioctl() at netbsd:sys_ioctl+0x555
syscall() at netbsd:syscall+0x18c
--- syscall (number 54) ---
netbsd:syscall+0x18c: 
db{0}> call fstrans_dump 
Fstrans locks by lwp:
[ 5691.3454404] 2414.241 (/) shared 2 cow 0 alias 0
[ 5691.3454404] Fstrans state by mount:
[ 5691.3454404] /owner 0x90001f6d7a00 state suspended

In the ps output there is also:
0 2324 3   3   200   90001fe43340   vnd0 vndbp
db{0}> tr/a 90001fe43340 
trace: pid 0 lid 2324 at 0x90008c806df0
sleepq_block() at netbsd:sleepq_block+0x12c
vndthread() at netbsd:vndthread+0x78c

So it looks like vnconfig waits for the vnd I/O to drain, but the vnd thread
is idle.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)

2022-05-27 Thread Manuel Bouyer

On Fri, May 27, 2022 at 11:42:08AM +0200, Matthias Petermann wrote:
> I took some "screenshots" of the vga console. This was unfortunately the
> only way because the device has no serial console.
> 
> 
> Paginated processes list:
> 
> https://www.petermann-it.de/tmp/p1.jpg
> https://www.petermann-it.de/tmp/p2.jpg
> https://www.petermann-it.de/tmp/p3.jpg

several processes in fstchg wait, a stack trace of these processes
(tr/t 0t or tr/a 0x would show theses) would help.

So it looks like a deadlock in the filesystem. What is your storage
configuration ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)

2022-05-27 Thread Manuel Bouyer

On Fri, May 27, 2022 at 10:12:44AM +0200, Matthias Petermann wrote:
> 
> Hello all,
> 
> currently I am not able to instantiate a NetBSD Xen guest on NetBSD 9.99
> (side fact: I also have problems with a Windows guest, but it is not that
> important at the moment).
> 
> The problem occurs in the following environment:
> 
>  - Xen Kernel 4.15.2 and matching Xen Tools from pkgsrc 2022Q1 (built
> 29.04.2022)
>  - NetBSD/Xen 9.99.97 (build 25.05.2022)
> 
> The host is booted with this boot.cfg (if this matters):
> 
> ```
> menu=Boot Xen:load /netbsd-XEN3_DOM0.gz console=pc;multiboot /xen.gz
> dom0_mem=512M vga=keep console=vga
> ```
> 
> The guest config looks like this:
> 
> ```
> name = "net"
> type="pv"
> kernel = "/netbsd-INSTALL_XEN3_DOMU.gz"
> #kernel = "/netbsd-XEN3_DOMU.gz"
> memory = 2048
> vcpus = 2
> vif = [ 'mac=00:16:3E:01:00:01,bridge=bridge0' ]
> disk = [
>'file:/data/vhd/net.img,hda,rw',
>'file:/data/vhd/net-export.img,hdb,rw'
> ]
> ```
> 
> When I try to instantiate the guest, I get the following output on the
> controlling terminal:
> 
> ```
> ganymed$ doas xl create net
> Parsing config from net
> libxl: error: libxl_device.c:1109:device_backend_callback: Domain 1:unable
> to add device with path /local/domain/0/backend/vif/1/0
> libxl: error: libxl_create.c:1862:domcreate_attach_devices: Domain 1:unable
> to add vif devices
> ```

did you create the bridge0 ?

> 
> At the same time the following message appears on the system console:
> 
> ```
> [   184.680057] xbd backend: attach device vnd0d (size 1048576000) for
> domain 1
> [   184.910057] xbd backend: attach device vnd1d (size 33554432) for domain
> 1
> [   195.260077] xvif1i0: Ethernet address 00:16:3e:02:00:01
> [   195.320059] xbd backend: detach device vnd1d for domain 1
> [   195.350051] xbd backend: detach device vnd0d for domain 1
> [   195.450054] xvif1i0: disconnecting
> ```
> 
> After the messages appear on the system console, the system does not respond
> to any input either via SSH or on the local console. It seems to be frozen.
> I can still activate the kernel debugger with Control+Alt+Escape.

Can you get a stack trace, and processes list ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: cmake hang solution?

2022-05-02 Thread Manuel Bouyer

On Mon, May 02, 2022 at 11:13:45AM -0700, Chuck Silvers wrote:
> it looks like the diff won't apply as-is, but I think the concept still 
> applies.
> 
> note that there have been a LOT of changes in libpthread since netbsd-9,
> and some of those changes also claim to fix problems where threads hang
> waiting on locks and/or condvars.  it would be more useful to test
> with a HEAD libpthread (which I'll guess requires a HEAD libc too).

the goal is to build the official netbsd-9 packages, so that's not an option

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: cmake hang solution?

2022-05-02 Thread Manuel Bouyer

On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote:
> On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote:
> > w...@netbsd.org (Thomas Klausner) writes:
> > >I never saw the cmake hang myself. I still see hangs in guile.
> > 
> > 
> > I see both in almost every pbulk run.
> 
> 
> please try this patch for the cmake variation of this hang:
> 
> http://www.netbsd.org/~chs/diff.pthread-park-stuck.1

would this apply to netbsd-9 too ? The hang I'm seeing is on a system
with a HEAD kernel and a netbsd-9 userland 

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: reproducible kernel crash with quota

2022-04-12 Thread Manuel Bouyer

On Tue, Apr 12, 2022 at 08:52:28AM +0200, 6b...@6bone.informatik.uni-leipzig.de 
wrote:
> Hello,
> 
> since I already have some open bugs with reproducible kernel crashes, I'm
> only writing this to the mailing list.
> 
> how to reproduce the crash: /etc/rc.d/quota restart
> 
> dmesg:
> 
> [   412.047595] panic: kernel diagnostic assertion
> "dq->dq_ump->um_quotas[dq->dq _type] != vp" failed: file
> "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978
> [   412.047595] cpu8: Begin traceback...
> [   412.047595] vpanic() at netbsd:vpanic+0x156
> [   412.057595] kern_assert() at netbsd:kern_assert+0x4b
> [   412.057595] dqflush() at netbsd:dqflush+0x92
> [   412.057595] quota1_handle_cmd_quotaoff() at
> netbsd:quota1_handle_cmd_quotaof f+0x120

I wonder if, when quota1_handle_cmd_quotaoff() can't get an exclusive lock
for a vnode, could fail to free the associated quota structure.
Shoudln't it wait for the exclusive vnlock or retry in this case ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Bug or no Bug?

2022-02-10 Thread Manuel Bouyer

On Wed, Feb 09, 2022 at 09:22:34PM +0100, 6b...@6bone.informatik.uni-leipzig.de 
wrote:
> Hello,
> 
> I have installed the 9.99.xx kernel on several systems. On most systems
> there are no problems. On a Dell 2800, the kernel crashes during boot. The
> problem only occurs if the option LOCKDEBUG is set.
> 
> options LOCKDEBUG   # expensive locking checks/support
> 
> Should a bug report be made in this case? Or should problems that only occur
> when LOCKDEBUG is enabled be ignored?

Crash with LOCKDEBUG are not expected, so please report.


-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: execute statically-linked linux files

2022-01-07 Thread Manuel Bouyer

On Thu, Jan 06, 2022 at 11:38:58PM +, RVP wrote:
> On Thu, 6 Jan 2022, Manuel Bouyer wrote:
> 
> > the second issue is that it expects /emul/linux/proc/self/fd/4 to be a 
> > working
> > symlink, and on NetBSD it's not. Note that with /bin/ls I get something
> > similar:
> > armandeche:/local/armandeche1/tmp#ktrace -i ls -l /proc/self/fd/
> > total 2
> > crw--w  1 bouyer  tty5, 0 Jan  6 17:54 0
> > crw--w  1 bouyer  tty5, 0 Jan  6 17:54 1
> > crw--w  1 bouyer  tty5, 0 Jan  6 17:54 2
> > lr-xr-xr-x  1 rootwheel  2048 Jan  6 17:54 3 -> /local/armandeche1/tmp
> > 
> > ls: /proc/self/fd//4: Invalid argument
> > lr-xr-xr-x  1 rootwheel 0 Jan  6 17:54 4
> > 
> > 22875  1 ls   CALL  readlink(0x7f7fffb98200,0x7f7fffb98610,0x400)
> > 22875  1 ls   NAMI  "/proc/self/fd//4"
> > 22875  1 ls   RET   readlink -1 errno 22 Invalid argument
> > 
> > If I can trust the ktrace output, fd/4 should point to /etc/spwd.db
> > 
> > On linux, strace shows it reading the link from /proc/self/exec, getting 
> > back
> > 
> 
> This 2nd issue I think I can explain: the fd existed at the start of a
> readdir(), but, then is closed sometime when the listing is still in
> progress as in the code below:

It could be it, as when the directory is read, fd 4 is the directory itself.
But at the time of the readlink, fd 4 is definitively open, but points to
another file (I can't see a close(4) between the open("/etc/spwd.db") and the
readlink()).

Anyway, the issue with the linux binary is likely different.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: execute statically-linked linux files

2022-01-06 Thread Manuel Bouyer

On Thu, Jan 06, 2022 at 05:02:13PM +0100, Anders Magnusson wrote:
> Kave you looked at brandelf?
> 
> https://www.freebsd.org/cgi/man.cgi?query=brandelf=1

Looks like what I need, thanks.
For the record, attached is my port to NetBSD of this

Interestingly, it seems to recognise all binaries as SVR4 (for NetBSD or
linux binaries) so it seems that the ELF type is recorded at some other place.

Anyway with a binary rebranded to linux I now hit another issue:
it quickly core dumps, with an issue that seems related to procfs:

with procfs only mounted on /emul/linux/proc, I get:
  6369   6369 xc8  CALL  open(0x43d6da,0x280800,0x66d208)
  6369   6369 xc8  NAMI  "/emul/linux/proc/self/exe"
  6369   6369 xc8  NAMI  "/proc/self/exe"
  6369   6369 xc8  RET   open -1 errno -2 No such file or directory
  6369   6369 xc8  PSIG  SIGSEGV SIG_DFL: code=SEGV_MAPERR, addr=0x0, 
trap=14)
  6369   6369 xc8  NAMI  "xc8.core"

But /emul/linux/proc/self/exe should exists:
armandeche:/>ls -l /emul/linux/proc/self/exe
lr-xr-xr-x  1 root  wheel  7 Jan  6 17:46 /emul/linux/proc/self/exe -> /bin/ls
armandeche:/>/emul/linux/bin/ls /emul/linux/proc/self/exe
/emul/linux/proc/self/exe

If I also mount procfs on /proc things go a bit further:
 25735  25735 xc8  CALL  open(0x43d6da,0x280800,0x66d208)
 25735  25735 xc8  NAMI  "/emul/linux/proc/self/exe"
 25735  25735 xc8  NAMI  "/proc/self/exe"
 25735  25735 xc8  RET   open 4
 25735  25735 xc8  CALL  readlink(0x7f7fd6f5,0x7f7fd830,0xfff)
 25735  25735 xc8  NAMI  "/emul/linux/proc/self/fd/4"
 25735  25735 xc8  RET   readlink -1 errno -22 Invalid argument
 25735  25735 xc8  CALL  close(4)
 25735  25735 xc8  RET   close 0
 25735  25735 xc8  PSIG  SIGSEGV SIG_DFL: code=SEGV_MAPERR, addr=0x0, 
trap=14)
 25735  25735 xc8  NAMI  "xc8.core"

What's strange here is that /emul/linux/proc/self/exe should work as well
as /proc/self/exe

the second issue is that it expects /emul/linux/proc/self/fd/4 to be a working
symlink, and on NetBSD it's not. Note that with /bin/ls I get something
similar:
armandeche:/local/armandeche1/tmp#ktrace -i ls -l /proc/self/fd/
total 2
crw--w  1 bouyer  tty5, 0 Jan  6 17:54 0
crw--w  1 bouyer  tty5, 0 Jan  6 17:54 1
crw--w  1 bouyer  tty5, 0 Jan  6 17:54 2
lr-xr-xr-x  1 rootwheel  2048 Jan  6 17:54 3 -> /local/armandeche1/tmp

ls: /proc/self/fd//4: Invalid argument
lr-xr-xr-x  1 rootwheel 0 Jan  6 17:54 4

 22875  1 ls   CALL  readlink(0x7f7fffb98200,0x7f7fffb98610,0x400)
 22875  1 ls   NAMI  "/proc/self/fd//4"
 22875  1 ls   RET   readlink -1 errno 22 Invalid argument

If I can trust the ktrace output, fd/4 should point to /etc/spwd.db

On linux, strace shows it reading the link from /proc/self/exec, getting back
the executable path and doing a stat on it.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--
/*-
 * SPDX-License-Identifier: BSD-3-Clause
 *
 * Copyright (c) 2000, 2001 David O'Brien
 * Copyright (c) 1996 SÃ¸ren Schmidt
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer
 *in this position and unchanged.
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in the
 *documentation and/or other materials provided with the distribution.
 * 3. The name of the author may not be used to endorse or promote products
 *derived from this software without specific prior written permission
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
 * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
 * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
 * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
 * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
 * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

#include 

#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int elftype(const char *);
static const char *iselftype(int);
static void printelftypes(void);
static void usage(void);

struct ELFtypes {
	const char *str;
	int value;
};
/* XXX - any more types? */

execute statically-linked linux files

2022-01-06 Thread Manuel Bouyer

Hello,
I have linux binaires I'd like to run on NetBSD (this is a commercial
product). Some files are dynamically-linked files and run properly.
They show up as:
ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, 
interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.24, 
BuildID[sha1]=38afca809a07f7e934012f7dac9094e3bcd2585d, stripped

But there are also some statically-linked files, which shows up as
ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
(note the missing "for GNU/Linux" here) and NetBSD don't want to run them
(Exec format error. Binary file not executable.).

Is there a way to convert the ELF header so that NetBSD can run them ?

thanks

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: XEN devices included in kernel even if not XEN

2021-12-21 Thread Manuel Bouyer

On Tue, Dec 21, 2021 at 07:44:59AM -0800, Paul Goyette wrote:
> I've noticed that device drivers listed in arch/xen/conf/files.xen
> (or, at least, most of those devices) seem to be included in kernel
> even if not using XEN.  Shouldn't all those devices be conditional?
> 
> # sysctl -a | grep driver | tr ',' '\n' | grep 'x[be]*'
> ...
>  [141 -1 xenevt]
>  [142 142 xbd]
>  [143 -1 xencons]

I think this lists all the known major numbers for the $MACHINE, I don't think
it means that the driver is actually loaded.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Serious bugs in NetBSD-current, have they been fixed?

2021-10-25 Thread Manuel Bouyer

On Mon, Oct 25, 2021 at 09:33:26AM +0100, Chavdar Ivanov wrote:
> [...]
> The only (minor) problem I am still having occasionally is with cmake,
> which hangs for me in two well-defined and repeating spots when I am
> doing pkg_rolling-replace (the build completes when I attach to the
> cmake process with gdb and just quit it). This has been discussed
> before, I still am not clear if this is entropy related (unlikely as
> it occurs always during the build of two particular packages only), a
> problem with threads or an internal cmake bug.

It's probably kern/56414 (probably wrong category as it seems to be a userland
bug). It's not related to entropy.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: linux clone issue

2021-10-05 Thread Manuel Bouyer

On Tue, Oct 05, 2021 at 03:57:14PM +0100, Robert Swindells wrote:
> 
> Manuel Bouyer  wrote:
> >I'm trying to run a binary-only linux program under NetBSD 9.2.
> >From what I found, the binary was built on Ubuntu 16.04
> >
> >The program dies at at specific point and it seems to be a bug in our
> >emulation:
> 
>   8992   8992 mylinuxprog CALL  set_robust_list(0x7f7ff7ef5a20,0x18)
>   8992   8992 mylinuxprog RET   set_robust_list 0
> 
> This is doing futex stuff which isn't in -9, it doesn't work in -current
> either but thorpej@ has an improved version on a branch.

Hum, so after the ptrace issue this is going to be the next challenge :)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

linux ptrace issue [Re: linux clone issue]

2021-10-05 Thread Manuel Bouyer

On Tue, Oct 05, 2021 at 01:08:52PM +0200, Manuel Bouyer wrote:
> On Tue, Oct 05, 2021 at 12:42:33AM -0400, Eric Hawicz wrote:
> > 
> > On 10/4/2021 10:33 AM, Manuel Bouyer wrote:
> > > Hello
> > > I'm trying to run a binary-only linux program under NetBSD 9.2.
> > >  From what I found, the binary was built on Ubuntu 16.04
> > > [...]
> > > 
> > > As you can see above (ktrace -si output), the read on fd 3 in 26751 
> > > returns
> > > with an error as soon as the child does its execve(), just as if CLOSEEXEC
> > > was set in the child. But the dup2(4,1) should keep the write side open
> > > without CLOSEEXEC. The program does a similar sequence just before
> > > (also forking a shell to execute some command) and it works.
> > > Later when sh tries to write to stdout it gets a SIGPIPE.
> > > 
> > > I couldn't reproduce this with a simple program.
> > > But it seems that I can't reproduce this clone call. It seems that we are
> > > called with flags 0x1200011, which would translate to
> > > CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD,
> > > and a NULL stack pointer.
> > > But when run on linux, this clone syscall straces to
> > > CLONE_VM|CLONE_VFORK|SIGCHLD
> > 
> > I think that combination of flags is actually a "fork()" call, which glibc
> > implements using clone.  I found that through 
> > https://eli.thegreenplace.net/2018/launching-linux-threads-and-processes-with-clone/,
> > which mentions that glibc has a ARCH_FORK macro, though it seems that the
> > more recent code uses an arch_fork inline function: 
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/arch-fork.h;h=b846da08f98839aef336868de24850626428509c;hb=HEAD
> 
> Yes, I think it's a form of fork() or vfork(). But when I compile a
> test program on linux (RHEL7 or Ubuntu 20), fork() and vfork() appears
> as fork and vfork in NetBSD's ktrace, not clone.

I missed a point in the trace output, the parent is killed, and
read() returns not because the other end is closed but because of the
signal.

This seems to come from a ptrace difference between linux and our emulation.
Actually this binary linux program does a fork() and the child does the
work, the parent just waits. But what happens is:
the parent:
p = fork()
wait()
ptrace(PTRACE_CONT, p, NULL, SIG_0)
exit(0)

the child does:
ptrace(PTRACE_TRACEME, 0, NULL, NULL)

exit(0)

On linux, ptrace(PTRACE_TRACEME) returns EPERM, the wait in the parent
waits until the child exits, and ptrace(PTRACE_CONT) gets ESRCH.

On NetBSD, ptrace(PTRACE_TRACEME) succeeds, wait() returns at some point
before the child exits, the parent ptrace(PTRACE_CONT) the child, the
child gets killed (not by the parent, I can't see a kill() in the trace).

On linux, ptrace(PTRACE_TRACEME) receiving EPERM may be because the process
is running under strace. Running strace without -f (so that only the parent
gets traced), I see the wait() returning, the parent getting a SIGCHLD, and
ptrace(PTRACE_CONT) succeeding. But on linux, it doesn't seem that an
orphaned child process gets killed.

Could our linux ptrace emulation be fixed in any way ?
especially avoid the
pid XXX was killed: orphaned traced process

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: linux clone issue

2021-10-05 Thread Manuel Bouyer

On Tue, Oct 05, 2021 at 12:42:33AM -0400, Eric Hawicz wrote:
> 
> On 10/4/2021 10:33 AM, Manuel Bouyer wrote:
> > Hello
> > I'm trying to run a binary-only linux program under NetBSD 9.2.
> >  From what I found, the binary was built on Ubuntu 16.04
> > [...]
> > 
> > As you can see above (ktrace -si output), the read on fd 3 in 26751 returns
> > with an error as soon as the child does its execve(), just as if CLOSEEXEC
> > was set in the child. But the dup2(4,1) should keep the write side open
> > without CLOSEEXEC. The program does a similar sequence just before
> > (also forking a shell to execute some command) and it works.
> > Later when sh tries to write to stdout it gets a SIGPIPE.
> > 
> > I couldn't reproduce this with a simple program.
> > But it seems that I can't reproduce this clone call. It seems that we are
> > called with flags 0x1200011, which would translate to
> > CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD,
> > and a NULL stack pointer.
> > But when run on linux, this clone syscall straces to
> > CLONE_VM|CLONE_VFORK|SIGCHLD
> 
> I think that combination of flags is actually a "fork()" call, which glibc
> implements using clone.  I found that through 
> https://eli.thegreenplace.net/2018/launching-linux-threads-and-processes-with-clone/,
> which mentions that glibc has a ARCH_FORK macro, though it seems that the
> more recent code uses an arch_fork inline function: 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/arch-fork.h;h=b846da08f98839aef336868de24850626428509c;hb=HEAD

Yes, I think it's a form of fork() or vfork(). But when I compile a
test program on linux (RHEL7 or Ubuntu 20), fork() and vfork() appears
as fork and vfork in NetBSD's ktrace, not clone.

> 
> 
> > I tried writing a program using fork(), vfork() or clone() but
> > none of them would use the clone() syscall as do my linux binary.
> > Any idea what could cause clone() to be used this way ?
> 
> Is your binary statically linked?  Maybe it has a different glibc
> implementation from the .so that's on your system.

Yes, the linux emulation on NetBSD use suse's glibc, while my linux test
systems are RHEL7 and Ubuntu 20

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

linux clone issue

2021-10-04 Thread Manuel Bouyer

Hello
I'm trying to run a binary-only linux program under NetBSD 9.2.
>From what I found, the binary was built on Ubuntu 16.04

The program dies at at specific point and it seems to be a bug in our
emulation:
 26751  26751 mylinuxprog CALL  close(3)
 26751  26751 mylinuxprog RET   close 0
 26751  26751 mylinuxprog CALL  wait4(0x558d,0x7f7fde10,0,0)
 26751  26751 mylinuxprog RET   wait4 21901/0x558d
 26751  26751 mylinuxprog CALL  munmap(0x7f7ff7efb000,0x4000)
 26751  26751 mylinuxprog RET   munmap 0
 26751  26751 mylinuxprog CALL  pipe2(0x7f7fddf0,0x8)
 26751  26751 mylinuxprog RET   pipe2 0
 26751  26751 mylinuxprog CALL  clone(0x1200011,0,0,0x7f7ff7ef5a10,0x687f)
 26751  26751 mylinuxprog RET   clone 8992/0x2320
  8992   8992 mylinuxprog EMUL  "linux"
  8992   8992 mylinuxprog RET   fork 0
 26751  26751 mylinuxprog CALL  close(4)
 26751  26751 mylinuxprog RET   close 0
 26751  26751 mylinuxprog CALL  fcntl(3,F_SETFD,0)
 26751  26751 mylinuxprog RET   fcntl 0
 26751  26751 mylinuxprog CALL  fstat64(3,0x7f7fdd10)
 26751  26751 mylinuxprog RET   fstat64 0
 26751  26751 mylinuxprog CALL  
mmap(0,0x4000,PROT_READ|PROT_WRITE,0x22,0x,0)
 26751  26751 mylinuxprog RET   mmap 140187597254656/0x7f7ff7efb000
 26751  26751 mylinuxprog CALL  read(3,0x7f7ff7efb000,0x4000)
  8992   8992 mylinuxprog CALL  set_robust_list(0x7f7ff7ef5a20,0x18)
  8992   8992 mylinuxprog RET   set_robust_list 0
 22927  22927 mylinuxprog CALL  exit_group(0)
  8992   8992 mylinuxprog CALL  dup2(4,1)
  8992   8992 mylinuxprog RET   dup2 1
  8992   8992 mylinuxprog CALL  
execve(0x7f7ff718d873,0x7f7fbd70,0x7f7fea38)
  8992   8992 mylinuxprog NAMI  "/emul/linux/bin/sh"
  8992   8992 mylinuxprog NAMI  "/emul/linux"
  8992   8992 mylinuxprog NAMI  "/emul/linux/lib64/ld-linux-x86-64.so.2"
 26751  26751 mylinuxprog RET   read -1 errno -3 No such process
 26751  26751 mylinuxprog PSIG  SIGKILL SIG_DFL: code=SI_NOINFO
  8992   8992 sh   EMUL  "linux"
[...]


As you can see above (ktrace -si output), the read on fd 3 in 26751 returns
with an error as soon as the child does its execve(), just as if CLOSEEXEC
was set in the child. But the dup2(4,1) should keep the write side open
without CLOSEEXEC. The program does a similar sequence just before
(also forking a shell to execute some command) and it works.
Later when sh tries to write to stdout it gets a SIGPIPE.

I couldn't reproduce this with a simple program.
But it seems that I can't reproduce this clone call. It seems that we are
called with flags 0x1200011, which would translate to
CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD,
and a NULL stack pointer.
But when run on linux, this clone syscall straces to
CLONE_VM|CLONE_VFORK|SIGCHLD

I tried writing a program using fork(), vfork() or clone() but
none of them would use the clone() syscall as do my linux binary.
Any idea what could cause clone() to be used this way ?

Also, any idea about this file descriptor issue ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Anyone still using PCI "isp" SCSI / FC controllers?

2021-07-20 Thread Manuel Bouyer

On Sun, Jul 18, 2021 at 02:44:46PM -0700, Jason Thorpe wrote:
> The Qlogic ISP SCSI / FC driver PCI front-end appears to universally support 
> using 64-bit PCI DMA addresses, based on my reading of this code block in 
> isp_pci_dmasetup():
> 
> if (sizeof (bus_addr_t) > 4) { 
> if (rq->req_header.rqs_entry_type == RQSTYPE_T2RQS) {
> rq->req_header.rqs_entry_type = RQSTYPE_T3RQS;
> } else if (rq->req_header.rqs_entry_type == 
> RQSTYPE_REQUEST) {  
> rq->req_header.rqs_entry_type = RQSTYPE_A64;
> }
> }
> 
> There's just one problem, though!  It does not use the 64-bit PCI DMA tag, 
> and so it is always getting DMA addresses that fit in 32-bits.  On x86-64 
> machines, this results in having to bounce DMA transfers (ick).  On Alpha 
> machines, this results in having to use SGMAP (IOMMU) DMA; this is not a 
> problem unto itself, and I recently made some improvements to this on systems 
> where Qlogic ISP controllers were more likely to be present (e.g. AlphaServer 
> 1000 / 1000A).
> 
> But there are some Alpha systems we support (notably the EV6+ 
> Tsunami/Typhoon/Titan systems e.g. DS10/DS20/DS25/...) that natively support 
> 64-bit PCI DMA addressing without having to use SGMAPs ... this is generally 
> preferred because, among other things, it's faster.
> 
> I'm pretty sure it's safe, based on the code block quoted above, to change 
> PCI DMA tag selection in the driver to something like this:
> 
> /*
>  * See conditional in isp_pci_dmasetup(); if
>  * sizeof (bus_addr_t) > 4, then we'll program 
>  * the device using 64-bit DMA addresses.  
>  * So, if we're going to do that, we should do
>  * our best to get 64-bit addresses in the first
>  * place.
>  */
> if (sizeof (bus_addr_t) > 4 && pci_dma64_available(pa)) {
> isp->isp_dmatag = pa->pa_dmat64;
> } else {
> isp->isp_dmatag = pa->pa_dmat;
> }
> 
> Anyway, if someone with more knowledge of these controllers could chime in, 
> I'd really appreciate it.  (Hopefully Matt is still lurking on these mailing 
> lists??)

I have:
isp0 at pci10 dev 0 function 0: QLogic FC-AL and 4Gbps Fabric PCI-E HBA
isp1 at pci10 dev 0 function 1: QLogic FC-AL and 4Gbps Fabric PCI-E HBA

connecting to a overland LTO changer
I don't have specific knowledge on these controllers, but I could certainly
test-boot a -current kernel and see if I can still read tapes (the server is
running netbsd-8 at this time)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: linux emul and newer glibc

2021-06-29 Thread Manuel Bouyer

On Mon, Jun 28, 2021 at 08:15:29AM +0900, Rin Okuyama wrote:
> Hi,
> 
> On 2021/06/28 2:40, Manuel Bouyer wrote:
> > Hello,
> > I'm trying to run a binary which wants GLIBCXX_3.4.21, while with the suse
> > packages we have GLIBCXX_3.4.19. Before I try grabbing newer libraries,
> > has anyone tried to run linux binaries with more recent libraries ?
> 
> For my amd64 machine, GLIBCXX_3.4.28 (from glibc 2.32) works just fine,
> which is extracted manually from Fedora 33 by pkgsrc/pkgtools/rpm2pkg.

indeed it works for me too. Now I need to make it not choke on udev errors ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

linux emul and newer glibc

2021-06-27 Thread Manuel Bouyer

Hello,
I'm trying to run a binary which wants GLIBCXX_3.4.21, while with the suse
packages we have GLIBCXX_3.4.19. Before I try grabbing newer libraries,
has anyone tried to run linux binaries with more recent libraries ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Xen panics in autoconf

2021-06-15 Thread Manuel Bouyer

Hello,
some recent changes broke Xen:
[   2.919] panic: kernel diagnostic assertion "KERNEL_LOCKED_P()" failed: 
file "/usr/src/sys/kern/subr_autoconf.c", line 1039 
[   2.919] cpu0: Begin traceback...
[   2.919] 
vpanic(c05be3c0,cd02ddb8,cd02ddd4,c0415210,c05be3c0,c05be322,c05c6cc1,c05f4284,40f,cd02de18)
 at netbsd:vpanic+0x139
[   2.919] 
kern_assert(c05be3c0,c05be322,c05c6cc1,c05f4284,40f,cd02de18,c0b84b60,0,cd02ddf4,c0415270)
 at netbsd:kern_assert+0x23
[   2.919] 
config_match(c12e6c00,c0b84b60,cd02ded8,c0b84b60,c0b84b60,c12e6c00,cd02de3c,c04155d2,cd02de18,c0428627)
 at netbsd:config_match+0x90
[   2.0100679] 
mapply(cd02de18,c0428627,0,c12e5dc0,,c0cb6524,cd02de18,0,c12e6c00,0) at 
netbsd:mapply+0x50
[   2.0100679] 
config_vsearch(c12e6c00,cd02ded8,,cd02dea0,cd02de88,c01120af,0,c1460278,cd02dee6,c05c06cd)
 at netbsd:config_vsearch+0x212
[   2.0100679] 
config_vfound(c12e6c00,cd02ded8,c0111f20,,cd02dea0,cd02df00,c0112638,c12e6c00,cd02ded8,c0111f20)
 at netbsd:config_vfound+0x2f
[   2.0100679] 
config_found(c12e6c00,cd02ded8,c0111f20,,a,8,0,c12ddf54,0,c12c21f0) at 
netbsd:config_found+0x2d
[   2.0100679] 
xenbus_probe_device_type(cd02df2e,1e,c05c0734,c12ddf54,cd02df24,4,c12ddf44,c12ddf44,2,6564)
 at netbsd:xenbus_probe_device_type+0x498
[   2.0100679] 
xenbus_probe_frontends.isra.0(60,2,0,c0113430,0,c05bec62,0,0,cd02df9c,c0112e65) 
at netbsd:xenbus_probe_frontends.isra.0+0xbb
[   2.0100679] 
xenbus_probe(0,c05c076d,6,c1453c00,0,c0102031,c1453c00,d99000,c0b92200,0) at 
netbsd:xenbus_probe+0x2d
[   2.0100679] xenbus_probe_init(c1453c00,d99000,c0b92200,0,c0100084,0,0,0,0,0) 
at netbsd:xenbus_probe_init+0x85
[   2.0100679] cpu0: End traceback...

Any idea what changed recently ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: booting xen [was Re: serial console puzzle]

2021-04-30 Thread Manuel Bouyer

On Fri, Apr 30, 2021 at 07:28:57PM +0100, Patrick Welche wrote:
> On Fri, Apr 30, 2021 at 07:00:38PM +0200, Manuel Bouyer wrote:
> > On Fri, Apr 30, 2021 at 05:55:37PM +0100, Patrick Welche wrote:
> > > no luck. I see loading /netbsd-XEN3_DOM0, and then it just reboots.
> > > Nothing more appears on the console. (-current XEN, xen.gz from 
> > > xenkernel415)
> > 
> > Try xen-debug.gz ?
> > Do you get the Xen boot messages ?
> 
> I don't get the Xen boot messages. Just tried xen-debug.gz and again I just
> see loading, and then a reboot. I don't think it gets as far xen*.gz.
> 
> boot.cfg contains:
> 
> menu=Boot Xen:rndseed /var/db/entropy-file;consdev com0,57600;load 
> /netbsd-XEN3_
> DOM0 console=com1 com1=57600,8n1,0x3f8;multiboot /xen-debug.gz dom0_mem=1024M

should probably be:
menu=Boot Xen:rndseed /var/db/entropy-file;consdev com0,57600;load 
/netbsd-XEN3_ DOM0 console=com0;multiboot /xen-debug.gz dom0_mem=1024M 
console=com1 com1=57600,8n1,0x3f8

(should really be console=com0 for NetBSD, it doens't access the hardware and
use the I/O services from the hypervisor)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: serial console puzzle

2021-04-30 Thread Manuel Bouyer

On Fri, Apr 30, 2021 at 05:55:37PM +0100, Patrick Welche wrote:
> On Fri, Apr 30, 2021 at 04:52:41PM +0100, Patrick Welche wrote:
> > On Fri, Apr 30, 2021 at 05:23:54PM +0200, Manuel Bouyer wrote:
> > > On Fri, Apr 30, 2021 at 04:18:49PM +0100, Patrick Welche wrote:
> > > > On Fri, Apr 30, 2021 at 05:04:34PM +0200, Manuel Bouyer wrote:
> > > > > On Fri, Apr 30, 2021 at 03:44:46PM +0100, Patrick Welche wrote:
> > > > > > In /boot.cfg:
> > > > > > 
> > > > > > menu=Boot normally:rndseed /var/db/entropy-file;consdev 
> > > > > > com0,57600;boot
> > > > > > 
> > > > > > # installboot -ve /dev/rsd0a
> > > > > > File system: /dev/rsd0a
> > > > > > Boot options:timeout 5, flags 0, speed 57600, ioaddr 0, 
> > > > > > console com0
> > > > > > 
> > > > > > Yet in dmesg:
> > > > > > 
> > > > > > com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, 1-byte FIFO
> > > > > > com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, 1-byte FIFO
> > > > > > com1: console
> > > > > > 
> > > > > > (so I don't actually see anything)
> > > > > > 
> > > > > > (Wednesday's -current/amd64)
> > > > > > 
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > one possibility is that the bios has com0 and com1 swapped.
> > > > > In some case I had to explicitely set ioaddr with installboot to have
> > > > > the serial console working.
> > > > 
> > > > I should have said: according to the BIOS "COM A" is 0x3f8, and "COM B"
> > > > is 0x2f8, so they are the right way around.
> > > 
> > > I've seen BIOSes report it the right way on in setup, but the wrong way
> > > to the boot loader.
> > > In such cases and explicit ioaddr did help.
> > 
> > Indeed - it did!
> > 
> > # installboot -ve /dev/rsd0a
> > File system: /dev/rsd0a
> > Boot options:timeout 5, flags 0, speed 57600, ioaddr 3f8, console 
> > com0
> > 
> > com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, 1-byte FIFO
> > com0: console
> > com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, 1-byte FIFO
> > 
> > now for xen...
> 
> no luck. I see loading /netbsd-XEN3_DOM0, and then it just reboots.
> Nothing more appears on the console. (-current XEN, xen.gz from xenkernel415)

Try xen-debug.gz ?
Do you get the Xen boot messages ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: serial console puzzle

2021-04-30 Thread Manuel Bouyer

On Fri, Apr 30, 2021 at 04:18:49PM +0100, Patrick Welche wrote:
> On Fri, Apr 30, 2021 at 05:04:34PM +0200, Manuel Bouyer wrote:
> > On Fri, Apr 30, 2021 at 03:44:46PM +0100, Patrick Welche wrote:
> > > In /boot.cfg:
> > > 
> > > menu=Boot normally:rndseed /var/db/entropy-file;consdev com0,57600;boot
> > > 
> > > # installboot -ve /dev/rsd0a
> > > File system: /dev/rsd0a
> > > Boot options:timeout 5, flags 0, speed 57600, ioaddr 0, console 
> > > com0
> > > 
> > > Yet in dmesg:
> > > 
> > > com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, 1-byte FIFO
> > > com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, 1-byte FIFO
> > > com1: console
> > > 
> > > (so I don't actually see anything)
> > > 
> > > (Wednesday's -current/amd64)
> > > 
> > > 
> > > Thoughts?
> > 
> > one possibility is that the bios has com0 and com1 swapped.
> > In some case I had to explicitely set ioaddr with installboot to have
> > the serial console working.
> 
> I should have said: according to the BIOS "COM A" is 0x3f8, and "COM B"
> is 0x2f8, so they are the right way around.

I've seen BIOSes report it the right way on in setup, but the wrong way
to the boot loader.
In such cases and explicit ioaddr did help.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: serial console puzzle

2021-04-30 Thread Manuel Bouyer

On Fri, Apr 30, 2021 at 03:44:46PM +0100, Patrick Welche wrote:
> In /boot.cfg:
> 
> menu=Boot normally:rndseed /var/db/entropy-file;consdev com0,57600;boot
> 
> # installboot -ve /dev/rsd0a
> File system: /dev/rsd0a
> Boot options:timeout 5, flags 0, speed 57600, ioaddr 0, console com0
> 
> Yet in dmesg:
> 
> com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, 1-byte FIFO
> com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, 1-byte FIFO
> com1: console
> 
> (so I don't actually see anything)
> 
> (Wednesday's -current/amd64)
> 
> 
> Thoughts?

one possibility is that the bios has com0 and com1 swapped.
In some case I had to explicitely set ioaddr with installboot to have
the serial console working.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: make fails to build on linux

2021-04-17 Thread Manuel Bouyer

On Sat, Apr 17, 2021 at 10:44:36PM +0200, Jaromír Dole?ek wrote:
> Le sam. 17 avr. 2021 à 19:49, Manuel Bouyer  a écrit :
> >
> > On Sat, Apr 17, 2021 at 07:25:58PM +0200, Manuel Bouyer wrote:
> > > Hello
> > > trying a build.sh tools on linux I got:
> > > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:
> > > In function '__regex_wctype':
> > > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:254:2:
> > >  error: 'for' loop initial declarations are only allowed in C99 mode
> > >   for (size_t i = 0; i < __arraycount(wctypes); i++) {
> > > ^
> > > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:2
> > > 54:2: note: use option -std=c99 or -std=gnu99 to compile your code
> > >
> > > What is the right fix for this ?
> > >
> > > For now I just moved the declaration outside of the loop
> >
> > Well, the build fails later with the same error.
> > Using "-V HOST_CFLAGS=-std=gnu99" allows the tools to build; maybe
> > this should be the default ?
> 
> I think it would be sensible to use -std=c99 by default, yes. It's

it has to be gnu99; I tried c99 and it failed with some types not defined.

> strange that the Linux toolchain refuses it by default, do we force
> some other -std flag by default now by chance?

AFAIK no. But the toolchain on RLEL7 is quite old: 
gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: make fails to build on linux

2021-04-17 Thread Manuel Bouyer

On Sat, Apr 17, 2021 at 10:54:48AM -0700, Jason Thorpe wrote:
> 
> > On Apr 17, 2021, at 10:48 AM, Manuel Bouyer  wrote:
> > 
> > Well, the build fails later with the same error.
> > Using "-V HOST_CFLAGS=-std=gnu99" allows the tools to build; maybe
> > this should be the default ?
> 
> Just fix the code to not use that style of declaration?

Some of them are in imported code (gnu toolchain); this is why I didn't
try to fix it

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: make fails to build on linux

2021-04-17 Thread Manuel Bouyer

On Sat, Apr 17, 2021 at 07:25:58PM +0200, Manuel Bouyer wrote:
> Hello
> trying a build.sh tools on linux I got:
> /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:
>  
> In function '__regex_wctype':
> /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:254:2:
>  error: 'for' loop initial declarations are only allowed in C99 mode
>   for (size_t i = 0; i < __arraycount(wctypes); i++) {
> ^
> /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:2
> 54:2: note: use option -std=c99 or -std=gnu99 to compile your code
> 
> What is the right fix for this ?
> 
> For now I just moved the declaration outside of the loop

Well, the build fails later with the same error.
Using "-V HOST_CFLAGS=-std=gnu99" allows the tools to build; maybe
this should be the default ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

make fails to build on linux

2021-04-17 Thread Manuel Bouyer

Hello
trying a build.sh tools on linux I got:
/dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c: 
In function '__regex_wctype':
/dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:254:2:
 error: 'for' loop initial declarations are only allowed in C99 mode
  for (size_t i = 0; i < __arraycount(wctypes); i++) {
^
/dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:2
54:2: note: use option -std=c99 or -std=gnu99 to compile your code

What is the right fix for this ?

For now I just moved the declaration outside of the loop

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: running xen on current

2021-04-15 Thread Manuel Bouyer

On Thu, Apr 15, 2021 at 01:39:37PM +0100, Patrick Welche wrote:
> On Thu, Apr 15, 2021 at 07:28:32AM -0400, Brad Spencer wrote:
> > Manuel Bouyer  writes:
> > 
> > > On Thu, Apr 15, 2021 at 09:53:50AM +0100, Patrick Welche wrote:
> > >> I have tried and failed to run xen on 3 -current/amd64 systems with
> > >> 3 different failure modes:
> > >> 
> > >> 1) laptop:  xen.gz Building a PV Dom0 / ELF: not an ELF binary -> 
> > >> panic/reboot
> > >> 2) desktop: XEN3_DOM0 panics including PR port-xen/55978
> > >> 3) server:  Trampoline space cannot be allocated; will try fallback -> 
> > >> reboot
> > >> 
> > >> They are all working NetBSD-current/amd64 systems.
> > >> 
> > >> My conclusion was that xen is hopelessly broken, so was quite surprised
> > >> by Greg Wood's thread about the finer points of running a guest OS, given
> > >> that those systems won't even start the host OS.
> > >> 
> > >> I dug out an old desktop, and to my pleasant surprise it booted 
> > >> XEN3_DOM0,
> > >> and I have managed to run some XEN3_DOMUs.
> > >> 
> > >> The difference between the working/broken setups seems to be that the
> > >> working one is "BIOS" booting rather than EFI booting.
> > >> 
> > >> Among all your xen success stories, are any of you EFI booting?
> > >
> > > AFAIK EFI is not yet supported by Xen (maybe this is supported by 4.15,
> > > I've not had a chance to try yet). I have it running on fairly recent
> > > Dell servers (in BIOS mode)
> > 
> > 
> > There has been fiddling with Xen and EFI for quite some time.  See:
> > 
> > https://wiki.xenproject.org/wiki/Xen_EFI
> > 
> > for example (might be old)... this indicates that Xen 4.3 or later could
> > be built as a EFI binary and probably booted from the EFI firmware
> > directly or with grub2 when grub2 is a EFI binary itself.  Of course
> > those instructions are all Linux-centric and I don't know if you created
> > a Xen kernel like this if it would boot a NetBSD DOM0 kernel.  I am in
> > no position to try any tests with this right now personally, but it is
> > tempting as I have a EFI only laptop that I could probably replace the
> > hard drive temporarily.
> 
> Looking at
> 
>   https://xenproject.org/2021/04/08/xen-project-hypervisor-4-15/
> 
> (so 4.15 only just came out!) I see
> 
>   Unified boot images: It is now possible to create an image bundling
>   together files needed for Xen to boot into a single EFI binary;
>   making it now possible to boot a functional Xen system directly
>   from the EFI boot manager, rather than having to go through grub
>   multiboot.  Files that can be bundled include a hypervisor, dom0
>   kernel, dom0 initrd, Xen KConfig, XSM configuration, and a device
>   tree.
> 
> I thought that "go through grub multiboot" was the equivalent of our
> boot.cfg "multiboot /xen.gz dom0_mem=1024M", but apparently not?

It should be; but there are probably differences between BIOS and EFI, even
when using multiboot (the way to access the console, or find the ACPI
tables, may be different, for example)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: running xen on current

2021-04-15 Thread Manuel Bouyer

On Thu, Apr 15, 2021 at 09:53:50AM +0100, Patrick Welche wrote:
> I have tried and failed to run xen on 3 -current/amd64 systems with
> 3 different failure modes:
> 
> 1) laptop:  xen.gz Building a PV Dom0 / ELF: not an ELF binary -> panic/reboot
> 2) desktop: XEN3_DOM0 panics including PR port-xen/55978
> 3) server:  Trampoline space cannot be allocated; will try fallback -> reboot
> 
> They are all working NetBSD-current/amd64 systems.
> 
> My conclusion was that xen is hopelessly broken, so was quite surprised
> by Greg Wood's thread about the finer points of running a guest OS, given
> that those systems won't even start the host OS.
> 
> I dug out an old desktop, and to my pleasant surprise it booted XEN3_DOM0,
> and I have managed to run some XEN3_DOMUs.
> 
> The difference between the working/broken setups seems to be that the
> working one is "BIOS" booting rather than EFI booting.
> 
> Among all your xen success stories, are any of you EFI booting?

AFAIK EFI is not yet supported by Xen (maybe this is supported by 4.15,
I've not had a chance to try yet). I have it running on fairly recent
Dell servers (in BIOS mode)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Manuel Bouyer

On Sat, Apr 10, 2021 at 03:17:35PM -0700, Greg A. Woods wrote:
> [...]
> # fdisk -F /images/FreeBSD-12.2-RELEASE-amd64-mini-memstick.img
> Disk: /images/FreeBSD-12.2-RELEASE-amd64-mini-memstick.img
> NetBSD disklabel disk geometry:
> cylinders: 49, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
> total sectors: 791121, bytes/sector: 512
> 
> BIOS disk geometry:
> cylinders: 49, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
> total sectors: 791121
> 
> Partitions aligned to 16065 sector boundaries, offset 63
> 
> Partition table:
> 0: EFI system partition (sysid 239)
> start 1, size 1600 (1 MB, Cyls 0/0/2-0/25/26)
> 1: FreeBSD or 386BSD or old NetBSD (sysid 165)
> start 1601, size 789520 (386 MB, Cyls 0/25/27-49/62/30), Active
> 2: 
> 3: 
> First active partition: 1
> Drive serial number: 2425393296 (0x90909090)
> 
> # fdisk vnd0
> fdisk: primary partition table invalid, no magic in sector 0
> fdisk: Cannot determine the number of heads
> Disk: /dev/rvnd0d
> NetBSD disklabel disk geometry:
> cylinders: 4096, heads: 64, sectors/track: 32 (2048 sectors/cylinder)
> total sectors: 8388608, bytes/sector: 512
> 
> BIOS disk geometry:
> cylinders: 522, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
> total sectors: 8388608
> 
> Partitions aligned to 16065 sector boundaries, offset 63
> 
> Partition table:
> 0: 
> 1: 
> 2: 
> 3: 
> Bootselector disabled.
> No active partition.
> Drive serial number: 0 (0x)

I can't reproduce this fdisk/disklabel on netbsd-9 nor -current.
fdisk on vnd0 gives me the same partition table as on the file.
FreeBSD fails to boot with the same error message.
The size of the disk is indeed 790528 in the xenstore (and the dom0's
kernel message) but I don't know where this comes from.
xbdback uses getdiskinfo() to get the device's size.
In vnd, the size comes from a VOP_GETATTR() on the file, so it looks
like VOP_GETATTR() returns the wrong size.
The file is definitively 791121 sectors long:
#dd if=FreeBSD-12.2-RELEASE-amd64-mini-memstick.img.orig 
of=FreeBSD-12.2-RELEASE-amd64-mini-memstick.img
791121+0 records in
791121+0 records out
#ls -l FreeBSD-12.2-RELEASE-amd64-mini-memstick.img
-rw-r--r--  1 root  wheel  405053952 Apr 11 11:56 
FreeBSD-12.2-RELEASE-amd64-mini-memstick.img
#expr 405053952 / 512
791121

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: mail/sendmail not relaying on netbsd-9/sparc, problem with OpenSSL update?

2021-04-07 Thread Manuel Bouyer

On Wed, Apr 07, 2021 at 11:47:10AM -0500, John D. Baker wrote:
> Dropping pkgsrc-users@ as it appears not to be a pkgsrc problem.
> 
> On Wed, 7 Apr 2021, Martin Husemann wrote:
> 
> > On Wed, Apr 07, 2021 at 11:26:05AM -0500, John D. Baker wrote:
> > > 
> > > (gdb) run -odi -v -q
> > > Starting program: /usr/sbin/sendmail -odi -v -q
> > > process 867 is executing new program: /usr/pkg/libexec/sendmail/sendmail
> > > 
> > > Program received signal SIGILL, Illegal instruction.
> > > 0xedd6d40c in _sparcv9_vis1_probe () from /usr/lib/libcrypto.so.14
> > > (gdb) bt
> > 
> > This is normal, you should be able to "continue" from it.
> > The library catches the SIGILL and avoids the instruction.
> 
> ISTR that I tried that and simply got the SIGILL again.  Maybe that
> was from a later sparcV9 instruction...
> 
> In any case, while one may be able to do that in 'gdb', when running
> normally, it is fatal and there is no recourse.  Odd that it doesn't
> dump core.

It should not be fatal. The library traps sigill specially to test for
instructions.

Does the program really exit if you hit 'continue' in ddb ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: cmake hang ... again

2021-04-06 Thread Manuel Bouyer

On Tue, Apr 06, 2021 at 02:24:50PM +0100, Chavdar Ivanov wrote:
> Hi,
> 
> It may or may not be linked to the recent rather enthralling
> discussion about the entropy; I don't know. I've asked for ideas in
> the past, but couldn't figure out what to do if it hits me again.
> 
> Usually I run -current on amd64, updating the systems on average 2-3
> times a week; I also use pkgsrc-head and again, 2-3 times a month I
> cvs update my pkgsrc tree, together with a ' git pull' in wip, and I
> run 'pkg_rolling-replace'.
> 
> Each and every run of pkg_rolling-replace gets me to a seemingly
> identical hang in cmake in a single package - misc/kdepim4 , in
> apparently the same spot. with similar trace. Attaching to the process

I see the same thing in bulk builds, with various kde packages.
When I asked I've been told that this was a known issue, but without fix ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Manuel Bouyer

On Mon, Apr 05, 2021 at 09:30:16AM -0700, Greg A. Woods wrote:
> At Mon, 5 Apr 2021 10:46:19 +0200, Manuel Bouyer  
> wrote:
> Subject: Re: regarding the changes to kernel entropy gathering
> >
> > If I understood it properly, there's no need for such a knob.
> > echo 0123456789abcdef0123456789abcdef > /dev/random
> >
> > will get you back to the state we had in netbsd-9, with (pseudo-)randomness
> > collected from devices.
> 
> Well, no, not quite so much randomness.  Definitely pseudo though!
> 
> My patch on the other hand can at least inject some real randomness into
> the entropy pool, even if it is observable or influenceable by nefarious
> dudes who might be hiding out in my garage.

As I understand it, once /dev/random has been seeded, randomness
from other devices will be taken into account (with or without your patch).

In your case, /dev/random reads did block because it didn't get
an initial seed.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: how do I mount a read-only filesystem from the "root device" prompt?

2021-04-05 Thread Manuel Bouyer

On Sun, Apr 04, 2021 at 03:13:35PM -0700, Greg A. Woods wrote:
> I would think it's not just CDs and hypervisor-provided virtual devices
> that can have multiple partitions, use wedges, and yet be read-only.
> 
> Are not a wide variety of removable storage devices also capable of
> being made "read-only" at the hardware level?

At last some SCSI devices had a pin to make then read-only. I used this
to build ssh gateways in the past ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Manuel Bouyer

On Sun, Apr 04, 2021 at 06:47:23PM -0700, Brian Buhrow wrote:
> Hello.  As I understand it, Greg ran into this problem on a xen domu.  In 
> checking my NetBSD-9
> system running as a domu under xen-4.14.1, there is no rdrand or rdseed 
> feature exposed to
> domu's by xen.  This observation is confirmed by looking at the xen command 
> line reference
> page: https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html

Actually, if the CPU supports rdrand or rdseed, they are available
to domUs:
cpu0: Running on hypervisor: Xen
cpu0: "Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz"
cpu0: Intel Xeon Scalable (Skylake, Cascade Lake, Copper Lake) (686-class)
cpu0: family 0x6 model 0x55 stepping 0x7 (id 0x50657)
[...]
cpu0: features1 0xf6f81203
cpu0: features2 0x810
cpu0: features5 0xd18f2369

Source Bits Type  Flags
xbd04010273 disk estimate, collect, v, t, dt
xennet0   0 net  v, t, dt
cpu0  88774 vm   estimate, collect, v, t, dv
system-power  0 power estimate, collect, v, t, dt
autoconf  1 ???  estimate, collect, t, dt
printf0 ???  collect
callout 108 skew estimate, collect, v, dv
cpurng 4096 rng  estimate, collect, v


-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Manuel Bouyer

On Mon, Apr 05, 2021 at 01:16:56AM +, RVP wrote:
> [...]
> Hmm. I have to say, that now I find myself not disagreeing with Greg's
> point of view: Maybe NetBSD's default is too strict and a knob like
> kern.entropy.use_pooh_poohed_sources=1 would not be a bad thing for
> some users--with all appropriate sysinst warnings of course.

If I understood it properly, there's no need for such a knob.
echo 0123456789abcdef0123456789abcdef > /dev/random

will get you back to the state we had in netbsd-9, with (pseudo-)randomness
collected from devices.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: nothing contributing entropy in Xen domUs? or dom0!!!

2021-04-01 Thread Manuel Bouyer

On Thu, Apr 01, 2021 at 04:13:59AM +, RVP wrote:
> > [...]
> 
> Does this /etc/entropy-file match what's there in your /boot.cfg?

irrelevant for Xen, as Xen uses the multiboot protocol.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-04-01 Thread Manuel Bouyer

On Wed, Mar 31, 2021 at 09:58:48PM -0400, Thor Lancelot Simon wrote:
> On Wed, Mar 31, 2021 at 11:24:07AM +0200, Manuel Bouyer wrote:
> > On Tue, Mar 30, 2021 at 10:42:53PM +, Taylor R Campbell wrote:
> > > 
> > > There are no virtual RNG devices on the system in question, according
> > > to the quoted `rndctl -l' output.  Perhaps the VM host needs to be
> > > taught to expose a virtio-rng device to the guest?
> > 
> > There is no such thing in Xen.
> 
> Is the CPU so old that it doesn't have RDRAND / RDSEED, or is Xen perhaps
> masking these CPU features from the guest?

Is there an easy way to test, on a netbsd-9 system, if the instruction is
present and working ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-31 Thread Manuel Bouyer

On Tue, Mar 30, 2021 at 10:42:53PM +, Taylor R Campbell wrote:
> > Date: Tue, 30 Mar 2021 23:53:43 +0200
> > From: Manuel Bouyer 
> > 
> > On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote:
> > > [...]
> > > 
> > > Perhaps the answer is that nothing seems to be contributing anything to
> > > the entropy pool.  No matter what device I exercise, none of the numbers
> > > in the following changes:
> > 
> > yes, it's been this way since the rnd rototill. Virtual devices are
> > not trusted.
> > 
> > The only way is to manually seed the pool.
> 
> This is false.  The virtual RNG drivers (viornd(4) [1], rump
> hyperentropy [2], maybe others) all assume the VM host provides
> samples with full entropy.  This has always been the case, and this
> didn't change at all in the rototill last year.
> 
> There are no virtual RNG devices on the system in question, according
> to the quoted `rndctl -l' output.  Perhaps the VM host needs to be
> taught to expose a virtio-rng device to the guest?

There is no such thing in Xen.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Manuel Bouyer

On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote:
> [...]
> 
> Perhaps the answer is that nothing seems to be contributing anything to
> the entropy pool.  No matter what device I exercise, none of the numbers
> in the following changes:

yes, it's been this way since the rnd rototill. Virtual devices are
not trusted.

The only way is to manually seed the pool.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: xen-tools 4.13.1 build failure

2020-12-21 Thread Manuel Bouyer

On Mon, Oct 12, 2020 at 04:53:14PM +0100, Chavdar Ivanov wrote:
> Hi,
> Another xentools413 build failure. It has been failing for me the last
> two weeks or so, failing to build seabios, as follows:
> 
> gmake[5]: Entering directory
> '/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/firmware'
> /usr/pkg/bin/gmake -C seabios-dir CC=gcc LD=ld PYTHON=python3.7
> EXTRAVERSION="-Xen" all;
> gmake[6]: Entering directory
> '/usr/pkgsrc/sysutils/xentools413/work/seabios-rel-1.12.1'
>   Linking out/rom.o
> ld -N -T out/romlayout32flat.lds out/rom16.strip.o
> out/rom32seg.strip.o out/code32flat.o -o out/rom.o
> ld: out/code32flat.o: in function `memmove':
> /usr/pkgsrc/sysutils/xentools413/work/seabios-rel-1.12.1/./src/string.c:206:
> undefined reference to `memcpy'

strange, I don't get this on my test machine (on netbsd-9) ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: install from CD fails on Xen

2020-10-27 Thread Manuel Bouyer

On Tue, Oct 27, 2020 at 10:14:45AM +0100, Martin Husemann wrote:
> On Tue, Oct 27, 2020 at 09:42:41AM +0100, Manuel Bouyer wrote:
> > Hello,
> > in tests from 2020-10-25:
> > http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/
> > anita fails with
> > Could not locate a CD medium in any drive with the distribution sets
> > (for both amd64 and i386)
> > 
> > martin, could you please have a look ?
> 
> Sure, will look at it - this is with the stock ISO provided as read-only
> xbd(4)?

Yes.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

install from CD fails on Xen

2020-10-27 Thread Manuel Bouyer

Hello,
in tests from 2020-10-25:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/
anita fails with
Could not locate a CD medium in any drive with the distribution sets
(for both amd64 and i386)

martin, could you please have a look ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: xen-tools 4.13.1 build failure

2020-08-31 Thread Manuel Bouyer

On Sun, Aug 30, 2020 at 04:54:18PM +0100, Chavdar Ivanov wrote:
> Hi,
> 
> Trying to build xentools-4.13.1 under -current:
> 
> gcc -I/usr/pkg/include -I/usr/include -I/usr/pkg/include/python3.7
> -I/usr/pkg/include/glib-2.0 -I/usr/pkg/include/gio-unix-2.0
> -I/usr/pkg/lib/glib-2.0/include -I/usr/X11R7/include
> -D_XOPEN_SOURCE_EXTENDED=1 -I/usr/pkg/include/ncurses -DPIC -O2
> -I/usr/pkg/include -I/usr/include -I/usr/pkg/include/python3.7
> -I/usr/pkg/include/glib-2.0 -I/usr/pkg/include/gio-unix-2.0
> -I/usr/pkg/lib/glib-2.0/include -I/usr/X11R7/include
> -D_XOPEN_SOURCE_EXTENDED=1 -I/usr/pkg/include/ncurses -m64 -DBUILD_ID
> -fno-strict-aliasing -std=gnu99 -Wall -Wstrict-prototypes
> -Wdeclaration-after-statement -Wno-unused-but-set-variable
> -Wno-unused-local-typedefs   -m64 -DBUILD_ID -fno-strict-aliasing
> -std=gnu99 -Wall -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .subdirs-all.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
> -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .subdir-all-libs.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99
> -Wall -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .subdirs-all.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
> -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .subdir-all-evtchn.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99
> -Wall -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .build.d   -Werror -Wmissing-prototypes -I./include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include
>  
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/libs/toollog/include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include
>  
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/libs/toolcore/include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include
> -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
> -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .netbsd.opic.d   -Werror -Wmissing-prototypes -I./include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include
>  
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/libs/toollog/include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include
>  
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/libs/toolcore/include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include
>  -fPIC -c -o netbsd.opic netbsd.c
> netbsd.c:30:10: fatal error: xen/xenio3.h: No such file or directory
>  #include 
>   ^~
> compilation terminated.
> netbsd.c:30:10: fatal error: xen/xenio3.h: No such file or directory

This header is in src/sys/arch/xen/include, it should be installed along with
xenio.h
I just commited a fix for this.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

*.fr.netbsd.org downtime

2020-06-30 Thread Manuel Bouyer

Hello,
I will be upgrading the storage on {ftp,www,rsync,anoncvs}.fr.netbsd.org
in the next 2 days. This will requires several reboots and services
interruptions while datas are being moved around.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: regression: xen domU no longer supports "type=cdrom" vbd disk

2020-06-08 Thread Manuel Bouyer

On Mon, Jun 08, 2020 at 02:26:39PM -0700, Greg A. Woods wrote:
> I use xl.cfg "disk" entries like the following to mount a virtual CDROM
> in a Xen domU:
> 
> 'format=raw, vdev=0x5, access=ro, devtype=cdrom, 
> target=/images/NetBSD-9.0-amd64.iso'
> 
> However since upgrading my -current source tree I've been seeing:
> 
>   xenbus0: ignoring device/vbd/4 type cdrom
> 
> As shown in this patch I had to comment out the core of the mentioned
> change to be able to use an ISO image again as a virtual CDROM again:

Actually this change matches what other OSes do with 'devtype=cdrom',
we were an outsider here.

For PV or PVH domUs you can omit the devtype keyword, it's only
needed for HVM guests (if you want to boot from the cdrom image).

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

xen panic

2020-05-30 Thread Manuel Bouyer

Hello,
build from 202005272200Z panics on Xen, on both i386 and amd64 but for
different reasons: http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/

i386 fails early at boot with:
[   1.000] panic: kernel diagnostic assertion "!*zpte" failed: file 
"/home/source/ab/HEAD/src/sys/arch/x86/x86/pmap.c", line 3832 pmap_zero_page: 
lock botch
[   1.000] cpu0: Begin traceback...
[   1.000] 
vpanic(c059333c,c0da7e00,c0da7e2c,c0127fb3,c059333c,c058e222,c0593bf7,c0592904,ef8,1)
 at netbsd:vpanic+0x134
[   1.000] 
kern_assert(c059333c,c058e222,c0593bf7,c0592904,ef8,1,bfe07090,8000,0,c17a8830)
 at netbsd:kern_assert+0x23
[   1.000] 
pmap_zero_page(1ffef000,0,c03d53c5,c0c875aa,c0c862b0,1,8,c0c875aa,2,1) at 
netbsd:pmap_zero_page+0x1e3
[   1.000] uvm_pagealloc_strat(0,0,0,0,3,0,0,15554000,1ffee000,0) at 
netbsd:uvm_pagealloc_strat+0x2d6
[   1.000] pmap_get_physpage(8,1,3abee003,1,10002,8,8,8,28,c) at 
netbsd:pmap_get_physpage+0x203
[   1.000] 
pmap_growkernel(d6cfd000,c05b90ea,c17a9000,15554000,1000,0,0,0,0,10002) at 
netbsd:pmap_growkernel+0xce
[   1.000] 
uvm_km_bootstrap(c17a9000,f560,0,c17a9000,f560,c0da7fb0,c055f14a,e,3,9) 
at netbsd:uvm_km_bootstrap+0x2c8
[   1.000] uvm_init(e,3,9,2,0,0,c0da5000,7ff,c0e1b000,756e6547) at 
netbsd:uvm_init+0x63

amd64 can boot and run tests, but panics with:
kernel/t_trapsignal (97/860): 20 test cases
bus_handle: [0.193910s] Passed.
bus_handle_recurse: [0.201020s] Passed.
bus_ignore: [0.200598s] Passed.
bus_mask: [0.199164s] Passed.
bus_simple: [0.199066s] Passed.
fpe_handle: [0.210561s] Passed.
fpe_handle_recurse: [ 872.0704774] panic: kernel diagnostic assertion 
"curlwp->l_md.md_flags & MDL_FPU_IN_CPU" failed: file 
"/home/source/ab/HEAD/src/sys/arch/x86/x86/fpu.c", line 487 
[ 872.0704774] cpu0: Begin traceback...
[ 872.0704774] vpanic() at netbsd:vpanic+0x146
[ 872.0704774] kern_assert() at netbsd:kern_assert+0x48
[ 872.0704774] fputrap() at netbsd:fputrap+0x171
[ 872.0704774] cpu0: End traceback...

[ 872.0704774] dumping to dev 168,1 (offset=524254, size=0): not possible
[ 872.0704774] rebooting...

Any idea what could have changed to cause this ?
2020-05-26 08:40 UTC builds did complete tests.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: -current build failure

2020-05-27 Thread Manuel Bouyer

On Wed, May 27, 2020 at 06:54:47PM +0100, Chavdar Ivanov wrote:
> Hi,
> 
> With sources updated about an hour ago I get:
> .
> --- kern-XEN3_DOM0 ---
> /home/sysbuild/amd64/tools/bin/x86_64--netbsd-ld: pintr.o: in function
> `xen_pic_to_gsi':
> pintr.c:(.text+0x78): undefined reference to `msipic_get_pci_info'
> /home/sysbuild/amd64/tools/bin/x86_64--netbsd-ld: pci_intr_machdep.o:
> in function `pci_intr_release':
> pci_intr_machdep.c:(.text+0x775): undefined reference to 
> `x86_pci_msix_release'

Did you clean the build directory ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: qemu emulated machine crashes due to disk timeouts

2020-05-14 Thread Manuel Bouyer

On Thu, May 14, 2020 at 03:32:51PM +0200, Jaromír Dole?ek wrote:
> [...]
> Seriously though I think that it wouldn't hurt to just bump ATA_DELAY
> to 30 seconds by default.

I don't remember if it's used only for I/O or also for probe.
If the later, it could take 3x more time to boot ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: i386 Xen integration breaks linking NET4501 kernel

2020-05-10 Thread Manuel Bouyer

On Sun, May 10, 2020 at 02:36:15PM +0200, Rhialto wrote:
> Probably similarly, linking fails when building an amd64 MODULAR kernel,
> with some Xen-related undefined symbol errors:

Yes I posted a question to tech-kern, asking how to resolve this, I got
no reply so far.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: modload & xen and -current 9.99.60

2020-05-08 Thread Manuel Bouyer

On Fri, May 08, 2020 at 02:55:10PM +0200, Frank Kardel wrote:
> I checked to same kernel in an instance with memory=2048 and it just works.
> 
> Using todays kernel also works woth memory=2048.
> 
> Using memory=65536 for the xen instance gives a surprising familiar
> 
> TEST-A# modload bpfjit
> [  97.4727034] kobj_load, 444: [%M/bpfjit/bpfjit.kmod]: linker error: out of
> memory
> modload: bpfjit: Cannot allocate memory
> TEST-A#
> 
> So it seems to be linked to available memory.
> 
> The more you have the less you get for modload.

It could be a variable overflow somewhere but I can't see how it relates to
64Gb. Does it work with 16Gb ?

Also could you try with a PVH or HVM guest ? These ones would use modules
from /stand/amd64/ and not /stand/amd64-xen/ and should be close to native.

I don't have a box with that much RAM to test ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: modload & xen and -current 9.99.60

2020-05-07 Thread Manuel Bouyer

On Thu, May 07, 2020 at 09:50:18PM +0200, Frank Kardel wrote:
> see here:
> 
> Alpine: 21:45 ~ [8] sysctl  kern.module.path
> kern.module.path = /stand/amd64-xen/9.99.60/modules

looks good

> Alpine: 21:46 ~ [9] ll /stand/amd64-xen/9.99.60/modules/bpfjit/bpfjit.kmod
> -r--r--r--  1 root  wheel  34328 May  5 16:58
> /stand/amd64-xen/9.99.60/modules/bpfjit/bpfjit.kmod
> Alpine: 21:46 ~ [10] size
> /stand/amd64-xen/9.99.60/modules/bpfjit/bpfjit.kmod
>textdata bss dec hex filename
>   10399   0   0   10399289f
> /stand/amd64-xen/9.99.60/modules/bpfjit/bpfjit.kmod
> Alpine: 21:46 ~ [11] ll
> /stand/amd64-xen/9.99.60/modules/pciverbose/pciverbose.kmod
> -r--r--r--  1 root  wheel  140600 May  5 16:55
> /stand/amd64-xen/9.99.60/modules/pciverbose/pciverbose.kmod
> Alpine: 21:47 ~ [12] size
> /stand/amd64-xen/9.99.60/modules/pciverbose/pciverbose.kmod
>textdata bss dec hex filename
>  132575  16   0  132591   205ef
> /stand/amd64-xen/9.99.60/modules/pciverbose/pciverbose.kmod

no problem for me, with sources from today:
xen1:/#modload bpfjit
xen1:/#modstat | grep !$
modstat | grep bpfjit
bpfjit misc filesys  -09174 sljit
xen1:/#modload pciverbose
xen1:/#modstat | grep !$
modstat | grep pciverbose
pciverbose     misc filesys  -0 218 pci

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: i386 Xen integration breaks GENERIC_PAE kernel build

2020-05-07 Thread Manuel Bouyer

On Thu, May 07, 2020 at 06:46:06AM -0500, John D. Baker wrote:
> Building the GENERIC_PAE kernel from recent -current/i386 fails with:
> 
> [...]
> --- hypervisor.o ---
> /x/current/src/sys/arch/xen/xen/hypervisor.c: In function 'init_xen_early':
> /x/current/src/sys/arch/xen/xen/hypervisor.c:247:27: error: cast to pointer 
> from integer of different size [-Werror=int-to-pointer-cast]
>   HYPERVISOR_shared_info = (void *)(HYPERVISOR_shared_info_pa + KERNBASE);

Should be fixed with hypervisor.c 1.82

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: modload & xen and -current 9.99.60

2020-05-07 Thread Manuel Bouyer

On Thu, May 07, 2020 at 07:45:48AM +0200, Frank Kardel wrote:
> Hi,
> 
> Running 9.99.60 XEN3_DOMU shows
> 
> [ 67264.313173] kobj_load, 444: [%M/bpfjit/bpfjit.kmod]: linker error: out
> of memory
> [ 67292.894143] kobj_load, 428: [%M/scsiverbose/scsiverbose.kmod]: linker
> error: out of memory
> 
> and modload fails with the OOM error.
> 
> Is this an expected behavior or a bug? (kern.securelevel is -1).

What does kern.module.path show for you ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: i386 Xen integration breaks linking NET4501 kernel

2020-05-05 Thread Manuel Bouyer

On Mon, May 04, 2020 at 06:42:11PM -0500, John D. Baker wrote:
> A recent build of -current/i386 fails when trying to link a kernel built
> from the NET4501 config:
> 
> [...]
> #  link  NET4501/netbsd
> /r0/build/current/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map --cref 
> -T netbsd.ldscript -Ttext c010 -e start -X -o netbsd 
> ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o
> /r0/build/current/tools/amd64/bin/i486--netbsdelf-ld: locore.o: in function 
> `start_xenpvh':
> (.text+0x410): undefined reference to `hvm_start_paddr'
> /r0/build/current/tools/amd64/bin/i486--netbsdelf-ld: (.text+0x436): 
> undefined reference to `HYPERVISOR_shared_info_pa'

Should be fixed now. Sorry for this

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: sysctl machdep.hypervisor

2020-04-11 Thread Manuel Bouyer

On Wed, Apr 08, 2020 at 08:26:33PM -, Michael van Elst wrote:
> bou...@antioche.eu.org (Manuel Bouyer) writes:
> 
> >Hello,
> >we have a machdep.hypervisor sysctl which returns a specific string when
> >an hypervisor is detected. I'd like to change the string returned for
> >Xen (rename from Xen to "Xen PV" and add others Xen subtypes).
> 
> >I didn't find any use in our source tree, does anyone know if this would
> >cause problems ?
> 
> I'd avoid whitespace in such values.

Sure, committed. Thanks !

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

sysctl machdep.hypervisor

2020-04-08 Thread Manuel Bouyer

Hello,
we have a machdep.hypervisor sysctl which returns a specific string when
an hypervisor is detected. I'd like to change the string returned for
Xen (rename from Xen to "Xen PV" and add others Xen subtypes).

I didn't find any use in our source tree, does anyone know if this would
cause problems ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: XEN 4.11 and 9.99.48 DOMU performance

2020-03-14 Thread Manuel Bouyer

There have been scheduler-related fixes in the last few days; did you
try with an up to date kernel ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: XEN 4.11 and 9.99.48 DOMU performance

2020-03-10 Thread Manuel Bouyer

On Tue, Mar 10, 2020 at 08:11:46PM +0100, Frank Kardel wrote:
> interrupt total rate type
> vmcmd kills   179840 misc
> vmcmd extends 179800 misc
> vmcmd calls  1850330 misc
> pserialize exclusive access 1450 misc
> vmem static_bt_inuse2000 misc
> vmem static_bt_count2000 misc
> rndpseudo open soft  390 misc
> TLB shootdown  50603277  139 intr
> softint net/0  33488890   92 misc
> softint bio/0 10 misc
> softint clk/0   4239404   11 misc
> softint ser/0  26560 misc
> callout late/0  1990 misc
> crosscall unicast600 misc
> namecache entries collected  9405282 misc
> namecache under scan target  3622200 misc
> vcpu0 xenev0 channel 4 18112410   50 intr
> softint net/15644651 misc
> softint bio/1 10 misc
> ...
> 
> softint clk/11   3856351 misc
> softint ser/11  1490 misc
> callout late/11   10 misc
> vcpu0 xenev0 channel 2  2970 intr
> vcpu0 raw systime went backwards1580 intr
> vcpu0 xenev0 channel 5 36222558   99 intr
> vcpu1 xenev0 channel 6   1554970 intr
> vcpu1 missed hardclock   830 intr
> vcpu1 xenev0 channel 7 36222475   99 intr
> vcpu2 xenev0 channel 8 15438790   42 intr
> ...
> 
> xbd0 map unaligned960510 misc
> xbd1 map unaligned  14069263 misc
> 
> TLB shootdown is there as some crosscall unicast. I don't see any other IPIs
> though.

Indeed it seems that in netbsd9 IPIs don't show up as such.
But there should be some crosscall broadcast. On a netbsd-9 pbulk host
I see more broadcast than unicast.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: XEN 4.11 and 9.99.48 DOMU performance

2020-03-10 Thread Manuel Bouyer

On Tue, Mar 10, 2020 at 07:30:33PM +0100, Frank Kardel wrote:
> No information about IPI in vmstat -i in DOM0 and DOMU.

the dom0 is not MP so I don't expect to see IPIs here.
But the domU is, so there should be IPIs here.

Hum, it looks like IPIs are in vmstat -e, not -i ...
sorry

> 
> Otherwise it is usually responsive. Sometimes things get stuck but switching
> a screen in screen seems to unstick things.
> 
> It seems like "wakeups" get sometimes lost.

I guess it could be related to IPIs.
But I'm running daily tests on domUs and I didn't notice anything strange

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: XEN 4.11 and 9.99.48 DOMU performance

2020-03-10 Thread Manuel Bouyer

On Tue, Mar 10, 2020 at 06:48:14PM +0100, Frank Kardel wrote:
> [...]
> 
> To me it looks more like locking issues or xen scheduling features.

yes, that could be. does vmstat -i show anything about IPIs ?

Is the domU otherwise responsive ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: XEN 4.11 and 9.99.48 DOMU performance

2020-03-10 Thread Manuel Bouyer

On Tue, Mar 10, 2020 at 04:20:22PM +0100, Frank Kardel wrote:
> This is my first XEN setup so I may have misconfigured something:
> 
> I have a 4G DOM0 on a 512G System with a EPYC 7302P 16-Core Processor.
> 
> On that I configured a 400G DOMU with 12 vcpus. like this:
> 
> name = "system"
> kernel = "/netbsd-XEN3_DOMU.gz"
> memory = 40
> cpus="all"
> vcpus=4
> maxvcpus=12
> vif = [ 'mac=aa:00:00:d1:00:01,bridge=bridge0',
> 'mac=aa:00:00:d1:00:02,bridge=bridge1' ]
> disk = [ 'file:/data0/xen-roots/root-Alpine-system.img,0x0,w',
>  'phy:/dev/wedges/data1,0x1,w' ]
> 
> On that I run postgresql 11 attempting to load a 1TB database.
> 
> Usually this workload keeps a machine continually busy cpu/io-wise.
> 
> I was expecting that I/O via the xen backend would be the bottleneck.
> 
> Instead DOM0 is only seldom busy for IO. DOMU is crawling along sleeping
> 
> at all sorts of places:


What does
iostat 5

show about the disks, in the dom0 and domU ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer

On Mon, Jan 13, 2020 at 09:00:30PM +, Andrew Doran wrote:
> I reproduced it on native x86.  It's a bug in the CPU topology code.  Now
> fixed with revision 1.11 src/sys/kern/subr_cpu.c - sorry about that.

I confirm, I now see user activity on all CPUs. Thanks !

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer

On Mon, Jan 13, 2020 at 07:11:21PM +, Andrew Doran wrote:
> On Mon, Jan 13, 2020 at 07:36:41PM +0100, Manuel Bouyer wrote:
> 
> > On Mon, Jan 13, 2020 at 06:33:08PM +, Andrew Doran wrote:
> > > On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote:
> > > 
> > > > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> > > > > It also sets rsp and rbp. I think rbp is not set by anything else, at 
> > > > > last
> > > > > in the Xen case.
> > > > > The different rbp value would explain why in one case we hit a 
> > > > > KASSERT()
> > > > > in lwp_startup later.
> > > > > But I don't know what pcb_rbp contains; I couldn't find where the pcb 
> > > > > for
> > > > > idlelwp is initialized.
> > > > 
> > > > I tried the attached patch, which should set rsp/rbp as cpu_switchto()
> > > > does. It doens't cause the lwp_startup() KASSERT as calling 
> > > > cpu_switchto()
> > > > does; it also doesn't change the scheduler behavior.
> > > 
> > > Wait - do you mean that everything works now?  Or that everything still 
> > > runs
> > > on CPU0?
> > 
> > No, everything still runs on CPU0
> 
> Hmm, I don't understand why.  I'll set up Xen and try it out.  It might take
> me a day or two.

OK thanks. 

> [...]
> 
> The assertion in lwp_startup() is because I made MI changes so that prevlwp
> is never NULL when calling cpu_switchto(), when fixing some bugs problems MP
> support on !x86 and make things more correct.  lwp_startup()/mi_switch() now
> need to unlock prevlwp after it is finished in cpu_switchto().  I never
> expected anybody but mi_switch() to call cpu_switchto().

OK, so I removed the call to cpu_switchto() before idle_loop(),
and added a few KASSERTS.
I guess you can back out the prev == NULL case from cpu_switchto().

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer

On Mon, Jan 13, 2020 at 06:33:08PM +, Andrew Doran wrote:
> On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote:
> 
> > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> > > It also sets rsp and rbp. I think rbp is not set by anything else, at last
> > > in the Xen case.
> > > The different rbp value would explain why in one case we hit a KASSERT()
> > > in lwp_startup later.
> > > But I don't know what pcb_rbp contains; I couldn't find where the pcb for
> > > idlelwp is initialized.
> > 
> > I tried the attached patch, which should set rsp/rbp as cpu_switchto()
> > does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto()
> > does; it also doesn't change the scheduler behavior.
> 
> Wait - do you mean that everything works now?  Or that everything still runs
> on CPU0?

No, everything still runs on CPU0

> 
> The very first thing that idle_loop() does on amd64/i386 is set up the frame
> pointer - ebp/rbp.
> 
>  :
>0:   55  push   %rbp
>1:   48 89 e5mov%rsp,%rbp
>4:   41 56   push   %r14
>6:   41 55   push   %r13

OK, so it's OK that my patch doesn't changes anything.
And so I still don't understand the KASSERT when cpu_switchto() is called
before idle_loop().

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer

On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote:
> On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> > It also sets rsp and rbp. I think rbp is not set by anything else, at last
> > in the Xen case.
> > The different rbp value would explain why in one case we hit a KASSERT()
> > in lwp_startup later.
> > But I don't know what pcb_rbp contains; I couldn't find where the pcb for
> > idlelwp is initialized.
> 
> I tried the attached patch, which should set rsp/rbp as cpu_switchto()
> does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto()
> does; it also doesn't change the scheduler behavior.

With the patch this time

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--
Index: sys/arch/xen/x86/cpu.c
===
RCS file: /cvsroot/src/sys/arch/xen/x86/cpu.c,v
retrieving revision 1.131
diff -u -p -u -r1.131 cpu.c
--- sys/arch/xen/x86/cpu.c  23 Nov 2019 19:40:38 -  1.131
+++ sys/arch/xen/x86/cpu.c  13 Jan 2020 16:40:50 -
@@ -739,7 +739,16 @@ cpu_hatch(void *v)
 
aprint_debug_dev(ci->ci_dev, "running\n");
 
-   cpu_switchto(NULL, ci->ci_data.cpu_idlelwp, true);
+#ifdef __x86_64__
+   asm("movq %0, %%rsp" : : "r" (pcb->pcb_rsp));
+   asm("movq %0, %%rbp" : : "r" (pcb->pcb_rbp));
+#else
+   asm("movl %0, %%esp" : : "r" (pcb->pcb_esp));
+   asm("movl %0, %%ebp" : : "r" (pcb->pcb_ebp));
+#endif
+   KASSERT(ci->ci_curlwp == ci->ci_data.cpu_idlelwp);
+
+   //cpu_switchto(NULL, ci->ci_data.cpu_idlelwp, true);
 
idle_loop(NULL);
KASSERT(false);

Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer

On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> It also sets rsp and rbp. I think rbp is not set by anything else, at last
> in the Xen case.
> The different rbp value would explain why in one case we hit a KASSERT()
> in lwp_startup later.
> But I don't know what pcb_rbp contains; I couldn't find where the pcb for
> idlelwp is initialized.

I tried the attached patch, which should set rsp/rbp as cpu_switchto()
does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto()
does; it also doesn't change the scheduler behavior.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer

On Mon, Jan 13, 2020 at 02:49:52PM +, Andrew Doran wrote:
> > Now I get a different panic:
> > [   1.000] vcpu0 at hypervisor0
> > [   1.000] vcpu0: 64 page colors
> > [   1.000] vcpu0: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 
> > 0x6fb
> > [   1.000] vcpu0: node 0, package 0, core 1, smt 0
> > [   1.000] vcpu1 at hypervisor0
> > [   1.000] vcpu1: 2 page colors
> > [   1.000] vcpu1: starting
> > [   1.000] vcpu1: is started.
> > [   1.000] vcpu1: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 
> > 0x6fb
> > [   1.000] vcpu1: node 0, package 0, core 0, smt 0
> > [...]
> > [   1.030] UVM: using package allocation scheme, 1 package(s) per bucket
> > [   1.030] Xen vcpu1 clock: using event channel 7
> > [   1.8809493] vcpu1: running
> > [   1.8809493] panic: kernel diagnostic assertion "prev != NULL" failed: 
> > file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_lwp.c", line 1021
> > [   1.8809493] cpu1: Begin traceback...
> > [   1.8809493] 
> > vpanic(c057f868,d77abf74,d77abf98,c03cc3e5,c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0)
> >  at netbsd:vpanic+0x134
> > [   1.8809493] 
> > kern_assert(c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0,0,0,c13a6900,c03c60c0)
> >  at netbsd:kern_assert+0x23
> > [   1.8809493] lwp_startup(0,c13a6900,8b1000,c0674200,0,c010007a,0,0,0,0) 
> > at netbsd:lwp_startup+0x155
> > [   1.8809493] cpu1: End traceback...
> > 
> > If I remove the call to cpu_switchto() in cpu_hatch() it boots, but it seems
> > that all user processes are running on cpu0 only ...
> 
> I looked and the only thing cpu_switchto() is doing there is setting curlwp,
> but that's already set in cpu_start_secondary(), so it's not needed.

It also sets rsp and rbp. I think rbp is not set by anything else, at last
in the Xen case.
The different rbp value would explain why in one case we hit a KASSERT()
in lwp_startup later.
But I don't know what pcb_rbp contains; I couldn't find where the pcb for
idlelwp is initialized.


> 
> > I can't see what extra work the cpu_switchto() could be doing that would
> > matters, execpt maybe the %epb/rbp init. Any idea ?
> 
> Right I don't think cpu_switchto() matters there.  The strategy for
> assigning LWPs to CPUs in the scheduler has changed.  If the machine is not
> busy everything is likely to stay on CPU0.  Are you putting much load on it?

I just tried a build.sh -j4
CPU0 is 100% busy, the others are 100% idle:

load averages:  3.02,  2.14,  1.26;   up 0+00:51:5916:59:03
61 processes: 5 runnable, 54 sleeping, 2 on CPU
CPU0 states: 39.3% user,  0.0% nice, 60.7% system,  0.0% interrupt,  0.0% idle
CPU1 states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU2 states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU3 states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Memory: 1402M Act, 168K Inact, 16K Wired, 14M Exec, 1352M File, 1932M Free
Swap: 

  PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
21392 bouyer33029M 5964K RUN/0  0:00  2.00%  0.10% as
0 root   00 0K   11M CPU/3  0:30  0.00%  0.00% [system]
   81 bouyer85020M 3596K kqueue/0   0:19  0.00%  0.00% tmux
  226 bouyer43016M 1900K CPU/0  0:00  0.00%  0.00% top
16883 bouyer330  8992K 2212K RUN/0  0:00  0.00%  0.00% nbmake
21137 bouyer330  7844K 1220K RUN/0  0:00  0.00%  0.00% sed
12098 bouyer330  4288K  164K RUN/0  0:00  0.00%  0.00% sh
22411 bouyer330  4288K  164K RUN/0  0:00  0.00%  0.00% cc
   42 root  85080M 5768K poll/0 0:00  0.00%  0.00% sshd

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer

On Mon, Jan 13, 2020 at 12:02:13PM +, Andrew Doran wrote:
> Ah yes it does, I saw something that made me think it affected x86_64 only. 
> I'll make the change on i386 too.

thanks.

Now I get a different panic:
[   1.000] vcpu0 at hypervisor0
[   1.000] vcpu0: 64 page colors
[   1.000] vcpu0: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 0x6fb
[   1.000] vcpu0: node 0, package 0, core 1, smt 0
[   1.000] vcpu1 at hypervisor0
[   1.000] vcpu1: 2 page colors
[   1.000] vcpu1: starting
[   1.000] vcpu1: is started.
[   1.000] vcpu1: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 0x6fb
[   1.000] vcpu1: node 0, package 0, core 0, smt 0
[...]
[   1.030] UVM: using package allocation scheme, 1 package(s) per bucket
[   1.030] Xen vcpu1 clock: using event channel 7
[   1.8809493] vcpu1: running
[   1.8809493] panic: kernel diagnostic assertion "prev != NULL" failed: file 
"/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_lwp.c", line 1021
[   1.8809493] cpu1: Begin traceback...
[   1.8809493] 
vpanic(c057f868,d77abf74,d77abf98,c03cc3e5,c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0)
 at netbsd:vpanic+0x134
[   1.8809493] 
kern_assert(c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0,0,0,c13a6900,c03c60c0) at 
netbsd:kern_assert+0x23
[   1.8809493] lwp_startup(0,c13a6900,8b1000,c0674200,0,c010007a,0,0,0,0) at 
netbsd:lwp_startup+0x155
[   1.8809493] cpu1: End traceback...

If I remove the call to cpu_switchto() in cpu_hatch() it boots, but it seems
that all user processes are running on cpu0 only ...
I can't see what extra work the cpu_switchto() could be doing that would
matters, execpt maybe the %epb/rbp init. Any idea ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer

On Mon, Jan 13, 2020 at 11:42:17AM +, Andrew Doran wrote:
> Hi Manuel,
> 
> On Mon, Jan 13, 2020 at 10:56:23AM +0100, Manuel Bouyer wrote:
> > Hello,
> > A current Xen domU kernel fails to boot with:
> > [   1.000] hypervisor0 at mainbus0: Xen version 4.11.3nb1
> > [   1.000] vcpu0 at hypervisor0
> > [   1.000] vcpu0: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64
> > [   1.000] vcpu0: node 0, package 0, core 1, smt 1
> > [   1.000] vcpu1 at hypervisor0
> > [   1.000] vcpu1: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64
> > [   1.000] vcpu1: node 0, package 1, core 0, smt 0
> > [   1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface
> > [   1.000] xencons0 at hypervisor0: Xen Virtual Console Driver
> > [   1.9901295] uvm_fault(0x80d5c120, 0x0, 1) -> e
> > [   1.9901295] fatal page fault in supervisor mode
> > [   1.9901295] trap type 6 code 0 rip 0x8020209f cs 0x8 rflags 
> > 0x10246 cr2 0x28 ilevel 0 rsp 0xb7802b19de88
> > [   1.9901295] curlwp 0xb7800083b500 pid 0.15 lowest kstack 
> > 0xb7802b1992c0
> > kernel: page fault trap, code=0
> > Stopped in pid 0.15 (system) at netbsd:cpu_switchto+0xf:movq
> > 28(%r13),%rax
> > cpu_switchto() at netbsd:cpu_switchto+0xf
> > 
> > both amd64 and i386. A boot with vcpus=1 succeeds, so I guess something is
> > missing in initialisations of secondary CPUs.
> > This happens with the 202001101800Z but the problem is probably older than
> > that (the testbed used vcpus=1 until today)
> > 
> > Any idea ?
> 
> It should work now with revision 1.199 of src/sys/arch/amd64/amd64/locore.S. 

The same problem happens with i386.

> Nothing else in tree calls cpu_switchto() with prevlwp=NULL any more.  Can
> Xen's cpu_hatch() call idle_loop() directly?

Maybe it could, but cpu_switchto() does some extra work (switch the stack,
set curlwp at last). Maybe this is already done but I'll have to double check.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--

1 2 3 >

1 - 100 of 261 matches

Mail list logo