Re: dwiic errors
On Thu, Mar 14, 2024 at 08:42:47AM -0700, Paul Goyette wrote: > On Thu, 14 Mar 2024, Michael van Elst wrote: > > > p...@whooppee.com (Paul Goyette) writes: > > > > > as soon as you proceed past this point (including normal non-single- > > > user boot), the dwiic starts spewing time-out messages. These > > > messages come every 0.5 second or so, and there's usually a hundred > > > or more messages before they stop; in some cases the messages have > > > continued to stream by for several minutes (at which point I pressed > > > the reset button). The value for %d is always 0 or 1. > > > > Probably result of > > > > GENERIC:ihidev* at iic? > > > > that is probing for a modern laptop touchpad. > > > > Can you disable ihidev instead of dwiic and see what happens then ? > > No change. It attaches dwiic0 and then starts with the messages. It could also be some sensors I guess. Any chance to see what attaches at dwiic0 ? Maybe entering ddb before the console gets spammed ? FWIW I have a laptop with the touchpad as ihidev@dwiic and it works fine with RC6 -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: ATF tests panic assertion "uvmexp.swpgonly > 0" failed
On Mon, Dec 11, 2023 at 11:27:17AM -0800, Chuck Silvers wrote: > On Fri, Dec 08, 2023 at 06:13:42PM +0100, Manuel Bouyer wrote: > > Hello again > > I see a second rare panic running ATF tests on Xen: > > lib/libc/regex/t_exhaust (236/949): 1 test cases > > regcomp_too_big: [ 1254.5816543] panic: kernel diagnostic assertion > > "uvmexp.swpgonly > 0" failed: file "/usr/src/sys/uvm/uvm_anon.c", line 175 > > [ 1254.6116351] cpu1: Begin traceback... > > [ 1254.6216378] > > vpanic(c12d3bf8,d855adcc,d855ade8,c0d03f72,c12d3bf8,c12d3b5f,c13da23e,c13da189,af,c3b00ac0) > > at netbsd:vpanic+0x184 > > [ 1254.6516393] > > kern_assert(c12d3bf8,c12d3b5f,c13da23e,c13da189,af,c3b00ac0,c2d9f8d0,0,d855ae0c,c0d041c4) > > at netbsd:kern_assert+0x23 > > [ 1254.6716402] > > uvm_anfree(c2d9f8d0,c2342000,3,c543b0c0,0,c3b00ac0,1,d855ae58,c0d20d1d,c2d9f8d0) > > at netbsd:uvm_anfree+0x2b8 > > [ 1254.7016358] > > uvm_anon_release(c2d9f8d0,1,d72f2000,c543b0c0,d72f2000,d72f1000,1,0,0,d855ae84) > > at netbsd:uvm_anon_release+0x85 > > [ 1254.7216389] > > uvm_aio_aiodone_pages(d855ae84,1,1,0,c243c400,1ebf140,0,0,d855ae84,c1e6cc24) > > at netbsd:uvm_aio_aiodone_pages+0x2fc > > [ 1254.7716201] > > uvm_aio_aiodone(c5560180,8e016,3,0,72bec,c26ff204,c26ff204,c5560180,d855af20,c0e54a31) > > at netbsd:uvm_aio_aiodone+0x97 > > [ 1254.7916387] > > biodone2(c5560180,1000,0,c25e4cec,c0db8d85,c5561024,c0e54996,d82f8000,d855af48,c0e0a32d) > > at netbsd:biodone2+0x95 > > [ 1254.8216356] > > dkiodone(c5561024,10,10,d85502ac,c010293f,d855af70,c5561024,3,d855af70,c0e0a49e) > > at netbsd:dkiodone+0x9b > > [ 1254.8416368] > > biodone2(3,0,c010293f,8a260008,2,d8550004,c243c980,d85502ac,d855afe0,c0d7f5c5) > > at netbsd:biodone2+0x95 > > [ 1254.8815946] biointr(0,0,0,0,0,0,0,0,0,0) at netbsd:biointr+0x4c > > [ 1254.8916211] > > softint_dispatch(c243c400,3,c2c2c2c2,c2c2c2c2,c2c2c2c2,c2c2c2c2,d855dff0,d855df14,c268b000,80050033) > > at netbsd:softint_dispatch+0xe0 > > [ 1254.9216366] Bad frame pointer: 0xd82fcf20 > > [ 1254.9316199] cpu1: End traceback... > > > > The first time seems to be > > https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/i386-hvm/202310061820Z_anita.txt > > > > and the second time was > > https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/i386-hvm/202311250810Z_anita.txt > > > > Has anyone else seen this ? > > > yes, various people have been seeing this assertion (or some other related > ones) > occasionally for years now. I've looked into it a few times but I have been > unable to spot the problem. thanks is there a PR open about this ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
ATF tests panic assertion "uvmexp.swpgonly > 0" failed
Hello again I see a second rare panic running ATF tests on Xen: lib/libc/regex/t_exhaust (236/949): 1 test cases regcomp_too_big: [ 1254.5816543] panic: kernel diagnostic assertion "uvmexp.swpgonly > 0" failed: file "/usr/src/sys/uvm/uvm_anon.c", line 175 [ 1254.6116351] cpu1: Begin traceback... [ 1254.6216378] vpanic(c12d3bf8,d855adcc,d855ade8,c0d03f72,c12d3bf8,c12d3b5f,c13da23e,c13da189,af,c3b00ac0) at netbsd:vpanic+0x184 [ 1254.6516393] kern_assert(c12d3bf8,c12d3b5f,c13da23e,c13da189,af,c3b00ac0,c2d9f8d0,0,d855ae0c,c0d041c4) at netbsd:kern_assert+0x23 [ 1254.6716402] uvm_anfree(c2d9f8d0,c2342000,3,c543b0c0,0,c3b00ac0,1,d855ae58,c0d20d1d,c2d9f8d0) at netbsd:uvm_anfree+0x2b8 [ 1254.7016358] uvm_anon_release(c2d9f8d0,1,d72f2000,c543b0c0,d72f2000,d72f1000,1,0,0,d855ae84) at netbsd:uvm_anon_release+0x85 [ 1254.7216389] uvm_aio_aiodone_pages(d855ae84,1,1,0,c243c400,1ebf140,0,0,d855ae84,c1e6cc24) at netbsd:uvm_aio_aiodone_pages+0x2fc [ 1254.7716201] uvm_aio_aiodone(c5560180,8e016,3,0,72bec,c26ff204,c26ff204,c5560180,d855af20,c0e54a31) at netbsd:uvm_aio_aiodone+0x97 [ 1254.7916387] biodone2(c5560180,1000,0,c25e4cec,c0db8d85,c5561024,c0e54996,d82f8000,d855af48,c0e0a32d) at netbsd:biodone2+0x95 [ 1254.8216356] dkiodone(c5561024,10,10,d85502ac,c010293f,d855af70,c5561024,3,d855af70,c0e0a49e) at netbsd:dkiodone+0x9b [ 1254.8416368] biodone2(3,0,c010293f,8a260008,2,d8550004,c243c980,d85502ac,d855afe0,c0d7f5c5) at netbsd:biodone2+0x95 [ 1254.8815946] biointr(0,0,0,0,0,0,0,0,0,0) at netbsd:biointr+0x4c [ 1254.8916211] softint_dispatch(c243c400,3,c2c2c2c2,c2c2c2c2,c2c2c2c2,c2c2c2c2,d855dff0,d855df14,c268b000,80050033) at netbsd:softint_dispatch+0xe0 [ 1254.9216366] Bad frame pointer: 0xd82fcf20 [ 1254.9316199] cpu1: End traceback... The first time seems to be https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/i386-hvm/202310061820Z_anita.txt and the second time was https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/i386-hvm/202311250810Z_anita.txt Has anyone else seen this ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
ATF tests panic
Hello in my daily ATF runs on Xen VMs, I see occasional panics like: kernel/kqueue/t_proc4 (81/956): 1 test cases proc4: [ 727.4761311] uvm_fault(0xeb8004cff900, 0x0, 2) -> e [ 727.4899584] fatal page fault in supervisor mode [ 727.4981872] trap type 6 code 0x2 rip 0x80dcf31d cs 0x8 rflags 0x10246 cr2 0xb0 ilevel 0 rsp 0xc280510e4e08 [ 727.5261058] curlwp 0xeb80055a6000 pid 20525.20525 lowest kstack 0xc280510e02c0 [ 727.5410680] panic: trap [ 727.5410680] cpu0: Begin traceback... [ 727.5611378] vpanic() at netbsd:vpanic+0x173 [ 727.5679804] panic() at netbsd:panic+0x3c [ 727.5787912] trap() at netbsd:trap+0xb0a [ 727.5896397] --- trap (number 6) --- [ 727.6161322] _mutex_init() at netbsd:_mutex_init+0x33 [ 727.6274274] knote_proc_fork() at netbsd:knote_proc_fork+0xa2 [ 727.6366632] fork1() at netbsd:fork1+0x6ba [ 727.6468860] sys_fork() at netbsd:sys_fork+0x29 [ 727.6563376] syscall() at netbsd:syscall+0x17a [ 727.6661484] --- syscall (number 2) --- [ 727.6775934] netbsd:syscall+0x17a: [ 727.6775934] cpu0: End traceback... [ 727.6928214] dumping to dev 168,1 (offset=8, size=128926): [ 727.6928214] dump device bad The first occurance seems to be this: https://largo.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64-pv/202309261750Z_anita.txt I see it for PV, PVH and HVM runs, but it's quite rare. I am the only one seeing this ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: openssl3+postfix issue (ca md too weak)
On Mon, Nov 13, 2023 at 08:34:04PM +0100, Manuel Bouyer wrote: > Hello > I'm facing an issue with postfix+openssl3 which may be critical (depending > on how it can be fixed). > > Now my postfix setup fails to send mails with > Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: > error:0A00018E:SSL routines::ca md too > weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984: > > >From what I understood, this is the remote certificate which is not accepted: > openssl 3 deprecated some signature algorithm, which are no longer accepted > with @SECLEVEL=1 (which is the default). I didn't understand. The message is not about the server certificate but the client certificate (which, indeed, is quite old and uses a private CA). Even though no client certificate is requested for this server, is seems that postfix loads it and errors out if it's too weak. This is quite confusing ... The good news is, as it's a private CA I can rebuild it :) -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: openssl3+postfix issue (ca md too weak)
On Mon, Nov 13, 2023 at 07:16:14PM -0800, Brian Buhrow wrote: > Hello Taylor. Just as a point of reference, smtp clients that connect > to domains hosted by > Microsoft, i.e. outlook.com and any other domains that use their > infrastructure for e-mail, will > have to present a valid SSL certificate in order to submit mail to their smtp > servers. But > that is a different issue than Manuel is describing, as I understand it. I > think he is saying > that the server is presenting an SSL certificate that his client doesn't like > when he tries to > send mail to an external smtp server. In that case, I agree with you, his > client shouldn't be > overly concerned about whether the server presented SSL certificate can be > verified all the way > down the verification chain. I guess it's fine if it does the verification > and puts a note in > the headers, but it shouldn't stop mail from going out. Actually, the client is using SMTP AUTH, so making sure he's sending the auth credentials to the right SMTP server is critical. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: openssl3+postfix issue (ca md too weak)
On Tue, Nov 14, 2023 at 02:39:53AM +, Taylor R Campbell wrote: > [trimming tech-crypto from cc because this is a policy and > configuration issue, not a cryptography issue] > > > Date: Mon, 13 Nov 2023 20:34:04 +0100 > > From: Manuel Bouyer > > > > I'm facing an issue with postfix+openssl3 which may be critical (depending > > on how it can be fixed). > > > > Now my postfix setup fails to send mails with > > Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: > > error:0A00018E:SSL routines::ca md too > > weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984: > > 1. This says `warning'; does the mail actually fail to go through, or >are you just alarmed by the warning? it fails: Nov 13 20:21:48 comore postfix/smtp[4182]: warning: TLS library problem: error:0A00018E:SSL routines::ca md too weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984: Nov 13 20:21:48 comore postfix/smtp[4182]: D2EF31805C: to=, relay=mail.soc.lip6.fr[132.227.86.2]:465, delay=1441, delays=1441/0.05/0.02/0, dsn=4.7.5, status=deferred (Cannot start TLS: handshake failure) > > 2. Can you describe your mail topology? This is a simple mail client (my laptop); outgoing emails go through 2 mails servers (depending on the from, and a relay map). Both mail servers requires SMTP AUTH (which is why I enforce smtp_tls_security_level = verify), configured as: smtp_sasl_auth_enable = yes smtp_sasl_password_maps = hash:/home/bouyer/.postfix/sasl_passwd smtp_sasl_security_options = noanonymous > > 3. Can you describe the postfix configuration on every node involved >in the topology? the mails servers this client talks to are both running sendmail, on netbsd-9 > 4. Can you share master.cf on every node involved if it's not the >default? on the client master.cf is the default, with this additional line: relay-smtps unix - - n - - smtp # Client-side SMTPS requires "encrypt" or stronger. -o smtp_tls_security_level=verify -o smtp_tls_wrappermode=yes -o smtp_starttls_timeout=60 -o smtp_helo_timeout=60 > > 5. If you connect to the server with `openssl s_client', what happens? It works: openssl s_client -connect mail.soc.lip6.fr:465 -verify_return_error [...] Start Time: 1699948718 Timeout : 7200 (sec) Verify return code: 0 (ok) Extended master secret: no Max Early Data: 0 --- read R BLOCK 220 asim.lip6.fr ESMTP Sendmail 8.15.2/8.15.2; Tue, 14 Nov 2023 08:58:37 +0100 (MET) Also, tnftp talking to a web server with the exact same certificate and certificate chain has no problem either This is one of the thing I have a hard time to understand: why can't I reproduce this error with other TLS client ? > > > So, as far as I understand, we end up with a postfix installation which > > can't talk to servers with valid certificates. > > Unless anything has changed in the past couple years, I don't think > there is any widespread deployment of SMTP TLS server authentication > that means anything for general MTAs -- at best, TLS in SMTP serves as > opportunistic encryption to defend against passive eavesdroppers. There is actually, for SMTP AUTH And I don't think using an MTA for SMTP AUTH is that unusual -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: openssl3+postfix issue (ca md too weak)
On Tue, Nov 14, 2023 at 11:10:16AM +1300, Lloyd Parkes wrote: > > > On 14/11/23 10:56, Joerg Sonnenberger wrote: > > > > NIST has been sunsetting SHA1 for a long time, 2016 in fact. In many cases, > > there is a better trust chain > > for Comodo intermediary certificates and admins should be installing those. > > I'm not sure that's what Comodo has, even though it is the normal way of > doing things. > > I found a Comodo web page that said SHA1 will be fine, so don't worry, and > if you are worried, you can buy a different certificate. That same web > page's link to their intermediate certificates is a dead link. Comodo does > not fill me with confidence. Unfortunably I don't have the choise for this one. > > I'm going to guess that the default @SECLEVEL of openssl needs to be > adjusted if there is no Postfix specific way to adjust it. Apparently you > can set the environment variable OPENSSL_CONF to run with a custom openssl > configuration which can avoid reducing the security level of the rest of > your system. Searching for "openssl @SECLEVEL" gave me the usual levels of > StackExchange clarity, so ymmv. I tried this; but nothing that I've tried in /etc/openssl/openssl.cnf did seems to have any effect. I wonder if postfix is doing some specific openssl setup that overrides the openssl.cnf settings. But also note that I could not reproduce the problem with openssl s_client -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: openssl3+postfix issue (ca md too weak)
On Mon, Nov 13, 2023 at 10:56:00PM +0100, Joerg Sonnenberger wrote: > On Monday, November 13, 2023 8:34:04 PM CET Manuel Bouyer wrote: > > Hello > > I'm facing an issue with postfix+openssl3 which may be critical (depending > > on how it can be fixed). > > > > Now my postfix setup fails to send mails with > > Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: > > error:0A00018E:SSL routines::ca md too > > weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984: > > > > From what I understood, this is the remote certificate which is not > > accepted: > > openssl 3 deprecated some signature algorithm, which are no longer accepted > > with @SECLEVEL=1 (which is the default). > > In server's certificate chain all but the last one are signed with > > sha384WithRSAEncryption (which should be OK). The last one (the root > > certificate) is signed with RSA-SHA1 and I don't think this will change > > soon: > > 3 s:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, > > CN = A > > AA Certificate Services > >i:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, > > CN = A > > AA Certificate Services > >a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA1 > >v:NotBefore: Jan 1 00:00:00 2004 GMT; NotAfter: Dec 31 23:59:59 2028 GMT > > > > So, as far as I understand, we end up with a postfix installation which > > can't talk to servers with valid certificates. > > NIST has been sunsetting SHA1 for a long time, 2016 in fact. In many cases, > there is a better trust chain > for Comodo intermediary certificates and admins should be installing those. My chain is from October, not that old. Maybe our CA is not completely up to date; I will have to check that. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: openssl3+postfix issue (ca md too weak)
On Mon, Nov 13, 2023 at 10:58:38PM +0100, Steffen Nurpmeso wrote: > Manuel Bouyer wrote in > : > |On Mon, Nov 13, 2023 at 10:24:56PM +0100, Steffen Nurpmeso wrote: > |> Manuel Bouyer wrote in > |> : > |>|Hello > |>|I'm facing an issue with postfix+openssl3 which may be critical (dependi\ > |>|ng > |>|on how it can be fixed). > |>| > |>|Now my postfix setup fails to send mails with > ... > |>|>From what I understood, this is the remote certificate which is not \ > |>|>accepted: > |>|openssl 3 deprecated some signature algorithm, which are no longer \ > |>|accepted > ... > |> Isn't that just postfix config. > | > |It's possible; but I didn't find anything relevant in the postfix docs > | > |> Btw *i* have no problem with > |> > |> smtpd_tls_ask_ccert = no > |> smtpd_tls_auth_only = yes > |> smtpd_tls_loglevel = 1 > |> #SMART The next is usually nice but when using client certificates > |> smtpd_tls_received_header = no > |> smtpd_tls_fingerprint_digest = sha256 > |> smtpd_tls_mandatory_protocols = >=TLSv1.2 > |> smtpd_tls_protocols = $smtpd_tls_mandatory_protocols > |> # super modern, forward secrecy TLSv1.2 / TLSv1.3 selection.. > |> tls_high_cipherlist = EECDH+AESGCM:EECDH+AES256:EDH+AESGCM:CHACHA20 > |> smtpd_tls_mandatory_ciphers = high > |> smtpd_tls_mandatory_exclude_ciphers = TLSv1 > |> > |> ^ This works in practice without any noticeable trouble. > |> (But then i again i do not have to make money from that or my > |> customers who must talk to ten year old refrigerators.) > | > |this is only server-side configuration; my problem is with client-side > |rejecting the server's certificate > > Well i have > > #SMART comment out next > smtp_tls_security_level = may I have smtp_tls_security_level = verify and this is what I need because a username/passwd is sent as part of the smtp transaction > # To always go directly SMTPS/SUBMISSIONS > #smtp_tls_wrappermode = yes > smtp_tls_fingerprint_digest = $smtpd_tls_fingerprint_digest > smtp_tls_mandatory_protocols = $smtpd_tls_mandatory_protocols > smtp_tls_protocols = $smtpd_tls_protocols > #SMART When only relaying to smarthost, the next should be =high > _or_better_! > smtp_tls_mandatory_ciphers = $smtpd_tls_mandatory_ciphers > smtp_tls_mandatory_exclude_ciphers = $smtpd_tls_mandatory_exclude_ciphers > smtp_tls_ciphers = $smtpd_tls_ciphers > smtp_tls_exclude_ciphers = $smtpd_tls_exclude_ciphers > smtp_tls_connection_reuse = yes > > But if you have a problem with only one permanent remote partner In my config I have 2 possible relays (depending on the from of the email) and both shows the same problem (yet with different certificates signed by different CAs). > you surely want a dedicated map for that one. No, I need a strong encrypted connection -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: openssl3+postfix issue (ca md too weak)
On Mon, Nov 13, 2023 at 10:24:56PM +0100, Steffen Nurpmeso wrote: > Manuel Bouyer wrote in > : > |Hello > |I'm facing an issue with postfix+openssl3 which may be critical (depending > |on how it can be fixed). > | > |Now my postfix setup fails to send mails with > |Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: \ > |error:0A00018E:SSL routines::ca md too weak:/usr/src/crypto/external/bsd\ > |/openssl/dist/ssl/statem/statem_lib.c:984: > | > |>From what I understood, this is the remote certificate which is not \ > |>accepted: > |openssl 3 deprecated some signature algorithm, which are no longer accepted > |with @SECLEVEL=1 (which is the default). > |In server's certificate chain all but the last one are signed with > |sha384WithRSAEncryption (which should be OK). The last one (the root > |certificate) is signed with RSA-SHA1 and I don't think this will change > |soon: > | 3 s:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, \ > | CN = A > | AA Certificate Services > | i:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, \ > | CN = A > | AA Certificate Services > | a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA1 > | v:NotBefore: Jan 1 00:00:00 2004 GMT; NotAfter: Dec 31 23:59:59 \ > | 2028 GMT > | > |So, as far as I understand, we end up with a postfix installation which > |can't talk to servers with valid certificates. > | > |The solution (from google) would be to force @SECLEVEL=0 but I didn't find > |a way to do this for postfix. The solutions I've seen were for openvpn or > |curl, but nothing about postfix :( > > Isn't that just postfix config. It's possible; but I didn't find anything relevant in the postfix docs > Btw *i* have no problem with > > smtpd_tls_ask_ccert = no > smtpd_tls_auth_only = yes > smtpd_tls_loglevel = 1 > #SMART The next is usually nice but when using client certificates > smtpd_tls_received_header = no > smtpd_tls_fingerprint_digest = sha256 > smtpd_tls_mandatory_protocols = >=TLSv1.2 > smtpd_tls_protocols = $smtpd_tls_mandatory_protocols > # super modern, forward secrecy TLSv1.2 / TLSv1.3 selection.. > tls_high_cipherlist = EECDH+AESGCM:EECDH+AES256:EDH+AESGCM:CHACHA20 > smtpd_tls_mandatory_ciphers = high > smtpd_tls_mandatory_exclude_ciphers = TLSv1 > > ^ This works in practice without any noticeable trouble. > (But then i again i do not have to make money from that or my > customers who must talk to ten year old refrigerators.) this is only server-side configuration; my problem is with client-side rejecting the server's certificate -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
openssl3+postfix issue (ca md too weak)
Hello I'm facing an issue with postfix+openssl3 which may be critical (depending on how it can be fixed). Now my postfix setup fails to send mails with Nov 13 20:20:53 comore postfix/smtp[6449]: warning: TLS library problem: error:0A00018E:SSL routines::ca md too weak:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_lib.c:984: >From what I understood, this is the remote certificate which is not accepted: openssl 3 deprecated some signature algorithm, which are no longer accepted with @SECLEVEL=1 (which is the default). In server's certificate chain all but the last one are signed with sha384WithRSAEncryption (which should be OK). The last one (the root certificate) is signed with RSA-SHA1 and I don't think this will change soon: 3 s:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, CN = A AA Certificate Services i:C = GB, ST = Greater Manchester, L = Salford, O = Comodo CA Limited, CN = A AA Certificate Services a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA1 v:NotBefore: Jan 1 00:00:00 2004 GMT; NotAfter: Dec 31 23:59:59 2028 GMT So, as far as I understand, we end up with a postfix installation which can't talk to servers with valid certificates. The solution (from google) would be to force @SECLEVEL=0 but I didn't find a way to do this for postfix. The solutions I've seen were for openvpn or curl, but nothing about postfix :( Any idea ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: ACPI changes in -current, -10 vs. kernels w/o "genfb"
On Sat, Oct 21, 2023 at 11:46:48AM +0200, Manuel Bouyer wrote: > On Fri, Oct 20, 2023 at 06:47:54PM -0500, John D. Baker wrote: > > On Thu, 19 Oct 2023, Manuel Bouyer wrote: > > > > > On Thu, Oct 19, 2023 at 08:46:27AM -0500, John D. Baker wrote: > > > > > > > [...] > > > > # link VERTHANDI/netbsd > > > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map > > > > --cref -T netbsd.ldscript -Ttext c010 -e start -X -o netbsd > > > > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o > > > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: acpi_wakeup.o: > > > > in function `acpi_md_sleep_patch': > > > > /x/netbsd-10/src/sys/arch/x86/acpi/acpi_wakeup.c:145: undefined > > > > reference to `acpi_md_vesa_modenum' > > > > [...] > > > > > > Hello, > > > should be fixed on HEAD, will request a pullup to netbsd-10 > > > > Thanks! HEAD built just fine, but with -10/i386, my custom kernels build > > OK, but the stock XEN3PAE_DOM0 build fails with: > > > > [...] > > # link XEN3PAE_DOM0/netbsd > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map > > --cref -T netbsd.ldscript -Ttext 0xc010 -e start -X -o netbsd > > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: genfb_machdep.o: in > > function `x86_genfb_init': > > /x/netbsd-10/src/sys/arch/x86/x86/genfb_machdep.c:141: undefined reference > > to `acpi_md_vesa_modenum' > > [...] > > > > I'm guessing the issue is that XEN3PAE_DOM0 has "genfb", but no ACPI > > support, so is missing the symbol. > > Actually it has ACPI (which is why genfb tries to use the symbol) but > not acpi_wakeup > > What's strange is that I did a full build on HEAD and didn't notice the issue. > > Will look at it pullup-10 #433 -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: ACPI changes in -current, -10 vs. kernels w/o "genfb"
On Fri, Oct 20, 2023 at 06:47:54PM -0500, John D. Baker wrote: > On Thu, 19 Oct 2023, Manuel Bouyer wrote: > > > On Thu, Oct 19, 2023 at 08:46:27AM -0500, John D. Baker wrote: > > > > > [...] > > > # link VERTHANDI/netbsd > > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map > > > --cref -T netbsd.ldscript -Ttext c010 -e start -X -o netbsd > > > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o > > > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: acpi_wakeup.o: in > > > function `acpi_md_sleep_patch': > > > /x/netbsd-10/src/sys/arch/x86/acpi/acpi_wakeup.c:145: undefined reference > > > to `acpi_md_vesa_modenum' > > > [...] > > > > Hello, > > should be fixed on HEAD, will request a pullup to netbsd-10 > > Thanks! HEAD built just fine, but with -10/i386, my custom kernels build > OK, but the stock XEN3PAE_DOM0 build fails with: > > [...] > # link XEN3PAE_DOM0/netbsd > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map --cref > -T netbsd.ldscript -Ttext 0xc010 -e start -X -o netbsd > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: genfb_machdep.o: in > function `x86_genfb_init': > /x/netbsd-10/src/sys/arch/x86/x86/genfb_machdep.c:141: undefined reference to > `acpi_md_vesa_modenum' > [...] > > I'm guessing the issue is that XEN3PAE_DOM0 has "genfb", but no ACPI > support, so is missing the symbol. Actually it has ACPI (which is why genfb tries to use the symbol) but not acpi_wakeup What's strange is that I did a full build on HEAD and didn't notice the issue. Will look at it -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: ACPI changes in -current, -10 vs. kernels w/o "genfb"
On Thu, Oct 19, 2023 at 08:46:27AM -0500, John D. Baker wrote: > The following change in -current: > > https://mail-index.netbsd.org/source-changes/2023/10/16/msg148163.html > > and its subsequent pull-up to netbsd-10: > > https://mail-index.netbsd.org/source-changes/2023/10/18/msg148226.html > > breaks building kernels which exclude "genfb". The failure is as follows: > > [...] > # link VERTHANDI/netbsd > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map --cref > -T netbsd.ldscript -Ttext c010 -e start -X -o netbsd > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o > /r0/build/netbsd-10/tools/amd64/bin/i486--netbsdelf-ld: acpi_wakeup.o: in > function `acpi_md_sleep_patch': > /x/netbsd-10/src/sys/arch/x86/acpi/acpi_wakeup.c:145: undefined reference to > `acpi_md_vesa_modenum' > [...] > > I have machines with ACPI for which "genfb" (or any DRMKMS framebuffer) > is superfluous and therefore are omitted from the configuration. Hello, should be fixed on HEAD, will request a pullup to netbsd-10 -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: ftp TLS fails
On Tue, Oct 10, 2023 at 03:56:56PM +0200, Manuel Bouyer wrote: > Hello > with netbsd-10 from oct, 2 ftp fails to connect to https sites: > tchatcha:/chroot/usr/pkgsrc-2023Q3/pkgsrc/sysutils/xenkernel418>ftp -o /tmp/o > https://ftp.netbsd.org/ > Trying [2001:470:a085:999::21]:443 ... > ftp: Can't connect to `2001:470:a085:999::21:443': No route to host > Trying 199.233.217.201:443 ... > :error:0A86:SSL > routines:tls_post_process_server_certificate:certificate verify > failed:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_clnt.c:1889: > ftp: Can't connect to `ftp.netbsd.org:https' > > > I have a ca-certificates.crt in /etc/openssl/certs/, I tried to re-run > certctl but it didn't help. > I see the same issue with downloads.xen.org > > It seems that not all roots are installed ? With some help from Thomas I found the problem: I had a /etc/openssl/openssl.cnf lying around and this caused trouble. After a rm -r /etc/openssl/* and postinstall again, _ have the certs. /etc/openssl (I guess I only did rm -rf /etc/openssl/certs* before) and this fixed things. /etc/openssl/certs.conf has more things now. Before it had only netbsd-certctl 20230816 -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
ftp TLS fails
Hello with netbsd-10 from oct, 2 ftp fails to connect to https sites: tchatcha:/chroot/usr/pkgsrc-2023Q3/pkgsrc/sysutils/xenkernel418>ftp -o /tmp/o https://ftp.netbsd.org/ Trying [2001:470:a085:999::21]:443 ... ftp: Can't connect to `2001:470:a085:999::21:443': No route to host Trying 199.233.217.201:443 ... :error:0A86:SSL routines:tls_post_process_server_certificate:certificate verify failed:/usr/src/crypto/external/bsd/openssl/dist/ssl/statem/statem_clnt.c:1889: ftp: Can't connect to `ftp.netbsd.org:https' I have a ca-certificates.crt in /etc/openssl/certs/, I tried to re-run certctl but it didn't help. I see the same issue with downloads.xen.org It seems that not all roots are installed ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: heartbeat panic by heavy traffic
On Fri, Sep 15, 2023 at 02:00:31PM -, Michael van Elst wrote: > bou...@antioche.eu.org (Manuel Bouyer) writes: > > >But the clock softint shouldn't be locked out for 16s, ever. > > Then the clock softint must have a higher priority than > everything else including hard interrupts. > > Obviously that's not how the system is designed, there > are no limits on how long specific events may take and > thus no guarantee for lower priority tasks to actually > execute with a certain time. That would be some kind > of real-time system. But obviously such events are not expected to take a long time, or they would have been deffered to lower priority, preemptible tasks. Letting such events run for a long time wedges the system. I still maintain that the bug here is the network soft interrupt running for such a long time, without gigving a chance to other tasks > > Such systems also rarely panic if they detect a violation > of their rules. > > In any case, locking out lower priority tasks by an > overwhelmed network layer probably isn't the bug that > we look for. I disagree. And the heartbeat panic is here to help locate such bugs. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: heartbeat panic by heavy traffic
On Fri, Sep 15, 2023 at 09:19:04AM -, Michael van Elst wrote: > mar...@duskware.de (Martin Husemann) writes: > > >On Fri, Sep 15, 2023 at 12:17:58PM +0900, Masanobu SAITOH wrote: > >> I think it would be good to change the default behavior from > >> panic to something others because GENERIC kernel enables HEARTBEAT. > >> by default. One of idea is to print warning message at sufficient > >> intervals. > > >I disagree. It is very important that we fix the underlying problem > >instead. Without hearbeat, this behaviour is still visible (but > >undiagnosable). > > The crash here comes from how the network stack operates. Running at > a higher priority, it locks out the lower priority clock softint > and heartbeat detects that and crashes the system intentionally. But the clock softint shouldn't be locked out for 16s, ever. It means that userland processes are stuck too, as well as kernel threads. This is a real bug, the network stack should be fixed to relax at periodic intervals. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Call for testing: New kernel heartbeat(9) checks
On Fri, Jul 07, 2023 at 05:10:33PM +, Taylor R Campbell wrote: > > Date: Fri, 7 Jul 2023 17:56:42 +0200 > > From: Manuel Bouyer > > > > On Fri, Jul 07, 2023 at 01:11:54PM +, Taylor R Campbell wrote: > > > - The magic numbers for debug.crashme.spl_spinout are for evbarm. > > > On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1. > > Correction: IPL_SOFTCLOCK=2. > > > > 1.cpuctl offline 0 > > > sleep 20 > > > cpuctl online 0 > > > > With this I get a panic on Xen: > > [ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" > > failed: file > > "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158 > > [...] > > [ 53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" > > failed: file > > "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158 > > This was a mistake that arose because I was testing on aarch64 where > kpreempt_disabled() is always true. Update and try again, please! > > sys/kern/kern_heartbeat.c 1.2 > sys/kern/subr_xcall.c 1.36 Yes, with these (and using 2 for IPL_SOFTCLOCK) every test pass now. thanks ! This allowed me to fix a small bug in Xen's clock initialisation already :) -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Call for testing: New kernel heartbeat(9) checks
On Fri, Jul 07, 2023 at 01:11:54PM +, Taylor R Campbell wrote: > FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called > heartbeat(9) that will make the system crash rather than hang when > CPUs are stuck in certain ways that hardware watchdog timers can't > detect (or on systems without hardware watchdog timers). > > It's optional for now, but it's small and I'd like to make it > mandatory in the future. If you'd like to try it out, add the > following two lines to your kernel config: > > options HEARTBEAT > options HEARTBEAT_MAX_PERIOD_DEFAULT=15 > > You can disable it with `sysctl -w kern.heartbeat.max_period=0' at > runtime, or use that knob to change the maximum period before the > system will crash if not all (online) CPUs have made progress. > > > Here are some manual tests that you can use to exercise it -- these > are manual tests, not automatic tests, because some will deliberately > crash the kernel to make sure the diagnostic works, and the others, if > broken, will also crash the kernel. > > Notes: > - The magic numbers for debug.crashme.spl_spinout are for evbarm. > On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1. > For other architectures, consult the source for the numbers to use. > - If you're on a single-CPU system, skip the cpuctl offline/online > tests and just do (4) and (5). > - If you're on a >2-CPU system, then for the cpuctl offline/online > tests, try offlining all CPUs but one at a time. > > 1.cpuctl offline 0 > sleep 20 > cpuctl online 0 With this I get a panic on Xen: [ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158 [ 225.4605386] cpu0: Begin traceback... [ 225.4605386] vpanic() at netbsd:vpanic+0x163 [ 225.4605386] kern_assert() at netbsd:kern_assert+0x4b [ 225.4705333] heartbeat_resume() at netbsd:heartbeat_resume+0x82 [ 225.4705333] cpu_xc_online() at netbsd:cpu_xc_online+0x11 [ 225.4705333] xc_thread() at netbsd:xc_thread+0xc8 [ 225.4705333] cpu0: End traceback... [ 225.4705333] fatal breakpoint trap in supervisor mode [ 225.4705333] trap type 1 code 0 rip 0x8022e96d cs 0xe030 rflags 0x202 cr2 0x9b8030d32000 ilevel 0 rsp 0x9b8030985dd0 [ 225.4705333] curlwp 0x9b80007c6900 pid 0.7 lowest kstack 0x9b80309812c0 Stopped in pid 0.7 (system) at netbsd:breakpoint+0x5: leave breakpoint() at netbsd:breakpoint+0x5 vpanic() at netbsd:vpanic+0x163 kern_assert() at netbsd:kern_assert+0x4b heartbeat_resume() at netbsd:heartbeat_resume+0x82 cpu_xc_online() at netbsd:cpu_xc_online+0x11 xc_thread() at netbsd:xc_thread+0xc8 Is it expected ? Nothing looks Xen-specific here > > 2.cpuctl offline 1 > sleep 20 > cpuctl online 1 same panic > > 3.cpuctl offline 0 > sysctl -w kern.heartbeat.max_period=5 > sleep 10 > sysctl -w kern.heartbeat.max_period=0 > sleep 10 > sysctl -w kern.heartbeat.max_period=15 > sleep 20 > cpuctl online 0 Here we have: #sysctl -w kern.heartbeat.max_period=15 [ 53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158 [ 53.5704682] cpu0: Begin traceback... [ 53.5704682] vpanic() at netbsd:vpanic+0x163 [ 53.5704682] kern_assert() at netbsd:kern_assert+0x4b [ 53.5704682] heartbeat_resume() at netbsd:heartbeat_resume+0x82 [ 53.5704682] xc_thread() at netbsd:xc_thread+0xc8 [ 53.5704682] cpu0: End traceback... > > 4.sysctl -w debug.crashme_enable=1 > sysctl -w debug.crashme.spl_spinout=1 # IPL_SOFTCLOCK > # verify system panics after 15sec my sysctl command did hang, but the system didn't panic > > 5.sysctl -w debug.crashme_enable=1 > sysctl -w debug.crashme.spl_spinout=6 # IPL_SCHED > # verify system panics after 15sec This one did panic > > 6.cpuctl offline 0 > sysctl -w debug.crashme_enable=1 > sysctl -w debug.crashme.spl_spinout=1 # IPL_SOFTCLOCK > # verify system panics after 15sec my sysctl command did hang, but the system didn't panic > > 7.cpuctl offline 0 > sysctl -w debug.crashme_enable=1 > sysctl -w debug.crashme.spl_spinout=5 # IPL_VM > # verify system panics after 15sec and this one did panic -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)
On Fri, Jun 23, 2023 at 11:37:23PM +, RVP wrote: > On Fri, 23 Jun 2023, Brian Buhrow wrote: > > > hello. My understanding is that the arp caching mechanism works > > regardless of whether > > you use static MAC addresses or dynamically generated ones. > > [...] > > If you then run brconfig on the bridge containing the domu, you'll see the > > MAC address you > > assigned, or which was assigned dynamically, alive and well. > > > > Right, but, cacheing implies a timeout, and is there a timeout for the MAC > addresses on Xen IFs? Does an `arp -an' indicate this (I can't test this-- > no Xen set up.) Xen IFs are no different from regular ethernert interfaces -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)
On Fri, Jun 23, 2023 at 03:52:21PM +0200, Matthias Petermann wrote: > Hi, > > On 23.06.23 02:45, RVP wrote: > > So, the server tries to write data into the socket; write() fails with > > errno = EHOSTDOWN which sshd(8) treats as a fatal error and it exits. > > The client tries to read/write to a closed connection, and it too quits. > > > > The part which doesn't make sense is the EHOSTDOWN error. Clearly the > > other end isn't down. Can't say I understand what's happening here. You > > need a Xen guru now, Matthias :) > > I will still try the tips from yesterday (long time ping test) and collect > some more data. And yes - I think only someone with a strong Xen background > can really help me :-) I will followup as soon I completed my recent tests. I'm not sure it's Xen-specific, there have been changes in the network stack between -9 and -10 affecting the way ARP and duplicate addresses are managed. > > > > > On Thu, 22 Jun 2023, Brian Buhrow wrote: > > > > > hello. Actually, on the server side, where you get the "host > > > is down" message, that is a > > > system error from the network stack itself. I've seen it when the > > > arp cache times out and > > > can't be refreshed in a timely manner. > > > > > > > But, does ARP make any sense for Xen IFs? I thought MAC addresses were > > ginned up for Xen IFs... > > At the moment, I manually set the MAC adresses for all DomUs in the Domain > configuration file (at the network interface specification), example: > > ``` > name="srv-net" > type="pv" > kernel="/netbsd-XEN3_DOMU.gz" > memory=512 > vcpus=2 > vif = ['mac=00:16:3E:00:00:01,bridge=bridge0,ip=192.168.2.51' ] the ip= part is not used by NetBSD. A fixed mac address shouldn't make a difference, it's the xl tool which generates one if needed and the domU doesn't know if it's fixed or auto-generated. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
panic in knote
Hello in my daily tests of HEAD: https://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/ I've seen twice this panic: kernel/kqueue/t_proc4 (80/939): 1 test cases proc4: [ 538.1439841] uvm_fault(0xc287759c, 0, 2) -> 0xe [ 538.1642020] fatal page fault in supervisor mode [ 538.1755046] trap type 6 code 0x2 eip 0xc0d657ac cs 0x8 eflags 0x10286 cr2 0x54 ilevel 0 esp 0xda7b8e78 [ 538.1985618] curlwp 0xc4d70900 pid 23901 lid 23901 lowest kstack 0xda7b62c0 [ 538.2145388] panic: trap [ 538.2145388] cpu0: Begin traceback... [ 538.2294995] vpanic(c12d339c,da7b8d18,da7b8dd4,c01306b2,c12d339c,da7b8de0,da7b8de0,5d5d,da7b62c0,10286) at netbsd:vpanic+0x196 [ 538.2537154] panic(c12d339c,da7b8de0,da7b8de0,5d5d,da7b62c0,10286,54,0,da7b8e78,c287759c) at netbsd:panic+0x18 [ 538.2847793] trap() at netbsd:trap+0xd7c [ 538.2949956] --- trap (number 6) --- [ 538.3051892] mutex_init(54,2,0,da7b8e60,c04bdd42,c4c391c0,c4c391c0,c4c391c0,54,c4c391c0) at netbsd:mutex_init+0x9 [ 538.3336643] knote_proc_fork_track(c4c23c48,c4a6b040,0,da7b8ea4,c0d0f8d8,c36b4440,c36b4440,c4a6b040,c4d70900,da7b8f10) at netbsd:knote_proc_fork_track+0xce [ 538.3636624] knote_proc_fork(c4a6b040,c36b4440,da711000,0,0,0,c0d549e0,c4c391c0,da7b8ef4,0) at netbsd:knote_proc_fork+0x97 [ 538.3945229] fork1(c4d70900,0,14,0,0,c0d549e0,0,da7b8f60,da7b8f9c,c04bd5ab) at netbsd:fork1+0x667 [ 538.4238665] sys_fork(c4d70900,da7b8f68,da7b8f60,c23454c8,1,2,da7b8f60,da7b8f68,0,0) at netbsd:sys_fork+0x48 [ 538.4556545] syscall() at netbsd:syscall+0x17c [ 538.4648600] --- syscall (number 2) --- [ 538.4822324] bb3b7027: [ 538.4876585] cpu0: End traceback... any idea ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: kernel deadlock on fstchg with vnd
On Sun, May 29, 2022 at 01:18:16PM +0200, J. Hannken-Illjes wrote: > > On 29. May 2022, at 08:30, Michael van Elst wrote: > > > > bou...@antioche.eu.org (Manuel Bouyer) writes: > > > >> Hello, > >> do you have an idea on the problem in this thread: > >> http://mail-index.netbsd.org/port-xen/2022/05/27/msg010213.html > > [...] > >> I can't reproduce this when using vnd from userland. > > > > You can replicate it by addressing the block device with vnconfig. > > > > A workaround would be to modify the Xen block script to select the > > raw device: > > > > vnconfig /dev/r${disk}d $xparams >/dev/null; then > > > > or just the disk name: > > > > vnconfig ${disk} $xparams >/dev/null; then > > Good catch, sys/dev/vnd.c has this: > > 1751 static void > 1752 vndclear(struct vnd_softc *vnd, int myminor) > 1753 { > 1754 struct vnode *vp = vnd->sc_vp; > 1755 int fflags = FREAD; > 1756 int bmaj, cmaj, i, mn; > 1757 int s; > 1758 > 1759 #ifdef DEBUG > 1760 if (vnddebug & VDB_FOLLOW) > 1761 printf("vndclear(%p): vp %p\n", vnd, vp); > 1762 #endif > 1763 /* locate the major number */ > 1764 bmaj = bdevsw_lookup_major(_bdevsw); > 1765 cmaj = cdevsw_lookup_major(_cdevsw); > 1766 > 1767 /* Nuke the vnodes for any open instances */ > 1768 for (i = 0; i < MAXPARTITIONS; i++) { > 1769 mn = DISKMINOR(device_unit(vnd->sc_dev), i); > 1770 vdevgone(bmaj, mn, mn, VBLK); > 1771 if (mn != myminor) /* XXX avoid to kill own vnode */ > 1772 vdevgone(cmaj, mn, mn, VCHR); > 1773 } > > The "skip myself" on lines 1771/1772 is responsible for this behaviour. Yes and doing the same for block devices avoids the issue. But Taylor is reluctant to commit this hack. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)
On Fri, May 27, 2022 at 02:06:59PM +0200, Matthias Petermann wrote: > Anyway, Once I try to "xl console" I did only get a fragment: > > ``` > ganymed$ doas xl console net > [ 1.000] cpu_rng: rdrand > [ 1.000] entropy: ready > [ 1.000] Copyright (c) 1996, 1997, 1998, 1999, > ``` > > At the "1999," the Dom0 became frozen, again. A recent change caused xenconsoled to hang, and possibly xenstore to miss events too. Should be fixed with src/sys/arch/xen/xen/xenevt.c 1.65 But the hang on the filesystem remains for me. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
kernel deadlock on fstchg with vnd
Hello, do you have an idea on the problem in this thread: http://mail-index.netbsd.org/port-xen/2022/05/27/msg010213.html When stoping a Xen guest with virtual disk backed by a file, the vnconfig -u process won't exit: it hangs on specio, and other processes hang on fstchg. >From kernel messages, the xbd backed has closed the vnd device which is being unconfigured, although I can't say if it did before or after the vnconfig -u process was started (but likely before). I can't reproduce this when using vnd from userland. This happens with the file on /, or on a different partition, with or without -o log. It happens even if the dom0 has a single CPU. Any idea how to debug this further ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)
On Fri, May 27, 2022 at 04:41:29PM +0200, J. Hannken-Illjes wrote: > > On 27. May 2022, at 16:24, Manuel Bouyer wrote: > > > > On Fri, May 27, 2022 at 02:52:55PM +0200, J. Hannken-Illjes wrote: > >>> On 27. May 2022, at 14:41, Matthias Petermann > >>> wrote: > >>> > >>> Hello Jürgen, > >>> > >>> Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes: > >>>> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump" > >>>> should give even more details. > >>> > >>> here is the stacktrace from the vnconfig process (the PID has changed > >>> since I restarted): > >>> > >>> https://www.petermann-it.de/tmp/p7.jpg > >> > >> This is the thread currently suspending the root fs (vrevoke suspends it). > >> > >> Looks like it is waiting for I/O to drain on the vnd device ... > >> > >>> You can find the output of fstrans_dump here: > >>> > >>> https://www.petermann-it.de/tmp/p8.jpg > >> > >> The owner is irritating, it should be vnconfig from above. > > > > I can reproduce it: > > What is the recipe? xl create -c shutdown -p now in the guest notice that the guest doesn't shut down and run xl destroy (I think xl destroy is what causes the deadlock, by running a second vnconfig -u) But my dom0 has 32 vcpus, and this seems to cause oter troubles (at the xenstore level, among others). Trying again with only 1 vcpu. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)
On Fri, May 27, 2022 at 04:24:30PM +0200, Manuel Bouyer wrote: > On Fri, May 27, 2022 at 02:52:55PM +0200, J. Hannken-Illjes wrote: > > > On 27. May 2022, at 14:41, Matthias Petermann > > > wrote: > > > > > > Hello Jürgen, > > > > > > Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes: > > >> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump" > > >> should give even more details. > > > > > > here is the stacktrace from the vnconfig process (the PID has changed > > > since I restarted): > > > > > > https://www.petermann-it.de/tmp/p7.jpg > > > > This is the thread currently suspending the root fs (vrevoke suspends it). > > > > Looks like it is waiting for I/O to drain on the vnd device ... > > > > > You can find the output of fstrans_dump here: > > > > > > https://www.petermann-it.de/tmp/p8.jpg > > > > The owner is irritating, it should be vnconfig from above. > > I can reproduce it: > db{0}> ps > PIDLID S CPU FLAGS STRUCT LWP * NAME WAIT > 2419 2419 3 8 0 9000210b9280 tcsh fstchg > 2415 2415 3 11 0 90001f66f540 vnconfig fstchg > 2416 2416 3 18 0 900020ea3200dirname fstchg > 2417 2417 3 24 0 900020e6c700 sh fstchg > 2414 2414 3 12 0 90001f6d7a00 vnconfig specio > [...] > db{0}> tr/t 0t2415 > trace: pid 2415 lid 2415 at 0x90008ed3e980 > sleepq_block() at netbsd:sleepq_block+0x12c > cv_wait() at netbsd:cv_wait+0x42 > fstrans_start() at netbsd:fstrans_start+0x193 > VOP_LOCK() at netbsd:VOP_LOCK+0x79 > vn_lock() at netbsd:vn_lock+0xae > namei_tryemulroot() at netbsd:namei_tryemulroot+0x1024 > namei() at netbsd:namei+0x29 > vn_open() at netbsd:vn_open+0x133 > do_open() at netbsd:do_open+0xc3 > do_sys_openat() at netbsd:do_sys_openat+0x74 > sys_open() at netbsd:sys_open+0x24 > syscall() at netbsd:syscall+0x18c > --- syscall (number 5) --- > netbsd:syscall+0x18c: > db{0}> tr/t 0t2414 > trace: pid 2414 lid 2414 at 0x90008c57e6c0 > sleepq_block() at netbsd:sleepq_block+0x12c > cv_wait() at netbsd:cv_wait+0x42 > spec_io_drain() at netbsd:spec_io_drain+0x84 > spec_close() at netbsd:spec_close+0x1c6 > VOP_CLOSE() at netbsd:VOP_CLOSE+0x38 > spec_node_revoke() at netbsd:spec_node_revoke+0x14d > vcache_reclaim() at netbsd:vcache_reclaim+0x4e7 > vgone() at netbsd:vgone+0xcd > vrevoke() at netbsd:vrevoke+0xfa > genfs_revoke() at netbsd:genfs_revoke+0x13 > VOP_REVOKE() at netbsd:VOP_REVOKE+0x35 > vdevgone() at netbsd:vdevgone+0x64 > vnddoclear.part.0() at netbsd:vnddoclear.part.0+0xaa > vndioctl() at netbsd:vndioctl+0x78c > bdev_ioctl() at netbsd:bdev_ioctl+0x91 > spec_ioctl() at netbsd:spec_ioctl+0xa5 > VOP_IOCTL() at netbsd:VOP_IOCTL+0x41 > vn_ioctl() at netbsd:vn_ioctl+0xb3 > sys_ioctl() at netbsd:sys_ioctl+0x555 > syscall() at netbsd:syscall+0x18c > --- syscall (number 54) --- > netbsd:syscall+0x18c: > db{0}> call fstrans_dump > Fstrans locks by lwp: > [ 5691.3454404] 2414.241 (/) shared 2 cow 0 alias 0 > [ 5691.3454404] Fstrans state by mount: > [ 5691.3454404] /owner 0x90001f6d7a00 state suspended > > In the ps output there is also: > 0 2324 3 3 200 90001fe43340 vnd0 vndbp > db{0}> tr/a 90001fe43340 > trace: pid 0 lid 2324 at 0x90008c806df0 > sleepq_block() at netbsd:sleepq_block+0x12c > vndthread() at netbsd:vndthread+0x78c > > So it looks like vnconfig waits for the vnd I/O to drain, but the vnd thread > is idle. could this happen if the vnd is still open ? I suspect the xbd backend did not close the vnd. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)
On Fri, May 27, 2022 at 02:52:55PM +0200, J. Hannken-Illjes wrote: > > On 27. May 2022, at 14:41, Matthias Petermann wrote: > > > > Hello Jürgen, > > > > Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes: > >> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump" > >> should give even more details. > > > > here is the stacktrace from the vnconfig process (the PID has changed since > > I restarted): > > > > https://www.petermann-it.de/tmp/p7.jpg > > This is the thread currently suspending the root fs (vrevoke suspends it). > > Looks like it is waiting for I/O to drain on the vnd device ... > > > You can find the output of fstrans_dump here: > > > > https://www.petermann-it.de/tmp/p8.jpg > > The owner is irritating, it should be vnconfig from above. I can reproduce it: db{0}> ps PIDLID S CPU FLAGS STRUCT LWP * NAME WAIT 2419 2419 3 8 0 9000210b9280 tcsh fstchg 2415 2415 3 11 0 90001f66f540 vnconfig fstchg 2416 2416 3 18 0 900020ea3200dirname fstchg 2417 2417 3 24 0 900020e6c700 sh fstchg 2414 2414 3 12 0 90001f6d7a00 vnconfig specio [...] db{0}> tr/t 0t2415 trace: pid 2415 lid 2415 at 0x90008ed3e980 sleepq_block() at netbsd:sleepq_block+0x12c cv_wait() at netbsd:cv_wait+0x42 fstrans_start() at netbsd:fstrans_start+0x193 VOP_LOCK() at netbsd:VOP_LOCK+0x79 vn_lock() at netbsd:vn_lock+0xae namei_tryemulroot() at netbsd:namei_tryemulroot+0x1024 namei() at netbsd:namei+0x29 vn_open() at netbsd:vn_open+0x133 do_open() at netbsd:do_open+0xc3 do_sys_openat() at netbsd:do_sys_openat+0x74 sys_open() at netbsd:sys_open+0x24 syscall() at netbsd:syscall+0x18c --- syscall (number 5) --- netbsd:syscall+0x18c: db{0}> tr/t 0t2414 trace: pid 2414 lid 2414 at 0x90008c57e6c0 sleepq_block() at netbsd:sleepq_block+0x12c cv_wait() at netbsd:cv_wait+0x42 spec_io_drain() at netbsd:spec_io_drain+0x84 spec_close() at netbsd:spec_close+0x1c6 VOP_CLOSE() at netbsd:VOP_CLOSE+0x38 spec_node_revoke() at netbsd:spec_node_revoke+0x14d vcache_reclaim() at netbsd:vcache_reclaim+0x4e7 vgone() at netbsd:vgone+0xcd vrevoke() at netbsd:vrevoke+0xfa genfs_revoke() at netbsd:genfs_revoke+0x13 VOP_REVOKE() at netbsd:VOP_REVOKE+0x35 vdevgone() at netbsd:vdevgone+0x64 vnddoclear.part.0() at netbsd:vnddoclear.part.0+0xaa vndioctl() at netbsd:vndioctl+0x78c bdev_ioctl() at netbsd:bdev_ioctl+0x91 spec_ioctl() at netbsd:spec_ioctl+0xa5 VOP_IOCTL() at netbsd:VOP_IOCTL+0x41 vn_ioctl() at netbsd:vn_ioctl+0xb3 sys_ioctl() at netbsd:sys_ioctl+0x555 syscall() at netbsd:syscall+0x18c --- syscall (number 54) --- netbsd:syscall+0x18c: db{0}> call fstrans_dump Fstrans locks by lwp: [ 5691.3454404] 2414.241 (/) shared 2 cow 0 alias 0 [ 5691.3454404] Fstrans state by mount: [ 5691.3454404] /owner 0x90001f6d7a00 state suspended In the ps output there is also: 0 2324 3 3 200 90001fe43340 vnd0 vndbp db{0}> tr/a 90001fe43340 trace: pid 0 lid 2324 at 0x90008c806df0 sleepq_block() at netbsd:sleepq_block+0x12c vndthread() at netbsd:vndthread+0x78c So it looks like vnconfig waits for the vnd I/O to drain, but the vnd thread is idle. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)
On Fri, May 27, 2022 at 11:42:08AM +0200, Matthias Petermann wrote: > I took some "screenshots" of the vga console. This was unfortunately the > only way because the device has no serial console. > > > Paginated processes list: > > https://www.petermann-it.de/tmp/p1.jpg > https://www.petermann-it.de/tmp/p2.jpg > https://www.petermann-it.de/tmp/p3.jpg several processes in fstchg wait, a stack trace of these processes (tr/t 0t or tr/a 0x would show theses) would help. So it looks like a deadlock in the filesystem. What is your storage configuration ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)
On Fri, May 27, 2022 at 10:12:44AM +0200, Matthias Petermann wrote: > > Hello all, > > currently I am not able to instantiate a NetBSD Xen guest on NetBSD 9.99 > (side fact: I also have problems with a Windows guest, but it is not that > important at the moment). > > The problem occurs in the following environment: > > - Xen Kernel 4.15.2 and matching Xen Tools from pkgsrc 2022Q1 (built > 29.04.2022) > - NetBSD/Xen 9.99.97 (build 25.05.2022) > > The host is booted with this boot.cfg (if this matters): > > ``` > menu=Boot Xen:load /netbsd-XEN3_DOM0.gz console=pc;multiboot /xen.gz > dom0_mem=512M vga=keep console=vga > ``` > > The guest config looks like this: > > ``` > name = "net" > type="pv" > kernel = "/netbsd-INSTALL_XEN3_DOMU.gz" > #kernel = "/netbsd-XEN3_DOMU.gz" > memory = 2048 > vcpus = 2 > vif = [ 'mac=00:16:3E:01:00:01,bridge=bridge0' ] > disk = [ >'file:/data/vhd/net.img,hda,rw', >'file:/data/vhd/net-export.img,hdb,rw' > ] > ``` > > When I try to instantiate the guest, I get the following output on the > controlling terminal: > > ``` > ganymed$ doas xl create net > Parsing config from net > libxl: error: libxl_device.c:1109:device_backend_callback: Domain 1:unable > to add device with path /local/domain/0/backend/vif/1/0 > libxl: error: libxl_create.c:1862:domcreate_attach_devices: Domain 1:unable > to add vif devices > ``` did you create the bridge0 ? > > At the same time the following message appears on the system console: > > ``` > [ 184.680057] xbd backend: attach device vnd0d (size 1048576000) for > domain 1 > [ 184.910057] xbd backend: attach device vnd1d (size 33554432) for domain > 1 > [ 195.260077] xvif1i0: Ethernet address 00:16:3e:02:00:01 > [ 195.320059] xbd backend: detach device vnd1d for domain 1 > [ 195.350051] xbd backend: detach device vnd0d for domain 1 > [ 195.450054] xvif1i0: disconnecting > ``` > > After the messages appear on the system console, the system does not respond > to any input either via SSH or on the local console. It seems to be frozen. > I can still activate the kernel debugger with Control+Alt+Escape. Can you get a stack trace, and processes list ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: cmake hang solution?
On Mon, May 02, 2022 at 11:13:45AM -0700, Chuck Silvers wrote: > it looks like the diff won't apply as-is, but I think the concept still > applies. > > note that there have been a LOT of changes in libpthread since netbsd-9, > and some of those changes also claim to fix problems where threads hang > waiting on locks and/or condvars. it would be more useful to test > with a HEAD libpthread (which I'll guess requires a HEAD libc too). the goal is to build the official netbsd-9 packages, so that's not an option -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: cmake hang solution?
On Sun, May 01, 2022 at 01:24:01PM -0700, Chuck Silvers wrote: > On Tue, Apr 05, 2022 at 02:10:36PM -, Michael van Elst wrote: > > w...@netbsd.org (Thomas Klausner) writes: > > >I never saw the cmake hang myself. I still see hangs in guile. > > > > > > I see both in almost every pbulk run. > > > please try this patch for the cmake variation of this hang: > > http://www.netbsd.org/~chs/diff.pthread-park-stuck.1 would this apply to netbsd-9 too ? The hang I'm seeing is on a system with a HEAD kernel and a netbsd-9 userland -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: reproducible kernel crash with quota
On Tue, Apr 12, 2022 at 08:52:28AM +0200, 6b...@6bone.informatik.uni-leipzig.de wrote: > Hello, > > since I already have some open bugs with reproducible kernel crashes, I'm > only writing this to the mailing list. > > how to reproduce the crash: /etc/rc.d/quota restart > > dmesg: > > [ 412.047595] panic: kernel diagnostic assertion > "dq->dq_ump->um_quotas[dq->dq _type] != vp" failed: file > "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978 > [ 412.047595] cpu8: Begin traceback... > [ 412.047595] vpanic() at netbsd:vpanic+0x156 > [ 412.057595] kern_assert() at netbsd:kern_assert+0x4b > [ 412.057595] dqflush() at netbsd:dqflush+0x92 > [ 412.057595] quota1_handle_cmd_quotaoff() at > netbsd:quota1_handle_cmd_quotaof f+0x120 I wonder if, when quota1_handle_cmd_quotaoff() can't get an exclusive lock for a vnode, could fail to free the associated quota structure. Shoudln't it wait for the exclusive vnlock or retry in this case ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Bug or no Bug?
On Wed, Feb 09, 2022 at 09:22:34PM +0100, 6b...@6bone.informatik.uni-leipzig.de wrote: > Hello, > > I have installed the 9.99.xx kernel on several systems. On most systems > there are no problems. On a Dell 2800, the kernel crashes during boot. The > problem only occurs if the option LOCKDEBUG is set. > > options LOCKDEBUG # expensive locking checks/support > > Should a bug report be made in this case? Or should problems that only occur > when LOCKDEBUG is enabled be ignored? Crash with LOCKDEBUG are not expected, so please report. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: execute statically-linked linux files
On Thu, Jan 06, 2022 at 11:38:58PM +, RVP wrote: > On Thu, 6 Jan 2022, Manuel Bouyer wrote: > > > the second issue is that it expects /emul/linux/proc/self/fd/4 to be a > > working > > symlink, and on NetBSD it's not. Note that with /bin/ls I get something > > similar: > > armandeche:/local/armandeche1/tmp#ktrace -i ls -l /proc/self/fd/ > > total 2 > > crw--w 1 bouyer tty5, 0 Jan 6 17:54 0 > > crw--w 1 bouyer tty5, 0 Jan 6 17:54 1 > > crw--w 1 bouyer tty5, 0 Jan 6 17:54 2 > > lr-xr-xr-x 1 rootwheel 2048 Jan 6 17:54 3 -> /local/armandeche1/tmp > > > > ls: /proc/self/fd//4: Invalid argument > > lr-xr-xr-x 1 rootwheel 0 Jan 6 17:54 4 > > > > 22875 1 ls CALL readlink(0x7f7fffb98200,0x7f7fffb98610,0x400) > > 22875 1 ls NAMI "/proc/self/fd//4" > > 22875 1 ls RET readlink -1 errno 22 Invalid argument > > > > If I can trust the ktrace output, fd/4 should point to /etc/spwd.db > > > > On linux, strace shows it reading the link from /proc/self/exec, getting > > back > > > > This 2nd issue I think I can explain: the fd existed at the start of a > readdir(), but, then is closed sometime when the listing is still in > progress as in the code below: It could be it, as when the directory is read, fd 4 is the directory itself. But at the time of the readlink, fd 4 is definitively open, but points to another file (I can't see a close(4) between the open("/etc/spwd.db") and the readlink()). Anyway, the issue with the linux binary is likely different. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: execute statically-linked linux files
On Thu, Jan 06, 2022 at 05:02:13PM +0100, Anders Magnusson wrote: > Kave you looked at brandelf? > > https://www.freebsd.org/cgi/man.cgi?query=brandelf=1 Looks like what I need, thanks. For the record, attached is my port to NetBSD of this Interestingly, it seems to recognise all binaries as SVR4 (for NetBSD or linux binaries) so it seems that the ELF type is recorded at some other place. Anyway with a binary rebranded to linux I now hit another issue: it quickly core dumps, with an issue that seems related to procfs: with procfs only mounted on /emul/linux/proc, I get: 6369 6369 xc8 CALL open(0x43d6da,0x280800,0x66d208) 6369 6369 xc8 NAMI "/emul/linux/proc/self/exe" 6369 6369 xc8 NAMI "/proc/self/exe" 6369 6369 xc8 RET open -1 errno -2 No such file or directory 6369 6369 xc8 PSIG SIGSEGV SIG_DFL: code=SEGV_MAPERR, addr=0x0, trap=14) 6369 6369 xc8 NAMI "xc8.core" But /emul/linux/proc/self/exe should exists: armandeche:/>ls -l /emul/linux/proc/self/exe lr-xr-xr-x 1 root wheel 7 Jan 6 17:46 /emul/linux/proc/self/exe -> /bin/ls armandeche:/>/emul/linux/bin/ls /emul/linux/proc/self/exe /emul/linux/proc/self/exe If I also mount procfs on /proc things go a bit further: 25735 25735 xc8 CALL open(0x43d6da,0x280800,0x66d208) 25735 25735 xc8 NAMI "/emul/linux/proc/self/exe" 25735 25735 xc8 NAMI "/proc/self/exe" 25735 25735 xc8 RET open 4 25735 25735 xc8 CALL readlink(0x7f7fd6f5,0x7f7fd830,0xfff) 25735 25735 xc8 NAMI "/emul/linux/proc/self/fd/4" 25735 25735 xc8 RET readlink -1 errno -22 Invalid argument 25735 25735 xc8 CALL close(4) 25735 25735 xc8 RET close 0 25735 25735 xc8 PSIG SIGSEGV SIG_DFL: code=SEGV_MAPERR, addr=0x0, trap=14) 25735 25735 xc8 NAMI "xc8.core" What's strange here is that /emul/linux/proc/self/exe should work as well as /proc/self/exe the second issue is that it expects /emul/linux/proc/self/fd/4 to be a working symlink, and on NetBSD it's not. Note that with /bin/ls I get something similar: armandeche:/local/armandeche1/tmp#ktrace -i ls -l /proc/self/fd/ total 2 crw--w 1 bouyer tty5, 0 Jan 6 17:54 0 crw--w 1 bouyer tty5, 0 Jan 6 17:54 1 crw--w 1 bouyer tty5, 0 Jan 6 17:54 2 lr-xr-xr-x 1 rootwheel 2048 Jan 6 17:54 3 -> /local/armandeche1/tmp ls: /proc/self/fd//4: Invalid argument lr-xr-xr-x 1 rootwheel 0 Jan 6 17:54 4 22875 1 ls CALL readlink(0x7f7fffb98200,0x7f7fffb98610,0x400) 22875 1 ls NAMI "/proc/self/fd//4" 22875 1 ls RET readlink -1 errno 22 Invalid argument If I can trust the ktrace output, fd/4 should point to /etc/spwd.db On linux, strace shows it reading the link from /proc/self/exec, getting back the executable path and doing a stat on it. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference -- /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 2000, 2001 David O'Brien * Copyright (c) 1996 Søren Schmidt * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer *in this position and unchanged. * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in the *documentation and/or other materials provided with the distribution. * 3. The name of the author may not be used to endorse or promote products *derived from this software without specific prior written permission * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include #include #include #include #include #include #include #include #include #include #include #include static int elftype(const char *); static const char *iselftype(int); static void printelftypes(void); static void usage(void); struct ELFtypes { const char *str; int value; }; /* XXX - any more types? */
execute statically-linked linux files
Hello, I have linux binaires I'd like to run on NetBSD (this is a commercial product). Some files are dynamically-linked files and run properly. They show up as: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.24, BuildID[sha1]=38afca809a07f7e934012f7dac9094e3bcd2585d, stripped But there are also some statically-linked files, which shows up as ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped (note the missing "for GNU/Linux" here) and NetBSD don't want to run them (Exec format error. Binary file not executable.). Is there a way to convert the ELF header so that NetBSD can run them ? thanks -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: XEN devices included in kernel even if not XEN
On Tue, Dec 21, 2021 at 07:44:59AM -0800, Paul Goyette wrote: > I've noticed that device drivers listed in arch/xen/conf/files.xen > (or, at least, most of those devices) seem to be included in kernel > even if not using XEN. Shouldn't all those devices be conditional? > > # sysctl -a | grep driver | tr ',' '\n' | grep 'x[be]*' > ... > [141 -1 xenevt] > [142 142 xbd] > [143 -1 xencons] I think this lists all the known major numbers for the $MACHINE, I don't think it means that the driver is actually loaded. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Serious bugs in NetBSD-current, have they been fixed?
On Mon, Oct 25, 2021 at 09:33:26AM +0100, Chavdar Ivanov wrote: > [...] > The only (minor) problem I am still having occasionally is with cmake, > which hangs for me in two well-defined and repeating spots when I am > doing pkg_rolling-replace (the build completes when I attach to the > cmake process with gdb and just quit it). This has been discussed > before, I still am not clear if this is entropy related (unlikely as > it occurs always during the build of two particular packages only), a > problem with threads or an internal cmake bug. It's probably kern/56414 (probably wrong category as it seems to be a userland bug). It's not related to entropy. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: linux clone issue
On Tue, Oct 05, 2021 at 03:57:14PM +0100, Robert Swindells wrote: > > Manuel Bouyer wrote: > >I'm trying to run a binary-only linux program under NetBSD 9.2. > >From what I found, the binary was built on Ubuntu 16.04 > > > >The program dies at at specific point and it seems to be a bug in our > >emulation: > > 8992 8992 mylinuxprog CALL set_robust_list(0x7f7ff7ef5a20,0x18) > 8992 8992 mylinuxprog RET set_robust_list 0 > > This is doing futex stuff which isn't in -9, it doesn't work in -current > either but thorpej@ has an improved version on a branch. Hum, so after the ptrace issue this is going to be the next challenge :) -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
linux ptrace issue [Re: linux clone issue]
On Tue, Oct 05, 2021 at 01:08:52PM +0200, Manuel Bouyer wrote: > On Tue, Oct 05, 2021 at 12:42:33AM -0400, Eric Hawicz wrote: > > > > On 10/4/2021 10:33 AM, Manuel Bouyer wrote: > > > Hello > > > I'm trying to run a binary-only linux program under NetBSD 9.2. > > > From what I found, the binary was built on Ubuntu 16.04 > > > [...] > > > > > > As you can see above (ktrace -si output), the read on fd 3 in 26751 > > > returns > > > with an error as soon as the child does its execve(), just as if CLOSEEXEC > > > was set in the child. But the dup2(4,1) should keep the write side open > > > without CLOSEEXEC. The program does a similar sequence just before > > > (also forking a shell to execute some command) and it works. > > > Later when sh tries to write to stdout it gets a SIGPIPE. > > > > > > I couldn't reproduce this with a simple program. > > > But it seems that I can't reproduce this clone call. It seems that we are > > > called with flags 0x1200011, which would translate to > > > CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, > > > and a NULL stack pointer. > > > But when run on linux, this clone syscall straces to > > > CLONE_VM|CLONE_VFORK|SIGCHLD > > > > I think that combination of flags is actually a "fork()" call, which glibc > > implements using clone. I found that through > > https://eli.thegreenplace.net/2018/launching-linux-threads-and-processes-with-clone/, > > which mentions that glibc has a ARCH_FORK macro, though it seems that the > > more recent code uses an arch_fork inline function: > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/arch-fork.h;h=b846da08f98839aef336868de24850626428509c;hb=HEAD > > Yes, I think it's a form of fork() or vfork(). But when I compile a > test program on linux (RHEL7 or Ubuntu 20), fork() and vfork() appears > as fork and vfork in NetBSD's ktrace, not clone. I missed a point in the trace output, the parent is killed, and read() returns not because the other end is closed but because of the signal. This seems to come from a ptrace difference between linux and our emulation. Actually this binary linux program does a fork() and the child does the work, the parent just waits. But what happens is: the parent: p = fork() wait() ptrace(PTRACE_CONT, p, NULL, SIG_0) exit(0) the child does: ptrace(PTRACE_TRACEME, 0, NULL, NULL) exit(0) On linux, ptrace(PTRACE_TRACEME) returns EPERM, the wait in the parent waits until the child exits, and ptrace(PTRACE_CONT) gets ESRCH. On NetBSD, ptrace(PTRACE_TRACEME) succeeds, wait() returns at some point before the child exits, the parent ptrace(PTRACE_CONT) the child, the child gets killed (not by the parent, I can't see a kill() in the trace). On linux, ptrace(PTRACE_TRACEME) receiving EPERM may be because the process is running under strace. Running strace without -f (so that only the parent gets traced), I see the wait() returning, the parent getting a SIGCHLD, and ptrace(PTRACE_CONT) succeeding. But on linux, it doesn't seem that an orphaned child process gets killed. Could our linux ptrace emulation be fixed in any way ? especially avoid the pid XXX was killed: orphaned traced process -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: linux clone issue
On Tue, Oct 05, 2021 at 12:42:33AM -0400, Eric Hawicz wrote: > > On 10/4/2021 10:33 AM, Manuel Bouyer wrote: > > Hello > > I'm trying to run a binary-only linux program under NetBSD 9.2. > > From what I found, the binary was built on Ubuntu 16.04 > > [...] > > > > As you can see above (ktrace -si output), the read on fd 3 in 26751 returns > > with an error as soon as the child does its execve(), just as if CLOSEEXEC > > was set in the child. But the dup2(4,1) should keep the write side open > > without CLOSEEXEC. The program does a similar sequence just before > > (also forking a shell to execute some command) and it works. > > Later when sh tries to write to stdout it gets a SIGPIPE. > > > > I couldn't reproduce this with a simple program. > > But it seems that I can't reproduce this clone call. It seems that we are > > called with flags 0x1200011, which would translate to > > CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, > > and a NULL stack pointer. > > But when run on linux, this clone syscall straces to > > CLONE_VM|CLONE_VFORK|SIGCHLD > > I think that combination of flags is actually a "fork()" call, which glibc > implements using clone. I found that through > https://eli.thegreenplace.net/2018/launching-linux-threads-and-processes-with-clone/, > which mentions that glibc has a ARCH_FORK macro, though it seems that the > more recent code uses an arch_fork inline function: > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/arch-fork.h;h=b846da08f98839aef336868de24850626428509c;hb=HEAD Yes, I think it's a form of fork() or vfork(). But when I compile a test program on linux (RHEL7 or Ubuntu 20), fork() and vfork() appears as fork and vfork in NetBSD's ktrace, not clone. > > > > I tried writing a program using fork(), vfork() or clone() but > > none of them would use the clone() syscall as do my linux binary. > > Any idea what could cause clone() to be used this way ? > > Is your binary statically linked? Maybe it has a different glibc > implementation from the .so that's on your system. Yes, the linux emulation on NetBSD use suse's glibc, while my linux test systems are RHEL7 and Ubuntu 20 -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
linux clone issue
Hello I'm trying to run a binary-only linux program under NetBSD 9.2. >From what I found, the binary was built on Ubuntu 16.04 The program dies at at specific point and it seems to be a bug in our emulation: 26751 26751 mylinuxprog CALL close(3) 26751 26751 mylinuxprog RET close 0 26751 26751 mylinuxprog CALL wait4(0x558d,0x7f7fde10,0,0) 26751 26751 mylinuxprog RET wait4 21901/0x558d 26751 26751 mylinuxprog CALL munmap(0x7f7ff7efb000,0x4000) 26751 26751 mylinuxprog RET munmap 0 26751 26751 mylinuxprog CALL pipe2(0x7f7fddf0,0x8) 26751 26751 mylinuxprog RET pipe2 0 26751 26751 mylinuxprog CALL clone(0x1200011,0,0,0x7f7ff7ef5a10,0x687f) 26751 26751 mylinuxprog RET clone 8992/0x2320 8992 8992 mylinuxprog EMUL "linux" 8992 8992 mylinuxprog RET fork 0 26751 26751 mylinuxprog CALL close(4) 26751 26751 mylinuxprog RET close 0 26751 26751 mylinuxprog CALL fcntl(3,F_SETFD,0) 26751 26751 mylinuxprog RET fcntl 0 26751 26751 mylinuxprog CALL fstat64(3,0x7f7fdd10) 26751 26751 mylinuxprog RET fstat64 0 26751 26751 mylinuxprog CALL mmap(0,0x4000,PROT_READ|PROT_WRITE,0x22,0x,0) 26751 26751 mylinuxprog RET mmap 140187597254656/0x7f7ff7efb000 26751 26751 mylinuxprog CALL read(3,0x7f7ff7efb000,0x4000) 8992 8992 mylinuxprog CALL set_robust_list(0x7f7ff7ef5a20,0x18) 8992 8992 mylinuxprog RET set_robust_list 0 22927 22927 mylinuxprog CALL exit_group(0) 8992 8992 mylinuxprog CALL dup2(4,1) 8992 8992 mylinuxprog RET dup2 1 8992 8992 mylinuxprog CALL execve(0x7f7ff718d873,0x7f7fbd70,0x7f7fea38) 8992 8992 mylinuxprog NAMI "/emul/linux/bin/sh" 8992 8992 mylinuxprog NAMI "/emul/linux" 8992 8992 mylinuxprog NAMI "/emul/linux/lib64/ld-linux-x86-64.so.2" 26751 26751 mylinuxprog RET read -1 errno -3 No such process 26751 26751 mylinuxprog PSIG SIGKILL SIG_DFL: code=SI_NOINFO 8992 8992 sh EMUL "linux" [...] As you can see above (ktrace -si output), the read on fd 3 in 26751 returns with an error as soon as the child does its execve(), just as if CLOSEEXEC was set in the child. But the dup2(4,1) should keep the write side open without CLOSEEXEC. The program does a similar sequence just before (also forking a shell to execute some command) and it works. Later when sh tries to write to stdout it gets a SIGPIPE. I couldn't reproduce this with a simple program. But it seems that I can't reproduce this clone call. It seems that we are called with flags 0x1200011, which would translate to CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, and a NULL stack pointer. But when run on linux, this clone syscall straces to CLONE_VM|CLONE_VFORK|SIGCHLD I tried writing a program using fork(), vfork() or clone() but none of them would use the clone() syscall as do my linux binary. Any idea what could cause clone() to be used this way ? Also, any idea about this file descriptor issue ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Anyone still using PCI "isp" SCSI / FC controllers?
On Sun, Jul 18, 2021 at 02:44:46PM -0700, Jason Thorpe wrote: > The Qlogic ISP SCSI / FC driver PCI front-end appears to universally support > using 64-bit PCI DMA addresses, based on my reading of this code block in > isp_pci_dmasetup(): > > if (sizeof (bus_addr_t) > 4) { > if (rq->req_header.rqs_entry_type == RQSTYPE_T2RQS) { > rq->req_header.rqs_entry_type = RQSTYPE_T3RQS; > } else if (rq->req_header.rqs_entry_type == > RQSTYPE_REQUEST) { > rq->req_header.rqs_entry_type = RQSTYPE_A64; > } > } > > There's just one problem, though! It does not use the 64-bit PCI DMA tag, > and so it is always getting DMA addresses that fit in 32-bits. On x86-64 > machines, this results in having to bounce DMA transfers (ick). On Alpha > machines, this results in having to use SGMAP (IOMMU) DMA; this is not a > problem unto itself, and I recently made some improvements to this on systems > where Qlogic ISP controllers were more likely to be present (e.g. AlphaServer > 1000 / 1000A). > > But there are some Alpha systems we support (notably the EV6+ > Tsunami/Typhoon/Titan systems e.g. DS10/DS20/DS25/...) that natively support > 64-bit PCI DMA addressing without having to use SGMAPs ... this is generally > preferred because, among other things, it's faster. > > I'm pretty sure it's safe, based on the code block quoted above, to change > PCI DMA tag selection in the driver to something like this: > > /* > * See conditional in isp_pci_dmasetup(); if > * sizeof (bus_addr_t) > 4, then we'll program > * the device using 64-bit DMA addresses. > * So, if we're going to do that, we should do > * our best to get 64-bit addresses in the first > * place. > */ > if (sizeof (bus_addr_t) > 4 && pci_dma64_available(pa)) { > isp->isp_dmatag = pa->pa_dmat64; > } else { > isp->isp_dmatag = pa->pa_dmat; > } > > Anyway, if someone with more knowledge of these controllers could chime in, > I'd really appreciate it. (Hopefully Matt is still lurking on these mailing > lists??) I have: isp0 at pci10 dev 0 function 0: QLogic FC-AL and 4Gbps Fabric PCI-E HBA isp1 at pci10 dev 0 function 1: QLogic FC-AL and 4Gbps Fabric PCI-E HBA connecting to a overland LTO changer I don't have specific knowledge on these controllers, but I could certainly test-boot a -current kernel and see if I can still read tapes (the server is running netbsd-8 at this time) -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: linux emul and newer glibc
On Mon, Jun 28, 2021 at 08:15:29AM +0900, Rin Okuyama wrote: > Hi, > > On 2021/06/28 2:40, Manuel Bouyer wrote: > > Hello, > > I'm trying to run a binary which wants GLIBCXX_3.4.21, while with the suse > > packages we have GLIBCXX_3.4.19. Before I try grabbing newer libraries, > > has anyone tried to run linux binaries with more recent libraries ? > > For my amd64 machine, GLIBCXX_3.4.28 (from glibc 2.32) works just fine, > which is extracted manually from Fedora 33 by pkgsrc/pkgtools/rpm2pkg. indeed it works for me too. Now I need to make it not choke on udev errors ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
linux emul and newer glibc
Hello, I'm trying to run a binary which wants GLIBCXX_3.4.21, while with the suse packages we have GLIBCXX_3.4.19. Before I try grabbing newer libraries, has anyone tried to run linux binaries with more recent libraries ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Xen panics in autoconf
Hello, some recent changes broke Xen: [ 2.919] panic: kernel diagnostic assertion "KERNEL_LOCKED_P()" failed: file "/usr/src/sys/kern/subr_autoconf.c", line 1039 [ 2.919] cpu0: Begin traceback... [ 2.919] vpanic(c05be3c0,cd02ddb8,cd02ddd4,c0415210,c05be3c0,c05be322,c05c6cc1,c05f4284,40f,cd02de18) at netbsd:vpanic+0x139 [ 2.919] kern_assert(c05be3c0,c05be322,c05c6cc1,c05f4284,40f,cd02de18,c0b84b60,0,cd02ddf4,c0415270) at netbsd:kern_assert+0x23 [ 2.919] config_match(c12e6c00,c0b84b60,cd02ded8,c0b84b60,c0b84b60,c12e6c00,cd02de3c,c04155d2,cd02de18,c0428627) at netbsd:config_match+0x90 [ 2.0100679] mapply(cd02de18,c0428627,0,c12e5dc0,,c0cb6524,cd02de18,0,c12e6c00,0) at netbsd:mapply+0x50 [ 2.0100679] config_vsearch(c12e6c00,cd02ded8,,cd02dea0,cd02de88,c01120af,0,c1460278,cd02dee6,c05c06cd) at netbsd:config_vsearch+0x212 [ 2.0100679] config_vfound(c12e6c00,cd02ded8,c0111f20,,cd02dea0,cd02df00,c0112638,c12e6c00,cd02ded8,c0111f20) at netbsd:config_vfound+0x2f [ 2.0100679] config_found(c12e6c00,cd02ded8,c0111f20,,a,8,0,c12ddf54,0,c12c21f0) at netbsd:config_found+0x2d [ 2.0100679] xenbus_probe_device_type(cd02df2e,1e,c05c0734,c12ddf54,cd02df24,4,c12ddf44,c12ddf44,2,6564) at netbsd:xenbus_probe_device_type+0x498 [ 2.0100679] xenbus_probe_frontends.isra.0(60,2,0,c0113430,0,c05bec62,0,0,cd02df9c,c0112e65) at netbsd:xenbus_probe_frontends.isra.0+0xbb [ 2.0100679] xenbus_probe(0,c05c076d,6,c1453c00,0,c0102031,c1453c00,d99000,c0b92200,0) at netbsd:xenbus_probe+0x2d [ 2.0100679] xenbus_probe_init(c1453c00,d99000,c0b92200,0,c0100084,0,0,0,0,0) at netbsd:xenbus_probe_init+0x85 [ 2.0100679] cpu0: End traceback... Any idea what changed recently ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: booting xen [was Re: serial console puzzle]
On Fri, Apr 30, 2021 at 07:28:57PM +0100, Patrick Welche wrote: > On Fri, Apr 30, 2021 at 07:00:38PM +0200, Manuel Bouyer wrote: > > On Fri, Apr 30, 2021 at 05:55:37PM +0100, Patrick Welche wrote: > > > no luck. I see loading /netbsd-XEN3_DOM0, and then it just reboots. > > > Nothing more appears on the console. (-current XEN, xen.gz from > > > xenkernel415) > > > > Try xen-debug.gz ? > > Do you get the Xen boot messages ? > > I don't get the Xen boot messages. Just tried xen-debug.gz and again I just > see loading, and then a reboot. I don't think it gets as far xen*.gz. > > boot.cfg contains: > > menu=Boot Xen:rndseed /var/db/entropy-file;consdev com0,57600;load > /netbsd-XEN3_ > DOM0 console=com1 com1=57600,8n1,0x3f8;multiboot /xen-debug.gz dom0_mem=1024M should probably be: menu=Boot Xen:rndseed /var/db/entropy-file;consdev com0,57600;load /netbsd-XEN3_ DOM0 console=com0;multiboot /xen-debug.gz dom0_mem=1024M console=com1 com1=57600,8n1,0x3f8 (should really be console=com0 for NetBSD, it doens't access the hardware and use the I/O services from the hypervisor) -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: serial console puzzle
On Fri, Apr 30, 2021 at 05:55:37PM +0100, Patrick Welche wrote: > On Fri, Apr 30, 2021 at 04:52:41PM +0100, Patrick Welche wrote: > > On Fri, Apr 30, 2021 at 05:23:54PM +0200, Manuel Bouyer wrote: > > > On Fri, Apr 30, 2021 at 04:18:49PM +0100, Patrick Welche wrote: > > > > On Fri, Apr 30, 2021 at 05:04:34PM +0200, Manuel Bouyer wrote: > > > > > On Fri, Apr 30, 2021 at 03:44:46PM +0100, Patrick Welche wrote: > > > > > > In /boot.cfg: > > > > > > > > > > > > menu=Boot normally:rndseed /var/db/entropy-file;consdev > > > > > > com0,57600;boot > > > > > > > > > > > > # installboot -ve /dev/rsd0a > > > > > > File system: /dev/rsd0a > > > > > > Boot options:timeout 5, flags 0, speed 57600, ioaddr 0, > > > > > > console com0 > > > > > > > > > > > > Yet in dmesg: > > > > > > > > > > > > com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, 1-byte FIFO > > > > > > com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, 1-byte FIFO > > > > > > com1: console > > > > > > > > > > > > (so I don't actually see anything) > > > > > > > > > > > > (Wednesday's -current/amd64) > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > one possibility is that the bios has com0 and com1 swapped. > > > > > In some case I had to explicitely set ioaddr with installboot to have > > > > > the serial console working. > > > > > > > > I should have said: according to the BIOS "COM A" is 0x3f8, and "COM B" > > > > is 0x2f8, so they are the right way around. > > > > > > I've seen BIOSes report it the right way on in setup, but the wrong way > > > to the boot loader. > > > In such cases and explicit ioaddr did help. > > > > Indeed - it did! > > > > # installboot -ve /dev/rsd0a > > File system: /dev/rsd0a > > Boot options:timeout 5, flags 0, speed 57600, ioaddr 3f8, console > > com0 > > > > com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, 1-byte FIFO > > com0: console > > com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, 1-byte FIFO > > > > now for xen... > > no luck. I see loading /netbsd-XEN3_DOM0, and then it just reboots. > Nothing more appears on the console. (-current XEN, xen.gz from xenkernel415) Try xen-debug.gz ? Do you get the Xen boot messages ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: serial console puzzle
On Fri, Apr 30, 2021 at 04:18:49PM +0100, Patrick Welche wrote: > On Fri, Apr 30, 2021 at 05:04:34PM +0200, Manuel Bouyer wrote: > > On Fri, Apr 30, 2021 at 03:44:46PM +0100, Patrick Welche wrote: > > > In /boot.cfg: > > > > > > menu=Boot normally:rndseed /var/db/entropy-file;consdev com0,57600;boot > > > > > > # installboot -ve /dev/rsd0a > > > File system: /dev/rsd0a > > > Boot options:timeout 5, flags 0, speed 57600, ioaddr 0, console > > > com0 > > > > > > Yet in dmesg: > > > > > > com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, 1-byte FIFO > > > com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, 1-byte FIFO > > > com1: console > > > > > > (so I don't actually see anything) > > > > > > (Wednesday's -current/amd64) > > > > > > > > > Thoughts? > > > > one possibility is that the bios has com0 and com1 swapped. > > In some case I had to explicitely set ioaddr with installboot to have > > the serial console working. > > I should have said: according to the BIOS "COM A" is 0x3f8, and "COM B" > is 0x2f8, so they are the right way around. I've seen BIOSes report it the right way on in setup, but the wrong way to the boot loader. In such cases and explicit ioaddr did help. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: serial console puzzle
On Fri, Apr 30, 2021 at 03:44:46PM +0100, Patrick Welche wrote: > In /boot.cfg: > > menu=Boot normally:rndseed /var/db/entropy-file;consdev com0,57600;boot > > # installboot -ve /dev/rsd0a > File system: /dev/rsd0a > Boot options:timeout 5, flags 0, speed 57600, ioaddr 0, console com0 > > Yet in dmesg: > > com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, 1-byte FIFO > com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, 1-byte FIFO > com1: console > > (so I don't actually see anything) > > (Wednesday's -current/amd64) > > > Thoughts? one possibility is that the bios has com0 and com1 swapped. In some case I had to explicitely set ioaddr with installboot to have the serial console working. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: make fails to build on linux
On Sat, Apr 17, 2021 at 10:44:36PM +0200, Jaromír Dole?ek wrote: > Le sam. 17 avr. 2021 à 19:49, Manuel Bouyer a écrit : > > > > On Sat, Apr 17, 2021 at 07:25:58PM +0200, Manuel Bouyer wrote: > > > Hello > > > trying a build.sh tools on linux I got: > > > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c: > > > In function '__regex_wctype': > > > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:254:2: > > > error: 'for' loop initial declarations are only allowed in C99 mode > > > for (size_t i = 0; i < __arraycount(wctypes); i++) { > > > ^ > > > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:2 > > > 54:2: note: use option -std=c99 or -std=gnu99 to compile your code > > > > > > What is the right fix for this ? > > > > > > For now I just moved the declaration outside of the loop > > > > Well, the build fails later with the same error. > > Using "-V HOST_CFLAGS=-std=gnu99" allows the tools to build; maybe > > this should be the default ? > > I think it would be sensible to use -std=c99 by default, yes. It's it has to be gnu99; I tried c99 and it failed with some types not defined. > strange that the Linux toolchain refuses it by default, do we force > some other -std flag by default now by chance? AFAIK no. But the toolchain on RLEL7 is quite old: gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: make fails to build on linux
On Sat, Apr 17, 2021 at 10:54:48AM -0700, Jason Thorpe wrote: > > > On Apr 17, 2021, at 10:48 AM, Manuel Bouyer wrote: > > > > Well, the build fails later with the same error. > > Using "-V HOST_CFLAGS=-std=gnu99" allows the tools to build; maybe > > this should be the default ? > > Just fix the code to not use that style of declaration? Some of them are in imported code (gnu toolchain); this is why I didn't try to fix it -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: make fails to build on linux
On Sat, Apr 17, 2021 at 07:25:58PM +0200, Manuel Bouyer wrote: > Hello > trying a build.sh tools on linux I got: > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c: > > In function '__regex_wctype': > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:254:2: > error: 'for' loop initial declarations are only allowed in C99 mode > for (size_t i = 0; i < __arraycount(wctypes); i++) { > ^ > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:2 > 54:2: note: use option -std=c99 or -std=gnu99 to compile your code > > What is the right fix for this ? > > For now I just moved the declaration outside of the loop Well, the build fails later with the same error. Using "-V HOST_CFLAGS=-std=gnu99" allows the tools to build; maybe this should be the default ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
make fails to build on linux
Hello trying a build.sh tools on linux I got: /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c: In function '__regex_wctype': /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:254:2: error: 'for' loop initial declarations are only allowed in C99 mode for (size_t i = 0; i < __arraycount(wctypes); i++) { ^ /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:2 54:2: note: use option -std=c99 or -std=gnu99 to compile your code What is the right fix for this ? For now I just moved the declaration outside of the loop -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: running xen on current
On Thu, Apr 15, 2021 at 01:39:37PM +0100, Patrick Welche wrote: > On Thu, Apr 15, 2021 at 07:28:32AM -0400, Brad Spencer wrote: > > Manuel Bouyer writes: > > > > > On Thu, Apr 15, 2021 at 09:53:50AM +0100, Patrick Welche wrote: > > >> I have tried and failed to run xen on 3 -current/amd64 systems with > > >> 3 different failure modes: > > >> > > >> 1) laptop: xen.gz Building a PV Dom0 / ELF: not an ELF binary -> > > >> panic/reboot > > >> 2) desktop: XEN3_DOM0 panics including PR port-xen/55978 > > >> 3) server: Trampoline space cannot be allocated; will try fallback -> > > >> reboot > > >> > > >> They are all working NetBSD-current/amd64 systems. > > >> > > >> My conclusion was that xen is hopelessly broken, so was quite surprised > > >> by Greg Wood's thread about the finer points of running a guest OS, given > > >> that those systems won't even start the host OS. > > >> > > >> I dug out an old desktop, and to my pleasant surprise it booted > > >> XEN3_DOM0, > > >> and I have managed to run some XEN3_DOMUs. > > >> > > >> The difference between the working/broken setups seems to be that the > > >> working one is "BIOS" booting rather than EFI booting. > > >> > > >> Among all your xen success stories, are any of you EFI booting? > > > > > > AFAIK EFI is not yet supported by Xen (maybe this is supported by 4.15, > > > I've not had a chance to try yet). I have it running on fairly recent > > > Dell servers (in BIOS mode) > > > > > > There has been fiddling with Xen and EFI for quite some time. See: > > > > https://wiki.xenproject.org/wiki/Xen_EFI > > > > for example (might be old)... this indicates that Xen 4.3 or later could > > be built as a EFI binary and probably booted from the EFI firmware > > directly or with grub2 when grub2 is a EFI binary itself. Of course > > those instructions are all Linux-centric and I don't know if you created > > a Xen kernel like this if it would boot a NetBSD DOM0 kernel. I am in > > no position to try any tests with this right now personally, but it is > > tempting as I have a EFI only laptop that I could probably replace the > > hard drive temporarily. > > Looking at > > https://xenproject.org/2021/04/08/xen-project-hypervisor-4-15/ > > (so 4.15 only just came out!) I see > > Unified boot images: It is now possible to create an image bundling > together files needed for Xen to boot into a single EFI binary; > making it now possible to boot a functional Xen system directly > from the EFI boot manager, rather than having to go through grub > multiboot. Files that can be bundled include a hypervisor, dom0 > kernel, dom0 initrd, Xen KConfig, XSM configuration, and a device > tree. > > I thought that "go through grub multiboot" was the equivalent of our > boot.cfg "multiboot /xen.gz dom0_mem=1024M", but apparently not? It should be; but there are probably differences between BIOS and EFI, even when using multiboot (the way to access the console, or find the ACPI tables, may be different, for example) -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: running xen on current
On Thu, Apr 15, 2021 at 09:53:50AM +0100, Patrick Welche wrote: > I have tried and failed to run xen on 3 -current/amd64 systems with > 3 different failure modes: > > 1) laptop: xen.gz Building a PV Dom0 / ELF: not an ELF binary -> panic/reboot > 2) desktop: XEN3_DOM0 panics including PR port-xen/55978 > 3) server: Trampoline space cannot be allocated; will try fallback -> reboot > > They are all working NetBSD-current/amd64 systems. > > My conclusion was that xen is hopelessly broken, so was quite surprised > by Greg Wood's thread about the finer points of running a guest OS, given > that those systems won't even start the host OS. > > I dug out an old desktop, and to my pleasant surprise it booted XEN3_DOM0, > and I have managed to run some XEN3_DOMUs. > > The difference between the working/broken setups seems to be that the > working one is "BIOS" booting rather than EFI booting. > > Among all your xen success stories, are any of you EFI booting? AFAIK EFI is not yet supported by Xen (maybe this is supported by 4.15, I've not had a chance to try yet). I have it running on fairly recent Dell servers (in BIOS mode) -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
On Sat, Apr 10, 2021 at 03:17:35PM -0700, Greg A. Woods wrote: > [...] > # fdisk -F /images/FreeBSD-12.2-RELEASE-amd64-mini-memstick.img > Disk: /images/FreeBSD-12.2-RELEASE-amd64-mini-memstick.img > NetBSD disklabel disk geometry: > cylinders: 49, heads: 255, sectors/track: 63 (16065 sectors/cylinder) > total sectors: 791121, bytes/sector: 512 > > BIOS disk geometry: > cylinders: 49, heads: 255, sectors/track: 63 (16065 sectors/cylinder) > total sectors: 791121 > > Partitions aligned to 16065 sector boundaries, offset 63 > > Partition table: > 0: EFI system partition (sysid 239) > start 1, size 1600 (1 MB, Cyls 0/0/2-0/25/26) > 1: FreeBSD or 386BSD or old NetBSD (sysid 165) > start 1601, size 789520 (386 MB, Cyls 0/25/27-49/62/30), Active > 2: > 3: > First active partition: 1 > Drive serial number: 2425393296 (0x90909090) > > # fdisk vnd0 > fdisk: primary partition table invalid, no magic in sector 0 > fdisk: Cannot determine the number of heads > Disk: /dev/rvnd0d > NetBSD disklabel disk geometry: > cylinders: 4096, heads: 64, sectors/track: 32 (2048 sectors/cylinder) > total sectors: 8388608, bytes/sector: 512 > > BIOS disk geometry: > cylinders: 522, heads: 255, sectors/track: 63 (16065 sectors/cylinder) > total sectors: 8388608 > > Partitions aligned to 16065 sector boundaries, offset 63 > > Partition table: > 0: > 1: > 2: > 3: > Bootselector disabled. > No active partition. > Drive serial number: 0 (0x) I can't reproduce this fdisk/disklabel on netbsd-9 nor -current. fdisk on vnd0 gives me the same partition table as on the file. FreeBSD fails to boot with the same error message. The size of the disk is indeed 790528 in the xenstore (and the dom0's kernel message) but I don't know where this comes from. xbdback uses getdiskinfo() to get the device's size. In vnd, the size comes from a VOP_GETATTR() on the file, so it looks like VOP_GETATTR() returns the wrong size. The file is definitively 791121 sectors long: #dd if=FreeBSD-12.2-RELEASE-amd64-mini-memstick.img.orig of=FreeBSD-12.2-RELEASE-amd64-mini-memstick.img 791121+0 records in 791121+0 records out #ls -l FreeBSD-12.2-RELEASE-amd64-mini-memstick.img -rw-r--r-- 1 root wheel 405053952 Apr 11 11:56 FreeBSD-12.2-RELEASE-amd64-mini-memstick.img #expr 405053952 / 512 791121 -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: mail/sendmail not relaying on netbsd-9/sparc, problem with OpenSSL update?
On Wed, Apr 07, 2021 at 11:47:10AM -0500, John D. Baker wrote: > Dropping pkgsrc-users@ as it appears not to be a pkgsrc problem. > > On Wed, 7 Apr 2021, Martin Husemann wrote: > > > On Wed, Apr 07, 2021 at 11:26:05AM -0500, John D. Baker wrote: > > > > > > (gdb) run -odi -v -q > > > Starting program: /usr/sbin/sendmail -odi -v -q > > > process 867 is executing new program: /usr/pkg/libexec/sendmail/sendmail > > > > > > Program received signal SIGILL, Illegal instruction. > > > 0xedd6d40c in _sparcv9_vis1_probe () from /usr/lib/libcrypto.so.14 > > > (gdb) bt > > > > This is normal, you should be able to "continue" from it. > > The library catches the SIGILL and avoids the instruction. > > ISTR that I tried that and simply got the SIGILL again. Maybe that > was from a later sparcV9 instruction... > > In any case, while one may be able to do that in 'gdb', when running > normally, it is fatal and there is no recourse. Odd that it doesn't > dump core. It should not be fatal. The library traps sigill specially to test for instructions. Does the program really exit if you hit 'continue' in ddb ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: cmake hang ... again
On Tue, Apr 06, 2021 at 02:24:50PM +0100, Chavdar Ivanov wrote: > Hi, > > It may or may not be linked to the recent rather enthralling > discussion about the entropy; I don't know. I've asked for ideas in > the past, but couldn't figure out what to do if it hits me again. > > Usually I run -current on amd64, updating the systems on average 2-3 > times a week; I also use pkgsrc-head and again, 2-3 times a month I > cvs update my pkgsrc tree, together with a ' git pull' in wip, and I > run 'pkg_rolling-replace'. > > Each and every run of pkg_rolling-replace gets me to a seemingly > identical hang in cmake in a single package - misc/kdepim4 , in > apparently the same spot. with similar trace. Attaching to the process I see the same thing in bulk builds, with various kde packages. When I asked I've been told that this was a known issue, but without fix ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: regarding the changes to kernel entropy gathering
On Mon, Apr 05, 2021 at 09:30:16AM -0700, Greg A. Woods wrote: > At Mon, 5 Apr 2021 10:46:19 +0200, Manuel Bouyer > wrote: > Subject: Re: regarding the changes to kernel entropy gathering > > > > If I understood it properly, there's no need for such a knob. > > echo 0123456789abcdef0123456789abcdef > /dev/random > > > > will get you back to the state we had in netbsd-9, with (pseudo-)randomness > > collected from devices. > > Well, no, not quite so much randomness. Definitely pseudo though! > > My patch on the other hand can at least inject some real randomness into > the entropy pool, even if it is observable or influenceable by nefarious > dudes who might be hiding out in my garage. As I understand it, once /dev/random has been seeded, randomness from other devices will be taken into account (with or without your patch). In your case, /dev/random reads did block because it didn't get an initial seed. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: how do I mount a read-only filesystem from the "root device" prompt?
On Sun, Apr 04, 2021 at 03:13:35PM -0700, Greg A. Woods wrote: > I would think it's not just CDs and hypervisor-provided virtual devices > that can have multiple partitions, use wedges, and yet be read-only. > > Are not a wide variety of removable storage devices also capable of > being made "read-only" at the hardware level? At last some SCSI devices had a pin to make then read-only. I used this to build ssh gateways in the past ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: regarding the changes to kernel entropy gathering
On Sun, Apr 04, 2021 at 06:47:23PM -0700, Brian Buhrow wrote: > Hello. As I understand it, Greg ran into this problem on a xen domu. In > checking my NetBSD-9 > system running as a domu under xen-4.14.1, there is no rdrand or rdseed > feature exposed to > domu's by xen. This observation is confirmed by looking at the xen command > line reference > page: https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html Actually, if the CPU supports rdrand or rdseed, they are available to domUs: cpu0: Running on hypervisor: Xen cpu0: "Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz" cpu0: Intel Xeon Scalable (Skylake, Cascade Lake, Copper Lake) (686-class) cpu0: family 0x6 model 0x55 stepping 0x7 (id 0x50657) [...] cpu0: features1 0xf6f81203 cpu0: features2 0x810 cpu0: features5 0xd18f2369 Source Bits Type Flags xbd04010273 disk estimate, collect, v, t, dt xennet0 0 net v, t, dt cpu0 88774 vm estimate, collect, v, t, dv system-power 0 power estimate, collect, v, t, dt autoconf 1 ??? estimate, collect, t, dt printf0 ??? collect callout 108 skew estimate, collect, v, dv cpurng 4096 rng estimate, collect, v -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: regarding the changes to kernel entropy gathering
On Mon, Apr 05, 2021 at 01:16:56AM +, RVP wrote: > [...] > Hmm. I have to say, that now I find myself not disagreeing with Greg's > point of view: Maybe NetBSD's default is too strict and a knob like > kern.entropy.use_pooh_poohed_sources=1 would not be a bad thing for > some users--with all appropriate sysinst warnings of course. If I understood it properly, there's no need for such a knob. echo 0123456789abcdef0123456789abcdef > /dev/random will get you back to the state we had in netbsd-9, with (pseudo-)randomness collected from devices. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: nothing contributing entropy in Xen domUs? or dom0!!!
On Thu, Apr 01, 2021 at 04:13:59AM +, RVP wrote: > > [...] > > Does this /etc/entropy-file match what's there in your /boot.cfg? irrelevant for Xen, as Xen uses the multiboot protocol. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)
On Wed, Mar 31, 2021 at 09:58:48PM -0400, Thor Lancelot Simon wrote: > On Wed, Mar 31, 2021 at 11:24:07AM +0200, Manuel Bouyer wrote: > > On Tue, Mar 30, 2021 at 10:42:53PM +, Taylor R Campbell wrote: > > > > > > There are no virtual RNG devices on the system in question, according > > > to the quoted `rndctl -l' output. Perhaps the VM host needs to be > > > taught to expose a virtio-rng device to the guest? > > > > There is no such thing in Xen. > > Is the CPU so old that it doesn't have RDRAND / RDSEED, or is Xen perhaps > masking these CPU features from the guest? Is there an easy way to test, on a netbsd-9 system, if the instruction is present and working ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)
On Tue, Mar 30, 2021 at 10:42:53PM +, Taylor R Campbell wrote: > > Date: Tue, 30 Mar 2021 23:53:43 +0200 > > From: Manuel Bouyer > > > > On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote: > > > [...] > > > > > > Perhaps the answer is that nothing seems to be contributing anything to > > > the entropy pool. No matter what device I exercise, none of the numbers > > > in the following changes: > > > > yes, it's been this way since the rnd rototill. Virtual devices are > > not trusted. > > > > The only way is to manually seed the pool. > > This is false. The virtual RNG drivers (viornd(4) [1], rump > hyperentropy [2], maybe others) all assume the VM host provides > samples with full entropy. This has always been the case, and this > didn't change at all in the rototill last year. > > There are no virtual RNG devices on the system in question, according > to the quoted `rndctl -l' output. Perhaps the VM host needs to be > taught to expose a virtio-rng device to the guest? There is no such thing in Xen. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)
On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote: > [...] > > Perhaps the answer is that nothing seems to be contributing anything to > the entropy pool. No matter what device I exercise, none of the numbers > in the following changes: yes, it's been this way since the rnd rototill. Virtual devices are not trusted. The only way is to manually seed the pool. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: xen-tools 4.13.1 build failure
On Mon, Oct 12, 2020 at 04:53:14PM +0100, Chavdar Ivanov wrote: > Hi, > Another xentools413 build failure. It has been failing for me the last > two weeks or so, failing to build seabios, as follows: > > gmake[5]: Entering directory > '/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/firmware' > /usr/pkg/bin/gmake -C seabios-dir CC=gcc LD=ld PYTHON=python3.7 > EXTRAVERSION="-Xen" all; > gmake[6]: Entering directory > '/usr/pkgsrc/sysutils/xentools413/work/seabios-rel-1.12.1' > Linking out/rom.o > ld -N -T out/romlayout32flat.lds out/rom16.strip.o > out/rom32seg.strip.o out/code32flat.o -o out/rom.o > ld: out/code32flat.o: in function `memmove': > /usr/pkgsrc/sysutils/xentools413/work/seabios-rel-1.12.1/./src/string.c:206: > undefined reference to `memcpy' strange, I don't get this on my test machine (on netbsd-9) ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: install from CD fails on Xen
On Tue, Oct 27, 2020 at 10:14:45AM +0100, Martin Husemann wrote: > On Tue, Oct 27, 2020 at 09:42:41AM +0100, Manuel Bouyer wrote: > > Hello, > > in tests from 2020-10-25: > > http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/ > > anita fails with > > Could not locate a CD medium in any drive with the distribution sets > > (for both amd64 and i386) > > > > martin, could you please have a look ? > > Sure, will look at it - this is with the stock ISO provided as read-only > xbd(4)? Yes. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
install from CD fails on Xen
Hello, in tests from 2020-10-25: http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/ anita fails with Could not locate a CD medium in any drive with the distribution sets (for both amd64 and i386) martin, could you please have a look ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: xen-tools 4.13.1 build failure
On Sun, Aug 30, 2020 at 04:54:18PM +0100, Chavdar Ivanov wrote: > Hi, > > Trying to build xentools-4.13.1 under -current: > > gcc -I/usr/pkg/include -I/usr/include -I/usr/pkg/include/python3.7 > -I/usr/pkg/include/glib-2.0 -I/usr/pkg/include/gio-unix-2.0 > -I/usr/pkg/lib/glib-2.0/include -I/usr/X11R7/include > -D_XOPEN_SOURCE_EXTENDED=1 -I/usr/pkg/include/ncurses -DPIC -O2 > -I/usr/pkg/include -I/usr/include -I/usr/pkg/include/python3.7 > -I/usr/pkg/include/glib-2.0 -I/usr/pkg/include/gio-unix-2.0 > -I/usr/pkg/lib/glib-2.0/include -I/usr/X11R7/include > -D_XOPEN_SOURCE_EXTENDED=1 -I/usr/pkg/include/ncurses -m64 -DBUILD_ID > -fno-strict-aliasing -std=gnu99 -Wall -Wstrict-prototypes > -Wdeclaration-after-statement -Wno-unused-but-set-variable > -Wno-unused-local-typedefs -m64 -DBUILD_ID -fno-strict-aliasing > -std=gnu99 -Wall -Wstrict-prototypes -Wdeclaration-after-statement > -Wno-unused-but-set-variable -Wno-unused-local-typedefs -O2 > -fomit-frame-pointer > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF > .subdirs-all.d -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall > -Wstrict-prototypes -Wdeclaration-after-statement > -Wno-unused-but-set-variable -Wno-unused-local-typedefs -O2 > -fomit-frame-pointer > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF > .subdir-all-libs.d -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 > -Wall -Wstrict-prototypes -Wdeclaration-after-statement > -Wno-unused-but-set-variable -Wno-unused-local-typedefs -O2 > -fomit-frame-pointer > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF > .subdirs-all.d -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall > -Wstrict-prototypes -Wdeclaration-after-statement > -Wno-unused-but-set-variable -Wno-unused-local-typedefs -O2 > -fomit-frame-pointer > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF > .subdir-all-evtchn.d -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 > -Wall -Wstrict-prototypes -Wdeclaration-after-statement > -Wno-unused-but-set-variable -Wno-unused-local-typedefs -O2 > -fomit-frame-pointer > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF > .build.d -Werror -Wmissing-prototypes -I./include > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include > > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/libs/toollog/include > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include > > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/libs/toolcore/include > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include > -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall > -Wstrict-prototypes -Wdeclaration-after-statement > -Wno-unused-but-set-variable -Wno-unused-local-typedefs -O2 > -fomit-frame-pointer > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF > .netbsd.opic.d -Werror -Wmissing-prototypes -I./include > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include > > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/libs/toollog/include > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include > > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/libs/toolcore/include > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/evtchn/../../../tools/include > -fPIC -c -o netbsd.opic netbsd.c > netbsd.c:30:10: fatal error: xen/xenio3.h: No such file or directory > #include > ^~ > compilation terminated. > netbsd.c:30:10: fatal error: xen/xenio3.h: No such file or directory This header is in src/sys/arch/xen/include, it should be installed along with xenio.h I just commited a fix for this. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
*.fr.netbsd.org downtime
Hello, I will be upgrading the storage on {ftp,www,rsync,anoncvs}.fr.netbsd.org in the next 2 days. This will requires several reboots and services interruptions while datas are being moved around. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: regression: xen domU no longer supports "type=cdrom" vbd disk
On Mon, Jun 08, 2020 at 02:26:39PM -0700, Greg A. Woods wrote: > I use xl.cfg "disk" entries like the following to mount a virtual CDROM > in a Xen domU: > > 'format=raw, vdev=0x5, access=ro, devtype=cdrom, > target=/images/NetBSD-9.0-amd64.iso' > > However since upgrading my -current source tree I've been seeing: > > xenbus0: ignoring device/vbd/4 type cdrom > > As shown in this patch I had to comment out the core of the mentioned > change to be able to use an ISO image again as a virtual CDROM again: Actually this change matches what other OSes do with 'devtype=cdrom', we were an outsider here. For PV or PVH domUs you can omit the devtype keyword, it's only needed for HVM guests (if you want to boot from the cdrom image). -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
xen panic
Hello, build from 202005272200Z panics on Xen, on both i386 and amd64 but for different reasons: http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/ i386 fails early at boot with: [ 1.000] panic: kernel diagnostic assertion "!*zpte" failed: file "/home/source/ab/HEAD/src/sys/arch/x86/x86/pmap.c", line 3832 pmap_zero_page: lock botch [ 1.000] cpu0: Begin traceback... [ 1.000] vpanic(c059333c,c0da7e00,c0da7e2c,c0127fb3,c059333c,c058e222,c0593bf7,c0592904,ef8,1) at netbsd:vpanic+0x134 [ 1.000] kern_assert(c059333c,c058e222,c0593bf7,c0592904,ef8,1,bfe07090,8000,0,c17a8830) at netbsd:kern_assert+0x23 [ 1.000] pmap_zero_page(1ffef000,0,c03d53c5,c0c875aa,c0c862b0,1,8,c0c875aa,2,1) at netbsd:pmap_zero_page+0x1e3 [ 1.000] uvm_pagealloc_strat(0,0,0,0,3,0,0,15554000,1ffee000,0) at netbsd:uvm_pagealloc_strat+0x2d6 [ 1.000] pmap_get_physpage(8,1,3abee003,1,10002,8,8,8,28,c) at netbsd:pmap_get_physpage+0x203 [ 1.000] pmap_growkernel(d6cfd000,c05b90ea,c17a9000,15554000,1000,0,0,0,0,10002) at netbsd:pmap_growkernel+0xce [ 1.000] uvm_km_bootstrap(c17a9000,f560,0,c17a9000,f560,c0da7fb0,c055f14a,e,3,9) at netbsd:uvm_km_bootstrap+0x2c8 [ 1.000] uvm_init(e,3,9,2,0,0,c0da5000,7ff,c0e1b000,756e6547) at netbsd:uvm_init+0x63 amd64 can boot and run tests, but panics with: kernel/t_trapsignal (97/860): 20 test cases bus_handle: [0.193910s] Passed. bus_handle_recurse: [0.201020s] Passed. bus_ignore: [0.200598s] Passed. bus_mask: [0.199164s] Passed. bus_simple: [0.199066s] Passed. fpe_handle: [0.210561s] Passed. fpe_handle_recurse: [ 872.0704774] panic: kernel diagnostic assertion "curlwp->l_md.md_flags & MDL_FPU_IN_CPU" failed: file "/home/source/ab/HEAD/src/sys/arch/x86/x86/fpu.c", line 487 [ 872.0704774] cpu0: Begin traceback... [ 872.0704774] vpanic() at netbsd:vpanic+0x146 [ 872.0704774] kern_assert() at netbsd:kern_assert+0x48 [ 872.0704774] fputrap() at netbsd:fputrap+0x171 [ 872.0704774] cpu0: End traceback... [ 872.0704774] dumping to dev 168,1 (offset=524254, size=0): not possible [ 872.0704774] rebooting... Any idea what could have changed to cause this ? 2020-05-26 08:40 UTC builds did complete tests. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: -current build failure
On Wed, May 27, 2020 at 06:54:47PM +0100, Chavdar Ivanov wrote: > Hi, > > With sources updated about an hour ago I get: > . > --- kern-XEN3_DOM0 --- > /home/sysbuild/amd64/tools/bin/x86_64--netbsd-ld: pintr.o: in function > `xen_pic_to_gsi': > pintr.c:(.text+0x78): undefined reference to `msipic_get_pci_info' > /home/sysbuild/amd64/tools/bin/x86_64--netbsd-ld: pci_intr_machdep.o: > in function `pci_intr_release': > pci_intr_machdep.c:(.text+0x775): undefined reference to > `x86_pci_msix_release' Did you clean the build directory ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: qemu emulated machine crashes due to disk timeouts
On Thu, May 14, 2020 at 03:32:51PM +0200, Jaromír Dole?ek wrote: > [...] > Seriously though I think that it wouldn't hurt to just bump ATA_DELAY > to 30 seconds by default. I don't remember if it's used only for I/O or also for probe. If the later, it could take 3x more time to boot ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: i386 Xen integration breaks linking NET4501 kernel
On Sun, May 10, 2020 at 02:36:15PM +0200, Rhialto wrote: > Probably similarly, linking fails when building an amd64 MODULAR kernel, > with some Xen-related undefined symbol errors: Yes I posted a question to tech-kern, asking how to resolve this, I got no reply so far. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: modload & xen and -current 9.99.60
On Fri, May 08, 2020 at 02:55:10PM +0200, Frank Kardel wrote: > I checked to same kernel in an instance with memory=2048 and it just works. > > Using todays kernel also works woth memory=2048. > > Using memory=65536 for the xen instance gives a surprising familiar > > TEST-A# modload bpfjit > [ 97.4727034] kobj_load, 444: [%M/bpfjit/bpfjit.kmod]: linker error: out of > memory > modload: bpfjit: Cannot allocate memory > TEST-A# > > So it seems to be linked to available memory. > > The more you have the less you get for modload. It could be a variable overflow somewhere but I can't see how it relates to 64Gb. Does it work with 16Gb ? Also could you try with a PVH or HVM guest ? These ones would use modules from /stand/amd64/ and not /stand/amd64-xen/ and should be close to native. I don't have a box with that much RAM to test ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: modload & xen and -current 9.99.60
On Thu, May 07, 2020 at 09:50:18PM +0200, Frank Kardel wrote: > see here: > > Alpine: 21:45 ~ [8] sysctl kern.module.path > kern.module.path = /stand/amd64-xen/9.99.60/modules looks good > Alpine: 21:46 ~ [9] ll /stand/amd64-xen/9.99.60/modules/bpfjit/bpfjit.kmod > -r--r--r-- 1 root wheel 34328 May 5 16:58 > /stand/amd64-xen/9.99.60/modules/bpfjit/bpfjit.kmod > Alpine: 21:46 ~ [10] size > /stand/amd64-xen/9.99.60/modules/bpfjit/bpfjit.kmod >textdata bss dec hex filename > 10399 0 0 10399289f > /stand/amd64-xen/9.99.60/modules/bpfjit/bpfjit.kmod > Alpine: 21:46 ~ [11] ll > /stand/amd64-xen/9.99.60/modules/pciverbose/pciverbose.kmod > -r--r--r-- 1 root wheel 140600 May 5 16:55 > /stand/amd64-xen/9.99.60/modules/pciverbose/pciverbose.kmod > Alpine: 21:47 ~ [12] size > /stand/amd64-xen/9.99.60/modules/pciverbose/pciverbose.kmod >textdata bss dec hex filename > 132575 16 0 132591 205ef > /stand/amd64-xen/9.99.60/modules/pciverbose/pciverbose.kmod no problem for me, with sources from today: xen1:/#modload bpfjit xen1:/#modstat | grep !$ modstat | grep bpfjit bpfjit misc filesys -09174 sljit xen1:/#modload pciverbose xen1:/#modstat | grep !$ modstat | grep pciverbose pciverbose misc filesys -0 218 pci -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: i386 Xen integration breaks GENERIC_PAE kernel build
On Thu, May 07, 2020 at 06:46:06AM -0500, John D. Baker wrote: > Building the GENERIC_PAE kernel from recent -current/i386 fails with: > > [...] > --- hypervisor.o --- > /x/current/src/sys/arch/xen/xen/hypervisor.c: In function 'init_xen_early': > /x/current/src/sys/arch/xen/xen/hypervisor.c:247:27: error: cast to pointer > from integer of different size [-Werror=int-to-pointer-cast] > HYPERVISOR_shared_info = (void *)(HYPERVISOR_shared_info_pa + KERNBASE); Should be fixed with hypervisor.c 1.82 -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: modload & xen and -current 9.99.60
On Thu, May 07, 2020 at 07:45:48AM +0200, Frank Kardel wrote: > Hi, > > Running 9.99.60 XEN3_DOMU shows > > [ 67264.313173] kobj_load, 444: [%M/bpfjit/bpfjit.kmod]: linker error: out > of memory > [ 67292.894143] kobj_load, 428: [%M/scsiverbose/scsiverbose.kmod]: linker > error: out of memory > > and modload fails with the OOM error. > > Is this an expected behavior or a bug? (kern.securelevel is -1). What does kern.module.path show for you ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: i386 Xen integration breaks linking NET4501 kernel
On Mon, May 04, 2020 at 06:42:11PM -0500, John D. Baker wrote: > A recent build of -current/i386 fails when trying to link a kernel built > from the NET4501 config: > > [...] > # link NET4501/netbsd > /r0/build/current/tools/amd64/bin/i486--netbsdelf-ld -Map netbsd.map --cref > -T netbsd.ldscript -Ttext c010 -e start -X -o netbsd > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o > /r0/build/current/tools/amd64/bin/i486--netbsdelf-ld: locore.o: in function > `start_xenpvh': > (.text+0x410): undefined reference to `hvm_start_paddr' > /r0/build/current/tools/amd64/bin/i486--netbsdelf-ld: (.text+0x436): > undefined reference to `HYPERVISOR_shared_info_pa' Should be fixed now. Sorry for this -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: sysctl machdep.hypervisor
On Wed, Apr 08, 2020 at 08:26:33PM -, Michael van Elst wrote: > bou...@antioche.eu.org (Manuel Bouyer) writes: > > >Hello, > >we have a machdep.hypervisor sysctl which returns a specific string when > >an hypervisor is detected. I'd like to change the string returned for > >Xen (rename from Xen to "Xen PV" and add others Xen subtypes). > > >I didn't find any use in our source tree, does anyone know if this would > >cause problems ? > > I'd avoid whitespace in such values. Sure, committed. Thanks ! -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
sysctl machdep.hypervisor
Hello, we have a machdep.hypervisor sysctl which returns a specific string when an hypervisor is detected. I'd like to change the string returned for Xen (rename from Xen to "Xen PV" and add others Xen subtypes). I didn't find any use in our source tree, does anyone know if this would cause problems ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: XEN 4.11 and 9.99.48 DOMU performance
There have been scheduler-related fixes in the last few days; did you try with an up to date kernel ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: XEN 4.11 and 9.99.48 DOMU performance
On Tue, Mar 10, 2020 at 08:11:46PM +0100, Frank Kardel wrote: > interrupt total rate type > vmcmd kills 179840 misc > vmcmd extends 179800 misc > vmcmd calls 1850330 misc > pserialize exclusive access 1450 misc > vmem static_bt_inuse2000 misc > vmem static_bt_count2000 misc > rndpseudo open soft 390 misc > TLB shootdown 50603277 139 intr > softint net/0 33488890 92 misc > softint bio/0 10 misc > softint clk/0 4239404 11 misc > softint ser/0 26560 misc > callout late/0 1990 misc > crosscall unicast600 misc > namecache entries collected 9405282 misc > namecache under scan target 3622200 misc > vcpu0 xenev0 channel 4 18112410 50 intr > softint net/15644651 misc > softint bio/1 10 misc > ... > > softint clk/11 3856351 misc > softint ser/11 1490 misc > callout late/11 10 misc > vcpu0 xenev0 channel 2 2970 intr > vcpu0 raw systime went backwards1580 intr > vcpu0 xenev0 channel 5 36222558 99 intr > vcpu1 xenev0 channel 6 1554970 intr > vcpu1 missed hardclock 830 intr > vcpu1 xenev0 channel 7 36222475 99 intr > vcpu2 xenev0 channel 8 15438790 42 intr > ... > > xbd0 map unaligned960510 misc > xbd1 map unaligned 14069263 misc > > TLB shootdown is there as some crosscall unicast. I don't see any other IPIs > though. Indeed it seems that in netbsd9 IPIs don't show up as such. But there should be some crosscall broadcast. On a netbsd-9 pbulk host I see more broadcast than unicast. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: XEN 4.11 and 9.99.48 DOMU performance
On Tue, Mar 10, 2020 at 07:30:33PM +0100, Frank Kardel wrote: > No information about IPI in vmstat -i in DOM0 and DOMU. the dom0 is not MP so I don't expect to see IPIs here. But the domU is, so there should be IPIs here. Hum, it looks like IPIs are in vmstat -e, not -i ... sorry > > Otherwise it is usually responsive. Sometimes things get stuck but switching > a screen in screen seems to unstick things. > > It seems like "wakeups" get sometimes lost. I guess it could be related to IPIs. But I'm running daily tests on domUs and I didn't notice anything strange -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: XEN 4.11 and 9.99.48 DOMU performance
On Tue, Mar 10, 2020 at 06:48:14PM +0100, Frank Kardel wrote: > [...] > > To me it looks more like locking issues or xen scheduling features. yes, that could be. does vmstat -i show anything about IPIs ? Is the domU otherwise responsive ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: XEN 4.11 and 9.99.48 DOMU performance
On Tue, Mar 10, 2020 at 04:20:22PM +0100, Frank Kardel wrote: > This is my first XEN setup so I may have misconfigured something: > > I have a 4G DOM0 on a 512G System with a EPYC 7302P 16-Core Processor. > > On that I configured a 400G DOMU with 12 vcpus. like this: > > name = "system" > kernel = "/netbsd-XEN3_DOMU.gz" > memory = 40 > cpus="all" > vcpus=4 > maxvcpus=12 > vif = [ 'mac=aa:00:00:d1:00:01,bridge=bridge0', > 'mac=aa:00:00:d1:00:02,bridge=bridge1' ] > disk = [ 'file:/data0/xen-roots/root-Alpine-system.img,0x0,w', > 'phy:/dev/wedges/data1,0x1,w' ] > > On that I run postgresql 11 attempting to load a 1TB database. > > Usually this workload keeps a machine continually busy cpu/io-wise. > > I was expecting that I/O via the xen backend would be the bottleneck. > > Instead DOM0 is only seldom busy for IO. DOMU is crawling along sleeping > > at all sorts of places: What does iostat 5 show about the disks, in the dom0 and domU ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 09:00:30PM +, Andrew Doran wrote: > I reproduced it on native x86. It's a bug in the CPU topology code. Now > fixed with revision 1.11 src/sys/kern/subr_cpu.c - sorry about that. I confirm, I now see user activity on all CPUs. Thanks ! -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 07:11:21PM +, Andrew Doran wrote: > On Mon, Jan 13, 2020 at 07:36:41PM +0100, Manuel Bouyer wrote: > > > On Mon, Jan 13, 2020 at 06:33:08PM +, Andrew Doran wrote: > > > On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote: > > > > > > > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote: > > > > > It also sets rsp and rbp. I think rbp is not set by anything else, at > > > > > last > > > > > in the Xen case. > > > > > The different rbp value would explain why in one case we hit a > > > > > KASSERT() > > > > > in lwp_startup later. > > > > > But I don't know what pcb_rbp contains; I couldn't find where the pcb > > > > > for > > > > > idlelwp is initialized. > > > > > > > > I tried the attached patch, which should set rsp/rbp as cpu_switchto() > > > > does. It doens't cause the lwp_startup() KASSERT as calling > > > > cpu_switchto() > > > > does; it also doesn't change the scheduler behavior. > > > > > > Wait - do you mean that everything works now? Or that everything still > > > runs > > > on CPU0? > > > > No, everything still runs on CPU0 > > Hmm, I don't understand why. I'll set up Xen and try it out. It might take > me a day or two. OK thanks. > [...] > > The assertion in lwp_startup() is because I made MI changes so that prevlwp > is never NULL when calling cpu_switchto(), when fixing some bugs problems MP > support on !x86 and make things more correct. lwp_startup()/mi_switch() now > need to unlock prevlwp after it is finished in cpu_switchto(). I never > expected anybody but mi_switch() to call cpu_switchto(). OK, so I removed the call to cpu_switchto() before idle_loop(), and added a few KASSERTS. I guess you can back out the prev == NULL case from cpu_switchto(). -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 06:33:08PM +, Andrew Doran wrote: > On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote: > > > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote: > > > It also sets rsp and rbp. I think rbp is not set by anything else, at last > > > in the Xen case. > > > The different rbp value would explain why in one case we hit a KASSERT() > > > in lwp_startup later. > > > But I don't know what pcb_rbp contains; I couldn't find where the pcb for > > > idlelwp is initialized. > > > > I tried the attached patch, which should set rsp/rbp as cpu_switchto() > > does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto() > > does; it also doesn't change the scheduler behavior. > > Wait - do you mean that everything works now? Or that everything still runs > on CPU0? No, everything still runs on CPU0 > > The very first thing that idle_loop() does on amd64/i386 is set up the frame > pointer - ebp/rbp. > > : >0: 55 push %rbp >1: 48 89 e5mov%rsp,%rbp >4: 41 56 push %r14 >6: 41 55 push %r13 OK, so it's OK that my patch doesn't changes anything. And so I still don't understand the KASSERT when cpu_switchto() is called before idle_loop(). -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote: > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote: > > It also sets rsp and rbp. I think rbp is not set by anything else, at last > > in the Xen case. > > The different rbp value would explain why in one case we hit a KASSERT() > > in lwp_startup later. > > But I don't know what pcb_rbp contains; I couldn't find where the pcb for > > idlelwp is initialized. > > I tried the attached patch, which should set rsp/rbp as cpu_switchto() > does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto() > does; it also doesn't change the scheduler behavior. With the patch this time -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference -- Index: sys/arch/xen/x86/cpu.c === RCS file: /cvsroot/src/sys/arch/xen/x86/cpu.c,v retrieving revision 1.131 diff -u -p -u -r1.131 cpu.c --- sys/arch/xen/x86/cpu.c 23 Nov 2019 19:40:38 - 1.131 +++ sys/arch/xen/x86/cpu.c 13 Jan 2020 16:40:50 - @@ -739,7 +739,16 @@ cpu_hatch(void *v) aprint_debug_dev(ci->ci_dev, "running\n"); - cpu_switchto(NULL, ci->ci_data.cpu_idlelwp, true); +#ifdef __x86_64__ + asm("movq %0, %%rsp" : : "r" (pcb->pcb_rsp)); + asm("movq %0, %%rbp" : : "r" (pcb->pcb_rbp)); +#else + asm("movl %0, %%esp" : : "r" (pcb->pcb_esp)); + asm("movl %0, %%ebp" : : "r" (pcb->pcb_ebp)); +#endif + KASSERT(ci->ci_curlwp == ci->ci_data.cpu_idlelwp); + + //cpu_switchto(NULL, ci->ci_data.cpu_idlelwp, true); idle_loop(NULL); KASSERT(false);
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote: > It also sets rsp and rbp. I think rbp is not set by anything else, at last > in the Xen case. > The different rbp value would explain why in one case we hit a KASSERT() > in lwp_startup later. > But I don't know what pcb_rbp contains; I couldn't find where the pcb for > idlelwp is initialized. I tried the attached patch, which should set rsp/rbp as cpu_switchto() does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto() does; it also doesn't change the scheduler behavior. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 02:49:52PM +, Andrew Doran wrote: > > Now I get a different panic: > > [ 1.000] vcpu0 at hypervisor0 > > [ 1.000] vcpu0: 64 page colors > > [ 1.000] vcpu0: Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz, id > > 0x6fb > > [ 1.000] vcpu0: node 0, package 0, core 1, smt 0 > > [ 1.000] vcpu1 at hypervisor0 > > [ 1.000] vcpu1: 2 page colors > > [ 1.000] vcpu1: starting > > [ 1.000] vcpu1: is started. > > [ 1.000] vcpu1: Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz, id > > 0x6fb > > [ 1.000] vcpu1: node 0, package 0, core 0, smt 0 > > [...] > > [ 1.030] UVM: using package allocation scheme, 1 package(s) per bucket > > [ 1.030] Xen vcpu1 clock: using event channel 7 > > [ 1.8809493] vcpu1: running > > [ 1.8809493] panic: kernel diagnostic assertion "prev != NULL" failed: > > file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_lwp.c", line 1021 > > [ 1.8809493] cpu1: Begin traceback... > > [ 1.8809493] > > vpanic(c057f868,d77abf74,d77abf98,c03cc3e5,c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0) > > at netbsd:vpanic+0x134 > > [ 1.8809493] > > kern_assert(c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0,0,0,c13a6900,c03c60c0) > > at netbsd:kern_assert+0x23 > > [ 1.8809493] lwp_startup(0,c13a6900,8b1000,c0674200,0,c010007a,0,0,0,0) > > at netbsd:lwp_startup+0x155 > > [ 1.8809493] cpu1: End traceback... > > > > If I remove the call to cpu_switchto() in cpu_hatch() it boots, but it seems > > that all user processes are running on cpu0 only ... > > I looked and the only thing cpu_switchto() is doing there is setting curlwp, > but that's already set in cpu_start_secondary(), so it's not needed. It also sets rsp and rbp. I think rbp is not set by anything else, at last in the Xen case. The different rbp value would explain why in one case we hit a KASSERT() in lwp_startup later. But I don't know what pcb_rbp contains; I couldn't find where the pcb for idlelwp is initialized. > > > I can't see what extra work the cpu_switchto() could be doing that would > > matters, execpt maybe the %epb/rbp init. Any idea ? > > Right I don't think cpu_switchto() matters there. The strategy for > assigning LWPs to CPUs in the scheduler has changed. If the machine is not > busy everything is likely to stay on CPU0. Are you putting much load on it? I just tried a build.sh -j4 CPU0 is 100% busy, the others are 100% idle: load averages: 3.02, 2.14, 1.26; up 0+00:51:5916:59:03 61 processes: 5 runnable, 54 sleeping, 2 on CPU CPU0 states: 39.3% user, 0.0% nice, 60.7% system, 0.0% interrupt, 0.0% idle CPU1 states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU2 states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU3 states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Memory: 1402M Act, 168K Inact, 16K Wired, 14M Exec, 1352M File, 1932M Free Swap: PID USERNAME PRI NICE SIZE RES STATE TIME WCPUCPU COMMAND 21392 bouyer33029M 5964K RUN/0 0:00 2.00% 0.10% as 0 root 00 0K 11M CPU/3 0:30 0.00% 0.00% [system] 81 bouyer85020M 3596K kqueue/0 0:19 0.00% 0.00% tmux 226 bouyer43016M 1900K CPU/0 0:00 0.00% 0.00% top 16883 bouyer330 8992K 2212K RUN/0 0:00 0.00% 0.00% nbmake 21137 bouyer330 7844K 1220K RUN/0 0:00 0.00% 0.00% sed 12098 bouyer330 4288K 164K RUN/0 0:00 0.00% 0.00% sh 22411 bouyer330 4288K 164K RUN/0 0:00 0.00% 0.00% cc 42 root 85080M 5768K poll/0 0:00 0.00% 0.00% sshd -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 12:02:13PM +, Andrew Doran wrote: > Ah yes it does, I saw something that made me think it affected x86_64 only. > I'll make the change on i386 too. thanks. Now I get a different panic: [ 1.000] vcpu0 at hypervisor0 [ 1.000] vcpu0: 64 page colors [ 1.000] vcpu0: Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz, id 0x6fb [ 1.000] vcpu0: node 0, package 0, core 1, smt 0 [ 1.000] vcpu1 at hypervisor0 [ 1.000] vcpu1: 2 page colors [ 1.000] vcpu1: starting [ 1.000] vcpu1: is started. [ 1.000] vcpu1: Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz, id 0x6fb [ 1.000] vcpu1: node 0, package 0, core 0, smt 0 [...] [ 1.030] UVM: using package allocation scheme, 1 package(s) per bucket [ 1.030] Xen vcpu1 clock: using event channel 7 [ 1.8809493] vcpu1: running [ 1.8809493] panic: kernel diagnostic assertion "prev != NULL" failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_lwp.c", line 1021 [ 1.8809493] cpu1: Begin traceback... [ 1.8809493] vpanic(c057f868,d77abf74,d77abf98,c03cc3e5,c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0) at netbsd:vpanic+0x134 [ 1.8809493] kern_assert(c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0,0,0,c13a6900,c03c60c0) at netbsd:kern_assert+0x23 [ 1.8809493] lwp_startup(0,c13a6900,8b1000,c0674200,0,c010007a,0,0,0,0) at netbsd:lwp_startup+0x155 [ 1.8809493] cpu1: End traceback... If I remove the call to cpu_switchto() in cpu_hatch() it boots, but it seems that all user processes are running on cpu0 only ... I can't see what extra work the cpu_switchto() could be doing that would matters, execpt maybe the %epb/rbp init. Any idea ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 11:42:17AM +, Andrew Doran wrote: > Hi Manuel, > > On Mon, Jan 13, 2020 at 10:56:23AM +0100, Manuel Bouyer wrote: > > Hello, > > A current Xen domU kernel fails to boot with: > > [ 1.000] hypervisor0 at mainbus0: Xen version 4.11.3nb1 > > [ 1.000] vcpu0 at hypervisor0 > > [ 1.000] vcpu0: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64 > > [ 1.000] vcpu0: node 0, package 0, core 1, smt 1 > > [ 1.000] vcpu1 at hypervisor0 > > [ 1.000] vcpu1: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64 > > [ 1.000] vcpu1: node 0, package 1, core 0, smt 0 > > [ 1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface > > [ 1.000] xencons0 at hypervisor0: Xen Virtual Console Driver > > [ 1.9901295] uvm_fault(0x80d5c120, 0x0, 1) -> e > > [ 1.9901295] fatal page fault in supervisor mode > > [ 1.9901295] trap type 6 code 0 rip 0x8020209f cs 0x8 rflags > > 0x10246 cr2 0x28 ilevel 0 rsp 0xb7802b19de88 > > [ 1.9901295] curlwp 0xb7800083b500 pid 0.15 lowest kstack > > 0xb7802b1992c0 > > kernel: page fault trap, code=0 > > Stopped in pid 0.15 (system) at netbsd:cpu_switchto+0xf:movq > > 28(%r13),%rax > > cpu_switchto() at netbsd:cpu_switchto+0xf > > > > both amd64 and i386. A boot with vcpus=1 succeeds, so I guess something is > > missing in initialisations of secondary CPUs. > > This happens with the 202001101800Z but the problem is probably older than > > that (the testbed used vcpus=1 until today) > > > > Any idea ? > > It should work now with revision 1.199 of src/sys/arch/amd64/amd64/locore.S. The same problem happens with i386. > Nothing else in tree calls cpu_switchto() with prevlwp=NULL any more. Can > Xen's cpu_hatch() call idle_loop() directly? Maybe it could, but cpu_switchto() does some extra work (switch the stack, set curlwp at last). Maybe this is already done but I'll have to double check. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --