Re: GitHub mirror stopped mirroring
At Sun, 28 Jan 2024 08:22:49 +, Chris Pinnock wrote: Subject: Re: GitHub mirror stopped mirroring > > > The Mercurial mirror also hasn't been updated for a week. > > Ngā mihi, Lloyd > > > > Hi - someone was looking at this yesterday. Mercurial syncing > again. KRgds, C It doesn't seem to have made it to anonhg.NetBSD.org yet. The src repo there is still 11 days older, as is of course the GitHub version. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp7xByzpME33.pgp Description: OpenPGP Digital Signature
Re: Xen FreeBSD domU block I/O problem on -current only affects reads > 1024 bytes
At Tue, 20 Apr 2021 16:53:58 -0700, "Greg A. Woods" wrote: Subject: Re: Xen FreeBSD domU block I/O problem on -current only affects reads > 1024 bytes > > With the gracious help of RVP I have been able to identify > better what is actually going wrong with FreeBSD's access to NetBSD dom0 > xbdback(4) storage. > > It seems that in certain circumstances (e.g. in newfs and the test > program) whenever FreeBSD issues a read of more than 1024 bytes only the > first 1024 bytes are correct -- the rest of the bytes returned come from > somewhere else on the disk, which appears to be starting at six(6) > sectors after where they were supposed to have come from. Note that > this corresponds to exactly 4096 bytes offset from the beginning of the > read. Reviving this old thread with some new info It seems ZFS either doesn't issue large read requests, and/or it works around the problem in some other way. With the help of a custom FreeBSD kernel with ZFS compiled in, and booting it as a PVH domU kernel, and with the new(ish) FreeBSD (14.0) way of installing with a ZFS root, I have a couple of domUs running just fine now, one even recovered old zpools on the machine where I first experienced this problem! As soon as possible, especially if I can dredge up another test server, I'll test plain UFS again with a NetBSD 10.0_RC2 kernel as dom0. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpsMEoHet2IW.pgp Description: OpenPGP Digital Signature
Re: Status of NetBSD virtualization roadmap - support jails like features?
At Fri, 15 Apr 2022 07:36:15 +0200, Matthias Petermann wrote: Subject: Status of NetBSD virtualization roadmap - support jails like features? > > My motivation: I am looking for a particularly high performance > virtualization solution on NetBSD. Especially disk and network IO > plays a role for me. In my experience nothing beats I/O performance of Xen with LVM in the dom0 and the best/fastest storage available for the dom0, especially now there's SMP support for dom0. That's anecdotal though -- I haven't done any real comparisons. I just know that NFS in domUs is a lot slower than using LVMs via xbd(4), no matter where/how-fast the NFS server is! If I'm not too far out of touch I think there's still a wee bit more SMP support needed in the networking code to make it possible for dom0 to also give the best network throughput, but it's really not horrible as-is. In theory NVMM with QEMU and virtio(4) should be about the same I would guess, with potential for some improvement in some micro-benchmarks, but for production use the maturity and completeness of the provisioning support offered by Xen still seems far superior to me. > Regardless, I still think it wouldn't hurt > if NetBSD could implement some sort of > jail. I'm not convinced "jails" (at least in the FreeBSD form I'm most familiar with) actually buy much without also increasing complexity and/or introducing limitations on both the provisioning and the "virtual" side. With a full virtualisation as in Xen the added complexity is very well partitioned between the provisioning side and the VMs, and there are almost no limitations inside the VMs (assuming you are virtualising something that fits well into a virtualised environment, i.e. with no special direct hardware access needs) -- everything looks and feels and is managed almost as if it is running on bare hardware and so the management of the VM is exactly as if it were running on separate hardware; except of course some aspects are actually easier to manage, such as provisioning direct console access and control. There's really nothing new to learn other than how to spin up a new domU (and possibly how to use LVM effectively). However FreeBSD-style jails do offer their own form of flexibility that seems to be worth having available, and it would be nice for jails to be available on NetBSD as well. The impact inside the OS (kernel and userland) is quite high though, and is itself a form of complexity nightmare all its own, though perhaps not so horrible as Linux "cgroups" and some other related Linux kernel namespaces are. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpX4KZZetx9T.pgp Description: OpenPGP Digital Signature
Re: Are NetBSD users interested extending options for patch?
At Mon, 11 Apr 2022 21:03:02 +0200, Hans Petter Selasky wrote: Subject: Are NetBSD users interested extending options for patch? > > https://reviews.freebsd.org/D30160 As a user with some extensive background in making and using patch files, I can't imagine that feature ever being useful; and rather instead I would find it to be more dangerous if not just useless. Patch already has '-p N', and in my experience that has covered most of the cases where a similar problem actually occurs. In all (which are very few) other cases I've found that it is trivial to edit the patch, often in a pipeline with a simple 'sed' command (e.g. in cases where the pathnames in the patch need a prefix applied or changed, instead of simply stripping it with '-p'). I would expect any heuristic to automatically search and find files by simply matching their basename to be very unreliable and to find the wrong file just as likely -- at least in the general case. Say for example a patch contains a lot of changes to "Makefile" files in many different directories? How is this hack supposed to help find the right one (e.g. if a directory containing a "Makefile" was renamed)? Perhaps as mentioned in a comment on that post it may be useful in some very specific cases where files aren't likely to move around too much and where all files are guaranteed to be uniquely named and never renamed despite being moved about between directories. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpmZsdr0JiM_.pgp Description: OpenPGP Digital Signature
Re: xterm-color256: Different behavior between NetBSD 9.2 and 9.99.93?
I've recently been trying to debug this same problem, and I had been gathering info until I got side-tracked onto X11 hi-res monitor issues. At Thu, 3 Feb 2022 16:28:00 +0300, Valery Ushakov wrote: Subject: Re: xterm-color256: Different behavior between NetBSD 9.2 and 9.99.93? > > On Thu, Feb 03, 2022 at 14:15:45 +0100, Martin Husemann wrote: > > > Bug in the terminfo compiler? > > http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/tic/tic.c#rev1.39 > > sounds like it might be related. As I understand the code the 1.30 fix to promote older compiled entries being included with "use=" shouldn't affect anything if the source database is already in the newest format, no? As I understand things, the problem is that tic(1) isn't incorporating "use=" entries using the correct algorithm. The value of the "colors" capability is just a part of the symptom. Careful comparison of the "infocmp -1 xterm-256color" output from NetBSD and from a system using ncurses should produce identical matching output, but at the moment there are several differences and examining the terminfo source file suggests, to me at least, that the order of processing of the "use=" entries is wrong. The proper algorithm, as I understand it is to scan right-to-left for "use=" capabilities, and to rescan after each new entry has been inserted to replace the "use=" capability. This algorithm is fairly clearly described in the ncurses terminfo(5) manual page, and in order to handle the ncurses terminfo source file more-or-less as-is, one must presumably implement the ncurses "use=" merging algorithm faithfully. These comments are as far as I've got in diagnosing things in the NetBSD sources: --- tic.c.~1.40.~ 2020-05-30 17:44:04.0 -0700 +++ tic.c 2022-01-06 17:53:47.893092115 -0800 @@ -424,6 +424,7 @@ rtic = term->tic; basename = _ti_getname(TERMINFO_RTYPE_O1, rtic->name); promoted = false; + /* XXX this does the use= merging the wrong way!?!?!? */ while ((cap = _ti_find_extra(rtic, >extras, "use")) != NULL) { if (*cap++ != 's') { @@ -684,6 +685,7 @@ free(tbuf.buf); /* Merge use entries until we have merged all we can */ + /* XXX this doesn't do properly nested merging!!! */ while (merge_use(flags) != 0) ; -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms
Re: the entropy bug, and device timeouts (was: Note: two files changed and hashes/signatures updated for NetBSD 8.1)
At Thu, 27 Jan 2022 10:40:20 +0100, Martin Husemann wrote: Subject: Re: the entropy bug, and device timeouts (was: Note: two files changed and hashes/signatures updated for NetBSD 8.1) > > On Wed, Jan 26, 2022 at 10:56:53PM -0800, Greg A. Woods wrote: > > Well, if you have a hardware RNG, or my patches, then that'll do > > something, but otherwise it's just useless noise and misdirection. > > This is not true. Once there is enough entropy gathered (or the system > has been told the administrator considers it good enough), everything is > fine and basically the same state as before the changes you want to back > out (at least from a userland perspective). That's not my experience, though I am not quite at -current. One thing that I found I had to change was the way feeding a random number as entropy through /dev/random wasn't working unless I re-enabled so-called "estimation" for that device (via rndctl), and I don't think that was due to any of my changes. Note that the "seed" device is trusted in the code as if it were a hardware random number generator so it has "collection" and "estimation" enabled by default. Keep in mind also that not all ways of booting NetBSD allow for "rndseed", including Xen domUs. What my patches do is re-enable the ability of rndctl to (re-)enable "collection" and/or "estimation" for other devices that have calls to submit values and/or timestamps to entropy collection. This means the following can be added to /etc/rc.conf on, for example, Xen domU systems and they can come up to full entropy in the good old fashioned way without suffering from lack of any way to insert entropy with "rndseed": rndctl=YES rndctl_flags="-t disk; -t vm" # optional: "-t net" I have some tentative patches to make this all actually work for domUs in sysinst too in the way your message discussed, but I've had a extremely difficult time getting that to work in any kind of user-friendly way. It took hours of code walk-through just to figure out what was really expected of the user. Also for other reasons (e.g. cloning domUs), I think the "rndctl" way is both easier and more secure (assuming all the regular things about security between domUs on the same server). My patches also mean systems without hardware RNG devices can do the same, and indeed my Dell servers do just fine accumulating their own entropy after boot without rndseed and without even any "rndctl" setup as they have fan speed and voltage monitors in their environmental sensors, and my patches re-enable default collection and estimation for such trusted devices. Personally I find the way the current kernel handles entropy, i.e. without my patches, to be obnoxious, condescending, and ignorant. Perhaps that view might cause some to consider me to be the same way, but I can easily live with that. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp8LqplCHJ1Y.pgp Description: OpenPGP Digital Signature
Re: the entropy bug, and device timeouts (was: Note: two files changed and hashes/signatures updated for NetBSD 8.1)
At Wed, 26 Jan 2022 16:47:15 +1300, Lloyd Parkes wrote: Subject: Re: the entropy bug, and device timeouts (was: Note: two files changed and hashes/signatures updated for NetBSD 8.1) > > The change was more subtle than that I > think. Untrusted hardware was used as an > entropy source, but it didn't count > towards the "enough" that was needed to > bootstrap the rnd system from nothing. No, not quite -- there was a whole bunch of code removed that is needed to actually make the hardware events "count" if and when you configure them to do so. > On 7 May 2020 a change was committed to > /etc/rc.d/random_seed so that a seed file > is created at boot time if you don't > already have one. I haven't checked > because I really can't be bothered right > now, but I'm pretty sure that's all that's > required. Well, if you have a hardware RNG, or my patches, then that'll do something, but otherwise it's just useless noise and misdirection. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpKKAuRm57sa.pgp Description: OpenPGP Digital Signature
the entropy bug, and device timeouts (was: Note: two files changed and hashes/signatures updated for NetBSD 8.1)
At Mon, 24 Jan 2022 08:46:36 +, "Thomas Mueller" wrote: Subject: Re: Note: two files changed and hashes/signatures updated for NetBSD 8.1 > > Does there look to be a fix in the entropy bug? > > This bug relates to entropy and how it impedes building many packages > in pkgsrc. > > I seemed to get around this bug on one computer but not the other. I have fixes that restore the previous option to use "untrusted" hardware as an entropy source. They may need some updating to be truly complete in the most recent -current, as I'm still back at 9.99.81. However I've little hope that my patches will be accepted back into the main source tree, since there seems to be some crazy un-bendable insistence on perfect security of all randomness, even for private machines, embedded systems, and so on. > Other bug is longer-standing and plagued me in NetBSD 8.99.51 and > again in 9.99.82. > > Do there look to be improvements in how NetBSD handles hard drives > that would be affected by that bug? > > That bug causes device timeouts on some types of hard drive but not > all. I can't imagine how the entropy issues could be related in any way to disk device driver timeouts. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp4T3nGPbUzd.pgp Description: OpenPGP Digital Signature
Re: backward compatibility: how far can it reasonably go?
At Wed, 8 Dec 2021 11:36:17 -0800, Jason Thorpe wrote: Subject: Re: backward compatibility: how far can it reasonably go? > > > > On Dec 8, 2021, at 10:52 AM, Greg A. Woods > > wrote: > > That's one bullet I've dodged entirely already since my oldest > > systems are running netbsd-5 stable. (Though in theory isn't > > there supposed to be COMPAT support for SA?) > > int > compat_60_sys_sa_register(lwp_t *l, const struct > compat_60_sys_sa_register_args *uap, register_t *retval) > { return sys_nosys(l, uap, retval); > } > > SA is one of those things that's REALLY hard to provide > compatibility for. :-) I see! Yes, I can appreciate that SA isn't easily maintained in any way. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp4UZmrDv0i5.pgp Description: OpenPGP Digital Signature
Re: backward compatibility: how far can it reasonably go?
At Wed, 08 Dec 2021 11:08:09 -0500, Brad Spencer wrote: Subject: Re: backward compatibility: how far can it reasonably go? > > When I took a system from 4.0 to 7.x some time ago, the only thing that > I had problems with was anything that used scheduler activations since > that had been removed. For me this only effected stuff from pkgsrc, as > I also rolled in new userland at the same time. That's one bullet I've dodged entirely already since my oldest systems are running netbsd-5 stable. (Though in theory isn't there supposed to be COMPAT support for SA?) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpFm57bXUxlC.pgp Description: OpenPGP Digital Signature
Re: backward compatibility: how far can it reasonably go?
At Wed, 8 Dec 2021 15:32:24 -, ya...@sdf.org wrote: Subject: Re: backward compatibility: how far can it reasonably go? > > > "Greg A. Woods" writes: no, Greg Troxel wrote: > > I am unclear if ipf has been removed by default from current. > Even in NetBSD 9, ipf is not in the GENERIC kernel config. Well I'm running in Xen domUs, so not GENERIC but XEN3_DOMU, and indeed I'm running all custom kernel builds. > Was the kernel compiled to use ipf? Clearly IPF is in the 9.99.81 kernel I booted, since it's functions are visible in the backtrace of the crash :-) If it were not compiled in, I think/hope it would not crash -- just the ipf tool would return an error and complain about something like ENXIO or maybe ENODEV. So if IPF were the only problem I would try taking it out temporarily, but with ifconfig also useless, I'll probably try the upgrade from the dom0. > e.g. add to kernel config: > options IPFILTER_LOG# ipmon(8) log support > options IPFILTER_LOOKUP # ippool(8) support > options IPFILTER_COMPAT # Compat for IP-Filter > pseudo-device ipfilter# IP filter (firewall) and NAT Yes, all there (and BRIDGE_IPF as well, though I haven't used that feature yet, and it would likely only be needed in the dom0) Indeed an identical kernel is already running IPF in another domU instance, but of course with the corresponding 9.99.81 userland. It works as well as ever -- I use it with blocklistd, as well as for basic firewalling (most of my systems are mostly on a private network with only one or two ports forwarded to them from the main firewall and so otherwise using the main FW's NAT for outgoing connections only). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpZy7_nCYGHf.pgp Description: OpenPGP Digital Signature
Re: backward compatibility: how far can it reasonably go?
At Tue, 7 Dec 2021 20:37:26 -0800 (PST), Paul Goyette wrote: Subject: Re: backward compatibility: how far can it reasonably go? > > Without looking at the details of your backtrace, the issue with > ifconfig(8) could be related to PRs kern/54150 and/or kern/54151. Aw, damn, my memory is too short! Thanks for reminding me of those! The kernel crash was IPF-related, and in my test back then I was testing on an i386 machine, which at the time did not, IIRC (and we know what that might mean), was not running IPF. Anyway, the two machines I'm upgrading do need to run IPF, at least until they are running a new OS with new pkgs. I'm beginning to think the only way to avoid that rabbit hole in order to get these upgrades done in the next week will be to shut them down and do the upgrades by mounting their filesystems in their dom0s. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpV1TJFBLXGn.pgp Description: OpenPGP Digital Signature
backward compatibility: how far can it reasonably go?
So I've got a couple of old but important machines (Xen amd64 domUs) running NetBSD-5, and I've finally decided that I'm reasonably well enough prepared to try upgrading them. However it seems a "modern" (9.99.81, -current from about 2021-03-10) kernel with COMPAT_40 isn't able to run some of the userland on those systems. Is this something that should work? If it should I think it would make the upgrade much easier as I could then plop down the new userland and run etcupdate. (there are of course alternative ways to do the upgrade, eased by the fact they are domUs (*)) The most immediate problems I noticed are with networking. ifconfig -a returns without printing anything, and trying to enable IPF crashes: Enabling ipfilter. [ 90.1912601] panic: kmem_free(0xd000108870c0, 697) != allocated size 18374686479671623680; overwrote? [ 90.1912601] cpu3: Begin traceback... [ 90.1922525] vpanic() at netbsd:vpanic+0x14a [ 90.1922525] snprintf() at netbsd:snprintf [ 90.1922525] kmem_alloc() at netbsd:kmem_alloc [ 90.1932517] frrequest() at netbsd:frrequest+0x100 [ 90.1932517] ipf_ipf_ioctl() at netbsd:ipf_ipf_ioctl+0x37d [ 90.1932517] ipfioctl() at netbsd:ipfioctl+0x9a [ 90.1942516] cdev_ioctl() at netbsd:cdev_ioctl+0x81 [ 90.1942516] VOP_IOCTL() at netbsd:VOP_IOCTL+0x3e [ 90.1942516] vn_ioctl() at netbsd:vn_ioctl+0xad [ 90.1952515] sys_ioctl() at netbsd:sys_ioctl+0x555 [ 90.1952515] syscall() at netbsd:syscall+0x9c [ 90.1952515] --- syscall (number 54) --- [ 90.1952515] netbsd:syscall+0x9c: [ 90.1952515] cpu3: End traceback... [ 90.1952515] fatal breakpoint trap in supervisor mode [ 90.1952515] trap type 1 code 0 rip 0x8022d93d cs 0xe030 rflags 0x202 cr2 0x7a0d38c36020 ilevel 0 rsp 0xd0018da561b0 [ 90.1952515] curlwp 0xdf5468c0 pid 184.184 lowest kstack 0xd0018da522c0 Stopped in pid 184.184 (ipf) at netbsd:breakpoint+0x5: leave breakpoint() at netbsd:breakpoint+0x5 vpanic() at netbsd:vpanic+0x14a snprintf() at netbsd:snprintf kmem_alloc() at netbsd:kmem_alloc frrequest() at netbsd:frrequest+0x100 ipf_ipf_ioctl() at netbsd:ipf_ipf_ioctl+0x37d ipfioctl() at netbsd:ipfioctl+0x9a cdev_ioctl() at netbsd:cdev_ioctl+0x81 VOP_IOCTL() at netbsd:VOP_IOCTL+0x3e vn_ioctl() at netbsd:vn_ioctl+0xad sys_ioctl() at netbsd:sys_ioctl+0x555 syscall() at netbsd:syscall+0x9c --- syscall (number 54) --- netbsd:syscall+0x9c: ds 61c0 es 6170 fs 61b0 gs 10 rdi 0 rsi d0018da55f5c rbp d0018da561b0 rbx 1 rdx 2 rcx 0 rax 0 r8 1 r9 1 r10 0 r11 fffe r12 104 r13 8063bb30ostype+0x36eb8 r14 d0018da561f8 r15 3 rip 8022d93dbreakpoint+0x5 cs e030 rflags 202 rsp d0018da561b0 ss e02b netbsd:breakpoint+0x5: leave db{3}> (*) alternatives Now since these are domUs and their dom0 is also NetBSD I could also upgrade them "in absentia" so to speak, i.e. drop a new userland on their filesystems from the dom0, though this seems more scary somehow. I guess it shouldn't be since the dom0 and other test systems are already running what I want them to run. Or, given they are relatively cleanly configured filesystem-wise (esp. with a separate /usr/pkg, /home, etc.) I could also build new prototype systems, copy over the /etc files and old shared libraries from the old system to the new prototype, then run etcupdate on the new prototype, and finally shut down the old system, re-assign the other filesystems (/var, /usr/pkg, /home, /work, etc.) to the prototype, reboot the prototype with the old system's name and address, and finally patching up and/or rebuilding whatever is needed in /var. The key thing is that I want to be able to upgraded pkgs piecemeal since I'm sure there will be some hiccups and reconfigs required along the way. Note that most everything is static-linked on these systems. The base system is 100% static linked (except for ld.elf_so itself) though of course there still are a few baroque packages which require dynamic-loaded code so I will still need to be very careful to preserve all old shared libraries. That makes the approach of building a fresh prototype somewhat more difficult, though ultimately perhaps safest as it can be fully tested before ditching the old system. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpS15LHa_ZPh.pgp Description: OpenPGP Digital Signature
panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/build/src/sys/arch/x86/x86/pmap.c", line 2431
I've been busy testing kernels in Xen domUs. Just after running "xl destroy nbtest" this happened: [ 2499253.4056334] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499253.4056334] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499253.4056334] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499253.4056334] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499253.4256354] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499254.1256770] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499254.1256770] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499256.1658017] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499256.1658017] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499258.1359215] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499258.1359215] xvif37i0 GNTTABOP_copy[0] Rx -3 [ 2499260.1560424] xvif37i0: disconnecting [ 2499260.1560424] xbd backend: detach device vg0-nbtest.pkg for domain 37 [ 2499260.1660445] xbd backend: detach device vg0-nbtest.var for domain 37 [ 2499260.1660445] xbd backend: detach device vg0-nbtest.swap for domain 37 [ 2499260.1660445] xbd backend: detach device vg0-nbtest.root for domain 37 [ 2499262.0061528] xbd backend: detach device vnd0d for domain 37 [ 2499264.9263322] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/build/src/sys/arch/x86/x86/pmap.c", line 2431 [ 2499264.9263322] cpu0: Begin traceback... [ 2499264.9263322] vpanic() at netbsd:vpanic+0x14a [ 2499264.9263322] kern_assert() at netbsd:kern_assert+0x48 [ 2499264.9263322] pmap_free_ptp() at netbsd:pmap_free_ptp+0x3b1 [ 2499264.9263322] pmap_enter_ma() at netbsd:pmap_enter_ma+0xebe [ 2499264.9263322] privcmd_ioctl() at netbsd:privcmd_ioctl+0xa8c [ 2499264.9263322] kernfs_try_fileop() at netbsd:kernfs_try_fileop+0x5c [ 2499264.9263322] VOP_IOCTL() at netbsd:VOP_IOCTL+0x5d [ 2499264.9263322] vn_ioctl() at netbsd:vn_ioctl+0xad [ 2499264.9263322] sys_ioctl() at netbsd:sys_ioctl+0x555 [ 2499264.9263322] syscall() at netbsd:syscall+0x9c [ 2499264.9263322] --- syscall (number 54) --- [ 2499264.9263322] netbsd:syscall+0x9c: [ 2499264.9263322] cpu0: End traceback... [ 2499264.9263322] fatal breakpoint trap in supervisor mode [ 2499264.9263322] trap type 1 code 0 rip 0x8023e93d cs 0xe030 rflags 0x202 cr2 0x70153d533000 ilevel 0 rsp 0xc580ef7a4950 [ 2499264.9263322] curlwp 0xc58012291240 pid 12621.12621 lowest kstack 0xc580ef7a02c0 Stopped in pid 12621.12621 (xl) at netbsd:breakpoint+0x5: leave ds 4960 es 4910 fs 4950 gs 10 rdi 0 rsi 1 rbp c580ef7a4950 rbx c58003ad6f40 rdx 2 rcx 0 rax 0 r8 c58003ad6f40 r9 1 r10 0 r11 fffe r12 104 r13 80c9d620ostype+0x148 r14 c580ef7a4998 r15 701535f6e000 rip 8023e93dbreakpoint+0x5 cs e030 rflags 202 rsp c580ef7a4950 ss e02b netbsd:breakpoint+0x5: leave db{0}> (XEN) [2021-12-05 19:27:12.065] Watchdog timer fired for domain 0 (XEN) [2021-12-05 19:27:12.065] Hardware Dom0 shutdown: watchdog rebooting machine This is an amd64 system running a 9.99.81 kernel and Xen 4.13.2nb2. Is it worth a PR? -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpQfdKyX0rMy.pgp Description: OpenPGP Digital Signature
Re: sysinst extended partitioning won't set/do the "newfs" flag!
At Mon, 08 Jun 2020 16:42:42 -0700, "Greg A. Woods" wrote: Subject: sysinst extended partitioning won't set/do the "newfs" flag! > > I'm having trouble getting the "new" sysinst, when using extended > partitioning, to set the "newfs" flag (and the "-o log" flag). > > I can set it, but it never sticks and never happens, which means nothing > gets mounted. And also the "mount" flag doesn't even seem to have any effect either (e.g. even if all the partitions and filesystems are ready made). I've updated recently to the latest sysinst sources from -current with no improvement. I now see from the source that the mysterious "install" flag should only be set on one partition (though I'm still not quite sure exactly what it's supposed to mean, except that this is to be the root filesystem, though why it can't figure that out from the mount point being "/" is not clear). Having that flag set on only one partition doesn't help though. (My original gues was that partitions tagged with the "install" flag were to be used during the install, i.e. mounted under /targetroot, and any without it set would only be written to /etc/fstab for use once the target system boots live.) In any case my basic expectations for the requirements of the most basic functionality of the "extended partitioning" feature is that I should be able to use it to install on a system with a "bunch" of disks, making one or more filesystems (or, e.g., swap partitions) on each disk, and having them be newfs'ed (or not) and mounted (or not) for the target system (all before extracting sets, i.e. mounted under /targetroot), and have the resulting configuration all be written to the new target's /etc/fstab. So far I've been unable to even get close to making this work. (I can get it to create partitions, but then it won't do anything with them.) Trying to read the source to figure out what is and isn't working, or how it maybe should work, hasn't helped me any yet either. A design guide, or theory-of-operation doc, etc. might help. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpIOiBgJ4Z74.pgp Description: OpenPGP Digital Signature
Re: very strange build failure in external/gpl3/gcc/lib/libstdc++-v3/include/bits
At Fri, 8 Oct 2021 19:44:02 + (UTC), RVP wrote: Subject: Re: very strange build failure in external/gpl3/gcc/lib/libstdc++-v3/include/bits > > On Fri, 8 Oct 2021, Greg A. Woods wrote: > > > If two identical 'mv' commands run in the same directory (with no other > > commands running there in between) then the second one is going to > > report an ENOENT error. Given these 'mv' commands are on the tail of a > > command list that creates the source file, they have to run very nearly > > in parallel in order to trigger the observed failure. > > > > GCC comes with a move-if-change script to do just this kind of file-rename > juggling. Try using that in the rule instead of the home-brew commands... > > /usr/src/external/gpl3/gcc/dist/move-if-change I think that would be very thin wallpaper for such a problem. :-) There are possibly other lurking problems for such a parallel build failure, so fixing the root cause really would be the better solution. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpXTkYohXsjQ.pgp Description: OpenPGP Digital Signature
Re: very strange build failure in external/gpl3/gcc/lib/libstdc++-v3/include/bits
At Thu, 7 Oct 2021 23:17:33 + (UTC), RVP wrote: Subject: Re: very strange build failure in external/gpl3/gcc/lib/libstdc++-v3/include/bits > > On Thu, 7 Oct 2021, Greg A. Woods wrote: > > > It's almost as if the call to rename() in 'mv' succeeds, but returns an > > ENOENT error sometimes!!! > > > > Or, there's a race to completion happening. Is libstdc++-v3 being built > twice? Yes that's quite likely. I realised the same just after I wrote my message and went out to do some yard work. If two identical 'mv' commands run in the same directory (with no other commands running there in between) then the second one is going to report an ENOENT error. Given these 'mv' commands are on the tail of a command list that creates the source file, they have to run very nearly in parallel in order to trigger the observed failure. I'm not sure yet how or where these built include files get specified twice, or in whatever way that causes them to be built multiple times in parallel. Perhaps it's the trickery here (interfering with similar trickery in )? /usr/src/external/gpl3/gcc/lib/libstdc++-v3/include/Makefile.includes Or given that c++config.h was also in the same boat, maybe all of external/gpl3/gcc/lib/libstdc++-v3/include/bits is being built twice for the "includes" target? > PS. I should ask: your machines are all running NTP, right? Yes indeed, though the second machine is using all local disk. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpfO3tndRvb0.pgp Description: OpenPGP Digital Signature
Re: very strange build failure in external/gpl3/gcc/lib/libstdc++-v3/include/bits
At Thu, 07 Oct 2021 10:25:31 -0700, "Greg A. Woods" wrote: Subject: very strange build failure in external/gpl3/gcc/lib/libstdc++-v3/include/bits > > I had a parallel build fail as follows yesterday. > > This same source tree has been built in the same way on the same machine > multiple times without these errors ever appearing. An rsync'ed copy of > the source tree has been successfully built on another machine multiple > times without these errors ever appearing. I spoke a little too soon. The second machine encountered the same error just now, but only part of it -- i.e. only one of the 'mv' commands failed: mv: rename gthr-posix.h.tmp to gthr-posix.h: No such file or directory --- gthr-posix.h --- *** [gthr-posix.h] Error code 1 nbmake[7]: stopped in /work/woods/m-NetBSD-current/external/gpl3/gcc/lib/libstdc++-v3/include/bits 1 error I happened to have a couple of other older builds from the same tree on that other machine, and so I looked for similar errors in the logs from those builds, and what do you know, but I found one more (from last March)! mv: rename c++config.h.tmp to c++config.h: No such file or directory includes ===> external/mit/xorg/lib/xkeyboard-config/symbols/nec_vndr install /build/woods/b2/current-amd64-destdir/usr/X11R7/include/xcb/xc_misc.h install /build/woods/b2/current-amd64-destdir/usr/X11R7/include/xcb/xcb.h --- includes-include --- nbmake[5]: stopped in /work/woods/m-NetBSD-current/external/gpl3/gcc/lib/libstdc++-v3 --- includes-libxcb --- nbmake[8]: stopped in /work/woods/m-NetBSD-current/external/mit/xorg/lib/libxcb --- includes-bits --- nbmake[9]: stopped in /work/woods/m-NetBSD-current/external/gpl3/gcc/lib/libstdc++-v3/include That time I just restarted the build and put it down to a parallel build Makefile error. In that case it's from external/gpl3/gcc/lib/libstdc++-v3/include/bits/Makefile and again it's the same style of rule where it is impossible for me to understand how the failure could possibly happen! It's almost as if the call to rename() in 'mv' succeeds, but returns an ENOENT error sometimes!!! BTW, there's a rule in /usr/src/external/gpl3/gcc/lib/libstdc++-v3/Makefile of a similar form but it includes what seems to me to be a nonsensical "&& rm -f ${.TARGET}.tmp" at the end. Shouldn't that be "|| rm -f ..."??? (or just not there at all?) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpEG3VPGltrh.pgp Description: OpenPGP Digital Signature
Re: Entropy error blocks lang/python38 installation
At Wed, 16 Jun 2021 11:18:02 +0200, Martin Husemann wrote: Subject: Re: Entropy error blocks lang/python38 installation > > On Wed, Jun 16, 2021 at 11:10:34AM +0200, Joerg Sonnenberger wrote: > > On Wed, Jun 16, 2021 at 06:13:23AM +0200, Martin Husemann wrote: > > > On Wed, Jun 16, 2021 at 03:42:35AM +, Thomas Mueller wrote: > > > > I believe I must apply the fix/workaround every time. > > > > > > The entropy state gets stored on shutdown and reloaded on next boot. > > > Fixing it once is enough. > > > > ...assuming that people actually use shutdown and don't just reboot. > > Kinda - but the instructions in the man page are quite explicit and > ask you to save the entropy state at least once manually, which should > avoid the blocking behaviour in all cases. That's not an acceptable regression. Previously no manual operations were ever necessary -- the blocking was never permanent. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpekRmVN6MUp.pgp Description: OpenPGP Digital Signature
Re: building netbsd-9 2 'sync' processes stuck in 'tstile'
At Fri, 7 May 2021 17:39:47 -0500 (CDT), "John D. Baker" wrote: Subject: Re: building netbsd-9 2 'sync' processes stuck in 'tstile' > > So far, the now 6 'sync' processes have been stuck in "tstile" for 4 > days. Other than being unable to build/link any kernels, the system is > fine and its primary functions as file server (NFS, SaMBa, AppleTalk), > backup DNS and NTP server are unaffected as are its clients (i.e., every > other machine on my LAN). CVS updates to the various trees complete > without problems. > > Of course anything that runs 'sync' will get stuck. > > This is the first time I've had this kind of problem on this system > since I placed it in service in 2010. This really smells more like a kernel deadlock. I wonder if you could use crash(8) (or ddb(4)) to get kernel stack traces of the stuck processes. (E.g. see the EXAMPLES section in crash(8).) That might help narrow down which locks are causing the problems... -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpIXyFj83E4R.pgp Description: OpenPGP Digital Signature
Re: building netbsd-9 2 'sync' processes stuck in 'tstile'
At Mon, 3 May 2021 17:52:03 -0500 (CDT), "John D. Baker" wrote: Subject: building netbsd-9 2 'sync' processes stuck in 'tstile' > > While building netbsd-9/amd64 with "-j 2", the build process got stuck > while linking "GENERIC_KASLR" and "GENERIC". 'top' shows two 'sync' > processes stuck in 'tstile'. Although the build could be aborted with > "Ctrl-C", the two 'sync' processes remain and cannot be killed (even > with -9). > > The host is netbsd-9/amd64 as of 30 April. The filesystem on which the > build process operates resides on a local RAIDframe RAID-R of eight 1TB > SATA disks. > > The same filesystem is also NFS exported and clients otherwise continue > to operate on it normally. So, I've had a similar, but less critical, thing happen, though with a somewhat opposite configuration. I.e. I've seen lots of processes get "stuck" and/or very slow (with processes sitting in "tstile" for long periods) on a similar system. However the main problem seemed to be on a -current system that was somewhat heavily accessing an NFS filesystem on another (older) NetBSD system. (i.e. /usr/src and /home are NFS mounts to the other server) I don't know if these "tstile" processes were unkillable (though I've experienced that before where a kernel deadlock caused it(*)). However they eventually completed, and even more mysteriously the whole problem resolved itself and disappeared without any knowing intervention! I just left the machine to struggle along overnight and in the morning it was running fine, and continued to do so for over a week until I rebooted the other day to test some unrelated kernel fixes. I never did find any possible cause for the slowness. The older system that's serving NFS has an uptime of 117 days and didn't seem to be suffering any ill effects during the slowness or since. (*) The "tstile" hangs caused by a deadlock were on a Xen dom0 where there were locking order problems in the xenstore interface and so "xl" commands could deadlock in the kernel. That bug has been fixed. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp9veDYpNxpl.pgp Description: OpenPGP Digital Signature
Re: booting xen [was Re: serial console puzzle]
I've copied this reply to port-xen as it's entirely Xen related. At Fri, 30 Apr 2021 20:50:10 +0200, Manuel Bouyer wrote: Subject: Re: booting xen [was Re: serial console puzzle] > > On Fri, Apr 30, 2021 at 07:28:57PM +0100, Patrick Welche wrote: > > > > boot.cfg contains: > > > > menu=Boot Xen:rndseed /var/db/entropy-file;consdev com0,57600;load > > /netbsd-XEN3_ > > DOM0 console=com1 com1=57600,8n1,0x3f8;multiboot /xen-debug.gz > > dom0_mem=1024M > > should probably be: > menu=Boot Xen:rndseed /var/db/entropy-file;consdev com0,57600;load > /netbsd-XEN3_ DOM0 console=com0;multiboot /xen-debug.gz dom0_mem=1024M > console=com1 com1=57600,8n1,0x3f8 > > (should really be console=com0 for NetBSD, it doens't access the hardware and > use the I/O services from the hypervisor) On serial console machines I've been using NetBSD "console=xencons" for ages. This is the documented (by Xen, i.e. preferred Xen way), for serial consoles: menu=Boot Xen:load /netbsd-XEN3_DOM0 -v bootdev=dk0 console=xencons;multiboot /xen bootscrub=false dom0_mem=4G console=com1,vga console_timestamps=datems dom0_max_vcpus=4 dom0_vcpus_pin=true pv-l1tf=off,domu=off vpmu=on cpuid=rdrand spec-ctrl=no-xen,l1d-flush=off guest_loglvl=all From my Xen notes: - N.B.: The Xen kernel handles serial input (and can pass it to the dom0 kernel) but not keyboards, thus for serial console use the NetBSD console should be "xencons", but when using the VGA console the NetBSD console _must_ be "pc". - Xen counts serial ports from '1', but of course NetBSD counts them from zero, so instead of "console=com0" as would be used for /netbsd alone, it must be "console=com1,vga" for /xen. Note that Xen can print use multiple consoles simultaneously! Note also we could tell Xen to set the port up with something like "com1=115200,8n1", but for now I think the BIOS does this OK on the Dell PE machines. These notes are based on direct examination of the code and are confirmed by practice on multiple machines. I believe the main advantage of keeping Xen in firm and sole control of the serial console is that you can still talk to Xen directly with it for debugging, as noted by Xen as it boots: (XEN) [2021-04-21 20:54:44.504] *** Serial input to DOM0 (type 'CTRL-a' three times to switch input) I've not really made use of this feature though -- just tested it a couple of times. I don't know if Xen still peeks at serial I/O if you let the dom0 kernel take control of the UART, but it may. I just don't see the point of letting the dom0 use anything but xencons, if it can. Similarly I don't see any point to trying to set or reset the UART parameters if the BIOS already has them set and working -- keep it simple and keep as much of the config in the first place it's needed and nowhere else! For systems with VGA console only though I finally figured out it has to be "console=pc" explicitly else I didn't see any NetBSD boot messages (this I have not diagnosed yet -- it is on a remote machine I've never seen physically, though I do have Dell iDRAC access to it): menu=Xen:load /netbsd -v bootdev=dk0 console=pc;multiboot /xen dom0_mem=2G dom0_max_vcpus=1 dom0_vcpus_pin Of course VGA consoles suck for servers and for debugging, but sometimes that's all you've got. You'll note in the first example and the nodes, Xen can use two different consoles simultaneously, so if I do go out into my machine room (i.e. garage) I can see the Xen message on the screen too. I really wish NetBSD could do that. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpQ7huxel1C6.pgp Description: OpenPGP Digital Signature
Re: Problem reports for version control systems
At Fri, 30 Apr 2021 15:23:03 +0200, Christian Groessler wrote: Subject: Re: Problem reports for version control systems > > On 4/30/21 7:31 AM, Lloyd Parkes wrote: > > > Hi all, > > The problem reports people have in their > > emails are completely inadequate for > > trying to determine what is going wrong > > for people trying to access the NetBSD > > source. > > > > > > > I'm rsync'ing the CVS tree to my local > server and then run CVS against my server > on the LAN. No problems... Same here, since 2001 or so: RCS file: RCS/rsync-netbsd-cvs,v revision 1.1 date: 2001/06/06 17:52:06; author: woods; state: Exp; Initial revision This script can now be found here: https://github.com/robohack/rsync-netbsd-cvs -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpzTB5zfwE1y.pgp Description: OpenPGP Digital Signature
Re: Xen FreeBSD domU block I/O problem on -current only affects reads > 1024 bytes
020202020 * 0001000 21212121212121212121212121212121 * 0002000 * 0003000 23232323232323232323232323232323 * 0004000 24242424242424242424242424242424 * 0005000 25252525252525252525252525252525 * 0006000 26262626262626262626262626262626 * 0007000 27272727272727272727272727272727 * 001 28282828282828282828282828282828 * 0011000 29292929292929292929292929292929 * 0012000 2a2a2a2a2a2a2a2a2a2a2a2a2a2a2a2a * 0013000 2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b * 0014000 2c2c2c2c2c2c2c2c2c2c2c2c2c2c2c2c * 0015000 2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d * 0016000 2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e * 0017000 2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f * 002 Let's try that again with just the one sample data line and blkchk: # grep 28141568000 /var/tmp/ckfile.txt > /var/tmp/ckfile.1 # /var/tmp/blkchk check /dev/da0 /var/tmp/ckfile.1 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1024] \x28 != /var/tmp/ckfile.1[ln#0][1024] \x22 # Every byte after 1024 is different, but I'll cut it off at 10: # /var/tmp/blkchk check -v /dev/da0 /var/tmp/ckfile.1 2>&1 | head blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1024] \x28 != /var/tmp/ckfile.1[ln#0][1024] \x22 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1025] \x28 != /var/tmp/ckfile.1[ln#0][1025] \x22 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1026] \x28 != /var/tmp/ckfile.1[ln#0][1026] \x22 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1027] \x28 != /var/tmp/ckfile.1[ln#0][1027] \x22 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1028] \x28 != /var/tmp/ckfile.1[ln#0][1028] \x22 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1029] \x28 != /var/tmp/ckfile.1[ln#0][1029] \x22 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1030] \x28 != /var/tmp/ckfile.1[ln#0][1030] \x22 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1031] \x28 != /var/tmp/ckfile.1[ln#0][1031] \x22 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1032] \x28 != /var/tmp/ckfile.1[ln#0][1032] \x22 blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1033] \x28 != /var/tmp/ckfile.1[ln#0][1033] \x22 -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpNATBJbYGU3.pgp Description: OpenPGP Digital Signature
CVS (was: GCC 10 available for testing etc. in -current.)
At Mon, 19 Apr 2021 11:56:59 +0200, Reinoud Zandijk wrote: Subject: Re: GCC 10 available for testing etc. in -current. > > Same for me; I've never had trouble with CVS trees and they always just work > and update fine. > > Hg on the otherhand I had to delete and recheckout my hg tree *again*; i > had interrupted hg during a merge and oh boy; it was completely shot and > thought i had tons of local changes that all conflicted; a whopping 500+ files > or so, thus resorting to just nuking it and rechecking it out. This never > happened to my CVS tree. > > So, no, hg is not mature enough yet to switch over to and don't get me started > on git! I don't think all of those problems can be blamed on Hg (or Git). A very big part of the problem is what Joerg said: "when someone messes up history, that's a non-linear update." I.e. the conversion from CVS to Hg and/or Git sometimes has to rewrite history to undo a mess-up and clean-up in the CVS repo, and those are things that really mess up Git and Hg users. And NetBSD developers seem to have a penchant for messing up/in the repository on a regular basis. There were two such events in the past week or two alone. This very update to GCC 10 was involved in one of them. These same shenanigans also affect CVS, but usually in less ugly ways, In both cases it's often a matter of timing If you do your CVS update in between one of these "messes" being made and being cleaned then you'll encounter some problems, but if not then you're often none the wiser to what happened. For the same reason different people will have different experiences with the Hg and Git clones because they do their updates at different times. If you don't clone or fetch history that then has to be rewritten then you won't know that history was rewritten. The real solution of course is to stop and _prevent_ history from being rewritten, ever. It doesn't matter if this is in CVS, Git, Hg, Fossil, or something else. It's just easier to prevent in Git, and Hg, etc. Personally I've been using rsync to fetch the whole CVS repository daily for years now, and then I update local checkouts, some automatically and some by hand. It's very efficient, and it gives me a local copy of all the repository history. It's not quite as nice as a git clone, since I can't reliably and efficiently and easily keep my own local branches and do local commits (e.g. in the way you can do very easily and efficiently with Git), but it is still very much better than any other current alternative, including the current Hg and Git and Fossil conversions. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpPcTPIbcOuN.pgp Description: OpenPGP Digital Signature
Xen FreeBSD domU block I/O problem begins somewhere between 8.99.32 (2020-06-09) and 9.99.81 (2021-03-10)
So I was just reminded that I do still have a Xen server that's still running the 8.99.32 kernel and Xen-4.11. I had not been testing on it because it still of course has the vnd(4) CHS size bug (and because it's also hosting my $HOME and /usr/src and I don't want to crash it), and I had not remembered until just now that I can work around that by simply padding out the mini-memstick.img file! And, so It works, A-OK, with all other things remaining the same: # ls -l /dev/xbd0 crw-r- 1 root operator 0x3a Apr 17 04:31 /dev/xbd0 # newfs /dev/xbd0 /dev/xbd0: 20480.0MB (41943040 sectors) block size 32768, fragment size 4096 using 33 cylinder groups of 626.09MB, 20035 blks, 80256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 11540352, 12822592, 14104832, 15387072, 16669312, 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 26927232, 28209472, 29491712, 30773952, 32056192, 8432, 34620672, 35902912, 37185152, 38467392, 39749632, 41031872 # fsck /dev/xbd0 ** /dev/xbd0 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 2 files, 2 used, 5076797 free (21 frags, 634597 blocks, 0.0% fragmentation) * FILE SYSTEM IS CLEAN * # So the problem is almost certainly in NetBSD-current itself, and somewhere in the vast gulf between 8.99.32 (2020-06-09) and 9.99.81 (2021-03-10). Unfortunately I don't have enough hardware that's Xen-capable and up and running well enough to allow me to do any brute-force bisecting. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpLZpfkSDO0p.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Fri, 16 Apr 2021 11:44:08 +0100, David Brownlee wrote: Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > On Fri, 16 Apr 2021 at 08:41, Greg A. Woods wrote: > > > What else is different? What am I missing? What could be different in > > NetBSD current that could cause a FreeBSD domU to (mis)behave this way? > > Could the fault still be in the FreeBSD drivers -- I don't see how as > > the same root problem caused corruption in both HVM and PVH domUs. > > Random data collection thoughts: > > - Can you reproduce it on tiny partitions (to speed up testing) > - If you newfs, shutdown the DOMU, then copy off the data from the > DOM0 does it pass FreeBSD fsck on a native boot > - Alternatively if you newfs an image on a native FreeBSD box and copy > to the DOM0 does the DOMU fsck fail > - Potentially based on results above - does it still happen with a > reboot between the newfs and fsck > - Can you ktrace whichever of newfs or fsck to see exactly what its > writing (tiny *tiny* filesystem for the win here :) So, the root filesystem is clean (from the factory, and verified by at least NetBSD's fsck as OK), but when '-f' is used it is found to be corrupt. Unfortunately I don't have any real FreeBSD machines available (though I could possibly get it installed on my MacBookPro again, but that's probably a multi-day effort at this point). However I've just found a way to reproduce the problem reliably and with a working comparison with a matching-sized memory disk. First off attach a tiny 4mb LVM LV to FreeBSD -- that's the smallest LV possible apparently: dom0 # lvm lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert build scratch -wi-a- 250.00g fbsd-test.0 scratch -wi-a- 30.00g fbsd-test.1 scratch -wi-a- 30.00g nbtest.pkg vg0 -wi-a- 30.00g nbtest.root vg0 -wi-a- 30.00g nbtest.swap vg0 -wi-a- 8.00g nbtest.var vg0 -wi-a- 10.00g tinytestvg0 -wi-a- 4.00m dom0 # xl block-attach fbsd-test format=raw, vdev=sdc, access=rw, target=/dev/mapper/vg0-tinytest Now a run of the test on the FreeBSD domU (first showing the kernel seeing the device attachment): # xbd3: 4MB at device/vbd/2080 on xenbusb_front0 xbd3: attaching as da2 xbd3: features: flush xbd3: synchronize cache commands enabled. GEOM: new disk da2 # dd if=/dev/zero of=tinytest.fs count=8192 8192+0 records in 8192+0 records out 4194304 bytes transferred in 0.081106 secs (51713998 bytes/sec) # mdconfig -a -t vnode -f tinytest.fs md0 # newfs -o space -n md0 /dev/md0: 4.0MB (8192 sectors) block size 32768, fragment size 4096 using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 2304, 4416, 6528 # newfs -o space -n da2 /dev/da2: 4.0MB (8192 sectors) block size 32768, fragment size 4096 using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 2304, 4416, 6528 # dumpfs da2 >da2.dumpfs # dumpfs md0 >md0.dumpfs # diff md0.dumpfs da2.dumpfs 1,2c1,2 < magic 19540119 (UFS2) timeFri Apr 16 18:48:55 2021 < superblock location 65536 id [ 6079dc17 1006b3b4 ] --- > magic 19540119 (UFS2) timeFri Apr 16 18:49:57 2021 > superblock location 65536 id [ 6079dc55 348e5947 ] 27c27 < magic 90255 tell2 timeFri Apr 16 18:48:55 2021 --- > magic 90255 tell2 timeFri Apr 16 18:49:57 2021 40c40 < magic 90255 tell128000 timeFri Apr 16 18:48:55 2021 --- > magic 90255 tell128000 timeFri Apr 16 18:49:57 2021 53c53 < magic 90255 tell23 timeFri Apr 16 18:48:55 2021 --- > magic 90255 tell23 timeFri Apr 16 18:49:57 2021 66c66 < magic 90255 tell338000 timeFri Apr 16 18:48:55 2021 --- > magic 90255 tell338000 timeFri Apr 16 18:49:57 2021 # fsck md0 ** /dev/md0 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 1 files, 1 used, 870 free (14 frags, 107 blocks, 1.6% fragmentation) * FILE SYSTEM IS CLEAN * # fsck da2 ** /dev/da2 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ROOT INODE UNALLOCATED ALLOCATE? [yn] n * FILE SYSTEM MARKED DIRTY * So I ktraced the fsck_ufs run, and though I haven't looked at it with a fine-tooth comb and the source open, the only thing that seems a wee bit different about what fsck does is that it opens the device twice, with O_RDONLY, then shortly before it prints the first "** /dev/da2" line it reopens it O_RDRW a third time, closes the second one, and then closes the second one and calls dup() on the third one so that it has the same FD# as the second open had. Otherwise it does a
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
So I wrote a little awk script so that I could write 512-byte blocks with varying values of bytes. (Awk is the only decent programming language on the FreeBSD mini-memstick.img which I could think of that would do something close to what I wanted it to do. I could have combined awk+sh+dd and done things faster, but I had all day to let it run while I worked on some small engine repairs.) https://github.com/robohack/experiments/blob/master/tblocks.awk and then I used it to write 30GB to two different LVM LVs, each of identical size, and each exported to the domU, one written on the dom0 and the other written on the domU. Then I ran a cmp of both drives on each the dom0 and domU. On the dom0 side were no differences. All 30GB of what was written directly in the dom0 to one of the LVs was identical to what was written in the FreeBSD domU to the other LV. I.e. the FreeBSD domU side seems to be writing reliably through to the disk. The FreeBSD domU though is _really_ slow at reading with cmp (perhaps not unexpectedly given that it is using stdio to do the read and only managing 4KB requests, at a rate of just under 500 requests per second on each disk). I'm going to send this and go to bed before it finishes, but I'm guessing it's about 2/3's of the way through (it has run for nearly 11,000 seconds), and thus so far there are no differences from the FreeBSD domU's point of view either. Anyway, what the heck is FreeBSD newfs and/or fsck doing different!?!?!?? They're both writing and reading the very same raw device(s) that I wrote and read to/from with awk and cmp. These awk/cmp tests did very sequential operations, and the data are quite uniform and regular; whereas newfs/fsck write/read a much more complex data structure using operations scattered about in the disk. These tests are also writing then reading enough data to flush through the buffer caches in each dom0 and domU several times over. The dom0 has only 4GB and the domU has 8GB, but Xen says it's only using under 2GB. What else is different? What am I missing? What could be different in NetBSD current that could cause a FreeBSD domU to (mis)behave this way? Could the fault still be in the FreeBSD drivers -- I don't see how as the same root problem caused corruption in both HVM and PVH domUs. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpOtcGozQ5xB.pgp Description: OpenPGP Digital Signature
Re: running xen on current
At Thu, 15 Apr 2021 13:02:54 +0200, Manuel Bouyer wrote: Subject: Re: running xen on current > > AFAIK EFI is not yet supported by Xen (maybe this is supported by 4.15, > I've not had a chance to try yet). I have it running on fairly recent > Dell servers (in BIOS mode) My Dell servers, even the newer PE-R510, are much older I think :-) They run -current (2021-03-10) quite well (except for PR# 54969 -- I have to remember to unmount my larger filesystems manually before any reboot unless I want to risk loss and/or wait a long time for fscks -- I haven't turned on '-o log' for them yet as I wanted to measure its performance impact). My XEN3_DOM0 kernel is somewhat customized, but not in any way that should affect the hardware support or Xen -- of interest might be iscsi support and and VND_COMPRESSION, but I haven't tried testing either yet. I did read about the unified EFI image support in Xen 4.15 and I was thinking of trying it on my old MacBookPro -- but I would also want X11 to work on it too, and even FreeBSD's Xserver wasn't working on it last summer, so I went back to MacOS in order to be able to use it for web and such as well as remote access. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpW51GFrekcb.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
atures: flush dev.xbd.0.ring_pages: 1 dev.xbd.0.max_request_size: 65536 dev.xbd.0.max_request_segments: 17 dev.xbd.0.max_requests: 32 dev.xbd.0.%parent: xenbusb_front0 dev.xbd.0.%pnpinfo: dev.xbd.0.%location: dev.xbd.0.%driver: xbd dev.xbd.0.%desc: Virtual Block Device For reference the bug behaviour remains the same (at least for this simplest quick and easy test): # newfs /dev/da0 /dev/da0: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096 using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 11540352, 12822592, 14104832, 15387072, 16669312, 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 26927232, 28209472, 29491712, 30773952, 32056192, 8432, 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 43596352, 44878592, 46160832, 47443072, 48725312, 50007552, 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 60265472, 61547712, 62829952 # fsck /dev/da0 ** /dev/da0 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups CG 0: BAD CHECK-HASH 0x49168424 vs 0xe610ac1b SUMMARY INFORMATION BAD SALVAGE? [yn] n BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] n CG 1: BAD CHECK-HASH 0xfa76fceb vs 0xb9e90a55 CG 2: BAD CHECK-HASH 0x41f444c vs 0x5efb290e CG 3: BAD CHECK-HASH 0xad63fe7e vs 0x7ab3861f CG 4: BAD CHECK-HASH 0xfd2043f3 vs 0xadb781f4 CG 5: BAD CHECK-HASH 0x545cf9c1 vs 0xcec5661e CG 6: BAD CHECK-HASH 0xaa354166 vs 0x7dd269d3 CG 7: BAD CHECK-HASH 0x349fb54 vs 0x3078e065 CG 8: BAD CHECK-HASH 0xab23a7c vs 0xc8aa7e98 CG 9: BAD CHECK-HASH 0xa3ce804e vs 0x205a6b0d CG 10: BAD CHECK-HASH 0x5da738e9 vs 0x604d5ecf CG 11: BAD CHECK-HASH 0xf4db82db vs 0xfef11ffc CG 12: BAD CHECK-HASH 0xa4983f56 vs 0xc7e701c8 CG 13: BAD CHECK-HASH 0xde48564 vs 0x42072fba CG 14: BAD CHECK-HASH 0xf38d3dc3 vs 0xad98cf7b CG 15: BAD CHECK-HASH 0x5af187f1 vs 0xbacadeb1 CG 16: BAD CHECK-HASH 0xe07abf93 vs 0xe4ca225 CG 17: BAD CHECK-HASH 0x490605a1 vs 0xe2917802 CG 18: BAD CHECK-HASH 0xb76fbd06 vs 0xa895abc CG 19: BAD CHECK-HASH 0x1e130734 vs 0x6a8bc135 CG 20: BAD CHECK-HASH 0x4e50bab9 vs 0x44719a4a CG 21: BAD CHECK-HASH 0xe72c008b vs 0xadb0c6e9 CG 22: BAD CHECK-HASH 0x1945b82c vs 0x3aeca102 CG 23: BAD CHECK-HASH 0xb039021e vs 0xb99f957d CG 24: BAD CHECK-HASH 0xb9c2c336 vs 0xd384be85 CG 25: BAD CHECK-HASH 0x10be7904 vs 0x649e2abf CG 26: BAD CHECK-HASH 0xeed7c1a3 vs 0x95f7 CG 27: BAD CHECK-HASH 0x47ab7b91 vs 0x3fb02d8b CG 28: BAD CHECK-HASH 0x17e8c61c vs 0xa2b4ca67 CG 29: BAD CHECK-HASH 0xbe947c2e vs 0x65972e04 CG 30: BAD CHECK-HASH 0x40fdc489 vs 0x4219223f CG 31: BAD CHECK-HASH 0xe9817ebb vs 0x36eb9a37 CG 32: BAD CHECK-HASH 0x3007c2bc vs 0xd1916e1d CG 33: BAD CHECK-HASH 0x997b788e vs 0x5204f64d CG 34: BAD CHECK-HASH 0x6712c029 vs 0xe291bcf0 CG 35: BAD CHECK-HASH 0xce6e7a1b vs 0x136ff032 CG 36: BAD CHECK-HASH 0x9e2dc796 vs 0x78ea85c8 CG 37: BAD CHECK-HASH 0x37517da4 vs 0x40c2cf31 CG 38: BAD CHECK-HASH 0xc938c503 vs 0x9b844ab6 CG 39: BAD CHECK-HASH 0x60447f31 vs 0x23129481 CG 40: BAD CHECK-HASH 0x69bfbe19 vs 0xa81f5e9 CG 41: BAD CHECK-HASH 0xc0c3042b vs 0xbd37ebd1 CG 42: BAD CHECK-HASH 0x3eaabc8c vs 0xfadfd8d1 CG 43: BAD CHECK-HASH 0x97d606be vs 0xf41513bc CG 44: BAD CHECK-HASH 0xc795bb33 vs 0xad4e6069 CG 45: BAD CHECK-HASH 0x6ee90101 vs 0xbeab94a9 CG 46: BAD CHECK-HASH 0x9080b9a6 vs 0x2688acd1 CG 47: BAD CHECK-HASH 0x39fc0394 vs 0xb5a37e85 CG 48: BAD CHECK-HASH 0x83773bf6 vs 0xd779cc90 CG 49: BAD CHECK-HASH 0xe0d3fd3c vs 0xb8083ca 2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation) * FILE SYSTEM MARKED DIRTY * * PLEASE RERUN FSCK * -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpFkv2vwtCE3.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Wed, 14 Apr 2021 19:53:47 +0200, Jaromír Doleček wrote: Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > You can test if this is the problem by disabling the feature in > negotiation in NetBSD xbdback.c - comment out the code which sets > feature-max-indirect-segments in xbdback_backend_changed(). With the > feature disabled, FreeBSD DomU should not use indirect segments. Ah, yes, thanks! I should have thought of that. That's especially useful since on the client side it's a read-only flag: # sysctl -w hw.xbd.xbd_enable_indirect=0 sysctl: oid 'hw.xbd.xbd_enable_indirect' is a read only tunable sysctl: Tunable values are set in /boot/loader.conf Apparently in the Linux implementation the number of indirect segments used by a domU can be tuned at boot time, and that appears to be done by setting a driver option on the guest kernel command line. When I first read that it didn't make so much sense to me to be giving this kind of control to the domU. Perhaps it would be better to make this a tuneable in xl.cfg(5) such that it can be tuned on a per-guest basis. Then setting it to zero for a given guest would not advertise the feature at all. I've some other things to do before I can reboot -- I'll report as soon as that's done -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp4H02B9VFeu.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Tue, 13 Apr 2021 18:20:39 -0700, "Greg A. Woods" wrote: Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > So "17" seems an odd number, but it is apparently because of "Need to > alloc one extra page to account for possible mapping offset". Nope, changing that to 16 didn't make any difference. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpWqf4eWoyDV.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Sun, 11 Apr 2021 13:55:36 -0700, "Greg A. Woods" wrote: Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD > xbd(4) with a new filesystem created on it, is impossible. So, having run out of "easy" ideas, and working under the assumption that this must be a problem in NetBSD-current dom0 (i.e. not likely in Xen or Xen tools) I've been scanning through changes and this one, so far, is one that would seem to me to have at least some tiny possibility of being the root cause. RCS file: /cvs/master/m-NetBSD/main/src/sys/arch/xen/xen/xbdback_xenbus.c,v revision 1.86 date: 2020-04-21 06:56:18 -0700; author: jdolecek; state: Exp; lines: +175 -47; commitid: 26JkIx2V3sGnZf5C; add support for indirect segments, which makes it possible to pass up to MAXPHYS (implementation limit, interface allows more) using single request request using indirect segment requires 1 extra copy hypercall per request, but saves 2 shared memory hypercalls (map_grant/unmap_grant), so should be net performance boost due to less TLB flushing this also effectively doubles disk queue size for xbd(4) I don't see anything obviously glaringly wrong, and of course this is working A-OK on my same machines with NetBSD-5 and a NetBSD-current (and originally somewhat older NetBSD-8.99) domUs. However I'm really not very familiar with this code and the specs for what it should be doing so I'm unlikely to be able to spot anything that's missing. I did read the following, which mostly reminded me to look in xenstore's db to see what feature-max-indirect-segments is set to by default: https://xenproject.org/2013/08/07/indirect-descriptors-for-xen-pv-disks/ Here's what is stored for a file-backed device: backend = "" vbd = "" 3 = "" 768 = "" frontend = "/local/domain/3/device/vbd/768" params = "/build/images/FreeBSD-12.2-RELEASE-amd64-mini-memstick.img" script = "/etc/xen/scripts/block" frontend-id = "3" online = "1" removable = "0" bootable = "1" state = "4" dev = "hda" type = "phy" mode = "r" device-type = "disk" discard-enable = "0" vnd = "/dev/vnd0d" physical-device = "3587" hotplug-status = "connected" sectors = "792576" info = "4" sector-size = "512" feature-flush-cache = "1" feature-max-indirect-segments = "17" Here's what's stored for an LVM-LV backed vbd: 162 = "" 2048 = "" frontend = "/local/domain/162/device/vbd/2048" params = "/dev/mapper/vg1-fbsd--test.0" script = "/etc/xen/scripts/block" frontend-id = "162" online = "1" removable = "0" bootable = "1" state = "4" dev = "sda" type = "phy" mode = "r" device-type = "disk" discard-enable = "0" physical-device = "43285" hotplug-status = "connected" sectors = "83886080" info = "4" sector-size = "512" feature-flush-cache = "1" feature-max-indirect-segments = "17" So "17" seems an odd number, but it is apparently because of "Need to alloc one extra page to account for possible mapping offset". It is currently the maximum for indirect-segments, and it's hard-coded. (Linux apparently has a max of 256, and the linux blkfront defaults to only using 32.) Maybe it should be "16", so matching max_request_size? I did take a quick gander at the related code in FreeBSD (both the domU code that's talking to this code in NetBSD, and the dom0 code that would be used if dom0 was running FreeBSD), and besides seeing that it is quite different, I also don't see anything obviously wrong or incompatible there either. (I do note that the FreeBSD equivalent to xbdback(4) has a major advantage of being able to directly access files, i.e. without the need for vnd(4). Not quite as exciting as maybe full 9pfs mounts through to domUs would be, but still pretty neat!) FreeBSD's equivalent to xbdback(4) (i.e. sys/dev/xen/blkback/blkack.c) doesn't seem to mention "feature-max-indirect-segments", so apparently they don't offer it yet, though it does mention "feature-flush-cache". However their front-end code does detect it and seems to make use of it, and has done for some 6 years now according to "git blame" (with no recent fixes beyond fixing a memory leak on their end). Here we see it live from FreeBSD's sysctl output, thus my concern that this feature may be the source of the problem: hw.xbd.xbd_enable_indirect: 1 dev.xbd.0.max_request_size: 65536 dev.xbd.0.max_request_segments: 17 dev.xbd.0.max_requests: 32 -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpw4B9SHqX72.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
ev/da1 ** /dev/da1 ** Last Mounted on /mnt ** Phase 1 - Check Blocks and Sizes PARTIALLY TRUNCATED INODE I=325128 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=877864 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=877866 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=877879 SALVAGE? [yn] ^C * FILE SYSTEM MARKED DIRTY * Back on the NetBSD side: # xl block-detach fbsd-test 2064 # fsck /dev/mapper/rscratch-fbsd--test.0 ** /dev/mapper/rscratch-fbsd--test.0 ** Last Mounted on /mnt ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] n SUMMARY INFORMATION BAD SALVAGE? [yn] n BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] n 12076 files, 91642 used, 7647797 free (293 frags, 955938 blocks, 0.0% fragmentation) * UNRESOLVED INCONSISTENCIES REMAIN * -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpIIZo7QXMjA.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Sun, 11 Apr 2021 13:23:31 -0700, "Greg A. Woods" wrote: Subject: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > In fact it only seems to be fsck that complains, possibly along > with any attempt to write to a filesystem, that causes problems. Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD xbd(4) with a new filesystem created on it, is impossible. I was able to write 500MB of zeros to the LVM LV backed disk, overwriting the copy of the .img file I had put there, and only see 500MB of zeros back on the NetBSD side, so writing directly to the raw /dev/da1 on FreeBSD seems to write data without problem. However then the following happens when I try to use a new FS there: # newfs /dev/da1 /dev/da1: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096 using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 11540352, 12822592, 14104832, 15387072, 16669312, 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 26927232, 28209472, 29491712, 30773952, 32056192, 8432, 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 43596352, 44878592, 46160832, 47443072, 48725312, 50007552, 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 60265472, 61547712, 62829952 # mount /dev/da1 /mnt # mount /dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only) devfs on /dev (devfs, local, multilabel) tmpfs on /var (tmpfs, local) tmpfs on /tmp (tmpfs, local) /dev/da1 on /mnt (ufs, local) # df Filesystem 512-blocks UsedAvail Capacity Mounted on /dev/ufs/FreeBSD_Install 782968 737016 -16680 102%/ devfs 2 20 100%/dev tmpfs 6553660864928 1%/var tmpfs 40960 840952 0%/tmp /dev/da1 60901560 16 56029424 0%/mnt # cp /COPYRIGHT /mnt UFS /dev/da1 (/mnt) cylinder checksum failed: cg 0, cgp: 0xe66de1a4 != bp: 0xf433acbc UFS /dev/da1 (/mnt) cylinder checksum failed: cg 1, cgp: 0x89ba8532 != bp: 0x3491fbd0 UFS /dev/da1 (/mnt) cylinder checksum failed: cg 3, cgp: 0xdeaf87a7 != bp: 0x3a071e86 UFS /dev/da1 (/mnt) cylinder checksum failed: cg 7, cgp: 0x7085828d != bp: 0xaaae0f19 UFS /dev/da1 (/mnt) cylinder checksum failed: cg 15, cgp: 0x293dfe28 != bp: 0xe2f25f8b UFS /dev/da1 (/mnt) cylinder checksum failed: cg 31, cgp: 0x9a4d0762 != bp: 0x4119c6e [[ and on and on ]] UFS /dev/da1 (/mnt) cylinder checksum failed: cg 49, cgp: 0x931f84e5 != bp: 0xb48687df /mnt: create/symlink failed, no inodes free cp: /mnt/COPYRIGHT: No space left on device # Apr 11 20:37:28 syslogd: last message repeated 4 times Apr 11 20:37:59 kernel: pid 713 (cp), uid 0 inumber 2 on /mnt: out of inodes # df -i Filesystem 512-blocks UsedAvail Capacity iused ifree %iused Mounted on /dev/ufs/FreeBSD_Install 782968 737016 -16680 102% 12129 285 98% / devfs 2 20 100% 0 0 100% /dev tmpfs 6553660864928 1% 75 114613 0% /var tmpfs 40960 840952 0% 6 71674 0% /tmp /dev/da1 60901560 16 56029424 0% 2 4012796 0% /mnt NetBSD can actually make some sense of this FreeBSD filesystem though: # fsck -n /dev/mapper/rscratch-fbsd--test.0 ** /dev/mapper/rscratch-fbsd--test.0 (NO WRITE) Invalid quota magic number CONTINUE? yes ** File system is already clean ** Last Mounted on /mnt ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups SUMMARY INFORMATION BAD SALVAGE? no BLK(S) MISSING IN BIT MAPS SALVAGE? no ** Phase 6 - Check Quotas CLEAR SUPERBLOCK QUOTA FLAG? no 2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation) * UNRESOLVED INCONSISTENCIES REMAIN * I'm not sure if those problems are to be expected with a FreeBSD-created filesystem or not. Probably the "Invalid quota magic number" is normal, but I'm not sure about the "BLK(s) MISSING IN BIT MAPS". Have FreeBSD and NetBSD FFS diverged this much? I won't try to mount it, especially not from the dom0. Dumpfs shows the following: file system: /dev/mapper/rscratch-fbsd--test.0 format FFSv2 endian little-endian location 65536 (-b 128) magic 19540119timeSun Apr 11 13:46:15 2021 superblock location 65536 id [ 60735d32 358197c4 ] cylgrp dynamic inodes FFSv2 sblock FFSv2 fslevel 5 nbfree 951584 ndir2 nifree 4012796 nffree 21 ncg 50 size7864320 blocks 7612695 bsize 32768 shift 15 m
one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
So, with the vnd(4) issue more or less sorted, there seems to be one major mystery remaining w.r.t. whatever has gone wrong with the ability of NetBSD-current XEN3_DOM0 to host FreeBSD domUs. I still can't create a clean filesystem on a writeable disk. The "newfs" runs fine, but a subsequent "fsck" finds errors and cannot fix them (though the first run does change one or two things). I can't even get a clean fsck of the running system's root FS: (the "ada0: disk error" after I hit ^C is because the underlying disk (vnd0d) is exported read-only to the domU) # fsck -v /dev/ufs/FreeBSD_Install start / wait fsck_ufs /dev/ufs/FreeBSD_Install ** /dev/ufs/FreeBSD_Install SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n ** Last Mounted on ** Root file system ** Phase 1 - Check Blocks and Sizes PARTIALLY TRUNCATED INODE I=28 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=112 SALVAGE? [yn] ^Cada0: disk error cmd=write 8145-8152 status: fffe * FILE SYSTEM MARKED DIRTY * # Most mysteriously this filesystem is in use as the root FS and all the files in it can be found and read! Presumably they are all intact too -- no programs have failed or behaved mysteriously (except fsck) and all the human readable files I've looked at (e.g. manual pages) all seem fine. In fact it only seems to be fsck that complains, possibly along with any attempt to write to a filesystem, that causes problems. (I believe writing to a filesystem appears to corrupt it but that is only according to fsck. I do seem believe there was an eventual crashes of a system that had been running with active filesystems, but I have not got far enough again since to reproduce this, due to the fsck problem.) # mount /dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only) devfs on /dev (devfs, local, multilabel) tmpfs on /var (tmpfs, local) tmpfs on /tmp (tmpfs, local) # df Filesystem 512-blocks Used Avail Capacity Mounted on /dev/ufs/FreeBSD_Install 782968 737016 -16680 102%/ devfs 2 2 0 100%/dev tmpfs 65536232 65304 0%/var tmpfs 40960 8 40952 0%/tmp # time -l sh -c 'find / -type f | xargs cat > /dev/null ' 38.58 real 1.36 user18.30 sys 4872 maximum resident set size 13 average shared memory size 5 average unshared data size 215 average unshared stack size 1906 page reclaims 0 page faults 0 swaps 14024 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 12348 voluntary context switches 33 involuntary context switches In fact I can put a copy of the FreeBSD img file into an LVM LV, attach it to the running FreeBSD domU, mount it (without an FSCK, since the FreeBSD_Install filesystem comes clean from the factory), then do "diff -r -X /mnt -X /dev / /mnt" and find only the expected differences. So, what could be different about how fsck reads v.s. the kernel itself? If indeed writing to filesystem corrupts it, how and why? It seems NetBSD can make sense of the BSD label inside the FreeBSD mini-memstick.img file, e.g. when accessed through vnd(4), but it can't seem to make sense of the filesystem(s) inside (which I guess might be expected?): # file -s /dev/rvnd0f /dev/rvnd0f: DOS/MBR boot sector, BSD disklabel # disklabel vnd0 # /dev/rvnd0: type: vnd disk: vnd label: fictitious flags: bytes/sector: 512 sectors/track: 32 tracks/cylinder: 64 sectors/cylinder: 2048 cylinders: 387 total sectors: 791121 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # microseconds track-to-track seek: 0 # microseconds drivedata: 0 6 partitions: #sizeoffset fstype [fsize bsize cpg/sgs] d:791121 0 unused 0 0# (Cyl. 0 -386*) e: 1600 1unknown # (Cyl. 0*- 0*) f:789520 1601 4.2BSD 0 0 0 # (Cyl. 0*-386*) disklabel: boot block size 0 disklabel: super block size 0 # fsck -n /dev/vnd0f ** /dev/rvnd0f (NO WRITE) BAD SUPER BLOCK: CAN'T FIND SUPERBLOCK /dev/rvnd0f: CANNOT FIGURE OUT SECTORS PER CYLINDER -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpMSgeO6z7I3.pgp Description: OpenPGP Digital Signature
Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
On the other hand NetBSD's own .img files work OK. However interestingly there's a small, but apparently insignificant (because it works OK) difference between how fdisk sees the disk image and the vnd0 device: # fdisk -F images/NetBSD-9.99.81-amd64-live.img Disk: images/NetBSD-9.99.81-amd64-live.img NetBSD disklabel disk geometry: cylinders: 972, heads: 255, sectors/track: 63 (16065 sectors/cylinder) total sectors: 15624192, bytes/sector: 512 BIOS disk geometry: cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder) total sectors: 15624192 Partitions aligned to 2048 sector boundaries, offset 2048 Partition table: 0: NetBSD (sysid 169) start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active 1: 2: 3: Bootselector disabled. First active partition: 0 Drive serial number: 0 (0x) # vndconfig -cv vnd0 images/NetBSD-9.99.81-amd64-live.img /dev/rvnd0: 7999586304 bytes on images/NetBSD-9.99.81-amd64-live.img # fdisk vnd0 Disk: /dev/rvnd0 NetBSD disklabel disk geometry: cylinders: 7629, heads: 64, sectors/track: 32 (2048 sectors/cylinder) total sectors: 15624192, bytes/sector: 512 BIOS disk geometry: cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder) total sectors: 15624192 Partitions aligned to 2048 sector boundaries, offset 2048 Partition table: 0: NetBSD (sysid 169) start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active 1: 2: 3: Bootselector disabled. First active partition: 0 Drive serial number: 0 (0x) 21:10 [1.1496] # disklabel vnd0 # /dev/rvnd0: type: ESDI disk: image label: flags: bytes/sector: 512 sectors/track: 32 tracks/cylinder: 64 sectors/cylinder: 2048 cylinders: 7629 total sectors: 15624192 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # microseconds track-to-track seek: 0 # microseconds drivedata: 0 8 partitions: #sizeoffset fstype [fsize bsize cpg/sgs] a: 15622144 2048 4.2BSD 1024 819216 # (Cyl. 1 - 7628) c: 15622144 2048 unused 0 0# (Cyl. 1 - 7628) d: 15624192 0 unused 0 0# (Cyl. 0 - 7628) # disklabel images/NetBSD-9.99.81-amd64-live.img # images/NetBSD-9.99.81-amd64-live.img: type: ESDI disk: image label: flags: bytes/sector: 512 sectors/track: 32 tracks/cylinder: 64 sectors/cylinder: 2048 cylinders: 7629 total sectors: 15624192 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # microseconds track-to-track seek: 0 # microseconds drivedata: 0 8 partitions: #sizeoffset fstype [fsize bsize cpg/sgs] a: 15622144 2048 4.2BSD 1024 819216 # (Cyl. 1 - 7628) c: 15622144 2048 unused 0 0# (Cyl. 1 - 7628) d: 15624192 0 unused 0 0# (Cyl. 0 - 7628) From inside the NetBSD live image: [ 1.4412586] xbd4 at xenbus0 id 4: Xen Virtual Block Device Interface [ 1.4422594] xbd4: using event channel 20 [ 1.7112647] entropy: xbd4 attached as an entropy source (collecting without estimation) [ 1.7112647] xbd4: 7629 MB, 512 bytes/sect x 15624192 sectors [ 1.7112647] xbd4: backend features 0x9 # df Filesystem 1K-blocks UsedAvail %Cap Mounted on /dev/xbd4a7562414 4699114 2485180 65% / ptyfs 110 100% /dev/pts # fdisk xbd4 Disk: /dev/rxbd4 NetBSD disklabel disk geometry: cylinders: 7629, heads: 1, sectors/track: 2048 (2048 sectors/cylinder) total sectors: 15624192, bytes/sector: 512 BIOS disk geometry: cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder) total sectors: 15624192 Partitions aligned to 2048 sector boundaries, offset 2048 Partition table: 0: NetBSD (sysid 169) start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active 1: 2: 3: Bootselector disabled. First active partition: 0 Drive serial number: 0 (0x) The NetBSD live.img root filesystem seems fine and clean: # fsck -n /dev/rxbd4a ** /dev/rxbd4a (NO WRITE) ** Last Mounted on / ** Root file system ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 32740 files, 2349557 used, 1431650 free (538 frags, 178889 blocks, 0.0% fragmentation) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgphQf8ZEW6Z4.pgp Description: OpenPGP Digital Signature
Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
At Sat, 10 Apr 2021 18:44:32 -0700, Brian Buhrow wrote: Subject: Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!) > > hello. This must be some kind of regression that's ben around a > while. I'm runing a xen dom0 with NetBSD-5.2 and xen-3.3.2, very old, > but vnd(4) does expose the entire file to the domu's including FreeBSD > 11 and 12 without any corruption or booting issues. Do you know when > this trouble began? I don't know -- I think I've only ever successfully used ISO files, and I think I gave up on some IMG file(s) previously (possibly not just from FreeBSD) without trying to understand why they didn't work. Have you tried specifically with a recent FreeBSD mini-memstick.img file? I'm thinking (esp. given what I see from "od -c < /dev/rvnd0d") that what's wrong is the vnd(4) driver is (also?) imposing some mis-interpreted idea about the number of cylinders and heads or something like that, especially given that "fdisk vnd0" is so totally confused about what's in there. There's a definite pattern of corruption anyway -- I just can't explain it well enough yet. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpX0mMSVXy0W.pgp Description: OpenPGP Digital Signature
I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
0002000 # dd if=/dev/rvnd0d count=17 msgfmt=quiet| od -c 000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 002 \0 \0 \0 \0 \0 \0 \0 \0 \b \0 \0 \0 020 \0 \0 \0 0020020 030 \0 \0 \0 230 005 \0 \0 \0 \0 \0 \0 377 377 377 377 0020040 367 360 p ` \0 \0 \0 007 200 037 \0 027 \0 \0 \0 0020060 \0 @ \0 \0 \0 \b \0 \0 \b \0 \0 \0 005 \0 \0 \0 0020100 \0 \0 \0 \0 < \0 \0 \0 \0 300 377 377 \0 370 377 377 0020120 016 \0 \0 \0 013 \0 \0 \0 004 \0 \0 \0 \0 020 \0 \0 0020140 003 \0 \0 \0 002 \0 \0 \0 \0 \b \0 \0 \0 \0 \0 \0 0020160 \0 \0 \0 \0 \0 020 \0 \0 200 \0 \0 \0 004 \0 \0 \0 0020200 \0 \0 \0 \0 300 220 005 \0 001 \0 \0 \0 \0 \0 \0 \0 0020220 367 360 p ` _ ` A q 230 005 \0 \0 \0 \b \0 \0 0020240 \0 @ \0 \0 \0 \0 \0 \0 300 220 005 \0 300 220 005 \0 0020260 027 \0 \0 \0 001 \0 \0 \0 \0 X \0 \0 0 d 001 \0 0020300 001 \0 \0 \0 377 357 003 \0 375 347 007 \0 016 \0 \0 \0 0020320 \0 001 \0 200 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 0020340 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 0021000 In fact the vnd0d device seems to give garbage forever -- it seems to have been completely confused by trying to access a real disk image! As a side note unfortunately even though access to this LVM-backed mini-memstick.img file now seems OK enough to get the install booted and a shell running, access to other FreeBSD xbd(4) devices is still not working from FreeBSD (i.e. a fresh newfs'ed FS appears corrupt to an immediate fsck, without mounting, and even fsck of the mounted root in this IMG fails enormously). # df Filesystem 512-blocks Used Avail Capacity Mounted on /dev/ufs/FreeBSD_Install 782968 737016 -16680 102%/ devfs 2 2 0 100%/dev tmpfs 65536232 65304 0%/var tmpfs 40960 8 40952 0%/tmp # fsck /dev/ufs/FreeBSD_Install ** /dev/ufs/FreeBSD_Install SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n ** Last Mounted on ** Root file system ** Phase 1 - Check Blocks and Sizes PARTIALLY TRUNCATED INODE I=28 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=112 SALVAGE? [yn] ^Cda0: disk error cmd=write 8145-8152 status: fffe # * FILE SYSTEM MARKED DIRTY * # -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp0TQ7zS9Hhk.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Wed, 7 Apr 2021 22:47:39 +0200, Martin Husemann wrote: Subject: Re: regarding the changes to kernel entropy gathering > > When you create a custom setup like that, you will have to replace > etc/rc.d/entropy with a custom solution (e.g. mounting some flash storage). No storage means "NO storage.". > Or you ignore the issue and do the dd at each boot - hopefully not generating > any strong keys on that machine then (but you would have no good storage > for those anyway). Or I don't ignore the issue and instead I fix the code so that it's still possible to get entropy estimates from non-hardware-RNG devices and then things keep working the way they used to, and there's still some possibility of _real_ entropy being used to seed the PRNGs. From what I've seen here so far I'm far from alone in wanting that ability. What's most confusing is to why there's such animosity and stubborn unwillingness to even consider that the old way of getting some entropy from a few less-than-perfect sources was good enough for many, or even most, of us. It's better than no entropy when there are no "perfect" sources, and that's also a situation that includes many of us. It doesn't have to be the default. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpiP2WuJhrQy.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Wed, 7 Apr 2021 09:52:29 +0200, Martin Husemann wrote: Subject: Re: regarding the changes to kernel entropy gathering > > On Tue, Apr 06, 2021 at 03:12:45PM -0700, Greg A. Woods wrote: > > > Isn't it as simple as: > > > > > > dd bs=32 if=/dev/urandom of=/dev/random > > > > No, that still leaves the question of _when_ to run it. (And, at least > > at the moment, where to put it. /etc/rc.local?) > > Of course not! > > You run it once. Manually. And never again. Nope, sorry, that's not a good enough answer. It doesn't solve the problem of dealing with a lack of mutable storage. A system _MUST_ be able to be booted and with no user intervention be able to (eventually) get to the state where /dev/random and getrandom(2) WILL NOT block, and it _MUST_ be able to do so without the help of any hardware RNG, and without the ability to store (and read) a seed from a file or other storage device. I.e. we _MUST_ be _ABLE_ to choose to use other devices as sources for entropy, even if they are not perfect. We had this, it works fine, we still need it. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpuAM5snajCz.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Tue, 6 Apr 2021 20:21:43 +0200, Martin Husemann wrote: Subject: Re: regarding the changes to kernel entropy gathering > > On Tue, Apr 06, 2021 at 10:54:51AM -0700, Greg A. Woods wrote: > > > > And the stock implementation has no possibility of ever providing an > > initial seed at all on its own (unlike previous implementations, and of > > course unlike what my patch _affords_). > > Isn't it as simple as: > > dd bs=32 if=/dev/urandom of=/dev/random No, that still leaves the question of _when_ to run it. (And, at least at the moment, where to put it. /etc/rc.local?) Isn't something the following better (assuming you choose your devices carefully): echo 'rndctl_flags="-t env;-t disk;-t tty"' >> /etc/rc.conf That's what my patches fix and allow, and this way you don't have to guess when you can safely use /dev/urandom as an entropy seed -- the seeding happens in real time, and only as entropy bits are made available from those given devices. That can also be done by sysinst, assuming a reasonably well worded question can be answered, and that it might only need to be asked if there are no "rng" type devices already. Doing this also requires no network access (ever). It can even be done, ahead of time, for use on immutable systems. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgparJTWSICYJ.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Tue, 6 Apr 2021 12:08:54 +, Taylor R Campbell wrote: Subject: Re: regarding the changes to kernel entropy gathering > > The main issue that hits people is that the traditional mechanism by > which the OS reports a potential security problem with entropy is for > it to make applications silently hang -- and the issue is getting > worse now that getrandom() is more widely used, e.g. in Python when > you do `import multiprocessing'. I think adding a uprintf(9) that the user who started the blocked process (i.e. not just the admin) has a better chance of directly seeing would be one step closer, and should be extremely easy. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpOxcINx3I65.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
4, flags 0x70, func=0x8083f151, ver=427 kern.entropy.gather (1.1260.1264): CTLTYPE_INT, size 4, flags 0x70, func=0x8083dd4c, ver=428 kern.entropy.needed (1.1260.1265): CTLTYPE_INT, size 4, flags 0x100, ver=429 kern.entropy.pending (1.1260.1266): CTLTYPE_INT, size 4, flags 0x100, ver=430 kern.entropy.epoch (1.1260.1267): CTLTYPE_INT, size 4, flags 0x100, ver=431 Perhaps function pointer values shouldn't be printed as integers? And there are no text descriptions for some of the kern.entropy values: 17:27 [1.831] # sysctl -d kern.entropy.needed kern.entropy.needed: (no description) 17:27 [1.832] # sysctl -d kern.entropy.pending kern.entropy.pending: (no description) 17:27 [1.833] # sysctl -d kern.entropy.epoch kern.entropy.epoch: (no description) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp6vc6Eur6UN.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 15:37:49 -0400, Thor Lancelot Simon wrote: Subject: Re: regarding the changes to kernel entropy gathering > > On Sun, Apr 04, 2021 at 03:32:08PM -0700, Greg A. Woods wrote: > > > > BTW, to me reusing the same entropy on every reboot seems less secure. > > Sure. But that's not what the code actually does. > > Please, read the code in more depth (or in this case, breadth), then argue > about it. Sorry, I was eluding to the idea of sticking the following in /etc/rc.local as the brain-dead way to work around the problem: echo -n "" > /dev/random However I have not yet read and understood enough of the code to know if: dd if=/dev/urandom of=/dev/random bs=32 count=1 is any more "secure" -- I'm guessing (hoping?) it depends on exactly when this might be run, and also depends on which, if any, other device sources are enabled for "collecting". If in some rare case none were enabled, or if it were run before any were able to "stir the pool", then I'm guessing it would be no more secure than writing a fixed string. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpr66fioyhjH.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
"stir" the pot in the first place, then why not just "count" it as "real" entropy and be done with it -- at least then it is obvious when enough entropy has been gathered and the currently implemented algorithms handle things properly and securely and all inside the kernel. I.e. the admin doesn't have to put a "sleep 30" or whatever in front of it and hope that's enough and that it's still not too predictable. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpRUdV5ZZmgF.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 03:02:42 +0200, Joerg Sonnenberger wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Except that's not what the system is doing. It removes the seed file on > boot and creates a new one on shutdown. That's not exactly what the documentation says it does (from rndctl(8)): -L Load saved entropy from file save-file and overwrite it with a seed derived by hashing it together with output from /dev/urandom so that the new seed has at least as much entropy as either the old seed had or the system already has. If interrupted, either the old seed or the new seed will be in place. The code seems to concur. Also the system re-saves the $random_file via /etc/security (unconditionally, i.e. always, but only if $random_file is set). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpVwYw0Ir4wO.pgp Description: OpenPGP Digital Signature
Re: how do I mount a read-only filesystem from the "root device" prompt?
At Mon, 5 Apr 2021 07:04:32 - (UTC), mlel...@serpens.de (Michael van Elst) wrote: Subject: Re: how do I mount a read-only filesystem from the "root device" prompt? > > Someone would need to write code to "upgrade" vnodes. I doubt that's > trivial. Indeed -- I've underestimated the complexity of such low-level changes in the past -- they can snowball out of control! > Fortunately it is not necessary. If the parent device is read-only, > no "upgrade" will help to make it read-write. So you open read-write > or fail back to read-only when necessary. An attempt to open a wedge > read-write on a read-only opened parent device then has to fail. Yes, this makes sense. > I'm testing a patch for that... Excellent! Thank you very much! -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpXg3AxNRSEV.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 16:13:55 +1200, Lloyd Parkes wrote: Subject: Re: regarding the changes to kernel entropy gathering > > The current implementation prints out a message whenever it blocks a > process that wants randomness, which immediately makes this > implementation superior to all others that I have ever seen. The > number of times I've logged into systems that have stalled on boot and > made them finish booting by running "ls -lR /" over the past 20 years > are too many to count. I don't know if I just needed to wait longer > for the boot to finish, or if generating entropy was the fix, and I > will never know. This is nuts. Indeed! > We can use the message to point the system administrator to a manual > page that tells them what to do, and by "tells them what to do", I > mean in plain simple language, right at the top of the page, without > scaring them. Excellent idea! :-) However I have been wondering if sending the message just to the console, and logging it, say in /var/log/kern, is sufficient. It still took me a very long time to find the existing new message because I don't hang out on the console -- this is a VM, after all, and it's running in a city almost exactly 4200km driving distance from me too! As-is I feel I hang out on the console more often than the average admin who doesn't use a physical console, and of course infinitely more often than any user who doesn't admin his own server. I have added the following comment to the kernel to remind me to think more about this, as a uprintf(9) at the same time would pop right up on the actual user's session too: --- kern_entropy.c.~1.30.~ 2021-03-07 17:23:05.0 -0800 +++ kern_entropy.c 2021-04-03 11:25:31.667067667 -0700 @@ -1306,7 +1306,7 @@ /* Wait for some entropy to come in and try again. */ KASSERT(E->stage >= ENTROPY_WARM); - printf("entropy: pid %d (%s) blocking due to lack of entropy\n", + printf("entropy: pid %d (%s) blocking due to lack of entropy\n", /* xxx uprintf() instead/also? */ curproc->p_pid, curproc->p_comm); if (ISSET(flags, ENTROPY_SIG)) { -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpF1LIq_XrV5.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 10:46:19 +0200, Manuel Bouyer wrote: Subject: Re: regarding the changes to kernel entropy gathering > > If I understood it properly, there's no need for such a knob. > echo 0123456789abcdef0123456789abcdef > /dev/random > > will get you back to the state we had in netbsd-9, with (pseudo-)randomness > collected from devices. Well, no, not quite so much randomness. Definitely pseudo though! My patch on the other hand can at least inject some real randomness into the entropy pool, even if it is observable or influenceable by nefarious dudes who might be hiding out in my garage. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgps8MDVICM_D.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 4 Apr 2021 18:47:23 -0700, Brian Buhrow wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Hello. As I understand it, Greg ran into this problem on a xen domu. > In checking my NetBSD-9 system running as a domu under xen-4.14.1, > there is no rdrand or rdseed feature exposed to domu's by xen. This > observation is confirmed by looking at the xen command line reference > page: https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html The problem in the domU was really just the very tip of the iceberg. The dom0 exhibits the exact same problem and for the same reasons. > and NetBSD doesn't trust the random sources provided by the xennet(4) > and xbd(4) drivers. Therefore, the only solution to get randomness > working for the first time on a newlyinstalled domu is to write 32 > bytes to /dev/random. It's not that the xbd(4) devices, etc. are not trusted as entropy sources -- the new entropy system doesn't trust anything, real or virtual, despite the documentation saying that it can be made to do so. My patch fixes that bug. It was very obvious once I understood the root of the issue. As a result my patch fixes the bug for Xen dom0 and domU. Writing randomness to /dev/random is _NOT_ a general solution (though it could be IFF it can be reliably taken from /dev/urandom AND IFF the rest of the system and documentation is completely and adequately fixed to match the new regime). What perturbs me the most and makes me rather angry is that the rest of the system, and the system documentation, continued to lie and mislead me for days (and it didn't help that nobody who knew this was pointing helpfully and clearly at the root of the problem). So, my patch ALSO restores the kernel's behaviour to match the documentation and tools (specifically rndctl). That the core of it it is just a two-line patch makes this fix extremely satisfying. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpYuvFIPVAsp.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 4 Apr 2021 23:09:18 +, Taylor R Campbell wrote: Subject: Re: regarding the changes to kernel entropy gathering > > If you know this (and this is something I certainly can't confidently > assert!), you can write 32 bytes to /dev/random, save a seed, and be > done with it. I don't have random data easily available at install time. I don't have random data easily available every time I boot a machine with non-persistent storage (e.g. a test ISO image). I _do_ trust well enough the sources of randomness in some device drivers to provide me with a secure enough amount of entropy, for my purposes. And so with my fix(es) I don't need to feed supposedly random data to every system on every install and/or every reboot. What's worse? My fixes, or something like this in /etc/rc.local: echo -n "" > /dev/random > But users who don't go messing around with obscure rndctl settings in > rc.conf will be proverbially shot in the foot by this change -- except > they won't notice because there is practically guaranteed to be no > feedback whatsoever for a security disaster until their systems turn > up in a paper published at Usenix like <https://factorable.net/>. You're really stretching your argument thinly if you are assuming everyone _needs_ perfect entropy here. Also, that's only if the default RND_FLAG_ESTIMATE_* bits are turned off. AND only if the system doesn't have some true hardware RNG. > What your change does is equivalent to going around to every device > driver that previously said `this provides zero entropy, or I don't > know how much entropy it provides' and replacing that claim by `this > is a sample of an independent and perfectly uniform random string of > bits', which is a much stronger (and falser) claim than even the old > `entropy estimation' confabulation that NetBSD used to do. No, only if the default RND_FLAG_ESTIMATE_* bits are ***NOT*** turned off. AND only if the user is like me and stuck with some poor second-grade ancient hardware that doesn't have some fancy new true hardware RNG. In the mean time a more productive approach would be to figure out what's best for those of us who don't need perfection every time and/or to fix those device drivers that could feed sufficiently random data to the entropy pool, and then to recommend a suitable value for rndctl_flags in /etc/rc.conf. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp1Of0SebF9S.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 01:05:58 +0200, Joerg Sonnenberger wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Part of the problem here is that most of the non-RNG data sources are > easily observable either from the local system (e.g. any malicious user) > or other VMs on the same machine (in case of a hypervisor) or local > machines on the same network (in case of network interrupts). It _Just_ _Doesn't_ _Matter_ (i.e. for many of us, most of the time). Now ideally in the hypervisor scenario we would have a backend device that read from /dev/random and offered it to the VM guest as a virtual hardware RNG. Or maybe it's as simple as passing a those few bytes through a custom Xenstore string and having a script in the VM read them and inject them into /dev/random. But that's not been done yet. BTW, personally, on at least on some machines, I don't have any worry whatsoever at the moment about one VM guest spying on, or influencing the PRNG, in another. Zero worry. They're all _me_. I don't need some theoretically perfect level of protection from myself. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpqbpSPpUT4a.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 05 Apr 2021 00:14:30 +0200 (CEST), Havard Eidnes wrote: Subject: Re: regarding the changes to kernel entropy gathering > > > What about architectures that have nothing like RDRAND/RDSEED? Are > > they, effectively, totally unsupported now? > > Nope, not entirely. But they have to be seeded once. If they > have storage which survives reboots, and entropy is saved and > restored on reboot, they will be ~fine. BTW, to me reusing the same entropy on every reboot seems less secure. > Systems without persistent storage and also without RDRAND/RDSEED > will however be ... a more challenging problem. Leaving things like that would be totally silly. With my patch the old way of gathering entropy from devices works just fine as it always did, albeit with the second patch it does require a tiny bit of extra configuration. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpIqHAnkWIdc.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 05 Apr 2021 00:07:49 +0200 (CEST), Havard Eidnes wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Indeed, that's also compatible with what I wrote. The samples > from whatever sources you have are still being mixed into the > pool, but they are not being counted as contributing to the > entropy estimate, because the quality of the samples is at best > unknown. Perhaps we're talking past each other? Until I made the fix no amount of time or activity or of me telling the system to make use of the driver inputs was unblocking getrandom(2) or /dev/random, so it doesn't really matter if anything was being "mixed into the pool" so to speak as the pool was empty. > A possible workaround is, once you have some uptime and some bits > mixed into the pool, you can do: I don't need a work-around -- I found a fix. I corrected some code that was purposefully ignoring my orders for how it should behave. > I am still of the fairly firm beleif that the mistrust in the > hardware vendors' ability to make a reasonable and robust > implementation is without foundation. Well there are still millions of systems out there without the fancy newer hardware RNGs available to make them more secure than Fort Knox. At least a small handful of them run NetBSD for me, and want them to work for my needs and I was, and am, quite happy with using entropy that can be collected from various devices that my systems (virtual and real) actually have. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpkw5j9Fv1Vg.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 4 Apr 2021 16:39:11 -0400 (EDT), Mouse wrote: Subject: Re: regarding the changes to kernel entropy gathering > > > No amount of uptime and activity was increasing the entropy in my > > system before I patched it. > > As I understand it, entropy was being contributed. What wasn't > happening was the random driver code recognizing and acknowledging that > entropy, because it had no way to tell how much of it there really was. Clearly there was no entropy being contributed in any way shape or form. It wasn't the driver code at fault. It was the code I fixed with my patch that was at fault. I told the system to "count" the entropy being gathered by the appropriate driver(s), but it was being ignored entirely. After my fix the system behaved as I told it to. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpDGn3IsNN1r.pgp Description: OpenPGP Digital Signature
Re: how do I mount a read-only filesystem from the "root device" prompt?
At Sun, 4 Apr 2021 01:19:44 -0700, John Nemeth wrote: Subject: Re: how do I mount a read-only filesystem from the "root device" prompt? > > Given that it is possible to have partitions on CDs, which is > common on Suns, but not so much elsewhere, and that anywhere there > is a partition, there is the possibility of using wedges, it would > seem that this is essential. I would think it's not just CDs and hypervisor-provided virtual devices that can have multiple partitions, use wedges, and yet be read-only. Are not a wide variety of removable storage devices also capable of being made "read-only" at the hardware level? On Apr 4, 7:34, Michael van Elst wrote: > > I suggested to make it open read-only if it gets EROFS and to validate > the open mode against what is possible in this state. Given the layers of devices and code involved, perhaps it might be possible to just honour the original mode requested by the code opening the first partition to mount a filesystem, and then to upgrade the vnode to write mode if/when that mount is upgraded to write mode or another rw mount is attempted on another partition on the same device? I realize there's nothing like VOP_REOPEN() to change the open mode flags, but if I'm not mistaken that wouldn't be too difficult to implement. Anyway I did find this is where the actual EROFS is being returned, and perhaps changing it to EACCES would be less confusing, or maybe not --- sys/arch/xen/xen/xbd_xenbus.c.~1.129.~ 2021-02-28 15:45:22.0 -0800 +++ sys/arch/xen/xen/xbd_xenbus.c 2021-04-04 14:21:01.006355121 -0700 @@ -950,7 +950,7 @@ if (sc == NULL) return (ENXIO); if ((flags & FWRITE) && (sc->sc_info & VDISK_READONLY)) - return EROFS; + return EACCES; DPRINTF(("xbdopen(%" PRIx64 ", %d)\n", dev, flags)); return dk_open(>sc_dksc, dev, flags, fmt, l); -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpD2yAUtTyIh.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 04 Apr 2021 21:14:31 +0200 (CEST), Havard Eidnes wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Do note, the existing randomness sources are still being sampled and > mixed into the pool, so even if the starting state from the saved > entropy may be known (by violating the security of the storage), > it's still not possible to predict the complete stream of randomness > data once the system has seen a bit of uptime (given that there are > actual other sources of (unverified) entropy which aren't all of too > low quality). No amount of uptime and activity was increasing the entropy in my system before I patched it. /dev/random remained blocked after days of busy system activity. I would argue that most, if not all, of the sources of entropy identified by rndctl(8) on my systems are high-quality and secure sources in my circumstances and for my uses. Perhaps the unpatched implementation isn't doing exactly what you think it is? The unpatched implementation completely and entirely prevents the system from ever using any of those sources, despite showing that they are enabled for use. > However, in the new scheme of things, because most of the > traditional sources have unknown quality, and we have no reliable > method to estimate how much "actual entropy" those sources > provide, they no longer count towards the *estimate* of what is > now a lower bound on the "real" entropy available in the pool. It really doesn't matter what can be determined in general and from a distance. What matters is what a given administrator can determine in particular for a given application in a given circumstance. Before my patch the system was not behaving as documented and could not be made to behave as the documentation said it could be made to behave. With my patch I can choose which to trust from amongst the available sources. Without that patch my choices are ignored and the system lies to me about using my choices. I would argue my patch fixes a critical bug. > Besides, the implementation has been thoroughly vetted. E.g. the > reference [7] from the wikipedia article states in the conclusion on > page 20 > >Overall, the Ivy Bridge RNG is a robust design with a large >margin of safety that ensures good random data is generated even >if the Entropy Source is not operating as well as predicted. "design" != implementation -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpg86syab6rB.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 04 Apr 2021 23:47:10 +0700, Robert Elz wrote: Subject: Re: regarding the changes to kernel entropy gathering > > If we want really good security, I'd submit we need to disable > the random seed file, and RDRAND (and anything similar) until we > have proof that they're perfect. Indeed, I concur. I trust the randomness and in-observability and isolation of the behaviour of my system's fans far more than I would trust Intel's RDRAND or RDSEED instructions. I even trust the randomness of the timings of the virtual disks in my Xen domU virtual machines more-so, even with multiple sibling guests, even if some of those other guests can be influenced by untrusted third parties at critical times. > Personally, I'm happy with anything that your average high school > student is unlikely to be able to crack in an hour. I don't run > a bank, or a military installation, and I'm not the NSA. If someone > is prepared to put in the effort required to break into my systems, > then let them, it isn't worth the cost to prevent that tiny chance. > That's the same way that my house has ordinary locks - I'm sure they > can be picked by someone who knows what they're doing, and better security > is available, at a price, but a nice happy medium is what fits me best. Indeed again. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp4TWUMkWqxh.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 4 Apr 2021 09:49:58 +, Taylor R Campbell wrote: Subject: Re: regarding the changes to kernel entropy gathering > > > Date: Sat, 03 Apr 2021 12:24:29 -0700 > > From: "Greg A. Woods" > > > > Updating a system, even on -current, shouldn't create a long-lived > > situation where the system documentation and the behaviour and actions > > of system commands is completely out of sync with the behaviour of the > > kernel, and in fact lies to the administrator about the abilities of the > > system. > > It would help if you could identify specifically what you are calling > a lie. > > > @@ -1754,21 +1766,21 @@ > > rnd_add_uint32(struct krndsource *rs, uint32_t value) > > { > > > > - rnd_add_data(rs, , sizeof value, 0); > > + rnd_add_data(rs, , sizeof value, sizeof value * ABBY); > > } > > The rnd_add_uint32 function is used by drivers to feed in data from > sources _with no known model for their entropy_. Indeed -- that's the idea. > It's how drivers > toss in data that might be helpful but might totally predictable, and > the driver has no way to know. Yeah, so? They don't need to know this. I'm not actually asking random drivers to decide the amount of physical entropy they can collect. That is controlled elsewhere. > Your change _creates_ the lie that every bit of data entered this way > is drawn from a source with independent uniform distribution. No, my change _allows_ the administrator to decide which devices can be used as estimating/counting entropy sources. For example I know that many of the devices on almost all of my machines (virtual or otherwise) are equally good sources of entropy for their uses. An addition change, one which I would also find totally acceptable, would be to disable the current default of allowing "estimation" on devices which are not true hardware RNGs. I.e. maybe this simple change would suffice (though I haven't checked beyond a quick grep to see that this flag is the mostly commonly used one -- perhaps some real RNG devices could also be changed to use explicit flags to enable estimation by default): --- sys/sys/rndio.h.~1.2.~ 2016-07-23 14:36:45.0 -0700 +++ sys/sys/rndio.h 2021-04-04 12:39:15.609936311 -0700 @@ -91,8 +91,7 @@ #define RND_FLAG_ESTIMATE_TIME 0x4000 /* estimate entropy on time */ #define RND_FLAG_ESTIMATE_VALUE0x8000 /* estimate entropy on value */ #defineRND_FLAG_HASENABLE 0x0001 /* has enable/disable fns */ -#define RND_FLAG_DEFAULT (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME|\ -RND_FLAG_ESTIMATE_TIME) +#define RND_FLAG_DEFAULT (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME) #defineRND_TYPE_UNKNOWN0 /* unknown source */ #defineRND_TYPE_DISK 1 /* source is physical disk */ There are a vast number of ways this re-tooling of entropy collection could have been done better. I'm asking for discussion on what amount to some VERY simple changes which completely and totally solve many real-world uses of this code while at the same time not just allowing, but defaulting to, the very strict and secure operation for special situations. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpXnQkfhexZY.pgp Description: OpenPGP Digital Signature
how do I mount a read-only filesystem from the "root device" prompt?
So with Xen one can export a "disk" (disk, file, LVM partiion, etc.) with "access=ro", and that is enforced. However if one tries to mount such a disk in a domU as root, it fails. When one first looks at the code which does the initial vfs_mountroot it would appear to be correct -- i.e. it is trying to open the root filesystem device for reading it uses VOP_OPEN() to open the root device with FREAD (which I think means "only for reading"): error = VOP_OPEN(rootvp, FREAD, FSCRED); if (error) { printf("vfs_mountroot: can't open root device, error = %d\n", error); return (error); } However something assumes that if it is like a disk (i.e. but not a CD-ROM/DVD) then it tries to open for write too as we get: root on dk1 vfs_mountroot: can't open root device, error = 30 cannot mount root, error = 30 (errno #30 is of course EROFS) I'm not even sure where this is happening. vfs_rootmountalloc() does indeed set MNT_RDONLY, but this error is happening before vfs_mountroot() calls ffs_mountroot (through the vfs_mountroot pointer). So I'm lost -- any hints? Is it from bounds_check_with_label()? How? -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp1QhMimCiG1.pgp Description: OpenPGP Digital Signature
regarding the changes to kernel entropy gathering
So, I'm not sure what to say here. I'm very surprised, quite confused, more than a little perturbed, and even somewhat angry. It's taken me quite some time to write this. Now temper this with knowing that I do know I'm running -current, not a release, and that I accept the challenges this might cause (thus see the patch below). Updating a system, even on -current, shouldn't cause what I can only describe as _intentional_ breakage, even for matters so important as system security and integrity, and especially not without clear mention UPDATING, and perhaps also with documented and referenced tools to assist in undoing said breakage. Updating a system, even on -current, shouldn't create a long-lived situation where the system documentation and the behaviour and actions of system commands is completely out of sync with the behaviour of the kernel, and in fact lies to the administrator about the abilities of the system. In any case, the following patch (and in particular the last hunk) fixes all my problems and complaints in this domain. It is fully tested, and it works A-OK with Xen in both domU and dom0 kernels. My systems once again have consistent documentation, and tools that don't lie, and are able to function as before w.r.t. matters related to /dev/random and getrandom(2). Now I'm not proposing this as the final solution -- I think there's some middle ground to be found, but at least this gets things back to working. --- sys/kern/kern_entropy.c.~1.30.~ 2021-03-07 17:23:05.0 -0800 +++ sys/kern/kern_entropy.c 2021-04-03 11:25:31.667067667 -0700 @@ -1306,7 +1306,7 @@ /* Wait for some entropy to come in and try again. */ KASSERT(E->stage >= ENTROPY_WARM); - printf("entropy: pid %d (%s) blocking due to lack of entropy\n", + printf("entropy: pid %d (%s) blocking due to lack of entropy\n", /* xxx uprintf() instead/also? */ curproc->p_pid, curproc->p_comm); if (ISSET(flags, ENTROPY_SIG)) { @@ -1577,6 +1577,16 @@ KASSERT(i == __arraycount(extra)); entropy_enter(extra, sizeof extra, 0); explicit_memset(extra, 0, sizeof extra); + + aprint_verbose("entropy: %s attached as an entropy source (", rs->name); + if (!(flags & RND_FLAG_NO_COLLECT)) { + printf("collecting"); + if (flags & RND_FLAG_NO_ESTIMATE) + printf(" without estimation"); + } + else + printf("off"); + printf(")\n"); } /* @@ -1610,6 +1620,8 @@ /* Free the per-CPU data. */ percpu_free(rs->state, sizeof(struct rndsource_cpu)); + + aprint_verbose("entropy: %s detached as an entropy source\n", rs->name); } /* @@ -1754,21 +1766,21 @@ rnd_add_uint32(struct krndsource *rs, uint32_t value) { - rnd_add_data(rs, , sizeof value, 0); + rnd_add_data(rs, , sizeof value, sizeof value * NBBY); } void _rnd_add_uint32(struct krndsource *rs, uint32_t value) { - rnd_add_data(rs, , sizeof value, 0); + rnd_add_data(rs, , sizeof value, sizeof value * NBBY); } void _rnd_add_uint64(struct krndsource *rs, uint64_t value) { - rnd_add_data(rs, , sizeof value, 0); + rnd_add_data(rs, , sizeof value, sizeof value * NBBY); } /* -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp_AompQk1f3.pgp Description: OpenPGP Digital Signature
Re: nothing contributing entropy in Xen domUs? or dom0!!!
At Thu, 1 Apr 2021 04:13:59 + (UTC), RVP wrote: Subject: Re: nothing contributing entropy in Xen domUs? or dom0!!! > > Does this /etc/entropy-file match what's there in your /boot.cfg? > > On my laptop $random_file is left at the default which is: > /var/db/entropy-file Yes I did change that as well (as /var isn't part of the root partition). However that's not the problem for the dom0. "rndseed" isn't currently used (at least not by me or any documentation I'm aware of) when loading (multibooting) a Xen kernel and a NetBSD dom0 kernel. /etc/rc.d/random_seed will do this (again) later anyway. However since as I showed the hardware doesn't seem to be providing entropy that can be "counted" ("estimated"), there's nothing to save, and so nothing to load on the next boot either. I know how to seed it -- but that's not the problem -- the hardware should be providing plenty of entropy. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpfwaVZQJ63E.pgp Description: OpenPGP Digital Signature
Re: nothing contributing entropy in Xen domUs? or dom0!!!
Intel"; CPUID level 11 Intel-specific functions: Version 000206c2: Type 0 - Original OEM Family 6 - Pentium Pro Model 12 - Stepping 2 Reserved 8 Extended brand string: "Intel(R) Xeon(R) CPU E5645 @ 2.40GHz" CLFLUSH instruction cache line size: 8 Initial APIC ID: 34 Hyper threading siblings: 32 Feature flags 1fc9cbf5: FPUFloating Point Unit DE Debugging Extensions TSCTime Stamp Counter MSRModel Specific Registers PAEPhysical Address Extension MCEMachine Check Exception CX8COMPXCHG8B Instruction APIC On-chip Advanced Programmable Interrupt Controller present and enabled SEPFast System Call MCAMachine Check Architecture CMOV Conditional Move and Compare Instructions FGPAT Page Attribute Table CLFSH CFLUSH instruction ACPI Thermal Monitor and Clock Ctrl MMXMMX instruction set FXSR Fast FP/MMX Streaming SIMD Extensions save/restore SSEStreaming SIMD Extensions instruction set SSE2 SSE2 extensions SS Self Snoop HT Hyper Threading TLB and cache info: 5a: unknown TLB/cache descriptor 03: Data TLB: 4KB pages, 4-way set assoc, 64 entries 55: unknown TLB/cache descriptor ff: unknown TLB/cache descriptor b2: unknown TLB/cache descriptor f0: unknown TLB/cache descriptor ca: unknown TLB/cache descriptor Processor serial: 0002-06C2---- I noted today though that entropy doesn't seem to be accumulating even in the dom0 despite there being many useful sources configured to both collect and "estimate" _and_ despite the fact there's a valid-looking $random_file that was saved and reloaded by /etc/rc.d/random_seed (and saved again every day by /etc/security): # /etc/rc.d/random_seed rcvar # random_seed random_seed=YES # ls -l /etc/entropy-file -rw--- 1 root wheel 536 Mar 31 04:15 /etc/entropy-file # rndctl -l Source Bits Type Flags ipmi0-Temp0 env estimate, collect, v, t, dv, dt ipmi0-Temp1 0 env estimate, collect, v, t, dv, dt ipmi0-Temp2 0 env estimate, collect, v, t, dv, dt ipmi0-Temp3 0 env estimate, collect, v, t, dv, dt ipmi0-Ambient-T 0 env estimate, collect, v, t, dv, dt ipmi0-Planar-Te 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-1 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-1 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-2 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-2 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-3 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-3 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-4 0 env estimate, collect, v, t, dv, dt ipmi0-Status 0 ??? estimate, collect, t, dt ipmi0-Voltage 0 power estimate, collect, v, t, dv, dt ipmi0-Voltage10 power estimate, collect, v, t, dv, dt ipmi0-Status1 0 ??? estimate, collect, t, dt ipmi0-Intrusion 0 ??? estimate, collect, t, dt ipmi0-Temp4 0 env estimate, collect, v, t, dv, dt ipmi0-Temp5 0 env estimate, collect, v, t, dv, dt ipmi0-Temp6 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-4 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-5 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-5 0 env estimate, collect, v, t, dv, dt ipmi0-Ambient-T 0 env estimate, collect, v, t, dv, dt ipmi0-Ambient-T 0 env estimate, collect, v, t, dv, dt ums0 0 tty estimate, collect, v, t, dt ukbd0 0 tty estimate, collect, v, t, dt /dev/random 0 ??? estimate, collect, v sd2 0 disk estimate, collect, v, t, dt sd1 0 disk estimate, collect, v, t, dt sd0 0 disk estimate, collect, v, t, dt cpu0 0 vm estimate, collect, v, t, dv hardclock 0 skew estimate, collect, t pckbd00 tty estimate, collect, v, t, dt system-power 0 power estimate, collect, v, t, dt autoconf 0 ??? estimate, collect, t seed 0 ??? estimate, collect, v # sysctl kern.entropy kern.entropy.collection = 1 kern.entropy.depletion = 0 kern.entropy.consolidate = -23552 kern.entropy.gather = -23552 kern.entropy.needed = 256 kern.entropy.pending = 0 kern.entropy.epoch = 19 -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpMsnaVWfOo5.pgp Description: OpenPGP Digital Signature
Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)
[[ sorry I've not been catching up on mailing list discussions as fast as I had hoped to, and I'm way behind on following the entropy rototill. ]] At Wed, 31 Mar 2021 00:12:31 +, Taylor R Campbell wrote: Subject: Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement) > > This is false. If the VM host provided a viornd(4) device then NetBSD > would automatically collect, and count, entropy from the host, with no > manual intervention. I'll leave that idea to others more up-to-date on Xen PV drivers to respond to. Booting a -current GENERIC kernel (which has both Xen PV and virtio(4) devices configured into it) in a "type='pvh'" domU only attaches the xenbus PV devices, no virtio devices, so adding virtio might be a bit of a much bigger task that will need further support on at least the backend, and perhaps on the front-end too, especially to do it without QEMU. I haven't tried if virtio devices show up in an HVM domU precisely because I'm trying to avoid having to run and rely on QEMU (never mind any performance implications of HVM). > > Finally, if the system isn't actually collecting entropy from a device, > > then why the heck does it allow me to think it is (i.e. by allowing me > > to enable it and show it as enabled and collecting via "rndctl -l")? > > The system does collect samples from all those devices. However, they > are not designed to be unpredictable and there is no good reliable > model for just how unpredictable they are, so the system doesn't > _count_ anything from them. See https://man.NetBSD.org/entropy.4 for > a high-level overview. I'm not sure the word "count" appears in entropy(4) any context I can make sense of it in w.r.t. what it means to "collect" but not "count" entropy from those devices. Worse the "Flags" shown by "rndctl -l" don't seem to be directly documented (i.e. they're not described in rndctl(8)), and even on a kernel running on real hardware I don't see the word "count" showing there. After looking at the source I'm not sure the descriptions of the RND_FLAG_* values in rnd(4) help me much either. Based on my vague understanding of all of this, perhaps you meant to say "estimate", instead of "count"? That would make more sense in the context of what I read in rnd(4) and rndctl(8), though "estimate" still seems a little vague in meaning to me. In any case, I don't see why an xbd disk, or a xennet interface, can't be treated exactly as if they were real hardware (i.e. in terms of extracting entropy from their behaviour). This is exactly what virtualization is all about to me -- even for paravirtualization. After all in a threat-free world (i.e. specifically where I also trust other domUs) their entropy is going to reflect (though maybe not exactly mirror) the entropy of the underlying hardware and/or network traffic. So (but maybe not by default) if I as the admin want to trust the entropy available from an xbd(4) or xennet(4) device, then I should be able to enable it with rndctl(8) and have it "count". More importantly though the system shouldn't mislead me into thinking it is "counting" entropy from a device when it is actually not. If I had seen that there were no sources estimating/counting/whatever entropy, and I tried to enable one and was given a nice error message about this not being possible, then I would have looked elsewhere to find out how to give the system more bits of entropy. As is in my Xen domU system the output of "rndctl -l" leads me to believe all of my devices are collecting both timing and value samples, and using either one or the other to gather entropy (though with '-v' I don't see that any bits of entropy have been added from any of those amy millions of collected samples). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpcOwz5f2PVj.pgp Description: OpenPGP Digital Signature
Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)
At Tue, 30 Mar 2021 23:53:43 +0200, Manuel Bouyer wrote: Subject: Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement) > > On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote: > > [...] > > > > Perhaps the answer is that nothing seems to be contributing anything to > > the entropy pool. No matter what device I exercise, none of the numbers > > in the following changes: > > yes, it's been this way since the rnd rototill. Virtual devices are > not trusted. > > The only way is to manually seed the pool. Ah, so that is definitely not what I expected! Previously wasn't it up to the local admin what to trust? I guess throwing bits into /dev/random is one way to play that game, but I have to trust the dom0 implicitly and utterly anyway, so why not trust the devices it presents? This is especially true for xbd block devices. All my blocks are belong to dom0. The network device is in effect no different than if it were real hardware, so if I want to trust network traffic, then I should be able to enable it, just as I could if it were real hardware. The CPUs are also probably the least "virtual" things in Xen, so why not trust them? (Though I'm not sure I understand what entropy they can offer in the first place.) Finally, if the system isn't actually collecting entropy from a device, then why the heck does it allow me to think it is (i.e. by allowing me to enable it and show it as enabled and collecting via "rndctl -l")? -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpE2Nup3Gb9V.pgp Description: OpenPGP Digital Signature
nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)
ue to lack of entropy [ 563844.834413] entropy: pid 7903 (python) blocking due to lack of entropy [ 566365.511377] entropy: pid 9001 (python) blocking due to lack of entropy [ 577473.897830] entropy: pid 9350 (python) blocking due to lack of entropy [ 579179.381600] entropy: pid 25728 (od) blocking due to lack of entropy [ 579186.994440] entropy: pid 11107 (cat) blocking due to lack of entropy [ 579202.264290] entropy: pid 7248 (cat) blocking due to lack of entropy [ 579669.831978] entropy: ready -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms At Tue, 30 Mar 2021 10:06:19 -0700, "Greg A. Woods" wrote: Subject: python3.7 rebuild stuck in kernel in "entropy" during an "import" statement > > So I've been running a pkg-rolling_replace and one of the packages being > rebuilt is python3.7, and it has got stuck, apparently on an "entropy" > wait in the kernel, and it's been in this state for over 24hrs as you > can see. > > The only things the process has open appear to be its stdio descriptors, > two of which are are open on the log file I was directing all output to. > > This is on a Xen domU of a machine running: > > $ uname -a > NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 > PDT 2021 > woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0 > amd64 > > > 09:51 [504] $ ps -lwwp 19875 > UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND > 0 19875 11551 0 85 0 55412 11324 entropy Ipts/0 0:00.27 ./python -E > -Wi > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py > -d /usr/pkg/lib/python3.7 -f -x > bad_coding|badsyntax|site-packages|lib2to3/tests/data > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 > 09:51 [505] $ ps -uwwp 19875 > USER PID %CPU %MEM VSZ RSS TTY STAT STARTEDTIME COMMAND > root 19875 0.0 0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py > -d /usr/pkg/lib/python3.7 -f -x > bad_coding|badsyntax|site-packages|lib2to3/tests/data > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 > 09:51 [506] $ fstat -p 19875 > USER CMD PID FD MOUNT INUM MODE SZ|DV R/W > root python 19875 wd /build10645634 drwxr-xr-x1024 r > root python 198750 /dev/pts 3 crw--- pts/0 rw > root python 198751 /build 3721223 -rw-r--r-- 28287492 w > root python 198752 /build 3721223 -rw-r--r-- 28287492 w > 09:51 [507] $ find /build -inum 3721223 > /build/packages/root/pkg_roll.out > 09:51 [508] $ > > > It was killable -- I sent SIGINT from the tty and it died as expected. > > > Running "make replace" gets it stuck in the same place again, an the > SIGINT shows the following stack trace: > > PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 > LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1 > ./python -E -Wi > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py > -d /usr/pkg/lib/python3.7 -f -x > 'bad_coding|badsyntax|site-packages|lib2to3/tests/data' > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 > ^T > [ 563859.5589422] load: 0.39 cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k > make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1 > make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 > make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 > ^T > [ 563866.4606073] load: 0.36 cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k > make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 > make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1 > make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 > ^?Traceback (most recent call last): > File > "/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py", > line 20, in > from concurrent.futures import ProcessPoolExecutor > File "", line 1032, in _handle_fromlist > File > "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py", > line 43, in __getattr__ > from .process import ProcessPoolExecutor as pe > File > "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/process.py", > line 53, i
python3.7 rebuild stuck in kernel in "entropy" during an "import" statement
So I've been running a pkg-rolling_replace and one of the packages being rebuilt is python3.7, and it has got stuck, apparently on an "entropy" wait in the kernel, and it's been in this state for over 24hrs as you can see. The only things the process has open appear to be its stdio descriptors, two of which are are open on the log file I was directing all output to. This is on a Xen domU of a machine running: $ uname -a NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 PDT 2021 woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0 amd64 09:51 [504] $ ps -lwwp 19875 UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 0 19875 11551 0 85 0 55412 11324 entropy Ipts/0 0:00.27 ./python -E -Wi /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py -d /usr/pkg/lib/python3.7 -f -x bad_coding|badsyntax|site-packages|lib2to3/tests/data /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 09:51 [505] $ ps -uwwp 19875 USER PID %CPU %MEM VSZ RSS TTY STAT STARTEDTIME COMMAND root 19875 0.0 0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py -d /usr/pkg/lib/python3.7 -f -x bad_coding|badsyntax|site-packages|lib2to3/tests/data /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 09:51 [506] $ fstat -p 19875 USER CMD PID FD MOUNT INUM MODE SZ|DV R/W root python 19875 wd /build10645634 drwxr-xr-x1024 r root python 198750 /dev/pts 3 crw--- pts/0 rw root python 198751 /build 3721223 -rw-r--r-- 28287492 w root python 198752 /build 3721223 -rw-r--r-- 28287492 w 09:51 [507] $ find /build -inum 3721223 /build/packages/root/pkg_roll.out 09:51 [508] $ It was killable -- I sent SIGINT from the tty and it died as expected. Running "make replace" gets it stuck in the same place again, an the SIGINT shows the following stack trace: PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1 ./python -E -Wi /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py -d /usr/pkg/lib/python3.7 -f -x 'bad_coding|badsyntax|site-packages|lib2to3/tests/data' /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 ^T [ 563859.5589422] load: 0.39 cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1 make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 ^T [ 563866.4606073] load: 0.36 cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1 make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 ^?Traceback (most recent call last): File "/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py", line 20, in from concurrent.futures import ProcessPoolExecutor File "", line 1032, in _handle_fromlist File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py", line 43, in __getattr__ from .process import ProcessPoolExecutor as pe File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/process.py", line 53, in import multiprocessing as mp File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/__init__.py", line 16, in from . import context File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/context.py", line 5, in from . import process File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py", line 363, in _current_process = _MainProcess() File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py", line 347, in __init__ self._config = {'authkey': AuthenticationString(os.urandom(32)), KeyboardInterrupt *** Error code 1 (ignored) *** Signal 2 *** Signal 2 -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpWjPEXKgaka.pgp Description: OpenPGP Digital Signature
Re: kern/54969 (Disk cache is no longer flushed on shutdown)
RNING: some file systems would not unmount [Wed Mar 24 20:43:02 2021][ 715718.5284461] forcefully unmounting /dev/mapper/scratch-build from /build... [Wed Mar 24 20:43:02 2021][ 715718.5284461] forcefully unmounted /dev/mapper/scratch-build from /build, type ffs [Wed Mar 24 20:43:02 2021][ 715718.5384534] unmount of / (/dev/dk0) failed with error 16 [Wed Mar 24 20:43:02 2021][ 715718.5384534] WARNING: some file systems would not unmount [Wed Mar 24 20:43:02 2021][ 715718.5384534] forcefully unmounting /dev/dk0 from /... [Wed Mar 24 20:43:02 2021][ 715718.5384534] forcefully unmounted /dev/dk0 from /, type ffs [Wed Mar 24 20:43:02 2021][ 715718.5384534] unmounting done [Wed Mar 24 20:43:02 2021][ 715718.5384534] turning off swap... done [Wed Mar 24 20:43:02 2021][ 715718.5384534] dk0 at sd0 (/) deleted [Wed Mar 24 20:43:02 2021][ 715718.5384534] sd0: detached [Wed Mar 24 20:43:02 2021][ 715718.5384534] scsibus0: detached [Wed Mar 24 20:43:02 2021][ 715718.7184994] mfi0: detached [Wed Mar 24 20:43:02 2021][ 715718.7184994] pci8: detached [Wed Mar 24 20:43:02 2021][ 715718.7184994] ppb7: detached [Wed Mar 24 20:43:02 2021][ 715718.7184994] unmounting done [Wed Mar 24 20:43:02 2021][ 715718.7184994] turning off swap... done [Wed Mar 24 20:43:02 2021][ 715718.7184994] rebooting... [[ ... why is "turning off swap" seen twice? .. ]] [[ ... and then the reboot, until rc scripts say ... ]] [Wed Mar 24 20:44:51 2021]Starting root file system check: [Wed Mar 24 20:44:51 2021]/dev/rdk0: file system is clean; not checking [Wed Mar 24 20:44:51 2021]start / wait fsck_ffs -p /dev/rdk0 [Wed Mar 24 20:44:52 2021]Starting file system checks: [Wed Mar 24 20:44:52 2021]/dev/rdk2: file system is clean; not checking [Wed Mar 24 20:44:52 2021]/dev/rdk3: file system is clean; not checking [[ ... here I hit ^T on the console as it was taking too long ... ]] [Wed Mar 24 20:44:58 2021][ 15.0201108] load: 0.08 cmd: sleep 345 [nanoslp] 0.00u 0.00s 0% 512k [Wed Mar 24 20:44:58 2021]/dev/mapper/rscratch-build: phase 1: cyl group 24 of 345 (6%) [Wed Mar 24 20:46:09 2021]/dev/mapper/rscratch-build: phase 1: cyl group 284 of 345 (82%) [Wed Mar 24 20:49:30 2021]/dev/mapper/rscratch-build: 1400986 files, 36172587 used, 28347707 free (17403 frags, 3541288 blocks, 0.0% fragmentation) [Wed Mar 24 20:49:30 2021]/dev/mapper/rscratch-build: MARKING FILE SYSTEM CLEAN [Wed Mar 24 20:49:30 2021]start /var nowait fsck_ffs -p /dev/rdk2 [Wed Mar 24 20:49:30 2021]start /build nowait fsck_ffs -p /dev/mapper/rscratch-build [Wed Mar 24 20:49:30 2021]done ffs: /dev/rdk2 (/var) = 0x0 [Wed Mar 24 20:49:30 2021]start /usr/pkg nowait fsck_ffs -p /dev/rdk3 [Wed Mar 24 20:49:30 2021]done ffs: /dev/rdk3 (/usr/pkg) = 0x0 [Wed Mar 24 20:49:30 2021]done ffs: /dev/mapper/rscratch-build (/build) = 0x0 [Wed Mar 24 20:49:30 2021]Script /etc/rc.d/fsck running [Wed Mar 24 20:49:30 2021]Currently sourcing /etc/rc.d/fsck [Wed Mar 24 20:49:30 2021]exec: mount_ffs -o rw /dev/dk2 /var [Wed Mar 24 20:49:30 2021]exec: mount_ffs -o rw /dev/dk2 /var [Wed Mar 24 20:49:30 2021]/dev/dk2 on /var type ffs (local, fsid: 0xa802/0x78b, reads: sync 1 async 0, writes: sync 2 async 0) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpeOBU3AaVgV.pgp Description: OpenPGP Digital Signature
a reminder: upgrade Xen in single-user mode, or with Xen disabled!
So I just upgraded Xen to xenkernel413-4.13.2nb5, but without first upgrading the Xen tools, as otherwise how would one safely shut down any running domUs, etc.? :-) Once upgrading to xentools413-4.13.2nb4 I immediately got stuck: # xl list NameID Mem VCPUs State Time(s) [ 578.9865720] load: 0.27 cmd: xl 2027 [tstile] 0.00u 0.01s 0% 3080k and I mean "really" stuck -- xl is unkillable (and unstoppable) in that state! At first I had grave misgivings that the old tstile deadlock was back, but at the moment only dom0 is running So thinking, h the old xenstored is started on boot and will still be running and so I need to restart that from another xterm with "/etc/rc.d/xencommons restart", and voila, that unstuck xl. Probably xl shouldn't get stuck like that if it can't connect to xenstored properly -- as I said it's unkillable in that state! I then tried "/etc/rc.d/xenwatchdog restart" but it didn't restart (for some reason I've yet to diagnose -- I had this happen once before -- it seems to have trouble restarting sometimes, perhaps especially after restarting xencommons). That meant that a few moments later the Xen kernel decided dom0 was dead and promptly (and I mean PROMPTLY) rebooted the machine -- kaboom! (XEN) [2021-03-25 04:16:26.951] Watchdog timer fired for domain 0 (XEN) [2021-03-25 04:16:26.951] Hardware Dom0 shutdown: watchdog rebooting machine At least on this next reboot all the right versions of the right bits started! -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpNLASee644w.pgp Description: OpenPGP Digital Signature
Re: still a problem with gpt(8) reading from LVM volumes? (was: problems with GPT (and maybe dkctl wedges) on LVM volumes)
At Fri, 19 Mar 2021 00:04:28 -0700, John Nemeth wrote: Subject: Re: still a problem with gpt(8) reading from LVM volumes? (was: problems with GPT (and maybe dkctl wedges) on LVM volumes) > > One of the projects I have in mind is to replace the data > structure. One good thing about the program is that all manipulation > of the data structure is done through access routines and is > appropriately contained in map.c and map.h. One thing that is > slowing me down is thinking of an appropriate data structure for > tracking allocated space (the current method gets this pretty much > for free). One tradional way to do this would be to use a bitmap, > but with the size of modern disks, that is completely infeasible. > Note that whatever method is chosen must be able to handle duplicate > allocations (i.e. overlapping partitions). Hi John, I have done some work in the past couple of years with code that deals with what I think are sometimes called "extents" or "intervals". The code I worked on was primarily merging and diffing and searching sets of extents. Another application of extents is in calendar scheduling. Anyway I have some small example bits of Go code here: https://github.com/robohack/experiments/blob/master/t-interval-complement.go https://github.com/robohack/experiments/blob/master/t-interval-complement_test.go I also copied some code from stackoverflow to play with: https://github.com/robohack/experiments/blob/master/tintervals-merge.py -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpGgb2m4wtXQ.pgp Description: OpenPGP Digital Signature
Re: odd ATF failure for sh: ulimit_redirection_interaction failed
At Sat, 13 Mar 2021 20:57:20 +0700, Robert Elz wrote: Subject: Re: odd ATF failure for sh: ulimit_redirection_interaction failed > > OK, I see what is going on now. > > The difference for you is your initial ulimit -n > value. Not that it is big, but that when > reduced the way that test does it, getting > smaller and smaller till < 16, it happens to > land on a value < 10 as the first such limit > it tries. Using the default max fd value > doesn't do that, it reaches 15 or something > and stops. sh does not work well with less than > 10 available fds. Heh. I landed on very nearly that clue when I started tracing the script, but I didn't quite realize the implications! Thank you very much for figuring it out! It turns out of course that I had made a typo in my /etc/login.conf and I had accidentally given my rootclass a soft limit of a very small number of open files, just 64. On the good side, /usr/tests was the only thing that seemed to run into any problems with this! (But of course that was just for root shells -- my normal userid had 2000) > But first, make sh give a better inducation what > the problem is when this kind of thing does > happen. Yes, Please! > When there are redirections in builtins, the > existing fd (if any) must be moved elsewhere > so it can be restored after the builtin exits. > sh always moves to a fd > 10 for this use. Ah, that explains to me better what that code is doing and why. > ps: attempting to follow fd usages inside sh > is not something for the faint of heart. Indeed. As I was staring at it a couple of weeks ago I was coincidentally reminded of Gosling Emacs -- maybe that sh code could borrow Gosling's skull and crossed bones comment from his display.c :-) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpNvYPAoeT5c.pgp Description: OpenPGP Digital Signature
Re: sys.mk broken for single-suffix rules since 1.144 (2021/11/09)
At Mon, 22 Mar 2021 21:56:40 - (UTC), chris...@astron.com (Christos Zoulas) wrote: Subject: Re: sys.mk broken for single-suffix rules since 1.144 (2021/11/09) > > Thanks, I fixed the shuttle-rule issue, but let's split the LDSTATIC > and the OPTIM into separate commits. DBG has side effects too (other > Makefiles set it) so it should be done very carefully. Thank you very much! Yes, the other issues should be kept separate. At the moment though I have to be a bad workman and blame my tools for not making it easy for me to produce diffs that separate issues. Hopefully if/when NetBSD finally makes it into a modern VCS then I'll be able to use the tools I've become very familiar with more recently in other endeavours to create topic-specific diffs! The LDSTATIC and related COMPILER_LINK.* and CPPFLAGS changes are quite simple and straight forward though, and I've used them for nearly/more-than a decade now. They are critically necessary for doing full static builds but of course are only part of that story, though luckily a completely independent part of it. I've also used the OPIM/DBG change for as long or longer, though I have seen some interaction with other third-party Makefiles (probably none within NetBSD itself though, though of course I'll have to scan my tree just to be sure I haven't forgotten fixing something somewhere). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpAAZIHNSCmx.pgp Description: OpenPGP Digital Signature
Re: sys.mk broken for single-suffix rules since 1.144 (2021/11/09)
At Sun, 21 Mar 2021 16:44:31 -0700, "Greg A. Woods" wrote: Subject: sys.mk broken for single-suffix rules since 1.144 (2021/11/09) Sorry, make that 2020/11/09, of course :-) Also this only applies to a few platforms (i386, x86_64, and aarch64), and when the Makefile used somehow ends up including , but does not use and/or does not set PROG -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgprG5SoWsPdM.pgp Description: OpenPGP Digital Signature
sys.mk broken for single-suffix rules since 1.144 (2021/11/09)
RGET} ${.IMPSRC} ${LDLIBS} + ${COMPILE_LINK.c} -o ${.TARGET} ${.IMPSRC} ${LDLIBS} # XXX: disable for now # ${CTFCONVERT_RUN} .c.o: @@ -138,7 +151,7 @@ # C++ .cc .cpp .cxx .C: - ${LINK.cc} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS} + ${COMPILE_LINK.cc} -o ${.TARGET} ${.IMPSRC} ${LDLIBS} # XXX: disable for now # ${CTFCONVERT_RUN} .cc.o .cpp.o .cxx.o .C.o: @@ -151,8 +164,9 @@ # Fortran/Ratfor .f: - ${LINK.f} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS} - ${CTFCONVERT_RUN} + ${COMPILE_LINK.f} -o ${.TARGET} ${.IMPSRC} ${LDLIBS} +# XXX: disable for now +# ${CTFCONVERT_RUN} .f.o: ${COMPILE.f} ${.IMPSRC} ${OBJECT_TARGET} ${CTFCONVERT_RUN} @@ -162,8 +176,9 @@ rm -f ${.PREFIX}.o .F: - ${LINK.F} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS} - ${CTFCONVERT_RUN} + ${COMPILE_LINK.F} -o ${.TARGET} ${.IMPSRC} ${LDLIBS} +# XXX: disable for now +# ${CTFCONVERT_RUN} .F.o: ${COMPILE.F} ${.IMPSRC} ${OBJECT_TARGET} ${CTFCONVERT_RUN} @@ -173,8 +188,9 @@ rm -f ${.PREFIX}.o .r: - ${LINK.r} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS} - ${CTFCONVERT_RUN} + ${COMPILE_LINK.r} -o ${.TARGET} ${.IMPSRC} ${LDLIBS} +# XXX: disable for now +# ${CTFCONVERT_RUN} .r.o: ${COMPILE.r} ${.IMPSRC} ${OBJECT_TARGET} ${CTFCONVERT_RUN} @@ -185,8 +201,9 @@ # Pascal .p: - ${LINK.p} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS} - ${CTFCONVERT_RUN} + ${COMPILE_LINK.p} -o ${.TARGET} ${.IMPSRC} ${LDLIBS} +# XXX: disable for now +# ${CTFCONVERT_RUN} .p.o: ${COMPILE.p} ${.IMPSRC} ${OBJECT_TARGET} ${CTFCONVERT_RUN} @@ -197,8 +214,9 @@ # Assembly .s: - ${LINK.s} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS} - ${CTFCONVERT_RUN} + ${COMPILE_LINK.s} -o ${.TARGET} ${.IMPSRC} ${LDLIBS} +# XXX: disable for now +# ${CTFCONVERT_RUN} .s.o: ${COMPILE.s} ${.IMPSRC} ${OBJECT_TARGET} ${CTFCONVERT_RUN} @@ -207,8 +225,9 @@ ${AR} ${ARFLAGS} ${.TARGET} ${.PREFIX}.o rm -f ${.PREFIX}.o .S: - ${LINK.S} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS} - ${CTFCONVERT_RUN} + ${COMPILE_LINK.S} -o ${.TARGET} ${.IMPSRC} ${LDLIBS} +# XXX: disable for now +# ${CTFCONVERT_RUN} .S.o: ${COMPILE.S} ${.IMPSRC} ${OBJECT_TARGET} ${CTFCONVERT_RUN} @@ -220,8 +239,9 @@ # Lex .l: ${LEX.l} ${.IMPSRC} - ${LINK.c} ${OBJECT_TARGET} lex.yy.c ${LDLIBS} -ll - ${CTFCONVERT_RUN} + ${COMPILE_LINK.c} -o ${.TARGET} lex.yy.c ${LDLIBS} -ll +# XXX: disable for now +# ${CTFCONVERT_RUN} rm -f lex.yy.c .l.c: ${LEX.l} ${.IMPSRC} @@ -235,8 +255,9 @@ # Yacc .y: ${YACC.y} ${.IMPSRC} - ${LINK.c} ${OBJECT_TARGET} y.tab.c ${LDLIBS} - ${CTFCONVERT_RUN} + ${COMPILE_LINK.c} -o ${.TARGET} y.tab.c ${LDLIBS} +# XXX: disable for now +# ${CTFCONVERT_RUN} rm -f y.tab.c .y.c: ${YACC.y} ${.IMPSRC} -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpPGRHrsBSaZ.pgp Description: OpenPGP Digital Signature
still a problem with gpt(8) reading from LVM volumes? (was: problems with GPT (and maybe dkctl wedges) on LVM volumes)
At Fri, 12 Mar 2021 14:02:06 -0800, I wrote: Subject: problems with GPT (and maybe dkctl wedges) on LVM volumes > > # gpt -vvv show -a /dev/mapper/rvg0-nbtest.0 > /dev/mapper/rvg0-nbtest.0: mediasize=41943040; sectorsize=512; blocks=81920 > /dev/mapper/rvg0-nbtest.0: PMBR at sector 0 > /dev/mapper/rvg0-nbtest.0: Pri GPT at sector 1 > /dev/mapper/rvg0-nbtest.0: GPT partition: type=ffs, start=64, size=41942943 > gpt: /dev/mapper/rvg0-nbtest.0: map entry doesn't fit media: new start + new > size < start + size > (22 + 13fde < 40 + 27fff9f) I'm still not quite sure why gpt(8) can't show me the full partition table when reading from a raw LVM volume (dm) device as above in exactly the same way it does when reading from the raw (xbd emulated) disk in the domU. After all if I map, say, an install.img file, then in the domU I see: # gpt show -a /dev/rxbd4 start size index contents 01 PMBR 11 Pri GPT header 2 32 Pri GPT table 34 2014 Unused 2048 262144 1 GPT part - EFI System Type: efi TypeID: c12a7328-f81f-11d2-ba4b-00a0c93ec93b GUID: 97ac9806-df43-4590-ae5b-c88d8861ea0e Size: 128 M Label: EFI system Attributes: None 264192 7544832 2 GPT part - NetBSD FFSv1/FFSv2 Type: ffs TypeID: 49f48d5a-b10e-11dc-b99b-0019d1879648 GUID: 2865e4e5-a798-4bed-9dc7-2e2317a3d789 Size: 3684 M Label: Attributes: biosboot, bootme 7809024 2015 Unused 7811039 32 Sec GPT table 78110711 Sec GPT header and in the dom0 I see the same from the target file: # gpt show -a /images/NetBSD-9.99.81-amd64-install.img start size index contents 01 PMBR 11 Pri GPT header 2 32 Pri GPT table 34 2014 Unused 2048 262144 1 GPT part - EFI System Type: efi TypeID: c12a7328-f81f-11d2-ba4b-00a0c93ec93b GUID: 97ac9806-df43-4590-ae5b-c88d8861ea0e Size: 128 M Label: EFI system Attributes: None 264192 7544832 2 GPT part - NetBSD FFSv1/FFSv2 Type: ffs TypeID: 49f48d5a-b10e-11dc-b99b-0019d1879648 GUID: 2865e4e5-a798-4bed-9dc7-2e2317a3d789 Size: 3684 M Label: Attributes: biosboot, bootme 7809024 2015 Unused 7811039 32 Sec GPT table 78110711 Sec GPT header BTW, I've yet to try ccd(4) as an interpolative layer to add "paritionable disk" semantics -- my first attempt on the older (production) Xen system where I was testing this on resulted in a hard crash as I was running "ccdconfig -u ccd0" to try a different LVM. I need to run through the exercise of letting sysinst partition up an xbd0 to try this again on a newer, and less critical, Xen server. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpMHuyGSLKTh.pgp Description: OpenPGP Digital Signature
Re: odd ATF failure for sh: ulimit_redirection_interaction failed
At Fri, 12 Mar 2021 21:46:29 -0500 (EST), Mouse wrote: Subject: Re: odd ATF failure for sh: ulimit_redirection_interaction failed > > > But it still doesn't really make sense from what I see in the source. > > The attempted FD is #18, but the error just says "1", not "18": > > Does > _ever_ work for multi-digit N? I thought it didn't. Maybe > that was just least-common-denominator sh, or maybe the test is broken? Well, the test does appear to execute without error IFF the condition the test is meant to exercise is not enforced (i.e. if the ulimit for open FDs is not kept lower than the number of currently open FDs); though I have not done any other kind of test to be sure the data sent to a multi-digit FD is actually received from the given FD. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpgFrWaVV0So.pgp Description: OpenPGP Digital Signature
Re: odd ATF failure for sh: ulimit_redirection_interaction failed
At Fri, 12 Mar 2021 15:49:37 +0700, Robert Elz wrote: Subject: Re: odd ATF failure for sh: ulimit_redirection_interaction failed > > The -X should allow... > > | tc-se:stderr: > | tc-se:helper.sh: 1: Invalid argument > > to reveal just what is producing that error message (what is invalid). Thanks for the pointers! From a few quick tests it looks like the problem is with the redirection to what should be an already open file descriptor. But it still doesn't really make sense from what I see in the source. The attempted FD is #18, but the error just says "1", not "18": tc-se:+ LIM=9 tc-se:+ ulimit -S -n 9 tc-se:+ '[' 9 -gt 16 ']' tc-se:+ for FD=18 tc-se:+ echo '18 in 18 38 77 155 311 624 1249 2499 4999 ' >&18 tc-se:helper.sh: 1: Invalid argument tc-se:+ exit 1 tc-se: tc-end: 1615574952.421302, ulimit_redirection_interaction, failed, atf-check failed; see the output of the test for details On the other hand for EINVAL fcntl(2) does say: The argument cmd is F_DUPFD and arg is negative or greater than the maximum allowable number (see getdtablesize(3)). So I'm not so sure this test is valid in the first place, is it? (I've been somewhat confused by the logic in this test and the logic in the related code in sh) Indeed if I move the ulimit call to reset the limit back up to the old limit of 2000 then it runs through the whole list without error (well except for the stderr output caused by the "set -x" of course). tc-so:Executing command [ /bin/sh helper.sh ] tc-se:Fail: stderr not empty tc-se:--- /dev/null 2021-03-13 02:13:32.737028821 + tc-se:+++ /tmp/check.n8ejtt/stderr 2021-03-13 02:13:32.736999310 + tc-se:@@ -0,0 +1,21 @@ tc-se:++ ulimit -S -n 2000 tc-se:++ for FD=18 tc-se:++ echo 18 in 18 38 77 155 311 624 1249 2499 4999 >&18 tc-se:++ for FD=38 tc-se:++ echo 38 in 18 38 77 155 311 624 1249 2499 4999 >&38 tc-se:++ for FD=77 tc-se:++ echo 77 in 18 38 77 155 311 624 1249 2499 4999 >&77 tc-se:++ for FD=155 tc-se:++ echo 155 in 18 38 77 155 311 624 1249 2499 4999 >&155 tc-se:++ for FD=311 tc-se:++ echo 311 in 18 38 77 155 311 624 1249 2499 4999 >&311 tc-se:++ for FD=624 tc-se:++ echo 624 in 18 38 77 155 311 624 1249 2499 4999 >&624 tc-se:++ for FD=1249 tc-se:++ echo 1249 in 18 38 77 155 311 624 1249 2499 4999 >&1249 tc-se:++ for FD=2499 tc-se:++ echo 2499 in 18 38 77 155 311 624 1249 2499 4999 >&2499 tc-se:++ for FD=4999 tc-se:++ echo 4999 in 18 38 77 155 311 624 1249 2499 4999 >&4999 tc-se:++ for FD= tc-se:++ echo in 18 38 77 155 311 624 1249 2499 4999 >& tc-end: 1615601612.771674, ulimit_redirection_interaction, failed, atf-check failed; see the output of the test for details I want to add some debug printfs to the shell too, but I'm currently stymied by another problem (I can't access the domU filesystem from the dom0, and until I can get a complete rebuild to finish so I can do a full reinstall of the domU, accessing the FS from the dom0 would be the only easy way I have of injecting changes to the test system since it has no networking). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpbfxMXYf52z.pgp Description: OpenPGP Digital Signature
problems with GPT (and maybe dkctl wedges) on LVM volumes
So with -current if you present a LVM volume to a domU and then use sysinst to install on it (and I think IFF you choose "extended partitioning") you end up with a GPT partitioned VLM volume that the XEN_DOMU kernel sees as follows: [ 2.0010567] xbd0 at xenbus0 id 0: Xen Virtual Block Device Interface [ 2.0090574] xbd0: 20480 MB, 512 bytes/sect x 41943040 sectors [ 2.0090574] xbd0: backend features 0x1 [ 2.0100607] dk0 at xbd0: "nbtest-root", 41942943 blocks at 64, type: ffs From the running dom0 the GPT partition table for this device looks like this: # gpt show -a /dev/rxbd0 start size index contents 0 1 PMBR 1 1 Pri GPT header 232 Pri GPT table 3430 Unused 64 41942943 1 GPT part - NetBSD FFSv1/FFSv2 Type: ffs TypeID: 49f48d5a-b10e-11dc-b99b-0019d1879648 GUID: da2147be-1fe7-4bb3-a1fc-e601c92301fe Size: 20480 M Label: nbtest-root Attributes: biosboot 4194300732 Sec GPT table 41943039 1 Sec GPT header However attempts to access the filesystem from the dom0 fail (after seeming to get most of the way to finding the whole primary table): # gpt -vvv show -a /dev/mapper/rvg0-nbtest.0 /dev/mapper/rvg0-nbtest.0: mediasize=41943040; sectorsize=512; blocks=81920 /dev/mapper/rvg0-nbtest.0: PMBR at sector 0 /dev/mapper/rvg0-nbtest.0: Pri GPT at sector 1 /dev/mapper/rvg0-nbtest.0: GPT partition: type=ffs, start=64, size=41942943 gpt: /dev/mapper/rvg0-nbtest.0: map entry doesn't fit media: new start + new size < start + size (22 + 13fde < 40 + 27fff9f) This may or may not be related to PR# 54900. There's also mention of a possibly related issue in this thread: http://mail-index.netbsd.org/netbsd-users/2020/07/19/msg025551.html However in my case it looks like gpt(8) when run in the dom0 is having problems skipping past the "Unused" part. (suggested because 0x22 == 34d) Also it seems dkctl doesn't work as I had expected it would on LVM partitions, even though it can apparently find a viable partition: # dkctl /dev/mapper/rvg0-nbtest.0 getwedgeinfo vg0-nbtest.0 at vg0-nbtest.0: vg0-nbtest.0 vg0-nbtest.0: 41943040 blocks at 0, type: ffs # dkctl /dev/mapper/rvg0-nbtest.0 makewedges dkctl: /dev/mapper/rvg0-nbtest.0: makewedges: Inappropriate ioctl for device # dkctl /dev/mapper/rvg0-nbtest.0 addwedge nbtest-root 64 41942943 ffs dkctl: /dev/mapper/rvg0-nbtest.0: addwedge: Inappropriate ioctl for device So it looks like I'm back to using plain MBR for domUs again, at least for my next round of Xen server rebuilds. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpnIjOfdk_fp.pgp Description: OpenPGP Digital Signature
odd ATF failure for sh: ulimit_redirection_interaction failed
My build (for amd64) of very recent -current sources (2021/03/08) exhibit an odd failure in the ATF tests for /bin/sh. (I'm not sure when this first appeared, but it's not there in my older builds, e.g. from 2020/06) From glancing through the test script I'm not sure quite what's happening, though I've not tried to dig much deeper yet. From the log: tc-start: 1615494985.794473, ulimit_redirection_interaction tc-so:Executing command [ /bin/sh helper.sh ] tc-se:Fail: incorrect exit status: 1, expected: 0 tc-se:stdout: tc-se: tc-se:stderr: tc-se:helper.sh: 1: Invalid argument tc-se: tc-end: 1615494985.880185, ulimit_redirection_interaction, failed, atf-check failed; see the output of the test for details -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpgCp7gHm4g5.pgp Description: OpenPGP Digital Signature
Re: build of 2021/03/07 -current fails because of pthread_types.h (without MKLLVM!)
So, the clue is in the last two "notes" from the compiler: /build/woods/b2/current-amd64-destdir/usr/include/pthread_types.h:170:8: note: '__pthread_cond_st' is not literal because: 170 | struct __pthread_cond_st { |^ /build/woods/b2/current-amd64-destdir/usr/include/pthread_types.h:175:17: note: non-static data member '__pthread_cond_st::ptc_waiters' has volatile type 175 | void *volatile ptc_waiters; | ^~~ These lead me to pthread_types.h, and to the apparent change that may have introduced the fault (revision 1.25 of pthread_types.h), which after looking at the full set of changes in 1.25 lead me to find the definition of __pthread_volatile, and that allowed me to read the comment about this definition and that suggested the following fix, which at least allows the compile to continue. I hate C++ and I hate debugging C++, but here at least I'm grateful someone had already figured out how to solve the problem and I only had to apply it in one more place. If all goes well I should be able to test the build under Xen in the next few hours. Index: lib/libpthread/pthread_types.h === RCS file: /cvs/master/m-NetBSD/main/src/lib/libpthread/pthread_types.h,v retrieving revision 1.25 diff -u -r1.25 pthread_types.h --- lib/libpthread/pthread_types.h 10 Jun 2020 22:45:15 - 1.25 +++ lib/libpthread/pthread_types.h 9 Mar 2021 22:43:05 - @@ -172,7 +172,7 @@ /* Protects the queue of waiters */ __pthread_spin_t ptc_lock; - void *volatile ptc_waiters; + void *__pthread_volatile ptc_waiters; void *ptc_spare; pthread_mutex_t *ptc_mutex; /* Current mutex */ -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpyKO6vyn83x.pgp Description: OpenPGP Digital Signature
build of 2021/03/07 -current fails because of pthread_types.h (without MKLLVM!)
ded from /build/woods/b2/current-amd64-destdir/usr/include/sys/types.h:359, from /build/woods/b2/current-amd64-destdir/usr/include/sys/endian.h:55, from /work/woods/m-NetBSD-current-new/external/bsd/libc++/dist/libcxx/include/__config:82, from /work/woods/m-NetBSD-current-new/external/bsd/libc++/dist/libcxx/include/algorithm:623, from /work/woods/m-NetBSD-current-new/external/bsd/libc++/lib/../dist/libcxx/src/algorithm.cpp:10: /build/woods/b2/current-amd64-destdir/usr/include/pthread_types.h:170:8: note: '__pthread_cond_st' is not literal because: 170 | struct __pthread_cond_st { |^ /build/woods/b2/current-amd64-destdir/usr/include/pthread_types.h:175:17: note: non-static data member '__pthread_cond_st::ptc_waiters' has volatile type 175 | void *volatile ptc_waiters; | ^~~ *** Failed target: algorithm.o *** Failed command: /build/woods/b2/current-amd64-amd64-tools/bin/x86_64--netbsd-c++ -frandom-seed=a0ced134 -O2 -g -Wall -Wpointer-arith -Wno-sign-compare -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Wsign-compare -Wformat=2 -Werror -Wno-error -pipe -fstack-protector -Wstack-protector --param ssp-buffer-size=1 -std=c++11 -Wold-style-cast -Wctor-dtor-privacy -Wnon-virtual-dtor -Wreorder -Wno-deprecated -Woverloaded-virtual -Wsign-promo -Wsynth -Wno-non-template-friend -Wno-pmf-conversions --sysroot=/build/woods/b2/current-amd64-destdir -nostdinc++ -cxx-isystem /work/woods/m-NetBSD-current-new/external/bsd/libc++/lib/../dist/libcxx/include -I/work/woods/m-NetBSD-current-new/external/bsd/libc++/lib/../dist/libcxxrt/src -DLIBCXXRT -D_FORTIFY_SOURCE=2 -c /work/woods/m-NetBSD-current-new/external/bsd/libc++/lib/../dist/libcxx/src/algorithm.cpp -o algorithm.o *** Error code 1 Stop. nbmake[1]: stopped in /work/woods/m-NetBSD-current-new/external/bsd/libc++/lib *** Failed target: dependall *** Failed command: cd "/work/woods/m-NetBSD-current-new/external/bsd/libc++/lib"; /build/woods/b2/current-amd64-amd64-tools/bin/nbmake realall *** Error code 1 Stop. nbmake: stopped in /work/woods/m-NetBSD-current-new/external/bsd/libc++/lib 11:32 [102] $ mynbmake -v MKDEBUG yes 11:32 [103] $ mynbmake -v MKDEBUGLIB yes 11:32 [104] $ mynbmake -v MKLLVM no 11:32 [105] $ mynbmake -v MKLLVMRT no -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgptrfbHvuOsL.pgp Description: OpenPGP Digital Signature
netbsd-5 branch cannot be built with a recent 9.99.64 system (or perhaps any recent GCC?)
keinfo LIBGCC= LIBGCC1= LIBGCC1_TEST= LIBGCC2= INSTALL_LIBGCC= EXTRA_PARTS= CPPFLAGS=-DNETBSD_TOOLS AR=ar RANLIB=ranlib BISON=true DESTDIR= INSTALL=/build/woods/b2/netbs d-5-amd64-i386-tools/bin/i386--netbsdelf-install\ -c\ -p\ -r /build/woods/b2/netbsd-5-amd64-i386-tools/bin/nbgmake -e MACHINE= MAKEINFO=/build/woods/b2/netbsd-5-amd64-i386-tools/bin/nbmakeinfo LIBGCC= LIBGCC1= LIBGCC1_TEST= LIBGCC2= INSTALL_LIBGCC= EXTRA_PARTS= CPPFLAGS=-DNETBSD_TOOLS AR=ar RANLIB=ranlib BISON=true DESTDIR= INSTALL=/build/woods/b2/netbsd-5-amd64-i386-tools/bin/i386--netbsdelf-install\ -c\ -p\ -r all-gcc) *** Error code 2 Stop. nbmake: stopped in /work/woods/m-NetBSD-5/tools/gcc Note there are also lots of other new warnings from a newer compiler building the older toolchain! -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp9gBzANraAN.pgp Description: OpenPGP Digital Signature
one more possible speedup for "make -jN sets" in makesums
I finally found and enabled USE_PIGZGZIP. That's a big help! (especially with the bigger sets I get with all-static builds) However the "makesums" part of "make sets" still goes one at a time because of an explicit ".ORDER:" request. My added comment asks my question: --- distrib/sets/Makefile.~1.107.~ 2020-05-30 15:20:31.225318105 -0700 +++ distrib/sets/Makefile 2021-02-18 10:05:39.414690365 -0800 @@ -269,6 +269,8 @@ ${TOOL_CAT} ${TARDIR}/$$i >> ${TARDIR}/$$i.tmp; \ done .endfor +# XXX this .ORDER is here "so the checksums come out in the proper sequence.", +# but as a result they cannot be done in parallel!!! Sorting after!?!?!? .ORDER: ${MAKETARSETS:@.TARS.@do-sum-${.TARS.}@} I think this would currently also assume/require that nbcksum always do just one write(2) to generate its whole output (I haven't checked that), or that the whole process can be changed such that they each write to unique temporary files that are then collected and coalesced after they've all run. Perhaps the distrib/sets/makesums script could also run the (currently) two nbcksum processes in parallel (e.g. if ${.MAKE.JOBS} is set and greater than one in the makefile then pass a '-j ${.MAKE.JOBS}' option to the script). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp1PW6p7nzys.pgp Description: OpenPGP Digital Signature
Re: is this crash while coredumping known? (forget the link to NFS)
At Sat, 11 Jul 2020 23:29:05 -0700, "Greg A. Woods" wrote: Subject: Re: is this crash while coredumping known? (forget the link to NFS) > > So it doesn't seem like this crash has anything to do with NFS after all. This crash is ongoing for me. I'll be away for a couple of weeks, then possibly too busy for a few more weeks, but I hope to update to the very most recent -current in the near future. Perhaps it's fixed in more recent sources than what I'm running, but I'm now also wondering about the path through mount_null mountpoints. In the mean time I wonder if anyone might try to reproduce this, particularly with mount_null mountpoints in place. Since various packages have configure tests that dump core it's not long before the crash occurs when building lots of packages. FYI my sandboxctl.conf file is as follows, with /build being one big filesystem and with the various pkgsrc vars pointing at /var/package* places, and with /more being an NFS server with pkgsrc sources (I was quite surprised I had to add /usr/X11R7 with netbsd-native): #!/bin/sh SANDBOX_TYPE=netbsd-native SANDBOX_ROOT=/build/sandbox/pkgbuild NETBSD_NATIVE_RELEASEDIR=/build/woods/xentastic/current-amd64-release/amd64 post_mount_hook () { mkdir -p ${SANDBOX_ROOT}/usr/X11R7 sandbox_bindfs -o ro /usr/X11R7 ${SANDBOX_ROOT}/usr/X11R7 mkdir -p ${SANDBOX_ROOT}/usr/src sandbox_bindfs -o ro /build/src-current ${SANDBOX_ROOT}/usr/src mkdir -p ${SANDBOX_ROOT}/usr/xsrc sandbox_bindfs -o ro /build/xsrc-current ${SANDBOX_ROOT}/usr/xsrc mkdir -p ${SANDBOX_ROOT}/usr/pkgsrc sandbox_bindfs -o ro /more/work/woods/m-NetBSD-pkgsrc-current ${SANDBOX_ROOT}/usr/pkgsrc mkdir -p ${SANDBOX_ROOT}/usr/pkg sandbox_bindfs -o rw /build/package-pkgbuild ${SANDBOX_ROOT}/usr/pkg mkdir -p ${SANDBOX_ROOT}/var/package-distfiles sandbox_bindfs -o rw /build/package-distfiles ${SANDBOX_ROOT}/var/package-distfiles mkdir -p ${SANDBOX_ROOT}/var/package-obj sandbox_bindfs -o rw /build/package-obj ${SANDBOX_ROOT}/var/package-obj mkdir -p ${SANDBOX_ROOT}/var/packages sandbox_bindfs -o rw /build/packages ${SANDBOX_ROOT}/var/packages # xxx to make it easier to source .kshrc mkdir -p ${SANDBOX_ROOT}/home sandbox_bindfs -o ro /more/home ${SANDBOX_ROOT}/home ln -fs /usr/src/etc/mk.conf ${SANDBOX_ROOT}/etc/mk.conf } -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpPjqxk_fPSK.pgp Description: OpenPGP Digital Signature
Re: recent changes to pthread_fork.c:fork() cause static linking to fail if the app provides its own malloc()
At Tue, 14 Jul 2020 20:05:57 - (UTC), chris...@astron.com (Christos Zoulas) wrote: Subject: Re: recent changes to pthread_fork.c:fork() cause static linking to fail if the app provides its own malloc() > > It is not only _malloc_prefork(), it is also _malloc_postfork() and > _malloc_postfork_child(). The easiest way to fix things is to provide > them as no-op. Indeed. I guess this will have to be the way. Perhaps some proper documentation could/should be written about how to do this and exactly what APIs are necessary to override the internal malloc() entirely. Note that this is necessary in cases of malloc() et al in particular for both static-linked and dynamic linked programs. The difference is that with static linking one gets a linker error and cannot continue, but with dynamic linking one silently invokes "Undefined Behaviour" (i.e. depending on what the internal malloc() uses to obtain heap space). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpKFk8TXbGnF.pgp Description: OpenPGP Digital Signature
Re: recent changes to pthread_fork.c:fork() cause static linking to fail if the app provides its own malloc()
At Tue, 14 Jul 2020 00:28:46 +0200, Joerg Sonnenberger wrote: Subject: Re: recent changes to pthread_fork.c:fork() cause static linking to fail if the app provides its own malloc() > > On Mon, Jul 13, 2020 at 03:05:17PM -0700, Greg A. Woods wrote: > > I think it is the following change (and perhaps more similar/related > > changes) which breaks static linking of applications which wish to > > supply their own implementation of malloc(), and which call, e.g., > > fork(): > > I consider it a strong WONTFIX. It's no different from not poviding > posix_memalign etc. Well, _malloc_prefork() is explicitly called with an underscore leading the identifier name, so strictly speaking it's invalid for an application to define it. (and it's not documented, nor in any standard that I can find, with or without the leading underscore). So, in my opion it is invalid for unrelated parts of the library to use such an interal function and as a result have conflicts with overriding some functions. Perhaps splitting all the internal definitions out of jemalloc.c into their own compilation units and making sure they don't then also still cause unnecessary inclusion of related code and definitions when referenced would be a possible work-around, but that will no doubt lead to later maintenance headaches. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgplHNvRMifab.pgp Description: OpenPGP Digital Signature
recent changes to pthread_fork.c:fork() cause static linking to fail if the app provides its own malloc()
I think it is the following change (and perhaps more similar/related changes) which breaks static linking of applications which wish to supply their own implementation of malloc(), and which call, e.g., fork(): This is because fork() now calls _malloc_prefork(), and if the application's replacement does not offer this function (as it should not), then the linker is forced to drag in all of jemalloc.o. This of course happens even if the application is not multi-threaded and is not linking against -lpthread. revision 1.15 date: 2020-05-15 07:37:21 -0700; author: joerg; state: Exp; lines: +6 -2; commitid: 85oo6pCrePrJul8C; Hook up proper fork lock handling for malloc: - lock all relevant mutexes just before fork - unlock all mutexes just after fork in the parent - full reinit non-spinlocks in the child This is not using the normal pthread_atfork interface to ensure order of operation, malloc is used as implementation detail too often. For example, here static linking pkgsrc/shells/heirloom-sh: ld: /usr/lib/libc.a(jemalloc.o): in function `malloc': /build/src-current/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2056: multiple definition of `malloc'; mapmalloc.o:/var/package-obj/root/shells/heirloom-sh/work/heirloom-sh-050706/mapmalloc.c:195: first defined here ld: /usr/lib/libc.a(jemalloc.o): in function `calloc': /build/src-current/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2154: multiple definition of `calloc'; mapmalloc.o:/var/package-obj/root/shells/heirloom-sh/work/heirloom-sh-050706/mapmalloc.c:381: first defined here ld: /usr/lib/libc.a(jemalloc.o): in function `realloc': /build/src-current/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2326: multiple definition of `realloc'; mapmalloc.o:/var/package-obj/root/shells/heirloom-sh/work/heirloom-sh-050706/mapmalloc.c:330: first defined here ld: /usr/lib/libc.a(jemalloc.o): in function `free': /build/src-current/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2416: multiple definition of `free'; mapmalloc.o:/var/package-obj/root/shells/heirloom-sh/work/heirloom-sh-050706/mapmalloc.c:303: first defined here Should I send-pr this? Is there any possibility of an "easy" fix? -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpvrEPlVbpjk.pgp Description: OpenPGP Digital Signature
Re: is this crash while coredumping known? (forget the link to NFS)
So it doesn't seem like this crash has anything to do with NFS after all. I've been doing package builds in a sandboxctl chroot that access NFS sources (read-only) but are otherwise entirely confined to a local filesystem, albiet through sandboxctl's Null mounts. After many core dumps (mostly from GNU Configure scripts), one eventually caused another similar looking crash. This one did a core dump, but savecore didn't think there was enough free space left in /var/crash to recover it (even though there is enough space for dozens of the compresed cores if they comrpess as well as the last one). (Below is the original crash messages for comparison) [ 200974.6716318] fatal double fault in supervisor mode [ 200974.6716318] trap type 13 code 0 rip 0x80e3c127 cs 0x8 rflags 0x10286 cr2 0x9a02af3e6f88 e6f90 [ 200974.6816277] curlwp 0x90f14a2e2bc0 pid 1591.1591 lowest kstack 0x9a02af3e52c0 kernel: double fault trap, code=0 Stopped in pid 1591.1591 (conftest) at netbsd:radix_tree_gang_lookup_node+0x1a:movq%rdx,) radix_tree_gang_lookup_node() at netbsd:radix_tree_gang_lookup_node+0x1a uvm_page_array_fill() at netbsd:uvm_page_array_fill+0x14b uvm_page_array_fill_and_peek() at netbsd:uvm_page_array_fill_and_peek+0x1e uvn_findpage() at netbsd:uvn_findpage+0x88 uvn_findpages() at netbsd:uvn_findpages+0xcd genfs_getpages() at netbsd:genfs_getpages+0x959 VOP_GETPAGES() at netbsd:VOP_GETPAGES+0x58 uvn_get() at netbsd:uvn_get+0x57 ubc_fault() at netbsd:ubc_fault+0x182 uvm_fault_internal() at netbsd:uvm_fault_internal+0x51e trap() at netbsd:trap+0x4e5 --- trap (number 6) --- kcopy() at netbsd:kcopy+0x15 uiomove() at netbsd:uiomove+0xb7 ubc_uiomove() at netbsd:ubc_uiomove+0x156 ffs_write() at netbsd:ffs_write+0x251 layer_bypass() at netbsd:layer_bypass+0x102 VOP_WRITE() at netbsd:VOP_WRITE+0x40 vn_rdwr() at netbsd:vn_rdwr+0xcc coredump_write() at netbsd:coredump_write+0xa0 coredump_elf64() at netbsd:coredump_elf64+0x43a coredump() at netbsd:coredump+0x650 sigexit() at netbsd:sigexit+0x27c sendsig_siginfo() at netbsd:sendsig_siginfo+0x323 trapsignal() at netbsd:trapsignal+0x371 trap() at netbsd:trap+0x8e7 --- trap (number 6) --- 400581: ds 23 es 23 fs 0 gs 0 rdi 90eb408bdd58 rsi 0 rbp 9a02af3e7080 rbx 9a02af3e7190 rdx 9a02af3e71b0 rcx 1 rax 80e3c10dradix_tree_gang_lookup_node r8 0 r9 1 r10 0 r11 2 r12 90eb408bdd40 r13 1 r14 0 r15 90eb408bdd58 rip 80e3c127radix_tree_gang_lookup_node+0x1a cs 8 rflags 10286 rsp 9a02af3e6f90 ss 0 netbsd:radix_tree_gang_lookup_node+0x1a:movq %rdx,ff10(%rbp) db{3}> savecore: reboot after panic: reboot forced via kernel debugger savecore: system went down at Sat Jul 11 19:35:25 2020 savecore: no dump, not enough free space in /var/crash $ df -h /var/crash/ FilesystemSize Used Avail %Cap MountedOn /dev/dk2 3.9G 1.5G 2.2G 40% /var At Thu, 09 Jul 2020 18:03:23 -0700, "Greg A. Woods" wrote: Subject: is this crash while coredumping to NFS known? > > Here's what was on the console: > > [ 71887.4479952] fatal double fault in supervisor mode > [ 71887.4479952] trap type 13 code 0 rip 0x809c5051 cs 0x8 rflags > 0x10286 cr2 0x8b827c3e4f98 i > 3e4fa0 > [ 71887.4479952] curlwp 0x8693578524c0 pid 29079.29079 lowest kstack > 0x8b827c3e32c0 > kernel: double fault trap, code=0 > Stopped in pid 29079.29079 (tpgsqltime) at netbsd:ip_output+0x14: movq > %rsi,fe68(%rbp > > ip_output() at netbsd:ip_output+0x14 > tcp_output() at netbsd:tcp_output+0xc68 > tcp_send_wrapper() at netbsd:tcp_send_wrapper+0x9a > sosend() at netbsd:sosend+0x7e4 > nfs_send() at netbsd:nfs_send+0x86 > nfs_request() at netbsd:nfs_request+0x3d4 > nfs_readrpc() at netbsd:nfs_readrpc+0x204 > nfs_doio() at netbsd:nfs_doio+0x731 > VOP_STRATEGY() at netbsd:VOP_STRATEGY+0x64 > genfs_getpages() at netbsd:genfs_getpages+0x1400 > nfs_getpages() at netbsd:nfs_getpages+0x5d > VOP_GETPAGES() at netbsd:VOP_GETPAGES+0x80 > uvm_fault_internal() at netbsd:uvm_fault_internal+0x1895 > trap() at netbsd:trap+0x4e5 > --- trap (number 6) --- > copyin() at netbsd:copyin+0x2f > uiomove() at netbsd:uiomove+0xb7 > ubc_uiomove() at netbsd:ubc_uiomove+0x156 > nfs_write() at netbsd:nfs_write+0x129 > VOP_WRITE() at netbsd:VOP_WRITE+0x65 > vn_rdwr() at netbsd:vn_rdwr+0xcc > coredump_write() at netbsd:coredump_write+0x56 > coredump_elf64() at netbsd:coredump_elf64+0x89c > coredump() at netbsd:coredump+0x650 > sigexit() at netbsd:sigexit+0x27c > sendsig() at netbsd:sendsig > lwp_userret() at netbsd:lwp_userret+0x1c5 > trap() at netbsd:trap+0x
Re: USB console support "was: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1)
At Thu, 09 Jul 2020 18:16:26 -0700, "Greg A. Woods" wrote: Subject: USB console support "was: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1) > > Oh, and I wanted to mention something else that I'd forgotten about but > stumbled across again the other day while debugging servers: > > Xen supports writing console messages to a special kind of USB port: > > "console=dbgp" indicates that Xen should use a USB debug port. > > http://xenbits.xenproject.org/docs/4.11-testing/misc/xen-command-line.html > > There's more about it in this thread: > > https://lists.xenproject.org/archives/html/xen-devel/2009-03/msg00436.html > https://lists.xenproject.org/archives/html/xen-devel/2009-03/msg00458.html For what it's worth my Dell servers and my MacBook Pro have such USB debug ports. The MacBook Pro even has two of them, and I'm pretty sure one of them is connected to the external ports. On the Dell though this seems to be the port that connects to the DRAC. From "lspci -vvv": 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller (rev 09) (prog-if 20 [EHCI]) Subsystem: Dell Device 01b2 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpwb_S2z2w5O.pgp Description: OpenPGP Digital Signature
USB console support "was: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1)
At Mon, 06 Jul 2020 13:13:03 -0700, "Greg A. Woods" wrote: Subject: Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1 > > Or indeed any device with any kind of USB port, e.g. a laptop. Oh, and I wanted to mention something else that I'd forgotten about but stumbled across again the other day while debugging servers: Xen supports writing console messages to a special kind of USB port: "console=dbgp" indicates that Xen should use a USB debug port. http://xenbits.xenproject.org/docs/4.11-testing/misc/xen-command-line.html There's more about it in this thread: https://lists.xenproject.org/archives/html/xen-devel/2009-03/msg00436.html https://lists.xenproject.org/archives/html/xen-devel/2009-03/msg00458.html Further of interest is that Xen also supports writing to both a COM port and the "vga" console simultaneously. Indeed it may support writing to "dbgp" at the same time as well. This is something I looked into for NetBSD/i386 some time ago, but never got it fully working. To me I think it would be super incredibly valuable for the boot code to be able to talk to both a serial port and the "pc" console simultaneously. It is less important for the kernel to do so, but it still would be nice. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpWgRz8XQc4Z.pgp Description: OpenPGP Digital Signature
is this crash while coredumping to NFS known?
I was running a wee test program this morning, which crashed, and it seems the kernel crashed while trying to write the core file out. The current working directory, and the target of the core file, is an NFS mount. The core file was created, but is empty: $ ls -l *.core -rw--- 1 woods ostaff 0 Jul 9 11:17 tpgsqltime.core The system is running my version of 9.99.64, so it's not quite current, and thus I wanted to ask if anyone knows if this particular crash is known of before I send-pr. I think this is the first time I've had a core dump over NFS since updating the kernel from 8.99.32. So I'm not sure yet how easily this is reproduced, but in any case it is a regression. Here's what was on the console: [ 71887.4479952] fatal double fault in supervisor mode [ 71887.4479952] trap type 13 code 0 rip 0x809c5051 cs 0x8 rflags 0x10286 cr2 0x8b827c3e4f98 i 3e4fa0 [ 71887.4479952] curlwp 0x8693578524c0 pid 29079.29079 lowest kstack 0x8b827c3e32c0 kernel: double fault trap, code=0 Stopped in pid 29079.29079 (tpgsqltime) at netbsd:ip_output+0x14: movq %rsi,fe68(%rbp ip_output() at netbsd:ip_output+0x14 tcp_output() at netbsd:tcp_output+0xc68 tcp_send_wrapper() at netbsd:tcp_send_wrapper+0x9a sosend() at netbsd:sosend+0x7e4 nfs_send() at netbsd:nfs_send+0x86 nfs_request() at netbsd:nfs_request+0x3d4 nfs_readrpc() at netbsd:nfs_readrpc+0x204 nfs_doio() at netbsd:nfs_doio+0x731 VOP_STRATEGY() at netbsd:VOP_STRATEGY+0x64 genfs_getpages() at netbsd:genfs_getpages+0x1400 nfs_getpages() at netbsd:nfs_getpages+0x5d VOP_GETPAGES() at netbsd:VOP_GETPAGES+0x80 uvm_fault_internal() at netbsd:uvm_fault_internal+0x1895 trap() at netbsd:trap+0x4e5 --- trap (number 6) --- copyin() at netbsd:copyin+0x2f uiomove() at netbsd:uiomove+0xb7 ubc_uiomove() at netbsd:ubc_uiomove+0x156 nfs_write() at netbsd:nfs_write+0x129 VOP_WRITE() at netbsd:VOP_WRITE+0x65 vn_rdwr() at netbsd:vn_rdwr+0xcc coredump_write() at netbsd:coredump_write+0x56 coredump_elf64() at netbsd:coredump_elf64+0x89c coredump() at netbsd:coredump+0x650 sigexit() at netbsd:sigexit+0x27c sendsig() at netbsd:sendsig lwp_userret() at netbsd:lwp_userret+0x1c5 trap() at netbsd:trap+0x9b7 --- trap (number 6) --- 7c5294: ds 23 es 23 fs 0 gs 0 rdi 869202438bc0 rsi 0 rbp 8b827c3e5160 rbx 8693660f4988 rdx 869364f08818 rcx 400 rax 0 r8 0 r9 869364f087b8 r10 869202438bc0 r11 0 r12 869364a93040 r13 a0 r14 869364a930b0 r15 6c rip 809c5051ip_output+0x14 cs 8 rflags 10286 rsp 8b827c3e4fa0 ss 0 netbsd:ip_output+0x14: movq%rsi,fe68(%rbp) db{0}> machine cpu addrdev id flags ipisspl curlwp 0x8163a800 cpu00 30090 8 0x8693578524c0 0x8b825ded cpu14 f0020 0 0x868c2a81e1c0 0x8b825e0ec000 cpu22 f0020 4 0x86934acd26c0 0x8b825e16d000 cpu36 f0020 0 0x868c2ad6c200 0x8b825e19e000 cpu41 f0020 0 0x868c2a9ec340 0x8b825e1cf000 cpu55 f0020 0 0x868c2aa9d080 0x8b825e20 cpu63 f0020 0 0x868c2aa8e100 0x8b825e231000 cpu77 f0020 0 0x868c2ab3f180 db{0}> ps PIDLID S CPU FLAGS STRUCT LWP * NAME WAIT 29079>29079 7 0 100 8693578524c0 tpgsqltime I do have a full kernel core dump, but it's 32GB (345M compressed), and probably contains data I don't want to share. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpcGdanbYs_8.pgp Description: OpenPGP Digital Signature
Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1
At Mon, 6 Jul 2020 23:53:02 +0900, Rin Okuyama wrote: Subject: Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1 > > It seems that stride of framebuffer is not correctly set. > > Your laptop has an NVIDIA GPU, doesn't it? If so, nouveaufb(4) is used > instead of genfb(4), which is normally used when booted from UEFI. It > should be worth trying Yes, indeed, it has an NVIDIA GeForce 320M. > userconf disable nouveau* > > for UEFI bootloader. Oh, that sounded so very promising! However unfortunately it made not one bit of difference. Thank you for the idea though, and also thank you for pointing out the alternate framebuffer driver that might also be worth looking into. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpDKqSxprCns.pgp Description: OpenPGP Digital Signature
Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1
At Sun, 5 Jul 2020 21:09:27 -0700, Brian Buhrow wrote: Subject: Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1 > > Hello. I agree with Mouse, except that I also think it would be very > helpful and useful to have a serial console on USB only devices. Or indeed any device with any kind of USB port, e.g. a laptop. However what would be most generally useful, as opposed to ideal, would be for just the console output to appear on the first found USB serial adapter. So if the kernel can get far enough to probe a USB serial port, then it should dump the message buffer, and continue to copy everything added to the message buffer, to that USB serial device. That's the first and most important step. Make it simple, easy, and obvious how to capture all kernel messages on a modern machine without having to get all the way to the point where one can run "dmesg". Further allowing that port to be attached as the console would be "nice but not quite as necessary". Now ideally the kernel should make the best attempt to identify the first possible USB serial port as early as possible, and attach it as console, so that nothing can be missed, and so that any other bugs in device probing, etc., etc., etc., would not prevent use of DDB on this USB serial console. Even better would be to find out if the platform firmware can do some or all of this, and then to use that code both for the boot loader and the kernel console. E.g. on an EFI system, perhaps through a custom EFI driver? And for uBoot systems too? -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpFUqYZ0ZSmB.pgp Description: OpenPGP Digital Signature
NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1
So, in my ongoing NetBSD on a MacBook saga NetBSD-7.2 boots fine from USB on the MacBook Pro (MacBook7,1) (with the help of rEFIT on a second USB stick). NetBSD-8.2 and newer, including the most recent -current, hangs during boot and the kernel messages appear to have torn video: http://www.planix.ca/~woods/macbookpro-netbsd-boot-fail.jpg However today I discovered that NetBSD-8.0 will often boot with the kernel messages properly visible in nice green on black in a full 52(?)-line display, but it hangs or crashes. (It is not reliable at booting though -- sometimes the boot loader just hangs without printing anything.) If the boot loader does work though, and if I boot "normally" it just hangs, with the last message being: pci0 at mainbus0 bus0: configuration mode 1 The caps-lock button is dead so I think the machine is well and truly frozen in a CPU loop (the CPU is hot, the fan runs fast). I'm guessing NetBSD-8.2 and everything more recent is also hanging at this same spot, but with the busted video mode it's hard to tell for sure. If I boot 8.0 with ACPI turned off (boot option #2 or from the boot prompt "boot -2"), it crashes into ddb after getting a bit further, but there are many errors about not being ablt to map PCI interrupts. If I boot 8.0 with "-vx", there are quite a number of "invalid config space" messages after the pci0 attachment: pci0 at mainbus0 bus0: configuration mode 1 acpi0: MCFG: 000:00:0: invalid config space (cfg[0x100]=0x, alias=false) The second and third numbers change in each following message, and in two of those messages the cfg[0x100] number is 0x. So it looks like ACPI is necessary, but support for using it in this MacBook7,1 is broken somehow. I can post a full-res photo of the screen in one or more or all of these states it someone wants to see it. In any case, what might have been changed after 8.0 that broke the video output? Where do I look? Is amd64 video now the genfb(4) device code? Or is it still vga(4)? If it's genfb(4), then I do see commits about doing anti-aliasing, and maybe the video junk I see could possibly be explained by such a thing. If I can get 7.2 installed (likely), so that I need only drop a kernel in place instead of building the whole installimage and writing the damn slow USB stick with a whole install image every time, then maybe I'll be able to try bisecting changes to get the video working right again. I really wish modern PC vendors were not still so bloody stupid with their firmware as to make it impossible to talk to them via a serial port of some kind (e.g. a USB serial adapter as console would be awesome!). That said, what would it take to wire the NetBSD console to a USB serial adapter? In lieu of that it would be nice if hitting ^S on the keyboard would at least pause the kernel messages from scrolling by during boot, but I get that such a thing might be a bit hard to arrange for in NetBSD. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpvllmWWoiDK.pgp Description: OpenPGP Digital Signature
USB storage transfers halt when usbdevs is run: hardware bug or software bug?
USB storage device transfers freeze when usbdevs is run: hardware bug or software bug? While I was doing a "gzcat < *.gz > /dev/rsd2d", where sd2 was a USB memory stick, I happened to run "usbdevs -dv" and the writes to the USB device froze, and indeed the writing process was stuck in the kernel (I couldn't even stop it with ^Z). Luckily yanking the stick out seemed to unfreeze and kill the process and clean everything up nicely and I was able to re-insert it and re-do the write to it without incident. This is on an amd64 server running 9.99.64. Upon removal and subsequent re-insertion the kernel said the following (but was silent before this when usbdevs ran): [ 193334.306434] umass0: BBB reset failed, IOERROR [ 193334.306434] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.318288] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.318288] umass0: BBB reset failed, IOERROR [ 193334.329223] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.329223] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.341024] umass0: BBB reset failed, IOERROR [ 193334.341024] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.351781] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.357775] sd2d: error writing fsbn 4053632 of 4053632-4053759 (sd2 bn 4053632; cn 4021 tn 7 sn 23) [ 193334.366963] umass0: BBB reset failed, IOERROR [ 193334.366963] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.378283] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.378283] umass0: BBB reset failed, IOERROR [ 193334.389225] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.389225] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.401026] umass0: BBB reset failed, IOERROR [ 193334.401026] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.411782] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.417780] umass0: BBB reset failed, IOERROR [ 193334.417780] sd2(umass0:0:0:0): generic HBA error [ 193334.426444] sd2: detached [ 193334.426444] scsibus1: detached [ 193334.426444] umass0: detached [ 193334.436445] umass0: at uhub6 port 2 (addr 5) disconnected reinsertion: [ 193341.516925] umass0 at uhub6 port 2 configuration 1 interface 0 [ 193341.516925] umass0: SMI Corporation (0x090c) USB DISK (0x1000), rev 2.00/11.00, addr 5 [ 193341.526926] umass0: using SCSI over Bulk-Only [ 193341.526926] scsibus1 at umass0: 2 targets, 1 lun per target [ 193342.366983] sd2 at scsibus1 target 0 lun 0: disk removable [ 193342.376985] sd2: 7712 MB, 15744 cyl, 16 head, 63 sec, 512 bytes/sect x 15794176 sectors [ 193342.386986] sd2: GPT GUID: d1e3490c-b0e6-42e9-9d9e-3ac286a0f7e0 [ 193342.396989] dk6 at sd2: "EFI system", 262144 blocks at 2048, type: msdos [ 193342.396989] dk7 at sd2: "d3aa0396-d911-4aac-baa8-f2478557d31a", 7544832 blocks at 264192, type: ffs I'm guessing it's a software bug with bad locking order somewhere. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpl6RBM0wIkw.pgp Description: OpenPGP Digital Signature
Why is (nb)ctfmerge failing when linking larger kernels???
At Wed, 01 Jul 2020 17:57:08 -0700, "Greg A. Woods" wrote: Subject: weird occasional "Resource exhaustion" errors when linking GENERIC_KASLR > > I've been using a stock 9.0 amd64 install to build my -current tree and > found it failing with a "Resource exhaustion" error (also "Out of > memory") when linking the GENERIC_KASLR kernel. So even in 9.99.64 ctfmerge fails, especially with the ALL kernel (though I must admit I haven't tried to build an amd64 ALL kernel for perhaps a year or so): link ALL/netbsd NetBSD 9.99.64 (ALL) #0: Thu Jul 2 17:29:25 PDT 2020 textdata bss dec hex filename 80120264174291832 8122368 262534464 fa5f540 netbsd ERROR: nbctfmerge: netbsd.ctf: Cannot finalize temp file: Resource exhaustion: Cannot allocate memory --- netbsd --- *** [netbsd] Error code 1 nbmake: stopped in /build/woods/xentastic/current-amd64-amd64-obj/build/src-current/sys/arch/amd64/compile/ALL 1 error $ ulimit -a time(cpu-seconds)unlimited file(blocks) unlimited coredump(blocks) unlimited data(kbytes) 8388608 stack(kbytes)32768 lockedmem(kbytes)512000 memory(kbytes) 4096000 nofiles(descriptors) 3404 processes420 threads 2048 vmemory(kbytes) 2097152 sbsize(bytes)unlimited -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpZ_PVETjfdI.pgp Description: OpenPGP Digital Signature
weird occasional "Resource exhaustion" errors when linking GENERIC_KASLR
I've been using a stock 9.0 amd64 install to build my -current tree and found it failing with a "Resource exhaustion" error (also "Out of memory") when linking the GENERIC_KASLR kernel. Here I leant on ^T while it built and this is the last message before it died (with the "nbmake" lines edited out): [ 155166.4979147] load: 1.54 cmd: nbctfmerge 7370 [iowait 0x45fc5f/4] 46.23u 1.99s 148% 1628476k Out of memory Again, but with the different error message: [ 155250.5722602] load: 1.41 cmd: nbctfmerge 18682 [iowait 0x45fc5f/7] 46.18u 1.43s 152% 1444080k ERROR: nbctfmerge: netbsd: Cannot get sect .debug_line.1 data: Resource exhaustion Then without "warning" it will ramp up to near twice as much memory and just work A-OK: [ 155591.1865138] load: 0.81 cmd: nbctfmerge 15691 [iowait 0x42522a/4] 46.28u 3.71s 142% 2382048k [ 155591.2765553] load: 0.81 cmd: nbctfmerge 15691 [iowait 0x42522a/4] 46.28u 3.80s 142% 2382048k [ 155591.3665934] load: 0.81 cmd: nbctfmerge 15691 [iowait 0x45ab1a/5] 46.28u 3.89s 142% 2076944k [ 155591.4566282] load: 0.81 cmd: nbctfmerge 15691 [0x45e35a/0] 46.28u 3.98s 142% 0k [ 155591.543] load: 0.81 cmd: nbctfmerge 15691 [0x45e35a/0] 46.28u 4.07s 142% 0k [ 155591.6467075] load: 0.82 cmd: nbctfmerge 15691 [0x45e35a/0] 46.28u 4.16s 140% 0k [ 155591.7367458] load: 0.82 cmd: nbctfmerge 15691 [0x45e35a/0] 46.28u 4.25s 140% 0k mv -f netbsd netbsd.gdb /build/woods/xentastic/current-amd64-amd64-tools/bin/x86_64--netbsd-strip -g -o netbsd netbsd.gdb This did not happen with the exact same source tree when building on either an 8.99.32 or 9.99.64 system running in a Xen domU on similar hardware. For the record, thinking this might be an rlimit issue, I opened things up to the max to no avail, but even with these limits the link often fails: $ ulimit -a time(cpu-seconds)unlimited file(blocks) unlimited coredump(blocks) unlimited data(kbytes) 8388608 stack(kbytes)32768 lockedmem(kbytes)524288 memory(kbytes) 2048000 nofiles(descriptors) 3404 processes420 threads 2048 vmemory(kbytes) 2097152 sbsize(bytes)unlimited Also for the record, this is 9.0/amd64 running on a bare machine with 8 cores, 32GB of RAM, and everything is on local filesystems: NetBSD 9.0 (GENERIC) #0: Fri Feb 14 00:06:28 UTC 2020 mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC total memory = 32762 MB avail memory = 31788 MB Dell Inc. PowerEdge 2950 cpu7: Intel(R) Xeon(R) CPU X5460 @ 3.16GHz, id 0x10676 mfi0: PERC 6/i Integrated version 6.3.3.0002 mfi0: logical drives 2, 256MB RAM, BBU type BBU, status good scsibus0 at mfi0: 64 targets, 8 luns per target sd0 at scsibus0 target 0 lun 0: disk fixed sd0: 465 GB, 476416 cyl, 64 head, 32 sec, 512 bytes/sect x 975699968 sectors sd1 at scsibus0 target 1 lun 0: disk fixed sd1: 544 GB, 557568 cyl, 64 head, 32 sec, 512 bytes/sect x 1141899264 sectors -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpZQGkmBDee2.pgp Description: OpenPGP Digital Signature
postinstall removed yet another "obsolete" system library that was still used....
So I just upgraded a system from an old 8.99 -current to a newer 9.99 current and "postinstall fix obsolete" removed my /usr/lib/libgomp.so.1* However this library was still in use by installed packages (due, I think, to a dependency of libgd on libgomp, thus every gd-using package is now G.D. broke)! I propose that the rule documented in src/distrib/lists/base/shl.mi be far more strictly observed, even for libraries that appear and disappear between releases (i.e. for -current), at least for the ".major" link and the file it points to. If they were never there in a release, never mentioning them as obsolete in releases should be just fine (i.e. they were never there, so never mentioning them is the correct thing to do). On the other hand we could first fix postinstall to be more careful by getting it to fetch all the "REQUIRED" values from package BUILD_INFO like this: pkg_info -a -Q REQUIRES | sort -u and then have it noisily refuse to remove any obsolete file still in this "required" list. This would allow us to mention all old/upgraded shared libraries as obsolete, including those from between releases. Of course this only protects things installed via pkgsrc, and there's still the risk of subsequently needing to install a binary package built for an older release which needs one of these "obsolete" files, but at least pkg_add can (be made to if it doesn't already) notice this and abort. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpH1wJr2kVDc.pgp Description: OpenPGP Digital Signature
Re: unable to boot NetBSD-9.99.64-amd64-install.img on a MacBook7,1
At Sat, 13 Jun 2020 22:03:39 -0700, "Greg A. Woods" wrote: Subject: Re: unable to boot NetBSD-9.99.64-amd64-install.img on a MacBook7,1 > > At Tue, 09 Jun 2020 22:01:41 -0700, "Greg A. Woods" wrote: > Subject: unable to boot NetBSD-9.99.64-amd64-install.img on a MacBook7,1 > > > > Most interestingly if I do some playing at the boot prompt first such > > that there is lots of white text in the small centre area, then try to > > boot, the lines of green dots overwrite the top about 1/3 of the screen > > leaving the lower portion of the white boot loader text still visible: > > > > http://www.planix.ca/~woods/macbookpro-netbsd-boot-fail.jpg > > Same goes for today's snapshot from: > > > https://nycdn.netbsd.org/pub/NetBSD-daily/HEAD/202006131940Z/images/NetBSD-9.99.66-amd64-install.img.gz Would knowing anything about how FreeBSD works on this machine help figure out why NetBSD doesn't? I have FreeBSD 12.1 installed and (mostly) working (the nvidia driver crashes it when starting X). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpYddxy3xnGQ.pgp Description: OpenPGP Digital Signature