Re: GitHub mirror stopped mirroring

2024-01-29 Thread Greg A. Woods
At Sun, 28 Jan 2024 08:22:49 +, Chris Pinnock  wrote:
Subject: Re: GitHub mirror stopped mirroring
> 
> >  The Mercurial mirror also hasn't been updated for a week.
> >  Ngā mihi, Lloyd
> >
> 
> Hi - someone was looking at this yesterday. Mercurial syncing
> again. KRgds, C

It doesn't seem to have made it to anonhg.NetBSD.org yet.  The src repo
there is still 11 days older, as is of course the GitHub version.

-- 
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp7xByzpME33.pgp
Description: OpenPGP Digital Signature


Re: Xen FreeBSD domU block I/O problem on -current only affects reads > 1024 bytes

2024-01-03 Thread Greg A. Woods
At Tue, 20 Apr 2021 16:53:58 -0700, "Greg A. Woods"  wrote:
Subject: Re: Xen FreeBSD domU block I/O problem on -current only affects reads 
> 1024 bytes
>
> With the gracious help of RVP  I have been able to identify
> better what is actually going wrong with FreeBSD's access to NetBSD dom0
> xbdback(4) storage.
>
> It seems that in certain circumstances (e.g. in newfs and the test
> program) whenever FreeBSD issues a read of more than 1024 bytes only the
> first 1024 bytes are correct -- the rest of the bytes returned come from
> somewhere else on the disk, which appears to be starting at six(6)
> sectors after where they were supposed to have come from.  Note that
> this corresponds to exactly 4096 bytes offset from the beginning of the
> read.

Reviving this old thread with some new info

It seems ZFS either doesn't issue large read requests, and/or it works
around the problem in some other way.

With the help of a custom FreeBSD kernel with ZFS compiled in, and
booting it as a PVH domU kernel, and with the new(ish) FreeBSD (14.0)
way of installing with a ZFS root, I have a couple of domUs running just
fine now, one even recovered old zpools on the machine where I first
experienced this problem!

As soon as possible, especially if I can dredge up another test server,
I'll test plain UFS again with a NetBSD 10.0_RC2 kernel as dom0.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpsMEoHet2IW.pgp
Description: OpenPGP Digital Signature


Re: Status of NetBSD virtualization roadmap - support jails like features?

2022-04-15 Thread Greg A. Woods
At Fri, 15 Apr 2022 07:36:15 +0200, Matthias Petermann  
wrote:
Subject: Status of NetBSD virtualization roadmap - support jails like features?
>
> My motivation: I am looking for a particularly high performance
> virtualization solution on NetBSD. Especially disk and network IO
> plays a role for me.

In my experience nothing beats I/O performance of Xen with LVM in the
dom0 and the best/fastest storage available for the dom0, especially now
there's SMP support for dom0.  That's anecdotal though -- I haven't done
any real comparisons.  I just know that NFS in domUs is a lot slower
than using LVMs via xbd(4), no matter where/how-fast the NFS server is!

If I'm not too far out of touch I think there's still a wee bit more SMP
support needed in the networking code to make it possible for dom0 to
also give the best network throughput, but it's really not horrible as-is.


In theory NVMM with QEMU and virtio(4) should be about the same I would
guess, with potential for some improvement in some micro-benchmarks, but
for production use the maturity and completeness of the provisioning
support offered by Xen still seems far superior to me.


> Regardless, I still think it wouldn't hurt
> if NetBSD could implement some sort of
> jail.

I'm not convinced "jails" (at least in the FreeBSD form I'm most
familiar with) actually buy much without also increasing complexity
and/or introducing limitations on both the provisioning and the
"virtual" side.

With a full virtualisation as in Xen the added complexity is very well
partitioned between the provisioning side and the VMs, and there are
almost no limitations inside the VMs (assuming you are virtualising
something that fits well into a virtualised environment, i.e. with no
special direct hardware access needs) -- everything looks and feels and
is managed almost as if it is running on bare hardware and so the
management of the VM is exactly as if it were running on separate
hardware; except of course some aspects are actually easier to manage,
such as provisioning direct console access and control.  There's really
nothing new to learn other than how to spin up a new domU (and possibly
how to use LVM effectively).

However FreeBSD-style jails do offer their own form of flexibility that
seems to be worth having available, and it would be nice for jails to be
available on NetBSD as well.  The impact inside the OS (kernel and
userland) is quite high though, and is itself a form of complexity
nightmare all its own, though perhaps not so horrible as Linux "cgroups"
and some other related Linux kernel namespaces are.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpX4KZZetx9T.pgp
Description: OpenPGP Digital Signature


Re: Are NetBSD users interested extending options for patch?

2022-04-11 Thread Greg A. Woods
At Mon, 11 Apr 2022 21:03:02 +0200, Hans Petter Selasky  
wrote:
Subject: Are NetBSD users interested extending options for patch?
>
> https://reviews.freebsd.org/D30160

As a user with some extensive background in making and using patch
files, I can't imagine that feature ever being useful; and rather
instead I would find it to be more dangerous if not just useless.

Patch already has '-p N', and in my experience that has covered most of
the cases where a similar problem actually occurs.

In all (which are very few) other cases I've found that it is trivial to
edit the patch, often in a pipeline with a simple 'sed' command (e.g. in
cases where the pathnames in the patch need a prefix applied or changed,
instead of simply stripping it with '-p').

I would expect any heuristic to automatically search and find files by
simply matching their basename to be very unreliable and to find the
wrong file just as likely -- at least in the general case.

Say for example a patch contains a lot of changes to "Makefile" files in
many different directories?  How is this hack supposed to help find the
right one (e.g. if a directory containing a "Makefile" was renamed)?

Perhaps as mentioned in a comment on that post it may be useful in some
very specific cases where files aren't likely to move around too much
and where all files are guaranteed to be uniquely named and never
renamed despite being moved about between directories.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpmZsdr0JiM_.pgp
Description: OpenPGP Digital Signature


Re: xterm-color256: Different behavior between NetBSD 9.2 and 9.99.93?

2022-02-03 Thread Greg A. Woods
I've recently been trying to debug this same problem, and I had been
gathering info until I got side-tracked onto X11 hi-res monitor issues.

At Thu, 3 Feb 2022 16:28:00 +0300, Valery Ushakov  wrote:
Subject: Re: xterm-color256: Different behavior between NetBSD 9.2 and 9.99.93?
> 
> On Thu, Feb 03, 2022 at 14:15:45 +0100, Martin Husemann wrote:
> 
> > Bug in the terminfo compiler?
> 
> http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/tic/tic.c#rev1.39
> 
> sounds like it might be related.

As I understand the code the 1.30 fix to promote older compiled
entries being included with "use=" shouldn't affect anything if the
source database is already in the newest format, no?

As I understand things, the problem is that tic(1) isn't incorporating
"use=" entries using the correct algorithm.

The value of the "colors" capability is just a part of the symptom.
Careful comparison of the "infocmp -1 xterm-256color" output from NetBSD
and from a system using ncurses should produce identical matching
output, but at the moment there are several differences and examining
the terminfo source file suggests, to me at least, that the order of
processing of the "use=" entries is wrong.

The proper algorithm, as I understand it is to scan right-to-left for
"use=" capabilities, and to rescan after each new entry has been
inserted to replace the "use=" capability.

This algorithm is fairly clearly described in the ncurses terminfo(5)
manual page, and in order to handle the ncurses terminfo source file
more-or-less as-is, one must presumably implement the ncurses "use="
merging algorithm faithfully.

These comments are as far as I've got in diagnosing things in the NetBSD
sources:

--- tic.c.~1.40.~   2020-05-30 17:44:04.0 -0700
+++ tic.c   2022-01-06 17:53:47.893092115 -0800
@@ -424,6 +424,7 @@
rtic = term->tic;
basename = _ti_getname(TERMINFO_RTYPE_O1, rtic->name);
promoted = false;
+   /* XXX this does the use= merging the wrong way!?!?!? */
while ((cap = _ti_find_extra(rtic, >extras, "use"))
!= NULL) {
if (*cap++ != 's') {
@@ -684,6 +685,7 @@
free(tbuf.buf);
 
/* Merge use entries until we have merged all we can */
+   /* XXX this doesn't do properly nested merging!!! */
while (merge_use(flags) != 0)
;
 

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


Re: the entropy bug, and device timeouts (was: Note: two files changed and hashes/signatures updated for NetBSD 8.1)

2022-02-01 Thread Greg A. Woods
At Thu, 27 Jan 2022 10:40:20 +0100, Martin Husemann  wrote:
Subject: Re: the entropy bug, and device timeouts (was: Note: two files changed 
and hashes/signatures updated for NetBSD 8.1)
>
> On Wed, Jan 26, 2022 at 10:56:53PM -0800, Greg A. Woods wrote:
> > Well, if you have a hardware RNG, or my patches, then that'll do
> > something, but otherwise it's just useless noise and misdirection.
>
> This is not true. Once there is enough entropy gathered (or the system
> has been told the administrator considers it good enough), everything is
> fine and basically the same state as before the changes you want to back
> out (at least from a userland perspective).

That's not my experience, though I am not quite at -current.

One thing that I found I had to change was the way feeding a random
number as entropy through /dev/random wasn't working unless I re-enabled
so-called "estimation" for that device (via rndctl), and I don't think
that was due to any of my changes.

Note that the "seed" device is trusted in the code as if it were a
hardware random number generator so it has "collection" and "estimation"
enabled by default.

Keep in mind also that not all ways of booting NetBSD allow for
"rndseed", including Xen domUs.

What my patches do is re-enable the ability of rndctl to (re-)enable
"collection" and/or "estimation" for other devices that have calls to
submit values and/or timestamps to entropy collection.

This means the following can be added to /etc/rc.conf on, for example,
Xen domU systems and they can come up to full entropy in the good old
fashioned way without suffering from lack of any way to insert entropy
with "rndseed":

rndctl=YES  rndctl_flags="-t disk; -t vm" # optional: "-t net"

I have some tentative patches to make this all actually work for domUs
in sysinst too in the way your message discussed, but I've had a
extremely difficult time getting that to work in any kind of
user-friendly way.  It took hours of code walk-through just to figure
out what was really expected of the user.  Also for other reasons
(e.g. cloning domUs), I think the "rndctl" way is both easier and more
secure (assuming all the regular things about security between domUs on
the same server).

My patches also mean systems without hardware RNG devices can do the
same, and indeed my Dell servers do just fine accumulating their own
entropy after boot without rndseed and without even any "rndctl" setup
as they have fan speed and voltage monitors in their environmental
sensors, and my patches re-enable default collection and estimation for
such trusted devices.

Personally I find the way the current kernel handles entropy,
i.e. without my patches, to be obnoxious, condescending, and ignorant.
Perhaps that view might cause some to consider me to be the same way,
but I can easily live with that.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp8LqplCHJ1Y.pgp
Description: OpenPGP Digital Signature


Re: the entropy bug, and device timeouts (was: Note: two files changed and hashes/signatures updated for NetBSD 8.1)

2022-01-26 Thread Greg A. Woods
At Wed, 26 Jan 2022 16:47:15 +1300, Lloyd Parkes 
 wrote:
Subject: Re: the entropy bug, and device timeouts (was: Note: two files changed 
and hashes/signatures updated for NetBSD 8.1)
>
> The change was more subtle than that I
> think. Untrusted hardware was used as an
> entropy source, but it didn't count
> towards the "enough" that was needed to
> bootstrap the rnd system from nothing.

No, not quite -- there was a whole bunch of code removed that is needed
to actually make the hardware events "count" if and when you configure
them to do so.

> On 7 May 2020 a change was committed to
> /etc/rc.d/random_seed so that a seed file
> is created at boot time if you don't
> already have one. I haven't checked
> because I really can't be bothered right
> now, but I'm pretty sure that's all that's
> required.

Well, if you have a hardware RNG, or my patches, then that'll do
something, but otherwise it's just useless noise and misdirection.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpKKAuRm57sa.pgp
Description: OpenPGP Digital Signature


the entropy bug, and device timeouts (was: Note: two files changed and hashes/signatures updated for NetBSD 8.1)

2022-01-24 Thread Greg A. Woods
At Mon, 24 Jan 2022 08:46:36 +, "Thomas Mueller"  
wrote:
Subject: Re: Note: two files changed and hashes/signatures updated for NetBSD 
8.1
>
> Does there look to be a fix in the entropy bug?
>
> This bug relates to entropy and how it impedes building many packages
> in pkgsrc.
>
> I seemed to get around this bug on one computer but not the other.

I have fixes that restore the previous option to use "untrusted"
hardware as an entropy source.  They may need some updating to be truly
complete in the most recent -current, as I'm still back at 9.99.81.

However I've little hope that my patches will be accepted back into the
main source tree, since there seems to be some crazy un-bendable
insistence on perfect security of all randomness, even for private
machines, embedded systems, and so on.


> Other bug is longer-standing and plagued me in NetBSD 8.99.51 and
> again in 9.99.82.
>
> Do there look to be improvements in how NetBSD handles hard drives
> that would be affected by that bug?
>
> That bug causes device timeouts on some types of hard drive but not
> all.

I can't imagine how the entropy issues could be related in any way to
disk device driver timeouts.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp4T3nGPbUzd.pgp
Description: OpenPGP Digital Signature


Re: backward compatibility: how far can it reasonably go?

2021-12-08 Thread Greg A. Woods
At Wed, 8 Dec 2021 11:36:17 -0800, Jason Thorpe  wrote:
Subject: Re: backward compatibility: how far can it reasonably go?
>
>
> > On Dec 8, 2021, at 10:52 AM, Greg A. Woods 
> > wrote:
> >  That's one bullet I've dodged entirely already since my oldest
> > systems are running netbsd-5 stable.  (Though in theory isn't
> > there supposed to be COMPAT support for SA?)
>
> int
> compat_60_sys_sa_register(lwp_t *l, const struct
> compat_60_sys_sa_register_args *uap, register_t *retval)
> { return sys_nosys(l, uap, retval);
> }
>
> SA is one of those things that's REALLY hard to provide
> compatibility for.

:-)  I see!

Yes, I can appreciate that SA isn't easily maintained in any way.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp4UZmrDv0i5.pgp
Description: OpenPGP Digital Signature


Re: backward compatibility: how far can it reasonably go?

2021-12-08 Thread Greg A. Woods
At Wed, 08 Dec 2021 11:08:09 -0500, Brad Spencer  wrote:
Subject: Re: backward compatibility: how far can it reasonably go?
>
> When I took a system from 4.0 to 7.x some time ago, the only thing that
> I had problems with was anything that used scheduler activations since
> that had been removed.  For me this only effected stuff from pkgsrc, as
> I also rolled in new userland at the same time.

That's one bullet I've dodged entirely already since my oldest systems
are running netbsd-5 stable.  (Though in theory isn't there supposed to
be COMPAT support for SA?)

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpFm57bXUxlC.pgp
Description: OpenPGP Digital Signature


Re: backward compatibility: how far can it reasonably go?

2021-12-08 Thread Greg A. Woods
At Wed, 8 Dec 2021 15:32:24 -, ya...@sdf.org wrote:
Subject: Re: backward compatibility: how far can it reasonably go?
>
> > "Greg A. Woods"  writes:

no, Greg Troxel wrote:

> > I am unclear if ipf has been removed by default from current.

> Even in NetBSD 9, ipf is not in the GENERIC kernel config.

Well I'm running in Xen domUs, so not GENERIC but XEN3_DOMU, and indeed
I'm running all custom kernel builds.


> Was the kernel compiled to use ipf?

Clearly IPF is in the 9.99.81 kernel I booted, since it's functions are
visible in the backtrace of the crash :-)

If it were not compiled in, I think/hope it would not crash -- just the
ipf tool would return an error and complain about something like ENXIO
or maybe ENODEV.  So if IPF were the only problem I would try taking it
out temporarily, but with ifconfig also useless, I'll probably try the
upgrade from the dom0.


> e.g. add to kernel config:
> options IPFILTER_LOG# ipmon(8) log support
> options IPFILTER_LOOKUP # ippool(8) support
> options IPFILTER_COMPAT # Compat for IP-Filter
> pseudo-device   ipfilter# IP filter (firewall) and NAT

Yes, all there (and BRIDGE_IPF as well, though I haven't used that
feature yet, and it would likely only be needed in the dom0)

Indeed an identical kernel is already running IPF in another domU
instance, but of course with the corresponding 9.99.81 userland.  It
works as well as ever -- I use it with blocklistd, as well as for basic
firewalling (most of my systems are mostly on a private network with
only one or two ports forwarded to them from the main firewall and so
otherwise using the main FW's NAT for outgoing connections only).

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpZy7_nCYGHf.pgp
Description: OpenPGP Digital Signature


Re: backward compatibility: how far can it reasonably go?

2021-12-07 Thread Greg A. Woods
At Tue, 7 Dec 2021 20:37:26 -0800 (PST), Paul Goyette  wrote:
Subject: Re: backward compatibility: how far can it reasonably go?
>
> Without looking at the details of your backtrace, the issue with
> ifconfig(8) could be related to PRs kern/54150 and/or kern/54151.

Aw, damn, my memory is too short!  Thanks for reminding me of those!

The kernel crash was IPF-related, and in my test back then I was testing
on an i386 machine, which at the time did not, IIRC (and we know what
that might mean), was not running IPF.

Anyway, the two machines I'm upgrading do need to run IPF, at least
until they are running a new OS with new pkgs.

I'm beginning to think the only way to avoid that rabbit hole in order
to get these upgrades done in the next week will be to shut them down
and do the upgrades by mounting their filesystems in their dom0s.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpV1TJFBLXGn.pgp
Description: OpenPGP Digital Signature


backward compatibility: how far can it reasonably go?

2021-12-07 Thread Greg A. Woods
So I've got a couple of old but important machines (Xen amd64 domUs)
running NetBSD-5, and I've finally decided that I'm reasonably well
enough prepared to try upgrading them.

However it seems a "modern" (9.99.81, -current from about 2021-03-10)
kernel with COMPAT_40 isn't able to run some of the userland on those
systems.

Is this something that should work?

If it should I think it would make the upgrade much easier as I could
then plop down the new userland and run etcupdate.  (there are of course
alternative ways to do the upgrade, eased by the fact they are domUs (*))

The most immediate problems I noticed are with networking.  ifconfig -a
returns without printing anything, and trying to enable IPF crashes:

Enabling ipfilter.
[  90.1912601] panic: kmem_free(0xd000108870c0, 697) != allocated size 
18374686479671623680; overwrote?
[  90.1912601] cpu3: Begin traceback...
[  90.1922525] vpanic() at netbsd:vpanic+0x14a
[  90.1922525] snprintf() at netbsd:snprintf
[  90.1922525] kmem_alloc() at netbsd:kmem_alloc
[  90.1932517] frrequest() at netbsd:frrequest+0x100
[  90.1932517] ipf_ipf_ioctl() at netbsd:ipf_ipf_ioctl+0x37d
[  90.1932517] ipfioctl() at netbsd:ipfioctl+0x9a
[  90.1942516] cdev_ioctl() at netbsd:cdev_ioctl+0x81
[  90.1942516] VOP_IOCTL() at netbsd:VOP_IOCTL+0x3e
[  90.1942516] vn_ioctl() at netbsd:vn_ioctl+0xad
[  90.1952515] sys_ioctl() at netbsd:sys_ioctl+0x555
[  90.1952515] syscall() at netbsd:syscall+0x9c
[  90.1952515] --- syscall (number 54) ---
[  90.1952515] netbsd:syscall+0x9c:
[  90.1952515] cpu3: End traceback...
[  90.1952515] fatal breakpoint trap in supervisor mode
[  90.1952515] trap type 1 code 0 rip 0x8022d93d cs 0xe030 rflags 0x202 
cr2 0x7a0d38c36020 ilevel 0 rsp 0xd0018da561b0
[  90.1952515] curlwp 0xdf5468c0 pid 184.184 lowest kstack 
0xd0018da522c0
Stopped in pid 184.184 (ipf) at netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x14a
snprintf() at netbsd:snprintf
kmem_alloc() at netbsd:kmem_alloc
frrequest() at netbsd:frrequest+0x100
ipf_ipf_ioctl() at netbsd:ipf_ipf_ioctl+0x37d
ipfioctl() at netbsd:ipfioctl+0x9a
cdev_ioctl() at netbsd:cdev_ioctl+0x81
VOP_IOCTL() at netbsd:VOP_IOCTL+0x3e
vn_ioctl() at netbsd:vn_ioctl+0xad
sys_ioctl() at netbsd:sys_ioctl+0x555
syscall() at netbsd:syscall+0x9c
--- syscall (number 54) ---
netbsd:syscall+0x9c:
ds  61c0
es  6170
fs  61b0
gs  10
rdi 0
rsi d0018da55f5c
rbp d0018da561b0
rbx 1
rdx 2
rcx 0
rax 0
r8  1
r9  1
r10 0
r11 fffe
r12 104
r13 8063bb30ostype+0x36eb8
r14 d0018da561f8
r15 3
rip 8022d93dbreakpoint+0x5
cs  e030
rflags  202
rsp d0018da561b0
ss  e02b
netbsd:breakpoint+0x5:  leave
db{3}>


(*) alternatives

Now since these are domUs and their dom0 is also NetBSD I could also
upgrade them "in absentia" so to speak, i.e. drop a new userland on
their filesystems from the dom0, though this seems more scary somehow.
I guess it shouldn't be since the dom0 and other test systems are
already running what I want them to run.

Or, given they are relatively cleanly configured filesystem-wise
(esp. with a separate /usr/pkg, /home, etc.) I could also build new
prototype systems, copy over the /etc files and old shared libraries
from the old system to the new prototype, then run etcupdate on the new
prototype, and finally shut down the old system, re-assign the other
filesystems (/var, /usr/pkg, /home, /work, etc.) to the prototype,
reboot the prototype with the old system's name and address, and finally
patching up and/or rebuilding whatever is needed in /var.

The key thing is that I want to be able to upgraded pkgs piecemeal since
I'm sure there will be some hiccups and reconfigs required along the
way.

Note that most everything is static-linked on these systems.  The base
system is 100% static linked (except for ld.elf_so itself) though of
course there still are a few baroque packages which require
dynamic-loaded code so I will still need to be very careful to preserve
all old shared libraries.  That makes the approach of building a fresh
prototype somewhat more difficult, though ultimately perhaps safest as
it can be fully tested before ditching the old system.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpS15LHa_ZPh.pgp
Description: OpenPGP Digital Signature


panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/build/src/sys/arch/x86/x86/pmap.c", line 2431

2021-12-05 Thread Greg A. Woods
I've been busy testing kernels in Xen domUs.

Just after running "xl destroy nbtest" this happened:

[ 2499253.4056334] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499253.4056334] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499253.4056334] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499253.4056334] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499253.4256354] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499254.1256770] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499254.1256770] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499256.1658017] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499256.1658017] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499258.1359215] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499258.1359215] xvif37i0 GNTTABOP_copy[0] Rx -3
[ 2499260.1560424] xvif37i0: disconnecting
[ 2499260.1560424] xbd backend: detach device vg0-nbtest.pkg for domain 37
[ 2499260.1660445] xbd backend: detach device vg0-nbtest.var for domain 37
[ 2499260.1660445] xbd backend: detach device vg0-nbtest.swap for domain 37
[ 2499260.1660445] xbd backend: detach device vg0-nbtest.root for domain 37
[ 2499262.0061528] xbd backend: detach device vnd0d for domain 37
[ 2499264.9263322] panic: kernel diagnostic assertion "kpreempt_disabled()" 
failed: file "/build/src/sys/arch/x86/x86/pmap.c", line 2431
[ 2499264.9263322] cpu0: Begin traceback...
[ 2499264.9263322] vpanic() at netbsd:vpanic+0x14a
[ 2499264.9263322] kern_assert() at netbsd:kern_assert+0x48
[ 2499264.9263322] pmap_free_ptp() at netbsd:pmap_free_ptp+0x3b1
[ 2499264.9263322] pmap_enter_ma() at netbsd:pmap_enter_ma+0xebe
[ 2499264.9263322] privcmd_ioctl() at netbsd:privcmd_ioctl+0xa8c
[ 2499264.9263322] kernfs_try_fileop() at netbsd:kernfs_try_fileop+0x5c
[ 2499264.9263322] VOP_IOCTL() at netbsd:VOP_IOCTL+0x5d
[ 2499264.9263322] vn_ioctl() at netbsd:vn_ioctl+0xad
[ 2499264.9263322] sys_ioctl() at netbsd:sys_ioctl+0x555
[ 2499264.9263322] syscall() at netbsd:syscall+0x9c
[ 2499264.9263322] --- syscall (number 54) ---
[ 2499264.9263322] netbsd:syscall+0x9c:
[ 2499264.9263322] cpu0: End traceback...
[ 2499264.9263322] fatal breakpoint trap in supervisor mode
[ 2499264.9263322] trap type 1 code 0 rip 0x8023e93d cs 0xe030 rflags 
0x202 cr2 0x70153d533000 ilevel 0 rsp 0xc580ef7a4950
[ 2499264.9263322] curlwp 0xc58012291240 pid 12621.12621 lowest kstack 
0xc580ef7a02c0
Stopped in pid 12621.12621 (xl) at  netbsd:breakpoint+0x5:  leave
ds  4960
es  4910
fs  4950
gs  10
rdi 0
rsi 1
rbp c580ef7a4950
rbx c58003ad6f40
rdx 2
rcx 0
rax 0
r8  c58003ad6f40
r9  1
r10 0
r11 fffe
r12 104
r13 80c9d620ostype+0x148
r14 c580ef7a4998
r15 701535f6e000
rip 8023e93dbreakpoint+0x5
cs  e030
rflags  202
rsp c580ef7a4950
ss  e02b
netbsd:breakpoint+0x5:  leave
db{0}> (XEN) [2021-12-05 19:27:12.065] Watchdog timer fired for domain 0
(XEN) [2021-12-05 19:27:12.065] Hardware Dom0 shutdown: watchdog rebooting 
machine


This is an amd64 system running a 9.99.81 kernel and Xen 4.13.2nb2.

Is it worth a PR?

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpQfdKyX0rMy.pgp
Description: OpenPGP Digital Signature


Re: sysinst extended partitioning won't set/do the "newfs" flag!

2021-11-02 Thread Greg A. Woods
At Mon, 08 Jun 2020 16:42:42 -0700, "Greg A. Woods"  wrote:
Subject: sysinst extended partitioning won't set/do the "newfs" flag!
>
> I'm having trouble getting the "new" sysinst, when using extended
> partitioning, to set the "newfs" flag (and the "-o log" flag).
>
> I can set it, but it never sticks and never happens, which means nothing
> gets mounted.

And also the "mount" flag doesn't even seem to have any effect either
(e.g. even if all the partitions and filesystems are ready made).

I've updated recently to the latest sysinst sources from -current with
no improvement.

I now see from the source that the mysterious "install" flag should only
be set on one partition (though I'm still not quite sure exactly what
it's supposed to mean, except that this is to be the root filesystem,
though why it can't figure that out from the mount point being "/" is
not clear).  Having that flag set on only one partition doesn't help
though.  (My original gues was that partitions tagged with the "install"
flag were to be used during the install, i.e. mounted under /targetroot,
and any without it set would only be written to /etc/fstab for use once
the target system boots live.)

In any case my basic expectations for the requirements of the most basic
functionality of the "extended partitioning" feature is that I should be
able to use it to install on a system with a "bunch" of disks, making
one or more filesystems (or, e.g., swap partitions) on each disk, and
having them be newfs'ed (or not) and mounted (or not) for the target
system (all before extracting sets, i.e. mounted under /targetroot), and
have the resulting configuration all be written to the new target's
/etc/fstab.

So far I've been unable to even get close to making this work.  (I can
get it to create partitions, but then it won't do anything with them.)

Trying to read the source to figure out what is and isn't working, or
how it maybe should work, hasn't helped me any yet either.  A design
guide, or theory-of-operation doc, etc. might help.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpIOiBgJ4Z74.pgp
Description: OpenPGP Digital Signature


Re: very strange build failure in external/gpl3/gcc/lib/libstdc++-v3/include/bits

2021-10-09 Thread Greg A. Woods
At Fri, 8 Oct 2021 19:44:02 + (UTC), RVP  wrote:
Subject: Re: very strange build failure in 
external/gpl3/gcc/lib/libstdc++-v3/include/bits
>
> On Fri, 8 Oct 2021, Greg A. Woods wrote:
>
> > If two identical 'mv' commands run in the same directory (with no other
> > commands running there in between) then the second one is going to
> > report an ENOENT error.  Given these 'mv' commands are on the tail of a
> > command list that creates the source file, they have to run very nearly
> > in parallel in order to trigger the observed failure.
> >
>
> GCC comes with a move-if-change script to do just this kind of file-rename
> juggling. Try using that in the rule instead of the home-brew commands...
>
> /usr/src/external/gpl3/gcc/dist/move-if-change

I think that would be very thin wallpaper for such a problem.  :-)

There are possibly other lurking problems for such a parallel build
failure, so fixing the root cause really would be the better solution.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpXTkYohXsjQ.pgp
Description: OpenPGP Digital Signature


Re: very strange build failure in external/gpl3/gcc/lib/libstdc++-v3/include/bits

2021-10-08 Thread Greg A. Woods
At Thu, 7 Oct 2021 23:17:33 + (UTC), RVP  wrote:
Subject: Re: very strange build failure in 
external/gpl3/gcc/lib/libstdc++-v3/include/bits
>
> On Thu, 7 Oct 2021, Greg A. Woods wrote:
>
> > It's almost as if the call to rename() in 'mv' succeeds, but returns an
> > ENOENT error sometimes!!!
> >
>
> Or, there's a race to completion happening. Is libstdc++-v3 being built
> twice?

Yes that's quite likely.  I realised the same just after I wrote my
message and went out to do some yard work.

If two identical 'mv' commands run in the same directory (with no other
commands running there in between) then the second one is going to
report an ENOENT error.  Given these 'mv' commands are on the tail of a
command list that creates the source file, they have to run very nearly
in parallel in order to trigger the observed failure.

I'm not sure yet how or where these built include files get specified
twice, or in whatever way that causes them to be built multiple times in
parallel.

Perhaps it's the trickery here (interfering with similar trickery in
)?

/usr/src/external/gpl3/gcc/lib/libstdc++-v3/include/Makefile.includes


Or given that c++config.h was also in the same boat, maybe all of
external/gpl3/gcc/lib/libstdc++-v3/include/bits is being built twice for
the "includes" target?


> PS. I should ask: your machines are all running NTP, right?

Yes indeed, though the second machine is using all local disk.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpfO3tndRvb0.pgp
Description: OpenPGP Digital Signature


Re: very strange build failure in external/gpl3/gcc/lib/libstdc++-v3/include/bits

2021-10-07 Thread Greg A. Woods
At Thu, 07 Oct 2021 10:25:31 -0700, "Greg A. Woods"  wrote:
Subject: very strange build failure in 
external/gpl3/gcc/lib/libstdc++-v3/include/bits
>
> I had a parallel build fail as follows yesterday.
>
> This same source tree has been built in the same way on the same machine
> multiple times without these errors ever appearing.  An rsync'ed copy of
> the source tree has been successfully built on another machine multiple
> times without these errors ever appearing.

I spoke a little too soon.

The second machine encountered the same error just now, but only part of
it -- i.e. only one of the 'mv' commands failed:

mv: rename gthr-posix.h.tmp to gthr-posix.h: No such file or directory
--- gthr-posix.h ---
*** [gthr-posix.h] Error code 1

nbmake[7]: stopped in 
/work/woods/m-NetBSD-current/external/gpl3/gcc/lib/libstdc++-v3/include/bits
1 error


I happened to have a couple of other older builds from the same tree on
that other machine, and so I looked for similar errors in the logs from
those builds, and what do you know, but I found one more (from last
March)!

mv: rename c++config.h.tmp to c++config.h: No such file or directory
includes ===> external/mit/xorg/lib/xkeyboard-config/symbols/nec_vndr
install  
/build/woods/b2/current-amd64-destdir/usr/X11R7/include/xcb/xc_misc.h
install  /build/woods/b2/current-amd64-destdir/usr/X11R7/include/xcb/xcb.h
--- includes-include ---

nbmake[5]: stopped in 
/work/woods/m-NetBSD-current/external/gpl3/gcc/lib/libstdc++-v3
--- includes-libxcb ---
nbmake[8]: stopped in /work/woods/m-NetBSD-current/external/mit/xorg/lib/libxcb
--- includes-bits ---

nbmake[9]: stopped in 
/work/woods/m-NetBSD-current/external/gpl3/gcc/lib/libstdc++-v3/include


That time I just restarted the build and put it down to a parallel build
Makefile error.

In that case it's from external/gpl3/gcc/lib/libstdc++-v3/include/bits/Makefile
and again it's the same style of rule where it is impossible for me to
understand how the failure could possibly happen!


It's almost as if the call to rename() in 'mv' succeeds, but returns an
ENOENT error sometimes!!!



BTW, there's a rule in /usr/src/external/gpl3/gcc/lib/libstdc++-v3/Makefile
of a similar form but it includes what seems to me to be a nonsensical
"&& rm -f ${.TARGET}.tmp" at the end.  Shouldn't that be "|| rm -f ..."???
(or just not there at all?)

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpEG3VPGltrh.pgp
Description: OpenPGP Digital Signature


Re: Entropy error blocks lang/python38 installation

2021-06-16 Thread Greg A. Woods
At Wed, 16 Jun 2021 11:18:02 +0200, Martin Husemann  wrote:
Subject: Re: Entropy error blocks lang/python38 installation
>
> On Wed, Jun 16, 2021 at 11:10:34AM +0200, Joerg Sonnenberger wrote:
> > On Wed, Jun 16, 2021 at 06:13:23AM +0200, Martin Husemann wrote:
> > > On Wed, Jun 16, 2021 at 03:42:35AM +, Thomas Mueller wrote:
> > > > I believe I must apply the fix/workaround every time.
> > >
> > > The entropy state gets stored on shutdown and reloaded on next boot.
> > > Fixing it once is enough.
> >
> > ...assuming that people actually use shutdown and don't just reboot.
>
> Kinda - but the instructions in the man page are quite explicit and
> ask you to save the entropy state at least once manually, which should
> avoid the blocking behaviour in all cases.

That's not an acceptable regression.

Previously no manual operations were ever necessary -- the blocking was
never permanent.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpekRmVN6MUp.pgp
Description: OpenPGP Digital Signature


Re: building netbsd-9 2 'sync' processes stuck in 'tstile'

2021-05-07 Thread Greg A. Woods
At Fri, 7 May 2021 17:39:47 -0500 (CDT), "John D. Baker" 
 wrote:
Subject: Re: building netbsd-9 2 'sync' processes stuck in 'tstile'
>
> So far, the now 6 'sync' processes have been stuck in "tstile" for 4
> days.  Other than being unable to build/link any kernels, the system is
> fine and its primary functions as file server (NFS, SaMBa, AppleTalk),
> backup DNS and NTP server are unaffected as are its clients (i.e., every
> other machine on my LAN).  CVS updates to the various trees complete
> without problems.
>
> Of course anything that runs 'sync' will get stuck.
>
> This is the first time I've had this kind of problem on this system
> since I placed it in service in 2010.

This really smells more like a kernel deadlock.

I wonder if you could use crash(8) (or ddb(4)) to get kernel stack
traces of the stuck processes.  (E.g. see the EXAMPLES section in
crash(8).)

That might help narrow down which locks are causing the problems...

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpIXyFj83E4R.pgp
Description: OpenPGP Digital Signature


Re: building netbsd-9 2 'sync' processes stuck in 'tstile'

2021-05-07 Thread Greg A. Woods
At Mon, 3 May 2021 17:52:03 -0500 (CDT), "John D. Baker" 
 wrote:
Subject: building netbsd-9 2 'sync' processes stuck in 'tstile'
>
> While building netbsd-9/amd64 with "-j 2", the build process got stuck
> while linking "GENERIC_KASLR" and "GENERIC".  'top' shows two 'sync'
> processes stuck in 'tstile'.  Although the build could be aborted with
> "Ctrl-C", the two 'sync' processes remain and cannot be killed (even
> with -9).
>
> The host is netbsd-9/amd64 as of 30 April.  The filesystem on which the
> build process operates resides on a local RAIDframe RAID-R of eight 1TB
> SATA disks.
>
> The same filesystem is also NFS exported and clients otherwise continue
> to operate on it normally.

So, I've had a similar, but less critical, thing happen, though with a
somewhat opposite configuration.

I.e. I've seen lots of processes get "stuck" and/or very slow (with
processes sitting in "tstile" for long periods) on a similar system.

However the main problem seemed to be on a -current system that was
somewhat heavily accessing an NFS filesystem on another (older) NetBSD
system.  (i.e. /usr/src and /home are NFS mounts to the other server)

I don't know if these "tstile" processes were unkillable (though I've
experienced that before where a kernel deadlock caused it(*)).

However they eventually completed, and even more mysteriously the whole
problem resolved itself and disappeared without any knowing intervention!

I just left the machine to struggle along overnight and in the morning
it was running fine, and continued to do so for over a week until I
rebooted the other day to test some unrelated kernel fixes.

I never did find any possible cause for the slowness.

The older system that's serving NFS has an uptime of 117 days and didn't
seem to be suffering any ill effects during the slowness or since.


(*) The "tstile" hangs caused by a deadlock were on a Xen dom0 where
there were locking order problems in the xenstore interface and so "xl"
commands could deadlock in the kernel.  That bug has been fixed.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp9veDYpNxpl.pgp
Description: OpenPGP Digital Signature


Re: booting xen [was Re: serial console puzzle]

2021-05-01 Thread Greg A. Woods
I've copied this reply to port-xen as it's entirely Xen related.

At Fri, 30 Apr 2021 20:50:10 +0200, Manuel Bouyer  
wrote:
Subject: Re: booting xen [was Re: serial console puzzle]
> 
> On Fri, Apr 30, 2021 at 07:28:57PM +0100, Patrick Welche wrote:
> > 
> > boot.cfg contains:
> > 
> > menu=Boot Xen:rndseed /var/db/entropy-file;consdev com0,57600;load 
> > /netbsd-XEN3_
> > DOM0 console=com1 com1=57600,8n1,0x3f8;multiboot /xen-debug.gz 
> > dom0_mem=1024M
> 
> should probably be:
> menu=Boot Xen:rndseed /var/db/entropy-file;consdev com0,57600;load 
> /netbsd-XEN3_ DOM0 console=com0;multiboot /xen-debug.gz dom0_mem=1024M 
> console=com1 com1=57600,8n1,0x3f8
> 
> (should really be console=com0 for NetBSD, it doens't access the hardware and
> use the I/O services from the hypervisor)

On serial console machines I've been using NetBSD "console=xencons" for
ages.

This is the documented (by Xen, i.e. preferred Xen way), for serial
consoles:

menu=Boot Xen:load /netbsd-XEN3_DOM0 -v bootdev=dk0 
console=xencons;multiboot /xen bootscrub=false dom0_mem=4G console=com1,vga 
console_timestamps=datems dom0_max_vcpus=4 dom0_vcpus_pin=true 
pv-l1tf=off,domu=off vpmu=on cpuid=rdrand spec-ctrl=no-xen,l1d-flush=off 
guest_loglvl=all

From my Xen notes:

  - N.B.:  The Xen kernel handles serial input (and can pass it to
the dom0 kernel) but not keyboards, thus for serial console use
the NetBSD console should be "xencons", but when using the VGA
console the NetBSD console _must_ be "pc".

  - Xen counts serial ports from '1', but of course NetBSD counts
them from zero, so instead of "console=com0" as would be used
for /netbsd alone, it must be "console=com1,vga" for /xen.  Note
that Xen can print use multiple consoles simultaneously!  Note
also we could tell Xen to set the port up with something like
"com1=115200,8n1", but for now I think the BIOS does this OK on
the Dell PE machines.

These notes are based on direct examination of the code and are
confirmed by practice on multiple machines.

I believe the main advantage of keeping Xen in firm and sole control of
the serial console is that you can still talk to Xen directly with it
for debugging, as noted by Xen as it boots:

(XEN) [2021-04-21 20:54:44.504] *** Serial input to DOM0 (type 'CTRL-a' three 
times to switch input)

I've not really made use of this feature though -- just tested it a
couple of times.  I don't know if Xen still peeks at serial I/O if you
let the dom0 kernel take control of the UART, but it may.  I just don't
see the point of letting the dom0 use anything but xencons, if it can.
Similarly I don't see any point to trying to set or reset the UART
parameters if the BIOS already has them set and working -- keep it
simple and keep as much of the config in the first place it's needed and
nowhere else!


For systems with VGA console only though I finally figured out it has to
be "console=pc" explicitly else I didn't see any NetBSD boot messages
(this I have not diagnosed yet -- it is on a remote machine I've never
seen physically, though I do have Dell iDRAC access to it):

menu=Xen:load /netbsd -v bootdev=dk0 console=pc;multiboot /xen 
dom0_mem=2G dom0_max_vcpus=1 dom0_vcpus_pin

Of course VGA consoles suck for servers and for debugging, but sometimes
that's all you've got.


You'll note in the first example and the nodes, Xen can use two
different consoles simultaneously, so if I do go out into my machine
room (i.e. garage) I can see the Xen message on the screen too.  I
really wish NetBSD could do that.

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpQ7huxel1C6.pgp
Description: OpenPGP Digital Signature


Re: Problem reports for version control systems

2021-05-01 Thread Greg A. Woods
At Fri, 30 Apr 2021 15:23:03 +0200, Christian Groessler  
wrote:
Subject: Re: Problem reports for version control systems
>
> On 4/30/21 7:31 AM, Lloyd Parkes wrote:
>
> > Hi all,
> > The problem reports people have in their
> > emails are completely inadequate for
> > trying to determine what is going wrong
> > for people trying to access the NetBSD
> > source.
> >
> > 
>
>
> I'm rsync'ing the CVS tree to my local
> server and then run CVS against my server
> on the LAN. No problems...

Same here, since 2001 or so:

RCS file: RCS/rsync-netbsd-cvs,v
revision 1.1
date: 2001/06/06 17:52:06;  author: woods;  state: Exp;
Initial revision

This script can now be found here:

 https://github.com/robohack/rsync-netbsd-cvs

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpzTB5zfwE1y.pgp
Description: OpenPGP Digital Signature


Re: Xen FreeBSD domU block I/O problem on -current only affects reads > 1024 bytes

2021-04-20 Thread Greg A. Woods
020202020
*
0001000  21212121212121212121212121212121
*
0002000  
*
0003000  23232323232323232323232323232323
*
0004000  24242424242424242424242424242424
*
0005000  25252525252525252525252525252525
*
0006000  26262626262626262626262626262626
*
0007000  27272727272727272727272727272727
*
001  28282828282828282828282828282828
*
0011000  29292929292929292929292929292929
*
0012000  2a2a2a2a2a2a2a2a2a2a2a2a2a2a2a2a
*
0013000  2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b
*
0014000  2c2c2c2c2c2c2c2c2c2c2c2c2c2c2c2c
*
0015000  2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
*
0016000  2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e
*
0017000  2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f
*
002


Let's try that again with just the one sample data line and blkchk:


# grep 28141568000 /var/tmp/ckfile.txt > /var/tmp/ckfile.1
# /var/tmp/blkchk check /dev/da0 /var/tmp/ckfile.1
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1024] \x28 != 
/var/tmp/ckfile.1[ln#0][1024] \x22
#

Every byte after 1024 is different, but I'll cut it off at 10:

# /var/tmp/blkchk check -v /dev/da0 /var/tmp/ckfile.1 2>&1 | head
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1024] \x28 != 
/var/tmp/ckfile.1[ln#0][1024] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1025] \x28 != 
/var/tmp/ckfile.1[ln#0][1025] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1026] \x28 != 
/var/tmp/ckfile.1[ln#0][1026] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1027] \x28 != 
/var/tmp/ckfile.1[ln#0][1027] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1028] \x28 != 
/var/tmp/ckfile.1[ln#0][1028] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1029] \x28 != 
/var/tmp/ckfile.1[ln#0][1029] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1030] \x28 != 
/var/tmp/ckfile.1[ln#0][1030] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1031] \x28 != 
/var/tmp/ckfile.1[ln#0][1031] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1032] \x28 != 
/var/tmp/ckfile.1[ln#0][1032] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1033] \x28 != 
/var/tmp/ckfile.1[ln#0][1033] \x22


--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpNATBJbYGU3.pgp
Description: OpenPGP Digital Signature


CVS (was: GCC 10 available for testing etc. in -current.)

2021-04-19 Thread Greg A. Woods
At Mon, 19 Apr 2021 11:56:59 +0200, Reinoud Zandijk  wrote:
Subject: Re: GCC 10 available for testing etc. in -current.
>
> Same for me; I've never had trouble with CVS trees and they always just work
> and update fine.
>
> Hg on the otherhand I had to delete and recheckout my hg tree *again*; i
> had interrupted hg during a merge and oh boy; it was completely shot and
> thought i had tons of local changes that all conflicted; a whopping 500+ files
> or so, thus resorting to just nuking it and rechecking it out. This never
> happened to my CVS tree.
>
> So, no, hg is not mature enough yet to switch over to and don't get me started
> on git!

I don't think all of those problems can be blamed on Hg (or Git).

A very big part of the problem is what Joerg said:  "when someone messes
up history, that's a non-linear update."

I.e. the conversion from CVS to Hg and/or Git sometimes has to rewrite
history to undo a mess-up and clean-up in the CVS repo, and those are
things that really mess up Git and Hg users.

And NetBSD developers seem to have a penchant for messing up/in the
repository on a regular basis.  There were two such events in the past
week or two alone.  This very update to GCC 10 was involved in one of
them.

These same shenanigans also affect CVS, but usually in less ugly ways,

In both cases it's often a matter of timing

If you do your CVS update in between one of these "messes" being made
and being cleaned then you'll encounter some problems, but if not then
you're often none the wiser to what happened.

For the same reason different people will have different experiences
with the Hg and Git clones because they do their updates at different
times.

If you don't clone or fetch history that then has to be rewritten then
you won't know that history was rewritten.

The real solution of course is to stop and _prevent_ history from being
rewritten, ever.  It doesn't matter if this is in CVS, Git, Hg, Fossil,
or something else.  It's just easier to prevent in Git, and Hg, etc.

Personally I've been using rsync to fetch the whole CVS repository daily
for years now, and then I update local checkouts, some automatically and
some by hand.  It's very efficient, and it gives me a local copy of all
the repository history.  It's not quite as nice as a git clone, since I
can't reliably and efficiently and easily keep my own local branches and
do local commits (e.g. in the way you can do very easily and efficiently
with Git), but it is still very much better than any other current
alternative, including the current Hg and Git and Fossil conversions.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpPcTPIbcOuN.pgp
Description: OpenPGP Digital Signature


Xen FreeBSD domU block I/O problem begins somewhere between 8.99.32 (2020-06-09) and 9.99.81 (2021-03-10)

2021-04-16 Thread Greg A. Woods
So I was just reminded that I do still have a Xen server that's still
running the 8.99.32 kernel and Xen-4.11.  I had not been testing on it
because it still of course has the vnd(4) CHS size bug (and because it's
also hosting my $HOME and /usr/src and I don't want to crash it), and I
had not remembered until just now that I can work around that by simply
padding out the mini-memstick.img file!

And, so

It works, A-OK, with all other things remaining the same:

# ls -l /dev/xbd0
crw-r-  1 root  operator  0x3a Apr 17 04:31 /dev/xbd0
# newfs /dev/xbd0
/dev/xbd0: 20480.0MB (41943040 sectors) block size 32768, fragment size 4096
using 33 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112,
 11540352, 12822592, 14104832, 15387072, 16669312, 17951552, 19233792,
 20516032, 21798272, 23080512, 24362752, 25644992, 26927232, 28209472,
 29491712, 30773952, 32056192, 8432, 34620672, 35902912, 37185152,
 38467392, 39749632, 41031872
# fsck /dev/xbd0
** /dev/xbd0
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
2 files, 2 used, 5076797 free (21 frags, 634597 blocks, 0.0% fragmentation)

* FILE SYSTEM IS CLEAN *
#


So the problem is almost certainly in NetBSD-current itself, and
somewhere in the vast gulf between 8.99.32 (2020-06-09) and 9.99.81
(2021-03-10).

Unfortunately I don't have enough hardware that's Xen-capable and up and
running well enough to allow me to do any brute-force bisecting.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpLZpfkSDO0p.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-16 Thread Greg A. Woods
At Fri, 16 Apr 2021 11:44:08 +0100, David Brownlee  wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> On Fri, 16 Apr 2021 at 08:41, Greg A. Woods  wrote:
>
> > What else is different?  What am I missing?  What could be different in
> > NetBSD current that could cause a FreeBSD domU to (mis)behave this way?
> > Could the fault still be in the FreeBSD drivers -- I don't see how as
> > the same root problem caused corruption in both HVM and PVH domUs.
>
> Random data collection thoughts:
>
> - Can you reproduce it on tiny partitions (to speed up testing)
> - If you newfs, shutdown the DOMU, then copy off the data from the
> DOM0 does it pass FreeBSD fsck on a native boot
> - Alternatively if you newfs an image on a native FreeBSD box and copy
> to the DOM0 does the DOMU fsck fail
> - Potentially based on results above - does it still happen with a
> reboot between the newfs and fsck
> - Can you ktrace whichever of newfs or fsck to see exactly what its
> writing (tiny *tiny* filesystem for the win here :)

So, the root filesystem is clean (from the factory, and verified by at
least NetBSD's fsck as OK), but when '-f' is used it is found to be
corrupt.

Unfortunately I don't have any real FreeBSD machines available (though I
could possibly get it installed on my MacBookPro again, but that's
probably a multi-day effort at this point).

However I've just found a way to reproduce the problem reliably and with
a working comparison with a matching-sized memory disk.

First off attach a tiny 4mb LVM LV to FreeBSD -- that's the smallest LV
possible apparently:

dom0 # lvm lvs
  LV  VG  Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  build   scratch -wi-a- 250.00g
  fbsd-test.0 scratch -wi-a-  30.00g
  fbsd-test.1 scratch -wi-a-  30.00g
  nbtest.pkg  vg0 -wi-a-  30.00g
  nbtest.root vg0 -wi-a-  30.00g
  nbtest.swap vg0 -wi-a-   8.00g
  nbtest.var  vg0 -wi-a-  10.00g
  tinytestvg0 -wi-a-   4.00m
dom0 # xl block-attach fbsd-test format=raw, vdev=sdc, access=rw, 
target=/dev/mapper/vg0-tinytest


Now a run of the test on the FreeBSD domU (first showing the kernel
seeing the device attachment):


# xbd3: 4MB  at device/vbd/2080 on xenbusb_front0
xbd3: attaching as da2
xbd3: features: flush
xbd3: synchronize cache commands enabled.
GEOM: new disk da2

# dd if=/dev/zero of=tinytest.fs count=8192
8192+0 records in
8192+0 records out
4194304 bytes transferred in 0.081106 secs (51713998 bytes/sec)
# mdconfig -a -t vnode -f tinytest.fs
md0
# newfs -o space -n md0
/dev/md0: 4.0MB (8192 sectors) block size 32768, fragment size 4096
using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 2304, 4416, 6528
# newfs -o space -n da2
/dev/da2: 4.0MB (8192 sectors) block size 32768, fragment size 4096
using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 2304, 4416, 6528
# dumpfs da2 >da2.dumpfs
# dumpfs md0 >md0.dumpfs
# diff md0.dumpfs da2.dumpfs
1,2c1,2
< magic 19540119 (UFS2) timeFri Apr 16 18:48:55 2021
< superblock location   65536   id  [ 6079dc17 1006b3b4 ]
---
> magic 19540119 (UFS2) timeFri Apr 16 18:49:57 2021
> superblock location   65536   id  [ 6079dc55 348e5947 ]
27c27
< magic 90255   tell2   timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell2   timeFri Apr 16 18:49:57 2021
40c40
< magic 90255   tell128000  timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell128000  timeFri Apr 16 18:49:57 2021
53c53
< magic 90255   tell23  timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell23  timeFri Apr 16 18:49:57 2021
66c66
< magic 90255   tell338000  timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell338000  timeFri Apr 16 18:49:57 2021
# fsck md0
** /dev/md0
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
1 files, 1 used, 870 free (14 frags, 107 blocks, 1.6% fragmentation)

* FILE SYSTEM IS CLEAN *
# fsck da2
** /dev/da2
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
ROOT INODE UNALLOCATED
ALLOCATE? [yn] n


* FILE SYSTEM MARKED DIRTY *


So I ktraced the fsck_ufs run, and though I haven't looked at it with a
fine-tooth comb and the source open, the only thing that seems a wee bit
different about what fsck does is that it opens the device twice, with
O_RDONLY, then shortly before it prints the first "** /dev/da2" line it
reopens it O_RDRW a third time, closes the second one, and then closes
the second one and calls dup() on the third one so that it has the same
FD# as the second open had.

Otherwise it does a

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-16 Thread Greg A. Woods
So I wrote a little awk script so that I could write 512-byte blocks
with varying values of bytes.  (Awk is the only decent programming
language on the FreeBSD mini-memstick.img which I could think of that
would do something close to what I wanted it to do.  I could have
combined awk+sh+dd and done things faster, but I had all day to let it
run while I worked on some small engine repairs.)

https://github.com/robohack/experiments/blob/master/tblocks.awk

and then I used it to write 30GB to two different LVM LVs, each of
identical size, and each exported to the domU, one written on the dom0
and the other written on the domU.

Then I ran a cmp of both drives on each the dom0 and domU.

On the dom0 side were no differences.  All 30GB of what was written
directly in the dom0 to one of the LVs was identical to what was written
in the FreeBSD domU to the other LV.  I.e. the FreeBSD domU side seems
to be writing reliably through to the disk.

The FreeBSD domU though is _really_ slow at reading with cmp (perhaps
not unexpectedly given that it is using stdio to do the read and only
managing 4KB requests, at a rate of just under 500 requests per second
on each disk).

I'm going to send this and go to bed before it finishes, but I'm
guessing it's about 2/3's of the way through (it has run for nearly
11,000 seconds), and thus so far there are no differences from the
FreeBSD domU's point of view either.

Anyway, what the heck is FreeBSD newfs and/or fsck doing different!?!?!??

They're both writing and reading the very same raw device(s) that I
wrote and read to/from with awk and cmp.

These awk/cmp tests did very sequential operations, and the data are
quite uniform and regular; whereas newfs/fsck write/read a much more
complex data structure using operations scattered about in the disk.

These tests are also writing then reading enough data to flush through
the buffer caches in each dom0 and domU several times over.  The dom0
has only 4GB and the domU has 8GB, but Xen says it's only using under 2GB.

What else is different?  What am I missing?  What could be different in
NetBSD current that could cause a FreeBSD domU to (mis)behave this way?
Could the fault still be in the FreeBSD drivers -- I don't see how as
the same root problem caused corruption in both HVM and PVH domUs.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpOtcGozQ5xB.pgp
Description: OpenPGP Digital Signature


Re: running xen on current

2021-04-15 Thread Greg A. Woods
At Thu, 15 Apr 2021 13:02:54 +0200, Manuel Bouyer  
wrote:
Subject: Re: running xen on current
>
> AFAIK EFI is not yet supported by Xen (maybe this is supported by 4.15,
> I've not had a chance to try yet). I have it running on fairly recent
> Dell servers (in BIOS mode)

My Dell servers, even the newer PE-R510, are much older I think  :-)

They run -current (2021-03-10) quite well (except for PR# 54969 -- I
have to remember to unmount my larger filesystems manually before any
reboot unless I want to risk loss and/or wait a long time for fscks -- I
haven't turned on '-o log' for them yet as I wanted to measure its
performance impact).

My XEN3_DOM0 kernel is somewhat customized, but not in any way that
should affect the hardware support or Xen -- of interest might be iscsi
support and and VND_COMPRESSION, but I haven't tried testing either yet.

I did read about the unified EFI image support in Xen 4.15 and I was
thinking of trying it on my old MacBookPro -- but I would also want X11
to work on it too, and even FreeBSD's Xserver wasn't working on it last
summer, so I went back to MacOS in order to be able to use it for web
and such as well as remote access.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpW51GFrekcb.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-14 Thread Greg A. Woods
atures: flush
dev.xbd.0.ring_pages: 1
dev.xbd.0.max_request_size: 65536
dev.xbd.0.max_request_segments: 17
dev.xbd.0.max_requests: 32
dev.xbd.0.%parent: xenbusb_front0
dev.xbd.0.%pnpinfo: 
dev.xbd.0.%location: 
dev.xbd.0.%driver: xbd
dev.xbd.0.%desc: Virtual Block Device




For reference the bug behaviour remains the same (at least for this
simplest quick and easy test):

# newfs /dev/da0
/dev/da0: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096
using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 
11540352, 12822592, 14104832, 15387072, 16669312,
 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 
26927232, 28209472, 29491712, 30773952, 32056192, 8432,
 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 
43596352, 44878592, 46160832, 47443072, 48725312, 50007552,
 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 
60265472, 61547712, 62829952
# fsck /dev/da0
** /dev/da0
** Last Mounted on 
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
CG 0: BAD CHECK-HASH 0x49168424 vs 0xe610ac1b
SUMMARY INFORMATION BAD
SALVAGE? [yn] n

BLK(S) MISSING IN BIT MAPS
SALVAGE? [yn] n

CG 1: BAD CHECK-HASH 0xfa76fceb vs 0xb9e90a55
CG 2: BAD CHECK-HASH 0x41f444c vs 0x5efb290e
CG 3: BAD CHECK-HASH 0xad63fe7e vs 0x7ab3861f
CG 4: BAD CHECK-HASH 0xfd2043f3 vs 0xadb781f4
CG 5: BAD CHECK-HASH 0x545cf9c1 vs 0xcec5661e
CG 6: BAD CHECK-HASH 0xaa354166 vs 0x7dd269d3
CG 7: BAD CHECK-HASH 0x349fb54 vs 0x3078e065
CG 8: BAD CHECK-HASH 0xab23a7c vs 0xc8aa7e98
CG 9: BAD CHECK-HASH 0xa3ce804e vs 0x205a6b0d
CG 10: BAD CHECK-HASH 0x5da738e9 vs 0x604d5ecf
CG 11: BAD CHECK-HASH 0xf4db82db vs 0xfef11ffc
CG 12: BAD CHECK-HASH 0xa4983f56 vs 0xc7e701c8
CG 13: BAD CHECK-HASH 0xde48564 vs 0x42072fba
CG 14: BAD CHECK-HASH 0xf38d3dc3 vs 0xad98cf7b
CG 15: BAD CHECK-HASH 0x5af187f1 vs 0xbacadeb1
CG 16: BAD CHECK-HASH 0xe07abf93 vs 0xe4ca225
CG 17: BAD CHECK-HASH 0x490605a1 vs 0xe2917802
CG 18: BAD CHECK-HASH 0xb76fbd06 vs 0xa895abc
CG 19: BAD CHECK-HASH 0x1e130734 vs 0x6a8bc135
CG 20: BAD CHECK-HASH 0x4e50bab9 vs 0x44719a4a
CG 21: BAD CHECK-HASH 0xe72c008b vs 0xadb0c6e9
CG 22: BAD CHECK-HASH 0x1945b82c vs 0x3aeca102
CG 23: BAD CHECK-HASH 0xb039021e vs 0xb99f957d
CG 24: BAD CHECK-HASH 0xb9c2c336 vs 0xd384be85
CG 25: BAD CHECK-HASH 0x10be7904 vs 0x649e2abf
CG 26: BAD CHECK-HASH 0xeed7c1a3 vs 0x95f7
CG 27: BAD CHECK-HASH 0x47ab7b91 vs 0x3fb02d8b
CG 28: BAD CHECK-HASH 0x17e8c61c vs 0xa2b4ca67
CG 29: BAD CHECK-HASH 0xbe947c2e vs 0x65972e04
CG 30: BAD CHECK-HASH 0x40fdc489 vs 0x4219223f
CG 31: BAD CHECK-HASH 0xe9817ebb vs 0x36eb9a37
CG 32: BAD CHECK-HASH 0x3007c2bc vs 0xd1916e1d
CG 33: BAD CHECK-HASH 0x997b788e vs 0x5204f64d
CG 34: BAD CHECK-HASH 0x6712c029 vs 0xe291bcf0
CG 35: BAD CHECK-HASH 0xce6e7a1b vs 0x136ff032
CG 36: BAD CHECK-HASH 0x9e2dc796 vs 0x78ea85c8
CG 37: BAD CHECK-HASH 0x37517da4 vs 0x40c2cf31
CG 38: BAD CHECK-HASH 0xc938c503 vs 0x9b844ab6
CG 39: BAD CHECK-HASH 0x60447f31 vs 0x23129481
CG 40: BAD CHECK-HASH 0x69bfbe19 vs 0xa81f5e9
CG 41: BAD CHECK-HASH 0xc0c3042b vs 0xbd37ebd1
CG 42: BAD CHECK-HASH 0x3eaabc8c vs 0xfadfd8d1
CG 43: BAD CHECK-HASH 0x97d606be vs 0xf41513bc
CG 44: BAD CHECK-HASH 0xc795bb33 vs 0xad4e6069
CG 45: BAD CHECK-HASH 0x6ee90101 vs 0xbeab94a9
CG 46: BAD CHECK-HASH 0x9080b9a6 vs 0x2688acd1
CG 47: BAD CHECK-HASH 0x39fc0394 vs 0xb5a37e85
CG 48: BAD CHECK-HASH 0x83773bf6 vs 0xd779cc90
CG 49: BAD CHECK-HASH 0xe0d3fd3c vs 0xb8083ca
2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation)

* FILE SYSTEM MARKED DIRTY *

* PLEASE RERUN FSCK *

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpFkv2vwtCE3.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-14 Thread Greg A. Woods
At Wed, 14 Apr 2021 19:53:47 +0200, Jaromír Doleček  
wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
> 
> You can test if this is the problem by disabling the feature in
> negotiation in NetBSD xbdback.c - comment out the code which sets
> feature-max-indirect-segments in xbdback_backend_changed(). With the
> feature disabled, FreeBSD DomU should not use indirect segments.

Ah, yes, thanks!  I should have thought of that.  That's especially
useful since on the client side it's a read-only flag:

# sysctl -w hw.xbd.xbd_enable_indirect=0
sysctl: oid 'hw.xbd.xbd_enable_indirect' is a read only tunable
sysctl: Tunable values are set in /boot/loader.conf

Apparently in the Linux implementation the number of indirect segments
used by a domU can be tuned at boot time, and that appears to be done by
setting a driver option on the guest kernel command line.  When I first
read that it didn't make so much sense to me to be giving this kind of
control to the domU.  Perhaps it would be better to make this a tuneable
in xl.cfg(5) such that it can be tuned on a per-guest basis.  Then
setting it to zero for a given guest would not advertise the feature at
all.

I've some other things to do before I can reboot -- I'll report as soon
as that's done

-- 
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp4H02B9VFeu.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-13 Thread Greg A. Woods
At Tue, 13 Apr 2021 18:20:39 -0700, "Greg A. Woods"  wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> So "17" seems an odd number, but it is apparently because of "Need to
> alloc one extra page to account for possible mapping offset".

Nope, changing that to 16 didn't make any difference.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpWqf4eWoyDV.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-13 Thread Greg A. Woods
At Sun, 11 Apr 2021 13:55:36 -0700, "Greg A. Woods"  wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD
> xbd(4) with a new filesystem created on it, is impossible.


So, having run out of "easy" ideas, and working under the assumption
that this must be a problem in NetBSD-current dom0 (i.e. not likely in
Xen or Xen tools) I've been scanning through changes and this one, so
far, is one that would seem to me to have at least some tiny possibility
of being the root cause.


RCS file: /cvs/master/m-NetBSD/main/src/sys/arch/xen/xen/xbdback_xenbus.c,v

revision 1.86
date: 2020-04-21 06:56:18 -0700;  author: jdolecek;  state: Exp;  lines: 
+175 -47;  commitid: 26JkIx2V3sGnZf5C;
add support for indirect segments, which makes it possible to pass
up to MAXPHYS (implementation limit, interface allows more) using
single request

request using indirect segment requires 1 extra copy hypercall per
request, but saves 2 shared memory hypercalls (map_grant/unmap_grant),
so should be net performance boost due to less TLB flushing

this also effectively doubles disk queue size for xbd(4)


I don't see anything obviously glaringly wrong, and of course this is
working A-OK on my same machines with NetBSD-5 and a NetBSD-current (and
originally somewhat older NetBSD-8.99) domUs.

However I'm really not very familiar with this code and the specs for
what it should be doing so I'm unlikely to be able to spot anything
that's missing.  I did read the following, which mostly reminded me to
look in xenstore's db to see what feature-max-indirect-segments is set
to by default:

https://xenproject.org/2013/08/07/indirect-descriptors-for-xen-pv-disks/


Here's what is stored for a file-backed device:

backend = ""
 vbd = ""
  3 = ""
   768 = ""
frontend = "/local/domain/3/device/vbd/768"
params = "/build/images/FreeBSD-12.2-RELEASE-amd64-mini-memstick.img"
script = "/etc/xen/scripts/block"
frontend-id = "3"
online = "1"
removable = "0"
bootable = "1"
state = "4"
dev = "hda"
type = "phy"
mode = "r"
device-type = "disk"
discard-enable = "0"
vnd = "/dev/vnd0d"
physical-device = "3587"
hotplug-status = "connected"
sectors = "792576"
info = "4"
sector-size = "512"
feature-flush-cache = "1"
feature-max-indirect-segments = "17"


Here's what's stored for an LVM-LV backed vbd:

162 = ""
 2048 = ""
  frontend = "/local/domain/162/device/vbd/2048"
  params = "/dev/mapper/vg1-fbsd--test.0"
  script = "/etc/xen/scripts/block"
  frontend-id = "162"
  online = "1"
  removable = "0"
  bootable = "1"
  state = "4"
  dev = "sda"
  type = "phy"
  mode = "r"
  device-type = "disk"
  discard-enable = "0"
  physical-device = "43285"
  hotplug-status = "connected"
  sectors = "83886080"
  info = "4"
  sector-size = "512"
  feature-flush-cache = "1"
  feature-max-indirect-segments = "17"


So "17" seems an odd number, but it is apparently because of "Need to
alloc one extra page to account for possible mapping offset".  It is
currently the maximum for indirect-segments, and it's hard-coded.
(Linux apparently has a max of 256, and the linux blkfront defaults to
only using 32.)  Maybe it should be "16", so matching max_request_size?



I did take a quick gander at the related code in FreeBSD (both the domU
code that's talking to this code in NetBSD, and the dom0 code that would
be used if dom0 was running FreeBSD), and besides seeing that it is
quite different, I also don't see anything obviously wrong or
incompatible there either.  (I do note that the FreeBSD equivalent to
xbdback(4) has a major advantage of being able to directly access files,
i.e. without the need for vnd(4).  Not quite as exciting as maybe full
9pfs mounts through to domUs would be, but still pretty neat!)

FreeBSD's equivalent to xbdback(4) (i.e. sys/dev/xen/blkback/blkack.c)
doesn't seem to mention "feature-max-indirect-segments", so apparently
they don't offer it yet, though it does mention "feature-flush-cache".

However their front-end code does detect it and seems to make use of it,
and has done for some 6 years now according to "git blame" (with no
recent fixes beyond fixing a memory leak on their end).  Here we see it
live from FreeBSD's sysctl output, thus my concern that this feature may
be the source of the problem:

hw.xbd.xbd_enable_indirect: 1
dev.xbd.0.max_request_size: 65536
dev.xbd.0.max_request_segments: 17
dev.xbd.0.max_requests: 32

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpw4B9SHqX72.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods
ev/da1
** /dev/da1
** Last Mounted on /mnt
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=325128
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=877864
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=877866
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=877879
SALVAGE? [yn] ^C

* FILE SYSTEM MARKED DIRTY *


Back on the NetBSD side:


 # xl block-detach fbsd-test  2064
 # fsck /dev/mapper/rscratch-fbsd--test.0
** /dev/mapper/rscratch-fbsd--test.0
** Last Mounted on /mnt
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? [yn] n

SUMMARY INFORMATION BAD
SALVAGE? [yn] n

BLK(S) MISSING IN BIT MAPS
SALVAGE? [yn] n

12076 files, 91642 used, 7647797 free (293 frags, 955938 blocks, 0.0% 
fragmentation)

* UNRESOLVED INCONSISTENCIES REMAIN *



--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpIIZo7QXMjA.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods
At Sun, 11 Apr 2021 13:23:31 -0700, "Greg A. Woods"  wrote:
Subject: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> In fact it only seems to be fsck that complains, possibly along
> with any attempt to write to a filesystem, that causes problems.

Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD
xbd(4) with a new filesystem created on it, is impossible.

I was able to write 500MB of zeros to the LVM LV backed disk,
overwriting the copy of the .img file I had put there, and only see
500MB of zeros back on the NetBSD side, so writing directly to the raw
/dev/da1 on FreeBSD seems to write data without problem.

However then the following happens when I try to use a new FS there:

# newfs /dev/da1
/dev/da1: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096
using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 
11540352, 12822592, 14104832, 15387072, 16669312,
 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 
26927232, 28209472, 29491712, 30773952, 32056192, 8432,
 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 
43596352, 44878592, 46160832, 47443072, 48725312, 50007552,
 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 
60265472, 61547712, 62829952
# mount /dev/da1 /mnt
# mount
/dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only)
devfs on /dev (devfs, local, multilabel)
tmpfs on /var (tmpfs, local)
tmpfs on /tmp (tmpfs, local)
/dev/da1 on /mnt (ufs, local)
# df
Filesystem   512-blocks   UsedAvail Capacity  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016   -16680   102%/
devfs 2  20   100%/dev
tmpfs 6553660864928 1%/var
tmpfs 40960  840952 0%/tmp
/dev/da1   60901560 16 56029424 0%/mnt
# cp /COPYRIGHT /mnt
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 0, cgp: 0xe66de1a4 != bp: 
0xf433acbc
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 1, cgp: 0x89ba8532 != bp: 
0x3491fbd0
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 3, cgp: 0xdeaf87a7 != bp: 
0x3a071e86
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 7, cgp: 0x7085828d != bp: 
0xaaae0f19
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 15, cgp: 0x293dfe28 != bp: 
0xe2f25f8b
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 31, cgp: 0x9a4d0762 != bp: 
0x4119c6e
[[  and on and on  ]]
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 49, cgp: 0x931f84e5 != bp: 
0xb48687df

/mnt: create/symlink failed, no inodes free
cp: /mnt/COPYRIGHT: No space left on device
# Apr 11 20:37:28  syslogd: last message repeated 4 times
Apr 11 20:37:59  kernel: pid 713 (cp), uid 0 inumber 2 on /mnt: out of inodes
# df -i
Filesystem   512-blocks   UsedAvail Capacity iused   ifree 
%iused  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016   -16680   102%   12129 285   
98%   /
devfs 2  20   100%   0   0  
100%   /dev
tmpfs 6553660864928 1%  75  114613
0%   /var
tmpfs 40960  840952 0%   6   71674
0%   /tmp
/dev/da1   60901560 16 56029424 0%   2 4012796
0%   /mnt




NetBSD can actually make some sense of this FreeBSD filesystem though:

# fsck -n /dev/mapper/rscratch-fbsd--test.0
** /dev/mapper/rscratch-fbsd--test.0 (NO WRITE)
Invalid quota magic number

CONTINUE? yes

** File system is already clean
** Last Mounted on /mnt
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
SUMMARY INFORMATION BAD
SALVAGE? no

BLK(S) MISSING IN BIT MAPS
SALVAGE? no

** Phase 6 - Check Quotas

CLEAR SUPERBLOCK QUOTA FLAG? no

2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation)

* UNRESOLVED INCONSISTENCIES REMAIN *



I'm not sure if those problems are to be expected with a FreeBSD-created
filesystem or not.  Probably the "Invalid quota magic number" is normal,
but I'm not sure about the "BLK(s) MISSING IN BIT MAPS".  Have FreeBSD
and NetBSD FFS diverged this much?  I won't try to mount it, especially
not from the dom0.

Dumpfs shows the following:

file system: /dev/mapper/rscratch-fbsd--test.0
format  FFSv2
endian  little-endian
location 65536  (-b 128)
magic   19540119timeSun Apr 11 13:46:15 2021
superblock location 65536   id  [ 60735d32 358197c4 ]
cylgrp  dynamic inodes  FFSv2   sblock  FFSv2   fslevel 5
nbfree  951584  ndir2   nifree  4012796 nffree  21
ncg 50  size7864320 blocks  7612695
bsize   32768   shift   15  m

one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods
So, with the vnd(4) issue more or less sorted, there seems to be one
major mystery remaining w.r.t. whatever has gone wrong with the ability
of NetBSD-current XEN3_DOM0 to host FreeBSD domUs.

I still can't create a clean filesystem on a writeable disk.  The
"newfs" runs fine, but a subsequent "fsck" finds errors and cannot fix
them (though the first run does change one or two things).

I can't even get a clean fsck of the running system's root FS:
(the "ada0: disk error" after I hit ^C is because the underlying disk
(vnd0d) is exported read-only to the domU)


# fsck -v /dev/ufs/FreeBSD_Install
start / wait fsck_ufs /dev/ufs/FreeBSD_Install
** /dev/ufs/FreeBSD_Install

SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n


ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n

** Last Mounted on
** Root file system
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=28
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=112
SALVAGE? [yn] ^Cada0: disk error cmd=write 8145-8152 status: fffe

* FILE SYSTEM MARKED DIRTY *

#


Most mysteriously this filesystem is in use as the root FS and all the
files in it can be found and read!  Presumably they are all intact too
-- no programs have failed or behaved mysteriously (except fsck) and all
the human readable files I've looked at (e.g. manual pages) all seem
fine.  In fact it only seems to be fsck that complains, possibly along
with any attempt to write to a filesystem, that causes problems.  (I
believe writing to a filesystem appears to corrupt it but that is only
according to fsck.  I do seem believe there was an eventual crashes of a
system that had been running with active filesystems, but I have not got
far enough again since to reproduce this, due to the fsck problem.)

# mount
/dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only)
devfs on /dev (devfs, local, multilabel)
tmpfs on /var (tmpfs, local)
tmpfs on /tmp (tmpfs, local)
# df
Filesystem   512-blocks   Used  Avail Capacity  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016 -16680   102%/
devfs 2  2  0   100%/dev
tmpfs 65536232  65304 0%/var
tmpfs 40960  8  40952 0%/tmp
# time -l sh -c 'find  / -type f | xargs cat > /dev/null '
   38.58 real 1.36 user18.30 sys
  4872  maximum resident set size
13  average shared memory size
 5  average unshared data size
   215  average unshared stack size
  1906  page reclaims
 0  page faults
 0  swaps
 14024  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 12348  voluntary context switches
33  involuntary context switches


In fact I can put a copy of the FreeBSD img file into an LVM LV, attach
it to the running FreeBSD domU, mount it (without an FSCK, since the
FreeBSD_Install filesystem comes clean from the factory), then do
"diff -r -X /mnt -X /dev / /mnt" and find only the expected differences.

So, what could be different about how fsck reads v.s. the kernel itself?

If indeed writing to filesystem corrupts it, how and why?


It seems NetBSD can make sense of the BSD label inside the FreeBSD
mini-memstick.img file, e.g. when accessed through vnd(4), but it can't
seem to make sense of the filesystem(s) inside (which I guess might be
expected?):

# file -s /dev/rvnd0f
/dev/rvnd0f: DOS/MBR boot sector, BSD disklabel

# disklabel vnd0
# /dev/rvnd0:
type: vnd
disk: vnd
label: fictitious
flags:
bytes/sector: 512
sectors/track: 32
tracks/cylinder: 64
sectors/cylinder: 2048
cylinders: 387
total sectors: 791121
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0

6 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 d:791121 0 unused  0 0# (Cyl.  0 -386*)
 e:  1600 1unknown # (Cyl.  0*-  0*)
 f:789520  1601 4.2BSD  0 0 0  # (Cyl.  0*-386*)
disklabel: boot block size 0
disklabel: super block size 0


# fsck -n /dev/vnd0f
** /dev/rvnd0f (NO WRITE)
BAD SUPER BLOCK: CAN'T FIND SUPERBLOCK
/dev/rvnd0f: CANNOT FIGURE OUT SECTORS PER CYLINDER


--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpMSgeO6z7I3.pgp
Description: OpenPGP Digital Signature


Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-10 Thread Greg A. Woods
On the other hand NetBSD's own .img files work OK.

However interestingly there's a small, but apparently insignificant
(because it works OK) difference between how fdisk sees the disk image
and the vnd0 device:

# fdisk -F images/NetBSD-9.99.81-amd64-live.img
Disk: images/NetBSD-9.99.81-amd64-live.img
NetBSD disklabel disk geometry:
cylinders: 972, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192, bytes/sector: 512

BIOS disk geometry:
cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192

Partitions aligned to 2048 sector boundaries, offset 2048

Partition table:
0: NetBSD (sysid 169)
start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active
1: 
2: 
3: 
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x)
# vndconfig -cv vnd0 images/NetBSD-9.99.81-amd64-live.img
/dev/rvnd0: 7999586304 bytes on images/NetBSD-9.99.81-amd64-live.img
# fdisk vnd0
Disk: /dev/rvnd0
NetBSD disklabel disk geometry:
cylinders: 7629, heads: 64, sectors/track: 32 (2048 sectors/cylinder)
total sectors: 15624192, bytes/sector: 512

BIOS disk geometry:
cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192

Partitions aligned to 2048 sector boundaries, offset 2048

Partition table:
0: NetBSD (sysid 169)
start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active
1: 
2: 
3: 
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x)
21:10 [1.1496] # disklabel vnd0
# /dev/rvnd0:
type: ESDI
disk: image
label: 
flags:
bytes/sector: 512
sectors/track: 32
tracks/cylinder: 64
sectors/cylinder: 2048
cylinders: 7629
total sectors: 15624192
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0 

8 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 a:  15622144  2048 4.2BSD   1024  819216  # (Cyl.  1 -   7628)
 c:  15622144  2048 unused  0 0# (Cyl.  1 -   7628)
 d:  15624192 0 unused  0 0# (Cyl.  0 -   7628)
# disklabel images/NetBSD-9.99.81-amd64-live.img
# images/NetBSD-9.99.81-amd64-live.img:
type: ESDI
disk: image
label: 
flags:
bytes/sector: 512
sectors/track: 32
tracks/cylinder: 64
sectors/cylinder: 2048
cylinders: 7629
total sectors: 15624192
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0 

8 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 a:  15622144  2048 4.2BSD   1024  819216  # (Cyl.  1 -   7628)
 c:  15622144  2048 unused  0 0# (Cyl.  1 -   7628)
 d:  15624192 0 unused  0 0# (Cyl.  0 -   7628)



From inside the NetBSD live image:

[   1.4412586] xbd4 at xenbus0 id 4: Xen Virtual Block Device Interface
[   1.4422594] xbd4: using event channel 20
[   1.7112647] entropy: xbd4 attached as an entropy source (collecting without 
estimation)
[   1.7112647] xbd4: 7629 MB, 512 bytes/sect x 15624192 sectors
[   1.7112647] xbd4: backend features 0x9



# df
Filesystem  1K-blocks UsedAvail %Cap Mounted on
/dev/xbd4a7562414  4699114  2485180  65% /
ptyfs   110 100% /dev/pts
# fdisk xbd4
Disk: /dev/rxbd4
NetBSD disklabel disk geometry:
cylinders: 7629, heads: 1, sectors/track: 2048 (2048 sectors/cylinder)
total sectors: 15624192, bytes/sector: 512

BIOS disk geometry:
cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192

Partitions aligned to 2048 sector boundaries, offset 2048

Partition table:
0: NetBSD (sysid 169)
start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active
1: 
2: 
3: 
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x)



The NetBSD live.img root filesystem seems fine and clean:

# fsck -n /dev/rxbd4a
** /dev/rxbd4a (NO WRITE)
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
32740 files, 2349557 used, 1431650 free (538 frags, 178889 blocks, 0.0% 
fragmentation)


-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgphQf8ZEW6Z4.pgp
Description: OpenPGP Digital Signature


Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-10 Thread Greg A. Woods
At Sat, 10 Apr 2021 18:44:32 -0700, Brian Buhrow  wrote:
Subject: Re: I think I've found why Xen domUs can't mount some file-backed disk 
images! (vnd(4) hides labels!)
>
>   hello.  This must be some kind of regression that's ben around a
> while.  I'm runing a xen dom0 with NetBSD-5.2 and xen-3.3.2, very old,
> but vnd(4) does expose the entire file to the domu's including FreeBSD
> 11 and 12 without any corruption or booting issues.  Do you know when
> this trouble began?

I don't know -- I think I've only ever successfully used ISO files, and
I think I gave up on some IMG file(s) previously (possibly not just from
FreeBSD) without trying to understand why they didn't work.

Have you tried specifically with a recent FreeBSD mini-memstick.img file?

I'm thinking (esp. given what I see from "od -c < /dev/rvnd0d") that
what's wrong is the vnd(4) driver is (also?) imposing some
mis-interpreted idea about the number of cylinders and heads or
something like that, especially given that "fdisk vnd0" is so totally
confused about what's in there.

There's a definite pattern of corruption anyway -- I just can't explain
it well enough yet.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpX0mMSVXy0W.pgp
Description: OpenPGP Digital Signature


I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-10 Thread Greg A. Woods
0002000


# dd if=/dev/rvnd0d count=17 msgfmt=quiet| od -c
000   \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
002   \0  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0 020  \0  \0  \0
0020020  030  \0  \0  \0 230 005  \0  \0  \0  \0  \0  \0 377 377 377 377
0020040  367 360   p   `  \0  \0  \0 007 200 037  \0 027  \0  \0  \0
0020060   \0   @  \0  \0  \0  \b  \0  \0  \b  \0  \0  \0 005  \0  \0  \0
0020100   \0  \0  \0  \0   <  \0  \0  \0  \0 300 377 377  \0 370 377 377
0020120  016  \0  \0  \0 013  \0  \0  \0 004  \0  \0  \0  \0 020  \0  \0
0020140  003  \0  \0  \0 002  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0
0020160   \0  \0  \0  \0  \0 020  \0  \0 200  \0  \0  \0 004  \0  \0  \0
0020200   \0  \0  \0  \0 300 220 005  \0 001  \0  \0  \0  \0  \0  \0  \0
0020220  367 360   p   `   _   `   A   q 230 005  \0  \0  \0  \b  \0  \0
0020240   \0   @  \0  \0  \0  \0  \0  \0 300 220 005  \0 300 220 005  \0
0020260  027  \0  \0  \0 001  \0  \0  \0  \0   X  \0  \0   0   d 001  \0
0020300  001  \0  \0  \0 377 357 003  \0 375 347 007  \0 016  \0  \0  \0
0020320   \0 001  \0 200  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0020340   \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0021000


In fact the vnd0d device seems to give garbage forever -- it seems to
have been completely confused by trying to access a real disk image!


As a side note unfortunately even though access to this LVM-backed
mini-memstick.img file now seems OK enough to get the install booted and
a shell running, access to other FreeBSD xbd(4) devices is still not
working from FreeBSD (i.e. a fresh newfs'ed FS appears corrupt to an
immediate fsck, without mounting, and even fsck of the mounted root in
this IMG fails enormously).

# df
Filesystem   512-blocks   Used  Avail Capacity  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016 -16680   102%/
devfs 2  2  0   100%/dev
tmpfs 65536232  65304 0%/var
tmpfs 40960  8  40952 0%/tmp
# fsck /dev/ufs/FreeBSD_Install
** /dev/ufs/FreeBSD_Install

SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n


ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n

** Last Mounted on
** Root file system
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=28
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=112
SALVAGE? [yn] ^Cda0: disk error cmd=write 8145-8152 status: fffe

#
* FILE SYSTEM MARKED DIRTY *

#


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp0TQ7zS9Hhk.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-07 Thread Greg A. Woods
At Wed, 7 Apr 2021 22:47:39 +0200, Martin Husemann  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
> 
> When you create a custom setup like that, you will have to replace
> etc/rc.d/entropy with a custom solution (e.g. mounting some flash storage).

No storage means "NO storage.".

> Or you ignore the issue and do the dd at each boot - hopefully not generating
> any strong keys on that machine then (but you would have no good storage
> for those anyway).

Or I don't ignore the issue and instead I fix the code so that it's
still possible to get entropy estimates from non-hardware-RNG devices
and then things keep working the way they used to, and there's still
some possibility of _real_ entropy being used to seed the PRNGs.

From what I've seen here so far I'm far from alone in wanting that
ability.

What's most confusing is to why there's such animosity and stubborn
unwillingness to even consider that the old way of getting some entropy
from a few less-than-perfect sources was good enough for many, or even
most, of us.

It's better than no entropy when there are no "perfect" sources, and
that's also a situation that includes many of us.

It doesn't have to be the default.

-- 
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpiP2WuJhrQy.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-07 Thread Greg A. Woods
At Wed, 7 Apr 2021 09:52:29 +0200, Martin Husemann  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> On Tue, Apr 06, 2021 at 03:12:45PM -0700, Greg A. Woods wrote:
> > > Isn't it as simple as:
> > >
> > >   dd bs=32 if=/dev/urandom of=/dev/random
> >
> > No, that still leaves the question of _when_ to run it.  (And, at least
> > at the moment, where to put it.  /etc/rc.local?)
>
> Of course not!
>
> You run it once. Manually. And never again.

Nope, sorry, that's not a good enough answer.  It doesn't solve the
problem of dealing with a lack of mutable storage.

A system _MUST_ be able to be booted and with no user intervention be
able to (eventually) get to the state where /dev/random and getrandom(2)
WILL NOT block, and it _MUST_ be able to do so without the help of any
hardware RNG, and without the ability to store (and read) a seed from a
file or other storage device.

I.e. we _MUST_ be _ABLE_ to choose to use other devices as sources for
entropy, even if they are not perfect.  We had this, it works fine, we
still need it.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpuAM5snajCz.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg A. Woods
At Tue, 6 Apr 2021 20:21:43 +0200, Martin Husemann  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> On Tue, Apr 06, 2021 at 10:54:51AM -0700, Greg A. Woods wrote:
> >
> > And the stock implementation has no possibility of ever providing an
> > initial seed at all on its own (unlike previous implementations, and of
> > course unlike what my patch _affords_).
>
> Isn't it as simple as:
>
>   dd bs=32 if=/dev/urandom of=/dev/random

No, that still leaves the question of _when_ to run it.  (And, at least
at the moment, where to put it.  /etc/rc.local?)

Isn't something the following better (assuming you choose your devices
carefully):

echo 'rndctl_flags="-t env;-t disk;-t tty"' >> /etc/rc.conf

That's what my patches fix and allow, and this way you don't have to
guess when you can safely use /dev/urandom as an entropy seed -- the
seeding happens in real time, and only as entropy bits are made
available from those given devices.

That can also be done by sysinst, assuming a reasonably well worded
question can be answered, and that it might only need to be asked if
there are no "rng" type devices already.

Doing this also requires no network access (ever).

It can even be done, ahead of time, for use on immutable systems.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgparJTWSICYJ.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg A. Woods
At Tue, 6 Apr 2021 12:08:54 +, Taylor R Campbell  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> The main issue that hits people is that the traditional mechanism by
> which the OS reports a potential security problem with entropy is for
> it to make applications silently hang -- and the issue is getting
> worse now that getrandom() is more widely used, e.g. in Python when
> you do `import multiprocessing'.

I think adding a uprintf(9) that the user who started the blocked
process (i.e. not just the admin) has a better chance of directly seeing
would be one step closer, and should be extremely easy.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpOxcINx3I65.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg A. Woods
 4, flags 
0x70, func=0x8083f151, ver=427
kern.entropy.gather (1.1260.1264): CTLTYPE_INT, size 4, flags 0x70, 
func=0x8083dd4c, ver=428
kern.entropy.needed (1.1260.1265): CTLTYPE_INT, size 4, flags 
0x100, ver=429
kern.entropy.pending (1.1260.1266): CTLTYPE_INT, size 4, flags 
0x100, ver=430
kern.entropy.epoch (1.1260.1267): CTLTYPE_INT, size 4, flags 
0x100, ver=431

Perhaps function pointer values shouldn't be printed as integers?


And there are no text descriptions for some of the kern.entropy values:

17:27 [1.831] # sysctl -d kern.entropy.needed
kern.entropy.needed: (no description)
17:27 [1.832] # sysctl -d kern.entropy.pending
kern.entropy.pending: (no description)
17:27 [1.833] # sysctl -d kern.entropy.epoch
kern.entropy.epoch: (no description)


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp6vc6Eur6UN.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Mon, 5 Apr 2021 15:37:49 -0400, Thor Lancelot Simon  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> On Sun, Apr 04, 2021 at 03:32:08PM -0700, Greg A. Woods wrote:
> >
> > BTW, to me reusing the same entropy on every reboot seems less secure.
>
> Sure.  But that's not what the code actually does.
>
> Please, read the code in more depth (or in this case, breadth), then argue
> about it.

Sorry, I was eluding to the idea of sticking the following in
/etc/rc.local as the brain-dead way to work around the problem:

echo -n "" > /dev/random

However I have not yet read and understood enough of the code to know
if:

dd if=/dev/urandom of=/dev/random bs=32 count=1

is any more "secure" -- I'm guessing (hoping?) it depends on exactly
when this might be run, and also depends on which, if any, other device
sources are enabled for "collecting".  If in some rare case none were
enabled, or if it were run before any were able to "stir the pool", then
I'm guessing it would be no more secure than writing a fixed string.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpr66fioyhjH.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
"stir" the pot in the first place, then why not just "count"
it as "real" entropy and be done with it -- at least then it is obvious
when enough entropy has been gathered and the currently implemented
algorithms handle things properly and securely and all inside the
kernel.  I.e. the admin doesn't have to put a "sleep 30" or whatever in
front of it and hope that's enough and that it's still not too
predictable.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpRUdV5ZZmgF.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Mon, 5 Apr 2021 03:02:42 +0200, Joerg Sonnenberger  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Except that's not what the system is doing. It removes the seed file on
> boot and creates a new one on shutdown.

That's not exactly what the documentation says it does (from rndctl(8)):

-L  Load saved entropy from file save-file and overwrite it with a
 seed derived by hashing it together with output from /dev/urandom
 so that the new seed has at least as much entropy as either the
 old seed had or the system already has.  If interrupted, either
 the old seed or the new seed will be in place.

The code seems to concur.

Also the system re-saves the $random_file via /etc/security
(unconditionally, i.e. always, but only if $random_file is set).

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpVwYw0Ir4wO.pgp
Description: OpenPGP Digital Signature


Re: how do I mount a read-only filesystem from the "root device" prompt?

2021-04-05 Thread Greg A. Woods
At Mon, 5 Apr 2021 07:04:32 - (UTC), mlel...@serpens.de (Michael van Elst) 
wrote:
Subject: Re: how do I mount a read-only filesystem from the "root device" 
prompt?
>
> Someone would need to write code to "upgrade" vnodes. I doubt that's
> trivial.

Indeed -- I've underestimated the complexity of such low-level changes
in the past -- they can snowball out of control!

> Fortunately it is not necessary. If the parent device is read-only,
> no "upgrade" will help to make it read-write. So you open read-write
> or fail back to read-only when necessary. An attempt to open a wedge
> read-write on a read-only opened parent device then has to fail.

Yes, this makes sense.

> I'm testing a patch for that...

Excellent!  Thank you very much!

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpXg3AxNRSEV.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Mon, 5 Apr 2021 16:13:55 +1200, Lloyd Parkes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> The current implementation prints out a message whenever it blocks a
> process that wants randomness, which immediately makes this
> implementation superior to all others that I have ever seen. The
> number of times I've logged into systems that have stalled on boot and
> made them finish booting by running "ls -lR /" over the past 20 years
> are too many to count. I don't know if I just needed to wait longer
> for the boot to finish, or if generating entropy was the fix, and I
> will never know. This is nuts.

Indeed!

> We can use the message to point the system administrator to a manual
> page that tells them what to do, and by "tells them what to do", I
> mean in plain simple language, right at the top of the page, without
> scaring them.

Excellent idea!  :-)

However I have been wondering if sending the message just to the
console, and logging it, say in /var/log/kern, is sufficient.

It still took me a very long time to find the existing new message
because I don't hang out on the console -- this is a VM, after all, and
it's running in a city almost exactly 4200km driving distance from me
too!  As-is I feel I hang out on the console more often than the average
admin who doesn't use a physical console, and of course infinitely more
often than any user who doesn't admin his own server.

I have added the following comment to the kernel to remind me to think
more about this, as a uprintf(9) at the same time would pop right up on
the actual user's session too:

--- kern_entropy.c.~1.30.~  2021-03-07 17:23:05.0 -0800
+++ kern_entropy.c  2021-04-03 11:25:31.667067667 -0700
@@ -1306,7 +1306,7 @@

/* Wait for some entropy to come in and try again.  */
KASSERT(E->stage >= ENTROPY_WARM);
-   printf("entropy: pid %d (%s) blocking due to lack of entropy\n",
+   printf("entropy: pid %d (%s) blocking due to lack of 
entropy\n", /* xxx uprintf() instead/also? */
   curproc->p_pid, curproc->p_comm);

    if (ISSET(flags, ENTROPY_SIG)) {


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpF1LIq_XrV5.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Mon, 5 Apr 2021 10:46:19 +0200, Manuel Bouyer  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> If I understood it properly, there's no need for such a knob.
> echo 0123456789abcdef0123456789abcdef > /dev/random
>
> will get you back to the state we had in netbsd-9, with (pseudo-)randomness
> collected from devices.

Well, no, not quite so much randomness.  Definitely pseudo though!

My patch on the other hand can at least inject some real randomness into
the entropy pool, even if it is observable or influenceable by nefarious
dudes who might be hiding out in my garage.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgps8MDVICM_D.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Sun, 4 Apr 2021 18:47:23 -0700, Brian Buhrow  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Hello.  As I understand it, Greg ran into this problem on a xen domu.
> In checking my NetBSD-9 system running as a domu under xen-4.14.1,
> there is no rdrand or rdseed feature exposed to domu's by xen.  This
> observation is confirmed by looking at the xen command line reference
> page: https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html

The problem in the domU was really just the very tip of the iceberg.

The dom0 exhibits the exact same problem and for the same reasons.

> and NetBSD doesn't trust the random sources provided by the xennet(4)
> and xbd(4) drivers.  Therefore, the only solution to get randomness
> working for the first time on a newlyinstalled domu is to write 32
> bytes to /dev/random.

It's not that the xbd(4) devices, etc. are not trusted as entropy
sources -- the new entropy system doesn't trust anything, real or
virtual, despite the documentation saying that it can be made to do so.

My patch fixes that bug.  It was very obvious once I understood the root
of the issue.

As a result my patch fixes the bug for Xen dom0 and domU.

Writing randomness to /dev/random is _NOT_ a general solution (though it
could be IFF it can be reliably taken from /dev/urandom AND IFF the rest
of the system and documentation is completely and adequately fixed to
match the new regime).

What perturbs me the most and makes me rather angry is that the rest of
the system, and the system documentation, continued to lie and mislead
me for days (and it didn't help that nobody who knew this was pointing
helpfully and clearly at the root of the problem).  So, my patch ALSO
restores the kernel's behaviour to match the documentation and tools
(specifically rndctl).  That the core of it it is just a two-line patch
makes this fix extremely satisfying.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpYuvFIPVAsp.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 4 Apr 2021 23:09:18 +, Taylor R Campbell  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> If you know this (and this is something I certainly can't confidently
> assert!), you can write 32 bytes to /dev/random, save a seed, and be
> done with it.

I don't have random data easily available at install time.

I don't have random data easily available every time I boot a machine
with non-persistent storage (e.g. a test ISO image).

I _do_ trust well enough the sources of randomness in some device
drivers to provide me with a secure enough amount of entropy, for my
purposes.

And so with my fix(es) I don't need to feed supposedly random data to
every system on every install and/or every reboot.

What's worse?  My fixes, or something like this in /etc/rc.local:

   echo -n "" > /dev/random

> But users who don't go messing around with obscure rndctl settings in
> rc.conf will be proverbially shot in the foot by this change -- except
> they won't notice because there is practically guaranteed to be no
> feedback whatsoever for a security disaster until their systems turn
> up in a paper published at Usenix like <https://factorable.net/>.

You're really stretching your argument thinly if you are assuming
everyone _needs_ perfect entropy here.

Also, that's only if the default RND_FLAG_ESTIMATE_* bits are turned off.

AND only if the system doesn't have some true hardware RNG.

> What your change does is equivalent to going around to every device
> driver that previously said `this provides zero entropy, or I don't
> know how much entropy it provides' and replacing that claim by `this
> is a sample of an independent and perfectly uniform random string of
> bits', which is a much stronger (and falser) claim than even the old
> `entropy estimation' confabulation that NetBSD used to do.

No, only if the default RND_FLAG_ESTIMATE_* bits are ***NOT*** turned off.

AND only if the user is like me and stuck with some poor second-grade
ancient hardware that doesn't have some fancy new true hardware RNG.

In the mean time a more productive approach would be to figure out
what's best for those of us who don't need perfection every time and/or
to fix those device drivers that could feed sufficiently random data to
the entropy pool, and then to recommend a suitable value for
rndctl_flags in /etc/rc.conf.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp1Of0SebF9S.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Mon, 5 Apr 2021 01:05:58 +0200, Joerg Sonnenberger  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Part of the problem here is that most of the non-RNG data sources are
> easily observable either from the local system (e.g. any malicious user)
> or other VMs on the same machine (in case of a hypervisor) or local
> machines on the same network (in case of network interrupts).

It _Just_ _Doesn't_ _Matter_  (i.e. for many of us, most of the time).

Now ideally in the hypervisor scenario we would have a backend device
that read from /dev/random and offered it to the VM guest as a virtual
hardware RNG.  Or maybe it's as simple as passing a those few bytes
through a custom Xenstore string and having a script in the VM read them
and inject them into /dev/random.  But that's not been done yet.

BTW, personally, on at least on some machines, I don't have any worry
whatsoever at the moment about one VM guest spying on, or influencing
the PRNG, in another.  Zero worry.  They're all _me_.  I don't need some
theoretically perfect level of protection from myself.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpqbpSPpUT4a.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Mon, 05 Apr 2021 00:14:30 +0200 (CEST), Havard Eidnes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> > What about architectures that have nothing like RDRAND/RDSEED?  Are
> > they, effectively, totally unsupported now?
>
> Nope, not entirely.  But they have to be seeded once.  If they
> have storage which survives reboots, and entropy is saved and
> restored on reboot, they will be ~fine.

BTW, to me reusing the same entropy on every reboot seems less secure.

> Systems without persistent storage and also without RDRAND/RDSEED
> will however be ... a more challenging problem.

Leaving things like that would be totally silly.

With my patch the old way of gathering entropy from devices works just
fine as it always did, albeit with the second patch it does require a
tiny bit of extra configuration.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpIqHAnkWIdc.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Mon, 05 Apr 2021 00:07:49 +0200 (CEST), Havard Eidnes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Indeed, that's also compatible with what I wrote.  The samples
> from whatever sources you have are still being mixed into the
> pool, but they are not being counted as contributing to the
> entropy estimate, because the quality of the samples is at best
> unknown.

Perhaps we're talking past each other?

Until I made the fix no amount of time or activity or of me telling the
system to make use of the driver inputs was unblocking getrandom(2) or
/dev/random, so it doesn't really matter if anything was being "mixed
into the pool" so to speak as the pool was empty.

> A possible workaround is, once you have some uptime and some bits
> mixed into the pool, you can do:

I don't need a work-around -- I found a fix.  I corrected some code that
was purposefully ignoring my orders for how it should behave.

> I am still of the fairly firm beleif that the mistrust in the
> hardware vendors' ability to make a reasonable and robust
> implementation is without foundation.

Well there are still millions of systems out there without the fancy
newer hardware RNGs available to make them more secure than Fort Knox.
At least a small handful of them run NetBSD for me, and want them to
work for my needs and I was, and am, quite happy with using entropy that
can be collected from various devices that my systems (virtual and real)
actually have.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpkw5j9Fv1Vg.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 4 Apr 2021 16:39:11 -0400 (EDT), Mouse  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> > No amount of uptime and activity was increasing the entropy in my
> > system before I patched it.
>
> As I understand it, entropy was being contributed.  What wasn't
> happening was the random driver code recognizing and acknowledging that
> entropy, because it had no way to tell how much of it there really was.

Clearly there was no entropy being contributed in any way shape or form.

It wasn't the driver code at fault.

It was the code I fixed with my patch that was at fault.

I told the system to "count" the entropy being gathered by the
appropriate driver(s), but it was being ignored entirely.

After my fix the system behaved as I told it to.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpDGn3IsNN1r.pgp
Description: OpenPGP Digital Signature


Re: how do I mount a read-only filesystem from the "root device" prompt?

2021-04-04 Thread Greg A. Woods
At Sun, 4 Apr 2021 01:19:44 -0700, John Nemeth  wrote:
Subject: Re: how do I mount a read-only filesystem from the "root device" 
prompt?
>
>  Given that it is possible to have partitions on CDs, which is
> common on Suns, but not so much elsewhere, and that anywhere there
> is a partition, there is the possibility of using wedges, it would
> seem that this is essential.

I would think it's not just CDs and hypervisor-provided virtual devices
that can have multiple partitions, use wedges, and yet be read-only.

Are not a wide variety of removable storage devices also capable of
being made "read-only" at the hardware level?

On Apr 4,  7:34, Michael van Elst wrote:
>
> I suggested to make it open read-only if it gets EROFS and to validate
> the open mode against what is possible in this state.

Given the layers of devices and code involved, perhaps it might be
possible to just honour the original mode requested by the code opening
the first partition to mount a filesystem, and then to upgrade the vnode
to write mode if/when that mount is upgraded to write mode or another rw
mount is attempted on another partition on the same device?

I realize there's nothing like VOP_REOPEN() to change the open mode
flags, but if I'm not mistaken that wouldn't be too difficult to
implement.

Anyway I did find this is where the actual EROFS is being returned, and
perhaps changing it to EACCES would be less confusing, or maybe not

--- sys/arch/xen/xen/xbd_xenbus.c.~1.129.~  2021-02-28 15:45:22.0 
-0800
+++ sys/arch/xen/xen/xbd_xenbus.c   2021-04-04 14:21:01.006355121 -0700
@@ -950,7 +950,7 @@
if (sc == NULL)
return (ENXIO);
if ((flags & FWRITE) && (sc->sc_info & VDISK_READONLY))
-   return EROFS;
+   return EACCES;

DPRINTF(("xbdopen(%" PRIx64 ", %d)\n", dev, flags));
return dk_open(>sc_dksc, dev, flags, fmt, l);



--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpD2yAUtTyIh.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 04 Apr 2021 21:14:31 +0200 (CEST), Havard Eidnes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Do note, the existing randomness sources are still being sampled and
> mixed into the pool, so even if the starting state from the saved
> entropy may be known (by violating the security of the storage),
> it's still not possible to predict the complete stream of randomness
> data once the system has seen a bit of uptime (given that there are
> actual other sources of (unverified) entropy which aren't all of too
> low quality).

No amount of uptime and activity was increasing the entropy in my system
before I patched it.  /dev/random remained blocked after days of busy
system activity.  I would argue that most, if not all, of the sources of
entropy identified by rndctl(8) on my systems are high-quality and
secure sources in my circumstances and for my uses.

Perhaps the unpatched implementation isn't doing exactly what you think
it is?

The unpatched implementation completely and entirely prevents the system
from ever using any of those sources, despite showing that they are
enabled for use.

> However, in the new scheme of things, because most of the
> traditional sources have unknown quality, and we have no reliable
> method to estimate how much "actual entropy" those sources
> provide, they no longer count towards the *estimate* of what is
> now a lower bound on the "real" entropy available in the pool.

It really doesn't matter what can be determined in general and from a
distance.

What matters is what a given administrator can determine in particular
for a given application in a given circumstance.

Before my patch the system was not behaving as documented and could not
be made to behave as the documentation said it could be made to behave.

With my patch I can choose which to trust from amongst the available
sources.  Without that patch my choices are ignored and the system lies
to me about using my choices.  I would argue my patch fixes a critical
bug.

> Besides, the implementation has been thoroughly vetted.  E.g. the
> reference [7] from the wikipedia article states in the conclusion on
> page 20
>
>Overall, the Ivy Bridge RNG is a robust design with a large
>margin of safety that ensures good random data is generated even
>if the Entropy Source is not operating as well as predicted.

"design" != implementation

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpg86syab6rB.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 04 Apr 2021 23:47:10 +0700, Robert Elz  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> If we want really good security, I'd submit we need to disable
> the random seed file, and RDRAND (and anything similar) until we
> have proof that they're perfect.

Indeed, I concur.

I trust the randomness and in-observability and isolation of the
behaviour of my system's fans far more than I would trust Intel's RDRAND
or RDSEED instructions.

I even trust the randomness of the timings of the virtual disks in my
Xen domU virtual machines more-so, even with multiple sibling guests,
even if some of those other guests can be influenced by untrusted third
parties at critical times.

> Personally, I'm happy with anything that your average high school
> student is unlikely to be able to crack in an hour.   I don't run
> a bank, or a military installation, and I'm not the NSA.   If someone
> is prepared to put in the effort required to break into my systems,
> then let them, it isn't worth the cost to prevent that tiny chance.
> That's the same way that my house has ordinary locks - I'm sure they
> can be picked by someone who knows what they're doing, and better security
> is available, at a price, but a nice happy medium is what fits me best.

Indeed again.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp4TWUMkWqxh.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 4 Apr 2021 09:49:58 +, Taylor R Campbell  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> > Date: Sat, 03 Apr 2021 12:24:29 -0700
> > From: "Greg A. Woods" 
> >
> > Updating a system, even on -current, shouldn't create a long-lived
> > situation where the system documentation and the behaviour and actions
> > of system commands is completely out of sync with the behaviour of the
> > kernel, and in fact lies to the administrator about the abilities of the
> > system.
>
> It would help if you could identify specifically what you are calling
> a lie.
>
> > @@ -1754,21 +1766,21 @@
> >  rnd_add_uint32(struct krndsource *rs, uint32_t value)
> >  {
> >
> > -   rnd_add_data(rs, , sizeof value, 0);
> > +   rnd_add_data(rs, , sizeof value, sizeof value * ABBY);
> >  }
>
> The rnd_add_uint32 function is used by drivers to feed in data from
> sources _with no known model for their entropy_.

Indeed -- that's the idea.

> It's how drivers
> toss in data that might be helpful but might totally predictable, and
> the driver has no way to know.

Yeah, so?  They don't need to know this.  I'm not actually asking random
drivers to decide the amount of physical entropy they can collect.
That is controlled elsewhere.

> Your change _creates_ the lie that every bit of data entered this way
> is drawn from a source with independent uniform distribution.

No, my change _allows_ the administrator to decide which devices can be
used as estimating/counting entropy sources.  For example I know that
many of the devices on almost all of my machines (virtual or otherwise)
are equally good sources of entropy for their uses.

An addition change, one which I would also find totally acceptable,
would be to disable the current default of allowing "estimation" on
devices which are not true hardware RNGs.  I.e. maybe this simple change
would suffice (though I haven't checked beyond a quick grep to see that
this flag is the mostly commonly used one -- perhaps some real RNG
devices could also be changed to use explicit flags to enable estimation
by default):

--- sys/sys/rndio.h.~1.2.~  2016-07-23 14:36:45.0 -0700
+++ sys/sys/rndio.h 2021-04-04 12:39:15.609936311 -0700
@@ -91,8 +91,7 @@
 #define RND_FLAG_ESTIMATE_TIME 0x4000  /* estimate entropy on time */
 #define RND_FLAG_ESTIMATE_VALUE0x8000  /* estimate entropy on 
value */
 #defineRND_FLAG_HASENABLE  0x0001  /* has enable/disable 
fns */
-#define RND_FLAG_DEFAULT   (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME|\
-RND_FLAG_ESTIMATE_TIME)
+#define RND_FLAG_DEFAULT   (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME)

 #defineRND_TYPE_UNKNOWN0   /* unknown source */
 #defineRND_TYPE_DISK   1   /* source is physical disk */


There are a vast number of ways this re-tooling of entropy collection
could have been done better.

I'm asking for discussion on what amount to some VERY simple changes
which completely and totally solve many real-world uses of this code
while at the same time not just allowing, but defaulting to, the very
strict and secure operation for special situations.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpXnQkfhexZY.pgp
Description: OpenPGP Digital Signature


how do I mount a read-only filesystem from the "root device" prompt?

2021-04-03 Thread Greg A. Woods
So with Xen one can export a "disk" (disk, file, LVM partiion, etc.)
with "access=ro", and that is enforced.

However if one tries to mount such a disk in a domU as root, it fails.

When one first looks at the code which does the initial vfs_mountroot it
would appear to be correct -- i.e. it is trying to open the root
filesystem device for reading it uses VOP_OPEN() to open the root
device with FREAD (which I think means "only for reading"):

error = VOP_OPEN(rootvp, FREAD, FSCRED);
if (error) {
printf("vfs_mountroot: can't open root device, error = 
%d\n", error);
return (error);
}

However something assumes that if it is like a disk (i.e. but not a
CD-ROM/DVD) then it tries to open for write too as we get:

root on dk1
vfs_mountroot: can't open root device, error = 30
cannot mount root, error = 30

(errno #30 is of course EROFS)

I'm not even sure where this is happening.

vfs_rootmountalloc() does indeed set MNT_RDONLY, but this error is
happening before vfs_mountroot() calls ffs_mountroot (through the
vfs_mountroot pointer).

So I'm lost -- any hints?  Is it from bounds_check_with_label()?  How?

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp1QhMimCiG1.pgp
Description: OpenPGP Digital Signature


regarding the changes to kernel entropy gathering

2021-04-03 Thread Greg A. Woods
So, I'm not sure what to say here.

I'm very surprised, quite confused, more than a little perturbed, and
even somewhat angry.  It's taken me quite some time to write this.

Now temper this with knowing that I do know I'm running -current, not a
release, and that I accept the challenges this might cause (thus see the
patch below).

Updating a system, even on -current, shouldn't cause what I can only
describe as _intentional_ breakage, even for matters so important as
system security and integrity, and especially not without clear mention
UPDATING, and perhaps also with documented and referenced tools to
assist in undoing said breakage.

Updating a system, even on -current, shouldn't create a long-lived
situation where the system documentation and the behaviour and actions
of system commands is completely out of sync with the behaviour of the
kernel, and in fact lies to the administrator about the abilities of the
system.

In any case, the following patch (and in particular the last hunk) fixes
all my problems and complaints in this domain.  It is fully tested, and
it works A-OK with Xen in both domU and dom0 kernels.  My systems once
again have consistent documentation, and tools that don't lie, and are
able to function as before w.r.t. matters related to /dev/random and
getrandom(2).

Now I'm not proposing this as the final solution -- I think there's some
middle ground to be found, but at least this gets things back to working.


--- sys/kern/kern_entropy.c.~1.30.~ 2021-03-07 17:23:05.0 -0800
+++ sys/kern/kern_entropy.c 2021-04-03 11:25:31.667067667 -0700
@@ -1306,7 +1306,7 @@

/* Wait for some entropy to come in and try again.  */
KASSERT(E->stage >= ENTROPY_WARM);
-   printf("entropy: pid %d (%s) blocking due to lack of entropy\n",
+   printf("entropy: pid %d (%s) blocking due to lack of 
entropy\n", /* xxx uprintf() instead/also? */
   curproc->p_pid, curproc->p_comm);

if (ISSET(flags, ENTROPY_SIG)) {
@@ -1577,6 +1577,16 @@
KASSERT(i == __arraycount(extra));
entropy_enter(extra, sizeof extra, 0);
explicit_memset(extra, 0, sizeof extra);
+
+   aprint_verbose("entropy: %s attached as an entropy source (", rs->name);
+   if (!(flags & RND_FLAG_NO_COLLECT)) {
+   printf("collecting");
+   if (flags & RND_FLAG_NO_ESTIMATE)
+   printf(" without estimation");
+   }
+   else
+   printf("off");
+   printf(")\n");
 }

 /*
@@ -1610,6 +1620,8 @@

/* Free the per-CPU data.  */
percpu_free(rs->state, sizeof(struct rndsource_cpu));
+
+   aprint_verbose("entropy: %s detached as an entropy source\n", rs->name);
 }

 /*
@@ -1754,21 +1766,21 @@
 rnd_add_uint32(struct krndsource *rs, uint32_t value)
 {

-   rnd_add_data(rs, , sizeof value, 0);
+   rnd_add_data(rs, , sizeof value, sizeof value * NBBY);
 }

 void
 _rnd_add_uint32(struct krndsource *rs, uint32_t value)
 {

-   rnd_add_data(rs, , sizeof value, 0);
+   rnd_add_data(rs, , sizeof value, sizeof value * NBBY);
 }

 void
 _rnd_add_uint64(struct krndsource *rs, uint64_t value)
 {

-   rnd_add_data(rs, , sizeof value, 0);
+   rnd_add_data(rs, , sizeof value, sizeof value * NBBY);
 }

 /*

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp_AompQk1f3.pgp
Description: OpenPGP Digital Signature


Re: nothing contributing entropy in Xen domUs? or dom0!!!

2021-03-31 Thread Greg A. Woods
At Thu, 1 Apr 2021 04:13:59 + (UTC), RVP  wrote:
Subject: Re: nothing contributing entropy in Xen domUs?  or dom0!!!
>
> Does this /etc/entropy-file match what's there in your /boot.cfg?
>
> On my laptop $random_file is left at the default which is:
> /var/db/entropy-file

Yes I did change that as well (as /var isn't part of the root partition).

However that's not the problem for the dom0.

"rndseed" isn't currently used (at least not by me or any documentation
I'm aware of) when loading (multibooting) a Xen kernel and a NetBSD dom0
kernel.

/etc/rc.d/random_seed will do this (again) later anyway.

However since as I showed the hardware doesn't seem to be providing
entropy that can be "counted" ("estimated"), there's nothing to save,
and so nothing to load on the next boot either.

I know how to seed it -- but that's not the problem -- the hardware
should be providing plenty of entropy.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpfwaVZQJ63E.pgp
Description: OpenPGP Digital Signature


Re: nothing contributing entropy in Xen domUs? or dom0!!!

2021-03-31 Thread Greg A. Woods
Intel"; CPUID level 11

Intel-specific functions:
Version 000206c2:
Type 0 - Original OEM
Family 6 - Pentium Pro
Model 12 -
Stepping 2
Reserved 8

Extended brand string: "Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz"
CLFLUSH instruction cache line size: 8
Initial APIC ID: 34
Hyper threading siblings: 32

Feature flags 1fc9cbf5:
FPUFloating Point Unit
DE Debugging Extensions
TSCTime Stamp Counter
MSRModel Specific Registers
PAEPhysical Address Extension
MCEMachine Check Exception
CX8COMPXCHG8B Instruction
APIC   On-chip Advanced Programmable Interrupt Controller present and enabled
SEPFast System Call
MCAMachine Check Architecture
CMOV   Conditional Move and Compare Instructions
FGPAT  Page Attribute Table
CLFSH  CFLUSH instruction
ACPI   Thermal Monitor and Clock Ctrl
MMXMMX instruction set
FXSR   Fast FP/MMX Streaming SIMD Extensions save/restore
SSEStreaming SIMD Extensions instruction set
SSE2   SSE2 extensions
SS Self Snoop
HT Hyper Threading

TLB and cache info:
5a: unknown TLB/cache descriptor
03: Data TLB: 4KB pages, 4-way set assoc, 64 entries
55: unknown TLB/cache descriptor
ff: unknown TLB/cache descriptor
b2: unknown TLB/cache descriptor
f0: unknown TLB/cache descriptor
ca: unknown TLB/cache descriptor
Processor serial: 0002-06C2----


I noted today though that entropy doesn't seem to be accumulating even
in the dom0 despite there being many useful sources configured to both
collect and "estimate" _and_ despite the fact there's a valid-looking
$random_file that was saved and reloaded by /etc/rc.d/random_seed (and
saved again every day by /etc/security):

# /etc/rc.d/random_seed rcvar
# random_seed
random_seed=YES
# ls -l /etc/entropy-file
-rw---  1 root  wheel  536 Mar 31 04:15 /etc/entropy-file
# rndctl -l
Source Bits Type  Flags
ipmi0-Temp0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp1   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp2   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp3   0 env  estimate, collect, v, t, dv, dt
ipmi0-Ambient-T   0 env  estimate, collect, v, t, dv, dt
ipmi0-Planar-Te   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-1   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-1   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-2   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-2   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-3   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-3   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-4   0 env  estimate, collect, v, t, dv, dt
ipmi0-Status  0 ???  estimate, collect, t, dt
ipmi0-Voltage 0 power estimate, collect, v, t, dv, dt
ipmi0-Voltage10 power estimate, collect, v, t, dv, dt
ipmi0-Status1 0 ???  estimate, collect, t, dt
ipmi0-Intrusion   0 ???  estimate, collect, t, dt
ipmi0-Temp4   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp5   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp6   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-4   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-5   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-5   0 env  estimate, collect, v, t, dv, dt
ipmi0-Ambient-T   0 env  estimate, collect, v, t, dv, dt
ipmi0-Ambient-T   0 env  estimate, collect, v, t, dv, dt
ums0  0 tty  estimate, collect, v, t, dt
ukbd0 0 tty  estimate, collect, v, t, dt
/dev/random   0 ???  estimate, collect, v
sd2   0 disk estimate, collect, v, t, dt
sd1   0 disk estimate, collect, v, t, dt
sd0   0 disk estimate, collect, v, t, dt
cpu0  0 vm   estimate, collect, v, t, dv
hardclock 0 skew estimate, collect, t
pckbd00 tty  estimate, collect, v, t, dt
system-power  0 power estimate, collect, v, t, dt
autoconf  0 ???  estimate, collect, t
seed  0 ???  estimate, collect, v
# sysctl kern.entropy
kern.entropy.collection = 1
kern.entropy.depletion = 0
kern.entropy.consolidate = -23552
kern.entropy.gather = -23552
kern.entropy.needed = 256
kern.entropy.pending = 0
kern.entropy.epoch = 19

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpMsnaVWfOo5.pgp
Description: OpenPGP Digital Signature


Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Greg A. Woods
[[ sorry I've not been catching up on mailing list discussions as fast
as I had hoped to, and I'm way behind on following the entropy rototill. ]]

At Wed, 31 Mar 2021 00:12:31 +, Taylor R Campbell  
wrote:
Subject: Re: nothing contributing entropy in Xen domUs?  (causing python3.7 
rebuild to get stuck in kernel in "entropy" during an "import" statement)
>
> This is false.  If the VM host provided a viornd(4) device then NetBSD
> would automatically collect, and count, entropy from the host, with no
> manual intervention.

I'll leave that idea to others more up-to-date on Xen PV drivers to
respond to.  Booting a -current GENERIC kernel (which has both Xen PV
and virtio(4) devices configured into it) in a "type='pvh'" domU only
attaches the xenbus PV devices, no virtio devices, so adding virtio
might be a bit of a much bigger task that will need further support on
at least the backend, and perhaps on the front-end too, especially to do
it without QEMU.  I haven't tried if virtio devices show up in an HVM
domU precisely because I'm trying to avoid having to run and rely on
QEMU (never mind any performance implications of HVM).

> > Finally, if the system isn't actually collecting entropy from a device,
> > then why the heck does it allow me to think it is (i.e. by allowing me
> > to enable it and show it as enabled and collecting via "rndctl -l")?
>
> The system does collect samples from all those devices.  However, they
> are not designed to be unpredictable and there is no good reliable
> model for just how unpredictable they are, so the system doesn't
> _count_ anything from them.  See https://man.NetBSD.org/entropy.4 for
> a high-level overview.

I'm not sure the word "count" appears in entropy(4) any context I can
make sense of it in w.r.t. what it means to "collect" but not "count"
entropy from those devices.

Worse the "Flags" shown by "rndctl -l" don't seem to be directly
documented (i.e. they're not described in rndctl(8)), and even on a
kernel running on real hardware I don't see the word "count" showing
there.

After looking at the source I'm not sure the descriptions of the
RND_FLAG_* values in rnd(4) help me much either.

Based on my vague understanding of all of this, perhaps you meant to say
"estimate", instead of "count"?  That would make more sense in the
context of what I read in rnd(4) and rndctl(8), though "estimate" still
seems a little vague in meaning to me.

In any case, I don't see why an xbd disk, or a xennet interface, can't
be treated exactly as if they were real hardware (i.e. in terms of
extracting entropy from their behaviour).  This is exactly what
virtualization is all about to me -- even for paravirtualization.  After
all in a threat-free world (i.e. specifically where I also trust other
domUs) their entropy is going to reflect (though maybe not exactly
mirror) the entropy of the underlying hardware and/or network traffic.
So (but maybe not by default) if I as the admin want to trust the
entropy available from an xbd(4) or xennet(4) device, then I should be
able to enable it with rndctl(8) and have it "count".

More importantly though the system shouldn't mislead me into thinking it
is "counting" entropy from a device when it is actually not.  If I had
seen that there were no sources estimating/counting/whatever entropy,
and I tried to enable one and was given a nice error message about this
not being possible, then I would have looked elsewhere to find out how
to give the system more bits of entropy.  As is in my Xen domU system
the output of "rndctl -l" leads me to believe all of my devices are
collecting both timing and value samples, and using either one or the
other to gather entropy (though with '-v' I don't see that any bits of
entropy have been added from any of those amy millions of collected
samples).

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpcOwz5f2PVj.pgp
Description: OpenPGP Digital Signature


Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Greg A. Woods
At Tue, 30 Mar 2021 23:53:43 +0200, Manuel Bouyer  
wrote:
Subject: Re: nothing contributing entropy in Xen domUs?  (causing python3.7 
rebuild to get stuck in kernel in "entropy" during an "import" statement)
>
> On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote:
> > [...]
> >
> > Perhaps the answer is that nothing seems to be contributing anything to
> > the entropy pool.  No matter what device I exercise, none of the numbers
> > in the following changes:
>
> yes, it's been this way since the rnd rototill. Virtual devices are
> not trusted.
>
> The only way is to manually seed the pool.

Ah, so that is definitely not what I expected!

Previously wasn't it up to the local admin what to trust?  I guess
throwing bits into /dev/random is one way to play that game, but

I have to trust the dom0 implicitly and utterly anyway, so why not trust
the devices it presents?

This is especially true for xbd block devices.  All my blocks are belong
to dom0.

The network device is in effect no different than if it were real
hardware, so if I want to trust network traffic, then I should be able
to enable it, just as I could if it were real hardware.

The CPUs are also probably the least "virtual" things in Xen, so why not
trust them?  (Though I'm not sure I understand what entropy they can
offer in the first place.)

Finally, if the system isn't actually collecting entropy from a device,
then why the heck does it allow me to think it is (i.e. by allowing me
to enable it and show it as enabled and collecting via "rndctl -l")?

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpE2Nup3Gb9V.pgp
Description: OpenPGP Digital Signature


nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Greg A. Woods
ue to lack of entropy
[ 563844.834413] entropy: pid 7903 (python) blocking due to lack of entropy
[ 566365.511377] entropy: pid 9001 (python) blocking due to lack of entropy
[ 577473.897830] entropy: pid 9350 (python) blocking due to lack of entropy
[ 579179.381600] entropy: pid 25728 (od) blocking due to lack of entropy
[ 579186.994440] entropy: pid 11107 (cat) blocking due to lack of entropy
[ 579202.264290] entropy: pid 7248 (cat) blocking due to lack of entropy
[ 579669.831978] entropy: ready


--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


At Tue, 30 Mar 2021 10:06:19 -0700, "Greg A. Woods"  wrote:
Subject: python3.7 rebuild stuck in kernel in "entropy" during an "import" 
statement
>
> So I've been running a pkg-rolling_replace and one of the packages being
> rebuilt is python3.7, and it has got stuck, apparently on an "entropy"
> wait in the kernel, and it's been in this state for over 24hrs as you
> can see.
>
> The only things the process has open appear to be its stdio descriptors,
> two of which are are open on the log file I was directing all output to.
>
> This is on a Xen domU of a machine running:
>
> $ uname -a
> NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 
> PDT 2021  
> woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0
>  amd64
>
>
> 09:51 [504] $ ps -lwwp 19875
> UID   PID  PPID CPU PRI NI   VSZ   RSS WCHAN   STAT TTY  TIME COMMAND
>   0 19875 11551   0  85  0 55412 11324 entropy Ipts/0 0:00.27 ./python -E 
> -Wi 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
>  -d /usr/pkg/lib/python3.7 -f -x 
> bad_coding|badsyntax|site-packages|lib2to3/tests/data 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
> 09:51 [505] $ ps -uwwp 19875
> USER   PID %CPU %MEM   VSZ   RSS TTY   STAT STARTEDTIME COMMAND
> root 19875  0.0  0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
>  -d /usr/pkg/lib/python3.7 -f -x 
> bad_coding|badsyntax|site-packages|lib2to3/tests/data 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
> 09:51 [506] $ fstat -p 19875
> USER CMD  PID   FD  MOUNT INUM MODE SZ|DV R/W
> root python 19875   wd  /build10645634 drwxr-xr-x1024 r
> root python 198750  /dev/pts 3 crw---   pts/0 rw
> root python 198751  /build 3721223 -rw-r--r--  28287492 w
> root python 198752  /build 3721223 -rw-r--r--  28287492 w
> 09:51 [507] $ find /build -inum 3721223
> /build/packages/root/pkg_roll.out
> 09:51 [508] $
>
>
> It was killable -- I sent SIGINT from the tty and it died as expected.
>
>
> Running "make replace" gets it stuck in the same place again, an the
> SIGINT shows the following stack trace:
>
> PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
>   LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1  
> ./python -E -Wi 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
>   -d /usr/pkg/lib/python3.7 -f  -x 
> 'bad_coding|badsyntax|site-packages|lib2to3/tests/data'  
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
> ^T
> [ 563859.5589422] load: 0.39  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
> make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
> make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> ^T
> [ 563866.4606073] load: 0.36  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
> make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
> make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> ^?Traceback (most recent call last):
>   File 
> "/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py",
>  line 20, in 
> from concurrent.futures import ProcessPoolExecutor
>   File "", line 1032, in _handle_fromlist
>   File 
> "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py",
>  line 43, in __getattr__
> from .process import ProcessPoolExecutor as pe
>   File 
> "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/process.py",
>  line 53, i

python3.7 rebuild stuck in kernel in "entropy" during an "import" statement

2021-03-30 Thread Greg A. Woods
So I've been running a pkg-rolling_replace and one of the packages being
rebuilt is python3.7, and it has got stuck, apparently on an "entropy"
wait in the kernel, and it's been in this state for over 24hrs as you
can see.

The only things the process has open appear to be its stdio descriptors,
two of which are are open on the log file I was directing all output to.

This is on a Xen domU of a machine running:

$ uname -a
NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 PDT 
2021  
woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0
 amd64


09:51 [504] $ ps -lwwp 19875
UID   PID  PPID CPU PRI NI   VSZ   RSS WCHAN   STAT TTY  TIME COMMAND
  0 19875 11551   0  85  0 55412 11324 entropy Ipts/0 0:00.27 ./python -E 
-Wi 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
 -d /usr/pkg/lib/python3.7 -f -x 
bad_coding|badsyntax|site-packages|lib2to3/tests/data 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
09:51 [505] $ ps -uwwp 19875
USER   PID %CPU %MEM   VSZ   RSS TTY   STAT STARTEDTIME COMMAND
root 19875  0.0  0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
 -d /usr/pkg/lib/python3.7 -f -x 
bad_coding|badsyntax|site-packages|lib2to3/tests/data 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
09:51 [506] $ fstat -p 19875
USER CMD  PID   FD  MOUNT INUM MODE SZ|DV R/W
root python 19875   wd  /build10645634 drwxr-xr-x1024 r
root python 198750  /dev/pts 3 crw---   pts/0 rw
root python 198751  /build 3721223 -rw-r--r--  28287492 w
root python 198752  /build 3721223 -rw-r--r--  28287492 w
09:51 [507] $ find /build -inum 3721223
/build/packages/root/pkg_roll.out
09:51 [508] $


It was killable -- I sent SIGINT from the tty and it died as expected.


Running "make replace" gets it stuck in the same place again, an the
SIGINT shows the following stack trace:

PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
  LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1  
./python -E -Wi 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
  -d /usr/pkg/lib/python3.7 -f  -x 
'bad_coding|badsyntax|site-packages|lib2to3/tests/data'  
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
^T
[ 563859.5589422] load: 0.39  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
^T
[ 563866.4606073] load: 0.36  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
^?Traceback (most recent call last):
  File 
"/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py",
 line 20, in 
from concurrent.futures import ProcessPoolExecutor
  File "", line 1032, in _handle_fromlist
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py",
 line 43, in __getattr__
from .process import ProcessPoolExecutor as pe
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/process.py",
 line 53, in 
import multiprocessing as mp
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/__init__.py",
 line 16, in 
from . import context
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/context.py",
 line 5, in 
from . import process
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py",
 line 363, in 
_current_process = _MainProcess()
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py",
 line 347, in __init__
self._config = {'authkey': AuthenticationString(os.urandom(32)),
KeyboardInterrupt
*** Error code 1 (ignored)
*** Signal 2
*** Signal 2



--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpWjPEXKgaka.pgp
Description: OpenPGP Digital Signature


Re: kern/54969 (Disk cache is no longer flushed on shutdown)

2021-03-25 Thread Greg A. Woods
RNING: some file systems would 
not unmount
[Wed Mar 24 20:43:02 2021][ 715718.5284461] forcefully unmounting 
/dev/mapper/scratch-build from /build...
[Wed Mar 24 20:43:02 2021][ 715718.5284461] forcefully unmounted 
/dev/mapper/scratch-build from /build, type ffs
[Wed Mar 24 20:43:02 2021][ 715718.5384534] unmount of / (/dev/dk0) failed with 
error 16
[Wed Mar 24 20:43:02 2021][ 715718.5384534] WARNING: some file systems would 
not unmount
[Wed Mar 24 20:43:02 2021][ 715718.5384534] forcefully unmounting /dev/dk0 from 
/...
[Wed Mar 24 20:43:02 2021][ 715718.5384534] forcefully unmounted /dev/dk0 from 
/, type ffs
[Wed Mar 24 20:43:02 2021][ 715718.5384534] unmounting done
[Wed Mar 24 20:43:02 2021][ 715718.5384534] turning off swap... done
[Wed Mar 24 20:43:02 2021][ 715718.5384534] dk0 at sd0 (/) deleted
[Wed Mar 24 20:43:02 2021][ 715718.5384534] sd0: detached
[Wed Mar 24 20:43:02 2021][ 715718.5384534] scsibus0: detached
[Wed Mar 24 20:43:02 2021][ 715718.7184994] mfi0: detached
[Wed Mar 24 20:43:02 2021][ 715718.7184994] pci8: detached
[Wed Mar 24 20:43:02 2021][ 715718.7184994] ppb7: detached
[Wed Mar 24 20:43:02 2021][ 715718.7184994] unmounting done
[Wed Mar 24 20:43:02 2021][ 715718.7184994] turning off swap... done
[Wed Mar 24 20:43:02 2021][ 715718.7184994] rebooting...

[[ ... why is "turning off swap" seen twice? .. ]]

[[ ... and then the reboot, until rc scripts say ... ]]

[Wed Mar 24 20:44:51 2021]Starting root file system check:
[Wed Mar 24 20:44:51 2021]/dev/rdk0: file system is clean; not checking
[Wed Mar 24 20:44:51 2021]start / wait fsck_ffs -p /dev/rdk0


[Wed Mar 24 20:44:52 2021]Starting file system checks:
[Wed Mar 24 20:44:52 2021]/dev/rdk2: file system is clean; not checking
[Wed Mar 24 20:44:52 2021]/dev/rdk3: file system is clean; not checking

[[ ... here I hit ^T on the console as it was taking too long ... ]]

[Wed Mar 24 20:44:58 2021][  15.0201108] load: 0.08  cmd: sleep 345 [nanoslp] 
0.00u 0.00s 0% 512k
[Wed Mar 24 20:44:58 2021]/dev/mapper/rscratch-build: phase 1: cyl group 24 of 
345 (6%)
[Wed Mar 24 20:46:09 2021]/dev/mapper/rscratch-build: phase 1: cyl group 284 of 
345 (82%)
[Wed Mar 24 20:49:30 2021]/dev/mapper/rscratch-build: 1400986 files, 36172587 
used, 28347707 free (17403 frags, 3541288 blocks, 0.0% fragmentation)
[Wed Mar 24 20:49:30 2021]/dev/mapper/rscratch-build: MARKING FILE SYSTEM CLEAN
[Wed Mar 24 20:49:30 2021]start /var nowait fsck_ffs -p /dev/rdk2
[Wed Mar 24 20:49:30 2021]start /build nowait fsck_ffs -p 
/dev/mapper/rscratch-build
[Wed Mar 24 20:49:30 2021]done ffs: /dev/rdk2 (/var) = 0x0
[Wed Mar 24 20:49:30 2021]start /usr/pkg nowait fsck_ffs -p /dev/rdk3
[Wed Mar 24 20:49:30 2021]done ffs: /dev/rdk3 (/usr/pkg) = 0x0
[Wed Mar 24 20:49:30 2021]done ffs: /dev/mapper/rscratch-build (/build) = 0x0
[Wed Mar 24 20:49:30 2021]Script /etc/rc.d/fsck running
[Wed Mar 24 20:49:30 2021]Currently sourcing /etc/rc.d/fsck
[Wed Mar 24 20:49:30 2021]exec: mount_ffs -o rw /dev/dk2 /var
[Wed Mar 24 20:49:30 2021]exec: mount_ffs -o rw /dev/dk2 /var
[Wed Mar 24 20:49:30 2021]/dev/dk2 on /var type ffs (local, fsid: 0xa802/0x78b, 
reads: sync 1 async 0, writes: sync 2 async 0)


--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpeOBU3AaVgV.pgp
Description: OpenPGP Digital Signature


a reminder: upgrade Xen in single-user mode, or with Xen disabled!

2021-03-25 Thread Greg A. Woods
So I just upgraded Xen to xenkernel413-4.13.2nb5, but without first
upgrading the Xen tools, as otherwise how would one safely shut down any
running domUs, etc.?  :-)

Once upgrading to xentools413-4.13.2nb4 I immediately got stuck:

# xl list
NameID   Mem VCPUs  State   Time(s)
[ 578.9865720] load: 0.27  cmd: xl 2027 [tstile] 0.00u 0.01s 0% 3080k

and I mean "really" stuck -- xl is unkillable (and unstoppable) in that
state!

At first I had grave misgivings that the old tstile deadlock was back,
but at the moment only dom0 is running

So thinking, h the old xenstored is started on boot and will
still be running and so I need to restart that from another xterm with
"/etc/rc.d/xencommons restart", and voila, that unstuck xl.

Probably xl shouldn't get stuck like that if it can't connect to
xenstored properly -- as I said it's unkillable in that state!

I then tried "/etc/rc.d/xenwatchdog restart" but it didn't restart (for
some reason I've yet to diagnose -- I had this happen once before -- it
seems to have trouble restarting sometimes, perhaps especially after
restarting xencommons).

That meant that a few moments later the Xen kernel decided dom0 was dead
and promptly (and I mean PROMPTLY) rebooted the machine -- kaboom!

(XEN) [2021-03-25 04:16:26.951] Watchdog timer fired for domain 0
(XEN) [2021-03-25 04:16:26.951] Hardware Dom0 shutdown: watchdog rebooting 
machine

At least on this next reboot all the right versions of the right bits
started!

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpNLASee644w.pgp
Description: OpenPGP Digital Signature


Re: still a problem with gpt(8) reading from LVM volumes? (was: problems with GPT (and maybe dkctl wedges) on LVM volumes)

2021-03-23 Thread Greg A. Woods
At Fri, 19 Mar 2021 00:04:28 -0700, John Nemeth  wrote:
Subject: Re: still a problem with gpt(8) reading from LVM volumes? (was: 
problems with GPT (and maybe dkctl wedges) on LVM volumes)
>
>  One of the projects I have in mind is to replace the data
> structure.  One good thing about the program is that all manipulation
> of the data structure is done through access routines and is
> appropriately contained in map.c and map.h.  One thing that is
> slowing me down is thinking of an appropriate data structure for
> tracking allocated space (the current method gets this pretty much
> for free).  One tradional way to do this would be to use a bitmap,
> but with the size of modern disks, that is completely infeasible.
> Note that whatever method is chosen must be able to handle duplicate
> allocations (i.e. overlapping partitions).

Hi John,

I have done some work in the past couple of years with code that deals
with what I think are sometimes called "extents" or "intervals".  The
code I worked on was primarily merging and diffing and searching sets of
extents.

Another application of extents is in calendar scheduling.

Anyway I have some small example bits of Go code here:

https://github.com/robohack/experiments/blob/master/t-interval-complement.go
https://github.com/robohack/experiments/blob/master/t-interval-complement_test.go

I also copied some code from stackoverflow to play with:

https://github.com/robohack/experiments/blob/master/tintervals-merge.py

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpGgb2m4wtXQ.pgp
Description: OpenPGP Digital Signature


Re: odd ATF failure for sh: ulimit_redirection_interaction failed

2021-03-23 Thread Greg A. Woods
At Sat, 13 Mar 2021 20:57:20 +0700, Robert Elz  wrote:
Subject: Re: odd ATF failure for sh: ulimit_redirection_interaction failed
>
> OK, I see what is going on now.
>
> The difference for you is your initial ulimit -n
> value.  Not that it is big, but that when
> reduced the way that test does it, getting
> smaller and smaller till < 16, it happens to
> land on a value < 10 as the first such limit
> it tries.  Using the default max fd value
> doesn't do that, it reaches 15 or something
> and stops.   sh does not work well with less than
> 10 available fds.

Heh.  I landed on very nearly that clue when I started tracing the
script, but I didn't quite realize the implications!

Thank you very much for figuring it out!

It turns out of course that I had made a typo in my /etc/login.conf and
I had accidentally given my rootclass a soft limit of a very small
number of open files, just 64.  On the good side, /usr/tests was the
only thing that seemed to run into any problems with this!  (But of
course that was just for root shells -- my normal userid had 2000)

> But first, make sh give a better inducation what
> the problem is when this kind of thing does
> happen.

Yes, Please!

> When there are redirections in builtins, the
> existing fd (if any) must be moved elsewhere
> so it can be restored after the builtin exits.
> sh always moves to a fd > 10 for this use.

Ah, that explains to me better what that code is doing and why.

> ps: attempting to follow fd usages inside sh
> is not something for the faint of heart.

Indeed.  As I was staring at it a couple of weeks ago I was
coincidentally reminded of Gosling Emacs -- maybe that sh code could
borrow Gosling's skull and crossed bones comment from his display.c  :-)

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpNvYPAoeT5c.pgp
Description: OpenPGP Digital Signature


Re: sys.mk broken for single-suffix rules since 1.144 (2021/11/09)

2021-03-23 Thread Greg A. Woods
At Mon, 22 Mar 2021 21:56:40 - (UTC), chris...@astron.com (Christos Zoulas) 
wrote:
Subject: Re: sys.mk broken for single-suffix rules since 1.144 (2021/11/09)
>
> Thanks, I fixed the shuttle-rule issue, but let's split the LDSTATIC
> and the OPTIM into separate commits. DBG has side effects too (other
> Makefiles set it) so it should be done very carefully.

Thank you very much!

Yes, the other issues should be kept separate.  At the moment though I
have to be a bad workman and blame my tools for not making it easy for
me to produce diffs that separate issues.  Hopefully if/when NetBSD
finally makes it into a modern VCS then I'll be able to use the tools
I've become very familiar with more recently in other endeavours to
create topic-specific diffs!

The LDSTATIC and related COMPILER_LINK.* and CPPFLAGS changes are quite
simple and straight forward though, and I've used them for
nearly/more-than a decade now.  They are critically necessary for doing
full static builds but of course are only part of that story, though
luckily a completely independent part of it.

I've also used the OPIM/DBG change for as long or longer, though I have
seen some interaction with other third-party Makefiles (probably none
within NetBSD itself though, though of course I'll have to scan my tree
just to be sure I haven't forgotten fixing something somewhere).

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpAAZIHNSCmx.pgp
Description: OpenPGP Digital Signature


Re: sys.mk broken for single-suffix rules since 1.144 (2021/11/09)

2021-03-21 Thread Greg A. Woods
At Sun, 21 Mar 2021 16:44:31 -0700, "Greg A. Woods"  wrote:
Subject: sys.mk broken for single-suffix rules since 1.144 (2021/11/09)

Sorry, make that 2020/11/09, of course  :-)

Also this only applies to a few platforms (i386, x86_64, and aarch64),
and when the Makefile used somehow ends up including , but
does not use  and/or does not set PROG

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgprG5SoWsPdM.pgp
Description: OpenPGP Digital Signature


sys.mk broken for single-suffix rules since 1.144 (2021/11/09)

2021-03-21 Thread Greg A. Woods
RGET} ${.IMPSRC} ${LDLIBS}
+   ${COMPILE_LINK.c} -o ${.TARGET} ${.IMPSRC} ${LDLIBS}
 # XXX: disable for now
 #  ${CTFCONVERT_RUN}
 .c.o:
@@ -138,7 +151,7 @@

 # C++
 .cc .cpp .cxx .C:
-   ${LINK.cc} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS}
+   ${COMPILE_LINK.cc} -o ${.TARGET} ${.IMPSRC} ${LDLIBS}
 # XXX: disable for now
 #  ${CTFCONVERT_RUN}
 .cc.o .cpp.o .cxx.o .C.o:
@@ -151,8 +164,9 @@

 # Fortran/Ratfor
 .f:
-   ${LINK.f} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS}
-   ${CTFCONVERT_RUN}
+   ${COMPILE_LINK.f} -o ${.TARGET} ${.IMPSRC} ${LDLIBS}
+# XXX: disable for now
+#  ${CTFCONVERT_RUN}
 .f.o:
${COMPILE.f} ${.IMPSRC} ${OBJECT_TARGET}
${CTFCONVERT_RUN}
@@ -162,8 +176,9 @@
rm -f ${.PREFIX}.o

 .F:
-   ${LINK.F} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS}
-   ${CTFCONVERT_RUN}
+   ${COMPILE_LINK.F} -o ${.TARGET} ${.IMPSRC} ${LDLIBS}
+# XXX: disable for now
+#  ${CTFCONVERT_RUN}
 .F.o:
${COMPILE.F} ${.IMPSRC} ${OBJECT_TARGET}
${CTFCONVERT_RUN}
@@ -173,8 +188,9 @@
rm -f ${.PREFIX}.o

 .r:
-   ${LINK.r} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS}
-   ${CTFCONVERT_RUN}
+   ${COMPILE_LINK.r} -o ${.TARGET} ${.IMPSRC} ${LDLIBS}
+# XXX: disable for now
+#  ${CTFCONVERT_RUN}
 .r.o:
${COMPILE.r} ${.IMPSRC} ${OBJECT_TARGET}
${CTFCONVERT_RUN}
@@ -185,8 +201,9 @@

 # Pascal
 .p:
-   ${LINK.p} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS}
-   ${CTFCONVERT_RUN}
+   ${COMPILE_LINK.p} -o ${.TARGET} ${.IMPSRC} ${LDLIBS}
+# XXX: disable for now
+#  ${CTFCONVERT_RUN}
 .p.o:
${COMPILE.p} ${.IMPSRC} ${OBJECT_TARGET}
${CTFCONVERT_RUN}
@@ -197,8 +214,9 @@

 # Assembly
 .s:
-   ${LINK.s} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS}
-   ${CTFCONVERT_RUN}
+   ${COMPILE_LINK.s} -o ${.TARGET} ${.IMPSRC} ${LDLIBS}
+# XXX: disable for now
+#  ${CTFCONVERT_RUN}
 .s.o:
${COMPILE.s} ${.IMPSRC} ${OBJECT_TARGET}
${CTFCONVERT_RUN}
@@ -207,8 +225,9 @@
${AR} ${ARFLAGS} ${.TARGET} ${.PREFIX}.o
rm -f ${.PREFIX}.o
 .S:
-   ${LINK.S} ${OBJECT_TARGET} ${.IMPSRC} ${LDLIBS}
-   ${CTFCONVERT_RUN}
+   ${COMPILE_LINK.S} -o ${.TARGET} ${.IMPSRC} ${LDLIBS}
+# XXX: disable for now
+#  ${CTFCONVERT_RUN}
 .S.o:
${COMPILE.S} ${.IMPSRC} ${OBJECT_TARGET}
${CTFCONVERT_RUN}
@@ -220,8 +239,9 @@
 # Lex
 .l:
${LEX.l} ${.IMPSRC}
-   ${LINK.c} ${OBJECT_TARGET} lex.yy.c ${LDLIBS} -ll
-   ${CTFCONVERT_RUN}
+   ${COMPILE_LINK.c} -o ${.TARGET} lex.yy.c ${LDLIBS} -ll
+# XXX: disable for now
+#  ${CTFCONVERT_RUN}
rm -f lex.yy.c
 .l.c:
${LEX.l} ${.IMPSRC}
@@ -235,8 +255,9 @@
 # Yacc
 .y:
${YACC.y} ${.IMPSRC}
-   ${LINK.c} ${OBJECT_TARGET} y.tab.c ${LDLIBS}
-   ${CTFCONVERT_RUN}
+   ${COMPILE_LINK.c} -o ${.TARGET} y.tab.c ${LDLIBS}
+# XXX: disable for now
+#  ${CTFCONVERT_RUN}
rm -f y.tab.c
 .y.c:
${YACC.y} ${.IMPSRC}

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpPGRHrsBSaZ.pgp
Description: OpenPGP Digital Signature


still a problem with gpt(8) reading from LVM volumes? (was: problems with GPT (and maybe dkctl wedges) on LVM volumes)

2021-03-18 Thread Greg A. Woods
At Fri, 12 Mar 2021 14:02:06 -0800, I wrote:
Subject: problems with GPT (and maybe dkctl wedges) on LVM volumes
>
> # gpt -vvv show -a /dev/mapper/rvg0-nbtest.0
> /dev/mapper/rvg0-nbtest.0: mediasize=41943040; sectorsize=512; blocks=81920
> /dev/mapper/rvg0-nbtest.0: PMBR at sector 0
> /dev/mapper/rvg0-nbtest.0: Pri GPT at sector 1
> /dev/mapper/rvg0-nbtest.0: GPT partition: type=ffs, start=64, size=41942943
> gpt: /dev/mapper/rvg0-nbtest.0: map entry doesn't fit media: new start + new 
> size < start + size
> (22 + 13fde < 40 + 27fff9f)

I'm still not quite sure why gpt(8) can't show me the full partition
table when reading from a raw LVM volume (dm) device as above in exactly
the same way it does when reading from the raw (xbd emulated) disk in
the domU.


After all if I map, say, an install.img file, then in the domU I see:

# gpt show -a /dev/rxbd4
start size  index  contents
01 PMBR
11 Pri GPT header
2   32 Pri GPT table
   34 2014 Unused
 2048   262144  1  GPT part - EFI System
 Type: efi
 TypeID: c12a7328-f81f-11d2-ba4b-00a0c93ec93b
 GUID: 97ac9806-df43-4590-ae5b-c88d8861ea0e
 Size: 128 M
 Label: EFI system
 Attributes: None
   264192  7544832  2  GPT part - NetBSD FFSv1/FFSv2
 Type: ffs
 TypeID: 49f48d5a-b10e-11dc-b99b-0019d1879648
 GUID: 2865e4e5-a798-4bed-9dc7-2e2317a3d789
 Size: 3684 M
 Label:
 Attributes: biosboot, bootme
  7809024 2015 Unused
  7811039   32 Sec GPT table
  78110711 Sec GPT header


and in the dom0 I see the same from the target file:

# gpt show -a /images/NetBSD-9.99.81-amd64-install.img
start size  index  contents
01 PMBR
11 Pri GPT header
2   32 Pri GPT table
   34 2014 Unused
 2048   262144  1  GPT part - EFI System
 Type: efi
 TypeID: c12a7328-f81f-11d2-ba4b-00a0c93ec93b
 GUID: 97ac9806-df43-4590-ae5b-c88d8861ea0e
 Size: 128 M
 Label: EFI system
 Attributes: None
   264192  7544832  2  GPT part - NetBSD FFSv1/FFSv2
 Type: ffs
 TypeID: 49f48d5a-b10e-11dc-b99b-0019d1879648
 GUID: 2865e4e5-a798-4bed-9dc7-2e2317a3d789
 Size: 3684 M
 Label:
 Attributes: biosboot, bootme
  7809024 2015 Unused
  7811039   32 Sec GPT table
  78110711 Sec GPT header


BTW, I've yet to try ccd(4) as an interpolative layer to add
"paritionable disk" semantics -- my first attempt on the older
(production) Xen system where I was testing this on resulted in a hard
crash as I was running "ccdconfig -u ccd0" to try a different LVM.  I
need to run through the exercise of letting sysinst partition up an xbd0
to try this again on a newer, and less critical, Xen server.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpMHuyGSLKTh.pgp
Description: OpenPGP Digital Signature


Re: odd ATF failure for sh: ulimit_redirection_interaction failed

2021-03-12 Thread Greg A. Woods
At Fri, 12 Mar 2021 21:46:29 -0500 (EST), Mouse  
wrote:
Subject: Re: odd ATF failure for sh: ulimit_redirection_interaction failed
>
> > But it still doesn't really make sense from what I see in the source.
> > The attempted FD is #18, but the error just says "1", not "18":
>
> Does > _ever_ work for multi-digit N?  I thought it didn't.  Maybe
> that was just least-common-denominator sh, or maybe the test is broken?

Well, the test does appear to execute without error IFF the condition
the test is meant to exercise is not enforced (i.e. if the ulimit for
open FDs is not kept lower than the number of currently open FDs);
though I have not done any other kind of test to be sure the data sent
to a multi-digit FD is actually received from the given FD.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpgFrWaVV0So.pgp
Description: OpenPGP Digital Signature


Re: odd ATF failure for sh: ulimit_redirection_interaction failed

2021-03-12 Thread Greg A. Woods
At Fri, 12 Mar 2021 15:49:37 +0700, Robert Elz  wrote:
Subject: Re: odd ATF failure for sh: ulimit_redirection_interaction failed
> 
> The -X should allow...
> 
>   | tc-se:stderr:
>   | tc-se:helper.sh: 1: Invalid argument
> 
> to reveal just what is producing that error message (what is invalid).

Thanks for the pointers!

From a few quick tests it looks like the problem is with the redirection
to what should be an already open file descriptor.

But it still doesn't really make sense from what I see in the source.
The attempted FD is #18, but the error just says "1", not "18":

tc-se:+ LIM=9
tc-se:+ ulimit -S -n 9
tc-se:+ '[' 9 -gt 16 ']'
tc-se:+ for FD=18
tc-se:+ echo '18 in 18 38 77 155 311 624 1249 2499 4999 ' >&18
tc-se:helper.sh: 1: Invalid argument
tc-se:+ exit 1
tc-se:
tc-end: 1615574952.421302, ulimit_redirection_interaction, failed, atf-check 
failed; see the output of the test for details

On the other hand for EINVAL fcntl(2) does say:

The argument cmd is F_DUPFD and arg is negative or
greater than the maximum allowable number (see
getdtablesize(3)).

So I'm not so sure this test is valid in the first place, is it?
(I've been somewhat confused by the logic in this test and the logic in
the related code in sh)

Indeed if I move the ulimit call to reset the limit back up to the
old limit of 2000 then it runs through the whole list without error
(well except for the stderr output caused by the "set -x" of course).

tc-so:Executing command [ /bin/sh helper.sh ]
tc-se:Fail: stderr not empty
tc-se:--- /dev/null 2021-03-13 02:13:32.737028821 +
tc-se:+++ /tmp/check.n8ejtt/stderr  2021-03-13 02:13:32.736999310 +
tc-se:@@ -0,0 +1,21 @@
tc-se:++ ulimit -S -n 2000
tc-se:++ for FD=18
tc-se:++ echo 18 in 18 38 77 155 311 624 1249 2499 4999  >&18
tc-se:++ for FD=38
tc-se:++ echo 38 in 18 38 77 155 311 624 1249 2499 4999  >&38
tc-se:++ for FD=77
tc-se:++ echo 77 in 18 38 77 155 311 624 1249 2499 4999  >&77
tc-se:++ for FD=155
tc-se:++ echo 155 in 18 38 77 155 311 624 1249 2499 4999  >&155
tc-se:++ for FD=311
tc-se:++ echo 311 in 18 38 77 155 311 624 1249 2499 4999  >&311
tc-se:++ for FD=624
tc-se:++ echo 624 in 18 38 77 155 311 624 1249 2499 4999  >&624
tc-se:++ for FD=1249
tc-se:++ echo 1249 in 18 38 77 155 311 624 1249 2499 4999  >&1249
tc-se:++ for FD=2499
tc-se:++ echo 2499 in 18 38 77 155 311 624 1249 2499 4999  >&2499
tc-se:++ for FD=4999
tc-se:++ echo 4999 in 18 38 77 155 311 624 1249 2499 4999  >&4999
tc-se:++ for FD=
tc-se:++ echo  in 18 38 77 155 311 624 1249 2499 4999  >&
tc-end: 1615601612.771674, ulimit_redirection_interaction, failed, atf-check 
failed; see the output of the test for details


I want to add some debug printfs to the shell too, but I'm currently
stymied by another problem (I can't access the domU filesystem from the
dom0, and until I can get a complete rebuild to finish so I can do a
full reinstall of the domU, accessing the FS from the dom0 would be the
only easy way I have of injecting changes to the test system since it
has no networking).

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpbfxMXYf52z.pgp
Description: OpenPGP Digital Signature


problems with GPT (and maybe dkctl wedges) on LVM volumes

2021-03-12 Thread Greg A. Woods
So with -current if you present a LVM volume to a domU and then use
sysinst to install on it (and I think IFF you choose "extended
partitioning") you end up with a GPT partitioned VLM volume that the
XEN_DOMU kernel sees as follows:

[   2.0010567] xbd0 at xenbus0 id 0: Xen Virtual Block Device Interface
[   2.0090574] xbd0: 20480 MB, 512 bytes/sect x 41943040 sectors
[   2.0090574] xbd0: backend features 0x1
[   2.0100607] dk0 at xbd0: "nbtest-root", 41942943 blocks at 64, type: ffs

From the running dom0 the GPT partition table for this device looks like
this:

# gpt show -a /dev/rxbd0  
 start  size  index  contents
 0 1 PMBR
 1 1 Pri GPT header
 232 Pri GPT table
3430 Unused
64  41942943  1  GPT part - NetBSD FFSv1/FFSv2
 Type: ffs
 TypeID: 49f48d5a-b10e-11dc-b99b-0019d1879648
 GUID: da2147be-1fe7-4bb3-a1fc-e601c92301fe
 Size: 20480 M
 Label: nbtest-root
 Attributes: biosboot
  4194300732 Sec GPT table
  41943039 1 Sec GPT header



However attempts to access the filesystem from the dom0 fail (after
seeming to get most of the way to finding the whole primary table):

# gpt -vvv show -a /dev/mapper/rvg0-nbtest.0
/dev/mapper/rvg0-nbtest.0: mediasize=41943040; sectorsize=512; blocks=81920
/dev/mapper/rvg0-nbtest.0: PMBR at sector 0
/dev/mapper/rvg0-nbtest.0: Pri GPT at sector 1
/dev/mapper/rvg0-nbtest.0: GPT partition: type=ffs, start=64, size=41942943
gpt: /dev/mapper/rvg0-nbtest.0: map entry doesn't fit media: new start + new 
size < start + size
(22 + 13fde < 40 + 27fff9f)


This may or may not be related to PR# 54900.

There's also mention of a possibly related issue in this thread:

http://mail-index.netbsd.org/netbsd-users/2020/07/19/msg025551.html

However in my case it looks like gpt(8) when run in the dom0 is having
problems skipping past the "Unused" part.  (suggested because 0x22 == 34d)



Also it seems dkctl doesn't work as I had expected it would on LVM
partitions, even though it can apparently find a viable partition:

# dkctl /dev/mapper/rvg0-nbtest.0 getwedgeinfo
vg0-nbtest.0 at vg0-nbtest.0: vg0-nbtest.0
vg0-nbtest.0: 41943040 blocks at 0, type: ffs

# dkctl /dev/mapper/rvg0-nbtest.0 makewedges
dkctl: /dev/mapper/rvg0-nbtest.0: makewedges: Inappropriate ioctl for device

# dkctl /dev/mapper/rvg0-nbtest.0 addwedge nbtest-root 64 41942943 ffs
dkctl: /dev/mapper/rvg0-nbtest.0: addwedge: Inappropriate ioctl for device


So it looks like I'm back to using plain MBR for domUs again, at least
for my next round of Xen server rebuilds.

-- 
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpnIjOfdk_fp.pgp
Description: OpenPGP Digital Signature


odd ATF failure for sh: ulimit_redirection_interaction failed

2021-03-11 Thread Greg A. Woods
My build (for amd64) of very recent -current sources (2021/03/08)
exhibit an odd failure in the ATF tests for /bin/sh.  (I'm not sure when
this first appeared, but it's not there in my older builds, e.g. from
2020/06)

From glancing through the test script I'm not sure quite what's
happening, though I've not tried to dig much deeper yet.

From the log:

tc-start: 1615494985.794473, ulimit_redirection_interaction
tc-so:Executing command [ /bin/sh helper.sh ]
tc-se:Fail: incorrect exit status: 1, expected: 0
tc-se:stdout:
tc-se:
tc-se:stderr:
tc-se:helper.sh: 1: Invalid argument
tc-se:
tc-end: 1615494985.880185, ulimit_redirection_interaction, failed, atf-check 
failed; see the output of the test for details

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpgCp7gHm4g5.pgp
Description: OpenPGP Digital Signature


Re: build of 2021/03/07 -current fails because of pthread_types.h (without MKLLVM!)

2021-03-09 Thread Greg A. Woods
So, the clue is in the last two "notes" from the compiler:

/build/woods/b2/current-amd64-destdir/usr/include/pthread_types.h:170:8: note: 
'__pthread_cond_st' is not literal because:
  170 | struct __pthread_cond_st {
  |^
/build/woods/b2/current-amd64-destdir/usr/include/pthread_types.h:175:17: note: 
  non-static data member '__pthread_cond_st::ptc_waiters' has volatile type
  175 |  void *volatile ptc_waiters;
  | ^~~


These lead me to pthread_types.h, and to the apparent change that may
have introduced the fault (revision 1.25 of pthread_types.h), which
after looking at the full set of changes in 1.25 lead me to find the
definition of __pthread_volatile, and that allowed me to read the
comment about this definition and that suggested the following fix,
which at least allows the compile to continue.  I hate C++ and I hate
debugging C++, but here at least I'm grateful someone had already
figured out how to solve the problem and I only had to apply it in one
more place.

If all goes well I should be able to test the build under Xen in the
next few hours.


Index: lib/libpthread/pthread_types.h
===
RCS file: /cvs/master/m-NetBSD/main/src/lib/libpthread/pthread_types.h,v
retrieving revision 1.25
diff -u -r1.25 pthread_types.h
--- lib/libpthread/pthread_types.h  10 Jun 2020 22:45:15 -  1.25
+++ lib/libpthread/pthread_types.h  9 Mar 2021 22:43:05 -
@@ -172,7 +172,7 @@

/* Protects the queue of waiters */
__pthread_spin_t ptc_lock;
-   void *volatile ptc_waiters;
+   void *__pthread_volatile ptc_waiters;
void *ptc_spare;

pthread_mutex_t *ptc_mutex; /* Current mutex */

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpyKO6vyn83x.pgp
Description: OpenPGP Digital Signature


build of 2021/03/07 -current fails because of pthread_types.h (without MKLLVM!)

2021-03-08 Thread Greg A. Woods
ded from 
/build/woods/b2/current-amd64-destdir/usr/include/sys/types.h:359,
 from 
/build/woods/b2/current-amd64-destdir/usr/include/sys/endian.h:55,
 from 
/work/woods/m-NetBSD-current-new/external/bsd/libc++/dist/libcxx/include/__config:82,
 from 
/work/woods/m-NetBSD-current-new/external/bsd/libc++/dist/libcxx/include/algorithm:623,
 from 
/work/woods/m-NetBSD-current-new/external/bsd/libc++/lib/../dist/libcxx/src/algorithm.cpp:10:
/build/woods/b2/current-amd64-destdir/usr/include/pthread_types.h:170:8: note: 
'__pthread_cond_st' is not literal because:
  170 | struct __pthread_cond_st {
  |^
/build/woods/b2/current-amd64-destdir/usr/include/pthread_types.h:175:17: note: 
  non-static data member '__pthread_cond_st::ptc_waiters' has volatile type
  175 |  void *volatile ptc_waiters;
  | ^~~

*** Failed target:  algorithm.o
*** Failed command: 
/build/woods/b2/current-amd64-amd64-tools/bin/x86_64--netbsd-c++ 
-frandom-seed=a0ced134 -O2 -g -Wall -Wpointer-arith -Wno-sign-compare 
-Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual 
-Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Wsign-compare 
-Wformat=2 -Werror -Wno-error -pipe -fstack-protector -Wstack-protector --param 
ssp-buffer-size=1 -std=c++11 -Wold-style-cast -Wctor-dtor-privacy 
-Wnon-virtual-dtor -Wreorder -Wno-deprecated -Woverloaded-virtual -Wsign-promo 
-Wsynth -Wno-non-template-friend -Wno-pmf-conversions 
--sysroot=/build/woods/b2/current-amd64-destdir -nostdinc++ -cxx-isystem 
/work/woods/m-NetBSD-current-new/external/bsd/libc++/lib/../dist/libcxx/include 
-I/work/woods/m-NetBSD-current-new/external/bsd/libc++/lib/../dist/libcxxrt/src 
-DLIBCXXRT -D_FORTIFY_SOURCE=2 -c 
/work/woods/m-NetBSD-current-new/external/bsd/libc++/lib/../dist/libcxx/src/algorithm.cpp
 -o algorithm.o
*** Error code 1

Stop.
nbmake[1]: stopped in /work/woods/m-NetBSD-current-new/external/bsd/libc++/lib

*** Failed target:  dependall
*** Failed command: cd 
"/work/woods/m-NetBSD-current-new/external/bsd/libc++/lib"; 
/build/woods/b2/current-amd64-amd64-tools/bin/nbmake realall
*** Error code 1

Stop.
nbmake: stopped in /work/woods/m-NetBSD-current-new/external/bsd/libc++/lib


11:32 [102] $ mynbmake -v MKDEBUG
yes
11:32 [103] $ mynbmake -v MKDEBUGLIB
yes
11:32 [104] $ mynbmake -v MKLLVM
no
11:32 [105] $ mynbmake -v MKLLVMRT
no

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgptrfbHvuOsL.pgp
Description: OpenPGP Digital Signature


netbsd-5 branch cannot be built with a recent 9.99.64 system (or perhaps any recent GCC?)

2021-02-24 Thread Greg A. Woods
keinfo LIBGCC= 
LIBGCC1= LIBGCC1_TEST= LIBGCC2= INSTALL_LIBGCC= EXTRA_PARTS= 
CPPFLAGS=-DNETBSD_TOOLS AR=ar RANLIB=ranlib BISON=true DESTDIR= 
INSTALL=/build/woods/b2/netbs
 d-5-amd64-i386-tools/bin/i386--netbsdelf-install\ -c\ -p\ -r 
/build/woods/b2/netbsd-5-amd64-i386-tools/bin/nbgmake -e MACHINE= 
MAKEINFO=/build/woods/b2/netbsd-5-amd64-i386-tools/bin/nbmakeinfo LIBGCC= 
LIBGCC1= LIBGCC1_TEST= LIBGCC2= INSTALL_LIBGCC= EXTRA_PARTS= 
CPPFLAGS=-DNETBSD_TOOLS AR=ar RANLIB=ranlib BISON=true DESTDIR= 
INSTALL=/build/woods/b2/netbsd-5-amd64-i386-tools/bin/i386--netbsdelf-install\ 
-c\ -p\ -r all-gcc)
*** Error code 2

Stop.
nbmake: stopped in /work/woods/m-NetBSD-5/tools/gcc


Note there are also lots of other new warnings from a newer compiler
building the older toolchain!

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp9gBzANraAN.pgp
Description: OpenPGP Digital Signature


one more possible speedup for "make -jN sets" in makesums

2021-02-18 Thread Greg A. Woods
I finally found and enabled USE_PIGZGZIP.  That's a big help!
(especially with the bigger sets I get with all-static builds)

However the "makesums" part of "make sets" still goes one at a time
because of an explicit ".ORDER:" request.  My added comment asks my
question:

--- distrib/sets/Makefile.~1.107.~  2020-05-30 15:20:31.225318105 -0700
+++ distrib/sets/Makefile   2021-02-18 10:05:39.414690365 -0800
@@ -269,6 +269,8 @@
${TOOL_CAT} ${TARDIR}/$$i >> ${TARDIR}/$$i.tmp; \
done
 .endfor
+# XXX this .ORDER is here "so the checksums come out in the proper sequence.",
+# but as a result they cannot be done in parallel!!!  Sorting after!?!?!?
 .ORDER: ${MAKETARSETS:@.TARS.@do-sum-${.TARS.}@}



I think this would currently also assume/require that nbcksum always do
just one write(2) to generate its whole output (I haven't checked that),
or that the whole process can be changed such that they each write to
unique temporary files that are then collected and coalesced after
they've all run.

Perhaps the distrib/sets/makesums script could also run the (currently)
two nbcksum processes in parallel (e.g. if ${.MAKE.JOBS} is set and
greater than one in the makefile then pass a '-j ${.MAKE.JOBS}' option
to the script).

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp1PW6p7nzys.pgp
Description: OpenPGP Digital Signature


Re: is this crash while coredumping known? (forget the link to NFS)

2020-07-15 Thread Greg A. Woods
At Sat, 11 Jul 2020 23:29:05 -0700, "Greg A. Woods"  wrote:
Subject: Re: is this crash while coredumping known? (forget the link to NFS)
>
> So it doesn't seem like this crash has anything to do with NFS after all.

This crash is ongoing for me.

I'll be away for a couple of weeks, then possibly too busy for a few
more weeks, but I hope to update to the very most recent -current in the
near future.  Perhaps it's fixed in more recent sources than what I'm
running, but I'm now also wondering about the path through mount_null
mountpoints.

In the mean time I wonder if anyone might try to reproduce this,
particularly with mount_null mountpoints in place.  Since various
packages have configure tests that dump core it's not long before the
crash occurs when building lots of packages.

FYI my sandboxctl.conf file is as follows, with /build being one big
filesystem and with the various pkgsrc vars pointing at /var/package*
places, and with /more being an NFS server with pkgsrc sources (I was
quite surprised I had to add /usr/X11R7 with netbsd-native):

#!/bin/sh

SANDBOX_TYPE=netbsd-native
SANDBOX_ROOT=/build/sandbox/pkgbuild
NETBSD_NATIVE_RELEASEDIR=/build/woods/xentastic/current-amd64-release/amd64

post_mount_hook ()
{
mkdir -p ${SANDBOX_ROOT}/usr/X11R7
sandbox_bindfs -o ro /usr/X11R7 ${SANDBOX_ROOT}/usr/X11R7

mkdir -p ${SANDBOX_ROOT}/usr/src
sandbox_bindfs -o ro /build/src-current ${SANDBOX_ROOT}/usr/src
mkdir -p ${SANDBOX_ROOT}/usr/xsrc
sandbox_bindfs -o ro /build/xsrc-current ${SANDBOX_ROOT}/usr/xsrc
mkdir -p ${SANDBOX_ROOT}/usr/pkgsrc
sandbox_bindfs -o ro /more/work/woods/m-NetBSD-pkgsrc-current 
${SANDBOX_ROOT}/usr/pkgsrc

mkdir -p ${SANDBOX_ROOT}/usr/pkg
sandbox_bindfs -o rw /build/package-pkgbuild ${SANDBOX_ROOT}/usr/pkg

mkdir -p ${SANDBOX_ROOT}/var/package-distfiles
sandbox_bindfs -o rw /build/package-distfiles 
${SANDBOX_ROOT}/var/package-distfiles
mkdir -p ${SANDBOX_ROOT}/var/package-obj
sandbox_bindfs -o rw /build/package-obj ${SANDBOX_ROOT}/var/package-obj
mkdir -p ${SANDBOX_ROOT}/var/packages
sandbox_bindfs -o rw /build/packages ${SANDBOX_ROOT}/var/packages

# xxx to make it easier to source .kshrc
mkdir -p ${SANDBOX_ROOT}/home
sandbox_bindfs -o ro /more/home ${SANDBOX_ROOT}/home

ln -fs /usr/src/etc/mk.conf ${SANDBOX_ROOT}/etc/mk.conf
}


--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpPjqxk_fPSK.pgp
Description: OpenPGP Digital Signature


Re: recent changes to pthread_fork.c:fork() cause static linking to fail if the app provides its own malloc()

2020-07-15 Thread Greg A. Woods
At Tue, 14 Jul 2020 20:05:57 - (UTC), chris...@astron.com (Christos Zoulas) 
wrote:
Subject: Re: recent changes to pthread_fork.c:fork() cause static linking to 
fail if the app provides its own malloc()
>
> It is not only _malloc_prefork(), it is also _malloc_postfork() and
> _malloc_postfork_child(). The easiest way to fix things is to provide
> them as no-op.

Indeed.

I guess this will have to be the way.  Perhaps some proper documentation
could/should be written about how to do this and exactly what APIs are
necessary to override the internal malloc() entirely.

Note that this is necessary in cases of malloc() et al in particular
for both static-linked and dynamic linked programs.

The difference is that with static linking one gets a linker error and
cannot continue, but with dynamic linking one silently invokes
"Undefined Behaviour" (i.e. depending on what the internal malloc() uses
to obtain heap space).

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpKFk8TXbGnF.pgp
Description: OpenPGP Digital Signature


Re: recent changes to pthread_fork.c:fork() cause static linking to fail if the app provides its own malloc()

2020-07-13 Thread Greg A. Woods
At Tue, 14 Jul 2020 00:28:46 +0200, Joerg Sonnenberger  wrote:
Subject: Re: recent changes to pthread_fork.c:fork() cause static linking to 
fail if the app provides its own malloc()
>
> On Mon, Jul 13, 2020 at 03:05:17PM -0700, Greg A. Woods wrote:
> > I think it is the following change (and perhaps more similar/related
> > changes) which breaks static linking of applications which wish to
> > supply their own implementation of malloc(), and which call, e.g.,
> > fork():
>
> I consider it a strong WONTFIX. It's no different from not poviding
> posix_memalign etc.


Well, _malloc_prefork() is explicitly called with an underscore leading
the identifier name, so strictly speaking it's invalid for an
application to define it.  (and it's not documented, nor in any standard
that I can find, with or without the leading underscore).

So, in my opion it is invalid for unrelated parts of the library to use
such an interal function and as a result have conflicts with overriding
some functions.

Perhaps splitting all the internal definitions out of jemalloc.c into
their own compilation units and making sure they don't then also still
cause unnecessary inclusion of related code and definitions when
referenced would be a possible work-around, but that will no doubt lead
to later maintenance headaches.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgplHNvRMifab.pgp
Description: OpenPGP Digital Signature


recent changes to pthread_fork.c:fork() cause static linking to fail if the app provides its own malloc()

2020-07-13 Thread Greg A. Woods
I think it is the following change (and perhaps more similar/related
changes) which breaks static linking of applications which wish to
supply their own implementation of malloc(), and which call, e.g.,
fork():

This is because fork() now calls _malloc_prefork(), and if the
application's replacement does not offer this function (as it should
not), then the linker is forced to drag in all of jemalloc.o.  This of
course happens even if the application is not multi-threaded and is not
linking against -lpthread.

revision 1.15
date: 2020-05-15 07:37:21 -0700;  author: joerg;  state: Exp;  lines: +6 -2;  
commitid: 85oo6pCrePrJul8C;
Hook up proper fork lock handling for malloc:
- lock all relevant mutexes just before fork
- unlock all mutexes just after fork in the parent
- full reinit non-spinlocks in the child
This is not using the normal pthread_atfork interface to ensure order of
operation, malloc is used as implementation detail too often.


For example, here static linking pkgsrc/shells/heirloom-sh:

ld: /usr/lib/libc.a(jemalloc.o): in function `malloc':
/build/src-current/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2056: 
multiple definition of `malloc'; 
mapmalloc.o:/var/package-obj/root/shells/heirloom-sh/work/heirloom-sh-050706/mapmalloc.c:195:
 first defined here
ld: /usr/lib/libc.a(jemalloc.o): in function `calloc':
/build/src-current/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2154: 
multiple definition of `calloc'; 
mapmalloc.o:/var/package-obj/root/shells/heirloom-sh/work/heirloom-sh-050706/mapmalloc.c:381:
 first defined here
ld: /usr/lib/libc.a(jemalloc.o): in function `realloc':
/build/src-current/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2326: 
multiple definition of `realloc'; 
mapmalloc.o:/var/package-obj/root/shells/heirloom-sh/work/heirloom-sh-050706/mapmalloc.c:330:
 first defined here
ld: /usr/lib/libc.a(jemalloc.o): in function `free':
/build/src-current/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2416: 
multiple definition of `free'; 
mapmalloc.o:/var/package-obj/root/shells/heirloom-sh/work/heirloom-sh-050706/mapmalloc.c:303:
 first defined here


Should I send-pr this?  Is there any possibility of an "easy" fix?

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpvrEPlVbpjk.pgp
Description: OpenPGP Digital Signature


Re: is this crash while coredumping known? (forget the link to NFS)

2020-07-12 Thread Greg A. Woods
So it doesn't seem like this crash has anything to do with NFS after all.

I've been doing package builds in a sandboxctl chroot that access NFS
sources (read-only) but are otherwise entirely confined to a local
filesystem, albiet through sandboxctl's Null mounts.  After many core
dumps (mostly from GNU Configure scripts), one eventually caused another
similar looking crash.

This one did a core dump, but savecore didn't think there was enough
free space left in /var/crash to recover it (even though there is enough
space for dozens of the compresed cores if they comrpess as well as the
last one).

(Below is the original crash messages for comparison)


[ 200974.6716318] fatal double fault in supervisor mode
[ 200974.6716318] trap type 13 code 0 rip 0x80e3c127 cs 0x8 rflags 
0x10286 cr2 0x9a02af3e6f88
e6f90
[ 200974.6816277] curlwp 0x90f14a2e2bc0 pid 1591.1591 lowest kstack 
0x9a02af3e52c0
kernel: double fault trap, code=0

Stopped in pid 1591.1591 (conftest) at  
netbsd:radix_tree_gang_lookup_node+0x1a:movq%rdx,)
radix_tree_gang_lookup_node() at netbsd:radix_tree_gang_lookup_node+0x1a
uvm_page_array_fill() at netbsd:uvm_page_array_fill+0x14b
uvm_page_array_fill_and_peek() at netbsd:uvm_page_array_fill_and_peek+0x1e
uvn_findpage() at netbsd:uvn_findpage+0x88
uvn_findpages() at netbsd:uvn_findpages+0xcd
genfs_getpages() at netbsd:genfs_getpages+0x959
VOP_GETPAGES() at netbsd:VOP_GETPAGES+0x58
uvn_get() at netbsd:uvn_get+0x57
ubc_fault() at netbsd:ubc_fault+0x182
uvm_fault_internal() at netbsd:uvm_fault_internal+0x51e
trap() at netbsd:trap+0x4e5
--- trap (number 6) ---
kcopy() at netbsd:kcopy+0x15
uiomove() at netbsd:uiomove+0xb7
ubc_uiomove() at netbsd:ubc_uiomove+0x156
ffs_write() at netbsd:ffs_write+0x251
layer_bypass() at netbsd:layer_bypass+0x102
VOP_WRITE() at netbsd:VOP_WRITE+0x40
vn_rdwr() at netbsd:vn_rdwr+0xcc
coredump_write() at netbsd:coredump_write+0xa0
coredump_elf64() at netbsd:coredump_elf64+0x43a
coredump() at netbsd:coredump+0x650
sigexit() at netbsd:sigexit+0x27c
sendsig_siginfo() at netbsd:sendsig_siginfo+0x323
trapsignal() at netbsd:trapsignal+0x371
trap() at netbsd:trap+0x8e7
--- trap (number 6) ---
400581:
ds  23
es  23
fs  0
gs  0
rdi 90eb408bdd58
rsi 0
rbp 9a02af3e7080
rbx 9a02af3e7190
rdx 9a02af3e71b0
rcx 1
rax 80e3c10dradix_tree_gang_lookup_node
r8  0
r9  1
r10 0
r11 2
r12 90eb408bdd40
r13 1
r14 0
r15 90eb408bdd58
rip 80e3c127radix_tree_gang_lookup_node+0x1a
cs  8
rflags  10286
rsp 9a02af3e6f90
ss  0
netbsd:radix_tree_gang_lookup_node+0x1a:movq
%rdx,ff10(%rbp)
db{3}>




savecore: reboot after panic: reboot forced via kernel debugger
savecore: system went down at Sat Jul 11 19:35:25 2020
savecore: no dump, not enough free space in /var/crash


$ df -h /var/crash/
FilesystemSize  Used Avail %Cap MountedOn
/dev/dk2  3.9G  1.5G  2.2G  40% /var



At Thu, 09 Jul 2020 18:03:23 -0700, "Greg A. Woods"  wrote:
Subject: is this crash while coredumping to NFS known?
>
> Here's what was on the console:
>
> [ 71887.4479952] fatal double fault in supervisor mode
> [ 71887.4479952] trap type 13 code 0 rip 0x809c5051 cs 0x8 rflags 
> 0x10286 cr2 0x8b827c3e4f98 i
> 3e4fa0
> [ 71887.4479952] curlwp 0x8693578524c0 pid 29079.29079 lowest kstack 
> 0x8b827c3e32c0
> kernel: double fault trap, code=0
> Stopped in pid 29079.29079 (tpgsqltime) at  netbsd:ip_output+0x14:  movq  
>   %rsi,fe68(%rbp
>
> ip_output() at netbsd:ip_output+0x14
> tcp_output() at netbsd:tcp_output+0xc68
> tcp_send_wrapper() at netbsd:tcp_send_wrapper+0x9a
> sosend() at netbsd:sosend+0x7e4
> nfs_send() at netbsd:nfs_send+0x86
> nfs_request() at netbsd:nfs_request+0x3d4
> nfs_readrpc() at netbsd:nfs_readrpc+0x204
> nfs_doio() at netbsd:nfs_doio+0x731
> VOP_STRATEGY() at netbsd:VOP_STRATEGY+0x64
> genfs_getpages() at netbsd:genfs_getpages+0x1400
> nfs_getpages() at netbsd:nfs_getpages+0x5d
> VOP_GETPAGES() at netbsd:VOP_GETPAGES+0x80
> uvm_fault_internal() at netbsd:uvm_fault_internal+0x1895
> trap() at netbsd:trap+0x4e5
> --- trap (number 6) ---
> copyin() at netbsd:copyin+0x2f
> uiomove() at netbsd:uiomove+0xb7
> ubc_uiomove() at netbsd:ubc_uiomove+0x156
> nfs_write() at netbsd:nfs_write+0x129
> VOP_WRITE() at netbsd:VOP_WRITE+0x65
> vn_rdwr() at netbsd:vn_rdwr+0xcc
> coredump_write() at netbsd:coredump_write+0x56
> coredump_elf64() at netbsd:coredump_elf64+0x89c
> coredump() at netbsd:coredump+0x650
> sigexit() at netbsd:sigexit+0x27c
> sendsig() at netbsd:sendsig
> lwp_userret() at netbsd:lwp_userret+0x1c5
> trap() at netbsd:trap+0x

Re: USB console support "was: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1)

2020-07-10 Thread Greg A. Woods
At Thu, 09 Jul 2020 18:16:26 -0700, "Greg A. Woods"  wrote:
Subject: USB console support "was: NetBSD-7.0 boots OK and NetBSD-8.0 
hangs/crashes during boot on a MacBook7,1)
> 
> Oh, and I wanted to mention something else that I'd forgotten about but
> stumbled across again the other day while debugging servers:
> 
> Xen supports writing console messages to a special kind of USB port:
> 
>   "console=dbgp" indicates that Xen should use a USB debug port.
> 
> http://xenbits.xenproject.org/docs/4.11-testing/misc/xen-command-line.html
> 
> There's more about it in this thread:
> 
> https://lists.xenproject.org/archives/html/xen-devel/2009-03/msg00436.html
> https://lists.xenproject.org/archives/html/xen-devel/2009-03/msg00458.html

For what it's worth my Dell servers and my MacBook Pro have such USB
debug ports.

The MacBook Pro even has two of them, and I'm pretty sure one of them is
connected to the external ports.

On the Dell though this seems to be the port that connects to the DRAC.

From "lspci -vvv":

00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI 
USB2 Controller (rev 09) (prog-if 20 [EHCI])
Subsystem: Dell Device 01b2
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpwb_S2z2w5O.pgp
Description: OpenPGP Digital Signature


USB console support "was: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1)

2020-07-09 Thread Greg A. Woods
At Mon, 06 Jul 2020 13:13:03 -0700, "Greg A. Woods"  wrote:
Subject: Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a 
MacBook7,1
>
> Or indeed any device with any kind of USB port, e.g. a laptop.

Oh, and I wanted to mention something else that I'd forgotten about but
stumbled across again the other day while debugging servers:

Xen supports writing console messages to a special kind of USB port:

"console=dbgp" indicates that Xen should use a USB debug port.

http://xenbits.xenproject.org/docs/4.11-testing/misc/xen-command-line.html

There's more about it in this thread:

https://lists.xenproject.org/archives/html/xen-devel/2009-03/msg00436.html
https://lists.xenproject.org/archives/html/xen-devel/2009-03/msg00458.html


Further of interest is that Xen also supports writing to both a COM port
and the "vga" console simultaneously.  Indeed it may support writing to
"dbgp" at the same time as well.  This is something I looked into for
NetBSD/i386 some time ago, but never got it fully working.  To me I
think it would be super incredibly valuable for the boot code to be able
to talk to both a serial port and the "pc" console simultaneously.  It
is less important for the kernel to do so, but it still would be nice.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpWgRz8XQc4Z.pgp
Description: OpenPGP Digital Signature


is this crash while coredumping to NFS known?

2020-07-09 Thread Greg A. Woods
I was running a wee test program this morning, which crashed, and it
seems the kernel crashed while trying to write the core file out.

The current working directory, and the target of the core file, is an
NFS mount.

The core file was created, but is empty:

$ ls -l *.core
-rw---  1 woods  ostaff  0 Jul  9 11:17 tpgsqltime.core

The system is running my version of 9.99.64, so it's not quite current,
and thus I wanted to ask if anyone knows if this particular crash is
known of before I send-pr.

I think this is the first time I've had a core dump over NFS since
updating the kernel from 8.99.32.  So I'm not sure yet how easily this
is reproduced, but in any case it is a regression.

Here's what was on the console:

[ 71887.4479952] fatal double fault in supervisor mode
[ 71887.4479952] trap type 13 code 0 rip 0x809c5051 cs 0x8 rflags 
0x10286 cr2 0x8b827c3e4f98 i
3e4fa0
[ 71887.4479952] curlwp 0x8693578524c0 pid 29079.29079 lowest kstack 
0x8b827c3e32c0
kernel: double fault trap, code=0
Stopped in pid 29079.29079 (tpgsqltime) at  netbsd:ip_output+0x14:  movq
%rsi,fe68(%rbp

ip_output() at netbsd:ip_output+0x14
tcp_output() at netbsd:tcp_output+0xc68
tcp_send_wrapper() at netbsd:tcp_send_wrapper+0x9a
sosend() at netbsd:sosend+0x7e4
nfs_send() at netbsd:nfs_send+0x86
nfs_request() at netbsd:nfs_request+0x3d4
nfs_readrpc() at netbsd:nfs_readrpc+0x204
nfs_doio() at netbsd:nfs_doio+0x731
VOP_STRATEGY() at netbsd:VOP_STRATEGY+0x64
genfs_getpages() at netbsd:genfs_getpages+0x1400
nfs_getpages() at netbsd:nfs_getpages+0x5d
VOP_GETPAGES() at netbsd:VOP_GETPAGES+0x80
uvm_fault_internal() at netbsd:uvm_fault_internal+0x1895
trap() at netbsd:trap+0x4e5
--- trap (number 6) ---
copyin() at netbsd:copyin+0x2f
uiomove() at netbsd:uiomove+0xb7
ubc_uiomove() at netbsd:ubc_uiomove+0x156
nfs_write() at netbsd:nfs_write+0x129
VOP_WRITE() at netbsd:VOP_WRITE+0x65
vn_rdwr() at netbsd:vn_rdwr+0xcc
coredump_write() at netbsd:coredump_write+0x56
coredump_elf64() at netbsd:coredump_elf64+0x89c
coredump() at netbsd:coredump+0x650
sigexit() at netbsd:sigexit+0x27c
sendsig() at netbsd:sendsig
lwp_userret() at netbsd:lwp_userret+0x1c5
trap() at netbsd:trap+0x9b7
--- trap (number 6) ---
7c5294:
ds  23
es  23
fs  0
gs  0
rdi 869202438bc0
rsi 0
rbp 8b827c3e5160
rbx 8693660f4988
rdx 869364f08818
rcx 400
rax 0
r8  0
r9  869364f087b8
r10 869202438bc0
r11 0
r12 869364a93040
r13 a0
r14 869364a930b0
r15 6c
rip 809c5051ip_output+0x14
cs  8
rflags  10286
rsp 8b827c3e4fa0
ss  0
netbsd:ip_output+0x14:  movq%rsi,fe68(%rbp)
db{0}> machine cpu
addrdev id  flags   ipisspl curlwp
0x8163a800  cpu00   30090   8  0x8693578524c0
0x8b825ded  cpu14   f0020   0  0x868c2a81e1c0
0x8b825e0ec000  cpu22   f0020   4  0x86934acd26c0
0x8b825e16d000  cpu36   f0020   0  0x868c2ad6c200
0x8b825e19e000  cpu41   f0020   0  0x868c2a9ec340
0x8b825e1cf000  cpu55   f0020   0  0x868c2aa9d080
0x8b825e20  cpu63   f0020   0  0x868c2aa8e100
0x8b825e231000  cpu77   f0020   0  0x868c2ab3f180
db{0}> ps
PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
29079>29079 7   0   100   8693578524c0 tpgsqltime


I do have a full kernel core dump, but it's 32GB (345M compressed), and
probably contains data I don't want to share.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpcGdanbYs_8.pgp
Description: OpenPGP Digital Signature


Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1

2020-07-07 Thread Greg A. Woods
At Mon, 6 Jul 2020 23:53:02 +0900, Rin Okuyama  wrote:
Subject: Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a 
MacBook7,1
>
> It seems that stride of framebuffer is not correctly set.
>
> Your laptop has an NVIDIA GPU, doesn't it? If so, nouveaufb(4) is used
> instead of genfb(4), which is normally used when booted from UEFI. It
> should be worth trying

Yes, indeed, it has an NVIDIA GeForce 320M.

>   userconf disable nouveau*
>
> for UEFI bootloader.

Oh, that sounded so very promising!

However unfortunately it made not one bit of difference.

Thank you for the idea though, and also thank you for pointing out the
alternate framebuffer driver that might also be worth looking into.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpDKqSxprCns.pgp
Description: OpenPGP Digital Signature


Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1

2020-07-06 Thread Greg A. Woods
At Sun, 5 Jul 2020 21:09:27 -0700, Brian Buhrow  wrote:
Subject: Re: NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a 
MacBook7,1
>
>   Hello.  I agree with Mouse, except that I also think it would be very
> helpful and useful to have a serial console on USB only devices.

Or indeed any device with any kind of USB port, e.g. a laptop.

However what would be most generally useful, as opposed to ideal, would
be for just the console output to appear on the first found USB serial
adapter.

So if the kernel can get far enough to probe a USB serial port, then it
should dump the message buffer, and continue to copy everything added to
the message buffer, to that USB serial device.

That's the first and most important step.  Make it simple, easy, and
obvious how to capture all kernel messages on a modern machine without
having to get all the way to the point where one can run "dmesg".

Further allowing that port to be attached as the console would be "nice
but not quite as necessary".

Now ideally the kernel should make the best attempt to identify the
first possible USB serial port as early as possible, and attach it as
console, so that nothing can be missed, and so that any other bugs in
device probing, etc., etc., etc., would not prevent use of DDB on this
USB serial console.

Even better would be to find out if the platform firmware can do some or
all of this, and then to use that code both for the boot loader and the
kernel console.  E.g. on an EFI system, perhaps through a custom EFI
driver?  And for uBoot systems too?

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpFUqYZ0ZSmB.pgp
Description: OpenPGP Digital Signature


NetBSD-7.0 boots OK and NetBSD-8.0 hangs/crashes during boot on a MacBook7,1

2020-07-05 Thread Greg A. Woods
So, in my ongoing NetBSD on a MacBook saga

NetBSD-7.2 boots fine from USB on the MacBook Pro (MacBook7,1) (with the
help of rEFIT on a second USB stick).

NetBSD-8.2 and newer, including the most recent -current, hangs during
boot and the kernel messages appear to have torn video:

 http://www.planix.ca/~woods/macbookpro-netbsd-boot-fail.jpg


However today I discovered that NetBSD-8.0 will often boot with the
kernel messages properly visible in nice green on black in a full
52(?)-line display, but it hangs or crashes.  (It is not reliable at
booting though -- sometimes the boot loader just hangs without printing
anything.)

If the boot loader does work though, and if I boot "normally" it just
hangs, with the last message being:

pci0 at mainbus0 bus0: configuration mode 1

The caps-lock button is dead so I think the machine is well and truly
frozen in a CPU loop (the CPU is hot, the fan runs fast).

I'm guessing NetBSD-8.2 and everything more recent is also hanging at
this same spot, but with the busted video mode it's hard to tell for
sure.

If I boot 8.0 with ACPI turned off (boot option #2 or from the boot
prompt "boot -2"), it crashes into ddb after getting a bit further, but
there are many errors about not being ablt to map PCI interrupts.

If I boot 8.0 with "-vx", there are quite a number of "invalid config
space" messages after the pci0 attachment:

pci0 at mainbus0 bus0: configuration mode 1
acpi0: MCFG: 000:00:0: invalid config space (cfg[0x100]=0x, 
alias=false)

The second and third numbers change in each following message, and in
two of those messages the cfg[0x100] number is 0x.

So it looks like ACPI is necessary, but support for using it in this
MacBook7,1 is broken somehow.

I can post a full-res photo of the screen in one or more or all of these
states it someone wants to see it.

In any case, what might have been changed after 8.0 that broke the video
output?  Where do I look?  Is amd64 video now the genfb(4) device code?
Or is it still vga(4)?  If it's genfb(4), then I do see commits about
doing anti-aliasing, and maybe the video junk I see could possibly be
explained by such a thing.  If I can get 7.2 installed (likely), so that
I need only drop a kernel in place instead of building the whole
installimage and writing the damn slow USB stick with a whole install
image every time, then maybe I'll be able to try bisecting changes to
get the video working right again.

I really wish modern PC vendors were not still so bloody stupid with
their firmware as to make it impossible to talk to them via a serial
port of some kind (e.g. a USB serial adapter as console would be
awesome!).  That said, what would it take to wire the NetBSD console to
a USB serial adapter?

In lieu of that it would be nice if hitting ^S on the keyboard would at
least pause the kernel messages from scrolling by during boot, but I get
that such a thing might be a bit hard to arrange for in NetBSD.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpvllmWWoiDK.pgp
Description: OpenPGP Digital Signature


USB storage transfers halt when usbdevs is run: hardware bug or software bug?

2020-07-05 Thread Greg A. Woods
USB storage device transfers freeze when usbdevs is run:  hardware bug
or software bug?

While I was doing a "gzcat < *.gz > /dev/rsd2d", where sd2 was a USB
memory stick, I happened to run "usbdevs -dv" and the writes to the USB
device froze, and indeed the writing process was stuck in the kernel (I
couldn't even stop it with ^Z).

Luckily yanking the stick out seemed to unfreeze and kill the process
and clean everything up nicely and I was able to re-insert it and re-do
the write to it without incident.

This is on an amd64 server running 9.99.64.

Upon removal and subsequent re-insertion the kernel said the following
(but was silent before this when usbdevs ran):

[ 193334.306434] umass0: BBB reset failed, IOERROR
[ 193334.306434] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.318288] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.318288] umass0: BBB reset failed, IOERROR
[ 193334.329223] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.329223] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.341024] umass0: BBB reset failed, IOERROR
[ 193334.341024] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.351781] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.357775] sd2d: error writing fsbn 4053632 of 4053632-4053759 (sd2 bn 
4053632; cn 4021 tn 7 sn 23)
[ 193334.366963] umass0: BBB reset failed, IOERROR
[ 193334.366963] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.378283] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.378283] umass0: BBB reset failed, IOERROR
[ 193334.389225] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.389225] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.401026] umass0: BBB reset failed, IOERROR
[ 193334.401026] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.411782] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.417780] umass0: BBB reset failed, IOERROR
[ 193334.417780] sd2(umass0:0:0:0): generic HBA error
[ 193334.426444] sd2: detached
[ 193334.426444] scsibus1: detached
[ 193334.426444] umass0: detached
[ 193334.436445] umass0: at uhub6 port 2 (addr 5) disconnected

reinsertion:

[ 193341.516925] umass0 at uhub6 port 2 configuration 1 interface 0
[ 193341.516925] umass0: SMI Corporation (0x090c) USB DISK (0x1000), rev 
2.00/11.00, addr 5
[ 193341.526926] umass0: using SCSI over Bulk-Only
[ 193341.526926] scsibus1 at umass0: 2 targets, 1 lun per target
[ 193342.366983] sd2 at scsibus1 target 0 lun 0:  disk 
removable
[ 193342.376985] sd2: 7712 MB, 15744 cyl, 16 head, 63 sec, 512 bytes/sect x 
15794176 sectors
[ 193342.386986] sd2: GPT GUID: d1e3490c-b0e6-42e9-9d9e-3ac286a0f7e0
[ 193342.396989] dk6 at sd2: "EFI system", 262144 blocks at 2048, type: msdos
[ 193342.396989] dk7 at sd2: "d3aa0396-d911-4aac-baa8-f2478557d31a", 7544832 
blocks at 264192, type: ffs


I'm guessing it's a software bug with bad locking order somewhere.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpl6RBM0wIkw.pgp
Description: OpenPGP Digital Signature


Why is (nb)ctfmerge failing when linking larger kernels???

2020-07-04 Thread Greg A. Woods
At Wed, 01 Jul 2020 17:57:08 -0700, "Greg A. Woods"  wrote:
Subject: weird occasional "Resource exhaustion" errors when linking 
GENERIC_KASLR
>
> I've been using a stock 9.0 amd64 install to build my -current tree and
> found it failing with a "Resource exhaustion" error (also "Out of
> memory") when linking the GENERIC_KASLR kernel.

So even in 9.99.64 ctfmerge fails, especially with the ALL kernel
(though I must admit I haven't tried to build an amd64 ALL kernel for
perhaps a year or so):

   link  ALL/netbsd
NetBSD 9.99.64 (ALL) #0: Thu Jul  2 17:29:25 PDT 2020
   textdata bss dec hex filename
80120264174291832   8122368 262534464   fa5f540 netbsd
ERROR: nbctfmerge: netbsd.ctf: Cannot finalize temp file: Resource exhaustion: 
Cannot allocate memory
--- netbsd ---
*** [netbsd] Error code 1

nbmake: stopped in 
/build/woods/xentastic/current-amd64-amd64-obj/build/src-current/sys/arch/amd64/compile/ALL
1 error

$ ulimit -a
time(cpu-seconds)unlimited
file(blocks) unlimited
coredump(blocks) unlimited
data(kbytes) 8388608
stack(kbytes)32768
lockedmem(kbytes)512000
memory(kbytes)   4096000
nofiles(descriptors) 3404
processes420
threads  2048
vmemory(kbytes)  2097152
sbsize(bytes)unlimited

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpZ_PVETjfdI.pgp
Description: OpenPGP Digital Signature


weird occasional "Resource exhaustion" errors when linking GENERIC_KASLR

2020-07-01 Thread Greg A. Woods
I've been using a stock 9.0 amd64 install to build my -current tree and
found it failing with a "Resource exhaustion" error (also "Out of
memory") when linking the GENERIC_KASLR kernel.

Here I leant on ^T while it built and this is the last message before it
died (with the "nbmake" lines edited out):

[ 155166.4979147] load: 1.54  cmd: nbctfmerge 7370 [iowait 0x45fc5f/4] 
46.23u 1.99s 148% 1628476k
Out of memory

Again, but with the different error message:

[ 155250.5722602] load: 1.41  cmd: nbctfmerge 18682 [iowait 0x45fc5f/7] 
46.18u 1.43s 152% 1444080k
ERROR: nbctfmerge: netbsd: Cannot get sect .debug_line.1 data: Resource 
exhaustion


Then without "warning" it will ramp up to near twice as much memory and
just work A-OK:

[ 155591.1865138] load: 0.81  cmd: nbctfmerge 15691 [iowait 0x42522a/4] 
46.28u 3.71s 142% 2382048k
[ 155591.2765553] load: 0.81  cmd: nbctfmerge 15691 [iowait 0x42522a/4] 
46.28u 3.80s 142% 2382048k
[ 155591.3665934] load: 0.81  cmd: nbctfmerge 15691 [iowait 0x45ab1a/5] 
46.28u 3.89s 142% 2076944k
[ 155591.4566282] load: 0.81  cmd: nbctfmerge 15691 [0x45e35a/0] 46.28u 
3.98s 142% 0k
[ 155591.543] load: 0.81  cmd: nbctfmerge 15691 [0x45e35a/0] 46.28u 
4.07s 142% 0k
[ 155591.6467075] load: 0.82  cmd: nbctfmerge 15691 [0x45e35a/0] 46.28u 
4.16s 140% 0k
[ 155591.7367458] load: 0.82  cmd: nbctfmerge 15691 [0x45e35a/0] 46.28u 
4.25s 140% 0k
mv -f netbsd netbsd.gdb

/build/woods/xentastic/current-amd64-amd64-tools/bin/x86_64--netbsd-strip -g -o 
netbsd netbsd.gdb


This did not happen with the exact same source tree when building on
either an 8.99.32 or 9.99.64 system running in a Xen domU on similar
hardware.


For the record, thinking this might be an rlimit issue, I opened things
up to the max to no avail, but even with these limits the link often
fails:

$ ulimit -a
time(cpu-seconds)unlimited
file(blocks) unlimited
coredump(blocks) unlimited
data(kbytes) 8388608
stack(kbytes)32768
lockedmem(kbytes)524288
memory(kbytes)   2048000
nofiles(descriptors) 3404
processes420
threads  2048
vmemory(kbytes)  2097152
sbsize(bytes)unlimited


Also for the record, this is 9.0/amd64 running on a bare machine with 8
cores, 32GB of RAM, and everything is on local filesystems:

NetBSD 9.0 (GENERIC) #0: Fri Feb 14 00:06:28 UTC 2020
mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC
total memory = 32762 MB
avail memory = 31788 MB

Dell Inc. PowerEdge 2950

cpu7: Intel(R) Xeon(R) CPU   X5460  @ 3.16GHz, id 0x10676

mfi0: PERC 6/i Integrated version 6.3.3.0002
mfi0: logical drives 2, 256MB RAM, BBU type BBU, status good
scsibus0 at mfi0: 64 targets, 8 luns per target

sd0 at scsibus0 target 0 lun 0:  disk fixed
sd0: 465 GB, 476416 cyl, 64 head, 32 sec, 512 bytes/sect x 975699968 sectors

sd1 at scsibus0 target 1 lun 0:  disk fixed
sd1: 544 GB, 557568 cyl, 64 head, 32 sec, 512 bytes/sect x 1141899264 sectors



--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpZQGkmBDee2.pgp
Description: OpenPGP Digital Signature


postinstall removed yet another "obsolete" system library that was still used....

2020-06-27 Thread Greg A. Woods
So I just upgraded a system from an old 8.99 -current to a newer 9.99
current and "postinstall fix obsolete" removed my /usr/lib/libgomp.so.1*

However this library was still in use by installed packages (due, I
think, to a dependency of libgd on libgomp, thus every gd-using package
is now G.D. broke)!

I propose that the rule documented in src/distrib/lists/base/shl.mi be
far more strictly observed, even for libraries that appear and disappear
between releases (i.e. for -current), at least for the ".major" link and
the file it points to.  If they were never there in a release, never
mentioning them as obsolete in releases should be just fine (i.e. they
were never there, so never mentioning them is the correct thing to do).

On the other hand we could first fix postinstall to be more careful by
getting it to fetch all the "REQUIRED" values from package BUILD_INFO
like this:

pkg_info -a -Q REQUIRES  | sort -u

and then have it noisily refuse to remove any obsolete file still in
this "required" list.  This would allow us to mention all old/upgraded
shared libraries as obsolete, including those from between releases.  Of
course this only protects things installed via pkgsrc, and there's still
the risk of subsequently needing to install a binary package built for
an older release which needs one of these "obsolete" files, but at least
pkg_add can (be made to if it doesn't already) notice this and abort.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpH1wJr2kVDc.pgp
Description: OpenPGP Digital Signature


Re: unable to boot NetBSD-9.99.64-amd64-install.img on a MacBook7,1

2020-06-17 Thread Greg A. Woods
At Sat, 13 Jun 2020 22:03:39 -0700, "Greg A. Woods"  wrote:
Subject: Re: unable to boot NetBSD-9.99.64-amd64-install.img on a MacBook7,1
>
> At Tue, 09 Jun 2020 22:01:41 -0700, "Greg A. Woods"  wrote:
> Subject: unable to boot NetBSD-9.99.64-amd64-install.img on a MacBook7,1
> >
> > Most interestingly if I do some playing at the boot prompt first such
> > that there is lots of white text in the small centre area, then try to
> > boot, the lines of green dots overwrite the top about 1/3 of the screen
> > leaving the lower portion of the white boot loader text still visible:
> >
> > http://www.planix.ca/~woods/macbookpro-netbsd-boot-fail.jpg
>
> Same goes for today's snapshot from:
>
>  
> https://nycdn.netbsd.org/pub/NetBSD-daily/HEAD/202006131940Z/images/NetBSD-9.99.66-amd64-install.img.gz

Would knowing anything about how FreeBSD works on this machine help
figure out why NetBSD doesn't?

I have FreeBSD 12.1 installed and (mostly) working (the nvidia driver
crashes it when starting X).

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpYddxy3xnGQ.pgp
Description: OpenPGP Digital Signature


  1   2   >