from:"Greg A. Woods"

Re: pcc [was Re: valgrind]

2022-03-23 Thread Greg A. Woods

At Wed, 23 Mar 2022 20:56:27 +0100, Anders Magnusson  
wrote:
Subject: Re: pcc [was Re: valgrind]
>
> Den 2022-03-23 kl. 19:37, skrev Greg A. Woods:
> >
> > Heh.  I would say PCC's generated code doesn't compare to either modern
> > GCC or LLVM/Clang's output.
> >
> > I would say the main reason is PCC doesn't (as far as I know) employ
> > any "Undefined Behaviour" caveat to optimize code, for example.
> >
> > I'll let the reader decide which might have the "higher" quality.
>
> I would really want to know what you base these three statements on?

Well I've read a great deal of PDP-11 assembler as produced by PCC, and
I've fought with LLVM/Clang (and to a lesser extent with GCC) and their
undefined behaviour sanitizers (and valgrind) when trying to port old
code to these new compilers and to understand what they have done to it.

I also have way more experience than I ever really wanted in finding
bugs in a wide variety of compilers that are effectively from the same
era as PCC (e.g. especially Lattice C and early Microsoft C, which as I
recall started life as Lattice C).

Modern optimizers that take advantage of UB to do their thing can cause
very strange bugs (hidden bugs, when the UB sanitizer isn't used),
especially with legacy code, or indeed with modern code written by naive
programmers.

Note of course that I'm explicitly _not_ talking about the quality of
the _input_ code, but of the generated assembler code, and I'm assuming
that's what Paul was asking about.

One thing I don't have a good feel for though is how the code produced
by modern GCC and LLVM/Clang looks when they are told to "leave it
alone" after the first step, i.e. with "-O0", and especially as compared
to PCC with -O0.  I _think_ they should be about the same, but I dunno.
Although older compilers like PCC are very naive and simplistic in how
they generate code, my feeling is that modern compilers are even more
naive in their first step of code generation as they have come to rely
even more on their own optimizers to clean things up.  That's pure
speculation though -- I haven't worked directly with assembler code very
much at all since I left the likes of the 6502 and 8086 behind.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgphG58pf3Aww.pgp
Description: OpenPGP Digital Signature

Re: pcc [was Re: valgrind]

2022-03-23 Thread Greg A. Woods

At Tue, 22 Mar 2022 17:47:55 +, "Koning, Paul"  wrote:
Subject: Re: pcc [was Re: valgrind]
>
>
> Out of curiosity: how does PCC code quality compare with that of
> GCC and (for targets that it supports) Clang?

Heh.  I would say PCC's generated code doesn't compare to either modern
GCC or LLVM/Clang's output.

I would say the main reason is PCC doesn't (as far as I know) employ
any "Undefined Behaviour" caveat to optimize code, for example.

I'll let the reader decide which might have the "higher" quality.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgp_oLwFqoKCk.pgp
Description: OpenPGP Digital Signature

Re: pcc [was Re: valgrind]

2022-03-22 Thread Greg A. Woods

At Mon, 21 Mar 2022 08:54:43 -0400 (EDT), Mouse  
wrote:
Subject: pcc [was Re: valgrind]
>
> >> I've been making very-spare-time progress on building my own
> >> compiler on and off for some years now; perhaps I'll eventually get
> >> somewhere.  [...]
> > Have you looked at pcc?  http://pcc.ludd.ltu.se/ and in our source
> > tree in src/external/bsd/pcc .
>
> No, I haven't.  I should - it may well end up being quicker to move an
> existing compiler in the directions I want to go than to write my own.

I would like to add my voice too.

I _really_ like valgrind.  It is immensely valuable and infinitely
better than any of the compiler so-called "sanitizers" (except maybe the
Undefined Behaviour sanitizer in Clang, which, sadly, is a necessary
evil if one is to use such a modern language bastardizer like Clang).

It's a little ugly to use, and it's a very tough task-master, but I'm
really sad that I cannot use it easily and regularly on NetBSD.

(I'm just about as sad that it no longer works on modern macOS either.)

I also really like PCC.  (I remember teething pains getting used to it
back when it first replaced Ritchie C on my university's PDP-11/60, but
once I actually used it for real code (i.e. assignments in those days),
and soon on the Vax too, I really liked it.)

I'm really sad that I still cannot build NetBSD entirely with PCC as the
native and only default compiler.

I would like to do whatever I can to help fix these two problems, but
I'm not anywhere near able to even begin to fix them on my own (well I
could hack on NetBSD to make it work with PCC, but I don't want to be a
one-man-band on a project of that scale).

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpIqUBNiiYDq.pgp
Description: OpenPGP Digital Signature

Re: wsvt25 backspace key should match terminfo definition

2021-11-25 Thread Greg A. Woods

st logical for the keyboard one was using.

In Research Unix this all happened with the Unix 8th edition when the
default for the tty line discipline's erase character was also changed
to be the ASCII BS character, and the new "nttyld.c" implemented live
erase and kill processing though none of this was properly documented.
This is from V10's ttyld(4):

   The erase character (backspace by default) erases the last‐typed  char‐
   acter.   It will not erase beyond the beginning of a line or an end‐of‐
   file character.

   The  interrupt  character  (default DEL) is not passed to a program but
   sends signal to any processes in the process group of the  stream;  see
   signal(2) and stream(4).

(The Research releases never really documented their "nttyld.c" driver
-- i.e. the one that does live erase and kill processing, though it is
hinted to in V10's stty(1) manual page with the "old" and "!old" options.)

All this gets me to the point of showing that it was those upstart
Berkeley dudes who first set the default "stty erase" character to DEL. :-)

Quite risky actually if you think of what will happen when you try to
transfer your new keyboard skills to one of those East Coast systems and
you don't pay attention to the difference in the tty drivers!  Thank
goodness the tset(1) command now prints what's what.

From 4.3BSD's tty(4) describing the new "standard Berkeley terminal
driver":

   During  input,  line editing is normally done, with the erase character
   sg_erase (by default, DELETE)  logically  erasing  the  last  character
   typed  and  the  sg_kill  character  (default, ^U: control‐U) logically
   erasing the entire current input line.  These  characters  never  erase
   beyond  the beginning of the current input line or an eof.  These char‐
   acters may be entered literally by preceding them  with  ‘\’;  the  ‘\’
   will normally be erased when the character is typed.

 [[  ]]

   ^C t_intrc (ETX) generates a SIGINT signal.  This is the normal way
  to stop a process which is no longer interesting, or  to  regain
  control in an interactive program.

Personally, I stuck to the Bell Labs ways and always used the
"Backspace" key on my VT10x terminals as the key for "stty erase", and
the "Del" key as the key for "stty intr"; all the way back to when I ran
Research and AT Unixes, right up to when I had a real VT101 on the
console of some of my early (non-i386) NetBSD machines.

Also note the BSD tty(4) driver's default for "stty intr" being 
as this also has its origins in the DEC world.

BTW, since this is tech-kern, note that the wscons(4) manual page tells
a little white lie when it suggests that when in "vt100" mode it "will
work sufficiently as a VT220 emulator."  This fact is now documented in
/usr/share/misc/terminfo:

  # Testing the emulator and reading the source code (NetBSD 2.0), it appears
  # that "vt220" is inaccurate.  There are a few vt220-features, but most of the
  # vt220 screens in vttest do not work with this emulator.

There's a real mess of confusion related to thinking of wscons(4) (and
pc(4)/pcvt(4) before) as good enough to be like a vt220.  In my twisted
view of the erase/kill/intr story I like to think it may actually have
started with the insistence of following the BSD tradition to have the
standard PC keyboard's "<- Backspace" key (i.e. the "big left-arrow" key
at the top right of the main group which was and is always the
"Backspace" key) to send a DEL character (instead of the more normally
expected "BS" character given its name and label).  Since VT220s are the
most common terminal to insist on sending "DEL" from the logically
placed "erase the previous character key", well then choosing "vt220" as
one's terminal type meant avoiding confusion over terminfo's "kbs" value
and the default "stty erase" character.  Also it's more progressive
sounding to have a vt220 instead of a vt100.  As such pc(4), then
pcvt(4), and finally wscons(4), are described as "vt220" compatible and
have been made to work well enough for vi(1) to behave properly when
TERM=vt220.

Personally I think there is a massive over-use of "use=vt220" in
terminfo -- that should be reserved ONLY for true DEC VT derivatives and
updates of the real VT220.

BTW, my own local NetBSD source trees generally always have hacks to set
the physical console keyboard's backspace key to translate into an ASCII
BS character, though I use the physical console on x86 machines so
little any more that it hardly matters.

In any case in this modern day I think some people still fall far too
hard on their swords, er, keyboards to try to stick to and insist on the
idea that DEL _has_ to be the character for "stty erase" (at least for
anything "BSD"-ish).

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpVqCBlu2f9k.pgp
Description: OpenPGP Digital Signature

Re: wsvt25 backspace key should match terminfo definition

2021-11-24 Thread Greg A. Woods

wing the UNIX tradition
> that this key is DEL,

You mean "the BSD tradition".  See my previous reply.

The Unix (or "UNIX") tradition is for "stty kill" to be 'DEL' (and erase
to be either 'BS' or '#', depending on how far back you go).

In my experience NetBSD already follows the BSD tradition very well.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpwahgTRyhrJ.pgp
Description: OpenPGP Digital Signature

Re: Devices.

2021-06-01 Thread Greg A. Woods

At Tue, 1 Jun 2021 14:00:50 -0400 (EDT), Mouse  
wrote:
Subject: Re: Devices.
>
> Uh, maybe _you_ do.  _I_ don't.  For most of my chroots, I want the
> chroot to have as minimal a set of devices as still allows it to do its
> job, and in particular I do not want it to ever dynamically acquire new
> devices, nor do I want it to have /dev entries for devices not
> necessary for its task.  I not infrequently want unusual ownerships or
> permissions on its /dev, too.

Indeed.  Very important!

> I also want to be able to have device nodes places other than /dev, and
> that desire is at least mostly orthogonal to chroot.

I'm less convinced of this part though  This ability has brought
more complexity (e.g. mount options to disable devices per filesystem)
than I've ever seen pay off in benefit.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp8ynnPwmB1n.pgp
Description: OpenPGP Digital Signature

Re: Devices.

2021-06-01 Thread Greg A. Woods

At Sat, 29 May 2021 16:17:13 -0700, John Nemeth  wrote:
Subject: Re: Devices.
>
> On May 29, 22:52, David Holland wrote:
> } On Sat, May 29, 2021 at 05:41:38PM -0400, Mouse wrote:
> }
> }  > > For disks, which for historical reasons live in both cdevsw and
> }  > > bdevsw, both entries would point at the same disk_dev.
> }  >
> }  > I would suggest getting rid of the bdev/cdev distinction.  It is, as
> }  > you say, a historical artifact, and IMO it is not serving anyone at
> }  > this point.
> }
> } It is deeply baked into the system call API and into POSIX, so it's
> } not going anywhere. It's been proposed that we should stop having
> } block devices, which would have the same net effect; I have no strong
> } opinion on that and it doesn't need to be part of this set of changes.
>
>  I was thinking the same thing about getting rid of block
> devices.  The only place they should ever be used is an argument
> to mount(2) and mount(2) can be adjusted to use a block device
> underneath when it is handed a character device.  FreeBSD got rid
> of block devices a long time ago.  Doing that as a first step is
> likely to simplify things to make other things easier.

I'm uncomfortable with what seems to me to be a rather arbitrary
decision to remove block devices from NetBSD.

My understanding w.r.t. the rationale FreeBSD used in deciding to remove
the block devices was that FreeBSD never really buffered/cached by
device in the first place.  Also, according to PHK in his 2002 BSDCan
paper about FreeBSD's /dev, "In FreeBSD block devices were not even
implemented in a fashion which would be of any use, since any write
errors would never be reported to the writing process."[*]
[*] 
https://www.usenix.org/legacy/events/bsdcon02/full_papers/kamp/kamp_html/index.html

If I'm not mistaken that's all different in NetBSD though (except maybe
for the error handling issue), or am I mistaken???

Of course on Linux the went the other way and there are no raw devices,
and clearly that's turned out to be a bad idea, especially for the needs
of some tools such as 'dd' and the underlying drivers which all had to
double down and add new non-standard controls just to re-implement "raw"
access.

>  We should really get with the times and create a devfs.

I dunno.

I think it's a good idea for some classes of devices, but I'm not so
sure it has to be a one-size-fits-all singular solution.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgp4sxzsRR_r8.pgp
Description: OpenPGP Digital Signature

Re: Some changes to autoconfiguration APIs

2021-05-01 Thread Greg A. Woods

At Fri, 30 Apr 2021 23:05:48 -0400 (EDT), Mouse  
wrote:
Subject: Re: Some changes to autoconfiguration APIs
>
>   However, I see little reason to do the
> statement expression rather than
>
>   { static const struct cfargs foo = { ... };
> config_found(..., );
>   }

That's a very good point!

I think statement expressions can be a rather "dangerous" complication
in C -- I've only ever found them to be truly useful within a macro when
I'm trying to avoid, or do something different than, the "usual
promotions".

Kind of related to this, I have the following comment in my notes about C:

- Positional parameters are evil (or at least error prone), especially
  for variable numbers of parameters.

Named parameters can be simulated in modern C with full structure
passing:

struct fooness {
int blah;
};
struct somefunc_params {
char *p1;
int i1;
struct fooness foo;
};
int
somefunc(struct somefunc_params p)
{
if (p.i1)
printf("%s", p.p1);
return 0;
}

res = somefunc((struct somefunc_params)
   {.p1 = "foo",
.i1 = 1,
.foo = (struct fooness) {.blah = 4}});

A working example with more rants and ravings about C, and some other
ideas about hiding the struct references within the function
implementation is here:

https://github.com/robohack/experiments/blob/master/tc99namedparams.c

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpLOiWfxFc4v.pgp
Description: OpenPGP Digital Signature

Re: Xen FreeBSD domU block I/O problem on -current only affects reads > 1024 bytes

2021-04-20 Thread Greg A. Woods

020202020
*
0001000  21212121212121212121212121212121
*
0002000  
*
0003000  23232323232323232323232323232323
*
0004000  24242424242424242424242424242424
*
0005000  25252525252525252525252525252525
*
0006000  26262626262626262626262626262626
*
0007000  27272727272727272727272727272727
*
001  28282828282828282828282828282828
*
0011000  29292929292929292929292929292929
*
0012000  2a2a2a2a2a2a2a2a2a2a2a2a2a2a2a2a
*
0013000  2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b
*
0014000  2c2c2c2c2c2c2c2c2c2c2c2c2c2c2c2c
*
0015000  2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
*
0016000  2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e
*
0017000  2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f2f
*
002


Let's try that again with just the one sample data line and blkchk:


# grep 28141568000 /var/tmp/ckfile.txt > /var/tmp/ckfile.1
# /var/tmp/blkchk check /dev/da0 /var/tmp/ckfile.1
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1024] \x28 != 
/var/tmp/ckfile.1[ln#0][1024] \x22
#

Every byte after 1024 is different, but I'll cut it off at 10:

# /var/tmp/blkchk check -v /dev/da0 /var/tmp/ckfile.1 2>&1 | head
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1024] \x28 != 
/var/tmp/ckfile.1[ln#0][1024] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1025] \x28 != 
/var/tmp/ckfile.1[ln#0][1025] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1026] \x28 != 
/var/tmp/ckfile.1[ln#0][1026] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1027] \x28 != 
/var/tmp/ckfile.1[ln#0][1027] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1028] \x28 != 
/var/tmp/ckfile.1[ln#0][1028] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1029] \x28 != 
/var/tmp/ckfile.1[ln#0][1029] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1030] \x28 != 
/var/tmp/ckfile.1[ln#0][1030] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1031] \x28 != 
/var/tmp/ckfile.1[ln#0][1031] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1032] \x28 != 
/var/tmp/ckfile.1[ln#0][1032] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1033] \x28 != 
/var/tmp/ckfile.1[ln#0][1033] \x22


--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpQKYvlkWl0Y.pgp
Description: OpenPGP Digital Signature

Xen FreeBSD domU block I/O problem begins somewhere between 8.99.32 (2020-06-09) and 9.99.81 (2021-03-10)

2021-04-16 Thread Greg A. Woods

So I was just reminded that I do still have a Xen server that's still
running the 8.99.32 kernel and Xen-4.11.  I had not been testing on it
because it still of course has the vnd(4) CHS size bug (and because it's
also hosting my $HOME and /usr/src and I don't want to crash it), and I
had not remembered until just now that I can work around that by simply
padding out the mini-memstick.img file!

And, so

It works, A-OK, with all other things remaining the same:

# ls -l /dev/xbd0
crw-r-  1 root  operator  0x3a Apr 17 04:31 /dev/xbd0
# newfs /dev/xbd0
/dev/xbd0: 20480.0MB (41943040 sectors) block size 32768, fragment size 4096
using 33 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112,
 11540352, 12822592, 14104832, 15387072, 16669312, 17951552, 19233792,
 20516032, 21798272, 23080512, 24362752, 25644992, 26927232, 28209472,
 29491712, 30773952, 32056192, 8432, 34620672, 35902912, 37185152,
 38467392, 39749632, 41031872
# fsck /dev/xbd0
** /dev/xbd0
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
2 files, 2 used, 5076797 free (21 frags, 634597 blocks, 0.0% fragmentation)

* FILE SYSTEM IS CLEAN *
#


So the problem is almost certainly in NetBSD-current itself, and
somewhere in the vast gulf between 8.99.32 (2020-06-09) and 9.99.81
(2021-03-10).

Unfortunately I don't have enough hardware that's Xen-capable and up and
running well enough to allow me to do any brute-force bisecting.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpTsPzyBFUd7.pgp
Description: OpenPGP Digital Signature

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-16 Thread Greg A. Woods

At Fri, 16 Apr 2021 11:44:08 +0100, David Brownlee  wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> On Fri, 16 Apr 2021 at 08:41, Greg A. Woods  wrote:
>
> > What else is different?  What am I missing?  What could be different in
> > NetBSD current that could cause a FreeBSD domU to (mis)behave this way?
> > Could the fault still be in the FreeBSD drivers -- I don't see how as
> > the same root problem caused corruption in both HVM and PVH domUs.
>
> Random data collection thoughts:
>
> - Can you reproduce it on tiny partitions (to speed up testing)
> - If you newfs, shutdown the DOMU, then copy off the data from the
> DOM0 does it pass FreeBSD fsck on a native boot
> - Alternatively if you newfs an image on a native FreeBSD box and copy
> to the DOM0 does the DOMU fsck fail
> - Potentially based on results above - does it still happen with a
> reboot between the newfs and fsck
> - Can you ktrace whichever of newfs or fsck to see exactly what its
> writing (tiny *tiny* filesystem for the win here :)

So, the root filesystem is clean (from the factory, and verified by at
least NetBSD's fsck as OK), but when '-f' is used it is found to be
corrupt.

Unfortunately I don't have any real FreeBSD machines available (though I
could possibly get it installed on my MacBookPro again, but that's
probably a multi-day effort at this point).

However I've just found a way to reproduce the problem reliably and with
a working comparison with a matching-sized memory disk.

First off attach a tiny 4mb LVM LV to FreeBSD -- that's the smallest LV
possible apparently:

dom0 # lvm lvs
  LV  VG  Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  build   scratch -wi-a- 250.00g
  fbsd-test.0 scratch -wi-a-  30.00g
  fbsd-test.1 scratch -wi-a-  30.00g
  nbtest.pkg  vg0 -wi-a-  30.00g
  nbtest.root vg0 -wi-a-  30.00g
  nbtest.swap vg0 -wi-a-   8.00g
  nbtest.var  vg0 -wi-a-  10.00g
  tinytestvg0 -wi-a-   4.00m
dom0 # xl block-attach fbsd-test format=raw, vdev=sdc, access=rw, 
target=/dev/mapper/vg0-tinytest


Now a run of the test on the FreeBSD domU (first showing the kernel
seeing the device attachment):


# xbd3: 4MB  at device/vbd/2080 on xenbusb_front0
xbd3: attaching as da2
xbd3: features: flush
xbd3: synchronize cache commands enabled.
GEOM: new disk da2

# dd if=/dev/zero of=tinytest.fs count=8192
8192+0 records in
8192+0 records out
4194304 bytes transferred in 0.081106 secs (51713998 bytes/sec)
# mdconfig -a -t vnode -f tinytest.fs
md0
# newfs -o space -n md0
/dev/md0: 4.0MB (8192 sectors) block size 32768, fragment size 4096
using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 2304, 4416, 6528
# newfs -o space -n da2
/dev/da2: 4.0MB (8192 sectors) block size 32768, fragment size 4096
using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 2304, 4416, 6528
# dumpfs da2 >da2.dumpfs
# dumpfs md0 >md0.dumpfs
# diff md0.dumpfs da2.dumpfs
1,2c1,2
< magic 19540119 (UFS2) timeFri Apr 16 18:48:55 2021
< superblock location   65536   id  [ 6079dc17 1006b3b4 ]
---
> magic 19540119 (UFS2) timeFri Apr 16 18:49:57 2021
> superblock location   65536   id  [ 6079dc55 348e5947 ]
27c27
< magic 90255   tell2   timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell2   timeFri Apr 16 18:49:57 2021
40c40
< magic 90255   tell128000  timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell128000  timeFri Apr 16 18:49:57 2021
53c53
< magic 90255   tell23  timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell23  timeFri Apr 16 18:49:57 2021
66c66
< magic 90255   tell338000  timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell338000  timeFri Apr 16 18:49:57 2021
# fsck md0
** /dev/md0
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
1 files, 1 used, 870 free (14 frags, 107 blocks, 1.6% fragmentation)

* FILE SYSTEM IS CLEAN *
# fsck da2
** /dev/da2
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
ROOT INODE UNALLOCATED
ALLOCATE? [yn] n


* FILE SYSTEM MARKED DIRTY *


So I ktraced the fsck_ufs run, and though I haven't looked at it with a
fine-tooth comb and the source open, the only thing that seems a wee bit
different about what fsck does is that it opens the device twice, with
O_RDONLY, then shortly before it prints the first "** /dev/da2" line it
reopens it O_RDRW a third time, closes the second one, and then closes
the second one and calls dup() on the third one so that it has the same
FD# as the second open had.

Otherwise it does a

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-16 Thread Greg A. Woods

So I wrote a little awk script so that I could write 512-byte blocks
with varying values of bytes.  (Awk is the only decent programming
language on the FreeBSD mini-memstick.img which I could think of that
would do something close to what I wanted it to do.  I could have
combined awk+sh+dd and done things faster, but I had all day to let it
run while I worked on some small engine repairs.)

https://github.com/robohack/experiments/blob/master/tblocks.awk

and then I used it to write 30GB to two different LVM LVs, each of
identical size, and each exported to the domU, one written on the dom0
and the other written on the domU.

Then I ran a cmp of both drives on each the dom0 and domU.

On the dom0 side were no differences.  All 30GB of what was written
directly in the dom0 to one of the LVs was identical to what was written
in the FreeBSD domU to the other LV.  I.e. the FreeBSD domU side seems
to be writing reliably through to the disk.

The FreeBSD domU though is _really_ slow at reading with cmp (perhaps
not unexpectedly given that it is using stdio to do the read and only
managing 4KB requests, at a rate of just under 500 requests per second
on each disk).

I'm going to send this and go to bed before it finishes, but I'm
guessing it's about 2/3's of the way through (it has run for nearly
11,000 seconds), and thus so far there are no differences from the
FreeBSD domU's point of view either.

Anyway, what the heck is FreeBSD newfs and/or fsck doing different!?!?!??

They're both writing and reading the very same raw device(s) that I
wrote and read to/from with awk and cmp.

These awk/cmp tests did very sequential operations, and the data are
quite uniform and regular; whereas newfs/fsck write/read a much more
complex data structure using operations scattered about in the disk.

These tests are also writing then reading enough data to flush through
the buffer caches in each dom0 and domU several times over.  The dom0
has only 4GB and the domU has 8GB, but Xen says it's only using under 2GB.

What else is different?  What am I missing?  What could be different in
NetBSD current that could cause a FreeBSD domU to (mis)behave this way?
Could the fault still be in the FreeBSD drivers -- I don't see how as
the same root problem caused corruption in both HVM and PVH domUs.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpnTBJp7jfyq.pgp
Description: OpenPGP Digital Signature

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-14 Thread Greg A. Woods

atures: flush
dev.xbd.0.ring_pages: 1
dev.xbd.0.max_request_size: 65536
dev.xbd.0.max_request_segments: 17
dev.xbd.0.max_requests: 32
dev.xbd.0.%parent: xenbusb_front0
dev.xbd.0.%pnpinfo: 
dev.xbd.0.%location: 
dev.xbd.0.%driver: xbd
dev.xbd.0.%desc: Virtual Block Device




For reference the bug behaviour remains the same (at least for this
simplest quick and easy test):

# newfs /dev/da0
/dev/da0: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096
using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 
11540352, 12822592, 14104832, 15387072, 16669312,
 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 
26927232, 28209472, 29491712, 30773952, 32056192, 8432,
 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 
43596352, 44878592, 46160832, 47443072, 48725312, 50007552,
 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 
60265472, 61547712, 62829952
# fsck /dev/da0
** /dev/da0
** Last Mounted on 
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
CG 0: BAD CHECK-HASH 0x49168424 vs 0xe610ac1b
SUMMARY INFORMATION BAD
SALVAGE? [yn] n

BLK(S) MISSING IN BIT MAPS
SALVAGE? [yn] n

CG 1: BAD CHECK-HASH 0xfa76fceb vs 0xb9e90a55
CG 2: BAD CHECK-HASH 0x41f444c vs 0x5efb290e
CG 3: BAD CHECK-HASH 0xad63fe7e vs 0x7ab3861f
CG 4: BAD CHECK-HASH 0xfd2043f3 vs 0xadb781f4
CG 5: BAD CHECK-HASH 0x545cf9c1 vs 0xcec5661e
CG 6: BAD CHECK-HASH 0xaa354166 vs 0x7dd269d3
CG 7: BAD CHECK-HASH 0x349fb54 vs 0x3078e065
CG 8: BAD CHECK-HASH 0xab23a7c vs 0xc8aa7e98
CG 9: BAD CHECK-HASH 0xa3ce804e vs 0x205a6b0d
CG 10: BAD CHECK-HASH 0x5da738e9 vs 0x604d5ecf
CG 11: BAD CHECK-HASH 0xf4db82db vs 0xfef11ffc
CG 12: BAD CHECK-HASH 0xa4983f56 vs 0xc7e701c8
CG 13: BAD CHECK-HASH 0xde48564 vs 0x42072fba
CG 14: BAD CHECK-HASH 0xf38d3dc3 vs 0xad98cf7b
CG 15: BAD CHECK-HASH 0x5af187f1 vs 0xbacadeb1
CG 16: BAD CHECK-HASH 0xe07abf93 vs 0xe4ca225
CG 17: BAD CHECK-HASH 0x490605a1 vs 0xe2917802
CG 18: BAD CHECK-HASH 0xb76fbd06 vs 0xa895abc
CG 19: BAD CHECK-HASH 0x1e130734 vs 0x6a8bc135
CG 20: BAD CHECK-HASH 0x4e50bab9 vs 0x44719a4a
CG 21: BAD CHECK-HASH 0xe72c008b vs 0xadb0c6e9
CG 22: BAD CHECK-HASH 0x1945b82c vs 0x3aeca102
CG 23: BAD CHECK-HASH 0xb039021e vs 0xb99f957d
CG 24: BAD CHECK-HASH 0xb9c2c336 vs 0xd384be85
CG 25: BAD CHECK-HASH 0x10be7904 vs 0x649e2abf
CG 26: BAD CHECK-HASH 0xeed7c1a3 vs 0x95f7
CG 27: BAD CHECK-HASH 0x47ab7b91 vs 0x3fb02d8b
CG 28: BAD CHECK-HASH 0x17e8c61c vs 0xa2b4ca67
CG 29: BAD CHECK-HASH 0xbe947c2e vs 0x65972e04
CG 30: BAD CHECK-HASH 0x40fdc489 vs 0x4219223f
CG 31: BAD CHECK-HASH 0xe9817ebb vs 0x36eb9a37
CG 32: BAD CHECK-HASH 0x3007c2bc vs 0xd1916e1d
CG 33: BAD CHECK-HASH 0x997b788e vs 0x5204f64d
CG 34: BAD CHECK-HASH 0x6712c029 vs 0xe291bcf0
CG 35: BAD CHECK-HASH 0xce6e7a1b vs 0x136ff032
CG 36: BAD CHECK-HASH 0x9e2dc796 vs 0x78ea85c8
CG 37: BAD CHECK-HASH 0x37517da4 vs 0x40c2cf31
CG 38: BAD CHECK-HASH 0xc938c503 vs 0x9b844ab6
CG 39: BAD CHECK-HASH 0x60447f31 vs 0x23129481
CG 40: BAD CHECK-HASH 0x69bfbe19 vs 0xa81f5e9
CG 41: BAD CHECK-HASH 0xc0c3042b vs 0xbd37ebd1
CG 42: BAD CHECK-HASH 0x3eaabc8c vs 0xfadfd8d1
CG 43: BAD CHECK-HASH 0x97d606be vs 0xf41513bc
CG 44: BAD CHECK-HASH 0xc795bb33 vs 0xad4e6069
CG 45: BAD CHECK-HASH 0x6ee90101 vs 0xbeab94a9
CG 46: BAD CHECK-HASH 0x9080b9a6 vs 0x2688acd1
CG 47: BAD CHECK-HASH 0x39fc0394 vs 0xb5a37e85
CG 48: BAD CHECK-HASH 0x83773bf6 vs 0xd779cc90
CG 49: BAD CHECK-HASH 0xe0d3fd3c vs 0xb8083ca
2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation)

* FILE SYSTEM MARKED DIRTY *

* PLEASE RERUN FSCK *

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpwH2P4OJhnc.pgp
Description: OpenPGP Digital Signature

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-14 Thread Greg A. Woods

At Wed, 14 Apr 2021 19:53:47 +0200, Jaromír Doleček  
wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
> 
> You can test if this is the problem by disabling the feature in
> negotiation in NetBSD xbdback.c - comment out the code which sets
> feature-max-indirect-segments in xbdback_backend_changed(). With the
> feature disabled, FreeBSD DomU should not use indirect segments.

Ah, yes, thanks!  I should have thought of that.  That's especially
useful since on the client side it's a read-only flag:

# sysctl -w hw.xbd.xbd_enable_indirect=0
sysctl: oid 'hw.xbd.xbd_enable_indirect' is a read only tunable
sysctl: Tunable values are set in /boot/loader.conf

Apparently in the Linux implementation the number of indirect segments
used by a domU can be tuned at boot time, and that appears to be done by
setting a driver option on the guest kernel command line.  When I first
read that it didn't make so much sense to me to be giving this kind of
control to the domU.  Perhaps it would be better to make this a tuneable
in xl.cfg(5) such that it can be tuned on a per-guest basis.  Then
setting it to zero for a given guest would not advertise the feature at
all.

I've some other things to do before I can reboot -- I'll report as soon
as that's done

-- 
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgp9ArzMYs191.pgp
Description: OpenPGP Digital Signature

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-13 Thread Greg A. Woods

At Tue, 13 Apr 2021 18:20:39 -0700, "Greg A. Woods"  wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> So "17" seems an odd number, but it is apparently because of "Need to
> alloc one extra page to account for possible mapping offset".

Nope, changing that to 16 didn't make any difference.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpKKEyzDjq3_.pgp
Description: OpenPGP Digital Signature

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-13 Thread Greg A. Woods

At Sun, 11 Apr 2021 13:55:36 -0700, "Greg A. Woods"  wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD
> xbd(4) with a new filesystem created on it, is impossible.

So, having run out of "easy" ideas, and working under the assumption
that this must be a problem in NetBSD-current dom0 (i.e. not likely in
Xen or Xen tools) I've been scanning through changes and this one, so
far, is one that would seem to me to have at least some tiny possibility
of being the root cause.

RCS file: /cvs/master/m-NetBSD/main/src/sys/arch/xen/xen/xbdback_xenbus.c,v

revision 1.86
date: 2020-04-21 06:56:18 -0700;  author: jdolecek;  state: Exp;  lines: 
+175 -47;  commitid: 26JkIx2V3sGnZf5C;
add support for indirect segments, which makes it possible to pass
up to MAXPHYS (implementation limit, interface allows more) using
single request

request using indirect segment requires 1 extra copy hypercall per
request, but saves 2 shared memory hypercalls (map_grant/unmap_grant),
so should be net performance boost due to less TLB flushing

this also effectively doubles disk queue size for xbd(4)

I don't see anything obviously glaringly wrong, and of course this is
working A-OK on my same machines with NetBSD-5 and a NetBSD-current (and
originally somewhat older NetBSD-8.99) domUs.

However I'm really not very familiar with this code and the specs for
what it should be doing so I'm unlikely to be able to spot anything
that's missing.  I did read the following, which mostly reminded me to
look in xenstore's db to see what feature-max-indirect-segments is set
to by default:

https://xenproject.org/2013/08/07/indirect-descriptors-for-xen-pv-disks/

Here's what is stored for a file-backed device:

backend = ""
 vbd = ""
  3 = ""
   768 = ""
frontend = "/local/domain/3/device/vbd/768"
params = "/build/images/FreeBSD-12.2-RELEASE-amd64-mini-memstick.img"
script = "/etc/xen/scripts/block"
frontend-id = "3"
online = "1"
removable = "0"
bootable = "1"
state = "4"
dev = "hda"
type = "phy"
mode = "r"
device-type = "disk"
discard-enable = "0"
vnd = "/dev/vnd0d"
physical-device = "3587"
hotplug-status = "connected"
sectors = "792576"
info = "4"
sector-size = "512"
feature-flush-cache = "1"
feature-max-indirect-segments = "17"

Here's what's stored for an LVM-LV backed vbd:

162 = ""
 2048 = ""
  frontend = "/local/domain/162/device/vbd/2048"
  params = "/dev/mapper/vg1-fbsd--test.0"
  script = "/etc/xen/scripts/block"
  frontend-id = "162"
  online = "1"
  removable = "0"
  bootable = "1"
  state = "4"
  dev = "sda"
  type = "phy"
  mode = "r"
  device-type = "disk"
  discard-enable = "0"
  physical-device = "43285"
  hotplug-status = "connected"
  sectors = "83886080"
  info = "4"
  sector-size = "512"
  feature-flush-cache = "1"
  feature-max-indirect-segments = "17"

So "17" seems an odd number, but it is apparently because of "Need to
alloc one extra page to account for possible mapping offset".  It is
currently the maximum for indirect-segments, and it's hard-coded.
(Linux apparently has a max of 256, and the linux blkfront defaults to
only using 32.)  Maybe it should be "16", so matching max_request_size?

I did take a quick gander at the related code in FreeBSD (both the domU
code that's talking to this code in NetBSD, and the dom0 code that would
be used if dom0 was running FreeBSD), and besides seeing that it is
quite different, I also don't see anything obviously wrong or
incompatible there either.  (I do note that the FreeBSD equivalent to
xbdback(4) has a major advantage of being able to directly access files,
i.e. without the need for vnd(4).  Not quite as exciting as maybe full
9pfs mounts through to domUs would be, but still pretty neat!)

FreeBSD's equivalent to xbdback(4) (i.e. sys/dev/xen/blkback/blkack.c)
doesn't seem to mention "feature-max-indirect-segments", so apparently
they don't offer it yet, though it does mention "feature-flush-cache".

However their front-end code does detect it and seems to make use of it,
and has done for some 6 years now according to "git blame" (with no
recent fixes beyond fixing a memory leak on their end).  Here we see it
live from FreeBSD's sysctl output, thus my concern that this feature may
be the source of the problem:

hw.xbd.xbd_enable_indirect: 1
dev.xbd.0.max_request_size: 65536
dev.xbd.0.max_request_segments: 17
dev.xbd.0.max_requests: 32

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpjqWl9lIxPf.pgp
Description: OpenPGP Digital Signature

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods

At Sun, 11 Apr 2021 23:04:29 - (UTC), mlel...@serpens.de (Michael van Elst) 
wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> wo...@planix.ca ("Greg A. Woods") writes:
>
> >SALVAGE? [yn] ^Cada0: disk error cmd=write 8145-8152 status: fffe
>
> That seems to be a message from the disk driver:

Yes, exactly, that's from the FreeBSD kernel as fsck was trying to
update the superblock and mark the filesystem as dirty (their fsck_ffs
always opens the device for write, even with '-n'); and the error is of
course because the backend has attached the disk as a read-only device.

> The latter case should log a message on Dom0 about DIOCCACHESYNC
> failing.

I haven't seen anything like that yet.

> But if you have sectors of DEV_BSIZE like here there is no difference
> and no conflict.

Yes as far as I've seen the FreeBSD domU reports a sector size of 512
bytes in every xbd(4) device and for every GEOM partition it creates or
finds on those devices.

FreeBSD newfs seems to concur that sectors are 512 bytes even when
writing to a raw (i.e. un-labeled) /dev/da1 (which has a 30GB LVM LV
backing it):

# newfs /dev/da1
/dev/da1: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096
using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.

$ echo 62914560 \* 512 / 1024 / 1024 | bc -l
30720.

The NetBSD dom0 reported the attachment of this device with a matching
number of (512-byte) sectors:

xbd backend: attach device scratch-fbsd--t (size 62914560) for domain 2

> The FreeBSD-12.2-RELEASE-amd64-mini-memstick.img I just fetched
> has two MBR partitions:
>
> Partition table:
> 0: EFI system partition (sysid 239)
> start 1, size 1600 (1 MB, Cyls 0/0/2-0/50/1)
> 1: FreeBSD or 386BSD or old NetBSD (sysid 165)
> start 1601, size 789520 (386 MB, Cyls 0/50/2-386/18/17), Active
>
> Making our disklabel program read the FreeBSD disklabel was a bit
> tricky, there is a bug that makes it segfault, but:
>
> type: unknown
> disk:
> label:
> flags:
> bytes/sector: 512
> sectors/track: 1
> tracks/cylinder: 1
> sectors/cylinder: 1
> cylinders: 789520
> total sectors: 789520
> rpm: 3600
> interleave: 0
> trackskew: 0
> cylinderskew: 0
> headswitch: 0   # microseconds
> track-to-track seek: 0  # microseconds
> drivedata: 0
>
> 8 partitions:
> #sizeoffset fstype [fsize bsize cpg/sgs]
>  a:78950416 4.2BSD  0 0 0  # (Cyl. 16 - 
> 789519)
>  c:789520 0 unused  0 0# (Cyl.  0 - 
> 789519)
>
>
> Apparently the MBR partition 1 starting at sector 1601 is a disk
> image itself and the disklabel is in sector 1 of that image.

Well I think in FreeBSD parlance it just is an MBR partition that has a
BSD label confined within its limits, and that BSD label further divides
its MBR partition into more disk partitions.  That's just the FreeBSD
way -- if I understand correctly their BSD labels are restricted to the
confines of the MBR partition where they sit.

And yes, FreeBSD's disklabel output matches:

# disklabel da0s2
# /dev/da0s2:
8 partitions:
#  size offsetfstype   [fsize bsize bps/cpg]
  a: 789504 164.2BSD0 0 0
  c: 789520  0unused0 0 # "raw" part, don't edit

So in FreeBSD the filesystem there is at "/dev/da0s2a" -- where "da0" is
the "device", "s2" is the second MBR partition, and "a" is of course the
BSD label's "a" partition.  They use more or less the same naming for
GPT entries as well.

> Adding a wedge to access the partition at offset 16 (+1601) gives:
>
> # dkctl vnd0 addwedge freebsd 1617 789504 ffs
> dk6 created successfully.

I had not thought to try that yet.  It's good to see it works!

Now that I can get vnd0d to export the .img file to FreeBSD I think I've
effectively eliminated worries about vnd(4) causing the bigger problems.

Speaking of which, I think this might be evidence that the FreeBSD
system was suffering the effects of accessing the corrupted filesystem I
was experimenting with.  Note the SIGSEGV's from processes apparently
after the kernel has gone into its halt-spin loop (this is the first
time I've seen this particular misbehaviour):

# halt -pq
Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining... 0 0 done
Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done
Waiting (max 60 seconds) for system thr

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods

ev/da1
** /dev/da1
** Last Mounted on /mnt
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=325128
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=877864
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=877866
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=877879
SALVAGE? [yn] ^C

* FILE SYSTEM MARKED DIRTY *


Back on the NetBSD side:


 # xl block-detach fbsd-test  2064
 # fsck /dev/mapper/rscratch-fbsd--test.0
** /dev/mapper/rscratch-fbsd--test.0
** Last Mounted on /mnt
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? [yn] n

SUMMARY INFORMATION BAD
SALVAGE? [yn] n

BLK(S) MISSING IN BIT MAPS
SALVAGE? [yn] n

12076 files, 91642 used, 7647797 free (293 frags, 955938 blocks, 0.0% 
fragmentation)

* UNRESOLVED INCONSISTENCIES REMAIN *



--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpUlrkicjNvs.pgp
Description: OpenPGP Digital Signature

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods

At Sun, 11 Apr 2021 13:23:31 -0700, "Greg A. Woods"  wrote:
Subject: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> In fact it only seems to be fsck that complains, possibly along
> with any attempt to write to a filesystem, that causes problems.

Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD
xbd(4) with a new filesystem created on it, is impossible.

I was able to write 500MB of zeros to the LVM LV backed disk,
overwriting the copy of the .img file I had put there, and only see
500MB of zeros back on the NetBSD side, so writing directly to the raw
/dev/da1 on FreeBSD seems to write data without problem.

However then the following happens when I try to use a new FS there:

# newfs /dev/da1
/dev/da1: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096
using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 
11540352, 12822592, 14104832, 15387072, 16669312,
 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 
26927232, 28209472, 29491712, 30773952, 32056192, 8432,
 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 
43596352, 44878592, 46160832, 47443072, 48725312, 50007552,
 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 
60265472, 61547712, 62829952
# mount /dev/da1 /mnt
# mount
/dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only)
devfs on /dev (devfs, local, multilabel)
tmpfs on /var (tmpfs, local)
tmpfs on /tmp (tmpfs, local)
/dev/da1 on /mnt (ufs, local)
# df
Filesystem   512-blocks   UsedAvail Capacity  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016   -16680   102%/
devfs 2  20   100%/dev
tmpfs 6553660864928 1%/var
tmpfs 40960  840952 0%/tmp
/dev/da1   60901560 16 56029424 0%/mnt
# cp /COPYRIGHT /mnt
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 0, cgp: 0xe66de1a4 != bp: 
0xf433acbc
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 1, cgp: 0x89ba8532 != bp: 
0x3491fbd0
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 3, cgp: 0xdeaf87a7 != bp: 
0x3a071e86
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 7, cgp: 0x7085828d != bp: 
0xaaae0f19
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 15, cgp: 0x293dfe28 != bp: 
0xe2f25f8b
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 31, cgp: 0x9a4d0762 != bp: 
0x4119c6e
[[  and on and on  ]]
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 49, cgp: 0x931f84e5 != bp: 
0xb48687df

/mnt: create/symlink failed, no inodes free
cp: /mnt/COPYRIGHT: No space left on device
# Apr 11 20:37:28  syslogd: last message repeated 4 times
Apr 11 20:37:59  kernel: pid 713 (cp), uid 0 inumber 2 on /mnt: out of inodes
# df -i
Filesystem   512-blocks   UsedAvail Capacity iused   ifree 
%iused  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016   -16680   102%   12129 285   
98%   /
devfs 2  20   100%   0   0  
100%   /dev
tmpfs 6553660864928 1%  75  114613
0%   /var
tmpfs 40960  840952 0%   6   71674
0%   /tmp
/dev/da1   60901560 16 56029424 0%   2 4012796
0%   /mnt




NetBSD can actually make some sense of this FreeBSD filesystem though:

# fsck -n /dev/mapper/rscratch-fbsd--test.0
** /dev/mapper/rscratch-fbsd--test.0 (NO WRITE)
Invalid quota magic number

CONTINUE? yes

** File system is already clean
** Last Mounted on /mnt
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
SUMMARY INFORMATION BAD
SALVAGE? no

BLK(S) MISSING IN BIT MAPS
SALVAGE? no

** Phase 6 - Check Quotas

CLEAR SUPERBLOCK QUOTA FLAG? no

2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation)

* UNRESOLVED INCONSISTENCIES REMAIN *



I'm not sure if those problems are to be expected with a FreeBSD-created
filesystem or not.  Probably the "Invalid quota magic number" is normal,
but I'm not sure about the "BLK(s) MISSING IN BIT MAPS".  Have FreeBSD
and NetBSD FFS diverged this much?  I won't try to mount it, especially
not from the dom0.

Dumpfs shows the following:

file system: /dev/mapper/rscratch-fbsd--test.0
format  FFSv2
endian  little-endian
location 65536  (-b 128)
magic   19540119timeSun Apr 11 13:46:15 2021
superblock location 65536   id  [ 60735d32 358197c4 ]
cylgrp  dynamic inodes  FFSv2   sblock  FFSv2   fslevel 5
nbfree  951584  ndir2   nifree  4012796 nffree  21
ncg 50  size7864320 blocks  7612695
bsize   32768   shift   15  m

one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods

So, with the vnd(4) issue more or less sorted, there seems to be one
major mystery remaining w.r.t. whatever has gone wrong with the ability
of NetBSD-current XEN3_DOM0 to host FreeBSD domUs.

I still can't create a clean filesystem on a writeable disk.  The
"newfs" runs fine, but a subsequent "fsck" finds errors and cannot fix
them (though the first run does change one or two things).

I can't even get a clean fsck of the running system's root FS:
(the "ada0: disk error" after I hit ^C is because the underlying disk
(vnd0d) is exported read-only to the domU)


# fsck -v /dev/ufs/FreeBSD_Install
start / wait fsck_ufs /dev/ufs/FreeBSD_Install
** /dev/ufs/FreeBSD_Install

SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n


ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n

** Last Mounted on
** Root file system
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=28
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=112
SALVAGE? [yn] ^Cada0: disk error cmd=write 8145-8152 status: fffe

* FILE SYSTEM MARKED DIRTY *

#


Most mysteriously this filesystem is in use as the root FS and all the
files in it can be found and read!  Presumably they are all intact too
-- no programs have failed or behaved mysteriously (except fsck) and all
the human readable files I've looked at (e.g. manual pages) all seem
fine.  In fact it only seems to be fsck that complains, possibly along
with any attempt to write to a filesystem, that causes problems.  (I
believe writing to a filesystem appears to corrupt it but that is only
according to fsck.  I do seem believe there was an eventual crashes of a
system that had been running with active filesystems, but I have not got
far enough again since to reproduce this, due to the fsck problem.)

# mount
/dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only)
devfs on /dev (devfs, local, multilabel)
tmpfs on /var (tmpfs, local)
tmpfs on /tmp (tmpfs, local)
# df
Filesystem   512-blocks   Used  Avail Capacity  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016 -16680   102%/
devfs 2  2  0   100%/dev
tmpfs 65536232  65304 0%/var
tmpfs 40960  8  40952 0%/tmp
# time -l sh -c 'find  / -type f | xargs cat > /dev/null '
   38.58 real 1.36 user18.30 sys
  4872  maximum resident set size
13  average shared memory size
 5  average unshared data size
   215  average unshared stack size
  1906  page reclaims
 0  page faults
 0  swaps
 14024  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 12348  voluntary context switches
33  involuntary context switches


In fact I can put a copy of the FreeBSD img file into an LVM LV, attach
it to the running FreeBSD domU, mount it (without an FSCK, since the
FreeBSD_Install filesystem comes clean from the factory), then do
"diff -r -X /mnt -X /dev / /mnt" and find only the expected differences.

So, what could be different about how fsck reads v.s. the kernel itself?

If indeed writing to filesystem corrupts it, how and why?


It seems NetBSD can make sense of the BSD label inside the FreeBSD
mini-memstick.img file, e.g. when accessed through vnd(4), but it can't
seem to make sense of the filesystem(s) inside (which I guess might be
expected?):

# file -s /dev/rvnd0f
/dev/rvnd0f: DOS/MBR boot sector, BSD disklabel

# disklabel vnd0
# /dev/rvnd0:
type: vnd
disk: vnd
label: fictitious
flags:
bytes/sector: 512
sectors/track: 32
tracks/cylinder: 64
sectors/cylinder: 2048
cylinders: 387
total sectors: 791121
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0

6 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 d:791121 0 unused  0 0# (Cyl.  0 -386*)
 e:  1600 1unknown # (Cyl.  0*-  0*)
 f:789520  1601 4.2BSD  0 0 0  # (Cyl.  0*-386*)
disklabel: boot block size 0
disklabel: super block size 0


# fsck -n /dev/vnd0f
** /dev/rvnd0f (NO WRITE)
BAD SUPER BLOCK: CAN'T FIND SUPERBLOCK
/dev/rvnd0f: CANNOT FIGURE OUT SECTORS PER CYLINDER


--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpBY33Act0N5.pgp
Description: OpenPGP Digital Signature

Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Greg A. Woods

386/18/17), Active
2: 
3: 
First active partition: 1
Drive serial number: 2425393296 (0x90909090)



So, as you can see below I think it's better to round the device out to
a full number of cylinders if we're still going to play the CHS
silliness.

But for vnd(4) in particular I think it does beg the questions I ask
in the new comments below, especially the first one.

--- vnd.c.~1.278.~  2021-03-07 17:18:43.0 -0800
+++ vnd.c   2021-04-11 11:00:52.147530152 -0700
@@ -1480,20 +1480,41 @@
}
} else if (vnd->sc_size >= (32 * 64)) {
/*
-* Size must be at least 2048 DEV_BSIZE blocks
-* (1M) in order to use this geometry.
+* The file's size must be at least 2048 DEV_BSIZE
+* blocks (1M) in order to use this (fake) geometry.
+*
+* XXX why ever use this arbitrary fake setup instead 
of the next
 */
vnd->sc_geom.vng_secsize = DEV_BSIZE;
vnd->sc_geom.vng_nsectors = 32;
vnd->sc_geom.vng_ntracks = 64;
-   vnd->sc_geom.vng_ncylinders = vnd->sc_size / (64 * 32);
+   vnd->sc_geom.vng_ncylinders = (vnd->sc_size + (64 * 32) 
- 1) / (64 * 32);
} else {
+   /*
+* XXX is there anything that pretends which is worse:
+* rotational delay, or seeking?  Does it matter for < 
1M?
+*/
+#if 1
+   /* else pretend it's just one big platter of 
single-sector cylinders */
vnd->sc_geom.vng_secsize = DEV_BSIZE;
vnd->sc_geom.vng_nsectors = 1;
vnd->sc_geom.vng_ntracks = 1;
vnd->sc_geom.vng_ncylinders = vnd->sc_size;
+#else
+   /* else pretend it's just one big cylinder */
+   vnd->sc_geom.vng_secsize = DEV_BSIZE;
+   vnd->sc_geom.vng_nsectors = vnd->sc_size;
+   vnd->sc_geom.vng_ntracks = 1;
+   vnd->sc_geom.vng_ncylinders = 1;
+#endif
}

+   /*
+* n.b.:  this will round the disk's size up to an even cylinder
+* amount, but (if it is writeable) writing into the partly
+* empty cylinder, i.e. past current end of the file, will
+* simply extend the file
+*/
vnd_set_geometry(vnd);

if (vio->vnd_flags & VNDIOF_READONLY) {



--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpqbNpL0b2M3.pgp
Description: OpenPGP Digital Signature

Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Greg A. Woods

At Sun, 11 Apr 2021 16:06:27 - (UTC), mlel...@serpens.de (Michael van Elst) 
wrote:
Subject: Re: I think I've found why Xen domUs can't mount some file-backed disk 
images! (vnd(4) hides labels!)
>
> k...@munnari.oz.au (Robert Elz) writes:
>
> >Date:Sun, 11 Apr 2021 14:25:40 - (UTC)
> >From:mlel...@serpens.de (Michael van Elst)
> >Message-ID:  
>
> >  | +   dg->dg_secperunit = vnd->sc_size / DEV_BSIZE;
>
> >While it shouldn't make any difference for any properly created image
> >file, make it be
>
> > (vnd->sc_size + DEV_BSIZE - 1) / DEV_BSIZE;
>
> >so that any trailing partial sector remains in the image.
>
>
> The trailing partial sector is already ignored. Fortunately no disk image
> can even have a partial trailing sector and some magically implicit
> padding would have unexpected side effects.
>
> But the code also needs to be adjusted for different sector sizes.

So since vnd->sc_size is in units of disk blocks

dg->dg_secperunit =
((vnd->sc_size * DEV_BSIZE) + DEV_BSIZE - 1) /
vnd->sc_geom.vng_secsize;

right?

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpHppeDklPmd.pgp
Description: OpenPGP Digital Signature

Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-10 Thread Greg A. Woods

On the other hand NetBSD's own .img files work OK.

However interestingly there's a small, but apparently insignificant
(because it works OK) difference between how fdisk sees the disk image
and the vnd0 device:

# fdisk -F images/NetBSD-9.99.81-amd64-live.img
Disk: images/NetBSD-9.99.81-amd64-live.img
NetBSD disklabel disk geometry:
cylinders: 972, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192, bytes/sector: 512

BIOS disk geometry:
cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192

Partitions aligned to 2048 sector boundaries, offset 2048

Partition table:
0: NetBSD (sysid 169)
start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active
1: 
2: 
3: 
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x)
# vndconfig -cv vnd0 images/NetBSD-9.99.81-amd64-live.img
/dev/rvnd0: 7999586304 bytes on images/NetBSD-9.99.81-amd64-live.img
# fdisk vnd0
Disk: /dev/rvnd0
NetBSD disklabel disk geometry:
cylinders: 7629, heads: 64, sectors/track: 32 (2048 sectors/cylinder)
total sectors: 15624192, bytes/sector: 512

BIOS disk geometry:
cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192

Partitions aligned to 2048 sector boundaries, offset 2048

Partition table:
0: NetBSD (sysid 169)
start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active
1: 
2: 
3: 
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x)
21:10 [1.1496] # disklabel vnd0
# /dev/rvnd0:
type: ESDI
disk: image
label: 
flags:
bytes/sector: 512
sectors/track: 32
tracks/cylinder: 64
sectors/cylinder: 2048
cylinders: 7629
total sectors: 15624192
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0 

8 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 a:  15622144  2048 4.2BSD   1024  819216  # (Cyl.  1 -   7628)
 c:  15622144  2048 unused  0 0# (Cyl.  1 -   7628)
 d:  15624192 0 unused  0 0# (Cyl.  0 -   7628)
# disklabel images/NetBSD-9.99.81-amd64-live.img
# images/NetBSD-9.99.81-amd64-live.img:
type: ESDI
disk: image
label: 
flags:
bytes/sector: 512
sectors/track: 32
tracks/cylinder: 64
sectors/cylinder: 2048
cylinders: 7629
total sectors: 15624192
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0 

8 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 a:  15622144  2048 4.2BSD   1024  819216  # (Cyl.  1 -   7628)
 c:  15622144  2048 unused  0 0# (Cyl.  1 -   7628)
 d:  15624192 0 unused  0 0# (Cyl.  0 -   7628)



From inside the NetBSD live image:

[   1.4412586] xbd4 at xenbus0 id 4: Xen Virtual Block Device Interface
[   1.4422594] xbd4: using event channel 20
[   1.7112647] entropy: xbd4 attached as an entropy source (collecting without 
estimation)
[   1.7112647] xbd4: 7629 MB, 512 bytes/sect x 15624192 sectors
[   1.7112647] xbd4: backend features 0x9



# df
Filesystem  1K-blocks UsedAvail %Cap Mounted on
/dev/xbd4a7562414  4699114  2485180  65% /
ptyfs   110 100% /dev/pts
# fdisk xbd4
Disk: /dev/rxbd4
NetBSD disklabel disk geometry:
cylinders: 7629, heads: 1, sectors/track: 2048 (2048 sectors/cylinder)
total sectors: 15624192, bytes/sector: 512

BIOS disk geometry:
cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192

Partitions aligned to 2048 sector boundaries, offset 2048

Partition table:
0: NetBSD (sysid 169)
start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active
1: 
2: 
3: 
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x)



The NetBSD live.img root filesystem seems fine and clean:

# fsck -n /dev/rxbd4a
** /dev/rxbd4a (NO WRITE)
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
32740 files, 2349557 used, 1431650 free (538 frags, 178889 blocks, 0.0% 
fragmentation)


-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpqxRB084Uts.pgp
Description: OpenPGP Digital Signature

Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-10 Thread Greg A. Woods

At Sat, 10 Apr 2021 18:44:32 -0700, Brian Buhrow  wrote:
Subject: Re: I think I've found why Xen domUs can't mount some file-backed disk 
images! (vnd(4) hides labels!)
>
>   hello.  This must be some kind of regression that's ben around a
> while.  I'm runing a xen dom0 with NetBSD-5.2 and xen-3.3.2, very old,
> but vnd(4) does expose the entire file to the domu's including FreeBSD
> 11 and 12 without any corruption or booting issues.  Do you know when
> this trouble began?

I don't know -- I think I've only ever successfully used ISO files, and
I think I gave up on some IMG file(s) previously (possibly not just from
FreeBSD) without trying to understand why they didn't work.

Have you tried specifically with a recent FreeBSD mini-memstick.img file?

I'm thinking (esp. given what I see from "od -c < /dev/rvnd0d") that
what's wrong is the vnd(4) driver is (also?) imposing some
mis-interpreted idea about the number of cylinders and heads or
something like that, especially given that "fdisk vnd0" is so totally
confused about what's in there.

There's a definite pattern of corruption anyway -- I just can't explain
it well enough yet.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpbUW36DVCRL.pgp
Description: OpenPGP Digital Signature

a working patch to _allow_ non-hardware-RNG entropy sources

2021-04-10 Thread Greg A. Woods

mate entropy on time */
 #define RND_FLAG_ESTIMATE_VALUE0x8000  /* estimate entropy on 
value */
 #defineRND_FLAG_HASENABLE  0x0001  /* has enable/disable 
fns */
-#define RND_FLAG_DEFAULT   (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME|\
-RND_FLAG_ESTIMATE_TIME)
+#define RND_FLAG_DEFAULT   
(RND_FLAG_COLLECT_VALUE|RND_FLAG_ESTIMATE_VALUE| \
+RND_FLAG_COLLECT_TIME|RND_FLAG_ESTIMATE_TIME)
+/*
+ * N.B.:  It would appear from the above value that by default all devices 
using
+ * RND_FLAG_DEFAULT will be enabled directly to collect _and_ estimate(count)
+ * entropy based on both deltas in values they submit, and the time delta
+ * between submissions.  HOWEVER this is moderated by a switch in
+ * kern_entropy.c:rnd_attach_source() which will add either the NO_COLLECT
+ * and/or the NO_ESTIMATE flag depending on what type the device is.
+ *
+ * By default only RND_TYPE_SKEW, RND_TYPE_ENV, RND_TYPE_POWER, and 
RND_TYPE_RNG
+ * will avoid both of these flags being set.
+ *
+ * Network devices will be entirely disabled (from both colleciton and
+ * estimating) as they can possibly be easily influenced externally.
+ *
+ * All other devices will be given the NO_ESTIMATE flag such that they are not
+ * used to estimate(count) entropy by default.
+ *
+ * In any case either or both of the RND_FLAG_NO_* flags can be turned off at
+ * runtime by the RNDCTL ioctl on rnd(4), i.e. by rndctl(8) such that entropy
+ * collection and estimation can be enabled on a per-device or per-type basis.
+ */

 #defineRND_TYPE_UNKNOWN0   /* unknown source */
 #defineRND_TYPE_DISK   1   /* source is physical disk */
Index: sys/rndsource.h
===
RCS file: /cvs/master/m-NetBSD/main/src/sys/sys/rndsource.h,v
retrieving revision 1.7
diff -u -r1.7 rndsource.h
--- sys/rndsource.h 30 Apr 2020 03:28:19 -  1.7
+++ sys/rndsource.h 8 Apr 2021 18:15:01 -
@@ -45,8 +45,6 @@

 /*
  * struct rnd_delta_estimator
- *
- * Unused.  Preserved for ABI compatibility.
  */
 typedef struct rnd_delta_estimator {
uint64_tx;
@@ -68,8 +66,8 @@
 struct krndsource {
LIST_ENTRY(krndsource) list;/* the linked list */
 charname[16];   /* device name */
-   rnd_delta_t time_delta; /* unused */
-   rnd_delta_t value_delta;/* unused */
+   rnd_delta_t time_delta; /* */
+   rnd_delta_t value_delta;/* */
 uint32_ttotal;  /* number of bits added while cold */
 uint32_ttype;   /* type, RND_TYPE_* */
 uint32_tflags;  /* flags, RND_FLAG_* */
@@ -89,8 +87,10 @@
uint32_t);
 void   rnd_detach_source(struct krndsource *);

+#if 0
 void   _rnd_add_uint32(struct krndsource *, uint32_t); /* legacy */
 void   _rnd_add_uint64(struct krndsource *, uint64_t); /* legacy */
+#endif

 void   rnd_add_uint32(struct krndsource *, uint32_t);
 void   rnd_add_data(struct krndsource *, const void *, uint32_t, uint32_t);
Index: uvm/uvm_page.c
===
RCS file: /cvs/master/m-NetBSD/main/src/sys/uvm/uvm_page.c,v
retrieving revision 1.250
diff -u -r1.250 uvm_page.c
--- uvm/uvm_page.c  20 Dec 2020 11:11:34 -  1.250
+++ uvm/uvm_page.c  8 Apr 2021 21:41:20 -
@@ -983,8 +983,7 @@
 * Attach RNG source for this CPU's VM events
 */
 rnd_attach_source(>rs, ci->ci_data.cpu_name, RND_TYPE_VM,
-   RND_FLAG_COLLECT_TIME|RND_FLAG_COLLECT_VALUE|
-   RND_FLAG_ESTIMATE_VALUE);
+   RND_FLAG_DEFAULT);
 }

 /*

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpMtaoK9AMBK.pgp
Description: OpenPGP Digital Signature

I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-10 Thread Greg A. Woods

0002000


# dd if=/dev/rvnd0d count=17 msgfmt=quiet| od -c
000   \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
002   \0  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0 020  \0  \0  \0
0020020  030  \0  \0  \0 230 005  \0  \0  \0  \0  \0  \0 377 377 377 377
0020040  367 360   p   `  \0  \0  \0 007 200 037  \0 027  \0  \0  \0
0020060   \0   @  \0  \0  \0  \b  \0  \0  \b  \0  \0  \0 005  \0  \0  \0
0020100   \0  \0  \0  \0   <  \0  \0  \0  \0 300 377 377  \0 370 377 377
0020120  016  \0  \0  \0 013  \0  \0  \0 004  \0  \0  \0  \0 020  \0  \0
0020140  003  \0  \0  \0 002  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0
0020160   \0  \0  \0  \0  \0 020  \0  \0 200  \0  \0  \0 004  \0  \0  \0
0020200   \0  \0  \0  \0 300 220 005  \0 001  \0  \0  \0  \0  \0  \0  \0
0020220  367 360   p   `   _   `   A   q 230 005  \0  \0  \0  \b  \0  \0
0020240   \0   @  \0  \0  \0  \0  \0  \0 300 220 005  \0 300 220 005  \0
0020260  027  \0  \0  \0 001  \0  \0  \0  \0   X  \0  \0   0   d 001  \0
0020300  001  \0  \0  \0 377 357 003  \0 375 347 007  \0 016  \0  \0  \0
0020320   \0 001  \0 200  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0020340   \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0021000


In fact the vnd0d device seems to give garbage forever -- it seems to
have been completely confused by trying to access a real disk image!


As a side note unfortunately even though access to this LVM-backed
mini-memstick.img file now seems OK enough to get the install booted and
a shell running, access to other FreeBSD xbd(4) devices is still not
working from FreeBSD (i.e. a fresh newfs'ed FS appears corrupt to an
immediate fsck, without mounting, and even fsck of the mounted root in
this IMG fails enormously).

# df
Filesystem   512-blocks   Used  Avail Capacity  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016 -16680   102%/
devfs 2  2  0   100%/dev
tmpfs 65536232  65304 0%/var
tmpfs 40960  8  40952 0%/tmp
# fsck /dev/ufs/FreeBSD_Install
** /dev/ufs/FreeBSD_Install

SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n


ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n

** Last Mounted on
** Root file system
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=28
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=112
SALVAGE? [yn] ^Cda0: disk error cmd=write 8145-8152 status: fffe

#
* FILE SYSTEM MARKED DIRTY *

#


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpELwDHrgUjQ.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-07 Thread Greg A. Woods

At Wed, 7 Apr 2021 22:47:39 +0200, Martin Husemann  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
> 
> When you create a custom setup like that, you will have to replace
> etc/rc.d/entropy with a custom solution (e.g. mounting some flash storage).

No storage means "NO storage.".

> Or you ignore the issue and do the dd at each boot - hopefully not generating
> any strong keys on that machine then (but you would have no good storage
> for those anyway).

Or I don't ignore the issue and instead I fix the code so that it's
still possible to get entropy estimates from non-hardware-RNG devices
and then things keep working the way they used to, and there's still
some possibility of _real_ entropy being used to seed the PRNGs.

From what I've seen here so far I'm far from alone in wanting that
ability.

What's most confusing is to why there's such animosity and stubborn
unwillingness to even consider that the old way of getting some entropy
from a few less-than-perfect sources was good enough for many, or even
most, of us.

It's better than no entropy when there are no "perfect" sources, and
that's also a situation that includes many of us.

It doesn't have to be the default.

-- 
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpgg9AaQiU92.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-07 Thread Greg A. Woods

At Wed, 7 Apr 2021 09:52:29 +0200, Martin Husemann  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> On Tue, Apr 06, 2021 at 03:12:45PM -0700, Greg A. Woods wrote:
> > > Isn't it as simple as:
> > >
> > >   dd bs=32 if=/dev/urandom of=/dev/random
> >
> > No, that still leaves the question of _when_ to run it.  (And, at least
> > at the moment, where to put it.  /etc/rc.local?)
>
> Of course not!
>
> You run it once. Manually. And never again.

Nope, sorry, that's not a good enough answer.  It doesn't solve the
problem of dealing with a lack of mutable storage.

A system _MUST_ be able to be booted and with no user intervention be
able to (eventually) get to the state where /dev/random and getrandom(2)
WILL NOT block, and it _MUST_ be able to do so without the help of any
hardware RNG, and without the ability to store (and read) a seed from a
file or other storage device.

I.e. we _MUST_ be _ABLE_ to choose to use other devices as sources for
entropy, even if they are not perfect.  We had this, it works fine, we
still need it.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpeaL6Xd0CAO.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg A. Woods

At Tue, 6 Apr 2021 20:21:43 +0200, Martin Husemann  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> On Tue, Apr 06, 2021 at 10:54:51AM -0700, Greg A. Woods wrote:
> >
> > And the stock implementation has no possibility of ever providing an
> > initial seed at all on its own (unlike previous implementations, and of
> > course unlike what my patch _affords_).
>
> Isn't it as simple as:
>
>   dd bs=32 if=/dev/urandom of=/dev/random

No, that still leaves the question of _when_ to run it.  (And, at least
at the moment, where to put it.  /etc/rc.local?)

Isn't something the following better (assuming you choose your devices
carefully):

echo 'rndctl_flags="-t env;-t disk;-t tty"' >> /etc/rc.conf

That's what my patches fix and allow, and this way you don't have to
guess when you can safely use /dev/urandom as an entropy seed -- the
seeding happens in real time, and only as entropy bits are made
available from those given devices.

That can also be done by sysinst, assuming a reasonably well worded
question can be answered, and that it might only need to be asked if
there are no "rng" type devices already.

Doing this also requires no network access (ever).

It can even be done, ahead of time, for use on immutable systems.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpW4B04umieR.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg A. Woods

At Tue, 6 Apr 2021 12:08:54 +, Taylor R Campbell  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> The main issue that hits people is that the traditional mechanism by
> which the OS reports a potential security problem with entropy is for
> it to make applications silently hang -- and the issue is getting
> worse now that getrandom() is more widely used, e.g. in Python when
> you do `import multiprocessing'.

I think adding a uprintf(9) that the user who started the blocked
process (i.e. not just the admin) has a better chance of directly seeing
would be one step closer, and should be extremely easy.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpOvi5MZvUCj.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg A. Woods

 4, flags 
0x70, func=0x8083f151, ver=427
kern.entropy.gather (1.1260.1264): CTLTYPE_INT, size 4, flags 0x70, 
func=0x8083dd4c, ver=428
kern.entropy.needed (1.1260.1265): CTLTYPE_INT, size 4, flags 
0x100, ver=429
kern.entropy.pending (1.1260.1266): CTLTYPE_INT, size 4, flags 
0x100, ver=430
kern.entropy.epoch (1.1260.1267): CTLTYPE_INT, size 4, flags 
0x100, ver=431

Perhaps function pointer values shouldn't be printed as integers?


And there are no text descriptions for some of the kern.entropy values:

17:27 [1.831] # sysctl -d kern.entropy.needed
kern.entropy.needed: (no description)
17:27 [1.832] # sysctl -d kern.entropy.pending
kern.entropy.pending: (no description)
17:27 [1.833] # sysctl -d kern.entropy.epoch
kern.entropy.epoch: (no description)


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpE52Jkajvwh.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods

At Mon, 5 Apr 2021 15:37:49 -0400, Thor Lancelot Simon  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> On Sun, Apr 04, 2021 at 03:32:08PM -0700, Greg A. Woods wrote:
> >
> > BTW, to me reusing the same entropy on every reboot seems less secure.
>
> Sure.  But that's not what the code actually does.
>
> Please, read the code in more depth (or in this case, breadth), then argue
> about it.

Sorry, I was eluding to the idea of sticking the following in
/etc/rc.local as the brain-dead way to work around the problem:

echo -n "" > /dev/random

However I have not yet read and understood enough of the code to know
if:

dd if=/dev/urandom of=/dev/random bs=32 count=1

is any more "secure" -- I'm guessing (hoping?) it depends on exactly
when this might be run, and also depends on which, if any, other device
sources are enabled for "collecting".  If in some rare case none were
enabled, or if it were run before any were able to "stir the pool", then
I'm guessing it would be no more secure than writing a fixed string.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpgF42U_yi8i.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods

"stir" the pot in the first place, then why not just "count"
it as "real" entropy and be done with it -- at least then it is obvious
when enough entropy has been gathered and the currently implemented
algorithms handle things properly and securely and all inside the
kernel.  I.e. the admin doesn't have to put a "sleep 30" or whatever in
front of it and hope that's enough and that it's still not too
predictable.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpxsHTzqoenJ.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods

At Mon, 5 Apr 2021 03:02:42 +0200, Joerg Sonnenberger  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Except that's not what the system is doing. It removes the seed file on
> boot and creates a new one on shutdown.

That's not exactly what the documentation says it does (from rndctl(8)):

-L  Load saved entropy from file save-file and overwrite it with a
 seed derived by hashing it together with output from /dev/urandom
 so that the new seed has at least as much entropy as either the
 old seed had or the system already has.  If interrupted, either
 the old seed or the new seed will be in place.

The code seems to concur.

Also the system re-saves the $random_file via /etc/security
(unconditionally, i.e. always, but only if $random_file is set).

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpJ2gB7j21GX.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods

At Mon, 5 Apr 2021 16:13:55 +1200, Lloyd Parkes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> The current implementation prints out a message whenever it blocks a
> process that wants randomness, which immediately makes this
> implementation superior to all others that I have ever seen. The
> number of times I've logged into systems that have stalled on boot and
> made them finish booting by running "ls -lR /" over the past 20 years
> are too many to count. I don't know if I just needed to wait longer
> for the boot to finish, or if generating entropy was the fix, and I
> will never know. This is nuts.

Indeed!

> We can use the message to point the system administrator to a manual
> page that tells them what to do, and by "tells them what to do", I
> mean in plain simple language, right at the top of the page, without
> scaring them.

Excellent idea!  :-)

However I have been wondering if sending the message just to the
console, and logging it, say in /var/log/kern, is sufficient.

It still took me a very long time to find the existing new message
because I don't hang out on the console -- this is a VM, after all, and
it's running in a city almost exactly 4200km driving distance from me
too!  As-is I feel I hang out on the console more often than the average
admin who doesn't use a physical console, and of course infinitely more
often than any user who doesn't admin his own server.

I have added the following comment to the kernel to remind me to think
more about this, as a uprintf(9) at the same time would pop right up on
the actual user's session too:

--- kern_entropy.c.~1.30.~  2021-03-07 17:23:05.0 -0800
+++ kern_entropy.c  2021-04-03 11:25:31.667067667 -0700
@@ -1306,7 +1306,7 @@

/* Wait for some entropy to come in and try again.  */
KASSERT(E->stage >= ENTROPY_WARM);
-   printf("entropy: pid %d (%s) blocking due to lack of entropy\n",
+   printf("entropy: pid %d (%s) blocking due to lack of 
entropy\n", /* xxx uprintf() instead/also? */
   curproc->p_pid, curproc->p_comm);

    if (ISSET(flags, ENTROPY_SIG)) {


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpbil_4h9ofy.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods

At Mon, 5 Apr 2021 10:46:19 +0200, Manuel Bouyer  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> If I understood it properly, there's no need for such a knob.
> echo 0123456789abcdef0123456789abcdef > /dev/random
>
> will get you back to the state we had in netbsd-9, with (pseudo-)randomness
> collected from devices.

Well, no, not quite so much randomness.  Definitely pseudo though!

My patch on the other hand can at least inject some real randomness into
the entropy pool, even if it is observable or influenceable by nefarious
dudes who might be hiding out in my garage.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpdkEisDB6Js.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods

At Sun, 4 Apr 2021 18:47:23 -0700, Brian Buhrow  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Hello.  As I understand it, Greg ran into this problem on a xen domu.
> In checking my NetBSD-9 system running as a domu under xen-4.14.1,
> there is no rdrand or rdseed feature exposed to domu's by xen.  This
> observation is confirmed by looking at the xen command line reference
> page: https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html

The problem in the domU was really just the very tip of the iceberg.

The dom0 exhibits the exact same problem and for the same reasons.

> and NetBSD doesn't trust the random sources provided by the xennet(4)
> and xbd(4) drivers.  Therefore, the only solution to get randomness
> working for the first time on a newlyinstalled domu is to write 32
> bytes to /dev/random.

It's not that the xbd(4) devices, etc. are not trusted as entropy
sources -- the new entropy system doesn't trust anything, real or
virtual, despite the documentation saying that it can be made to do so.

My patch fixes that bug.  It was very obvious once I understood the root
of the issue.

As a result my patch fixes the bug for Xen dom0 and domU.

Writing randomness to /dev/random is _NOT_ a general solution (though it
could be IFF it can be reliably taken from /dev/urandom AND IFF the rest
of the system and documentation is completely and adequately fixed to
match the new regime).

What perturbs me the most and makes me rather angry is that the rest of
the system, and the system documentation, continued to lie and mislead
me for days (and it didn't help that nobody who knew this was pointing
helpfully and clearly at the root of the problem).  So, my patch ALSO
restores the kernel's behaviour to match the documentation and tools
(specifically rndctl).  That the core of it it is just a two-line patch
makes this fix extremely satisfying.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpWiXqui7McJ.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods

At Sun, 4 Apr 2021 23:09:18 +, Taylor R Campbell  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> If you know this (and this is something I certainly can't confidently
> assert!), you can write 32 bytes to /dev/random, save a seed, and be
> done with it.

I don't have random data easily available at install time.

I don't have random data easily available every time I boot a machine
with non-persistent storage (e.g. a test ISO image).

I _do_ trust well enough the sources of randomness in some device
drivers to provide me with a secure enough amount of entropy, for my
purposes.

And so with my fix(es) I don't need to feed supposedly random data to
every system on every install and/or every reboot.

What's worse?  My fixes, or something like this in /etc/rc.local:

   echo -n "" > /dev/random

> But users who don't go messing around with obscure rndctl settings in
> rc.conf will be proverbially shot in the foot by this change -- except
> they won't notice because there is practically guaranteed to be no
> feedback whatsoever for a security disaster until their systems turn
> up in a paper published at Usenix like <https://factorable.net/>.

You're really stretching your argument thinly if you are assuming
everyone _needs_ perfect entropy here.

Also, that's only if the default RND_FLAG_ESTIMATE_* bits are turned off.

AND only if the system doesn't have some true hardware RNG.

> What your change does is equivalent to going around to every device
> driver that previously said `this provides zero entropy, or I don't
> know how much entropy it provides' and replacing that claim by `this
> is a sample of an independent and perfectly uniform random string of
> bits', which is a much stronger (and falser) claim than even the old
> `entropy estimation' confabulation that NetBSD used to do.

No, only if the default RND_FLAG_ESTIMATE_* bits are ***NOT*** turned off.

AND only if the user is like me and stuck with some poor second-grade
ancient hardware that doesn't have some fancy new true hardware RNG.

In the mean time a more productive approach would be to figure out
what's best for those of us who don't need perfection every time and/or
to fix those device drivers that could feed sufficiently random data to
the entropy pool, and then to recommend a suitable value for
rndctl_flags in /etc/rc.conf.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpnOADtmWrjC.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods

At Mon, 5 Apr 2021 01:05:58 +0200, Joerg Sonnenberger  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Part of the problem here is that most of the non-RNG data sources are
> easily observable either from the local system (e.g. any malicious user)
> or other VMs on the same machine (in case of a hypervisor) or local
> machines on the same network (in case of network interrupts).

It _Just_ _Doesn't_ _Matter_  (i.e. for many of us, most of the time).

Now ideally in the hypervisor scenario we would have a backend device
that read from /dev/random and offered it to the VM guest as a virtual
hardware RNG.  Or maybe it's as simple as passing a those few bytes
through a custom Xenstore string and having a script in the VM read them
and inject them into /dev/random.  But that's not been done yet.

BTW, personally, on at least on some machines, I don't have any worry
whatsoever at the moment about one VM guest spying on, or influencing
the PRNG, in another.  Zero worry.  They're all _me_.  I don't need some
theoretically perfect level of protection from myself.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpFPOplfhwSl.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods

At Mon, 05 Apr 2021 00:14:30 +0200 (CEST), Havard Eidnes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> > What about architectures that have nothing like RDRAND/RDSEED?  Are
> > they, effectively, totally unsupported now?
>
> Nope, not entirely.  But they have to be seeded once.  If they
> have storage which survives reboots, and entropy is saved and
> restored on reboot, they will be ~fine.

BTW, to me reusing the same entropy on every reboot seems less secure.

> Systems without persistent storage and also without RDRAND/RDSEED
> will however be ... a more challenging problem.

Leaving things like that would be totally silly.

With my patch the old way of gathering entropy from devices works just
fine as it always did, albeit with the second patch it does require a
tiny bit of extra configuration.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpgeBbtqrqWg.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods

At Mon, 05 Apr 2021 00:07:49 +0200 (CEST), Havard Eidnes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Indeed, that's also compatible with what I wrote.  The samples
> from whatever sources you have are still being mixed into the
> pool, but they are not being counted as contributing to the
> entropy estimate, because the quality of the samples is at best
> unknown.

Perhaps we're talking past each other?

Until I made the fix no amount of time or activity or of me telling the
system to make use of the driver inputs was unblocking getrandom(2) or
/dev/random, so it doesn't really matter if anything was being "mixed
into the pool" so to speak as the pool was empty.

> A possible workaround is, once you have some uptime and some bits
> mixed into the pool, you can do:

I don't need a work-around -- I found a fix.  I corrected some code that
was purposefully ignoring my orders for how it should behave.

> I am still of the fairly firm beleif that the mistrust in the
> hardware vendors' ability to make a reasonable and robust
> implementation is without foundation.

Well there are still millions of systems out there without the fancy
newer hardware RNGs available to make them more secure than Fort Knox.
At least a small handful of them run NetBSD for me, and want them to
work for my needs and I was, and am, quite happy with using entropy that
can be collected from various devices that my systems (virtual and real)
actually have.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpw8NF4N8YCU.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods

At Sun, 4 Apr 2021 16:39:11 -0400 (EDT), Mouse  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> > No amount of uptime and activity was increasing the entropy in my
> > system before I patched it.
>
> As I understand it, entropy was being contributed.  What wasn't
> happening was the random driver code recognizing and acknowledging that
> entropy, because it had no way to tell how much of it there really was.

Clearly there was no entropy being contributed in any way shape or form.

It wasn't the driver code at fault.

It was the code I fixed with my patch that was at fault.

I told the system to "count" the entropy being gathered by the
appropriate driver(s), but it was being ignored entirely.

After my fix the system behaved as I told it to.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpKRv3dDs3Kt.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods

At Sun, 04 Apr 2021 21:14:31 +0200 (CEST), Havard Eidnes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Do note, the existing randomness sources are still being sampled and
> mixed into the pool, so even if the starting state from the saved
> entropy may be known (by violating the security of the storage),
> it's still not possible to predict the complete stream of randomness
> data once the system has seen a bit of uptime (given that there are
> actual other sources of (unverified) entropy which aren't all of too
> low quality).

No amount of uptime and activity was increasing the entropy in my system
before I patched it.  /dev/random remained blocked after days of busy
system activity.  I would argue that most, if not all, of the sources of
entropy identified by rndctl(8) on my systems are high-quality and
secure sources in my circumstances and for my uses.

Perhaps the unpatched implementation isn't doing exactly what you think
it is?

The unpatched implementation completely and entirely prevents the system
from ever using any of those sources, despite showing that they are
enabled for use.

> However, in the new scheme of things, because most of the
> traditional sources have unknown quality, and we have no reliable
> method to estimate how much "actual entropy" those sources
> provide, they no longer count towards the *estimate* of what is
> now a lower bound on the "real" entropy available in the pool.

It really doesn't matter what can be determined in general and from a
distance.

What matters is what a given administrator can determine in particular
for a given application in a given circumstance.

Before my patch the system was not behaving as documented and could not
be made to behave as the documentation said it could be made to behave.

With my patch I can choose which to trust from amongst the available
sources.  Without that patch my choices are ignored and the system lies
to me about using my choices.  I would argue my patch fixes a critical
bug.

> Besides, the implementation has been thoroughly vetted.  E.g. the
> reference [7] from the wikipedia article states in the conclusion on
> page 20
>
>Overall, the Ivy Bridge RNG is a robust design with a large
>margin of safety that ensures good random data is generated even
>if the Entropy Source is not operating as well as predicted.

"design" != implementation

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpjs3QaPXmot.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods

At Sun, 04 Apr 2021 23:47:10 +0700, Robert Elz  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> If we want really good security, I'd submit we need to disable
> the random seed file, and RDRAND (and anything similar) until we
> have proof that they're perfect.

Indeed, I concur.

I trust the randomness and in-observability and isolation of the
behaviour of my system's fans far more than I would trust Intel's RDRAND
or RDSEED instructions.

I even trust the randomness of the timings of the virtual disks in my
Xen domU virtual machines more-so, even with multiple sibling guests,
even if some of those other guests can be influenced by untrusted third
parties at critical times.

> Personally, I'm happy with anything that your average high school
> student is unlikely to be able to crack in an hour.   I don't run
> a bank, or a military installation, and I'm not the NSA.   If someone
> is prepared to put in the effort required to break into my systems,
> then let them, it isn't worth the cost to prevent that tiny chance.
> That's the same way that my house has ordinary locks - I'm sure they
> can be picked by someone who knows what they're doing, and better security
> is available, at a price, but a nice happy medium is what fits me best.

Indeed again.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpvuqMttwSyI.pgp
Description: OpenPGP Digital Signature

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods

At Sun, 4 Apr 2021 09:49:58 +, Taylor R Campbell  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> > Date: Sat, 03 Apr 2021 12:24:29 -0700
> > From: "Greg A. Woods" 
> >
> > Updating a system, even on -current, shouldn't create a long-lived
> > situation where the system documentation and the behaviour and actions
> > of system commands is completely out of sync with the behaviour of the
> > kernel, and in fact lies to the administrator about the abilities of the
> > system.
>
> It would help if you could identify specifically what you are calling
> a lie.
>
> > @@ -1754,21 +1766,21 @@
> >  rnd_add_uint32(struct krndsource *rs, uint32_t value)
> >  {
> >
> > -   rnd_add_data(rs, , sizeof value, 0);
> > +   rnd_add_data(rs, , sizeof value, sizeof value * ABBY);
> >  }
>
> The rnd_add_uint32 function is used by drivers to feed in data from
> sources _with no known model for their entropy_.

Indeed -- that's the idea.

> It's how drivers
> toss in data that might be helpful but might totally predictable, and
> the driver has no way to know.

Yeah, so?  They don't need to know this.  I'm not actually asking random
drivers to decide the amount of physical entropy they can collect.
That is controlled elsewhere.

> Your change _creates_ the lie that every bit of data entered this way
> is drawn from a source with independent uniform distribution.

No, my change _allows_ the administrator to decide which devices can be
used as estimating/counting entropy sources.  For example I know that
many of the devices on almost all of my machines (virtual or otherwise)
are equally good sources of entropy for their uses.

An addition change, one which I would also find totally acceptable,
would be to disable the current default of allowing "estimation" on
devices which are not true hardware RNGs.  I.e. maybe this simple change
would suffice (though I haven't checked beyond a quick grep to see that
this flag is the mostly commonly used one -- perhaps some real RNG
devices could also be changed to use explicit flags to enable estimation
by default):

--- sys/sys/rndio.h.~1.2.~  2016-07-23 14:36:45.0 -0700
+++ sys/sys/rndio.h 2021-04-04 12:39:15.609936311 -0700
@@ -91,8 +91,7 @@
 #define RND_FLAG_ESTIMATE_TIME 0x4000  /* estimate entropy on time */
 #define RND_FLAG_ESTIMATE_VALUE0x8000  /* estimate entropy on 
value */
 #defineRND_FLAG_HASENABLE  0x0001  /* has enable/disable 
fns */
-#define RND_FLAG_DEFAULT   (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME|\
-RND_FLAG_ESTIMATE_TIME)
+#define RND_FLAG_DEFAULT   (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME)

 #defineRND_TYPE_UNKNOWN0   /* unknown source */
 #defineRND_TYPE_DISK   1   /* source is physical disk */

There are a vast number of ways this re-tooling of entropy collection
could have been done better.

I'm asking for discussion on what amount to some VERY simple changes
which completely and totally solve many real-world uses of this code
while at the same time not just allowing, but defaulting to, the very
strict and secure operation for special situations.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpXj_p1tBVqr.pgp
Description: OpenPGP Digital Signature

regarding the changes to kernel entropy gathering

2021-04-03 Thread Greg A. Woods

So, I'm not sure what to say here.

I'm very surprised, quite confused, more than a little perturbed, and
even somewhat angry.  It's taken me quite some time to write this.

Now temper this with knowing that I do know I'm running -current, not a
release, and that I accept the challenges this might cause (thus see the
patch below).

Updating a system, even on -current, shouldn't cause what I can only
describe as _intentional_ breakage, even for matters so important as
system security and integrity, and especially not without clear mention
UPDATING, and perhaps also with documented and referenced tools to
assist in undoing said breakage.

Updating a system, even on -current, shouldn't create a long-lived
situation where the system documentation and the behaviour and actions
of system commands is completely out of sync with the behaviour of the
kernel, and in fact lies to the administrator about the abilities of the
system.

In any case, the following patch (and in particular the last hunk) fixes
all my problems and complaints in this domain.  It is fully tested, and
it works A-OK with Xen in both domU and dom0 kernels.  My systems once
again have consistent documentation, and tools that don't lie, and are
able to function as before w.r.t. matters related to /dev/random and
getrandom(2).

Now I'm not proposing this as the final solution -- I think there's some
middle ground to be found, but at least this gets things back to working.


--- sys/kern/kern_entropy.c.~1.30.~ 2021-03-07 17:23:05.0 -0800
+++ sys/kern/kern_entropy.c 2021-04-03 11:25:31.667067667 -0700
@@ -1306,7 +1306,7 @@

/* Wait for some entropy to come in and try again.  */
KASSERT(E->stage >= ENTROPY_WARM);
-   printf("entropy: pid %d (%s) blocking due to lack of entropy\n",
+   printf("entropy: pid %d (%s) blocking due to lack of 
entropy\n", /* xxx uprintf() instead/also? */
   curproc->p_pid, curproc->p_comm);

if (ISSET(flags, ENTROPY_SIG)) {
@@ -1577,6 +1577,16 @@
KASSERT(i == __arraycount(extra));
entropy_enter(extra, sizeof extra, 0);
explicit_memset(extra, 0, sizeof extra);
+
+   aprint_verbose("entropy: %s attached as an entropy source (", rs->name);
+   if (!(flags & RND_FLAG_NO_COLLECT)) {
+   printf("collecting");
+   if (flags & RND_FLAG_NO_ESTIMATE)
+   printf(" without estimation");
+   }
+   else
+   printf("off");
+   printf(")\n");
 }

 /*
@@ -1610,6 +1620,8 @@

/* Free the per-CPU data.  */
percpu_free(rs->state, sizeof(struct rndsource_cpu));
+
+   aprint_verbose("entropy: %s detached as an entropy source\n", rs->name);
 }

 /*
@@ -1754,21 +1766,21 @@
 rnd_add_uint32(struct krndsource *rs, uint32_t value)
 {

-   rnd_add_data(rs, , sizeof value, 0);
+   rnd_add_data(rs, , sizeof value, sizeof value * NBBY);
 }

 void
 _rnd_add_uint32(struct krndsource *rs, uint32_t value)
 {

-   rnd_add_data(rs, , sizeof value, 0);
+   rnd_add_data(rs, , sizeof value, sizeof value * NBBY);
 }

 void
 _rnd_add_uint64(struct krndsource *rs, uint64_t value)
 {

-   rnd_add_data(rs, , sizeof value, 0);
+   rnd_add_data(rs, , sizeof value, sizeof value * NBBY);
 }

 /*

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp9RYQDmBbzG.pgp
Description: OpenPGP Digital Signature

Re: UVM behavior under memory pressure

2021-04-01 Thread Greg A. Woods

At Thu, 1 Apr 2021 23:15:42 +0200, Manuel Bouyer  wrote:
Subject: Re: UVM behavior under memory pressure
>
> Yes, I understand this. But, in an emergency situation like this one (there
> is no free ram, swap is full, openscad eventually gets killed),
> I would expect the pager to reclaim pages where it can;
> like file cache (down to vm.filemin, I agree it shouldn't go down to 0).
>
> In my case, vm.anonmax is at 80%, and I suspect it was not reached
> (I tried to increase it to 90% but this didn't change anything).

As I understand things there's no point to increasing any vm.*max value
unless it is already way too low and you want more memory to be used for
that category and there's not already more use in other categories
(i.e. where a competing vm.*max value is too high).

It is the vm.*min value for the desired category that isn't high enough
to allow that category to claim more pages from the less desired
categories.

I.e. if vm.anonmin is too low, and I believe the default of 10% is way
too low, then when file I/O gets busy for whatever reason, (and with the
default rather high vm.filemax value) large processes _will_ get
partially paged out as only 10% of their memory will be kept activated.

Simultaneously decreasing vm.filemax and increasing vm.anonmin should
guarantee more memory can be dedicated to processes needing it as
opposed to allowing file caching to take over.

I think in general the vm.*max limits (except maybe vm.filemax) are only
really interesting on very small memory systems and/or on systems with
very specific types of uses which might demand more pages of one
category or the other.  The default vm.filemax value on the other hand
may be too high for systems that don't _constantly_ do a lot of file I/O
_and_ access many of the same files more than once.

So if you regularly run large processes that don't necessarily do a
whole lot of file I/O then you want to reduce vm.filemax, perhaps quite
a lot, maybe even to just being barely above vm.filemin; and of course
you want to increase vm.anonmin.  One early guide suggested (with my
comments):

vm.execmin=2# this is too low if your progs are huge code
vm.execmax=4# but this should probably be as much as 20
vm.filemin=0
vm.filemax=1# too low for compiling, web serving, etc.
vm.anonmin=70
vm.anonmax=95

Note that increasing vm.anonmin won't dedicate memory to anon pages if
they're not currently needed of course, but it will guarantee at least
that much memory will be made available, and kept available, when and if
pressure for anon pages increases.

So all of these limits are not "hard limits", nor are they dedicated
allocations per-se.  A given category can use more pages than its max
limit, at least until some other category experiences pressure,
i.e. until the page daemon is woken.

(Just keep in mind that one cannot currently exceed 95% as the sum of
the lower (vm.*min) limits.  The total of the upper (vm.*max) limits can
be more than 100%, but there are caveats to such a state.)

Also if you have a really large memory machine and you don't have
processes that wander through huge numbers of files, then you might also
want to lower vm.bufcache so that it's not wasted.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgp6P_y72diVe.pgp
Description: OpenPGP Digital Signature

Re: UVM behavior under memory pressure

2021-04-01 Thread Greg A. Woods

At Thu, 1 Apr 2021 21:03:37 +0200, Manuel Bouyer  wrote:
Subject: UVM behavior under memory pressure
>
> Of course the system is very slow
> Shouldn't UVM choose, in this case, to reclaim pages from the file cache
> for the process data ?
> I'm using the default vm.* sysctl values.

I almost never use the default vm.* values.

I would guess the main problem for your system's memory requirements, at
the time you showed it, is that the default for vm.anonmin is way too
low and so raising vm.anonmin might help.  If vm.anonmin isn't high
enough then the pager won't sacrifice other requirements already in play
for anon pages.

Lowering vm.filemax (and maybe also vm.filemin) might also help since
your system, at that time, appeared to be doing far less I/O on large
numbers of files than, say, a web server or a compile server might be
doing.  However with almost 3G dedicated to the file cache it would seem
your system did recently trawl through a lot of file data, and so with a
lower vm.filemax less of it would have been kept as pressure for other
types of memory increased.

Here are the values I use, with comments about why, from my default
/etc/sysctl.conf.  These have worked reasonably well for me for years,
though I did have a virtual machine struggle to do some builds when I
ran too many make jobs in parallel and then a gargantuan compiler job
came along and needed too much memory.  However there was enough swap
and eventually it thrashed its way through, and more importantly I was
still able to run commands, albeit slowly, and my one large interactive
process (emacs), sometimes took quite a while to wake up and respond.

# N.B.:  On a live system make sure to order changes to these values so that you
# always lower any values from their default first, and then raise any that are
# to be raised above their defaults.  This way, the sum of the minimums will
# stay within the 95% limit.

# the minimum percentage of memory always (made) available for the
# file data cache
#
# The default is 10, which is much too high, even for a large-memory
# system...
#
vm.filemin=5

# the maximum percentage of memory that will be reclaimed from other uses for
# file data cache
#
# The default is 50, which may be too high for small-memory systems but may be
# about right for large-memory systems...
#
#vm.filemax=25

# the minimum percentage of memory always (made) available for anonymous pages
#
# The default is 10, which is way too low...
#
vm.anonmin=40

# the maximum percentage of memory that will be reclaimed from other uses for
# anonymous pages
#
# The default is 80, which seems just about right, but then again it's unlikely
# that the majority of inactive anonymous pages will ever be reactivated so
# maybe this should be lowered?
#
#vm.anonmax=80

# the minimum percentage of memory always (made) available for text pages
#
# The default is 5, which may be far too low on small-RAM systems...
#
vm.execmin=20

# the maximum percentage of memory that will be reclaimed from other uses for
# text pages
#
# The default is 30, which may be too low, esp. for big programs on small-memory
# systems...
#
vm.execmax=40

# It may also be useful to set the bufmem high-water limit to a number which may
# actually be less than 5% (vm.bufcache / options BUFCACHE) on large-memory
# systems (as BUFCACHE cannot be set below 5%).
#
# note this value is given in bytes.
#
#vm.bufmem_hiwater=

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpSINIeXL6Sx.pgp
Description: OpenPGP Digital Signature

Re: nothing contributing entropy in Xen domUs? or dom0!!!

2021-03-31 Thread Greg A. Woods

At Thu, 1 Apr 2021 04:13:59 + (UTC), RVP  wrote:
Subject: Re: nothing contributing entropy in Xen domUs?  or dom0!!!
>
> Does this /etc/entropy-file match what's there in your /boot.cfg?
>
> On my laptop $random_file is left at the default which is:
> /var/db/entropy-file

Yes I did change that as well (as /var isn't part of the root partition).

However that's not the problem for the dom0.

"rndseed" isn't currently used (at least not by me or any documentation
I'm aware of) when loading (multibooting) a Xen kernel and a NetBSD dom0
kernel.

/etc/rc.d/random_seed will do this (again) later anyway.

However since as I showed the hardware doesn't seem to be providing
entropy that can be "counted" ("estimated"), there's nothing to save,
and so nothing to load on the next boot either.

I know how to seed it -- but that's not the problem -- the hardware
should be providing plenty of entropy.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpfPcjeu55q3.pgp
Description: OpenPGP Digital Signature

Re: nothing contributing entropy in Xen domUs? or dom0!!!

2021-03-31 Thread Greg A. Woods

Intel"; CPUID level 11

Intel-specific functions:
Version 000206c2:
Type 0 - Original OEM
Family 6 - Pentium Pro
Model 12 -
Stepping 2
Reserved 8

Extended brand string: "Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz"
CLFLUSH instruction cache line size: 8
Initial APIC ID: 34
Hyper threading siblings: 32

Feature flags 1fc9cbf5:
FPUFloating Point Unit
DE Debugging Extensions
TSCTime Stamp Counter
MSRModel Specific Registers
PAEPhysical Address Extension
MCEMachine Check Exception
CX8COMPXCHG8B Instruction
APIC   On-chip Advanced Programmable Interrupt Controller present and enabled
SEPFast System Call
MCAMachine Check Architecture
CMOV   Conditional Move and Compare Instructions
FGPAT  Page Attribute Table
CLFSH  CFLUSH instruction
ACPI   Thermal Monitor and Clock Ctrl
MMXMMX instruction set
FXSR   Fast FP/MMX Streaming SIMD Extensions save/restore
SSEStreaming SIMD Extensions instruction set
SSE2   SSE2 extensions
SS Self Snoop
HT Hyper Threading

TLB and cache info:
5a: unknown TLB/cache descriptor
03: Data TLB: 4KB pages, 4-way set assoc, 64 entries
55: unknown TLB/cache descriptor
ff: unknown TLB/cache descriptor
b2: unknown TLB/cache descriptor
f0: unknown TLB/cache descriptor
ca: unknown TLB/cache descriptor
Processor serial: 0002-06C2----


I noted today though that entropy doesn't seem to be accumulating even
in the dom0 despite there being many useful sources configured to both
collect and "estimate" _and_ despite the fact there's a valid-looking
$random_file that was saved and reloaded by /etc/rc.d/random_seed (and
saved again every day by /etc/security):

# /etc/rc.d/random_seed rcvar
# random_seed
random_seed=YES
# ls -l /etc/entropy-file
-rw---  1 root  wheel  536 Mar 31 04:15 /etc/entropy-file
# rndctl -l
Source Bits Type  Flags
ipmi0-Temp0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp1   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp2   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp3   0 env  estimate, collect, v, t, dv, dt
ipmi0-Ambient-T   0 env  estimate, collect, v, t, dv, dt
ipmi0-Planar-Te   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-1   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-1   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-2   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-2   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-3   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-3   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-4   0 env  estimate, collect, v, t, dv, dt
ipmi0-Status  0 ???  estimate, collect, t, dt
ipmi0-Voltage 0 power estimate, collect, v, t, dv, dt
ipmi0-Voltage10 power estimate, collect, v, t, dv, dt
ipmi0-Status1 0 ???  estimate, collect, t, dt
ipmi0-Intrusion   0 ???  estimate, collect, t, dt
ipmi0-Temp4   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp5   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp6   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-4   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-5   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-5   0 env  estimate, collect, v, t, dv, dt
ipmi0-Ambient-T   0 env  estimate, collect, v, t, dv, dt
ipmi0-Ambient-T   0 env  estimate, collect, v, t, dv, dt
ums0  0 tty  estimate, collect, v, t, dt
ukbd0 0 tty  estimate, collect, v, t, dt
/dev/random   0 ???  estimate, collect, v
sd2   0 disk estimate, collect, v, t, dt
sd1   0 disk estimate, collect, v, t, dt
sd0   0 disk estimate, collect, v, t, dt
cpu0  0 vm   estimate, collect, v, t, dv
hardclock 0 skew estimate, collect, t
pckbd00 tty  estimate, collect, v, t, dt
system-power  0 power estimate, collect, v, t, dt
autoconf  0 ???  estimate, collect, t
seed  0 ???  estimate, collect, v
# sysctl kern.entropy
kern.entropy.collection = 1
kern.entropy.depletion = 0
kern.entropy.consolidate = -23552
kern.entropy.gather = -23552
kern.entropy.needed = 256
kern.entropy.pending = 0
kern.entropy.epoch = 19

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpZNU3eXL60M.pgp
Description: OpenPGP Digital Signature

Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Greg A. Woods

[[ sorry I've not been catching up on mailing list discussions as fast
as I had hoped to, and I'm way behind on following the entropy rototill. ]]

At Wed, 31 Mar 2021 00:12:31 +, Taylor R Campbell  
wrote:
Subject: Re: nothing contributing entropy in Xen domUs?  (causing python3.7 
rebuild to get stuck in kernel in "entropy" during an "import" statement)
>
> This is false.  If the VM host provided a viornd(4) device then NetBSD
> would automatically collect, and count, entropy from the host, with no
> manual intervention.

I'll leave that idea to others more up-to-date on Xen PV drivers to
respond to.  Booting a -current GENERIC kernel (which has both Xen PV
and virtio(4) devices configured into it) in a "type='pvh'" domU only
attaches the xenbus PV devices, no virtio devices, so adding virtio
might be a bit of a much bigger task that will need further support on
at least the backend, and perhaps on the front-end too, especially to do
it without QEMU.  I haven't tried if virtio devices show up in an HVM
domU precisely because I'm trying to avoid having to run and rely on
QEMU (never mind any performance implications of HVM).

> > Finally, if the system isn't actually collecting entropy from a device,
> > then why the heck does it allow me to think it is (i.e. by allowing me
> > to enable it and show it as enabled and collecting via "rndctl -l")?
>
> The system does collect samples from all those devices.  However, they
> are not designed to be unpredictable and there is no good reliable
> model for just how unpredictable they are, so the system doesn't
> _count_ anything from them.  See https://man.NetBSD.org/entropy.4 for
> a high-level overview.

I'm not sure the word "count" appears in entropy(4) any context I can
make sense of it in w.r.t. what it means to "collect" but not "count"
entropy from those devices.

Worse the "Flags" shown by "rndctl -l" don't seem to be directly
documented (i.e. they're not described in rndctl(8)), and even on a
kernel running on real hardware I don't see the word "count" showing
there.

After looking at the source I'm not sure the descriptions of the
RND_FLAG_* values in rnd(4) help me much either.

Based on my vague understanding of all of this, perhaps you meant to say
"estimate", instead of "count"?  That would make more sense in the
context of what I read in rnd(4) and rndctl(8), though "estimate" still
seems a little vague in meaning to me.

In any case, I don't see why an xbd disk, or a xennet interface, can't
be treated exactly as if they were real hardware (i.e. in terms of
extracting entropy from their behaviour).  This is exactly what
virtualization is all about to me -- even for paravirtualization.  After
all in a threat-free world (i.e. specifically where I also trust other
domUs) their entropy is going to reflect (though maybe not exactly
mirror) the entropy of the underlying hardware and/or network traffic.
So (but maybe not by default) if I as the admin want to trust the
entropy available from an xbd(4) or xennet(4) device, then I should be
able to enable it with rndctl(8) and have it "count".

More importantly though the system shouldn't mislead me into thinking it
is "counting" entropy from a device when it is actually not.  If I had
seen that there were no sources estimating/counting/whatever entropy,
and I tried to enable one and was given a nice error message about this
not being possible, then I would have looked elsewhere to find out how
to give the system more bits of entropy.  As is in my Xen domU system
the output of "rndctl -l" leads me to believe all of my devices are
collecting both timing and value samples, and using either one or the
other to gather entropy (though with '-v' I don't see that any bits of
entropy have been added from any of those amy millions of collected
samples).

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpHGwjWgu37A.pgp
Description: OpenPGP Digital Signature

Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Greg A. Woods

At Tue, 30 Mar 2021 23:53:43 +0200, Manuel Bouyer  
wrote:
Subject: Re: nothing contributing entropy in Xen domUs?  (causing python3.7 
rebuild to get stuck in kernel in "entropy" during an "import" statement)
>
> On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote:
> > [...]
> >
> > Perhaps the answer is that nothing seems to be contributing anything to
> > the entropy pool.  No matter what device I exercise, none of the numbers
> > in the following changes:
>
> yes, it's been this way since the rnd rototill. Virtual devices are
> not trusted.
>
> The only way is to manually seed the pool.

Ah, so that is definitely not what I expected!

Previously wasn't it up to the local admin what to trust?  I guess
throwing bits into /dev/random is one way to play that game, but

I have to trust the dom0 implicitly and utterly anyway, so why not trust
the devices it presents?

This is especially true for xbd block devices.  All my blocks are belong
to dom0.

The network device is in effect no different than if it were real
hardware, so if I want to trust network traffic, then I should be able
to enable it, just as I could if it were real hardware.

The CPUs are also probably the least "virtual" things in Xen, so why not
trust them?  (Though I'm not sure I understand what entropy they can
offer in the first place.)

Finally, if the system isn't actually collecting entropy from a device,
then why the heck does it allow me to think it is (i.e. by allowing me
to enable it and show it as enabled and collecting via "rndctl -l")?

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpPeLoehMD2G.pgp
Description: OpenPGP Digital Signature

nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Greg A. Woods

ue to lack of entropy
[ 563844.834413] entropy: pid 7903 (python) blocking due to lack of entropy
[ 566365.511377] entropy: pid 9001 (python) blocking due to lack of entropy
[ 577473.897830] entropy: pid 9350 (python) blocking due to lack of entropy
[ 579179.381600] entropy: pid 25728 (od) blocking due to lack of entropy
[ 579186.994440] entropy: pid 11107 (cat) blocking due to lack of entropy
[ 579202.264290] entropy: pid 7248 (cat) blocking due to lack of entropy
[ 579669.831978] entropy: ready


--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


At Tue, 30 Mar 2021 10:06:19 -0700, "Greg A. Woods"  wrote:
Subject: python3.7 rebuild stuck in kernel in "entropy" during an "import" 
statement
>
> So I've been running a pkg-rolling_replace and one of the packages being
> rebuilt is python3.7, and it has got stuck, apparently on an "entropy"
> wait in the kernel, and it's been in this state for over 24hrs as you
> can see.
>
> The only things the process has open appear to be its stdio descriptors,
> two of which are are open on the log file I was directing all output to.
>
> This is on a Xen domU of a machine running:
>
> $ uname -a
> NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 
> PDT 2021  
> woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0
>  amd64
>
>
> 09:51 [504] $ ps -lwwp 19875
> UID   PID  PPID CPU PRI NI   VSZ   RSS WCHAN   STAT TTY  TIME COMMAND
>   0 19875 11551   0  85  0 55412 11324 entropy Ipts/0 0:00.27 ./python -E 
> -Wi 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
>  -d /usr/pkg/lib/python3.7 -f -x 
> bad_coding|badsyntax|site-packages|lib2to3/tests/data 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
> 09:51 [505] $ ps -uwwp 19875
> USER   PID %CPU %MEM   VSZ   RSS TTY   STAT STARTEDTIME COMMAND
> root 19875  0.0  0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
>  -d /usr/pkg/lib/python3.7 -f -x 
> bad_coding|badsyntax|site-packages|lib2to3/tests/data 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
> 09:51 [506] $ fstat -p 19875
> USER CMD  PID   FD  MOUNT INUM MODE SZ|DV R/W
> root python 19875   wd  /build10645634 drwxr-xr-x1024 r
> root python 198750  /dev/pts 3 crw---   pts/0 rw
> root python 198751  /build 3721223 -rw-r--r--  28287492 w
> root python 198752  /build 3721223 -rw-r--r--  28287492 w
> 09:51 [507] $ find /build -inum 3721223
> /build/packages/root/pkg_roll.out
> 09:51 [508] $
>
>
> It was killable -- I sent SIGINT from the tty and it died as expected.
>
>
> Running "make replace" gets it stuck in the same place again, an the
> SIGINT shows the following stack trace:
>
> PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
>   LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1  
> ./python -E -Wi 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
>   -d /usr/pkg/lib/python3.7 -f  -x 
> 'bad_coding|badsyntax|site-packages|lib2to3/tests/data'  
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
> ^T
> [ 563859.5589422] load: 0.39  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
> make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
> make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> ^T
> [ 563866.4606073] load: 0.36  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
> make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
> make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> ^?Traceback (most recent call last):
>   File 
> "/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py",
>  line 20, in 
> from concurrent.futures import ProcessPoolExecutor
>   File "", line 1032, in _handle_fromlist
>   File 
> "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py",
>  line 43, in __getattr__
> from .process import ProcessPoolExecutor as pe
>   File 
> "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/process.py",
>  line 53, i

python3.7 rebuild stuck in kernel in "entropy" during an "import" statement

2021-03-30 Thread Greg A. Woods

So I've been running a pkg-rolling_replace and one of the packages being
rebuilt is python3.7, and it has got stuck, apparently on an "entropy"
wait in the kernel, and it's been in this state for over 24hrs as you
can see.

The only things the process has open appear to be its stdio descriptors,
two of which are are open on the log file I was directing all output to.

This is on a Xen domU of a machine running:

$ uname -a
NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 PDT 
2021  
woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0
 amd64


09:51 [504] $ ps -lwwp 19875
UID   PID  PPID CPU PRI NI   VSZ   RSS WCHAN   STAT TTY  TIME COMMAND
  0 19875 11551   0  85  0 55412 11324 entropy Ipts/0 0:00.27 ./python -E 
-Wi 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
 -d /usr/pkg/lib/python3.7 -f -x 
bad_coding|badsyntax|site-packages|lib2to3/tests/data 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
09:51 [505] $ ps -uwwp 19875
USER   PID %CPU %MEM   VSZ   RSS TTY   STAT STARTEDTIME COMMAND
root 19875  0.0  0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
 -d /usr/pkg/lib/python3.7 -f -x 
bad_coding|badsyntax|site-packages|lib2to3/tests/data 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
09:51 [506] $ fstat -p 19875
USER CMD  PID   FD  MOUNT INUM MODE SZ|DV R/W
root python 19875   wd  /build10645634 drwxr-xr-x1024 r
root python 198750  /dev/pts 3 crw---   pts/0 rw
root python 198751  /build 3721223 -rw-r--r--  28287492 w
root python 198752  /build 3721223 -rw-r--r--  28287492 w
09:51 [507] $ find /build -inum 3721223
/build/packages/root/pkg_roll.out
09:51 [508] $


It was killable -- I sent SIGINT from the tty and it died as expected.


Running "make replace" gets it stuck in the same place again, an the
SIGINT shows the following stack trace:

PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
  LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1  
./python -E -Wi 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
  -d /usr/pkg/lib/python3.7 -f  -x 
'bad_coding|badsyntax|site-packages|lib2to3/tests/data'  
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
^T
[ 563859.5589422] load: 0.39  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
^T
[ 563866.4606073] load: 0.36  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
^?Traceback (most recent call last):
  File 
"/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py",
 line 20, in 
from concurrent.futures import ProcessPoolExecutor
  File "", line 1032, in _handle_fromlist
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py",
 line 43, in __getattr__
from .process import ProcessPoolExecutor as pe
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/process.py",
 line 53, in 
import multiprocessing as mp
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/__init__.py",
 line 16, in 
from . import context
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/context.py",
 line 5, in 
from . import process
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py",
 line 363, in 
_current_process = _MainProcess()
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py",
 line 347, in __init__
self._config = {'authkey': AuthenticationString(os.urandom(32)),
KeyboardInterrupt
*** Error code 1 (ignored)
*** Signal 2
*** Signal 2



--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpMapUqkjr1L.pgp
Description: OpenPGP Digital Signature

Re: style change: explicitly permit braces for single statements

2020-07-13 Thread Greg A. Woods

At Mon, 13 Jul 2020 09:48:07 -0400 (EDT), Mouse  
wrote:
Subject: Re: style change: explicitly permit braces for single statements
>
> Slavishly always
> adding them makes it difficult to keep code from walking into the right
> margin:

These days one really should consider the right margin to be a virtual
concept -- there's really no valid reason not to have and use horizontal
scrolling (any code editor I'll ever use can do it on any display), and
even most any small-ish laptop can have a nice readable font at 50x132,
or even 50x160.  (i.e. that's another style guide rule that should die)

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpezCK6Ft2EX.pgp
Description: OpenPGP Digital Signature

Re: style change: explicitly permit braces for single statements

2020-07-12 Thread Greg A. Woods

At Sun, 12 Jul 2020 10:01:36 +1000, Luke Mewburn  wrote:
Subject: style change: explicitly permit braces for single statements
>
> I propose that the NetBSD C style guide in to /usr/share/misc/style
> is reworded to more explicitly permit braces around single statements,
> instead of the current discourgement.
>
> IMHO, permitting braces to be consistently used:
> - Adds to clarity of intent.
> - Aids code review.
> - Avoids gotofail: 
> https://en.wikipedia.org/wiki/Unreachable_code#goto_fail_bug

Well, if you s/permit/require/g, I strongly concur (with possibly one
tiny exception allowed in rare cases -- when there's no newline).

Personally I don't think there's any good excuse for not always putting
braces around all single-statement blocks.  The only bad execuse is that
the language doesn't strictly require them.  People are lazy, I get that
(I am too), but in my opinion C is just not really safe without them.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgplEU3KR6NQt.pgp
Description: OpenPGP Digital Signature

USB storage transfers halt when usbdevs is run: hardware bug or software bug?

2020-07-05 Thread Greg A. Woods

USB storage device transfers freeze when usbdevs is run:  hardware bug
or software bug?

While I was doing a "gzcat < *.gz > /dev/rsd2d", where sd2 was a USB
memory stick, I happened to run "usbdevs -dv" and the writes to the USB
device froze, and indeed the writing process was stuck in the kernel (I
couldn't even stop it with ^Z).

Luckily yanking the stick out seemed to unfreeze and kill the process
and clean everything up nicely and I was able to re-insert it and re-do
the write to it without incident.

This is on an amd64 server running 9.99.64.

Upon removal and subsequent re-insertion the kernel said the following
(but was silent before this when usbdevs ran):

[ 193334.306434] umass0: BBB reset failed, IOERROR
[ 193334.306434] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.318288] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.318288] umass0: BBB reset failed, IOERROR
[ 193334.329223] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.329223] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.341024] umass0: BBB reset failed, IOERROR
[ 193334.341024] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.351781] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.357775] sd2d: error writing fsbn 4053632 of 4053632-4053759 (sd2 bn 
4053632; cn 4021 tn 7 sn 23)
[ 193334.366963] umass0: BBB reset failed, IOERROR
[ 193334.366963] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.378283] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.378283] umass0: BBB reset failed, IOERROR
[ 193334.389225] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.389225] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.401026] umass0: BBB reset failed, IOERROR
[ 193334.401026] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.411782] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.417780] umass0: BBB reset failed, IOERROR
[ 193334.417780] sd2(umass0:0:0:0): generic HBA error
[ 193334.426444] sd2: detached
[ 193334.426444] scsibus1: detached
[ 193334.426444] umass0: detached
[ 193334.436445] umass0: at uhub6 port 2 (addr 5) disconnected

reinsertion:

[ 193341.516925] umass0 at uhub6 port 2 configuration 1 interface 0
[ 193341.516925] umass0: SMI Corporation (0x090c) USB DISK (0x1000), rev 
2.00/11.00, addr 5
[ 193341.526926] umass0: using SCSI over Bulk-Only
[ 193341.526926] scsibus1 at umass0: 2 targets, 1 lun per target
[ 193342.366983] sd2 at scsibus1 target 0 lun 0:  disk 
removable
[ 193342.376985] sd2: 7712 MB, 15744 cyl, 16 head, 63 sec, 512 bytes/sect x 
15794176 sectors
[ 193342.386986] sd2: GPT GUID: d1e3490c-b0e6-42e9-9d9e-3ac286a0f7e0
[ 193342.396989] dk6 at sd2: "EFI system", 262144 blocks at 2048, type: msdos
[ 193342.396989] dk7 at sd2: "d3aa0396-d911-4aac-baa8-f2478557d31a", 7544832 
blocks at 264192, type: ffs


I'm guessing it's a software bug with bad locking order somewhere.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgppIi4jGdYQ5.pgp
Description: OpenPGP Digital Signature

Re: So it seems "umount -f /nfs/mount" still doesn't work.....

2020-07-01 Thread Greg A. Woods

At Tue, 30 Jun 2020 14:28:38 -0700, "Greg A. Woods"  wrote:
Subject: Re: So it seems "umount -f /nfs/mount" still doesn't work.
>
> At Tue, 30 Jun 2020 12:52:37 -0700, "Greg A. Woods"  wrote:
> Subject: So it seems "umount -f /nfs/mount" still doesn't work.
> >
>

So, I should have mentioned that "umount -f nfs.server:/remotefs" does
work (i.e. it does not hang waiting for the server to reconnect, and
provided that there are no processes with cwd or open files on the
remote filesystem, it can unmount the filesystem).

I.e. the problem is in how umount(8) looks up the parameters of the
mount point.  If it looks at the mount point it hangs, but if it looks
through the mount table, it works.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgp_1vuSFtbtv.pgp
Description: OpenPGP Digital Signature

Re: So it seems "umount -f /nfs/mount" still doesn't work.....

2020-06-30 Thread Greg A. Woods

At Tue, 30 Jun 2020 12:52:37 -0700, "Greg A. Woods"  wrote:
Subject: So it seems "umount -f /nfs/mount" still doesn't work.
> 

Curiously the kernel now does something I didn't quite expect when one
tries to reboot a system with a stuck mount.  I was able to see this as
I was running a kernel that verbosely documents all its shutdown
unmounts and detaches.  In prior times I had reached for the power switch.

At first it just hangs:

lilbit# reboot -q
[ 1131744.8297338] syncing disks... 3 3 done
[ 1131744.9797408] unmounting 0xc1f27000 /more/work (more.local:/work)...
[ 1131744.9907053] ok
[ 1131744.9907053] unmounting 0xc1f24000 /more/archive (more.local:/archive)...
[ 1131745.0004431] ok
[ 1131745.0004431] unmounting 0xc1f21000 /more/home (more.local:/home)...
[ 1131745.0097426] ok
[ 1131745.0097426] unmounting 0xc1f1f000 /once/build (once.local:/build)...
[ 1131745.0097426] ok
[ 1131745.0210854] unmounting 0xc1f1b000 /future/build (future.local:/build)...
[ 1131745.0210854] ok
[ 1131745.0304676] unmounting 0xc1f11000 /building/build 
(building.local:/build)...

   this is me hitting ^T to try to see what's going on 

[ 1131753.2800902] load: 0.52  cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k
[ 1132107.6651517] load: 0.48  cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k
[ 1133247.8436109] load: 0.48  cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k

    then I hit ^C and immediately it proceeded 

^C[ 1133249.3636755] unmounting 0xc1f0f000 /proc (procfs)...
[ 1133249.3636755] ok
[ 1133249.3636755] unmounting 0xc1f0d000 /dev/pts (ptyfs)...
[ 1133249.3788641] unmounting 0xc1ecb000 /kern (kernfs)...
[ 1133249.3843127] ok
[ 1133249.3843127] unmounting 0xc1ec9000 /cache (/dev/wd1a)...
[ 1133249.7636916] ok
[ 1133249.7636916] unmounting 0xc1ec6000 /home (/dev/wd0g)...
[ 1133249.7736976] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)...
[ 1133250.0737098] unmounting 0xc1ab1000 /var (/dev/wd0e)...
[ 1133250.1537121] unmounting 0xc1804000 / (/dev/wd0a)...
[ 1133251.0337515] unmounting 0xc1f11000 /building/build 
(building.local:/build)...
[ 1133251.0469644] unmounting 0xc1f0d000 /dev/pts (ptyfs)...
[ 1133251.0469644] unmounting 0xc1ec6000 /home (/dev/wd0g)...
[ 1133251.0579007] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)...
[ 1133251.0637673] unmounting 0xc1ab1000 /var (/dev/wd0e)...
[ 1133251.0637673] unmounting 0xc1804000 / (/dev/wd0a)...
[ 1133251.0750403] sd0: detached
[ 1133251.0750403] scsibus0: detached
[ 1133251.0750403] gpio1: detached
[ 1133251.0853614] sysbeep0: detached
[ 1133251.0853614] midi0: detached
[ 1133251.0853614] wd1: detached
[ 1133251.0949369] uhub0: detached
[ 1133251.0949369] com1: detached
[ 1133251.0949369] usb0: detached
[ 1133251.1045456] gpio0: detached
[ 1133251.1045456] ohci0: detached
[ 1133251.1045456] pchb0: detached
[ 1133251.1151702] unmounting 0xc1f11000 /building/build 
(building.local:/build)...
[ 1133251.1151702] unmounting 0xc1f0d000 /dev/pts (ptyfs)...
[ 1133251.1279509] unmounting 0xc1ec6000 /home (/dev/wd0g)...
[ 1133251.1279509] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)...
[ 1133251.1393918] unmounting 0xc1ab1000 /var (/dev/wd0e)...
[ 1133251.1448739] unmounting 0xc1804000 / (/dev/wd0a)...
[ 1133251.1448739] forcefully unmounting /building/build 
(building.local:/build)...
[ 1133251.1587138] forceful unmount of /building/build failed with error -3
[ 1133251.1653872] rebooting...


So it seems there's some contention between the internal attempt to
unmount the stuck NFS filesystem(s), and the reboot system call itself,
but if the reboot command is interrupted, then the kernel can get on
with its shutdown procedures, and eventually it actually forces the
unmount of the stuck NFS filesystem.

Another interesting thing to note is that /future/build was also stuck
as future.local is offline at this time.  However that's the filesystem
I tried to clear first by hand with "umount -f /future/build", but that
was stuck, apparently in the same call to nfs_reconnect().  It seems it
had done enough that when the reboot() triggered unmounting that it
could complete the unmount without problems.  (The other mounts on
more.local and once.local were responding so they unmounted normally.)

-- 
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpSyQ4PZfAFq.pgp
Description: PGP signature

So it seems "umount -f /nfs/mount" still doesn't work.....

2020-06-30 Thread Greg A. Woods

 rshd 
-L 
   12 16008   9130  85  0  7860   1140 kqueue  I?  0:00.01 
pickup -l -t unix -u 
0 16131  10040  85  0  2620644 select  I?  0:00.01 rshd 
-L 
 1000 20090 152700  85  0 25056   7028 select  Is   ?  0:00.24 
xterm -class UXTerm 
 1000  1768   6090  85  0 11708   1056 ttyraw  Is+  pts/1  0:00.09 -ksh 
0  2940  75840 117  0 17264236 nfscn2  D+   pts/2  0:00.14 
umount -f /future/build 
 1000  4103  40070  85  0  4120   1064 pause   Is   pts/2  0:00.09 -ksh 
0  7584  41030  85  0  7656   1088 pause   Ipts/2  0:00.35 ksh 
0  6722 276390 127  0 16768440 tstile  D+   pts/3  0:00.00 
fstat /future/build 
 1000 21172 14064  489  85  0  3728   1060 pause   Is   pts/3  0:00.09 -ksh 
0 27639 211720  85  0  9648   1048 pause   Ipts/3  0:00.08 ksh 
 1000 13000 20090  722  85  0  3600   1056 ttyraw  Is+  pts/4  0:00.11 -ksh 
0  3707 19523 1057  85  0 11736   1044 pause   Spts/5  0:00.08 ksh 
0  4176  3707 1057  43  0 12900624 -   O+   pts/5  0:00.00 ps 
-alx 
 1000 19523  1002 3550  85  0  3188   1056 pause   Ss   pts/5  0:00.09 -ksh 
0  1013 1  527  85  0  2660652 ttyraw  Is+  ttyE0  0:00.08 -ksh 
0   822 1  126  85  0  2524412 ttyraw  Is+  ttyE1  0:00.00 
/usr/libexec/getty Ws ttyE1 
0   828 1  126  85  0  2524416 ttyraw  Is+  ttyE2  0:00.00 
/usr/libexec/getty Ws ttyE2 
0   957 1  126  85  0  2528412 ttyraw  Is+  ttyE3  0:00.00 
/usr/libexec/getty Ws ttyE3 
0   862 1  126  85  0  4188408 ttyraw  Is+  ttyE4  0:00.00 
/usr/libexec/getty Ws ttyE4 
0  1023 1  126  85  0  2524424 ttyraw  Is+  ttyE5  0:00.00 
/usr/libexec/getty Ws ttyE5 
0  1050 1  126  85  0  2524428 ttyraw  Is+  ttyE6  0:00.00 
/usr/libexec/getty Ws ttyE6 
0   668 10  85  0  2528416 ttyraw  Is+  xencons0:00.01 
/usr/libexec/getty console constty 
12:00 [1.61] # crash
Crash version 8.99.32, image version 8.99.32.
Output from a running system is unreliable.
crash> bt /t 0t2940
trace: pid 2940 lid 1 at 0xaf808a4748f0
sleepq_block() at sleepq_block+0xfd
kpause() at kpause+0xdf
nfs_reconnect() at nfs_reconnect+0x8b
nfs_request() at nfs_request+0xf3a
nfs_getattr() at nfs_getattr+0x175
VOP_GETATTR() at VOP_GETATTR+0x49
vn_stat() at vn_stat+0x3d
do_sys_statat() at do_sys_statat+0x97
sys___lstat50() at sys___lstat50+0x25
syscall() at syscall+0x9c
--- syscall (number 441) ---
43292a:
crash> bt /t 0t6722
trace: pid 6722 lid 1 at 0xaf808a488920
sleepq_block() at sleepq_block+0x99
turnstile_block() at turnstile_block+0x337
rw_vector_enter() at rw_vector_enter+0x169
genfs_lock() at genfs_lock+0x3c
VOP_LOCK() at VOP_LOCK+0x71
vn_lock() at vn_lock+0x90
nfs_root() at nfs_root+0x2b
lookup_once() at lookup_once+0x38e
namei_tryemulroot() at namei_tryemulroot+0x453
namei() at namei+0x29
fd_nameiat.isra.2() at fd_nameiat.isra.2+0x54
do_sys_statat() at do_sys_statat+0x87
sys___stat50() at sys___stat50+0x28
syscall() at syscall+0x9c
--- syscall (number 439) ---
43c94a:
crash> 


So it would seem that even though umount is trying to force an unmount
of an NFS mount, the kernel is first trying to reconnect to the server!


BTW, I have another system running a quite recent i386 build where
crash(8) is unable to do a backtrace:

# ktrace crash
Crash version 9.99.64, image version 9.99.64.
Kernel compiled without options LOCKDEBUG.
Output from a running system is unreliable.
crash> trace /t 0t4003
crash: kvm_read(0x4, 4): kvm_read: Bad address
trace: pid 4003 lid 4003
crash> 


-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp65J9WmWX7L.pgp
Description: PGP signature

Re: NULL pointer arithmetic issues

2020-03-10 Thread Greg A. Woods

At Mon, 9 Mar 2020 17:36:24 +0100, Joerg Sonnenberger  wrote:
Subject: Re: NULL pointer arithmetic issues
>
> I consider it as something even worse. Just like the case of passing
> NULL pointers to memcpy and friends with zero as size, this
> interpretation / restriction in the standard is actively harmful to some
> code for the sake of potential optimisation opportunities in other code.
> It seems to be a poor choice at that. I.e. it requires adding
> conditional branches for something that behaves sanely everywhere but
> may the DS9k.

Indeed.

I way the very existence of anything called "Undefined Behaviour" and
its exploitation by optimizers is evil.  (by definition, especially if
we accept as valid the claim that "Premature optimization is the root of
all evil in programming" -- this is of course a little bit of a stretch
since my claim could be twisted to say that any and all automatic
optimzation by a compiler or toolchain is evil, but of course that's not
exactly my intent -- normal optimization which does not change the
behaviour and intent of the code is, IMO, OK, but defining "intent" is
obviously the problem)

So in Standard C all "Undefined Behaviour" should be changed to
"Implementation Defined" and it should be required that the
implementation is not allowed to abuse any such things for the evil
purpose of premature optimzation.  For this kind of thing adding an
integer to a pointer (or the equivalent, e.g. taking the address of a
field in a struct pointed to by a nil pointer) should always do just
that, even if the pointer can be proven to be a nil pointer at compile
time.  It is wrong to do anything else, and absolute insanity to remove
any other code just because the compiler assumes SIGSEGV
would/should/could happen before the other code gets a chance to run.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpGIvFrvwd81.pgp
Description: OpenPGP Digital Signature

Re: NULL pointer arithmetic issues

2020-02-25 Thread Greg A. Woods

At Wed, 26 Feb 2020 00:12:49 -0500 (EST), Mouse  
wrote:
Subject: Re: NULL pointer arithmetic issues
>
> > This is the main point of my original rant.  "Undefined Behaviour" as
> > it has been interpreted by Optimization Warriors has given us an
> > unusable language.
>
> I'd say that it's given you unusuable implementations of the language.
> The problem is not the language; it's the compiler(s).  (Well, unless
> you consider the language to be the problem because it's possible to
> implement it badly.  I don't.)

I don't think the C language (in all lower-case, un-quoted, plainly) is
the problem -- I think the problem is the wording of the modern
standard, and the unfortunate choice to use the phrase "undefined
behaviour" for certain things.  This has given "license" to optimization
warriors -- and their over-optimization is the root of the evil I see in
current compilers.  It is this unfortunate choice of describing things
as "undefined" within the language that has made modern "Standard C"
unusable (especially for any and all legacy code, which is most of it,
right?).

If we outlawed the use of the phrase "undefined behaviour" and made all
instances of it into "implementation defined behviour", with a very
specific caveat that such instances did not, would not, and could not,
ever allow optimizers to even think of violating any possible
conceivable principle of least astonishment.

E.g. in the example I gave, the only thing allowed would be for the
implementation to do as it please IFF and when the pointer passed was
actually a nil pointer at runtime (and perhaps in this case with a
strong hint that the best and ideal behaviour would be something akin to
calling abort()).

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpEf9I3wNR2o.pgp
Description: OpenPGP Digital Signature

Re: NULL pointer arithmetic issues

2020-02-25 Thread Greg A. Woods

At Mon, 24 Feb 2020 22:15:22 -0500 (EST), Mouse  
wrote:
Subject: Re: NULL pointer arithmetic issues
>
> > Greg A. Woods wrote:
> >
> >   NO MORE "undefined behaviour"!!!  Pick something sane and stick to it!
> >
> >   The problem with modern "Standard" C is that instead of refining
> >   the definition of the abstract machine to match the most common
> >   and/or logical behaviour of existing implementations, the standards
> >   committee chose to throw the baby out with the bath water and make
> >   whole swaths of conditions into so-called "undefined behaviour"
> >   conditions.
>
> Unfortunately for your argument, they did this because there are
> "existing implementations" that disagree severely over the points in
> question.

I don't believe that's quite right.

True "Undefined Behaviour" is not usually the explanation for
differences between implementations.  That's normally what the Language
Lawyers call "Implementation Defined" behaviour.

"Undefined behaviour" is used for things like dereferencing a nil
pointer.  There's little disagreement about that being "undefined by
definition" -- even ignoring the Language Lawyers.  We can hopefully
agree upon that even using the original K edition's language:

"C guarantees that no pointer that validly points at data will
contain zero"

The problem though is that C gives more rope than you might ever think
possible in some situations, such as for example, the chances of
dereferencing a nil pointer with poorly written code.

The worse problem though is when compiler writers, what I'll call
"Optimization Warrior Lawyers", start abusing any and every possible
instance of "Undefined Behaviour" to their advantage.

This is worse than ignoring Hoare's advice -- this is the very epitome
of premature optimization -- this is pure evil.

This is breaking otherwise readable and usable code.

I give you again my example:

> >   An excellent example are the data-flow optimizations that are now
> >   commonly abused to elide security/safety-sensitive code:
>
> > int
> > foo(struct bar *p)
> > {
> > char *lp = p->s;
> >
> > if (p == NULL || lp == NULL) {
> > return -1;
> > }
>
> This code is, and always has been, broken; it is accessing p->s before
> it knows that p isn't nil.

How do you know for sure?  How does the compiler know?  Serious questions.

What if all calls to foo() are written as such:

if (p) foo(p);

I agree this might not be "fail-safe" code, or in any other way
advisable, but it was perfectly fine in the world before UB Optimization
Warriors, however today's "Standard C" gives compilers license to
replace "foo()" with a trap or call to "abort()", etc.

I.e. it takes a real "C Language Lawyer(tm)" to know that past certain
optimization levels the sequence points prevent this from happening.

In the past I could equally assume the optimizer would rewrite the first
bit of foo() as:

if (! p || ! p->s) return -1;

In 35 years of C programming I've never before had to pay such close
attention to such minute details.  I need tools now to audit old code
for such things, and my current experience to date suggests UBSan is not
up to this task -- i.e. runtime reports are useless (perhaps even with
high-code-coverage unit tests).

This is the main point of my original rant.  "Undefined Behaviour" as it
has been interpreted by Optimization Warriors has given us an unusable
language.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgpytwmS13Epg.pgp
Description: OpenPGP Digital Signature

Re: NULL pointer arithmetic issues

2020-02-24 Thread Greg A. Woods

r has knowingly put a supposed de-reference of a
  pointer on the first line of the function, then any comparisons of
  that pointer with NULL further on are OBVIOUSLY never ever going to be
  true and so it can SILENTLY wipe out the whole damn security check.

  I guess I'm saying that modern compiler maintainers are not sane, and
  at least some of the more recent C Standards Committee are definitely
  NOT sane and/or friendly and considerate.

  C's primitive nature engenders the programmer to think in terms of
  what the target machine is going to do, and as such it is extremely
  sad and disheartening that the standards committee chose to endanger
  users in so many ways.

[[ in modern "Standard C" ]]
  It’s not that evaluating something like (1<<32) might have an
  unpredictable result, but rather that the entire execution of any
  program that evaluates such an expression is ENTIRELY meaningless!
  Indeed according to "Standard C" the execution is not even meaningful
  up to the point where undefined behaviour is encountered.  Undefined
  behaviour trumps ALL other behaviors of the C abstract machine.

  And it is all in the goal of attempting comprehensive maximum possible
  optimization of all code at any expense INCLUDING correct operation of
  the program.

  Not all so-called "undefined behaviours" are quite this bad, yet, but
  in general we would be infinitely better off with a more completely
  defined abstract machine that might force some target architectures to
  jump through hoops instead of forcing EVERY programmer to ALWAYS be
  more careful than EVERY conceivable optimizer.

  As Phil Pennock said:

If I program in C, I need to defend against the compiler maintainers.
[[ and future standards committee members!!! ]]
If I program in Go, the language maintainers defend me from my mistakes.

  And I say:

Modern "Standard C" is actually "Useless C" and "Unusable C"


Indeed I now say if "Standard C" follows C++ then it will be safe to say
that a good optimizing compiler will soon be able to turn all C programs
into "abort()" calls.

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpdFaG78xjZR.pgp
Description: OpenPGP Digital Signature

Re: semaphores options

2019-04-08 Thread Greg A. Woods

At Mon, 08 Apr 2019 20:37:39 -0700, "Greg A. Woods"  wrote:
Subject: Re: semaphores options
>
> RCS file: /cvs/master/m-NetBSD/main/src/sys/conf/param.c,v
> 
> revision 1.65
> date: 2015-05-12 19:06:25 -0700;  author: pgoyette;  state: Exp;  lines: +4 
> -2;  commitid: G8nWAd1qbrsX8ely;
> Create a new sysv_ipc module to contain the SYSVSHM, SYSVSEM, and
> SYSVMSG options.  Move associated variables out of param.c and into
> the module's source file.
> 
>
> This commit adds a great big ugly "#if XXX_PRG" around all the related
> SysV IPC settings in sys/conf/param.c, i.e. it entirely removes all
> support for "options SEMMNI=NNN" and related.

Note also this change only appears in NetBSD-8.0 in terms of releases.

The netbsd-7 branch and all its releases preserve the original behaviour
where these "options" worked -- or at least that's what I understand
from CVS.  I haven't tested this -- my only currently running 7.x kernel
didn't have my custom config edits.

--
Greg A. Woods 

+1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpPmvsEu7Au3.pgp
Description: OpenPGP Digital Signature

Re: semaphores options

2019-04-08 Thread Greg A. Woods

At Mon, 8 Apr 2019 21:19:32 +0300, Dima Veselov  
wrote:
Subject: semaphores options
>
> Greetings!
> Sorry for posting so many questions recently, but my production
> server failed to start PostgreSQL after system upgrade (8-STABLE).
>
> This was caused by semaphores, which I like to set in kernel options,
> which now are not working. Better say some are working, some are
> not.
>
> I solved the problem setting them via sysctl but I wonder what happened
> with options(4)?
> It seems that SEMMNI, SEMMNS, SEMMNU, NOFILE and CHILD_MAX do not
> work anymore, but SHMSEG and NMBCLUSTERS are good. I beleive they
> were always working because the system worked long time and had
> sysctl.conf empty. Any recent changes?

Indeed, something seems to have changed, and the problem continues with
-current as of late January (8.99.32).

I think the culprit was this change, which somehow didn't have an
accompanying change to any documentation (most notably options(4) still
documents all the removed settings):

RCS file: /cvs/master/m-NetBSD/main/src/sys/conf/param.c,v

revision 1.65
date: 2015-05-12 19:06:25 -0700;  author: pgoyette;  state: Exp;  lines: +4 -2; 
 commitid: G8nWAd1qbrsX8ely;
Create a new sysv_ipc module to contain the SYSVSHM, SYSVSEM, and
SYSVMSG options.  Move associated variables out of param.c and into
the module's source file.

This commit adds a great big ugly "#if XXX_PRG" around all the related
SysV IPC settings in sys/conf/param.c, i.e. it entirely removes all
support for "options SEMMNI=NNN" and related.

Perhaps this only affects kernels which have the SysV IPC code baked in,
though I've no idea how the so-called modular world is supposed to work
for pre-set definitions -- I guess it doesn't, though perhaps there's
still some hook for config(1)?.

The real underlying problem may be that none of the SysV IPC options
from options(4) where ever properly set up with "defflag" or "defparam"
in the appropriate "files" file (sys/kern/files.kern probably), or as we
used to say, they were never "defopt'ed" for config.  See config(5).

Having "options FOO=1234" worked without "defparam" if the use was in
sys/conf/param.c, but it doesn't seem to work with the new regime.
Maybe it would work again if "defparam" lines were added to the right
place.

FYI, I have had the following in my kernel configs (in this particular
case edited into XEN3_DOMU) since a very long time ago (before 1.6), and
they continued to work up to and including 5.2_STABLE:

# System V compatible IPC subsystem.  (msgctl(2), semctl(2), and shmctl(2))
#
# Note: SysV IPC parameters could be changed dynamically, see sysctl(8).
#
options SYSVMSG # System V-like message queues
#
options MSGMNI=200  # max number of message queue identifiers 
(default 40)
options MSGMNB=32768# max size of a message queue (default 2048)
options MSGTQL=512  # max number of messages in the system (default 
40)
options MSGSSZ=128  # size of a message segment (must be 2^n, n>4) 
(default 8)
options MSGSEG=16384# max number of message segments in the system
# (must be less than 32767) (default 2048)
#
options SYSVSEM # System V-like semaphores
options SEMMNI=200  # max number of semaphore identifiers in system 
(def=10)
options SEMMNS=600  # max number of semaphores in system (def=60)
options SEMMNU=300  # number of undo structures in system (def=30)
options SEMUME=100  # max number of undo entries per process 
(def=10)
#
options SYSVSHM # System V-like memory sharing
options SHMMAXPGS=16384 # Size of shared memory map (def=2048)

But on my 8.99.32 XEN3_DOMU kernel these only give me:

# sysctl kern.ipc
kern.ipc.sysvmsg = 1
kern.ipc.sysvsem = 1
kern.ipc.sysvshm = 1
kern.ipc.shmmax = 2097152000
kern.ipc.shmmni = 128
kern.ipc.shmseg = 128
kern.ipc.shmmaxpgs = 512000
kern.ipc.shm_use_phys = 0
kern.ipc.msgmni = 200
kern.ipc.msgseg = 16384
kern.ipc.semmni = 10
kern.ipc.semmns = 60
kern.ipc.semmnu = 30

FYI, to show it did/does work on an older system:

23:02 [0.185] # uname -a
NetBSD central 5.2_STABLE NetBSD 5.2_STABLE (XEN3_DOMU) #0: Sun Jun  5 16:33:15 
PDT 2016  
woods@building:/build/woods/building/netbsd-5-amd64-amd64-obj/work/woods/m-NetBSD-5/sys/arch/amd64/compile/XEN3_DOMU
 amd64
23:02 [0.186] # sysctl kern.ipc
kern.ipc.sysvmsg = 1
kern.ipc.sysvsem = 1
kern.ipc.sysvshm = 1
kern.ipc.shmmax = 67108864
kern.ipc.shmmni = 128
kern.ipc.shmseg = 128
kern.ipc.shmmaxpgs = 16384
kern.ipc.shm_use_phys = 0
kern.ipc.msgmni = 200
kern.ipc.msgseg = 16384
kern.ipc.semmni = 200
kern.ipc.semmns = 600
kern.ipc.semmnu = 300

--
Greg A. Woods 

+1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 

pgphcHNkIPNT0.pgp
Description: OpenPGP Digital Signature

Not Groff! Heirloom Doctools!

2015-06-04 Thread Greg A. Woods

At Thu, 04 Jun 2015 14:53:56 +0200, Johnny Billquist b...@softjar.se wrote:
Subject: Re: Groff
 
 On 2015-06-04 12:44, Robert Swindells wrote:
  
  Johnny Billquist b...@softjar.se wrote:
  
   What happened to the original roff? I mean, groff is just a gnu
   replacement for roff. Maybe switch back to the original?
 
  The sources to all of DWB are available from ATT:
 
  http://www2.research.att.com/~astopen/download/
 
  It needs a bit of work to get it to build on NetBSD though.
 
 Hmm. What about roff from 2.11BSD? That shouldn't be so hard to get
 building on NetBSD...

Have my posts since 2009 about Heirloom Doctools somehow mostly going
into a black hole or something!?!?!?!  I get responses of yes, please!
on the lists, but nothing happens and people still keep posting truly
lame suggestions as if they've never heard of Heirloom Doctools.  I
posted about it in a response to this very thread just three days ago
(though I redirected to tech-userlevel then too)!

Yes, sorry Johnny, but your suggestion really is poor.  Ancient troff,
was a poor fit for modern use even 25 years ago with psroff to
generate PostScript from its C/A/T output -- it's full of bugs and
missing tons of features (beyond being device independent), and still
written in what's basically PDP11 assembler dressed up as C (i.e. it's
missing all of BWK's extensive rework), never mind that it's not
actually in the original 2.11BSD release, which contains just Berkeley's
bits (and the same small bits are in the 4.4BSD release too).

Heirloom Doctools _is_ the original troff, in its very latest form!
(well, there's a fork on github that's got a bunch more bug fixes)

A better place to get the original troff, in modern form, with an
open-source license would be Plan-9.

However Heirloom Doctools is equivalent to the Plan-9 version, but
without Plan-9 dependencies, and with more fixes and features.
I.e. Heirloom Doctools are the very most up-to-date code from the very
people who wrote and maintained it since the beginning (sans Joe
Ossanna, of course) .

Back before 2009 it already produced PDFs and handled UTF-8.

Heirloom Doctools already builds and works on NetBSD just fine, and
has done so since before 2009 (advertised as working on 2.0 in 2007).

Heirloom Doctools is the essentially the complete set of tools from the
ATT Documenter's Work Bench suite -- i.e. it contains all the other
_necessary_ pre-processors like eqn, pic, tbl, grap, refer, and vgrind,
and it contains the back-end drivers and font tables for PostScript and
PDF and other printers.  The only thing it's really missing are the
papers from /usr/{share/}doc, but those are freely available elsewhere,
including from the DWB release.

As I discussed back in 2009, Heirloom Doctools is essentially better
quality and far more feature-full than the last DWB release, and
arguably has a much better license, and of course DWB since 2009 is
probably never going to see another public maintenance release now that
Glen Fowler has retired.  The only thing DWB has over Heirloom Doctools
is arguably better PostScript support (oh, and 'pm', but it's C++ :-)).

Why do people keep forgetting about it, and WTF are we still waiting for?

(once again re-directing to tech-userlelvel where this discussion is
more apropos)

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpD3ABW7JR_U.pgp
Description: PGP signature

Re: retrocomputing NetBSD style

2015-06-03 Thread Greg A. Woods

At Wed, 3 Jun 2015 09:23:37 -0400 (EDT), Mouse mo...@rodents-montreal.org 
wrote:
Subject: Re: retrocomputing NetBSD style
 
 GAW Wrote:
  I really don't understand anyone who has the desire to try to run
  build.sh on a VAX-750 to build even just a kernel, let alone the
  whole distribution.
 
 I recall a time where NetBSD/vax was broken for a long time because
 everyone was cross-building; as soon as a native build was attemped,
 the brokenness showed up.
 
 I native build on _everything_.  If it can't native build, it isn't
 really part of my stable, so to speak.

Yes, there is that issue!

See, for instance, my recent posts comparing assembler output from
kernel compiles done by the same compiler when run on amd64 vs. i386.

However those are the kinds of bugs one might hope can be caught by
decent enough regression tests of the compiler and its toolchain.

Unfortunately these are tests which we don't have now, in part because
in a sense we treat the whole system as the regression test, thus
forcing users to do native compiles to prove there are no noticeable
regressions.

Of course if we did have a proper cross-compiler regression test suite
then we would only have to build and run such tests on those less
capable machines.

In some sense though since I don't intend to use my Soekris board (or
RPi, or BBB, etc.) as development systems, I only really care that the
cross compiler generates working code for them, and we do have an
increasingly useful whole-system regression test suite that I do intend
to run on those smaller systems to prove they work well when their
binaries have been built on my build server.

However this issue does have me wanting to do builds on my RPi and BBB
and to dig my Alpha and another Sparc server out of storage, and find a
couple of MIPS systems of each type, just so I can try cross-compiling
on them all and prove that any future fixes to the compiler will then
result in identical code no matter what host it runs on, including
self-hosted.

So, I guess until/unless we have a good compiler regression test suite
then another awesome use for older and very different hardware from the
current melange of almost-identical i386 derivatives is to help run as a
test base for the toolchain.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpv4E6v43hu2.pgp
Description: PGP signature

Re: Removing ARCNET stuffs

2015-06-02 Thread Greg A. Woods

At Tue, 2 Jun 2015 11:47:01 -0700, Dennis Ferguson 
dennis.c.fergu...@gmail.com wrote:
Subject: Re: Removing ARCNET stuffs
 
 It's too long an argument, but I think any approach to a
 multiprocessor network stack that attempts to get there starting
 with the existing network L2/L3/interface code as a base is likely
 not on the table.  I would offer the rather herculean effort spent
 on FreeBSD to attempt to do exactly that, and the fairly mediocre
 result it produced, as evidence.  The resources to match that
 probably don't exist, and if there were a better, easier way to do
 this it would have been done already.  I think the least cost way to
 produce a better result is actually to make a big change, preserving
 the device drivers and the transport protocol code (which needs to run
 single-threaded per-socket in any case) and any non-IP protocol code
 that still works (running single-threaded) but doing a wholesale
 replacement of the code that moves packets between those things with
 something that can operate without locks.  Doing it this way has some
 risks, not the least of which is that it would leave you with networking
 code unlike anyone else's (though if it were well done I'm not sure this
 would last, everyone has trouble with the network stack), but I think
 this makes the problem tractable and has a good chance of producing
 something that scales quite well even without a lot of Linux-style
 micro-optimization effort.

Dennis, if you are able I wonder if you could comment on how well you
think the NetGraph implementation in FreeBSD fares with respect to being
part of a multiprocessor network stack, and if you think it offers any
advantages (and/or has any disadvantages) in an SMP environment.  I
understand that NetGraph gained some finer-grained SMP support as early
as FreeBSD-5.x.  I also read about some NetGraph locking and performance
issues in the 201309DevSummit notes, but I don't know any of the
details.

What if NetGraph was the _only_ network stack in the kernel?

And what about Luigi Rizzo's netmap?  (which claims to be specifically
targeted at multi-core machines)  (I'm going to try to learn a bit more
about netmap at BSDCan this year.)

And finally, what about the possibilities for a more formal STREAMS-like
implementation, or at least something that would be compatible with
existing STREAMS modules at the API (DDI/DKI) level, w.r.t. SMP?  This
would maybe allow independent maintenance and testing of less widely
used protocol modules (and perhaps even drivers) by third parties.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpjagMEFUHHu.pgp
Description: PGP signature

retrocomputing NetBSD style

2015-06-01 Thread Greg A. Woods

At Fri, 29 May 2015 10:22:35 +, David Holland dholland-t...@netbsd.org 
wrote:
Subject: Re: Removing ARCNET stuffs
 
 There's one other thing I ought to mention here, which is that I have
 never entirely understood the point of running a modern OS on old
 hardware; if you're going to run a modern OS, you can run it on modern
 hardware and you get the exact same things as on old hardware, except
 faster and smoother. It's always seemed to me that running vintage
 OSes (on old hardware or even new) is more interesting, because that
 way you get a complete vintage environment with its own, substantively
 different, set of things. This does require maintaining the vintage
 OSes, but that's part of the fun... nonetheless, because I don't
 understand this point I may be suggesting something that makes no
 sense to people who do, so take all the above with that grain of
 salt.

You're quite right that it is interesting to run classic software on
classic hardware, to the extent that retrocomputing is about preserving
a bit of history, or living in the past, or whatever, and to the extent
that one might enjoy such a thing.

However there were, and are, a lot of us who want(ed) a modern OS to run
on our old hardware because we want(ed) to re-purpose that fine old
hardware to do something new and exciting with it.  I.e. I am/was not
building a museum, but rather trying to get things done and learn new
things.

For example I started running NetBSD on Sun-3 and early sparc systems
because that's the hardware I had, and it was good an capable hardware.
However the original SunOS-4 was broken and decrepit for the uses I
wanted to put it to, and I didn't have source so I couldn't really fix
it.  NetBSD opened the door to doing modern things without paying
high-end prices for the latest commercial hardware and software.  At
that time the older hardware really was built better too, and it was
more operational -- i.e. it had proper serial console support, and
once I got to using Alphas, proper 24x7-Lights-Out support with the
ability to power cycle it and reset it remotely without extra control
hardware.

In many respects I still do the same thing, but because of the things
you were saying about how the pace of hardware change has dropped
significantly in recent years, now the hardware I use is just an older
variant of the same stuff you can buy new -- e.g. my new-to-me servers
are Dell PE2950's -- they're replacing a PE2650, but they're not really
all that much different from a brand new R710 or similar.  The old 2650
is really feeling dated now and its processors are missing a number of
features I want, but with the 2950 I can run the very same binaries
quite a range of hardware from the latest greatest back to these older
second-hand systems.

Also, w.r.t. supporting older and less-capable systems, I would now
treat them exactly the same as modern embedded systems with similar
limitations.  I don't expect I'll ever do many, if any, full builds on
my RPi or BBB, and hopefully not even build many packages on them
either, but rather I will cross-compile for them on my far more beefy
big build server.  Were I to try to run the latest NetBSD on an old
Micro-VAX, Sun3, etc., I would never expect to actually do self-hosted
builds on such systems.  I really don't understand anyone who has the
desire to try to run build.sh on a VAX-750 to build even just a kernel,
let alone the whole distribution.  I won't even bother trying that on my
Soekris board!

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgp0rTWOpmHiz.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-14 Thread Greg A. Woods

At Wed, 14 Dec 2011 07:50:37 + (UTC), mlel...@serpens.de (Michael van Elst) 
wrote:
Subject: Re: Lost file-system story
 
 wo...@planix.ca (Greg A. Woods) writes:
 
 easy, if not even easier, to do a mount -u -r
 
 Does this work again?

Not that I know of, and PR#30525 concurs, as does the commit mentioned
in that PR to prevent it from falsely appearing to work, a change which
remains in netbsd-5 and -current to date.  See my discussion of this
issue earlier in this thread.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpsPtoKtaNDu.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-13 Thread Greg A. Woods

At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn brett.l...@baesystems.com 
wrote:
Subject: Re: Lost file-system story
 
 On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote:
  
  fsck is supposed to handle *all* corruptions to the file system that can
  occur as part of normal file system operation in the kernel. It is doing
  best effort for others. It's a bug if it doesn't do the former and a
  potential missing feature for the latter.
  
 
 There are a lot of slips twixt cup and lip.  If you are really unlucky
 you can get an outage at just the wrong time that will cause the
 filesystem to be hosed so badly that fsck cannot recover it.  Sure, fsck
 can run to completion but all you have is most of your FS in lost+found
 which you have to be really really desperate to sort through.  I have
 been working with UNIX for over 20years now and I have only seen this
 happen once and it was with a commercial UNIX.

I've seen that happen more than once unfortunately.  SunOS-4 once I think.

I agree 100% with Joerg here though.

I'm pretty sure at least some of the times I've seen fsck do more damage
than good it was due to a kernel bug or more breaking assumptions about
ordered operations.

There have of course also been some pretty serious bugs in various fsck
implementations across the years and vendors.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpYVEF362Y36.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-13 Thread Greg A. Woods

At Mon, 12 Dec 2011 18:49:31 -0500 (EST), Matt W. Benjamin 
m...@linuxbox.com wrote:
Subject: Re: Lost file-system story
 
 Why would sync not be effective under MNT_ASYNC?  Use of sync is not
 required to lead to consistency expect with respect to an arbitrary
 point in time, but I don't think anyone ever believed otherwise.
 However, there should be no question of metadata never being written
 out if sync was run?

Well sync(2) _could_ be effective even in the face of MNT_ASYNC, though
I'm not sure it will, or indeed even should be required to, have a
guaranteed ongoing beneficial affect to the on-disk consistency of
filesystem that was mounted with MNT_ASYNC while activity continues to
proceed on the filesystem.

I.e. I don't expect sync(2) to suddenly enforce order on the writes that
it schedules to a MNT_ASYNC-mounted filesystem.  The ordering _may_ be a
natural result of the implementation, but if it's not then I wouldn't
consider that to be a bug, and I certainly wouldn't write any
documentation that suggested it might be a possible outcome.  MNT_ASYNC
means, to me at least, that even sync(2) can get away with doing writes
to a filesystem mounted with that flag in an order other than one which
would guarantee on-disk consistency to a level where fsck could repair
it.

I.e. sync(2) could possibly make things worse for MNT_ASYNC mounted
filesystems before it makes them better, and I don't see how that could
be considered to be a bug.

I do agree that IFF the filesystem is made quiescent, AND all writes
necessary and scheduled by sync(2) are allowed to come to completion,
THEN the on-disk state of an MNT_ASYNC-mounted filesystem must be
consistent (and all data blocks must be flushed to the disk too).

However if you're going to go to that trouble (i.e. close all files open
on the MNT_ASYNC-mounted filesystem and somehow prevent any other file
operations of any kind on that filesystem until such time that you think
the sync(2) scheduled writes are all done), then it should be just as
easy, if not even easier, to do a mount -u -r (or mount -u -o
noasync, or even umount), in which case you'll not only be sure that
the filesystem is consistent and secure, but you'll know when it reaches
this state (i.e. you won't have to guess about when sync(2)'s scheduled
work completes).

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpcLcSlnWPyx.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-12 Thread Greg A. Woods

At Fri, 9 Dec 2011 22:12:25 -0500, Donald Allen donaldcal...@gmail.com wrote:
Subject: Re: Lost file-system story

 On Fri, Dec 9, 2011 at 8:43 PM, Greg A. Woods wo...@planix.ca wrote:
  At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen donaldcal...@gmail.com 
  wrote:
  Subject: Re: Lost file-system story

   does not guarantee to keep a consistent file system structure on the
   disk is what I expected from NetBSD. From what I've been told in this
   discussion, NetBSD pretty much guarantees that if you use async and
   the system crashes, you *will* lose the filesystem if there's been any
   writing to it for an arbitrarily long period of time, since apparently
   meta-data for async filesystems doesn't get written as a matter of
   course.

  I'm not sure what the difference is.

 You would be sure if you'd read my posts carefully. The difference is
 whether the probability of an async-mounted filesystem is near zero or
 near one.

I think perhaps the misunderstanding between you and everyone else is
because you haven't fully appreciated what everyone has been trying to
tell you about the true meaning of async in Unix-based filesystems,
and in particular about NetBSD's current implementation of Unix-based
filesystems, and what that all means to implementing algorithms that can
relibably repair the on-disk image of a filesystem after a crash.

I would have thought the warning given in the description of async in
mount(8) would be sufficient, but apparently you haven't read it that
way.

Perhaps the problem is the last occurance of the word or in the last
sentence of that warning should be changed to and.  To me that would
at least make the warning a bit stronger.

  And that's why by default, and by very strong recommendation, filesystem
  metadata for Unix-based filesystems (sans WABPL) should always be
  written synchronously to the disk if you ever hope to even try to use
  fsck(8).

 That's simply not true. Have you ever used Linux in all the years that
  ext2 was the predominant filesystem? ext2 filesystems were routinely
 mounted async for many years; everything -- data, meta-data -- was
 written asynchronously with no regard to ordering. 

DO NOT confuse any Linux-based filesystem with any Unix-based
filesystem.  They may have nearly identical semantics from the user
programming perspective (i.e. POSIX), but they're all entirely different
under the hood.

Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have
never ever Ever EVER given any guarantee about the repariability of the
filesystem after a crash if it has been mounted with MNT_ASYNC.

Indeed it is more or less _impossible_ by design for the system to make
any such guarantee given what MNT_ASYNC actually means for Unix-based
filesystems, and especially what it means in the NetBSD implementation.

  Unix filesystems, including Berkeley Fast File System variant, have
  never made any guarantees about the recoverability of an async-mounted
  filesystem after a crash.

 I never thought or asserted otherwise.

Well, from my perspective, especially after carefully reading your
posts, you do indeed seem to think that async-mounted Unix-based
filesystems should be able to be repaired, at least some of the time,
despite the documentation, and all the collected wisdom of those who've
replied to your posts so far, saying otherwise.

  You seem to have inferred some impossible capability based on your
  experience with other non-Unix filesystems that have a completely
  different internal structure and implementation from the Unix-based
  filesystems in NetBSD.

 Nonsense -- I have inferred no such thing. Instead of referring you to
 previous posts for a re-read, I'll give you a little summary. I am
 speaking about probabilities. I completely understand that no
 filesystem mounted async (or any other way, for that matter), whether
 Linux or NetBSD or OpenBSD, is GUARANTEED to survive a crash.

OK, let's try stating this once more in what I hope are the same terms
you're trying to use:  The probablility of any Unix-based filesystem
being repariable after a crash is zero (0) if it has been mounted with
MNT_ASYNC, and if there was _any_ activity that affected its structure
since mount time up to the time of the crash.  It still might survive
after some types of changes, but it _probably_ won't.  There are no
guarantees.  Use newfs and restore to recover.

Linux ext2 is not a Unix-based filesystem and Linux itself is not a
Unix-based kernel.  The meaning of async to ext2 is apparently very
different than it is to any Unix-based filesystem.  NetBSD might be free
of UNIX(tm) code, but it and its progenitors, right back to the 7th
Edition of the original Unix, were all implemented by people firmly
entrenched in the original Unix heritage from the inside out.

For Unix-based filesystems and their repair tools, any probablility of
recovery less than one is as good as if it were zero.  Don't ever get
your hopes up.  Use newfs

Re: Lost file-system story

2011-12-12 Thread Greg A. Woods

At Mon, 12 Dec 2011 15:08:40 +, David Holland dholland-t...@netbsd.org 
wrote:
Subject: Re: Lost file-system story
 
 On Sun, Dec 11, 2011 at 06:53:26PM -0800, Greg A. Woods wrote:
 No, as far as I can tell he understands perfectly well; he just
 doesn't consider the behavior acceptable.
 
 It appears that currently a ffs volume mounted -oasync never writes
 back metadata. I don't think this behavior is acceptable either.

I agree there are conditions and operations which _should_ guarantee
that the on-disk state of the filesystem is identical to what the user
perceives and thus that the filesystem is 100% consistent and secure.

It seems umount(2) works to make this guarantee, for example.

The two other most important of these that come to mind are:

mount -u -r /async-mounted-fs

and

mount -u -o noasync /async-mounted-fs

It is my understanding that neither works at the moment, and that this
is a known and reported and accepted bug, as I outlined in an earlier
post to this thread.

I think sync(2) should probably also work, but _only_ if the filesystem
is made entirely quiescent from before the time sync() is called, and
until after the time all the writes it has scheduled have completed, all
the way to the disk media.  (and of course once activity starts on the
filesystem again, all guarantees are lost again)

It might be nice if sync(2) could schedule all the needed writes to
happen in an order which would ensure consistency and repairability of
the on-disk image at any given time, but I'm guessing this might be too
much to ask, at least without some more significant effort.

However without enforcing the synchronous ordering of writes, sync(2)
is effectively useless for the purposes Mr. Allen appears to have,
though perhaps his level of risk tollerance would still make it useful
to him while others of us would still be unable to tolerate its dangers
in any scenarios where we were not prepared to use newfs to recover.

Besides, the only way I know to guarantee a filesystem remains quiescent
is to unmount it, so if you do that first then there's nothing for
sync(2) to do afterwards, so nothing new to implement.  :-)


   DO NOT confuse any Linux-based filesystem with any Unix-based
   filesystem.  They may have nearly identical semantics from the user
   programming perspective (i.e. POSIX), but they're all entirely different
   under the hood.
   
   Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have
   never ever Ever EVER given any guarantee about the repariability of the
   filesystem after a crash if it has been mounted with MNT_ASYNC.
 
 What on earth do you mean by Unix-based filesystems such that this
 statement is true?

I mean exactly what it sounds like -- nothing more.

Having almost no knowledge about ext2 or any other non-Unix-based
filesystems, I'm trying to be careful to avoid making any claims about
those non-Unix-based filesystems.

I included FFS as a Unix-based filesystem because I know for sure that
it shares many of the attributes of the original Unix filesystems with
respect to the issues surrouding MNT_ASYNC.

   Perhaps this sentence from McKusick's memo about fsck will help you to
   understand:  fsck is able to repair corrupted file systems using
   procedures based upon the order in which UNIX honors these file system
   update requests.  This is true for all Unix-based filesystems.
 
 No, it is true for ffs, and possibly for our ext2 implementation
 (which shares a lot of code with ffs) but nothing else.

Well, if you follow what I by Unix-based filesystems, and you ignore LFS
and options like WABPL, as I've said, then I believe it is entirely true
since within my definition that leaves just FFS, and.

V7, though it didn't have MNT_ASYNC, would suffer the same as if
MNT_ASYNC were implemented for it -- indeed I'm guessing that NetBSD's
reimplementation of v7fs will have the same problems with MNT_ASYNC.

As I say, I don't know enough about the non-Unix-based filesystems in
NetBSD, such as those compatible AmigaDOS, Acorn, Windows NT, or even
MS-DOS, to know if they would be adversely affected by MNT_ASYNC.
Indeed I'm not even sure if they all have reasonable filesystem repair
tools (NetBSD has none, except maybe for ext2fs and msdos, though in my
experience NetBSD's MS-DOS filesystem implementation is very fragile and
it does not have a truly useful fsck_msdos, even without trying to use
MNT_ASYNC with it).  SysVbfs may suffer too, but I don't know enough
about it either despite it being by definition Unix-based, and we don't
have an fsck for it in any case.

I'd also be guessing about EFS, and I'm not sure I'd categorize it as
Unix-based any more than I do LFS.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpVUXizbZcol.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-12 Thread Greg A. Woods

At Mon, 12 Dec 2011 11:09:44 -0500 (EST), Mouse mo...@rodents-montreal.org 
wrote:
Subject: Re: Lost file-system story

 They _can_ be repaired...some of the time.  When they can, it is
 because, by coincidence, it just so happens that the stuff that got
 written produces a filesystem fsck can repair.

That's totally irrellevant.

Possibilities other than zero or one are not useful in manual pages, and
they are only useful to an end user as a very last resort -- equivalent
to calling out the army to put Humpty Dumpty back together again.

For all useful intents and purposes any probablity of irreparable damage
of greater than zero is, for the end user, and for all planning purposes,
as good as a probability of one.  Plan to use newfs and restore after
every crash and you'll be OK.  Plan otherwise and you will eventually be
disappointed.

 That's not how I feel about it when I've lost a filesystem.  I'll take
 a filesystem with a nonzero probability of recovering something useful
 from over one that guarantees to trash everything any day (other things
 being equal, of course).

Heh.  Yup, there are those of use who will find it a challenge to see
just how much we can recover from a damaged file system no matter how
useful the outcome may be.

You don't put that in the manual page though, and you never give the end
user that expectation (unless it's already too late for them and they've
got yolk all over their face).

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgp42zLQBCM7L.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-12 Thread Greg A. Woods

At Sun, 11 Dec 2011 23:23:33 -0500, Donald Allen donaldcal...@gmail.com wrote:
Subject: Re: Lost file-system story
 
 How can you possibly say such a thing and hope to be taken seriously?
 What you just said means that P(survival) = .999 is the same as
 P(survival) = 0.
 
 There are a LOT of situations (e.g., mine) where P(survival) = .999
 would be very acceptable and P(survival) = 0 would not.

The manual page must not give probabilities or even speak of
possiblities.

So, as-is you have been warned properly by the manual page.

For planning purposes you _must_ expect that your filesystem will be
damaged beyond repair after a crash and that you will have to use
newfs and restore to recover.  Learn these expectations well and you
will be happier in the long run.  Fail to learn them and you have no
recourse but to wallow in your own sorrows.  I.e. you can't come to the
mailing list and say that you expected something better just because you
say you can get something better from something else entirely different.
You have false expectations based on your experiences with entirely
foreign environments.

Maybe Humpty Dumpty can be put back together again, sometimes, but even
if you have all the King's horses and all the King's men on call to
respond to a disaster at a moment's notice, you must not expect that you
can have the egg put back together successfully, even just once, even if
it does look like just a minor crack this time.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpiHkVmGsc5g.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-12 Thread Greg A. Woods

At Mon, 12 Dec 2011 14:23:40 -0600, Eric Haszlakiewicz e...@nimenees.com 
wrote:
Subject: Re: Lost file-system story
 
 Donald, don't listen to Greg.  Just in case it needs to be repeated, you're
 not the only one that thinks it is reasonable to expect a non-0 probability
 that things will be recovereable, even if something goes wrong.

Eric, what part of MNT_ASYNC don't you understand?

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpz7FPSKpwfe.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-12 Thread Greg A. Woods

At Mon, 12 Dec 2011 14:17:35 -0600, Eric Haszlakiewicz e...@nimenees.com 
wrote:
Subject: Re: Lost file-system story
 
 On Mon, Dec 12, 2011 at 11:39:38AM -0800, Greg A. Woods wrote:
  Having almost no knowledge about ext2 or any other non-Unix-based
  filesystems, I'm trying to be careful to avoid making any claims about
  those non-Unix-based filesystems.
 
 hmm.. so then how can you claim that it is entirely different (as you did
 in an earlier email)?  It sounds like you're talking our of your, ahem.. 
 depth.

As I said, I'm trying to be careful to avoid making claims one way or
another about non-Unix-based filesystems.

I'm also trying to keep in mind that MNT_ASYNC can be an attribute of
the OS implementation well above the filesystems and I'm also trying to
avoid making claims about non-Unix filesytem structures which may be
faced with this feature for the first time.

Once upon a time I was quite familiar with the use of the tools that
came before fsck.  I have a great deal of experience with the on-disk
structure of V7fs, SysVfs, and many of the minor variants of these
filesystems.  I'm experienced with many of the things that can go wrong
with these filesystems and I'm moderately experienced with how they can
be repaired as best as is humanly possible with low-level bit
manipulating tools when bugs in either the kernel or fsck cause
unexpected failures (not unlike what can happen when MNT_ASYNC is used).
I'm moderately experienced with more modern filesystems such as with
SysVr4's native FS and Berkeley FFS, though less experienced with
low-level on-disk repair of those filesystems (since on these modern
Unix-based filesystems the standard repair tools, especially fsck, have
been vastly improved; and kernel bugs which destroy the ordered writing
of metadata have effectively been eliminated).

  I included FFS as a Unix-based filesystem because I know for sure that
  it shares many of the attributes of the original Unix filesystems with
  respect to the issues surrouding MNT_ASYNC.
 
 Have you tried actually comparing the current NetBSD ffs sources against
 whatever Unix sources you are talking about?  While I'm sure that there
 are many attributes that are shared, if you even compare the current NetBSD
 sources with those from, say, 1994, you will find a ton of differences.

This has nothing to do with any given pile of source code per se.  The
issues that affect repariability of a Unix-based filesystem are higher
level design considerations that are common to the implementations of
fsck and the filesystems they can repair from the v7 addenda tape all
the way through to the implementation of modern day NetBSD's
fsck_ffs(8).

You might find McKusick and Kowalski's paper about BSD FFS fsck
enlightening.  (I can supply a copy if you can't find it elsewhere.  It
would be nice if it could be included in the NetBSD distribution, even
if not cleaned up to reflect the current implementation.  It was in
4.4BSD-Lite2, after all.)


Like I said earlier:

Perhaps the superblock(s) should also record when a filesystem has been
mounted with MNT_ASYNC so that fsck(8) can print a warning such as:

FS is dirty and was mounted async.  Demons will fly out of your nose


-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgprM7NvSBuE4.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-09 Thread Greg A. Woods

At Tue, 6 Dec 2011 12:44:16 -0500, Donald Allen donaldcal...@gmail.com wrote:
Subject: Re: Lost file-system story
 
 much more clear. When I read this before the fun started, I took it to
 mean, perhaps unjustifiably, what I know to be true -- there is some
 non-zero probability that fsck of an async file-system will not be
 able to verify and/or restore the filesystem to correctness  after a
 crash. You are saying that the probability, in the case of NetBSD, is
 1. If that's true, that there's no periodic sync, I would say that's
 *really* a mistake. It should be there with a knob the administrator
 can turn to adjust the sync frequency.

Just to be clear:  There is such a knob, or rather binary switch.  It's
called umount(2).

sync(2) might work too, but I seem to vaguely remember something about
it not working for async-mounted filesystems, and some obscure reason
why it wouldn't/couldn't work for them, though that doesn't seem logical
to me any more.  sync(2) should, IMHO, even go so far as to cause the
dirty flag to be cleared on the disk once all the writes to flush all
necessary updates have completed (and assuming of course that no further
changes of any kind are made to the filesystem after sync(2) scheduled
all the writes, and assuming of course that writes cached in the storage
interface controller or in the drive controller will be written out in
order.

In theory mount -u -r should work too, but then there's PR#30525.

Steve Bellovin asked a question some time ago on netbsd-users about why
umount(2) works, but mount -u -r doesn't, and to the best of my
understanding it hasn't been answered yet (though mention was made of a
possible fix to be found in FreeBSD, followed by some musings on how
hard it is to find and use such fixes in the diverging code bases of
FreeBSD and NetBSD).

Perhaps sync(2) will fail for async-mounted filesystems, or even without
MNT_ASYNC, for the same reason that mount -u -r fails, though that's
pure speculation based on my vague ideas, and is not based on anything
in the code.  The question was asked in PR#30525 about mount -u -r
vs. filesystems mounted with MNT_SYNC, but nobody knew if that would
make any significant difference or not (and I would naively suspect not).

Perhaps the superblock should also record when a filesystem has been
mounted with MNT_ASYNC so that fsck(8) can print a warning such as:

FS is dirty and was mounted async.  Demons will fly out of your nose

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgppoVyhhnBug.pgp
Description: PGP signature

Re: Lost file-system story

2011-12-09 Thread Greg A. Woods

At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen donaldcal...@gmail.com wrote:
Subject: Re: Lost file-system story
 
 does not guarantee to keep a consistent file system structure on the
 disk is what I expected from NetBSD. From what I've been told in this
 discussion, NetBSD pretty much guarantees that if you use async and
 the system crashes, you *will* lose the filesystem if there's been any
 writing to it for an arbitrarily long period of time, since apparently
 meta-data for async filesystems doesn't get written as a matter of
 course.

I'm not sure what the difference is.  You seem to be quibbling over
minor differences and perhaps one-off experiences.  Both OpenBSD and
NetBSD also say that you should not use the async flag unless you are
prepared to recreate the file system from scratch if your system
crashes.  That means use newfs(8) [and, by implication, something like
restore(8)], not fsck(8), to recover after a crash.  You got lucky with
your test on OpenBSD.


 And then there's the matter of NetBSD fsck apparently not
 really being designed to cope with the mess left on the disk after
 such a crash. Please correct me if I've misinterpreted what's been
 said here (there have been a few different stories told, so I'm trying
 to compute the mean).

That's been true of Unix (and many unix-like) filesystems and their
fsck(8) commands since the beginning of Unix.

fsck(8) is designed to rely on the possible states of on-disk filesystem
metadata because that's now Unix-based filesystems have been guaranteed
to work (barring use of MNT_ASYNC, obviously).

And that's why by default, and by very strong recommendation, filesystem
metadata for Unix-based filesystems (sans WABPL) should always be
written synchronously to the disk if you ever hope to even try to use
fsck(8).


 I am not telling the OpenBSD story to rub NetBSD peoples' noses in it.
 I'm simply pointing out that that system appears to be an example of
 ffs doing what I thought it did and what I know ext2 and journal-less
 ext4 do -- do a very good job of putting the world into operating
 order (without offering an impossible guarantee to do so) after a
 crash when async is used, after having been told that ffs and its fsck
 were not designed to do this.

You seem to be very confused about what MNT_ASYNC is and is not.  :-)

Unix filesystems, including Berkeley Fast File System variant, have
never made any guarantees about the recoverability of an async-mounted
filesystem after a crash.

You seem to have inferred some impossible capability based on your
experience with other non-Unix filesystems that have a completely
different internal structure and implementation from the Unix-based
filesystems in NetBSD.

Perhaps the BSD manuals have assumed some knowledge of Unix history, but
even the NetBSD-1.6 mount(8) manual, from 2002, is _extremely_ clear
about the dangers of the async flag, with strong emphasis in the
formatted text on the relevant warning:

 async   All I/O to the file system should be done asyn-
 chronously.  In the event of a crash, _it_is_
 _impossible_for_the_system_to_verify_the_integrity_of_
 _data_on_a_file_system_mounted_with_this_option._  You
 should only use this option if you have an applica-
 tion-specific data recovery mechanism, or are willing
 to recreate the file system from scratch.

According to CVS that wording has not changed since October 1, 2002, and
the emphasised text has been there unchanged since September 16, 1998.

 So I'd love it if my experience encourages someone to improve NetBSD
 ffs and fsck to make use of async practical

As others have already said, this has already been done.  It's called
WABPL.  See wapbl(4) for more information.  Use mount -o log to enable
it.

(BTW, I personally don't think you would want to use softdep -- it can
suffer almost as badly as async after a crash, though perhaps without
totally invalidating fsck(8)'s ability to at least recover files and
directories which were static since mount; and it does also offer vastly
improved performance in many use cases, but as the manual says, it
should still be used with care (i.e. recognition of the risks of
less-tested, much more complex code, and vastly changed internal
implmentation semantics implying radically different recovery modes.)

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgp7bEgL4qiOc.pgp
Description: PGP signature

Re: language bindings (fs-independent quotas)

2011-11-21 Thread Greg A. Woods

At Fri, 18 Nov 2011 16:27:53 +0200, Alan Barrett a...@cequrux.com wrote:
Subject: Re: language bindings (fs-independent quotas)
 
 On Fri, 18 Nov 2011, Manuel Bouyer wrote:
  Assuming that there's no need to handle fields with embedded
  spaces, perl's split() function will DTRT.
 
  No, it does not because there are fields that can be empty.
 
 The common way of dealing with that is to have a placehloder like -
 for empty fields.

I dunno (and don't want to know :-)) about perl, but it's easy enough to
insert proper field separators into fixed-width columnar input with Awk
and then go about using split() or whatever uses FS normally.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpRRsykMmxKn.pgp
Description: PGP signature

Re: getrusage() problems with user vs. system time reporting

2011-11-02 Thread Greg A. Woods

At Tue, 01 Nov 2011 01:43:23 -0700, Greg A. Woods wo...@planix.ca wrote:
Subject: Re: getrusage() problems with user vs. system time reporting
 
 Indeed with all this doom and gloom about TSC it seems it might be
 better just to use binuptime() -- that probably won't be as fast
 though  Perhaps if I'm inspired tomorrow I'll try to re-do the
 sysctl_stats.h macros to use it instead of cpu_counter32(), and then use
 this real measure of system time in calcru() instead of pretending the
 stathz ticks mean anything.

So, I've done that now.  (including removing dependency on
SYSCALL_COUNTS and SYSCALL_TIMES, etc.; all except for figuring out how
to hook in the ISR hooks...)

Seems binuptime() is indeed way too expensive to run at every
mi_switch(), syscall(), etc. (it more than doubles the time it takes for
gettimeofday() to run), but getbinuptime() seems to be sufficiently
low-cost to use in these situations.

Unfortunately getbinuptime() isn't immediately looking a whole lot
better than the statistical sampling in statclock(), though perhaps,
with enough benchmark runtime, it is, as expected, being _much_ more
fair at splitting between user and system time.

At least this is the case on a VirtualBox VM.  I need to see it on real
hardware next.

With some further analysis, and with addition of new time values to
struct proc so that statclock() ticks can also be accounted (right now I
re-use the p_*ticks storage for 64-bit nanosecond values), it may be
possible to come up with a simple algorithm so that calcru() can use
balance out the difference between the getbinuptime() values and the
true binuptime() stored in {l,p}_rtime.  Storing the raw getbinuptime()
values would also avoid having to do 64-bit adds in mi_switch(),
syscall(), et al.

If anyone's interested in more details I can post some of my results, as
well as the changes I've made.

Any comments about this would be appreciated!


One thing that's confusing me is that though normally for short-running
processes I'm seeing the getbinuptime() values be either zero, or
somewhat less than the binuptime() value from p_rtime, on rare occasions
I also see vastly larger getbinuptime() values.

For example (this from calcru(), as it is called from in kern_exit.c):

exit|tty: atrun[377]: rt=2216 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks
exit|tty: atrun[361]: rt=2066 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks
exit|tty: atrun[409]: rt=2207 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks
exit|tty: atrun[130]: rt=2272 ms, u+s=1694 ms (u=0 ns / s=1694070 ns), it=0 
ticks
exit|tty: atrun[434]: rt=4048 ms, u+s=10151 ms (u=10151849 ns / s=0 ns), it=0 
ticks
exit|tty: atrun[162]: rt=3209 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks
exit|tty: atrun[458]: rt=2576 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks


rt is the real-time (p_rtime) value calculated by calcru(), in ms
u is the accumulated user time from getbinuptime() calls, in ns
s is the accumulated system time from getbinuptime() calls, in ns
it is the old-style statistically gathered p_iticks value
u+s is of course u + s, converted to ms


(the longest running sample above was when the VM was idle except for
cron, but the VirtualBox host, my desktop, may have been quite busy)


-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpVuuQcViaeH.pgp
Description: PGP signature

Re: getrusage() problems with user vs. system time reporting

2011-11-01 Thread Greg A. Woods

At Mon, 31 Oct 2011 23:28:49 +, David Laight da...@l8s.co.uk wrote:
Subject: Re: getrusage() problems with user vs. system time reporting
 
 On Mon, Oct 31, 2011 at 04:08:13PM -0700, Greg A. Woods wrote:
  
  So, if I understand the way SYSCALL_TIMES_PROCTIMES is implemented it
  effectively converts the struct proc p_*ticks fields into cycle counts
  instead of stathz tick counts.  (though it seems enabling this does not
  disable the additional accumulation of stathz ticks, nor does it adjust
  the calculations in kern_resource.c to give expected values)
 
 It doesn't matter. With the cycle counter values, the stathz ticks
 are noise. The counts are then a bit like doing the stathz count
 on every tick of the cycle counter!

Ah, yes, of course!  I realized that shortly after posting while I was
adding #ifdef's to turn off the counting in statclock().  :-)

  Because they do use cpu_counter32(), I'm surprised they would be that
  expensive to keep.
 
 If a cpu's TSC rate changes (eg with power saving) they they'll give
 different results. So you'd really need a nanotime function.
 OTOH using the valuse to split the total execution time is probably
 always better than the current code.

Wikipedia's entry on Time Stamp Counter (and Intel's app-note about
using RDTSC for performance monitoring) also mention that on any
processor since the Pentium Pro with out-of-order execution an accurate
cycle count can only be obtained by preceding the RDTSC instruction with
something like CPUID, or on CPUs which support one, the RDTSCP
instruction.

It is also mentioned that some processors run the time-stamp cycle
counter at a constant rate (not the actual current CPU clock rate)
(though apparently this quirk can be identified 

And of course there's the issue of multiple processors, since as I
understand it the TSC on different cores are not synchronised.

Finally though I'm still learning more about virtual TSCs on VMware and
VirtualBox, I'm not so sure the TSC will be at all useful in such a
virtual machine environment.

Indeed with all this doom and gloom about TSC it seems it might be
better just to use binuptime() -- that probably won't be as fast
though  Perhaps if I'm inspired tomorrow I'll try to re-do the
sysctl_stats.h macros to use it instead of cpu_counter32(), and then use
this real measure of system time in calcru() instead of pretending the
stathz ticks mean anything.


 Getting the TSC is (IIRC) 30-40 clocks on i386 - because it is a
 synchronising instruction. But it might be the delays only really
 affect back to back reads. ad@ knows more - it will be in the
 archives somewhere.

I found some references saying that it could be 150-200 cycles, and
another saying that it was closer to 80 cycles.


BTW, I don't seem to have any luck identifying the CPU in the VirtualBox
VM that I'm running NetBSD-5 in:

# cpuctl identify 0
cpuctl: cpuset_create: Cannot allocate memory

ktrace seems to say the error is coming from sysctl(), not calloc():

   354  1 cpuctl   CALL  __sysctl(0xbfbfe954,2,0x80758e0,0xbfbfe95c,0,0)
   354  1 cpuctl   RET   __sysctl 0
   354  1 cpuctl   CALL  open(0x806b261,2,0x51)
   354  1 cpuctl   NAMI  /dev/cpuctl
   354  1 cpuctl   RET   open 3
   354  1 cpuctl   CALL  __sysctl(0xbfbfe888,2,0xbfbfe8e0,0xbfbfe8e4,0,0)
   354  1 cpuctl   RET   __sysctl 0
   354  1 cpuctl   CALL  __sysctl(0xbfbfe858,2,0xbfbfe8b0,0xbfbfe8b4,0,0)
   354  1 cpuctl   RET   __sysctl 0
   354  1 cpuctl   CALL  __sysctl(0x8072130,2,0xbfbfe8e0,0xbfbfe8e4,0,0)
   354  1 cpuctl   RET   __sysctl -1 errno 12 Cannot allocate memory


I was about to try to copy over a sysctl.debug symbol file from my build
machine, after turning on the network, and I got a crash as I started
the rcp, and it's the first time I've seen such a crash and the only
difference is that I've turned on SYSCALL_TIMES_PROCTIMES et al.

Mutex error: lockdebug_barrier: spin lock held

lock address : 0xc0d4de54 type :   spin
initialized  : 0xc04e7086
shared holds :  0 exclusive:  1
shares wanted:  0 exclusive:  0
current cpu  :  0 last held:  0
current lwp  : 0xd3d02840 last held: 0xd3d02840
last locked  : 0xc04e622c unlocked : 0xc04e624b
owner field  : 0x00010700 wait/spin:0/1

panic: LOCKDEBUG
Begin traceback...
copyright(c0d50643,0,0,c0c36d90,d2a69c40,d2a69bd8,c0d4de54,c0c33a24,d3d02840,c04e622c)
 at 0xc0b8d29d
End traceback...

dumping to dev 0,1 offset 2000263
dump 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 
26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 succeeded


rebooting...

(gdb) where
#0  0xc05d246c in cpu_reboot ()
#1  0xc04fbe30 in panic ()
#2  0xc04f476b in lockdebug_abort1 ()
#3  0xc04d54ca in rw_vector_enter ()
#4  0xc0466531

Re: getrusage() problems with user vs. system time reporting

2011-10-31 Thread Greg A. Woods

At Mon, 31 Oct 2011 21:10:40 +, David Laight da...@l8s.co.uk wrote:
Subject: Re: getrusage() problems with user vs. system time reporting
 
 There is an kernel option in i386 (and maybe amd64) to do some
 per-syscall stats. One of those counts 'time in syscall' and IIRC
 could easily be used to weight the tick counts so that getrusage
 gives more accurate times.

I had no idea there was a SYSCALL_TIMES_PROCTIMES option as well!  (and
I see it's been sitting there un-documented since 2007, and so it is
already in the netbsd-5 sources I'm experimenting with!)  This is
exciting!  This is what I was looking for!

I had ignored SYSCALL_TIMES because it seemed from the manual pages to
be lacking per-process hooks, though I was getting to the place where I
might have noticed that this would be an appropriate framework in which
to add per-process support. :-)


So, if I understand the way SYSCALL_TIMES_PROCTIMES is implemented it
effectively converts the struct proc p_*ticks fields into cycle counts
instead of stathz tick counts.  (though it seems enabling this does not
disable the additional accumulation of stathz ticks, nor does it adjust
the calculations in kern_resource.c to give expected values)


It looks like SYSCALL_TIMES is indeed on both i386 and amd64 at this
time, which will do fine for me for now, but given how it seems to work
it looks like it could be made to work on alpha, mips, powerpc, sparc64,
and ia64 with relative ease.


 The problem is that getting an accurate timestamp is relatively
 expensive. It has been almost the dominant part of process switch.

Because they do use cpu_counter32(), I'm surprised they would be that
expensive to keep.

If one were to get rid of the big syscall_counts and syscall_times
tables and just use the bits necessary for SYSCALL_TIMES_PROCTIMES,
would that help reduce the overhead to a more acceptable level?


BTW, have you ever built and tested a kernel with appropriate instances
of SYSCALL_TIME_ISR_ENTRY() and SYSCALL_TIME_ISR_EXIT() put into place?
If so, do you have suggestions as to where I could try putting those
macros, especially in a netbsd-5 kernel?

In my estimation it's useless to try to make getrusage() show more
accurate user time without also firmly accounting for ISR times as well.


Relatively speaking I don't mind at all taking a small, equitable, hit
in context switching if as a result I can get relatively accurate user
and system (and ISR) times per process as a result.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 250 762-7675http://www.planix.com/


pgpYwaAKjBz1j.pgp
Description: PGP signature

kernel network interface incompatibilities between netbsd-4 and netbsd-5?

2010-10-15 Thread Greg A. Woods

Are there known kernel network interface incompatibilities between
netbsd-4 and netbsd-5?

I mention this because in considering upgrading one of my servers from
netbsd-4 to netbsd-5 I noticed that a static-linked arpwatch binary
built on netbsd-4 was complaining about bogons on my network even though
they were not bogons -- they were all in the same subnet:

Oct 15 12:36:39 historically arpwatch: bogon 204.92.254.6 8:0:20:21:99:db
Oct 15 12:36:48 historically last message repeated 10 times
Oct 15 12:36:49 historically arpwatch: bogon 204.92.254.244 0:f:d3:0:5:83
Oct 15 12:36:50 historically arpwatch: bogon 204.92.254.6 8:0:20:21:99:db
Oct 15 12:37:05 historically last message repeated 14 times


I also noticed that unbound wasn't answering DNS queries ether, though
it didn't make any complaints.

Unfortunately the old netbsd-4 fstat is useless against a netbsd-5
kernel to see if unbound was actually listening on the right interfaces.

-- 
Greg A. Woods

+1 416 218-0098VE3TCP  RoboHack wo...@robohack.ca
Planix, Inc. wo...@planix.com  Secrets of the Weird wo...@weird.com


pgp4moR2WFLUD.pgp
Description: PGP signature

Re: Hardware RAID problem with NetBSD 5?

2010-03-29 Thread Greg A. Woods

At Tue, 30 Mar 2010 00:38:05 + (UTC), John Klos j...@ziaspace.com wrote:
Subject: Hardware RAID problem with NetBSD 5?
 
 ataraid0: found 1 RAID volume
 ld0 at ataraid0 vendtype 3 unit 0: nVidia ATA RAID-1 array
 ld0: 931 GB, 121601 cyl, 255 head, 63 sec, 512 bytes/sect x 1953525120  
 sectors

I guess ataraid(4) is broken in NetBSD-5, as it is in NetBSD-4 and
-current.

See PRs #42985 and #38273 for starters.


 Strange... Does anyone have any ideas? Has anyone seen behaviour like 
 this, particularly the reset button getting disabled?

I booted today's kernel and encountered a rather harder lockup than
previously (DDB hung doing a backtrace, sending BREAK to the serial
console had no effect), though the reset button, at least on my machine
downstairs, still worked fine.

I can imagine some machines where the reset button is more of a software
controlled feature -- I've seen that kind of design mistake several
times before -- but I don't know any details of your MSI board (and I
can't find any manuals or other information about it on MSI's site).

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpCNy4FPf9eD.pgp
Description: PGP signature

MIPS SoC systems (was: Dead ports [Re: config(5) break down])

2010-03-20 Thread Greg A. Woods

At Fri, 19 Mar 2010 21:23:35 +, Herb Peyerl hpey...@beer.org wrote:
Subject: Re: Dead ports [Re: config(5) break down]
 
 On Fri, Mar 19, 2010 at 05:19:47PM -0400, Thor Lancelot Simon wrote:
  Have a look at
  http://www.rmicorp.com/assets/docs/2070SG_XLR_XLS_Product_Selection_Guide_2008-12-16.pdf
  specifically at the bottom few rows on the XLS chart.  You're looking at
  parts that have 3 or 4 Gig-E interfaces, tons of useful hardware offload,
  and are, by published reports, way down in the sub-$50 range.  You can
  get very similar stuff from Cavium.
 
 Last time I bought a cavium board it was $5k USD... An Octeon 3850
 was $700 for 1521 piece part... I didn't think they had anything
 reasonable down below $500?  (and as far as I remember, they already
 had FreeBSD running on the Octeons).  Admittedly it's been a few 
 years.

FreeBSD is re-doing all its MIPS support, with quite a bit of work going
into the Atheros and Cavium ports.  Atheros is running, and some Cavium
are running too, but not yet all the most interesting ones.  Check out
Warner Losh's postings:

   http://bsdimp.blogspot.com/search/label/mips

I'm interested in bringing over some of those ports to NetBSD (though
if I try to do it for my day job I'll need to bring over Netgraph first).

Here's one company making Cavium-based systems at a reasonable price:

http://www.portwell.com/products/detail.asp?CUSTCHAR1=CAM-0100

This one doesn't run FreeBSD yet, but someone is working on it and they
are very close (it's not much different from the Cavium eval board
Warner shows booting).

They have a bunch of higher-end systems based on Cavium CPUs too (and
some other CPUs too):

http://www.portwell.com/products/MIPS.asp

This company isn't as low-priced, but has similar devices.  This one is
just under $500, single unit:


http://www.lannerinc.com/Network_Application_Platforms/Network_Processor_Platforms/Desktop_NPU_Platforms/MR-320

and they also have a wide product range:


http://lannerinc.com/Network_Application_Platforms/Network_Processor_Platforms


One of the cheapest Atheros boards is the Ubiquiti RouterStation
series.  You can get one in a case with power supply from various
vendors now for just over $100, single unit pricing (the board is $80).

http://www.ubnt.com/rspro

This is one that FreeBSD runs on already, and I think adapting our
AR53xx port to also work on its AR71xx SoC would be relatively easy.
It's pretty snappy, but it has a poorly supported Ethernet switch chip
that as yet limits it for use in my day job.


When you start looking at what the GNU/Linux OpenWRT project supports,
there are dozens of very interesting little systems available at
relatively low prices.

http://oldwiki.openwrt.org/TableOfHardware.html

Routerboard.com (MikroTik) sell a bunch of interesting boards that even
including their own proprietary GNU/Linux port licensing, are still
quite cost effective.  Most of the more powerful ones are AR71xx based.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpZAMKw7TNbn.pgp
Description: PGP signature

Re: config(5) break down

2010-03-16 Thread Greg A. Woods

At Tue, 16 Mar 2010 10:22:42 +, Andrew Doran a...@netbsd.org wrote:
Subject: Re: config(5) break down
 
 Correctamundo.  95% of downloads in the week following the release of 5.0
 were for x86.  It doesn't say much about embedded but does tell us that
 a very large segment of the user population does commodity hardware.
 
 (What the figures also revealed was that a number of the ports had as close
 to zero downloads as matters.  Which is, to be frank, a red flag for
 those that are not maintained.)

Please do not even think about using downloads as a measure of which
ports are used and how much they are used!

That's a completely invalid measurement of how NetBSD might be used.

Many of us just download the source.  We don't tell you which parts of
it that we use or don't use.

Even port-* mailing list subscriptions aren't a truly valid hint of
which ports are used or how much they are used.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpVTS2ubXFh7.pgp
Description: PGP signature

Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-11 Thread Greg A. Woods

At Thu, 11 Mar 2010 10:22:29 +0900, Masao Uebayashi uebay...@gmail.com wrote:
Subject: Re: (Semi-random) thoughts on device tree structure and devfs

 On Thu, Mar 11, 2010 at 4:33 AM, Greg A. Woods wo...@planix.ca wrote:
  At Wed, 10 Mar 2010 08:56:36 + (GMT), Iain Hibbert 
  plu...@rya-online.net wrote:
  Subject: Re: (Semi-random) thoughts on device tree structure and devfs

  So, you want to be able to mount a disk by the label:

    $ mount -t msdosfs -o label foobar /external_disk_foobar

  Yes, something like that, using fs_volname of course.  I've wanted this
  kind of feature for decades.

 While I understand usefulness of human-readable labels, I don't think
 it should be handled in kernel.  Because labels are arbitrary.  They
 are not ensured to be unique.

The fs_id value is _NOT_ going to be any more unique than the fs_volname
value.

The fs_id value is also not guaranteed to be unique to start with,
especially not across the operational lifetime of a filesystem.

There are a plethora of ways the fs_id can be duplicated, and just about
as many ways for it to get lost (or changed without change control) too.

Sure, labels are arbitrary -- at least to the machine.  They are not,
necessarily, arbitrary to the human who creates them though.

In any case the label doesn't have to be _guaranteed_ to be unique to be
useful to both the human and the machine.

Also, the filesystem identifier doesn't have to be a meaningless lengthy
string of impossible to memorize sequences of digits to be useful to the
system either -- a human created, human meaningful, label can be just as
useful to the machine.

 I think labels should be resolved by some name service.  It's not
 different than /etc/hosts - IP address.

Sorry, but I'm flabbergasted!   What the heck does that mean in this
context of filesystem identification?

Do you really want to add more complexity, goo, and mess, and places for
errors to happen by adding a translation layer?

First off, there's really nowhere to store your magical mappings.

K.I.S.S.  Please!

We do have a place to store a human readable/meaningful filesystem
identifier.

Let the human provide this label.

If the system finds duplicate labels then tell the human which devices
have conflicting labels and where those filesystem were last mounted and
let the human decide which device should be used.  (i.e. the labels do
need to be unique for a successful automatic initialisation of the
system, but there needs to be a manual way to work around them not being
unique regardless of what data they consist of)

In my opinion the fs_id value is truly useless anywhere outside of the
on-disk storage of a single filesystem copy where its sole valid use is
(IIUC) to help to match valid backup superblock copies.  The fact I'm
not even sure it's safe or sane to derive the NFS filesystem filehandle
from it in any way.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/

pgpUEhptX6Bge.pgp
Description: PGP signature

Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-11 Thread Greg A. Woods

At Fri, 12 Mar 2010 00:35:24 +0900, Masao Uebayashi uebay...@gmail.com wrote:
Subject: Re: (Semi-random) thoughts on device tree structure and devfs
 
 Speaking of tracking state...  I've found that keeping track of state
 in devfsd is very wrong.

Indeed -- I do agree with that much at least!

I've had diskless systems running for a long while now (since 2003)
where /dev is created by init(8) on every boot (by running
/sbin/MAKEDEV, as I've renamed it).

In the extremely rare cases where I've wanted to change permissions or
similar on a device node I can just use the normal commands:

chmod 666 /dev/tty001

and if I want to make such a change persistent across boots I just add
that exact same command to /etc/rc.local.

There's no magic needed.

I think the only key feature necessary is that devfs handle the normal
permissions and ownership changes, but to do so of course with no more
persistence than tmpfs, md. or mfs.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpbWA7E47MWU.pgp
Description: PGP signature

Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-07 Thread Greg A. Woods

At Sun, 7 Mar 2010 20:50:03 +, Quentin Garnier c...@cubidou.net wrote:
Subject: Re: (Semi-random) thoughts on device tree structure and devfs
 
 On Sun, Mar 07, 2010 at 06:43:49PM +0900, Masao Uebayashi wrote:
 [...]
 You're barking up the wrong tree.  What's annoying is not that the
 numbering changes.  It is that the numbering is relevant to the use of
 the device.  I expect dk(4) devices to be given names (be it real names
 or GUIDs), and I expect to be able to use that whenever I currently have
 to use a string of the form dkN.

Indeed.  This needs carving in stone somewhere, since folks seem to
forget it.  I think even I have been known to forget it sometimes.  ;-)

 Wrong.  Device numbers should be irrelevant to anything but operations
 on device_t objects.

Indeed.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpI8t1YQjNaV.pgp
Description: PGP signature

Re: blocksizes

2010-02-01 Thread Greg A. Woods

At Mon, 1 Feb 2010 15:34:39 -0500 (EST), der Mouse mo...@rodents-montreal.org 
wrote:
Subject: Re: blocksizes
 
  This can easily happen if you copy the image between disks with
  different block sizes.
 
 Now _there_ is a valid argument for doing everything in terms of bytes
 (as was discussed briefly upthread).

Indeed.  Or at least using only _one_ logical block size that's
consistent for the system across all hardware that can be used by the
system.

Otherwise one must have a working equivalent NetBSD system that can make
use of both kinds of disks in order to copy an image from one kind of
disk to another.  Instead I think it would be best to be able to use any
kind of host system to make an image copy of a NetBSD disk even across
disks with different sector sizes, i.e. without having to use a system
which can understand both the on-disk filesystem and how it deals with
different hardware sector sizes.

In the pure sense of trying to do what's most optimal for a given system
on a given type of hardware, I think I can understand the desire to use
the hardware sector size, or multiples thereof, in the disk driver and
to map logical sectors to match.  However for a portable system I think
the on-disk filesystem representation should try to use a single logical
sector size across all hardware.

I hesitate to say even this much, never mind any more, because I still
feel like I'm sitting firmly and safely on the fence.  :-)

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpqrFxhJMuKN.pgp
Description: PGP signature

Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-31 Thread Greg A. Woods

At Sat, 30 Jan 2010 19:35:47 -0500, Thor Lancelot Simon t...@panix.com wrote:
Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD   netgraph
 
 As far as I know, the standard *is* MP.  MLPPP -- in my years-ago
 experience anyway -- was Livingston's proprietary predecessor of the
 standard protocol; they don't interoperate.

Well long ago there was RFC 1717, which was written by authors from
Newbridge, UCB, and Lloyd Internetworking, and indeed the title of that
RFC appears to abbreviate PPP Multilink Protocol to MP (though
perhaps it should be called PPP-MP).  There was also a protocol from
Ascend called Multichannel Protocol Plus (MP+) and I don't know if/how
it was related to PPP-MP.  Livingston did support RFC 1717 and they also
called it MP, or sometimes multi-line load balancing.  If I remember
correctly Lucent bought Livingston, then Ascend.

Initially I need to inter-operate with a concentrator running MPD on
FreeBSD using Netgraph, thus ng_ppp(4), which implements RFC 1990 PPP
Multilink Protocol, probably using UDP encapsulation. (RFC1990 obsoletes
RFC1717)

Porting Netgraph still seems to be the most optimal solution all round,
though perhaps not with the fastest result, unless I can get help on the
FreeBSD side at making the code more portable.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpZiiprzulJ0.pgp
Description: PGP signature

Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-30 Thread Greg A. Woods

At Sat, 30 Jan 2010 11:37:41 +0900, Masao Uebayashi uebay...@tombi.co.jp 
wrote:
Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD   netgraph
 
 What you need is something like npppd/pipex which OpenBSD has just imported?


Not as it is, as far as I can tell.  (I don't see any new documentation
imported for it -- just a couple of kernel files and the usr.sbin/npppd
stuff, also without manual pages it seems, sigh.)

Does it actually do MLPPP?  I only find mention of Multilink PPP (which
they abbreviate MP for some silly reason) in usr.sbin/npppd/npppd/ppp.h.

usr.sbin/npppd seems to be server-only.  I need client code first, then
eventually server support.

The kernel code (if indeed it has any client code -- not sure yet)
doesn't seem to allow forwarding through UDP or TCP.  It does mention
PPTP, and PPPoE in places but those don't really help me directly.

The document I eventually found here:

http://www.seil.jp/download/eng/doc/npppd_pipex.pdf

confirms that this seems to be server/concentrator only.  (that link
sure would have helped me figure this out faster!)

The more I think about it, the more I highly desire the simple way
Netgraph modules can be composed into any graph that meets one's current
requirements, and it's all done without recompiling anything.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpQ5Byu9pjMk.pgp
Description: PGP signature

Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-30 Thread Greg A. Woods

At Sat, 30 Jan 2010 15:11:03 -0600, David Young dyo...@pobox.com wrote:
Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD   netgraph
 
 On Sat, Jan 30, 2010 at 03:59:29PM -0500, Greg A. Woods wrote:
  The kernel code (if indeed it has any client code -- not sure yet)
  doesn't seem to allow forwarding through UDP or TCP.  It does mention
  PPTP, and PPPoE in places but those don't really help me directly.
 
 You can operate gre(4) over UDP without involving userland, does that
 help any?

Well, if/when whatever does client-side MLPPP can be configured to use
GRE tunnels as members of a bundle, and assuming I can convince MPD on
the server side to stick a ng_gre node in before the ng_ppp node on each
incoming bundle, then yes, it would help.

Ideally though I just want to encapsulate the PPP frames in UDP to be
directly compatible with MPD on the server side.

 Is MLPPD necessary/desirable for some reason?

I'm not sure what MLPPD is -- Did you mean MLPPP?  If so, then yes,
MLPPP is, currently, a core feature of the project I'm working on.

(MLPP is something else entirely I think -- the closest thing to network
protocols I can find is MLPP-over-IP.)

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgppY6nawJxxO.pgp
Description: PGP signature

kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-29 Thread Greg A. Woods

 read this far, but who doesn't yet know so much
about Netgraph, to have a look at Archie Cobbs' DaemonNews article and
Julian's slides describing what's been worked on in Netgraph more
recently:

http://people.freebsd.org/~julian/netgraph.html
http://people.freebsd.org/~julian/BAFUG/talks/Netgraph/Netgraph.pdf

(BTW, Kohler's Click Modular Router is another interesting project!)

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpfCHbBPXKwo.pgp
Description: PGP signature

Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-29 Thread Greg A. Woods

At Fri, 29 Jan 2010 14:43:38 -0600, David Young dyo...@pobox.com wrote:
Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph
 
 On Fri, Jan 29, 2010 at 02:56:31PM -0500, Greg A. Woods wrote:
  I need advanced kernel-level multilink PPP (MLPPP) support, including
  the ability to create bundle links via UDP (and maybe TCP) over IP.
 
 Why do you need kernel-level multilink PPP support?  Do you need to
 interoperate with existing multilink PPP systems?

Partly, but the biggest concern is performance.

I.e.:

1. We absolutely do need to use MLPPP.  We do control both ends of the
connection, and we may someday look at other protocols, but our current
production head-end concentrators are using MLPPP.

2. We also need to do it over multiple connections that are up to many
tens of megabits/sec each, perhaps sometimes even 100mbps each.  Home
cable connections are now 10-50mbps down or more in many places, and
truly high-speed ADSL2 is also growing in availability.  We aggregate
such connections for both speed and reliability reasons.

Our current low-end FreeBSD-based CPE device, which has a board with a
500 MHz AMD Geode LX800 on it, when connected to a 50mbps+2mbps cable
connection that has been split into two tunnels, can achieve 8-mbps max
(download) with userland MLPPP, period; but as much as 34mbps with MPD
using Netgraph MLPPP via UDP, and that was just a quickdirty test
without tuning anything or using truly independent connections.

As I'm sure you know it's just not feasible to move data fast enough in
and out of userland to split and reassemble packets them on commodity
CPE devices.  We also need to do ipsec (with hardware crypto), ipfilter,
ethernet bridging and vlans, etc., all on the same little processors.

-- 
Greg A. Woods
Planix, Inc.

wo...@planix.com   +1 416 218 0099http://www.planix.com/


pgpezIhV4d9wy.pgp
Description: PGP signature

98 matches

Mail list logo