Re: Upgrade of current pkgsrc fails due to gtk3 on 10.99

2024-04-16 Thread Greg Troxel
Riccardo Mottola  writes:

> should I unkeep all stuff I don't know, including dependencies?
> sometimes packages are important, but unknown since dependencies.
> Python is an extreme example: I don't want it (except 2.7 core I need
> to build certain things) but it is pulled in as dependency and all its
> modules too.. from various software, where maybe it is not even a core
> thing.

'keep' is about a package being actively desired.  "pkgin ar" will
remove packages if a) they are not actively desired and b) they are not
a dependency of some package that is being kept.

So if you don't actually want it, uk it.  If it's needed because of
something you do want, it will stay.

I probably should have python311 marked keep on a few systems because I
run scripts outside of pkgsrc, but I have e.g. py311-paho-mqtt marked
keep (because the script uses that) so I don't mark python.

> I installed pkgin, I don't have it since I am using current, there is
> no repository. I understand that for the operation above no repo is
> needed though.

Correct; it doesn't need one.  However you can make one by creating a
summary file from the binary packages you have created (since of course
you keep them :)

  # regenerate for pkgin
  pkgsum () { ls *gz | xargs pkg_info -X | bzip2 > pkg_summary.bz2; }

> # pkgin sk
> processing remote summary
> (https://cdn.netbsd.org/pub/pkgsrc/packages/NetBSD/x86_64/10.99.10/All)...
> pkgin: Could not fetch
> https://cdn.netbsd.org/pub/pkgsrc/packages/NetBSD/x86_64/10.99.10/All/pkg_summary.gz:
> Not Found
>
> Do you suggest just pointing it to 10.0 repo to silence it? I don't
> want to mix current with binaries.

No, I suggest making a summary file from your own packages and pointing
to that.  Either that or ignoring the warning, or configuring no repo at
all and see how that goes.

> I went for another route. I tried pkg_delete. I hope this one keeps
> correctly track of dependencies with versions! It seems to.
>
> pkg_delete py39-tomli-2.0.1nb1
> Package `py39-tomli-2.0.1nb1' is still required by other packages:
>     py39-pyproject_hooks-1.0.0nb1
>     py39-build-1.2.1
>     py39-setuptools_scm-8.0.4
>     py39-hatchling-1.22.5
>
> Then I went up and tried to remove these py39 packages.. and up to the
> tree... I was able to remove them all! No non-python package depended
> on them up to python39 removal. I did the same for python 3.10 and 3.8
> stuff, so I end up with py2.7 and py3.11 which is fine!

If you have done uk, and then ar, this probably would have happened.

> I hope I really didn't break anything.
> I wonder if this is automated by your steps above? I thought
> pkg_rolling-replace -uv would take care of it.

No, pkg_rr does not have any code to deal with multi-version.

> yay :) However lintpksgrc is less happy:
>
> Scan Makefiles: 19598 packages
> Version mismatch: 'py-cairo' 1.18.2nb6 vs 1.26.0nb1
> Unknown package: 'py-gobject' version 2.28.7nb5
> Unknown package: 'py-gtk2' version 2.24.0nb48
> Version mismatch: 'py-setuptools' 44.1.1nb1 vs 69.5.1
>
> again here I think the multi-version is what bites me:
>
> # pkg_info | grep setuptools
> py27-setuptools-44.1.1nb1 New Python packaging system (python 2.x version)
> py311-setuptools-69.5.1 New Python packaging system

yes, more hacking is needed to really fix all this.

I am just pointing out that uk/ar hygiene avoids a lot of grief.



Re: Upgrade of current pkgsrc fails due to gtk3 on 10.99

2024-04-15 Thread Greg Troxel
Riccardo Mottola  writes:

> *** pkg_chk reports the following packages need replacing, but they
> are not installed: py311-tomli
> *** Please read the errors listed above, fix the problem,
> *** then re-run pkg_rolling-replace to continue.

> # pkg_info | grep tomli
> py310-tomli-2.0.1nb1 Lil' TOML parser
> py39-tomli-2.0.1nb1 Lil' TOML parser

This is pkg_rr inheriting pkg_chk's lack of grasping multi-version packages. 

> Does pkg_rr has a "cache" or is it fresh calculated information?

There is no pkg_rr cache but rebuild, unsafe_depends and mismatch are
stored as package variables.

> Again python modules pain :)

Yes, all flowing from python's lack of API compat.

This may not help, but I recommmend:

  pkgin sk
  pkgin uk foo # for any kept foo that you don't  actually want
  pkgin ar

which gets rid of a lot of stuff that you don't need -- and don't need
to rebuild.

After that, if you have  py310-foo, then

  make PYTHON_VERSION_REQD=310 replace

in foo (or something very close to that as I haven't done this in
months).


Re: Upgrade of current pkgsrc fails due to gtk3 on 10.99

2024-04-14 Thread Greg Troxel
Riccardo Mottola  writes:

> I am running 10.99.10 and updated pkgsrc and want to upgrade with
> pkg_rolling-replace
>
> gtk3 fails with the error below.

This is about 'make replace', not pkg_rr.

> I see the issue on freetype, which is a little scary.
> The blocking problem whough is that libgdk-3 has undefined reference
> to X symbols!

gtk3 is buggy.  It tries to link against installed libs during the
build, instead of only the libs being built.

I did
  pkg_delete -f gtk3+
and then it builds fine.

Actually fixing this is harder; you'll have to find where gtk3+'s build
manages to have /usr/pkg/lib in the search path first.  This sort of
thing is gradually getting better, but it's hard.  And, people that only
pbulk don't hit it, so the set of motivated people is smaller.  However,
it will probably only take one person who is both motivated and has a
round tuit to fix this.



Re: make replace failing with python module conflicts

2024-03-18 Thread Greg Troxel
Riccardo Mottola  writes:

> I have the issue below, python upgrade conflicts with its modules.
>
> Usually I jsut remove these modules and retry, but here the dependency
> tree is borader.
> What's the best to handle this? Beyond hating the python hell.

> ===> Updating using binary package of python311-3.11.8
> /usr/bin/env  /usr/pkg/sbin/pkg_add -K /usr/pkg/pkgdb -U -D
> /usr/pkgsrc/lang/python311/work/.packages/python311-3.11.8.tgz
> pkg_add: Package `python311-3.11.8' conflicts with
> `py311-cElementTree-[0-9]*', and `py311-cElementTree-3.11.5' is installed.
> pkg_add: Package `python311-3.11.8' conflicts with `py311-expat-[0-9]*',
> and `py311-expat-3.11.5nb1' is installed.

Two options:

  A) pkg_delete -f the py311-foo that are now included in python base
  package

  B) using pbulk, a separate machine, etc., build a binary package set of
  everything you need and update with pkgin

  (use someone else's binary packagss, but this is like B)



Re: Upgrading a 90s laptop from 5.1 to 10 -- no FD or CDROM

2023-12-21 Thread Greg Troxel
"Jeremy C. Reed"  writes:

> On Wed, 20 Dec 2023, Greg Troxel wrote:
>
>>   - beware that 5->10 is almost certainly not going to work.   I have
>> generally been doing N->N+1 on most machines, but had occasion to do
>> 5->9.  I found that the 9 kernel would not boot.   I then tried 7,
>
> Does this mean the /netbsd 9 kernel itself wouldn't boot (never got to 
> attempting /sbin/init)? 
> Or that that /netbsd 9 kernel couldn't run NetBSD 7 /sbin/init and 
> /etc/rc, etc?
> Or that the /netbsd 9 kernel couldn't run the NetBSD 9 /sbin/init etc, 
> maybe due to some left over NetBSD 5 configs or executables? If so, how 
> did you recover?

It means exactly that I tested doing a 5->9 upgrade on duplicate
hardware, leaving it in the basement and pretending I couldn't walk
down, meaning both not being able to touch the keyboard, power button,
reset button and not being able to see the monitor.  I got into a wedged
state where I couldn't reach the machine enough that I concluded it
would not work.  I then did 5->7 and 7->9 successfully locally, and
remotely.

Fuzzy memory that came back to me is that netbsd-5's ifconfig fails to
run properly with a netbsd-9.  If that fuzzy memory is correct, then one
should be able to unpack the user sets and then be ok.  But I don't like
these sitations of being broken and hoping step 2 will fix it.   I
prefer to see the system usable with the new kernel.

Hence my recommendation to upgrade to 7 and then 9 or 10.  If 10 boots
to multiuser with remote ssh with 7 userland, then sure, install the 10
userland sets.


Re: Upgrading a 90s laptop from 5.1 to 10 -- no FD or CDROM

2023-12-20 Thread Greg Troxel
jo...@sdf.org writes:

> The days are short and work has slowed down. 'Tis the season to get the
> old hardware out!

I know what you mean.

> I have a '98 Toshiba Satellite Pro laptop running a 2010 build of NetBSD
> 5.1. It runs great including X11. No tmux though :(
>
> So how do I get NetBSD 10 on this? There are some challenges.
>
> I have a pcmcia 3Com NIC and so internet access (IPv6 only!). But the FD
> adapter and the CDROM bay both died years ago.
>
> I can plug in a pcmcia CF card and mount it. The old BIOS can't boot from
> that though. There's also a single USB revision 1.0 port.
>
> Take the old HD out and try to put a boot image on it?

I have been doing inplace upgrades on systems for years, almost entirely
successfully.  My scripts are in pkgsrc/sysutils/etcmanage, which in
addition to etcmanage has BUILD-NetBSD and INSTALL-NetBSD.  This will
seem like a lot, but I find once I'm set up, updating is very easy and
reliable.

  - set up etcmanage.  It seems to take people a while to get their head
around it but the point is that on update, if there is a file in
/etc which is unmodified relative to what some previous
install/update did, then it will be made to match the new install.
And if not, it will be left alone.  Then the human can
review/merge/fix.  You need it on the machine to be updated and sort
of the release build machine.

  - Read BUILD-NetBSD and INSTALL-NetBSD.  They are not mysterious.

  - Run BUILD-NetBSD someplace to build.  It's basically a vanilla
build.sh, but it prepares etcmanage checksums suitable for
bootstrapping.  etcmanage lacks support for xz sets, simply due to
round tuits and the "USE_XZ_SETS=no" workaround working very well.

  - Bring the sets to the computer to be updated.  Bring them all; do
not bring a kernel and try that.  Instead, get all the bits safely
there so when you do install the kernel and reboot the rest will be
there already.  Backup /netbsd to /netbsd.ok.  Run "INSTALL-NetBSD
installkernel" and reboot.  If it is ok, run "INSTALL-NetBSD
installuser".  Then "postinstall -s /usr/netbsd-etc fix", as the
installuser step.  Note that installuser will unpack the etc and
xetc sets to /usr/netbsd-etc, and the rest to /.

  - When things are ok clean out /stand (old modules).

  - beware that 5->10 is almost certainly not going to work.   I have
generally been doing N->N+1 on most machines, but had occasion to do
5->9.  I found that the 9 kernel would not boot.   I then tried 7,
and 5->7 worked, and 7->9 worked.   The downside is just boot
netbsd.ok, so it's not that bad, but I would go to 7 first, and then
to 9 or 10.

I advise learning all this especially etcamange on a faster machine on
which it is easier to deal with problems.  I don't really expect much,
but still.

I did the 5->7->9 upgrade on a remote machine where I lacked console
access.  That was of course scary, and it only succeeded because I
tested the steps on duplicate hardware.  Had there not been a 5->9
hiccup it would have been fine

You are doing local, so beware that a 7 kernel might not work with 5
userland to configure networking, perhaps firewall.  It's coming to me
that my issue was that 5 ifconfig could not set up networking with a 9
kernel.  So it's possible you could do 5->9 since you can type on it.

Of course, you can do this all by hand.  I find it more reliable to have
debugged the scripts and then use them.

BTW 5 was amazing.  It was a release that ran forever absent power/hw
issues. I know that's the netbsd way but it seemed extra solid.

> total memory = 81660 KB
> avail memory = 75672 KB

That's tight.  I do wonder how the newer systems will go.  You may want
to build a slimmed-down kernel.  You might even want MODULAR but I'm not
sure that's really baked enough for production.  It may well be; I just
haven't tried it.

> apm0 at mainbus0: Advanced Power Management BIOS: Power Management spec V1.2

I dimly think apm might have been deleted.  Probably doesn't matter.


Re: random lockups (now suspecting zfs)

2023-11-08 Thread Greg Troxel
Stephen Borrill  writes:

> On Sat, 4 Nov 2023, Simon Burge wrote:
>> Greg Troxel wrote:
>>
>>> So to me this feels like a locking botch in a rare path in zfs.
>>
>> This appears to be the case.  Chuck Silvers has some understanding of
>> the problem and I'm helping test, but at this stage there isn't a fix
>> available. :/
>
> It's interesting that you see the lockups during pkgsrc builds, i.e. a
> period where there is lots of file creation. We use zfs on backup
> systems that pull in data with rsync. During the initial runs (where
> every file is new) we usually get a couple of lockups, but during day
> to day operation (few changes) it is reliable. These are on physical
> and virtual machines running NetBSD 9 with the rule of thumb of 1GB
> RAM per TB of storage obeyed, but no patches besides setting MAXPHYS
> in the module to 32k for Xen.

I just had another problem, on the non-xen 32GB machine (which has 3.5T
in the pool, only half full).

The machine wasn't really doing much; X running with xfce, a few xterms,
ssh client, pidgin, and idling firefox with I think 24 tabs.

I found it mostly normal and was using an ssh session, and then switched
to the firefox virtual desktop which failed to redraw.  I tried to kill
firefox (because firefox hanging is not so strange :-() and found that a
few of the tabs appeared to be stuck in flt_noram and zio_buf.  I think
there might have been a different wchan earlier that was zfs but not
zio_buf.

I think it got in this state due to firefox leaking memory (in SIZE but
not RES?).

(So it might be a missing wakeup on flt_noram, but lock not released
seems plausible also.  Totally guessing here.)

(As I was composing this message (in tmux on another machine), the
firefox lockup deteriorated to more things hanging and then a total
lockup.  I was unable to ctrl-alt-f1 to get back to the text console.
It is responding to mdns queries and pings and sshd answers but I see
"local version string" and the "remote protocol" line does not appear.

I should try LOCKDEBUG on the package building box (where if it doesn't
work right that's much more ok!).



10853   129  9817  0  85  0  2762264 180704 uvnfp2   DEl  ttyp57:12.20 
(firefox)
10853   994  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853  1867  9817  0  85  0  3423848 723944 uvnfp2   DEl  ttyp5  146:52.32 
(firefox)
10853  7407  9817  0  85  0 20169184 355160 flt_nora DEl  ttyp5   63:48.76 
(firefox)
10853  7630  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853  8451  9817  0  85  0  2712376 126588 uvnfp2   DEl  ttyp57:09.93 
(firefox)
10853  8504  9817  0  85  0  2744608 143008 uvnfp2   DEl  ttyp59:56.45 
(firefox)
10853  9817 1 21 117  0 12939188 948252 zio_buf_ DEl  ttyp5  303:41.53 
(firefox)
10853 11066  9817  0  85  0  2821832 225664 >db_ DEl  ttyp51:34.01 
(firefox)
10853 11769  9817  0  85  0  2849780 232172 uvnfp2   DEl  ttyp59:19.27 
(firefox)
10853 12055  9817  0  85  0  2832852 144304 uvnfp2   DEl  ttyp58:49.22 
(firefox)
10853 13075  9817  0  85  0  2782516 193652 plpg DEl  ttyp59:00.21 
(firefox)
10853 15399  9817  0  85  0  2822236 249496 uvnfp2   DEl  ttyp5   10:12.41 
(firefox)
10853 15991  9817  0  85  0  2775316 187104 uvnfp2   DEl  ttyp57:13.63 
(firefox)
10853 16033  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853 16877  9817  0  85  0  2731156 148896 uvnfp2   DEl  ttyp51:59.22 
(firefox)
10853 17275  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853 19768  9817  0  85  0  2760188 152880 uvnfp2   DEl  ttyp57:11.17 
(firefox)
10853 21618  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853 24342  9817  0  85  0  2737588 148452 uvnfp2   DEl  ttyp5   11:51.61 
(firefox)
10853 24956  9817  0  85  0  2981764 336852 uvnfp2   DEl  ttyp5   20:20.13 
(firefox)
10853 26368  9817  0  85  0  3164560 240992 uvnfp2   DEl  ttyp5   19:28.72 
(firefox)
10853 26981  9817   1123  85  0  3659088 770432 flt_nora DEl  ttyp5   84:09.22 
(firefox)
10853 27139  9817  0   0  00  0 -Zttyp50:00.00 
(firefox)
10853 29076  9817   2270  85  0  2975552 261064 flt_nora DEl  ttyp5   88:44.15 
(firefox)

top says

Memory: 14G Act, 6989M Inact, 88M Wired, 549M Exec, 13G File, 228M Free
Swap: 40G Total, 348M Used, 40G Free / Pools: 11G Used

so it did get into paging

vmstat -s:
 4096 bytes per page
   16 page colors
  8079588 pages managed
58419 pages free
  3733123 pages active
  1789074 pages inactive
1 pages paging
22427 pages wired
1 reserve pagedaemon pages
   40 reserve kernel pages
   252503 boot kernel pages
  2817259 kernel pool pages
  2027769 anonymous pages
  3376469 cached fil

Re: Static IPv6, dhcpcd, and defaultroute6 issue

2023-11-05 Thread Greg Troxel
jo...@sdf.org writes:

> I want to configure a static IPv6 along with a DHCP IPv4 on my Rock64
> running NetBSD10_beta. However, I'm not able to get the default route for
> IPv6 set on boot.
>
> Here are my interface specific dhcpcd.conf entries:
> interface awge0
> noipv6rs
> static ip6_address=2001:x:x:x/64
>
> Entries in rc.conf:
> dhcpcd=YES
> ip6mode=host
> defaultroute6="fe80::x:x:x%awge0"
>
> After reboot no default is listed under Internet6 output for route show.
> Explicitly doing route add -inet6 default fe80::x:x:x%awge0 fixes this,
> and my static IPv6 works perfectly.
>
> What makes me dig deeper is I can instead do a service network restart,
> and the defaultroute6 in rc.conf is applied correctly.
>
> Is there a potential race condition in the network rc script with dhcpcd?

This feels like a slightly odd way to do things.  If you want a static
ip6, why don't you create ifconfig.awge and put in

  inet6 2001:x:x:x/64

Also, why don't you want RS?  but that's your call.

I wonder if defaultroute6 doest't work because the interface is not yet
up.   You are, I think, relying on dhcpcd enumerating interfaces,
bringing them up, and trying to configure them.   And that I bet runs
after networking

  /etc/rc.d $ rcorder  *|egrep 'network|dhcp'
  network
  dhcpcd
  dhcpd
  dhcpd6

You should also be able to put in an ifconfig.awge0 that has (not
tested!):

  !/sbin/route add -inet6 default fe80::foo

after "up" and "inet6".


Re: updating kernel AND modules

2023-11-04 Thread Greg Troxel
Thomas Klausner  writes:

> The NetBSD guide does not talk about kernel modules at all in the
> updating section
> (https://www.netbsd.org/docs/guide/en/chap-kernel.html,
> http://netbsd.org/docs/guide/en/chap-updating.html)
>
> What is the current best-practice method for that?

My update method is INSTALL-NetBSD from etcmanage, which:

  [first you build.sh release)]
  installs a new kernel

  unpacks all the sets which are not etc and xetc, which includes the
  module set

  puts etc and xetc in /usr/netbsd-etc and merges that to etc, not
  relevant to your question


basically the kernel is the combination of /netbsd and modules, and you
need to update both at once.


Re: random lockups (now suspecting zfs)

2023-11-04 Thread Greg Troxel
Simon Burge  writes:

> Greg Troxel wrote:
>
>>  Fri, Oct 20, 2023 at 01:11:15PM -0400, Greg Troxel wrote:
>>> A different machine has locked up, running recent netbsd-10.  I was
>>> doing pkgsrc rebuilds in zfs, in a dom0 with 4G of RAM, with 8G total
>>> physical.  It has a private patch to reduce the amount of memory used
>>> for ARC, which has been working well.
>
> Are you still seeing the problem below even with limiting the amount of
> memory ARC can use?

Yes.  I have been running with limited ARC for a long time, since when I
posted my patch.  I find that just doing lots of zfs activity, enough
that I would have over-used RAM for ARC, is ok.  On my 32G system, my
boot messages are

  ARCI 002 arc_abs_min 16777216
  ARCI 002 arc_c_min 1067485440
  ARCI 005 arc_c_max 4269941760
  ARCI 010 arc_c_min 1067485440
  ARCI 010 arc_p 2134970880
  ARCI 010 arc_c 4269941760
  ARCI 010 arc_c_max 4269941760
  ARCI 011 arc_meta_limit 1067485440
  ZFS filesystem version: 5

or about 4G for ARC.  On my  8G physical/4G dom0 system:

  ARCI 002 arc_abs_min 16777216
  ARCI 002 arc_c_min 131072000
  ARCI 005 arc_c_max 524288000
  ARCI 010 arc_c_min 131072000
  ARCI 010 arc_p 262144000
  ARCI 010 arc_c 524288000
  ARCI 010 arc_c_max 524288000
  ARCI 011 arc_meta_limit 131072000
  ZFS filesystem version: 5

it's 524 MB.   I think it would be good to commit something like my
patch, but people have said large-memory systems shouldn't have a
change.  I think that's wrong; as I see it NetBSD's code oversizes ARC
compared to upstream, for no good reason.  But the fix to that is to
make it settable and then the default isn't so important.

>> >> All 3 tmux windows show something like
>> >> 
>> >>   [ 373598.5266510] load: 0.00  cmd: bash 21965 [flt_noram5] 0.37u 2.89s 
>> >> 0% 6396k
>> >> 
>> >> and I can switch among them and ^T, but trying to run top is stuck (in
>> >> flt_noram5).  I'll give it an hour or so, and have a look at the
>> >> console.
>
> I've seen cc1plus processes wedged in either flt_noram or tstile after
> doing multiple builds, and a reboot is the only way out.  I'm using ZFS
> for everything except swap and some mostly-unused media files that live
> on an FFS.

Perhaps I failed to say that the box sometimes fails to respond to ping
when it gets like this.

>> So to me this feels like a locking botch in a rare path in zfs.
>
> This appears to be the case.  Chuck Silvers has some understanding of
> the problem and I'm helping test, but at this stage there isn't a fix
> available. :/

That's great to hear that someone has an idea.

So far I can't reproduce this on demand, but it does seem that running a
pkg_rr in the dom0 and in a domU at the same time tends to provoke it.
The domU has two virtual disks, one for files and one for swap.  Both
disk's backing files are zvols.

The files one is UFS2 (no zfs in domU).  pkgsrc, distfiles, packages,
and tmp for piggy programs are all nfs from dom0, and the dom0 is UFS2
for / and /usr, with pkgsrc,packages,distfiles,tmp-for-piggy all on zfs.

I suppose it might help for me to build with LOCKDEBUG and then try
builds.  Surely that will be slow, but is it likely enough illuminating
that it makes sense to try?


Re: file-backed cgd backup question

2023-11-01 Thread Greg Troxel
Paul Ripke  writes:

>> #!/bin/sh
>> 
>> dd if=/dev/zero of=VND bs=1m count=1
>> cat VND > VND.000
>> vnconfig vnd0 VND
>> cat VND > VND.001
>> newfs /dev/rvnd0a
>> cat VND > VND.002
>> vnconfig -u vnd0
>> cat VND > VND.003
>
> At least this DTRT:
>
> dd if=VND of=VND.004 iflag=direct

That (and thanks for the AIX note) is interesting, but I don't see that
it makes backup programs work right.

It seems that the current man page caution about consistency is correct,
and that it would be great if someone added a cache invalidate on close.
(It's a little scary to touch this code in terms of how much it could
mess people up.)


Re: random lockups (now suspecting zfs)

2023-11-01 Thread Greg Troxel
Paul Ripke  writes:

>  Fri, Oct 20, 2023 at 01:11:15PM -0400, Greg Troxel wrote:
>> A different machine has locked up, running recent netbsd-10.  I was
>> doing pkgsrc rebuilds in zfs, in a dom0 with 4G of RAM, with 8G total
>> physical.  It has a private patch to reduce the amount of memory used
>> for ARC, which has been working well.

I have had an additional lockup each on my main machine and my xen/pkg
machine.

>> All 3 tmux windows show something like
>> 
>>   [ 373598.5266510] load: 0.00  cmd: bash 21965 [flt_noram5] 0.37u 2.89s 0% 
>> 6396k
>> 
>> and I can switch among them and ^T, but trying to run top is stuck (in
>> flt_noram5).  I'll give it an hour or so, and have a look at the
>> console.
>
> Curious - do you have swap configured? On what kind of device?
> I'm wondering if a pageout is wedged waiting for memory...

I do have swap configured.

  wd0 at atabus0 drive 0
  wd0: 
  wd0: drive supports 1-sector PIO transfers, LBA48 addressing
  wd0: 3726 GB, 7752021 cyl, 16 head, 63 sec, 512 bytes/sect x 7814037168 
sectors
  wd0: GPT GUID: 7f026840-bd44-4063-be7c-647727ac10d6
  dk2 at wd0: "GDT-3276-4/swap", 83886080 blocks at 4458496, type: swap
  root on dk1 dumps on dk2
  Device  1024-blocks UsedAvail Capacity  Priority
  /dev/dk2   419430400 41943040 0%0

  wd0 at atabus0 drive 0
  wd0: 
  wd0: drive supports 1-sector PIO transfers, LBA48 addressing
  wd0: 953 GB, 1984533 cyl, 16 head, 63 sec, 512 bytes/sect x 2000409264 sectors
  Device  1024-blocks UsedAvail Capacity  Priority
  /dev/wd0b  1677765649384 16728272 0%0

The first is a GPT partition mounted by NAME, and the second is a
disklabel partition.  The first machine I don't expect to really swap,
and the second definitely has memory pressure.   Interestingly none of
the xen domUs have locked up, meaning I've never found them wedged and
the dom0 ok.

So to me this feels like a locking botch in a rare path in zfs.




Re: file-backed cgd backup question

2023-10-22 Thread Greg Troxel
mlel...@serpens.de (Michael van Elst) writes:

> g...@lexort.com (Greg Troxel) writes:
>
>>> vnd opens the backing file when the unit is created and closes
>>> the backing file when the unit is destroyed. Then you can access
>>> the file again.
>
>>Is there a guarantee of cache consistency for writes before and reads
>>after?
>
> Before the unit is created you can access the file and after the
> unit is destroyed you can access the file. That's always safe.

Sorry if I'm failing to understand something obvious, but with a caching
layer that has file contents, how are the cache contents invalidated?

Specifically (but loosely in commands)

  let's assume the vnd is small and there is a lot of RAM available

  process opens the file and reads it

  vnconfig

  mount vnd0 /mnt

  date > /mnt/somefile

  umount /mnt

  vnconfig -u

  process opens the file and reads it

Without fs cache invalidation, stale data can be returned.

If there is explicit invalidation, it would be nice to say that
precisely but I am not understanding that it is there.  Reading vnd.c, I
don't see any cache invalidation on detach.   The only explicit
invalidation I find is in setcred from VNDIOCSET.

I guess that prevents the above, but doesn't prevent

  vnconfig

  mount

  read backing file

  write to mount

  unmount

  detach

  read backing file

so maybe we need a vinvalbuf on detach?

> I also think that when the unit is configured but not opened
> (by device access or mounts) it is safe to access the file.

As I read the code, reads are ok but will leave possibly stale data in
the cache for post-close.

>>> The data is written directly to the allocated blocks of the file.
>>> So exclusively opening  the backing file _or_ the vnd unit should
>>> also be safe. But that's not much different from accessing any file
>>> concurrently, which also leads to "corrupt", inconsistent backups.
>
>>That's a different kind of corrupt.
>
> Yes, but in the end it's the same, the "backup" isn't usuable.

I am expecting that after deconfiguring, a read of the entire file is
guaranteed consistent, but I think we're missing invalidate on close.

> You cannot access the backing file to get a consistent state of the
> data while a unit is in use. And that's independent of how vnd accesses
> the bits.

Agreed; that's more or less like using a backup program on database
files while the database is running.

> N.B. if you want to talk about dangers, think about fdiscard(). I
> doubt that it is safe in the context of the vnd optimization.

It seems clear that pretty much any file operations are unsafe while the
vnd is configured.  That seems like an entirely reasonable situation and
if that's the rule, easy to document.

I wrote a test script and it shows that stale reads happen.  When I run
this on UFS2 (netbsd-10), I find that all 4 files are all zero.  When I
run it on zfs (also netbsd-10), I find that 000 and 001 are all zero and
002 and 003 are the same.  (I am guessing that zfs doesn't use the
direct operations, or caches differently; here I haven't the slightest
idea what is happening.)

10 minutes later, reading VND is still all zeros.  With a new vnconfig,
it still reads as all zeros.


#!/bin/sh

dd if=/dev/zero of=VND bs=1m count=1
cat VND > VND.000
vnconfig vnd0 VND
cat VND > VND.001
newfs /dev/rvnd0a
cat VND > VND.002
vnconfig -u vnd0
cat VND > VND.003


Re: file-backed cgd backup question

2023-10-21 Thread Greg Troxel
mlel...@serpens.de (Michael van Elst) writes:

> g...@lexort.com (Greg Troxel) writes:
>
>>I dimly knew this, but keep forgetting.  Reading vndconfig(8), it does
>>not explain that the normal path leads to incorrect behavior (stale
>>reads from file cache even after closing the vnd, mtime).
>
> vnd opens the backing file when the unit is created and closes
> the backing file when the unit is destroyed. Then you can access
> the file again.

Is there a guarantee of cache consistency for writes before and reads
after?

> The data is written directly to the allocated blocks of the file.
> So exclusively opening  the backing file _or_ the vnd unit should
> also be safe. But that's not much different from accessing any file
> concurrently, which also leads to "corrupt", inconsistent backups.

That's a different kind of corrupt, which is that one is reading the
blocks that were written, but perhaps they are not consistent with each
other.  Here, it seems someone can read bits from the fs cache that were
a state that existed before vnconfig/write/vnconfig-u.

> Updating the backing file mtime on close sounds useful. I'm not sure
> what effect updating atime/mtime on every access would have.

Agreed.

>> This
>>optimization is sufficiently dangerous and not expected that it needs to
>>be documented clearly and loudly.  I just added a note to the man page.
>
> I think the reference to "ciphertext" should be adjusted and the
> text should be toned more neutral when describing the functionality.

I dropped ciphertext; this isn't cgd.

I'm not sure what you mean by neutral.  The statement "this makes a file
available as a block device" implies that at least after vnconfig -u,
reading from the file will return the last write while it was
configured.  That's a minus, and the speed is a plus.

Maybe you mean that it should tout the speed advantage more?

> Pointing to the -i option to disable the optimization unconditionally
> might also be helpful.

Good point; done.


Re: random lockups

2023-10-20 Thread Greg Troxel
A different machine has locked up, running recent netbsd-10.  I was
doing pkgsrc rebuilds in zfs, in a dom0 with 4G of RAM, with 8G total
physical.  It has a private patch to reduce the amount of memory used
for ARC, which has been working well.

All 3 tmux windows show something like

  [ 373598.5266510] load: 0.00  cmd: bash 21965 [flt_noram5] 0.37u 2.89s 0% 
6396k

and I can switch among them and ^T, but trying to run top is stuck (in
flt_noram5).  I'll give it an hour or so, and have a look at the
console.



Re: file-backed cgd backup question

2023-10-20 Thread Greg Troxel
mlel...@serpens.de (Michael van Elst) writes:

> vnd has an optimization where the backing file isn't touched, but
> the underlying device is accessed directly. Then file cache and
> device aren't in sync and a backup program reading the file might
> read stale data. vnd should probably update the file when
> unconfiguring, but so far it does not.
>
> The optimization is disabled under some conditions and explicitely
> if you use 'vnconfig -i'. Then all operations are done by file
> I/O and the timestamps of the backing file are maintained.
> The extra caching of course affects performance.

I dimly knew this, but keep forgetting.  Reading vndconfig(8), it does
not explain that the normal path leads to incorrect behavior (stale
reads from file cache even after closing the vnd, mtime).  This
optimization is sufficiently dangerous and not expected that it needs to
be documented clearly and loudly.  I just added a note to the man page.



random lockups

2023-10-18 Thread Greg Troxel
I have a 2019 Dell SFF computer,  which I think has 9th gen i7, with 32G
ram and a samsung ssd

  total memory = 32577 MB
  cpu0: "Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz"
  wd0: 
  i915drmkms0 at pci0 dev 2 function 0: Intel UHD Graphics 630 (rev. 0x02)

It's running netbsd-10 with modesetting, with most data on ZFS .  Except
for video artifacts, it is basically running very well.

It seemed to have no issues for a while, and then I upgraded it from an
august build:

  -rw-r--r--  1 root  wheel  24840504 Aug  6 05:51 /netbsd.ok
  -rw-r--r--  1 root  wheel  29577080 Oct  8 09:04 /netbsd

since then, perhaps under load (building packages, rsyncing about 0.5T
to an external UFS2 disk), it has locked up twice.  Monitor in power
save mode (which it would have gone to), no ping, need to press/hold to
power off.


I realize this could be a vast number of things, flaky power, bad power
supply, bad RAM, but it feels correlated with updating.   I think this
updated included a zfs actually-return-memory fix (which is very welcome
but epsilon scary).

Is anyone else seeing problems, especially new problems with netbsd-10?


Re: Call for testing: certctl, postinstall, TLS trust anchors

2023-10-09 Thread Greg Troxel
Taylor R Campbell  writes:

>> Date: Sun, 08 Oct 2023 10:54:13 -0400
>> From: Greg Troxel 

> See above: if you know of applications that rely on /etc/openssl/certs
> for S/MIME, and it's not just a joke (which most open-ended
> interorganizational use of S/MIME that I'm aware of is --
> intraorganizational uses managed by a corporate IT department purely
> for internal or partner use aside), I'm curious to see.

I can't point to use of /etc/openssl/certs, but I get mail from random
people (mostly in .eu) on public lists with s/mime signatures, from
public CAs.  And, I have experience with actually sending S/MIME mail
back and forth with people working at a different government
contractor.  I believe this was with public CAs; the guvvies have certs
from the DoD CA.

>>   4) This is all not obvious, and
>>  a) It's not the least bit clear that the right thing is happening.
>>  b) I expect ~nobody to understand this.
>
> Is any of the text in the certctl(8) or hier(7) man pages unclear
> about this?  I tried to clarify the purpose of /etc/openssl/certs for
> TLS trust anchors specifically in that text.

What's not obvious is that if you know that pkix is not just about TLS,
and are a bit unclear on how mozilla deals with things, are a bit
unclear on whether openssl can do pkix validation in general, and how
one would do that for non-TLS (but know that obviously you can), then
it's not entirely clear.

What's missing is a remark that mozilla publishes two lists, one for
"servers" (which today means TLS) and one for email, and that certctl by
default configures the server list into openssl.  And that it will
therefore be used for all validation operations.  Which is pointing out
that the purpose restriction on mozilla's set isn't aligned with openssl
behavior (after all it's for nss), or that I don't understand something
else.

>> Looking in /usr/share/certs/mozilla, it continues to be non-obvious.
>> The idea that 'all' has "untrusted CAs" seems crazy; if they aren't
>> trusted, why are they in the root set, which is by definition the set of
>> CAs which meet the rules and are therefore trustworthy?
>
> `all' has everything in certdata.txt.
>
> `server' has only what Mozilla has chosen to trust for TLS.
>
> `email' has only what Mozilla has chosen to trust for S/MIME.

So yes, the "mozilla root set" contains things which are not eligible to
be in the set of configured trust anchors, because that is only things
in the set with a certain flag.  This is really surprising and I had no
idea.  I suspect many others have the same sorts of feelings.

> Yes.  The certdata.txt format has a way to say that trust anchors are
> fit for code-signing, so for completeness I exposed that via the
> directory /usr/share/certs/mozilla/code, but (a) there are no trust
> anchors in certdata.txt that use it, and (b) there is nothing in
> NetBSD that would use such trust anchors anyway.

ok, but as more than a nit, the customer set for openssl configuration
is any program anybody would build on NetBSD, not just the base system.

> These exclusions also match my knowledge of the history:

That all makes sense.

> As far as I'm aware, S/MIME is only ever seriously deployed within a
> single organization at a time (or a closed set of partnering
> organizations).  So I don't expect anything about it to seriously work
> out of the box and I have no idea what public CAs do about it.

Public CAs issue email certs to people just like they issue web server
certs.  It all works just about as well as TLS certs, in that with a
hundred CAs, it's hard to really believe anything with high assurance,
but it mostly works.  I am unclear on today's practice but in 2016 it
was normal in the big company world, the kinds of places that just could
not cope with OpenPGP.

> I'm prioritizing effort on TLS, but I _installed_ the email
> certificates as pem files under /usr/share so that they're available
> in case someone wants to do something with them like declare
> /etc/smime/certs as the place to find the trust anchors and configure
> them with certctl(8) using a different config file.

This is the part I don't get, but I need to look at pkix, to see if
there is standards support for this "different set of trust anchors
depending on flavor".

> The situation with security/mozilla-rootcerts is actually worse
> because it doesn't interpret the DISTRUST_AFTER annotations, so CAs
> that _were_ trusted for TLS but have _now_ been sunset are still
> included.  That was news to me too...

Grepping and sort/uniq:

  25 CKA_TRUST_SERVER_AUTH CK_TRUST CKT_NSS_MUST_VERIFY_TRUST
   1 CKA_TRUST_SERVER_AUTH CK_TRUST CKT_NSS_NOT_TRUSTED
 144 CKA_TRUST_SERVER_AUTH CK_TRUST CKT_NSS_TRUSTED_DELEGATOR

  53 CKA_TRUST_EMAIL_PROTECTION CK_TRUST CKT_NSS_MUST_VER

Re: Call for testing: certctl, postinstall, TLS trust anchors

2023-10-08 Thread Greg Troxel
(I've been putting off thinking about and dealing with this due to
juggling too many other things.)

Taylor R Campbell  writes:

> The new certctl(8) tool is provided to manage the TLS trust anchors
> configured in /etc/openssl/certs with a simple way to change the
> source of trust anchors or distrust individual ones -- and with a
> manual override, if you would rather use another mechanism to do it,
> like the commands available in the security/mozilla-rootcerts or
> security/ca-certificates packages, or the special-purpose
> security/mozilla-rootcerts-openssl package.

This says "TLS trust anchors", but I wonder if that's accurate.  Isn't
this "pkix trust anchors, for which the most common case is TLS"?  I
have not dug in to the openssl library calls, but my impression is that
openssl the installed software does pkix validation in general, and the
installed trust anchors will be used for invocations to validate pkix certs
separately from TLS.

After reading, I think what's going on is

  1) mozilla rootcert situation is a bit of a mess smantically
  2) certctl is installing the subset that is intended for TLS
  3) the installed certs will be used for all uses, not just TLS
  (e.g. SMIME), and because certs intended for SMIME but not "server"
  are missing, the wrong thing will happen sometimes, but because many
  CAs do both (?) it will often be ok.
  4) This is all not obvious, and
 a) It's not the least bit clear that the right thing is happening.
 b) I expect ~nobody to understand this.

Looking in /usr/share/certs/mozilla, it continues to be non-obvious.
The idea that 'all' has "untrusted CAs" seems crazy; if they aren't
trusted, why are they in the root set, which is by definition the set of
CAs which meet the rules and are therefore trustworthy?

I see code is empty.  I'm going to ignore this.

With a bit of ls/sort/uniq/comm, I see that there are certs in all that
do not appear in email or server:

  Explicitly_Distrust_DigiNotar_Root_CA.pem
  Symantec_Class_1_Public_Primary_Certification_Authority_-_G6.pem
  Symantec_Class_2_Public_Primary_Certification_Authority_-_G6.pem
  TUBITAK_Kamu_SM_SSL_Kok_Sertifikasi_-_Surum_1.pem
  TrustCor_ECA-1.pem
  TrustCor_RootCert_CA-1.pem
  TrustCor_RootCert_CA-2.pem
  Verisign_Class_1_Public_Primary_Certification_Authority_-_G3.pem
  Verisign_Class_2_Public_Primary_Certification_Authority_-_G3.pem

Looking at email vs server, I see 88 in common, 21 email only, 52 server
only.

How is SMIME supposed to work?  Are SMIME validators, which presumably
use openssl as an engine, supposed to maintain a different trust anchor
set?  Where? 

I see that mozilla-rootcerts-openssl has 169 certificates, so that must
be all, which appears to be a really serious bug.

Maybe we don't want to deal with this, but I think it needs to be
clearer, especially as this upgrade to certctl from
mozilla-rootcerts-openssl does:

  resolves a security issue where "untrusted" certs were trust anchors (yay)

  removes trust anchors for email, likely breaking some SMIME uses (but
  not sure in practice how much, given tbird uses internal and gpg2 uses
  gnutls).  (theoretical boo)


I'll also observe that it's mostly because I have avoide digging in
until today, but this larger situation of the subsets, what was and what
is, is news to me today.  

> So it would be helpful if you could test updating NetBSD in whatever
> way you do it (sysinst, untar/etcupdate/postinstall, etcmanage,
> something even more bespoke), and let me know if anything goes wrong
> with the TLS trust anchors:

I did an update via

  Existing system is netbsd-10 from august, upgraded since 2003 from
  netbsd-wicked-old, both software and hardware.

  Existing system has mozilla-rootcerts-openssl.

  Built release from netbsd-10 (normally via build.sh).

  Upgraded via INSTALL-NetBSD from etcmanage, which unpacks kernel and
  non-etc sets, unpacks etc/xetc to /usr/netbsd-etc and then runs
  etcmanage to merge those to /etc, never touching a user-modified
  file.

> 1. Does postinstall work smoothly for you?

Ran "postinstall -s /usr/netbsd-etc check", got warning about certctl,
which seems right.  Ran with fix, got:
  certctl: existing certificates; set manual or move them
which also seems right.

> 2. Does it blow away any configuration you had?  (I don't think it
>should, but if you back up /etc you should be able to see.)

The mozilla-rootcerts-openssl certs and one cert I had there manually
remain.  So correct behavior.

> 3. Do you end up with the trust anchors you expected?

I expect things I have changed in /etc not to be modified in an upgrade,
so yes.

Given the man page, I expected to have an empty /usr/openssl/untrusted
directory but that is not in any of the sets.  Perhaps the idea is that
it is created on demand, but I didn't figure that out from the man page.

> 4. Are the answers obvious or do you have to go digging?

I'm not a really good test case for this, but it seems pretty clear that

Re: cgd questions

2023-10-03 Thread Greg Troxel
Thomas Klausner  writes:

> IIUC the cgdconfig man page correctly, this is how you do that:
>
>  To create a new parameters file that will generate the same key as an old
>  parameters file:
>
>  # cgdconfig -G -o newparamsfile oldparamsfile
>  old file's passphrase:
>  new file's passphrase:

I think what that does is encrypt the old key using the new passphrase,
and store that encrypted key in the config file.  Thus you haven't
changed the key, but you have a config file that allows decryption with
a new passphrase.  That's good to give a second person access, but it
doesn't revoke the first passphrase's access, if I understood correctly.


Re: cgd questions

2023-10-01 Thread Greg Troxel
Thomas Klausner  writes:

> When I pick up a cgd disk and want to use it on a NetBSD system to
> which it was not connected before, what do I need?
>
> - the passphrase
> - the /etc/cgd/foo file?
>
> If you need the /etc/cgd/foo file too, how do people handle those for
> cgds used as backup disks?

Yes, you need the /etc/cgd/foo file because the passphrase is salted,
and you might need an iv depending on iv method.  IMHO this is a design
bug in cgd.  At least as a normal path, one should be able to access
with just the passphrase.

My setup is

  (this is for a 512-sector disk)
  GPT partition on disk
  index 2: 16384 sectors starting at 64, ffs
  index 1: rest of disk, cgd

  in index 2, newfs and then rsync all my cgd init files.
  in index 1, cgconfig

Thus, any backup disk has the params for all of them.

> The other question is that the cgd man page says that some ciphers are
> obsolete. How can I switch from an obsolete cipher to a new one - is
> the only method to make a new cgd with the new cipher and copy the
> data manually?

I believe that's the only way.  I can't even figure out how to change
the passphrase without doing that.


Re: security/mozilla-rootcerts-openssl post certificate inclusion in base

2023-09-26 Thread Greg Troxel
Chavdar Ivanov  writes:

> lack cause anything? On top of this, I seem not to be able to remove
> mozilla-rootcerts-openssl, as it is required by hs-x509-system, itself
> required eventually by converters/pandoc. (I sorted this out by

That's a bug.  It is against policy for a package to require
mozilla-rootcerts-openssl.

> replacing the latter package after cvs updating - the NetBSD
> condiitional in the Makefile has been removed so after that nothing
> stopped me from removing mozilla-rootcerts-openssl; leaving the
> comments in the mail as someone else may find himself in the same
> situation).

And it's fixed.

> The query is then about the 198 certificates present in the package
> but missing in base - are they likely to cause any problems?

I would uninstall mozilla-rootcerts-openssl and then make sure your cert
dir is ok.

Are you saying that mozilla-rootcerts-openssl has CAs that base does
not, separately from the history of how your system got be how it is?


Re: possible NFS trouble

2023-09-20 Thread Greg Troxel
David Brownlee  writes:

> On Thu, 24 Aug 2023 at 15:27, Greg Troxel  wrote:
>>
>> I did not try this build with the new computer under 9.
>>
>> It occurred to me that I need  to find a parallel filesystem exerciser
>> and try that, as simpler than the thunderbird build process.
>
> Might also be worth a single pass at building thunderbird on the local
> SSD to see if any issues trigger?
>
> As a random datapoint I'm building pkgsrc with MAKE_JOBS=9 on an old
> CPU E5-2450L (16 "cpus" as 8+8 I've ), obj on local tmpfs with enough
> memory to fit, and I've not had any issues with thunderbird recently
> ("2023")

Totally fair to suggest local building.

I just ran a build with work on a USB3 ssd, and with MAKE_JOBS=8, got a
failure, but re-running make was ok.

So I suspect some issue with thunderbird/-j.  Maybe I'll figure it out,
maybe not.

I will also try a filesystem tester fromm xfs that I heard about,
apparently not in pkgsrc.   Other benchmark/tester suggestions welcome.



Re: modesetting vs intel in 10.0

2023-09-04 Thread Greg Troxel
nia  writes:

> On Sat, Aug 26, 2023 at 06:47:01AM -0400, Greg Troxel wrote:
>> I am new to modern intel graphics.  I have a UHD 630 with a 9th
>> generation (coffee lake?) CPU.  It is using intel, and it works for
>> xterm :-) But I see artifacts while typing into github comment boxes.
>> Is this something I should try modesetting on?
>> 
>> Is there a wiki page that collects "this option works or doesn't work,
>> and is or isn't preferred" with various graphics, along with
>> instructions for choosing?  I know I could figure this out but it would
>> probably speed testing to have someone who already knows write it down
>> and post a link.
>
> Yes, the artifacts are a symptom.
>
> In general the intel driver has been discontinued since 2015 or so.
>
> In /etc/xorg.conf, try:
>
> Section "Device"
> Identifier  "Card0"
> Driver  "modesetting"
> EndSection
>
> (This is okay for the whole file if it doesn't exist already.)

Thanks, that worked fine.

Finally a data point (I don't like to exit and restart X because I have
too much state).

Earlier the screen would go black for a few seconds pretty often (like 2
every 30), with modesetting and also intel.  I am now pretty sure that
this was because I was running zpool scrub.  Without a scrub, I don't
get this.

My graphics is:

  i915drmkms0 at pci0 dev 2 function 0: Intel UHD Graphics 630 (rev. 0x02)
  i915drmkms0: interrupting at msi4 vec 0 (i915drmkms0)
  i915drmkms0: notice: Failed to load DMC firmware i915/kbl_dmc_ver1_04.bin. 
Disabling runtime power management.
  [drm] Initialized i915 1.6.0 20200114 for i915drmkms0 on minor 0
  intelfb0 at i915drmkms0
  intelfb0: framebuffer at 0xc004, size 2560x1440, depth 32, stride 10240

which is on a 2019 Dell with 9th gen i7:

  cpu0: "Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz"
  cpu0: Intel 7th or 8th gen Core (Kaby Lake, Coffee Lake) or Xeon E (Coffee 
Lake) (686-class), 3000.00 MHz


With the intel driver, it is mostly great except typing text in firefox has
artifacts which resolve.  xterm is fine.  display blanking into hw
powersave is fine.

With the modesetting driver, it is also mostly great, except:

  same github comment box artifacts

  cursor is often not quite right and it's a little confusing to resize
  windows as the change-icon clues are a bit off.  It's like the "update
  cursor" write calls get garbled output and then resolve.

  scrolling in firefox has text with some scan lines messed up and then
  over a second or two they resolve.  I see this changing virtual
  desktops to firefox also.  switching to xterms is fine.
  
  
But really both are quite usable.


Re: ERROR: No valid Python version

2023-09-04 Thread Greg Troxel
Chavdar Ivanov  writes:

> Yes, it was replacing cmake at the time, which eventually needed libxslt.

In that case your choice is to fix the real issue first, work around
with a large number of -X, or use -k and see what gets done.

>> I don't know what's going on, and would suggest turning  on set -x and
>> tracing the logs to see.
>
> My assumption on duty in these occasions is always that the problem is
> with my local tree and with the occasional messing I do with it; I try
> to avoid bothering the list with these and to find a solution myself;
> I mentioned the issue as I have had it a few times before. One reason
> might be that I started using  'pkg_rr -suv'  recently, before I used
> to use ' -uvk', which, albeit not safe, at least goes through the
> whole process without intervention and usually does the job (but I
> understand why it shouldn't be used). I'll try to trace it.

-s is strict, which means unsafe_depends_strict is used instead of
unsafe_depends.  In theory, the only time unsafe_depends_strict gets set
when unsafe_depends doesn't is when a package is updated from foo-1.2nb3
to foo-1.2nb4, differing only in PKGREVISION.  This is sort of by
definition fixing packaging issues, and should not change the ABI.
Except deciding to install something, changing options, etc. does.
However there are a ton of revbumps.  So is it safe?  Usually.  I really
doubt your pain is related to failure to rebuild an
unsafe_depends_strict that has an ABI change.

If you haven't done it in a while, -s will rebuild a ton of packages,
and this will expose problems that are about rebuilding those.  So it's
not really that -s causes trouble as much as if you invite 1000 people
to a party one of them is going to fall down and hurt themselves.

In this case, if libxslt is out of date, and you gave -u, then libxslt
should get sorted before cmake, and if not there is likely either a
broken pkgdb on your system or a pkg_rr bug.

When this happens to me, usually when doing a manual make replace, I
just go to the eg. libxslt dir after it fails and type 'make replace
clean', which is fast because it was just built and still there.  You
can do this if you don't want to dig into the bug and if this strategy
works, that tends to argue for a pkg_rr bug.



Re: ERROR: No valid Python version

2023-08-31 Thread Greg Troxel
Chavdar Ivanov  writes:

>
> ---
> ===> Building binary package for libxslt-1.1.38nb1
> => Creating binary package /usr/pkgsrc/packages/All/libxslt-1.1.38nb1.tgz
> ===> Installing binary package of libxslt-1.1.38nb1
> pkg_add: A different version of libxslt-1.1.38nb1 is already
> installed: libxslt-1.1.38
> pkg_add: 1 package addition failed
> *** Error code 1
> 
>
> The test run of 'pkg_chk -uq' has identified libxslt as a target for
> upgrade, but it is for some reason missing from the final list of
> installed packages as found by it, so it tries to install it instead
> of replace.
>
> This is done after 'pkgin ar', 'pkg_admin rebuild-tree', 'pkg_admin
> rebuild' and the above. I was running 'pkg_rolling-replace -suv' at
> the time.

You snipped too much log; presumably it was replacing something else.

I don't know what's going on, and would suggest turning  on set -x and
tracing the logs to see.


Re: ERROR: No valid Python version

2023-08-30 Thread Greg Troxel
Riccardo Mottola  writes:

> Hi,
>
> Greg Troxel wrote:
>> did you cd to devel/scons (or really the PKGPATH of the installed pkg)
>> and type "make replace".  The pkg_rr man page says, or should say, to do
>> that, and then to deal with that error as if it were not from pkg_rr.
>
> no, I didn't... I thought to have encountered some weird setup
> error. I did now. It fails with a cryptic message:

the man page for pkg_rr asks that people not report "some make replace
that pkg_rr did failed" as a pkg_rr problem, because it's not really
true but mostly because the subset of peopel that don't like pkg_rr then
ignore your report, whereas some of them could perhaps be helpful if it
is properly reported as a simpler failure.

>> perhaps, scons 3 is now limited to py27.
>
> Well it is trying to build the py27 version, also it is more "sane
> now" since it says:
> py27-scons-3.1.2nb7while with pkg_rr it was confused between the py310
> and the "none" version:
>
> RR> Replacing py310-scons-3.1.2nb4
> ===> Cleaning for none-scons-3.1.2nb7
>
> In fact at the moment I have this one installed, according to pkg_info
>
> py310-scons-3.1.2nb4 Python-based, open-source build system
>
> so I suppose pkg_rr was right trying to subdtitute the py3 version

Yes, the basic issue is that new scons3 dropped support for py3, and the
basic solution is to upgrade to 4, but we don't really have support for
that, and 99% people only have scons as a build tool (because somebody
else wrongly thought it was better than autoconf 5 years ago :-).

>> I recommend "pkgin ar" before a rebuild run.
>
> Now it is a little late :-P

You can still do it and continue.   Unless you actually *want* scons3,
removing it is best.  The tools that are needed for other things will
get built.



Re: ERROR: No valid Python version

2023-08-29 Thread Greg Troxel
Riccardo Mottola  writes:

> Hi,
>
> pkg_rolling-replace of current pksgrc on 10.99.7 stops with this error:
>
> RR> Replacing py310-scons-3.1.2nb4
> ===> Cleaning for none-scons-3.1.2nb7
> ERROR: This package has set PKG_FAIL_REASON:
> ERROR: No valid Python version
> *** Error code 1

did you cd to devel/scons (or really the PKGPATH of the installed pkg)
and type "make replace".  The pkg_rr man page says, or should say, to do
that, and then to deal with that error as if it were not from pkg_rr.

perhaps, scons 3 is now limited to py27.

I recommend "pkgin ar" before a rebuild run.


Re: modesetting vs intel in 10.0

2023-08-27 Thread Greg Troxel
David Brownlee  writes:

> That would be correct, though current and -10 support is better than
> -9.

On a 9th gen cpu, UHD 630 graphics, netbsd-9 is wsfb, but intel driver
and modesetting function at least mostly with netbsd-10.  I'm typing
this from intel driver working totally ok except for font funniness
while typing in github comments.


It seems that zfs scrub and heavy load leads to periods of all-black
screen, with both intel and modesetting drivers.  Or it's a coincidence
and it's something else.


Re: modesetting vs intel in 10.0

2023-08-26 Thread Greg Troxel
nia  writes:

> On Fri, Aug 25, 2023 at 08:55:13PM +0100, David Brownlee wrote:
>> Picking up on this, particularly with netbsd-10 looming, I think we
>> should at least whitelist some known good-with-modesetting Intel GPUs,
>> with a plan to swapping over to whitelisting keep-on-intel Intel over
>> time.
>
> I've pretty much only got haswell or newer, all of which are
> better with modesetting. Maybe you have some older GPUs which
> need intel?

I am new to modern intel graphics.  I have a UHD 630 with a 9th
generation (coffee lake?) CPU.  It is using intel, and it works for
xterm :-) But I see artifacts while typing into github comment boxes.
Is this something I should try modesetting on?

Is there a wiki page that collects "this option works or doesn't work,
and is or isn't preferred" with various graphics, along with
instructions for choosing?  I know I could figure this out but it would
probably speed testing to have someone who already knows write it down
and post a link.


Re: possible NFS trouble

2023-08-24 Thread Greg Troxel
Martin Husemann  writes:

> On Thu, Aug 24, 2023 at 09:22:13AM -0400, Greg Troxel wrote:
>> I ran a build, and it was erroring out with IO errors, and restarting
>> kept having errors.  I suspected NFS concurrency, and reran it with
>> MAKE_JOBS=1 and it seems to have gone much better.
>
> What network adapters are involved? Does ifconfig show a serious
> amount of errors on either client or server? In my experience network
> driver bugs (or crappy hardware) are the most likely cause for NFS errors
> under high load.

On the new computer, re0, and I see zero errors and 70K packets in/out.
On the server, re0, no errors and about 65K packets in/out.

Neither computer shows tcp drops due to checksums (literarlly 0).


I did not try this build with the new computer under 9.

It occurred to me that I need  to find a parallel filesystem exerciser
and try that, as simpler than the thunderbird build process.



possible NFS trouble

2023-08-24 Thread Greg Troxel
My situation is a little complicated; hence "possible".

I had a setup where things were ok:

  2010 computer with 4 cores, 24G RAM, 67% tmpfs (so 16G), SSD with UFS2.
  netbsd-9 amd64

  lower-end 2010 computer 'xen' with 2 cores, 8G RAM, SSD with / and
  /usr UFS2 and most of it zfs.  Sometimes runs xen dom0, sometimes bare
  metal.
  netbsd-10 amd64

On the first computer, I build packages, and generally this is fine.
thunderbird has a crazy WRKDIR size requirement (22G), so I have

  .if "${PKGPATH}" == "mail/thunderbird"
  WRKOBJDIR=  /nfs/ztmp/work
  .endif
  WRKOBJDIR?= /tmp/work
  CREATE_WRKDIR_SYMLINK=  yes

where ztmp is a zfs fs on the second computer, mounted via nfs, and thus
NFSv3 TCP.  I am doing this so that I abuse the non-critical ssd on the
second box, and not the ssd that I care about on the main box.  This is
slow but works, and it's only thunderbird.

The network between the computers is wired GbE with two switches and has
been reliable.

This setup worked fine, but I tended to use MAKE_JOBS 2 or 3.


The first computer had a hardware failure, and I replaced it with a 2019
8-core 32G box with a new SSD with everything except / sw /var /usr on
ZFS.  It seems to work very well.

However, my tmpfs is now 19G, which is still less than 22G, and I don't
really want to page  out everything  (after tmpfs, zfs ARC, other pools,
etc.), so I'm still using NFS ztmp.


I ran a build, and it was erroring out with IO errors, and restarting
kept having errors.  I suspected NFS concurrency, and reran it with
MAKE_JOBS=1 and it seems to have gone much better.

So it seems that perhaps

  8 jobs all doing NFS doesn't really work, on 10 vs 9, or 8 jobs vs 3
  jobs, or maybe only if server is a xen dom0, or ??

  Thunderbird is buggy and building thunderbird with high MAKE_JOBS
  doesn't really work.

  As always, something else.


I wonder if anyone has had a similar experience or insight.


I am likely to get some low-end SSD/flash to use for WKROBJDIR for
thunderbird, to avoid NFS, anwyay.



Re: Growth in pool usage between netbsd-9 and -10?

2023-08-19 Thread Greg Troxel
Paul Ripke  writes:

>> That seems a little excessive, to my eye? Is this "normal"? Having >1/4 of 
>> RAM
>> apparently used mostly for file metadata, if I'm reading this correctly? I
>> don't recall the usage under -9, unfortunately, but I also never had a reason
>> to look.

In general, kernel memory tends to fill with cache entries, and as long
as it is freed if needed for something else, that's fine.  It's only
memory without effective draining under pressure (hi zfs!) that is
problematic.

I think that vm.filemax is supposed to define the level above which file
cache is supposed to be freed with higher priority, and vm.filemin is
the level below which file cache is protected from freeing.   However I
am not sure if that is the block cache or also pool usage for metadata.

I use the program below "touchmem.c" to force memory to be used for
program data.  This causes a great deal of freeing usually.  It
obviously can lead to paging if too big, but you can also run it with
varying sizes and figure out when it starts being slow.

Actually I should enhance this to be a loop with timing to measure the
delay from the freeing activity by doing each amount twice, and to see
in the second runs the runtime vs size which should depart from linear
when it crosses into paging the program itself.

Running this in a xen domU with 2000 MB (not 2048), I pushed "file" as
shown by top to 383M.  This is with default (9) settings:

  vm.anonmin = 10
  vm.filemin = 10
  vm.execmin = 5
  vm.anonmax = 80
  vm.filemax = 50
  vm.execmax = 30
  vm.inactivepct = 33
  vm.bufcache = 15
  vm.bufmem = 170868736
  vm.bufmem_lowater = 58982400
  vm.bufmem_hiwater = 471859200

After running this with a high value:

$ time touchmem 500; time touchmem 1000; time touchmem 1500; time touchmem 2000

real0m1.843s
user0m0.169s
sys 0m1.676s

real0m3.862s
user0m0.580s
sys 0m3.281s

real0m6.095s
user0m0.648s
sys 0m5.446s

real0m26.580s
user0m1.332s
sys 0m8.701s

Which basically shows that a 1500 MB process is not really impaired and
a larger one (with data equal to the machine's entire RAM) is.  This
seems ok.  But it's 9.


#include 
#include 
#include 

int
main(int argc, char **argv)
{
char *foo;
size_t i, size;
int pgsize;

if (argc != 2) {
printf("usage: touchmem size (in MB)\n");
exit(EXIT_FAILURE);
}
  
size = atoi(argv[1]) * 1024L * 1024L;
#if 0
printf("size is %zd\n", size);
#endif

pgsize = getpagesize();
foo = malloc(size);
if (foo == NULL) {
printf("Can't malloc.\n");
exit(EXIT_FAILURE);
}

for (i = 0; i < size; i += pgsize)
foo[i] = 1;
exit(EXIT_SUCCESS);
}


Re: Growth in pool usage between netbsd-9 and -10?

2023-08-19 Thread Greg Troxel
Paul Ripke  writes:

> Following up, I see buf2k allocated when I run 'git status' under 
> netbsd/pkgsrc,
> netbsd/src, etc repos. I also see growth during /etc/daily. The filesystem is
> ffsv2 with log, on raidframe raid1 of wd[01]a.

You might try setting kern.maxvnodes to 64 and then back and see what is
freed.  I am 98% sure I did this on some machine with no ill effects,
not sure if 9 or 10.



Re: 10.99.7 panic: defibrillate

2023-08-13 Thread Greg Troxel
Would it be useful for heartbeat to have a just-log-don't-panic option?

It feels like are in a state where we know there is a problem somewhere,
and we don't know if it is in heartbeat, the kernel, or hardware.

I would not want to run a watchdog that reboots the system unless the FP
rate is well under once per year, and really under 0.2/year.  Having
this logged instead of panicing would make it more comfortable to turn
on.  Probably it should be default to not panic, if this turns into
enough reports that it seems to have significantly non-zero probability.

(Presumably atf runs on real hw survive HEARTBEAT though, so whatever is
happening seems low probability to start with.)


Re: Strange behavior for route(8)

2023-07-28 Thread Greg Troxel
Paul Goyette  writes:

Paul Goyette  writes:

> Can anyone exlpain what I'm doing wrong?

> {184} route show -inet net 192.168.0/24
> route: botched keyword: 192.168.0/24
> Usage: route [-dfLnqSsTtv] cmd [[-] args]

route show is documented to print the table.  It does not take narrowing
specifiers.

Try

  route get -inet 192.168.0.0/24

and it's arguably a bug that

  route get -inet 192.168.0/24

errors, but that route show prints it that way.


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread Greg Troxel
Brian Buhrow  writes:

>   Hello.  The ARP cache timeout used to be 1200 seconds or 20 minutes, 
> hard coded.  Now, it
> looks like it's either 1200 seconds or 300 seconds, I'm not sure after a 
> quick romp through the
> kernel source.  In any case, The fact that you're getting regular delays on 
> your pings suggests
> there is a delay between the time when the arp cache times out and when it 
> gets refreshed.  As

However, a missing arp cache entry should result in at most a 1 RTT
delay over the local net, and it really should not be a big deal.
However, tcpdump and analysis is always a good idea, to turn theories in
to observations.


Re: error installing libiconv-1.17

2023-03-27 Thread Greg Troxel
> I think this is a bug, since it prevents a proper upgrade.

Certainly it is.

These bugs are not that rare.  There is typically a link line to make a
binary that needs libs that are part of the package and libs that are in
pkgsrc (in dependencies).   So it ends up being something that sort of
is like this

  cc -o prog prog.o foo.o bar.o -L../libs -L/usr/pkg/lib -Wl,-R/usr/pkg/lib 
-lproglib -lotherlib

and that is probably ok but if the -L get tangled up in the wrong order
because it's 10 times more complicated than that with multiple stages of
substitution, you can see how it would go wrong.  My guess is that these
bugs are 90% upstream bugs.

So if you can figure it out, and how to fix, and report upstream if
that's where the bug is, that would be helpful.  But it's hard.



Re: clang-built NetBSD and rust

2023-03-27 Thread Greg Troxel
Havard Eidnes  writes:

> I've looked at
>
>   ftp://nyftp.netbsd.org/pub/NetBSD-daily/HEAD-llvm/latest/amd64/binary/sets/
>
> and specifically the base.tar.xz file, and it doesn't look like it has
> libgcc_s at all, but apparently MKGCC=yes will build it, but that
> doesn't appear to be the default (which is probably intentional).
>
> So ... what should I/we do about this?  Do we need a separate rust
> bootstrap kit built with clang and built with a clang-built "target
> root"?  It looks like there's no netbsd-9 nor netbsd-10 built with
> clang, only HEAD?

Basically binary packages are built for some environment, and you then
can't change the environment and expect them to work.

So yes, for NetBSD with LLVM and not GCC, I think we need different
bootstraps, and that's work to generate and awkward to implement.   I
don't think you should feel any duty to accomodate this, which I say
even if I'm a clang fan.

The big question to me is if a system with HAVE_LLVM can also build with
MKGCC.  The wiki says no:
  https://wiki.netbsd.org/tutorials/clang/
but that was likely adopted by me from list traffic and may be wrong.

I am unclear on whether NetBSD which 1) builds clang and 2) uses clang
to build the system can sanely also have GCC available, and whether we
should turn that on in public builds.  That's a bigger question than
rust, surely.

The real issue is that it's a bug that rust needs binary bootstraps, and
that there's no path from source with a base system.  This is
exacerbated by the rust.org implementation's practice of reuquiring the
previous rust version.  But other than the eventual gcc, and maybe
mrustc, I don't see that getting fixed, since they seem to view the
current situation as ok.


Re: lang/guile30 build issue: lto support missing in ar/ranlib

2023-01-08 Thread Greg Troxel
Thomas Klausner  writes:

> On 10.99.2 after the load sections 2->4 change I see the following
> when building lang/guile30:
>
> ar: libguile_3.0_la-alist.o: plugin needed to handle lto object
> ranlib: .libs/libguile-3.0.a(libguile_3.0_la-alist.o): plugin needed to 
> handle lto object
>   CCLD guile
>
> and the resulting binary segfaults when run (which also happens during
> the build), backtrace below.
>
> Is there a flag to turn off lto, or can we please get ar/ranlib
> support for lto?
>
> To reproduce, just try building 'lang/guile30'.

It fails to even build on i386.  I have a local patch, pending figuring
it out, to just disable lto.  I was unsure if that belonged on only some
arches, but seems best to mass disable and theni figure it out.


Index: Makefile
===
RCS file: /cvsroot/pkgsrc/lang/guile30/Makefile,v
retrieving revision 1.5
diff -u -p -r1.5 Makefile
--- Makefile26 Oct 2022 10:31:04 -  1.5
+++ Makefile8 Jan 2023 17:37:16 -
@@ -23,6 +23,8 @@ LDFLAGS.SunOS+=   -lsocket -lnsl
 MAKE_ENV+= PAXCTL=echo
 MAKE_ENV.NetBSD+=  PAXCTL=/usr/sbin/paxctl
 
+CONFIGURE_ARGS+=   --disable-lto
+
 .if !empty(GUILE_SUBDIR)
 # Installation prefix is non-default.
 GUILE_PREFIX=  ${PREFIX}/${GUILE_SUBDIR}


Re: 10_BETA: Nice QOL improvements to the installer

2022-12-21 Thread Greg Troxel

Salil Wadnerkar  writes:

> Mayuresh wrote:
>> On Wed, Dec 21, 2022 at 09:25:30AM +1300, Lloyd Parkes wrote:
>>> I was installing on amd64 and the installer let me choose either MBR or GPT,
>> I notice a separate image marked "bios"
>>
>>  NetBSD-10.0_BETA-amd64-bios-install.img.gz
>>  NetBSD-10.0_BETA-amd64-install.img.gz
>>
>> So wonder, which installer you tried that gave you these two options. I
>> got an impression that these two are just different installers - 1 UEFI
>> and 1 BIOS.
> UEFI installer is a hybrid installer - works for both UEFI and BIOS:
> http://blog.netbsd.org/tnf/entry/netbsd_10_0_beta_available

Someone who actually knows should comment, but I would expect that both
installers can do the same things once booted.   I would expect that the
one with bios in the name has an MBR partition and MBR boot in the img,
to be bootable on machines that support that, and that the other is GPT
with EFI boot, to be bootable on machines that support that.  I would
expect most machines to be able to boot either, except for old machines
are mbr only and maybe some new ones are EFI only.   And on some
machines, it wouldn't surprise me if some are buggier with one method or
the other.


signature.asc
Description: PGP signature


Re: /etc/protocols generation

2022-12-16 Thread Greg Troxel

Jan Schaumann  writes:

> 1) Since this is solving the same problem using the
> same input and producing the same output, the awk
> scripts there resemble those from pkgsrc/net/iana-etc/
> necessarily.  Those, however, are released under the
> Open Software License ("OSL") v. 3.0.  I don't know
> whether my scripts are sufficiently original to not be
> considered derived work; if not, then we'd have to
> import those scripts using the OSL.

OSL is problematic for TNF becaues it is AGPL-like and board@ decided
that was not ok for DEFAULT_ACCEPTABLE in pkgsrc, so it is obviously not
ok in base -- even though fears of having to distribute modified
services build scripts if someone does so and runs a web service are a
bit overblown :-)

If you copied enough, it's a derivative work, and if you didn't copy, it
isn't.  The fact that it ends up similar is not relevant technically,
but I can see that it is a concern when assessing risk.  If there's only
one way to do it sanely and you wrote it separately, i don't see an
issue, but IANAL, IANYL, TINLA as always.  I looked quickly and the
programs don't look particularly similar in details.

You could write to the author and ask if they consider what you did to
be a derived work, or permission to distribute what you did under a
2-clause BSD also crediting them.  I suspect there is no actual problem.

> This, then would speak in favor of leaving the tools
> in pkgsrc.  The Makefile could then perform the task
> of reaching over into pkgsrc?  Seems like a messy set
> up and not much of a win. :-/

I don't think the build can depend on pkgsrc.  If you mean a step in a
process to update what is checked in under src, I don't see that as a
big problem.

> 2) I wanted to add the execution of the ATF tests, but
> those literally test the outcome of the files
> installed under /etc.  For a full validation, one
> would need to copy the generated file into /etc, which
> seems heavy handed to me.  Having a partial test that
> may generate a diff when the file is updated seemed
> reasonable to me.

It could make sense to encapsulate in ATF building our file from sources
and then comparing it to what is checked in.  It seems obviously not ok
for an ATF test to modify the running system or even DESTDIR.


signature.asc
Description: PGP signature


Re: Branching for netbsd-10 next week

2022-12-09 Thread Greg Troxel

Robert Elz  writes:

>   | And it's not just NetBSD
>
> The relevant issue is, in that NetBSD10 might have EA support, but
> perhaps without them being enabled by default on anything, for which
> the solution, and its ramifications are a peculiarly NetBSD issue
> (the same thing does not apply to FreeBSD for example, there you
> just need a new enough system, and not to be using FFSv1).

And you need not to be using FAT32, and probabbly a bunch of other
filesystems, on various operating systems.  NetBSD already has the
property that EA might be present in a filesystem (FFSv1 maybe, ZFS) and
it might not.  That basic "maybe" is not going to change; there's just
one more option with.


>   | --- it's a larger issue that you need to use a FS with
>   | certain properties if you want certain features.  So it really belongs
>   | in the upstream documentation in general.
>
> That one needs ACLs yes, certainly - that to get those working on
> NetBSD requires NetBSD >= 10, and one needs to then either newfs
> with some specific option, or fsck with some specific option, not so much.

In that case, sure.  But there is a tendency in pkgsrc to put in fixes
that apply to all use on NetBSD, not just in pkgsrc context (which is
great) and then not push them upstream.


signature.asc
Description: PGP signature


Re: Branching for netbsd-10 next week

2022-12-09 Thread Greg Troxel

Robert Elz  writes:

>   | - packets from pkgsrc (like samba) will continue to have the
>   | corresponding options disabled by default
>
> Those packages could have warnings in DESCR and MESSAGE (or whatever it
> is called) advising of the need for FFSv2ea for full functionality.
> How does samba (and anything else like it) deal with this - if it is
> a compile time option, then since NetBSD supports EAs (and hence the
> sys call interface exists) EA support in samba should be enabled.
> More likely it is to be a run time problem, as whether a filesys has
> EA support or not will vary, some might, others not, so whether samba
> can provide EA functionality will tend to vary file by file (or at least
> directory by directory) - given that, a solid warning that FFSv2ea support
> is needed in the samba man page (or other doc) for NetBSD should allow
> users to know what to do.

Not MESSAGE; this is not a "your hair is on fire" thing.  And it's not
just NetBSD --- it's a larger issue that you need to use a FS with
certain properties if you want certain features.  So it really belongs
in the upstream documentation in general.

As a general point, these sorts of issues are not properly pkgsrc
accomodations, but belong upstream, as anyone building samba from
sources following the upstream build instructions should get the same
hints.

Indeed, a warning in the code (pushed to the upstream project) about
using ACLs in a fs that doesn't have ACLs seems good.


signature.asc
Description: PGP signature


Re: ghc and current aarch64

2022-11-20 Thread Greg Troxel

Clay Daniels  writes:

> I'm impressed with your patience, Greg. I always do a fresh install,
> mostly just writing over the one before, but sometimes I use Gparted
> on a usb stick to get it really clean.  It gives me a chance to
> install fresh copies, or even versions, of my basic apps, like
> foxfire, hexchat/irssi, thunderbird, etc.

It's not that I'm more patient, just with different  things.  I use
etcmanage to deal with /etc, mostly automatically, and I use
pkg_rolling-replace to replace packages.  So my system is actually very
up to date.  What I don't have the patience for is redoing everything
after a fresh install and having to figure out what's broken and fix it
one at a time.  I believe that the cost of such re-install work is huge,
more than people estimate, and the cost of careful upgrades is not so
bad.


signature.asc
Description: PGP signature


Re: ghc and current aarch64

2022-11-20 Thread Greg Troxel

Clay Daniels  writes:

> Thanks for the clue. compat90 has other libs like libterminfo that I
> have been missing in 9.99.106 for the last week or so. I had given up
> and loaded the 9.3 release, which I will say is really good. But I
> can't seem to give up current, it's some kind of odd addiction...

[trimming to current-users only]

Note that when you upgrade, libs remain.   So if you install 9, and then
upgrade to current, things should be smoother in terms of being able to
run 9 binaries.

Many of my systems are quite old and I have upgraded over many netbsd
versions.   As an example:

  -r--r--r--  1 root  wheel  237892 Jan  6  2004 /usr/lib/libkrb5.so.18.0
  -r--r--r--  1 root  wheel  252301 Oct 20  2004 /usr/lib/libkrb5.so.19.1
  -r--r--r--  1 root  wheel  257801 Jan 28  2009 /usr/lib/libkrb5.so.20.1
  -r--r--r--  1 root  wheel  399610 Nov 17  2014 /usr/lib/libkrb5.so.22.0
  -r--r--r--  1 root  wheel  570927 Apr  8  2017 /usr/lib/libkrb5.so.26.0
  -r--r--r--  1 root  wheel  651112 Oct  2 21:53 /usr/lib/libkrb5.so.27.0



signature.asc
Description: PGP signature


Re: namespace pollution? clone()

2022-07-26 Thread Greg Troxel

Thomas Klausner  writes:

> When compiling inkscape I found a weird compilation error that I
> traced down to clone() being in the visible namespace.
>
> https://gitlab.com/inkscape/inbox/-/issues/7378

It's too bad they are expressing 'not supported' to avoid a reasonable
change.  Normally 'not supported' just means "we won't fix it, send a
patch".

> I wonder why it's visible though, since in sched.h it's protected by
> _NETBSD_SOURCE.

Is there some kind of visibility define?  AIUI, _NETBSD_SOURCE things
are in the namespace unless there is some kind of visibility define.
Almost always when there is a define, programs break because they don't
really mean "exactly stick to this standard".

> The command line is
>
> cd
> /scratch/graphics/inkscape/work/inkscape-1.2.1_2022-07-14_9c6d41e410/src
> && c++ -DHAVE_CONFIG_H -DHAVE_X11 -DWITH_CSSBLEND -DWITH_MESH
> -DWITH_SVG2 -D_REENTRANT -Dinkscape_base_EXPORTS

So where is the visibility restriction?



signature.asc
Description: PGP signature


Re: Script to create bootable arm images?

2022-06-24 Thread Greg Troxel

Brook Milligan  writes:

>> On Jun 24, 2022, at 12:45 PM, Jason Thorpe  wrote:
>> 
>>> On Jun 24, 2022, at 11:02 AM, Brook Milligan  wrote:
>>> 
>>> build.sh works great to create, for example, binary/gzimg/armv7.img.gz.  
>>> 
>>> However, that is not necessarily a bootable image, at least on some
>>> systems.  In addition, various u-boot magic files must be added to
>>> the FAT partition at the beginning of armv7.img to make it
>>> bootable.
>> 
>> installboot(8) (and the tool-ified version) has support for plopping the 
>> correct uboot into the image.  You need to install the uboot packages first.
>
> Thanks.  So all that is needed is the following (for cross compiling):
>
> gzip -c -d < /path/to/release/evbarm-earmv7hf/binary/gzimg/armv7.img.gz > 
> /tmp/armv7.img
> env INSTALLBOOT_UBOOT_PATHS=/path/to/pkg/share/u-boot  
> /path/to/tools/bin/nbinstallboot  -m evbarm -o board=ti,am335x-bone-black  
> /tmp/armv7.img
>
> Does it make sense to consider a command line option for installboot that 
> sets the u-boot path so the environment variable is not needed?
>
> In answer to my original question, I think this means that all the
> images on armbsd.org can be created with a loop over the various
> boards with each iteration running the two commands above.  Is that
> correct?  Is that what is happening?

As someone on the outside, I think it would be good to have checked-in
scripts with comments about what they depend on.   I can see that those
that do this think that's unnecessary, but also that it is documentation
of the process for almost everybody else.


signature.asc
Description: PGP signature


Re: CVS commit: src

2022-05-25 Thread Greg Troxel

Alistair Crooks  writes:

> On Wed, May 25, 2022 at 15:13 Greg Troxel  wrote:
>
>>
>> Slightly, but not really.  Back then, there was multicast routing, and
>> then there was the mbone project for wide-area multicast because the
>> internet didn't yet support it like it eventually would (reality ended
>> up different).  Things like vic and vac came out of the research effort
>> that was associated with mbone deployment, and that's why I mentioned
>> them as "mbone".  It's true they work on multicast rather than mbone,
>> and thus arguably we should rename the category, but in pkgsrc we don'd
>> do that.  I'd say the pkgsrc name is wrong though.
>
> I haven't looked at cvs history, but my recollection (it was 25 years ago)
> is that the pkgsrc mbone category name came from FreeBSD at the initial
> import - over time, we got rid of some categories that didn't match the
> functionality requirement - plan9, japanese, etc - and we and they added
> others without coordination - inputmethods, ftp, dns etc.
>
> My bad for not being aware of these things at the time, sorry

kre@ disagrees with me about it being wrong; the mbone was a particular
overlay network, a social phenomenon, and a group of tools built for
that network all at the same time, and most of the mbone/ category is
what people of that day would have thought of as mbone tools.  My point
was just that the basic system routing support was not "mbone", but
simply "multicast routing".   You wouldn't really use video chat over
local because you'd just walk down the hall instead.  It was talking to
people at other sites that was a big deal.


signature.asc
Description: PGP signature


Re: CVS commit: src

2022-05-25 Thread Greg Troxel
On May 25, 2022 4:55:58 PM UTC, nia  wrote:
>On Wed, May 25, 2022 at 09:50:29PM +0700, Robert Elz wrote:
>> What is in pkgsrc/mbone is mostky tge ancient mbone tools
>> (I don't recognise everything) and the name fits for that.
>> We have nothing mbone in base that I know if, nomkmbone
>> (or whatever) doesn't make a lot of sense (as a name).
>> 
>> kre
>> 
>> ps: I build kernels with MROUTING turned on.
>
>I will rename the option to MKMROUTING.
>

Thanks -- I think that's a good name and better than what I suggested.


Re: CVS commit: src

2022-05-25 Thread Greg Troxel

nia  writes:

> On Wed, May 25, 2022 at 08:42:20AM -0400, Greg Troxel wrote:
>> I was really surprisd that we had mbone applications in base; to me,
>> that would mean things like vic and vat.
>> 
>> This is not about about MBONE; it's about multicast routing.  The mbone
>> was an overlay network to connect local multicast islands, and operated
>> in the 90s.
>
> Interesting, thanks.
>
>> Separately from the mbone, I have used multicast routing support in
>> NetBSD across connected local networks.
>> 
>> (Arguably map-mbone is misnamed; it really isn't about the mbone per se
>> but about whatever multicast network is available.  But that's just a
>> historical note.)
>
> Is the name situation same for the category in pkgsrc?

Slightly, but not really.  Back then, there was multicast routing, and
then there was the mbone project for wide-area multicast because the
internet didn't yet support it like it eventually would (reality ended
up different).  Things like vic and vac came out of the research effort
that was associated with mbone deployment, and that's why I mentioned
them as "mbone".  It's true they work on multicast rather than mbone,
and thus arguably we should rename the category, but in pkgsrc we don'd
do that.  I'd say the pkgsrc name is wrong though.

>> I don't object to a default-on MK knob; having knobs to make base
>> smaller seems entirely reasonable.
>> 
>> I would suggest "multicast" as a word rather than mbone, as what this
>> knob does is remove user-space support for IP multicast routing.
>> Someone who understands the history would not expect mrouted to vanish
>> by disabling mbone.
>
> All of these applications depends on the "MROUTING" kernel option,
> it seems, which is mostly default-off, except for a few (tending
> on the more obscure side) kernel configs. I wonder if anyone
> knows the history there.

I'm not really sure why MROUTING is default off, but I think BSD
tradition is that the user-space tools are present even for kernel
options that are off, and that was ok because ~everybody built their own
kernel anyway, but almost nobody rebuilt userland.  The disk space for
these programs was small, and kernel RAM was a big deal, if you set the
wayback machine to a VAX with 1MB of RAM and a 100 MB disk, or 8MB and
250 MB.  I'm fuzzy on the exact numbers, but I'm sure kernel RAM usage
was a much bigger deal than disk.


signature.asc
Description: PGP signature


Re: CVS commit: src

2022-05-25 Thread Greg Troxel

"Nia Alarie"  writes:

> Module Name:  src
> Committed By: nia
> Date: Wed May 25 10:18:30 UTC 2022
>
> Modified Files:
>   src/distrib/sets/lists/base: mi
>   src/distrib/sets/lists/etc: mi
>   src/distrib/sets/lists/man: mi
>   src/etc: Makefile
>   src/etc/mtree: special
>   src/etc/rc.d: Makefile
>   src/share/man/man5: mk.conf.5
>   src/share/mk: bsd.README bsd.own.mk
>   src/usr.sbin: Makefile
>
> Log Message:
> mk: Allow building base without the MBONE applications by setting
> MKMBONE=no in mk.conf

I was really surprisd that we had mbone applications in base; to me,
that would mean things like vic and vat.

This is not about about MBONE; it's about multicast routing.  The mbone
was an overlay network to connect local multicast islands, and operated
in the 90s.

Separately from the mbone, I have used multicast routing support in
NetBSD across connected local networks.

(Arguably map-mbone is misnamed; it really isn't about the mbone per se
but about whatever multicast network is available.  But that's just a
historical note.)

I don't object to a default-on MK knob; having knobs to make base
smaller seems entirely reasonable.

I would suggest "multicast" as a word rather than mbone, as what this
knob does is remove user-space support for IP multicast routing.
Someone who understands the history would not expect mrouted to vanish
by disabling mbone.


signature.asc
Description: PGP signature


Re: File system corruption due to UFS2 extended attributes

2022-05-24 Thread Greg Troxel

Chuck Silvers  writes:

> The introduction in NetBSD's implementation of UFS2 of the extended
> attribute code from FreeBSD has introduced a compatibility problem
> with previous releases of NetBSD.  The explanation of this problem is
> a bit involved and requires knowing some history, so please bear with me
> as I explain.

Your analysis and approach make sense to me, even though it's
regrettable that it is necessary.  I guess UFS needs zfs-style feature
flags

What about compatibility with FreeBSD?

  - What happens if someone takes a FreeBSD UFS2 filesystem and mounts
it under NetBDS 9?

  - What happens if someone tries to mount a NetBSD <=9 UFS2 filesystem
on FreeBSD?   A 10 UFS2 filesystem w/o ea?  with?

Or is it already the case that FreeBSD and NetBSD do not interoperate
with UFS2?

And same questions for the other active BSD variants, which I think is
mostly OpenBSD and Dragonfly these days but I have lost track.


signature.asc
Description: PGP signature


Re: error upgrading packages on current / pkg database

2022-05-01 Thread Greg Troxel

Riccardo Mottola  writes:

> *** pkg_chk reports the following packages need replacing, but they
> are not installed: py37-gobject
> *** Please read the errors listed above, fix the problem,
> *** then re-run pkg_rolling-replace to continue.
>
> Indeed it is not installed, I only have the py27 version which got
> pulled in by gimp!
> I actually removed the previous version, reinstalled gimp and ended
> again with the py27 version installed.
> why does prr complain?

1. pkg_chk reports a mismatch with the wrong pyNN prefix

2. pkg_rr does not have logic to set PYTHON_VERSION_REQD based on the
desired pkgname to replace.  This is the easy part.

So if you would like this fixed, I would suggest digging into pkg_chk.


signature.asc
Description: PGP signature


Re: error upgrading packages on current / pkg database

2022-04-15 Thread Greg Troxel

Mike Pumford  writes:

> On 15/04/2022 17:19, Riccardo Mottola wrote:
>> Hello,
>>
>> I did a full system upgrade (running now ) and then got current pkgsrc.
>>
>> Now I try to run pkg_rolling-replace -uv ; it compiled for days, then stops.
>>
>> disc# pkg_admin check
>> ...pkg_admin: can't open
>> /usr/pkg/pkgdb/glib2-2.70.2nb1/+CONTENTS: No such file or directory
>>
> Hmm. Thats interesting. I've seen that happen on one of my 9.2-STABLE
> systems when doing a pkgin upgrade. Although in my case it was librsvg
> that got clobbered.
>
> Instead of the correct pkgdb data the folder existed but contained a
> random core file. The core file wasn't from any pkg tool.
> I ended up with:
> /usr/pkg/pkgdb/librsvg-2.52.6/gdk-pixbuf-query.core
>
> Which isn't even a tool I use its just something pulled in as a
> dependendency.

Interesting data point.

I should have said as 2A: when you find a directory with broken
contents, just rm -rf it, and that will get rid of the record of
installation without removing the files.  Then "pkg_add glib2" and
perhaps "pkg_admin set automatic=yes glib2" if you want it that way, and
then rerun check and rebuild-tree

> you have it around so you can extract them into the folder and then
> use pkg_admin to put things back together. That's what I did to sort
> myself out :)
>
> Something like:
>
> tar zxf glib-2.70.2nb1.tgz +*

That can work too, but either way all the cross-package data linkage may
not be quite rie
ght.


signature.asc
Description: PGP signature


Re: error upgrading packages on current / pkg database

2022-04-15 Thread Greg Troxel

Riccardo Mottola  writes:

> disc# pkg_admin check
> ...pkg_admin: can't open
> /usr/pkg/pkgdb/glib2-2.70.2nb1/+CONTENTS: No such file or directory
>
> disc# pkg_admin rebuild
> pkg_admin: glib2-2.70.2nb1: can't open `+CONTENTS'
>
> disc# pkg_admin rebuild-tree
> pkg_admin: Cannot read +CONTENTS of package glib2-2.70.2nb1

1) Understand how pkgdb works

  cd /usr/pkg/pkgdb

  look around at the contents
  
2) look at glib2 specifically

  ls -ld glib2-*
  for d in glib2-*; do
 echo DIR $d
 ls -l  $d
  done

3) Figure out how to write 'pkg_admin fsck' which will detect/fix
automatically



signature.asc
Description: PGP signature


Re: Status of NetBSD virtualization roadmap - support jails like features?

2022-04-15 Thread Greg Troxel

  However, this week I read a post on Reddit[2] that was a bit
  disturbing to me. Meaningfully, it proclaims that the main development
  platform for nvmm is now DragonflyBSD rather than NetBSD. It also
  claims that the implementation in NetBSD is now "stale and
  broken". Comparing the timestamps of the last commits in the
  repositories [3] and [4], the last activities are only three months
  apart. The nature and extent of the respective changes is difficult
  for me to evaluate. Is anyone here deeper into this and can say what
  the general state of nvmm in NetBSD is?

1) nvmm seems to work well in netbsd (I haven't run it yet) and there has
been bug fixing.

2) code flows between BSDs a lot, in many directions.

3) You could run diff to see what's different and why.

4) The language in the reddit post does not sound particularly
constructive.  Someone with knowledge of improved code in DragonFly (I
don't know if that's true or not) could send a message here or
teech-kern pointing it out and suggesting we update, rather than being
dismissive on reddit.  Or file PRs and list them; technical criticism is
fair.


Probably after your message (which I view as helpful) someone(tm) will
look at the diff.  But if you are inclined to do  that and post some
comments, that's probably useful.



signature.asc
Description: PGP signature


Re: odd setlist failure

2022-02-25 Thread Greg Troxel

matthew green  writes:

> this should be fixed now.  sorry for the fallout.

Thanks.  I can confirm that current as of 25th 1802Z builds and boots as
a XEN3_DOM0 (but I'm not really running X11 on it)a.


signature.asc
Description: PGP signature


odd setlist failure

2022-02-25 Thread Greg Troxel

current fails to build for me, complaining about ati_drv.so.19 in
destdir but not in setlist.   I see that .6 is in the setlists now.   It
my destdir I have:

-r--r--r--  1 gdt  wheel  7420 Jan 26 10:48 
/usr/obj/gdt-current/destdir/i386/usr/X11R7/lib/modules/drivers/ati_drv.so.6
-r--r--r--  1 gdt  wheel  7420 Feb 25 08:06 
/usr/obj/gdt-current/destdir/i386/usr/X11R7/lib/modules/drivers/ati_drv.so.19
lrwxr-xr-x  1 gdt  wheel13 Feb 25 08:06 
/usr/obj/gdt-current/destdir/i386/usr/X11R7/lib/modules/drivers/ati_drv.so -> 
ati_drv.so.19

Build host is netbsd-9 amd64.

Is anyone else seeing this?



signature.asc
Description: PGP signature


Re: pkg_rolling-relace reports mismatch

2022-02-16 Thread Greg Troxel

Riccardo Mottola  writes:

> pkg_rolling-replace reports me this:

yes but

> *** pkg_chk reports the following packages need replacing, but they are
> not installed: py37-gobject
> *** Please read the errors listed above, fix the problem,
> *** then re-run pkg_rolling-replace to continue.

it comes from pkg_chk.

The directions are clear: figure out why pkg_chk is getting it wrong,
fix that, perhaps accomodate that change in pkg_rr, and then rerun, and
all will be good.

But seriously, for pyNN-gobject, the PKGPATH is devel/py-gobject, and
there are multiple PKGNAMEs, and all of this stuff needs to grasp N
flavors of versioned packages.

> but I don't have py37-gobject installed! all python packages installed
> seem correctly either py27 or py39

Usually py27-foo gets reported as the current default version, so that's
odd.

> lintpkgsrc reports me:

> Unknown package: 'py-gobject' version 2.28.7nb4
> Unknown package: 'py-gtk2' version 2.24.0nb38

Probably the same issue: not understanding versioned packages.

> disc$ pkg_info | grep gobject
> glib2-tools-2.70.2  GLib2/gobject python-dependent tools
> py27-gobject-2.28.7nb4 Python bindings for glib2 gobject
> cairo-gobject-1.16.0nb6 Vector graphics library with cross-device output
> support
> gobject-introspection-1.68.0nb1 GObject Introspection
> py-gobject-shared-2.28.7nb6 Python bindings for glib2 gobject
>
> what's wrong?
> I cannot remove py27-gtk2-2.24.0nb38 and py27-gobject-2.28.7nb4 because
> the former depends on the latter and gimp depends on them.

Probably

cd devel/gobject-introspection && make PYTHON_VERSION_REQD=27 replace clean

That's a big hint as to what the combination of pkg_chk/pkg_rr need to do.


signature.asc
Description: PGP signature


Re: Bug or no Bug?

2022-02-09 Thread Greg Troxel

6b...@6bone.informatik.uni-leipzig.de writes:

> Hello,
>
> I have installed the 9.99.xx kernel on several systems. On most
> systems there are no problems. On a Dell 2800, the kernel crashes
> during boot. The problem only occurs if the option LOCKDEBUG is set.
>
> options LOCKDEBUG   # expensive locking checks/support
>
> Should a bug report be made in this case? Or should problems that only
> occur when LOCKDEBUG is enabled be ignored?

There are three possibilities, at least

  LOCKDEBUG perturbs timing and provokes a latent bug
  LOCKDEBUG catches bad code and faults
  LOCKDEBUG is itself buggy

In all three -- and you don't know yet -- a bug report is in order.

It would be good to set up kernel gdb, probably, or at least a serial
console, or disable all graphics and see if you can get a backtrace.



signature.asc
Description: PGP signature


Re: Heads up: objdir is now rm -rf resistent

2021-12-15 Thread Greg Troxel

Valery Ushakov  writes:

> On Wed, Dec 15, 2021 at 07:53:55 -0500, Greg Troxel wrote:
>
>> I wonder if "rm -rf" should actually succeed with these modes, by doing
>> a chmod when necessary.  It has always seem to me that -f is supposed to
>> really mean -f.  But maybe POSIX says otherwise.
>
> That would be a security hole, wouldn't it?  There will be a window
> where the directory that is not supposed to be readable becomes
> readable.

That's a good point.  I guess this is just
intractable in the Unix filesystem model.


signature.asc
Description: PGP signature


Re: Heads up: objdir is now rm -rf resistent

2021-12-15 Thread Greg Troxel

Andreas Gustafsson  writes:

> m...@netbsd.org wrote:
>> I hope fixing this is enough to fix all the cryptic issues.
>
> The build is now fixed, but I still need to give the testbeds the
> ability to automatically remove objdirs containing non-writable
> directories, because otherwise they will get stuck whenever they
> decide to build a historic version from the affected time range.
>
> This is also going to be an ongoing pitfall for anyone building
> historic versions, for example when bisecting.

I wonder if "rm -rf" should actually succeed with these modes, by doing
a chmod when necessary.  It has always seem to me that -f is supposed to
really mean -f.  But maybe POSIX says otherwise.



signature.asc
Description: PGP signature


Re: backward compatibility: how far can it reasonably go?

2021-12-09 Thread Greg Troxel

ya...@sdf.org writes:

>> "Greg A. Woods"  writes:
>> I am unclear if ipf has been removed by default from current.
>
> Even in NetBSD 9, ipf is not in the GENERIC kernel config.
>
> Was the kernel compiled to use ipf?
>
> e.g. add to kernel config:
> options IPFILTER_LOG# ipmon(8) log support
> options IPFILTER_LOOKUP # ippool(8) support
> options IPFILTER_COMPAT # Compat for IP-Filter
> pseudo-device   ipfilter# IP filter (firewall) and NAT

You are correct, but /sbin/ipf loads the ipl module, and it does this
well enough that I had no idea it wasn't in GENERIC.



signature.asc
Description: PGP signature


Re: backward compatibility: how far can it reasonably go?

2021-12-08 Thread Greg Troxel

"Greg A. Woods"  writes:

> So I've got a couple of old but important machines (Xen amd64 domUs)
> running NetBSD-5, and I've finally decided that I'm reasonably well
> enough prepared to try upgrading them.
>
> However it seems a "modern" (9.99.81, -current from about 2021-03-10)
> kernel with COMPAT_40 isn't able to run some of the userland on those
> systems.
>
> Is this something that should work?

Yes, except ipf, is my memory.

> If it should I think it would make the upgrade much easier as I could
> then plop down the new userland and run etcupdate.  (there are of course
> alternative ways to do the upgrade, eased by the fact they are domUs (*))

Yes, if you can just
  new kernel, reboot
  unpack user sets, merge etc, reboot

then your packages should be ok.

I am unclear if ipf has been removed by default from current.

(Others commented on your backtrace and I have nothing useful to say
about that.)

> Now since these are domUs and their dom0 is also NetBSD I could also
> upgrade them "in absentia" so to speak, i.e. drop a new userland on
> their filesystems from the dom0, though this seems more scary somehow.
> I guess it shouldn't be since the dom0 and other test systems are
> already running what I want them to run.

I think mounting their filesystems into dom0 and doing the unpack/merge
there, perhaps chrooted, is sensible.

> Or, given they are relatively cleanly configured filesystem-wise
> (esp. with a separate /usr/pkg, /home, etc.) I could also build new
> prototype systems, copy over the /etc files and old shared libraries
> from the old system to the new prototype, then run etcupdate on the new
> prototype, and finally shut down the old system, re-assign the other
> filesystems (/var, /usr/pkg, /home, /work, etc.) to the prototype,
> reboot the prototype with the old system's name and address, and finally
> patching up and/or rebuilding whatever is needed in /var.

You could, but that sounds like way more work, and you will almost
certainly miss things.  I have found upgrading to be better than fresh,
although I know one person who puts configs in a VCS with salt and
debugs them until they work and uses that for new/recovery.  But he's a
bit extreme.

> The key thing is that I want to be able to upgraded pkgs piecemeal since
> I'm sure there will be some hiccups and reconfigs required along the
> way.

That is the iffy part.  packages are linked to base system libs and each
such package needs to see a consistent set via dynamic linking.

I would build a full set of packages on another machine so you can
'pkgin up'.

> Note that most everything is static-linked on these systems.  The base
> system is 100% static linked (except for ld.elf_so itself) though of
> course there still are a few baroque packages which require
> dynamic-loaded code so I will still need to be very careful to preserve
> all old shared libraries.  That makes the approach of building a fresh
> prototype somewhat more difficult, though ultimately perhaps safest as
> it can be fully tested before ditching the old system.

If you really did that, then things should be relatively simple.


signature.asc
Description: PGP signature


the openssl 3 question

2021-09-30 Thread Greg Troxel

This discussion belongs, I think, on tech-userlevel, but as that may be
sparse I wanted to point out once that with  openssl 3's release, a
number of questions arise.  Please follow up on tech-userlevel unless
somebody like martin@ wants to have the discussion some place else (fine
with me, but just one list please):

  http://mail-index.netbsd.org/tech-userlevel/2021/09/30/msg013070.html



signature.asc
Description: PGP signature


Re: Problem reports for version control systems

2021-05-02 Thread Greg Troxel

Robert Swindells  writes:

> Lloyd Parkes  wrote:
>>The network is a 1Gb/s LAN through to a smaller NetBSD router running 
>>NPF with MSS clamping enabled so that I can get Netflix. My ISP does not 
>>use CGN for my IPv4 connection. My IPv6 connection is tunnelled through 
>>to Hurricane Electric in Sydney, Australia.
>
> Have you tried disabling IPv6 or explicitly connecting using IPv4 ?
>
> I don't see any problems using IPv6 through NPF to update cvs but I have
> native IPv6 and can use a 1500 byte MTU. I'm also using cvs.n.o instead
> of anoncvs.n.o but they have adjacent IPv4 addresses.
>
> I'm guessing that your IPv6 tunnel has a lower MTU than your IPv4
> connection to your ISP.

I update over HE tunnels all the time with no issues (cvs, not anoncvs).
IPv6 tends to 1280 MTU; my gif for the tunnel is that, and wthat has all
apparently worked fine.   So always good to look with tcpdump, but I
suspect tunenl MTU is not really an issue.

There are sometimes problems with reachability via v6, and sometimes the
speeds are lower.  More recently, I had one case of v4 reachability
failure while v6 worked, and have sometimes seen lower ping times on v6.


signature.asc
Description: PGP signature


Re: posix_spawn issue?

2021-05-01 Thread Greg Troxel

The actual POSIX spec may be useful:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_spawn.html


signature.asc
Description: PGP signature


Re: math/cgal and gcc10

2021-04-25 Thread Greg Troxel

Chavdar Ivanov  writes:

> Hi,
>
> Update to gcc10 requires
>
>  cvs diff -u
> cvs diff: Diffing .
> Index: Makefile
> ===
> RCS file: /cvsroot/pkgsrc/math/cgal/Makefile,v
> retrieving revision 1.62
> diff -u -r1.62 Makefile
> --- Makefile21 Apr 2021 13:24:12 -  1.62
> +++ Makefile25 Apr 2021 09:29:10 -
> @@ -15,7 +15,7 @@
>  LICENSE=   gnu-lgpl-v3
>
>  USE_CMAKE= yes
> -USE_LANGUAGES= c c++03
> +USE_LANGUAGES= c c++11
>  CMAKE_ARGS+=   -DCGAL_INSTALL_MAN_DIR=${PREFIX}/${PKGMANDIR}/man1
>
>  REPLACE_SH+=   scripts/cgal_create_CMakeLists
>
> for math/cgal.
>
> I don't know if one should send PRs for this sort of occasional problems.

Here (or pkgsrc-users?) seems ok.  But my question would be if cgal
documents that it needs a C++11 compiler, in which case this change is
right regardless, or if it's supposed to be ok with C++03, in which case
maybe something else is wrong.


signature.asc
Description: PGP signature


Re: IPv6 default route flapping

2021-04-20 Thread Greg Troxel

Joerg Sonnenberger  writes:

> On Wed, Apr 21, 2021 at 12:54:36AM +0700, Robert Elz wrote:
>> It seems as if what is happening, is that the router is sending RA's with
>> the source-link addr option, which isn't being added to the neighbour
>> cache.
>> 
>> Then NetBSD is doing a NS to discover the address it ignored from the RA,
>> but instead of replying with a ND as would perhaps be expected, the router
>> is simply sending another RA (containing the relevant addr info, which would
>> answer the NS if processed).
>
> I'm not entirely sure that the behavior of sending a RA as "answer" to a
> NS is valid under RFC 4861

Agreed.   Not sending a ND reply to an ND query is obviously highly
irregular, even if it turns out to be allowed.


signature.asc
Description: PGP signature


Re: IPv6 default route flapping

2021-04-20 Thread Greg Troxel

Martin Husemann  writes:

> Well, adding -v or -vv would help here, especially to show the RA lifetime.
>
> But more interesting is what happens on the routing socket, as that is
> where dhcpcd gets the idea of reachability from.

A great suggestion and I 'll add turning on dhcpcd debugging and looking
at all 3.

Besides -v, tcpdump prints timestamps and those were omitted.   Also the
lines were miswrapped making them very hard to read.

It looks like there are neighbor solicitations and I am not seeing the
replies.

Are you really sure you aren't blocking ND replies in a firewall?   Also
add to your time history the packets that were blocked, with timestamps.


signature.asc
Description: PGP signature


Re: IPv6 default route flapping

2021-04-20 Thread Greg Troxel

Jan Schaumann  writes:

> Apr 20 01:32:32 netbsd dhcpcd[17397]: xennet0: soliciting an IPv6 router
> Apr 20 01:32:34 netbsd dhcpcd[17397]: xennet0: Router Advertisement from 
> fe80::caa:49ff:feaf:1815
> Apr 20 01:32:35 netbsd dhcpcd[17397]: xennet0: fe80::caa:49ff:feaf:1815 is 
> unreachable
> Apr 20 01:32:35 netbsd dhcpcd[17397]: xennet0: soliciting an IPv6 router
> Apr 20 01:32:44 netbsd dhcpcd[17397]: xennet0: fe80::caa:49ff:feaf:1815 is 
> reachable again
> Apr 20 01:32:52 netbsd dhcpcd[17397]: xennet0: fe80::caa:49ff:feaf:1815 is 
> unreachable
> Apr 20 01:32:52 netbsd dhcpcd[17397]: xennet0: soliciting an IPv6 router
> Apr 20 01:32:54 netbsd dhcpcd[17397]: xennet0: Router Advertisement from 
> fe80::caa:49ff:feaf:1815
> Apr 20 01:32:55 netbsd dhcpcd[17397]: xennet0: fe80::caa:49ff:feaf:1815 is 
> unreachable
> Apr 20 01:32:56 netbsd dhcpcd[17397]: xennet0: soliciting an IPv6 router

I don't know what's wrong, but obviously you should 'tcpdump -w IPV6
ip6' and look at it.


signature.asc
Description: PGP signature


Re: extra files in DESTDIR

2021-02-27 Thread Greg Troxel

bch  writes:

questions I still wonder about:

>>>   what arch are you building for
>>>   what was the build host
>>>   what was your build.sh line


> So I nuked the /usr/obj directory completely (which contained the DESTDIR),
> used the -A switch to confirm cvs co is properly tracking -current, and ran
> a single job (e.g. no “-j4”) and tools failed to build. I ended up deleting
> the tools source completely and re-pulling and it then built properly. I

This is a clue.   But perhaps you didn't delete the objdir for the
tools.  After cvs up -A, removal and re-checkout should not change.

Do

  cvs up -I \!

and save it.  There should be no output, but you might have built files
in the tree.

Also from top level

  cvs diff -r HEAD -I \!

and again output should be empty.

> went on to build a distribution, and ended up at the same place - failed
> the final checkflist. The offending file is a regular file, while it’s
> analog in my running system is a symlink, which is another curiousity and
> perhaps a clue? I’m still at a loss. I captured the build w script(1) so I
> may be able to provide more information on request.

See src/libexec/ld.elf_so and read the Makefile.  I just removed
ld.elf_so from my destdir and re-ran make install (via using the
arch-specific nbmake in the tooldir) and it worked fine to recreate the
symlink.

Check that directory specificially in your checkout for not having
spurious files.


signature.asc
Description: PGP signature


Re: extra files in DESTDIR

2021-02-26 Thread Greg Troxel

Bch  writes:

> I've been getting this (and worse) for a while, and just living with it:
>
> ===  1 extra files in DESTDIR  =
> Files in DESTDIR but missing from flist.
> File is obsolete or flist is out of date ?
> --
> ./usr/libexec/ld.elf_so
> =  end of 1 extra files  ===

You didn't specify:

  do you really mean -current
  what date did you update
  what arch are you building for
  what was the build host
  what was your build.sh line
  
> It *was* worse when I had errant cvs tags in effect, and so part
> of the tree was updated differently than the rest, but thats sorted
> out now.

Do you mean that you did "cvs up -dP" and got no output?  (Or at least
only P/U and not M?)

It might be good to rm -rf your objdir and destdir and see if this
presents on a fresh build.

> This particular extra file is a troublesome long-running issue for
> me though -- I've ignored it until now, but I'm trying to build a
> release to update a remote machine and this is sticking point.
> Deleting the file doesn't work - it will re-appear. What do I need
> to look at or do to solve this?

I have a netbsd-9 system on which I built current for amd64.  My destdir
has

> ls -l /usr/obj/gdt-current/destdir/amd64/usr/libexec/ld.elf_so
lrwxr-xr-x  1 gdt  wheel  18 Oct 18 00:01 
/usr/obj/gdt-current/destdir/amd64/usr/libexec/ld.elf_so -> /libexec/ld.elf_so

starting from src/distrib, I find

./lists/base/md.amd64:./libexec/ld.elf_so-i386  base-sys-shlib  
compat,pic
./lists/base/md.amd64:./usr/libexec/ld.elf_so-i386  
base-sys-shlib  compat,pic

so things look fine to me.


signature.asc
Description: PGP signature


how to get nodev for zfs?

2021-02-23 Thread Greg Troxel

I'm contemplating using zfs over NFS for domU package builders, and I'm
basically allergic to NFS for security reasons but it should be
confined.

So I'm trying to reduce exposure, and have set setuid=off on
zfs filesystems.  That successfully prevented a suid binary from working.

The other usual thing is "nodev", and zfs has a devices property on or
off.  So I went to set it to off and got an error that FreeBSD doesn't
support that.

I made a device node (just with mknod) for wd0d and I was able to dd
from it.

Is there any good approach to avoiding this?   Why doesn't devices=off
just lead to the nodev mount option and work, similar to setuid=off
leads to nosuid?





signature.asc
Description: PGP signature


Re: zpool import lossage

2021-02-17 Thread Greg Troxel

Lloyd Parkes  writes:

> You should be able to create the symlink in any directory and tell zfs
> import which directory to use.

Thanks for the great hint; it works, reduces ick, and limits scope of
ick.  In a directory searched via -d, all files are searched, not just
whole disks.

> I think that /etc/zfs is used for maintaining certain system state
> information about imported pools across reboots and so I'm not overly
> surprised to see that it is empty after you exported the pool. It
> might just optimise the boot time import of the pool.

/etc/zfs/zpool.cache has a record for each pool of where the devices
are.  It is deleted on export; that's a feature :-)

I updated the HOWTO; see "pool importing problems".

https://wiki.netbsd.org/zfs/


signature.asc
Description: PGP signature


zpool import lossage

2021-02-16 Thread Greg Troxel

(I'm testing on 9, but am guessing this is similar on current and will
if anywhere be fixed there and not necessarily pulled up to 9.)

I'm starting to try out zfs.   So far I don't have any data that
matters.

On a 1T SSD I have wd0[abe] as root/swap/usr as an unremarkable netbsd-9
system, on an unremarkable amd64 desktop with 8G of RAM.

I created pool1 with wd0f, which is the rest of the 1T disk, about 850G,
not raid of any kind.  I created a few filesystems, changed their mount
points, changed their options, and mounted one over NFS from another
machine, and all seemed ok.  (Yes, I realize the doctrine that "use the
whole disk as a zfs component" is the preferred approach.)

I wanted to rename my pool from pool1 to tank0, for no good reason,
mostly trying to do all the scary things while the only data I had was a
pkgsrc checkout, but partly having seen Stephen Borrill's report of
import trouble.

So I did

  zpool export pool1

and sure enough all my zfs stuff was gone.

Then I did, per the man page:

  zpool import

and nothing was found.  After a bunch of reading and ktracing, I
realized that there is no record of the pool in /etc/zfs or anywhere
else I could find, and the notion is that zpool import will somehow find
all the disks that have zfs data on them, apparently by opening all
disks and looking for some kind of ZFSMAGIC.  But it looked at wd0 and
not the slices.  There was no apparent way to ask it to look at wd0f
specifically.  So I did

  cd /dev; rm wd0; ln -s wd0f wd0

which is icky, but then zpool import found wd0f and I could

  zpool import pool1 tank0

So this feels like a significant bug, and matches Stephen Borrill's
report.  I think we're heading to documenting this in the wiki, or at
least I am.

Does anything think I have this wrong?
Is anyone inclined to do anything more serious?


signature.asc
Description: PGP signature


zfs howto

2021-02-12 Thread Greg Troxel

Long ago I rototilled to zfs howto adding far more questions than
answers.  I just did another rototill pass.

  https://wiki.netbsd.org/zfs/

While many \todos remain, the biggest questions I have are about NFS:

  If I want to export a zfs filesystem over NFS, what specifically do I
  need to do.

  Does the crash bug referenced in the NFS section still exist in
  current current?  (It's still open)
  http://gnats.netbsd.org/55042

and:

  FreeBSD has tunables to control memory usage, or did.  NetBSD seems
  not to.  If I got this wrong, please explain.
  


signature.asc
Description: PGP signature


zfs: 9 vs current, and ZIL/L2ARC on ssd?

2021-02-11 Thread Greg Troxel

I am about to try to use zfs for the first time and have a few
questions.

I have a machine that is running NetBSD-9/amd64 with 2 cores, 8G of RAM,
a single 1T SSD, with a smallish root/swap/usr, and about 870 GiB free
intended for zfs.  I am heading for one po0l that is not raid at all.

I'm not all that worried about transitions or stability; this is a build
machine for packages, not particularly precious, and it being down for a
week while I fix it is no big deal.

I will likely pivot the machine to be xen dom0; I hope that doesn't
matter much (other than 1 core only in the dom0).  Or I might use nvmm,
or both.

I might add a spinning disk later, either internal or USB.  (I realize
that there, I probably want both ZIL and L2ARC on SSD.  I would rather
move bits later than do things now to ease that, since I do not have an
actual plan.)

My questions are:

  Is 9/current close enough to the same zfs code that it doesn't matter
  which I run?  If I'm inclined to run current for other reasons, is
  that a bad idea zfs-wise?

  I understand that zfs has an intent log always, and that can be within
  the pool, or one can add a ZIL device.  With the pool having one
  device which is an SSD, I see no point in partitioning off part of
  that SSD to be the ZIL.

  I understand that zfs has ARC in RAM, and can have L2ARC on disk.
  Given that the pool is on SSD, it seems pointless to split off some
  for L2ARC.

My expected answers are:

  The code is basically the same and it doesn't really matter but
  probably current has some bugfixes 9 doesn't.  There's no reason
  current is scary becuase of zfs.

  There is no point in a ZIL on the same SSD as the pool.

  There is really no point in L2ARC on the same SSD as the pool.


Corrections/clues appreciated.


signature.asc
Description: PGP signature


Re: cannot allocate memory running mame

2020-10-01 Thread Greg Troxel

Thomas Klausner  writes:

>> > /*
>> >  * Virtual memory related constants, all in bytes
>> >  */
>> > #define MAXTSIZ (256*1024*1024) /* max text size */
>> 
>> AFAICT for amd64 the limit is arbitrary, if you only need it temporarily for
>> debugging purposes, just bump it to 512 MB (or bigger) and rebuild your 
>> kernel.
>
> Thanks for the analysis.
>
> Actually, that is the size of mame stripped, so yes, I need this bigger :)
>
> Should we bump it in general?

While 256M of text seems big, I find it really surprising to have a
compiled limit that low.

For me, the biggest installed program is mongod at 32M, in terms of text
as reported by size(1).

So I would support bumpming that to 512M for sure, maybe 1G.

It also seems like this belongs in ulimit, as something a sysadmin would
reasonably want to set along with data size limits.






signature.asc
Description: PGP signature


Re: ctype and gcc9

2020-09-21 Thread Greg Troxel

Patrick Welche  writes:

> Since gcc9, essentially every ctype using piece of software fails with
>
>error: array subscript has type 'char' [-Werror=char-subscripts]
>
> which prompts a style question: cast every argument of every call to
> a ctype function in every piece of software to (unsigned char), or
> -Wno-error=char-subscripts], or something else?

There were earlier warnings that were similar.  This is tricky business!

POSIX says (quoting C99, which is harder to give a URL to):

  https://pubs.opengroup.org/onlinepubs/9699919799/functions/isalpha.html

which says that these are functions.  However in NetBSD they are macros.
I have seen arguments that the macro implementation is legitimate.  And,
I think a true function would have the same problem, just showing up
differently in terms of warnings (promoting char to int in a function
call, when the function has UB for many of those values -- a warning
less likely to be issued).

The caller must provide an int which is representable as unsigned char
(or EOF), says the definition, if you read it straightforwardly without
trying to be a language lawyer thinking about macros.

If you pass a char to a function that takes int, and char is a signed
type, and the value is negative, there is sign extension, and thus
undefined behavior when using a ctype function.

https://wiki.sei.cmu.edu/confluence/display/c/STR34-C.+Cast+characters+to+unsigned+char+before+converting+to+larger+integer+sizes
https://wiki.sei.cmu.edu/confluence/display/c/STR37-C.+Arguments+to+character-handling+functions+must+be+representable+as+an+unsigned+char

So, it's more than a style question; code that passes char to ctype
funtions is actually wrong, and yes it should all be fixed.


signature.asc
Description: PGP signature


Re: arp: ioctl(SIOCGNBRINFO): Inappropriate ioctl for device

2020-09-16 Thread Greg Troxel

Thomas Klausner  writes:

> On Wed, Sep 16, 2020 at 11:10:55AM +0200, Martin Husemann wrote:
>> On Wed, Sep 16, 2020 at 11:05:49AM +0200, Thomas Klausner wrote:
>> > The one with 192.168.0.x configured is wm0. (I only have an lo0 except for 
>> > that.)
>> 
>> Strange, your kernel is newer or same age as your userland?
>
> My kernel is from September 4. Since there was no version bump I
> assumed that I could install a newer userland (with gcc9) without
> problems.

I don't think that's a good assumption.   We bump the kernel for an ABI
change within the kernel, and this was an addition to the user/kernel
ABI.

My impression is that you always need a new kernel before a new
userland, and I never had any reason to think any other plan was
reliable.


signature.asc
Description: PGP signature


Re: hang (not about pkg_rolling-replace) building libvdpau

2020-09-03 Thread Greg Troxel

Chavdar Ivanov  writes:

> In this case I don't think it is anything to do with
> pkg_rolling-replace; I've reported a few of these hangs, which happen
> to happen during pkg_rolling-replace, but involve most often cmake,
> but other programs as well. Apparently there are similarities in the
> traces, perhaps pointing to the thread model and execution. In all
> these occasions the process in question continued to a successful
> conclusion after attaching and then detaching with gdb.

Thanks; that is helpful information.   It causes unnecessary confusion
to associate  problems with things that don't cause them, and causes
people not to pay attention.


signature.asc
Description: PGP signature


Re: hang while updating pkg_rolling-replace libvdpau

2020-09-03 Thread Greg Troxel

Riccardo Mottola  writes:

> I finished updating all my core system to current on i386-64, kernel,
> userland, etc.
>
> Now I launched pkg_rolling replace, it crunches through several
> packages, but then hangs.
>
>
> I tried running it several times, rebooting in between... but
> nothing. What is "hang" ? The CPU stays idle too. Hangs exactly there.

pkg_rr is just a shell script.  As always, I ask that you see what order
it does "make replace package clean" and then run the problematic
command by itself, without pkg_rr, and then report that intead.

(If you do find a bug in pkg_rr, of which there are many, please do
report it.  But it is confusing to people to conflate what broke with
the program that merely chose the sequence.)


signature.asc
Description: PGP signature


Re: RPI3 serlial clock confusion?

2020-07-11 Thread Greg Troxel
Michael van Elst  writes:

> We (and others) use the mini uart for console on RPI3 to keep the
> real UART for bluetooth. The mini uart has a limited FIFO, which
> means you cannot run bluetooth at best speed. It would also be
> subject to the changes in the core frequencies like the console is
> now.
>
> You can chose what you want to get:
>
> - have problems with serial console, that's the default
> - have problems with bluetooth
> - run on fixed frequency
>
> You can use settings in config.txt and/or load a DTB overlay to
> select this.  So there are only two questions, what should be the
> default and how easy is it for a user to do this configuration.

The third question is if a user starting to read at

  http://wiki.netbsd.org/ports/evbarm/raspberry_pi/

will come to understand this.

I am realizing that this notion of mini/real UART confusing.  I am
guessing there is some internal plumbing to connect the console (serial
port) IO pins to one or the other, and similarly for tbluetooth, and
this is controlled in dtb choice, but I have no real clue about how to
do this.


Re: USB cardreader under netbsd-9 "failed to create xfers"?

2020-06-27 Thread Greg Troxel
Usually this is about lack of available memory, due to fragmentation.

Different parts of the USB stack do allocation differently.  could be
xhci vs ehci or something.

does this happen if you plug it in when you have just rebooted?


Re: PATCH: Relax fdatasync checks to IEEE Std 1003.1-2008

2020-05-24 Thread Greg Troxel
Yorick Hardy  writes:

(I realize you later say this isn't it.)

>> @@ -4141,10 +4140,6 @@ sys_fdatasync(struct lwp *l, const struct 
>> sys_fdatasync_args *uap, register_t *r
>>  /* fd_getvnode() will use the descriptor for us */
>>  if ((error = fd_getvnode(SCARG(uap, fd), )) != 0)
>>  return (error);
>> -if ((fp->f_flag & FWRITE) == 0) {
>> -fd_putfile(SCARG(uap, fd));
>> -return (EBADF);
>> -}
>>  vp = fp->f_vnode;
>>  vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
>>  error = VOP_FSYNC(vp, fp->f_cred, FSYNC_WAIT|FSYNC_DATAONLY, 0, 0);

If you look at the function beyond what's in the diff, you will see (I
think, but I really mean I see) that there is always a single
fd_putfile.  This was just doing the put before returning, rather than
setting error and the usaul "goto out" where the end-of-routine cleanup
happens.  See also sys_fsync_range() in the same file.

I could be reading this wrong.


Re: How long to build from source?

2020-05-08 Thread Greg Troxel
nottobay  writes:

> I have a 5 year old a8 laptop. How can I figure out how long compiling the
> current source will take?

Actually compile it and report back.

Why do you need to know?   It will almost certainly be less than a day.
Is that going to cause you do try netbsd, or not try netbsd?


Re: sdmmc_mem_enable failed with error 60

2020-04-08 Thread Greg Troxel
Patrick Welche  writes:

> Having seen TSC related commits, I tried TSC again, and noticed that
> I could read/write SD cards again, and irritating pckb command timeouts
> stopped happening. This laptop seems stable apart from the incorrect
> time.

Does the time work ok with TSC, with the new commits?  Basically there
was a bug in estimating TSC rate, which I think was a multiprocessing
hazard.

I have a system where TSC should be 2.8G, and it was being measured at
boot at 3.9G sometimes (and sometimes 2.8G).   I haven't tried to
backport the patches to 8 and test, but my guess is that I'll have the
right TSC value and timekeeping with TSC will be ok.


Re: PATCH: Relax fdatasync checks to IEEE Std 1003.1-2008

2020-03-25 Thread Greg Troxel
Paul Ripke  writes:

> On Mon, Mar 16, 2020 at 08:47:27AM -0400, Greg Troxel wrote:
>>   [lots of test reports about fdatasync patch]
>> 
>> Thanks -- that's enough for me to be comfortable.
>> and it's been proposed for more than long enough, with no adverse
>> comments, so I'll commit it soonish.

> fwiw, I missed a comment at the top of the function... fixed in
> attached patch.

I have committed your patch, exactly as you just sent it.  My full
release build worked and I have an anita test run in process, just in
case.

Thanks for perservering on this.  It takes many people to fix all the
loose ends in an operating system!


Re: PATCH: Relax fdatasync checks to IEEE Std 1003.1-2008

2020-03-16 Thread Greg Troxel
  [lots of test reports about fdatasync patch]

Thanks -- that's enough for me to be comfortable.
and it's been proposed for more than long enough, with no adverse
comments, so I'll commit it soonish.





Re: PATCH: Relax fdatasync checks to IEEE Std 1003.1-2008

2020-03-13 Thread Greg Troxel
Paul Ripke  writes:

>> Running atf on a GCP VM. Never run atf before, we'll see how it goes.
>
> Failed test cases:
> dev/fss/t_fss:basic, include/t_paths:paths, 
> lib/libarchive/t_libarchive:libarchive,
> lib/libc/sys/t_ptrace_sigchld:traceme_raise1, 
> lib/libc/sys/t_ptrace_wait:resume,
> lib/libc/sys/t_ptrace_wait3:resume, lib/libc/sys/t_ptrace_wait4:resume, 
> lib/libc/sys/t_ptrace_wait6:resume,
> lib/libc/sys/t_ptrace_waitid:syscall_signal_on_sce, 
> lib/libnvmm/t_io_assist:io_assist,
> lib/libnvmm/t_mem_assist:mem_assist, lib/librumphijack/t_tcpip:http, 
> lib/librumphijack/t_tcpip:nfs,
> net/arp/t_arp:arp_rtm, net/if_pppoe/t_pppoe:pppoe6_chap, 
> net/if_pppoe/t_pppoe:pppoe6_pap,
> net/if_pppoe/t_pppoe:pppoe_chap, net/if_pppoe/t_pppoe:pppoe_pap, 
> net/mpls/t_mpls_fw6:mplsfw6,
> net/mpls/t_mpls_fw64:mplsfw64_expl, net/ndp/t_ndp:ndp_rtm, 
> rump/rumpkern/t_vm:uvmwait,
> crypto/opencrypto/t_opencrypto:ioctl
>
> Summary for 854 test programs:
> 8159 passed test cases.
> 23 failed test cases.
> 48 expected failed test cases.
> 1134 skipped test cases.
>
> I don't have a baseline - so I don't know if anything has changed.
> Glancing thru the above, nothing looks related, but I could run again
> against a pristine kernel?

That would help, and there is

http://releng.netbsd.org/b5reports/amd64/

but the problem is that there is a constant stream of commits that cause
new failures.  (We don't require proving that each commit doesn't cause
atf failures, but I personally like to be careful.)

There is also the issue that actually testing fsync* sort of requires
controlled crashes, which is doable but expensive.

We do have fsync tests but no fdatasync tests.  I wonder how much a
naive copy and s/fsync/fdatasync/ test file would make sense.
See src/sys/lib/libc/sys/t_fsync.c

(I am not requesting that you do add atf tests as a precondition for
committing, although I would like to hear that unifi/mongodb3 (without
the patch) works.)

Thanks for working on this.


Re: PATCH: Relax fdatasync checks to IEEE Std 1003.1-2008

2020-03-12 Thread Greg Troxel
Paul Ripke  writes:

> Currently, fdatasync requires a file descriptor open for writing, as
> per IEEE Std 1003.1, 2004:
> [EBADF]
> The fildes argument is not a valid file descriptor open for writing.
> https://pubs.opengroup.org/onlinepubs/009695399/functions/fdatasync.html
>
> While, IEEE Std 1003.1-2008 (and later revisions):
> [EBADF]
> The fildes argument is not a valid file descriptor.
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/fdatasync.html

I didn't realize that there was a change in POSIX.  That makes things
quite different.

It would help if someone ran the entire atf test suite; you might
install py-anita to do so.

I am inclined to commit this change if there aren't objections.


Re: XEN 4.11 and 9.99.48 DOMU performance

2020-03-10 Thread Greg Troxel
I wonder if the issue is that the dom0 is single threaded, serializing
all the IO.

Have you done dd bs=1m from the dom0 raw disk, the dom0 backing file,
and the domU ('raw disk')?   I used to see X, 0.9X, and (0.9)^2 X, ish,
on somewhat old Xen and older NetBSD.

I would run "systat vmstat" in the domU, to see the disk busy fraction
and transfer rate.

I also wonder if there is some sort of fsync in the domU leading to
virtual FUA leading to fsync in the dom0 going on.






Re: benchmark results on ryzen 3950x with netbsd-9, -current, and -current (no DIAGNOSTIC)

2020-03-03 Thread Greg Troxel
matthew green  writes:

Thanks for the very interstesting data.

> below has a full summary, but the highlights:
>
>  - DIAGNOSTIC costs 3-8%

This seems higher than it ought to be.  I don't doubt your measurements;
I mean that probably things are being done under DIAGNOSTIC that aren't
really approprirate and are better put under DEBUG.   I have always
believed, since using DIAGNOSTIC under 2.8BSD, that DIAGNOSTIC should
basically only be adding asserttions and not doing anything expensive.

Part of that is the basic intent, and part of it is that I don't think
anyone should want to turn off DIAGNOSTIC for performance reasons.

I don't know how to find the things that cost more (perhaps one could
compile some files with and without DIAGNOSTIC?), but if we could remove
any expensive things from DIAGNOSTIC that would be good.

I'll suggest 1% slowdown as a goal, without really knowing how realistic
that is.

It's very cool to see that the gains in current overwhelm the DIAGNOSTIC
slowdown.  It makes me want a new motherboard with more cores :-)



Re: USB umass hard drive "failed to create xfers" when attaching

2020-02-17 Thread Greg Troxel
mlel...@serpens.de (Michael van Elst) writes:

> g...@lexort.com (Greg Troxel) writes:
>
>>My impression is that something, perhaps more umass than *hci, needs a
>>very large chunk of memory.
>
> umass allocates two 64k (MAXPHYS sized) DMA buffers and a few smaller ones.
> For all drivers but ehci each buffer must use contigous physical pages.
>
> ehci learned to use multiple DMA segments some time ago, so the
> error is now rare. xhci still has to learn it, other drivers may
> not be able to support it due to hardware limitations.
>
>>My fuzzy impression is that the standard wisdom is that drivers should
>>not demand really large continuous chunks.
>
> 64kbyte is not really large.

Thanks for the clarity, and agreed that 64 kB is nowhere near large.


Re: USB umass hard drive "failed to create xfers" when attaching

2020-02-17 Thread Greg Troxel
Paul Goyette  writes:

> First, this is on a amd64 system, witwh 8core/16thread and 128GB of RAM.
>
> On IRC it was suggested (thanks, maya!) that the error message might be
> related to memory fragmentation.  I didn't believe it (given how much
> RAM I have), but a quick check with top(1) showed that I had more than
> 100GB of 'file cache' active.  So, I unmounted all my development trees
> (to force the cache to get flushed).  Sure enough, I am now able to
> successfully mount the USB drive!
>
> So, sounds like "something somewhere isn't quite right (tm)".  I would
> have expected a memory allocation failure to automatically trigger some
> mechanism to reclaim some of the file cache...

I have seen this too, on ehci.

My impression is that something, perhaps more umass than *hci, needs a
very large chunk of memory.  The system can end up with lots of memory
available, but not one large enough.  If the available memory in bytes
is enough, that may not trigger reclaiming.

My fuzzy impression is that the standard wisdom is that drivers should
not demand really large continuous chunks.


Re: File corruption?

2020-01-19 Thread Greg Troxel
Robert Nestor  writes:

> Sorry for not being specific.  When I do the shutdown on a subsequent
> reboot all the filesystems are dirty forcing fsck to run.  Sometimes
> it finds some minor errors and repairs them.

ok - I am trying to separate "corruption", which means that files that
were not in the process of being written were damaged, from an unclean
shutdown with the usual non-frightening fixups.

> I’m running xfce4, so when I do the “shutdown -r now” I see xfce4 and
> X exit bringing me back to the console display that was active when I
> booted the system.  As it goes thru the normal shutdown process it
> reaches a point where I get the assertion error (something like
> “uvm_page locked against owner”) followed by a stack trace and then
> quickly followed by the system rebooting.  There is no crash file
> generated.

(Definitely follow ad@'s advice here.)

You can of course exit xfe4 back to console before starting this.

> I haven’t changed any crash parameters from the stock setup.  I seem
> to recall there used to be one for kernel crashes, but can’t find it
> now.  I guess next step is to boot up with the “-d” flag and see if I
> can get something useful.  Is that correct?

See swapctl(8) and fstab(5).  Basically you need to configure a dump
device (almost always the swap device).  swapctl -l is useful.

But, it is likely that after sending ad@ a picture, you won't have to
debug this any more...


Re: File corruption?

2020-01-19 Thread Greg Troxel
Robert Nestor  writes:

> I’ve downloaded and installed 9.99.38 (Jan 17 build) and the original
> problem I was seeing with “git” is gone.  However, I’m now seeing a
> new problem with file corruption, but it only seems to happen when I
> do a normal shutdown.  If I do a “shutdown -r now” to shutdown and

You say "corruption", but then you say "filesystems are dirty".  Are you
actually finding files with bad contents?

> reboot the system I see a crash during the shutdown phase and on
> subsequent reboot the filesystems are all dirty. There is an assertion
> about uvm_page I think, but it quickly disappears on the reboot.

Are you saying that if you "shutdown now", that the system shuts down
with the crash?  And that there are then files with bad contents?


> Is there a log file someplace that is written on the shutdown or is
> there an easy way for me to capture the traceback before it disappears
> off my screen?  There’s no crash file produced.

crash dumps may not be enabled.  You could also enable ddb and do the
shutdown not using X, and then get a trace there.


Re: net/net-snmp build failure on 9.99.37

2020-01-14 Thread Greg Troxel
This appears to be a broken tar issue surounding hardlinks and Christos
has backed it out.  So perhaps update and rebuild and try again.

I can see why you refer to 9.33.37 as a version of NetBSD, but really it
is not a name for a specific version.  That last number is increased
when there is an internal ABI change, but many other things happen
within the operating system that do not change that version number.   So
it is probably best to give the time at which you updated.





  1   2   3   >