Re: All (?) network tests failing

2020-04-05 Thread Andreas Gustafsson
Christos Zoulas wrote:
> It could be due to tcsh doing its file descriptor dance differently...
> What shell are you using?

The default shell of the root user.  The notion of changing the shell
according to a personal preference doesn't really apply when running
automated tests on a fresh automated install.

The shell command used to run the tests is:

  mkdir /tmp/tests && \
  cd /usr/tests && \
  { atf-run; echo $? >/tmp/tests/test.status; } | tee /tmp/tests/test.tps | 
atf-report -o ticker:- -o xml:/tmp/tests/test.xml

which may also matter as file descrptors are allocated for log files
and pipes.

But rather than manually doing a fresh install, logging in as root,
and running that command, it may be easier to just run the anita
command starting on line 3 of the log from the first failed test run
on b5:

  http://releng.netbsd.org/b5reports/i386/2020/2020.03.22.00.56.45/test.log

-- 
Andreas Gustafsson, g...@gson.org


Re: All (?) network tests failing

2020-04-05 Thread Robert Elz
Date:Sun, 5 Apr 2020 01:26:15 - (UTC)
From:chris...@astron.com (Christos Zoulas)
Message-ID:  

  | It could be due to tcsh doing its file descriptor dance differently...
  | What shell are you using?

When I run tests against HEAD, I use /bin/sh - the only other
possibilities are csh (which I gave up using decades ago, before
there was a tcsh) or /bin/ksh (of whioch our version has too many
"issues" to bother with).   I have nothing from pkgsrc installed
in test setups.

The b5 tests are the same I believe, simply build HEAD, install it,
and atf-run

kre



Re: All (?) network tests failing

2020-04-04 Thread Christos Zoulas
In article <24200.56933.470930.730...@guava.gson.org>,
Andreas Gustafsson   wrote:
>Robert Elz wrote:
>> Not an idea, but a possibility - the change to route.c (1.167) was
>> unimportant - it doesn't really matter (to the tests) if it does
>> anything useful or not - it is possible that it just happened that the
>> fd that the setsockopt() was being performed on was a socket (a suitable
>> socket) prior to the openssl update, but after that, the rump fd's
>> shifted around, and what the setsockopt() was operating upon was no
>> longer a socket.
>> 
>> No idea if that is really what happened or not, but something like that
>> is at least plausible (even though it would seem that the changes of the
>> sys call having worked by accident seem to be not very high).
>
>I agree that this sounds plausible.  Also, the tests never failing for
>Christos might then be explained by him running them in an environment
>that has a different number of file descriptors already in use.

It could be due to tcsh doing its file descriptor dance differently...
What shell are you using?

christos



Re: All (?) network tests failing

2020-04-04 Thread Andreas Gustafsson
Robert Elz wrote:
> Not an idea, but a possibility - the change to route.c (1.167) was
> unimportant - it doesn't really matter (to the tests) if it does
> anything useful or not - it is possible that it just happened that the
> fd that the setsockopt() was being performed on was a socket (a suitable
> socket) prior to the openssl update, but after that, the rump fd's
> shifted around, and what the setsockopt() was operating upon was no
> longer a socket.
> 
> No idea if that is really what happened or not, but something like that
> is at least plausible (even though it would seem that the changes of the
> sys call having worked by accident seem to be not very high).

I agree that this sounds plausible.  Also, the tests never failing for
Christos might then be explained by him running them in an environment
that has a different number of file descriptors already in use.
-- 
Andreas Gustafsson, g...@gson.org


Re: All (?) network tests failing

2020-04-04 Thread Robert Elz
Date:Sat, 4 Apr 2020 16:37:08 +0300
From:Andreas Gustafsson 
Message-ID:  <24200.36228.881611.989...@guava.gson.org>

  | Does anyone have an idea why the tests didn't start failing
  | immediately when route.c 1.167 was committed, but only after the
  | seemingly unrelated openssl update?

Not an idea, but a possibility - the change to route.c (1.167) was
unimportant - it doesn't really matter (to the tests) if it does
anything useful or not - it is possible that it just happened that the
fd that the setsockopt() was being performed on was a socket (a suitable
socket) prior to the openssl update, but after that, the rump fd's
shifted around, and what the setsockopt() was operating upon was no
longer a socket.

No idea if that is really what happened or not, but something like that
is at least plausible (even though it would seem that the changes of the
sys call having worked by accident seem to be not very high).

kre



Re: All (?) network tests failing

2020-04-04 Thread Martin Husemann
On Sat, Apr 04, 2020 at 09:38:19AM -0400, Christos Zoulas wrote:
> I am still puzzled by this as the tests never failed on my machine!

I still see test failure on macppc and sparc64, some of them might
be related to libpcap being miscompiled (see my other PR).

Martin


Re: All (?) network tests failing

2020-04-04 Thread Christos Zoulas


> On Apr 4, 2020, at 9:37 AM, Andreas Gustafsson  wrote:
> 
> Martin Husemann wrote:
>> I analyzed this particular one (202 steps back because rump.netstat dumps
>> core) - will fix it soon.
> 
> With martin's changes, the number of unexpected test failures
> went down from 413 to 6 on my bare metal testbed:
> 
>  
> http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.04.html#2020.04.03.16.20.52
> 
> The remaining 6 or so failures are unrelated.  Thanks to everyone who
> helped get this fixed.
> 
> Does anyone have an idea why the tests didn't start failing
> immediately when route.c 1.167 was committed, but only after the
> seemingly unrelated openssl update?
> 
I am still puzzled by this as the tests never failed on my machine!

christos



signature.asc
Description: Message signed with OpenPGP


Re: All (?) network tests failing

2020-04-04 Thread Andreas Gustafsson
Martin Husemann wrote:
> I analyzed this particular one (202 steps back because rump.netstat dumps
> core) - will fix it soon.

With martin's changes, the number of unexpected test failures
went down from 413 to 6 on my bare metal testbed:

  
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.04.html#2020.04.03.16.20.52

The remaining 6 or so failures are unrelated.  Thanks to everyone who
helped get this fixed.

Does anyone have an idea why the tests didn't start failing
immediately when route.c 1.167 was committed, but only after the
seemingly unrelated openssl update?
-- 
Andreas Gustafsson, g...@gson.org


Re: All (?) network tests failing

2020-04-03 Thread Martin Husemann
On Fri, Apr 03, 2020 at 06:04:44PM +0300, Andreas Gustafsson wrote:
> Christos Zoulas wrote:
> > >That should take care of the failing network related tests that contain
> > >rump.route commands, but that's not all of the failing tests.
> > 
> > Thanks! I fixed that now. Let's see how many break after this...
> 
> 413 on my bare metal testbed (20 steps forward, 202 steps back):

I analyzed this particular one (202 steps back because rump.netstat dumps
core) - will fix it soon.

Martin


Re: All (?) network tests failing

2020-04-03 Thread Andreas Gustafsson
Christos Zoulas wrote:
> >That should take care of the failing network related tests that contain
> >rump.route commands, but that's not all of the failing tests.
> 
> Thanks! I fixed that now. Let's see how many break after this...

413 on my bare metal testbed (20 steps forward, 202 steps back):

  
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/commits-2020.04.html#2020.04.02.21.36.03

There have also been other commits since the previous run, so these
changes in test outcomes may not all be due to your sbin/route commits.
-- 
Andreas Gustafsson, g...@gson.org


Re: All (?) network tests failing

2020-04-02 Thread Christos Zoulas
In article <19747.1585851...@jinx.noi.kre.to>,
Robert Elz   wrote:
>Date:Mon, 30 Mar 2020 14:25:01 -0400
>From:Christos Zoulas 
>Message-ID:  <3d3ac2b9-5e6e-400c-9a4b-10742c90c...@zoulas.com>
>
>  | All the tests are failing for you the same way:
>  | rump.route: SO_RERROR: Socket operation on non-socket
>
>Not all, but quite a few are.
>
>This one I think is due to src/sbin/route/rouyte.c 1.167
>
>sock = prog_socket(PF_ROUTE, SOCK_RAW, 0);
>if (setsockopt(sock, SOL_SOCKET, SO_RERROR,
>, sizeof(on)) == -1)
>warn("SO_RERROR");
>
>where that setcockopt() was added.   I think that needs to be a prog_*
>type call, so rump can do the right thing.   That will mean adding it
>to prog_opts, and right now I don't have time to work out what the correct
>magic is, but if no-one else does in the next day or so, I will take
>another look.
>
>That should take care of the failing network related tests that contain
>rump.route commands, but that's not all of the failing tests.

Thanks! I fixed that now. Let's see how many break after this...

christos



Re: All (?) network tests failing

2020-04-02 Thread Robert Elz
Date:Mon, 30 Mar 2020 14:25:01 -0400
From:Christos Zoulas 
Message-ID:  <3d3ac2b9-5e6e-400c-9a4b-10742c90c...@zoulas.com>

  | All the tests are failing for you the same way:
  | rump.route: SO_RERROR: Socket operation on non-socket

Not all, but quite a few are.

This one I think is due to src/sbin/route/rouyte.c 1.167

 sock = prog_socket(PF_ROUTE, SOCK_RAW, 0);
 if (setsockopt(sock, SOL_SOCKET, SO_RERROR,
 , sizeof(on)) == -1)
 warn("SO_RERROR");

where that setcockopt() was added.   I think that needs to be a prog_*
type call, so rump can do the right thing.   That will mean adding it
to prog_opts, and right now I don't have time to work out what the correct
magic is, but if no-one else does in the next day or so, I will take
another look.

That should take care of the failing network related tests that contain
rump.route commands, but that's not all of the failing tests.

kre



Re: All (?) network tests failing

2020-03-31 Thread Robert Elz
Date:Mon, 30 Mar 2020 14:25:01 -0400
From:Christos Zoulas 
Message-ID:  <3d3ac2b9-5e6e-400c-9a4b-10742c90c...@zoulas.com>

  | All the tests are failing for you the same way:
  | rump.route: SO_RERROR: Socket operation on non-socket
  |  <>I doubt that my gif change affected that. This smells to me like the =
  | rump fd hijack is not
  | working either because we have some new system call involved or =
  | something is messing
  | up the file descriptors.

If something has decided to move an fd out of the low number space
(not all that high necessarily) then rumphijack will confuse the fd
from user space with one of its own (it isn't very smart about that,
and bases the decision entirely upon the value of the fd it sees).

I wonder if something changte to try and be "nice" to other programs
by moving a "background" fd out of the 0..50 type space usually used by
user fd's and somewhere up > 100 ? (the fd space really runs up to the
thousands, but nothing we run ATF/Rump tests against ever needs more than
a small number of fd's, so they never naturally get out of the low number
area).

kre



Re: All (?) network tests failing

2020-03-31 Thread Robert Elz
Date:Mon, 30 Mar 2020 20:47:12 -0400
From:Christos Zoulas 
Message-ID:  

  | Unfortunately they still work for me after a clean build. I am going to =
  | try to download a standard build...

Does your tree have any uncommitted changes?

(I see the same 200+ tests failing as everyone else seems to see, on
amd64 (I do my tests in a XEN DomU), but I note b5 is seeing the same
on i386 (at least) as well).

kre



Re: All (?) network tests failing

2020-03-30 Thread Christos Zoulas
Unfortunately they still work for me after a clean build. I am going to try to 
download
a standard build...

christos

> On Mar 30, 2020, at 3:35 PM, Christos Zoulas  wrote:
> 
> Signed PGP part
> Ok, let me start a clean build.
> 
> christos
> 
>> On Mar 30, 2020, at 2:36 PM, Andreas Gustafsson  wrote:
>> 
>> Christos Zoulas wrote:
>>> All the tests are failing for you the same way:
>>> 
>>> rump.route: SO_RERROR: Socket operation on non-socket
>>> 
>>> I doubt that my gif change affected that. This smells to me like the rump fd
>>> hijack is not
>>> working either because we have some new system call involved or something is
>>> messing
>>> up the file descriptors.  What is your build host?
>> 
>> The tests are failing on both the TNF testbed and my own.  The
>> respective OS versions are:
>> 
>> NetBSD babylon5.netbsd.org 8.1_STABLE NetBSD 8.1_STABLE (BABYLON5) #1: Fri 
>> Jan 24 21:50:18 UTC 2020  
>> s...@franklin.netbsd.org:/home/netbsd/8/amd64/obj/sys/arch/amd64/compile/BABYLON5
>>  amd64
>> NetBSD guido.araneus.fi 9.0 NetBSD 9.0 (GENERIC) #0: Fri Feb 14 00:06:28 UTC 
>> 2020  mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC 
>> amd64
>> 
>> --
>> Andreas Gustafsson, g...@gson.org
> 
> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: All (?) network tests failing

2020-03-30 Thread Christos Zoulas
Ok, let me start a clean build.

christos

> On Mar 30, 2020, at 2:36 PM, Andreas Gustafsson  wrote:
> 
> Christos Zoulas wrote:
>> All the tests are failing for you the same way:
>> 
>> rump.route: SO_RERROR: Socket operation on non-socket
>> 
>> I doubt that my gif change affected that. This smells to me like the rump fd
>> hijack is not
>> working either because we have some new system call involved or something is
>> messing
>> up the file descriptors.  What is your build host?
> 
> The tests are failing on both the TNF testbed and my own.  The
> respective OS versions are:
> 
>  NetBSD babylon5.netbsd.org 8.1_STABLE NetBSD 8.1_STABLE (BABYLON5) #1: Fri 
> Jan 24 21:50:18 UTC 2020  
> s...@franklin.netbsd.org:/home/netbsd/8/amd64/obj/sys/arch/amd64/compile/BABYLON5
>  amd64
>  NetBSD guido.araneus.fi 9.0 NetBSD 9.0 (GENERIC) #0: Fri Feb 14 00:06:28 UTC 
> 2020  mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
> 
> --
> Andreas Gustafsson, g...@gson.org



signature.asc
Description: Message signed with OpenPGP


Re: All (?) network tests failing

2020-03-30 Thread Andreas Gustafsson
Christos Zoulas wrote:
> All the tests are failing for you the same way:
> 
> rump.route: SO_RERROR: Socket operation on non-socket
> 
> I doubt that my gif change affected that. This smells to me like the rump fd
> hijack is not
> working either because we have some new system call involved or something is
> messing
> up the file descriptors.  What is your build host?

The tests are failing on both the TNF testbed and my own.  The
respective OS versions are:

  NetBSD babylon5.netbsd.org 8.1_STABLE NetBSD 8.1_STABLE (BABYLON5) #1: Fri 
Jan 24 21:50:18 UTC 2020  
s...@franklin.netbsd.org:/home/netbsd/8/amd64/obj/sys/arch/amd64/compile/BABYLON5
 amd64
  NetBSD guido.araneus.fi 9.0 NetBSD 9.0 (GENERIC) #0: Fri Feb 14 00:06:28 UTC 
2020  mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64

-- 
Andreas Gustafsson, g...@gson.org


Re: All (?) network tests failing

2020-03-30 Thread Manuel Bouyer
On Mon, Mar 30, 2020 at 08:28:10PM +0200, Martin Husemann wrote:
> On Mon, Mar 30, 2020 at 02:25:01PM -0400, Christos Zoulas wrote:
> > What is your build host?
> > I am running the latest build I installed built from NetBSD/current to 
> > NetBSD/current.
> 
> I see the same fallout on a NetBSD-current build on a NetBSD-current
> (but it crept in delayed, probably because something did not get immediately
> rebuild by build.sh -u).

I also see the same with builds from releng.netbsd.org:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/202003270130Z_atf.html#failed-tcs-summary

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: All (?) network tests failing

2020-03-30 Thread Martin Husemann
On Mon, Mar 30, 2020 at 02:25:01PM -0400, Christos Zoulas wrote:
> What is your build host?
> I am running the latest build I installed built from NetBSD/current to 
> NetBSD/current.

I see the same fallout on a NetBSD-current build on a NetBSD-current
(but it crept in delayed, probably because something did not get immediately
rebuild by build.sh -u).

Martin


Re: All (?) network tests failing

2020-03-30 Thread Christos Zoulas

> 
>> 2. The gif related tests are failing because of a recent change to record 
>> mac addresses
>>I committed a fix for that.
> 
> Your fix didn't work; the gif tests are still failing with
> src/tests/net/net_common.sh 1.40:
> 
>  
> http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.03.30.13.01.39/test.html#net_if_gif_t_gif_gif_basic_ipv4overipv4
> 
>> 3. The rest of the tests (I've sampled 5 of them) don't fail for me.
> 
> If you do a full release build from scratch, install the release, and
> run the tests in the installed release like the testbeds do, I bet
> they will fail for you, too.

All the tests are failing for you the same way:
rump.route: SO_RERROR: Socket operation on non-socket
 <>I doubt that my gif change affected that. This smells to me like the rump fd 
hijack is not
working either because we have some new system call involved or something is 
messing
up the file descriptors.

What is your build host?
I am running the latest build I installed built from NetBSD/current to 
NetBSD/current.

christos



signature.asc
Description: Message signed with OpenPGP


Re: All (?) network tests failing

2020-03-30 Thread Andreas Gustafsson
Christos Zoulas wrote:
> I've been looking into this:
> 1. The libcrypto/bn test just needs more time

That may be.  That one never failed on real hardware for me (it just
went from taking 3 seconds to 14), but 200+ other test cases did fail,
and still do.

> 2. The gif related tests are failing because of a recent change to record mac 
> addresses
> I committed a fix for that.

Your fix didn't work; the gif tests are still failing with
src/tests/net/net_common.sh 1.40:

  
http://www.gson.org/netbsd/bugs/build/amd64-baremetal/2020/2020.03.30.13.01.39/test.html#net_if_gif_t_gif_gif_basic_ipv4overipv4

> 3. The rest of the tests (I've sampled 5 of them) don't fail for me.

If you do a full release build from scratch, install the release, and
run the tests in the installed release like the testbeds do, I bet
they will fail for you, too.
-- 
Andreas Gustafsson, g...@gson.org


Re: All (?) network tests failing

2020-03-30 Thread Christos Zoulas
I've been looking into this:
1. The libcrypto/bn test just needs more time
2. The gif related tests are failing because of a recent change to record mac 
addresses
I committed a fix for that.
3. The rest of the tests (I've sampled 5 of them) don't fail for me.

christos

> On Mar 30, 2020, at 8:50 AM, Martin Husemann  wrote:
> 
> On Mon, Mar 30, 2020 at 03:44:49PM +0300, Andreas Gustafsson wrote:
>> Martin Husemann wrote:
>>> -current just had a serious regression in test results, it seems like
>>> ~all networking tests are failing now:
>> 
>> Many (most?) of these have been failing for more than a week now, as
>> reported on current-users in
>> 
>>  http://mail-index.netbsd.org/current-users/2020/03/23/msg038127.html
>> 
>> which identified both the commits and the developer responsible for
>> the breakage.
> 
> Well, they did not fail for me, and I can't see how the openssl upgrade
> is related - there certainly is something strange ongoing (maybe a bug
> in the build system not triggering all rebuilds for update builds).
> 
> Martin



signature.asc
Description: Message signed with OpenPGP


Re: All (?) network tests failing

2020-03-30 Thread Martin Husemann
On Mon, Mar 30, 2020 at 03:44:49PM +0300, Andreas Gustafsson wrote:
> Martin Husemann wrote:
> > -current just had a serious regression in test results, it seems like
> > ~all networking tests are failing now:
> 
> Many (most?) of these have been failing for more than a week now, as
> reported on current-users in
> 
>   http://mail-index.netbsd.org/current-users/2020/03/23/msg038127.html
> 
> which identified both the commits and the developer responsible for
> the breakage.

Well, they did not fail for me, and I can't see how the openssl upgrade
is related - there certainly is something strange ongoing (maybe a bug
in the build system not triggering all rebuilds for update builds).

Martin


Re: All (?) network tests failing

2020-03-30 Thread Andreas Gustafsson
Martin Husemann wrote:
> -current just had a serious regression in test results, it seems like
> ~all networking tests are failing now:

Many (most?) of these have been failing for more than a week now, as
reported on current-users in

  http://mail-index.netbsd.org/current-users/2020/03/23/msg038127.html

which identified both the commits and the developer responsible for
the breakage.
-- 
Andreas Gustafsson, g...@gson.org


Re: All (?) network tests failing

2020-03-30 Thread Martin Husemann
rump.route: SO_RERROR: Socket operation on non-socket

Many of the ones not failing "silently" show that.

Martin