Re: parallel test failures

2021-02-27 Thread David Bremner
Tomi Ollila  writes:

> So, AFAIU, you got 124 since timeout(1) exited with that status (and 
> killed all parallel(1) executions (after 2 minutes in that case?)...
> ... and when you set NOTMUCH_TEST_TIMEOUT=0 then timeout(1) was not
> executed and a test hung (probably T355-smime).

That sounds right.

> In any way you get it again to hung state (w/o using timeout(1) to 
> mess around) you probably can peek things with ps, /proc, strace,
> gdb, or with some other (potentially more sophisticated ;) tools.

In fact it looks like I already reported this issue (or a different
issue causing T355 to hang, which seems less likely) at

   id:87h7pxiek3@tethera.net

Past me seems to have thought it was some kind of gpgsm failure. I would
welcome input from people use or understand gpgsm.

d
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: parallel test failures

2021-02-27 Thread Tomi Ollila
On Fri, Feb 26 2021, David Bremner wrote:

> David Bremner  writes:
>
>>
>> Thanks to both of you for your feedback / suggestions. I did read today
>> that timeout exits with 124 when the time limit is reached. I haven't
>> investigated further (nor do I know how the timelimit should be reached,
>> since the whold build+test cycle takes about 10s on this machine.
>
> Maybe a timeout is not so crazy. I ran a couple of trials with
> NOTMUCH_TEST_TIMEOUT=0, and it eventually hung (after 6, and 110
> repetitions) in T355-smime, as far as I can tell on the first test.
> I'm currently running some trials to see if I can duplicate that without
> parallel execution, but that of course takes longer.

So, AFAIU, you got 124 since timeout(1) exited with that status (and 
killed all parallel(1) executions (after 2 minutes in that case?)...
... and when you set NOTMUCH_TEST_TIMEOUT=0 then timeout(1) was not
executed and a test hung (probably T355-smime).

In any way you get it again to hung state (w/o using timeout(1) to 
mess around) you probably can peek things with ps, /proc, strace,
gdb, or with some other (potentially more sophisticated ;) tools.

>
> d

Tomi
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: parallel test failures

2021-02-26 Thread David Bremner
David Bremner  writes:

> Tomi Ollila  writes:
>
>>
>> Anyway, the log.gz did not show any tests failing but parallel exiting
>> nonzero possibly for some other reason. Cannot say. Probably stracing (even
>> with --seccomp-bpf) would make it happen even less likely :/
>>
>
> Thanks to both of you for your feedback / suggestions. I did read today
> that timeout exits with 124 when the time limit is reached. I haven't
> investigated further (nor do I know how the timelimit should be reached,
> since the whold build+test cycle takes about 10s on this machine.

Maybe a timeout is not so crazy. I ran a couple of trials with
NOTMUCH_TEST_TIMEOUT=0, and it eventually hung (after 6, and 110
repetitions) in T355-smime, as far as I can tell on the first test.
I'm currently running some trials to see if I can duplicate that without
parallel execution, but that of course takes longer.

d
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: parallel test failures

2021-02-26 Thread David Bremner
Tomi Ollila  writes:

>
> Anyway, the log.gz did not show any tests failing but parallel exiting
> nonzero possibly for some other reason. Cannot say. Probably stracing (even
> with --seccomp-bpf) would make it happen even less likely :/
>

Thanks to both of you for your feedback / suggestions. I did read today
that timeout exits with 124 when the time limit is reached. I haven't
investigated further (nor do I know how the timelimit should be reached,
since the whold build+test cycle takes about 10s on this machine.

d
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: parallel test failures

2021-02-25 Thread Tomi Ollila
On Fri, Feb 19 2021, David Bremner wrote:

> I have intermittent failures when running the test suite on sufficiently
> parallel machines.  I have attached a log of such a failing build,
> although it does not seem especially illuminating.
>
> It takes anywhere from 5 to 300 runs to get a failure for me running on
> 60 hardware threads (30 cores). At least on this machine the number of
> tests that pass seems consistent at 1205

I did the following changes to see file write accesses:


diff --git a/test/notmuch-test b/test/notmuch-test
index b58fd3b3..903a5dff 100755
--- a/test/notmuch-test
+++ b/test/notmuch-test
@@ -62,13 +62,16 @@ if test -z "$NOTMUCH_TEST_SERIALIZE" && command -v
parallel >/dev/null ; then
 META_FAILURE="parallel test suite returned error code $RES"
 fi
 else
+rm -rf inw; mkdir inw
 for test in $TESTS; do
+testname=$(basename $test .sh)
+inotifywait -d --outfile $PWD/inw/inw-$testname -r -e 
close_write,delete $PWD/test /tmp
 $TEST_TIMEOUT_CMD $test "$@" &
 wait $!
+pkill inotifywa
 # If the test failed without producing results, then it aborted,
 # so we should abort, too.
 RES=$?
-testname=$(basename $test .sh)
 if [[ $RES != 0 && ! -e
 "$NOTMUCH_BUILDDIR/test/test-results/$testname" ]]; then
 META_FAILURE="Aborting on $testname (returned $RES)"
 break


Then ran tests w/ NOTMUCH_TEST_SERIALIZE=t

and then ran

for f in inw/*; do echo $f; sed -e 's,.*notmuch/test/,  ,' -e '/tmp.T/ s,/.*,,' 
$f | sort -u; echo; done | less

to examine "fallout"

based on that (random gazes to the listing) I did not see any potentially
overlapping writes, but saw unrelated inconsistency in test directories.

Anyway, the log.gz did not show any tests failing but parallel exiting
nonzero possibly for some other reason. Cannot say. Probably stracing (even
with --seccomp-bpf) would make it happen even less likely :/

Tomi
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: parallel test failures

2021-02-21 Thread Xu Wang
I did not look at logs, but I have had problem in other scenarios. The
way I debugged was to use strace to get a list of all files the tests
accessed. From that list I could recognize that some files that should
have been in separate temp directories were not thread-specific and
solution was to put the temp files in separate dir for each test. Not
sure if this is helpful, but wanted to share.

Kind regards and best of luck,

Xu

On Fri, Feb 19, 2021 at 7:24 AM David Bremner  wrote:
>
>
> I have intermittent failures when running the test suite on sufficiently
> parallel machines.  I have attached a log of such a failing build,
> although it does not seem especially illuminating.
>
> It takes anywhere from 5 to 300 runs to get a failure for me running on
> 60 hardware threads (30 cores). At least on this machine the number of
> tests that pass seems consistent at 1205
>
> ___
> notmuch mailing list -- notmuch@notmuchmail.org
> To unsubscribe send an email to notmuch-le...@notmuchmail.org
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org