Re: parallel tests broken on Debian stable

2021-10-30 Thread David Bremner
David Bremner  writes:

> In a debian stretch neither gnu parallel nor moreutils parallel
> successfully completes the tests.  Logs are attached. One thing worth
> investigating is that gnu parallel inside timeout doesn't really work
> right, since it says something like

I can't figure out if this bug is reproducible in current Debian, so I'm
going to untag it. If someone finds a reproducer, feel free to tag that
as a bug.

d
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: parallel tests broken on Debian stable

2019-05-20 Thread Daniel Kahn Gillmor
On Mon 2019-05-20 19:49:02 -0400, Daniel Kahn Gillmor wrote:
> However, i tried selectively upgrading all the versions of all of these
> packages *except for gdb* to the version in buster (or to the version
> from backports, in the case of the kernel).  and i'm *still* seeing the
> problem on the stretch system.
>
> So perhaps it's some interaction between timeout and gdb?  I haven't
> managed to test that particular combination yet.

Ok, i just went ahead and tested gdb 8.2.1-2 on a stretch system (after
rolling back my other package upgrades.

Getting gdb installed cleanly required upgrading libc6, libssl1.1, and
libpython (3.5 to 3.7) and associated packages to their versions from
buster, as well as pulling in a few libraries whose sonames weren't
already in stretch at all (libipt2, libncursesw6, libtinfo6).

once gdb was upgraded, the tests passed with moreutils parallel with
timeout enabled.

So the combination that tickles the problem is:

 * GNU timeout (either version)
 * moreutils parallel (either version)
 * gdb 7.12-6 (stretch)

and the problem goes away when gdb is upgraded to 8.2.1-2.

I wasn't able to whittle it down to a smaller reliable test,
unfortunately, but hopefully this is enough for someone else to pick it
up and run with it.


Digging around a bit further, i see back in 2011
9ade8160a6a3f6f55996d7a58dc2cd81a6df8395 from amdragon (cc'ed, though i
haven't heard from him since 2016) that introduced this comment (now in
tests/T380-atomicity.sh):

# -tty /dev/null works around a conflict between the 'timeout' wrapper
# and gdb's attempt to control the TTY.
 […]
gdb -tty /dev/null -batch -x $TEST_DIRECTORY/atomicity.gdb notmuch 
>/dev/null 2>/dev/null


I tried adding "-tty /dev/null" to every place where ${TEST_GDB} was
invoked (in T050, T060, and T070), and i found that it produced test
failures in T050 and T060.  However, just directing stdin from /dev/null
on those runs of ${TEST_GDB} solved the problem for me. (see patch below)

Leaving any of those three gdb invocations without anything changed
meant that "make -j4 check" would hang pretty reliably with processes
stopped by SIGTTOU again with gdb from stretch.

Again, i don't understand any of this mechanism in detail, but the
following patch appears to avoid hangs of the test suite for me on a
stock debian stretch installation, and is (i think) a narrower change
than the earlier proposal that forced stdin from /dev/null on all the
tests uniformly.

If no one objects, i can propose this as an actual patch, but i'd feel
much happier if someone who understands this particular situation can
propose it with a clearer justifiation.

Regards,

--dkg


diff --git a/test/T050-new.sh b/test/T050-new.sh
index dfc8508f..9b3d41df 100755
--- a/test/T050-new.sh
+++ b/test/T050-new.sh
@@ -366,7 +366,7 @@ run
 EOF
 
 ${TEST_GDB} --batch-silent --return-child-result -x notmuch-new-vanish.gdb \
---args notmuch new 2>OUTPUT 1>/dev/null
+--args notmuch new 2>OUTPUT 1>/dev/null > OUTPUT
 
 # Clean up the file in case gdb isn't available.
diff --git a/test/T060-count.sh b/test/T060-count.sh
index 0c0bf473..a393edb6 100755
--- a/test/T060-count.sh
+++ b/test/T060-count.sh
@@ -116,7 +116,7 @@ EOF
 backup_database
 test_begin_subtest "error message from query_search_messages"
 ${TEST_GDB} --batch-silent --return-child-result -x count-files.gdb \
---args notmuch count --output=files '*' 2>OUTPUT 1>/dev/null
+--args notmuch count --output=files '*' 2>OUTPUT 1>/dev/null < /dev/null
 cat < EXPECTED
 notmuch count: A Xapian exception occurred
 A Xapian exception occurred performing query
diff --git a/test/T070-insert.sh b/test/T070-insert.sh
index 05be473a..2d281fe6 100755
--- a/test/T070-insert.sh
+++ b/test/T070-insert.sh
@@ -267,12 +267,14 @@ for code in  FILE_NOT_EMAIL READ_ONLY_DATABASE 
UPGRADE_REQUIRED PATH_ERROR; do
 test_begin_subtest "EXIT_FAILURE when index_file returns $code"
 test_expect_code 1 \
  "${TEST_GDB} --batch-silent --return-child-result \
+-tty /dev/null \
 -ex 'set args insert < $gen_msg_filename' \
 -x index-file-$code.gdb notmuch"
 
 test_begin_subtest "success exit with --keep when index_file returns $code"
 test_expect_code 0 \
  "${TEST_GDB} --batch-silent --return-child-result \
+-tty /dev/null \
 -ex 'set args insert --keep < $gen_msg_filename' \
 -x index-file-$code.gdb notmuch"
 done
@@ -281,12 +283,14 @@ for code in OUT_OF_MEMORY XAPIAN_EXCEPTION ; do
 test_begin_subtest "EX_TEMPFAIL when index_file returns $code"
 test_expect_code 75 \
  "${TEST_GDB} --batch-silent --return-child-result \
+-tty /dev/null \
 -ex 'set args insert < $gen_msg_filename' \
 -x index-file-$code.gdb notmuch"
 
 test_begin_subtest "success exit with --keep when index_file returns $code"
 test_expect_code 0 \
  "${TEST_GDB} --batch-silent --retu

Re: parallel tests broken on Debian stable

2019-05-20 Thread Daniel Kahn Gillmor
On Mon 2019-05-20 13:27:03 -0400, Daniel Kahn Gillmor wrote:
>  c) we should avoid the timeout hanging :)

I dug into this today, and i'm reporting back my findings.

I have what appears to be a fix (see below), but i don't understand it,
so i'm not advocating for it.

To be clear: my two test cases are two KVM instances, one running
stretch (debian stable) and one running sid (debian unstable).  both
systems have 4 virtual CPUs (on a hardware platform that has 4 cores).

The two VMs are otherwise similarly configured.  Both have moreutils
installed, and GNU parallel is not installed.

on the stretch system, i can achieve this hang/failure with a simple
"make -j4 check".  on the "sid" system, i do not see the failure.

When i disable the use of timeout entirely (with NOTMUCH_TEST_TIMEOUT=0,
see id:20190520232535.4904-1-...@fifthhorseman.net), the problem goes
away on the stretch system.

When i inspect the state of the debian stretch system when the tests are
hanging, i see this (from "ps auwx"):


USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
dkg   7980  1.8  0.5  10228  4348 pts/1S+   18:49   0:00 make -j4 check
dkg   8001  0.0  0.3  11228  2884 pts/1S+   18:49   0:00 bash 
/home/dkg/src/notmuch/notmuch/test/notmuch-test
dkg   8011  0.0  0.1  10092   804 pts/1S18:49   0:00 timeout 2m 
parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh 
/home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/
dkg   8012  0.0  0.0   4168   744 pts/1T18:49   0:00 parallel -- 
/home/dkg/src/notmuch/notmuch/test/T000-basic.sh 
/home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg   8013  0.0  0.0   416896 pts/1T18:49   0:00 parallel -- 
/home/dkg/src/notmuch/notmuch/test/T000-basic.sh 
/home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg   8014  0.0  0.0   416896 pts/1T18:49   0:00 parallel -- 
/home/dkg/src/notmuch/notmuch/test/T000-basic.sh 
/home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg   8267  0.0  0.0   416896 pts/1T18:49   0:00 parallel -- 
/home/dkg/src/notmuch/notmuch/test/T000-basic.sh 
/home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg   8268  0.0  0.0   4276   732 pts/1T18:49   0:00 sh -c 
/home/dkg/src/notmuch/notmuch/test/T050-new.sh
dkg   8270  2.7  0.4  11772  3744 pts/1T18:49   0:00 bash 
/home/dkg/src/notmuch/notmuch/test/T050-new.sh
dkg   8320  0.0  0.0   416896 pts/1T18:49   0:00 parallel -- 
/home/dkg/src/notmuch/notmuch/test/T000-basic.sh 
/home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg   8321  0.0  0.0   4276   748 pts/1T18:49   0:00 sh -c 
/home/dkg/src/notmuch/notmuch/test/T060-count.sh
dkg   8322  0.7  0.4  11752  3556 pts/1T18:49   0:00 bash 
/home/dkg/src/notmuch/notmuch/test/T060-count.sh
dkg   8345  0.0  0.0   416896 pts/1T18:49   0:00 parallel -- 
/home/dkg/src/notmuch/notmuch/test/T000-basic.sh 
/home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg   8346  0.0  0.0   4276   744 pts/1T18:49   0:00 sh -c 
/home/dkg/src/notmuch/notmuch/test/T070-insert.sh
dkg   8347  1.7  0.4  11772  3764 pts/1T18:49   0:00 bash 
/home/dkg/src/notmuch/notmuch/test/T070-insert.sh
dkg   8425  0.0  0.0   416896 pts/1T18:49   0:00 parallel -- 
/home/dkg/src/notmuch/notmuch/test/T000-basic.sh 
/home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg   8426  0.0  0.0   4276   740 pts/1T18:49   0:00 sh -c 
/home/dkg/src/notmuch/notmuch/test/T080-search.sh
dkg   8427  1.5  0.4  11752  3664 pts/1T18:49   0:00 bash 
/home/dkg/src/notmuch/notmuch/test/T080-search.sh
dkg   8763  4.7  2.9  73960 22708 pts/1T18:49   0:00 gdb 
--batch-silent --return-child-result -x count-files.gdb --args notmuch count 
--output=files *
dkg   8914  0.0  0.8  68508  6228 pts/1T18:49   0:00 notmuch search 
--format=text0 --output=files --offset=1 --limit=1 *
dkg   8915  0.0  0.1   4484  1164 pts/1T18:49   0:00 xargs -0 -I {} 
mv {} /home/dkg/src/notmuch/notmuch/test/tmp.T050-new/mail/moved_messages
dkg   8916  0.0  0.5  68244  3824 pts/1T18:49   0:00 notmuch insert 
--folder=Drafts +draft -unread
dkg   8919  0.0  0.0  13012   704 pts/1T18:49   0:00 notmuch new
dkg   8920  0.0  0.0   1412 4 pts/1t18:49   0:00 /bin/bash -c 
exec /home/dkg/src/notmuch/notmuch/notmuch count --output=files \*



As you can see in the "STAT" column, nearly all of the hanging processes
are marked with T ("stopped by job control signal" according to ps(1)).

I also note that "t" means "stopped by debugger during the tracing" --
maybe that final line (with "t") is the special one that triggers th

Re: parallel tests broken on Debian stable

2019-05-20 Thread Daniel Kahn Gillmor
On Mon 2019-05-20 21:55:05 +0300, Tomi Ollila wrote:
> On Mon, May 20 2019, Daniel Kahn Gillmor wrote:
>>  b) we should have a clearer sense of which tests completed and which
>> did not
>> […]
>> if someone else wants to
>> improve the test suite to handle (b) that would be great.
>
> Simplest would be that after ./notmuch-test --debug exits, test-results/
> would still exist :D

Please send patches :)

   --dkg


signature.asc
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: parallel tests broken on Debian stable

2019-05-20 Thread Tomi Ollila
On Mon, May 20 2019, Daniel Kahn Gillmor wrote:

> Thanks for catching this, Bremner.
>
> On Fri 2019-05-10 07:45:18 -0300, David Bremner wrote:
>> In a debian stretch neither gnu parallel nor moreutils parallel
>> successfully completes the tests.
>
> I have confirmed this misbehavior on debian stretch with moreutils
> parallel.  on stretch (with gmime 3.0 from stretch-backports), the tests
> complete successfully in serial, but with moreutils parallel installed
> they consume the full 2 minutes and then time out, but they time out
> without failing explicitly.
>
>> But the real question is why the jobs don't complete within a 2 minute
>> timeout.
>
> i agree, this is a serious issue, though we should also ensure that
> timed-out tests count as failures.
>
> So there are three things to fix:
>
>  a) when a timeout hits, the test suite should explicitly fail, even if
> no individual tests have failed.
>
>  b) we should have a clearer sense of which tests completed and which
> did not
>
>  c) we should avoid the timeout hanging :)
>
>
> Looking at the output of "ps" during the hanging test suite, it appears
> that parallel is getting blocked by running: T050, T060, T070, T080 (i'm
> using a 4-core machine).
>
> I've just sent two patches to the list that aim for part (a), and i'm
> going to work on part (c) on stable now.  if someone else wants to
> improve the test suite to handle (b) that would be great.

Simplest would be that after ./notmuch-test --debug exits, test-results/
would still exist :D

>
>  --dkg

Tomi
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: parallel tests broken on Debian stable

2019-05-20 Thread Daniel Kahn Gillmor
Thanks for catching this, Bremner.

On Fri 2019-05-10 07:45:18 -0300, David Bremner wrote:
> In a debian stretch neither gnu parallel nor moreutils parallel
> successfully completes the tests.

I have confirmed this misbehavior on debian stretch with moreutils
parallel.  on stretch (with gmime 3.0 from stretch-backports), the tests
complete successfully in serial, but with moreutils parallel installed
they consume the full 2 minutes and then time out, but they time out
without failing explicitly.

> But the real question is why the jobs don't complete within a 2 minute
> timeout.

i agree, this is a serious issue, though we should also ensure that
timed-out tests count as failures.

So there are three things to fix:

 a) when a timeout hits, the test suite should explicitly fail, even if
no individual tests have failed.

 b) we should have a clearer sense of which tests completed and which
did not

 c) we should avoid the timeout hanging :)


Looking at the output of "ps" during the hanging test suite, it appears
that parallel is getting blocked by running: T050, T060, T070, T080 (i'm
using a 4-core machine).

I've just sent two patches to the list that aim for part (a), and i'm
going to work on part (c) on stable now.  if someone else wants to
improve the test suite to handle (b) that would be great.

 --dkg


signature.asc
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


parallel tests broken on Debian stable

2019-05-10 Thread David Bremner

In a debian stretch neither gnu parallel nor moreutils parallel
successfully completes the tests.  Logs are attached. One thing worth
investigating is that gnu parallel inside timeout doesn't really work
right, since it says something like

parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: /home/bremner/notmuch/test/T357-index-decryption.sh
parallel: /home/bremner/notmuch/test/T330-emacs-subject-to-filename.sh
parallel: /home/bremner/notmuch/test/T310-emacs.sh
parallel: /home/bremner/notmuch/test/T350-crypto.sh
parallel: /home/bremner/notmuch/test/T355-smime.sh
parallel: /home/bremner/notmuch/test/T160-json.sh
parallel: /home/bremner/notmuch/test/T320-emacs-large-search-buffer.sh
parallel: /home/bremner/notmuch/test/T170-sexp.sh

and then doesn't stop. But the real question is why the jobs don't
complete within a 2 minute timeout.

The attached logs are from a stretch chroot running under buster. I also
saw something similar on a stretch system running on real
hardware. Unsurprisingly, the number of tests completed is
non-determinstic, and varies from machine to machine and run to run.



moreutils.log
Description: Binary data


gnuparallel.log
Description: Binary data
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch