Re: wait skips signals but first one

2024-02-05 Thread Chet Ramey

On 2/5/24 12:22 PM, Mykyta Dorokhin wrote:


Note 1: forgot to mention that I'm cross-compiling.
Note 2: it probably makes sense to add a warning or something that states 
that HAVE_POSIX_SIGSETJMP disabled due to cross-compiling.


The autoconf macro that tests for this (BASH_FUNC_POSIX_SETJMP) prints a
warning if cross-compiling and defaults to the same setting as whether or
not it thinks it has POSIX signals available (bash_cv_posix_signals).

You probably didn't notice it the first time and used the cached value
from then on.


Thank you  for your time! You are doing a great job!


Thanks for your kind words.

Chet
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait skips signals but first one

2024-02-05 Thread Chet Ramey

On 2/3/24 7:01 PM, Mykyta Dorokhin wrote:

There is a line in trap.c with your change. If I revert it then everything 
works again:


- if (interrupt_immediately && wait_intr_flag)
+ if (/* interrupt_immediately && */wait_intr_flag)

So if I put interrupt_immediately back and rebuild the code with thes only 
fix then it starts working properly, signals are getting received as expected.


OK. Let's look at that. By this time, interrupt_immediately was no longer
set anywhere, so the code before this change did nothing but inhibit the
siglongjmp/longjmp call from trap_handler, which means the sighandler
returned and (possibly) did not interrupt the wait builtin.

That is what this means (replace SIGINT with SIGUSR1 here):


The one change that might make a difference is a bug fix: if the wait
builtin is waiting for a process and receives a trapped signal, it's
supposed to cause wait to return immediately and then run the trap. Bash
didn't do that consistently for SIGINT, and would run the trap when it
shouldn't, or before it should, and sometimes not return from the wait
at all. So maybe the longjmp back to the wait builtin is what changed
things, even though longjmp is one of the functions that POSIX says is
safe to call from a signal handler context, and it restores the signal
mask if you're running on a system that has sigsetjmp/siglongjmp.


So the effect of this change is to longjmp/siglongjmp back to the wait
builtin, so it can return from there before running the trap. If you use
siglongjmp, it restores the original signal mask (look at the wait
builtin's call to setjmp_sigs, a macro that calls sigsetjmp with 1 as the
second argument), which means the trapped signal is no longer blocked.

Since this works as intended on all other systems, I would check to see
if your system has sigsetjmp/siglongjmp and whether or not they are
behaving correctly.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait skips signals but first one

2024-02-03 Thread Chet Ramey

On 2/3/24 10:28 AM, Mykyta Dorokhin wrote:


Analysis with strace.

After receiving SIGUSR1, Debian only blocks SIGCHLD, then clears the block:

205295 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=205327, 
si_uid=1040} ---

205295 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
205295 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], 
sa_flags=SA_RESTORER, sa_restorer=0x7f063bdb5fd0}, 
{sa_handler=0x5637247940b0, sa_mask=[], sa_flags=SA_RESTORER, 
sa_restorer=0x7f063bdb5fd0}, 8) = 0

205295 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
205295 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0  # unblocks all signalas


The above is the correct action.

On our device, it blocks SIGUSR1 as well as SIGCHLD and keeps doing it over 
and over again:


One explanation for this is SIGUSR1 being blocked when the shell is
invoked. Another is that sigsetjmp/siglongjmp are either not available
(or configure doesn't think they are) or don't properly save and restore
the signal mask.



6707  --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=6724, 
si_uid=0} ---

6707  rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0
6707  rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0
6707  rt_sigprocmask(SIG_BLOCK, NULL, [USR1 CHLD], 8) = 0
6707  write(1, ">>> TRAPPED USR1 <<<\n", 21) = 21
6707  rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0
6707  rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0
6707  rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0
6707  rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0
6707  write(1, "Iteration\n", 10)       = 10


On modern systems, the OS blocks the signal that is caught during signal 
handling, and unblocks so that signal handlers are not called recursively. 
  The exception to this is if SA_NODEFER is set. On some very old UNIX 
systems you had to block the signal yourself, and there was a small window 
where things could go wrong. I suspect BASH probably has a build option to 
allow blocking signals in handlers for compatibility with other systems, 
and is not being built correctly for Linux. 


Bash does have an autoconf test for this, but it didn't change as part
of this push. You can check what MUST_REINSTALL_SIGHANDLERS is set to
in config.h, but I suspect it won't be different.

And `not being built correctly for Linux' would mean your Debian and my
Red Hat tests would fail.


I suspect on those very old 
systems the signal was automatically unblocked on return, but is not done 
here, because the POSIX sigprocmask is called, which requires calling it 
again to unblock the signal in Linux.  And since wait is restarted, it 
never is unblocked.


If you mean wait(2), it doesn't get restarted. waitpid(2) will return
-1/EINTR since it received a caught signal.



According to strace no additional user flags are set when the BASH signal 
handler is put in place for SIGUSR1.


Correct, the trap signal handler doesn't assume that system calls are
restarted.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait skips signals but first one

2024-02-03 Thread Chet Ramey

On 2/3/24 10:00 AM, Mykyta Dorokhin wrote:

I have found the commit on devel branch which breaks things for me (and 
probably other Yocto-based builds):


This one still works
==

commit 89d788fb0152724a93e0fdab8c15116e5c76572b
Author: Chet Ramey 
Date:   Mon Feb 17 11:41:35 2020 -0500

    commit bash-20200214 snapshot

This one not
==


commit 0df4ddca3f371bc258fe4185cdec36fce3e7be7b
Author: Chet Ramey 
Date:   Mon Feb 24 10:41:37 2020 -0500

    commit bash-20200221 snapshot



Please take a look. Maybe you'll notice something suspicious there. I don't 
know... uninitialized variables, endian-dependent code, etc.


There are changes there, of course, but it's hard to see how they make a
difference. The wait builtin was changed not to interrupt the wait for a
trapped SIGCHLD, but to delay running any SIGCHLD trap until the wait
exited. Since your example doesn't trap SIGCHLD, it doesn't seem
significant. Any other trapped signal still interrupts the wait. Subshells
clear the process substitution FIFO list, but you're not using process
substitution.

The one change that might make a difference is a bug fix: if the wait
builtin is waiting for a process and receives a trapped signal, it's
supposed to cause wait to return immediately and then run the trap. Bash
didn't do that consistently for SIGINT, and would run the trap when it
shouldn't, or before it should, and sometimes not return from the wait
at all. So maybe the longjmp back to the wait builtin is what changed
things, even though longjmp is one of the functions that POSIX says is
safe to call from a signal handler context, and it restores the signal
mask if you're running on a system that has sigsetjmp/siglongjmp.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: wait skips signals but first one

2024-01-08 Thread Chet Ramey

On 1/5/24 2:46 PM, Mykyta Dorokhin wrote:


Bash Version: 5.1
Patch Level: 16
Release Status: release

Description:


I'm working on a custom project within the Yocto framework. After a recent 
build system update, the bash
version updated to 5.1.16. Subsequently, I've noticed peculiar side effects 
related to using 'wait' and signals.
Below is a script demonstrating the issue.

The problem lies in the fact that, in the case of waiting on 'wait' in the 
script, only the first signal
interrupts 'wait'; subsequent signals of the same type do not interrupt 'wait', 
and it remains blocked.

I manually switched bash versions and compiled a table (bash version - Yocto 
version):

5.0.18 (dunfell): No issues
5.1.4 (hardknott): Has issues
5.1.8 (honister): Has issues
5.1.16 (kirkstone): Has issues
5.2.21 (master): Has issues

Meanwhile, in my home desktop distribution, Ubuntu 22.04, I tested the same 
scenario, and everything works correctly; signals are processed as expected.

My assumption is that the problem may be related to my Yocto build being 
intended for a 32-bit device, and perhaps the bug only manifests in this case.


The only way I could see a potential problem was if you're using a 32-bit
build on a 64-bit device.



I consider this a significant issue. Could you confirm whether any testing has 
been conducted on 32-bit platforms,
as it seems that everything works correctly on 64-bit desktops?


I don't have any 32-bit platforms available for testing, but I have a hard
time believing that it would make a difference.

Like you, I can't reproduce it on the desktop platforms I have available
right now.

The bash devel git branch has fairly fine granularity. If you can automate
the signal sending somewhat, maybe by having a child process send signals
to $$, you could use your script and `git bisect' to find the commit where
the behavior changed. bash-5.0 was frozen 12/31/2018, and bash-5.1 was
frozen 12/14/2020, so that should get you started with the devel branch
commits you want to inspect.

http://git.savannah.gnu.org/cgit/bash.git/log/?h=devel


--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



OpenPGP_signature.asc
Description: OpenPGP digital signature