Re: exit status issue
On Wed, Nov 23, 2011 at 02:09:57PM -0600, Dallas Clement wrote: > trap CleanupExit KILL For the record, you cannot trap SIGKILL.
Re: exit status issue
Hi All, Just wanted to let you know that I have found a solution to my problem, though I can't entirely explain it yet. As mentioned earlier, this script was designed to run as a daemon, i.e. run in the background. This is what I was doing to daemonize the script: Daemon() { while : do checkStatus sleep 1 done } trap CleanupExit INT trap CleanupExit TERM trap CleanupExit ILL trap CleanupExit QUIT trap CleanupExit ABRT trap CleanupExit KILL if [ x"$1" = x"start" ]; then Daemon 0<&- 1>/dev/null 2>&1 & echo $! > $PIDFILE elif [ x"$1" = x"stop" ]; then TMP=`cat $PIDFILE` kill $TMP CleanupExit else : fi When running like this, I was seeing 'touch' command failures inside function fsaccesstest() shown previously. I did an experiment last night and ran this script in the foreground (removed the &) from the call to Daemon. To my surprise, I did not see any errors with the 'touch' command. I then went back and changed how I was turning this script into a daemon, by using the start-stop-daemon utility instead like this: start-stop-daemon --start --background --quiet --exec $DAEMON Once I did this, my shell script is now running in the background without any errors. Life is good. Perhaps when I was calling Daemon & as shown above, the parent process was not exiting properly and was somehow interfering in the child process execution. I'm not really sure. Anyhow, I'm past this problem now. Thank you all for the good ideas and generous help.
Re: exit status issue
On Wednesday 23 November 2011, Andreas Schwab wrote: > "Steven W. Orr" writes: > > > I think we're beating this one to death, but I have point out that > > telling perl to run a print command whose output is redirected by bash > > is not the same as telling bash to run a builtin echo command that is > > redirected as an integral operation of the same interpreter. > > They are really the same, but even if you change it bash still wins: > > $ bash -c 'echo "hello"' >/dev/full > bash: line 0: echo: write error: No space left on device > IMO this is not really perl's fault, but rather the programmer's fault, since he has forgotten to check for possible errors. If did, he would get: $ perl -e 'print "hello"; close(STDOUT) or die "$!\n";' >/dev/full No space left on device $ echo $? 28 Something similar happens with C, of course, if you forget to check error conditions: $ cat > foo.c <<'END' #include int main (void) { printf("hello\n"); return 0; } END $ gcc foo.c $ ./a.out >/dev/null # Oops, no error reported. $ cat > foo.c <<'END' #include int main (void) { printf("hello\n"); if (fclose(stdout) != 0) perror("Write error"); return 0; } END $ gcc foo.c $ ./a.out >/dev/null # Error will be reported now. Write error: No space left on device More "modern" languages, with built-in exception handling, are somewhat better in this regard: $ python3 -c 'print("foo")' >/dev/full Exception IOError: (28, 'No space left on device') in ... ignored Just my 2 cents. Regards, Stefano
Re: exit status issue
"Steven W. Orr" writes: > I think we're beating this one to death, but I have point out that > telling perl to run a print command whose output is redirected by bash > is not the same as telling bash to run a builtin echo command that is > redirected as an integral operation of the same interpreter. They are really the same, but even if you change it bash still wins: $ bash -c 'echo "hello"' >/dev/full bash: line 0: echo: write error: No space left on device Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
Re: exit status issue
On 11/22/2011 6:49 PM, Bob Proulx wrote: Dallas Clement wrote: This doesn't (yet) look like a problem with bash. Admittedly bash seems to do the right thing if you go by the set -x execution trace. If you go by that, it would indeed seem that the call to touch is failing. I have a higher level of trust in -x output since I haven't found any problems with it. But I have found a lot of programs that do not handle file errors correctly. Those are way too common. Therefore I am much more suspicous of them. For example: $ perl -le 'print "hello";'>/dev/full $ echo $? 0 Perl ate that error and never said anything about it. One of a very many programs with this problem. BTW... Bash gets it right. $ echo "hello">/dev/full bash: echo: write error: No space left on device $ echo $? 1 I think we're beating this one to death, but I have point out that telling perl to run a print command whose output is redirected by bash is not the same as telling bash to run a builtin echo command that is redirected as an integral operation of the same interpreter. IOW, the perl command did succeed. If you want to be fair, you need to write a perl command that writes to /dev/null -- Time flies like the wind. Fruit flies like a banana. Stranger things have .0. happened but none stranger than this. Does your driver's license say Organ ..0 Donor?Black holes are where God divided by zero. Listen to me! We are all- 000 individuals! What if this weren't a hypothetical question? steveo at syslang.net
Re: exit status issue
Bob, Yes I agree with everything you've said. I am grasping at straws, which I really don't like. It's very unlikely that the problems are with any of these things I've mentioned bash, Linux kernel, compiler, libc etc. This script is fairly large and it does lots of checking of block devices, raid arrays, and volumes. All of it is executed synchronously which should make it simple to debug. The script itself is executed in the background as a daemon process. This script is executed on locally mounted RAID devices. No network involved in the test. I'll continue my search for root cause with a more disciplined approach. Thanks for the help.
Re: exit status issue
Dallas Clement wrote: > > This doesn't (yet) look like a problem with bash. > > Admittedly bash seems to do the right thing if you go by the set -x > execution trace. If you go by that, it would indeed seem that the > call to touch is failing. I have a higher level of trust in -x output since I haven't found any problems with it. But I have found a lot of programs that do not handle file errors correctly. Those are way too common. Therefore I am much more suspicous of them. For example: $ perl -le 'print "hello";' >/dev/full $ echo $? 0 Perl ate that error and never said anything about it. One of a very many programs with this problem. BTW... Bash gets it right. $ echo "hello" >/dev/full bash: echo: write error: No space left on device $ echo $? 1 > But the strace output tells a different story. According to it, the > system calls that touch makes are succeeding. Your test case is still too complex to be passed on to others. > In fact, I created my own 'touch' and 'stat' programs, both of which > log any errors. That's excellent! > I never see any errors logged, so I'm pretty sure they are getting > executed which jives with what I'm seeing in the strace output. Since it is your own code you would want to see that they logged the successful result of the operation and also logged that they were calling exit(0) and at that time show that bash logged them as a non-zero exit. For that test you could use 'false' to simulate the unsuccessful program. > Is this a lower level scheduling problem? It certainly could be. I'm > testing on Linux 2.6.39.4. I'll rollback to an older kernel, libc, > and libpthread and see if that makes any difference. Scheduling? Now you are scaring me. I think you are going off into the weeds with that idea. I can't imagine how this can be anything other than simple inline flow of control through your script. Your script as you have shown to us does not have any background asynchronous processing. Therefore it will be executing all of the commands in order. There isn't any out-of-order execution magically happening. I see by this that you are grasping at straws. You are "just trying things". I have empathy for you and your troubles. But that isn't the scientific method. http://en.wikipedia.org/wiki/Scientific_method#Elements_of_scientific_method > Not doing anything with NFS mounting thankfully. It's a RAID array on > a NAS device which is typically used for SMB / CIFS network mount. Your system is mounting it using an SMB mount? Then perhaps your network protocol layer is causing trouble. I think a bug in the SMB layer is much more likely. Bob
Re: exit status issue
>> + touch /mnt/array1/.accesstest >> + CHK_RESULT=1 > > It looks to me that touch is failing and reporting the failure and > that bash is handling it correctly. > >> + touch /mnt/array1/.accesstest >> + CHK_RESULT=0 > > And then on a subsequent pass touch is reporting success. > > This doesn't (yet) look like a problem with bash. Admittedly bash seems to do the right thing if you go by the set -x execution trace. If you go by that, it would indeed seem that the call to touch is failing. But the strace output tells a different story. According to it, the system calls that touch makes are succeeding. In fact, I created my own 'touch' and 'stat' programs, both of which log any errors. I never see any errors logged, so I'm pretty sure they are getting executed which jives with what I'm seeing in the strace output. Is this a lower level scheduling problem? It certainly could be. I'm testing on Linux 2.6.39.4. I'll rollback to an older kernel, libc, and libpthread and see if that makes any difference. > >> The purpose of this function is simply to try and create or modify a >> test file on a RAID share. It's just a periodic integrity check. > > If I ever had even one single failure on a raid filesystem I would > start replacing hardware. I wouldn't be trying to retry. I wouldn't > be trying to umount and mount it again. ANY failures would mean > something so severely broken that I would consider the system unusable > until repaired. Yep, I agree - these sort of failures should not happen unless there is something terribly wrong. > > On the other hand, do you really mean an NFS mounted filesystem? > Perhaps using the nfs automounter? If so then say so. Don't say > "share" which implies a MS Windows SMB / CIFS network mount. The NFS > client subsystem is notoriously bad. The nfs automounter is worse. I > am guessing you probably really mean that you are using an nfs mounted > filesystem using the automounter. Pretty much everyone in that > situation has tried to recover from problems with it at one time or > another. Not doing anything with NFS mounting thankfully. It's a RAID array on a NAS device which is typically used for SMB / CIFS network mount.
Re: exit status issue
Dallas Clement wrote: > Okay, I simplified the script slightly, fixed the quoting and re-ran > with set -x. Here is the script and the output for the failure > scenario. Much more useful! Thanks. > + touch /mnt/array1/.accesstest > + CHK_RESULT=1 It looks to me that touch is failing and reporting the failure and that bash is handling it correctly. > + touch /mnt/array1/.accesstest > + CHK_RESULT=0 And then on a subsequent pass touch is reporting success. This doesn't (yet) look like a problem with bash. > The purpose of this function is simply to try and create or modify a > test file on a RAID share. It's just a periodic integrity check. If I ever had even one single failure on a raid filesystem I would start replacing hardware. I wouldn't be trying to retry. I wouldn't be trying to umount and mount it again. ANY failures would mean something so severely broken that I would consider the system unusable until repaired. On the other hand, do you really mean an NFS mounted filesystem? Perhaps using the nfs automounter? If so then say so. Don't say "share" which implies a MS Windows SMB / CIFS network mount. The NFS client subsystem is notoriously bad. The nfs automounter is worse. I am guessing you probably really mean that you are using an nfs mounted filesystem using the automounter. Pretty much everyone in that situation has tried to recover from problems with it at one time or another. > The set -x output shows that the 'touch' command failed, but it > doesn't show why. That sounds like a bug in touch. If there is an error then it should emit an error message. Perhaps you should try an alternate command to create the file. You could use bash itself. : >> "$1"/.accesstest Or you could use a helper command. Using a different program other than 'touch' may cause it to emit a better error message. Bob
Re: exit status issue
On Tue, Nov 22, 2011 at 02:55:28PM -0600, Dallas Clement wrote: > fsaccesstest() > { > set -x > > local RETRY_MAX=3 > local RETRY_INTERVAL=3 > local CHK_RESULT=1 > local RETVAL=0 > > TMP=`grep "$1" /proc/mounts|awk '{print $1}'` > if [ "$TMP" != "" ]; then > for i in `seq 1 ${RETRY_MAX}` > do > touch "$1"/.accesstest > CHK_RESULT=$? > if [ ${CHK_RESULT} -eq 0 ] ; then > break > fi > logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** > fsaccesstest test $1 > failed CHK_RESULT=$CHK_RESULT. retrying... (${i}) ***" > sleep ${RETRY_INTERVAL} > done > if [ $CHK_RESULT -ne 0 ]; then > RETVAL=$CHK_RESULT > fi > fi > > set +x > > return $RETVAL > } That does look better. > + touch /mnt/array1/.accesstest > + CHK_RESULT=1 > + '[' 1 -eq 0 ']' > + logger -s -t diskmon -p local0.info '*** fsaccesstest test > /mnt/array1 failed CHK_RESULT=1. retrying... (1) ***' > diskmon: *** fsaccesstest test /mnt/array1 failed CHK_RESULT=1. > retrying... (1) *** > The purpose of this function is simply to try and create or modify a > test file on a RAID share. It's just a periodic integrity check. OK. > The set -x output shows that the 'touch' command failed, but it > doesn't show why. Presumably the touch command itself will tell you why, on stderr. It should say something like "No such file or directory" or "Permission denied" or "Device or resource busy", etc. > The three system calls that touch makes are all successful. > So my question is why is $? getting set to 1? If you suspect a bug in touch, then you need to report it to your OS vendor. touch isn't part of bash. > It looks to me like the > bash interpreter is returning early from this call to 'touch'. I > can't prove this yet though. There's no way bash would do that, given the script as shown. There is no background execution there at all. > Here's another example of an unsuccessful touch execution. Notice how > the execution keeps getting interrupted and then resumed. This is getting into kernel scheduling, etc. Way off topic for a bash mailing list. You might need to take this up on the Linux Kernel Mailing List, or something similar if you are not doing this in Linux.
Re: exit status issue
> That makes no sense. Fix your quoting (it's atrocious) and then run > the function with "set -x". Don't throw strace at it, at this point. > It's really not the right tool at this level of debugging. Okay, I simplified the script slightly, fixed the quoting and re-ran with set -x. Here is the script and the output for the failure scenario. # $1 mntpoint fsaccesstest() { set -x local RETRY_MAX=3 local RETRY_INTERVAL=3 local CHK_RESULT=1 local RETVAL=0 TMP=`grep "$1" /proc/mounts|awk '{print $1}'` if [ "$TMP" != "" ]; then for i in `seq 1 ${RETRY_MAX}` do touch "$1"/.accesstest CHK_RESULT=$? if [ ${CHK_RESULT} -eq 0 ] ; then break fi logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest test $1 failed CHK_RESULT=$CHK_RESULT. retrying... (${i}) ***" sleep ${RETRY_INTERVAL} done if [ $CHK_RESULT -ne 0 ]; then RETVAL=$CHK_RESULT fi fi set +x return $RETVAL } + local RETRY_MAX=3 + local RETRY_INTERVAL=3 + local CHK_RESULT=1 + local RETVAL=0 ++ grep /mnt/array1 /proc/mounts ++ awk '{print $1}' + TMP=/dev/md10 + '[' /dev/md10 '!=' '' ']' ++ seq 1 3 + for i in '`seq 1 ${RETRY_MAX}`' + touch /mnt/array1/.accesstest + CHK_RESULT=1 + '[' 1 -eq 0 ']' + logger -s -t diskmon -p local0.info '*** fsaccesstest test /mnt/array1 failed CHK_RESULT=1. retrying... (1) ***' diskmon: *** fsaccesstest test /mnt/array1 failed CHK_RESULT=1. retrying... (1) *** + sleep 3 + for i in '`seq 1 ${RETRY_MAX}`' + touch /mnt/array1/.accesstest + CHK_RESULT=0 + '[' 0 -eq 0 ']' + break + '[' 0 -ne 0 ']' + set +x > I don't even know what your actual *symptom* is. I can't deduce it from > a spew of strace output. You haven't described the reason for the touch > command, so I can only presume this is some sort of autofs environment, > hence my attempt to solve the issue, above. The purpose of this function is simply to try and create or modify a test file on a RAID share. It's just a periodic integrity check. The set -x output shows that the 'touch' command failed, but it doesn't show why. I certainly acknowledge the strace output is a lot of stuff to untangle, but it does show what is happening at a process level and the return status for every system call. What I'm seeing is that there are lots of concurrent processes running at the time this particular shell script function is executed. The other processes seem to have an adverse effect on the process that is executing 'touch'. The three system calls that touch makes are all successful. So my question is why is $? getting set to 1? It looks to me like the bash interpreter is returning early from this call to 'touch'. I can't prove this yet though. Here's another example of an unsuccessful touch execution. Notice how the execution keeps getting interrupted and then resumed. [pid 31400] execve("/tmp/test-touch", ["/tmp/test-touch", "/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin:/usr/sb"..., "MAIL=/var/spool/mail/root", "C_INCLUDE_PATH=/usr/include:/usr"..., "PWD=/usr/local/sbin", "HOME=/root", "SHLVL=2", "LOGNAME=root", "_=/tmp/test-touch"]Process 31401 attached (waiting for parent) [pid 31400] <... execve resumed> ) = 0 [pid 31400] brk(0 [pid 31400] <... brk resumed> ) = 0x2369000 [pid 31400] open("/mnt/array1/.accesstest", O_RDWR|O_CREAT, 0666 [pid 31400] <... open resumed> )= 3 [pid 31400] close(3 [pid 31400] <... close resumed> ) = 0 [pid 31400] utimes("/mnt/array1/.accesstest", {4196394, 0} [pid 31400] <... utimes resumed> ) = 0 [pid 31400] exit_group(0) = ? Process 23756 resumed Process 31400 detached [pid 23756] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 31400 [pid 23756] --- SIGCHLD (Child exited) @ 0 (0) --- [pid 23756] wait4(-1, Process 23756 suspended
Re: exit status issue
On Tue, Nov 22, 2011 at 11:31:22AM -0600, Dallas Clement wrote: > It turns out that my 'touch' command is actually getting executed, but > not when I expect it. Here is a recap of my script function which is > occasionally misbehaving: > > # $1 mntpoint > fsaccesstest() > { > local RETRY_MAX=3 > local RETRY_INTERVAL=3 > local CHK_RESULT=1 > > TMP=`grep $1 /proc/mounts|awk '{print $1}'` > if [ "$TMP" != "" ]; then > for i in `seq 1 ${RETRY_MAX}` > do > touch $1/.accesstest > CHK_RESULT=$? > if [ ${CHK_RESULT} -eq 0 ] ; then > break > fi > logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** > fsaccesstest test $1 > failed. retrying... (${i}) ***" > sleep ${RETRY_INTERVAL} > done > if [ $CHK_RESULT -ne 0 ]; then > logger -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest > test $1 failed. ***" > echo "*** fsaccess test $1 failed. ***" >/dev/console > ## try remount it > umount $1 > mount $TMP $1 > return $? > fi > return 0 > fi > return 1 > } > In the failure scenario, I had wrongly assumed that the touch command > was not being executed, when in fact it is. It's just that the > execution is deferred somehow and occurs after the call to 'logger', > which would not have been executed had the call to 'touch' somehow > failed or returned too soon. That makes no sense. Fix your quoting (it's atrocious) and then run the function with "set -x". Don't throw strace at it, at this point. It's really not the right tool at this level of debugging. Also note that there are *two* calls to logger in this function. You might be mistaking which logger call you are seeing, particularly since you're trying to debug with the wrong tool. If I understand what this function is attempting to do: fstest() { local i=0 while ((++i < 3)); do # If it's mounted, yay! df "$1" && return 0 # It's not mounted, so try to trigger autofs. touch "$1"/.accesstest sleep 3 done # We failed 3 times, so give up. return 1 } I don't even know what your actual *symptom* is. I can't deduce it from a spew of strace output. You haven't described the reason for the touch command, so I can only presume this is some sort of autofs environment, hence my attempt to solve the issue, above.
Re: exit status issue
It turns out that my 'touch' command is actually getting executed, but not when I expect it. Here is a recap of my script function which is occasionally misbehaving: # $1 mntpoint fsaccesstest() { local RETRY_MAX=3 local RETRY_INTERVAL=3 local CHK_RESULT=1 TMP=`grep $1 /proc/mounts|awk '{print $1}'` if [ "$TMP" != "" ]; then for i in `seq 1 ${RETRY_MAX}` do touch $1/.accesstest CHK_RESULT=$? if [ ${CHK_RESULT} -eq 0 ] ; then break fi logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest test $1 failed. retrying... (${i}) ***" sleep ${RETRY_INTERVAL} done if [ $CHK_RESULT -ne 0 ]; then logger -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest test $1 failed. ***" echo "*** fsaccess test $1 failed. ***" >/dev/console ## try remount it umount $1 mount $TMP $1 return $? fi return 0 fi return 1 } Most of the time this function executes the 'touch' command without errors as shown in the following strace output. The touch command make just three system calls to open(), close(), and utimes() which you can clearly see. Success Scenario [pid 23930] execve("/tmp/test-touch", ["/tmp/test-touch", "/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin: [pid 23930] brk(0) = 0x18d2000 [pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9676000 [pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9675000 [pid 23930] access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) [pid 23930] open("/etc/ld.so.cache", O_RDONLY) = 3 [pid 23930] fstat(3, {st_dev=makedev(8, 162), st_ino=90088, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=40, st_size=19548, st_atime=2011/11/19-08:39:02, st_mtime=2011/11/15-08:39:00, st_ctime=2011/11/ [pid 23930] mmap(NULL, 19548, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4ab967 [pid 23930] close(3)= 0 [pid 23930] open("/lib64/libc.so.6", O_RDONLY) = 3 [pid 23930] read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\333\1\0"..., 832) = 832 [pid 23930] fstat(3, {st_dev=makedev(8, 162), st_ino=285057, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=2616, st_size=1331216, st_atime=2011/11/19-08:10:31, st_mtime=2010/06/29-12:51:27, st_ctime=201 [pid 23930] mmap(NULL, 2343048, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4ab932 [pid 23930] mprotect(0x7f4ab9454000, 1044480, PROT_NONE) = 0 [pid 23930] mmap(0x7f4ab9553000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x133000) = 0x7f4ab9553000 [pid 23930] mmap(0x7f4ab9558000, 16520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9558000 [pid 23930] close(3)= 0 [pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab966f000 [pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab966e000 [pid 23930] arch_prctl(ARCH_SET_FS, 0x7f4ab966e6f0) = 0 [pid 23930] mprotect(0x7f4ab9553000, 12288, PROT_READ) = 0 [pid 23930] munmap(0x7f4ab967, 19548) = 0 [pid 23930] open("/mnt/array1/.accesstest", O_RDWR|O_CREAT, 0666) = 3 [pid 23930] close(3)= 0 [pid 23930] utimes("/mnt/array1/.accesstest", {4196394, 0}) = 0 [pid 23930] exit_group(0) = ? Process 23756 resumed Process 23930 detached <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23930 --- SIGCHLD (Child exited) @ 0 (0) --- In the failure scenario, I had wrongly assumed that the touch command was not being executed, when in fact it is. It's just that the execution is deferred somehow and occurs after the call to 'logger', which would not have been executed had the call to 'touch' somehow failed or returned too soon. So it looks I am being afflicted with out of order execution problems. I am expecting that if my script makes a call to 'touch' it should not return until the command has been completely executed, i.e. the child process reaped, and exit status determined. This doesn't seem to be happening. Please forgive for the volume of data. Failure Scenario [pid 26415] execve("/tmp/test-touch", ["/tmp/test-touch", "/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=r
Re: exit status issue
On Fri, Nov 18, 2011 at 2:35 PM, Greg Wooledge wrote: > On Fri, Nov 18, 2011 at 02:23:54PM -0600, Dallas Clement wrote: >> [pid 6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], >> [/* 14 vars */]) = 0 >> >> The = 0 at the end is the exit status. > > That is the return value of the execve() call. All it means is that > the /bin/touch program was successfully launched. > > $ strace -f bash -c false > ... > execve("/bin/bash", ["bash", "-c", "false"], [/* 32 vars */]) = 0 > ... > > arc3:~$ bash -c false; echo $? > 1 > Yes, you are sure right on all counts. The (0) in SIGCHLD (Child exited) @ 0 (0) is clearly not the exit status of the child either. Proven by a simple test script. #!/tmp/bash3 touch /bogus/file echo $? - root@TS-2RVED8:/tmp# strace -f -e execve ./test.sh execve("./test.sh", ["./test.sh"], [/* 15 vars */]) = 0 Process 7269 attached Process 7268 suspended [pid 7269] execve("/bin/touch", ["touch", "/bogus/file"], [/* 14 vars */]) = 0 touch: /bogus/file: No such file or directory Process 7268 resumed Process 7269 detached --- SIGCHLD (Child exited) @ 0 (0) --- 1 Process 7268 detached Thanks for setting me straight guys. I was misled. It looks like the 'touch' command really could be failing in my script.
Re: exit status issue
Dallas Clement writes: > [pid 6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], > [/* 14 vars */]) = 0 > > The = 0 at the end is the exit status. No, it isn't. It only says that execve was successful. The exit status is set by the exit syscall. Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
Re: exit status issue
On Fri, Nov 18, 2011 at 02:23:54PM -0600, Dallas Clement wrote: > [pid 6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], > [/* 14 vars */]) = 0 > > The = 0 at the end is the exit status. That is the return value of the execve() call. All it means is that the /bin/touch program was successfully launched. $ strace -f bash -c false ... execve("/bin/bash", ["bash", "-c", "false"], [/* 32 vars */]) = 0 ... arc3:~$ bash -c false; echo $? 1
Re: exit status issue
DJ Mills wrote: > Bob Proulx wrote: > > sh -x ./scripttodebug > > I'm guessing you mean bash -x, not sh -x. Two different shells. It is a bash list so I probably should have said bash just to be politically correct. But the script doesn't use any bash specific constructs so sh should be fine. Saying sh is habit for me since I always use sh when possible. Bob
Re: exit status issue
On Fri, Nov 18, 2011 at 2:07 PM, Greg Wooledge wrote: >> Notice that the child process pid=6747 exits with status 0 > > Where does it say that? How are you even invoking strace here? I don't > know how to read this output, nor do I know how to reproduce it. [pid 6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], [/* 14 vars */]) = 0 The = 0 at the end is the exit status. I know it's complicated but this script is run as a daemon. Was using strace with -f option to follow through forks, i.e. strace -f -e execve ./diskmon_exec.sh start > /mnt/array1/share/strace.txt 2>&1 >> I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing >> this problem. > > Simplify the script, and reproduce the problem with something we can > actually read and understand. For example, > > #!/bin/bash > f() { > touch /root/nopermission > echo "\$? is $?" > } > f > If you can't reproduce the problem in the simpler script, then you know > the problem is being triggered by something in the more complex script. > That will help you narrow down the cause. > Yes, I agree this is the right way to approach it. However, I have not been able to reproduce the problem with a simple function as you have shown. It is almost definitely interaction with the rest of what is going on in the complicated script. In fact, as I comment out portions of it, the problem goes away. But I can't put my finger on any one thing that is the root cause. It feels like something is getting clobbered in the execution context. This failure does not happen every time this function is executed. It is very hit and miss. Currently I am running bash -x to see if I can get it to happen that way.
Re: exit status issue
On Fri, Nov 18, 2011 at 2:59 PM, Bob Proulx wrote: > Dallas Clement wrote: >> Geir Hauge wrote: >> > Add ''set -x'' at the start of the function and examine the output >> > to see if it actually runs touch from PATH. >> >> The strace output is showing that the correct 'touch' is being executed. > > It would be a lot easier to use the 'sh -x' trace than using strace > for seeing what is going on in your script. Try it first and see > commands are being run. Because you are setting CHK_RESULT=1 as a > default all that it would take for your script to exit 1 would be to > not be setting that value to zero at a time that you think it should. > > sh -x ./scripttodebug > > Your script is complicated enough that it isn't immediately obvious by > a casual inspection whether it is correct or not. Almost certainly in > these cases it is an error in the programming of the script. I would > much sooner suspect it than bash at this point. > >> TMP=`grep $1 /proc/mounts|awk '{print $1}'` > > You do not show how this is called but $1 option argument to grep may > be insufficiently quoted. > >> echo "*** fsaccesstest failed to unmount $1. ***" >/dev/console > > Writing directly to /dev/console is a little harsh. You might > consider using the 'logger' program and writing to the syslog instead. > > Your script makes me think that you might be using an NFS automounter > and trying to correct nfs client problems. (shrug) > > Bob > > I'm guessing you mean bash -x, not sh -x. Two different shells.
Re: exit status issue
On Fri, Nov 18, 2011 at 01:49:38PM -0600, Dallas Clement wrote: > [pid 6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], > [/* 14 vars */]) = 0 > Process 6748 attached > Process 13686 suspended > [pid 6748] execve("/usr/bin/logger", ["logger", "-s", "-t", > "diskmon", "-p", "local0.info", "*** fsaccesstest test > /mnt/array"...], [/* 14 vars */]) = 0 > Process 13686 resumed > Process 6747 detached > [pid 13686] --- SIGCHLD (Child exited) @ 0 (0) --- > Process 13686 suspended > diskmon: *** fsaccesstest test /mnt/array1 failed: status=1. retrying... (1) > *** > Process 13686 resumed > Process 6748 detached > --- SIGCHLD (Child exited) @ 0 (0) --- > > Notice that the child process pid=6747 exits with status 0 Where does it say that? How are you even invoking strace here? I don't know how to read this output, nor do I know how to reproduce it. > I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing > this problem. Simplify the script, and reproduce the problem with something we can actually read and understand. For example, #!/bin/bash f() { touch /root/nopermission echo "\$? is $?" } f If you can't reproduce the problem in the simpler script, then you know the problem is being triggered by something in the more complex script. That will help you narrow down the cause.
Re: exit status issue
Dallas Clement wrote: > Geir Hauge wrote: > > Add ''set -x'' at the start of the function and examine the output > > to see if it actually runs touch from PATH. > > The strace output is showing that the correct 'touch' is being executed. It would be a lot easier to use the 'sh -x' trace than using strace for seeing what is going on in your script. Try it first and see commands are being run. Because you are setting CHK_RESULT=1 as a default all that it would take for your script to exit 1 would be to not be setting that value to zero at a time that you think it should. sh -x ./scripttodebug Your script is complicated enough that it isn't immediately obvious by a casual inspection whether it is correct or not. Almost certainly in these cases it is an error in the programming of the script. I would much sooner suspect it than bash at this point. >TMP=`grep $1 /proc/mounts|awk '{print $1}'` You do not show how this is called but $1 option argument to grep may be insufficiently quoted. >echo "*** fsaccesstest failed to unmount $1. ***" >/dev/console Writing directly to /dev/console is a little harsh. You might consider using the 'logger' program and writing to the syslog instead. Your script makes me think that you might be using an NFS automounter and trying to correct nfs client problems. (shrug) Bob
Re: exit status issue
On Fri, Nov 18, 2011 at 6:16 AM, Geir Hauge wrote: > Add ''set -x'' at the start of the function and examine the output to see if > it actually runs touch from PATH. The strace output is showing that the correct 'touch' is being executed. [pid 6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], [/* 14 vars */]) = 0 Process 6748 attached Process 13686 suspended [pid 6748] execve("/usr/bin/logger", ["logger", "-s", "-t", "diskmon", "-p", "local0.info", "*** fsaccesstest test /mnt/array"...], [/* 14 vars */]) = 0 Process 13686 resumed Process 6747 detached [pid 13686] --- SIGCHLD (Child exited) @ 0 (0) --- Process 13686 suspended diskmon: *** fsaccesstest test /mnt/array1 failed: status=1. retrying... (1) *** Process 13686 resumed Process 6748 detached --- SIGCHLD (Child exited) @ 0 (0) --- Notice that the child process pid=6747 exits with status 0, but $?=1 after returning from executing touch. This triggers the failure. I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing this problem. I am seeing this failure whether I execute the 'touch' command or the 'stat' command. Both of these are incidentally busybox implementations. ls -al /bin/touch lrwxrwxrwx1 root root7 Nov 15 08:07 /bin/touch -> busybox ls -al /bin/stat lrwxrwxrwx1 root root7 Nov 15 08:07 /bin/stat -> busybox Are there any bash debugging tools available that I can use to halt on a data breakpoint and examine state? Thanks, Dallas
Re: exit status issue
2011/11/17 Dallas Clement > > Would appreciate any insight you could offer. Here is my script and > the strace output for when touch returns 1. > Add ''set -x'' at the start of the function and examine the output to see if it actually runs touch from PATH. -- Geir Hauge
exit status issue
Hello, I've got a script that is periodically performing a disk / raid array integrity check by touching a file. The touch is failing occasionally, with an exit status of 1. If I do 'strace touch' instead of just 'touch' the strace indicates an exit status of 0. So I'm not sure why this is getting propagated back as a 1 to the script. Would appreciate any insight you could offer. Here is my script and the strace output for when touch returns 1. fsaccesstest() { local RETRY_MAX=3 local RETRY_INTERVAL=3 local CHK_RESULT=1 TMP=`grep $1 /proc/mounts|awk '{print $1}'` if [ "$TMP" != "" ]; then for i in `seq 1 ${RETRY_MAX}` do touch $1/.accesstest CHK_RESULT=$? if [ ${CHK_RESULT} -eq 0 ] ; then break fi echo "*** fsaccesstest test $1 failed: status=${CHK_RESULT}. retrying... (${i}) ***" sleep ${RETRY_INTERVAL} done if [ $CHK_RESULT -ne 0 ]; then echo "*** fsaccess test $1 failed. ***" >/dev/console ## try to remount it. It may be busy, so retry if necessary. for j in `seq 1 ${RETRY_MAX}` do umount -f -l $1 # Force a lazy unmount CHK_RESULT=$? if [ ${CHK_RESULT} -eq 0 ]; then break fi sleep ${RETRY_INTERVAL} done if [ ${CHK_RESULT} -eq 0 ]; then mount $TMP $1 CHK_RESULT=$? if [ ${CHK_RESULT} -ne 0 ]; then echo "*** fsaccesstest failed to mount $1. ***" >/dev/console return ${CHK_RESULT} fi else echo "*** fsaccesstest failed to unmount $1. ***" >/dev/console return ${CHK_RESULT} fi fi return 0 fi return 1 } execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin:/usr/sb"..., "MAIL=/var/spool/mail/root", "C_INCLUDE_PATH=/usr/include:/usr"..., "PWD=/root", "HOME=/root", "SHLVL=2", "LOGNAME=root", "_=/usr/local/bin/strace"]) = 0 brk(0) = 0x1702000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f459915 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f459914f000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_dev=makedev(8, 162), st_ino=90088, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=40, st_size=19548, st_atime=2011/11/15-08:39:02, st_mtime=2011/11/15-08:39:00, st_ctime=2011/11/15-08:39:01}) = 0 mmap(NULL, 19548, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f459914a000 close(3)= 0 open("/lib64/libcrypt.so.1", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\n\0"..., 832) = 832 fstat(3, {st_dev=makedev(8, 162), st_ino=285064, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=48, st_size=23096, st_atime=2011/11/15-08:10:31, st_mtime=2010/06/29-12:51:27, st_ctime=2011/11/15-08:07:41}) = 0 mmap(NULL, 1257888, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4598f03000 mprotect(0x7f4598f08000, 1044480, PROT_NONE) = 0 mmap(0x7f4599007000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4000) = 0x7f4599007000 mmap(0x7f4599009000, 184736, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4599009000 close(3)= 0 open("/lib64/libm.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360?\0\0"..., 832) = 832 fstat(3, {st_dev=makedev(8, 162), st_ino=285045, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=1064, st_size=538200, st_atime=2011/11/15-08:10:31, st_mtime=2010/06/29-12:51:27, st_ctime=2011/11/15-08:07:41}) = 0 mmap(NULL, 1569000, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4598d83000 mprotect(0x7f4598e01000, 1048576, PROT_NONE) = 0 mmap(0x7f4598f01000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7e000) = 0x7f4598f01000 close(3)= 0 open("/lib64/libc.so.6", O_RDONLY)