Re: exit status issue

2011-11-23 Thread Greg Wooledge
On Wed, Nov 23, 2011 at 02:09:57PM -0600, Dallas Clement wrote:
> trap CleanupExit KILL

For the record, you cannot trap SIGKILL.



Re: exit status issue

2011-11-23 Thread Dallas Clement
Hi All,

Just wanted to let you know that I have found a solution to my
problem, though I can't entirely explain it yet.  As mentioned
earlier, this script was designed to run as a daemon, i.e. run in the
background.  This is what I was doing to daemonize the script:


Daemon()
{
while :
do
checkStatus
sleep 1
done
}

trap CleanupExit INT
trap CleanupExit TERM
trap CleanupExit ILL
trap CleanupExit QUIT
trap CleanupExit ABRT
trap CleanupExit KILL

if [ x"$1" = x"start" ]; then
Daemon 0<&- 1>/dev/null 2>&1 &
echo $! > $PIDFILE
elif [ x"$1" = x"stop" ]; then
TMP=`cat $PIDFILE`
kill $TMP
CleanupExit
else
:
fi

When running like this, I was seeing 'touch' command failures inside
function fsaccesstest() shown previously.

I did an experiment last night and ran this script in the foreground
(removed the &) from the call to Daemon.  To my surprise, I did not
see any errors with the 'touch' command.

I then went back and changed how I was turning this script into a
daemon, by using the start-stop-daemon utility instead like this:

start-stop-daemon --start --background --quiet --exec $DAEMON

Once I did this, my shell script is now running in the background
without any errors.  Life is good.

Perhaps when I was calling Daemon & as shown above, the parent process
was not exiting properly and was somehow interfering in the child
process execution.  I'm not really sure.

Anyhow, I'm past this problem now.  Thank you all for the good ideas
and generous help.



Re: exit status issue

2011-11-23 Thread Stefano Lattarini
On Wednesday 23 November 2011, Andreas Schwab wrote:
> "Steven W. Orr"  writes:
> 
> > I think we're beating this one to death, but I have point out that
> > telling perl to run a print command whose output is redirected by bash
> > is not the same as telling bash to run a builtin echo command that is
> > redirected as an integral operation of the same interpreter.
> 
> They are really the same, but even if you change it bash still wins:
> 
> $ bash -c 'echo "hello"' >/dev/full
> bash: line 0: echo: write error: No space left on device
>
IMO this is not really perl's fault, but rather the programmer's
fault, since he has forgotten to check for possible errors.  If
did, he would get:

  $ perl -e 'print "hello"; close(STDOUT) or die "$!\n";' >/dev/full
  No space left on device
  $ echo $?
  28

Something similar happens with C, of course, if you forget to check
error conditions:

  $ cat > foo.c <<'END'
  #include 
  int main (void)
  {
printf("hello\n");
return 0;
  }
  END
  $ gcc foo.c
  $ ./a.out >/dev/null # Oops, no error reported.
  $ cat > foo.c <<'END'
  #include 
  int main (void)
  {
printf("hello\n");
if (fclose(stdout) != 0)
  perror("Write error");
return 0;
  }
  END
  $ gcc foo.c
  $ ./a.out >/dev/null # Error will be reported now.
  Write error: No space left on device

More "modern" languages, with built-in exception handling, are somewhat
better in this regard:

  $ python3 -c 'print("foo")' >/dev/full
  Exception IOError: (28, 'No space left on device') in ... ignored

Just my 2 cents.

Regards,
  Stefano



Re: exit status issue

2011-11-23 Thread Andreas Schwab
"Steven W. Orr"  writes:

> I think we're beating this one to death, but I have point out that
> telling perl to run a print command whose output is redirected by bash
> is not the same as telling bash to run a builtin echo command that is
> redirected as an integral operation of the same interpreter.

They are really the same, but even if you change it bash still wins:

$ bash -c 'echo "hello"' >/dev/full
bash: line 0: echo: write error: No space left on device

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



Re: exit status issue

2011-11-22 Thread Steven W. Orr

On 11/22/2011 6:49 PM, Bob Proulx wrote:

Dallas Clement wrote:

This doesn't (yet) look like a problem with bash.


Admittedly bash seems to do the right thing if you go by the set -x
execution trace.  If you go by that, it would indeed seem that the
call to touch is failing.


I have a higher level of trust in -x output since I haven't found any
problems with it.  But I have found a lot of programs that do not
handle file errors correctly.  Those are way too common.  Therefore I
am much more suspicous of them.  For example:

   $ perl -le 'print "hello";'>/dev/full
   $ echo $?
   0

Perl ate that error and never said anything about it.  One of a very
many programs with this problem.  BTW...  Bash gets it right.

   $ echo "hello">/dev/full
   bash: echo: write error: No space left on device
   $ echo $?
   1



I think we're beating this one to death, but I have point out that telling perl 
to run a print command whose output is redirected by bash is not the same as 
telling bash to run a builtin echo command that is redirected as an integral 
operation of the same interpreter. IOW, the perl command did succeed. If you 
want to be fair, you need to write a perl command that writes to /dev/null



--
Time flies like the wind. Fruit flies like a banana. Stranger things have  .0.
happened but none stranger than this. Does your driver's license say Organ ..0
Donor?Black holes are where God divided by zero. Listen to me! We are all- 000
individuals! What if this weren't a hypothetical question?
steveo at syslang.net



Re: exit status issue

2011-11-22 Thread Dallas Clement
Bob, Yes I agree with everything you've said.  I am grasping at
straws, which I really don't like.  It's very unlikely that the
problems are with any of these things I've mentioned bash, Linux
kernel, compiler, libc etc.   This script is fairly large and it does
lots of checking of block devices, raid arrays, and volumes.  All of
it is executed synchronously which should make it simple to debug.
The script itself is executed in the background as a daemon process.
This script is executed on locally mounted RAID devices.  No network
involved in the test.

I'll continue my search for root cause with a more disciplined
approach.   Thanks for the help.



Re: exit status issue

2011-11-22 Thread Bob Proulx
Dallas Clement wrote:
> > This doesn't (yet) look like a problem with bash.
> 
> Admittedly bash seems to do the right thing if you go by the set -x
> execution trace.  If you go by that, it would indeed seem that the
> call to touch is failing.

I have a higher level of trust in -x output since I haven't found any
problems with it.  But I have found a lot of programs that do not
handle file errors correctly.  Those are way too common.  Therefore I
am much more suspicous of them.  For example:

  $ perl -le 'print "hello";' >/dev/full
  $ echo $?
  0

Perl ate that error and never said anything about it.  One of a very
many programs with this problem.  BTW...  Bash gets it right.

  $ echo "hello" >/dev/full
  bash: echo: write error: No space left on device
  $ echo $?
  1

> But the strace output tells a different story.  According to it, the
> system calls that touch makes are succeeding.

Your test case is still too complex to be passed on to others.

> In fact, I created my own 'touch' and 'stat' programs, both of which
> log any errors.

That's excellent!

> I never see any errors logged, so I'm pretty sure they are getting
> executed which jives with what I'm seeing in the strace output.

Since it is your own code you would want to see that they logged the
successful result of the operation and also logged that they were
calling exit(0) and at that time show that bash logged them as a
non-zero exit.

For that test you could use 'false' to simulate the unsuccessful program.

> Is this a lower level scheduling problem?  It certainly could be.  I'm
> testing on Linux 2.6.39.4.   I'll rollback to an older kernel, libc,
> and libpthread and see if that makes any difference.

Scheduling?  Now you are scaring me.  I think you are going off into
the weeds with that idea.  I can't imagine how this can be anything
other than simple inline flow of control through your script.  Your
script as you have shown to us does not have any background
asynchronous processing.  Therefore it will be executing all of the
commands in order.  There isn't any out-of-order execution magically
happening.

I see by this that you are grasping at straws.  You are "just trying
things".  I have empathy for you and your troubles.  But that isn't
the scientific method.

  http://en.wikipedia.org/wiki/Scientific_method#Elements_of_scientific_method

> Not doing anything with NFS mounting thankfully.  It's a RAID array on
> a NAS device which is typically used for SMB / CIFS network mount.

Your system is mounting it using an SMB mount?  Then perhaps your
network protocol layer is causing trouble.  I think a bug in the SMB
layer is much more likely.

Bob



Re: exit status issue

2011-11-22 Thread Dallas Clement
>> + touch /mnt/array1/.accesstest
>> + CHK_RESULT=1
>
> It looks to me that touch is failing and reporting the failure and
> that bash is handling it correctly.
>
>> + touch /mnt/array1/.accesstest
>> + CHK_RESULT=0
>
> And then on a subsequent pass touch is reporting success.
>
> This doesn't (yet) look like a problem with bash.

Admittedly bash seems to do the right thing if you go by the set -x
execution trace.  If you go by that, it would indeed seem that the
call to touch is failing.  But the strace output tells a different
story.  According to it, the system calls that touch makes are
succeeding.  In fact, I created my own 'touch' and 'stat' programs,
both of which log any errors.  I never see any errors logged, so I'm
pretty sure they are getting executed which jives with what I'm seeing
in the strace output.

Is this a lower level scheduling problem?  It certainly could be.  I'm
testing on Linux 2.6.39.4.   I'll rollback to an older kernel, libc,
and libpthread and see if that makes any difference.

>
>> The purpose of this function is simply to try and create or modify a
>> test file on a RAID share.  It's just a periodic integrity check.
>
> If I ever had even one single failure on a raid filesystem I would
> start replacing hardware.  I wouldn't be trying to retry.  I wouldn't
> be trying to umount and mount it again.  ANY failures would mean
> something so severely broken that I would consider the system unusable
> until repaired.

Yep, I agree - these sort of failures should not happen unless there
is something terribly wrong.

>
> On the other hand, do you really mean an NFS mounted filesystem?
> Perhaps using the nfs automounter?  If so then say so.  Don't say
> "share" which implies a MS Windows SMB / CIFS network mount.  The NFS
> client subsystem is notoriously bad.  The nfs automounter is worse.  I
> am guessing you probably really mean that you are using an nfs mounted
> filesystem using the automounter.  Pretty much everyone in that
> situation has tried to recover from problems with it at one time or
> another.

Not doing anything with NFS mounting thankfully.  It's a RAID array on
a NAS device which is typically used for SMB / CIFS network mount.



Re: exit status issue

2011-11-22 Thread Bob Proulx
Dallas Clement wrote:
> Okay, I simplified the script slightly, fixed the quoting and re-ran
> with set -x.  Here is the script and the output for the failure
> scenario.

Much more useful!  Thanks.

> + touch /mnt/array1/.accesstest
> + CHK_RESULT=1

It looks to me that touch is failing and reporting the failure and
that bash is handling it correctly.

> + touch /mnt/array1/.accesstest
> + CHK_RESULT=0

And then on a subsequent pass touch is reporting success.

This doesn't (yet) look like a problem with bash.

> The purpose of this function is simply to try and create or modify a
> test file on a RAID share.  It's just a periodic integrity check.

If I ever had even one single failure on a raid filesystem I would
start replacing hardware.  I wouldn't be trying to retry.  I wouldn't
be trying to umount and mount it again.  ANY failures would mean
something so severely broken that I would consider the system unusable
until repaired.

On the other hand, do you really mean an NFS mounted filesystem?
Perhaps using the nfs automounter?  If so then say so.  Don't say
"share" which implies a MS Windows SMB / CIFS network mount.  The NFS
client subsystem is notoriously bad.  The nfs automounter is worse.  I
am guessing you probably really mean that you are using an nfs mounted
filesystem using the automounter.  Pretty much everyone in that
situation has tried to recover from problems with it at one time or
another.

> The set -x output shows that the 'touch' command failed, but it
> doesn't show why.

That sounds like a bug in touch.  If there is an error then it should
emit an error message.

Perhaps you should try an alternate command to create the file.  You
could use bash itself.

  : >> "$1"/.accesstest

Or you could use a helper command.  Using a different program other
than 'touch' may cause it to emit a better error message.

Bob



Re: exit status issue

2011-11-22 Thread Greg Wooledge
On Tue, Nov 22, 2011 at 02:55:28PM -0600, Dallas Clement wrote:
> fsaccesstest()
> {
>   set -x
> 
>   local RETRY_MAX=3
>   local RETRY_INTERVAL=3
>   local CHK_RESULT=1
>   local RETVAL=0
> 
>   TMP=`grep "$1" /proc/mounts|awk '{print $1}'`
>   if [ "$TMP" != "" ]; then
>   for i in `seq 1 ${RETRY_MAX}`
>   do
>   touch "$1"/.accesstest
>   CHK_RESULT=$?
>   if [ ${CHK_RESULT} -eq 0 ] ; then
>   break
>   fi
>   logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** 
> fsaccesstest test $1
> failed CHK_RESULT=$CHK_RESULT. retrying... (${i}) ***"
>   sleep ${RETRY_INTERVAL}
>   done
>   if [ $CHK_RESULT -ne 0 ]; then
>   RETVAL=$CHK_RESULT
>   fi
>   fi
> 
>   set +x
> 
>   return $RETVAL
> }

That does look better.

> + touch /mnt/array1/.accesstest
> + CHK_RESULT=1
> + '[' 1 -eq 0 ']'
> + logger -s -t diskmon -p local0.info '*** fsaccesstest test
> /mnt/array1 failed CHK_RESULT=1. retrying... (1) ***'
> diskmon: *** fsaccesstest test /mnt/array1 failed CHK_RESULT=1.
> retrying... (1) ***

> The purpose of this function is simply to try and create or modify a
> test file on a RAID share.  It's just a periodic integrity check.

OK.

> The set -x output shows that the 'touch' command failed, but it
> doesn't show why.

Presumably the touch command itself will tell you why, on stderr.  It
should say something like "No such file or directory" or "Permission
denied" or "Device or resource busy", etc.

> The three system calls that touch makes are all successful.
> So my question is why is $? getting set to 1?

If you suspect a bug in touch, then you need to report it to your OS
vendor.  touch isn't part of bash.

> It looks to me like the
> bash interpreter is returning early from this call to 'touch'.  I
> can't prove this yet though.

There's no way bash would do that, given the script as shown.  There is
no background execution there at all.

> Here's another example of an unsuccessful touch execution.  Notice how
> the execution keeps getting interrupted and then resumed.

This is getting into kernel scheduling, etc.  Way off topic for a bash
mailing list.  You might need to take this up on the Linux Kernel
Mailing List, or something similar if you are not doing this in Linux.



Re: exit status issue

2011-11-22 Thread Dallas Clement
> That makes no sense.  Fix your quoting (it's atrocious) and then run
> the function with "set -x".  Don't throw strace at it, at this point.
> It's really not the right tool at this level of debugging.

Okay, I simplified the script slightly, fixed the quoting and re-ran
with set -x.  Here is the script and the output for the failure
scenario.

# $1 mntpoint
fsaccesstest()
{
set -x

local RETRY_MAX=3
local RETRY_INTERVAL=3
local CHK_RESULT=1
local RETVAL=0

TMP=`grep "$1" /proc/mounts|awk '{print $1}'`
if [ "$TMP" != "" ]; then
for i in `seq 1 ${RETRY_MAX}`
do
touch "$1"/.accesstest
CHK_RESULT=$?
if [ ${CHK_RESULT} -eq 0 ] ; then
break
fi
logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** 
fsaccesstest test $1
failed CHK_RESULT=$CHK_RESULT. retrying... (${i}) ***"
sleep ${RETRY_INTERVAL}
done
if [ $CHK_RESULT -ne 0 ]; then
RETVAL=$CHK_RESULT
fi
fi

set +x

return $RETVAL
}

+ local RETRY_MAX=3
+ local RETRY_INTERVAL=3
+ local CHK_RESULT=1
+ local RETVAL=0
++ grep /mnt/array1 /proc/mounts
++ awk '{print $1}'
+ TMP=/dev/md10
+ '[' /dev/md10 '!=' '' ']'
++ seq 1 3
+ for i in '`seq 1 ${RETRY_MAX}`'
+ touch /mnt/array1/.accesstest
+ CHK_RESULT=1
+ '[' 1 -eq 0 ']'
+ logger -s -t diskmon -p local0.info '*** fsaccesstest test
/mnt/array1 failed CHK_RESULT=1. retrying... (1) ***'
diskmon: *** fsaccesstest test /mnt/array1 failed CHK_RESULT=1.
retrying... (1) ***
+ sleep 3
+ for i in '`seq 1 ${RETRY_MAX}`'
+ touch /mnt/array1/.accesstest
+ CHK_RESULT=0
+ '[' 0 -eq 0 ']'
+ break
+ '[' 0 -ne 0 ']'
+ set +x

> I don't even know what your actual *symptom* is.  I can't deduce it from
> a spew of strace output.  You haven't described the reason for the touch
> command, so I can only presume this is some sort of autofs environment,
> hence my attempt to solve the issue, above.

The purpose of this function is simply to try and create or modify a
test file on a RAID share.  It's just a periodic integrity check.

The set -x output shows that the 'touch' command failed, but it
doesn't show why.  I certainly acknowledge the strace output is a lot
of stuff to untangle, but it does show what is happening at a process
level and the return status for every system call.  What I'm seeing is
that there are lots of concurrent processes running at the time this
particular shell script function is executed.  The other processes
seem to have an adverse effect on the process that is executing
'touch'.  The three system calls that touch makes are all successful.
So my question is why is $? getting set to 1?  It looks to me like the
bash interpreter is returning early from this call to 'touch'.  I
can't prove this yet though.

Here's another example of an unsuccessful touch execution.  Notice how
the execution keeps getting interrupted and then resumed.

[pid 31400] execve("/tmp/test-touch", ["/tmp/test-touch",
"/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux",
"SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"...,
"HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin:/usr/sb"...,
"MAIL=/var/spool/mail/root", "C_INCLUDE_PATH=/usr/include:/usr"...,
"PWD=/usr/local/sbin", "HOME=/root", "SHLVL=2", "LOGNAME=root",
"_=/tmp/test-touch"]Process 31401 attached (waiting for parent)

[pid 31400] <... execve resumed> )  = 0
[pid 31400] brk(0 
[pid 31400] <... brk resumed> ) = 0x2369000

[pid 31400] open("/mnt/array1/.accesstest", O_RDWR|O_CREAT, 0666

[pid 31400] <... open resumed> )= 3
[pid 31400] close(3 
[pid 31400] <... close resumed> )   = 0
[pid 31400] utimes("/mnt/array1/.accesstest", {4196394, 0} 
[pid 31400] <... utimes resumed> )  = 0
[pid 31400] exit_group(0)   = ?
Process 23756 resumed
Process 31400 detached
[pid 23756] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) ==
0}], 0, NULL) = 31400
[pid 23756] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 23756] wait4(-1, Process 23756 suspended



Re: exit status issue

2011-11-22 Thread Greg Wooledge
On Tue, Nov 22, 2011 at 11:31:22AM -0600, Dallas Clement wrote:
> It turns out that my 'touch' command is actually getting executed, but
> not when I expect it.  Here is a recap of my script function which is
> occasionally misbehaving:
> 
> # $1 mntpoint
> fsaccesstest()
> {
>   local RETRY_MAX=3
>   local RETRY_INTERVAL=3
>   local CHK_RESULT=1
> 
>   TMP=`grep $1 /proc/mounts|awk '{print $1}'`
>   if [ "$TMP" != "" ]; then
>   for i in `seq 1 ${RETRY_MAX}`
>   do
>   touch $1/.accesstest
>   CHK_RESULT=$?
>   if [ ${CHK_RESULT} -eq 0 ] ; then
>   break
>   fi
>   logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** 
> fsaccesstest test $1
> failed. retrying... (${i}) ***"
>   sleep ${RETRY_INTERVAL}
>   done
>   if [ $CHK_RESULT -ne 0 ]; then
>   logger -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest 
> test $1 failed. ***"
>   echo "*** fsaccess test $1 failed. ***" >/dev/console
>   ## try remount it
>   umount $1
>   mount $TMP $1
>   return $?
>   fi
>   return 0
>   fi
>   return 1
> }

> In the failure scenario, I had wrongly assumed that the touch command
> was not being executed, when in fact it is.  It's just that the
> execution is deferred somehow and occurs after the call to 'logger',
> which would not have been executed had the call to 'touch' somehow
> failed or returned too soon.

That makes no sense.  Fix your quoting (it's atrocious) and then run
the function with "set -x".  Don't throw strace at it, at this point.
It's really not the right tool at this level of debugging.

Also note that there are *two* calls to logger in this function.  You
might be mistaking which logger call you are seeing, particularly since
you're trying to debug with the wrong tool.

If I understand what this function is attempting to do:

fstest() {
local i=0
while ((++i < 3)); do
# If it's mounted, yay!
df "$1" && return 0

# It's not mounted, so try to trigger autofs.
touch "$1"/.accesstest
sleep 3
done

# We failed 3 times, so give up.
return 1
}

I don't even know what your actual *symptom* is.  I can't deduce it from
a spew of strace output.  You haven't described the reason for the touch
command, so I can only presume this is some sort of autofs environment,
hence my attempt to solve the issue, above.



Re: exit status issue

2011-11-22 Thread Dallas Clement
It turns out that my 'touch' command is actually getting executed, but
not when I expect it.  Here is a recap of my script function which is
occasionally misbehaving:

# $1 mntpoint
fsaccesstest()
{
local RETRY_MAX=3
local RETRY_INTERVAL=3
local CHK_RESULT=1

TMP=`grep $1 /proc/mounts|awk '{print $1}'`
if [ "$TMP" != "" ]; then
for i in `seq 1 ${RETRY_MAX}`
do
touch $1/.accesstest
CHK_RESULT=$?
if [ ${CHK_RESULT} -eq 0 ] ; then
break
fi
logger -s -t ${LOGTAG} -p ${LOGFACILITY} "*** 
fsaccesstest test $1
failed. retrying... (${i}) ***"
sleep ${RETRY_INTERVAL}
done
if [ $CHK_RESULT -ne 0 ]; then
logger -t ${LOGTAG} -p ${LOGFACILITY} "*** fsaccesstest 
test $1 failed. ***"
echo "*** fsaccess test $1 failed. ***" >/dev/console
## try remount it
umount $1
mount $TMP $1
return $?
fi
return 0
fi
return 1
}

Most of the time this function executes the 'touch' command without
errors as shown in the following strace output.  The touch command
make just three system calls to open(), close(), and utimes() which
you can clearly see.

Success Scenario

[pid 23930] execve("/tmp/test-touch", ["/tmp/test-touch",
"/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux",
"SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"...,
"HUSHLOGIN=FALSE", "USER=root", "PATH=/bin:/sbin:/usr/bin:
[pid 23930] brk(0)  = 0x18d2000
[pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9676000
[pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9675000
[pid 23930] access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such
file or directory)
[pid 23930] open("/etc/ld.so.cache", O_RDONLY) = 3
[pid 23930] fstat(3, {st_dev=makedev(8, 162), st_ino=90088,
st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096,
st_blocks=40, st_size=19548, st_atime=2011/11/19-08:39:02,
st_mtime=2011/11/15-08:39:00, st_ctime=2011/11/
[pid 23930] mmap(NULL, 19548, PROT_READ, MAP_PRIVATE, 3, 0) =
0x7f4ab967
[pid 23930] close(3)= 0
[pid 23930] open("/lib64/libc.so.6", O_RDONLY) = 3
[pid 23930] read(3,
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\333\1\0"..., 832) =
832
[pid 23930] fstat(3, {st_dev=makedev(8, 162), st_ino=285057,
st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096,
st_blocks=2616, st_size=1331216, st_atime=2011/11/19-08:10:31,
st_mtime=2010/06/29-12:51:27, st_ctime=201
[pid 23930] mmap(NULL, 2343048, PROT_READ|PROT_EXEC,
MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4ab932
[pid 23930] mprotect(0x7f4ab9454000, 1044480, PROT_NONE) = 0
[pid 23930] mmap(0x7f4ab9553000, 20480, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x133000) = 0x7f4ab9553000
[pid 23930] mmap(0x7f4ab9558000, 16520, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4ab9558000
[pid 23930] close(3)= 0
[pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab966f000
[pid 23930] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4ab966e000
[pid 23930] arch_prctl(ARCH_SET_FS, 0x7f4ab966e6f0) = 0
[pid 23930] mprotect(0x7f4ab9553000, 12288, PROT_READ) = 0
[pid 23930] munmap(0x7f4ab967, 19548) = 0
[pid 23930] open("/mnt/array1/.accesstest", O_RDWR|O_CREAT, 0666) = 3
[pid 23930] close(3)= 0
[pid 23930] utimes("/mnt/array1/.accesstest", {4196394, 0}) = 0
[pid 23930] exit_group(0)   = ?
Process 23756 resumed
Process 23930 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23930
--- SIGCHLD (Child exited) @ 0 (0) ---

In the failure scenario, I had wrongly assumed that the touch command
was not being executed, when in fact it is.  It's just that the
execution is deferred somehow and occurs after the call to 'logger',
which would not have been executed had the call to 'touch' somehow
failed or returned too soon.  So it looks I am being afflicted with
out of order execution problems.  I am expecting that if my script
makes a call to 'touch' it should not return until the command has
been completely executed, i.e. the child process reaped, and exit
status determined.  This doesn't seem to be happening.

Please forgive for the volume of data.

Failure Scenario

[pid 26415] execve("/tmp/test-touch", ["/tmp/test-touch",
"/mnt/array1/.accesstest"], ["HOSTNAME=TS-2RVED8", "TERM=linux",
"SHELL=/bin/sh", "LIBRARY_PATH=/usr/lib:/usr/local"...,
"HUSHLOGIN=FALSE", "USER=r

Re: exit status issue

2011-11-18 Thread Dallas Clement
On Fri, Nov 18, 2011 at 2:35 PM, Greg Wooledge  wrote:
> On Fri, Nov 18, 2011 at 02:23:54PM -0600, Dallas Clement wrote:
>> [pid  6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
>> [/* 14 vars */]) = 0
>>
>> The = 0 at the end is the exit status.
>
> That is the return value of the execve() call.  All it means is that
> the /bin/touch program was successfully launched.
>
> $ strace -f bash -c false
> ...
> execve("/bin/bash", ["bash", "-c", "false"], [/* 32 vars */]) = 0
> ...
>
> arc3:~$ bash -c false; echo $?
> 1
>

Yes, you are sure right on all counts.  The (0) in SIGCHLD (Child
exited) @ 0 (0) is clearly not the exit status of the child either.
Proven by a simple test script.

#!/tmp/bash3

touch /bogus/file
echo $?

-

root@TS-2RVED8:/tmp# strace -f -e execve ./test.sh
execve("./test.sh", ["./test.sh"], [/* 15 vars */]) = 0
Process 7269 attached
Process 7268 suspended
[pid  7269] execve("/bin/touch", ["touch", "/bogus/file"], [/* 14 vars */]) = 0
touch: /bogus/file: No such file or directory
Process 7268 resumed
Process 7269 detached
--- SIGCHLD (Child exited) @ 0 (0) ---
1
Process 7268 detached

Thanks for setting me straight guys.  I was misled.  It looks like the
'touch' command really could be failing in my script.



Re: exit status issue

2011-11-18 Thread Andreas Schwab
Dallas Clement  writes:

> [pid  6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
> [/* 14 vars */]) = 0
>
> The = 0 at the end is the exit status.

No, it isn't.  It only says that execve was successful.  The exit status
is set by the exit syscall.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



Re: exit status issue

2011-11-18 Thread Greg Wooledge
On Fri, Nov 18, 2011 at 02:23:54PM -0600, Dallas Clement wrote:
> [pid  6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
> [/* 14 vars */]) = 0
> 
> The = 0 at the end is the exit status.

That is the return value of the execve() call.  All it means is that
the /bin/touch program was successfully launched.

$ strace -f bash -c false
...
execve("/bin/bash", ["bash", "-c", "false"], [/* 32 vars */]) = 0
...

arc3:~$ bash -c false; echo $?
1



Re: exit status issue

2011-11-18 Thread Bob Proulx
DJ Mills wrote:
> Bob Proulx wrote:
> >  sh -x ./scripttodebug
> 
> I'm guessing you mean bash -x, not sh -x.  Two different shells.

It is a bash list so I probably should have said bash just to be
politically correct.  But the script doesn't use any bash specific
constructs so sh should be fine.  Saying sh is habit for me since I
always use sh when possible.

Bob




Re: exit status issue

2011-11-18 Thread Dallas Clement
On Fri, Nov 18, 2011 at 2:07 PM, Greg Wooledge  wrote:
>> Notice that the child process pid=6747 exits with status 0
>
> Where does it say that?  How are you even invoking strace here?  I don't
> know how to read this output, nor do I know how to reproduce it.

[pid  6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
[/* 14 vars */]) = 0

The = 0 at the end is the exit status.

I know it's complicated but this script is run as a daemon.  Was using
strace with -f option to follow through forks, i.e.

strace -f -e execve ./diskmon_exec.sh  start > /mnt/array1/share/strace.txt 2>&1

>> I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing
>> this problem.
>
> Simplify the script, and reproduce the problem with something we can
> actually read and understand.  For example,
>
> #!/bin/bash
> f() {
>  touch /root/nopermission
>  echo "\$? is $?"
> }
> f

> If you can't reproduce the problem in the simpler script, then you know
> the problem is being triggered by something in the more complex script.
> That will help you narrow down the cause.
>

Yes, I agree this is the right way to approach it.  However, I have
not been able to reproduce the problem with a simple function as you
have shown.  It is almost definitely interaction with the rest of what
is going on in the complicated script.  In fact, as I comment out
portions of it, the problem goes away.  But I can't put my finger on
any one thing that is the root cause.  It feels like something is
getting clobbered in the execution context.  This failure does not
happen every time this function is executed.  It is very hit and miss.

Currently I am running bash -x to see if I can get it to happen that way.



Re: exit status issue

2011-11-18 Thread DJ Mills
On Fri, Nov 18, 2011 at 2:59 PM, Bob Proulx  wrote:
> Dallas Clement wrote:
>> Geir Hauge wrote:
>> > Add ''set -x'' at the start of the function and examine the output
>> > to see if it actually runs touch from PATH.
>>
>> The strace output is showing that the correct 'touch' is being executed.
>
> It would be a lot easier to use the 'sh -x' trace than using strace
> for seeing what is going on in your script.  Try it first and see
> commands are being run.  Because you are setting CHK_RESULT=1 as a
> default all that it would take for your script to exit 1 would be to
> not be setting that value to zero at a time that you think it should.
>
>  sh -x ./scripttodebug
>
> Your script is complicated enough that it isn't immediately obvious by
> a casual inspection whether it is correct or not.  Almost certainly in
> these cases it is an error in the programming of the script.  I would
> much sooner suspect it than bash at this point.
>
>>        TMP=`grep $1 /proc/mounts|awk '{print $1}'`
>
> You do not show how this is called but $1 option argument to grep may
> be insufficiently quoted.
>
>>    echo "*** fsaccesstest failed to unmount $1. ***" >/dev/console
>
> Writing directly to /dev/console is a little harsh.  You might
> consider using the 'logger' program and writing to the syslog instead.
>
> Your script makes me think that you might be using an NFS automounter
> and trying to correct nfs client problems.  (shrug)
>
> Bob
>
>

I'm guessing you mean bash -x, not sh -x.  Two different shells.



Re: exit status issue

2011-11-18 Thread Greg Wooledge
On Fri, Nov 18, 2011 at 01:49:38PM -0600, Dallas Clement wrote:
> [pid  6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
> [/* 14 vars */]) = 0
> Process 6748 attached
> Process 13686 suspended
> [pid  6748] execve("/usr/bin/logger", ["logger", "-s", "-t",
> "diskmon", "-p", "local0.info", "*** fsaccesstest test
> /mnt/array"...], [/* 14 vars */]) = 0
> Process 13686 resumed
> Process 6747 detached
> [pid 13686] --- SIGCHLD (Child exited) @ 0 (0) ---
> Process 13686 suspended
> diskmon: *** fsaccesstest test /mnt/array1 failed: status=1. retrying... (1) 
> ***
> Process 13686 resumed
> Process 6748 detached
> --- SIGCHLD (Child exited) @ 0 (0) ---
> 
> Notice that the child process pid=6747 exits with status 0

Where does it say that?  How are you even invoking strace here?  I don't
know how to read this output, nor do I know how to reproduce it.

> I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing
> this problem.

Simplify the script, and reproduce the problem with something we can
actually read and understand.  For example,

#!/bin/bash
f() {
  touch /root/nopermission
  echo "\$? is $?"
}
f

If you can't reproduce the problem in the simpler script, then you know
the problem is being triggered by something in the more complex script.
That will help you narrow down the cause.



Re: exit status issue

2011-11-18 Thread Bob Proulx
Dallas Clement wrote:
> Geir Hauge wrote:
> > Add ''set -x'' at the start of the function and examine the output
> > to see if it actually runs touch from PATH.
> 
> The strace output is showing that the correct 'touch' is being executed.

It would be a lot easier to use the 'sh -x' trace than using strace
for seeing what is going on in your script.  Try it first and see
commands are being run.  Because you are setting CHK_RESULT=1 as a
default all that it would take for your script to exit 1 would be to
not be setting that value to zero at a time that you think it should.

  sh -x ./scripttodebug

Your script is complicated enough that it isn't immediately obvious by
a casual inspection whether it is correct or not.  Almost certainly in
these cases it is an error in the programming of the script.  I would
much sooner suspect it than bash at this point.

>TMP=`grep $1 /proc/mounts|awk '{print $1}'`

You do not show how this is called but $1 option argument to grep may
be insufficiently quoted.

>echo "*** fsaccesstest failed to unmount $1. ***" >/dev/console

Writing directly to /dev/console is a little harsh.  You might
consider using the 'logger' program and writing to the syslog instead.

Your script makes me think that you might be using an NFS automounter
and trying to correct nfs client problems.  (shrug)

Bob



Re: exit status issue

2011-11-18 Thread Dallas Clement
On Fri, Nov 18, 2011 at 6:16 AM, Geir Hauge  wrote:
> Add ''set -x'' at the start of the function and examine the output to see if
> it actually runs touch from PATH.

The strace output is showing that the correct 'touch' is being executed.

[pid  6747] execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
[/* 14 vars */]) = 0
Process 6748 attached
Process 13686 suspended
[pid  6748] execve("/usr/bin/logger", ["logger", "-s", "-t",
"diskmon", "-p", "local0.info", "*** fsaccesstest test
/mnt/array"...], [/* 14 vars */]) = 0
Process 13686 resumed
Process 6747 detached
[pid 13686] --- SIGCHLD (Child exited) @ 0 (0) ---
Process 13686 suspended
diskmon: *** fsaccesstest test /mnt/array1 failed: status=1. retrying... (1) ***
Process 13686 resumed
Process 6748 detached
--- SIGCHLD (Child exited) @ 0 (0) ---

Notice that the child process pid=6747 exits with status 0, but $?=1
after returning from executing touch.  This triggers the failure.

I just upgrade my bash from version 3.21 to 4.2 and I'm still seeing
this problem.

I am seeing this failure whether I execute the 'touch' command or the
'stat' command.  Both of these are incidentally busybox
implementations.

 ls -al /bin/touch
lrwxrwxrwx1 root root7 Nov 15 08:07 /bin/touch -> busybox

ls -al /bin/stat
lrwxrwxrwx1 root root7 Nov 15 08:07 /bin/stat -> busybox

Are there any bash debugging tools available that I can use to halt on
a data breakpoint and examine state?

Thanks,
Dallas



Re: exit status issue

2011-11-18 Thread Geir Hauge
2011/11/17 Dallas Clement 

>
>  Would appreciate any insight you could offer.  Here is my script and
> the strace output for when touch returns 1.
>

Add ''set -x'' at the start of the function and examine the output to see
if it actually runs touch from PATH.

-- 
Geir Hauge


exit status issue

2011-11-17 Thread Dallas Clement
Hello,

I've got a script that is periodically performing a disk / raid array
integrity check by touching a file.  The touch is failing
occasionally, with an exit status of 1.  If I do 'strace touch'
instead of just 'touch' the strace indicates an exit status of 0.  So
I'm not sure why this is getting propagated back as a 1 to the script.
  Would appreciate any insight you could offer.  Here is my script and
the strace output for when touch returns 1.

fsaccesstest()
{
local RETRY_MAX=3
local RETRY_INTERVAL=3
local CHK_RESULT=1

TMP=`grep $1 /proc/mounts|awk '{print $1}'`
if [ "$TMP" != "" ]; then
for i in `seq 1 ${RETRY_MAX}`
do
touch $1/.accesstest
CHK_RESULT=$?
if [ ${CHK_RESULT} -eq 0 ] ; then
break
fi
echo "*** fsaccesstest test $1 failed: 
status=${CHK_RESULT}.
retrying... (${i}) ***"
sleep ${RETRY_INTERVAL}
done
if [ $CHK_RESULT -ne 0 ]; then
echo "*** fsaccess test $1 failed. ***" >/dev/console
## try to remount it.  It may be busy, so retry if 
necessary.
for j in `seq 1 ${RETRY_MAX}`
do
umount -f -l $1 # Force a lazy unmount
CHK_RESULT=$?
if [ ${CHK_RESULT} -eq 0 ]; then
break
fi
sleep ${RETRY_INTERVAL}
done
if [ ${CHK_RESULT} -eq 0 ]; then
mount $TMP $1
CHK_RESULT=$?
if [ ${CHK_RESULT} -ne 0 ]; then
echo "*** fsaccesstest failed to mount 
$1. ***" >/dev/console
return ${CHK_RESULT}
fi
else
echo "*** fsaccesstest failed to unmount $1. 
***" >/dev/console
return ${CHK_RESULT}
fi
fi
return 0
fi
return 1
}

execve("/bin/touch", ["touch", "/mnt/array1/.accesstest"],
["HOSTNAME=TS-2RVED8", "TERM=linux", "SHELL=/bin/sh",
"LIBRARY_PATH=/usr/lib:/usr/local"..., "HUSHLOGIN=FALSE", "USER=root",
"PATH=/bin:/sbin:/usr/bin:/usr/sb"..., "MAIL=/var/spool/mail/root",
"C_INCLUDE_PATH=/usr/include:/usr"..., "PWD=/root", "HOME=/root",
"SHLVL=2", "LOGNAME=root", "_=/usr/local/bin/strace"]) = 0
brk(0)  = 0x1702000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f459915
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f459914f000
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)  = 3
fstat(3, {st_dev=makedev(8, 162), st_ino=90088, st_mode=S_IFREG|0644,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=40,
st_size=19548, st_atime=2011/11/15-08:39:02,
st_mtime=2011/11/15-08:39:00, st_ctime=2011/11/15-08:39:01}) = 0
mmap(NULL, 19548, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f459914a000
close(3)= 0
open("/lib64/libcrypt.so.1", O_RDONLY)  = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\n\0"..., 832) = 832
fstat(3, {st_dev=makedev(8, 162), st_ino=285064, st_mode=S_IFREG|0755,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=48,
st_size=23096, st_atime=2011/11/15-08:10:31,
st_mtime=2010/06/29-12:51:27, st_ctime=2011/11/15-08:07:41}) = 0
mmap(NULL, 1257888, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f4598f03000
mprotect(0x7f4598f08000, 1044480, PROT_NONE) = 0
mmap(0x7f4599007000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4000) = 0x7f4599007000
mmap(0x7f4599009000, 184736, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4599009000
close(3)= 0
open("/lib64/libm.so.6", O_RDONLY)  = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360?\0\0"..., 832) = 832
fstat(3, {st_dev=makedev(8, 162), st_ino=285045, st_mode=S_IFREG|0755,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=1064,
st_size=538200, st_atime=2011/11/15-08:10:31,
st_mtime=2010/06/29-12:51:27, st_ctime=2011/11/15-08:07:41}) = 0
mmap(NULL, 1569000, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f4598d83000
mprotect(0x7f4598e01000, 1048576, PROT_NONE) = 0
mmap(0x7f4598f01000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7e000) = 0x7f4598f01000
close(3)= 0
open("/lib64/libc.so.6", O_RDONLY)