We upgraded our system from a 16 processor AV20000 to an 8 processor
AV25000. Because each block of 4 cpus can only support 4gb of memory our
upgrade was going to cut our memory in half. So, the main purpose of the
changes was to cut the memory requirements of the universe lock tables. And
in fact we successfully cut the per user memory usage in half. The other
changes were made as part of the overall review of both uv.config and the
kernel parameters prior to the upgrade.

Because several people suggested that the semaphore and/or shared memory
changes could be the cause of the problem I had them changed back to their
original values yesterday. So far we have not had a failure, but its a
little too soon to celebrate.

Thanks for the advice on using the trace. I was not able to pick out the
system calls and before I make another attempt I think I will wait and see
if the weekends changes fixed the problem. If the problem is solved I will
post in case someone else runs in the same problem in the future.
Thanks,
Vance Dailey

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Ken Wallis
Sent: Sunday, February 08, 2004 7:52 PM
To: 'U2 Users Discussion List'
Subject: RE: UV command failing mystery


>From: Vance Dailey

>It was suggested that I try to run dg_strace. I ran it on one
>of the failing
>uv processes. It generated a 1mb file. I can see where it
>executes "uvsh"
>and It fails just after the 7th occurance of "RUN APP.PROGS
>PACKAGE.INS".
>(The 7th run is just after the string
>"SPECIAL.EDITOR.SELECT.DATA\OLONG".) I
>have no idea how to read this file but I thought it might help identify
>where the error occurred. I have included the very end of the output of
>dg_strace below:
>
...
>close(3)                                = 0
>sigaction_svr3(SIGQUIT, {...}, {...})   = 1253
>sigaction_svr3(SIGNULL, {...}, {0xc0a0d, [XCPU XFSZ],
>SA_RESTART|SA_SIGINFO}) = 2130681856
...

This tool seems to be showing you the system calls that uvsh is making and
the values returned from them (the bit after the "=").  The section you have
shown is simply the program trying to tidy up and exit after detecting
something it didn't like.  You'll need to look higher up in the output for a
system call which seems to return an error code.  Unfortunately, you need to
know what sort of system calls should return 0 all the time and which ones
regularly return other values.  I think I'd be looking at calls to sem...()
or shm...() functions that return non-zero and then using errmsg (if DG/UX
has that, or vi-ing /usr/include/sys/errno.h if it doesn't) to see what the
error numbers returned mean and man to interpret from that where the problem
lies.

I can't remember the exact numbers you quoted earlier, but certainly with
your user counts I'd be very suspicious of the reductions you made to the
semaphore kernel parameters.  Just as a matter of interest, why were these
reductions made?

HTH,

Ken


--
u2-users mailing list
[EMAIL PROTECTED]
http://www.oliver.com/mailman/listinfo/u2-users

-- 
u2-users mailing list
[EMAIL PROTECTED]
http://www.oliver.com/mailman/listinfo/u2-users

Reply via email to