Re: [gridengine users] Error 137 - trying to figure out what it means.

Jake Carroll Mon, 14 Jan 2013 14:09:59 -0800

Hi.

So we tested out trying to hard set wall-time different for the specific
user who's experiencing the Exit 137 issue. We noticed the jobs are still
failing, however.


One job that was killed that included the wall-time setting. Obviously the
job did not run for 24h, anyway input and outputs shown below.

--------
- qsub b5_set112.sh



- b5_set11_2.sh:

#$ -cwd 
#$ -l h_rt=24:00:00

#$ -l vf=20G
#$ -N b5_set11_2
#$ -m eas
#$ -M someguy@somewhere
/blah/blah/blah/bayesRsim <b5_set11_2.par

 
- cat b5_set11_2.e1325823:
/opt/gridengine/default/spool/compute-0-4/job_scripts/1325823: line 7:
8117 Killed                  /blah/blah/blag/bayesRsim < b5_set11_2.par


-qacct -j 1325823
==============================================================
qname        medium.q
hostname     compute-0-4.local
group        users 
owner        someguy
project      NONE  
department   defaultdepartment
jobname      b5_set11_2
jobnumber    1325823
taskid       undefined
account      sge   
priority     0     
qsub_time    Mon Jan 14 15:36:49 2013
start_time   Mon Jan 14 15:36:55 2013
end_time     Mon Jan 14 18:11:56 2013
granted_pe   NONE  
slots        1     
failed       0    
exit_status  137   
ru_wallclock 9301  
ru_utime     9262.906
ru_stime     7.916 
ru_maxrss    13820636
ru_ixrss     0     
ru_ismrss    0     
ru_idrss     0     
ru_isrss     0     
ru_minflt    46056 
ru_majflt    26    
ru_nswap     0     
ru_inblock   392840
ru_oublock   32    
ru_msgsnd    0     
ru_msgrcv    0     
ru_nsignals  0     
ru_nvcsw     536   
ru_nivcsw    30791 
cpu          9270.822
mem          61688.906
io           0.430 
iow          0.000 
maxvmem      13.302G
arid         undefined

So, you mentioned "default time limit of your shell". My googling
suggested trying to set a wall time limit, or have the user specify the
wall time, but that did not help. A few google searches show the use of a
global time limit for jobs in general, but make no reference to a default
time limit of the shell. Am I supposed to be looking at limits such as
s_rt and h_rt? If so, how go I manipulate these for the specific user? The
queue_conf man page makes some reference to this, but it doesn't explain
explicitly how to manipulate it globally or on a per user basis making
reference to defaults or "shell".

Sorry - just stumbling through this and not finding it too intuitive.


--JC




On 14/01/13 10:34 AM, "Ron Chen" <[email protected]> wrote:

>Exit code 137 = process was killed because it exceeded the time limit,
>and Google is your best friend if you have similar issues - and the
>solution is to check the default time limit of your shell.
>
> -Ron
>
>
>************************************************************************
>
>Open Grid Scheduler - the official open source Grid Engine:
>http://gridscheduler.sourceforge.net/
>
>
>
>
>
>________________________________
>From: Jake Carroll <[email protected]>
>To: "[email protected]" <[email protected]>
>Sent: Sunday, January 13, 2013 6:56 PM
>Subject: [gridengine users] Error 137 - trying to figure out what it
>means.
>
>
>Hi all.
>
>We're trying to figure out the answer to a problem that is escaping us.
>We can usually self solve most of these issues, but this one, we're
>having problems trapping and can't find any solid answers for after a lot
>of looking around on online resources.
>
>One of our quite capable users [read: he rarely needs our help with grid
>engine] has an unusual issue with certain jobs (seemingly, randomly?)
>crashing out on error 137. The code is predominantly C++ based running
>atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for
>us is that sometimes these array based jobs (non PE's/parallel
>environments and no mpi/mpich explicit in use) are only crashing
>sometimes. Some, and not others. It seems almost quasi-random.
>
>The code is written in fortran compiled with Intels ifort, using standard
>code optimisation (compile flag 02). However, the code is also compiled
>with optimisation turned off and traceback and error reporting turned on,
>and in both cases  programs failed and no run-time error was printed. The
>same code was also compiled with gfortran and did also produce error
>'137'.
>
>The code run successfully numerous times, but is doing something slightly
>different each time due to random sampling and different model
>specifications. There are 20 jobs because  analyses are run across 20
>replicates of a simulations. Previously our user had
>no problems running these 20 replicates across 11 different models
>(20x11=220 runs). 
>
>Some specifics:
>
>Array jobMemory allocation is 20GB, and the job uses less than 14GB.
>
>Submitted through a shell script qsub test.sh, where test sh looks like:
>
>-------------------------------------------------------
>#$ -cwd 
>#$ -l vf=20G
>#$ -N b1_set12_1
>#$ -m eas
>#$ -M [email protected]
>/path/to/some/stuff/here/bayesRsim <b1_set12_1.par
>       
>--------------------------------------------------------------------------
>---------------------------------------
>
>Intels default is 'static compiling' from what we understand, in anyway
>no external libraries are used (although Intel uses its own MKL library).
>
>
>We can't see any obvious memory starvation issues or resource contention
>problems. Do you have any suggestions in things we could look at to trap
>this? The error 137 stuff online, after looking around a little, seems
>sparse at best.
>
>Any help would be appreciated.
>
>--JC
>_______________________________________________
>users mailing list
>[email protected]
>https://gridengine.org/mailman/listinfo/users     


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Error 137 - trying to figure out what it means.

Reply via email to