Hi. So we tested out trying to hard set wall-time different for the specific user who's experiencing the Exit 137 issue. We noticed the jobs are still failing, however.
One job that was killed that included the wall-time setting. Obviously the job did not run for 24h, anyway input and outputs shown below. -------- - qsub b5_set112.sh - b5_set11_2.sh: #$ -cwd #$ -l h_rt=24:00:00 #$ -l vf=20G #$ -N b5_set11_2 #$ -m eas #$ -M someguy@somewhere /blah/blah/blah/bayesRsim <b5_set11_2.par - cat b5_set11_2.e1325823: /opt/gridengine/default/spool/compute-0-4/job_scripts/1325823: line 7: 8117 Killed /blah/blah/blag/bayesRsim < b5_set11_2.par -qacct -j 1325823 ============================================================== qname medium.q hostname compute-0-4.local group users owner someguy project NONE department defaultdepartment jobname b5_set11_2 jobnumber 1325823 taskid undefined account sge priority 0 qsub_time Mon Jan 14 15:36:49 2013 start_time Mon Jan 14 15:36:55 2013 end_time Mon Jan 14 18:11:56 2013 granted_pe NONE slots 1 failed 0 exit_status 137 ru_wallclock 9301 ru_utime 9262.906 ru_stime 7.916 ru_maxrss 13820636 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 46056 ru_majflt 26 ru_nswap 0 ru_inblock 392840 ru_oublock 32 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 536 ru_nivcsw 30791 cpu 9270.822 mem 61688.906 io 0.430 iow 0.000 maxvmem 13.302G arid undefined So, you mentioned "default time limit of your shell". My googling suggested trying to set a wall time limit, or have the user specify the wall time, but that did not help. A few google searches show the use of a global time limit for jobs in general, but make no reference to a default time limit of the shell. Am I supposed to be looking at limits such as s_rt and h_rt? If so, how go I manipulate these for the specific user? The queue_conf man page makes some reference to this, but it doesn't explain explicitly how to manipulate it globally or on a per user basis making reference to defaults or "shell". Sorry - just stumbling through this and not finding it too intuitive. --JC On 14/01/13 10:34 AM, "Ron Chen" <[email protected]> wrote: >Exit code 137 = process was killed because it exceeded the time limit, >and Google is your best friend if you have similar issues - and the >solution is to check the default time limit of your shell. > > -Ron > > >************************************************************************ > >Open Grid Scheduler - the official open source Grid Engine: >http://gridscheduler.sourceforge.net/ > > > > > >________________________________ >From: Jake Carroll <[email protected]> >To: "[email protected]" <[email protected]> >Sent: Sunday, January 13, 2013 6:56 PM >Subject: [gridengine users] Error 137 - trying to figure out what it >means. > > >Hi all. > >We're trying to figure out the answer to a problem that is escaping us. >We can usually self solve most of these issues, but this one, we're >having problems trapping and can't find any solid answers for after a lot >of looking around on online resources. > >One of our quite capable users [read: he rarely needs our help with grid >engine] has an unusual issue with certain jobs (seemingly, randomly?) >crashing out on error 137. The code is predominantly C++ based running >atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for >us is that sometimes these array based jobs (non PE's/parallel >environments and no mpi/mpich explicit in use) are only crashing >sometimes. Some, and not others. It seems almost quasi-random. > >The code is written in fortran compiled with Intels ifort, using standard >code optimisation (compile flag 02). However, the code is also compiled >with optimisation turned off and traceback and error reporting turned on, >and in both cases programs failed and no run-time error was printed. The >same code was also compiled with gfortran and did also produce error >'137'. > >The code run successfully numerous times, but is doing something slightly >different each time due to random sampling and different model >specifications. There are 20 jobs because analyses are run across 20 >replicates of a simulations. Previously our user had >no problems running these 20 replicates across 11 different models >(20x11=220 runs). > >Some specifics: > >Array jobMemory allocation is 20GB, and the job uses less than 14GB. > >Submitted through a shell script qsub test.sh, where test sh looks like: > >------------------------------------------------------- >#$ -cwd >#$ -l vf=20G >#$ -N b1_set12_1 >#$ -m eas >#$ -M [email protected] >/path/to/some/stuff/here/bayesRsim <b1_set12_1.par > >-------------------------------------------------------------------------- >--------------------------------------- > >Intels default is 'static compiling' from what we understand, in anyway >no external libraries are used (although Intel uses its own MKL library). > > >We can't see any obvious memory starvation issues or resource contention >problems. Do you have any suggestions in things we could look at to trap >this? The error 137 stuff online, after looking around a little, seems >sparse at best. > >Any help would be appreciated. > >--JC >_______________________________________________ >users mailing list >[email protected] >https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
