Hi, Am 14.01.2013 um 23:08 schrieb Jake Carroll:
> So we tested out trying to hard set wall-time different for the specific > user who's experiencing the Exit 137 issue. We noticed the jobs are still > failing, however. is there any message about the kill signal in the spooling directory's messages file of the node, i.e.: /opt/gridengine/default/spool/compute-0-4/messages (search for the job id) -- Reuti > One job that was killed that included the wall-time setting. Obviously the > job did not run for 24h, anyway input and outputs shown below. > > -------- > - qsub b5_set112.sh > > > > - b5_set11_2.sh: > > #$ -cwd > #$ -l h_rt=24:00:00 > > #$ -l vf=20G > #$ -N b5_set11_2 > #$ -m eas > #$ -M someguy@somewhere > /blah/blah/blah/bayesRsim <b5_set11_2.par > > > - cat b5_set11_2.e1325823: > /opt/gridengine/default/spool/compute-0-4/job_scripts/1325823: line 7: > 8117 Killed /blah/blah/blag/bayesRsim < b5_set11_2.par > > > -qacct -j 1325823 > ============================================================== > qname medium.q > hostname compute-0-4.local > group users > owner someguy > project NONE > department defaultdepartment > jobname b5_set11_2 > jobnumber 1325823 > taskid undefined > account sge > priority 0 > qsub_time Mon Jan 14 15:36:49 2013 > start_time Mon Jan 14 15:36:55 2013 > end_time Mon Jan 14 18:11:56 2013 > granted_pe NONE > slots 1 > failed 0 > exit_status 137 > ru_wallclock 9301 > ru_utime 9262.906 > ru_stime 7.916 > ru_maxrss 13820636 > ru_ixrss 0 > ru_ismrss 0 > ru_idrss 0 > ru_isrss 0 > ru_minflt 46056 > ru_majflt 26 > ru_nswap 0 > ru_inblock 392840 > ru_oublock 32 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 536 > ru_nivcsw 30791 > cpu 9270.822 > mem 61688.906 > io 0.430 > iow 0.000 > maxvmem 13.302G > arid undefined > > So, you mentioned "default time limit of your shell". My googling > suggested trying to set a wall time limit, or have the user specify the > wall time, but that did not help. A few google searches show the use of a > global time limit for jobs in general, but make no reference to a default > time limit of the shell. Am I supposed to be looking at limits such as > s_rt and h_rt? If so, how go I manipulate these for the specific user? The > queue_conf man page makes some reference to this, but it doesn't explain > explicitly how to manipulate it globally or on a per user basis making > reference to defaults or "shell". > > Sorry - just stumbling through this and not finding it too intuitive. > > > --JC > > > > > On 14/01/13 10:34 AM, "Ron Chen" <[email protected]> wrote: > >> Exit code 137 = process was killed because it exceeded the time limit, >> and Google is your best friend if you have similar issues - and the >> solution is to check the default time limit of your shell. >> >> -Ron >> >> >> ************************************************************************ >> >> Open Grid Scheduler - the official open source Grid Engine: >> http://gridscheduler.sourceforge.net/ >> >> >> >> >> >> ________________________________ >> From: Jake Carroll <[email protected]> >> To: "[email protected]" <[email protected]> >> Sent: Sunday, January 13, 2013 6:56 PM >> Subject: [gridengine users] Error 137 - trying to figure out what it >> means. >> >> >> Hi all. >> >> We're trying to figure out the answer to a problem that is escaping us. >> We can usually self solve most of these issues, but this one, we're >> having problems trapping and can't find any solid answers for after a lot >> of looking around on online resources. >> >> One of our quite capable users [read: he rarely needs our help with grid >> engine] has an unusual issue with certain jobs (seemingly, randomly?) >> crashing out on error 137. The code is predominantly C++ based running >> atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for >> us is that sometimes these array based jobs (non PE's/parallel >> environments and no mpi/mpich explicit in use) are only crashing >> sometimes. Some, and not others. It seems almost quasi-random. >> >> The code is written in fortran compiled with Intels ifort, using standard >> code optimisation (compile flag 02). However, the code is also compiled >> with optimisation turned off and traceback and error reporting turned on, >> and in both cases programs failed and no run-time error was printed. The >> same code was also compiled with gfortran and did also produce error >> '137'. >> >> The code run successfully numerous times, but is doing something slightly >> different each time due to random sampling and different model >> specifications. There are 20 jobs because analyses are run across 20 >> replicates of a simulations. Previously our user had >> no problems running these 20 replicates across 11 different models >> (20x11=220 runs). >> >> Some specifics: >> >> Array jobMemory allocation is 20GB, and the job uses less than 14GB. >> >> Submitted through a shell script qsub test.sh, where test sh looks like: >> >> ------------------------------------------------------- >> #$ -cwd >> #$ -l vf=20G >> #$ -N b1_set12_1 >> #$ -m eas >> #$ -M [email protected] >> /path/to/some/stuff/here/bayesRsim <b1_set12_1.par >> >> -------------------------------------------------------------------------- >> --------------------------------------- >> >> Intels default is 'static compiling' from what we understand, in anyway >> no external libraries are used (although Intel uses its own MKL library). >> >> >> We can't see any obvious memory starvation issues or resource contention >> problems. Do you have any suggestions in things we could look at to trap >> this? The error 137 stuff online, after looking around a little, seems >> sparse at best. >> >> Any help would be appreciated. >> >> --JC >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
