Hi all,

A number of my jobs keep dying, and I'm having trouble tracking down
what's going on. Any tips or help would be greatly appreciated.

The job is a perl script that launches a binary (called moses) using
the perl "system()" call. The end of the log file is below. I know
that the perl script is responsible for printing out the last two
lines (starting with "Exit code: 137"), but I can't figure out who is
responsible for printing out the first line (starting with "sh: line
1: 29188 Killed"). I know that it's not the perl script, and I'm
reasonably sure that it's not the moses binary.

I suspect that maybe the grid engine is killing the job, but I don't
know how to track down that hypothesis. Here's the log:

sh: line 1: 29188 Killed
/free/lane/slm-merging-trunk/moses-cmd/src/moses -config
/scratch4/lane/2011-12-15_europarl/config/de-en/filtered/filtered.ttable20.dist05.synlm50.ini
-inputtype 0 -w -0.178571 -slm 0.178571 -lm 0.089286 -d 0.053571
0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm 0.035714
0.035714 0.035714 0.035714 0.035714 -n-best-list run1.best100.out 100
-input-file /scratch4/lane/2011-12-15_europarl/corpus/dev.tok.norm.de
> run1.out
Exit code: 137
The decoder died. CONFIG WAS -w -0.178571 -slm 0.178571 -lm 0.089286
-d 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm
0.035714 0.035714 0.035714 0.035714 0.035714


My understanding is that an exit code 137 indicates that the process
received kill signal 9.


For what it's worth, the results of running qacct -j on the job after
it died are listed below.

==============================================================
qname        all.q
hostname     quad19.scream.lab
group        scream
owner        lane
project      NONE
department   defaultdepartment
jobname      de-en.mert
jobnumber    20337
taskid       undefined
account      sge
priority     0
qsub_time    Mon Feb 13 14:08:54 2012
start_time   Mon Feb 13 14:09:05 2012
end_time     Wed Feb 15 14:54:52 2012
granted_pe   NONE
slots        1
failed       0
exit_status  2
ru_wallclock 175547
ru_utime     175460.360
ru_stime     21.147
ru_maxrss    23910412
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    6545996
ru_majflt    7568
ru_nswap     0
ru_inblock   3067192
ru_oublock   22064
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     9545
ru_nivcsw    256918
cpu          175481.507
mem          2516411.448
io           4.733
iow          0.000
maxvmem      25.026G
arid         undefined


I'm running under OGS GE2011.11. A colleague suggested that there may
be some sort of configuration where the grid engine is killing the
jobs after 48 hours or so. I know that I've successfully run jobs
longer than that under my old SGE setup, but not yet under the new OGS
setup.

As far as I can tell, all of my hard and soft limits are set to INFINITY.

Thanks,
Lane
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to