Re: [gridengine users] jobs randomly die

2019-05-17 Thread Hay, William
On Tue, 2019-05-14 at 10:03 -0400, Feng Zhang wrote:
> looks like your job used a lot of ram:
> 
> mem  7.463TBs
> io   70.435GB
> iow  0.000s
> maxvmem  532.004MB

Not really 532MB isn't a lot of memory these days.  The mem figure is
in TerraByte Seconds which accumulate fairly quickly.  At 512 M you get
a TBs every 2000 seconds or so.  However the fact that it is reporting
these numbers indicates some sort of built in memory limit was enabled.
 Grid Engine won't measure memory usage unless it has some sort of
limit to enforce.

William
> 
> Do you have CGROUP to limit resource of jobs?
> 
> Best,
> 
> Feng
> 
> On Tue, May 14, 2019 at 9:53 AM hiller  wrote:
> > 
> > ~> qconf -srqs
> > No resource quota set found
> > 
> > 'dmesg -T' does not give an oom or other weird messages.
> > 
> > 'free -h' looks good and also looked good at 'kill time':
> > 
> > ~> free -h
> >   totalusedfree  shared  buff/cache
> >    available
> > Mem:   188G1.0G185G2.6M2.0G
> > 186G
> > Swap:   49G  0B 49G
> > 
> > Full output of qacct:
> > ~>  qacct -j 635659
> > ==
> > qnameall.q
> > hostname karun10
> > groupusers
> > ownercalj
> > project  NONE
> > department   defaultdepartment
> > jobname  dsc_gdr2
> > jobnumber635659
> > taskid   undefined
> > account  sge
> > priority 0
> > qsub_timeMon May 13 13:06:58 2019
> > start_time   Mon May 13 13:06:56 2019
> > end_time Mon May 13 18:31:42 2019
> > granted_pe   make
> > slots1
> > failed   100 : assumedly after job
> > exit_status  137  (Killed)
> > ru_wallclock 19486s
> > ru_utime 0.048s
> > ru_stime 0.006s
> > ru_maxrss11.566KB
> > ru_ixrss 0.000B
> > ru_ismrss0.000B
> > ru_idrss 0.000B
> > ru_isrss 0.000B
> > ru_minflt7885
> > ru_majflt0
> > ru_nswap 0
> > ru_inblock   0
> > ru_oublock   8
> > ru_msgsnd0
> > ru_msgrcv0
> > ru_nsignals  0
> > ru_nvcsw 142
> > ru_nivcsw3
> > cpu  19305.760s
> > mem  7.463TBs
> > io   70.435GB
> > iow  0.000s
> > maxvmem  532.004MB
> > arid undefined
> > ar_sub_time  undefined
> > category -l hostname=karun10 -pe make 1
> > 
> > 
> > Thanks, ulrich
> > 
> > 
> > On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote:
> > > It's a limit being reached, of some sort. Do you have a RQS of
> > > any kind (qconf -srqs)? We see this for job-requested, or system
> > > set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on
> > > compute nodes often useful), as well as time limits reached. What
> > > is the whole output from 'qacct -j JOBID'?
> > > 
> > > Cheers,
> > > -Hugh
> > > 
> > > -Original Message-
> > > From: users-boun...@gridengine.org 
> > > On Behalf Of hiller
> > > Sent: Tuesday, May 14, 2019 9:02 AM
> > > To: users@gridengine.org
> > > Subject: Re: [gridengine users] jobs randomly die
> > > 
> > > Hi,
> > > nope, there are no oom messages in the journal.
> > > Regards, ulrich
> > > 
> > > 
> > > On 5/14/19 12:49 PM, Arnau wrote:
> > > > Hi,
> > > > 
> > > > _maybe_ the OOM killer killed the job ? a look to messages will
> > > > give you an answer (I've seen this in my cluster).
> > > > 
> > > > HTH,
> > > > Arnau
> > > > 
> > > > El mar., 14 may. 2019 a las 12:37, hiller ( > > > de <mailto:hil...@mpia-hd.mpg.de>>) escribió:
> > > > 
> > > > Dear all,
> > > > i have a problem that jobs sent to gridengine randomly die.
> > > > The gridengine version is 8.1.9
> > > > The OS is opensuse 15.0
> > > > The gridengine messages file says:
> > > > 05/13/2019 18:31:45|worker|karun|E|master task of job
> > > > 635659.1 failed - killing job
> > > > 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on
> > > > host karun10 assumedly after job because: job 635659.1 died
> > > > through signal KILL (9)
> > > > 
> > > > 

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Daniel Povey
I have observed apparently random failures when users had gid's in the
range `gid_range` (see below; gid_range should be
out of the range where users have gid's).
But usually this kind of thing would be due to OOM.

qconf -sconf | grep  gid_range
gid_range5-51000


On Tue, May 14, 2019 at 10:42 AM Reuti  wrote:

> AFAICS the sent kill by SGE happens after a task returned already with an
> error. SGE would in this case use the kill signal to be sure to kill all
> child processes. Hence the question would  be: what was the initial command
> in the job script, and what output/error did it generate?
>
> -- Reuti
>
> > Am 14.05.2019 um 11:36 schrieb hiller :
> >
> > Dear all,
> > i have a problem that jobs sent to gridengine randomly die.
> > The gridengine version is 8.1.9
> > The OS is opensuse 15.0
> > The gridengine messages file says:
> > 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed -
> killing job
> > 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10
> assumedly after job because: job 635659.1 died through signal KILL (9)
> >
> > qacct -j 635659 says:
> > failed   100 : assumedly after job
> > exit_status  137  (Killed)
> >
> >
> > The was no kill triggered by the user. Also there are no other
> limitations, neither ulimit nor in the gridengine queue
> > The 'qconf -sq all.q' command gives:
> > s_rt  INFINITY
> > h_rt  INFINITY
> > s_cpu INFINITY
> > h_cpu INFINITY
> > s_fsize   INFINITY
> > h_fsize   INFINITY
> > s_dataINFINITY
> > h_dataINFINITY
> > s_stack   INFINITY
> > h_stack   INFINITY
> > s_coreINFINITY
> > h_coreINFINITY
> > s_rss INFINITY
> > h_rss INFINITY
> > s_vmemINFINITY
> > h_vmemINFINITY
> >
> > Years ago there were some threads about the same issue, but i did not
> find a solution.
> >
> > Does somebody have a hint what i can do or check/debug?
> >
> > With kind regards and many thanks for any help, ulrich
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] jobs randomly die

2019-05-14 Thread Reuti
AFAICS the sent kill by SGE happens after a task returned already with an 
error. SGE would in this case use the kill signal to be sure to kill all child 
processes. Hence the question would  be: what was the initial command in the 
job script, and what output/error did it generate?

-- Reuti

> Am 14.05.2019 um 11:36 schrieb hiller :
> 
> Dear all,
> i have a problem that jobs sent to gridengine randomly die.
> The gridengine version is 8.1.9
> The OS is opensuse 15.0
> The gridengine messages file says:
> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - 
> killing job
> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 
> assumedly after job because: job 635659.1 died through signal KILL (9)
> 
> qacct -j 635659 says:
> failed   100 : assumedly after job
> exit_status  137  (Killed)
> 
> 
> The was no kill triggered by the user. Also there are no other limitations, 
> neither ulimit nor in the gridengine queue
> The 'qconf -sq all.q' command gives:
> s_rt  INFINITY
> h_rt  INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize   INFINITY
> h_fsize   INFINITY
> s_dataINFINITY
> h_dataINFINITY
> s_stack   INFINITY
> h_stack   INFINITY
> s_coreINFINITY
> h_coreINFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmemINFINITY
> h_vmemINFINITY
> 
> Years ago there were some threads about the same issue, but i did not find a 
> solution.
> 
> Does somebody have a hint what i can do or check/debug?
> 
> With kind regards and many thanks for any help, ulrich
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
Huh. Yeah, nothing particularly useful there (was hoping for the submit_cmd ... 
but maybe that's just UGE?). What's in the job script (options), and how 
exactly was it submitted (command)? And do you have any default limits in 
$SGE_ROOT/$SGE_CELL/common/sge_request file?

-Hugh

-Original Message-
From: users-boun...@gridengine.org  On Behalf Of 
hiller
Sent: Tuesday, May 14, 2019 9:52 AM
To: users@gridengine.org
Subject: Re: [gridengine users] jobs randomly die

~> qconf -srqs
No resource quota set found

'dmesg -T' does not give an oom or other weird messages. 

'free -h' looks good and also looked good at 'kill time':

~> free -h
  totalusedfree  shared  buff/cache   available
Mem:   188G1.0G185G2.6M2.0G186G
Swap:   49G  0B 49G

Full output of qacct:
~>  qacct -j 635659
==
qnameall.q   
hostname karun10 
groupusers   
ownercalj
project  NONE
department   defaultdepartment   
jobname  dsc_gdr2
jobnumber635659  
taskid   undefined
account  sge 
priority 0   
qsub_timeMon May 13 13:06:58 2019
start_time   Mon May 13 13:06:56 2019
end_time Mon May 13 18:31:42 2019
granted_pe   make
slots1   
failed   100 : assumedly after job
exit_status  137  (Killed)
ru_wallclock 19486s
ru_utime 0.048s
ru_stime 0.006s
ru_maxrss11.566KB
ru_ixrss 0.000B
ru_ismrss0.000B
ru_idrss 0.000B
ru_isrss 0.000B
ru_minflt7885
ru_majflt0   
ru_nswap 0   
ru_inblock   0   
ru_oublock   8   
ru_msgsnd0   
ru_msgrcv0   
ru_nsignals  0   
ru_nvcsw 142 
ru_nivcsw3   
cpu  19305.760s
mem  7.463TBs
io   70.435GB
iow  0.000s
maxvmem  532.004MB
arid undefined
ar_sub_time  undefined
category -l hostname=karun10 -pe make 1


Thanks, ulrich


On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote:
> It's a limit being reached, of some sort. Do you have a RQS of any kind 
> (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion 
> (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as well 
> as time limits reached. What is the whole output from 'qacct -j JOBID'?
> 
> Cheers,
> -Hugh
> 
> -Original Message-
> From: users-boun...@gridengine.org  On Behalf 
> Of hiller
> Sent: Tuesday, May 14, 2019 9:02 AM
> To: users@gridengine.org
> Subject: Re: [gridengine users] jobs randomly die
> 
> Hi,
> nope, there are no oom messages in the journal.
> Regards, ulrich
> 
> 
> On 5/14/19 12:49 PM, Arnau wrote:
>> Hi,
>>
>> _maybe_ the OOM killer killed the job ? a look to messages will give you an 
>> answer (I've seen this in my cluster).
>>
>> HTH,
>> Arnau
>>
>> El mar., 14 may. 2019 a las 12:37, hiller (> <mailto:hil...@mpia-hd.mpg.de>>) escribió:
>>
>> Dear all,
>> i have a problem that jobs sent to gridengine randomly die.
>> The gridengine version is 8.1.9
>> The OS is opensuse 15.0
>> The gridengine messages file says:
>> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - 
>> killing job
>> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 
>> assumedly after job because: job 635659.1 died through signal KILL (9)
>>
>> qacct -j 635659 says:
>> failed       100 : assumedly after job
>> exit_status  137                  (Killed)
>>
>>
>> The was no kill triggered by the user. Also there are no other 
>> limitations, neither ulimit nor in the gridengine queue
>> The 'qconf -sq all.q' command gives:
>> s_rt                  INFINITY
>> h_rt                  INFINITY
>> s_cpu                 INFINITY
>> h_cpu                 INFINITY
>> s_fsize               INFINITY
>> h_fsize               INFINITY
>> s_data                INFINITY
>> h_data                INFINITY
>> s_stack               INFINITY
>> h_stack               INFINITY
>> s_core                INFINITY
>> h_core                INFINITY
>> s_rss                 INFINITY
>> h_rss                 INFINITY
>> s_vmem                INFINITY
>> h_vmem                INFINITY
>>
>>

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Feng Zhang
looks like your job used a lot of ram:

mem  7.463TBs
io   70.435GB
iow  0.000s
maxvmem  532.004MB

Do you have CGROUP to limit resource of jobs?

Best,

Feng

On Tue, May 14, 2019 at 9:53 AM hiller  wrote:
>
> ~> qconf -srqs
> No resource quota set found
>
> 'dmesg -T' does not give an oom or other weird messages.
>
> 'free -h' looks good and also looked good at 'kill time':
>
> ~> free -h
>   totalusedfree  shared  buff/cache   
> available
> Mem:   188G1.0G185G2.6M2.0G
> 186G
> Swap:   49G  0B 49G
>
> Full output of qacct:
> ~>  qacct -j 635659
> ==
> qnameall.q
> hostname karun10
> groupusers
> ownercalj
> project  NONE
> department   defaultdepartment
> jobname  dsc_gdr2
> jobnumber635659
> taskid   undefined
> account  sge
> priority 0
> qsub_timeMon May 13 13:06:58 2019
> start_time   Mon May 13 13:06:56 2019
> end_time Mon May 13 18:31:42 2019
> granted_pe   make
> slots1
> failed   100 : assumedly after job
> exit_status  137  (Killed)
> ru_wallclock 19486s
> ru_utime 0.048s
> ru_stime 0.006s
> ru_maxrss11.566KB
> ru_ixrss 0.000B
> ru_ismrss0.000B
> ru_idrss 0.000B
> ru_isrss 0.000B
> ru_minflt7885
> ru_majflt0
> ru_nswap 0
> ru_inblock   0
> ru_oublock   8
> ru_msgsnd0
> ru_msgrcv0
> ru_nsignals  0
> ru_nvcsw 142
> ru_nivcsw3
> cpu  19305.760s
> mem  7.463TBs
> io   70.435GB
> iow  0.000s
> maxvmem  532.004MB
> arid undefined
> ar_sub_time  undefined
> category -l hostname=karun10 -pe make 1
>
>
> Thanks, ulrich
>
>
> On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote:
> > It's a limit being reached, of some sort. Do you have a RQS of any kind 
> > (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion 
> > (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as 
> > well as time limits reached. What is the whole output from 'qacct -j JOBID'?
> >
> > Cheers,
> > -Hugh
> >
> > -Original Message-
> > From: users-boun...@gridengine.org  On Behalf 
> > Of hiller
> > Sent: Tuesday, May 14, 2019 9:02 AM
> > To: users@gridengine.org
> > Subject: Re: [gridengine users] jobs randomly die
> >
> > Hi,
> > nope, there are no oom messages in the journal.
> > Regards, ulrich
> >
> >
> > On 5/14/19 12:49 PM, Arnau wrote:
> >> Hi,
> >>
> >> _maybe_ the OOM killer killed the job ? a look to messages will give you 
> >> an answer (I've seen this in my cluster).
> >>
> >> HTH,
> >> Arnau
> >>
> >> El mar., 14 may. 2019 a las 12:37, hiller ( >> <mailto:hil...@mpia-hd.mpg.de>>) escribió:
> >>
> >> Dear all,
> >> i have a problem that jobs sent to gridengine randomly die.
> >> The gridengine version is 8.1.9
> >> The OS is opensuse 15.0
> >> The gridengine messages file says:
> >> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed 
> >> - killing job
> >> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 
> >> assumedly after job because: job 635659.1 died through signal KILL (9)
> >>
> >> qacct -j 635659 says:
> >> failed   100 : assumedly after job
> >> exit_status  137  (Killed)
> >>
> >>
> >> The was no kill triggered by the user. Also there are no other 
> >> limitations, neither ulimit nor in the gridengine queue
> >> The 'qconf -sq all.q' command gives:
> >> s_rt  INFINITY
> >> h_rt  INFINITY
> >> s_cpu INFINITY
> >> h_cpu INFINITY
> >> s_fsize   INFINITY
> >> h_fsize   INFINITY
> >> s_dataINFINITY
> >> h_dataINFINITY
> >> s_stack   INFINITY
> >> h_stack   INFINITY
> >> s_coreINFINITY
> >> h_coreINFINITY
> >> s_rss INFINITY
> >> h_rss INFINITY
> >> s_vmemINFINITY
> >> h_vm

Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
It's a limit being reached, of some sort. Do you have a RQS of any kind (qconf 
-srqs)? We see this for job-requested, or system set RAM exhaustion (OOM 
killer, as mentioned 'dmesg -T' on compute nodes often useful), as well as time 
limits reached. What is the whole output from 'qacct -j JOBID'?

Cheers,
-Hugh

-Original Message-
From: users-boun...@gridengine.org  On Behalf Of 
hiller
Sent: Tuesday, May 14, 2019 9:02 AM
To: users@gridengine.org
Subject: Re: [gridengine users] jobs randomly die

Hi,
nope, there are no oom messages in the journal.
Regards, ulrich


On 5/14/19 12:49 PM, Arnau wrote:
> Hi,
> 
> _maybe_ the OOM killer killed the job ? a look to messages will give you an 
> answer (I've seen this in my cluster).
> 
> HTH,
> Arnau
> 
> El mar., 14 may. 2019 a las 12:37, hiller ( <mailto:hil...@mpia-hd.mpg.de>>) escribió:
> 
> Dear all,
> i have a problem that jobs sent to gridengine randomly die.
> The gridengine version is 8.1.9
> The OS is opensuse 15.0
> The gridengine messages file says:
> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - 
> killing job
> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 
> assumedly after job because: job 635659.1 died through signal KILL (9)
> 
> qacct -j 635659 says:
> failed       100 : assumedly after job
> exit_status  137                  (Killed)
> 
> 
> The was no kill triggered by the user. Also there are no other 
> limitations, neither ulimit nor in the gridengine queue
> The 'qconf -sq all.q' command gives:
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> Years ago there were some threads about the same issue, but i did not 
> find a solution.
> 
> Does somebody have a hint what i can do or check/debug?
> 
> With kind regards and many thanks for any help, ulrich
> ___
> users mailing list
> users@gridengine.org <mailto:users@gridengine.org>
> https://gridengine.org/mailman/listinfo/users
> 
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] jobs randomly die

2019-05-14 Thread hiller
Hi,
nope, there are no oom messages in the journal.
Regards, ulrich


On 5/14/19 12:49 PM, Arnau wrote:
> Hi,
> 
> _maybe_ the OOM killer killed the job ? a look to messages will give you an 
> answer (I've seen this in my cluster).
> 
> HTH,
> Arnau
> 
> El mar., 14 may. 2019 a las 12:37, hiller ( >) escribió:
> 
> Dear all,
> i have a problem that jobs sent to gridengine randomly die.
> The gridengine version is 8.1.9
> The OS is opensuse 15.0
> The gridengine messages file says:
> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - 
> killing job
> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 
> assumedly after job because: job 635659.1 died through signal KILL (9)
> 
> qacct -j 635659 says:
> failed       100 : assumedly after job
> exit_status  137                  (Killed)
> 
> 
> The was no kill triggered by the user. Also there are no other 
> limitations, neither ulimit nor in the gridengine queue
> The 'qconf -sq all.q' command gives:
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> Years ago there were some threads about the same issue, but i did not 
> find a solution.
> 
> Does somebody have a hint what i can do or check/debug?
> 
> With kind regards and many thanks for any help, ulrich
> ___
> users mailing list
> users@gridengine.org 
> https://gridengine.org/mailman/listinfo/users
> 
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users