Thanks sathish. All other jobs are running fine across the cluster so I don't think it is related to any pam module issue. I am investigating issue further.i will come back to you with more details
Regards Navin On Mon, Jun 8, 2020, 19:24 sathish <sathish.sathishku...@gmail.com> wrote: > Hi Navin, > > Was this working earlier or is this the first time are you trying ? > Are you using pam module ? if yes, try disabling the pam module and see > if it works. > > Thanks > Sathish > > On Thu, Jun 4, 2020 at 10:47 PM navin srivastava <navin.alt...@gmail.com> > wrote: > >> Hi Team, >> >> i am seeing a weird issue in my environment. >> one of the gaussian job is failing with the slurm within a minute after >> it go for the execution without writing anything and unable to figure out >> the reason. >> The same job works fine without slurm on the same node. >> >> slurmctld.log >> >> [2020-06-03T19:14:33.170] debug: Job 1357498 has more than one partition >> (normal)(21052) >> [2020-06-03T19:14:33.170] debug: Job 1357498 has more than one partition >> (normalGPUsmall)(21052) >> [2020-06-03T19:14:33.170] debug: Job 1357498 has more than one partition >> (normalGPUbig)(21052) >> [2020-06-03T19:15:12.497] debug: sched: JobId=1357498. State=PENDING. >> Reason=Priority, Priority=21052. >> Partition=normal,normalGPUsmall,normalGPUbig. >> [2020-06-03T19:15:12.497] debug: sched: JobId=1357498. State=PENDING. >> Reason=Priority, Priority=21052. >> Partition=normal,normalGPUsmall,normalGPUbig. >> [2020-06-03T19:15:12.497] debug: sched: JobId=1357498. State=PENDING. >> Reason=Priority, Priority=21052. >> Partition=normal,normalGPUsmall,normalGPUbig. >> [2020-06-03T19:16:12.626] debug: sched: JobId=1357498. State=PENDING. >> Reason=Priority, Priority=21052. >> Partition=normal,normalGPUsmall,normalGPUbig. >> [2020-06-03T19:17:12.753] debug: sched: JobId=1357498. State=PENDING. >> Reason=Priority, Priority=21052. >> Partition=normal,normalGPUsmall,normalGPUbig. >> [2020-06-03T19:18:12.882] debug: sched: JobId=1357498. State=PENDING. >> Reason=Priority, Priority=21052. >> Partition=normal,normalGPUsmall,normalGPUbig. >> [2020-06-03T19:19:13.633] sched: Allocate JobID=1357498 NodeList=oled4 >> #CPUs=4 Partition=normal >> [2020-06-04T12:25:36.961] _job_complete: JobID=1357498 State=0x1 >> NodeCnt=1 WEXITSTATUS 2 >> [2020-06-04T12:25:36.961] SLURM Job_id=1357498 Name=job1 Ended, Run time >> 17:06:23, FAILED, ExitCode 2 >> [2020-06-04T12:25:36.962] _job_complete: JobID=1357498 State=0x8005 >> NodeCnt=1 done >> >> slurmd.log >> >> [2020-06-04T12:22:43.712] [1357498.batch] debug: jag_common_poll_data: >> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time >> 164642.840000(164537+105) >> [2020-06-04T12:23:13.712] [1357498.batch] debug: jag_common_poll_data: >> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time >> 164762.820000(164657+105) >> [2020-06-04T12:23:43.712] [1357498.batch] debug: jag_common_poll_data: >> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time >> 164882.810000(164777+105) >> [2020-06-04T12:24:13.712] [1357498.batch] debug: jag_common_poll_data: >> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time >> 165002.790000(164897+105) >> [2020-06-04T12:24:43.712] [1357498.batch] debug: jag_common_poll_data: >> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time >> 165122.770000(165016+105) >> [2020-06-04T12:25:13.713] [1357498.batch] debug: jag_common_poll_data: >> Task average frequency = 2769 pid 64084 mem size 4625724 23696420 time >> 165242.750000(165136+105) >> [2020-06-04T12:25:36.955] [1357498.batch] task 0 (64084) exited with exit >> code 2. >> [2020-06-04T12:25:36.955] [1357498.batch] debug: task_p_post_term: >> affinity 1357498.4294967294, task 0 >> [2020-06-04T12:25:36.960] [1357498.batch] debug: >> step_terminate_monitor_stop signaling condition >> [2020-06-04T12:25:36.960] [1357498.batch] job 1357498 completed with >> slurm_rc = 0, job_rc = 512 >> [2020-06-04T12:25:36.960] [1357498.batch] sending >> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 512 >> [2020-06-04T12:25:36.961] [1357498.batch] debug: Message thread exited >> [2020-06-04T12:25:36.962] [1357498.batch] done with job >> [2020-06-04T12:25:36.962] debug: task_p_slurmd_release_resources: >> affinity jobid 1357498 >> [2020-06-04T12:25:36.962] debug: credential for job 1357498 revoked >> [2020-06-04T12:25:36.963] debug: Waiting for job 1357498's prolog to >> complete >> [2020-06-04T12:25:36.963] debug: Finished wait for job 1357498's prolog >> to complete >> [2020-06-04T12:25:36.963] debug: [job 1357498] attempting to run epilog >> [/etc/slurm/slurm.epilog.clean] >> [2020-06-04T12:25:37.254] debug: completed epilog for jobid 1357498 >> [2020-06-04T12:25:37.254] debug: Job 1357498: sent epilog complete msg: >> rc = 0 >> >> any suggestion will be welcome to troubleshoot this issue further. >> >> Regards >> Navin. >> >> >> >> > > -- > Regards..... > Sathish >