subject:"\[slurm\-dev\] Re\: Slurmctld auto restart and kill running job, why \?"

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-10-11 Thread Douglas Jacobsen

Fyi, sending sighup to slurmctld is sufficient for rotating the
slurmctld.log file.  No need to actually restart it all the way.  It is
good to know the cause behind the deleted jobs.

Doug

On Oct 11, 2016 7:36 AM, "Ryan Novosielski"  wrote:

>
> Thanks for clearing that up. I was pretty sure there was no problem at all
> in using logrotate, and I know that restarting slurmctld does not
> ordinarily lose jobs.
>
> --
> 
> || \\UTGERS, |---*
> O*---
> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> ||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
> `'
>
> > On Oct 11, 2016, at 06:19, Philippe  wrote:
> >
> > Hello all,
> > sorry for this long delay since my first post.
> > Thanks for all the answers, it helped me to make some tests, and after
> not so long, I realize I use a personnal script to launch the daemons, and
> I was still using my "debug" start line, which contains the startclean
> argument ...
> > So it's all my fault, Slurm did his job to startclean when the logrotate
> triggered it.
> >
> > Sorry for that !
> >
> > On Thu, Sep 29, 2016 at 2:05 PM, Janne Blomqvist <
> janne.blomqv...@aalto.fi> wrote:
> > On 2016-09-27 10:39, Philippe wrote:
> > > If I can't use logrotate, what must I use ?
> >
> > You can log via syslog, and let your syslog daemon handle the rotation
> > (and rate limiting, disk full, logging to a central log host and all the
> > other nice things that syslog can do for you).
> >
> >
> > --
> > Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
> > Aalto University School of Science, PHYS & NBE
> > +358503841576 || janne.blomqv...@aalto.fi
> >
> >
>

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-10-11 Thread Ryan Novosielski


Thanks for clearing that up. I was pretty sure there was no problem at all in 
using logrotate, and I know that restarting slurmctld does not ordinarily lose 
jobs.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
`'

> On Oct 11, 2016, at 06:19, Philippe  wrote:
> 
> Hello all,
> sorry for this long delay since my first post.
> Thanks for all the answers, it helped me to make some tests, and after not so 
> long, I realize I use a personnal script to launch the daemons, and I was 
> still using my "debug" start line, which contains the startclean argument ...
> So it's all my fault, Slurm did his job to startclean when the logrotate 
> triggered it.
> 
> Sorry for that !
> 
> On Thu, Sep 29, 2016 at 2:05 PM, Janne Blomqvist  
> wrote:
> On 2016-09-27 10:39, Philippe wrote:
> > If I can't use logrotate, what must I use ?
> 
> You can log via syslog, and let your syslog daemon handle the rotation
> (and rate limiting, disk full, logging to a central log host and all the
> other nice things that syslog can do for you).
> 
> 
> --
> Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
> Aalto University School of Science, PHYS & NBE
> +358503841576 || janne.blomqv...@aalto.fi
> 
>

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-10-11 Thread Philippe

Hello all,
sorry for this long delay since my first post.
Thanks for all the answers, it helped me to make some tests, and after not
so long, I realize I use a personnal script to launch the daemons, and I
was still using my "debug" start line, which contains the startclean
argument ...
So it's all my fault, Slurm did his job to startclean when the logrotate
triggered it.

Sorry for that !

On Thu, Sep 29, 2016 at 2:05 PM, Janne Blomqvist 
wrote:

> On 2016-09-27 10:39, Philippe wrote:
> > If I can't use logrotate, what must I use ?
>
> You can log via syslog, and let your syslog daemon handle the rotation
> (and rate limiting, disk full, logging to a central log host and all the
> other nice things that syslog can do for you).
>
>
> --
> Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
> Aalto University School of Science, PHYS & NBE
> +358503841576 || janne.blomqv...@aalto.fi
>
>

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-29 Thread Janne Blomqvist

On 2016-09-27 10:39, Philippe wrote:
> If I can't use logrotate, what must I use ?

You can log via syslog, and let your syslog daemon handle the rotation
(and rate limiting, disk full, logging to a central log host and all the
other nice things that syslog can do for you).


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi



signature.asc
Description: OpenPGP digital signature

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-28 Thread John DeSantis


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Christopher,

Yes, it does restart - but that's how we've configured logrotate.

John DeSantis

On 09/28/2016 07:55 PM, Christopher Samuel wrote:
> 
> On 29/09/16 01:16, John DeSantis wrote:
> 
>> We get the same snippet when our logrotate takes action against
>> the cltdlog:
> 
> Does your slurmctld restart then too?
> 
-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQEcBAEBCAAGBQJX7GTFAAoJEEmckBqrs5nBG3wH/AvkeQWdjjhDOBTOK1XQeQdD
OP47RGAIY4UsvSW08dx1fjXYyGTT/ZA2jAPrqp7xav8AZb5fgHbunScbW8zuk2lK
uvuxKm+z31dWJPX75p36Y97lsoTZRF7waKr8jTgqndaO0aL70oKM5+RQDEmUIUKb
QWg7lLR3vpx9cyE9IIJdRSL4us9pGBDe2P8HPWKV9vkX/ZVTqclVPH5gPIChFrMB
Nc//29tEpYffKC7u4ObtjZCDZC3uGyuBLSbPcCqUJjA3k4iRiKVGF2p3h3Fr4gjt
rF/I7YWo9R48B7/HsQnOc2BClmqSCcVlxsGE20Cyz/SrfulNyqXjMeL0QFSuhZo=
=x9gP
-END PGP SIGNATURE-

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-28 Thread Christopher Samuel


On 29/09/16 01:16, John DeSantis wrote:

> We get the same snippet when our logrotate takes action against the
> cltdlog:

Does your slurmctld restart then too?

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-28 Thread John DeSantis


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Christopher,

>> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM)
>> received
> 
> So that's some external process sending one of those two signals
> to slurmctld, it's not something it's choosing to do at all.  We've
> never seen this.

We get the same snippet when our logrotate takes action against the
cltdlog:

# cron
Sep 22 03:16:01 ctld run-parts(/etc/cron.daily)[29373]: starting logrota
te
Sep 22 03:16:55 ctld run-parts(/etc/cron.daily)[29428]: finished logrota
te

# ctldlog
[2016-09-22T03:16:01.217] Terminate signal (SIGINT or SIGTERM) received


HTH,
John DeSantis

On 09/27/2016 07:38 PM, Christopher Samuel wrote:
> 
> On 26/09/16 17:48, Philippe wrote:
> 
>> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM)
>> received
> 
> So that's some external process sending one of those two signals
> to slurmctld, it's not something it's choosing to do at all.  We've
> never seen this.
> 
> One other question - you've got the shutdown log from slurmctld and
> the start log of a slurmd - what happens when slurmctld starts up?
> 
> That might be your clue about why yours jobs are getting killed.
> 
-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQEcBAEBCAAGBQJX694GAAoJEEmckBqrs5nBIJ0H/i2QGONJQ4tqMP41LTmtWI7s
SXgdNtowJeQQZIW7aewMY8lbhss9zJmCMCKEYaJnjmh4lTfy1ekQHEXNM4C3I2M+
uDmnK51Lepp5ml5U808hXZXojSiIR1exryw8QkzdypbXf8wNm1fCTOuS5sbIQNXj
Tzu7rZEFxMYpsdKemuMKx3DN8Z6XqgNKQFsNjsA5sF4BqlzBwQOidT5bpe5Fk4iR
9Q8eEKtyaUvZ1G3fbXSx/BrFAJE1ut5Jqtwu/ufylknmir2o7KCYf3dzIArg2iwa
ajyHlbAg5q5TVPtJR3qri6cjgQm97Ekx/Fxarm67Mv7A+iL8ooB87IuXKaU1Kbo=
=O3c7
-END PGP SIGNATURE-

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel


On 26/09/16 17:48, Philippe wrote:

> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) received

So that's some external process sending one of those two signals to
slurmctld, it's not something it's choosing to do at all.  We've never
seen this.

One other question - you've got the shutdown log from slurmctld and the
start log of a slurmd - what happens when slurmctld starts up?

That might be your clue about why yours jobs are getting killed.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel

On 26/09/16 17:48, Philippe wrote:

> [2016-09-26T08:01:44.792] debug:  slurmdbd: Issue with call
> DBD_CLUSTER_CPUS(1407): 4294967295(This cluster hasn't been added to
> accounting yet)

Not related - but it looks like whilst it's been told to talk to
slurmdbd you haven't added the cluster to slurmdbd with "sacctmgr" yet
so I suspect all your accounting info is getting lost.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel


On 27/09/16 17:40, Philippe wrote:

>   /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null

I think you want to check whether that's really restarting it or just
doing an "scontrol reconfigure" which won't (shouldn't) restart it.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread John DeSantis


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Philippe,

> If I can't use logrotate, what must I use ? I just disabled it, I'm
> gonna see if the problem still persist.

You can use logrotate.  I'd suggest using a much larger "size" though.
 For example, we don't rotate the logs until at least 2GB.

> * My SlurmctldTimeout is set to 300, in the doc it seems like it's
> in seconds, so 300 seconds for a timeout is way enough I hope !

You're correct, it's in seconds; 5 minutes should be more than enough
for slurmctld to restart.

> * If there's typos, slurm doesn't warn for them at start or
> somewhere later in the logs ?

If there are typos and/or invalid syntax in slurm.conf, slurmctld will
not start - at least that's what I've seen.

> Thanks for reply, if you have any other idea, don't hesitate to
> share !

What I would do is perform a restart of slurm using the "postrotate"
command below, but remove the "--quiet" and ">/dev/null", and prefix
"time" to it, e.g.:

time /usr/sbin/invoke-rc.d slurm-llnl reconfig

This way you'll be able to verify how long a restart takes, and output
will be dumped to the screen.

HTH,
John DeSantis

On 09/27/2016 03:38 AM, Philippe wrote:
> Hi John, thanks for the reply :)
> 
> Yes I've got logrotate enabled for my slurm : 
> /var/log/slurm/slurmd.log /var/log/slurm/slurmctld.log 
> /var/log/slurm/slurmdbd.log { compress missingok nocopytruncate 
> nocreate nodelaycompress nomail notifempty noolddir rotate 12 
> sharedscripts size=5M postrotate /usr/sbin/invoke-rc.d --quiet
> slurm-llnl reconfig >/dev/null endscript }
> 
> If I can't use logrotate, what must I use ? I just disabled it, I'm
> gonna see if the problem still persist.
> 
> * My SlurmctldTimeout is set to 300, in the doc it seems like it's
> in seconds, so 300 seconds for a timeout is way enough I hope ! *
> If there's typos, slurm doesn't warn for them at start or somewhere
> later in the logs ? * No upgrade since the compilation.
> 
> Thanks for reply, if you have any other idea, don't hesitate to
> share !
> 
> Regards,
> 
> 
> On Mon, Sep 26, 2016 at 9:39 PM, John DeSantis 
> wrote:
> 
>> 
> Philippe,
> 
 But every 3 days, the slurmctld process restarts by itself,
 as you can see in slurmctld.log :
> 
> ... SNIP ...
> 
 No crontab set, anything, it just restarts itself. And the
 thing is, when I got jobs running for more than 3 days, they
 are killed by this restart (even if, normally, slurm is
 capable to resume jobs)
> 
> Do you have log rotation enabled that is stopping and restarting
> the ctl d?
> 
> As far as the lost jobs go, check your 'SlurmctldTimeout' and see
> if it's set too low.  We've never lost any jobs due to:
> 
> * ctld restarts * typos in slurm.conf (!!) * upgrades
> 
> I've been especially guilty of typos, and FWIW SLURM has been 
> extremely forgiving.
> 
> HTH, John DeSantis
> 
> 
> 
> On 09/26/2016 03:46 AM, Philippe wrote:
 Hello everybody, I'm trying to understand an issue with 2
 SLURM installations on Ubuntu 14.04 64b, with slurm 2.6.9
 compiled. Only 1 computer running Slurmctld/slurmd/slurmdbd.
 It works very well for any job that doesn't last more than 2
 days.
 
 But every 3 days, the slurmctld process restarts by itself,
 as you can see in slurmctld.log :
 
 [2016-09-26T08:01:42.682] debug2: Performing purge of old
 job records [2016-09-26T08:01:42.683] debug:  sched: Running
 job scheduler [2016-09-26T08:01:42.683] debug2: Performing
 full system state save [2016-09-26T08:01:44.003] debug:
 slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
 [2016-09-26T08:01:44.751] debug2: Sending cpu count of 22 for
 cluster [2016-09-26T08:01:44.792] debug:  slurmdbd: Issue
 with call DBD_CLUSTER_CPUS(1407): 4294967295(This cluster
 hasn't been added to accounting yet) 
 [2016-09-26T08:01:46.792] debug2: Testing job time limits
 and checkpoints [2016-09-26T08:01:47.005] debug3: Writing job
 id 840 to header record of job_state file
 [2016-09-26T08:02:03.003] debug: slurmdbd: DBD_RC is -1 from
 DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:02:16.582]
 Terminate signal (SIGINT or SIGTERM) received
 [2016-09-26T08:02:16.582] debug:  sched: slurmctld 
 terminating [2016-09-26T08:02:16.583] debug3:
 _slurmctld_rpc_mgr shutting down [2016-09-26T08:02:16.798]
 Saving all slurm state [2016-09-26T08:02:16.801] debug3:
 Writing job id 840 to header record of job_state file
 [2016-09-26T08:02:16.881] debug3: _slurmctld_background
 shutting down [2016-09-26T08:02:16.952] slurmdbd: saved 1846
 pending RPCs [2016-09-26T08:02:17.024] Unable to remove
 pidfile '/var/run/slurmctld.pid': Permission denied 
 [2016-09-26T08:02:16.582] killing old slurmctld[20343] 
 [2016-09-26T08:02:17.028] Job accounting information stored,
 but details not gathered [2016-09-26T08:02:17.028] slurmctld
 version 2.6.9 started on cluster graph

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Philippe

Hi John,
thanks for the reply :)

Yes I've got logrotate enabled for my slurm :
/var/log/slurm/slurmd.log /var/log/slurm/slurmctld.log
/var/log/slurm/slurmdbd.log {
  compress
  missingok
  nocopytruncate
  nocreate
  nodelaycompress
  nomail
  notifempty
  noolddir
  rotate 12
  sharedscripts
  size=5M
  postrotate
  /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null
  endscript
}

If I can't use logrotate, what must I use ?
I just disabled it, I'm gonna see if the problem still persist.

* My SlurmctldTimeout is set to 300, in the doc it seems like it's in
seconds, so 300 seconds for a timeout is way enough I hope !
* If there's typos, slurm doesn't warn for them at start or somewhere later
in the logs ?
* No upgrade since the compilation.

Thanks for reply, if you have any other idea, don't hesitate to share !

Regards,


On Mon, Sep 26, 2016 at 9:39 PM, John DeSantis  wrote:

>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Philippe,
>
> > But every 3 days, the slurmctld process restarts by itself, as you
> > can see in slurmctld.log :
>
> ... SNIP ...
>
> > No crontab set, anything, it just restarts itself. And the thing
> > is, when I got jobs running for more than 3 days, they are killed
> > by this restart (even if, normally, slurm is capable to resume
> > jobs)
>
> Do you have log rotation enabled that is stopping and restarting the ctl
> d?
>
> As far as the lost jobs go, check your 'SlurmctldTimeout' and see if
> it's set too low.  We've never lost any jobs due to:
>
> * ctld restarts
> * typos in slurm.conf (!!)
> * upgrades
>
> I've been especially guilty of typos, and FWIW SLURM has been
> extremely forgiving.
>
> HTH,
> John DeSantis
>
>
>
> On 09/26/2016 03:46 AM, Philippe wrote:
> > Hello everybody, I'm trying to understand an issue with 2 SLURM
> > installations on Ubuntu 14.04 64b, with slurm 2.6.9 compiled. Only
> > 1 computer running Slurmctld/slurmd/slurmdbd. It works very well
> > for any job that doesn't last more than 2 days.
> >
> > But every 3 days, the slurmctld process restarts by itself, as you
> > can see in slurmctld.log :
> >
> > [2016-09-26T08:01:42.682] debug2: Performing purge of old job
> > records [2016-09-26T08:01:42.683] debug:  sched: Running job
> > scheduler [2016-09-26T08:01:42.683] debug2: Performing full system
> > state save [2016-09-26T08:01:44.003] debug:  slurmdbd: DBD_RC is -1
> > from DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:01:44.751] debug2:
> > Sending cpu count of 22 for cluster [2016-09-26T08:01:44.792]
> > debug:  slurmdbd: Issue with call DBD_CLUSTER_CPUS(1407):
> > 4294967295(This cluster hasn't been added to accounting yet)
> > [2016-09-26T08:01:46.792] debug2: Testing job time limits and
> > checkpoints [2016-09-26T08:01:47.005] debug3: Writing job id 840 to
> > header record of job_state file [2016-09-26T08:02:03.003] debug:
> > slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
> > [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM)
> > received [2016-09-26T08:02:16.582] debug:  sched: slurmctld
> > terminating [2016-09-26T08:02:16.583] debug3: _slurmctld_rpc_mgr
> > shutting down [2016-09-26T08:02:16.798] Saving all slurm state
> > [2016-09-26T08:02:16.801] debug3: Writing job id 840 to header
> > record of job_state file [2016-09-26T08:02:16.881] debug3:
> > _slurmctld_background shutting down [2016-09-26T08:02:16.952]
> > slurmdbd: saved 1846 pending RPCs [2016-09-26T08:02:17.024] Unable
> > to remove pidfile '/var/run/slurmctld.pid': Permission denied
> > [2016-09-26T08:02:16.582] killing old slurmctld[20343]
> > [2016-09-26T08:02:17.028] Job accounting information stored, but
> > details not gathered [2016-09-26T08:02:17.028] slurmctld version
> > 2.6.9 started on cluster graph [2016-09-26T08:02:17.029] debug3:
> > Trying to load plugin /usr/local/lib/slurm/crypto_munge.so
> > [2016-09-26T08:02:17.052] Munge cryptographic signature plugin
> > loaded [2016-09-26T08:02:17.052] debug3: Success.
> > [2016-09-26T08:02:17.052] debug3: Trying to load plugin
> > /usr/local/lib/slurm/gres_gpu.so [2016-09-26T08:02:17.060] debug:
> > init: Gres GPU plugin loaded [2016-09-26T08:02:17.060] debug3:
> > Success.
> >
> > No crontab set, anything, it just restarts itself. And the thing
> > is, when I got jobs running for more than 3 days, they are killed
> > by this restart (even if, normally, slurm is capable to resume
> > jobs) :
> >
> > [2016-09-26T08:02:01.282] [830] profile signalling type Task
> > [2016-09-26T08:02:20.632] debug3: in the service_connection
> > [2016-09-26T08:02:20.634] debug2: got this type of message 1001
> > [2016-09-26T08:02:20.634] debug2: Processing RPC:
> > REQUEST_NODE_REGISTRATION_STATUS [2016-09-26T08:02:20.635] debug3:
> > CPUs=22 Boards=1 Sockets=22 Cores=1 Threads=1 Memory=145060
> > TmpDisk=78723 Uptime=4210729 [2016-09-26T08:02:20.636] debug4:
> > found jobid = 830, stepid = 4294967294 [2016-09-26T08:02:20.637]
> > [830] Called _msg_socket_accept [2016-09-26T08:02:20.637] [830

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-26 Thread John DeSantis


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Philippe,

> But every 3 days, the slurmctld process restarts by itself, as you
> can see in slurmctld.log :

... SNIP ...

> No crontab set, anything, it just restarts itself. And the thing
> is, when I got jobs running for more than 3 days, they are killed
> by this restart (even if, normally, slurm is capable to resume
> jobs)

Do you have log rotation enabled that is stopping and restarting the ctl
d?

As far as the lost jobs go, check your 'SlurmctldTimeout' and see if
it's set too low.  We've never lost any jobs due to:

* ctld restarts
* typos in slurm.conf (!!)
* upgrades

I've been especially guilty of typos, and FWIW SLURM has been
extremely forgiving.

HTH,
John DeSantis



On 09/26/2016 03:46 AM, Philippe wrote:
> Hello everybody, I'm trying to understand an issue with 2 SLURM
> installations on Ubuntu 14.04 64b, with slurm 2.6.9 compiled. Only
> 1 computer running Slurmctld/slurmd/slurmdbd. It works very well
> for any job that doesn't last more than 2 days.
> 
> But every 3 days, the slurmctld process restarts by itself, as you
> can see in slurmctld.log :
> 
> [2016-09-26T08:01:42.682] debug2: Performing purge of old job
> records [2016-09-26T08:01:42.683] debug:  sched: Running job
> scheduler [2016-09-26T08:01:42.683] debug2: Performing full system
> state save [2016-09-26T08:01:44.003] debug:  slurmdbd: DBD_RC is -1
> from DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:01:44.751] debug2:
> Sending cpu count of 22 for cluster [2016-09-26T08:01:44.792]
> debug:  slurmdbd: Issue with call DBD_CLUSTER_CPUS(1407):
> 4294967295(This cluster hasn't been added to accounting yet) 
> [2016-09-26T08:01:46.792] debug2: Testing job time limits and
> checkpoints [2016-09-26T08:01:47.005] debug3: Writing job id 840 to
> header record of job_state file [2016-09-26T08:02:03.003] debug:
> slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null) 
> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM)
> received [2016-09-26T08:02:16.582] debug:  sched: slurmctld
> terminating [2016-09-26T08:02:16.583] debug3: _slurmctld_rpc_mgr
> shutting down [2016-09-26T08:02:16.798] Saving all slurm state 
> [2016-09-26T08:02:16.801] debug3: Writing job id 840 to header
> record of job_state file [2016-09-26T08:02:16.881] debug3:
> _slurmctld_background shutting down [2016-09-26T08:02:16.952]
> slurmdbd: saved 1846 pending RPCs [2016-09-26T08:02:17.024] Unable
> to remove pidfile '/var/run/slurmctld.pid': Permission denied 
> [2016-09-26T08:02:16.582] killing old slurmctld[20343] 
> [2016-09-26T08:02:17.028] Job accounting information stored, but
> details not gathered [2016-09-26T08:02:17.028] slurmctld version
> 2.6.9 started on cluster graph [2016-09-26T08:02:17.029] debug3:
> Trying to load plugin /usr/local/lib/slurm/crypto_munge.so 
> [2016-09-26T08:02:17.052] Munge cryptographic signature plugin
> loaded [2016-09-26T08:02:17.052] debug3: Success. 
> [2016-09-26T08:02:17.052] debug3: Trying to load plugin 
> /usr/local/lib/slurm/gres_gpu.so [2016-09-26T08:02:17.060] debug:
> init: Gres GPU plugin loaded [2016-09-26T08:02:17.060] debug3:
> Success.
> 
> No crontab set, anything, it just restarts itself. And the thing
> is, when I got jobs running for more than 3 days, they are killed
> by this restart (even if, normally, slurm is capable to resume
> jobs) :
> 
> [2016-09-26T08:02:01.282] [830] profile signalling type Task 
> [2016-09-26T08:02:20.632] debug3: in the service_connection 
> [2016-09-26T08:02:20.634] debug2: got this type of message 1001 
> [2016-09-26T08:02:20.634] debug2: Processing RPC: 
> REQUEST_NODE_REGISTRATION_STATUS [2016-09-26T08:02:20.635] debug3:
> CPUs=22 Boards=1 Sockets=22 Cores=1 Threads=1 Memory=145060
> TmpDisk=78723 Uptime=4210729 [2016-09-26T08:02:20.636] debug4:
> found jobid = 830, stepid = 4294967294 [2016-09-26T08:02:20.637]
> [830] Called _msg_socket_accept [2016-09-26T08:02:20.637] [830]
> Leaving _msg_socket_accept [2016-09-26T08:02:20.637] [830] eio:
> handling events for 1 objects [2016-09-26T08:02:20.637] [830]
> Called _msg_socket_readable [2016-09-26T08:02:20.637] [830]
> Entering _handle_accept (new thread) [2016-09-26T08:02:20.638]
> [830]   Identity: uid=0, gid=0 [2016-09-26T08:02:20.638] [830]
> Entering _handle_request [2016-09-26T08:02:20.639] [830] Got
> request 5 [2016-09-26T08:02:20.639] [830] Handling REQUEST_STATE 
> [2016-09-26T08:02:20.639] [830] Leaving  _handle_request:
> SLURM_SUCCESS [2016-09-26T08:02:20.639] [830] Entering
> _handle_request [2016-09-26T08:02:20.639] debug:  found apparently
> running job 830 [2016-09-26T08:02:20.640] [830] Leaving
> _handle_accept [2016-09-26T08:02:20.692] debug3: in the
> service_connection [2016-09-26T08:02:20.693] debug2: got this type
> of message 6013 [2016-09-26T08:02:20.693] debug2: Processing RPC:
> REQUEST_ABORT_JOB [2016-09-26T08:02:20.694] debug:  _rpc_abort_job,
> uid = 64030 [2016-09-26T08:02:20.694] debug:
> task_slurmd_release_resources: 830 [2016-09-2

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

13 matches

Site Navigation

Mail list logo

Footer information