[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
Fyi, sending sighup to slurmctld is sufficient for rotating the slurmctld.log file. No need to actually restart it all the way. It is good to know the cause behind the deleted jobs. Doug On Oct 11, 2016 7:36 AM, "Ryan Novosielski" wrote: > > Thanks for clearing that up. I was pretty sure there was no problem at all > in using logrotate, and I know that restarting slurmctld does not > ordinarily lose jobs. > > -- > > || \\UTGERS, |---* > O*--- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus > || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark > `' > > > On Oct 11, 2016, at 06:19, Philippe wrote: > > > > Hello all, > > sorry for this long delay since my first post. > > Thanks for all the answers, it helped me to make some tests, and after > not so long, I realize I use a personnal script to launch the daemons, and > I was still using my "debug" start line, which contains the startclean > argument ... > > So it's all my fault, Slurm did his job to startclean when the logrotate > triggered it. > > > > Sorry for that ! > > > > On Thu, Sep 29, 2016 at 2:05 PM, Janne Blomqvist < > janne.blomqv...@aalto.fi> wrote: > > On 2016-09-27 10:39, Philippe wrote: > > > If I can't use logrotate, what must I use ? > > > > You can log via syslog, and let your syslog daemon handle the rotation > > (and rate limiting, disk full, logging to a central log host and all the > > other nice things that syslog can do for you). > > > > > > -- > > Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist > > Aalto University School of Science, PHYS & NBE > > +358503841576 || janne.blomqv...@aalto.fi > > > > >
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
Thanks for clearing that up. I was pretty sure there was no problem at all in using logrotate, and I know that restarting slurmctld does not ordinarily lose jobs. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Oct 11, 2016, at 06:19, Philippe wrote: > > Hello all, > sorry for this long delay since my first post. > Thanks for all the answers, it helped me to make some tests, and after not so > long, I realize I use a personnal script to launch the daemons, and I was > still using my "debug" start line, which contains the startclean argument ... > So it's all my fault, Slurm did his job to startclean when the logrotate > triggered it. > > Sorry for that ! > > On Thu, Sep 29, 2016 at 2:05 PM, Janne Blomqvist > wrote: > On 2016-09-27 10:39, Philippe wrote: > > If I can't use logrotate, what must I use ? > > You can log via syslog, and let your syslog daemon handle the rotation > (and rate limiting, disk full, logging to a central log host and all the > other nice things that syslog can do for you). > > > -- > Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist > Aalto University School of Science, PHYS & NBE > +358503841576 || janne.blomqv...@aalto.fi > >
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
Hello all, sorry for this long delay since my first post. Thanks for all the answers, it helped me to make some tests, and after not so long, I realize I use a personnal script to launch the daemons, and I was still using my "debug" start line, which contains the startclean argument ... So it's all my fault, Slurm did his job to startclean when the logrotate triggered it. Sorry for that ! On Thu, Sep 29, 2016 at 2:05 PM, Janne Blomqvist wrote: > On 2016-09-27 10:39, Philippe wrote: > > If I can't use logrotate, what must I use ? > > You can log via syslog, and let your syslog daemon handle the rotation > (and rate limiting, disk full, logging to a central log host and all the > other nice things that syslog can do for you). > > > -- > Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist > Aalto University School of Science, PHYS & NBE > +358503841576 || janne.blomqv...@aalto.fi > >
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
On 2016-09-27 10:39, Philippe wrote: > If I can't use logrotate, what must I use ? You can log via syslog, and let your syslog daemon handle the rotation (and rate limiting, disk full, logging to a central log host and all the other nice things that syslog can do for you). -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi signature.asc Description: OpenPGP digital signature
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Christopher, Yes, it does restart - but that's how we've configured logrotate. John DeSantis On 09/28/2016 07:55 PM, Christopher Samuel wrote: > > On 29/09/16 01:16, John DeSantis wrote: > >> We get the same snippet when our logrotate takes action against >> the cltdlog: > > Does your slurmctld restart then too? > -BEGIN PGP SIGNATURE- Version: GnuPG v2 iQEcBAEBCAAGBQJX7GTFAAoJEEmckBqrs5nBG3wH/AvkeQWdjjhDOBTOK1XQeQdD OP47RGAIY4UsvSW08dx1fjXYyGTT/ZA2jAPrqp7xav8AZb5fgHbunScbW8zuk2lK uvuxKm+z31dWJPX75p36Y97lsoTZRF7waKr8jTgqndaO0aL70oKM5+RQDEmUIUKb QWg7lLR3vpx9cyE9IIJdRSL4us9pGBDe2P8HPWKV9vkX/ZVTqclVPH5gPIChFrMB Nc//29tEpYffKC7u4ObtjZCDZC3uGyuBLSbPcCqUJjA3k4iRiKVGF2p3h3Fr4gjt rF/I7YWo9R48B7/HsQnOc2BClmqSCcVlxsGE20Cyz/SrfulNyqXjMeL0QFSuhZo= =x9gP -END PGP SIGNATURE-
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
On 29/09/16 01:16, John DeSantis wrote: > We get the same snippet when our logrotate takes action against the > cltdlog: Does your slurmctld restart then too? -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Christopher, >> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) >> received > > So that's some external process sending one of those two signals > to slurmctld, it's not something it's choosing to do at all. We've > never seen this. We get the same snippet when our logrotate takes action against the cltdlog: # cron Sep 22 03:16:01 ctld run-parts(/etc/cron.daily)[29373]: starting logrota te Sep 22 03:16:55 ctld run-parts(/etc/cron.daily)[29428]: finished logrota te # ctldlog [2016-09-22T03:16:01.217] Terminate signal (SIGINT or SIGTERM) received HTH, John DeSantis On 09/27/2016 07:38 PM, Christopher Samuel wrote: > > On 26/09/16 17:48, Philippe wrote: > >> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) >> received > > So that's some external process sending one of those two signals > to slurmctld, it's not something it's choosing to do at all. We've > never seen this. > > One other question - you've got the shutdown log from slurmctld and > the start log of a slurmd - what happens when slurmctld starts up? > > That might be your clue about why yours jobs are getting killed. > -BEGIN PGP SIGNATURE- Version: GnuPG v2 iQEcBAEBCAAGBQJX694GAAoJEEmckBqrs5nBIJ0H/i2QGONJQ4tqMP41LTmtWI7s SXgdNtowJeQQZIW7aewMY8lbhss9zJmCMCKEYaJnjmh4lTfy1ekQHEXNM4C3I2M+ uDmnK51Lepp5ml5U808hXZXojSiIR1exryw8QkzdypbXf8wNm1fCTOuS5sbIQNXj Tzu7rZEFxMYpsdKemuMKx3DN8Z6XqgNKQFsNjsA5sF4BqlzBwQOidT5bpe5Fk4iR 9Q8eEKtyaUvZ1G3fbXSx/BrFAJE1ut5Jqtwu/ufylknmir2o7KCYf3dzIArg2iwa ajyHlbAg5q5TVPtJR3qri6cjgQm97Ekx/Fxarm67Mv7A+iL8ooB87IuXKaU1Kbo= =O3c7 -END PGP SIGNATURE-
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
On 26/09/16 17:48, Philippe wrote: > [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) received So that's some external process sending one of those two signals to slurmctld, it's not something it's choosing to do at all. We've never seen this. One other question - you've got the shutdown log from slurmctld and the start log of a slurmd - what happens when slurmctld starts up? That might be your clue about why yours jobs are getting killed. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
On 26/09/16 17:48, Philippe wrote: > [2016-09-26T08:01:44.792] debug: slurmdbd: Issue with call > DBD_CLUSTER_CPUS(1407): 4294967295(This cluster hasn't been added to > accounting yet) Not related - but it looks like whilst it's been told to talk to slurmdbd you haven't added the cluster to slurmdbd with "sacctmgr" yet so I suspect all your accounting info is getting lost. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
On 27/09/16 17:40, Philippe wrote: > /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null I think you want to check whether that's really restarting it or just doing an "scontrol reconfigure" which won't (shouldn't) restart it. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Philippe, > If I can't use logrotate, what must I use ? I just disabled it, I'm > gonna see if the problem still persist. You can use logrotate. I'd suggest using a much larger "size" though. For example, we don't rotate the logs until at least 2GB. > * My SlurmctldTimeout is set to 300, in the doc it seems like it's > in seconds, so 300 seconds for a timeout is way enough I hope ! You're correct, it's in seconds; 5 minutes should be more than enough for slurmctld to restart. > * If there's typos, slurm doesn't warn for them at start or > somewhere later in the logs ? If there are typos and/or invalid syntax in slurm.conf, slurmctld will not start - at least that's what I've seen. > Thanks for reply, if you have any other idea, don't hesitate to > share ! What I would do is perform a restart of slurm using the "postrotate" command below, but remove the "--quiet" and ">/dev/null", and prefix "time" to it, e.g.: time /usr/sbin/invoke-rc.d slurm-llnl reconfig This way you'll be able to verify how long a restart takes, and output will be dumped to the screen. HTH, John DeSantis On 09/27/2016 03:38 AM, Philippe wrote: > Hi John, thanks for the reply :) > > Yes I've got logrotate enabled for my slurm : > /var/log/slurm/slurmd.log /var/log/slurm/slurmctld.log > /var/log/slurm/slurmdbd.log { compress missingok nocopytruncate > nocreate nodelaycompress nomail notifempty noolddir rotate 12 > sharedscripts size=5M postrotate /usr/sbin/invoke-rc.d --quiet > slurm-llnl reconfig >/dev/null endscript } > > If I can't use logrotate, what must I use ? I just disabled it, I'm > gonna see if the problem still persist. > > * My SlurmctldTimeout is set to 300, in the doc it seems like it's > in seconds, so 300 seconds for a timeout is way enough I hope ! * > If there's typos, slurm doesn't warn for them at start or somewhere > later in the logs ? * No upgrade since the compilation. > > Thanks for reply, if you have any other idea, don't hesitate to > share ! > > Regards, > > > On Mon, Sep 26, 2016 at 9:39 PM, John DeSantis > wrote: > >> > Philippe, > But every 3 days, the slurmctld process restarts by itself, as you can see in slurmctld.log : > > ... SNIP ... > No crontab set, anything, it just restarts itself. And the thing is, when I got jobs running for more than 3 days, they are killed by this restart (even if, normally, slurm is capable to resume jobs) > > Do you have log rotation enabled that is stopping and restarting > the ctl d? > > As far as the lost jobs go, check your 'SlurmctldTimeout' and see > if it's set too low. We've never lost any jobs due to: > > * ctld restarts * typos in slurm.conf (!!) * upgrades > > I've been especially guilty of typos, and FWIW SLURM has been > extremely forgiving. > > HTH, John DeSantis > > > > On 09/26/2016 03:46 AM, Philippe wrote: Hello everybody, I'm trying to understand an issue with 2 SLURM installations on Ubuntu 14.04 64b, with slurm 2.6.9 compiled. Only 1 computer running Slurmctld/slurmd/slurmdbd. It works very well for any job that doesn't last more than 2 days. But every 3 days, the slurmctld process restarts by itself, as you can see in slurmctld.log : [2016-09-26T08:01:42.682] debug2: Performing purge of old job records [2016-09-26T08:01:42.683] debug: sched: Running job scheduler [2016-09-26T08:01:42.683] debug2: Performing full system state save [2016-09-26T08:01:44.003] debug: slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:01:44.751] debug2: Sending cpu count of 22 for cluster [2016-09-26T08:01:44.792] debug: slurmdbd: Issue with call DBD_CLUSTER_CPUS(1407): 4294967295(This cluster hasn't been added to accounting yet) [2016-09-26T08:01:46.792] debug2: Testing job time limits and checkpoints [2016-09-26T08:01:47.005] debug3: Writing job id 840 to header record of job_state file [2016-09-26T08:02:03.003] debug: slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) received [2016-09-26T08:02:16.582] debug: sched: slurmctld terminating [2016-09-26T08:02:16.583] debug3: _slurmctld_rpc_mgr shutting down [2016-09-26T08:02:16.798] Saving all slurm state [2016-09-26T08:02:16.801] debug3: Writing job id 840 to header record of job_state file [2016-09-26T08:02:16.881] debug3: _slurmctld_background shutting down [2016-09-26T08:02:16.952] slurmdbd: saved 1846 pending RPCs [2016-09-26T08:02:17.024] Unable to remove pidfile '/var/run/slurmctld.pid': Permission denied [2016-09-26T08:02:16.582] killing old slurmctld[20343] [2016-09-26T08:02:17.028] Job accounting information stored, but details not gathered [2016-09-26T08:02:17.028] slurmctld version 2.6.9 started on cluster graph
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
Hi John, thanks for the reply :) Yes I've got logrotate enabled for my slurm : /var/log/slurm/slurmd.log /var/log/slurm/slurmctld.log /var/log/slurm/slurmdbd.log { compress missingok nocopytruncate nocreate nodelaycompress nomail notifempty noolddir rotate 12 sharedscripts size=5M postrotate /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null endscript } If I can't use logrotate, what must I use ? I just disabled it, I'm gonna see if the problem still persist. * My SlurmctldTimeout is set to 300, in the doc it seems like it's in seconds, so 300 seconds for a timeout is way enough I hope ! * If there's typos, slurm doesn't warn for them at start or somewhere later in the logs ? * No upgrade since the compilation. Thanks for reply, if you have any other idea, don't hesitate to share ! Regards, On Mon, Sep 26, 2016 at 9:39 PM, John DeSantis wrote: > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Philippe, > > > But every 3 days, the slurmctld process restarts by itself, as you > > can see in slurmctld.log : > > ... SNIP ... > > > No crontab set, anything, it just restarts itself. And the thing > > is, when I got jobs running for more than 3 days, they are killed > > by this restart (even if, normally, slurm is capable to resume > > jobs) > > Do you have log rotation enabled that is stopping and restarting the ctl > d? > > As far as the lost jobs go, check your 'SlurmctldTimeout' and see if > it's set too low. We've never lost any jobs due to: > > * ctld restarts > * typos in slurm.conf (!!) > * upgrades > > I've been especially guilty of typos, and FWIW SLURM has been > extremely forgiving. > > HTH, > John DeSantis > > > > On 09/26/2016 03:46 AM, Philippe wrote: > > Hello everybody, I'm trying to understand an issue with 2 SLURM > > installations on Ubuntu 14.04 64b, with slurm 2.6.9 compiled. Only > > 1 computer running Slurmctld/slurmd/slurmdbd. It works very well > > for any job that doesn't last more than 2 days. > > > > But every 3 days, the slurmctld process restarts by itself, as you > > can see in slurmctld.log : > > > > [2016-09-26T08:01:42.682] debug2: Performing purge of old job > > records [2016-09-26T08:01:42.683] debug: sched: Running job > > scheduler [2016-09-26T08:01:42.683] debug2: Performing full system > > state save [2016-09-26T08:01:44.003] debug: slurmdbd: DBD_RC is -1 > > from DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:01:44.751] debug2: > > Sending cpu count of 22 for cluster [2016-09-26T08:01:44.792] > > debug: slurmdbd: Issue with call DBD_CLUSTER_CPUS(1407): > > 4294967295(This cluster hasn't been added to accounting yet) > > [2016-09-26T08:01:46.792] debug2: Testing job time limits and > > checkpoints [2016-09-26T08:01:47.005] debug3: Writing job id 840 to > > header record of job_state file [2016-09-26T08:02:03.003] debug: > > slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null) > > [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) > > received [2016-09-26T08:02:16.582] debug: sched: slurmctld > > terminating [2016-09-26T08:02:16.583] debug3: _slurmctld_rpc_mgr > > shutting down [2016-09-26T08:02:16.798] Saving all slurm state > > [2016-09-26T08:02:16.801] debug3: Writing job id 840 to header > > record of job_state file [2016-09-26T08:02:16.881] debug3: > > _slurmctld_background shutting down [2016-09-26T08:02:16.952] > > slurmdbd: saved 1846 pending RPCs [2016-09-26T08:02:17.024] Unable > > to remove pidfile '/var/run/slurmctld.pid': Permission denied > > [2016-09-26T08:02:16.582] killing old slurmctld[20343] > > [2016-09-26T08:02:17.028] Job accounting information stored, but > > details not gathered [2016-09-26T08:02:17.028] slurmctld version > > 2.6.9 started on cluster graph [2016-09-26T08:02:17.029] debug3: > > Trying to load plugin /usr/local/lib/slurm/crypto_munge.so > > [2016-09-26T08:02:17.052] Munge cryptographic signature plugin > > loaded [2016-09-26T08:02:17.052] debug3: Success. > > [2016-09-26T08:02:17.052] debug3: Trying to load plugin > > /usr/local/lib/slurm/gres_gpu.so [2016-09-26T08:02:17.060] debug: > > init: Gres GPU plugin loaded [2016-09-26T08:02:17.060] debug3: > > Success. > > > > No crontab set, anything, it just restarts itself. And the thing > > is, when I got jobs running for more than 3 days, they are killed > > by this restart (even if, normally, slurm is capable to resume > > jobs) : > > > > [2016-09-26T08:02:01.282] [830] profile signalling type Task > > [2016-09-26T08:02:20.632] debug3: in the service_connection > > [2016-09-26T08:02:20.634] debug2: got this type of message 1001 > > [2016-09-26T08:02:20.634] debug2: Processing RPC: > > REQUEST_NODE_REGISTRATION_STATUS [2016-09-26T08:02:20.635] debug3: > > CPUs=22 Boards=1 Sockets=22 Cores=1 Threads=1 Memory=145060 > > TmpDisk=78723 Uptime=4210729 [2016-09-26T08:02:20.636] debug4: > > found jobid = 830, stepid = 4294967294 [2016-09-26T08:02:20.637] > > [830] Called _msg_socket_accept [2016-09-26T08:02:20.637] [830
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Philippe, > But every 3 days, the slurmctld process restarts by itself, as you > can see in slurmctld.log : ... SNIP ... > No crontab set, anything, it just restarts itself. And the thing > is, when I got jobs running for more than 3 days, they are killed > by this restart (even if, normally, slurm is capable to resume > jobs) Do you have log rotation enabled that is stopping and restarting the ctl d? As far as the lost jobs go, check your 'SlurmctldTimeout' and see if it's set too low. We've never lost any jobs due to: * ctld restarts * typos in slurm.conf (!!) * upgrades I've been especially guilty of typos, and FWIW SLURM has been extremely forgiving. HTH, John DeSantis On 09/26/2016 03:46 AM, Philippe wrote: > Hello everybody, I'm trying to understand an issue with 2 SLURM > installations on Ubuntu 14.04 64b, with slurm 2.6.9 compiled. Only > 1 computer running Slurmctld/slurmd/slurmdbd. It works very well > for any job that doesn't last more than 2 days. > > But every 3 days, the slurmctld process restarts by itself, as you > can see in slurmctld.log : > > [2016-09-26T08:01:42.682] debug2: Performing purge of old job > records [2016-09-26T08:01:42.683] debug: sched: Running job > scheduler [2016-09-26T08:01:42.683] debug2: Performing full system > state save [2016-09-26T08:01:44.003] debug: slurmdbd: DBD_RC is -1 > from DBD_FLUSH_JOBS(1408): (null) [2016-09-26T08:01:44.751] debug2: > Sending cpu count of 22 for cluster [2016-09-26T08:01:44.792] > debug: slurmdbd: Issue with call DBD_CLUSTER_CPUS(1407): > 4294967295(This cluster hasn't been added to accounting yet) > [2016-09-26T08:01:46.792] debug2: Testing job time limits and > checkpoints [2016-09-26T08:01:47.005] debug3: Writing job id 840 to > header record of job_state file [2016-09-26T08:02:03.003] debug: > slurmdbd: DBD_RC is -1 from DBD_FLUSH_JOBS(1408): (null) > [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) > received [2016-09-26T08:02:16.582] debug: sched: slurmctld > terminating [2016-09-26T08:02:16.583] debug3: _slurmctld_rpc_mgr > shutting down [2016-09-26T08:02:16.798] Saving all slurm state > [2016-09-26T08:02:16.801] debug3: Writing job id 840 to header > record of job_state file [2016-09-26T08:02:16.881] debug3: > _slurmctld_background shutting down [2016-09-26T08:02:16.952] > slurmdbd: saved 1846 pending RPCs [2016-09-26T08:02:17.024] Unable > to remove pidfile '/var/run/slurmctld.pid': Permission denied > [2016-09-26T08:02:16.582] killing old slurmctld[20343] > [2016-09-26T08:02:17.028] Job accounting information stored, but > details not gathered [2016-09-26T08:02:17.028] slurmctld version > 2.6.9 started on cluster graph [2016-09-26T08:02:17.029] debug3: > Trying to load plugin /usr/local/lib/slurm/crypto_munge.so > [2016-09-26T08:02:17.052] Munge cryptographic signature plugin > loaded [2016-09-26T08:02:17.052] debug3: Success. > [2016-09-26T08:02:17.052] debug3: Trying to load plugin > /usr/local/lib/slurm/gres_gpu.so [2016-09-26T08:02:17.060] debug: > init: Gres GPU plugin loaded [2016-09-26T08:02:17.060] debug3: > Success. > > No crontab set, anything, it just restarts itself. And the thing > is, when I got jobs running for more than 3 days, they are killed > by this restart (even if, normally, slurm is capable to resume > jobs) : > > [2016-09-26T08:02:01.282] [830] profile signalling type Task > [2016-09-26T08:02:20.632] debug3: in the service_connection > [2016-09-26T08:02:20.634] debug2: got this type of message 1001 > [2016-09-26T08:02:20.634] debug2: Processing RPC: > REQUEST_NODE_REGISTRATION_STATUS [2016-09-26T08:02:20.635] debug3: > CPUs=22 Boards=1 Sockets=22 Cores=1 Threads=1 Memory=145060 > TmpDisk=78723 Uptime=4210729 [2016-09-26T08:02:20.636] debug4: > found jobid = 830, stepid = 4294967294 [2016-09-26T08:02:20.637] > [830] Called _msg_socket_accept [2016-09-26T08:02:20.637] [830] > Leaving _msg_socket_accept [2016-09-26T08:02:20.637] [830] eio: > handling events for 1 objects [2016-09-26T08:02:20.637] [830] > Called _msg_socket_readable [2016-09-26T08:02:20.637] [830] > Entering _handle_accept (new thread) [2016-09-26T08:02:20.638] > [830] Identity: uid=0, gid=0 [2016-09-26T08:02:20.638] [830] > Entering _handle_request [2016-09-26T08:02:20.639] [830] Got > request 5 [2016-09-26T08:02:20.639] [830] Handling REQUEST_STATE > [2016-09-26T08:02:20.639] [830] Leaving _handle_request: > SLURM_SUCCESS [2016-09-26T08:02:20.639] [830] Entering > _handle_request [2016-09-26T08:02:20.639] debug: found apparently > running job 830 [2016-09-26T08:02:20.640] [830] Leaving > _handle_accept [2016-09-26T08:02:20.692] debug3: in the > service_connection [2016-09-26T08:02:20.693] debug2: got this type > of message 6013 [2016-09-26T08:02:20.693] debug2: Processing RPC: > REQUEST_ABORT_JOB [2016-09-26T08:02:20.694] debug: _rpc_abort_job, > uid = 64030 [2016-09-26T08:02:20.694] debug: > task_slurmd_release_resources: 830 [2016-09-2