Re: [slurm-users] [External] Hibernating a whole cluster
Howdy, On Tue, 7 Feb 2023 at 20:18, Sean Mc Grath wrote: > Hi Analabha, > > Yes, unfortunately for your needs, I expect a time limited reservation > along my suggestion would not accept jobs that would be scheduled to end > outside of the reservations availability times. I'd suggest looking at > check-pointing in this case, e.g. with DMTCP: Distributed MultiThreaded > Checkpointing, http://dmtcp.sourceforge.net/. That could allow jobs to > have their state saved and then re-loaded when they are started again. > > Checkpointing sounds intriguing. Many thanks for the suggestion. A bit of googling turned up this cluster page <https://docs.nersc.gov/development/checkpoint-restart/dmtcp/>where they've set it up to work with slurm. However, I also noticed this presentation <https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf>hosted on the slurm website that indicates that DMTCP doesn't work with containers, and the other checkpointing tools that do support containers don't support MPI. I also took a gander at CRIU <https://criu.org/Main_Page>, but this paper <https://www.ijecs.in/index.php/ijecs/article/download/4122/3855/8058> indicates that it too, has similar limitations, and BLCR seems to have died <https://hpc.rz.rptu.de/documentation/checkpoint_blcr.html>. Unless some or all of this information is dated or obsolete, these drawbacks would be deal-breakers, since most of us have been spoiled by containerization, and MPI is, of course, bread and butter for all. I'd be mighty grateful for any other insights regarding my predicament. In the meantime, I'm going to give the ugly hack of launching scontrol suspend-resume scripts a whirl. AR > Best > > Sean > > --- > Sean McGrath > Senior Systems Administrator, IT Services > > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Tuesday 7 February 2023 12:14 > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > > Hi Sean, > > Thanks for your awesome suggestion! I'm going through the reservation docs > now. At first glance, it seems like a daily reservation would turn down > jobs that are too big for the reservation. It'd be nice if > slurm could suspend (in the manner of 'scontrol suspend') jobs during > reserved downtime and resume them after. That way, folks can submit large > jobs without having to worry about the downtimes. Perhaps the FLEX option > in reservations can accomplish this somehow? > > > I suppose that I can do it using a shell script iterator and a cron job, > but that seems like an ugly hack. I was hoping if there is a way to config > this in slurm itself? > > AR > > On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath wrote: > > Hi Analabha, > > Could you do something like create a daily reservation for 8 hours that > starts at 9am, or whatever times work for you like the following untested > command: > > scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1 > flags=daily ReservationName=daily > > Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY > > Some more possible helpful documentation at > https://slurm.schedmd.com/reservations.html, search for "daily". > > My idea being that jobs can only run in that reservation, (that would have > to be configured separately, not sure how from the top of my head), which > is only active during the times you want the node to be working. So the > cronjob that hibernates/shuts it down will do so when there are no jobs > running. At least in theory. > > Hope that helps. > > Sean > > --- > Sean McGrath > Senior Systems Administrator, IT Services > > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Tuesday 7 February 2023 10:05 > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > > Hi, > > Thanks. I had read the Slurm Power Saving Guide before. I believe the > configs enable slurmctld to check other nodes for idleness and > suspend/resume them. Slurmctld must run on a separate, always-on server for > this to work, right? > > My issue might be a little different. I literally have only one node that > runs everything: slurmctld, slurmd, slurmdbd, everything. > > This node must be set to "sudo systemctl hibernate"after business hours, > regardless of whether jobs are queued or running. The next business day, it > can be switched on manually. > > systemctl hibernate is supposed to save the entire run state of the sole > node to swap and poweroff. When powered on again, it should restore > everything to its previous running state. > > When the job queue
Re: [slurm-users] [External] Hibernating a whole cluster
Hi Analabha, Yes, unfortunately for your needs, I expect a time limited reservation along my suggestion would not accept jobs that would be scheduled to end outside of the reservations availability times. I'd suggest looking at check-pointing in this case, e.g. with DMTCP: Distributed MultiThreaded Checkpointing, http://dmtcp.sourceforge.net/. That could allow jobs to have their state saved and then re-loaded when they are started again. Best Sean --- Sean McGrath Senior Systems Administrator, IT Services From: slurm-users on behalf of Analabha Roy Sent: Tuesday 7 February 2023 12:14 To: Slurm User Community List Subject: Re: [slurm-users] [External] Hibernating a whole cluster Hi Sean, Thanks for your awesome suggestion! I'm going through the reservation docs now. At first glance, it seems like a daily reservation would turn down jobs that are too big for the reservation. It'd be nice if slurm could suspend (in the manner of 'scontrol suspend') jobs during reserved downtime and resume them after. That way, folks can submit large jobs without having to worry about the downtimes. Perhaps the FLEX option in reservations can accomplish this somehow? I suppose that I can do it using a shell script iterator and a cron job, but that seems like an ugly hack. I was hoping if there is a way to config this in slurm itself? AR On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath mailto:smcg...@tcd.ie>> wrote: Hi Analabha, Could you do something like create a daily reservation for 8 hours that starts at 9am, or whatever times work for you like the following untested command: scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1 flags=daily ReservationName=daily Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY Some more possible helpful documentation at https://slurm.schedmd.com/reservations.html, search for "daily". My idea being that jobs can only run in that reservation, (that would have to be configured separately, not sure how from the top of my head), which is only active during the times you want the node to be working. So the cronjob that hibernates/shuts it down will do so when there are no jobs running. At least in theory. Hope that helps. Sean --- Sean McGrath Senior Systems Administrator, IT Services From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Analabha Roy mailto:hariseldo...@gmail.com>> Sent: Tuesday 7 February 2023 10:05 To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] [External] Hibernating a whole cluster Hi, Thanks. I had read the Slurm Power Saving Guide before. I believe the configs enable slurmctld to check other nodes for idleness and suspend/resume them. Slurmctld must run on a separate, always-on server for this to work, right? My issue might be a little different. I literally have only one node that runs everything: slurmctld, slurmd, slurmdbd, everything. This node must be set to "sudo systemctl hibernate"after business hours, regardless of whether jobs are queued or running. The next business day, it can be switched on manually. systemctl hibernate is supposed to save the entire run state of the sole node to swap and poweroff. When powered on again, it should restore everything to its previous running state. When the job queue is empty, this works well. I'm not sure how well this hibernate/resume will work with running jobs and would appreciate any suggestions or insights. AR On Tue, 7 Feb 2023 at 01:39, Florian Zillner mailto:fzill...@lenovo.com>> wrote: Hi, follow this guide: https://slurm.schedmd.com/power_save.html Create poweroff / poweron scripts and configure slurm to do the poweroff after X minutes. Works well for us. Make sure to set an appropriate time (ResumeTimeout) to allow the node to come back to service. Note that we did not achieve good power saving with suspending the nodes, powering them off and on saves way more power. The downside is it takes ~ 5 mins to resume (= power on) the nodes when needed. Cheers, Florian From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Analabha Roy mailto:hariseldo...@gmail.com>> Sent: Monday, 6 February 2023 18:21 To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> mailto:slurm-users@lists.schedmd.com>> Subject: [External] [slurm-users] Hibernating a whole cluster Hi, I've just finished setup of a single node "cluster" with slurm on ubuntu 20.04. Infrastructural limitations prevent me from running it 24/7, and it's only powered on during business hours. Currently, I have a cron job running that hibernates that sole node before closing time. The hibernation is done with standard systemd, and hibernates to the swap parti
Re: [slurm-users] [External] Hibernating a whole cluster
That's probably not optimal, but could work. I'd go with brutal preemption: swapping 90+G can be quite time-consuming. Diego Il 07/02/2023 14:18, Analabha Roy ha scritto: On Tue, 7 Feb 2023, 18:12 Diego Zuccato, <mailto:diego.zucc...@unibo.it>> wrote: RAM used by a suspended job is not released. At most it can be swapped out (if enough swap is available). There should be enough swap available. I have 93 gigs of Ram and as big a swap partition. I can top it off with swap files if needed. Il 07/02/2023 13:14, Analabha Roy ha scritto: > Hi Sean, > > Thanks for your awesome suggestion! I'm going through the reservation > docs now. At first glance, it seems like a daily reservation would turn > down jobs that are too big for the reservation. It'd be nice if > slurm could suspend (in the manner of 'scontrol suspend') jobs during > reserved downtime and resume them after. That way, folks can submit > large jobs without having to worry about the downtimes. Perhaps the FLEX > option in reservations can accomplish this somehow? > > > I suppose that I can do it using a shell script iterator and a cron job, > but that seems like an ugly hack. I was hoping if there is a way to > config this in slurm itself? > > AR > > On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath mailto:smcg...@tcd.ie> > <mailto:smcg...@tcd.ie <mailto:smcg...@tcd.ie>>> wrote: > > Hi Analabha, > > Could you do something like create a daily reservation for 8 hours > that starts at 9am, or whatever times work for you like the > following untested command: > > scontrol create reservation starttime=09:00:00 duration=8:00:00 > nodecnt=1 flags=daily ReservationName=daily > > Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY <https://slurm.schedmd.com/scontrol.html#OPT_DAILY> > <https://slurm.schedmd.com/scontrol.html#OPT_DAILY <https://slurm.schedmd.com/scontrol.html#OPT_DAILY>> > > Some more possible helpful documentation at > https://slurm.schedmd.com/reservations.html <https://slurm.schedmd.com/reservations.html> > <https://slurm.schedmd.com/reservations.html <https://slurm.schedmd.com/reservations.html>>, search for "daily". > > My idea being that jobs can only run in that reservation, (that > would have to be configured separately, not sure how from the top of > my head), which is only active during the times you want the node to > be working. So the cronjob that hibernates/shuts it down will do so > when there are no jobs running. At least in theory. > > Hope that helps. > > Sean > > --- > Sean McGrath > Senior Systems Administrator, IT Services > > > *From:* slurm-users mailto:slurm-users-boun...@lists.schedmd.com> > <mailto:slurm-users-boun...@lists.schedmd.com <mailto:slurm-users-boun...@lists.schedmd.com>>> on behalf of > Analabha Roy mailto:hariseldo...@gmail.com> <mailto:hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>>> > *Sent:* Tuesday 7 February 2023 10:05 > *To:* Slurm User Community List mailto:slurm-users@lists.schedmd.com> > <mailto:slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>>> > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > Hi, > > Thanks. I had read the Slurm Power Saving Guide before. I believe > the configs enable slurmctld to check other nodes for idleness and > suspend/resume them. Slurmctld must run on a separate, always-on > server for this to work, right? > > My issue might be a little different. I literally have only one node > that runs everything: slurmctld, slurmd, slurmdbd, everything. > > This node must be set to "sudo systemctl hibernate"after business > hours, regardless of whether jobs are queued or running. The next > business day, it can be switched on manually. > > systemctl hibernate is supposed to save the entire run state of the > sole node to swap and poweroff. When powered on again, it should > restore everything to its previous running state. > > W
Re: [slurm-users] [External] Hibernating a whole cluster
On Tue, 7 Feb 2023, 18:12 Diego Zuccato, wrote: > RAM used by a suspended job is not released. At most it can be swapped > out (if enough swap is available). > There should be enough swap available. I have 93 gigs of Ram and as big a swap partition. I can top it off with swap files if needed. > > Il 07/02/2023 13:14, Analabha Roy ha scritto: > > Hi Sean, > > > > Thanks for your awesome suggestion! I'm going through the reservation > > docs now. At first glance, it seems like a daily reservation would turn > > down jobs that are too big for the reservation. It'd be nice if > > slurm could suspend (in the manner of 'scontrol suspend') jobs during > > reserved downtime and resume them after. That way, folks can submit > > large jobs without having to worry about the downtimes. Perhaps the FLEX > > option in reservations can accomplish this somehow? > > > > > > I suppose that I can do it using a shell script iterator and a cron job, > > but that seems like an ugly hack. I was hoping if there is a way to > > config this in slurm itself? > > > > AR > > > > On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath > <mailto:smcg...@tcd.ie>> wrote: > > > > Hi Analabha, > > > > Could you do something like create a daily reservation for 8 hours > > that starts at 9am, or whatever times work for you like the > > following untested command: > > > > scontrol create reservation starttime=09:00:00 duration=8:00:00 > > nodecnt=1 flags=daily ReservationName=daily > > > > Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY > > <https://slurm.schedmd.com/scontrol.html#OPT_DAILY> > > > > Some more possible helpful documentation at > > https://slurm.schedmd.com/reservations.html > > <https://slurm.schedmd.com/reservations.html>, search for "daily". > > > > My idea being that jobs can only run in that reservation, (that > > would have to be configured separately, not sure how from the top of > > my head), which is only active during the times you want the node to > > be working. So the cronjob that hibernates/shuts it down will do so > > when there are no jobs running. At least in theory. > > > > Hope that helps. > > > > Sean > > > > --- > > Sean McGrath > > Senior Systems Administrator, IT Services > > > > > ---------------- > > *From:* slurm-users > <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of > > Analabha Roy mailto:hariseldo...@gmail.com > >> > > *Sent:* Tuesday 7 February 2023 10:05 > > *To:* Slurm User Community List > <mailto:slurm-users@lists.schedmd.com>> > > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > > Hi, > > > > Thanks. I had read the Slurm Power Saving Guide before. I believe > > the configs enable slurmctld to check other nodes for idleness and > > suspend/resume them. Slurmctld must run on a separate, always-on > > server for this to work, right? > > > > My issue might be a little different. I literally have only one node > > that runs everything: slurmctld, slurmd, slurmdbd, everything. > > > > This node must be set to "sudo systemctl hibernate"after business > > hours, regardless of whether jobs are queued or running. The next > > business day, it can be switched on manually. > > > > systemctl hibernate is supposed to save the entire run state of the > > sole node to swap and poweroff. When powered on again, it should > > restore everything to its previous running state. > > > > When the job queue is empty, this works well. I'm not sure how well > > this hibernate/resume will work with running jobs and would > > appreciate any suggestions or insights. > > > > AR > > > > > > On Tue, 7 Feb 2023 at 01:39, Florian Zillner > <mailto:fzill...@lenovo.com>> wrote: > > > > Hi, > > > > follow this guide: https://slurm.schedmd.com/power_save.html > > <https://slurm.schedmd.com/power_save.html> > > > > Create poweroff / poweron scripts and configure slurm to do the > > poweroff after X minutes. Works well for us. Make sure to set an > > appropriate time (ResumeTimeout) to allow the node to come back > > to service. > >
Re: [slurm-users] [External] Hibernating a whole cluster
RAM used by a suspended job is not released. At most it can be swapped out (if enough swap is available). Il 07/02/2023 13:14, Analabha Roy ha scritto: Hi Sean, Thanks for your awesome suggestion! I'm going through the reservation docs now. At first glance, it seems like a daily reservation would turn down jobs that are too big for the reservation. It'd be nice if slurm could suspend (in the manner of 'scontrol suspend') jobs during reserved downtime and resume them after. That way, folks can submit large jobs without having to worry about the downtimes. Perhaps the FLEX option in reservations can accomplish this somehow? I suppose that I can do it using a shell script iterator and a cron job, but that seems like an ugly hack. I was hoping if there is a way to config this in slurm itself? AR On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath <mailto:smcg...@tcd.ie>> wrote: Hi Analabha, Could you do something like create a daily reservation for 8 hours that starts at 9am, or whatever times work for you like the following untested command: scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1 flags=daily ReservationName=daily Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY <https://slurm.schedmd.com/scontrol.html#OPT_DAILY> Some more possible helpful documentation at https://slurm.schedmd.com/reservations.html <https://slurm.schedmd.com/reservations.html>, search for "daily". My idea being that jobs can only run in that reservation, (that would have to be configured separately, not sure how from the top of my head), which is only active during the times you want the node to be working. So the cronjob that hibernates/shuts it down will do so when there are no jobs running. At least in theory. Hope that helps. Sean --- Sean McGrath Senior Systems Administrator, IT Services *From:* slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Analabha Roy mailto:hariseldo...@gmail.com>> *Sent:* Tuesday 7 February 2023 10:05 *To:* Slurm User Community List mailto:slurm-users@lists.schedmd.com>> *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster Hi, Thanks. I had read the Slurm Power Saving Guide before. I believe the configs enable slurmctld to check other nodes for idleness and suspend/resume them. Slurmctld must run on a separate, always-on server for this to work, right? My issue might be a little different. I literally have only one node that runs everything: slurmctld, slurmd, slurmdbd, everything. This node must be set to "sudo systemctl hibernate"after business hours, regardless of whether jobs are queued or running. The next business day, it can be switched on manually. systemctl hibernate is supposed to save the entire run state of the sole node to swap and poweroff. When powered on again, it should restore everything to its previous running state. When the job queue is empty, this works well. I'm not sure how well this hibernate/resume will work with running jobs and would appreciate any suggestions or insights. AR On Tue, 7 Feb 2023 at 01:39, Florian Zillner mailto:fzill...@lenovo.com>> wrote: Hi, follow this guide: https://slurm.schedmd.com/power_save.html <https://slurm.schedmd.com/power_save.html> Create poweroff / poweron scripts and configure slurm to do the poweroff after X minutes. Works well for us. Make sure to set an appropriate time (ResumeTimeout) to allow the node to come back to service. Note that we did not achieve good power saving with suspending the nodes, powering them off and on saves way more power. The downside is it takes ~ 5 mins to resume (= power on) the nodes when needed. Cheers, Florian *From:* slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Analabha Roy mailto:hariseldo...@gmail.com>> *Sent:* Monday, 6 February 2023 18:21 *To:* slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> mailto:slurm-users@lists.schedmd.com>> *Subject:* [External] [slurm-users] Hibernating a whole cluster Hi, I've just finished setup of a single node "cluster" with slurm on ubuntu 20.04. Infrastructural limitations prevent me from running it 24/7, and it's only powered on during business hours. Currently, I have a cron job running that hibernates that sole node before closing time. The hibernation is done w
Re: [slurm-users] [External] Hibernating a whole cluster
Hi Sean, Thanks for your awesome suggestion! I'm going through the reservation docs now. At first glance, it seems like a daily reservation would turn down jobs that are too big for the reservation. It'd be nice if slurm could suspend (in the manner of 'scontrol suspend') jobs during reserved downtime and resume them after. That way, folks can submit large jobs without having to worry about the downtimes. Perhaps the FLEX option in reservations can accomplish this somehow? I suppose that I can do it using a shell script iterator and a cron job, but that seems like an ugly hack. I was hoping if there is a way to config this in slurm itself? AR On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath wrote: > Hi Analabha, > > Could you do something like create a daily reservation for 8 hours that > starts at 9am, or whatever times work for you like the following untested > command: > > scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1 > flags=daily ReservationName=daily > > Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY > > Some more possible helpful documentation at > https://slurm.schedmd.com/reservations.html, search for "daily". > > My idea being that jobs can only run in that reservation, (that would have > to be configured separately, not sure how from the top of my head), which > is only active during the times you want the node to be working. So the > cronjob that hibernates/shuts it down will do so when there are no jobs > running. At least in theory. > > Hope that helps. > > Sean > > --- > Sean McGrath > Senior Systems Administrator, IT Services > > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Tuesday 7 February 2023 10:05 > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > > Hi, > > Thanks. I had read the Slurm Power Saving Guide before. I believe the > configs enable slurmctld to check other nodes for idleness and > suspend/resume them. Slurmctld must run on a separate, always-on server for > this to work, right? > > My issue might be a little different. I literally have only one node that > runs everything: slurmctld, slurmd, slurmdbd, everything. > > This node must be set to "sudo systemctl hibernate"after business hours, > regardless of whether jobs are queued or running. The next business day, it > can be switched on manually. > > systemctl hibernate is supposed to save the entire run state of the sole > node to swap and poweroff. When powered on again, it should restore > everything to its previous running state. > > When the job queue is empty, this works well. I'm not sure how well this > hibernate/resume will work with running jobs and would appreciate any > suggestions or insights. > > AR > > > On Tue, 7 Feb 2023 at 01:39, Florian Zillner wrote: > > Hi, > > follow this guide: https://slurm.schedmd.com/power_save.html > > Create poweroff / poweron scripts and configure slurm to do the poweroff > after X minutes. Works well for us. Make sure to set an appropriate time > (ResumeTimeout) to allow the node to come back to service. > Note that we did not achieve good power saving with suspending the nodes, > powering them off and on saves way more power. The downside is it takes ~ 5 > mins to resume (= power on) the nodes when needed. > > Cheers, > Florian > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Monday, 6 February 2023 18:21 > *To:* slurm-users@lists.schedmd.com > *Subject:* [External] [slurm-users] Hibernating a whole cluster > > Hi, > > I've just finished setup of a single node "cluster" with slurm on ubuntu > 20.04. Infrastructural limitations prevent me from running it 24/7, and > it's only powered on during business hours. > > > Currently, I have a cron job running that hibernates that sole node before > closing time. > > The hibernation is done with standard systemd, and hibernates to the swap > partition. > > I have not run any lengthy slurm jobs on it yet. Before I do, can I get > some thoughts on a couple of things? > > If it hibernated when slurm still had jobs running/queued, would they > resume properly when the machine powers back on? > > Note that my swap space is bigger than my RAM. > > Is it necessary to perhaps setup a pre-hibernate script for systemd to > iterate scontrol to suspend all the jobs before hibernating and resume them > post-resume? > > What about the wall times? I'm uessing that slurm will count the downtime > as elapsed for each job. Is there a way to config this, or is the only > alternative a post-hibernate
Re: [slurm-users] [External] Hibernating a whole cluster
Hi Analabha, Could you do something like create a daily reservation for 8 hours that starts at 9am, or whatever times work for you like the following untested command: scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1 flags=daily ReservationName=daily Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY Some more possible helpful documentation at https://slurm.schedmd.com/reservations.html, search for "daily". My idea being that jobs can only run in that reservation, (that would have to be configured separately, not sure how from the top of my head), which is only active during the times you want the node to be working. So the cronjob that hibernates/shuts it down will do so when there are no jobs running. At least in theory. Hope that helps. Sean --- Sean McGrath Senior Systems Administrator, IT Services From: slurm-users on behalf of Analabha Roy Sent: Tuesday 7 February 2023 10:05 To: Slurm User Community List Subject: Re: [slurm-users] [External] Hibernating a whole cluster Hi, Thanks. I had read the Slurm Power Saving Guide before. I believe the configs enable slurmctld to check other nodes for idleness and suspend/resume them. Slurmctld must run on a separate, always-on server for this to work, right? My issue might be a little different. I literally have only one node that runs everything: slurmctld, slurmd, slurmdbd, everything. This node must be set to "sudo systemctl hibernate"after business hours, regardless of whether jobs are queued or running. The next business day, it can be switched on manually. systemctl hibernate is supposed to save the entire run state of the sole node to swap and poweroff. When powered on again, it should restore everything to its previous running state. When the job queue is empty, this works well. I'm not sure how well this hibernate/resume will work with running jobs and would appreciate any suggestions or insights. AR On Tue, 7 Feb 2023 at 01:39, Florian Zillner mailto:fzill...@lenovo.com>> wrote: Hi, follow this guide: https://slurm.schedmd.com/power_save.html Create poweroff / poweron scripts and configure slurm to do the poweroff after X minutes. Works well for us. Make sure to set an appropriate time (ResumeTimeout) to allow the node to come back to service. Note that we did not achieve good power saving with suspending the nodes, powering them off and on saves way more power. The downside is it takes ~ 5 mins to resume (= power on) the nodes when needed. Cheers, Florian From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Analabha Roy mailto:hariseldo...@gmail.com>> Sent: Monday, 6 February 2023 18:21 To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> mailto:slurm-users@lists.schedmd.com>> Subject: [External] [slurm-users] Hibernating a whole cluster Hi, I've just finished setup of a single node "cluster" with slurm on ubuntu 20.04. Infrastructural limitations prevent me from running it 24/7, and it's only powered on during business hours. Currently, I have a cron job running that hibernates that sole node before closing time. The hibernation is done with standard systemd, and hibernates to the swap partition. I have not run any lengthy slurm jobs on it yet. Before I do, can I get some thoughts on a couple of things? If it hibernated when slurm still had jobs running/queued, would they resume properly when the machine powers back on? Note that my swap space is bigger than my RAM. Is it necessary to perhaps setup a pre-hibernate script for systemd to iterate scontrol to suspend all the jobs before hibernating and resume them post-resume? What about the wall times? I'm uessing that slurm will count the downtime as elapsed for each job. Is there a way to config this, or is the only alternative a post-hibernate script that iteratively updates the wall times of the running jobs using scontrol again? Thanks for your attention. Regards AR -- Analabha Roy Assistant Professor Department of Physics<http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan<http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu<mailto:dan...@utexas.edu>, a...@phys.buruniv.ac.in<mailto:a...@phys.buruniv.ac.in>, hariseldo...@gmail.com<mailto:hariseldo...@gmail.com> Webpage: http://www.ph.utexas.edu/~daneel/
Re: [slurm-users] [External] Hibernating a whole cluster
Hi, Thanks. I had read the Slurm Power Saving Guide before. I believe the configs enable slurmctld to check other nodes for idleness and suspend/resume them. Slurmctld must run on a separate, always-on server for this to work, right? My issue might be a little different. I literally have only one node that runs everything: slurmctld, slurmd, slurmdbd, everything. This node must be set to "sudo systemctl hibernate"after business hours, regardless of whether jobs are queued or running. The next business day, it can be switched on manually. systemctl hibernate is supposed to save the entire run state of the sole node to swap and poweroff. When powered on again, it should restore everything to its previous running state. When the job queue is empty, this works well. I'm not sure how well this hibernate/resume will work with running jobs and would appreciate any suggestions or insights. AR On Tue, 7 Feb 2023 at 01:39, Florian Zillner wrote: > Hi, > > follow this guide: https://slurm.schedmd.com/power_save.html > > Create poweroff / poweron scripts and configure slurm to do the poweroff > after X minutes. Works well for us. Make sure to set an appropriate time > (ResumeTimeout) to allow the node to come back to service. > Note that we did not achieve good power saving with suspending the nodes, > powering them off and on saves way more power. The downside is it takes ~ 5 > mins to resume (= power on) the nodes when needed. > > Cheers, > Florian > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Monday, 6 February 2023 18:21 > *To:* slurm-users@lists.schedmd.com > *Subject:* [External] [slurm-users] Hibernating a whole cluster > > Hi, > > I've just finished setup of a single node "cluster" with slurm on ubuntu > 20.04. Infrastructural limitations prevent me from running it 24/7, and > it's only powered on during business hours. > > > Currently, I have a cron job running that hibernates that sole node before > closing time. > > The hibernation is done with standard systemd, and hibernates to the swap > partition. > > I have not run any lengthy slurm jobs on it yet. Before I do, can I get > some thoughts on a couple of things? > > If it hibernated when slurm still had jobs running/queued, would they > resume properly when the machine powers back on? > > Note that my swap space is bigger than my RAM. > > Is it necessary to perhaps setup a pre-hibernate script for systemd to > iterate scontrol to suspend all the jobs before hibernating and resume them > post-resume? > > What about the wall times? I'm uessing that slurm will count the downtime > as elapsed for each job. Is there a way to config this, or is the only > alternative a post-hibernate script that iteratively updates the wall times > of the running jobs using scontrol again? > > Thanks for your attention. > Regards > AR > -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/
Re: [slurm-users] [External] Hibernating a whole cluster
I would agree with Florian about using the Slurm power_save method. In the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving there are additional details and scripts for performing node suspend and resume. You would need the server to have a BMC so that you can power it down and up using IPMI commands from your Slurm management server. /Ole On 06-02-2023 21:07, Florian Zillner wrote: follow this guide: https://slurm.schedmd.com/power_save.html <https://slurm.schedmd.com/power_save.html> Create poweroff / poweron scripts and configure slurm to do the poweroff after X minutes. Works well for us. Make sure to set an appropriate time (ResumeTimeout) to allow the node to come back to service. Note that we did not achieve good power saving with suspending the nodes, powering them off and on saves way more power. The downside is it takes ~ 5 mins to resume (= power on) the nodes when needed. Cheers, Florian *From:* slurm-users on behalf of Analabha Roy *Sent:* Monday, 6 February 2023 18:21 *To:* slurm-users@lists.schedmd.com *Subject:* [External] [slurm-users] Hibernating a whole cluster Hi, I've just finished setup of a single node "cluster" with slurm on ubuntu 20.04. Infrastructural limitations prevent me from running it 24/7, and it's only powered on during business hours. Currently, I have a cron job running that hibernates that sole node before closing time. The hibernation is done with standard systemd, and hibernates to the swap partition. I have not run any lengthy slurm jobs on it yet. Before I do, can I get some thoughts on a couple of things? If it hibernated when slurm still had jobs running/queued, would they resume properly when the machine powers back on? Note that my swap space is bigger than my RAM. Is it necessary to perhaps setup a pre-hibernate script for systemd to iterate scontrol to suspend all the jobs before hibernating and resume them post-resume? What about the wall times? I'm uessing that slurm will count the downtime as elapsed for each job. Is there a way to config this, or is the only alternative a post-hibernate script that iteratively updates the wall times of the running jobs using scontrol again?
Re: [slurm-users] [External] Hibernating a whole cluster
Hi, follow this guide: https://slurm.schedmd.com/power_save.html Create poweroff / poweron scripts and configure slurm to do the poweroff after X minutes. Works well for us. Make sure to set an appropriate time (ResumeTimeout) to allow the node to come back to service. Note that we did not achieve good power saving with suspending the nodes, powering them off and on saves way more power. The downside is it takes ~ 5 mins to resume (= power on) the nodes when needed. Cheers, Florian From: slurm-users on behalf of Analabha Roy Sent: Monday, 6 February 2023 18:21 To: slurm-users@lists.schedmd.com Subject: [External] [slurm-users] Hibernating a whole cluster Hi, I've just finished setup of a single node "cluster" with slurm on ubuntu 20.04. Infrastructural limitations prevent me from running it 24/7, and it's only powered on during business hours. Currently, I have a cron job running that hibernates that sole node before closing time. The hibernation is done with standard systemd, and hibernates to the swap partition. I have not run any lengthy slurm jobs on it yet. Before I do, can I get some thoughts on a couple of things? If it hibernated when slurm still had jobs running/queued, would they resume properly when the machine powers back on? Note that my swap space is bigger than my RAM. Is it necessary to perhaps setup a pre-hibernate script for systemd to iterate scontrol to suspend all the jobs before hibernating and resume them post-resume? What about the wall times? I'm uessing that slurm will count the downtime as elapsed for each job. Is there a way to config this, or is the only alternative a post-hibernate script that iteratively updates the wall times of the running jobs using scontrol again? Thanks for your attention. Regards AR