Re: [gridengine users] job restart - cannot run on host until clean up of an previous run has finished

Reuti Thu, 23 Aug 2012 11:08:48 -0700

Hi,

Am 22.08.2012 um 23:42 schrieb Henrichs, Juryk:


> I tried the safety kill. Unfortunately that does not do the trick.
> 
> No idea what to make of it, but the job is restarted as expected as it spans
> over not more than 15 nodes (or 120 slots) . If it spans more than that, it is
> not restarted with the above message.

Well, the safety kill is only executed on the master node of the parallel job. 
Is there indeed still something running on any of the other slave nodes?

-- Reuti


> Any ideas?
> 
> Juryk
> 
> 
> > Reuti 
> 
> > Mittwoch, 22. August 2012 19:50
> > Hi,
> >
> >
> > nothing has to be changed. I posted your link as there it's corrected, in
> > contrast to the mentioned version 6.2u5 of the OP.
> >
> > -- Reuti
> > Dave Love 
> 
> > Mittwoch, 22. August 2012 16:00
> >
> > What needs documenting now? (I checked the lists of expanded variables
> > in the various instances against the code, but...)
> >
> > Reuti 
> 
> > Mittwoch, 22. August 2012 08:31
> > Hi,
> >
> > Am 22.08.2012 um 00:37 schrieb Henrichs, Juryk:
> >
> >> Hallo Reuti,
> >>
> >> checkpointing type is application_level. The migr_command script basically 
> >> writes one value into one file to tell the application to stop. All the 
> >> rest is taken care of by the application itself.
> >
> > So it's not safe whether the application really left the machine when the 
> > "migr_command" finishes - right? I would suggest to put some sleep into the 
> > procedure and check whether the job script is gone and/or perform a safety 
> > kill: kill -9 -- -$1 There are some undocumented variables, and so $job_pid 
> > can be passed as $1 to the "migr_command":
> >
> > 
> http://arc.liv.ac.uk/SGE/htmlman/htmlman5/checkpoint.html
> 
> >
> > -- Reuti
> >
> >
> >> Having this said - the qsub command is started in a script which starts 
> >> some other processes in parallel to keep track of the computation. Those 
> >> may not be finished by then. However, this should not be a problem, since 
> >> the qsub command is not yet returned (as long as the job is suspended and 
> >> rescheduled but not finished).
> >>
> >> Juryk
> >>
> >>> 
>       Reuti   Dienstag, 21. August 2012 23:47
> >>> Hi,
> >>>
> >>> Am 21.08.2012 um 22:44 schrieb Henrichs, Juryk:
> >>>
> >>>
> >>>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing.
> >>>> On one of our clusters that works fine - jobs is suspended via the
> >>>> suspend command, is stopped, rescheduled in the queue and restarted if
> >>>> resources are available.
> >>>>
> >>>> With apparently the same setup of the sge on a second cluster my jobs
> >>>> are rescheduled but do not get started. qstat -sj shows
> >>>> "cannot run on host XXX until clean up of an previous run has finished"
> >>>>
> >>>> If the job is deleted from the queue and restarted manually works 
> >>>> perfect.
> >>>>
> >>>> Is there a way to get a more elaborate error message and to find out
> >>>> what exactly goes wrong with the cleanup?
> >>>>
> >>> Depending on the checkpointing setup it might be necessary to remove all 
> >>> processes of a job in the "migr_command" defined script. Which 
> >>> checkpointing type do you use amd how do you remove the processes therein?
> >>>
> >>> -- Reuti
> >>>
> >>>
> >>>
> >>>> Juryk
> >>>>
> >>>>
> >>>> This e-mail and any attachment thereto may contain confidential 
> >>>> information and/or information protected by intellectual property rights 
> >>>> for the exclusive attention of the intended addressees named above. Any 
> >>>> access of third parties to this e-mail is unauthorised. Any use of this 
> >>>> e-mail by unintended recipients such as total or partial copying, 
> >>>> distribution, disclosure etc. is prohibited and may be unlawful. When 
> >>>> addressed to our clients the content of this e-mail is subject to the 
> >>>> General Terms and Conditions of GL's Group of Companies applicable at 
> >>>> the date of this e-mail.
> >>>> If you have received this e-mail in error, please notify the sender 
> >>>> either by telephone or by e-mail and delete the material from any 
> >>>> computer.
> >>>> GL's Group of Companies does not warrant and/or guarantee that this 
> >>>> message at the moment of receipt is authentic, correct and its 
> >>>> communication free of errors, interruption etc.
> >>>> FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
> >>>> GeschÃ¤ftsfÃ¼hrer (CEO): Volker HÃ¶ppner, Henning Kinkhorst, Stefan 
> >>>> Deucker
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>>
> >>>> 
> [email protected]
> 
> >>>> 
> https://gridengine.org/mailman/listinfo/users
> 
> >>>>
> >>>>
> >>>>
> >>> 
>       Henrichs, Juryk Dienstag, 21. August 2012 22:44
> >>> Hi
> >>>
> >>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing.
> >>> On one of our clusters that works fine - jobs is suspended via the
> >>> suspend command, is stopped, rescheduled in the queue and restarted if
> >>> resources are available.
> >>>
> >>> With apparently the same setup of the sge on a second cluster my jobs
> >>> are rescheduled but do not get started. qstat -sj shows
> >>> "cannot run on host XXX until clean up of an previous run has finished"
> >>>
> >>> If the job is deleted from the queue and restarted manually works perfect.
> >>>
> >>> Is there a way to get a more elaborate error message and to find out
> >>> what exactly goes wrong with the cleanup?
> >>>
> >>> Juryk
> >>>
> >>> 
>       Juryk Henrichs  Dienstag, 21. August 2012 22:34
> >>> Hi
> >>>
> >>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. 
> >>> On one of our clusters that works fine - jobs is suspended via the 
> >>> suspend command, is stopped, rescheduled in the queue and restarted if 
> >>> resources are available.
> >>>
> >>> With apparently the same setup of the sge on a second cluster my jobs are 
> >>> rescheduled but do not get started. qstat -sj shows
> >>> "cannot run on host XXX until clean up of an previous run has finished"
> >>>
> >>> If the job is deleted from the queue and restarted manually works perfect.
> >>>
> >>> Is there a way to get a more elaborate error message and to find out what 
> >>> exactly goes wrong with the cleanup?
> >>>
> >>> Juryk
> >>>
> >>>
> >>> 
>       Juryk Henrichs  Dienstag, 21. August 2012 13:57
> >>> Hi
> >>>
> >>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. 
> >>> On one of our clusters that works fine - jobs is suspended via the 
> >>> suspend command, is stopped, rescheduled in the queue and restarted if 
> >>> resources are available.
> >>>
> >>> With apparently the same setup of the sge on a second cluster my jobs are 
> >>> rescheduled but do not get started. qstat -sj shows
> >>> "cannot run on host XXX until clean up of an previous run has finished"
> >>>
> >>> If the job is deleted from the queue and restarted manually works perfect.
> >>>
> >>> Is there a way to get a more elaborate error message and to find out what 
> >>> exactly goes wrong with the cleanup?
> >>>
> >>> Juryk
> >>>
> >>>
> >>> 
>       Juryk Henrichs  Dienstag, 21. August 2012 13:07
> >>> Hi
> >>>
> >>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. 
> >>> On one of our clusters that works fine - jobs is suspended via the 
> >>> suspend command, is stopped, rescheduled in the queue and restarted if 
> >>> resources are available.
> >>>
> >>> With apparently the same setup of the sge on a second cluster my jobs are 
> >>> rescheduled but do not get started. qstat -sj shows
> >>> "cannot run on host XXX until clean up of an previous run has finished"
> >>>
> >>> If the job is deleted from the queue and restarted manually works perfect.
> >>>
> >>> Is there a way to get a more elaborate error message and to find out what 
> >>> exactly goes wrong with the cleanup?
> >>>
> >>> Juryk
> >> --
> >> Juryk Henrichs,
> >>
> >> Senior Project Engineer
> >> Fluid Engineering
> >> FutureShip GmbH -- A GL company
> >>
> >> Office Potsdam
> >> Behlertstr. 3a, Haus G
> >> D-14467 Potsdam
> >>
> >> Tel.: +49 331 9799 179-16
> >> Fax.: +49 331 9799 179-9
> >>
> >> 
> http://www.futureship.net
> 
> >> 
> http://www.gl-group.com
> 
> >> This e-mail and any attachment thereto may contain confidential 
> >> information and/or information protected by intellectual property rights 
> >> for the exclusive attention of the intended addressees named above. Any 
> >> access of third parties to this e-mail is unauthorised. Any use of this 
> >> e-mail by unintended recipients such as total or partial copying, 
> >> distribution, disclosure etc. is prohibited and may be unlawful. When 
> >> addressed to our clients the content of this e-mail is subject to the 
> >> General Terms and Conditions of GL's Group of Companies applicable at the 
> >> date of this e-mail.
> >> If you have received this e-mail in error, please notify the sender either 
> >> by telephone or by e-mail and delete the material from any computer.
> >> GL's Group of Companies does not warrant and/or guarantee that this 
> >> message at the moment of receipt is authentic, correct and its 
> >> communication free of errors, interruption etc.
> >> FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
> >> GeschÃ¤ftsfÃ¼hrer (CEO): Volker HÃ¶ppner, Henning Kinkhorst, Stefan Deucker
> >
> > Henrichs, Juryk 
> 
> > Mittwoch, 22. August 2012 00:37
> > Hallo Reuti,
> >
> > checkpointing type is application_level. The migr_command script basically
> > writes one value into one file to tell the application to stop. All the 
> > rest is
> > taken care of by the application itself.
> >
> > Having this said - the qsub command is started in a script which starts some
> > other processes in parallel to keep track of the computation. Those may not 
> > be
> > finished by then. However, this should not be a problem, since the qsub 
> > command
> > is not yet returned (as long as the job is suspended and rescheduled but not
> > finished).
> >
> > Juryk
> >
> > > Reuti 
> 
> > > Dienstag, 21. August 2012 23:47
> > > Hi,
> > >
> > > Am 21.08.2012 um 22:44 schrieb Henrichs, Juryk:
> > >
> > >> we are running sge 6.2u5. I am trying to restart jobs via checkpointing.
> > >> On one of our clusters that works fine - jobs is suspended via the
> > >> suspend command, is stopped, rescheduled in the queue and restarted if
> > >> resources are available.
> > >>
> > >> With apparently the same setup of the sge on a second cluster my jobs
> > >> are rescheduled but do not get started. qstat -sj shows
> > >> "cannot run on host XXX until clean up of an previous run has finished"
> > >>
> > >> If the job is deleted from the queue and restarted manually works 
> > >> perfect.
> > >>
> > >> Is there a way to get a more elaborate error message and to find out
> > >> what exactly goes wrong with the cleanup?
> > >
> > > Depending on the checkpointing setup it might be necessary to remove all
> > processes of a job in the "migr_command" defined script. Which checkpointing
> > type do you use amd how do you remove the processes therein?
> > >
> > > -- Reuti
> > >
> > >
> > >> Juryk
> > >>
> > >>
> > >> This e-mail and any attachment thereto may contain confidential 
> > >> information
> > and/or information protected by intellectual property rights for the 
> > exclusive
> > attention of the intended addressees named above. Any access of third 
> > parties
> > to this e-mail is unauthorised. Any use of this e-mail by unintended
> > recipients such as total or partial copying, distribution, disclosure etc. 
> > is
> > prohibited and may be unlawful. When addressed to our clients the content of
> > this e-mail is subject to the General Terms and Conditions of GL's Group of
> > Companies applicable at the date of this e-mail.
> > >> If you have received this e-mail in error, please notify the sender 
> > >> either
> > by telephone or by e-mail and delete the material from any computer.
> > >> GL's Group of Companies does not warrant and/or guarantee that this 
> > >> message
> > at the moment of receipt is authentic, correct and its communication free of
> > errors, interruption etc.
> > >> FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
> > >> GeschÃ¤ftsfÃ¼hrer (CEO): Volker HÃ¶ppner, Henning Kinkhorst, Stefan 
> > >> Deucker
> > >>
> > >>
> > >> _______________________________________________
> > >> users mailing list
> > >> 
> [email protected]
> 
> > >> 
> https://gridengine.org/mailman/listinfo/users
> 
> > >>
> > >
> > > Henrichs, Juryk 
> 
> > > Dienstag, 21. August 2012 22:44
> > > Hi
> > >
> > > we are running sge 6.2u5. I am trying to restart jobs via checkpointing.
> > > On one of our clusters that works fine - jobs is suspended via the
> > > suspend command, is stopped, rescheduled in the queue and restarted if
> > > resources are available.
> > >
> > > With apparently the same setup of the sge on a second cluster my jobs
> > > are rescheduled but do not get started. qstat -sj shows
> > > "cannot run on host XXX until clean up of an previous run has finished"
> > >
> > > If the job is deleted from the queue and restarted manually works perfect.
> > >
> > > Is there a way to get a more elaborate error message and to find out
> > > what exactly goes wrong with the cleanup?
> > >
> > > Juryk
> > >
> > > Juryk Henrichs 
> 
> > > Dienstag, 21. August 2012 22:34
> > > Hi
> > >
> > > we are running sge 6.2u5. I am trying to restart jobs via checkpointing. 
> > > On
> > > one of our clusters that works fine - jobs is suspended via the suspend
> > > command, is stopped, rescheduled in the queue and restarted if resources 
> > > are
> > > available.
> > >
> > > With apparently the same setup of the sge on a second cluster my jobs are
> > > rescheduled but do not get started. qstat -sj shows
> > > "cannot run on host XXX until clean up of an previous run has finished"
> > >
> > > If the job is deleted from the queue and restarted manually works perfect.
> > >
> > > Is there a way to get a more elaborate error message and to find out what
> > > exactly goes wrong with the cleanup?
> > >
> > > Juryk
> > >
> > >
> > > Juryk Henrichs 
> 
> > > Dienstag, 21. August 2012 13:57
> > > Hi
> > >
> > > we are running sge 6.2u5. I am trying to restart jobs via checkpointing. 
> > > On
> > > one of our clusters that works fine - jobs is suspended via the suspend
> > > command, is stopped, rescheduled in the queue and restarted if resources 
> > > are
> > > available.
> > >
> > > With apparently the same setup of the sge on a second cluster my jobs are
> > > rescheduled but do not get started. qstat -sj shows
> > > "cannot run on host XXX until clean up of an previous run has finished"
> > >
> > > If the job is deleted from the queue and restarted manually works perfect.
> > >
> > > Is there a way to get a more elaborate error message and to find out what
> > > exactly goes wrong with the cleanup?
> > >
> > > Juryk
> > >
> > >
> > > Juryk Henrichs 
> 
> > > Dienstag, 21. August 2012 13:07
> > > Hi
> > >
> > > we are running sge 6.2u5. I am trying to restart jobs via checkpointing. 
> > > On
> > > one of our clusters that works fine - jobs is suspended via the suspend
> > > command, is stopped, rescheduled in the queue and restarted if resources 
> > > are
> > > available.
> > >
> > > With apparently the same setup of the sge on a second cluster my jobs are
> > > rescheduled but do not get started. qstat -sj shows
> > > "cannot run on host XXX until clean up of an previous run has finished"
> > >
> > > If the job is deleted from the queue and restarted manually works perfect.
> > >
> > > Is there a way to get a more elaborate error message and to find out what
> > > exactly goes wrong with the cleanup?
> > >
> > > Juryk
> >
> > --
> > Juryk Henrichs,
> >
> > Senior Project Engineer
> > Fluid Engineering
> > FutureShip GmbH -- A GL company
> >
> > Office Potsdam
> > Behlertstr. 3a, Haus G
> > D-14467 Potsdam
> >
> > Tel.: +49 331 9799 179-16
> > Fax.: +49 331 9799 179-9
> >
> > 
> http://www.futureship.net
> 
> > 
> http://www.gl-group.com
> 
> > Reuti 
> 
> > Dienstag, 21. August 2012 23:47
> > Hi,
> >
> > Am 21.08.2012 um 22:44 schrieb Henrichs, Juryk:
> >
> >> we are running sge 6.2u5. I am trying to restart jobs via checkpointing.
> >> On one of our clusters that works fine - jobs is suspended via the
> >> suspend command, is stopped, rescheduled in the queue and restarted if
> >> resources are available.
> >>
> >> With apparently the same setup of the sge on a second cluster my jobs
> >> are rescheduled but do not get started. qstat -sj shows
> >> "cannot run on host XXX until clean up of an previous run has finished"
> >>
> >> If the job is deleted from the queue and restarted manually works perfect.
> >>
> >> Is there a way to get a more elaborate error message and to find out
> >> what exactly goes wrong with the cleanup?
> >
> > Depending on the checkpointing setup it might be necessary to remove all 
> > processes of a job in the "migr_command" defined script. Which 
> > checkpointing type do you use amd how do you remove the processes therein?
> >
> > -- Reuti
> >
> >
> >> Juryk
> >>
> >>
> >> This e-mail and any attachment thereto may contain confidential 
> >> information and/or information protected by intellectual property rights 
> >> for the exclusive attention of the intended addressees named above. Any 
> >> access of third parties to this e-mail is unauthorised. Any use of this 
> >> e-mail by unintended recipients such as total or partial copying, 
> >> distribution, disclosure etc. is prohibited and may be unlawful. When 
> >> addressed to our clients the content of this e-mail is subject to the 
> >> General Terms and Conditions of GL's Group of Companies applicable at the 
> >> date of this e-mail.
> >> If you have received this e-mail in error, please notify the sender either 
> >> by telephone or by e-mail and delete the material from any computer.
> >> GL's Group of Companies does not warrant and/or guarantee that this 
> >> message at the moment of receipt is authentic, correct and its 
> >> communication free of errors, interruption etc.
> >> FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
> >> GeschÃ¤ftsfÃ¼hrer (CEO): Volker HÃ¶ppner, Henning Kinkhorst, Stefan Deucker
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> 
> [email protected]
> 
> >> 
> https://gridengine.org/mailman/listinfo/users
> 
> >>
> >
> > Henrichs, Juryk 
> 
> > Dienstag, 21. August 2012 22:44
> > Hi
> >
> > we are running sge 6.2u5. I am trying to restart jobs via checkpointing.
> > On one of our clusters that works fine - jobs is suspended via the
> > suspend command, is stopped, rescheduled in the queue and restarted if
> > resources are available.
> >
> > With apparently the same setup of the sge on a second cluster my jobs
> > are rescheduled but do not get started. qstat -sj shows
> > "cannot run on host XXX until clean up of an previous run has finished"
> >
> > If the job is deleted from the queue and restarted manually works perfect.
> >
> > Is there a way to get a more elaborate error message and to find out
> > what exactly goes wrong with the cleanup?
> >
> > Juryk
> >
> > Juryk Henrichs 
> 
> > Dienstag, 21. August 2012 22:34
> > Hi
> >
> > we are running sge 6.2u5. I am trying to restart jobs via checkpointing. On
> > one of our clusters that works fine - jobs is suspended via the suspend
> > command, is stopped, rescheduled in the queue and restarted if resources are
> > available.
> >
> > With apparently the same setup of the sge on a second cluster my jobs are
> > rescheduled but do not get started. qstat -sj shows
> > "cannot run on host XXX until clean up of an previous run has finished"
> >
> > If the job is deleted from the queue and restarted manually works perfect.
> >
> > Is there a way to get a more elaborate error message and to find out what
> > exactly goes wrong with the cleanup?
> >
> > Juryk
> >
> >
> > Juryk Henrichs 
> 
> > Dienstag, 21. August 2012 13:57
> > Hi
> >
> > we are running sge 6.2u5. I am trying to restart jobs via checkpointing. On
> > one of our clusters that works fine - jobs is suspended via the suspend
> > command, is stopped, rescheduled in the queue and restarted if resources are
> > available.
> >
> > With apparently the same setup of the sge on a second cluster my jobs are
> > rescheduled but do not get started. qstat -sj shows
> > "cannot run on host XXX until clean up of an previous run has finished"
> >
> > If the job is deleted from the queue and restarted manually works perfect.
> >
> > Is there a way to get a more elaborate error message and to find out what
> > exactly goes wrong with the cleanup?
> >
> > Juryk
> >
> >
> > Juryk Henrichs 
> 
> > Dienstag, 21. August 2012 13:07
> > Hi
> >
> > we are running sge 6.2u5. I am trying to restart jobs via checkpointing. On
> > one of our clusters that works fine - jobs is suspended via the suspend
> > command, is stopped, rescheduled in the queue and restarted if resources are
> > available.
> >
> > With apparently the same setup of the sge on a second cluster my jobs are
> > rescheduled but do not get started. qstat -sj shows
> > "cannot run on host XXX until clean up of an previous run has finished"
> >
> > If the job is deleted from the queue and restarted manually works perfect.
> >
> > Is there a way to get a more elaborate error message and to find out what
> > exactly goes wrong with the cleanup?
> >
> > Juryk
> 
> --
> Juryk Henrichs,
> 
> Senior Project Engineer
> Fluid Engineering
> FutureShip GmbH -- A GL company
> 
> Office Potsdam
> Behlertstr. 3a, Haus G
> D-14467 Potsdam
> 
> Tel.: +49 331 9799 179-16
> Fax.: +49 331 9799 179-9
> 
> 
> http://www.futureship.net
> http://www.gl-group.co
> m
> This e-mail and any attachment thereto may contain confidential information 
> and/or information protected by intellectual property rights for the 
> exclusive attention of the intended addressees named above. Any access of 
> third parties to this e-mail is unauthorised. Any use of this e-mail by 
> unintended recipients such as total or partial copying, distribution, 
> disclosure etc. is prohibited and may be unlawful. When addressed to our 
> clients the content of this e-mail is subject to the General Terms and 
> Conditions of GL's Group of Companies applicable at the date of this e-mail.
> If you have received this e-mail in error, please notify the sender either by 
> telephone or by e-mail and delete the material from any computer.
> GL's Group of Companies does not warrant and/or guarantee that this message 
> at the moment of receipt is authentic, correct and its communication free of 
> errors, interruption etc.
> FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
> Geschäftsführer (CEO): Volker Höppner, Henning Kinkhorst, Stefan Deucker 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] job restart - cannot run on host until clean up of an previous run has finished

Reply via email to