Hallo Reuti, checkpointing type is application_level. The migr_command script basically writes one value into one file to tell the application to stop. All the rest is taken care of by the application itself.
Having this said - the qsub command is started in a script which starts some
other processes in parallel to keep track of the computation. Those may not be
finished by then. However, this should not be a problem, since the qsub command
is not yet returned (as long as the job is suspended and rescheduled but not
finished).
Juryk
Reuti <mailto:[email protected]>
Dienstag, 21. August 2012 23:47
Hi,
Am 21.08.2012 um 22:44 schrieb Henrichs, Juryk:
we are running sge 6.2u5. I am trying to restart jobs via
checkpointing.
On one of our clusters that works fine - jobs is suspended via
the
suspend command, is stopped, rescheduled in the queue and
restarted if
resources are available.
With apparently the same setup of the sge on a second cluster
my jobs
are rescheduled but do not get started. qstat -sj shows
"cannot run on host XXX until clean up of an previous run has
finished"
If the job is deleted from the queue and restarted manually
works perfect.
Is there a way to get a more elaborate error message and to
find out
what exactly goes wrong with the cleanup?
Depending on the checkpointing setup it might be necessary to remove
all processes of a job in the "migr_command" defined script. Which
checkpointing type do you use amd how do you remove the processes therein?
-- Reuti
Juryk
This e-mail and any attachment thereto may contain confidential
information and/or information protected by intellectual property rights for
the exclusive attention of the intended addressees named above. Any access of
third parties to this e-mail is unauthorised. Any use of this e-mail by
unintended recipients such as total or partial copying, distribution,
disclosure etc. is prohibited and may be unlawful. When addressed to our
clients the content of this e-mail is subject to the General Terms and
Conditions of GL's Group of Companies applicable at the date of this e-mail.
If you have received this e-mail in error, please notify the
sender either by telephone or by e-mail and delete the material from any
computer.
GL's Group of Companies does not warrant and/or guarantee that
this message at the moment of receipt is authentic, correct and its
communication free of errors, interruption etc.
FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
Geschäftsführer (CEO): Volker Höppner, Henning Kinkhorst,
Stefan Deucker
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
Henrichs, Juryk <mailto:[email protected]>
Dienstag, 21. August 2012 22:44
Hi
we are running sge 6.2u5. I am trying to restart jobs via checkpointing.
On one of our clusters that works fine - jobs is suspended via the
suspend command, is stopped, rescheduled in the queue and restarted if
resources are available.
With apparently the same setup of the sge on a second cluster my jobs
are rescheduled but do not get started. qstat -sj shows
"cannot run on host XXX until clean up of an previous run has finished"
If the job is deleted from the queue and restarted manually works
perfect.
Is there a way to get a more elaborate error message and to find out
what exactly goes wrong with the cleanup?
Juryk
Juryk Henrichs <mailto:[email protected]>
Dienstag, 21. August 2012 22:34
Hi
we are running sge 6.2u5. I am trying to restart jobs via
checkpointing. On one of our clusters that works fine - jobs is suspended via
the suspend command, is stopped, rescheduled in the queue and restarted if
resources are available.
With apparently the same setup of the sge on a second cluster my jobs
are rescheduled but do not get started. qstat -sj shows
"cannot run on host XXX until clean up of an previous run has finished"
If the job is deleted from the queue and restarted manually works
perfect.
Is there a way to get a more elaborate error message and to find out
what exactly goes wrong with the cleanup?
Juryk
Juryk Henrichs <mailto:[email protected]>
Dienstag, 21. August 2012 13:57
Hi
we are running sge 6.2u5. I am trying to restart jobs via
checkpointing. On one of our clusters that works fine - jobs is suspended via
the suspend command, is stopped, rescheduled in the queue and restarted if
resources are available.
With apparently the same setup of the sge on a second cluster my jobs
are rescheduled but do not get started. qstat -sj shows
"cannot run on host XXX until clean up of an previous run has finished"
If the job is deleted from the queue and restarted manually works
perfect.
Is there a way to get a more elaborate error message and to find out
what exactly goes wrong with the cleanup?
Juryk
Juryk Henrichs <mailto:[email protected]>
Dienstag, 21. August 2012 13:07
Hi
we are running sge 6.2u5. I am trying to restart jobs via
checkpointing. On one of our clusters that works fine - jobs is suspended via
the suspend command, is stopped, rescheduled in the queue and restarted if
resources are available.
With apparently the same setup of the sge on a second cluster my jobs
are rescheduled but do not get started. qstat -sj shows
"cannot run on host XXX until clean up of an previous run has finished"
If the job is deleted from the queue and restarted manually works
perfect.
Is there a way to get a more elaborate error message and to find out
what exactly goes wrong with the cleanup?
Juryk
--
Juryk Henrichs,
Senior Project Engineer
Fluid Engineering
FutureShip GmbH -- A GL company
Office Potsdam
Behlertstr. 3a, Haus G
D-14467 Potsdam
Tel.: +49 331 9799 179-16
Fax.: +49 331 9799 179-9
http://www.futureship.net
http://www.gl-group.com
This e-mail and any attachment thereto may contain confidential information
and/or information protected by intellectual property rights for the exclusive
attention of the intended addressees named above. Any access of third parties
to this e-mail is unauthorised. Any use of this e-mail by unintended recipients
such as total or partial copying, distribution, disclosure etc. is prohibited
and may be unlawful. When addressed to our clients the content of this e-mail
is subject to the General Terms and Conditions of GL's Group of Companies
applicable at the date of this e-mail.
If you have received this e-mail in error, please notify the sender either by
telephone or by e-mail and delete the material from any computer.
GL's Group of Companies does not warrant and/or guarantee that this message at
the moment of receipt is authentic, correct and its communication free of
errors, interruption etc.
FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
Geschäftsführer (CEO): Volker Höppner, Henning Kinkhorst, Stefan Deucker
<<inline: compose-unknown-contact.jpg>>
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
