Re: [Dmtcp-forum] DL_Plugin, restart problems with hbict and general questions.

2016-10-18 Thread Rohan Garg
Hi Moritz,

Please find my answers inline.

On Mon, Oct 17, 2016 at 03:28:39PM +, Eilfort, Moritz Emanuel Christoph 
wrote:
> Hi,
> 
> Currently I am trying to run ParastationMPI together with dmtcp-2.4.5. 
> In order for this to work I have to disable the DL Plugin.
> What kind of problems do I have to expect or in which scenario is the
> DL Plugin crucial?

The DL plugin adds wrappers for the libdl (dlopen/dlclose/dlsym)
functions. The intent is to prevent checkpointing in the middle of a
call to one of these functions, which can lead to certain races on
restart.

The plugin usually works but we have noticed problems in certain
cases. That said, typically, applications dlopen libraries during
their initialization, and so, as long as one checkpoints after the
initialization has completed, it's safe to disable the plugin.

> 
> Additionally I am trying to use hbict to achieve a form of pre-copy
> live migration. In this scenario I am using MPICH 3.2 and dmtcp-2.4.5
> configured with delta-compression enabled.
> 
> I have tested the following scenario with and without gzip. The result
> is the same.
> 
> I am able to checkpoint my application without any problems.
> Upon restart I get the following error message and a few processes
> terminate which causes the computation to wait endlessly for the
> missing processes. There does not seem to be a difference if I use
> multiple nodes or a single node. I save the images to
> /var/tmp/ckptfiles. 
> 
> Restart fails with the following message:
> 
> MAX checkpoint index is 5, blknum = 198563
> Create result checkpoint: .size = 1
> [19532] WARNING at dmtcp_restart.cpp:300 in createProcess;
> REASON='JWARNING(setsid() != -1) failed'
>  getsid(0) = 19504
>  (strerror((*__errno_location ( = Operation not permitted
> Message: Failed to restore this process as session leader.
> .
> COMPLETE
> ...[19564] mtcp_util.ic:306 mtcp_skipfile:
>   mtcp_sys_mmap() failed with error: 12...
> COMPLETE
> .[19567] mtcp_util.ic:306 mtcp_skipfile:
>   mtcp_sys_mmap() failed with error: 12...
> COMPLETE
> .
> COMPLETE
> [19571] mtcp_restart.c:963 read_one_memory_area:
>   Assertion failed: 0
> 
> 
> The Coordinator informs me that a couple of processes have terminated:
> 
> 11548] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect;
> REASON='client disconnected'
>  client->identity() = 599ebe687dfd668d-47000-5804dd42
>  client->progname() = lu.D.16
> [11548] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect;
> REASON='client disconnected'
>  client->identity() = 599ebe687dfd668d-5-5804dd42
>  client->progname() = lu.D.16
> 
> 
> 
> How can I resolve this problem? Am I doing something wrong?

I haven't worked closely with the HBICT option. Artem (cc-ed) would
be the best person to answer this question. I think it'll be easier to
debug if we could reproduce this locally, or if there's a possibility
of a guest account on your system.

> 
> Two more, general questions:
> 
> Is it possible to specify different checkpoint directories for
> processes on different nodes during launch?

The current runtime options and the API do not support this, but it's
not difficult to add such an option. A DMTCP plugin could set a local
checkpoint directory depending on the MPI rank or hostname.

> 
> Is it possible to increase the number of hosts upon restart?
> If I launch a computation on two nodes, I naturally only get one
> dmtcp_sshd, and one hydra image for the remote node. Could I somehow 
> launch a second dmtcp_sshd on a third node?

Generally, MPI creates shared-memory regions for inter-process
communication for processes on the same node. Assuming there's a way to
tear down these connections at checkpoint time, it should be possible
to place the co-located processes on separate hosts on restart. Does that
answer your question?

Note that this problem doesn't occur when consolidating processes to fewer
nodes on restart -- the only thing one needs to ensure is that processes
that were co-located on a host prior to checkpointing are co-located
on restarted.

> 
> Kind regards,
> Moritz 
> 
> 
> --
> Check out the vibrant tech community on one of the world's most 
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


[Dmtcp-forum] DL_Plugin, restart problems with hbict and general questions.

2016-10-17 Thread Eilfort, Moritz Emanuel Christoph
Hi,

Currently I am trying to run ParastationMPI together with dmtcp-2.4.5. 
In order for this to work I have to disable the DL Plugin.
What kind of problems do I have to expect or in which scenario is the
DL Plugin crucial?

Additionally I am trying to use hbict to achieve a form of pre-copy
live migration. In this scenario I am using MPICH 3.2 and dmtcp-2.4.5
configured with delta-compression enabled.

I have tested the following scenario with and without gzip. The result
is the same.

I am able to checkpoint my application without any problems.
Upon restart I get the following error message and a few processes
terminate which causes the computation to wait endlessly for the
missing processes. There does not seem to be a difference if I use
multiple nodes or a single node. I save the images to
/var/tmp/ckptfiles. 

Restart fails with the following message:

MAX checkpoint index is 5, blknum = 198563
Create result checkpoint: .size = 1
[19532] WARNING at dmtcp_restart.cpp:300 in createProcess;
REASON='JWARNING(setsid() != -1) failed'
 getsid(0) = 19504
 (strerror((*__errno_location ( = Operation not permitted
Message: Failed to restore this process as session leader.
.
COMPLETE
...[19564] mtcp_util.ic:306 mtcp_skipfile:
  mtcp_sys_mmap() failed with error: 12...
COMPLETE
.[19567] mtcp_util.ic:306 mtcp_skipfile:
  mtcp_sys_mmap() failed with error: 12...
COMPLETE
.
COMPLETE
[19571] mtcp_restart.c:963 read_one_memory_area:
  Assertion failed: 0


The Coordinator informs me that a couple of processes have terminated:

11548] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect;
REASON='client disconnected'
 client->identity() = 599ebe687dfd668d-47000-5804dd42
 client->progname() = lu.D.16
[11548] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect;
REASON='client disconnected'
 client->identity() = 599ebe687dfd668d-5-5804dd42
 client->progname() = lu.D.16



How can I resolve this problem? Am I doing something wrong?

Two more, general questions:

Is it possible to specify different checkpoint directories for
processes on different nodes during launch?

Is it possible to increase the number of hosts upon restart?
If I launch a computation on two nodes, I naturally only get one
dmtcp_sshd, and one hydra image for the remote node. Could I somehow 
launch a second dmtcp_sshd on a third node?

Kind regards,
Moritz 


--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum