Hi Ankit,
Currently DMTCP doesn't support rsh. If it's possible, could you try ssh
instead?
Best,
Jiajun
On Mon, Oct 27, 2014 at 8:37 PM, Kapil Arya wrote:
> Hi Jiajun,
>
> Can you take a look at this error and suggest Ankit a solution?
>
> Kapil
>
> On Mon, Oct 27, 2014 at 3:25 AM, Ankit Ga
UNPRIVILEGED
> dmtcp::Util::patchArgvIfSetuid(path, argv, newArgv); // BUG:
> dmtcp::Util::patchArgvIfSetuid() DOES NOT SET newArgv WHEN COPYING //
> BINARY IN CODE RE-FACTORING FROM REVISION 911. *filename =
> (*newArgv)[0]; } else */*
> {
>
> *filename = (char*
Hi Marina,
What’s the network are you using? Is it Ethernet or InfiniBand?
On Tue, Oct 28, 2014 at 12:26 PM, Kapil Arya
wrote:
> Hi Jiajun,
>
> Can you take a look at this one?
>
> Kapil
>
> On Tue, Oct 28, 2014 at 8:50 AM, Marina Moran <
> esperandoelmila...@gmail.com> wrote:
>
>> Hi,
>>
>>
and I am *not* using torque nor any batch
> queue system.
>
> Regards
> Marina
>
> On 10/28/14, Jiajun Cao wrote:
> > Hi Marina,
> >
> > What's the network are you using? Is it Ethernet or InfiniBand?
> >
> > On Tue, Oct 28, 2014 at 12:26 PM, Kapil
Hi Manuel,
What kind of network is used in the cluster? Ethernet or InfiniBand?
On Mon, Nov 17, 2014 at 2:52 PM, Gene Cooperman wrote:
> Jiajun,
> Could you respond to this, since you've been extending our support
> for MPI?
>
> Thanks,
> - Gene
>
> On Mon, Nov 17, 2014 at 06:02:34PM +010
n master both as user slurm and root (same results)
>
> DMTCP has been installed both in master and computing nodes, same version.
> I am compiling it with no flags, or just the debug ones.
>
>
>
>
>
> 2014-11-17 23:01 GMT+01:00 Jiajun Cao :
>
>> Hi Manuel,
&
ssion with srun*
> ---
> ---
> [root@slurm-master slurm]# srun -n 3 ./mpiLoop 2
> Process 2 of 3 is on slurm-compute3
> iteration 0 on process 2
> Process 1 of 3 is on slurm-compute2
> iteration 0 on process 1
> Process 0 of 3 is on slurm-compute1
> iteration 0 on pr
urm-master slurm]# srun -n 3 ./mpiLoop 2
> Process 2 of 3 is on slurm-compute3
> iteration 0 on process 2
> Process 1 of 3 is on slurm-compute2
> iteration 0 on process 1
> Process 0 of 3 is on slurm-compute1
> iteration 0 on process 0
> iteration 1 on process 2
> iteration
I'll do the fix accordingly.
On Thu, Mar 5, 2015 at 2:45 PM, Gene Cooperman wrote:
> Hi Eliot,
> Good to hear from you again. Sorry there was a delay before
> we answered your bug report.
>
> Hi Rohan and Jiajun,
> I see what the bug is. Could one of you implement the bug fix
> (see be
Hi Manuel,
If I understand it correctly, what you want is to run your application
with DMTCP support under Slurm. Is that right?
If so, we already have the plugin for the support of Slurm. The source code
is in the plugin/batch-queue directory, and it is
compiled by default. To enable it, try dm
Also, could you specify what kind of network you were using for
communication, i.e., Ethernet, InfiniBand, or something else?
Best,
Jiajun
On Mon, Aug 17, 2015 at 11:09 AM, Rohan Garg wrote:
> Hi Ramy,
>
> In the past we have tested with up to 2K cores. The results were
> published in HPDC-2014
Hi,
I have several questions:
1. What kind of application are you running? Is there an integration of
matlab and mpi? I'm asking because I haven't run any mpi-based matlab
applications before.
2. What kind of environment are you using? Specifically, I'd like to know
the MPI version, interconnect
:
> Hello
> ]Thanks for the respond.
>
>
> On 10/06/2015 02:18 PM, Jiajun Cao wrote:
>
> Hi,
>
>
> 1. What kind of application are you running? Is there an integration of
> matlab and mpi? I'm asking because I haven't run any mpi-based matlab
> applica
Best,
Jiajun
On Thu, Oct 8, 2015 at 9:00 AM, abderrahmane wrote:
> Hello
>
> I did it and still got Restart error : cannot map initial resources into
> the restart allocation.
>
> Also i used openmpi 1.8.8 and got the same error msg.
>
>
>
> On 10/06/2015 07:06 PM,
t;>>
>>> 2015-10-08 16:00 GMT+03:00 abderrahmane :
>>>
>>> Hello
>>>
>>> I did it and still got Restart error : cannot map initial resources into
>>> the restart allocation.
>>>
>>> Also i used openmpi 1.8.8 and got the s
Hi Manuel,
The infiniband plugin shouldn't affect application launching. Could you try
removing the "--ib" flag and see if the application still crashes? This can
help diagnose whether the issue is in the ib plugin or other dmtcp modules.
Best,
Jiajun
Best,
Jiajun
On Sun, Oct 18, 2015 at 10:57
Also, if possible, could you offer us a guest account of your cluster?
Compared to email communication, this is more efficient to debug.
Best,
Jiajun
On Mon, Oct 19, 2015 at 11:26 AM, Jiajun Cao wrote:
> Hi Manuel,
>
> The infiniband plugin shouldn't affect application launch
the system, no root.
> Correct?
>
> (feel free to continue this conversation outside dmtcp-forum mailing list
> if you consider it irrelevant for the community)
>
> 2015-10-19 10:52 GMT-07:00 Jiajun Cao :
>
>> Also, if possible, could you offer us a guest account of your
Hi Manuel,
It's not entirely clear what you're asking, but let me review some
concepts from DMTCP. I apologize if you already know this.
1. The dmtcp_restart_script.sh can be executed from any machine. It is
independent of the location of the coordinator and independent of the
location of t
Hi Edson,
The error is what's expected. DMTCP considers the computation as a whole,
i.e., for all processes involved in a computation, they must run under
DMTCP. Technically, this is because DMTCP must handle the network
communication. At the time of a checkpoint, DMTCP needs to drain the data
in
> Message: Still draining socket... perhaps remote host is not running under
> DMTCP?
>
> There is a way to capture the event before that message warning?
>
> Thanks a lot!!!
>
> Edson
> On Oct 26, 2015 5:25 PM, "Jiajun Cao" wrote:
>
>> Hi Edson,
>>
>
Hi Kevin,
Your point is very good. Sorry it takes so long to get back to you. I
created a github issue page to track this down:
https://github.com/dmtcp/dmtcp/issues/219
You're very welcome to join the discussion.
Best,
Jiajun
On Mon, Oct 19, 2015 at 7:54 PM, Kevin Buckley <
kevin.buckley.ecs.
t
> before the checkpoint?
>
> Thanks!
>
> Edson
>
>
>
> 2015-10-27 20:45 GMT+01:00 Jiajun Cao :
>
>> Hi Edson,
>>
>> DMTCP_EVENT_WRITE_CKPT corresponds to the event right at the time of
>> writing the checkpoint images into storage. At this point,
dmtcp_lauch. For example:
>
> - dmtcp_launch mpirun ...
>
> It doesn't work. The dmtcp doesn't managed to drain the buffers.
>
> - dmtcp_launch --with-plugin plugin.so mpirun ...
>
> It works fine!
>
> Could you explain me why it works with the plugin and doe
Hi Marina,
Where are checkpoint images stored? Are they stored in a shared file
system, or to local storage? From what I can tell from the log, there're 12
processes before checkpoint, and hence 12 checkpoint images. On restart,
only 6 of them connect to the coordinator. It may be the fact that th
1310c955fc5-7-564e0755.dmtcp
> drwxr-xr-x 2 hpcpro hpcpro4096 Nov 19 12:31
> ckpt_orted_1310c955fc5-7-564e0755_files
>
>
> I cant figure out what can it be... Thanks for your help,
> regards
> Marina
>
> On 11/18/15, Jiajun Cao wrote:
> > Hi Marina,
>
> >
> >
> > I tried without the -h and -p options as well (cause I set the
> > variable DMTCP_HOST) and tried with all the .dmtcp files (in both
> > nodes) and all gives the same error.
> >
> > Am I missing something perhaps?
> >
> > Thanks again f
Hi Husen,
Depending on your use cases, there're two ways to integrate DMTCP with
Slurm:
1. Submitting Slurm job scripts using DMTCP: we already have the DMTCP
plugin for Slurm, and if you download the source code of DMTCP, some
example scripts can be found at:
plugin/batch-queue/job_examples
Hi Husen,
There can be multiple reasons a client disconnects. Is it possible to give
us access to your cluster? This should be the fastest way to diagnose the
problem. Also, to have some initial guess, could you please provide the
following info:
1. MPI version;
2. What resource management softwa
Hi Ashutosh,
Regarding your first email, did you try the configuration option
--enable-m32, which compiles in 32 bit mode? It should solve your problem.
In your last email, what kind of application are you trying to run under
DMTCP? From the info you provided, there's only one process. However, t
Hi Kyle,
To confirm it's libltdl that causes the problem, you can write a simple
program that calls lt_dlsym (lt_dlopen if necessary), and returns. Run the
program under DMTCP to see if it crashes.
DMTCP doesn't have special handling for libltdl. It deals with libdl
instead. I wonder if that is th
Hi Husen,
The scripts look okay. Just out of curiosity, could you try to switch the
order of dmtcp_launch and mpirun/mpiexec? It may produce something
different, if it's a Slurm-related issue.
Best,
Jiajun
On Sat, May 21, 2016 at 1:44 AM, Husen R wrote:
> by the way,
>
> If I use MPICH, no che
Hi William,
It turns out a typo in our code. Could you make the following one-line
change and try again?
diff --git a/src/util_exec.cpp b/src/util_exec.cpp
index e54f014..768d6e6 100644
--- a/src/util_exec.cpp
+++ b/src/util_exec.cpp
@@ -678,7 +678,7 @@ void Util::getDmtcpArgs(vector &dmtcp_args)
Hi Jonathan,
Thanks for writing to us. We're definitely glad to help you with the
problem. Can you provide us the following info:
What's the interconnect of the cluster, InfiniBand, TCP?
What versions of Slurm and MPI do you use?
Aside from the failure jobs, are the remaining jobs successful? C
ROR: ld.so:'
> > from libdl.so. If this happens only under DMTCP, then consider setting
> > the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'.
> > If the problem persists, please write to the DMTCP developers.
> >
> > [
Hi John,
This is interesting. Looks like the application fails when doing a poll().
There is some info I'd like to collect:
1. Are you running the application on a single node or on several nodes? If
it's a distributed application, what't the communication fabric, tcp?
InfiniBand?
2. What's the o
Hi Wentao,
I just briefly browsed PETSc and SLEPc, and I think DMTCP should support
them well (although I haven't tried), since essentially what they do is
numerical computation, with nothing fancy to the OS.
I've tried DMTCP on HPCG (High Performance Conjugate Gradients) and NAMD
(Scalable Molecu
Hi Maksym,
Thanks for writing to us. Can you provide the following info:
DMTCP version, Slurm version, Mvapich2 version, and is Mvapich2 configured
with srun as the process launcher?
Also, how did you run the jobs? Did you do it by submitting scripts or by
running interactive jobs?
Best,
Jiaju
ble-fast-install --disable-rdma-cm
> --with-pm=mpirun:hydra --with-rdma=gen2 --with-device=ch3:mrail
> --enable-alloca --enable-hwloc --disable-fast --enable-g=dbg
> --enable-error-messages=all --enable-error-checking=all --prefix=
>
>
> On 12/05/2016 11:39 PM, Jiajun Cao wrote:
&g
What application were you running? If it is difficult to share the
information of the binary, can you send us the backtrace of the
segfault?
Best,
Jiajun
On Mon, Jul 24, 2017 at 12:31:57AM -0400, Stas Vernon wrote:
> Hi guys,
> I'm getting 'Segmentation fault (core dumped)' error when running (i
Hi Anirban,
Thanks for writing to us. I have a few questions to ask in order
to further diagnose the problem:
1. What is the network type of the cluster? Is it based Ethernet
or InfiniBand?
2. Were you running interactive jobs, or batch jobs?
3. Most likely the error indicates that some sock
brary and some other libraries (e.g.
> trillinos). It requires installation of several libraries for run.
> Could you please advise how can I get backtrace of the segfault?
> Thank you.
>
> On Mon, Jul 24, 2017 at 1:12 PM, Jiajun Cao wrote:
>
> > What application were yo
Hi Rob,
I know what's going on here after browsing some blas
source code. As Rohan mentioned, thread 42 is behaving normally.
What looks more interesting is the calling stack of thread 1, the
main thread, especially the bottom 5 frames. It seems that the blas
library registers some handler at fork
gt; > > ability to change the environment and code is quite limited and python
> > 3.4
> > > > is the only version that I was able to get working in this environment.
> > > >
> > > > Jiajun's explanation sounds like the issue to me. What would be the
>
Hi Rick,
The support for allowing gdb attach on restart was not added until
the 2.4 release.
Is there any possibility that you upgrade the installation to a
newer version? Note you don't need to have root privilege to do
that. If you want to test it locally, just compile the source code,
and add
here is no hang after applying this patch.
> > Thanks,
> > Rob
> >
> > On Mon, Aug 7, 2017 at 12:04 PM, Jiajun Cao wrote:
> >
> >> Hi Rob,
> >>
> >> Could you please try the attached patch? Basically it removes
> >> the lock/unlock
Hi Daniel,
Thanks for writing to us. We had an early proof-of-concept
implementation for OmniPath last year. But that is still on a private
fork, and severely lacks testing. Unfortunately the student (read: I)
who wrote the code graduated last year, and we haven't had enough
people working on this
47 matches
Mail list logo