Re: [Dmtcp-forum] Issue involving rsh done inside a program and ran via dmtcp_launch

2014-10-27 Thread Jiajun Cao
Hi Ankit, Currently DMTCP doesn't support rsh. If it's possible, could you try ssh instead? Best, Jiajun On Mon, Oct 27, 2014 at 8:37 PM, Kapil Arya wrote: > Hi Jiajun, > > Can you take a look at this error and suggest Ankit a solution? > > Kapil > > On Mon, Oct 27, 2014 at 3:25 AM, Ankit Ga

Re: [Dmtcp-forum] Issue involving rsh done inside a program and ran via dmtcp_launch

2014-10-27 Thread Jiajun Cao
UNPRIVILEGED > dmtcp::Util::patchArgvIfSetuid(path, argv, newArgv); // BUG: > dmtcp::Util::patchArgvIfSetuid() DOES NOT SET newArgv WHEN COPYING // > BINARY IN CODE RE-FACTORING FROM REVISION 911. *filename = > (*newArgv)[0]; } else */* > { > > *filename = (char*

Re: [Dmtcp-forum] when using 2 nodes, dmtcp_launch hangs out

2014-10-28 Thread Jiajun Cao
Hi Marina, What’s the network are you using? Is it Ethernet or InfiniBand? On Tue, Oct 28, 2014 at 12:26 PM, Kapil Arya wrote: > Hi Jiajun, > > Can you take a look at this one? > > Kapil > > On Tue, Oct 28, 2014 at 8:50 AM, Marina Moran < > esperandoelmila...@gmail.com> wrote: > >> Hi, >> >>

Re: [Dmtcp-forum] when using 2 nodes, dmtcp_launch hangs out

2014-10-28 Thread Jiajun Cao
and I am *not* using torque nor any batch > queue system. > > Regards > Marina > > On 10/28/14, Jiajun Cao wrote: > > Hi Marina, > > > > What's the network are you using? Is it Ethernet or InfiniBand? > > > > On Tue, Oct 28, 2014 at 12:26 PM, Kapil

Re: [Dmtcp-forum] dmtcp not working with MPICH: perhaps remote host is not running under DMTCP?

2014-11-17 Thread Jiajun Cao
Hi Manuel, What kind of network is used in the cluster? Ethernet or InfiniBand? On Mon, Nov 17, 2014 at 2:52 PM, Gene Cooperman wrote: > Jiajun, > Could you respond to this, since you've been extending our support > for MPI? > > Thanks, > - Gene > > On Mon, Nov 17, 2014 at 06:02:34PM +010

Re: [Dmtcp-forum] dmtcp not working with MPICH: perhaps remote host is not running under DMTCP?

2014-11-18 Thread Jiajun Cao
n master both as user slurm and root (same results) > > DMTCP has been installed both in master and computing nodes, same version. > I am compiling it with no flags, or just the debug ones. > > > > > > 2014-11-17 23:01 GMT+01:00 Jiajun Cao : > >> Hi Manuel, &

Re: [Dmtcp-forum] dmtcp not working with MPICH: perhaps remote host is not running under DMTCP?

2014-11-20 Thread Jiajun Cao
ssion with srun* > --- > --- > [root@slurm-master slurm]# srun -n 3 ./mpiLoop 2 > Process 2 of 3 is on slurm-compute3 > iteration 0 on process 2 > Process 1 of 3 is on slurm-compute2 > iteration 0 on process 1 > Process 0 of 3 is on slurm-compute1 > iteration 0 on pr

Re: [Dmtcp-forum] dmtcp not working with MPICH: perhaps remote host is not running under DMTCP?

2014-11-20 Thread Jiajun Cao
urm-master slurm]# srun -n 3 ./mpiLoop 2 > Process 2 of 3 is on slurm-compute3 > iteration 0 on process 2 > Process 1 of 3 is on slurm-compute2 > iteration 0 on process 1 > Process 0 of 3 is on slurm-compute1 > iteration 0 on process 0 > iteration 1 on process 2 > iteration

Re: [Dmtcp-forum] Two restart issues with Java jobs: more details

2015-03-05 Thread Jiajun Cao
I'll do the fix accordingly. On Thu, Mar 5, 2015 at 2:45 PM, Gene Cooperman wrote: > Hi Eliot, > Good to hear from you again. Sorry there was a delay before > we answered your bug report. > > Hi Rohan and Jiajun, > I see what the bug is. Could one of you implement the bug fix > (see be

Re: [Dmtcp-forum] integration with Slurm as plugin

2015-05-21 Thread Jiajun Cao
Hi Manuel, If I understand it correctly, what you want is to run your application with DMTCP support under Slurm. Is that right? If so, we already have the plugin for the support of Slurm. The source code is in the plugin/batch-queue directory, and it is compiled by default. To enable it, try dm

Re: [Dmtcp-forum] DMTCP scaling potential

2015-08-17 Thread Jiajun Cao
Also, could you specify what kind of network you were using for communication, i.e., Ethernet, InfiniBand, or something else? Best, Jiajun On Mon, Aug 17, 2015 at 11:09 AM, Rohan Garg wrote: > Hi Ramy, > > In the past we have tested with up to 2K cores. The results were > published in HPDC-2014

Re: [Dmtcp-forum] error using --rm mpirun

2015-10-06 Thread Jiajun Cao
Hi, I have several questions: 1. What kind of application are you running? Is there an integration of matlab and mpi? I'm asking because I haven't run any mpi-based matlab applications before. 2. What kind of environment are you using? Specifically, I'd like to know the MPI version, interconnect

Re: [Dmtcp-forum] error using --rm mpirun

2015-10-06 Thread Jiajun Cao
: > Hello > ]Thanks for the respond. > > > On 10/06/2015 02:18 PM, Jiajun Cao wrote: > > Hi, > > > 1. What kind of application are you running? Is there an integration of > matlab and mpi? I'm asking because I haven't run any mpi-based matlab > applica

Re: [Dmtcp-forum] error using --rm mpirun

2015-10-08 Thread Jiajun Cao
Best, Jiajun On Thu, Oct 8, 2015 at 9:00 AM, abderrahmane wrote: > Hello > > I did it and still got Restart error : cannot map initial resources into > the restart allocation. > > Also i used openmpi 1.8.8 and got the same error msg. > > > > On 10/06/2015 07:06 PM,

Re: [Dmtcp-forum] error using --rm mpirun

2015-10-09 Thread Jiajun Cao
t;>> >>> 2015-10-08 16:00 GMT+03:00 abderrahmane : >>> >>> Hello >>> >>> I did it and still got Restart error : cannot map initial resources into >>> the restart allocation. >>> >>> Also i used openmpi 1.8.8 and got the s

Re: [Dmtcp-forum] problem running DMTCP with MVAPICH over infiniband

2015-10-19 Thread Jiajun Cao
Hi Manuel, The infiniband plugin shouldn't affect application launching. Could you try removing the "--ib" flag and see if the application still crashes? This can help diagnose whether the issue is in the ib plugin or other dmtcp modules. Best, Jiajun Best, Jiajun On Sun, Oct 18, 2015 at 10:57

Re: [Dmtcp-forum] problem running DMTCP with MVAPICH over infiniband

2015-10-19 Thread Jiajun Cao
Also, if possible, could you offer us a guest account of your cluster? Compared to email communication, this is more efficient to debug. Best, Jiajun On Mon, Oct 19, 2015 at 11:26 AM, Jiajun Cao wrote: > Hi Manuel, > > The infiniband plugin shouldn't affect application launch

Re: [Dmtcp-forum] problem running DMTCP with MVAPICH over infiniband

2015-10-19 Thread Jiajun Cao
the system, no root. > Correct? > > (feel free to continue this conversation outside dmtcp-forum mailing list > if you consider it irrelevant for the community) > > 2015-10-19 10:52 GMT-07:00 Jiajun Cao : > >> Also, if possible, could you offer us a guest account of your

Re: [Dmtcp-forum] restart MPI job with local files?

2015-10-19 Thread Jiajun Cao
Hi Manuel, It's not entirely clear what you're asking, but let me review some concepts from DMTCP. I apologize if you already know this. 1. The dmtcp_restart_script.sh can be executed from any machine. It is independent of the location of the coordinator and independent of the location of t

Re: [Dmtcp-forum] checkpoint a MPI application that exchanges message with other application

2015-10-26 Thread Jiajun Cao
Hi Edson, The error is what's expected. DMTCP considers the computation as a whole, i.e., for all processes involved in a computation, they must run under DMTCP. Technically, this is because DMTCP must handle the network communication. At the time of a checkpoint, DMTCP needs to drain the data in

Re: [Dmtcp-forum] checkpoint a MPI application that exchanges message with other application

2015-10-27 Thread Jiajun Cao
> Message: Still draining socket... perhaps remote host is not running under > DMTCP? > > There is a way to capture the event before that message warning? > > Thanks a lot!!! > > Edson > On Oct 26, 2015 5:25 PM, "Jiajun Cao" wrote: > >> Hi Edson, >> >

Re: [Dmtcp-forum] Specifying --interval: enough to initiate checkpointing

2015-10-27 Thread Jiajun Cao
Hi Kevin, Your point is very good. Sorry it takes so long to get back to you. I created a github issue page to track this down: https://github.com/dmtcp/dmtcp/issues/219 You're very welcome to join the discussion. Best, Jiajun On Mon, Oct 19, 2015 at 7:54 PM, Kevin Buckley < kevin.buckley.ecs.

Re: [Dmtcp-forum] checkpoint a MPI application that exchanges message with other application

2015-10-28 Thread Jiajun Cao
t > before the checkpoint? > > Thanks! > > Edson > > > > 2015-10-27 20:45 GMT+01:00 Jiajun Cao : > >> Hi Edson, >> >> DMTCP_EVENT_WRITE_CKPT corresponds to the event right at the time of >> writing the checkpoint images into storage. At this point,

Re: [Dmtcp-forum] checkpoint a MPI application that exchanges message with other application

2015-10-31 Thread Jiajun Cao
dmtcp_lauch. For example: > > - dmtcp_launch mpirun ... > > It doesn't work. The dmtcp doesn't managed to drain the buffers. > > - dmtcp_launch --with-plugin plugin.so mpirun ... > > It works fine! > > Could you explain me why it works with the plugin and doe

Re: [Dmtcp-forum] problem with restart w/DMTCP 2.4.2

2015-11-18 Thread Jiajun Cao
Hi Marina, Where are checkpoint images stored? Are they stored in a shared file system, or to local storage? From what I can tell from the log, there're 12 processes before checkpoint, and hence 12 checkpoint images. On restart, only 6 of them connect to the coordinator. It may be the fact that th

Re: [Dmtcp-forum] problem with restart w/DMTCP 2.4.2

2015-11-19 Thread Jiajun Cao
1310c955fc5-7-564e0755.dmtcp > drwxr-xr-x 2 hpcpro hpcpro4096 Nov 19 12:31 > ckpt_orted_1310c955fc5-7-564e0755_files > > > I cant figure out what can it be... Thanks for your help, > regards > Marina > > On 11/18/15, Jiajun Cao wrote: > > Hi Marina, >

Re: [Dmtcp-forum] problem with restart w/DMTCP 2.4.2

2015-12-01 Thread Jiajun Cao
> > > > > > I tried without the -h and -p options as well (cause I set the > > variable DMTCP_HOST) and tried with all the .dmtcp files (in both > > nodes) and all gives the same error. > > > > Am I missing something perhaps? > > > > Thanks again f

Re: [Dmtcp-forum] DMTCP support in SLURM

2016-04-18 Thread Jiajun Cao
Hi Husen, Depending on your use cases, there're two ways to integrate DMTCP with Slurm: 1. Submitting Slurm job scripts using DMTCP: we already have the DMTCP plugin for Slurm, and if you download the source code of DMTCP, some example scripts can be found at: plugin/batch-queue/job_examples

Re: [Dmtcp-forum] Problem running dmtcp_launch in different node

2016-04-28 Thread Jiajun Cao
Hi Husen, There can be multiple reasons a client disconnects. Is it possible to give us access to your cluster? This should be the fastest way to diagnose the problem. Also, to have some initial guess, could you please provide the following info: 1. MPI version; 2. What resource management softwa

Re: [Dmtcp-forum] 32-bit executable and LD_PRELOAD errors

2016-04-29 Thread Jiajun Cao
Hi Ashutosh, Regarding your first email, did you try the configuration option --enable-m32, which compiles in 32 bit mode? It should solve your problem. In your last email, what kind of application are you trying to run under DMTCP? From the info you provided, there's only one process. However, t

Re: [Dmtcp-forum] Segfault when calling ImageMagick convert through popen

2016-05-22 Thread Jiajun Cao
Hi Kyle, To confirm it's libltdl that causes the problem, you can write a simple program that calls lt_dlsym (lt_dlopen if necessary), and returns. Run the program under DMTCP to see if it crashes. DMTCP doesn't have special handling for libltdl. It deals with libdl instead. I wonder if that is th

Re: [Dmtcp-forum] dmtcp checkpoint failed in slurm

2016-05-22 Thread Jiajun Cao
Hi Husen, The scripts look okay. Just out of curiosity, could you try to switch the order of dmtcp_launch and mpirun/mpiexec? It may produce something different, if it's a Slurm-related issue. Best, Jiajun On Sat, May 21, 2016 at 1:44 AM, Husen R wrote: > by the way, > > If I use MPICH, no che

Re: [Dmtcp-forum] symbol lookup error: libdmtcp_modify-env.so: undefined symbol: warning

2016-06-04 Thread Jiajun Cao
Hi William, It turns out a typo in our code. Could you make the following one-line change and try again? diff --git a/src/util_exec.cpp b/src/util_exec.cpp index e54f014..768d6e6 100644 --- a/src/util_exec.cpp +++ b/src/util_exec.cpp @@ -678,7 +678,7 @@ void Util::getDmtcpArgs(vector &dmtcp_args)

Re: [Dmtcp-forum] Did DMTCP coordinator die uncleanly?

2016-06-28 Thread Jiajun Cao
Hi Jonathan, Thanks for writing to us. We're definitely glad to help you with the problem. Can you provide us the following info: What's the interconnect of the cluster, InfiniBand, TCP? What versions of Slurm and MPI do you use? Aside from the failure jobs, are the remaining jobs successful? C

Re: [Dmtcp-forum] Did DMTCP coordinator die uncleanly?

2016-07-01 Thread Jiajun Cao
ROR: ld.so:' > > from libdl.so. If this happens only under DMTCP, then consider setting > > the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'. > > If the problem persists, please write to the DMTCP developers. > > > > [

Re: [Dmtcp-forum] Segmentation Fault with OpenMPI

2016-07-17 Thread Jiajun Cao
Hi John, This is interesting. Looks like the application fails when doing a poll(). There is some info I'd like to collect: 1. Are you running the application on a single node or on several nodes? If it's a distributed application, what't the communication fabric, tcp? InfiniBand? 2. What's the o

Re: [Dmtcp-forum] Applying DMTCP to a (Fortran 90 + MPI + SLEPc) solver on TACC Stampede

2016-09-29 Thread Jiajun Cao
Hi Wentao, I just briefly browsed PETSc and SLEPc, and I think DMTCP should support them well (although I haven't tried), since essentially what they do is numerical computation, with nothing fancy to the OS. I've tried DMTCP on HPCG (High Performance Conjugate Gradients) and NAMD (Scalable Molecu

Re: [Dmtcp-forum] C/R with MVAPICH

2016-12-05 Thread Jiajun Cao
Hi Maksym, Thanks for writing to us. Can you provide the following info: DMTCP version, Slurm version, Mvapich2 version, and is Mvapich2 configured with srun as the process launcher? Also, how did you run the jobs? Did you do it by submitting scripts or by running interactive jobs? Best, Jiaju

Re: [Dmtcp-forum] C/R with MVAPICH

2016-12-07 Thread Jiajun Cao
ble-fast-install --disable-rdma-cm > --with-pm=mpirun:hydra --with-rdma=gen2 --with-device=ch3:mrail > --enable-alloca --enable-hwloc --disable-fast --enable-g=dbg > --enable-error-messages=all --enable-error-checking=all --prefix= > > > On 12/05/2016 11:39 PM, Jiajun Cao wrote: &g

Re: [Dmtcp-forum] PBS scheduler support

2017-07-24 Thread Jiajun Cao
What application were you running? If it is difficult to share the information of the binary, can you send us the backtrace of the segfault? Best, Jiajun On Mon, Jul 24, 2017 at 12:31:57AM -0400, Stas Vernon wrote: > Hi guys, > I'm getting 'Segmentation fault (core dumped)' error when running (i

Re: [Dmtcp-forum] KernelBufferDrainer

2017-07-24 Thread Jiajun Cao
Hi Anirban, Thanks for writing to us. I have a few questions to ask in order to further diagnose the problem: 1. What is the network type of the cluster? Is it based Ethernet or InfiniBand? 2. Were you running interactive jobs, or batch jobs? 3. Most likely the error indicates that some sock

Re: [Dmtcp-forum] PBS scheduler support

2017-07-25 Thread Jiajun Cao
brary and some other libraries (e.g. > trillinos). It requires installation of several libraries for run. > Could you please advise how can I get backtrace of the segfault? > Thank you. > > On Mon, Jul 24, 2017 at 1:12 PM, Jiajun Cao wrote: > > > What application were yo

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-01 Thread Jiajun Cao
Hi Rob, I know what's going on here after browsing some blas source code. As Rohan mentioned, thread 42 is behaving normally. What looks more interesting is the calling stack of thread 1, the main thread, especially the bottom 5 frames. It seems that the blas library registers some handler at fork

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-07 Thread Jiajun Cao
gt; > > ability to change the environment and code is quite limited and python > > 3.4 > > > > is the only version that I was able to get working in this environment. > > > > > > > > Jiajun's explanation sounds like the issue to me. What would be the >

Re: [Dmtcp-forum] How to get dmtcp_restart to pause for gdb attach

2017-08-08 Thread Jiajun Cao
Hi Rick, The support for allowing gdb attach on restart was not added until the 2.4 release. Is there any possibility that you upgrade the installation to a newer version? Note you don't need to have root privilege to do that. If you want to test it locally, just compile the source code, and add

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-11 Thread Jiajun Cao
here is no hang after applying this patch. > > Thanks, > > Rob > > > > On Mon, Aug 7, 2017 at 12:04 PM, Jiajun Cao wrote: > > > >> Hi Rob, > >> > >> Could you please try the attached patch? Basically it removes > >> the lock/unlock

Re: [Dmtcp-forum] Omnipath Support?

2018-11-01 Thread Jiajun Cao
Hi Daniel, Thanks for writing to us. We had an early proof-of-concept implementation for OmniPath last year. But that is still on a private fork, and severely lacks testing. Unfortunately the student (read: I) who wrote the code graduated last year, and we haven't had enough people working on this