Re: [OMPI users] Does OpenMPI 1.4.1 support the MPI_IN_PLACE designation ...

2010-08-16 Thread Jeff Squyres
MPI_IN_PLACE is defined in both mpif.h and the "mpi" Fortran module.

Does the subroutine in question have "include mpif.h" or "use mpi"?


On Aug 16, 2010, at 3:55 PM, Richard Walsh wrote:

> 
> All,
> 
> I have a fortran code (Octopus 3.2) that is bombing during a build in a 
> routine that uses:
> 
> call MPI_Allreduce(MPI_IN_PLACE, rho(1, ispin), np, MPI_DOUBLE_PRECISION, 
> MPI_SUM, st%mpi_grp%comm, mpi_err)
> 
> with the error message:
> 
> states.F90(1240): error #6404: This name does not have a type, and must have 
> an explicit type.   [MPI_IN_PLACE]
>call MPI_Allreduce(MPI_IN_PLACE, rho(1, ispin), np, 
> MPI_DOUBLE_PRECISION, MPI_SUM, st%mpi_grp%comm, mpi_err)
> ---^
> compilation aborted for states_oct.f90 (code 1)
> 
> This suggests that MPI_IN_PLACE is missing from the mpi.h header.
> 
> Any thoughts?
> 
> rbw
> 
> Richard Walsh
> Parallel Applications and Systems Manager
> CUNY HPC Center, Staten Island, NY
> 718-982-3319
> 612-382-4620
> 
> Reason does give the heart pause;
> As the heart gives reason fits.
> 
> Yet, to live where reason always rules;
> Is to kill one's heart with wits.
> 
> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
> Gokhan Kir [g...@iastate.edu]
> Sent: Monday, August 16, 2010 5:43 PM
> To: us...@open-mpi.org
> Subject: [OMPI users] A Problem with RAxML
> 
> Hi,
> I am currently using RAxML 7.0, and recently I got a problem. Even though I 
> Googled  it, I couldn't find a satisfying answer.
> I got this message from BATCH_ERRORs file " raxmlHPC-MPI: topologies.c:179: 
> restoreTL: Assertion `n >= 0 && n < rl->max' failed. "
> 
> Any help is appreciated,
> 
> Thanks,
> 
> --
> Gokhan
> 
> Think green before you print this email.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] A Problem with RAxML

2010-08-16 Thread Gokhan Kir
Thanks Richard,
Actually I am not sure how to try the way you told in RAxML. I don't have
too much experience with these programs.

Thanks again.

On Mon, Aug 16, 2010 at 5:40 PM, Richard Walsh
wrote:

>
> Hey Gokhan,
>
> The following worked for me with OpenMPI 1.4.1 with the latest Intel
> compiler
> (May release) although there have been reports that with full vectorization
> there
> are some unexplained inflight failures:
>
> #
> # Parallel Version
> #
> service0:/share/apps/raxml/7.0.4/build # make -f Makefile.MPI
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o axml.o
> axml.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> raxmlParsimony.o raxmlParsimony.c
> mpicc -c -o rev_functions.o rev_functions.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> optimizeModel.o optimizeModel.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o multiple.o
> multiple.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> searchAlgo.o searchAlgo.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> topologies.o topologies.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> parsePartitions.o parsePartitions.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o treeIO.o
> treeIO.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o models.o
> models.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> bipartitionList.o bipartitionList.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> rapidBootstrap.o rapidBootstrap.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> evaluatePartialGeneric.o evaluatePartialGeneric.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> evaluateGeneric.o evaluateGeneric.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> newviewGeneric.o newviewGeneric.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> makenewzGeneric.o makenewzGeneric.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> evaluateGenericVector.o evaluateGenericVector.c
> mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o
> categorizeGeneric.o categorizeGeneric.c
> mpicc -o raxmlHPC-MPI axml.o raxmlParsimony.o rev_functions.o
> optimizeModel.o multiple.o searchAlgo.o topologies.o parsePartitions.o
> treeIO.o models.o bipartitionList.o rapidBootstrap.o
> evaluatePartialGeneric.o evaluateGeneric.o newviewGeneric.o
> makenewzGeneric.o  evaluateGenericVector.o categorizeGeneric.o  -lm
>
> The lastest PGI-built OpenMPI 1.4.1 release is said to behave correctly
> with the following
> flags regardless of the level of optimization.  I have both versions
> installed.  Both compile
> and link without error for me.  This is with and IB built OpenMPI.
>
> CC = /share/apps/openmpi-pgi/default/bin/mpicc
> CFLAGS =  -O3 -DPARALLEL -Mnoframe -Munroll
>
> Hope this is useful ...
>
> rbw
>
> Richard Walsh
> Parallel Applications and Systems Manager
> CUNY HPC Center, Staten Island, NY
> 718-982-3319
> 612-382-4620
>
> Reason does give the heart pause;
> As the heart gives reason fits.
>
> Yet, to live where reason always rules;
> Is to kill one's heart with wits.
> 
> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of
> Gokhan Kir [g...@iastate.edu]
> Sent: Monday, August 16, 2010 5:43 PM
> To: us...@open-mpi.org
> Subject: [OMPI users] A Problem with RAxML
>
> Hi,
> I am currently using RAxML 7.0, and recently I got a problem. Even though I
> Googled  it, I couldn't find a satisfying answer.
> I got this message from BATCH_ERRORs file " raxmlHPC-MPI: topologies.c:179:
> restoreTL: Assertion `n >= 0 && n < rl->max' failed. "
>
> Any help is appreciated,
>
> Thanks,
>
> --
> Gokhan
>
> Think green before you print this email.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Gokhan Kir
Graduate Student
Program of Interdepartmental Genetics
Department of Genetics,Development and Cell Biology
2188 Molecular Biology Building


[OMPI users] Does OpenMPI 1.4.1 support the MPI_IN_PLACE designation ...

2010-08-16 Thread Richard Walsh

All,

I have a fortran code (Octopus 3.2) that is bombing during a build in a routine 
that uses:

call MPI_Allreduce(MPI_IN_PLACE, rho(1, ispin), np, MPI_DOUBLE_PRECISION, 
MPI_SUM, st%mpi_grp%comm, mpi_err)

with the error message:

states.F90(1240): error #6404: This name does not have a type, and must have an 
explicit type.   [MPI_IN_PLACE]
call MPI_Allreduce(MPI_IN_PLACE, rho(1, ispin), np, 
MPI_DOUBLE_PRECISION, MPI_SUM, st%mpi_grp%comm, mpi_err)
---^
compilation aborted for states_oct.f90 (code 1)

This suggests that MPI_IN_PLACE is missing from the mpi.h header.

Any thoughts?

rbw

Richard Walsh
Parallel Applications and Systems Manager
CUNY HPC Center, Staten Island, NY
718-982-3319
612-382-4620

Reason does give the heart pause;
As the heart gives reason fits.

Yet, to live where reason always rules;
Is to kill one's heart with wits.

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
Gokhan Kir [g...@iastate.edu]
Sent: Monday, August 16, 2010 5:43 PM
To: us...@open-mpi.org
Subject: [OMPI users] A Problem with RAxML

Hi,
I am currently using RAxML 7.0, and recently I got a problem. Even though I 
Googled  it, I couldn't find a satisfying answer.
I got this message from BATCH_ERRORs file " raxmlHPC-MPI: topologies.c:179: 
restoreTL: Assertion `n >= 0 && n < rl->max' failed. "

Any help is appreciated,

Thanks,

--
Gokhan

Think green before you print this email.



Re: [OMPI users] A Problem with RAxML

2010-08-16 Thread Richard Walsh

Hey Gokhan,

The following worked for me with OpenMPI 1.4.1 with the latest Intel compiler
(May release) although there have been reports that with full vectorization 
there
are some unexplained inflight failures:

#
# Parallel Version
#
service0:/share/apps/raxml/7.0.4/build # make -f Makefile.MPI
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o axml.o axml.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
raxmlParsimony.o raxmlParsimony.c
mpicc -c -o rev_functions.o rev_functions.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
optimizeModel.o optimizeModel.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o multiple.o 
multiple.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o searchAlgo.o 
searchAlgo.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o topologies.o 
topologies.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
parsePartitions.o parsePartitions.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o treeIO.o 
treeIO.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o models.o 
models.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
bipartitionList.o bipartitionList.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
rapidBootstrap.o rapidBootstrap.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
evaluatePartialGeneric.o evaluatePartialGeneric.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
evaluateGeneric.o evaluateGeneric.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
newviewGeneric.o newviewGeneric.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
makenewzGeneric.o makenewzGeneric.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
evaluateGenericVector.o evaluateGenericVector.c
mpicc -O3 -DPARALLEL -fomit-frame-pointer -funroll-loops   -c -o 
categorizeGeneric.o categorizeGeneric.c
mpicc -o raxmlHPC-MPI axml.o raxmlParsimony.o rev_functions.o optimizeModel.o 
multiple.o searchAlgo.o topologies.o parsePartitions.o treeIO.o models.o 
bipartitionList.o rapidBootstrap.o evaluatePartialGeneric.o evaluateGeneric.o 
newviewGeneric.o makenewzGeneric.o  evaluateGenericVector.o categorizeGeneric.o 
 -lm

The lastest PGI-built OpenMPI 1.4.1 release is said to behave correctly with 
the following
flags regardless of the level of optimization.  I have both versions installed. 
 Both compile
and link without error for me.  This is with and IB built OpenMPI.

CC = /share/apps/openmpi-pgi/default/bin/mpicc
CFLAGS =  -O3 -DPARALLEL -Mnoframe -Munroll

Hope this is useful ...

rbw

Richard Walsh
Parallel Applications and Systems Manager
CUNY HPC Center, Staten Island, NY
718-982-3319
612-382-4620

Reason does give the heart pause;
As the heart gives reason fits.

Yet, to live where reason always rules;
Is to kill one's heart with wits.

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
Gokhan Kir [g...@iastate.edu]
Sent: Monday, August 16, 2010 5:43 PM
To: us...@open-mpi.org
Subject: [OMPI users] A Problem with RAxML

Hi,
I am currently using RAxML 7.0, and recently I got a problem. Even though I 
Googled  it, I couldn't find a satisfying answer.
I got this message from BATCH_ERRORs file " raxmlHPC-MPI: topologies.c:179: 
restoreTL: Assertion `n >= 0 && n < rl->max' failed. "

Any help is appreciated,

Thanks,

--
Gokhan

Think green before you print this email.



Re: [OMPI users] A Problem with RAxML

2010-08-16 Thread Ralph Castain
You might want to start by contacting someone from that software package - this 
is the Open MPI mailing list.


On Aug 16, 2010, at 3:43 PM, Gokhan Kir wrote:

> Hi,
> I am currently using RAxML 7.0, and recently I got a problem. Even though I 
> Googled  it, I couldn't find a satisfying answer. 
> I got this message from BATCH_ERRORs file " raxmlHPC-MPI: topologies.c:179: 
> restoreTL: Assertion `n >= 0 && n < rl->max' failed. "
> 
> Any help is appreciated,
> 
> Thanks,
> 
> -- 
> Gokhan  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] A Problem with RAxML

2010-08-16 Thread Gokhan Kir
Hi,
I am currently using RAxML 7.0, and recently I got a problem. Even though I
Googled  it, I couldn't find a satisfying answer.
I got this message from BATCH_ERRORs file " raxmlHPC-MPI: topologies.c:179:
restoreTL: Assertion `n >= 0 && n < rl->max' failed. "

Any help is appreciated,

Thanks,

-- 
Gokhan


Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Nysal Jan
The value of hdr->tag seems wrong.

In ompi/mca/pml/ob1/pml_ob1_hdr.h
#define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
#define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
#define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
#define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
#define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
#define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
#define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
#define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
#define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)

and in ompi/mca/btl/btl.h
#define MCA_BTL_TAG_PML 0x40

So hdr->tag should be a value >= 65
Since the tag is incorrect you are not getting the proper callback function
pointer and hence the SEGV.
I'm not sure at this point as to why you are getting an invalid/corrupt
message header ?

--Nysal

On Tue, Aug 10, 2010 at 7:45 PM, Eloi Gaudry  wrote:

> Hi,
>
> sorry, i just forgot to add the values of the function parameters:
> (gdb) print reg->cbdata
> $1 = (void *) 0x0
> (gdb) print openib_btl->super
> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> btl_rdma_pipeline_send_length = 1048576,
>  btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800,
> btl_flags = 310,
>  btl_add_procs = 0x2b341eb8ee47 , btl_del_procs =
> 0x2b341eb90156 , btl_register = 0, btl_finalize =
> 0x2b341eb93186 ,
>  btl_alloc = 0x2b341eb90a3e , btl_free =
> 0x2b341eb91400 , btl_prepare_src = 0x2b341eb91813
> ,
>  btl_prepare_dst = 0x2b341eb91f2e , btl_send =
> 0x2b341eb94517 , btl_sendi = 0x2b341eb9340d
> ,
>  btl_put = 0x2b341eb94660 , btl_get = 0x2b341eb94c4e
> , btl_dump = 0x2b341acd45cb ,
> btl_mpool = 0xf3f4110,
>  btl_register_error = 0x2b341eb90565 ,
> btl_ft_event = 0x2b341eb952e7 }
> (gdb) print hdr->tag
> $3 = 0 '\0'
> (gdb) print des
> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> (gdb) print reg->cbfunc
> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
>
> Eloi
>
> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > Hi,
> >
> > Here is the output of a core file generated during a segmentation fault
> > observed during a collective call (using openib):
> >
> > #0  0x in ?? ()
> > (gdb) where
> > #0  0x in ?? ()
> > #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
> > btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> > (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> > btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > 0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> > btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
> > btl_openib_component_progress () at btl_openib_component.c:3451 #6
> > 0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207 #7
> > 0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
> > m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8
>  0x2aedb859fa31
> > in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
> > statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
> > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7279d444,
> > rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> > #10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > (sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
> > op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
> > coll_tuned_decision_fixed.c:63
> > #11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444,
> > recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0) at pallreduce.c:102 #12 0x04387dbf in
> > FEMTown::MPI::Allreduce (sendbuf=0x7279d444, recvbuf=0x7279d440,
> > count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at
> > stubs.cpp:626 #13 0x04058be8 in FEMTown::Domain::align (itf=
> >
> {
> > = {_vptr.shared_base_ptr = 0x7279d620, ptr_ = {px = 0x199942a4, pn =
> > {pi_ = 0x6}}}, }) at interface.cpp:371
> > #14 0x040cb858 in
> FEMTown::Field::detail::align_itfs_and_neighbhors
> > (dim=2, set={px = 0x7279d780, pn = {pi_ = 0x2f279d640}},
> > check_info=@0x7279d7f0) at check.cpp:63 #15 0x040cbfa8 in
> > FEMTown::Field::align_elements (set={px = 0x7279d950, pn = {pi_ =
> > 0x66e08d0}}, check_info=@0x7279d7f0) at check.cpp:159 #16
> > 0x039acdd4 in PyField_align_elements (self=0x0,
> > args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
> > 0x01fbf76d in FEMTown::Main::ExErrCatch<_object* 

Re: [OMPI users] Abort

2010-08-16 Thread David Ronis
Hi Jeff,

I've reproduced your test here, with the same results.  Moreover, if I
put the nodes with rank>0 into a blocking MPI call (MPI_Bcast or
MPI_Barrier) I still get the same behavior; namely, rank 0's calling
abort() generates a core file and leads to termination, which is the
behavior I want.  I'll look at my code a bit more, but the only
difference I see now is that in my code a floating point exception
triggers a signal-handler that calls abort().   I don't see why that
should be different from your test.

Thanks for your help.

David

On Mon, 2010-08-16 at 09:54 -0700, Jeff Squyres wrote:
> FWIW, I'm unable to replicate your behavior.  This is with Open MPI 1.4.2 on 
> RHEL5:
> 
> 
> [9:52] svbu-mpi:~/mpi % cat abort.c
> #include 
> #include 
> #include 
> 
> int main(int argc, char **argv)
> {
> int rank;
> 
> MPI_Init(, );
> MPI_Comm_rank(MPI_COMM_WORLD, );
> if (0 == rank) {
> abort();
> }
> printf("Rank %d sleeping...\n", rank);
> sleep(600);
> printf("Rank %d finalizing...\n", rank);
> MPI_Finalize();
> return 0;
> }
> [9:52] svbu-mpi:~/mpi % mpicc abort.c -o abort
> [9:52] svbu-mpi:~/mpi % ls -l core*
> ls: No match.
> [9:52] svbu-mpi:~/mpi % mpirun -np 4 --bynode --host svbu-mpi055,svbu-mpi056 
> ./abort
> Rank 1 sleeping...
> [svbu-mpi055:03991] *** Process received signal ***
> [svbu-mpi055:03991] Signal: Aborted (6)
> [svbu-mpi055:03991] Signal code:  (-6)
> [svbu-mpi055:03991] [ 0] /lib64/libpthread.so.0 [0x2b45caac87c0]
> [svbu-mpi055:03991] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b45cad05265]
> [svbu-mpi055:03991] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b45cad06d10]
> [svbu-mpi055:03991] [ 3] ./abort(main+0x36) [0x4008ee]
> [svbu-mpi055:03991] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) 
> [0x2b45cacf2994]
> [svbu-mpi055:03991] [ 5] ./abort [0x400809]
> [svbu-mpi055:03991] *** End of error message ***
> Rank 3 sleeping...
> Rank 2 sleeping...
> --
> mpirun noticed that process rank 0 with PID 3991 on node svbu-mpi055 exited 
> on signal 6 (Aborted).
> --
> [9:52] svbu-mpi:~/mpi % ls -l core*
> -rw--- 1 jsquyres eng5 26009600 Aug 16 09:52 core.abort-1281977540-3991
> [9:52] svbu-mpi:~/mpi % file core.abort-1281977540-3991 
> core.abort-1281977540-3991: ELF 64-bit LSB core file AMD x86-64, version 1 
> (SYSV), SVR4-style, from 'abort'
> [9:52] svbu-mpi:~/mpi % 
> -
> 
> You can see that all processes die immediately, and I get a corefile from the 
> process that called abort().
> 
> 
> On Aug 16, 2010, at 9:25 AM, David Ronis wrote:
> 
> > I've tried both--as you said, MPI_Abort doesn't drop a core file, but
> > does kill off the entire MPI job.   abort() drops core when I'm running
> > on 1 processor, but not in a multiprocessor run.  In addition, a node
> > calling abort() doesn't lead to the entire run being killed off.
> > 
> > David
> > O
> > n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote:
> >> On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
> >> 
> >>> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> >>> with an intel i7).  coresize is unlimited:
> >>> 
> >>> ulimit -a
> >>> core file size  (blocks, -c) unlimited
> >> 
> >> That looks good.
> >> 
> >> In reviewing the email thread, it's not entirely clear: are you calling 
> >> abort() or MPI_Abort()?  MPI_Abort() won't drop a core file.  abort() 
> >> should.
> >> 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 



Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Eloi Gaudry

 Hi Jeff,

Thanks for your reply.

I did run our application through valgrind but it couldn't find any 
"Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2 
with the suppression file), "Use of uninitialized bytes" and 
"Conditional jump depending on uninitialized bytes" in different ompi 
routines. Some of them are located in btl_openib_component.c. I'll send 
you an output of valgrind shortly.


Another question, you said that the callback function pointer should 
never be 0. But can the tag be null (hdr->tag) ?


Thanks for your help,
Eloi



On 16/08/2010 18:22, Jeff Squyres wrote:

Sorry for the delay in replying.

Odd; the values of the callback function pointer should never be 0.  This seems 
to suggest some kind of memory corruption is occurring.

I don't know if it's possible, because the stack trace looks like you're 
calling through python, but can you run this application through valgrind, or 
some other memory-checking debugger?


On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:


Hi,

sorry, i just forgot to add the values of the function parameters:
(gdb) print reg->cbdata
$1 = (void *) 0x0
(gdb) print openib_btl->super
$2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288, 
btl_rndv_eager_limit = 12288, btl_max_send_size = 65536, 
btl_rdma_pipeline_send_length = 1048576,
   btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size = 1060864, 
btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800, btl_flags = 310,
   btl_add_procs = 0x2b341eb8ee47, btl_del_procs = 
0x2b341eb90156, btl_register = 0, btl_finalize = 
0x2b341eb93186,
   btl_alloc = 0x2b341eb90a3e, btl_free = 
0x2b341eb91400, btl_prepare_src = 
0x2b341eb91813,
   btl_prepare_dst = 0x2b341eb91f2e, btl_send = 
0x2b341eb94517, btl_sendi = 0x2b341eb9340d,
   btl_put = 0x2b341eb94660, btl_get = 
0x2b341eb94c4e, btl_dump = 0x2b341acd45cb, 
btl_mpool = 0xf3f4110,
   btl_register_error = 0x2b341eb90565, 
btl_ft_event = 0x2b341eb952e7}
(gdb) print hdr->tag
$3 = 0 '\0'
(gdb) print des
$4 = (mca_btl_base_descriptor_t *) 0xf4a6700
(gdb) print reg->cbfunc
$5 = (mca_btl_base_module_recv_cb_fn_t) 0

Eloi

On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:

Hi,

Here is the output of a core file generated during a segmentation fault
observed during a collective call (using openib):

#0  0x in ?? ()
(gdb) where
#0  0x in ?? ()
#1  0x2aedbc4e05f4 in btl_openib_handle_incoming
(openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
(device=0x19024ac0, cq=0, wc=0x7279ce90) at
btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
(device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
btl_openib_component_progress () at btl_openib_component.c:3451 #6
0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207 #7
0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8  0x2aedb859fa31
in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7279d444,
rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
#10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
coll_tuned_decision_fixed.c:63
#11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444,
recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
comm=0x19d81ff0) at pallreduce.c:102 #12 0x04387dbf in
FEMTown::MPI::Allreduce (sendbuf=0x7279d444, recvbuf=0x7279d440,
count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at
stubs.cpp:626 #13 0x04058be8 in FEMTown::Domain::align (itf=
 {
= {_vptr.shared_base_ptr = 0x7279d620, ptr_ = {px = 0x199942a4, pn =
{pi_ = 0x6}}},}) at interface.cpp:371
#14 0x040cb858 in FEMTown::Field::detail::align_itfs_and_neighbhors
(dim=2, set={px = 0x7279d780, pn = {pi_ = 0x2f279d640}},
check_info=@0x7279d7f0) at check.cpp:63 #15 0x040cbfa8 in
FEMTown::Field::align_elements (set={px = 0x7279d950, pn = {pi_ =
0x66e08d0}}, check_info=@0x7279d7f0) at check.cpp:159 #16
0x039acdd4 in PyField_align_elements (self=0x0,
args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
0x01fbf76d in FEMTown::Main::ExErrCatch<_object* (*)(_object*,
_object*, _object*)>::exec<_object>  (this=0x7279dc20, s=0x0,
po1=0x2aaab0765050, po2=0x19d2e950) at
/home/qa/svntop/femtown/modules/main/py/exception.hpp:463
#18 0x039acc82 in 

Re: [OMPI users] Abort

2010-08-16 Thread Jeff Squyres
FWIW, I'm unable to replicate your behavior.  This is with Open MPI 1.4.2 on 
RHEL5:


[9:52] svbu-mpi:~/mpi % cat abort.c
#include 
#include 
#include 

int main(int argc, char **argv)
{
int rank;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
if (0 == rank) {
abort();
}
printf("Rank %d sleeping...\n", rank);
sleep(600);
printf("Rank %d finalizing...\n", rank);
MPI_Finalize();
return 0;
}
[9:52] svbu-mpi:~/mpi % mpicc abort.c -o abort
[9:52] svbu-mpi:~/mpi % ls -l core*
ls: No match.
[9:52] svbu-mpi:~/mpi % mpirun -np 4 --bynode --host svbu-mpi055,svbu-mpi056 
./abort
Rank 1 sleeping...
[svbu-mpi055:03991] *** Process received signal ***
[svbu-mpi055:03991] Signal: Aborted (6)
[svbu-mpi055:03991] Signal code:  (-6)
[svbu-mpi055:03991] [ 0] /lib64/libpthread.so.0 [0x2b45caac87c0]
[svbu-mpi055:03991] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b45cad05265]
[svbu-mpi055:03991] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b45cad06d10]
[svbu-mpi055:03991] [ 3] ./abort(main+0x36) [0x4008ee]
[svbu-mpi055:03991] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x2b45cacf2994]
[svbu-mpi055:03991] [ 5] ./abort [0x400809]
[svbu-mpi055:03991] *** End of error message ***
Rank 3 sleeping...
Rank 2 sleeping...
--
mpirun noticed that process rank 0 with PID 3991 on node svbu-mpi055 exited on 
signal 6 (Aborted).
--
[9:52] svbu-mpi:~/mpi % ls -l core*
-rw--- 1 jsquyres eng5 26009600 Aug 16 09:52 core.abort-1281977540-3991
[9:52] svbu-mpi:~/mpi % file core.abort-1281977540-3991 
core.abort-1281977540-3991: ELF 64-bit LSB core file AMD x86-64, version 1 
(SYSV), SVR4-style, from 'abort'
[9:52] svbu-mpi:~/mpi % 
-

You can see that all processes die immediately, and I get a corefile from the 
process that called abort().


On Aug 16, 2010, at 9:25 AM, David Ronis wrote:

> I've tried both--as you said, MPI_Abort doesn't drop a core file, but
> does kill off the entire MPI job.   abort() drops core when I'm running
> on 1 processor, but not in a multiprocessor run.  In addition, a node
> calling abort() doesn't lead to the entire run being killed off.
> 
> David
> O
> n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote:
>> On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
>> 
>>> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
>>> with an intel i7).  coresize is unlimited:
>>> 
>>> ulimit -a
>>> core file size  (blocks, -c) unlimited
>> 
>> That looks good.
>> 
>> In reviewing the email thread, it's not entirely clear: are you calling 
>> abort() or MPI_Abort()?  MPI_Abort() won't drop a core file.  abort() should.
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Checkpointing mpi4py program

2010-08-16 Thread ananda.mudar
Josh



I have one more update on my observation while analyzing this issue.



Just to refresh, I am using openmpi-trunk release 23596 with
mpi4py-1.2.1 and BLCR 0.8.2. When I checkpoint the python script written
using mpi4py, the program doesn't progress after the checkpoint is taken
successfully. I tried it with openmpi 1.4.2 and then tried it with the
latest trunk version as suggested. I see the similar behavior in both
the releases.



I have one more interesting observation which I thought may be useful. I
tried the "-stop" option of ompi-checkpoint (trunk version) and the
mpirun prints the following error messages when I run the command
"ompi-checkpoint -stop -v ":



 Error messages in the window where mpirun command was running START
==

[hpdcnln001:15148] Error: (   app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]

[hpdcnln001:15148] [[37739,1],2] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

[hpdcnln001:15149] Error: (   app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]

[hpdcnln001:15149] [[37739,1],3] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

[hpdcnln001:15146] Error: (   app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]

[hpdcnln001:15146] [[37739,1],0] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

[hpdcnln001:15147] Error: (   app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]

[hpdcnln001:15147] [[37739,1],1] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

 Error messages in the window where mpirun command was running END
==



Please note that the checkpoint image was created at the end of it.
However when I run the command "kill -CONT ", it fails to
move forward which is same as the original problem I have reported.



Let me know if you need any additional information.



Thanks for your time in advance



-  Ananda



Ananda B Mudar, PMP

Senior Technical Architect

Wipro Technologies

Ph: 972 765 8093

ananda.mu...@wipro.com



From: Ananda Babu Mudar (WT01 - Energy and Utilities)
Sent: Sunday, August 15, 2010 11:25 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Checkpointing mpi4py program
Importance: High



Josh

I tried running the mpi4py program with the latest trunk version of
openmpi. I have compiled openmpi-1.7a1r23596 from trunk and recompiled
mpi4py to use this library. Unfortunately I see the same behavior as I
have seen with openmpi 1.4.2 ie; checkpoint will be successful but the
program doesn't proceed after that.

I have attached the stack traces of all the MPI processes that are part
of the mpirun. I really appreciate if you can take a look at the stack
trace and let m e know the potential problem. I am kind of stuck at this
point and need your assistance to move forward. Please let me know if
you need any additional information.

Thanks for your time in advance

Thanks

Ananda

-Original Message-
Subject: Re: [OMPI users] Checkpointing mpi4py program
From: Joshua Hursey (jjhursey_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-08-13 12:28:31

Nope. I probably won't get to it for a while. I'll let you know if I do.


On Aug 13, 2010, at 12:17 PM, 
 wrote:

> OK, I will do that.
>
> But did you try this program on a system where the latest trunk is
> installed? Were you successful in checkpointing?
>
> - Ananda
> -Original Message-
> Message: 9
> Date: Fri, 13 Aug 2010 10:21:29 -0400
> From: Joshua Hursey 
> Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2
> To: Open MPI Users 
> Message-ID: <7A43615B-A462-4C72-8112-496653D8F0A0_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii
>
> I probably won't have an opportunity to work on reproducing this on
the
> 1.4.2. The trunk has a bunch of bug fixes that probably will not be
> backported to the 1.4 series (things have changed too much since that
> branch). So I would suggest trying the 1.5 series.
>
> -- Josh
>
> On Aug 13, 2010, at 10:12 AM, 
>  wrote:
>
>> Josh
>>
>> I am having problems compiling the sources from the latest trunk. It
>> complains of libgomp.spec missing even though that file exists on my
>> system. I will see if I have to change any other environment
variables
>> to have a successful compilation. I will keep you posted.
>>
>> BTW, were you successful in reproducing the problem on a system with
>> OpenMPI 1.4.2?
>>
>> Thanks
>> Ananda
>> -Original Message-
>> Date: Thu, 12 Aug 2010 09:12:26 -0400
>> From: Joshua Hursey 

Re: [OMPI users] Abort

2010-08-16 Thread David Ronis
I've tried both--as you said, MPI_Abort doesn't drop a core file, but
does kill off the entire MPI job.   abort() drops core when I'm running
on 1 processor, but not in a multiprocessor run.  In addition, a node
calling abort() doesn't lead to the entire run being killed off.

David
O
n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote:
> On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
> 
> > I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> > with an intel i7).  coresize is unlimited:
> > 
> > ulimit -a
> > core file size  (blocks, -c) unlimited
> 
> That looks good.
> 
> In reviewing the email thread, it's not entirely clear: are you calling 
> abort() or MPI_Abort()?  MPI_Abort() won't drop a core file.  abort() should.
> 



Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Jeff Squyres
Sorry for the delay in replying.

Odd; the values of the callback function pointer should never be 0.  This seems 
to suggest some kind of memory corruption is occurring.

I don't know if it's possible, because the stack trace looks like you're 
calling through python, but can you run this application through valgrind, or 
some other memory-checking debugger?


On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:

> Hi,
> 
> sorry, i just forgot to add the values of the function parameters:
> (gdb) print reg->cbdata
> $1 = (void *) 0x0
> (gdb) print openib_btl->super
> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288, 
> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536, 
> btl_rdma_pipeline_send_length = 1048576,
>   btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size = 
> 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800, 
> btl_flags = 310,
>   btl_add_procs = 0x2b341eb8ee47 , btl_del_procs = 
> 0x2b341eb90156 , btl_register = 0, btl_finalize = 
> 0x2b341eb93186 ,
>   btl_alloc = 0x2b341eb90a3e , btl_free = 
> 0x2b341eb91400 , btl_prepare_src = 0x2b341eb91813 
> ,
>   btl_prepare_dst = 0x2b341eb91f2e , btl_send = 
> 0x2b341eb94517 , btl_sendi = 0x2b341eb9340d 
> ,
>   btl_put = 0x2b341eb94660 , btl_get = 0x2b341eb94c4e 
> , btl_dump = 0x2b341acd45cb , 
> btl_mpool = 0xf3f4110,
>   btl_register_error = 0x2b341eb90565 , 
> btl_ft_event = 0x2b341eb952e7 }
> (gdb) print hdr->tag
> $3 = 0 '\0'
> (gdb) print des
> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> (gdb) print reg->cbfunc
> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> 
> Eloi
> 
> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > Hi,
> >
> > Here is the output of a core file generated during a segmentation fault
> > observed during a collective call (using openib):
> >
> > #0  0x in ?? ()
> > (gdb) where
> > #0  0x in ?? ()
> > #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
> > btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> > (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> > btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > 0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> > btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
> > btl_openib_component_progress () at btl_openib_component.c:3451 #6
> > 0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207 #7
> > 0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
> > m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8  0x2aedb859fa31
> > in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
> > statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
> > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7279d444,
> > rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> > #10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > (sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
> > op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
> > coll_tuned_decision_fixed.c:63
> > #11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444,
> > recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0) at pallreduce.c:102 #12 0x04387dbf in
> > FEMTown::MPI::Allreduce (sendbuf=0x7279d444, recvbuf=0x7279d440,
> > count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at
> > stubs.cpp:626 #13 0x04058be8 in FEMTown::Domain::align (itf=
> > {
> > = {_vptr.shared_base_ptr = 0x7279d620, ptr_ = {px = 0x199942a4, pn =
> > {pi_ = 0x6}}}, }) at interface.cpp:371
> > #14 0x040cb858 in FEMTown::Field::detail::align_itfs_and_neighbhors
> > (dim=2, set={px = 0x7279d780, pn = {pi_ = 0x2f279d640}},
> > check_info=@0x7279d7f0) at check.cpp:63 #15 0x040cbfa8 in
> > FEMTown::Field::align_elements (set={px = 0x7279d950, pn = {pi_ =
> > 0x66e08d0}}, check_info=@0x7279d7f0) at check.cpp:159 #16
> > 0x039acdd4 in PyField_align_elements (self=0x0,
> > args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
> > 0x01fbf76d in FEMTown::Main::ExErrCatch<_object* (*)(_object*,
> > _object*, _object*)>::exec<_object> (this=0x7279dc20, s=0x0,
> > po1=0x2aaab0765050, po2=0x19d2e950) at
> > /home/qa/svntop/femtown/modules/main/py/exception.hpp:463
> > #18 0x039acc82 in PyField_align_elements_ewrap (self=0x0,
> > args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39 #19
> > 0x044093a0 in PyEval_EvalFrameEx (f=0x19b52e90, throwflag= > optimized out>) at Python/ceval.c:3921 #20 0x0440aae9 in
> > PyEval_EvalCodeEx (co=0x2aaab754ad50, globals=,
> > locals=, args=0x3, argcount=1, 

Re: [OMPI users] Abort

2010-08-16 Thread Jeff Squyres
On Aug 13, 2010, at 12:53 PM, David Ronis wrote:

> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> with an intel i7).  coresize is unlimited:
> 
> ulimit -a
> core file size  (blocks, -c) unlimited

That looks good.

In reviewing the email thread, it's not entirely clear: are you calling abort() 
or MPI_Abort()?  MPI_Abort() won't drop a core file.  abort() should.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] problem with .bashrc stetting of openmpi

2010-08-16 Thread Eugene Loh




sun...@chem.iitb.ac.in wrote:

  
sun...@chem.iitb.ac.in wrote:


  Dear Open-mpi users,

I installed openmpi-1.4.1 in my user area and then set the path for
openmpi in the .bashrc file as follow. However, am still getting
following
error message whenever am starting the parallel molecular dynamics
simulation using GROMACS. So every time am starting the MD job, I need
to
source the .bashrc file again.
  

Have you set OPAL_PREFIX to /home/sunitap/soft/openmpi?

  
  How to set OPAL_PREFIX?
During the installation of openmpi, I ran configure with
--prefix=/home/sunitap/soft/openmpi
Did you mean this?
  

No.  The "OPAL_PREFIX" steps occurs after you configure, build, and
install OMPI.  At the time that you run MPI programs, set the
"OPAL_PREFIX" environment variable to /home/sunitap/soft/openmpi.  The
syntax depends on your shell.  E.g., for csh:

setenv OPAL_PREFIX /home/sunitap/soft/openmpi

The sequence might be something like this:

./configure --prefix=/home/sunitap/soft/openmpi
make
make install
cd /home/sunitap/soft/openmpi/examples
mpicc connectivity_c.c
setenv OPAL_PREFIX /home/sunitap/soft/openmpi
mpirun -n 2 ./connectivity_c

though I didn't check all those commands out.




Re: [OMPI users] How to program with openmpi under MFC?

2010-08-16 Thread Shiqing Fan

 Hi,

Sorry for late answer.

I've checked your source code, and I didn't find anything wrong, 
everything works just fine with Open MPI trunk version. Could you tell 
me which version did you use, so that I can debug with your generated 
mpi libs?


By the way, I noticed that you put MPI_Init, MPI_Comm_rank, 
MPI_Comm_size, and MPI_Finalize in the function that will be triggered 
by the button press. This is only fine with the first button click, but 
not for the second click and so on. It's better to place them in the 
class initialization and finalization, so that they will only be 
executed once.



Regards,
Shiqing



On 2010-8-12 1:20 PM, lyb wrote:

Hi,
Some other information supply. the function breaks at the 3rd ASSERT. 
Send you the picture. thanks



Hello,
 the message is,
Unhandle exception at 0x7835b701 (mfc80ud.dll) : 0xC005: conflit 
while read 0xf78e9e00.


thanks.


 Hi,

I personally haven't try to program MPI with MFC, but in principle 
it should work. What kind of error did you get, was there any error 
message? Thanks.


Shiqing



On 2010-8-12 9:13 AM, lyb wrote:

Hi,

I have a MFC project, and need to add mpi functions in it,  and  
choose openmpi.

but I  searched all of mail list ,  not. find the answer.

And I try to call mpi functions under MFC, as follows,

int ompi_test(int *argc, char **argv)
{
int rank, size;

MPI_Init(argc, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
printf("Hello, world, I am %d of %d\n", rank, size);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();

return 0;
}
void CSchedulerDlg::OnBnClickedButton1()
{
ompi_test(NULL, NULL);
}

but break at MPI_Init(argc, );.

So what should I do?
Can anybody help me?

Thanks in advance.

Best Regards.



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users











--
--
Shiqing Fan  http://www.hlrs.de/people/fan
High Performance Computing   Tel.: +49 711 685 87234
  Center Stuttgart (HLRS)Fax.: +49 711 685 65832
Address:Allmandring 30   email: f...@hlrs.de
70569 Stuttgart



Re: [OMPI users] problem with .bashrc stetting of openmpi

2010-08-16 Thread Addepalli, Srirangam V
Try 

env | grep LD_LIBRARY_PATH

Does it show /home/sunitap/soft/openmpi/lib in your library path.

I have a similar installation. This is how my LD_LIBRARY_PATH looks.

LD_LIBRARY_PATH=/lustre/work/apps/gromacs-testgar/lib:/lustre/work/apps/gromacs-mkl/lib:/lustre/work/apps/openmpi-testgar/lib:/opt/intel/Compiler/11.1/064/lib/intel64:/opt/intel/Compiler/11.1/064/mkl/lib/em64t:/opt/intel/Compiler/11.1/064/lib/intel64:/opt/intel/Compiler/11.1/064/mkl/lib/em64t:/opt/gridengine/lib/lx26-amd64

Rangam

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
sun...@chem.iitb.ac.in [sun...@chem.iitb.ac.in]
Sent: Monday, August 16, 2010 1:24 AM
To: Open MPI Users
Subject: Re: [OMPI users] problem with .bashrc stetting of openmpi

> Hello Sunitha,
> If you have admin privileges on this system add library path to
>  /etc/ld.so.conf
I don't have admin privileges.
>
> eg: echo "/home/sunitap/soft/openmpi/lib" >> /etc/ld.so.conf
>
> ldconfig
>
> Rangam
> 
> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of
> sun...@chem.iitb.ac.in [sun...@chem.iitb.ac.in]
> Sent: Monday, August 16, 2010 12:28 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] problem with .bashrc stetting of openmpi
>
> Hi,
>
>> hello Sunita,
>>
>> what linux distribution is this?
> The linux distribution is Red Hat Enterprise Linux Server release 5.5
> (Tikanga)
>>
>> On Fri, Aug 13, 2010 at 1:57 AM,  wrote:
>>
> Thanks,
> Sunita
>
>>> Dear Open-mpi users,
>>>
>>> I installed openmpi-1.4.1 in my user area and then set the path for
>>> openmpi in the .bashrc file as follow. However, am still getting
>>> following
>>> error message whenever am starting the parallel molecular dynamics
>>> simulation using GROMACS. So every time am starting the MD job, I need
>>> to
>>> source the .bashrc file again.
>>>
>>> Earlier in some other machine I did the same thing and was not getting
>>> any
>>> problem.
>>>
>>> Could you guys suggest what would be the problem?
>>>
>>> .bashrc
>>> #path for openmpi
>>> export PATH=$PATH:/home/sunitap/soft/openmpi/bin
>>> export CFLAGS="-I/home/sunitap/soft/openmpi/include"
>>> export LDFLAGS="-L/home/sunitap/soft/openmpi/lib"
>>> export LD_LIBRARY_PATH=/home/sunitap/soft/openmpi/lib:$LD_LIBRARY_PATH
>>>
>>> == error message ==
>>> mdrun_mpi: error while loading shared libraries: libmpi.so.0: cannot
>>> open
>>> shared object file: No such file or directory
>>>
>>> 
>>>
>>> Thanks for any help.
>>> Best regards,
>>> Sunita
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] problem with .bashrc stetting of openmpi

2010-08-16 Thread sunita
> Hello Sunitha,
> If you have admin privileges on this system add library path to
>  /etc/ld.so.conf
I don't have admin privileges.
>
> eg: echo "/home/sunitap/soft/openmpi/lib" >> /etc/ld.so.conf
>
> ldconfig
>
> Rangam
> 
> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of
> sun...@chem.iitb.ac.in [sun...@chem.iitb.ac.in]
> Sent: Monday, August 16, 2010 12:28 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] problem with .bashrc stetting of openmpi
>
> Hi,
>
>> hello Sunita,
>>
>> what linux distribution is this?
> The linux distribution is Red Hat Enterprise Linux Server release 5.5
> (Tikanga)
>>
>> On Fri, Aug 13, 2010 at 1:57 AM,  wrote:
>>
> Thanks,
> Sunita
>
>>> Dear Open-mpi users,
>>>
>>> I installed openmpi-1.4.1 in my user area and then set the path for
>>> openmpi in the .bashrc file as follow. However, am still getting
>>> following
>>> error message whenever am starting the parallel molecular dynamics
>>> simulation using GROMACS. So every time am starting the MD job, I need
>>> to
>>> source the .bashrc file again.
>>>
>>> Earlier in some other machine I did the same thing and was not getting
>>> any
>>> problem.
>>>
>>> Could you guys suggest what would be the problem?
>>>
>>> .bashrc
>>> #path for openmpi
>>> export PATH=$PATH:/home/sunitap/soft/openmpi/bin
>>> export CFLAGS="-I/home/sunitap/soft/openmpi/include"
>>> export LDFLAGS="-L/home/sunitap/soft/openmpi/lib"
>>> export LD_LIBRARY_PATH=/home/sunitap/soft/openmpi/lib:$LD_LIBRARY_PATH
>>>
>>> == error message ==
>>> mdrun_mpi: error while loading shared libraries: libmpi.so.0: cannot
>>> open
>>> shared object file: No such file or directory
>>>
>>> 
>>>
>>> Thanks for any help.
>>> Best regards,
>>> Sunita
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>




Re: [OMPI users] problem with .bashrc stetting of openmpi

2010-08-16 Thread sunita
Hi,


> sun...@chem.iitb.ac.in wrote:
>> Dear Open-mpi users,
>>
>> I installed openmpi-1.4.1 in my user area and then set the path for
>> openmpi in the .bashrc file as follow. However, am still getting
>> following
>> error message whenever am starting the parallel molecular dynamics
>> simulation using GROMACS. So every time am starting the MD job, I need
>> to
>> source the .bashrc file again.
>>
>> Earlier in some other machine I did the same thing and was not getting
>> any
>> problem.
>>
>> Could you guys suggest what would be the problem?
>>
>>
> Have you set OPAL_PREFIX to /home/sunitap/soft/openmpi?
How to set OPAL_PREFIX?
During the installation of openmpi, I ran configure with
--prefix=/home/sunitap/soft/openmpi
Did you mean this?
>
> If you do a ldd on mdrun_mpi does libmpi.so.0 come up not found?

I got libmpi.so.0 not found. The output I got is
=
ldd `which mdrun_mpi`
libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x0039e180)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0039dd60)
libm.so.6 => /lib64/libm.so.6 (0x0039d5e0)
libSM.so.6 => /usr/lib64/libSM.so.6 (0x0039d960)
libICE.so.6 => /usr/lib64/libICE.so.6 (0x0039da60)
libX11.so.6 => /usr/lib64/libX11.so.6 (0x0039d720)
libmpi.so.0 => not found
libopen-rte.so.0 => not found
libopen-pal.so.0 => not found
libdl.so.2 => /lib64/libdl.so.2 (0x0039d620)
libutil.so.1 => /lib64/libutil.so.1 (0x0039e4a0)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0039d660)
libc.so.6 => /lib64/libc.so.6 (0x0039d5a0)
libz.so.1 => /usr/lib64/libz.so.1 (0x0039d6a0)
libXau.so.6 => /usr/lib64/libXau.so.6 (0x0039d6e0)
libXdmcp.so.6 => /usr/lib64/libXdmcp.so.6 (0x0039d760)
/lib64/ld-linux-x86-64.so.2 (0x0039d560)
=
> If so and there truly is a libmpi.so.0 in /home/sunitap/soft/openmpi/lib
> you may want to make sure the bitness of libmpi.so.0 and mdrun_mpi are
> the same by
> doing a file command on both.
>

The file command on both gives following output.
file ~/soft/gromacs/bin/mdrun_mpi
/home/sunitap/soft/gromacs/bin/mdrun_mpi: ELF 64-bit LSB executable, AMD
x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses
shared libs), for GNU/Linux 2.6.9, not stripped

file /home/sunitap/soft/openmpi/lib/libmpi.so.0
/home/sunitap/soft/openmpi/lib/libmpi.so.0: symbolic link to
`libmpi.so.0.0.1'


Thanks.
Sunita
> --td
>> .bashrc
>> #path for openmpi
>> export PATH=$PATH:/home/sunitap/soft/openmpi/bin
>> export CFLAGS="-I/home/sunitap/soft/openmpi/include"
>> export LDFLAGS="-L/home/sunitap/soft/openmpi/lib"
>> export LD_LIBRARY_PATH=/home/sunitap/soft/openmpi/lib:$LD_LIBRARY_PATH
>>
>> == error message ==
>> mdrun_mpi: error while loading shared libraries: libmpi.so.0: cannot
>> open
>> shared object file: No such file or directory
>>
>> 
>>
>> Thanks for any help.
>> Best regards,
>> Sunita
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.650.633.7054
> Oracle * - Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com 
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] problem with .bashrc stetting of openmpi

2010-08-16 Thread Addepalli, Srirangam V
Hello Sunitha,
If you have admin privileges on this system add library path to
 /etc/ld.so.conf

eg: echo "/home/sunitap/soft/openmpi/lib" >> /etc/ld.so.conf

ldconfig

Rangam

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
sun...@chem.iitb.ac.in [sun...@chem.iitb.ac.in]
Sent: Monday, August 16, 2010 12:28 AM
To: Open MPI Users
Subject: Re: [OMPI users] problem with .bashrc stetting of openmpi

Hi,

> hello Sunita,
>
> what linux distribution is this?
The linux distribution is Red Hat Enterprise Linux Server release 5.5
(Tikanga)
>
> On Fri, Aug 13, 2010 at 1:57 AM,  wrote:
>
Thanks,
Sunita

>> Dear Open-mpi users,
>>
>> I installed openmpi-1.4.1 in my user area and then set the path for
>> openmpi in the .bashrc file as follow. However, am still getting
>> following
>> error message whenever am starting the parallel molecular dynamics
>> simulation using GROMACS. So every time am starting the MD job, I need
>> to
>> source the .bashrc file again.
>>
>> Earlier in some other machine I did the same thing and was not getting
>> any
>> problem.
>>
>> Could you guys suggest what would be the problem?
>>
>> .bashrc
>> #path for openmpi
>> export PATH=$PATH:/home/sunitap/soft/openmpi/bin
>> export CFLAGS="-I/home/sunitap/soft/openmpi/include"
>> export LDFLAGS="-L/home/sunitap/soft/openmpi/lib"
>> export LD_LIBRARY_PATH=/home/sunitap/soft/openmpi/lib:$LD_LIBRARY_PATH
>>
>> == error message ==
>> mdrun_mpi: error while loading shared libraries: libmpi.so.0: cannot
>> open
>> shared object file: No such file or directory
>>
>> 
>>
>> Thanks for any help.
>> Best regards,
>> Sunita
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] problem with .bashrc stetting of openmpi

2010-08-16 Thread Manik Mayur
Hi Sunita,

have you tried running "ldconfig"?

Manik Mayur




2010/8/16  :
> Hi,
>
>> hello Sunita,
>>
>> what linux distribution is this?
> The linux distribution is Red Hat Enterprise Linux Server release 5.5
> (Tikanga)
>>
>> On Fri, Aug 13, 2010 at 1:57 AM,  wrote:
>>
> Thanks,
> Sunita
>
>>> Dear Open-mpi users,
>>>
>>> I installed openmpi-1.4.1 in my user area and then set the path for
>>> openmpi in the .bashrc file as follow. However, am still getting
>>> following
>>> error message whenever am starting the parallel molecular dynamics
>>> simulation using GROMACS. So every time am starting the MD job, I need
>>> to
>>> source the .bashrc file again.
>>>
>>> Earlier in some other machine I did the same thing and was not getting
>>> any
>>> problem.
>>>
>>> Could you guys suggest what would be the problem?
>>>
>>> .bashrc
>>> #path for openmpi
>>> export PATH=$PATH:/home/sunitap/soft/openmpi/bin
>>> export CFLAGS="-I/home/sunitap/soft/openmpi/include"
>>> export LDFLAGS="-L/home/sunitap/soft/openmpi/lib"
>>> export LD_LIBRARY_PATH=/home/sunitap/soft/openmpi/lib:$LD_LIBRARY_PATH
>>>
>>> == error message ==
>>> mdrun_mpi: error while loading shared libraries: libmpi.so.0: cannot
>>> open
>>> shared object file: No such file or directory
>>>
>>> 
>>>
>>> Thanks for any help.
>>> Best regards,
>>> Sunita
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] problem with .bashrc stetting of openmpi

2010-08-16 Thread sunita
Hi,

> hello Sunita,
>
> what linux distribution is this?
The linux distribution is Red Hat Enterprise Linux Server release 5.5
(Tikanga)
>
> On Fri, Aug 13, 2010 at 1:57 AM,  wrote:
>
Thanks,
Sunita

>> Dear Open-mpi users,
>>
>> I installed openmpi-1.4.1 in my user area and then set the path for
>> openmpi in the .bashrc file as follow. However, am still getting
>> following
>> error message whenever am starting the parallel molecular dynamics
>> simulation using GROMACS. So every time am starting the MD job, I need
>> to
>> source the .bashrc file again.
>>
>> Earlier in some other machine I did the same thing and was not getting
>> any
>> problem.
>>
>> Could you guys suggest what would be the problem?
>>
>> .bashrc
>> #path for openmpi
>> export PATH=$PATH:/home/sunitap/soft/openmpi/bin
>> export CFLAGS="-I/home/sunitap/soft/openmpi/include"
>> export LDFLAGS="-L/home/sunitap/soft/openmpi/lib"
>> export LD_LIBRARY_PATH=/home/sunitap/soft/openmpi/lib:$LD_LIBRARY_PATH
>>
>> == error message ==
>> mdrun_mpi: error while loading shared libraries: libmpi.so.0: cannot
>> open
>> shared object file: No such file or directory
>>
>> 
>>
>> Thanks for any help.
>> Best regards,
>> Sunita
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Checkpointing mpi4py program

2010-08-16 Thread ananda.mudar
Josh

I tried running the mpi4py program with the latest trunk version of
openmpi. I have compiled openmpi-1.7a1r23596 from trunk and recompiled
mpi4py to use this library. Unfortunately I see the same behavior as I
have seen with openmpi 1.4.2 ie; checkpoint will be successful but the
program doesn't proceed after that.

I have attached the stack traces of all the MPI processes that are part
of the mpirun. I really appreciate if you can take a look at the stack
trace and let m e know the potential problem. I am kind of stuck at this
point and need your assistance to move forward. Please let me know if
you need any additional information.

Thanks for your time in advance

Thanks

Ananda

-Original Message-
Subject: Re: [OMPI users] Checkpointing mpi4py program
From: Joshua Hursey (jjhursey_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-08-13 12:28:31

Nope. I probably won't get to it for a while. I'll let you know if I do.


On Aug 13, 2010, at 12:17 PM, 
 wrote:

> OK, I will do that.
>
> But did you try this program on a system where the latest trunk is
> installed? Were you successful in checkpointing?
>
> - Ananda
> -Original Message-
> Message: 9
> Date: Fri, 13 Aug 2010 10:21:29 -0400
> From: Joshua Hursey 
> Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2
> To: Open MPI Users 
> Message-ID: <7A43615B-A462-4C72-8112-496653D8F0A0_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii
>
> I probably won't have an opportunity to work on reproducing this on
the
> 1.4.2. The trunk has a bunch of bug fixes that probably will not be
> backported to the 1.4 series (things have changed too much since that
> branch). So I would suggest trying the 1.5 series.
>
> -- Josh
>
> On Aug 13, 2010, at 10:12 AM, 
>  wrote:
>
>> Josh
>>
>> I am having problems compiling the sources from the latest trunk. It
>> complains of libgomp.spec missing even though that file exists on my
>> system. I will see if I have to change any other environment
variables
>> to have a successful compilation. I will keep you posted.
>>
>> BTW, were you successful in reproducing the problem on a system with
>> OpenMPI 1.4.2?
>>
>> Thanks
>> Ananda
>> -Original Message-
>> Date: Thu, 12 Aug 2010 09:12:26 -0400
>> From: Joshua Hursey 
>> Subject: Re: [OMPI users] Checkpointing mpi4py program
>> To: Open MPI Users 
>> Message-ID: <1F1445AB-9208-4EF0-AF25-5926BD53C7E1_at_[hidden]>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Can you try this with the current trunk (r23587 or later)?
>>
>> I just added a number of new features and bug fixes, and I would be
>> interested to see if it fixes the problem. In particular I suspect
> that
>> this might be related to the Init/Finalize bounding of the checkpoint

>> region.
>>
>> -- Josh
>>
>> On Aug 10, 2010, at 2:18 PM, 
>>  wrote:
>>
>>> Josh
>>>
>>> Please find attached is the python program that reproduces the hang
>> that
>>> I described. Initial part of this file describes the prerequisite
>>> modules and the steps to reproduce the problem. Please let me know
if
>>> you have any questions in reproducing the hang.
>>>
>>> Please note that, if I add the following lines at the end of the
>> program
>>> (in case sleep_time is True), the problem disappears ie; program
>> resumes
>>> successfully after successful completion of checkpoint.
>>> # Add following lines at the end for sleep_time is True
>>> else:
>>> time.sleep(0.1)
>>> # End of added lines
>>>
>>>
>>> Thanks a lot for your time in looking into this issue.
>>>
>>> Regards
>>> Ananda
>>>
>>> Ananda B Mudar, PMP
>>> Senior Technical Architect
>>> Wipro Technologies
>>> Ph: 972 765 8093 begin_of_the_skype_highlighting  972
765 8093  end_of_the_skype_highlighting
>>> ananda.mudar_at_[hidden]
>>>
>>>
>>> -Original Message-
>>> Date: Mon, 9 Aug 2010 16:37:58 -0400
>>> From: Joshua Hursey 
>>> Subject: Re: [OMPI users] Checkpointing mpi4py program
>>> To: Open MPI Users 
>>> Message-ID: <270BD450-743A-4662-9568-1FEDFCC6F9C6_at_[hidden]>
>>> Content-Type: text/plain; charset=windows-1252
>>>
>>> I have not tried to checkpoint an mpi4py application, so I cannot
say
>>> for sure if it works or not. You might be hitting something with the

>>> Python runtime interacting in an odd way with either Open MPI or
> BLCR.
>>>
>>> Can you attach a debugger and get a backtrace on a stuck checkpoint?

>>> That might show us where things are held up.
>>>
>>> -- Josh
>>>
>>>
>>> On Aug 9, 2010, at 4:04 PM, 
>>>  wrote:
>>>
 Hi

 I have integrated mpi4py with openmpi 1.4.2 that was built with
BLCR
>>> 0.8.2. When I run ompi-checkpoint on the program