I think I have this figured out - will fix on Monday. I'm not sure why
Jeff's conditions are all required, especially the second one.
However, the fundamental problem is that we pull information from two
sources regarding the number of procs in the job when unpacking a
buffer, and the two sources appear to be out-of-sync with each other
in certain scenarios.
The details are beyond the user list. I'll respond here again once I
get it fixed.
Ralph
On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
Unfortunately, I think I have reproduced the problem as well -- with
SVN trunk HEAD (r20655):
[15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
[svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack
failed in file base/odls_base_default_fns.c at line 566
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
Notice that I'm not trying to run an MPI app -- it's just "uptime".
The following things seem to be necessary to make this error occur
for me:
1. --bynode
2. set some mca parameter (any mca parameter)
3. -np value less than the size of my slurm allocation
If I remove any of those, it seems to run file
On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
With further investigation, I have reproduced this problem. I
think I was originally testing against a version that was not
recent enough. I do not see it with r20594 which is from February
19. So, something must have happened over the last 8 days. I will
try and narrow down the issue.
Rolf
On 02/27/09 09:34, Rolf Vandevaart wrote:
I just tried trunk-1.4a1r20458 and I did not see this error,
although my configuration was rather different. I ran across 100
2-CPU sparc nodes, np=256, connected with TCP.
Hopefully George's comment helps out with this issue.
One other thought to see whether SGE has anything to do with this
is create a hostfile and run it outside of SGE.
Rolf
On 02/26/09 22:10, Ralph Castain wrote:
FWIW: I tested the trunk tonight using both SLURM and rsh
launchers, and everything checks out fine. However, this is
running under SGE and thus using qrsh, so it is possible the SGE
support is having a problem.
Perhaps one of the Sun OMPI developers can help here?
Ralph
On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
It looks like the system doesn't know what nodes the procs are
to be placed upon. Can you run this with --display-devel-map?
That will tell us where the system thinks it is placing things.
Thanks
Ralph
On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
Maybe it's my pine mailer.
This is a NAMD run on 256 procs across 32 dual-socket quad-core
AMD
shangai nodes running a standard benchmark called stmv.
The basic error message, which occurs 31 times is like:
[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
file ../../../.././orte/mca/odls/base/odls_base_default_fns.c
at line 595
The mpirun command has long paths in it, sorry. It's invoking a
special binding
script which in turn lauches the NAMD run. This works on an
older SVN at
level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for
this 256 proc run where
the older SVN hangs indefinitely polling some completion (sm or
openib). So, I was trying
later SVNs with this 256 proc run, hoping the error would go
away.
Here's some of the invocation again. Hope you can read it:
EAGER_SIZE=32767
export OMPI_MCA_btl_openib_use_eager_rdma=0
export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
and, unexpanded
mpirun --prefix $PREFIX -np %PE% $MCA -x
OMPI_MCA_btl_openib_use_eager_rdma -x
OMPI_MCA_btl_openib_eager_limit -x
OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -
machinefile $HOSTS $MPI_BINDER $NAMD2 stmv.namd
and, expanded
mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256 --
mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x
OMPI_MCA_btl_openib_eager_limit -x
OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit -
machinefile /tmp/48292.1.all.q/newhosts /ctmp8/mostyn/IMSC/
bench_intel_openmpi_I_shang2/mpi_binder.MRL /ctmp8/mostyn/IMSC/
bench_intel_openmpi_I_shang2/
intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd
This is all via Sun Grid Engine.
The OS as indicated above is SuSE SLES 10 SP2.
DM
On Thu, 26 Feb 2009, Ralph Castain wrote:
I'm sorry, but I can't make any sense of this message. Could
you provide a
little explanation of what you are doing, what the system
looks like, what is
supposed to happen, etc? I can barely parse your cmd line...
Thanks
Ralph
On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
Today's and yesterdays.
1.4a1r20643_svn
+ mpirun --prefix
/tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/
openib/suse_sles_10/x86_6
4/opteron -np 256 --mca btl sm,openib,self -x
OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
269.1.all.q/newhosts
/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
/ctmp8/mostyn/IM
SC/bench_intel_openmpi_I_shang2/
intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
NAMD_2.6_Source/Li
nux-amd64-MPI/namd2 stmv.namd
[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls_
base_default_fns.c at line 595
[s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
Made with INTEL compilers 10.1.015.
Regards,
Mostyn
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users