Finally it worked, thanks!
[mishima@manage OMB-3.1.1-openmpi2.0.0]$ ompi_info --param btl openib
--level 5 | grep openib_flags
MCA btl openib: parameter "btl_openib_flags" (current value:
"65847", data source: default, level: 5 tuner/det
ail, type: unsigned_int)
[mishima@manage OMB-3.1.1
The latest patch also causes a segfault...
By the way, I found a typo as below. &ca_pml_ob1.use_all_rdma in the last
line should be &mca_pml_ob1.use_all_rdma:
+mca_pml_ob1.use_all_rdma = false;
+(void) mca_base_component_var_register
(&mca_pml_ob1_component.pmlm_version, "use_all_rdma",
+
I understood. Thanks.
Tetsuya Mishima
2016/08/09 11:33:15、"devel"さんは「Re: [OMPI devel] sm BTL performace of
the openmpi-2.0.0」で書きました
> I will add a control to have the new behavior or using all available RDMA
btls or just the eager ones for the RDMA protocol. The flags will remain as
they are. And
Then, my understanding is that you will restore the default value of
btl_openib_flags to previous one( = 310) and add a new MCA parameter to
control HCA inclusion for such a situation. The work arround so far for
openmpi-2.0.0 is setting those flags manually. Right?
Tetsuya Mishima
2016/08/09 9:5
Hi, unfortunately it doesn't work well. The previous one was much
better ...
[mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -report-bindings
osu_bw
[manage.cluster:25107] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]],
Hi, here is the gdb output for additional information:
(It might be inexact, because I built openmpi-2.0.0 without debug option)
Core was generated by `osu_bw'.
Program terminated with signal 11, Segmentation fault.
#0 0x0031d9008806 in ?? () from /lib64/libgcc_s.so.1
(gdb) where
#0 0x0
Hi, it caused segfault as below:
[manage.cluster:25436] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]:
[B/B/B/B/B/B][./././././.]
[manage.cluster:25436] MCW rank 1 bound to s
Hi,
I applied the patch to the file "pml_ob1_rdma.c" and ran osu_bw again.
Then, I still see the bad performance for larger size(>=2097152 ).
[mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -report-bindings
osu_bw
[manage.cluster:27444] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[
Hi Christoph,
I applied the commits - pull/#1250 as Nathan told me and added "-mca
btl_openib_flags 311" to the mpirun command line option, then it worked for
me. I don't know the reason, but it looks ATOMIC_FOP in the
btl_openib_flags degrades the sm/vader perfomance.
Regards,
Tetsuya Mishima
Hi Nathan,
You gave me a hit, thanks!
I applied your patches and added "-mca btl_openib_flags 311" to the mpirun
option, then it worked for me.
[mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca btl_openib_flags
311 -bind-to core -report-bindings osu_bw
[manage.cluster:21733] MCW rank 0
Hi Gilles,
I confirmed the vader is used when I don't specify any BTL as you pointed
out!
Regards,
Tetsuya Mishima
[mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 --mca
btl_base_verbose 10 -bind-to core -report-bindings osu_bw
[manage.cluster:20006] MCW rank 0 bound to socket 0[core 0[hwt
Hi,
Thanks. I will try it and report later.
Tetsuya Mishima
2016/07/27 9:20:28、"devel"さんは「Re: [OMPI devel] sm BTL performace of
the openmpi-2.0.0」で書きました
> sm is deprecated in 2.0.0 and will likely be removed in favor of vader in
2.1.0.
>
> This issue is probably this known issue:
https://github
Hi Gilles,
Thanks. I ran again with --mca pml ob1 but I've got the same results as
below:
[mishima@manage OMB-3.1.1-openmpi2.0.0]$ mpirun -np 2 -mca pml ob1 -bind-to
core -report-bindings osu_bw
[manage.cluster:18142] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././.][./././././.]
[manage
Hi folks,
I saw a performance degradation of openmpi-2.0.0 when I ran our application
on a node (12cores). So I did 4 tests using osu_bw as below:
1: mpirun –np 2 osu_bw bad(30% of test2)
2: mpirun –np 2 –mca btl self,sm osu_bw good(same as openmpi1.10.3)
3: mpir
Hi Gilles san, thank you for your quick comment. I fully understand the
meaning of the warning. Regarding the question you raise, I'm afraid that
I'm not sure which solution is better ...
Regards,
Tetsuya Mishima
2016/07/07 14:13:02、"devel"さんは「Re: [OMPI devel] v2.0.0rc4 is
released」で書きました
> This
Hi Jeff, sorry for a very short report. I saw the warning below
at the end of installation of openmpi-2.0.0rc4. Is this okay?
$ make install
...
make install-exec-hook
make[3]: Entering directory
`/home/mishima/mis/openmpi/openmpi-pgi16.5/openmpi-2.0.0rc4'
WARNING! Common symbols found:
Hi Devendar,
As far as I know, the report-bindings option shows the logical
cpu order. On the other hand, you are talking about physical one,
I guess.
Regards,
Tetsuya Mishima
2015/04/21 9:04:37、"devel"さんは「Re: [OMPI devel] binding output
error」で書きました
> HT is not enabled. All node are same topo
Gilles,
Your patch looks good to me and I think this issue should be fixed
in the upcoming openmpi-1.8.3. Could you commit it to the trunk and
create a CMR for it?
Tetsuya
> Mishima-san,
>
> the root cause is macro expansion does not always occur as one would
> have expected ...
>
> could you pl
Gilles,
Thank you for your fix. I successfully compiled it with PGI, although
I could not check it executing actual test run.
Tetsuya
> Mishima-san,
>
> the root cause is macro expansion does not always occur as one would
> have expected ...
>
> could you please give a try to the attached patch
Hi folks,
I tried to build openmpi-1.8.2 with PGI fortran and -i8(64bit fortran int)
option
as shown below:
./configure \
--prefix=/home/mishima/opt/mpi/openmpi-1.8.2-pgi14.7_int64 \
--enable-abi-breaking-fortran-status-i8-fix \
--with-tm \
--with-verbs \
--disable-ipv6 \
CC=pgcc CFLAGS="-tp k8-
Hi Ralph,
I comfirmed that the openib issue was really fixed by r32395
and hope you'll be able to release the final version soon.
Tetsuya
> Kewl - the openib issue has been fixed in the nightly tarball. I'm
waiting for review of a couple of pending CMRs, then we'll release a quick
rc4 and move
I comfirmed openmpi-1.8.2rc3 with PGI-14.7 worked fine for me
except for the openib issue reported by Mike Dubman.
Tetsuya Mishima
> Sorry, finally got through all this ompi email and see this problem was
fixed.
>
> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On
Hi Paul,
Thank you for your investigation. I'm sure it's very
close to fix the problem although I myself can't do
that. So I must owe you something...
Please try Awamori, which is Okinawa's sake and very
good in such a hot day.
Tetsuya
> On Wed, Jul 30, 2014 at 8:53 PM, Paul Hargrove wrote:
>
Paul and Jeff,
I additionally installed PGI14.4 and check the behavior.
Then, I confirmed that both versions create same results.
PGI14.7:
[mishima@manage work]$ mpif90 test.f -o test.ex --showme
pgfortran test.f -o test.ex
-I/home/mishima/opt/mpi/openmpi-1.8.2rc2-pgi14.7/include
-I/home/mishim
Hi Paul, thank you for your comment.
I don't think my mpi_f08.mod is older one, because the time stamp is
equal to the time when I rebuilt them today.
[mishima@manage openmpi-1.8.2rc2-pgi14.7]$ ll lib/mpi*
-rwxr-xr-x 1 mishima mishima315 Jul 30 12:27 lib/mpi_ext.mod
-rwxr-xr-x 1 mishima mis
This is another one.
(See attached file: openmpi-1.8.2rc2-pgi14.7.tar.gz)
Tetusya
> Tetsuya --
>
> I am unable to test with the PGI compiler -- I don't have a license. I
was hoping that LANL would be able to test today, but I don't think they
got to it.
>
> Can you send more details?
>
> E.g.
Hi Jeff,
Sorry for poor information and late reply. Today, I attended a very very
long meeting ...
Anyway, I attached compile-output and configure-log.
(due to file size limitation, I send them in twice)
I hope you could find the problem.
(See attached file: openmpi-1.8-pgi14.7.tar.gz)
Regar
Sorry for poor information. I attached compile-output and configure-log.
I hope you could find the problem.
(See attached file: openmpi-pgi14.7.tar.gz)
Regards,
Tetsuya Mishima
> Tetsuya --
>
> I am unable to test with the PGI compiler -- I don't have a license. I
was hoping that LANL would b
Hi folks,
I tried to build openmpi-1.8.2rc2 with PGI-14.7 and execute a sample
program. Then, it causes linking error:
[mishima@manage work]$ cat test.f
program hello_world
use mpi_f08
implicit none
type(MPI_Comm) :: comm
integer :: myid, npes, ierror
integer
Thanks Ralph. I'll check it on next Monday.
Tetsuya
> Should be fixed with r32058
>
>
> On Jun 20, 2014, at 4:13 AM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > By the way, something is wrong with your latest rmaps_rank_file.c.
> > I've got the error below. I'm tring to fi
Hi Ralph,
By the way, something is wrong with your latest rmaps_rank_file.c.
I've got the error below. I'm tring to find the problem. But, you
could find it more quickly...
[mishima@manage trial]$ cat rankfile
rank 0=node05 slot=0-1
rank 1=node05 slot=3-4
rank 2=node05 slot=6-7
[mishima@manage
I'm not sure, but I guess it's related to Gilles's ticket.
It's a quite bad binding pattern as Ralph pointed out, so
checking for that condition and disqualifying coll/ml could
be a practical solution as well.
Tetsuya
> It is related, but it means that coll/ml has a higher degree of
sensitivity
Hi folks,
Recently I have been seeing a hang with trunk when I specify a
particular binding by use of rankfile or "-map-by slot".
This can be reproduced by the rankfile which allocates a process
beyond socket boundary. For example, on the node05 which has 2 socket
with 4 core, the rank 1 is allo
Thanks Ralph.
Tetsuya
> I tracked it down - not Torque specific, but impacts all managed
environments. Will fix
>
>
> On Apr 1, 2014, at 2:23 AM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> > Hi Ralph,
> >
> > I saw another hangup with openmpi-1.8 when I used more than 4 nodes
> > (having 8 cores
Hi Ralph,
I saw another hangup with openmpi-1.8 when I used more than 4 nodes
(having 8 cores each) under managed state by Torque. Although I'm not
sure you can reproduce it with SLURM, at leaset with Torque it can be
reproduced in this way:
[mishima@manage ~]$ qsub -I -l nodes=4:ppn=8
qsub: wai
Hi Jeff,
it worked for me with openmpi-1.8rc1.
Tetsuya
> Ralph applied a bunch of CMRs to the v1.8 branch after the nightly
tarball was made last night.
>
> I just created a new nightly tarball that includes all of those CMRs:
1.8a1r31269. It should have the fix for this error included in it.
Thanks Jeff. But I'm already offline today ...
I can not confirm it until monday morning, sorry.
Tetsuya
> Ralph applied a bunch of CMRs to the v1.8 branch after the nightly
tarball was made last night.
>
> I just created a new nightly tarball that includes all of those CMRs:
1.8a1r31269. It s
Thanks Jeff. It seems to be really the latest one - ticket #4474.
> On Mar 28, 2014, at 5:45 AM, wrote:
>
> >
--
> > A system call failed during shared memory initialization that should
> > not have. It is likely that your
Hi all,
I saw this error as shown below with openmpi-1.8a1r31254.
I've never seen it before with openmpi-1.7.5.
The message implies it's related to vader and I can stop
it by excluding vader from btl, -mca btl ^vader.
Could someone fix this problem?
Tetsuya
[mishima@manage openmpi]$ mpirun -n
I added two improvements. Please replace the previous patch file
by this attached one, and take a look this week end.
1. Add pre-check for ORTE_ERR_NOT_FOUND to make retry with byslot
work afterward correctly. Otherwise, the retry could fail, because
some fields such as node->procs, node->slots_
no problem - it's a minor cleanup.
Tetsuya
> Hi Tetsuya
>
> Let me take a look when I get home this weekend - I'm giving an ORTE
tutorial to a group of new developers this week and my time is very
limited.
>
> Thanks
> Ralph
>
>
>
> On Tue, Mar 25, 2014 at 5:37 PM, wrote:
>
> Hi Ralph, I moved
Hi Ralph, I moved on to the development list.
I'm not sure why add_one flag is used in the rr_byobj.
Here, if oversubscribed, proc is mapped to each object
one by one. So, I think the add_one is not necesarry.
Instead, when the user doesn't permit oversubscription,
the second pass should be skip
42 matches
Mail list logo