Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned processes
Disable the memory manager / don't use leave pinned. Then you can fork/exec without fear (because only MPI will have registered memory -- it'll never leave user buffers registered after MPI communications finish). > On Apr 23, 2015, at 9:25 PM, Howard Pritchardwrote: > > Jeff > > this is kind of a lanl thing. Jack and I are working offline. any > suggestions about openib and fork/exec may be useful however...and don't say > no to fork/exec not at least if you dream of mpi in the data center. > > On Apr 23, 2015 10:49 AM, "Galloway, Jack D" wrote: > I am using a “homecooked” cluster at LANL, ~500 cores. There are a whole > bunch of fortran system calls doing the copying and pasting. The full code > is attached here, a bunch of if-then statements for user options. Thanks for > the help. > > > > --Jack Galloway > > > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard Pritchard > Sent: Thursday, April 23, 2015 8:15 AM > To: Open MPI Users > Subject: Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned > processes > > > > Hi Jack, > > Are you using a system at LANL? Maybe I could try to reproduce the problem on > the system you are using. The system call stuff adds a certain bit of zest > to the problem. does the app make fortran system calls to do the copying and > pasting? > > Howard > > On Apr 22, 2015 4:24 PM, "Galloway, Jack D" wrote: > > I have an MPI program that is fairly straight forward, essentially > "initialize, 2 sends from master to slaves, 2 receives on slaves, do a bunch > of system calls for copying/pasting then running a serial code on each mpi > task, tidy up and mpi finalize". > > This seems straightforward, but I'm not getting mpi_finalize to work > correctly. Below is a snapshot of the program, without all the system > copy/paste/call external code which I've rolled up in "do codish stuff" type > statements. > > program mpi_finalize_break > > ! > > call MPI_INIT(ierr) > > icomm = MPI_COMM_WORLD > > call MPI_COMM_SIZE(icomm,nproc,ierr) > > call MPI_COMM_RANK(icomm,rank,ierr) > > > > ! > > if (rank == 0) then > > ! > > call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr) > > call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr) > > else > > call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > ! > > endif > > > > print*, "got here4", rank > > call MPI_BARRIER(icomm,ierr) > > print*, "got here5", rank, ierr > > call MPI_FINALIZE(ierr) > > > > print*, "got here6" > > end program mpi_finalize_break > > Now the problem I am seeing occurs around the "got here4", "got here5" and > "got here6" statements. I get the appropriate number of print statements with > corresponding ranks for "got here4", as well as "got here5". Meaning, the > master and all the slaves (rank 0, and all other ranks) got to the barrier > call, through the barrier call, and to MPI_FINALIZE, reporting 0 for ierr on > all of them. However, when it gets to "got here6", after the MPI_FINALIZE > I'll get all kinds of weird behavior. Sometimes I'll get one less "got here6" > than I expect, or sometimes I'll get eight less (it varies), however the > program hangs forever, never closing and leaves an orphaned process on one > (or more) of the compute nodes. > > I am running this on an infiniband backbone machine, with the NFS server > shared over infiniband (nfs-rdma). I'm trying to determine how the > MPI_BARRIER call works fine, yet MPI_FINALIZE ends up with random orphaned > runs (not the same node, nor the same number of orphans every time). I'm > guessing it is related to the various system calls to cp, mv, > ./run_some_code, cp, mv but wasn't sure if it may be related to the speed of > infiniband too, as all this happens fairly quickly. I could have wrong > intuition as well. Anybody have thoughts? I could put the whole code if > helpful, but this condensed version I believe captures it. I'm running > openmpi1.8.4 compiled against ifort 15.0.2 , with Mellanox adapters running > firmware 2.9.1000. This is the mellanox firmware available through yum with > centos 6.5, 2.6.32-504.8.1.el6.x86_64. > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > > inet addr:192.168.6.254 Bcast:192.168.6.255 Mask:255.255.255.0 > > inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > > RX packets:10952 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:9805 errors:0 dropped:625413 overruns:0 carrier:0 > > collisions:0 txqueuelen:256 > > RX bytes:830040 (810.5 KiB) TX bytes:643212 (628.1 KiB) > > > > hca_id: mlx4_0 > > transport: InfiniBand (0) > >
Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned processes
Jeff this is kind of a lanl thing. Jack and I are working offline. any suggestions about openib and fork/exec may be useful however...and don't say no to fork/exec not at least if you dream of mpi in the data center. On Apr 23, 2015 10:49 AM, "Galloway, Jack D"wrote: > I am using a “homecooked” cluster at LANL, ~500 cores. There are a > whole bunch of fortran system calls doing the copying and pasting. The > full code is attached here, a bunch of if-then statements for user > options. Thanks for the help. > > > > --Jack Galloway > > > > *From:* users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Howard > Pritchard > *Sent:* Thursday, April 23, 2015 8:15 AM > *To:* Open MPI Users > *Subject:* Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned > processes > > > > Hi Jack, > > Are you using a system at LANL? Maybe I could try to reproduce the problem > on the system you are using. The system call stuff adds a certain bit of > zest to the problem. does the app make fortran system calls to do the > copying and pasting? > > Howard > > On Apr 22, 2015 4:24 PM, "Galloway, Jack D" wrote: > > I have an MPI program that is fairly straight forward, essentially > "initialize, 2 sends from master to slaves, 2 receives on slaves, do a > bunch of system calls for copying/pasting then running a serial code on > each mpi task, tidy up and mpi finalize". > > This seems straightforward, but I'm not getting mpi_finalize to work > correctly. Below is a snapshot of the program, without all the system > copy/paste/call external code which I've rolled up in "do codish stuff" > type statements. > > program mpi_finalize_break > > ! > > call MPI_INIT(ierr) > > icomm = MPI_COMM_WORLD > > call MPI_COMM_SIZE(icomm,nproc,ierr) > > call MPI_COMM_RANK(icomm,rank,ierr) > > > > ! > > if (rank == 0) then > > ! > > call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr) > > call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr) > > else > > call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > ! > > endif > > > > print*, "got here4", rank > > call MPI_BARRIER(icomm,ierr) > > print*, "got here5", rank, ierr > > call MPI_FINALIZE(ierr) > > > > print*, "got here6" > > end program mpi_finalize_break > > Now the problem I am seeing occurs around the "got here4", "got here5" and > "got here6" statements. I get the appropriate number of print statements > with corresponding ranks for "got here4", as well as "got here5". Meaning, > the master and all the slaves (rank 0, and all other ranks) got to the > barrier call, through the barrier call, and to MPI_FINALIZE, reporting 0 > for ierr on all of them. However, when it gets to "got here6", after the > MPI_FINALIZE I'll get all kinds of weird behavior. Sometimes I'll get one > less "got here6" than I expect, or sometimes I'll get eight less (it > varies), however the program hangs forever, never closing and leaves an > orphaned process on one (or more) of the compute nodes. > > I am running this on an infiniband backbone machine, with the NFS server > shared over infiniband (nfs-rdma). I'm trying to determine how the > MPI_BARRIER call works fine, yet MPI_FINALIZE ends up with random orphaned > runs (not the same node, nor the same number of orphans every time). I'm > guessing it is related to the various system calls to cp, mv, > ./run_some_code, cp, mv but wasn't sure if it may be related to the speed > of infiniband too, as all this happens fairly quickly. I could have wrong > intuition as well. Anybody have thoughts? I could put the whole code if > helpful, but this condensed version I believe captures it. I'm running > openmpi1.8.4 compiled against ifort 15.0.2 , with Mellanox adapters running > firmware 2.9.1000. This is the mellanox firmware available through yum > with centos 6.5, 2.6.32-504.8.1.el6.x86_64. > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > > inet addr:192.168.6.254 Bcast:192.168.6.255 Mask:255.255.255.0 > > inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > > RX packets:10952 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:9805 errors:0 dropped:625413 overruns:0 carrier:0 > > collisions:0 txqueuelen:256 > > RX bytes:830040 (810.5 KiB) TX bytes:643212 (628.1 KiB) > > > > hca_id: mlx4_0 > > transport: InfiniBand (0) > > fw_ver: 2.9.1000 > > node_guid: 0002:c903:0057:e7fc > > sys_image_guid: 0002:c903:0057:e7ff > > vendor_id: 0x02c9 > > vendor_part_id: 26428 > > hw_ver: 0xB0 > > board_id: MT_0D90110009 > >
Re: [OMPI users] help in execution mpi
Use “orte_rsh_agent = rsh” instead > On Apr 23, 2015, at 10:48 AM, rebona...@upf.br wrote: > > Hi all > > I am install mpi (version 1.6.5) at ubuntu 14.04. I am teach parallel > programming in undergraduate course. > I wnat use rsh instead ssh (default). > I change the file "openmpi-mca-params.conf" and put there plm_rsh_agent = rsh > . > The mpi application work, but a message appear for each process created: > > /* begin message */ > -- > A deprecated MCA parameter value was specified in an MCA parameter > file. Deprecated MCA parameters should be avoided; they may disappear > in future releases. > > Deprecated parameter: plm_rsh_agent > -- > /* end message */ > > It's bad for explanation with students. There is any form to supress these > warning messages? > > Thank's a lot. > > > Marcelo Trindade Rebonatto > Passo Fundo University - Brazil > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26773.php
[OMPI users] help in execution mpi
Hi all I am install mpi (version 1.6.5) at ubuntu 14.04. I am teach parallel programming in undergraduate course. I wnat use rsh instead ssh (default). I change the file "openmpi-mca-params.conf" and put there plm_rsh_agent = rsh . The mpi application work, but a message appear for each process created: /* begin message */ -- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -- /* end message */ It's bad for explanation with students. There is any form to supress these warning messages? Thank's a lot. Marcelo Trindade Rebonatto Passo Fundo University - Brazil
Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned processes
I am using a “homecooked” cluster at LANL, ~500 cores. There are a whole bunch of fortran system calls doing the copying and pasting. The full code is attached here, a bunch of if-then statements for user options. Thanks for the help. --Jack Galloway From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard Pritchard Sent: Thursday, April 23, 2015 8:15 AM To: Open MPI Users Subject: Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned processes Hi Jack, Are you using a system at LANL? Maybe I could try to reproduce the problem on the system you are using. The system call stuff adds a certain bit of zest to the problem. does the app make fortran system calls to do the copying and pasting? Howard On Apr 22, 2015 4:24 PM, "Galloway, Jack D"> wrote: I have an MPI program that is fairly straight forward, essentially "initialize, 2 sends from master to slaves, 2 receives on slaves, do a bunch of system calls for copying/pasting then running a serial code on each mpi task, tidy up and mpi finalize". This seems straightforward, but I'm not getting mpi_finalize to work correctly. Below is a snapshot of the program, without all the system copy/paste/call external code which I've rolled up in "do codish stuff" type statements. program mpi_finalize_break ! call MPI_INIT(ierr) icomm = MPI_COMM_WORLD call MPI_COMM_SIZE(icomm,nproc,ierr) call MPI_COMM_RANK(icomm,rank,ierr) ! if (rank == 0) then ! call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr) call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr) else call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr) call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr) ! endif print*, "got here4", rank call MPI_BARRIER(icomm,ierr) print*, "got here5", rank, ierr call MPI_FINALIZE(ierr) print*, "got here6" end program mpi_finalize_break Now the problem I am seeing occurs around the "got here4", "got here5" and "got here6" statements. I get the appropriate number of print statements with corresponding ranks for "got here4", as well as "got here5". Meaning, the master and all the slaves (rank 0, and all other ranks) got to the barrier call, through the barrier call, and to MPI_FINALIZE, reporting 0 for ierr on all of them. However, when it gets to "got here6", after the MPI_FINALIZE I'll get all kinds of weird behavior. Sometimes I'll get one less "got here6" than I expect, or sometimes I'll get eight less (it varies), however the program hangs forever, never closing and leaves an orphaned process on one (or more) of the compute nodes. I am running this on an infiniband backbone machine, with the NFS server shared over infiniband (nfs-rdma). I'm trying to determine how the MPI_BARRIER call works fine, yet MPI_FINALIZE ends up with random orphaned runs (not the same node, nor the same number of orphans every time). I'm guessing it is related to the various system calls to cp, mv, ./run_some_code, cp, mv but wasn't sure if it may be related to the speed of infiniband too, as all this happens fairly quickly. I could have wrong intuition as well. Anybody have thoughts? I could put the whole code if helpful, but this condensed version I believe captures it. I'm running openmpi1.8.4 compiled against ifort 15.0.2 , with Mellanox adapters running firmware 2.9.1000. This is the mellanox firmware available through yum with centos 6.5, 2.6.32-504.8.1.el6.x86_64. ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.6.254 Bcast:192.168.6.255 Mask:255.255.255.0 inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:10952 errors:0 dropped:0 overruns:0 frame:0 TX packets:9805 errors:0 dropped:625413 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:830040 (810.5 KiB) TX bytes:643212 (628.1 KiB) hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.9.1000 node_guid: 0002:c903:0057:e7fc sys_image_guid: 0002:c903:0057:e7ff vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: MT_0D90110009 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 2 port_lmc: 0x00 link_layer: InfiniBand This problem only occurs in this simple implementation, thus my
Re: [OMPI users] Questions regarding MPI_T performance variables and Collective tuning
On Apr 22, 2015, at 1:57 PM, Jerome Viennewrote: > > While looking at performance and control variables provided by the MPI_T > interface, I was surprised by the impressive number of control variables > (1,087 if I am right (with 1.8.4)) but I was also disappointed to see that I > was able to get only 2 performance variables. Yeah, those were mostly added as "make sure we have the MPI_T Pvar stuff implemented correctly." > I would to know if you are planning to add more Performance variables like > number of time an algorithm from a collective was called, or the number of > buffer allocated/free etc… We're open to adding lots of them. We're actually waiting for specific requests, to be honest. Can you provide a specific list of pvars that you'd like to see? We're also contemplating rolling our existing PERUSE implementation (i.e., a prior generation performance variable system) into the official MPI_T system. It's not immediately clear how to do this, though -- PERUSE and MPI_T aren't 100% compatible. So we've been loosely discussing how to do that (because we *do* have a bunch of MPI_t-performance-variable-like entities under our PERUSE). > Regarding collective tuning, I was wondering if you can recommend me a > paper/presentation that will provide most of the details. I found some > interesting posts (like this one: > http://www.open-mpi.org/community/lists/users/2014/11/25847.php) but I am > looking for a paper/doc explaining the different modules (basic, tuned, self, > hierarchical…) and how to set dynamic rules George: can you provide insight here? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned processes
Can you send your full Fortran test program? > On Apr 22, 2015, at 6:24 PM, Galloway, Jack Dwrote: > > I have an MPI program that is fairly straight forward, essentially > "initialize, 2 sends from master to slaves, 2 receives on slaves, do a bunch > ofsystem calls for copying/pasting then running a serial code on each mpi > task, tidy up and mpi finalize". > > This seems straightforward, but I'm not getting mpi_finalize to work > correctly. Below is a snapshot of the program, without all the > systemcopy/paste/call external code which I've rolled up in "do codish stuff" > type statements. > > program mpi_finalize_break > > ! > > call MPI_INIT(ierr) > > icomm = MPI_COMM_WORLD > > call MPI_COMM_SIZE(icomm,nproc,ierr) > > call MPI_COMM_RANK(icomm,rank,ierr) > > > > ! > > if (rank == 0) then > > ! > > call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr) > > call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr) > > else > > call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > ! > > endif > > > > print*, "got here4", rank > > call MPI_BARRIER(icomm,ierr) > > print*, "got here5", rank, ierr > > call MPI_FINALIZE(ierr) > > > > print*, "got here6" > > end program mpi_finalize_break > > Now the problem I am seeing occurs around the "got here4", "got here5" and > "got here6" statements. I get the appropriate number of print(it varies), > however the program hangs forever, never closing and leaves an orphaned > process on one (or more) of the compute nodes. > > I am running this on an infiniband backbone machine, with the NFS server > shared over infiniband (nfs-rdma). I'm trying to determine how the running > firmware 2.9.1000. This is the mellanox firmware available through yum with > centos 6.5, 2.6.32-504.8.1.el6.x86_64. > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > > inet addr:192.168.6.254 Bcast:192.168.6.255 Mask:255.255.255.0 > > inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > > RX packets:10952 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:9805 errors:0 dropped:625413 overruns:0 carrier:0 > > collisions:0 txqueuelen:256 > > RX bytes:830040 (810.5 KiB) TX bytes:643212 (628.1 KiB) > > > > hca_id: mlx4_0 > > transport: InfiniBand (0) > > fw_ver: 2.9.1000 > > node_guid: 0002:c903:0057:e7fc > > sys_image_guid: 0002:c903:0057:e7ff > > vendor_id: 0x02c9 > > vendor_part_id: 26428 > > hw_ver: 0xB0 > > board_id: MT_0D90110009 > > phys_port_cnt: 1 > > port: 1 > > state: PORT_ACTIVE (4) > > max_mtu:4096 (5) > > active_mtu: 4096 (5) > > sm_lid: 1 > > port_lid: 2 > > port_lmc: 0x00 > > link_layer: InfiniBand > > > > This problem only occurs in this simple implementation, thus my thinking it > is tied to the system calls. I run several other, much larger, muchmore > robust MPI codes without issue on the machine. Thanks for the help. > > --Jack > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26765.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned processes
Hi Jack, Are you using a system at LANL? Maybe I could try to reproduce the problem on the system you are using. The system call stuff adds a certain bit of zest to the problem. does the app make fortran system calls to do the copying and pasting? Howard On Apr 22, 2015 4:24 PM, "Galloway, Jack D"wrote: > I have an MPI program that is fairly straight forward, essentially > "initialize, 2 sends from master to slaves, 2 receives on slaves, do a > bunch of system calls for copying/pasting then running a serial code on > each mpi task, tidy up and mpi finalize". > > This seems straightforward, but I'm not getting mpi_finalize to work > correctly. Below is a snapshot of the program, without all the system > copy/paste/call external code which I've rolled up in "do codish stuff" > type statements. > > program mpi_finalize_break > > ! > > call MPI_INIT(ierr) > > icomm = MPI_COMM_WORLD > > call MPI_COMM_SIZE(icomm,nproc,ierr) > > call MPI_COMM_RANK(icomm,rank,ierr) > > > > ! > > if (rank == 0) then > > ! > > call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr) > > call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr) > > else > > call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > ! > > endif > > > > print*, "got here4", rank > > call MPI_BARRIER(icomm,ierr) > > print*, "got here5", rank, ierr > > call MPI_FINALIZE(ierr) > > > > print*, "got here6" > > end program mpi_finalize_break > > Now the problem I am seeing occurs around the "got here4", "got here5" and > "got here6" statements. I get the appropriate number of print statements > with corresponding ranks for "got here4", as well as "got here5". Meaning, > the master and all the slaves (rank 0, and all other ranks) got to the > barrier call, through the barrier call, and to MPI_FINALIZE, reporting 0 > for ierr on all of them. However, when it gets to "got here6", after the > MPI_FINALIZE I'll get all kinds of weird behavior. Sometimes I'll get one > less "got here6" than I expect, or sometimes I'll get eight less (it > varies), however the program hangs forever, never closing and leaves an > orphaned process on one (or more) of the compute nodes. > > I am running this on an infiniband backbone machine, with the NFS server > shared over infiniband (nfs-rdma). I'm trying to determine how the > MPI_BARRIER call works fine, yet MPI_FINALIZE ends up with random orphaned > runs (not the same node, nor the same number of orphans every time). I'm > guessing it is related to the various system calls to cp, mv, > ./run_some_code, cp, mv but wasn't sure if it may be related to the speed > of infiniband too, as all this happens fairly quickly. I could have wrong > intuition as well. Anybody have thoughts? I could put the whole code if > helpful, but this condensed version I believe captures it. I'm running > openmpi1.8.4 compiled against ifort 15.0.2 , with Mellanox adapters running > firmware 2.9.1000. This is the mellanox firmware available through yum > with centos 6.5, 2.6.32-504.8.1.el6.x86_64. > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > > inet addr:192.168.6.254 Bcast:192.168.6.255 Mask:255.255.255.0 > > inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > > RX packets:10952 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:9805 errors:0 dropped:625413 overruns:0 carrier:0 > > collisions:0 txqueuelen:256 > > RX bytes:830040 (810.5 KiB) TX bytes:643212 (628.1 KiB) > > > > hca_id: mlx4_0 > > transport: InfiniBand (0) > > fw_ver: 2.9.1000 > > node_guid: 0002:c903:0057:e7fc > > sys_image_guid: 0002:c903:0057:e7ff > > vendor_id: 0x02c9 > > vendor_part_id: 26428 > > hw_ver: 0xB0 > > board_id: MT_0D90110009 > > phys_port_cnt: 1 > > port: 1 > > state: PORT_ACTIVE (4) > > max_mtu:4096 (5) > > active_mtu: 4096 (5) > > sm_lid: 1 > > port_lid: 2 > > port_lmc: 0x00 > > link_layer: InfiniBand > > > > This problem only occurs in this simple implementation, thus my thinking > it is tied to the system calls. I run several other, much larger, much > more robust MPI codes without issue on the machine. Thanks for the help. > > --Jack > > ___ > users mailing list > us...@open-mpi.org > Subscription:
Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl
/usr/bin/ofed_info So, the OFED on your system is not MellanoxOFED 2.4.x but smth else. try #rpm -qi libibverbs On Thu, Apr 23, 2015 at 7:47 AM, Subhra Mazumdarwrote: > Hi, > > where is the command ofed_info located? I searched from / but didn't find > it. > > Subhra. > > On Tue, Apr 21, 2015 at 10:43 PM, Mike Dubman > wrote: > >> cool, progress! >> >> >>1429676565.124664] sys.c:719 MXM WARN Conflicting CPU >> frequencies detected, using: 2601.00 >> >> means that cpu governor on your machine is not on "performance" mode >> >> >> MXM ERROR ibv_query_device() returned 38: Function not implemented >> >> indicates that ofed installed on your nodes is not indeed 2.4.-1.0.0 or >> there is a mismatch between ofed kernel drivers version and ofed userspace >> libraries version. >> or you have multiple ofed libraries installed on your node and use >> incorrect one. >> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0? >> >> >> >> >> >> On Wed, Apr 22, 2015 at 7:59 AM, Subhra Mazumdar < >> subhramazumd...@gmail.com> wrote: >> >>> Hi, >>> >>> I compiled the openmpi that comes inside the mellanox hpcx package with >>> mxm support instead of separately downloaded openmpi. I also used the >>> environment as in the README so that no LD_PRELOAD (except our own library >>> which is unrelated) is needed. Now it runs fine (no segfault) but we get >>> same errors as before (saying initialization of MXM library failed). Is it >>> using MXM successfully? >>> >>> [root@JARVICE >>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# mpirun >>> --allow-run-as-root --mca mtl mxm -n 1 /root/backend localhost : -x >>> LD_PRELOAD=/root/libci.so -n 1 /root/app2 >>> >>> -- >>> WARNING: a request was made to bind a process. While the system >>> supports binding the process itself, at least one node does NOT >>> support binding memory to the process location. >>> >>> Node: JARVICE >>> >>> This usually is due to not having the required NUMA support installed >>> on the node. In some Linux distributions, the required support is >>> contained in the libnumactl and libnumactl-devel packages. >>> This is a warning only; your job will continue, though performance may >>> be degraded. >>> >>> -- >>> i am backend >>> [1429676565.121218] sys.c:719 MXM WARN Conflicting CPU >>> frequencies detected, using: 2601.00 >>> [1429676565.122937] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >>> failed call to ibv_exp_use_priv_env(): Function not implemented >>> [1429676565.122950] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >>> ibv_query_device() returned 38: Function not implemented >>> [1429676565.123535] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >>> failed call to ibv_exp_use_priv_env(): Function not implemented >>> [1429676565.123543] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >>> ibv_query_device() returned 38: Function not implemented >>> [1429676565.124664] sys.c:719 MXM WARN Conflicting CPU >>> frequencies detected, using: 2601.00 >>> [1429676565.126264] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >>> failed call to ibv_exp_use_priv_env(): Function not implemented >>> [1429676565.126276] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >>> ibv_query_device() returned 38: Function not implemented >>> [1429676565.126812] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >>> failed call to ibv_exp_use_priv_env(): Function not implemented >>> [1429676565.126821] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >>> ibv_query_device() returned 38: Function not implemented >>> >>> -- >>> Initialization of MXM library failed. >>> >>> Error: Input/output error >>> >>> >>> -- >>> >>> >>> >>> >>> Thanks, >>> Subhra. >>> >>> >>> On Sat, Apr 18, 2015 at 12:28 AM, Mike Dubman >>> wrote: >>> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0? why LD_PRELOAD needed in your command line? Can you try module load hpcx mpirun -np $np test.exe ? On Sat, Apr 18, 2015 at 8:39 AM, Subhra Mazumdar < subhramazumd...@gmail.com> wrote: > I followed the instructions as in the README, now getting a different > error: > > [root@JARVICE > hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# > ../openmpi-1.8.4/openmpinstall/bin/mpirun --allow-run-as-root --mca mtl > mxm > -x LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 > ./mxm/lib/libmxm.so.2" -n 1 ../backend localhost : -x > LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 > ./mxm/lib/libmxm.so.2 ../libci.so" -n 1 ../app2 > >
Re: [OMPI users] problem with Java in openmpi-dev-1567-g11e8c20
Hi Howard, > Could you double check that on the linux box you are using an ompi install > which has java support? Yes, I have a script file that I call with the Open MPI version that I want to build so that I can't forget to use an empty directory, to remove the last installation before installing the new one, and so on. The strange thing is that I cannot reproduce the error today. I've no idea why it didn't work two days ago. Nevertheless I'm happy that it works now. Thank you very much for your help which forced me to try again. linpc1 java 110 ls -l /usr/local/openmpi-1.9.0_64_gcc/bin/mpijavac lrwxrwxrwx 1 root root 11 Apr 21 07:52 /usr/local/openmpi-1.9.0_64_gcc/bin/mpijavac -> mpijavac.pl linpc1 java 111 ls -l /usr/local/openmpi-1.9.0_64_gcc/lib64/*java* -rwxr-xr-x 1 root root 1170 Apr 21 07:52 /usr/local/openmpi-1.9.0_64_gcc/lib64/libmpi_java.la lrwxrwxrwx 1 root root 20 Apr 21 07:52 /usr/local/openmpi-1.9.0_64_gcc/lib64/libmpi_java.so -> libmpi_java.so.0.0.0 lrwxrwxrwx 1 root root 20 Apr 21 07:52 /usr/local/openmpi-1.9.0_64_gcc/lib64/libmpi_java.so.0 -> libmpi_java.so.0.0.0 -rwxr-xr-x 1 root root 538243 Apr 21 07:52 /usr/local/openmpi-1.9.0_64_gcc/lib64/libmpi_java.so.0.0.0 -rwxr-xr-x 1 root root 1239 Apr 21 07:52 /usr/local/openmpi-1.9.0_64_gcc/lib64/liboshmem_java.la lrwxrwxrwx 1 root root 23 Apr 21 07:52 /usr/local/openmpi-1.9.0_64_gcc/lib64/liboshmem_java.so -> liboshmem_java.so.0.0.0 lrwxrwxrwx 1 root root 23 Apr 21 07:52 /usr/local/openmpi-1.9.0_64_gcc/lib64/liboshmem_java.so.0 -> liboshmem_java.so.0.0.0 -rwxr-xr-x 1 root root 169198 Apr 21 07:52 /usr/local/openmpi-1.9.0_64_gcc/lib64/liboshmem_java.so.0.0.0 linpc1 java 112 tyr fd1026 104 mpiexec -np 6 -host tyr,linpc1,sunpc1 java MatMultWithAnyProc2DarrayIn1DarrayMain You have started 6 processes but I need at most 4 processes. I build a new worker group with 4 processes. The processes with the following ranks in the basic group belong to the new group: 2 3 4 5 ... Kind regards Siegmar > Howard > On Apr 21, 2015 10:11 AM, "Siegmar Gross" < > siegmar.gr...@informatik.hs-fulda.de> wrote: > > > Hi, > > > > today I installed openmpi-dev-1567-g11e8c20 on my machines > > (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1 > > x86_64) with gcc-4.9.2. I used the following configure command > > for all platforms. > > > > ../openmpi-dev-1567-g11e8c20/configure \ > > --prefix=/usr/local/openmpi-1.9.0_64_gcc \ > > --libdir=/usr/local/openmpi-1.9.0_64_gcc/lib64 \ > > --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ > > --with-jdk-headers=/usr/local/jdk1.8.0/include \ > > JAVA_HOME=/usr/local/jdk1.8.0 \ > > LDFLAGS="-m64" CC="gcc" CXX="g++" FC="gfortran" \ > > CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \ > > CPP="cpp" CXXCPP="cpp" \ > > CPPFLAGS="" CXXCPPFLAGS="" \ > > --enable-mpi-cxx \ > > --enable-cxx-exceptions \ > > --enable-mpi-java \ > > --enable-heterogeneous \ > > --enable-mpi-thread-multiple \ > > --with-hwloc=internal \ > > --without-verbs \ > > --with-wrapper-cflags="-std=c11 -m64" \ > > --with-wrapper-cxxflags="-m64" \ > > --with-wrapper-fcflags="-m64" \ > > --enable-debug \ > > |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc > > > > I can run a small program on both Solaris machines without problems, > > but get an error on Linux. > > > > tyr java 123 mpiexec -np 6 --host sunpc1 java > > MatMultWithAnyProc2DarrayIn1DarrayMain > > You have started 6 processes but I need at most 4 processes. > > I build a new worker group with 4 processes. The processes with > > the following ranks in the basic group belong to the new group: > > 2 3 4 5 > > > > Group "groupOther" contains 2 processes which have > > nothing to do. > > > > Worker process 0 of 4 running on sunpc1. > > Worker process 1 of 4 running on sunpc1. > > Worker process 2 of 4 running on sunpc1. > > Worker process 3 of 4 running on sunpc1. > > > > (4,6)-matrix a: > > > > 1.00 2.00 3.00 4.00 5.00 6.00 > > 7.00 8.00 9.00 10.00 11.00 12.00 > > 13.00 14.00 15.00 16.00 17.00 18.00 > > 19.00 20.00 21.00 22.00 23.00 24.00 > > ... > > > > > > I get the following error on my Linux machine. > > > > tyr java 127 mpiexec -np 6 --host linpc1 java > > MatMultWithAnyProc2DarrayIn1DarrayMain > > Exception in thread "main" java.lang.NoClassDefFoundError: mpi/MPIException > > at java.lang.Class.getDeclaredMethods0(Native Method) > > at java.lang.Class.privateGetDeclaredMethods(Class.java:2688) > > at java.lang.Class.getMethod0(Class.java:2937) > > at java.lang.Class.getMethod(Class.java:1771) > > at > > sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) > > at > > sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) > > Caused by: java.lang.ClassNotFoundException: mpi.MPIException > > at
Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl
Hi, where is the command ofed_info located? I searched from / but didn't find it. Subhra. On Tue, Apr 21, 2015 at 10:43 PM, Mike Dubmanwrote: > cool, progress! > > >>1429676565.124664] sys.c:719 MXM WARN Conflicting CPU > frequencies detected, using: 2601.00 > > means that cpu governor on your machine is not on "performance" mode > > >> MXM ERROR ibv_query_device() returned 38: Function not implemented > > indicates that ofed installed on your nodes is not indeed 2.4.-1.0.0 or > there is a mismatch between ofed kernel drivers version and ofed userspace > libraries version. > or you have multiple ofed libraries installed on your node and use > incorrect one. > could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0? > > > > > > On Wed, Apr 22, 2015 at 7:59 AM, Subhra Mazumdar < > subhramazumd...@gmail.com> wrote: > >> Hi, >> >> I compiled the openmpi that comes inside the mellanox hpcx package with >> mxm support instead of separately downloaded openmpi. I also used the >> environment as in the README so that no LD_PRELOAD (except our own library >> which is unrelated) is needed. Now it runs fine (no segfault) but we get >> same errors as before (saying initialization of MXM library failed). Is it >> using MXM successfully? >> >> [root@JARVICE >> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# mpirun >> --allow-run-as-root --mca mtl mxm -n 1 /root/backend localhost : -x >> LD_PRELOAD=/root/libci.so -n 1 /root/app2 >> -- >> WARNING: a request was made to bind a process. While the system >> supports binding the process itself, at least one node does NOT >> support binding memory to the process location. >> >> Node: JARVICE >> >> This usually is due to not having the required NUMA support installed >> on the node. In some Linux distributions, the required support is >> contained in the libnumactl and libnumactl-devel packages. >> This is a warning only; your job will continue, though performance may be >> degraded. >> -- >> i am backend >> [1429676565.121218] sys.c:719 MXM WARN Conflicting CPU >> frequencies detected, using: 2601.00 >> [1429676565.122937] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >> failed call to ibv_exp_use_priv_env(): Function not implemented >> [1429676565.122950] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >> ibv_query_device() returned 38: Function not implemented >> [1429676565.123535] [JARVICE:14767:0] ib_dev.c:445 MXM WARN >> failed call to ibv_exp_use_priv_env(): Function not implemented >> [1429676565.123543] [JARVICE:14767:0] ib_dev.c:456 MXM ERROR >> ibv_query_device() returned 38: Function not implemented >> [1429676565.124664] sys.c:719 MXM WARN Conflicting CPU >> frequencies detected, using: 2601.00 >> [1429676565.126264] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >> failed call to ibv_exp_use_priv_env(): Function not implemented >> [1429676565.126276] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >> ibv_query_device() returned 38: Function not implemented >> [1429676565.126812] [JARVICE:14768:0] ib_dev.c:445 MXM WARN >> failed call to ibv_exp_use_priv_env(): Function not implemented >> [1429676565.126821] [JARVICE:14768:0] ib_dev.c:456 MXM ERROR >> ibv_query_device() returned 38: Function not implemented >> -- >> Initialization of MXM library failed. >> >> Error: Input/output error >> >> -- >> >> >> >> >> Thanks, >> Subhra. >> >> >> On Sat, Apr 18, 2015 at 12:28 AM, Mike Dubman >> wrote: >> >>> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0? >>> why LD_PRELOAD needed in your command line? Can you try >>> >>> module load hpcx >>> mpirun -np $np test.exe >>> ? >>> >>> On Sat, Apr 18, 2015 at 8:39 AM, Subhra Mazumdar < >>> subhramazumd...@gmail.com> wrote: >>> I followed the instructions as in the README, now getting a different error: [root@JARVICE hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# ../openmpi-1.8.4/openmpinstall/bin/mpirun --allow-run-as-root --mca mtl mxm -x LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 ./mxm/lib/libmxm.so.2" -n 1 ../backend localhost : -x LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 ./mxm/lib/libmxm.so.2 ../libci.so" -n 1 ../app2 -- WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. Node: JARVICE This usually is due to not