Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
you need to run the ulimit command before mpirun and on the same node. if it still does not work, then you can use a wrapper. instead of mpirun a.out you would do mpirun a.sh a.sh is a script ulimit -c unlimited exec a.out the core is created in the current directory Cheers, Gilles On Saturday, September 3, 2016, Mahmood Naderan wrote: > >Did you ran > >ulimit -c unlimited > >before invoking mpirun ? > > Yes. On the node which says that error. Is that file created in the > current working directory? Or it is somewhere in the system folders? > > > > As another question, I am trying to use OpenMPI-2.0.0 as a new one. > Problem is that the application uses libmpi_f90.a from old versions > but I don't see that in OpenMPI-2.0.0. There are some other libraries > there. > > > > > -- > Regards, > Mahmood > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
Note that open MPI v2.0.0 is not ABI compatible with prior releases of open MPI. If you are trying to run an MPI executable created by a prior version of open MPI, you will need to recompile your application with open MPI v2.0.0. Sent from my phone. No type good. > On Sep 2, 2016, at 12:48 PM, Mahmood Naderan wrote: > > Thanks for your help. Please see below > > mahmood@compute-0-1:~$ ldd > /share/apps/chemistry/siesta-3.2-pl-5/tpar/transiesta >linux-vdso.so.1 => (0x7fffba9a8000) >libmpi_f90.so.1 => /opt/openmpi/lib/libmpi_f90.so.1 > (0x2b472b64) >libmpi_f77.so.1 => /opt/openmpi/lib/libmpi_f77.so.1 > (0x2b472b848000) >libmpi.so.1 => /opt/openmpi/lib/libmpi.so.1 (0x2b472ba8) >libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x003d17e0) >librt.so.1 => /lib64/librt.so.1 (0x003d1860) >libnsl.so.1 => /lib64/libnsl.so.1 (0x003d1ae0) >libutil.so.1 => /lib64/libutil.so.1 (0x003d18a0) >libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x2b472c028000) >libm.so.6 => /lib64/libm.so.6 (0x2b472c32) >libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x2b472c5a8000) >libdl.so.2 => /lib64/libdl.so.2 (0x003d1760) >libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003d1920) >libpthread.so.0 => /lib64/libpthread.so.0 (0x003d17a0) >libc.so.6 => /lib64/libc.so.6 (0x003d1720) >libdat.so.1 => /usr/lib64/libdat.so.1 (0x2b472c8b) >/lib64/ld-linux-x86-64.so.2 (0x003d16e0) > > > -- > Regards, > Mahmood > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
Thanks for your help. Please see below mahmood@compute-0-1:~$ ldd /share/apps/chemistry/siesta-3.2-pl-5/tpar/transiesta linux-vdso.so.1 => (0x7fffba9a8000) libmpi_f90.so.1 => /opt/openmpi/lib/libmpi_f90.so.1 (0x2b472b64) libmpi_f77.so.1 => /opt/openmpi/lib/libmpi_f77.so.1 (0x2b472b848000) libmpi.so.1 => /opt/openmpi/lib/libmpi.so.1 (0x2b472ba8) libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x003d17e0) librt.so.1 => /lib64/librt.so.1 (0x003d1860) libnsl.so.1 => /lib64/libnsl.so.1 (0x003d1ae0) libutil.so.1 => /lib64/libutil.so.1 (0x003d18a0) libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x2b472c028000) libm.so.6 => /lib64/libm.so.6 (0x2b472c32) libtorque.so.2 => /opt/torque/lib/libtorque.so.2 (0x2b472c5a8000) libdl.so.2 => /lib64/libdl.so.2 (0x003d1760) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003d1920) libpthread.so.0 => /lib64/libpthread.so.0 (0x003d17a0) libc.so.6 => /lib64/libc.so.6 (0x003d1720) libdat.so.1 => /usr/lib64/libdat.so.1 (0x2b472c8b) /lib64/ld-linux-x86-64.so.2 (0x003d16e0) -- Regards, Mahmood ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
Thankyou. That is helpful. Could you run an 'ldd' on your executable, on one of the compute nodes if possible? I will nto be able to solve your problem, but at least we now know what the application is, and can look at the libraries it is using. On 2 September 2016 at 17:19, Mahmood Naderan wrote: > The application is Siesta-3.2 and the command I use is > > > /share/apps/computer/openmpi-1.6.5/bin/mpirun -hostfile hosts.txt -np > 15 /share/apps/chemistry/siesta-3.2-pl-5/tpar/transiesta < > trans-cc-bt-cc-163-20.fdf > > There is one node in the hosts.txt file. I have built transiesta > binary from the source which uses > /share/apps/computer/openmpi-1.6.5/bin/mpif90 > > -- > Regards, > Mahmood > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
The application is Siesta-3.2 and the command I use is /share/apps/computer/openmpi-1.6.5/bin/mpirun -hostfile hosts.txt -np 15 /share/apps/chemistry/siesta-3.2-pl-5/tpar/transiesta < trans-cc-bt-cc-163-20.fdf There is one node in the hosts.txt file. I have built transiesta binary from the source which uses /share/apps/computer/openmpi-1.6.5/bin/mpif90 -- Regards, Mahmood ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
Mahmood, are you compiling and linking this application? Or are you using an executable which someone else has prepared? It would be very useful if we could know the application. On 2 September 2016 at 16:35, Mahmood Naderan wrote: > >Did you ran > >ulimit -c unlimited > >before invoking mpirun ? > > Yes. On the node which says that error. Is that file created in the > current working directory? Or it is somewhere in the system folders? > > > > As another question, I am trying to use OpenMPI-2.0.0 as a new one. > Problem is that the application uses libmpi_f90.a from old versions > but I don't see that in OpenMPI-2.0.0. There are some other libraries > there. > > > > > -- > Regards, > Mahmood > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
>Did you ran >ulimit -c unlimited >before invoking mpirun ? Yes. On the node which says that error. Is that file created in the current working directory? Or it is somewhere in the system folders? As another question, I am trying to use OpenMPI-2.0.0 as a new one. Problem is that the application uses libmpi_f90.a from old versions but I don't see that in OpenMPI-2.0.0. There are some other libraries there. -- Regards, Mahmood ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
Also, the error message suggested that TCP is not the issue here -- the TCP hangups are likely because some other process exited unexpectedly. Indeed: - mpirun noticed that process rank 0 with PID 4989 on node compute-0-1 exited on signal 4 (Illegal instruction). - This might be the real issue. Getting a corefile, as was already suggested, might be the best way to go forward. > On Sep 2, 2016, at 5:50 AM, John Hearns via users > wrote: > > Mahmood, as Giles says start by looking at how that application is compiled > and linked. > Run 'ldd' on the executable and look closely at the libraries. Do this on a > compute node if you can. > > There was a discussion on another mailign list recently about how to > fingerpritn executables and see which architecture it was compiled for. > My mind is a blank at the moment as to what that discussion concluded. Sorry. > And if this was on OpenMPI I am doubly sorry! > > > On 2 September 2016 at 10:37, Gilles Gouaillardet > wrote: > Did you ran > ulimit -c unlimited > before invoking mpirun ? > > if your application can be ran with only one tasks, you can try to run it > under gdb. > you will hopefully be able to see where the illegal instruction occurs. > > since you are running on AMD processors, you have to make sure you are not > using any third party library that was optimized for Intel processors (e.g. > that uses AVX (SSE ?) instructions) > > Cheers, > > Gilles > > On Friday, September 2, 2016, Mahmood Naderan wrote: > >Are you running under a batch manager ? > >On which architecture ? > Currently I am not using the job manager (which is actually PBS). I am > running from the terminal. > > The machines are AMD Opteron 64 bit > > > >Hopefully you will get a core file that points you to the illegal instruction > Where is that core file. I can not find it. > > BTW, the openmpi is 1.6.5 > > > -- > Regards, > Mahmood > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
Mahmood, as Giles says start by looking at how that application is compiled and linked. Run 'ldd' on the executable and look closely at the libraries. Do this on a compute node if you can. There was a discussion on another mailign list recently about how to fingerpritn executables and see which architecture it was compiled for. My mind is a blank at the moment as to what that discussion concluded. Sorry. And if this was on OpenMPI I am doubly sorry! On 2 September 2016 at 10:37, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Did you ran > ulimit -c unlimited > before invoking mpirun ? > > if your application can be ran with only one tasks, you can try to run it > under gdb. > you will hopefully be able to see where the illegal instruction occurs. > > since you are running on AMD processors, you have to make sure you are not > using any third party library that was optimized for Intel processors (e.g. > that uses AVX (SSE ?) instructions) > > Cheers, > > Gilles > > On Friday, September 2, 2016, Mahmood Naderan > wrote: > >> >Are you running under a batch manager ? >> >On which architecture ? >> Currently I am not using the job manager (which is actually PBS). I am >> running from the terminal. >> >> The machines are AMD Opteron 64 bit >> >> >> >Hopefully you will get a core file that points you to the illegal >> instruction >> Where is that core file. I can not find it. >> >> BTW, the openmpi is 1.6.5 >> >> >> -- >> Regards, >> Mahmood >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
Did you ran ulimit -c unlimited before invoking mpirun ? if your application can be ran with only one tasks, you can try to run it under gdb. you will hopefully be able to see where the illegal instruction occurs. since you are running on AMD processors, you have to make sure you are not using any third party library that was optimized for Intel processors (e.g. that uses AVX (SSE ?) instructions) Cheers, Gilles On Friday, September 2, 2016, Mahmood Naderan wrote: > >Are you running under a batch manager ? > >On which architecture ? > Currently I am not using the job manager (which is actually PBS). I am > running from the terminal. > > The machines are AMD Opteron 64 bit > > > >Hopefully you will get a core file that points you to the illegal > instruction > Where is that core file. I can not find it. > > BTW, the openmpi is 1.6.5 > > > -- > Regards, > Mahmood > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
>Are you running under a batch manager ? >On which architecture ? Currently I am not using the job manager (which is actually PBS). I am running from the terminal. The machines are AMD Opteron 64 bit >Hopefully you will get a core file that points you to the illegal instruction Where is that core file. I can not find it. BTW, the openmpi is 1.6.5 -- Regards, Mahmood ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] job aborts "readv failed: Connection reset by peer"
In absence of a clear error message, the btl_tcp_frag related error messages can suggest a process was killed by the oom-killer. This is not your case, since rank 0 died because of an illegal instruction. Are you running under a batch manager ? On which architecture ? do your compute node have the very same architecture than the node used to compile your libs and apps ? That kind of error can occur if your app was built with AVX2 instructions (e.g. latest Intel xeon) but runs on a previous generation processor that is not AVX2 capable. I guess the same thing can occur if different arm versions are involved. can you ulimit -c unlimited and mpirun again ? Hopefully you will get a core file that points you to the illegal instruction Cheers, Gilles On Tuesday, August 30, 2016, Mahmood Naderan wrote: > Hi, > An MPI job is running on two nodes and everything seems to be fine. > However, in the middle of the run, the program aborts with the following > error > > > [compute-0-1.local][[47664,1],14][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [compute-0-3.local][[47664,1],11][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [compute-0-3.local][[47664,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > -- > mpirun noticed that process rank 0 with PID 4989 on node compute-0-1 > exited on signal 4 (Illegal instruction). > -- > > > There are 8 processes on that node and each consumes about 150MB of > memory. The total memory usage is about 1% of the memory. > > There are some discussions on the web about memory error but there is no > clear answer for that. What does that illegal instruction mean? > > > > > Regards, > Mahmood > > > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] job aborts "readv failed: Connection reset by peer"
Hi, An MPI job is running on two nodes and everything seems to be fine. However, in the middle of the run, the program aborts with the following error [compute-0-1.local][[47664,1],14][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [compute-0-3.local][[47664,1],11][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [compute-0-3.local][[47664,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) -- mpirun noticed that process rank 0 with PID 4989 on node compute-0-1 exited on signal 4 (Illegal instruction). -- There are 8 processes on that node and each consumes about 150MB of memory. The total memory usage is about 1% of the memory. There are some discussions on the web about memory error but there is no clear answer for that. What does that illegal instruction mean? Regards, Mahmood ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users