[OMPI users] Double free or corruption problem updated result

2017-06-17 Thread ashwin .D
Hello Gilles,
   I am enclosing all the information you requested.

1)  as an attachment I enclose the log file
2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
reinstalled it /usr/lib/local.
I ran all the examples in the examples directory. All passed except
oshmem_strided_puts where I got this message

[[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
valid range
--
SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
errorcode -1.
--


3) I deleted all old OpenMPI versions under /usr/local/lib.
4) I am using the COSMO weather model - http://www.cosmo-model.org/ to run
simulations
The support staff claim they have seen no errors with a similar setup. They
use

1) gfortran 4.8.5
2) OpenMPI 1.10.1

The only difference is I use OpenMPI 2.1.1.

5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo. and
I got the same error as in the mpi_logs file

6) Regarding compiler and linking options on Ubuntu 16.04

mpif90 --showme:compile and --showme:link give me the options for compiling
and linking.

Here are the options from my makefile

-pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking

7) I have a 64 bit OS.

Well I think I have responded all of your questions. In any case I have not
please let me know and I will respond ASAP. The only thing I have not done
is look at /usr/local/include. I saw some old OpenMPI files there. If those
need to be deleted I will do after I hear from you.

Best regards,
Ashwin.


mpi_logs
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_ABORT, indirect execution of executables by mpirun, Open MPI 2.1.1

2017-06-17 Thread gilles
Ted,

i do not observe the same behavior you describe with Open MPI 2.1.1

# mpirun -np 2 -mca btl tcp,self --mca odls_base_verbose 5 ./abort.sh

abort.sh 31361 launching abort
abort.sh 31362 launching abort
I am rank 0 with pid 31363
I am rank 1 with pid 31364

--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

--
[linux:31356] [[18199,0],0] odls:kill_local_proc working on WILDCARD
[linux:31356] [[18199,0],0] odls:kill_local_proc checking child process 
[[18199,1],0]
[linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],0]
[linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31361 
SUCCESS
[linux:31356] [[18199,0],0] odls:kill_local_proc checking child process 
[[18199,1],1]
[linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],1]
[linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31362 
SUCCESS
[linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,1],0]
[linux:31356] [[18199,0],0] odls:default:SENT KILL 15 TO PID 31361 
SUCCESS
[linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,1],1]
[linux:31356] [[18199,0],0] odls:default:SENT KILL 15 TO PID 31362 
SUCCESS
[linux:31356] [[18199,0],0] SENDING SIGKILL TO [[18199,1],0]
[linux:31356] [[18199,0],0] odls:default:SENT KILL 9 TO PID 31361 
SUCCESS
[linux:31356] [[18199,0],0] SENDING SIGKILL TO [[18199,1],1]
[linux:31356] [[18199,0],0] odls:default:SENT KILL 9 TO PID 31362 
SUCCESS
[linux:31356] [[18199,0],0] odls:kill_local_proc working on WILDCARD
[linux:31356] [[18199,0],0] odls:kill_local_proc checking child process 
[[18199,1],0]
[linux:31356] [[18199,0],0] odls:kill_local_proc child [[18199,1],0] is 
not alive
[linux:31356] [[18199,0],0] odls:kill_local_proc checking child process 
[[18199,1],1]
[linux:31356] [[18199,0],0] odls:kill_local_proc child [[18199,1],1] is 
not alive


Open MPI did kill both shells, and they were indeed killed as evidenced 
by ps

#ps -fu gilles --forest
UIDPID  PPID  C STIME TTY  TIME CMD
gilles1564  1561  0 15:39 ?00:00:01 sshd: gilles@pts/1
gilles1565  1564  0 15:39 pts/100:00:00  \_ -bash
gilles   31356  1565  3 15:57 pts/100:00:00  \_ /home/gilles/
local/ompi-v2.x/bin/mpirun -np 2 -mca btl tcp,self --mca odls_base
gilles   31364 1  1 15:57 pts/100:00:00 ./abort


so trapping SIGTERM in your shell and manually killing the MPI task 
should work
(as Jeff explained, as long as the shell script is fast enough to do 
that between SIGTERM and SIGKILL)


if you observe a different behavior, please double check your Open MPI 
version and post the outputs of the same commands.

btw, are you running from a batch manager ? if yes, which one ?

Cheers,

Gilles

- Original Message -
> Ted,
> 
> if you
> 
> mpirun --mca odls_base_verbose 10 ...
> 
> you will see which processes get killed and how
> 
> Best regards,
> 
> 
> Gilles
> 
> - Original Message -
> > Hello Jeff,
> > 
> > Thanks for your comments.
> > 
> > I am not seeing behavior #4, on the two computers that I have tested 
> on, using Open MPI 
> > 2.1.1.
> > 
> > I wonder if you can duplicate my results with the files that I have 
> uploaded.
> > 
> > Regarding what is the "correct" behavior, I am willing to modify my 
> application to correspond 
> > to Open MPI's behavior (whatever behavior the Open MPI developers 
> decide is best) -- 
> > provided that Open MPI does in fact kill off both shells.
> > 
> > So my highest priority now is to find out why Open MPI 2.1.1 does 
not 
> kill off both shells on 
> > my computer.
> > 
> > Sincerely,
> > 
> > Ted Sussman
> > 
> >  On 16 Jun 2017 at 16:35, Jeff Squyres (jsquyres) wrote:
> > 
> > > Ted --
> > > 
> > > Sorry for jumping in late.  Here's my $0.02...
> > > 
> > > In the runtime, we can do 4 things:
> > > 
> > > 1. Kill just the process that we forked.
> > > 2. Kill just the process(es) that call back and identify 
themselves 
> as MPI processes (we don't track this right now, but we could add that 
> functionality).
> > > 3. Union of #1 and #2.
> > > 4. Kill all processes (to include any intermediate processes that 
> are not included in #1 and #2).
> > > 
> > > In Open MPI 2.x, #4 is the intended behavior.  There may be a bug 
or 
> two that needs to get fixed (e.g., in your last mail, I don't see 
> offhand why it waits until the MPI process finishes sleeping), but we 
> should be killing the process group, which -- unless any of the 
> descendant processes have explicitly left the process group -- should 
> hit the entire process tree.  
> > > 
> > > Sidenote: there's actually a way to be a bit more aggressive and 
do 
> a better job of ensuring that we kill *all* processes (via 

Re: [OMPI users] Double free or corruption problem updated result

2017-06-17 Thread Gilles Gouaillardet
Ashwin,

did you try to run your app with a MPICH-based library (mvapich,
IntelMPI or even stock mpich) ?
or did you try with Open MPI v1.10 ?
the stacktrace does not indicate the double free occurs in MPI...

it seems you ran valgrind vs a shell and not your binary.
assuming your mpirun command is
mpirun lmparbin_all
i suggest you try again with
mpirun --tag-output valgrind lmparbin_all
that will generate one valgrind log per task, but these are prefixed
so it should be easier to figure out what is going wrong

Cheers,

Gilles


On Sun, Jun 18, 2017 at 11:41 AM, ashwin .D  wrote:
> There is a sequential version of the same program COSMO (no reference to
> MPI) that I can run without any problems. Of course it takes a lot longer to
> complete. Now I also ran valgrind (not sure whether that is useful or not)
> and I have enclosed the logs.
>
> On Sat, Jun 17, 2017 at 7:20 PM, ashwin .D  wrote:
>>
>> Hello Gilles,
>>I am enclosing all the information you requested.
>>
>> 1)  as an attachment I enclose the log file
>> 2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
>> reinstalled it /usr/lib/local.
>> I ran all the examples in the examples directory. All passed except
>> oshmem_strided_puts where I got this message
>>
>> [[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
>> valid range
>> --
>> SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
>> errorcode -1.
>> --
>>
>>
>> 3) I deleted all old OpenMPI versions under /usr/local/lib.
>> 4) I am using the COSMO weather model - http://www.cosmo-model.org/ to run
>> simulations
>> The support staff claim they have seen no errors with a similar setup.
>> They use
>>
>> 1) gfortran 4.8.5
>> 2) OpenMPI 1.10.1
>>
>> The only difference is I use OpenMPI 2.1.1.
>>
>> 5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo.
>> and I got the same error as in the mpi_logs file
>>
>> 6) Regarding compiler and linking options on Ubuntu 16.04
>>
>> mpif90 --showme:compile and --showme:link give me the options for
>> compiling and linking.
>>
>> Here are the options from my makefile
>>
>> -pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
>>
>> 7) I have a 64 bit OS.
>>
>> Well I think I have responded all of your questions. In any case I have
>> not please let me know and I will respond ASAP. The only thing I have not
>> done is look at /usr/local/include. I saw some old OpenMPI files there. If
>> those need to be deleted I will do after I hear from you.
>>
>> Best regards,
>> Ashwin.
>>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Double free or corruption problem updated result

2017-06-17 Thread ashwin .D
Hello Gilles,

  First of all I am extremely grateful for this
communication from you on a weekend and that too few hours after I

posted my email. Well I am not sure I can go on posting log files as
you rightly point out that MPI is not the source of the

problem. Still I have enclosed the valgrind log files as you
requested. I have downloaded the MPICH packages as you suggested

and I am going to install them shortly. But before I do that I think I
have a clue on the source of my problem(double free or corruption) and
I would really appreciate

your advice.


As I mentioned before COSMO has been compiled with mpif90 for shared
memory usage and with gfortran for sequential access.

But it is dependent on a lot of external third party software such as
zlib, libcurl, hdf5, netcdf and netcdf-fortran. When I

looked at the config.log of those packages all of them had  been
compiled with gfortran and gcc and some cases g++ with
enable-shared option. So my question then is could that be a source of
the "mismatch" ?

In other words I would have to recompile all those packages with
mpif90 and mpicc and then try another test. At the very


least there should be no mixing of gcc/gfortran compiled code with
mpif90 compiled code. Comments ?


Best regards,
Ashwin.

>Ashwin,

>did you try to run your app with a MPICH-based library (mvapich,
>IntelMPI or even stock mpich) ?
>or did you try with Open MPI v1.10 ?
>the stacktrace does not indicate the double free occurs in MPI...

>it seems you ran valgrind vs a shell and not your binary.
>assuming your mpirun command is
>mpirun lmparbin_all
>i suggest you try again with
>mpirun --tag-output valgrind lmparbin_all
>that will generate one valgrind log per task, but these are prefixed
>so it should be easier to figure out what is going wrong

>Cheers,

>Gilles


On Sun, Jun 18, 2017 at 11:41 AM, ashwin .D  wrote:
> There is a sequential version of the same program COSMO (no reference to
> MPI) that I can run without any problems. Of course it takes a lot longer to
> complete. Now I also ran valgrind (not sure whether that is useful or not)
> and I have enclosed the logs.


On Sun, Jun 18, 2017 at 8:11 AM, ashwin .D  wrote:

> There is a sequential version of the same program COSMO (no reference to
> MPI) that I can run without any problems. Of course it takes a lot longer
> to complete. Now I also ran valgrind (not sure whether that is useful or
> not) and I have enclosed the logs.
>
> On Sat, Jun 17, 2017 at 7:20 PM, ashwin .D  wrote:
>
>> Hello Gilles,
>>I am enclosing all the information you requested.
>>
>> 1)  as an attachment I enclose the log file
>> 2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
>> reinstalled it /usr/lib/local.
>> I ran all the examples in the examples directory. All passed except
>> oshmem_strided_puts where I got this message
>>
>> [[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
>> valid range
>> 
>> --
>> SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
>> errorcode -1.
>> 
>> --
>>
>>
>> 3) I deleted all old OpenMPI versions under /usr/local/lib.
>> 4) I am using the COSMO weather model - http://www.cosmo-model.org/ to
>> run simulations
>> The support staff claim they have seen no errors with a similar setup.
>> They use
>>
>> 1) gfortran 4.8.5
>> 2) OpenMPI 1.10.1
>>
>> The only difference is I use OpenMPI 2.1.1.
>>
>> 5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo.
>> and I got the same error as in the mpi_logs file
>>
>> 6) Regarding compiler and linking options on Ubuntu 16.04
>>
>> mpif90 --showme:compile and --showme:link give me the options for
>> compiling and linking.
>>
>> Here are the options from my makefile
>>
>> -pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
>>
>> 7) I have a 64 bit OS.
>>
>> Well I think I have responded all of your questions. In any case I have
>> not please let me know and I will respond ASAP. The only thing I have not
>> done is look at /usr/local/include. I saw some old OpenMPI files there. If
>> those need to be deleted I will do after I hear from you.
>>
>> Best regards,
>> Ashwin.
>>
>>
>


logs
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Double free or corruption problem updated result

2017-06-17 Thread ashwin .D
There is a sequential version of the same program COSMO (no reference to
MPI) that I can run without any problems. Of course it takes a lot longer
to complete. Now I also ran valgrind (not sure whether that is useful or
not) and I have enclosed the logs.

On Sat, Jun 17, 2017 at 7:20 PM, ashwin .D  wrote:

> Hello Gilles,
>I am enclosing all the information you requested.
>
> 1)  as an attachment I enclose the log file
> 2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
> reinstalled it /usr/lib/local.
> I ran all the examples in the examples directory. All passed except
> oshmem_strided_puts where I got this message
>
> [[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
> valid range
> --
> SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
> errorcode -1.
> --
>
>
> 3) I deleted all old OpenMPI versions under /usr/local/lib.
> 4) I am using the COSMO weather model - http://www.cosmo-model.org/ to
> run simulations
> The support staff claim they have seen no errors with a similar setup.
> They use
>
> 1) gfortran 4.8.5
> 2) OpenMPI 1.10.1
>
> The only difference is I use OpenMPI 2.1.1.
>
> 5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo.
> and I got the same error as in the mpi_logs file
>
> 6) Regarding compiler and linking options on Ubuntu 16.04
>
> mpif90 --showme:compile and --showme:link give me the options for
> compiling and linking.
>
> Here are the options from my makefile
>
> -pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
>
> 7) I have a 64 bit OS.
>
> Well I think I have responded all of your questions. In any case I have
> not please let me know and I will respond ASAP. The only thing I have not
> done is look at /usr/local/include. I saw some old OpenMPI files there. If
> those need to be deleted I will do after I hear from you.
>
> Best regards,
> Ashwin.
>
>


logs
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users