Ashwin,
the valgrind logs clearly indicate you are trying to access some memory
that was already free'd
for example
[1,0]<stderr>:==4683== Invalid read of size 4
[1,0]<stderr>:==4683== at 0x795DC2: __src_input_MOD_organize_input
(src_input.f90:2318)
[1,0]<stderr>:==4683== Address 0xb4001d0 is 0 bytes inside a block of
size 24 free'd
[1,0]<stderr>:==4683== by 0x63F3690: free_NC_var (in
/usr/local/lib/libnetcdf.so.11.0.3)
[1,0]<stderr>:==4683== by 0x63BB431: nc_close (in
/usr/local/lib/libnetcdf.so.11.0.3)
[1,0]<stderr>:==4683== by 0x435A9F: __io_utilities_MOD_close_file
(io_utilities.f90:995)
[1,0]<stderr>:==4683== Block was alloc'd at
[1,0]<stderr>:==4683== by 0x63F378C: new_x_NC_var (in
/usr/local/lib/libnetcdf.so.11.0.3)
[1,0]<stderr>:==4683== by 0x63BAF85: nc_open (in
/usr/local/lib/libnetcdf.so.11.0.3)
[1,0]<stderr>:==4683== by 0x547E6F6: nf_open_ (nf_control.F90:189)
so the double-free error could be a side effect of this.
at this stage, i suggest you fix your application, and see if it
resolves your issue.
(e.g. there is no need to try an other MPI library and/or version for now)
Cheers,
Gilles
On 6/18/2017 2:41 PM, ashwin .D wrote:
Hello Gilles,
First of all I am extremely grateful for this communication from
you on a weekend and that too few hours after I
posted my email. Well I am not sure I can go on posting log files as you
rightly point out that MPI is not the source of the
problem. Still I have enclosed the valgrind log files as you requested. I have
downloaded the MPICH packages as you suggested
and I am going to install them shortly. But before I do that I think I have a
clue on the source of my problem(double free or corruption) and I would really
appreciate
your advice.
As I mentioned before COSMO has been compiled with mpif90 for shared memory
usage and with gfortran for sequential access.
But it is dependent on a lot of external third party software such as zlib,
libcurl, hdf5, netcdf and netcdf-fortran. When I
looked at the config.log of those packages all of them had been compiled with
gfortran and gcc and some cases g++ with
enable-shared option. So my question then is could that be a source of the
"mismatch" ?
In other words I would have to recompile all those packages with mpif90 and
mpicc and then try another test. At the very
least there should be no mixing of gcc/gfortran compiled code with mpif90
compiled code. Comments ?
Best regards,
Ashwin.
>Ashwin,
>did you try to run your app with a MPICH-based library (mvapich,
>IntelMPI or even stock mpich) ?
>or did you try with Open MPI v1.10 ?
>the stacktrace does not indicate the double free occurs in MPI...
>it seems you ran valgrind vs a shell and not your binary.
>assuming your mpirun command is
>mpirun lmparbin_all
>i suggest you try again with
>mpirun --tag-output valgrind lmparbin_all
>that will generate one valgrind log per task, but these are prefixed
>so it should be easier to figure out what is going wrong
>Cheers,
>Gilles
On Sun, Jun 18, 2017 at 11:41 AM, ashwin .D <winas...@gmail.com
<mailto:winas...@gmail.com>> wrote:
> There is a sequential version of the same program COSMO (no reference to
> MPI) that I can run without any problems. Of course it takes a lot longer to
> complete. Now I also ran valgrind (not sure whether that is useful or not)
> and I have enclosed the logs.
On Sun, Jun 18, 2017 at 8:11 AM, ashwin .D <winas...@gmail.com
<mailto:winas...@gmail.com>> wrote:
There is a sequential version of the same program COSMO (no
reference to MPI) that I can run without any problems. Of course
it takes a lot longer to complete. Now I also ran valgrind (not
sure whether that is useful or not) and I have enclosed the logs.
On Sat, Jun 17, 2017 at 7:20 PM, ashwin .D <winas...@gmail.com
<mailto:winas...@gmail.com>> wrote:
Hello Gilles,
I am enclosing all the information you
requested.
1) as an attachment I enclose the log file
2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature
and I reinstalled it /usr/lib/local.
I ran all the examples in the examples directory. All passed
except oshmem_strided_puts where I got this message
[[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1
is not in valid range
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 13409,
host=a-Vostro-3800) with errorcode -1.
--------------------------------------------------------------------------
3) I deleted all old OpenMPI versions under /usr/local/lib.
4) I am using the COSMO weather model -
http://www.cosmo-model.org/ to run simulations
The support staff claim they have seen no errors with a
similar setup. They use
1) gfortran 4.8.5
2) OpenMPI 1.10.1
The only difference is I use OpenMPI 2.1.1.
5) I did try this option as well mpirun --mca btl tcp,self -np
4 cosmo. and I got the same error as in the mpi_logs file
6) Regarding compiler and linking options on Ubuntu 16.04
mpif90 --showme:compile and --showme:link give me the options
for compiling and linking.
Here are the options from my makefile
-pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
7) I have a 64 bit OS.
Well I think I have responded all of your questions. In any
case I have not please let me know and I will respond ASAP.
The only thing I have not done is look at /usr/local/include.
I saw some old OpenMPI files there. If those need to be
deleted I will do after I hear from you.
Best regards,
Ashwin.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users