to follow up:
after fixing an issue in the source of mpich2 simple2pmi.c ( which
overruns a snprintf buffer ), the spawn interface started to work.
however other things started to break ( eg the singleton mode, when no
srun was provided ).
Pavan Balaji directed me to these steps, which works great and is far
less hassle than going through srun and PMI2.
- don't add any flags to mpich and build the default ( ./configure
-prefix ... )
(do NOT pass the --with-slurm, --with-pmi, --enable-pmiport options).
- don't add -lpmi to your application ( it will force slurm's pmi 1
interface which doesn't support PMI_Spawn_multiple )
then launch the application via :
salloc -N 2 mpiexec myapplication
all MPI_comm_spawn work fine now going through hydra's PMI 1.1 interface
this works as expected and also seems to be faster then the srun option
via PMI2.
thanks for all the help.
Kind Regards,
Christoph
On 01/04/13 13:00, Christoph Sprenger wrote:
> sorry... here is a the complete trace:
>
> *** buffer overflow detected ***:
> /vol/bob/check/csprenger/linux64/opt/bin/mpi_hello_world terminated
> ======= Backtrace: =========
> /lib/libc.so.6(__fortify_fail+0x37)[0x7f194dc19217]
> /lib/libc.so.6(+0xfe0d0)[0x7f194dc180d0]
> /lib/libc.so.6(+0xfd7cb)[0x7f194dc177cb]
> /lib/libc.so.6(__snprintf_chk+0x7a)[0x7f194dc1769a]
> /tech/home/csprenger/mpich-3.0.3_SLURM//lib/libmpich.so.10(+0xe89d7)[0x7f194e9399d7]
> /tech/home/csprenger/mpich-3.0.3_SLURM//lib/libmpich.so.10(PMI2_Init+0x9c1)[0x7f194e93e2e1]
> /tech/home/csprenger/mpich-3.0.3_SLURM//lib/libmpich.so.10(MPID_Init+0xac)[0x7f194e8fd1cc]
> /tech/home/csprenger/mpich-3.0.3_SLURM//lib/libmpich.so.10(MPIR_Init_thread+0x240)[0x7f194e9bf950]
> /tech/home/csprenger/mpich-3.0.3_SLURM//lib/libmpich.so.10(MPI_Init+0xb1)[0x7f194e9bf2c1]
> /vol/bob/check/csprenger/linux64/opt/bin/mpi_hello_world[0x401215]
> /lib/libc.so.6(__libc_start_main+0xfd)[0x7f194db38c4d]
> /vol/bob/check/csprenger/linux64/opt/bin/mpi_hello_world[0x401149]
>
> so i'm still not sure if i've built this correctly to begin with, since
> the crash is caused in mpich-master-3.0.3/src/pmi/pmi2/simple2pmi.c
> so not sure if that should be using slurm's pmi2 plugin or if this is
> potentially an issue i should enquire about on the mpich forum ?
>
>
> this is my code:
>
> #include "mpi.h"
> #include <iostream>
>
> int main(int argc, char *argv[])
> {
> MPI_Init(&argc, &argv);
>
> int rank,c_size;
> MPI_Comm_size( MPI_COMM_WORLD, &c_size );
> MPI_Comm_rank( MPI_COMM_WORLD, &rank );
> std::cerr << "Hello World from rank " << rank << " / " << c_size <<
> std::endl;
> MPI_Finalize();
>
>
> }
>
> with LF -lpthread -lmpich
>
>
>
> Kind Regards,
> Christoph
>
>
>
> On 29/03/13 22:49, Hongjia Cao wrote:
>> could you please paste the complete output/error messages?
>>
>> 在 2013-03-28四的 14:59 -0600,Christoph Sprenger写道:
>>> pich.so.10(PMI2_Init+0x7ff)[0x7f5daff7806f]
>>> /tech/home/csprenger/mpich-3.0.2_SLURM//lib/libmpich.so.10(MPID_Init
>>> +0xac)[0x7f5daff371ac]
>>> /tech/home/csprenger/mpich-3.0.2_SLURM//lib/libmpich.so.10(MPIR_Init_thread+0x240)[0x7f5dafff90d0]
>>> /tech/home/csprenger/mpich-3.0.2_SLURM//lib/libmpich.so.10(MPI_Init
>>> +0xb1)[0x7f5dafff8a41]
>>>
>>>