Hello over there. 

We have a very strange issue when the program tries to send a non-blocking 
message with MPI_Isend() and packed data: if we run this send after some 
unnecessary code (see details below), it works, but without it, not.

This program uses dynamic spawning to launch processes. Below are some extracts 
of the code with comments, environment specifications, and the output error.

Thanks in advance,

Martín


—



char * xmul_coord_transbuf = NULL , * transpt , * transend ;
char * mpi_buffer ;
int mpi_buffer_size ; 

void init_xmul_coord_buff ( int siz ) {
  unsigned long int i = ( ( ( unsigned long ) ( siz ) + 7 ) & ~ 7 ) ;
  if ( xmul_coord_transbuf == NULL ) {
      transpt = xmul_coord_transbuf = ( char * ) malloc ( 512 ) ;
      transend = transpt + 508 ; }
  mpi_buffer = transpt ;
  transpt += i ;
  if ( transpt >= transend ) transpt = xmul_coord_transbuf ; 
  mpi_buf_position = 0 ;
  mpi_buffer_size = siz ;
}

#define my_pack(x, mpi_type) { MPI_Pack_size(1,mpi_type,comm,&mpi_pack_size); 
MPI_Pack(&x, 1, mpi_type, mpi_buffer,mpi_buffer_size,&mpi_buf_position, comm); }

void inform_my_completion ( double val , Fint imstopped ) {
  int a , i = imstopped ; 
  MPI_Comm comm;
  MPI_Status status;
  MPI_Request request;
  if ( !myslavenum ) return ;  // Note: myslavenum equals rank; there are 6 
slaves in our test...
  init_xmul_coord_buff ( sizeof ( double ) + sizeof ( int ) ) ; 
  my_pack ( val , MPI_DOUBLE ) ;
  my_pack ( i , MPI_INT ) ;
  
#ifdef FUNNY_CODE
  // compiling with -DFUNNY_CODE, it works; otherwise it crashes with message 
below ... 
  if ( FALSE ) { fprintf ( stderr , "\r/////SLAVE %i - report to COORD... 
%.0f\n" , myslavenum , val ) ; fflush ( stderr ) ; }
#endif

               // this is done only ONCE, no reception even attempted in our 
test code
  MPI_Isend( mpi_buffer , mpi_buffer_size , MPI_PACKED , 0 , XMUL_DONE , 
MPI_COMM_WORLD , &request ) ; 
}


-----------------------------
File compiled without optimization, linked with -O3

-----------------------------
Windows Version:
    Windows 10 Pro
Single machine, 4 CPUs (2 threads each)

-----------------------------
Cygwin Version:

$ uname -r
3.3.4(0.341/5/3)

-----------------------------
MPI version: 

mpirun (Open MPI) 4.1.2

All processes started with MPI_Comm_Spawn()

-----------------------------
Crash message at runtime:

[DESKTOP-N9KKTKD:00286] *** Process received signal ***
[DESKTOP-N9KKTKD:00286] Signal: Segmentation fault (11)
[DESKTOP-N9KKTKD:00286] Signal code: Address not mapped (23)
[DESKTOP-N9KKTKD:00286] Failing at address: 0xc9
Unable to print stack trace!
[DESKTOP-N9KKTKD:00286] *** End of error message ***
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[DESKTOP-N9KKTKD:00282] *** Process received signal ***
[DESKTOP-N9KKTKD:00282] Signal: Segmentation fault (11)
[DESKTOP-N9KKTKD:00282] Signal code: Address not mapped (23)
[DESKTOP-N9KKTKD:00282] Failing at address: 0xcb
Unable to print stack trace!
[DESKTOP-N9KKTKD:00282] *** End of error message ***

-----------------------------
Message when exitting master:

[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
--------------------------------------------------------------------------
(null) noticed that process rank 5 with PID 0 on node DESKTOP-N9KKTKD exited on 
signal 11 (Segmentation fault).
--------------------------------------------------------------------------











  

Reply via email to