[OMPI users] Cygwin. Strange issue with MPI_Isend() and packed data

2022-09-13 Thread Martín Morales via users
Hello over there. 

We have a very strange issue when the program tries to send a non-blocking 
message with MPI_Isend() and packed data: if we run this send after some 
unnecessary code (see details below), it works, but without it, not.

This program uses dynamic spawning to launch processes. Below are some extracts 
of the code with comments, environment specifications, and the output error.

Thanks in advance,

Martín


—



char * xmul_coord_transbuf = NULL , * transpt , * transend ;
char * mpi_buffer ;
int mpi_buffer_size ; 

void init_xmul_coord_buff ( int siz ) {
  unsigned long int i = ( ( ( unsigned long ) ( siz ) + 7 ) & ~ 7 ) ;
  if ( xmul_coord_transbuf == NULL ) {
  transpt = xmul_coord_transbuf = ( char * ) malloc ( 512 ) ;
  transend = transpt + 508 ; }
  mpi_buffer = transpt ;
  transpt += i ;
  if ( transpt >= transend ) transpt = xmul_coord_transbuf ; 
  mpi_buf_position = 0 ;
  mpi_buffer_size = siz ;
}

#define my_pack(x, mpi_type) { MPI_Pack_size(1,mpi_type,comm,_pack_size); 
MPI_Pack(, 1, mpi_type, mpi_buffer,mpi_buffer_size,_buf_position, comm); }

void inform_my_completion ( double val , Fint imstopped ) {
  int a , i = imstopped ; 
  MPI_Comm comm;
  MPI_Status status;
  MPI_Request request;
  if ( !myslavenum ) return ;  // Note: myslavenum equals rank; there are 6 
slaves in our test...
  init_xmul_coord_buff ( sizeof ( double ) + sizeof ( int ) ) ; 
  my_pack ( val , MPI_DOUBLE ) ;
  my_pack ( i , MPI_INT ) ;
  
#ifdef FUNNY_CODE
  // compiling with -DFUNNY_CODE, it works; otherwise it crashes with message 
below ... 
  if ( FALSE ) { fprintf ( stderr , "\r/SLAVE %i - report to COORD... 
%.0f\n" , myslavenum , val ) ; fflush ( stderr ) ; }
#endif

   // this is done only ONCE, no reception even attempted in our 
test code
  MPI_Isend( mpi_buffer , mpi_buffer_size , MPI_PACKED , 0 , XMUL_DONE , 
MPI_COMM_WORLD ,  ) ; 
}


-
File compiled without optimization, linked with -O3

-
Windows Version:
Windows 10 Pro
Single machine, 4 CPUs (2 threads each)

-
Cygwin Version:

$ uname -r
3.3.4(0.341/5/3)

-
MPI version: 

mpirun (Open MPI) 4.1.2

All processes started with MPI_Comm_Spawn()

-
Crash message at runtime:

[DESKTOP-N9KKTKD:00286] *** Process received signal ***
[DESKTOP-N9KKTKD:00286] Signal: Segmentation fault (11)
[DESKTOP-N9KKTKD:00286] Signal code: Address not mapped (23)
[DESKTOP-N9KKTKD:00286] Failing at address: 0xc9
Unable to print stack trace!
[DESKTOP-N9KKTKD:00286] *** End of error message ***
--
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
[DESKTOP-N9KKTKD:00282] *** Process received signal ***
[DESKTOP-N9KKTKD:00282] Signal: Segmentation fault (11)
[DESKTOP-N9KKTKD:00282] Signal code: Address not mapped (23)
[DESKTOP-N9KKTKD:00282] Failing at address: 0xcb
Unable to print stack trace!
[DESKTOP-N9KKTKD:00282] *** End of error message ***

-
Message when exitting master:

[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
[DESKTOP-N9KKTKD][[47566,1],0][/pub/devel/openmpi/v4.1/openmpi-4.1.2-1.x86_64/src/openmpi-4.1.2/opal/mca/btl/tcp/btl_tcp_frag.c:242:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Software caused connection abort (113)
--
(null) noticed that process rank 5 with PID 0 on node DESKTOP-N9KKTKD exited on 
signal 11 (Segmentation fault).
--











  



[OMPI users] Cygwin. Problem with MPI_Comm_disconnect() in multiple spawns

2021-11-12 Thread Martín Morales via users
Hello,

We've been using OMPI 4.1.0 like a singleton in Linux and Cygwin. The 
application is interactive and the user can launch several jobs at the time. 
The jobs are launched with the Spawn function.

On Cygwin, when MPI_Comm_disconnect() is called in one job (lets say, A), and 
another (B) is still running, the spawned processes on B becomes unable to 
respond later, when is needed; A finishes normally. This problem isn't on Linux.

We've noted, beside, that spawned processes at B are sending its messages to 
the parent but those messages aren't ever received.

Is there any known issue with this or has someone any idea?

Thanks in advance. Best regards

Martín


Re: [OMPI users] Issue with MPI_Get_processor_name() in Cygwin

2021-02-10 Thread Martín Morales via users
Hello Joseph,

Yes, it was just that. However, for some reason it was working on Linux…
Thank you very much for your help.
Regards,

Martín

From: Joseph Schuchart via users<mailto:users@lists.open-mpi.org>
Sent: martes, 9 de febrero de 2021 17:45
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Joseph Schuchart<mailto:schuch...@hlrs.de>
Subject: Re: [OMPI users] Issue with MPI_Get_processor_name() in Cygwin

Martin,

The name argument to MPI_Get_processor_name is a character string of
length at least MPI_MAX_PROCESSOR_NAME, which in OMPI is 256. You are
providing a character string of length 200, so OMPI is free to write
past the end of your string and into some of your stack variables, hence
you are "losing" the values of rank and size. The issue should be gone
if you write `char hostName[MPI_MAX_PROCESSOR_NAME];`

Cheers
Joseph

On 2/9/21 9:14 PM, Martín Morales via users wrote:
> Hello,
>
> I have what it could be a memory corruption with
> /MPI_Get_processor_name()/ in Cygwin.
>
> I’m using OMPI 4.1.0; I tried also in Linux (same OMPI version) but
> there isn’t an issue there.
>
> Below the example of a trivial spawn operation. It has 2 scripts:
> spawned and spawner.
>
> In the spawned script, if I move the /MPI_Get_processor_name()/ line
> below /MPI_Comm_size()/ I lose the values of /rank/ and /size/.
>
> In fact, I declared some other variables in the /int hostName_len, rank,
> size;/ line and I lost them too.
>
> Regards,
>
> Martín
>
> ---
>
> *Spawned:*
>
> /#include "mpi.h"/
>
> /#include /
>
> /#include /
>
> //
>
> /int main(int argc, char ** argv){/
>
> */int hostName_len,rank, size;/*
>
> /MPI_Comm parentcomm;/
>
> /char hostName[200];/
>
> //
>
> /MPI_Init( NULL, NULL );/
>
> /MPI_Comm_get_parent(  );/
>
> /*MPI_Get_processor_name(hostName, _len);*/
>
> /   MPI_Comm_rank(MPI_COMM_WORLD, );/
>
> /MPI_Comm_size(MPI_COMM_WORLD, );/
>
> //
>
> /if (parentcomm != MPI_COMM_NULL) {/
>
> /  printf("I'm the spawned h: %s  r/s: %i/%i\n", hostName, rank, size);/
>
> /}/
>
> //
>
> /MPI_Finalize();/
>
> /return 0;/
>
> /}/
>
> //
>
> *Spawner:*
>
> #include "mpi.h"
>
> #include 
>
> #include 
>
> #include 
>
> int main(int argc, char ** argv){
>
>  int processesToRun;
>
>  MPI_Comm intercomm;
>
>if(argc < 2 ){
>
>  printf("Processes number needed!\n");
>
>  return 0;
>
>}
>
>processesToRun = atoi(argv[1]);
>
>MPI_Init( NULL, NULL );
>
>printf("Spawning from parent:...\n");
>
>MPI_Comm_spawn( "./spawned", MPI_ARGV_NULL, processesToRun,
> MPI_INFO_NULL, 0, MPI_COMM_SELF, , MPI_ERRCODES_IGNORE);
>
>  MPI_Finalize();
>
>  return 0;
>
> }
>
> //
>



[OMPI users] Issue with MPI_Get_processor_name() in Cygwin

2021-02-09 Thread Martín Morales via users

Hello,

I have what it could be a memory corruption with MPI_Get_processor_name() in 
Cygwin.
I’m using OMPI 4.1.0; I tried also in Linux (same OMPI version) but there isn’t 
an issue there.
Below the example of a trivial spawn operation. It has 2 scripts: spawned and 
spawner.

In the spawned script, if I move the MPI_Get_processor_name() line below 
MPI_Comm_size() I lose the values of rank and size.
In fact, I declared some other variables in the  int hostName_len, rank, size; 
line and I lost them too.

Regards,

Martín

---

Spawned:

#include "mpi.h"
#include 
#include 

int main(int argc, char ** argv){
int hostName_len,rank, size;
MPI_Comm parentcomm;
char hostName[200];

MPI_Init( NULL, NULL );
MPI_Comm_get_parent(  );
MPI_Get_processor_name(hostName, _len);
   MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );

if (parentcomm != MPI_COMM_NULL) {
  printf("I'm the spawned h: %s  r/s: %i/%i\n", hostName, rank, size);
}

MPI_Finalize();
return 0;
}

Spawner:

#include "mpi.h"
#include 
#include 
#include 

int main(int argc, char ** argv){
int processesToRun;
MPI_Comm intercomm;

  if(argc < 2 ){
printf("Processes number needed!\n");
return 0;
  }
  processesToRun = atoi(argv[1]);
  MPI_Init( NULL, NULL );
  printf("Spawning from parent:...\n");
  MPI_Comm_spawn( "./spawned", MPI_ARGV_NULL, processesToRun, 
MPI_INFO_NULL, 0, MPI_COMM_SELF, , MPI_ERRCODES_IGNORE);

MPI_Finalize();
return 0;
}





Re: [OMPI users] OMPI 4.1 in Cygwin packages?

2021-02-08 Thread Martín Morales via users
Hi Marco,

Apologies for my delay. I tried 4.1.0 and it worked!!
Thank you very much for your assistance. Kind regards,

Martín

From: Marco Atzeri
Sent: sábado, 6 de febrero de 2021 08:54
To: Martín Morales; Open MPI 
Users
Subject: Re: [OMPI users] OMPI 4.1 in Cygwin packages?

Martin,

what is the IP address of the machine you can not connect ?

All those VMware interfaces look suspicious, anyway.


In the mean time I uploaded 4.1.0-1 for X86_64,
you can try to see if solve the issue.

the i686 version in still in build phase


On 05.02.2021 20:46, Martín Morales wrote:
> Hi Marcos,
>
> Pasted below the output.
>
> Thank you. Regards,
>
> Martín

>
> /internal_name:  {A6301D34-A586-4439-B7A7-69FA905CA167}/
>
> /flags: AF_INET6 up running multicast/
>
> /address:   fe80::e5c6:c83:8653:3cd8%14/
>
> /friendly_name: VMware Network Adapter VMnet1/
>
> //
>
> /internal_name:  {A6301D34-A586-4439-B7A7-69FA905CA167}/
>
> /flags: AF_INET  up broadcast running multicast/
>
> /address:   192.168.148.1/
>
> /friendly_name: VMware Network Adapter VMnet1/
>



Re: [OMPI users] OMPI 4.1 in Cygwin packages?

2021-02-05 Thread Martín Morales via users
Hi Marcos,

Pasted below the output.
Thank you. Regards,

Martín


internal_name:  {C3C1A65B-A775-4604-A187-C1FDC48EC211}
flags: AF_INET6 up running multicast
address:   fe80::c4a3:827f:bd3c:141%16
friendly_name: Ethernet

internal_name:  {C3C1A65B-A775-4604-A187-C1FDC48EC211}
flags: AF_INET  up broadcast running multicast
address:   192.168.56.1
friendly_name: Ethernet

internal_name:  {8F915CBF-9C68-4DFA-9AE1-FF3207DA0CC9}
flags: AF_INET6 up multicast
address:   fe80::f84e:e8e9:b2e7:7f23%9
friendly_name: Local Area Connection* 1

internal_name:  {8F915CBF-9C68-4DFA-9AE1-FF3207DA0CC9}
flags: AF_INET  up broadcast multicast
address:   169.254.127.35
friendly_name: Local Area Connection* 1

internal_name:  {9C9C1DB9-1E8B-4893-A83F-C881060ED6DF}
flags: AF_INET6 up multicast
address:   fe80::3c7c:e0b3:af76:3a0%11
friendly_name: Local Area Connection* 2

internal_name:  {9C9C1DB9-1E8B-4893-A83F-C881060ED6DF}
flags: AF_INET  up broadcast multicast
address:   169.254.3.160
friendly_name: Local Area Connection* 2

internal_name:  {A6301D34-A586-4439-B7A7-69FA905CA167}
flags: AF_INET6 up running multicast
address:   fe80::e5c6:c83:8653:3cd8%14
friendly_name: VMware Network Adapter VMnet1

internal_name:  {A6301D34-A586-4439-B7A7-69FA905CA167}
flags: AF_INET  up broadcast running multicast
address:   192.168.148.1
friendly_name: VMware Network Adapter VMnet1

internal_name:  {B259A286-0A90-429D-97A1-D4CEAA97EA42}
flags: AF_INET6 up running multicast
address:   fe80::555d:6f18:486e:376e%15
friendly_name: VMware Network Adapter VMnet8

internal_name:  {B259A286-0A90-429D-97A1-D4CEAA97EA42}
flags: AF_INET  up broadcast running multicast
address:   192.168.200.1
friendly_name: VMware Network Adapter VMnet8

internal_name:  {1BA832E1-BEA7-4E7B-8FFA-9BBDCBA170A6}
flags: AF_INET6 up running multicast
address:   fe80::8038:114f:63e0:81a3%5
friendly_name: Wi-Fi

internal_name:  {1BA832E1-BEA7-4E7B-8FFA-9BBDCBA170A6}
flags: AF_INET  up broadcast running multicast
address:   192.168.100.45
friendly_name: Wi-Fi

internal_name:  {144D1ED1-0EE5-47E1-82A7-7E4ABB8DB2D8}
flags: AF_INET6 up multicast
address:   fe80::c4f8:6145:cc59:4c4b%3
friendly_name: Bluetooth Network Connection

internal_name:  {144D1ED1-0EE5-47E1-82A7-7E4ABB8DB2D8}
flags: AF_INET  up broadcast multicast
address:   169.254.76.75
friendly_name: Bluetooth Network Connection

internal_name:  {EB43FC56-BCC2-11EA-A07A-806E6F6E6963}
flags: AF_INET6 up loopback running multicast
address:   ::1
friendly_name: Loopback Pseudo-Interface 1

internal_name:  {EB43FC56-BCC2-11EA-A07A-806E6F6E6963}
flags: AF_INET  up loopback running multicast
address:   127.0.0.1
friendly_name: Loopback Pseudo-Interface 1

internal_name:  {A04CB0D8-879C-418C-8BB7-209EEADBDCD0}
flags: AF_INET  up broadcast multicast
address:   0.0.0.0
friendly_name: Ethernet (Kernel Debugger)

internal_name:  {D9635C22-48DE-4359-99BA-057A3850FA03}
flags: AF_INET  up broadcast multicast
address:   192.168.56.1
friendly_name: VirtualBox Host-Only Network


From: Marco Atzeri via users<mailto:users@lists.open-mpi.org>
Sent: viernes, 5 de febrero de 2021 13:37
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Marco Atzeri<mailto:marco.atz...@gmail.com>
Subject: Re: [OMPI users] OMPI 4.1 in Cygwin packages?

On 05.02.2021 16:18, Martín Morales via users wrote:
> Hi Gilles,
>
> I tried but it hangs indefinitely and without any output.
>
> Regards,
>
> Martín
>

Hi Martin,

can you run get-interface available on

http://matzeri.altervista.org/works/interface/

so we can see how Cygwin identify all your network interface ?

Regards
Marco



Re: [OMPI users] OMPI 4.1 in Cygwin packages?

2021-02-05 Thread Martín Morales via users
Hi Gilles,

I tried but it hangs indefinitely and without any output.
Regards,

Martín

From: Gilles Gouaillardet via users<mailto:users@lists.open-mpi.org>
Sent: jueves, 4 de febrero de 2021 23:48
To: Open MPI Users<mailto:users@lists.open-mpi.org>
Cc: Gilles Gouaillardet<mailto:gilles.gouaillar...@gmail.com>
Subject: Re: [OMPI users] OMPI 4.1 in Cygwin packages?

Martin,

at first glance, I could not spot the root cause.

That being said, the second note is sometimes referred as
"WinDev2021Eval" in the logs, but it is also referred as "worker".

What if you use the real names in your hostfile: DESKTOP-C0G4680 and
WinDev2021Eval instead of master and worker?

Cheers,

Gilles

On Fri, Feb 5, 2021 at 5:59 AM Martín Morales via users
 wrote:
>
> Hello all,
>
>
>
> Gilles, unfortunately, the result is the same. Attached the log you ask me.
>
>
>
> Jeff, some time ago I tried with OMPI 4.1.0 (Linux) and it worked.
>
>
>
> Thank you both. Regards,
>
>
>
> Martín
>
>
>
> From: Jeff Squyres (jsquyres) via users
> Sent: jueves, 4 de febrero de 2021 16:10
> To: Open MPI User's List
> Cc: Jeff Squyres (jsquyres)
> Subject: Re: [OMPI users] OMPI 4.1 in Cygwin packages?
>
>
>
> Do we know if this was definitely fixed in v4.1.x?
>
>
> > On Feb 4, 2021, at 7:46 AM, Gilles Gouaillardet via users 
> >  wrote:
> >
> > Martin,
> >
> > this is a connectivity issue reported by the btl/tcp component.
> >
> > You can try restricting the IP interface to a subnet known to work
> > (and with no firewall) between both hosts
> >
> > mpirun --mca btl_tcp_if_include 192.168.0.0/24 ...
> >
> > If the error persists, you can
> >
> > mpirun --mca btl_tcp_base_verbose 20 ...
> >
> > and then compress and post the logs so we can have a look
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Thu, Feb 4, 2021 at 9:33 PM Martín Morales via users
> >  wrote:
> >>
> >> Hi Marcos,
> >>
> >>
> >>
> >> Yes, I have a problem with spawning to a “worker” host (on localhost, 
> >> works). There are just two machines: “master” and “worker”.  I’m using 
> >> Windows 10 in both with same Cygwin and packages. Pasted below some 
> >> details.
> >>
> >> Thanks for your help. Regards,
> >>
> >>
> >>
> >> Martín
> >>
> >>
> >>
> >> 
> >>
> >>
> >>
> >> Running:
> >>
> >>
> >>
> >> mpirun -np 1 -hostfile ./hostfile ./spawner.exe 8
> >>
> >>
> >>
> >> hostfile:
> >>
> >>
> >>
> >> master slots=5
> >>
> >> worker slots=5
> >>
> >>
> >>
> >> Error:
> >>
> >>
> >>
> >> At least one pair of MPI processes are unable to reach each other for
> >>
> >> MPI communications.  This means that no Open MPI device has indicated
> >>
> >> that it can be used to communicate between these processes.  This is
> >>
> >> an error; Open MPI requires that all MPI processes be able to reach
> >>
> >> each other.  This error can sometimes be the result of forgetting to
> >>
> >> specify the "self" BTL.
> >>
> >>
> >>
> >> Process 1 ([[31598,1],0]) is on host: DESKTOP-C0G4680
> >>
> >> Process 2 ([[31598,2],2]) is on host: worker
> >>
> >> BTLs attempted: self tcp
> >>
> >>
> >>
> >> Your MPI job is now going to abort; sorry.
> >>
> >> --
> >>
> >> [DESKTOP-C0G4680:02828] [[31598,1],0] ORTE_ERROR_LOG: Unreachable in file 
> >> /pub/devel/openmpi/v4.0/openmpi-4.0.5-1.x86_64/src/openmpi-4.0.5/ompi/dpm/dpm.c
> >>  at line 493
> >>
> >> [DESKTOP-C0G4680:02828] *** An error occurred in MPI_Comm_spawn
> >>
> >> [DESKTOP-C0G4680:02828] *** reported by process [2070806529,0]
> >>
> >> [DESKTOP-C0G4680:02828] *** on communicator MPI_COMM_SELF
> >>
> >> [DESKTOP-C0G4680:02828] *** MPI_ERR_INTERN: internal error
> >>
> >> [DESKTOP-C0G4680:02828] *** MPI_ERRORS_ARE_FATAL (processes in this 
> >> communicator will now abort,
> >>
> >> [DESKTOP-C0G4680:02828] ***and poten

Re: [OMPI users] OMPI 4.1 in Cygwin packages?

2021-02-04 Thread Martín Morales via users
Hello all,

Gilles, unfortunately, the result is the same. Attached the log you ask me.

Jeff, some time ago I tried with OMPI 4.1.0 (Linux) and it worked.

Thank you both. Regards,

Martín

From: Jeff Squyres (jsquyres) via users<mailto:users@lists.open-mpi.org>
Sent: jueves, 4 de febrero de 2021 16:10
To: Open MPI User's List<mailto:users@lists.open-mpi.org>
Cc: Jeff Squyres (jsquyres)<mailto:jsquy...@cisco.com>
Subject: Re: [OMPI users] OMPI 4.1 in Cygwin packages?

Do we know if this was definitely fixed in v4.1.x?


> On Feb 4, 2021, at 7:46 AM, Gilles Gouaillardet via users 
>  wrote:
>
> Martin,
>
> this is a connectivity issue reported by the btl/tcp component.
>
> You can try restricting the IP interface to a subnet known to work
> (and with no firewall) between both hosts
>
> mpirun --mca btl_tcp_if_include 192.168.0.0/24 ...
>
> If the error persists, you can
>
> mpirun --mca btl_tcp_base_verbose 20 ...
>
> and then compress and post the logs so we can have a look
>
>
> Cheers,
>
> Gilles
>
> On Thu, Feb 4, 2021 at 9:33 PM Martín Morales via users
>  wrote:
>>
>> Hi Marcos,
>>
>>
>>
>> Yes, I have a problem with spawning to a “worker” host (on localhost, 
>> works). There are just two machines: “master” and “worker”.  I’m using 
>> Windows 10 in both with same Cygwin and packages. Pasted below some details.
>>
>> Thanks for your help. Regards,
>>
>>
>>
>> Martín
>>
>>
>>
>> 
>>
>>
>>
>> Running:
>>
>>
>>
>> mpirun -np 1 -hostfile ./hostfile ./spawner.exe 8
>>
>>
>>
>> hostfile:
>>
>>
>>
>> master slots=5
>>
>> worker slots=5
>>
>>
>>
>> Error:
>>
>>
>>
>> At least one pair of MPI processes are unable to reach each other for
>>
>> MPI communications.  This means that no Open MPI device has indicated
>>
>> that it can be used to communicate between these processes.  This is
>>
>> an error; Open MPI requires that all MPI processes be able to reach
>>
>> each other.  This error can sometimes be the result of forgetting to
>>
>> specify the "self" BTL.
>>
>>
>>
>> Process 1 ([[31598,1],0]) is on host: DESKTOP-C0G4680
>>
>> Process 2 ([[31598,2],2]) is on host: worker
>>
>> BTLs attempted: self tcp
>>
>>
>>
>> Your MPI job is now going to abort; sorry.
>>
>> --
>>
>> [DESKTOP-C0G4680:02828] [[31598,1],0] ORTE_ERROR_LOG: Unreachable in file 
>> /pub/devel/openmpi/v4.0/openmpi-4.0.5-1.x86_64/src/openmpi-4.0.5/ompi/dpm/dpm.c
>>  at line 493
>>
>> [DESKTOP-C0G4680:02828] *** An error occurred in MPI_Comm_spawn
>>
>> [DESKTOP-C0G4680:02828] *** reported by process [2070806529,0]
>>
>> [DESKTOP-C0G4680:02828] *** on communicator MPI_COMM_SELF
>>
>> [DESKTOP-C0G4680:02828] *** MPI_ERR_INTERN: internal error
>>
>> [DESKTOP-C0G4680:02828] *** MPI_ERRORS_ARE_FATAL (processes in this 
>> communicator will now abort,
>>
>> [DESKTOP-C0G4680:02828] ***and potentially your MPI job)
>>
>>
>>
>> USER_SSH@DESKTOP-C0G4680 ~
>>
>> $ [WinDev2012Eval:00120] [[31598,2],2] ORTE_ERROR_LOG: Unreachable in file 
>> /pub/devel/openmpi/v4.0/openmpi-4.0.5-1.x86_64/src/openmpi-4.0.5/ompi/dpm/dpm.c
>>  at line 493
>>
>> [WinDev2012Eval:00121] [[31598,2],3] ORTE_ERROR_LOG: Unreachable in file 
>> /pub/devel/openmpi/v4.0/openmpi-4.0.5-1.x86_64/src/openmpi-4.0.5/ompi/dpm/dpm.c
>>  at line 493
>>
>> --
>>
>> It looks like MPI_INIT failed for some reason; your parallel process is
>>
>> likely to abort.  There are many reasons that a parallel process can
>>
>> fail during MPI_INIT; some of which are due to configuration or environment
>>
>> problems.  This failure appears to be an internal failure; here's some
>>
>> additional information (which may only be relevant to an Open MPI
>>
>> developer):
>>
>>
>>
>> ompi_dpm_dyn_init() failed
>>
>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>>
>> --
>>
>> [WinDev2012Eval:00121] *** An error occurred in MPI_Init
>>
>> [WinDev2012Eval:00121] *** reported by process 
>> [15289389101093879810,12884901

Re: [OMPI users] OMPI 4.1 in Cygwin packages?

2021-02-04 Thread Martín Morales via users
Hi Marcos,

Yes, I have a problem with spawning to a “worker” host (on localhost, works). 
There are just two machines: “master” and “worker”.  I’m using Windows 10 in 
both with same Cygwin and packages. Pasted below some details.
Thanks for your help. Regards,

Martín



Running:

mpirun -np 1 -hostfile ./hostfile ./spawner.exe 8

hostfile:

master slots=5
worker slots=5

Error:

At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

Process 1 ([[31598,1],0]) is on host: DESKTOP-C0G4680
Process 2 ([[31598,2],2]) is on host: worker
BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--
[DESKTOP-C0G4680:02828] [[31598,1],0] ORTE_ERROR_LOG: Unreachable in file 
/pub/devel/openmpi/v4.0/openmpi-4.0.5-1.x86_64/src/openmpi-4.0.5/ompi/dpm/dpm.c 
at line 493
[DESKTOP-C0G4680:02828] *** An error occurred in MPI_Comm_spawn
[DESKTOP-C0G4680:02828] *** reported by process [2070806529,0]
[DESKTOP-C0G4680:02828] *** on communicator MPI_COMM_SELF
[DESKTOP-C0G4680:02828] *** MPI_ERR_INTERN: internal error
[DESKTOP-C0G4680:02828] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[DESKTOP-C0G4680:02828] ***and potentially your MPI job)

USER_SSH@DESKTOP-C0G4680 ~
$ [WinDev2012Eval:00120] [[31598,2],2] ORTE_ERROR_LOG: Unreachable in file 
/pub/devel/openmpi/v4.0/openmpi-4.0.5-1.x86_64/src/openmpi-4.0.5/ompi/dpm/dpm.c 
at line 493
[WinDev2012Eval:00121] [[31598,2],3] ORTE_ERROR_LOG: Unreachable in file 
/pub/devel/openmpi/v4.0/openmpi-4.0.5-1.x86_64/src/openmpi-4.0.5/ompi/dpm/dpm.c 
at line 493
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_dpm_dyn_init() failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--
[WinDev2012Eval:00121] *** An error occurred in MPI_Init
[WinDev2012Eval:00121] *** reported by process 
[15289389101093879810,12884901891]
[WinDev2012Eval:00121] *** on a NULL communicator
[WinDev2012Eval:00121] *** Unknown error
[WinDev2012Eval:00121] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[WinDev2012Eval:00121] ***and potentially your MPI job)
[DESKTOP-C0G4680:02831] 2 more processes have sent help message 
help-mca-bml-r2.txt / unreachable proc
[DESKTOP-C0G4680:02831] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages
[DESKTOP-C0G4680:02831] 1 more process has sent help message 
help-mpi-runtime.txt / mpi_init:startup:internal-failure
[DESKTOP-C0G4680:02831] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

Script spawner:

#include "mpi.h"
#include 
#include 
#include 

int main(int argc, char ** argv){
int processesToRun;
MPI_Comm intercomm;
MPI_Info info;

   if(argc < 2 ){
  printf("Processes number needed!\n");
  return 0;
   }
   processesToRun = atoi(argv[1]);
MPI_Init( NULL, NULL );
   printf("Spawning from parent:...\n");
   MPI_Comm_spawn( "./spawned.exe", MPI_ARGV_NULL, processesToRun, 
MPI_INFO_NULL, 0, MPI_COMM_SELF, , MPI_ERRCODES_IGNORE);

MPI_Finalize();
return 0;
}

Script spawned:

#include "mpi.h"
#include 
#include 

int main(int argc, char ** argv){
int hostName_len,rank, size;
MPI_Comm parentcomm;
char hostName[200];

MPI_Init( NULL, NULL );
MPI_Comm_get_parent(  );
MPI_Get_processor_name(hostName, _len);
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );

if (parentcomm != MPI_COMM_NULL) {
printf("I'm the spawned h: %s  r/s: %i/%i\n", hostName, rank, size );
}

MPI_Finalize();
return 0;
}




From: Marco Atzeri via users<mailto:users@lists.open-mpi.org>
Sent: miércoles, 3 de febrero de 2021 17:58
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Cc: Marco Atzeri<mailto:marco.atz...@gmail.com>
Subject: Re: [OMPI users] OMPI 4.1 in Cygwin packages?

On 03.02.2021 21:35, Martín Morales via users wrote:
> Hello,
>
> I would like to know if any OMPI 4.1.* is going to 

[OMPI users] OMPI 4.1 in Cygwin packages?

2021-02-03 Thread Martín Morales via users

Hello,

I would like to know if any OMPI 4.1.* is going to be available in the Cygwin 
packages.
Thanks and regards,

Martín


Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-15 Thread Martín Morales via users
Hi Howard, that’s right. This happens after some time I run

./simple_spawn 

and the hostfile without “master” host (and just “worker” in it).
Regards,

Martín




From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: sábado, 15 de agosto de 2020 15:09
To: Martín Morales<mailto:martineduardomora...@hotmail.com>
Cc: Open MPI Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

HI Martin,

Thanks this is helpful.  Are you getting this timeout when you're running the 
spawner process as a singleton?

Howard

Am Fr., 14. Aug. 2020 um 17:44 Uhr schrieb Martín Morales 
mailto:martineduardomora...@hotmail.com>>:
Howard,

I pasted below, the error message after a while of the hang I referred.
Regards,

Martín

-

A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--
[nos-GF7050VT-M:03767] *** An error occurred in MPI_Init
[nos-GF7050VT-M:03767] *** reported by process [2337734658,0]
[nos-GF7050VT-M:03767] *** on a NULL communicator
[nos-GF7050VT-M:03767] *** Unknown error
[nos-GF7050VT-M:03767] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[nos-GF7050VT-M:03767] ***and potentially your MPI job)
[osboxes:02457] *** An error occurred in MPI_Comm_spawn
[osboxes:02457] *** reported by process [2337734657,0]
[osboxes:02457] *** on communicator MPI_COMM_WORLD
[osboxes:02457] *** MPI_ERR_UNKNOWN: unknown error
[osboxes:02457] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[osboxes:02457] ***and potentially your MPI job)
[osboxes:02458] 1 more process has sent help message help-orted.txt / timedout
[osboxes:02458] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages




From: Martín Morales via users<mailto:users@lists.open-mpi.org>
Sent: viernes, 14 de agosto de 2020 19:40
To: Howard Pritchard<mailto:hpprit...@gmail.com>
Cc: Martín Morales<mailto:martineduardomora...@hotmail.com>; Open MPI 
Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hi Howard.

Thanks for the track in Github. I have run with mpirun without “master” in the 
hostfile and runs ok. The hanging occurs when I run like a singleton (no 
mpirun) which is the way I need to run. If I make a top in both machines the 
processes are correctly mapped but hangued. Seems the MPI_Init() function 
doesn’t return. Thanks for your help.
Best regards,

Martín






From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: viernes, 14 de agosto de 2020 15:18
To: Martín Morales<mailto:martineduardomora...@hotmail.com>
Cc: Open MPI Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hi Martin,

I opened an issue on Open MPI's github to track this 
https://github.com/open-mpi/ompi/issues/8005

You may be seeing another problem if you removed master from the host file.
Could you add the --debug-daemons option to the mpirun and post the output?

Howard


Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales 
mailto:martineduardomora...@hotmail.com>>:
Hi Howard.

Great!, that works for the crashing problem with OMPI 4.0.4. However It stills 
hanging if I remove “master” (host which launches spawning processes) from my 
hostfile.
I need spawn only in “worker”. Is there a way or workaround for doing this 
without mpirun?
Thanks a lot for your assistance.

Martín




From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: lunes, 10 de agosto de 2020 19:13
To: Martín Morales<mailto:martineduardomora...@hotmail.com>
Cc: Open MPI Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hi Martin,

I

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-14 Thread Martín Morales via users
Howard,

I pasted below, the error message after a while of the hang I referred.
Regards,

Martín

-

A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--
[nos-GF7050VT-M:03767] *** An error occurred in MPI_Init
[nos-GF7050VT-M:03767] *** reported by process [2337734658,0]
[nos-GF7050VT-M:03767] *** on a NULL communicator
[nos-GF7050VT-M:03767] *** Unknown error
[nos-GF7050VT-M:03767] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[nos-GF7050VT-M:03767] ***and potentially your MPI job)
[osboxes:02457] *** An error occurred in MPI_Comm_spawn
[osboxes:02457] *** reported by process [2337734657,0]
[osboxes:02457] *** on communicator MPI_COMM_WORLD
[osboxes:02457] *** MPI_ERR_UNKNOWN: unknown error
[osboxes:02457] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[osboxes:02457] ***and potentially your MPI job)
[osboxes:02458] 1 more process has sent help message help-orted.txt / timedout
[osboxes:02458] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages




From: Martín Morales via users<mailto:users@lists.open-mpi.org>
Sent: viernes, 14 de agosto de 2020 19:40
To: Howard Pritchard<mailto:hpprit...@gmail.com>
Cc: Martín Morales<mailto:martineduardomora...@hotmail.com>; Open MPI 
Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hi Howard.

Thanks for the track in Github. I have run with mpirun without “master” in the 
hostfile and runs ok. The hanging occurs when I run like a singleton (no 
mpirun) which is the way I need to run. If I make a top in both machines the 
processes are correctly mapped but hangued. Seems the MPI_Init() function 
doesn’t return. Thanks for your help.
Best regards,

Martín






From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: viernes, 14 de agosto de 2020 15:18
To: Martín Morales<mailto:martineduardomora...@hotmail.com>
Cc: Open MPI Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hi Martin,

I opened an issue on Open MPI's github to track this 
https://github.com/open-mpi/ompi/issues/8005

You may be seeing another problem if you removed master from the host file.
Could you add the --debug-daemons option to the mpirun and post the output?

Howard


Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales 
mailto:martineduardomora...@hotmail.com>>:
Hi Howard.

Great!, that works for the crashing problem with OMPI 4.0.4. However It stills 
hanging if I remove “master” (host which launches spawning processes) from my 
hostfile.
I need spawn only in “worker”. Is there a way or workaround for doing this 
without mpirun?
Thanks a lot for your assistance.

Martín




From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: lunes, 10 de agosto de 2020 19:13
To: Martín Morales<mailto:martineduardomora...@hotmail.com>
Cc: Open MPI Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hi Martin,

I was able to reproduce this with 4.0.x branch.  I'll open an issue.

If you really want to use 4.0.4, then what you'll need to do is build an 
external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and then 
build Open MPI using the --with-pmix=where your pmix is installed
You will also need to build both Open MPI and PMIx against the same libevent.   
There's a configure option with both packages to use an external libevent 
installation.

Howard


Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales 
mailto:martineduardomora...@hotmail.com>>:
Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have to post 
this on the bug section? Thanks and regards.

Martín

From: Howard Pritchard<ma

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-14 Thread Martín Morales via users
Hi Howard.

Thanks for the track in Github. I have run with mpirun without “master” in the 
hostfile and runs ok. The hanging occurs when I run like a singleton (no 
mpirun) which is the way I need to run. If I make a top in both machines the 
processes are correctly mapped but hangued. Seems the MPI_Init() function 
doesn’t return. Thanks for your help.
Best regards,

Martín





From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: viernes, 14 de agosto de 2020 15:18
To: Martín Morales<mailto:martineduardomora...@hotmail.com>
Cc: Open MPI Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hi Martin,

I opened an issue on Open MPI's github to track this 
https://github.com/open-mpi/ompi/issues/8005

You may be seeing another problem if you removed master from the host file.
Could you add the --debug-daemons option to the mpirun and post the output?

Howard


Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales 
mailto:martineduardomora...@hotmail.com>>:
Hi Howard.

Great!, that works for the crashing problem with OMPI 4.0.4. However It stills 
hanging if I remove “master” (host which launches spawning processes) from my 
hostfile.
I need spawn only in “worker”. Is there a way or workaround for doing this 
without mpirun?
Thanks a lot for your assistance.

Martín




From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: lunes, 10 de agosto de 2020 19:13
To: Martín Morales<mailto:martineduardomora...@hotmail.com>
Cc: Open MPI Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hi Martin,

I was able to reproduce this with 4.0.x branch.  I'll open an issue.

If you really want to use 4.0.4, then what you'll need to do is build an 
external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and then 
build Open MPI using the --with-pmix=where your pmix is installed
You will also need to build both Open MPI and PMIx against the same libevent.   
There's a configure option with both packages to use an external libevent 
installation.

Howard


Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales 
mailto:martineduardomora...@hotmail.com>>:
Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have to post 
this on the bug section? Thanks and regards.

Martín

From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: lunes, 10 de agosto de 2020 14:44
To: Open MPI Users<mailto:users@lists.open-mpi.org>
Cc: Martín Morales<mailto:martineduardomora...@hotmail.com>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hello Martin,

Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx version 
that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
This is supposed to be fixed in the 4.0.5 release.  Could you try the 4.0.5rc1 
tarball and see if that addresses the problem you're seeing?

https://www.open-mpi.org/software/ompi/v4.0/

Howard



Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users 
mailto:users@lists.open-mpi.org>>:

Hello people!
I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one "master", 
one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded OMPI just like 
this:

./configure --prefix=/usr/local/openmpi-4.0.4/bin/

My hostfile is this:

master slots=2
worker slots=2

I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
If I launch the processes only on the "master" machine It's ok. But if I use 
the hostfile crashes with this:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M
  Process 2 ([[35155,1],0]) is on host: unknown!
  BTLs attempted: tcp self

Your MPI job is now going to abort; sorry.
--
[nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file 
dpm/dpm.c at line 493
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-11 Thread Martín Morales via users
Hi Howard.

Great!, that works for the crashing problem with OMPI 4.0.4. However It stills 
hanging if I remove “master” (host which launches spawning processes) from my 
hostfile.
I need spawn only in “worker”. Is there a way or workaround for doing this 
without mpirun?
Thanks a lot for your assistance.

Martín




From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: lunes, 10 de agosto de 2020 19:13
To: Martín Morales<mailto:martineduardomora...@hotmail.com>
Cc: Open MPI Users<mailto:users@lists.open-mpi.org>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hi Martin,

I was able to reproduce this with 4.0.x branch.  I'll open an issue.

If you really want to use 4.0.4, then what you'll need to do is build an 
external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and then 
build Open MPI using the --with-pmix=where your pmix is installed
You will also need to build both Open MPI and PMIx against the same libevent.   
There's a configure option with both packages to use an external libevent 
installation.

Howard


Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales 
mailto:martineduardomora...@hotmail.com>>:
Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have to post 
this on the bug section? Thanks and regards.

Martín

From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: lunes, 10 de agosto de 2020 14:44
To: Open MPI Users<mailto:users@lists.open-mpi.org>
Cc: Martín Morales<mailto:martineduardomora...@hotmail.com>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hello Martin,

Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx version 
that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
This is supposed to be fixed in the 4.0.5 release.  Could you try the 4.0.5rc1 
tarball and see if that addresses the problem you're seeing?

https://www.open-mpi.org/software/ompi/v4.0/

Howard



Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users 
mailto:users@lists.open-mpi.org>>:

Hello people!
I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one "master", 
one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded OMPI just like 
this:

./configure --prefix=/usr/local/openmpi-4.0.4/bin/

My hostfile is this:

master slots=2
worker slots=2

I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
If I launch the processes only on the "master" machine It's ok. But if I use 
the hostfile crashes with this:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M
  Process 2 ([[35155,1],0]) is on host: unknown!
  BTLs attempted: tcp self

Your MPI job is now going to abort; sorry.
--
[nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file 
dpm/dpm.c at line 493
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
[nos-GF7050VT-M:22526] *** An error occurred in MPI_Init
[nos-GF7050VT-M:22526] *** reported by process [2303918082,1]
[nos-GF7050VT-M:22526] *** on a NULL communicator
[nos-GF7050VT-M:22526] *** Unknown error
[nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[nos-GF7050VT-M:22526] ***and potentially your MPI job)

Note: host "nos-GF7050VT-M" is "worker"

But If I run without "master" in hostfile, the processes are launched but It 
hangs: MPI_Init() doesn't returns.
I launched the script (pasted below) in this 2 ways with the same result:

$ ./simple_spawn 2
$ mpirun -np 1 ./simple_spawn 2

The "simple_spawn" script:

#include "mpi.h"
#include 
#include 

int main(int argc, char ** argv){
int processesToRun;
MPI_Comm parentcomm, intercomm;
MPI_Info info;
 

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-10 Thread Martín Morales via users
Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have to post 
this on the bug section? Thanks and regards.

Martín

From: Howard Pritchard<mailto:hpprit...@gmail.com>
Sent: lunes, 10 de agosto de 2020 14:44
To: Open MPI Users<mailto:users@lists.open-mpi.org>
Cc: Martín Morales<mailto:martineduardomora...@hotmail.com>
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.

Hello Martin,

Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx version 
that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
This is supposed to be fixed in the 4.0.5 release.  Could you try the 4.0.5rc1 
tarball and see if that addresses the problem you're seeing?

https://www.open-mpi.org/software/ompi/v4.0/

Howard



Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users 
mailto:users@lists.open-mpi.org>>:

Hello people!
I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one "master", 
one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded OMPI just like 
this:

./configure --prefix=/usr/local/openmpi-4.0.4/bin/

My hostfile is this:

master slots=2
worker slots=2

I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
If I launch the processes only on the "master" machine It's ok. But if I use 
the hostfile crashes with this:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M
  Process 2 ([[35155,1],0]) is on host: unknown!
  BTLs attempted: tcp self

Your MPI job is now going to abort; sorry.
--
[nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file 
dpm/dpm.c at line 493
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
[nos-GF7050VT-M:22526] *** An error occurred in MPI_Init
[nos-GF7050VT-M:22526] *** reported by process [2303918082,1]
[nos-GF7050VT-M:22526] *** on a NULL communicator
[nos-GF7050VT-M:22526] *** Unknown error
[nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[nos-GF7050VT-M:22526] ***and potentially your MPI job)

Note: host "nos-GF7050VT-M" is "worker"

But If I run without "master" in hostfile, the processes are launched but It 
hangs: MPI_Init() doesn't returns.
I launched the script (pasted below) in this 2 ways with the same result:

$ ./simple_spawn 2
$ mpirun -np 1 ./simple_spawn 2

The "simple_spawn" script:

#include "mpi.h"
#include 
#include 

int main(int argc, char ** argv){
int processesToRun;
MPI_Comm parentcomm, intercomm;
MPI_Info info;
int rank, size, hostName_len;
char hostName[200];

MPI_Init( ,  );
MPI_Comm_get_parent(  );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Get_processor_name(hostName, _len);

if (parentcomm == MPI_COMM_NULL) {

if(argc < 2 ){
printf("Processes number needed!");
return 0;
}
processesToRun = atoi(argv[1]);
MPI_Info_create(  );
MPI_Info_set( info, "hostfile", "./hostfile" );
MPI_Info_set( info, "map_by", "node" );

MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, processesToRun, info, 0, 
MPI_COMM_WORLD, , MPI_ERRCODES_IGNORE);
printf("I'm the parent.\n");
} else {
printf("I'm the spawned h: %s  r/s: %i/%i.\n", hostName, rank, size );
}
fflush(stdout);
MPI_Finalize();
return 0;
}

I came from OMPI 4.0.1. In this version It's working... with some 
inconsistencies I'm afraid. That's why I decided to upgrade to OMPI 4.0.4.
I tried several versions with no luck. Is there maybe an intrinsic problem with 
the OMPI dynamic allocation functionality?
Any help will be very appreciated. Best regards.

Martín




[OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-06 Thread Martín Morales via users

Hello people!
I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one "master", 
one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded OMPI just like 
this:

./configure --prefix=/usr/local/openmpi-4.0.4/bin/

My hostfile is this:

master slots=2
worker slots=2

I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
If I launch the processes only on the "master" machine It's ok. But if I use 
the hostfile crashes with this:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M
  Process 2 ([[35155,1],0]) is on host: unknown!
  BTLs attempted: tcp self

Your MPI job is now going to abort; sorry.
--
[nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file 
dpm/dpm.c at line 493
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
[nos-GF7050VT-M:22526] *** An error occurred in MPI_Init
[nos-GF7050VT-M:22526] *** reported by process [2303918082,1]
[nos-GF7050VT-M:22526] *** on a NULL communicator
[nos-GF7050VT-M:22526] *** Unknown error
[nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[nos-GF7050VT-M:22526] ***and potentially your MPI job)

Note: host "nos-GF7050VT-M" is "worker"

But If I run without "master" in hostfile, the processes are launched but It 
hangs: MPI_Init() doesn't returns.
I launched the script (pasted below) in this 2 ways with the same result:

$ ./simple_spawn 2
$ mpirun -np 1 ./simple_spawn 2

The "simple_spawn" script:

#include "mpi.h"
#include 
#include 

int main(int argc, char ** argv){
int processesToRun;
MPI_Comm parentcomm, intercomm;
MPI_Info info;
int rank, size, hostName_len;
char hostName[200];

MPI_Init( ,  );
MPI_Comm_get_parent(  );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Get_processor_name(hostName, _len);

if (parentcomm == MPI_COMM_NULL) {

if(argc < 2 ){
printf("Processes number needed!");
return 0;
}
processesToRun = atoi(argv[1]);
MPI_Info_create(  );
MPI_Info_set( info, "hostfile", "./hostfile" );
MPI_Info_set( info, "map_by", "node" );

MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, processesToRun, info, 0, 
MPI_COMM_WORLD, , MPI_ERRCODES_IGNORE);
printf("I'm the parent.\n");
} else {
printf("I'm the spawned h: %s  r/s: %i/%i.\n", hostName, rank, size );
}
fflush(stdout);
MPI_Finalize();
return 0;
}

I came from OMPI 4.0.1. In this version It's working... with some 
inconsistencies I'm afraid. That's why I decided to upgrade to OMPI 4.0.4.
I tried several versions with no luck. Is there maybe an intrinsic problem with 
the OMPI dynamic allocation functionality?
Any help will be very appreciated. Best regards.

Martín



[OMPI users] Problem with MPI_Spawn

2020-04-20 Thread Martín Morales via users
Hello All.
I'm using OMPI 4.0.1. I run MPI_Spawn() as a singleton. I need to run, in the 
same instance of my app, different spawn configurations. In this case I run 
first using my hostfile (i.e. setting a MPI_Info object with MPI_Info_create() 
function, and then setting the attributes with MPI_Info_set) that allocates the 
process ok; then I run spawn again but with MPI_INFO_NULL on spawn function 
because I want to allocate process just in local, but the allocation ignores 
this and It puts the process likewise the previous MPI_info setting. Its like 
the spawn function was it "cached".
Any idea?
Thanks in advance and regards.


Re: [OMPI users] Non-blocking send issue

2019-12-31 Thread Martín Morales via users
Hi George, thank you very much for your answer. Can you please explain me a 
little more about "If you need to guarantee progress you might either have your 
own thread calling MPI functions (such as MPI_Test)". Regards

Martín


De: George Bosilca 
Enviado: martes, 31 de diciembre de 2019 13:47
Para: Open MPI Users 
Cc: Martín Morales 
Asunto: Re: [OMPI users] Non-blocking send issue

Martin,

The MPI standard does not mandate progress outside MPI calls, thus 
implementations are free to provide, or not, asynchronous progress. Calling 
MPI_Test provides the MPI implementation with an opportunity to progress it's 
internal communication queues. However, an implementation could try a best 
effort to limit the time it spent in MPI_Test* and to provide the application 
with more time for computation, even when this might limit its own internal 
progress. Thus, as a non-blocking collective is composed of a potentially large 
number of point-to-point communications, it might require a significant number 
of MPI_Test to reach completion.

If you need to guarantee progress you might either have your own thread calling 
MPI functions (such as MPI_Test) or you can use the asynchronous progress some 
MPI libraries provide. For this last option read the documentation of your MPI 
implementation to see how to enable asynchronous progress.

  George.


On Mon, Dec 30, 2019 at 2:31 PM Martín Morales via users 
mailto:users@lists.open-mpi.org>> wrote:
Hello all!
Im with OMPI 4.0.1 and I have a strange behaviour (or at least, unexpected) 
with some non-blocking sending calls: MPI_Isend and MPI_Ibcast. I really need 
asyncronous sending so I dont use MPI_Wait after the send call (MPI_Isend or 
MPI_Ibcast); insted of this I check "on demand" with MPI_Test to verify if 
sending its or not complete. Test Im doing it sends just an int value. Here 
some code (with MPI_Ibcast):

***SENDER***

//Note that It use an intercommunicator
MPI_Ibcast(_some_int_data, 1, MPI_INT, MPI_ROOT, mpi_intercomm, 
_sender);
//MPI_Wait(_sender, MPI_STATUS_IGNORE); <-- I dont want this


***RECEIVER***

MPI_Ibcast(_some_int_data, 1, MPI_INT, 0, parentcomm, _receiver);
MPI_Wait(_receiver, MPI_STATUS_IGNORE);

***TEST RECEPTION (same sender instance program)***

void test_reception() {

int request_complete;

MPI_Test(_sender, _complete, MPI_STATUS_IGNORE);

if (request_complete) {
...
} else {
...
}

}

But when I invoke this test function after some time has elapsed since I sent, 
the request isnt complete and i have to invoque this test function again and 
againg... x (variable) times, until it finally its completed. Its just an int 
it was sended, just that (all on a local machine); has no sense such delay. The 
request should be completed on the first function test invocation.

If, instead of this, I uncomment the unwanted MPI_Wait (i.e. doing it like a 
synchronous request), it completes immediately, like expected.
If I send with MPI_Isend I get the same behaviour.

I dont understand whats is going on. Any help will be very appreciated.

Regards.

Martín


[OMPI users] Non-blocking send issue

2019-12-30 Thread Martín Morales via users
Hello all!
Im with OMPI 4.0.1 and I have a strange behaviour (or at least, unexpected) 
with some non-blocking sending calls: MPI_Isend and MPI_Ibcast. I really need 
asyncronous sending so I dont use MPI_Wait after the send call (MPI_Isend or 
MPI_Ibcast); insted of this I check "on demand" with MPI_Test to verify if 
sending its or not complete. Test Im doing it sends just an int value. Here 
some code (with MPI_Ibcast):

***SENDER***

//Note that It use an intercommunicator
MPI_Ibcast(_some_int_data, 1, MPI_INT, MPI_ROOT, mpi_intercomm, 
_sender);
//MPI_Wait(_sender, MPI_STATUS_IGNORE); <-- I dont want this


***RECEIVER***

MPI_Ibcast(_some_int_data, 1, MPI_INT, 0, parentcomm, _receiver);
MPI_Wait(_receiver, MPI_STATUS_IGNORE);

***TEST RECEPTION (same sender instance program)***

void test_reception() {

int request_complete;

MPI_Test(_sender, _complete, MPI_STATUS_IGNORE);

if (request_complete) {
...
} else {
...
}

}

But when I invoke this test function after some time has elapsed since I sent, 
the request isnt complete and i have to invoque this test function again and 
againg... x (variable) times, until it finally its completed. Its just an int 
it was sended, just that (all on a local machine); has no sense such delay. The 
request should be completed on the first function test invocation.

If, instead of this, I uncomment the unwanted MPI_Wait (i.e. doing it like a 
synchronous request), it completes immediately, like expected.
If I send with MPI_Isend I get the same behaviour.

I dont understand whats is going on. Any help will be very appreciated.

Regards.

Martín


[OMPI users] Spawns no local

2019-10-02 Thread Martín Morales via users
Hello all. I will like request you a practical example about to how to set with 
MPI_Info_set(, …) so that “info” passed to MPI_Comm_spawn() not spawns 
local any process (let say “master” host), but yes in a slave (“slave” host), 
without using mpirun (just “./o.out”). Im using OpenMPI 4.0.1.
Thanks!


Re: [OMPI users] Singleton and Spawn

2019-09-26 Thread Martín Morales via users
Ralph, I'm not set any default hostfile; nevertheless how can I check this?


I have 2 machines: a “master” and a “slave”. Master has the Open MPI build. 
Both machines share files (Open MPI bins and libs, etc) by NFS. Path is 
/cluster/openmpi. My example its in /cluster/examples/martin and my hostfile 
Its in /cluster/examples/martin/resources (named as “hostsfile”). I attach both 
files.

So, when I run:


$ mpirun -np 1 ./spawn7


I get:


I'm papi 0/1

I'm the spawned 1/7

I'm the spawned 2/7

I'm the spawned 0/7. Received: 99

I'm the spawned 5/7

I'm the spawned 6/7

I'm the spawned 4/7

I'm the spawned 3/7


But when I run:


$ ./spawn7


I get:


I'm papi 0/1

--

There are not enough slots available in the system to satisfy the 7

slots that were requested by the application:


/cluster/examples/martin/spawn7


Either request fewer slots for your application, or make more slots

available for use.


A "slot" is the Open MPI term for an allocatable unit where we can

launch a process. The number of slots available are defined by the

environment in which Open MPI processes are run:


1. Hostfile, via "slots=N" clauses (N defaults to number of

processor cores if not provided)

2. The --host command line parameter, via a ":N" suffix on the

hostname (N defaults to 1 if not provided)

3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)

4. If none of a hostfile, the --host command line parameter, or an

RM is present, Open MPI defaults to the number of processor cores


In all the above cases, if you want Open MPI to default to the number

of hardware threads instead of the number of processor cores, use the

--use-hwthread-cpus option.


Alternatively, you can use the --oversubscribe option to ignore the

number of available slots when deciding the number of processes to

launch.

--

[master:09093] *** An error occurred in MPI_Comm_spawn

[master:09093] *** reported by process [2032730113,0]

[master:09093] *** on communicator MPI_COMM_WORLD

[master:09093] *** MPI_ERR_SPAWN: could not spawn processes

[master:09093] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

[master:09093] *** and potentially your MPI job)


I have:


Open Mpi version: 4.0.1

OS: Ubuntu 18.04 (on both machines)


De: Ralph Castain 
Enviado: miércoles, 25 de septiembre de 2019 16:50
Para: Martín Morales 
Cc: Open MPI Users 
Asunto: Re: [OMPI users] Singleton and Spawn

It's a different code path, that's all - just a question of what path gets 
traversed.

Would you mind posting a little more info on your two use-cases? For example, 
do you have a default hostfile telling mpirun what machines to use?


On Sep 25, 2019, at 12:41 PM, Martín Morales 
mailto:martineduardomora...@hotmail.com>> 
wrote:

Thanks Ralph, but if I have a wrong hostfile path in my MPI_Comm_spawn 
function, why it works if I run with mpirun (Eg. mpirun -np 1 ./spawnExample)?

De: Ralph Castain mailto:r...@open-mpi.org>>
Enviado: miércoles, 25 de septiembre de 2019 15:42
Para: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: steven.va...@gmail.com<mailto:steven.va...@gmail.com> 
mailto:steven.va...@gmail.com>>; Martín Morales 
mailto:martineduardomora...@hotmail.com>>
Asunto: Re: [OMPI users] Singleton and Spawn

Yes, of course it can - however, I believe there is a bug in the add-hostfile 
code path. We can address that problem far easier than moving to a different 
interconnect.


On Sep 25, 2019, at 11:39 AM, Martín Morales via users 
mailto:users@lists.open-mpi.org>> wrote:

Thanks Steven. So, actually it can’t spawns from a singleton?


De: users 
mailto:users-boun...@lists.open-mpi.org>> en 
nombre de Steven Varga via users 
mailto:users@lists.open-mpi.org>>
Enviado: miércoles, 25 de septiembre de 2019 14:50
Para: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Steven Varga mailto:steven.va...@gmail.com>>
Asunto: Re: [OMPI users] Singleton and Spawn

As far as I know you have to wire up the connections among MPI clients, 
allocate resources etc. PMIx is a library to set up all processes, and shipped 
with openmpi.

The standard HPC method to launch tasks is through job schedulers such as SLURM 
or GRID Engine. SLURM srun is very similar to mpirun: does the resource 
allocations, then launches the jobs on allocated nodes and cores, etc. It does 
this through PMIx library, or mpiexec.

When running mpiexec without integrated job manager, you are responsible 
allocating recourses. See mpirun for details to pass host lists, 
oversubscription etc.

If you are looking for a different, not MPI based interconnect, try ZeroMQ or 
other Remote Procedure Calls -- it won't be simpler th

Re: [OMPI users] Singleton and Spawn

2019-09-25 Thread Martín Morales via users
Thanks Ralph, but if I have a wrong hostfile path in my MPI_Comm_spawn 
function, why it works if I run with mpirun (Eg. mpirun -np 1 ./spawnExample)?

De: Ralph Castain 
Enviado: miércoles, 25 de septiembre de 2019 15:42
Para: Open MPI Users 
Cc: steven.va...@gmail.com ; Martín Morales 

Asunto: Re: [OMPI users] Singleton and Spawn

Yes, of course it can - however, I believe there is a bug in the add-hostfile 
code path. We can address that problem far easier than moving to a different 
interconnect.


On Sep 25, 2019, at 11:39 AM, Martín Morales via users 
mailto:users@lists.open-mpi.org>> wrote:

Thanks Steven. So, actually it can’t spawns from a singleton?


De: users 
mailto:users-boun...@lists.open-mpi.org>> en 
nombre de Steven Varga via users 
mailto:users@lists.open-mpi.org>>
Enviado: miércoles, 25 de septiembre de 2019 14:50
Para: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Steven Varga mailto:steven.va...@gmail.com>>
Asunto: Re: [OMPI users] Singleton and Spawn

As far as I know you have to wire up the connections among MPI clients, 
allocate resources etc. PMIx is a library to set up all processes, and shipped 
with openmpi.

The standard HPC method to launch tasks is through job schedulers such as SLURM 
or GRID Engine. SLURM srun is very similar to mpirun: does the resource 
allocations, then launches the jobs on allocated nodes and cores, etc. It does 
this through PMIx library, or mpiexec.

When running mpiexec without integrated job manager, you are responsible 
allocating recourses. See mpirun for details to pass host lists, 
oversubscription etc.

If you are looking for a different, not MPI based interconnect, try ZeroMQ or 
other Remote Procedure Calls -- it won't be simpler though.

Hope it helps:
Steve

On Wed, Sep 25, 2019, 13:15 Martín Morales via users, 
mailto:users@lists.open-mpi.org>> wrote:
Hi all! This is my first post. I'm newbie on Open MPI (and on MPI likewise!). I 
recently build the current version of this fabulous software (v4.0.1) on two 
Ubuntu 18 machines (a little part of our Beowulf Cluster). I already read (a 
lot) the FAQ and posts on the mail users list but I cant figure out how can I 
do this (if it can):  I need run my parallel programs without mpirun/exec 
commands; I need just one process (in my “master” machine) that will spawns 
processes dynamically (in the “slaves” machines). I already maked some dummies 
tests scripts and they works fine with  mpirun/exec commands. I set in  the 
MPI_Info_set the key “add-hostfile” with the file containing that 2 machines, 
that I mention before, with 4 slots each one. Nevertheless it doesn't work when 
I just run like a singleton program (e.g. ./spawnExample): it throws an error 
like this: “There are not enough slots available in the system to satisfy the 7 
slots that were requested by the application:...”. Here I try to start 8 
processes on the 2 machines. It seems that one process its executing fine on 
“master” and when it tries to spawns the other 7 it crashes.
We need this execution schema because we already have our software (used for 
scientific research) and we need to “incorporate” or “embed” Open MPI on it.
Thanks in advance guys!
___
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users



Re: [OMPI users] Singleton and Spawn

2019-09-25 Thread Martín Morales via users
Thanks Steven. So, actually it can’t spawns from a singleton?


De: users  en nombre de Steven Varga via 
users 
Enviado: miércoles, 25 de septiembre de 2019 14:50
Para: Open MPI Users 
Cc: Steven Varga 
Asunto: Re: [OMPI users] Singleton and Spawn

As far as I know you have to wire up the connections among MPI clients, 
allocate resources etc. PMIx is a library to set up all processes, and shipped 
with openmpi.

The standard HPC method to launch tasks is through job schedulers such as SLURM 
or GRID Engine. SLURM srun is very similar to mpirun: does the resource 
allocations, then launches the jobs on allocated nodes and cores, etc. It does 
this through PMIx library, or mpiexec.

When running mpiexec without integrated job manager, you are responsible 
allocating recourses. See mpirun for details to pass host lists, 
oversubscription etc.

If you are looking for a different, not MPI based interconnect, try ZeroMQ or 
other Remote Procedure Calls -- it won't be simpler though.

Hope it helps:
Steve

On Wed, Sep 25, 2019, 13:15 Martín Morales via users, 
mailto:users@lists.open-mpi.org>> wrote:
Hi all! This is my first post. I'm newbie on Open MPI (and on MPI likewise!). I 
recently build the current version of this fabulous software (v4.0.1) on two 
Ubuntu 18 machines (a little part of our Beowulf Cluster). I already read (a 
lot) the FAQ and posts on the mail users list but I cant figure out how can I 
do this (if it can):  I need run my parallel programs without mpirun/exec 
commands; I need just one process (in my “master” machine) that will spawns 
processes dynamically (in the “slaves” machines). I already maked some dummies 
tests scripts and they works fine with  mpirun/exec commands. I set in  the 
MPI_Info_set the key “add-hostfile” with the file containing that 2 machines, 
that I mention before, with 4 slots each one. Nevertheless it doesn't work when 
I just run like a singleton program (e.g. ./spawnExample): it throws an error 
like this: “There are not enough slots available in the system to satisfy the 7 
slots that were requested by the application:...”. Here I try to start 8 
processes on the 2 machines. It seems that one process its executing fine on 
“master” and when it tries to spawns the other 7 it crashes.
We need this execution schema because we already have our software (used for 
scientific research) and we need to “incorporate” or “embed” Open MPI on it.
Thanks in advance guys!
___
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] Singleton and Spawn

2019-09-25 Thread Martín Morales via users
Hi all! This is my first post. I'm newbie on Open MPI (and on MPI likewise!). I 
recently build the current version of this fabulous software (v4.0.1) on two 
Ubuntu 18 machines (a little part of our Beowulf Cluster). I already read (a 
lot) the FAQ and posts on the mail users list but I cant figure out how can I 
do this (if it can):  I need run my parallel programs without mpirun/exec 
commands; I need just one process (in my “master” machine) that will spawns 
processes dynamically (in the “slaves” machines). I already maked some dummies 
tests scripts and they works fine with  mpirun/exec commands. I set in  the 
MPI_Info_set the key “add-hostfile” with the file containing that 2 machines, 
that I mention before, with 4 slots each one. Nevertheless it doesn't work when 
I just run like a singleton program (e.g. ./spawnExample): it throws an error 
like this: “There are not enough slots available in the system to satisfy the 7 
slots that were requested by the application:...”. Here I try to start 8 
processes on the 2 machines. It seems that one process its executing fine on 
“master” and when it tries to spawns the other 7 it crashes.
We need this execution schema because we already have our software (used for 
scientific research) and we need to “incorporate” or “embed” Open MPI on it.
Thanks in advance guys!
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users