Re: [OMPI users] MXM problem

Ralph Castain Mon, 25 May 2015 12:04:32 -0400 (EDT)

I can’t speak to the mxm problem, but the no-tree-spawn issue indicates that 
you don’t have password-less ssh authorized between the compute nodes



> On May 25, 2015, at 8:55 AM, Timur Ismagilov <tismagi...@mail.ru> wrote:
> 
> Hello!
> 
> I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
> OFED-1.5.4.1;
> CentOS release 6.2;
> infiniband 4x FDR
> 
> 
> 
> I have two problems:
> 1. I can not use mxm:
> 1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 -mca 
> plm_rsh_no_tree_spawn 1 -np 4 ./hello 
> --------------------------------------------------------------------------    
>                            
> A requested component was not found, or was unable to be opened.  This        
>                            
> means that this component is either not installed or is unable to be          
>                            
> used on your system (e.g., sometimes this means that shared libraries         
>                            
> that the component requires are unable to be found/loaded).  Note that        
>                            
> Open MPI stopped checking at the first component that it did not find.        
>                            
>                                                                               
>                            
> Host:      node14                                                             
>                            
> Framework: pml                                                                
>                            
> Component: yalla                                                              
>                            
> --------------------------------------------------------------------------    
>                            
> *** An error occurred in MPI_Init                                             
>                            
> --------------------------------------------------------------------------    
>                            
> It looks like MPI_INIT failed for some reason; your parallel process is       
>                            
> likely to abort.  There are many reasons that a parallel process can          
>                            
> fail during MPI_INIT; some of which are due to configuration or environment   
>                            
> problems.  This failure appears to be an internal failure; here's some        
>                            
> additional information (which may only be relevant to an Open MPI             
>                            
> developer):                                                                   
>                            
>                                                                               
>                            
>   mca_pml_base_open() failed                                                  
>                            
>   --> Returned "Not found" (-13) instead of "Success" (0)                     
>                            
> --------------------------------------------------------------------------    
>                            
> *** on a NULL communicator                                                    
>                            
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>                            
> ***    and potentially your MPI job)                                          
>                            
> *** An error occurred in MPI_Init                                             
>                            
> [node28:102377] Local abort before MPI_INIT completed successfully; not able 
> to aggregate error messages,
>  and not able to guarantee that all other processes were killed!              
>                            
> *** on a NULL communicator                                                    
>                            
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>                            
> ***    and potentially your MPI job)                                          
>                            
> [node29:105600] Local abort before MPI_INIT completed successfully; not able 
> to aggregate error messages,
>  and not able to guarantee that all other processes were killed!              
>                            
> *** An error occurred in MPI_Init                                             
>                            
> *** on a NULL communicator                                                    
>                            
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>                            
> ***    and potentially your MPI job)                                          
>                            
> [node5:102409] Local abort before MPI_INIT completed successfully; not able 
> to aggregate error messages, 
> and not able to guarantee that all other processes were killed!               
>                            
> *** An error occurred in MPI_Init                                             
>                            
> *** on a NULL communicator                                                    
>                            
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>                            
> ***    and potentially your MPI job)                                          
>                            
> [node14:85284] Local abort before MPI_INIT completed successfully; not able 
> to aggregate error messages, 
> and not able to guarantee that all other processes were killed!               
>                            
> -------------------------------------------------------                       
>                            
> Primary job  terminated normally, but 1 process returned                      
>                            
> a non-zero exit code.. Per user-direction, the job has been aborted.          
>                            
> -------------------------------------------------------                       
>                            
> --------------------------------------------------------------------------    
>                            
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing                     
> the job to be terminated. The first process to do so was:                     
>                            
>                                                                               
>                            
>   Process name: [[9372,1],2]
>   Exit code:    1                                                             
>                            
> --------------------------------------------------------------------------    
>                            
> [login:08295] 3 more processes have sent help message help-mca-base.txt / 
> find-available:not-valid       
> [login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages         
> [login:08295] 3 more processes have sent help message help-mpi-runtime / 
> mpi_init:startup:internal-failur
> e                                                                             
>                            
> 
> 1.b $mpirun --mca pml yalla -host node5,node14,node28,node29 -mca 
> plm_rsh_no_tree_spawn 1 -np 4 ./hello 
> --------------------------------------------------------------------------    
>                           
> A requested component was not found, or was unable to be opened.  This        
>                           
> means that this component is either not installed or is unable to be          
>                           
> used on your system (e.g., sometimes this means that shared libraries         
>                           
> that the component requires are unable to be found/loaded).  Note that        
>                           
> Open MPI stopped checking at the first component that it did not find.        
>                           
>                                                                               
>                           
> Host:      node5                                                              
>                           
> Framework: pml                                                                
>                           
> Component: yalla                                                              
>                           
> --------------------------------------------------------------------------    
>                           
> *** An error occurred in MPI_Init                                             
>                           
> *** on a NULL communicator                                                    
>                           
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>                           
> ***    and potentially your MPI job)                                          
>                           
> [node5:102449] Local abort before MPI_INIT completed successfully; not able 
> to aggregate error messages,
> and not able to guarantee that all other processes were killed!               
>                           
> --------------------------------------------------------------------------    
>                           
> It looks like MPI_INIT failed for some reason; your parallel process is       
>                           
> likely to abort.  There are many reasons that a parallel process can          
>                           
> fail during MPI_INIT; some of which are due to configuration or environment   
>                           
> problems.  This failure appears to be an internal failure; here's some        
>                           
> additional information (which may only be relevant to an Open MPI             
>                           
> developer):                                                                   
>                           
>                                                                               
>                           
>   mca_pml_base_open() failed                                                  
>                           
>   --> Returned "Not found" (-13) instead of "Success" (0)                     
>                           
> --------------------------------------------------------------------------    
>                           
> -------------------------------------------------------                       
>                           
> Primary job  terminated normally, but 1 process returned                      
>                           
> a non-zero exit code.. Per user-direction, the job has been aborted.          
>                           
> -------------------------------------------------------                       
>                           
> *** An error occurred in MPI_Init                                             
>                           
> *** on a NULL communicator                                                    
>                           
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,      
>                           
> ***    and potentially your MPI job)                                          
>                           
> [node14:85325] Local abort before MPI_INIT completed successfully; not able 
> to aggregate error messages,
> and not able to guarantee that all other processes were killed!               
>                           
> --------------------------------------------------------------------------    
>                           
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing                    
> the job to be terminated. The first process to do so was:                     
>                           
>                                                                               
>                           
>   Process name: [[9619,1],0]                                                  
>                           
>   Exit code:    1                                                             
>                           
> --------------------------------------------------------------------------    
>                           
> [login:08552] 1 more process has sent help message help-mca-base.txt / 
> find-available:not-valid         
> [login:08552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages        
> 
> 2. I can not remove -mca plm_rsh_no_tree_spawn 1 from mpirun cmd line:
> $mpirun -host node5,node14,node28,node29 -np 4 ./hello
> sh: -c: line 0: syntax error near unexpected token `--tree-spawn'             
>                            
> sh: -c: line 0: `( test ! -r ./.profile || . ./.profile; 
> OPAL_PREFIX=/gpfs/NETHOME/oivt1/nicevt/itf/sourc
> es/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8 ; export 
> OPAL_PREFIX; PATH=/gpfs/NETHOME/o
> ivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/bin:$PATH
>  ; export PA
> TH ; 
> LD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi
> -mellanox-v1.8/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/gpfs/NETHOME/oivt1/nice
> vt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib:$DYLD_LIBRARY_PATH
>  ; expor
> t DYLD_LIBRARY_PATH ;   
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/o
> mpi-mellanox-v1.8/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 
> -mca ess "env" -mca orte_es
> s_jobid "625606656" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "5" -mca 
> orte_parent_uri "625606656.1;tc
> p://10.65.0.105,10.64.0.105,10.67.0.105:56862" -mca orte_hnp_uri 
> "625606656.0;tcp://10.65.0.2,10.67.0.2,8
> 3.149.214.101,10.64.0.2:54893" --mca pml "yalla" -mca plm_rsh_no_tree_spawn 
> "0" -mca plm "rsh" ) --tree-s
> pawn'                                                                         
>                            
> --------------------------------------------------------------------------    
>                            
> ORTE was unable to reliably start one or more daemons.                        
>                            
> This usually is caused by:                                                    
>                            
>                                                                               
>                            
> * not finding the required libraries and/or binaries on                       
>                            
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH               
>                            
>   settings, or configure OMPI with --enable-orterun-prefix-by-default         
>                            
>                                                                               
>                            
> * lack of authority to execute on one or more specified nodes.                
>                            
>   Please verify your allocation and authorities.                              
>                            
>                                                                               
>                            
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). 
>                            
>   Please check with your sys admin to determine the correct location to use.  
>                            
>                                                                               
>                            
> *  compilation of the orted with dynamic libraries when static are required   
>                            
>   (e.g., on Cray). Please check your configure cmd line and consider using    
>                            
>   one of the contrib/platform definitions for your system type.               
>                            
>                                                                               
>                            
> * an inability to create a connection back to mpirun due to a                 
>                            
>   lack of common network interfaces and/or no route found between             
>                            
>   them. Please check network connectivity (including firewalls                
>                            
>   and network routing requirements).                                          
>                            
> --------------------------------------------------------------------------    
>                            
> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate 
>                          
>                                                                               
>                            
> 
> Thank you for your comments.
>  
> Best regards,
> Timur.
>  
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/05/26919.php

Re: [OMPI users] MXM problem

Reply via email to