Appreciate your input! None of the developers have access to an LSF machine any 
more, so we can't test it :-/

What version of OMPI does this patch apply to? I can go ahead and add it - just 
want to know if it should just go to the trunk and 1.5 series, or also the 1.4 
series.

Thanks again!
Ralph

On Apr 26, 2010, at 12:06 PM, Teng Lin wrote:

> Hi,
> 
> We recently identify a bug in our LSF cluster.
> The job always hang if all LSF related components present. One observation we 
> have is that the job works fine after removing all LSF related components. 
> 
> Below message from stdout:
> [xxxx:24930] mca: base: components_open: Looking for ess components
> [xxxx:24930] mca: base: components_open: opening ess components
> [xxxx:24930] mca: base: components_open: found loaded component env
> [xxxx:24930] mca: base: components_open: component env has no register 
> function
> [xxxx:24930] mca: base: components_open: component env open function 
> successful
> [xxxx:24930] mca: base: components_open: found loaded component hnp
> [xxxx:24930] mca: base: components_open: component hnp has no register 
> function
> [xxxx:24930] mca: base: components_open: component hnp open function 
> successful
> [xxxx:24930] mca: base: components_open: found loaded component lsf
> [xxxx:24930] mca: base: components_open: component lsf has no register 
> function
> [xxxx:24930] mca: base: components_open: component lsf open function 
> successful
> [xxxx:24930] mca: base: components_open: found loaded component singleton
> [xxxx:24930] mca: base: components_open: component singleton has no register 
> function
> [xxxx:24930] mca: base: components_open: component singleton open function 
> successful
> [xxxx:24930] mca: base: components_open: found loaded component slurm
> [xxxx:24930] mca: base: components_open: component slurm has no register 
> function
> [xxxx:24930] mca: base: components_open: component slurm open function 
> successful
> [xxxx:24930] mca: base: components_open: found loaded component tool
> [xxxx:24930] mca: base: components_open: component tool has no register 
> function
> [xxxx:24930] mca: base: components_open: component tool open function 
> successful
> [xxxx:24930] mca: base: components_open: Looking for plm components
> [xxxx:24930] mca: base: components_open: opening plm components
> [xxxx:24930] mca: base: components_open: found loaded component lsf
> [xxxx:24930] mca: base: components_open: component lsf has no register 
> function
> [xxxx:24930] mca: base: components_open: component lsf open function 
> successful
> [xxxx:24930] mca: base: components_open: found loaded component rsh
> [xxxx:24930] mca: base: components_open: component rsh has no register 
> function
> [xxxx:24930] mca: base: components_open: component rsh open function 
> successful
> [xxxx:24930] mca: base: components_open: found loaded component slurm
> [xxxx:24930] mca: base: components_open: component slurm has no register 
> function
> [xxxx:24930] mca: base: components_open: component slurm open function 
> successful
> [xxxx:24930] mca:base:select: Auto-selecting plm components
> [xxxx:24930] mca:base:select:(  plm) Querying component [lsf]
> [xxxx:24930] mca:base:select:(  plm) Query of component [lsf] set priority to 
> 75
> [xxxx:24930] mca:base:select:(  plm) Querying component [rsh]
> [xxxx:24930] mca:base:select:(  plm) Query of component [rsh] set priority to 
> 10
> [xxxx:24930] mca:base:select:(  plm) Querying component [slurm]
> [xxxx:24930] mca:base:select:(  plm) Skipping component [slurm]. Query failed 
> to return a module
> [xxxx:24930] mca:base:select:(  plm) Selected component [lsf]
> [xxxx:24930] mca: base: close: component rsh closed
> [xxxx:24930] mca: base: close: unloading component rsh
> [xxxx:24930] mca: base: close: component slurm closed
> [xxxx:24930] mca: base: close: unloading component slurm
> [xxxx:24930] mca: base: components_open: Looking for rml components
> [xxxx:24930] mca: base: components_open: opening rml components
> [xxxx:24930] mca: base: components_open: found loaded component oob
> [xxxx:24930] mca: base: components_open: component oob has no register 
> function
> [xxxx:24930] mca: base: components_open: Looking for oob components
> [xxxx:24930] mca: base: components_open: opening oob components
> [xxxx:24930] mca: base: components_open: found loaded component tcp
> [xxxx:24930] mca: base: components_open: component tcp has no register 
> function
> [xxxx:24930] mca: base: components_open: component tcp open function 
> successful
> [xxxx:24930] mca: base: components_open: component oob open function 
> successful
> [xxxx:24930] orte_rml_base_select: initializing rml component oob
> [xxxx:24930] mca: base: components_open: Looking for ras components
> [xxxx:24930] mca: base: components_open: opening ras components
> [xxxx:24930] mca: base: components_open: found loaded component lsf
> [xxxx:24930] mca: base: components_open: component lsf has no register 
> function
> [xxxx:24930] mca: base: components_open: component lsf open function 
> successful
> [xxxx:24930] mca: base: components_open: found loaded component slurm
> [xxxx:24930] mca: base: components_open: component slurm has no register 
> function
> [xxxx:24930] mca: base: components_open: component slurm open function 
> successful
> [xxxx:24930] mca:base:select: Auto-selecting ras components
> [xxxx:24930] mca:base:select:(  ras) Querying component [lsf]
> [xxxx:24930] mca:base:select:(  ras) Query of component [lsf] set priority to 
> 75
> [xxxx:24930] mca:base:select:(  ras) Querying component [slurm]
> [xxxx:24930] mca:base:select:(  ras) Skipping component [slurm]. Query failed 
> to return a module
> [xxxx:24930] mca:base:select:(  ras) Selected component [lsf]
> [xxxx:24930] mca: base: close: unloading component slurm
> [xxxx:24930] plm:lsf: final top-level argv:
> [xxxx:24930] plm:lsf:     orted -mca ess lsf -mca orte_ess_jobid 2605449216 
> -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
> "2605449216.0;tcp://xxx.xxx.xxx.xxx:57649"
> 
> 
> Below message is from the log file from res daemon:
> Apr 22 15:52:01 2010 6540 3 7.06 execAtask_: lsfExecvp() failed.
> Apr 22 15:52:01 2010 6540 3 7.06 rexecChild: execAtask_() failed, No such 
> file or directory.
> 
> Above messages suggest that orted is not in the path.
> 
> Applying below patch seem to fix the problem.
> 
> --- plm_lsf_module.c.orig       2010-04-26 13:27:59.699974000 -0400
> +++ plm_lsf_module.c    2010-04-26 10:58:24.719737000 -0400
> @@ -304,7 +304,7 @@
>      * orterun can do the rest of its stuff. Instead, we'll catch any
>      * failures and deal with them elsewhere
>      */
> -    if (lsb_launch(nodelist_argv, argv, LSF_DJOB_NOWAIT, env) < 0) {
> +    if (lsb_launch(nodelist_argv, argv, LSF_DJOB_REPLACE_ENV | 
> LSF_DJOB_NOWAIT, env) < 0) {
>         ORTE_ERROR_LOG(ORTE_ERR_FAILED_TO_START);
>         opal_output(0, "lsb_launch failed: %d", rc);
>         rc = ORTE_ERR_FAILED_TO_START;
> 
> If the LSF_DJOB_REPLACE_ENV option is specified, envp entries will overwrite 
> all existing environment values except those needed by LSF. 
> If the function fails, lsberrno is set to indicate the error. It would be 
> useful if we can 
> One thing we can not guarantee is that orted is in the path of remote node. 
> LSF_DJOB_REPLACE_ENV can certainly be used to overcome this. But it may also 
> have some side effect.
> 
> There are few things that still not quite clear to us. lsb_launch supposes to 
> return a negative number, not sure why it did not in our case.
> 
> 
> Not sure if it related to change set 19033 
> (https://svn.open-mpi.org/trac/ompi/changeset/19033) in certain way.
> 
> 
> Teng 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to