Appreciate your input! None of the developers have access to an LSF machine any more, so we can't test it :-/
What version of OMPI does this patch apply to? I can go ahead and add it - just want to know if it should just go to the trunk and 1.5 series, or also the 1.4 series. Thanks again! Ralph On Apr 26, 2010, at 12:06 PM, Teng Lin wrote: > Hi, > > We recently identify a bug in our LSF cluster. > The job always hang if all LSF related components present. One observation we > have is that the job works fine after removing all LSF related components. > > Below message from stdout: > [xxxx:24930] mca: base: components_open: Looking for ess components > [xxxx:24930] mca: base: components_open: opening ess components > [xxxx:24930] mca: base: components_open: found loaded component env > [xxxx:24930] mca: base: components_open: component env has no register > function > [xxxx:24930] mca: base: components_open: component env open function > successful > [xxxx:24930] mca: base: components_open: found loaded component hnp > [xxxx:24930] mca: base: components_open: component hnp has no register > function > [xxxx:24930] mca: base: components_open: component hnp open function > successful > [xxxx:24930] mca: base: components_open: found loaded component lsf > [xxxx:24930] mca: base: components_open: component lsf has no register > function > [xxxx:24930] mca: base: components_open: component lsf open function > successful > [xxxx:24930] mca: base: components_open: found loaded component singleton > [xxxx:24930] mca: base: components_open: component singleton has no register > function > [xxxx:24930] mca: base: components_open: component singleton open function > successful > [xxxx:24930] mca: base: components_open: found loaded component slurm > [xxxx:24930] mca: base: components_open: component slurm has no register > function > [xxxx:24930] mca: base: components_open: component slurm open function > successful > [xxxx:24930] mca: base: components_open: found loaded component tool > [xxxx:24930] mca: base: components_open: component tool has no register > function > [xxxx:24930] mca: base: components_open: component tool open function > successful > [xxxx:24930] mca: base: components_open: Looking for plm components > [xxxx:24930] mca: base: components_open: opening plm components > [xxxx:24930] mca: base: components_open: found loaded component lsf > [xxxx:24930] mca: base: components_open: component lsf has no register > function > [xxxx:24930] mca: base: components_open: component lsf open function > successful > [xxxx:24930] mca: base: components_open: found loaded component rsh > [xxxx:24930] mca: base: components_open: component rsh has no register > function > [xxxx:24930] mca: base: components_open: component rsh open function > successful > [xxxx:24930] mca: base: components_open: found loaded component slurm > [xxxx:24930] mca: base: components_open: component slurm has no register > function > [xxxx:24930] mca: base: components_open: component slurm open function > successful > [xxxx:24930] mca:base:select: Auto-selecting plm components > [xxxx:24930] mca:base:select:( plm) Querying component [lsf] > [xxxx:24930] mca:base:select:( plm) Query of component [lsf] set priority to > 75 > [xxxx:24930] mca:base:select:( plm) Querying component [rsh] > [xxxx:24930] mca:base:select:( plm) Query of component [rsh] set priority to > 10 > [xxxx:24930] mca:base:select:( plm) Querying component [slurm] > [xxxx:24930] mca:base:select:( plm) Skipping component [slurm]. Query failed > to return a module > [xxxx:24930] mca:base:select:( plm) Selected component [lsf] > [xxxx:24930] mca: base: close: component rsh closed > [xxxx:24930] mca: base: close: unloading component rsh > [xxxx:24930] mca: base: close: component slurm closed > [xxxx:24930] mca: base: close: unloading component slurm > [xxxx:24930] mca: base: components_open: Looking for rml components > [xxxx:24930] mca: base: components_open: opening rml components > [xxxx:24930] mca: base: components_open: found loaded component oob > [xxxx:24930] mca: base: components_open: component oob has no register > function > [xxxx:24930] mca: base: components_open: Looking for oob components > [xxxx:24930] mca: base: components_open: opening oob components > [xxxx:24930] mca: base: components_open: found loaded component tcp > [xxxx:24930] mca: base: components_open: component tcp has no register > function > [xxxx:24930] mca: base: components_open: component tcp open function > successful > [xxxx:24930] mca: base: components_open: component oob open function > successful > [xxxx:24930] orte_rml_base_select: initializing rml component oob > [xxxx:24930] mca: base: components_open: Looking for ras components > [xxxx:24930] mca: base: components_open: opening ras components > [xxxx:24930] mca: base: components_open: found loaded component lsf > [xxxx:24930] mca: base: components_open: component lsf has no register > function > [xxxx:24930] mca: base: components_open: component lsf open function > successful > [xxxx:24930] mca: base: components_open: found loaded component slurm > [xxxx:24930] mca: base: components_open: component slurm has no register > function > [xxxx:24930] mca: base: components_open: component slurm open function > successful > [xxxx:24930] mca:base:select: Auto-selecting ras components > [xxxx:24930] mca:base:select:( ras) Querying component [lsf] > [xxxx:24930] mca:base:select:( ras) Query of component [lsf] set priority to > 75 > [xxxx:24930] mca:base:select:( ras) Querying component [slurm] > [xxxx:24930] mca:base:select:( ras) Skipping component [slurm]. Query failed > to return a module > [xxxx:24930] mca:base:select:( ras) Selected component [lsf] > [xxxx:24930] mca: base: close: unloading component slurm > [xxxx:24930] plm:lsf: final top-level argv: > [xxxx:24930] plm:lsf: orted -mca ess lsf -mca orte_ess_jobid 2605449216 > -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri > "2605449216.0;tcp://xxx.xxx.xxx.xxx:57649" > > > Below message is from the log file from res daemon: > Apr 22 15:52:01 2010 6540 3 7.06 execAtask_: lsfExecvp() failed. > Apr 22 15:52:01 2010 6540 3 7.06 rexecChild: execAtask_() failed, No such > file or directory. > > Above messages suggest that orted is not in the path. > > Applying below patch seem to fix the problem. > > --- plm_lsf_module.c.orig 2010-04-26 13:27:59.699974000 -0400 > +++ plm_lsf_module.c 2010-04-26 10:58:24.719737000 -0400 > @@ -304,7 +304,7 @@ > * orterun can do the rest of its stuff. Instead, we'll catch any > * failures and deal with them elsewhere > */ > - if (lsb_launch(nodelist_argv, argv, LSF_DJOB_NOWAIT, env) < 0) { > + if (lsb_launch(nodelist_argv, argv, LSF_DJOB_REPLACE_ENV | > LSF_DJOB_NOWAIT, env) < 0) { > ORTE_ERROR_LOG(ORTE_ERR_FAILED_TO_START); > opal_output(0, "lsb_launch failed: %d", rc); > rc = ORTE_ERR_FAILED_TO_START; > > If the LSF_DJOB_REPLACE_ENV option is specified, envp entries will overwrite > all existing environment values except those needed by LSF. > If the function fails, lsberrno is set to indicate the error. It would be > useful if we can > One thing we can not guarantee is that orted is in the path of remote node. > LSF_DJOB_REPLACE_ENV can certainly be used to overcome this. But it may also > have some side effect. > > There are few things that still not quite clear to us. lsb_launch supposes to > return a negative number, not sure why it did not in our case. > > > Not sure if it related to change set 19033 > (https://svn.open-mpi.org/trac/ompi/changeset/19033) in certain way. > > > Teng > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users