Re: [OMPI devel] Removing ORTE code

2018-10-02 Thread Ralph H Castain
Based on silence plus today’s telecon, the stale code has been removed: 
https://github.com/open-mpi/ompi/pull/5827


> On Sep 26, 2018, at 7:00 AM, Ralph H Castain  wrote:
> 
> We are considering a “purge” of stale ORTE code and want to know if anyone is 
> using it before proceeding. With the advent of PMIx, several ORTE features 
> are no longer required by OMPI itself. However, we acknowledge that it is 
> possible that someone out there (e.g., a researcher) is using them. The 
> specific features include:
> 
> * OOB use from within an application process. We need to retain the OOB 
> itself for daemon-to-daemon communication. However, the application processes 
> no longer open a connection to their ORTE daemon, instead relying on the PMIx 
> connection to communicate their needs.
> 
> * the DFS framework - allows an application process to access a remote file 
> via ORTE. It provided essentially a function-shipping service that was used 
> by map-reduce applications we no longer support
> 
> * the notifier framework - supported output of messages to syslog and email. 
> PMIx now provides such services if someone wants to use them
> 
> * iof/tool component - we are moving to PMIx for tool support, so there are 
> no ORTE tools using this any more
> 
> We may discover additional candidates for removal as we go forward - we’ll 
> update the list as we do. First, however, we’d really like to hear back from 
> anyone who might have a need for any of the above.
> 
> Please respond by Oct 5th
> Ralph
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] btl/vader: race condition in finalize on OS X

2018-10-02 Thread Ralph H Castain
We already have the register_cleanup option in master - are you using an older 
version of PMIx that doesn’t support it?


> On Oct 2, 2018, at 4:05 AM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> FYI: https://github.com/open-mpi/ompi/issues/5798 brought up what may be the 
> same issue.
> 
> 
>> On Oct 2, 2018, at 3:16 AM, Gilles Gouaillardet  wrote:
>> 
>> Folks,
>> 
>> 
>> When running a simple helloworld program on OS X, we can end up with the 
>> following error message
>> 
>> 
>> A system call failed during shared memory initialization that should
>> not have.  It is likely that your MPI job will now either abort or
>> experience performance degradation.
>> 
>>  Local host:  c7.kmc.kobe.rist.or.jp
>>  System call: unlink(2) 
>> /tmp/ompi.c7.1000/pid.23376/1/vader_segment.c7.17d80001.54
>>  Error:   No such file or directory (errno 2)
>> 
>> 
>> the error does not occur on linux by default since the vader segment is in 
>> /dev/shm by default.
>> 
>> the patch below can be used to evidence the issue on linux
>> 
>> 
>> diff --git a/opal/mca/btl/vader/btl_vader_component.c 
>> b/opal/mca/btl/vader/btl_vader_component.c
>> index 115bceb..80fec05 100644
>> --- a/opal/mca/btl/vader/btl_vader_component.c
>> +++ b/opal/mca/btl/vader/btl_vader_component.c
>> @@ -204,7 +204,7 @@ static int mca_btl_vader_component_register (void)
>>OPAL_INFO_LVL_3, 
>> MCA_BASE_VAR_SCOPE_GROUP, _btl_vader_component.single_copy_mechanism);
>> OBJ_RELEASE(new_enum);
>> 
>> -if (0 == access ("/dev/shm", W_OK)) {
>> +if (0 && 0 == access ("/dev/shm", W_OK)) {
>> mca_btl_vader_component.backing_directory = "/dev/shm";
>> } else {
>> mca_btl_vader_component.backing_directory = 
>> opal_process_info.job_session_dir;
>> 
>> 
>> From my analysis, here is what happens :
>> 
>> - each rank is supposed to have its own vader_segment unlinked by btl/vader 
>> in vader_finalize().
>> 
>> - but this file might have already been destroyed by an other task in 
>> orte_ess_base_app_finalize()
>> 
>>  if (NULL == opal_pmix.register_cleanup) {
>>orte_session_dir_finalize(ORTE_PROC_MY_NAME);
>>}
>> 
>>  *all* the tasks end up removing 
>> opal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1")
>> 
>> 
>> I am not really sure about the best way to fix this.
>> 
>> - one option is to perform an intra node barrier in vader_finalize()
>> 
>> - an other option would be to implement an opal_pmix.register_cleanup
>> 
>> 
>> Any thoughts ?
>> 
>> 
>> Cheers,
>> 
>> 
>> Gilles
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] btl/vader: race condition in finalize on OS X

2018-10-02 Thread Jeff Squyres (jsquyres) via devel
FYI: https://github.com/open-mpi/ompi/issues/5798 brought up what may be the 
same issue.


> On Oct 2, 2018, at 3:16 AM, Gilles Gouaillardet  wrote:
> 
> Folks,
> 
> 
> When running a simple helloworld program on OS X, we can end up with the 
> following error message
> 
> 
> A system call failed during shared memory initialization that should
> not have.  It is likely that your MPI job will now either abort or
> experience performance degradation.
> 
>   Local host:  c7.kmc.kobe.rist.or.jp
>   System call: unlink(2) 
> /tmp/ompi.c7.1000/pid.23376/1/vader_segment.c7.17d80001.54
>   Error:   No such file or directory (errno 2)
> 
> 
> the error does not occur on linux by default since the vader segment is in 
> /dev/shm by default.
> 
> the patch below can be used to evidence the issue on linux
> 
> 
> diff --git a/opal/mca/btl/vader/btl_vader_component.c 
> b/opal/mca/btl/vader/btl_vader_component.c
> index 115bceb..80fec05 100644
> --- a/opal/mca/btl/vader/btl_vader_component.c
> +++ b/opal/mca/btl/vader/btl_vader_component.c
> @@ -204,7 +204,7 @@ static int mca_btl_vader_component_register (void)
> OPAL_INFO_LVL_3, 
> MCA_BASE_VAR_SCOPE_GROUP, _btl_vader_component.single_copy_mechanism);
>  OBJ_RELEASE(new_enum);
> 
> -if (0 == access ("/dev/shm", W_OK)) {
> +if (0 && 0 == access ("/dev/shm", W_OK)) {
>  mca_btl_vader_component.backing_directory = "/dev/shm";
>  } else {
>  mca_btl_vader_component.backing_directory = 
> opal_process_info.job_session_dir;
> 
> 
> From my analysis, here is what happens :
> 
>  - each rank is supposed to have its own vader_segment unlinked by btl/vader 
> in vader_finalize().
> 
> - but this file might have already been destroyed by an other task in 
> orte_ess_base_app_finalize()
> 
>   if (NULL == opal_pmix.register_cleanup) {
> orte_session_dir_finalize(ORTE_PROC_MY_NAME);
> }
> 
>   *all* the tasks end up removing 
> opal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1")
> 
> 
> I am not really sure about the best way to fix this.
> 
>  - one option is to perform an intra node barrier in vader_finalize()
> 
>  - an other option would be to implement an opal_pmix.register_cleanup
> 
> 
> Any thoughts ?
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-02 Thread Brice Goglin
OK

I pushed your #ifdef fixes and I fixed the printf warning.

I opened 3 issues related to x86 cpuid and OpenProcess failing in lstopo
--ps. Hopefully we'll find a way to play with cygwin here for real in
the near future, and then add that config to our CI.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-02 Thread Marco Atzeri

Am 02.10.2018 um 07:47 schrieb Brice Goglin:

Le 02/10/2018 à 00:28, Marco Atzeri a écrit :

Am 01.10.2018 um 19:57 schrieb Brice Goglin:

Le 01/10/2018 à 19:22, Marco Atzeri a écrit :




Your own machine doesn't matter. None is these tests look at your CPU or
topology. *All* of them on all x86 machines.
CPUID are emulated by reading files, nothing is read from your local
machine topology. There's just something wrong here that prevents these
emulating CPUID files from being read. "lstopo -i ..." will tell you.


$ 
/pub/devel/hwloc/hwloc-2.0.2-1.x86_64/build/utils/lstopo/lstopo-no-graphics.exe 
 -i 
/pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/tests/hwloc/x86/AMD-15h-Bulldozer-4xOpteron-6272/ 
 --if cpuid --of xml -

Ignoring dumped cpuid directory.



It works instead with "--if xml"

IMHO, should be better to produce an error
instead of the local machine output with a warning,
if the input is not understandable


The input is understandable here, but there's a cygwin-related bug 
somewhere when we actually try to use it.


--if xml makes no sense here since you're not giving any XML as input.


of course with "--if xml" I used the XML file *.output.as input file.

The error message comes from hwloc_x86_check_cpuiddump_input() failing 
in hwloc/topology-x86.c.
That function always prints an error message before returning an error, 
except when opendir() fails on the given directory.
The directory was passed by lstopo to the core using environment 
variable HWLOC_CPUID_PATH.


Anyway, I have no way to debug this for now so you're stuck with not 
running make check in that directory :/


Brice



---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

[OMPI devel] btl/vader: race condition in finalize on OS X

2018-10-02 Thread Gilles Gouaillardet

Folks,


When running a simple helloworld program on OS X, we can end up with the 
following error message



A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  c7.kmc.kobe.rist.or.jp
  System call: unlink(2) 
/tmp/ompi.c7.1000/pid.23376/1/vader_segment.c7.17d80001.54

  Error:   No such file or directory (errno 2)


the error does not occur on linux by default since the vader segment is 
in /dev/shm by default.


the patch below can be used to evidence the issue on linux


diff --git a/opal/mca/btl/vader/btl_vader_component.c 
b/opal/mca/btl/vader/btl_vader_component.c

index 115bceb..80fec05 100644
--- a/opal/mca/btl/vader/btl_vader_component.c
+++ b/opal/mca/btl/vader/btl_vader_component.c
@@ -204,7 +204,7 @@ static int mca_btl_vader_component_register (void)
    OPAL_INFO_LVL_3, 
MCA_BASE_VAR_SCOPE_GROUP, _btl_vader_component.single_copy_mechanism);

 OBJ_RELEASE(new_enum);

-    if (0 == access ("/dev/shm", W_OK)) {
+    if (0 && 0 == access ("/dev/shm", W_OK)) {
 mca_btl_vader_component.backing_directory = "/dev/shm";
 } else {
 mca_btl_vader_component.backing_directory = 
opal_process_info.job_session_dir;



From my analysis, here is what happens :

 - each rank is supposed to have its own vader_segment unlinked by 
btl/vader in vader_finalize().


- but this file might have already been destroyed by an other task in 
orte_ess_base_app_finalize()


      if (NULL == opal_pmix.register_cleanup) {
    orte_session_dir_finalize(ORTE_PROC_MY_NAME);
    }

  *all* the tasks end up removing 
opal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1")



I am not really sure about the best way to fix this.

 - one option is to perform an intra node barrier in vader_finalize()

 - an other option would be to implement an opal_pmix.register_cleanup


Any thoughts ?


Cheers,


Gilles

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel