Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-01 Thread Brice Goglin
Le 02/10/2018 à 00:28, Marco Atzeri a écrit :
> Am 01.10.2018 um 19:57 schrieb Brice Goglin:
>> Le 01/10/2018 à 19:22, Marco Atzeri a écrit :
>>>
>>
>> Your own machine doesn't matter. None is these tests look at your CPU or
>> topology. *All* of them on all x86 machines.
>> CPUID are emulated by reading files, nothing is read from your local
>> machine topology. There's just something wrong here that prevents these
>> emulating CPUID files from being read. "lstopo -i ..." will tell you.
>
> $
> /pub/devel/hwloc/hwloc-2.0.2-1.x86_64/build/utils/lstopo/lstopo-no-graphics.exe
>  -i
> /pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/tests/hwloc/x86/AMD-15h-Bulldozer-4xOpteron-6272/
>  --if cpuid --of xml -
> Ignoring dumped cpuid directory.
> 
>
>
> It works instead with "--if xml"
>
> IMHO, should be better to produce an error
> instead of the local machine output with a warning,
> if the input is not understandable

The input is understandable here, but there's a cygwin-related bug
somewhere when we actually try to use it.

--if xml makes no sense here since you're not giving any XML as input.

The error message comes from hwloc_x86_check_cpuiddump_input() failing
in hwloc/topology-x86.c.
That function always prints an error message before returning an error,
except when opendir() fails on the given directory.
The directory was passed by lstopo to the core using environment
variable HWLOC_CPUID_PATH.

Anyway, I have no way to debug this for now so you're stuck with not
running make check in that directory :/

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [OMPI devel] Error in TCP BTL??

2018-10-01 Thread George Bosilca
https://github.com/open-mpi/ompi/pull/5819 will ease the pain. I couldn't
figure out what exactly trigger this, but apparently recent versions of OSX
refuse to bind with port 0.

  George.



On Mon, Oct 1, 2018 at 4:12 PM Jeff Squyres (jsquyres) via devel <
devel@lists.open-mpi.org> wrote:

> I get that 100% time in the runs on MacOS, too (with today's HEAD):
>
> --
> $ mpirun -np 4 --mca btl tcp,self ring_c
> Process 0 sending 10 to 1, tag 201 (4 processes in ring)
> [JSQUYRES-M-26UT][[5535,1],0][btl_tcp_endpoint.c:742:mca_btl_tcp_endpoint_start_connect]
> bind() failed: Invalid argument (22)
> [JSQUYRES-M-26UT:85104] *** An error occurred in MPI_Send
> [JSQUYRES-M-26UT:85104] *** reported by process [362741761,0]
> [JSQUYRES-M-26UT:85104] *** on communicator MPI_COMM_WORLD
> [JSQUYRES-M-26UT:85104] *** MPI_ERR_OTHER: known error not in list
> [JSQUYRES-M-26UT:85104] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [JSQUYRES-M-26UT:85104] ***and potentially your MPI job)
> --
>
>
> > On Oct 1, 2018, at 2:12 PM, Ralph H Castain  wrote:
> >
> > I’m getting this error when trying to run a simple ring program on my
> Mac:
> >
> >
> [Ralphs-iMac-2.local][[21423,14],0][btl_tcp_endpoint.c:742:mca_btl_tcp_endpoint_start_connect]
> bind() failed: Invalid argument (22)
> >
> > Anyone recognize the problem? It causes the job to immediately abort.
> This is with current head of master this morning - it was working when I
> last used it, but it has been an unknown period of time.
> > Ralph
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-01 Thread Marco Atzeri

Am 01.10.2018 um 19:57 schrieb Brice Goglin:

Le 01/10/2018 à 19:22, Marco Atzeri a écrit :




Your own machine doesn't matter. None is these tests look at your CPU or
topology. *All* of them on all x86 machines.
CPUID are emulated by reading files, nothing is read from your local
machine topology. There's just something wrong here that prevents these
emulating CPUID files from being read. "lstopo -i ..." will tell you.


$ 
/pub/devel/hwloc/hwloc-2.0.2-1.x86_64/build/utils/lstopo/lstopo-no-graphics.exe 
 -i 
/pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/tests/hwloc/x86/AMD-15h-Bulldozer-4xOpteron-6272/ 
 --if cpuid --of xml -

Ignoring dumped cpuid directory.



It works instead with "--if xml"

IMHO, should be better to produce an error
instead of the local machine output with a warning,
if the input is not understandable



Brice


Thanks
Marco

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [OMPI devel] Error in TCP BTL??

2018-10-01 Thread Jeff Squyres (jsquyres) via devel
I get that 100% time in the runs on MacOS, too (with today's HEAD):

--
$ mpirun -np 4 --mca btl tcp,self ring_c
Process 0 sending 10 to 1, tag 201 (4 processes in ring)
[JSQUYRES-M-26UT][[5535,1],0][btl_tcp_endpoint.c:742:mca_btl_tcp_endpoint_start_connect]
 bind() failed: Invalid argument (22)
[JSQUYRES-M-26UT:85104] *** An error occurred in MPI_Send
[JSQUYRES-M-26UT:85104] *** reported by process [362741761,0]
[JSQUYRES-M-26UT:85104] *** on communicator MPI_COMM_WORLD
[JSQUYRES-M-26UT:85104] *** MPI_ERR_OTHER: known error not in list
[JSQUYRES-M-26UT:85104] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[JSQUYRES-M-26UT:85104] ***and potentially your MPI job)
--


> On Oct 1, 2018, at 2:12 PM, Ralph H Castain  wrote:
> 
> I’m getting this error when trying to run a simple ring program on my Mac:
> 
> [Ralphs-iMac-2.local][[21423,14],0][btl_tcp_endpoint.c:742:mca_btl_tcp_endpoint_start_connect]
>  bind() failed: Invalid argument (22)
> 
> Anyone recognize the problem? It causes the job to immediately abort. This is 
> with current head of master this morning - it was working when I last used 
> it, but it has been an unknown period of time.
> Ralph
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Error in TCP BTL??

2018-10-01 Thread Ralph H Castain
I’m getting this error when trying to run a simple ring program on my Mac:

[Ralphs-iMac-2.local][[21423,14],0][btl_tcp_endpoint.c:742:mca_btl_tcp_endpoint_start_connect]
 bind() failed: Invalid argument (22)

Anyone recognize the problem? It causes the job to immediately abort. This is 
with current head of master this morning - it was working when I last used it, 
but it has been an unknown period of time.
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-01 Thread Brice Goglin
Le 01/10/2018 à 19:22, Marco Atzeri a écrit :
>
>> Unfortunately that test script isn't easy to debug in the v2.x branch.
>> If that OpenProcess is where things fail, I assume that the line that
>> fails is "lstopo --ps". On MinGW, that code is ignored because /proc
>> doesn't exist. Does /proc exist on Cygwin? If so, we should just disable
>> that test line on Windows.
>
> /proc exists but windows calls of course are not aware of it.
> The failing message in German is coming from the Windows layer,
> as my Cygwin enviroment is in English.

Actually the error message comes from lstopo itself. We list PIDs from
/proc and then pass them to OpenProcess, which likely fails for
administrator processes. And we abort() instead of returning an error.
I guess things could work but we'd need to setup a cygwin here for
testing, so it'll take some time.

>
>>>
>>> 
>>>
>>> And all the :
>>>
>>> FAIL: Intel-Skylake-2xXeon6140.output
>>> FAIL: Intel-Broadwell-2xXeon-E5-2650Lv4.output
>>> FAIL: Intel-Haswell-2xXeon-E5-2680v3.output
>>> FAIL: Intel-IvyBridge-12xXeon-E5-4620v2.output
>>> FAIL: Intel-SandyBridge-2xXeon-E5-2650.output
>>> FAIL: Intel-Westmere-2xXeon-X5650.output
>>> FAIL: Intel-Nehalem-2xXeon-X5550.output
>>> FAIL: Intel-Penryn-4xXeon-X7460.output
>>> FAIL: Intel-Core-2xXeon-E5345.output
>>> FAIL: Intel-KnightsLanding-XeonPhi-7210.output
>>> FAIL: Intel-KnightsCorner-XeonPhi-SE10P.output
>>> FAIL: AMD-17h-Zen-2xEpyc-7451.output
>>> FAIL: AMD-15h-Piledriver-4xOpteron-6348.output
>>> FAIL: AMD-15h-Bulldozer-4xOpteron-6272.output
>>> FAIL: AMD-K10-MagnyCours-2xOpteron-6164HE.output
>>> FAIL: AMD-K10-Istanbul-8xOpteron-8439SE.output
>>> FAIL: AMD-K8-SantaRosa-2xOpteron-2218.output
>>> FAIL: AMD-K8-SledgeHammer-2xOpteron-250.output
>>> FAIL: Zhaoxin-CentaurHauls-ZXD-4600.output
>>> FAIL: Zhaoxin-Shanghai-KaiSheng-ZXC+-FC1081.output
>>> ###
>>>
>>> But it is not clear to me how these tests should pass.
>>>
>>> The Laptop has a Quad Core I5
>>
>> These tests use a tarball of the output of the cpuid instruction to
>> emulate calling cpuid on those platforms.
>> Go to tests/hwloc/xml, unpack one of the tarballs, and run
>> "/path/to/utils/lstopo/lstopo -i ", you
>> should get more information about what's failing when reading these
>> dumped cpuid outputs.
>> If it doesn't work tests/hwloc/xml/Intel-Skylake-2xXeon6140.output.log
>> will show the difference between the expected and obtained topology when
>> exported to XML.
>
> I saw the difference, and as my machine is different from
> everyone on the list none of the tests can pass.

Your own machine doesn't matter. None is these tests look at your CPU or
topology. *All* of them on all x86 machines.
CPUID are emulated by reading files, nothing is read from your local
machine topology. There's just something wrong here that prevents these
emulating CPUID files from being read. "lstopo -i ..." will tell you.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-01 Thread Brice Goglin
Le 01/10/2018 à 17:27, Marco Atzeri a écrit :
> Am 30.09.2018 um 20:11 schrieb Samuel Thibault:
>> Marco Atzeri, le dim. 30 sept. 2018 20:02:59 +0200, a ecrit:
>>> also adding a HWLOC_DECLSPEC on the first case distances.c:347
>>> does not solve the issue as the two declaration are not the same.
>>>
>>> Suggestion ?
>>
>> Perhaps use hwloc_uint64_t instead of uint64_t in hwloc/distances.c?
>>
>> Samuel
>
> Thanks Samuel,
> it was that, in more than one place.
>
> The attached patch allowed the compilation on cygwin64 bit.

hwloc_uint64_t is currently defined to DWORDLONG (worked fine on MinGW
and MSVC so far). I'd like to see if there's an easier way to solve this
issue by just making that definition compatible for cygwin.

> FAIL: test-lstopo.sh
>
> that seems due to a mix between Cygwin and Windows
>
>  utils/lstopo/test-lstopo.sh.log #
>
> Machine (3665MB total) + Package L#0
>   NUMANode L#0 (P#0 3665MB)
>   L3 L#0 (6144KB)
>     L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>   PU L#0 (P#0)
>   PU L#1 (P#1)
>     L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>   PU L#2 (P#2)
>   PU L#3 (P#3)
>     L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>   PU L#4 (P#4)
>   PU L#5 (P#5)
>     L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>   PU L#6 (P#6)
>   PU L#7 (P#7)
> OpenProcess 13220 failed 5: Zugriff verweigert

Unfortunately that test script isn't easy to debug in the v2.x branch.
If that OpenProcess is where things fail, I assume that the line that
fails is "lstopo --ps". On MinGW, that code is ignored because /proc
doesn't exist. Does /proc exist on Cygwin? If so, we should just disable
that test line on Windows.


>
> 
>
> And all the :
>
> FAIL: Intel-Skylake-2xXeon6140.output
> FAIL: Intel-Broadwell-2xXeon-E5-2650Lv4.output
> FAIL: Intel-Haswell-2xXeon-E5-2680v3.output
> FAIL: Intel-IvyBridge-12xXeon-E5-4620v2.output
> FAIL: Intel-SandyBridge-2xXeon-E5-2650.output
> FAIL: Intel-Westmere-2xXeon-X5650.output
> FAIL: Intel-Nehalem-2xXeon-X5550.output
> FAIL: Intel-Penryn-4xXeon-X7460.output
> FAIL: Intel-Core-2xXeon-E5345.output
> FAIL: Intel-KnightsLanding-XeonPhi-7210.output
> FAIL: Intel-KnightsCorner-XeonPhi-SE10P.output
> FAIL: AMD-17h-Zen-2xEpyc-7451.output
> FAIL: AMD-15h-Piledriver-4xOpteron-6348.output
> FAIL: AMD-15h-Bulldozer-4xOpteron-6272.output
> FAIL: AMD-K10-MagnyCours-2xOpteron-6164HE.output
> FAIL: AMD-K10-Istanbul-8xOpteron-8439SE.output
> FAIL: AMD-K8-SantaRosa-2xOpteron-2218.output
> FAIL: AMD-K8-SledgeHammer-2xOpteron-250.output
> FAIL: Zhaoxin-CentaurHauls-ZXD-4600.output
> FAIL: Zhaoxin-Shanghai-KaiSheng-ZXC+-FC1081.output
> ###
>
> But it is not clear to me how these tests should pass.
>
> The Laptop has a Quad Core I5

These tests use a tarball of the output of the cpuid instruction to
emulate calling cpuid on those platforms.
Go to tests/hwloc/xml, unpack one of the tarballs, and run
"/path/to/utils/lstopo/lstopo -i ", you
should get more information about what's failing when reading these
dumped cpuid outputs.
If it doesn't work tests/hwloc/xml/Intel-Skylake-2xXeon6140.output.log
will show the difference between the expected and obtained topology when
exported to XML.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-01 Thread Marco Atzeri

Am 30.09.2018 um 20:11 schrieb Samuel Thibault:

Marco Atzeri, le dim. 30 sept. 2018 20:02:59 +0200, a ecrit:

also adding a HWLOC_DECLSPEC on the first case distances.c:347
does not solve the issue as the two declaration are not the same.

Suggestion ?


Perhaps use hwloc_uint64_t instead of uint64_t in hwloc/distances.c?

Samuel


Thanks Samuel,
it was that, in more than one place.

The attached patch allowed the compilation on cygwin64 bit.

The only remaining warnings are minor one

##
/cygdrive/d/cyg_pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/hwloc/topology-windows.c: 
In function ‘hwloc_look_windows’:
/cygdrive/d/cyg_pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/hwloc/topology-windows.c:814:28: 
warning: format ‘%lx’ expects argument of type ‘long unsigned int’, but 
argument 4 has type ‘ULONG_PTR {aka long long unsigned int}’ [-Wformat=]
  hwloc_debug("%s#%u mask %lx\n", hwloc_obj_type_string(type), id, 
procInfo[i].ProcessorMask);
  ~~^ 
~

  %llx
###


The only tests failing are:

FAIL: test-lstopo.sh

that seems due to a mix between Cygwin and Windows

 utils/lstopo/test-lstopo.sh.log #

Machine (3665MB total) + Package L#0
  NUMANode L#0 (P#0 3665MB)
  L3 L#0 (6144KB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
  PU L#0 (P#0)
  PU L#1 (P#1)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
  PU L#2 (P#2)
  PU L#3 (P#3)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
  PU L#4 (P#4)
  PU L#5 (P#5)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
  PU L#6 (P#6)
  PU L#7 (P#7)
OpenProcess 13220 failed 5: Zugriff verweigert



And all the :

FAIL: Intel-Skylake-2xXeon6140.output
FAIL: Intel-Broadwell-2xXeon-E5-2650Lv4.output
FAIL: Intel-Haswell-2xXeon-E5-2680v3.output
FAIL: Intel-IvyBridge-12xXeon-E5-4620v2.output
FAIL: Intel-SandyBridge-2xXeon-E5-2650.output
FAIL: Intel-Westmere-2xXeon-X5650.output
FAIL: Intel-Nehalem-2xXeon-X5550.output
FAIL: Intel-Penryn-4xXeon-X7460.output
FAIL: Intel-Core-2xXeon-E5345.output
FAIL: Intel-KnightsLanding-XeonPhi-7210.output
FAIL: Intel-KnightsCorner-XeonPhi-SE10P.output
FAIL: AMD-17h-Zen-2xEpyc-7451.output
FAIL: AMD-15h-Piledriver-4xOpteron-6348.output
FAIL: AMD-15h-Bulldozer-4xOpteron-6272.output
FAIL: AMD-K10-MagnyCours-2xOpteron-6164HE.output
FAIL: AMD-K10-Istanbul-8xOpteron-8439SE.output
FAIL: AMD-K8-SantaRosa-2xOpteron-2218.output
FAIL: AMD-K8-SledgeHammer-2xOpteron-250.output
FAIL: Zhaoxin-CentaurHauls-ZXD-4600.output
FAIL: Zhaoxin-Shanghai-KaiSheng-ZXC+-FC1081.output
###

But it is not clear to me how these tests should pass.


The Laptop has a Quad Core I5

processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model   : 142
model name  : Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
stepping: 10
cpu MHz : 1800.000
cache size  : 6144 KB

Regards
Marco










---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus
--- origsrc/hwloc-2.0.2/hwloc/distances.c   2018-01-26 09:56:03.0 
+0100
+++ src/hwloc-2.0.2/hwloc/distances.c   2018-10-01 13:39:46.501610500 +0200
@@ -345,7 +345,7 @@ int hwloc_internal_distances_add(hwloc_t
 /* The actual function exported to the user
  */
 int hwloc_distances_add(hwloc_topology_t topology,
-   unsigned nbobjs, hwloc_obj_t *objs, uint64_t *values,
+   unsigned nbobjs, hwloc_obj_t *objs, hwloc_uint64_t 
*values,
unsigned long kind, unsigned long flags)
 {
   hwloc_obj_type_t type;
--- origsrc/hwloc-2.0.2/hwloc/pci-common.c  2018-01-26 09:56:03.0 
+0100
+++ src/hwloc-2.0.2/hwloc/pci-common.c  2018-10-01 14:00:38.336045900 +0200
@@ -16,7 +16,7 @@
 #endif
 #include 
 
-#ifdef HWLOC_WIN_SYS
+#if defined(HWLOC_WIN_SYS) && !defined(__CYGWIN__)
 #include 
 #define open _open
 #define read _read
--- origsrc/hwloc-2.0.2/hwloc/shmem.c   2018-01-26 09:56:03.0 +0100
+++ src/hwloc-2.0.2/hwloc/shmem.c   2018-10-01 13:46:55.542444700 +0200
@@ -76,7 +76,7 @@ hwloc_shmem_topology_get_length(hwloc_to
 
 int
 hwloc_shmem_topology_write(hwloc_topology_t topology,
-  int fd, uint64_t fileoffset,
+  int fd, hwloc_uint64_t fileoffset,
   void *mmap_address, size_t length,
   unsigned long flags)
 {
@@ -259,7 +259,7 @@ hwloc_shmem_topology_get_length(hwloc_to
 
 int
 hwloc_shmem_topology_write(hwloc_topology_t topology __hwloc_attribute_unused,
-  int fd __hwloc_attribute_unused, uint64_t fileoffset