Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released

2010-11-11 Thread Samuel Thibault
Jirka Hladky, le Thu 11 Nov 2010 20:03:20 +0100, a écrit :
> On Thursday, November 11, 2010 07:19:41 pm Samuel Thibault wrote:
> > Jirka Hladky, le Thu 11 Nov 2010 14:50:46 +0100, a écrit :
> > > "On this system function XYZ is not supported by GLIBC/KERNEL)"
> > > 
> > > I'm missing the information:
> > > 
> > > -which function is not implemented
> > 
> > Well, you have it: hwloc_proc_getmembind()
> > How it'd be called by the OS in the future is unknown of course.
> > 
> > > -where this function belong - is it system call, glibc or hwloc's
> > > function?
> > 
> > It's always system call or glibc function, it depends on the system and
> > we can't know where it'd be implemented in the future. Or our lack of
> > knowledge of which system call can provide the functionality.
> 
> Well, I think I have not expressed myself correctly. At the moment we have:
> 
> hwloc_get_membind failed (errno 38 Function not implemented)
> 
> I would like to see which glibc/system call has failed.
> Example:
> 
>   err = get_mempolicy(, linuxmask, max_os_index, 0, 0);
>   if (err < 0) {
> perror("get_mempolicy"); <== ADD THIS LINE
> goto out_with_mask;
>   }

My point is that the fix here is _not_ about get_mempolicy. Hwloc didn't
even call it. Hwloc just knows that Linux doesn't provide any function
to get the mempolicy of another process. The get_mempolicy function
doesn't take a pid, and thus will never take one, so another OS function
will have to be defined in the future by Linux people, which will wear
another name. So printing "get_mempolicy" will not actually help.

> My first impression when I saw the error message above was that function 
> "hwloc_get_membind" is not implemented. 

hwloc_bind should probably print "hwloc_proc_get_membind" instead when
it gives the flag, indeed.  I don't think much more can be printed.

Samuel


Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released

2010-11-11 Thread Jirka Hladky
On Thursday, November 11, 2010 07:19:41 pm Samuel Thibault wrote:
> Jirka Hladky, le Thu 11 Nov 2010 14:50:46 +0100, a écrit :
> > "On this system function XYZ is not supported by GLIBC/KERNEL)"
> > 
> > I'm missing the information:
> > 
> > -which function is not implemented
> 
> Well, you have it: hwloc_proc_getmembind()
> How it'd be called by the OS in the future is unknown of course.
> 
> > -where this function belong - is it system call, glibc or hwloc's
> > function?
> 
> It's always system call or glibc function, it depends on the system and
> we can't know where it'd be implemented in the future. Or our lack of
> knowledge of which system call can provide the functionality.

Well, I think I have not expressed myself correctly. At the moment we have:

hwloc_get_membind failed (errno 38 Function not implemented)

I would like to see which glibc/system call has failed.
Example:

  err = get_mempolicy(, linuxmask, max_os_index, 0, 0);
  if (err < 0) {
perror("get_mempolicy"); <== ADD THIS LINE
goto out_with_mask;
  }


Right now, you just know that error has occurred somewhere in 
hwloc_get_membind

My first impression when I saw the error message above was that function 
"hwloc_get_membind" is not implemented. 


> 
> > Or perhaps something more user friendly like
> > "On this system --get does not work together with --membind"
> 
> We'd have to handle a big list of combinations of parameters in that
> case.  I'd rather add a paragraph to the documentation that just
> explains that not everything is available on all OSes, or hwloc just
> doesn't know that it got implemented.
I completely agree on that. Please add a paragraph to the documentation 
explaining that some functionality is not avaialble on all OSes.

Thanks!
Jirka



Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released

2010-11-11 Thread Samuel Thibault
Jirka Hladky, le Thu 11 Nov 2010 14:50:46 +0100, a écrit :
> "On this system function XYZ is not supported by GLIBC/KERNEL)"
> 
> I'm missing the information:
> 
> -which function is not implemented

Well, you have it: hwloc_proc_getmembind()
How it'd be called by the OS in the future is unknown of course.

> -where this function belong - is it system call, glibc or hwloc's function?

It's always system call or glibc function, it depends on the system and
we can't know where it'd be implemented in the future. Or our lack of knowledge 
of
which system call can provide the functionality.

> Or perhaps something more user friendly like
> "On this system --get does not work together with --membind"

We'd have to handle a big list of combinations of parameters in that
case.  I'd rather add a paragraph to the documentation that just
explains that not everything is available on all OSes, or hwloc just
doesn't know that it got implemented.

Samuel


Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released

2010-11-11 Thread Jirka Hladky
On Thursday, November 11, 2010 01:43:38 pm Samuel Thibault wrote:
> Jirka Hladky, le Thu 11 Nov 2010 13:36:46 +0100, a écrit :
> > > hwloc_get_membind failed (errno 38 Function not implemented)
> > 
> > Yes, you are right!
> > --get --pid
> > works on Linux.
> > 
> > --get --membind --pid
> > will give "Function not implemented"
> > 
> > $ /tmp/hwloc-1.1rc2/utils/hwloc-bind --get --membind --pid 344
> > hwloc_get_membind failed (errno 38 Function not implemented)
> > 
> > > It actually depends on the OS. I'll see what I can.
> > 
> > I see. It's getting difficult then. I believe that in this case more
> > explanatory error message would be enough.
> 
> Mmm, to me
> 
> hwloc_get_membind failed (errno 38 Function not implemented)
> 
> is already self-explanatory actually. Do you see how could it be improved?
> 
> Samuel

Hi Samuel,

you can say that

"On this system function XYZ is not supported by GLIBC/KERNEL)"

I'm missing the information:

-which function is not implemented
-where this function belong - is it system call, glibc or hwloc's function?

Or perhaps something more user friendly like
"On this system --get does not work together with --membind"

It's just my personal opinion.

Thanks
Jirka




Re: [hwloc-devel] hwloc-1.2a1r2694 and hwloc-1.2a1r2751

2010-11-11 Thread Jirka Hladky
Hi Brice,

this one is tricky. I don't see this crash when compiling by hand (./configure 
&& make && make check). I see the crash only when building with rpmbuild. It 
happens both with 2694 and 2751.

rpmbuild is applying automatically CFLAGS flags. Finally, I have reduced it to

cd hwloc-1.2a1r2751/
export CFLAGS='-O2'
./configure && make && make check

It works fine with -O1

Please try if you can reproduce the problem with
===
$make clean && export CFLAGS='-g -O2' && ./configure && make && make check
===


This is gdb output:
=
gdb /tmp/J/hwloc-1.2a1r2751/tests/.libs/lt-linux-libnuma
(gdb) run
Starting program: /tmp/J/hwloc-1.2a1r2751/tests/.libs/lt-linux-libnuma 

Program received signal SIGSEGV, Segmentation fault.
0x77deb632 in hwloc_get_type_depth (topology=0x0, type=HWLOC_OBJ_NODE) 
at traversal.c:17
=

I have the feeling it's gcc bug. Any feedback?

Thanks
Jirka


On Wednesday, November 10, 2010 07:33:19 pm Brice Goglin wrote:
> I don't see any change in this test between 2694 and 2751. Do you get a
> better backtrace if you compile in debug mode (and/or with CFLAGS="-g
> -O0") or with gdb?
> 
> Brice
> 
> Le 10/11/2010 15:56, Jirka Hladky a écrit :
> > Hi Brice,
> > 
> > just a quick check.
> > 
> > I see following when running make check for hwloc-1.2a1r2694
> > 
> > ==
> > PASS: hwloc_insert_misc
> > *** buffer overflow detected ***:
> > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux-libnum
> > a terminated
> > === Backtrace: =
> > /lib64/libc.so.6(__fortify_fail+0x37)[0x30cfcf7707]
> > /lib64/libc.so.6[0x30cfcf5720]
> > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux-
> > libnuma[0x401ae9]
> > /lib64/libc.so.6(__libc_start_main+0xfd)[0x30cfc1eb1d]
> > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux-
> > libnuma[0x401059]
> > === Memory map: 
> > 0040-00404000 r-xp  fd:00 1230911
> > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux-libnum
> > a 00603000-00604000 rw-p 3000 fd:00 1230911
> > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux-libnum
> > a 019a6000-019c7000 rw-p  00:00 0
> > [heap]
> > 30cf80-30cf81e000 r-xp  08:02 48991
> > /lib64/ld-2.11.2.so
> > ===
> > 
> > It's running just fine when using hwloc-1.2a1r2751
> > 
> > Have you fixed this test in hwloc-1.2a1r2751?
> > 
> > Thanks!
> > Jirka



Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released

2010-11-11 Thread Samuel Thibault
Jirka Hladky, le Thu 11 Nov 2010 13:36:46 +0100, a écrit :
> > hwloc_get_membind failed (errno 38 Function not implemented)
> 
> Yes, you are right!
> --get --pid 
> works on Linux.
> 
> --get --membind --pid
> will give "Function not implemented"
> 
> $ /tmp/hwloc-1.1rc2/utils/hwloc-bind --get --membind --pid 344
> hwloc_get_membind failed (errno 38 Function not implemented)
> 
> > It actually depends on the OS. I'll see what I can.
> I see. It's getting difficult then. I believe that in this case more 
> explanatory 
> error message would be enough.

Mmm, to me

hwloc_get_membind failed (errno 38 Function not implemented)

is already self-explanatory actually. Do you see how could it be improved?

Samuel


Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released

2010-11-11 Thread Jirka Hladky
Hi Brice,
hi Samuel,

see my comments bellow.

> > 1) Does the option --get works together with --pid ? Like finding out
> > mempolicy for any pid? I don't think that get_mempolicy supports this.
> 
> hwloc indeed gives:
> 
> hwloc_get_membind failed (errno 38 Function not implemented)

Yes, you are right!
--get --pid 
works on Linux.

--get --membind --pid
will give "Function not implemented"

$ /tmp/hwloc-1.1rc2/utils/hwloc-bind --get --membind --pid 344
hwloc_get_membind failed (errno 38 Function not implemented)

> It actually depends on the OS. I'll see what I can.
I see. It's getting difficult then. I believe that in this case more 
explanatory 
error message would be enough.


Thanks!
Jirka


Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released

2010-11-11 Thread Brice Goglin
Le 11/11/2010 13:08, Jirka Hladky a écrit :
> On Thursday, November 11, 2010 11:11:31 am Brice Goglin wrote:
>   
>> Le 11/11/2010 02:31, Samuel Thibault a écrit :
>> 
 get_mempolicy: Invalid argument
 hwloc_get_membind failed (errno 22 Invalid argument)
 
>>> Could you try to increase the value of max_os_index?
>>>
>>> I can see in the kernel source code the following in sys_get_mempolicy:
>>> if (nmask != NULL && maxnode < MAX_NUMNODES)
>>> 
>>> return -EINVAL;
>>>
>>> and MAX_NUMNODES depends on .config ...
>>>   
>> And indeed MAX_NUMNODES is (1<> CONFIG_NODES_SHIFT=9 on rhel6 kernels. We pass a single ulong to the
>> kernel, so it's not large enough to store 1<<9 bits. We couldn't
>> reproduce on Debian and RHEL5 since NODE_SHIFT=6 there.
>>
>> We had to loop until we found the kernel NR_CPUS for sched_getaffinity,
>> we can do the same to find the kernel MAX_NUMNODES for get_mempolicy.
>> The attached patch may help. Only slightly tested obviously since I
>> don't have any kernel causing the problem.
>>
>> Brice
>> 
>
> Hi Brice,
>
> thanks for the quick patch. I have tested it and it works! :-)
>
> $ utils/hwloc-bind --membind node:1 --mempolicy interleave -- 
> utils/hwloc-bind 
> --get --membind
> 0x (interleave)
>
>
> I have couple of questions:
> 1) Does the option --get works together with --pid ? Like finding out 
> mempolicy 
> for any pid? I don't think that get_mempolicy supports this.

Right, it's not supported on Linux.

>  We can perhaps 
> enhance the parsing to raise an error when --pid and --get are both specified.
>   

It actually depends on the OS. I'll see what I can.

> 2) This might be a dumb question - I have tried --get on my laptop which is 
> running Fedora-12. It's one socket system with NUMA enabled - there is 
> however 
> only node#0. I know that it's nonsense. But still, you can use this to run 
> some tests
>
> I'm quite puzzled by the following output:
>
> $utils/hwloc-bind --membind node:0 --mempolicy interleave -- utils/hwloc-bind 
> --get --membind
> 0xf...f (interleave)
>
> What does "0xf...f" mean?
>   

0xf...f is a full set (all bit from 0 to infinity are set). It means
that the memory binding is set to "near all the memory of the machine".
Finding a behavior that works for both NUMA and non-NUMA cases was not
easy...


> 3) Just a small hint. Fedora 12 is using almost the same kernel as RHEL-6.
>   

Ah good to know, thanks. I am deploying a F12 machine right now to check
things.

Brice



Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released

2010-11-11 Thread Jirka Hladky
On Thursday, November 11, 2010 11:11:31 am Brice Goglin wrote:
> Le 11/11/2010 02:31, Samuel Thibault a écrit :
> >> get_mempolicy: Invalid argument
> >> hwloc_get_membind failed (errno 22 Invalid argument)
> > 
> > Could you try to increase the value of max_os_index?
> > 
> > I can see in the kernel source code the following in sys_get_mempolicy:
> > if (nmask != NULL && maxnode < MAX_NUMNODES)
> > 
> > return -EINVAL;
> > 
> > and MAX_NUMNODES depends on .config ...
> 
> And indeed MAX_NUMNODES is (1< CONFIG_NODES_SHIFT=9 on rhel6 kernels. We pass a single ulong to the
> kernel, so it's not large enough to store 1<<9 bits. We couldn't
> reproduce on Debian and RHEL5 since NODE_SHIFT=6 there.
> 
> We had to loop until we found the kernel NR_CPUS for sched_getaffinity,
> we can do the same to find the kernel MAX_NUMNODES for get_mempolicy.
> The attached patch may help. Only slightly tested obviously since I
> don't have any kernel causing the problem.
> 
> Brice


Hi Brice,

thanks for the quick patch. I have tested it and it works! :-)

$ utils/hwloc-bind --membind node:1 --mempolicy interleave -- utils/hwloc-bind 
--get --membind
0x (interleave)


I have couple of questions:
1) Does the option --get works together with --pid ? Like finding out mempolicy 
for any pid? I don't think that get_mempolicy supports this. We can perhaps 
enhance the parsing to raise an error when --pid and --get are both specified.

2) This might be a dumb question - I have tried --get on my laptop which is 
running Fedora-12. It's one socket system with NUMA enabled - there is however 
only node#0. I know that it's nonsense. But still, you can use this to run 
some tests

I'm quite puzzled by the following output:

$utils/hwloc-bind --membind node:0 --mempolicy interleave -- utils/hwloc-bind 
--get --membind
0xf...f (interleave)

What does "0xf...f" mean?

3) Just a small hint. Fedora 12 is using almost the same kernel as RHEL-6.

Thanks for looking into this!!!

Cheers
Jirka








Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released

2010-11-11 Thread Brice Goglin
Le 11/11/2010 02:31, Samuel Thibault a écrit :
>> get_mempolicy: Invalid argument
>> hwloc_get_membind failed (errno 22 Invalid argument)
>> 
>
> Could you try to increase the value of max_os_index?
>
> I can see in the kernel source code the following in sys_get_mempolicy:
>
>   if (nmask != NULL && maxnode < MAX_NUMNODES)
>   return -EINVAL;
>
> and MAX_NUMNODES depends on .config ...
>   

And indeed MAX_NUMNODES is (1<mempolicy;... */
 hwloc_linux_get_thisthread_membind(hwloc_topology_t topology, hwloc_nodeset_t nodeset, hwloc_membind_policy_t *policy, int flags __hwloc_attribute_unused)
 {
   hwloc_const_bitmap_t complete_nodeset;
-  unsigned max_os_index; /* highest os_index + 1 */
+  unsigned max_os_index;
   unsigned long *linuxmask;
   int linuxpolicy;
   int err;

-  /* compute max_os_index */
-  complete_nodeset = hwloc_topology_get_complete_nodeset(topology);
-  if (complete_nodeset) {
-max_os_index = hwloc_bitmap_last(complete_nodeset);
-if (max_os_index == (unsigned) -1)
-  max_os_index = 0;
-  } else {
-max_os_index = 0;
-  }
-  /* round up to the nearest multiple of BITS_PER_LONG */
-  max_os_index = (max_os_index + HWLOC_BITS_PER_LONG) & ~(HWLOC_BITS_PER_LONG - 1);
+  max_os_index = hwloc_linux_find_kernel_max_numnodes(topology);

   linuxmask = malloc(max_os_index/HWLOC_BITS_PER_LONG * sizeof(long));
   if (!linuxmask) {