Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released
Jirka Hladky, le Thu 11 Nov 2010 20:03:20 +0100, a écrit : > On Thursday, November 11, 2010 07:19:41 pm Samuel Thibault wrote: > > Jirka Hladky, le Thu 11 Nov 2010 14:50:46 +0100, a écrit : > > > "On this system function XYZ is not supported by GLIBC/KERNEL)" > > > > > > I'm missing the information: > > > > > > -which function is not implemented > > > > Well, you have it: hwloc_proc_getmembind() > > How it'd be called by the OS in the future is unknown of course. > > > > > -where this function belong - is it system call, glibc or hwloc's > > > function? > > > > It's always system call or glibc function, it depends on the system and > > we can't know where it'd be implemented in the future. Or our lack of > > knowledge of which system call can provide the functionality. > > Well, I think I have not expressed myself correctly. At the moment we have: > > hwloc_get_membind failed (errno 38 Function not implemented) > > I would like to see which glibc/system call has failed. > Example: > > err = get_mempolicy(, linuxmask, max_os_index, 0, 0); > if (err < 0) { > perror("get_mempolicy"); <== ADD THIS LINE > goto out_with_mask; > } My point is that the fix here is _not_ about get_mempolicy. Hwloc didn't even call it. Hwloc just knows that Linux doesn't provide any function to get the mempolicy of another process. The get_mempolicy function doesn't take a pid, and thus will never take one, so another OS function will have to be defined in the future by Linux people, which will wear another name. So printing "get_mempolicy" will not actually help. > My first impression when I saw the error message above was that function > "hwloc_get_membind" is not implemented. hwloc_bind should probably print "hwloc_proc_get_membind" instead when it gives the flag, indeed. I don't think much more can be printed. Samuel
Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released
On Thursday, November 11, 2010 07:19:41 pm Samuel Thibault wrote: > Jirka Hladky, le Thu 11 Nov 2010 14:50:46 +0100, a écrit : > > "On this system function XYZ is not supported by GLIBC/KERNEL)" > > > > I'm missing the information: > > > > -which function is not implemented > > Well, you have it: hwloc_proc_getmembind() > How it'd be called by the OS in the future is unknown of course. > > > -where this function belong - is it system call, glibc or hwloc's > > function? > > It's always system call or glibc function, it depends on the system and > we can't know where it'd be implemented in the future. Or our lack of > knowledge of which system call can provide the functionality. Well, I think I have not expressed myself correctly. At the moment we have: hwloc_get_membind failed (errno 38 Function not implemented) I would like to see which glibc/system call has failed. Example: err = get_mempolicy(, linuxmask, max_os_index, 0, 0); if (err < 0) { perror("get_mempolicy"); <== ADD THIS LINE goto out_with_mask; } Right now, you just know that error has occurred somewhere in hwloc_get_membind My first impression when I saw the error message above was that function "hwloc_get_membind" is not implemented. > > > Or perhaps something more user friendly like > > "On this system --get does not work together with --membind" > > We'd have to handle a big list of combinations of parameters in that > case. I'd rather add a paragraph to the documentation that just > explains that not everything is available on all OSes, or hwloc just > doesn't know that it got implemented. I completely agree on that. Please add a paragraph to the documentation explaining that some functionality is not avaialble on all OSes. Thanks! Jirka
Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released
Jirka Hladky, le Thu 11 Nov 2010 14:50:46 +0100, a écrit : > "On this system function XYZ is not supported by GLIBC/KERNEL)" > > I'm missing the information: > > -which function is not implemented Well, you have it: hwloc_proc_getmembind() How it'd be called by the OS in the future is unknown of course. > -where this function belong - is it system call, glibc or hwloc's function? It's always system call or glibc function, it depends on the system and we can't know where it'd be implemented in the future. Or our lack of knowledge of which system call can provide the functionality. > Or perhaps something more user friendly like > "On this system --get does not work together with --membind" We'd have to handle a big list of combinations of parameters in that case. I'd rather add a paragraph to the documentation that just explains that not everything is available on all OSes, or hwloc just doesn't know that it got implemented. Samuel
Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released
On Thursday, November 11, 2010 01:43:38 pm Samuel Thibault wrote: > Jirka Hladky, le Thu 11 Nov 2010 13:36:46 +0100, a écrit : > > > hwloc_get_membind failed (errno 38 Function not implemented) > > > > Yes, you are right! > > --get --pid > > works on Linux. > > > > --get --membind --pid > > will give "Function not implemented" > > > > $ /tmp/hwloc-1.1rc2/utils/hwloc-bind --get --membind --pid 344 > > hwloc_get_membind failed (errno 38 Function not implemented) > > > > > It actually depends on the OS. I'll see what I can. > > > > I see. It's getting difficult then. I believe that in this case more > > explanatory error message would be enough. > > Mmm, to me > > hwloc_get_membind failed (errno 38 Function not implemented) > > is already self-explanatory actually. Do you see how could it be improved? > > Samuel Hi Samuel, you can say that "On this system function XYZ is not supported by GLIBC/KERNEL)" I'm missing the information: -which function is not implemented -where this function belong - is it system call, glibc or hwloc's function? Or perhaps something more user friendly like "On this system --get does not work together with --membind" It's just my personal opinion. Thanks Jirka
Re: [hwloc-devel] hwloc-1.2a1r2694 and hwloc-1.2a1r2751
Hi Brice, this one is tricky. I don't see this crash when compiling by hand (./configure && make && make check). I see the crash only when building with rpmbuild. It happens both with 2694 and 2751. rpmbuild is applying automatically CFLAGS flags. Finally, I have reduced it to cd hwloc-1.2a1r2751/ export CFLAGS='-O2' ./configure && make && make check It works fine with -O1 Please try if you can reproduce the problem with === $make clean && export CFLAGS='-g -O2' && ./configure && make && make check === This is gdb output: = gdb /tmp/J/hwloc-1.2a1r2751/tests/.libs/lt-linux-libnuma (gdb) run Starting program: /tmp/J/hwloc-1.2a1r2751/tests/.libs/lt-linux-libnuma Program received signal SIGSEGV, Segmentation fault. 0x77deb632 in hwloc_get_type_depth (topology=0x0, type=HWLOC_OBJ_NODE) at traversal.c:17 = I have the feeling it's gcc bug. Any feedback? Thanks Jirka On Wednesday, November 10, 2010 07:33:19 pm Brice Goglin wrote: > I don't see any change in this test between 2694 and 2751. Do you get a > better backtrace if you compile in debug mode (and/or with CFLAGS="-g > -O0") or with gdb? > > Brice > > Le 10/11/2010 15:56, Jirka Hladky a écrit : > > Hi Brice, > > > > just a quick check. > > > > I see following when running make check for hwloc-1.2a1r2694 > > > > == > > PASS: hwloc_insert_misc > > *** buffer overflow detected ***: > > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux-libnum > > a terminated > > === Backtrace: = > > /lib64/libc.so.6(__fortify_fail+0x37)[0x30cfcf7707] > > /lib64/libc.so.6[0x30cfcf5720] > > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux- > > libnuma[0x401ae9] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x30cfc1eb1d] > > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux- > > libnuma[0x401059] > > === Memory map: > > 0040-00404000 r-xp fd:00 1230911 > > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux-libnum > > a 00603000-00604000 rw-p 3000 fd:00 1230911 > > /home/jhladky/rpmbuild/BUILD/hwloc-1.2a1r2694/tests/.libs/lt-linux-libnum > > a 019a6000-019c7000 rw-p 00:00 0 > > [heap] > > 30cf80-30cf81e000 r-xp 08:02 48991 > > /lib64/ld-2.11.2.so > > === > > > > It's running just fine when using hwloc-1.2a1r2751 > > > > Have you fixed this test in hwloc-1.2a1r2751? > > > > Thanks! > > Jirka
Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released
Jirka Hladky, le Thu 11 Nov 2010 13:36:46 +0100, a écrit : > > hwloc_get_membind failed (errno 38 Function not implemented) > > Yes, you are right! > --get --pid > works on Linux. > > --get --membind --pid > will give "Function not implemented" > > $ /tmp/hwloc-1.1rc2/utils/hwloc-bind --get --membind --pid 344 > hwloc_get_membind failed (errno 38 Function not implemented) > > > It actually depends on the OS. I'll see what I can. > I see. It's getting difficult then. I believe that in this case more > explanatory > error message would be enough. Mmm, to me hwloc_get_membind failed (errno 38 Function not implemented) is already self-explanatory actually. Do you see how could it be improved? Samuel
Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released
Hi Brice, hi Samuel, see my comments bellow. > > 1) Does the option --get works together with --pid ? Like finding out > > mempolicy for any pid? I don't think that get_mempolicy supports this. > > hwloc indeed gives: > > hwloc_get_membind failed (errno 38 Function not implemented) Yes, you are right! --get --pid works on Linux. --get --membind --pid will give "Function not implemented" $ /tmp/hwloc-1.1rc2/utils/hwloc-bind --get --membind --pid 344 hwloc_get_membind failed (errno 38 Function not implemented) > It actually depends on the OS. I'll see what I can. I see. It's getting difficult then. I believe that in this case more explanatory error message would be enough. Thanks! Jirka
Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released
Le 11/11/2010 13:08, Jirka Hladky a écrit : > On Thursday, November 11, 2010 11:11:31 am Brice Goglin wrote: > >> Le 11/11/2010 02:31, Samuel Thibault a écrit : >> get_mempolicy: Invalid argument hwloc_get_membind failed (errno 22 Invalid argument) >>> Could you try to increase the value of max_os_index? >>> >>> I can see in the kernel source code the following in sys_get_mempolicy: >>> if (nmask != NULL && maxnode < MAX_NUMNODES) >>> >>> return -EINVAL; >>> >>> and MAX_NUMNODES depends on .config ... >>> >> And indeed MAX_NUMNODES is (1<> CONFIG_NODES_SHIFT=9 on rhel6 kernels. We pass a single ulong to the >> kernel, so it's not large enough to store 1<<9 bits. We couldn't >> reproduce on Debian and RHEL5 since NODE_SHIFT=6 there. >> >> We had to loop until we found the kernel NR_CPUS for sched_getaffinity, >> we can do the same to find the kernel MAX_NUMNODES for get_mempolicy. >> The attached patch may help. Only slightly tested obviously since I >> don't have any kernel causing the problem. >> >> Brice >> > > Hi Brice, > > thanks for the quick patch. I have tested it and it works! :-) > > $ utils/hwloc-bind --membind node:1 --mempolicy interleave -- > utils/hwloc-bind > --get --membind > 0x (interleave) > > > I have couple of questions: > 1) Does the option --get works together with --pid ? Like finding out > mempolicy > for any pid? I don't think that get_mempolicy supports this. Right, it's not supported on Linux. > We can perhaps > enhance the parsing to raise an error when --pid and --get are both specified. > It actually depends on the OS. I'll see what I can. > 2) This might be a dumb question - I have tried --get on my laptop which is > running Fedora-12. It's one socket system with NUMA enabled - there is > however > only node#0. I know that it's nonsense. But still, you can use this to run > some tests > > I'm quite puzzled by the following output: > > $utils/hwloc-bind --membind node:0 --mempolicy interleave -- utils/hwloc-bind > --get --membind > 0xf...f (interleave) > > What does "0xf...f" mean? > 0xf...f is a full set (all bit from 0 to infinity are set). It means that the memory binding is set to "near all the memory of the machine". Finding a behavior that works for both NUMA and non-NUMA cases was not easy... > 3) Just a small hint. Fedora 12 is using almost the same kernel as RHEL-6. > Ah good to know, thanks. I am deploying a F12 machine right now to check things. Brice
Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released
On Thursday, November 11, 2010 11:11:31 am Brice Goglin wrote: > Le 11/11/2010 02:31, Samuel Thibault a écrit : > >> get_mempolicy: Invalid argument > >> hwloc_get_membind failed (errno 22 Invalid argument) > > > > Could you try to increase the value of max_os_index? > > > > I can see in the kernel source code the following in sys_get_mempolicy: > > if (nmask != NULL && maxnode < MAX_NUMNODES) > > > > return -EINVAL; > > > > and MAX_NUMNODES depends on .config ... > > And indeed MAX_NUMNODES is (1<CONFIG_NODES_SHIFT=9 on rhel6 kernels. We pass a single ulong to the > kernel, so it's not large enough to store 1<<9 bits. We couldn't > reproduce on Debian and RHEL5 since NODE_SHIFT=6 there. > > We had to loop until we found the kernel NR_CPUS for sched_getaffinity, > we can do the same to find the kernel MAX_NUMNODES for get_mempolicy. > The attached patch may help. Only slightly tested obviously since I > don't have any kernel causing the problem. > > Brice Hi Brice, thanks for the quick patch. I have tested it and it works! :-) $ utils/hwloc-bind --membind node:1 --mempolicy interleave -- utils/hwloc-bind --get --membind 0x (interleave) I have couple of questions: 1) Does the option --get works together with --pid ? Like finding out mempolicy for any pid? I don't think that get_mempolicy supports this. We can perhaps enhance the parsing to raise an error when --pid and --get are both specified. 2) This might be a dumb question - I have tried --get on my laptop which is running Fedora-12. It's one socket system with NUMA enabled - there is however only node#0. I know that it's nonsense. But still, you can use this to run some tests I'm quite puzzled by the following output: $utils/hwloc-bind --membind node:0 --mempolicy interleave -- utils/hwloc-bind --get --membind 0xf...f (interleave) What does "0xf...f" mean? 3) Just a small hint. Fedora 12 is using almost the same kernel as RHEL-6. Thanks for looking into this!!! Cheers Jirka
Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v1.1rc1 released
Le 11/11/2010 02:31, Samuel Thibault a écrit : >> get_mempolicy: Invalid argument >> hwloc_get_membind failed (errno 22 Invalid argument) >> > > Could you try to increase the value of max_os_index? > > I can see in the kernel source code the following in sys_get_mempolicy: > > if (nmask != NULL && maxnode < MAX_NUMNODES) > return -EINVAL; > > and MAX_NUMNODES depends on .config ... > And indeed MAX_NUMNODES is (1<mempolicy;... */ hwloc_linux_get_thisthread_membind(hwloc_topology_t topology, hwloc_nodeset_t nodeset, hwloc_membind_policy_t *policy, int flags __hwloc_attribute_unused) { hwloc_const_bitmap_t complete_nodeset; - unsigned max_os_index; /* highest os_index + 1 */ + unsigned max_os_index; unsigned long *linuxmask; int linuxpolicy; int err; - /* compute max_os_index */ - complete_nodeset = hwloc_topology_get_complete_nodeset(topology); - if (complete_nodeset) { -max_os_index = hwloc_bitmap_last(complete_nodeset); -if (max_os_index == (unsigned) -1) - max_os_index = 0; - } else { -max_os_index = 0; - } - /* round up to the nearest multiple of BITS_PER_LONG */ - max_os_index = (max_os_index + HWLOC_BITS_PER_LONG) & ~(HWLOC_BITS_PER_LONG - 1); + max_os_index = hwloc_linux_find_kernel_max_numnodes(topology); linuxmask = malloc(max_os_index/HWLOC_BITS_PER_LONG * sizeof(long)); if (!linuxmask) {