I saw similar issues in my former life when we encountered a Linux "glitch" in the way it handled proximity for shared memory - caused lockups under certain conditions. Turned out the problem was fixed in a later kernel version.
Afraid I can't remember the versions involved any more, though.... Unless speed is a critical issue, I'd fall back to using TCP for now, maybe have someone over there look at a different kernel rev later. On May 5, 2010, at 11:30 AM, Gus Correa wrote: > Hi Jeff, Ralph, list. > > Sorry for the long email, and the delay to answer. > I had to test MPI/reboot the machine several times > to address the questions. > Hopefully with answers to all your questions inline below. > > Jeff Squyres wrote: >> I'd actually be a little surprised if HT was the problem. I run with HT >> enabled on my nehalem boxen all the time. It's pretty surprising that Open >> MPI is causing a hard lockup of your system; user-level processes shouldn't >> be able to do that. > > I hope I can do the same here! :) > >> Notes: >> 1. With HT enabled, as you noted, Linux will just see 2x as many cores as >> you actually have. Depending on your desired workload, this may or may not >> help you. But that shouldn't affect the correctness of running your MPI >> application. > > I agree and that is what I seek. > Correctness first, performance later. > I want OpenMPI to work correctly, with or without hyperthreading, > and preferably using the "sm" BTL. > In order, let's see what is possible, what works, what performs better. > > *** > > Reporting the most recent experiments with v1.4.2, > 1) hyperthreading turned ON, > 2) then HT turned OFF, on the BIOS. > > In both cases I tried > A) "-mca btl ^sm" and > B) without it. > > (Just in case, I checked and /proc/cpuinfo reports a number of cores > consistent with the BIOS setting for HT.) > > Details below, but first off, > my conclusion is that HT OFF or ON makes *NO difference*. > The problem seems to be with the "sm" btl. > When "sm" is on (default) OpenMPI breaks (at least on this computer). > > ################################ > 1) With hyperthreading turned ON: > ################################ > > A) with -mca btl ^sm (i.e. "sm" OFF): > Ran fine with 4,8,...,128 processes and fails with 256, > due to system limit on the number of open TCP connections, > as reported before with 1.4.1. > > B) withOUT any -mca parameters (i.e. "sm" ON)" > Ran fine with 4,...,32, but failed with 64 processes, > with the same segfault and syslog error messages I reported > before for both 1.4.1 and 1.4.2. > (see below) > > Of course np=64 is oversubscribing, but this is just a "hello world" > lightweight test. > Moreover, in the previous experiments with both 1.4.1 and 1.4.2 > the failures happened even earlier, with np = 16, which is the > exactly number of (virtual) processors with hyperthreading turned on, > i.e., with no oversubscription. > > The machine returns the prompt, but hangs right after. > > Could the failures be traced to some funny glitch in the > Fedora Core 12 (2.6.32.11-99.fc12.x86_6) SMP kernel? > > [gus@spinoza ~]$ uname -a > Linux spinoza.ldeo.columbia.edu 2.6.32.11-99.fc12.x86_64 #1 SMP Mon Apr 5 > 19:59:38 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux > > > ******** > ERROR messages: > > /opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpiexec -np 64 a.out > > Message from syslogd@spinoza at May 4 22:28:15 ... > kernel:------------[ cut here ]------------ > > Message from syslogd@spinoza at May 4 22:28:15 ... > kernel:invalid opcode: 0000 [#1] SMP > > Message from syslogd@spinoza at May 4 22:28:15 ... > kernel:last sysfs file: > /sys/devices/system/cpu/cpu15/topology/physical_package_id > > Message from syslogd@spinoza at May 4 22:28:15 ... > kernel:Stack: > -------------------------------------------------------------------------- > mpiexec noticed that process rank 63 with PID 6587 on node > spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > Message from syslogd@spinoza at May 4 22:28:15 ... > kernel:Call Trace: > > Message from syslogd@spinoza at May 4 22:28:15 ... > kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 > e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb > fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01 > > ************ > ################################ > 2) Now with hyperthreading OFF: > ################################ > > A) with -mca btl ^sm (i.e. "sm" OFF): > Ran fine with 4,8,...,128 processes and fails with 256, > due to system limit on the number of open TCP connections, > as reported before with 1.4.1. > This is exactly the same result as with HT ON. > > B) withOUT any -mca parameters (i.e. "sm" ON)" > Ran fine with 4,...,32, but failed with 64 processes, > with the same syslog messages, but hung before showing > the Open MPI segfault message (see below). > So, again, very similar behavior as with HT ON > > ------------------------------------------------------- > My conclusion is that HT OFF or ON makes NO difference. > The problem seems to be with the "sm" btl. > ------------------------------------------------------- > > *********** > ERROR MESSAGES > > [root@spinoza examples]# /opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpiexec -np 64 > a.out > > Message from syslogd@spinoza at May 5 12:04:05 ... > kernel:------------[ cut here ]------------ > > Message from syslogd@spinoza at May 5 12:04:05 ... > kernel:invalid opcode: 0000 [#1] SMP > > Message from syslogd@spinoza at May 5 12:04:05 ... > kernel:last sysfs file: > /sys/devices/system/cpu/cpu7/topology/physical_package_id > > Message from syslogd@spinoza at May 5 12:04:05 ... > kernel:Stack: > > Message from syslogd@spinoza at May 5 12:04:05 ... > kernel:Call Trace: > > > > > *********** >> 2. To confirm: yes, TCP will be quite a bit slower than sm (but again, that >> depends on how much MPI traffic you're sending). > > Thank you, the clarification is really important. > I suppose then that "sm" is preferred, if I can get it to work right. > > The main goal is to run yet another atmospheric model on this machine. > It is a typical domain decomposition problem, > with a bunch of 2D arrays being exchanged > across domain boundaries at each time step. > This is the MPI traffic. > There are probably some collectives too, > but I haven't checked out the code. > >> 3. Yes, you can disable the 2nd thread on each core via Linux, but you need >> root-level access to do it. > > I have root-level access. > However, so far I only learned the BIOS way, which requires a reboot. > > Doing it in Linux would be more convenient, avoiding reboots, > I suppose. > How do I do it in Linux. > Should I overwrite something in /proc ? > Something else. > >> Some questions: >> - is the /tmp directory on your local disk? > > Yes. > And there is plenty of room in the / filesystem and the > /tmp directory: > > [root@spinoza ~]# ll -d /tmp > drwxrwxrwt 22 root root 4096 2010-05-05 12:36 /tmp > > [root@spinoza ~]# df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg_spinoza-lv_root > 1.8T 504G 1.2T 30% / > tmpfs 24G 0 24G 0% /dev/shm > /dev/sda1 194M 40M 144M 22% /boot > > > FYI, this is a standalone workstation. > MPI is not being used over any network, private or local. > It is all "inside the box". > >> - are there any revealing messages in /var/log/messages (or equivalent) >> about failures when the machine hangs? > > Parsing kernel messages is not my favorite hobby or league. > In any case, as far as my search could go, there are just standard > kernel messages on /var/log/messages (e.g. ntpd synchronization, etc), > until the system hangs when the hello_c program fails. > Then the the log starts again with the boot process. > This behavior was repeated time and again over my several > attempts to run OpenMPI programs with the "sm" btl on. > > *** > > However, I am suspicious of these kernel messages during boot. > Are they telling me of a memory misconfiguration, perhaps? > What do the "*BAD*gran_size: ..." mean? > > Does anybody out there with a sane funnctional Nehalem system > get these funny "*BAD*gran_size: ..." lines > in " dmesg | more" output, or in /var/log/messages during boot? > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > total RAM covered: 49144M > gran_size: 64K chunk_size: 64K num_reg: 8 lose cover RAM: > 45G > gran_size: 64K chunk_size: 128K num_reg: 8 lose cover RAM: > 45G > gran_size: 64K chunk_size: 256K num_reg: 8 lose cover RAM: > 45G > gran_size: 64K chunk_size: 512K num_reg: 8 lose cover RAM: > 45G > gran_size: 64K chunk_size: 1M num_reg: 8 lose cover RAM: 45G > gran_size: 64K chunk_size: 2M num_reg: 8 lose cover RAM: 45G > gran_size: 64K chunk_size: 4M num_reg: 8 lose cover RAM: 45G > gran_size: 64K chunk_size: 8M num_reg: 8 lose cover RAM: 45G > gran_size: 64K chunk_size: 16M num_reg: 8 lose cover RAM: > 0G > gran_size: 64K chunk_size: 32M num_reg: 8 lose cover RAM: > 0G > gran_size: 64K chunk_size: 64M num_reg: 8 lose cover RAM: > 0G > gran_size: 64K chunk_size: 128M num_reg: 8 lose cover RAM: > 0G > gran_size: 64K chunk_size: 256M num_reg: 8 lose cover RAM: > 0G > gran_size: 64K chunk_size: 512M num_reg: 8 lose cover RAM: > 0G > gran_size: 64K chunk_size: 1G num_reg: 8 lose cover RAM: 0G > *BAD*gran_size: 64K chunk_size: 2G num_reg: 8 lose cover RAM: -1G > gran_size: 128K chunk_size: 128K num_reg: 8 lose cover RAM: > 45G > gran_size: 128K chunk_size: 256K num_reg: 8 lose cover RAM: > 45G > gran_size: 128K chunk_size: 512K num_reg: 8 lose cover RAM: > 45G > gran_size: 128K chunk_size: 1M num_reg: 8 lose cover RAM: 45G > gran_size: 128K chunk_size: 2M num_reg: 8 lose cover RAM: 45G > gran_size: 128K chunk_size: 4M num_reg: 8 lose cover RAM: 45G > gran_size: 128K chunk_size: 8M num_reg: 8 lose cover RAM: 45G > gran_size: 128K chunk_size: 16M num_reg: 8 lose cover RAM: > 0G > gran_size: 128K chunk_size: 32M num_reg: 8 lose cover RAM: > 0G > gran_size: 128K chunk_size: 64M num_reg: 8 lose cover RAM: > 0G > gran_size: 128K chunk_size: 128M num_reg: 8 lose cover RAM: > 0G > gran_size: 128K chunk_size: 256M num_reg: 8 lose cover RAM: > 0G > gran_size: 128K chunk_size: 512M num_reg: 8 lose cover RAM: > 0G > gran_size: 128K chunk_size: 1G num_reg: 8 lose cover RAM: 0G > *BAD*gran_size: 128K chunk_size: 2G num_reg: 8 lose cover RAM: -1G > > > ... and it goes on and on ... then stops with > > > *BAD*gran_size: 512M chunk_size: 2G num_reg: 8 lose cover RAM: -520M > gran_size: 1G chunk_size: 1G num_reg: 6 lose cover RAM: 1016M > gran_size: 1G chunk_size: 2G num_reg: 7 lose cover RAM: 1016M > gran_size: 2G chunk_size: 2G num_reg: 5 lose cover RAM: 2040M > mtrr_cleanup: can not find optimal value > please specify mtrr_gran_size/mtrr_chunk_size > > ... > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > > I know about the finicky memory configuration details > required by Nehalem, but I didn't put together this system, > or opened the box to see what is inside yet. > > Kernel experts and Nehalem Pros: > > If something sounds suspicious, please tell me, and I will > check if the memory modules are the right ones and correctly > distributed on the slots. > > ** > > Thank you very much, > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > >> On May 4, 2010, at 8:35 PM, Gus Correa wrote: >>> Hi Douglas >>> >>> Yes, very helpful indeed! >>> >>> The machine here is a two-way quad-core, and /proc/cpuinfo shows 16 >>> processors, twice as much as the physical cores, >>> just like you see on yours. >>> So, HT is turned on for sure. >>> >>> The security guard opened the office door for me, >>> and I could reboot that machine. >>> It's called Spinoza. Maybe that's why it is locked. >>> Now the door is locked again, so I will have to wait until tomorrow >>> to play around with the BIOS settings. >>> >>> I will remember the BIOS double negative that you pointed out: >>> "When Disabled only one thread per core is enabled" >>> Ain't that English funny? >>> So far, I can't get no satisfaction. >>> Hence, let's see if Ralph's suggestion works. >>> Never get no hyperthreading turned on, >>> and you ain't have no problems with Open MPI. :) >>> >>> Many thanks! >>> Have a great Halifax Spring time! >>> >>> Cheers, >>> Gus >>> >>> Douglas Guptill wrote: >>>> On Tue, May 04, 2010 at 05:34:40PM -0600, Ralph Castain wrote: >>>>> On May 4, 2010, at 4:51 PM, Gus Correa wrote: >>>>> >>>>>> Hi Ralph >>>>>> >>>>>> Ralph Castain wrote: >>>>>>> One possibility is that the sm btl might not like that you have >>>>>>> hyperthreading enabled. >>>>>> I remember that hyperthreading was discussed months ago, >>>>>> in the previous incarnation of this problem/thread/discussion on >>>>>> "Nehalem vs. Open MPI". >>>>>> (It sounds like one of those supreme court cases ... ) >>>>>> >>>>>> I don't really administer that machine, >>>>>> or any machine with hyperthreading, >>>>>> so I am not much familiar to the HT nitty-gritty. >>>>>> How do I turn off hyperthreading? >>>>>> Is it a BIOS or a Linux thing? >>>>>> I may try that. >>>>> I believe it can be turned off via an admin-level cmd, but I'm not >>>>> certain about it >>>> The challenge was too great to resist, so I yielded, and rebooted my >>>> Nehalem (Core i7 920 @ 2.67 GHz) to confirm my thoughts on the issue. >>>> >>>> Entering the BIOS setup by pressing "DEL", and "right-arrowing" over >>>> to "Advanced", then "down arrow" to "CPU configuration", I found a >>>> setting called "Intel (R) HT Technology". The help dialogue says >>>> "When Disabled only one thread per core is enabled". >>>> >>>> Mine is "Enabled", and I see 8 cpus. The Core i7, to my >>>> understanding, is a 4 core chip. >>>> >>>> Hope that helps, >>>> Douglas. >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users