Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

Ralph Castain Wed, 5 May 2010 15:44:49 -0400

I saw similar issues in my former life when we encountered a Linux "glitch" in 
the way it handled proximity for shared memory - caused lockups under certain 
conditions. Turned out the problem was fixed in a later kernel version.


Afraid I can't remember the versions involved any more, though....

Unless speed is a critical issue, I'd fall back to using TCP for now, maybe 
have someone over there look at a different kernel rev later.


On May 5, 2010, at 11:30 AM, Gus Correa wrote:

> Hi Jeff, Ralph, list.
> 
> Sorry for the long email, and the delay to answer.
> I had to test MPI/reboot the machine several times
> to address the questions.
> Hopefully with answers to all your questions inline below.
> 
> Jeff Squyres wrote:
>> I'd actually be a little surprised if HT was the problem.  I run with HT 
>> enabled on my nehalem boxen all the time.  It's pretty surprising that Open 
>> MPI is causing a hard lockup of your system; user-level processes shouldn't 
>> be able to do that.
> 
> I hope I can do the same here!  :)
> 
>> Notes:
>> 1. With HT enabled, as you noted, Linux will just see 2x as many cores as 
>> you actually have.  Depending on your desired workload, this may or may not 
>> help you.  But that shouldn't affect the correctness of running your MPI 
>> application.
> 
> I agree and that is what I seek.
> Correctness first, performance later.
> I want OpenMPI to work correctly, with or without hyperthreading,
> and preferably using the "sm" BTL.
> In order, let's see what is possible, what works, what performs better.
> 
> ***
> 
> Reporting the most recent experiments with v1.4.2,
> 1) hyperthreading turned ON,
> 2) then HT turned OFF, on the BIOS.
> 
> In both cases I tried
> A) "-mca btl ^sm" and
> B) without it.
> 
> (Just in case, I checked and /proc/cpuinfo reports a number of cores
> consistent with the BIOS setting for HT.)
> 
> Details below, but first off,
> my conclusion is that HT OFF or ON makes *NO difference*.
> The problem seems to be with the "sm" btl.
> When "sm" is on (default) OpenMPI breaks (at least on this computer).
> 
> ################################
> 1) With hyperthreading turned ON:
> ################################
> 
> A) with -mca btl ^sm (i.e. "sm" OFF):
> Ran fine with 4,8,...,128 processes and fails with 256,
> due to system limit on the number of open TCP connections,
> as reported before with 1.4.1.
> 
> B) withOUT any -mca parameters (i.e. "sm" ON)"
> Ran fine with 4,...,32, but failed with 64 processes,
> with the same segfault and syslog error messages I reported
> before for both 1.4.1 and 1.4.2.
> (see below)
> 
> Of course np=64 is oversubscribing, but this is just a "hello world"
> lightweight test.
> Moreover, in the previous experiments with both 1.4.1 and 1.4.2
> the failures happened even earlier, with np = 16, which is the
> exactly number of (virtual) processors with hyperthreading turned on,
> i.e., with no oversubscription.
> 
> The machine returns the prompt, but hangs right after.
> 
> Could the failures be traced to some funny glitch in the
> Fedora Core 12 (2.6.32.11-99.fc12.x86_6) SMP kernel?
> 
> [gus@spinoza ~]$ uname -a
> Linux spinoza.ldeo.columbia.edu 2.6.32.11-99.fc12.x86_64 #1 SMP Mon Apr 5 
> 19:59:38 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> ********
> ERROR messages:
> 
> /opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpiexec -np 64 a.out
> 
> Message from syslogd@spinoza at May  4 22:28:15 ...
> kernel:------------[ cut here ]------------
> 
> Message from syslogd@spinoza at May  4 22:28:15 ...
> kernel:invalid opcode: 0000 [#1] SMP
> 
> Message from syslogd@spinoza at May  4 22:28:15 ...
> kernel:last sysfs file: 
> /sys/devices/system/cpu/cpu15/topology/physical_package_id
> 
> Message from syslogd@spinoza at May  4 22:28:15 ...
> kernel:Stack:
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 63 with PID 6587 on node 
> spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 
> Message from syslogd@spinoza at May  4 22:28:15 ...
> kernel:Call Trace:
> 
> Message from syslogd@spinoza at May  4 22:28:15 ...
> kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 
> e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb 
> fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01
> 
> ************
> ################################
> 2) Now with hyperthreading OFF:
> ################################
> 
> A) with -mca btl ^sm (i.e. "sm" OFF):
> Ran fine with 4,8,...,128 processes and fails with 256,
> due to system limit on the number of open TCP connections,
> as reported before with 1.4.1.
> This is exactly the same result as with HT ON.
> 
> B) withOUT any -mca parameters (i.e. "sm" ON)"
> Ran fine with 4,...,32, but failed with 64 processes,
> with the same syslog messages, but hung before showing
> the Open MPI segfault message (see below).
> So, again, very similar behavior as with HT ON
> 
> -------------------------------------------------------
> My conclusion is that HT OFF or ON makes NO difference.
> The problem seems to be with the "sm" btl.
> -------------------------------------------------------
> 
> ***********
> ERROR MESSAGES
> 
> [root@spinoza examples]# /opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpiexec -np 64 
> a.out
> 
> Message from syslogd@spinoza at May  5 12:04:05 ...
> kernel:------------[ cut here ]------------
> 
> Message from syslogd@spinoza at May  5 12:04:05 ...
> kernel:invalid opcode: 0000 [#1] SMP
> 
> Message from syslogd@spinoza at May  5 12:04:05 ...
> kernel:last sysfs file: 
> /sys/devices/system/cpu/cpu7/topology/physical_package_id
> 
> Message from syslogd@spinoza at May  5 12:04:05 ...
> kernel:Stack:
> 
> Message from syslogd@spinoza at May  5 12:04:05 ...
> kernel:Call Trace:
> 
> 
> 
> 
> ***********
>> 2. To confirm: yes, TCP will be quite a bit slower than sm (but again, that 
>> depends on how much MPI traffic you're sending).  
> 
> Thank you, the clarification is really important.
> I suppose then that "sm" is preferred, if I can get it to work right.
> 
> The main goal is to run yet another atmospheric model on this machine.
> It is a typical domain decomposition problem,
> with a bunch of 2D arrays being exchanged
> across domain boundaries at each time step.
> This is the MPI traffic.
> There are probably some collectives too,
> but I haven't checked out the code.
> 
>> 3. Yes, you can disable the 2nd thread on each core via Linux, but you need 
>> root-level access to do it.
> 
> I have root-level access.
> However, so far I only learned the BIOS way, which requires a reboot.
> 
> Doing it in Linux would be more convenient, avoiding reboots,
> I suppose.
> How do I do it in Linux.
> Should I overwrite something in /proc ?
> Something else.
> 
>> Some questions:
>> - is the /tmp directory on your local disk?
> 
> Yes.
> And there is plenty of room in the / filesystem and the
> /tmp directory:
> 
> [root@spinoza ~]# ll -d /tmp
> drwxrwxrwt 22 root root 4096 2010-05-05 12:36 /tmp
> 
> [root@spinoza ~]# df -h
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg_spinoza-lv_root
>                      1.8T  504G  1.2T  30% /
> tmpfs                  24G     0   24G   0% /dev/shm
> /dev/sda1             194M   40M  144M  22% /boot
> 
> 
> FYI, this is a standalone workstation.
> MPI is not being used over any network, private or local.
> It is all "inside the box".
> 
>> - are there any revealing messages in /var/log/messages (or equivalent) 
>> about failures when the machine hangs?
> 
> Parsing kernel messages is not my favorite hobby or league.
> In any case, as far as my search could go, there are just standard
> kernel messages on /var/log/messages (e.g. ntpd synchronization, etc),
> until the system hangs when the hello_c program fails.
> Then the the log starts again with the boot process.
> This behavior was repeated time and again over my several
> attempts to run OpenMPI programs with the "sm" btl on.
> 
> ***
> 
> However, I am suspicious of these kernel messages during boot.
> Are they telling me of a memory misconfiguration, perhaps?
> What do the "*BAD*gran_size: ..." mean?
> 
> Does anybody out there with a sane funnctional Nehalem system
> get these funny "*BAD*gran_size: ..." lines
> in " dmesg | more" output, or in /var/log/messages during boot?
> 
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> total RAM covered: 49144M
> gran_size: 64K        chunk_size: 64K         num_reg: 8      lose cover RAM: 
> 45G
> gran_size: 64K        chunk_size: 128K        num_reg: 8      lose cover RAM: 
> 45G
> gran_size: 64K        chunk_size: 256K        num_reg: 8      lose cover RAM: 
> 45G
> gran_size: 64K        chunk_size: 512K        num_reg: 8      lose cover RAM: 
> 45G
> gran_size: 64K        chunk_size: 1M  num_reg: 8      lose cover RAM: 45G
> gran_size: 64K        chunk_size: 2M  num_reg: 8      lose cover RAM: 45G
> gran_size: 64K        chunk_size: 4M  num_reg: 8      lose cover RAM: 45G
> gran_size: 64K        chunk_size: 8M  num_reg: 8      lose cover RAM: 45G
> gran_size: 64K        chunk_size: 16M         num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 64K        chunk_size: 32M         num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 64K        chunk_size: 64M         num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 64K        chunk_size: 128M        num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 64K        chunk_size: 256M        num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 64K        chunk_size: 512M        num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 64K        chunk_size: 1G  num_reg: 8      lose cover RAM: 0G
> *BAD*gran_size: 64K   chunk_size: 2G  num_reg: 8      lose cover RAM: -1G
> gran_size: 128K       chunk_size: 128K        num_reg: 8      lose cover RAM: 
> 45G
> gran_size: 128K       chunk_size: 256K        num_reg: 8      lose cover RAM: 
> 45G
> gran_size: 128K       chunk_size: 512K        num_reg: 8      lose cover RAM: 
> 45G
> gran_size: 128K       chunk_size: 1M  num_reg: 8      lose cover RAM: 45G
> gran_size: 128K       chunk_size: 2M  num_reg: 8      lose cover RAM: 45G
> gran_size: 128K       chunk_size: 4M  num_reg: 8      lose cover RAM: 45G
> gran_size: 128K       chunk_size: 8M  num_reg: 8      lose cover RAM: 45G
> gran_size: 128K       chunk_size: 16M         num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 128K       chunk_size: 32M         num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 128K       chunk_size: 64M         num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 128K       chunk_size: 128M        num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 128K       chunk_size: 256M        num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 128K       chunk_size: 512M        num_reg: 8      lose cover RAM: 
> 0G
> gran_size: 128K       chunk_size: 1G  num_reg: 8      lose cover RAM: 0G
> *BAD*gran_size: 128K  chunk_size: 2G  num_reg: 8      lose cover RAM: -1G
> 
> 
> ... and it goes on and on ... then stops with
> 
> 
> *BAD*gran_size: 512M  chunk_size: 2G  num_reg: 8      lose cover RAM: -520M
> gran_size: 1G         chunk_size: 1G  num_reg: 6      lose cover RAM: 1016M
> gran_size: 1G         chunk_size: 2G  num_reg: 7      lose cover RAM: 1016M
> gran_size: 2G         chunk_size: 2G  num_reg: 5      lose cover RAM: 2040M
> mtrr_cleanup: can not find optimal value
> please specify mtrr_gran_size/mtrr_chunk_size
> 
> ...
> 
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> 
> I know about the finicky memory configuration details
> required by Nehalem, but I didn't put together this system,
> or opened the box to see what is inside yet.
> 
> Kernel experts and Nehalem Pros:
> 
> If something sounds suspicious, please tell me, and I will
> check if the memory modules are the right ones and correctly
> distributed on the slots.
> 
> **
> 
> Thank you very much,
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
> 
> 
>> On May 4, 2010, at 8:35 PM, Gus Correa wrote:
>>> Hi Douglas
>>> 
>>> Yes, very helpful indeed!
>>> 
>>> The machine here is a two-way quad-core, and /proc/cpuinfo shows 16
>>> processors, twice as much as the physical cores,
>>> just like you see on yours.
>>> So, HT is turned on for sure.
>>> 
>>> The security guard opened the office door for me,
>>> and I could reboot that machine.
>>> It's called Spinoza.  Maybe that's why it is locked.
>>> Now the door is locked again, so I will have to wait until tomorrow
>>> to play around with the BIOS settings.
>>> 
>>> I will remember the BIOS double negative that you pointed out:
>>> "When Disabled only one thread per core is enabled"
>>> Ain't that English funny?
>>> So far, I can't get no satisfaction.
>>> Hence, let's see if Ralph's suggestion works.
>>> Never get no hyperthreading turned on,
>>> and you ain't have no problems with Open MPI.  :)
>>> 
>>> Many thanks!
>>> Have a great Halifax Spring time!
>>> 
>>> Cheers,
>>> Gus
>>> 
>>> Douglas Guptill wrote:
>>>> On Tue, May 04, 2010 at 05:34:40PM -0600, Ralph Castain wrote:
>>>>> On May 4, 2010, at 4:51 PM, Gus Correa wrote:
>>>>> 
>>>>>> Hi Ralph
>>>>>> 
>>>>>> Ralph Castain wrote:
>>>>>>> One possibility is that the sm btl might not like that you have 
>>>>>>> hyperthreading enabled.
>>>>>> I remember that hyperthreading was discussed months ago,
>>>>>> in the previous incarnation of this problem/thread/discussion on 
>>>>>> "Nehalem vs. Open MPI".
>>>>>> (It sounds like one of those supreme court cases ... )
>>>>>> 
>>>>>> I don't really administer that machine,
>>>>>> or any machine with hyperthreading,
>>>>>> so I am not much familiar to the HT nitty-gritty.
>>>>>> How do I turn off hyperthreading?
>>>>>> Is it a BIOS or a Linux thing?
>>>>>> I may try that.
>>>>> I believe it can be turned off via an admin-level cmd, but I'm not 
>>>>> certain about it
>>>> The challenge was too great to resist, so I yielded, and rebooted my
>>>> Nehalem (Core i7 920 @ 2.67 GHz) to confirm my thoughts on the issue.
>>>> 
>>>> Entering the BIOS setup by pressing "DEL", and "right-arrowing" over
>>>> to "Advanced", then "down arrow" to "CPU configuration", I found a
>>>> setting called "Intel (R) HT Technology".  The help dialogue says
>>>> "When Disabled only one thread per core is enabled".
>>>> 
>>>> Mine is "Enabled", and I see 8 cpus.  The Core i7, to my
>>>> understanding, is a 4 core chip.
>>>> 
>>>> Hope that helps,
>>>> Douglas.
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

Reply via email to