IMB hangs[Scanned]

Arif Ali Fri, 19 Jan 2007 12:51:28 -0500

see below for answers,

regards,
Arif Ali
Software Engineer
OCF plc


Mobile: +44 (0)7970 148 122
Office: +44 (0)114 257 2200
Fax:    +44 (0)114 257 0022
Email:  a...@ocf.co.uk
Web:    http://www.ocf.co.uk

Skype:  arif_ali80
MSN:    a...@ocf.co.uk



Jeff Squyres wrote:

Beware: this is a lengthy, detailed message.

On Jan 18, 2007, at 3:53 PM, Arif Ali wrote:

1. We have
HW
* 2xBladecenter H
* 2xCisco Infiniband Switch Modules
* 1xCisco Infiniband Switch
* 16x PPC64 JS21 blades each are 4 cores, with Cisco HCA


Can you provide the details of your Cisco HCA?

*PRODUCT TYPE*:
Cisco 4x InfiniBand Host Channel Adapter Expansion Card
*DEVICE TYPE*:
Network adapter
*PORTS*:
2 InfiniBand ports
*DATA TRANSFER RATE*:
10 Gbps
*COMPAT*:
IBM BladeCenter
• The Cisco 4x InfiniBand Host Channel Adapter Expansion Card for IBM
BladeCenter provides InfiniBand I/O capability to processor blades in
IBM BladeCenter unit
• The host channel adapter adds 2 InfiniBand ports to the CPU blade cards
to create an IB-capable high density cluster
• PCI-Express Interface to dual 4x InfiniBand bridge
• Line rate of the interfaces are 10 Gbps per link, theoretical maximum
• 128 MB table memory (133 MHz DDR SDRAM)
• I2C serial EEPROM holding system Vital Product Data (VPD)
• IBM proprietary blade daughter card form factor
• Forced air cooling compatible for highly reliable operation

The lspci -vvv for the card gives me the following information

0c:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex(Tavor compatibility mode) (rev a0)Subsystem: Mellanox Technologies MT25208 InfiniHost III Ex (Tavorcompatibility mode)Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-Stepping- SERR- FastB2B-Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-<MAbort- >SERR- <PERR-

Latency: 0, Cache Line Size 20
Interrupt: pin A routed to IRQ 36
Region 0: Memory at 100b8900000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at 100b8000000 (64-bit, prefetchable) [size=8M]
Region 4: Memory at 100b0000000 (64-bit, prefetchable) [size=128M]
Expansion ROM at 100b8800000 [disabled] [size=1M]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
Vector table: BAR=0 offset=00082000
PBA: BAR=0 offset=00082200
Capabilities: [60] Express Endpoint IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s <64ns, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
Link: Latency L0s unlimited, L1 unlimited
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x8

SW
* SLES 10
* OFED 1.1 w. OpenMPI 1.1.1
I am running the Intel MPI Benchmark (IMB) on the cluster as a partof validation process for the customer.
I have tried the OpenMPI that comes with OFED 1.1, which gavespurious "Not Enough Memory" error messages, after looking throughFAQs (with the help of Cisco) I was able to find the problems andfixes. I used the FAQs to add unlimited soft and hard limits formemlock, turned RDMA off by using "--mca btl_openib_flags 1". Thisstill did not work, and still got the Memory problems.
As a clarification: I suggested setting the btl_openib_flags to 1 asone means of [potentially] reducing the amount of registered memoryto verify that the amount of registered memory available in thesystem is the problem (especially because it was dying with largemessages in the all-to-all pattern). With that setting, we gotthrough the alltoall test (which we previously couldn't). So itseemed to indicate that on that platform, there isn't much registeredmemory available (even though there's 8GB available on each blade).
Are you saying that a full run of the IMB still failed with the same"cannot register any more memory" kind of error?
I checked with Brad Benton -- an OMPI developer from IBM -- heconfirms that on the JS21s, depending on the version of yourfirmware, you will be limited to 256M or 512M of registerable memory(256M = older firmware, 512M = newer firmware). This could verydefinitely be a factor in what is happening here.
Can you let us know what version of the firmware you have?

The firmware for the blade is the latest, as the IB cards would not berecognisedOn 06/09/02006 on the following was released, this is the only latestone on the IBM webpage

*Version 2.00, 01MB245_300_002*

I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failedmiserably.
Can you describe what happened there?  Is it failing in a different way?

Here's the output

#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
# Date : Fri Jan 19 17:33:52 2007
# Machine : ppc64# System : Linux
# Release : 2.6.16.21-0.8-ppc64
# Version : #1 SMP Mon Jul 3 18:25:39 UTC 2006

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_scatter
# Allgather
# Allgatherv
# Alltoall
# Bcast
# Barrier

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 58 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.76 0.00
1 1000 1.88 0.51
2 1000 1.89 1.01
4 1000 1.91 2.00
8 1000 1.88 4.05
16 1000 2.02 7.55
32 1000 2.05 14.88

[0,1,4][btl_openib_component.c:1153:btl_openib_component_progress] fromnode03 to: node02 error polling HP CQ with status REMOTE ACCESS ERRORstatus number 10 for wr_id 268969528 opcode 128[0,1,28][btl_openib_component.c:1153:btl_openib_component_progress] fromnode09 to: node02 error polling HP CQ with status REMOTE ACCESS ERRORstatus number 10 for wr_id 268906808 opcode 128[0,1,58][btl_openib_component.c:1153:btl_openib_component_progress] fromnode16 to: node02 error polling HP CQ with status REMOTE ACCESS ERRORstatus number 10 for wr_id 268919352 opcode 256614836[0,1,0][btl_openib_component.c:1153:btl_openib_component_progress] fromnode02 to: node03 error polling HP CQ with status WORK REQUEST FLUSHEDERROR status number 5 for wr_id 276070200 opcode 0[0,1,59][btl_openib_component.c:1153:btl_openib_component_progress] fromnode16 to: node02 error polling HP CQ with status REMOTE ACCESS ERRORstatus number 10 for wr_id 268919352 opcode 256614836mpirun noticed that job rank 0 with PID 0 on node node02 exited onsignal 15 (Terminated).

55 additional processes aborted (not shown)

I then tried the released version of the OpenMPI-1.2b3, which gotme further than before. Now the benchmark goes through all thetests until Allgatherv finishes, and it seems that it is waiting tostart AlltoAll, I have waited about 12 hours to see if thiscontinues. I have since then managed to run AlltoAll, and the restof the benchmark separately.
If it does not continue within a few minutes, it's not going to goanywhere. IMB does do "warmup" sends that may take a few minutes,but if you've gone 5-10 minutes with no activity, it's likely to behung.
FWIW: I can run IMB on 64 processes (16 hosts, 4ppn -- but not ablade center) with no problem. I.e., it doesn't hang/crash.Hanging instead of crashing may still be a side-effect of running outof DMA-able memory -- I don't know enough about the IBM hardware tosay. I doubt that we have explored the error scenarios in OMPI toomuch; it's pretty safe to say that if limits are not used and thesystem runs out of DMA-able memory, Bad / Undefined things may happen(a "good" scenario would be that the process/MPI job aborts, a "bad"scenario would be some kind of deadlock situation).
I have tried a few tunable paramaters, that was suggested by Cisco,which improved the results, but still hung. The parameters that Ihave used to try and diagnose are below. I used the debug/verbosevariables to see if I could see if I could get error messages onthe running of the benchmark.
#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_flags=1
mpi_leave_pinned=1
mpool_base_use_mem_hooks=1
Note that in that list, only the btl_openib_flags parameter will[potentially] decrease the amount of registered memory used. Also,note that mpi_leave_pinned is only useful when utilizing RDMAoperations; so it's effectively a no-op when btl_openib_flags is setto 1.
--> For those jumping into the conversation late, the value ofbtl_openib_flags is a bit mask with the following bits: SEND=1,PUT=2, GET=4.
With all that was said above, let me provide a few options fordecreasing the amount of registered memory that OMPI uses and alsodescribe a way to put a strict limit on how much registered memoryOMPI will use.
I'll create some FAQ entries about these exact topics in the NearFuture that will go into more detail, but it might take a few daysbecause FAQ wording is tricky; the algorithms that OMPI uses and thetunable parameters that it exports are quite complicated -- I'll wantto sure it's precisely correct for those who land there via Google.Here's the quick version (Galen/Gleb/Pasha: please correct me if Iget these details incorrect -- thanks!):
- All internal-to-OMPI registered buffers -- whether they are usedfor sending or receiving -- are cached on freelists. So if OMPIregisters an internal buffer, sends from it, and then is done withit, the buffer is not de-registered -- it is put back on the freelist for use in the future.
- OMPI makes IB connections to peer MPI processes lazily. That is,the first time you MPI_SEND or MPI_RECV to a peer, OMPI makes theconnection.
- OMPI creates an initial set of pre-posted buffers when each IB portis initialized. The amount registered for each IB endpoint (i.e.,ports and LIDs) in use on the host by the MPI process upon MPI_INIT is:
     2 * btl_openib_free_list_inc *
         (btl_openib_max_send_size + btl_openib_eager_limit)
=> NOTE: There's some pretty pictures of the exact meanings of themax send size and eager limit and how they are used in this paper:http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/.
The "2" is because there are actually 2 free lists -- one for sendingbuffers and one for receiving buffers. Default values for thesethree MCA parameters are 32 (free_list_inc), 64k (max_send_size), 12k(eager_limit), respectively. So each MPI process will preregisterabout 4.75MB of memory per endpoint in use on the host. Since theseare all MCA parameters, they are all adjustable at run-time.
- OMPI then pre-registers and pre-posts receive buffers when eachlazy IB connection is made. The buffers are drawn from the freelistsmentioned above, so the first few connections may not actuallyregister any *new* memory. The freelists register more memory anddole it out as necessary when requests are made that cannot besatisfied by what is already on the freelist.
- The number of pre-posted receiver buffers are controlled via thebtl_openib_rd_num and btl_openib_rd_win MCA parameters. OMPI pre-posts btl_openib_rd_num plus a few more (for control messages) --resulting in 11 buffers by default per queue pair (OMPI uses 2 QPs,one high priority for eager fragments and one low priority for sendfragments) per endpoint. So there are
        11 * (12k + 64k) = 836k

buffers pre posted for each IB connection endpoint.
=> What I'm guessing is happening in your network is that IMB ishitting some communication intensive portions and network trafficeither backs up, starts getting congested, or otherwise becomes"slow", meaning that OMPI is queueing up traffic faster than thenetwork can process it. Hence, OMPI keeps registering more and morememory because there's no more memory available on the freelist torecycle.
- The sending buffering behavior is regulated by thebtl_openib_free_list_max MCA parameter, which defaults to -1 (meaningthat the free list can grow to infinite size). You can set a cap onthis, telling OMPI how many entries it is allowed to have on thefreelist, but that doesn't have a direct correlation as to how muchmemory will actually be registered at any one time whenbtl_openib_flags > 1 (because OMPI will also be registering andcaching user buffers). Also keep in mind that this MCA parametergoverns the size of both sending and receiving buffer freelists.
That being said, if you use btl_openib_flags=1, you can usebtl_openib_free_list_max as a direct method (because OMPI will *not*be registering and caching user buffers), but you need to choose avalue that will be acceptable for both the send and receive freelists.
What should happen if OMPI hits the btl_openib_free_list_max limit isthat the upper layer (called the "PML") will internally buffermessages until more IB registered buffers become available. It's notentirely accurate, but you can think of it as effectively multiplelevels of queueing going on here: MPI requests, PML buffers, IBregistered buffers, network. Fun stuff! :-)
- A future OMPI feature is an MCA parameter calledmpool_rdma_rcache_size_limit. It defaults to an "unlimited" value,which means that OMPI will try to register memory forever. But ifyou set it to a nonzero positive value (in bytes), OMPI will limititself to that much registered memory for each MPI process. This MCAparameter unfortunately didn't make it into the 1.2 release, but willbe included in some future release. This code is currently on theOMPI trunk (and nightly snapshots), but not available in the 1.2branch (and nightly snapshots/releases).
=====

With all those explanations, here's some recommendations for you:
- Try simply setting the size of the eager limit and max send size tosmaller values, perhaps 4k for the eager limit and 12k for the maxsend size. This will decrease the amount of registered memory thatOMPI uses for each connection.
- Try setting btl_openib_free_list_max, perhaps in conjunction withbtl_openib_flags=1, to allow you to directly set indirectly orexactly how much registered memory is used per endpoint.
- If you want to explore the OMPI trunk (with all the normaldisclaimers about development code), try settingmpool_rdma_rcache_size_limit to a fixed value.
Keep in mind that the intermixing of all of these values is quitecomplicated. It's a very, very thin line to walk to balance resourceconstraints and application performance. Tweaking one parameter maygive you good resource limits but hose your overall performance.Another dimension here is that different applications will likely usedifferent communication patterns, so different sets of values may besuitable for different applications. It's a complicated parameterspace problem. :-\
2. On another side note, I am having similar problems on anothercustomer's cluster, where the benchmark hangs but at a differentplace each time.
HW specs
* 12x IBM 3455 2xdual Core machines, with Infinipath/pathscale HCAs
* 1x Voltaire Switch
SW
* master: RHEL 4 AS U3
* compute: RHEL 4 WS U3
* OFED 1.1.1 w. OpenMPI-1.1.2
For InfiniPath HCAs, you should probably be using the psm MTL insteadof the openib BTL.
The short version explanation between the two is that MTL plugins aredesigned for networks that export MPI-like interfaces (e.g., portals,tports, MX, InifiniPath). BTL plugins are more geared towardsnetworks that export RDMA interfaces. You can force using the psmMTL with:
mpirun --mca pml cm ...
This tells OMPI to use the "cm" PML plugin (PML is the back end toMPI point-to-point), which, if you've built the "psm" MTL plugin (psmis the InfiniPath library glue), will use the InfiniPath native back-end library which will do nice things. Beyond that, someone elsewill have to answer -- I have no experience with the psm MTL...
Hope this helps!

<<attachment: aali.vcf>>

Re: [OMPI users] OpenMPI/OpenIB/IMB hangs[Scanned]

Reply via email to