Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

Camille Coti Fri, 22 Aug 2008 10:04:37 -0400

Actually, I have tried with several versions, since you were working onthe affinity thing. I have tried with revision 19103 a couple a weeksago, the problem was already there.


Part of /proc/cpuinfo is below:
processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 0
revision   : 7
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 900.000000
itc MHz    : 900.000000
BogoMIPS   : 1325.40
siblings   : 1

The machine is a 60-way Altix machine, so you have 60 times thisinformation in /proc/cpuinfo (yes, 60, not 64).


Camille



Ralph Castain a écrit :

I believe I have found the problem, and it may indeed relate to thechange in paffinity. By any chance, do you have unfilled sockets on thatmachine? Could you provide the output from something like "cat/proc/cpuinfo" (or the equiv for your system) so we could see whatphysical processors and sockets are present?
If I'm correct as to the problem, here is the issue. OMPI has (untilnow) always assumed that the #logical processors (or sockets, or cores)was the same as the #physical processors (or sockets, or cores). As aresult, several key subsystems were written without making anydistinction as to which (logical vs physical) they were referring to.This was no problem until we recently encountered systems with "holes"in their system - a processor turned "off", or a socket unpopulated, etc.
In this case, the local processor id no longer matches the physicalprocessor id (ditto for sockets and cores). We adjusted the paffinitysubsystem to deal with it - took much more effort than we would haveliked, and exposed lots of inconsistencies in how the base operatingsystems handle such situations.
Unfortunately, having gotten that straightened out, it is possible thatyou have uncovered a similar inconsistency in logical vs physical inanother subsystem. I have asked better eyes than mine to take a look atthat now to confirm - if so, it could take us a little while to fix.
My request for info was aimed at helping us to determine why your systemis seeing this problem, but our tests didn't. We have tested the revisedpaffinity on both completely filled and on at least one system with"holes", but differences in OS levels, processor types, etc could havecaused our tests to pass while your system fails. I'm particularlysuspicious of the old kernel you are running and how our revised codewill handle it.
For now, I would suggest you work with revisions lower than r19391 -could you please confirm that r19390 or earlier works?
Thanks
Ralph

On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:
OK, thank you!

Camille

Ralph Castain a écrit :
Okay, I'll look into it. I suspect the problem is due to theredefinition of the paffinity API to clarify physical vs logicalprocessors - more than likely, the maffinity interface suffers fromthe same problem we had to correct over there.We'll report back later with an estimate of how quickly this can befixed.
Thanks
Ralph
On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:
Ralph,
I compiled a clean checkout from the trunk (r19392), the problem isstill the same.
Camille


Ralph Castain a écrit :
Hi Camille
What OMPI version are you using? We just changed the paffinitymodule last night, but did nothing to maffinity. However, it ispossible that the maffinity framework makes some calls intopaffinity that need to adjust.
So version number would help a great deal in this case.
Thanks
Ralph
On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:
Hello,
I am trying to run applications on a shared-memory machine. Forthe moment I am just trying to run tests on point-to-pointcommunications (a trivial token ring) and collective operations(from the SkaMPI tests suite).
It runs smoothly if mpi_paffinity_alone is set to 0. For a numberof processes which is larger than about 10, global communicationsjust don't seem possible. Point-to-point communications seem to beOK.
But when I specify --mca mpi_paffinity_alone 1 in my commandline, I get the following error:
mbind: Invalid argument
I looked into the code of maffinity/libnuma, and found out theerror comes from
     numa_setlocal_memory(segments[i].mbs_start_addr,
                          segments[i].mbs_len);

in maffinity_libnuma_module.c.

The machine I am using is a Linux box running a 2.6.5-7 kernel.

Has anyone experienced a similar problem?

Camille
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] problem when mpi_paffinity_alone is set to 1

Reply via email to