Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
Hi Jeff, Ralph, first of all: thanks for your work on this! On 3 July 2013 21:09, Jeff Squyres (jsquyres) wrote: > 1. The root cause of the issue is that you are assigning a > non-existent IP address to a name. I.e., maps to 127.0.1.1, > but that IP address does not exist anywhere. Hence, OMPI will never > conclude that that is "local". If you had assigned to > the 127.0.0.1 address, things should have worked fine. Ok, I see. Would that have worked also if I had added the 127.0.1.1 address to the "lo" interface (in addition to 127.0.0.1)? > Just curious: why are you doing this? It's commonplace in Ubuntu/Debian installations; see, e.g., http://serverfault.com/questions/363095/what-does-127-0-1-1-represent-in-etc-hosts In our case, it was rolled out as a fix for some cron job running on Apache servers (apparently Debian's Apache looks up 127.0.1.1 and uses that as the ServerName, unless a server name is not explicitly configured), which was later extended to all hosts because "what harm can it do?". (Needless to say, we have rolled back the change.) > 2. That being said, OMPI is not currently looking at all the > responses from gethostbyname() -- we're only looking at the first > one. In the spirit of how clients are supposed to behave when > multiple IP addresses are returned from a single name lookup, OMPI > should examine all of those addresses and see if it finds one that > it "likes", and then use that. So we should extend OMPI to examine > all the IP addresses from gethostbyname(). Just for curiosity: would it have worked, had I compiled OMPI with IPv6 support? (As far as I understand IPv6, an application is required to examine all the addresses returned for a host name, and not just pick the first one.) > Ralph is going to work on this, but it'll likely take him a little > time to get it done. We'll get it into the trunk and probably ask > you to verify that it works for you. And if so, we'll back-port to > the v1.6 and v1.7 series. I'm glad to help and verify, but I guess we do not need the backport or an urgent fix. The easy workaround for us was to remove the 127.0.1.1 line from the compute nodes (we keep it only on Apache servers where it originated). Thanks, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
Hi, sorry for the delay in replying -- pretty busy week :-( On 28 June 2013 21:54, Jeff Squyres (jsquyres) wrote: > Here's what we think we know (I'm using the name "foo" instead of > your actual hostname because it's easier to type): > > 1. When you run "hostname", you get foo.local back Yes. > 2. In your /etc/hosts file, foo.local is listed on two lines: >127.0.1.1 >10.1.255.201 > Yes: [rmurri@nh64-5-9 ~]$ fgrep nh64-5-9 /etc/hosts 127.0.1.1 nh64-5-9.local nh64-5-9 10.1.255.194nh64-5-9.local nh64-5-9 > 3. When you login to the "foo" server and execute mpirun with a hostfile > that contains "foo", Open MPI incorrectly thinks that the local machine is > not foo, and therefore tries to ssh to it (and things go downhill from > there). > Yes. > 4. When you login to the "foo" server and execute mpirun with a hostfile > that contains "foo.local" (you said "FQDN", but never said exactly what you > meant by that -- I'm assuming "foo.local", not "foo.yourdomain.com"), then > Open MPI behaves properly. > Yes. FQDN = foo.local. (This is a compute node in a cluster that does not have any public IP address not DNS entry -- it only has an interface to the cluster-private network. I presume this is not relevant to OpenMPI as long as all names are correctly resolved via `/etc/hosts`.) > Is that all correct? Yes, all correct. > We have some followup questions for you: > > 1. What happens when you try to resolve "foo"? (e.g., via the "dig" program > -- "dig foo") Here's what happens with `dig`: [rmurri@nh64-5-9 ~]$ dig nh64-5-9 ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9 ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 4373 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 ;; QUESTION SECTION: ;nh64-5-9. IN A ;; AUTHORITY SECTION: . 3600IN SOA a.root-servers.net. nstld.verisign-grs.com. 2013070200 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 10.1.1.1#53(10.1.1.1) ;; WHEN: Tue Jul 2 15:47:57 2013 ;; MSG SIZE rcvd: 101 However, `getent hosts` has a different reply: [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9 127.0.1.1 nh64-5-9.local nh64-5-9 > 2. What happens when you try to resolve "foo.local"? (e.g., "dig foo.local") Here's what happens with `dig`: [rmurri@nh64-5-9 ~]$ dig nh64-5-9.local ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.local ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62092 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1 ;; QUESTION SECTION: ;nh64-5-9.local.IN A ;; ANSWER SECTION: nh64-5-9.local. 259200 IN A 10.1.255.194 ;; AUTHORITY SECTION: local. 259200 IN NS ns.local. ;; ADDITIONAL SECTION: ns.local. 259200 IN A 127.0.0.1 ;; Query time: 0 msec ;; SERVER: 10.1.1.1#53(10.1.1.1) ;; WHEN: Tue Jul 2 15:48:50 2013 ;; MSG SIZE rcvd: 81 Same query resolved via `getent hosts`: [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9 127.0.1.1 nh64-5-9.local nh64-5-9 > 3. What happens when you try to resolve "foo.yourdomain.com"? (e.g., "dig > foo.yourdomain.com") This yields an empty response from both `dig` and `getent hosts` as the node is only attached to a private network and not registered in DNS: [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9.uzh.ch [rmurri@nh64-5-9 ~]$ dig nh64-5-9.uzh.ch ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.uzh.ch ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 61801 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 ;; QUESTION SECTION: ;nh64-5-9.uzh.ch. IN A ;; AUTHORITY SECTION: uzh.ch. 8921IN SOA ns1.uzh.ch. hostmaster.uzh.ch. 384627811 3600 1800 360 10800 ;; Query time: 0 msec ;; SERVER: 10.1.1.1#53(10.1.1.1) ;; WHEN: Tue Jul 2 15:50:54 2013 ;; MSG SIZE rcvd: 84 > 4. Please apply the attached patch to your Open MPI 1.6.5 build (please note > that it adds diagnostic output; do *not* put this patch into production) > and: >4a. Run with one of your "bad" cases and send us the output >4b. Run with one of your "good" cases and send us the output Please find the outputs attached. The exact `mpiexec` invocation and the machines file are at the beginning of each file. Note that I allocated 8 slots (on 4 nodes), but only use 2 slots (on 1 node). Thanks, Riccardo exam01.out.BAD Description: Binary data exam01.out.GOOD Description: Binary data
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
Hello, On 26 June 2013 03:11, Ralph Castain wrote: > I've been reviewing the code, and I think I'm getting a handle on > the issue. > > Just to be clear - your hostname resolves to the 127 address? And you are on > a Linux (not one of the BSD flavors out there)? Yes (but resolves to 127.0.1.1 -- not the usual 127.0.0.1), and yes (Rocks 5.3 ~= CentOS 5.3). > If the answer to both is "yes", then the problem is that we ignore loopback > devices if anything else is present. When we check to see if the hostname we > were given is the local node, we resolve the name to the address and then > check our list of interfaces. The loopback device is ignored and therefore > not on the list. So if you resolve to the 127 address, we will decide this > is a different node than the one we are on. > > I can modify that logic, but want to ensure this accurately captures the > problem. I'll also have to discuss the change with the other developers to > ensure we don't shoot ourselves in the foot if we make it. Ok, thanks -- I'll keep an eye on your replies. Thanks, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
On 20 June 2013 11:29, Riccardo Murri wrote: > However, I cannot reproduce the issue now Just to be clear: the "issue" in that mail refers to the OpenMPI SGE ras plugin not working with our version of SGE. The issue with 127.0.1.1 addresses is reproducible at will. Thanks, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
On 19 June 2013 23:52, Reuti wrote: > Am 19.06.2013 um 22:14 schrieb Riccardo Murri: > >> On 19 June 2013 20:42, Reuti wrote: >>> Am 19.06.2013 um 19:43 schrieb Riccardo Murri : >>> >>>> On 19 June 2013 16:01, Ralph Castain wrote: >>>>> How is OMPI picking up this hostfile? It isn't being specified on the cmd >>>>> line - are you running under some resource manager? >>>> >>>> Via the environment variable `OMPI_MCA_orte_default_hostfile`. >>>> >>>> We're running under SGE, but disable the OMPI/SGE integration (rather > > BTW: Which version of SGE? SGE6.2u4 running under Rocks 5.3: $ qstat -h GE 6.2u4 $ cat /etc/rocks-release Rocks release 5.3 (Rolled Tacos) >> It's enabled but (IIRC) the problem is that OpenMPI detects the >> presence of SGE from some environment variable > > Correct. > > >> , which, in our version >> of SGE, simply isn't there. > > Do you use a custom "starter_method" in the queue definition? No custom starter_method. > Does a submitted script with: > > #!/bin/sh > env > > list at least some of the SGE* environment variables - or none at all? Quite some SGE_* variables are in the environment: $ cat env.sh env | sort $ qsub -pe mpi 2 env.sh Your job 29590 ("env.sh") has been submitted $ egrep ^SGE_ env.sh.o29590 SGE_ACCOUNT=sge SGE_ARCH=lx26-amd64 ... However, I cannot reproduce the issue now -- it's quite possible that it originated on a older cluster (now decommisioned) and we just kept the submission script on newer hardware without checking. Thanks for the help, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
On 20 June 2013 06:33, Ralph Castain wrote: > Been trying to decipher this problem, and think maybe I'm beginning to > understand it. Just to clarify: > > * when you execute "hostname", you get the .local response? Yes: [rmurri@nh64-2-11 ~]$ hostname nh64-2-11.local [rmurri@nh64-2-11 ~]$ uname -n nh64-2-11.local [rmurri@nh64-2-11 ~]$ hostname -s nh64-2-11 [rmurri@nh64-2-11 ~]$ hostname -f nh64-2-11.local > * you somewhere have it setup so that 10.x.x.x resolves to , with no > ".local" extension? No. Host name resolution is correct, but the hostname resolves to the 127.0.1.1 address: [rmurri@nh64-2-11 ~]$ getent hosts `hostname` 127.0.1.1nh64-2-11.local nh64-2-11 Note that `/etc/hosts` also lists a 10.x.x.x address, which is the one actually assigned to the ethernet interface: [rmurri@nh64-2-11 ~]$ fgrep `hostname -s` /etc/hosts 127.0.1.1 nh64-2-11.local nh64-2-11 10.1.255.201nh64-2-11.local nh64-2-11 192.168.255.206 nh64-2-11-myri0 If we remove the `127.0.1.1` line from `/etc/hosts`, then everything works again. Also, everything works if we use only FQDNs in the hostfile. So it seems that the 127.0.1.1 address is treated specially. Thanks, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
On 19 June 2013 20:42, Ralph Castain wrote: > I'm assuming that the offending host has some other address besides > just 127.0.1.1 as otherwise it couldn't connect to anything. Yes, it has an IP on some 10.x.x.x network. > I'm heading out the door for a couple of weeks, but can try to look at it > when I return. We have a workaround (just create the hostfile using FQDNs -- actually FQDNs or UQDNS depending on what `uname -n` returns), so it's definitely not urgent for us. But if you think it's a bug worth fixing, I can provide details and/or test code. Thanks, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
On 19 June 2013 20:42, Reuti wrote: > Am 19.06.2013 um 19:43 schrieb Riccardo Murri : > >> On 19 June 2013 16:01, Ralph Castain wrote: >>> How is OMPI picking up this hostfile? It isn't being specified on the cmd >>> line - are you running under some resource manager? >> >> Via the environment variable `OMPI_MCA_orte_default_hostfile`. >> >> We're running under SGE, but disable the OMPI/SGE integration (rather > > It's disabled by default, you would have to activate it during `configure` of > Open MPI. It's enabled but (IIRC) the problem is that OpenMPI detects the presence of SGE from some environment variable, which, in our version of SGE, simply isn't there. I can dig up the details if you're interested. Regards, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
On 19 June 2013 16:01, Ralph Castain wrote: > How is OMPI picking up this hostfile? It isn't being specified on the cmd > line - are you running under some resource manager? Via the environment variable `OMPI_MCA_orte_default_hostfile`. We're running under SGE, but disable the OMPI/SGE integration (rather old version of SGE, does not coordinate well with OpenMPI); here's the relevant snippet from our startup script: # the OMPI/SGE integration does not seem to work with # our SGE version; so use the `mpi` PE and direct OMPI # to look for a "plain old" machine file unset PE_HOSTFILE if [ -r "${TMPDIR}/machines" ]; then OMPI_MCA_orte_default_hostfile="${TMPDIR}/machines" export OMPI_MCA_orte_default_hostfile fi GMSCOMMAND="$openmpi_root/bin/mpiexec -n $NCPUS --nooversubscribe $gamess $INPUT -scr $(pwd)" The `$TMPDIR/machines` hostfile is created from SGE's $PE_HOSTFILE by extracting the host names, and repeating each one for the given number of slots (unmodified code that comes with SGE): PeHostfile2MachineFile() { cat $1 | while read line; do # echo $line host=`echo $line|cut -f1 -d" "|cut -f1 -d"."` nslots=`echo $line|cut -f2 -d" "` i=1 while [ $i -le $nslots ]; do echo $host i=`expr $i + 1` done done } Thanks, Riccardo
Re: [OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1
Hi, (colleague of OP here) On 19 June 2013 15:09, Ralph Castain wrote: > I don't see a hostfile on your command line - so I assume you are using a > default hostfile? What is in it? The hostfile comes from the batch system; it just contains the unqualified host names: $ cat $TMPDIR/machines nh64-1-17 nh64-1-17 No problem if we modify the setup script to create the hostfile using FQDNs instead. (`uname -n` returns the FQDN, not the unqualified host name.) Thanks, Riccardo -- Riccardo Murri http://www.gc3.uzh.ch/people/rm Grid Computing Competence Centre University of Zurich Winterthurerstrasse 190, CH-8057 Zürich (Switzerland) Tel: +41 44 635 4222 Fax: +41 44 635 6888
Re: [OMPI users] Why? MPI_Scatter problem
On Mon, Dec 13, 2010 at 4:57 PM, Kechagias Apostolos wrote: > I have the code that is in the attachment. > Can anybody explain how to use scatter function? MPI_Scatter receives the data in the initial segment of the given buffer. (The receiving buffer needs to be 1/Nth of the send buffer.) So, in your code, it's always start=0 and end=(N1-1) independently of the rank. Best regards, Riccardo
Re: [OMPI users] Help on Mpi derived datatype for class with static members
Hi, On Fri, Dec 10, 2010 at 2:51 AM, Santosh Ansumali wrote: >> - the "static" data member is shared between all instances of the >> class, so it cannot be part of the MPI datatype (it will likely be >> at a fixed memory location); > > Yes! I agree that i is global as far as different instances of class > is concern. I don't even want it to be part of MPI datatype. > However, I am concern > that as the given class has a static member, is it ok to just ignore > its existence while creating MPI datatype? > It *should* be. However, an authoritative answer requires good knowledge of the C++ standard and extensive experience with the compiler (none of which I personally have), so my suggestion would be to post the question on comp.lang.c++ or StackOverflow. >> - in addition, the "i" member is "static const" of a POD type, meaning >> the compiler is allowed to optimize it out and not allocate any >> actual memory location for it; >> >> This boils down to: the only data you need to send around in a "class >> test" instance is the "double data[5]" array. > > True! on what computers there is no memory allocation for static const > int member. As far as I understand it, the "const" is a hint to the compiler that the value will never change, so the storage *could* be optimized out. Whether this happens or not, depends on the compiler and the optimization level (e.g., GCC will never optimize a value out with -O0, but can do it at -O2), and on the actual code as well: if your code references "&i" at some point, then the compiler has to create actual storage for "i". > True! I just want to show the essential part of the class. The real > class is inheriting from other class which has no data member. Beware: if you are using virtual functions in any class of the hierarchy, then the vtable pointer will be a hidden field in the class' storage, and you definitely do not want to overwrite it -- this can influence the start address and/or the displacement of the data. In a simple case like "class test { double data[5]; }" you can just use "&data" as the address of your MPI data, but things may be different in the general case. Again, my adivce would be to post a question in a dedicated C++ forum for a comprehensive answer. If you are going to send C++ classes via MPI, you might want to have a look at Boost.MPI, which provides an easier interface to sending C++ classes around, possibly at some performance and/or memory cost. Best regards, Riccardo
Re: [OMPI users] Help on Mpi derived datatype for class with static members
On Wed, Dec 8, 2010 at 10:04 PM, Santosh Ansumali wrote: > I am confused with the use of MPI derived datatype for classes with > static member. How to create derived datatype for something like > class test{ > static const int i=5; > double data[5]; > } > This looks like C++ code, and I think there can be a couple of problems with sending this as an MPI derived datatype: - the "static" data member is shared between all instances of the class, so it cannot be part of the MPI datatype (it will likely be at a fixed memory location); - in addition, the "i" member is "static const" of a POD type, meaning the compiler is allowed to optimize it out and not allocate any actual memory location for it; This boils down to: the only data you need to send around in a "class test" instance is the "double data[5]" array. If the static member were not "const", you could send it in a separate message. Best regards, Riccardo P.S. Besides, all members in a "class" are private by default and "class test" does not have a constructor, so there's no way you can put any useful values into this "test" class. (But I guess this is just an oversight when stripping down the code for the example...)
Re: [OMPI users] possible mismatch between MPI_Iprobe and MPI_Recv?
Hi Jeff, thanks for the explanation - I should have read the MPI standard more carefully. In the end, I traced the bug down to using standard send instead of synchronous send, so it had nothing to do with the receiving side at all. Best regards, Riccardo
[OMPI users] possible mismatch between MPI_Iprobe and MPI_Recv?
Hello, I'm trying to debug a segfaulting application; the segfault does not happen consistently, however, so my guess is that it is due to some memory corruption problem which I'm trying to find. I'm using code like this: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status); if(flag) { int size; MPI_Get_count(&status, MPI_BYTE, &size); void* row = xmalloc(size); /* ... */ MPI_Recv(row, size, MPI_BYTE, status.MPI_SOURCE, status.MPI_TAG, MPI_COMM_WORLD, &status); /* ... */ } Question: is it possible that, in the time my program progresses from MPI_Iprobe() to MPI_Recv(), another message has arrived, that matches the MPI_Recv(), but is not the one originally matched from MPI_Iprobe()? (e.g. a shorter one) In particular, could it be that the size of the message actually received by MPI_Recv() does not match `size` (the variable)? In case a shorter message (different from the one initially matched) was received, can I get the actual message size via a new call to MPI_Get_count(&mpi_recv_status ...)? (My application is sending variable-length messages from one rank to the other at a quite high rate, so such a mismatch could potentially be deadly.) Best regards, Riccardo
Re: [OMPI users] MPI-2.2: do you care?
On Wed, Oct 27, 2010 at 2:29 AM, Jeremiah Willcock wrote: > On Tue, 26 Oct 2010, Jeff Squyres wrote: > >> Open MPI users -- >> >> I took a little heat at the last the MPI Forum for not having Open MPI be >> fully complaint with MPI-2.2 yet (OMPI is compliant with MPI-2.1). >> Specifically, there's still 4 open issues in Open MPI that are necessary for >> full MPI-2.2 compliance: >> >> https://svn.open-mpi.org/trac/ompi/query?status=accepted&status=assigned&status=new&status=reopened&summary=~MPI-2.2&col=id&col=summary&col=status&col=type&col=priority&col=milestone&col=version&order=priority >> >> We haven't made these items a priority because -- to be blunt -- no one >> really has been asking for them. No one has come forward and said "I *must* >> have these features!" (to be fair, they're somewhat obscure features). >> >> Other than not having the obvious "OMPI is MPI-2.2 compliant" checkmark >> for marketing reasons, is there anyone who *needs* the functionality >> represented by those still-open tickets? > > I have been writing some code that would have benefited greatly from the fix > to #2219 (MPI datatypes for C99 types and MPI integer typedefs). +1 -- Riccardo Murri, Hadlaubstr. 150, 8006 Zürich (CH)
Re: [OMPI users] MPI Template Datatype?
On Tue, Aug 10, 2010 at 9:49 PM, Alexandru Blidaru wrote: > Are the Boost.MPI send and recv functions as fast as the standard ones when > using Open-MPI? Boost.MPI is layered on top of plain MPI; it basically provides a mapping from complex and user-defined C++ data types to MPI datatypes. The added overhead depends on how complex the C++ data structures are; there are some tweaks and hints that can reduce the overhead, it's all explained in the manual. There are also some performance comparison available in the Boost.MPI manual: http://www.boost.org/doc/libs/1_43_0/doc/html/mpi/performance.html Best regards, Riccardo P.S. I think discussion of Boost.MPI is off-topic on the OMPI mailing list; feel free to email me privately or move discussion to the Boost.MPI mailing-list.
Re: [OMPI users] MPI Template Datatype?
Hi Alexandru, you can read all about Boost.MPI at: http://www.boost.org/doc/libs/1_43_0/doc/html/mpi.html On Mon, Aug 9, 2010 at 10:27 PM, Alexandru Blidaru wrote: > I basically have to implement a 4D vector. An additional goal of my project > is to support char, int, float and double datatypes in the vector. If your "vector" is fixed-size (i.e., all vectors are comprised of 4 elements), then you can likely dispose of std::vector, use C-style arrays with templated send/receive calls (that would be just interfaces to MPI_Send/MPI_Recv) // BEWARE: untested code!!! template int send(T* vector, int dest, int tag, MPI_Comm comm) { throw std::logic_error("called generic MyVector::send"); }; template int recv(T* vector, int source, int tag, MPI_Comm comm) { throw std::logic_error("called generic MyVector::send"); }; and then you specialize the template for the types you actually use: template <> int send(int* vector, int dest, int tag, MPI_Comm comm) { return MPI_Send(vector, 4, MPI_DOUBLE, dest, tag, comm); }; template <> int recv(int* vector, int src, int tag, MPI_Comm comm) { return MPI_Recv(vector, 4, MPI_DOUBLE, dest, tag, comm); }; // etc. However, let me warn you that it would likely take more time and effort to write all the template specializations and get them working than just use Boost.MPI. Best regards, Riccardo
Re: [OMPI users] MPI Template Datatype?
Hello Alexandru, On Mon, Aug 9, 2010 at 6:05 PM, Alexandru Blidaru wrote: > I have to send some vectors from node to node, and the vecotrs are built > using a template. The datatypes used in the template will be long, int, > double, and char. How may I send those vectors since I wouldn't know what > MPI datatype i have to specify in MPI_Send and MPI Recv. Is there any way to > do this? > I'm not sure I understand what your question is about: are you asking what MPI datatypes you should use to send C types "long", "int", etc., or are you trying to send a more complex C type ("vector")? Can you send some code demonstrating the problem you are trying to solve? Besides, your wording suggests that you are trying to send a C++ std::vector over MPI: have you already had a look at Boost.MPI? It has out-of-the-box support for STL containers. Cheers, Riccardo
Re: [OMPI users] Open MPI C++ class datatype
Hi Jack, On Wed, Aug 4, 2010 at 6:25 AM, Jack Bryan wrote: > I need to transfer some data, which is C++ class with some vector > member data. > I want to use MPI_Bcast(buffer, count, datatype, root, comm); > May I use MPI_Datatype to define customized data structure that contain C++ > class ? No, unless you have access to the implementation details of the std::vector class (which would render your code dependent on one particular implementation of the STL, and thus non-portable). Boost.MPI provides support for std C++ datatypes; if you want to keep to "plain MPI" calls, then your only choice is to use C-style arrays. Regards, Riccardo
[OMPI users] is OpenMPI 1.4 thread-safe?
Hello, The FAQ states: "Support for MPI_THREAD_MULTIPLE [...] has been designed into Open MPI from its first planning meetings. Support for MPI_THREAD_MULTIPLE is included in the first version of Open MPI, but it is only lightly tested and likely still has some bugs." The man page of "mpirun" from v1.4.3a1r23323 in addition says "Open MPI is, currently, neither thread-safe nor async-signal-safe" (section "Process Termination / Signal Handling"). Are these statements up-to-date? What is the status of MPI_THREAD_MULTIPLE in OMPI 1.4? Thanks in advance for any info! Cheers, Riccardo
[OMPI users] what is "thread support: progress" ?
Hello, I just re-compiled OMPI, and noticed this in the "ompi_info --all" output: Open MPI: 1.4.3a1r23323 ... Thread support: posix (mpi: yes, progress: no) ... what is this "progress thread support"? Is it the "asynchronous progress ... in the TCP point-to-point device" that the FAQ mentions? I could not find any ./configure option to enable or disable it. Cheers, Riccardo
Re: [OMPI users] Cannot start (WAS: Segmentation fault / Address not mapped (1) with 2-node job on Rocks 5.2)
Sorry, just found out about the "--debug-daemons" option, which allowed me to google a meaningful error message and find the solution in the archives of this list. For the record, the problem was that the "orted" being launched on the remote node is the one from the system-wide MPI install, not the one in my home dir. It seems that "-x PATH" does not affect the search for "orted"; would it make sense that "-x FOO" adds also a " -o SendEnv=FOO" in the "ssh remote-node orted" invocation? Best regards, Riccardo
Re: [OMPI users] Cannot start (WAS: Segmentation fault / Address not mapped (1) with 2-node job on Rocks 5.2)
Hello, On Tue, Jun 22, 2010 at 8:05 AM, Ralph Castain wrote: > Sorry for the problem - the issue is a bug in the handling of the >pernode option in 1.4.2. This has been fixed and awaits release in >1.4.3. > Thank you for pointing this out. Unfortunately, I still am not able to start remote processes:: $ mpirun --host compute-0-11 -np 1 ./hello_mpi -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- The same program runs fine if I use "--host localhost". Doing a "strace -v" on the "mpirun" invocation shows a strange invocation of "orted":: execve("//usr/bin/ssh", ["/usr/bin/ssh", "-x", "compute-0-11", " orted", "--daemonize", "-mca", "ess", "env", "-mca", "orte_ess_jobid", "2322006016", "-mca", "orte_ess_vpid", "1", "-mca", "orte_ess_num_procs", "2", "--hnp-uri", "\"2322006016.0;tcp://192.168.122.1"], ["MKLROOT=/opt/intel/mkl/10.0.3.02", ...]) Indeed, the 192.168.122.1 address is connected to an internal Xen bridge "virbr0", so it should not appear as a "call-back" address. Is there a command-line option to force mpirun to use a certain IP address? I have tried starting "mpirun" with "--mca btl_tcp_if_exclude lo,virbr0" to no avail. Also, the " orted" argument to ssh starts with a space; is this OK? I'm using OMPI 1.4.2, self-compiled on a Rocks 5.2 (i.e., CentOS 5.2) cluster Regards, Riccardo
[OMPI users] Segmentation fault / Address not mapped (1) with 2-node job on Rocks 5.2
Hello, I'm using OpenMPI 1.4.2 on a Rocks 5.2 cluster. I compiled it on my own to have a thread-enabled MPI (the OMPI coming with Rocks 5.2 apparently only supports MPI_THREAD_SINGLE), and installed into ~/sw. To test the newly installed library I compiled a simple "hello world" that comes with Rocks:: [murri@idgc3grid01 hello_mpi.d]$ cat hello_mpi.c #include #include #include int main(int argc, char **argv) { int myrank; struct utsname unam; MPI_Init(&argc, &argv); uname(&unam); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); printf("Hello from rank %d on host %s\n", myrank, unam.nodename); MPI_Finalize(); } The program runs fine as long as it only uses ranks on localhost:: [murri@idgc3grid01 hello_mpi.d]$ mpirun --host localhost -np 2 hello_mpi Hello from rank 1 on host idgc3grid01.uzh.ch Hello from rank 0 on host idgc3grid01.uzh.ch However, as soon as I try to run on more than one host, I get a segfault:: [murri@idgc3grid01 hello_mpi.d]$ mpirun --host idgc3grid01,compute-0-11 --pernode hello_mpi [idgc3grid01:13006] *** Process received signal *** [idgc3grid01:13006] Signal: Segmentation fault (11) [idgc3grid01:13006] Signal code: Address not mapped (1) [idgc3grid01:13006] Failing at address: 0x50 [idgc3grid01:13006] [ 0] /lib64/libpthread.so.0 [0x359420e4c0] [idgc3grid01:13006] [ 1] /home/oci/murri/sw/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb) [0x2b352d00265b] [idgc3grid01:13006] [ 2] /home/oci/murri/sw/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x676) [0x2b352d00e0e6] [idgc3grid01:13006] [ 3] /home/oci/murri/sw/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xb8) [0x2b352d015358] [idgc3grid01:13006] [ 4] /home/oci/murri/sw/lib/openmpi/mca_plm_rsh.so [0x2b352dcb9a80] [idgc3grid01:13006] [ 5] mpirun [0x40345a] [idgc3grid01:13006] [ 6] mpirun [0x402af3] [idgc3grid01:13006] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x359361d974] [idgc3grid01:13006] [ 8] mpirun [0x402a29] [idgc3grid01:13006] *** End of error message *** Segmentation fault I've already tried the suggestions posted to similar messages on the list: "ldd" reports that the executable is linked with the libraries in my home, not the system-wide OMPI:: [murri@idgc3grid01 hello_mpi.d]$ ldd hello_mpi libmpi.so.0 => /home/oci/murri/sw/lib/libmpi.so.0 (0x2ad2bd6f2000) libopen-rte.so.0 => /home/oci/murri/sw/lib/libopen-rte.so.0 (0x2ad2bd997000) libopen-pal.so.0 => /home/oci/murri/sw/lib/libopen-pal.so.0 (0x2ad2bdbe3000) libdl.so.2 => /lib64/libdl.so.2 (0x003593e0) libnsl.so.1 => /lib64/libnsl.so.1 (0x003596a0) libutil.so.1 => /lib64/libutil.so.1 (0x0035a100) libm.so.6 => /lib64/libm.so.6 (0x003593a0) libpthread.so.0 => /lib64/libpthread.so.0 (0x00359420) libc.so.6 => /lib64/libc.so.6 (0x00359360) /lib64/ld-linux-x86-64.so.2 (0x00359320) I've also checked with "strace" that the "mpi.h" file used during compile is the one in ~/sw/include and that all ".so" files being loaded from OMPI are the ones in ~/sw/lib. I can ssh without password to the target compute node. The "mpirun" and "mpicc" are the correct ones: [murri@idgc3grid01 hello_mpi.d]$ which mpirun ~/sw/bin/mpirun [murri@idgc3grid01 hello_mpi.d]$ which mpicc ~/sw/bin/mpicc I'm pretty stuck now; can anybody give me a hint? Thanks a lot for any help! Best regards, Riccardo