Re: [OMPI devel] new CRS component added (criu)

2014-02-07 Thread Jeff Squyres (jsquyres)
On Feb 7, 2014, at 5:08 PM, Jeff Squyres (jsquyres) wrote: > AS_IF([test $crs_criu_happy -eq 1], > [$2], > [AS_IF([test "$with_criu" != "x" && "x$with_criu" != "xno"], >[AC_MSG_WARN([You asked for CRIU support, but I can't find > it.]) >

Re: [OMPI devel] RFC: Add an OPAL rand and srand

2014-02-07 Thread Jeff Squyres (jsquyres)
+1 On Feb 7, 2014, at 5:23 PM, Joshua Ladd wrote: > What: Add an internal random number generator to OPAL. > > Why: OMPI uses rand and srand all over the place. Because the middleware is > mucking with the RNG’s global state, applications that use these library >

Re: [OMPI devel] RFC: Add an OPAL rand and srand

2014-02-07 Thread Nathan Hjelm
+1. On Fri, Feb 07, 2014 at 10:23:41PM +, Joshua Ladd wrote: >What: Add an internal random number generator to OPAL. > > > >Why: OMPI uses rand and srand all over the place. Because the middleware >is mucking with the RNG's global state, applications that use these >

Re: [OMPI devel] new CRS component added (criu)

2014-02-07 Thread Josh Hursey
That is fantastic! Thanks for the hard work so far getting the C/R infrastructure back in place. On Fri, Feb 7, 2014 at 3:46 PM, Adrian Reber wrote: > I have created a new CRS component using criu (criu.org) to support > checkpoint/restart in Open MPI. My current patch only

Re: [OMPI devel] RFC: Add an OPAL rand and srand

2014-02-07 Thread Joshua Ladd
Yes. After batting this around a bit with Jeff and Mike, we came to the consensus that the interface should be more "rand_r", so that state is locally managed by the consumer. The ALFG offers a powerful yet simple way to do it. We may even expose it to users since it offers a very scalable and

Re: [OMPI devel] RFC: Add an OPAL rand and srand

2014-02-07 Thread Paul Hargrove
Joshua, This is for ticket #2928, right? -Paul On Fri, Feb 7, 2014 at 2:23 PM, Joshua Ladd wrote: > What: Add an internal random number generator to OPAL. > > > > Why: OMPI uses rand and srand all over the place. Because the middleware > is mucking with the RNG's

Re: [OMPI devel] Update on 1.7.5

2014-02-07 Thread Paul Hargrove
Ralph, I'll try to test tonight's v1.7 taball for: + ia64 atomics (#4174) + bad getpwuid (#4164) + opalpath_nfs/EPERM (#4125) + torque smp (#4227) All but torque are fully-automated tests and I need only check my email for the results. The torque one will require manual job submission. -Paul

Re: [OMPI devel] new CRS component added (criu)

2014-02-07 Thread Jeff Squyres (jsquyres)
Sweet -- +1 for CRIU support! FWIW, I see you modeled your configure.m4 off the blcr configure.m4, but I'd actually go with making it a bit simpler. For example, I typically structure my configure.m4's like this (typed in mail client -- forgive mistakes...): - AS_IF([...some test],

[OMPI devel] RFC: optimize probe in ob1

2014-02-07 Thread Nathan Hjelm
What: The current probe algorithm in ob1 is linear with respect to the number or processes in the job. I wish to change the algorithm to be linear in the number of processes with unexpected messages. To do this I added an additional opal_list_t to the ob1 communicator and made the ob1 process a

[OMPI devel] Update on 1.7.5

2014-02-07 Thread Ralph Castain
Hi folks As you may have noticed, I've been working my way thru the CMR backlog on 1.7.5. A large percentage of them were minor fixes (valgrind warning suppressions, error message typos, etc.), so those went in the first round. Today's round contains more "meaty" things, but I still consider

[OMPI devel] new CRS component added (criu)

2014-02-07 Thread Adrian Reber
I have created a new CRS component using criu (criu.org) to support checkpoint/restart in Open MPI. My current patch only provides the framework and necessary configure scripts to detect and link against criu. With this patch orte-checkpoint can request a checkpoint and the new CRIU CRS component

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Shamis, Pavel
Exchange is evil…. Attached. Best, P p4.patch.gz Description: p4.patch.gz On Feb 7, 2014, at 12:41 PM, Nathan Hjelm wrote:Can you gzip the patch. The local exchange server has a habit ofconverting LF to CRLF.-NathanOn Fri, Feb 07, 2014 at 12:14:02PM -0500, Shamis, Pavel

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Nathan Hjelm
Can you gzip the patch. The local exchange server has a habit of converting LF to CRLF. -Nathan On Fri, Feb 07, 2014 at 12:14:02PM -0500, Shamis, Pavel wrote: > Can you please give a try to the attached hot-fix. > It unrolls most of the spaghetti, except the iboffload component (which is >

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Nathan Hjelm
Hah. You beet me to it. More or less identical to what I was doing. I will give this a try. If this works we should push it and add it to the coll/ml cmr. -Nathan On Fri, Feb 07, 2014 at 12:14:02PM -0500, Shamis, Pavel wrote: > Can you please give a try to the attached hot-fix. > It unrolls most

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Shamis, Pavel
Can you please give a try to the attached hot-fix. It unrolls most of the spaghetti, except the iboffload component (which is anyway disabled). Sorry for the mess. Best, Pasha On Feb 7, 2014, at 10:52 AM, Nathan Hjelm > wrote: On Fri, Feb 07, 2014 at

Re: [OMPI devel] openmpi installation

2014-02-07 Thread Ralph Castain
If the directories are there and populated, then the problem is likely with your path. Do this: 1. "which mpirun" - if you don't see your /bin, then you know your path is wrong 2. "printenv PATH" - is it what you expected? We generally suggest that you put your /bin and /lib at the beginning

Re: [OMPI devel] openmpi installation

2014-02-07 Thread Talla
Thank you for considering my case seriously. yes sir both directories along with other directories are there with files in them. But still I feel I am missing something not sure what it is. how I can check Open Mpi? mpirun is not responding not even mpicc ? any instruction how to run parallel jobs

Re: [OMPI devel] openmpi installation

2014-02-07 Thread Ralph Castain
Well, it certainly looks okay - try doing "ls" in your prefix directory. Do you see the bin and lib directories there? Anything in them? On Feb 7, 2014, at 8:37 AM, Talla wrote: > Hello sir > I downloaded openmpi 1.7 and followed the installation instructions: > cd openmpi >

[OMPI devel] openmpi installation

2014-02-07 Thread Talla
Hello sir I downloaded openmpi 1.7 and followed the installation instructions: cd openmpi ./configure --prefix="/home/$USER/.openmpi" make make install export PATH="$PATH:/home/$USER/.openmpi/bin" export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/" echo export

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Jeff Squyres (jsquyres)
On Feb 7, 2014, at 10:52 AM, Nathan Hjelm wrote: > Should be ready today. The use of that coll/ml structure is unnecessary > at this time. I am removing it in bcol right now. In the future we will > put in a better fix but this should work for 1.7.x/1.8.x. Sweet. -- Jeff

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Ralph Castain
On Feb 7, 2014, at 7:52 AM, Nathan Hjelm wrote: > On Fri, Feb 07, 2014 at 07:46:03AM -0800, Ralph Castain wrote: >> The issue in 1.7 is all the cross-integration, which means we violate our >> normal behavior when it comes to no-building and user-directed component >>

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Nathan Hjelm
On Fri, Feb 07, 2014 at 07:46:03AM -0800, Ralph Castain wrote: > The issue in 1.7 is all the cross-integration, which means we violate our > normal behavior when it comes to no-building and user-directed component > selection. Jeff and I just discussed how this could be resolved using the >

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Ralph Castain
The issue in 1.7 is all the cross-integration, which means we violate our normal behavior when it comes to no-building and user-directed component selection. Jeff and I just discussed how this could be resolved using the PML-BTL model, but (a) that is not what we have in 1.7, and (b) it isn't

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Nathan Hjelm
How is this a problem in 1.7? We don't have a .ompi_ignore in 1.7.4. That is there to prevent mtt failures while I fix some outstanding bcol issues. I will clean this up on trunk and add it to the cmr. -Nathan On Thu, Feb 06, 2014 at 08:42:27PM -0800, Ralph Castain wrote: > As many of you will

Re: [OMPI devel] C/R and orte_oob

2014-02-07 Thread Josh Hursey
In the original implementation, the OOB ft_event did not do much of anything on checkpoint preparation and continue. We did not even close the sockets. However, during restart the OOB will need to renegotiate the socket connections - usually by calling the finalization function (close stale

Re: [OMPI devel] singleton appears to be broken

2014-02-07 Thread George Bosilca
It is difficult to see it from the stack trace, as it happens in the ORTE threads. But I do have all the output I expect, and as the application I was running is hello_world I’m almost certain it happens during MPI_Finalize. George. On Feb 7, 2014, at 03:38 , Ralph Castain

Re: [OMPI devel] singleton appears to be broken

2014-02-07 Thread Ralph Castain
Think I see the code path that causes this - I'll have to play with it a little as the race condition is biased heavily towards success, so (as you noted) it won't happen very often. On Feb 6, 2014, at 6:38 PM, Ralph Castain wrote: > Interesting - does it happen in