Re: [OMPI devel] v1.5 r26132 broken on multiple nodes?

2012-03-15 Thread Ralph Castain
Let me know what you find - I took a look at the code and it looks correct. All required changes were included in the patch that was applied to the branch. On Mar 14, 2012, at 11:27 PM, Eugene Loh wrote: > I'm quitting for the day, but happened to notice that all our v1.5 MTT runs > are failin

Re: [OMPI devel] RFC: ob1: fallback on put/send on rget failure

2012-03-15 Thread Shamis, Pavel
Nathan, I did not get any patch. Regards, Pavel (Pasha) Shamis --- Application Performance Tools Group Computer Science and Math Division Oak Ridge National Laboratory On Mar 15, 2012, at 5:07 PM, Nathan Hjelm wrote: > > > What: Update ob1 to do the following: >- fallback on sen

[OMPI devel] RFC: ob1: fallback on put/send on rget failure

2012-03-15 Thread Nathan Hjelm
What: Update ob1 to do the following: - fallback on send after rdma_put_retries_limit failures of prepare_dst - fallback on put (single non-pipelined) if the btl returns OMPI_ERR_NOT_AVAILABLE on a get transaction. When: Timeout in about one week (Mar 22) Why: Two reasons:

Re: [OMPI devel] poor btl sm latency

2012-03-15 Thread Jeffrey Squyres
On Mar 15, 2012, at 8:06 AM, Matthias Jurenz wrote: > We made a big step forward today! > > The used Kernel has a bug regarding to the shared L1 instruction cache in AMD > Bulldozer processors: > See > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03

Re: [OMPI devel] poor btl sm latency

2012-03-15 Thread Matthias Jurenz
We made a big step forward today! The used Kernel has a bug regarding to the shared L1 instruction cache in AMD Bulldozer processors: See http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dfb09f9b7ab03fd367740e541a5caf830ed56726 and http://developer.amd.com/Assets

[OMPI devel] v1.5 r26132 broken on multiple nodes?

2012-03-15 Thread Eugene Loh
I'm quitting for the day, but happened to notice that all our v1.5 MTT runs are failing with r26133, though tests ran fine as of r26129. Things run fine on-node, but if you run even just "hostname" on a remote node, the job fails with orted: Command not found I get this problem whether I inc