[OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup

2011-05-18 Thread Jason Mackay

Hi all,
 
My thanks to all those involved for putting together this Windows binary 
release of OpenMPI!  I am hoping to use it in a small Windows based OpenMPI 
cluster at home.
 
Unfortunately my experience so far has not exactly been trouble free.  It seems 
that, due to the fact that this release is using WMI, there are a number of 
settings that must be configured on the machines in order to get this to work. 
These settings are not documented in the distribution at all. I have been 
experimenting with it for over a week on and off and as soon as I solve one 
problem, another one arises.
 
Currently, after much searching, reading, and tinkering with DCOM settings 
etc..., I can remotely start processes on all my machines using mpirun but 
those processes cannot access network shares (e.g. for binary distribution) and 
HPL (which works on any one node) does not seem to work if I run it across 
multiple nodes, also indicating a network issue (CPU sits at 100% in all 
processes with no network traffic and never terminates). To eliminate 
premission issues that may be caused by UAC I tried the same setup on two 
domain machines using an administrative account to launch and the behavior was 
the same. I have read that WMI processes cannot access network resources and I 
am at a loss for a solution to this newest of problems. If anyone knows how to 
make this work I would appreciate the help. I assume that someone has gotten 
this working and has the answers.
 
I have searched the mailing list archives and I found other users with similar 
problems but no clear guidance on the threads. Some threads make references to 
Microsoft KB articles but do not explicitly tell the user what needs to be 
done, leaving each new user to rediscover the tricks on their own. One thread 
made it appear that testing had only been done on Windows XP. Needless to say, 
security has changed dramatically in Windows since XP!
 
I would like to see OpenMPI for Windows be usable by a newcomer without all of 
this pain.
 
What would be fantastic would be:
1) a step-by-step procedure for how to get OpenMPI 1.5 working on Windows
  a) preferably in a bare Windows 7 workgroup environment with nothing else 
(i.e. no Microsoft Cluster Compute Pack, no domain etc...)
2) inclusion of these steps in the binary distribution
3) bonus points for a script which accomplishes these things automatically
 
If someone can help with (1), I would happily volunteer my time to work on (3).
 
Regards,
Jason 

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-18 Thread Ralph Castain
Okay, I finally had time to parse this and fix it. Thanks!

On May 16, 2011, at 1:02 PM, Peter Thompson wrote:

> Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() 
> beforehand, and then calling putenv() with the string duplicated from env[j]. 
>  Of course, if the strdup fails, then we bail out. 
> As for why it's suddenly a problem, I'm not quite as certain.   The problem 
> we do show is a double free, so someone has already freed that memory used by 
> putenv(), and I do know that while that used to be just flagged as an event 
> before, now we seem to be unable to continue past it.   Not sure if that is 
> our change or a library/system change. 
> PeterT
> 
> 
> Ralph Castain wrote:
>> On May 16, 2011, at 12:45 PM, Peter Thompson wrote:
>> 
>>  
>>> Hi Ralph,
>>> 
>>> We've had a number of user complaints about this.   Since it seems on the 
>>> face of it that it is a debugger issue, it may have not made it's way back 
>>> here.  Is your objection that the patch basically aborts if it gets a bad 
>>> value?   I could understand that being a concern.   Of course, it aborts on 
>>> TotalView now if we attempt to move forward without this patch.
>>> 
>>>
>> 
>> No - my concern is that you appear to be removing the "putenv" calls. OMPI 
>> places some values into the local environment so the user can control 
>> behavior. Removing those causes problems.
>> 
>> What I need to know is why, after it has worked with TV for years, these 
>> putenv's are suddenly a problem. Is the problem occurring during shutdown? 
>> Or is this something that causes TV to break?
>> 
>> 
>>  
>>> I've passed your comment back to the engineer, with a suspicion about the 
>>> concerns about the abort, but if you have other objections, let me know.
>>> 
>>> Cheers,
>>> PeterT
>>> 
>>> 
>>> Ralph Castain wrote:
>>>
 That would be a problem, I fear. We need to push those envars into the 
 environment.
 
 Is there some particular problem causing what you see? We have no other 
 reports of this issue, and orterun has had that code forever.
 
 
 
 Sent from my iPad
 
 On May 11, 2011, at 2:05 PM, Peter Thompson  
 wrote:
 
   
> We've gotten a few reports of problems with memory debugging when using 
> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
> started after an MPI_Init.  However in the case where memory debugging is 
> enabled, things seemed to run away or fail.   My analysis showed that we 
> had a number of core files left over from the attempt, and all were 
> mpirun (or orterun) cores.   It seemed to be a regression on our part, 
> since testing seemed to indicate this worked okay before TotalView 
> 8.9.0-0, so I filed an internal bug and passed it to engineering.   After 
> giving our engineer a brief tutorial on how to build a debug version of 
> OpenMPI, he found what appears to be a problem in the code for orterun.c. 
>   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
> 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.
> He doesn't subscribe to this list that I know of, so I offered to pass 
> this by the group.   Of course, I'm not sure if this is exactly the right 
> place to submit patches, but I'm sure you'd tell me where to put it if 
> I'm in the wrong here.   It's a short patch, so I'll cut and paste it, 
> and attach as well, since cut and paste can do weird things to formatting.
> 
> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew 
> to find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
> 'totalview mpirun -a -np 4 ./foo'
> 
> Cheers,
> PeterT
> 
> 
> more ~/patches/anbs-patch
> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 
> -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
> 20:28:16.5881
> 83000 -0400
> ***
> *** 1578,1588 
>   }
>   if (NULL != env) {
>   size1 = opal_argv_count(env);
>   for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
>   }
>   }
>   /* All done */
> --- 1578,1600 
>   }
>   if (NULL != env) {
>   size1 = opal_argv_count(env);
>   for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the 
> pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !region.
> !So we make a copy of the variable.
> ! 

Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

2011-05-18 Thread Ralph Castain
I'm no Windozer, and our developer in that area is away for awhile. However, 
looking over the code, I can see where this might be failing.

The Win allocator appears to be trying to connect to some cluster server - 
failing that, it aborts.

If you just want to launch local, I would suggest adding "-mca ras ^ccp" to 
your mpirun cmd line. This will allow the system to pickup the local host and 
use it.

BTW: your orte_headnode_name value is supposed to be whatever is returned by 
the "hostname" command on your head node (probably the node where you are 
executing mpirun), not the literal "HEADNODE_NAME" string. You might try it 
with that correction as well.


On May 18, 2011, at 4:26 AM, hi wrote:

> Any comment / suggestion on how to resolve this?
> 
> Thank you.
> -Hiral
> 
> On 5/12/11, hi  wrote:
>> Hi,
>> 
>> Clarifications:
>> - I have downloaded pre-build OpenMPI_v1..5.3-x64 from open-mpi.org
>> - installed it on Window 7
>> - and then copied OpenMPI_v1..5.3-x64 directory from Windows 7 to
>> Windows Server 2008 into different directory and also in same
>> directory
>> 
>> Now on Windows Server 2008, I am observing these errors...
>> 
>> c:\ompi_tests\win64>mpirun mar_f_i_op.exe
>> [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
>> ..\..\..\openmpi-1.5.3\orte\mca\ras\base\ras_base_allocate.c at line
>> 147
>> [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
>> ..\..\..\openmpi-1.5.3\orte\mca\plm\base\plm_base_launch_support.c at
>> line 99
>> [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
>> ..\..\..\openmpi-1.5.3\orte\mca\plm\ccp\plm_ccp_module.c at line 186
>> =
>> 
>> As suggested, I tried with following but nothing worked...
>> - copied to the same directory as it was in previous machine
>> - executed "mpirun -mca orte_headnode_name HEADNODE_NAME"  and "mpirun
>> -mca orte_headnode_name MYHOSTNAME"
>> - set OPENMPI_HOME and other OPAL_* env variables as follow...
>> 
>> set OPENMPI_HOME=C:\MPIs\OpenMPI_v1.5.3-x64
>> set OPAL_PREFIX=C:\MPIs\OpenMPI_v1.5.3-x64
>> set OPAL_EXEC_PREFIX=C:\MPIs\OpenMPI_v1.5.3-x64
>> set OPAL_BINDIR=C:\MPIs\OpenMPI_v1.5.3-x64\bin
>> set OPAL_SBINDIR=C:\MPIs\OpenMPI_v1.5.3-x64\sbin
>> set OPAL_LIBEXECDIR=C:\MPIs\OpenMPI_v1.5.3-x64\libexec
>> set OPAL_DATAROOTDIR=C:\MPIs\OpenMPI_v1.5.3-x64\share
>> set OPAL_DATADIR=C:\MPIs\OpenMPI_v1.5.3-x64\share
>> set OPAL_SYSCONFDIR=C:\MPIs\OpenMPI_v1.5.3-x64\etc
>> set OPAL_LOCALSTATEDIR=C:\MPIs\OpenMPI_v1.5.3-x64\etc
>> set OPAL_LIBDIR=C:\MPIs\OpenMPI_v1.5.3-x64\lib
>> set OPAL_INCLUDEDIR=C:\MPIs\OpenMPI_v1.5.3-x64\include
>> set OPAL_INFODIR=C:\MPIs\OpenMPI_v1.5.3-x64\share\info
>> set OPAL_MANDIR=C:\MPIs\OpenMPI_v1.5.3-x64\share\man
>> set OPAL_PKGDATADIR=C:\MPIs\OpenMPI_v1.5.3-x64\share\openmpi
>> set OPAL_PKGLIBDIR=C:\MPIs\OpenMPI_v1.5.3-x64\lib\openmpi
>> set OPAL_PKGINCLUDEDIR=C:\MPIs\OpenMPI_v1.5.3-x64\include\openmpi
>> 
>> Please correct if I missed any other env variable.
>> 
>> Thank you.
>> -Hiral
>> 
>> 
>> On Wed, May 11, 2011 at 8:56 PM, Shiqing Fan  wrote:
>>> Hi,
>>> 
>>> The error message means that Open MPI couldn't allocate any compute node.
>>> It might because the headnode wasn't discovered. You could try with option
>>> "-mca orte_headnode_name HEADNODE_NAME" in the mpirun command line (mpirun
>>> --help will show how to use it) .
>>> 
>>> And Jeff is also right, special care should be taken for the executable
>>> paths, and it's better to use UNC path.
>>> 
>>> To clarify the path issue, if you just copy the OMPI dir to another
>>> computer, there might also be another problem that OMPI couldn't load the
>>> registry entries, as the registry entries were set during the installation
>>> phase on the specific computer. In 1.5.3, a overall env "OPENMPI_HOME"
>>> will do the work.
>>> 
>>> Regards,
>>> Shiqing
>>> - 原始邮件 -
>>> 发件人: Jeff Squyres 
>>> 收件人: Open MPI Users 
>>> 已发送邮件: Wed, 11 May 2011 15:21:26 +0200 (CEST)
>>> 主题: Re: [OMPI users] Sorry! You were supposed to get help about: But
>>> couldn't open help-orterun.txt
>>> 
>>> On May 11, 2011, at 5:50 AM, Ralph Castain wrote:
>>> 
> Clarification: I installed pre-built OpenMPI_v1.5.3-x64 on Windows 7
> and copied this directory into Windows Server 2008.
>>> 
>>> Did you copy OMPI to the same directory tree that you built it?
>>> 
>>> OMPI hard-codes some directory names when it builds, and it expects to
>>> find that directory structure when it runs.  If you build OMPI with a
>>> --prefix of /foo, but then move it to /bar, various things may not work
>>> (like finding help messages, etc.) unless you set the OMPI/OPAL
>>> environment variables that tell OMPI where the files are actually
>>> located.
>>> 
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> ___
>>> 

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-18 Thread Brock Palen
Well I have a new wrench into this situation.
We have a power failure at our datacenter took down our entire system 
nodes,switch,sm.  
Now I am unable to produce the error with oob default ibflags etc.

Does this shed any light on the issue?  It also makes it hard to now debug the 
issue without being able to reproduce it.

Any thoughts?  Am I overlooking something? 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On May 17, 2011, at 2:18 PM, Brock Palen wrote:

> Sorry typo 314 not 313, 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On May 17, 2011, at 2:02 PM, Brock Palen wrote:
> 
>> Thanks, I though of looking at ompi_info after I sent that note sigh.
>> 
>> SEND_INPLACE appears to help performance of larger messages in my synthetic 
>> benchmarks over regular SEND.  Also it appears that SEND_INPLACE still 
>> allows our code to run.
>> 
>> We working on getting devs access to our system and code. 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On May 16, 2011, at 11:49 AM, George Bosilca wrote:
>> 
>>> Here is the output of the "ompi_info --param btl openib":
>>> 
>>>   MCA btl: parameter "btl_openib_flags" (current value: <306>, 
>>> data
>>>source: default value)
>>>BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
>>>SEND_INPLACE=8, RDMA_MATCHED=64, 
>>> HETEROGENEOUS_RDMA=256; flags
>>>only used by the "dr" PML (ignored by others): 
>>> ACK=16,
>>>CHECKSUM=32, RDMA_COMPLETION=128; flags only used by 
>>> the "bfo"
>>>PML (ignored by others): FAILOVER_SUPPORT=512)
>>> 
>>> So the 305 flags means: HETEROGENEOUS_RDMA | CHECKSUM | ACK | SEND. Most of 
>>> these flags are totally useless in the current version of Open MPI (DR is 
>>> not supported), so the only value that really matter is SEND | 
>>> HETEROGENEOUS_RDMA.
>>> 
>>> If you want to enable the send protocol try first with SEND | SEND_INPLACE 
>>> (9), if not downgrade to SEND (1)
>>> 
>>> george.
>>> 
>>> On May 16, 2011, at 11:33 , Samuel K. Gutierrez wrote:
>>> 
 
 On May 16, 2011, at 8:53 AM, Brock Palen wrote:
 
> 
> 
> 
> On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
> 
>> Hi,
>> 
>> Just out of curiosity - what happens when you add the following MCA 
>> option to your openib runs?
>> 
>> -mca btl_openib_flags 305
> 
> You Sir found the magic combination.
 
 :-)  - cool.
 
 Developers - does this smell like a registered memory availability hang?
 
> I verified this lets IMB and CRASH progress pass their lockup points,
> I will have a user test this, 
 
 Please let us know what you find.
 
> Is this an ok option to put in our environment?  What does 305 mean?
 
 There may be a performance hit associated with this configuration, but if 
 it lets your users run, then I don't see a problem with adding it to your 
 environment.
 
 If I'm reading things correctly, 305 turns off RDMA PUT/GET and turns on 
 SEND.
 
 OpenFabrics gurus - please correct me if I'm wrong :-).
 
 Samuel Gutierrez
 Los Alamos National Laboratory
 
 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
>> 
>> Thanks,
>> 
>> Samuel Gutierrez
>> Los Alamos National Laboratory
>> 
>> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
>> 
>>> On May 13, 2011, at 4:09 PM, Dave Love wrote:
>>> 
 Jeff Squyres  writes:
 
> On May 11, 2011, at 3:21 PM, Dave Love wrote:
> 
>> We can reproduce it with IMB.  We could provide access, but we'd 
>> have to
>> negotiate with the owners of the relevant nodes to give you 
>> interactive
>> access to them.  Maybe Brock's would be more accessible?  (If you
>> contact me, I may not be able to respond for a few days.)
> 
> Brock has replied off-list that he, too, is able to reliably 
> reproduce the issue with IMB, and is working to get access for us.  
> Many thanks for your offer; let's see where Brock's access takes us.
 
 Good.  Let me know if we could be useful
 
>>> -- we have not closed this issue,
>> 
>> Which issue?   I couldn't find a relevant-looking one.
> 
> https://svn.open-mpi.org/trac/ompi/ticket/2714
 
 Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 
 on
 connectx with more than one collective I can't recall.
>>> 

Re: [OMPI users] Segfault after malloc()?

2011-05-18 Thread Paul van der Walt
Okay cool, mine already breaks with P=2, so I'll try this soon. Thanks
for the impatient-idiot's-guide :)

On 18 May 2011 14:15, Jeff Squyres  wrote:
> If you're only running with a few MPI processes, you might be able to get 
> away with:
>
> mpirun -np 4 valgrind ./my_mpi_application
>
> If you run any more than that, the output gets too jumbled and you should 
> output each process' valgrind stdout to a different file with the --log-file 
> option (IIRC).
>
> I personally like these valgrind options:
>
> valgrind --num-callers=50 --db-attach=yes --tool=memcheck --leak-check=yes 
> --show-reachable=yes
>
>
>
> On May 18, 2011, at 8:49 AM, Paul van der Walt wrote:
>
>> Hi Jeff,
>>
>> Thanks for the response.
>>
>> On 18 May 2011 13:30, Jeff Squyres  wrote:
>>> *Usually* when we see segv's in calls to alloc, it means that there was 
>>> previously some kind of memory bug, such as an array overflow or something 
>>> like that (i.e., something that stomped on the memory allocation tables, 
>>> causing the next alloc to fail).
>>>
>>> Have you tried running your code through a memory-checking debugger?
>>
>> I sort-of tried with valgrind, but I'm not really sure how to
>> interpret the output (I'm not such a C-wizard). I'll have another look
>> a little later then and report back. I suppose I should RTFM on how to
>> properly invoke valgrind so it makes sense with an MPI program?
>>
>> Paul
>>
>> --
>> O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org


Re: [OMPI users] Segfault after malloc()?

2011-05-18 Thread Jeff Squyres
If you're only running with a few MPI processes, you might be able to get away 
with:

mpirun -np 4 valgrind ./my_mpi_application

If you run any more than that, the output gets too jumbled and you should 
output each process' valgrind stdout to a different file with the --log-file 
option (IIRC).

I personally like these valgrind options:

valgrind --num-callers=50 --db-attach=yes --tool=memcheck --leak-check=yes 
--show-reachable=yes



On May 18, 2011, at 8:49 AM, Paul van der Walt wrote:

> Hi Jeff,
> 
> Thanks for the response.
> 
> On 18 May 2011 13:30, Jeff Squyres  wrote:
>> *Usually* when we see segv's in calls to alloc, it means that there was 
>> previously some kind of memory bug, such as an array overflow or something 
>> like that (i.e., something that stomped on the memory allocation tables, 
>> causing the next alloc to fail).
>> 
>> Have you tried running your code through a memory-checking debugger?
> 
> I sort-of tried with valgrind, but I'm not really sure how to
> interpret the output (I'm not such a C-wizard). I'll have another look
> a little later then and report back. I suppose I should RTFM on how to
> properly invoke valgrind so it makes sense with an MPI program?
> 
> Paul
> 
> -- 
> O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Segfault after malloc()?

2011-05-18 Thread Paul van der Walt
Hi Jeff,

Thanks for the response.

On 18 May 2011 13:30, Jeff Squyres  wrote:
> *Usually* when we see segv's in calls to alloc, it means that there was 
> previously some kind of memory bug, such as an array overflow or something 
> like that (i.e., something that stomped on the memory allocation tables, 
> causing the next alloc to fail).
>
> Have you tried running your code through a memory-checking debugger?

I sort-of tried with valgrind, but I'm not really sure how to
interpret the output (I'm not such a C-wizard). I'll have another look
a little later then and report back. I suppose I should RTFM on how to
properly invoke valgrind so it makes sense with an MPI program?

Paul

-- 
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org


Re: [OMPI users] Segfault after malloc()?

2011-05-18 Thread Jeff Squyres
*Usually* when we see segv's in calls to alloc, it means that there was 
previously some kind of memory bug, such as an array overflow or something like 
that (i.e., something that stomped on the memory allocation tables, causing the 
next alloc to fail).

Have you tried running your code through a memory-checking debugger?


On May 16, 2011, at 12:35 PM, Paul van der Walt wrote:

> Hi all,
> 
> I hope to provide enough information to make my problem clear. I
> have been debugging a lot after continually getting a segfault
> in my program, but then I decided to try and run it on another
> node, and it didn't segfault! The program which causes this
> strange behaviour can be downloaded with
> 
> $ git clone https://toothbr...@github.com/toothbrush/bsp-cg.git
> 
> It depends on bsponmpi (can be found at:
> http://bsponmpi.sourceforge.net/ ).
> 
> The machine on which I get a segfault is 
> Linux scarlatti 2.6.38-2-amd64 #1 SMP Thu Apr 7 04:28:07 UTC 2011 x86_64 
> GNU/Linux
> OpenMPI --version: mpirun (Open MPI) 1.4.3
> 
> And the error message is:
> [scarlatti:22100] *** Process received signal ***
> [scarlatti:22100] Signal: Segmentation fault (11)
> [scarlatti:22100] Signal code:  (128)
> [scarlatti:22100] Failing at address: (nil)
> [scarlatti:22100] [ 0] /lib/libpthread.so.0(+0xef60) [0x7f33ca69ef60]
> [scarlatti:22100] [ 1] /lib/libc.so.6(+0x74121) [0x7f33ca3a3121]
> [scarlatti:22100] [ 2] /lib/libc.so.6(__libc_malloc+0x70) [0x7f33ca3a5930]
> [scarlatti:22100] [ 3] src/cg(vecalloci+0x2c) [0x401789]
> [scarlatti:22100] [ 4] src/cg(bspmv_init+0x60) [0x40286a]
> [scarlatti:22100] [ 5] src/cg(bspcg+0x63b) [0x401f8b]
> [scarlatti:22100] [ 6] src/cg(main+0xd3) [0x402517]
> [scarlatti:22100] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f33ca34dc4d]
> [scarlatti:22100] [ 8] src/cg() [0x401609]
> [scarlatti:22100] *** End of error message ***
> --
> mpirun noticed that process rank 0 with PID 22100 on node scarlatti exited on 
> signal 11 (Segmentation fault).
> --
> 
> The program can be invoked (after downloading the source,
> running make, and cd'ing into the project's root directory)
> like:
> 
> $ mpirun -np 2 src/cg examples/test.mtx-P2 examples/test.mtx-v2 
> examples/test.mtx-u2
> 
> The program seems to fail at src/bspedupack.c:vecalloci(), but
> printf'ing the pointer that's returned by malloc() looks okay.
> 
> The node on which the program DOES run without segfault is as
> follows: (OS X laptop)
> 
> Darwin purcell 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 
> 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386
> OpenMPI --version: mpirun (Open MPI) 1.2.8
> 
> Please inform if this is a real bug in OpenMPI, or if I'm coding
> something incorrectly. Note that I'm not asking anyone to debug
> my code for me, it's purely in case people want to try and
> reproduce my error locally. 
> 
> If I can provide more info, please advise. I'm not an MPI
> expert, unfortunately. 
> 
> Kind regards,
> 
> Paul van der Walt
> 
> -- 
> O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

2011-05-18 Thread hi
Any comment / suggestion on how to resolve this?

Thank you.
-Hiral

On 5/12/11, hi  wrote:
> Hi,
>
> Clarifications:
> - I have downloaded pre-build OpenMPI_v1..5.3-x64 from open-mpi.org
> - installed it on Window 7
> - and then copied OpenMPI_v1..5.3-x64 directory from Windows 7 to
> Windows Server 2008 into different directory and also in same
> directory
>
> Now on Windows Server 2008, I am observing these errors...
>
> c:\ompi_tests\win64>mpirun mar_f_i_op.exe
> [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
> ..\..\..\openmpi-1.5.3\orte\mca\ras\base\ras_base_allocate.c at line
> 147
> [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
> ..\..\..\openmpi-1.5.3\orte\mca\plm\base\plm_base_launch_support.c at
> line 99
> [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
> ..\..\..\openmpi-1.5.3\orte\mca\plm\ccp\plm_ccp_module.c at line 186
> =
>
> As suggested, I tried with following but nothing worked...
> - copied to the same directory as it was in previous machine
> - executed "mpirun -mca orte_headnode_name HEADNODE_NAME"  and "mpirun
> -mca orte_headnode_name MYHOSTNAME"
> - set OPENMPI_HOME and other OPAL_* env variables as follow...
>
> set OPENMPI_HOME=C:\MPIs\OpenMPI_v1.5.3-x64
> set OPAL_PREFIX=C:\MPIs\OpenMPI_v1.5.3-x64
> set OPAL_EXEC_PREFIX=C:\MPIs\OpenMPI_v1.5.3-x64
> set OPAL_BINDIR=C:\MPIs\OpenMPI_v1.5.3-x64\bin
> set OPAL_SBINDIR=C:\MPIs\OpenMPI_v1.5.3-x64\sbin
> set OPAL_LIBEXECDIR=C:\MPIs\OpenMPI_v1.5.3-x64\libexec
> set OPAL_DATAROOTDIR=C:\MPIs\OpenMPI_v1.5.3-x64\share
> set OPAL_DATADIR=C:\MPIs\OpenMPI_v1.5.3-x64\share
> set OPAL_SYSCONFDIR=C:\MPIs\OpenMPI_v1.5.3-x64\etc
> set OPAL_LOCALSTATEDIR=C:\MPIs\OpenMPI_v1.5.3-x64\etc
> set OPAL_LIBDIR=C:\MPIs\OpenMPI_v1.5.3-x64\lib
> set OPAL_INCLUDEDIR=C:\MPIs\OpenMPI_v1.5.3-x64\include
> set OPAL_INFODIR=C:\MPIs\OpenMPI_v1.5.3-x64\share\info
> set OPAL_MANDIR=C:\MPIs\OpenMPI_v1.5.3-x64\share\man
> set OPAL_PKGDATADIR=C:\MPIs\OpenMPI_v1.5.3-x64\share\openmpi
> set OPAL_PKGLIBDIR=C:\MPIs\OpenMPI_v1.5.3-x64\lib\openmpi
> set OPAL_PKGINCLUDEDIR=C:\MPIs\OpenMPI_v1.5.3-x64\include\openmpi
>
> Please correct if I missed any other env variable.
>
> Thank you.
> -Hiral
>
>
> On Wed, May 11, 2011 at 8:56 PM, Shiqing Fan  wrote:
>> Hi,
>>
>> The error message means that Open MPI couldn't allocate any compute node.
>> It might because the headnode wasn't discovered. You could try with option
>> "-mca orte_headnode_name HEADNODE_NAME" in the mpirun command line (mpirun
>> --help will show how to use it) .
>>
>> And Jeff is also right, special care should be taken for the executable
>> paths, and it's better to use UNC path.
>>
>> To clarify the path issue, if you just copy the OMPI dir to another
>> computer, there might also be another problem that OMPI couldn't load the
>> registry entries, as the registry entries were set during the installation
>> phase on the specific computer. In 1.5.3, a overall env "OPENMPI_HOME"
>> will do the work.
>>
>> Regards,
>> Shiqing
>> - 原始邮件 -
>> 发件人: Jeff Squyres 
>> 收件人: Open MPI Users 
>> 已发送邮件: Wed, 11 May 2011 15:21:26 +0200 (CEST)
>> 主题: Re: [OMPI users] Sorry! You were supposed to get help about: But
>> couldn't open help-orterun.txt
>>
>> On May 11, 2011, at 5:50 AM, Ralph Castain wrote:
>>
 Clarification: I installed pre-built OpenMPI_v1.5.3-x64 on Windows 7
 and copied this directory into Windows Server 2008.
>>
>> Did you copy OMPI to the same directory tree that you built it?
>>
>> OMPI hard-codes some directory names when it builds, and it expects to
>> find that directory structure when it runs.  If you build OMPI with a
>> --prefix of /foo, but then move it to /bar, various things may not work
>> (like finding help messages, etc.) unless you set the OMPI/OPAL
>> environment variables that tell OMPI where the files are actually
>> located.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>