Re: [OMPI users] Checkpoint/Restart error

2010-02-01 Thread Josh Hursey
Thanks for the bug report. There are a couple of places in the code  
that, in a sense, hard code '/tmp' as the temporary directory. It  
shouldn't be to hard to fix since there is a common function used in  
the code to discovery the 'true' temporary directory (which defaults  
to /tmp). Of course there might be other complexities once I dig into  
the problem.


I don't know when I will get to this, but I filed a ticket about this  
if you want to track it:

  https://svn.open-mpi.org/trac/ompi/ticket/2208

Thanks again,
Josh

On Jan 29, 2010, at 4:41 PM, Jazcek Braden wrote:


Josh,

I was following this thread as I had similar symptoms and discovered a
peculiar error.  when I launch the program, openmpi follows the
$TMPDIR environment variable and puts the session information in the
$TMPDIR directory.  However ompi-checkpoint seems to be requiring the
sessions file to be in /tmp ignoring the $TMPDIR.  Is this by design
and what would I have to do to get it to obey the $TMPDIR environment
variable.

--
Jazcek Braden
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Checkpoint/Restart error

2010-01-29 Thread Jazcek Braden
Josh,

I was following this thread as I had similar symptoms and discovered a
peculiar error.  when I launch the program, openmpi follows the
$TMPDIR environment variable and puts the session information in the
$TMPDIR directory.  However ompi-checkpoint seems to be requiring the
sessions file to be in /tmp ignoring the $TMPDIR.  Is this by design
and what would I have to do to get it to obey the $TMPDIR environment
variable.

-- 
Jazcek Braden


Re: [OMPI users] Checkpoint/Restart error

2010-01-25 Thread Josh Hursey
I tested the 1.4.1 release, and everything worked fine for me (tested  
a few different configurations of nodes/environments).


The ompi-checkpoint error you cited is usually caused by one of two  
things:
 - The PID specified is wrong (which I don't think that is the case  
here)

 - The session directory cannot be found in /tmp.

So I think the problem is the latter. The session directory looks  
something like:

  /tmp/openmpi-sessions-USERNAME@LOCALHOST_0
Within this directory the mpirun process places its contact  
information. ompi-checkpoint uses this contact information to connect  
to the job. If it cannot find it, then it errors out. (We definitely  
need a better error message here. I filed a ticket [1]).


We usually do not recommend running Open MPI as a root user. So I  
would strongly recommend that you do not run as a root user.


With a regular user, check the location of the session directory. Make  
sure that it is in /tmp on the node where 'mpirun' and 'ompi- 
checkpoint' are run.


-- Josh

[1] https://svn.open-mpi.org/trac/ompi/ticket/2189

On Jan 25, 2010, at 5:48 AM, Andreea Costea wrote:


So? anyone? any clue?

Summarize:
- installed OpenMPI 1.4.1 on fresh Centos 5
- mpirun works but ompi-checkpoint throws this error:
ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405
- on another VM I have OpenMPI 1.3.3. installed. Checkpointing works  
fine on guest but has the previous mentioned error on root. Both  
root and guest show the same output after "param -all -all" except  
for the $HOME (which only matters for mca_component_path,  
mca_param_files, snapc_base_global_snapshot_dir)



Thanks,
Andreea


On Tue, Jan 19, 2010 at 9:01 PM, Andreea Costea  wrote:
I noticed one more thing. As I still have some VMs that have OpenMPI  
version 1.3.3 installed I started to use those machines 'till I fix  
the problem with 1.4.1 And while checkpointing on one of this VMs I  
realized that checkpointing as a guest works fine and checkpointing  
as a root outputs the same error like in 1.4.1. : ORTE_ERROR_LOG:  
Not found in file orte-checkpoint.c at line 405


I logged the outputs of "ompi_info --param all all" which I run for  
root and for another user and the only differences were at these  
parameters:


mca_component_path
mca_param_files
snapc_base_global_snapshot_dir

All 3 params differ because of the $HOME.
One more thing: I don't have the directory $HOME/.openmpi

Ideas?

Thanks,
Andreea





On Tue, Jan 19, 2010 at 12:51 PM, Andreea Costea  wrote:
Well... I decided to install a fresh OS to be sure that there is no  
OpenMPI version conflict. So I formatted one of my VMs, did a fresh  
CentOS install, installed BLCR 0.8.2 and OpenMPI 1.4.1 and the  
result: the same. mpirun works but ompi-checkpoint has that error at  
line 405:


[[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at  
line 405


As for the files remaining after uninstalling: Jeff you were rigth.  
There is no file left, just some empty directories.


Which might be the problem with that ORTE_ERROR_LOG error?

Thanks,
Andreea

On Fri, Jan 15, 2010 at 11:47 PM, Andreea Costea  wrote:

It's almost midnight here, so I left home, but I will try it tomorrow.
There were some directories left after "make uninstall". I will give  
more details tomorrow.


Thanks Jeff,
Andreea


On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres   
wrote:

On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote:

> - I wanted to update to version 1.4.1 and I uninstalled previous  
version like this: make uninstall, and than manually deleted all the  
left over files. the directory where I installed was /usr/local


I'll let Josh answer your CR questions, but I did want to ask about  
this point.  AFAIK, "make uninstall" removes *all* Open MPI files.   
For example:


-
[7:25] $ cd /path/to/my/OMPI/tree
[7:25] $ make install > /dev/null
[7:26] $ find /tmp/bogus/ -type f | wc
   646 646   28082
[7:26] $ make uninstall > /dev/null
[7:27] $ find /tmp/bogus/ -type f | wc
 0   0   0
[7:27] $
-

I realize that some *directories* are left in $prefix, but there  
should be no *files* left.  Are you seeing something different?


--
Jeff Squyres
jsquy...@cisco.com


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Checkpoint/Restart error

2010-01-25 Thread Andreea Costea
So? anyone? any clue?

Summarize:
- installed OpenMPI 1.4.1 on fresh Centos 5
- mpirun works but ompi-checkpoint throws this error:
ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405
- on another VM I have OpenMPI 1.3.3. installed. Checkpointing works fine on
guest but has the previous mentioned error on root. Both root and guest show
the same output after "param -all -all" except for the $HOME (which only
matters for mca_component_path, mca_param_files,
snapc_base_global_snapshot_dir)


Thanks,
Andreea


On Tue, Jan 19, 2010 at 9:01 PM, Andreea Costea wrote:

> I noticed one more thing. As I still have some VMs that have OpenMPI
> version 1.3.3 installed I started to use those machines 'till I fix the
> problem with 1.4.1 And while checkpointing on one of this VMs I realized
> that checkpointing as a guest works fine and checkpointing as a root outputs
> the same error like in 1.4.1. : ORTE_ERROR_LOG: Not found in file
> orte-checkpoint.c at line 405
>
> I logged the outputs of "ompi_info --param all all" which I run for root
> and for another user and the only differences were at these parameters:
>
> mca_component_path
> mca_param_files
> snapc_base_global_snapshot_dir
>
> All 3 params differ because of the $HOME.
> One more thing: I don't have the directory $HOME/.openmpi
>
> Ideas?
>
> Thanks,
> Andreea
>
>
>
>
>
> On Tue, Jan 19, 2010 at 12:51 PM, Andreea Costea 
> wrote:
>
>> Well... I decided to install a fresh OS to be sure that there is no
>> OpenMPI version conflict. So I formatted one of my VMs, did a fresh CentOS
>> install, installed BLCR 0.8.2 and OpenMPI 1.4.1 and the result: the same.
>> mpirun works but ompi-checkpoint has that error at line 405:
>>
>> [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line
>> 405
>>
>> As for the files remaining after uninstalling: Jeff you were rigth. There
>> is no file left, just some empty directories.
>>
>> Which might be the problem with that ORTE_ERROR_LOG error?
>>
>> Thanks,
>> Andreea
>>
>> On Fri, Jan 15, 2010 at 11:47 PM, Andreea Costea 
>> wrote:
>>
>>> It's almost midnight here, so I left home, but I will try it tomorrow.
>>> There were some directories left after "make uninstall". I will give more
>>> details tomorrow.
>>>
>>> Thanks Jeff,
>>> Andreea
>>>
>>>
>>> On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres wrote:
>>>
 On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote:

 > - I wanted to update to version 1.4.1 and I uninstalled previous
 version like this: make uninstall, and than manually deleted all the left
 over files. the directory where I installed was /usr/local

 I'll let Josh answer your CR questions, but I did want to ask about this
 point.  AFAIK, "make uninstall" removes *all* Open MPI files.  For example:

 -
 [7:25] $ cd /path/to/my/OMPI/tree
 [7:25] $ make install > /dev/null
 [7:26] $ find /tmp/bogus/ -type f | wc
646 646   28082
 [7:26] $ make uninstall > /dev/null
 [7:27] $ find /tmp/bogus/ -type f | wc
  0   0   0
 [7:27] $
 -

 I realize that some *directories* are left in $prefix, but there should
 be no *files* left.  Are you seeing something different?

 --
 Jeff Squyres
 jsquy...@cisco.com


 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

>>>
>>>
>>
>


Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Jeff Squyres
On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote:

> - I wanted to update to version 1.4.1 and I uninstalled previous version like 
> this: make uninstall, and than manually deleted all the left over files. the 
> directory where I installed was /usr/local

I'll let Josh answer your CR questions, but I did want to ask about this point. 
 AFAIK, "make uninstall" removes *all* Open MPI files.  For example:

-
[7:25] $ cd /path/to/my/OMPI/tree
[7:25] $ make install > /dev/null
[7:26] $ find /tmp/bogus/ -type f | wc
646 646   28082
[7:26] $ make uninstall > /dev/null
[7:27] $ find /tmp/bogus/ -type f | wc
  0   0   0
[7:27] $ 
-

I realize that some *directories* are left in $prefix, but there should be no 
*files* left.  Are you seeing something different?

-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
I don't know what else should I try... because it worked on 1.3.3 doing
exactly the same steps. I tried to install it both with an active eth
interface and an inactive one. I am running on a virtual machine that has
CentOS as OS.

Any suggestions?

Thanks,
Andreea

On Fri, Jan 15, 2010 at 9:07 PM, Andreea Costea wrote:

> I tried the new version, that was uploaded today. I still have that error,
> just that now is at line 405 instead of 399.
>
> Maybe if I give more details:
> - I first had OpenMPI version 1.3.3 with BLCR installed: mpirun,
> ompi-checkpoint and ompi-restart worked with that version.
> - I wanted to update to version 1.4.1 and I uninstalled previous version
> like this: make uninstall, and than manually deleted all the left over
> files. the directory where I installed was /usr/local
> - I installed 1.4.1 in the same directory: /usr/locale. paths set
> correctly  to usr/local/bin and /usr/local/lib
> - mpirun works, ompi-checkpoint gives the following error:
> [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line
> 405
> HNP with PID 7899 Not found!
>
> I would appreciate any help,
> Andreea
>
>
>
> On Fri, Jan 15, 2010 at 1:15 PM, Andreea Costea wrote:
>
>> Hi...
>> still not working. Though I uninstalled OpenMPI with make uninstall and I
>> manually deleted all other files, I still have the same error when
>> checkpointing.
>>
>> Any idea?
>>
>> Thanks,
>> Andreea
>>
>>
>>
>> On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote:
>>
>>> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote:
>>>
>>> > Hi,
>>> >
>>> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have
>>> downloaded today. When I want to checkpoint I am having the following error
>>> message:
>>> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at
>>> line 399
>>> > HNP with PID 2337 Not found!
>>>
>>> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is
>>> no error at line 399, in 1.3.3 there is. Check your installation of Open
>>> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected
>>> problems.
>>>
>>> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in
>>> your path/lib_path any longer.
>>>
>>> -- Josh
>>>
>>> >
>>> > I tried the same thing with version 1.3.3 and it works perfectly.
>>> >
>>> > Any idea why?
>>> >
>>> > thanks,
>>> > Andreea
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>


Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
I tried the new version, that was uploaded today. I still have that error,
just that now is at line 405 instead of 399.

Maybe if I give more details:
- I first had OpenMPI version 1.3.3 with BLCR installed: mpirun,
ompi-checkpoint and ompi-restart worked with that version.
- I wanted to update to version 1.4.1 and I uninstalled previous version
like this: make uninstall, and than manually deleted all the left over
files. the directory where I installed was /usr/local
- I installed 1.4.1 in the same directory: /usr/locale. paths set correctly
to usr/local/bin and /usr/local/lib
- mpirun works, ompi-checkpoint gives the following error:
[[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line
405
HNP with PID 7899 Not found!

I would appreciate any help,
Andreea


On Fri, Jan 15, 2010 at 1:15 PM, Andreea Costea wrote:

> Hi...
> still not working. Though I uninstalled OpenMPI with make uninstall and I
> manually deleted all other files, I still have the same error when
> checkpointing.
>
> Any idea?
>
> Thanks,
> Andreea
>
>
>
> On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote:
>
>> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote:
>>
>> > Hi,
>> >
>> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have
>> downloaded today. When I want to checkpoint I am having the following error
>> message:
>> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at
>> line 399
>> > HNP with PID 2337 Not found!
>>
>> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is
>> no error at line 399, in 1.3.3 there is. Check your installation of Open
>> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected
>> problems.
>>
>> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in
>> your path/lib_path any longer.
>>
>> -- Josh
>>
>> >
>> > I tried the same thing with version 1.3.3 and it works perfectly.
>> >
>> > Any idea why?
>> >
>> > thanks,
>> > Andreea
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
Hi...
still not working. Though I uninstalled OpenMPI with make uninstall and I
manually deleted all other files, I still have the same error when
checkpointing.

Any idea?

Thanks,
Andreea


On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote:

> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote:
>
> > Hi,
> >
> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have
> downloaded today. When I want to checkpoint I am having the following error
> message:
> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line
> 399
> > HNP with PID 2337 Not found!
>
> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is
> no error at line 399, in 1.3.3 there is. Check your installation of Open
> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected
> problems.
>
> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in
> your path/lib_path any longer.
>
> -- Josh
>
> >
> > I tried the same thing with version 1.3.3 and it works perfectly.
> >
> > Any idea why?
> >
> > thanks,
> > Andreea
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Checkpoint/Restart error

2010-01-14 Thread Joshua Hursey
On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote:

> Hi,
> 
> I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have 
> downloaded today. When I want to checkpoint I am having the following error 
> message:
> [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 399
> HNP with PID 2337 Not found! 

This looks like an error coming from the 1.3.3 install. In 1.4.1 there is no 
error at line 399, in 1.3.3 there is. Check your installation of Open MPI, I 
bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected problems.

Try a clean installation of 1.4.1 and double check that 1.3.3 is not in your 
path/lib_path any longer.

-- Josh

> 
> I tried the same thing with version 1.3.3 and it works perfectly.
> 
> Any idea why?
> 
> thanks,
> Andreea
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Checkpoint/Restart error

2010-01-14 Thread Andreea Costea
Hi,

I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have
downloaded today. When I want to checkpoint I am having the following error
message:
[[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line
399
HNP with PID 2337 Not found!

I tried the same thing with version 1.3.3 and it works perfectly.

Any idea why?

thanks,
Andreea