Re: [OMPI users] users Digest, Vol 1658, Issue 2

2010-08-13 Thread Joshua Hursey
I probably won't have an opportunity to work on reproducing this on the 1.4.2. 
The trunk has a bunch of bug fixes that probably will not be backported to the 
1.4 series (things have changed too much since that branch). So I would suggest 
trying the 1.5 series.

-- Josh

On Aug 13, 2010, at 10:12 AM,   
wrote:

> Josh
> 
> I am having problems compiling the sources from the latest trunk. It
> complains of libgomp.spec missing even though that file exists on my
> system. I will see if I have to change any other environment variables
> to have a successful compilation. I will keep you posted.
> 
> BTW, were you successful in reproducing the problem on a system with
> OpenMPI 1.4.2?
> 
> Thanks
> Ananda
> -Original Message-
> Date: Thu, 12 Aug 2010 09:12:26 -0400
> From: Joshua Hursey 
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> To: Open MPI Users 
> Message-ID: <1f1445ab-9208-4ef0-af25-5926bd53c...@open-mpi.org>
> Content-Type: text/plain; charset=us-ascii
> 
> Can you try this with the current trunk (r23587 or later)?
> 
> I just added a number of new features and bug fixes, and I would be
> interested to see if it fixes the problem. In particular I suspect that
> this might be related to the Init/Finalize bounding of the checkpoint
> region.
> 
> -- Josh
> 
> On Aug 10, 2010, at 2:18 PM, 
>  wrote:
> 
>> Josh
>> 
>> Please find attached is the python program that reproduces the hang
> that
>> I described. Initial part of this file describes the prerequisite
>> modules and the steps to reproduce the problem. Please let me know if
>> you have any questions in reproducing the hang.
>> 
>> Please note that, if I add the following lines at the end of the
> program
>> (in case sleep_time is True), the problem disappears ie; program
> resumes
>> successfully after successful completion of checkpoint.
>> # Add following lines at the end for sleep_time is True
>> else:
>>  time.sleep(0.1)
>> # End of added lines
>> 
>> 
>> Thanks a lot for your time in looking into this issue.
>> 
>> Regards
>> Ananda
>> 
>> Ananda B Mudar, PMP
>> Senior Technical Architect
>> Wipro Technologies
>> Ph: 972 765 8093
>> ananda.mu...@wipro.com
>> 
>> 
>> -Original Message-
>> Date: Mon, 9 Aug 2010 16:37:58 -0400
>> From: Joshua Hursey 
>> Subject: Re: [OMPI users] Checkpointing mpi4py program
>> To: Open MPI Users 
>> Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org>
>> Content-Type: text/plain; charset=windows-1252
>> 
>> I have not tried to checkpoint an mpi4py application, so I cannot say
>> for sure if it works or not. You might be hitting something with the
>> Python runtime interacting in an odd way with either Open MPI or BLCR.
>> 
>> Can you attach a debugger and get a backtrace on a stuck checkpoint?
>> That might show us where things are held up.
>> 
>> -- Josh
>> 
>> 
>> On Aug 9, 2010, at 4:04 PM, 
>>  wrote:
>> 
>>> Hi
>>> 
>>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
>> 0.8.2. When I run ompi-checkpoint on the program written using mpi4py,
> I
>> see that program doesn?t resume sometimes after successful checkpoint
>> creation. This doesn?t occur always meaning the program resumes after
>> successful checkpoint creation most of the time and completes
>> successfully. Has anyone tested the checkpoint/restart functionality
>> with mpi4py programs? Are there any best practices that I should keep
> in
>> mind while checkpointing mpi4py programs?
>>> 
>>> Thanks for your time
>>> -  Ananda
>>> Please do not print this email unless it is absolutely necessary.
>>> 
>>> The information contained in this electronic message and any
>> attachments to this message are intended for the exclusive use of the
>> addressee(s) and may contain proprietary, confidential or privileged
>> information. If you are not the intended recipient, you should not
>> disseminate, distribute or copy this e-mail. Please notify the sender
>> immediately and destroy all copies of this message and any
> attachments.
>>> 
>>> WARNING: Computer viruses can be transmitted via email. The recipient
>> should check this email and any attachments for the presence of
> viruses.
>> The company accepts no liability for any damage caused by any virus
>> transmitted by this email.
>>> 
>>> www.wipro.com
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> Please do not print this email unless it is absolutely necessary. 
> 
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you 

Re: [OMPI users] users Digest, Vol 1658, Issue 2

2010-08-13 Thread ananda.mudar
Josh

I am having problems compiling the sources from the latest trunk. It
complains of libgomp.spec missing even though that file exists on my
system. I will see if I have to change any other environment variables
to have a successful compilation. I will keep you posted.

BTW, were you successful in reproducing the problem on a system with
OpenMPI 1.4.2?

Thanks
Ananda
-Original Message-
List-Post: users@lists.open-mpi.org
Date: Thu, 12 Aug 2010 09:12:26 -0400
From: Joshua Hursey 
Subject: Re: [OMPI users] Checkpointing mpi4py program
To: Open MPI Users 
Message-ID: <1f1445ab-9208-4ef0-af25-5926bd53c...@open-mpi.org>
Content-Type: text/plain; charset=us-ascii

Can you try this with the current trunk (r23587 or later)?

I just added a number of new features and bug fixes, and I would be
interested to see if it fixes the problem. In particular I suspect that
this might be related to the Init/Finalize bounding of the checkpoint
region.

-- Josh

On Aug 10, 2010, at 2:18 PM, 
 wrote:

> Josh
> 
> Please find attached is the python program that reproduces the hang
that
> I described. Initial part of this file describes the prerequisite
> modules and the steps to reproduce the problem. Please let me know if
> you have any questions in reproducing the hang.
> 
> Please note that, if I add the following lines at the end of the
program
> (in case sleep_time is True), the problem disappears ie; program
resumes
> successfully after successful completion of checkpoint.
> # Add following lines at the end for sleep_time is True
> else:
>   time.sleep(0.1)
> # End of added lines
> 
> 
> Thanks a lot for your time in looking into this issue.
> 
> Regards
> Ananda
> 
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093
> ananda.mu...@wipro.com
> 
> 
> -Original Message-
> Date: Mon, 9 Aug 2010 16:37:58 -0400
> From: Joshua Hursey 
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> To: Open MPI Users 
> Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org>
> Content-Type: text/plain; charset=windows-1252
> 
> I have not tried to checkpoint an mpi4py application, so I cannot say
> for sure if it works or not. You might be hitting something with the
> Python runtime interacting in an odd way with either Open MPI or BLCR.
> 
> Can you attach a debugger and get a backtrace on a stuck checkpoint?
> That might show us where things are held up.
> 
> -- Josh
> 
> 
> On Aug 9, 2010, at 4:04 PM, 
>  wrote:
> 
>> Hi
>> 
>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
> 0.8.2. When I run ompi-checkpoint on the program written using mpi4py,
I
> see that program doesn?t resume sometimes after successful checkpoint
> creation. This doesn?t occur always meaning the program resumes after
> successful checkpoint creation most of the time and completes
> successfully. Has anyone tested the checkpoint/restart functionality
> with mpi4py programs? Are there any best practices that I should keep
in
> mind while checkpointing mpi4py programs?
>> 
>> Thanks for your time
>> -  Ananda
>> Please do not print this email unless it is absolutely necessary.
>> 
>> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of the
> addressee(s) and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and destroy all copies of this message and any
attachments.
>> 
>> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of
viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>> 
>> www.wipro.com
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users

Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com