Josh

I have few more observations that I want to share with you.

I modified the earlier C program little bit by making two MPI_Bcast() calls 
inside while loop for 10 seconds. The issue of MPI_Bcast() failing with 
ERR_TRUNCATE error message resurfaces when I call checkpoint on this program. 
Interestingly the two MPI_Bcast() calls are broadcasting different data types 
ie; first one broadcasts integer variable and the second one broadcasts float 
variable.

If I make these two MPI_Bcast() calls to broadcast the same data type ie; 
either broadcast two different integer variables one after another or broadcast 
two different float variables one after another, the program continues 
successfully. Checkpoint command is successful all the times and the program 
resumes after successful checkpoint.

When the MPI_Bcast() failed with ERR_TRUNCATE error message, I have captured 
the output after setting "--mca crcp_base_verbose 20 --mca orte_debug_verbose 
20".  I have filtered all the messages before and after the error message 
occurred so that it will not have clutter.

I am attaching both the C program (little modified from the earlier one I 
shared with you) and the filtered output log file with this thread. Hope you 
see something with these messages that might be going wrong.

Please let me know if you need any additional information on this issue.

Thanks
Ananda
________________________________

Sent: Wed 8/18/2010 4:43 PM
To: 'us...@open-mpi.org'
Subject: Re: [OMPI users] Checkpointing mpi4py program (Probably bcast issue)



Josh



Thanks for addressing the issue. I will try the new version that has your fix 
and let you know.



BTW, I have been in touch with mpi4py team also to debug this issue. According 
to mpi4py team, MPI_Bcast() is implemented with two collective calls: First one 
with MPI_Bcast() of single integer and the next one with MPI_Bcast() chunk of 
memory. Since the problem I was running into was during MPI_Bcast() calls, I 
have mimic'd the mpi4py logic and wrote a program in c. I have attached the 
same with this mail for your reference.



If you run this program without checkpoint control, program runs for ever 
because of the infinite loop inside. However if I run this program under 
checkpoint control (mpirun -am ft-enable-cr), occasionally it fails with the 
following messages:

=== Error message START ======================================

[Host1:7398] *** An error occurred in MPI_Bcast

[Host1:7398] *** on communicator MPI_COMM_WORLD

[Host1:7398] *** MPI_ERR_TRUNCATE: message truncated

[Host1:7398] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

=== Error message END ======================================



While running mpi4py program, I used to get these errors occasionally from 
cPickle().



I tried this program with OpenMPI 1.4.2, OpenMPI trunk versions and the 
behavior is same.



I have not hit the hang condition I had seen while checkpointing mpi4py program 
but I hope this issue may be manifesting hang condition at times!!



Let me know if you need any other information.



Thanks

Ananda





Ananda B Mudar, PMP

Senior Technical Architect

Wipro Technologies

Ph: 972 765 8093

ananda.mu...@wipro.com

--- Original Message -------------------------------

Subject: Re: [OMPI users] Checkpointing mpi4py program
From: Joshua Hursey (jjhursey_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-08-18 16:48:17

I just fixed the --stop bug that you highlighted in r23627.

As far as the mpi4py program, I don't really know what to suggest. I don't have 
a setup to test this locally and am completely unfamiliar with mpi4py. Can you 
reproduce this with just a C program?

-- Josh

On Aug 16, 2010, at 12:25 PM, <ananda.mudar_at_[hidden]> 
<ananda.mudar_at_[hidden]> wrote:

> Josh
>
> I have one more update on my observation while analyzing this issue.
>
> Just to refresh, I am using openmpi-trunk release 23596 with mpi4py-1.2.1 and 
> BLCR 0.8.2. When I checkpoint the python script written using mpi4py, the 
> program doesn't progress after the checkpoint is taken successfully. I tried 
> it with openmpi 1.4.2 and then tried it with the latest trunk version as 
> suggested. I see the similar behavior in both the releases.
>
> I have one more interesting observation which I thought may be useful. I 
> tried the "-stop" option of ompi-checkpoint (trunk version) and the mpirun 
> prints the following error messages when I run the command "ompi-checkpoint 
> -stop -v <pid of mpirun>":
>
> ==== Error messages in the window where mpirun command was running START 
> ======================================
> [hpdcnln001:15148] Error: ( app) Passed an invalid handle (0) [5 
> ="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]
> [hpdcnln001:15148] [[37739,1],2] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253
> [hpdcnln001:15149] Error: ( app) Passed an invalid handle (0) [5 
> ="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]
> [hpdcnln001:15149] [[37739,1],3] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253
> [hpdcnln001:15146] Error: ( app) Passed an invalid handle (0) [5 
> ="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]
> [hpdcnln001:15146] [[37739,1],0] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253
> [hpdcnln001:15147] Error: ( app) Passed an invalid handle (0) [5 
> ="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]
> [hpdcnln001:15147] [[37739,1],1] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253
> ==== Error messages in the window where mpirun command was running END 
> ======================================
>
> Please note that the checkpoint image was created at the end of it. However 
> when I run the command "kill -CONT <pid of mpirun>", it fails to move forward 
> which is same as the original problem I have reported.
>
> Let me know if you need any additional information.
>
> Thanks for your time in advance
>
> - Ananda
>
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093 begin_of_the_skype_highlighting              972 765 8093    
>   end_of_the_skype_highlighting
> ananda.mudar_at_[hidden]
>
> From: Ananda Babu Mudar
> Sent: Sunday, August 15, 2010 11:25 PM
> To: users_at_[hidden]
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> Importance: High
>
> Josh
>
> I tried running the mpi4py program with the latest trunk version of openmpi. 
> I have compiled openmpi-1.7a1r23596 from trunk and recompiled mpi4py to use 
> this library. Unfortunately I see the same behavior as I have seen with 
> openmpi 1.4.2 ie; checkpoint will be successful but the program doesn't 
> proceed after that.
>
> I have attached the stack traces of all the MPI processes that are part of 
> the mpirun. I really appreciate if you can take a look at the stack trace and 
> let m e know the potential problem. I am kind of stuck at this point and need 
> your assistance to move forward. Please let me know if you need any 
> additional information.
>
> Thanks for your time in advance
>
> Thanks
>
> Ananda
>
> -----Original Message-----
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> From: Joshua Hursey (jjhursey_at_[hidden])
> Date: 2010-08-13 12:28:31
>
> Nope. I probably won't get to it for a while. I'll let you know if I do.
>
> On Aug 13, 2010, at 12:17 PM, <ananda.mudar_at_[hidden]> 
> <ananda.mudar_at_[hidden]> wrote:
>
> > OK, I will do that.
> >
> > But did you try this program on a system where the latest trunk is
> > installed? Were you successful in checkpointing?
> >
> > - Ananda
> > -----Original Message-----
> > Message: 9
> > Date: Fri, 13 Aug 2010 10:21:29 -0400
> > From: Joshua Hursey <jjhursey_at_[hidden]>
> > Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <7A43615B-A462-4C72-8112-496653D8F0A0_at_[hidden]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > I probably won't have an opportunity to work on reproducing this on the
> > 1.4.2. The trunk has a bunch of bug fixes that probably will not be
> > backported to the 1.4 series (things have changed too much since that
> > branch). So I would suggest trying the 1.5 series.
> >
> > -- Josh
> >
> > On Aug 13, 2010, at 10:12 AM, <ananda.mudar_at_[hidden]>
> > <ananda.mudar_at_[hidden]> wrote:
> >
> >> Josh
> >>
> >> I am having problems compiling the sources from the latest trunk. It
> >> complains of libgomp.spec missing even though that file exists on my
> >> system. I will see if I have to change any other environment variables
> >> to have a successful compilation. I will keep you posted.
> >>
> >> BTW, were you successful in reproducing the problem on a system with
> >> OpenMPI 1.4.2?
> >>
> >> Thanks
> >> Ananda
> >> -----Original Message-----
> >> Date: Thu, 12 Aug 2010 09:12:26 -0400
> >> From: Joshua Hursey <jjhursey_at_[hidden]>
> >> Subject: Re: [OMPI users] Checkpointing mpi4py program
> >> To: Open MPI Users <users_at_[hidden]>
> >> Message-ID: <1F1445AB-9208-4EF0-AF25-5926BD53C7E1_at_[hidden]>
> >> Content-Type: text/plain; charset=us-ascii
> >>
> >> Can you try this with the current trunk (r23587 or later)?
> >>
> >> I just added a number of new features and bug fixes, and I would be
> >> interested to see if it fixes the problem. In particular I suspect
> > that
> >> this might be related to the Init/Finalize bounding of the checkpoint
> >> region.
> >>
> >> -- Josh
> >>
> >> On Aug 10, 2010, at 2:18 PM, <ananda.mudar_at_[hidden]>
> >> <ananda.mudar_at_[hidden]> wrote:
> >>
> >>> Josh
> >>>
> >>> Please find attached is the python program that reproduces the hang
> >> that
> >>> I described. Initial part of this file describes the prerequisite
> >>> modules and the steps to reproduce the problem. Please let me know if
> >>> you have any questions in reproducing the hang.
> >>>
> >>> Please note that, if I add the following lines at the end of the
> >> program
> >>> (in case sleep_time is True), the problem disappears ie; program
> >> resumes
> >>> successfully after successful completion of checkpoint.
> >>> # Add following lines at the end for sleep_time is True
> >>> else:
> >>> time.sleep(0.1)
> >>> # End of added lines
> >>>
> >>>
> >>> Thanks a lot for your time in looking into this issue.
> >>>
> >>> Regards
> >>> Ananda
> >>>
> >>> Ananda B Mudar, PMP
> >>> Senior Technical Architect
> >>> Wipro Technologies
> >>> Ph: 972 765 8093 972 765 8093
> >>> ananda.mudar_at_[hidden]
> >>>
> >>>
> >>> -----Original Message-----
> >>> Date: Mon, 9 Aug 2010 16:37:58 -0400
> >>> From: Joshua Hursey <jjhursey_at_[hidden]>
> >>> Subject: Re: [OMPI users] Checkpointing mpi4py program
> >>> To: Open MPI Users <users_at_[hidden]>
> >>> Message-ID: <270BD450-743A-4662-9568-1FEDFCC6F9C6_at_[hidden]>
> >>> Content-Type: text/plain; charset=windows-1252
> >>>
> >>> I have not tried to checkpoint an mpi4py application, so I cannot say
> >>> for sure if it works or not. You might be hitting something with the
> >>> Python runtime interacting in an odd way with either Open MPI or
> > BLCR.
> >>>
> >>> Can you attach a debugger and get a backtrace on a stuck checkpoint?
> >>> That might show us where things are held up.
> >>>
> >>> -- Josh
> >>>
> >>>
> >>> On Aug 9, 2010, at 4:04 PM, <ananda.mudar_at_[hidden]>
> >>> <ananda.mudar_at_[hidden]> wrote:
> >>>
> >>>> Hi
> >>>>
> >>>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
> >>> 0.8.2. When I run ompi-checkpoint on the program written using
> > mpi4py,
> >> I
> >>> see that program doesn?t resume sometimes after successful checkpoint
> >>> creation. This doesn?t occur always meaning the program resumes after
> >>> successful checkpoint creation most of the time and completes
> >>> successfully. Has anyone tested the checkpoint/restart functionality
> >>> with mpi4py programs? Are there any best practices that I should keep
> >> in
> >>> mind while checkpointing mpi4py programs?
> >>>>
> >>>> Thanks for your time
> >>>> - Ananda
> >>>> Please do not print this email unless it is absolutely necessary.
> >>>>
> >>>> The information contained in this electronic message and any
> >>> attachments to this message are intended for the exclusive use of the
> >>> addressee(s) and may contain proprietary, confidential or privileged
> >>> information. If you are not the intended recipient, you should not
> >>> disseminate, distribute or copy this e-mail. Please notify the sender
> >>> immediately and destroy all copies of this message and any
> >> attachments.
> >>>>
> >>>> WARNING: Computer viruses can be transmitted via email. The
> > recipient
> >>> should check this email and any attachments for the presence of
> >> viruses.
> >>> The company accepts no liability for any damage caused by any virus
> >>> transmitted by this email.
> >>>>
> >>>> www.wipro.com
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> Please do not print this email unless it is absolutely necessary.
> >>
> >> The information contained in this electronic message and any
> > attachments to this message are intended for the exclusive use of the
> > addressee(s) and may contain proprietary, confidential or privileged
> > information. If you are not the intended recipient, you should not
> > disseminate, distribute or copy this e-mail. Please notify the sender
> > immediately and destroy all copies of this message and any attachments.
> >>
> >> WARNING: Computer viruses can be transmitted via email. The recipient
> > should check this email and any attachments for the presence of viruses.
> > The company accepts no liability for any damage caused by any virus
> > transmitted by this email.
> >>
> >> www.wipro.com
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > Please do not print this email unless it is absolutely necessary.
> >
> > The information contained in this electronic message and any attachments to 
> > this message are intended for the exclusive use of the addressee(s) and may 
> > contain proprietary, confidential or privileged information. If you are not 
> > the intended recipient, you should not disseminate, distribute or copy this 
> > e-mail. Please notify the sender immediately and destroy all copies of this 
> > message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The recipient 
> > should check this email and any attachments for the presence of viruses. 
> > The company accepts no liability for any damage caused by any virus 
> > transmitted by this email.
> >
> > www.wipro.com
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient should 
> check this email and any attachments for the presence of viruses. The company 
> accepts no liability for any damage caused by any virus transmitted by this 
> email.
>
> www.wipro.com
>
> <ATT00001..txt>


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com
/*
 * Author: 
 * -------
 *      Ananda B Mudar
 *      Senior Technical Architect
 *      Wipro Technologies
 *      ananda dot mudar at wipro dot com
 *
 * Objective of the program: 
 * -------------------------
 *      Checkpointing the program that has two successive
 * MPI_Bcast() calls will result in the following errors, some times:
 *
 * [Host1:7398] *** An error occurred in MPI_Bcast
 * [Host1:7398] *** on communicator MPI_COMM_WORLD
 * [Host1:7398] *** MPI_ERR_TRUNCATE: message truncated
 * [Host1:7398] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
 *
 * Background:
 * -----------
 * I wrote a similar program in python using mpi4py module and tried to
 * checkpoint the program using ompi-checkpoint. Calling ompi-checkpoint
 * on this program, failed in multiple ways:
 * (a) The program never resumes after the checkpoint image is taken
 * (b) The program used to fail in bcast() with cPickle errors
 *
 * This program calls MPI_Bcast() of an integer and a double value inside a
 * while loop for 10 seconds. If you run this program under checkpoint control,
 * and call checkpoint command, it will fail with the error message shown above.
 * 
 * If we replace these MPI_Bcast() calls to bcast() similar datatypes ie;
 * instead of integer and double, bcast() integer two times or bcast() char
 * two times, the program runs successfully after taking checkpoints!!
 *
 * Pre-Requisites:
 * ---------------
 *      BLCR 0.8.2
 *      OpenMPI library configured with checkpoint functionality
 *              - Reproducible with any version of OpenMPI 1.4.2, 1.5, trunk
 *      gcc (This is not mandatory but we have used this)
 *
 * Steps to reproduce:
 * -------------------
 * 1. Run this program with mpirun -am ft-enable-cr with atleast two processes
 * 2. While the program is running, run ompi-checkpoint on the pid of mpirun
 * 3. Sometimes, the program moves forward successfully but sometimes, you will
 *    get the error mentioned above.
 *
 */
#include "mpi.h"
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
        int myid, numprocs, namelen, number = 0, number1 = 0, sleep_time;
        double sleep_time1; /* Change it to type same as sleep_time for the
                               program to run successfully with checkpoints */
        char string1[20];
        char processor_name[MPI_MAX_PROCESSOR_NAME];
        char buffer[100];
        double start_time = 0.0;

        MPI_Init(&argc, &argv);
        MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
        MPI_Comm_rank(MPI_COMM_WORLD, &myid);
        MPI_Get_processor_name(processor_name, &namelen);

        fprintf(stdout, "Process %d of %d on %s & number = %d\n",
                        myid, numprocs, processor_name, number);
        if (myid == 0) {
                number++;
        }

        /*
         * bcast() is implemented as collective bcast() calls in mpi4py
         * To mimic this behavior, MPI_Bcast() is called first for an
         * integer and then for a chunk of memory.
         */
        MPI_Bcast(&number, 1, MPI_INT, 0, MPI_COMM_WORLD);
        fprintf(stdout, "Process %d & new_number = %d\n", myid, number);

        if (myid == 0) {
                start_time = MPI_Wtime();
        }

        /*
         * Infinite while loop; hence MPI_Finalize() is not called.
         */
        while (1) {
                /*
                 * Wait for atleast 10 seconds before printing the next
                 * set of messages.
                 */
                if (myid == 0) {
                        if (MPI_Wtime() - start_time <= 10) {
                                sleep_time = 1;
                                sleep_time1 = 1.2345;
                        } else {
                                sleep_time = 0;
                                sleep_time1 = 0.1234;
                                start_time = MPI_Wtime();
                        }
                }
                /*
                 * bcast() is implemented as collective bcast() calls in mpi4py
                 * To mimic this behavior, MPI_Bcast() is called first for an
                 * integer and then for a chunk of memory.
                 */
                MPI_Bcast(&sleep_time, 1, MPI_INT, 0, MPI_COMM_WORLD);
                MPI_Bcast(&sleep_time1, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

                /*
                 * If 10 seconds have elapsed, print the messages
                 */
                if (sleep_time == 0) {
                        fprintf(stdout, "Process %d of %d on %s, number =%d\n",
                                myid, numprocs, processor_name, number);
                        if (myid == 0) {
                                number++;
                        }
                        MPI_Bcast(&number, 1, MPI_INT, 0, MPI_COMM_WORLD);
                        fprintf(stdout,
                                "Process %d, new_number = %d\n", 
                                myid, number);
                }
        }
}

Attachment: debug-msgs-with-orte-debug-crcp-base-verbose-20.log
Description: debug-msgs-with-orte-debug-crcp-base-verbose-20.log

Reply via email to