[hwloc-devel] Create success (hwloc git dev-31-g4c0ce9e)

2013-12-18 Thread MPI Team
Creating nightly hwloc snapshot git tarball was a success.

Snapshot:   hwloc dev-31-g4c0ce9e
Start time: Wed Dec 18 21:01:01 EST 2013
End time:   Wed Dec 18 21:04:00 EST 2013

Your friendly daemon,
Cyrador


Re: [OMPI devel] Q: MPI-RTE / ompi_proc_t vs. ompi_process_info_t ?

2013-12-18 Thread Thomas Naughton

Hi Ralph,

OK, thanks for clarification and code pointers. 
I'll update "rte.h" to reflect the updates.


Thanks,
--tjn

 _
  Thomas Naughton  naught...@ornl.gov
  Research Associate   (865) 576-4184


On Wed, 18 Dec 2013, Ralph Castain wrote:


There is no relation at all between ompi_proc_t and ompi_process_info_t. The 
ompi_proc_t is defined in the MPI layer and is used in that layer in various 
places very much like orte_proc_t is used in the ORTE layer.

If you look in ompi/mca/rte/orte/rte_orte.c, you'll see how we handle the 
revised function calls. Basically, we use the process name to retrieve the 
modex data via the opal_db, and then load a pointer to the hostname into the 
ompi_proc_t proc_hostname field. Thus, the definition of ompi_proc_t remains in 
the MPI layer.

So there was no need to change the ompi/mca/rte/rte.h file, nor to #define 
anything in the component .h file - just have to modify the wrapper code inside 
the RTE component itself.

HTH
Ralph


On Dec 18, 2013, at 1:50 PM, Thomas Naughton  wrote:


Hi Ralph,

Question about the MPI-RTE interface change in r29931.  The change was not
reflected in the "ompi/mca/rte/rte.h" file.

I'm curious how the newly added "struct ompi_proc_t" relates to the "struct 
ompi_process_info_t" that is described in the "rte.h" file?

I understand the general motivation for the API change but it is less clear
to me how the information previously defined in the header changes (or does
not change)?

Thanks,
--tjn

_
 Thomas Naughton  naught...@ornl.gov
 Research Associate   (865) 576-4184


On Mon, 16 Dec 2013, svn-commit-mai...@open-mpi.org wrote:


Author: rhc (Ralph Castain)
Date: 2013-12-16 22:26:00 EST (Mon, 16 Dec 2013)
New Revision: 29931
URL: https://svn.open-mpi.org/trac/ompi/changeset/29931

Log:
Revert r29917 and replace it with a fix that resolves the thread deadlock while 
retaining the desired debug info. In an earlier commit, we had changed the 
modex accordingly:

* automatically retrieve the hostname (and all RTE info) for all procs during 
MPI_Init if nprocs < cutoff

* if nprocs > cutoff, retrieve the hostname (and all RTE info) for a proc upon 
the first call to modex_recv for that proc. This would provide the hostname for 
debugging purposes as we only report errors on messages, and so we must have 
called modex_recv to get the endpoint info

* BTLs are not to call modex_recv until they need the endpoint info for first 
message - i.e., not during add_procs so we don't call it for every process in 
the job, but only those with whom we communicate

My understanding is that only some BTLs have been modified to meet that third 
requirement, but those include the Cray ones where jobs are big enough that 
launch times were becoming an issue. Other BTLs would hopefully be modified as 
time went on and interest in using them at scale arose. Meantime, those BTLs 
would call modex_recv on every proc, and we would therefore be no worse than 
the prior behavior.

This commit revises the MPI-RTE interface to pass the ompi_proc_t instead of 
the ompi_process_name_t for the proc so that the hostname can be easily 
inserted. I have advised the ORNL folks of the change.

cmr=v1.7.4:reviewer=jsquyres:subject=Fix thread deadlock

Text files modified:
 trunk/ompi/mca/rte/orte/rte_orte.h| 7 ---
 trunk/ompi/mca/rte/orte/rte_orte_module.c |27 ++-
 trunk/ompi/proc/proc.c|26 ++
 trunk/ompi/runtime/ompi_module_exchange.c |10 +-
 4 files changed, 49 insertions(+), 21 deletions(-)


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Recommended tool to measure packet counters

2013-12-18 Thread Siddhartha Jana
Ah got it ! Thanks

-- Sid


On 18 December 2013 07:44, Jeff Squyres (jsquyres) wrote:

> On Dec 14, 2013, at 8:02 AM, Siddhartha Jana 
> wrote:
>
> > Is there a preferred method/tool among developers of MPI-library for
> checking the count of the packets transmitted by the network card during
> two-sided communication?
> >
> > Is the use of
> > iptables -I INPUT -i eth0
> > iptables -I OUTPUT -o eth0
> >
> > recommended ?
>
> If you're using an ethernet, non-OS-bypass transport (e.g., TCP), you
> might also want to look at ethtool.
>
> Note that these counts will include control messages sent by Open MPI, too
> -- not just raw MPI traffic.  They also will not include any traffic sent
> across shared memory (or other transports).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] RFC: remove opal progress recursion depth counter

2013-12-18 Thread Nathan Hjelm
Opps, yeah. Meant 1.7.5. If people agree with this change I could
possibly slip it in before Friday but that is unlikely.

On Wed, Dec 18, 2013 at 03:32:36PM -0800, Ralph Castain wrote:
> U1.7.4 is leaving the station on Fri, Nathan, so next Tues => will 
> have to go into 1.7.5
> 
> 
> On Dec 18, 2013, at 3:23 PM, Nathan Hjelm  wrote:
> 
> > What: Remove the opal_progress_recursion_depth_counter from
> > opal_progress.
> > 
> > Why: This counter adds two atomic adds to the critical path when
> > OPAL_HAVE_THREADS is set (which is the case for most builds). I grepped
> > through ompi, orte, and opal to find where this value was being used and
> > did not find anything either inside or outside opal_progress.
> > 
> > When: I want this change to go into 1.7.4 (if possible) so setting a
> > quick timeout for next Tuesday.
> > 
> > Let me know if there is a good reason to keep this counter and it will
> > be spared.
> > 
> > -Nathan Hjelm
> > HPC-5, LANL
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


pgp5W1crmdsfc.pgp
Description: PGP signature


Re: [OMPI devel] RFC: remove opal progress recursion depth counter

2013-12-18 Thread Ralph Castain
U1.7.4 is leaving the station on Fri, Nathan, so next Tues => will have 
to go into 1.7.5


On Dec 18, 2013, at 3:23 PM, Nathan Hjelm  wrote:

> What: Remove the opal_progress_recursion_depth_counter from
> opal_progress.
> 
> Why: This counter adds two atomic adds to the critical path when
> OPAL_HAVE_THREADS is set (which is the case for most builds). I grepped
> through ompi, orte, and opal to find where this value was being used and
> did not find anything either inside or outside opal_progress.
> 
> When: I want this change to go into 1.7.4 (if possible) so setting a
> quick timeout for next Tuesday.
> 
> Let me know if there is a good reason to keep this counter and it will
> be spared.
> 
> -Nathan Hjelm
> HPC-5, LANL
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] RFC: remove opal progress recursion depth counter

2013-12-18 Thread Nathan Hjelm
What: Remove the opal_progress_recursion_depth_counter from
opal_progress.

Why: This counter adds two atomic adds to the critical path when
OPAL_HAVE_THREADS is set (which is the case for most builds). I grepped
through ompi, orte, and opal to find where this value was being used and
did not find anything either inside or outside opal_progress.

When: I want this change to go into 1.7.4 (if possible) so setting a
quick timeout for next Tuesday.

Let me know if there is a good reason to keep this counter and it will
be spared.

-Nathan Hjelm
HPC-5, LANL


pgpwYhikrOjnJ.pgp
Description: PGP signature


Re: [OMPI devel] Q: MPI-RTE / ompi_proc_t vs. ompi_process_info_t ?

2013-12-18 Thread Ralph Castain
There is no relation at all between ompi_proc_t and ompi_process_info_t. The 
ompi_proc_t is defined in the MPI layer and is used in that layer in various 
places very much like orte_proc_t is used in the ORTE layer.

If you look in ompi/mca/rte/orte/rte_orte.c, you'll see how we handle the 
revised function calls. Basically, we use the process name to retrieve the 
modex data via the opal_db, and then load a pointer to the hostname into the 
ompi_proc_t proc_hostname field. Thus, the definition of ompi_proc_t remains in 
the MPI layer.

So there was no need to change the ompi/mca/rte/rte.h file, nor to #define 
anything in the component .h file - just have to modify the wrapper code inside 
the RTE component itself.

HTH
Ralph


On Dec 18, 2013, at 1:50 PM, Thomas Naughton  wrote:

> Hi Ralph,
> 
> Question about the MPI-RTE interface change in r29931.  The change was not
> reflected in the "ompi/mca/rte/rte.h" file.
> 
> I'm curious how the newly added "struct ompi_proc_t" relates to the "struct 
> ompi_process_info_t" that is described in the "rte.h" file?
> 
> I understand the general motivation for the API change but it is less clear
> to me how the information previously defined in the header changes (or does
> not change)?
> 
> Thanks,
> --tjn
> 
> _
>  Thomas Naughton  naught...@ornl.gov
>  Research Associate   (865) 576-4184
> 
> 
> On Mon, 16 Dec 2013, svn-commit-mai...@open-mpi.org wrote:
> 
>> Author: rhc (Ralph Castain)
>> Date: 2013-12-16 22:26:00 EST (Mon, 16 Dec 2013)
>> New Revision: 29931
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29931
>> 
>> Log:
>> Revert r29917 and replace it with a fix that resolves the thread deadlock 
>> while retaining the desired debug info. In an earlier commit, we had changed 
>> the modex accordingly:
>> 
>> * automatically retrieve the hostname (and all RTE info) for all procs 
>> during MPI_Init if nprocs < cutoff
>> 
>> * if nprocs > cutoff, retrieve the hostname (and all RTE info) for a proc 
>> upon the first call to modex_recv for that proc. This would provide the 
>> hostname for debugging purposes as we only report errors on messages, and so 
>> we must have called modex_recv to get the endpoint info
>> 
>> * BTLs are not to call modex_recv until they need the endpoint info for 
>> first message - i.e., not during add_procs so we don't call it for every 
>> process in the job, but only those with whom we communicate
>> 
>> My understanding is that only some BTLs have been modified to meet that 
>> third requirement, but those include the Cray ones where jobs are big enough 
>> that launch times were becoming an issue. Other BTLs would hopefully be 
>> modified as time went on and interest in using them at scale arose. 
>> Meantime, those BTLs would call modex_recv on every proc, and we would 
>> therefore be no worse than the prior behavior.
>> 
>> This commit revises the MPI-RTE interface to pass the ompi_proc_t instead of 
>> the ompi_process_name_t for the proc so that the hostname can be easily 
>> inserted. I have advised the ORNL folks of the change.
>> 
>> cmr=v1.7.4:reviewer=jsquyres:subject=Fix thread deadlock
>> 
>> Text files modified:
>>  trunk/ompi/mca/rte/orte/rte_orte.h| 7 ---
>>  trunk/ompi/mca/rte/orte/rte_orte_module.c |27 
>> ++-
>>  trunk/ompi/proc/proc.c|26 ++
>>  trunk/ompi/runtime/ompi_module_exchange.c |10 +-
>>  4 files changed, 49 insertions(+), 21 deletions(-)
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] Q: MPI-RTE / ompi_proc_t vs. ompi_process_info_t ?

2013-12-18 Thread Thomas Naughton

Hi Ralph,

Question about the MPI-RTE interface change in r29931.  The change was not
reflected in the "ompi/mca/rte/rte.h" file.

I'm curious how the newly added "struct ompi_proc_t" relates to the 
"struct ompi_process_info_t" that is described in the "rte.h" file?


I understand the general motivation for the API change but it is less clear
to me how the information previously defined in the header changes (or does
not change)?

Thanks,
--tjn

 _
  Thomas Naughton  naught...@ornl.gov
  Research Associate   (865) 576-4184


On Mon, 16 Dec 2013, svn-commit-mai...@open-mpi.org wrote:


Author: rhc (Ralph Castain)
Date: 2013-12-16 22:26:00 EST (Mon, 16 Dec 2013)
New Revision: 29931
URL: https://svn.open-mpi.org/trac/ompi/changeset/29931

Log:
Revert r29917 and replace it with a fix that resolves the thread deadlock while 
retaining the desired debug info. In an earlier commit, we had changed the 
modex accordingly:

* automatically retrieve the hostname (and all RTE info) for all procs during 
MPI_Init if nprocs < cutoff

* if nprocs > cutoff, retrieve the hostname (and all RTE info) for a proc upon 
the first call to modex_recv for that proc. This would provide the hostname for 
debugging purposes as we only report errors on messages, and so we must have 
called modex_recv to get the endpoint info

* BTLs are not to call modex_recv until they need the endpoint info for first 
message - i.e., not during add_procs so we don't call it for every process in 
the job, but only those with whom we communicate

My understanding is that only some BTLs have been modified to meet that third 
requirement, but those include the Cray ones where jobs are big enough that 
launch times were becoming an issue. Other BTLs would hopefully be modified as 
time went on and interest in using them at scale arose. Meantime, those BTLs 
would call modex_recv on every proc, and we would therefore be no worse than 
the prior behavior.

This commit revises the MPI-RTE interface to pass the ompi_proc_t instead of 
the ompi_process_name_t for the proc so that the hostname can be easily 
inserted. I have advised the ORNL folks of the change.

cmr=v1.7.4:reviewer=jsquyres:subject=Fix thread deadlock

Text files modified:
  trunk/ompi/mca/rte/orte/rte_orte.h| 7 ---
  trunk/ompi/mca/rte/orte/rte_orte_module.c |27 ++-
  trunk/ompi/proc/proc.c|26 ++
  trunk/ompi/runtime/ompi_module_exchange.c |10 +-
  4 files changed, 49 insertions(+), 21 deletions(-)




[OMPI devel] Problem with memory in mpi program

2013-12-18 Thread Yeni Lora
My program it is with MPI and OpenMP, and is a sample program take
much memory, I don't know the memory RAM consume for a mpi program and
I want to know if mpi consume a lot of memory when if used together
openmp or I doing something wrong, for take memory Ram of mi program I
used a file /proc/id_proc/stat, where id_proc if the id of my process.
This is my example program:

#include 
#include "mpi.h"
#include 
#include 
#include 

int main(int argc, char** argv){

int  my_rank; /* rank of process */
 int  p;   /* number of processes */

 MPI_Init_thread(, , MPI::THREAD_MULTIPLE,);

 /* find out process rank */
MPI_Comm_rank(MPI_COMM_WORLD, _rank);

/* find out number of processes */
MPI_Comm_size(MPI_COMM_WORLD, );

 char cad[4];
 MPI_Status status;

omp_set_num_threads(2);
#pragma omp parallel
{
 int h =  omp_get_thread_num();

 if(h==0){
MPI_Send(, 1, MPI::CHAR, my_rank, 
11,MPI_COMM_WORLD);
 }
 else{
std::vector all(2,0);
MPI_Recv(, 2, MPI::CHAR, MPI::ANY_SOURCE,
MPI::ANY_TAG,MPI_COMM_WORLD, );
}
}

/* shut down MPI */
MPI_Finalize();

return 0;
}

Compile:
mpic++ -fopenmp -fno-threadsafe-statics -o sample_program sample_program.c

Run
mpirun  sample_program

and the memory consume: 190MB

Please I need help is very important to me get a low memory consume


Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-18 Thread Nathan Hjelm
Found the problem. Was accessing a boolean variable using intval. That
is a bug that has gone unnoticed on all platforms but thankfully Solaris
caught it.

Please try the attached patch.

-Nathan

On Wed, Dec 18, 2013 at 12:27:29PM +0100, Siegmar Gross wrote:
> Hi,
> 
> today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun
> C 5.12. Unfortunately my problems with bus errors, which I reported
> December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are
> not solved yet. Has somebody time to look into that matter or is
> Solaris support abandoned, so that I have to stay with openmpi-1.6.x
> in the future? Thank you very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
diff --git a/opal/mca/base/mca_base_var.c b/opal/mca/base/mca_base_var.c
index 7b55eb8..c043c06 100644
--- a/opal/mca/base/mca_base_var.c
+++ b/opal/mca/base/mca_base_var.c
@@ -1682,7 +1682,11 @@ static int var_value_string (mca_base_var_t *var, char 
**value_string)
 
 ret = (0 > ret) ? OPAL_ERR_OUT_OF_RESOURCE : OPAL_SUCCESS;
 } else {
-ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, );
+if (MCA_BASE_VAR_TYPE_BOOL == var->mbv_type) {
+ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->boolval, );
+} else {
+ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, );
+}
 
 *value_string = strdup (tmp);
 if (NULL == value_string) {


pgpFNtma5UKPz.pgp
Description: PGP signature


Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-18 Thread Jeff Squyres (jsquyres)
Siegmar --

Thanks for keeping us honest!  I just filed three tickets with the issues you 
reported:

https://svn.open-mpi.org/trac/ompi/ticket/3988
https://svn.open-mpi.org/trac/ompi/ticket/3989
https://svn.open-mpi.org/trac/ompi/ticket/3990


On Dec 18, 2013, at 6:27 AM, Siegmar Gross 
 wrote:

> Hi,
> 
> today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun
> C 5.12. Unfortunately my problems with bus errors, which I reported
> December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are
> not solved yet. Has somebody time to look into that matter or is
> Solaris support abandoned, so that I have to stay with openmpi-1.6.x
> in the future? Thank you very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [PATCH v2 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-18 Thread Ralph Castain
In the case of the send, there really isn't any problem with just replacing 
things - the non-blocking change won't impact anything, so no need to retain 
the old code. People were only concerned about the recv's as those places will 
require further repair, and they wanted to ensure we know where those places 
are located.

You also need to change those comparisons, however, as the return code isn't 
the number of bytes sent any more - it is just ORTE_SUCCESS or else an error 
code, so you should be testing for ORTE_SUCCESS ==




On Dec 18, 2013, at 6:42 AM, Adrian Reber  wrote:

> From: Adrian Reber 
> 
> This patch changes all send/send_buffer occurrences in the C/R code
> to send_nb/send_buffer_nb.
> The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
> The new code compiles but does not work.
> 
> Changes from V1:
> * #ifdef out the code (so it is preserved for later re-design)
> * marked the broken C/R code with ENABLE_FT_FIXED
> 
> Signed-off-by: Adrian Reber 
> ---
> ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 18 +++
> orte/mca/errmgr/base/errmgr_base_tool.c |  4 ++
> orte/mca/rml/ftrm/rml_ftrm.h| 19 
> orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
> orte/mca/rml/ftrm/rml_ftrm_module.c | 63 +
> orte/mca/snapc/full/snapc_full_app.c| 20 
> orte/mca/snapc/full/snapc_full_global.c | 12 +
> orte/mca/snapc/full/snapc_full_local.c  |  4 ++
> orte/mca/sstore/central/sstore_central_app.c|  8 
> orte/mca/sstore/central/sstore_central_global.c |  4 ++
> orte/mca/sstore/central/sstore_central_local.c  | 12 +
> orte/mca/sstore/stage/sstore_stage_app.c|  8 
> orte/mca/sstore/stage/sstore_stage_global.c |  4 ++
> orte/mca/sstore/stage/sstore_stage_local.c  | 16 +++
> orte/tools/orte-checkpoint/orte-checkpoint.c|  4 ++
> orte/tools/orte-migrate/orte-migrate.c  |  4 ++
> 16 files changed, 130 insertions(+), 72 deletions(-)
> 
> diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
> b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> index cba7586..4f7bd7f 100644
> --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> @@ -5102,7 +5102,11 @@ static int wait_quiesce_drained(void)
> PACK_BUFFER(buffer, response, 1, OPAL_SIZE, "");
> 
> /* JJH - Performance Optimization? - Why not post all isends, 
> then wait? */
> +#ifdef ENABLE_FT_FIXED
> +/* This is the old, now broken code */
> if ( 0 > ( ret = ompi_rte_send_buffer(&(cur_peer_ref->proc_name), 
> buffer, OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
> +#endif /* ENABLE_FT_FIXED */
> +if ( 0 > ( ret = 
> ompi_rte_send_buffer_nb(&(cur_peer_ref->proc_name), buffer, 
> OMPI_CRCP_COORD_BOOKMARK_TAG, orte_rml_send_callback, NULL)) ) {
> exit_status = ret;
> goto cleanup;
> }
> @@ -5303,7 +5307,11 @@ static int send_bookmarks(int peer_idx)
> PACK_BUFFER(buffer, (peer_ref->total_msgs_recvd), 1, OPAL_UINT32,
> "crcp:bkmrk: send_bookmarks: Unable to pack 
> total_msgs_recvd");
> 
> +#ifdef ENABLE_FT_FIXED
> +/* This is the old, now broken code */
> if ( 0 > ( ret = ompi_rte_send_buffer(_name, buffer, 
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
> +#endif /* ENABLE_FT_FIXED */
> +if ( 0 > ( ret = ompi_rte_send_buffer_nb(_name, buffer, 
> OMPI_CRCP_COORD_BOOKMARK_TAG, orte_rml_send_callback, NULL)) ) {
> opal_output(mca_crcp_bkmrk_component.super.output_handle,
> "crcp:bkmrk: send_bookmarks: Failed to send bookmark to 
> peer %s: Return %d\n",
> OMPI_NAME_PRINT(_name),
> @@ -5599,8 +5607,13 @@ static int 
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> /*
>  * Do the send...
>  */
> +#ifdef ENABLE_FT_FIXED
> +/* This is the old, now broken code */
> if ( 0 > ( ret = ompi_rte_send_buffer(_ref->proc_name, buffer,
>   OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) 
> {
> +#endif /* ENABLE_FT_FIXED */
> +if ( 0 > ( ret = ompi_rte_send_buffer_nb(_ref->proc_name, buffer,
> +  OMPI_CRCP_COORD_BOOKMARK_TAG, 
> orte_rml_send_callback, NULL)) ) {
> opal_output(mca_crcp_bkmrk_component.super.output_handle,
> "crcp:bkmrk: do_send_msg_detail: Unable to send message 
> details to peer %s: Return %d\n",
> OMPI_NAME_PRINT(_ref->proc_name),
> @@ -6217,8 +6230,13 @@ static int 
> do_recv_msg_detail_resp(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> "crcp:bkmrk: recv_msg_details: Unable to ask peer for more 
> messages");
> PACK_BUFFER(buffer, total_found, 1, OPAL_UINT32,
> "crcp:bkmrk: recv_msg_details: Unable to ask peer for more 
> 

Re: [OMPI devel] [PATCH v2 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2013-12-18 Thread Ralph Castain
Hi Adrian

No point in keeping the old code for those places where you update the syntax 
of a non-blocking recv (i.e., you remove the no-longer-reqd extra param). I 
would only keep it where you have to replace a blocking recv with a 
non-blocking one as that is where the behavior will change.

Other than that, it looks fine to me.

On Dec 18, 2013, at 6:42 AM, Adrian Reber  wrote:

> From: Adrian Reber 
> 
> This patch changes all recv/recv_buffer occurrences in the C/R code
> to recv_nb/recv_buffer_nb.
> The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
> The new code compiles but does not work.
> 
> Changes from V1:
> * #ifdef out the code (so it is preserved for later re-design)
> * marked the broken C/R code with ENABLE_FT_FIXED
> 
> Signed-off-by: Adrian Reber 
> ---
> ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 19 ++
> orte/mca/errmgr/base/errmgr_base_tool.c |  6 +-
> orte/mca/rml/ftrm/rml_ftrm.h| 27 ++---
> orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
> orte/mca/rml/ftrm/rml_ftrm_module.c | 78 +++--
> orte/mca/snapc/full/snapc_full_app.c| 12 
> orte/mca/snapc/full/snapc_full_global.c | 25 
> orte/mca/snapc/full/snapc_full_local.c  | 24 
> orte/mca/sstore/central/sstore_central_app.c|  6 ++
> orte/mca/sstore/central/sstore_central_global.c | 11 ++--
> orte/mca/sstore/central/sstore_central_local.c  | 11 ++--
> orte/mca/sstore/stage/sstore_stage_app.c|  5 ++
> orte/mca/sstore/stage/sstore_stage_global.c | 11 ++--
> orte/mca/sstore/stage/sstore_stage_local.c  | 11 ++--
> orte/tools/orte-checkpoint/orte-checkpoint.c|  9 ++-
> orte/tools/orte-migrate/orte-migrate.c  |  9 ++-
> 16 files changed, 124 insertions(+), 142 deletions(-)
> 
> diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
> b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> index 5d4005f..cba7586 100644
> --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> @@ -4739,6 +4739,8 @@ static int ft_event_post_drain_acks(void)
> drain_msg_ack = (ompi_crcp_bkmrk_pml_drain_message_ack_ref_t*)item;
> 
> /* Post the receive */
> +#ifdef ENABLE_FT_FIXED
> +/* This is the old, now broken code */
> if( OMPI_SUCCESS != (ret = ompi_rte_recv_buffer_nb( 
> _msg_ack->peer,
> 
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> 0,
> @@ -4750,6 +4752,9 @@ static int ft_event_post_drain_acks(void)
> OMPI_NAME_PRINT(&(drain_msg_ack->peer)));
> return ret;
> }
> +#endif /* ENABLE_FT_FIXED */
> +ompi_rte_recv_buffer_nb(_msg_ack->peer, 
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> +0, drain_message_ack_cbfunc, NULL);
> }
> 
> return OMPI_SUCCESS;
> @@ -5330,6 +5335,8 @@ static int recv_bookmarks(int peer_idx)
> peer_name.jobid  = OMPI_PROC_MY_NAME->jobid;
> peer_name.vpid   = peer_idx;
> 
> +#ifdef ENABLE_FT_FIXED
> +/* This is the old, now broken code */
> if ( 0 > (ret = ompi_rte_recv_buffer_nb(_name,
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> 0,
> @@ -5342,6 +5349,9 @@ static int recv_bookmarks(int peer_idx)
> exit_status = ret;
> goto cleanup;
> }
> +#endif /* ENABLE_FT_FIXED */
> +ompi_rte_recv_buffer_nb(_name, OMPI_CRCP_COORD_BOOKMARK_TAG,
> +   0, recv_bookmarks_cbfunc, NULL);
> 
> ++total_recv_bookmarks;
> 
> @@ -5616,6 +5626,8 @@ static int 
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> /*
>  * Recv the ACK msg
>  */
> +#ifdef ENABLE_FT_FIXED
> +/* This is the old, now broken code */
> if ( 0 > (ret = ompi_rte_recv_buffer(_ref->proc_name, buffer,
>  OMPI_CRCP_COORD_BOOKMARK_TAG, 0) ) ) 
> {
> opal_output(mca_crcp_bkmrk_component.super.output_handle,
> @@ -5626,6 +5638,9 @@ static int 
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> exit_status = ret;
> goto cleanup;
> }
> +#endif /* ENABLE_FT_FIXED */
> +ompi_rte_recv_buffer_nb(_ref->proc_name, 
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0,
> +orte_rml_recv_callback, NULL);
> 
> UNPACK_BUFFER(buffer, recv_response, 1, OPAL_UINT32,
>   "crcp:bkmrk: send_msg_details: Failed to unpack the ACK 
> from peer buffer.");
> @@ -5790,6 +5805,8 @@ static int 
> do_recv_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> /*
>  * Recv the msg
>  */
> +#ifdef ENABLE_FT_FIXED
> +/* This is the old, now broken code */
> if ( 0 > (ret = ompi_rte_recv_buffer(_ref->proc_name, buffer, 
> 

[OMPI devel] [PATCH v2 0/2] Trying to get the C/R code to compile again

2013-12-18 Thread Adrian Reber
From: Adrian Reber 

This is the second try to replace the usage of blocking send and
recv in the C/R code with the non-blocking versions. The new code
compiles (in contrast to the old code) but does not work yet.
This is the first step to get the C/R code working again. Right
now it only compiles.

Changes from V1:
* #ifdef out the broken code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Adrian Reber (2):
  Trying to get the C/R code to compile again. (recv_*_nb)
  Trying to get the C/R code to compile again. (send_*_nb)

 ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c|  37 +++
 orte/mca/errmgr/base/errmgr_base_tool.c |  10 +-
 orte/mca/rml/ftrm/rml_ftrm.h|  46 +---
 orte/mca/rml/ftrm/rml_ftrm_component.c  |   4 -
 orte/mca/rml/ftrm/rml_ftrm_module.c | 141 
 orte/mca/snapc/full/snapc_full_app.c|  32 ++
 orte/mca/snapc/full/snapc_full_global.c |  37 +--
 orte/mca/snapc/full/snapc_full_local.c  |  28 +++--
 orte/mca/sstore/central/sstore_central_app.c|  14 +++
 orte/mca/sstore/central/sstore_central_global.c |  15 ++-
 orte/mca/sstore/central/sstore_central_local.c  |  23 +++-
 orte/mca/sstore/stage/sstore_stage_app.c|  13 +++
 orte/mca/sstore/stage/sstore_stage_global.c |  15 ++-
 orte/mca/sstore/stage/sstore_stage_local.c  |  27 -
 orte/tools/orte-checkpoint/orte-checkpoint.c|  13 ++-
 orte/tools/orte-migrate/orte-migrate.c  |  13 ++-
 16 files changed, 254 insertions(+), 214 deletions(-)

-- 
1.8.4.2



[OMPI devel] [PATCH v2 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-18 Thread Adrian Reber
From: Adrian Reber 

This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.

Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Signed-off-by: Adrian Reber 
---
 ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 18 +++
 orte/mca/errmgr/base/errmgr_base_tool.c |  4 ++
 orte/mca/rml/ftrm/rml_ftrm.h| 19 
 orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
 orte/mca/rml/ftrm/rml_ftrm_module.c | 63 +
 orte/mca/snapc/full/snapc_full_app.c| 20 
 orte/mca/snapc/full/snapc_full_global.c | 12 +
 orte/mca/snapc/full/snapc_full_local.c  |  4 ++
 orte/mca/sstore/central/sstore_central_app.c|  8 
 orte/mca/sstore/central/sstore_central_global.c |  4 ++
 orte/mca/sstore/central/sstore_central_local.c  | 12 +
 orte/mca/sstore/stage/sstore_stage_app.c|  8 
 orte/mca/sstore/stage/sstore_stage_global.c |  4 ++
 orte/mca/sstore/stage/sstore_stage_local.c  | 16 +++
 orte/tools/orte-checkpoint/orte-checkpoint.c|  4 ++
 orte/tools/orte-migrate/orte-migrate.c  |  4 ++
 16 files changed, 130 insertions(+), 72 deletions(-)

diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
index cba7586..4f7bd7f 100644
--- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
+++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
@@ -5102,7 +5102,11 @@ static int wait_quiesce_drained(void)
 PACK_BUFFER(buffer, response, 1, OPAL_SIZE, "");

 /* JJH - Performance Optimization? - Why not post all isends, then 
wait? */
+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */
 if ( 0 > ( ret = ompi_rte_send_buffer(&(cur_peer_ref->proc_name), 
buffer, OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
+#endif /* ENABLE_FT_FIXED */
+if ( 0 > ( ret = 
ompi_rte_send_buffer_nb(&(cur_peer_ref->proc_name), buffer, 
OMPI_CRCP_COORD_BOOKMARK_TAG, orte_rml_send_callback, NULL)) ) {
 exit_status = ret;
 goto cleanup;
 }
@@ -5303,7 +5307,11 @@ static int send_bookmarks(int peer_idx)
 PACK_BUFFER(buffer, (peer_ref->total_msgs_recvd), 1, OPAL_UINT32,
 "crcp:bkmrk: send_bookmarks: Unable to pack total_msgs_recvd");

+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */
 if ( 0 > ( ret = ompi_rte_send_buffer(_name, buffer, 
OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
+#endif /* ENABLE_FT_FIXED */
+if ( 0 > ( ret = ompi_rte_send_buffer_nb(_name, buffer, 
OMPI_CRCP_COORD_BOOKMARK_TAG, orte_rml_send_callback, NULL)) ) {
 opal_output(mca_crcp_bkmrk_component.super.output_handle,
 "crcp:bkmrk: send_bookmarks: Failed to send bookmark to 
peer %s: Return %d\n",
 OMPI_NAME_PRINT(_name),
@@ -5599,8 +5607,13 @@ static int 
do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 /*
  * Do the send...
  */
+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */
 if ( 0 > ( ret = ompi_rte_send_buffer(_ref->proc_name, buffer,
   OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
+#endif /* ENABLE_FT_FIXED */
+if ( 0 > ( ret = ompi_rte_send_buffer_nb(_ref->proc_name, buffer,
+  OMPI_CRCP_COORD_BOOKMARK_TAG, 
orte_rml_send_callback, NULL)) ) {
 opal_output(mca_crcp_bkmrk_component.super.output_handle,
 "crcp:bkmrk: do_send_msg_detail: Unable to send message 
details to peer %s: Return %d\n",
 OMPI_NAME_PRINT(_ref->proc_name),
@@ -6217,8 +6230,13 @@ static int 
do_recv_msg_detail_resp(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 "crcp:bkmrk: recv_msg_details: Unable to ask peer for more 
messages");
 PACK_BUFFER(buffer, total_found, 1, OPAL_UINT32,
 "crcp:bkmrk: recv_msg_details: Unable to ask peer for more 
messages");
+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */

 if ( 0 > ( ret = ompi_rte_send_buffer(_ref->proc_name, buffer, 
OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
+#endif /* ENABLE_FT_FIXED */
+
+if ( 0 > ( ret = ompi_rte_send_buffer_nb(_ref->proc_name, buffer, 
OMPI_CRCP_COORD_BOOKMARK_TAG, orte_rml_send_callback, NULL)) ) {
 opal_output(mca_crcp_bkmrk_component.super.output_handle,
 "crcp:bkmrk: recv_msg_detail_resp: Unable to send message 
detail response to peer %s: Return %d\n",
 OMPI_NAME_PRINT(_ref->proc_name),
diff --git a/orte/mca/errmgr/base/errmgr_base_tool.c 
b/orte/mca/errmgr/base/errmgr_base_tool.c
index b982e46..e274bae 100644
--- 

[OMPI devel] [PATCH v2 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2013-12-18 Thread Adrian Reber
From: Adrian Reber 

This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.

Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Signed-off-by: Adrian Reber 
---
 ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 19 ++
 orte/mca/errmgr/base/errmgr_base_tool.c |  6 +-
 orte/mca/rml/ftrm/rml_ftrm.h| 27 ++---
 orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
 orte/mca/rml/ftrm/rml_ftrm_module.c | 78 +++--
 orte/mca/snapc/full/snapc_full_app.c| 12 
 orte/mca/snapc/full/snapc_full_global.c | 25 
 orte/mca/snapc/full/snapc_full_local.c  | 24 
 orte/mca/sstore/central/sstore_central_app.c|  6 ++
 orte/mca/sstore/central/sstore_central_global.c | 11 ++--
 orte/mca/sstore/central/sstore_central_local.c  | 11 ++--
 orte/mca/sstore/stage/sstore_stage_app.c|  5 ++
 orte/mca/sstore/stage/sstore_stage_global.c | 11 ++--
 orte/mca/sstore/stage/sstore_stage_local.c  | 11 ++--
 orte/tools/orte-checkpoint/orte-checkpoint.c|  9 ++-
 orte/tools/orte-migrate/orte-migrate.c  |  9 ++-
 16 files changed, 124 insertions(+), 142 deletions(-)

diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
index 5d4005f..cba7586 100644
--- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
+++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
@@ -4739,6 +4739,8 @@ static int ft_event_post_drain_acks(void)
 drain_msg_ack = (ompi_crcp_bkmrk_pml_drain_message_ack_ref_t*)item;

 /* Post the receive */
+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */
 if( OMPI_SUCCESS != (ret = ompi_rte_recv_buffer_nb( 
_msg_ack->peer,
 
OMPI_CRCP_COORD_BOOKMARK_TAG,
 0,
@@ -4750,6 +4752,9 @@ static int ft_event_post_drain_acks(void)
 OMPI_NAME_PRINT(&(drain_msg_ack->peer)));
 return ret;
 }
+#endif /* ENABLE_FT_FIXED */
+ompi_rte_recv_buffer_nb(_msg_ack->peer, 
OMPI_CRCP_COORD_BOOKMARK_TAG,
+0, drain_message_ack_cbfunc, NULL);
 }

 return OMPI_SUCCESS;
@@ -5330,6 +5335,8 @@ static int recv_bookmarks(int peer_idx)
 peer_name.jobid  = OMPI_PROC_MY_NAME->jobid;
 peer_name.vpid   = peer_idx;

+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */
 if ( 0 > (ret = ompi_rte_recv_buffer_nb(_name,
 OMPI_CRCP_COORD_BOOKMARK_TAG,
 0,
@@ -5342,6 +5349,9 @@ static int recv_bookmarks(int peer_idx)
 exit_status = ret;
 goto cleanup;
 }
+#endif /* ENABLE_FT_FIXED */
+ompi_rte_recv_buffer_nb(_name, OMPI_CRCP_COORD_BOOKMARK_TAG,
+   0, recv_bookmarks_cbfunc, NULL);

 ++total_recv_bookmarks;

@@ -5616,6 +5626,8 @@ static int 
do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 /*
  * Recv the ACK msg
  */
+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */
 if ( 0 > (ret = ompi_rte_recv_buffer(_ref->proc_name, buffer,
  OMPI_CRCP_COORD_BOOKMARK_TAG, 0) ) ) {
 opal_output(mca_crcp_bkmrk_component.super.output_handle,
@@ -5626,6 +5638,9 @@ static int 
do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 exit_status = ret;
 goto cleanup;
 }
+#endif /* ENABLE_FT_FIXED */
+ompi_rte_recv_buffer_nb(_ref->proc_name, 
OMPI_CRCP_COORD_BOOKMARK_TAG, 0,
+orte_rml_recv_callback, NULL);

 UNPACK_BUFFER(buffer, recv_response, 1, OPAL_UINT32,
   "crcp:bkmrk: send_msg_details: Failed to unpack the ACK from 
peer buffer.");
@@ -5790,6 +5805,8 @@ static int 
do_recv_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 /*
  * Recv the msg
  */
+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */
 if ( 0 > (ret = ompi_rte_recv_buffer(_ref->proc_name, buffer, 
OMPI_CRCP_COORD_BOOKMARK_TAG, 0) ) ) {
 opal_output(mca_crcp_bkmrk_component.super.output_handle,
 "crcp:bkmrk: do_recv_msg_detail: %s <-- %s Failed to 
receive buffer from peer. Return %d\n",
@@ -5799,6 +5816,8 @@ static int 
do_recv_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 exit_status = ret;
 goto cleanup;
 }
+#endif /* ENABLE_FT_FIXED */
+ompi_rte_recv_buffer_nb(_ref->proc_name, 
OMPI_CRCP_COORD_BOOKMARK_TAG, 0, orte_rml_recv_callback, NULL);

 /* Pull out the communicator ID */
 UNPACK_BUFFER(buffer, 

Re: [OMPI devel] [patch] async-signal-safe signal handler

2013-12-18 Thread Jeff Squyres (jsquyres)
This patch looks good to me (sorry for the delay in replying -- MPI Forum + 
OMPI dev meeting got in the way).

Brian -- do you have any opinions on it?


On Dec 11, 2013, at 1:43 AM, Kawashima, Takahiro  
wrote:

> Hi,
> 
> Open MPI's signal handler (show_stackframe function defined in
> opal/util/stacktrace.c) calls non-async-signal-safe functions
> and it causes a problem.
> 
> See attached mpisigabrt.c. Passing corrupted memory to realloc(3)
> will cause SIGABRT and show_stackframe function will be invoked.
> But invoked show_stackframe function deadlocks in backtrace_symbols(3)
> on some systems because backtrace_symbols(3) calls malloc(3)
> internally and a deadlock of realloc/malloc mutex occurs.
> 
> Attached mpisigabrt.gstack.txt shows the stacktrace gotten
> by gdb in this deadlock situation on Ubuntu 12.04 LTS (precise)
> x86_64. Though I could not reproduce this behavior on RHEL 5/6,
> I can reproduce it also on K computer and its successor PRIMEHPC FX10.
> Passing non-heap memory to free(3) and double-free also cause
> this deadlock.
> 
> malloc (and backtrace_symbols) is not marked as async-signal-safe
> in POSIX and current glibc, though it seems to have been marked
> in old glibc. So we should not call it in the signal handler now.
> 
>  
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04
>  http://cygwin.com/ml/libc-help/2013-06/msg5.html
> 
> I wrote a patch to address this issue. See the attached
> async-signal-safe-stacktrace.patch.
> 
> This patch calls backtrace_symbols_fd(3) instead of backtrace_symbols(3).
> Though backtrace_symbols_fd is not declared as async-signal-safe,
> it is described not to call malloc internally in its man. So it
> should be rather safer.
> 
> Output format of show_stackframe function is not changed by
> this patch. But the opal_backtrace_print function (backtrace
> framework) interface is changed for the output format compatibility.
> This requires changes in some additional files (ompi_mpi_abort.c
> etc.).
> 
> This patch also removes unnecessary fflush(3) calls, which are
> meaningless for write(2) system call but might cause a similar
> problem.
> 
> What do you think about this patch?
> 
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Recommended tool to measure packet counters

2013-12-18 Thread Jeff Squyres (jsquyres)
On Dec 14, 2013, at 8:02 AM, Siddhartha Jana  wrote:

> Is there a preferred method/tool among developers of MPI-library for checking 
> the count of the packets transmitted by the network card during two-sided 
> communication?
> 
> Is the use of
> iptables -I INPUT -i eth0
> iptables -I OUTPUT -o eth0
> 
> recommended ?

If you're using an ethernet, non-OS-bypass transport (e.g., TCP), you might 
also want to look at ethtool.

Note that these counts will include control messages sent by Open MPI, too -- 
not just raw MPI traffic.  They also will not include any traffic sent across 
shared memory (or other transports).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-18 Thread Siegmar Gross
Hi,

today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun
C 5.12. Unfortunately my problems with bus errors, which I reported
December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are
not solved yet. Has somebody time to look into that matter or is
Solaris support abandoned, so that I have to stay with openmpi-1.6.x
in the future? Thank you very much for any help in advance.


Kind regards

Siegmar