Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2013-01-24 Thread Number Cruncher
I've looked in more detail at the current two MPI_Alltoallv algorithms 
and wanted to raise a couple of ideas.


Firstly, the new default "pairwise" algorithm.
* There is no optimisation for sparse/empty messages, compare to the old 
basic "linear" algorithm.
* The attached "pairwise-nop" patch adds this optimisation and on the 
test case I first described in this thread (1000's of small, sparse, 
all-to-all), this cuts runtime by approximately 30%
* I think the upper bound on the loop counter for pairwise exchange is 
off-by-one. As the comment notes "starting from 1 since local exhange 
[sic] is done"; but when step = (size + 1), the sendto/recvfrom both 
reduce to rank (self-exchange is already handled in earlier code)


The pairwise algorithm still kills performance on my gigabit ethernet 
network. My message transmission time must be small compared to latency, 
and the forced MPI_Comm_size() synchronisation steps introduce a minimum 
delay (single_link_latency * comm_size), i.e. latency scale linearly 
with comm_size. The linear algorithm doesn't wait for each exchange, so 
its minimum latency is just a single transmit/receive.


Which brings me to the second idea. The problem with the existing 
implementation of the linear algorithm is that the irecv/isend pattern 
was identical on all processes, meaning that every process starts by 
having to wait for process 0 to send to everyone and every process can 
finish waiting for rank (size-1) to send to everyone.


It seems preferable to at least post the send/recv requests in the same 
order as the pairwise algorithm. The attached "linear-alltoallv" patch 
implements this and, on my test case, shows some modest 5% improvement. 
I was wondering if it would address the concerns which led to the switch 
of default algorithm.


Simon
diff -r '--exclude=*~' -u openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_alltoallv.c openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_alltoallv.c
--- openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_alltoallv.c	2012-04-03 15:30:17.0 +0100
+++ openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_alltoallv.c	2013-01-24 15:12:13.299568308 +
@@ -70,7 +70,7 @@
 }

 /* Perform pairwise exchange starting from 1 since local exhange is done */
-for (step = 1; step < size + 1; step++) {
+for (step = 1; step < size; step++) {

 /* Determine sender and receiver for this step. */
 sendto  = (rank + step) % size;
diff -r '--exclude=*~' -u openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_util.c openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_util.c
--- openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_util.c	2012-04-03 15:30:17.0 +0100
+++ openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_util.c	2013-01-24 15:11:56.562118400 +
@@ -37,25 +37,31 @@
  ompi_status_public_t* status )

 { /* post receive first, then send, then waitall... should be fast (I hope) */
-int err, line = 0;
+int err, line = 0, nreq = 0;
 ompi_request_t* reqs[2];
 ompi_status_public_t statuses[2];

-/* post new irecv */
-err = MCA_PML_CALL(irecv( recvbuf, rcount, rdatatype, source, rtag, 
-  comm, [0]));
-if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; }
-
-/* send data to children */
-err = MCA_PML_CALL(isend( sendbuf, scount, sdatatype, dest, stag, 
-  MCA_PML_BASE_SEND_STANDARD, comm, [1]));
-if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; }
+if (0 != rcount) {
+/* post new irecv */
+err = MCA_PML_CALL(irecv( recvbuf, rcount, rdatatype, source, rtag, 
+  comm, [nreq++]));
+if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; }
+}

-err = ompi_request_wait_all( 2, reqs, statuses );
-if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler_waitall; }
+if (0 != scount) {
+/* send data to children */
+err = MCA_PML_CALL(isend( sendbuf, scount, sdatatype, dest, stag, 
+  MCA_PML_BASE_SEND_STANDARD, comm, [nreq++]));
+if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; }
+}

-if (MPI_STATUS_IGNORE != status) {
-*status = statuses[0];
+if (0 != nreq) {
+err = ompi_request_wait_all( nreq, reqs, statuses );
+if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler_waitall; }
+
+if (MPI_STATUS_IGNORE != status) {
+*status = statuses[0];
+}
 }

 return (MPI_SUCCESS);
@@ -68,7 +74,7 @@
 if( MPI_ERR_IN_STATUS == err ) {
 /* At least we know he error was detected during the wait_all */
 int err_index = 0;
-if( MPI_SUCCESS != statuses[1].MPI_ERROR ) {
+if( nreq > 1 && MPI_SUCCESS != statuses[1].MPI_ERROR ) {
 err_index = 1;
 }
 if (MPI_STATUS_IGNORE != status) {
@@ -107,25 

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-23 Thread George Bosilca
 that your specific
>>>> communication scheme may have. That's why there are different algorithms 
>>>> and
>>>> you are given the option to dynamically select them at run time without the
>>>> need to recompile the code. I don't think the change of the default
>>>> algorithm (note that the pairwise algorithm has been there for many years -
>>>> it is not new, simply the new default one) was introduced in order to piss
>>>> users off.
>>>> 
>>>> If you want OMPI to default to the previous algorithm:
>>>> 
>>>> 1) Add this to the system-wide OMPI configuration file
>>>> $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be
>>>> $PREFIX/etc, where $PREFIX is the OMPI installation directory):
>>>> coll_tuned_use_dynamic_rules = 1
>>>> coll_tuned_alltoallv_algorithm = 1
>>>> 
>>>> 2) The settings from (1) can be overridden on per user basis by the similar
>>>> settings from $HOME/.openmpi/mca-params.conf.
>>>> 
>>>> 3) The settings from (1) and (2) can be overridden on per job basis by
>>>> exporting MCA parameters as environment variables:
>>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>>>> 
>>>> 4) Finally, the settings from (1), (2), and (3) can be overridden on per 
>>>> MPI
>>>> program launch by supplying appropriate MCA parameters to orterun (a.k.a.
>>>> mpirun and mpiexec).
>>>> 
>>>> There is also a largely undocumented feature of the "tuned" collective
>>>> component where a dynamic rules file can be supplied. In the file a series
>>>> of cases tell the library which implementation to use based on the
>>>> communicator and message sizes. No idea if it works with ALLTOALLV.
>>>> 
>>>> Kind regards,
>>>> Hristo
>>>> 
>>>> (sorry for top posting - damn you, Outlook!)
>>>> --
>>>> Hristo Iliev, Ph.D. -- High Performance Computing
>>>> RWTH Aachen University, Center for Computing and Communication
>>>> Rechen- und Kommunikationszentrum der RWTH Aachen
>>>> Seffenter Weg 23,  D 52074  Aachen (Germany)
>>>> 
>>>>> -Original Message-
>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>>>>> On Behalf Of Number Cruncher
>>>>> Sent: Wednesday, December 19, 2012 5:31 PM
>>>>> To: Open MPI Users
>>>>> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>>>> 1.6.1
>>>>> 
>>>>> On 19/12/12 11:08, Paul Kapinos wrote:
>>>>>> Did you *really* wanna to dig into code just in order to switch a
>>>>>> default communication algorithm?
>>>>> No, I didn't want to, but with a huge change in performance, I'm forced to
>>>> do
>>>>> something! And having looked at the different algorithms, I think there's
>>>> a
>>>>> problem with the new default whenever message sizes are small enough
>>>>> that connection latency will dominate. We're not all running Infiniband,
>>>> and
>>>>> having to wait for each pairwise exchange to complete before initiating
>>>>> another seems wrong if the latency in waiting for completion dominates the
>>>>> transmission time.
>>>>> 
>>>>> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to
>>>> put all
>>>>> 10 outbound messages on the wire, and wait for 10 matching inbound
>>>>> messages, in any order? The new algorithm must wait for first exchange to
>>>>> complete, then second exchange, then third. Unlike before, does it not
>>>> have
>>>>> to wait to acknowledge the matching *zero-sized* request? I don't see why
>>>>> this temporal ordering matters.
>>>>> 
>>>>> Thanks for your help,
>>>>> Simon
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> Note there are several ways to set the parameters; --mca on command
>>>>>> line is just one of them (suitable for quick online tests).
>>>>>> 
>>>>>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>>>>>> 
>>>>>> We 'tune' our Open MPI by setting environm

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-22 Thread Number Cruncher
o a largely undocumented feature of the "tuned" collective
component where a dynamic rules file can be supplied. In the file a series
of cases tell the library which implementation to use based on the
communicator and message sizes. No idea if it works with ALLTOALLV.

Kind regards,
Hristo

(sorry for top posting - damn you, Outlook!)
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)


-Original Message-
From:users-boun...@open-mpi.org  [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Wednesday, December 19, 2012 5:31 PM
To: Open MPI Users
Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
1.6.1

On 19/12/12 11:08, Paul Kapinos wrote:

Did you *really* wanna to dig into code just in order to switch a
default communication algorithm?

No, I didn't want to, but with a huge change in performance, I'm forced to

do

something! And having looked at the different algorithms, I think there's

a

problem with the new default whenever message sizes are small enough
that connection latency will dominate. We're not all running Infiniband,

and

having to wait for each pairwise exchange to complete before initiating
another seems wrong if the latency in waiting for completion dominates the
transmission time.

E.g. If I have 10 small pairwise exchanges to perform,isn't it better to

put all

10 outbound messages on the wire, and wait for 10 matching inbound
messages, in any order? The new algorithm must wait for first exchange to
complete, then second exchange, then third. Unlike before, does it not

have

to wait to acknowledge the matching *zero-sized* request? I don't see why
this temporal ordering matters.

Thanks for your help,
Simon





Note there are several ways to set the parameters; --mca on command
line is just one of them (suitable for quick online tests).

http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

We 'tune' our Open MPI by setting environment variables

Best
Paul Kapinos



On 12/19/12 11:44, Number Cruncher wrote:

Having run some more benchmarks, the new default is *really* bad for
our application (2-10x slower), so I've been looking at the source to
try and figure out why.

It seems that the biggest difference will occur when the all_to_all
is actually sparse (e.g. our application); if most N-M process
exchanges are zero in size the 1.6
ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
only post irecv/isend for non-zero exchanges; any zero-size exchanges
are skipped. It then waits once for all requests to complete. In
contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post
the zero-size exchanges for
*every* N-M pair, and wait for each pairwise exchange. This is
O(comm_size)
waits, may of which are zero. I'm not clear what optimizations there
are for zero-size isend/irecv, but surely there's a great deal more
latency if each pairwise exchange has to be confirmed complete before
executing the next?

Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in
their "mpiexec". Is there a C API?

Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like
fashion with increasing stride, so it works best when independent
communication paths could be established between several ports of
the network switch/router. Some 1 Gbps Ethernet equipment is not
capable of doing so, some is - it depends (usually on the price).
This said, not all algorithms perform the same given a specific type
of network interconnect. For example, on our fat-tree InfiniBand
network the pairwise algorithm performs better.

You can switch back to the basic linear algorithm by providing the
following MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default.
Algorithm 2
is the pairwise one.
You can also set these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make
it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense
to activate process binding with --bind-to-core if you haven't
already did so.
It prevents MPI processes from being migrated to other NUMA nodes
while running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen
University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg

23,

D 5

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-21 Thread George Bosilca
rules file can be supplied. In the file a series
>> of cases tell the library which implementation to use based on the
>> communicator and message sizes. No idea if it works with ALLTOALLV.
>> 
>> Kind regards,
>> Hristo
>> 
>> (sorry for top posting - damn you, Outlook!)
>> --
>> Hristo Iliev, Ph.D. -- High Performance Computing
>> RWTH Aachen University, Center for Computing and Communication
>> Rechen- und Kommunikationszentrum der RWTH Aachen
>> Seffenter Weg 23,  D 52074  Aachen (Germany)
>> 
>>> -Original Message-
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>>> On Behalf Of Number Cruncher
>>> Sent: Wednesday, December 19, 2012 5:31 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>> 1.6.1
>>> 
>>> On 19/12/12 11:08, Paul Kapinos wrote:
>>>> Did you *really* wanna to dig into code just in order to switch a
>>>> default communication algorithm?
>>> No, I didn't want to, but with a huge change in performance, I'm forced to
>> do
>>> something! And having looked at the different algorithms, I think there's
>> a
>>> problem with the new default whenever message sizes are small enough
>>> that connection latency will dominate. We're not all running Infiniband,
>> and
>>> having to wait for each pairwise exchange to complete before initiating
>>> another seems wrong if the latency in waiting for completion dominates the
>>> transmission time.
>>> 
>>> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to
>> put all
>>> 10 outbound messages on the wire, and wait for 10 matching inbound
>>> messages, in any order? The new algorithm must wait for first exchange to
>>> complete, then second exchange, then third. Unlike before, does it not
>> have
>>> to wait to acknowledge the matching *zero-sized* request? I don't see why
>>> this temporal ordering matters.
>>> 
>>> Thanks for your help,
>>> Simon
>>> 
>>> 
>>> 
>>> 
>>>> Note there are several ways to set the parameters; --mca on command
>>>> line is just one of them (suitable for quick online tests).
>>>> 
>>>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>>>> 
>>>> We 'tune' our Open MPI by setting environment variables
>>>> 
>>>> Best
>>>> Paul Kapinos
>>>> 
>>>> 
>>>> 
>>>> On 12/19/12 11:44, Number Cruncher wrote:
>>>>> Having run some more benchmarks, the new default is *really* bad for
>>>>> our application (2-10x slower), so I've been looking at the source to
>>>>> try and figure out why.
>>>>> 
>>>>> It seems that the biggest difference will occur when the all_to_all
>>>>> is actually sparse (e.g. our application); if most N-M process
>>>>> exchanges are zero in size the 1.6
>>>>> ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
>>>>> only post irecv/isend for non-zero exchanges; any zero-size exchanges
>>>>> are skipped. It then waits once for all requests to complete. In
>>>>> contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post
>>>>> the zero-size exchanges for
>>>>> *every* N-M pair, and wait for each pairwise exchange. This is
>>>>> O(comm_size)
>>>>> waits, may of which are zero. I'm not clear what optimizations there
>>>>> are for zero-size isend/irecv, but surely there's a great deal more
>>>>> latency if each pairwise exchange has to be confirmed complete before
>>>>> executing the next?
>>>>> 
>>>>> Relatedly, how would I direct OpenMPI to use the older algorithm
>>>>> programmatically? I don't want the user to have to use "--mca" in
>>>>> their "mpiexec". Is there a C API?
>>>>> 
>>>>> Thanks,
>>>>> Simon
>>>>> 
>>>>> 
>>>>> On 16/11/12 10:15, Iliev, Hristo wrote:
>>>>>> Hi, Simon,
>>>>>> 
>>>>>> The pairwise algorithm passes messages in a synchronised ring-like
>>>>>> fashion with increasing stride, so it works best when independent
>>>>>> communication paths could be established between several ports of
>>&g

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-21 Thread Number Cruncher
I completely understand there's no "one size fits all", and I appreciate 
that there are workarounds to the change in behaviour. I'm only trying 
to make what little contribution I can by questioning the architecture 
of the pairwise algorithm.


I know that for every user you please, there will be some that aren't 
happy when a default changes (Windows 8 anyone?), but I'm trying to 
provide some real-world data. If 90% of apps are 10% faster and 10% are 
1000% slower, should the default change?


all_to_all is a really nice primitive from a developer point of view. 
Every process' code is symmetric and identical. Maybe I should have to 
worry that most of the matrix exchange is sparse; I probably could 
calculate an optimal exchange pattern. But I think this is the 
implementation's job, and I will continue to argue that *waiting* for 
each pairwise exchange is (a) unnecessary, (b) doesn't improve 
performance for *any* application and (c) at worst causes huge slowdown 
over the last algorithm for sparse cases.


In summary: I'm arguing that there's a problem with the pairwise 
implementation as it stands. It doesn't have any optimization for sparse 
all_to_all and imposes unnecessary synchronisation barriers in all cases.


Simon



On 20/12/2012 14:42, Iliev, Hristo wrote:

Simon,

The goal of any MPI implementation is to be as fast as possible.
Unfortunately there is no "one size fits all" algorithm that works on all
networks and given all possible kind of peculiarities that your specific
communication scheme may have. That's why there are different algorithms and
you are given the option to dynamically select them at run time without the
need to recompile the code. I don't think the change of the default
algorithm (note that the pairwise algorithm has been there for many years -
it is not new, simply the new default one) was introduced in order to piss
users off.

If you want OMPI to default to the previous algorithm:

1) Add this to the system-wide OMPI configuration file
$sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be
$PREFIX/etc, where $PREFIX is the OMPI installation directory):
coll_tuned_use_dynamic_rules = 1
coll_tuned_alltoallv_algorithm = 1

2) The settings from (1) can be overridden on per user basis by the similar
settings from $HOME/.openmpi/mca-params.conf.

3) The settings from (1) and (2) can be overridden on per job basis by
exporting MCA parameters as environment variables:
export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1

4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI
program launch by supplying appropriate MCA parameters to orterun (a.k.a.
mpirun and mpiexec).

There is also a largely undocumented feature of the "tuned" collective
component where a dynamic rules file can be supplied. In the file a series
of cases tell the library which implementation to use based on the
communicator and message sizes. No idea if it works with ALLTOALLV.

Kind regards,
Hristo

(sorry for top posting - damn you, Outlook!)
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Wednesday, December 19, 2012 5:31 PM
To: Open MPI Users
Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
1.6.1

On 19/12/12 11:08, Paul Kapinos wrote:

Did you *really* wanna to dig into code just in order to switch a
default communication algorithm?

No, I didn't want to, but with a huge change in performance, I'm forced to

do

something! And having looked at the different algorithms, I think there's

a

problem with the new default whenever message sizes are small enough
that connection latency will dominate. We're not all running Infiniband,

and

having to wait for each pairwise exchange to complete before initiating
another seems wrong if the latency in waiting for completion dominates the
transmission time.

E.g. If I have 10 small pairwise exchanges to perform,isn't it better to

put all

10 outbound messages on the wire, and wait for 10 matching inbound
messages, in any order? The new algorithm must wait for first exchange to
complete, then second exchange, then third. Unlike before, does it not

have

to wait to acknowledge the matching *zero-sized* request? I don't see why
this temporal ordering matters.

Thanks for your help,
Simon





Note there are several ways to set the parameters; --mca on command
line is just one of them (suitable for quick online tests).

http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

We 'tune' our Open MPI by setting environment variables

Best
Paul Kapinos



On 12/19/12 11:44, Number Cruncher wrote:

Having run some more benchmarks, t

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-20 Thread Iliev, Hristo
Simon,

The goal of any MPI implementation is to be as fast as possible.
Unfortunately there is no "one size fits all" algorithm that works on all
networks and given all possible kind of peculiarities that your specific
communication scheme may have. That's why there are different algorithms and
you are given the option to dynamically select them at run time without the
need to recompile the code. I don't think the change of the default
algorithm (note that the pairwise algorithm has been there for many years -
it is not new, simply the new default one) was introduced in order to piss
users off.

If you want OMPI to default to the previous algorithm:

1) Add this to the system-wide OMPI configuration file
$sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be
$PREFIX/etc, where $PREFIX is the OMPI installation directory):
coll_tuned_use_dynamic_rules = 1
coll_tuned_alltoallv_algorithm = 1

2) The settings from (1) can be overridden on per user basis by the similar
settings from $HOME/.openmpi/mca-params.conf.

3) The settings from (1) and (2) can be overridden on per job basis by
exporting MCA parameters as environment variables:
export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1

4) Finally, the settings from (1), (2), and (3) can be overridden on per MPI
program launch by supplying appropriate MCA parameters to orterun (a.k.a.
mpirun and mpiexec).

There is also a largely undocumented feature of the "tuned" collective
component where a dynamic rules file can be supplied. In the file a series
of cases tell the library which implementation to use based on the
communicator and message sizes. No idea if it works with ALLTOALLV.

Kind regards,
Hristo

(sorry for top posting - damn you, Outlook!)
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)

> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On Behalf Of Number Cruncher
> Sent: Wednesday, December 19, 2012 5:31 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
> 1.6.1
> 
> On 19/12/12 11:08, Paul Kapinos wrote:
> > Did you *really* wanna to dig into code just in order to switch a
> > default communication algorithm?
> 
> No, I didn't want to, but with a huge change in performance, I'm forced to
do
> something! And having looked at the different algorithms, I think there's
a
> problem with the new default whenever message sizes are small enough
> that connection latency will dominate. We're not all running Infiniband,
and
> having to wait for each pairwise exchange to complete before initiating
> another seems wrong if the latency in waiting for completion dominates the
> transmission time.
> 
> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to
put all
> 10 outbound messages on the wire, and wait for 10 matching inbound
> messages, in any order? The new algorithm must wait for first exchange to
> complete, then second exchange, then third. Unlike before, does it not
have
> to wait to acknowledge the matching *zero-sized* request? I don't see why
> this temporal ordering matters.
> 
> Thanks for your help,
> Simon
> 
> 
> 
> 
> >
> > Note there are several ways to set the parameters; --mca on command
> > line is just one of them (suitable for quick online tests).
> >
> > http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
> >
> > We 'tune' our Open MPI by setting environment variables
> >
> > Best
> > Paul Kapinos
> >
> >
> >
> > On 12/19/12 11:44, Number Cruncher wrote:
> >> Having run some more benchmarks, the new default is *really* bad for
> >> our application (2-10x slower), so I've been looking at the source to
> >> try and figure out why.
> >>
> >> It seems that the biggest difference will occur when the all_to_all
> >> is actually sparse (e.g. our application); if most N-M process
> >> exchanges are zero in size the 1.6
> >> ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
> >> only post irecv/isend for non-zero exchanges; any zero-size exchanges
> >> are skipped. It then waits once for all requests to complete. In
> >> contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post
> >> the zero-size exchanges for
> >> *every* N-M pair, and wait for each pairwise exchange. This is
> >> O(comm_size)
> >> waits, may of which are zero. I'm not clear what optimizations there
> >> are for zero-size isend/irecv, but surely there's 

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-19 Thread Number Cruncher

On 19/12/12 11:08, Paul Kapinos wrote:
Did you *really* wanna to dig into code just in order to switch a 
default communication algorithm?


No, I didn't want to, but with a huge change in performance, I'm forced 
to do something! And having looked at the different algorithms, I think 
there's a problem with the new default whenever message sizes are small 
enough that connection latency will dominate. We're not all running 
Infiniband, and having to wait for each pairwise exchange to complete 
before initiating another seems wrong if the latency in waiting for 
completion dominates the transmission time.


E.g. If I have 10 small pairwise exchanges to perform,isn't it better to 
put all 10 outbound messages on the wire, and wait for 10 matching 
inbound messages, in any order? The new algorithm must wait for first 
exchange to complete, then second exchange, then third. Unlike before, 
does it not have to wait to acknowledge the matching *zero-sized* 
request? I don't see why this temporal ordering matters.


Thanks for your help,
Simon






Note there are several ways to set the parameters; --mca on command 
line is just one of them (suitable for quick online tests).


http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

We 'tune' our Open MPI by setting environment variables

Best
Paul Kapinos



On 12/19/12 11:44, Number Cruncher wrote:

Having run some more benchmarks, the new default is *really* bad for our
application (2-10x slower), so I've been looking at the source to try 
and figure

out why.

It seems that the biggest difference will occur when the all_to_all 
is actually
sparse (e.g. our application); if most N-M process exchanges are zero 
in size
the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will 
actually
only post irecv/isend for non-zero exchanges; any zero-size exchanges 
are
skipped. It then waits once for all requests to complete. In 
contrast, the new
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size 
exchanges for
*every* N-M pair, and wait for each pairwise exchange. This is 
O(comm_size)
waits, may of which are zero. I'm not clear what optimizations there 
are for
zero-size isend/irecv, but surely there's a great deal more latency 
if each
pairwise exchange has to be confirmed complete before executing the 
next?


Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in their
"mpiexec". Is there a C API?

Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like 
fashion

with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of 
doing so,
some is - it depends (usually on the price). This said, not all 
algorithms
perform the same given a specific type of network interconnect. For 
example,
on our fat-tree InfiniBand network the pairwise algorithm performs 
better.


You can switch back to the basic linear algorithm by providing the 
following

MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. 
Algorithm 2

is the pairwise one.
You can also set these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make 
it have

global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already 
did so.

It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 
1.6.1


I've noticed a very significant (100%) slow down for MPI_Alltoallv 
calls

as of

version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb 
ethernet
where process-to-process message sizes are fairly small (e.g. 
100kbyte)

and

much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default 
algorithm

to a pairwise exchange", but I'm not clear what this means or how to

switch

back to the old "non-default algorit

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-19 Thread Paul Kapinos
Did you *really* wanna to dig into code just in order to switch a default 
communication algorithm?


Note there are several ways to set the parameters; --mca on command line is just 
one of them (suitable for quick online tests).


http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

We 'tune' our Open MPI by setting environment variables

Best
Paul Kapinos



On 12/19/12 11:44, Number Cruncher wrote:

Having run some more benchmarks, the new default is *really* bad for our
application (2-10x slower), so I've been looking at the source to try and figure
out why.

It seems that the biggest difference will occur when the all_to_all is actually
sparse (e.g. our application); if most N-M process exchanges are zero in size
the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
only post irecv/isend for non-zero exchanges; any zero-size exchanges are
skipped. It then waits once for all requests to complete. In contrast, the new
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size exchanges for
*every* N-M pair, and wait for each pairwise exchange. This is O(comm_size)
waits, may of which are zero. I'm not clear what optimizations there are for
zero-size isend/irecv, but surely there's a great deal more latency if each
pairwise exchange has to be confirmed complete before executing the next?

Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in their
"mpiexec". Is there a C API?

Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like fashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
some is - it depends (usually on the price). This said, not all algorithms
perform the same given a specific type of network interconnect. For example,
on our fat-tree InfiniBand network the pairwise algorithm performs better.

You can switch back to the basic linear algorithm by providing the following
MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
is the pairwise one.
You can also set these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already did so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

I've noticed a very significant (100%) slow down for MPI_Alltoallv calls

as of

version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb ethernet
where process-to-process message sizes are fairly small (e.g. 100kbyte)

and

much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
to a pairwise exchange", but I'm not clear what this means or how to

switch

back to the old "non-default algorithm".

I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over 1Gb
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.

It constructs an array of integers and a nProcess x nProcess exchange

typical

of part of our application. This is then exchanged several thousand times.
Output from "mpicc -O3" runs shown below.

My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.

I also

attach a plot showing network throughput on our actual mesh generation
application. Nodes cfsc01-04 are running 1.6.0 and finish within 35

minutes.

Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over

a

hour to run. There seems to be a much greater network demand in the 1.6.1
version, despite the user-code and input data being identical.

Thanks for any help you can give,
Simon



___

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-19 Thread Number Cruncher
Having run some more benchmarks, the new default is *really* bad for our 
application (2-10x slower), so I've been looking at the source to try 
and figure out why.


It seems that the biggest difference will occur when the all_to_all is 
actually sparse (e.g. our application); if most N-M process exchanges 
are zero in size the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear 
algorithm will actually only post irecv/isend for non-zero exchanges; 
any zero-size exchanges are skipped. It then waits once for all requests 
to complete. In contrast, the new 
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size 
exchanges for *every* N-M pair, and wait for each pairwise exchange. 
This is O(comm_size) waits, may of which are zero. I'm not clear what 
optimizations there are for zero-size isend/irecv, but surely there's a 
great deal more latency if each pairwise exchange has to be confirmed 
complete before executing the next?


Relatedly, how would I direct OpenMPI to use the older algorithm 
programmatically? I don't want the user to have to use "--mca" in their 
"mpiexec". Is there a C API?


Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like fashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
some is - it depends (usually on the price). This said, not all algorithms
perform the same given a specific type of network interconnect. For example,
on our fat-tree InfiniBand network the pairwise algorithm performs better.

You can switch back to the basic linear algorithm by providing the following
MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
is the pairwise one.
  
You can also set  these values as exported environment variables:


export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already did so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

I've noticed a very significant (100%) slow down for MPI_Alltoallv calls

as of

version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb ethernet
where process-to-process message sizes are fairly small (e.g. 100kbyte)

and

much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
to a pairwise exchange", but I'm not clear what this means or how to

switch

back to the old "non-default algorithm".

I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over 1Gb
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.

It constructs an array of integers and a nProcess x nProcess exchange

typical

of part of our application. This is then exchanged several thousand times.
Output from "mpicc -O3" runs shown below.

My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.

I also

attach a plot showing network throughput on our actual mesh generation
application. Nodes cfsc01-04 are running 1.6.0 and finish within 35

minutes.

Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over

a

hour to run. There seems to be a much greater network demand in the 1.6.1
version, despite the user-code and input data being identical.

Thanks for any help you can give,
Simon





Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-11-16 Thread Iliev, Hristo
Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like fashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
some is - it depends (usually on the price). This said, not all algorithms
perform the same given a specific type of network interconnect. For example,
on our fat-tree InfiniBand network the pairwise algorithm performs better.

You can switch back to the basic linear algorithm by providing the following
MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
is the pairwise one.

You can also set  these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already did so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)


> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On Behalf Of Number Cruncher
> Sent: Thursday, November 15, 2012 5:37 PM
> To: Open MPI Users
> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
> 
> I've noticed a very significant (100%) slow down for MPI_Alltoallv calls
as of
> version 1.6.1.
> * This is most noticeable for high-frequency exchanges over 1Gb ethernet
> where process-to-process message sizes are fairly small (e.g. 100kbyte)
and
> much of the exchange matrix is sparse.
> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
> to a pairwise exchange", but I'm not clear what this means or how to
switch
> back to the old "non-default algorithm".
> 
> I attach a test program which illustrates the sort of usage in our MPI
> application. I have run as this as 32 processes on four nodes, over 1Gb
> ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
> on node 1, rank 1,5,9, ... on node 2, etc.
> 
> It constructs an array of integers and a nProcess x nProcess exchange
typical
> of part of our application. This is then exchanged several thousand times.
> Output from "mpicc -O3" runs shown below.
> 
> My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.
I also
> attach a plot showing network throughput on our actual mesh generation
> application. Nodes cfsc01-04 are running 1.6.0 and finish within 35
minutes.
> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over
a
> hour to run. There seems to be a much greater network demand in the 1.6.1
> version, despite the user-code and input data being identical.
> 
> Thanks for any help you can give,
> Simon
> 
> For 1.6.0:
> 
> Open MPI 1.6.0
> Proc  0: 50 38 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0 0
> 0  Total: 198 x 100 int
> Proc  1: 38 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0 Total: 148 x 100 int
> Proc  2: 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0Total: 109 x 100 int
> Proc  3: 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0   Total: 80 x 100 int
> Proc  4: 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0  Total: 58 x 100 int
> Proc  5: 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0 Total: 41 x 100 int
> Proc  6:  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0Total: 29 x 100 int
> Proc  7:  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0   Total: 20 x 100
int
> Proc  8:  4  3  2  1  0  0  0  0  0  0  0  0
> 0  Total: 14 x
> 100 int
> Proc  9:  3  2  1  0  0  0  0  0  0  0  0
> 0 Total: 9 x
> 100 int
> Proc 10:  2

[OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-11-15 Thread Number Cruncher
I've noticed a very significant (100%) slow down for MPI_Alltoallv calls 
as of version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb ethernet 
where process-to-process message sizes are fairly small (e.g. 100kbyte) 
and much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default 
algorithm to a pairwise exchange", but I'm not clear what this means or 
how to switch back to the old "non-default algorithm".


I attach a test program which illustrates the sort of usage in our MPI 
application. I have run as this as 32 processes on four nodes, over 1Gb 
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,.. 
on node 1, rank 1,5,9, ... on node 2, etc.


It constructs an array of integers and a nProcess x nProcess exchange 
typical of part of our application. This is then exchanged several 
thousand times. Output from "mpicc -O3" runs shown below.


My guess is that 1.6.1 is hitting additional latency not present in 
1.6.0. I also attach a plot showing network throughput on our actual 
mesh generation application. Nodes cfsc01-04 are running 1.6.0 and 
finish within 35 minutes. Nodes cfsc05-08 are running 1.6.2 (started 10 
minutes later) and take over a hour to run. There seems to be a much 
greater network demand in the 1.6.1 version, despite the user-code and 
input data being identical.


Thanks for any help you can give,
Simon

For 1.6.0:

Open MPI 1.6.0
Proc  0: 50 38 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0 0  
0  Total: 198 x 100 int
Proc  1: 38 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0 
0 Total: 148 x 100 int
Proc  2: 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0 
0Total: 109 x 100 int
Proc  3: 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0 
0   Total: 80 x 100 int
Proc  4: 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0 
0  Total: 58 x 100 int
Proc  5: 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0 
0 Total: 41 x 100 int
Proc  6:  8  6  4  3  2  1  0  0  0  0  0  0  0  0 
0Total: 29 x 100 int
Proc  7:  6  4  3  2  1  0  0  0  0  0  0  0  0 
0   Total: 20 x 100 int
Proc  8:  4  3  2  1  0  0  0  0  0  0  0  0 
0  Total: 14 x 
100 int
Proc  9:  3  2  1  0  0  0  0  0  0  0  0 
0 Total: 9 x 
100 int

Proc 10:  2  1  0  0  0  0  0  0  0  0 0 Total: 6 x 100 int
Proc 11:  1  0  0  0  0  0  0  0  0 
0 0 
Total: 4 x 100 int
Proc 12:  0  0  0  0  0  0  0  0 
0 0 0 
Total: 2 x 100 int
Proc 13:  0  0  0  0  0  0  0 
0 0 0  0 
Total: 1 x 100 int
Proc 14:  0  0  0  0  0  0 
0 0 0  
0  0 Total: 1 x 100 int
Proc 15:  0  0  0  0  0 
0 0 0  
0  0  0 Total: 0 x 100 int
Proc 16:  0  0  0  0 
0 0 0  
0  0  0  0 Total: 0 x 100 int
Proc 17:  0  0  0 
0 0 0  
0  0  0  0  0 Total: 1 x 100 int
Proc 18:  0  0 
0 0 0  
0  0  0  0  0  0 Total: 1 x 100 int
Proc 19:  0 
0 0 0  
0  0  0  0  0  0  0 Total: 2 x 100 int
Proc 20: 
0 0 0  
0  0  0  0  0  0  0  1 Total: 4 x 100 int

Proc 21: 0  0  0  0  0  0  0  0  0  1  2 Total: 6 x 100 int
Proc 22:  0 
0  0  0  0  0  0  0  0  1  2  3 Total: 9 x 100 int

Proc 23: 0  0  0  0  0  0  0  0  0  1  2  3  4 Total: 14 x 100 int
Proc 24:0 0  0  
0  0  0  0  0  0  1  2  3  4  6 Total: 20 x 100 int
Proc 25: 0  0 0  0  
0  0  0  0  0  1  2  3  4  6  8 Total: 29 x 100 int
Proc 26:  0  0  0 0  0  
0  0  0  0  1  2  3  4  6  8 12 Total: 41 x 100 int
Proc 27:   0  0  0  0 0  0  
0  0  0  1  2  3  4  6  8 12 16 Total: 58 x 100 int
Proc 28:0  0  0  0  0 0  0  
0  0  1  2  3  4  6  8 12 16 22 Total: 80 x 100 int
Proc 29: 0  0  0  0  0  0 0  0  
0  1  2  3  4  6  8 12 16 22