Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2013-01-20 Thread Kawashima, Takahiro
I've confirmed. Thanks.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Done -- thank you!
> 
> On Jan 11, 2013, at 3:52 AM, "Kawashima, Takahiro" 
>  wrote:
> 
> > Hi Open MPI core members and Rayson,
> > 
> > I've confirmed to the authors and created the bibtex reference.
> > Could you make a page in the "Open MPI Publications" page that
> > links to Fujitsu's PDF file? The attached file contains information
> > of title, authors, abstract, link URL, and bibtex reference.
> > 
> > Best regards,
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> > 
> >> Sorry for not replying sooner.
> >> I'm taliking with the authors (they are not in this list) and
> >> will request linking the PDF soon if they allowed.
> >> 
> >> Takahiro Kawashima,
> >> MPI development team,
> >> Fujitsu
> >> 
> >>> Our policy so far was that adding a paper to the list of publication on 
> >>> the Open MPI website was a discretionary action at the authors' request. 
> >>> I don't see any compelling reason to change. Moreover, Fujitsu being a 
> >>> contributor of the Open MPI community, there is no obstacle of adding a 
> >>> link to their paper -- at their request.
> >>> 
> >>>  George.
> >>> 
> >>> On Jan 10, 2013, at 00:15 , Rayson Ho  wrote:
> >>> 
>  Hi Ralph,
>  
>  Since the whole journal is available online, and is reachable by
>  Google, I don't believe we can get into copyright issues by providing
>  a link to it (but then, I also know that there are countries that have
>  more crazy web page linking rules!).
>  
>  http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol48-3.html
>  
>  Rayson
>  
>  ==
>  Open Grid Scheduler - The Official Open Source Grid Engine
>  http://gridscheduler.sourceforge.net/
>  
>  Scalable Cloud HPC: 10,000-node OGS/GE Amazon EC2 cluster
>  http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html
>  
>  
>  On Thu, Sep 20, 2012 at 6:46 AM, Ralph Castain  wrote:
> > I'm unaware of any formal criteria. The papers currently located there 
> > are those written by members of the OMPI community, but we can 
> > certainly link to something written by someone else, so long as we 
> > don't get into copyright issues.
> > 
> > On Sep 19, 2012, at 11:57 PM, Rayson Ho  wrote:
> > 
> >> I found this paper recently, "MPI Library and Low-Level Communication
> >> on the K computer", available at:
> >> 
> >> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
> >> 
> >> What are the criteria for adding papers to the "Open MPI Publications" 
> >> page?
> >> 
> >> Rayson


Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2013-01-11 Thread Kawashima, Takahiro
Hi Open MPI core members and Rayson,

I've confirmed to the authors and created the bibtex reference.
Could you make a page in the "Open MPI Publications" page that
links to Fujitsu's PDF file? The attached file contains information
of title, authors, abstract, link URL, and bibtex reference.

Best regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Sorry for not replying sooner.
> I'm taliking with the authors (they are not in this list) and
> will request linking the PDF soon if they allowed.
> 
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> 
> > Our policy so far was that adding a paper to the list of publication on the 
> > Open MPI website was a discretionary action at the authors' request. I 
> > don't see any compelling reason to change. Moreover, Fujitsu being a 
> > contributor of the Open MPI community, there is no obstacle of adding a 
> > link to their paper -- at their request.
> > 
> >   George.
> > 
> > On Jan 10, 2013, at 00:15 , Rayson Ho  wrote:
> > 
> > > Hi Ralph,
> > > 
> > > Since the whole journal is available online, and is reachable by
> > > Google, I don't believe we can get into copyright issues by providing
> > > a link to it (but then, I also know that there are countries that have
> > > more crazy web page linking rules!).
> > > 
> > > http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol48-3.html
> > > 
> > > Rayson
> > > 
> > > ==
> > > Open Grid Scheduler - The Official Open Source Grid Engine
> > > http://gridscheduler.sourceforge.net/
> > > 
> > > Scalable Cloud HPC: 10,000-node OGS/GE Amazon EC2 cluster
> > > http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html
> > > 
> > > 
> > > On Thu, Sep 20, 2012 at 6:46 AM, Ralph Castain  wrote:
> > >> I'm unaware of any formal criteria. The papers currently located there 
> > >> are those written by members of the OMPI community, but we can certainly 
> > >> link to something written by someone else, so long as we don't get into 
> > >> copyright issues.
> > >> 
> > >> On Sep 19, 2012, at 11:57 PM, Rayson Ho  wrote:
> > >> 
> > >>> I found this paper recently, "MPI Library and Low-Level Communication
> > >>> on the K computer", available at:
> > >>> 
> > >>> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
> > >>> 
> > >>> What are the criteria for adding papers to the "Open MPI Publications" 
> > >>> page?
> > >>> 
> > >>> Rayson
Title: MPI Library and Low-Level Communication on the K computer






Title: MPI Library and Low-Level Communication on the K computer

Author(s): Naoyuki Shida, Shinji Sumimoto, Atsuya Uno

Abstract:

The key to raising application performance in a massively parallel system like the K computer is to increase the speed of communication between compute nodes. In the K computer, this inter-node communication is governed by the Message Passing Interface (MPI) communication library and low-level communication. This paper describes the implementation and performance of the MPI communication library, which exploits the new Tofu-interconnect architecture introduced in the K computer to enhance the performance of petascale applications, and low-level communication mechanism, which performs fine-grained control of the Tofu interconnect.

Paper:


paper11.pdf (PDF)



Presented: FUJITSU Scientific & Technical Journal 2012-7 (Vol.48, No.3)

Bibtex reference:


 @Article{shida2012:mpi_kcomputer,
  author  = {Naoyuki Shida and Shinji Sumimoto and Atsuya Uno},
  title   = {{MPI} Library and Low-Level Communication on the {K computer}},
  journal = {FUJITSU Scientific \& Technical Journal},
  month   = {July},
  year= {2012},
  volume  = {48},
  number  = {3},
  pages   = {324--330}
} 







Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2013-01-10 Thread Kawashima, Takahiro
Hi,

Sorry for not replying sooner.
I'm taliking with the authors (they are not in this list) and
will request linking the PDF soon if they allowed.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Our policy so far was that adding a paper to the list of publication on the 
> Open MPI website was a discretionary action at the authors' request. I don't 
> see any compelling reason to change. Moreover, Fujitsu being a contributor of 
> the Open MPI community, there is no obstacle of adding a link to their paper 
> -- at their request.
> 
>   George.
> 
> On Jan 10, 2013, at 00:15 , Rayson Ho  wrote:
> 
> > Hi Ralph,
> > 
> > Since the whole journal is available online, and is reachable by
> > Google, I don't believe we can get into copyright issues by providing
> > a link to it (but then, I also know that there are countries that have
> > more crazy web page linking rules!).
> > 
> > http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol48-3.html
> > 
> > Rayson
> > 
> > ==
> > Open Grid Scheduler - The Official Open Source Grid Engine
> > http://gridscheduler.sourceforge.net/
> > 
> > Scalable Cloud HPC: 10,000-node OGS/GE Amazon EC2 cluster
> > http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html
> > 
> > 
> > On Thu, Sep 20, 2012 at 6:46 AM, Ralph Castain  wrote:
> >> I'm unaware of any formal criteria. The papers currently located there are 
> >> those written by members of the OMPI community, but we can certainly link 
> >> to something written by someone else, so long as we don't get into 
> >> copyright issues.
> >> 
> >> On Sep 19, 2012, at 11:57 PM, Rayson Ho  wrote:
> >> 
> >>> I found this paper recently, "MPI Library and Low-Level Communication
> >>> on the K computer", available at:
> >>> 
> >>> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
> >>> 
> >>> What are the criteria for adding papers to the "Open MPI Publications" 
> >>> page?
> >>> 
> >>> Rayson
> >>> 
> >>> ==
> >>> Open Grid Scheduler - The Official Open Source Grid Engine
> >>> http://gridscheduler.sourceforge.net/
> >>> 
> >>> 
> >>> On Fri, Nov 18, 2011 at 5:32 AM, George Bosilca  
> >>> wrote:
>  Dear Yuki and Takahiro,
>  
>  Thanks for the bug report and for the patch. I pushed a [nearly 
>  identical] patch in the trunk in 
>  https://svn.open-mpi.org/trac/ompi/changeset/25488. A special version 
>  for the 1.4 has been prepared and has been attached to the ticket #2916 
>  (https://svn.open-mpi.org/trac/ompi/ticket/2916).
>  
>  Thanks,
>  george.
>  
>  
>  On Nov 14, 2011, at 02:27 , Y.MATSUMOTO wrote:
>  
> > Dear Open MPI community,
> > 
> > I'm a member of MPI library development team in Fujitsu,
> > Takahiro Kawashima, who sent mail before, is my colleague.
> > We start to feed back.
> > 
> > First, we fixed about MPI_LB/MPI_UB and data packing problem.
> > 
> > Program crashes when it meets all of the following conditions:
> > a: The type of sending data is contiguous and derived type.
> > b: Either or both of MPI_LB and MPI_UB is used in the data type.
> > c: The size of sending data is smaller than extent(Data type has gap).
> > d: Send-count is bigger than 1.
> > e: Total size of data is bigger than "eager limit"
> > 
> > This problem occurs in attachment C program.
> > 
> > An incorrect-address accessing occurs
> > because an unintended value of "done" inputs and
> > the value of "max_allowd" becomes minus
> > in the following place in "ompi/datatype/datatype_pack.c(in version 
> > 1.4.3)".
> > 
> > 
> > (ompi/datatype/datatype_pack.c)
> > 188 packed_buffer = (unsigned char *) 
> > iov[iov_count].iov_base;
> > 189 done = pConv->bConverted - i * pData->size;  /* partial 
> > data from last pack */
> > 190 if( done != 0 ) {  /* still some data to copy from the 
> > last time */
> > 191 done = pData->size - done;
> > 192 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
> > pConv->pBaseBuf, pData, pConv->count );
> > 193 MEMCPY_CSUM( packed_buffer, user_memory, done, 
> > pConv );
> > 194 packed_buffer += done;
> > 195 max_allowed -= done;
> > 196 total_bytes_converted += done;
> > 197 user_memory += (extent - pData->size + done);
> > 198 }
> > 
> > This program assumes "done" as the size of partial data from last pack.
> > However, when the program crashes, "done" equals the sum of all 
> > transmitted data size.
> > It makes "max_allowed" to be a negative value.
> > 
> 

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2013-01-10 Thread George Bosilca
Our policy so far was that adding a paper to the list of publication on the 
Open MPI website was a discretionary action at the authors' request. I don't 
see any compelling reason to change. Moreover, Fujitsu being a contributor of 
the Open MPI community, there is no obstacle of adding a link to their paper -- 
at their request.

  George.

On Jan 10, 2013, at 00:15 , Rayson Ho  wrote:

> Hi Ralph,
> 
> Since the whole journal is available online, and is reachable by
> Google, I don't believe we can get into copyright issues by providing
> a link to it (but then, I also know that there are countries that have
> more crazy web page linking rules!).
> 
> http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol48-3.html
> 
> Rayson
> 
> ==
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> 
> Scalable Cloud HPC: 10,000-node OGS/GE Amazon EC2 cluster
> http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html
> 
> 
> On Thu, Sep 20, 2012 at 6:46 AM, Ralph Castain  wrote:
>> I'm unaware of any formal criteria. The papers currently located there are 
>> those written by members of the OMPI community, but we can certainly link to 
>> something written by someone else, so long as we don't get into copyright 
>> issues.
>> 
>> On Sep 19, 2012, at 11:57 PM, Rayson Ho  wrote:
>> 
>>> I found this paper recently, "MPI Library and Low-Level Communication
>>> on the K computer", available at:
>>> 
>>> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
>>> 
>>> What are the criteria for adding papers to the "Open MPI Publications" page?
>>> 
>>> Rayson
>>> 
>>> ==
>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>> http://gridscheduler.sourceforge.net/
>>> 
>>> 
>>> On Fri, Nov 18, 2011 at 5:32 AM, George Bosilca  
>>> wrote:
 Dear Yuki and Takahiro,
 
 Thanks for the bug report and for the patch. I pushed a [nearly identical] 
 patch in the trunk in https://svn.open-mpi.org/trac/ompi/changeset/25488. 
 A special version for the 1.4 has been prepared and has been attached to 
 the ticket #2916 (https://svn.open-mpi.org/trac/ompi/ticket/2916).
 
 Thanks,
 george.
 
 
 On Nov 14, 2011, at 02:27 , Y.MATSUMOTO wrote:
 
> Dear Open MPI community,
> 
> I'm a member of MPI library development team in Fujitsu,
> Takahiro Kawashima, who sent mail before, is my colleague.
> We start to feed back.
> 
> First, we fixed about MPI_LB/MPI_UB and data packing problem.
> 
> Program crashes when it meets all of the following conditions:
> a: The type of sending data is contiguous and derived type.
> b: Either or both of MPI_LB and MPI_UB is used in the data type.
> c: The size of sending data is smaller than extent(Data type has gap).
> d: Send-count is bigger than 1.
> e: Total size of data is bigger than "eager limit"
> 
> This problem occurs in attachment C program.
> 
> An incorrect-address accessing occurs
> because an unintended value of "done" inputs and
> the value of "max_allowd" becomes minus
> in the following place in "ompi/datatype/datatype_pack.c(in version 
> 1.4.3)".
> 
> 
> (ompi/datatype/datatype_pack.c)
> 188 packed_buffer = (unsigned char *) iov[iov_count].iov_base;
> 189 done = pConv->bConverted - i * pData->size;  /* partial 
> data from last pack */
> 190 if( done != 0 ) {  /* still some data to copy from the 
> last time */
> 191 done = pData->size - done;
> 192 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
> pConv->pBaseBuf, pData, pConv->count );
> 193 MEMCPY_CSUM( packed_buffer, user_memory, done, pConv 
> );
> 194 packed_buffer += done;
> 195 max_allowed -= done;
> 196 total_bytes_converted += done;
> 197 user_memory += (extent - pData->size + done);
> 198 }
> 
> This program assumes "done" as the size of partial data from last pack.
> However, when the program crashes, "done" equals the sum of all 
> transmitted data size.
> It makes "max_allowed" to be a negative value.
> 
> We modified the code as following and it passed our test suite.
> But we are not sure this fix is correct. Can anyone review this fix?
> Patch (against Open MPI 1.4 branch) is attached to this mail.
> 
> -if( done != 0 ) {  /* still some data to copy from the last 
> time */
> +if( (done + max_allowed) >= pData->size ) {  /* still some 
> data to copy from the last time */
> 
> Best 

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2013-01-09 Thread Rayson Ho
Hi Ralph,

Since the whole journal is available online, and is reachable by
Google, I don't believe we can get into copyright issues by providing
a link to it (but then, I also know that there are countries that have
more crazy web page linking rules!).

http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol48-3.html

Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Cloud HPC: 10,000-node OGS/GE Amazon EC2 cluster
http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html


On Thu, Sep 20, 2012 at 6:46 AM, Ralph Castain  wrote:
> I'm unaware of any formal criteria. The papers currently located there are 
> those written by members of the OMPI community, but we can certainly link to 
> something written by someone else, so long as we don't get into copyright 
> issues.
>
> On Sep 19, 2012, at 11:57 PM, Rayson Ho  wrote:
>
>> I found this paper recently, "MPI Library and Low-Level Communication
>> on the K computer", available at:
>>
>> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
>>
>> What are the criteria for adding papers to the "Open MPI Publications" page?
>>
>> Rayson
>>
>> ==
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>>
>> On Fri, Nov 18, 2011 at 5:32 AM, George Bosilca  wrote:
>>> Dear Yuki and Takahiro,
>>>
>>> Thanks for the bug report and for the patch. I pushed a [nearly identical] 
>>> patch in the trunk in https://svn.open-mpi.org/trac/ompi/changeset/25488. A 
>>> special version for the 1.4 has been prepared and has been attached to the 
>>> ticket #2916 (https://svn.open-mpi.org/trac/ompi/ticket/2916).
>>>
>>>  Thanks,
>>>  george.
>>>
>>>
>>> On Nov 14, 2011, at 02:27 , Y.MATSUMOTO wrote:
>>>
 Dear Open MPI community,

 I'm a member of MPI library development team in Fujitsu,
 Takahiro Kawashima, who sent mail before, is my colleague.
 We start to feed back.

 First, we fixed about MPI_LB/MPI_UB and data packing problem.

 Program crashes when it meets all of the following conditions:
 a: The type of sending data is contiguous and derived type.
 b: Either or both of MPI_LB and MPI_UB is used in the data type.
 c: The size of sending data is smaller than extent(Data type has gap).
 d: Send-count is bigger than 1.
 e: Total size of data is bigger than "eager limit"

 This problem occurs in attachment C program.

 An incorrect-address accessing occurs
 because an unintended value of "done" inputs and
 the value of "max_allowd" becomes minus
 in the following place in "ompi/datatype/datatype_pack.c(in version 
 1.4.3)".


 (ompi/datatype/datatype_pack.c)
 188 packed_buffer = (unsigned char *) iov[iov_count].iov_base;
 189 done = pConv->bConverted - i * pData->size;  /* partial 
 data from last pack */
 190 if( done != 0 ) {  /* still some data to copy from the 
 last time */
 191 done = pData->size - done;
 192 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
 pConv->pBaseBuf, pData, pConv->count );
 193 MEMCPY_CSUM( packed_buffer, user_memory, done, pConv );
 194 packed_buffer += done;
 195 max_allowed -= done;
 196 total_bytes_converted += done;
 197 user_memory += (extent - pData->size + done);
 198 }

 This program assumes "done" as the size of partial data from last pack.
 However, when the program crashes, "done" equals the sum of all 
 transmitted data size.
 It makes "max_allowed" to be a negative value.

 We modified the code as following and it passed our test suite.
 But we are not sure this fix is correct. Can anyone review this fix?
 Patch (against Open MPI 1.4 branch) is attached to this mail.

 -if( done != 0 ) {  /* still some data to copy from the last 
 time */
 +if( (done + max_allowed) >= pData->size ) {  /* still some 
 data to copy from the last time */

 Best regards,

 Yuki MATSUMOTO
 MPI development team,
 Fujitsu

 (2011/06/28 10:58), Takahiro Kawashima wrote:
> Dear Open MPI community,
>
> I'm a member of MPI library development team in Fujitsu. Shinji
> Sumimoto, whose name appears in Jeff's blog, is one of our bosses.
>
> As Rayson and Jeff noted, K computer, world's most powerful HPC system
> developed by RIKEN and Fujitsu, utilizes Open MPI as a base of its MPI
> library. We, Fujitsu, are pleased to announce that, and also have special
> thanks to Open MPI 

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2012-09-20 Thread Ralph Castain
I'm unaware of any formal criteria. The papers currently located there are 
those written by members of the OMPI community, but we can certainly link to 
something written by someone else, so long as we don't get into copyright 
issues.

On Sep 19, 2012, at 11:57 PM, Rayson Ho  wrote:

> I found this paper recently, "MPI Library and Low-Level Communication
> on the K computer", available at:
> 
> http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf
> 
> What are the criteria for adding papers to the "Open MPI Publications" page?
> 
> Rayson
> 
> ==
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> 
> 
> On Fri, Nov 18, 2011 at 5:32 AM, George Bosilca  wrote:
>> Dear Yuki and Takahiro,
>> 
>> Thanks for the bug report and for the patch. I pushed a [nearly identical] 
>> patch in the trunk in https://svn.open-mpi.org/trac/ompi/changeset/25488. A 
>> special version for the 1.4 has been prepared and has been attached to the 
>> ticket #2916 (https://svn.open-mpi.org/trac/ompi/ticket/2916).
>> 
>>  Thanks,
>>  george.
>> 
>> 
>> On Nov 14, 2011, at 02:27 , Y.MATSUMOTO wrote:
>> 
>>> Dear Open MPI community,
>>> 
>>> I'm a member of MPI library development team in Fujitsu,
>>> Takahiro Kawashima, who sent mail before, is my colleague.
>>> We start to feed back.
>>> 
>>> First, we fixed about MPI_LB/MPI_UB and data packing problem.
>>> 
>>> Program crashes when it meets all of the following conditions:
>>> a: The type of sending data is contiguous and derived type.
>>> b: Either or both of MPI_LB and MPI_UB is used in the data type.
>>> c: The size of sending data is smaller than extent(Data type has gap).
>>> d: Send-count is bigger than 1.
>>> e: Total size of data is bigger than "eager limit"
>>> 
>>> This problem occurs in attachment C program.
>>> 
>>> An incorrect-address accessing occurs
>>> because an unintended value of "done" inputs and
>>> the value of "max_allowd" becomes minus
>>> in the following place in "ompi/datatype/datatype_pack.c(in version 1.4.3)".
>>> 
>>> 
>>> (ompi/datatype/datatype_pack.c)
>>> 188 packed_buffer = (unsigned char *) iov[iov_count].iov_base;
>>> 189 done = pConv->bConverted - i * pData->size;  /* partial 
>>> data from last pack */
>>> 190 if( done != 0 ) {  /* still some data to copy from the last 
>>> time */
>>> 191 done = pData->size - done;
>>> 192 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
>>> pConv->pBaseBuf, pData, pConv->count );
>>> 193 MEMCPY_CSUM( packed_buffer, user_memory, done, pConv );
>>> 194 packed_buffer += done;
>>> 195 max_allowed -= done;
>>> 196 total_bytes_converted += done;
>>> 197 user_memory += (extent - pData->size + done);
>>> 198 }
>>> 
>>> This program assumes "done" as the size of partial data from last pack.
>>> However, when the program crashes, "done" equals the sum of all transmitted 
>>> data size.
>>> It makes "max_allowed" to be a negative value.
>>> 
>>> We modified the code as following and it passed our test suite.
>>> But we are not sure this fix is correct. Can anyone review this fix?
>>> Patch (against Open MPI 1.4 branch) is attached to this mail.
>>> 
>>> -if( done != 0 ) {  /* still some data to copy from the last 
>>> time */
>>> +if( (done + max_allowed) >= pData->size ) {  /* still some 
>>> data to copy from the last time */
>>> 
>>> Best regards,
>>> 
>>> Yuki MATSUMOTO
>>> MPI development team,
>>> Fujitsu
>>> 
>>> (2011/06/28 10:58), Takahiro Kawashima wrote:
 Dear Open MPI community,
 
 I'm a member of MPI library development team in Fujitsu. Shinji
 Sumimoto, whose name appears in Jeff's blog, is one of our bosses.
 
 As Rayson and Jeff noted, K computer, world's most powerful HPC system
 developed by RIKEN and Fujitsu, utilizes Open MPI as a base of its MPI
 library. We, Fujitsu, are pleased to announce that, and also have special
 thanks to Open MPI community.
 We are sorry to be late announce!
 
 Our MPI library is based on Open MPI 1.4 series, and has a new point-
 to-point component (BTL) and new topology-aware collective communication
 algorithms (COLL). Also, it is adapted to our runtime environment (ESS,
 PLM, GRPCOMM etc).
 
 K computer connects 68,544 nodes by our custom interconnect.
 Its runtime environment is our proprietary one. So we don't use orted.
 We cannot tell start-up time yet because of disclosure restriction, sorry.
 
 We are surprised by the extensibility of Open MPI, and have proved that
 Open MPI is scalable to 68,000 processes level! We feel pleasure to
 utilize such a great open-source software.
 
 We cannot tell detail of our technology yet because of our 

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2012-09-20 Thread Rayson Ho
I found this paper recently, "MPI Library and Low-Level Communication
on the K computer", available at:

http://www.fujitsu.com/downloads/MAG/vol48-3/paper11.pdf

What are the criteria for adding papers to the "Open MPI Publications" page?

Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/


On Fri, Nov 18, 2011 at 5:32 AM, George Bosilca  wrote:
> Dear Yuki and Takahiro,
>
> Thanks for the bug report and for the patch. I pushed a [nearly identical] 
> patch in the trunk in https://svn.open-mpi.org/trac/ompi/changeset/25488. A 
> special version for the 1.4 has been prepared and has been attached to the 
> ticket #2916 (https://svn.open-mpi.org/trac/ompi/ticket/2916).
>
>   Thanks,
>   george.
>
>
> On Nov 14, 2011, at 02:27 , Y.MATSUMOTO wrote:
>
>> Dear Open MPI community,
>>
>> I'm a member of MPI library development team in Fujitsu,
>> Takahiro Kawashima, who sent mail before, is my colleague.
>> We start to feed back.
>>
>> First, we fixed about MPI_LB/MPI_UB and data packing problem.
>>
>> Program crashes when it meets all of the following conditions:
>> a: The type of sending data is contiguous and derived type.
>> b: Either or both of MPI_LB and MPI_UB is used in the data type.
>> c: The size of sending data is smaller than extent(Data type has gap).
>> d: Send-count is bigger than 1.
>> e: Total size of data is bigger than "eager limit"
>>
>> This problem occurs in attachment C program.
>>
>> An incorrect-address accessing occurs
>> because an unintended value of "done" inputs and
>> the value of "max_allowd" becomes minus
>> in the following place in "ompi/datatype/datatype_pack.c(in version 1.4.3)".
>>
>>
>> (ompi/datatype/datatype_pack.c)
>> 188 packed_buffer = (unsigned char *) iov[iov_count].iov_base;
>> 189 done = pConv->bConverted - i * pData->size;  /* partial data 
>> from last pack */
>> 190 if( done != 0 ) {  /* still some data to copy from the last 
>> time */
>> 191 done = pData->size - done;
>> 192 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
>> pConv->pBaseBuf, pData, pConv->count );
>> 193 MEMCPY_CSUM( packed_buffer, user_memory, done, pConv );
>> 194 packed_buffer += done;
>> 195 max_allowed -= done;
>> 196 total_bytes_converted += done;
>> 197 user_memory += (extent - pData->size + done);
>> 198 }
>>
>> This program assumes "done" as the size of partial data from last pack.
>> However, when the program crashes, "done" equals the sum of all transmitted 
>> data size.
>> It makes "max_allowed" to be a negative value.
>>
>> We modified the code as following and it passed our test suite.
>> But we are not sure this fix is correct. Can anyone review this fix?
>> Patch (against Open MPI 1.4 branch) is attached to this mail.
>>
>> -if( done != 0 ) {  /* still some data to copy from the last 
>> time */
>> +if( (done + max_allowed) >= pData->size ) {  /* still some data 
>> to copy from the last time */
>>
>> Best regards,
>>
>> Yuki MATSUMOTO
>> MPI development team,
>> Fujitsu
>>
>> (2011/06/28 10:58), Takahiro Kawashima wrote:
>>> Dear Open MPI community,
>>>
>>> I'm a member of MPI library development team in Fujitsu. Shinji
>>> Sumimoto, whose name appears in Jeff's blog, is one of our bosses.
>>>
>>> As Rayson and Jeff noted, K computer, world's most powerful HPC system
>>> developed by RIKEN and Fujitsu, utilizes Open MPI as a base of its MPI
>>> library. We, Fujitsu, are pleased to announce that, and also have special
>>> thanks to Open MPI community.
>>> We are sorry to be late announce!
>>>
>>> Our MPI library is based on Open MPI 1.4 series, and has a new point-
>>> to-point component (BTL) and new topology-aware collective communication
>>> algorithms (COLL). Also, it is adapted to our runtime environment (ESS,
>>> PLM, GRPCOMM etc).
>>>
>>> K computer connects 68,544 nodes by our custom interconnect.
>>> Its runtime environment is our proprietary one. So we don't use orted.
>>> We cannot tell start-up time yet because of disclosure restriction, sorry.
>>>
>>> We are surprised by the extensibility of Open MPI, and have proved that
>>> Open MPI is scalable to 68,000 processes level! We feel pleasure to
>>> utilize such a great open-source software.
>>>
>>> We cannot tell detail of our technology yet because of our contract
>>> with RIKEN AICS, however, we will plan to feedback of our improvements
>>> and bug fixes. We can contribute some bug fixes soon, however, for
>>> contribution of our improvements will be next year with Open MPI
>>> agreement.
>>>
>>> Best regards,
>>>
>>> MPI development team,
>>> Fujitsu
>>>
>>>
 I got more information:

http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/

 Short version: yes, Open MPI is 

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-11-18 Thread George Bosilca
Dear Yuki and Takahiro,

Thanks for the bug report and for the patch. I pushed a [nearly identical] 
patch in the trunk in https://svn.open-mpi.org/trac/ompi/changeset/25488. A 
special version for the 1.4 has been prepared and has been attached to the 
ticket #2916 (https://svn.open-mpi.org/trac/ompi/ticket/2916).

  Thanks,
  george.


On Nov 14, 2011, at 02:27 , Y.MATSUMOTO wrote:

> Dear Open MPI community,
> 
> I'm a member of MPI library development team in Fujitsu,
> Takahiro Kawashima, who sent mail before, is my colleague.
> We start to feed back.
> 
> First, we fixed about MPI_LB/MPI_UB and data packing problem.
> 
> Program crashes when it meets all of the following conditions:
> a: The type of sending data is contiguous and derived type.
> b: Either or both of MPI_LB and MPI_UB is used in the data type.
> c: The size of sending data is smaller than extent(Data type has gap).
> d: Send-count is bigger than 1.
> e: Total size of data is bigger than "eager limit"
> 
> This problem occurs in attachment C program.
> 
> An incorrect-address accessing occurs
> because an unintended value of "done" inputs and
> the value of "max_allowd" becomes minus
> in the following place in "ompi/datatype/datatype_pack.c(in version 1.4.3)".
> 
> 
> (ompi/datatype/datatype_pack.c)
> 188 packed_buffer = (unsigned char *) iov[iov_count].iov_base;
> 189 done = pConv->bConverted - i * pData->size;  /* partial data 
> from last pack */
> 190 if( done != 0 ) {  /* still some data to copy from the last 
> time */
> 191 done = pData->size - done;
> 192 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
> pConv->pBaseBuf, pData, pConv->count );
> 193 MEMCPY_CSUM( packed_buffer, user_memory, done, pConv );
> 194 packed_buffer += done;
> 195 max_allowed -= done;
> 196 total_bytes_converted += done;
> 197 user_memory += (extent - pData->size + done);
> 198 }
> 
> This program assumes "done" as the size of partial data from last pack.
> However, when the program crashes, "done" equals the sum of all transmitted 
> data size.
> It makes "max_allowed" to be a negative value.
> 
> We modified the code as following and it passed our test suite.
> But we are not sure this fix is correct. Can anyone review this fix?
> Patch (against Open MPI 1.4 branch) is attached to this mail.
> 
> -if( done != 0 ) {  /* still some data to copy from the last time 
> */
> +if( (done + max_allowed) >= pData->size ) {  /* still some data 
> to copy from the last time */
> 
> Best regards,
> 
> Yuki MATSUMOTO
> MPI development team,
> Fujitsu
> 
> (2011/06/28 10:58), Takahiro Kawashima wrote:
>> Dear Open MPI community,
>> 
>> I'm a member of MPI library development team in Fujitsu. Shinji
>> Sumimoto, whose name appears in Jeff's blog, is one of our bosses.
>> 
>> As Rayson and Jeff noted, K computer, world's most powerful HPC system
>> developed by RIKEN and Fujitsu, utilizes Open MPI as a base of its MPI
>> library. We, Fujitsu, are pleased to announce that, and also have special
>> thanks to Open MPI community.
>> We are sorry to be late announce!
>> 
>> Our MPI library is based on Open MPI 1.4 series, and has a new point-
>> to-point component (BTL) and new topology-aware collective communication
>> algorithms (COLL). Also, it is adapted to our runtime environment (ESS,
>> PLM, GRPCOMM etc).
>> 
>> K computer connects 68,544 nodes by our custom interconnect.
>> Its runtime environment is our proprietary one. So we don't use orted.
>> We cannot tell start-up time yet because of disclosure restriction, sorry.
>> 
>> We are surprised by the extensibility of Open MPI, and have proved that
>> Open MPI is scalable to 68,000 processes level! We feel pleasure to
>> utilize such a great open-source software.
>> 
>> We cannot tell detail of our technology yet because of our contract
>> with RIKEN AICS, however, we will plan to feedback of our improvements
>> and bug fixes. We can contribute some bug fixes soon, however, for
>> contribution of our improvements will be next year with Open MPI
>> agreement.
>> 
>> Best regards,
>> 
>> MPI development team,
>> Fujitsu
>> 
>> 
>>> I got more information:
>>> 
>>>http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/
>>> 
>>> Short version: yes, Open MPI is used on K and was used to power the 8PF 
>>> runs.
>>> 
>>> w00t!
>>> 
>>> 
>>> 
>>> On Jun 24, 2011, at 7:16 PM, Jeff Squyres wrote:
>>> 
 w00t!
 
 OMPI powers 8 petaflops!
 (at least I'm guessing that -- does anyone know if that's true?)
 
 
 On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:
 
> Interesting... page 11:
> 
> http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf
> 
> Open MPI based:
> 
> * Open Standard, Open Source, Multi-Platform including PC Cluster.
> * Adding extension to 

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-11-14 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 14/11/11 21:27, Y.MATSUMOTO wrote:

> I'm a member of MPI library development team in Fujitsu,
> Takahiro Kawashima, who sent mail before, is my colleague.
> We start to feed back.

First of all I'd like to say congratulations on breaking
10PF, and also a big thanks for working on contributing
changes back to Open-MPI!

Whilst I can't comment on the fix I can confirm that I also
see segfaults with Open-MPI 1.4.2 and 1.4.4 with your example
program.

Intel compilers 11.1:

- --
[bruce002:03973] *** Process received signal ***
[bruce002:03973] Signal: Segmentation fault (11)
[bruce002:03973] Signal code: Address not mapped (1)
[bruce002:03973] Failing at address: 0x10009
[bruce002:03973] [ 0] /lib64/libpthread.so.0 [0x3e1320eb10]
[bruce002:03973] [ 1] /usr/local/openmpi/1.4.4-intel/lib/libmpi.so.0 
[0x2ab5d79d]
[bruce002:03973] [ 2] 
/usr/local/openmpi/1.4.4-intel/lib/libopen-pal.so.0(opal_progress+0x87) 
[0x2b1fdc27]
[bruce002:03973] [ 3] /usr/local/openmpi/1.4.4-intel/lib/libmpi.so.0 
[0x2abce252]
[bruce002:03973] [ 4] 
/usr/local/openmpi/1.4.4-intel/lib/libmpi.so.0(PMPI_Recv+0x213) [0x2ab1e0f3]
[bruce002:03973] [ 5] ./tp_lb_ub_ng(main+0x29b) [0x4021ab]
[bruce002:03973] [ 6] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3e12a1d994]
[bruce002:03973] [ 7] ./tp_lb_ub_ng [0x401e59]
[bruce002:03973] *** End of error message ***
- --
mpiexec noticed that process rank 1 with PID 3973 on node bruce002 exited on 
signal 11 (Segmentation fault).
- --
[bruce002:03972] *** Process received signal ***
[bruce002:03972] Signal: Segmentation fault (11)
[bruce002:03972] Signal code: Address not mapped (1)
[bruce002:03972] Failing at address: 0xff84bad0
[bruce002:03972] [ 0] /lib64/libpthread.so.0 [0x3e1320eb10]
[bruce002:03972] [ 1] ./tp_lb_ub_ng(__intel_new_memcpy+0x2c) [0x403c9c]
[bruce002:03972] *** End of error message ***


GCC 4.4.4:

- --
[bruce002:04049] *** Process received signal ***
[bruce002:04049] Signal: Segmentation fault (11)
[bruce002:04049] Signal code: Address not mapped (1)
[bruce002:04049] Failing at address: 0x10009
[bruce002:04049] [ 0] /lib64/libpthread.so.0 [0x3e1320eb10]
[bruce002:04049] [ 1] /usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0 
[0x2ab51f27]
[bruce002:04049] [ 2] 
/usr/local/openmpi/1.4.4-gcc/lib/libopen-pal.so.0(opal_progress+0x5a) 
[0x2b14bb3a]
[bruce002:04049] [ 3] /usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0 
[0x2abb9985]
[bruce002:04049] [ 4] 
/usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0(PMPI_Recv+0x12f) [0x2ab1913f]
[bruce002:04049] [ 5] ./tp_lb_ub_ng(main+0x21c) [0x400dd0]
[bruce002:04049] [ 6] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3e12a1d994]
[bruce002:04049] [ 7] ./tp_lb_ub_ng [0x400af9]
[bruce002:04049] *** End of error message ***
- --
mpiexec noticed that process rank 1 with PID 4049 on node bruce002 exited on 
signal 11 (Segmentation fault).
- --
[bruce002:04048] *** Process received signal ***
[bruce002:04048] Signal: Segmentation fault (11)
[bruce002:04048] Signal code: Address not mapped (1)
[bruce002:04048] Failing at address: 0x2aaab0833000
[bruce002:04048] [ 0] /lib64/libpthread.so.0 [0x3e1320eb10]
[bruce002:04048] [ 1] /lib64/libc.so.6(memcpy+0x3ff) [0x3e12a7c63f]
[bruce002:04048] [ 2] /usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0 
[0x2aafef7b]
[bruce002:04048] [ 3] /usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0 
[0x2ab4fcdd]
[bruce002:04048] [ 4] /usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0 
[0x2abc1563]
[bruce002:04048] [ 5] /usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0 
[0x2abbce78]
[bruce002:04048] [ 6] /usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0 
[0x2ab52036]
[bruce002:04048] [ 7] 
/usr/local/openmpi/1.4.4-gcc/lib/libopen-pal.so.0(opal_progress+0x5a) 
[0x2b14bb3a]
[bruce002:04048] [ 8] /usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0 
[0x2abba5f5]
[bruce002:04048] [ 9] 
/usr/local/openmpi/1.4.4-gcc/lib/libmpi.so.0(MPI_Send+0x177) [0x2ab1b1d7]
[bruce002:04048] [10] ./tp_lb_ub_ng(main+0x1e4) [0x400d98]
[bruce002:04048] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3e12a1d994]
[bruce002:04048] [12] ./tp_lb_ub_ng [0x400af9]
[bruce002:04048] *** End of error message ***


- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/


Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-11-14 Thread Y.MATSUMOTO

Dear Open MPI community,

I'm a member of MPI library development team in Fujitsu,
Takahiro Kawashima, who sent mail before, is my colleague.
We start to feed back.

First, we fixed about MPI_LB/MPI_UB and data packing problem.

Program crashes when it meets all of the following conditions:
a: The type of sending data is contiguous and derived type.
b: Either or both of MPI_LB and MPI_UB is used in the data type.
c: The size of sending data is smaller than extent(Data type has gap).
d: Send-count is bigger than 1.
e: Total size of data is bigger than "eager limit"

This problem occurs in attachment C program.

An incorrect-address accessing occurs
because an unintended value of "done" inputs and
the value of "max_allowd" becomes minus
in the following place in "ompi/datatype/datatype_pack.c(in version 1.4.3)".


(ompi/datatype/datatype_pack.c)
188 packed_buffer = (unsigned char *) iov[iov_count].iov_base;
189 done = pConv->bConverted - i * pData->size;  /* partial data 
from last pack */
190 if( done != 0 ) {  /* still some data to copy from the last 
time */
191 done = pData->size - done;
192 OMPI_DDT_SAFEGUARD_POINTER( user_memory, done, 
pConv->pBaseBuf, pData, pConv->count );
193 MEMCPY_CSUM( packed_buffer, user_memory, done, pConv );
194 packed_buffer += done;
195 max_allowed -= done;
196 total_bytes_converted += done;
197 user_memory += (extent - pData->size + done);
198 }

This program assumes "done" as the size of partial data from last pack.
However, when the program crashes, "done" equals the sum of all transmitted 
data size.
It makes "max_allowed" to be a negative value.

We modified the code as following and it passed our test suite.
But we are not sure this fix is correct. Can anyone review this fix?
Patch (against Open MPI 1.4 branch) is attached to this mail.

-if( done != 0 ) {  /* still some data to copy from the last time */
+if( (done + max_allowed) >= pData->size ) {  /* still some data to 
copy from the last time */

Best regards,

Yuki MATSUMOTO
MPI development team,
Fujitsu

(2011/06/28 10:58), Takahiro Kawashima wrote:

Dear Open MPI community,

I'm a member of MPI library development team in Fujitsu. Shinji
Sumimoto, whose name appears in Jeff's blog, is one of our bosses.

As Rayson and Jeff noted, K computer, world's most powerful HPC system
developed by RIKEN and Fujitsu, utilizes Open MPI as a base of its MPI
library. We, Fujitsu, are pleased to announce that, and also have special
thanks to Open MPI community.
We are sorry to be late announce!

Our MPI library is based on Open MPI 1.4 series, and has a new point-
to-point component (BTL) and new topology-aware collective communication
algorithms (COLL). Also, it is adapted to our runtime environment (ESS,
PLM, GRPCOMM etc).

K computer connects 68,544 nodes by our custom interconnect.
Its runtime environment is our proprietary one. So we don't use orted.
We cannot tell start-up time yet because of disclosure restriction, sorry.

We are surprised by the extensibility of Open MPI, and have proved that
Open MPI is scalable to 68,000 processes level! We feel pleasure to
utilize such a great open-source software.

We cannot tell detail of our technology yet because of our contract
with RIKEN AICS, however, we will plan to feedback of our improvements
and bug fixes. We can contribute some bug fixes soon, however, for
contribution of our improvements will be next year with Open MPI
agreement.

Best regards,

MPI development team,
Fujitsu



I got more information:

http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/

Short version: yes, Open MPI is used on K and was used to power the 8PF runs.

w00t!



On Jun 24, 2011, at 7:16 PM, Jeff Squyres wrote:


w00t!

OMPI powers 8 petaflops!
(at least I'm guessing that -- does anyone know if that's true?)


On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:


Interesting... page 11:

http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf

Open MPI based:

* Open Standard, Open Source, Multi-Platform including PC Cluster.
* Adding extension to Open MPI for "Tofu" interconnect

Rayson

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Index: ompi/datatype/datatype_pack.c
===
--- ompi/datatype/datatype_pack.c   (revision 25474)
+++ ompi/datatype/datatype_pack.c   (working copy)
@@ -187,7 +187,7 @@
 
 packed_buffer = (unsigned char *) iov[iov_count].iov_base;
 done = pConv->bConverted - i * pData->size;  /* partial data from 
last pack */
-if( done != 0 ) {  /* still some data to copy from the last time */
+if( (done + max_allowed) >= pData->size ) {  /* still some data to 

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-07-04 Thread Jeff Squyres
On Jul 3, 2011, at 8:40 PM, Kawashima wrote:

>> Does your llp sed path order MPI matching ordering?  Eg if some prior isend 
>> is already queued, could the llp send overtake it?
> 
> Yes, LLP send may overtake queued isend.
> But we use correct PML send_sequence. So the LLP message is queued as
> unexpected message on receiver side, and I think it's no problem.

Good!  I just wanted to ask because I couldn't quite tell from your prior 
description.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-07-03 Thread Kawashima
Hi Jeff,

> Does your llp sed path order MPI matching ordering?  Eg if some prior isend 
> is already queued, could the llp send overtake it?

Yes, LLP send may overtake queued isend.
But we use correct PML send_sequence. So the LLP message is queued as
unexpected message on receiver side, and I think it's no problem.

> >rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
> >   (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
> >   ompi_comm_peer_lookup(comm, dst),
> >   MCA_PML_OB1_HDR_TYPE_MATCH));
> > 
> >if (rc == OMPI_SUCCESS) {
> >/* NOTE this is not thread safe */
> >OPAL_THREAD_ADD32(>send_sequence, 1);
> >}

Takahiro Kawashima,
MPI development team,
Fujitsu

> Does your llp sed path order MPI matching ordering?  Eg if some prior isend 
> is already queued, could the llp send overtake it?
> 
> Sent from my phone. No type good. 
> 
> On Jun 29, 2011, at 8:27 AM, "Kawashima"  wrote:
> 
> > Hi Jeff,
> > 
> >>> First, we created a new BTL component, 'tofu BTL'. It's not so special
> >>> one but dedicated to our Tofu interconnect. But its latency was not
> >>> enough for us.
> >>> 
> >>> So we created a new framework, 'LLP', and its component, 'tofu LLP'.
> >>> It bypasses request object creation in PML and BML/BTL, and sends
> >>> a message immediately if possible.
> >> 
> >> Gotcha.  Was the sendi pml call not sufficient?  (sendi = "send 
> >> immediate")  This call was designed to be part of a latency reduction 
> >> mechanism.  I forget offhand what we don't do before calling sendi, but 
> >> the rationale was that if the message was small enough, we could skip some 
> >> steps in the sending process and "just send it."
> > 
> > I know sendi, but its latency was not sufficient for us.
> > To come at sendi call, we must do:
> >  - allocate send request (MCA_PML_OB1_SEND_REQUEST_ALLOC)
> >  - initialize send request (MCA_PML_OB1_SEND_REQUEST_INIT)
> >  - select BTL module (mca_pml_ob1_send_request_start)
> >  - select protocol (mca_pml_ob1_send_request_start_btl)
> > We want to eliminate these overheads. We want to send more immediately.
> > 
> > Here is a code snippet:
> > 
> > 
> > 
> > #if OMPI_ENABLE_LLP
> > static inline int mca_pml_ob1_call_llp_send(void *buf,
> >size_t size,
> >int dst,
> >int tag,
> >ompi_communicator_t *comm)
> > {
> >int rc;
> >mca_pml_ob1_comm_proc_t *proc = >c_pml_comm->procs[dst];
> >mca_pml_ob1_match_hdr_t *match = mca_pml_ob1.llp_send_buf;
> > 
> >match->hdr_common.hdr_type = MCA_PML_OB1_HDR_TYPE_MATCH;
> >match->hdr_common.hdr_flags = 0;
> >match->hdr_ctx = comm->c_contextid;
> >match->hdr_src = comm->c_my_rank;
> >match->hdr_tag = tag;
> >match->hdr_seq = proc->send_sequence + 1;
> > 
> >rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
> >   (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
> >   ompi_comm_peer_lookup(comm, dst),
> >   MCA_PML_OB1_HDR_TYPE_MATCH));
> > 
> >if (rc == OMPI_SUCCESS) {
> >/* NOTE this is not thread safe */
> >OPAL_THREAD_ADD32(>send_sequence, 1);
> >}
> > 
> >return rc;
> > }
> > #endif
> > 
> > int mca_pml_ob1_send(void *buf,
> > size_t count,
> > ompi_datatype_t * datatype,
> > int dst,
> > int tag,
> > mca_pml_base_send_mode_t sendmode,
> > ompi_communicator_t * comm)
> > {
> >int rc;
> >mca_pml_ob1_send_request_t *sendreq;
> > 
> > #if OMPI_ENABLE_LLP
> >/* try to send message via LLP if
> > *   - one of LLP modules is available, and
> > *   - datatype is basic, and
> > *   - data is small, and
> > *   - communication mode is standard, buffered, or ready, and
> > *   - destination is not myself
> > */
> >if (((datatype->flags & DT_FLAG_BASIC) == DT_FLAG_BASIC) &&
> >(datatype->size * count < mca_pml_ob1.llp_max_payload_size) &&
> >(sendmode == MCA_PML_BASE_SEND_STANDARD ||
> > sendmode == MCA_PML_BASE_SEND_BUFFERED ||
> > sendmode == MCA_PML_BASE_SEND_READY) &&
> >(dst != comm->c_my_rank)) {
> >rc = mca_pml_ob1_call_llp_send(buf, datatype->size * count, dst, 
> > tag, comm);
> >if (rc != OMPI_ERR_NOT_AVAILABLE) {
> >/* successfully sent out via LLP or unrecoverable error occurred 
> > */
> >return rc;
> >}
> >}
> > #endif
> > 
> >MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq, rc);
> >if (rc != OMPI_SUCCESS)
> >return rc;
> > 
> >MCA_PML_OB1_SEND_REQUEST_INIT(sendreq,
> >

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-07-02 Thread Jeff Squyres (jsquyres)
Does your llp sed path order MPI matching ordering?  Eg if some prior isend is 
already queued, could the llp send overtake it?

Sent from my phone. No type good. 

On Jun 29, 2011, at 8:27 AM, "Kawashima"  wrote:

> Hi Jeff,
> 
>>> First, we created a new BTL component, 'tofu BTL'. It's not so special
>>> one but dedicated to our Tofu interconnect. But its latency was not
>>> enough for us.
>>> 
>>> So we created a new framework, 'LLP', and its component, 'tofu LLP'.
>>> It bypasses request object creation in PML and BML/BTL, and sends
>>> a message immediately if possible.
>> 
>> Gotcha.  Was the sendi pml call not sufficient?  (sendi = "send immediate")  
>> This call was designed to be part of a latency reduction mechanism.  I 
>> forget offhand what we don't do before calling sendi, but the rationale was 
>> that if the message was small enough, we could skip some steps in the 
>> sending process and "just send it."
> 
> I know sendi, but its latency was not sufficient for us.
> To come at sendi call, we must do:
>  - allocate send request (MCA_PML_OB1_SEND_REQUEST_ALLOC)
>  - initialize send request (MCA_PML_OB1_SEND_REQUEST_INIT)
>  - select BTL module (mca_pml_ob1_send_request_start)
>  - select protocol (mca_pml_ob1_send_request_start_btl)
> We want to eliminate these overheads. We want to send more immediately.
> 
> Here is a code snippet:
> 
> 
> 
> #if OMPI_ENABLE_LLP
> static inline int mca_pml_ob1_call_llp_send(void *buf,
>size_t size,
>int dst,
>int tag,
>ompi_communicator_t *comm)
> {
>int rc;
>mca_pml_ob1_comm_proc_t *proc = >c_pml_comm->procs[dst];
>mca_pml_ob1_match_hdr_t *match = mca_pml_ob1.llp_send_buf;
> 
>match->hdr_common.hdr_type = MCA_PML_OB1_HDR_TYPE_MATCH;
>match->hdr_common.hdr_flags = 0;
>match->hdr_ctx = comm->c_contextid;
>match->hdr_src = comm->c_my_rank;
>match->hdr_tag = tag;
>match->hdr_seq = proc->send_sequence + 1;
> 
>rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
>   (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
>   ompi_comm_peer_lookup(comm, dst),
>   MCA_PML_OB1_HDR_TYPE_MATCH));
> 
>if (rc == OMPI_SUCCESS) {
>/* NOTE this is not thread safe */
>OPAL_THREAD_ADD32(>send_sequence, 1);
>}
> 
>return rc;
> }
> #endif
> 
> int mca_pml_ob1_send(void *buf,
> size_t count,
> ompi_datatype_t * datatype,
> int dst,
> int tag,
> mca_pml_base_send_mode_t sendmode,
> ompi_communicator_t * comm)
> {
>int rc;
>mca_pml_ob1_send_request_t *sendreq;
> 
> #if OMPI_ENABLE_LLP
>/* try to send message via LLP if
> *   - one of LLP modules is available, and
> *   - datatype is basic, and
> *   - data is small, and
> *   - communication mode is standard, buffered, or ready, and
> *   - destination is not myself
> */
>if (((datatype->flags & DT_FLAG_BASIC) == DT_FLAG_BASIC) &&
>(datatype->size * count < mca_pml_ob1.llp_max_payload_size) &&
>(sendmode == MCA_PML_BASE_SEND_STANDARD ||
> sendmode == MCA_PML_BASE_SEND_BUFFERED ||
> sendmode == MCA_PML_BASE_SEND_READY) &&
>(dst != comm->c_my_rank)) {
>rc = mca_pml_ob1_call_llp_send(buf, datatype->size * count, dst, tag, 
> comm);
>if (rc != OMPI_ERR_NOT_AVAILABLE) {
>/* successfully sent out via LLP or unrecoverable error occurred */
>return rc;
>}
>}
> #endif
> 
>MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq, rc);
>if (rc != OMPI_SUCCESS)
>return rc;
> 
>MCA_PML_OB1_SEND_REQUEST_INIT(sendreq,
>  buf,
>  count,
>  datatype,
>  dst, tag,
>  comm, sendmode, false);
> 
>PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_ACTIVATE,
> &(sendreq)->req_send.req_base,
> PERUSE_SEND);
> 
>MCA_PML_OB1_SEND_REQUEST_START(sendreq, rc);
>if (rc != OMPI_SUCCESS) {
>MCA_PML_OB1_SEND_REQUEST_RETURN( sendreq );
>return rc;
>}
> 
>ompi_request_wait_completion(>req_send.req_base.req_ompi);
> 
>rc = sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR;
>ompi_request_free( (ompi_request_t**) );
>return rc;
> }
> 
> 
> 
> mca_pml_ob1_send is body of MPI_Send in Open MPI. Region of
> OMPI_ENABLE_LLP is added by us.
> 
> We don't have to use a send request if we could "send immediately".
> So we try to 

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-29 Thread Kawashima
Hi Jeff,

> > First, we created a new BTL component, 'tofu BTL'. It's not so special
> > one but dedicated to our Tofu interconnect. But its latency was not
> > enough for us.
> > 
> > So we created a new framework, 'LLP', and its component, 'tofu LLP'.
> > It bypasses request object creation in PML and BML/BTL, and sends
> > a message immediately if possible.
> 
> Gotcha.  Was the sendi pml call not sufficient?  (sendi = "send immediate")  
> This call was designed to be part of a latency reduction mechanism.  I forget 
> offhand what we don't do before calling sendi, but the rationale was that if 
> the message was small enough, we could skip some steps in the sending process 
> and "just send it."

I know sendi, but its latency was not sufficient for us.
To come at sendi call, we must do:
  - allocate send request (MCA_PML_OB1_SEND_REQUEST_ALLOC)
  - initialize send request (MCA_PML_OB1_SEND_REQUEST_INIT)
  - select BTL module (mca_pml_ob1_send_request_start)
  - select protocol (mca_pml_ob1_send_request_start_btl)
We want to eliminate these overheads. We want to send more immediately.

Here is a code snippet:



#if OMPI_ENABLE_LLP
static inline int mca_pml_ob1_call_llp_send(void *buf,
size_t size,
int dst,
int tag,
ompi_communicator_t *comm)
{
int rc;
mca_pml_ob1_comm_proc_t *proc = >c_pml_comm->procs[dst];
mca_pml_ob1_match_hdr_t *match = mca_pml_ob1.llp_send_buf;

match->hdr_common.hdr_type = MCA_PML_OB1_HDR_TYPE_MATCH;
match->hdr_common.hdr_flags = 0;
match->hdr_ctx = comm->c_contextid;
match->hdr_src = comm->c_my_rank;
match->hdr_tag = tag;
match->hdr_seq = proc->send_sequence + 1;

rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
   (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
   ompi_comm_peer_lookup(comm, dst),
   MCA_PML_OB1_HDR_TYPE_MATCH));

if (rc == OMPI_SUCCESS) {
/* NOTE this is not thread safe */
OPAL_THREAD_ADD32(>send_sequence, 1);
}

return rc;
}
#endif

int mca_pml_ob1_send(void *buf,
 size_t count,
 ompi_datatype_t * datatype,
 int dst,
 int tag,
 mca_pml_base_send_mode_t sendmode,
 ompi_communicator_t * comm)
{
int rc;
mca_pml_ob1_send_request_t *sendreq;

#if OMPI_ENABLE_LLP
/* try to send message via LLP if
 *   - one of LLP modules is available, and
 *   - datatype is basic, and
 *   - data is small, and
 *   - communication mode is standard, buffered, or ready, and
 *   - destination is not myself
 */
if (((datatype->flags & DT_FLAG_BASIC) == DT_FLAG_BASIC) &&
(datatype->size * count < mca_pml_ob1.llp_max_payload_size) &&
(sendmode == MCA_PML_BASE_SEND_STANDARD ||
 sendmode == MCA_PML_BASE_SEND_BUFFERED ||
 sendmode == MCA_PML_BASE_SEND_READY) &&
(dst != comm->c_my_rank)) {
rc = mca_pml_ob1_call_llp_send(buf, datatype->size * count, dst, tag, 
comm);
if (rc != OMPI_ERR_NOT_AVAILABLE) {
/* successfully sent out via LLP or unrecoverable error occurred */
return rc;
}
}
#endif

MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq, rc);
if (rc != OMPI_SUCCESS)
return rc;

MCA_PML_OB1_SEND_REQUEST_INIT(sendreq,
  buf,
  count,
  datatype,
  dst, tag,
  comm, sendmode, false);

PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_ACTIVATE,
 &(sendreq)->req_send.req_base,
 PERUSE_SEND);

MCA_PML_OB1_SEND_REQUEST_START(sendreq, rc);
if (rc != OMPI_SUCCESS) {
MCA_PML_OB1_SEND_REQUEST_RETURN( sendreq );
return rc;
}

ompi_request_wait_completion(>req_send.req_base.req_ompi);

rc = sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR;
ompi_request_free( (ompi_request_t**) );
return rc;
}



mca_pml_ob1_send is body of MPI_Send in Open MPI. Region of
OMPI_ENABLE_LLP is added by us.

We don't have to use a send request if we could "send immediately".
So we try to send via LLP at first. If LLP could not send immediately
because of interconnect busy or something, LLP returns
OMPI_ERR_NOT_AVAILABLE, and we continue normal PML/BML/BTL send(i).
Since we want to use simple memcpy instead of complex convertor,
we restrict datatype that can go into the LLP.

Of course, we cannot use LLP on MPI_Isend.

> Note, too, that the coll modules can be laid overtop of each other -- 

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-29 Thread Jeff Squyres
On Jun 29, 2011, at 3:57 AM, Kawashima wrote:

> First, we created a new BTL component, 'tofu BTL'. It's not so special
> one but dedicated to our Tofu interconnect. But its latency was not
> enough for us.
> 
> So we created a new framework, 'LLP', and its component, 'tofu LLP'.
> It bypasses request object creation in PML and BML/BTL, and sends
> a message immediately if possible.

Gotcha.  Was the sendi pml call not sufficient?  (sendi = "send immediate")  
This call was designed to be part of a latency reduction mechanism.  I forget 
offhand what we don't do before calling sendi, but the rationale was that if 
the message was small enough, we could skip some steps in the sending process 
and "just send it."

Note, too, that the coll modules can be laid overtop of each other -- e.g., if 
you only implement barrier (and some others) in tofu coll, then you can supply 
NULL for the other function pointers and the coll base will resolve those 
functions to other coll modules automatically.

> Also, we modified tuned COLL to implement interconnect-and-topology-
> specific bcast/allgather/alltoall/allreduce algorithm. These algorithm
> implementations also bypass PML/BML/BTL to eliminate protocol and software
> overhead.

Good.  As Sylvain mentioned, that was the intent of the coll framework -- it 
certainly isn't *necessary* for coll's to always implement their underlying 
sends/receives with the BTL.  The sm coll does this, for example -- it uses its 
own shared memory block for talking to other the sm coll's in other processes 
on the same node, but it doesn't go through the sm BTL.

> To achieve above, we created 'tofu COMMON', like sm (ompi/mca/common/sm/).
> 
> Is there interesting one?
> 
> Though our BTL and COLL are quite interconnect-specific, LLP may be
> contributed in the future.

Yes, it may be interesting to see what you did there.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-29 Thread Kawashima
Hi Sylvain,

> > Also, we modified tuned COLL to implement interconnect-and-topology-
> > specific bcast/allgather/alltoall/allreduce algorithm. These algorithm
> > implementations also bypass PML/BML/BTL to eliminate protocol and 
> software
> > overhead.
> This seems perfectly valid to me. The current coll components use normal 
> MPI_Send/Recv semantics, hence the PML/BML/BTL chain, but I always saw the 
> coll framework as a way to be able to integrate smoothly "custom" 
> collective components for a specific interconnect. I think that Mellanox 
> also did a specific collective component using directly their ConnectX HCA 
> capabilities.
> 
> However, modifying the "tuned" component may not be the better way to 
> integrate your collective work. You may consider creating a "tofu" coll 
> component which would only provide the collectives you optimized (and the 
> coll framework will fallback on tuned for the ones you didn't optimize).

Yes. I agree.
But sadly, my colleague implemented it badly.

We created another COLL component that use interconnect barrier,
like Mellanox FCA.

> > To achieve above, we created 'tofu COMMON', like sm 
> (ompi/mca/common/sm/).
> > 
> > Is there interesting one?
> It may be interesting, yes. I don't know the tofu model, but if it is not 
> secret, contributing it is usually a good thing.
> 
> Your communication model may be similar to others and portions of code may 
> be shared with other technologies (I'm thinking of IB, MX, PSM,...). 
> People writing new code would also consider your model and let you take 
> advantage of it. Knowing how tofu is integrated into Open MPI may also 
> impact major decisions the open-source community is taking.

Tofu communication model is simular to that of IB RDMA.
Actually, we use source code of openib BTL as a reference.
We'll consider contribution of some code, and join the discussion.

Regards,

Takahiro Kawashima,
MPI development team,
Fujitsu


Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-29 Thread sylvain . jeaugey
Kawashima-san,

Congratulations for your machine, this is a stunning achievement !

> Kawashima  wrote :
> Also, we modified tuned COLL to implement interconnect-and-topology-
> specific bcast/allgather/alltoall/allreduce algorithm. These algorithm
> implementations also bypass PML/BML/BTL to eliminate protocol and 
software
> overhead.
This seems perfectly valid to me. The current coll components use normal 
MPI_Send/Recv semantics, hence the PML/BML/BTL chain, but I always saw the 
coll framework as a way to be able to integrate smoothly "custom" 
collective components for a specific interconnect. I think that Mellanox 
also did a specific collective component using directly their ConnectX HCA 
capabilities.

However, modifying the "tuned" component may not be the better way to 
integrate your collective work. You may consider creating a "tofu" coll 
component which would only provide the collectives you optimized (and the 
coll framework will fallback on tuned for the ones you didn't optimize).

> To achieve above, we created 'tofu COMMON', like sm 
(ompi/mca/common/sm/).
> 
> Is there interesting one?
It may be interesting, yes. I don't know the tofu model, but if it is not 
secret, contributing it is usually a good thing.

Your communication model may be similar to others and portions of code may 
be shared with other technologies (I'm thinking of IB, MX, PSM,...). 
People writing new code would also consider your model and let you take 
advantage of it. Knowing how tofu is integrated into Open MPI may also 
impact major decisions the open-source community is taking.

Sylvain

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-29 Thread Kawashima
Hi Jeff, Ralph, and all,

Thank you for your reply.
RIKEN and Fujitsu will work toword 10Pflops with Open MPI continuously.

Here we can explain some parts of our MPI:

As page 13 of Koh Hotta's presentation shows, we extended OMPI
communication layers.

> http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf
# Sorry, this figure is somewhat broken. Arrows point incorrect layers.

First, we created a new BTL component, 'tofu BTL'. It's not so special
one but dedicated to our Tofu interconnect. But its latency was not
enough for us.

So we created a new framework, 'LLP', and its component, 'tofu LLP'.
It bypasses request object creation in PML and BML/BTL, and sends
a message immediately if possible.

Also, we modified tuned COLL to implement interconnect-and-topology-
specific bcast/allgather/alltoall/allreduce algorithm. These algorithm
implementations also bypass PML/BML/BTL to eliminate protocol and software
overhead.

To achieve above, we created 'tofu COMMON', like sm (ompi/mca/common/sm/).

Is there interesting one?

Though our BTL and COLL are quite interconnect-specific, LLP may be
contributed in the future.

Regards,

Takahiro Kawashima,
MPI development team,
Fujitsu

> I echo what Ralph said -- congratulations!
> 
> Let us know when you'll be ready to contribute back what you can.
> 
> Thanks!
> 
> 
> On Jun 27, 2011, at 9:58 PM, Takahiro Kawashima wrote:
> 
> > Dear Open MPI community,
> > 
> > I'm a member of MPI library development team in Fujitsu. Shinji
> > Sumimoto, whose name appears in Jeff's blog, is one of our bosses.
> > 
> > As Rayson and Jeff noted, K computer, world's most powerful HPC system
> > developed by RIKEN and Fujitsu, utilizes Open MPI as a base of its MPI
> > library. We, Fujitsu, are pleased to announce that, and also have special
> > thanks to Open MPI community.
> > We are sorry to be late announce!
> > 
> > Our MPI library is based on Open MPI 1.4 series, and has a new point-
> > to-point component (BTL) and new topology-aware collective communication
> > algorithms (COLL). Also, it is adapted to our runtime environment (ESS,
> > PLM, GRPCOMM etc).
> > 
> > K computer connects 68,544 nodes by our custom interconnect.
> > Its runtime environment is our proprietary one. So we don't use orted.
> > We cannot tell start-up time yet because of disclosure restriction, sorry.
> > 
> > We are surprised by the extensibility of Open MPI, and have proved that
> > Open MPI is scalable to 68,000 processes level! We feel pleasure to
> > utilize such a great open-source software.
> > 
> > We cannot tell detail of our technology yet because of our contract
> > with RIKEN AICS, however, we will plan to feedback of our improvements
> > and bug fixes. We can contribute some bug fixes soon, however, for
> > contribution of our improvements will be next year with Open MPI
> > agreement.
> > 
> > Best regards,
> > 
> > MPI development team,
> > Fujitsu
> > 
> > 
> >> I got more information:
> >> 
> >>   http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/
> >> 
> >> Short version: yes, Open MPI is used on K and was used to power the 8PF 
> >> runs.
> >> 
> >> w00t!
> >> 
> >> 
> >> 
> >> On Jun 24, 2011, at 7:16 PM, Jeff Squyres wrote:
> >> 
> >>> w00t!  
> >>> 
> >>> OMPI powers 8 petaflops!
> >>> (at least I'm guessing that -- does anyone know if that's true?)
> >>> 
> >>> 
> >>> On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:
> >>> 
>  Interesting... page 11:
>  
>  http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf
>  
>  Open MPI based:
>  
>  * Open Standard, Open Source, Multi-Platform including PC Cluster.
>  * Adding extension to Open MPI for "Tofu" interconnect
>  
>  Rayson
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/


Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-27 Thread Jeff Squyres
I echo what Ralph said -- congratulations!

Let us know when you'll be ready to contribute back what you can.

Thanks!


On Jun 27, 2011, at 9:58 PM, Takahiro Kawashima wrote:

> Dear Open MPI community,
> 
> I'm a member of MPI library development team in Fujitsu. Shinji
> Sumimoto, whose name appears in Jeff's blog, is one of our bosses.
> 
> As Rayson and Jeff noted, K computer, world's most powerful HPC system
> developed by RIKEN and Fujitsu, utilizes Open MPI as a base of its MPI
> library. We, Fujitsu, are pleased to announce that, and also have special
> thanks to Open MPI community.
> We are sorry to be late announce!
> 
> Our MPI library is based on Open MPI 1.4 series, and has a new point-
> to-point component (BTL) and new topology-aware collective communication
> algorithms (COLL). Also, it is adapted to our runtime environment (ESS,
> PLM, GRPCOMM etc).
> 
> K computer connects 68,544 nodes by our custom interconnect.
> Its runtime environment is our proprietary one. So we don't use orted.
> We cannot tell start-up time yet because of disclosure restriction, sorry.
> 
> We are surprised by the extensibility of Open MPI, and have proved that
> Open MPI is scalable to 68,000 processes level! We feel pleasure to
> utilize such a great open-source software.
> 
> We cannot tell detail of our technology yet because of our contract
> with RIKEN AICS, however, we will plan to feedback of our improvements
> and bug fixes. We can contribute some bug fixes soon, however, for
> contribution of our improvements will be next year with Open MPI
> agreement.
> 
> Best regards,
> 
> MPI development team,
> Fujitsu
> 
> 
>> I got more information:
>> 
>>   http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/
>> 
>> Short version: yes, Open MPI is used on K and was used to power the 8PF runs.
>> 
>> w00t!
>> 
>> 
>> 
>> On Jun 24, 2011, at 7:16 PM, Jeff Squyres wrote:
>> 
>>> w00t!  
>>> 
>>> OMPI powers 8 petaflops!
>>> (at least I'm guessing that -- does anyone know if that's true?)
>>> 
>>> 
>>> On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:
>>> 
 Interesting... page 11:
 
 http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf
 
 Open MPI based:
 
 * Open Standard, Open Source, Multi-Platform including PC Cluster.
 * Adding extension to Open MPI for "Tofu" interconnect
 
 Rayson
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-27 Thread Takahiro Kawashima
Dear Open MPI community,

I'm a member of MPI library development team in Fujitsu. Shinji
Sumimoto, whose name appears in Jeff's blog, is one of our bosses.

As Rayson and Jeff noted, K computer, world's most powerful HPC system
developed by RIKEN and Fujitsu, utilizes Open MPI as a base of its MPI
library. We, Fujitsu, are pleased to announce that, and also have special
thanks to Open MPI community.
We are sorry to be late announce!

Our MPI library is based on Open MPI 1.4 series, and has a new point-
to-point component (BTL) and new topology-aware collective communication
algorithms (COLL). Also, it is adapted to our runtime environment (ESS,
PLM, GRPCOMM etc).

K computer connects 68,544 nodes by our custom interconnect.
Its runtime environment is our proprietary one. So we don't use orted.
We cannot tell start-up time yet because of disclosure restriction, sorry.

We are surprised by the extensibility of Open MPI, and have proved that
Open MPI is scalable to 68,000 processes level! We feel pleasure to
utilize such a great open-source software.

We cannot tell detail of our technology yet because of our contract
with RIKEN AICS, however, we will plan to feedback of our improvements
and bug fixes. We can contribute some bug fixes soon, however, for
contribution of our improvements will be next year with Open MPI
agreement.

Best regards,

MPI development team,
Fujitsu


> I got more information:
> 
>http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/
> 
> Short version: yes, Open MPI is used on K and was used to power the 8PF runs.
> 
> w00t!
> 
> 
> 
> On Jun 24, 2011, at 7:16 PM, Jeff Squyres wrote:
> 
> > w00t!  
> > 
> > OMPI powers 8 petaflops!
> > (at least I'm guessing that -- does anyone know if that's true?)
> > 
> > 
> > On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:
> > 
> >> Interesting... page 11:
> >> 
> >> http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf
> >> 
> >> Open MPI based:
> >> 
> >> * Open Standard, Open Source, Multi-Platform including PC Cluster.
> >> * Adding extension to Open MPI for "Tofu" interconnect
> >> 
> >> Rayson


Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-27 Thread Rayson Ho
On Sat, Jun 25, 2011 at 9:23 PM, Jeff Squyres  wrote:
> I got more information:
>
>   http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/

That's really awesome!!

SC08: "Open MPI: 10^15 Flops Can't Be Wrong"

2011: "Open MPI: 8 * 10^15 Flops Can't Be Wrong"


And equally awesome is that Fujitsu is going to contribute its changes
back to Open MPI!! Can't wait to see presentations like:

"Open MPI: 10^17 Flops Can't Be Wrong", or even

"Open MPI: 10^18 Flops Can't Be Wrong" :-)

Rayson



>
> Short version: yes, Open MPI is used on K and was used to power the 8PF runs.
>
> w00t!
>
>
>
> On Jun 24, 2011, at 7:16 PM, Jeff Squyres wrote:
>
>> w00t!
>>
>> OMPI powers 8 petaflops!
>> (at least I'm guessing that -- does anyone know if that's true?)
>>
>>
>> On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:
>>
>>> Interesting... page 11:
>>>
>>> http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf
>>>
>>> Open MPI based:
>>>
>>> * Open Standard, Open Source, Multi-Platform including PC Cluster.
>>> * Adding extension to Open MPI for "Tofu" interconnect
>>>
>>> Rayson
>>>
>>> ==
>>> Grid Engine / Open Grid Scheduler
>>> http://gridscheduler.sourceforge.net/
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-26 Thread Ralph Castain
Any info available on the launch environment used, and how long it took to 
start the 8Pf job?

On Jun 25, 2011, at 7:23 PM, Jeff Squyres wrote:

> I got more information:
> 
>   http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/
> 
> Short version: yes, Open MPI is used on K and was used to power the 8PF runs.
> 
> w00t!
> 
> 
> 
> On Jun 24, 2011, at 7:16 PM, Jeff Squyres wrote:
> 
>> w00t!  
>> 
>> OMPI powers 8 petaflops!
>> (at least I'm guessing that -- does anyone know if that's true?)
>> 
>> 
>> On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:
>> 
>>> Interesting... page 11:
>>> 
>>> http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf
>>> 
>>> Open MPI based:
>>> 
>>> * Open Standard, Open Source, Multi-Platform including PC Cluster.
>>> * Adding extension to Open MPI for "Tofu" interconnect
>>> 
>>> Rayson
>>> 
>>> ==
>>> Grid Engine / Open Grid Scheduler
>>> http://gridscheduler.sourceforge.net/
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-25 Thread Jeff Squyres
I got more information:

   http://blogs.cisco.com/performance/open-mpi-powers-8-petaflops/

Short version: yes, Open MPI is used on K and was used to power the 8PF runs.

w00t!



On Jun 24, 2011, at 7:16 PM, Jeff Squyres wrote:

> w00t!  
> 
> OMPI powers 8 petaflops!
> (at least I'm guessing that -- does anyone know if that's true?)
> 
> 
> On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:
> 
>> Interesting... page 11:
>> 
>> http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf
>> 
>> Open MPI based:
>> 
>> * Open Standard, Open Source, Multi-Platform including PC Cluster.
>> * Adding extension to Open MPI for "Tofu" interconnect
>> 
>> Rayson
>> 
>> ==
>> Grid Engine / Open Grid Scheduler
>> http://gridscheduler.sourceforge.net/
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-24 Thread Jeff Squyres
w00t!  

OMPI powers 8 petaflops!
(at least I'm guessing that -- does anyone know if that's true?)


On Jun 24, 2011, at 7:03 PM, Rayson Ho wrote:

> Interesting... page 11:
> 
> http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf
> 
> Open MPI based:
> 
> * Open Standard, Open Source, Multi-Platform including PC Cluster.
> * Adding extension to Open MPI for "Tofu" interconnect
> 
> Rayson
> 
> ==
> Grid Engine / Open Grid Scheduler
> http://gridscheduler.sourceforge.net/
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-24 Thread Rayson Ho
Interesting... page 11:

http://www.fujitsu.com/downloads/TC/sc10/programming-on-k-computer.pdf

Open MPI based:

* Open Standard, Open Source, Multi-Platform including PC Cluster.
* Adding extension to Open MPI for "Tofu" interconnect

Rayson

==
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net/