Re: [Lustre-discuss] write RPC & congestion

2010-09-27 Thread burlen
  On 09/24/2010 06:36 PM, Andreas Dilger wrote:
> On 2010-09-24, at 19:10, Andreas Dilger wrote:
>> On 2010-09-24, at 18:20, burlen wrote:
>>> To be sure I understand this, is it correct that each OST has its own pool 
>>> of service threads? So system wide number of service threads is bound by 
>>> oss_max_threads*num_osts?
>> Actuall, the current oss_max_threads tunable is for the whole OSS (as the 
>> name implies).

Again many thanks for your help

With respect to an upper bound on number of rpcs and RMDAs in flight 
system wide, does the situation change much on the Cray XT5 with Lustre 
1.8 and OSSs directly connected to the 3d torus? I am asking after 
having seen the XT3 section in manual. not sure if it applies to XT5 and 
if it does how this might influence the above tunables.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] write RPC & congestion

2010-09-24 Thread burlen

Hi, Thanks for all the help,

Andreas Dilger wrote:
> When one of the server threads is ready to process a read/write request it 
> will get or put the data from/to the buffers that the client already 
> prepared.  The number of currently active IO requests is exactly the number 
> of active service threads (up to 512 by default).
To be sure I understand this, is it correct that each OST has its own 
pool of service threads? So system wide number of service threads is 
bound by oss_max_threads*num_osts?

Thanks again
Burlen







___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] write RPC & congestion

2010-08-22 Thread burlen
Andreas Dilger wrote:
> On 2010-08-17, at 14:15, burlen wrote:
>   
>> I have some question about Lustre RPC and the sequence of events that 
>> occur during large concurrent write() involving many processes and large 
>> data size per process.  I understand there is a mechanism of flow 
>> control by credits, but I'm a little unclear on how it works in general 
>> after reading the "networking & io protocol" white paper.
>> 
>
> There are different levels of flow control.  There is one at the LNET level, 
> that controls low-level messages from overwhelming the server with messages, 
> and avoiding stalling small/reply messages at the back of a deep queue of 
> requests.
>
>   
>> Is it true that a write() RPC transfer's data in chunks of at least 1MB 
>> and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? 
>> I can use the bounds to estimate the number of RPCs issued per MB of 
>> data to write?
>> 
>
> Currently, 1MB is the largest bulk IO size, and is the typical size used by 
> clients for all IO.
>
>   
>> About how many concurrent incoming write() RPC per OSS service thread 
>> can a single server handle before it stops responding to incoming RPCs ?
>> 
>
> The server can handle tens of thousands of write _requests_, but note that 
> since Lustre has always been designed as an RDMA-capable protocol the request 
> is relatively small (a few hundreds of bytes) and does not contain any of the 
> DATA.
>
> When one of the server threads is ready to process a read/write request it 
> will get or put the data from/to the buffers that the client already 
> prepared.  The number of currently active IO requests is exactly the number 
> of active service threads (up to 512 by default).
>
>   
>> What happens to an RPC when the server is too busy to handle it, is it 
>> even issued by the client ? Does the client have to poll and/or resend 
>> the RPC ? Does the process of polling for flow control credits add 
>> significant network/server congestion ?
>> 
>
> The clients limit the number of concurrent RPC requests, by default to 8 per 
> OST.  The LNET level message credits will also limit the number of in-flight 
> messages in case there is e.g. an LNET router between the client and server.
>
> The client will almost never time out a request, as it is informed how long 
> requests are currently taking to process and will wait patiently for its 
> earlier requests to finish processing.  If the client is going to time out a 
> request (based on an earlier request timeout that is about to be exceeded) 
> the server will inform it to continue waiting and give it a new processing 
> time estimate (unless of course the server is non-functional or so 
> overwhelmed that it can't even do that).
>
>   
>> Is it likely that a large number of RPC's/flow control credit requests 
>> will induce enough network congestion so that client's RPC's timeout ? 
>> How does the client handle such a timeout ?
>> 
>
> Since the flow control credits are bounded, and will be returned to the peer 
> as earlier requests complete there is not additional traffic due to this.  
> However, considering that HPC clusters are distributed denial-of-service 
> engines it is always possible to overwhelm the server under some conditions.  
> In case of a client RPC timeout (hundreds of seconds under load) the client 
> will resend the request and/or try to contact the backup server until one 
> responds.
>   
Thank you for you help.

Is my understanding correct?

A single RPC request will initiate an RDMA transfer of at most 
"max_pages_per_rpc". where the page unit is Lustre page size 65536. Each 
RDMA transfer is executed in 1MB chunks.  On a given client, if there 
are more than "max_pages_per_rpc" pages of data available to transfer , 
multiple RPCs are issued and multiple RDMA's are initiated.

Would it be correct to say: The purpose of the "max_pages_per_rpc" 
parameter is to enable the servers to even out the individual progress 
of concurrent clients with a lot of data to move and more fairly 
apportion the available bandwidth amongst concurrently writing clients?




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] write RPC & congestion

2010-08-17 Thread burlen
Hi, thanks for previous help.

I have some question about Lustre RPC and the sequence of events that 
occur during large concurrent write() involving many processes and large 
data size per process.  I understand there is a mechanism of flow 
control by credits, but I'm a little unclear on how it works in general 
after reading the "networking & io protocol" white paper.

Is it true that a write() RPC transfer's data in chunks of at least 1MB 
and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? 
I can use the bounds to estimate the number of RPCs issued per MB of 
data to write?

About how many concurrent incoming write() RPC per OSS service thread 
can a single server handle before it stops responding to incoming RPCs ?

What happens to an RPC when the server is too busy to handle it, is it 
even issued by the client ? Does the client have to poll and/or resend 
the RPC ? Does the process of polling for flow control credits add 
significant network/server congestion ?

Is it likely that a large number of RPC's/flow control credit requests 
will induce enough network congestion so that client's RPC's timeout ? 
How does the client handle such a timeout ?

Burlen

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] buffering

2010-08-12 Thread burlen
Andreas Dilger wrote:
> On 2010-08-11, at 23:36, burlen wrote:
>   
>> I am interested in how write()s are buffered in Lustre on the cleint, 
>> server, and network in between. Specifically I'd like to understand what 
>> happens during writes when large number of clients are making large 
>> writes to all of the OSTs on an OSS, and the buffers are inadequate to 
>> handle the outgoing/incoming data.
>> 
>
> Lustre doesn't buffer dirty pages on the OSS, only on the client.  The 
> clients are granted a "reserve" of space in each OST filesystem to ensure 
> there is enough free space for any cached writes that they do.
>
>   
Thanks for your answer.

If I understand the way write() typically works on Linux, during a large 
write(), too large to be buffered in the page cache, once the page cache 
is full dirty pages would be flushed to disk. the data transfer would 
block at that point until the dirty pages are written to disk, whence 
the data transfer would resume into the resulting free pages. But in 
Lustre I assume that once the client's page cache is full, the dirty 
pages are sent over the network to the OSS where they are written to 
disk. In that case, does the network layer effectively act like a 
buffer? So that the client may resume the data transfer into the page 
cache before the former set dirty pages actually hit the disk? Or does 
the data transfer block until dirty pages actually reach the disk?

Thanks
Burlen

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] buffering

2010-08-11 Thread burlen
Hi,

I am interested in how write()s are buffered in Lustre on the cleint, 
server, and network in between. Specifically I'd like to understand what 
happens during writes when large number of clients are making large 
writes to all of the OSTs on an OSS, and the buffers are inadequate to 
handle the outgoing/incoming data. I know nothing about Lustre's 
buffering, can anyone point me to a source of information ?

Thanks
Burlen

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] llapi stripe_size

2010-04-16 Thread burlen
calling llapi_file_create reports the following error:

error: bad stripe_size 4096, must be an even multiple of 65536 bytes: 
Invalid argument (22)

The operation manual said: This value must be an even multiple of system 
page size, as shown by getpagesize. The value 4096 above was returned 
from a call to getpagesize. (I don't intend to set the stripe size so 
small, this is just to illustrate).

Where does the number 65536 come from, what's the right way to determine 
it at runtime?

Thanks
Burlen



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] programmatic access to parameters

2010-03-26 Thread burlen
Thanks very much for your help and advise.

Andreas Dilger wrote:
> On 2010-03-25, at 15:12, Andreas Dilger wrote:
>>> The llapi_* functions are great, I see how to set the stripe count
>>> and size. I wasn't sure if there was also a function to query about
>>> the configuration, eg number of OST's deployed?
>>
>> There isn't directly such a function, but indirectly this is possible
>> to get from userspace without changing Lustre or the liblustreapi
>> library:
>
>
> I filed bug 22472 for this issue, with a proposed patch, though the 
> actual implementation may change before this is included into any 
> release.
>
> Cheers, Andreas
> -- 
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] programmatic access to parameters

2010-03-25 Thread burlen
> I don't know what your constraints are, but should note that this sort
> of information (number of OSTs) can be obtained rather trivially from 
> any lustre client via shell prompt, to wit: 
True, but parsing the output of a c "system" call is something I hoped 
to avoid. It might not be portable and might be fragile over time.

This also gets at my motivation for asking for a header with the Lustre 
limits, if I hard code something down the road the limits change, and we 
are suddenly shooting ourselves in the foot.

I think I have made a mistake about the MPI hints in my last mail,. The 
striping_* hints are a part of the MPI standard at least as far back as 
2003. It says that they are reserved but implementations are not 
required to  interpret them. That's a pretty weak assurance.

I'd like this thread to be considered by Lustre team as a feature 
request for better programmatic support. I think it makes sense because 
the performance is fairly sensitive to both the deployed hardware and 
striping parameters. There also can be more information regarding the 
specific IO needs available at the application level than at the MPI 
level. And MPI implementations don't have to honor hints.

Thanks, I am grateful for the help as I get up to speed on Lustre fs
Burlen

Cliff White wrote:
> burlen wrote:
>> System limits are sometimes provided in a header, I wasn't sure if 
>> Lustre adopted that approach. The llapi_* functions are great, I see 
>> how to set the stripe count and size. I wasn't sure if there was also 
>> a function to query about the configuration, eg number of OST's 
>> deployed?
>>
>> This would be for use in a global hybrid megnetospheric simulation 
>> that runs on a large scale (1E4-1E5 cores). The good striping 
>> parameters depend on the run, and could be calculated at run time. It 
>> can make a significant difference in our run times to have these set 
>> correctly. I am not sure if we always want a stripe count of the 
>> maximum. I think this depends on how many files we are synchronously 
>> writing, and the number of available OST's total. Eg if there are 256 
>> OST's on some system and we have 2 files to write would it not make 
>> sense to set the stripe count to 128?
>>
>> We can't rely on our user to set the Lustre parameter correctly. We 
>> can't rely on the system defaults either, they typically aren't set 
>> optimally for our use case. MPI hints look promising but the ADIO 
>> Lustre optimization are fairly new,  as far as I understand not 
>> publically available in MPICH until next release (maybe in may?). We 
>> run on a variety of systems some with variety of MPI implementation 
>> (eg Cray, SGI). The MPI hints will only be useful on implementation 
>> that support the particular hint. From a consistency point of view we 
>> need to both make use of MPI hints and direct access via the llapi so 
>> that we run well on all those systems, regardless of which MPI 
>> implementation is deployed. 
>
> I don't know what your constraints are, but should note that this sort
> of information (number of OSTs) can be obtained rather trivially from 
> any lustre client via shell prompt, to wit:
> # lctl dl |grep OST |wc -l
> 2
> or:
> # ls /proc/fs/lustre/osc | grep OST |wc -l
> 2
>
> probably a few other ways to do that. Not as stylish as llapi_*..
>
> cliffw
>
>> Thanks
>> Burlen
>>
>>
>> Andreas Dilger wrote:
>>> On 2010-03-23, at 14:25, burlen wrote:
>>>> How can one programmatically probe the lustre system an application is
>>>> running on?
>>> Lustre-specific interfaces are generally "llapi_*" functions, from 
>>> liblustreapi.
>>>
>>>> At compile time I'd like access to the various lustre system limits ,
>>>> for example those listed in ch.32 of operations manual.
>>> There are no llapi_* functions for this today.  Can you explain a 
>>> bit better what you are trying to use this for?
>>>
>>> statfs(2) will tell you a number of limits, as will pathconf(3), and 
>>> those are standard POSIX APIs.
>>>
>>>> Incidentally one I didn't see listed in that chapter is the maximum 
>>>> number of OST's a single file can be striped across.
>>> That is the first thing listed:
>>>
>>>>> 32.1Maximum Stripe Count
>>>>> The maximum number of stripe count is 160. This limit is 
>>>>> hard-coded, but is near the upper limit imposed by the underlying 
>>>>> ext3 file system. It may be increased in future rel

Re: [Lustre-discuss] programmatic access to parameters

2010-03-24 Thread burlen
System limits are sometimes provided in a header, I wasn't sure if 
Lustre adopted that approach. The llapi_* functions are great, I see how 
to set the stripe count and size. I wasn't sure if there was also a 
function to query about the configuration, eg number of OST's deployed?

This would be for use in a global hybrid megnetospheric simulation that 
runs on a large scale (1E4-1E5 cores). The good striping parameters 
depend on the run, and could be calculated at run time. It can make a 
significant difference in our run times to have these set correctly. I 
am not sure if we always want a stripe count of the maximum. I think 
this depends on how many files we are synchronously writing, and the 
number of available OST's total. Eg if there are 256 OST's on some 
system and we have 2 files to write would it not make sense to set the 
stripe count to 128?

We can't rely on our user to set the Lustre parameter correctly. We 
can't rely on the system defaults either, they typically aren't set 
optimally for our use case. MPI hints look promising but the ADIO Lustre 
optimization are fairly new,  as far as I understand not publically 
available in MPICH until next release (maybe in may?). We run on a 
variety of systems some with variety of MPI implementation (eg Cray, 
SGI). The MPI hints will only be useful on implementation that support 
the particular hint. From a consistency point of view we need to both 
make use of MPI hints and direct access via the llapi so that we run 
well on all those systems, regardless of which MPI implementation is 
deployed.  

Thanks
Burlen


Andreas Dilger wrote:
> On 2010-03-23, at 14:25, burlen wrote:
>> How can one programmatically probe the lustre system an application is
>> running on?
>
> Lustre-specific interfaces are generally "llapi_*" functions, from 
> liblustreapi.
>
>> At compile time I'd like access to the various lustre system limits ,
>> for example those listed in ch.32 of operations manual.
>
> There are no llapi_* functions for this today.  Can you explain a bit 
> better what you are trying to use this for?
>
> statfs(2) will tell you a number of limits, as will pathconf(3), and 
> those are standard POSIX APIs.
>
>> Incidentally one I didn't see listed in that chapter is the maximum 
>> number of OST's a single file can be striped across.
>
> That is the first thing listed:
>
>>> 32.1Maximum Stripe Count
>>> The maximum number of stripe count is 160. This limit is hard-coded, 
>>> but is near the upper limit imposed by the underlying ext3 file 
>>> system. It may be increased in future releases. Under normal 
>>> circumstances, the stripe count is not affected by ACLs.
>>>
>> At run time I'd like to be able to probe the size (number of OSS, OST
>> etc...) of the system the application is running on.
>
>
> One shortcut is to specify "-1" for the stripe count will stripe a 
> file across all available OSTs, which is what most applications want, 
> if they are not being striped over only 1 or 2 OSTs.
>
> If you are using MPIIO, the Lustre ADIO layer can optimize these 
> things for you, based on application hints.
>
> If you could elaborate on your needs, there may not be any need to 
> make your application more Lustre-aware.
>
> Cheers, Andreas
> -- 
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] programmatic access to parameters

2010-03-23 Thread burlen
Hi ,

How can one programmatically probe the lustre system an application is 
running on?

At compile time I'd like access to the various lustre system limits , 
for example those listed in ch.32 of operations manual. Incidentally one 
I didn't see listed in that chapter is the maximum number of OST's a 
single file can be striped across.

At run time I'd like to be able to probe the size (number of OSS, OST 
etc...) of the system the application is running on.

I hope not to have to hard code such values.

Thanks
Burlen


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss