Re: [Lustre-discuss] write RPC & congestion
On 09/24/2010 06:36 PM, Andreas Dilger wrote: > On 2010-09-24, at 19:10, Andreas Dilger wrote: >> On 2010-09-24, at 18:20, burlen wrote: >>> To be sure I understand this, is it correct that each OST has its own pool >>> of service threads? So system wide number of service threads is bound by >>> oss_max_threads*num_osts? >> Actuall, the current oss_max_threads tunable is for the whole OSS (as the >> name implies). Again many thanks for your help With respect to an upper bound on number of rpcs and RMDAs in flight system wide, does the situation change much on the Cray XT5 with Lustre 1.8 and OSSs directly connected to the 3d torus? I am asking after having seen the XT3 section in manual. not sure if it applies to XT5 and if it does how this might influence the above tunables. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] write RPC & congestion
Hi, Thanks for all the help, Andreas Dilger wrote: > When one of the server threads is ready to process a read/write request it > will get or put the data from/to the buffers that the client already > prepared. The number of currently active IO requests is exactly the number > of active service threads (up to 512 by default). To be sure I understand this, is it correct that each OST has its own pool of service threads? So system wide number of service threads is bound by oss_max_threads*num_osts? Thanks again Burlen ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] write RPC & congestion
Andreas Dilger wrote: > On 2010-08-17, at 14:15, burlen wrote: > >> I have some question about Lustre RPC and the sequence of events that >> occur during large concurrent write() involving many processes and large >> data size per process. I understand there is a mechanism of flow >> control by credits, but I'm a little unclear on how it works in general >> after reading the "networking & io protocol" white paper. >> > > There are different levels of flow control. There is one at the LNET level, > that controls low-level messages from overwhelming the server with messages, > and avoiding stalling small/reply messages at the back of a deep queue of > requests. > > >> Is it true that a write() RPC transfer's data in chunks of at least 1MB >> and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? >> I can use the bounds to estimate the number of RPCs issued per MB of >> data to write? >> > > Currently, 1MB is the largest bulk IO size, and is the typical size used by > clients for all IO. > > >> About how many concurrent incoming write() RPC per OSS service thread >> can a single server handle before it stops responding to incoming RPCs ? >> > > The server can handle tens of thousands of write _requests_, but note that > since Lustre has always been designed as an RDMA-capable protocol the request > is relatively small (a few hundreds of bytes) and does not contain any of the > DATA. > > When one of the server threads is ready to process a read/write request it > will get or put the data from/to the buffers that the client already > prepared. The number of currently active IO requests is exactly the number > of active service threads (up to 512 by default). > > >> What happens to an RPC when the server is too busy to handle it, is it >> even issued by the client ? Does the client have to poll and/or resend >> the RPC ? Does the process of polling for flow control credits add >> significant network/server congestion ? >> > > The clients limit the number of concurrent RPC requests, by default to 8 per > OST. The LNET level message credits will also limit the number of in-flight > messages in case there is e.g. an LNET router between the client and server. > > The client will almost never time out a request, as it is informed how long > requests are currently taking to process and will wait patiently for its > earlier requests to finish processing. If the client is going to time out a > request (based on an earlier request timeout that is about to be exceeded) > the server will inform it to continue waiting and give it a new processing > time estimate (unless of course the server is non-functional or so > overwhelmed that it can't even do that). > > >> Is it likely that a large number of RPC's/flow control credit requests >> will induce enough network congestion so that client's RPC's timeout ? >> How does the client handle such a timeout ? >> > > Since the flow control credits are bounded, and will be returned to the peer > as earlier requests complete there is not additional traffic due to this. > However, considering that HPC clusters are distributed denial-of-service > engines it is always possible to overwhelm the server under some conditions. > In case of a client RPC timeout (hundreds of seconds under load) the client > will resend the request and/or try to contact the backup server until one > responds. > Thank you for you help. Is my understanding correct? A single RPC request will initiate an RDMA transfer of at most "max_pages_per_rpc". where the page unit is Lustre page size 65536. Each RDMA transfer is executed in 1MB chunks. On a given client, if there are more than "max_pages_per_rpc" pages of data available to transfer , multiple RPCs are issued and multiple RDMA's are initiated. Would it be correct to say: The purpose of the "max_pages_per_rpc" parameter is to enable the servers to even out the individual progress of concurrent clients with a lot of data to move and more fairly apportion the available bandwidth amongst concurrently writing clients? ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] write RPC & congestion
Hi, thanks for previous help. I have some question about Lustre RPC and the sequence of events that occur during large concurrent write() involving many processes and large data size per process. I understand there is a mechanism of flow control by credits, but I'm a little unclear on how it works in general after reading the "networking & io protocol" white paper. Is it true that a write() RPC transfer's data in chunks of at least 1MB and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? I can use the bounds to estimate the number of RPCs issued per MB of data to write? About how many concurrent incoming write() RPC per OSS service thread can a single server handle before it stops responding to incoming RPCs ? What happens to an RPC when the server is too busy to handle it, is it even issued by the client ? Does the client have to poll and/or resend the RPC ? Does the process of polling for flow control credits add significant network/server congestion ? Is it likely that a large number of RPC's/flow control credit requests will induce enough network congestion so that client's RPC's timeout ? How does the client handle such a timeout ? Burlen ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] buffering
Andreas Dilger wrote: > On 2010-08-11, at 23:36, burlen wrote: > >> I am interested in how write()s are buffered in Lustre on the cleint, >> server, and network in between. Specifically I'd like to understand what >> happens during writes when large number of clients are making large >> writes to all of the OSTs on an OSS, and the buffers are inadequate to >> handle the outgoing/incoming data. >> > > Lustre doesn't buffer dirty pages on the OSS, only on the client. The > clients are granted a "reserve" of space in each OST filesystem to ensure > there is enough free space for any cached writes that they do. > > Thanks for your answer. If I understand the way write() typically works on Linux, during a large write(), too large to be buffered in the page cache, once the page cache is full dirty pages would be flushed to disk. the data transfer would block at that point until the dirty pages are written to disk, whence the data transfer would resume into the resulting free pages. But in Lustre I assume that once the client's page cache is full, the dirty pages are sent over the network to the OSS where they are written to disk. In that case, does the network layer effectively act like a buffer? So that the client may resume the data transfer into the page cache before the former set dirty pages actually hit the disk? Or does the data transfer block until dirty pages actually reach the disk? Thanks Burlen ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] buffering
Hi, I am interested in how write()s are buffered in Lustre on the cleint, server, and network in between. Specifically I'd like to understand what happens during writes when large number of clients are making large writes to all of the OSTs on an OSS, and the buffers are inadequate to handle the outgoing/incoming data. I know nothing about Lustre's buffering, can anyone point me to a source of information ? Thanks Burlen ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] llapi stripe_size
calling llapi_file_create reports the following error: error: bad stripe_size 4096, must be an even multiple of 65536 bytes: Invalid argument (22) The operation manual said: This value must be an even multiple of system page size, as shown by getpagesize. The value 4096 above was returned from a call to getpagesize. (I don't intend to set the stripe size so small, this is just to illustrate). Where does the number 65536 come from, what's the right way to determine it at runtime? Thanks Burlen ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] programmatic access to parameters
Thanks very much for your help and advise. Andreas Dilger wrote: > On 2010-03-25, at 15:12, Andreas Dilger wrote: >>> The llapi_* functions are great, I see how to set the stripe count >>> and size. I wasn't sure if there was also a function to query about >>> the configuration, eg number of OST's deployed? >> >> There isn't directly such a function, but indirectly this is possible >> to get from userspace without changing Lustre or the liblustreapi >> library: > > > I filed bug 22472 for this issue, with a proposed patch, though the > actual implementation may change before this is included into any > release. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] programmatic access to parameters
> I don't know what your constraints are, but should note that this sort > of information (number of OSTs) can be obtained rather trivially from > any lustre client via shell prompt, to wit: True, but parsing the output of a c "system" call is something I hoped to avoid. It might not be portable and might be fragile over time. This also gets at my motivation for asking for a header with the Lustre limits, if I hard code something down the road the limits change, and we are suddenly shooting ourselves in the foot. I think I have made a mistake about the MPI hints in my last mail,. The striping_* hints are a part of the MPI standard at least as far back as 2003. It says that they are reserved but implementations are not required to interpret them. That's a pretty weak assurance. I'd like this thread to be considered by Lustre team as a feature request for better programmatic support. I think it makes sense because the performance is fairly sensitive to both the deployed hardware and striping parameters. There also can be more information regarding the specific IO needs available at the application level than at the MPI level. And MPI implementations don't have to honor hints. Thanks, I am grateful for the help as I get up to speed on Lustre fs Burlen Cliff White wrote: > burlen wrote: >> System limits are sometimes provided in a header, I wasn't sure if >> Lustre adopted that approach. The llapi_* functions are great, I see >> how to set the stripe count and size. I wasn't sure if there was also >> a function to query about the configuration, eg number of OST's >> deployed? >> >> This would be for use in a global hybrid megnetospheric simulation >> that runs on a large scale (1E4-1E5 cores). The good striping >> parameters depend on the run, and could be calculated at run time. It >> can make a significant difference in our run times to have these set >> correctly. I am not sure if we always want a stripe count of the >> maximum. I think this depends on how many files we are synchronously >> writing, and the number of available OST's total. Eg if there are 256 >> OST's on some system and we have 2 files to write would it not make >> sense to set the stripe count to 128? >> >> We can't rely on our user to set the Lustre parameter correctly. We >> can't rely on the system defaults either, they typically aren't set >> optimally for our use case. MPI hints look promising but the ADIO >> Lustre optimization are fairly new, as far as I understand not >> publically available in MPICH until next release (maybe in may?). We >> run on a variety of systems some with variety of MPI implementation >> (eg Cray, SGI). The MPI hints will only be useful on implementation >> that support the particular hint. From a consistency point of view we >> need to both make use of MPI hints and direct access via the llapi so >> that we run well on all those systems, regardless of which MPI >> implementation is deployed. > > I don't know what your constraints are, but should note that this sort > of information (number of OSTs) can be obtained rather trivially from > any lustre client via shell prompt, to wit: > # lctl dl |grep OST |wc -l > 2 > or: > # ls /proc/fs/lustre/osc | grep OST |wc -l > 2 > > probably a few other ways to do that. Not as stylish as llapi_*.. > > cliffw > >> Thanks >> Burlen >> >> >> Andreas Dilger wrote: >>> On 2010-03-23, at 14:25, burlen wrote: >>>> How can one programmatically probe the lustre system an application is >>>> running on? >>> Lustre-specific interfaces are generally "llapi_*" functions, from >>> liblustreapi. >>> >>>> At compile time I'd like access to the various lustre system limits , >>>> for example those listed in ch.32 of operations manual. >>> There are no llapi_* functions for this today. Can you explain a >>> bit better what you are trying to use this for? >>> >>> statfs(2) will tell you a number of limits, as will pathconf(3), and >>> those are standard POSIX APIs. >>> >>>> Incidentally one I didn't see listed in that chapter is the maximum >>>> number of OST's a single file can be striped across. >>> That is the first thing listed: >>> >>>>> 32.1Maximum Stripe Count >>>>> The maximum number of stripe count is 160. This limit is >>>>> hard-coded, but is near the upper limit imposed by the underlying >>>>> ext3 file system. It may be increased in future rel
Re: [Lustre-discuss] programmatic access to parameters
System limits are sometimes provided in a header, I wasn't sure if Lustre adopted that approach. The llapi_* functions are great, I see how to set the stripe count and size. I wasn't sure if there was also a function to query about the configuration, eg number of OST's deployed? This would be for use in a global hybrid megnetospheric simulation that runs on a large scale (1E4-1E5 cores). The good striping parameters depend on the run, and could be calculated at run time. It can make a significant difference in our run times to have these set correctly. I am not sure if we always want a stripe count of the maximum. I think this depends on how many files we are synchronously writing, and the number of available OST's total. Eg if there are 256 OST's on some system and we have 2 files to write would it not make sense to set the stripe count to 128? We can't rely on our user to set the Lustre parameter correctly. We can't rely on the system defaults either, they typically aren't set optimally for our use case. MPI hints look promising but the ADIO Lustre optimization are fairly new, as far as I understand not publically available in MPICH until next release (maybe in may?). We run on a variety of systems some with variety of MPI implementation (eg Cray, SGI). The MPI hints will only be useful on implementation that support the particular hint. From a consistency point of view we need to both make use of MPI hints and direct access via the llapi so that we run well on all those systems, regardless of which MPI implementation is deployed. Thanks Burlen Andreas Dilger wrote: > On 2010-03-23, at 14:25, burlen wrote: >> How can one programmatically probe the lustre system an application is >> running on? > > Lustre-specific interfaces are generally "llapi_*" functions, from > liblustreapi. > >> At compile time I'd like access to the various lustre system limits , >> for example those listed in ch.32 of operations manual. > > There are no llapi_* functions for this today. Can you explain a bit > better what you are trying to use this for? > > statfs(2) will tell you a number of limits, as will pathconf(3), and > those are standard POSIX APIs. > >> Incidentally one I didn't see listed in that chapter is the maximum >> number of OST's a single file can be striped across. > > That is the first thing listed: > >>> 32.1Maximum Stripe Count >>> The maximum number of stripe count is 160. This limit is hard-coded, >>> but is near the upper limit imposed by the underlying ext3 file >>> system. It may be increased in future releases. Under normal >>> circumstances, the stripe count is not affected by ACLs. >>> >> At run time I'd like to be able to probe the size (number of OSS, OST >> etc...) of the system the application is running on. > > > One shortcut is to specify "-1" for the stripe count will stripe a > file across all available OSTs, which is what most applications want, > if they are not being striped over only 1 or 2 OSTs. > > If you are using MPIIO, the Lustre ADIO layer can optimize these > things for you, based on application hints. > > If you could elaborate on your needs, there may not be any need to > make your application more Lustre-aware. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] programmatic access to parameters
Hi , How can one programmatically probe the lustre system an application is running on? At compile time I'd like access to the various lustre system limits , for example those listed in ch.32 of operations manual. Incidentally one I didn't see listed in that chapter is the maximum number of OST's a single file can be striped across. At run time I'd like to be able to probe the size (number of OSS, OST etc...) of the system the application is running on. I hope not to have to hard code such values. Thanks Burlen ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss