Re: [lustre-discuss] problem getting high performance output to single file
David, You are right, there is a lock. As Patrick mentioned, https://jira.hpdd.intel.com/browse/LU-1669 will solve your problems. Please check it out. In my own experience, Lustre 2.7.0 client does solve such problem very well, and I got a very good performance so far. Regards, Cuong On Wed, May 20, 2015 at 4:46 AM, David A. Schneider david...@slac.stanford.edu wrote: We do use checksums, but can't turn it off. It know we've measured some performance penalty with checksums. I'll check about configuring lustre clients to to use RDMA. We ran into something similar where our MPI programs were not taking advantage of the infini-band - we noticed much slower message passing then we expected - it sounds like there is a similar thing we can do with lustre, but I guess the locking is the main issue. All our compute nodes are currently running red hat 5 and it doesn't look like lustre 2.6 was tested with rhel5, but we have been talking about moving everything to at least rhel6, maybe rhel7, so there's hope, Thanks for the help! best, David On 05/19/15 11:10, Patrick Farrell wrote: Ah. I think I know what¹s going on here: In Lustre 2.x client versions prior to 2.6, only one process on a given client can write to a given file at a time, regardless of how the file is striped. So if you are writing to the same file, there will be little to no benefit of putting an extra process on the same node. A *single* process on a node could benefit, but not the split you¹ve described. The details, which are essentially just that a pair of per-file locks are used by any individual process writing to a file, are here: https://jira.hpdd.intel.com/browse/LU-1669 On 5/19/15, 12:59 PM, Mohr Jr, Richard Frank (Rick Mohr) rm...@utk.edu wrote: On May 19, 2015, at 1:44 PM, Schneider, David A. david...@slac.stanford.edu wrote: Thanks for the suggestion! When I had each rank run on a separate compute node/host, I saw parallel performance (4 seconds for the 6GB of writing). When I ran the MPI job on one host (the hosts have 12 cores, by default we pack ranks onto as few hosts as possible), things happened serially, each rank finished about 2 seconds after a different rank. Hmm. That does seem like there is some bottleneck on the client side that is limiting the throughput from a single client. Here are some things you could look into (although they might require more tinkering than you have permission to do): 1) Based on your output from ³lctl list_nids², it looks like you are running IP-over-IB. Can you configure the clients to use RDMA? (They would have nids like x.x.x.x@o2ib.) 2) Do you have the option of trying a newer client version? Earlier lustre versions used a single-thread ptlrpcd to manage network traffic, but newer versions have a multi-threaded implementation. You may need to compare compatibility with the Lustre version running on the servers though. 3) Do you gave checksums disabled? Try running lctl get_param osc.*.checksums². If the values are ³1², then checksums are enabled which can slow down performance. You could try setting the value to ³0² to see if that helps. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- Nguyen Viet Cuong ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
David, What interconnect are you using for Lustre? ( IB/o2ib [fdr,qdr,ddr], Ethernet/tcp [40GbE,10Gbe,1GbE] ). You can run 'lctl list_nids' and see what protocol lnet is binding to, then look at that interface for the specific type. Also, do you know anything about the server side of your Lustre FS? What make/model of block devices are used in OSTs? --Jeff On 5/19/15 9:05 AM, Schneider, David A. wrote: Thanks, for the client, where I am running from, I have $ cat /proc/fs/lustre/version lustre: 2.1.6 kernel: patchless_client build: jenkins--PRISTINE-2.6.18-348.4.1.el5 best, David Schneider From: Patrick Farrell [p...@cray.com] Sent: Tuesday, May 19, 2015 9:03 AM To: Schneider, David A.; John Bauer; lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file For the clients, cat /proc/fs/lustre/version For the servers, it¹s the same, but presumably you don¹t have access. On 5/19/15, 11:01 AM, Schneider, David A. david...@slac.stanford.edu wrote: Hi, My first test was just to do the for loop where I allocate a 4MB buffer, initialize it, and delete it. That program ran at about 6GB/sec. Once I write to a file, I drop down to 370mb/sec. Our top performance for I/O to one file has been about 400 mb/sec. For this question: Which versions are you using in servers and clients? I don't know what command to determine this, I suspect it is older since we are on red hat 5. I will ask. best, David Schneider From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of John Bauer [bau...@iodoctors.com] Sent: Tuesday, May 19, 2015 8:52 AM To: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file David You note that you write a 6GB file. I suspect that your Linux systems have significantly more memory than 6GB meaning your file will end being cached in the system buffers. It wont matter how many OSTs you use as you probably are not measuring the speed to the OST's, but rather, you are measuring the memory copy speed. What transfer rate are you seeing? John On 5/19/2015 10:40 AM, Schneider, David A. wrote: I am trying to get good performance with parallel writing to one file through MPI. Our cluster has high performance when I write to separate files, but when I use one file - I see very little performance increase. As I understand, our cluster defaults to use one OST per file. There are many OST's though, which is how we get good performance when writing to multiple files. I have been using the command lfs setstripe to change the stripe count and block size. I can see that this works, when I do lfs getstripe, I see the output file is striped, but I'm getting very little I/O performance when I create the striped file. When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? I am on linux red hat 5. I'm not sure what version of lustre this is. I have skimmed through a 450 page pdf of lustre documentation, I saw references to destructive testing one does in the beginning, but I'm not sure what I can do now. I think this is the first work we've done to get high performance when writing a single file, so I'm worried there is something buried in the lustre configuration that needs to be changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should check? best, David Schneider ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- I/O Doctors, LLC 507-766-0378 bau...@iodoctors.com ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- -- Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite
Re: [lustre-discuss] problem getting high performance output to single file
We do use checksums, but can't turn it off. It know we've measured some performance penalty with checksums. I'll check about configuring lustre clients to to use RDMA. We ran into something similar where our MPI programs were not taking advantage of the infini-band - we noticed much slower message passing then we expected - it sounds like there is a similar thing we can do with lustre, but I guess the locking is the main issue. All our compute nodes are currently running red hat 5 and it doesn't look like lustre 2.6 was tested with rhel5, but we have been talking about moving everything to at least rhel6, maybe rhel7, so there's hope, Thanks for the help! best, David On 05/19/15 11:10, Patrick Farrell wrote: Ah. I think I know what¹s going on here: In Lustre 2.x client versions prior to 2.6, only one process on a given client can write to a given file at a time, regardless of how the file is striped. So if you are writing to the same file, there will be little to no benefit of putting an extra process on the same node. A *single* process on a node could benefit, but not the split you¹ve described. The details, which are essentially just that a pair of per-file locks are used by any individual process writing to a file, are here: https://jira.hpdd.intel.com/browse/LU-1669 On 5/19/15, 12:59 PM, Mohr Jr, Richard Frank (Rick Mohr) rm...@utk.edu wrote: On May 19, 2015, at 1:44 PM, Schneider, David A. david...@slac.stanford.edu wrote: Thanks for the suggestion! When I had each rank run on a separate compute node/host, I saw parallel performance (4 seconds for the 6GB of writing). When I ran the MPI job on one host (the hosts have 12 cores, by default we pack ranks onto as few hosts as possible), things happened serially, each rank finished about 2 seconds after a different rank. Hmm. That does seem like there is some bottleneck on the client side that is limiting the throughput from a single client. Here are some things you could look into (although they might require more tinkering than you have permission to do): 1) Based on your output from ³lctl list_nids², it looks like you are running IP-over-IB. Can you configure the clients to use RDMA? (They would have nids like x.x.x.x@o2ib.) 2) Do you have the option of trying a newer client version? Earlier lustre versions used a single-thread ptlrpcd to manage network traffic, but newer versions have a multi-threaded implementation. You may need to compare compatibility with the Lustre version running on the servers though. 3) Do you gave checksums disabled? Try running lctl get_param osc.*.checksums². If the values are ³1², then checksums are enabled which can slow down performance. You could try setting the value to ³0² to see if that helps. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
Hi, My first test was just to do the for loop where I allocate a 4MB buffer, initialize it, and delete it. That program ran at about 6GB/sec. Once I write to a file, I drop down to 370mb/sec. Our top performance for I/O to one file has been about 400 mb/sec. For this question: Which versions are you using in servers and clients? I don't know what command to determine this, I suspect it is older since we are on red hat 5. I will ask. best, David Schneider From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of John Bauer [bau...@iodoctors.com] Sent: Tuesday, May 19, 2015 8:52 AM To: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file David You note that you write a 6GB file. I suspect that your Linux systems have significantly more memory than 6GB meaning your file will end being cached in the system buffers. It wont matter how many OSTs you use as you probably are not measuring the speed to the OST's, but rather, you are measuring the memory copy speed. What transfer rate are you seeing? John On 5/19/2015 10:40 AM, Schneider, David A. wrote: I am trying to get good performance with parallel writing to one file through MPI. Our cluster has high performance when I write to separate files, but when I use one file - I see very little performance increase. As I understand, our cluster defaults to use one OST per file. There are many OST's though, which is how we get good performance when writing to multiple files. I have been using the command lfs setstripe to change the stripe count and block size. I can see that this works, when I do lfs getstripe, I see the output file is striped, but I'm getting very little I/O performance when I create the striped file. When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? I am on linux red hat 5. I'm not sure what version of lustre this is. I have skimmed through a 450 page pdf of lustre documentation, I saw references to destructive testing one does in the beginning, but I'm not sure what I can do now. I think this is the first work we've done to get high performance when writing a single file, so I'm worried there is something buried in the lustre configuration that needs to be changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should check? best, David Schneider ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- I/O Doctors, LLC 507-766-0378 bau...@iodoctors.com ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
On May 19, 2015, at 11:40 AM, Schneider, David A. david...@slac.stanford.edu wrote: When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? If you are looking for a simple shared-file test, you could try something like this: 1) Create a file with a stripe size of 1 GB and a stripe count of 6. 2) Write an MPI program where each process writes 1 GB of sequential data. Each process should first seek to (mpi_rank)*(1GB) and then write 1 GB. This will ensure that all processes are writing to non-overlapping parts of the file. 3) Start the program running on 6 nodes (1 process per node). In a scenario like that, you should effectively be getting file-per-process speeds even though you are writing to a shared file because each process is writing to a different OST. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
For the clients, cat /proc/fs/lustre/version For the servers, it¹s the same, but presumably you don¹t have access. On 5/19/15, 11:01 AM, Schneider, David A. david...@slac.stanford.edu wrote: Hi, My first test was just to do the for loop where I allocate a 4MB buffer, initialize it, and delete it. That program ran at about 6GB/sec. Once I write to a file, I drop down to 370mb/sec. Our top performance for I/O to one file has been about 400 mb/sec. For this question: Which versions are you using in servers and clients? I don't know what command to determine this, I suspect it is older since we are on red hat 5. I will ask. best, David Schneider From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of John Bauer [bau...@iodoctors.com] Sent: Tuesday, May 19, 2015 8:52 AM To: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file David You note that you write a 6GB file. I suspect that your Linux systems have significantly more memory than 6GB meaning your file will end being cached in the system buffers. It wont matter how many OSTs you use as you probably are not measuring the speed to the OST's, but rather, you are measuring the memory copy speed. What transfer rate are you seeing? John On 5/19/2015 10:40 AM, Schneider, David A. wrote: I am trying to get good performance with parallel writing to one file through MPI. Our cluster has high performance when I write to separate files, but when I use one file - I see very little performance increase. As I understand, our cluster defaults to use one OST per file. There are many OST's though, which is how we get good performance when writing to multiple files. I have been using the command lfs setstripe to change the stripe count and block size. I can see that this works, when I do lfs getstripe, I see the output file is striped, but I'm getting very little I/O performance when I create the striped file. When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? I am on linux red hat 5. I'm not sure what version of lustre this is. I have skimmed through a 450 page pdf of lustre documentation, I saw references to destructive testing one does in the beginning, but I'm not sure what I can do now. I think this is the first work we've done to get high performance when writing a single file, so I'm worried there is something buried in the lustre configuration that needs to be changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should check? best, David Schneider ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- I/O Doctors, LLC 507-766-0378 bau...@iodoctors.com ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] problem getting high performance output to single file
I am trying to get good performance with parallel writing to one file through MPI. Our cluster has high performance when I write to separate files, but when I use one file - I see very little performance increase. As I understand, our cluster defaults to use one OST per file. There are many OST's though, which is how we get good performance when writing to multiple files. I have been using the command lfs setstripe to change the stripe count and block size. I can see that this works, when I do lfs getstripe, I see the output file is striped, but I'm getting very little I/O performance when I create the striped file. When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? I am on linux red hat 5. I'm not sure what version of lustre this is. I have skimmed through a 450 page pdf of lustre documentation, I saw references to destructive testing one does in the beginning, but I'm not sure what I can do now. I think this is the first work we've done to get high performance when writing a single file, so I'm worried there is something buried in the lustre configuration that needs to be changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should check? best, David Schneider ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
Which versions are you using in servers and clients? On Wed, May 20, 2015 at 12:40 AM, Schneider, David A. david...@slac.stanford.edu wrote: I am trying to get good performance with parallel writing to one file through MPI. Our cluster has high performance when I write to separate files, but when I use one file - I see very little performance increase. As I understand, our cluster defaults to use one OST per file. There are many OST's though, which is how we get good performance when writing to multiple files. I have been using the command lfs setstripe to change the stripe count and block size. I can see that this works, when I do lfs getstripe, I see the output file is striped, but I'm getting very little I/O performance when I create the striped file. When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? I am on linux red hat 5. I'm not sure what version of lustre this is. I have skimmed through a 450 page pdf of lustre documentation, I saw references to destructive testing one does in the beginning, but I'm not sure what I can do now. I think this is the first work we've done to get high performance when writing a single file, so I'm worried there is something buried in the lustre configuration that needs to be changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should check? best, David Schneider ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- Nguyen Viet Cuong ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
David You note that you write a 6GB file. I suspect that your Linux systems have significantly more memory than 6GB meaning your file will end being cached in the system buffers. It wont matter how many OSTs you use as you probably are not measuring the speed to the OST's, but rather, you are measuring the memory copy speed. What transfer rate are you seeing? John On 5/19/2015 10:40 AM, Schneider, David A. wrote: I am trying to get good performance with parallel writing to one file through MPI. Our cluster has high performance when I write to separate files, but when I use one file - I see very little performance increase. As I understand, our cluster defaults to use one OST per file. There are many OST's though, which is how we get good performance when writing to multiple files. I have been using the command lfs setstripe to change the stripe count and block size. I can see that this works, when I do lfs getstripe, I see the output file is striped, but I'm getting very little I/O performance when I create the striped file. When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? I am on linux red hat 5. I'm not sure what version of lustre this is. I have skimmed through a 450 page pdf of lustre documentation, I saw references to destructive testing one does in the beginning, but I'm not sure what I can do now. I think this is the first work we've done to get high performance when writing a single file, so I'm worried there is something buried in the lustre configuration that needs to be changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should check? best, David Schneider ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- I/O Doctors, LLC 507-766-0378 bau...@iodoctors.com ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
Thanks, for the client, where I am running from, I have $ cat /proc/fs/lustre/version lustre: 2.1.6 kernel: patchless_client build: jenkins--PRISTINE-2.6.18-348.4.1.el5 best, David Schneider From: Patrick Farrell [p...@cray.com] Sent: Tuesday, May 19, 2015 9:03 AM To: Schneider, David A.; John Bauer; lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file For the clients, cat /proc/fs/lustre/version For the servers, it¹s the same, but presumably you don¹t have access. On 5/19/15, 11:01 AM, Schneider, David A. david...@slac.stanford.edu wrote: Hi, My first test was just to do the for loop where I allocate a 4MB buffer, initialize it, and delete it. That program ran at about 6GB/sec. Once I write to a file, I drop down to 370mb/sec. Our top performance for I/O to one file has been about 400 mb/sec. For this question: Which versions are you using in servers and clients? I don't know what command to determine this, I suspect it is older since we are on red hat 5. I will ask. best, David Schneider From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of John Bauer [bau...@iodoctors.com] Sent: Tuesday, May 19, 2015 8:52 AM To: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file David You note that you write a 6GB file. I suspect that your Linux systems have significantly more memory than 6GB meaning your file will end being cached in the system buffers. It wont matter how many OSTs you use as you probably are not measuring the speed to the OST's, but rather, you are measuring the memory copy speed. What transfer rate are you seeing? John On 5/19/2015 10:40 AM, Schneider, David A. wrote: I am trying to get good performance with parallel writing to one file through MPI. Our cluster has high performance when I write to separate files, but when I use one file - I see very little performance increase. As I understand, our cluster defaults to use one OST per file. There are many OST's though, which is how we get good performance when writing to multiple files. I have been using the command lfs setstripe to change the stripe count and block size. I can see that this works, when I do lfs getstripe, I see the output file is striped, but I'm getting very little I/O performance when I create the striped file. When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? I am on linux red hat 5. I'm not sure what version of lustre this is. I have skimmed through a 450 page pdf of lustre documentation, I saw references to destructive testing one does in the beginning, but I'm not sure what I can do now. I think this is the first work we've done to get high performance when writing a single file, so I'm worried there is something buried in the lustre configuration that needs to be changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should check? best, David Schneider ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- I/O Doctors, LLC 507-766-0378 bau...@iodoctors.com ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
Hi Jeff, I know we have infini-band, however when I ran lctl, what I see (maybe I should not put our ip addresses on the internet, so I'll xxx them out) is .xx.xx.xx@tcp2 .xx.xx.xx@tcp unfortunately, I'm not sure how to look at the interface for these types, maybe they are in turn connected to infiniband. I don't know much about the OSTs. I know there is a raid structure that allows for the 400MB/sec on each one. In one of my tests, I believe I wrote 44GB in 100 separate files in under 10 seconds, so the system can support 4.4GB/sec. best, David Schneider From: Jeff Johnson [jeff.john...@aeoncomputing.com] Sent: Tuesday, May 19, 2015 9:11 AM To: Schneider, David A.; Patrick Farrell; John Bauer; lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file David, What interconnect are you using for Lustre? ( IB/o2ib [fdr,qdr,ddr], Ethernet/tcp [40GbE,10Gbe,1GbE] ). You can run 'lctl list_nids' and see what protocol lnet is binding to, then look at that interface for the specific type. Also, do you know anything about the server side of your Lustre FS? What make/model of block devices are used in OSTs? --Jeff On 5/19/15 9:05 AM, Schneider, David A. wrote: Thanks, for the client, where I am running from, I have $ cat /proc/fs/lustre/version lustre: 2.1.6 kernel: patchless_client build: jenkins--PRISTINE-2.6.18-348.4.1.el5 best, David Schneider From: Patrick Farrell [p...@cray.com] Sent: Tuesday, May 19, 2015 9:03 AM To: Schneider, David A.; John Bauer; lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file For the clients, cat /proc/fs/lustre/version For the servers, it¹s the same, but presumably you don¹t have access. On 5/19/15, 11:01 AM, Schneider, David A. david...@slac.stanford.edu wrote: Hi, My first test was just to do the for loop where I allocate a 4MB buffer, initialize it, and delete it. That program ran at about 6GB/sec. Once I write to a file, I drop down to 370mb/sec. Our top performance for I/O to one file has been about 400 mb/sec. For this question: Which versions are you using in servers and clients? I don't know what command to determine this, I suspect it is older since we are on red hat 5. I will ask. best, David Schneider From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of John Bauer [bau...@iodoctors.com] Sent: Tuesday, May 19, 2015 8:52 AM To: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file David You note that you write a 6GB file. I suspect that your Linux systems have significantly more memory than 6GB meaning your file will end being cached in the system buffers. It wont matter how many OSTs you use as you probably are not measuring the speed to the OST's, but rather, you are measuring the memory copy speed. What transfer rate are you seeing? John On 5/19/2015 10:40 AM, Schneider, David A. wrote: I am trying to get good performance with parallel writing to one file through MPI. Our cluster has high performance when I write to separate files, but when I use one file - I see very little performance increase. As I understand, our cluster defaults to use one OST per file. There are many OST's though, which is how we get good performance when writing to multiple files. I have been using the command lfs setstripe to change the stripe count and block size. I can see that this works, when I do lfs getstripe, I see the output file is striped, but I'm getting very little I/O performance when I create the striped file. When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? I am on linux red hat 5. I'm not sure what version of lustre this is. I have skimmed through a 450 page pdf of lustre documentation, I saw references to destructive testing one does in the beginning, but I'm not sure what I can do now. I think this is the first work we've done to get high performance when writing a single file, so I'm worried there is something buried in the lustre configuration that needs to be changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should check? best, David Schneider
Re: [lustre-discuss] problem getting high performance output to single file
On May 19, 2015, at 1:44 PM, Schneider, David A. david...@slac.stanford.edu wrote: Thanks for the suggestion! When I had each rank run on a separate compute node/host, I saw parallel performance (4 seconds for the 6GB of writing). When I ran the MPI job on one host (the hosts have 12 cores, by default we pack ranks onto as few hosts as possible), things happened serially, each rank finished about 2 seconds after a different rank. Hmm. That does seem like there is some bottleneck on the client side that is limiting the throughput from a single client. Here are some things you could look into (although they might require more tinkering than you have permission to do): 1) Based on your output from “lctl list_nids”, it looks like you are running IP-over-IB. Can you configure the clients to use RDMA? (They would have nids like x.x.x.x@o2ib.) 2) Do you have the option of trying a newer client version? Earlier lustre versions used a single-thread ptlrpcd to manage network traffic, but newer versions have a multi-threaded implementation. You may need to compare compatibility with the Lustre version running on the servers though. 3) Do you gave checksums disabled? Try running lctl get_param osc.*.checksums”. If the values are “1”, then checksums are enabled which can slow down performance. You could try setting the value to “0” to see if that helps. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
Thanks for the suggestion! When I had each rank run on a separate compute node/host, I saw parallel performance (4 seconds for the 6GB of writing). When I ran the MPI job on one host (the hosts have 12 cores, by default we pack ranks onto as few hosts as possible), things happened serially, each rank finished about 2 seconds after a different rank. I'm told that the hosts can handle a lot of I/O, but it seems there a some issues with getting that to work well. I believe we get good performance with different ranks on one host reading from different files. I'll look into tuning the MPI/Hdf5 parameter now, with an eye for designing my application to write from different hosts. My initial tests with MPI showed degraded performance when I used different hosts for the writing, but maybe there are some parameters that will help. I can try the openmpi forum at that point. best, David Schneider From: Mohr Jr, Richard Frank (Rick Mohr) [rm...@utk.edu] Sent: Tuesday, May 19, 2015 9:15 AM To: Schneider, David A. Cc: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] problem getting high performance output to single file On May 19, 2015, at 11:40 AM, Schneider, David A. david...@slac.stanford.edu wrote: When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file). Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles? If you are looking for a simple shared-file test, you could try something like this: 1) Create a file with a stripe size of 1 GB and a stripe count of 6. 2) Write an MPI program where each process writes 1 GB of sequential data. Each process should first seek to (mpi_rank)*(1GB) and then write 1 GB. This will ensure that all processes are writing to non-overlapping parts of the file. 3) Start the program running on 6 nodes (1 process per node). In a scenario like that, you should effectively be getting file-per-process speeds even though you are writing to a shared file because each process is writing to a different OST. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] problem getting high performance output to single file
Ah. I think I know what¹s going on here: In Lustre 2.x client versions prior to 2.6, only one process on a given client can write to a given file at a time, regardless of how the file is striped. So if you are writing to the same file, there will be little to no benefit of putting an extra process on the same node. A *single* process on a node could benefit, but not the split you¹ve described. The details, which are essentially just that a pair of per-file locks are used by any individual process writing to a file, are here: https://jira.hpdd.intel.com/browse/LU-1669 On 5/19/15, 12:59 PM, Mohr Jr, Richard Frank (Rick Mohr) rm...@utk.edu wrote: On May 19, 2015, at 1:44 PM, Schneider, David A. david...@slac.stanford.edu wrote: Thanks for the suggestion! When I had each rank run on a separate compute node/host, I saw parallel performance (4 seconds for the 6GB of writing). When I ran the MPI job on one host (the hosts have 12 cores, by default we pack ranks onto as few hosts as possible), things happened serially, each rank finished about 2 seconds after a different rank. Hmm. That does seem like there is some bottleneck on the client side that is limiting the throughput from a single client. Here are some things you could look into (although they might require more tinkering than you have permission to do): 1) Based on your output from ³lctl list_nids², it looks like you are running IP-over-IB. Can you configure the clients to use RDMA? (They would have nids like x.x.x.x@o2ib.) 2) Do you have the option of trying a newer client version? Earlier lustre versions used a single-thread ptlrpcd to manage network traffic, but newer versions have a multi-threaded implementation. You may need to compare compatibility with the Lustre version running on the servers though. 3) Do you gave checksums disabled? Try running lctl get_param osc.*.checksums². If the values are ³1², then checksums are enabled which can slow down performance. You could try setting the value to ³0² to see if that helps. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org