Re: [OMPI users] file/process write speed is not scalable
I'm using OpenMPI v.4.0.2. Is your problem similar to mine? Thanks, David On Tue, Apr 14, 2020 at 7:33 AM Patrick Bégou via users < users@lists.open-mpi.org> wrote: > Hi David, > > could you specify which version of OpenMPI you are using ? > I've also some parallel I/O trouble with one code but still have not > investigated. > Thanks > > Patrick > > Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit : > > > Thank you for your suggestion. > I am more concerned about the poor performance of one MPI process/socket > case. > The model fits better for my real workload. > The performance that I see is a lot worse than what the underlying > hardware can support. > The best case (all MPI processes in a single socket) is pretty good, which > is about 80+% of underlying hardware's speed. > However, one MPI per socket model achieves only 30% of what I get with all > MPI processes in a single socket. > Both are doing the same thing - independent file write. > I used all the OSTs available. > > As a reference point, I did the same test on ramdisk. > For both case, the performance scales very well, and their performances > are close. > > There seems to be extra overhead when multi-sockets are used for > independent file I/O with Lustre. > I don't know what causes that overhead. > > Thanks, > David > > > On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users < > users@lists.open-mpi.org> wrote: > >> Note there could be some NUMA-IO effect, so I suggest you compare >> running every MPI tasks on socket 0, to running every MPI tasks on >> socket 1 and so on, and then compared to running one MPI task per >> socket. >> >> Also, what performance do you measure? >> - Is this something in line with the filesystem/network expectation? >> - Or is this much higher (and in this case, you are benchmarking the i/o >> cache)? >> >> FWIW, I usually write files whose cumulated size is four times the >> node memory to avoid local caching effect >> (if you have a lot of RAM, that might take a while ...) >> >> Keep in mind Lustre is also sensitive to the file layout. >> If you write one file per task, you likely want to use all the >> available OST, but no stripping. >> If you want to write into a single file with 1MB blocks per MPI task, >> you likely want to stripe with 1MB blocks, >> and use the same number of OST than MPI tasks (so each MPI task ends >> up writing to its own OST) >> >> Cheers, >> >> Gilles >> >> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users >> wrote: >> > >> > Hi, >> > >> > I'm running IOR benchmark on a big shared memory machine with Lustre >> file system. >> > I set up IOR to use an independent file/process so that the aggregated >> bandwidth is maximized. >> > I ran N MPI processes where N < # of cores in a socket. >> > When I put those N MPI processes on a single socket, its write >> performance is scalable. >> > However, when I put those N MPI processes on N sockets (so, 1 MPI >> process/socket), >> > it performance does not scale, and stays the same for more than 4 MPI >> processes. >> > I expected it would be as scalable as the case of N processes on a >> single socket. >> > But, it is not. >> > >> > I think if an MPI process write to an independent file/process, there >> must not be file locking among MPI processes. However, there seems to be >> some. Is there any way to avoid that locking or overhead? It may not be >> file lock issue, but I don't know what is the exact reason for the poor >> performance. >> > >> > Any help will be appreciated. >> > >> > David >> > >
Re: [OMPI users] file/process write speed is not scalable
Hi David, could you specify which version of OpenMPI you are using ? I've also some parallel I/O trouble with one code but still have not investigated. Thanks Patrick Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit : > > Thank you for your suggestion. > I am more concerned about the poor performance of one MPI > process/socket case. > The model fits better for my real workload. > The performance that I see is a lot worse than what the underlying > hardware can support. > The best case (all MPI processes in a single socket) is pretty good, > which is about 80+% of underlying hardware's speed. > However, one MPI per socket model achieves only 30% of what I get with > all MPI processes in a single socket. > Both are doing the same thing - independent file write. > I used all the OSTs available. > > As a reference point, I did the same test on ramdisk. > For both case, the performance scales very well, and their > performances are close. > > There seems to be extra overhead when multi-sockets are used for > independent file I/O with Lustre. > I don't know what causes that overhead. > > Thanks, > David > > > On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users > mailto:users@lists.open-mpi.org>> wrote: > > Note there could be some NUMA-IO effect, so I suggest you compare > running every MPI tasks on socket 0, to running every MPI tasks on > socket 1 and so on, and then compared to running one MPI task per > socket. > > Also, what performance do you measure? > - Is this something in line with the filesystem/network expectation? > - Or is this much higher (and in this case, you are benchmarking > the i/o cache)? > > FWIW, I usually write files whose cumulated size is four times the > node memory to avoid local caching effect > (if you have a lot of RAM, that might take a while ...) > > Keep in mind Lustre is also sensitive to the file layout. > If you write one file per task, you likely want to use all the > available OST, but no stripping. > If you want to write into a single file with 1MB blocks per MPI task, > you likely want to stripe with 1MB blocks, > and use the same number of OST than MPI tasks (so each MPI task ends > up writing to its own OST) > > Cheers, > > Gilles > > On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users > mailto:users@lists.open-mpi.org>> wrote: > > > > Hi, > > > > I'm running IOR benchmark on a big shared memory machine with > Lustre file system. > > I set up IOR to use an independent file/process so that the > aggregated bandwidth is maximized. > > I ran N MPI processes where N < # of cores in a socket. > > When I put those N MPI processes on a single socket, its write > performance is scalable. > > However, when I put those N MPI processes on N sockets (so, 1 > MPI process/socket), > > it performance does not scale, and stays the same for more than > 4 MPI processes. > > I expected it would be as scalable as the case of N processes on > a single socket. > > But, it is not. > > > > I think if an MPI process write to an independent file/process, > there must not be file locking among MPI processes. However, there > seems to be some. Is there any way to avoid that locking or > overhead? It may not be file lock issue, but I don't know what is > the exact reason for the poor performance. > > > > Any help will be appreciated. > > > > David >
Re: [OMPI users] file/process write speed is not scalable
Thank you for your suggestion. I am more concerned about the poor performance of one MPI process/socket case. The model fits better for my real workload. The performance that I see is a lot worse than what the underlying hardware can support. The best case (all MPI processes in a single socket) is pretty good, which is about 80+% of underlying hardware's speed. However, one MPI per socket model achieves only 30% of what I get with all MPI processes in a single socket. Both are doing the same thing - independent file write. I used all the OSTs available. As a reference point, I did the same test on ramdisk. For both case, the performance scales very well, and their performances are close. There seems to be extra overhead when multi-sockets are used for independent file I/O with Lustre. I don't know what causes that overhead. Thanks, David On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users < users@lists.open-mpi.org> wrote: > Note there could be some NUMA-IO effect, so I suggest you compare > running every MPI tasks on socket 0, to running every MPI tasks on > socket 1 and so on, and then compared to running one MPI task per > socket. > > Also, what performance do you measure? > - Is this something in line with the filesystem/network expectation? > - Or is this much higher (and in this case, you are benchmarking the i/o > cache)? > > FWIW, I usually write files whose cumulated size is four times the > node memory to avoid local caching effect > (if you have a lot of RAM, that might take a while ...) > > Keep in mind Lustre is also sensitive to the file layout. > If you write one file per task, you likely want to use all the > available OST, but no stripping. > If you want to write into a single file with 1MB blocks per MPI task, > you likely want to stripe with 1MB blocks, > and use the same number of OST than MPI tasks (so each MPI task ends > up writing to its own OST) > > Cheers, > > Gilles > > On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users > wrote: > > > > Hi, > > > > I'm running IOR benchmark on a big shared memory machine with Lustre > file system. > > I set up IOR to use an independent file/process so that the aggregated > bandwidth is maximized. > > I ran N MPI processes where N < # of cores in a socket. > > When I put those N MPI processes on a single socket, its write > performance is scalable. > > However, when I put those N MPI processes on N sockets (so, 1 MPI > process/socket), > > it performance does not scale, and stays the same for more than 4 MPI > processes. > > I expected it would be as scalable as the case of N processes on a > single socket. > > But, it is not. > > > > I think if an MPI process write to an independent file/process, there > must not be file locking among MPI processes. However, there seems to be > some. Is there any way to avoid that locking or overhead? It may not be > file lock issue, but I don't know what is the exact reason for the poor > performance. > > > > Any help will be appreciated. > > > > David >
Re: [OMPI users] file/process write speed is not scalable
Note there could be some NUMA-IO effect, so I suggest you compare running every MPI tasks on socket 0, to running every MPI tasks on socket 1 and so on, and then compared to running one MPI task per socket. Also, what performance do you measure? - Is this something in line with the filesystem/network expectation? - Or is this much higher (and in this case, you are benchmarking the i/o cache)? FWIW, I usually write files whose cumulated size is four times the node memory to avoid local caching effect (if you have a lot of RAM, that might take a while ...) Keep in mind Lustre is also sensitive to the file layout. If you write one file per task, you likely want to use all the available OST, but no stripping. If you want to write into a single file with 1MB blocks per MPI task, you likely want to stripe with 1MB blocks, and use the same number of OST than MPI tasks (so each MPI task ends up writing to its own OST) Cheers, Gilles On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users wrote: > > Hi, > > I'm running IOR benchmark on a big shared memory machine with Lustre file > system. > I set up IOR to use an independent file/process so that the aggregated > bandwidth is maximized. > I ran N MPI processes where N < # of cores in a socket. > When I put those N MPI processes on a single socket, its write performance is > scalable. > However, when I put those N MPI processes on N sockets (so, 1 MPI > process/socket), > it performance does not scale, and stays the same for more than 4 MPI > processes. > I expected it would be as scalable as the case of N processes on a single > socket. > But, it is not. > > I think if an MPI process write to an independent file/process, there must > not be file locking among MPI processes. However, there seems to be some. Is > there any way to avoid that locking or overhead? It may not be file lock > issue, but I don't know what is the exact reason for the poor performance. > > Any help will be appreciated. > > David
[OMPI users] file/process write speed is not scalable
Hi, I'm running IOR benchmark on a big shared memory machine with Lustre file system. I set up IOR to use an independent file/process so that the aggregated bandwidth is maximized. I ran N MPI processes where N < # of cores in a socket. When I put those N MPI processes on a single socket, its write performance is scalable. However, when I put those N MPI processes on N sockets (so, 1 MPI process/socket), it performance does not scale, and stays the same for more than 4 MPI processes. I expected it would be as scalable as the case of N processes on a single socket. But, it is not. I think if an MPI process write to an independent file/process, there must not be file locking among MPI processes. However, there seems to be some. Is there any way to avoid that locking or overhead? It may not be file lock issue, but I don't know what is the exact reason for the poor performance. Any help will be appreciated. David