Note there could be some NUMA-IO effect, so I suggest you compare running every MPI tasks on socket 0, to running every MPI tasks on socket 1 and so on, and then compared to running one MPI task per socket.
Also, what performance do you measure? - Is this something in line with the filesystem/network expectation? - Or is this much higher (and in this case, you are benchmarking the i/o cache)? FWIW, I usually write files whose cumulated size is four times the node memory to avoid local caching effect (if you have a lot of RAM, that might take a while ...) Keep in mind Lustre is also sensitive to the file layout. If you write one file per task, you likely want to use all the available OST, but no stripping. If you want to write into a single file with 1MB blocks per MPI task, you likely want to stripe with 1MB blocks, and use the same number of OST than MPI tasks (so each MPI task ends up writing to its own OST) Cheers, Gilles On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users <users@lists.open-mpi.org> wrote: > > Hi, > > I'm running IOR benchmark on a big shared memory machine with Lustre file > system. > I set up IOR to use an independent file/process so that the aggregated > bandwidth is maximized. > I ran N MPI processes where N < # of cores in a socket. > When I put those N MPI processes on a single socket, its write performance is > scalable. > However, when I put those N MPI processes on N sockets (so, 1 MPI > process/socket), > it performance does not scale, and stays the same for more than 4 MPI > processes. > I expected it would be as scalable as the case of N processes on a single > socket. > But, it is not. > > I think if an MPI process write to an independent file/process, there must > not be file locking among MPI processes. However, there seems to be some. Is > there any way to avoid that locking or overhead? It may not be file lock > issue, but I don't know what is the exact reason for the poor performance. > > Any help will be appreciated. > > David