I'm using OpenMPI v.4.0.2. Is your problem similar to mine? Thanks, David
On Tue, Apr 14, 2020 at 7:33 AM Patrick Bégou via users < users@lists.open-mpi.org> wrote: > Hi David, > > could you specify which version of OpenMPI you are using ? > I've also some parallel I/O trouble with one code but still have not > investigated. > Thanks > > Patrick > > Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit : > > > Thank you for your suggestion. > I am more concerned about the poor performance of one MPI process/socket > case. > The model fits better for my real workload. > The performance that I see is a lot worse than what the underlying > hardware can support. > The best case (all MPI processes in a single socket) is pretty good, which > is about 80+% of underlying hardware's speed. > However, one MPI per socket model achieves only 30% of what I get with all > MPI processes in a single socket. > Both are doing the same thing - independent file write. > I used all the OSTs available. > > As a reference point, I did the same test on ramdisk. > For both case, the performance scales very well, and their performances > are close. > > There seems to be extra overhead when multi-sockets are used for > independent file I/O with Lustre. > I don't know what causes that overhead. > > Thanks, > David > > > On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users < > users@lists.open-mpi.org> wrote: > >> Note there could be some NUMA-IO effect, so I suggest you compare >> running every MPI tasks on socket 0, to running every MPI tasks on >> socket 1 and so on, and then compared to running one MPI task per >> socket. >> >> Also, what performance do you measure? >> - Is this something in line with the filesystem/network expectation? >> - Or is this much higher (and in this case, you are benchmarking the i/o >> cache)? >> >> FWIW, I usually write files whose cumulated size is four times the >> node memory to avoid local caching effect >> (if you have a lot of RAM, that might take a while ...) >> >> Keep in mind Lustre is also sensitive to the file layout. >> If you write one file per task, you likely want to use all the >> available OST, but no stripping. >> If you want to write into a single file with 1MB blocks per MPI task, >> you likely want to stripe with 1MB blocks, >> and use the same number of OST than MPI tasks (so each MPI task ends >> up writing to its own OST) >> >> Cheers, >> >> Gilles >> >> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users >> <users@lists.open-mpi.org> wrote: >> > >> > Hi, >> > >> > I'm running IOR benchmark on a big shared memory machine with Lustre >> file system. >> > I set up IOR to use an independent file/process so that the aggregated >> bandwidth is maximized. >> > I ran N MPI processes where N < # of cores in a socket. >> > When I put those N MPI processes on a single socket, its write >> performance is scalable. >> > However, when I put those N MPI processes on N sockets (so, 1 MPI >> process/socket), >> > it performance does not scale, and stays the same for more than 4 MPI >> processes. >> > I expected it would be as scalable as the case of N processes on a >> single socket. >> > But, it is not. >> > >> > I think if an MPI process write to an independent file/process, there >> must not be file locking among MPI processes. However, there seems to be >> some. Is there any way to avoid that locking or overhead? It may not be >> file lock issue, but I don't know what is the exact reason for the poor >> performance. >> > >> > Any help will be appreciated. >> > >> > David >> > >