If possible, consider changing to a non-blocking write using MPI_FILE_WRITE_ALL_BEGIN so that if possible, work can continue while the file is being written to disk. You may need to make a copy of the data being written if the space will be used for another purpose while the data is being written.
On Mon, Apr 6, 2020, at 6:35 PM, Collin Strassburger via users wrote: > Gilles, > > I just checked the write implementation of the Fortran codes with which I > have noticed the issue; while they are compiled with MPI, they are not using > MPI-IO. Thank you for pointing out the important distinction! > > Thanks, > Collin > > > > **From:** users <users-boun...@lists.open-mpi.org> **On Behalf Of **Gilles > GOUAILLARDET via users > **Sent:** Monday, April 6, 2020 11:01 AM > **To:** Open MPI Users <users@lists.open-mpi.org> > **Cc:** Gilles GOUAILLARDET <gilles.gouaillar...@gmail.com> > **Subject:** Re: [OMPI users] Slow collective MPI File IO > > Collin, > Do you have any data to backup your claim? > As long as MPI-IO is used to perform file I/O, the Fortran bindings overhead > should be hardly noticeable. > Cheers, > Gilles > > > On April 6, 2020, at 23:22, Collin Strassburger via users > <users@lists.open-mpi.org> wrote: > Hello, > > Just a quick comment on this; is your code written in C/C++ or Fortran? > Fortran has issues with writing at a decent speed regardless of MPI setup and > as such should be avoided for file IO (yet I still occasionally see it > implemented). > > Collin > > **From:** users <users-boun...@lists.open-mpi.org> **On Behalf Of **Dong-In > Kang via users > **Sent:** Monday, April 6, 2020 10:02 AM > **To:** Gabriel, Edgar <egabr...@central.uh.edu> > **Cc:** Dong-In Kang <dik...@gmail.com>; Open MPI Users > <users@lists.open-mpi.org> > **Subject:** Re: [OMPI users] Slow collective MPI File IO > > > Thank you Edgar for the information. > > I also tried MPI_File_write_at_all(), but it usually makes the performance > worse. > My program is very simple. > Each MPI process writes a consecutive portion of a file. > No interleaving among the MPI processes. > I think in this case I can use MPI_File_write_at(). > > I tested the maximum bandwidth of the target devices and they are at least a > few times bigger than what single process can achieve. > I tested it using the same program but open the individual files using > MPI_COMM_SELF. > I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB > chunk, but no noticeable difference. > (There are performance differences between using 32MB chunk and using 512MB > chunk. > But, they still don't make multiple MPI processes file IO exceeds the > performance of single MPI process file IO) > As for the local disk, at least 2 times faster than single MPI process can > achieve. > As for the ramdisk, at least 5 times faster. > Luster, I know that it is at least 7-8 times or more faster depending on the > configuration. > > About caching effect, it would be the case of MPI_File_read(). > I can see very high bandwidth of MPI_File_read(), which I believe comes from > caches in RAM. > But as for MPI_File_write, I think it doesn't be affected by caching. > And I create a new file for each test and removes the file at the end of the > testing. > > I may make a very simple mistake, but I don't know what it is. > I saw MPI_File I/O could achieve multiple times of speedup over single > process file IO, > when faster file system is used like Lustre from a few reports in the > internet. > I started this experiment because I couldn't get speedup on Lustre file > system. > And then I moved the experiment to ramdisk and local disk, because it can > remove the issue of Lustre configuration. > > Any comments are welcome. > > David > > > > > > > > > > > > On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar <egabr...@central.uh.edu> wrote: >> Hi, >> >> A couple of comments. First, if you use MPI_File_write_at, this is usually >> not considered collective I/O, even if executed by multiple processes. >> MPI_File_write_at_all would be collective I/O. >> >> Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are >> providing. If already a single process is able to saturate the bandwidth of >> your file system and hardware, you will not be able to see performance >> improvements from multiple processes (some minor exceptions maybe due to >> caching effects, but that is only for smaller problem sizes, the larger the >> amount of data that you try to write, the lesser the caching effects become >> in file I/O). So the first question that you have to answer, what is the >> sustained bandwidth of your hardware, and are you able to saturate it >> already with a single process. If you are using a single hard drive (or even >> 2 or 3 hard drives in a RAID 0 configuration), this is almost certainly the >> case. >> >> Lastly, the configuration parameters of your tests also play a major role. >> As a general rule, the larger amounts of data you are able to provide per >> file I/O call, the better the performance will be. 1MB of data per call is >> probably on the smaller side. The ompio implementation of MPI I/O breaks >> large individual I/O operations (e.g. MPI_File_write_at) into chunks of >> 512MB for performance reasons internally. Large collective I/O operations >> (e.g. MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you >> some hints on the quantities of data that you would have to use for >> performance reasons. >> >> Along the same lines, one final comment. You say you did 1000 writes of 1MB >> each. For a single process that is about 1GB of data. Depending on how much >> main memory your PC has, this amount of data can still be cached in modern >> systems, and you might have an unrealistically high bandwidth value for the >> 1 process case that you are comparing against (it depends a bit on what your >> benchmark does, and whether you force flushing the data to disk inside of >> your measurement loop). >> >> Hope this gives you some pointers on where to start to look. >> Thanks >> Edgar >> >> **From:** users <users-boun...@lists.open-mpi.org> **On Behalf Of **Dong-In >> Kang via users >> **Sent:** Monday, April 6, 2020 7:14 AM >> **To:** users@lists.open-mpi.org >> **Cc:** Dong-In Kang <dik...@gmail.com> >> **Subject:** [OMPI users] Slow collective MPI File IO >> >> Hi, >> >> I am running an MPI program where N processes write to a single file on a >> single shared memory machine. >> I’m using OpenMPI v.4.0.2. >> Each MPI process write a 1MB chunk of data for 1K times sequentially. >> There is no overlap in the file between any of the two MPI processes. >> I ran the program for -np = {1, 2, 4, 8}. >> I am seeing that the speed of the collective write to a file for -np = {2, >> 4, 8} never exceeds the speed of -np = {1}. >> I did the experiment with a few different file systems {local disk, ram >> disk, Luster FS}. >> For all of them, I see similar results. >> The speed of collective write to a single shared file never exceeds the >> speed of single MPI process case. >> Any tip or suggestions? >> >> I used MPI_File_write_at() routine with proper offset for each MPI process. >> (I also tried MPI_File_write_at_all() routine, which makes the performance >> worse as np gets bigger.) >> Before writing, MPI_Barrrier() is used. >> The start time is taken right after MPI_Barrier() using MPI_Timer(); >> The end time is taken right after another MPI_Barrier(). >> The speed of the collective write is calculate as >> (total data amount written to the file)/(time between the first >> MPI_Barrier() and the second MPI_Barrier()); >> >> Any idea to increase the speed? >> >> Thanks, >> David >>