Re: [OMPI users] Slow collective MPI File IO

Benson Muite via users Mon, 06 Apr 2020 09:20:26 -0700

If possible, consider changing to a non-blocking write using 
MPI_FILE_WRITE_ALL_BEGIN so that if possible, work can continue while the file 
is being written to disk. You may need to make a copy of the data being written 
if the space will be used for another purpose while the data is being written.



On Mon, Apr 6, 2020, at 6:35 PM, Collin Strassburger via users wrote:
> Gilles,

> 

> I just checked the write implementation of the Fortran codes with which I 
> have noticed the issue; while they are compiled with MPI, they are not using 
> MPI-IO. Thank you for pointing out the important distinction!

> 

> Thanks,

> Collin

> 

> 

> 

> **From:** users <users-boun...@lists.open-mpi.org> **On Behalf Of **Gilles 
> GOUAILLARDET via users
> **Sent:** Monday, April 6, 2020 11:01 AM
> **To:** Open MPI Users <users@lists.open-mpi.org>
> **Cc:** Gilles GOUAILLARDET <gilles.gouaillar...@gmail.com>
> **Subject:** Re: [OMPI users] Slow collective MPI File IO

> 

> Collin,

> Do you have any data to backup your claim?

> As long as MPI-IO is used to perform file I/O, the Fortran bindings overhead 
> should be hardly noticeable.

> Cheers,

> Gilles

> 
> 
> On April 6, 2020, at 23:22, Collin Strassburger via users 
> <users@lists.open-mpi.org> wrote:

> Hello,

> 

> Just a quick comment on this; is your code written in C/C++ or Fortran? 
> Fortran has issues with writing at a decent speed regardless of MPI setup and 
> as such should be avoided for file IO (yet I still occasionally see it 
> implemented).

> 

> Collin

> 

> **From:** users <users-boun...@lists.open-mpi.org> **On Behalf Of **Dong-In 
> Kang via users
> **Sent:** Monday, April 6, 2020 10:02 AM
> **To:** Gabriel, Edgar <egabr...@central.uh.edu>
> **Cc:** Dong-In Kang <dik...@gmail.com>; Open MPI Users 
> <users@lists.open-mpi.org>
> **Subject:** Re: [OMPI users] Slow collective MPI File IO

> 

> 

> Thank you Edgar for the information.

> 

> I also tried MPI_File_write_at_all(), but it usually makes the performance 
> worse.

> My program is very simple.

> Each MPI process writes a consecutive portion of a file.

> No interleaving among the MPI processes.

> I think in this case I can use MPI_File_write_at().

> 

> I tested the maximum bandwidth of the target devices and they are at least a 
> few times bigger than what single process can achieve.

> I tested it using the same program but open the individual files using 
> MPI_COMM_SELF.

> I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB 
> chunk, but no noticeable difference.

> (There are performance differences between using 32MB chunk and using 512MB 
> chunk.

> But, they still don't make multiple MPI processes file IO exceeds the 
> performance of single MPI process file IO)

> As for the local disk, at least 2 times faster than single MPI process can 
> achieve.

> As for the ramdisk, at least 5 times faster.

> Luster, I know that it is at least 7-8 times or more faster depending on the 
> configuration.

> 

> About caching effect, it would be the case of MPI_File_read().

> I can see very high bandwidth of MPI_File_read(), which I believe comes from 
> caches in RAM.

> But as for MPI_File_write, I think it doesn't be affected by caching.

> And I create a new file for each test and removes the file at the end of the 
> testing.

> 

> I may make a very simple mistake, but I don't know what it is.

> I saw MPI_File I/O could achieve multiple times of speedup over single 
> process file IO,

> when faster file system is used like Lustre from a few reports in the 
> internet.

> I started this experiment because I couldn't get speedup on Lustre file 
> system.
>  And then I moved the experiment to ramdisk and local disk, because it can 
> remove the issue of Lustre configuration.

> 

> Any comments are welcome.

> 

> David

> 

> 

> 

> 

> 

> 

> 

> 

> 

> 

> 

> On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar <egabr...@central.uh.edu> wrote:

>> Hi,

>> 

>> A couple of comments. First, if you use MPI_File_write_at, this is usually 
>> not considered collective I/O, even if executed by multiple processes. 
>> MPI_File_write_at_all would be collective I/O.

>> 

>> Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are 
>> providing. If already a single process is able to saturate the bandwidth of 
>> your file system and hardware, you will not be able to see performance 
>> improvements from multiple processes (some minor exceptions maybe due to 
>> caching effects, but that is only for smaller problem sizes, the larger the 
>> amount of data that you try to write, the lesser the caching effects become 
>> in file I/O). So the first question that you have to answer, what is the 
>> sustained bandwidth of your hardware, and are you able to saturate it 
>> already with a single process. If you are using a single hard drive (or even 
>> 2 or 3 hard drives in a RAID 0 configuration), this is almost certainly the 
>> case.

>> 

>> Lastly, the configuration parameters of your tests also play a major role. 
>> As a general rule, the larger amounts of data you are able to provide per 
>> file I/O call, the better the performance will be. 1MB of data per call is 
>> probably on the smaller side. The ompio implementation of MPI I/O breaks 
>> large individual I/O operations (e.g. MPI_File_write_at) into chunks of 
>> 512MB for performance reasons internally. Large collective I/O operations 
>> (e.g. MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you 
>> some hints on the quantities of data that you would have to use for 
>> performance reasons.

>> 

>> Along the same lines, one final comment. You say you did 1000 writes of 1MB 
>> each. For a single process that is about 1GB of data. Depending on how much 
>> main memory your PC has, this amount of data can still be cached in modern 
>> systems, and you might have an unrealistically high bandwidth value for the 
>> 1 process case that you are comparing against (it depends a bit on what your 
>> benchmark does, and whether you force flushing the data to disk inside of 
>> your measurement loop).

>> 

>> Hope this gives you some pointers on where to start to look.

>> Thanks

>> Edgar

>> 

>> **From:** users <users-boun...@lists.open-mpi.org> **On Behalf Of **Dong-In 
>> Kang via users
>> **Sent:** Monday, April 6, 2020 7:14 AM
>> **To:**  users@lists.open-mpi.org
>> **Cc:** Dong-In Kang <dik...@gmail.com>
>> **Subject:** [OMPI users] Slow collective MPI File IO

>> 

>> Hi,

>> 

>> I am running an MPI program where N processes write to a single file on a 
>> single shared memory machine.

>> I’m using OpenMPI v.4.0.2.

>> Each MPI process write a 1MB chunk of data for 1K times sequentially.

>> There is no overlap in the file between any of the two MPI processes.

>> I ran the program for -np = {1, 2, 4, 8}.

>> I am seeing that the speed of the collective write to a file for -np = {2, 
>> 4, 8} never exceeds the speed of -np = {1}.

>> I did the experiment with a few different file systems {local disk, ram 
>> disk, Luster FS}.

>> For all of them, I see similar results.

>> The speed of collective write to a single shared file never exceeds the 
>> speed of single MPI process case.

>> Any tip or suggestions?

>> 

>> I used MPI_File_write_at() routine with proper offset for each MPI process.

>> (I also tried MPI_File_write_at_all() routine, which makes the performance 
>> worse as np gets bigger.)

>> Before writing, MPI_Barrrier() is used.

>> The start time is taken right after MPI_Barrier() using MPI_Timer();

>> The end time is taken right after another MPI_Barrier().

>> The speed of the collective write is calculate as

>> (total data amount written to the file)/(time between the first 
>> MPI_Barrier() and the second MPI_Barrier());

>> 

>> Any idea to increase the speed?

>> 

>> Thanks,

>> David

>>

Re: [OMPI users] Slow collective MPI File IO

Reply via email to