Re: [OMPI users] file/process write speed is not scalable

2020-04-14 Thread Dong-In Kang via users
I'm using OpenMPI v.4.0.2.
Is your problem similar to mine?

Thanks,
David


On Tue, Apr 14, 2020 at 7:33 AM Patrick Bégou via users <
users@lists.open-mpi.org> wrote:

> Hi David,
>
> could you specify which version of OpenMPI you are using ?
> I've also some parallel I/O trouble with one code but still have not
> investigated.
> Thanks
>
> Patrick
>
> Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit :
>
>
>  Thank you for your suggestion.
> I am more concerned about the poor performance of one MPI process/socket
> case.
> The model fits better for my real workload.
> The performance that I see is a lot worse than what the underlying
> hardware can support.
> The best case (all MPI processes in a single socket) is pretty good, which
> is about 80+% of underlying hardware's speed.
> However, one MPI per socket model achieves only 30% of what I get with all
> MPI processes in a single socket.
> Both are doing the same thing - independent file write.
> I used all the OSTs available.
>
> As a reference point, I did the same test on ramdisk.
> For both case, the performance scales very well, and their performances
> are close.
>
> There seems to be extra overhead when multi-sockets are used for
> independent file I/O with Lustre.
> I don't know what causes that overhead.
>
> Thanks,
> David
>
>
> On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users <
> users@lists.open-mpi.org> wrote:
>
>> Note there could be some NUMA-IO effect, so I suggest you compare
>> running every MPI tasks on socket 0, to running every MPI tasks on
>> socket 1 and so on, and then compared to running one MPI task per
>> socket.
>>
>> Also, what performance do you measure?
>> - Is this something in line with the filesystem/network expectation?
>> - Or is this much higher (and in this case, you are benchmarking the i/o
>> cache)?
>>
>> FWIW, I usually write files whose cumulated size is four times the
>> node memory to avoid local caching effect
>> (if you have a lot of RAM, that might take a while ...)
>>
>> Keep in mind Lustre is also sensitive to the file layout.
>> If you write one file per task, you likely want to use all the
>> available OST, but no stripping.
>> If you want to write into a single file with 1MB blocks per MPI task,
>> you likely want to stripe with 1MB blocks,
>> and use the same number of OST than MPI tasks (so each MPI task ends
>> up writing to its own OST)
>>
>> Cheers,
>>
>> Gilles
>>
>> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users
>>  wrote:
>> >
>> > Hi,
>> >
>> > I'm running IOR benchmark on a big shared memory machine with Lustre
>> file system.
>> > I set up IOR to use an independent file/process so that the aggregated
>> bandwidth is maximized.
>> > I ran N MPI processes where N < # of cores in a socket.
>> > When I put those N MPI processes on a single socket, its write
>> performance is scalable.
>> > However, when I put those N MPI processes on N sockets (so, 1 MPI
>> process/socket),
>> > it performance does not scale, and stays the same for more than 4 MPI
>> processes.
>> > I expected it would be as scalable as the case of N processes on a
>> single socket.
>> > But, it is not.
>> >
>> > I think if an MPI process write to an independent file/process, there
>> must not be file locking among MPI processes. However, there seems to be
>> some. Is there any way to avoid that locking or overhead? It may not be
>> file lock issue, but I don't know what is the exact reason for the poor
>> performance.
>> >
>> > Any help will be appreciated.
>> >
>> > David
>>
>
>


Re: [OMPI users] file/process write speed is not scalable

2020-04-14 Thread Patrick Bégou via users
Hi David,

could you specify which version of OpenMPI you are using ?
I've also some parallel I/O trouble with one code but still have not
investigated.
Thanks

Patrick

Le 13/04/2020 à 17:11, Dong-In Kang via users a écrit :
>
>  Thank you for your suggestion.
> I am more concerned about the poor performance of one MPI
> process/socket case.
> The model fits better for my real workload.
> The performance that I see is a lot worse than what the underlying
> hardware can support.
> The best case (all MPI processes in a single socket) is pretty good,
> which is about 80+% of underlying hardware's speed.
> However, one MPI per socket model achieves only 30% of what I get with
> all MPI processes in a single socket.
> Both are doing the same thing - independent file write.
> I used all the OSTs available.
>
> As a reference point, I did the same test on ramdisk.
> For both case, the performance scales very well, and their
> performances are close.
>
> There seems to be extra overhead when multi-sockets are used for
> independent file I/O with Lustre.
> I don't know what causes that overhead.
>
> Thanks,
> David
>
>
> On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users
> mailto:users@lists.open-mpi.org>> wrote:
>
> Note there could be some NUMA-IO effect, so I suggest you compare
> running every MPI tasks on socket 0, to running every MPI tasks on
> socket 1 and so on, and then compared to running one MPI task per
> socket.
>
> Also, what performance do you measure?
> - Is this something in line with the filesystem/network expectation?
> - Or is this much higher (and in this case, you are benchmarking
> the i/o cache)?
>
> FWIW, I usually write files whose cumulated size is four times the
> node memory to avoid local caching effect
> (if you have a lot of RAM, that might take a while ...)
>
> Keep in mind Lustre is also sensitive to the file layout.
> If you write one file per task, you likely want to use all the
> available OST, but no stripping.
> If you want to write into a single file with 1MB blocks per MPI task,
> you likely want to stripe with 1MB blocks,
> and use the same number of OST than MPI tasks (so each MPI task ends
> up writing to its own OST)
>
> Cheers,
>
> Gilles
>
> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users
> mailto:users@lists.open-mpi.org>> wrote:
> >
> > Hi,
> >
> > I'm running IOR benchmark on a big shared memory machine with
> Lustre file system.
> > I set up IOR to use an independent file/process so that the
> aggregated bandwidth is maximized.
> > I ran N MPI processes where N < # of cores in a socket.
> > When I put those N MPI processes on a single socket, its write
> performance is scalable.
> > However, when I put those N MPI processes on N sockets (so, 1
> MPI process/socket),
> > it performance does not scale, and stays the same for more than
> 4 MPI processes.
> > I expected it would be as scalable as the case of N processes on
> a single socket.
> > But, it is not.
> >
> > I think if an MPI process write to an independent file/process,
> there must not be file locking among MPI processes. However, there
> seems to be some. Is there any way to avoid that locking or
> overhead? It may not be file lock issue, but I don't know what is
> the exact reason for the poor performance.
> >
> > Any help will be appreciated.
> >
> > David
>



Re: [OMPI users] file/process write speed is not scalable

2020-04-13 Thread Dong-In Kang via users
 Thank you for your suggestion.
I am more concerned about the poor performance of one MPI process/socket
case.
The model fits better for my real workload.
The performance that I see is a lot worse than what the underlying hardware
can support.
The best case (all MPI processes in a single socket) is pretty good, which
is about 80+% of underlying hardware's speed.
However, one MPI per socket model achieves only 30% of what I get with all
MPI processes in a single socket.
Both are doing the same thing - independent file write.
I used all the OSTs available.

As a reference point, I did the same test on ramdisk.
For both case, the performance scales very well, and their performances are
close.

There seems to be extra overhead when multi-sockets are used for
independent file I/O with Lustre.
I don't know what causes that overhead.

Thanks,
David


On Thu, Apr 9, 2020 at 11:07 PM Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Note there could be some NUMA-IO effect, so I suggest you compare
> running every MPI tasks on socket 0, to running every MPI tasks on
> socket 1 and so on, and then compared to running one MPI task per
> socket.
>
> Also, what performance do you measure?
> - Is this something in line with the filesystem/network expectation?
> - Or is this much higher (and in this case, you are benchmarking the i/o
> cache)?
>
> FWIW, I usually write files whose cumulated size is four times the
> node memory to avoid local caching effect
> (if you have a lot of RAM, that might take a while ...)
>
> Keep in mind Lustre is also sensitive to the file layout.
> If you write one file per task, you likely want to use all the
> available OST, but no stripping.
> If you want to write into a single file with 1MB blocks per MPI task,
> you likely want to stripe with 1MB blocks,
> and use the same number of OST than MPI tasks (so each MPI task ends
> up writing to its own OST)
>
> Cheers,
>
> Gilles
>
> On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users
>  wrote:
> >
> > Hi,
> >
> > I'm running IOR benchmark on a big shared memory machine with Lustre
> file system.
> > I set up IOR to use an independent file/process so that the aggregated
> bandwidth is maximized.
> > I ran N MPI processes where N < # of cores in a socket.
> > When I put those N MPI processes on a single socket, its write
> performance is scalable.
> > However, when I put those N MPI processes on N sockets (so, 1 MPI
> process/socket),
> > it performance does not scale, and stays the same for more than 4 MPI
> processes.
> > I expected it would be as scalable as the case of N processes on a
> single socket.
> > But, it is not.
> >
> > I think if an MPI process write to an independent file/process, there
> must not be file locking among MPI processes. However, there seems to be
> some. Is there any way to avoid that locking or overhead? It may not be
> file lock issue, but I don't know what is the exact reason for the poor
> performance.
> >
> > Any help will be appreciated.
> >
> > David
>


Re: [OMPI users] file/process write speed is not scalable

2020-04-09 Thread Gilles Gouaillardet via users
Note there could be some NUMA-IO effect, so I suggest you compare
running every MPI tasks on socket 0, to running every MPI tasks on
socket 1 and so on, and then compared to running one MPI task per
socket.

Also, what performance do you measure?
- Is this something in line with the filesystem/network expectation?
- Or is this much higher (and in this case, you are benchmarking the i/o cache)?

FWIW, I usually write files whose cumulated size is four times the
node memory to avoid local caching effect
(if you have a lot of RAM, that might take a while ...)

Keep in mind Lustre is also sensitive to the file layout.
If you write one file per task, you likely want to use all the
available OST, but no stripping.
If you want to write into a single file with 1MB blocks per MPI task,
you likely want to stripe with 1MB blocks,
and use the same number of OST than MPI tasks (so each MPI task ends
up writing to its own OST)

Cheers,

Gilles

On Fri, Apr 10, 2020 at 6:41 AM Dong-In Kang via users
 wrote:
>
> Hi,
>
> I'm running IOR benchmark on a big shared memory machine with Lustre file 
> system.
> I set up IOR to use an independent file/process so that the aggregated 
> bandwidth is maximized.
> I ran N MPI processes where N < # of cores in a socket.
> When I put those N MPI processes on a single socket, its write performance is 
> scalable.
> However, when I put those N MPI processes on N sockets (so, 1 MPI 
> process/socket),
> it performance does not scale, and stays the same for more than 4 MPI 
> processes.
> I expected it would be as scalable as the case of N processes on a single 
> socket.
> But, it is not.
>
> I think if an MPI process write to an independent file/process, there must 
> not be file locking among MPI processes. However, there seems to be some. Is 
> there any way to avoid that locking or overhead? It may not be file lock 
> issue, but I don't know what is the exact reason for the poor performance.
>
> Any help will be appreciated.
>
> David


[OMPI users] file/process write speed is not scalable

2020-04-09 Thread Dong-In Kang via users
Hi,

I'm running IOR benchmark on a big shared memory machine with Lustre file
system.
I set up IOR to use an independent file/process so that the aggregated
bandwidth is maximized.
I ran N MPI processes where N < # of cores in a socket.
When I put those N MPI processes on a single socket, its write performance
is scalable.
However, when I put those N MPI processes on N sockets (so, 1 MPI
process/socket),
it performance does not scale, and stays the same for more than 4 MPI
processes.
I expected it would be as scalable as the case of N processes on a single
socket.
But, it is not.

I think if an MPI process write to an independent file/process, there must
not be file locking among MPI processes. However, there seems to be some.
Is there any way to avoid that locking or overhead? It may not be file lock
issue, but I don't know what is the exact reason for the poor performance.

Any help will be appreciated.

David