Re: [Gluster-devel] I/O performance

2019-02-13 Thread Xavi Hernandez
Here are the results of the last run:
https://docs.google.com/spreadsheets/d/19JqvuFKZxKifgrhLF-5-bgemYj8XKldUox1QwsmGj2k/edit?usp=sharing

Each test has been run with a rough approximation of the best configuration
I've found (in number of client and brick threads), but I haven't done an
exhaustive search of the best configuration in each case.

The "fio rand write" test seems to have a big regression. An initial check
of the data shows that 2 of the 5 runs have taken > 50% more time. I'll try
to check why.

Many of the tests show a very high disk utilization, so comparisons may not
be accurate. In any case it's clear that we need a method to automatically
adjust the number of worker threads to the given load to make this useful.
Without that it's virtually impossible to find a fixed number of threads
that will work fine in all cases. I'm currently working on this.

Xavi

On Wed, Feb 13, 2019 at 11:34 AM Xavi Hernandez 
wrote:

> On Tue, Feb 12, 2019 at 1:30 AM Vijay Bellur  wrote:
>
>>
>>
>> On Tue, Feb 5, 2019 at 10:57 PM Xavi Hernandez 
>> wrote:
>>
>>> On Wed, Feb 6, 2019 at 7:00 AM Poornima Gurusiddaiah <
>>> pguru...@redhat.com> wrote:
>>>


 On Tue, Feb 5, 2019, 10:53 PM Xavi Hernandez >>> wrote:

> On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez 
> wrote:
>
>> On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah <
>> pguru...@redhat.com> wrote:
>>
>>> Can the threads be categorised to do certain kinds of fops?
>>>
>>
>> Could be, but creating multiple thread groups for different tasks is
>> generally bad because many times you end up with lots of idle threads 
>> which
>> waste resources and could increase contention. I think we should only
>> differentiate threads if it's absolutely necessary.
>>
>>
>>> Read/write affinitise to certain set of threads, the other metadata
>>> fops to other set of threads. So we limit the read/write threads and not
>>> the metadata threads? Also if aio is enabled in the backend the threads
>>> will not be blocked on disk IO right?
>>>
>>
>> If we don't block the thread but we don't prevent more requests to go
>> to the disk, then we'll probably have the same problem. Anyway, I'll try 
>> to
>> run some tests with AIO to see if anything changes.
>>
>
> I've run some simple tests with AIO enabled and results are not good.
> A simple dd takes >25% more time. Multiple parallel dd take 35% more time
> to complete.
>


 Thank you. That is strange! Had few questions, what tests are you
 running for measuring the io-threads performance(not particularly aoi)? is
 it dd from multiple clients?

>>>
>>> Yes, it's a bit strange. What I see is that many threads from the thread
>>> pool are active but using very little CPU. I also see an AIO thread for
>>> each brick, but its CPU usage is not big either. Wait time is always 0 (I
>>> think this is a side effect of AIO activity). However system load grows
>>> very high. I've seen around 50, while on the normal test without AIO it's
>>> stays around 20-25.
>>>
>>> Right now I'm running the tests on a single machine (no real network
>>> communication) using an NVMe disk as storage. I use a single mount point.
>>> The tests I'm running are these:
>>>
>>>- Single dd, 128 GiB, blocks of 1MiB
>>>- 16 parallel dd, 8 GiB per dd, blocks of 1MiB
>>>- fio in sequential write mode, direct I/O, blocks of 128k, 16
>>>threads, 8GiB per file
>>>- fio in sequential read mode, direct I/O, blocks of 128k, 16
>>>threads, 8GiB per file
>>>- fio in random write mode, direct I/O, blocks of 128k, 16 threads,
>>>8GiB per file
>>>- fio in random read mode, direct I/O, blocks of 128k, 16 threads,
>>>8GiB per file
>>>- smallfile create, 16 threads, 256 files per thread, 32 MiB per
>>>file (with one brick down, for the following test)
>>>- self-heal of an entire brick (from the previous smallfile test)
>>>- pgbench init phase with scale 100
>>>
>>> I run all these tests for a replica 3 volume and a disperse 4+2 volume.
>>>
>>
>>
>> Are these performance results available somewhere? I am quite curious to
>> understand the performance gains on NVMe!
>>
>
> I'm updating test results with the latest build. I'll report it here once
> it's complete.
>
> Xavi
>
>>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-02-11 Thread Vijay Bellur
On Tue, Feb 5, 2019 at 10:57 PM Xavi Hernandez 
wrote:

> On Wed, Feb 6, 2019 at 7:00 AM Poornima Gurusiddaiah 
> wrote:
>
>>
>>
>> On Tue, Feb 5, 2019, 10:53 PM Xavi Hernandez > wrote:
>>
>>> On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez 
>>> wrote:
>>>
 On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah <
 pguru...@redhat.com> wrote:

> Can the threads be categorised to do certain kinds of fops?
>

 Could be, but creating multiple thread groups for different tasks is
 generally bad because many times you end up with lots of idle threads which
 waste resources and could increase contention. I think we should only
 differentiate threads if it's absolutely necessary.


> Read/write affinitise to certain set of threads, the other metadata
> fops to other set of threads. So we limit the read/write threads and not
> the metadata threads? Also if aio is enabled in the backend the threads
> will not be blocked on disk IO right?
>

 If we don't block the thread but we don't prevent more requests to go
 to the disk, then we'll probably have the same problem. Anyway, I'll try to
 run some tests with AIO to see if anything changes.

>>>
>>> I've run some simple tests with AIO enabled and results are not good. A
>>> simple dd takes >25% more time. Multiple parallel dd take 35% more time to
>>> complete.
>>>
>>
>>
>> Thank you. That is strange! Had few questions, what tests are you running
>> for measuring the io-threads performance(not particularly aoi)? is it dd
>> from multiple clients?
>>
>
> Yes, it's a bit strange. What I see is that many threads from the thread
> pool are active but using very little CPU. I also see an AIO thread for
> each brick, but its CPU usage is not big either. Wait time is always 0 (I
> think this is a side effect of AIO activity). However system load grows
> very high. I've seen around 50, while on the normal test without AIO it's
> stays around 20-25.
>
> Right now I'm running the tests on a single machine (no real network
> communication) using an NVMe disk as storage. I use a single mount point.
> The tests I'm running are these:
>
>- Single dd, 128 GiB, blocks of 1MiB
>- 16 parallel dd, 8 GiB per dd, blocks of 1MiB
>- fio in sequential write mode, direct I/O, blocks of 128k, 16
>threads, 8GiB per file
>- fio in sequential read mode, direct I/O, blocks of 128k, 16 threads,
>8GiB per file
>- fio in random write mode, direct I/O, blocks of 128k, 16 threads,
>8GiB per file
>- fio in random read mode, direct I/O, blocks of 128k, 16 threads,
>8GiB per file
>- smallfile create, 16 threads, 256 files per thread, 32 MiB per file
>(with one brick down, for the following test)
>- self-heal of an entire brick (from the previous smallfile test)
>- pgbench init phase with scale 100
>
> I run all these tests for a replica 3 volume and a disperse 4+2 volume.
>


Are these performance results available somewhere? I am quite curious to
understand the performance gains on NVMe!

Thanks,
Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-02-06 Thread Poornima Gurusiddaiah
Thank You for all the detailed explanation. If its the disk saturating, if
we run some of the above mentioned tests(with multithreads) on plain xfs,
we should hit the saturation right. Will try out some tests, this is
interesting.

Thanks,
Poornima

On Wed, Feb 6, 2019 at 12:27 PM Xavi Hernandez 
wrote:

> On Wed, Feb 6, 2019 at 7:00 AM Poornima Gurusiddaiah 
> wrote:
>
>>
>>
>> On Tue, Feb 5, 2019, 10:53 PM Xavi Hernandez > wrote:
>>
>>> On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez 
>>> wrote:
>>>
 On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah <
 pguru...@redhat.com> wrote:

> Can the threads be categorised to do certain kinds of fops?
>

 Could be, but creating multiple thread groups for different tasks is
 generally bad because many times you end up with lots of idle threads which
 waste resources and could increase contention. I think we should only
 differentiate threads if it's absolutely necessary.


> Read/write affinitise to certain set of threads, the other metadata
> fops to other set of threads. So we limit the read/write threads and not
> the metadata threads? Also if aio is enabled in the backend the threads
> will not be blocked on disk IO right?
>

 If we don't block the thread but we don't prevent more requests to go
 to the disk, then we'll probably have the same problem. Anyway, I'll try to
 run some tests with AIO to see if anything changes.

>>>
>>> I've run some simple tests with AIO enabled and results are not good. A
>>> simple dd takes >25% more time. Multiple parallel dd take 35% more time to
>>> complete.
>>>
>>
>>
>> Thank you. That is strange! Had few questions, what tests are you running
>> for measuring the io-threads performance(not particularly aoi)? is it dd
>> from multiple clients?
>>
>
> Yes, it's a bit strange. What I see is that many threads from the thread
> pool are active but using very little CPU. I also see an AIO thread for
> each brick, but its CPU usage is not big either. Wait time is always 0 (I
> think this is a side effect of AIO activity). However system load grows
> very high. I've seen around 50, while on the normal test without AIO it's
> stays around 20-25.
>
> Right now I'm running the tests on a single machine (no real network
> communication) using an NVMe disk as storage. I use a single mount point.
> The tests I'm running are these:
>
>- Single dd, 128 GiB, blocks of 1MiB
>- 16 parallel dd, 8 GiB per dd, blocks of 1MiB
>- fio in sequential write mode, direct I/O, blocks of 128k, 16
>threads, 8GiB per file
>- fio in sequential read mode, direct I/O, blocks of 128k, 16 threads,
>8GiB per file
>- fio in random write mode, direct I/O, blocks of 128k, 16 threads,
>8GiB per file
>- fio in random read mode, direct I/O, blocks of 128k, 16 threads,
>8GiB per file
>- smallfile create, 16 threads, 256 files per thread, 32 MiB per file
>(with one brick down, for the following test)
>- self-heal of an entire brick (from the previous smallfile test)
>- pgbench init phase with scale 100
>
> I run all these tests for a replica 3 volume and a disperse 4+2 volume.
>
> Xavi
>
>
>> Regards,
>> Poornima
>>
>>
>>> Xavi
>>>
>>>
 All this is based on the assumption that large number of parallel read
> writes make the disk perf bad but not the large number of dentry and
> metadata ops. Is that true?
>

 It depends. If metadata is not cached, it's as bad as a read or write
 since it requires a disk access (a clear example of this is the bad
 performance of 'ls' in cold cache, which is basically metadata reads). In
 fact, cached data reads are also very fast, and data writes could go to the
 cache and be updated later in background, so I think the important point is
 if things are cached or not, instead of if they are data or metadata. Since
 we don't have this information from the user side, it's hard to tell what's
 better. My opinion is that we shouldn't differentiate requests of
 data/metadata. If metadata requests happen to be faster, then that thread
 will be able to handle other requests immediately, which seems good enough.

 However there's one thing that I would do. I would differentiate reads
 (data or metadata) from writes. Normally writes come from cached
 information that is flushed to disk at some point, so this normally happens
 in the background. But reads tend to be in foreground, meaning that someone
 (user or application) is waiting for it. So I would give preference to
 reads over writes. To do so effectively, we need to not saturate the
 backend, otherwise when we need to send a read, it will still need to wait
 for all pending requests to complete. If disks are not saturated, we can
 have the answer to the read quite fast, and then continue processing the
 remaining writes.

 Anyway, I may 

Re: [Gluster-devel] I/O performance

2019-02-05 Thread Xavi Hernandez
On Wed, Feb 6, 2019 at 7:00 AM Poornima Gurusiddaiah 
wrote:

>
>
> On Tue, Feb 5, 2019, 10:53 PM Xavi Hernandez 
>> On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez 
>> wrote:
>>
>>> On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah <
>>> pguru...@redhat.com> wrote:
>>>
 Can the threads be categorised to do certain kinds of fops?

>>>
>>> Could be, but creating multiple thread groups for different tasks is
>>> generally bad because many times you end up with lots of idle threads which
>>> waste resources and could increase contention. I think we should only
>>> differentiate threads if it's absolutely necessary.
>>>
>>>
 Read/write affinitise to certain set of threads, the other metadata
 fops to other set of threads. So we limit the read/write threads and not
 the metadata threads? Also if aio is enabled in the backend the threads
 will not be blocked on disk IO right?

>>>
>>> If we don't block the thread but we don't prevent more requests to go to
>>> the disk, then we'll probably have the same problem. Anyway, I'll try to
>>> run some tests with AIO to see if anything changes.
>>>
>>
>> I've run some simple tests with AIO enabled and results are not good. A
>> simple dd takes >25% more time. Multiple parallel dd take 35% more time to
>> complete.
>>
>
>
> Thank you. That is strange! Had few questions, what tests are you running
> for measuring the io-threads performance(not particularly aoi)? is it dd
> from multiple clients?
>

Yes, it's a bit strange. What I see is that many threads from the thread
pool are active but using very little CPU. I also see an AIO thread for
each brick, but its CPU usage is not big either. Wait time is always 0 (I
think this is a side effect of AIO activity). However system load grows
very high. I've seen around 50, while on the normal test without AIO it's
stays around 20-25.

Right now I'm running the tests on a single machine (no real network
communication) using an NVMe disk as storage. I use a single mount point.
The tests I'm running are these:

   - Single dd, 128 GiB, blocks of 1MiB
   - 16 parallel dd, 8 GiB per dd, blocks of 1MiB
   - fio in sequential write mode, direct I/O, blocks of 128k, 16 threads,
   8GiB per file
   - fio in sequential read mode, direct I/O, blocks of 128k, 16 threads,
   8GiB per file
   - fio in random write mode, direct I/O, blocks of 128k, 16 threads, 8GiB
   per file
   - fio in random read mode, direct I/O, blocks of 128k, 16 threads, 8GiB
   per file
   - smallfile create, 16 threads, 256 files per thread, 32 MiB per file
   (with one brick down, for the following test)
   - self-heal of an entire brick (from the previous smallfile test)
   - pgbench init phase with scale 100

I run all these tests for a replica 3 volume and a disperse 4+2 volume.

Xavi


> Regards,
> Poornima
>
>
>> Xavi
>>
>>
>>> All this is based on the assumption that large number of parallel read
 writes make the disk perf bad but not the large number of dentry and
 metadata ops. Is that true?

>>>
>>> It depends. If metadata is not cached, it's as bad as a read or write
>>> since it requires a disk access (a clear example of this is the bad
>>> performance of 'ls' in cold cache, which is basically metadata reads). In
>>> fact, cached data reads are also very fast, and data writes could go to the
>>> cache and be updated later in background, so I think the important point is
>>> if things are cached or not, instead of if they are data or metadata. Since
>>> we don't have this information from the user side, it's hard to tell what's
>>> better. My opinion is that we shouldn't differentiate requests of
>>> data/metadata. If metadata requests happen to be faster, then that thread
>>> will be able to handle other requests immediately, which seems good enough.
>>>
>>> However there's one thing that I would do. I would differentiate reads
>>> (data or metadata) from writes. Normally writes come from cached
>>> information that is flushed to disk at some point, so this normally happens
>>> in the background. But reads tend to be in foreground, meaning that someone
>>> (user or application) is waiting for it. So I would give preference to
>>> reads over writes. To do so effectively, we need to not saturate the
>>> backend, otherwise when we need to send a read, it will still need to wait
>>> for all pending requests to complete. If disks are not saturated, we can
>>> have the answer to the read quite fast, and then continue processing the
>>> remaining writes.
>>>
>>> Anyway, I may be wrong, since all these things depend on too many
>>> factors. I haven't done any specific tests about this. It's more like a
>>> brainstorming. As soon as I can I would like to experiment with this and
>>> get some empirical data.
>>>
>>> Xavi
>>>
>>>
 Thanks,
 Poornima


 On Fri, Feb 1, 2019, 5:34 PM Emmanuel Dreyfus >>>
> On Thu, Jan 31, 2019 at 10:53:48PM -0800, Vijay Bellur wrote:
> > Perhaps we could throttle both 

Re: [Gluster-devel] I/O performance

2019-02-05 Thread Poornima Gurusiddaiah
On Tue, Feb 5, 2019, 10:53 PM Xavi Hernandez  On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez 
> wrote:
>
>> On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah 
>> wrote:
>>
>>> Can the threads be categorised to do certain kinds of fops?
>>>
>>
>> Could be, but creating multiple thread groups for different tasks is
>> generally bad because many times you end up with lots of idle threads which
>> waste resources and could increase contention. I think we should only
>> differentiate threads if it's absolutely necessary.
>>
>>
>>> Read/write affinitise to certain set of threads, the other metadata fops
>>> to other set of threads. So we limit the read/write threads and not the
>>> metadata threads? Also if aio is enabled in the backend the threads will
>>> not be blocked on disk IO right?
>>>
>>
>> If we don't block the thread but we don't prevent more requests to go to
>> the disk, then we'll probably have the same problem. Anyway, I'll try to
>> run some tests with AIO to see if anything changes.
>>
>
> I've run some simple tests with AIO enabled and results are not good. A
> simple dd takes >25% more time. Multiple parallel dd take 35% more time to
> complete.
>


Thank you. That is strange! Had few questions, what tests are you running
for measuring the io-threads performance(not particularly aoi)? is it dd
from multiple clients?

Regards,
Poornima


> Xavi
>
>
>> All this is based on the assumption that large number of parallel read
>>> writes make the disk perf bad but not the large number of dentry and
>>> metadata ops. Is that true?
>>>
>>
>> It depends. If metadata is not cached, it's as bad as a read or write
>> since it requires a disk access (a clear example of this is the bad
>> performance of 'ls' in cold cache, which is basically metadata reads). In
>> fact, cached data reads are also very fast, and data writes could go to the
>> cache and be updated later in background, so I think the important point is
>> if things are cached or not, instead of if they are data or metadata. Since
>> we don't have this information from the user side, it's hard to tell what's
>> better. My opinion is that we shouldn't differentiate requests of
>> data/metadata. If metadata requests happen to be faster, then that thread
>> will be able to handle other requests immediately, which seems good enough.
>>
>> However there's one thing that I would do. I would differentiate reads
>> (data or metadata) from writes. Normally writes come from cached
>> information that is flushed to disk at some point, so this normally happens
>> in the background. But reads tend to be in foreground, meaning that someone
>> (user or application) is waiting for it. So I would give preference to
>> reads over writes. To do so effectively, we need to not saturate the
>> backend, otherwise when we need to send a read, it will still need to wait
>> for all pending requests to complete. If disks are not saturated, we can
>> have the answer to the read quite fast, and then continue processing the
>> remaining writes.
>>
>> Anyway, I may be wrong, since all these things depend on too many
>> factors. I haven't done any specific tests about this. It's more like a
>> brainstorming. As soon as I can I would like to experiment with this and
>> get some empirical data.
>>
>> Xavi
>>
>>
>>> Thanks,
>>> Poornima
>>>
>>>
>>> On Fri, Feb 1, 2019, 5:34 PM Emmanuel Dreyfus >>
 On Thu, Jan 31, 2019 at 10:53:48PM -0800, Vijay Bellur wrote:
 > Perhaps we could throttle both aspects - number of I/O requests per
 disk

 While there it would be nice to detect and report  a disk with lower
 than
 peer performance: that happen sometimes when a disk is dying, and last
 time I was hit by that performance problem, I had a hard time finding
 the culprit.

 --
 Emmanuel Dreyfus
 m...@netbsd.org
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 https://lists.gluster.org/mailman/listinfo/gluster-devel

>>>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-02-05 Thread Xavi Hernandez
On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez  wrote:

> On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah 
> wrote:
>
>> Can the threads be categorised to do certain kinds of fops?
>>
>
> Could be, but creating multiple thread groups for different tasks is
> generally bad because many times you end up with lots of idle threads which
> waste resources and could increase contention. I think we should only
> differentiate threads if it's absolutely necessary.
>
>
>> Read/write affinitise to certain set of threads, the other metadata fops
>> to other set of threads. So we limit the read/write threads and not the
>> metadata threads? Also if aio is enabled in the backend the threads will
>> not be blocked on disk IO right?
>>
>
> If we don't block the thread but we don't prevent more requests to go to
> the disk, then we'll probably have the same problem. Anyway, I'll try to
> run some tests with AIO to see if anything changes.
>

I've run some simple tests with AIO enabled and results are not good. A
simple dd takes >25% more time. Multiple parallel dd take 35% more time to
complete.

Xavi


> All this is based on the assumption that large number of parallel read
>> writes make the disk perf bad but not the large number of dentry and
>> metadata ops. Is that true?
>>
>
> It depends. If metadata is not cached, it's as bad as a read or write
> since it requires a disk access (a clear example of this is the bad
> performance of 'ls' in cold cache, which is basically metadata reads). In
> fact, cached data reads are also very fast, and data writes could go to the
> cache and be updated later in background, so I think the important point is
> if things are cached or not, instead of if they are data or metadata. Since
> we don't have this information from the user side, it's hard to tell what's
> better. My opinion is that we shouldn't differentiate requests of
> data/metadata. If metadata requests happen to be faster, then that thread
> will be able to handle other requests immediately, which seems good enough.
>
> However there's one thing that I would do. I would differentiate reads
> (data or metadata) from writes. Normally writes come from cached
> information that is flushed to disk at some point, so this normally happens
> in the background. But reads tend to be in foreground, meaning that someone
> (user or application) is waiting for it. So I would give preference to
> reads over writes. To do so effectively, we need to not saturate the
> backend, otherwise when we need to send a read, it will still need to wait
> for all pending requests to complete. If disks are not saturated, we can
> have the answer to the read quite fast, and then continue processing the
> remaining writes.
>
> Anyway, I may be wrong, since all these things depend on too many factors.
> I haven't done any specific tests about this. It's more like a
> brainstorming. As soon as I can I would like to experiment with this and
> get some empirical data.
>
> Xavi
>
>
>> Thanks,
>> Poornima
>>
>>
>> On Fri, Feb 1, 2019, 5:34 PM Emmanuel Dreyfus >
>>> On Thu, Jan 31, 2019 at 10:53:48PM -0800, Vijay Bellur wrote:
>>> > Perhaps we could throttle both aspects - number of I/O requests per
>>> disk
>>>
>>> While there it would be nice to detect and report  a disk with lower than
>>> peer performance: that happen sometimes when a disk is dying, and last
>>> time I was hit by that performance problem, I had a hard time finding
>>> the culprit.
>>>
>>> --
>>> Emmanuel Dreyfus
>>> m...@netbsd.org
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-02-01 Thread Xavi Hernandez
On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah 
wrote:

> Can the threads be categorised to do certain kinds of fops?
>

Could be, but creating multiple thread groups for different tasks is
generally bad because many times you end up with lots of idle threads which
waste resources and could increase contention. I think we should only
differentiate threads if it's absolutely necessary.


> Read/write affinitise to certain set of threads, the other metadata fops
> to other set of threads. So we limit the read/write threads and not the
> metadata threads? Also if aio is enabled in the backend the threads will
> not be blocked on disk IO right?
>

If we don't block the thread but we don't prevent more requests to go to
the disk, then we'll probably have the same problem. Anyway, I'll try to
run some tests with AIO to see if anything changes.

All this is based on the assumption that large number of parallel read
> writes make the disk perf bad but not the large number of dentry and
> metadata ops. Is that true?
>

It depends. If metadata is not cached, it's as bad as a read or write since
it requires a disk access (a clear example of this is the bad performance
of 'ls' in cold cache, which is basically metadata reads). In fact, cached
data reads are also very fast, and data writes could go to the cache and be
updated later in background, so I think the important point is if things
are cached or not, instead of if they are data or metadata. Since we don't
have this information from the user side, it's hard to tell what's better.
My opinion is that we shouldn't differentiate requests of data/metadata. If
metadata requests happen to be faster, then that thread will be able to
handle other requests immediately, which seems good enough.

However there's one thing that I would do. I would differentiate reads
(data or metadata) from writes. Normally writes come from cached
information that is flushed to disk at some point, so this normally happens
in the background. But reads tend to be in foreground, meaning that someone
(user or application) is waiting for it. So I would give preference to
reads over writes. To do so effectively, we need to not saturate the
backend, otherwise when we need to send a read, it will still need to wait
for all pending requests to complete. If disks are not saturated, we can
have the answer to the read quite fast, and then continue processing the
remaining writes.

Anyway, I may be wrong, since all these things depend on too many factors.
I haven't done any specific tests about this. It's more like a
brainstorming. As soon as I can I would like to experiment with this and
get some empirical data.

Xavi


> Thanks,
> Poornima
>
>
> On Fri, Feb 1, 2019, 5:34 PM Emmanuel Dreyfus 
>> On Thu, Jan 31, 2019 at 10:53:48PM -0800, Vijay Bellur wrote:
>> > Perhaps we could throttle both aspects - number of I/O requests per disk
>>
>> While there it would be nice to detect and report  a disk with lower than
>> peer performance: that happen sometimes when a disk is dying, and last
>> time I was hit by that performance problem, I had a hard time finding
>> the culprit.
>>
>> --
>> Emmanuel Dreyfus
>> m...@netbsd.org
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-02-01 Thread Poornima Gurusiddaiah
Can the threads be categorised to do certain kinds of fops? Read/write
affinitise to certain set of threads, the other metadata fops to other set
of threads. So we limit the read/write threads and not the metadata
threads? Also if aio is enabled in the backend the threads will not be
blocked on disk IO right?
All this is based on the assumption that large number of parallel read
writes make the disk perf bad but not the large number of dentry and
metadata ops. Is that true?

Thanks,
Poornima


On Fri, Feb 1, 2019, 5:34 PM Emmanuel Dreyfus  On Thu, Jan 31, 2019 at 10:53:48PM -0800, Vijay Bellur wrote:
> > Perhaps we could throttle both aspects - number of I/O requests per disk
>
> While there it would be nice to detect and report  a disk with lower than
> peer performance: that happen sometimes when a disk is dying, and last
> time I was hit by that performance problem, I had a hard time finding
> the culprit.
>
> --
> Emmanuel Dreyfus
> m...@netbsd.org
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-02-01 Thread Emmanuel Dreyfus
On Thu, Jan 31, 2019 at 10:53:48PM -0800, Vijay Bellur wrote:
> Perhaps we could throttle both aspects - number of I/O requests per disk

While there it would be nice to detect and report  a disk with lower than
peer performance: that happen sometimes when a disk is dying, and last
time I was hit by that performance problem, I had a hard time finding
the culprit.

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] I/O performance

2019-01-31 Thread Vijay Bellur
On Thu, Jan 31, 2019 at 11:12 PM Xavi Hernandez 
wrote:

> On Fri, Feb 1, 2019 at 7:54 AM Vijay Bellur  wrote:
>
>>
>>
>> On Thu, Jan 31, 2019 at 10:01 AM Xavi Hernandez 
>> wrote:
>>
>>> Hi,
>>>
>>> I've been doing some tests with the global thread pool [1], and I've
>>> observed one important thing:
>>>
>>> Since this new thread pool has very low contention (apparently), it
>>> exposes other problems when the number of threads grows. What I've seen is
>>> that some workloads use all available threads on bricks to do I/O, causing
>>> avgload to grow rapidly and saturating the machine (or it seems so), which
>>> really makes everything slower. Reducing the maximum number of threads
>>> improves performance actually. Other workloads, though, do little I/O
>>> (probably most is locking or smallfile operations). In this case limiting
>>> the number of threads to a small value causes a performance reduction. To
>>> increase performance we need more threads.
>>>
>>> So this is making me thing that maybe we should implement some sort of
>>> I/O queue with a maximum I/O depth for each brick (or disk if bricks share
>>> same disk). This way we can limit the amount of requests physically
>>> accessing the underlying FS concurrently, without actually limiting the
>>> number of threads that can be doing other things on each brick. I think
>>> this could improve performance.
>>>
>>
>> Perhaps we could throttle both aspects - number of I/O requests per disk
>> and the number of threads too?  That way we will have the ability to behave
>> well when there is bursty I/O to the same disk and when there are multiple
>> concurrent requests to different disks. Do you have a reason to not limit
>> the number of threads?
>>
>
> No, in fact the global thread pool does have a limit for the number of
> threads. I'm not saying to replace the thread limit for I/O depth control,
> I think we need both. I think we need to clearly identify which threads are
> doing I/O and limit them, even if there are more threads available. The
> reason is easy: suppose we have a fixed number of threads. If we have heavy
> load sent in parallel, it's quite possible that all threads get blocked
> doing some I/O. This has two consequences:
>
>1. There are no more threads to execute other things, like sending
>answers to the client, or start processing new incoming requests. So CPU is
>underutilized.
>2. Massive parallel access to a FS actually decreases performance
>
> This means that we can do less work and this work takes more time, which
> is bad.
>
> If we limit the number of threads that can actually be doing FS I/O, it's
> easy to keep FS responsive and we'll still have more threads to do other
> work.
>


Got it, thx.


>
>
>>
>>> Maybe this approach could also be useful in client side, but I think
>>> it's not so critical there.
>>>
>>
>> Agree, rate limiting on the server side would be more appropriate.
>>
>
> Only thing to consider here is that if we limit rate on servers but
> clients can generate more requests without limit, we may require lots of
> memory to track all ongoing requests. Anyway, I think this is not the most
> important thing now, so if we solve the server-side problem, then we can
> check if this is really needed or not (it could happen that client
> applications limit themselves automatically because they will be waiting
> for answers from server before sending more requests, unless the number of
> application running concurrently is really huge).
>

We could enable throttling in the rpc layer to handle a client performing
aggressive I/O.  RPC throttling should be able to handle the scenario
described above.

-Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-01-31 Thread Xavi Hernandez
On Fri, Feb 1, 2019 at 7:54 AM Vijay Bellur  wrote:

>
>
> On Thu, Jan 31, 2019 at 10:01 AM Xavi Hernandez 
> wrote:
>
>> Hi,
>>
>> I've been doing some tests with the global thread pool [1], and I've
>> observed one important thing:
>>
>> Since this new thread pool has very low contention (apparently), it
>> exposes other problems when the number of threads grows. What I've seen is
>> that some workloads use all available threads on bricks to do I/O, causing
>> avgload to grow rapidly and saturating the machine (or it seems so), which
>> really makes everything slower. Reducing the maximum number of threads
>> improves performance actually. Other workloads, though, do little I/O
>> (probably most is locking or smallfile operations). In this case limiting
>> the number of threads to a small value causes a performance reduction. To
>> increase performance we need more threads.
>>
>> So this is making me thing that maybe we should implement some sort of
>> I/O queue with a maximum I/O depth for each brick (or disk if bricks share
>> same disk). This way we can limit the amount of requests physically
>> accessing the underlying FS concurrently, without actually limiting the
>> number of threads that can be doing other things on each brick. I think
>> this could improve performance.
>>
>
> Perhaps we could throttle both aspects - number of I/O requests per disk
> and the number of threads too?  That way we will have the ability to behave
> well when there is bursty I/O to the same disk and when there are multiple
> concurrent requests to different disks. Do you have a reason to not limit
> the number of threads?
>

No, in fact the global thread pool does have a limit for the number of
threads. I'm not saying to replace the thread limit for I/O depth control,
I think we need both. I think we need to clearly identify which threads are
doing I/O and limit them, even if there are more threads available. The
reason is easy: suppose we have a fixed number of threads. If we have heavy
load sent in parallel, it's quite possible that all threads get blocked
doing some I/O. This has two consequences:

   1. There are no more threads to execute other things, like sending
   answers to the client, or start processing new incoming requests. So CPU is
   underutilized.
   2. Massive parallel access to a FS actually decreases performance

This means that we can do less work and this work takes more time, which is
bad.

If we limit the number of threads that can actually be doing FS I/O, it's
easy to keep FS responsive and we'll still have more threads to do other
work.


>
>> Maybe this approach could also be useful in client side, but I think it's
>> not so critical there.
>>
>
> Agree, rate limiting on the server side would be more appropriate.
>

Only thing to consider here is that if we limit rate on servers but clients
can generate more requests without limit, we may require lots of memory to
track all ongoing requests. Anyway, I think this is not the most important
thing now, so if we solve the server-side problem, then we can check if
this is really needed or not (it could happen that client applications
limit themselves automatically because they will be waiting for answers
from server before sending more requests, unless the number of application
running concurrently is really huge).

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-01-31 Thread Vijay Bellur
On Thu, Jan 31, 2019 at 10:01 AM Xavi Hernandez 
wrote:

> Hi,
>
> I've been doing some tests with the global thread pool [1], and I've
> observed one important thing:
>
> Since this new thread pool has very low contention (apparently), it
> exposes other problems when the number of threads grows. What I've seen is
> that some workloads use all available threads on bricks to do I/O, causing
> avgload to grow rapidly and saturating the machine (or it seems so), which
> really makes everything slower. Reducing the maximum number of threads
> improves performance actually. Other workloads, though, do little I/O
> (probably most is locking or smallfile operations). In this case limiting
> the number of threads to a small value causes a performance reduction. To
> increase performance we need more threads.
>
> So this is making me thing that maybe we should implement some sort of I/O
> queue with a maximum I/O depth for each brick (or disk if bricks share same
> disk). This way we can limit the amount of requests physically accessing
> the underlying FS concurrently, without actually limiting the number of
> threads that can be doing other things on each brick. I think this could
> improve performance.
>

Perhaps we could throttle both aspects - number of I/O requests per disk
and the number of threads too?  That way we will have the ability to behave
well when there is bursty I/O to the same disk and when there are multiple
concurrent requests to different disks. Do you have a reason to not limit
the number of threads?


> Maybe this approach could also be useful in client side, but I think it's
> not so critical there.
>

Agree, rate limiting on the server side would be more appropriate.


Thanks,
Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] I/O performance

2019-01-31 Thread Xavi Hernandez
Hi,

I've been doing some tests with the global thread pool [1], and I've
observed one important thing:

Since this new thread pool has very low contention (apparently), it exposes
other problems when the number of threads grows. What I've seen is that
some workloads use all available threads on bricks to do I/O, causing
avgload to grow rapidly and saturating the machine (or it seems so), which
really makes everything slower. Reducing the maximum number of threads
improves performance actually. Other workloads, though, do little I/O
(probably most is locking or smallfile operations). In this case limiting
the number of threads to a small value causes a performance reduction. To
increase performance we need more threads.

So this is making me thing that maybe we should implement some sort of I/O
queue with a maximum I/O depth for each brick (or disk if bricks share same
disk). This way we can limit the amount of requests physically accessing
the underlying FS concurrently, without actually limiting the number of
threads that can be doing other things on each brick. I think this could
improve performance.

Maybe this approach could also be useful in client side, but I think it's
not so critical there.

What do you think ?

Xavi

[1] https://review.gluster.org/c/glusterfs/+/20636
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel