Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2019-01-25 Thread Raghavendra Gowdappa
On Sat, Jan 26, 2019 at 8:03 AM Raghavendra Gowdappa 
wrote:

>
>
> On Fri, Jan 11, 2019 at 8:09 PM Raghavendra Gowdappa 
> wrote:
>
>> Here is the update of the progress till now:
>> * The client profile attached till now shows the tuple creation is
>> dominated by writes and fstats. Note that fstats are side-effects of writes
>> as writes invalidate attributes of the file from kernel attribute cache.
>> * The rest of the init phase (which is marked by msgs "setting primary
>> key" and "vaccuum") is dominated by reads. Next bigger set of operations
>> are writes followed by fstats.
>>
>> So, only writes, reads and fstats are the operations we need to optimize
>> to reduce the init time latency. As mentioned in my previous mail, I did
>> following tunings:
>> * Enabled only write-behind, md-cache and open-behind.
>> - write-behind was configured with a cache-size/window-size of 20MB
>> - open-behind was configured with read-after-open yes
>> - md-cache was loaded as a child of write-behind in xlator graph. As
>> a parent of write-behind, writes responses of writes cached in write-behind
>> would invalidate stats. But when loaded as a child of write-behind this
>> problem won't be there. Note that in both cases fstat would pass through
>> write-behind (In the former case due to no stats in md-cache). However in
>> the latter case fstats can be served by md-cache.
>> - md-cache used to aggressively invalidate inodes. For the purpose of
>> this test, I just commented out inode-invalidate code in md-cache. We need
>> to fine tune the invalidation invocation logic.
>> - set group-metadata-cache to on. But turned off upcall
>> notifications. Note that since this workload basically accesses all its
>> data through single mount point. So, there is no shared files across mounts
>> and hence its safe to turn off invalidations.
>> * Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>>
>> With the above set of tunings I could reduce the init time of scale 8000
>> from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30%
>>
>> Since the workload is dominated by reads, we think a good read-cache
>> where reads to regions just written are served from cache would greatly
>> improve the performance. Since kernel page-cache already provides that
>> functionality along with read-ahead (which is more intelligent and serves
>> more read patterns than supported by Glusterfs read-ahead), we wanted to
>> try that. But, Manoj found a bug where reads followed by writes are not
>> served from page cache [5]. I am currently waiting for the resolution of
>> this bug. As an alternative, I can modify io-cache to serve reads from the
>> data just written. But, the change involves its challenges and hence would
>> like to get a resolution on [5] (either positive or negative) before
>> proceeding with modifications to io-cache.
>>
>> As to the rpc latency, Krutika had long back identified that reading a
>> single rpc message involves atleast 4 reads to socket. These many number of
>> reads were done to identify the structure of the message on the go. The
>> reason we wanted to discover the rpc message was to identify the part of
>> the rpc message containing read or write payload and make sure that payload
>> is directly read into a buffer different than the one containing rest of
>> the rpc message. This strategy will make sure payloads are not copied again
>> when buffers are moved across caches (read-ahead, io-cache etc) and also
>> the rest of the rpc message can be freed even though the payload outlives
>> the rpc message (when payloads are cached). However, we can experiment an
>> approach where we can either do away with zero-copy requirement or let the
>> entire buffer containing rpc message and payload to live in the cache.
>>
>> From my observations and discussions with Manoj and Xavi, this workload
>> is very sensitive to latency (than to concurrency). So, I am hopeful the
>> above approaches will give positive results.
>>
>
> Me, Manoj and Csaba figured out that invalidations by md-cache and Fuse
> auto-invalidations  were dropping the kernel page-cache (more details on
> [5]). Changes to stats by writes from same client (local writes) were
> triggering both these codepaths dropping the cache. Since all the I/O done
> by this workload goes through the caches of single client, the
> invalidations are not necessary and I made code changes to fuse-bridge to
> disable auto-invalidations completely and commented out inode-invalidations
> in md-cache. Note that this doesn't regress the consistency/coherency of
> data seen in the caches as its a single client use-case. With these two
> changes coupled with earlier optimizations (client-io-threads=on,
> server/client-event-threads=4, md-cache as a child of write-behind in
> xlator graph, performance.md-cache-timeout=600), pgbench init of scale 8000
> on a volume with NVMe backend completed in 54m25s. This is a whopping 94%
> improvement to 

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2019-01-25 Thread Raghavendra Gowdappa
On Sat, Jan 26, 2019 at 8:03 AM Raghavendra Gowdappa 
wrote:

>
>
> On Fri, Jan 11, 2019 at 8:09 PM Raghavendra Gowdappa 
> wrote:
>
>> Here is the update of the progress till now:
>> * The client profile attached till now shows the tuple creation is
>> dominated by writes and fstats. Note that fstats are side-effects of writes
>> as writes invalidate attributes of the file from kernel attribute cache.
>> * The rest of the init phase (which is marked by msgs "setting primary
>> key" and "vaccuum") is dominated by reads. Next bigger set of operations
>> are writes followed by fstats.
>>
>> So, only writes, reads and fstats are the operations we need to optimize
>> to reduce the init time latency. As mentioned in my previous mail, I did
>> following tunings:
>> * Enabled only write-behind, md-cache and open-behind.
>> - write-behind was configured with a cache-size/window-size of 20MB
>> - open-behind was configured with read-after-open yes
>> - md-cache was loaded as a child of write-behind in xlator graph. As
>> a parent of write-behind, writes responses of writes cached in write-behind
>> would invalidate stats. But when loaded as a child of write-behind this
>> problem won't be there. Note that in both cases fstat would pass through
>> write-behind (In the former case due to no stats in md-cache). However in
>> the latter case fstats can be served by md-cache.
>> - md-cache used to aggressively invalidate inodes. For the purpose of
>> this test, I just commented out inode-invalidate code in md-cache. We need
>> to fine tune the invalidation invocation logic.
>> - set group-metadata-cache to on. But turned off upcall
>> notifications. Note that since this workload basically accesses all its
>> data through single mount point. So, there is no shared files across mounts
>> and hence its safe to turn off invalidations.
>> * Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>>
>> With the above set of tunings I could reduce the init time of scale 8000
>> from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30%
>>
>> Since the workload is dominated by reads, we think a good read-cache
>> where reads to regions just written are served from cache would greatly
>> improve the performance. Since kernel page-cache already provides that
>> functionality along with read-ahead (which is more intelligent and serves
>> more read patterns than supported by Glusterfs read-ahead), we wanted to
>> try that. But, Manoj found a bug where reads followed by writes are not
>> served from page cache [5]. I am currently waiting for the resolution of
>> this bug. As an alternative, I can modify io-cache to serve reads from the
>> data just written. But, the change involves its challenges and hence would
>> like to get a resolution on [5] (either positive or negative) before
>> proceeding with modifications to io-cache.
>>
>> As to the rpc latency, Krutika had long back identified that reading a
>> single rpc message involves atleast 4 reads to socket. These many number of
>> reads were done to identify the structure of the message on the go. The
>> reason we wanted to discover the rpc message was to identify the part of
>> the rpc message containing read or write payload and make sure that payload
>> is directly read into a buffer different than the one containing rest of
>> the rpc message. This strategy will make sure payloads are not copied again
>> when buffers are moved across caches (read-ahead, io-cache etc) and also
>> the rest of the rpc message can be freed even though the payload outlives
>> the rpc message (when payloads are cached). However, we can experiment an
>> approach where we can either do away with zero-copy requirement or let the
>> entire buffer containing rpc message and payload to live in the cache.
>>
>> From my observations and discussions with Manoj and Xavi, this workload
>> is very sensitive to latency (than to concurrency). So, I am hopeful the
>> above approaches will give positive results.
>>
>
> Me, Manoj and Csaba figured out that invalidations by md-cache and Fuse
> auto-invalidations  were dropping the kernel page-cache (more details on
> [5]).
>

Thanks to Miklos for the pointer on auto-invalidations.


> Changes to stats by writes from same client (local writes) were triggering
> both these codepaths dropping the cache. Since all the I/O done by this
> workload goes through the caches of single client, the invalidations are
> not necessary and I made code changes to fuse-bridge to disable
> auto-invalidations completely and commented out inode-invalidations in
> md-cache. Note that this doesn't regress the consistency/coherency of data
> seen in the caches as its a single client use-case. With these two changes
> coupled with earlier optimizations (client-io-threads=on,
> server/client-event-threads=4, md-cache as a child of write-behind in
> xlator graph, performance.md-cache-timeout=600), pgbench init of scale 8000
> on a volume with NVMe 

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2019-01-25 Thread Raghavendra Gowdappa
On Fri, Jan 11, 2019 at 8:09 PM Raghavendra Gowdappa 
wrote:

> Here is the update of the progress till now:
> * The client profile attached till now shows the tuple creation is
> dominated by writes and fstats. Note that fstats are side-effects of writes
> as writes invalidate attributes of the file from kernel attribute cache.
> * The rest of the init phase (which is marked by msgs "setting primary
> key" and "vaccuum") is dominated by reads. Next bigger set of operations
> are writes followed by fstats.
>
> So, only writes, reads and fstats are the operations we need to optimize
> to reduce the init time latency. As mentioned in my previous mail, I did
> following tunings:
> * Enabled only write-behind, md-cache and open-behind.
> - write-behind was configured with a cache-size/window-size of 20MB
> - open-behind was configured with read-after-open yes
> - md-cache was loaded as a child of write-behind in xlator graph. As a
> parent of write-behind, writes responses of writes cached in write-behind
> would invalidate stats. But when loaded as a child of write-behind this
> problem won't be there. Note that in both cases fstat would pass through
> write-behind (In the former case due to no stats in md-cache). However in
> the latter case fstats can be served by md-cache.
> - md-cache used to aggressively invalidate inodes. For the purpose of
> this test, I just commented out inode-invalidate code in md-cache. We need
> to fine tune the invalidation invocation logic.
> - set group-metadata-cache to on. But turned off upcall notifications.
> Note that since this workload basically accesses all its data through
> single mount point. So, there is no shared files across mounts and hence
> its safe to turn off invalidations.
> * Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>
> With the above set of tunings I could reduce the init time of scale 8000
> from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30%
>
> Since the workload is dominated by reads, we think a good read-cache where
> reads to regions just written are served from cache would greatly improve
> the performance. Since kernel page-cache already provides that
> functionality along with read-ahead (which is more intelligent and serves
> more read patterns than supported by Glusterfs read-ahead), we wanted to
> try that. But, Manoj found a bug where reads followed by writes are not
> served from page cache [5]. I am currently waiting for the resolution of
> this bug. As an alternative, I can modify io-cache to serve reads from the
> data just written. But, the change involves its challenges and hence would
> like to get a resolution on [5] (either positive or negative) before
> proceeding with modifications to io-cache.
>
> As to the rpc latency, Krutika had long back identified that reading a
> single rpc message involves atleast 4 reads to socket. These many number of
> reads were done to identify the structure of the message on the go. The
> reason we wanted to discover the rpc message was to identify the part of
> the rpc message containing read or write payload and make sure that payload
> is directly read into a buffer different than the one containing rest of
> the rpc message. This strategy will make sure payloads are not copied again
> when buffers are moved across caches (read-ahead, io-cache etc) and also
> the rest of the rpc message can be freed even though the payload outlives
> the rpc message (when payloads are cached). However, we can experiment an
> approach where we can either do away with zero-copy requirement or let the
> entire buffer containing rpc message and payload to live in the cache.
>
> From my observations and discussions with Manoj and Xavi, this workload is
> very sensitive to latency (than to concurrency). So, I am hopeful the above
> approaches will give positive results.
>

Me, Manoj and Csaba figured out that invalidations by md-cache and Fuse
auto-invalidations  were dropping the kernel page-cache (more details on
[5]). Changes to stats by writes from same client (local writes) were
triggering both these codepaths dropping the cache. Since all the I/O done
by this workload goes through the caches of single client, the
invalidations are not necessary and I made code changes to fuse-bridge to
disable auto-invalidations completely and commented out inode-invalidations
in md-cache. Note that this doesn't regress the consistency/coherency of
data seen in the caches as its a single client use-case. With these two
changes coupled with earlier optimizations (client-io-threads=on,
server/client-event-threads=4, md-cache as a child of write-behind in
xlator graph, performance.md-cache-timeout=600), pgbench init of scale 8000
on a volume with NVMe backend completed in 54m25s. This is a whopping 94%
improvement to the time we started out with (59280s vs 3360s).

[root@shakthi4 ~]# gluster volume info

Volume Name: nvme-r3
Type: Replicate
Volume ID: 

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2019-01-11 Thread Raghavendra Gowdappa
Here is the update of the progress till now:
* The client profile attached till now shows the tuple creation is
dominated by writes and fstats. Note that fstats are side-effects of writes
as writes invalidate attributes of the file from kernel attribute cache.
* The rest of the init phase (which is marked by msgs "setting primary key"
and "vaccuum") is dominated by reads. Next bigger set of operations are
writes followed by fstats.

So, only writes, reads and fstats are the operations we need to optimize to
reduce the init time latency. As mentioned in my previous mail, I did
following tunings:
* Enabled only write-behind, md-cache and open-behind.
- write-behind was configured with a cache-size/window-size of 20MB
- open-behind was configured with read-after-open yes
- md-cache was loaded as a child of write-behind in xlator graph. As a
parent of write-behind, writes responses of writes cached in write-behind
would invalidate stats. But when loaded as a child of write-behind this
problem won't be there. Note that in both cases fstat would pass through
write-behind (In the former case due to no stats in md-cache). However in
the latter case fstats can be served by md-cache.
- md-cache used to aggressively invalidate inodes. For the purpose of
this test, I just commented out inode-invalidate code in md-cache. We need
to fine tune the invalidation invocation logic.
- set group-metadata-cache to on. But turned off upcall notifications.
Note that since this workload basically accesses all its data through
single mount point. So, there is no shared files across mounts and hence
its safe to turn off invalidations.
* Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781

With the above set of tunings I could reduce the init time of scale 8000
from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30%

Since the workload is dominated by reads, we think a good read-cache where
reads to regions just written are served from cache would greatly improve
the performance. Since kernel page-cache already provides that
functionality along with read-ahead (which is more intelligent and serves
more read patterns than supported by Glusterfs read-ahead), we wanted to
try that. But, Manoj found a bug where reads followed by writes are not
served from page cache [5]. I am currently waiting for the resolution of
this bug. As an alternative, I can modify io-cache to serve reads from the
data just written. But, the change involves its challenges and hence would
like to get a resolution on [5] (either positive or negative) before
proceeding with modifications to io-cache.

As to the rpc latency, Krutika had long back identified that reading a
single rpc message involves atleast 4 reads to socket. These many number of
reads were done to identify the structure of the message on the go. The
reason we wanted to discover the rpc message was to identify the part of
the rpc message containing read or write payload and make sure that payload
is directly read into a buffer different than the one containing rest of
the rpc message. This strategy will make sure payloads are not copied again
when buffers are moved across caches (read-ahead, io-cache etc) and also
the rest of the rpc message can be freed even though the payload outlives
the rpc message (when payloads are cached). However, we can experiment an
approach where we can either do away with zero-copy requirement or let the
entire buffer containing rpc message and payload to live in the cache.

>From my observations and discussions with Manoj and Xavi, this workload is
very sensitive to latency (than to concurrency). So, I am hopeful the above
approaches will give positive results.

[5] https://bugzilla.redhat.com/show_bug.cgi?id=1664934

regards,
Raghavendra

On Fri, Dec 28, 2018 at 12:44 PM Raghavendra Gowdappa 
wrote:

>
>
> On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa 
> wrote:
>
>>
>>
>> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay <
>> sankarshan.mukhopadh...@gmail.com> wrote:
>>
>>> [pulling the conclusions up to enable better in-line]
>>>
>>> > Conclusions:
>>> >
>>> > We should never have a volume with caching-related xlators disabled.
>>> The price we pay for it is too high. We need to make them work consistently
>>> and aggressively to avoid as many requests as we can.
>>>
>>> Are there current issues in terms of behavior which are known/observed
>>> when these are enabled?
>>>
>>
>> We did have issues with pgbench in past. But they've have been fixed.
>> Please refer to bz [1] for details. On 5.1, it runs successfully with all
>> caching related xlators enabled. Having said that the only performance
>> xlators which gave improved performance were open-behind and write-behind
>> [2] (write-behind had some issues, which will be fixed by [3] and we'll
>> have to measure performance again with fix to [3]).
>>
>
> One quick update. Enabling write-behind and md-cache with fix for [3]
> reduced the total time taken for 

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2019-01-01 Thread Xavi Hernandez
On Mon, Dec 24, 2018 at 11:30 AM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> [pulling the conclusions up to enable better in-line]
>
> > Conclusions:
> >
> > We should never have a volume with caching-related xlators disabled. The
> price we pay for it is too high. We need to make them work consistently and
> aggressively to avoid as many requests as we can.
>
> Are there current issues in terms of behavior which are known/observed
> when these are enabled?
>
> > We need to analyze client/server xlators deeper to see if we can avoid
> some delays. However optimizing something that is already at the
> microsecond level can be very hard.
>
> That is true - are there any significant gains which can be accrued by
> putting efforts here or, should this be a lower priority?
>

I would say that for volumes based on spinning disks this is not a high
priority, but if we want to provide good performance for NVME storage, this
is something that needs to be done. On NVME, reads and writes can be served
in few tens of microseconds, so adding 100 us in the network layer could
easily mean a performance reduction of 70% or more.


> > We need to determine what causes the fluctuations in brick side and
> avoid them.
> > This scenario is very similar to a smallfile/metadata workload, so this
> is probably one important cause of its bad performance.
>
> What kind of instrumentation is required to enable the determination?
>
> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez 
> wrote:
> >
> > Hi,
> >
> > I've done some tracing of the latency that network layer introduces in
> gluster. I've made the analysis as part of the pgbench performance issue
> (in particulat the initialization and scaling phase), so I decided to look
> at READV for this particular workload, but I think the results can be
> extrapolated to other operations that also have small latency (cached data
> from FS for example).
> >
> > Note that measuring latencies introduces some latency. It consists in a
> call to clock_get_time() for each probe point, so the real latency will be
> a bit lower, but still proportional to these numbers.
> >
>
> [snip]
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-31 Thread Sankarshan Mukhopadhyay
On Fri 28 Dec, 2018, 12:44 Raghavendra Gowdappa 
>
> On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa 
> wrote:
>
>>
>>
>> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay <
>> sankarshan.mukhopadh...@gmail.com> wrote:
>>
>>> [pulling the conclusions up to enable better in-line]
>>>
>>> > Conclusions:
>>> >
>>> > We should never have a volume with caching-related xlators disabled.
>>> The price we pay for it is too high. We need to make them work consistently
>>> and aggressively to avoid as many requests as we can.
>>>
>>> Are there current issues in terms of behavior which are known/observed
>>> when these are enabled?
>>>
>>
>> We did have issues with pgbench in past. But they've have been fixed.
>> Please refer to bz [1] for details. On 5.1, it runs successfully with all
>> caching related xlators enabled. Having said that the only performance
>> xlators which gave improved performance were open-behind and write-behind
>> [2] (write-behind had some issues, which will be fixed by [3] and we'll
>> have to measure performance again with fix to [3]).
>>
>
> One quick update. Enabling write-behind and md-cache with fix for [3]
> reduced the total time taken for pgbench init phase roughly by 20%-25%
> (from 12.5 min to 9.75 min for a scale of 100). Though this is still a huge
> time (around 12hrs for a db of scale 8000). I'll follow up with a detailed
> report once my experiments are complete. Currently trying to optimize the
> read path.
>
>
>> For some reason, read-side caching didn't improve transactions per
>> second. I am working on this problem currently. Note that these bugs
>> measure transaction phase of pgbench, but what xavi measured in his mail is
>> init phase. Nevertheless, evaluation of read caching (metadata/data) will
>> still be relevant for init phase too.
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4
>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>>
>
I think that what I am looking forward to is a well defined set of next
steps and potential update to this list windows that eventually result in a
formal and recorded procedure to ensure that Gluster performs best for
these application workloads.


>>
>>
>>> > We need to analyze client/server xlators deeper to see if we can avoid
>>> some delays. However optimizing something that is already at the
>>> microsecond level can be very hard.
>>>
>>> That is true - are there any significant gains which can be accrued by
>>> putting efforts here or, should this be a lower priority?
>>>
>>
>> The problem identified by xavi is also the one we (Manoj, Krutika, me and
>> Milind) had encountered in the past [4]. The solution we used was to have
>> multiple rpc connections between single brick and client. The solution
>> indeed fixed the bottleneck. So, there is definitely work involved here -
>> either to fix the single connection model or go with multiple connection
>> model. Its preferred to improve single connection and resort to multiple
>> connections only if bottlenecks in single connection are not fixable.
>> Personally I think this is high priority along with having appropriate
>> client side caching.
>>
>> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52
>>
>>
>>> > We need to determine what causes the fluctuations in brick side and
>>> avoid them.
>>> > This scenario is very similar to a smallfile/metadata workload, so
>>> this is probably one important cause of its bad performance.
>>>
>>> What kind of instrumentation is required to enable the determination?
>>>
>>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez 
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I've done some tracing of the latency that network layer introduces in
>>> gluster. I've made the analysis as part of the pgbench performance issue
>>> (in particulat the initialization and scaling phase), so I decided to look
>>> at READV for this particular workload, but I think the results can be
>>> extrapolated to other operations that also have small latency (cached data
>>> from FS for example).
>>> >
>>> > Note that measuring latencies introduces some latency. It consists in
>>> a call to clock_get_time() for each probe point, so the real latency will
>>> be a bit lower, but still proportional to these numbers.
>>> >
>>>
>>> [snip]
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-27 Thread Raghavendra Gowdappa
On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa 
wrote:

>
>
> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay <
> sankarshan.mukhopadh...@gmail.com> wrote:
>
>> [pulling the conclusions up to enable better in-line]
>>
>> > Conclusions:
>> >
>> > We should never have a volume with caching-related xlators disabled.
>> The price we pay for it is too high. We need to make them work consistently
>> and aggressively to avoid as many requests as we can.
>>
>> Are there current issues in terms of behavior which are known/observed
>> when these are enabled?
>>
>
> We did have issues with pgbench in past. But they've have been fixed.
> Please refer to bz [1] for details. On 5.1, it runs successfully with all
> caching related xlators enabled. Having said that the only performance
> xlators which gave improved performance were open-behind and write-behind
> [2] (write-behind had some issues, which will be fixed by [3] and we'll
> have to measure performance again with fix to [3]).
>

One quick update. Enabling write-behind and md-cache with fix for [3]
reduced the total time taken for pgbench init phase roughly by 20%-25%
(from 12.5 min to 9.75 min for a scale of 100). Though this is still a huge
time (around 12hrs for a db of scale 8000). I'll follow up with a detailed
report once my experiments are complete. Currently trying to optimize the
read path.


> For some reason, read-side caching didn't improve transactions per second.
> I am working on this problem currently. Note that these bugs measure
> transaction phase of pgbench, but what xavi measured in his mail is init
> phase. Nevertheless, evaluation of read caching (metadata/data) will still
> be relevant for init phase too.
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4
> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>
>
>> > We need to analyze client/server xlators deeper to see if we can avoid
>> some delays. However optimizing something that is already at the
>> microsecond level can be very hard.
>>
>> That is true - are there any significant gains which can be accrued by
>> putting efforts here or, should this be a lower priority?
>>
>
> The problem identified by xavi is also the one we (Manoj, Krutika, me and
> Milind) had encountered in the past [4]. The solution we used was to have
> multiple rpc connections between single brick and client. The solution
> indeed fixed the bottleneck. So, there is definitely work involved here -
> either to fix the single connection model or go with multiple connection
> model. Its preferred to improve single connection and resort to multiple
> connections only if bottlenecks in single connection are not fixable.
> Personally I think this is high priority along with having appropriate
> client side caching.
>
> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52
>
>
>> > We need to determine what causes the fluctuations in brick side and
>> avoid them.
>> > This scenario is very similar to a smallfile/metadata workload, so this
>> is probably one important cause of its bad performance.
>>
>> What kind of instrumentation is required to enable the determination?
>>
>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez 
>> wrote:
>> >
>> > Hi,
>> >
>> > I've done some tracing of the latency that network layer introduces in
>> gluster. I've made the analysis as part of the pgbench performance issue
>> (in particulat the initialization and scaling phase), so I decided to look
>> at READV for this particular workload, but I think the results can be
>> extrapolated to other operations that also have small latency (cached data
>> from FS for example).
>> >
>> > Note that measuring latencies introduces some latency. It consists in a
>> call to clock_get_time() for each probe point, so the real latency will be
>> a bit lower, but still proportional to these numbers.
>> >
>>
>> [snip]
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-27 Thread Raghavendra Gowdappa
On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa 
wrote:

>
>
> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay <
> sankarshan.mukhopadh...@gmail.com> wrote:
>
>> [pulling the conclusions up to enable better in-line]
>>
>> > Conclusions:
>> >
>> > We should never have a volume with caching-related xlators disabled.
>> The price we pay for it is too high. We need to make them work consistently
>> and aggressively to avoid as many requests as we can.
>>
>> Are there current issues in terms of behavior which are known/observed
>> when these are enabled?
>>
>
> We did have issues with pgbench in past. But they've have been fixed.
> Please refer to bz [1] for details. On 5.1, it runs successfully with all
> caching related xlators enabled. Having said that the only performance
> xlators which gave improved performance were open-behind and write-behind
> [2] (write-behind had some issues, which will be fixed by [3] and we'll
> have to measure performance again with fix to [3]). For some reason,
> read-side caching didn't improve transactions per second. I am working on
> this problem currently. Note that these bugs measure transaction phase of
> pgbench, but what xavi measured in his mail is init phase. Nevertheless,
> evaluation of read caching (metadata/data) will still be relevant for init
> phase too.
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4
> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>
>
>> > We need to analyze client/server xlators deeper to see if we can avoid
>> some delays. However optimizing something that is already at the
>> microsecond level can be very hard.
>>
>> That is true - are there any significant gains which can be accrued by
>> putting efforts here or, should this be a lower priority?
>>
>
> The problem identified by xavi is also the one we (Manoj, Krutika, me and
> Milind) had encountered in the past [4]. The solution we used was to have
> multiple rpc connections between single brick and client. The solution
> indeed fixed the bottleneck. So, there is definitely work involved here -
> either to fix the single connection model or go with multiple connection
> model. Its preferred to improve single connection and resort to multiple
> connections only if bottlenecks in single connection are not fixable.
> Personally I think this is high priority along with having appropriate
> client side caching.
>

Having multiple connections between a single brick and client didn't help
for pgbench init phase performance. In fact with more number of connections
performance actually regressed.

[5]  https://review.gluster.org/#/c/glusterfs/+/19133/


> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52
>
>
>> > We need to determine what causes the fluctuations in brick side and
>> avoid them.
>> > This scenario is very similar to a smallfile/metadata workload, so this
>> is probably one important cause of its bad performance.
>>
>> What kind of instrumentation is required to enable the determination?
>>
>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez 
>> wrote:
>> >
>> > Hi,
>> >
>> > I've done some tracing of the latency that network layer introduces in
>> gluster. I've made the analysis as part of the pgbench performance issue
>> (in particulat the initialization and scaling phase), so I decided to look
>> at READV for this particular workload, but I think the results can be
>> extrapolated to other operations that also have small latency (cached data
>> from FS for example).
>> >
>> > Note that measuring latencies introduces some latency. It consists in a
>> call to clock_get_time() for each probe point, so the real latency will be
>> a bit lower, but still proportional to these numbers.
>> >
>>
>> [snip]
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-25 Thread Raghavendra Gowdappa
On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa 
wrote:

>
>
> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay <
> sankarshan.mukhopadh...@gmail.com> wrote:
>
>> [pulling the conclusions up to enable better in-line]
>>
>> > Conclusions:
>> >
>> > We should never have a volume with caching-related xlators disabled.
>> The price we pay for it is too high. We need to make them work consistently
>> and aggressively to avoid as many requests as we can.
>>
>> Are there current issues in terms of behavior which are known/observed
>> when these are enabled?
>>
>
> We did have issues with pgbench in past. But they've have been fixed.
> Please refer to bz [1] for details. On 5.1, it runs successfully with all
> caching related xlators enabled. Having said that the only performance
> xlators which gave improved performance were open-behind and write-behind
> [2] (write-behind had some issues, which will be fixed by [3] and we'll
> have to measure performance again with fix to [3]). For some reason,
> read-side caching didn't improve transactions per second.
>

One possible reason for read-caching in glusterfs didn't show increased
performance can be, VFS already supports read-ahead (of 128KB) and
page-cache. It could be that whatever performance boost that can be
provided with caching is already leveraged at VFS page-cache  itself and
hence making glusterfs caching redundant. I'll run some tests to gather
evidence to (dis)prove this hypothesis.

I am working on this problem currently. Note that these bugs measure
> transaction phase of pgbench, but what xavi measured in his mail is init
> phase. Nevertheless, evaluation of read caching (metadata/data) will still
> be relevant for init phase too.
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4
> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>
>
>> > We need to analyze client/server xlators deeper to see if we can avoid
>> some delays. However optimizing something that is already at the
>> microsecond level can be very hard.
>>
>> That is true - are there any significant gains which can be accrued by
>> putting efforts here or, should this be a lower priority?
>>
>
> The problem identified by xavi is also the one we (Manoj, Krutika, me and
> Milind) had encountered in the past [4]. The solution we used was to have
> multiple rpc connections between single brick and client. The solution
> indeed fixed the bottleneck. So, there is definitely work involved here -
> either to fix the single connection model or go with multiple connection
> model. Its preferred to improve single connection and resort to multiple
> connections only if bottlenecks in single connection are not fixable.
> Personally I think this is high priority along with having appropriate
> client side caching.
>
> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52
>
>
>> > We need to determine what causes the fluctuations in brick side and
>> avoid them.
>> > This scenario is very similar to a smallfile/metadata workload, so this
>> is probably one important cause of its bad performance.
>>
>> What kind of instrumentation is required to enable the determination?
>>
>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez 
>> wrote:
>> >
>> > Hi,
>> >
>> > I've done some tracing of the latency that network layer introduces in
>> gluster. I've made the analysis as part of the pgbench performance issue
>> (in particulat the initialization and scaling phase), so I decided to look
>> at READV for this particular workload, but I think the results can be
>> extrapolated to other operations that also have small latency (cached data
>> from FS for example).
>> >
>> > Note that measuring latencies introduces some latency. It consists in a
>> call to clock_get_time() for each probe point, so the real latency will be
>> a bit lower, but still proportional to these numbers.
>> >
>>
>> [snip]
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-24 Thread Raghavendra Gowdappa
On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> [pulling the conclusions up to enable better in-line]
>
> > Conclusions:
> >
> > We should never have a volume with caching-related xlators disabled. The
> price we pay for it is too high. We need to make them work consistently and
> aggressively to avoid as many requests as we can.
>
> Are there current issues in terms of behavior which are known/observed
> when these are enabled?
>

We did have issues with pgbench in past. But they've have been fixed.
Please refer to bz [1] for details. On 5.1, it runs successfully with all
caching related xlators enabled. Having said that the only performance
xlators which gave improved performance were open-behind and write-behind
[2] (write-behind had some issues, which will be fixed by [3] and we'll
have to measure performance again with fix to [3]). For some reason,
read-side caching didn't improve transactions per second. I am working on
this problem currently. Note that these bugs measure transaction phase of
pgbench, but what xavi measured in his mail is init phase. Nevertheless,
evaluation of read caching (metadata/data) will still be relevant for init
phase too.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781


> > We need to analyze client/server xlators deeper to see if we can avoid
> some delays. However optimizing something that is already at the
> microsecond level can be very hard.
>
> That is true - are there any significant gains which can be accrued by
> putting efforts here or, should this be a lower priority?
>

The problem identified by xavi is also the one we (Manoj, Krutika, me and
Milind) had encountered in the past [4]. The solution we used was to have
multiple rpc connections between single brick and client. The solution
indeed fixed the bottleneck. So, there is definitely work involved here -
either to fix the single connection model or go with multiple connection
model. Its preferred to improve single connection and resort to multiple
connections only if bottlenecks in single connection are not fixable.
Personally I think this is high priority along with having appropriate
client side caching.

[4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52


> > We need to determine what causes the fluctuations in brick side and
> avoid them.
> > This scenario is very similar to a smallfile/metadata workload, so this
> is probably one important cause of its bad performance.
>
> What kind of instrumentation is required to enable the determination?
>
> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez 
> wrote:
> >
> > Hi,
> >
> > I've done some tracing of the latency that network layer introduces in
> gluster. I've made the analysis as part of the pgbench performance issue
> (in particulat the initialization and scaling phase), so I decided to look
> at READV for this particular workload, but I think the results can be
> extrapolated to other operations that also have small latency (cached data
> from FS for example).
> >
> > Note that measuring latencies introduces some latency. It consists in a
> call to clock_get_time() for each probe point, so the real latency will be
> a bit lower, but still proportional to these numbers.
> >
>
> [snip]
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-24 Thread Sankarshan Mukhopadhyay
[pulling the conclusions up to enable better in-line]

> Conclusions:
>
> We should never have a volume with caching-related xlators disabled. The 
> price we pay for it is too high. We need to make them work consistently and 
> aggressively to avoid as many requests as we can.

Are there current issues in terms of behavior which are known/observed
when these are enabled?

> We need to analyze client/server xlators deeper to see if we can avoid some 
> delays. However optimizing something that is already at the microsecond level 
> can be very hard.

That is true - are there any significant gains which can be accrued by
putting efforts here or, should this be a lower priority?

> We need to determine what causes the fluctuations in brick side and avoid 
> them.
> This scenario is very similar to a smallfile/metadata workload, so this is 
> probably one important cause of its bad performance.

What kind of instrumentation is required to enable the determination?

On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez  wrote:
>
> Hi,
>
> I've done some tracing of the latency that network layer introduces in 
> gluster. I've made the analysis as part of the pgbench performance issue (in 
> particulat the initialization and scaling phase), so I decided to look at 
> READV for this particular workload, but I think the results can be 
> extrapolated to other operations that also have small latency (cached data 
> from FS for example).
>
> Note that measuring latencies introduces some latency. It consists in a call 
> to clock_get_time() for each probe point, so the real latency will be a bit 
> lower, but still proportional to these numbers.
>

[snip]
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel