[Gluster-devel] Update on GCS 0.5 release
We've decided to delay GCS 0.5 release and postpone by few days (new date : 1st week of Jan) considering (a) most of the team members are out on holidays (b) some of the critical issues/PRs are yet to be addressed from [1] . Regards, GCS team [1] https://waffle.io/gluster/gcs?label=GCS%2F0.5 ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench
On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay < sankarshan.mukhopadh...@gmail.com> wrote: > [pulling the conclusions up to enable better in-line] > > > Conclusions: > > > > We should never have a volume with caching-related xlators disabled. The > price we pay for it is too high. We need to make them work consistently and > aggressively to avoid as many requests as we can. > > Are there current issues in terms of behavior which are known/observed > when these are enabled? > We did have issues with pgbench in past. But they've have been fixed. Please refer to bz [1] for details. On 5.1, it runs successfully with all caching related xlators enabled. Having said that the only performance xlators which gave improved performance were open-behind and write-behind [2] (write-behind had some issues, which will be fixed by [3] and we'll have to measure performance again with fix to [3]). For some reason, read-side caching didn't improve transactions per second. I am working on this problem currently. Note that these bugs measure transaction phase of pgbench, but what xavi measured in his mail is init phase. Nevertheless, evaluation of read caching (metadata/data) will still be relevant for init phase too. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781 > > We need to analyze client/server xlators deeper to see if we can avoid > some delays. However optimizing something that is already at the > microsecond level can be very hard. > > That is true - are there any significant gains which can be accrued by > putting efforts here or, should this be a lower priority? > The problem identified by xavi is also the one we (Manoj, Krutika, me and Milind) had encountered in the past [4]. The solution we used was to have multiple rpc connections between single brick and client. The solution indeed fixed the bottleneck. So, there is definitely work involved here - either to fix the single connection model or go with multiple connection model. Its preferred to improve single connection and resort to multiple connections only if bottlenecks in single connection are not fixable. Personally I think this is high priority along with having appropriate client side caching. [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52 > > We need to determine what causes the fluctuations in brick side and > avoid them. > > This scenario is very similar to a smallfile/metadata workload, so this > is probably one important cause of its bad performance. > > What kind of instrumentation is required to enable the determination? > > On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez > wrote: > > > > Hi, > > > > I've done some tracing of the latency that network layer introduces in > gluster. I've made the analysis as part of the pgbench performance issue > (in particulat the initialization and scaling phase), so I decided to look > at READV for this particular workload, but I think the results can be > extrapolated to other operations that also have small latency (cached data > from FS for example). > > > > Note that measuring latencies introduces some latency. It consists in a > call to clock_get_time() for each probe point, so the real latency will be > a bit lower, but still proportional to these numbers. > > > > [snip] > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Implementing multiplexing for self heal client.
On 12/21/18 6:56 PM, Sankarshan Mukhopadhyay wrote: On Fri, Dec 21, 2018 at 6:30 PM RAFI KC wrote: Hi All, What is the problem? As of now self-heal client is running as one daemon per node, this means even if there are multiple volumes, there will only be one self-heal daemon. So to take effect of each configuration changes in the cluster, the self-heal has to be reconfigured. But it doesn't have ability to dynamically reconfigure. Which means when you have lot of volumes in the cluster, every management operation that involves configurations changes like volume start/stop, add/remove brick etc will result in self-heal daemon restart. If such operation is executed more often, it is not only slow down self-heal for a volume, but also increases the slef-heal logs substantially. What is the value of the number of volumes when you write "lot of volumes"? 1000 volumes, more etc Yes, more than 1000 volumes. It also depends on how often you execute glusterd management operations (mentioned above). Each time self heal daemon is restarted, it prints the entire graph. This graph traces in the log will contribute the majority it's size. How to fix it? We are planning to follow a similar procedure as attach/detach graphs dynamically which is similar to brick multiplex. The detailed steps is as below, 1) First step is to make shd per volume daemon, to generate/reconfigure volfiles per volume basis . 1.1) This will help to attach the volfiles easily to existing shd daemon 1.2) This will help to send notification to shd daemon as each volinfo keeps the daemon object 1.3) reconfiguring a particular subvolume is easier as we can check the topology better 1.4) With this change the volfiles will be moved to workdir/vols/ directory. 2) Writing new rpc requests like attach/detach_client_graph function to support clients attach/detach 2.1) Also functions like graph reconfigure, mgmt_getspec_cbk has to be modified 3) Safely detaching a subvolume when there are pending frames to unwind. 3.1) We can mark the client disconnected and make all the frames to unwind with ENOTCONN 3.2) We can wait all the i/o to unwind until the new updated subvol attaches 4) Handle scenarios like glusterd restart, node reboot, etc At the moment we are not planning to limit the number of heal subvolmes per process as, because with the current approach also for every volume heal was doing from a single process. We have not heared any major complains on this? Is the plan to not ever limit or, have a throttle set to a default high(er) value? How would system resources be impacted if the proposed design is implemented? The plan is to implement in a way that it can support more than one multiplexed self-heal daemon. The throttling function as of now returns the same process to multiplex, but it can be easily modified to create a new process. This multiplexing logic won't utilize any additional resources that it currently does. Rafi KC ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Weekly Untriaged Bugs
[...truncated 6 lines...] https://bugzilla.redhat.com/1654778 / build: Please update GlusterFS documentation to describe how to do a non-root install https://bugzilla.redhat.com/1660404 / core: Conditional freeing of string after returning from dict_set_dynstr function https://bugzilla.redhat.com/1655901 / core: glusterfsd 5.1 and 5.2 crashes in socket.so https://bugzilla.redhat.com/1657645 / core: [Glusterfs-server-5.1] Gluster storage domain creation fails on MountError https://bugzilla.redhat.com/1654021 / core: Gluster volume heal causes continuous info logging of "invalid argument" https://bugzilla.redhat.com/1657202 / core: Possible memory leak in 5.1 brick process https://bugzilla.redhat.com/1654398 / core: tests/bugs/core/brick-mux-fd-cleanup.t is failing https://bugzilla.redhat.com/1658108 / disperse: [disperse] Dump respective itables in EC to statedumps. https://bugzilla.redhat.com/1658472 / disperse: Mountpoint not accessible for few seconds when bricks are brought down to max redundancy after reset brick https://bugzilla.redhat.com/1654753 / distribute: A distributed-disperse volume crashes when a symbolic link is renamed https://bugzilla.redhat.com/1653250 / encryption-xlator: memory-leak in crypt xlator (glusterfs client) https://bugzilla.redhat.com/1659334 / fuse: FUSE mount seems to be hung and not accessible https://bugzilla.redhat.com/1659824 / fuse: Unable to mount gluster fs on glusterfs client: Transport endpoint is not connected https://bugzilla.redhat.com/1657743 / fuse: Very high memory usage (25GB) on Gluster FUSE mountpoint https://bugzilla.redhat.com/1656415 / geo-replication: geo-rep: arbiter test case fails on verify_hardlink_rename_data https://bugzilla.redhat.com/1655333 / geo-replication: OSError: [Errno 116] Stale file handle due to rotated files https://bugzilla.redhat.com/1654642 / gluster-smb: Very high memory usage with glusterfs VFS module https://bugzilla.redhat.com/1657607 / posix: Convert nr_files to gf_atomic in posix_private structure https://bugzilla.redhat.com/1659371 / posix: posix_janitor_thread_proc has bug that can't go into the janitor_walker if change the system time forward and change back https://bugzilla.redhat.com/1659374 / posix: posix_janitor_thread_proc has bug that can't go into the janitor_walker if change the system time forward and change back https://bugzilla.redhat.com/1659378 / posix: posix_janitor_thread_proc has bug that can't go into the janitor_walker if change the system time forward and change back https://bugzilla.redhat.com/1657860 / project-infrastructure: Archives for ci-results mailinglist are getting wiped (with each mail?) https://bugzilla.redhat.com/1659934 / project-infrastructure: Cannot unsubscribe the review.gluster.org https://bugzilla.redhat.com/1659394 / project-infrastructure: Maintainer permissions on gluster-mixins project for Ankush https://bugzilla.redhat.com/1658742 / rpc: Inconsistent type for 'remote-port' parameter https://bugzilla.redhat.com/1657398 / transport: Unable to mount with custom certificate file [...truncated 2 lines...] build.log Description: Binary data ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Implementing multiplexing for self heal client.
On Fri, Dec 21, 2018 at 6:30 PM RAFI KC wrote: > > Hi All, > > What is the problem? > As of now self-heal client is running as one daemon per node, this means > even if there are multiple volumes, there will only be one self-heal > daemon. So to take effect of each configuration changes in the cluster, > the self-heal has to be reconfigured. But it doesn't have ability to > dynamically reconfigure. Which means when you have lot of volumes in the > cluster, every management operation that involves configurations changes > like volume start/stop, add/remove brick etc will result in self-heal > daemon restart. If such operation is executed more often, it is not only > slow down self-heal for a volume, but also increases the slef-heal logs > substantially. What is the value of the number of volumes when you write "lot of volumes"? 1000 volumes, more etc > > > How to fix it? > > We are planning to follow a similar procedure as attach/detach graphs > dynamically which is similar to brick multiplex. The detailed steps is > as below, > > > > > 1) First step is to make shd per volume daemon, to generate/reconfigure > volfiles per volume basis . > >1.1) This will help to attach the volfiles easily to existing shd daemon > >1.2) This will help to send notification to shd daemon as each > volinfo keeps the daemon object > >1.3) reconfiguring a particular subvolume is easier as we can check > the topology better > >1.4) With this change the volfiles will be moved to workdir/vols/ > directory. > > 2) Writing new rpc requests like attach/detach_client_graph function to > support clients attach/detach > >2.1) Also functions like graph reconfigure, mgmt_getspec_cbk has to > be modified > > 3) Safely detaching a subvolume when there are pending frames to unwind. > >3.1) We can mark the client disconnected and make all the frames to > unwind with ENOTCONN > >3.2) We can wait all the i/o to unwind until the new updated subvol > attaches > > 4) Handle scenarios like glusterd restart, node reboot, etc > > > > At the moment we are not planning to limit the number of heal subvolmes > per process as, because with the current approach also for every volume > heal was doing from a single process. We have not heared any major > complains on this? Is the plan to not ever limit or, have a throttle set to a default high(er) value? How would system resources be impacted if the proposed design is implemented? ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench
[pulling the conclusions up to enable better in-line] > Conclusions: > > We should never have a volume with caching-related xlators disabled. The > price we pay for it is too high. We need to make them work consistently and > aggressively to avoid as many requests as we can. Are there current issues in terms of behavior which are known/observed when these are enabled? > We need to analyze client/server xlators deeper to see if we can avoid some > delays. However optimizing something that is already at the microsecond level > can be very hard. That is true - are there any significant gains which can be accrued by putting efforts here or, should this be a lower priority? > We need to determine what causes the fluctuations in brick side and avoid > them. > This scenario is very similar to a smallfile/metadata workload, so this is > probably one important cause of its bad performance. What kind of instrumentation is required to enable the determination? On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez wrote: > > Hi, > > I've done some tracing of the latency that network layer introduces in > gluster. I've made the analysis as part of the pgbench performance issue (in > particulat the initialization and scaling phase), so I decided to look at > READV for this particular workload, but I think the results can be > extrapolated to other operations that also have small latency (cached data > from FS for example). > > Note that measuring latencies introduces some latency. It consists in a call > to clock_get_time() for each probe point, so the real latency will be a bit > lower, but still proportional to these numbers. > [snip] ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel