[Gluster-devel] Update on GCS 0.5 release

2018-12-24 Thread Atin Mukherjee
We've decided to delay GCS 0.5 release and postpone by few days (new date :
1st week of Jan) considering (a) most of the team members are out on
holidays (b) some of the critical issues/PRs are yet to be addressed from
[1] .

Regards,
GCS team

[1] https://waffle.io/gluster/gcs?label=GCS%2F0.5
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-24 Thread Raghavendra Gowdappa
On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> [pulling the conclusions up to enable better in-line]
>
> > Conclusions:
> >
> > We should never have a volume with caching-related xlators disabled. The
> price we pay for it is too high. We need to make them work consistently and
> aggressively to avoid as many requests as we can.
>
> Are there current issues in terms of behavior which are known/observed
> when these are enabled?
>

We did have issues with pgbench in past. But they've have been fixed.
Please refer to bz [1] for details. On 5.1, it runs successfully with all
caching related xlators enabled. Having said that the only performance
xlators which gave improved performance were open-behind and write-behind
[2] (write-behind had some issues, which will be fixed by [3] and we'll
have to measure performance again with fix to [3]). For some reason,
read-side caching didn't improve transactions per second. I am working on
this problem currently. Note that these bugs measure transaction phase of
pgbench, but what xavi measured in his mail is init phase. Nevertheless,
evaluation of read caching (metadata/data) will still be relevant for init
phase too.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781


> > We need to analyze client/server xlators deeper to see if we can avoid
> some delays. However optimizing something that is already at the
> microsecond level can be very hard.
>
> That is true - are there any significant gains which can be accrued by
> putting efforts here or, should this be a lower priority?
>

The problem identified by xavi is also the one we (Manoj, Krutika, me and
Milind) had encountered in the past [4]. The solution we used was to have
multiple rpc connections between single brick and client. The solution
indeed fixed the bottleneck. So, there is definitely work involved here -
either to fix the single connection model or go with multiple connection
model. Its preferred to improve single connection and resort to multiple
connections only if bottlenecks in single connection are not fixable.
Personally I think this is high priority along with having appropriate
client side caching.

[4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52


> > We need to determine what causes the fluctuations in brick side and
> avoid them.
> > This scenario is very similar to a smallfile/metadata workload, so this
> is probably one important cause of its bad performance.
>
> What kind of instrumentation is required to enable the determination?
>
> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez 
> wrote:
> >
> > Hi,
> >
> > I've done some tracing of the latency that network layer introduces in
> gluster. I've made the analysis as part of the pgbench performance issue
> (in particulat the initialization and scaling phase), so I decided to look
> at READV for this particular workload, but I think the results can be
> extrapolated to other operations that also have small latency (cached data
> from FS for example).
> >
> > Note that measuring latencies introduces some latency. It consists in a
> call to clock_get_time() for each probe point, so the real latency will be
> a bit lower, but still proportional to these numbers.
> >
>
> [snip]
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Implementing multiplexing for self heal client.

2018-12-24 Thread RAFI KC



On 12/21/18 6:56 PM, Sankarshan Mukhopadhyay wrote:

On Fri, Dec 21, 2018 at 6:30 PM RAFI KC  wrote:

Hi All,

What is the problem?
As of now self-heal client is running as one daemon per node, this means
even if there are multiple volumes, there will only be one self-heal
daemon. So to take effect of each configuration changes in the cluster,
the self-heal has to be reconfigured. But it doesn't have ability to
dynamically reconfigure. Which means when you have lot of volumes in the
cluster, every management operation that involves configurations changes
like volume start/stop, add/remove brick etc will result in self-heal
daemon restart. If such operation is executed more often, it is not only
slow down self-heal for a volume, but also increases the slef-heal logs
substantially.

What is the value of the number of volumes when you write "lot of
volumes"? 1000 volumes, more etc


Yes, more than 1000 volumes. It also depends on how often you execute 
glusterd management operations (mentioned above). Each time self heal 
daemon is restarted, it prints the entire graph. This graph traces in 
the log will contribute the majority it's size.







How to fix it?

We are planning to follow a similar procedure as attach/detach graphs
dynamically which is similar to brick multiplex. The detailed steps is
as below,




1) First step is to make shd per volume daemon, to generate/reconfigure
volfiles per volume basis .

1.1) This will help to attach the volfiles easily to existing shd daemon

1.2) This will help to send notification to shd daemon as each
volinfo keeps the daemon object

1.3) reconfiguring a particular subvolume is easier as we can check
the topology better

1.4) With this change the volfiles will be moved to workdir/vols/
directory.

2) Writing new rpc requests like attach/detach_client_graph function to
support clients attach/detach

2.1) Also functions like graph reconfigure, mgmt_getspec_cbk has to
be modified

3) Safely detaching a subvolume when there are pending frames to unwind.

3.1) We can mark the client disconnected and make all the frames to
unwind with ENOTCONN

3.2) We can wait all the i/o to unwind until the new updated subvol
attaches

4) Handle scenarios like glusterd restart, node reboot, etc



At the moment we are not planning to limit the number of heal subvolmes
per process as, because with the current approach also for every volume
heal was doing from a single process. We have not heared any major
complains on this?

Is the plan to not ever limit or, have a throttle set to a default
high(er) value? How would system resources be impacted if the proposed
design is implemented?


The plan is to implement in a way that it can support more than one 
multiplexed self-heal daemon. The throttling function as of now returns 
the same process to multiplex, but it can be easily modified to create a 
new process.


This multiplexing logic won't utilize any additional resources that it 
currently does.



Rafi KC



___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Weekly Untriaged Bugs

2018-12-24 Thread jenkins
[...truncated 6 lines...]
https://bugzilla.redhat.com/1654778 / build: Please update GlusterFS 
documentation to describe how to do a non-root install
https://bugzilla.redhat.com/1660404 / core: Conditional freeing of string after 
returning from dict_set_dynstr function
https://bugzilla.redhat.com/1655901 / core: glusterfsd 5.1 and 5.2 crashes in 
socket.so
https://bugzilla.redhat.com/1657645 / core: [Glusterfs-server-5.1] Gluster 
storage domain creation fails on MountError
https://bugzilla.redhat.com/1654021 / core: Gluster volume heal causes 
continuous info logging of "invalid argument"
https://bugzilla.redhat.com/1657202 / core: Possible memory leak in 5.1 brick 
process
https://bugzilla.redhat.com/1654398 / core: 
tests/bugs/core/brick-mux-fd-cleanup.t is failing
https://bugzilla.redhat.com/1658108 / disperse: [disperse] Dump respective 
itables in  EC to statedumps.
https://bugzilla.redhat.com/1658472 / disperse: Mountpoint not accessible for 
few seconds when bricks are brought down to max redundancy after reset brick
https://bugzilla.redhat.com/1654753 / distribute: A distributed-disperse volume 
crashes when a symbolic link is renamed
https://bugzilla.redhat.com/1653250 / encryption-xlator: memory-leak in crypt 
xlator (glusterfs client)
https://bugzilla.redhat.com/1659334 / fuse: FUSE mount seems to be hung and not 
accessible
https://bugzilla.redhat.com/1659824 / fuse: Unable to mount gluster fs on 
glusterfs client: Transport endpoint is not connected
https://bugzilla.redhat.com/1657743 / fuse: Very high memory usage (25GB) on 
Gluster FUSE mountpoint
https://bugzilla.redhat.com/1656415 / geo-replication: geo-rep: arbiter test 
case fails on verify_hardlink_rename_data
https://bugzilla.redhat.com/1655333 / geo-replication: OSError: [Errno 116] 
Stale file handle due to rotated files
https://bugzilla.redhat.com/1654642 / gluster-smb: Very high memory usage with 
glusterfs VFS module
https://bugzilla.redhat.com/1657607 / posix: Convert nr_files to gf_atomic in 
posix_private structure
https://bugzilla.redhat.com/1659371 / posix: posix_janitor_thread_proc has bug 
that can't go into the janitor_walker if change the system time forward and 
change back
https://bugzilla.redhat.com/1659374 / posix: posix_janitor_thread_proc has bug 
that can't go into the janitor_walker if change the system time forward and 
change back
https://bugzilla.redhat.com/1659378 / posix: posix_janitor_thread_proc has bug 
that can't go into the janitor_walker if change the system time forward and 
change back
https://bugzilla.redhat.com/1657860 / project-infrastructure: Archives for 
ci-results mailinglist are getting wiped (with each mail?)
https://bugzilla.redhat.com/1659934 / project-infrastructure: Cannot 
unsubscribe the review.gluster.org
https://bugzilla.redhat.com/1659394 / project-infrastructure: Maintainer 
permissions on gluster-mixins project for Ankush
https://bugzilla.redhat.com/1658742 / rpc: Inconsistent type for 'remote-port' 
parameter
https://bugzilla.redhat.com/1657398 / transport: Unable to mount with custom 
certificate file
[...truncated 2 lines...]

build.log
Description: Binary data
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Implementing multiplexing for self heal client.

2018-12-24 Thread Sankarshan Mukhopadhyay
On Fri, Dec 21, 2018 at 6:30 PM RAFI KC  wrote:
>
> Hi All,
>
> What is the problem?
> As of now self-heal client is running as one daemon per node, this means
> even if there are multiple volumes, there will only be one self-heal
> daemon. So to take effect of each configuration changes in the cluster,
> the self-heal has to be reconfigured. But it doesn't have ability to
> dynamically reconfigure. Which means when you have lot of volumes in the
> cluster, every management operation that involves configurations changes
> like volume start/stop, add/remove brick etc will result in self-heal
> daemon restart. If such operation is executed more often, it is not only
> slow down self-heal for a volume, but also increases the slef-heal logs
> substantially.

What is the value of the number of volumes when you write "lot of
volumes"? 1000 volumes, more etc

>
>
> How to fix it?
>
> We are planning to follow a similar procedure as attach/detach graphs
> dynamically which is similar to brick multiplex. The detailed steps is
> as below,
>
>
>
>
> 1) First step is to make shd per volume daemon, to generate/reconfigure
> volfiles per volume basis .
>
>1.1) This will help to attach the volfiles easily to existing shd daemon
>
>1.2) This will help to send notification to shd daemon as each
> volinfo keeps the daemon object
>
>1.3) reconfiguring a particular subvolume is easier as we can check
> the topology better
>
>1.4) With this change the volfiles will be moved to workdir/vols/
> directory.
>
> 2) Writing new rpc requests like attach/detach_client_graph function to
> support clients attach/detach
>
>2.1) Also functions like graph reconfigure, mgmt_getspec_cbk has to
> be modified
>
> 3) Safely detaching a subvolume when there are pending frames to unwind.
>
>3.1) We can mark the client disconnected and make all the frames to
> unwind with ENOTCONN
>
>3.2) We can wait all the i/o to unwind until the new updated subvol
> attaches
>
> 4) Handle scenarios like glusterd restart, node reboot, etc
>
>
>
> At the moment we are not planning to limit the number of heal subvolmes
> per process as, because with the current approach also for every volume
> heal was doing from a single process. We have not heared any major
> complains on this?

Is the plan to not ever limit or, have a throttle set to a default
high(er) value? How would system resources be impacted if the proposed
design is implemented?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-24 Thread Sankarshan Mukhopadhyay
[pulling the conclusions up to enable better in-line]

> Conclusions:
>
> We should never have a volume with caching-related xlators disabled. The 
> price we pay for it is too high. We need to make them work consistently and 
> aggressively to avoid as many requests as we can.

Are there current issues in terms of behavior which are known/observed
when these are enabled?

> We need to analyze client/server xlators deeper to see if we can avoid some 
> delays. However optimizing something that is already at the microsecond level 
> can be very hard.

That is true - are there any significant gains which can be accrued by
putting efforts here or, should this be a lower priority?

> We need to determine what causes the fluctuations in brick side and avoid 
> them.
> This scenario is very similar to a smallfile/metadata workload, so this is 
> probably one important cause of its bad performance.

What kind of instrumentation is required to enable the determination?

On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez  wrote:
>
> Hi,
>
> I've done some tracing of the latency that network layer introduces in 
> gluster. I've made the analysis as part of the pgbench performance issue (in 
> particulat the initialization and scaling phase), so I decided to look at 
> READV for this particular workload, but I think the results can be 
> extrapolated to other operations that also have small latency (cached data 
> from FS for example).
>
> Note that measuring latencies introduces some latency. It consists in a call 
> to clock_get_time() for each probe point, so the real latency will be a bit 
> lower, but still proportional to these numbers.
>

[snip]
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel