Re: [Gluster-devel] Race in protocol/client and RPC

2018-02-01 Thread Shyam Ranganathan
On 02/01/2018 08:25 AM, Xavi Hernandez wrote:
> After having tried several things, it seems that it will be complex to
> solve these races. All attempts to fix them have caused failures in
> other connections. Since I've other work to do and it doesn't seem to be
> causing serious failures in production, for now I'll leave this. I'll
> retake this when I've more time.

Xavi, convert the findings into a bug, and post the details there, so
that it may be followed up? (if not already done)

> 
> Xavi
> 
> On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez  > wrote:
> 
> Hi all,
> 
> I've identified a race in RPC layer that caused some spurious
> disconnections and CHILD_DOWN notifications.
> 
> The problem happens when protocol/client reconfigures a connection
> to move from glusterd to glusterfsd. This is done by calling
> rpc_clnt_reconfig() followed by rpc_transport_disconnect().
> 
> This seems fine because client_rpc_notify() will call
> rpc_clnt_cleanup_and_start() when the disconnect notification is
> received. However There's a problem.
> 
> Suppose that the disconnection notification has been executed and we
> are just about to call rpc_clnt_cleanup_and_start(). If at this
> point the reconnection timer is fired, rpc_clnt_reconnect() will be
> processed. This will cause the socket to be reconnected and a
> connection notification will be processed. Then a handshake request
> will be sent to the server.
> 
> However, when rpc_clnt_cleanup_and_start() continues, all sent XID's
> are deleted. When we receive the answer from the handshake, we are
> unable to map the XID, making the request to fail. So the handshake
> fails and the client is considered down, sending a CHILD_DOWN
> notification to upper xlators.
> 
> This causes, in some tests, to start processing things while a brick
> is down unexpectedly, causing spurious failures on the test.
> 
> To solve the problem I've forced the rpc_clnt_reconfig() function to
> disable the RPC connection using similar code to rcp_clnt_disable().
> This prevents the background rpc_clnt_reconnect() timer to be
> executed, avoiding the problem.
> 
> This seems to work fine for many tests, but it seems to be causing
> some issue in gfapi based tests. I'm still investigating this.
> 
> Xavi
> 
> 
> 
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Race in protocol/client and RPC

2018-02-01 Thread Xavi Hernandez
After having tried several things, it seems that it will be complex to
solve these races. All attempts to fix them have caused failures in other
connections. Since I've other work to do and it doesn't seem to be causing
serious failures in production, for now I'll leave this. I'll retake this
when I've more time.

Xavi

On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez 
wrote:

> Hi all,
>
> I've identified a race in RPC layer that caused some spurious
> disconnections and CHILD_DOWN notifications.
>
> The problem happens when protocol/client reconfigures a connection to move
> from glusterd to glusterfsd. This is done by calling rpc_clnt_reconfig()
> followed by rpc_transport_disconnect().
>
> This seems fine because client_rpc_notify() will call
> rpc_clnt_cleanup_and_start() when the disconnect notification is received.
> However There's a problem.
>
> Suppose that the disconnection notification has been executed and we are
> just about to call rpc_clnt_cleanup_and_start(). If at this point the
> reconnection timer is fired, rpc_clnt_reconnect() will be processed. This
> will cause the socket to be reconnected and a connection notification will
> be processed. Then a handshake request will be sent to the server.
>
> However, when rpc_clnt_cleanup_and_start() continues, all sent XID's are
> deleted. When we receive the answer from the handshake, we are unable to
> map the XID, making the request to fail. So the handshake fails and the
> client is considered down, sending a CHILD_DOWN notification to upper
> xlators.
>
> This causes, in some tests, to start processing things while a brick is
> down unexpectedly, causing spurious failures on the test.
>
> To solve the problem I've forced the rpc_clnt_reconfig() function to
> disable the RPC connection using similar code to rcp_clnt_disable(). This
> prevents the background rpc_clnt_reconnect() timer to be executed, avoiding
> the problem.
>
> This seems to work fine for many tests, but it seems to be causing some
> issue in gfapi based tests. I'm still investigating this.
>
> Xavi
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Race in protocol/client and RPC

2018-02-01 Thread Xavi Hernandez
On Thu, Feb 1, 2018 at 2:48 PM, Shyam Ranganathan 
wrote:

> On 02/01/2018 08:25 AM, Xavi Hernandez wrote:
> > After having tried several things, it seems that it will be complex to
> > solve these races. All attempts to fix them have caused failures in
> > other connections. Since I've other work to do and it doesn't seem to be
> > causing serious failures in production, for now I'll leave this. I'll
> > retake this when I've more time.
>
> Xavi, convert the findings into a bug, and post the details there, so
> that it may be followed up? (if not already done)
>

I've just created this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1541032


> >
> > Xavi
> >
> > On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez  > > wrote:
> >
> > Hi all,
> >
> > I've identified a race in RPC layer that caused some spurious
> > disconnections and CHILD_DOWN notifications.
> >
> > The problem happens when protocol/client reconfigures a connection
> > to move from glusterd to glusterfsd. This is done by calling
> > rpc_clnt_reconfig() followed by rpc_transport_disconnect().
> >
> > This seems fine because client_rpc_notify() will call
> > rpc_clnt_cleanup_and_start() when the disconnect notification is
> > received. However There's a problem.
> >
> > Suppose that the disconnection notification has been executed and we
> > are just about to call rpc_clnt_cleanup_and_start(). If at this
> > point the reconnection timer is fired, rpc_clnt_reconnect() will be
> > processed. This will cause the socket to be reconnected and a
> > connection notification will be processed. Then a handshake request
> > will be sent to the server.
> >
> > However, when rpc_clnt_cleanup_and_start() continues, all sent XID's
> > are deleted. When we receive the answer from the handshake, we are
> > unable to map the XID, making the request to fail. So the handshake
> > fails and the client is considered down, sending a CHILD_DOWN
> > notification to upper xlators.
> >
> > This causes, in some tests, to start processing things while a brick
> > is down unexpectedly, causing spurious failures on the test.
> >
> > To solve the problem I've forced the rpc_clnt_reconfig() function to
> > disable the RPC connection using similar code to rcp_clnt_disable().
> > This prevents the background rpc_clnt_reconnect() timer to be
> > executed, avoiding the problem.
> >
> > This seems to work fine for many tests, but it seems to be causing
> > some issue in gfapi based tests. I'm still investigating this.
> >
> > Xavi
> >
> >
> >
> >
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-devel
> >
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 4.0: RC0 Tagged

2018-02-01 Thread Shyam Ranganathan
All pending actions as noted in the mail are complete, and 4.0.0rc0 has
been tagged.

This brings us to the most important phase of 4.0 which is testing out
things that are not covered in our automated regression runs and such.

For now, let's get some packages (as soon as I trigger the release job)
and expect to hear back more around the testing front in the upcoming
days and weeks.

Thanks,
Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Coverity covscan for 2018-02-01-d663b9a3 (master branch)

2018-02-01 Thread staticanalysis
GlusterFS Coverity covscan results are available from
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/master/glusterfs-coverity/2018-02-01-d663b9a3
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Release 4.0: Release notes (please read and contribute)

2018-02-01 Thread Shyam Ranganathan
On 01/29/2018 05:10 PM, Shyam Ranganathan wrote:
> Hi,
> 
> I have posted an initial draft version of the release notes here [1].
> 
> I would like to *suggest* the following contributors to help improve and
> finish the release notes by 06th Feb, 2017. As you read this mail, if
> you feel you cannot contribute, do let us know, so that we can find the
> appropriate contributors for the same.

Reminder (1)

Request a response if you would be able to provide the release notes.
Release notes itself can come in later.

Helps plan for contingency in case you are unable to generate the
required notes.

Thanks!

> 
> NOTE: Please use the release tracker to post patches that modify the
> release notes, the bug ID is *1539842* (see [2]).
> 
> 1) Aravinda/Kotresh: Geo-replication section in the release notes
> 
> 2) Kaushal/Aravinda/ppai: GD2 section in the release notes
> 
> 3) Du/Poornima/Pranith: Performance section in the release notes
> 
> 4) Amar: monitoring section in the release notes
> 
> Following are individual call outs for certain features:
> 
> 1) "Ability to force permissions while creating files/directories on a
> volume" - Niels
> 
> 2) "Replace MD5 usage to enable FIPS support" - Ravi, Amar
> 
> 3) "Dentry fop serializer xlator on brick stack" - Du
> 
> 4) "Add option to disable nftw() based deletes when purging the landfill
> directory" - Amar
> 
> 5) "Enhancements for directory listing in readdirp" - Nithya
> 
> 6) "xlators should not provide init(), fini() and others directly, but
> have class_methods" - Amar
> 
> 7) "New on-wire protocol (XDR) needed to support iattx and cleaner
> dictionary structure" - Amar
> 
> 8) "The protocol xlators should prevent sending binary values in a dict
> over the networks" - Amar
> 
> 9) "Translator to handle 'global' options" - Amar
> 
> Thanks,
> Shyam
> 
> [1] github link to draft release notes:
> https://github.com/gluster/glusterfs/blob/release-4.0/doc/release-notes/4.0.0.md
> 
> [2] Initial gerrit patch for the release notes:
> https://review.gluster.org/#/c/19370/
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Replacing Centos 6 nodes with Centos 7

2018-02-01 Thread Nigel Babu
This seems to be working well so far. I noticed that this one doesn't vote
correctly. I've fixed this up and also fixed up all the jobs where the
voting wasn't accurate.

Overall regression time seems to have dropped down to 3.5h now from 6h or
so. I attribute this to less slower down in some specific test cases and
the SSD disks. I'm going to add one more Centos 7 machine to the pool today.

On Thu, Feb 1, 2018 at 9:26 AM, Nigel Babu  wrote:

> Hello folks,
>
> Today, I'm putting the first Centos 7 node in our regression pool.
>
> slave28.cloud.gluster.org -> Shutdown and removed
> builder100.cloud.gluster.org -> New Centos7 node (we'll be starting from
> 100 upwards)
>
> If this run goes well, we'll be replacing the nodes one by one with Centos
> 7. If you notice tests failing consistently on a Centos 7 node, please file
> a bug.
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Release 3.12.6: Scheduled for the 12th of February

2018-02-01 Thread Jiffin Tony Thottan

Hi,

It's time to prepare the 3.12.6 release, which falls on the 10th of
each month, and hence would be 12-02-2018 this time around.

This mail is to call out the following,

1) Are there any pending *blocker* bugs that need to be tracked for
3.12.6? If so mark them against the provided tracker [1] as blockers
for the release, or at the very least post them as a response to this
mail

2) Pending reviews in the 3.12 dashboard will be part of the release,
*iff* they pass regressions and have the review votes, so use the
dashboard [2] to check on the status of your patches to 3.12 and get
these going

3) I have made checks on what went into 3.10 post 3.12 release and if
these fixes are already included in 3.12 branch, then status on this is 
*green*

as all fixes ported to 3.10, are ported to 3.12 as well.

Thanks,
Jiffin

[1] Release bug tracker:
https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-3.12.6

[2] 3.12 review dashboard:
https://review.gluster.org/#/projects/glusterfs,dashboards/dashboard:3-12-dashboard 

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel