Re: [Gluster-devel] About file descriptor leak in glusterfsd daemon after network failure

2014-08-25 Thread Jaden Liang
Hi Niels,

We have tested the patch for some days. It works well when the gluster peer
status
change to disconnected. However, if we retore the network just before the
peer
status change to disconnected status, we found out that glusterfsd will
still
open a new fd, and leave the old one not released even stop the file
process.

Why does glusterfsd open a new fd instead of reusing the original reopened
fd?
Does glusterfsd have any kind of mechanism retrieve such fds?



2014-08-20 21:54 GMT+08:00 Niels de Vos nde...@redhat.com:

 On Wed, Aug 20, 2014 at 07:16:16PM +0800, Jaden Liang wrote:
  Hi gluster-devel team,
 
  We are running a 2 replica volume in 2 servers. One of our service daemon
  open a file with 'flock' in the volume. We can see every glusterfsd
 daemon
  open the replica files in its own server(in /proc/pid/fd). When we pull
 off
  the cable of one server about 10 minutes then re-plug in. We found that
 the
  glusterfsd open a 'NEW' file descriptor while still holding the old one
  which is opened in the first file access.
 
  Then we stop our service daemon, but the glusterfsd(the re-plug cable
 one)
  only closes the new fd, leave the old fd open, we think that may be a fd
  leak issue. And we restart our service daemon. It flocked the same file,
  and get a flock failure. The errno is Resource Temporary Unavailable.
 
  However, this situation is not replay every time but often come out. We
 are
  still looking into the source code of glusterfsd, but it is not a easy
 job.
  So we want to look for some help in here. Here are our questions:
 
  1. Has this issue been solved? Or is it a known issue?
  2. Does anyone know the file descriptor maintenance logic in
  glusterfsd(server-side)? When the fd will be closed or held?

 I think you are hitting bug 1129787:
 - https://bugzilla.redhat.com/show_bug.cgi?id=1129787
file locks are not released within an acceptable time when
a fuse-client uncleanly disconnects

 There has been a (short) discussion about this earlier, see
 http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040748.html

 Updating the proposed change is on my TODO list, in the end, the
 network.ping-timeout option should be used to define the timeout towards
 storage servers (like it is now) and the timeout from storage server to
 GlusterFS-client.

 You can try out the patch at http://review.gluster.org/8065 and see if
 the network.tcp-timeout option works for you. Just remember that the
 option will get fold into the network.ping-timeout one later on. If you
 are interested in sending an updated patch, let me know :)

 Cheers,
 Niels




-- 
Best regards,
Jaden Liang
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] About file descriptor leak in glusterfsd daemon after network failure

2014-08-21 Thread Niels de Vos
It seems that this email was sent twice? Just in case you missed my
response to the other one, here it is:
- http://supercolony.gluster.org/pipermail/gluster-devel/2014-August/041972.html

Niels

On Wed, Aug 20, 2014 at 07:13:21PM +0800, Jaden Liang wrote:
 Hi gluster-devel team,
 
 We are running a 2 replica volume in 2 servers. One of our service daemon
 open a file with 'flock' in the volume. We can see every glusterfsd daemon
 open the replica files in its own server(in /proc/pid/fd). When we pull off
 the cable of one server about 10 minutes then re-plug in. We found that the
 glusterfsd open a 'NEW' file descriptor while still holding the old one
 which is opened in the first file access.
 
 Then we stop our service daemon, but the glusterfsd(the re-plug cable one)
 only closes the new fd, leave the old fd open, we think that may be a fd
 leak issue. And we restart our service daemon. It flocked the same file,
 and get a flock failure. The errno is Resource Temporary Unavailable.
 
 However, this situation is not replay every time but often come out. We are
 still looking into the source code of glusterfsd, but it is not a easy job.
 So we want to look for some help in here. Here are our questions:
 
 1. Has this issue been solved? Or is it a known issue?
 2. Does anyone know the file descriptor maintenance logic in
 glusterfsd(server-side)? When the fd will be closed or held?
 
 Thank you very much.
 
 -- 
 Best regards,
 Jaden Liang

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] About file descriptor leak in glusterfsd daemon after network failure

2014-08-21 Thread Jaden's Gmail
Yes, that is another copy.
We are testing and reviewing the patch codes, will try some other scenarios at 
the same time. The result will inform out when solution confirmed.

Thanks again for your reply.

Jaden Liang

 在 2014年8月21日,17:00,Niels de Vos nde...@redhat.com 写道:
 
 It seems that this email was sent twice? Just in case you missed my
 response to the other one, here it is:
 - 
 http://supercolony.gluster.org/pipermail/gluster-devel/2014-August/041972.html
 
 Niels
 
 On Wed, Aug 20, 2014 at 07:13:21PM +0800, Jaden Liang wrote:
 Hi gluster-devel team,
 
 We are running a 2 replica volume in 2 servers. One of our service daemon
 open a file with 'flock' in the volume. We can see every glusterfsd daemon
 open the replica files in its own server(in /proc/pid/fd). When we pull off
 the cable of one server about 10 minutes then re-plug in. We found that the
 glusterfsd open a 'NEW' file descriptor while still holding the old one
 which is opened in the first file access.
 
 Then we stop our service daemon, but the glusterfsd(the re-plug cable one)
 only closes the new fd, leave the old fd open, we think that may be a fd
 leak issue. And we restart our service daemon. It flocked the same file,
 and get a flock failure. The errno is Resource Temporary Unavailable.
 
 However, this situation is not replay every time but often come out. We are
 still looking into the source code of glusterfsd, but it is not a easy job.
 So we want to look for some help in here. Here are our questions:
 
 1. Has this issue been solved? Or is it a known issue?
 2. Does anyone know the file descriptor maintenance logic in
 glusterfsd(server-side)? When the fd will be closed or held?
 
 Thank you very much.
 
 -- 
 Best regards,
 Jaden Liang
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] About file descriptor leak in glusterfsd daemon after network failure

2014-08-20 Thread Niels de Vos
On Wed, Aug 20, 2014 at 07:16:16PM +0800, Jaden Liang wrote:
 Hi gluster-devel team,
 
 We are running a 2 replica volume in 2 servers. One of our service daemon
 open a file with 'flock' in the volume. We can see every glusterfsd daemon
 open the replica files in its own server(in /proc/pid/fd). When we pull off
 the cable of one server about 10 minutes then re-plug in. We found that the
 glusterfsd open a 'NEW' file descriptor while still holding the old one
 which is opened in the first file access.
 
 Then we stop our service daemon, but the glusterfsd(the re-plug cable one)
 only closes the new fd, leave the old fd open, we think that may be a fd
 leak issue. And we restart our service daemon. It flocked the same file,
 and get a flock failure. The errno is Resource Temporary Unavailable.
 
 However, this situation is not replay every time but often come out. We are
 still looking into the source code of glusterfsd, but it is not a easy job.
 So we want to look for some help in here. Here are our questions:
 
 1. Has this issue been solved? Or is it a known issue?
 2. Does anyone know the file descriptor maintenance logic in
 glusterfsd(server-side)? When the fd will be closed or held?

I think you are hitting bug 1129787:
- https://bugzilla.redhat.com/show_bug.cgi?id=1129787
   file locks are not released within an acceptable time when 
   a fuse-client uncleanly disconnects

There has been a (short) discussion about this earlier, see 
http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040748.html

Updating the proposed change is on my TODO list, in the end, the
network.ping-timeout option should be used to define the timeout towards 
storage servers (like it is now) and the timeout from storage server to 
GlusterFS-client.

You can try out the patch at http://review.gluster.org/8065 and see if 
the network.tcp-timeout option works for you. Just remember that the 
option will get fold into the network.ping-timeout one later on. If you 
are interested in sending an updated patch, let me know :)

Cheers,
Niels
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel