Public bug reported:
Consider that we have a fstab entry that uses both volfile-server and
backupvolfile-server in a Ubuntu 14.04.2 LTS Server:
mygluster:/mydir /var/mydir glusterfs
defaults,nobootwait,nofail,_netdev,backupvolfile-server=mygluster-bak 0
0
If the host mygluster is accessible at boot time, the mount works with
success. However, if mygluster is offline (because a DNS error, for
example) and mygluster-bak is online, the mount fails at boot time.
The bug only occurs at boot time. After the boot, if we run 'mount
/var/mydir', the mount will work using the mygluster-bak server as
expected.
## How to reproduce
Put the following entry in your fstab:
non-existent:/mydir /var/mydir glusterfs
defaults,nobootwait,nofail,_netdev,backupvolfile-server=mygluster-bak 0
0
Mount the system and check that the mount worked with success:
$ mount /var/mydir; mount | grep mydir; umount
non-existent:/mydir on /var/mydir type fuse.glusterfs
(rw,default_permissions,allow_other,max_read=131072)
Now, reboot your system some times and check that sometimes the mount
has failed. At this moment, run 'mount /var/mydir' and successfully
mount the filesystem.
$ mount | grep mydir
$ mount /var/mydir; mount | grep mydir
non-existent:/mydir on /var/mydir type fuse.glusterfs
(rw,default_permissions,allow_other,max_read=131072)
### Logs
The boot.log, dmesg, mountall and the gluster logfile (var-lib-glance-
images.log, at my specific case) will be attached (each one in a
separate comment because of
https://bugs.launchpad.net/launchpad/+bug/82652).
However, the only log that really helps is the gluster logfile, with
entries like:
[glusterfsd.c:1910:main] 0-/usr/sbin/glusterfs: Started running
/usr/sbin/glusterfs version 3.4.2 (/usr/sbin/glusterfs --volfile-id=/mydir
--volfile-server=non-existent /var/mydir)
[name.c:249:af_inet_client_get_remote_sockaddr] 0-glusterfs: DNS resolution
failed on host non-existent
[fuse-bridge.c:5260:fini] 0-fuse: Unmounting '/var/mydir'.
[glusterfsd.c:1910:main] 0-/usr/sbin/glusterfs: Started running
/usr/sbin/glusterfs version 3.4.2 (/usr/sbin/glusterfs --volfile-id=/mydir
--volfile-server=mygluster-bak /var/mydir)
[fuse-bridge.c:5016:init] 0-fuse: Mountpoint /var/mydir seems to have a
stale mount, run 'umount /var/mydir' and try again.
[xlator.c:390:xlator_init] 0-fuse: Initialization of volume 'fuse' failed,
review your volfile again
## Log analyses and debugging
The "Mountpoint /var/mydir seems to have a stale mount, run 'umount
/var/mydir' and try again" log helps a lot.
I've changed the /sbin/mount.glusterfs script to increase verbosity and
discovered some more useful info:
- At first, mount.glusterfs runs: /usr/sbin/glusterfs --volfile-id=/mydir
--volfile-server=non-existent /var/mydir
- After, it runs 'stat -c %i /var/mydir' to test if the inode is one (mount
successful at this mount point) or another number. In a normal mount try
(running 'mount /var/mydir' after the boot), this step returns a large number
like 4198417. However, during the boot, it returns no output and prints the
following error to stderr: **stat: cannot stat ‘/var/mydir’: Transport endpoint
is not connected**;
- In a second moment, mount.glusterfs runs: /usr/sbin/glusterfs
--volfile-id=/mydir --volfile-server=mygluster-bak /var/mydir;
- Again, it runs 'stat -c %i /var/mydir' and got the error **Transport endpoint
is not connected**;
- At the end, mount.glusterfs prints "Mount failed. Please check the log file
for more details.", runs "umount /var/mydir" and exits with status 1.
So, I have done some tests to got more info about the "Transport
endpoint is not connected" error and I discovered that it occurs for a
very shot time after a mount error. It's possible to got this error at
any time. The following command will sometimes reproduce this error
(it's sporadic):
$ /usr/sbin/glusterfs --volfile-id=/mydir --volfile-server=non-existent
/var/mydir; stat -c %i /var/mydir
4198417
$ /usr/sbin/glusterfs --volfile-id=/mydir --volfile-server=non-existent
/var/mydir; stat -c %i /var/mydir
stat: cannot stat ‘/var/mydir’: Transport endpoint is not connected
To get more debug info yet, I've modified the script mount.glusterfs
again to run 'fuser -m /var/mydir' after the first 'stat' / "Transport
endpoint is not connected" to get any PIDs using the filesystem and
got:
$ fuser -m /var/mydir:
1 371 1280 1758 2287 2503
$ ps -ww -up 1 371 1280 1758 2287 2503:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 16.8 0.0 34156 3532 ? Ss 16:36 0:04 /sbin/init
root 371 0.1 0.0 20132 976 ? S 16:36 0:00
@sbin/plymouthd --mode=boot --attach-to-session
--pid-file=/run/initramfs/plymouth.pid
statd 1280 0.0 0.0 21540 1396 ? Ss 16:36 0:00 rpc.statd
-L
syslog 1758 0.0 0.0 255840 1216 ? Ssl 16:36 0:00 rsyslogd
Unfortunately, I do not get the output of PIDs 2287 and 2503.
I don't know if the second mount error is related to these PIDs returned
from 'fuser' or if it's normal, and if it's related to the "Transport
endpoint is not connected" error, but it can be a start point.
## Possible solutions and workaround
I studied about the "seems to have a stale mount, run 'umount ...' and
try again" message and discovered this commit at the upstream code:
https://github.com/gluster/glusterfs/commit/08041c.
The commit message contains the message "Also, mount.glusterfs script unmounts
mount-point on mount failure to
prevent hung mounts". This message is about the umount line in mount.glusterfs:
https://github.com/gluster/glusterfs/commit/08041c#diff-7829823331339149cb845ff035efff54R165.
I do not know if running umount (as implemented after the last mount
error and suggested in the gluster log) solves any "Transport endpoint
not connected error" or only other specific mount hang, but a possible
solution is adding this line before the second mount (after the first
failure):
--- mount.glusterfs.orig 2015-06-12 01:02:18.943119823 -0300
+++ mount.glusterfs 2015-06-12 01:24:52.824311071 -0300
@@ -226,6 +226,7 @@
if [ $inode -ne 1 ]; then
err=1;
if [ -n "$cmd_line1" ]; then
+ umount $mount_point > /dev/null 2>&1;
cmd_line1=$(echo "$cmd_line1 $mount_point");
$cmd_line1;
err=0;
After this "patch", the mount point using the backupvolfile-server (that
failed most at the time) worked at most times, however it still failing
sometimes. The solution the always solved the mount was:
--- mount.glusterfs.orig 2015-06-12 01:02:18.943119823 -0300
+++ mount.glusterfs 2015-06-12 01:28:07.610199716 -0300
@@ -226,6 +226,7 @@
if [ $inode -ne 1 ]; then
err=1;
if [ -n "$cmd_line1" ]; then
+ sleep 0.1;
cmd_line1=$(echo "$cmd_line1 $mount_point");
$cmd_line1;
err=0;
I tested the last patch using many reboots (more then 60) and in all of
them the mount worked.
### Why do not backport some fix from the upstream
Since commit https://github.com/gluster/glusterfs/commit/b610f1,
upstream doesn't use two mounts in mount.glusterfs anymore (it uses
another solution). So, although this commit solves the problem, it makes
no sense to use the upstream solution because it changes the gluster
client behavior and should not be used in a bugfix update.
** Affects: glusterfs (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1464494
Title:
Gluster mount using volfile-bak fails on boot
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glusterfs/+bug/1464494/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs