Re: [Gluster-users] Intermittent mount disconnect due to socket poller error

2018-03-07 Thread Ryan Lee
I happened to review the status of volume clients and realized they were 
reporting a mix of different op-versions: 3.13 clients were still 
connecting to the downgraded 3.12 server (likely a timing issue between 
downgrading clients and mounting volumes).  Remounting the reported 
clients has resulted in the correct op-version all around and about a 
week free of these errors.


On 2018-03-01 12:38, Ryan Lee wrote:
Thanks for your response - is there more that would be useful in 
addition to what I already attached?  We're logging at default level on 
the brick side and at error on clients.  I could turn it up for a few 
days to try to catch this problem in action (it's happened several more 
times since I first wrote).


On 2018-02-28 18:38, Raghavendra Gowdappa wrote:

Is it possible to attach logfiles of problematic client and bricks?

On Thu, Mar 1, 2018 at 3:00 AM, Ryan Lee > wrote:


    We've been on the Gluster 3.7 series for several years with things
    pretty stable.  Given that it's reached EOL, yesterday I upgraded to
    3.13.2.  Every Gluster mount and server was disabled then brought
    back up after the upgrade, changing the op-version to 31302 and then
    trying it all out.

    It went poorly.  Every sizable read and write (100's MB) lead to
    'Transport endpoint not connected' errors on the command line and
    immediate unavailability of the mount.  After unsuccessfully trying
    to search for similar problems with solutions, I ended up
    downgrading to 3.12.6 and changing the op-version to 31202.  That
    brought us back to usability with the majority of those operations
    succeeding enough to consider it online, but there are still
    occasional mount disconnects that we never saw with 3.7 - about 6 in
    the past 18 hours.  It seems these disconnects would never come
    back, either, unless manually re-mounted.  Manually remounting
    reconnects immediately.  They only disconnect the affected client,
    though some simultaneous disconnects have occurred due to
    simultaneous activity.  The lower-level log info seems to indicate a
    socket problem, potentially broken on the client side based on
    timing (but the timing is narrow, and I won't claim the clocks are
    that well synchronized across all our servers).  The client and one
    server claim a socket polling error with no data available, and the
    other server claims a writev error.  This seems to lead the client
    to the 'all subvolumes are down' state, even though all other
    clients are still connected.  Has anybody run into this?  Did I miss
    anything moving so many versions ahead?

    I've included the output of volume info and some excerpts from the
    logs.   We have two servers running glusterd and two replica volumes
    with a brick on each server.  Both experience disconnects; there are
    about 10 clients for each, with one using both.  We use SSL over
    internal IPv4. Names in all caps were replaced, as were IP addresses.

    Let me know if there's anything else I can provide.

    % gluster v info VOL
    Volume Name: VOL
    Type: Replicate
    Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
    Status: Started
    Snapshot Count: 0
    Number of Bricks: 1 x 2 = 2
    Transport-type: tcp
    Bricks:
    Brick1: SERVER1:/glusterfs/VOL-brick1/data
    Brick2: SERVER2:/glusterfs/VOL-brick2/data
    Options Reconfigured:
    config.transport: tcp
    features.selinux: off
    transport.address-family: inet
    nfs.disable: on
    client.ssl: on
    performance.readdir-ahead: on
    auth.ssl-allow: [NAMES, including CLIENT]
    server.ssl: on
    ssl.certificate-depth: 3

    Log excerpts (there was nothing related in glusterd.log):

    CLIENT:/var/log/glusterfs/mnt-VOL.log
    [2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller]
    0-VOL-client-1: socket_poller SERVER2:49153 failed (No data 
available)

    [2018-02-28 19:35:58.477154] E [MSGID: 108006]
    [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0:
    All subvolumes are down. Going offline until atleast one of them
    comes back up.
    [2018-02-28 19:35:58.486146] E [MSGID: 101046]
    [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67
    times>
    
    [2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller]
    0-VOL-client-1: socket_poller SERVER2:24007 failed (No data 
available)

    [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller]
    0-VOL-client-1: socket_poller SERVER2:24007 failed (No data 
available)


    


    SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
    [2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller]
    0-tcp.VOL-server: poll error on socket
    [2018-02-28 19:35:58.380530] I [MSGID: 115036]
    [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting
    connection from 
CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0

    [2018-02-28 19:35:58.380932] I 

Re: [Gluster-users] Intermittent mount disconnect due to socket poller error

2018-02-28 Thread Raghavendra Gowdappa
Is it possible to attach logfiles of problematic client and bricks?

On Thu, Mar 1, 2018 at 3:00 AM, Ryan Lee  wrote:

> We've been on the Gluster 3.7 series for several years with things pretty
> stable.  Given that it's reached EOL, yesterday I upgraded to 3.13.2.
> Every Gluster mount and server was disabled then brought back up after the
> upgrade, changing the op-version to 31302 and then trying it all out.
>
> It went poorly.  Every sizable read and write (100's MB) lead to
> 'Transport endpoint not connected' errors on the command line and immediate
> unavailability of the mount.  After unsuccessfully trying to search for
> similar problems with solutions, I ended up downgrading to 3.12.6 and
> changing the op-version to 31202.  That brought us back to usability with
> the majority of those operations succeeding enough to consider it online,
> but there are still occasional mount disconnects that we never saw with 3.7
> - about 6 in the past 18 hours.  It seems these disconnects would never
> come back, either, unless manually re-mounted.  Manually remounting
> reconnects immediately.  They only disconnect the affected client, though
> some simultaneous disconnects have occurred due to simultaneous activity.
> The lower-level log info seems to indicate a socket problem, potentially
> broken on the client side based on timing (but the timing is narrow, and I
> won't claim the clocks are that well synchronized across all our servers).
> The client and one server claim a socket polling error with no data
> available, and the other server claims a writev error.  This seems to lead
> the client to the 'all subvolumes are down' state, even though all other
> clients are still connected.  Has anybody run into this?  Did I miss
> anything moving so many versions ahead?
>
> I've included the output of volume info and some excerpts from the logs.
>  We have two servers running glusterd and two replica volumes with a brick
> on each server.  Both experience disconnects; there are about 10 clients
> for each, with one using both.  We use SSL over internal IPv4. Names in all
> caps were replaced, as were IP addresses.
>
> Let me know if there's anything else I can provide.
>
> % gluster v info VOL
> Volume Name: VOL
> Type: Replicate
> Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: SERVER1:/glusterfs/VOL-brick1/data
> Brick2: SERVER2:/glusterfs/VOL-brick2/data
> Options Reconfigured:
> config.transport: tcp
> features.selinux: off
> transport.address-family: inet
> nfs.disable: on
> client.ssl: on
> performance.readdir-ahead: on
> auth.ssl-allow: [NAMES, including CLIENT]
> server.ssl: on
> ssl.certificate-depth: 3
>
> Log excerpts (there was nothing related in glusterd.log):
>
> CLIENT:/var/log/glusterfs/mnt-VOL.log
> [2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller]
> 0-VOL-client-1: socket_poller SERVER2:49153 failed (No data available)
> [2018-02-28 19:35:58.477154] E [MSGID: 108006]
> [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0: All
> subvolumes are down. Going offline until atleast one of them comes back up.
> [2018-02-28 19:35:58.486146] E [MSGID: 101046]
> [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67 times>
> 
> [2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller]
> 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)
> [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller]
> 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)
>
> 
>
>
> SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
> [2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller]
> 0-tcp.VOL-server: poll error on socket
> [2018-02-28 19:35:58.380530] I [MSGID: 115036]
> [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection
> from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
> [2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply]
> 0-tcp.VOL-server: not connected (priv->connected = -1)
> [2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0xa4e, Program: GlusterFS
> 3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.uploads-server)
> [2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply]
> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
> [0x7f97bd37ee6a] -->/usr/lib/x86_64-linux-gnu/g
> lusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) [0x7f97bcf1f4c8]
> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
> [0x7f97bcf0abd5] ) 0-: Reply submission failed
> [2018-02-28 19:35:58.381196] I [MSGID: 101055]
> [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection
> CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
> [2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update]
>