Hi,

I am sorry for the long delay, just I moved on with my tool and had to
narrow down the circumstances of this error.
So apparently this error is related to many remote nodes querying CIB (I
had an OCF monitor agent running on all nodes), once turned off it works
with ~100 nodes.


Parts of the logs from one of the remote nodes, which went offline:

[...]
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:569   )   trace: crm_remote_recv_once:    non-blocking, exiting
read: Resource temporarily unavailable (-11)
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:588   )   trace: crm_remote_recv_once:    Read less than the
advertised length: 622632 < 639978 bytes
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:644   )   debug: crm_remote_recv:    recv() failed: Resource
temporarily unavailable (-11)
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:625   )   trace: crm_remote_recv:    waiting to receive remote
msg, starting timeout 60000, remaining_timeout 1000
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:566   )   trace: crm_remote_recv_once:    Received 16384 more
bytes, 639016 total
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:588   )   trace: crm_remote_recv_once:    Read less than the
advertised length: 639016 < 639978 bytes
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:644   )   debug: crm_remote_recv:    recv() failed: Resource
temporarily unavailable (-11)
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
utils.c:1026  )   error: crm_abort:    crm_remote_parse_buffer: Forked
child 15923 to record non-fatal assert at remote.c:426 :
remote->buffer[sizeof(struct crm_remote_header_v0) +
header->payload_uncompressed - 1] == 0
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2668  )   error: crm_xml_err:    XML Error: Entity: line 1: parser
error : AttValue: ' expected
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2668  )   error: crm_xml_err:    XML Error: Entity: line 1: parser
error : attributes construct error
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2668  )   error: crm_xml_err:    XML Error: Entity: line 1: parser
error : Couldn't find end of Start Tag lrm_rsc_op line 1
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2668  )   error: crm_xml_err:    XML Error: Entity: line 1: parser
error : Premature end of data in tag lrm_resource line 1
[...]

Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2731  ) warning: string2xml:    Parsing failed (domain=1, level=3,
code=77): Premature end of data in tag lrmd_ipc_proxy line 1
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2738  )   error: string2xml:    Couldn't fully parse 638976 chars:
<lrmd_ipc_proxy lrmd_ipc_op="response"
lrmd_ipc_session="dba1138a-b075-4ff0-bf8f-7b1c32e82553" lrmd_ipc_msg_id="5"
lrmd_op="ipc_fwd" lrmd_remote_msg_id="5777"
lrmd_remote_msg_type="request"><lrmd_ipc_msg><cib-reply t="cib"
cib_op="cib_query" cib_callid="4"
cib_clientid="18024dbd-c874-4b20-8f75-7f027201961d" cib_callopt="4352"
cib_rc="0"><cib_calldata><cib admin_epoch="6
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial   <lrmd_ipc_proxy
lrmd_ipc_op="response"
lrmd_ipc_session="dba1138a-b075-4ff0-bf8f-7b1c32e82553" lrmd_ipc_msg_id="5"
lrmd_op="ipc_fwd" lrmd_remote_msg_id="5777" lrmd_remote_msg_type="request">
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial     <lrmd_ipc_msg>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial       <cib-reply t="cib"
cib_op="cib_query" cib_callid="4"
cib_clientid="18024dbd-c874-4b20-8f75-7f027201961d" cib_callopt="4352"
cib_rc="0">
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial         <cib_calldata>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial           <cib
admin_epoch="6" crm_feature_set="3.0.11" epoch="445" num_updates="144"
validate-with="pacemaker-2.6" cib-last-written="Tue Oct 18 16:01:11 2016"
update-origin="ip-10-169-147-9" update-client="crm_resource"
update-user="ip-10-165-93-92" have-quorum="1" dc-uuid="1">
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial             <configuration>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial               <crm_config>
[...]

Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial                 </lrm>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial               </node_state>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial             </status>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial           </cib>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial         </cib_calldata>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial       </cib-reply>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial     </lrmd_ipc_msg>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:    Partial   </lrmd_ipc_proxy>
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
strings.c:348   )    info: crm_compress_string:    Compressed 638868 bytes
into 20773 (ratio 30:1) in 112ms
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:361   )   trace: crm_remote_send:    Sending len[0]=40,
start=6d726c3c
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:    Message size: 40
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:254   )   trace: crm_send_tls:    Sent all 40 bytes
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:    Message size: 124
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:254   )   trace: crm_send_tls:    Sent all 124 bytes
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:625   )   trace: crm_remote_recv:    waiting to receive remote
msg, starting timeout 60000, remaining_timeout 60000
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:566   )   trace: crm_remote_recv_once:    Received 962 more bytes,
962 total
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
utils.c:1026  )   error: crm_abort:    crm_remote_header: Forked child
15924 to record non-fatal assert at remote.c:119 : endian == ENDIAN_LOCAL
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:122   )   error: crm_remote_header:    Invalid message detected,
endian mismatch: badadbbd is neither 312d3031 nor the swab'd 31302d31
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:644   )   debug: crm_remote_recv:    recv() failed: Resource
temporarily unavailable (-11)
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:625   )   trace: crm_remote_recv:    waiting to receive remote
msg, starting timeout 60000, remaining_timeout 60000
Oct 18 16:02:04 [10504] ip-10-203-186-119 pacemaker_remoted: (
utils.c:1026  )   error: crm_abort:    crm_remote_header: Forked child
15959 to record non-fatal assert at remote.c:119 : endian == ENDIAN_LOCAL
Oct 18 16:02:04 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:122   )   error: crm_remote_header:    Invalid message detected,
endian mismatch: badadbbd is neither 312d3031 nor the swab'd 31302d31
Oct 18 16:02:04 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:566   )   trace: crm_remote_recv_once:    Received 40 more bytes,
1002 total
Oct 18 16:02:04 [10504] ip-10-203-186-119 pacemaker_remoted: (
utils.c:1026  )   error: crm_abort:    crm_remote_header: Forked child
15960 to record non-fatal assert at remote.c:119 : endian == ENDIAN_LOCAL
Oct 18 16:02:04 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:122   )   error: crm_remote_header:    Invalid message detected,
endian mismatch: badadbbd is neither 312d3031 nor the swab'd 31302d31
[...]

Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:122   )   error: crm_remote_header:    Invalid message detected,
endian mismatch: badadbbd is neither 312d3031 nor the swab'd 31302d31
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted:
(tls_backend.:129   )    info: lrmd_remote_client_msg:    Client disconnect
detected in tls msg dispatcher.
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
ipc_proxy.c:402   )    info: ipc_proxy_remove_provider:    ipc proxy
connection for client dba1138a-b075-4ff0-bf8f-7b1c32e82553 pid 15764
destroyed because cluster node disconnected.
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:361   )   trace: crm_remote_send:    Sending len[0]=40,
start=6d726c3c
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:    Message size: 40
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:254   )   trace: crm_send_tls:    Sent all 40 bytes
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:    Message size: 149
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:254   )   trace: crm_send_tls:    Sent all 149 bytes
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
services.c:461   )    info: cancel_recurring_action:    Cancelling ocf
operation dbx_mounts_nodes_monitor_103000
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:361   )   trace: crm_remote_send:    Sending len[0]=40,
start=6d726c3c
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:    Message size: 40
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:246   )   error: crm_send_tls:    Connection terminated rc = -53
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:    Message size: 916
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:246   )   error: crm_send_tls:    Connection terminated rc = -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:364   )   error: crm_remote_send:    Failed to send remote msg, rc
= -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted:
(lrmd_client.:584   )   error: lrmd_tls_send_msg:    Failed to send remote
lrmd tls msg, rc = -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
lrmd.c:523   ) warning: send_client_notify:    Notification of client
remote-lrmd-ip-10-203-186-119:3121/c20a9a4e-b919-4e8a-8167-0cfa846fb24c
failed
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
services.c:461   )    info: cancel_recurring_action:    Cancelling ocf
operation dbx_nfs_mounts_datas_monitor_137000
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:361   )   trace: crm_remote_send:    Sending len[0]=40,
start=6d726c3c
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:    Message size: 40
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:246   )   error: crm_send_tls:    Connection terminated rc = -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:    Message size: 903
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:246   )   error: crm_send_tls:    Connection terminated rc = -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:364   )   error: crm_remote_send:    Failed to send remote msg, rc
= -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted:
(lrmd_client.:584   )   error: lrmd_tls_send_msg:    Failed to send remote
lrmd tls msg, rc = -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
lrmd.c:523   ) warning: send_client_notify:    Notification of client
remote-lrmd-ip-10-203-186-119:3121/c20a9a4e-b919-4e8a-8167-0cfa846fb24c
failed
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
services.c:461   )    info: cancel_recurring_action:    Cancelling ocf
operation dbx_gss_datas_monitor_127000






On Fri, Sep 30, 2016 at 4:53 PM, Jan Pokorný <jpoko...@redhat.com> wrote:

> On 30/09/16 11:28 -0500, Radoslaw Garbacz wrote:
> > I have posted a question about this error attached to another thread, but
> > because it was old and there is no answer I thought it could have been
> > missed, so I am sorry for repeating it.
> >
> > Regarding the problem.
> > I have a cluster, and when the cluster gets bigger (around 40 remote
> nodes)
> > some remote nodes go offline after a while and their logs report some
> > message errors, there is no indication about anything wrong in the other
> > logs.
>
> I believe I would have a plausible explanation provided it may happen
> (not sure now, perhaps the ipc proxy setup would allow it) that two
> messages via the same connection are transmitted, with the second one
> being read as part of the first one.
>
> Could you please try running pacemaker_remoted with
> "PCMK_trace_files=remote.c" in the respective "sysconfig" file?
>
> > Details:
> > - 40 ec2 m3.xlarge nodes, 1 corosync ring member, 39 remote
> > - maybe irrelevant, but either "cib" or "pengine" process goes to ~100%
> CPU
> > - it does not happen immediately
> > - smaller cluster (~20 remote nodes) does not show any problems
> > - pacemaker: 1.1.15-1.1f8e642.git.el6.x86_64
> > - corosync: 2.4.1-1.2.0da1.el6.x86_64
> > - libqb-1.0.0-1.28.4dff.el6.x86_64
> > - CentOS 6
> >
> > Logs:
> >
> > [...]
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > crm_abort:        crm_remote_header: Triggered assert at remote.c:119 :
> > endian == ENDIAN_LOCAL
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > crm_remote_header:        Invalid message detected, endian mismatch:
> > badadbbd is neither 63646330 nor the swab'd 30636463
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > crm_abort:        crm_remote_header: Triggered assert at remote.c:119 :
> > endian == ENDIAN_LOCAL
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > crm_remote_header:        Invalid message detected, endian mismatch:
> > badadbbd is neither 63646330 nor the swab'd 30636463
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > crm_abort:        crm_remote_header: Triggered assert at remote.c:119 :
> > endian == ENDIAN_LOCAL
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > crm_remote_header:        Invalid message detected, endian mismatch:
> > badadbbd is neither 63646330 nor the swab'd 30636463
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info:
> > lrmd_remote_client_msg:   Client disconnect detected in tls msg
> dispatcher.
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info:
> > ipc_proxy_remove_provider:        ipc proxy connection for client
> > ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster
> > node disconnected.
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info:
> > cancel_recurring_action:  Cancelling ocf operation
> > monitor_all_monitor_191000
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > crm_send_tls:     Connection terminated rc = -53
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > crm_send_tls:     Connection terminated rc = -10
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > crm_remote_send:  Failed to send remote msg, rc = -10
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > lrmd_tls_send_msg:        Failed to send remote lrmd tls msg, rc = -10
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:  warning:
> > send_client_notify:       Notification of client
> > remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22
> > failed
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:   notice:
> > lrmd_remote_client_destroy:       LRMD client disconnecting remote
> client -
> > name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296-
> > 725735d17e22
> > Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> > Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > handle_new_connection:    Error in connection setup (19626-21815-14):
> > Remote I/O error (121)
> > Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> > Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> > handle_new_connection:    Error in connection setup (19626-21815-14):
> > Remote I/O error (121)
> > [...]
>
> --
> Jan (Poki)
>
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to