Re: [ClusterLabs] Pacemaker remote - invalid message detected, endian mismatch

2016-10-18 Thread Radoslaw Garbacz
Hi,

I am sorry for the long delay, just I moved on with my tool and had to
narrow down the circumstances of this error.
So apparently this error is related to many remote nodes querying CIB (I
had an OCF monitor agent running on all nodes), once turned off it works
with ~100 nodes.


Parts of the logs from one of the remote nodes, which went offline:

[...]
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:569   )   trace: crm_remote_recv_once:non-blocking, exiting
read: Resource temporarily unavailable (-11)
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:588   )   trace: crm_remote_recv_once:Read less than the
advertised length: 622632 < 639978 bytes
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:644   )   debug: crm_remote_recv:recv() failed: Resource
temporarily unavailable (-11)
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:625   )   trace: crm_remote_recv:waiting to receive remote
msg, starting timeout 6, remaining_timeout 1000
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:566   )   trace: crm_remote_recv_once:Received 16384 more
bytes, 639016 total
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:588   )   trace: crm_remote_recv_once:Read less than the
advertised length: 639016 < 639978 bytes
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:644   )   debug: crm_remote_recv:recv() failed: Resource
temporarily unavailable (-11)
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
utils.c:1026  )   error: crm_abort:crm_remote_parse_buffer: Forked
child 15923 to record non-fatal assert at remote.c:426 :
remote->buffer[sizeof(struct crm_remote_header_v0) +
header->payload_uncompressed - 1] == 0
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2668  )   error: crm_xml_err:XML Error: Entity: line 1: parser
error : AttValue: ' expected
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2668  )   error: crm_xml_err:XML Error: Entity: line 1: parser
error : attributes construct error
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2668  )   error: crm_xml_err:XML Error: Entity: line 1: parser
error : Couldn't find end of Start Tag lrm_rsc_op line 1
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2668  )   error: crm_xml_err:XML Error: Entity: line 1: parser
error : Premature end of data in tag lrm_resource line 1
[...]

Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2731  ) warning: string2xml:Parsing failed (domain=1, level=3,
code=77): Premature end of data in tag lrmd_ipc_proxy line 1
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2738  )   error: string2xml:Couldn't fully parse 638976 chars:

Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial 
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial   
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial 
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial   
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial 
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial   
[...]

Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial 
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial   
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial 
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial   
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial 
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial   
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial 
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
xml.c:2740  )   error: string2xml:Partial   
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
strings.c:348   )info: crm_compress_string:Compressed 638868 bytes
into 20773 (ratio 30:1) in 112ms
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:361   )   trace: crm_remote_send:Sending len[0]=40,
start=6d726c3c
Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:

Re: [ClusterLabs] Pacemaker remote - invalid message detected, endian mismatch

2016-09-30 Thread Jan Pokorný
On 30/09/16 11:28 -0500, Radoslaw Garbacz wrote:
> I have posted a question about this error attached to another thread, but
> because it was old and there is no answer I thought it could have been
> missed, so I am sorry for repeating it.
> 
> Regarding the problem.
> I have a cluster, and when the cluster gets bigger (around 40 remote nodes)
> some remote nodes go offline after a while and their logs report some
> message errors, there is no indication about anything wrong in the other
> logs.

I believe I would have a plausible explanation provided it may happen
(not sure now, perhaps the ipc proxy setup would allow it) that two
messages via the same connection are transmitted, with the second one
being read as part of the first one.

Could you please try running pacemaker_remoted with
"PCMK_trace_files=remote.c" in the respective "sysconfig" file?

> Details:
> - 40 ec2 m3.xlarge nodes, 1 corosync ring member, 39 remote
> - maybe irrelevant, but either "cib" or "pengine" process goes to ~100% CPU
> - it does not happen immediately
> - smaller cluster (~20 remote nodes) does not show any problems
> - pacemaker: 1.1.15-1.1f8e642.git.el6.x86_64
> - corosync: 2.4.1-1.2.0da1.el6.x86_64
> - libqb-1.0.0-1.28.4dff.el6.x86_64
> - CentOS 6
> 
> Logs:
> 
> [...]
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_header:Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_header:Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_header:Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> lrmd_remote_client_msg:   Client disconnect detected in tls msg dispatcher.
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> ipc_proxy_remove_provider:ipc proxy connection for client
> ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster
> node disconnected.
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> cancel_recurring_action:  Cancelling ocf operation
> monitor_all_monitor_191000
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_send_tls: Connection terminated rc = -53
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_send_tls: Connection terminated rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_send:  Failed to send remote msg, rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> lrmd_tls_send_msg:Failed to send remote lrmd tls msg, rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:  warning:
> send_client_notify:   Notification of client
> remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22
> failed
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:   notice:
> lrmd_remote_client_destroy:   LRMD client disconnecting remote client -
> name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296-
> 725735d17e22
> Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> handle_new_connection:Error in connection setup (19626-21815-14):
> Remote I/O error (121)
> Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> handle_new_connection:Error in connection setup (19626-21815-14):
> Remote I/O error (121)
> [...]

-- 
Jan (Poki)


pgp4MIpK8j5hh.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org