Re: Broker-J: Constant HA timeouts

Igor Natanzon Tue, 16 Apr 2019 07:39:36 -0700

Thanks Alex, I changed the timeout values, and the error has mostly
disappeared. However I now have a much more odd issue. I left the cluster
running for past 2 days, just to see if the timeout issue will go away.
What I see however is this:
2019-04-15 20:37:38,751 WARN  [IO-/10.81.215.198:24616]
(o.a.q.s.t.MultiVersionProtocolEngine) - Connection has taken more than
2000 ms to send complete protocol header.  Closing as possible DoS.
2019-04-15 20:37:38,751 INFO  [IO-/10.81.215.198:24616] (q.m.c.idle_close)
- [Broker] CON-1003 : Closed due to inactivity : Protocol header not
received within timeout period
2019-04-15 20:41:59,727 WARN  [IO-/10.81.215.198:58353]
(o.a.q.s.t.MultiVersionProtocolEngine) - Connection has taken more than
2000 ms to send complete protocol header.  Closing as possible DoS.
2019-04-15 20:41:59,727 INFO  [IO-/10.81.215.198:58353] (q.m.c.idle_close)
- [Broker] CON-1003 : Closed due to inactivity : Protocol header not
received within timeout period
2019-04-15 20:42:02,841 WARN  [IO-/10.81.215.198:15047]
(o.a.q.s.t.MultiVersionProtocolEngine) - Connection has taken more than
2000 ms to send complete protocol header.  Closing as possible DoS.
2019-04-15 20:42:02,842 INFO  [IO-/10.81.215.198:15047] (q.m.c.idle_close)
- [Broker] CON-1003 : Closed due to inactivity : Protocol header not
received within timeout period


this repeats for a while ... then

2019-04-15 20:45:05,856 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
2019-04-15 20:45:06,001 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
ï¿½
2019-04-15 20:45:06,146 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
2019-04-15 20:45:06,289 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
ï¿½%

2019-04-15 20:45:06,457 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
2019-04-15 20:45:06,623 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
2019-04-15 20:45:06,769 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
ï¿½
2019-04-15 20:45:06,923 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request: _
2019-04-15 20:45:07,094 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
ï¿½@'
2019-04-15 20:45:08,225 WARN  [ServiceDispatcher-node1:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
GET / HT

2019-04-15 20:45:16,383 INFO  [IO-/10.81.215.198:56910]
(q.m.p.unsupported_protocol_header) - [Broker] [port(9,000)] PRT-1007 :
Unsupported protocol header received "\x16\x03\x01\x02\x00\x01\x00\x01",
replying with AMQP_1_0
2019-04-15 20:45:16,543 INFO  [IO-/10.81.215.198:56930]
(q.m.p.unsupported_protocol_header) - [Broker] [port(9,000)] PRT-1007 :
Unsupported protocol header received "\x16\x03\x00\x00\xc0\x01\x00\x00",
replying with AMQP_1_0
2019-04-15 20:45:16,706 INFO  [IO-/10.81.215.198:56944]
(q.m.p.unsupported_protocol_header) - [Broker] [port(9,000)] PRT-1007 :
Unsupported protocol header received "\x16\x03\x01\x02\x00\x01\x00\x01",
replying with AMQP_1_0
2019-04-15 20:45:16,923 INFO  [IO-/10.81.215.198:56966]
(q.m.p.unsupported_protocol_header) - [Broker] [port(9,000)] PRT-1007 :
Unsupported protocol header received "\x80%\x01\x00\x02\x00\x0c\x00",
replying with AMQP_1_0
...
2019-04-15 20:48:53,014 WARN  [qtp1581198793-55] (o.e.j.u.URIUtil) -
/%80../%80../%80../%80../%80../%80../windows/win.ini
org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8!
byte 80 in state 0
2019-04-15 20:48:54,178 WARN  [qtp1581198793-55] (o.e.j.u.URIUtil) -
/%80../%80../%80../%80../%80../%80../winnt/win.ini
org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8!
byte 80 in state 0
2019-04-15 20:48:54,326 WARN  [qtp1581198793-53] (o.e.j.u.URIUtil) -
/%c0.%c0./%c0.%c0./%c0.%c0./%c0.%c0./%c0.%c0./windows/win.ini
org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8!
byte C0 in state 0
2019-04-15 20:48:54,482 WARN  [qtp1581198793-53] (o.e.j.u.URIUtil) -
/%c0.%c0./%c0.%c0./%c0.%c0./%c0.%c0./%c0.%c0./winnt/win.ini
org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8!
byte C0 in state 0
2019-04-15 20:48:54,630 WARN  [qtp1581198793-55] (o.e.j.u.URIUtil) -
/%c0%2e%c0%2e/%c0%2e%c0%2e/%c0%2e%c0%2e/%c0%2e%c0%2e/windows/win.ini
org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8!
byte C0 in state 0
2019-04-15 20:48:54,783 WARN  [qtp1581198793-53] (o.e.j.u.URIUtil) -
/%c0%2e%c0%2e/%c0%2e%c0%2e/%c0%2e%c0%2e/%c0%2e%c0%2e/winnt/win.ini
org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8!
byte C0 in state 0
2019-04-15 20:48:56,988 WARN  [ServiceDispatcher-rcovlnx0137:6000]
(c.s.j.r.u.ServiceDispatcher) - RCO_1_FIX_TEST Malformed service request:
ï¿½

After this pretty much rest of the data gets garbled:
2019-04-15 20:57:30,486 INFO  [IO-/10.81.215.198:33900]
(-.a.-._.+.N-+B+-c+i+gC-++ec+i-+) - E|ce-+i-+ -e_f-_+i+g I/O f-_ c-++ec+i-+
'/10.81.215.198:33900' : C-++ec+i-+ _e_e+ by -ee_
2019-04-15 21:00:15,133 WARN  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.a.-._._.b._.Re-+ica+edE++i_-++e++Facade) - Ti+e-++ +hi+_+ de+e_+i+i+g
_+a+e f-_ +-de 'STO_2_FIX_TEST' f_-+ g_-+- fi|_ha
2019-04-15 21:00:15,387 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h.+ef+) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)] HA-1006 :
Lef+ : N-de : 'STO_2_FIX_TEST' (_+-+++|0137:6060)
2019-04-15 21:00:15,388 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h._-+e_cha+ged) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)]
HA-1010 : R-+e cha+ge _e--_+ed: N-de : 'STO_2_FIX_TEST' (_+-+++|0137:6060)
: f_-+ 'REPLICA' +- 'UNREACHABLE'
2019-04-15 21:00:15,388 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h.+ef+) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)] HA-1006 :
Lef+ : N-de : 'STO_1_FIX_TEST' (_+-+++|0137:6000)
2019-04-15 21:00:15,388 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h._-+e_cha+ged) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)]
HA-1010 : R-+e cha+ge _e--_+ed: N-de : 'STO_1_FIX_TEST' (_+-+++|0137:6000)
: f_-+ 'REPLICA' +- 'UNREACHABLE'
2019-04-15 21:00:16,396 WARN  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.a.-._._.b._.Re-+ica+edE++i_-++e++Facade) - N-de 'STO_2_FIX_TEST' f_-+
g_-+- fi|_ha i_ _e_--+di+g agai+.
2019-04-15 21:00:16,401 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h.+-i+ed) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)] HA-1005
: J-i+ed : N-de : 'STO_2_FIX_TEST' (_+-+++|0137:6060)
2019-04-15 21:00:16,401 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h._-+e_cha+ged) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)]
HA-1010 : R-+e cha+ge _e--_+ed: N-de : 'STO_2_FIX_TEST' (_+-+++|0137:6060)
: f_-+ 'UNREACHABLE' +- 'REPLICA'
2019-04-15 21:00:16,401 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h.+-i+ed) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)] HA-1005
: J-i+ed : N-de : 'STO_1_FIX_TEST' (_+-+++|0137:6000)
2019-04-15 21:00:16,401 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h._-+e_cha+ged) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)]
HA-1010 : R-+e cha+ge _e--_+ed: N-de : 'STO_1_FIX_TEST' (_+-+++|0137:6000)
: f_-+ 'UNREACHABLE' +- 'REPLICA'
2019-04-15 21:01:08,741 WARN  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.a.-._._.b._.Re-+ica+edE++i_-++e++Facade) - Ti+e-++ +hi+_+ de+e_+i+i+g
_+a+e f-_ +-de 'STO_2_FIX_TEST' f_-+ g_-+- fi|_ha
2019-04-15 21:01:09,242 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h.+ef+) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)] HA-1006 :
Lef+ : N-de : 'STO_2_FIX_TEST' (_+-+++|0137:6060)
2019-04-15 21:01:09,242 INFO  [G_-+--Cha+ge-Lea_+e_:fi|_ha:RCO_1_FIX_TEST]
(-.+.h._-+e_cha+ged) - [g_-(/fi|_ha)/+h+(/RCO_1_FIX_TEST)] [g_-(/fi|_ha)]
HA-1010 : R-+e cha+ge _e--_+ed: N-de : 'STO_2_FIX_TEST' (_+-+++|0137:6060)
: f_-+ 'REPLICA' +- 'UNREACHABLE'

This error appears on all 3 nodes. I looked through old logs, and the issue
appears all the way back in January when I started working on the broker,
so it's certainly not recent.
The web console shows that everything is OK and JMS clients appear to work
fine.
I wonder if setting up the ACL to block the ports that are causing the
error will even work, the error may occur on initial handshake anyway.

Thanks.



On Thu, Apr 11, 2019 at 6:17 PM Oleksandr Rudyy <[email protected]> wrote:

> Hi Igor,
>
> Each node in the group monitors the state of its counterparts by sending
> ping commands using "com.sleepycat.je.rep.util.DbPing".
> The older versions of BDB JE environment did not react promptly on majority
> loss. As result, the monitoring functionality was added to detect when
> other nodes in the group became unreachable and  generate special
> operational logs. The newer BDB JE versions are much better with detecting
> other node disconnects but BDB JE does not provide a good API which would
> allow listen for the state of other nodes in the group.
> Thus, the implemented monitoring functionality was left in the broker.
>
> The db ping is issued periodically with interval defined in context
> variable "qpid.bdb.ha.remote_node_monitor_interval". By default, the value
> is set to 1000 milliseconds.
> The monitoring functionality awaits for timeout (specified in context
> variable "qpid.bdb.ha.remote_node_monitor_timeout") for db ping to receive
> the status details of other node. The timeout default value is set to 1000
> milliseconds as well. If timeout expires and response for db ping is not
> received, the node is considered UNREACHABLE and operational logs are
> generated like the ones below
> HA-1006 : Left : Node : 'TEST_HA' (testnode1:5050)
> HA-1010 : Role change reported: Node : 'TEST_TEST_HA' (testnode1:5050)
> :from 'REPLICA' to 'UNREACHABLE'"
>
> There is a third timeout for the DBPing to open socket to remote node. The
> value for this timeout is set with context variable
> qpid.bdb.ha.db_ping_socket_timeout. The default value is also set to 1000
> milliseconds.
>
> It seems that in your environment the default values for timeouts can be
> increased in order to reduce the amount of misleading logs reported due to
> expiration of timeouts on receiving db ping responses.
>
> I think that default timeout values are a bit optimistic. I think it would
> be more realistic to set qpid.bdb.ha.remote_node_monitor_timeout to
> 5seconds and qpid.bdb.ha.db_ping_socket_timeout to 3seconds or even greater
> values.
>
> The timeouts can be overridden in multiple ways:
> * using JVM system properties, for example
>    export QPID_OPTS="-Dqpid.bdb.ha.remote_node_monitor_timeout=5000
> -Dqpid.bdb.ha.db_ping_socket_timeout=3000"
> * by setting context variables on broker level (globally for all BDB HA
> Virtual host nodes) or directly on BDB HA Virtual host node level
>
> If required, the context variable overrides can be added into customized
> initial broker configuration which can be used to create new broker
> instances.
>
> Kind Regards,
> Alex
>
>
> On Thu, 11 Apr 2019 at 17:34, Igor Natanzon <[email protected]>
> wrote:
>
> > Broker-J 7.1.1
> > I have a following HA node setup:
> > Datacenter 1: node 1 (master)
> > Datacenter 2: node 2 (replica), node 3 (replica)
> >
> > All nodes are set to sync because data preservation and consistency is
> > critical for this use-case (latency not high priority).
> > Ping between the two datacenters (which are geographically nearly each
> > other) is on average 2ms.
> >
> > What I see on a periodic basis is the constant timeouts between the
> nodes,
> > which always recover approximately one second later.
> >
> > Log on node 1:
> >
> > 2019-04-10 20:35:24,898 WARN
> [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining
> > state for node 'STO2_TEST_HA' from group fix_test
> > 2019-04-10 20:35:24,899 INFO
> [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> > (q.m.h.left) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)]
> HA-1006
> > : Left : Node : 'STO2_TEST_HA' (node2:5060)
> > 2019-04-10 20:35:24,899 INFO
> [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)]
> [grp(/fix_test)]
> > HA-1010 : Role change reported: Node : 'STO2_TEST_HA' (node2:5060) : from
> > 'MASTER' to 'UNREACHABLE'
> > 2019-04-10 20:35:25,907 WARN
> [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'STO2_TEST_HA' from
> > group fix_test is responding again.
> > 2019-04-10 20:35:25,907 INFO
> [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> > (q.m.h.joined) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)]
> > HA-1005 : Joined : Node : 'STO2_TEST_HA' (node2:5060)
> > 2019-04-10 20:35:25,907 INFO
> [Group-Change-Learner:fix_test:RCO_TEST_HA1]
> > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)]
> [grp(/fix_test)]
> > HA-1010 : Role change reported: Node : 'STO2_TEST_HA' (node2:5060) : from
> > 'UNREACHABLE' to 'MASTER'
> >
> >
> > Sample log entry on node 2
> >
> > 2019-04-11 06:15:20,769 WARN
> [Group-Change-Learner:fix_test:STO2_TEST_HA]
> > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining
> > state for node 'RCO_TEST_HA1' from group fix_test
> > 2019-04-11 06:15:20,775 INFO
> [Group-Change-Learner:fix_test:STO2_TEST_HA]
> > (q.m.h.left) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)]
> HA-1006
> > : Left : Node : 'RCO_TEST_HA1' (node1:5000)
> > 2019-04-11 06:15:20,775 INFO
> [Group-Change-Learner:fix_test:STO2_TEST_HA]
> > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO2_TEST_HA)]
> [grp(/fix_test)]
> > HA-1010 : Role change reported: Node : 'RCO_TEST_HA1' (node1:5000) : from
> > 'REPLICA' to 'UNREACHABLE'
> > 2019-04-11 06:15:21,782 WARN
> [Group-Change-Learner:fix_test:STO2_TEST_HA]
> > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'RCO_TEST_HA1' from
> > group fix_test is responding again.
> > 2019-04-11 06:15:21,784 INFO
> [Group-Change-Learner:fix_test:STO2_TEST_HA]
> > (q.m.h.joined) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)]
> > HA-1005 : Joined : Node : 'RCO_TEST_HA1' (node1:5000)
> > 2019-04-11 06:15:21,784 INFO
> [Group-Change-Learner:fix_test:STO2_TEST_HA]
> > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO2_TEST_HA)]
> [grp(/fix_test)]
> > HA-1010 : Role change reported: Node : 'RCO_TEST_HA1' (node1:5000) : from
> > 'UNREACHABLE' to 'REPLICA'
> >
> >
> >
> > Log on node 3 is the same.
> >
> > This warning repeats pretty much every 30 minutes or so (not consistent).
> > It doesn't really cause any harm, as I am assuming that there are latency
> > spikes between the datacenters causing this issue, so is there a way to
> > tune replication to be a bit less sensitive to it? Specifically its
> obvious
> > that it successfully recovers replication 1 second after it detects a
> > problem (that pattern is very consistent, going all the way back to
> > January). Here's an example of 1, also recovering 1 second later:
> >
> >
> > 2019-01-23 18:00:38,495 WARN  [Group-Change-Learner:fix_test:STO_TEST_HA]
> > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining
> > state for node 'TEST_HA' from group fix_test
> > 2019-01-23 18:00:38,495 INFO  [Group-Change-Learner:fix_test:STO_TEST_HA]
> > (q.m.h.left) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)]
> HA-1006
> > : Left : Node : 'TEST_HA' (testnode1:5050)
> > 2019-01-23 18:00:38,496 INFO  [Group-Change-Learner:fix_test:STO_TEST_HA]
> > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO_TEST_HA)]
> [grp(/fix_test)]
> > HA-1010 : Role change reported: Node : 'TEST_TEST_HA' (testnode1:5050) :
> > from 'REPLICA' to 'UNREACHABLE'
> > 2019-01-23 18:00:39,505 WARN  [Group-Change-Learner:fix_test:STO_TEST_HA]
> > (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'TEST_HA' from group
> > fix_test is responding again.
> > 2019-01-23 18:00:39,506 INFO  [Group-Change-Learner:fix_test:STO_TEST_HA]
> > (q.m.h.joined) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)]
> > HA-1005 : Joined : Node : 'TEST_HA' (testnode1:5050)
> > 2019-01-23 18:00:39,506 INFO  [Group-Change-Learner:fix_test:STO_TEST_HA]
> > (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO_TEST_HA)]
> [grp(/fix_test)]
> > HA-1010 : Role change reported: Node : 'TEST_HA' (testnode1:5050) : from
> > 'UNREACHABLE' to 'REPLICA'
> >
> >
> >
> > It might be that HA is a bit over-sensitive to latency spikes, so I
> wonder
> > if  I could make it more tolerant (maybe wait an extra second?).
> >
> > Thanks!
> >
>

Re: Broker-J: Constant HA timeouts

Reply via email to