Re: Broker-J BDB JE High Availability Time Sync issue

2018-04-17 Thread Keith W
Hi Bryan

I haven't seen problem reports raised against Broker-J regarding this error.

The error actually originates from the HA feature within Berkeley BDB JE.
There's a useful reply within the following thread that explains some
background about how the maximum clock delta is used.

https://community.oracle.com/thread/1027853

For this problem to appear,  I see three possibilities:

1) The NTP synchronisation between the nodes was somehow ineffective
causing a genuine skew between the nodes
2) The processing BDB JE internal
com.sleepycat.je.rep.stream.Protocol.SNTPResponse was somehow delayed
causing a spurious exception to be reported on the master.  I am wondering
whether an unfortunately timed lengthy GC pause could cause this.
3) There is some kind of previously unknown defect in BDB JE HA itself.

To help you narrow in, I'd suggest changing the logging configuration so
you retain logs for longer so if you see a reoccurrence you have more to
investigate.   I also suggest turning GC logging with timestamps, at the
JVM level.

You could also consider raising je.rep.maxClockDelta.  Broker-J doesn’t
actually change the replica consistency policy settings from their
defaults, which includes the permitted clock delta setting.  A BDB HA node
never actually reads from the replicated environment until the node becomes
master.  This means the concerns described by
https://docs.oracle.com/cd/E17277_02/html/ReplicationGuide/consistency.html
don't actually apply.   (As an aside I note that
 NoConsistencyRequiredPolicy (rather the TimeConsistencyPolicy default)
should serve Broker-J's use case just as well - but this is not something I
have tried or investigated completely).

Hope this helps.

Keith Wall.


On Mon, 16 Apr 2018 at 18:54, Bryan Dixon  wrote:

> We are using Broker-J 7.0.2 and just ran into a Berkeley HA Time Sync issue
> that I'm wondering if anyone else has run into.  The stackTrace is at the
> end of this post.   We are running on Windows Server 2012 R2 6.3 amd64 and
> our JDK is Oracle Corporation 1.8.0_162-b12.  We have 3 servers as part of
> our HA setup.
>
> This error occurred in our production environment which has been live for
> just a couple of weeks.  We never ran into this in our Test or Dev
> environments that have been running for a few months.   When one of our
> admins checked the clock times of all 3 servers they were completely in
> sync.  Another admin stated that the server clock times are synced with
> NTP.
> Unfortunately our log files rolled off and I don't know exactly when this
> error first occurred because the older log file are gone.
>
> 2018-04-16 04:10:57,039 ERROR [Group-Change-Learner:prodbrok
> er:prodbroker2]
> (o.a.q.s.u.ServerScopedRuntimeException) - Exception on master check
> com.sleepycat.je.EnvironmentFailureException: (JE 7.4.5) Environment must
> be
> closed, caused by: com.sleepycat.je.EnvironmentFailureException:
> Environment
> invalid because of previous exception: (JE 7.4.5)
> prodbroker2(2):D:\qpidwork\prodbroker2\config Clock delta: 8859 ms.
> between
> Feeder: prodbroker1 and this Replica exceeds max permissible delta: 2000
> ms.
> HANDSHAKE_ERROR: Error during the handshake between two nodes. Some
> validity
> or compatibility check failed, preventing further communication between the
> nodes. Environment is invalid and must be closed. Originally thrown by HA
> thread: UNKNOWN prodbroker2(2) Originally thrown by HA thread: UNKNOWN
> prodbroker2(2)
> at
> com.sleepycat.je.EnvironmentFailureException.wrapSelf(Enviro
> nmentFailureException.java:228)
> at
> com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(Environm
> entImpl.java:1766)
> at
> com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1775)
> at com.sleepycat.je.Environment.checkOpen(Environment.java:2473)
> at com.sleepycat.je.DbInternal.checkOpen(DbInternal.java:105)
> at
> com.sleepycat.je.rep.ReplicatedEnvironment.checkOpen(Replica
> tedEnvironment.java:1052)
> at
> com.sleepycat.je.rep.ReplicatedEnvironment.getState(Replicat
> edEnvironment.java:764)
> at
> org.apache.qpid.server.store.berkeleydb.replication.Replicat
> edEnvironmentFacade$RemoteNodeStateLearner.executeDatabasePi
> ngerOnNodeChangesIfMaster(ReplicatedEnvironmentFacade.java:2276)
> at
> org.apache.qpid.server.store.berkeleydb.replication.Replicat
> edEnvironmentFacade$RemoteNodeStateLearner.call(ReplicatedEn
> vironmentFacade.java:2042)
> at
> org.apache.qpid.server.store.berkeleydb.replication.Replicat
> edEnvironmentFacade$RemoteNodeStateLearner.call(ReplicatedEn
> vironmentFacade.java:2012)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
> tureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
> 

Broker-J BDB JE High Availability Time Sync issue

2018-04-16 Thread Bryan Dixon
We are using Broker-J 7.0.2 and just ran into a Berkeley HA Time Sync issue
that I'm wondering if anyone else has run into.  The stackTrace is at the
end of this post.   We are running on Windows Server 2012 R2 6.3 amd64 and
our JDK is Oracle Corporation 1.8.0_162-b12.  We have 3 servers as part of
our HA setup.

This error occurred in our production environment which has been live for
just a couple of weeks.  We never ran into this in our Test or Dev
environments that have been running for a few months.   When one of our
admins checked the clock times of all 3 servers they were completely in
sync.  Another admin stated that the server clock times are synced with NTP. 
Unfortunately our log files rolled off and I don't know exactly when this
error first occurred because the older log file are gone.

2018-04-16 04:10:57,039 ERROR [Group-Change-Learner:prodbroker:prodbroker2]
(o.a.q.s.u.ServerScopedRuntimeException) - Exception on master check
com.sleepycat.je.EnvironmentFailureException: (JE 7.4.5) Environment must be
closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment
invalid because of previous exception: (JE 7.4.5)
prodbroker2(2):D:\qpidwork\prodbroker2\config Clock delta: 8859 ms. between
Feeder: prodbroker1 and this Replica exceeds max permissible delta: 2000 ms.
HANDSHAKE_ERROR: Error during the handshake between two nodes. Some validity
or compatibility check failed, preventing further communication between the
nodes. Environment is invalid and must be closed. Originally thrown by HA
thread: UNKNOWN prodbroker2(2) Originally thrown by HA thread: UNKNOWN
prodbroker2(2)
at
com.sleepycat.je.EnvironmentFailureException.wrapSelf(EnvironmentFailureException.java:228)
at
com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766)
at
com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1775)
at com.sleepycat.je.Environment.checkOpen(Environment.java:2473)
at com.sleepycat.je.DbInternal.checkOpen(DbInternal.java:105)
at
com.sleepycat.je.rep.ReplicatedEnvironment.checkOpen(ReplicatedEnvironment.java:1052)
at
com.sleepycat.je.rep.ReplicatedEnvironment.getState(ReplicatedEnvironment.java:764)
at
org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade$RemoteNodeStateLearner.executeDatabasePingerOnNodeChangesIfMaster(ReplicatedEnvironmentFacade.java:2276)
at
org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade$RemoteNodeStateLearner.call(ReplicatedEnvironmentFacade.java:2042)
at
org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade$RemoteNodeStateLearner.call(ReplicatedEnvironmentFacade.java:2012)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid
because of previous exception: (JE 7.4.5)
prodbroker2(2):D:\qpidwork\prodbroker2\config Clock delta: 8859 ms. between
Feeder: prodbroker1 and this Replica exceeds max permissible delta: 2000 ms.
HANDSHAKE_ERROR: Error during the handshake between two nodes. Some validity
or compatibility check failed, preventing further communication between the
nodes. Environment is invalid and must be closed. Originally thrown by HA
thread: UNKNOWN prodbroker2(2) Originally thrown by HA thread: UNKNOWN
prodbroker2(2)
at
com.sleepycat.je.rep.stream.ReplicaFeederHandshake.checkClockSkew(ReplicaFeederHandshake.java:424)
at
com.sleepycat.je.rep.stream.ReplicaFeederHandshake.execute(ReplicaFeederHandshake.java:261)
at 
com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:691)
at
com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:474)
at 
com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:409)
at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1873)



--
Sent from: http://qpid.2158936.n2.nabble.com/Apache-Qpid-users-f2158936.html

-
To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
For additional commands, e-mail: users-h...@qpid.apache.org