Re: Failover very slow with kahadb while restart of master is fast

Tim Bain Fri, 29 Apr 2016 06:02:19 -0700

By reading the code (
http://www.grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-kahadb-store/5.11.1/org/apache/activemq/store/kahadb/disk/journal/Journal.java#Journal.recoveryCheck%28org.apache.activemq.store.kahadb.disk.journal.DataFile%29),
I found that we're reading the data files to confirm that they're not
corrupted.  That's the beauty of an open-source project: all of the source
code is available, so you can read it yourself and see what's being done.


You should confirm that your OS will properly invalidate its cache entry
for a given file when that file changes; having KahaDB read stale cache
entries would be a very bad thing.

Tim
We have found the main cause for this issue. Current Master and new Master
seem to do the exact same things on startup.
They read all the journals.

It seems that the OS is caching the files in memory in some cases.
The master seems to have all the jorunal files in the cache. So when it
restarts it is fast.
The new master seems to not have the journals in cache so they have to be
transfered over the network (~10MB/s) in our case. So this
causes the difference in startup time.

One big question remains:
Why are the whole journals read? Shouldn't a master that is shut down in a
controlled way leave a clean state, so on startup only the meta data index
is read?

Christian

On 15.02.2016 13:32, Christian Schneider wrote:

> At a customer we are experiencing a very strange behaviour of activemq.
>
> We have two brokers on separate machines A and B. Storage is on a third
> machine and imported using nfsv4.
> Kahadb contains about 18 GB of journal files.
>
> Scenario 1:
> - Stop B
> - Stop A (master now)
> - Start A
>
> The start of A takes about 50 seconds.
>
> Scnario 2:
> - Stop A
> - Stop B (master now)
> - Start A
>
> The start of A takes about 5 minutes.
>
> In the slow case the log at debug level shows a big gap between adding the
> last queue (12:04) and running the ActiveMQ Journal Checkpoint Worker again
> after the queue creation (12:09).
> I have no idea what activemq is doing in the mean time but it seems to
> take several minutes.
>
> Can anyone explain this behaviour or explain what activemq does in the
> mean time? I would also be happy about pointer to the code.
> What I do not understand is why it makes a difference if the master is
> started again or if the slave is becoming master. As they both share the
> same kahadb over nfs I would expect the same startup
> behaviour.
>
> Christian
>
> -----
>
> Slow start
> 2016-02-15 12:03:01,943 | INFO  | Refreshing
> org.apache.activemq.xbean.XBeanBrokerFactory$1@17b14695: startup date
> [Mon Feb 15 12:03:01 CET 2016]; root of context hierarchy |
> org.apache.activemq.xbean.XBeanBrokerFactory$1 | main
> ...
> 2016-02-15 12:03:03,018 | INFO  |
> PListStore:[/path/activemq/data/A/tmp_storage] started |
> org.apache.activemq.store.kahadb.plist.PListStoreImpl | main
> 2016-02-15 12:03:03,033 | INFO  | ignoring zero length, partially
> initialised journal data file: db-1.log number = 1 , length = 0 |
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 2016-02-15 12:03:03,177 | INFO  | JobSchedulerStore:path/scheduler started
> | org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl | main
> 2016-02-15 12:03:03,180 | INFO  | JobScheduler using directory:
> path/scheduler | org.apache.activemq.broker.BrokerService | main
> 2016-02-15 12:03:03,335 | INFO  | Using Persistence Adapter:
> KahaDBPersistenceAdapter[/path] | org.apache.activemq.broker.BrokerService
> | main
> 2016-02-15 12:03:03,350 | INFO  | JMX consoles can connect to
> service:jmx:rmi:///jndi/rmi://localhost:1617/jmxrmi |
> org.apache.activemq.broker.jmx.ManagementContext | JMX connector
> 2016-02-15 12:03:05,198 | INFO  | Corrupt journal records found in
> 'path/db-1122.log' between offsets: 17893102..17893732 |
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 2016-02-15 12:03:47,925 | INFO  | KahaDB is version 5 |
> org.apache.activemq.store.kahadb.MessageDatabase | main
> ....
> 2016-02-15 12:03:49,233 | INFO  | Recovering from the journal ... |
> org.apache.activemq.store.kahadb.MessageDatabase | main
> 2016-02-15 12:03:49,237 | INFO  | Recovery replayed 20 operations from the
> journal in 0.094 seconds. |
> org.apache.activemq.store.kahadb.MessageDatabase | main
> 2016-02-15 12:04:26,863 | DEBUG | Checkpoint started. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:04:26,932 | DEBUG | Checkpoint done. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:04:26,940 | INFO  | Apache ActiveMQ 5.10.2 (A,
> ID:A-59167-1455534183070-1:1) is starting |
> org.apache.activemq.broker.BrokerService | main
> ...
> 2016-02-15 12:04:30,289 | DEBUG | A adding destination: queue://q1.DLQ |
> org.apache.activemq.broker.region.AbstractRegion | main
> 2016-02-15 12:04:57,026 | DEBUG | queue://q1.DLQ expiring messages .. |
> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
> 2016-02-15 12:04:57,029 | DEBUG | q1.DLQ toPageIn: 5, Inflight: 0,
> pagedInMessages.size 0, pagedInPendingDispatch.size 0, enqueueCount: 0,
> dequeueCount: 0, memUsage:0 | org.apache.activemq.broker.region.Queue |
> ActiveMQ Broker[A] Scheduler
> 2016-02-15 12:09:20,440 | DEBUG | Checkpoint started. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:09:20,476 | DEBUG | Checkpoint done. |
> org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal
> Checkpoint Worker
> 2016-02-15 12:09:20,488 | DEBUG | queue://q1.DLQ expiring messages done. |
> org.apache.activemq.broker.region.Queue | ActiveMQ Broker[A] Scheduler
>
>
>

-- 
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com

Re: Failover very slow with kahadb while restart of master is fast

Reply via email to