Matt, et al.,

Kahadb is on shared NFS scaled out NAS storage. Sometimes, ActiveMQ loses its 
NFS mounts when the Storage Team upgrades the OS on the storage nodes. They 
upgrade one node at a time. The NFS mount must migrate to another storage node. 
Supposedly, It can take up to 30 seconds to migrate. The IP address of the new 
storage node is different than the original storage node. We avoid data 
corruption by stopping activemq.service on the broker that is in slave mode 
during the storage upgrade.

Unfortunatly, I did not check for I/O errors earlier. I don't have 
/var/log/messages before 2022/06/10. If this happens again I will certainly 
follow your advise.


---

Karl Nordström

Systems Administrator

Penn State IT | Application Platforms

________________________________
From: Matt Pavlovich <mattr...@gmail.com>
Sent: Wednesday, June 15, 2022 6:24 PM
To: users@activemq.apache.org <users@activemq.apache.org>
Subject: Re: ActiveMQ 5.16.4 Data Corruption

Karl-

Is this on a local disk, RAID, SAN or NAS? First step is to confirm there was 
no disk corruption-- check your syslog and dmesg output for anyI/O error 
messages.

Thanks,
Matt

> On Jun 14, 2022, at 2:48 PM, Nordstrom, Karl <k...@psu.edu> wrote:
>
> Hello,
>
> We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 1:1.8.0.332.b09-1.el7_9 
> running on rhel7.
>
> The following was done on our acceptance cluster.
>
> I check activemq.log for messages to determine if activemq has corrupt data 
> files:
>
> [kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" 
> /opt/local/activemq/data/activemq.log | head -1
> 2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: 
> JobSchedulerStore: 
> /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | 
> org.apache.activemq.broker.BrokerService | main
>
> Then I move scheduleDB files after stopping activemq.service on both brokers.
>
> cd /opt/local/activemq/data/kahadb/scheduler
>
> sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv 
> scheduleDB.redo scheduleDB.redo.`date +%Y%m%d`
>
> After starting ActiveMQ, 7,500,000 entries were recovered, but it failed with 
> ERROR | Failed to start job scheduler store.
>
> There was a corrupt journal file.
>
> [kxn2@amq-a02 data]$ grep Corrupt activemq.log*
>
> 2022-06-02 07:55:40,066 | WARN  | Corrupt journal records found in 
> '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log'
>  between offsets: 11558626..11559784 | 
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
>
> We tried starting activemq without the db-1179.log file, with an empty 
> db-1179.log file. ActiveMQ complained about both.
>
> We eventually stopped activemq, renamed the schedule/ directory and started 
> activemq.
>
> After we restarted, we have one db-*.log file with 50K messages.
>
> [kxn2@amq-a02 scheduler]$ wc -l db-1.log
> 50,067 db-1.log
>
> Before we had 125 log files and 8.697,209 messages!
>
> [kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log
> ...
> 8,697,209 total
>
> So, we have millions of messages that we probably do not need. It took 2.5 
> hours to recover 7.5M entries before it failed; likely due to the corrupt 
> record.
>
> How can I get activemq to clean up these logs, so this recovery doesn't take 
> so long?
>
> How can I correct the data corruption?
>
> For a test, I did remove the range of the file between offsets: 
> 11558626..11559784. I used the "head -c" command, grep and vi to do that. 
> ActiveMQ did start.
>
> I am hoping that this doesn't happen in production, because it won't be 
> acceptable to lose messages to get activemq to start up.
>
> ---
>
> Karl Nordström
>
> Systems Administrator
>
> Penn State IT | Application Platforms

Reply via email to