Karl-

Is this on a local disk, RAID, SAN or NAS? First step is to confirm there was 
no disk corruption-- check your syslog and dmesg output for anyI/O error 
messages.

Thanks,
Matt

> On Jun 14, 2022, at 2:48 PM, Nordstrom, Karl <k...@psu.edu> wrote:
> 
> Hello,
> 
> We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 1:1.8.0.332.b09-1.el7_9 
> running on rhel7.
> 
> The following was done on our acceptance cluster.
> 
> I check activemq.log for messages to determine if activemq has corrupt data 
> files:
> 
> [kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" 
> /opt/local/activemq/data/activemq.log | head -1
> 2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: 
> JobSchedulerStore: 
> /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | 
> org.apache.activemq.broker.BrokerService | main
> 
> Then I move scheduleDB files after stopping activemq.service on both brokers.
> 
> cd /opt/local/activemq/data/kahadb/scheduler
> 
> sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv 
> scheduleDB.redo scheduleDB.redo.`date +%Y%m%d`
> 
> After starting ActiveMQ, 7,500,000 entries were recovered, but it failed with 
> ERROR | Failed to start job scheduler store.
> 
> There was a corrupt journal file.
> 
> [kxn2@amq-a02 data]$ grep Corrupt activemq.log*
> 
> 2022-06-02 07:55:40,066 | WARN  | Corrupt journal records found in 
> '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log'
>  between offsets: 11558626..11559784 | 
> org.apache.activemq.store.kahadb.disk.journal.Journal | main
> 
> We tried starting activemq without the db-1179.log file, with an empty 
> db-1179.log file. ActiveMQ complained about both.
> 
> We eventually stopped activemq, renamed the schedule/ directory and started 
> activemq.
> 
> After we restarted, we have one db-*.log file with 50K messages.
> 
> [kxn2@amq-a02 scheduler]$ wc -l db-1.log
> 50,067 db-1.log
> 
> Before we had 125 log files and 8.697,209 messages!
> 
> [kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log
> ...
> 8,697,209 total
> 
> So, we have millions of messages that we probably do not need. It took 2.5 
> hours to recover 7.5M entries before it failed; likely due to the corrupt 
> record.
> 
> How can I get activemq to clean up these logs, so this recovery doesn't take 
> so long?
> 
> How can I correct the data corruption?
> 
> For a test, I did remove the range of the file between offsets: 
> 11558626..11559784. I used the "head -c" command, grep and vi to do that. 
> ActiveMQ did start.
> 
> I am hoping that this doesn't happen in production, because it won't be 
> acceptable to lose messages to get activemq to start up.
> 
> ---
> 
> Karl Nordström
> 
> Systems Administrator
> 
> Penn State IT | Application Platforms

Reply via email to