Karl- Is this on a local disk, RAID, SAN or NAS? First step is to confirm there was no disk corruption-- check your syslog and dmesg output for anyI/O error messages.
Thanks, Matt > On Jun 14, 2022, at 2:48 PM, Nordstrom, Karl <k...@psu.edu> wrote: > > Hello, > > We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 1:1.8.0.332.b09-1.el7_9 > running on rhel7. > > The following was done on our acceptance cluster. > > I check activemq.log for messages to determine if activemq has corrupt data > files: > > [kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" > /opt/local/activemq/data/activemq.log | head -1 > 2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: > JobSchedulerStore: > /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | > org.apache.activemq.broker.BrokerService | main > > Then I move scheduleDB files after stopping activemq.service on both brokers. > > cd /opt/local/activemq/data/kahadb/scheduler > > sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv > scheduleDB.redo scheduleDB.redo.`date +%Y%m%d` > > After starting ActiveMQ, 7,500,000 entries were recovered, but it failed with > ERROR | Failed to start job scheduler store. > > There was a corrupt journal file. > > [kxn2@amq-a02 data]$ grep Corrupt activemq.log* > > 2022-06-02 07:55:40,066 | WARN | Corrupt journal records found in > '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log' > between offsets: 11558626..11559784 | > org.apache.activemq.store.kahadb.disk.journal.Journal | main > > We tried starting activemq without the db-1179.log file, with an empty > db-1179.log file. ActiveMQ complained about both. > > We eventually stopped activemq, renamed the schedule/ directory and started > activemq. > > After we restarted, we have one db-*.log file with 50K messages. > > [kxn2@amq-a02 scheduler]$ wc -l db-1.log > 50,067 db-1.log > > Before we had 125 log files and 8.697,209 messages! > > [kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log > ... > 8,697,209 total > > So, we have millions of messages that we probably do not need. It took 2.5 > hours to recover 7.5M entries before it failed; likely due to the corrupt > record. > > How can I get activemq to clean up these logs, so this recovery doesn't take > so long? > > How can I correct the data corruption? > > For a test, I did remove the range of the file between offsets: > 11558626..11559784. I used the "head -c" command, grep and vi to do that. > ActiveMQ did start. > > I am hoping that this doesn't happen in production, because it won't be > acceptable to lose messages to get activemq to start up. > > --- > > Karl Nordström > > Systems Administrator > > Penn State IT | Application Platforms