The commitlog code has changed DRASTICALLY between 2.x and trunk. If it's really a bunch of trailing 0s as was suggested later, then https://issues.apache.org/jira/browse/CASSANDRA-11995 addresses at least one cause/case of that particular bug.
On Mon, Jul 26, 2021 at 3:11 PM Leon Zaruvinsky <leonzaruvin...@gmail.com> wrote: > And for completeness, a sample stack trace: > > ERROR [2021-07-21T02:11:01.994Z] org.apache.cassandra.db.commitlog.CommitLog: > Failed commit log replay. Commit disk failure policy is stop_on_startup; > terminating thread (throwable0_message: Mutation checksum failure at 15167277 > in CommitLog-5-1626828286977.log) > org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: > Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143) > at > org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175) > at > org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155) > at > org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296) > at > org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741) > > > On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky <leonzaruvin...@gmail.com> > wrote: > >> Currently we're using commitlog_batch: >> >> commitlog_sync: batch >> commitlog_sync_batch_window_in_ms: 2 >> commitlog_segment_size_in_mb: 32 >> >> durable_writes is also true. >> >> Unfortunately we are still using Cassandra 2.2.x :( Though I'd be curious >> if much in this space has changed since then (I've looked through the >> changelogs and nothing stood out). >> >> On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa <jji...@gmail.com> wrote: >> >>> What commitlog settings are you using? >>> >>> Default is periodic with 10s sync. That leaves you a 10s window on hard >>> poweroff/crash. >>> >>> I would also expect cassandra to cleanup and start cleanly, which >>> version are you running? >>> >>> >>> >>> On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky < >>> leonzaruvin...@gmail.com> wrote: >>> >>>> Hi Cassandra community, >>>> >>>> We (and others) regularly run into commit log corruptions that are >>>> caused by Cassandra, or the underlying infrastructure, being hard >>>> restarted. I suspect that this is because it happens in the middle of a >>>> commitlog file write to disk. >>>> >>>> Could anyone point me at resources / code to understand why this is >>>> happening? Shouldn't Cassandra not be acking writes until the commitlog is >>>> safely written to disk? I would expect that on startup, Cassandra should >>>> be able to clean up bad commitlog files and recover gracefully. >>>> >>>> I've seen various references online to this issue as something that >>>> will be fixed in the future - so I'm curious if there is any movement or >>>> thoughts there. >>>> >>>> Thanks a bunch, >>>> Leon >>>> >>>