Following up, I've found that we tend to encounter one of three types of
exceptions/commitlog corruptions:

1.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Mutation checksum failure at ... in CommitLog-5-1531150627243.log
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:638)

2.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Could not read commit log descriptor in file CommitLog-5-1550003067433.log
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:638)

3.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Encountered bad header at position ... of commit log
CommitLog-5-1603991140803.log, with invalid CRC. The end of segment marker
should be zero.
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)

I believe exception (2) is mitigated by
https://issues.apache.org/jira/browse/CASSANDRA-11995 and
https://issues.apache.org/jira/browse/CASSANDRA-13918

But it's not clear to me how (1) and (3) can be mitigated.

On Mon, Jul 26, 2021 at 6:40 PM Leon Zaruvinsky <leonzaruvin...@gmail.com>
wrote:

> Thanks for the links/comments Jeff and Bowen.
>
> We run xfs. Not sure that we can switch to zfs, so a different solution
> would be preferred.
>
> I’ll take a look through that patch – maybe I’ll try to backport and
> replicate.  We’ve seen both cases where the commitlog is just 0s (empty)
> and where it has had real data in it.
>
> Leon
>
> On Mon, Jul 26, 2021 at 6:38 PM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> The commitlog code has changed DRASTICALLY between 2.x and trunk.
>>
>> If it's really a bunch of trailing 0s as was suggested later, then
>> https://issues.apache.org/jira/browse/CASSANDRA-11995 addresses at least
>> one cause/case of that particular bug.
>>
>>
>>
>> On Mon, Jul 26, 2021 at 3:11 PM Leon Zaruvinsky <leonzaruvin...@gmail.com>
>> wrote:
>>
>>> And for completeness, a sample stack trace:
>>>
>>> ERROR [2021-07-21T02:11:01.994Z] 
>>> org.apache.cassandra.db.commitlog.CommitLog: Failed commit log replay. 
>>> Commit disk failure policy is stop_on_startup; terminating thread 
>>> (throwable0_message: Mutation checksum failure at 15167277 in 
>>> CommitLog-5-1626828286977.log)
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
>>>  Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log
>>>     at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
>>>     at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519)
>>>     at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401)
>>>     at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143)
>>>     at 
>>> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175)
>>>     at 
>>> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155)
>>>     at 
>>> org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296)
>>>     at 
>>> org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289)
>>>     at 
>>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222)
>>>     at 
>>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
>>>     at 
>>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741)
>>>
>>>
>>> On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky <
>>> leonzaruvin...@gmail.com> wrote:
>>>
>>>> Currently we're using commitlog_batch:
>>>>
>>>>     commitlog_sync: batch
>>>>     commitlog_sync_batch_window_in_ms: 2
>>>>     commitlog_segment_size_in_mb: 32
>>>>
>>>> durable_writes is also true.
>>>>
>>>> Unfortunately we are still using Cassandra 2.2.x :( Though I'd be
>>>> curious if much in this space has changed since then (I've looked through
>>>> the changelogs and nothing stood out).
>>>>
>>>> On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>>
>>>>> What commitlog settings are you using?
>>>>>
>>>>> Default is periodic with 10s sync. That leaves you a 10s window on
>>>>> hard poweroff/crash.
>>>>>
>>>>> I would also expect cassandra to cleanup and start cleanly, which
>>>>> version are you running?
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky <
>>>>> leonzaruvin...@gmail.com> wrote:
>>>>>
>>>>>> Hi Cassandra community,
>>>>>>
>>>>>> We (and others) regularly run into commit log corruptions that are
>>>>>> caused by Cassandra, or the underlying infrastructure, being hard
>>>>>> restarted.  I suspect that this is because it happens in the middle of a
>>>>>> commitlog file write to disk.
>>>>>>
>>>>>> Could anyone point me at resources / code to understand why this is
>>>>>> happening?  Shouldn't Cassandra not be acking writes until the commitlog 
>>>>>> is
>>>>>> safely written to disk?  I would expect that on startup, Cassandra should
>>>>>> be able to clean up bad commitlog files and recover gracefully.
>>>>>>
>>>>>> I've seen various references online to this issue as something that
>>>>>> will be fixed in the future - so I'm curious if there is any movement or
>>>>>> thoughts there.
>>>>>>
>>>>>> Thanks a bunch,
>>>>>> Leon
>>>>>>
>>>>>

Reply via email to