Re: 1.2.19: AssertionError when running compactions on a CF with TTLed columns

2018-12-11 Thread Reynald Borer
Hi everyone,

I was finally able to sort out my problem in an "interesting" manner that I
think is worth sharing on the list!

What I did is the following: on each node, I stopped Cassandra, completely
dropped the data files of the column family, started Cassandra again and
issued a repair for this column family.

The process took time since the cluster is formed of 40 nodes, but once
done, the nodes didn't exhibit this assertion error anymore!

I believe this was triggered because of me tweaking the
"sstable_size_in_mb" parameter. Somehow I had data files with different
sizes and it confused Cassandra.

So, problem solved now :-)

Cheers,
Reynald


On Fri, Aug 31, 2018 at 7:45 AM Reynald Borer 
wrote:

> Hi everyone,
>
> I'm running a Cassandra 1.2.19 cluster of 40 nodes and compactions of a
> specific column family are sporadically raising an AssertionError like this
> (full stack trace visible under
> https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a):
>
> ERROR [CompactionExecutor:9137] 2018-08-27 11:43:05,197
> org.apache.cassandra.service.CassandraDaemon - Exception in thread
> Thread[CompactionExecutor:9137,1,main]
> java.lang.AssertionError: 2
> at
> org.apache.cassandra.db.compaction.LeveledManifest.replace(LeveledManifest.java:267)
>
> The data written in this column family can be seen as wide rows, that is,
> rows with lots of columns. Each column has a TTL of 7 days though.
>
> Whenever this happens, it seems to block compactions of this column family
> (I see the pending compactions increasing) until I restart the failing node.
>
> I have searched on jira and on this mailing-list about this issue without
> too much luck. I suspect it may be related to
> https://issues.apache.org/jira/browse/CASSANDRA-6563 although it's hard
> for to confirm.
>
> I know this version is pretty old, does this issue anyway rings a bell to
> one of you?
>
> Here are some more details about my cluster:
>
> - it is composed of 40 nodes
> - it is pretty old and I'm in the process of upgrading it, thus it was
> running without issues under version 1.0.12 & 1.1.12
> - it really affect a single column family only (schema can be seen on
> https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a#file-schema-txt
> )
> - my cluster is set up with RandomPartitioner (inherited from when it was
> set up on version 0.7) and a replication factor of 3
> - it's running weekly repairs (and this assertion happens mostly during
> repairs)
> - what I also noted is that since the cluster was upgraded to 1.2.19 the
> disk size of this column family keeps increasing (it went from 400G to
> 1.2T!)
>
> Thanks in advance for your help.
>
> Best regards,
> Reynald
>
>
>
>


1.2.19: AssertionError when running compactions on a CF with TTLed columns

2018-08-30 Thread Reynald Borer
Hi everyone,

I'm running a Cassandra 1.2.19 cluster of 40 nodes and compactions of a
specific column family are sporadically raising an AssertionError like this
(full stack trace visible under
https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a):

ERROR [CompactionExecutor:9137] 2018-08-27 11:43:05,197
org.apache.cassandra.service.CassandraDaemon - Exception in thread
Thread[CompactionExecutor:9137,1,main]
java.lang.AssertionError: 2
at
org.apache.cassandra.db.compaction.LeveledManifest.replace(LeveledManifest.java:267)

The data written in this column family can be seen as wide rows, that is,
rows with lots of columns. Each column has a TTL of 7 days though.

Whenever this happens, it seems to block compactions of this column family
(I see the pending compactions increasing) until I restart the failing node.

I have searched on jira and on this mailing-list about this issue without
too much luck. I suspect it may be related to
https://issues.apache.org/jira/browse/CASSANDRA-6563 although it's hard for
to confirm.

I know this version is pretty old, does this issue anyway rings a bell to
one of you?

Here are some more details about my cluster:

- it is composed of 40 nodes
- it is pretty old and I'm in the process of upgrading it, thus it was
running without issues under version 1.0.12 & 1.1.12
- it really affect a single column family only (schema can be seen on
https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a#file-schema-txt
)
- my cluster is set up with RandomPartitioner (inherited from when it was
set up on version 0.7) and a replication factor of 3
- it's running weekly repairs (and this assertion happens mostly during
repairs)
- what I also noted is that since the cluster was upgraded to 1.2.19 the
disk size of this column family keeps increasing (it went from 400G to
1.2T!)

Thanks in advance for your help.

Best regards,
Reynald