Re: many instances of org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier$1 on the heap

2020-08-04 Thread jelmer
It happened again today and I had a bit more time to probe stuff. It seems
all non periodic tasks execute on a single thread. so if one thread where
to get stuck work would simply pile up until out of memory, i did a series
of stack dumps and it always seemed to look something like this

"NonPeriodicTasks:1" #103 daemon prio=5 os_prio=0 tid=0x7febe8342400
> nid=0x4103 runnable [0x7febc78ed000]
>java.lang.Thread.State: RUNNABLE
> at com.google.common.collect.Iterators$7.computeNext(Iterators.java:652)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
> at
> com.github.benmanes.caffeine.cache.LocalCache.invalidateAll(LocalCache.java:108)
> at
> com.github.benmanes.caffeine.cache.LocalManualCache.invalidateAll(LocalManualCache.java:79)
> at
> org.apache.cassandra.cache.ChunkCache.invalidateFile(ChunkCache.java:197)
> at
> org.apache.cassandra.io.util.FileHandle$Cleanup.lambda$tidy$0(FileHandle.java:207)
> at
> org.apache.cassandra.io.util.FileHandle$Cleanup$$Lambda$217/794936631.accept(Unknown
> Source)
> at java.util.Optional.ifPresent(Optional.java:159)
> at
> org.apache.cassandra.io.util.FileHandle$Cleanup.tidy(FileHandle.java:207)
> at
> org.apache.cassandra.utils.concurrent.Ref$GlobalState.release(Ref.java:326)
> at
> org.apache.cassandra.utils.concurrent.Ref$State.ensureReleased(Ref.java:204)
> at org.apache.cassandra.utils.concurrent.Ref.ensureReleased(Ref.java:129)
> at
> org.apache.cassandra.utils.concurrent.SharedCloseableImpl.close(SharedCloseableImpl.java:45)
> at
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier$1.run(SSTableReader.java:2231)


And the thread executing these tasks would always be at 100% cpu

One would expect that invalidating a local cache would be cheap operation.
Yet its not, what could cause chunk cache invalidation to be slow ?
Cassandra does seem to be using an old version of caffeine and there have
been issues  with it in
the past where it would go into an endless loop under the wrong set of
circumstances




On Mon, 3 Aug 2020 at 13:52, jelmer  wrote:

> It did look like there where repairs running at the time. The
> LiveSSTableCount for the entire node is about 2200 tables, for the keyspace
> that was being repaired its just 150
>
> We run cassandra 3.11.6 so we should be unaffected by  cassandra-14096
>
> We use http://cassandra-reaper.io/ for the repairs
>
>
>
> On Sat, 1 Aug 2020 at 01:49, Erick Ramirez 
> wrote:
>
>> I don't have specific experience relating to InstanceTidier but when I
>> saw this, I immediately thought of repairs blowing up the heap. 40K
>> instances indicates to me that you have thousands of SSTables -- are they
>> tiny (like 1MB or less)? Otherwise, are they dense nodes (~1TB or more)?
>>
>> How do you run repairs? I'm wondering if it's possible that there are
>> multiple repairs running in parallel like a cron job kicking in while the
>> previous repair is still running.
>>
>> You didn't specify your C* version but my guess is that it's pre-3.11.5.
>> FWIW the repair issue I'm referring to is CASSANDRA-14096 [1].
>>
>> [1] https://issues.apache.org/jira/browse/CASSANDRA-14096
>>
>


Re: many instances of org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier$1 on the heap

2020-08-03 Thread jelmer
It did look like there where repairs running at the time. The
LiveSSTableCount for the entire node is about 2200 tables, for the keyspace
that was being repaired its just 150

We run cassandra 3.11.6 so we should be unaffected by  cassandra-14096

We use http://cassandra-reaper.io/ for the repairs



On Sat, 1 Aug 2020 at 01:49, Erick Ramirez 
wrote:

> I don't have specific experience relating to InstanceTidier but when I
> saw this, I immediately thought of repairs blowing up the heap. 40K
> instances indicates to me that you have thousands of SSTables -- are they
> tiny (like 1MB or less)? Otherwise, are they dense nodes (~1TB or more)?
>
> How do you run repairs? I'm wondering if it's possible that there are
> multiple repairs running in parallel like a cron job kicking in while the
> previous repair is still running.
>
> You didn't specify your C* version but my guess is that it's pre-3.11.5.
> FWIW the repair issue I'm referring to is CASSANDRA-14096 [1].
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-14096
>


Re: many instances of org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier$1 on the heap

2020-07-31 Thread Erick Ramirez
Oh, I just saw on ASF Slack that you were already discussing it earlier
today with driftx in the #cassandra channel. Cheers!

>


Re: many instances of org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier$1 on the heap

2020-07-31 Thread Erick Ramirez
I don't have specific experience relating to InstanceTidier but when I saw
this, I immediately thought of repairs blowing up the heap. 40K instances
indicates to me that you have thousands of SSTables -- are they tiny (like
1MB or less)? Otherwise, are they dense nodes (~1TB or more)?

How do you run repairs? I'm wondering if it's possible that there are
multiple repairs running in parallel like a cron job kicking in while the
previous repair is still running.

You didn't specify your C* version but my guess is that it's pre-3.11.5.
FWIW the repair issue I'm referring to is CASSANDRA-14096 [1].

[1] https://issues.apache.org/jira/browse/CASSANDRA-14096