Re: Soon After Starting c* Process: CPU 100% for java Process

Elliott Sims Thu, 01 Jul 2021 15:42:15 -0700

As more general advice, I'd strongly encourage you to update to 3.11.x from
2.2.8.  My personal experience is that it's significantly faster and more
space-efficient, and the garbage collection behavior under pressure is
drastically better.  There's also improved tooling for diagnosing
performance issues.


Also for narrowing down performance issues, I've had good luck with the
"ttop" module of Swiss Java Knife and with the async-profiler tool:
https://github.com/jvm-profiling-tools/async-profiler

On Thu, Jul 1, 2021 at 5:42 AM Fred Habash <fmhab...@gmail.com> wrote:

> Great. Thanks.
>
> I examined the debug logs and from the time c* starts till it crashes the
> 30 minute duration log spitting the exact same message.
>
> Google research leads to no clear answers as to ...
>
> - What is 'evicting cold readers' mean?
> - Why is evicting the same sstable throughout i.e. from start to crash?
> - What is it mean for the AbstractQueryPager to say 'remaining rows to
> page: 2147483646'.
> - How is all the above related?
>
> My Google research revealed a similar inquiry where the response eluded to
> a race condition and recommendation to upgrade to c* 3
>
> Depending on the actual version, you may be running into a race condition
> or NPE that's not allowing the files to close properly. Try upgrading to
> the latest version of 3.x.
>
> In another hit it referenced a page callback.
>
> * page callback that does not have an executor assigned to it*
>
>
> This message repeats from start to crash
> -----------------------------------------------------------------------
> DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,542
> AbstractQueryPager.java:95 - Fetched 1 live rows
> DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,542
> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
> considering pager exhausted
> DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
> DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:95 - Fetched 1 live rows
> DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
> considering pager exhausted
> DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
> DEBUG [SharedPool-Worker-2] 2021-06-30 12:34:51,543
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra/XXXX/YYYYY-cf0c43b028e811e68f2b1b695a8d5b2c/
>
> Eventually, the shared pool worker crashes
> --------------------------------------------------------------
> b-3223-big-Data.db
> WARN  [SharedPool-Worker-55] 2021-06-30 19:55:41,677
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread
> Thread[SharedPool-Worker-55,5,main]: {}
> java.lang.IllegalArgumentException: Not enough bytes. Offset: 39289.
> Length: 20585. Buffer size: 57984
> at
> org.apache.cassandra.db.composites.AbstractCType.checkRemaining(AbstractCType.java:362)
> ~[apache-cassandra-2.2.8.jar:2.2.8]
>
> On Wed, Jun 30, 2021 at 7:53 PM Kane Wilson <k...@raft.so> wrote:
>
>> Looks like it's doing a lot of reads immediately on startup
>> (AbstractQueryPager) which is potentially causing a lot of GC (guessing
>> that's what caused the StatusLogger).
>>
>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
>>
>> is quite suspicious. You'll want to find out what query is causing a
>> massive scan at startup, you probably need to have a look through the start
>> of the logs to get a better idea at what's happening at startup.
>>
>> On Thu, Jul 1, 2021 at 5:14 AM Fred Habash <fmhab...@gmail.com> wrote:
>>
>>> I have node in cluster when I start c, the cpu reaches 100% with java
>>> process on top. Within a few minutes, jvm crashes (jvm instability)
>>> messages in system.log and c* crashes.
>>>
>>> Once c* is up, cluster average read latency reaches multi-seconds and
>>> client apps are unhappy. For now, the only way out is to drain the node and
>>> let the cluster latency settle.
>>>
>>> None of these measures helped ...
>>> 1. Rebooting the ec2
>>> 2. Replacing the ec2 altogether (new ec2/ new c* install/ etc).
>>> 3. Stopping compactions (as a diagnostic measure)
>>> Trying to understand why the java process is chewing much cpu i.e. what
>>> is actually happening ...
>>>
>>> I see these error messages in the debug.log. What functional task do
>>> these messages relate to e.g. compactions?
>>>
>>>
>>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>>> AbstractQueryPager.java:95 - Fetched 1 live rows
>>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>>> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
>>> considering pager exhausted
>>> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
>>> MemtablePostFlush                 0         0             29         0
>>>             0
>>>
>>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>>> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
>>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>>> SliceQueryPager.java:92 - Querying next page of slice query; new filter:
>>> SliceQueryFilter [reversed=false, slices=[[, ]], count=5000, toGroup = 0]
>>> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
>>> ValidationExecutor                0         0              0         0
>>>             0
>>> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
>>> Sampler                           0         0              0         0
>>>             0
>>> INFO  [Service Thread] 2021-06-30 13:39:04,767 StatusLogger.java:56 -
>>> MemtableFlushWriter               0         0              6         0
>>>             0
>>> INFO  [Service Thread] 2021-06-30 13:39:04,767 StatusLogger.java:56 -
>>> InternalResponseStage             0         0              4         0
>>>             0
>>> DEBUG [SharedPool-Worker-131] 2021-06-30 13:39:05,078
>>> StorageProxy.java:1467 - Read timeout; received 1 of 2 responses (only
>>> digests)
>>> DEBUG [SharedPool-Worker-131] 2021-06-30 13:39:05,079
>>> SliceQueryPager.java:92 - Querying next page of slice query; new filter:
>>> SliceQueryFilter [reversed=false, slices=[[, ]], count=5000, toGroup = 0]
>>> DEBUG [SharedPool-Worker-158] 2021-06-30 13:39:05,079
>>> StorageProxy.java:1467 - Read timeout; received 1 of 2 responses (only
>>> digests)
>>> DEBUG [SharedPool-Worker-158] 2021-06-30 13:39:05,079
>>> SliceQueryPager.java:92 - Querying next page of slice query; new filter:
>>> SliceQueryFilter [reversed=false, slices=[[, ]], count=5000, toGroup = 0]
>>> DEBUG [SharedPool-Worker-90] 2021-06-30 13:39:05,080 StorageProxy.jav
>>> ....
>>> EBUG [SharedPool-Worker-26] 2021-06-30 13:39:01,842
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-5069-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,847
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-5592-big-Data.db
>>> DEBUG [SharedPool-Worker-5] 2021-06-30 13:39:01,849
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-3993-big-Data.db
>>> DEBUG [SharedPool-Worker-5] 2021-06-30 13:39:01,849
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-5927-big-Data.db
>>> DEBUG [SharedPool-Worker-5] 2021-06-30 13:39:01,849
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-1276-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-5949-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-865-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-5741-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-4098-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-1662-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-1339-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-4598-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,855
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-3676-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,855
>>> FileCacheService.java:102 - Evicting cold readers for
>>> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-2814-big-Data.db
>>> DEBUG [SharedPool-Worker-12] 2021-
>>>
>>> We are using c* 2.2.8
>>>
>>>
>>> ----------------------------------------
>>> Thank you
>>>
>>>
>>>
>>
>> --
>> raft.so - Cassandra consulting, support, and managed services
>>
>
>
> --
>
> ----------------------------------------
> Thank you
>
>
>

Re: Soon After Starting c* Process: CPU 100% for java Process

Reply via email to