Re: Cassandra crashed with OOM, and the system.log and debug.log doesn't match.

qiang zhang Tue, 11 Jul 2017 00:40:13 -0700

Thanks for your explanation!

> It's taking a full minute to sync your memtable to disk. This is either
indication that your disk is broken, or your JVM is pausing for GC.
The disk is ok, the long time JVM pausing happens many times, I didn't
disable the paging file in windows, may be that's the reason.


> That's a bit thin - depending on data model and data volume, you may be
able to construct a read that fills up your 3G heap, and causes you to OOM
with a single read. How much data is involved?  What does 'nodetool
tablestats' look like, and finally, how many reads/seconds are you doing on
this workload?
The total data size in this database is about 45GB, the read operation is
like this:
SELECT value FROM kairosdb.data_points WHERE key =
44617461506572665f3130000000015cf125b800000d6b6169726f735f646f75626c656974656d5f6e616d653d6974656d5f333a
AND column1 >= 4d6d2ac0 AND column1 <= 4ddb07c1 THRIFT LIMIT (partitions=1,
cells_per_partition=10000)
I think this operation can limit the max size of the result data, is the
memory usage of this operation also limited?

I've removed all the data already and start another test using the same
case. So the 'nodetool tablestats' information is not available now, I've
remember that the sstable file number of that column family is not too much
(less than 100).

I'm doing 10000 querys in about 300 seconds, and doing that once for every
30 minutes, while the same time, I'm inserting 10000 datas every second.

I'm hoping to find a way to prevent OOM happens, and keep cassandra node
alive for longer time. Is there any suggestions?

Thanks.


2017-07-11 0:18 GMT+08:00 Jeff Jirsa <jji...@apache.org>:

>
>
> On 2017-07-10 02:07 (-0700), 张强 <tzx...@gmail.com> wrote:
> > Hi experts, I've a single cassandra 3.11.0 node working with kairosdb (a
> > time series database), after running 4 days with stable workload, the
> > database client start to get "request errors", but there are not a lot of
> > error or warning messages in the cassandra log file, the client start to
> > receive error message at about 7-7 21:03:00, and kairosdb keep retrying
> > after that time, but there isn't much logs in the cassandra log file.
> > I've noticed the abnormal status at about 7-8 16:00:00, then I've typed a
> > "nodetool tablestats" command to get some information, the command got an
> > error, and while that time, the cassandra process start to crash, and
> > generated a dump file.
> > After C* shutdown, I take the logs to see what happened, and I found
> > something strange inside the logs.
> >
> > 1. In the system.log, there are two lines shows that no logs between
> > 2017-07-07 21:03:50 to 2017-07-08 16:07:33, I think that is a pretty long
> > period without any logs, and in gc.log file, there are a lot of logs
> shows
> > long time GC, that should be logged in system.log.
> > INFO  [ReadStage-1] 2017-07-07 21:03:50,824 NoSpamLogger.java:91 -
> Maximum
> > memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB
>
> Failing to allocate during read stage is a good indication that you're out
> of memory - either the heap is too small, or it's a direct memory
> allocation failure, or something, but that log line probably shouldn't be
> at INFO, because it seems like it's probably hiding a larger problem.
>
> > WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2017-07-08 16:07:33,347
> > NoSpamLogger.java:94 - Out of 1 commit log syncs over the past 0.00s with
> > average duration of 60367.73ms, 1 have exceeded the configured commit
> > interval by an average of 50367.73ms
>
> It's taking a full minute to sync your memtable to disk. This is either
> indication that your disk is broken, or your JVM is pausing for GC.
>
> >
> > 2. In the system.log, there is a log shows very long time GC, and then
> the
> > C* start to close.
> > WARN  [ScheduledTasks:1] 2017-07-08 16:07:46,846 NoSpamLogger.java:94 -
> > Some operations timed out, details available at debug level (debug.log)
> > WARN  [Service Thread] 2017-07-08 16:10:36,114 GCInspector.java:282 -
> > ConcurrentMarkSweep GC in 688850ms.  CMS Old Gen: 2114938312 ->
> 469583832;
> > Par Eden Space: 837584 -> 305319752; Par Survivor Space: 41943040 ->
> > 25784008
> > ......
> > ERROR [Thrift:22] 2017-07-08 16:10:56,322 CassandraDaemon.java:228 -
> > Exception in thread Thread[Thrift:22,5,main]
> > java.lang.OutOfMemoryError: Java heap space
>
> You ran out of heap. We try to clean up and kill things when this happens,
> but by definition, the JVM is in an undefined state, and we may not be able
> to shut things down properly.
>
> >
> > 3. In the debug.log, the last INFO level log is at 2017-07-07 14:43:59,
> the
> > log is:
> > INFO  [IndexSummaryManager:1] 2017-07-07 14:43:59,967
> > IndexSummaryRedistribution.java:75 - Redistributing index summaries
> > After that, there are DEBUG level logs until 2017-07-07 21:11:34, but no
> > more INFO level or other level logs in that log file, while there are
> still
> > many logs in the system.log after 2017-07-07 14:43:59. Why doesn't these
> > two log files match?
> >
> > My hardware is 4 core cpu and 12G ram, and I'm using windows server 2012
> > r2.
>
> That's a bit thin - depending on data model and data volume, you may be
> able to construct a read that fills up your 3G heap, and causes you to OOM
> with a single read. How much data is involved?  What does 'nodetool
> tablestats' look like, and finally, how many reads/seconds are you doing on
> this workload?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Cassandra crashed with OOM, and the system.log and debug.log doesn't match.

Reply via email to