Re: OOM at Bootstrap Time

Jonathan Haddad Sun, 26 Oct 2014 13:47:33 -0700

If the issue is related to I/O, you're going to want to determine if
you're saturated.  Take a look at `iostat -dmx 1`, you'll see avgqu-sz
(queue size) and svctm, (service time).    The higher those numbers
are, the most overwhelmed your disk is.


On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
> Hello Maxime
>
> Increasing the flush writers won't help if your disk I/O is not keeping up.
>
> I've had a look into the log file, below are some remarks:
>
> 1) There are a lot of SSTables on disk for some tables (events for example,
> but not only). I've seen that some compactions are taking up to 32 SSTables
> (which corresponds to the default max value for SizeTiered compaction).
>
> 2) There is a secondary index that I found suspicious : loc.loc_id_idx. As
> its name implies I have the impression that it's an index on the id of the
> loc which would lead to almost an 1-1 relationship between the indexed value
> and the original loc. Such index should be avoided because they do not
> perform well. If it's not an index on the loc_id, please disregard my remark
>
> 3) There is a clear imbalance of SSTable count on some nodes. In the log, I
> saw:
>
> INFO  [STREAM-IN-/xxxx.xxxx.xxxx.20] 2014-10-25 02:21:43,360
> StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
> ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes), sending 0
> files(0 bytes)
>
> INFO  [STREAM-IN-/xxxx.xxxx.xxxx.81] 2014-10-25 02:21:46,121
> StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
> ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes), sending 0
> files(0 bytes)
>
> INFO  [STREAM-IN-/xxxx.xxxx.xxxx.71] 2014-10-25 02:21:50,494
> StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
> ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes), sending
> 0 files(0 bytes)
>
> INFO  [STREAM-IN-/xxxx.xxxx.xxxx.217] 2014-10-25 02:21:51,036
> StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79
> ID#0] Prepare completed. Receiving 1640 files(3 208 023 573 bytes), sending
> 0 files(0 bytes)
>
>  As you can see, the existing 4 nodes are streaming data to the new node and
> on average the data set size is about 3.3 - 4.5 Gb. However the number of
> SSTables is around 150 files for nodes xxxx.xxxx.xxxx.20 and
> xxxx.xxxx.xxxx.81 but goes through the roof to reach 1315 files for
> xxxx.xxxx.xxxx.71 and 1640 files for xxxx.xxxx.xxxx.217
>
>  The total data set size is roughly the same but the file number is x10,
> which mean that you'll have a bunch of tiny files.
>
>  I guess that upon reception of those files, there will be a massive flush
> to disk, explaining the behaviour you're facing (flush storm)
>
> I would suggest looking on nodes xxxx.xxxx.xxxx.71 and xxxx.xxxx.xxxx.217 to
> check for the total SSTable count for each table to confirm this intuition
>
> Regards
>
>
> On Sun, Oct 26, 2014 at 4:58 PM, Maxime <maxim...@gmail.com> wrote:
>>
>> I've emailed you a raw log file of an instance of this happening.
>>
>> I've been monitoring more closely the timing of events in tpstats and the
>> logs and I believe this is what is happening:
>>
>> - For some reason, C* decides to provoke a flush storm (I say some reason,
>> I'm sure there is one but I have had difficulty determining the behaviour
>> changes between 1.* and more recent releases).
>> - So we see ~ 3000 flush being enqueued.
>> - This happens so suddenly that even boosting the number of flush writers
>> to 20 does not suffice. I don't even see "all time blocked" numbers for it
>> before C* stops responding. I suspect this is due to the sudden OOM and GC
>> occurring.
>> - The last tpstat that comes back before the node goes down indicates 20
>> active and 3000 pending and the rest 0. It's by far the anomalous activity.
>>
>> Is there a way to throttle down this generation of Flush? C* complains if
>> I set the queue_size to any value (deprecated now?) and boosting the threads
>> does not seem to help since even at 20 we're an order of magnitude off.
>>
>> Suggestions? Comments?
>>
>>
>> On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan <doanduy...@gmail.com> wrote:
>>>
>>> Hello Maxime
>>>
>>>  Can you put the complete logs and config somewhere ? It would be
>>> interesting to know what is the cause of the OOM.
>>>
>>> On Sun, Oct 26, 2014 at 3:15 AM, Maxime <maxim...@gmail.com> wrote:
>>>>
>>>> Thanks a lot that is comforting. We are also small at the moment so I
>>>> definitely can relate with the idea of keeping small and simple at a level
>>>> where it just works.
>>>>
>>>> I see the new Apache version has a lot of fixes so I will try to upgrade
>>>> before I look into downgrading.
>>>>
>>>>
>>>> On Saturday, October 25, 2014, Laing, Michael
>>>> <michael.la...@nytimes.com> wrote:
>>>>>
>>>>> Since no one else has stepped in...
>>>>>
>>>>> We have run clusters with ridiculously small nodes - I have a
>>>>> production cluster in AWS with 4GB nodes each with 1 CPU and disk-based
>>>>> instance storage. It works fine but you can see those little puppies
>>>>> struggle...
>>>>>
>>>>> And I ran into problems such as you observe...
>>>>>
>>>>> Upgrading Java to the latest 1.7 and - most importantly - reverting to
>>>>> the default configuration, esp. for heap, seemed to settle things down
>>>>> completely. Also make sure that you are using the 'recommended production
>>>>> settings' from the docs on your boxen.
>>>>>
>>>>> However we are running 2.0.x not 2.1.0 so YMMV.
>>>>>
>>>>> And we are switching to 15GB nodes w 2 heftier CPUs each and SSD
>>>>> storage - still a 'small' machine, but much more reasonable for C*.
>>>>>
>>>>> However I can't say I am an expert, since I deliberately keep things so
>>>>> simple that we do not encounter problems - it just works so I dig into 
>>>>> other
>>>>> stuff.
>>>>>
>>>>> ml
>>>>>
>>>>>
>>>>> On Sat, Oct 25, 2014 at 5:22 PM, Maxime <maxim...@gmail.com> wrote:
>>>>>>
>>>>>> Hello, I've been trying to add a new node to my cluster ( 4 nodes )
>>>>>> for a few days now.
>>>>>>
>>>>>> I started by adding a node similar to my current configuration, 4 GB
>>>>>> or RAM + 2 Cores on DigitalOcean. However every time, I would end up 
>>>>>> getting
>>>>>> OOM errors after many log entries of the type:
>>>>>>
>>>>>> INFO  [SlabPoolCleaner] 2014-10-25 13:44:57,240
>>>>>> ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383 (0%) on-heap, >>>>>> 0
>>>>>> (0%) off-heap
>>>>>>
>>>>>> leading to:
>>>>>>
>>>>>> ka-120-Data.db (39291 bytes) for commitlog position
>>>>>> ReplayPosition(segmentId=1414243978538, position=23699418)
>>>>>> WARN  [SharedPool-Worker-13] 2014-10-25 13:48:18,032
>>>>>> AbstractTracingAwareExecutorService.java:167 - Uncaught exception on 
>>>>>> thread
>>>>>> Thread[SharedPool-Worker-13,5,main]: {}
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>
>>>>>> Thinking it had to do with either compaction somehow or streaming, 2
>>>>>> activities I've had tremendous issues with in the past; I tried to slow 
>>>>>> down
>>>>>> the setstreamthroughput to extremely low values all the way to 5. I also
>>>>>> tried setting setcompactionthoughput to 0, and then reading that in some
>>>>>> cases it might be too fast, down to 8. Nothing worked, it merely vaguely
>>>>>> changed the mean time to OOM but not in a way indicating either was 
>>>>>> anywhere
>>>>>> a solution.
>>>>>>
>>>>>> The nodes were configured with 2 GB of Heap initially, I tried to
>>>>>> crank it up to 3 GB, stressing the host memory to its limit.
>>>>>>
>>>>>> After doing some exploration (I am considering writing a Cassandra Ops
>>>>>> documentation with lessons learned since there seems to be little of it 
>>>>>> in
>>>>>> organized fashions), I read that some people had strange issues on 
>>>>>> lower-end
>>>>>> boxes like that, so I bit the bullet and upgraded my new node to a 8GB + 
>>>>>> 4
>>>>>> Core instance, which was anecdotally better.
>>>>>>
>>>>>> To my complete shock, exact same issues are present, even raising the
>>>>>> Heap memory to 6 GB. I figure it can't be a "normal" situation anymore, 
>>>>>> but
>>>>>> must be a bug somehow.
>>>>>>
>>>>>> My cluster is 4 nodes, RF of 2, about 160 GB of data across all nodes.
>>>>>> About 10 CF of varying sizes. Runtime writes are between 300 to 900 /
>>>>>> second. Cassandra 2.1.0, nothing too wild.
>>>>>>
>>>>>> Has anyone encountered these kinds of issues before? I would really
>>>>>> enjoy hearing about the experiences of people trying to run small-sized
>>>>>> clusters like mine. From everything I read, Cassandra operations go very
>>>>>> well on large (16 GB + 8 Cores) machines, but I'm sad to report I've had
>>>>>> nothing but trouble trying to run on smaller machines, perhaps I can 
>>>>>> learn
>>>>>> from other's experience?
>>>>>>
>>>>>> Full logs can be provided to anyone interested.
>>>>>>
>>>>>> Cheers
>>>>>
>>>>>
>>>
>>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: OOM at Bootstrap Time

Reply via email to