Re: Debugging write timeouts on Cassandra 2.2.5

Jai Bheemsen Rao Dhanwada Wed, 20 Sep 2017 09:49:51 -0700

Hello Nike,

were you able to fix the issue? If so what change helped you?


On Wed, Feb 24, 2016 at 5:36 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Great that you found a specific release that triggers the problem - 2.1.x
> has a huge number of changes.
>
> How many partitions and rows do you have? What's the largest row count for
> a single partition? And all of these CQL tables are COMPACT STORAGE,
> correct? Are you writing a large number of skinny partitions or a smaller
> number of very wide partitions? It wouldn't surprise me if behavior for
> large partitions varies between releases since they can be so
> memory-intensive.
>
> I see this change in 2.1.5 that could possibly introduce some memory usage:
> Write partition size estimates into a system table (CASSANDRA-7688)
>
> At this stage it would probably help for you to try to produce a
> reasonably small repro test case that you could file as a Jira. And if you
> could run that repro test case on 3.x to verify that the problem still
> exists, that would be helpful as well.
>
> How long does it take to repro the timeout?
>
> Can you repro the timeout using a single node?
>
> What is the pattern of the timeouts - just random and occasional, or heavy
> and continuous once they start?
>
> Are they occurring uniformly on all three nodes?
>
> If you bounce the cluster and continue testing, do the timeouts commence
> immediately, fairly soon, or only after about as long as they take from a
> clean fresh start?
>
>
>
> -- Jack Krupansky
>
> On Wed, Feb 24, 2016 at 7:04 PM, Mike Heffner <m...@librato.com> wrote:
>
>> Nate,
>>
>> So we have run several install tests, bisecting the 2.1.x release line,
>> and we believe that the regression was introduced in version 2.1.5. This is
>> the first release that clearly hits the timeout for us.
>>
>> It looks like quite a large release, so our next step will likely be
>> bisecting the major commits to see if we can narrow it down:
>> https://github.com/apache/cassandra/blob/3c0a337ebc90b0d99349d0aa152c92
>> b5b3494d8c/CHANGES.txt. Obviously, any suggestions on potential suspects
>> appreciated.
>>
>> These are the memtable settings we've configured diff from the defaults
>> during our testing:
>>
>> memtable_allocation_type: offheap_objects
>> memtable_flush_writers: 8
>>
>>
>> Cheers,
>>
>> Mike
>>
>> On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall <n...@thelastpickle.com>
>> wrote:
>>
>>> The biggest change which *might* explain your behavior has to do with
>>> the changes in memtable flushing between 2.0 and 2.1:
>>> https://issues.apache.org/jira/browse/CASSANDRA-5549
>>>
>>> However, the tpstats you posted shows no dropped mutations which would
>>> make me more certain of this as the cause.
>>>
>>> What values do you have right now for each of these (my recommendations
>>> for each on a c4.2xl with stock cassandra-env.sh are in parenthesis):
>>>
>>> - memtable_flush_writers (2)
>>> - memtable_heap_space_in_mb  (2048)
>>> - memtable_offheap_space_in_mb (2048)
>>> - memtable_cleanup_threshold (0.11)
>>> - memtable_allocation_type (offheap_objects)
>>>
>>> The biggest win IMO will be moving to offheap_objects. By default,
>>> everything is on heap. Regardless, spending some time tuning these for your
>>> workload will pay off.
>>>
>>> You may also want to be explicit about
>>>
>>> - native_transport_max_concurrent_connections
>>> - native_transport_max_concurrent_connections_per_ip
>>>
>>> Depending on the driver, these may now be allowing 32k streams per
>>> connection(!) as detailed in v3 of the native protocol:
>>> https://github.com/apache/cassandra/blob/cassandra-2.1/
>>> doc/native_protocol_v3.spec#L130-L152
>>>
>>>
>>>
>>> On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner <m...@librato.com> wrote:
>>>
>>>> Anuj,
>>>>
>>>> So we originally started testing with Java8 + G1, however we were able
>>>> to reproduce the same results with the default CMS settings that ship in
>>>> the cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses
>>>> during the runs.
>>>>
>>>> Query pattern during our testing was 100% writes, batching (via Thrift
>>>> mostly) to 5 tables, between 6-1500 rows per batch.
>>>>
>>>> Mike
>>>>
>>>> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
>>>> wrote:
>>>>
>>>>> Whats the GC overhead? Can you your share your GC collector and
>>>>> settings ?
>>>>>
>>>>>
>>>>> Whats your query pattern? Do you use secondary indexes, batches, in
>>>>> clause etc?
>>>>>
>>>>>
>>>>> Anuj
>>>>>
>>>>>
>>>>> Sent from Yahoo Mail on Android
>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>>>
>>>>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner
>>>>> <m...@librato.com> wrote:
>>>>> Alain,
>>>>>
>>>>> Thanks for the suggestions.
>>>>>
>>>>> Sure, tpstats are here: https://gist.github.com/
>>>>> mheffner/a979ae1a0304480b052a. Looking at the metrics across the
>>>>> ring, there were no blocked tasks nor dropped messages.
>>>>>
>>>>> Iowait metrics look fine, so it doesn't appear to be blocking on disk.
>>>>> Similarly, there are no long GC pauses.
>>>>>
>>>>> We haven't noticed latency on any particular table higher than others
>>>>> or correlated around the occurrence of a timeout. We have noticed with
>>>>> further testing that running cassandra-stress against the ring, while our
>>>>> workload is writing to the same ring, will incur similar 10 second
>>>>> timeouts. If our workload is not writing to the ring, cassandra stress 
>>>>> will
>>>>> run without hitting timeouts. This seems to imply that our workload 
>>>>> pattern
>>>>> is causing something to block cluster-wide, since the stress tool writes 
>>>>> to
>>>>> a different keyspace then our workload.
>>>>>
>>>>> I mentioned in another reply that we've tracked it to something
>>>>> between 2.0.x and 2.1.x, so we are focusing on narrowing which point
>>>>> release it was introduced in.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Mike
>>>>>
>>>>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Mike,
>>>>>>
>>>>>> What about the output of tpstats ? I imagine you have dropped
>>>>>> messages there. Any blocked threads ? Could you paste this output here ?
>>>>>>
>>>>>> May this be due to some network hiccup to access the disks as they
>>>>>> are EBS ? Can you think of anyway of checking this ? Do you have a lot of
>>>>>> GC logs, how long are the pauses (use something like: grep -i 
>>>>>> 'GCInspector'
>>>>>> /var/log/cassandra/system.log) ?
>>>>>>
>>>>>> Something else you could check are local_writes stats to see if only
>>>>>> one table if affected or this is keyspace / cluster wide. You can use
>>>>>> metrics exposed by cassandra or if you have no dashboards I believe a:
>>>>>> 'nodetool cfstats <myks> | grep -e 'Table:' -e 'Local'' should give you a
>>>>>> rough idea of local latencies.
>>>>>>
>>>>>> Those are just things I would check, I have not a clue on what is
>>>>>> happening here, hope this will help.
>>>>>>
>>>>>> C*heers,
>>>>>> -----------------
>>>>>> Alain Rodriguez
>>>>>> France
>>>>>>
>>>>>> The Last Pickle
>>>>>> http://www.thelastpickle.com
>>>>>>
>>>>>> 2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>:
>>>>>>
>>>>>>> Jaydeep,
>>>>>>>
>>>>>>> No, we don't use any light weight transactions.
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
>>>>>>> chovatia.jayd...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Are you guys using light weight transactions in your write path?
>>>>>>>>
>>>>>>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
>>>>>>>> fabrice.faco...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Are your commitlog and data on the same disk ? If yes, you should
>>>>>>>>> put
>>>>>>>>> commitlogs on a separate disk which don't have a lot of IO.
>>>>>>>>>
>>>>>>>>> Others IO may have great impact impact on your commitlog writing
>>>>>>>>> and
>>>>>>>>> it may even block.
>>>>>>>>>
>>>>>>>>> An example of impact IO may have, even for Async writes:
>>>>>>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-
>>>>>>>>> large-jvm-gc-pauses-caused-by-background-io-traffic
>>>>>>>>>
>>>>>>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>:
>>>>>>>>> > Jeff,
>>>>>>>>> >
>>>>>>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>>>>>>>>> >
>>>>>>>>> > Mike
>>>>>>>>> >
>>>>>>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <
>>>>>>>>> jeff.ji...@crowdstrike.com>
>>>>>>>>> > wrote:
>>>>>>>>> >>
>>>>>>>>> >> What disk size are you using?
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> From: Mike Heffner
>>>>>>>>> >> Reply-To: "user@cassandra.apache.org"
>>>>>>>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM
>>>>>>>>> >> To: "user@cassandra.apache.org"
>>>>>>>>> >> Cc: Peter Norton
>>>>>>>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>>>>>>>> >>
>>>>>>>>> >> Paulo,
>>>>>>>>> >>
>>>>>>>>> >> Thanks for the suggestion, we ran some tests against CMS and
>>>>>>>>> saw the same
>>>>>>>>> >> timeouts. On that note though, we are going to try doubling the
>>>>>>>>> instance
>>>>>>>>> >> sizes and testing with double the heap (even though current
>>>>>>>>> usage is low).
>>>>>>>>> >>
>>>>>>>>> >> Mike
>>>>>>>>> >>
>>>>>>>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <
>>>>>>>>> pauloricard...@gmail.com>
>>>>>>>>> >> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> Are you using the same GC settings as the staging 2.0 cluster?
>>>>>>>>> If not,
>>>>>>>>> >>> could you try using the default GC settings (CMS) and see if
>>>>>>>>> that changes
>>>>>>>>> >>> anything? This is just a wild guess, but there were reports
>>>>>>>>> before of
>>>>>>>>> >>> G1-caused instabilities with small heap sizes (< 16GB - see
>>>>>>>>> CASSANDRA-10403
>>>>>>>>> >>> for more context). Please ignore if you already tried
>>>>>>>>> reverting back to CMS.
>>>>>>>>> >>>
>>>>>>>>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>:
>>>>>>>>> >>>>
>>>>>>>>> >>>> Hi all,
>>>>>>>>> >>>>
>>>>>>>>> >>>> We've recently embarked on a project to update our Cassandra
>>>>>>>>> >>>> infrastructure running on EC2. We are long time users of
>>>>>>>>> 2.0.x and are
>>>>>>>>> >>>> testing out a move to version 2.2.5 running on VPC with EBS.
>>>>>>>>> Our test setup
>>>>>>>>> >>>> is a 3 node, RF=3 cluster supporting a small write load
>>>>>>>>> (mirror of our
>>>>>>>>> >>>> staging load).
>>>>>>>>> >>>>
>>>>>>>>> >>>> We are writing at QUORUM and while p95's look good compared
>>>>>>>>> to our
>>>>>>>>> >>>> staging 2.0.x cluster, we are seeing frequent write
>>>>>>>>> operations that time out
>>>>>>>>> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU
>>>>>>>>> across the cluster
>>>>>>>>> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is
>>>>>>>>> running with the
>>>>>>>>> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are
>>>>>>>>> less than 500ms.
>>>>>>>>> >>>>
>>>>>>>>> >>>> We run on c4.2xl instances with GP2 EBS attached storage for
>>>>>>>>> data and
>>>>>>>>> >>>> commitlog directories. The nodes are using EC2 enhanced
>>>>>>>>> networking and have
>>>>>>>>> >>>> the latest Intel network driver module. We are running on HVM
>>>>>>>>> instances
>>>>>>>>> >>>> using Ubuntu 14.04.2.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table
>>>>>>>>> is similar
>>>>>>>>> >>>> to the definition here:
>>>>>>>>> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>>>>>>>>> >>>>
>>>>>>>>> >>>> This is our cassandra.yaml:
>>>>>>>>> >>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#
>>>>>>>>> file-cassandra-yaml
>>>>>>>>> >>>>
>>>>>>>>> >>>> Like I mentioned we use 8u60 with G1GC and have used many of
>>>>>>>>> the GC
>>>>>>>>> >>>> settings in Al Tobey's tuning guide. This is our upstart
>>>>>>>>> config with JVM and
>>>>>>>>> >>>> other CPU settings: https://gist.github.com/
>>>>>>>>> mheffner/dc44613620b25c4fa46d
>>>>>>>>> >>>>
>>>>>>>>> >>>> We've used several of the sysctl settings from Al's guide as
>>>>>>>>> well:
>>>>>>>>> >>>> https://gist.github.com/mheffner/ea40d58f58a517028152
>>>>>>>>> >>>>
>>>>>>>>> >>>> Our client application is able to write using either Thrift
>>>>>>>>> batches
>>>>>>>>> >>>> using Asytanax driver or CQL async INSERT's using the
>>>>>>>>> Datastax Java driver.
>>>>>>>>> >>>>
>>>>>>>>> >>>> For testing against Thrift (our legacy infra uses this) we
>>>>>>>>> write batches
>>>>>>>>> >>>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch
>>>>>>>>> execution is
>>>>>>>>> >>>> around 45ms but our maximum (p100) sits less than 150ms
>>>>>>>>> except when it
>>>>>>>>> >>>> periodically spikes to the full 10seconds.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Testing the same write path using CQL writes instead
>>>>>>>>> demonstrates
>>>>>>>>> >>>> similar behavior. Low p99s except for periodic full timeouts.
>>>>>>>>> We enabled
>>>>>>>>> >>>> tracing for several operations but were unable to get a trace
>>>>>>>>> that completed
>>>>>>>>> >>>> successfully -- Cassandra started logging many messages as:
>>>>>>>>> >>>>
>>>>>>>>> >>>> INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE
>>>>>>>>> messages
>>>>>>>>> >>>> were dropped in last 5000 ms: 52499 for internal timeout and
>>>>>>>>> 0 for cross
>>>>>>>>> >>>> node timeout
>>>>>>>>> >>>>
>>>>>>>>> >>>> And all the traces contained rows with a "null"
>>>>>>>>> source_elapsed row:
>>>>>>>>> >>>> https://gist.githubusercontent.com/
>>>>>>>>> mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7
>>>>>>>>> e7d8f2911b/trace.out
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> We've exhausted as many configuration option permutations
>>>>>>>>> that we can
>>>>>>>>> >>>> think of. This cluster does not appear to be under any
>>>>>>>>> significant load and
>>>>>>>>> >>>> latencies seem to largely fall in two bands: low normal or
>>>>>>>>> max timeout. This
>>>>>>>>> >>>> seems to imply that something is getting stuck and timing out
>>>>>>>>> at the max
>>>>>>>>> >>>> write timeout.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Any suggestions on what to look for? We had debug enabled for
>>>>>>>>> awhile but
>>>>>>>>> >>>> we didn't see any msg that pointed to something obvious.
>>>>>>>>> Happy to provide
>>>>>>>>> >>>> any more information that may help.
>>>>>>>>> >>>>
>>>>>>>>> >>>> We are pretty much at the point of sprinkling debug around
>>>>>>>>> the code to
>>>>>>>>> >>>> track down what could be blocking.
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> Thanks,
>>>>>>>>> >>>>
>>>>>>>>> >>>> Mike
>>>>>>>>> >>>>
>>>>>>>>> >>>> --
>>>>>>>>> >>>>
>>>>>>>>> >>>>   Mike Heffner <m...@librato.com>
>>>>>>>>> >>>>   Librato, Inc.
>>>>>>>>> >>>>
>>>>>>>>> >>>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> --
>>>>>>>>> >>
>>>>>>>>> >>   Mike Heffner <m...@librato.com>
>>>>>>>>> >>   Librato, Inc.
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> >
>>>>>>>>> >   Mike Heffner <m...@librato.com>
>>>>>>>>> >   Librato, Inc.
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Close the World, Open the Net
>>>>>>>>> http://www.linux-wizard.net
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>   Mike Heffner <m...@librato.com>
>>>>>>>   Librato, Inc.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>   Mike Heffner <m...@librato.com>
>>>>>   Librato, Inc.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>   Mike Heffner <m...@librato.com>
>>>>   Librato, Inc.
>>>>
>>>>
>>>
>>>
>>> --
>>> -----------------
>>> Nate McCall
>>> Austin, TX
>>> @zznate
>>>
>>> Co-Founder & Sr. Technical Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>>
>>
>> --
>>
>>   Mike Heffner <m...@librato.com>
>>   Librato, Inc.
>>
>>
>

Re: Debugging write timeouts on Cassandra 2.2.5

Reply via email to