Hello Nike, were you able to fix the issue? If so what change helped you?
On Wed, Feb 24, 2016 at 5:36 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > Great that you found a specific release that triggers the problem - 2.1.x > has a huge number of changes. > > How many partitions and rows do you have? What's the largest row count for > a single partition? And all of these CQL tables are COMPACT STORAGE, > correct? Are you writing a large number of skinny partitions or a smaller > number of very wide partitions? It wouldn't surprise me if behavior for > large partitions varies between releases since they can be so > memory-intensive. > > I see this change in 2.1.5 that could possibly introduce some memory usage: > Write partition size estimates into a system table (CASSANDRA-7688) > > At this stage it would probably help for you to try to produce a > reasonably small repro test case that you could file as a Jira. And if you > could run that repro test case on 3.x to verify that the problem still > exists, that would be helpful as well. > > How long does it take to repro the timeout? > > Can you repro the timeout using a single node? > > What is the pattern of the timeouts - just random and occasional, or heavy > and continuous once they start? > > Are they occurring uniformly on all three nodes? > > If you bounce the cluster and continue testing, do the timeouts commence > immediately, fairly soon, or only after about as long as they take from a > clean fresh start? > > > > -- Jack Krupansky > > On Wed, Feb 24, 2016 at 7:04 PM, Mike Heffner <m...@librato.com> wrote: > >> Nate, >> >> So we have run several install tests, bisecting the 2.1.x release line, >> and we believe that the regression was introduced in version 2.1.5. This is >> the first release that clearly hits the timeout for us. >> >> It looks like quite a large release, so our next step will likely be >> bisecting the major commits to see if we can narrow it down: >> https://github.com/apache/cassandra/blob/3c0a337ebc90b0d99349d0aa152c92 >> b5b3494d8c/CHANGES.txt. Obviously, any suggestions on potential suspects >> appreciated. >> >> These are the memtable settings we've configured diff from the defaults >> during our testing: >> >> memtable_allocation_type: offheap_objects >> memtable_flush_writers: 8 >> >> >> Cheers, >> >> Mike >> >> On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall <n...@thelastpickle.com> >> wrote: >> >>> The biggest change which *might* explain your behavior has to do with >>> the changes in memtable flushing between 2.0 and 2.1: >>> https://issues.apache.org/jira/browse/CASSANDRA-5549 >>> >>> However, the tpstats you posted shows no dropped mutations which would >>> make me more certain of this as the cause. >>> >>> What values do you have right now for each of these (my recommendations >>> for each on a c4.2xl with stock cassandra-env.sh are in parenthesis): >>> >>> - memtable_flush_writers (2) >>> - memtable_heap_space_in_mb (2048) >>> - memtable_offheap_space_in_mb (2048) >>> - memtable_cleanup_threshold (0.11) >>> - memtable_allocation_type (offheap_objects) >>> >>> The biggest win IMO will be moving to offheap_objects. By default, >>> everything is on heap. Regardless, spending some time tuning these for your >>> workload will pay off. >>> >>> You may also want to be explicit about >>> >>> - native_transport_max_concurrent_connections >>> - native_transport_max_concurrent_connections_per_ip >>> >>> Depending on the driver, these may now be allowing 32k streams per >>> connection(!) as detailed in v3 of the native protocol: >>> https://github.com/apache/cassandra/blob/cassandra-2.1/ >>> doc/native_protocol_v3.spec#L130-L152 >>> >>> >>> >>> On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner <m...@librato.com> wrote: >>> >>>> Anuj, >>>> >>>> So we originally started testing with Java8 + G1, however we were able >>>> to reproduce the same results with the default CMS settings that ship in >>>> the cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses >>>> during the runs. >>>> >>>> Query pattern during our testing was 100% writes, batching (via Thrift >>>> mostly) to 5 tables, between 6-1500 rows per batch. >>>> >>>> Mike >>>> >>>> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> >>>> wrote: >>>> >>>>> Whats the GC overhead? Can you your share your GC collector and >>>>> settings ? >>>>> >>>>> >>>>> Whats your query pattern? Do you use secondary indexes, batches, in >>>>> clause etc? >>>>> >>>>> >>>>> Anuj >>>>> >>>>> >>>>> Sent from Yahoo Mail on Android >>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android> >>>>> >>>>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner >>>>> <m...@librato.com> wrote: >>>>> Alain, >>>>> >>>>> Thanks for the suggestions. >>>>> >>>>> Sure, tpstats are here: https://gist.github.com/ >>>>> mheffner/a979ae1a0304480b052a. Looking at the metrics across the >>>>> ring, there were no blocked tasks nor dropped messages. >>>>> >>>>> Iowait metrics look fine, so it doesn't appear to be blocking on disk. >>>>> Similarly, there are no long GC pauses. >>>>> >>>>> We haven't noticed latency on any particular table higher than others >>>>> or correlated around the occurrence of a timeout. We have noticed with >>>>> further testing that running cassandra-stress against the ring, while our >>>>> workload is writing to the same ring, will incur similar 10 second >>>>> timeouts. If our workload is not writing to the ring, cassandra stress >>>>> will >>>>> run without hitting timeouts. This seems to imply that our workload >>>>> pattern >>>>> is causing something to block cluster-wide, since the stress tool writes >>>>> to >>>>> a different keyspace then our workload. >>>>> >>>>> I mentioned in another reply that we've tracked it to something >>>>> between 2.0.x and 2.1.x, so we are focusing on narrowing which point >>>>> release it was introduced in. >>>>> >>>>> Cheers, >>>>> >>>>> Mike >>>>> >>>>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Mike, >>>>>> >>>>>> What about the output of tpstats ? I imagine you have dropped >>>>>> messages there. Any blocked threads ? Could you paste this output here ? >>>>>> >>>>>> May this be due to some network hiccup to access the disks as they >>>>>> are EBS ? Can you think of anyway of checking this ? Do you have a lot of >>>>>> GC logs, how long are the pauses (use something like: grep -i >>>>>> 'GCInspector' >>>>>> /var/log/cassandra/system.log) ? >>>>>> >>>>>> Something else you could check are local_writes stats to see if only >>>>>> one table if affected or this is keyspace / cluster wide. You can use >>>>>> metrics exposed by cassandra or if you have no dashboards I believe a: >>>>>> 'nodetool cfstats <myks> | grep -e 'Table:' -e 'Local'' should give you a >>>>>> rough idea of local latencies. >>>>>> >>>>>> Those are just things I would check, I have not a clue on what is >>>>>> happening here, hope this will help. >>>>>> >>>>>> C*heers, >>>>>> ----------------- >>>>>> Alain Rodriguez >>>>>> France >>>>>> >>>>>> The Last Pickle >>>>>> http://www.thelastpickle.com >>>>>> >>>>>> 2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>: >>>>>> >>>>>>> Jaydeep, >>>>>>> >>>>>>> No, we don't use any light weight transactions. >>>>>>> >>>>>>> Mike >>>>>>> >>>>>>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia < >>>>>>> chovatia.jayd...@gmail.com> wrote: >>>>>>> >>>>>>>> Are you guys using light weight transactions in your write path? >>>>>>>> >>>>>>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat < >>>>>>>> fabrice.faco...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Are your commitlog and data on the same disk ? If yes, you should >>>>>>>>> put >>>>>>>>> commitlogs on a separate disk which don't have a lot of IO. >>>>>>>>> >>>>>>>>> Others IO may have great impact impact on your commitlog writing >>>>>>>>> and >>>>>>>>> it may even block. >>>>>>>>> >>>>>>>>> An example of impact IO may have, even for Async writes: >>>>>>>>> https://engineering.linkedin.com/blog/2016/02/eliminating- >>>>>>>>> large-jvm-gc-pauses-caused-by-background-io-traffic >>>>>>>>> >>>>>>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>: >>>>>>>>> > Jeff, >>>>>>>>> > >>>>>>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS. >>>>>>>>> > >>>>>>>>> > Mike >>>>>>>>> > >>>>>>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa < >>>>>>>>> jeff.ji...@crowdstrike.com> >>>>>>>>> > wrote: >>>>>>>>> >> >>>>>>>>> >> What disk size are you using? >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> From: Mike Heffner >>>>>>>>> >> Reply-To: "user@cassandra.apache.org" >>>>>>>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM >>>>>>>>> >> To: "user@cassandra.apache.org" >>>>>>>>> >> Cc: Peter Norton >>>>>>>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5 >>>>>>>>> >> >>>>>>>>> >> Paulo, >>>>>>>>> >> >>>>>>>>> >> Thanks for the suggestion, we ran some tests against CMS and >>>>>>>>> saw the same >>>>>>>>> >> timeouts. On that note though, we are going to try doubling the >>>>>>>>> instance >>>>>>>>> >> sizes and testing with double the heap (even though current >>>>>>>>> usage is low). >>>>>>>>> >> >>>>>>>>> >> Mike >>>>>>>>> >> >>>>>>>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta < >>>>>>>>> pauloricard...@gmail.com> >>>>>>>>> >> wrote: >>>>>>>>> >>> >>>>>>>>> >>> Are you using the same GC settings as the staging 2.0 cluster? >>>>>>>>> If not, >>>>>>>>> >>> could you try using the default GC settings (CMS) and see if >>>>>>>>> that changes >>>>>>>>> >>> anything? This is just a wild guess, but there were reports >>>>>>>>> before of >>>>>>>>> >>> G1-caused instabilities with small heap sizes (< 16GB - see >>>>>>>>> CASSANDRA-10403 >>>>>>>>> >>> for more context). Please ignore if you already tried >>>>>>>>> reverting back to CMS. >>>>>>>>> >>> >>>>>>>>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>: >>>>>>>>> >>>> >>>>>>>>> >>>> Hi all, >>>>>>>>> >>>> >>>>>>>>> >>>> We've recently embarked on a project to update our Cassandra >>>>>>>>> >>>> infrastructure running on EC2. We are long time users of >>>>>>>>> 2.0.x and are >>>>>>>>> >>>> testing out a move to version 2.2.5 running on VPC with EBS. >>>>>>>>> Our test setup >>>>>>>>> >>>> is a 3 node, RF=3 cluster supporting a small write load >>>>>>>>> (mirror of our >>>>>>>>> >>>> staging load). >>>>>>>>> >>>> >>>>>>>>> >>>> We are writing at QUORUM and while p95's look good compared >>>>>>>>> to our >>>>>>>>> >>>> staging 2.0.x cluster, we are seeing frequent write >>>>>>>>> operations that time out >>>>>>>>> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU >>>>>>>>> across the cluster >>>>>>>>> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is >>>>>>>>> running with the >>>>>>>>> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are >>>>>>>>> less than 500ms. >>>>>>>>> >>>> >>>>>>>>> >>>> We run on c4.2xl instances with GP2 EBS attached storage for >>>>>>>>> data and >>>>>>>>> >>>> commitlog directories. The nodes are using EC2 enhanced >>>>>>>>> networking and have >>>>>>>>> >>>> the latest Intel network driver module. We are running on HVM >>>>>>>>> instances >>>>>>>>> >>>> using Ubuntu 14.04.2. >>>>>>>>> >>>> >>>>>>>>> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table >>>>>>>>> is similar >>>>>>>>> >>>> to the definition here: >>>>>>>>> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a >>>>>>>>> >>>> >>>>>>>>> >>>> This is our cassandra.yaml: >>>>>>>>> >>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f# >>>>>>>>> file-cassandra-yaml >>>>>>>>> >>>> >>>>>>>>> >>>> Like I mentioned we use 8u60 with G1GC and have used many of >>>>>>>>> the GC >>>>>>>>> >>>> settings in Al Tobey's tuning guide. This is our upstart >>>>>>>>> config with JVM and >>>>>>>>> >>>> other CPU settings: https://gist.github.com/ >>>>>>>>> mheffner/dc44613620b25c4fa46d >>>>>>>>> >>>> >>>>>>>>> >>>> We've used several of the sysctl settings from Al's guide as >>>>>>>>> well: >>>>>>>>> >>>> https://gist.github.com/mheffner/ea40d58f58a517028152 >>>>>>>>> >>>> >>>>>>>>> >>>> Our client application is able to write using either Thrift >>>>>>>>> batches >>>>>>>>> >>>> using Asytanax driver or CQL async INSERT's using the >>>>>>>>> Datastax Java driver. >>>>>>>>> >>>> >>>>>>>>> >>>> For testing against Thrift (our legacy infra uses this) we >>>>>>>>> write batches >>>>>>>>> >>>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch >>>>>>>>> execution is >>>>>>>>> >>>> around 45ms but our maximum (p100) sits less than 150ms >>>>>>>>> except when it >>>>>>>>> >>>> periodically spikes to the full 10seconds. >>>>>>>>> >>>> >>>>>>>>> >>>> Testing the same write path using CQL writes instead >>>>>>>>> demonstrates >>>>>>>>> >>>> similar behavior. Low p99s except for periodic full timeouts. >>>>>>>>> We enabled >>>>>>>>> >>>> tracing for several operations but were unable to get a trace >>>>>>>>> that completed >>>>>>>>> >>>> successfully -- Cassandra started logging many messages as: >>>>>>>>> >>>> >>>>>>>>> >>>> INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE >>>>>>>>> messages >>>>>>>>> >>>> were dropped in last 5000 ms: 52499 for internal timeout and >>>>>>>>> 0 for cross >>>>>>>>> >>>> node timeout >>>>>>>>> >>>> >>>>>>>>> >>>> And all the traces contained rows with a "null" >>>>>>>>> source_elapsed row: >>>>>>>>> >>>> https://gist.githubusercontent.com/ >>>>>>>>> mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7 >>>>>>>>> e7d8f2911b/trace.out >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> We've exhausted as many configuration option permutations >>>>>>>>> that we can >>>>>>>>> >>>> think of. This cluster does not appear to be under any >>>>>>>>> significant load and >>>>>>>>> >>>> latencies seem to largely fall in two bands: low normal or >>>>>>>>> max timeout. This >>>>>>>>> >>>> seems to imply that something is getting stuck and timing out >>>>>>>>> at the max >>>>>>>>> >>>> write timeout. >>>>>>>>> >>>> >>>>>>>>> >>>> Any suggestions on what to look for? We had debug enabled for >>>>>>>>> awhile but >>>>>>>>> >>>> we didn't see any msg that pointed to something obvious. >>>>>>>>> Happy to provide >>>>>>>>> >>>> any more information that may help. >>>>>>>>> >>>> >>>>>>>>> >>>> We are pretty much at the point of sprinkling debug around >>>>>>>>> the code to >>>>>>>>> >>>> track down what could be blocking. >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> Thanks, >>>>>>>>> >>>> >>>>>>>>> >>>> Mike >>>>>>>>> >>>> >>>>>>>>> >>>> -- >>>>>>>>> >>>> >>>>>>>>> >>>> Mike Heffner <m...@librato.com> >>>>>>>>> >>>> Librato, Inc. >>>>>>>>> >>>> >>>>>>>>> >>> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> -- >>>>>>>>> >> >>>>>>>>> >> Mike Heffner <m...@librato.com> >>>>>>>>> >> Librato, Inc. >>>>>>>>> >> >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > >>>>>>>>> > Mike Heffner <m...@librato.com> >>>>>>>>> > Librato, Inc. >>>>>>>>> > >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Close the World, Open the Net >>>>>>>>> http://www.linux-wizard.net >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Mike Heffner <m...@librato.com> >>>>>>> Librato, Inc. >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Mike Heffner <m...@librato.com> >>>>> Librato, Inc. >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Mike Heffner <m...@librato.com> >>>> Librato, Inc. >>>> >>>> >>> >>> >>> -- >>> ----------------- >>> Nate McCall >>> Austin, TX >>> @zznate >>> >>> Co-Founder & Sr. Technical Consultant >>> Apache Cassandra Consulting >>> http://www.thelastpickle.com >>> >> >> >> >> -- >> >> Mike Heffner <m...@librato.com> >> Librato, Inc. >> >> >