Re: storing indexes on ssd
On a single node that's a bit less than half full, the index files are 87G. How will OS disk cache know to keep the index file blocks cached but not cache blocks from the data files? As far as I know it is not smart enough to handle that gracefully. Re: ram expensiveness, see https://www.extremetech.com/computing/263031-ram-prices-roof-stuck-way -- it's really not an important point though, ram is still far more expensive than disk, regardless of whether the price has been going up. On Tue, Feb 13, 2018 at 12:02 AM, Oleksandr Shulgin < oleksandr.shul...@zalando.de> wrote: > On Tue, Feb 13, 2018 at 1:30 AM, Dan Kinder <dkin...@turnitin.com> wrote: > >> Created https://issues.apache.org/jira/browse/CASSANDRA-14229 >> > > This is confusing. You've already started the conversation here... > > How big are your index files in the end? Even if Cassandra doesn't cache > them in or (off-) heap, they might as well just fit into the OS disk cache. > > From your ticket description: > > ... as ram continues to get more expensive,.. > > Where did you get that from? I would expect quite the opposite. > > Regards, > -- > Alex > > -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: storing indexes on ssd
Created https://issues.apache.org/jira/browse/CASSANDRA-14229 On Mon, Feb 12, 2018 at 12:10 AM, Mateusz Korniak < mateusz-li...@ant.gliwice.pl> wrote: > On Saturday 10 of February 2018 23:09:40 Dan Kinder wrote: > > We're optimizing Cassandra right now for fairly random reads on a large > > dataset. In this dataset, the values are much larger than the keys. I was > > wondering, is it possible to have Cassandra write the *index* files > > (*-Index.db) to one drive (SSD), but write the *data* files (*-Data.db) > to > > another (HDD)? This would be an overall win for us since it's > > cost-prohibitive to store the data itself all on SSD, but we hit the > limits > > if we just use HDD; effectively we would need to buy double, since we are > > doing 2 random reads (index + data). > > Considered putting cassandra data on lvmcache? > We are using this on small (3x2TB compressed data, 128/256MB cache) > clusters > since reaching I/O limits of 2xHDD in RAID10. > > > -- > Mateusz Korniak > "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś, > krótko mówiąc - podpora społeczeństwa." > Nikos Kazantzakis - "Grek Zorba" > > -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: Setting min_index_interval to 1?
@Hannu this was based on the assumption that if we receive a read for a key that is sampled, it'll be treated as cached and won't go to the index on disk. Part of my question was whether that's the case, I'm not sure. Btw I ended up giving up on this, trying the key cache route already showed that it would require more memory than we have available. And even then, the performance started to tank; we saw irqbalance and other processes peg the CPU even with not too much load, so there was some numa-related problem there that I don't have time to look into. On Fri, Feb 2, 2018 at 12:42 AM, Hannu Kröger <hkro...@gmail.com> wrote: > Wouldn’t that still try to read the index on the disk? So you would just > potentially have all keys on the memory and on the disk and reading would > first happen in memory and then on the disk and only after that you would > read the sstable. > > So you wouldn’t gain much, right? > > Hannu > > On 2 Feb 2018, at 02:25, Nate McCall <n...@thelastpickle.com> wrote: > > >> Another was the crazy idea I started with of setting min_index_interval >> to 1. My guess was that this would cause it to read all index entries, and >> effectively have them all cached permanently. And it would read them >> straight out of the SSTables on every restart. Would this work? Other than >> probably causing a really long startup time, are there issues with this? >> >> > I've never tried that. It sounds like you understand the potential impact > on memory and startup time. If you have the data in such a way that you can > easily experiment, I would like to see a breakdown of the impact on > response time vs. memory usage as well as where the point of diminishing > returns is on turning this down towards 1 (I think there will be a sweet > spot somewhere). > > > -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
storing indexes on ssd
Hi, We're optimizing Cassandra right now for fairly random reads on a large dataset. In this dataset, the values are much larger than the keys. I was wondering, is it possible to have Cassandra write the *index* files (*-Index.db) to one drive (SSD), but write the *data* files (*-Data.db) to another (HDD)? This would be an overall win for us since it's cost-prohibitive to store the data itself all on SSD, but we hit the limits if we just use HDD; effectively we would need to buy double, since we are doing 2 random reads (index + data). Thanks, -dan
Setting min_index_interval to 1?
Hi, I have an unusual case here: I'm wondering what will happen if I set min_index_interval to 1. Here's the logic. Suppose I have a table where I really want to squeeze as many reads/sec out of it as possible, and where the row data size is much larger than the keys. E.g. the keys are a few bytes, the row data is ~500KB. This table would be a great candidate for key caching. Let's suppose I have enough memory to have every key cached. However, it's a lot of data, and the reads are very random. So it would take a very long time for that cache to warm up. One solution is that I write a little app to go through every key to warm it up manually, and ensure that Cassandra has key_cache_keys_to_save set to save the whole thing on restart. (Anyone know of a better way of doing this?) Another was the crazy idea I started with of setting min_index_interval to 1. My guess was that this would cause it to read all index entries, and effectively have them all cached permanently. And it would read them straight out of the SSTables on every restart. Would this work? Other than probably causing a really long startup time, are there issues with this? Thanks, -dan
LCS major compaction on 3.2+ on JBOD
Hi I am wondering how major compaction behaves for a table using LCS on JBOD with Cassandra 3.2+'s JBOD improvements. Up to then I know that major compaction would use a single thread, include all SSTables in a single compaction, and spit out a bunch of SSTables in appropriate levels. Does 3.2+ do 1 compaction per disk, since they are separate leveled structures? Or does it do a single compaction task that writes SSTables to the appropriate disk by key range? -dan
Re:
Created https://issues.apache.org/jira/browse/CASSANDRA-13923 On Mon, Oct 2, 2017 at 12:06 PM, Dan Kinder <dkin...@turnitin.com> wrote: > Sure will do. > > On Mon, Oct 2, 2017 at 11:48 AM, Jeff Jirsa <jji...@gmail.com> wrote: > >> You're right, sorry I didnt read the full stack (gmail hid it from me) >> >> Would you open a JIRA with your stack traces, and note (somewhat loudly) >> that this is a regression? >> >> >> On Mon, Oct 2, 2017 at 11:43 AM, Dan Kinder <dkin...@turnitin.com> wrote: >> >>> Right, I just meant that calling it at all results in holding a read >>> lock, which unfortunately is blocking these read threads. >>> >>> On Mon, Oct 2, 2017 at 11:40 AM, Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> >>>> >>>> On Mon, Oct 2, 2017 at 11:27 AM, Dan Kinder <dkin...@turnitin.com> >>>> wrote: >>>> >>>>> (As a side note, it seems silly to call shouldDefragment at all on a >>>>> read if the compaction strategy is not STSC) >>>>> >>>>> >>>>> >>>> It defaults to false: >>>> >>>> https://github.com/apache/cassandra/blob/cassandra-3.0/src/j >>>> ava/org/apache/cassandra/db/compaction/AbstractCompactionStr >>>> ategy.java#L302 >>>> >>>> And nothing else other than STCS overrides it to true. >>>> >>>> >>>> >>> >>> >>> -- >>> Dan Kinder >>> Principal Software Engineer >>> Turnitin – www.turnitin.com >>> dkin...@turnitin.com >>> >> >> > > > -- > Dan Kinder > Principal Software Engineer > Turnitin – www.turnitin.com > dkin...@turnitin.com > -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re:
Sure will do. On Mon, Oct 2, 2017 at 11:48 AM, Jeff Jirsa <jji...@gmail.com> wrote: > You're right, sorry I didnt read the full stack (gmail hid it from me) > > Would you open a JIRA with your stack traces, and note (somewhat loudly) > that this is a regression? > > > On Mon, Oct 2, 2017 at 11:43 AM, Dan Kinder <dkin...@turnitin.com> wrote: > >> Right, I just meant that calling it at all results in holding a read >> lock, which unfortunately is blocking these read threads. >> >> On Mon, Oct 2, 2017 at 11:40 AM, Jeff Jirsa <jji...@gmail.com> wrote: >> >>> >>> >>> On Mon, Oct 2, 2017 at 11:27 AM, Dan Kinder <dkin...@turnitin.com> >>> wrote: >>> >>>> (As a side note, it seems silly to call shouldDefragment at all on a >>>> read if the compaction strategy is not STSC) >>>> >>>> >>>> >>> It defaults to false: >>> >>> https://github.com/apache/cassandra/blob/cassandra-3.0/src/j >>> ava/org/apache/cassandra/db/compaction/AbstractCompactionStr >>> ategy.java#L302 >>> >>> And nothing else other than STCS overrides it to true. >>> >>> >>> >> >> >> -- >> Dan Kinder >> Principal Software Engineer >> Turnitin – www.turnitin.com >> dkin...@turnitin.com >> > > -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re:
Right, I just meant that calling it at all results in holding a read lock, which unfortunately is blocking these read threads. On Mon, Oct 2, 2017 at 11:40 AM, Jeff Jirsa <jji...@gmail.com> wrote: > > > On Mon, Oct 2, 2017 at 11:27 AM, Dan Kinder <dkin...@turnitin.com> wrote: > >> (As a side note, it seems silly to call shouldDefragment at all on a read >> if the compaction strategy is not STSC) >> >> >> > It defaults to false: > > https://github.com/apache/cassandra/blob/cassandra-3.0/ > src/java/org/apache/cassandra/db/compaction/AbstractCompactionStrategy. > java#L302 > > And nothing else other than STCS overrides it to true. > > > -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re:
Sorry, for that ReadStage exception, I take it back, accidentally ended up too early in the logs. This node that has building ReadStage shows no exceptions in the logs. nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 8 1882 45881 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 9 9 2551 0 0 MutationStage 0 0 35929880 0 0 GossipStage0 0 35793 0 0 RequestResponseStage 0 0 751285 0 0 ReadRepairStage0 0224 0 0 CounterMutationStage 0 0 0 0 0 MemtableFlushWriter0 0111 0 0 MemtablePostFlush 0 0239 0 0 ValidationExecutor 0 0 0 0 0 ViewMutationStage 0 0 0 0 0 CacheCleanupExecutor 0 0 0 0 0 PerDiskMemtableFlushWriter_10 0 0104 0 0 PerDiskMemtableFlushWriter_11 0 0104 0 0 MemtableReclaimMemory 0 0116 0 0 PendingRangeCalculator 0 0 16 0 0 SecondaryIndexManagement 0 0 0 0 0 HintsDispatcher0 0 13 0 0 PerDiskMemtableFlushWriter_1 0 0104 0 0 Native-Transport-Requests 0 02607030 0 0 PerDiskMemtableFlushWriter_2 0 0104 0 0 MigrationStage 0 0278 0 0 PerDiskMemtableFlushWriter_0 0 0115 0 0 Sampler0 0 0 0 0 PerDiskMemtableFlushWriter_5 0 0104 0 0 InternalResponseStage 0 0298 0 0 PerDiskMemtableFlushWriter_6 0 0104 0 0 PerDiskMemtableFlushWriter_3 0 0104 0 0 PerDiskMemtableFlushWriter_4 0 0104 0 0 PerDiskMemtableFlushWriter_9 0 0104 0 0 AntiEntropyStage 0 0 0 0 0 PerDiskMemtableFlushWriter_7 0 0104 0 0 PerDiskMemtableFlushWriter_8 0 0104 0 0 Message type Dropped READ 0 RANGE_SLICE 0 _TRACE 0 HINT 0 MUTATION 0 COUNTER_MUTATION 0 BATCH_STORE 0 BATCH_REMOVE 0 REQUEST_RESPONSE 0 PAGED_RANGE 0 READ_REPAIR 0 On Thu, Sep 28, 2017 at 2:08 PM, Dan Kinder <dkin...@turnitin.com> wrote: > Thanks for the responses. > > @Prem yes this is after the entire cluster is on 3.11, but no I did not > run upgradesstables yet. > > @Thomas no I don't see any major GC going on. > > @Jeff yeah it's fully upgraded. I decided to shut the whole thing down and > bring it back (thankfully this cluster is not serving live traffic). The > nodes seemed okay for an hour or two, but I see the issue again, without me > bouncing any nodes. This time it's ReadStage that's building up, and the > exception I'm seeing in the logs is: > > DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242 > - Digest mismatch: > > org.apache.cassandra.service.DigestMismatchException: Mismatch for key > DecoratedKey(6150926370328526396, 696a6374652e6f7267) ( > 2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025) > > at > org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) > ~[apache-cassandra-3.11.0.jar:3.11.0] > &
Re:
Thanks for the responses. @Prem yes this is after the entire cluster is on 3.11, but no I did not run upgradesstables yet. @Thomas no I don't see any major GC going on. @Jeff yeah it's fully upgraded. I decided to shut the whole thing down and bring it back (thankfully this cluster is not serving live traffic). The nodes seemed okay for an hour or two, but I see the issue again, without me bouncing any nodes. This time it's ReadStage that's building up, and the exception I'm seeing in the logs is: DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242 - Digest mismatch: org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(6150926370328526396, 696a6374652e6f7267) (2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025) at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.11.0.jar:3.11.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_71] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) [apache-cassandra-3.11.0.jar:3.11.0] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71] Do you think running upgradesstables would help? Or relocatesstables? I presumed it shouldn't be necessary for Cassandra to function, just an optimization. On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas < thomas.steinmau...@dynatrace.com> wrote: > Dan, > > > > do you see any major GC? We have been hit by the following memory leak in > our loadtest environment with 3.11.0. > > https://issues.apache.org/jira/browse/CASSANDRA-13754 > > > > So, depending on the heap size and uptime, you might get into heap > troubles. > > > > Thomas > > > > *From:* Dan Kinder [mailto:dkin...@turnitin.com] > *Sent:* Donnerstag, 28. September 2017 18:20 > *To:* user@cassandra.apache.org > *Subject:* > > > > Hi, > > I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the > following. The cluster does function, for a while, but then some stages > begin to back up and the node does not recover and does not drain the > tasks, even under no load. This happens both to MutationStage and > GossipStage. > > I do see the following exception happen in the logs: > > > > ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 > CassandraDaemon.java:228 - Exception in thread > Thread[ReadRepairStage:2328,5,main] > > org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out > - received only 1 responses. > > at org.apache.cassandra.service.DataResolver$ > RepairMergeListener.close(DataResolver.java:171) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at org.apache.cassandra.db.partitions. > UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ~[na:1.8.0_91] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > ~[na:1.8.0_91] > > at org.apache.cassandra.concurrent.NamedThreadFactory. > lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91] > > > > But it's hard to correlate precisely with things going bad. It is also > very strange to me since I have both read_repair_chance and > dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is > confusing why ReadRepairStage would err. > > Anyone have thoughts on this? It's pretty muddling, and causes nodes to > lock up. Once it happens Cassandra can't even shut down, I have to kill -9. > If I can't find a resolution I'm going to need to downgrade and restore to > backup... > > The only issue I found
Re:
I should also note, I also see nodes become locked up without seeing that Exception. But the GossipStage buildup does seem correlated with gossip activity, e.g. me restarting a different node. On Thu, Sep 28, 2017 at 9:20 AM, Dan Kinder <dkin...@turnitin.com> wrote: > Hi, > > I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the > following. The cluster does function, for a while, but then some stages > begin to back up and the node does not recover and does not drain the > tasks, even under no load. This happens both to MutationStage and > GossipStage. > > I do see the following exception happen in the logs: > > > ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 > CassandraDaemon.java:228 - Exception in thread > Thread[ReadRepairStage:2328,5,main] > > org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out > - received only 1 responses. > > at org.apache.cassandra.service.DataResolver$ > RepairMergeListener.close(DataResolver.java:171) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at org.apache.cassandra.db.partitions. > UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ~[na:1.8.0_91] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > ~[na:1.8.0_91] > > at org.apache.cassandra.concurrent.NamedThreadFactory. > lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > ~[apache-cassandra-3.11.0.jar:3.11.0] > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91] > > > But it's hard to correlate precisely with things going bad. It is also > very strange to me since I have both read_repair_chance and > dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is > confusing why ReadRepairStage would err. > > Anyone have thoughts on this? It's pretty muddling, and causes nodes to > lock up. Once it happens Cassandra can't even shut down, I have to kill -9. > If I can't find a resolution I'm going to need to downgrade and restore to > backup... > > The only issue I found that looked similar is https://issues.apache.org/ > jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10. > > > $ nodetool tpstats > > Pool Name Active Pending Completed > Blocked All time blocked > > ReadStage 0 0 582103 0 > 0 > > MiscStage 0 0 0 0 > 0 > > CompactionExecutor1111 2868 0 > 0 > > MutationStage 32 4593678 55057393 0 > 0 > > GossipStage1 2818 371487 0 > 0 > > RequestResponseStage 0 04345522 0 > 0 > > ReadRepairStage0 0 151473 0 > 0 > > CounterMutationStage 0 0 0 0 > 0 > > MemtableFlushWriter181 76 0 > 0 > > MemtablePostFlush 1 382139 0 > 0 > > ValidationExecutor 0 0 0 0 > 0 > > ViewMutationStage 0 0 0 0 > 0 > > CacheCleanupExecutor 0 0 0 0 > 0 > > PerDiskMemtableFlushWriter_10 0 0 69 0 > 0 > > PerDiskMemtableFlushWriter_11 0 0 69 0 > 0 > > MemtableReclaimMemory 0 0 81 0 > 0 > > PendingRangeCa
[no subject]
Hi, I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the following. The cluster does function, for a while, but then some stages begin to back up and the node does not recover and does not drain the tasks, even under no load. This happens both to MutationStage and GossipStage. I do see the following exception happen in the logs: ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:2328,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 1 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.11.0.jar:3.11.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_91] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_91] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.0.jar:3.11.0] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91] But it's hard to correlate precisely with things going bad. It is also very strange to me since I have both read_repair_chance and dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is confusing why ReadRepairStage would err. Anyone have thoughts on this? It's pretty muddling, and causes nodes to lock up. Once it happens Cassandra can't even shut down, I have to kill -9. If I can't find a resolution I'm going to need to downgrade and restore to backup... The only issue I found that looked similar is https://issues.apache.org/jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10. $ nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 582103 0 0 MiscStage 0 0 0 0 0 CompactionExecutor1111 2868 0 0 MutationStage 32 4593678 55057393 0 0 GossipStage1 2818 371487 0 0 RequestResponseStage 0 04345522 0 0 ReadRepairStage0 0 151473 0 0 CounterMutationStage 0 0 0 0 0 MemtableFlushWriter181 76 0 0 MemtablePostFlush 1 382139 0 0 ValidationExecutor 0 0 0 0 0 ViewMutationStage 0 0 0 0 0 CacheCleanupExecutor 0 0 0 0 0 PerDiskMemtableFlushWriter_10 0 0 69 0 0 PerDiskMemtableFlushWriter_11 0 0 69 0 0 MemtableReclaimMemory 0 0 81 0 0 PendingRangeCalculator 0 0 32 0 0 SecondaryIndexManagement 0 0 0 0 0 HintsDispatcher0 0596 0 0 PerDiskMemtableFlushWriter_1 0 0 69 0 0 Native-Transport-Requests 11 04547746 0 67 PerDiskMemtableFlushWriter_2 0 0 69 0 0 MigrationStage 1 1545586 0 0 PerDiskMemtableFlushWriter_0 0 0 80 0 0 Sampler0 0 0 0 0 PerDiskMemtableFlushWriter_5 0 0 69 0
Re: Problems with large partitions and compaction
What Cassandra version? CMS or G1? What are your timeouts set to? "GC activity" - Even if there isn't a lot of activity per se maybe there is a single long pause happening. I have seen large partitions cause lots of allocation fast. Looking at SSTable Levels in nodetool cfstats can help, look at it for all your tables. Don't recommend switching to STCS until you know more. You end up with massive compaction that takes a long time to settle down. On Tue, Feb 14, 2017 at 5:50 PM, John Sanda <john.sa...@gmail.com> wrote: > I have a table that uses LCS and has wound up with partitions upwards of > 700 MB. I am seeing lots of the large partition warnings. Client requests > are subsequently failing. The driver is not reporting timeout exception, > just NoHostAvailableExceptions (in the logs I have reviewed so far). I know > that I need to redesign the table to avoid such large partitions. What > specifically goes wrong that results in the instability I am seeing? Or put > another way, what issues will compacting really large partitions cause? > Initially I thought that there was high GC activity, but after closer > inspection that does not really seem to happening. And most of the failures > I am seeing are on reads, but for an entirely different table. Lastly, does > anyone has anyone had success to switching to STCS in this situation as a > work around? > > Thanks > > - John > -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: Cassandra Golang Driver and Support
Just want to put a plug in for gocql and the guys who work on it. I use it for production applications that sustain ~10,000 writes/sec on an 8 node cluster and in the few times I have seen problems they have been responsive on issues and pull requests. Once or twice I have seen the API change but otherwise it has been stable. In general I have found it very intuitive to use and easy to configure. On Thu, Apr 14, 2016 at 2:30 PM, Yawei Liwrote: > Thanks for the info, Bryan! > We are in general assess the support level of GoCQL v.s Java Driver. From > http://gocql.github.io/, looks like it is a WIP (some TODO items, api is > subject to change)? And https://github.com/gocql/gocql suggests the > performance may degrade now and then, and the supported versions are up to > 2.2.x? For us maintaining two stacks (Java and Go) may be expensive so I am > checking what's the general strategy folks are using here. > > On Wed, Apr 13, 2016 at 11:31 AM, Bryan Cheng > wrote: > >> Hi Yawei, >> >> While you're right that there's no first-party driver, we've had good >> luck using gocql (https://github.com/gocql/gocql) in production at >> moderate scale. What features in particular are you looking for that are >> missing? >> >> --Bryan >> >> On Tue, Apr 12, 2016 at 10:06 PM, Yawei Li wrote: >> >>> Hi, >>> >>> It looks like to me that DataStax doesn't provide official golang driver >>> yet and the goland client libs are overall lagging behind the Java driver >>> in terms of feature set, supported version and possibly production >>> stability? >>> >>> We are going to support a large number of services in both Java and Go. >>> if the above impression is largely true, we are considering the option of >>> focusing on Java client and having GoLang program talk to the Java service >>> via RPC for data access. Anyone has tried similar approach? >>> >>> Thanks >>> >> >>
Re: MemtableReclaimMemory pending building up
Quick follow-up here, so far I've had these nodes stable for about 2 days now with the following (still mysterious) solution: *increase* memtable_heap_space_in_mb to 20GB. This was having issues at the default value of 1/4 heap (12GB in my case, I misspoke earlier and said 16GB). Upping it to 20GB seems to have made the issue go away so far. Best guess now is that it simply was memtable flush throughput. Playing with memtable_cleanup_threshold further may have also helped but I didn't want to create small SSTables. Thanks again for the input @Alain. On Fri, Mar 4, 2016 at 4:53 PM, Dan Kinder <dkin...@turnitin.com> wrote: > Hi thanks for responding Alain. Going to provide more info inline. > > However a small update that is probably relevant: while the node was in > this state (MemtableReclaimMemory building up), since this cluster is not > serving live traffic I temporarily turned off ALL client traffic, and the > node still never recovered, MemtableReclaimMemory never went down. Seems > like there is one thread doing this reclaiming and it has gotten stuck > somehow. > > Will let you know when I have more results from experimenting... but > again, merci > > On Thu, Mar 3, 2016 at 2:32 AM, Alain RODRIGUEZ <arodr...@gmail.com> > wrote: > >> Hi Dan, >> >> I'll try to go through all the elements: >> >> seeing this odd behavior happen, seemingly to single nodes at a time >> >> >> Is that one node at the time or always on the same node. Do you consider >> your data model if fairly, evenly distributed ? >> > > of 6 nodes, 2 of them seem to be the recurring culprits. Could be related > to a particular data partition. > > >> >> The node starts to take more and more memory (instance has 48GB memory on >>> G1GC) >> >> >> Do you use 48 GB heap size or is that the total amount of memory in the >> node ? Could we have your JVM settings (GC and heap sizes), also memtable >> size and type (off heap?) and the amount of available memory ? >> > > Machine spec: 24 virtual cores, 64GB memory, 12 HDD JBOD (yes an absurd > number of disks, not my choice) > > memtable_heap_space_in_mb: 10240 # 10GB (previously left as default which > was 16GB and caused the issue more frequently) > memtable_allocation_type: heap_buffers > memtable_flush_writers: 12 > > MAX_HEAP_SIZE="48G" > JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}" > JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}" > > JVM_OPTS="$JVM_OPTS -XX:+UseG1GC" > JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500" > JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5" > JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25" > >> >> Note that there is a decent number of compactions going on as well but >>> that is expected on these nodes and this particular one is catching up from >>> a high volume of writes >>> >> >> Are the *concurrent_compactors* correctly throttled (about 8 with good >> machines) and the *compaction_throughput_mb_per_sec* high enough to cope >> with what is thrown at the node ? Using SSD I often see the latter >> unthrottled (using 0 value), but I would try small increments first. >> > concurrent_compactors: 12 > compaction_throughput_mb_per_sec: 0 > >> >> Also interestingly, neither CPU nor disk utilization are pegged while >>> this is going on >>> >> >> First thing is making sure your memory management is fine. Having >> information about the JVM and memory usage globally would help. Then, if >> you are not fully using the resources you might want to try increasing the >> number of *concurrent_writes* to a higher value (probably a way higher, >> given the pending requests, but go safely, incrementally, first on a canary >> node) and monitor tpstats + resources. Hope this will help Mutation pending >> going down. My guess is that pending requests are messing with the JVM, but >> it could be the exact contrary as well. >> > concurrent_writes: 192 > It may be worth noting that the main reads going on are large batch reads, > while these writes are happening (akin to analytics jobs). > > I'm going to look into JVM use a bit more but otherwise it seems like > normal Young generation GCs are happening even as this problem surfaces. > > >> >> Native-Transport-Requests25 0 547935519 0 >>> 2586907 >> >> >> About Native requests being blocked, you can probably mitigate things by >> increasing the native_transport_max_threads: 128 (try to double it and >> continue tuning incrementally).
Re: MemtableReclaimMemory pending building up
perations = high memory pressure. > Reducing pending stuff somehow will probably get you out off troubles. > > Hope this first round of ideas will help you. > > C*heers, > --- > Alain Rodriguez - al...@thelastpickle.com > France > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > 2016-03-02 22:58 GMT+01:00 Dan Kinder <dkin...@turnitin.com>: > >> Also should note: Cassandra 2.2.5, Centos 6.7 >> >> On Wed, Mar 2, 2016 at 1:34 PM, Dan Kinder <dkin...@turnitin.com> wrote: >> >>> Hi y'all, >>> >>> I am writing to a cluster fairly fast and seeing this odd behavior >>> happen, seemingly to single nodes at a time. The node starts to take more >>> and more memory (instance has 48GB memory on G1GC). tpstats shows that >>> MemtableReclaimMemory Pending starts to grow first, then later >>> MutationStage builds up as well. By then most of the memory is being >>> consumed, GC is getting longer, node slows down and everything slows down >>> unless I kill the node. Also the number of Active MemtableReclaimMemory >>> threads seems to stay at 1. Also interestingly, neither CPU nor disk >>> utilization are pegged while this is going on; it's on jbod and there is >>> plenty of headroom there. (Note that there is a decent number of >>> compactions going on as well but that is expected on these nodes and this >>> particular one is catching up from a high volume of writes). >>> >>> Anyone have any theories on why this would be happening? >>> >>> >>> $ nodetool tpstats >>> Pool NameActive Pending Completed Blocked >>> All time blocked >>> MutationStage 192715481 311327142 0 >>> 0 >>> ReadStage 7 09142871 0 >>> 0 >>> RequestResponseStage 1 0 690823199 0 >>> 0 >>> ReadRepairStage 0 02145627 0 >>> 0 >>> CounterMutationStage 0 0 0 0 >>> 0 >>> HintedHandoff 0 0144 0 >>> 0 >>> MiscStage 0 0 0 0 >>> 0 >>> CompactionExecutor 1224 41022 0 >>> 0 >>> MemtableReclaimMemory 1 102 4263 0 >>> 0 >>> PendingRangeCalculator0 0 10 0 >>> 0 >>> GossipStage 0 0 148329 0 >>> 0 >>> MigrationStage0 0 0 0 >>> 0 >>> MemtablePostFlush 0 0 5233 0 >>> 0 >>> ValidationExecutor0 0 0 0 >>> 0 >>> Sampler 0 0 0 0 >>> 0 >>> MemtableFlushWriter 0 0 4270 0 >>> 0 >>> InternalResponseStage 0 0 16322698 0 >>> 0 >>> AntiEntropyStage 0 0 0 0 >>> 0 >>> CacheCleanupExecutor 0 0 0 0 >>> 0 >>> Native-Transport-Requests25 0 547935519 0 >>> 2586907 >>> >>> Message type Dropped >>> READ 0 >>> RANGE_SLICE 0 >>> _TRACE 0 >>> MUTATION287057 >>> COUNTER_MUTATION 0 >>> REQUEST_RESPONSE 0 >>> PAGED_RANGE 0 >>> READ_REPAIR149 >>> >>> >> >> >> -- >> Dan Kinder >> Principal Software Engineer >> Turnitin – www.turnitin.com >> dkin...@turnitin.com >> >
Re: MemtableReclaimMemory pending building up
Also should note: Cassandra 2.2.5, Centos 6.7 On Wed, Mar 2, 2016 at 1:34 PM, Dan Kinder <dkin...@turnitin.com> wrote: > Hi y'all, > > I am writing to a cluster fairly fast and seeing this odd behavior happen, > seemingly to single nodes at a time. The node starts to take more and more > memory (instance has 48GB memory on G1GC). tpstats shows that > MemtableReclaimMemory Pending starts to grow first, then later > MutationStage builds up as well. By then most of the memory is being > consumed, GC is getting longer, node slows down and everything slows down > unless I kill the node. Also the number of Active MemtableReclaimMemory > threads seems to stay at 1. Also interestingly, neither CPU nor disk > utilization are pegged while this is going on; it's on jbod and there is > plenty of headroom there. (Note that there is a decent number of > compactions going on as well but that is expected on these nodes and this > particular one is catching up from a high volume of writes). > > Anyone have any theories on why this would be happening? > > > $ nodetool tpstats > Pool NameActive Pending Completed Blocked > All time blocked > MutationStage 192715481 311327142 0 > 0 > ReadStage 7 09142871 0 > 0 > RequestResponseStage 1 0 690823199 0 > 0 > ReadRepairStage 0 02145627 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > HintedHandoff 0 0144 0 > 0 > MiscStage 0 0 0 0 > 0 > CompactionExecutor 1224 41022 0 > 0 > MemtableReclaimMemory 1 102 4263 0 > 0 > PendingRangeCalculator0 0 10 0 > 0 > GossipStage 0 0 148329 0 > 0 > MigrationStage0 0 0 0 > 0 > MemtablePostFlush 0 0 5233 0 > 0 > ValidationExecutor0 0 0 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0 4270 0 > 0 > InternalResponseStage 0 0 16322698 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > Native-Transport-Requests25 0 547935519 0 > 2586907 > > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION287057 > COUNTER_MUTATION 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR149 > > -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
MemtableReclaimMemory pending building up
Hi y'all, I am writing to a cluster fairly fast and seeing this odd behavior happen, seemingly to single nodes at a time. The node starts to take more and more memory (instance has 48GB memory on G1GC). tpstats shows that MemtableReclaimMemory Pending starts to grow first, then later MutationStage builds up as well. By then most of the memory is being consumed, GC is getting longer, node slows down and everything slows down unless I kill the node. Also the number of Active MemtableReclaimMemory threads seems to stay at 1. Also interestingly, neither CPU nor disk utilization are pegged while this is going on; it's on jbod and there is plenty of headroom there. (Note that there is a decent number of compactions going on as well but that is expected on these nodes and this particular one is catching up from a high volume of writes). Anyone have any theories on why this would be happening? $ nodetool tpstats Pool NameActive Pending Completed Blocked All time blocked MutationStage 192715481 311327142 0 0 ReadStage 7 09142871 0 0 RequestResponseStage 1 0 690823199 0 0 ReadRepairStage 0 02145627 0 0 CounterMutationStage 0 0 0 0 0 HintedHandoff 0 0144 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 1224 41022 0 0 MemtableReclaimMemory 1 102 4263 0 0 PendingRangeCalculator0 0 10 0 0 GossipStage 0 0 148329 0 0 MigrationStage0 0 0 0 0 MemtablePostFlush 0 0 5233 0 0 ValidationExecutor0 0 0 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 4270 0 0 InternalResponseStage 0 0 16322698 0 0 AntiEntropyStage 0 0 0 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests25 0 547935519 0 2586907 Message type Dropped READ 0 RANGE_SLICE 0 _TRACE 0 MUTATION287057 COUNTER_MUTATION 0 REQUEST_RESPONSE 0 PAGED_RANGE 0 READ_REPAIR149
Re: Production with Single Node
I could see this being desirable if you are deploying the exact same application as you deploy in other places with many nodes, and you know the load will be low. It may be a rare situation but in such a case you save big effort by not having to change your application logic. Not that I necessarily recommend it but to answer John's question: my understanding is that you want to keep it snappy and low-latency you should watch out for GC pause and consider your GC tuning carefully, it being a single node will cause the whole show to stop. Presumably your load won't be very high. Also if you are concerned with durability you may want to consider changing commitlog_sync <https://docs.datastax.com/en/cassandra/1.2/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__commitlog_sync> to batch. I believe this is the only way to guarantee write durability with one node. Again with the performance caveat; under high load it could cause problems. On Fri, Jan 22, 2016 at 12:34 PM, Jonathan Haddad <j...@jonhaddad.com> wrote: > My opinion: > http://rustyrazorblade.com/2013/09/cassandra-faq-can-i-start-with-a-single-node/ > > TL;DR: the only reason to run 1 node in prod is if you're super broke but > know you'll need to scale up almost immediately after going to prod (maybe > after getting some funding). > > If you're planning on doing it as a more permanent solution, you've chosen > the wrong database. > > On Fri, Jan 22, 2016 at 12:30 PM Jack Krupansky <jack.krupan...@gmail.com> > wrote: > >> The risks would be about the same as with a single-node Postgres or MySQL >> database, except that you wouldn't have the benefit of full SQL. >> >> How much data (rows, columns), what kind of load pattern (heavy write, >> heavy update, heavy query), and what types of queries (primary key-only, >> slices, filtering, secondary indexes, etc.)? >> >> -- Jack Krupansky >> >> On Fri, Jan 22, 2016 at 3:24 PM, John Lammers < >> john.lamm...@karoshealth.com> wrote: >> >>> After deploying a number of production systems with up to 10 Cassandra >>> nodes each, we are looking at deploying a small, all-in-one-server system >>> with only a single, local node (Cassandra 2.1.11). >>> >>> What are the risks of such a configuration? >>> >>> The virtual disk would be running RAID 5 and the disk controller would >>> have a flash backed write-behind cache. >>> >>> What's the best way to configure Cassandra and/or respecify the hardware >>> for an all-in-one-box solution? >>> >>> Thanks-in-advance! >>> >>> --John >>> >>> >> -- Dan Kinder Principal Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: compression cpu overhead
To clarify, writes have no *immediate* cpu cost from adding the write to the memtable, however the compression overhead cost is paid when writing out a new SSTable (whether from flushing a memtable or compacting), correct? So it sounds like when reads >> writes then Tushar's comments are accurate, but for a high write workload flushing and compactions would create most of the overhead. On Tue, Nov 3, 2015 at 6:03 PM, Jon Haddad <jonathan.had...@gmail.com> wrote: > You won't see any overhead on writes because you don't actually write to > sstables when performing a write. Just the commit log & memtable. > Memtables are flushes asynchronously. > > On Nov 4, 2015, at 1:57 AM, Tushar Agrawal <agrawal.tus...@gmail.com> > wrote: > > For writes it's negligible. For reads it makes a significant difference > for high tps and low latency workload. You would see up to 3x higher cpu > with LZ4 vs no compression. It would be different for different h/w > configurations. > > > Thanks, > Tushar > (Sent from iPhone) > > On Nov 3, 2015, at 5:51 PM, Dan Kinder <dkin...@turnitin.com> wrote: > > Most concerned about write since that's where most of the cost is, but > perf numbers for a any workload mix would be helpful. > > On Tue, Nov 3, 2015 at 3:48 PM, Graham Sanderson <gra...@vast.com> wrote: > >> On read or write? >> >> https://issues.apache.org/jira/browse/CASSANDRA-7039 and friends in 2.2 >> should make some difference, I didn’t immediately find perf numbers though. >> >> On Nov 3, 2015, at 5:42 PM, Dan Kinder <dkin...@turnitin.com> wrote: >> >> Hey all, >> >> Just wondering if anyone has done seen or done any benchmarking for the >> actual CPU overhead added by various compression algorithms in Cassandra >> (at least LZ4) vs no compression. Clearly this is going to be workload >> dependent but even a rough gauge would be helpful (ex. "Turning on LZ4 >> compression increases my CPU load by ~2x") >> >> -dan >> >> >> > > > -- > Dan Kinder > Senior Software Engineer > Turnitin – www.turnitin.com > dkin...@turnitin.com > > > -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
compression cpu overhead
Hey all, Just wondering if anyone has done seen or done any benchmarking for the actual CPU overhead added by various compression algorithms in Cassandra (at least LZ4) vs no compression. Clearly this is going to be workload dependent but even a rough gauge would be helpful (ex. "Turning on LZ4 compression increases my CPU load by ~2x") -dan
Re: compression cpu overhead
Most concerned about write since that's where most of the cost is, but perf numbers for a any workload mix would be helpful. On Tue, Nov 3, 2015 at 3:48 PM, Graham Sanderson <gra...@vast.com> wrote: > On read or write? > > https://issues.apache.org/jira/browse/CASSANDRA-7039 and friends in 2.2 > should make some difference, I didn’t immediately find perf numbers though. > > On Nov 3, 2015, at 5:42 PM, Dan Kinder <dkin...@turnitin.com> wrote: > > Hey all, > > Just wondering if anyone has done seen or done any benchmarking for the > actual CPU overhead added by various compression algorithms in Cassandra > (at least LZ4) vs no compression. Clearly this is going to be workload > dependent but even a rough gauge would be helpful (ex. "Turning on LZ4 > compression increases my CPU load by ~2x") > > -dan > > > -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: memtable flush size with LCS
@Jeff Jirsa thanks the memtable_* keys were the actual determining factor for my memtable flushes, they are what I needed to play with. On Thu, Oct 29, 2015 at 8:23 AM, Ken Hancock <ken.hanc...@schange.com> wrote: > Or if you're doing a high volume of writes, then your flushed file size > may be completely determined by other CFs that have consumed the commitlog > size, forcing any memtables whose commitlog is being delete to be forced to > disk. > > > On Wed, Oct 28, 2015 at 2:51 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> > wrote: > >> It’s worth mentioning that initial flushed file size is typically >> determined by memtable_cleanup_threshold and the memtable space options >> (memtable_heap_space_in_mb, memtable_offheap_space_in_mb, depending on >> memtable_allocation_type) >> >> >> >> From: Nate McCall >> Reply-To: "user@cassandra.apache.org" >> Date: Wednesday, October 28, 2015 at 11:45 AM >> To: Cassandra Users >> Subject: Re: memtable flush size with LCS >> >> >> do you mean that this property is ignored at memtable flush time, and so >>> memtables are already allowed to be much larger than sstable_size_in_mb? >>> >> >> Yes, 'sstable_size_in_mb' plays no part in the flush process. Flushing >> is based on solely on runtime activity and the file size is determined by >> whatever was in the memtable at that time. >> >> >> >> -- >> - >> Nate McCall >> Austin, TX >> @zznate >> >> Co-Founder & Sr. Technical Consultant >> Apache Cassandra Consulting >> http://www.thelastpickle.com >> > > > > > -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
memtable flush size with LCS
Hi all, The docs indicate that memtables are triggered to flush when data in the commitlog is expiring or based on memtable_flush_period_in_ms. But LCS has a specified sstable size; when using LCS are memtables flushed when they hit the desired sstable size (default 160MB) or could L0 sstables be much larger than that? Wondering because I have an overwrite workload where larger memtables would be helpful, and if I need to increase my LCS sstable size in order to allow for that. -dan
Re: memtable flush size with LCS
Thanks, I am using most of the suggested parameters to tune compactions. To clarify, when you say "The sstable_size_in_mb can be thought of a target for the compaction process moving the file beyond L0." do you mean that this property is ignored at memtable flush time, and so memtables are already allowed to be much larger than sstable_size_in_mb? On Tue, Oct 27, 2015 at 2:57 PM, Nate McCall <n...@thelastpickle.com> wrote: > The sstable_size_in_mb can be thought of a target for the compaction > process moving the file beyond L0. > > Note: If there are more than 32 SSTables in L0, it will switch over to > doing STCS for L0 (you can disable this behavior by passing > -Dcassandra.disable_stcs_in_l0=true as a system property). > > With a lot of overwrites, the settings you want to tune will be > gc_grace_seconds in combination with tombstone_threhsold, > tombstone_compaction_interval and maybe unchecked_tombstone_compaction > (there are different opinions about this last one, YMMV). Making these more > aggressive and increasing your sstable_size_in_mb will allow for > potentially capturing more overwrites in a level which will lead to less > fragmentation. However, making the size too large will keep compaction from > triggering on further out levels which can then exacerbate problems > particulary if you have long-lived TTLs. > > In general, it is very workload specific, but monitoring the histogram for > the number of ssables used in a read (via > org.apache.cassandra.metrics.ColumnFamily.$KEYSPACE.$TABLE.SSTablesPerReadHistogram.95percentile > or shown manually in nodetool cfhistograms output) after any change will > help you narrow in a good setting. > > See > http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html?scroll=compactSubprop__compactionSubpropertiesLCS > for more details. > > On Tue, Oct 27, 2015 at 3:42 PM, Dan Kinder <dkin...@turnitin.com> wrote: > > > > Hi all, > > > > The docs indicate that memtables are triggered to flush when data in the > commitlog is expiring or based on memtable_flush_period_in_ms. > > > > But LCS has a specified sstable size; when using LCS are memtables > flushed when they hit the desired sstable size (default 160MB) or could L0 > sstables be much larger than that? > > > > Wondering because I have an overwrite workload where larger memtables > would be helpful, and if I need to increase my LCS sstable size in order to > allow for that. > > > > -dan > > > > > -- > - > Nate McCall > Austin, TX > @zznate > > Co-Founder & Sr. Technical Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
future very wide row support
Hi, My understanding is that wide row support (i.e. many columns/CQL-rows/cells per partition key) has gotten much better in the past few years; even though the theoretical of 2 billion has been much higher than practical for a long time, it seems like now Cassandra is able to handle these better (ex. incremental compactions so Cassandra doesn't OOM). So I'm wondering: - With more recent improvements (say, including up to 2.2 or maybe 3.0), is the practical limit still much lower than 2 billion? Do we have any idea what limits us in this regard? (Maybe repair is still another bottleneck?) - Is the 2 billion limit a SSTable limitation? https://issues.apache.org/jira/browse/CASSANDRA-7447 seems to indicate that it might be. Is there any future work we think will increase this limit? A couple of caveats: I am aware that even if such a large partition is possible it may not usually be practical because it works against Cassandra's primary feature of sharding data to multiple nodes and parallelize access. However some analytics/batch processing use-cases could benefit from the guarantee that a certain set of data is together on a node. It can also make certain data modeling situations a bit easier, where currently we just need to model around the limitation. Also, 2 billion rows for small columns only adds up to data in the tens of gigabytes, and use of larger nodes these days means that practically one node could hold much larger partitions. And lastly, there are just cases where the 99.999% of partition keys are going to be pretty small, but there are potential outliers that could be very large; it would be great for Cassandra to handle these even if it is suboptimal, helping us all avoid having to model around such exceptions. Well, this turned into something of an essay... thanks for reading and glad to receive input on this.
Re: Overwhelming tombstones with LCS
On Sun, Jul 5, 2015 at 1:40 PM, Roman Tkachenko ro...@mailgunhq.com wrote: Hey guys, I have a table with RF=3 and LCS. Data model makes use of wide rows. A certain query run against this table times out and tracing reveals the following error on two out of three nodes: *Scanned over 10 tombstones; query aborted (see tombstone_failure_threshold)* This basically means every request with CL higher than one fails. I have two questions: * How could it happen that only two out of three nodes have overwhelming tombstones? For the third node tracing shows sensible *Read 815 live and 837 tombstoned cells* traces. One theory: before 2.1.6 compactions on wide rows with lots of tombstones could take forever or potentially never finish. What version of Cassandra are you on? It may be that you got lucky with one node that has been able to keep up but the others haven't been able to. * Anything I can do to fix those two nodes? I have already set gc_grace to 1 day and tried to make compaction strategy more aggressive (unchecked_tombstone_compaction - true, tombstone_threshold - 0.01) to no avail - a couple of days have already passed and it still gives the same error. You probably want major compaction which is coming soon for LCS ( https://issues.apache.org/jira/browse/CASSANDRA-7272) but not here yet. The alternative is, if you have enough time and headroom (this is going to do some pretty serious compaction so be careful), alter your table to STCS, let it compact into one SSTable, then convert back to LCS. It's pretty heavy-handed but as long as your gc_grace is low enough it'll do the job. Definitely do NOT do this if you have many tombstones in single wide rows and are not 2.1.6 Thanks! Roman -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Commitlog still replaying after drain shutdown
Hi all, To quote Sebastian Estevez in one recent thread: You said you ran a nodetool drain before the restart, but your logs show commitlogs replayed. That does not add up... The docs seem to generally agree with this: if you did `nodetool drain` before restarting your node there shouldn't be any commitlogs. But my experience has been that if I do `nodetool drain`, I need to wait at least 30-60 seconds after it has finished if I really want no commitlog replay on restart. If I restart immediately (or even 10-20s later) then it replays plenty. (This was true on 2.X and is still true on 2.1.7 for me.) Is this unusual or the same thing others see? Is `nodetool drain` really supposed to wait until all memtables are flushed and commitlogs are deleted before it returns? Thanks, -dan
Re: counters still inconsistent after repair
Thanks Rob, this was helpful. More counters will be added soon, I'll let you know if those have any problems. On Mon, Jun 15, 2015 at 4:32 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Jun 15, 2015 at 2:52 PM, Dan Kinder dkin...@turnitin.com wrote: Potentially relevant facts: - Recently upgraded to 2.1.6 from 2.0.14 - This table has ~million rows, low contention, and fairly high increment rate Can you repro on a counter that was created after the upgrade? Mainly wondering: - Is this known or expected? I know Cassandra counters have had issues but thought by now it should be able to keep a consistent counter or at least repair it... All counters which haven't been written to after 2.1 new counters are still on disk as old counters and will remain that way until UPDATEd and then compacted together with all old shards. Old counters can exhibit this behavior. - Any way to reset this counter? Per Aleksey (in IRC) you can turn a replica for an old counter into a new counter by UPDATEing it once. In order to do that without modifying the count, you can [1] : UPDATE tablename SET countercolumn = countercolumn +0 where id = 1; The important caveat that this must be done at least once per shard, with one shard per RF. The only way one can be sure that all shards have been UPDATEd is by contacting each replica node and doing the UPDATE + 0 there, because local writes are preferred. To summarize, the optimal process to upgrade your pre-existing counters to 2.1-era new counters : 1) get a list of all counter keys 2) get a list of replicas per counter key 3) connect to each replica for each counter key and issue an UPDATE + 0 for that counter key 4) run a major compaction As an aside, Aleksey suggests that the above process is so heavyweight that it may not be worth it. If you just leave them be, all counters you're actually used will become progressively more accurate over time. =Rob [1] Special thanks to Jeff Jirsa for verifying that this syntax works. -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
counters still inconsistent after repair
Currently on 2.1.6 I'm seeing behavior like the following: cqlsh:walker select * from counter_table where field = 'test'; field | value ---+--- test |30 (1 rows) cqlsh:walker select * from counter_table where field = 'test'; field | value ---+--- test |90 (1 rows) cqlsh:walker select * from counter_table where field = 'test'; field | value ---+--- test |30 (1 rows) Using tracing I can see that one node has wrong data. However running repair on this table does not seem to have done anything, I still see the wrong value returned from this same node. Potentially relevant facts: - Recently upgraded to 2.1.6 from 2.0.14 - This table has ~million rows, low contention, and fairly high increment rate Mainly wondering: - Is this known or expected? I know Cassandra counters have had issues but thought by now it should be able to keep a consistent counter or at least repair it... - Any way to reset this counter? - Any other stuff I can check?
Re: Multiple cassandra instances per physical node
@James Rothering yeah I was thinking of container in a broad sense: either full virtual machines, docker containers, straight LXC, or whatever else would allow the Cassandra nodes to have their own IPs and bind to default ports. @Jonathan Haddad thanks for the blog post. To ensure the same host does not replicate its own data, would I basically need the nodes on a single host to be labeled as one rack? (Assuming I use vnodes) On Thu, May 21, 2015 at 1:02 PM, Sebastian Estevez sebastian.este...@datastax.com wrote: JBOD -- just a bunch of disks, no raid. All the best, [image: datastax_logo.png] http://www.datastax.com/ Sebastián Estévez Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com [image: linkedin.png] https://www.linkedin.com/company/datastax [image: facebook.png] https://www.facebook.com/datastax [image: twitter.png] https://twitter.com/datastax [image: g+.png] https://plus.google.com/+Datastax/about http://feeds.feedburner.com/datastax http://cassandrasummit-datastax.com/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. On Thu, May 21, 2015 at 4:00 PM, James Rothering jrother...@codojo.me wrote: Hmmm ... Not familiar with JBOD. Is that just RAID-0? Also ... wrt the container talk, is that a Docker container you're talking about? On Thu, May 21, 2015 at 12:48 PM, Jonathan Haddad j...@jonhaddad.com wrote: If you run it in a container with dedicated IPs it'll work just fine. Just be sure you aren't using the same machine to replicate it's own data. On Thu, May 21, 2015 at 12:43 PM Manoj Khangaonkar khangaon...@gmail.com wrote: +1. I agree we need to be able to run multiple server instances on one physical machine. This is especially necessary in development and test environments where one is experimenting and needs a cluster, but do not have access to multiple physical machines. If you google , you can find a few blogs that talk about how to do this. But it is less than ideal. We need to be able to do it by changing ports in cassandra.yaml. ( The way it is done easily with Hadoop or Apache Kafka or Redis and many other distributed systems) regards On Thu, May 21, 2015 at 10:32 AM, Dan Kinder dkin...@turnitin.com wrote: Hi, I'd just like some clarity and advice regarding running multiple cassandra instances on a single large machine (big JBOD array, plenty of CPU/RAM). First, I am aware this was not Cassandra's original design, and doing this seems to unreasonably go against the commodity hardware intentions of Cassandra's design. In general it seems to be recommended against (at least as far as I've heard from @Rob Coli and others). However maybe this term commodity is changing... my hardware/ops team argues that due to cooling, power, and other datacenter costs, having slightly larger nodes (=32G RAM, =24 CPU, =8 disks JBOD) is actually a better price point. Now, I am not a hardware guy, so if this is not actually true I'd love to hear why, otherwise I pretty much need to take them at their word. Now, Cassandra features seemed to have improved such that JBOD works fairly well, but especially with memory/GC this seems to be reaching its limit. One Cassandra instance can only scale up so much. So my question is: suppose I take a 12 disk JBOD and run 2 Cassandra nodes (each with 5 data disks, 1 commit log disk) and either give each its own container IP or change the listen ports. Will this work? What are the risks? Will/should Cassandra support this better in the future? -- http://khangaonkar.blogspot.com/ -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Multiple cassandra instances per physical node
Hi, I'd just like some clarity and advice regarding running multiple cassandra instances on a single large machine (big JBOD array, plenty of CPU/RAM). First, I am aware this was not Cassandra's original design, and doing this seems to unreasonably go against the commodity hardware intentions of Cassandra's design. In general it seems to be recommended against (at least as far as I've heard from @Rob Coli and others). However maybe this term commodity is changing... my hardware/ops team argues that due to cooling, power, and other datacenter costs, having slightly larger nodes (=32G RAM, =24 CPU, =8 disks JBOD) is actually a better price point. Now, I am not a hardware guy, so if this is not actually true I'd love to hear why, otherwise I pretty much need to take them at their word. Now, Cassandra features seemed to have improved such that JBOD works fairly well, but especially with memory/GC this seems to be reaching its limit. One Cassandra instance can only scale up so much. So my question is: suppose I take a 12 disk JBOD and run 2 Cassandra nodes (each with 5 data disks, 1 commit log disk) and either give each its own container IP or change the listen ports. Will this work? What are the risks? Will/should Cassandra support this better in the future?
Delete query range limitation
I understand that range deletes are currently not supported ( http://stackoverflow.com/questions/19390335/cassandra-cql-delete-using-a-less-than-operator-on-a-secondary-key ) Since Cassandra now does have range tombstones is there a reason why it can't be allowed? Is there a ticket for supporting this or is it a deliberate design decision not to?
Finding nodes that own a given token/partition key
Hey all, In certain cases it would be useful for us to find out which node(s) have the data for a given token/partition key. The only solutions I'm aware of is to select from system.local and/or system.peers to grab the host_id and tokens, do `SELECT token(thing) FROM myks.mytable WHERE thing = 'value';`, then do the math (put the ring together) and figure out which node has the data. I'm assuming this is what token aware drivers are doing. Is there a simpler way to do this? A bit more context: we'd like to move some processing closer to data, but for a few reasons hadoop/spark aren't good options for the moment.
Re: Finding nodes that own a given token/partition key
Thanks guys, think both of these answer my question. Guess I had overlooked nodetool getendpoints. Hopefully findable by future googlers now. On Thu, Mar 26, 2015 at 2:37 PM, Adam Holmberg adam.holmb...@datastax.com wrote: Dan, Depending on your context, many of the DataStax drivers have the token ring exposed client-side. For example, Python: http://datastax.github.io/python-driver/api/cassandra/metadata.html#tokens-and-ring-topology Java: http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/Metadata.html You may not have to construct this yourself. Adam Holmberg On Thu, Mar 26, 2015 at 3:53 PM, Roman Tkachenko ro...@mailgunhq.com wrote: Hi Dan, Have you tried using nodetool getendpoints? It shows you nodes that currently own the specific key. Roman On Thu, Mar 26, 2015 at 1:21 PM, Dan Kinder dkin...@turnitin.com wrote: Hey all, In certain cases it would be useful for us to find out which node(s) have the data for a given token/partition key. The only solutions I'm aware of is to select from system.local and/or system.peers to grab the host_id and tokens, do `SELECT token(thing) FROM myks.mytable WHERE thing = 'value';`, then do the math (put the ring together) and figure out which node has the data. I'm assuming this is what token aware drivers are doing. Is there a simpler way to do this? A bit more context: we'd like to move some processing closer to data, but for a few reasons hadoop/spark aren't good options for the moment. -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster
Per Aleksey Yeschenko's comment on that ticket, it does seem like a timestamp granularity issue, but it should work properly if it is within the same session. gocql by default uses 2 connections and 128 streams per connection. If you set it to 1 connection with 1 stream this problem goes away. I suppose that'll take care of it in testing. At least one interesting conclusion here: a gocql.Session does not map to one Cassandra session. This makes some sense given that gocql says to use Session shared concurrently (so it better not just be one Cassandra session), but it is a bit concerning that there is no way to make this 100% safe outside of cutting the gocql.Session down to 1 connection and stream. On Mon, Mar 2, 2015 at 5:34 PM, Peter Sanford psanf...@retailnext.net wrote: The more I think about it, the more this feels like a column timestamp issue. If two inserts have the same timestamp then the values are compared lexically to decide which one to keep (which I think explains the 99/100 999/1000 mystery). We can verify this by also selecting out the WRITETIME of the column: ... var prevTS int for i := 0; i 1; i++ { val := fmt.Sprintf(%d, i) db.Query(UPDATE ut.test SET val = ? WHERE key = 'foo', val).Exec() var result string var ts int db.Query(SELECT val, WRITETIME(val) FROM ut.test WHERE key = 'foo').Scan(result, ts) if result != val { fmt.Printf(Expected %v but got: %v; (prevTS:%d, ts:%d)\n, val, result, prevTS, ts) } prevTS = ts } When I run it with this change I see that the timestamps are in fact the same: Expected 10 but got: 9; (prevTS:1425345839903000, ts:1425345839903000) Expected 100 but got: 99; (prevTS:1425345839939000, ts:1425345839939000) Expected 101 but got: 99; (prevTS:1425345839939000, ts:1425345839939000) Expected 1000 but got: 999; (prevTS:1425345840296000, ts:1425345840296000) It looks like we're only getting millisecond precision instead of microsecond for the column timestamps?! If you explicitly set the timestamp value when you do the insert, you can get actual microsecond precision and the issue should go away. -psanford On Mon, Mar 2, 2015 at 4:21 PM, Dan Kinder dkin...@turnitin.com wrote: Yeah I thought that was suspicious too, it's mysterious and fairly consistent. (By the way I had error checking but removed it for email brevity, but thanks for verifying :) ) On Mon, Mar 2, 2015 at 4:13 PM, Peter Sanford psanf...@retailnext.net wrote: Hmm. I was able to reproduce the behavior with your go program on my dev machine (C* 2.0.12). I was hoping it was going to just be an unchecked error from the .Exec() or .Scan(), but that is not the case for me. The fact that the issue seems to happen on loop iteration 10, 100 and 1000 is pretty suspicious. I took a tcpdump to confirm that the gocql was in fact sending the write 100 query and then on the next read Cassandra responded with 99. I'll be interested to see what the result of the jira ticket is. -psanford -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: Less frequent flushing with LCS
Nope, they flush every 5 to 10 minutes. On Mon, Mar 2, 2015 at 1:13 PM, Daniel Chia danc...@coursera.org wrote: Do the tables look like they're being flushed every hour? It seems like the setting memtable_flush_after_mins which I believe defaults to 60 could also affect how often your tables are flushed. Thanks, Daniel On Mon, Mar 2, 2015 at 11:49 AM, Dan Kinder dkin...@turnitin.com wrote: I see, thanks for the input. Compression is not enabled at the moment, but I may try increasing that number regardless. Also I don't think in-memory tables would work since the dataset is actually quite large. The pattern is more like a given set of rows will receive many overwriting updates and then not be touched for a while. On Fri, Feb 27, 2015 at 2:27 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Feb 27, 2015 at 2:01 PM, Dan Kinder dkin...@turnitin.com wrote: Theoretically sstable_size_in_mb could be causing it to flush (it's at the default 160MB)... though we are flushing well before we hit 160MB. I have not tried changing this but we don't necessarily want all the sstables to be large anyway, I've always wished that the log message told you *why* the SSTable was being flushed, which of the various bounds prompted the flush. In your case, the size on disk may be under 160MB because compression is enabled. I would start by increasing that size. Datastax DSE has in-memory tables for this use case. =Rob -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster
Yeah I thought that was suspicious too, it's mysterious and fairly consistent. (By the way I had error checking but removed it for email brevity, but thanks for verifying :) ) On Mon, Mar 2, 2015 at 4:13 PM, Peter Sanford psanf...@retailnext.net wrote: Hmm. I was able to reproduce the behavior with your go program on my dev machine (C* 2.0.12). I was hoping it was going to just be an unchecked error from the .Exec() or .Scan(), but that is not the case for me. The fact that the issue seems to happen on loop iteration 10, 100 and 1000 is pretty suspicious. I took a tcpdump to confirm that the gocql was in fact sending the write 100 query and then on the next read Cassandra responded with 99. I'll be interested to see what the result of the jira ticket is. -psanford -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster
Done: https://issues.apache.org/jira/browse/CASSANDRA-8892 On Mon, Mar 2, 2015 at 3:26 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Mar 2, 2015 at 11:44 AM, Dan Kinder dkin...@turnitin.com wrote: I had been having the same problem as in those older post: http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAORswtz+W4Eg2CoYdnEcYYxp9dARWsotaCkyvS5M7+Uo6HT1=a...@mail.gmail.com%3E As I said on that thread : It sounds unreasonable/unexpected to me, if you have a trivial repro case, I would file a JIRA. =Rob -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: Less frequent flushing with LCS
I see, thanks for the input. Compression is not enabled at the moment, but I may try increasing that number regardless. Also I don't think in-memory tables would work since the dataset is actually quite large. The pattern is more like a given set of rows will receive many overwriting updates and then not be touched for a while. On Fri, Feb 27, 2015 at 2:27 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Feb 27, 2015 at 2:01 PM, Dan Kinder dkin...@turnitin.com wrote: Theoretically sstable_size_in_mb could be causing it to flush (it's at the default 160MB)... though we are flushing well before we hit 160MB. I have not tried changing this but we don't necessarily want all the sstables to be large anyway, I've always wished that the log message told you *why* the SSTable was being flushed, which of the various bounds prompted the flush. In your case, the size on disk may be under 160MB because compression is enabled. I would start by increasing that size. Datastax DSE has in-memory tables for this use case. =Rob -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Reboot: Read After Write Inconsistent Even On A One Node Cluster
Hey all, I had been having the same problem as in those older post: http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAORswtz+W4Eg2CoYdnEcYYxp9dARWsotaCkyvS5M7+Uo6HT1=a...@mail.gmail.com%3E To summarize it, on my local box with just one cassandra node I can update and then select the updated row and get an incorrect response. My understanding is this may have to do with not having fine-grained enough timestamp resolution, but regardless I'm wondering: is this actually a bug or is there any way to mitigate it? It causes sporadic failures in our unit tests, and having to Sleep() between tests isn't ideal. At least confirming it's a bug would be nice though. For those interested, here's a little go program that can reproduce the issue. When I run it I typically see: Expected 100 but got: 99 Expected 1000 but got: 999 --- main.go: --- package main import ( fmt github.com/gocql/gocql ) func main() { cf := gocql.NewCluster(localhost) db, _ := cf.CreateSession() // Keyspace ut = update test err := db.Query(`CREATE KEYSPACE IF NOT EXISTS ut WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }`).Exec() if err != nil { panic(err.Error()) } err = db.Query(CREATE TABLE IF NOT EXISTS ut.test (key text, val text, PRIMARY KEY(key))).Exec() if err != nil {panic(err.Error()) } err = db.Query(TRUNCATE ut.test).Exec() if err != nil { panic(err.Error()) } err = db.Query(INSERT INTO ut.test (key) VALUES ('foo')).Exec() if err != nil { panic(err.Error()) } for i := 0; i 1; i++ { val := fmt.Sprintf(%d, i) db.Query(UPDATE ut.test SET val = ? WHERE key = 'foo', val).Exec() var result string db.Query(SELECT val FROM ut.test WHERE key = 'foo').Scan(result) if result != val { fmt.Printf(Expected %v but got: %v\n, val, result) } } }
Less frequent flushing with LCS
Hi all, We have a table in Cassandra where we frequently overwrite recent inserts. Compaction does a fine job with this but ultimately larger memtables would reduce compactions. The question is: can we make Cassandra use larger memtables and flush less frequently? What currently triggers the flushes? Opscenter shows them flushing consistently at about 110MB in size, we have plenty of memory to go larger. According to http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_memtable_thruput_c.html we can up the commit log space threshold, but this does not help, there is plenty of runway there. Theoretically sstable_size_in_mb could be causing it to flush (it's at the default 160MB)... though we are flushing well before we hit 160MB. I have not tried changing this but we don't necessarily want all the sstables to be large anyway, Thanks, -dan
Re: large range read in Cassandra
For the benefit of others, I ended up finding out that the CQL library I was using (https://github.com/gocql/gocql) at this time leaves paging page size defaulted to no paging, so Cassandra was trying to pull all rows of the partition into memory at once. Setting the page size to a reasonable number seems to have done the trick. On Tue, Nov 25, 2014 at 2:54 PM, Dan Kinder dkin...@turnitin.com wrote: Thanks, very helpful Rob, I'll watch for that. On Tue, Nov 25, 2014 at 11:45 AM, Robert Coli rc...@eventbrite.com wrote: On Tue, Nov 25, 2014 at 10:45 AM, Dan Kinder dkin...@turnitin.com wrote: To be clear, I expect this range query to take a long time and perform relatively heavy I/O. What I expected Cassandra to do was use auto-paging ( https://issues.apache.org/jira/browse/CASSANDRA-4415, http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3) so that we aren't literally pulling the entire thing in. Am I misunderstanding this use case? Could you clarify why exactly it would slow way down? It seems like with each read it should be doing a simple range read from one or two sstables. If you're paging through a single partition, that's likely to be fine. When you said range reads ... over rows my impression was you were talking about attempting to page through millions of partitions. With that confusion cleared up, the likely explanation for lack of availability in your case is heap pressure/GC time. Look for GCs around that time. Also, if you're using authentication, make sure that your authentication keyspace has a replication factor greater than 1. =Rob -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
Re: STCS limitation with JBOD?
Thanks for the info guys. Regardless of the reason for using nodetool compact, it seems like the question still stands... but he impression I'm getting is that nodetool compact on JBOD as I described will basically fall apart. Is that correct? To answer Colin's question as an aside: we have a dataset with fairly high insert load and periodic range reads (batch processing). We have a situation where we may want rewrite some rows (changing the primary key) by deleting and inserting as a new row. This is not something we would do on a regular basis, but after or during the process a compact would greatly help to clear out tombstones/rewritten data. @Ryan Svihla it also sounds like your suggestion in this case would be: create a new column family, rewrite all data into that, truncate/remove the previous one, and replace it with the new one. On Tue, Jan 6, 2015 at 9:39 AM, Ryan Svihla r...@foundev.pro wrote: nodetool compact is the ultimate running with scissors solution, far more people manage to stab themselves in the eye. Customers running with scissors successfully not withstanding. My favorite discussions usually tend to result: 1. We still have tombstones ( so they set gc_grace_seconds to 0) 2. We added a node after fixing it and now a bunch of records that were deleted have come back (usually after setting gc_grace_seconds to 0 and then not blanking nodes that have been offline) 3. Why are my read latencies so spikey? (cause they're on STC and now have a giant single huge SStable which worked fine when their data set was tiny, now they're looking at 100 sstables on STC, which means slwww reads) 4. We still have tombstones (yeah I know this again, but this is usually when they've switched to LCS, which basically noops with nodetool compact) All of this is managed when you have a team that understands the tradeoffs of nodetool compact, but I categorically reject it's a good experience for new users, as I've unfortunately had about dozen fire drills this year as a result of nodetool compact alone. Data modeling around partitions that are truncated when falling out of scope is typically far more manageable, works with any compaction strategy, and doesn't require operational awareness at the same scale. On Fri, Jan 2, 2015 at 2:15 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Jan 2, 2015 at 11:28 AM, Colin co...@clark.ws wrote: Forcing a major compaction is usually a bad idea. What is your reason for doing that? I'd say often and not usually. Lots of people have schema where they create way too much garbage, and major compaction can be a good response. The docs' historic incoherent FUD notwithstanding. =Rob -- Thanks, Ryan Svihla -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
STCS limitation with JBOD?
Hi, Forcing a major compaction (using nodetool compact http://datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCompact.html) with STCS will result in a single sstable (ignoring repair data). However this seems like it could be a problem for large JBOD setups. For example if I have 12 disks, 1T each, then it seems like on this node I cannot have one column family store more than 1T worth of data (more or less), because all the data will end up in a single sstable that can exist only on one disk. Is this accurate? The compaction write path docs http://datastax.com/documentation/cassandra/2.1/cassandra/dml/dml_write_path_c.html give a bit of hope that cassandra could split the one final sstable across the disks, but I doubt it is able to and want to confirm. I imagine that RAID/LLVM, using LCS, or multiple cassandra instances not in JBOD mode could be solutions to this (with their own problems), but want to verify that this actually is a problem. -dan
Re: large range read in Cassandra
Thanks Rob. To be clear, I expect this range query to take a long time and perform relatively heavy I/O. What I expected Cassandra to do was use auto-paging ( https://issues.apache.org/jira/browse/CASSANDRA-4415, http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3) so that we aren't literally pulling the entire thing in. Am I misunderstanding this use case? Could you clarify why exactly it would slow way down? It seems like with each read it should be doing a simple range read from one or two sstables. If this won't work then it may me we need to start using Hive/Spark/Pig etc. sooner, or page it manually using LIMIT and WHERE [the last returned result]. On Mon, Nov 24, 2014 at 5:49 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Nov 24, 2014 at 4:26 PM, Dan Kinder dkin...@turnitin.com wrote: We have a web crawler project currently based on Cassandra ( https://github.com/iParadigms/walker, written in Go and using the gocql driver), with the following relevant usage pattern: - Big range reads over a CF to grab potentially millions of rows and dispatch new links to crawl If you really mean millions of storage rows, this is just about the worst case for Cassandra. The problem you're having is probably that you shouldn't try to do this in Cassandra. Your timeouts are either from the read actually taking longer than the timeout or from the reads provoking heap pressure and resulting GC. =Rob
Re: large range read in Cassandra
Thanks, very helpful Rob, I'll watch for that. On Tue, Nov 25, 2014 at 11:45 AM, Robert Coli rc...@eventbrite.com wrote: On Tue, Nov 25, 2014 at 10:45 AM, Dan Kinder dkin...@turnitin.com wrote: To be clear, I expect this range query to take a long time and perform relatively heavy I/O. What I expected Cassandra to do was use auto-paging ( https://issues.apache.org/jira/browse/CASSANDRA-4415, http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3) so that we aren't literally pulling the entire thing in. Am I misunderstanding this use case? Could you clarify why exactly it would slow way down? It seems like with each read it should be doing a simple range read from one or two sstables. If you're paging through a single partition, that's likely to be fine. When you said range reads ... over rows my impression was you were talking about attempting to page through millions of partitions. With that confusion cleared up, the likely explanation for lack of availability in your case is heap pressure/GC time. Look for GCs around that time. Also, if you're using authentication, make sure that your authentication keyspace has a replication factor greater than 1. =Rob -- Dan Kinder Senior Software Engineer Turnitin – www.turnitin.com dkin...@turnitin.com
large range read in Cassandra
Hi, We have a web crawler project currently based on Cassandra ( https://github.com/iParadigms/walker, written in Go and using the gocql driver), with the following relevant usage pattern: - Big range reads over a CF to grab potentially millions of rows and dispatch new links to crawl - Fast insert of new links (effectively using Cassandra to deduplicate) We ultimately planned on doing the batch processing step (the dispatching) in a system like Spark, but for the time being it is also in Go. We believe this should work fine given that Cassandra now properly allows chunked iteration of columns in a CF. The issue is, periodically while doing a particularly large range read, other operations time out because that node is busy. In an experimental cluster with only two nodes (and replication factor of 2), I'll get an error like: Operation timed out - received only 1 responses. Indicating that the second node took too long to reply. At the moment I have the long range reads set to consistency level ANY but the rest of the operations are on QUORUM, so on this cluster they require responses from both nodes. The relevant CF is also using LeveledCompactionStrategy. This happens in both Cassandra 2 and 2.1. Despite this error I don't see any significant I/O, memory consumption, or CPU usage. Here are some of the configuration values I've played with: Increasing timeouts: read_request_timeout_in_ms: 15000 range_request_timeout_in_ms: 3 write_request_timeout_in_ms: 1 request_timeout_in_ms: 1 Getting rid of caches we don't need: key_cache_size_in_mb: 0 row_cache_size_in_mb: 0 Each of the 2 nodes has an HDD for commit log and single HDD I'm using for data. Hence the following thread config (maybe since I/O is not an issue I should increase these?): concurrent_reads: 16 concurrent_writes: 32 concurrent_counter_writes: 32 Because I have a large number columns and aren't doing random I/O I've increased this: column_index_size_in_kb: 2048 It's something of a mystery why this error comes up. Of course with a 3rd node it will get masked if I am doing QUORUM operations, but it still seems like it should not happen, and that there is some kind of head-of-line blocking or other issue in Cassandra. I would like to increase the amount of dispatching I'm doing because of this it bogs it down if I do. Any suggestions for other things we can try here would be appreciated. -dan