Re: Cassandra 5.0 Beta1 - vector searching results
> For your #1 - if there are going to be 100+ million vectors, wouldn't I want the search to go across nodes? If you have a replication factor of 3 and 3 nodes, every node will have a complete copy of the data, so you'd only need to talk to one node. If your replication factor is 1, you'd have to talk to all three nodes. On Wed, Mar 27, 2024 at 9:06 AM Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Thank you all for the details on this. > For your #1 - if there are going to be 100+ million vectors, wouldn't I > want the search to go across nodes? > > Right now, we're running both weaviate (8 node cluster), our main > cassandra 4 cluster (12 nodes), and a test 3 node cassandra 5 cluster. > Weaviate does some interesting things like product quantization to reduce > size and improve search speed. They get amazing speed, but the drawback > is, from what I can tell, they load the entire index into RAM. We've been > having a reoccurring issue where once it runs out of RAM, it doesn't get > slow; it just stops working. Weaviate enables some powerful > vector+boolean+range queries. I would love to only have one database! > > I'll look into how to do profiling - the terms you use are things I'm not > familiar with, but I've got chatGPT and google... :) > > -Joe > On 3/21/2024 10:51 PM, Caleb Rackliffe wrote: > > To expand on Jonathan’s response, the best way to get SAI to perform on > the read side is to use it as a tool for large-partition search. In other > words, if you can model your data such that your queries will be restricted > to a single partition, two things will happen… > > 1.) With all queries (not just ANN queries), you will only hit as many > nodes as your read consistency level and replication factor require. For > vector searches, that means you should only hit one node, and it should be > the coordinating node w/ a properly configured, token-aware client. > > 2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as > your table compaction strategy. This will essentially guarantee your > (partition-restricted) SAI query hits a small number of SSTable-attached > indexes. (It’ll hit Memtable-attached indexes as well for any recently > added data, so if you’re seeing latencies shoot up, it’s possible there > could be contention on the Memtable-attached index that supports ANN > queries. I haven’t done a deep dive on it. You can always flush Memtables > directly before queries to factor that out.) > > If you can do all of the above, the simple performance of the local index > query and its post-filtering reads is probably the place to explore > further. If you manage to collect any profiling data (JFR, flamegraphs via > async-profiler, etc) I’d be happy to dig into it with you. > > Thanks for kicking the tires! > > On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user > wrote: > > > > Hi Joe, > > > > Have you considered submitting something for Community Over Code NA 2024? > The CFP is still open for a few more weeks, options could be my Performance > Engineering track or the Cassandra track – or both > > > > > https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D > > > > Regards, Paul Brebner > > > > > > > > *From: *Joe Obernberger > > *Date: *Friday, 22 March 2024 at 3:19 am > *To: *user@cassandra.apache.org > > *Subject: *Cassandra 5.0 Beta1 - vector searching results > > EXTERNAL EMAIL - USE CAUTION when clicking links or attachments > > > > > Hi All - I'd like to share some initial results for the vector search on > Cassandra 5.0 beta1. 3 node cluster running in kubernetes; fast Netapp > storage. > > Have a table (doc.embeddings_googleflan5tlarge) with definition: > > CREATE TABLE doc.embeddings_googleflant5large ( > uuid text, > type text, > fieldname text, > offset int, > sourceurl text, > textdata text, > creationdate timestamp, > embeddings vector, > metadata boolean, > source text, > PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata) > ) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC, > textdata ASC) > AND additional_write_policy = '99p' > AND allow_auto_snapshot = true > AND bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND cdc = false > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND compression = {'chunk_length_in_k
Re: Cassandra 5.0 Beta1 - vector searching results
To expand on Jonathan’s response, the best way to get SAI to perform on the read side is to use it as a tool for large-partition search. In other words, if you can model your data such that your queries will be restricted to a single partition, two things will happen…1.) With all queries (not just ANN queries), you will only hit as many nodes as your read consistency level and replication factor require. For vector searches, that means you should only hit one node, and it should be the coordinating node w/ a properly configured, token-aware client.2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as your table compaction strategy. This will essentially guarantee your (partition-restricted) SAI query hits a small number of SSTable-attached indexes. (It’ll hit Memtable-attached indexes as well for any recently added data, so if you’re seeing latencies shoot up, it’s possible there could be contention on the Memtable-attached index that supports ANN queries. I haven’t done a deep dive on it. You can always flush Memtables directly before queries to factor that out.)If you can do all of the above, the simple performance of the local index query and its post-filtering reads is probably the place to explore further. If you manage to collect any profiling data (JFR, flamegraphs via async-profiler, etc) I’d be happy to dig into it with you.Thanks for kicking the tires!On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user wrote: Hi Joe, Have you considered submitting something for Community Over Code NA 2024? The CFP is still open for a few more weeks, options could be my Performance Engineering track or the Cassandra track – or both https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D Regards, Paul Brebner From: Joe Obernberger Date: Friday, 22 March 2024 at 3:19 am To: user@cassandra.apache.org Subject: Cassandra 5.0 Beta1 - vector searching results EXTERNAL EMAIL - USE CAUTION when clicking links or attachments Hi All - I'd like to share some initial results for the vector search on Cassandra 5.0 beta1. 3 node cluster running in kubernetes; fast Netapp storage. Have a table (doc.embeddings_googleflan5tlarge) with definition: CREATE TABLE doc.embeddings_googleflant5large ( uuid text, type text, fieldname text, offset int, sourceurl text, textdata text, creationdate timestamp, embeddings vector, metadata boolean, source text, PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata) ) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC, textdata ASC) AND additional_write_policy = '99p' AND allow_auto_snapshot = true AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND cdc = false AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND memtable = 'default' AND crc_check_chance = 1.0 AND default_time_to_live = 0 AND extensions = {} AND gc_grace_seconds = 864000 AND incremental_backups = true AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair = 'BLOCKING' AND speculative_retry = '99p'; CREATE CUSTOM INDEX ann_index_googleflant5large ON doc.embeddings_googleflant5large (embeddings) USING 'sai'; CREATE CUSTOM INDEX offset_index_googleflant5large ON doc.embeddings_googleflant5large (offset) USING 'sai'; nodetool status -r UN cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB 128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8 rack1 UN cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local 17.98 GiB 128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412 rack1 UN cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local 18.16 GiB 128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0 rack1 nodetool tablestats doc.embeddings_googleflant5large Total number of tables: 1 Keyspace: doc Read Count: 0 Read Latency: NaN ms Write Count: 2893108 Write Latency: 326.3586520174843 ms Pending Flushes: 0 Table: embeddings_googleflant5large SSTable count: 6 Old SSTable count: 0 Max SSTable size: 5.108GiB Space used (live): 19318114423 Space used (total): 19318114423 Space used by snapshots (total): 0 Off heap memory used (total): 4874912 SSTable Compression Ratio: 0.97448 Number of partitions (estimate): 58399 Memtable cell count: 0
Re: Token Ring Gaps in a 2 DC Setup
Yup, all repairs are complete. I'm reading at a CL of ONE pretty much everywhere. Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com [cid:47487E9A-F738-4BAE-9A15-E6824E9D1834] From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tue, 20 Mar 2012 13:15:27 -0400 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Token Ring Gaps in a 2 DC Setup mmm, has repair completed on all nodes ? Also, while it was digging around, I noticed that we do a LOT of reads immediately after writes, and almost every read from the first DC was bringing a read-repair along with it. What CL are you using ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/03/2012, at 7:39 AM, Caleb Rackliffe wrote: Hey Aaron, I've run cleanup jobs across all 15 nodes, and after that, I still have about a 24 million to 15 million key ratio between the data centers. The first DC is a few months older than the second, and it also began its life before 1.0.7 was out, whereas the second started at 1.0.7. I wonder if running and upgradesstables would be interesting? Also, while it was digging around, I noticed that we do a LOT of reads immediately after writes, and almost every read from the first DC was bringing a read-repair along with it. (Possibly because the distant DC had not yet received certain mutations?) I ended up turning RR off entirely, since I've got HH in place to handle short-duration failures :) Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.commailto:ca...@steelhouse.com EB2FF764-478C-4966-9B0A-E7B76D6AD7DC[21].png From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Mon, 19 Mar 2012 13:34:38 -0400 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Token Ring Gaps in a 2 DC Setup I've also run repair on a few nodes in both data centers, but the sizes are still vastly different. If repair is completing on all the nodes then the data is fully distributed. If you want to dig around… Take a look at the data files on disk. Do the nodes in DC 1 have some larger, older, data files ? These may be waiting for compaction to catch up them. If you have done any toke moves, did you run cleanup afterwards ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.comhttp://www.thelastpickle.com/ On 18/03/2012, at 8:35 PM, Caleb Rackliffe wrote: More detail… I'm running 1.0.7 on these boxes, and the keyspace readout from the CLI looks like this: create keyspace Users with placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC2 : 1, DC1 : 2} and durable_writes = true; Thanks! Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.commailto:ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Date: Sun, 18 Mar 2012 02:47:05 -0400 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Token Ring Gaps in a 2 DC Setup Hi Everyone, I have a cluster using NetworkTopologyStrategy that looks like this: 10.41.116.22 DC1 RAC1 Up Normal 13.21 GB10.00% 0 10.54.149.202 DC2 RAC1 Up Normal 6.98 GB 0.00% 1 10.41.116.20 DC1 RAC2 Up Normal 12.75 GB10.00% 1701411830 10.41.116.16 DC1 RAC3 Up Normal 12.62 GB10.00% 3402823670 10.54.149.203 DC2 RAC1 Up Normal 6.7 GB 0.00% 3402823671 10.41.116.18 DC1 RAC4 Up Normal 10.8 GB 10.00% 5104235500 10.41.116.14 DC1 RAC5 Up Normal 10.27 GB10.00% 6805647340 10.54.149.204 DC2 RAC1 Up Normal 6.7 GB 0.00% 6805647341 10.41.116.12 DC1 RAC6 Up Normal 10.58 GB10.00% 8507059170 10.41.116.10 DC1 RAC7 Up Normal 10.89 GB10.00% 10208471000 10.54.149.205 DC2 RAC1 Up Normal 7.51 GB 0.00% 10208471001 10.41.116.8 DC1 RAC8 Up Normal 10.48 GB 10.00
Re: Token Ring Gaps in a 2 DC Setup
Hey Aaron, I've run cleanup jobs across all 15 nodes, and after that, I still have about a 24 million to 15 million key ratio between the data centers. The first DC is a few months older than the second, and it also began its life before 1.0.7 was out, whereas the second started at 1.0.7. I wonder if running and upgradesstables would be interesting? Also, while it was digging around, I noticed that we do a LOT of reads immediately after writes, and almost every read from the first DC was bringing a read-repair along with it. (Possibly because the distant DC had not yet received certain mutations?) I ended up turning RR off entirely, since I've got HH in place to handle short-duration failures :) Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com [cid:CA735F54-7FB9-4D56-8DD6-944F62768556] From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Mon, 19 Mar 2012 13:34:38 -0400 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Token Ring Gaps in a 2 DC Setup I've also run repair on a few nodes in both data centers, but the sizes are still vastly different. If repair is completing on all the nodes then the data is fully distributed. If you want to dig around… Take a look at the data files on disk. Do the nodes in DC 1 have some larger, older, data files ? These may be waiting for compaction to catch up them. If you have done any toke moves, did you run cleanup afterwards ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 18/03/2012, at 8:35 PM, Caleb Rackliffe wrote: More detail… I'm running 1.0.7 on these boxes, and the keyspace readout from the CLI looks like this: create keyspace Users with placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC2 : 1, DC1 : 2} and durable_writes = true; Thanks! Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.commailto:ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Date: Sun, 18 Mar 2012 02:47:05 -0400 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Token Ring Gaps in a 2 DC Setup Hi Everyone, I have a cluster using NetworkTopologyStrategy that looks like this: 10.41.116.22 DC1 RAC1 Up Normal 13.21 GB10.00% 0 10.54.149.202 DC2 RAC1 Up Normal 6.98 GB 0.00% 1 10.41.116.20 DC1 RAC2 Up Normal 12.75 GB10.00% 1701411830 10.41.116.16 DC1 RAC3 Up Normal 12.62 GB10.00% 3402823670 10.54.149.203 DC2 RAC1 Up Normal 6.7 GB 0.00% 3402823671 10.41.116.18 DC1 RAC4 Up Normal 10.8 GB 10.00% 5104235500 10.41.116.14 DC1 RAC5 Up Normal 10.27 GB10.00% 6805647340 10.54.149.204 DC2 RAC1 Up Normal 6.7 GB 0.00% 6805647341 10.41.116.12 DC1 RAC6 Up Normal 10.58 GB10.00% 8507059170 10.41.116.10 DC1 RAC7 Up Normal 10.89 GB10.00% 10208471000 10.54.149.205 DC2 RAC1 Up Normal 7.51 GB 0.00% 10208471001 10.41.116.8 DC1 RAC8 Up Normal 10.48 GB 10.00% 11909882800 10.41.116.24 DC1 RAC9 Up Normal 10.89 GB10.00% 13611294700 10.54.149.206 DC2 RAC1 Up Normal 6.37 GB 0.00% 13611294701 10.41.116.26 DC1 RAC10 Up Normal 11.17 GB10.00% 15312706500 There are two data centers, one with 10 nodes/2 replicas and one with 5 nodes/1 replica. What I've attempted to do with my token assignments is have each node in the smaller DC handle 20% of the keyspace, and this would mean that I should see roughly equal usage on all 15 boxes. It just doesn't seem to be happening that way, though. It looks like the 1 replica nodes are carrying about half the data the 2 replica nodes are. It's almost as if those nodes are only handling 10% of the keyspace instead of 20%. Does anybody have any suggestions as to what might be going on? I've run nodetool getendpoints against a bunch of keys, and I always get back three nodes, so I'm
Token Ring Gaps in a 2 DC Setup
Hi Everyone, I have a cluster using NetworkTopologyStrategy that looks like this: 10.41.116.22 DC1 RAC1 Up Normal 13.21 GB10.00% 0 10.54.149.202 DC2 RAC1 Up Normal 6.98 GB 0.00% 1 10.41.116.20 DC1 RAC2 Up Normal 12.75 GB10.00% 1701411830 10.41.116.16 DC1 RAC3 Up Normal 12.62 GB10.00% 3402823670 10.54.149.203 DC2 RAC1 Up Normal 6.7 GB 0.00% 3402823671 10.41.116.18 DC1 RAC4 Up Normal 10.8 GB 10.00% 5104235500 10.41.116.14 DC1 RAC5 Up Normal 10.27 GB10.00% 6805647340 10.54.149.204 DC2 RAC1 Up Normal 6.7 GB 0.00% 6805647341 10.41.116.12 DC1 RAC6 Up Normal 10.58 GB10.00% 8507059170 10.41.116.10 DC1 RAC7 Up Normal 10.89 GB10.00% 10208471000 10.54.149.205 DC2 RAC1 Up Normal 7.51 GB 0.00% 10208471001 10.41.116.8 DC1 RAC8 Up Normal 10.48 GB 10.00% 11909882800 10.41.116.24 DC1 RAC9 Up Normal 10.89 GB10.00% 13611294700 10.54.149.206 DC2 RAC1 Up Normal 6.37 GB 0.00% 13611294701 10.41.116.26 DC1 RAC10 Up Normal 11.17 GB10.00% 15312706500 There are two data centers, one with 10 nodes/2 replicas and one with 5 nodes/1 replica. What I've attempted to do with my token assignments is have each node in the smaller DC handle 20% of the keyspace, and this would mean that I should see roughly equal usage on all 15 boxes. It just doesn't seem to be happening that way, though. It looks like the 1 replica nodes are carrying about half the data the 2 replica nodes are. It's almost as if those nodes are only handling 10% of the keyspace instead of 20%. Does anybody have any suggestions as to what might be going on? I've run nodetool getendpoints against a bunch of keys, and I always get back three nodes, so I'm pretty confused. I've also run repair on a few nodes in both data centers, but the sizes are still vastly different. Thanks! Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com
Re: consistency level question
If your replication factor is set to one, your cluster is obviously in a bad state following any node failure. At best, I think it would make sense that about a third of your operations fail, but I'm not sure why all of them would. I don't know if Hector just refuses to work with a compromised cluster, etc. I guess I'm wondering why your replication factor is set to 1… Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com [cid:3CCB142F-1DF5-423E-BF0E-4D8F3F31E15B] From: Tamar Fraenkel ta...@tok-media.commailto:ta...@tok-media.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sun, 18 Mar 2012 03:15:53 -0400 To: cassandra-u...@incubator.apache.orgmailto:cassandra-u...@incubator.apache.org cassandra-u...@incubator.apache.orgmailto:cassandra-u...@incubator.apache.org Subject: consistency level question Hi! I have a 3 node cassandra cluster. I use Hector API. I give hecotr one of the node's IP address I call setAutoDiscoverHosts(true) and setRunAutoDiscoveryAtStartup(true). The describe on one node returns: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:1] The odd thing is that when I take one of the nodes down, expecting all to continue running smoothly, I get exceptions of the format seen bellow, and no read or write succeeds. When I bring the node back up, exceptions stop and read and write resumes. Any idea or explanation why this is the case? Thanks! me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level. at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:66) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:285) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:268) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:246) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289) at me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53) at me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49) at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20) at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85) at me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48) at me.prettyprint.cassandra.service.ColumnSliceIterator.hasNext(ColumnSliceIterator.java:60) at Tamar Fraenkel Senior Software Engineer, TOK Media [cid:ii_135b91fb888fa9ff] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 inline: tokLogo.pnginline: EB2FF764-478C-4966-9B0A-E7B76D6AD7DC[15].png
Re: Token Ring Gaps in a 2 DC Setup
More detail… I'm running 1.0.7 on these boxes, and the keyspace readout from the CLI looks like this: create keyspace Users with placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC2 : 1, DC1 : 2} and durable_writes = true; Thanks! Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Date: Sun, 18 Mar 2012 02:47:05 -0400 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Token Ring Gaps in a 2 DC Setup Hi Everyone, I have a cluster using NetworkTopologyStrategy that looks like this: 10.41.116.22 DC1 RAC1 Up Normal 13.21 GB10.00% 0 10.54.149.202 DC2 RAC1 Up Normal 6.98 GB 0.00% 1 10.41.116.20 DC1 RAC2 Up Normal 12.75 GB10.00% 1701411830 10.41.116.16 DC1 RAC3 Up Normal 12.62 GB10.00% 3402823670 10.54.149.203 DC2 RAC1 Up Normal 6.7 GB 0.00% 3402823671 10.41.116.18 DC1 RAC4 Up Normal 10.8 GB 10.00% 5104235500 10.41.116.14 DC1 RAC5 Up Normal 10.27 GB10.00% 6805647340 10.54.149.204 DC2 RAC1 Up Normal 6.7 GB 0.00% 6805647341 10.41.116.12 DC1 RAC6 Up Normal 10.58 GB10.00% 8507059170 10.41.116.10 DC1 RAC7 Up Normal 10.89 GB10.00% 10208471000 10.54.149.205 DC2 RAC1 Up Normal 7.51 GB 0.00% 10208471001 10.41.116.8 DC1 RAC8 Up Normal 10.48 GB 10.00% 11909882800 10.41.116.24 DC1 RAC9 Up Normal 10.89 GB10.00% 13611294700 10.54.149.206 DC2 RAC1 Up Normal 6.37 GB 0.00% 13611294701 10.41.116.26 DC1 RAC10 Up Normal 11.17 GB10.00% 15312706500 There are two data centers, one with 10 nodes/2 replicas and one with 5 nodes/1 replica. What I've attempted to do with my token assignments is have each node in the smaller DC handle 20% of the keyspace, and this would mean that I should see roughly equal usage on all 15 boxes. It just doesn't seem to be happening that way, though. It looks like the 1 replica nodes are carrying about half the data the 2 replica nodes are. It's almost as if those nodes are only handling 10% of the keyspace instead of 20%. Does anybody have any suggestions as to what might be going on? I've run nodetool getendpoints against a bunch of keys, and I always get back three nodes, so I'm pretty confused. I've also run repair on a few nodes in both data centers, but the sizes are still vastly different. Thanks! Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.commailto:ca...@steelhouse.com
Re: consistency level question
That sounds right to me :) Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com [cid:8E620335-844B-4EFF-ACAB-3D4439A3B4B6] From: Tamar Fraenkel ta...@tok-media.commailto:ta...@tok-media.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sun, 18 Mar 2012 04:20:58 -0400 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: consistency level question Thanks! I updated replication factor to 2, and now when I took one node down all continued running (I did see Hector complaining on the node being down), but things were saved to db and read from it. Just so I understand, now, having replication factor of 2, if I have 2 out of 3 nodes running all my read and writes with CL=1 should work, right? Tamar Fraenkel Senior Software Engineer, TOK Media [cid:ii_135b91fb888fa9ff] ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Sun, Mar 18, 2012 at 9:57 AM, Watanabe Maki watanabe.m...@gmail.commailto:watanabe.m...@gmail.com wrote: Because your RF is 1, so you need all nodes up. maki On 2012/03/18, at 16:15, Tamar Fraenkel ta...@tok-media.commailto:ta...@tok-media.com wrote: Hi! I have a 3 node cassandra cluster. I use Hector API. I give hecotr one of the node's IP address I call setAutoDiscoverHosts(true) and setRunAutoDiscoveryAtStartup(true). The describe on one node returns: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:1] The odd thing is that when I take one of the nodes down, expecting all to continue running smoothly, I get exceptions of the format seen bellow, and no read or write succeeds. When I bring the node back up, exceptions stop and read and write resumes. Any idea or explanation why this is the case? Thanks! me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level. at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:66) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:285) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:268) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:246) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289) at me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53) at me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49) at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20) at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85) at me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48) at me.prettyprint.cassandra.service.ColumnSliceIterator.hasNext(ColumnSliceIterator.java:60) at Tamar Fraenkel Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.commailto:ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 inline: tokLogo.pnginline: EB2FF764-478C-4966-9B0A-E7B76D6AD7DC[18].png
Re: Lots and Lots of CompactionReducer Threads
With the exception of a few little warnings on start-up about the Memtable live ratio, there is nothing at WARN or above in the logs. Just before the JVM terminates, there are about 10,000 threads in Reducer executor pools that look like this in JConsole … Name: CompactionReducer:1 State: TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@72938aea Total blocked: 0 Total waited: 1 Stack trace: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359) java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) The results from tpstats don't look too interesting… Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 03455159 0 0 RequestResponseStage 0 0 10133276 0 0 MutationStage 0 05898833 0 0 ReadRepairStage 0 02078449 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 236388 0 0 AntiEntropyStage 0 0 0 0 0 MigrationStage0 0 0 0 0 MemtablePostFlusher 0 0231 0 0 StreamStage 0 0 0 0 0 FlushWriter 0 0231 0 0 MiscStage 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 0 0 35 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 BINARY 0 READ 0 MUTATION 0 REQUEST_RESPONSE 0 The results from info seem unremarkable as well… Token: 15312706500 Gossip active: true Load : 5.6 GB Generation No: 1325995515 Uptime (seconds) : 67199 Heap Memory (MB) : 970.32 / 1968.00 Data Center : datacenter1 Rack : rack1 Exceptions : 0 I'm using LeveledCompactionStrategy with no throttling, and I'm not changing the default on the number of concurrent compactors. What is interesting to me here is that Cassandra creates an executor for every single compaction in ParallelCompactionIterable. Why couldn't we just create a pool with Runtime.availableProcessors() Threads and be done with it? Let me know if I left any info out. Thanks! Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sun, 8 Jan 2012 16:51:50 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Lots and Lots of CompactionReducer Threads How many threads ? Any errors in the server logs ? What does noodtool tpstats and nodetool compactionstats say ? Did you change compaction_strategy for the CF's ? By default cassandra will use as many compaction threads as you have cores, see concurrent_compactors in cassandra.yaml Have you set the JVM heap settings ? What does nodetool info show ? Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/01/2012, at 3:51 PM, Caleb Rackliffe wrote: Hi Everybody, JConsole tells me I've got CompactionReducer threads stacking up, consuming memory, and never going away. Eventually, my Java process fails because it can't allocate any more native threads. Here's my setup… Cassandra 1.0.5 on CentOS 6.0 4 GB of RAM 50 GB SSD HD Memtable flush threshold = 128 MB compaction throughput limit = 16 MB/sec Multithreaded compaction = true It may very well be that I'm doing something strange here, but it seems like those compaction threads should go away eventually. I'm hoping the combination of a low Memtable
Re: Lots and Lots of CompactionReducer Threads
After some searching, I think I may have found something in the code itself, and so I've filed a big report - https://issues.apache.org/jira/browse/CASSANDRA-3711 Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sun, 8 Jan 2012 17:48:59 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Cc: aa...@thelastpickle.commailto:aa...@thelastpickle.com aa...@thelastpickle.commailto:aa...@thelastpickle.com Subject: Re: Lots and Lots of CompactionReducer Threads With the exception of a few little warnings on start-up about the Memtable live ratio, there is nothing at WARN or above in the logs. Just before the JVM terminates, there are about 10,000 threads in Reducer executor pools that look like this in JConsole … Name: CompactionReducer:1 State: TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@72938aea Total blocked: 0 Total waited: 1 Stack trace: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359) java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) The results from tpstats don't look too interesting… Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 03455159 0 0 RequestResponseStage 0 0 10133276 0 0 MutationStage 0 05898833 0 0 ReadRepairStage 0 02078449 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 236388 0 0 AntiEntropyStage 0 0 0 0 0 MigrationStage0 0 0 0 0 MemtablePostFlusher 0 0231 0 0 StreamStage 0 0 0 0 0 FlushWriter 0 0231 0 0 MiscStage 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 0 0 35 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 BINARY 0 READ 0 MUTATION 0 REQUEST_RESPONSE 0 The results from info seem unremarkable as well… Token: 15312706500 Gossip active: true Load : 5.6 GB Generation No: 1325995515 Uptime (seconds) : 67199 Heap Memory (MB) : 970.32 / 1968.00 Data Center : datacenter1 Rack : rack1 Exceptions : 0 I'm using LeveledCompactionStrategy with no throttling, and I'm not changing the default on the number of concurrent compactors. What is interesting to me here is that Cassandra creates an executor for every single compaction in ParallelCompactionIterable. Why couldn't we just create a pool with Runtime.availableProcessors() Threads and be done with it? Let me know if I left any info out. Thanks! Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.commailto:ca...@steelhouse.com From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sun, 8 Jan 2012 16:51:50 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Lots and Lots of CompactionReducer Threads How many threads ? Any errors in the server logs ? What does noodtool tpstats and nodetool compactionstats say ? Did you change compaction_strategy for the CF's ? By default cassandra will use as many compaction threads as you have cores, see concurrent_compactors in cassandra.yaml Have
Lots and Lots of CompactionReducer Threads
Hi Everybody, JConsole tells me I've got CompactionReducer threads stacking up, consuming memory, and never going away. Eventually, my Java process fails because it can't allocate any more native threads. Here's my setup… Cassandra 1.0.5 on CentOS 6.0 4 GB of RAM 50 GB SSD HD Memtable flush threshold = 128 MB compaction throughput limit = 16 MB/sec Multithreaded compaction = true It may very well be that I'm doing something strange here, but it seems like those compaction threads should go away eventually. I'm hoping the combination of a low Memtable flush threshold, low compaction T/P limit, and heavy write load doesn't mean those threads are hanging around because they're actually not done doing their compaction tasks. Thanks, Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com
OutOfMemory Errors with Cassandra 1.0.5
) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:172) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:57) at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:134) at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:114) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Has anybody seen this sort of problem before? Thanks to anyone who takes a look. I can provide more information than this, but I figure that's enough to start… Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com
Re: OutOfMemory Errors with Cassandra 1.0.5
One other item… java –version java version 1.7.0_01 Java(TM) SE Runtime Environment (build 1.7.0_01-b08) Java HotSpot(TM) 64-Bit Server VM (build 21.1-b02, mixed mode) Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Fri, 6 Jan 2012 15:28:30 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: OutOfMemory Errors with Cassandra 1.0.5 Hi Everybody, I have a 10-node cluster running 1.0.5. The hardware/configuration for each box looks like this: Hardware: 4 GB RAM, 400 GB SATAII HD for commitlog, 50 GB SATAIII SSD for data directory, 1 GB SSD swap partition OS: CentOS 6, vm.swapiness = 0 Cassandra: disk access mode = standard, max memtable size = 128 MB, max new heap = 800 MB, max heap = 2 GB, stack size = 128k I explicitly didn't put JNA on the classpath because I had a hard time figuring out how much native memory it would actually need. After a node runs for a couple of days, my swap partition is almost completely full, and even though the resident size of my Java process is right under 3 GB, I get this sequence in the logs, with death coming on a failure to allocate another thread… WARN [pool-1-thread-1] 2012-01-05 09:06:38,078 Memtable.java (line 174) setting live ratio to maximum of 64 instead of 65.58206914005034 WARN [pool-1-thread-1] 2012-01-05 09:08:14,405 Memtable.java (line 174) setting live ratio to maximum of 64 instead of 1379.0945945945946 WARN [ScheduledTasks:1] 2012-01-05 09:08:31,593 GCInspector.java (line 146) Heap is 0.7523060581548427 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2012-01-05 09:08:31,611 StorageService.java (line 2535) Flushing CFS(Keyspace='Users', ColumnFamily='CounterCF') to relieve memory pressure WARN [pool-1-thread-1] 2012-01-05 13:45:29,934 Memtable.java (line 169) setting live ratio to minimum of 1.0 instead of 0.004297106677189052 WARN [pool-1-thread-1] 2012-01-06 02:23:18,175 Memtable.java (line 169) setting live ratio to minimum of 1.0 instead of 0.0018187309961539236 WARN [ScheduledTasks:1] 2012-01-06 06:10:05,202 GCInspector.java (line 146) Heap is 0.7635993298476305 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2012-01-06 06:10:05,203 StorageService.java (line 2535) Flushing CFS(Keyspace='Users', ColumnFamily='CounterCF') to relieve memory pressure WARN [ScheduledTasks:1] 2012-01-06 14:59:49,588 GCInspector.java (line 146) Heap is 0.7617639564886326 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2012-01-06 14:59:49,612 StorageService.java (line 2535) Flushing CFS(Keyspace='Users', ColumnFamily='CounterCF') to relieve memory pressure ERROR [CompactionExecutor:6880] 2012-01-06 19:45:49,336 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[CompactionExecutor:6880,1,main] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:691) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:943) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1325) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:132) at org.apache.cassandra.db.compaction.ParallelCompactionIterable$Reducer.getCompactedRow(ParallelCompactionIterable.java:190) at org.apache.cassandra.db.compaction.ParallelCompactionIterable$Reducer.getReduced(ParallelCompactionIterable.java:164) at org.apache.cassandra.db.compaction.ParallelCompactionIterable$Reducer.getReduced(ParallelCompactionIterable.java:144) at org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:116) at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135
Re: OutOfMemory Errors with Cassandra 1.0.5
I saw this article - http://comments.gmane.org/gmane.comp.db.cassandra.user/2225 I'm using the Hector client (for connection pooling), with ~3200 threads active according to JConsole. Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Date: Fri, 6 Jan 2012 15:40:26 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: OutOfMemory Errors with Cassandra 1.0.5 One other item… java –version java version 1.7.0_01 Java(TM) SE Runtime Environment (build 1.7.0_01-b08) Java HotSpot(TM) 64-Bit Server VM (build 21.1-b02, mixed mode) Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.commailto:ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Fri, 6 Jan 2012 15:28:30 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: OutOfMemory Errors with Cassandra 1.0.5 Hi Everybody, I have a 10-node cluster running 1.0.5. The hardware/configuration for each box looks like this: Hardware: 4 GB RAM, 400 GB SATAII HD for commitlog, 50 GB SATAIII SSD for data directory, 1 GB SSD swap partition OS: CentOS 6, vm.swapiness = 0 Cassandra: disk access mode = standard, max memtable size = 128 MB, max new heap = 800 MB, max heap = 2 GB, stack size = 128k I explicitly didn't put JNA on the classpath because I had a hard time figuring out how much native memory it would actually need. After a node runs for a couple of days, my swap partition is almost completely full, and even though the resident size of my Java process is right under 3 GB, I get this sequence in the logs, with death coming on a failure to allocate another thread… WARN [pool-1-thread-1] 2012-01-05 09:06:38,078 Memtable.java (line 174) setting live ratio to maximum of 64 instead of 65.58206914005034 WARN [pool-1-thread-1] 2012-01-05 09:08:14,405 Memtable.java (line 174) setting live ratio to maximum of 64 instead of 1379.0945945945946 WARN [ScheduledTasks:1] 2012-01-05 09:08:31,593 GCInspector.java (line 146) Heap is 0.7523060581548427 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2012-01-05 09:08:31,611 StorageService.java (line 2535) Flushing CFS(Keyspace='Users', ColumnFamily='CounterCF') to relieve memory pressure WARN [pool-1-thread-1] 2012-01-05 13:45:29,934 Memtable.java (line 169) setting live ratio to minimum of 1.0 instead of 0.004297106677189052 WARN [pool-1-thread-1] 2012-01-06 02:23:18,175 Memtable.java (line 169) setting live ratio to minimum of 1.0 instead of 0.0018187309961539236 WARN [ScheduledTasks:1] 2012-01-06 06:10:05,202 GCInspector.java (line 146) Heap is 0.7635993298476305 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2012-01-06 06:10:05,203 StorageService.java (line 2535) Flushing CFS(Keyspace='Users', ColumnFamily='CounterCF') to relieve memory pressure WARN [ScheduledTasks:1] 2012-01-06 14:59:49,588 GCInspector.java (line 146) Heap is 0.7617639564886326 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2012-01-06 14:59:49,612 StorageService.java (line 2535) Flushing CFS(Keyspace='Users', ColumnFamily='CounterCF') to relieve memory pressure ERROR [CompactionExecutor:6880] 2012-01-06 19:45:49,336 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[CompactionExecutor:6880,1,main] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:691) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:943) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1325) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:132) at org.apache.cassandra.db.compaction.ParallelCompactionIterable$Reducer.getCompactedRow(ParallelCompactionIterable.java:190) at org.apache.cassandra.db.compaction.ParallelCompactionIterable$Reducer.getReduced
Re: OutOfMemory Errors with Cassandra 1.0.5 (fixed)
Okay, it looks like I was slightly underestimating the number of connections open on the cluster. This probably won't be a problem after I tighten up the Hector pool maximums. Sorry for the spam… Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Fri, 6 Jan 2012 20:13:37 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: OutOfMemory Errors with Cassandra 1.0.5 I saw this article - http://comments.gmane.org/gmane.comp.db.cassandra.user/2225 I'm using the Hector client (for connection pooling), with ~3200 threads active according to JConsole. Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.commailto:ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Date: Fri, 6 Jan 2012 15:40:26 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: OutOfMemory Errors with Cassandra 1.0.5 One other item… java –version java version 1.7.0_01 Java(TM) SE Runtime Environment (build 1.7.0_01-b08) Java HotSpot(TM) 64-Bit Server VM (build 21.1-b02, mixed mode) Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.commailto:ca...@steelhouse.com From: Caleb Rackliffe ca...@steelhouse.commailto:ca...@steelhouse.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Fri, 6 Jan 2012 15:28:30 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: OutOfMemory Errors with Cassandra 1.0.5 Hi Everybody, I have a 10-node cluster running 1.0.5. The hardware/configuration for each box looks like this: Hardware: 4 GB RAM, 400 GB SATAII HD for commitlog, 50 GB SATAIII SSD for data directory, 1 GB SSD swap partition OS: CentOS 6, vm.swapiness = 0 Cassandra: disk access mode = standard, max memtable size = 128 MB, max new heap = 800 MB, max heap = 2 GB, stack size = 128k I explicitly didn't put JNA on the classpath because I had a hard time figuring out how much native memory it would actually need. After a node runs for a couple of days, my swap partition is almost completely full, and even though the resident size of my Java process is right under 3 GB, I get this sequence in the logs, with death coming on a failure to allocate another thread… WARN [pool-1-thread-1] 2012-01-05 09:06:38,078 Memtable.java (line 174) setting live ratio to maximum of 64 instead of 65.58206914005034 WARN [pool-1-thread-1] 2012-01-05 09:08:14,405 Memtable.java (line 174) setting live ratio to maximum of 64 instead of 1379.0945945945946 WARN [ScheduledTasks:1] 2012-01-05 09:08:31,593 GCInspector.java (line 146) Heap is 0.7523060581548427 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2012-01-05 09:08:31,611 StorageService.java (line 2535) Flushing CFS(Keyspace='Users', ColumnFamily='CounterCF') to relieve memory pressure WARN [pool-1-thread-1] 2012-01-05 13:45:29,934 Memtable.java (line 169) setting live ratio to minimum of 1.0 instead of 0.004297106677189052 WARN [pool-1-thread-1] 2012-01-06 02:23:18,175 Memtable.java (line 169) setting live ratio to minimum of 1.0 instead of 0.0018187309961539236 WARN [ScheduledTasks:1] 2012-01-06 06:10:05,202 GCInspector.java (line 146) Heap is 0.7635993298476305 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2012-01-06 06:10:05,203 StorageService.java (line 2535) Flushing CFS(Keyspace='Users', ColumnFamily='CounterCF') to relieve memory pressure WARN [ScheduledTasks:1] 2012-01-06 14:59:49,588 GCInspector.java (line 146) Heap is 0.7617639564886326 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2012-01-06 14:59:49,612 StorageService.java (line 2535) Flushing CFS(Keyspace='Users', ColumnFamily='CounterCF') to relieve memory pressure ERROR [CompactionExecutor:6880] 2012-01-06 19:45:49,336 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread
Memtable live ratio of infinity
Hi All, I saw the following log message today on a node running cassandra 1.0.5: WARN [pool-1-thread-1] 2011-12-15 20:28:53,915 Memtable.java (line 174) setting live ratio to maximum of 64 instead of Infinity I guess this means calculated throughput is either very low or the Memtable is huge. Either way, the following line in Memtable looks a bit weird: double newRatio = (double) deepSize / currentThroughput.get(); Has anyone seen this warning before? Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com [cid:835561CC-8141-473C-B2D5-344B8D34B499] inline: EB2FF764-478C-4966-9B0A-E7B76D6AD7DC[2].png
Cannot Start Cassandra 1.0.5 with JNA on the CLASSPATH
Hi All, I'm trying to start up Cassandra 1.0.5 on a Cent OS 6 machine. I installed JNA through yum and made a symbolic link to jna.jar in my Cassandra lib directory. When I run bin/cassandra -f, I get the following: INFO 09:14:31,552 Logging initialized INFO 09:14:31,555 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.6.0_29 INFO 09:14:31,555 Heap size: 3405774848/3405774848 INFO 09:14:31,555 Classpath: bin/../conf:bin/../build/classes/main:bin/../build/classes/thrift:bin/../lib/antlr-3.2.jar:bin/../lib/apache-cassandra-1.0.5.jar:bin/../lib/apache-cassandra-clientutil-1.0.5.jar:bin/../lib/apache-cassandra-thrift-1.0.5.jar:bin/../lib/avro-1.4.0-fixes.jar:bin/../lib/avro-1.4.0-sources-fixes.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/compress-lzf-0.8.4.jar:bin/../lib/concurrentlinkedhashmap-lru-1.2.jar:bin/../lib/guava-r08.jar:bin/../lib/high-scale-lib-1.1.2.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jamm-0.2.5.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-0.6.jar:bin/../lib/log4j-1.2.16.jar:bin/../lib/servlet-api-2.5-20081211.jar:bin/../lib/slf4j-api-1.6.1.jar:bin/../lib/slf4j-log4j12-1.6.1.jar:bin/../lib/snakeyaml-1.6.jar:bin/../lib/snappy-java-1.0.4.1.jar:bin/../lib/jamm-0.2.5.jar Killed If I remove the symlink to JNA, it starts up just fine. Also, I do have entries in my limits.conf for JNA: rootsoftmemlock unlimited roothardmemlock unlimited Has anyone else seen this behavior? Thanks, Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com [cid:51C98683-7807-40ED-992A-1FBF168FE2BD] inline: EB2FF764-478C-4966-9B0A-E7B76D6AD7DC.png
Causes of a High Memtable Live Ratio
Hi All, From what I've read in the source, a Memtable's live ratio is the ratio of Memtable usage to the current write throughput. If this is too high, I imagine the system could be in a possibly unsafe state, as the comment in Memtable.java indicates. Today, while bulk loading some data, I got the following message: WARN [pool-1-thread-1] 2011-11-18 21:08:57,331 Memtable.java (line 172) setting live ratio to maximum of 64 instead of 78.87903667214012 Should I be worried? If so, does anybody have any suggestions for how to address it? Thanks :) Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com [cid:88029BB7-C464-45DA-94B5-B7188AED39A7] inline: EB2FF764-478C-4966-9B0A-E7B76D6AD7DC[1].png
Re: super sub slice query?
I had the same question you did, I think. Below is as far as I got with Hector… I have a column family of super-columns with long names. The columns in each super-column also have long names. I'm using Hector, and what I want to do is get the last column in each super-column, for a range of super-columns. I was able to get the last column in a column family like this… Cluster cluster = HFactory.getOrCreateCluster(Cortex, config); Keyspace keyspace = HFactory.createKeyspace(Products, cluster); RangeSlicesQueryString, String, String rangeSlicesQuery = HFactory.createRangeSlicesQuery(keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); rangeSlicesQuery.setColumnFamily(Attributes); rangeSlicesQuery.setKeys(id0, id0); rangeSlicesQuery.setRange(, , true, 1); QueryResultOrderedRowsString, String, String rsult = rangeSlicesQuery.execute(); …but no luck with the additional dimension. Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: Guy Incognito dnd1...@gmail.commailto:dnd1...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thu, 27 Oct 2011 06:34:08 -0400 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: super sub slice query? is there such a thing? a query that runs against a SC family and returns a subset of subcolumns from a set of super-columns? is there a way to have eg a slice query (or super slice query) only return the column names, rather than the value as well?
Reading Last Values From a SuperColumn
Hi Everybody, I have a column family of super-columns with long names. The columns in each super-column also have long names. I'm using Hector, and what I want to do is get the last column in each super-column, for a range of super-columns. I was able to get the last column in a column family like this… Cluster cluster = HFactory.getOrCreateCluster(Cortex, config); Keyspace keyspace = HFactory.createKeyspace(Products, cluster); RangeSlicesQueryString, String, String rangeSlicesQuery = HFactory.createRangeSlicesQuery(keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); rangeSlicesQuery.setColumnFamily(Attributes); rangeSlicesQuery.setKeys(id0, id0); rangeSlicesQuery.setRange(, , true, 1); QueryResultOrderedRowsString, String, String rsult = rangeSlicesQuery.execute(); …but no luck with the additional dimension. Thanks in advance! Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com [cid:63B04035-B5DF-48C6-9396-B72DBE859205] inline: 026985CE-DD5D-4E4A-AC8E-A63219DF0CE1[11].png