Re: Lucene index plugin for Apache Cassandra
Unfortunately, we don't have published any benchmarks yet, but we have plans to do it as soon as possible. However, you can expect a similar behavior as those of Elasticsearch or Solr, with some overhead due to the need for indexing both the Cassandra's row key and the partition's token. You can also take a look at this presentation http://planetcassandra.org/video-presentations/vp/cassandra-summit-europe-2014/vd/stratio-advanced-search-and-top-k-queries-in-cassandra/ to see how cluster distribution is done. 2015-06-12 0:45 GMT+02:00 Ben Bromhead b...@instaclustr.com: Looks awesome, do you have any examples/benchmarks of using these indexes for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+? On 10 June 2015 at 09:08, Andres de la Peña adelap...@stratio.com wrote: Hi all, With the release of Cassandra 2.1.6, Stratio is glad to present its open source Lucene-based implementation of C* secondary indexes https://github.com/Stratio/cassandra-lucene-index as a plugin that can be attached to Apache Cassandra. Before the above changes, Lucene index was distributed inside a fork of Apache Cassandra, with all the difficulties implied. As of now, the fork is discontinued and new users should use the recently created plugin, which maintains all the features of Stratio Cassandra https://github.com/Stratio/stratio-cassandra. Stratio's Lucene index extends Cassandra’s functionality to provide near real-time distributed search engine capabilities such as with ElasticSearch or Solr, including full text search capabilities, free multivariable search, relevance queries and field-based sorting. Each node indexes its own data, so high availability and scalability is guaranteed. We hope this will be useful to the Apache Cassandra community. Regards, -- Andrés de la Peña http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD* -- Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | (650) 284 9692 -- Andrés de la Peña http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*
Question about nodetool status ... output
Hi, I have one node in my 5-node cluster that effectively owns 100% and it looks like my cluster is rather imbalanced. Is it common to have it this imbalanced for 4-5 nodes? My current output for a keyspace is: $ nodetool status myks Datacenter: Cassandra = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN X.X.X.33 203.92 GB 256 41.3% 871968c9-1d6b-4f06-ba90-8b3a8d92dcf0 RAC1 UN X.X.X.32 200.44 GB 256 34.2% d7cacd89-8613-4de5-8a5e-a2c53c41ea45 RAC1 UN X.X.X.51 197.17 GB 256 100.0% 344b0adf-2b5d-47c8-8881-9a3f56be6f3b RAC1 UN X.X.X.52 113.63 GB 1 46.3% 55daa807-af49-44c5-9742-fe456df621a1 RAC1 UN X.X.X.31 204.49 GB 256 78.3% 48cb0782-6c9a-4805-9330-38e192b6b680 RAC1 My keyspace has RF=3 and originally I added X.X.X.52 (num_tokens=1 was a mistake) and then X.X.X.51. I haven't executed `nodetool cleanup` on any nodes yet. For the curious, the full ring can be found here: https://gist.github.com/JensRantil/57ee515e647e2f154779 Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: Question about nodetool status ... output
Your data model also contributes to the balance (or lack of) of the cluster. If you have a really bad data partitioning Cassandra will not do any magic. Regarding that cluster, I would decommission the x.52 node and add it again with the correct configuration. After the bootstrap, run a cleanup. If is still that off-balance, you need to look into your data model. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.com On Fri, Jun 12, 2015 at 11:58 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, I have one node in my 5-node cluster that effectively owns 100% and it looks like my cluster is rather imbalanced. Is it common to have it this imbalanced for 4-5 nodes? My current output for a keyspace is: $ nodetool status myks Datacenter: Cassandra = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN X.X.X.33 203.92 GB 256 41.3% 871968c9-1d6b-4f06-ba90-8b3a8d92dcf0 RAC1 UN X.X.X.32 200.44 GB 256 34.2% d7cacd89-8613-4de5-8a5e-a2c53c41ea45 RAC1 UN X.X.X.51 197.17 GB 256 100.0% 344b0adf-2b5d-47c8-8881-9a3f56be6f3b RAC1 UN X.X.X.52 113.63 GB 1 46.3% 55daa807-af49-44c5-9742-fe456df621a1 RAC1 UN X.X.X.31 204.49 GB 256 78.3% 48cb0782-6c9a-4805-9330-38e192b6b680 RAC1 My keyspace has RF=3 and originally I added X.X.X.52 (num_tokens=1 was a mistake) and then X.X.X.51. I haven't executed `nodetool cleanup` on any nodes yet. For the curious, the full ring can be found here: https://gist.github.com/JensRantil/57ee515e647e2f154779 Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink -- --
Question regarding concurrent bootstrapping
Hi, Let's say I have an existing cluster and do the following: 1. I start a new joining node (A). It enters state Up/Joining. Streaming automatically start to this node. 2. I wait two minutes (best practise for bootstrapping). 3. I start a second node (B) to join the cluster. It allocates some of A:s previous parts of the ring and enters state Up/Joining. Streaming automatically starts to this node. Will streaming of data that A is no longer responsible (after B joined) stop immediately? That is, after (3), will data streamed to A only be what it is responsible of? This is of importance for planning when one it expanding a cluster to multiple smaller nodes. Thanks, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: Question about nodetool status ... output
Hi Carlos, Yes, I should have been more specific about that; basically all my primary ID:s are random UUIDs so I find that very hard to believe that my data model should be the problem here. I will run a full repair of the cluster, execute a cleanup and recommission the node, then. Thanks, Jens On Fri, Jun 12, 2015 at 2:38 PM, Carlos Rolo r...@pythian.com wrote: Your data model also contributes to the balance (or lack of) of the cluster. If you have a really bad data partitioning Cassandra will not do any magic. Regarding that cluster, I would decommission the x.52 node and add it again with the correct configuration. After the bootstrap, run a cleanup. If is still that off-balance, you need to look into your data model. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.com On Fri, Jun 12, 2015 at 11:58 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, I have one node in my 5-node cluster that effectively owns 100% and it looks like my cluster is rather imbalanced. Is it common to have it this imbalanced for 4-5 nodes? My current output for a keyspace is: $ nodetool status myks Datacenter: Cassandra = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN X.X.X.33 203.92 GB 256 41.3% 871968c9-1d6b-4f06-ba90-8b3a8d92dcf0 RAC1 UN X.X.X.32 200.44 GB 256 34.2% d7cacd89-8613-4de5-8a5e-a2c53c41ea45 RAC1 UN X.X.X.51 197.17 GB 256 100.0% 344b0adf-2b5d-47c8-8881-9a3f56be6f3b RAC1 UN X.X.X.52 113.63 GB 1 46.3% 55daa807-af49-44c5-9742-fe456df621a1 RAC1 UN X.X.X.31 204.49 GB 256 78.3% 48cb0782-6c9a-4805-9330-38e192b6b680 RAC1 My keyspace has RF=3 and originally I added X.X.X.52 (num_tokens=1 was a mistake) and then X.X.X.51. I haven't executed `nodetool cleanup` on any nodes yet. For the curious, the full ring can be found here: https://gist.github.com/JensRantil/57ee515e647e2f154779 Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink -- -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: Lucene index plugin for Apache Cassandra
Seems like an interesting tool! What operational recommendations would you make to users of this tool (Extra hardware capacity, extra metrics to monitor, etc)? Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.com On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña adelap...@stratio.com wrote: Unfortunately, we don't have published any benchmarks yet, but we have plans to do it as soon as possible. However, you can expect a similar behavior as those of Elasticsearch or Solr, with some overhead due to the need for indexing both the Cassandra's row key and the partition's token. You can also take a look at this presentation http://planetcassandra.org/video-presentations/vp/cassandra-summit-europe-2014/vd/stratio-advanced-search-and-top-k-queries-in-cassandra/ to see how cluster distribution is done. 2015-06-12 0:45 GMT+02:00 Ben Bromhead b...@instaclustr.com: Looks awesome, do you have any examples/benchmarks of using these indexes for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+? On 10 June 2015 at 09:08, Andres de la Peña adelap...@stratio.com wrote: Hi all, With the release of Cassandra 2.1.6, Stratio is glad to present its open source Lucene-based implementation of C* secondary indexes https://github.com/Stratio/cassandra-lucene-index as a plugin that can be attached to Apache Cassandra. Before the above changes, Lucene index was distributed inside a fork of Apache Cassandra, with all the difficulties implied. As of now, the fork is discontinued and new users should use the recently created plugin, which maintains all the features of Stratio Cassandra https://github.com/Stratio/stratio-cassandra. Stratio's Lucene index extends Cassandra’s functionality to provide near real-time distributed search engine capabilities such as with ElasticSearch or Solr, including full text search capabilities, free multivariable search, relevance queries and field-based sorting. Each node indexes its own data, so high availability and scalability is guaranteed. We hope this will be useful to the Apache Cassandra community. Regards, -- Andrés de la Peña http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD* -- Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | (650) 284 9692 -- Andrés de la Peña http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD* -- --
Re: Atomic behavior and efficiency of a DELETE query with an IN clause
Similarly, should we send multiple SELECT requests or a single one with a SELECT...IN ? On Wednesday, June 10, 2015 11:27 AM, Sotirios Delimanolis sotodel...@yahoo.com wrote: Will this eventually they will all go through behavior apply to the IN? How is this query written to the commitlog? Do you mean prepare a query likeDELETE FROM MastersOfTheUniverse WHERE mastersID = ?;and execute it asynchronously 3000 times or add 3000 of these DELETE (bound) prepared statements to a BATCH statement executed asynchronously? On Wednesday, June 10, 2015 9:51 AM, Jonathan Haddad j...@jonhaddad.com wrote: Batches don't work like that. It's possible for some to succeed, and later, the rest will. Atomic is the incorrect word to use, it's more like eventually they will all go through. Do not use IN(), use a whole bunch of prepared statements asynchronously. On Wed, Jun 10, 2015 at 9:26 AM Sotirios Delimanolis sotodel...@yahoo.com wrote: Hi, When executing a DELETE statement with an IN clause, where the list contains partition keys, what is the underlying behaviour with regards to atomicity? DELETE FROM MastersOfTheUniverse WHERE mastersID IN ('Man-At-Arms', 'Teela'); Is it going to act like an atomic batch where if one fails, all fail? If that is the case, is there any reason to use a BATCH statement with multiple single DELETE statement or should we always prefer a DELETE with an IN clause? For example, given 3000 keys for rows I want to delete, should I issue a single DELETE query and provide all the keys in the IN argument or should I add 3000 DELETE queries to a BATCH statement? Thank you,Sotirios
Re: Atomic behavior and efficiency of a DELETE query with an IN clause
Multiple async requests. IN() is a performance nightmare unless you're querying against a single partition key. On Fri, Jun 12, 2015 at 1:09 PM Sotirios Delimanolis sotodel...@yahoo.com wrote: Similarly, should we send multiple SELECT requests or a single one with a SELECT...IN ? On Wednesday, June 10, 2015 11:27 AM, Sotirios Delimanolis sotodel...@yahoo.com wrote: Will this eventually they will all go through behavior apply to the IN? How is this query written to the commitlog? Do you mean prepare a query like DELETE FROM MastersOfTheUniverse WHERE mastersID = ?; and execute it asynchronously 3000 times or add 3000 of these DELETE (bound) prepared statements to a BATCH statement executed asynchronously? On Wednesday, June 10, 2015 9:51 AM, Jonathan Haddad j...@jonhaddad.com wrote: Batches don't work like that. It's possible for some to succeed, and later, the rest will. Atomic is the incorrect word to use, it's more like eventually they will all go through. Do not use IN(), use a whole bunch of prepared statements asynchronously. On Wed, Jun 10, 2015 at 9:26 AM Sotirios Delimanolis sotodel...@yahoo.com wrote: Hi, When executing a DELETE statement with an IN clause, where the list contains partition keys, what is the underlying behaviour with regards to atomicity? DELETE FROM MastersOfTheUniverse WHERE mastersID IN ('Man-At-Arms', 'Teela'); Is it going to act like an atomic batch where if one fails, all fail? If that is the case, is there any reason to use a BATCH statement with multiple single DELETE statement or should we always prefer a DELETE with an IN clause? For example, given 3000 keys for rows I want to delete, should I issue a single DELETE query and provide all the keys in the IN argument or should I add 3000 DELETE queries to a BATCH statement? Thank you, Sotirios
Re: Question regarding concurrent bootstrapping
On Fri, Jun 12, 2015 at 5:21 AM, Jens Rantil jens.ran...@tink.se wrote: Let's say I have an existing cluster and do the following: 1. I start a new joining node (A). It enters state Up/Joining. Streaming automatically start to this node. 2. I wait two minutes (best practise for bootstrapping). 3. I start a second node (B) to join the cluster. It allocates some of A:s previous parts of the ring and enters state Up/Joining. Streaming automatically starts to this node. Will streaming of data that A is no longer responsible (after B joined) stop immediately? That is, after (3), will data streamed to A only be what it is responsible of? It depends on the version of Cassandra. A will get data it shouldn't get in any version that doesn't contain CASSANDRA-2434 patch. If you do not run cleanup on A when A is done bootstrapping In a version containing 2434, the attempt to bootstrap B will fail and will not work until A is done bootstrapping, unless you set the property -Dcassandra.consistent.rangemovement=false while starting it. In general, one DOES NOT WANT TO SET -Dcassandra.consistent.rangemovement! It fixes 2434, and 2434 is bad for consistency. Instead, considering expanding clusters to initial size when they are empty, and disabling bootstrapping while doing so. Lots and lots of background on : https://issues.apache.org/jira/browse/CASSANDRA-2434 Related ticket : https://issues.apache.org/jira/browse/CASSANDRA-7069 =Rob PS - BTW, the fact that 2434 existed for so long, in versions where repair was often broken/unused, is the strongest single item of information in support of the Coli Conjecture...
Dropped mutation messages
I am preparing to migrate a large amount of data to Cassandra. In order to test my migration code, I’ve been doing some dry runs to a test cluster. My test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a weird combination, but my production cluster that will eventually receive this data is RF=3. I am running with RF=1 so its faster while I work out the kinks in the migration. There are a few things that have puzzled me, after writing several 10’s of millions records to my test cluster. My main concern is that I have a few tens of thousands of dropped mutation messages. I’m overloading my cluster. I never have more than about 10% CPU utilization (even my I/O wait is negligible). A curious thing about that is that the driver hasn’t thrown any exceptions, even though mutations have been dropped. I’ve seen dropped mutation messages on my production cluster, but like this, I’ve never gotten errors back from the client. I had always assumed that one node dropped mutation messages, but the other two did not, and so quorum was satisfied. With RF=1, I don’t understand how mutation messages are being dropped and the client doesn’t tell me about it. Does this mean my cluster is missing data, and I have no idea? Each node has a couple dozen all-time blocked FlushWriters. Is that bad? I have around 100 dropped counter mutations, which is very weird because I don’t write any counters. I have counters in my schema for tracking view counts, but the migration code doesn’t write them. How could I get dropped counter mutation messages when I don’t modify them? Any insights would be appreciated. Thanks in advance. Robert
Re: Cassandra 2.2, 3.0, and beyond
On Thu, Jun 11, 2015 at 6:56 PM, Mohammed Guller moham...@glassbeam.com wrote: By that logic, 2.1.0 should have been somewhat as stable as 2.0.10 (the last release of 2.0.x branch before 2.1.0). However, we found out that it took almost 9 months for 2.1.x series to become stable and suitable for production. Going by past history, I am worried that it may take the same time for 2.2 to become stable. The instability of initial point releases is a significant part of the motivation for the new release cadence.[1] If new versions continued to take just as long to be production ready, the new process would have failed at one of its major goals... For the record, I agree with the reasoning in the linked post and am cautiously optimistic about the effect it will have on the stability of released versions. :D =Rob [1] http://mail-archives.apache.org/mod_mbox/cassandra-dev/201503.mbox/%3CCALdd-zjAyiTbZksMeq2LxGwLF5LPhoi_4vsjy8JBHBRnsxH=8...@mail.gmail.com%3E Unfortunately, even after DataStax hired half a dozen full-time test engineers, 2.1.0 continued the proud tradition of being unready for production use, with wait for .5 before upgrading once again looking like a good guideline. I’m starting to think that the entire model of “write a bunch of new features all at once and then try to stabilize it for release” is broken. We’ve been trying that for years and empirically speaking the evidence is that it just doesn’t work, either from a stability standpoint or even just shipping on time. ... So, I’d like to try something different. I think we were on the right track with shorter releases with more compatibility. But I’d like to throw in a twist. Intel cuts down on risk with a “tick-tock” schedule for new architectures and process shrinks instead of trying to do both at once. We can do something similar here: One month releases. Period. If it’s not done, it can wait. *Every other release only accepts bug fixes.* By itself, one-month releases are going to dramatically reduce the complexity of testing and debugging new releases -- and bugs that do slip past us will only affect a smaller percentage of users, avoiding the “big release has a bunch of bugs no one has seen before and pretty much everyone is hit by something” scenario. ***But by adding in the second rule, I think we have a real chance to make a quantum leap here: stable, production-ready releases every two months.*** (*** emphasis mine)
Re: Dropped mutation messages
I meant to say I’m *not* overloading my cluster. On Jun 12, 2015, at 6:52 PM, Robert Wille rwi...@fold3.com wrote: I am preparing to migrate a large amount of data to Cassandra. In order to test my migration code, I’ve been doing some dry runs to a test cluster. My test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a weird combination, but my production cluster that will eventually receive this data is RF=3. I am running with RF=1 so its faster while I work out the kinks in the migration. There are a few things that have puzzled me, after writing several 10’s of millions records to my test cluster. My main concern is that I have a few tens of thousands of dropped mutation messages. I’m overloading my cluster. I never have more than about 10% CPU utilization (even my I/O wait is negligible). A curious thing about that is that the driver hasn’t thrown any exceptions, even though mutations have been dropped. I’ve seen dropped mutation messages on my production cluster, but like this, I’ve never gotten errors back from the client. I had always assumed that one node dropped mutation messages, but the other two did not, and so quorum was satisfied. With RF=1, I don’t understand how mutation messages are being dropped and the client doesn’t tell me about it. Does this mean my cluster is missing data, and I have no idea? Each node has a couple dozen all-time blocked FlushWriters. Is that bad? I have around 100 dropped counter mutations, which is very weird because I don’t write any counters. I have counters in my schema for tracking view counts, but the migration code doesn’t write them. How could I get dropped counter mutation messages when I don’t modify them? Any insights would be appreciated. Thanks in advance. Robert
RE: Lucene index plugin for Apache Cassandra
The plugin looks cool. Thank you for open sourcing it. Does it support faceting and other Solr functionality? Mohammed From: Andres de la Peña [mailto:adelap...@stratio.com] Sent: Friday, June 12, 2015 3:43 AM To: user@cassandra.apache.org Subject: Re: Lucene index plugin for Apache Cassandra I really appreciate your interest Well, the first recommendation is to not use it unless you need it, because a properly Cassandra denormalized model is almost always preferable to indexing. Lucene indexing is a good option when there is no viable denormalization alternative. This is the case of range queries over multiple dimensions, full-text search or maybe complex boolean predicates. It's also appropriate for Spark/Hadoop jobs mapping a small fraction of the total amount of rows in a certain table, if you can pay the cost of indexing. Lucene indexes run inside C*, so users should closely monitor the amount of used memory. It's also a good idea to put the Lucene directory files in a separate disk to those used by C* itself. Additionally, you should consider that indexed tables write throughput will be appreciably reduced, maybe to a few thousands rows per second. It's really hard to estimate the amount of resources needed by the index due to the great variety of indexing and querying ways that Lucene offers, so the only thing we can suggest is to empirically find the optimal setup for your use case. 2015-06-12 12:00 GMT+02:00 Carlos Rolo r...@pythian.commailto:r...@pythian.com: Seems like an interesting tool! What operational recommendations would you make to users of this tool (Extra hardware capacity, extra metrics to monitor, etc)? Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolohttp://linkedin.com/in/carlosjuzarterolo Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.comhttp://www.pythian.com/ On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña adelap...@stratio.commailto:adelap...@stratio.com wrote: Unfortunately, we don't have published any benchmarks yet, but we have plans to do it as soon as possible. However, you can expect a similar behavior as those of Elasticsearch or Solr, with some overhead due to the need for indexing both the Cassandra's row key and the partition's token. You can also take a look at this presentationhttp://planetcassandra.org/video-presentations/vp/cassandra-summit-europe-2014/vd/stratio-advanced-search-and-top-k-queries-in-cassandra/ to see how cluster distribution is done. 2015-06-12 0:45 GMT+02:00 Ben Bromhead b...@instaclustr.commailto:b...@instaclustr.com: Looks awesome, do you have any examples/benchmarks of using these indexes for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+? On 10 June 2015 at 09:08, Andres de la Peña adelap...@stratio.commailto:adelap...@stratio.com wrote: Hi all, With the release of Cassandra 2.1.6, Stratio is glad to present its open source Lucene-based implementation of C* secondary indexeshttps://github.com/Stratio/cassandra-lucene-index as a plugin that can be attached to Apache Cassandra. Before the above changes, Lucene index was distributed inside a fork of Apache Cassandra, with all the difficulties implied. As of now, the fork is discontinued and new users should use the recently created plugin, which maintains all the features of Stratio Cassandrahttps://github.com/Stratio/stratio-cassandra. Stratio's Lucene index extends Cassandra’s functionality to provide near real-time distributed search engine capabilities such as with ElasticSearch or Solr, including full text search capabilities, free multivariable search, relevance queries and field-based sorting. Each node indexes its own data, so high availability and scalability is guaranteed. We hope this will be useful to the Apache Cassandra community. Regards, -- Andrés de la Peña [http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42tel:%2B34%2091%20352%2059%2042 // @stratiobdhttps://twitter.com/StratioBD -- Ben Bromhead Instaclustr | www.instaclustr.comhttps://www.instaclustr.com/ | @instaclustrhttp://twitter.com/instaclustr | (650) 284 9692 -- Andrés de la Peña [http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42tel:%2B34%2091%20352%2059%2042 // @stratiobdhttps://twitter.com/StratioBD -- -- Andrés de la Peña [http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // @stratiobdhttps://twitter.com/StratioBD
Re: Lucene index plugin for Apache Cassandra
I really appreciate your interest Well, the first recommendation is to not use it unless you need it, because a properly Cassandra denormalized model is almost always preferable to indexing. Lucene indexing is a good option when there is no viable denormalization alternative. This is the case of range queries over multiple dimensions, full-text search or maybe complex boolean predicates. It's also appropriate for Spark/Hadoop jobs mapping a small fraction of the total amount of rows in a certain table, if you can pay the cost of indexing. Lucene indexes run inside C*, so users should closely monitor the amount of used memory. It's also a good idea to put the Lucene directory files in a separate disk to those used by C* itself. Additionally, you should consider that indexed tables write throughput will be appreciably reduced, maybe to a few thousands rows per second. It's really hard to estimate the amount of resources needed by the index due to the great variety of indexing and querying ways that Lucene offers, so the only thing we can suggest is to empirically find the optimal setup for your use case. 2015-06-12 12:00 GMT+02:00 Carlos Rolo r...@pythian.com: Seems like an interesting tool! What operational recommendations would you make to users of this tool (Extra hardware capacity, extra metrics to monitor, etc)? Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.com On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña adelap...@stratio.com wrote: Unfortunately, we don't have published any benchmarks yet, but we have plans to do it as soon as possible. However, you can expect a similar behavior as those of Elasticsearch or Solr, with some overhead due to the need for indexing both the Cassandra's row key and the partition's token. You can also take a look at this presentation http://planetcassandra.org/video-presentations/vp/cassandra-summit-europe-2014/vd/stratio-advanced-search-and-top-k-queries-in-cassandra/ to see how cluster distribution is done. 2015-06-12 0:45 GMT+02:00 Ben Bromhead b...@instaclustr.com: Looks awesome, do you have any examples/benchmarks of using these indexes for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+? On 10 June 2015 at 09:08, Andres de la Peña adelap...@stratio.com wrote: Hi all, With the release of Cassandra 2.1.6, Stratio is glad to present its open source Lucene-based implementation of C* secondary indexes https://github.com/Stratio/cassandra-lucene-index as a plugin that can be attached to Apache Cassandra. Before the above changes, Lucene index was distributed inside a fork of Apache Cassandra, with all the difficulties implied. As of now, the fork is discontinued and new users should use the recently created plugin, which maintains all the features of Stratio Cassandra https://github.com/Stratio/stratio-cassandra. Stratio's Lucene index extends Cassandra’s functionality to provide near real-time distributed search engine capabilities such as with ElasticSearch or Solr, including full text search capabilities, free multivariable search, relevance queries and field-based sorting. Each node indexes its own data, so high availability and scalability is guaranteed. We hope this will be useful to the Apache Cassandra community. Regards, -- Andrés de la Peña http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD* -- Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | (650) 284 9692 -- Andrés de la Peña http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD* -- -- Andrés de la Peña http://www.stratio.com/ Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*
RE: Support for ad-hoc query
I will note here that the limitations on ad-hoc querying (and aggregates) make it much more difficult to deal with data quality problems, QA testing, and similar efforts, especially where people are used to a more relational, ad-hoc model. We have often had to extract data from Cassandra to Hadoop for querying by hive. Example: “We found a few records with incorrect data. How many more records like that are out there?” Sean Durity From: Peter Lin [mailto:wool...@gmail.com] Sent: Wednesday, June 10, 2015 8:17 AM To: user@cassandra.apache.org Subject: Re: Support for ad-hoc query I'll second Jack's detailed response and add that you really should do some discovery to figure out what kinds of queries you may need to support. It might not be possible and often that is the case, but it's worth while to ask the end users what kind of reports they need to run. Allowing arbitrary ad-hoc queries is a known anti-pattern for cassandra. If the system needs to query multiple cf to derive/calculate some result, using Cassandra alone isn't going to do it. You'll need some other system to give you better query capabilities like Hive. If you need data warehouse like features, look at http://www.kylin.io/ . They are doing some interesting things. peter On Wed, Jun 10, 2015 at 7:58 AM, Jack Krupansky jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote: Knowing your queries in advance is a hard-core requirement for effective deployment of Cassandra. Ad hoc queries are a very clear anti-pattern for Cassandra. DSE Search does provide support for advanced, complex, and ad hoc queries. Stratio and TupleJump Stargate can also be used. Back to the question of what you mean by ad hoc queries: 1. Do you expect real-time results, like sub-second, or are these long-running queries that might take seconds, 10 seconds or more, or even minutes to run? 2. Will they be very rare or quite frequent - how much load do you expect them to place on the cluster? 3. How complex do you expect them to be - how many clauses and operators? 4. What is their net cardinality - are they selecting just a few rows or many rows? 5. Do they have individual query clauses that select many rows even if the net combination of all select clauses is not so many rows? The requirement to perform advanced, complex, and ad hoc queries using DSE Search or the other techniques will almost certainly require that you use moderately more capable hardware, especially more RAM, for each node, and probably more nodes as well to reduce the row count per node since ad hoc queries will tend to be compute-intensive based on number of rows on the node. Yes, it can be done. No, it is not free or cheap. And, no, it does not come out of the box for a non-DSE Cassandra release. And, yes, you must address this requirement before deployment, not after deployment. -- Jack Krupansky On Wed, Jun 10, 2015 at 1:18 AM, Srinivasa T N seen...@gmail.commailto:seen...@gmail.com wrote: Thanks guys for the inputs. By ad-hoc queries I mean that I don't know the queries during cf design time. The data may be from single cf or multiple cf. (This feature maybe required if I want to do analysis on the data stored in cassandra, do you have any better ideas)? Regards, Seenu. On Tue, Jun 9, 2015 at 5:57 PM, Peter Lin wool...@gmail.commailto:wool...@gmail.com wrote: what do you mean by ad-hoc queries? Do you mean simple queries against a single column family aka table? Or do you mean MDX style queries that looks at multiple tables? if it's MDX style queries, many people extract data from Cassandra into a data warehouse that support multi-dimensional cubes. This works well when the extracted data is a small subset and fits neatly in a data warehouse. As others have stated, Cassandra isn't great at ad-hoc. For MDX style queries, Cassandra wasn't designed for it. One thing we've done for our own project is to combine solr with our own fuzzy index to make ad-hoc queries against a single table more friendly. On Tue, Jun 9, 2015 at 2:38 AM, Srinivasa T N seen...@gmail.commailto:seen...@gmail.com wrote: Hi All, I have an web application running with my backend data stored in cassandra. Now I want to do some analysis on the data stored which requires some ad-hoc queries fired on cassandra. How can I do the same? Regards, Seenu. The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home
Re: Support for ad-hoc query
No dispute about that. But the main design requirement Cassandra strives to meet is to be a blazing fast transactional database - here's the key, give me the data, and here's the key, write this data. Any additional query requirements are a distant second at best. A big part of that transactional speed requirement is achieved by jettisoning the overhead required for ad hoc queries. I think it is inevitable that Cassandra will eventually address the requirement for ad hoc queries when it finally decides what it wants to be when it grows up (i.e., whether to just be a niche or to subsume all of SQL), but in the meantime DSE Search/Solr, Stratio, and TupleJump Stargate, as well as extraction and indexing in Elasticsearch, are moderately reasonable near-term solutions. And I agree that having to fully model eventual (and evolving!) data requirements and emergent anomalous conditions upfront is too big a burden for many enterprises. -- Jack Krupansky On Fri, Jun 12, 2015 at 10:07 AM, sean_r_dur...@homedepot.com wrote: I will note here that the limitations on ad-hoc querying (and aggregates) make it much more difficult to deal with data quality problems, QA testing, and similar efforts, especially where people are used to a more relational, ad-hoc model. We have often had to extract data from Cassandra to Hadoop for querying by hive. Example: “We found a few records with incorrect data. How many more records like that are out there?” Sean Durity *From:* Peter Lin [mailto:wool...@gmail.com] *Sent:* Wednesday, June 10, 2015 8:17 AM *To:* user@cassandra.apache.org *Subject:* Re: Support for ad-hoc query I'll second Jack's detailed response and add that you really should do some discovery to figure out what kinds of queries you may need to support. It might not be possible and often that is the case, but it's worth while to ask the end users what kind of reports they need to run. Allowing arbitrary ad-hoc queries is a known anti-pattern for cassandra. If the system needs to query multiple cf to derive/calculate some result, using Cassandra alone isn't going to do it. You'll need some other system to give you better query capabilities like Hive. If you need data warehouse like features, look at http://www.kylin.io/ . They are doing some interesting things. peter On Wed, Jun 10, 2015 at 7:58 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Knowing your queries in advance is a hard-core requirement for effective deployment of Cassandra. Ad hoc queries are a very clear anti-pattern for Cassandra. DSE Search does provide support for advanced, complex, and ad hoc queries. Stratio and TupleJump Stargate can also be used. Back to the question of what you mean by ad hoc queries: 1. Do you expect real-time results, like sub-second, or are these long-running queries that might take seconds, 10 seconds or more, or even minutes to run? 2. Will they be very rare or quite frequent - how much load do you expect them to place on the cluster? 3. How complex do you expect them to be - how many clauses and operators? 4. What is their net cardinality - are they selecting just a few rows or many rows? 5. Do they have individual query clauses that select many rows even if the net combination of all select clauses is not so many rows? The requirement to perform advanced, complex, and ad hoc queries using DSE Search or the other techniques will almost certainly require that you use moderately more capable hardware, especially more RAM, for each node, and probably more nodes as well to reduce the row count per node since ad hoc queries will tend to be compute-intensive based on number of rows on the node. Yes, it can be done. No, it is not free or cheap. And, no, it does not come out of the box for a non-DSE Cassandra release. And, yes, you must address this requirement before deployment, not after deployment. -- Jack Krupansky On Wed, Jun 10, 2015 at 1:18 AM, Srinivasa T N seen...@gmail.com wrote: Thanks guys for the inputs. By ad-hoc queries I mean that I don't know the queries during cf design time. The data may be from single cf or multiple cf. (This feature maybe required if I want to do analysis on the data stored in cassandra, do you have any better ideas)? Regards, Seenu. On Tue, Jun 9, 2015 at 5:57 PM, Peter Lin wool...@gmail.com wrote: what do you mean by ad-hoc queries? Do you mean simple queries against a single column family aka table? Or do you mean MDX style queries that looks at multiple tables? if it's MDX style queries, many people extract data from Cassandra into a data warehouse that support multi-dimensional cubes. This works well when the extracted data is a small subset and fits neatly in a data warehouse. As others have stated, Cassandra isn't great at ad-hoc. For MDX style queries, Cassandra wasn't designed for it. One thing we've done for our own project is to combine solr
My dse-spark app goes well with spark-submit, BUT GOT STUCK while executing by sbt run or java jar run on my win-pc
My dse-spark app goes well with spark-submit, BUT GOT STUCK while executing by sbt run or java jar run on my windows pc which means the driver process is in a pc other than a dse cluster node. And what frustrating me is that when I looked through the logs, I see no error, but it just hang there, stage progress always stay 0/(some number bigger than 3000). How can I find the the problem?
connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15
Hello, We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we stop/kill a cassandra process, some other nodes keep a connection with the dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes. So, if I start the killed node again, it cannot handshake with the nodes which have a connection on the CLOSE_WAIT state until that connection is closed, so they remain on the down state to each other for 5-20 minutes, until they can handshake again. I believe this is somehow related to the fixes CASSANDRA-8336 and CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will continue to investigate to see if I find more evidences, but any help at this point would be appreciated, or at least a confirmation that it could be related to any of these tickets. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200