Re: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Andres de la Peña
Unfortunately, we don't have published any benchmarks yet, but we have
plans to do it as soon as possible. However, you can expect a similar
behavior as those of Elasticsearch or Solr, with some overhead due to the
need for indexing both the Cassandra's row key and the partition's token.
You can also take a look at this presentation
http://planetcassandra.org/video-presentations/vp/cassandra-summit-europe-2014/vd/stratio-advanced-search-and-top-k-queries-in-cassandra/
to see how cluster distribution is done.

2015-06-12 0:45 GMT+02:00 Ben Bromhead b...@instaclustr.com:

 Looks awesome, do you have any examples/benchmarks of using these indexes
 for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?

 On 10 June 2015 at 09:08, Andres de la Peña adelap...@stratio.com wrote:

 Hi all,

 With the release of Cassandra 2.1.6, Stratio is glad to present its open
 source Lucene-based implementation of C* secondary indexes
 https://github.com/Stratio/cassandra-lucene-index as a plugin that can
 be attached to Apache Cassandra. Before the above changes, Lucene index was
 distributed inside a fork of Apache Cassandra, with all the difficulties
 implied. As of now, the fork is discontinued and new users should use the
 recently created plugin, which maintains all the features of Stratio
 Cassandra https://github.com/Stratio/stratio-cassandra.



 Stratio's Lucene index extends Cassandra’s functionality to provide near
 real-time distributed search engine capabilities such as with ElasticSearch
 or Solr, including full text search capabilities, free multivariable
 search, relevance queries and field-based sorting. Each node indexes its
 own data, so high availability and scalability is guaranteed.


 We hope this will be useful to the Apache Cassandra community.


 Regards,

 --

 Andrés de la Peña


 http://www.stratio.com/
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón, Madrid
 Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*




 --

 Ben Bromhead

 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | (650) 284 9692




-- 

Andrés de la Peña


http://www.stratio.com/
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*


Question about nodetool status ... output

2015-06-12 Thread Jens Rantil
Hi,

I have one node in my 5-node cluster that effectively owns 100% and it
looks like my cluster is rather imbalanced. Is it common to have it this
imbalanced for 4-5 nodes?

My current output for a keyspace is:

$ nodetool status myks
Datacenter: Cassandra
=
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load   Tokens  Owns (effective)  Host ID
Rack
UN  X.X.X.33  203.92 GB  256 41.3%
871968c9-1d6b-4f06-ba90-8b3a8d92dcf0  RAC1
UN  X.X.X.32  200.44 GB  256 34.2%
d7cacd89-8613-4de5-8a5e-a2c53c41ea45  RAC1
UN  X.X.X.51  197.17 GB  256 100.0%
 344b0adf-2b5d-47c8-8881-9a3f56be6f3b  RAC1
UN  X.X.X.52  113.63 GB  1   46.3%
55daa807-af49-44c5-9742-fe456df621a1  RAC1
UN  X.X.X.31  204.49 GB  256 78.3%
48cb0782-6c9a-4805-9330-38e192b6b680  RAC1

My keyspace has RF=3 and originally I added X.X.X.52 (num_tokens=1 was a
mistake) and then X.X.X.51. I haven't executed `nodetool cleanup` on any
nodes yet.

For the curious, the full ring can be found here:
https://gist.github.com/JensRantil/57ee515e647e2f154779

Cheers,
Jens

-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: Question about nodetool status ... output

2015-06-12 Thread Carlos Rolo
Your data model also contributes to the balance (or lack of) of the
cluster. If you have a really bad data partitioning Cassandra will not do
any magic.

Regarding that cluster, I would decommission the x.52 node and add it again
with the correct configuration. After the bootstrap, run a cleanup. If is
still that off-balance, you need to look into your data model.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
www.pythian.com

On Fri, Jun 12, 2015 at 11:58 AM, Jens Rantil jens.ran...@tink.se wrote:

 Hi,

 I have one node in my 5-node cluster that effectively owns 100% and it
 looks like my cluster is rather imbalanced. Is it common to have it this
 imbalanced for 4-5 nodes?

 My current output for a keyspace is:

 $ nodetool status myks
 Datacenter: Cassandra
 =
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address Load   Tokens  Owns (effective)  Host ID
 Rack
 UN  X.X.X.33  203.92 GB  256 41.3%
 871968c9-1d6b-4f06-ba90-8b3a8d92dcf0  RAC1
 UN  X.X.X.32  200.44 GB  256 34.2%
 d7cacd89-8613-4de5-8a5e-a2c53c41ea45  RAC1
 UN  X.X.X.51  197.17 GB  256 100.0%
  344b0adf-2b5d-47c8-8881-9a3f56be6f3b  RAC1
 UN  X.X.X.52  113.63 GB  1   46.3%
 55daa807-af49-44c5-9742-fe456df621a1  RAC1
 UN  X.X.X.31  204.49 GB  256 78.3%
 48cb0782-6c9a-4805-9330-38e192b6b680  RAC1

 My keyspace has RF=3 and originally I added X.X.X.52 (num_tokens=1 was a
 mistake) and then X.X.X.51. I haven't executed `nodetool cleanup` on any
 nodes yet.

 For the curious, the full ring can be found here:
 https://gist.github.com/JensRantil/57ee515e647e2f154779

 Cheers,
 Jens

 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink


-- 


--





Question regarding concurrent bootstrapping

2015-06-12 Thread Jens Rantil
Hi,

Let's say I have an existing cluster and do the following:

   1. I start a new joining node (A). It enters state Up/Joining.
   Streaming automatically start to this node.
   2. I wait two minutes (best practise for bootstrapping).
   3. I start a second node (B) to join the cluster. It allocates some of
   A:s previous parts of the ring and enters state Up/Joining. Streaming
   automatically starts to this node.

Will streaming of data that A is no longer responsible (after B joined)
stop immediately? That is, after (3), will data streamed to A only be what
it is responsible of?

This is of importance for planning when one it expanding a cluster to
multiple smaller nodes.

Thanks,
Jens

-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: Question about nodetool status ... output

2015-06-12 Thread Jens Rantil
Hi Carlos,

Yes, I should have been more specific about that; basically all my primary
ID:s are random UUIDs so I find that very hard to believe that my data
model should be the problem here. I will run a full repair of the cluster,
execute a cleanup and recommission the node, then.

Thanks,
Jens

On Fri, Jun 12, 2015 at 2:38 PM, Carlos Rolo r...@pythian.com wrote:

 Your data model also contributes to the balance (or lack of) of the
 cluster. If you have a really bad data partitioning Cassandra will not do
 any magic.

 Regarding that cluster, I would decommission the x.52 node and add it
 again with the correct configuration. After the bootstrap, run a cleanup.
 If is still that off-balance, you need to look into your data model.

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
 www.pythian.com

 On Fri, Jun 12, 2015 at 11:58 AM, Jens Rantil jens.ran...@tink.se wrote:

 Hi,

 I have one node in my 5-node cluster that effectively owns 100% and it
 looks like my cluster is rather imbalanced. Is it common to have it this
 imbalanced for 4-5 nodes?

 My current output for a keyspace is:

 $ nodetool status myks
 Datacenter: Cassandra
 =
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address Load   Tokens  Owns (effective)  Host ID
   Rack
 UN  X.X.X.33  203.92 GB  256 41.3%
 871968c9-1d6b-4f06-ba90-8b3a8d92dcf0  RAC1
 UN  X.X.X.32  200.44 GB  256 34.2%
 d7cacd89-8613-4de5-8a5e-a2c53c41ea45  RAC1
 UN  X.X.X.51  197.17 GB  256 100.0%
  344b0adf-2b5d-47c8-8881-9a3f56be6f3b  RAC1
 UN  X.X.X.52  113.63 GB  1   46.3%
 55daa807-af49-44c5-9742-fe456df621a1  RAC1
 UN  X.X.X.31  204.49 GB  256 78.3%
 48cb0782-6c9a-4805-9330-38e192b6b680  RAC1

 My keyspace has RF=3 and originally I added X.X.X.52 (num_tokens=1 was a
 mistake) and then X.X.X.51. I haven't executed `nodetool cleanup` on any
 nodes yet.

 For the curious, the full ring can be found here:
 https://gist.github.com/JensRantil/57ee515e647e2f154779

 Cheers,
 Jens

 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink



 --






-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Carlos Rolo
Seems like an interesting tool!

What operational recommendations would you make to users of this tool
(Extra hardware capacity, extra metrics to monitor, etc)?

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
www.pythian.com

On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña adelap...@stratio.com
wrote:

 Unfortunately, we don't have published any benchmarks yet, but we have
 plans to do it as soon as possible. However, you can expect a similar
 behavior as those of Elasticsearch or Solr, with some overhead due to the
 need for indexing both the Cassandra's row key and the partition's token.
 You can also take a look at this presentation
 http://planetcassandra.org/video-presentations/vp/cassandra-summit-europe-2014/vd/stratio-advanced-search-and-top-k-queries-in-cassandra/
 to see how cluster distribution is done.

 2015-06-12 0:45 GMT+02:00 Ben Bromhead b...@instaclustr.com:

 Looks awesome, do you have any examples/benchmarks of using these indexes
 for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?

 On 10 June 2015 at 09:08, Andres de la Peña adelap...@stratio.com
 wrote:

 Hi all,

 With the release of Cassandra 2.1.6, Stratio is glad to present its
 open source Lucene-based implementation of C* secondary indexes
 https://github.com/Stratio/cassandra-lucene-index as a plugin that
 can be attached to Apache Cassandra. Before the above changes, Lucene index
 was distributed inside a fork of Apache Cassandra, with all the
 difficulties implied. As of now, the fork is discontinued and new users
 should use the recently created plugin, which maintains all the features of 
 Stratio
 Cassandra https://github.com/Stratio/stratio-cassandra.



 Stratio's Lucene index extends Cassandra’s functionality to provide near
 real-time distributed search engine capabilities such as with ElasticSearch
 or Solr, including full text search capabilities, free multivariable
 search, relevance queries and field-based sorting. Each node indexes its
 own data, so high availability and scalability is guaranteed.


 We hope this will be useful to the Apache Cassandra community.


 Regards,

 --

 Andrés de la Peña


 http://www.stratio.com/
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón, Madrid
 Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*




 --

 Ben Bromhead

 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | (650) 284 9692




 --

 Andrés de la Peña


 http://www.stratio.com/
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón, Madrid
 Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*


-- 


--





Re: Atomic behavior and efficiency of a DELETE query with an IN clause

2015-06-12 Thread Sotirios Delimanolis
Similarly, should we send multiple SELECT requests or a single one with a 
SELECT...IN ? 


 On Wednesday, June 10, 2015 11:27 AM, Sotirios Delimanolis 
sotodel...@yahoo.com wrote:
   

 Will this eventually they will all go through behavior apply to the IN? How 
is this query written to the commitlog?
Do you mean prepare a query likeDELETE FROM MastersOfTheUniverse WHERE 
mastersID = ?;and execute it asynchronously 3000 times or add 3000 of these 
DELETE (bound) prepared statements to a BATCH statement executed asynchronously?




 On Wednesday, June 10, 2015 9:51 AM, Jonathan Haddad j...@jonhaddad.com 
wrote:
   

 Batches don't work like that.  It's possible for some to succeed, and later, 
the rest will.  Atomic is the incorrect word to use, it's more like eventually 
they will all go through.

Do not use IN(), use a whole bunch of prepared statements asynchronously.  
On Wed, Jun 10, 2015 at 9:26 AM Sotirios Delimanolis sotodel...@yahoo.com 
wrote:

Hi,
When executing a DELETE statement with an IN clause, where the list contains 
partition keys, what is the underlying behaviour with regards to atomicity?
DELETE FROM MastersOfTheUniverse WHERE mastersID IN ('Man-At-Arms', 'Teela');
Is it going to act like an atomic batch where if one fails, all fail? If that 
is the case, is there any reason to use a BATCH statement with multiple single 
DELETE statement or should we always prefer a DELETE with an IN clause? 
For example, given 3000 keys for rows I want to delete, should I issue a single 
DELETE query and provide all the keys in the IN argument or should I add 3000 
DELETE queries to a BATCH statement?
Thank you,Sotirios




   

  

Re: Atomic behavior and efficiency of a DELETE query with an IN clause

2015-06-12 Thread Jonathan Haddad
Multiple async requests.  IN() is a performance nightmare unless you're
querying against a single partition key.

On Fri, Jun 12, 2015 at 1:09 PM Sotirios Delimanolis sotodel...@yahoo.com
wrote:

 Similarly, should we send multiple SELECT requests or a single one with a
 SELECT...IN ?



   On Wednesday, June 10, 2015 11:27 AM, Sotirios Delimanolis 
 sotodel...@yahoo.com wrote:


 Will this eventually they will all go through behavior apply to the IN?
 How is this query written to the commitlog?

 Do you mean prepare a query like

 DELETE FROM MastersOfTheUniverse WHERE mastersID = ?;

 and execute it asynchronously 3000 times or add 3000 of these DELETE (bound) 
 prepared statements to a BATCH statement executed asynchronously?






   On Wednesday, June 10, 2015 9:51 AM, Jonathan Haddad j...@jonhaddad.com
 wrote:


 Batches don't work like that.  It's possible for some to succeed, and
 later, the rest will.  Atomic is the incorrect word to use, it's more like
 eventually they will all go through.

 Do not use IN(), use a whole bunch of prepared statements asynchronously.

 On Wed, Jun 10, 2015 at 9:26 AM Sotirios Delimanolis sotodel...@yahoo.com
 wrote:

 Hi,

 When executing a DELETE statement with an IN clause, where the list
 contains partition keys, what is the underlying behaviour with regards to
 atomicity?

 DELETE FROM MastersOfTheUniverse WHERE mastersID IN ('Man-At-Arms', 'Teela');


 Is it going to act like an atomic batch where if one fails, all fail? If
 that is the case, is there any reason to use a BATCH statement with
 multiple single DELETE statement or should we always prefer a DELETE with
 an IN clause?

 For example, given 3000 keys for rows I want to delete, should I issue a
 single DELETE query and provide all the keys in the IN argument or should
 I add 3000 DELETE queries to a BATCH statement?

 Thank you,
 Sotirios









Re: Question regarding concurrent bootstrapping

2015-06-12 Thread Robert Coli
On Fri, Jun 12, 2015 at 5:21 AM, Jens Rantil jens.ran...@tink.se wrote:

 Let's say I have an existing cluster and do the following:

1. I start a new joining node (A). It enters state Up/Joining.
Streaming automatically start to this node.
2. I wait two minutes (best practise for bootstrapping).
3. I start a second node (B) to join the cluster. It allocates some of
A:s previous parts of the ring and enters state Up/Joining. Streaming
automatically starts to this node.

 Will streaming of data that A is no longer responsible (after B joined)
 stop immediately? That is, after (3), will data streamed to A only be what
 it is responsible of?


It depends on the version of Cassandra. A will get data it shouldn't get
in any version that doesn't contain CASSANDRA-2434 patch. If you do not run
cleanup on A when A is done bootstrapping

In a version containing 2434, the attempt to bootstrap B will fail and will
not work until A is done bootstrapping, unless you set the
property -Dcassandra.consistent.rangemovement=false while starting it.

In general, one DOES NOT WANT TO
SET -Dcassandra.consistent.rangemovement! It fixes 2434, and 2434 is
bad for consistency.

Instead, considering expanding clusters to initial size when they are
empty, and disabling bootstrapping while doing so.

Lots and lots of background on :
https://issues.apache.org/jira/browse/CASSANDRA-2434

Related ticket : https://issues.apache.org/jira/browse/CASSANDRA-7069

=Rob
PS - BTW, the fact that 2434 existed for so long, in versions where repair
was often broken/unused, is the strongest single item of information in
support of the Coli Conjecture...


Dropped mutation messages

2015-06-12 Thread Robert Wille
I am preparing to migrate a large amount of data to Cassandra. In order to test 
my migration code, I’ve been doing some dry runs to a test cluster. My test 
cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a 
weird combination, but my production cluster that will eventually receive this 
data is RF=3. I am running with RF=1 so its faster while I work out the kinks 
in the migration.

There are a few things that have puzzled me, after writing several 10’s of 
millions records to my test cluster.

My main concern is that I have a few tens of thousands of dropped mutation 
messages. I’m overloading my cluster. I never have more than about 10% CPU 
utilization (even my I/O wait is negligible). A curious thing about that is 
that the driver hasn’t thrown any exceptions, even though mutations have been 
dropped. I’ve seen dropped mutation messages on my production cluster, but like 
this, I’ve never gotten errors back from the client. I had always assumed that 
one node dropped mutation messages, but the other two did not, and so quorum 
was satisfied. With RF=1, I don’t understand how mutation messages are being 
dropped and the client doesn’t tell me about it. Does this mean my cluster is 
missing data, and I have no idea?

Each node has a couple dozen all-time blocked FlushWriters. Is that bad?

I have around 100 dropped counter mutations, which is very weird because I 
don’t write any counters. I have counters in my schema for tracking view 
counts, but the migration code doesn’t write them. How could I get dropped 
counter mutation messages when I don’t modify them?

Any insights would be appreciated. Thanks in advance.

Robert



Re: Cassandra 2.2, 3.0, and beyond

2015-06-12 Thread Robert Coli
On Thu, Jun 11, 2015 at 6:56 PM, Mohammed Guller moham...@glassbeam.com
wrote:

  By that logic, 2.1.0  should have been somewhat as stable as 2.0.10 (the
 last release of 2.0.x branch before 2.1.0). However, we found out that it
 took almost 9 months for 2.1.x series to become stable and suitable for
 production. Going by past history, I am worried that it may take the same
 time for 2.2 to become stable.


The instability of initial point releases is a significant part of the
motivation for the new release cadence.[1] If new versions continued to
take just as long to be production ready, the new process would have failed
at one of its major goals...

For the record, I agree with the reasoning in the linked post and am
cautiously optimistic about the effect it will have on the stability of
released versions. :D

=Rob
[1]
http://mail-archives.apache.org/mod_mbox/cassandra-dev/201503.mbox/%3CCALdd-zjAyiTbZksMeq2LxGwLF5LPhoi_4vsjy8JBHBRnsxH=8...@mail.gmail.com%3E

Unfortunately, even after DataStax hired half a dozen full-time test
engineers, 2.1.0 continued the proud tradition of being unready for
production use, with wait for .5 before upgrading once again looking like
a good guideline.

I’m starting to think that the entire model of “write a bunch of new
features all at once and then try to stabilize it for release” is broken.
We’ve been trying that for years and empirically speaking the evidence is
that it just doesn’t work, either from a stability standpoint or even just
shipping on time.
...
So, I’d like to try something different.  I think we were on the right
track with shorter releases with more compatibility.  But I’d like to throw
in a twist.  Intel cuts down on risk with a “tick-tock” schedule for new
architectures and process shrinks instead of trying to do both at once. We
can do something similar here:

One month releases.  Period.  If it’s not done, it can wait.
*Every other release only accepts bug fixes.*

By itself, one-month releases are going to dramatically reduce the
complexity of testing and debugging new releases -- and bugs that do slip
past us will only affect a smaller percentage of users, avoiding the “big
release has a bunch of bugs no one has seen before and pretty much everyone
is hit by something” scenario.  ***But by adding in the second rule, I
think we have a real chance to make a quantum leap here: stable,
production-ready
releases every two months.***


(*** emphasis mine)


Re: Dropped mutation messages

2015-06-12 Thread Robert Wille
I meant to say I’m *not* overloading my cluster.

On Jun 12, 2015, at 6:52 PM, Robert Wille rwi...@fold3.com wrote:

 I am preparing to migrate a large amount of data to Cassandra. In order to 
 test my migration code, I’ve been doing some dry runs to a test cluster. My 
 test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and 
 CL=QUORUM is a weird combination, but my production cluster that will 
 eventually receive this data is RF=3. I am running with RF=1 so its faster 
 while I work out the kinks in the migration.
 
 There are a few things that have puzzled me, after writing several 10’s of 
 millions records to my test cluster.
 
 My main concern is that I have a few tens of thousands of dropped mutation 
 messages. I’m overloading my cluster. I never have more than about 10% CPU 
 utilization (even my I/O wait is negligible). A curious thing about that is 
 that the driver hasn’t thrown any exceptions, even though mutations have been 
 dropped. I’ve seen dropped mutation messages on my production cluster, but 
 like this, I’ve never gotten errors back from the client. I had always 
 assumed that one node dropped mutation messages, but the other two did not, 
 and so quorum was satisfied. With RF=1, I don’t understand how mutation 
 messages are being dropped and the client doesn’t tell me about it. Does this 
 mean my cluster is missing data, and I have no idea?
 
 Each node has a couple dozen all-time blocked FlushWriters. Is that bad?
 
 I have around 100 dropped counter mutations, which is very weird because I 
 don’t write any counters. I have counters in my schema for tracking view 
 counts, but the migration code doesn’t write them. How could I get dropped 
 counter mutation messages when I don’t modify them?
 
 Any insights would be appreciated. Thanks in advance.
 
 Robert
 



RE: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Mohammed Guller
The plugin looks cool. Thank you for open sourcing it.

Does it support faceting and other Solr functionality?

Mohammed

From: Andres de la Peña [mailto:adelap...@stratio.com]
Sent: Friday, June 12, 2015 3:43 AM
To: user@cassandra.apache.org
Subject: Re: Lucene index plugin for Apache Cassandra

I really appreciate your interest

Well, the first recommendation is to not use it unless you need it, because a 
properly Cassandra denormalized model is almost always preferable to indexing. 
Lucene indexing is a good option when there is no viable denormalization 
alternative. This is the case of range queries over multiple dimensions, 
full-text search or maybe complex boolean predicates. It's also appropriate for 
Spark/Hadoop jobs mapping a small fraction of the total amount of rows in a 
certain table, if you can pay the cost of indexing.

Lucene indexes run inside C*, so users should closely monitor the amount of 
used memory. It's also a good idea to put the Lucene directory files in a 
separate disk to those used by C* itself. Additionally, you should consider 
that indexed tables write throughput will be appreciably reduced, maybe to a 
few thousands rows per second.

It's really hard to estimate the amount of resources needed by the index due to 
the great variety of indexing and querying ways that Lucene offers, so the only 
thing we can suggest is to empirically find the optimal setup for your use case.

2015-06-12 12:00 GMT+02:00 Carlos Rolo 
r...@pythian.commailto:r...@pythian.com:
Seems like an interesting tool!
What operational recommendations would you make to users of this tool (Extra 
hardware capacity, extra metrics to monitor, etc)?

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: 
linkedin.com/in/carlosjuzarterolohttp://linkedin.com/in/carlosjuzarterolo
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
www.pythian.comhttp://www.pythian.com/

On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña 
adelap...@stratio.commailto:adelap...@stratio.com wrote:
Unfortunately, we don't have published any benchmarks yet, but we have plans to 
do it as soon as possible. However, you can expect a similar behavior as those 
of Elasticsearch or Solr, with some overhead due to the need for indexing both 
the Cassandra's row key and the partition's token. You can also take a look at 
this 
presentationhttp://planetcassandra.org/video-presentations/vp/cassandra-summit-europe-2014/vd/stratio-advanced-search-and-top-k-queries-in-cassandra/
 to see how cluster distribution is done.

2015-06-12 0:45 GMT+02:00 Ben Bromhead 
b...@instaclustr.commailto:b...@instaclustr.com:
Looks awesome, do you have any examples/benchmarks of using these indexes for 
various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?

On 10 June 2015 at 09:08, Andres de la Peña 
adelap...@stratio.commailto:adelap...@stratio.com wrote:
Hi all,

With the release of Cassandra 2.1.6, Stratio is glad to present its open source 
Lucene-based implementation of C* secondary 
indexeshttps://github.com/Stratio/cassandra-lucene-index as a plugin that can 
be attached to Apache Cassandra. Before the above changes, Lucene index was 
distributed inside a fork of Apache Cassandra, with all the difficulties 
implied. As of now, the fork is discontinued and new users should use the 
recently created plugin, which maintains all the features of Stratio 
Cassandrahttps://github.com/Stratio/stratio-cassandra.

Stratio's Lucene index extends Cassandra’s functionality to provide near 
real-time distributed search engine capabilities such as with ElasticSearch or 
Solr, including full text search capabilities, free multivariable search, 
relevance queries and field-based sorting. Each node indexes its own data, so 
high availability and scalability is guaranteed.

We hope this will be useful to the Apache Cassandra community.

Regards,

--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42tel:%2B34%2091%20352%2059%2042 // 
@stratiobdhttps://twitter.com/StratioBD



--

Ben Bromhead

Instaclustr | www.instaclustr.comhttps://www.instaclustr.com/ | 
@instaclustrhttp://twitter.com/instaclustr | (650) 284 9692



--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42tel:%2B34%2091%20352%2059%2042 // 
@stratiobdhttps://twitter.com/StratioBD



--





--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // @stratiobdhttps://twitter.com/StratioBD


Re: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Andres de la Peña
I really appreciate your interest

Well, the first recommendation is to not use it unless you need it, because
a properly Cassandra denormalized model is almost always preferable to
indexing. Lucene indexing is a good option when there is no viable
denormalization alternative. This is the case of range queries over
multiple dimensions, full-text search or maybe complex boolean predicates.
It's also appropriate for Spark/Hadoop jobs mapping a small fraction of the
total amount of rows in a certain table, if you can pay the cost of
indexing.

Lucene indexes run inside C*, so users should closely monitor the amount of
used memory. It's also a good idea to put the Lucene directory files in a
separate disk to those used by C* itself. Additionally, you should consider
that indexed tables write throughput will be appreciably reduced, maybe to
a few thousands rows per second.

It's really hard to estimate the amount of resources needed by the index
due to the great variety of indexing and querying ways that Lucene offers,
so the only thing we can suggest is to empirically find the optimal setup
for your use case.

2015-06-12 12:00 GMT+02:00 Carlos Rolo r...@pythian.com:

 Seems like an interesting tool!

 What operational recommendations would you make to users of this tool
 (Extra hardware capacity, extra metrics to monitor, etc)?

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
 www.pythian.com

 On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña adelap...@stratio.com
  wrote:

 Unfortunately, we don't have published any benchmarks yet, but we have
 plans to do it as soon as possible. However, you can expect a similar
 behavior as those of Elasticsearch or Solr, with some overhead due to the
 need for indexing both the Cassandra's row key and the partition's token.
 You can also take a look at this presentation
 http://planetcassandra.org/video-presentations/vp/cassandra-summit-europe-2014/vd/stratio-advanced-search-and-top-k-queries-in-cassandra/
 to see how cluster distribution is done.

 2015-06-12 0:45 GMT+02:00 Ben Bromhead b...@instaclustr.com:

 Looks awesome, do you have any examples/benchmarks of using these
 indexes for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?

 On 10 June 2015 at 09:08, Andres de la Peña adelap...@stratio.com
 wrote:

 Hi all,

 With the release of Cassandra 2.1.6, Stratio is glad to present its
 open source Lucene-based implementation of C* secondary indexes
 https://github.com/Stratio/cassandra-lucene-index as a plugin that
 can be attached to Apache Cassandra. Before the above changes, Lucene index
 was distributed inside a fork of Apache Cassandra, with all the
 difficulties implied. As of now, the fork is discontinued and new users
 should use the recently created plugin, which maintains all the features 
 of Stratio
 Cassandra https://github.com/Stratio/stratio-cassandra.



 Stratio's Lucene index extends Cassandra’s functionality to provide
 near real-time distributed search engine capabilities such as with
 ElasticSearch or Solr, including full text search capabilities, free
 multivariable search, relevance queries and field-based sorting. Each node
 indexes its own data, so high availability and scalability is guaranteed.


 We hope this will be useful to the Apache Cassandra community.


 Regards,

 --

 Andrés de la Peña


 http://www.stratio.com/
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón, Madrid
 Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*




 --

 Ben Bromhead

 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | (650) 284 9692




 --

 Andrés de la Peña


 http://www.stratio.com/
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón, Madrid
 Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*



 --






-- 

Andrés de la Peña


http://www.stratio.com/
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*


RE: Support for ad-hoc query

2015-06-12 Thread SEAN_R_DURITY
I will note here that the limitations on ad-hoc querying (and aggregates) make 
it much more difficult to deal with data quality problems, QA testing, and 
similar efforts, especially where people are used to a more relational, ad-hoc 
model. We have often had to extract data from Cassandra to Hadoop for querying 
by hive.

Example: “We found a few records with incorrect data. How many more records 
like that are out there?”


Sean Durity

From: Peter Lin [mailto:wool...@gmail.com]
Sent: Wednesday, June 10, 2015 8:17 AM
To: user@cassandra.apache.org
Subject: Re: Support for ad-hoc query


I'll second Jack's detailed response and add that you really should do some 
discovery to figure out what kinds of queries you may need to support.
It might not be possible and often that is the case, but it's worth while to 
ask the end users what kind of reports they need to run. Allowing arbitrary 
ad-hoc queries is a known anti-pattern for cassandra. If the system needs to 
query multiple cf to derive/calculate some result, using Cassandra alone isn't 
going to do it. You'll need some other system to give you better query 
capabilities like Hive.
If you need data warehouse like features, look at http://www.kylin.io/ . They 
are doing some interesting things.
peter

On Wed, Jun 10, 2015 at 7:58 AM, Jack Krupansky 
jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote:
Knowing your queries in advance is a hard-core requirement for effective 
deployment of Cassandra. Ad hoc queries are a very clear anti-pattern for 
Cassandra. DSE Search does provide support for advanced, complex, and ad hoc 
queries. Stratio and TupleJump Stargate can also be used.

Back to the question of what you mean by ad hoc queries:

1. Do you expect real-time results, like sub-second, or are these long-running 
queries that might take seconds, 10 seconds or more, or even minutes to run?
2. Will they be very rare or quite frequent - how much load do you expect them 
to place on the cluster?
3. How complex do you expect them to be - how many clauses and operators?
4. What is their net cardinality - are they selecting just a few rows or many 
rows?
5. Do they have individual query clauses that select many rows even if the net 
combination of all select clauses is not so many rows?

The requirement to perform advanced, complex, and ad hoc queries using DSE 
Search or the other techniques will almost certainly require that you use 
moderately more capable hardware, especially more RAM, for each node, and 
probably more nodes as well to reduce the row count per node since ad hoc 
queries will tend to be compute-intensive based on number of rows on the node.

Yes, it can be done. No, it is not free or cheap. And, no, it does not come out 
of the box for a non-DSE Cassandra release. And, yes, you must address this 
requirement before deployment, not after deployment.


-- Jack Krupansky

On Wed, Jun 10, 2015 at 1:18 AM, Srinivasa T N 
seen...@gmail.commailto:seen...@gmail.com wrote:
Thanks guys for the inputs.
By ad-hoc queries I mean that I don't know the queries during cf design time.  
The data may be from single cf or multiple cf.  (This feature maybe required if 
I want to do analysis on the data stored in cassandra, do you have any better 
ideas)?
Regards,
Seenu.

On Tue, Jun 9, 2015 at 5:57 PM, Peter Lin 
wool...@gmail.commailto:wool...@gmail.com wrote:

what do you mean by ad-hoc queries?
Do you mean simple queries against a single column family aka table?
Or do you mean MDX style queries that looks at multiple tables?
if it's MDX style queries, many people extract data from Cassandra into a data 
warehouse that support multi-dimensional cubes. This works well when the 
extracted data is a small subset and fits neatly in a data warehouse.
As others have stated, Cassandra isn't great at ad-hoc. For MDX style queries, 
Cassandra wasn't designed for it. One thing we've done for our own project is 
to combine solr with our own fuzzy index to make ad-hoc queries against a 
single table more friendly.


On Tue, Jun 9, 2015 at 2:38 AM, Srinivasa T N 
seen...@gmail.commailto:seen...@gmail.com wrote:
Hi All,
   I have an web application running with my backend data stored in cassandra.  
Now I want to do some analysis on the data stored which requires some ad-hoc 
queries fired on cassandra.  How can I do the same?
Regards,
Seenu.







The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home 

Re: Support for ad-hoc query

2015-06-12 Thread Jack Krupansky
No dispute about that. But the main design requirement Cassandra strives to
meet is to be a blazing fast transactional database - here's the key, give
me the data, and here's the key, write this data. Any additional query
requirements are a distant second at best. A big part of that transactional
speed requirement is achieved by jettisoning the overhead required for ad
hoc queries.

I think it is inevitable that Cassandra will eventually address the
requirement for ad hoc queries when it finally decides what it wants to be
when it grows up (i.e., whether to just be a niche or to subsume all of
SQL), but in the meantime DSE Search/Solr, Stratio, and TupleJump Stargate,
as well as extraction and indexing in Elasticsearch, are moderately
reasonable near-term solutions.

And I agree that having to fully model eventual (and evolving!) data
requirements and emergent anomalous conditions upfront is too big a burden
for many enterprises.


-- Jack Krupansky

On Fri, Jun 12, 2015 at 10:07 AM, sean_r_dur...@homedepot.com wrote:

  I will note here that the limitations on ad-hoc querying (and
 aggregates) make it much more difficult to deal with data quality problems,
 QA testing, and similar efforts, especially where people are used to a more
 relational, ad-hoc model. We have often had to extract data from Cassandra
 to Hadoop for querying by hive.



 Example: “We found a few records with incorrect data. How many more
 records like that are out there?”





 Sean Durity



 *From:* Peter Lin [mailto:wool...@gmail.com]
 *Sent:* Wednesday, June 10, 2015 8:17 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Support for ad-hoc query





 I'll second Jack's detailed response and add that you really should do
 some discovery to figure out what kinds of queries you may need to support.

 It might not be possible and often that is the case, but it's worth while
 to ask the end users what kind of reports they need to run. Allowing
 arbitrary ad-hoc queries is a known anti-pattern for cassandra. If the
 system needs to query multiple cf to derive/calculate some result, using
 Cassandra alone isn't going to do it. You'll need some other system to give
 you better query capabilities like Hive.

 If you need data warehouse like features, look at http://www.kylin.io/ .
 They are doing some interesting things.

 peter



 On Wed, Jun 10, 2015 at 7:58 AM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

 Knowing your queries in advance is a hard-core requirement for effective
 deployment of Cassandra. Ad hoc queries are a very clear anti-pattern for
 Cassandra. DSE Search does provide support for advanced, complex, and ad
 hoc queries. Stratio and TupleJump Stargate can also be used.



 Back to the question of what you mean by ad hoc queries:



 1. Do you expect real-time results, like sub-second, or are these
 long-running queries that might take seconds, 10 seconds or more, or even
 minutes to run?

 2. Will they be very rare or quite frequent - how much load do you expect
 them to place on the cluster?

 3. How complex do you expect them to be - how many clauses and operators?

 4. What is their net cardinality - are they selecting just a few rows or
 many rows?

 5. Do they have individual query clauses that select many rows even if the
 net combination of all select clauses is not so many rows?



 The requirement to perform advanced, complex, and ad hoc queries using DSE
 Search or the other techniques will almost certainly require that you use
 moderately more capable hardware, especially more RAM, for each node, and
 probably more nodes as well to reduce the row count per node since ad hoc
 queries will tend to be compute-intensive based on number of rows on the
 node.



 Yes, it can be done. No, it is not free or cheap. And, no, it does not
 come out of the box for a non-DSE Cassandra release. And, yes, you must
 address this requirement before deployment, not after deployment.




   -- Jack Krupansky



 On Wed, Jun 10, 2015 at 1:18 AM, Srinivasa T N seen...@gmail.com wrote:

 Thanks guys for the inputs.

 By ad-hoc queries I mean that I don't know the queries during cf design
 time.  The data may be from single cf or multiple cf.  (This feature maybe
 required if I want to do analysis on the data stored in cassandra, do you
 have any better ideas)?

 Regards,

 Seenu.



 On Tue, Jun 9, 2015 at 5:57 PM, Peter Lin wool...@gmail.com wrote:



 what do you mean by ad-hoc queries?

 Do you mean simple queries against a single column family aka table?

 Or do you mean MDX style queries that looks at multiple tables?

 if it's MDX style queries, many people extract data from Cassandra into a
 data warehouse that support multi-dimensional cubes. This works well when
 the extracted data is a small subset and fits neatly in a data warehouse.

 As others have stated, Cassandra isn't great at ad-hoc. For MDX style
 queries, Cassandra wasn't designed for it. One thing we've done for our own
 project is to combine solr 

My dse-spark app goes well with spark-submit, BUT GOT STUCK while executing by sbt run or java jar run on my win-pc

2015-06-12 Thread 126
My dse-spark app goes well with spark-submit, BUT GOT STUCK while executing by 
sbt run or java jar run on my windows pc which means the driver process is in a 
pc other than a dse cluster node. And what frustrating me is that when I looked 
through the logs, I see no error, but it just hang there, stage progress always 
stay 0/(some number bigger than 3000).
How can I find the the problem?


connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15

2015-06-12 Thread Paulo Ricardo Motta Gomes
Hello,

We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we
stop/kill a cassandra process, some other nodes keep a connection with the
dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes.

So, if I start the killed node again, it cannot handshake with the nodes
which have a connection on the CLOSE_WAIT state until that connection is
closed, so they remain on the down state to each other for 5-20 minutes,
until they can handshake again.

I believe this is somehow related to the fixes CASSANDRA-8336 and
CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will
continue to investigate to see if I find more evidences, but any help at
this point would be appreciated, or at least a confirmation that it could
be related to any of these tickets.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200