Re: Dropped mutation messages

2015-06-12 Thread Robert Wille
I meant to say I’m *not* overloading my cluster.

On Jun 12, 2015, at 6:52 PM, Robert Wille  wrote:

> I am preparing to migrate a large amount of data to Cassandra. In order to 
> test my migration code, I’ve been doing some dry runs to a test cluster. My 
> test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and 
> CL=QUORUM is a weird combination, but my production cluster that will 
> eventually receive this data is RF=3. I am running with RF=1 so its faster 
> while I work out the kinks in the migration.
> 
> There are a few things that have puzzled me, after writing several 10’s of 
> millions records to my test cluster.
> 
> My main concern is that I have a few tens of thousands of dropped mutation 
> messages. I’m overloading my cluster. I never have more than about 10% CPU 
> utilization (even my I/O wait is negligible). A curious thing about that is 
> that the driver hasn’t thrown any exceptions, even though mutations have been 
> dropped. I’ve seen dropped mutation messages on my production cluster, but 
> like this, I’ve never gotten errors back from the client. I had always 
> assumed that one node dropped mutation messages, but the other two did not, 
> and so quorum was satisfied. With RF=1, I don’t understand how mutation 
> messages are being dropped and the client doesn’t tell me about it. Does this 
> mean my cluster is missing data, and I have no idea?
> 
> Each node has a couple dozen all-time blocked FlushWriters. Is that bad?
> 
> I have around 100 dropped counter mutations, which is very weird because I 
> don’t write any counters. I have counters in my schema for tracking view 
> counts, but the migration code doesn’t write them. How could I get dropped 
> counter mutation messages when I don’t modify them?
> 
> Any insights would be appreciated. Thanks in advance.
> 
> Robert
> 



Dropped mutation messages

2015-06-12 Thread Robert Wille
I am preparing to migrate a large amount of data to Cassandra. In order to test 
my migration code, I’ve been doing some dry runs to a test cluster. My test 
cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a 
weird combination, but my production cluster that will eventually receive this 
data is RF=3. I am running with RF=1 so its faster while I work out the kinks 
in the migration.

There are a few things that have puzzled me, after writing several 10’s of 
millions records to my test cluster.

My main concern is that I have a few tens of thousands of dropped mutation 
messages. I’m overloading my cluster. I never have more than about 10% CPU 
utilization (even my I/O wait is negligible). A curious thing about that is 
that the driver hasn’t thrown any exceptions, even though mutations have been 
dropped. I’ve seen dropped mutation messages on my production cluster, but like 
this, I’ve never gotten errors back from the client. I had always assumed that 
one node dropped mutation messages, but the other two did not, and so quorum 
was satisfied. With RF=1, I don’t understand how mutation messages are being 
dropped and the client doesn’t tell me about it. Does this mean my cluster is 
missing data, and I have no idea?

Each node has a couple dozen all-time blocked FlushWriters. Is that bad?

I have around 100 dropped counter mutations, which is very weird because I 
don’t write any counters. I have counters in my schema for tracking view 
counts, but the migration code doesn’t write them. How could I get dropped 
counter mutation messages when I don’t modify them?

Any insights would be appreciated. Thanks in advance.

Robert



Re: Cassandra 2.2, 3.0, and beyond

2015-06-12 Thread Robert Coli
On Thu, Jun 11, 2015 at 6:56 PM, Mohammed Guller 
wrote:

>  By that logic, 2.1.0  should have been somewhat as stable as 2.0.10 (the
> last release of 2.0.x branch before 2.1.0). However, we found out that it
> took almost 9 months for 2.1.x series to become stable and suitable for
> production. Going by past history, I am worried that it may take the same
> time for 2.2 to become stable.
>

The instability of initial point releases is a significant part of the
motivation for the new release cadence.[1] If new versions continued to
take just as long to be production ready, the new process would have failed
at one of its major goals...

For the record, I agree with the reasoning in the linked post and am
cautiously optimistic about the effect it will have on the stability of
released versions. :D

=Rob
[1]
http://mail-archives.apache.org/mod_mbox/cassandra-dev/201503.mbox/%3CCALdd-zjAyiTbZksMeq2LxGwLF5LPhoi_4vsjy8JBHBRnsxH=8...@mail.gmail.com%3E
"
Unfortunately, even after DataStax hired half a dozen full-time test
engineers, 2.1.0 continued the proud tradition of being unready for
production use, with "wait for .5 before upgrading" once again looking like
a good guideline.

I’m starting to think that the entire model of “write a bunch of new
features all at once and then try to stabilize it for release” is broken.
We’ve been trying that for years and empirically speaking the evidence is
that it just doesn’t work, either from a stability standpoint or even just
shipping on time.
...
So, I’d like to try something different.  I think we were on the right
track with shorter releases with more compatibility.  But I’d like to throw
in a twist.  Intel cuts down on risk with a “tick-tock” schedule for new
architectures and process shrinks instead of trying to do both at once. We
can do something similar here:

One month releases.  Period.  If it’s not done, it can wait.
*Every other release only accepts bug fixes.*

By itself, one-month releases are going to dramatically reduce the
complexity of testing and debugging new releases -- and bugs that do slip
past us will only affect a smaller percentage of users, avoiding the “big
release has a bunch of bugs no one has seen before and pretty much everyone
is hit by something” scenario.  ***But by adding in the second rule, I
think we have a real chance to make a quantum leap here: stable,
production-ready
releases every two months.***
"

(*** emphasis mine)


Re: Question regarding concurrent bootstrapping

2015-06-12 Thread Robert Coli
On Fri, Jun 12, 2015 at 5:21 AM, Jens Rantil  wrote:

> Let's say I have an existing cluster and do the following:
>
>1. I start a new joining node (A). It enters state "Up/Joining".
>Streaming automatically start to this node.
>2. I wait two minutes (best practise for bootstrapping).
>3. I start a second node (B) to join the cluster. It allocates some of
>A:s previous parts of the ring and enters state "Up/Joining". Streaming
>automatically starts to this node.
>
> Will streaming of data that A is no longer responsible (after B joined)
> stop immediately? That is, after (3), will data streamed to A only be what
> it is responsible of?
>

It depends on the version of Cassandra. A will get data it "shouldn't" get
in any version that doesn't contain CASSANDRA-2434 patch. If you do not run
"cleanup" on A when A is done bootstrapping

In a version containing 2434, the attempt to bootstrap B will fail and will
not work until A is done bootstrapping, unless you set the
property -Dcassandra.consistent.rangemovement=false while starting it.

In general, one DOES NOT WANT TO
SET -Dcassandra.consistent.rangemovement! It fixes 2434, and 2434 is
bad for consistency.

Instead, considering expanding clusters to initial size when they are
empty, and disabling bootstrapping while doing so.

Lots and lots of background on :
https://issues.apache.org/jira/browse/CASSANDRA-2434

Related ticket : https://issues.apache.org/jira/browse/CASSANDRA-7069

=Rob
PS - BTW, the fact that 2434 existed for so long, in versions where repair
was often broken/unused, is the strongest single item of information in
support of the Coli Conjecture...


Re: Atomic behavior and efficiency of a DELETE query with an IN clause

2015-06-12 Thread Jonathan Haddad
Multiple async requests.  IN() is a performance nightmare unless you're
querying against a single partition key.

On Fri, Jun 12, 2015 at 1:09 PM Sotirios Delimanolis 
wrote:

> Similarly, should we send multiple SELECT requests or a single one with a
> SELECT...IN ?
>
>
>
>   On Wednesday, June 10, 2015 11:27 AM, Sotirios Delimanolis <
> sotodel...@yahoo.com> wrote:
>
>
> Will this "eventually they will all go through" behavior apply to the IN?
> How is this query written to the commitlog?
>
> Do you mean prepare a query like
>
> DELETE FROM MastersOfTheUniverse WHERE mastersID = ?;
>
> and execute it asynchronously 3000 times or add 3000 of these DELETE (bound) 
> prepared statements to a BATCH statement executed asynchronously?
>
>
>
>
>
>
>   On Wednesday, June 10, 2015 9:51 AM, Jonathan Haddad 
> wrote:
>
>
> Batches don't work like that.  It's possible for some to succeed, and
> later, the rest will.  Atomic is the incorrect word to use, it's more like
> "eventually they will all go through".
>
> Do not use IN(), use a whole bunch of prepared statements asynchronously.
>
> On Wed, Jun 10, 2015 at 9:26 AM Sotirios Delimanolis 
> wrote:
>
> Hi,
>
> When executing a DELETE statement with an IN clause, where the list
> contains partition keys, what is the underlying behaviour with regards to
> atomicity?
>
> DELETE FROM MastersOfTheUniverse WHERE mastersID IN ('Man-At-Arms', 'Teela');
>
>
> Is it going to act like an atomic batch where if one fails, all fail? If
> that is the case, is there any reason to use a BATCH statement with
> multiple single DELETE statement or should we always prefer a DELETE with
> an IN clause?
>
> For example, given 3000 keys for rows I want to delete, should I issue a
> single DELETE query and provide all the keys in the IN argument or should
> I add 3000 DELETE queries to a BATCH statement?
>
> Thank you,
> Sotirios
>
>
>
>
>
>
>


Re: Atomic behavior and efficiency of a DELETE query with an IN clause

2015-06-12 Thread Sotirios Delimanolis
Similarly, should we send multiple SELECT requests or a single one with a 
SELECT...IN ? 


 On Wednesday, June 10, 2015 11:27 AM, Sotirios Delimanolis 
 wrote:
   

 Will this "eventually they will all go through" behavior apply to the IN? How 
is this query written to the commitlog?
Do you mean prepare a query likeDELETE FROM MastersOfTheUniverse WHERE 
mastersID = ?;and execute it asynchronously 3000 times or add 3000 of these 
DELETE (bound) prepared statements to a BATCH statement executed asynchronously?




 On Wednesday, June 10, 2015 9:51 AM, Jonathan Haddad  
wrote:
   

 Batches don't work like that.  It's possible for some to succeed, and later, 
the rest will.  Atomic is the incorrect word to use, it's more like "eventually 
they will all go through".

Do not use IN(), use a whole bunch of prepared statements asynchronously.  
On Wed, Jun 10, 2015 at 9:26 AM Sotirios Delimanolis  
wrote:

Hi,
When executing a DELETE statement with an IN clause, where the list contains 
partition keys, what is the underlying behaviour with regards to atomicity?
DELETE FROM MastersOfTheUniverse WHERE mastersID IN ('Man-At-Arms', 'Teela');
Is it going to act like an atomic batch where if one fails, all fail? If that 
is the case, is there any reason to use a BATCH statement with multiple single 
DELETE statement or should we always prefer a DELETE with an IN clause? 
For example, given 3000 keys for rows I want to delete, should I issue a single 
DELETE query and provide all the keys in the IN argument or should I add 3000 
DELETE queries to a BATCH statement?
Thank you,Sotirios




   

  

RE: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Mohammed Guller
The plugin looks cool. Thank you for open sourcing it.

Does it support faceting and other Solr functionality?

Mohammed

From: Andres de la Peña [mailto:adelap...@stratio.com]
Sent: Friday, June 12, 2015 3:43 AM
To: user@cassandra.apache.org
Subject: Re: Lucene index plugin for Apache Cassandra

I really appreciate your interest

Well, the first recommendation is to not use it unless you need it, because a 
properly Cassandra denormalized model is almost always preferable to indexing. 
Lucene indexing is a good option when there is no viable denormalization 
alternative. This is the case of range queries over multiple dimensions, 
full-text search or maybe complex boolean predicates. It's also appropriate for 
Spark/Hadoop jobs mapping a small fraction of the total amount of rows in a 
certain table, if you can pay the cost of indexing.

Lucene indexes run inside C*, so users should closely monitor the amount of 
used memory. It's also a good idea to put the Lucene directory files in a 
separate disk to those used by C* itself. Additionally, you should consider 
that indexed tables write throughput will be appreciably reduced, maybe to a 
few thousands rows per second.

It's really hard to estimate the amount of resources needed by the index due to 
the great variety of indexing and querying ways that Lucene offers, so the only 
thing we can suggest is to empirically find the optimal setup for your use case.

2015-06-12 12:00 GMT+02:00 Carlos Rolo 
mailto:r...@pythian.com>>:
Seems like an interesting tool!
What operational recommendations would you make to users of this tool (Extra 
hardware capacity, extra metrics to monitor, etc)?

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: 
linkedin.com/in/carlosjuzarterolo
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
www.pythian.com

On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña 
mailto:adelap...@stratio.com>> wrote:
Unfortunately, we don't have published any benchmarks yet, but we have plans to 
do it as soon as possible. However, you can expect a similar behavior as those 
of Elasticsearch or Solr, with some overhead due to the need for indexing both 
the Cassandra's row key and the partition's token. You can also take a look at 
this 
presentation
 to see how cluster distribution is done.

2015-06-12 0:45 GMT+02:00 Ben Bromhead 
mailto:b...@instaclustr.com>>:
Looks awesome, do you have any examples/benchmarks of using these indexes for 
various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?

On 10 June 2015 at 09:08, Andres de la Peña 
mailto:adelap...@stratio.com>> wrote:
Hi all,

With the release of Cassandra 2.1.6, Stratio is glad to present its open source 
Lucene-based implementation of C* secondary 
indexes as a plugin that can 
be attached to Apache Cassandra. Before the above changes, Lucene index was 
distributed inside a fork of Apache Cassandra, with all the difficulties 
implied. As of now, the fork is discontinued and new users should use the 
recently created plugin, which maintains all the features of Stratio 
Cassandra.

Stratio's Lucene index extends Cassandra’s functionality to provide near 
real-time distributed search engine capabilities such as with ElasticSearch or 
Solr, including full text search capabilities, free multivariable search, 
relevance queries and field-based sorting. Each node indexes its own data, so 
high availability and scalability is guaranteed.

We hope this will be useful to the Apache Cassandra community.

Regards,

--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // 
@stratiobd



--

Ben Bromhead

Instaclustr | www.instaclustr.com | 
@instaclustr | (650) 284 9692



--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // 
@stratiobd



--





--

Andrés de la Peña

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // @stratiobd


connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15

2015-06-12 Thread Paulo Ricardo Motta Gomes
Hello,

We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we
stop/kill a cassandra process, some other nodes keep a connection with the
dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes.

So, if I start the killed node again, it cannot handshake with the nodes
which have a connection on the CLOSE_WAIT state until that connection is
closed, so they remain on the down state to each other for 5-20 minutes,
until they can handshake again.

I believe this is somehow related to the fixes CASSANDRA-8336 and
CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will
continue to investigate to see if I find more evidences, but any help at
this point would be appreciated, or at least a confirmation that it could
be related to any of these tickets.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br *
+55 48 3232.3200


Re: Support for ad-hoc query

2015-06-12 Thread Jack Krupansky
No dispute about that. But the main design requirement Cassandra strives to
meet is to be a blazing fast transactional database - here's the key, give
me the data, and here's the key, write this data. Any additional query
requirements are a distant second at best. A big part of that transactional
speed requirement is achieved by jettisoning the overhead required for ad
hoc queries.

I think it is inevitable that Cassandra will eventually address the
requirement for ad hoc queries when it finally decides what it wants to be
when it grows up (i.e., whether to just be a niche or to subsume all of
SQL), but in the meantime DSE Search/Solr, Stratio, and TupleJump Stargate,
as well as extraction and indexing in Elasticsearch, are moderately
reasonable near-term solutions.

And I agree that having to fully model eventual (and evolving!) data
requirements and emergent anomalous conditions upfront is too big a burden
for many enterprises.


-- Jack Krupansky

On Fri, Jun 12, 2015 at 10:07 AM,  wrote:

>  I will note here that the limitations on ad-hoc querying (and
> aggregates) make it much more difficult to deal with data quality problems,
> QA testing, and similar efforts, especially where people are used to a more
> relational, ad-hoc model. We have often had to extract data from Cassandra
> to Hadoop for querying by hive.
>
>
>
> Example: “We found a few records with incorrect data. How many more
> records like that are out there?”
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Peter Lin [mailto:wool...@gmail.com]
> *Sent:* Wednesday, June 10, 2015 8:17 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Support for ad-hoc query
>
>
>
>
>
> I'll second Jack's detailed response and add that you really should do
> some discovery to figure out what kinds of queries you may need to support.
>
> It might not be possible and often that is the case, but it's worth while
> to ask the end users what kind of reports they need to run. Allowing
> arbitrary ad-hoc queries is a known anti-pattern for cassandra. If the
> system needs to query multiple cf to derive/calculate some result, using
> Cassandra alone isn't going to do it. You'll need some other system to give
> you better query capabilities like Hive.
>
> If you need data warehouse like features, look at http://www.kylin.io/ .
> They are doing some interesting things.
>
> peter
>
>
>
> On Wed, Jun 10, 2015 at 7:58 AM, Jack Krupansky 
> wrote:
>
> Knowing your queries in advance is a hard-core requirement for effective
> deployment of Cassandra. Ad hoc queries are a very clear anti-pattern for
> Cassandra. DSE Search does provide support for advanced, complex, and ad
> hoc queries. Stratio and TupleJump Stargate can also be used.
>
>
>
> Back to the question of what you mean by ad hoc queries:
>
>
>
> 1. Do you expect real-time results, like sub-second, or are these
> long-running queries that might take seconds, 10 seconds or more, or even
> minutes to run?
>
> 2. Will they be very rare or quite frequent - how much load do you expect
> them to place on the cluster?
>
> 3. How complex do you expect them to be - how many clauses and operators?
>
> 4. What is their net cardinality - are they selecting just a few rows or
> many rows?
>
> 5. Do they have individual query clauses that select many rows even if the
> net combination of all select clauses is not so many rows?
>
>
>
> The requirement to perform advanced, complex, and ad hoc queries using DSE
> Search or the other techniques will almost certainly require that you use
> moderately more capable hardware, especially more RAM, for each node, and
> probably more nodes as well to reduce the row count per node since ad hoc
> queries will tend to be compute-intensive based on number of rows on the
> node.
>
>
>
> Yes, it can be done. No, it is not free or cheap. And, no, it does not
> come out of the box for a non-DSE Cassandra release. And, yes, you must
> address this requirement before deployment, not after deployment.
>
>
>
>
>   -- Jack Krupansky
>
>
>
> On Wed, Jun 10, 2015 at 1:18 AM, Srinivasa T N  wrote:
>
> Thanks guys for the inputs.
>
> By ad-hoc queries I mean that I don't know the queries during cf design
> time.  The data may be from single cf or multiple cf.  (This feature maybe
> required if I want to do analysis on the data stored in cassandra, do you
> have any better ideas)?
>
> Regards,
>
> Seenu.
>
>
>
> On Tue, Jun 9, 2015 at 5:57 PM, Peter Lin  wrote:
>
>
>
> what do you mean by ad-hoc queries?
>
> Do you mean simple queries against a single column family aka table?
>
> Or do you mean MDX style queries that looks at multiple tables?
>
> if it's MDX style queries, many people extract data from Cassandra into a
> data warehouse that support multi-dimensional cubes. This works well when
> the extracted data is a small subset and fits neatly in a data warehouse.
>
> As others have stated, Cassandra isn't great at ad-hoc. For MDX style
> queries, Cassandra wasn't designed for it. One thing we've done fo

RE: Support for ad-hoc query

2015-06-12 Thread SEAN_R_DURITY
I will note here that the limitations on ad-hoc querying (and aggregates) make 
it much more difficult to deal with data quality problems, QA testing, and 
similar efforts, especially where people are used to a more relational, ad-hoc 
model. We have often had to extract data from Cassandra to Hadoop for querying 
by hive.

Example: “We found a few records with incorrect data. How many more records 
like that are out there?”


Sean Durity

From: Peter Lin [mailto:wool...@gmail.com]
Sent: Wednesday, June 10, 2015 8:17 AM
To: user@cassandra.apache.org
Subject: Re: Support for ad-hoc query


I'll second Jack's detailed response and add that you really should do some 
discovery to figure out what kinds of queries you may need to support.
It might not be possible and often that is the case, but it's worth while to 
ask the end users what kind of reports they need to run. Allowing arbitrary 
ad-hoc queries is a known anti-pattern for cassandra. If the system needs to 
query multiple cf to derive/calculate some result, using Cassandra alone isn't 
going to do it. You'll need some other system to give you better query 
capabilities like Hive.
If you need data warehouse like features, look at http://www.kylin.io/ . They 
are doing some interesting things.
peter

On Wed, Jun 10, 2015 at 7:58 AM, Jack Krupansky 
mailto:jack.krupan...@gmail.com>> wrote:
Knowing your queries in advance is a hard-core requirement for effective 
deployment of Cassandra. Ad hoc queries are a very clear anti-pattern for 
Cassandra. DSE Search does provide support for advanced, complex, and ad hoc 
queries. Stratio and TupleJump Stargate can also be used.

Back to the question of what you mean by ad hoc queries:

1. Do you expect real-time results, like sub-second, or are these long-running 
queries that might take seconds, 10 seconds or more, or even minutes to run?
2. Will they be very rare or quite frequent - how much load do you expect them 
to place on the cluster?
3. How complex do you expect them to be - how many clauses and operators?
4. What is their net cardinality - are they selecting just a few rows or many 
rows?
5. Do they have individual query clauses that select many rows even if the net 
combination of all select clauses is not so many rows?

The requirement to perform advanced, complex, and ad hoc queries using DSE 
Search or the other techniques will almost certainly require that you use 
moderately more capable hardware, especially more RAM, for each node, and 
probably more nodes as well to reduce the row count per node since ad hoc 
queries will tend to be compute-intensive based on number of rows on the node.

Yes, it can be done. No, it is not free or cheap. And, no, it does not come out 
of the box for a non-DSE Cassandra release. And, yes, you must address this 
requirement before deployment, not after deployment.


-- Jack Krupansky

On Wed, Jun 10, 2015 at 1:18 AM, Srinivasa T N 
mailto:seen...@gmail.com>> wrote:
Thanks guys for the inputs.
By ad-hoc queries I mean that I don't know the queries during cf design time.  
The data may be from single cf or multiple cf.  (This feature maybe required if 
I want to do analysis on the data stored in cassandra, do you have any better 
ideas)?
Regards,
Seenu.

On Tue, Jun 9, 2015 at 5:57 PM, Peter Lin 
mailto:wool...@gmail.com>> wrote:

what do you mean by ad-hoc queries?
Do you mean simple queries against a single column family aka table?
Or do you mean MDX style queries that looks at multiple tables?
if it's MDX style queries, many people extract data from Cassandra into a data 
warehouse that support multi-dimensional cubes. This works well when the 
extracted data is a small subset and fits neatly in a data warehouse.
As others have stated, Cassandra isn't great at ad-hoc. For MDX style queries, 
Cassandra wasn't designed for it. One thing we've done for our own project is 
to combine solr with our own fuzzy index to make ad-hoc queries against a 
single table more friendly.


On Tue, Jun 9, 2015 at 2:38 AM, Srinivasa T N 
mailto:seen...@gmail.com>> wrote:
Hi All,
   I have an web application running with my backend data stored in cassandra.  
Now I want to do some analysis on the data stored which requires some ad-hoc 
queries fired on cassandra.  How can I do the same?
Regards,
Seenu.







The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy an

My dse-spark app goes well with spark-submit, BUT GOT STUCK while executing by sbt run or java jar run on my win-pc

2015-06-12 Thread 126
My dse-spark app goes well with spark-submit, BUT GOT STUCK while executing by 
sbt run or java jar run on my windows pc which means the driver process is in a 
pc other than a dse cluster node. And what frustrating me is that when I looked 
through the logs, I see no error, but it just hang there, stage progress always 
stay 0/(some number bigger than 3000).
How can I find the the problem?


Re: Question about "nodetool status ..." output

2015-06-12 Thread Jens Rantil
Hi Carlos,

Yes, I should have been more specific about that; basically all my primary
ID:s are random UUIDs so I find that very hard to believe that my data
model should be the problem here. I will run a full repair of the cluster,
execute a cleanup and recommission the node, then.

Thanks,
Jens

On Fri, Jun 12, 2015 at 2:38 PM, Carlos Rolo  wrote:

> Your data model also contributes to the balance (or lack of) of the
> cluster. If you have a really bad data partitioning Cassandra will not do
> any magic.
>
> Regarding that cluster, I would decommission the x.52 node and add it
> again with the correct configuration. After the bootstrap, run a cleanup.
> If is still that off-balance, you need to look into your data model.
>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
> *
> Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
> www.pythian.com
>
> On Fri, Jun 12, 2015 at 11:58 AM, Jens Rantil  wrote:
>
>> Hi,
>>
>> I have one node in my 5-node cluster that effectively owns 100% and it
>> looks like my cluster is rather imbalanced. Is it common to have it this
>> imbalanced for 4-5 nodes?
>>
>> My current output for a keyspace is:
>>
>> $ nodetool status myks
>> Datacenter: Cassandra
>> =
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address Load   Tokens  Owns (effective)  Host ID
>>   Rack
>> UN  X.X.X.33  203.92 GB  256 41.3%
>> 871968c9-1d6b-4f06-ba90-8b3a8d92dcf0  RAC1
>> UN  X.X.X.32  200.44 GB  256 34.2%
>> d7cacd89-8613-4de5-8a5e-a2c53c41ea45  RAC1
>> UN  X.X.X.51  197.17 GB  256 100.0%
>>  344b0adf-2b5d-47c8-8881-9a3f56be6f3b  RAC1
>> UN  X.X.X.52  113.63 GB  1   46.3%
>> 55daa807-af49-44c5-9742-fe456df621a1  RAC1
>> UN  X.X.X.31  204.49 GB  256 78.3%
>> 48cb0782-6c9a-4805-9330-38e192b6b680  RAC1
>>
>> My keyspace has RF=3 and originally I added X.X.X.52 (num_tokens=1 was a
>> mistake) and then X.X.X.51. I haven't executed `nodetool cleanup` on any
>> nodes yet.
>>
>> For the curious, the full ring can be found here:
>> https://gist.github.com/JensRantil/57ee515e647e2f154779
>>
>> Cheers,
>> Jens
>>
>> --
>> Jens Rantil
>> Backend engineer
>> Tink AB
>>
>> Email: jens.ran...@tink.se
>> Phone: +46 708 84 18 32
>> Web: www.tink.se
>>
>> Facebook  Linkedin
>> 
>>  Twitter 
>>
>
>
> --
>
>
>
>


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook  Linkedin

 Twitter 


Re: Question about "nodetool status ..." output

2015-06-12 Thread Carlos Rolo
Your data model also contributes to the balance (or lack of) of the
cluster. If you have a really bad data partitioning Cassandra will not do
any magic.

Regarding that cluster, I would decommission the x.52 node and add it again
with the correct configuration. After the bootstrap, run a cleanup. If is
still that off-balance, you need to look into your data model.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
*
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
www.pythian.com

On Fri, Jun 12, 2015 at 11:58 AM, Jens Rantil  wrote:

> Hi,
>
> I have one node in my 5-node cluster that effectively owns 100% and it
> looks like my cluster is rather imbalanced. Is it common to have it this
> imbalanced for 4-5 nodes?
>
> My current output for a keyspace is:
>
> $ nodetool status myks
> Datacenter: Cassandra
> =
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID
> Rack
> UN  X.X.X.33  203.92 GB  256 41.3%
> 871968c9-1d6b-4f06-ba90-8b3a8d92dcf0  RAC1
> UN  X.X.X.32  200.44 GB  256 34.2%
> d7cacd89-8613-4de5-8a5e-a2c53c41ea45  RAC1
> UN  X.X.X.51  197.17 GB  256 100.0%
>  344b0adf-2b5d-47c8-8881-9a3f56be6f3b  RAC1
> UN  X.X.X.52  113.63 GB  1   46.3%
> 55daa807-af49-44c5-9742-fe456df621a1  RAC1
> UN  X.X.X.31  204.49 GB  256 78.3%
> 48cb0782-6c9a-4805-9330-38e192b6b680  RAC1
>
> My keyspace has RF=3 and originally I added X.X.X.52 (num_tokens=1 was a
> mistake) and then X.X.X.51. I haven't executed `nodetool cleanup` on any
> nodes yet.
>
> For the curious, the full ring can be found here:
> https://gist.github.com/JensRantil/57ee515e647e2f154779
>
> Cheers,
> Jens
>
> --
> Jens Rantil
> Backend engineer
> Tink AB
>
> Email: jens.ran...@tink.se
> Phone: +46 708 84 18 32
> Web: www.tink.se
>
> Facebook  Linkedin
> 
>  Twitter 
>

-- 


--





Question regarding concurrent bootstrapping

2015-06-12 Thread Jens Rantil
Hi,

Let's say I have an existing cluster and do the following:

   1. I start a new joining node (A). It enters state "Up/Joining".
   Streaming automatically start to this node.
   2. I wait two minutes (best practise for bootstrapping).
   3. I start a second node (B) to join the cluster. It allocates some of
   A:s previous parts of the ring and enters state "Up/Joining". Streaming
   automatically starts to this node.

Will streaming of data that A is no longer responsible (after B joined)
stop immediately? That is, after (3), will data streamed to A only be what
it is responsible of?

This is of importance for planning when one it expanding a cluster to
multiple smaller nodes.

Thanks,
Jens

-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook  Linkedin

 Twitter 


Re: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Andres de la Peña
I really appreciate your interest

Well, the first recommendation is to not use it unless you need it, because
a properly Cassandra denormalized model is almost always preferable to
indexing. Lucene indexing is a good option when there is no viable
denormalization alternative. This is the case of range queries over
multiple dimensions, full-text search or maybe complex boolean predicates.
It's also appropriate for Spark/Hadoop jobs mapping a small fraction of the
total amount of rows in a certain table, if you can pay the cost of
indexing.

Lucene indexes run inside C*, so users should closely monitor the amount of
used memory. It's also a good idea to put the Lucene directory files in a
separate disk to those used by C* itself. Additionally, you should consider
that indexed tables write throughput will be appreciably reduced, maybe to
a few thousands rows per second.

It's really hard to estimate the amount of resources needed by the index
due to the great variety of indexing and querying ways that Lucene offers,
so the only thing we can suggest is to empirically find the optimal setup
for your use case.

2015-06-12 12:00 GMT+02:00 Carlos Rolo :

> Seems like an interesting tool!
>
> What operational recommendations would you make to users of this tool
> (Extra hardware capacity, extra metrics to monitor, etc)?
>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
> *
> Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
> www.pythian.com
>
> On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña  > wrote:
>
>> Unfortunately, we don't have published any benchmarks yet, but we have
>> plans to do it as soon as possible. However, you can expect a similar
>> behavior as those of Elasticsearch or Solr, with some overhead due to the
>> need for indexing both the Cassandra's row key and the partition's token.
>> You can also take a look at this presentation
>> 
>> to see how cluster distribution is done.
>>
>> 2015-06-12 0:45 GMT+02:00 Ben Bromhead :
>>
>>> Looks awesome, do you have any examples/benchmarks of using these
>>> indexes for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?
>>>
>>> On 10 June 2015 at 09:08, Andres de la Peña 
>>> wrote:
>>>
 Hi all,

 With the release of Cassandra 2.1.6, Stratio is glad to present its
 open source Lucene-based implementation of C* secondary indexes
  as a plugin that
 can be attached to Apache Cassandra. Before the above changes, Lucene index
 was distributed inside a fork of Apache Cassandra, with all the
 difficulties implied. As of now, the fork is discontinued and new users
 should use the recently created plugin, which maintains all the features 
 of Stratio
 Cassandra .



 Stratio's Lucene index extends Cassandra’s functionality to provide
 near real-time distributed search engine capabilities such as with
 ElasticSearch or Solr, including full text search capabilities, free
 multivariable search, relevance queries and field-based sorting. Each node
 indexes its own data, so high availability and scalability is guaranteed.


 We hope this will be useful to the Apache Cassandra community.


 Regards,

 --

 Andrés de la Peña


 
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón, Madrid
 Tel: +34 91 352 59 42 // *@stratiobd *

>>>
>>>
>>>
>>> --
>>>
>>> Ben Bromhead
>>>
>>> Instaclustr | www.instaclustr.com | @instaclustr
>>>  | (650) 284 9692
>>>
>>
>>
>>
>> --
>>
>> Andrés de la Peña
>>
>>
>> 
>> Avenida de Europa, 26. Ática 5. 3ª Planta
>> 28224 Pozuelo de Alarcón, Madrid
>> Tel: +34 91 352 59 42 // *@stratiobd *
>>
>
>
> --
>
>
>
>


-- 

Andrés de la Peña



Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd *


Re: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Carlos Rolo
Seems like an interesting tool!

What operational recommendations would you make to users of this tool
(Extra hardware capacity, extra metrics to monitor, etc)?

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
*
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
www.pythian.com

On Fri, Jun 12, 2015 at 11:07 AM, Andres de la Peña 
wrote:

> Unfortunately, we don't have published any benchmarks yet, but we have
> plans to do it as soon as possible. However, you can expect a similar
> behavior as those of Elasticsearch or Solr, with some overhead due to the
> need for indexing both the Cassandra's row key and the partition's token.
> You can also take a look at this presentation
> 
> to see how cluster distribution is done.
>
> 2015-06-12 0:45 GMT+02:00 Ben Bromhead :
>
>> Looks awesome, do you have any examples/benchmarks of using these indexes
>> for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?
>>
>> On 10 June 2015 at 09:08, Andres de la Peña 
>> wrote:
>>
>>> Hi all,
>>>
>>> With the release of Cassandra 2.1.6, Stratio is glad to present its
>>> open source Lucene-based implementation of C* secondary indexes
>>>  as a plugin that
>>> can be attached to Apache Cassandra. Before the above changes, Lucene index
>>> was distributed inside a fork of Apache Cassandra, with all the
>>> difficulties implied. As of now, the fork is discontinued and new users
>>> should use the recently created plugin, which maintains all the features of 
>>> Stratio
>>> Cassandra .
>>>
>>>
>>>
>>> Stratio's Lucene index extends Cassandra’s functionality to provide near
>>> real-time distributed search engine capabilities such as with ElasticSearch
>>> or Solr, including full text search capabilities, free multivariable
>>> search, relevance queries and field-based sorting. Each node indexes its
>>> own data, so high availability and scalability is guaranteed.
>>>
>>>
>>> We hope this will be useful to the Apache Cassandra community.
>>>
>>>
>>> Regards,
>>>
>>> --
>>>
>>> Andrés de la Peña
>>>
>>>
>>> 
>>> Avenida de Europa, 26. Ática 5. 3ª Planta
>>> 28224 Pozuelo de Alarcón, Madrid
>>> Tel: +34 91 352 59 42 // *@stratiobd *
>>>
>>
>>
>>
>> --
>>
>> Ben Bromhead
>>
>> Instaclustr | www.instaclustr.com | @instaclustr
>>  | (650) 284 9692
>>
>
>
>
> --
>
> Andrés de la Peña
>
>
> 
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón, Madrid
> Tel: +34 91 352 59 42 // *@stratiobd *
>

-- 


--





Question about "nodetool status ..." output

2015-06-12 Thread Jens Rantil
Hi,

I have one node in my 5-node cluster that effectively owns 100% and it
looks like my cluster is rather imbalanced. Is it common to have it this
imbalanced for 4-5 nodes?

My current output for a keyspace is:

$ nodetool status myks
Datacenter: Cassandra
=
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load   Tokens  Owns (effective)  Host ID
Rack
UN  X.X.X.33  203.92 GB  256 41.3%
871968c9-1d6b-4f06-ba90-8b3a8d92dcf0  RAC1
UN  X.X.X.32  200.44 GB  256 34.2%
d7cacd89-8613-4de5-8a5e-a2c53c41ea45  RAC1
UN  X.X.X.51  197.17 GB  256 100.0%
 344b0adf-2b5d-47c8-8881-9a3f56be6f3b  RAC1
UN  X.X.X.52  113.63 GB  1   46.3%
55daa807-af49-44c5-9742-fe456df621a1  RAC1
UN  X.X.X.31  204.49 GB  256 78.3%
48cb0782-6c9a-4805-9330-38e192b6b680  RAC1

My keyspace has RF=3 and originally I added X.X.X.52 (num_tokens=1 was a
mistake) and then X.X.X.51. I haven't executed `nodetool cleanup` on any
nodes yet.

For the curious, the full ring can be found here:
https://gist.github.com/JensRantil/57ee515e647e2f154779

Cheers,
Jens

-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook  Linkedin

 Twitter 


Re: Lucene index plugin for Apache Cassandra

2015-06-12 Thread Andres de la Peña
Unfortunately, we don't have published any benchmarks yet, but we have
plans to do it as soon as possible. However, you can expect a similar
behavior as those of Elasticsearch or Solr, with some overhead due to the
need for indexing both the Cassandra's row key and the partition's token.
You can also take a look at this presentation

to see how cluster distribution is done.

2015-06-12 0:45 GMT+02:00 Ben Bromhead :

> Looks awesome, do you have any examples/benchmarks of using these indexes
> for various cluster sizes e.g. 20 nodes, 60 nodes, 100s+?
>
> On 10 June 2015 at 09:08, Andres de la Peña  wrote:
>
>> Hi all,
>>
>> With the release of Cassandra 2.1.6, Stratio is glad to present its open
>> source Lucene-based implementation of C* secondary indexes
>>  as a plugin that can
>> be attached to Apache Cassandra. Before the above changes, Lucene index was
>> distributed inside a fork of Apache Cassandra, with all the difficulties
>> implied. As of now, the fork is discontinued and new users should use the
>> recently created plugin, which maintains all the features of Stratio
>> Cassandra .
>>
>>
>>
>> Stratio's Lucene index extends Cassandra’s functionality to provide near
>> real-time distributed search engine capabilities such as with ElasticSearch
>> or Solr, including full text search capabilities, free multivariable
>> search, relevance queries and field-based sorting. Each node indexes its
>> own data, so high availability and scalability is guaranteed.
>>
>>
>> We hope this will be useful to the Apache Cassandra community.
>>
>>
>> Regards,
>>
>> --
>>
>> Andrés de la Peña
>>
>>
>> 
>> Avenida de Europa, 26. Ática 5. 3ª Planta
>> 28224 Pozuelo de Alarcón, Madrid
>> Tel: +34 91 352 59 42 // *@stratiobd *
>>
>
>
>
> --
>
> Ben Bromhead
>
> Instaclustr | www.instaclustr.com | @instaclustr
>  | (650) 284 9692
>



-- 

Andrés de la Peña



Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd *