date:20150213

Re:query by column size

2015-02-13 Thread Marcelo Valle (BLOOMBERG/ LONDON)

There is no automatic indexing in Cassandra. There are secondary indexes, but 
not for these cases.
You could use a solution like DSE, to get data automatically indexed on solr, 
in each node, as soon as data comes. Then you could do such a query on solr.
If the query can be slow, you could run a MR job over all rows, filtering the 
ones you want.
[]s
From: user@cassandra.apache.org 
Subject: Re:query by column size

Greetings,

I have one column family with 10 columns,  one of the column we store xml/json.
Is there a way I can query  that column where size  50kb  ?  assuming I  have 
index on that column.

thanks
CV.

sstables remain after compaction

2015-02-13 Thread Jason Wee

Hello,

Pre cassandra 1.0, after sstables are compacted, the old sstables will be
remain until the first gc kick in. For cassandra 1.0, the sstables will be
remove after compaction is done. Will it be possible the old sstables
remains due to whatever reasons (e.g. read referencing)?

Thank you.

Jason

Re: best supported spark connector for Cassandra

2015-02-13 Thread Carlos Rolo

Not for sure ;)

If you need Cassandra support I can forward you to someone to talk to at
Pythian.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:

Actually, I am not the one looking for support, but I thank you a lot
anyway.
But from your message I guess the answer is yes, Datastax is not the only
Cassandra vendor offering support and changing official Cassandra source at
this moment, is this right?

From: user@cassandra.apache.org
Subject: Re: best supported spark connector for Cassandra

Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0.

Regarding the Cassandra support, I can introduce you to someone in Stratio
that can help you.

2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net:

Thanks for the hint Gaspar.
Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache
2.0?

I had interest in knowing more about Stratio when I was working on a
start up. Now, on a blueship, it seems one of the hardest obstacles to use
Cassandra in a project is the need of an area supporting it, and it seems
people are specially concerned about how many vendors an open source
solution has to provide support.

This seems to be kind of an advantage of HBase, as there are many vendors
supporting it, but I wonder if Stratio can be considered an alternative to
Datastax reggarding Cassandra support?

It's not my call here to decide anything, but as part of the community it
helps to have this business scenario clear. I could say Cassandra could be
the best fit technical solution for some projects but sometimes
non-technical factors are in the game, like this need for having more than
one vendor available...

From: gmu...@stratio.com
Subject: Re: best supported spark connector for Cassandra

My suggestion is to use Java or Scala instead of Python. For Java/Scala
both the Datastax and Stratio drivers are valid and similar options. As far
as I know they both take care about data locality and are not based on the
Hadoop interface. The advantage of Stratio Deep is that allows you to
integrate Spark not only with Cassandra but with MongoDB, Elasticsearch,
Aerospike and others as well.
Stratio has a forked Cassandra for including some additional features
such as Lucene based secondary indexes. So Stratio driver works fine with
the Apache Cassandra and also with their fork.

You can find some examples of using Deep here:
https://github.com/Stratio/deep-examples Please if you need some help
with Stratio Deep do not hesitate to contact us.

2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com:

I am using Calliope cassandra-spark connector(
http://tuplejump.github.io/calliope/), which is quite handy and easy to
use!
The only problem is that it is a bit outdates , works with Spark 1.1.0,
hopefully new version comes soon.

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:

I just finished a scala course, nice exercise to check what I learned :D

Thanks for the answer!

From: user@cassandra.apache.org
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala):
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Data locality is provided by this method:
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell how the did the integration with
Spark. Take some time to dig down their code to understand the logic.

On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:

Taking the opportunity Spark was being discussed in another thread, I
decided to start a new one as I have interest in using Spark + Cassandra
in
the feature.

About 3 years ago, Spark was not an existing option and we tried to
use hadoop to process Cassandra data. My experience was horrible and we
reached the conclusion it was faster to develop an internal tool than
insist on Hadoop _for our specific case_.

How I can see Spark is starting to be known as a better hadoop and
it seems market is going this way now. I can also see I have many more
options to decide how to integrate Cassandra using the Spark RDD concept
than using the ColumnFamilyInputFormat.

I have found this java driver made by Datastax:
https://github.com/datastax/spark-cassandra-connector

I also have found python

Re: Added new nodes to cluster but no streams

2015-02-13 Thread Jens Rantil

Hi Bastranut,

A few minutes between each node will do.

Cheers,
Jens

On Fri, Feb 13, 2015 at 1:12 PM, Batranut Bogdan batra...@yahoo.com wrote:

Hello,

When adding a new node to the cluster I need to wait for each node to
receive all the data from other nodes in the cluster or just wait a few
minutes before I start each node?

On Thursday, February 12, 2015 7:21 PM, Robert Coli
rc...@eventbrite.com wrote:

On Thu, Feb 12, 2015 at 3:20 AM, Batranut Bogdan batra...@yahoo.com
wrote:

I have added new nodes to the existing cluster. In Opscenter I do not see
any streams... I presume that the new nodes get the data from the rest of
the cluster via streams. The existing cluster has TB magnitude, and space
used in the new nodes is ~90 GB. I must admit that I have restarted the new
nodes several times after adding them . Does this affect boostrap? AFAIK
the new nodes should start loading a part of all the data in the existing
cluster.

If it stays like this for a while, it sounds like your bootstraps have
hung. Note that in general you should add nodes one at a time, especially
if you are in a version without the fix for CASSANDRA-2434, in theory
adding multiple nodes at once might contribute to their bootstraps
hanging.

Stop cassandra on the joining nodes, wipe/move aside their data
directories, and try again one at a time.

=Rob

--
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
Twitter https://twitter.com/tink

Re: best supported spark connector for Cassandra

2015-02-13 Thread Gaspar Muñoz

Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0.

Regarding the Cassandra support, I can introduce you to someone in Stratio
that can help you.

2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net:

Thanks for the hint Gaspar.
Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache
2.0?

I had interest in knowing more about Stratio when I was working on a start
up. Now, on a blueship, it seems one of the hardest obstacles to use
Cassandra in a project is the need of an area supporting it, and it seems
people are specially concerned about how many vendors an open source
solution has to provide support.

This seems to be kind of an advantage of HBase, as there are many vendors
supporting it, but I wonder if Stratio can be considered an alternative to
Datastax reggarding Cassandra support?

From: gmu...@stratio.com
Subject: Re: best supported spark connector for Cassandra

My suggestion is to use Java or Scala instead of Python. For Java/Scala
both the Datastax and Stratio drivers are valid and similar options. As far
as I know they both take care about data locality and are not based on the
Hadoop interface. The advantage of Stratio Deep is that allows you to
integrate Spark not only with Cassandra but with MongoDB, Elasticsearch,
Aerospike and others as well.
Stratio has a forked Cassandra for including some additional features such
as Lucene based secondary indexes. So Stratio driver works fine with the
Apache Cassandra and also with their fork.

You can find some examples of using Deep here:
https://github.com/Stratio/deep-examples Please if you need some help
with Stratio Deep do not hesitate to contact us.

2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com:

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:

I just finished a scala course, nice exercise to check what I learned :D

Thanks for the answer!

From: user@cassandra.apache.org
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala):
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell how the did the integration with
Spark. Take some time to dig down their code to understand the logic.

On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:

Taking the opportunity Spark was being discussed in another thread, I
decided to start a new one as I have interest in using Spark + Cassandra in
the feature.

About 3 years ago, Spark was not an existing option and we tried to use
hadoop to process Cassandra data. My experience was horrible and we reached
the conclusion it was faster to develop an internal tool than insist on
Hadoop _for our specific case_.

How I can see Spark is starting to be known as a better hadoop and it
seems market is going this way now. I can also see I have many more options
to decide how to integrate Cassandra using the Spark RDD concept than using
the ColumnFamilyInputFormat.

I have found this java driver made by Datastax:
https://github.com/datastax/spark-cassandra-connector

I also have found python Cassandra support on spark's repo, but it
seems experimental yet:
https://github.com/apache/spark/tree/master/examples/src/main/python

Finally I have found stratio deep:
https://github.com/Stratio/deep-spark
It seems Stratio guys have forked Cassandra also, I am still a little
confused about it.

Question: which driver should I use, if I want to use Java? And which
if I want to use python?
I think the way Spark can integrate to Cassandra makes all the
difference in the world, from my past experience, so I would like to know
more about it, but I don't even know which source code I should start
looking...
I would like to integrate using python and or C++, but I wonder if it
doesn't pay the way to use the java driver instead.

Thanks in advance

Gaspar Muñoz

Re: Recommissioned a node

2015-02-13 Thread Eric Stevens

I created an issue for this:
https://issues.apache.org/jira/browse/CASSANDRA-8801

On Thu, Feb 12, 2015 at 10:18 AM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Feb 12, 2015 at 7:04 AM, Eric Stevens migh...@gmail.com wrote:

 IMO, especially with the threat to unrecoverable consistency violations,
 this should be a critical bug.


 You should file a JIRA, and let the list know what it is? :D

 I was never sure if it was just me being unreasonably literal to presume
 that decommission made the node forget its prior state, if I'm honest? It
 is nice to hear from other operators that this matches their expectations.
 But yes, the current behavior seems to have risks that forgetting
 doesn't, and I don't understand what benefits (if any) it has.

 As a brief aside, this is Yet Another Reason why you probably don't ever
 want a Cassandra node to automatically start on boot, or restart. If you
 don't know its configuration, it could join a cluster, which might be
 Meaningfully Bad in some circumstances.

 =Rob

Re: best supported spark connector for Cassandra

2015-02-13 Thread Marcelo Valle (BLOOMBERG/ LONDON)

Actually, I am not the one looking for support, but I thank you a lot anyway.
But from your message I guess the answer is yes, Datastax is not the only 
Cassandra vendor offering support and changing official Cassandra source at 
this moment, is this right?
From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Of course, Stratio Deep and Stratio Cassandra are licensed  Apache 2.0.   

Regarding the Cassandra support, I can introduce you to someone in Stratio that 
can help you. 

2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net:

Thanks for the hint Gaspar. 
Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0?

I had interest in knowing more about Stratio when I was working on a start up. 
Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a 
project is the need of an area supporting it, and it seems people are specially 
concerned about how many vendors an open source solution has to provide 
support. 

This seems to be kind of an advantage of HBase, as there are many vendors 
supporting it, but I wonder if Stratio can be considered an alternative to 
Datastax reggarding Cassandra support?

It's not my call here to decide anything, but as part of the community it helps 
to have this business scenario clear. I could say Cassandra could be the best 
fit technical solution for some projects but sometimes non-technical factors 
are in the game, like this need for having more than one vendor available...

From: gmu...@stratio.com 
Subject: Re: best supported spark connector for Cassandra

My suggestion is to use Java or Scala instead of Python. For Java/Scala both 
the Datastax and Stratio drivers are valid and similar options. As far as I 
know they both take care about data locality and are not based on the Hadoop 
interface. The advantage of Stratio Deep is that allows you to integrate Spark 
not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others 
as well. 
Stratio has a forked Cassandra for including some additional features such as 
Lucene based secondary indexes. So Stratio driver works fine with the Apache 
Cassandra and also with their fork.

You can find some examples of using Deep here: 
https://github.com/Stratio/deep-examples  Please if you need some help with 
Stratio Deep do not hesitate to contact us.

2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com:

I am using Calliope cassandra-spark 
connector(http://tuplejump.github.io/calliope/), which is quite handy and easy 
to use!
The only problem is that it is a bit outdates , works with Spark 1.1.0, 
hopefully new version comes soon.

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

I just finished a scala course, nice exercise to check what I learned :D

Thanks for the answer!

From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala): 
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Data locality is provided by this method: 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell how the did the integration with Spark. Take 
some time to dig down their code to understand the logic. 

On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Taking the opportunity Spark was being discussed in another thread, I decided 
to start a new one as I have interest in using Spark + Cassandra in the feature.

About 3 years ago, Spark was not an existing option and we tried to use hadoop 
to process Cassandra data. My experience was horrible and we reached the 
conclusion it was faster to develop an internal tool than insist on Hadoop _for 
our specific case_. 

How I can see Spark is starting to be known as a better hadoop and it seems 
market is going this way now. I can also see I have many more options to decide 
how to integrate Cassandra using the Spark RDD concept than using the 
ColumnFamilyInputFormat. 

I have found this java driver made by Datastax: 
https://github.com/datastax/spark-cassandra-connector

I also have found python Cassandra support on spark's repo, but it seems 
experimental yet: 
https://github.com/apache/spark/tree/master/examples/src/main/python

Finally I have found stratio deep: https://github.com/Stratio/deep-spark
It seems Stratio guys have forked Cassandra also, I am still a little confused 
about it.

Question: which driver should I use, if I want to use Java? And which if I want 
to use python? 
I think the way Spark can integrate to Cassandra makes all the difference in

Re: How to speed up SELECT * query in Cassandra

2015-02-13 Thread Jens Rantil

If you are using Spark you need to be _really_ careful about your
tombstones. In our experience a single partition with too many tombstones
can take down the whole batch job (until something like
https://issues.apache.org/jira/browse/CASSANDRA-8574 is fixed). This was a
major obstacle for us to overcome when using Spark.

Cheers,
Jens

On Wed, Feb 11, 2015 at 5:12 PM, Jiri Horky ho...@avast.com wrote:

  Well, I always wondered how Cassandra can by used in Hadoop-like
 environment where you basically need to do full table scan.

 I need to say that our experience is that cassandra is perfect for
 writing, reading specific values by key, but definitely not for reading all
 of the data out of it. Some of our projects found out that doing that with
 a not trivial in a timely manner is close to impossible in many situations.
 We are slowly moving to storing the data in HDFS and possibly reprocess
 them on a daily bases for such usecases (statistics).

 This is nothing against Cassandra, it can not be perfect for everything.
 But I am really interested how it can work well with Spark/Hadoop where you
 basically needs to read all the data as well (as far as I understand that).

 Jirka H.


 On 02/11/2015 01:51 PM, DuyHai Doan wrote:

 The very nature of cassandra's distributed nature vs partitioning data
 on hadoop makes spark on hdfs actually fasted than on cassandra

  Prove it. Did you ever have a look into the source code of the
 Spark/Cassandra connector to see how data locality is achieved before
 throwing out such statement ?

 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

   cassandra makes a very poor datawarehouse ot long term time series
 store

  Really? This is not the impression I have... I think Cassandra is good
 to store larges amounts of data and historical information, it's only not
 good to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra,
 AFAIK.

   The very nature of cassandra's distributed nature vs partitioning
 data on hadoop makes spark on hdfs actually fasted than on cassandra.

  I am not sure about the current state of Spark support for Cassandra,
 but I guess if you create a map reduce job, the intermediate map results
 will be still stored in HDFS, as it happens to hadoop, is this right? I
 think the problem with Spark + Cassandra or with Hadoop + Cassandra is that
 the hard part spark or hadoop does, the shuffling, could be done out of the
 box with Cassandra, but no one takes advantage on that. What if a map /
 reduce job used a temporary CF in Cassandra to store intermediate results?

   From: user@cassandra.apache.org
 Subject: Re: How to speed up SELECT * query in Cassandra

 I use spark with cassandra, and you dont need DSE.

  I see a lot of people ask this same question below (how do I get a lot
 of data out of cassandra?), and my question is always, why arent you
 updating both places at once?

  For example, we use hadoop and cassandra in conjunction with each
 other, we use a message bus to store every event in both, aggregrate in
 both, but only keep current data in cassandra (cassandra makes a very poor
 datawarehouse ot long term time series store) and then use services to
 process queries that merge data from hadoop and cassandra.

  Also, spark on hdfs gives more flexibility in terms of large datasets
 and performance.  The very nature of cassandra's distributed nature vs
 partitioning data on hadoop makes spark on hdfs actually fasted than on
 cassandra



 --
 *Colin Clark*
  +1 612 859 6129
 Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:


 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


 Even better, you can use Spark/Shark with DSE.

  Cheers,
 Jens


  --
  Jens Rantil
 Backend engineer
 Tink AB

  Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

  Facebook https://www.facebook.com/#%21/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink







-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink

Re: best supported spark connector for Cassandra

2015-02-13 Thread Marcelo Valle (BLOOMBERG/ LONDON)

For SQL queries on Cassandra I used to use Presto: https://prestodb.io/

It's a nice tool from FB and seems to work well with Cassandra. You can use 
their JDBC driver with your favourite java SQL tool. 

Inside my apps, I never needed to use SQL queries.

[]s
From: pavel.velik...@gmail.com 
Subject: Re: best supported spark connector for Cassandra

Hi Marcelo,

  Were you able to use the Spark SQL features of the Cassandra connector? I 
couldn’t make a .jar that wouldn’t confict with Spark SQL native .jar…
So I ended up using only the basic features, cannot use SQL queries.

On Feb 13, 2015, at 7:49 PM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:
I used to use calliope, which was really awesome before DataStax native 
integration with Spark. Now I'm quite happy with the official DataStax spark 
connector, it's very straightforward to use.

I never tried to use these drivers with Java though, I'd suggest you to use 
them with Scala, which is the best option to write spark jobs.

On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo r...@pythian.com wrote:

Not for sure ;)

If you need Cassandra support I can forward you to someone to talk to at 
Pythian.

Regards,

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo
Tel: 1649
www.pythian.com

On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Actually, I am not the one looking for support, but I thank you a lot anyway.
But from your message I guess the answer is yes, Datastax is not the only 
Cassandra vendor offering support and changing official Cassandra source at 
this moment, is this right?

From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Of course, Stratio Deep and Stratio Cassandra are licensed  Apache 2.0.   

Regarding the Cassandra support, I can introduce you to someone in Stratio that 
can help you. 

2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net:

Thanks for the hint Gaspar. 
Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0?

I had interest in knowing more about Stratio when I was working on a start up. 
Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a 
project is the need of an area supporting it, and it seems people are specially 
concerned about how many vendors an open source solution has to provide 
support. 

This seems to be kind of an advantage of HBase, as there are many vendors 
supporting it, but I wonder if Stratio can be considered an alternative to 
Datastax reggarding Cassandra support?

It's not my call here to decide anything, but as part of the community it helps 
to have this business scenario clear. I could say Cassandra could be the best 
fit technical solution for some projects but sometimes non-technical factors 
are in the game, like this need for having more than one vendor available...

From: gmu...@stratio.com 
Subject: Re: best supported spark connector for Cassandra

My suggestion is to use Java or Scala instead of Python. For Java/Scala both 
the Datastax and Stratio drivers are valid and similar options. As far as I 
know they both take care about data locality and are not based on the Hadoop 
interface. The advantage of Stratio Deep is that allows you to integrate Spark 
not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others 
as well. 
Stratio has a forked Cassandra for including some additional features such as 
Lucene based secondary indexes. So Stratio driver works fine with the Apache 
Cassandra and also with their fork.

You can find some examples of using Deep here: 
https://github.com/Stratio/deep-examples  Please if you need some help with 
Stratio Deep do not hesitate to contact us.

2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com:

I am using Calliope cassandra-spark 
connector(http://tuplejump.github.io/calliope/), which is quite handy and easy 
to use!
The only problem is that it is a bit outdates , works with Spark 1.1.0, 
hopefully new version comes soon.

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

I just finished a scala course, nice exercise to check what I learned :D

Thanks for the answer!

From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala): 
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Data locality is provided by this method: 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell

Re: best supported spark connector for Cassandra

2015-02-13 Thread Paulo Ricardo Motta Gomes

I used to use calliope, which was really awesome before DataStax native
integration with Spark. Now I'm quite happy with the official DataStax
spark connector, it's very straightforward to use.

I never tried to use these drivers with Java though, I'd suggest you to use
them with Scala, which is the best option to write spark jobs.

On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo r...@pythian.com wrote:

Not for sure ;)

If you need Cassandra support I can forward you to someone to talk to at
Pythian.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:

From: user@cassandra.apache.org
Subject: Re: best supported spark connector for Cassandra

Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0.

Regarding the Cassandra support, I can introduce you to someone in
Stratio that can help you.

2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net:

Thanks for the hint Gaspar.
Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache
2.0?

This seems to be kind of an advantage of HBase, as there are many
vendors supporting it, but I wonder if Stratio can be considered an
alternative to Datastax reggarding Cassandra support?

It's not my call here to decide anything, but as part of the community
it helps to have this business scenario clear. I could say Cassandra could
be the best fit technical solution for some projects but sometimes
non-technical factors are in the game, like this need for having more than
one vendor available...

From: gmu...@stratio.com
Subject: Re: best supported spark connector for Cassandra

You can find some examples of using Deep here:
https://github.com/Stratio/deep-examples Please if you need some help
with Stratio Deep do not hesitate to contact us.

2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com:

I am using Calliope cassandra-spark connector(
http://tuplejump.github.io/calliope/), which is quite handy and easy
to use!
The only problem is that it is a bit outdates , works with Spark 1.1.0,
hopefully new version comes soon.

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:

I just finished a scala course, nice exercise to check what I learned
:D

Thanks for the answer!

From: user@cassandra.apache.org
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala):
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell how the did the integration with
Spark. Take some time to dig down their code to understand the logic.

On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:

Taking the opportunity Spark was being discussed in another thread, I
decided to start a new one as I have interest in using Spark + Cassandra
in
the feature.

Re: best supported spark connector for Cassandra

2015-02-13 Thread Pavel Velikhov

Hi Marcelo,

  Were you able to use the Spark SQL features of the Cassandra connector? I 
couldn’t make a .jar that wouldn’t confict with Spark SQL native .jar…
So I ended up using only the basic features, cannot use SQL queries.



 On Feb 13, 2015, at 7:49 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:
 
 I used to use calliope, which was really awesome before DataStax native 
 integration with Spark. Now I'm quite happy with the official DataStax spark 
 connector, it's very straightforward to use.
 
 I never tried to use these drivers with Java though, I'd suggest you to use 
 them with Scala, which is the best option to write spark jobs.
 
 On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo r...@pythian.com 
 mailto:r...@pythian.com wrote:
 Not for sure ;)
 
 If you need Cassandra support I can forward you to someone to talk to at 
 Pythian.
 
 Regards,
 
 Regards,
 
 Carlos Juzarte Rolo
 Cassandra Consultant
  
 Pythian - Love your data
 
 rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo 
 http://linkedin.com/in/carlosjuzarterolo
 Tel: 1649
 www.pythian.com http://www.pythian.com/
 
 On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote:
 Actually, I am not the one looking for support, but I thank you a lot anyway.
 But from your message I guess the answer is yes, Datastax is not the only 
 Cassandra vendor offering support and changing official Cassandra source at 
 this moment, is this right?
 
 From: user@cassandra.apache.org mailto:user@cassandra.apache.org 
 Subject: Re: best supported spark connector for Cassandra
 Of course, Stratio Deep and Stratio Cassandra are licensed  Apache 2.0.   
 
 Regarding the Cassandra support, I can introduce you to someone in Stratio 
 that can help you. 
 
 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net:
 Thanks for the hint Gaspar. 
 Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0?
 
 I had interest in knowing more about Stratio when I was working on a start 
 up. Now, on a blueship, it seems one of the hardest obstacles to use 
 Cassandra in a project is the need of an area supporting it, and it seems 
 people are specially concerned about how many vendors an open source solution 
 has to provide support. 
 
 This seems to be kind of an advantage of HBase, as there are many vendors 
 supporting it, but I wonder if Stratio can be considered an alternative to 
 Datastax reggarding Cassandra support?
 
 It's not my call here to decide anything, but as part of the community it 
 helps to have this business scenario clear. I could say Cassandra could be 
 the best fit technical solution for some projects but sometimes non-technical 
 factors are in the game, like this need for having more than one vendor 
 available...
 
 
 From: gmu...@stratio.com mailto:gmu...@stratio.com 
 Subject: Re: best supported spark connector for Cassandra
 My suggestion is to use Java or Scala instead of Python. For Java/Scala both 
 the Datastax and Stratio drivers are valid and similar options. As far as I 
 know they both take care about data locality and are not based on the Hadoop 
 interface. The advantage of Stratio Deep is that allows you to integrate 
 Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and 
 others as well. 
 Stratio has a forked Cassandra for including some additional features such as 
 Lucene based secondary indexes. So Stratio driver works fine with the Apache 
 Cassandra and also with their fork.
 
 You can find some examples of using Deep here: 
 https://github.com/Stratio/deep-examples 
 https://github.com/Stratio/deep-examples  Please if you need some help with 
 Stratio Deep do not hesitate to contact us.
 
 
 2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com 
 mailto:shahab.mok...@gmail.com:
 I am using Calliope cassandra-spark 
 connector(http://tuplejump.github.io/calliope/ 
 http://tuplejump.github.io/calliope/), which is quite handy and easy to use!
 The only problem is that it is a bit outdates , works with Spark 1.1.0, 
 hopefully new version comes soon.
 
 best,
 /Shahab
 
 On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote:
 I just finished a scala course, nice exercise to check what I learned :D
 
 Thanks for the answer!
 
 From: user@cassandra.apache.org mailto:user@cassandra.apache.org 
 Subject: Re: best supported spark connector for Cassandra
 Start looking at the Spark/Cassandra connector here (in Scala): 
 https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector
  
 https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector
 
 Data locality

Re: query by column size

2015-02-13 Thread chandra Varahala

I have already secondary index on that column, but how to I query that
column by size ?

thanks
chandra

On Fri, Feb 13, 2015 at 3:30 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

 There is no automatic indexing in Cassandra. There are secondary indexes,
 but not for these cases.
 You could use a solution like DSE, to get data automatically indexed on
 solr, in each node, as soon as data comes. Then you could do such a query
 on solr.
 If the query can be slow, you could run a MR job over all rows, filtering
 the ones you want.
 []s

 From: user@cassandra.apache.org
 Subject: Re:query by column size

 Greetings,

 I have one column family with 10 columns,  one of the column we store
 xml/json.
 Is there a way I can query  that column where size  50kb  ?  assuming I
  have index on that column.

 thanks
 CV.

Re: query by column size

2015-02-13 Thread Tyler Hobbs

On Fri, Feb 13, 2015 at 11:18 AM, chandra Varahala 
hadoopandcassan...@gmail.com wrote:

 I have already secondary index on that column, but how to I query that
 column by size ?


You can't.  If this is a query that you want to do regularly and
efficiently, I suggest creating a second table to act as an index (or
materialized view of sorts).  Whenever your application writes a row to the
original table with a column  50kb, it should also update the second table.


-- 
Tyler Hobbs
DataStax http://datastax.com/

Re: Added new nodes to cluster but no streams

2015-02-13 Thread Batranut Bogdan

Got it, thank you very much. 

 On Friday, February 13, 2015 4:04 PM, Jens Rantil jens.ran...@tink.se 
wrote:
   

 Hi Bastranut,
A few minutes between each node will do.
Cheers,Jens
On Fri, Feb 13, 2015 at 1:12 PM, Batranut Bogdan batra...@yahoo.com wrote:

Hello,
When adding a new node to the cluster I need to wait for each node to receive 
all the data from other nodes in the cluster or just wait a few minutes before 
I start each node?
 

 On Thursday, February 12, 2015 7:21 PM, Robert Coli rc...@eventbrite.com 
wrote:
   

 On Thu, Feb 12, 2015 at 3:20 AM, Batranut Bogdan batra...@yahoo.com wrote:

I have added new nodes to the existing cluster. In Opscenter I do not see any 
streams... I presume that the new nodes get the data from the rest of the 
cluster via streams. The existing cluster has TB magnitude, and space used in 
the new nodes is ~90 GB. I must admit that I have restarted the new nodes 
several times after adding them . Does this affect boostrap? AFAIK the new 
nodes should start loading a part of all the data in the existing cluster.

If it stays like this for a while, it sounds like your bootstraps have hung. 
Note that in general you should add nodes one at a time, especially if you are 
in a version without the fix for CASSANDRA-2434, in theory adding multiple 
nodes at once might contribute to their bootstraps hanging.
Stop cassandra on the joining nodes, wipe/move aside their data directories, 
and try again one at a time.
=Rob

 





-- 
Jens RantilBackend engineerTink AB
Email: jens.rantil@tink.sePhone: +46 708 84 18 32Web: www.tink.se
Facebook Linkedin Twitter

Re: Pagination support on Java Driver Query API

2015-02-13 Thread Ajay

The syntax suggested by Ondrej is not working in some case in 2.0.11 and
logged an issue for the same.

https://issues.apache.org/jira/browse/CASSANDRA-8797

Thanks
Ajay
On Feb 12, 2015 11:01 PM, Bulat Shakirzyanov 
bulat.shakirzya...@datastax.com wrote:

 Fixed my Mail.app settings so you can see my actual name, sorry.

 On Feb 12, 2015, at 8:55 AM, DataStax bulat.shakirzya...@datastax.com
 wrote:

 Hello,

 As was mentioned earlier, the Java driver doesn’t actually perform
 pagination.

 Instead, it uses cassandra native protocol to set page size of the result
 set. (
 https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v2.spec#L699-L730
 )
 When Cassandra sends the result back to the java driver, it includes a
 some binary token.
 This token represents paging state. To fetch the next page, the driver
 re-executes the same
 statement with original page size and paging state attached. If there is
 another page available,
 Cassandra responds with a new paging state that can be used to fetch it.

 You could also try reporting this issue on the Cassandra user mailing list.

 On Feb 12, 2015, at 8:35 AM, Eric Stevens migh...@gmail.com wrote:

 I don't know what the shape of the page state data is deep inside the
 JavaDriver, I've actually tried to dig into that in the past and understand
 it to see if I could reproduce it as a general purpose any-query kind of
 thing.  I gave up before I fully understood it, but I think it's actually a
 handle to an in-memory state maintained by the coordinator, which is only
 maintained for the lifetime of the statement (i.e. it's not stateless
 paging). That would make it a bad candidate for stateless paging scenarios
 such as REST requests where a typical setup would load balance across HTTP
 hosts, never mind across coordinators.

 It shouldn't be too much work to abstract this basic idea for manual
 paging into a general purpose class that takes List[ClusteringKeyDef[T,
 O:Ordering]], and can produce a connection agnostic PageState from a
 ResultSet or Row, or accepts a PageState to produce a WHERE CQL fragment.



 Also RE: possibly multiple queries to satisfy a page - yes, that's
 unfortunate.  Since you're on 2.0.11, see Ondřej's answer to avoid it.

 On Thu, Feb 12, 2015 at 8:13 AM, Ajay ajay.ga...@gmail.com wrote:

 Thanks Eric. I figured out the same but didn't get time to put it on the
 mail. Thanks.

 But it is highly tied up to how data is stored internally in Cassandra.
 Basically how partition keys are used to distribute (less likely to change.
 We are not directly dependence on the partition algo) and clustering keys
 are used to sort the data with in a partition( multi level sorting and
 henceforth the restrictions on the ORDER BY clause) which I think can
 change likely down the lane in Cassandra 3.x or 4.x in an different way for
 some better storage or retrieval.

 Thats said I am hesitant to implement this client side logic for
 pagination for a) 2+ queries might need more than one query to Cassandra.
 b)  tied up implementation to Cassandra internal storage details which can
 change(though not often). c) in our case, we are building REST Apis which
 will be deployed Tomcat clusters. Hence whatever we cache to support
 pagination, need to be cached in a distributed way for failover support.

 It (pagination support) is best done at the server side like ROWNUM in
 SQL or better done in Java driver to hide the internal details and can be
 optimized better as server sends the paging state with the driver.

 Thanks
 Ajay
 On Feb 12, 2015 8:22 PM, Eric Stevens migh...@gmail.com wrote:

 Your page state then needs to track the last ck1 and last ck2 you saw.
 Pages 2+ will end up needing to be up to two queries if the first query
 doesn't fill the page size.

 CREATE TABLE foo (
   partitionkey int,
   ck1 int,
   ck2 int,
   col1 int,
   col2 int,
   PRIMARY KEY ((partitionkey), ck1, ck2)
 ) WITH CLUSTERING ORDER BY (ck1 asc, ck2 desc);

 INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,1,1,1);
 INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,2,2,2);
 INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,3,3,3);
 INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,1,4,4);
 INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,2,5,5);
 INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,3,6,6);

 If you're pulling the whole of partition 1 and your page size is 2, your
 first page looks like:

 *PAGE 1*

 SELECT * FROM foo WHERE partitionkey = 1 LIMIT 2;
  partitionkey | ck1 | ck2 | col1 | col2
 --+-+-+--+--
 1 |   1 |   3 |3 |3
 1 |   1 |   2 |2 |2

 You got enough rows to satisfy the page, Your page state is taken from
 the last row: (ck1=1, ck2=2)


 *PAGE 2*
 Notice that you have a page state, and add some limiting clauses on the
 statement:

 SELECT * FROM foo WHERE partitionkey = 1 AND ck1 = 1

Re: sstables remain after compaction

2015-02-13 Thread Robert Coli

On Fri, Feb 13, 2015 at 1:35 AM, Jason Wee peich...@gmail.com wrote:

 Pre cassandra 1.0, after sstables are compacted, the old sstables will be
 remain until the first gc kick in. For cassandra 1.0, the sstables will be
 remove after compaction is done. Will it be possible the old sstables
 remains due to whatever reasons (e.g. read referencing)?


If I understand your question properly, the answer is no or not for
longer than the duration of a running thread.

If compaction is working properly in a
post-needs-the-java-GC-to-delete-files version of Cassandra the input files
should be deleted ASAP. If a thread is actively accessing that file, I
would imagine it blocks for that long, but that's not likely to be very
long.

=Rob

Re: Added new nodes to cluster but no streams

2015-02-13 Thread Batranut Bogdan

Hello,
When adding a new node to the cluster I need to wait for each node to receive 
all the data from other nodes in the cluster or just wait a few minutes before 
I start each node?
 

 On Thursday, February 12, 2015 7:21 PM, Robert Coli rc...@eventbrite.com 
wrote:
   

 On Thu, Feb 12, 2015 at 3:20 AM, Batranut Bogdan batra...@yahoo.com wrote:

I have added new nodes to the existing cluster. In Opscenter I do not see any 
streams... I presume that the new nodes get the data from the rest of the 
cluster via streams. The existing cluster has TB magnitude, and space used in 
the new nodes is ~90 GB. I must admit that I have restarted the new nodes 
several times after adding them . Does this affect boostrap? AFAIK the new 
nodes should start loading a part of all the data in the existing cluster.

If it stays like this for a while, it sounds like your bootstraps have hung. 
Note that in general you should add nodes one at a time, especially if you are 
in a version without the fix for CASSANDRA-2434, in theory adding multiple 
nodes at once might contribute to their bootstraps hanging.
Stop cassandra on the joining nodes, wipe/move aside their data directories, 
and try again one at a time.
=Rob

Re: sstables remain after compaction

2015-02-13 Thread Jason Wee

Thank Rob,

I trigger user defined compaction to big sstables (big as in the size per
sstable reach more than 50GB, some 100GB). Occasionally, after user defined
compaction, I see some sstables remain, even after 12 hours elapsed.

You mentioned a thread, could you tell what threads are those or perhaps
highlight in the code?

Jason

On Sat, Feb 14, 2015 at 3:58 AM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Feb 13, 2015 at 1:35 AM, Jason Wee peich...@gmail.com wrote:

 Pre cassandra 1.0, after sstables are compacted, the old sstables will be
 remain until the first gc kick in. For cassandra 1.0, the sstables will be
 remove after compaction is done. Will it be possible the old sstables
 remains due to whatever reasons (e.g. read referencing)?


 If I understand your question properly, the answer is no or not for
 longer than the duration of a running thread.

 If compaction is working properly in a
 post-needs-the-java-GC-to-delete-files version of Cassandra the input files
 should be deleted ASAP. If a thread is actively accessing that file, I
 would imagine it blocks for that long, but that's not likely to be very
 long.

 =Rob

Storing bi-temporal data in Cassandra

2015-02-13 Thread Raj N

Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I
can do this using CQL for now. Taking the time series example from well
known modeling tutorials in Cassandra -

CREATE TABLE temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);

If I add another column transaction_time

CREATE TABLE temperatures (
weatherstation_id text,
event_time timestamp,
transaction_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time,transaction_time),
) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);

If I try to run a query using the following CQL, it throws an error -

select * from temperatures where weatherstation_id = 'foo' and event_time
= '2015-01-01 00:00:00' and event_time  '2015-01-02 00:00:00' and
transaction_time  '2015-01-02 00:00:00'

It works if I use an equals clause for the event_time. I am trying to get
the state as of a particular transaction_time

-Raj

Re:query by column size

sstables remain after compaction

Re: best supported spark connector for Cassandra

Re: Added new nodes to cluster but no streams

Re: best supported spark connector for Cassandra

Re: Recommissioned a node

Re: best supported spark connector for Cassandra

Re: How to speed up SELECT * query in Cassandra

Re: best supported spark connector for Cassandra

Re: best supported spark connector for Cassandra

Re: best supported spark connector for Cassandra

Re: query by column size

Re: query by column size

Re: Added new nodes to cluster but no streams

Re: Pagination support on Java Driver Query API

Re: sstables remain after compaction

Re: Added new nodes to cluster but no streams

Re: sstables remain after compaction

Storing bi-temporal data in Cassandra

19 matches

Site Navigation

Mail list logo

Footer information