date:20150211

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky

Well, I always wondered how Cassandra can by used in Hadoop-like
environment where you basically need to do full table scan.

I need to say that our experience is that cassandra is perfect for
writing, reading specific values by key, but definitely not for reading
all of the data out of it. Some of our projects found out that doing
that with a not trivial in a timely manner is close to impossible in
many situations. We are slowly moving to storing the data in HDFS and
possibly reprocess them on a daily bases for such usecases (statistics).

This is nothing against Cassandra, it can not be perfect for everything.
But I am really interested how it can work well with Spark/Hadoop where
you basically needs to read all the data as well (as far as I understand
that).

Jirka H.

On 02/11/2015 01:51 PM, DuyHai Doan wrote:
 The very nature of cassandra's distributed nature vs partitioning
 data on hadoop makes spark on hdfs actually fasted than on cassandra

 Prove it. Did you ever have a look into the source code of the
 Spark/Cassandra connector to see how data locality is achieved before
 throwing out such statement ?

 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote:

  cassandra makes a very poor datawarehouse ot long term time series store

 Really? This is not the impression I have... I think Cassandra is
 good to store larges amounts of data and historical information,
 it's only not good to store temporary data.
 Netflix has a large amount of data and it's all stored in
 Cassandra, AFAIK.

  The very nature of cassandra's distributed nature vs partitioning data 
 on hadoop makes spark on hdfs
 actually fasted than on cassandra.

 I am not sure about the current state of Spark support for
 Cassandra, but I guess if you create a map reduce job, the
 intermediate map results will be still stored in HDFS, as it
 happens to hadoop, is this right? I think the problem with Spark +
 Cassandra or with Hadoop + Cassandra is that the hard part spark
 or hadoop does, the shuffling, could be done out of the box with
 Cassandra, but no one takes advantage on that. What if a map /
 reduce job used a temporary CF in Cassandra to store intermediate
 results?

 From: user@cassandra.apache.org mailto:user@cassandra.apache.org
 Subject: Re: How to speed up SELECT * query in Cassandra

 I use spark with cassandra, and you dont need DSE.

 I see a lot of people ask this same question below (how do I
 get a lot of data out of cassandra?), and my question is
 always, why arent you updating both places at once?

 For example, we use hadoop and cassandra in conjunction with
 each other, we use a message bus to store every event in both,
 aggregrate in both, but only keep current data in cassandra
 (cassandra makes a very poor datawarehouse ot long term time
 series store) and then use services to process queries that
 merge data from hadoop and cassandra.  

 Also, spark on hdfs gives more flexibility in terms of large
 datasets and performance.  The very nature of cassandra's
 distributed nature vs partitioning data on hadoop makes spark
 on hdfs actually fasted than on cassandra



 -- 
 *Colin Clark* 
 +1 612 859 6129 tel:%2B1%20612%20859%206129
 Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se
 mailto:jens.ran...@tink.se wrote:


 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/
 LONDON) mvallemil...@bloomberg.net
 mailto:mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


 Even better, you can use Spark/Shark with DSE.

 Cheers,
 Jens


 -- 
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se mailto:jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se http://www.tink.se/

 Facebook https://www.facebook.com/#%21/tink.se Linkedin
 
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter
 https://twitter.com/tink

How to use CqlBulkOutputFormat with MultipleOutputs

2015-02-11 Thread Aby Kuruvilla

Have been trying to import data into multiple Column Families through a
 Hadoop job. Was able to use CqlOutputFormat to move data to a single
column family, but don't think this supports imports to multiple column
families. From some searching saw that CqlBulkOutputformat has support to
write to multiple column families, but have not been able to get it
working. Could not find any examples on this. Would be great if someone
could help me with an example of using CqlBulkOutputFormat with
MultipleOutputs .

Re: changes to metricsReporterConfigFile requires restart of cassandra?

2015-02-11 Thread Eric Stevens

AFAIK yes.  If you want just a subset of the metrics, I would suggest
exporting them all, and filtering on the Graphite side.

On Wed, Feb 11, 2015 at 6:54 AM, Erik Forsberg forsb...@opera.com wrote:

 Hi!

 I was pleased to find out that cassandra 2.0.x has added support for
 pluggable metrics export, which even includes a graphite metrics sender.

 Question: Will changes to the metricsReporterConfigFile require a
 restart of cassandra to take effect?

 I.e, if I want to add a new exported metric to that file, will I have to
 restart my cluster?

 Thanks,
 \EF

Re: best supported spark connector for Cassandra

2015-02-11 Thread shahab

I am using Calliope cassandra-spark connector(
http://tuplejump.github.io/calliope/), which is quite handy and easy to use!
The only problem is that it is a bit outdates , works with Spark 1.1.0,
hopefully new version comes soon.

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

 I just finished a scala course, nice exercise to check what I learned :D

 Thanks for the answer!

 From: user@cassandra.apache.org
 Subject: Re: best supported spark connector for Cassandra

 Start looking at the Spark/Cassandra connector here (in Scala):
 https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

 Data locality is provided by this method:
 https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

 Start digging from this all the way down the code.

 As for Stratio Deep, I can't tell how the did the integration with Spark.
 Take some time to dig down their code to understand the logic.



 On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 Taking the opportunity Spark was being discussed in another thread, I
 decided to start a new one as I have interest in using Spark + Cassandra in
 the feature.

 About 3 years ago, Spark was not an existing option and we tried to use
 hadoop to process Cassandra data. My experience was horrible and we reached
 the conclusion it was faster to develop an internal tool than insist on
 Hadoop _for our specific case_.

 How I can see Spark is starting to be known as a better hadoop and it
 seems market is going this way now. I can also see I have many more options
 to decide how to integrate Cassandra using the Spark RDD concept than using
 the ColumnFamilyInputFormat.

 I have found this java driver made by Datastax:
 https://github.com/datastax/spark-cassandra-connector

 I also have found python Cassandra support on spark's repo, but it seems
 experimental yet:
 https://github.com/apache/spark/tree/master/examples/src/main/python

 Finally I have found stratio deep: https://github.com/Stratio/deep-spark
 It seems Stratio guys have forked Cassandra also, I am still a little
 confused about it.

 Question: which driver should I use, if I want to use Java? And which if
 I want to use python?
 I think the way Spark can integrate to Cassandra makes all the difference
 in the world, from my past experience, so I would like to know more about
 it, but I don't even know which source code I should start looking...
 I would like to integrate using python and or C++, but I wonder if it
 doesn't pay the way to use the java driver instead.

 Thanks in advance

Re: Recommissioned a node

2015-02-11 Thread Eric Stevens

Yes, including the system and commitlog directory.  Then when it starts,
it's like a brand new node and will bootstrap to join.

On Wed, Feb 11, 2015 at 8:56 AM, Stefano Ortolani ostef...@gmail.com
wrote:

 Hi Eric,

 thanks for your answer. The reason why it got recommissioned was simply
 because the machine got restarted (with auto_bootstrap set to to true). A
 cleaner, and correct, recommission would have just required wiping the data
 folder, am I correct? Or would I have needed to change something else in
 the node configuration?

 Cheers,
 Stefano

 On Wed, Feb 11, 2015 at 6:47 AM, Eric Stevens migh...@gmail.com wrote:

 AFAIK it should be ok after the repair completed (it was missing all
 writes while it was decommissioning and while it was offline, and nobody
 would have been keeping hinted handoffs for it, so repair was the right
 thing to do).  Unless RF=N you're now due for a cleanup on the other nodes.

 Generally speaking though this was probably not a good idea.  When the
 node came back online, it rejoined the cluster immediately and would have
 been serving client requests without having a consistent view of the data.
 A safer approach would be to wipe the data directory and bootstrap it as a
 clean new member.

 I'm curious what prompted that cycle of decommission then recommission.

 On Tue, Feb 10, 2015 at 10:13 PM, Stefano Ortolani ostef...@gmail.com
 wrote:

 Hi,

 I recommissioned a node after decommissioningit.
 That happened (1) after a successfull decommission (checked), (2)
 without wiping the data directory on the node, (3) simply by restarting the
 cassandra service. The node now reports himself healty and up and running

 Knowing that I issued the repair command and patiently waited for its
 completion, can I assume the cluster, and its internals (replicas, balance
 between those) to be healthy and as new?

 Regards,
 Stefano

Re: Recommissioned a node

2015-02-11 Thread Jonathan Haddad

It could, because the tombstones that mark data deleted may have been
removed.  There would be nothing that says this data is gone.

If you're worried about it, turn up your gc grace seconds.  Also, don't
revive nodes back into a cluster with old data sitting on them.

On Wed Feb 11 2015 at 11:18:19 AM Stefano Ortolani ostef...@gmail.com
wrote:

 Hi Robert,

 it all happened within 30 minutes, so way before the default
 gc_grace_second (864000), so I should be fine.
 However, this is quite shocking if you ask me. The only possibility of
 getting to an inconsistent state only by restarting a node is appalling...

 Can other people confirm that a restart after the gc_grace_seconds passed
 would have violated consistency permanently?

 Cheers,
 Stefano

 On Wed, Feb 11, 2015 at 10:56 AM, Robert Coli rc...@eventbrite.com
 wrote:

 On Tue, Feb 10, 2015 at 9:13 PM, Stefano Ortolani ostef...@gmail.com
 wrote:

 I recommissioned a node after decommissioningit.
 That happened (1) after a successfull decommission (checked), (2)
 without wiping the data directory on the node, (3) simply by restarting the
 cassandra service. The node now reports himself healty and up and running

 Knowing that I issued the repair command and patiently waited for its
 completion, can I assume the cluster, and its internals (replicas, balance
 between those) to be healthy and as new?


 Did you recommission before or after gc_grace_seconds passed? If after,
 you have violated consistency in a manner that, in my understanding, one
 cannot recover from.

 If before, you're pretty fine.

 However this is a longstanding issue that I personally consider a bug :

 Your decommissioned node doesn't forget its state. In my opinion, you
 told it to leave the cluster, it should forget everything it knew as a
 member of that cluster.

 If you file this behavior as a JIRA bug, please let the list know.

 =Rob

Re: Pagination support on Java Driver Query API

2015-02-11 Thread Ajay

Hi Eric,

Thanks for your reply.

I am using Cassandra 2.0.11 and in that I cannot append condition like last
clustering key column  value of the last row in the previous batch. It
fails Preceding column is either not restricted or by a non-EQ relation. It
means I need to specify equal  condition for all preceding clustering key
columns. With this I cannot get the pagination correct.

Thanks
Ajay
 I can't believe that everyone read  process all rows at once (without
pagination).

Probably not too many people try to read all rows in a table as a single
rolling operation with a standard client driver.  But those who do would
use token() to keep track of where they are and be able to resume with that
as well.

But it sounds like you're talking about paginating a subset of data -
larger than you want to process as a unit, but prefiltered by some other
criteria which prevents you from being able to rely on token().  For this
there is no general purpose solution, but it typically involves you
maintaining your own paging state, typically keeping track of the last
partitioning and clustering key seen, and using that to construct your next
query.

For example, we have client queries which can span several partitioning
keys.  We make sure that the List of partition keys generated by a given
client query List(Pq) is deterministic, then our paging state is the index
offset of the final Pq in the response, plus the value of the final
clustering column.  A query coming in with a paging state attached to it
starts the next set of queries from the provided Pq offset where
clusteringKey  the provided value.

So if you can just track partition key offset (if spanning multiple
partitions), and clustering key offset, you can construct your next query
from those instead.

On Tue, Feb 10, 2015 at 6:58 PM, Ajay ajay.ga...@gmail.com wrote:

 Thanks Alex.

 But is there any workaround possible?. I can't believe that everyone read
  process all rows at once (without pagination).

 Thanks
 Ajay
 On Feb 10, 2015 11:46 PM, Alex Popescu al...@datastax.com wrote:


 On Tue, Feb 10, 2015 at 4:59 AM, Ajay ajay.ga...@gmail.com wrote:

 1) Java driver implicitly support Pagination in the ResultSet (using
 Iterator) which can be controlled through FetchSize. But it is limited in a
 way that we cannot skip or go previous. The FetchState is not exposed.


 Cassandra doesn't support skipping so this is not really a limitation of
 the driver.


 --

 [:-a)

 Alex Popescu
 Sen. Product Manager @ DataStax
 @al3xandru

 To unsubscribe from this group and stop receiving emails from it, send an
 email to java-driver-user+unsubscr...@lists.datastax.com.

Re: Pagination support on Java Driver Query API

2015-02-11 Thread Ajay

Basically I am trying different queries with your approach.

One such query is like

Select * from mycf where condition on partition key order by ck1 asc, ck2
desc where ck1 and ck2 are clustering keys in that order.

Here how do we achieve pagination support?

Thanks
Ajay
On Feb 11, 2015 11:16 PM, Ajay ajay.ga...@gmail.com wrote:


 Hi Eric,

 Thanks for your reply.

 I am using Cassandra 2.0.11 and in that I cannot append condition like
 last clustering key column  value of the last row in the previous batch.
 It fails Preceding column is either not restricted or by a non-EQ relation.
 It means I need to specify equal  condition for all preceding clustering
 key columns. With this I cannot get the pagination correct.

 Thanks
 Ajay
  I can't believe that everyone read  process all rows at once (without
 pagination).

 Probably not too many people try to read all rows in a table as a single
 rolling operation with a standard client driver.  But those who do would
 use token() to keep track of where they are and be able to resume with that
 as well.

 But it sounds like you're talking about paginating a subset of data -
 larger than you want to process as a unit, but prefiltered by some other
 criteria which prevents you from being able to rely on token().  For this
 there is no general purpose solution, but it typically involves you
 maintaining your own paging state, typically keeping track of the last
 partitioning and clustering key seen, and using that to construct your next
 query.

 For example, we have client queries which can span several partitioning
 keys.  We make sure that the List of partition keys generated by a given
 client query List(Pq) is deterministic, then our paging state is the
 index offset of the final Pq in the response, plus the value of the final
 clustering column.  A query coming in with a paging state attached to it
 starts the next set of queries from the provided Pq offset where
 clusteringKey  the provided value.

 So if you can just track partition key offset (if spanning multiple
 partitions), and clustering key offset, you can construct your next query
 from those instead.

 On Tue, Feb 10, 2015 at 6:58 PM, Ajay ajay.ga...@gmail.com wrote:

 Thanks Alex.

 But is there any workaround possible?. I can't believe that everyone read
  process all rows at once (without pagination).

 Thanks
 Ajay
 On Feb 10, 2015 11:46 PM, Alex Popescu al...@datastax.com wrote:


 On Tue, Feb 10, 2015 at 4:59 AM, Ajay ajay.ga...@gmail.com wrote:

 1) Java driver implicitly support Pagination in the ResultSet (using
 Iterator) which can be controlled through FetchSize. But it is limited in a
 way that we cannot skip or go previous. The FetchState is not exposed.


 Cassandra doesn't support skipping so this is not really a limitation of
 the driver.


 --

 [:-a)

 Alex Popescu
 Sen. Product Manager @ DataStax
 @al3xandru

 To unsubscribe from this group and stop receiving emails from it, send
 an email to java-driver-user+unsubscr...@lists.datastax.com.

Re: Recommissioned a node

2015-02-11 Thread Jonathan Haddad

And after decreasing your RF (rare but happens)

On Wed Feb 11 2015 at 11:31:38 AM Robert Coli rc...@eventbrite.com wrote:

 On Wed, Feb 11, 2015 at 11:20 AM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 It could, because the tombstones that mark data deleted may have been
 removed.  There would be nothing that says this data is gone.

 If you're worried about it, turn up your gc grace seconds.  Also, don't
 revive nodes back into a cluster with old data sitting on them.


 Also, run cleanup after range movements :

 https://issues.apache.org/jira/browse/CASSANDRA-7764

 =Rob

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread DuyHai Doan

For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies.
Look at Burden of proof
http://en.wikipedia.org/wiki/Philosophic_burden_of_proof

You stated The very nature of cassandra's distributed nature vs
partitioning data on hadoop makes spark on hdfs actually fasted than on
cassandra

It's up to YOU to prove it right, not up to me to prove it wrong.

All other bla bla is troll.

Come back to me once you get some decent benchmarks supporting your
statement, until then, the question is closed.



On Wed, Feb 11, 2015 at 3:17 PM, Colin co...@clark.ws wrote:

 Did you want me to included specific examples from my employment at
 datastax or start from the ground up?

 All spark is on cassandra is a better than the previous use of hive.

 The fact that datastax hasnt provided any benchmarks themselves other than
 glossy marketing statements pretty much says it all-where are your
 benchmarks?  Maybe you could combine it with the in memory option to really
 boogie...

 :)

 (If I find time, I might just write a blog post about exactly how to do
 this-it involves the use of parquet and partitioning with clustering-and it
 doesnt cost anything to do it-and it's in production at my company)
 --
 *Colin Clark*
 +1 612 859 6129
 Skype colin.p.clark

 On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote:

 The very nature of cassandra's distributed nature vs partitioning data
 on hadoop makes spark on hdfs actually fasted than on cassandra

 Prove it. Did you ever have a look into the source code of the
 Spark/Cassandra connector to see how data locality is achieved before
 throwing out such statement ?

 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

  cassandra makes a very poor datawarehouse ot long term time series store

 Really? This is not the impression I have... I think Cassandra is good to
 store larges amounts of data and historical information, it's only not good
 to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra,
 AFAIK.

  The very nature of cassandra's distributed nature vs partitioning data
 on hadoop makes spark on hdfs actually fasted than on cassandra.

 I am not sure about the current state of Spark support for Cassandra, but
 I guess if you create a map reduce job, the intermediate map results will
 be still stored in HDFS, as it happens to hadoop, is this right? I think
 the problem with Spark + Cassandra or with Hadoop + Cassandra is that the
 hard part spark or hadoop does, the shuffling, could be done out of the box
 with Cassandra, but no one takes advantage on that. What if a map / reduce
 job used a temporary CF in Cassandra to store intermediate results?

 From: user@cassandra.apache.org
 Subject: Re: How to speed up SELECT * query in Cassandra

 I use spark with cassandra, and you dont need DSE.

 I see a lot of people ask this same question below (how do I get a lot of
 data out of cassandra?), and my question is always, why arent you updating
 both places at once?

 For example, we use hadoop and cassandra in conjunction with each other,
 we use a message bus to store every event in both, aggregrate in both, but
 only keep current data in cassandra (cassandra makes a very poor
 datawarehouse ot long term time series store) and then use services to
 process queries that merge data from hadoop and cassandra.

 Also, spark on hdfs gives more flexibility in terms of large datasets and
 performance.  The very nature of cassandra's distributed nature vs
 partitioning data on hadoop makes spark on hdfs actually fasted than on
 cassandra



 --
 *Colin Clark*
 +1 612 859 6129
 Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:


 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


 Even better, you can use Spark/Shark with DSE.

 Cheers,
 Jens


 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink

Adding new node - OPSCenter problems

2015-02-11 Thread Batranut Bogdan

Hello all,
I have added new 3 nodes to existing cluster. I must point out that I have 
copied the cassandra yaml file, from an existing node and just changed 
listen_addres per instructions here: Adding nodes to an existing cluster | 
DataStax Cassandra 2.0 Documentation
|   |
|   |   |   |   |   |
| Adding nodes to an existing cluster | DataStax Cassandra 2.0 
DocumentationSteps to add nodes when using virtual nodes. |
|  |
| View on www.datastax.com | Preview by Yahoo |
|  |
|   |

 Installed datastax agents but in OpsCenter I see the new nodes in a different, 
empty Cluster. Also opscenter does not see the datacenter for the new nodes. 
They are all in the same datacenter, and the name of the Cluster is the same 
for all 9 nodes. Any ideeas?

changes to metricsReporterConfigFile requires restart of cassandra?

2015-02-11 Thread Erik Forsberg

Hi!

I was pleased to find out that cassandra 2.0.x has added support for
pluggable metrics export, which even includes a graphite metrics sender.

Question: Will changes to the metricsReporterConfigFile require a
restart of cassandra to take effect?

I.e, if I want to add a new exported metric to that file, will I have to
restart my cluster?

Thanks,
\EF

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin

Did you want me to included specific examples from my employment at datastax or 
start from the ground up? 

All spark is on cassandra is a better than the previous use of hive. 

The fact that datastax hasnt provided any benchmarks themselves other than 
glossy marketing statements pretty much says it all-where are your benchmarks?  
Maybe you could combine it with the in memory option to really boogie...

:)

(If I find time, I might just write a blog post about exactly how to do this-it 
involves the use of parquet and partitioning with clustering-and it doesnt cost 
anything to do it-and it's in production at my company)
--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

 On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote:
 
 The very nature of cassandra's distributed nature vs partitioning data on 
 hadoop makes spark on hdfs actually fasted than on cassandra
 
 Prove it. Did you ever have a look into the source code of the 
 Spark/Cassandra connector to see how data locality is achieved before 
 throwing out such statement ?
 
 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
  cassandra makes a very poor datawarehouse ot long term time series store
 
 Really? This is not the impression I have... I think Cassandra is good to 
 store larges amounts of data and historical information, it's only not good 
 to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. 
 
  The very nature of cassandra's distributed nature vs partitioning data on 
  hadoop makes spark on hdfs actually fasted than on cassandra.
 
 I am not sure about the current state of Spark support for Cassandra, but I 
 guess if you create a map reduce job, the intermediate map results will be 
 still stored in HDFS, as it happens to hadoop, is this right? I think the 
 problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard 
 part spark or hadoop does, the shuffling, could be done out of the box with 
 Cassandra, but no one takes advantage on that. What if a map / reduce job 
 used a temporary CF in Cassandra to store intermediate results?
 
 From: user@cassandra.apache.org 
 Subject: Re: How to speed up SELECT * query in Cassandra
 I use spark with cassandra, and you dont need DSE.
 
 I see a lot of people ask this same question below (how do I get a lot of 
 data out of cassandra?), and my question is always, why arent you updating 
 both places at once?
 
 For example, we use hadoop and cassandra in conjunction with each other, we 
 use a message bus to store every event in both, aggregrate in both, but only 
 keep current data in cassandra (cassandra makes a very poor datawarehouse ot 
 long term time series store) and then use services to process queries that 
 merge data from hadoop and cassandra.  
 
 Also, spark on hdfs gives more flexibility in terms of large datasets and 
 performance.  The very nature of cassandra's distributed nature vs 
 partitioning data on hadoop makes spark on hdfs actually fasted than on 
 cassandra
 
 
 
 --
 Colin Clark 
 +1 612 859 6129
 Skype colin.p.clark
 
 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:
 
 
 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
 If you use Cassandra enterprise, you can use hive, AFAIK.
 
 Even better, you can use Spark/Shark with DSE.
 
 Cheers,
 Jens
 
 
 -- 
 Jens Rantil
 Backend engineer
 Tink AB
 
 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se
 
 Facebook Linkedin Twitter

Re: Adding new node - OPSCenter problems

2015-02-11 Thread Batranut Bogdan

Hello,
nodetool status shows existing nodes as UN and the new 3 as UJ . What is 
strange is that in the Owns column for the new 3 nodes I have ? instead of a 
percentage value. What I see is all are in rack1 in the Rack column. 

 On Wednesday, February 11, 2015 4:50 PM, Carlos Rolo r...@pythian.com 
wrote:
   

 Hello, 

What is the output of nodetool status? All nodes should appear, otherwise there 
is some configuration error. 

Regards,

Carlos Juzarte RoloCassandra Consultant Pythian - Love your data
rolo@pythian | Twitter: cjrolo | Linkedin: 
linkedin.com/in/carlosjuzarteroloTel: 1649www.pythian.com
On Wed, Feb 11, 2015 at 3:46 PM, Batranut Bogdan batra...@yahoo.com wrote:

Hello all,
I have added new 3 nodes to existing cluster. I must point out that I have 
copied the cassandra yaml file, from an existing node and just changed 
listen_addres per instructions here: Adding nodes to an existing cluster | 
DataStax Cassandra 2.0 Documentation
|   |
|   |   |   |   |   |
| Adding nodes to an existing cluster | DataStax Cassandra 2.0 
DocumentationSteps to add nodes when using virtual nodes. |
|  |
| View on www.datastax.com | Preview by Yahoo |
|  |
|   |

 Installed datastax agents but in OpsCenter I see the new nodes in a different, 
empty Cluster. Also opscenter does not see the datacenter for the new nodes. 
They are all in the same datacenter, and the name of the Cluster is the same 
for all 9 nodes. Any ideeas?


--

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin

No, the question isnt closed.  You dont get to decide that.

I dont run a website making claims regarding cassandra and spark - your 
employer does.   

Again, where are your benchmarks?

I will publish mine, then we'll see what you've got.

--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

 On Feb 11, 2015, at 8:39 AM, DuyHai Doan doanduy...@gmail.com wrote:
 
 For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. 
 Look at Burden of proof
 
 You stated The very nature of cassandra's distributed nature vs partitioning 
 data on hadoop makes spark on hdfs actually fasted than on cassandra
 
 It's up to YOU to prove it right, not up to me to prove it wrong.
 
 All other bla bla is troll.
 
 Come back to me once you get some decent benchmarks supporting your 
 statement, until then, the question is closed.
 
 
 
 On Wed, Feb 11, 2015 at 3:17 PM, Colin co...@clark.ws wrote:
 Did you want me to included specific examples from my employment at datastax 
 or start from the ground up? 
 
 All spark is on cassandra is a better than the previous use of hive. 
 
 The fact that datastax hasnt provided any benchmarks themselves other than 
 glossy marketing statements pretty much says it all-where are your 
 benchmarks?  Maybe you could combine it with the in memory option to really 
 boogie...
 
 :)
 
 (If I find time, I might just write a blog post about exactly how to do 
 this-it involves the use of parquet and partitioning with clustering-and it 
 doesnt cost anything to do it-and it's in production at my company)
 --
 Colin Clark 
 +1 612 859 6129
 Skype colin.p.clark
 
 On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote:
 
 The very nature of cassandra's distributed nature vs partitioning data on 
 hadoop makes spark on hdfs actually fasted than on cassandra
 
 Prove it. Did you ever have a look into the source code of the 
 Spark/Cassandra connector to see how data locality is achieved before 
 throwing out such statement ?
 
 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
  cassandra makes a very poor datawarehouse ot long term time series store
 
 Really? This is not the impression I have... I think Cassandra is good to 
 store larges amounts of data and historical information, it's only not 
 good to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra, 
 AFAIK. 
 
  The very nature of cassandra's distributed nature vs partitioning data 
  on hadoop makes spark on hdfs actually fasted than on cassandra.
 
 I am not sure about the current state of Spark support for Cassandra, but 
 I guess if you create a map reduce job, the intermediate map results will 
 be still stored in HDFS, as it happens to hadoop, is this right? I think 
 the problem with Spark + Cassandra or with Hadoop + Cassandra is that the 
 hard part spark or hadoop does, the shuffling, could be done out of the 
 box with Cassandra, but no one takes advantage on that. What if a map / 
 reduce job used a temporary CF in Cassandra to store intermediate results?
 
 From: user@cassandra.apache.org 
 Subject: Re: How to speed up SELECT * query in Cassandra
 I use spark with cassandra, and you dont need DSE.
 
 I see a lot of people ask this same question below (how do I get a lot of 
 data out of cassandra?), and my question is always, why arent you updating 
 both places at once?
 
 For example, we use hadoop and cassandra in conjunction with each other, 
 we use a message bus to store every event in both, aggregrate in both, but 
 only keep current data in cassandra (cassandra makes a very poor 
 datawarehouse ot long term time series store) and then use services to 
 process queries that merge data from hadoop and cassandra.  
 
 Also, spark on hdfs gives more flexibility in terms of large datasets and 
 performance.  The very nature of cassandra's distributed nature vs 
 partitioning data on hadoop makes spark on hdfs actually fasted than on 
 cassandra
 
 
 
 --
 Colin Clark 
 +1 612 859 6129
 Skype colin.p.clark
 
 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:
 
 
 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
 If you use Cassandra enterprise, you can use hive, AFAIK.
 
 Even better, you can use Spark/Shark with DSE.
 
 Cheers,
 Jens
 
 
 -- 
 Jens Rantil
 Backend engineer
 Tink AB
 
 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se
 
 Facebook Linkedin Twitter

Re: best supported spark connector for Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)

I just finished a scala course, nice exercise to check what I learned :D

Thanks for the answer!

From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala): 
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Data locality is provided by this method: 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell how the did the integration with Spark. Take 
some time to dig down their code to understand the logic. 

On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Taking the opportunity Spark was being discussed in another thread, I decided 
to start a new one as I have interest in using Spark + Cassandra in the feature.

About 3 years ago, Spark was not an existing option and we tried to use hadoop 
to process Cassandra data. My experience was horrible and we reached the 
conclusion it was faster to develop an internal tool than insist on Hadoop _for 
our specific case_. 

How I can see Spark is starting to be known as a better hadoop and it seems 
market is going this way now. I can also see I have many more options to decide 
how to integrate Cassandra using the Spark RDD concept than using the 
ColumnFamilyInputFormat. 

I have found this java driver made by Datastax: 
https://github.com/datastax/spark-cassandra-connector

I also have found python Cassandra support on spark's repo, but it seems 
experimental yet: 
https://github.com/apache/spark/tree/master/examples/src/main/python

Finally I have found stratio deep: https://github.com/Stratio/deep-spark
It seems Stratio guys have forked Cassandra also, I am still a little confused 
about it.

Question: which driver should I use, if I want to use Java? And which if I want 
to use python? 
I think the way Spark can integrate to Cassandra makes all the difference in 
the world, from my past experience, so I would like to know more about it, but 
I don't even know which source code I should start looking...
I would like to integrate using python and or C++, but I wonder if it doesn't 
pay the way to use the java driver instead.

Thanks in advance

Re: Adding new node - OPSCenter problems

2015-02-11 Thread Carlos Rolo

Hello,

What is the output of nodetool status? All nodes should appear, otherwise
there is some configuration error.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Wed, Feb 11, 2015 at 3:46 PM, Batranut Bogdan batra...@yahoo.com wrote:

 Hello all,

 I have added new 3 nodes to existing cluster. I must point out that I have
 copied the cassandra yaml file, from an existing node and just changed
 listen_addres per instructions here: Adding nodes to an existing cluster
 | DataStax Cassandra 2.0 Documentation
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html






 Adding nodes to an existing cluster | DataStax Cassandra 2.0 Documentation
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html
 Steps to add nodes when using virtual nodes.
 View on www.datastax.com
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html
 Preview by Yahoo


 Installed datastax agents but in OpsCenter I see the new nodes in a
 different, empty Cluster. Also opscenter does not see the datacenter for
 the new nodes. They are all in the same datacenter, and the name of the
 Cluster is the same for all 9 nodes.
 Any ideeas?


-- 


--

Re: Recommissioned a node

2015-02-11 Thread Stefano Ortolani

Hi Eric,

thanks for your answer. The reason why it got recommissioned was simply
because the machine got restarted (with auto_bootstrap set to to true). A
cleaner, and correct, recommission would have just required wiping the data
folder, am I correct? Or would I have needed to change something else in
the node configuration?

Cheers,
Stefano

On Wed, Feb 11, 2015 at 6:47 AM, Eric Stevens migh...@gmail.com wrote:

 AFAIK it should be ok after the repair completed (it was missing all
 writes while it was decommissioning and while it was offline, and nobody
 would have been keeping hinted handoffs for it, so repair was the right
 thing to do).  Unless RF=N you're now due for a cleanup on the other nodes.

 Generally speaking though this was probably not a good idea.  When the
 node came back online, it rejoined the cluster immediately and would have
 been serving client requests without having a consistent view of the data.
 A safer approach would be to wipe the data directory and bootstrap it as a
 clean new member.

 I'm curious what prompted that cycle of decommission then recommission.

 On Tue, Feb 10, 2015 at 10:13 PM, Stefano Ortolani ostef...@gmail.com
 wrote:

 Hi,

 I recommissioned a node after decommissioningit.
 That happened (1) after a successfull decommission (checked), (2) without
 wiping the data directory on the node, (3) simply by restarting the
 cassandra service. The node now reports himself healty and up and running

 Knowing that I issued the repair command and patiently waited for its
 completion, can I assume the cluster, and its internals (replicas, balance
 between those) to be healthy and as new?

 Regards,
 Stefano

Re: Pagination support on Java Driver Query API

2015-02-11 Thread Eric Stevens

 I can't believe that everyone read  process all rows at once (without
pagination).

Probably not too many people try to read all rows in a table as a single
rolling operation with a standard client driver.  But those who do would
use token() to keep track of where they are and be able to resume with that
as well.

But it sounds like you're talking about paginating a subset of data -
larger than you want to process as a unit, but prefiltered by some other
criteria which prevents you from being able to rely on token().  For this
there is no general purpose solution, but it typically involves you
maintaining your own paging state, typically keeping track of the last
partitioning and clustering key seen, and using that to construct your next
query.

For example, we have client queries which can span several partitioning
keys.  We make sure that the List of partition keys generated by a given
client query List(Pq) is deterministic, then our paging state is the index
offset of the final Pq in the response, plus the value of the final
clustering column.  A query coming in with a paging state attached to it
starts the next set of queries from the provided Pq offset where
clusteringKey  the provided value.

So if you can just track partition key offset (if spanning multiple
partitions), and clustering key offset, you can construct your next query
from those instead.

On Tue, Feb 10, 2015 at 6:58 PM, Ajay ajay.ga...@gmail.com wrote:

 Thanks Alex.

 But is there any workaround possible?. I can't believe that everyone read
  process all rows at once (without pagination).

 Thanks
 Ajay
 On Feb 10, 2015 11:46 PM, Alex Popescu al...@datastax.com wrote:


 On Tue, Feb 10, 2015 at 4:59 AM, Ajay ajay.ga...@gmail.com wrote:

 1) Java driver implicitly support Pagination in the ResultSet (using
 Iterator) which can be controlled through FetchSize. But it is limited in a
 way that we cannot skip or go previous. The FetchState is not exposed.


 Cassandra doesn't support skipping so this is not really a limitation of
 the driver.


 --

 [:-a)

 Alex Popescu
 Sen. Product Manager @ DataStax
 @al3xandru

 To unsubscribe from this group and stop receiving emails from it, send an
 email to java-driver-user+unsubscr...@lists.datastax.com.

Safely delete tmplink files - 2.1.2

2015-02-11 Thread Demian Berjman

Hi, we are expecting the 2.1.3 release to fix the delete of tmplink files.
In the meantime, is it safe to delete this files without shutting down
Cassandra?

Thanks,

Re: Two problems with Cassandra

2015-02-11 Thread Robert Coli

On Wed, Feb 11, 2015 at 2:22 AM, Pavel Velikhov pavel.velik...@gmail.com
wrote:

   2. While trying to update the full dataset with a simple transformation
 (again via python driver), single node and clustered Cassandra run out of
 memory no matter what settings I try, even I put a lot of sleeps into the
 mix. However simpler transformations (updating just one column, specially
 when there is a lot of processing overhead) work just fine.


What does a simple transformation mean here? Assuming a reasonable sized
heap, OOM sounds like you're trying to update a large number of large
partitions in a single operation.

In general, in Cassandra, you're best off interacting with a single or
small number of partitions in any given interaction.

=Rob

Re: Safely delete tmplink files - 2.1.2

2015-02-11 Thread Robert Coli

On Wed, Feb 11, 2015 at 7:52 AM, Demian Berjman dberj...@despegar.com
wrote:

 Hi, we are expecting the 2.1.3 release to fix the delete of tmplink files.
 In the meantime, is it safe to delete this files without shutting down
 Cassandra?


If I were experiencing issues with 2.1.2, I would downgrade to 2.1.1, fwiw.

My belief is that it is safe to delete these files (but they may not
actually delete because Cassandra may have them open). Before doing
anything in production I would ask on the JIRA ticket for the issue.

also, in case you are running 2.1.2 in production...

https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

=Rob

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky

Hi,

here are some snippets of code in scala which should get you started.

Jirka H.

loop {lastRow =val query = lastRow match {case Some(row) =
nextPageQuery(row, upperLimit)case None =
initialQuery(lowerLimit)}session.execute(query).all}


private def nextPageQuery(row: Row, upperLimit: String): String = {val
tokenPart = token(%s)  token(0x%s) and token(%s) 
%s.format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName,
upperLimit)basicQuery.format(tokenPart)}


private def initialQuery(lowerLimit: String): String = {val tokenPart =
token(%s) = %s.format(rowKeyName,
lowerLimit)basicQuery.format(tokenPart)}private def calculateRanges:
(BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) =
{tokenRange match {case Some((start, end)) =Logger.info(Token range
given: {},  + start.underlying.toPlainString + ,  +
end.underlying.toPlainString + )val tokenSpaceSize = end - startval
rangeSize = tokenSpaceSize / concurrencyval ranges = for (i - 0 until
concurrency) yield (start + (i * rangeSize), start + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)case None =val
tokenSpaceSize = partitioner.max - partitioner.minval rangeSize =
tokenSpaceSize / concurrencyval ranges = for (i - 0 until concurrency)
yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)}}

private val basicQuery = {select %s, %s, %s, writetime(%s) from %s
where %s%s limit
%d%s.format(rowKeyName,columnKeyName,columnValueName,columnValueName,columnFamily,%s,
// templatewhereCondition,pageSize,if (cqlAllowFiltering)  allow
filtering else )}


case object Murmur3 extends Partitioner {override val min =
BigDecimal(-2).pow(63)override val max = BigDecimal(2).pow(63) - 1}case
object Random extends Partitioner {override val min =
BigDecimal(0)override val max = BigDecimal(2).pow(127) - 1}


On 02/11/2015 02:21 PM, Ja Sam wrote:
 Your answer looks very promising

  How do you calculate start and stop?

 On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 The fastest way I am aware of is to do the queries in parallel to
 multiple cassandra nodes and make sure that you only ask them for keys
 they are responsible for. Otherwise, the node needs to resend your
 query
 which is much slower and creates unnecessary objects (and thus GC
 pressure).

 You can manually take advantage of the token range information, if the
 driver does not get this into account for you. Then, you can play with
 concurrency and batch size of a single query against one node.
 Basically, what you/driver should do is to transform the query to
 series
 of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

 I will need to look up the actual code, but the idea should be
 clear :)

 Jirka H.


 On 02/11/2015 11:26 AM, Ja Sam wrote:
  Is there a simple way (or even a complicated one) how can I speed up
  SELECT * FROM [table] query?
  I need to get all rows form one table every day. I split tables, and
  create one for each day, but still query is quite slow (200 millions
  of records)
 
  I was thinking about run this query in parallel, but I don't know if
  it is possible

Re: High GC activity on node with 4TB on data

2015-02-11 Thread Jiri Horky

Hi Chris,

On 02/09/2015 04:22 PM, Chris Lohfink wrote:
- number of tombstones - how can I reliably find it out?
https://github.com/spotify/cassandra-opstools
https://github.com/cloudian/support-tools
thanks.

If not getting much compression it may be worth trying to disable it,
it may contribute but its very unlikely that its the cause of the gc
pressure itself.

7000 sstables but STCS? Sounds like compactions couldn't keep up. Do
you have a lot of pending compactions (nodetool)? You may want to
increase your compaction throughput (nodetool) to see if you can catch
up a little, it would cause a lot of heap overhead to do reads with
that many. May even need to take more drastic measures if it cant
catch back up.
I am sorry, I was wrong. We actually do use LCS (the switch was done
recently). There are almost none pending compaction. We have increased
the size sstable to 768M, so it should help as as well.

May also be good to check `nodetool cfstats` for very wide partitions.
There are basically none, this is fine.

It seems that the problem really comes from having so much data in so
many sstables, so
org.apache.cassandra.io.compress.CompressedRandomAccessReader classes
consumes more memory than 0.75*HEAP_SIZE, which triggers the CMS over
and over.

We have turned off the compression and so far, the situation seems to be
fine.

Cheers
Jirka H.

Theres a good chance if under load and you have over 8gb heap your GCs
could use tuning. The bigger the nodes the more manual tweaking it
will require to get the most out of
them https://issues.apache.org/jira/browse/CASSANDRA-8150 also has
some ideas.

Chris

On Mon, Feb 9, 2015 at 2:00 AM, Jiri Horky ho...@avast.com
mailto:ho...@avast.com wrote:

Hi all,

thank you all for the info.

To answer the questions:
- we have 2 DCs with 5 nodes in each, each node has 256G of
memory, 24x1T drives, 2x Xeon CPU - there are multiple cassandra
instances running for different project. The node itself is
powerful enough.
- there 2 keyspaces, one with 3 replicas per DC, one with 1
replica per DC (because of amount of data and because it serves
more or less like a cache)
- there are about 4k/s Request-response, 3k/s Read and 2k/s
Mutation requests - numbers are sum of all nodes
- we us STCS (LCS would be quite IO have for this amount of data)
- number of tombstones - how can I reliably find it out?
- the biggest CF (3.6T per node) has 7000 sstables

Now, I understand that the best practice for Cassandra is to run
with the minimum size of heap which is enough which for this
case we thought is about 12G - there is always 8G consumbed by the
SSTable readers. Also, I though that high number of tombstones
create pressure in the new space (which can then cause pressure in
old space as well), but this is not what we are seeing. We see
continuous GC activity in Old generation only.

Also, I noticed that the biggest CF has Compression factor of 0.99
which basically means that the data come compressed already. Do
you think that turning off the compression should help with memory
consumption?

Also, I think that tuning CMSInitiatingOccupancyFraction=75 might
help here, as it seems that 8G is something that Cassandra needs
for bookkeeping this amount of data and that this was sligtly
above the 75% limit which triggered the CMS again and again.

I will definitely have a look at the presentation.

Regards
Jiri Horky

On 02/08/2015 10:32 PM, Mark Reddy wrote:
Hey Jiri,

While I don't have any experience running 4TB nodes (yet), I
would recommend taking a look at a presentation by Arron Morton
on large
nodes:
http://planetcassandra.org/blog/cassandra-community-webinar-videoslides-large-nodes-with-cassandra-by-aaron-morton/
to see if you can glean anything from that.

I would note that at the start of his talk he mentions that in
version 1.2 we can now talk about nodes around 1 - 3 TB in size,
so if you are storing anything more than that you are getting
into very specialised use cases.

If you could provide us with some more information about your
cluster setup (No. of CFs, read/write patterns, do you delete /
update often, etc.) that may help in getting you to a better place.

Regards,
Mark

On 8 February 2015 at 21:10, Kevin Burton bur...@spinn3r.com
mailto:bur...@spinn3r.com wrote:

Do you have a lot of individual tables? Or lots of small
compactions?

I think the general consensus is that (at least for
Cassandra), 8GB heaps are ideal.

If you have lots of small tables it’s a known anti-pattern (I
believe) because the Cassandra internals could do a better
job on handling the in memory metadata representation.

I think this has

How to speed up SELECT * query in Cassandra

2015-02-11 Thread Ja Sam

Is there a simple way (or even a complicated one) how can I speed up SELECT
* FROM [table] query?
I need to get all rows form one table every day. I split tables, and create
one for each day, but still query is quite slow (200 millions of records)

I was thinking about run this query in parallel, but I don't know if it is
possible

Re:How to speed up SELECT * query in Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)

Look for the message Re: Fastest way to map/parallel read all values in a 
table? in the mailing list, it was recently discussed. You can have several 
parallel processes each one reading a slice of the data, by splitting min/max 
murmur3 hash ranges.

In the company I used to work we developed a system to run custom python 
processes on demand to process Cassandra data among other things to be able to 
do that. I hope it will be released as open source soon, it seems there is a 
lot of people having always this same problem.

If you use Cassandra enterprise, you can use hive, AFAIK. A good idea would be 
running a hadoop or spark process over your cluster and do the processing you 
want, but sometimes I think it might be a bit hard to achieve good results for 
that, mainly because these tools work fine but are auto magic. It's hard to 
control where intermediate data will be stored, for example.


From: user@cassandra.apache.org 
Subject: Re:How to speed up SELECT * query in Cassandra

Is there a simple way (or even a complicated one) how can I speed up SELECT * 
FROM [table] query?
I need to get all rows form one table every day. I split tables, and create one 
for each day, but still query is quite slow (200 millions of records)

I was thinking about run this query in parallel, but I don't know if it is 
possible

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin

I use spark with cassandra, and you dont need DSE.

I see a lot of people ask this same question below (how do I get a lot of data 
out of cassandra?), and my question is always, why arent you updating both 
places at once?

For example, we use hadoop and cassandra in conjunction with each other, we use 
a message bus to store every event in both, aggregrate in both, but only keep 
current data in cassandra (cassandra makes a very poor datawarehouse ot long 
term time series store) and then use services to process queries that merge 
data from hadoop and cassandra.  

Also, spark on hdfs gives more flexibility in terms of large datasets and 
performance.  The very nature of cassandra's distributed nature vs partitioning 
data on hadoop makes spark on hdfs actually fasted than on cassandra



--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:
 
 
 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
 If you use Cassandra enterprise, you can use hive, AFAIK.
 
 Even better, you can use Spark/Shark with DSE.
 
 Cheers,
 Jens
 
 
 -- 
 Jens Rantil
 Backend engineer
 Tink AB
 
 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se
 
 Facebook Linkedin Twitter

Re: Two problems with Cassandra

2015-02-11 Thread Pavel Velikhov

Hi Carlos,

  I tried on a single node and a 4-node cluster. On the 4-node cluster I setup 
the tables with replication factor = 2.
I usually iterate over a subset, but it can be about ~40% right now. Some of my 
column values could be quite big… I remember I was exporting to csv and I had 
to change the default csv max column length.

If I just update, there are no problems, its reading and updating that kills 
everything (could it have something to do with the driver?)

I’m using 2.0.8 release right now.

I was trying to tweak memory sizes. If I give Cassandra too much memory (8 or 
16 GB) it dies much faster due to GC not being able to keep up. But it 
consistently dies on a specific row in single instance case…

Is this enough info to point me somewhere?

Thank you,
Pavel

 On Feb 11, 2015, at 1:48 PM, Carlos Rolo r...@pythian.com wrote:
 
 Hello Pavel,
 
 What is the size of the Cluster (# of nodes)? And you need to iterate over 
 the full 1TB every time you do the update? Or just parts of it?
 
 IMO information is short to make any kind of assessment of the problem you 
 are having.
 
 I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same 
 problem. 
 
 Regards,
 
 Carlos Juzarte Rolo
 Cassandra Consultant
  
 Pythian - Love your data
 
 rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo 
 http://linkedin.com/in/carlosjuzarterolo
 Tel: 1649
 www.pythian.com http://www.pythian.com/
 On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov pavel.velik...@gmail.com 
 mailto:pavel.velik...@gmail.com wrote:
 Hi,
 
   I’m using Cassandra to store NLP data, the dataset is not that huge (about 
 1TB), but I need to iterate over it quite frequently, updating the full 
 dataset (each record, but not necessarily each column).
 
   I’ve run into two problems (I’m using the latest Cassandra):
 
   1. I was trying to copy from one Cassandra cluster to another via a python 
 driver, however the driver confused the two instances
   2. While trying to update the full dataset with a simple transformation 
 (again via python driver), single node and clustered Cassandra run out of 
 memory no matter what settings I try, even I put a lot of sleeps into the 
 mix. However simpler transformations (updating just one column, specially 
 when there is a lot of processing overhead) work just fine.
 
 I’m really concerned about #2, since we’re moving all heavy processing to a 
 Spark cluster and will expand it, and I would expect much heavier traffic 
 to/from Cassandra. Any hints, war stories, etc. very appreciated!
 
 Thank you,
 Pavel Velikhov
 
 
 --

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jens Rantil

On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


Even better, you can use Spark/Shark with DSE.

Cheers,
Jens


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky

The fastest way I am aware of is to do the queries in parallel to
multiple cassandra nodes and make sure that you only ask them for keys
they are responsible for. Otherwise, the node needs to resend your query
which is much slower and creates unnecessary objects (and thus GC pressure).

You can manually take advantage of the token range information, if the
driver does not get this into account for you. Then, you can play with
concurrency and batch size of a single query against one node.
Basically, what you/driver should do is to transform the query to series
of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

I will need to look up the actual code, but the idea should be clear :)

Jirka H.


On 02/11/2015 11:26 AM, Ja Sam wrote:
 Is there a simple way (or even a complicated one) how can I speed up
 SELECT * FROM [table] query?
 I need to get all rows form one table every day. I split tables, and
 create one for each day, but still query is quite slow (200 millions
 of records)

 I was thinking about run this query in parallel, but I don't know if
 it is possible

Re: Two problems with Cassandra

2015-02-11 Thread Carlos Rolo

Hello Pavel,

What is the size of the Cluster (# of nodes)? And you need to iterate over
the full 1TB every time you do the update? Or just parts of it?

IMO information is short to make any kind of assessment of the problem you
are having.

I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same
problem.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov pavel.velik...@gmail.com
wrote:

 Hi,

   I’m using Cassandra to store NLP data, the dataset is not that huge
 (about 1TB), but I need to iterate over it quite frequently, updating the
 full dataset (each record, but not necessarily each column).

   I’ve run into two problems (I’m using the latest Cassandra):

   1. I was trying to copy from one Cassandra cluster to another via a
 python driver, however the driver confused the two instances
   2. While trying to update the full dataset with a simple transformation
 (again via python driver), single node and clustered Cassandra run out of
 memory no matter what settings I try, even I put a lot of sleeps into the
 mix. However simpler transformations (updating just one column, specially
 when there is a lot of processing overhead) work just fine.

 I’m really concerned about #2, since we’re moving all heavy processing to
 a Spark cluster and will expand it, and I would expect much heavier traffic
 to/from Cassandra. Any hints, war stories, etc. very appreciated!

 Thank you,
 Pavel Velikhov

-- 


--

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread DuyHai Doan

The very nature of cassandra's distributed nature vs partitioning data on
hadoop makes spark on hdfs actually fasted than on cassandra

Prove it. Did you ever have a look into the source code of the
Spark/Cassandra connector to see how data locality is achieved before
throwing out such statement ?

On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

  cassandra makes a very poor datawarehouse ot long term time series store

 Really? This is not the impression I have... I think Cassandra is good to
 store larges amounts of data and historical information, it's only not good
 to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra,
 AFAIK.

  The very nature of cassandra's distributed nature vs partitioning data
 on hadoop makes spark on hdfs actually fasted than on cassandra.

 I am not sure about the current state of Spark support for Cassandra, but
 I guess if you create a map reduce job, the intermediate map results will
 be still stored in HDFS, as it happens to hadoop, is this right? I think
 the problem with Spark + Cassandra or with Hadoop + Cassandra is that the
 hard part spark or hadoop does, the shuffling, could be done out of the box
 with Cassandra, but no one takes advantage on that. What if a map / reduce
 job used a temporary CF in Cassandra to store intermediate results?

 From: user@cassandra.apache.org
 Subject: Re: How to speed up SELECT * query in Cassandra

 I use spark with cassandra, and you dont need DSE.

 I see a lot of people ask this same question below (how do I get a lot of
 data out of cassandra?), and my question is always, why arent you updating
 both places at once?

 For example, we use hadoop and cassandra in conjunction with each other,
 we use a message bus to store every event in both, aggregrate in both, but
 only keep current data in cassandra (cassandra makes a very poor
 datawarehouse ot long term time series store) and then use services to
 process queries that merge data from hadoop and cassandra.

 Also, spark on hdfs gives more flexibility in terms of large datasets and
 performance.  The very nature of cassandra's distributed nature vs
 partitioning data on hadoop makes spark on hdfs actually fasted than on
 cassandra



 --
 *Colin Clark*
 +1 612 859 6129
 Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:


 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


 Even better, you can use Spark/Shark with DSE.

 Cheers,
 Jens


 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink

Re: road map for Cassandra 3.0

2015-02-11 Thread DuyHai Doan

Look at the JIRA, filter by 3.0. But it's not very accurate. There are lot
of new features scheduled for 3.0. Some of them will make it on time for
3.0.0 like User Defined Functions I guess. Some other features will be
shipped with future 3 middle/minor versions.


On Wed, Feb 11, 2015 at 1:25 PM, Ernesto Reinaldo Barreiro 
reier...@gmail.com wrote:

 Hi,

 Is there a public road map for Cassandra 3.0? Are there any estimates for
 the release date of 3.0?

 --
 Regards - Ernesto Reinaldo Barreiro

Re: road map for Cassandra 3.0

2015-02-11 Thread Ernesto Reinaldo Barreiro

Thanks for your answer!

On Wed, Feb 11, 2015 at 1:03 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Look at the JIRA, filter by 3.0. But it's not very accurate. There are lot
 of new features scheduled for 3.0. Some of them will make it on time for
 3.0.0 like User Defined Functions I guess. Some other features will be
 shipped with future 3 middle/minor versions.


 On Wed, Feb 11, 2015 at 1:25 PM, Ernesto Reinaldo Barreiro 
 reier...@gmail.com wrote:

 Hi,

 Is there a public road map for Cassandra 3.0? Are there any estimates for
 the release date of 3.0?

 --
 Regards - Ernesto Reinaldo Barreiro





-- 
Regards - Ernesto Reinaldo Barreiro

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Ja Sam

Your answer looks very promising

 How do you calculate start and stop?

On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com wrote:

 The fastest way I am aware of is to do the queries in parallel to
 multiple cassandra nodes and make sure that you only ask them for keys
 they are responsible for. Otherwise, the node needs to resend your query
 which is much slower and creates unnecessary objects (and thus GC
 pressure).

 You can manually take advantage of the token range information, if the
 driver does not get this into account for you. Then, you can play with
 concurrency and batch size of a single query against one node.
 Basically, what you/driver should do is to transform the query to series
 of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

 I will need to look up the actual code, but the idea should be clear :)

 Jirka H.


 On 02/11/2015 11:26 AM, Ja Sam wrote:
  Is there a simple way (or even a complicated one) how can I speed up
  SELECT * FROM [table] query?
  I need to get all rows form one table every day. I split tables, and
  create one for each day, but still query is quite slow (200 millions
  of records)
 
  I was thinking about run this query in parallel, but I don't know if
  it is possible

Re: Two problems with Cassandra

2015-02-11 Thread Carlos Rolo

Update should not be a problem because no read is done, so no need to pull
the data out.

Is that row bigger than your memory capacity (Or HEAP size)? For dealing
with large heaps you can refer to this ticket: CASSANDRA-8150. It provides
some nice tips.

If someone else can share experience would be good.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Wed, Feb 11, 2015 at 12:05 PM, Pavel Velikhov pavel.velik...@gmail.com
wrote:

 Hi Carlos,

   I tried on a single node and a 4-node cluster. On the 4-node cluster I
 setup the tables with replication factor = 2.
 I usually iterate over a subset, but it can be about ~40% right now. Some
 of my column values could be quite big… I remember I was exporting to csv
 and I had to change the default csv max column length.

 If I just update, there are no problems, its reading and updating that
 kills everything (could it have something to do with the driver?)

 I’m using 2.0.8 release right now.

 I was trying to tweak memory sizes. If I give Cassandra too much memory
 (8 or 16 GB) it dies much faster due to GC not being able to keep up. But
 it consistently dies on a specific row in single instance case…

 Is this enough info to point me somewhere?

 Thank you,
 Pavel

 On Feb 11, 2015, at 1:48 PM, Carlos Rolo r...@pythian.com wrote:

 Hello Pavel,

 What is the size of the Cluster (# of nodes)? And you need to iterate over
 the full 1TB every time you do the update? Or just parts of it?

 IMO information is short to make any kind of assessment of the problem you
 are having.

 I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same
 problem.

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov pavel.velik...@gmail.com
  wrote:

 Hi,

   I’m using Cassandra to store NLP data, the dataset is not that huge
 (about 1TB), but I need to iterate over it quite frequently, updating the
 full dataset (each record, but not necessarily each column).

   I’ve run into two problems (I’m using the latest Cassandra):

   1. I was trying to copy from one Cassandra cluster to another via a
 python driver, however the driver confused the two instances
   2. While trying to update the full dataset with a simple transformation
 (again via python driver), single node and clustered Cassandra run out of
 memory no matter what settings I try, even I put a lot of sleeps into the
 mix. However simpler transformations (updating just one column, specially
 when there is a lot of processing overhead) work just fine.

 I’m really concerned about #2, since we’re moving all heavy processing to
 a Spark cluster and will expand it, and I would expect much heavier traffic
 to/from Cassandra. Any hints, war stories, etc. very appreciated!

 Thank you,
 Pavel Velikhov



 --







-- 


--

best supported spark connector for Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)

Taking the opportunity Spark was being discussed in another thread, I decided 
to start a new one as I have interest in using Spark + Cassandra in the feature.

About 3 years ago, Spark was not an existing option and we tried to use hadoop 
to process Cassandra data. My experience was horrible and we reached the 
conclusion it was faster to develop an internal tool than insist on Hadoop _for 
our specific case_. 

How I can see Spark is starting to be known as a better hadoop and it seems 
market is going this way now. I can also see I have many more options to decide 
how to integrate Cassandra using the Spark RDD concept than using the 
ColumnFamilyInputFormat. 

I have found this java driver made by Datastax: 
https://github.com/datastax/spark-cassandra-connector

I also have found python Cassandra support on spark's repo, but it seems 
experimental yet: 
https://github.com/apache/spark/tree/master/examples/src/main/python

Finally I have found stratio deep: https://github.com/Stratio/deep-spark
It seems Stratio guys have forked Cassandra also, I am still a little confused 
about it.

Question: which driver should I use, if I want to use Java? And which if I want 
to use python? 
I think the way Spark can integrate to Cassandra makes all the difference in 
the world, from my past experience, so I would like to know more about it, but 
I don't even know which source code I should start looking...
I would like to integrate using python and or C++, but I wonder if it doesn't 
pay the way to use the java driver instead.

Thanks in advance

Re: best supported spark connector for Cassandra

2015-02-11 Thread DuyHai Doan

Start looking at the Spark/Cassandra connector here (in Scala):
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Data locality is provided by this method:
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell how the did the integration with Spark.
Take some time to dig down their code to understand the logic.

On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:

Taking the opportunity Spark was being discussed in another thread, I
decided to start a new one as I have interest in using Spark + Cassandra in
the feature.

About 3 years ago, Spark was not an existing option and we tried to use
hadoop to process Cassandra data. My experience was horrible and we reached
the conclusion it was faster to develop an internal tool than insist on
Hadoop _for our specific case_.

How I can see Spark is starting to be known as a better hadoop and it
seems market is going this way now. I can also see I have many more options
to decide how to integrate Cassandra using the Spark RDD concept than using
the ColumnFamilyInputFormat.

I have found this java driver made by Datastax:
https://github.com/datastax/spark-cassandra-connector

I also have found python Cassandra support on spark's repo, but it seems
experimental yet:
https://github.com/apache/spark/tree/master/examples/src/main/python

Finally I have found stratio deep: https://github.com/Stratio/deep-spark
It seems Stratio guys have forked Cassandra also, I am still a little
confused about it.

Question: which driver should I use, if I want to use Java? And which if I
want to use python?
I think the way Spark can integrate to Cassandra makes all the difference
in the world, from my past experience, so I would like to know more about
it, but I don't even know which source code I should start looking...
I would like to integrate using python and or C++, but I wonder if it
doesn't pay the way to use the java driver instead.

Thanks in advance

38 matches

Mail list logo