Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-07-11 Thread Rohit Rai
Hi Gerard,

This was on my todos since long... i just published a Calliope snapshot
built against Hadoop 2.2.x, Take it for a spin if you get a chance -
You can get the jars from here -

   -
   
https://oss.sonatype.org/service/local/repositories/snapshots/content/com/tuplejump/calliope_2.10/0.9.4-H2-SNAPSHOT/calliope_2.10-0.9.4-H2-SNAPSHOT.jar
   -
   
https://oss.sonatype.org/service/local/repositories/snapshots/content/com/tuplejump/calliope-macros_2.10/0.9.4-H2-SNAPSHOT/calliope-macros_2.10-0.9.4-H2-SNAPSHOT.jar

Or to use from Maven -

dependency
  groupIdcom.tuplejump/groupId
  artifactIdcalliope_2.10/artifactId
  version0.9.4-H2-SNAPSHOT/version/dependency

and SBT -

libraryDependencies += com.tuplejump %% calliope_2.10 % 0.9.4-H2-SNAPSHOT


It passes all the tests so I am assuming all is fine, but we haven't tested
it very extensively.

Regards,
Rohit


*Founder  CEO, **Tuplejump, Inc.*

www.tuplejump.com
*The Data Engineering Platform*


On Fri, Jun 27, 2014 at 9:31 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi Rohit,

 Thanks for your message. We are currently on Spark 0.9.1, Cassandra 2.0.6
 and Calliope GA  (Would love to try the pre-release version if you want
 beta testers :-)   Our hadoop version is CDH4.4 and of course our spark
 assembly is compiled against it.

 We have got really interesting performance results from using Calliope and
 will probably try to compile it against Hadoop 2. Compared to the DataStax
 Java driver, out of the box, the Calliope lib gives us ~4.5x insert
 performance with a higher network and cpu usage (which is what we want in
 batch insert mode = fast)

 With additional code optimizations using the DataStax driver, we were able
 to reduce that gap to 2x but still Calliope was easier and faster to use.

 Will you be attending the Spark Summit? I'll be around.

 We'll be in touch in any case :-)

 -kr, Gerard.



 On Thu, Jun 26, 2014 at 11:03 AM, Rohit Rai ro...@tuplejump.com wrote:

 Hi Gerard,

 What is the version of Spark, Hadoop, Cassandra and Calliope are you
 using. We never built Calliope to Hadoop2 as we/or our clients don't use
 Hadoop in their deployments or use it only as the Infra component for Spark
 in which case H1/H2 doesn't make a difference for them.

 I know atleast of one case where the user had built Calliope against 2.0
 and was using it happily. If you need assistance with it we are here to
 help. Feel free to reach out to me directly and we can work out a solution
 for you.

 Regards,
 Rohit


 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*


 On Thu, Jun 26, 2014 at 12:44 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Thanks Nick.

 We used the CassandraOutputFormat through Calliope. The Calliope API
 makes the CassandraOutputFormat quite accessible  and is cool to work with.
  It worked fine at prototype level, but we had Hadoop version conflicts
 when we put it in our Spark environment (Using our Spark assembly compiled
 with CDH4.4). The conflict seems to be at the Cassandra-all lib level,
 which is compiled against a different hadoop version  (v1).

 We could not get round that issue. (Any pointers in that direction?)

 That's why I'm trying the direct CQLSSTableWriter way but it looks
 blocked as well.

  -kr, Gerard.




 On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 can you not use a Cassandra OutputFormat? Seems they have
 BulkOutputFormat. An example of using it with Hadoop is here:
 http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

 Using it with Spark will be similar to the examples:
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
 and
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala


 On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi,

 (My excuses for the cross-post from SO)

 I'm trying to create Cassandra SSTables from the results of a batch
 computation in Spark. Ideally, each partition should create the SSTable 
 for
 the data it holds in order to parallelize the process as much as possible
 (and probably even stream it to the Cassandra ring as well)

 After the initial hurdles with the CQLSSTableWriter (like requiring
 the yaml file), I'm confronted now with this issue:





 java.lang.RuntimeException: Attempting to load already loaded column 
 family customer.rawts
 at org.apache.cassandra.config.Schema.load(Schema.java:347)
 at org.apache.cassandra.config.Schema.load(Schema.java:112)
 at 
 org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

 I'm creating a writer on each parallel partition like this:





 def store(rdd:RDD[Message]) = {
 rdd.foreachPartition( msgIterator = {
   val writer = CQLSSTableWriter.builder()
  

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-27 Thread Gerard Maas
I got an answer on SO on this question, basically confirming that the
CQLSSTableWrite cannot be used in Spark (at least  in the form shown in the
code snippet).  DataStax filed a bug on that and might get solved on a
future version.

As you have observed, a single writer can only be used in serial
(ConcurrentModificationExceptions will happen if you do not), and creating
multiple writers in the JVM fails due to static schema construction within
the Cassandra code that the SSTableWriter uses.

I'm not aware of any workaround other than to spawn multiple JVMs, each
writing to a separate directory.

We have filed a Cassandra JIRA ticket to address this issue.

https://issues.apache.org/jira/browse/CASSANDRA-7463;  - Tupshin Harper
http://stackoverflow.com/users/881195/tupshin-harper

S.O. question:
http://stackoverflow.com/questions/24396902/using-cqlsstablewriter-concurrently/24455785#24455785




On Thu, Jun 26, 2014 at 11:03 AM, Rohit Rai ro...@tuplejump.com wrote:

 Hi Gerard,

 What is the version of Spark, Hadoop, Cassandra and Calliope are you
 using. We never built Calliope to Hadoop2 as we/or our clients don't use
 Hadoop in their deployments or use it only as the Infra component for Spark
 in which case H1/H2 doesn't make a difference for them.

 I know atleast of one case where the user had built Calliope against 2.0
 and was using it happily. If you need assistance with it we are here to
 help. Feel free to reach out to me directly and we can work out a solution
 for you.

 Regards,
 Rohit


 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*


 On Thu, Jun 26, 2014 at 12:44 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Thanks Nick.

 We used the CassandraOutputFormat through Calliope. The Calliope API
 makes the CassandraOutputFormat quite accessible  and is cool to work with.
  It worked fine at prototype level, but we had Hadoop version conflicts
 when we put it in our Spark environment (Using our Spark assembly compiled
 with CDH4.4). The conflict seems to be at the Cassandra-all lib level,
 which is compiled against a different hadoop version  (v1).

 We could not get round that issue. (Any pointers in that direction?)

 That's why I'm trying the direct CQLSSTableWriter way but it looks
 blocked as well.

  -kr, Gerard.




 On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath nick.pentre...@gmail.com
  wrote:

 can you not use a Cassandra OutputFormat? Seems they have
 BulkOutputFormat. An example of using it with Hadoop is here:
 http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

 Using it with Spark will be similar to the examples:
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
 and
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala


 On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi,

 (My excuses for the cross-post from SO)

 I'm trying to create Cassandra SSTables from the results of a batch
 computation in Spark. Ideally, each partition should create the SSTable for
 the data it holds in order to parallelize the process as much as possible
 (and probably even stream it to the Cassandra ring as well)

 After the initial hurdles with the CQLSSTableWriter (like requiring
 the yaml file), I'm confronted now with this issue:




 java.lang.RuntimeException: Attempting to load already loaded column 
 family customer.rawts
 at org.apache.cassandra.config.Schema.load(Schema.java:347)
 at org.apache.cassandra.config.Schema.load(Schema.java:112)
 at 
 org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

 I'm creating a writer on each parallel partition like this:




 def store(rdd:RDD[Message]) = {
 rdd.foreachPartition( msgIterator = {
   val writer = CQLSSTableWriter.builder()
 .inDirectory(/tmp/cass)
 .forTable(schema)
 .using(insertSttmt).build()
   msgIterator.foreach(msg = {...})
 })}

 And if I'm reading the exception correctly, I can only create one
 writer per table in one JVM. Digging a bit further in the code, it looks
 like the Schema.load(...) singleton enforces that limitation.

 I guess writings to the writer will not be thread-safe and even if
 they were the contention that multiple threads will create by having all
 parallel tasks trying to dump few GB of data to disk at the same time will
 defeat the purpose of using the SSTables for bulk upload anyway.

 So, are there ways to use the CQLSSTableWriter concurrently?

 If not, what is the next best option to load batch data at high
 throughput in Cassandra?

 Will the upcoming Spark-Cassandra integration help with this? (ie.
 should I just sit back, relax and the problem will solve itself?)

 Thanks,

 Gerard.







Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-27 Thread Gerard Maas
Hi Rohit,

Thanks for your message. We are currently on Spark 0.9.1, Cassandra 2.0.6
and Calliope GA  (Would love to try the pre-release version if you want
beta testers :-)   Our hadoop version is CDH4.4 and of course our spark
assembly is compiled against it.

We have got really interesting performance results from using Calliope and
will probably try to compile it against Hadoop 2. Compared to the DataStax
Java driver, out of the box, the Calliope lib gives us ~4.5x insert
performance with a higher network and cpu usage (which is what we want in
batch insert mode = fast)

With additional code optimizations using the DataStax driver, we were able
to reduce that gap to 2x but still Calliope was easier and faster to use.

Will you be attending the Spark Summit? I'll be around.

We'll be in touch in any case :-)

-kr, Gerard.



On Thu, Jun 26, 2014 at 11:03 AM, Rohit Rai ro...@tuplejump.com wrote:

 Hi Gerard,

 What is the version of Spark, Hadoop, Cassandra and Calliope are you
 using. We never built Calliope to Hadoop2 as we/or our clients don't use
 Hadoop in their deployments or use it only as the Infra component for Spark
 in which case H1/H2 doesn't make a difference for them.

 I know atleast of one case where the user had built Calliope against 2.0
 and was using it happily. If you need assistance with it we are here to
 help. Feel free to reach out to me directly and we can work out a solution
 for you.

 Regards,
 Rohit


 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*


 On Thu, Jun 26, 2014 at 12:44 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Thanks Nick.

 We used the CassandraOutputFormat through Calliope. The Calliope API
 makes the CassandraOutputFormat quite accessible  and is cool to work with.
  It worked fine at prototype level, but we had Hadoop version conflicts
 when we put it in our Spark environment (Using our Spark assembly compiled
 with CDH4.4). The conflict seems to be at the Cassandra-all lib level,
 which is compiled against a different hadoop version  (v1).

 We could not get round that issue. (Any pointers in that direction?)

 That's why I'm trying the direct CQLSSTableWriter way but it looks
 blocked as well.

  -kr, Gerard.




 On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath nick.pentre...@gmail.com
  wrote:

 can you not use a Cassandra OutputFormat? Seems they have
 BulkOutputFormat. An example of using it with Hadoop is here:
 http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

 Using it with Spark will be similar to the examples:
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
 and
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala


 On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi,

 (My excuses for the cross-post from SO)

 I'm trying to create Cassandra SSTables from the results of a batch
 computation in Spark. Ideally, each partition should create the SSTable for
 the data it holds in order to parallelize the process as much as possible
 (and probably even stream it to the Cassandra ring as well)

 After the initial hurdles with the CQLSSTableWriter (like requiring
 the yaml file), I'm confronted now with this issue:




 java.lang.RuntimeException: Attempting to load already loaded column 
 family customer.rawts
 at org.apache.cassandra.config.Schema.load(Schema.java:347)
 at org.apache.cassandra.config.Schema.load(Schema.java:112)
 at 
 org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

 I'm creating a writer on each parallel partition like this:




 def store(rdd:RDD[Message]) = {
 rdd.foreachPartition( msgIterator = {
   val writer = CQLSSTableWriter.builder()
 .inDirectory(/tmp/cass)
 .forTable(schema)
 .using(insertSttmt).build()
   msgIterator.foreach(msg = {...})
 })}

 And if I'm reading the exception correctly, I can only create one
 writer per table in one JVM. Digging a bit further in the code, it looks
 like the Schema.load(...) singleton enforces that limitation.

 I guess writings to the writer will not be thread-safe and even if
 they were the contention that multiple threads will create by having all
 parallel tasks trying to dump few GB of data to disk at the same time will
 defeat the purpose of using the SSTables for bulk upload anyway.

 So, are there ways to use the CQLSSTableWriter concurrently?

 If not, what is the next best option to load batch data at high
 throughput in Cassandra?

 Will the upcoming Spark-Cassandra integration help with this? (ie.
 should I just sit back, relax and the problem will solve itself?)

 Thanks,

 Gerard.







Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-26 Thread Rohit Rai
Hi Gerard,

What is the version of Spark, Hadoop, Cassandra and Calliope are you using.
We never built Calliope to Hadoop2 as we/or our clients don't use Hadoop in
their deployments or use it only as the Infra component for Spark in which
case H1/H2 doesn't make a difference for them.

I know atleast of one case where the user had built Calliope against 2.0
and was using it happily. If you need assistance with it we are here to
help. Feel free to reach out to me directly and we can work out a solution
for you.

Regards,
Rohit


*Founder  CEO, **Tuplejump, Inc.*

www.tuplejump.com
*The Data Engineering Platform*


On Thu, Jun 26, 2014 at 12:44 AM, Gerard Maas gerard.m...@gmail.com wrote:

 Thanks Nick.

 We used the CassandraOutputFormat through Calliope. The Calliope API makes
 the CassandraOutputFormat quite accessible  and is cool to work with.  It
 worked fine at prototype level, but we had Hadoop version conflicts when we
 put it in our Spark environment (Using our Spark assembly compiled with
 CDH4.4). The conflict seems to be at the Cassandra-all lib level, which is
 compiled against a different hadoop version  (v1).

 We could not get round that issue. (Any pointers in that direction?)

 That's why I'm trying the direct CQLSSTableWriter way but it looks blocked
 as well.

  -kr, Gerard.




 On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 can you not use a Cassandra OutputFormat? Seems they have
 BulkOutputFormat. An example of using it with Hadoop is here:
 http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

 Using it with Spark will be similar to the examples:
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
 and
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala


 On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi,

 (My excuses for the cross-post from SO)

 I'm trying to create Cassandra SSTables from the results of a batch
 computation in Spark. Ideally, each partition should create the SSTable for
 the data it holds in order to parallelize the process as much as possible
 (and probably even stream it to the Cassandra ring as well)

 After the initial hurdles with the CQLSSTableWriter (like requiring the
 yaml file), I'm confronted now with this issue:



 java.lang.RuntimeException: Attempting to load already loaded column family 
 customer.rawts
 at org.apache.cassandra.config.Schema.load(Schema.java:347)
 at org.apache.cassandra.config.Schema.load(Schema.java:112)
 at 
 org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

 I'm creating a writer on each parallel partition like this:



 def store(rdd:RDD[Message]) = {
 rdd.foreachPartition( msgIterator = {
   val writer = CQLSSTableWriter.builder()
 .inDirectory(/tmp/cass)
 .forTable(schema)
 .using(insertSttmt).build()
   msgIterator.foreach(msg = {...})
 })}

 And if I'm reading the exception correctly, I can only create one writer
 per table in one JVM. Digging a bit further in the code, it looks like the
 Schema.load(...) singleton enforces that limitation.

 I guess writings to the writer will not be thread-safe and even if they
 were the contention that multiple threads will create by having all
 parallel tasks trying to dump few GB of data to disk at the same time will
 defeat the purpose of using the SSTables for bulk upload anyway.

 So, are there ways to use the CQLSSTableWriter concurrently?

 If not, what is the next best option to load batch data at high
 throughput in Cassandra?

 Will the upcoming Spark-Cassandra integration help with this? (ie.
 should I just sit back, relax and the problem will solve itself?)

 Thanks,

 Gerard.






Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Gerard Maas
Hi,

(My excuses for the cross-post from SO)

I'm trying to create Cassandra SSTables from the results of a batch
computation in Spark. Ideally, each partition should create the SSTable for
the data it holds in order to parallelize the process as much as possible
(and probably even stream it to the Cassandra ring as well)

After the initial hurdles with the CQLSSTableWriter (like requiring the
yaml file), I'm confronted now with this issue:

java.lang.RuntimeException: Attempting to load already loaded column
family customer.rawts
at org.apache.cassandra.config.Schema.load(Schema.java:347)
at org.apache.cassandra.config.Schema.load(Schema.java:112)
at 
org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

I'm creating a writer on each parallel partition like this:

def store(rdd:RDD[Message]) = {
rdd.foreachPartition( msgIterator = {
  val writer = CQLSSTableWriter.builder()
.inDirectory(/tmp/cass)
.forTable(schema)
.using(insertSttmt).build()
  msgIterator.foreach(msg = {...})
})}

And if I'm reading the exception correctly, I can only create one writer
per table in one JVM. Digging a bit further in the code, it looks like the
Schema.load(...) singleton enforces that limitation.

I guess writings to the writer will not be thread-safe and even if they
were the contention that multiple threads will create by having all
parallel tasks trying to dump few GB of data to disk at the same time will
defeat the purpose of using the SSTables for bulk upload anyway.

So, are there ways to use the CQLSSTableWriter concurrently?

If not, what is the next best option to load batch data at high throughput
in Cassandra?

Will the upcoming Spark-Cassandra integration help with this? (ie. should I
just sit back, relax and the problem will solve itself?)

Thanks,

Gerard.


Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath
can you not use a Cassandra OutputFormat? Seems they have BulkOutputFormat.
An example of using it with Hadoop is here:
http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

Using it with Spark will be similar to the examples:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
and
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala


On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi,

 (My excuses for the cross-post from SO)

 I'm trying to create Cassandra SSTables from the results of a batch
 computation in Spark. Ideally, each partition should create the SSTable for
 the data it holds in order to parallelize the process as much as possible
 (and probably even stream it to the Cassandra ring as well)

 After the initial hurdles with the CQLSSTableWriter (like requiring the
 yaml file), I'm confronted now with this issue:

 java.lang.RuntimeException: Attempting to load already loaded column family 
 customer.rawts
 at org.apache.cassandra.config.Schema.load(Schema.java:347)
 at org.apache.cassandra.config.Schema.load(Schema.java:112)
 at 
 org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

 I'm creating a writer on each parallel partition like this:

 def store(rdd:RDD[Message]) = {
 rdd.foreachPartition( msgIterator = {
   val writer = CQLSSTableWriter.builder()
 .inDirectory(/tmp/cass)
 .forTable(schema)
 .using(insertSttmt).build()
   msgIterator.foreach(msg = {...})
 })}

 And if I'm reading the exception correctly, I can only create one writer
 per table in one JVM. Digging a bit further in the code, it looks like the
 Schema.load(...) singleton enforces that limitation.

 I guess writings to the writer will not be thread-safe and even if they
 were the contention that multiple threads will create by having all
 parallel tasks trying to dump few GB of data to disk at the same time will
 defeat the purpose of using the SSTables for bulk upload anyway.

 So, are there ways to use the CQLSSTableWriter concurrently?

 If not, what is the next best option to load batch data at high throughput
 in Cassandra?

 Will the upcoming Spark-Cassandra integration help with this? (ie. should
 I just sit back, relax and the problem will solve itself?)

 Thanks,

 Gerard.



Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Gerard Maas
Thanks Nick.

We used the CassandraOutputFormat through Calliope. The Calliope API makes
the CassandraOutputFormat quite accessible  and is cool to work with.  It
worked fine at prototype level, but we had Hadoop version conflicts when we
put it in our Spark environment (Using our Spark assembly compiled with
CDH4.4). The conflict seems to be at the Cassandra-all lib level, which is
compiled against a different hadoop version  (v1).

We could not get round that issue. (Any pointers in that direction?)

That's why I'm trying the direct CQLSSTableWriter way but it looks blocked
as well.

-kr, Gerard.




On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 can you not use a Cassandra OutputFormat? Seems they have
 BulkOutputFormat. An example of using it with Hadoop is here:
 http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

 Using it with Spark will be similar to the examples:
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
 and
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala


 On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi,

 (My excuses for the cross-post from SO)

 I'm trying to create Cassandra SSTables from the results of a batch
 computation in Spark. Ideally, each partition should create the SSTable for
 the data it holds in order to parallelize the process as much as possible
 (and probably even stream it to the Cassandra ring as well)

 After the initial hurdles with the CQLSSTableWriter (like requiring the
 yaml file), I'm confronted now with this issue:


 java.lang.RuntimeException: Attempting to load already loaded column family 
 customer.rawts
 at org.apache.cassandra.config.Schema.load(Schema.java:347)
 at org.apache.cassandra.config.Schema.load(Schema.java:112)
 at 
 org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

 I'm creating a writer on each parallel partition like this:


 def store(rdd:RDD[Message]) = {
 rdd.foreachPartition( msgIterator = {
   val writer = CQLSSTableWriter.builder()
 .inDirectory(/tmp/cass)
 .forTable(schema)
 .using(insertSttmt).build()
   msgIterator.foreach(msg = {...})
 })}

 And if I'm reading the exception correctly, I can only create one writer
 per table in one JVM. Digging a bit further in the code, it looks like the
 Schema.load(...) singleton enforces that limitation.

 I guess writings to the writer will not be thread-safe and even if they
 were the contention that multiple threads will create by having all
 parallel tasks trying to dump few GB of data to disk at the same time will
 defeat the purpose of using the SSTables for bulk upload anyway.

 So, are there ways to use the CQLSSTableWriter concurrently?

 If not, what is the next best option to load batch data at high
 throughput in Cassandra?

 Will the upcoming Spark-Cassandra integration help with this? (ie. should
 I just sit back, relax and the problem will solve itself?)

 Thanks,

 Gerard.





Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath
Right, ok.

I can't say I've used the Cassandra OutputFormats before. But perhaps if
you use it directly (instead of via Calliope) you may be able to get it to
work, albeit with less concise code?

Or perhaps you may be able to build Cassandra from source with Hadoop 2 /
CDH4 support:
https://groups.google.com/forum/#!topic/nosql-databases/Y-9amAdZk1s




On Wed, Jun 25, 2014 at 9:14 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Thanks Nick.

 We used the CassandraOutputFormat through Calliope. The Calliope API makes
 the CassandraOutputFormat quite accessible  and is cool to work with.  It
 worked fine at prototype level, but we had Hadoop version conflicts when we
 put it in our Spark environment (Using our Spark assembly compiled with
 CDH4.4). The conflict seems to be at the Cassandra-all lib level, which is
 compiled against a different hadoop version  (v1).

 We could not get round that issue. (Any pointers in that direction?)

 That's why I'm trying the direct CQLSSTableWriter way but it looks blocked
 as well.

  -kr, Gerard.




 On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 can you not use a Cassandra OutputFormat? Seems they have
 BulkOutputFormat. An example of using it with Hadoop is here:
 http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

 Using it with Spark will be similar to the examples:
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
 and
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala


 On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi,

 (My excuses for the cross-post from SO)

 I'm trying to create Cassandra SSTables from the results of a batch
 computation in Spark. Ideally, each partition should create the SSTable for
 the data it holds in order to parallelize the process as much as possible
 (and probably even stream it to the Cassandra ring as well)

 After the initial hurdles with the CQLSSTableWriter (like requiring the
 yaml file), I'm confronted now with this issue:


 java.lang.RuntimeException: Attempting to load already loaded column family 
 customer.rawts
 at org.apache.cassandra.config.Schema.load(Schema.java:347)
 at org.apache.cassandra.config.Schema.load(Schema.java:112)
 at 
 org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

 I'm creating a writer on each parallel partition like this:


 def store(rdd:RDD[Message]) = {
 rdd.foreachPartition( msgIterator = {
   val writer = CQLSSTableWriter.builder()
 .inDirectory(/tmp/cass)
 .forTable(schema)
 .using(insertSttmt).build()
   msgIterator.foreach(msg = {...})
 })}

 And if I'm reading the exception correctly, I can only create one writer
 per table in one JVM. Digging a bit further in the code, it looks like the
 Schema.load(...) singleton enforces that limitation.

 I guess writings to the writer will not be thread-safe and even if they
 were the contention that multiple threads will create by having all
 parallel tasks trying to dump few GB of data to disk at the same time will
 defeat the purpose of using the SSTables for bulk upload anyway.

 So, are there ways to use the CQLSSTableWriter concurrently?

 If not, what is the next best option to load batch data at high
 throughput in Cassandra?

 Will the upcoming Spark-Cassandra integration help with this? (ie.
 should I just sit back, relax and the problem will solve itself?)

 Thanks,

 Gerard.