Re: Cassandra & Spark

2018-08-25 Thread Affan Syed
gt;> >>> Which then act as a layer between Cassandra and your applications >>> storing into Cassandra (memory datagrid I think it is called) >>> >>> Basically, think of it as a big cache >>> >>> It is an in-memory thingi ☺ >>> >>>

Re: Cassandra & Spark

2018-08-25 Thread CharSyam
emory datagrid I think it is called) >> >> Basically, think of it as a big cache >> >> It is an in-memory thingi ☺ >> >> And then you can run some super fast queries >> >> >> >> -Tobias >> >> >> >> *From

Re: Cassandra & Spark

2018-08-25 Thread Affan Syed
Tobias > > > > *From: *DuyHai Doan > *Date: *Thursday, 8 June 2017 at 15:42 > *To: *Tobias Eriksson > *Cc: *한 승호 , "user@cassandra.apache.org" < > user@cassandra.apache.org> > *Subject: *Re: Cassandra & Spark > > > > Interesting >

IllegalArgumentException while saving rdd after repartition by cassandra replica set using cassandra spark connector

2018-07-22 Thread M Singh
Hi Folks: I am working with on a project to save Spark dataframe to cassandra and am getting an exception regarding row size not valid (see below). I tried to trace the code in the connector and it appears that the row size (3 below) is different from the column count (which turns out be 1).  I

Resources for Monitoring Cassandra, Spark, Solr

2018-07-02 Thread Rahul Singh
Folks, We often get questions on monitoring here so I assembled this post with articles from those in the community as well as links to the component tools to give folks a more comprehensive listing. https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance

Re: cassandra spark-connector-sqlcontext too many tasks

2018-03-17 Thread Ben Slater
I think that is probably a question for the Spark Connector forum: https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user as it’s much more related to the function of the connector than functionality of Cassandra itself. Cheers Ben On Sat, 17 Mar 2018 at 21:18

cassandra spark-connector-sqlcontext too many tasks

2018-03-17 Thread onmstester onmstester
I'm querying a single cassandra partition using sqlContext and Its temView which creates more than 2000 tasks on spark and took about 360 seconds: sqlContext.read().format("org.apache.spark.sql.cassandra).options(ops).load.createOrReplaceTempView("tableName") But using

Re: Cassandra/Spark failing to process large table

2018-03-08 Thread kurt greaves
Note that read repairs only occur for QUORUM/equivalent and higher, and also with a 10% (default) chance on anything less than QUORUM (ONE/LOCAL_ONE). This is configured at the table level through the dclocal_read_repair_chance and read_repair_chance settings (which are going away in 4.0). So if

Re: Cassandra/Spark failing to process large table

2018-03-08 Thread Faraz Mateen
Hi Ben, That makes sense. I also read about "read repairs". So, once an inconsistent record is read, cassandra synchronizes its replicas on other nodes as well. I ran the same spark query again, this time with default consistency level (LOCAL_ONE) and the result was correct. Thanks again for the

Re: Cassandra/Spark failing to process large table

2018-03-06 Thread Ben Slater
Hi Faraz Yes, it likely does mean there is inconsistency in the replicas. However, you shouldn’t be too freaked out about it - Cassandra is design to allow for this inconsistency to occur and the consistency levels allow you to achieve consistent results despite replicas not being consistent. To

Re: Cassandra/Spark failing to process large table

2018-03-06 Thread Faraz Mateen
Thanks a lot for the response. Setting consistency to ALL/TWO started giving me consistent count results on both cqlsh and spark. As expected, my query time has increased by 1.5x ( Before, it was taking ~1.6 hours but with consistency level ALL, same query is taking ~2.4 hours to complete.)

Re: Cassandra/Spark failing to process large table

2018-03-03 Thread Ben Slater
Both CQLSH and the Spark Cassandra query at consistent level ONE (LOCAL_ONE for Spark connector) by default so if there is any inconsistency in your replicas this can resulting in inconsistent query results. See http://cassandra.apache.org/doc/latest/tools/cqlsh.html and

Re: Cassandra/Spark failing to process large table

2018-03-03 Thread Kant Kodali
The fact that cqlsh itself gives different results tells me that this has nothing to do with spark. Moreover, spark results are monotonically increasing which seem to be more consistent than cqlsh. so I believe spark can be taken out of the equation. Now, while you are running these queries is

Cassandra/Spark failing to process large table

2018-03-02 Thread Faraz Mateen
Hi everyone, I am trying to use spark to process a large cassandra table (~402 million entries and 84 columns) but I am getting inconsistent results. Initially the requirement was to copy some columns from this table to another table. After copying the data, I noticed that some entries in the

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Jon Haddad
nt = 1 > > > > From: Cassa L <lcas...@gmail.com> > Date: Friday, October 27, 2017 at 1:50 AM > To: Jörn Franke <jornfra...@gmail.com> > Cc: user <u...@spark.apache.org>, <user@cassandra.apache.org> > Subject: Re: Why don't I see my spark jobs run

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Thakrar, Jayesh
assa L <lcas...@gmail.com> Date: Friday, October 27, 2017 at 1:50 AM To: Jörn Franke <jornfra...@gmail.com> Cc: user <u...@spark.apache.org>, <user@cassandra.apache.org> Subject: Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster? No, I dont

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Cassa L
No, I dont use Yarn. This is standalone spark that comes with DataStax Enterprise version of Cassandra. On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke wrote: > Do you use yarn ? Then you need to configure the queues with the right > scheduler and method. > > On 27. Oct

Re: Cassandra & Spark

2017-06-08 Thread Tobias Eriksson
qvantel.com> Cc: 한 승호 <shha...@outlook.com>, "user@cassandra.apache.org" <user@cassandra.apache.org> Subject: Re: Cassandra & Spark Interesting Tobias, when you said "Instead we transferred the data to Apache Kudu", did you transfer all Cassandra d

Re: Cassandra & Spark

2017-06-08 Thread DuyHai Doan
fference ! > > “my two cents!” > > -Tobias > > > > > > > > *From: *한 승호 <shha...@outlook.com> > *Date: *Thursday, 8 June 2017 at 10:25 > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Subject: *Cassandra & Spark

Re: Cassandra & Spark

2017-06-08 Thread Tobias Eriksson
Spark, Instead we transferred the data to Apache Kudu, and then we ran our analysis on Kudu, and what a difference ! “my two cents!” -Tobias From: 한 승호 <shha...@outlook.com> Date: Thursday, 8 June 2017 at 10:25 To: "user@cassandra.apache.org" <user@cassandra.apache.org&

Re: Cassandra & Spark

2017-06-08 Thread Kant Kodali
If you use Containers like Docker Plan A can work provided you do the resource and capacity planning. I tend to think that Plan B is more Standard and easier Although you can wait to hear from others for a second opinion. Caution: Data Locality will make sense if the Disk throughput is

Cassandra & Spark

2017-06-08 Thread 한 승호
Hello, I am Seung-ho and I work as a Data Engineer in Korea. I need some advice. My company recently consider replacing RDMBS-based system with Cassandra and Hadoop. The purpose of this system is to analyze Cadssandra and HDFS data with Spark. It seems many user cases put emphasis on data

Re: Cassandra - Spark - Flume: best architecture for log analytics.

2015-07-23 Thread Edward Ribeiro
which process data before inserting them in cassandra, and doesn't use cassandra as a temporary store. 2015-07-23 2:04 GMT+02:00 Renato Perini renato.per...@gmail.com: Problem: Log analytics. Solutions: 1) Aggregating logs using Flume and storing the aggregations into Cassandra. Spark

Re: Cassandra - Spark - Flume: best architecture for log analytics.

2015-07-23 Thread Ipremyadav
. Solutions: 1) Aggregating logs using Flume and storing the aggregations into Cassandra. Spark reads data from Cassandra, make some computations and write the results in distinct tables, still in Cassandra. 2) Aggregating logs using Flume to a sink, streaming data directly into Spark

Re: Cassandra - Spark - Flume: best architecture for log analytics.

2015-07-22 Thread Pierre Devops
Perini renato.per...@gmail.com: Problem: Log analytics. Solutions: 1) Aggregating logs using Flume and storing the aggregations into Cassandra. Spark reads data from Cassandra, make some computations and write the results in distinct tables, still in Cassandra. 2) Aggregating logs

Cassandra - Spark - Flume: best architecture for log analytics.

2015-07-22 Thread Renato Perini
Problem: Log analytics. Solutions: 1) Aggregating logs using Flume and storing the aggregations into Cassandra. Spark reads data from Cassandra, make some computations and write the results in distinct tables, still in Cassandra. 2) Aggregating logs using Flume to a sink

Re: RDD partitions per executor in Cassandra Spark Connector

2015-03-03 Thread Carl Yeksigian
These questions would be better addressed to the Spark Cassandra Connector mailing list, which can be found here: https://github.com/datastax/spark-cassandra-connector/#community Thanks, Carl On Tue, Mar 3, 2015 at 4:42 AM, Pavel Velikhov pavel.velik...@gmail.com wrote: Hi, is there a paper or

RDD partitions per executor in Cassandra Spark Connector

2015-03-02 Thread Rumph, Frens Jan
Hi all, I didn't find the *issues* button on https://github.com/datastax/spark-cassandra-connector/ so posting here. Any one have an idea why token ranges are grouped into one partition per executor? I expected at least one per core. Any suggestions on how to work around this? Doing a

Re: Running Cassandra + Spark on AWS - architecture questions

2015-02-23 Thread Clint Kelly
, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed workload Cassandra + Spark installation would look like, especially on AWS. What I gather is that you use OpsCenter to set up the following

Re: Running Cassandra + Spark on AWS - architecture questions

2015-02-22 Thread Eric Stevens
aggregated table, it is less IO intensive than the analytics DC with lot of read write to compute aggregations. On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed workload Cassandra

Running Cassandra + Spark on AWS - architecture questions

2015-02-20 Thread Clint Kelly
Hi all, I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed workload Cassandra + Spark installation would look like, especially on AWS. What I gather is that you use OpsCenter to set up the following: - One virtual data center for real-time processing (e.g., ingestion

Re: Running Cassandra + Spark on AWS - architecture questions

2015-02-20 Thread DuyHai Doan
intensive than the analytics DC with lot of read write to compute aggregations. On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed workload Cassandra + Spark installation would look

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-04 Thread Hugo José Pinto
Many thanks once again. I rethought the target data structure, and things started coming together to allow for really elegant, compact ESP preprocessing and storage. Best. Enviado do meu iPhone No dia 03/01/2015, às 23:53, Peter Lin wool...@gmail.com escreveu: if you like SQL dialect, try

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Peter Lin
It looks like you're using the wrong tool and architecture. If the use case really needs continuous query like event processing, use an ESP product to do that. You can still store data in Cassandra for persistence . The design you want is to have two paths: event stream and persistence. At the

Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Hugo José Pinto
Hello. We're currently using Hazelcast (http://hazelcast.org/) as a distributed in-memory data grid. That's been working sort-of-well for us, but going solely in-memory has exhausted its path in our use case, and we're considering porting our application to a NoSQL persistent store. After the

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread DuyHai Doan
Hello Hugo I was facing the same kind of requirement from some users. Long story short, below are the possible strategies with advantages and draw-backs of each 1) Put Spark in front of the back-end, every incoming modification/update/insert goes into Spark first, then Spark will forward it to

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Jabbar Azam
Hello, Or you can have a look at akka http://www.akka.io for event processing and use cassandra for persistence(Peters suggestion). On Sat Jan 03 2015 at 11:59:45 AM Peter Lin wool...@gmail.com wrote: It looks like you're using the wrong tool and architecture. If the use case really needs

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Hugo José Pinto
Thank you all for your answers. It seems I'll have to go with some event-driven processing before/during the Cassandra write path. My concern would be that I'd love to first guarantee the disk write of the Cassandra persistence and then do the event processing (which is mostly CRUD

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Colin
Use a message bus with a transactional get, get the message, send to cassandra, upon write success, submit to esp, commit get on bus. Messaging systems like rabbitmq support this semantic. Using cassandra as a queuing mechanism is an anti-pattern. -- Colin Clark +1-320-221-9531 On Jan 3,

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Peter Lin
listen to colin's advice, avoid the temptation of anti-patterns. On Sat, Jan 3, 2015 at 6:10 PM, Colin colpcl...@gmail.com wrote: Use a message bus with a transactional get, get the message, send to cassandra, upon write success, submit to esp, commit get on bus. Messaging systems like

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Hugo José Pinto
Thanks :) Duly noted - this is all uncharted territory for us, hence the value of seasoned advice. Best -- Hugo José Pinto No dia 03/01/2015, às 23:43, Peter Lin wool...@gmail.com escreveu: listen to colin's advice, avoid the temptation of anti-patterns. On Sat, Jan 3, 2015 at 6:10

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Peter Lin
if you like SQL dialect, try out products that use streamSQL to do continuous queries. Espers comes to mind. Google to see what other products support streamSQL On Sat, Jan 3, 2015 at 6:48 PM, Hugo José Pinto hugo.pi...@inovaworks.com wrote: Thanks :) Duly noted - this is all uncharted

Re: cassandra + spark / pyspark

2014-09-12 Thread Francisco Madrid-Salvador
Hi Oleg, Connectors don't deal with HA, they rely on Spark for that, so neither the Datastax connector, Stratio Deep nor Calliope have anything to do with Spark's HA. You should have previously configured Spark so that it meets your high availability needs. Furthermore, as I mentioned in a

Re: cassandra + spark / pyspark

2014-09-11 Thread abhinav chowdary
Adding to conversation... there are 3 great open source options available 1. Calliope http://tuplejump.github.io/calliope/ This is the first library that was out some time late last year (as i can recall) and I have been using this for a while, mostly very stable, uses Hadoop i/o in

Re: cassandra + spark / pyspark

2014-09-11 Thread DuyHai Doan
2. still uses thrift for minor stuff -- I think that the only call using thrift is describe_ring to get an estimate of ratio of partition keys within the token range 3. Stratio has a talk today at the SF Summit, presenting Stratio META. For the folks not attending the conference, video should be

Re: cassandra + spark / pyspark

2014-09-11 Thread Oleg Ruchovets
Ok. DataStax , Startio are required mesos, hadoop yarn other third party to get spark cluster HA. What in case of calliope? Is it sufficient to have cassandra + calliope + spark to be able process aggregations? In my case we have quite a lot of data so doing aggregation only in memory -

Re: cassandra + spark / pyspark

2014-09-11 Thread Rohit Rai
Hi Oleg, I am the creator of Calliope. Calliope doesn't force any deployment model... that means you can run it with Mesos or Hadoop or Standalone. To be fair I don't think the other libs mentioned here should work too. The Spark cluster HA can be provided using ZooKeeper even in the standalone

Re: cassandra + spark / pyspark

2014-09-11 Thread Oleg Ruchovets
Thank you Rohit. I sent the email to you. Thanks Oleg. On Thu, Sep 11, 2014 at 10:51 PM, Rohit Rai ro...@tuplejump.com wrote: Hi Oleg, I am the creator of Calliope. Calliope doesn't force any deployment model... that means you can run it with Mesos or Hadoop or Standalone. To be fair I

cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Hi , I try to evaluate different option of spark + cassandra and I have couple of questions: My aim is to use cassandra+spark without hadoop: 1) Is it possible to use only cassandra as input/output parameter for PySpark? 2) In case I'll use Spark (java,scala) is it possible to use only

cassandra + spark / pyspark

2014-09-10 Thread Francisco Madrid-Salvador
Hi Oleg, If you want to use cassandra+spark without hadoop, perhaps Stratio Deep is your best choice (https://github.com/Stratio/stratio-deep). It's an open-source Spark + Cassandra connector that doesn't make any use of Hadoop or Hadoop component. http://docs.openstratio.org/deep/0.3.3

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
. 2014 17:35, Oleg Ruchovets oruchov...@gmail.com a écrit : Hi , I try to evaluate different option of spark + cassandra and I have couple of questions: My aim is to use cassandra+spark without hadoop: 1) Is it possible to use only cassandra as input/output parameter for PySpark? 2

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
: Hi Oleg, If you want to use cassandra+spark without hadoop, perhaps Stratio Deep is your best choice (https://github.com/Stratio/stratio-deep). It's an open-source Spark + Cassandra connector that doesn't make any use of Hadoop or Hadoop component. http://docs.openstratio.org/deep/0.3.3

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
try to evaluate different option of spark + cassandra and I have couple of questions: My aim is to use cassandra+spark without hadoop: 1) Is it possible to use only cassandra as input/output parameter for PySpark? 2) In case I'll use Spark (java,scala) is it possible to use only

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
As far as I know, the Datastax connector uses thrift to connect Spark with Cassandra although thrift is already deprecated, could someone confirm this point? -- the Scala connector is using the latest Java driver, so no there is no Thrift there. For the Java version, I'm not sure, have not

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
Source code check for the Java version: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-java/src/main/java/com/datastax/spark/connector/RDDJavaFunctions.java#L26 It's using the RDDFunctions from scala code so yes, it's Java driver again. On Wed, Sep

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Interesting things actually: We have hadoop in our eco system. It has single point of failure and I am not sure about inter data center replication. Plan is to use cassandra - no single point of failure , there is data center replication. For aggregation/transformation using SPARK. BUT storm

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
Stupid question: do you really need both Storm Spark ? Can't you implement the Storm jobs in Spark ? It will be operationally simpler to have less moving parts. I'm not saying that Storm is not the right fit, it may be totally suitable for some usages. But if you want to avoid the SPOF thing

Re: cassandra + spark / pyspark

2014-09-10 Thread Paco Madrid
Good to know. Thanks, DuyHai! I'll take a look (but most probably tomorrow ;-)) Paco 2014-09-10 20:15 GMT+02:00 DuyHai Doan doanduy...@gmail.com: Source code check for the Java version:

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Typo. I am talking about spark only. Thanks Oleg. On Thursday, September 11, 2014, DuyHai Doan doanduy...@gmail.com wrote: Stupid question: do you really need both Storm Spark ? Can't you implement the Storm jobs in Spark ? It will be operationally simpler to have less moving parts. I'm not