gt;>
>>> Which then act as a layer between Cassandra and your applications
>>> storing into Cassandra (memory datagrid I think it is called)
>>>
>>> Basically, think of it as a big cache
>>>
>>> It is an in-memory thingi ☺
>>>
>>>
emory datagrid I think it is called)
>>
>> Basically, think of it as a big cache
>>
>> It is an in-memory thingi ☺
>>
>> And then you can run some super fast queries
>>
>>
>>
>> -Tobias
>>
>>
>>
>> *From
Tobias
>
>
>
> *From: *DuyHai Doan
> *Date: *Thursday, 8 June 2017 at 15:42
> *To: *Tobias Eriksson
> *Cc: *한 승호 , "user@cassandra.apache.org" <
> user@cassandra.apache.org>
> *Subject: *Re: Cassandra & Spark
>
>
>
> Interesting
>
Hi Folks:
I am working with on a project to save Spark dataframe to cassandra and am
getting an exception regarding row size not valid (see below). I tried to trace
the code in the connector and it appears that the row size (3 below) is
different from the column count (which turns out be 1). I
Folks,
We often get questions on monitoring here so I assembled this post with
articles from those in the community as well as links to the component tools to
give folks a more comprehensive listing.
https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance
I think that is probably a question for the Spark Connector forum:
https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user
as
it’s much more related to the function of the connector than functionality
of Cassandra itself.
Cheers
Ben
On Sat, 17 Mar 2018 at 21:18
I'm querying a single cassandra partition using sqlContext and Its temView
which creates more than 2000 tasks on spark and took about 360 seconds:
sqlContext.read().format("org.apache.spark.sql.cassandra).options(ops).load.createOrReplaceTempView("tableName")
But using
Note that read repairs only occur for QUORUM/equivalent and higher, and
also with a 10% (default) chance on anything less than QUORUM
(ONE/LOCAL_ONE). This is configured at the table level through the
dclocal_read_repair_chance and read_repair_chance settings (which are going
away in 4.0). So if
Hi Ben,
That makes sense. I also read about "read repairs". So, once an
inconsistent record is read, cassandra synchronizes its replicas on other
nodes as well. I ran the same spark query again, this time with default
consistency level (LOCAL_ONE) and the result was correct.
Thanks again for the
Hi Faraz
Yes, it likely does mean there is inconsistency in the replicas. However,
you shouldn’t be too freaked out about it - Cassandra is design to allow
for this inconsistency to occur and the consistency levels allow you to
achieve consistent results despite replicas not being consistent. To
Thanks a lot for the response.
Setting consistency to ALL/TWO started giving me consistent count results
on both cqlsh and spark. As expected, my query time has increased by 1.5x (
Before, it was taking ~1.6 hours but with consistency level ALL, same query
is taking ~2.4 hours to complete.)
Both CQLSH and the Spark Cassandra query at consistent level ONE (LOCAL_ONE
for Spark connector) by default so if there is any inconsistency in your
replicas this can resulting in inconsistent query results.
See http://cassandra.apache.org/doc/latest/tools/cqlsh.html and
The fact that cqlsh itself gives different results tells me that this has
nothing to do with spark. Moreover, spark results are monotonically
increasing which seem to be more consistent than cqlsh. so I believe
spark can be taken out of the equation.
Now, while you are running these queries is
Hi everyone,
I am trying to use spark to process a large cassandra table (~402 million
entries and 84 columns) but I am getting inconsistent results. Initially
the requirement was to copy some columns from this table to another table.
After copying the data, I noticed that some entries in the
nt = 1
>
>
>
> From: Cassa L <lcas...@gmail.com>
> Date: Friday, October 27, 2017 at 1:50 AM
> To: Jörn Franke <jornfra...@gmail.com>
> Cc: user <u...@spark.apache.org>, <user@cassandra.apache.org>
> Subject: Re: Why don't I see my spark jobs run
assa L <lcas...@gmail.com>
Date: Friday, October 27, 2017 at 1:50 AM
To: Jörn Franke <jornfra...@gmail.com>
Cc: user <u...@spark.apache.org>, <user@cassandra.apache.org>
Subject: Re: Why don't I see my spark jobs running in parallel in
Cassandra/Spark DSE cluster?
No, I dont
No, I dont use Yarn. This is standalone spark that comes with DataStax
Enterprise version of Cassandra.
On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke wrote:
> Do you use yarn ? Then you need to configure the queues with the right
> scheduler and method.
>
> On 27. Oct
qvantel.com>
Cc: 한 승호 <shha...@outlook.com>, "user@cassandra.apache.org"
<user@cassandra.apache.org>
Subject: Re: Cassandra & Spark
Interesting
Tobias, when you said "Instead we transferred the data to Apache Kudu", did you
transfer all Cassandra d
fference !
>
> “my two cents!”
>
> -Tobias
>
>
>
>
>
>
>
> *From: *한 승호 <shha...@outlook.com>
> *Date: *Thursday, 8 June 2017 at 10:25
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Subject: *Cassandra & Spark
Spark,
Instead we transferred the data to Apache Kudu, and then we ran our analysis on
Kudu, and what a difference !
“my two cents!”
-Tobias
From: 한 승호 <shha...@outlook.com>
Date: Thursday, 8 June 2017 at 10:25
To: "user@cassandra.apache.org" <user@cassandra.apache.org&
If you use Containers like Docker Plan A can work provided you do the
resource and capacity planning. I tend to think that Plan B is more
Standard and easier Although you can wait to hear from others for a second
opinion.
Caution: Data Locality will make sense if the Disk throughput is
Hello,
I am Seung-ho and I work as a Data Engineer in Korea. I need some advice.
My company recently consider replacing RDMBS-based system with Cassandra and
Hadoop.
The purpose of this system is to analyze Cadssandra and HDFS data with Spark.
It seems many user cases put emphasis on data
which process data before inserting them in cassandra, and doesn't
use cassandra as a temporary store.
2015-07-23 2:04 GMT+02:00 Renato Perini renato.per...@gmail.com:
Problem: Log analytics.
Solutions:
1) Aggregating logs using Flume and storing the aggregations into
Cassandra. Spark
.
Solutions:
1) Aggregating logs using Flume and storing the aggregations into
Cassandra. Spark reads data from Cassandra, make some computations
and write the results in distinct tables, still in Cassandra.
2) Aggregating logs using Flume to a sink, streaming data directly
into Spark
Perini renato.per...@gmail.com:
Problem: Log analytics.
Solutions:
1) Aggregating logs using Flume and storing the aggregations into
Cassandra. Spark reads data from Cassandra, make some computations
and write the results in distinct tables, still in Cassandra.
2) Aggregating logs
Problem: Log analytics.
Solutions:
1) Aggregating logs using Flume and storing the aggregations
into Cassandra. Spark reads data from Cassandra, make some computations
and write the results in distinct tables, still in Cassandra.
2) Aggregating logs using Flume to a sink
These questions would be better addressed to the Spark Cassandra Connector
mailing list, which can be found here:
https://github.com/datastax/spark-cassandra-connector/#community
Thanks,
Carl
On Tue, Mar 3, 2015 at 4:42 AM, Pavel Velikhov pavel.velik...@gmail.com
wrote:
Hi, is there a paper or
Hi all,
I didn't find the *issues* button on
https://github.com/datastax/spark-cassandra-connector/ so posting here.
Any one have an idea why token ranges are grouped into one partition per
executor? I expected at least one per core. Any suggestions on how to work
around this? Doing a
, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com
wrote:
Hi all,
I read the DSE 4.6 documentation and I'm still not 100% sure what a
mixed workload Cassandra + Spark installation would look like, especially
on AWS. What I gather is that you use OpsCenter to set up the following
aggregated table, it is
less IO intensive than the analytics DC with lot of read write to compute
aggregations.
On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com
wrote:
Hi all,
I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed
workload Cassandra
Hi all,
I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed
workload Cassandra + Spark installation would look like, especially on
AWS. What I gather is that you use OpsCenter to set up the following:
- One virtual data center for real-time processing (e.g., ingestion
intensive than the analytics DC with lot of read write to compute
aggregations.
On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com wrote:
Hi all,
I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed
workload Cassandra + Spark installation would look
Many thanks once again.
I rethought the target data structure, and things started coming together to
allow for really elegant, compact ESP preprocessing and storage.
Best.
Enviado do meu iPhone
No dia 03/01/2015, às 23:53, Peter Lin wool...@gmail.com escreveu:
if you like SQL dialect, try
It looks like you're using the wrong tool and architecture.
If the use case really needs continuous query like event processing, use an ESP
product to do that. You can still store data in Cassandra for persistence .
The design you want is to have two paths: event stream and persistence. At the
Hello.
We're currently using Hazelcast (http://hazelcast.org/) as a distributed
in-memory data grid. That's been working sort-of-well for us, but going
solely in-memory has exhausted its path in our use case, and we're
considering porting our application to a NoSQL persistent store. After the
Hello Hugo
I was facing the same kind of requirement from some users. Long story
short, below are the possible strategies with advantages and draw-backs of
each
1) Put Spark in front of the back-end, every incoming
modification/update/insert goes into Spark first, then Spark will forward
it to
Hello,
Or you can have a look at akka http://www.akka.io for event processing and
use cassandra for persistence(Peters suggestion).
On Sat Jan 03 2015 at 11:59:45 AM Peter Lin wool...@gmail.com wrote:
It looks like you're using the wrong tool and architecture.
If the use case really needs
Thank you all for your answers.
It seems I'll have to go with some event-driven processing before/during the
Cassandra write path.
My concern would be that I'd love to first guarantee the disk write of the
Cassandra persistence and then do the event processing (which is mostly CRUD
Use a message bus with a transactional get, get the message, send to cassandra,
upon write success, submit to esp, commit get on bus. Messaging systems like
rabbitmq support this semantic.
Using cassandra as a queuing mechanism is an anti-pattern.
--
Colin Clark
+1-320-221-9531
On Jan 3,
listen to colin's advice, avoid the temptation of anti-patterns.
On Sat, Jan 3, 2015 at 6:10 PM, Colin colpcl...@gmail.com wrote:
Use a message bus with a transactional get, get the message, send to
cassandra, upon write success, submit to esp, commit get on bus. Messaging
systems like
Thanks :)
Duly noted - this is all uncharted territory for us, hence the value of
seasoned advice.
Best
--
Hugo José Pinto
No dia 03/01/2015, às 23:43, Peter Lin wool...@gmail.com escreveu:
listen to colin's advice, avoid the temptation of anti-patterns.
On Sat, Jan 3, 2015 at 6:10
if you like SQL dialect, try out products that use streamSQL to do
continuous queries. Espers comes to mind. Google to see what other products
support streamSQL
On Sat, Jan 3, 2015 at 6:48 PM, Hugo José Pinto hugo.pi...@inovaworks.com
wrote:
Thanks :)
Duly noted - this is all uncharted
Hi Oleg,
Connectors don't deal with HA, they rely on Spark for that, so neither
the Datastax connector, Stratio Deep nor Calliope have anything to do
with Spark's HA. You should have previously configured Spark so that it
meets your high availability needs. Furthermore, as I mentioned in a
Adding to conversation...
there are 3 great open source options available
1. Calliope http://tuplejump.github.io/calliope/
This is the first library that was out some time late last year (as i
can recall) and I have been using this for a while, mostly very stable,
uses Hadoop i/o in
2. still uses thrift for minor stuff -- I think that the only call using
thrift is describe_ring to get an estimate of ratio of partition keys
within the token range
3. Stratio has a talk today at the SF Summit, presenting Stratio META. For
the folks not attending the conference, video should be
Ok.
DataStax , Startio are required mesos, hadoop yarn other third party to
get spark cluster HA.
What in case of calliope?
Is it sufficient to have cassandra + calliope + spark to be able process
aggregations?
In my case we have quite a lot of data so doing aggregation only in memory
-
Hi Oleg,
I am the creator of Calliope. Calliope doesn't force any deployment
model... that means you can run it with Mesos or Hadoop or Standalone. To
be fair I don't think the other libs mentioned here should work too.
The Spark cluster HA can be provided using ZooKeeper even in the standalone
Thank you Rohit.
I sent the email to you.
Thanks
Oleg.
On Thu, Sep 11, 2014 at 10:51 PM, Rohit Rai ro...@tuplejump.com wrote:
Hi Oleg,
I am the creator of Calliope. Calliope doesn't force any deployment
model... that means you can run it with Mesos or Hadoop or Standalone. To
be fair I
Hi ,
I try to evaluate different option of spark + cassandra and I have couple
of questions:
My aim is to use cassandra+spark without hadoop:
1) Is it possible to use only cassandra as input/output parameter for
PySpark?
2) In case I'll use Spark (java,scala) is it possible to use only
Hi Oleg,
If you want to use cassandra+spark without hadoop, perhaps Stratio Deep
is your best choice (https://github.com/Stratio/stratio-deep). It's an
open-source Spark + Cassandra connector that doesn't make any use of
Hadoop or Hadoop component.
http://docs.openstratio.org/deep/0.3.3
. 2014 17:35, Oleg Ruchovets oruchov...@gmail.com a écrit :
Hi ,
I try to evaluate different option of spark + cassandra and I have
couple of questions:
My aim is to use cassandra+spark without hadoop:
1) Is it possible to use only cassandra as input/output parameter for
PySpark?
2
:
Hi Oleg,
If you want to use cassandra+spark without hadoop, perhaps Stratio Deep is
your best choice (https://github.com/Stratio/stratio-deep). It's an
open-source Spark + Cassandra connector that doesn't make any use of Hadoop
or Hadoop component.
http://docs.openstratio.org/deep/0.3.3
try to evaluate different option of spark + cassandra and I have
couple of questions:
My aim is to use cassandra+spark without hadoop:
1) Is it possible to use only cassandra as input/output parameter for
PySpark?
2) In case I'll use Spark (java,scala) is it possible to use only
As far as I know, the Datastax connector uses thrift to connect Spark with
Cassandra although thrift is already deprecated, could someone confirm this
point?
-- the Scala connector is using the latest Java driver, so no there is no
Thrift there.
For the Java version, I'm not sure, have not
Source code check for the Java version:
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-java/src/main/java/com/datastax/spark/connector/RDDJavaFunctions.java#L26
It's using the RDDFunctions from scala code so yes, it's Java driver again.
On Wed, Sep
Interesting things actually:
We have hadoop in our eco system. It has single point of failure and I
am not sure about inter data center replication.
Plan is to use cassandra - no single point of failure , there is data
center replication.
For aggregation/transformation using SPARK. BUT storm
Stupid question: do you really need both Storm Spark ? Can't you
implement the Storm jobs in Spark ? It will be operationally simpler to
have less moving parts. I'm not saying that Storm is not the right fit, it
may be totally suitable for some usages.
But if you want to avoid the SPOF thing
Good to know. Thanks, DuyHai! I'll take a look (but most probably tomorrow
;-))
Paco
2014-09-10 20:15 GMT+02:00 DuyHai Doan doanduy...@gmail.com:
Source code check for the Java version:
Typo. I am talking about spark only.
Thanks
Oleg.
On Thursday, September 11, 2014, DuyHai Doan doanduy...@gmail.com wrote:
Stupid question: do you really need both Storm Spark ? Can't you
implement the Storm jobs in Spark ? It will be operationally simpler to
have less moving parts. I'm not
59 matches
Mail list logo