Re: Cassandra & Spark

Tobias Eriksson Thu, 08 Jun 2017 07:01:51 -0700

Hi
What I wanted was a dashboard with graphs/diagrams and it should not take 
minutes for the page to load
Thus, it was a problem to have Spark with Cassandra, and not solving the 
parallelization to such an extent that I could have the diagrams rendered in 
seconds.
Now with Kudu we get some decent results rendering the diagrams/graphs


The way we transfer data from Cassandra which is the Production system storage 
to Kudu, is through an Apache Kafka topic (or many topics actually) and then we 
have an application which ingests the data into Kudu


Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- > 
KuduIngestion App -- > Kudu < -- Dashboard App(s)


If you want to play with really fast analytics then perhaps consider looking at 
Apache Ignite
https://ignite.apache.org
Which then act as a layer between Cassandra and your applications storing into 
Cassandra (memory datagrid I think it is called)
Basically, think of it as a big cache
It is an in-memory thingi ☺
And then you can run some super fast queries

-Tobias

From: DuyHai Doan <doanduy...@gmail.com>
Date: Thursday, 8 June 2017 at 15:42
To: Tobias Eriksson <tobias.eriks...@qvantel.com>
Cc: 한 승호 <shha...@outlook.com>, "user@cassandra.apache.org" 
<user@cassandra.apache.org>
Subject: Re: Cassandra & Spark

Interesting

Tobias, when you said "Instead we transferred the data to Apache Kudu", did you 
transfer all Cassandra data into Kudu from with a single migration and then tap 
into Kudo for aggregation or did you run data import every day/week/month from 
Cassandra into Kudu ?

From my point of view, the difficulty is not to have a static set of data and 
run aggregation on it, there are a lot of alternatives out there. The 
difficulty is to be able to run analytics on a live/production/changing dataset 
with all the data movement & update that it implies.

Regards

On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson 
<tobias.eriks...@qvantel.com<mailto:tobias.eriks...@qvantel.com>> wrote:
Hi
Something to consider before moving to Apache Spark and Cassandra
I have a background where we have tons of data in Cassandra, and we wanted to 
use Apache Spark to run various jobs
We loved what we could do with Spark, BUT….
We realized soon that we wanted to run multiple jobs in parallel
Some jobs would take 30 minutes and some 45 seconds
Spark is by default arranged so that it will take up all the resources there 
is, this can be tweaked by using Mesos or Yarn
But even with Mesos and Yarn we found it complicated to run multiple jobs in 
parallel.
So eventually we ended up throwing out Spark,
Instead we transferred the data to Apache Kudu, and then we ran our analysis on 
Kudu, and what a difference !
“my two cents!”
-Tobias



From: 한 승호 <shha...@outlook.com<mailto:shha...@outlook.com>>
Date: Thursday, 8 June 2017 at 10:25
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Cassandra & Spark

Hello,

I am Seung-ho and I work as a Data Engineer in Korea. I need some advice.

My company recently consider replacing RDMBS-based system with Cassandra and 
Hadoop.
The purpose of this system is to analyze Cadssandra and HDFS data with Spark.

It seems many user cases put emphasis on data locality, for instance, both 
Cassandra and Spark executor should be on the same node.

The thing is, my company's data analyst team wants to analyze heterogeneous 
data source, Cassandra and HDFS, using Spark.
So, I wonder what would be the best practices of using Cassandra and Hadoop in 
such case.

Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the same 
node

Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node separately 
but the same cluster


Which would be better or correct, or would be a better way?

I appreciate your advice in advance :)

Best Regards,
Seung-Ho Han


Windows 10용 메일<https://go.microsoft.com/fwlink/?LinkId=550986>에서 보냄

Re: Cassandra & Spark

Reply via email to