Re: share datasets across multiple spark-streaming applications for lookup

JG Perrin Thu, 02 Nov 2017 06:45:36 -0700

Or Databaricks Delta (announced at Spark Summit) or IBM Event Store depending 
on the use case.


On Oct 31, 2017, at 14:30, Joseph Pride 
<jos...@versanalytics.com<mailto:jos...@versanalytics.com>> wrote:

Folks:

SnappyData.

I’m fairly new to working with it myself, but it looks pretty promising. It 
marries Spark with a co-located in-memory GemFire (or something gem-related) 
database. So you can access the data with SQL, JDBC, ODBC (if you wanna go 
Enterprise instead of open-source) or natively as mutable RDDs and DataFrames.

You can run it so the storage and Spark compute are co-located in the same JVM 
on each machine, so you get data locality instead of a bottleneck between load, 
save, and compute. The data is supposed to persist between applications, 
cluster startups, or multiple applications doing stuff to the data at the same 
time.

I hope it works for what I’m doing and isn’t too buggy. But it looks pretty 
good.

—Joe Pride

On Oct 31, 2017, at 11:14 AM, Gene Pang 
<gene.p...@gmail.com<mailto:gene.p...@gmail.com>> wrote:

Hi,

Alluxio enables sharing dataframes across different applications. This blog 
post<https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio> 
talks about dataframes and Alluxio, and this Spark Summit 
presentation<https://spark-summit.org/2017/events/best-practices-for-using-alluxio-with-apache-spark/>
 has additional information.

Thanks,
Gene

On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil 
<rcha...@expedia.com<mailto:rcha...@expedia.com>> wrote:
Any info on the below will be really appreciated.

I read about Alluxio and Ignite. Has anybody used any of them? Do they work 
well with multiple Apps doing lookups simultaneously? Are there better options? 
Thank you.

From: roshan joe <impdocs2...@gmail.com<mailto:impdocs2...@gmail.com>>
Date: Monday, October 30, 2017 at 7:53 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: share datasets across multiple spark-streaming applications for lookup

Hi,

What is the recommended way to share datasets across multiple spark-streaming 
applications, so that the incoming data can be looked up against this shared 
dataset?

The shared dataset is also incrementally refreshed and stored on S3. Below is 
the scenario.

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3.
Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3.


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and 
DS-2 and write to DS-3 in S3.
Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and 
DS-2 and write to DS-3 in S3.
Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and 
DS-2 and write to DS-n in S3.

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming 
apps. Any input is appreciated. Thank you!

Re: share datasets across multiple spark-streaming applications for lookup

Reply via email to