Spark Distributed Join

Ashish Mukherjee Fri, 13 Feb 2015 02:04:15 -0800

Hello,

I have the following scenario and was wondering if I can use Spark to
address it.


I want to query two different data stores (say, ElasticSearch and MySQL)
and then merge the two result sets based on a join key between the two. Is
it appropriate to use Spark to do this join, if the intermediate data sets
are large? (This is a No-ETL scenario)

I was thinking of two possibilities -

1) Send the intermediate data sets to Spark through a stream and get Spark
to do the join. The complexity here is that there would be multiple
concurrent streams to deal with. If I don't use streams, there would be
intermediate disk writes and data transfer to the Spark master.

2) Don't use Spark and do the same with some in-memory distributed engine
like MemSQL or Redis.

What's the experts' view on this?

Regards,
Ashish

Spark Distributed Join

Reply via email to