I think one of the advantages of using akka-streams within Spark is the fact 
that it is a general purpose stream processing toolset with backpressure, not 
necessarily specific to kafka. If things work out with the approach, Spark 
could be a great benefit to use as a coordination framework for discrete 
streams processed on each executor. I've been toying with the idea of making 
essentially an RDD of task messages, where each task becomes an akka stream 
which are materialized on multiple executors and completed as that executor's 
'task', allowing Spark to coordinate the completion of the entire job. For 
example, I might make an RDD which is just a set of URLs that I want to 
download and produce to Kafka, but let's say I have so many URLs that i need to 
coordinate that work across many servers. Using Spark with a forEachPartition 
block, I might set up an akka-stream to accomplish that task in a 
backpressured, stream-oriented way, so that I could have the entire Spark job 
complete when all of the URLs had been produced to Kafka, using individual Akka 
Streams within each executor.

I realize that this is not the original question on this thread, and I don't 
meant to hijack that. I am also interested in the potential of Akka Stream 
sources for a Spark Streaming job directly, which could potentially be adapted 
for both Kafka and non-kafka use cases, with the emphasis for me being on use 
cases which aren't necessarily Kafka specific. There are some portions which 
feel like a bit of a mismatch, but with Structured Streams, I think there is 
greater opportunity for some kind of symbiotic adapter layer on the input side 
of things. I think the Apache Gearpump 
<https://gearpump.apache.org/overview.html> project in incubation may 
demonstrate how this adaptation can be approached, and the nascent Alpakka 
project <https://github.com/akka/alpakka> is an example of the generic 
applications of Akka Streams.

It is important to note that Akka Streams are billed as a toolbox and not a 
framework, because they don't handle coordination of parallelism or multi-host 
concurrency. I think Spark could end up being a very convenient framework to 
handle this aspect of of a distributed application's architecture. It may be 
able to do some of this without any modification to either of these projects, 
but I haven't had the experience of actually attempting the implementation yet.


> On Nov 12, 2016, at 9:42 AM, Jacek Laskowski <ja...@japila.pl> wrote:
> 
> Hi Luciano,
> 
> Mind sharing why to have a structured streaming source/sink for Akka
> if Kafka's available and Akka Streams has a Kafka module? #curious
> 
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Sat, Nov 12, 2016 at 4:07 PM, Luciano Resende <luckbr1...@gmail.com> wrote:
>> If you are interested in Akka streaming, it is being maintained in Apache
>> Bahir. For Akka there isn't a structured streaming version yet, but we would
>> be interested in collaborating in the structured streaming version for sure.
>> 
>> On Thu, Nov 10, 2016 at 8:46 AM shyla deshpande <deshpandesh...@gmail.com>
>> wrote:
>>> 
>>> I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka,
>>> Spark Streaming and Cassandra using Structured Streaming. But the kafka
>>> source support for Structured Streaming is not yet available. So now I am
>>> trying to use Akka Stream as the source to Spark Streaming.
>>> 
>>> Want to make sure I am heading in the right direction. Please direct me to
>>> any sample code and reading material for this.
>>> 
>>> Thanks
>>> 
>> --
>> Sent from my Mobile device
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

Reply via email to