I think one of the advantages of using akka-streams within Spark is the fact that it is a general purpose stream processing toolset with backpressure, not necessarily specific to kafka. If things work out with the approach, Spark could be a great benefit to use as a coordination framework for discrete streams processed on each executor. I've been toying with the idea of making essentially an RDD of task messages, where each task becomes an akka stream which are materialized on multiple executors and completed as that executor's 'task', allowing Spark to coordinate the completion of the entire job. For example, I might make an RDD which is just a set of URLs that I want to download and produce to Kafka, but let's say I have so many URLs that i need to coordinate that work across many servers. Using Spark with a forEachPartition block, I might set up an akka-stream to accomplish that task in a backpressured, stream-oriented way, so that I could have the entire Spark job complete when all of the URLs had been produced to Kafka, using individual Akka Streams within each executor.
I realize that this is not the original question on this thread, and I don't meant to hijack that. I am also interested in the potential of Akka Stream sources for a Spark Streaming job directly, which could potentially be adapted for both Kafka and non-kafka use cases, with the emphasis for me being on use cases which aren't necessarily Kafka specific. There are some portions which feel like a bit of a mismatch, but with Structured Streams, I think there is greater opportunity for some kind of symbiotic adapter layer on the input side of things. I think the Apache Gearpump <https://gearpump.apache.org/overview.html> project in incubation may demonstrate how this adaptation can be approached, and the nascent Alpakka project <https://github.com/akka/alpakka> is an example of the generic applications of Akka Streams. It is important to note that Akka Streams are billed as a toolbox and not a framework, because they don't handle coordination of parallelism or multi-host concurrency. I think Spark could end up being a very convenient framework to handle this aspect of of a distributed application's architecture. It may be able to do some of this without any modification to either of these projects, but I haven't had the experience of actually attempting the implementation yet. > On Nov 12, 2016, at 9:42 AM, Jacek Laskowski <ja...@japila.pl> wrote: > > Hi Luciano, > > Mind sharing why to have a structured streaming source/sink for Akka > if Kafka's available and Akka Streams has a Kafka module? #curious > > Pozdrawiam, > Jacek Laskowski > ---- > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Sat, Nov 12, 2016 at 4:07 PM, Luciano Resende <luckbr1...@gmail.com> wrote: >> If you are interested in Akka streaming, it is being maintained in Apache >> Bahir. For Akka there isn't a structured streaming version yet, but we would >> be interested in collaborating in the structured streaming version for sure. >> >> On Thu, Nov 10, 2016 at 8:46 AM shyla deshpande <deshpandesh...@gmail.com> >> wrote: >>> >>> I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, >>> Spark Streaming and Cassandra using Structured Streaming. But the kafka >>> source support for Structured Streaming is not yet available. So now I am >>> trying to use Akka Stream as the source to Spark Streaming. >>> >>> Want to make sure I am heading in the right direction. Please direct me to >>> any sample code and reading material for this. >>> >>> Thanks >>> >> -- >> Sent from my Mobile device > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >