Hey Nilesh, Great to hear your using Spark Streaming, in my opinion the crux of your question comes down to what you want to do with the data in the future and/or if there is utility it using it from more than one Spark/Streaming job.
1). *One-time-use fire and forget *- as you rightly point out, hooking up to the Akka actors makes sense if the usefulness of the data is short-lived and you don't need the ability to readily go back into archived data. 2). *Fault tolerance & multiple uses* - consider using a message queue like Apache Kafka [1], write messages from your Akka Actors into a Kafka topic with multiple partitions and replication. Then use Spark Streaming job(s) to read from Kafka. You can tune Kafka to keep the last *N* days data online so if your Spark Streaming job dies it can pickup at the point it left off. 3). *Keep indefinitely* - files in HDFS, 'nuff said. We're currently using (2) Kafka & (3) HDFS to process around 400M "web clickstream events" a week. Everything is written into Kafka and kept 'online' for 7 days, and also written out to HDFS in compressed date-sequential files. We use several Spark Streaming jobs to process the real-time events straight from Kafka. Kafka supports multiple consumers so each job sees his own view of the message queue and all its events. If any of the Streaming jobs die or are restarted they continue consuming from Kafka from the last processed message without effecting any of the other consumer processes. Best, MC [1] http://kafka.apache.org/ On 10 June 2014 13:05, Nilesh Chakraborty <nil...@nileshc.com> wrote: > Hello! > > Spark Streaming supports HDFS as input source, and also Akka actor > receivers, or TCP socket receivers. > > For my use case I think it's probably more convenient to read the data > directly from Actors, because I already need to set up a multi-node Akka > cluster (on the same nodes that Spark runs on) and write some actors to > perform some parallel operations. Writing actor receivers to consume the > results of my business-logic actors and then feed into Spark is pretty > seamless. Note that the actors generate a large amount of data (a few GBs > to > tens of GBs). > > The other option would be to setup HDFS on the same cluster as Spark, write > the data from the Actors to HDFS, and then use HDFS as input source for > Spark Streaming. Does this result in better performance due to data > locality > (with HDFS data replication turned on)? I think performance should be > almost > the same with actors, since Spark workers local to the worker actors should > get the data fast, and some optimization like this is definitely done I > assume? > > I suppose the only benefit with HDFS would be better fault tolerance, and > the ability to checkpoint and recover even if master fails. > > Cheers, > Nilesh > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >