Can anyone provide any code examples of connecting Spark to zeromq data producers for purposes of simple real-time analytics? Even the most basic example would be nice :)
Thanks! On Mon, Dec 23, 2013 at 2:42 PM, Ognen Duzlevski <[email protected]>wrote: > Hello, I am new to Spark and have installed it, played with it a bit, > mostly I am reading through the "Fast data processing with Spark" book. > > One of the first things I realized is that I have to learn Scala, the > real-time data analytics part is not supported by the Python API, correct? > I don't mind, Scala seems to be a lovely language! :) > > Anyways, I would like to set up a data analysis pipeline where I have > already done the job of exposing a port on the internet (amazon elastic > load balancer) that feeds real-time data from tens-hundreds of thousands of > clients in real-time into a set of internal instances which are essentially > zeroMQ sockets (I do this via mongrel2 and associated handlers). > > These handlers can themselves create 0mq sockets to feed data into a > "pipeline" via a 0mq push/pull, pub/sub or whatever mechanism works best. > > One of the pipelines I am evaluating is Spark. > > There seems to be information on Spark but for some reason I find it to be > very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't > use Hadoop/HDFS? > > What do people do when they want to inhale real-time information? Let's > say I want to use 0mq. Does Spark allow for that? How would I go about > doing this? > > What about "dumping" all the data into a persistent store? Can I dump into > DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can > do that upon receipt of data before it "sees" the pipeline but sometimes > storing intermediate results helps too... > > Thanks! > OD >
