Hello, I am new to Spark and have installed it, played with it a bit,
mostly I am reading through the "Fast data processing with Spark" book.

One of the first things I realized is that I have to learn Scala, the
real-time data analytics part is not supported by the Python API, correct?
I don't mind, Scala seems to be a lovely language! :)

Anyways, I would like to set up a data analysis pipeline where I have
already done the job of exposing a port on the internet (amazon elastic
load balancer) that feeds real-time data from tens-hundreds of thousands of
clients in real-time into a set of internal instances which are essentially
zeroMQ sockets (I do this via mongrel2 and associated handlers).

These handlers can themselves create 0mq sockets to feed data into a
"pipeline" via a 0mq push/pull, pub/sub or whatever mechanism works best.

One of the pipelines I am evaluating is Spark.

There seems to be information on Spark but for some reason I find it to be
very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't
use Hadoop/HDFS?

What do people do when they want to inhale real-time information? Let's say
I want to use 0mq. Does Spark allow for that? How would I go about doing
this?

What about "dumping" all the data into a persistent store? Can I dump into
DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can
do that upon receipt of data before it "sees" the pipeline but sometimes
storing intermediate results helps too...

Thanks!
OD

Reply via email to