Noob Spark questions

Ognen Duzlevski Mon, 23 Dec 2013 12:44:21 -0800

Hello, I am new to Spark and have installed it, played with it a bit,
mostly I am reading through the "Fast data processing with Spark" book.


One of the first things I realized is that I have to learn Scala, the
real-time data analytics part is not supported by the Python API, correct?
I don't mind, Scala seems to be a lovely language! :)

Anyways, I would like to set up a data analysis pipeline where I have
already done the job of exposing a port on the internet (amazon elastic
load balancer) that feeds real-time data from tens-hundreds of thousands of
clients in real-time into a set of internal instances which are essentially
zeroMQ sockets (I do this via mongrel2 and associated handlers).

These handlers can themselves create 0mq sockets to feed data into a
"pipeline" via a 0mq push/pull, pub/sub or whatever mechanism works best.

One of the pipelines I am evaluating is Spark.

There seems to be information on Spark but for some reason I find it to be
very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't
use Hadoop/HDFS?

What do people do when they want to inhale real-time information? Let's say
I want to use 0mq. Does Spark allow for that? How would I go about doing
this?

What about "dumping" all the data into a persistent store? Can I dump into
DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can
do that upon receipt of data before it "sees" the pipeline but sometimes
storing intermediate results helps too...

Thanks!
OD

Noob Spark questions

Reply via email to