The default zeromq receiver that comes with the Spark repository does
guarantee which machine the zeromq receiver will be launched, it can be on
any of the worker machines in the cluster, NOT the application machine
(called the "driver" in our terms). And your understanding of the code and
the situation is generally correct (except the "passing through application
machine" part).

So if you want to do a distributed receiving, you have to create multiple
zeromq streams, which will create multiple zeromq receivers. The system
tries to spread the receivers across multiple machines. But you have to
make sure that the data is correctly "fanned out" across these multiple
receivers. Other ingestion mechanisms like Kafka actually allows this
natively - it allows the user to define "partitions" of a data stream and
the Kafka system takes care of partitioning the relavant data stream and
sending it to multiple receivers without duplicating the records.

TD


On Sun, Feb 16, 2014 at 8:48 AM, amirtuval <[email protected]> wrote:

> Hi
>
> I am a newbie to spark and spark streaming - I just recently became aware
> of
> it, but it seems really relevant to what I am trying to achieve.
>
> I am looking into using spark streaming, with an input stream from zeromq.
> I am trying to figure out what machine is actually listening on the zeromq
> socket.
>
> If I have a 10-machine cluster, and one machine acting as the "application
> machine", meaning the machine that runs the code I write, and submit jobs
> to
> the cluster - which of these machines will subscribe to the zeromq socket?
> Looking at the code, I see a "Sub" socket is created, meaning all the data
> will pass through the socket, which leads me to believe that the
> application
> machine is the one listening on the socket, and passes all received data to
> the cluster. This means that this single machine might become a bottle neck
> in a high throughput use case.
> A better approach would be to have each node in the cluster listen on a
> "fanout" socket, meaning each node in the cluster receive a part of the
> data.
>
> I am not sure about any of this, as I am not an expert in ZeroMQ, and
> definitely not in spark. If someone can clarify how this works, I'd really
> appreciate it.
> In addition, if there other input streams, that operate in a different
> manner that does not require the entire data to pass through the
> "application machine", I'd really appreciate knowing that too.
>
> Thank you very much in advance,
> Amir
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-a-cluster-tp1576.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to