Re: Spark Streaming on a cluster

Sourav Chandra Tue, 18 Feb 2014 22:27:13 -0800

Hi TD,

in case of multiple streams will the streaming code be like:


val ssc = ...
(1 to n).foreach {
  val nwStream = kafkaUtils.createStream(...)

nwStream.flatMap(...).map(...).reduceByKeyAndWindow(...).foreachRDD(saveToCassandra())
}
ssc.start()

Will it create any problem in execution (like reading/writing broadcast
variable etc)

Thanks,
Sourav


On Wed, Feb 19, 2014 at 11:31 AM, Tathagata Das <[email protected]
> wrote:

> The default zeromq receiver that comes with the Spark repository does
> guarantee which machine the zeromq receiver will be launched, it can be on
> any of the worker machines in the cluster, NOT the application machine
> (called the "driver" in our terms). And your understanding of the code and
> the situation is generally correct (except the "passing through application
> machine" part).
>
> So if you want to do a distributed receiving, you have to create multiple
> zeromq streams, which will create multiple zeromq receivers. The system
> tries to spread the receivers across multiple machines. But you have to
> make sure that the data is correctly "fanned out" across these multiple
> receivers. Other ingestion mechanisms like Kafka actually allows this
> natively - it allows the user to define "partitions" of a data stream and
> the Kafka system takes care of partitioning the relavant data stream and
> sending it to multiple receivers without duplicating the records.
>
> TD
>
>
> On Sun, Feb 16, 2014 at 8:48 AM, amirtuval <[email protected]> wrote:
>
>> Hi
>>
>> I am a newbie to spark and spark streaming - I just recently became aware
>> of
>> it, but it seems really relevant to what I am trying to achieve.
>>
>> I am looking into using spark streaming, with an input stream from zeromq.
>> I am trying to figure out what machine is actually listening on the zeromq
>> socket.
>>
>> If I have a 10-machine cluster, and one machine acting as the "application
>> machine", meaning the machine that runs the code I write, and submit jobs
>> to
>> the cluster - which of these machines will subscribe to the zeromq socket?
>> Looking at the code, I see a "Sub" socket is created, meaning all the data
>> will pass through the socket, which leads me to believe that the
>> application
>> machine is the one listening on the socket, and passes all received data
>> to
>> the cluster. This means that this single machine might become a bottle
>> neck
>> in a high throughput use case.
>> A better approach would be to have each node in the cluster listen on a
>> "fanout" socket, meaning each node in the cluster receive a part of the
>> data.
>>
>> I am not sure about any of this, as I am not an expert in ZeroMQ, and
>> definitely not in spark. If someone can clarify how this works, I'd really
>> appreciate it.
>> In addition, if there other input streams, that operate in a different
>> manner that does not require the entire data to pass through the
>> "application machine", I'd really appreciate knowing that too.
>>
>> Thank you very much in advance,
>> Amir
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-a-cluster-tp1576.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

[email protected]

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com

Re: Spark Streaming on a cluster

Reply via email to