Hi,

The mentioned approach did not work for me. It did create multiple
NetworkReceivers in each workers but DAG scheduler is failing with
NotSerializableException"org.apache.spark.streaming.StreamingContext

Can any one of you help me figuring this out?

Thanks,
Sourav




On Wed, Feb 19, 2014 at 11:55 AM, Sourav Chandra <
[email protected]> wrote:

> Hi TD,
>
> in case of multiple streams will the streaming code be like:
>
> val ssc = ...
> (1 to n).foreach {
>   val nwStream = kafkaUtils.createStream(...)
>
> nwStream.flatMap(...).map(...).reduceByKeyAndWindow(...).foreachRDD(saveToCassandra())
> }
> ssc.start()
>
> Will it create any problem in execution (like reading/writing broadcast
> variable etc)
>
> Thanks,
> Sourav
>
>
> On Wed, Feb 19, 2014 at 11:31 AM, Tathagata Das <
> [email protected]> wrote:
>
>> The default zeromq receiver that comes with the Spark repository does
>> guarantee which machine the zeromq receiver will be launched, it can be on
>> any of the worker machines in the cluster, NOT the application machine
>> (called the "driver" in our terms). And your understanding of the code and
>> the situation is generally correct (except the "passing through application
>> machine" part).
>>
>> So if you want to do a distributed receiving, you have to create multiple
>> zeromq streams, which will create multiple zeromq receivers. The system
>> tries to spread the receivers across multiple machines. But you have to
>> make sure that the data is correctly "fanned out" across these multiple
>> receivers. Other ingestion mechanisms like Kafka actually allows this
>> natively - it allows the user to define "partitions" of a data stream and
>> the Kafka system takes care of partitioning the relavant data stream and
>> sending it to multiple receivers without duplicating the records.
>>
>> TD
>>
>>
>> On Sun, Feb 16, 2014 at 8:48 AM, amirtuval <[email protected]> wrote:
>>
>>> Hi
>>>
>>> I am a newbie to spark and spark streaming - I just recently became
>>> aware of
>>> it, but it seems really relevant to what I am trying to achieve.
>>>
>>> I am looking into using spark streaming, with an input stream from
>>> zeromq.
>>> I am trying to figure out what machine is actually listening on the
>>> zeromq
>>> socket.
>>>
>>> If I have a 10-machine cluster, and one machine acting as the
>>> "application
>>> machine", meaning the machine that runs the code I write, and submit
>>> jobs to
>>> the cluster - which of these machines will subscribe to the zeromq
>>> socket?
>>> Looking at the code, I see a "Sub" socket is created, meaning all the
>>> data
>>> will pass through the socket, which leads me to believe that the
>>> application
>>> machine is the one listening on the socket, and passes all received data
>>> to
>>> the cluster. This means that this single machine might become a bottle
>>> neck
>>> in a high throughput use case.
>>> A better approach would be to have each node in the cluster listen on a
>>> "fanout" socket, meaning each node in the cluster receive a part of the
>>> data.
>>>
>>> I am not sure about any of this, as I am not an expert in ZeroMQ, and
>>> definitely not in spark. If someone can clarify how this works, I'd
>>> really
>>> appreciate it.
>>> In addition, if there other input streams, that operate in a different
>>> manner that does not require the entire data to pass through the
>>> "application machine", I'd really appreciate knowing that too.
>>>
>>> Thank you very much in advance,
>>> Amir
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-a-cluster-tp1576.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>
>
> --
>
> Sourav Chandra
>
> Senior Software Engineer
>
> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>
> [email protected]
>
> o: +91 80 4121 8723
>
> m: +91 988 699 3746
>
> skype: sourav.chandra
>
> Livestream
>
> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
> Block, Koramangala Industrial Area,
>
> Bangalore 560034
>
> www.livestream.com
>



-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

[email protected]

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com

Reply via email to