Re: Spark Streaming on a cluster

Sourav Chandra Wed, 19 Feb 2014 08:03:39 -0800

Yes I did not understand what is wrong in the code which leads to this


On Wed, Feb 19, 2014 at 9:30 PM, dachuan <[email protected]> wrote:

> I am curious about why does StreamingContext need to be serialized, it's
> supposed to be in the driver node alone.
>
>
> On Wed, Feb 19, 2014 at 10:29 AM, Sourav Chandra <
> [email protected]> wrote:
>
>> I am using KafkaDStream and its saying StreamingContext  class is not
>> serializable
>>
>> Code snippet:
>>
>> val ssc = new StreamingContext(...)
>> (1 to 4).foreach(i => {
>>   val stream =
>> KafkaUtils.createStream(...).flatMap(...).reduceByKeyAndWindow(...)..filter(...).foreach(saveToCassandra())
>> })
>>
>> ssc.start()
>>
>> I am using 2 nodes
>>
>> Thanks,
>> Sourav
>>
>>
>> On Wed, Feb 19, 2014 at 8:21 PM, dachuan <[email protected]> wrote:
>>
>>> I don't know the final answer, but I'd like to discuss about your
>>> problem.
>>>
>>> How many nodes are you using, and how many NetworkReceivers have you
>>> started?
>>>
>>> Which specific class is not serializable?
>>>
>>> thanks,
>>> dachuan.
>>>
>>>
>>> On Wed, Feb 19, 2014 at 9:42 AM, Sourav Chandra <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> The mentioned approach did not work for me. It did create multiple
>>>> NetworkReceivers in each workers but DAG scheduler is failing with
>>>> NotSerializableException"org.apache.spark.streaming.StreamingContext
>>>>
>>>> Can any one of you help me figuring this out?
>>>>
>>>> Thanks,
>>>> Sourav
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 11:55 AM, Sourav Chandra <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi TD,
>>>>>
>>>>> in case of multiple streams will the streaming code be like:
>>>>>
>>>>> val ssc = ...
>>>>> (1 to n).foreach {
>>>>>   val nwStream = kafkaUtils.createStream(...)
>>>>>
>>>>> nwStream.flatMap(...).map(...).reduceByKeyAndWindow(...).foreachRDD(saveToCassandra())
>>>>> }
>>>>> ssc.start()
>>>>>
>>>>> Will it create any problem in execution (like reading/writing
>>>>> broadcast variable etc)
>>>>>
>>>>> Thanks,
>>>>> Sourav
>>>>>
>>>>>
>>>>> On Wed, Feb 19, 2014 at 11:31 AM, Tathagata Das <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> The default zeromq receiver that comes with the Spark repository does
>>>>>> guarantee which machine the zeromq receiver will be launched, it can be 
>>>>>> on
>>>>>> any of the worker machines in the cluster, NOT the application machine
>>>>>> (called the "driver" in our terms). And your understanding of the code 
>>>>>> and
>>>>>> the situation is generally correct (except the "passing through 
>>>>>> application
>>>>>> machine" part).
>>>>>>
>>>>>> So if you want to do a distributed receiving, you have to create
>>>>>> multiple zeromq streams, which will create multiple zeromq receivers. The
>>>>>> system tries to spread the receivers across multiple machines. But you 
>>>>>> have
>>>>>> to make sure that the data is correctly "fanned out" across these 
>>>>>> multiple
>>>>>> receivers. Other ingestion mechanisms like Kafka actually allows this
>>>>>> natively - it allows the user to define "partitions" of a data stream and
>>>>>> the Kafka system takes care of partitioning the relavant data stream and
>>>>>> sending it to multiple receivers without duplicating the records.
>>>>>>
>>>>>> TD
>>>>>>
>>>>>>
>>>>>> On Sun, Feb 16, 2014 at 8:48 AM, amirtuval <[email protected]> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I am a newbie to spark and spark streaming - I just recently became
>>>>>>> aware of
>>>>>>> it, but it seems really relevant to what I am trying to achieve.
>>>>>>>
>>>>>>> I am looking into using spark streaming, with an input stream from
>>>>>>> zeromq.
>>>>>>> I am trying to figure out what machine is actually listening on the
>>>>>>> zeromq
>>>>>>> socket.
>>>>>>>
>>>>>>> If I have a 10-machine cluster, and one machine acting as the
>>>>>>> "application
>>>>>>> machine", meaning the machine that runs the code I write, and submit
>>>>>>> jobs to
>>>>>>> the cluster - which of these machines will subscribe to the zeromq
>>>>>>> socket?
>>>>>>> Looking at the code, I see a "Sub" socket is created, meaning all
>>>>>>> the data
>>>>>>> will pass through the socket, which leads me to believe that the
>>>>>>> application
>>>>>>> machine is the one listening on the socket, and passes all received
>>>>>>> data to
>>>>>>> the cluster. This means that this single machine might become a
>>>>>>> bottle neck
>>>>>>> in a high throughput use case.
>>>>>>> A better approach would be to have each node in the cluster listen
>>>>>>> on a
>>>>>>> "fanout" socket, meaning each node in the cluster receive a part of
>>>>>>> the
>>>>>>> data.
>>>>>>>
>>>>>>> I am not sure about any of this, as I am not an expert in ZeroMQ, and
>>>>>>> definitely not in spark. If someone can clarify how this works, I'd
>>>>>>> really
>>>>>>> appreciate it.
>>>>>>> In addition, if there other input streams, that operate in a
>>>>>>> different
>>>>>>> manner that does not require the entire data to pass through the
>>>>>>> "application machine", I'd really appreciate knowing that too.
>>>>>>>
>>>>>>> Thank you very much in advance,
>>>>>>> Amir
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-a-cluster-tp1576.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Sourav Chandra
>>>>>
>>>>> Senior Software Engineer
>>>>>
>>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>>
>>>>> [email protected]
>>>>>
>>>>> o: +91 80 4121 8723
>>>>>
>>>>> m: +91 988 699 3746
>>>>>
>>>>> skype: sourav.chandra
>>>>>
>>>>> Livestream
>>>>>
>>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main,
>>>>> 3rd Block, Koramangala Industrial Area,
>>>>>
>>>>> Bangalore 560034
>>>>>
>>>>> www.livestream.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Sourav Chandra
>>>>
>>>> Senior Software Engineer
>>>>
>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>
>>>> [email protected]
>>>>
>>>> o: +91 80 4121 8723
>>>>
>>>> m: +91 988 699 3746
>>>>
>>>> skype: sourav.chandra
>>>>
>>>> Livestream
>>>>
>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
>>>> Block, Koramangala Industrial Area,
>>>>
>>>> Bangalore 560034
>>>>
>>>> www.livestream.com
>>>>
>>>
>>>
>>>
>>> --
>>> Dachuan Huang
>>> Cellphone: 614-390-7234
>>> 2015 Neil Avenue
>>> Ohio State University
>>> Columbus, Ohio
>>> U.S.A.
>>> 43210
>>>
>>
>>
>>
>> --
>>
>> Sourav Chandra
>>
>> Senior Software Engineer
>>
>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>
>> [email protected]
>>
>> o: +91 80 4121 8723
>>
>> m: +91 988 699 3746
>>
>> skype: sourav.chandra
>>
>> Livestream
>>
>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
>> Block, Koramangala Industrial Area,
>>
>> Bangalore 560034
>>
>> www.livestream.com
>>
>
>
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>



-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

[email protected]

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com

Re: Spark Streaming on a cluster

Reply via email to