Re: Spark Streaming on a cluster

dachuan Wed, 19 Feb 2014 08:01:31 -0800

I am curious about why does StreamingContext need to be serialized, it's
supposed to be in the driver node alone.



On Wed, Feb 19, 2014 at 10:29 AM, Sourav Chandra <
[email protected]> wrote:

> I am using KafkaDStream and its saying StreamingContext  class is not
> serializable
>
> Code snippet:
>
> val ssc = new StreamingContext(...)
> (1 to 4).foreach(i => {
>   val stream =
> KafkaUtils.createStream(...).flatMap(...).reduceByKeyAndWindow(...)..filter(...).foreach(saveToCassandra())
> })
>
> ssc.start()
>
> I am using 2 nodes
>
> Thanks,
> Sourav
>
>
> On Wed, Feb 19, 2014 at 8:21 PM, dachuan <[email protected]> wrote:
>
>> I don't know the final answer, but I'd like to discuss about your problem.
>>
>> How many nodes are you using, and how many NetworkReceivers have you
>> started?
>>
>> Which specific class is not serializable?
>>
>> thanks,
>> dachuan.
>>
>>
>> On Wed, Feb 19, 2014 at 9:42 AM, Sourav Chandra <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> The mentioned approach did not work for me. It did create multiple
>>> NetworkReceivers in each workers but DAG scheduler is failing with
>>> NotSerializableException"org.apache.spark.streaming.StreamingContext
>>>
>>> Can any one of you help me figuring this out?
>>>
>>> Thanks,
>>> Sourav
>>>
>>>
>>>
>>>
>>> On Wed, Feb 19, 2014 at 11:55 AM, Sourav Chandra <
>>> [email protected]> wrote:
>>>
>>>> Hi TD,
>>>>
>>>> in case of multiple streams will the streaming code be like:
>>>>
>>>> val ssc = ...
>>>> (1 to n).foreach {
>>>>   val nwStream = kafkaUtils.createStream(...)
>>>>
>>>> nwStream.flatMap(...).map(...).reduceByKeyAndWindow(...).foreachRDD(saveToCassandra())
>>>> }
>>>> ssc.start()
>>>>
>>>> Will it create any problem in execution (like reading/writing broadcast
>>>> variable etc)
>>>>
>>>> Thanks,
>>>> Sourav
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 11:31 AM, Tathagata Das <
>>>> [email protected]> wrote:
>>>>
>>>>> The default zeromq receiver that comes with the Spark repository does
>>>>> guarantee which machine the zeromq receiver will be launched, it can be on
>>>>> any of the worker machines in the cluster, NOT the application machine
>>>>> (called the "driver" in our terms). And your understanding of the code and
>>>>> the situation is generally correct (except the "passing through 
>>>>> application
>>>>> machine" part).
>>>>>
>>>>> So if you want to do a distributed receiving, you have to create
>>>>> multiple zeromq streams, which will create multiple zeromq receivers. The
>>>>> system tries to spread the receivers across multiple machines. But you 
>>>>> have
>>>>> to make sure that the data is correctly "fanned out" across these multiple
>>>>> receivers. Other ingestion mechanisms like Kafka actually allows this
>>>>> natively - it allows the user to define "partitions" of a data stream and
>>>>> the Kafka system takes care of partitioning the relavant data stream and
>>>>> sending it to multiple receivers without duplicating the records.
>>>>>
>>>>> TD
>>>>>
>>>>>
>>>>> On Sun, Feb 16, 2014 at 8:48 AM, amirtuval <[email protected]> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I am a newbie to spark and spark streaming - I just recently became
>>>>>> aware of
>>>>>> it, but it seems really relevant to what I am trying to achieve.
>>>>>>
>>>>>> I am looking into using spark streaming, with an input stream from
>>>>>> zeromq.
>>>>>> I am trying to figure out what machine is actually listening on the
>>>>>> zeromq
>>>>>> socket.
>>>>>>
>>>>>> If I have a 10-machine cluster, and one machine acting as the
>>>>>> "application
>>>>>> machine", meaning the machine that runs the code I write, and submit
>>>>>> jobs to
>>>>>> the cluster - which of these machines will subscribe to the zeromq
>>>>>> socket?
>>>>>> Looking at the code, I see a "Sub" socket is created, meaning all the
>>>>>> data
>>>>>> will pass through the socket, which leads me to believe that the
>>>>>> application
>>>>>> machine is the one listening on the socket, and passes all received
>>>>>> data to
>>>>>> the cluster. This means that this single machine might become a
>>>>>> bottle neck
>>>>>> in a high throughput use case.
>>>>>> A better approach would be to have each node in the cluster listen on
>>>>>> a
>>>>>> "fanout" socket, meaning each node in the cluster receive a part of
>>>>>> the
>>>>>> data.
>>>>>>
>>>>>> I am not sure about any of this, as I am not an expert in ZeroMQ, and
>>>>>> definitely not in spark. If someone can clarify how this works, I'd
>>>>>> really
>>>>>> appreciate it.
>>>>>> In addition, if there other input streams, that operate in a different
>>>>>> manner that does not require the entire data to pass through the
>>>>>> "application machine", I'd really appreciate knowing that too.
>>>>>>
>>>>>> Thank you very much in advance,
>>>>>> Amir
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-a-cluster-tp1576.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Sourav Chandra
>>>>
>>>> Senior Software Engineer
>>>>
>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>>
>>>> [email protected]
>>>>
>>>> o: +91 80 4121 8723
>>>>
>>>> m: +91 988 699 3746
>>>>
>>>> skype: sourav.chandra
>>>>
>>>> Livestream
>>>>
>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
>>>> Block, Koramangala Industrial Area,
>>>>
>>>> Bangalore 560034
>>>>
>>>> www.livestream.com
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Sourav Chandra
>>>
>>> Senior Software Engineer
>>>
>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>>>
>>> [email protected]
>>>
>>> o: +91 80 4121 8723
>>>
>>> m: +91 988 699 3746
>>>
>>> skype: sourav.chandra
>>>
>>> Livestream
>>>
>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
>>> Block, Koramangala Industrial Area,
>>>
>>> Bangalore 560034
>>>
>>> www.livestream.com
>>>
>>
>>
>>
>> --
>> Dachuan Huang
>> Cellphone: 614-390-7234
>> 2015 Neil Avenue
>> Ohio State University
>> Columbus, Ohio
>> U.S.A.
>> 43210
>>
>
>
>
> --
>
> Sourav Chandra
>
> Senior Software Engineer
>
> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>
> [email protected]
>
> o: +91 80 4121 8723
>
> m: +91 988 699 3746
>
> skype: sourav.chandra
>
> Livestream
>
> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
> Block, Koramangala Industrial Area,
>
> Bangalore 560034
>
> www.livestream.com
>



-- 
Dachuan Huang
Cellphone: 614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210

Re: Spark Streaming on a cluster

Reply via email to