I am curious about why does StreamingContext need to be serialized, it's supposed to be in the driver node alone.
On Wed, Feb 19, 2014 at 10:29 AM, Sourav Chandra < [email protected]> wrote: > I am using KafkaDStream and its saying StreamingContext class is not > serializable > > Code snippet: > > val ssc = new StreamingContext(...) > (1 to 4).foreach(i => { > val stream = > KafkaUtils.createStream(...).flatMap(...).reduceByKeyAndWindow(...)..filter(...).foreach(saveToCassandra()) > }) > > ssc.start() > > I am using 2 nodes > > Thanks, > Sourav > > > On Wed, Feb 19, 2014 at 8:21 PM, dachuan <[email protected]> wrote: > >> I don't know the final answer, but I'd like to discuss about your problem. >> >> How many nodes are you using, and how many NetworkReceivers have you >> started? >> >> Which specific class is not serializable? >> >> thanks, >> dachuan. >> >> >> On Wed, Feb 19, 2014 at 9:42 AM, Sourav Chandra < >> [email protected]> wrote: >> >>> Hi, >>> >>> The mentioned approach did not work for me. It did create multiple >>> NetworkReceivers in each workers but DAG scheduler is failing with >>> NotSerializableException"org.apache.spark.streaming.StreamingContext >>> >>> Can any one of you help me figuring this out? >>> >>> Thanks, >>> Sourav >>> >>> >>> >>> >>> On Wed, Feb 19, 2014 at 11:55 AM, Sourav Chandra < >>> [email protected]> wrote: >>> >>>> Hi TD, >>>> >>>> in case of multiple streams will the streaming code be like: >>>> >>>> val ssc = ... >>>> (1 to n).foreach { >>>> val nwStream = kafkaUtils.createStream(...) >>>> >>>> nwStream.flatMap(...).map(...).reduceByKeyAndWindow(...).foreachRDD(saveToCassandra()) >>>> } >>>> ssc.start() >>>> >>>> Will it create any problem in execution (like reading/writing broadcast >>>> variable etc) >>>> >>>> Thanks, >>>> Sourav >>>> >>>> >>>> On Wed, Feb 19, 2014 at 11:31 AM, Tathagata Das < >>>> [email protected]> wrote: >>>> >>>>> The default zeromq receiver that comes with the Spark repository does >>>>> guarantee which machine the zeromq receiver will be launched, it can be on >>>>> any of the worker machines in the cluster, NOT the application machine >>>>> (called the "driver" in our terms). And your understanding of the code and >>>>> the situation is generally correct (except the "passing through >>>>> application >>>>> machine" part). >>>>> >>>>> So if you want to do a distributed receiving, you have to create >>>>> multiple zeromq streams, which will create multiple zeromq receivers. The >>>>> system tries to spread the receivers across multiple machines. But you >>>>> have >>>>> to make sure that the data is correctly "fanned out" across these multiple >>>>> receivers. Other ingestion mechanisms like Kafka actually allows this >>>>> natively - it allows the user to define "partitions" of a data stream and >>>>> the Kafka system takes care of partitioning the relavant data stream and >>>>> sending it to multiple receivers without duplicating the records. >>>>> >>>>> TD >>>>> >>>>> >>>>> On Sun, Feb 16, 2014 at 8:48 AM, amirtuval <[email protected]> wrote: >>>>> >>>>>> Hi >>>>>> >>>>>> I am a newbie to spark and spark streaming - I just recently became >>>>>> aware of >>>>>> it, but it seems really relevant to what I am trying to achieve. >>>>>> >>>>>> I am looking into using spark streaming, with an input stream from >>>>>> zeromq. >>>>>> I am trying to figure out what machine is actually listening on the >>>>>> zeromq >>>>>> socket. >>>>>> >>>>>> If I have a 10-machine cluster, and one machine acting as the >>>>>> "application >>>>>> machine", meaning the machine that runs the code I write, and submit >>>>>> jobs to >>>>>> the cluster - which of these machines will subscribe to the zeromq >>>>>> socket? >>>>>> Looking at the code, I see a "Sub" socket is created, meaning all the >>>>>> data >>>>>> will pass through the socket, which leads me to believe that the >>>>>> application >>>>>> machine is the one listening on the socket, and passes all received >>>>>> data to >>>>>> the cluster. This means that this single machine might become a >>>>>> bottle neck >>>>>> in a high throughput use case. >>>>>> A better approach would be to have each node in the cluster listen on >>>>>> a >>>>>> "fanout" socket, meaning each node in the cluster receive a part of >>>>>> the >>>>>> data. >>>>>> >>>>>> I am not sure about any of this, as I am not an expert in ZeroMQ, and >>>>>> definitely not in spark. If someone can clarify how this works, I'd >>>>>> really >>>>>> appreciate it. >>>>>> In addition, if there other input streams, that operate in a different >>>>>> manner that does not require the entire data to pass through the >>>>>> "application machine", I'd really appreciate knowing that too. >>>>>> >>>>>> Thank you very much in advance, >>>>>> Amir >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-a-cluster-tp1576.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Sourav Chandra >>>> >>>> Senior Software Engineer >>>> >>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · >>>> >>>> [email protected] >>>> >>>> o: +91 80 4121 8723 >>>> >>>> m: +91 988 699 3746 >>>> >>>> skype: sourav.chandra >>>> >>>> Livestream >>>> >>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd >>>> Block, Koramangala Industrial Area, >>>> >>>> Bangalore 560034 >>>> >>>> www.livestream.com >>>> >>> >>> >>> >>> -- >>> >>> Sourav Chandra >>> >>> Senior Software Engineer >>> >>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · >>> >>> [email protected] >>> >>> o: +91 80 4121 8723 >>> >>> m: +91 988 699 3746 >>> >>> skype: sourav.chandra >>> >>> Livestream >>> >>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd >>> Block, Koramangala Industrial Area, >>> >>> Bangalore 560034 >>> >>> www.livestream.com >>> >> >> >> >> -- >> Dachuan Huang >> Cellphone: 614-390-7234 >> 2015 Neil Avenue >> Ohio State University >> Columbus, Ohio >> U.S.A. >> 43210 >> > > > > -- > > Sourav Chandra > > Senior Software Engineer > > · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · > > [email protected] > > o: +91 80 4121 8723 > > m: +91 988 699 3746 > > skype: sourav.chandra > > Livestream > > "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd > Block, Koramangala Industrial Area, > > Bangalore 560034 > > www.livestream.com > -- Dachuan Huang Cellphone: 614-390-7234 2015 Neil Avenue Ohio State University Columbus, Ohio U.S.A. 43210
