Yes I did not understand what is wrong in the code which leads to this
On Wed, Feb 19, 2014 at 9:30 PM, dachuan <[email protected]> wrote: > I am curious about why does StreamingContext need to be serialized, it's > supposed to be in the driver node alone. > > > On Wed, Feb 19, 2014 at 10:29 AM, Sourav Chandra < > [email protected]> wrote: > >> I am using KafkaDStream and its saying StreamingContext class is not >> serializable >> >> Code snippet: >> >> val ssc = new StreamingContext(...) >> (1 to 4).foreach(i => { >> val stream = >> KafkaUtils.createStream(...).flatMap(...).reduceByKeyAndWindow(...)..filter(...).foreach(saveToCassandra()) >> }) >> >> ssc.start() >> >> I am using 2 nodes >> >> Thanks, >> Sourav >> >> >> On Wed, Feb 19, 2014 at 8:21 PM, dachuan <[email protected]> wrote: >> >>> I don't know the final answer, but I'd like to discuss about your >>> problem. >>> >>> How many nodes are you using, and how many NetworkReceivers have you >>> started? >>> >>> Which specific class is not serializable? >>> >>> thanks, >>> dachuan. >>> >>> >>> On Wed, Feb 19, 2014 at 9:42 AM, Sourav Chandra < >>> [email protected]> wrote: >>> >>>> Hi, >>>> >>>> The mentioned approach did not work for me. It did create multiple >>>> NetworkReceivers in each workers but DAG scheduler is failing with >>>> NotSerializableException"org.apache.spark.streaming.StreamingContext >>>> >>>> Can any one of you help me figuring this out? >>>> >>>> Thanks, >>>> Sourav >>>> >>>> >>>> >>>> >>>> On Wed, Feb 19, 2014 at 11:55 AM, Sourav Chandra < >>>> [email protected]> wrote: >>>> >>>>> Hi TD, >>>>> >>>>> in case of multiple streams will the streaming code be like: >>>>> >>>>> val ssc = ... >>>>> (1 to n).foreach { >>>>> val nwStream = kafkaUtils.createStream(...) >>>>> >>>>> nwStream.flatMap(...).map(...).reduceByKeyAndWindow(...).foreachRDD(saveToCassandra()) >>>>> } >>>>> ssc.start() >>>>> >>>>> Will it create any problem in execution (like reading/writing >>>>> broadcast variable etc) >>>>> >>>>> Thanks, >>>>> Sourav >>>>> >>>>> >>>>> On Wed, Feb 19, 2014 at 11:31 AM, Tathagata Das < >>>>> [email protected]> wrote: >>>>> >>>>>> The default zeromq receiver that comes with the Spark repository does >>>>>> guarantee which machine the zeromq receiver will be launched, it can be >>>>>> on >>>>>> any of the worker machines in the cluster, NOT the application machine >>>>>> (called the "driver" in our terms). And your understanding of the code >>>>>> and >>>>>> the situation is generally correct (except the "passing through >>>>>> application >>>>>> machine" part). >>>>>> >>>>>> So if you want to do a distributed receiving, you have to create >>>>>> multiple zeromq streams, which will create multiple zeromq receivers. The >>>>>> system tries to spread the receivers across multiple machines. But you >>>>>> have >>>>>> to make sure that the data is correctly "fanned out" across these >>>>>> multiple >>>>>> receivers. Other ingestion mechanisms like Kafka actually allows this >>>>>> natively - it allows the user to define "partitions" of a data stream and >>>>>> the Kafka system takes care of partitioning the relavant data stream and >>>>>> sending it to multiple receivers without duplicating the records. >>>>>> >>>>>> TD >>>>>> >>>>>> >>>>>> On Sun, Feb 16, 2014 at 8:48 AM, amirtuval <[email protected]> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I am a newbie to spark and spark streaming - I just recently became >>>>>>> aware of >>>>>>> it, but it seems really relevant to what I am trying to achieve. >>>>>>> >>>>>>> I am looking into using spark streaming, with an input stream from >>>>>>> zeromq. >>>>>>> I am trying to figure out what machine is actually listening on the >>>>>>> zeromq >>>>>>> socket. >>>>>>> >>>>>>> If I have a 10-machine cluster, and one machine acting as the >>>>>>> "application >>>>>>> machine", meaning the machine that runs the code I write, and submit >>>>>>> jobs to >>>>>>> the cluster - which of these machines will subscribe to the zeromq >>>>>>> socket? >>>>>>> Looking at the code, I see a "Sub" socket is created, meaning all >>>>>>> the data >>>>>>> will pass through the socket, which leads me to believe that the >>>>>>> application >>>>>>> machine is the one listening on the socket, and passes all received >>>>>>> data to >>>>>>> the cluster. This means that this single machine might become a >>>>>>> bottle neck >>>>>>> in a high throughput use case. >>>>>>> A better approach would be to have each node in the cluster listen >>>>>>> on a >>>>>>> "fanout" socket, meaning each node in the cluster receive a part of >>>>>>> the >>>>>>> data. >>>>>>> >>>>>>> I am not sure about any of this, as I am not an expert in ZeroMQ, and >>>>>>> definitely not in spark. If someone can clarify how this works, I'd >>>>>>> really >>>>>>> appreciate it. >>>>>>> In addition, if there other input streams, that operate in a >>>>>>> different >>>>>>> manner that does not require the entire data to pass through the >>>>>>> "application machine", I'd really appreciate knowing that too. >>>>>>> >>>>>>> Thank you very much in advance, >>>>>>> Amir >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-a-cluster-tp1576.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com. >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Sourav Chandra >>>>> >>>>> Senior Software Engineer >>>>> >>>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · >>>>> >>>>> [email protected] >>>>> >>>>> o: +91 80 4121 8723 >>>>> >>>>> m: +91 988 699 3746 >>>>> >>>>> skype: sourav.chandra >>>>> >>>>> Livestream >>>>> >>>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, >>>>> 3rd Block, Koramangala Industrial Area, >>>>> >>>>> Bangalore 560034 >>>>> >>>>> www.livestream.com >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Sourav Chandra >>>> >>>> Senior Software Engineer >>>> >>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · >>>> >>>> [email protected] >>>> >>>> o: +91 80 4121 8723 >>>> >>>> m: +91 988 699 3746 >>>> >>>> skype: sourav.chandra >>>> >>>> Livestream >>>> >>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd >>>> Block, Koramangala Industrial Area, >>>> >>>> Bangalore 560034 >>>> >>>> www.livestream.com >>>> >>> >>> >>> >>> -- >>> Dachuan Huang >>> Cellphone: 614-390-7234 >>> 2015 Neil Avenue >>> Ohio State University >>> Columbus, Ohio >>> U.S.A. >>> 43210 >>> >> >> >> >> -- >> >> Sourav Chandra >> >> Senior Software Engineer >> >> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · >> >> [email protected] >> >> o: +91 80 4121 8723 >> >> m: +91 988 699 3746 >> >> skype: sourav.chandra >> >> Livestream >> >> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd >> Block, Koramangala Industrial Area, >> >> Bangalore 560034 >> >> www.livestream.com >> > > > > -- > Dachuan Huang > Cellphone: 614-390-7234 > 2015 Neil Avenue > Ohio State University > Columbus, Ohio > U.S.A. > 43210 > -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · [email protected] o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com
