Hi, The mentioned approach did not work for me. It did create multiple NetworkReceivers in each workers but DAG scheduler is failing with NotSerializableException"org.apache.spark.streaming.StreamingContext
Can any one of you help me figuring this out? Thanks, Sourav On Wed, Feb 19, 2014 at 11:55 AM, Sourav Chandra < [email protected]> wrote: > Hi TD, > > in case of multiple streams will the streaming code be like: > > val ssc = ... > (1 to n).foreach { > val nwStream = kafkaUtils.createStream(...) > > nwStream.flatMap(...).map(...).reduceByKeyAndWindow(...).foreachRDD(saveToCassandra()) > } > ssc.start() > > Will it create any problem in execution (like reading/writing broadcast > variable etc) > > Thanks, > Sourav > > > On Wed, Feb 19, 2014 at 11:31 AM, Tathagata Das < > [email protected]> wrote: > >> The default zeromq receiver that comes with the Spark repository does >> guarantee which machine the zeromq receiver will be launched, it can be on >> any of the worker machines in the cluster, NOT the application machine >> (called the "driver" in our terms). And your understanding of the code and >> the situation is generally correct (except the "passing through application >> machine" part). >> >> So if you want to do a distributed receiving, you have to create multiple >> zeromq streams, which will create multiple zeromq receivers. The system >> tries to spread the receivers across multiple machines. But you have to >> make sure that the data is correctly "fanned out" across these multiple >> receivers. Other ingestion mechanisms like Kafka actually allows this >> natively - it allows the user to define "partitions" of a data stream and >> the Kafka system takes care of partitioning the relavant data stream and >> sending it to multiple receivers without duplicating the records. >> >> TD >> >> >> On Sun, Feb 16, 2014 at 8:48 AM, amirtuval <[email protected]> wrote: >> >>> Hi >>> >>> I am a newbie to spark and spark streaming - I just recently became >>> aware of >>> it, but it seems really relevant to what I am trying to achieve. >>> >>> I am looking into using spark streaming, with an input stream from >>> zeromq. >>> I am trying to figure out what machine is actually listening on the >>> zeromq >>> socket. >>> >>> If I have a 10-machine cluster, and one machine acting as the >>> "application >>> machine", meaning the machine that runs the code I write, and submit >>> jobs to >>> the cluster - which of these machines will subscribe to the zeromq >>> socket? >>> Looking at the code, I see a "Sub" socket is created, meaning all the >>> data >>> will pass through the socket, which leads me to believe that the >>> application >>> machine is the one listening on the socket, and passes all received data >>> to >>> the cluster. This means that this single machine might become a bottle >>> neck >>> in a high throughput use case. >>> A better approach would be to have each node in the cluster listen on a >>> "fanout" socket, meaning each node in the cluster receive a part of the >>> data. >>> >>> I am not sure about any of this, as I am not an expert in ZeroMQ, and >>> definitely not in spark. If someone can clarify how this works, I'd >>> really >>> appreciate it. >>> In addition, if there other input streams, that operate in a different >>> manner that does not require the entire data to pass through the >>> "application machine", I'd really appreciate knowing that too. >>> >>> Thank you very much in advance, >>> Amir >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-a-cluster-tp1576.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >> >> > > > -- > > Sourav Chandra > > Senior Software Engineer > > · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · > > [email protected] > > o: +91 80 4121 8723 > > m: +91 988 699 3746 > > skype: sourav.chandra > > Livestream > > "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd > Block, Koramangala Industrial Area, > > Bangalore 560034 > > www.livestream.com > -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · [email protected] o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com
