Hi TD, I think the FileNotFound is due to spark.cleaner.ttl parameter which is set to 3600 sec i.e. 1 hour. Thats why the temp metadata files are deleted.
Please correct me if I am wrong. Also If that is the case why it did not download again and create the file? Is is because our application is doing nothing i.e. no messages from kafka? Will it be downloaded if application again start receiving data? Thanks, Sourav On Fri, Feb 14, 2014 at 2:55 PM, Sourav Chandra < sourav.chan...@livestream.com> wrote: > Hi TD, > > I have kept running the streaming application for ~1hr though there is no > messages present in Kafka , just to check the memory usage and all and then > found out the stages have started failing (with exception > java.io.FileNotFoundException > (java.io.FileNotFoundException: http://10.10.127.230:57124/broadcast_1)) > and there are 1000 active stages > > Questions: > 1. Why it suddenly started failing and not able to find broadcast _1 > file? Is there any background cleanup causes this? How can we overcome this? > 2. Is the 1000 actve stages are because of spark.streaming.concurrentJobs > parameter? > 3. Why these stages are in hanging state (the ui showing no tasks > started)? > Shouldn't these also fail? what is the logic behind this? > 4. Why taks:Succeed:Total in failed stages showing like (0/12)(30 failed) > I can understand it has total 12 tasks and none succeeded. From where its > getting the 30 failed? Is it internal retry. If so why it is not same for > all other failed stages/ > > I have attached the snapshots. > > > > > On Fri, Feb 14, 2014 at 2:35 PM, Pankaj Mittal < > pankaj.mit...@livestream.com> wrote: > >> Hi TD, >> There is no persist method which accepts boolean. There is only >> persist(MEMORY_LEVEL) or default persist. >> I have a question, RDDs remain in cache for some remember time which is >> initialised to slide duration, but is it possible to set this to let's say >> an hour without changing slide duration ? >> >> Thanks >> Pankaj >> >> >> >> On Fri, Feb 14, 2014 at 7:50 AM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >>> Answers inline. Hope these answer your questions. >>> >>> TD >>> >>> >>> On Thu, Feb 13, 2014 at 5:49 PM, Sourav Chandra < >>> sourav.chan...@livestream.com> wrote: >>> >>>> HI, >>>> >>>> I have couple of questions: >>>> >>>> 1. While going through the spark-streaming code, I found out there is >>>> one configuration in JobScheduler/Generator >>>> (spark.streaming.concurrentJobs) which is set to 1. There is no >>>> documentation for this parameter. After setting this to 1000 in driver >>>> program, our streaming application's performance is improved. >>>> >>> >>> That is a parameter that allows Spark Stremaing to launch multiple Spark >>> jobs simultaneously. While it can improve the performance in many scenarios >>> (as it has in your case), it can actually increase the processing time of >>> each batch and increase end-to-end latency in certain scenarios. So it is >>> something that needs to be used with caution. That said, we should have >>> definitely exposed it in the documentation. >>> >>> >>>> What is this variable used for? Is it safe to use/tweak this parameter? >>>> >>>> 2. Can someone explain the usage of MapOutputTracker, BlockManager >>>> component. I have gone through the youtube video of Matei about spark >>>> internals but this was not covered in detail. >>>> >>> >>> I am not sure if there is a detailed document anywhere that explains but >>> I can give you a high level overview of the both. >>> >>> BlockManager is like a distributed key-value store for large blobs >>> (called blocks) of data. It has a master-worker architecture (loosely it is >>> like the HDFS file system) where the BlockManager at the workers store the >>> data blocks and BlockManagerMaster stores the metadata for what blocks are >>> stored where. All the cached RDD's partitions and shuffle data are stored >>> and managed by the BlockManager. It also transfers the blocks between the >>> workers as needed (shuffles etc all happen through the block manager). >>> Specifically for spark streaming, the data received from outside is stored >>> in the BlockManager of the worker nodes, and the IDs of the blocks are >>> reported to the BlockManagerMaster. >>> >>> MapOutputTrackers is a simpler component that keeps track of the >>> location of the output of the map stage, so that workers running the reduce >>> stage knows which machines to pull the data from. That also has the >>> master-worker component - master has the full knowledge of the mapoutput >>> and the worker component on-demand pulls that knowledge from the master >>> component when the reduce tasks are executed on the worker. >>> >>> >>> >>>> >>>> 3. Can someone explain the usage of cache w.r.t spark streaming? For >>>> example if we do stream.cache(), will the cache remain constant with all >>>> the partitions of RDDs present across the nodes for that stream, OR will it >>>> be regularly updated as in while new batch is coming? >>>> >>>> If you call DStream.persist (persist == cache = true), then all RDDs >>> generated by the DStream will be persisted in the cache (in the >>> BlockManager). As new RDDs are generated and persisted, old RDDs from the >>> same DStream will fall out of memory. either by LRU or explicitly if >>> spark.streaming.unpersist is set to true. >>> >>> >>>> Thanks, >>>> -- >>>> >>>> Sourav Chandra >>>> >>>> Senior Software Engineer >>>> >>>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · >>>> >>>> sourav.chan...@livestream.com >>>> >>>> o: +91 80 4121 8723 >>>> >>>> m: +91 988 699 3746 >>>> >>>> skype: sourav.chandra >>>> >>>> Livestream >>>> >>>> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd >>>> Block, Koramangala Industrial Area, >>>> >>>> Bangalore 560034 >>>> >>>> www.livestream.com >>>> >>> >>> >> > > > -- > > Sourav Chandra > > Senior Software Engineer > > · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · > > sourav.chan...@livestream.com > > o: +91 80 4121 8723 > > m: +91 988 699 3746 > > skype: sourav.chandra > > Livestream > > "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd > Block, Koramangala Industrial Area, > > Bangalore 560034 > > www.livestream.com > -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com