Hi Russell, This is super helpful. Thank you so much. Can you elaborate on the differences between structured streaming vs dstreams? How would the number of receivers required etc change?
On Sat, 8 Aug, 2020, 10:28 pm Russell Spitzer, <russell.spit...@gmail.com> wrote: > Note, none of this applies to Direct streaming approaches, only receiver > based Dstreams. > > You can think of a receiver as a long running task that never finishes. > Each receiver is submitted to an executor slot somewhere, it then runs > indefinitely and internally has a method which passes records over to a > block management system. There is a timing that you set which decides when > each block is "done" and records after that time has passed go into the > next block (See parameter > <https://spark.apache.org/docs/latest/configuration.html#spark-streaming> > spark.streaming.blockInterval) Once a block is done it can be processed > in the next Spark batch.. The gap between a block starting and a block > being finished is why you can lose data in Receiver streaming without > WriteAheadLoging. Usually your block interval is divisible into your batch > interval so you'll get X blocks per batch. Each block becomes one partition > of the job being done in a Streaming batch. Multiple receivers can be > unified into a single dstream, which just means the blocks produced by all > of those receivers are handled in the same Streaming batch. > > So if you have 5 different receivers, you need at minimum 6 executor > cores. 1 core for each receiver, and 1 core to actually do your processing > work. In a real world case you probably want significantly more cores on > the processing side than just 1. Without repartitioning you will never have > more that > > A quick example > > I run 5 receivers with block interval of 100ms and spark batch interval of > 1 second. I use union to group them all together, I will most likely end up > with one Spark Job for each batch every second running with 50 partitions > (1000ms / 100(ms / partition / receiver) * 5 receivers). If I have a total > of 10 cores in the system. 5 of them are running receivers, The remaining 5 > must process the 50 partitions of data generated by the last second of work. > > And again, just to reiterate, if you are doing a direct streaming approach > or structured streaming, none of this applies. > > On Sat, Aug 8, 2020 at 10:03 AM Dark Crusader < > relinquisheddra...@gmail.com> wrote: > >> Hi, >> >> I'm having some trouble figuring out how receivers tie into spark >> driver-executor structure. >> Do all executors have a receiver that is blocked as soon as it >> receives some stream data? >> Or can multiple streams of data be taken as input into a single executor? >> >> I have stream data coming in at every second coming from 5 different >> sources. I want to aggregate data from each of them. Does this mean I need >> 5 executors or does it have to do with threads on the executor? >> >> I might be mixing in a few concepts here. Any help would be appreciated. >> Thank you. >> >