One receiver basically runs on 1 core, so if your single node is having 4 cores, there are still 3 cores left for the processing (for executors). And yes receiver remains on the same machine unless some failure happens.
Thanks Best Regards On Tue, May 19, 2015 at 10:57 PM, Shushant Arora <shushantaror...@gmail.com> wrote: > Thanks Akhil andDibyendu. > > Does in high level receiver based streaming executors run on receivers > itself to have data localisation ? Or its always data is transferred to > executor nodes and executor nodes differ in each run of job but receiver > node remains same(same machines) throughout life of streaming application > unless node failure happens? > > > > On Tue, May 19, 2015 at 9:29 PM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> Just to add, there is a Receiver based Kafka consumer which uses Kafka >> Low Level Consumer API. >> >> http://spark-packages.org/package/dibbhatt/kafka-spark-consumer >> >> >> Regards, >> Dibyendu >> >> On Tue, May 19, 2015 at 9:00 PM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> >>> >>> On Tue, May 19, 2015 at 8:10 PM, Shushant Arora < >>> shushantaror...@gmail.com> wrote: >>> >>>> So for Kafka+spark streaming, Receiver based streaming used highlevel >>>> api and non receiver based streaming used low level api. >>>> >>>> 1.In high level receiver based streaming does it registers consumers at >>>> each job start(whenever a new job is launched by streaming application say >>>> at each second)? >>>> >>> >>> -> Receiver based streaming will always have the receiver running >>> parallel while your job is running, So by default for every 200ms >>> (spark.streaming.blockInterval) the receiver will generate a block of data >>> which is read from Kafka. >>> >>> >>> >>>> 2.No of executors in highlevel receiver based jobs will always equal to >>>> no of partitions in topic ? >>>> >>> >>> -> Not sure from where did you came up with this. For the non stream >>> based one, i think the number of partitions in spark will be equal to the >>> number of kafka partitions for the given topic. >>> >>> >>> >>>> 3.Will data from a single topic be consumed by executors in parllel or >>>> only one receiver consumes in multiple threads and assign to executors in >>>> high level receiver based approach ? >>>> >>>> -> They will consume the data parallel. For the receiver based >>> approach, you can actually specify the number of receiver that you want to >>> spawn for consuming the messages. >>> >>>> >>>> >>>> >>>> On Tue, May 19, 2015 at 2:38 PM, Akhil Das <ak...@sigmoidanalytics.com> >>>> wrote: >>>> >>>>> spark.streaming.concurrentJobs takes an integer value, not boolean. >>>>> If you set it as 2 then 2 jobs will run parallel. Default value is 1 and >>>>> the next job will start once it completes the current one. >>>>> >>>>> >>>>>> Actually, in the current implementation of Spark Streaming and under >>>>>> default configuration, only job is active (i.e. under execution) at any >>>>>> point of time. So if one batch's processing takes longer than 10 seconds, >>>>>> then then next batch's jobs will stay queued. >>>>>> This can be changed with an experimental Spark property >>>>>> "spark.streaming.concurrentJobs" which is by default set to 1. Its not >>>>>> currently documented (maybe I should add it). >>>>>> The reason it is set to 1 is that concurrent jobs can potentially >>>>>> lead to weird sharing of resources and which can make it hard to debug >>>>>> the >>>>>> whether there is sufficient resources in the system to process the >>>>>> ingested >>>>>> data fast enough. With only 1 job running at a time, it is easy to see >>>>>> that >>>>>> if batch processing time < batch interval, then the system will be >>>>>> stable. >>>>>> Granted that this may not be the most efficient use of resources under >>>>>> certain conditions. We definitely hope to improve this in the future. >>>>> >>>>> >>>>> Copied from TD's answer written in SO >>>>> <http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming> >>>>> . >>>>> >>>>> Non-receiver based streaming for example you can say are the >>>>> fileStream, directStream ones. You can read a bit of information from here >>>>> https://spark.apache.org/docs/1.3.1/streaming-kafka-integration.html >>>>> >>>>> Thanks >>>>> Best Regards >>>>> >>>>> On Tue, May 19, 2015 at 2:13 PM, Shushant Arora < >>>>> shushantaror...@gmail.com> wrote: >>>>> >>>>>> Thanks Akhil. >>>>>> When I don't set spark.streaming.concurrentJobs to true. Will the >>>>>> all pending jobs starts one by one after 1 jobs completes,or it does not >>>>>> creates jobs which could not be started at its desired interval. >>>>>> >>>>>> And Whats the difference and usage of Receiver vs non-receiver based >>>>>> streaming. Is there any documentation for that? >>>>>> >>>>>> On Tue, May 19, 2015 at 1:35 PM, Akhil Das < >>>>>> ak...@sigmoidanalytics.com> wrote: >>>>>> >>>>>>> It will be a single job running at a time by default (you can also >>>>>>> configure the spark.streaming.concurrentJobs to run jobs parallel which >>>>>>> is >>>>>>> not recommended to put in production). >>>>>>> >>>>>>> Now, your batch duration being 1 sec and processing time being 2 >>>>>>> minutes, if you are using a receiver based streaming then ideally those >>>>>>> receivers will keep on receiving data while the job is running (which >>>>>>> will >>>>>>> accumulate in memory if you set StorageLevel as MEMORY_ONLY and end up >>>>>>> in >>>>>>> block not found exceptions as spark drops some blocks which are yet to >>>>>>> process to accumulate new blocks). If you are using a non-receiver based >>>>>>> approach, you will not have this problem of dropping blocks. >>>>>>> >>>>>>> Ideally, if your data is small and you have enough memory to hold >>>>>>> your data then it will run smoothly without any issues. >>>>>>> >>>>>>> Thanks >>>>>>> Best Regards >>>>>>> >>>>>>> On Tue, May 19, 2015 at 1:23 PM, Shushant Arora < >>>>>>> shushantaror...@gmail.com> wrote: >>>>>>> >>>>>>>> What happnes if in a streaming application one job is not yet >>>>>>>> finished and stream interval reaches. Does it starts next job or wait >>>>>>>> for >>>>>>>> first to finish and rest jobs will keep on accumulating in queue. >>>>>>>> >>>>>>>> >>>>>>>> Say I have a streaming application with stream interval of 1 sec, >>>>>>>> but my job takes 2 min to process 1 sec stream , what will happen ? >>>>>>>> At any >>>>>>>> time there will be only one job running or multiple ? >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >