Re: spark streaming doubt

Akhil Das Wed, 20 May 2015 00:09:00 -0700

One receiver basically runs on 1 core, so if your single node is having 4
cores, there are still 3 cores left for the processing (for executors). And
yes receiver remains on the same machine unless some failure happens.


Thanks
Best Regards

On Tue, May 19, 2015 at 10:57 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Thanks Akhil andDibyendu.
>
> Does in high level receiver based streaming executors run on receivers
> itself to have data localisation ? Or its always data is transferred to
> executor nodes and executor nodes differ in each run of job but receiver
> node remains same(same machines) throughout life of streaming application
> unless node failure happens?
>
>
>
> On Tue, May 19, 2015 at 9:29 PM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Just to add, there is a Receiver based Kafka consumer which uses Kafka
>> Low Level Consumer API.
>>
>> http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
>>
>>
>> Regards,
>> Dibyendu
>>
>> On Tue, May 19, 2015 at 9:00 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>>
>>> On Tue, May 19, 2015 at 8:10 PM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
>>>> So for Kafka+spark streaming, Receiver based streaming used highlevel
>>>> api and non receiver based streaming used low level api.
>>>>
>>>> 1.In high level receiver based streaming does it registers consumers at
>>>> each job start(whenever a new job is launched by streaming application say
>>>> at each second)?
>>>>
>>>
>>> -> Receiver based streaming will always have the receiver running
>>> parallel while your job is running, So by default for every 200ms
>>> (spark.streaming.blockInterval) the receiver will generate a block of data
>>> which is read from Kafka.
>>> 
>>>
>>>
>>>> 2.No of executors in highlevel receiver based jobs will always equal to
>>>> no of partitions in topic ?
>>>>
>>>
>>> -> Not sure from where did you came up with this. For the non stream
>>> based one, i think the number of partitions in spark will be equal to the
>>> number of kafka partitions for the given topic.
>>> 
>>>
>>>
>>>> 3.Will data from a single topic be consumed by executors in parllel or
>>>> only one receiver consumes in multiple threads and assign to executors in
>>>> high level receiver based approach ?
>>>>
>>>> -> They will consume the data parallel. For the receiver based
>>> approach, you can actually specify the number of receiver that you want to
>>> spawn for consuming the messages.
>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 19, 2015 at 2:38 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> spark.streaming.concurrentJobs takes an integer value, not boolean.
>>>>> If you set it as 2 then 2 jobs will run parallel. Default value is 1 and
>>>>> the next job will start once it completes the current one.
>>>>>
>>>>>
>>>>>> Actually, in the current implementation of Spark Streaming and under
>>>>>> default configuration, only job is active (i.e. under execution) at any
>>>>>> point of time. So if one batch's processing takes longer than 10 seconds,
>>>>>> then then next batch's jobs will stay queued.
>>>>>> This can be changed with an experimental Spark property
>>>>>> "spark.streaming.concurrentJobs" which is by default set to 1. Its not
>>>>>> currently documented (maybe I should add it).
>>>>>> The reason it is set to 1 is that concurrent jobs can potentially
>>>>>> lead to weird sharing of resources and which can make it hard to debug 
>>>>>> the
>>>>>> whether there is sufficient resources in the system to process the 
>>>>>> ingested
>>>>>> data fast enough. With only 1 job running at a time, it is easy to see 
>>>>>> that
>>>>>> if batch processing time < batch interval, then the system will be 
>>>>>> stable.
>>>>>> Granted that this may not be the most efficient use of resources under
>>>>>> certain conditions. We definitely hope to improve this in the future.
>>>>>
>>>>>
>>>>> Copied from TD's answer written in SO
>>>>> <http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming>
>>>>> .
>>>>>
>>>>> Non-receiver based streaming for example you can say are the
>>>>> fileStream, directStream ones. You can read a bit of information from here
>>>>> https://spark.apache.org/docs/1.3.1/streaming-kafka-integration.html
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Tue, May 19, 2015 at 2:13 PM, Shushant Arora <
>>>>> shushantaror...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Akhil.
>>>>>> When I don't  set spark.streaming.concurrentJobs to true. Will the
>>>>>> all pending jobs starts one by one after 1 jobs completes,or it does not
>>>>>> creates jobs which could not be started at its desired interval.
>>>>>>
>>>>>> And Whats the difference and usage of Receiver vs non-receiver based
>>>>>> streaming. Is there any documentation for that?
>>>>>>
>>>>>> On Tue, May 19, 2015 at 1:35 PM, Akhil Das <
>>>>>> ak...@sigmoidanalytics.com> wrote:
>>>>>>
>>>>>>> It will be a single job running at a time by default (you can also
>>>>>>> configure the spark.streaming.concurrentJobs to run jobs parallel which 
>>>>>>> is
>>>>>>> not recommended to put in production).
>>>>>>>
>>>>>>> Now, your batch duration being 1 sec and processing time being 2
>>>>>>> minutes, if you are using a receiver based streaming then ideally those
>>>>>>> receivers will keep on receiving data while the job is running (which 
>>>>>>> will
>>>>>>> accumulate in memory if you set StorageLevel as MEMORY_ONLY and end up 
>>>>>>> in
>>>>>>> block not found exceptions as spark drops some blocks which are yet to
>>>>>>> process to accumulate new blocks). If you are using a non-receiver based
>>>>>>> approach, you will not have this problem of dropping blocks.
>>>>>>>
>>>>>>> Ideally, if your data is small and you have enough memory to hold
>>>>>>> your data then it will run smoothly without any issues.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Best Regards
>>>>>>>
>>>>>>> On Tue, May 19, 2015 at 1:23 PM, Shushant Arora <
>>>>>>> shushantaror...@gmail.com> wrote:
>>>>>>>
>>>>>>>> What happnes if in a streaming application one job is not yet
>>>>>>>> finished and stream interval reaches. Does it starts next job or wait 
>>>>>>>> for
>>>>>>>> first to finish and rest jobs will keep on accumulating in queue.
>>>>>>>>
>>>>>>>>
>>>>>>>> Say I have a streaming application with stream interval of 1 sec,
>>>>>>>> but my job takes 2 min to process 1 sec stream , what will happen ?  
>>>>>>>> At any
>>>>>>>> time there will be only one job running or multiple ?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: spark streaming doubt

Reply via email to