Re: Spark Streaming 1.3 & Kafka Direct Streams

Neelesh Mon, 06 Apr 2015 09:19:18 -0700

Somewhat agree on subclassing and its issues. It looks like the alternative
in spark 1.3.0 to create a custom build. Is there an enhancement filed for
this? If not, I'll file one.


Thanks!
-neelesh

On Wed, Apr 1, 2015 at 12:46 PM, Tathagata Das <t...@databricks.com> wrote:

> The challenge of opening up these internal classes to public (even with
> Developer API tag) is that it prevents us from making non-trivial changes
> without breaking API compatibility for all those who had subclassed. Its a
> tradeoff that is hard to optimize. That's why we favor exposing more
> optional parameters in the stable API (KafkaUtils) so that we can maintain
> binary compatibility with user code as well as allowing us to make
> non-trivial changes internally.
>
> That said, it may be worthwhile to actually take an optional compute
> function as a parameter through the KafkaUtils, as Cody suggested ( (Time,
> current offsets, kafka metadata, etc) => Option[KafkaRDD]). Worth
> thinking about its implications in the context of the driver restarts, etc
> (as those function will get called again on restart, and different return
> value from before can screw up semantics).
>
> TD
>
> On Wed, Apr 1, 2015 at 12:28 PM, Neelesh <neele...@gmail.com> wrote:
>
>> +1 for subclassing. its more flexible if we can  subclass the
>> implementation classes.
>>  On Apr 1, 2015 12:19 PM, "Cody Koeninger" <c...@koeninger.org> wrote:
>>
>>> As I said in the original ticket, I think the implementation classes
>>> should be exposed so that people can subclass and override compute() to
>>> suit their needs.
>>>
>>> Just adding a function from Time => Set[TopicAndPartition] wouldn't be
>>> sufficient for some of my current production use cases.
>>>
>>> compute() isn't really a function from Time => Option[KafkaRDD], it's a
>>> function from (Time, current offsets, kafka metadata, etc) =>
>>> Option[KafkaRDD]
>>>
>>> I think it's more straightforward to give access to that additional
>>> state via subclassing than it is to add in more callbacks for every
>>> possible use case.
>>>
>>>
>>>
>>>
>>> On Wed, Apr 1, 2015 at 2:01 PM, Tathagata Das <t...@databricks.com>
>>> wrote:
>>>
>>>> We should be able to support that use case in the direct API. It may be
>>>> as simple as allowing the users to pass on a function that returns the set
>>>> of topic+partitions to read from.
>>>> That is function (Time) => Set[TopicAndPartition] This gets called
>>>> every batch interval before the offsets are decided. This would allow users
>>>> to add topics, delete topics, modify partitions on the fly.
>>>>
>>>> What do you think Cody?
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 1, 2015 at 11:57 AM, Neelesh <neele...@gmail.com> wrote:
>>>>
>>>>> Thanks Cody!
>>>>>
>>>>> On Wed, Apr 1, 2015 at 11:21 AM, Cody Koeninger <c...@koeninger.org>
>>>>> wrote:
>>>>>
>>>>>> If you want to change topics from batch to batch, you can always just
>>>>>> create a KafkaRDD repeatedly.
>>>>>>
>>>>>> The streaming code as it stands assumes a consistent set of topics
>>>>>> though.  The implementation is private so you cant subclass it without
>>>>>> building your own spark.
>>>>>>
>>>>>> On Wed, Apr 1, 2015 at 1:09 PM, Neelesh <neele...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Cody, that was really helpful.  I have a much better
>>>>>>> understanding now. One last question -  Kafka topics  are initialized 
>>>>>>> once
>>>>>>> in the driver, is there an easy way of adding/removing topics on the 
>>>>>>> fly?
>>>>>>> KafkaRDD#getPartitions() seems to be computed only once, and no way of
>>>>>>> refreshing them.
>>>>>>>
>>>>>>> Thanks again!
>>>>>>>
>>>>>>> On Wed, Apr 1, 2015 at 10:01 AM, Cody Koeninger <c...@koeninger.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>>>>>>>
>>>>>>>> The kafka consumers run in the executors.
>>>>>>>>
>>>>>>>> On Wed, Apr 1, 2015 at 11:18 AM, Neelesh <neele...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> With receivers, it was pretty obvious which code ran where - each
>>>>>>>>> receiver occupied a core and ran on the workers. However, with the new
>>>>>>>>> kafka direct input streams, its hard for me to understand where the 
>>>>>>>>> code
>>>>>>>>> that's reading from kafka brokers runs. Does it run on the driver (I 
>>>>>>>>> hope
>>>>>>>>> not), or does it run on workers?
>>>>>>>>>
>>>>>>>>> Any help appreciated
>>>>>>>>> thanks!
>>>>>>>>> -neelesh
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: Spark Streaming 1.3 & Kafka Direct Streams

Reply via email to