Somewhat agree on subclassing and its issues. It looks like the alternative in spark 1.3.0 to create a custom build. Is there an enhancement filed for this? If not, I'll file one.
Thanks! -neelesh On Wed, Apr 1, 2015 at 12:46 PM, Tathagata Das <t...@databricks.com> wrote: > The challenge of opening up these internal classes to public (even with > Developer API tag) is that it prevents us from making non-trivial changes > without breaking API compatibility for all those who had subclassed. Its a > tradeoff that is hard to optimize. That's why we favor exposing more > optional parameters in the stable API (KafkaUtils) so that we can maintain > binary compatibility with user code as well as allowing us to make > non-trivial changes internally. > > That said, it may be worthwhile to actually take an optional compute > function as a parameter through the KafkaUtils, as Cody suggested ( (Time, > current offsets, kafka metadata, etc) => Option[KafkaRDD]). Worth > thinking about its implications in the context of the driver restarts, etc > (as those function will get called again on restart, and different return > value from before can screw up semantics). > > TD > > On Wed, Apr 1, 2015 at 12:28 PM, Neelesh <neele...@gmail.com> wrote: > >> +1 for subclassing. its more flexible if we can subclass the >> implementation classes. >> On Apr 1, 2015 12:19 PM, "Cody Koeninger" <c...@koeninger.org> wrote: >> >>> As I said in the original ticket, I think the implementation classes >>> should be exposed so that people can subclass and override compute() to >>> suit their needs. >>> >>> Just adding a function from Time => Set[TopicAndPartition] wouldn't be >>> sufficient for some of my current production use cases. >>> >>> compute() isn't really a function from Time => Option[KafkaRDD], it's a >>> function from (Time, current offsets, kafka metadata, etc) => >>> Option[KafkaRDD] >>> >>> I think it's more straightforward to give access to that additional >>> state via subclassing than it is to add in more callbacks for every >>> possible use case. >>> >>> >>> >>> >>> On Wed, Apr 1, 2015 at 2:01 PM, Tathagata Das <t...@databricks.com> >>> wrote: >>> >>>> We should be able to support that use case in the direct API. It may be >>>> as simple as allowing the users to pass on a function that returns the set >>>> of topic+partitions to read from. >>>> That is function (Time) => Set[TopicAndPartition] This gets called >>>> every batch interval before the offsets are decided. This would allow users >>>> to add topics, delete topics, modify partitions on the fly. >>>> >>>> What do you think Cody? >>>> >>>> >>>> >>>> >>>> On Wed, Apr 1, 2015 at 11:57 AM, Neelesh <neele...@gmail.com> wrote: >>>> >>>>> Thanks Cody! >>>>> >>>>> On Wed, Apr 1, 2015 at 11:21 AM, Cody Koeninger <c...@koeninger.org> >>>>> wrote: >>>>> >>>>>> If you want to change topics from batch to batch, you can always just >>>>>> create a KafkaRDD repeatedly. >>>>>> >>>>>> The streaming code as it stands assumes a consistent set of topics >>>>>> though. The implementation is private so you cant subclass it without >>>>>> building your own spark. >>>>>> >>>>>> On Wed, Apr 1, 2015 at 1:09 PM, Neelesh <neele...@gmail.com> wrote: >>>>>> >>>>>>> Thanks Cody, that was really helpful. I have a much better >>>>>>> understanding now. One last question - Kafka topics are initialized >>>>>>> once >>>>>>> in the driver, is there an easy way of adding/removing topics on the >>>>>>> fly? >>>>>>> KafkaRDD#getPartitions() seems to be computed only once, and no way of >>>>>>> refreshing them. >>>>>>> >>>>>>> Thanks again! >>>>>>> >>>>>>> On Wed, Apr 1, 2015 at 10:01 AM, Cody Koeninger <c...@koeninger.org> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md >>>>>>>> >>>>>>>> The kafka consumers run in the executors. >>>>>>>> >>>>>>>> On Wed, Apr 1, 2015 at 11:18 AM, Neelesh <neele...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> With receivers, it was pretty obvious which code ran where - each >>>>>>>>> receiver occupied a core and ran on the workers. However, with the new >>>>>>>>> kafka direct input streams, its hard for me to understand where the >>>>>>>>> code >>>>>>>>> that's reading from kafka brokers runs. Does it run on the driver (I >>>>>>>>> hope >>>>>>>>> not), or does it run on workers? >>>>>>>>> >>>>>>>>> Any help appreciated >>>>>>>>> thanks! >>>>>>>>> -neelesh >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >