[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762756#comment-15762756 ] Michael Armbrust commented on SPARK-17344: -- [KAFKA-4462] aims to give us backwards compatibility for clients which will be great. The fact that there is a long term plan here makes me less allergic to the idea of copy/pasting the 0.10.x {{Source}} and porting it to 0.8.0 as an interim solution for those who can upgrade yet. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617004#comment-15617004 ] Michael Allman commented on SPARK-17344: We (at VideoAmp) would love to use structured streaming with Kafka. However we use Kafka 0.8 and have no present desire or reason to upgrade. There's nothing valuable to us in the newer versions, and performing an upgrade would be a highly non-trivial undertaking given that so many of our production systems integrate with Kafka. We really don't want to mess with something like that. I believe the new streaming API has great potential to simplify our streaming apps, and I'm eager to try a fresh approach to streaming in Spark. I cannot commit to making a contribution to this effort, but I wanted to voice my opinion and cast my vote. Thanks. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567457#comment-15567457 ] Cody Koeninger commented on SPARK-17344: Given the choice between rewriting underlying kafka consumers and having a split codebase, I'd rather have a split codebase. Of course I'd rather not sink development effort into an old version of kafka at all, until the structured stream for 0.10 is working for my use cases. But If you want to wrap the 0.8 rdd in a structured stream, go for it, I'll help you figure out how do it. Seriously. Don't expect larger project uptake, but if you just need something to work for you > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567367#comment-15567367 ] Jeremy Smith commented on SPARK-17344: -- > By contrast, writing a streaming source shim around the existing simple > consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't > have stuff like SSL, dynamic topics, or offset committing. Serious question: Would it be so bad to have a bifurcated codebase here? People who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for them, and are probably not all that concerned about the features you mentioned. In general, structured streaming already provides a lot of the capabilities that I for one am concerned about when using Kafka - offsets are tracked natively by SS, so offset committing isn't that big of a deal; in a CDH cluster specifically, you are probably using network-level security and aren't viewing the lack of SSL as a blocker; and finally you're already resigned to static topic subscriptions because that's what you're getting with the DStream API. A simple Structured Streaming source for Kafka, even using the same underlying technology, would be a HUGE step up: * You won't have "dynamic topics" to the same level, but at least you won't have to throw away all your checkpoints just to do something with a new topic in the same application. Currently, you have to do this, because the entire graph is stored in the checkpoints along with all the topics you're ever going to look at. Structured streaming at least gives you separate checkpoints per source, rather than for the entire StreamingContext. * You're already unable to manually commit offsets; you either have to rewind to the beginning, or throw away everything from the past, or (as before) rely on the incredibly fragile StreamingContext checkpoints. Or, commit the topic/partition/offset to the sink so you can recover the actually processed messages from there. Again, decoupling each operation from the entire state of the StreamingContext is a huge step up, because you can actually upgrade your application code (at least in certain ways) without having to worry about re-processing stuff due to discarding the checkpoints. * It will dramatically simplify the usage of Kafka from Spark in general. 9/10 use cases involve some sort of structured data, the processing of which will have dramatically better performance when being used with tungsten than with RDD-level operations. So if the simple-consumer based Kafka source would be so easy, at the expense of some features, why not introduce it? I have a tremendous amount of respect for the complexity of Kafka and the work you're doing with it, but I also get a sense that the conceptual "perfect" here is the enemy of the good. The weekend project you mentioned would result in a dramatic improvement in the experience for a large percentage of users who are currently using Spark and Kafka together. Most companies are using some kind of Hadoop distribution (i.e. HDP or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH doesn't), but at what rate are people actually able to update HDP? I don't have any data on it (ironically) but I'm guessing that 0.9 still represents a fairly significant portion of the Kafka install base. Just my two cents on the matter. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567321#comment-15567321 ] Michael Armbrust commented on SPARK-17344: -- These are good questions. A few thoughts: bq. How long would it take CDH to distribute 0.10 if there was a compelling Spark client for it? Even if they were going to release kafka 0.10 in CDH yesterday, my experience is that many will take a long time for people to upgrade. We spent a fair amount of effort on multi-version compatibility for Hive in Spark SQL and it was great boost for adoption. I think this could be the same thing. bq. How are you going to handle SSL? You can't avoid the complexity of caching consumers if you still want the benefits of prefetching, and doing an SSL handshake for every batch will kill performance if they aren't cached. An option here would be to use the internal client directly. This way we can leverage all the work that they did to support SSL, etc yet make it speak specific versions of the protocol as we need. I did a [really rough prototype|https://gist.github.com/marmbrus/7d116b0a9672337497ddfccc0657dbf0] using the APIs described above and it is not that much code. There is clearly a lot more we'd need to do, but I think we should strongly consider this option. Caching connections to the specific brokers should probably still be implemented for the reasons you describe (and this is already handled by the internal client). An advantage here is you'd actually be able to share connections across queries without running into correctness problems. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566244#comment-15566244 ] Cody Koeninger commented on SPARK-17344: How long would it take CDH to distribute 0.10 if there was a compelling Spark client for it? How are you going to handle SSL? You can't avoid the complexity of caching consumers if you still want the benefits of prefetching, and doing an SSL handshake for every batch will kill performance if they aren't cached. Also note that this is a pretty prime example of what I'm talking about in my dev mailing list discussion on SIPs. This issue has been brought up, and decided against continuing support of 0.8, multiple times. You guys started making promises about structured streaming for Kafka over half a year ago, and still don't have it feature complete. This is a big potential detour for uncertain gain. The real underlying problem is still how you're going to do better than simply wrapping a DStream, and I don't see how this is directly relevant. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566191#comment-15566191 ] Michael Armbrust commented on SPARK-17344: -- I think the fact that CDH is still distributing 0.9 is a pretty convincing argument. I'm also not convinced its a bad idea to and speak the protocol directly. Our use case ends up being significantly simpler than most other consumer implementations since we have the opportunity to do global coordination on the driver. As such, we'd really only to correctly to handle two types of requests: [TopicMetadataRequest|https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-TopicMetadataRequest] and [FetchRequest|https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchRequest]. - The variations here across versions are minimal for these messages. - We could avoid have N different artifacts for N versions of kafka - We could remove the complexity of caching consumers on executors (though still set preferred locations to encourage collocation). - We could avoid extra copies of the payload when going from the kafka library into tungsten. I agree we shouldn't make this decision lightly, but looking at our past experience supporting multiple versions of Hadoop/Hive as transparently as possible, I think this could be a big boost for adoption. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558474#comment-15558474 ] Cody Koeninger commented on SPARK-17344: I think this is premature until you have a fully operational battlestation, er, structured stream, that has all the necessary features for 0.10 Regarding the conversation with Michael about possibly using the kafka protocol directly as a way to work around the differences between 0.8 and 0.10, please don't consider that. Every kafka consumer implementation I've ever used has bugs, and we don't need to spend time writing another buggy one. By contrast, writing a streaming source shim around the existing simple consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't have stuff like SSL, dynamic topics, or offset committing. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556129#comment-15556129 ] Jeremy Smith commented on SPARK-17344: -- +1 We're on CDH, and it will probably be a while before they support Kafka 0.10. At the same time, we don't use their Spark and we're looking forward to upgrading to 2.0.x and using structured streaming. I was just going to write our own Kafka Source implementation which uses the existing KafkaRDD but it would be much easier to get buy-in for an official Spark module. I will gladly add my vote to this issue if it's reopened. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552915#comment-15552915 ] Michael Armbrust commented on SPARK-17344: -- BTW, people should still comment here if this is what is preventing them from using Structured Streaming. I think we should be swayed by enough demand. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org