[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762756#comment-15762756
 ] 

Michael Armbrust commented on SPARK-17344:
--

[KAFKA-4462] aims to give us backwards compatibility for clients which will be 
great.  The fact that there is a long term plan here makes me less allergic to 
the idea of copy/pasting the 0.10.x {{Source}} and porting it to 0.8.0 as an 
interim solution for those who can upgrade yet.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-28 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617004#comment-15617004
 ] 

Michael Allman commented on SPARK-17344:


We (at VideoAmp) would love to use structured streaming with Kafka. However we 
use Kafka 0.8 and have no present desire or reason to upgrade. There's nothing 
valuable to us in the newer versions, and performing an upgrade would be a 
highly non-trivial undertaking given that so many of our production systems 
integrate with Kafka. We really don't want to mess with something like that.

I believe the new streaming API has great potential to simplify our streaming 
apps, and I'm eager to try a fresh approach to streaming in Spark. I cannot 
commit to making a contribution to this effort, but I wanted to voice my 
opinion and cast my vote.

Thanks.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567457#comment-15567457
 ] 

Cody Koeninger commented on SPARK-17344:


Given the choice between rewriting underlying kafka consumers and having a 
split codebase, I'd rather have a split codebase.  Of course I'd rather not 
sink development effort into an old version of kafka at all, until the 
structured stream for 0.10 is working for my use cases.

But If you want to wrap the 0.8 rdd in a structured stream, go for it, I'll 
help you figure out how do it.  Seriously.  Don't expect larger project uptake, 
but if you just need something to work for you

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Jeremy Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567367#comment-15567367
 ] 

Jeremy Smith commented on SPARK-17344:
--

 > By contrast, writing a streaming source shim around the existing simple 
 > consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't 
 > have stuff like SSL, dynamic topics, or offset committing.

Serious question: Would it be so bad to have a bifurcated codebase here? People 
who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for 
them, and are probably not all that concerned about the features you mentioned. 
In general, structured streaming already provides a lot of the capabilities 
that I for one am concerned about when using Kafka - offsets are tracked 
natively by SS, so offset committing isn't that big of a deal; in a CDH cluster 
specifically, you are probably using network-level security and aren't viewing 
the lack of SSL as a blocker; and finally you're already resigned to static 
topic subscriptions because that's what you're getting with the DStream API.

A simple Structured Streaming source for Kafka, even using the same underlying 
technology, would be a HUGE step up:

* You won't have "dynamic topics" to the same level, but at least you won't 
have to throw away all your checkpoints just to do something with a new topic 
in the same application. Currently, you have to do this, because the entire 
graph is stored in the checkpoints along with all the topics you're ever going 
to look at. Structured streaming at least gives you separate checkpoints per 
source, rather than for the entire StreamingContext.
* You're already unable to manually commit offsets; you either have to rewind 
to the beginning, or throw away everything from the past, or (as before) rely 
on the incredibly fragile StreamingContext checkpoints. Or, commit the 
topic/partition/offset to the sink so you can recover the actually processed 
messages from there. Again, decoupling each operation from the entire state of 
the StreamingContext is a huge step up, because you can actually upgrade your 
application code (at least in certain ways) without having to worry about 
re-processing stuff due to discarding the checkpoints.
* It will dramatically simplify the usage of Kafka from Spark in general. 9/10 
use cases involve some sort of structured data, the processing of which will 
have dramatically better performance when being used with tungsten than with 
RDD-level operations.

So if the simple-consumer based Kafka source would be so easy, at the expense 
of some features, why not introduce it? I have a tremendous amount of respect 
for the complexity of Kafka and the work you're doing with it, but I also get a 
sense that the conceptual "perfect" here is the enemy of the good. The weekend 
project you mentioned would result in a dramatic improvement in the experience 
for a large percentage of users who are currently using Spark and Kafka 
together. Most companies are using some kind of Hadoop distribution (i.e. HDP 
or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH 
doesn't), but at what rate are people actually able to update HDP? I don't have 
any data on it (ironically) but I'm guessing that 0.9 still represents a fairly 
significant portion of the Kafka install base.

Just my two cents on the matter.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567321#comment-15567321
 ] 

Michael Armbrust commented on SPARK-17344:
--

These are good questions.  A few thoughts:

bq. How long would it take CDH to distribute 0.10 if there was a compelling 
Spark client for it?

Even if they were going to release kafka 0.10 in CDH yesterday, my experience 
is that many will take a long time for people to upgrade.  We spent a fair 
amount of effort on multi-version compatibility for Hive in Spark SQL and it 
was great boost for adoption.  I think this could be the same thing.

bq. How are you going to handle SSL?  You can't avoid the complexity of caching 
consumers if you still want the benefits of prefetching, and doing an SSL 
handshake for every batch will kill performance if they aren't cached.

An option here would be to use the internal client directly.  This way we can 
leverage all the work that they did to support SSL, etc yet make it speak 
specific versions of the protocol as we need. I did a [really rough 
prototype|https://gist.github.com/marmbrus/7d116b0a9672337497ddfccc0657dbf0] 
using the APIs described above and it is not that much code.  There is clearly 
a lot more we'd need to do, but I think we should strongly consider this option.

Caching connections to the specific brokers should probably still be 
implemented for the reasons you describe (and this is already handled by the 
internal client).  An advantage here is you'd actually be able to share 
connections across queries without running into correctness problems.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566244#comment-15566244
 ] 

Cody Koeninger commented on SPARK-17344:


How long would it take CDH to distribute 0.10 if there was a compelling Spark 
client for it?

How are you going to handle SSL?

You can't avoid the complexity of caching consumers if you still want the 
benefits of prefetching, and doing an SSL handshake for every batch will kill 
performance if they aren't cached.

Also note that this is a pretty prime example of what I'm talking about in my 
dev mailing list discussion on SIPs.  This issue has been brought up, and 
decided against continuing support of 0.8, multiple times.

You guys started making promises about structured streaming for Kafka over half 
a year ago, and still don't have it feature complete.  This is a big potential 
detour for uncertain gain.  The real underlying problem is still how you're 
going to do better than simply wrapping a DStream, and I don't see how this is 
directly relevant.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566191#comment-15566191
 ] 

Michael Armbrust commented on SPARK-17344:
--

I think the fact that CDH is still distributing 0.9 is a pretty convincing 
argument.

I'm also not convinced its a bad idea to and speak the protocol directly.  Our 
use case ends up being significantly simpler than most other consumer 
implementations since we have the opportunity to do global coordination on the 
driver.  As such, we'd really only to correctly to handle two types of 
requests: 
[TopicMetadataRequest|https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-TopicMetadataRequest]
 and 
[FetchRequest|https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchRequest].
 - The variations here across versions are minimal for these messages.
 - We could avoid have N different artifacts for N versions of kafka
 - We could remove the complexity of caching consumers on executors (though 
still set preferred locations to encourage collocation).
 - We could avoid extra copies of the payload when going from the kafka library 
into tungsten.

I agree we shouldn't make this decision lightly, but looking at our past 
experience supporting multiple versions of Hadoop/Hive as transparently as 
possible, I think this could be a big boost for adoption.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558474#comment-15558474
 ] 

Cody Koeninger commented on SPARK-17344:


I think this is premature until you have a fully operational battlestation, er, 
structured stream, that has all the necessary features for 0.10

Regarding the conversation with Michael about possibly using the kafka protocol 
directly as a way to work around the differences between 0.8 and 0.10, please 
don't consider that.  Every kafka consumer implementation I've ever used has 
bugs, and we don't need to spend time writing another buggy one.  

By contrast, writing a streaming source shim around the existing simple 
consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't have 
stuff like SSL, dynamic topics, or offset committing.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-07 Thread Jeremy Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556129#comment-15556129
 ] 

Jeremy Smith commented on SPARK-17344:
--

+1

We're on CDH, and it will probably be a while before they support Kafka 0.10. 
At the same time, we don't use their Spark and we're looking forward to 
upgrading to 2.0.x and using structured streaming.

I was just going to write our own Kafka Source implementation which uses the 
existing KafkaRDD but it would be much easier to get buy-in for an official 
Spark module.

I will gladly add my vote to this issue if it's reopened.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-06 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552915#comment-15552915
 ] 

Michael Armbrust commented on SPARK-17344:
--

BTW, people should still comment here if this is what is preventing them from 
using Structured Streaming.  I think we should be swayed by enough demand.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org