Spark 2.2.0 or Spark 2.3.0?

2017-05-01 Thread kant kodali
Hi All, If I understand the Spark standard release process correctly. It looks like the official release is going to be sometime end of this month and it is going to be 2.2.0 right (not 2.3.0)? I am eagerly looking for Spark 2.2.0 because of the "update mode" option in Spark Streaming. Please corr

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Frank Austin Nothaft
Hi Ryan, IMO, the problem is that the Spark Avro version conflicts with the Parquet Avro version. As discussed upthread, I don’t think there’s a way to reliably make sure that Avro 1.8 is on the classpath first while using spark-submit. Relocating avro in our project wouldn’t solve the problem,

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
Michael, I think that the problem is with your classpath. Spark has a dependency to 1.7.7, which can't be changed. Your project is what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime dependency on Avro 1.8. It is understandably annoying that using the same version of Parquet

Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-05-01 Thread Michael Armbrust
This vote passes. Thanks to everyone for testing! I'll begin packaging the release. +1 Sean Owen (binding) Michael Armbrust (binding) Reynold Xin (binding) Tom Graves (binding) Dong Joon Hyun Holden Karau Vaquar Khan Kazuaki Ishizaki Denny Lee Felix Cheung -1 None On Fri, Apr 28, 2017 at 11:17

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Michael Heuer
Please excuse me if I'm misunderstanding -- the problem is not with our library or our classpath. There is a conflict within Spark itself, in that Parquet 1.8.2 expects to find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead. Spark already has to work around this for unit tests to pass

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Frank Austin Nothaft
Hi Ryan! I think relocating the avro dependency inside of Spark would make a lot of sense. Otherwise, we’d need Spark to move to Avro 1.8.0, or Parquet to cut a new 1.8.3 release that either reverts back to Avro 1.7.7 or that eliminates the code that is binary incompatible between Avro 1.7.7 an

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
Thanks for the extra context, Frank. I agree that it sounds like your problem comes from the conflict between your Jars and what comes with Spark. Its the same concern that makes everyone shudder when anything has a public dependency on Jackson. :) What we usually do to get around situations like

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Koert Kuipers
sounds like you are running into the fact that you cannot really put your classes before spark's on classpath? spark's switches to support this never really worked for me either. inability to control the classpath + inconsistent jars => trouble ? On Mon, May 1, 2017 at 2:36 PM, Frank Austin Notha

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Frank Austin Nothaft
Hi Ryan, We do set Avro to 1.8 in our downstream project. We also set Spark as a provided dependency, and build an überjar. We run via spark-submit, which builds the classpath with our überjar and all of the Spark deps. This leads to avro 1.7.1 getting picked off of the classpath at runtime, wh

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
Frank, The issue you're running into is caused by using parquet-avro with Avro 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark can't update Avro because it is a breaking change that would force users to rebuilt specific Avro classes in some cases. But you should be free to

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

2017-05-01 Thread Cody Koeninger
Yeah, seems reasonable. On Mon, May 1, 2017 at 12:40 PM, Jacek Laskowski wrote: > Hi, > > Thanks Cody and Michael! I didn't expect to get two answers so quickly and > from THE brains behind spark - Kafka integration. #impressed > > Yes, Michael has nailed it. Using save's path was so natural to m

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Frank Austin Nothaft
Hi Ryan et al, The issue we’ve seen using a build of the Spark 2.2.0 branch from a downstream project is that parquet-avro uses one of the new Avro 1.8.0 methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a dependency. My colleague Michael (who posted earlier on this thread

Re: New Optimizer Hint

2017-05-01 Thread Josh Rosen
The issue of UDFS which return structs being evaluated many times when accessing the returned struct's fields sounds like https://issues.apache.org/jira/browse/SPARK-17728; that issue mentions a trick of using *array* and *explode* to prevent project collapsing. On Thu, Apr 20, 2017 at 8:55 AM Rey

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

2017-05-01 Thread Jacek Laskowski
Hi, Thanks Cody and Michael! I didn't expect to get two answers so quickly and from THE brains behind spark - Kafka integration. #impressed Yes, Michael has nailed it. Using save's path was so natural to me after months with Spark that I was surprised to not have seen it instead of the custom and

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

2017-05-01 Thread Michael Armbrust
He's just suggesting that since the DataStreamWriter start() method can fill in an option named "path", we should make that a synonym for "topic". Then you could do something like. df.writeStream.format("kafka").start("topic") Seems reasonable if people don't think that is confusing. On Mon, May

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
I agree with Sean. Spark only pulls in parquet-avro for tests. For execution, it implements the record materialization APIs in Parquet to go directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 dependency into Spark as far as I can tell. rb On Mon, May 1, 2017 at 8:34 AM, Sean Owen

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

2017-05-01 Thread Cody Koeninger
I'm confused about what you're suggesting. Are you saying that a Kafka sink should take a filesystem path as an option? On Mon, May 1, 2017 at 8:52 AM, Jacek Laskowski wrote: > Hi, > > I've just found out that KafkaSourceProvider supports topic option > that sets the Kafka topic to save a DataFr

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Sean Owen
See discussion at https://github.com/apache/spark/pull/17163 -- I think the issue is that fixing this trades one problem for a slightly bigger one. On Mon, May 1, 2017 at 4:13 PM Michael Heuer wrote: > Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does > not bump the depend

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Michael Heuer
Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does not bump the dependency version for avro (currently at 1.7.7). Though perhaps not clear from the issue I reported [0], this means that Spark is internally inconsistent, in that a call through parquet (which depends on avro 1.

[KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

2017-05-01 Thread Jacek Laskowski
Hi, I've just found out that KafkaSourceProvider supports topic option that sets the Kafka topic to save a DataFrame to. You can also use topic column to assign rows to topics. Given the features, I've been wondering why "path" option is not supported (even of least precedence) so when no topic