Hi All,
If I understand the Spark standard release process correctly. It looks like
the official release is going to be sometime end of this month and it is
going to be 2.2.0 right (not 2.3.0)? I am eagerly looking for Spark 2.2.0
because of the "update mode" option in Spark Streaming. Please corr
Hi Ryan,
IMO, the problem is that the Spark Avro version conflicts with the Parquet Avro
version. As discussed upthread, I don’t think there’s a way to reliably make
sure that Avro 1.8 is on the classpath first while using spark-submit.
Relocating avro in our project wouldn’t solve the problem,
Michael, I think that the problem is with your classpath.
Spark has a dependency to 1.7.7, which can't be changed. Your project is
what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
dependency on Avro 1.8. It is understandably annoying that using the same
version of Parquet
This vote passes. Thanks to everyone for testing! I'll begin packaging
the release.
+1
Sean Owen (binding)
Michael Armbrust (binding)
Reynold Xin (binding)
Tom Graves (binding)
Dong Joon Hyun
Holden Karau
Vaquar Khan
Kazuaki Ishizaki
Denny Lee
Felix Cheung
-1
None
On Fri, Apr 28, 2017 at 11:17
Please excuse me if I'm misunderstanding -- the problem is not with our
library or our classpath.
There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead. Spark
already has to work around this for unit tests to pass
Hi Ryan!
I think relocating the avro dependency inside of Spark would make a lot of
sense. Otherwise, we’d need Spark to move to Avro 1.8.0, or Parquet to cut a
new 1.8.3 release that either reverts back to Avro 1.7.7 or that eliminates the
code that is binary incompatible between Avro 1.7.7 an
Thanks for the extra context, Frank. I agree that it sounds like your
problem comes from the conflict between your Jars and what comes with
Spark. Its the same concern that makes everyone shudder when anything has a
public dependency on Jackson. :)
What we usually do to get around situations like
sounds like you are running into the fact that you cannot really put your
classes before spark's on classpath? spark's switches to support this never
really worked for me either.
inability to control the classpath + inconsistent jars => trouble ?
On Mon, May 1, 2017 at 2:36 PM, Frank Austin Notha
Hi Ryan,
We do set Avro to 1.8 in our downstream project. We also set Spark as a
provided dependency, and build an überjar. We run via spark-submit, which
builds the classpath with our überjar and all of the Spark deps. This leads to
avro 1.7.1 getting picked off of the classpath at runtime, wh
Frank,
The issue you're running into is caused by using parquet-avro with Avro
1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
can't update Avro because it is a breaking change that would force users to
rebuilt specific Avro classes in some cases. But you should be free to
Yeah, seems reasonable.
On Mon, May 1, 2017 at 12:40 PM, Jacek Laskowski wrote:
> Hi,
>
> Thanks Cody and Michael! I didn't expect to get two answers so quickly and
> from THE brains behind spark - Kafka integration. #impressed
>
> Yes, Michael has nailed it. Using save's path was so natural to m
Hi Ryan et al,
The issue we’ve seen using a build of the Spark 2.2.0 branch from a downstream
project is that parquet-avro uses one of the new Avro 1.8.0 methods, and you
get a NoSuchMethodError since Spark puts Avro 1.7.7 as a dependency. My
colleague Michael (who posted earlier on this thread
The issue of UDFS which return structs being evaluated many times when
accessing the returned struct's fields sounds like
https://issues.apache.org/jira/browse/SPARK-17728; that issue mentions a
trick of using *array* and *explode* to prevent project collapsing.
On Thu, Apr 20, 2017 at 8:55 AM Rey
Hi,
Thanks Cody and Michael! I didn't expect to get two answers so quickly and
from THE brains behind spark - Kafka integration. #impressed
Yes, Michael has nailed it. Using save's path was so natural to me after
months with Spark that I was surprised to not have seen it instead of the
custom and
He's just suggesting that since the DataStreamWriter start() method can
fill in an option named "path", we should make that a synonym for "topic".
Then you could do something like.
df.writeStream.format("kafka").start("topic")
Seems reasonable if people don't think that is confusing.
On Mon, May
I agree with Sean. Spark only pulls in parquet-avro for tests. For
execution, it implements the record materialization APIs in Parquet to go
directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
dependency into Spark as far as I can tell.
rb
On Mon, May 1, 2017 at 8:34 AM, Sean Owen
I'm confused about what you're suggesting. Are you saying that a
Kafka sink should take a filesystem path as an option?
On Mon, May 1, 2017 at 8:52 AM, Jacek Laskowski wrote:
> Hi,
>
> I've just found out that KafkaSourceProvider supports topic option
> that sets the Kafka topic to save a DataFr
See discussion at https://github.com/apache/spark/pull/17163 -- I think the
issue is that fixing this trades one problem for a slightly bigger one.
On Mon, May 1, 2017 at 4:13 PM Michael Heuer wrote:
> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does
> not bump the depend
Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does
not bump the dependency version for avro (currently at 1.7.7). Though
perhaps not clear from the issue I reported [0], this means that Spark is
internally inconsistent, in that a call through parquet (which depends on
avro 1.
Hi,
I've just found out that KafkaSourceProvider supports topic option
that sets the Kafka topic to save a DataFrame to.
You can also use topic column to assign rows to topics.
Given the features, I've been wondering why "path" option is not
supported (even of least precedence) so when no topic
20 matches
Mail list logo