[GitHub] [spark] giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-15 Thread GitBox
giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r314223378
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 ##
 @@ -53,7 +54,9 @@ case class AvroDataToCatalyst(
 
   @transient private lazy val avroSchema = new 
Schema.Parser().parse(jsonFormatSchema)
 
-  @transient private lazy val reader = new GenericDatumReader[Any](avroSchema)
 
 Review comment:
   Exactly, to read data you can use a compatible evolved schema, but you also 
need to pass the schema with which the data was originally serialized. Using 
this GenericDatumReader API 
https://avro.apache.org/docs/1.8.2/api/java/org/apache/avro/generic/GenericDatumReader.html#GenericDatumReader(org.apache.avro.Schema,%20org.apache.avro.Schema)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-15 Thread GitBox
giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r314222409
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 ##
 @@ -53,7 +54,9 @@ case class AvroDataToCatalyst(
 
   @transient private lazy val avroSchema = new 
Schema.Parser().parse(jsonFormatSchema)
 
-  @transient private lazy val reader = new GenericDatumReader[Any](avroSchema)
+  @transient private lazy val reader = writerJsonFormatSchema
 
 Review comment:
   It's the whole point of the PR, it's explained in the description and in the 
Jira ticket 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-08-15 Thread GitBox
giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r314221371
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala
 ##
 @@ -28,39 +28,59 @@ object functions {
 // scalastyle:on: object.name
 
   /**
-   * Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-   * schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-   * arbitrary result.
+   * Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
+   * schema is provided, a different (but compatible) schema can be used for 
reading. If no writer's
+   * schema is provided, the specified schema must match the read data, 
otherwise the behavior is
+   * undefined: it may fail or return arbitrary result.
*
* @param data the binary column.
* @param jsonFormatSchema the avro schema in JSON string format.
+   * @param writerJsonFormatSchema the avro schema in JSON string format used 
to serialize the data.
*
* @since 3.0.0
*/
   @Experimental
   def from_avro(
   data: Column,
-  jsonFormatSchema: String): Column = {
-new Column(AvroDataToCatalyst(data.expr, jsonFormatSchema, Map.empty))
+  jsonFormatSchema: String,
+  writerJsonFormatSchema: Option[String]): Column = {
 
 Review comment:
   You can create an Option in Java using `Option.apply(...)`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-07-18 Thread GitBox
giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r305049306
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala
 ##
 @@ -28,39 +28,59 @@ object functions {
 // scalastyle:on: object.name
 
   /**
-   * Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-   * schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-   * arbitrary result.
+   * Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
+   * schema is provided, a different (but compatible) schema can be used for 
reading. If no writer's
+   * schema is provided, the specified schema must match the read data, 
otherwise the behavior is
+   * undefined: it may fail or return arbitrary result.
*
* @param data the binary column.
* @param jsonFormatSchema the avro schema in JSON string format.
+   * @param writerJsonFormatSchema the avro schema in JSON string format used 
to serialize the data.
*
* @since 3.0.0
*/
   @Experimental
   def from_avro(
   data: Column,
-  jsonFormatSchema: String): Column = {
-new Column(AvroDataToCatalyst(data.expr, jsonFormatSchema, Map.empty))
+  jsonFormatSchema: String,
+  writerJsonFormatSchema: Option[String]): Column = {
 
 Review comment:
   it's not possible because the other version already has a default value: 
   `[error] in object functions, multiple overloaded alternatives of method 
from_avro define default arguments`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-07-17 Thread GitBox
giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r304597229
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 ##
 @@ -53,7 +54,9 @@ case class AvroDataToCatalyst(
 
   @transient private lazy val avroSchema = new 
Schema.Parser().parse(jsonFormatSchema)
 
-  @transient private lazy val reader = new GenericDatumReader[Any](avroSchema)
+  @transient private lazy val reader = writerJsonFormatSchema
 
 Review comment:
   No, if the writerJsonFormatSchema is set then both writerJsonFormatSchema 
and jsonFormatSchema are used (in the `.map`), otherwise only the 
jsonFormatSchema is used (in the `.getOrElse`).
   
   In the first case the GenericDatumReader is built with the constructor 
accepting the two schemas (reader and writer) while in the second case it's 
built with the constructor accepting only one schema (which is assumed to be 
the same for both operations): 
https://avro.apache.org/docs/1.8.2/api/java/org/apache/avro/generic/GenericDatumReader.html
   
   Please refer to the Jira ticket for a more detailed explanation: 
https://issues.apache.org/jira/browse/SPARK-27506


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-05-08 Thread GitBox
giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r282146360
 
 

 ##
 File path: external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala
 ##
 @@ -22,21 +22,24 @@ import org.apache.spark.annotation.Experimental
 package object avro {
 
   /**
-   * Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-   * schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-   * arbitrary result.
+   * Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
+   * schema is provided, a different (but compatible) schema can be used for 
reading. If no writer's
+   * schema is provided, the specified schema must match the read data, 
otherwise the behavior is
+   * undefined: it may fail or return arbitrary result.
*
* @param data the binary column.
* @param jsonFormatSchema the avro schema in JSON string format.
+   * @param writerJsonFormatSchema the avro schema in JSON string format used 
to serialize the data.
*
* @since 2.4.0
*/
   @Experimental
   @deprecated("Please use 'org.apache.spark.sql.avro.functions.from_avro' 
instead.", "3.0.0")
   def from_avro(
   data: Column,
-  jsonFormatSchema: String): Column =
-org.apache.spark.sql.avro.functions.from_avro(data, jsonFormatSchema)
+  jsonFormatSchema: String,
 
 Review comment:
   I guess so, I'll remove it


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-05-08 Thread GitBox
giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r282146108
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala
 ##
 @@ -28,39 +28,59 @@ object functions {
 // scalastyle:on: object.name
 
   /**
-   * Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-   * schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-   * arbitrary result.
+   * Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
+   * schema is provided, a different (but compatible) schema can be used for 
reading. If no writer's
+   * schema is provided, the specified schema must match the read data, 
otherwise the behavior is
+   * undefined: it may fail or return arbitrary result.
*
* @param data the binary column.
* @param jsonFormatSchema the avro schema in JSON string format.
+   * @param writerJsonFormatSchema the avro schema in JSON string format used 
to serialize the data.
*
* @since 3.0.0
*/
   @Experimental
   def from_avro(
   data: Column,
-  jsonFormatSchema: String): Column = {
-new Column(AvroDataToCatalyst(data.expr, jsonFormatSchema, Map.empty))
+  jsonFormatSchema: String,
+  writerJsonFormatSchema: Option[String]): Column = {
+new Column(
+  AvroDataToCatalyst(
+data.expr,
+jsonFormatSchema,
+Map.empty,
+writerJsonFormatSchema
+  )
+)
   }
 
   /**
-   * Converts a binary column of avro format into its corresponding catalyst 
value. The specified
-   * schema must match the read data, otherwise the behavior is undefined: it 
may fail or return
-   * arbitrary result.
+   * Converts a binary column of avro format into its corresponding catalyst 
value. If a writer's
+   * schema is provided, a different (but compatible) schema can be used for 
reading. If no writer's
+   * schema is provided, the specified schema must match the read data, 
otherwise the behavior is
+   * undefined: it may fail or return arbitrary result.
*
* @param data the binary column.
* @param jsonFormatSchema the avro schema in JSON string format.
* @param options options to control how the Avro record is parsed.
+   * @param writerJsonFormatSchema the avro schema in JSON string format used 
to serialize the data.
*
* @since 3.0.0
*/
   @Experimental
   def from_avro(
   data: Column,
   jsonFormatSchema: String,
-  options: java.util.Map[String, String]): Column = {
-new Column(AvroDataToCatalyst(data.expr, jsonFormatSchema, 
options.asScala.toMap))
+  options: java.util.Map[String, String],
+  writerJsonFormatSchema: Option[String]): Column = {
 
 Review comment:
   Sure, it makes sense, though I can only set the default for one of the two 
`from_avro`s otherwise the compiler complains that there are multiple 
overloaded functions with default arguments


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-05-08 Thread GitBox
giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r282145261
 
 

 ##
 File path: 
external/avro/src/test/java/org/apache/spark/sql/avro/JavaAvroFunctionsSuite.java
 ##
 @@ -66,8 +66,8 @@ public void testToAvroFromAvro() {
 String avroTypeStr = "{\"type\": \"string\", \"name\": \"str\"}";
 
 Dataset actual = avroDF.select(
-  from_avro(avroDF.col("a"), avroTypeLong),
-  from_avro(avroDF.col("b"), avroTypeStr));
+  from_avro(avroDF.col("a"), avroTypeLong, scala.Option.apply(null)),
 
 Review comment:
   It doesn't work for me, I tried a bunch of possible ways to pass a None from 
Java and that's the only way I managed to make it work... I know it's ugly, but 
it's only a test


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-04-30 Thread GitBox
giamo commented on a change in pull request #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#discussion_r279774509
 
 

 ##
 File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 ##
 @@ -53,8 +54,10 @@ case class AvroDataToCatalyst(
 
   @transient private lazy val avroSchema = new 
Schema.Parser().parse(jsonFormatSchema)
 
-  @transient private lazy val reader = new GenericDatumReader[Any](avroSchema)
-
+  @transient private lazy val reader =
+writerJsonFormatSchema.fold(new GenericDatumReader[Any](avroSchema))(
 
 Review comment:
   Changed  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org