[jira] [Closed] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li closed SPARK-24256. -- > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > # We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > # We can not use some type-safe aggregation methods on this Dataset, such as > KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > # We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation is that ExpressionEncoder does not support serde of Scala case > class/tuple with subfields being any other user-defined type with its own > Encoder for serde. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean to support user-defined types. > With this proposed patch and our minor modification in AvroEncoder, we remove > above-mentioned limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li resolved SPARK-24256. Resolution: Won't Fix > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > # We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > # We can not use some type-safe aggregation methods on this Dataset, such as > KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > # We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation is that ExpressionEncoder does not support serde of Scala case > class/tuple with subfields being any other user-defined type with its own > Encoder for serde. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean to support user-defined types. > With this proposed patch and our minor modification in AvroEncoder, we remove > above-mentioned limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603800#comment-16603800 ] Fangshi Li commented on SPARK-24256: To summarize our discussion for this pr: Spark-avro is now merged into Spark as a built-in data source. Upstream community is not merging the AvroEncoder to support Avro types in Dataset, instead, the plan is to exposing the user-defined type API to support defining arbitrary user types in Dataset. The purpose of this patch is to enable ExpressionEncoder to work together with other types of Encoders, while it seems like upstream prefers to go with UDT. Given this, we can close this PR and we will start the discussion on UDT in another channel > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > # We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > # We can not use some type-safe aggregation methods on this Dataset, such as > KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > # We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation is that ExpressionEncoder does not support serde of Scala case > class/tuple with subfields being any other user-defined type with its own > Encoder for serde. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean to support user-defined types. > With this proposed patch and our minor modification in AvroEncoder, we remove > above-mentioned limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24287) Spark -packages option should support classifier, no-transitive, and custom conf
[ https://issues.apache.org/jira/browse/SPARK-24287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24287: --- Summary: Spark -packages option should support classifier, no-transitive, and custom conf (was: Spark 2.x -packages option should support classifier, no-transitive, and custom conf) > Spark -packages option should support classifier, no-transitive, and custom > conf > > > Key: SPARK-24287 > URL: https://issues.apache.org/jira/browse/SPARK-24287 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > We should extend Spark's -package option to support: > # Turn-off transitive dependency on a given artifact(like spark-avro) > # Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar > # Resolving a given artifact with custom ivy conf > # Excluding particular transitive dependencies from a given artifact. We > currently only have top-level exclusion rule applies for all artifacts. > We have tested this patch internally and it greatly increases the flexibility > when user uses -packages option -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24287) Spark 2.x -packages option should support classifier, no-transitive, and custom conf
[ https://issues.apache.org/jira/browse/SPARK-24287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24287: --- Description: We should extend Spark's -package option to support: # Turn-off transitive dependency on a given artifact(like spark-avro) # Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar # Resolving a given artifact with custom ivy conf # Excluding particular transitive dependencies from a given artifact. We currently only have top-level exclusion rule applies for all artifacts. We have tested this patch internally and it greatly increases the flexibility when user uses -packages option was: We should extend Spark's -package option to support: # Turn-off transitive dependency on a given artifact(like spark-avro) # Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar # Resolving a given artifact with custom ivy conf # Excluding particular transitive dependencies from a given artifact. We currently only have top-level exclusion rule applies for all artifacts. We have tested this patch internally and it increases the flexibility when user uses -packages option > Spark 2.x -packages option should support classifier, no-transitive, and > custom conf > > > Key: SPARK-24287 > URL: https://issues.apache.org/jira/browse/SPARK-24287 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > We should extend Spark's -package option to support: > # Turn-off transitive dependency on a given artifact(like spark-avro) > # Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar > # Resolving a given artifact with custom ivy conf > # Excluding particular transitive dependencies from a given artifact. We > currently only have top-level exclusion rule applies for all artifacts. > We have tested this patch internally and it greatly increases the flexibility > when user uses -packages option -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24287) Spark 2.x -packages option should support classifier, no-transitive, and custom conf
[ https://issues.apache.org/jira/browse/SPARK-24287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24287: --- Description: We should extend Spark's -package option to support: # Turn-off transitive dependency on a given artifact(like spark-avro) # Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar # Resolving a given artifact with custom ivy conf # Excluding particular transitive dependencies from a given artifact. We currently only have top-level exclusion rule applies for all artifacts. We have tested this patch internally and it increases the flexibility when user uses -packages option was: We should extend Spark's -package option to support: # Turn-off transitive dependency on a given artifact(like spark-avro) # Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar # Resolving a given artifact with custom ivy conf We have tested this patch internally and it increases the flexibility when user uses -packages option > Spark 2.x -packages option should support classifier, no-transitive, and > custom conf > > > Key: SPARK-24287 > URL: https://issues.apache.org/jira/browse/SPARK-24287 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > We should extend Spark's -package option to support: > # Turn-off transitive dependency on a given artifact(like spark-avro) > # Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar > # Resolving a given artifact with custom ivy conf > # Excluding particular transitive dependencies from a given artifact. We > currently only have top-level exclusion rule applies for all artifacts. > We have tested this patch internally and it increases the flexibility when > user uses -packages option -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24287) Spark 2.x -packages option should support classifier, no-transitive, and custom conf
Fangshi Li created SPARK-24287: -- Summary: Spark 2.x -packages option should support classifier, no-transitive, and custom conf Key: SPARK-24287 URL: https://issues.apache.org/jira/browse/SPARK-24287 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Fangshi Li We should extend Spark's -package option to support: # Turn-off transitive dependency on a given artifact(like spark-avro) # Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar # Resolving a given artifact with custom ivy conf We have tested this patch internally and it increases the flexibility when user uses -packages option -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24256: --- Description: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find Dataset is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: # We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. # We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. # We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation is that ExpressionEncoder does not support serde of Scala case class/tuple with subfields being any other user-defined type with its own Encoder for serde. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean to support user-defined types. With this proposed patch and our minor modification in AvroEncoder, we remove above-mentioned limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. was: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find Dataset is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation that Spark does not support define a Scala case class/tuple with subfields being any other user-defined type, is because ExpressionEncoder does not discover the implicit Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > # We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this
[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24256: --- Description: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find Dataset is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation that Spark does not support define a Scala case class/tuple with subfields being any other user-defined type, is because ExpressionEncoder does not discover the implicit Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. was: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find it is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation that Spark does not support define a Scala case class/tuple with subfields being any other user-defined type, is because ExpressionEncoder does not discover the implicit Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the
[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24256: --- Description: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find it is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation that Spark does not support define a Scala case class/tuple with subfields being any other user-defined type, is because ExpressionEncoder does not discover the implicit Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. was: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find it is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The root cause for these limitations is that Spark does not support simply define a Scala case class/tuple with subfields being any other user-defined type, since ExpressionEncoder does not discover the implicit Encoder for the user-defined fields, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we make it work with conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find it is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic
[jira] [Created] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
Fangshi Li created SPARK-24256: -- Summary: ExpressionEncoder should support user-defined types as fields of Scala case class and tuple Key: SPARK-24256 URL: https://issues.apache.org/jira/browse/SPARK-24256 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Fangshi Li Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find it is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The root cause for these limitations is that Spark does not support simply define a Scala case class/tuple with subfields being any other user-defined type, since ExpressionEncoder does not discover the implicit Encoder for the user-defined fields, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we make it work with conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
[ https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24216: --- Description: When user create a aggregator object in scala and pass the aggregator to Spark Dataset's agg() method, Spark's will initialize TypedAggregateExpression with the nodeName field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, depending on how user creates the aggregator object. For example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw java.lang.InternalError "Malformed class name". This has been reported in scalatest [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] and discussed in many scala upstream jiras such as SI-8110, SI-5425. To fix this issue, we follow the solution in [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. was: When we create a aggregator object within a function in scala and pass the aggregator to Spark Dataset's aggregation method, Spark's will initialize TypedAggregateExpression with the name field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, for example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw error "Malformed class name". This has been reported in scalatest [https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira [https://issues.scala-lang.org/browse/SI-8110]. To fix this issue, we follow the solution in [https://github.com/scalatest/scalatest/pull/1044] to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. > Spark TypedAggregateExpression uses getSimpleName that is not safe in scala > --- > > Key: SPARK-24216 > URL: https://issues.apache.org/jira/browse/SPARK-24216 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Fangshi Li >Priority: Major > > When user create a aggregator object in scala and pass the aggregator to > Spark Dataset's agg() method, Spark's will initialize > TypedAggregateExpression with the nodeName field as > aggregator.getClass.getSimpleName. However, getSimpleName is not safe in > scala environment, depending on how user creates the aggregator object. For > example, if the aggregator class full qualified name is > "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw > java.lang.InternalError "Malformed class name". This has been reported in > scalatest > [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] > and discussed in many scala upstream jiras such as SI-8110, SI-5425. > To fix this issue, we follow the solution in > [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] > to add safer version of getSimpleName as a util method, and > TypedAggregateExpression will invoke this util method rather than > getClass.getSimpleName. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23862) Spark ExpressionEncoder should support java enum type in scala
[ https://issues.apache.org/jira/browse/SPARK-23862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-23862: --- Issue Type: Bug (was: Improvement) > Spark ExpressionEncoder should support java enum type in scala > -- > > Key: SPARK-23862 > URL: https://issues.apache.org/jira/browse/SPARK-23862 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > In SPARK-21255, spark upstream adds support for creating encoders for java > enum types, but the support is only added to Java API(for enum working within > Java Beans). Since the java enum can come from third-party java library, we > have use case that requires > 1. using java enum types as field of scala case class > 2. using java enum as the type T in Dataset[T] > Spark ExpressionEncoder already supports ser/de many java types in > ScalaReflection, so we propose to add support for java enum as well, as a > follow up of SPARK-21255. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
[ https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24216: --- Description: When we create a aggregator object within a function in scala and pass the aggregator to Spark Dataset's aggregation method, Spark's will initialize TypedAggregateExpression with the name field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, for example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw error "Malformed class name". This has been reported in scalatest [https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira [https://issues.scala-lang.org/browse/SI-8110]. To fix this issue, we follow the solution in [https://github.com/scalatest/scalatest/pull/1044] to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. was: When we create a aggregator object within a function in scala and pass the aggregator to Spark Dataset's aggregation method, Spark's will initialize TypedAggregateExpression with the name field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, for example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw exception "Malformed class name". This has been reported in scalatest [https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira [https://issues.scala-lang.org/browse/SI-8110]. To fix this issue, we follow the solution in [https://github.com/scalatest/scalatest/pull/1044] to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. > Spark TypedAggregateExpression uses getSimpleName that is not safe in scala > --- > > Key: SPARK-24216 > URL: https://issues.apache.org/jira/browse/SPARK-24216 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Fangshi Li >Priority: Major > > When we create a aggregator object within a function in scala and pass the > aggregator to Spark Dataset's aggregation method, Spark's will initialize > TypedAggregateExpression with the name field as > aggregator.getClass.getSimpleName. However, getSimpleName is not safe in > scala environment, for example, if the aggregator class full qualified name > is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw error > "Malformed class name". This has been reported in scalatest > [https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira > [https://issues.scala-lang.org/browse/SI-8110]. > To fix this issue, we follow the solution in > [https://github.com/scalatest/scalatest/pull/1044] to add safer version of > getSimpleName as a util method, and TypedAggregateExpression will invoke this > util method rather than getClass.getSimpleName. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
[ https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24216: --- Description: When we create a aggregator object within a function in scala and pass the aggregator to Spark Dataset's aggregation method, Spark's will initialize TypedAggregateExpression with the name field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, for example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw exception "Malformed class name". This has been reported in scalatest [https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira [https://issues.scala-lang.org/browse/SI-8110]. To fix this issue, we follow the solution in [https://github.com/scalatest/scalatest/pull/1044] to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. was: When we create a aggregator object within a function in scala and pass the aggregator to Spark Dataset's aggregation method, Spark's will initialize TypedAggregateExpression with the name field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, for example, if the aggregator class full qualified name is "com.linkedin.spark.common.lib.SparkUtils$keyAgg$2$", the getSimpleName will throw exception "Malformed class name". This has been reported in scalatest https://github.com/scalatest/scalatest/pull/1044 and scala upstream jira https://issues.scala-lang.org/browse/SI-8110. To fix this issue, we follow the solution in https://github.com/scalatest/scalatest/pull/1044 to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. > Spark TypedAggregateExpression uses getSimpleName that is not safe in scala > --- > > Key: SPARK-24216 > URL: https://issues.apache.org/jira/browse/SPARK-24216 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Fangshi Li >Priority: Major > > When we create a aggregator object within a function in scala and pass the > aggregator to Spark Dataset's aggregation method, Spark's will initialize > TypedAggregateExpression with the name field as > aggregator.getClass.getSimpleName. However, getSimpleName is not safe in > scala environment, for example, if the aggregator class full qualified name > is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw exception > "Malformed class name". This has been reported in scalatest > [https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira > [https://issues.scala-lang.org/browse/SI-8110]. > To fix this issue, we follow the solution in > [https://github.com/scalatest/scalatest/pull/1044] to add safer version of > getSimpleName as a util method, and TypedAggregateExpression will invoke this > util method rather than getClass.getSimpleName. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
Fangshi Li created SPARK-24216: -- Summary: Spark TypedAggregateExpression uses getSimpleName that is not safe in scala Key: SPARK-24216 URL: https://issues.apache.org/jira/browse/SPARK-24216 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.3.1 Reporter: Fangshi Li When we create a aggregator object within a function in scala and pass the aggregator to Spark Dataset's aggregation method, Spark's will initialize TypedAggregateExpression with the name field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, for example, if the aggregator class full qualified name is "com.linkedin.spark.common.lib.SparkUtils$keyAgg$2$", the getSimpleName will throw exception "Malformed class name". This has been reported in scalatest https://github.com/scalatest/scalatest/pull/1044 and scala upstream jira https://issues.scala-lang.org/browse/SI-8110. To fix this issue, we follow the solution in https://github.com/scalatest/scalatest/pull/1044 to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition
[ https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-23815: --- Priority: Major (was: Minor) > Spark writer dynamic partition overwrite mode fails to write output on multi > level partition > > > Key: SPARK-23815 > URL: https://issues.apache.org/jira/browse/SPARK-23815 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Spark introduced new writer mode to overwrite only related partitions in > SPARK-20236. While we are using this feature in our production cluster, we > found a bug when writing multi-level partitions on HDFS. > A simple test case to reproduce this issue: > val df = Seq(("1","2","3")).toDF("col1", "col2","col3") > > df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location") > If HDFS location "/my/hdfs/location" does not exist, there will be no output. > This seems to be caused by the job commit change in SPARK-20236 in > HadoopMapReduceCommitProtocol. > In the commit job process, the output has been written into staging dir > /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls > fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to > /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail > on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not > create directory for more than one level. > This does not happen in unit test covered with SPARK-20236 with local file > system. > We are proposing a fix. When cleaning current partition dir > /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails > (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to > create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not > exist) so the following rename op can succeed. > > Reference: In official hdfs > documentation([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html),] > the rename operation has preconditions: > {code} > {{dest}} must be root, or have a parent that exists > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition
[ https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-23815: --- Description: Spark introduced new writer mode to overwrite only related partitions in SPARK-20236. While we are using this feature in our production cluster, we found a bug when writing multi-level partitions on HDFS. A simple test case to reproduce this issue: val df = Seq(("1","2","3")).toDF("col1", "col2","col3") df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location") If HDFS location "/my/hdfs/location" does not exist, there will be no output. This seems to be caused by the job commit change in SPARK-20236 in HadoopMapReduceCommitProtocol. In the commit job process, the output has been written into staging dir /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not create directory for more than one level. This does not happen in unit test covered with SPARK-20236 with local file system. We are proposing a fix. When cleaning current partition dir /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not exist) so the following rename op can succeed. Reference: In official hdfs documentation([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html),] the rename operation has preconditions: {code} {{dest}} must be root, or have a parent that exists {code} was: Spark introduced new writer mode to overwrite only related partitions in SPARK-20236. While we are using this feature in our production cluster, we found a bug when writing multi-level partitions on HDFS. A simple test case to reproduce this issue: val df = Seq(("1","2","3")).toDF("col1", "col2","col3") df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location") If HDFS location "/my/hdfs/location" does not exist, there will be no output. This seems to be caused by the job commit change in SPARK-20236 in HadoopMapReduceCommitProtocol. In the commit job process, the output has been written into staging dir /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not create directory for more than one level. This does not happen in unit test covered with SPARK-20236 with local file system. We are proposing a fix. When cleaning current partition dir /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not exist) so the following rename op can succeed. > Spark writer dynamic partition overwrite mode fails to write output on multi > level partition > > > Key: SPARK-23815 > URL: https://issues.apache.org/jira/browse/SPARK-23815 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Minor > > Spark introduced new writer mode to overwrite only related partitions in > SPARK-20236. While we are using this feature in our production cluster, we > found a bug when writing multi-level partitions on HDFS. > A simple test case to reproduce this issue: > val df = Seq(("1","2","3")).toDF("col1", "col2","col3") > > df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location") > If HDFS location "/my/hdfs/location" does not exist, there will be no output. > This seems to be caused by the job commit change in SPARK-20236 in > HadoopMapReduceCommitProtocol. > In the commit job process, the output has been written into staging dir > /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls > fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to > /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail > on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not > create directory for more than one level. > This does not happen in unit test covered with SPARK-20236 with local file > system. > We are proposing a fix. When cleaning current partition dir >
[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425832#comment-16425832 ] Fangshi Li commented on SPARK-21255: created a follow up issue to add the java enum support in scala api: https://issues.apache.org/jira/browse/SPARK-23862 > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike >Assignee: Mike >Priority: Major > Fix For: 2.3.0 > > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425832#comment-16425832 ] Fangshi Li edited comment on SPARK-21255 at 4/4/18 4:44 PM: created a follow up issue to add the java enum support in scala api: SPARK-23862 was (Author: shengzhixia): created a follow up issue to add the java enum support in scala api: https://issues.apache.org/jira/browse/SPARK-23862 > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike >Assignee: Mike >Priority: Major > Fix For: 2.3.0 > > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23862) Spark ExpressionEncoder should support java enum type in scala
Fangshi Li created SPARK-23862: -- Summary: Spark ExpressionEncoder should support java enum type in scala Key: SPARK-23862 URL: https://issues.apache.org/jira/browse/SPARK-23862 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Fangshi Li In SPARK-21255, spark upstream adds support for creating encoders for java enum types, but the support is only added to Java API(for enum working within Java Beans). Since the java enum can come from third-party java library, we have use case that requires 1. using java enum types as field of scala case class 2. using java enum as the type T in Dataset[T] Spark ExpressionEncoder already supports ser/de many java types in ScalaReflection, so we propose to add support for java enum as well, as a follow up of SPARK-21255. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition
Fangshi Li created SPARK-23815: -- Summary: Spark writer dynamic partition overwrite mode fails to write output on multi level partition Key: SPARK-23815 URL: https://issues.apache.org/jira/browse/SPARK-23815 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Fangshi Li Spark introduced new writer mode to overwrite only related partitions in SPARK-20236. While we are using this feature in our production cluster, we found a bug when writing multi-level partitions on HDFS. A simple test case to reproduce this issue: val df = Seq(("1","2","3")).toDF("col1", "col2","col3") df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location") If HDFS location "/my/hdfs/location" does not exist, there will be no output. This seems to be caused by the job commit change in SPARK-20236 in HadoopMapReduceCommitProtocol. In the commit job process, the output has been written into staging dir /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not create directory for more than one level. This does not happen in unit test covered with SPARK-20236 with local file system. We are proposing a fix. When cleaning current partition dir /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not exist) so the following rename op can succeed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org