[jira] [Closed] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-09-04 Thread Fangshi Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li closed SPARK-24256.
--

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  # We can not use some type-safe aggregation methods on this Dataset, such as 
> KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  # We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case 
> class/tuple with subfields being any other user-defined type with its own 
> Encoder for serde.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean to support user-defined types.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> above-mentioned limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-09-04 Thread Fangshi Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li resolved SPARK-24256.

Resolution: Won't Fix

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  # We can not use some type-safe aggregation methods on this Dataset, such as 
> KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  # We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case 
> class/tuple with subfields being any other user-defined type with its own 
> Encoder for serde.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean to support user-defined types.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> above-mentioned limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-09-04 Thread Fangshi Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603800#comment-16603800
 ] 

Fangshi Li commented on SPARK-24256:


To summarize our discussion for this pr:
Spark-avro is now merged into Spark as a built-in data source. Upstream 
community is not merging the AvroEncoder to support Avro types in Dataset, 
instead, the plan is to exposing the user-defined type API to support defining 
arbitrary user types in Dataset.

The purpose of this patch is to enable ExpressionEncoder to work together with 
other types of Encoders, while it seems like upstream prefers to go with UDT. 
Given this, we can close this PR and we will start the discussion on UDT in 
another channel

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  # We can not use some type-safe aggregation methods on this Dataset, such as 
> KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  # We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case 
> class/tuple with subfields being any other user-defined type with its own 
> Encoder for serde.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean to support user-defined types.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> above-mentioned limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24287) Spark -packages option should support classifier, no-transitive, and custom conf

2018-05-15 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24287:
---
Summary: Spark -packages option should support classifier, no-transitive, 
and custom conf  (was: Spark 2.x -packages option should support classifier, 
no-transitive, and custom conf)

> Spark -packages option should support classifier, no-transitive, and custom 
> conf
> 
>
> Key: SPARK-24287
> URL: https://issues.apache.org/jira/browse/SPARK-24287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> We should extend Spark's -package option to support:
>  # Turn-off transitive dependency on a given artifact(like spark-avro)
>  #  Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar
>  #  Resolving a given artifact with custom ivy conf
>  # Excluding particular transitive dependencies from a given artifact. We 
> currently only have top-level exclusion rule applies for all artifacts.
> We have tested this patch internally and it greatly increases the flexibility 
> when user uses -packages option 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24287) Spark 2.x -packages option should support classifier, no-transitive, and custom conf

2018-05-15 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24287:
---
Description: 
We should extend Spark's -package option to support:
 # Turn-off transitive dependency on a given artifact(like spark-avro)
 #  Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar
 #  Resolving a given artifact with custom ivy conf
 # Excluding particular transitive dependencies from a given artifact. We 
currently only have top-level exclusion rule applies for all artifacts.

We have tested this patch internally and it greatly increases the flexibility 
when user uses -packages option 

  was:
We should extend Spark's -package option to support:
 # Turn-off transitive dependency on a given artifact(like spark-avro)
 #  Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar
 #  Resolving a given artifact with custom ivy conf
 # Excluding particular transitive dependencies from a given artifact. We 
currently only have top-level exclusion rule applies for all artifacts.

We have tested this patch internally and it increases the flexibility when user 
uses -packages option 


> Spark 2.x -packages option should support classifier, no-transitive, and 
> custom conf
> 
>
> Key: SPARK-24287
> URL: https://issues.apache.org/jira/browse/SPARK-24287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> We should extend Spark's -package option to support:
>  # Turn-off transitive dependency on a given artifact(like spark-avro)
>  #  Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar
>  #  Resolving a given artifact with custom ivy conf
>  # Excluding particular transitive dependencies from a given artifact. We 
> currently only have top-level exclusion rule applies for all artifacts.
> We have tested this patch internally and it greatly increases the flexibility 
> when user uses -packages option 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24287) Spark 2.x -packages option should support classifier, no-transitive, and custom conf

2018-05-15 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24287:
---
Description: 
We should extend Spark's -package option to support:
 # Turn-off transitive dependency on a given artifact(like spark-avro)
 #  Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar
 #  Resolving a given artifact with custom ivy conf
 # Excluding particular transitive dependencies from a given artifact. We 
currently only have top-level exclusion rule applies for all artifacts.

We have tested this patch internally and it increases the flexibility when user 
uses -packages option 

  was:
We should extend Spark's -package option to support:
 # Turn-off transitive dependency on a given artifact(like spark-avro)
 #  Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar
 #  Resolving a given artifact with custom ivy conf

We have tested this patch internally and it increases the flexibility when user 
uses -packages option 


> Spark 2.x -packages option should support classifier, no-transitive, and 
> custom conf
> 
>
> Key: SPARK-24287
> URL: https://issues.apache.org/jira/browse/SPARK-24287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> We should extend Spark's -package option to support:
>  # Turn-off transitive dependency on a given artifact(like spark-avro)
>  #  Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar
>  #  Resolving a given artifact with custom ivy conf
>  # Excluding particular transitive dependencies from a given artifact. We 
> currently only have top-level exclusion rule applies for all artifacts.
> We have tested this patch internally and it increases the flexibility when 
> user uses -packages option 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24287) Spark 2.x -packages option should support classifier, no-transitive, and custom conf

2018-05-15 Thread Fangshi Li (JIRA)
Fangshi Li created SPARK-24287:
--

 Summary: Spark 2.x -packages option should support classifier, 
no-transitive, and custom conf
 Key: SPARK-24287
 URL: https://issues.apache.org/jira/browse/SPARK-24287
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Fangshi Li


We should extend Spark's -package option to support:
 # Turn-off transitive dependency on a given artifact(like spark-avro)
 #  Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar
 #  Resolving a given artifact with custom ivy conf

We have tested this patch internally and it increases the flexibility when user 
uses -packages option 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-12 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24256:
---
Description: 
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find Dataset is not flexible for other 
user-defined types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations:
 # We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
 # We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
 # We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The limitation is that ExpressionEncoder does not support serde of Scala case 
class/tuple with subfields being any other user-defined type with its own 
Encoder for serde.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean to support user-defined types.

With this proposed patch and our minor modification in AvroEncoder, we remove 
above-mentioned limitations with cluster-default conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

  was:
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find Dataset is not flexible for other 
user-defined types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
 1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
 2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
 3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The limitation that Spark does not support define a Scala case class/tuple with 
subfields being any other user-defined type, is because ExpressionEncoder does 
not discover the implicit Encoder for the user-defined field types, thus can 
not use any Encoder to serde the user-defined fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we remove 
these limitations with cluster-default conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 


> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this 

[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-11 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24256:
---
Description: 
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find Dataset is not flexible for other 
user-defined types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
 1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
 2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
 3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The limitation that Spark does not support define a Scala case class/tuple with 
subfields being any other user-defined type, is because ExpressionEncoder does 
not discover the implicit Encoder for the user-defined field types, thus can 
not use any Encoder to serde the user-defined fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we remove 
these limitations with cluster-default conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 

  was:
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find it is not flexible for other user-defined 
types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
 1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
 2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
 3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The limitation that Spark does not support define a Scala case class/tuple with 
subfields being any other user-defined type, is because ExpressionEncoder does 
not discover the implicit Encoder for the user-defined field types, thus can 
not use any Encoder to serde the user-defined fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we remove 
these limitations with cluster-default conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 


> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the 

[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-11 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24256:
---
Description: 
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find it is not flexible for other user-defined 
types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
 1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
 2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
 3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The limitation that Spark does not support define a Scala case class/tuple with 
subfields being any other user-defined type, is because ExpressionEncoder does 
not discover the implicit Encoder for the user-defined field types, thus can 
not use any Encoder to serde the user-defined fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we remove 
these limitations with cluster-default conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 

  was:
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find it is not flexible for other user-defined 
types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The root cause for these limitations is that Spark does not support simply 
define a Scala case class/tuple with subfields being any other user-defined 
type, since ExpressionEncoder does not discover the implicit Encoder for the 
user-defined fields, thus can not use any Encoder to serde the user-defined 
fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we make it 
work with conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 


> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find it is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic 

[jira] [Created] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-11 Thread Fangshi Li (JIRA)
Fangshi Li created SPARK-24256:
--

 Summary: ExpressionEncoder should support user-defined types as 
fields of Scala case class and tuple
 Key: SPARK-24256
 URL: https://issues.apache.org/jira/browse/SPARK-24256
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Fangshi Li


Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find it is not flexible for other user-defined 
types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The root cause for these limitations is that Spark does not support simply 
define a Scala case class/tuple with subfields being any other user-defined 
type, since ExpressionEncoder does not discover the implicit Encoder for the 
user-defined fields, thus can not use any Encoder to serde the user-defined 
fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we make it 
work with conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala

2018-05-10 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24216:
---
Description: 
When user create a aggregator object in scala and pass the aggregator to Spark 
Dataset's agg() method, Spark's will initialize TypedAggregateExpression with 
the nodeName field as aggregator.getClass.getSimpleName. However, getSimpleName 
is not safe in scala environment, depending on how user creates the aggregator 
object. For example, if the aggregator class full qualified name is 
"com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw 
java.lang.InternalError "Malformed class name". This has been reported in 
scalatest 
[scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] and 
discussed in many scala upstream jiras such as SI-8110, SI-5425.

To fix this issue, we follow the solution in 
[scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] to 
add safer version of getSimpleName as a util method, and 
TypedAggregateExpression will invoke this util method rather than 
getClass.getSimpleName.

  was:
When we create a aggregator object within a function in scala and pass the 
aggregator to Spark Dataset's aggregation method, Spark's will initialize 
TypedAggregateExpression with the name field as 
aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala 
environment, for example, if the aggregator class full qualified name is 
"com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw error 
"Malformed class name". This has been reported in scalatest 
[https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira 
[https://issues.scala-lang.org/browse/SI-8110].

To fix this issue, we follow the solution in 
[https://github.com/scalatest/scalatest/pull/1044] to add safer version of 
getSimpleName as a util method, and TypedAggregateExpression will invoke this 
util method rather than getClass.getSimpleName.


> Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
> ---
>
> Key: SPARK-24216
> URL: https://issues.apache.org/jira/browse/SPARK-24216
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Fangshi Li
>Priority: Major
>
> When user create a aggregator object in scala and pass the aggregator to 
> Spark Dataset's agg() method, Spark's will initialize 
> TypedAggregateExpression with the nodeName field as 
> aggregator.getClass.getSimpleName. However, getSimpleName is not safe in 
> scala environment, depending on how user creates the aggregator object. For 
> example, if the aggregator class full qualified name is 
> "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw 
> java.lang.InternalError "Malformed class name". This has been reported in 
> scalatest 
> [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] 
> and discussed in many scala upstream jiras such as SI-8110, SI-5425.
> To fix this issue, we follow the solution in 
> [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] 
> to add safer version of getSimpleName as a util method, and 
> TypedAggregateExpression will invoke this util method rather than 
> getClass.getSimpleName.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23862) Spark ExpressionEncoder should support java enum type in scala

2018-05-09 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-23862:
---
Issue Type: Bug  (was: Improvement)

> Spark ExpressionEncoder should support java enum type in scala
> --
>
> Key: SPARK-23862
> URL: https://issues.apache.org/jira/browse/SPARK-23862
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> In SPARK-21255, spark upstream adds support for creating encoders for java 
> enum types, but the support is only added to Java API(for enum working within 
> Java Beans). Since the java enum can come from third-party java library, we 
> have use case that requires 
> 1. using java enum types as field of scala case class
> 2. using java enum as the type T in Dataset[T]
> Spark ExpressionEncoder already supports ser/de many java types in 
> ScalaReflection, so we propose to add support for java enum as well, as a 
> follow up of SPARK-21255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala

2018-05-09 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24216:
---
Description: 
When we create a aggregator object within a function in scala and pass the 
aggregator to Spark Dataset's aggregation method, Spark's will initialize 
TypedAggregateExpression with the name field as 
aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala 
environment, for example, if the aggregator class full qualified name is 
"com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw error 
"Malformed class name". This has been reported in scalatest 
[https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira 
[https://issues.scala-lang.org/browse/SI-8110].

To fix this issue, we follow the solution in 
[https://github.com/scalatest/scalatest/pull/1044] to add safer version of 
getSimpleName as a util method, and TypedAggregateExpression will invoke this 
util method rather than getClass.getSimpleName.

  was:
When we create a aggregator object within a function in scala and pass the 
aggregator to Spark Dataset's aggregation method, Spark's will initialize 
TypedAggregateExpression with the name field as 
aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala 
environment, for example, if the aggregator class full qualified name is 
"com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw exception 
"Malformed class name". This has been reported in scalatest 
[https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira 
[https://issues.scala-lang.org/browse/SI-8110].

To fix this issue, we follow the solution in 
[https://github.com/scalatest/scalatest/pull/1044] to add safer version of 
getSimpleName as a util method, and TypedAggregateExpression will invoke this 
util method rather than getClass.getSimpleName.


> Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
> ---
>
> Key: SPARK-24216
> URL: https://issues.apache.org/jira/browse/SPARK-24216
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Fangshi Li
>Priority: Major
>
> When we create a aggregator object within a function in scala and pass the 
> aggregator to Spark Dataset's aggregation method, Spark's will initialize 
> TypedAggregateExpression with the name field as 
> aggregator.getClass.getSimpleName. However, getSimpleName is not safe in 
> scala environment, for example, if the aggregator class full qualified name 
> is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw error 
> "Malformed class name". This has been reported in scalatest 
> [https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira 
> [https://issues.scala-lang.org/browse/SI-8110].
> To fix this issue, we follow the solution in 
> [https://github.com/scalatest/scalatest/pull/1044] to add safer version of 
> getSimpleName as a util method, and TypedAggregateExpression will invoke this 
> util method rather than getClass.getSimpleName.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala

2018-05-08 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24216:
---
Description: 
When we create a aggregator object within a function in scala and pass the 
aggregator to Spark Dataset's aggregation method, Spark's will initialize 
TypedAggregateExpression with the name field as 
aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala 
environment, for example, if the aggregator class full qualified name is 
"com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw exception 
"Malformed class name". This has been reported in scalatest 
[https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira 
[https://issues.scala-lang.org/browse/SI-8110].

To fix this issue, we follow the solution in 
[https://github.com/scalatest/scalatest/pull/1044] to add safer version of 
getSimpleName as a util method, and TypedAggregateExpression will invoke this 
util method rather than getClass.getSimpleName.

  was:
When we create a aggregator object within a function in scala and pass the 
aggregator to Spark Dataset's aggregation method, Spark's will initialize 
TypedAggregateExpression with the name field as 
aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala 
environment, for example, if the aggregator class full qualified name is 
"com.linkedin.spark.common.lib.SparkUtils$keyAgg$2$", the getSimpleName will 
throw exception "Malformed class name". This has been reported in scalatest 
https://github.com/scalatest/scalatest/pull/1044 and scala upstream jira 
https://issues.scala-lang.org/browse/SI-8110.


To fix this issue, we follow the solution in 
https://github.com/scalatest/scalatest/pull/1044 to add safer version of 
getSimpleName as a util method, and TypedAggregateExpression will invoke this 
util method rather than getClass.getSimpleName.


> Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
> ---
>
> Key: SPARK-24216
> URL: https://issues.apache.org/jira/browse/SPARK-24216
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Fangshi Li
>Priority: Major
>
> When we create a aggregator object within a function in scala and pass the 
> aggregator to Spark Dataset's aggregation method, Spark's will initialize 
> TypedAggregateExpression with the name field as 
> aggregator.getClass.getSimpleName. However, getSimpleName is not safe in 
> scala environment, for example, if the aggregator class full qualified name 
> is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw exception 
> "Malformed class name". This has been reported in scalatest 
> [https://github.com/scalatest/scalatest/pull/1044] and scala upstream jira 
> [https://issues.scala-lang.org/browse/SI-8110].
> To fix this issue, we follow the solution in 
> [https://github.com/scalatest/scalatest/pull/1044] to add safer version of 
> getSimpleName as a util method, and TypedAggregateExpression will invoke this 
> util method rather than getClass.getSimpleName.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala

2018-05-08 Thread Fangshi Li (JIRA)
Fangshi Li created SPARK-24216:
--

 Summary: Spark TypedAggregateExpression uses getSimpleName that is 
not safe in scala
 Key: SPARK-24216
 URL: https://issues.apache.org/jira/browse/SPARK-24216
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.3.1
Reporter: Fangshi Li


When we create a aggregator object within a function in scala and pass the 
aggregator to Spark Dataset's aggregation method, Spark's will initialize 
TypedAggregateExpression with the name field as 
aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala 
environment, for example, if the aggregator class full qualified name is 
"com.linkedin.spark.common.lib.SparkUtils$keyAgg$2$", the getSimpleName will 
throw exception "Malformed class name". This has been reported in scalatest 
https://github.com/scalatest/scalatest/pull/1044 and scala upstream jira 
https://issues.scala-lang.org/browse/SI-8110.


To fix this issue, we follow the solution in 
https://github.com/scalatest/scalatest/pull/1044 to add safer version of 
getSimpleName as a util method, and TypedAggregateExpression will invoke this 
util method rather than getClass.getSimpleName.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition

2018-04-06 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-23815:
---
Priority: Major  (was: Minor)

> Spark writer dynamic partition overwrite mode fails to write output on multi 
> level partition
> 
>
> Key: SPARK-23815
> URL: https://issues.apache.org/jira/browse/SPARK-23815
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Spark introduced new writer mode to overwrite only related partitions in 
> SPARK-20236. While we are using this feature in our production cluster, we 
> found a bug when writing multi-level partitions on HDFS.
> A simple test case to reproduce this issue:
>  val df = Seq(("1","2","3")).toDF("col1", "col2","col3")
>  
> df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location")
> If HDFS location "/my/hdfs/location" does not exist, there will be no output.
> This seems to be caused by the job commit change in SPARK-20236 in 
> HadoopMapReduceCommitProtocol.
> In the commit job process, the output has been written into staging dir 
> /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls 
> fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to 
> /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail 
> on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not 
> create directory for more than one level. 
> This does not happen in unit test covered with SPARK-20236 with local file 
> system.
> We are proposing a fix. When cleaning current partition dir 
> /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails 
> (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to 
> create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not 
> exist) so the following rename op can succeed.
>  
> Reference: In official hdfs 
> documentation([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html),]
>  the rename operation has preconditions: 
> {code}
> {{dest}} must be root, or have a parent that exists
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition

2018-04-04 Thread Fangshi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-23815:
---
Description: 
Spark introduced new writer mode to overwrite only related partitions in 
SPARK-20236. While we are using this feature in our production cluster, we 
found a bug when writing multi-level partitions on HDFS.

A simple test case to reproduce this issue:
 val df = Seq(("1","2","3")).toDF("col1", "col2","col3")
 df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location")

If HDFS location "/my/hdfs/location" does not exist, there will be no output.

This seems to be caused by the job commit change in SPARK-20236 in 
HadoopMapReduceCommitProtocol.

In the commit job process, the output has been written into staging dir 
/my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls 
fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to 
/my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail 
on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not 
create directory for more than one level. 

This does not happen in unit test covered with SPARK-20236 with local file 
system.

We are proposing a fix. When cleaning current partition dir 
/my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails 
(because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to 
create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not 
exist) so the following rename op can succeed.

 

Reference: In official hdfs 
documentation([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html),]
 the rename operation has preconditions: 

{code}

{{dest}} must be root, or have a parent that exists

{code}
 

  was:
Spark introduced new writer mode to overwrite only related partitions in 
SPARK-20236. While we are using this feature in our production cluster, we 
found a bug when writing multi-level partitions on HDFS.


A simple test case to reproduce this issue:
val df = Seq(("1","2","3")).toDF("col1", "col2","col3")
df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location")


If HDFS location "/my/hdfs/location" does not exist, there will be no output.


This seems to be caused by the job commit change in SPARK-20236 in 
HadoopMapReduceCommitProtocol.


In the commit job process, the output has been written into staging dir 
/my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls 
fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to 
/my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail 
on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not 
create directory for more than one level. 

This does not happen in unit test covered with SPARK-20236 with local file 
system.


We are proposing a fix. When cleaning current partition dir 
/my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails 
(because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to 
create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not 
exist) so the following rename op can succeed.


> Spark writer dynamic partition overwrite mode fails to write output on multi 
> level partition
> 
>
> Key: SPARK-23815
> URL: https://issues.apache.org/jira/browse/SPARK-23815
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Minor
>
> Spark introduced new writer mode to overwrite only related partitions in 
> SPARK-20236. While we are using this feature in our production cluster, we 
> found a bug when writing multi-level partitions on HDFS.
> A simple test case to reproduce this issue:
>  val df = Seq(("1","2","3")).toDF("col1", "col2","col3")
>  
> df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location")
> If HDFS location "/my/hdfs/location" does not exist, there will be no output.
> This seems to be caused by the job commit change in SPARK-20236 in 
> HadoopMapReduceCommitProtocol.
> In the commit job process, the output has been written into staging dir 
> /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls 
> fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to 
> /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail 
> on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not 
> create directory for more than one level. 
> This does not happen in unit test covered with SPARK-20236 with local file 
> system.
> We are proposing a fix. When cleaning current partition dir 
> 

[jira] [Commented] (SPARK-21255) NPE when creating encoder for enum

2018-04-04 Thread Fangshi Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425832#comment-16425832
 ] 

Fangshi Li commented on SPARK-21255:


created a follow up issue to add the java enum support in scala api: 
https://issues.apache.org/jira/browse/SPARK-23862

> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
>Assignee: Mike
>Priority: Major
> Fix For: 2.3.0
>
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21255) NPE when creating encoder for enum

2018-04-04 Thread Fangshi Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425832#comment-16425832
 ] 

Fangshi Li edited comment on SPARK-21255 at 4/4/18 4:44 PM:


created a follow up issue to add the java enum support in scala api: SPARK-23862


was (Author: shengzhixia):
created a follow up issue to add the java enum support in scala api: 
https://issues.apache.org/jira/browse/SPARK-23862

> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
>Assignee: Mike
>Priority: Major
> Fix For: 2.3.0
>
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23862) Spark ExpressionEncoder should support java enum type in scala

2018-04-03 Thread Fangshi Li (JIRA)
Fangshi Li created SPARK-23862:
--

 Summary: Spark ExpressionEncoder should support java enum type in 
scala
 Key: SPARK-23862
 URL: https://issues.apache.org/jira/browse/SPARK-23862
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Fangshi Li


In SPARK-21255, spark upstream adds support for creating encoders for java enum 
types, but the support is only added to Java API(for enum working within Java 
Beans). Since the java enum can come from third-party java library, we have use 
case that requires 
1. using java enum types as field of scala case class
2. using java enum as the type T in Dataset[T]

Spark ExpressionEncoder already supports ser/de many java types in 
ScalaReflection, so we propose to add support for java enum as well, as a 
follow up of SPARK-21255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition

2018-03-28 Thread Fangshi Li (JIRA)
Fangshi Li created SPARK-23815:
--

 Summary: Spark writer dynamic partition overwrite mode fails to 
write output on multi level partition
 Key: SPARK-23815
 URL: https://issues.apache.org/jira/browse/SPARK-23815
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Fangshi Li


Spark introduced new writer mode to overwrite only related partitions in 
SPARK-20236. While we are using this feature in our production cluster, we 
found a bug when writing multi-level partitions on HDFS.


A simple test case to reproduce this issue:
val df = Seq(("1","2","3")).toDF("col1", "col2","col3")
df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location")


If HDFS location "/my/hdfs/location" does not exist, there will be no output.


This seems to be caused by the job commit change in SPARK-20236 in 
HadoopMapReduceCommitProtocol.


In the commit job process, the output has been written into staging dir 
/my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls 
fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to 
/my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail 
on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not 
create directory for more than one level. 

This does not happen in unit test covered with SPARK-20236 with local file 
system.


We are proposing a fix. When cleaning current partition dir 
/my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails 
(because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to 
create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not 
exist) so the following rename op can succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org