[jira] [Commented] (SPARK-30983) Support more than 5 typed column in typed Dataset.select API

2020-02-27 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047278#comment-17047278
 ] 

L. C. Hsieh commented on SPARK-30983:
-

cc [~cloud_fan]

> Support more than 5 typed column in typed Dataset.select API
> 
>
> Key: SPARK-30983
> URL: https://issues.apache.org/jira/browse/SPARK-30983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Because Dataset only provides overloading typed select API to at most 5 typed 
> columns, once more than 5 typed columns given, the select API call will go 
> for untyped one.
> Currently users cannot call typed select with more than 5 typed columns. 
> There are few options:
> 1. Expose Dataset.selectUntyped (could rename it) to accept any number (due 
> to the limit of ExpressionEncoder.tuple, at most 22 actually) of typed 
> columns. Pros: not need to add too much code in Dataset. Cons: The returned 
> type is generally Dataset[_], not a specified one like Dataset[(U1, U2)] for 
> the overloading method.
> 2. Add more overloading typed select APIs up to 22 typed column inputs. Pros: 
> Clear returned type. Cons: A lot of code to be added to Dataset for just 
> corner cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30983) Support more than 5 typed column in typed Dataset.select API

2020-02-27 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-30983:
---

 Summary: Support more than 5 typed column in typed Dataset.select 
API
 Key: SPARK-30983
 URL: https://issues.apache.org/jira/browse/SPARK-30983
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: L. C. Hsieh


Because Dataset only provides overloading typed select API to at most 5 typed 
columns, once more than 5 typed columns given, the select API call will go for 
untyped one.

Currently users cannot call typed select with more than 5 typed columns. There 
are few options:

1. Expose Dataset.selectUntyped (could rename it) to accept any number (due to 
the limit of ExpressionEncoder.tuple, at most 22 actually) of typed columns. 
Pros: not need to add too much code in Dataset. Cons: The returned type is 
generally Dataset[_], not a specified one like Dataset[(U1, U2)] for the 
overloading method.

2. Add more overloading typed select APIs up to 22 typed column inputs. Pros: 
Clear returned type. Cons: A lot of code to be added to Dataset for just corner 
cases.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30982) List All the removed APIs of Spark SQL and Core

2020-02-27 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30982:
-
Attachment: sql_signature.diff
added_sql_class
1_process_sql_script.sh

> List All the removed APIs of Spark SQL and Core
> ---
>
> Key: SPARK-30982
> URL: https://issues.apache.org/jira/browse/SPARK-30982
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: zhengruifeng
>Priority: Major
> Attachments: 1_process_core_script.sh, 1_process_sql_script.sh, 
> added_core_class, added_sql_class, core_signature.diff, sql_signature.diff
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30982) List All the removed APIs of Spark SQL and Core

2020-02-27 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30982:
-
Attachment: core_signature.diff
added_core_class
1_process_core_script.sh

> List All the removed APIs of Spark SQL and Core
> ---
>
> Key: SPARK-30982
> URL: https://issues.apache.org/jira/browse/SPARK-30982
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: zhengruifeng
>Priority: Major
> Attachments: 1_process_core_script.sh, added_core_class, 
> core_signature.diff
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30982) List All the removed APIs of Spark SQL and Core

2020-02-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-30982:
---

Assignee: zhengruifeng  (was: Xiao Li)

> List All the removed APIs of Spark SQL and Core
> ---
>
> Key: SPARK-30982
> URL: https://issues.apache.org/jira/browse/SPARK-30982
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30982) List All the removed APIs of Spark SQL and Core

2020-02-27 Thread Xiao Li (Jira)
Xiao Li created SPARK-30982:
---

 Summary: List All the removed APIs of Spark SQL and Core
 Key: SPARK-30982
 URL: https://issues.apache.org/jira/browse/SPARK-30982
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30981) Fix flaky "Test basic decommissioning" test

2020-02-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30981:
-

Assignee: (was: Dongjoon Hyun)

> Fix flaky "Test basic decommissioning" test
> ---
>
> Key: SPARK-30981
> URL: https://issues.apache.org/jira/browse/SPARK-30981
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - https://github.com/apache/spark/pull/27721
> {code}
> - Test basic decommissioning *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 126 times 
> over 2.010095245067 minutes. Last failure message: "++ id -u
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30981) Fix flaky "Test basic decommissioning" test

2020-02-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047270#comment-17047270
 ] 

Dongjoon Hyun commented on SPARK-30981:
---

Could you take a look at this, [~holden]?

> Fix flaky "Test basic decommissioning" test
> ---
>
> Key: SPARK-30981
> URL: https://issues.apache.org/jira/browse/SPARK-30981
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - https://github.com/apache/spark/pull/27721
> {code}
> - Test basic decommissioning *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 126 times 
> over 2.010095245067 minutes. Last failure message: "++ id -u
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30981) Fix flaky "Test basic decommissioning" test

2020-02-27 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30981:
-

 Summary: Fix flaky "Test basic decommissioning" test
 Key: SPARK-30981
 URL: https://issues.apache.org/jira/browse/SPARK-30981
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun
Assignee: Dongjoon Hyun


- https://github.com/apache/spark/pull/27721
{code}
- Test basic decommissioning *** FAILED ***
  The code passed to eventually never returned normally. Attempted 126 times 
over 2.010095245067 minutes. Last failure message: "++ id -u
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25474) Update the docs for spark.sql.statistics.fallBackToHdfs

2020-02-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25474:
--
Summary: Update the docs for spark.sql.statistics.fallBackToHdfs  (was: 
Support `spark.sql.statistics.fallBackToHdfs` in data source tables)

> Update the docs for spark.sql.statistics.fallBackToHdfs
> ---
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30902) default table provider should be decided by catalog implementations

2020-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30902.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27650
[https://github.com/apache/spark/pull/27650]

> default table provider should be decided by catalog implementations
> ---
>
> Key: SPARK-30902
> URL: https://issues.apache.org/jira/browse/SPARK-30902
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26599) BroardCast hint can not work with PruneFileSourcePartitions

2020-02-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-26599.
-

> BroardCast hint can not work with PruneFileSourcePartitions
> ---
>
> Key: SPARK-26599
> URL: https://issues.apache.org/jira/browse/SPARK-26599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Major
>
> BroardCast hint can not work with `PruneFileSourcePartitions`, for example, 
> when the filter condition p is a partition field, table b in SQL below cannot 
> be broadcast.
> ` sql("select /*+ broadcastjoin(b) */ * from (select a from empty_test where 
> p=1) a " +
>  "join (select a,b from par_1 where p=1) b on a.a=b.a").explain`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26599) BroardCast hint can not work with PruneFileSourcePartitions

2020-02-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26599:
--
Issue Type: Bug  (was: Improvement)

> BroardCast hint can not work with PruneFileSourcePartitions
> ---
>
> Key: SPARK-26599
> URL: https://issues.apache.org/jira/browse/SPARK-26599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Major
>
> BroardCast hint can not work with `PruneFileSourcePartitions`, for example, 
> when the filter condition p is a partition field, table b in SQL below cannot 
> be broadcast.
> ` sql("select /*+ broadcastjoin(b) */ * from (select a from empty_test where 
> p=1) a " +
>  "join (select a,b from par_1 where p=1) b on a.a=b.a").explain`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30980) Issue not resolved of Caught Hive MetaException attempting to get partition metadata by filter from Hive

2020-02-27 Thread Pradyumn Agrawal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradyumn Agrawal updated SPARK-30980:
-
Shepherd: Apache Spark

> Issue not resolved of Caught Hive MetaException attempting to get partition 
> metadata by filter from Hive
> 
>
> Key: SPARK-30980
> URL: https://issues.apache.org/jira/browse/SPARK-30980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.2
> Environment: 2.4.0-CDH6.3.1 (which I guess points to Spark Version 
> 2.4.2)
>Reporter: Pradyumn Agrawal
>Priority: Major
>
> I am querying on table created in Hive. Getting repetitive exception of 
> failing to query data with following stacktrace.
>  
> {code:java}
> // code placeholder
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARKjava.lang.RuntimeException: 
> Caught Hive MetaException attempting to get partition metadata by filter from 
> Hive. You can set the Spark configuration setting 
> spark.sql.hive.manageFilesourcePartitions to false to work around this 
> problem, however this will result in degraded performance. Please report a 
> bug: https://issues.apache.org/jira/browse/SPARK at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1258)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1251)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1251)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.sc

[jira] [Updated] (SPARK-30980) Issue not resolved of Caught Hive MetaException attempting to get partition metadata by filter from Hive

2020-02-27 Thread Pradyumn Agrawal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradyumn Agrawal updated SPARK-30980:
-
Description: 
I am querying on table created in Hive. Getting repetitive exception of failing 
to query data with following stacktrace.

 
{code:java}
// code placeholder
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARKjava.lang.RuntimeException: Caught 
Hive MetaException attempting to get partition metadata by filter from Hive. 
You can set the Spark configuration setting 
spark.sql.hive.manageFilesourcePartitions to false to work around this problem, 
however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1258)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1251)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1251)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
 at 
org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) 
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) 
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)

[jira] [Created] (SPARK-30980) Issue not resolved of Caught Hive MetaException attempting to get partition metadata by filter from Hive

2020-02-27 Thread Pradyumn Agrawal (Jira)
Pradyumn Agrawal created SPARK-30980:


 Summary: Issue not resolved of Caught Hive MetaException 
attempting to get partition metadata by filter from Hive
 Key: SPARK-30980
 URL: https://issues.apache.org/jira/browse/SPARK-30980
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.2, 2.4.0
 Environment: 2.4.0-CDH6.3.1 (which I guess points to Spark Version 
2.4.2)
Reporter: Pradyumn Agrawal


I am querying on table created in Hive. Getting repetitive exception of failing 
to query data with following stacktrace.

 
{code:java}
// code placeholder
{code}
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARKjava.lang.RuntimeException: Caught 
Hive MetaException attempting to get partition metadata by filter from Hive. 
You can set the Spark configuration setting 
spark.sql.hive.manageFilesourcePartitions to false to work around this problem, 
however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1258)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1251)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1251)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
 at 
org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) 
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) 
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPla

[jira] [Assigned] (SPARK-30972) PruneHiveTablePartitions should be executed as earlyScanPushDownRules

2020-02-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30972:
---

Assignee: wuyi

> PruneHiveTablePartitions should be executed as earlyScanPushDownRules
> -
>
> Key: SPARK-30972
> URL: https://issues.apache.org/jira/browse/SPARK-30972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Similar to PruneFileSourcePartitions, PruneHiveTablePartitions should also be 
> executed as earlyScanPushDownRules to eliminate the impact on statistic 
> computation later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30972) PruneHiveTablePartitions should be executed as earlyScanPushDownRules

2020-02-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30972.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27723
[https://github.com/apache/spark/pull/27723]

> PruneHiveTablePartitions should be executed as earlyScanPushDownRules
> -
>
> Key: SPARK-30972
> URL: https://issues.apache.org/jira/browse/SPARK-30972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Similar to PruneFileSourcePartitions, PruneHiveTablePartitions should also be 
> executed as earlyScanPushDownRules to eliminate the impact on statistic 
> computation later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30681) Add higher order functions API to PySpark

2020-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30681.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27406
[https://github.com/apache/spark/pull/27406]

> Add higher order functions API to PySpark
> -
>
> Key: SPARK-30681
> URL: https://issues.apache.org/jira/browse/SPARK-30681
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> As of 3.0.0 higher order functions are available in SQL and Scala, but not in 
> PySpark, forcing Python users to invoke these through {{expr}}, 
> {{selectExpr}} or {{sql}}.
> This is error prone and not well documented. Spark should provide 
> {{pyspark.sql}} wrappers that accept plain Python functions (of course within 
> limits of {{(*Column) -> Column}}) as arguments.
> {code:python}
> df.select(transform("values", lambda c: trim(upper(c)))
> def  increment_values(k: Column, v: Column) -> Column:
> return v + 1
> df.select(transform_values("data"), increment_values)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30682) Add higher order functions API to SparkR

2020-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30682:


Assignee: Maciej Szymkiewicz

> Add higher order functions API to SparkR
> 
>
> Key: SPARK-30682
> URL: https://issues.apache.org/jira/browse/SPARK-30682
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> As of 3.0.0 higher order functions are available in SQL and Scala, but not in 
> SparkR forcing R users to invoke these through {{expr}}, {{selectExpr}} or 
> {{sql}}.
> It would be great if Spark provided high level wrappers that accept plain R 
> functions operating on SQL expressions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30682) Add higher order functions API to SparkR

2020-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30682.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27433
[https://github.com/apache/spark/pull/27433]

> Add higher order functions API to SparkR
> 
>
> Key: SPARK-30682
> URL: https://issues.apache.org/jira/browse/SPARK-30682
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> As of 3.0.0 higher order functions are available in SQL and Scala, but not in 
> SparkR forcing R users to invoke these through {{expr}}, {{selectExpr}} or 
> {{sql}}.
> It would be great if Spark provided high level wrappers that accept plain R 
> functions operating on SQL expressions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30681) Add higher order functions API to PySpark

2020-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30681:


Assignee: Maciej Szymkiewicz

> Add higher order functions API to PySpark
> -
>
> Key: SPARK-30681
> URL: https://issues.apache.org/jira/browse/SPARK-30681
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> As of 3.0.0 higher order functions are available in SQL and Scala, but not in 
> PySpark, forcing Python users to invoke these through {{expr}}, 
> {{selectExpr}} or {{sql}}.
> This is error prone and not well documented. Spark should provide 
> {{pyspark.sql}} wrappers that accept plain Python functions (of course within 
> limits of {{(*Column) -> Column}}) as arguments.
> {code:python}
> df.select(transform("values", lambda c: trim(upper(c)))
> def  increment_values(k: Column, v: Column) -> Column:
> return v + 1
> df.select(transform_values("data"), increment_values)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30955) Exclude Generate output when aliasing in nested column pruning

2020-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30955.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27702
[https://github.com/apache/spark/pull/27702]

> Exclude Generate output when aliasing in nested column pruning
> --
>
> Key: SPARK-30955
> URL: https://issues.apache.org/jira/browse/SPARK-30955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> When aliasing in nested column pruning on Project on top of Generate, we 
> should exclude Generate outputs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2020-02-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-14643.
-
Resolution: Won't Fix

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Task
>  Components: Build, Project Infra
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves

2020-02-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047170#comment-17047170
 ] 

Hyukjin Kwon commented on SPARK-26836:
--

I am lowering the priority to Critical as it's at least not a regression and 
doesn't look blocking Spark 3.0; however, indeed we should treat correctness 
issues at least Critical+.

> Columns get switched in Spark SQL using Avro backed Hive table if schema 
> evolves
> 
>
> Key: SPARK-26836
> URL: https://issues.apache.org/jira/browse/SPARK-26836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
> Environment: I tested with Hive and HCatalog which runs on version 
> 2.3.4 and with Spark 2.3.1 and 2.4
>Reporter: Tamás Németh
>Priority: Critical
>  Labels: correctness
> Attachments: doctors.avro, doctors_evolved.avro, 
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the 
> avro files. 
> In the table definiton the avro.schema.url always points to the latest 
> partition's _schema.avsc file which is always the lates schema. (Avro schemas 
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the 
> avro.schema.url properties also set to the _schema.avsc which was used when 
> it was added and of course I always update the table avro.schema.url property 
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new 
> optional property is added in the middle. 
> When this happens then after the spark sql query the columns in the old 
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives 
> me back the correct columns for the partitions which were created the old 
> schema and for the new which was created the evolved schema.
>  
> Here is how I could reproduce with the 
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
>  example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before 
> evolving and the second one had the evolved one. (the evolved schema is the 
> same as in your testcase except I moved the extra_field column to the last 
> from the second and I generated two lines of avro data with the evolved 
> schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no 
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the 
> schema which is under the partition folder (non-evolved one -> 
> s3://somelocation/doctors/
> /dt=2019-02-05/_schema.avsc)
> Then if you ran a _select * from default.spark_test_ then the columns will be 
> mixed up (on the data below the first name column becomes the extra_field 
> column. I guess because in the latest schema it is the second column):
>  
> {code:java}
> number,extra_field,first_name,last_name,dt 
> 6,Colin,Baker,null,2019-02-05 
> 3,Jon,Pertwee,null,2019-02-05 
> 4,Tom,Baker,null,2019-02-05 
> 5,Peter,Davison,null,2019-02-05 
> 11,Matt,Smith,null,2019-02-05 
> 1,William,Hartnell,null,2019-02-05 
> 7,Sylvester,McCoy,null,2019-02-05 
> 8,Paul,McGann,null,2019-02-05 
> 2,Patrick,Troughton,null,2019-02-05 
> 9,Christo

[jira] [Updated] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves

2020-02-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26836:
-
Priority: Critical  (was: Blocker)

> Columns get switched in Spark SQL using Avro backed Hive table if schema 
> evolves
> 
>
> Key: SPARK-26836
> URL: https://issues.apache.org/jira/browse/SPARK-26836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
> Environment: I tested with Hive and HCatalog which runs on version 
> 2.3.4 and with Spark 2.3.1 and 2.4
>Reporter: Tamás Németh
>Priority: Critical
>  Labels: correctness
> Attachments: doctors.avro, doctors_evolved.avro, 
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the 
> avro files. 
> In the table definiton the avro.schema.url always points to the latest 
> partition's _schema.avsc file which is always the lates schema. (Avro schemas 
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the 
> avro.schema.url properties also set to the _schema.avsc which was used when 
> it was added and of course I always update the table avro.schema.url property 
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new 
> optional property is added in the middle. 
> When this happens then after the spark sql query the columns in the old 
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives 
> me back the correct columns for the partitions which were created the old 
> schema and for the new which was created the evolved schema.
>  
> Here is how I could reproduce with the 
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
>  example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before 
> evolving and the second one had the evolved one. (the evolved schema is the 
> same as in your testcase except I moved the extra_field column to the last 
> from the second and I generated two lines of avro data with the evolved 
> schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no 
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the 
> schema which is under the partition folder (non-evolved one -> 
> s3://somelocation/doctors/
> /dt=2019-02-05/_schema.avsc)
> Then if you ran a _select * from default.spark_test_ then the columns will be 
> mixed up (on the data below the first name column becomes the extra_field 
> column. I guess because in the latest schema it is the second column):
>  
> {code:java}
> number,extra_field,first_name,last_name,dt 
> 6,Colin,Baker,null,2019-02-05 
> 3,Jon,Pertwee,null,2019-02-05 
> 4,Tom,Baker,null,2019-02-05 
> 5,Peter,Davison,null,2019-02-05 
> 11,Matt,Smith,null,2019-02-05 
> 1,William,Hartnell,null,2019-02-05 
> 7,Sylvester,McCoy,null,2019-02-05 
> 8,Paul,McGann,null,2019-02-05 
> 2,Patrick,Troughton,null,2019-02-05 
> 9,Christopher,Eccleston,null,2019-02-05 
> 10,David,Tennant,null,2019-02-05 
> 21,fishfinger,Jim,Baker,2019-02-06 
> 24,fishfinger,Bean,Pertwee,2019-02-06
> {code}
> If I try the same query from Hive and not fro

[jira] [Updated] (SPARK-30979) spark-submit - no need to resolve dependencies in kubernetes cluster mode

2020-02-27 Thread Dyno (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dyno updated SPARK-30979:
-
Summary: spark-submit - no need to resolve dependencies in kubernetes 
cluster mode  (was: no need to resolve dependencies in kubernetes cluster mode)

> spark-submit - no need to resolve dependencies in kubernetes cluster mode
> -
>
> Key: SPARK-30979
> URL: https://issues.apache.org/jira/browse/SPARK-30979
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.5
> Environment: spark-operator with spark-2.4.4
>Reporter: Dyno
>Priority: Minor
>
> [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L301]
>  
> when use spark-operator, observed that spark operator is trying to download 
> all the dependencies, but it is really not necessary as the driver will do it 
> again.
>  should the check be ```
> if (!isMesosCluster && !isStandAloneCluster && !isKubernetesCluster) {
>  }
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30979) no need to resolve dependencies in kubernetes cluster mode

2020-02-27 Thread Dyno (Jira)
Dyno created SPARK-30979:


 Summary: no need to resolve dependencies in kubernetes cluster mode
 Key: SPARK-30979
 URL: https://issues.apache.org/jira/browse/SPARK-30979
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.5
 Environment: spark-operator with spark-2.4.4
Reporter: Dyno


[https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L301]

 

when use spark-operator, observed that spark operator is trying to download all 
the dependencies, but it is really not necessary as the driver will do it again.

 should the check be ```

if (!isMesosCluster && !isStandAloneCluster && !isKubernetesCluster) {
 }

```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30968) Upgrade aws-java-sdk-sts to 1.11.655

2020-02-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30968:
-

Assignee: Dongjoon Hyun

> Upgrade aws-java-sdk-sts to 1.11.655
> 
>
> Key: SPARK-30968
> URL: https://issues.apache.org/jira/browse/SPARK-30968
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30968) Upgrade aws-java-sdk-sts to 1.11.655

2020-02-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30968.
---
Fix Version/s: 3.0.0
   2.4.6
   Resolution: Fixed

Issue resolved by pull request 27720
[https://github.com/apache/spark/pull/27720]

> Upgrade aws-java-sdk-sts to 1.11.655
> 
>
> Key: SPARK-30968
> URL: https://issues.apache.org/jira/browse/SPARK-30968
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.6, 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30855) Issue using 'explode' function followed by a (*)star expand selection of resulting struct

2020-02-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047029#comment-17047029
 ] 

Dongjoon Hyun commented on SPARK-30855:
---

According to the public plan, no. Probably, 3.0.0 RC1?
- https://spark.apache.org/versioning-policy.html
3.0.0 will arrive before Spark Summit 2020.

> Issue using 'explode' function followed by a (*)star expand selection of 
> resulting struct
> -
>
> Key: SPARK-30855
> URL: https://issues.apache.org/jira/browse/SPARK-30855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Benoit Roy
>Priority: Major
>
> An exception occurs when trying to use a _* expand_ selection after 
> performing an explode on a array of struct.
> I am testing this on preview2 release of spark.
> Here's a public repo containing a very simple scala test case that reproduces 
> the issue
> {code:java}
>  git clone g...@github.com:benoitroy/spark-30855.git{code}
>  Simply execute the *Spark30855Tests* class.
> On a simple schema such as:
> {code:java}
> root
>  |-- k1: string (nullable = true)
>  |-- k2: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- k2.k1: struct (nullable = true)
>  ||||-- k2.k1.k1: string (nullable = true)
>  ||||-- k2.k1.k2: string (nullable = true)
>  |||-- k2.k2: string (nullable = true) {code}
>  The following test case will fail on the 'col.*' selection.
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> import org.scalatest.funsuite.AnyFunSuite
> class Spark38055Tests extends AnyFunSuite {
>   test("") {
> //
> val path = "src/test/data/json/data.json"
> //
> val spark = SparkSession
>   .builder()
>   .appName("Testing.")
>   .config("spark.master", "local")
>   .getOrCreate();
> //
> val df = spark.read.json(path)
> // SUCCESS!
> df.printSchema()
> // SUCCESS!
> df.select(explode(col("k2"))).show()
> // SUCCESS!
> df.select(explode(col("k2"))).select("col.*").printSchema()
> // FAIL!
> df.select(explode(col("k2"))).select("col.*").show()
>   }
> } {code}
>  
> The test class demonstrates two cases, one where it fails (as shown above) 
> and another where it succeeds.  There is only a slight variation on the 
> schema of both cases.  The succeeding case works on the following schema:
> {code:java}
> root
>  |-- k1: string (nullable = true)
>  |-- k2: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- k2.k1: struct (nullable = true)
>  ||||-- k2.k1.k1: string (nullable = true)
>  |||-- k2.k2: string (nullable = true) {code}
> You will notice that this schema simply removes a field from the nested 
> struct 'k2.k1'.  
>  
> The stacktrace produced by the failing case is show below:
> {code:java}
>  Binding attribute, tree: _gen_alias_23#23Binding attribute, tree: 
> _gen_alias_23#23org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>  Binding attribute, tree: _gen_alias_23#23 at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) 
> at 
> org.ap

[jira] [Comment Edited] (SPARK-30855) Issue using 'explode' function followed by a (*)star expand selection of resulting struct

2020-02-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047029#comment-17047029
 ] 

Dongjoon Hyun edited comment on SPARK-30855 at 2/27/20 11:00 PM:
-

According to the public plan, no. Probably, 3.0.0 RC1?
- https://spark.apache.org/versioning-policy.html

3.0.0 will arrive before Spark Summit 2020.


was (Author: dongjoon):
According to the public plan, no. Probably, 3.0.0 RC1?
- https://spark.apache.org/versioning-policy.html
3.0.0 will arrive before Spark Summit 2020.

> Issue using 'explode' function followed by a (*)star expand selection of 
> resulting struct
> -
>
> Key: SPARK-30855
> URL: https://issues.apache.org/jira/browse/SPARK-30855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Benoit Roy
>Priority: Major
>
> An exception occurs when trying to use a _* expand_ selection after 
> performing an explode on a array of struct.
> I am testing this on preview2 release of spark.
> Here's a public repo containing a very simple scala test case that reproduces 
> the issue
> {code:java}
>  git clone g...@github.com:benoitroy/spark-30855.git{code}
>  Simply execute the *Spark30855Tests* class.
> On a simple schema such as:
> {code:java}
> root
>  |-- k1: string (nullable = true)
>  |-- k2: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- k2.k1: struct (nullable = true)
>  ||||-- k2.k1.k1: string (nullable = true)
>  ||||-- k2.k1.k2: string (nullable = true)
>  |||-- k2.k2: string (nullable = true) {code}
>  The following test case will fail on the 'col.*' selection.
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> import org.scalatest.funsuite.AnyFunSuite
> class Spark38055Tests extends AnyFunSuite {
>   test("") {
> //
> val path = "src/test/data/json/data.json"
> //
> val spark = SparkSession
>   .builder()
>   .appName("Testing.")
>   .config("spark.master", "local")
>   .getOrCreate();
> //
> val df = spark.read.json(path)
> // SUCCESS!
> df.printSchema()
> // SUCCESS!
> df.select(explode(col("k2"))).show()
> // SUCCESS!
> df.select(explode(col("k2"))).select("col.*").printSchema()
> // FAIL!
> df.select(explode(col("k2"))).select("col.*").show()
>   }
> } {code}
>  
> The test class demonstrates two cases, one where it fails (as shown above) 
> and another where it succeeds.  There is only a slight variation on the 
> schema of both cases.  The succeeding case works on the following schema:
> {code:java}
> root
>  |-- k1: string (nullable = true)
>  |-- k2: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- k2.k1: struct (nullable = true)
>  ||||-- k2.k1.k1: string (nullable = true)
>  |||-- k2.k2: string (nullable = true) {code}
> You will notice that this schema simply removes a field from the nested 
> struct 'k2.k1'.  
>  
> The stacktrace produced by the failing case is show below:
> {code:java}
>  Binding attribute, tree: _gen_alias_23#23Binding attribute, tree: 
> _gen_alias_23#23org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>  Binding attribute, tree: _gen_alias_23#23 at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode

[jira] [Commented] (SPARK-30442) Write mode ignored when using CodecStreams

2020-02-27 Thread Abhishek Madav (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047024#comment-17047024
 ] 

Abhishek Madav commented on SPARK-30442:


In case of task failures, say the task fails to write to local-disk or is 
interrupted, the file is empty but materialized on the file-system. The next 
task which retries to write to this location would see the file and return a 
FileAlreadyExistException. Thus making it not resilient to task-failures.

> Write mode ignored when using CodecStreams
> --
>
> Key: SPARK-30442
> URL: https://issues.apache.org/jira/browse/SPARK-30442
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Jesse Collins
>Priority: Major
>
> Overwrite is hardcoded to false in the codec stream. This can cause issues, 
> particularly with aws tools, that make it impossible to retry.
> Ideally, this should be read from the write mode set for the DataWriter that 
> is writing through this codec class.
> [https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala#L81]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2020-02-27 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046958#comment-17046958
 ] 

Jorge Machado commented on SPARK-26412:
---

Thanks for the Tipp. It helps 

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30928) ML, GraphX 3.0 QA: API: Binary incompatible changes

2020-02-27 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043874#comment-17043874
 ] 

Huaxin Gao edited comment on SPARK-30928 at 2/27/20 7:55 PM:
-

  

I audited all the ML, MLLIb, GraphX-related MiMa exclusions added for 3.0:

A few of them are not necessary. Will open a PR to remove those.

 

 The following are false positive:

https://issues.apache.org/jira/browse/SPARK-16872 (private constructor)

https://issues.apache.org/jira/browse/SPARK-25838 (private or protected member 
variables)

https://issues.apache.org/jira/browse/SPARK-11215 (protected methods)

https://issues.apache.org/jira/browse/SPARK-26616 (private constructor or 
private object)

https://issues.apache.org/jira/browse/SPARK-25765 (private constructor)

https://issues.apache.org/jira/browse/SPARK-23042 (private object)

https://issues.apache.org/jira/browse/SPARK-30329  (private method and also 

ReversedMissingMethodProblem)

 

 The following are caused by inheritance structure change. The external APIs 
are still the same for users, so we don't need to document these in migration 
guide.

https://issues.apache.org/jira/browse/SPARK-29645 (the param is moved from 
individual algorithm to shared Params. 

https://issues.apache.org/jira/browse/SPARK-28968 (the param is moved from 
individual algorithm to shared Params. 

https://issues.apache.org/jira/browse/SPARK-3037 (AFT extends Estimator -> AFT 
extends Regressor)

 

 Need to check migration guide for the following:

Remove deprecated APIs:

https://issues.apache.org/jira/browse/SPARK-28980

https://issues.apache.org/jira/browse/SPARK-27410

https://issues.apache.org/jira/browse/SPARK-26127

https://issues.apache.org/jira/browse/SPARK-26090

https://issues.apache.org/jira/browse/SPARK-25382

Binary incompatible changes

https://issues.apache.org/jira/browse/SPARK-28780

https://issues.apache.org/jira/browse/SPARK-26133

https://issues.apache.org/jira/browse/SPARK-30144

https://issues.apache.org/jira/browse/SPARK-30630


was (Author: huaxingao):
 

I audited all the ML, MLLIb, GraphX-related MiMa exclusions added for 3.0:

A few of them are not necessary. Will open a PR to remove those.

 

 The following are false positive:

https://issues.apache.org/jira/browse/SPARK-16872 (private constructor)

https://issues.apache.org/jira/browse/SPARK-25838 (private or protected member 
variables)

https://issues.apache.org/jira/browse/SPARK-11215 (protected methods)

https://issues.apache.org/jira/browse/SPARK-26616 (private constructor or 
private object)

https://issues.apache.org/jira/browse/SPARK-25765 (private constructor)

https://issues.apache.org/jira/browse/SPARK-23042 (private object)

 

 The following are caused by inheritance structure change. The external APIs 
are still the same for users, so we don't need to document these in migration 
guide.

https://issues.apache.org/jira/browse/SPARK-29645 (the param is moved from 
individual algorithm to shared Params. 

https://issues.apache.org/jira/browse/SPARK-28968 (the param is moved from 
individual algorithm to shared Params. 

https://issues.apache.org/jira/browse/SPARK-3037 (AFT extends Estimator -> AFT 
extends Regressor)

 

 Need to check migration guide for the following:

Remove deprecated APIs:

https://issues.apache.org/jira/browse/SPARK-28980

https://issues.apache.org/jira/browse/SPARK-27410

https://issues.apache.org/jira/browse/SPARK-26127

https://issues.apache.org/jira/browse/SPARK-26090

https://issues.apache.org/jira/browse/SPARK-25382

Binary incompatible changes

https://issues.apache.org/jira/browse/SPARK-28780

https://issues.apache.org/jira/browse/SPARK-26133

https://issues.apache.org/jira/browse/SPARK-30144

https://issues.apache.org/jira/browse/SPARK-30630

> ML, GraphX 3.0 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-30928
> URL: https://issues.apache.org/jira/browse/SPARK-30928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xingbo Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046929#comment-17046929
 ] 

Xingbo Jiang commented on SPARK-30969:
--

I created https://issues.apache.org/jira/browse/SPARK-30978 to deprecate the 
multiple workers on the same host support with Standalone backend.

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
>
> Resource coordination is used for the case where multiple workers running on 
> the same host. However, it should be a rarely or event impossible use case in 
> current Standalone(which already allow multiple executor in a single worker). 
> We should remove support for it to simply the implementation and reduce the 
> potential maintain cost in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30978) Remove multiple workers on the same host support from Standalone backend

2020-02-27 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang updated SPARK-30978:
-
Description: 
Based on our experience, there is no scenario that necessarily requires 
deploying multiple Workers on the same node with Standalone backend. A worker 
should book all the resources reserved to Spark on the host it is launched, 
then it can allocate those resources to one or more executors launched by this 
worker. Since each executor runs in a separated JVM, we can limit the memory of 
each executor to avoid long GC pause.

The remaining concern is the local-cluster mode is implemented by launching 
multiple workers on the local host, we might need to re-implement 
LocalSparkCluster to launch only one Worker and multiple executors. It should 
be fine because local-cluster mode is only used in running Spark unit test 
cases, thus end users should not be affected by this change.

Removing multiple workers on the same host support could simplify the deploy 
model of Standalone backend, and also reduce the burden to support legacy 
deploy pattern in the future feature developments.

The proposal is to update the document to deprecate the support of system 
environment `SPARK_WORKER_INSTANCES` in 3.0, and remove the support in the next 
major version (3.1.0).

  was:
Based on our experience, there is no scenario that necessarily requires 
deploying multiple Workers on the same node with Standalone backend. A worker 
should book all the resources reserved to Spark on the host it is launched, 
then it can allocate those resources to one or more executors launched by this 
worker. Since each executor runs in a separated JVM, we can limit the memory of 
each executor to avoid long GC pause.

The remaining concern is the local-cluster mode is implemented by launching 
multiple workers on the local host, we might need to re-implement 
LocalSparkCluster to launch only one Worker and multiple executors. It should 
be fine because local-cluster mode is only used in running Spark unit test 
cases, thus end users should not be affected by this change.

Removing multiple workers on the same host support could simplify the deploy 
model of Standalone backend, and also reduce the burden to support legacy 
deploy pattern in the future feature developments.


> Remove multiple workers on the same host support from Standalone backend
> 
>
> Key: SPARK-30978
> URL: https://issues.apache.org/jira/browse/SPARK-30978
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
>
> Based on our experience, there is no scenario that necessarily requires 
> deploying multiple Workers on the same node with Standalone backend. A worker 
> should book all the resources reserved to Spark on the host it is launched, 
> then it can allocate those resources to one or more executors launched by 
> this worker. Since each executor runs in a separated JVM, we can limit the 
> memory of each executor to avoid long GC pause.
> The remaining concern is the local-cluster mode is implemented by launching 
> multiple workers on the local host, we might need to re-implement 
> LocalSparkCluster to launch only one Worker and multiple executors. It should 
> be fine because local-cluster mode is only used in running Spark unit test 
> cases, thus end users should not be affected by this change.
> Removing multiple workers on the same host support could simplify the deploy 
> model of Standalone backend, and also reduce the burden to support legacy 
> deploy pattern in the future feature developments.
> The proposal is to update the document to deprecate the support of system 
> environment `SPARK_WORKER_INSTANCES` in 3.0, and remove the support in the 
> next major version (3.1.0).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30978) Remove multiple workers on the same host support from Standalone backend

2020-02-27 Thread Xingbo Jiang (Jira)
Xingbo Jiang created SPARK-30978:


 Summary: Remove multiple workers on the same host support from 
Standalone backend
 Key: SPARK-30978
 URL: https://issues.apache.org/jira/browse/SPARK-30978
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.0.0, 3.1.0
Reporter: Xingbo Jiang
Assignee: Xingbo Jiang


Based on our experience, there is no scenario that necessarily requires 
deploying multiple Workers on the same node with Standalone backend. A worker 
should book all the resources reserved to Spark on the host it is launched, 
then it can allocate those resources to one or more executors launched by this 
worker. Since each executor runs in a separated JVM, we can limit the memory of 
each executor to avoid long GC pause.

The remaining concern is the local-cluster mode is implemented by launching 
multiple workers on the local host, we might need to re-implement 
LocalSparkCluster to launch only one Worker and multiple executors. It should 
be fine because local-cluster mode is only used in running Spark unit test 
cases, thus end users should not be affected by this change.

Removing multiple workers on the same host support could simplify the deploy 
model of Standalone backend, and also reduce the burden to support legacy 
deploy pattern in the future feature developments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30977) ResourceProfile and Builder should be private in spark 3.0

2020-02-27 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-30977:
--
Target Version/s: 3.0.0

> ResourceProfile and Builder should be private in spark 3.0
> --
>
> Key: SPARK-30977
> URL: https://issues.apache.org/jira/browse/SPARK-30977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> it looks like ResourceProfile and ResourceProfileBuilder accidentally got 
> opened up - they should be private[spark] until the stage level scheduling 
> feature is complete, which won't make the 3.0 release.  So make them private 
> in 3.0 branch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30977) ResourceProfile and Builder should be private in spark 3.0

2020-02-27 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-30977:
-

Assignee: (was: Thomas Graves)

> ResourceProfile and Builder should be private in spark 3.0
> --
>
> Key: SPARK-30977
> URL: https://issues.apache.org/jira/browse/SPARK-30977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> it looks like ResourceProfile and ResourceProfileBuilder accidentally got 
> opened up - they should be private[spark] until the stage level scheduling 
> feature is complete, which won't make the 3.0 release.  So make them private 
> in 3.0 branch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30977) ResourceProfile and Builder should be private in spark 3.0

2020-02-27 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046891#comment-17046891
 ] 

Thomas Graves commented on SPARK-30977:
---

I'm working on this should have pr by end of day

> ResourceProfile and Builder should be private in spark 3.0
> --
>
> Key: SPARK-30977
> URL: https://issues.apache.org/jira/browse/SPARK-30977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> it looks like ResourceProfile and ResourceProfileBuilder accidentally got 
> opened up - they should be private[spark] until the stage level scheduling 
> feature is complete, which won't make the 3.0 release.  So make them private 
> in 3.0 branch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30977) ResourceProfile and Builder should be private in spark 3.0

2020-02-27 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-30977:
-

 Summary: ResourceProfile and Builder should be private in spark 3.0
 Key: SPARK-30977
 URL: https://issues.apache.org/jira/browse/SPARK-30977
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves


it looks like ResourceProfile and ResourceProfileBuilder accidentally got 
opened up - they should be private[spark] until the stage level scheduling 
feature is complete, which won't make the 3.0 release.  So make them private in 
3.0 branch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-30976) Improve Maven Install Logic in build/mvn

2020-02-27 Thread Yin Huai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai deleted SPARK-30976:
-


> Improve Maven Install Logic in build/mvn
> 
>
> Key: SPARK-30976
> URL: https://issues.apache.org/jira/browse/SPARK-30976
> Project: Spark
>  Issue Type: Improvement
>Reporter: Wesley Hsiao
>Priority: Major
>
> The current code at lacks a validation step to test the installed maven 
> binary at   This is a point of failure where apache jenkins machine jobs can 
> fail where a maven binary can fail to run due to a corrupted download from an 
> apache mirror.
> To improve the stability of apache jenkins machine builds, a maven binary 
> test logic should be added after maven download to verify that the maven 
> binary works.  If it doesn't pass the test, then download and install from 
> archive apache rep



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30976) Improve Maven Install Logic in build/mvn

2020-02-27 Thread Wesley Hsiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wesley Hsiao updated SPARK-30976:
-
Description: 
The current code at lacks a validation step to test the installed maven binary 
at   This is a point of failure where apache jenkins machine jobs can fail 
where a maven binary can fail to run due to a corrupted download from an apache 
mirror.

To improve the stability of apache jenkins machine builds, a maven binary test 
logic should be added after maven download to verify that the maven binary 
works.  If it doesn't pass the test, then download and install from archive 
apache rep

  was:
The current code at 
[https://github.com/databricks/runtime/blob/master/build/mvn] lacks a 
validation step to test the installed maven binary at 
[https://github.com/databricks/runtime/blob/db9c17c77bb8e46f45038a992b4f12427e2a2692/build/mvn#L88-L102.]
  This is a point of failure where apache jenkins machine jobs can fail where a 
maven binary can fail to run due to a corrupted download from an apache mirror. 

To improve the stability of apache jenkins machine builds, a maven binary test 
logic should be added after maven download to verify that the maven binary 
works.  If it doesn't pass the test, then download and install from archive 
apache rep


> Improve Maven Install Logic in build/mvn
> 
>
> Key: SPARK-30976
> URL: https://issues.apache.org/jira/browse/SPARK-30976
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Wesley Hsiao
>Priority: Major
>
> The current code at lacks a validation step to test the installed maven 
> binary at   This is a point of failure where apache jenkins machine jobs can 
> fail where a maven binary can fail to run due to a corrupted download from an 
> apache mirror.
> To improve the stability of apache jenkins machine builds, a maven binary 
> test logic should be added after maven download to verify that the maven 
> binary works.  If it doesn't pass the test, then download and install from 
> archive apache rep



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30976) Improve Maven Install Logic in build/mvn

2020-02-27 Thread Wesley Hsiao (Jira)
Wesley Hsiao created SPARK-30976:


 Summary: Improve Maven Install Logic in build/mvn
 Key: SPARK-30976
 URL: https://issues.apache.org/jira/browse/SPARK-30976
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Wesley Hsiao


The current code at 
[https://github.com/databricks/runtime/blob/master/build/mvn] lacks a 
validation step to test the installed maven binary at 
[https://github.com/databricks/runtime/blob/db9c17c77bb8e46f45038a992b4f12427e2a2692/build/mvn#L88-L102.]
  This is a point of failure where apache jenkins machine jobs can fail where a 
maven binary can fail to run due to a corrupted download from an apache mirror. 

To improve the stability of apache jenkins machine builds, a maven binary test 
logic should be added after maven download to verify that the maven binary 
works.  If it doesn't pass the test, then download and install from archive 
apache rep



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-02-27 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-30951:
--
Description: 
tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
containing dates before October 15, 1582. This could be an issue when such 
sites try to upgrade to Spark 3.0.

>From SPARK-26651:
{quote}"The changes might impact on the results for dates and timestamps before 
October 15, 1582 (Gregorian)
{quote}
We recently discovered that some large scale Spark 2.x applications rely on 
dates before October 15, 1582.

Two cases came up recently:
 * An application that uses a commercial third-party library to encode 
sensitive dates. On insert, the library encodes the actual date as some other 
date. On select, the library decodes the date back to the original date. The 
encoded value could be any date, including one before October 15, 1582 (e.g., 
"0602-04-04").
 * An application that uses a specific unlikely date (e.g., "1200-01-01") as a 
marker to indicate "unknown date" (in lieu of null)

Both sites ran into problems after another component in their system was 
upgraded to use the proleptic Gregorian calendar. Spark applications that read 
files created by the upgraded component were interpreting encoded or marker 
dates incorrectly, and vice versa. Also, their data now had a mix of calendars 
(hybrid and proleptic Gregorian) with no metadata to indicate which file used 
which calendar.

Both sites had enormous amounts of existing data, so re-encoding the dates 
using some other scheme was not a feasible solution.

This is relevant to Spark 3:

Any Spark 2 application that uses such date-encoding schemes may run into 
trouble when run on Spark 3. The application may not properly interpret the 
dates previously written by Spark 2. Also, once the Spark 3 version of the 
application writes data, the tables will have a mix of calendars (hybrid and 
proleptic gregorian) with no metadata to indicate which file uses which 
calendar.

Similarly, sites might run with mixed Spark versions, resulting in data written 
by one version that cannot be interpreted by the other. And as above, the 
tables will now have a mix of calendars with no way to detect which file uses 
which calendar.

As with the two real-life example cases, these applications may have enormous 
amounts of legacy data, so re-encoding the dates using some other scheme may 
not be feasible.

We might want to consider a configuration setting to allow the user to specify 
the calendar for storing and retrieving date and timestamp values (not sure how 
such a flag would affect other date and timestamp-related functions). I realize 
the change is far bigger than just adding a configuration setting.

Here's a quick example of where trouble may happen, using the real-life case of 
the marker date.

In Spark 2.4:
{noformat}
scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count
res0: Long = 1
scala>
{noformat}
In Spark 3.0 (reading from the same legacy file):
{noformat}
scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count
res0: Long = 0
scala> 
{noformat}
By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
proleptic Gregorian calendar between 2.x and 3.x. After some upgrade headaches 
related to dates before 1582, the Hive community made the following changes:
 * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
checks a configuration setting to determine which calendar to use.
 * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
stores the calendar type in the metadata.
 * When reading date or timestamp data from ORC, Parquet, and Avro files, Hive 
checks the metadata for the calendar type.
 * When reading date or timestamp data from ORC, Parquet, and Avro files that 
lack calendar metadata, Hive's behavior is determined by a configuration 
setting. This allows Hive to read legacy data (note: if the data already 
consists of a mix of calendar types with no metadata, there is no good 
solution).

  was:
>From SPARK-26651:
{quote}"The changes might impact on the results for dates and timestamps before 
October 15, 1582 (Gregorian)
{quote}
We recently discovered that some large scale Spark 2.x applications rely on 
dates before October 15, 1582.

Two cases came up recently:
 * An application that uses a commercial third-party library to encode 
sensitive dates. On insert, the library encodes the actual date as some other 
date. On select, the library decodes the date back to the original date. The 
encoded value could be any date, including one before October 15, 1582 (e.g., 
"0602-04-04").
 * An application that uses a specific unlikely date (e.g., "1200-01-01") as a 
marker to indicate "unknown date" (in lieu of null)

Both sites ran into problems after another component in their system was 
upgraded t

[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-02-27 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046801#comment-17046801
 ] 

Bryan Cutler commented on SPARK-30961:
--

Yes, we should be able to keep Spark 3.x up to date with the latest pyarrow. It 
is currently being tested against 0.15.1 and I've tested manually with 0.16.0 
also.

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046795#comment-17046795
 ] 

Xiangrui Meng commented on SPARK-30969:
---

[~Ngone51] [~jiangxb1987] Is there a JIRA to deprecate multiple workers running 
on the same host? Could you create and link here? I think we should deprecate 
it in 3.0.

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
>
> Resource coordination is used for the case where multiple workers running on 
> the same host. However, it should be a rarely or event impossible use case in 
> current Standalone(which already allow multiple executor in a single worker). 
> We should remove support for it to simply the implementation and reduce the 
> potential maintain cost in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30969:
--
Environment: (was: Resource coordination is used for the case where 
multiple workers running on the same host. However, it should be a rarely or 
event impossible use case in current Standalone(which already allow multiple 
executor in a single worker). We should remove support for it to simply the 
implementation and reduce the potential maintain cost in the future.)

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
>
> Resource coordination is used for the case where multiple workers running on 
> the same host. However, it should be a rarely or event impossible use case in 
> current Standalone(which already allow multiple executor in a single worker). 
> We should remove support for it to simply the implementation and reduce the 
> potential maintain cost in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30969:
--
Priority: Critical  (was: Major)

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Resource coordination is used for the case where 
> multiple workers running on the same host. However, it should be a rarely or 
> event impossible use case in current Standalone(which already allow multiple 
> executor in a single worker). We should remove support for it to simply the 
> implementation and reduce the potential maintain cost in the future.
>Reporter: wuyi
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-30969:
-

Assignee: wuyi

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Resource coordination is used for the case where 
> multiple workers running on the same host. However, it should be a rarely or 
> event impossible use case in current Standalone(which already allow multiple 
> executor in a single worker). We should remove support for it to simply the 
> implementation and reduce the potential maintain cost in the future.
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30969:
--
Description: Resource coordination is used for the case where multiple 
workers running on the same host. However, it should be a rarely or event 
impossible use case in current Standalone(which already allow multiple executor 
in a single worker). We should remove support for it to simply the 
implementation and reduce the potential maintain cost in the future.

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Resource coordination is used for the case where 
> multiple workers running on the same host. However, it should be a rarely or 
> event impossible use case in current Standalone(which already allow multiple 
> executor in a single worker). We should remove support for it to simply the 
> implementation and reduce the potential maintain cost in the future.
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
>
> Resource coordination is used for the case where multiple workers running on 
> the same host. However, it should be a rarely or event impossible use case in 
> current Standalone(which already allow multiple executor in a single worker). 
> We should remove support for it to simply the implementation and reduce the 
> potential maintain cost in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30975) Rename config for spark.<>.memoryOverhead

2020-02-27 Thread Miquel Angel Andreu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miquel Angel Andreu updated SPARK-30975:

Summary: Rename config for spark.<>.memoryOverhead  (was: Rename  config to 
spark.executor.memoryOverhead)

> Rename config for spark.<>.memoryOverhead
> -
>
> Key: SPARK-30975
> URL: https://issues.apache.org/jira/browse/SPARK-30975
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Spark Submit
>Affects Versions: 2.4.5
>Reporter: Miquel Angel Andreu
>Priority: Minor
> Fix For: 2.4.6
>
>
> The configuration for spark was changed recently and we have to keep the 
> consistency in the code, so we need to rename the OverHeadMemory in the code 
> to the new one: {{spark.executor.memoryOverhead}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30975) Rename config to spark.executor.memoryOverhead

2020-02-27 Thread Miquel Angel Andreu (Jira)
Miquel Angel Andreu created SPARK-30975:
---

 Summary: Rename  config to spark.executor.memoryOverhead
 Key: SPARK-30975
 URL: https://issues.apache.org/jira/browse/SPARK-30975
 Project: Spark
  Issue Type: Task
  Components: Documentation, Spark Submit
Affects Versions: 2.4.5
Reporter: Miquel Angel Andreu
 Fix For: 2.4.6


The configuration for spark was changed recently and we have to keep the 
consistency in the code, so we need to rename the OverHeadMemory in the code to 
the new one: {{spark.executor.memoryOverhead}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28994) Document working of Adaptive

2020-02-27 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046710#comment-17046710
 ] 

Takeshi Yamamuro commented on SPARK-28994:
--

Adaptive? This means adaptive execution? btw, is it worth documenting this in 
the SQL references?

> Document working of Adaptive
> 
>
> Key: SPARK-28994
> URL: https://issues.apache.org/jira/browse/SPARK-28994
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28993) Document Working of Bucketing

2020-02-27 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046708#comment-17046708
 ] 

Takeshi Yamamuro commented on SPARK-28993:
--

Any update? btw, is it worth documenting this in the SQL references?

> Document Working of Bucketing
> -
>
> Key: SPARK-28993
> URL: https://issues.apache.org/jira/browse/SPARK-28993
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28965) Document workings of CBO

2020-02-27 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046706#comment-17046706
 ] 

Takeshi Yamamuro commented on SPARK-28965:
--

Any update? btw, is it worth documenting this in the SQL references?

> Document workings of CBO
> 
>
> Key: SPARK-28965
> URL: https://issues.apache.org/jira/browse/SPARK-28965
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28995) Document working of Spark Streaming

2020-02-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28995.
--
Resolution: Invalid

> Document working of Spark Streaming
> ---
>
> Key: SPARK-28995
> URL: https://issues.apache.org/jira/browse/SPARK-28995
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28995) Document working of Spark Streaming

2020-02-27 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046704#comment-17046704
 ] 

Takeshi Yamamuro commented on SPARK-28995:
--

I think this is not related to the SQL refs, so I'll close this. Please reopen 
this if any problem.

> Document working of Spark Streaming
> ---
>
> Key: SPARK-28995
> URL: https://issues.apache.org/jira/browse/SPARK-28995
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29458) Document scalar functions usage in APIs in SQL getting started.

2020-02-27 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046701#comment-17046701
 ] 

Takeshi Yamamuro commented on SPARK-29458:
--

[~dkbiswal] Any update?

> Document scalar functions usage in APIs in SQL getting started.
> ---
>
> Key: SPARK-29458
> URL: https://issues.apache.org/jira/browse/SPARK-29458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30095) create function syntax has to be enhance in Doc for multiple dependent jars

2020-02-27 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046698#comment-17046698
 ] 

Takeshi Yamamuro commented on SPARK-30095:
--

[~abhishek.akg] Any update?

> create function syntax has to be enhance in Doc for multiple dependent jars 
> 
>
> Key: SPARK-30095
> URL: https://issues.apache.org/jira/browse/SPARK-30095
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Create Function Example and Syntax has to be enhance as below
> 1. Case 1: How to use multiple dependent jars in the path while creating 
> function is not clear. -- Syntax to be given
> 2. Case 2: What are the different schema supported like file:/// is not 
> updated in doc - Supported Schema to be provided



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30635) Document PARTITIONED BY Clause of CREATE statement in SQL Reference

2020-02-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30635.
--
Resolution: Duplicate

I think this has been resolved by SPARK-28794 
([https://github.com/apache/spark/pull/26759/files]). Please reopen it if you 
have any problem.

> Document PARTITIONED BY  Clause of CREATE statement in SQL Reference
> 
>
> Key: SPARK-30635
> URL: https://issues.apache.org/jira/browse/SPARK-30635
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30693) Document STORED AS Clause of CREATE statement in SQL Reference

2020-02-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30693.
--
Resolution: Duplicate

I think this has been resolved by SPARK-28794 
([https://github.com/apache/spark/pull/26759/files]). Please reopen it if you 
have any problem.

> Document STORED AS Clause of CREATE statement in SQL Reference
> --
>
> Key: SPARK-30693
> URL: https://issues.apache.org/jira/browse/SPARK-30693
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30635) Document PARTITIONED BY Clause of CREATE statement in SQL Reference

2020-02-27 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046690#comment-17046690
 ] 

Takeshi Yamamuro commented on SPARK-30635:
--

[~jobitmathew] still working on it?

> Document PARTITIONED BY  Clause of CREATE statement in SQL Reference
> 
>
> Key: SPARK-30635
> URL: https://issues.apache.org/jira/browse/SPARK-30635
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30974) org.apache.spark.sql.AnalysisException: expression 'default.udfvalidation.`empname`' is neither present in the group by, nor is it an aggregate function.

2020-02-27 Thread Akshay (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akshay updated SPARK-30974:
---
Summary: org.apache.spark.sql.AnalysisException: expression 
'default.udfvalidation.`empname`' is neither present in the group by, nor is it 
an aggregate function.   (was: org.apache.spark.sql.AnalysisException: 
expression 'default.udfvalidation.`empname`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;)

> org.apache.spark.sql.AnalysisException: expression 
> 'default.udfvalidation.`empname`' is neither present in the group by, nor is 
> it an aggregate function. 
> --
>
> Key: SPARK-30974
> URL: https://issues.apache.org/jira/browse/SPARK-30974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: Akshay
>Priority: Minor
>
> I'm getting the following exception while executing the query in spark 2.4.2
> !image-2020-02-27-20-07-03-701.png!
>  
>  
> !image-2020-02-27-20-03-01-399.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30974) org.apache.spark.sql.AnalysisException: expression 'default.udfvalidation.`empname`' is neither present in the group by, nor is it an aggregate function. Add to group by

2020-02-27 Thread Akshay (Jira)
Akshay created SPARK-30974:
--

 Summary: org.apache.spark.sql.AnalysisException: expression 
'default.udfvalidation.`empname`' is neither present in the group by, nor is it 
an aggregate function. Add to group by or wrap in first() (or first_value) if 
you don't care which value you get.;;
 Key: SPARK-30974
 URL: https://issues.apache.org/jira/browse/SPARK-30974
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.2
Reporter: Akshay


I'm getting the following exception while executing the query in spark 2.4.2

!image-2020-02-27-20-07-03-701.png!

 

 

!image-2020-02-27-20-03-01-399.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29969) parse_url function result in incorrect result

2020-02-27 Thread YoungGyu Chun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045059#comment-17045059
 ] 

YoungGyu Chun edited comment on SPARK-29969 at 2/27/20 2:16 PM:


[~xiaoxigua] is this issue relating to Spark? The example provided looks like 
the issue of the beeline or hive.


was (Author: younggyuchun):
[~xiaoxigua] is this issue is relating to Spark? The example provided looks 
like the issue of the beeline or hive.

> parse_url function result in incorrect result
> -
>
> Key: SPARK-29969
> URL: https://issues.apache.org/jira/browse/SPARK-29969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.4
>Reporter: Victor Zhang
>Priority: Major
> Attachments: hive-result.jpg, spark-result.jpg
>
>
> In this Jira using java.net.URI instead of java.net.URL for performance 
> reason.
> https://issues.apache.org/jira/browse/SPARK-16826
> However, in the case of some unconventional parameters, it can lead to 
> incorrect results.
> For example, when the URL is encoded, the function cannot resolve the correct 
> result.
>  
> {code}
> 0: jdbc:hive2://localhost:1> SELECT 
> parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
>  'HOST');
> ++--+
> | 
> parse_url(http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%,
>  HOST) |
> ++--+
> | NULL |
> ++--+
> 1 row selected (0.094 seconds)
>  
> hive> SELECT 
> parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
>  'HOST');
> OK
> HEADER: _c0
> uzzf.down.gsxzq.com
> Time taken: 4.423 seconds, Fetched: 1 row(s)
> {code}
>  
> Here's a similar problem.
> https://issues.apache.org/jira/browse/SPARK-23056
> Our team used the spark function to run data for months, but now we have to 
> run it again.
> It's just too painful.:(:(:(
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30956) Use intercept instead of try-catch to assert failures in IntervalUtilsSuite

2020-02-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30956.
--
Fix Version/s: 3.0.0
 Assignee: Kent Yao
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/27700]

> Use intercept instead of try-catch to assert failures in IntervalUtilsSuite
> ---
>
> Key: SPARK-30956
> URL: https://issues.apache.org/jira/browse/SPARK-30956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Addressed the comment from 
> https://github.com/apache/spark/pull/27672#discussion_r383719562 to use 
> `intercept` instead of `try-catch` block to assert  failures in the 
> IntervalUtilsSuite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30855) Issue using 'explode' function followed by a (*)star expand selection of resulting struct

2020-02-27 Thread Benoit Roy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046654#comment-17046654
 ] 

Benoit Roy commented on SPARK-30855:


Ok thanks for letting me know.  Are there any plan for another preview release 
in the coming months?

> Issue using 'explode' function followed by a (*)star expand selection of 
> resulting struct
> -
>
> Key: SPARK-30855
> URL: https://issues.apache.org/jira/browse/SPARK-30855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Benoit Roy
>Priority: Major
>
> An exception occurs when trying to use a _* expand_ selection after 
> performing an explode on a array of struct.
> I am testing this on preview2 release of spark.
> Here's a public repo containing a very simple scala test case that reproduces 
> the issue
> {code:java}
>  git clone g...@github.com:benoitroy/spark-30855.git{code}
>  Simply execute the *Spark30855Tests* class.
> On a simple schema such as:
> {code:java}
> root
>  |-- k1: string (nullable = true)
>  |-- k2: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- k2.k1: struct (nullable = true)
>  ||||-- k2.k1.k1: string (nullable = true)
>  ||||-- k2.k1.k2: string (nullable = true)
>  |||-- k2.k2: string (nullable = true) {code}
>  The following test case will fail on the 'col.*' selection.
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> import org.scalatest.funsuite.AnyFunSuite
> class Spark38055Tests extends AnyFunSuite {
>   test("") {
> //
> val path = "src/test/data/json/data.json"
> //
> val spark = SparkSession
>   .builder()
>   .appName("Testing.")
>   .config("spark.master", "local")
>   .getOrCreate();
> //
> val df = spark.read.json(path)
> // SUCCESS!
> df.printSchema()
> // SUCCESS!
> df.select(explode(col("k2"))).show()
> // SUCCESS!
> df.select(explode(col("k2"))).select("col.*").printSchema()
> // FAIL!
> df.select(explode(col("k2"))).select("col.*").show()
>   }
> } {code}
>  
> The test class demonstrates two cases, one where it fails (as shown above) 
> and another where it succeeds.  There is only a slight variation on the 
> schema of both cases.  The succeeding case works on the following schema:
> {code:java}
> root
>  |-- k1: string (nullable = true)
>  |-- k2: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- k2.k1: struct (nullable = true)
>  ||||-- k2.k1.k1: string (nullable = true)
>  |||-- k2.k2: string (nullable = true) {code}
> You will notice that this schema simply removes a field from the nested 
> struct 'k2.k1'.  
>  
> The stacktrace produced by the failing case is show below:
> {code:java}
>  Binding attribute, tree: _gen_alias_23#23Binding attribute, tree: 
> _gen_alias_23#23org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>  Binding attribute, tree: _gen_alias_23#23 at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) 
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(

[jira] [Resolved] (SPARK-30937) Migration guide for Hive 2.3

2020-02-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30937.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27670
[https://github.com/apache/spark/pull/27670]

> Migration guide for Hive 2.3
> 
>
> Key: SPARK-30937
> URL: https://issues.apache.org/jira/browse/SPARK-30937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Add migration guide for user after Spark upgrade built-in Hive from 1.2 to 
> 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30937) Migration guide for Hive 2.3

2020-02-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30937:
---

Assignee: wuyi

> Migration guide for Hive 2.3
> 
>
> Key: SPARK-30937
> URL: https://issues.apache.org/jira/browse/SPARK-30937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Add migration guide for user after Spark upgrade built-in Hive from 1.2 to 
> 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30973) ScriptTransformationExec should wait for the termination of process when scriptOutputReader hasNext return false

2020-02-27 Thread Sun Ke (Jira)
Sun Ke created SPARK-30973:
--

 Summary: ScriptTransformationExec should wait for the termination 
of process when scriptOutputReader hasNext return false
 Key: SPARK-30973
 URL: https://issues.apache.org/jira/browse/SPARK-30973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 2.4.4
Reporter: Sun Ke






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30972) PruneHiveTablePartitions should be executed as earlyScanPushDownRules

2020-02-27 Thread wuyi (Jira)
wuyi created SPARK-30972:


 Summary: PruneHiveTablePartitions should be executed as 
earlyScanPushDownRules
 Key: SPARK-30972
 URL: https://issues.apache.org/jira/browse/SPARK-30972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


Similar to PruneFileSourcePartitions, PruneHiveTablePartitions should also be 
executed as earlyScanPushDownRules to eliminate the impact on statistic 
computation later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30971) Support MySQL Kerberos login in JDBC connector

2020-02-27 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-30971.
---
Resolution: Won't Do

> Support MySQL Kerberos login in JDBC connector
> --
>
> Key: SPARK-30971
> URL: https://issues.apache.org/jira/browse/SPARK-30971
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30971) Support MySQL Kerberos login in JDBC connector

2020-02-27 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046550#comment-17046550
 ] 

Gabor Somogyi commented on SPARK-30971:
---

Just for the record I've created this jira since MySQL doesn't provide kerberos 
authentication at the moment.

> Support MySQL Kerberos login in JDBC connector
> --
>
> Key: SPARK-30971
> URL: https://issues.apache.org/jira/browse/SPARK-30971
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30971) Support MySQL Kerberos login in JDBC connector

2020-02-27 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-30971:
-

 Summary: Support MySQL Kerberos login in JDBC connector
 Key: SPARK-30971
 URL: https://issues.apache.org/jira/browse/SPARK-30971
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2020-02-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046542#comment-17046542
 ] 

Hyukjin Kwon edited comment on SPARK-26412 at 2/27/20 12:03 PM:


You cannot separate one iterator to multiple iterators since iterator is 
supposed to be consumed once. Python doesn't support this way.
You should do something like

{code}
class SomeClass():
  def __init__(a,b,c):
 pass

def map_func(batch_iter):
   for a, b, c in batch_iter
 dataset = SomeClass(a, b, c) 
{code}

You can just pass strings and do {{json.loads}} which is pretty easy. There is 
no standard type for JSON in Spark which isn't ANSI standard.
Adding new types is a huge job because you should think about how to ser/de in 
Scala, R, Java for instance.


was (Author: hyukjin.kwon):
You cannot separate one iterator to multiple iterators since iterator is 
supposed to be consumed once. Python doesn't support this way.
You should do something like

{code}
class SomeClass():
  def __init__(a,b,c):
 pass

def map_func(batch_iter):
   for a, b, c in batch_iter
   dataset = SomeClass(a, b, c) 
{code}

You can just pass strings and do {{json.loads}} which is pretty easy. There is 
no standard type for JSON in Spark which isn't ANSI standard.
Adding new types is a huge job because you should think about how to ser/de in 
Scala, R, Java for instance.

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2020-02-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046542#comment-17046542
 ] 

Hyukjin Kwon commented on SPARK-26412:
--

You cannot separate one iterator to multiple iterators since iterator is 
supposed to be consumed once. Python doesn't support this way.
You should do something like

{code}
class SomeClass():
  def __init__(a,b,c):
 pass

def map_func(batch_iter):
   for a, b, c in batch_iter
   dataset = SomeClass(a, b, c) 
{code}

You can just pass strings and do {{json.loads}} which is pretty easy. There is 
no standard type for JSON in Spark which isn't ANSI standard.
Adding new types is a huge job because you should think about how to ser/de in 
Scala, R, Java for instance.

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30970) Fix NPE in resolving k8s master url

2020-02-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046454#comment-17046454
 ] 

Dongjoon Hyun commented on SPARK-30970:
---

BTW, [~Qin Yao]. Could you check 2.3.4 behavior and update the Affected Version 
if needed?

> Fix NPE in resolving k8s master url
> ---
>
> Key: SPARK-30970
> URL: https://issues.apache.org/jira/browse/SPARK-30970
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> {code:java}
> ```
> bin/spark-sql --master  k8s:///https://kubernetes.docker.internal:6443 --conf 
> spark.kubernetes.container.image=yaooqinn/spark:v2.4.4
> Exception in thread "main" java.lang.NullPointerException
>   at 
> org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> ```
> {code}
> Althrough k8s:///https://kubernetes.docker.internal:6443 is a wrong master 
> url but should not throw npe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30970) Fix NPE in resolving k8s master url

2020-02-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046453#comment-17046453
 ] 

Dongjoon Hyun commented on SPARK-30970:
---

Since the root cause is the user mistakes, the prevention logic will be a minor 
bug fix.

> Fix NPE in resolving k8s master url
> ---
>
> Key: SPARK-30970
> URL: https://issues.apache.org/jira/browse/SPARK-30970
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> {code:java}
> ```
> bin/spark-sql --master  k8s:///https://kubernetes.docker.internal:6443 --conf 
> spark.kubernetes.container.image=yaooqinn/spark:v2.4.4
> Exception in thread "main" java.lang.NullPointerException
>   at 
> org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> ```
> {code}
> Althrough k8s:///https://kubernetes.docker.internal:6443 is a wrong master 
> url but should not throw npe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30970) Fix NPE in resolving k8s master url

2020-02-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30970:
--
Priority: Minor  (was: Major)

> Fix NPE in resolving k8s master url
> ---
>
> Key: SPARK-30970
> URL: https://issues.apache.org/jira/browse/SPARK-30970
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> {code:java}
> ```
> bin/spark-sql --master  k8s:///https://kubernetes.docker.internal:6443 --conf 
> spark.kubernetes.container.image=yaooqinn/spark:v2.4.4
> Exception in thread "main" java.lang.NullPointerException
>   at 
> org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> ```
> {code}
> Althrough k8s:///https://kubernetes.docker.internal:6443 is a wrong master 
> url but should not throw npe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30929) ML, GraphX 3.0 QA: API: New Scala APIs, docs

2020-02-27 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30929:
-
Environment: (was: Audit new public Scala APIs added to MLlib & GraphX. 
Take note of:
 * Protected/public classes or methods. If access can be more private, then it 
should be.
 * Also look for non-sealed traits.
 * Documentation: Missing? Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue. 

For *user guide issues* link the new JIRAs to the relevant user guide QA issue)

> ML, GraphX 3.0 QA: API: New Scala APIs, docs
> 
>
> Key: SPARK-30929
> URL: https://issues.apache.org/jira/browse/SPARK-30929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30929) ML, GraphX 3.0 QA: API: New Scala APIs, docs

2020-02-27 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30929:
-
Description: 
Audit new public Scala APIs added to MLlib & GraphX. Take note of:
 * Protected/public classes or methods. If access can be more private, then it 
should be.
 * Also look for non-sealed traits.
 * Documentation: Missing? Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue. 

For *user guide issues* link the new JIRAs to the relevant user guide QA issue

> ML, GraphX 3.0 QA: API: New Scala APIs, docs
> 
>
> Key: SPARK-30929
> URL: https://issues.apache.org/jira/browse/SPARK-30929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs

2020-02-27 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046228#comment-17046228
 ] 

zhengruifeng edited comment on SPARK-30932 at 2/27/20 9:47 AM:
---

I checked added classes from {{added_ml_class:}}
 * FMClassifier, FMRegressor  has related Java example and doc;
 * RobustScaler has related Java example and doc;
 * MultilabelClassificationEvaluator,RankingEvaluator do not have related Java 
examples; However, other evaluators do not have examples, either;  *We may need 
to add some basic description in doc/ml-tuning.*
 * -org.apache.spark.ml.functions has no related doc, is only used in 
{{FunctionsSuite}}; *I am not sure we should make it public;*-
 * -org.apache.spark.ml.\{FitStart, FitEnd, LoadInstanceStart, LoadInstanceEnd, 
SaveInstanceStart, SaveInstanceEnd, TransformStart, TransformEnd} are marked 
{{Unstable}} and has no related doc;-

 


was (Author: podongfeng):
I checked added classes from {{added_ml_class:}}
 * FMClassifier, FMRegressor  has related Java example and doc;
 * RobustScaler has related Java example and doc;
 * MultilabelClassificationEvaluator,RankingEvaluator do not have related Java 
examples; However, other evaluators do not have examples, either;  *We may need 
to add some basic description in doc/ml-tuning.*
 * org.apache.spark.ml.functions has no related doc, is only used in 
\{{FunctionsSuite}}; *I am not sure we should make it public;*
 * org.apache.spark.ml.\{FitStart, FitEnd, LoadInstanceStart, LoadInstanceEnd, 
SaveInstanceStart, SaveInstanceEnd, TransformStart, TransformEnd} are marked 
\{{Unstable}} and has no related doc;

 

> ML 3.0 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-30932
> URL: https://issues.apache.org/jira/browse/SPARK-30932
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: 1_process_script.sh, added_ml_class, common_ml_class, 
> signature.diff
>
>
> Check Java compatibility for this release:
>  * APIs in {{spark.ml}}
>  * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
>  * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
>  ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
>  *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
>  ** Check Scala objects (especially with nesting!) carefully. These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
>  ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc. (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
>  * Check for differences in generated Scala vs Java docs. E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
>  * Remember that we should not break APIs from previous releases. If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
>  * If needed for complex issues, create small Java unit tests which execute 
> each method. (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
>  * There are not great tools. In the past, this task has been done by:
>  ** Generating API docs
>  ** Building JAR and outputting the Java class signatures for MLlib
>  ** Manually inspecting and searching the docs and class signatures for issues
>  * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30970) Fix NPE in resolving k8s master url

2020-02-27 Thread Kent Yao (Jira)
Kent Yao created SPARK-30970:


 Summary: Fix NPE in resolving k8s master url
 Key: SPARK-30970
 URL: https://issues.apache.org/jira/browse/SPARK-30970
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Spark Core
Affects Versions: 2.4.5, 3.0.0, 3.1.0
Reporter: Kent Yao


{code:java}
```
bin/spark-sql --master  k8s:///https://kubernetes.docker.internal:6443 --conf 
spark.kubernetes.container.image=yaooqinn/spark:v2.4.4
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739)
at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```
{code}
Althrough k8s:///https://kubernetes.docker.internal:6443 is a wrong master url 
but should not throw npe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread wuyi (Jira)
wuyi created SPARK-30969:


 Summary: Remove resource coordination support from Standalone
 Key: SPARK-30969
 URL: https://issues.apache.org/jira/browse/SPARK-30969
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: Resource coordination is used for the case where multiple 
workers running on the same host. However, it should be a rarely or event 
impossible use case in current Standalone(which already allow multiple executor 
in a single worker). We should remove support for it to simply the 
implementation and reduce the potential maintain cost in the future.
Reporter: wuyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2020-02-27 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046373#comment-17046373
 ] 

Jorge Machado commented on SPARK-26412:
---

Well I was thinking on something more. like I would like to give a,b,c to 
another object. Like

 
{code:java}
class SomeClass():
  def __init__(a,b,c):
 pass

def map_func(batch_iter):

   dataset = SomeClass(batch_iter[0], batch_iter[1], batch_iter[2]) <- this 
does not work. 

{code}
and another thing, it would be great if we could just yield a json for example 
instead of this fixed types

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2020-02-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046365#comment-17046365
 ] 

Hyukjin Kwon commented on SPARK-26412:
--

You can do it via:

{code}
def map_func(batch_iter):
for a, b, c in batch_iter:
yield a, b, c
{code}


> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29969) parse_url function result in incorrect result

2020-02-27 Thread Victor Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046359#comment-17046359
 ] 

Victor Zhang commented on SPARK-29969:
--

[~younggyuchun] 

The description may be a bit confusing. I use beeline to connect spark thrift 
server. 

The parse_url function in spark depend on java.net.URI while java.net.URL in 
hive.

 

 
{code:java}
spark-sql> SELECT 
parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
 'HOST');
NULL
Time taken: 1.211 seconds, Fetched 1 row(s)
{code}
 

 
{code:java}
hive> SELECT 
parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
 'HOST');
OK
HEADER: _c0
uzzf.down.gsxzq.com
Time taken: 0.039 seconds, Fetched: 1 row(s)
{code}

> parse_url function result in incorrect result
> -
>
> Key: SPARK-29969
> URL: https://issues.apache.org/jira/browse/SPARK-29969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.4
>Reporter: Victor Zhang
>Priority: Major
> Attachments: hive-result.jpg, spark-result.jpg
>
>
> In this Jira using java.net.URI instead of java.net.URL for performance 
> reason.
> https://issues.apache.org/jira/browse/SPARK-16826
> However, in the case of some unconventional parameters, it can lead to 
> incorrect results.
> For example, when the URL is encoded, the function cannot resolve the correct 
> result.
>  
> {code}
> 0: jdbc:hive2://localhost:1> SELECT 
> parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
>  'HOST');
> ++--+
> | 
> parse_url(http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%,
>  HOST) |
> ++--+
> | NULL |
> ++--+
> 1 row selected (0.094 seconds)
>  
> hive> SELECT 
> parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
>  'HOST');
> OK
> HEADER: _c0
> uzzf.down.gsxzq.com
> Time taken: 4.423 seconds, Fetched: 1 row(s)
> {code}
>  
> Here's a similar problem.
> https://issues.apache.org/jira/browse/SPARK-23056
> Our team used the spark function to run data for months, but now we have to 
> run it again.
> It's just too painful.:(:(:(
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-02-27 Thread Nicolas Renkamp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046337#comment-17046337
 ] 

Nicolas Renkamp commented on SPARK-30961:
-

[~bryanc] Thanks for the quick reply. 

Actually, I was using pyarrow 0.15.1 in this example. I did not really realize 
that spark 2.4.X should be used together with version 0.8.0 or highest 0.11.1 
of pyarrow.

Thanks for the background information and the links to the other issues.

 

>From a user's perspective it would be great of spark 3.X would be compatible 
>with the latest pyarrow version. Are you aiming for that?

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org