date:20200124

[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2020-01-24 Thread Mathew Wicks (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023439#comment-17023439
 ] 

Mathew Wicks commented on SPARK-28921:
--

[~dongjoon], it's just very bad practice to not update all jars which depend on 
each other, so I never tried to only do one. However, I also remember reading 
people who said they encountered errors while only updating one, on other 
threads about this issue.

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Paul Schweigert
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25330:
--
Fix Version/s: 2.3.2

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25330:
-

Assignee: Yuming Wang

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25330:
--
Fix Version/s: 2.4.0

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25330.
---
Resolution: Fixed

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields after using explode

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29721:
--
Summary: Spark SQL reads unnecessary nested fields after using explode  
(was: Spark SQL reads unnecessary nested fields from Parquet after using 
explode)

> Spark SQL reads unnecessary nested fields after using explode
> -
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Kai Kang
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29721.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26978
[https://github.com/apache/spark/pull/26978]

> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Kai Kang
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29721:
-

Assignee: L. C. Hsieh

> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Kai Kang
>Assignee: L. C. Hsieh
>Priority: Major
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30617.
-

> Is there any possible that spark no longer restrict enumerate types of 
> spark.sql.catalogImplementation
> --
>
> Key: SPARK-30617
> URL: https://issues.apache.org/jira/browse/SPARK-30617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: weiwenda
>Priority: Minor
>
> # We have implemented a complex ExternalCatalog which is used for retrieving 
> multi isomerism database's metadata(sush as elasticsearch、postgresql), so 
> that we can make a mixture query between hive and our online data.
>  # But as spark require that value of spark.sql.catalogImplementation must be 
> one of in-memory/hive, we have to modify SparkSession and rebuild spark to 
> make our project work.
>  # Finally, we hope spark removing above restriction, so that it's will be 
> much easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30617.
---
Resolution: Duplicate

> Is there any possible that spark no longer restrict enumerate types of 
> spark.sql.catalogImplementation
> --
>
> Key: SPARK-30617
> URL: https://issues.apache.org/jira/browse/SPARK-30617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: weiwenda
>Priority: Minor
>
> # We have implemented a complex ExternalCatalog which is used for retrieving 
> multi isomerism database's metadata(sush as elasticsearch、postgresql), so 
> that we can make a mixture query between hive and our online data.
>  # But as spark require that value of spark.sql.catalogImplementation must be 
> one of in-memory/hive, we have to modify SparkSession and rebuild spark to 
> make our project work.
>  # Finally, we hope spark removing above restriction, so that it's will be 
> much easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-30617:
---

> Is there any possible that spark no longer restrict enumerate types of 
> spark.sql.catalogImplementation
> --
>
> Key: SPARK-30617
> URL: https://issues.apache.org/jira/browse/SPARK-30617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: weiwenda
>Priority: Minor
>
> # We have implemented a complex ExternalCatalog which is used for retrieving 
> multi isomerism database's metadata(sush as elasticsearch、postgresql), so 
> that we can make a mixture query between hive and our online data.
>  # But as spark require that value of spark.sql.catalogImplementation must be 
> one of in-memory/hive, we have to modify SparkSession and rebuild spark to 
> make our project work.
>  # Finally, we hope spark removing above restriction, so that it's will be 
> much easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-24 Thread weiwenda (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weiwenda resolved SPARK-30617.
--
Resolution: Pending Closed

> Is there any possible that spark no longer restrict enumerate types of 
> spark.sql.catalogImplementation
> --
>
> Key: SPARK-30617
> URL: https://issues.apache.org/jira/browse/SPARK-30617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: weiwenda
>Priority: Minor
>
> # We have implemented a complex ExternalCatalog which is used for retrieving 
> multi isomerism database's metadata(sush as elasticsearch、postgresql), so 
> that we can make a mixture query between hive and our online data.
>  # But as spark require that value of spark.sql.catalogImplementation must be 
> one of in-memory/hive, we have to modify SparkSession and rebuild spark to 
> make our project work.
>  # Finally, we hope spark removing above restriction, so that it's will be 
> much easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests

2020-01-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023354#comment-17023354
 ] 

Dongjoon Hyun commented on SPARK-28900:
---

Thank you for update.

> Test Pyspark, SparkR on JDK 11 with run-tests
> -
>
> Key: SPARK-28900
> URL: https://issues.apache.org/jira/browse/SPARK-28900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Priority: Major
>
> Right now, we are testing JDK 11 with a Maven-based build, as in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/
> It looks like _all_ of the Maven-based jobs 'manually' build and invoke 
> tests, and only run tests via Maven -- that is, they do not run Pyspark or 
> SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} 
> script that is meant to be for this purpose.
> In fact, there seem to be a couple flavors of copy-pasted build configs. SBT 
> builds look like:
> {code}
> #!/bin/bash
> set -e
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> git clean -fdx
> ./dev/run-tests
> {code}
> Maven builds looks like:
> {code}
> #!/bin/bash
> set -x
> set -e
> rm -rf ./work
> git clean -fdx
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
> # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention:
> export 
> SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2"
> mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> MVN="build/mvn -DzincPort=$ZINC_PORT"
> set +e
> if [[ $HADOOP_PROFILE == hadoop-1 ]]; then
> # Note that there is no -Pyarn flag here for Hadoop 1:
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> else
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> fi
> if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then
>   if [[ $retcode1 -ne 0 ]]; then
> echo "Packaging Spark with Maven failed"
>   fi
>   if [[ $retcode2 -ne 0 ]]; then
> echo "Testing Spark with Maven failed"
>   fi
>   exit 1
> fi
> {code}
> The PR builder (one of them at least) looks like:
> {code}
> #!/bin/bash
> set -e  # fail on any non-zero exit code
> set -x
> export AMPLAB_JENKINS=1
> export PATH="$PATH:/home/anaconda/envs/py3k/bin"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> echo "fixing target dir permissions"
> chmod -R +w target/* || true  # stupid hack by sknapp to ensure that the 
> chmod always exits w/0 and doesn't bork the script
> echo "running git clean -fdx"
> git clean -fdx
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export

[jira] [Commented] (SPARK-29189) Add an option to ignore block locations when listing file

2020-01-24 Thread Reynold Xin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023340#comment-17023340
 ] 

Reynold Xin commented on SPARK-29189:
-

This is great, but how would users know when to set this? Shouldn't we do a 
slight incremental improvement to just automatically detect the common object 
stores and disable locality check?

> Add an option to ignore block locations when listing file
> -
>
> Key: SPARK-29189
> URL: https://issues.apache.org/jira/browse/SPARK-29189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wang, Gang
>Assignee: Wang, Gang
>Priority: Major
> Fix For: 3.0.0
>
>
> In our PROD env, we have a pure Spark cluster, I think this is also pretty 
> common, where computation is separated from storage layer. In such deploy 
> mode, data locality is never reachable. 
>  And there are some configurations in Spark scheduler to reduce waiting time 
> for data locality(e.g. "spark.locality.wait"). While, problem is that, in 
> listing file phase, the location informations of all the files, with all the 
> blocks inside each file, are all fetched from the distributed file system. 
> Actually, in a PROD environment, a table can be so huge that even fetching 
> all these location informations need take tens of seconds.
>  To improve such scenario, Spark need provide an option, where data locality 
> can be totally ignored, all we need in the listing file phase are the files 
> locations, without any block location informations.
>  
> And we made a benchmark in our PROD env, after ignore the block locations, we 
> got a pretty huge improvement.
> ||Table Size||Total File Number||Total Block Number||List File Duration(With 
> Block Location)||List File Duration(Without Block Location)||
> |22.6T|3|12|16.841s|1.730s|
> |28.8 T|42001|148964|10.099s|2.858s|
> |3.4 T|2| 2|5.833s|4.881s|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30640) Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps

2020-01-24 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-30640:


 Summary: Prevent unnessary copies of data in Arrow to Pandas 
conversion with Timestamps
 Key: SPARK-30640
 URL: https://issues.apache.org/jira/browse/SPARK-30640
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.4
Reporter: Bryan Cutler


During conversion of Arrow to Pandas, timestamp columns are modified to 
localize for the current timezone. If there are no timestamp columns, this can 
sometimes result in unnecessary copies of the data. See 
[https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html] for 
discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27117) current_date/current_timestamp should be reserved keywords in ansi parser mode

2020-01-24 Thread Reynold Xin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023317#comment-17023317
 ] 

Reynold Xin commented on SPARK-27117:
-

I changed the title to make it more clear to end users what's happening.

 

> current_date/current_timestamp should be reserved keywords in ansi parser mode
> --
>
> Key: SPARK-27117
> URL: https://issues.apache.org/jira/browse/SPARK-27117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27117) current_date/current_timestamp should be reserved keywords in ansi parser mode

2020-01-24 Thread Reynold Xin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-27117:

Summary: current_date/current_timestamp should be reserved keywords in ansi 
parser mode  (was: current_date/current_timestamp should not refer to columns 
with ansi parser mode)

> current_date/current_timestamp should be reserved keywords in ansi parser mode
> --
>
> Key: SPARK-27117
> URL: https://issues.apache.org/jira/browse/SPARK-27117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25382) Remove ImageSchema.readImages in 3.0

2020-01-24 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25382:
-
Docs Text: In Spark 3.0.0, the deprecated ImageSchema class and its 
readImages methods have been removed. Use 
`spark.read.format("image").load(path)` instead.  (was: In Spark 3.0.0, the 
deprecated ImageSchema class and its readImages methods have been removed. Use 
`spark.read.format(\"image\").load(path)` instead.)

> Remove ImageSchema.readImages in 3.0
> 
>
> Key: SPARK-25382
> URL: https://issues.apache.org/jira/browse/SPARK-25382
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> A follow-up task from SPARK-25345. We might need to support sampling 
> (SPARK-25383) in order to remove readImages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25382) Remove ImageSchema.readImages in 3.0

2020-01-24 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25382:
-
Docs Text: In Spark 3.0.0, the deprecated ImageSchema class and its 
readImages methods have been removed. Use 
`spark.read.format(\"image\").load(path)` instead.  (was: In Spark 3.0.0, the 
deprecated ImageSchema class and its readImages methods have been removed.)

> Remove ImageSchema.readImages in 3.0
> 
>
> Key: SPARK-25382
> URL: https://issues.apache.org/jira/browse/SPARK-25382
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> A follow-up task from SPARK-25345. We might need to support sampling 
> (SPARK-25383) in order to remove readImages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30639) Upgrade Jersey to 2.30

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30639:
--
Parent: SPARK-29194
Issue Type: Sub-task  (was: Improvement)

> Upgrade Jersey to 2.30
> --
>
> Key: SPARK-30639
> URL: https://issues.apache.org/jira/browse/SPARK-30639
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30639) Upgrade Jersey to 2.30

2020-01-24 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-30639:
-

 Summary: Upgrade Jersey to 2.30
 Key: SPARK-30639
 URL: https://issues.apache.org/jira/browse/SPARK-30639
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30638) add resources as parameter to the PluginContext

2020-01-24 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-30638:
-

 Summary: add resources as parameter to the PluginContext
 Key: SPARK-30638
 URL: https://issues.apache.org/jira/browse/SPARK-30638
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


Add the allocates resources and ResourceProfile to parameters to the 
PluginContext so that any plugins in driver or executor could use this 
information to initialize devices or use this information in a useful manner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30632) to_timestamp() doesn't work with certain timezones

2020-01-24 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023263#comment-17023263
 ] 

Maxim Gekk edited comment on SPARK-30632 at 1/24/20 9:12 PM:
-

Spark 2.4 and earlier versions use SimpleDateFormat to parse timestamp strings. 
Unfortunately, the class doesn't support time zones in the format like 
"America/Los_Angeles", see 
[https://stackoverflow.com/questions/23242211/java-simpledateformat-parse-timezone-like-america-los-angeles]
 . Spark 3.0 has migrated to DateTimeFormatter which doesn't have such issue. 
Port the changes back to Spark 2.4 is risky, and destabilizes it, IMHO. One of 
the reasons is this requires to change calendar system to Proleptic Gregorian 
calendar, see https://issues.apache.org/jira/browse/SPARK-26651


was (Author: maxgekk):
Spark 2.4 and earlier versions use SimpleDateFormat to parse timestamp strings. 
Unfortunately, the class doesn't support time zones in the format like 
"America/Los_Angeles", see 
[https://stackoverflow.com/questions/23242211/java-simpledateformat-parse-timezone-like-america-los-angeles]
 . Spark 3.0 has migrated to DateTimeFormatter which doesn't have such issue. 
Port the changes back to Spark 2.4 is risky, and destabilizes it, IMHO.

> to_timestamp() doesn't work with certain timezones
> --
>
> Key: SPARK-30632
> URL: https://issues.apache.org/jira/browse/SPARK-30632
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Anton Daitche
>Priority: Major
>
> It seams that to_timestamp() doesn't work with timezones of the type 
> /, e.g. America/Los_Angeles.
> The code
> {code:scala}
> val df = Seq(
> ("2019-01-24 11:30:00.123", "America/Los_Angeles"), 
> ("2020-01-01 01:30:00.123", "PST")
> ).toDF("ts_str", "tz_name")
> val ts_parsed = to_timestamp(
> concat_ws(" ", $"ts_str", $"tz_name"), "-MM-dd HH:mm:ss.SSS z"
> ).as("timestamp")
> df.select(ts_parsed).show(false)
> {code}
> prints
> {code}
> +---+
> |timestamp  |
> +---+
> |null   |
> |2020-01-01 10:30:00|
> +---+
> {code}
> So, the datetime string with timezone PST is properly parsed, whereas the one 
> with America/Los_Angeles is converted to null. According to 
> [this|https://github.com/apache/spark/pull/24195#issuecomment-578055146] 
> response on GitHub, this code works when run on the recent master version. 
> See also the discussion in 
> [this|https://github.com/apache/spark/pull/24195#issue] issue for more 
> context.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30632) to_timestamp() doesn't work with certain timezones

2020-01-24 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023263#comment-17023263
 ] 

Maxim Gekk commented on SPARK-30632:


Spark 2.4 and earlier versions use SimpleDateFormat to parse timestamp strings. 
Unfortunately, the class doesn't support time zones in the format like 
"America/Los_Angeles", see 
[https://stackoverflow.com/questions/23242211/java-simpledateformat-parse-timezone-like-america-los-angeles]
 . Spark 3.0 has migrated to DateTimeFormatter which doesn't have such issue. 
Port the changes back to Spark 2.4 is risky, and destabilizes it, IMHO.

> to_timestamp() doesn't work with certain timezones
> --
>
> Key: SPARK-30632
> URL: https://issues.apache.org/jira/browse/SPARK-30632
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Anton Daitche
>Priority: Major
>
> It seams that to_timestamp() doesn't work with timezones of the type 
> /, e.g. America/Los_Angeles.
> The code
> {code:scala}
> val df = Seq(
> ("2019-01-24 11:30:00.123", "America/Los_Angeles"), 
> ("2020-01-01 01:30:00.123", "PST")
> ).toDF("ts_str", "tz_name")
> val ts_parsed = to_timestamp(
> concat_ws(" ", $"ts_str", $"tz_name"), "-MM-dd HH:mm:ss.SSS z"
> ).as("timestamp")
> df.select(ts_parsed).show(false)
> {code}
> prints
> {code}
> +---+
> |timestamp  |
> +---+
> |null   |
> |2020-01-01 10:30:00|
> +---+
> {code}
> So, the datetime string with timezone PST is properly parsed, whereas the one 
> with America/Los_Angeles is converted to null. According to 
> [this|https://github.com/apache/spark/pull/24195#issuecomment-578055146] 
> response on GitHub, this code works when run on the recent master version. 
> See also the discussion in 
> [this|https://github.com/apache/spark/pull/24195#issue] issue for more 
> context.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests

2020-01-24 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023250#comment-17023250
 ] 

Shane Knapp commented on SPARK-28900:
-

FYI:  i will be OOO next week (mon-thur) with VERY limited availability until 
friday (when i need to create the branch-3.0 jobs).

and now to revisit this issue since it's been filed and address some points 
initially raised:



 
{code:java}
Narrowly, my suggestion is:
* Make the master Maven-based builds use dev/run-tests too, so that Pyspark 
tests are run. It's meant to support this, if AMPLAB_JENKINS_BUILD_TOOL is set 
to "maven". I'm not sure if we've tested this, then, if it's not used. We may 
need new Jenkins jobs to make sure it works. 
* Leave the Spark 2.x builds as-is as 'legacy'. 
{code}
 

re maven and dev/run-tests:  this will be super easy and i can probably get 
that done really quickly.  would dev/run-tests *replace* the mvn test block in 
the build script config?

re 2.x builds: easy.

 
{code:java}
Why also test with SBT? Maven is the build of reference and presumably one test 
job is enough? if it was because the Maven configs weren't running all the 
tests, and we can fix that, then are the SBT builds superfluous? Maybe keep one 
to verify SBT builds still work
{code}
i still am unsure why we have both, but would be more than happy to delete the 
SBT builds (esp if we have the maven test run dev/run-tests

 
{code:java}
Shouldn't the PR builder look more like the other Jenkins builds? maybe it 
needs to be special, a bit. But should all of them be using run-tests-jenkins?
{code}
 

for the most part, dev/run-tests-jenkins exists for pull request builds and 
posting results to PRs.  it also runs extra linting tests etc and acts mostly 
as a wrapper for dev/run-tests.  i'm nearly certain we can leave this as-is.

 
{code:java}
Looks like some cruft in the configs that has built up over time. Can we 
review/delete some? things like Java 7 home, hard-coding a Maven path. Perhaps 
standardizing on the simpler run-tests invocation does this?
{code}
i've actually been doing a lot of cleanup in the build configs.  i have a ways 
to go but things are MUCH cleaner.  

 

 

> Test Pyspark, SparkR on JDK 11 with run-tests
> -
>
> Key: SPARK-28900
> URL: https://issues.apache.org/jira/browse/SPARK-28900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Priority: Major
>
> Right now, we are testing JDK 11 with a Maven-based build, as in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/
> It looks like _all_ of the Maven-based jobs 'manually' build and invoke 
> tests, and only run tests via Maven -- that is, they do not run Pyspark or 
> SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} 
> script that is meant to be for this purpose.
> In fact, there seem to be a couple flavors of copy-pasted build configs. SBT 
> builds look like:
> {code}
> #!/bin/bash
> set -e
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> git clean -fdx
> ./dev/run-tests
> {code}
> Maven builds looks like:
> {code}
> #!/bin/bash
> set -x
> set -e
> rm -rf ./work
> git clean -fdx
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
> # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention:
> export 
> SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2"
> mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> MVN="build/mvn -DzincPort=$ZINC_PORT"
> set +e
> if [[ $HADOOP_PROFILE == hadoop-1 ]]; then
> # Note that there is no -Pyarn flag here for Hadoop 1:
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
>

[jira] [Resolved] (SPARK-30630) Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30630.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 27330
[https://github.com/apache/spark/pull/27330]

> Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0
> -
>
> Key: SPARK-30630
> URL: https://issues.apache.org/jira/browse/SPARK-30630
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> Currently, GBT has
> {code:java}
> /**
>  * Number of trees in ensemble
>  */
> @Since("2.0.0")
> val getNumTrees: Int = trees.length{code}
> and
> {code:java}
> /** Number of trees in ensemble */
> val numTrees: Int = trees.length{code}
> I will deprecate numTrees in 2.4.5 and remove it in 3.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30630) Deprecate numTrees in GBT

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30630:
-

Assignee: Huaxin Gao

> Deprecate numTrees in GBT
> -
>
> Key: SPARK-30630
> URL: https://issues.apache.org/jira/browse/SPARK-30630
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> Currently, GBT has
> {code:java}
> /**
>  * Number of trees in ensemble
>  */
> @Since("2.0.0")
> val getNumTrees: Int = trees.length{code}
> and
> {code:java}
> /** Number of trees in ensemble */
> val numTrees: Int = trees.length{code}
> I will deprecate numTrees in 2.4.5 and remove it in 3.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30630) Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30630:
--
Summary: Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0  (was: 
Deprecate numTrees in GBT)

> Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0
> -
>
> Key: SPARK-30630
> URL: https://issues.apache.org/jira/browse/SPARK-30630
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> Currently, GBT has
> {code:java}
> /**
>  * Number of trees in ensemble
>  */
> @Since("2.0.0")
> val getNumTrees: Int = trees.length{code}
> and
> {code:java}
> /** Number of trees in ensemble */
> val numTrees: Int = trees.length{code}
> I will deprecate numTrees in 2.4.5 and remove it in 3.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30626) Add SPARK_APPLICATION_ID into driver pod env

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30626:
--
Affects Version/s: (was: 2.4.4)

> Add SPARK_APPLICATION_ID into driver pod env
> 
>
> Key: SPARK-30626
> URL: https://issues.apache.org/jira/browse/SPARK-30626
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Jiaxin Shan
>Assignee: Jiaxin Shan
>Priority: Minor
> Fix For: 3.0.0
>
>
> This should be a minor improvement.
> The use case is we want to look up environment variables and create 
> application folder and redirect driver logs to application folder.  Executors 
> has it and we want to make a change to driver as well. 
>  
> {code:java}
> Limits:
>  cpu: 1024m
>  memory: 896Mi
>  Requests:
>  cpu: 1
>  memory: 896Mi
> Environment:
>  SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
>  SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
>  SPARK_CONF_DIR: /opt/spark/conf{code}
>  
> [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79]
> We need SPARK_APPLICATION_ID inside the pod to organize logs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30626) Add SPARK_APPLICATION_ID into driver pod env

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30626:
--
Summary: Add SPARK_APPLICATION_ID into driver pod env  (was: [K8S] Spark 
driver pod doesn't have SPARK_APPLICATION_ID env)

> Add SPARK_APPLICATION_ID into driver pod env
> 
>
> Key: SPARK-30626
> URL: https://issues.apache.org/jira/browse/SPARK-30626
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jiaxin Shan
>Assignee: Jiaxin Shan
>Priority: Minor
> Fix For: 3.0.0
>
>
> This should be a minor improvement.
> The use case is we want to look up environment variables and create 
> application folder and redirect driver logs to application folder.  Executors 
> has it and we want to make a change to driver as well. 
>  
> {code:java}
> Limits:
>  cpu: 1024m
>  memory: 896Mi
>  Requests:
>  cpu: 1
>  memory: 896Mi
> Environment:
>  SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
>  SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
>  SPARK_CONF_DIR: /opt/spark/conf{code}
>  
> [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79]
> We need SPARK_APPLICATION_ID inside the pod to organize logs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30626:
-

Assignee: Jiaxin Shan

> [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env
> 
>
> Key: SPARK-30626
> URL: https://issues.apache.org/jira/browse/SPARK-30626
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jiaxin Shan
>Assignee: Jiaxin Shan
>Priority: Minor
>
> This should be a minor improvement.
> The use case is we want to look up environment variables and create 
> application folder and redirect driver logs to application folder.  Executors 
> has it and we want to make a change to driver as well. 
>  
> {code:java}
> Limits:
>  cpu: 1024m
>  memory: 896Mi
>  Requests:
>  cpu: 1
>  memory: 896Mi
> Environment:
>  SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
>  SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
>  SPARK_CONF_DIR: /opt/spark/conf{code}
>  
> [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79]
> We need SPARK_APPLICATION_ID inside the pod to organize logs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30626.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27347
[https://github.com/apache/spark/pull/27347]

> [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env
> 
>
> Key: SPARK-30626
> URL: https://issues.apache.org/jira/browse/SPARK-30626
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jiaxin Shan
>Assignee: Jiaxin Shan
>Priority: Minor
> Fix For: 3.0.0
>
>
> This should be a minor improvement.
> The use case is we want to look up environment variables and create 
> application folder and redirect driver logs to application folder.  Executors 
> has it and we want to make a change to driver as well. 
>  
> {code:java}
> Limits:
>  cpu: 1024m
>  memory: 896Mi
>  Requests:
>  cpu: 1
>  memory: 896Mi
> Environment:
>  SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
>  SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
>  SPARK_CONF_DIR: /opt/spark/conf{code}
>  
> [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79]
> We need SPARK_APPLICATION_ID inside the pod to organize logs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests

2020-01-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023224#comment-17023224
 ] 

Dongjoon Hyun commented on SPARK-28900:
---

Hi, All.
Can we restart this before `branch-3.0` cut because we need to duplicate all 
`master` Jenkins jobs during cutting branch?

> Test Pyspark, SparkR on JDK 11 with run-tests
> -
>
> Key: SPARK-28900
> URL: https://issues.apache.org/jira/browse/SPARK-28900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Priority: Major
>
> Right now, we are testing JDK 11 with a Maven-based build, as in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/
> It looks like _all_ of the Maven-based jobs 'manually' build and invoke 
> tests, and only run tests via Maven -- that is, they do not run Pyspark or 
> SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} 
> script that is meant to be for this purpose.
> In fact, there seem to be a couple flavors of copy-pasted build configs. SBT 
> builds look like:
> {code}
> #!/bin/bash
> set -e
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> git clean -fdx
> ./dev/run-tests
> {code}
> Maven builds looks like:
> {code}
> #!/bin/bash
> set -x
> set -e
> rm -rf ./work
> git clean -fdx
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
> # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention:
> export 
> SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2"
> mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> MVN="build/mvn -DzincPort=$ZINC_PORT"
> set +e
> if [[ $HADOOP_PROFILE == hadoop-1 ]]; then
> # Note that there is no -Pyarn flag here for Hadoop 1:
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> else
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> fi
> if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then
>   if [[ $retcode1 -ne 0 ]]; then
> echo "Packaging Spark with Maven failed"
>   fi
>   if [[ $retcode2 -ne 0 ]]; then
> echo "Testing Spark with Maven failed"
>   fi
>   exit 1
> fi
> {code}
> The PR builder (one of them at least) looks like:
> {code}
> #!/bin/bash
> set -e  # fail on any non-zero exit code
> set -x
> export AMPLAB_JENKINS=1
> export PATH="$PATH:/home/anaconda/envs/py3k/bin"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> echo "fixing target dir permissions"
> chmod -R +w target/* || true  # stupid hack by sknapp to ensure that the 
> chmod always exits w/0 and doesn't bork the script
> echo "running git clean -fdx"
> git clean -fdx
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
>

[jira] [Updated] (SPARK-28704) Test backward compatibility on JDK9+ once we have a version supports JDK9+

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28704:
--
Target Version/s: 3.1.0

> Test backward compatibility on JDK9+ once we have a version supports JDK9+
> --
>
> Key: SPARK-28704
> URL: https://issues.apache.org/jira/browse/SPARK-28704
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or 
> later because our previous version does not support JAVA_9 or later. We 
> should add it back once we have a version supports JAVA_9 or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28704) Test backward compatibility on JDK9+ once we have a version supports JDK9+

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28704:
--
Labels: 3.1.0  (was: )

> Test backward compatibility on JDK9+ once we have a version supports JDK9+
> --
>
> Key: SPARK-28704
> URL: https://issues.apache.org/jira/browse/SPARK-28704
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: 3.1.0
>
> We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or 
> later because our previous version does not support JAVA_9 or later. We 
> should add it back once we have a version supports JAVA_9 or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28704) Test backward compatibility on JDK9+ once we have a version supports JDK9+

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28704:
--
Parent: (was: SPARK-29194)
Issue Type: Test  (was: Sub-task)

> Test backward compatibility on JDK9+ once we have a version supports JDK9+
> --
>
> Key: SPARK-28704
> URL: https://issues.apache.org/jira/browse/SPARK-28704
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or 
> later because our previous version does not support JAVA_9 or later. We 
> should add it back once we have a version supports JAVA_9 or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28704) Test backward compatibility on JDK9+ once we have a version supports JDK9+

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28704:
--
Labels:   (was: 3.1.0)

> Test backward compatibility on JDK9+ once we have a version supports JDK9+
> --
>
> Key: SPARK-28704
> URL: https://issues.apache.org/jira/browse/SPARK-28704
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or 
> later because our previous version does not support JAVA_9 or later. We 
> should add it back once we have a version supports JAVA_9 or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29924) Document Arrow requirement in JDK9+

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29924:
--
Description: At least, we need to mention 
`io.netty.tryReflectionSetAccessible=true` is required for Arrow runtime on 
JDK9+ environment  (was: At least, we need to mention 
`io.netty.tryReflectionSetAccessible=true` is required for Arrow runtime on 
JDK9+ environment

Also, SparkR's minimum arrow became also 0.15.1 due to Arrow source code 
incompatibility. We need to update R document like sparkr.md)

> Document Arrow requirement in JDK9+
> ---
>
> Key: SPARK-29924
> URL: https://issues.apache.org/jira/browse/SPARK-29924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is 
> required for Arrow runtime on JDK9+ environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29924) Document Arrow requirement in JDK9+

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29924.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27356
[https://github.com/apache/spark/pull/27356]

> Document Arrow requirement in JDK9+
> ---
>
> Key: SPARK-29924
> URL: https://issues.apache.org/jira/browse/SPARK-29924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is 
> required for Arrow runtime on JDK9+ environment
> Also, SparkR's minimum arrow became also 0.15.1 due to Arrow source code 
> incompatibility. We need to update R document like sparkr.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29924) Document Arrow requirement in JDK9+

2020-01-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29924:
-

Assignee: Dongjoon Hyun

> Document Arrow requirement in JDK9+
> ---
>
> Key: SPARK-29924
> URL: https://issues.apache.org/jira/browse/SPARK-29924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is 
> required for Arrow runtime on JDK9+ environment
> Also, SparkR's minimum arrow became also 0.15.1 due to Arrow source code 
> incompatibility. We need to update R document like sparkr.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-01-24 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023202#comment-17023202
 ] 

Shane Knapp commented on SPARK-30637:
-

ok, i was able to easily uninstall 1.0.2 and reinstall 2.0.0 on my staging 
worker w/o issue.  which, i have to admit, make me really nervous.  :)

 
{noformat}
* installing *source* package ‘testthat’ ...
** package ‘testthat’ successfully unpacked and MD5 sums checked
** libs
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I../inst/include 
-DCOMPILING_TESTTHAT -fpic  -g -O2 -fstack-protector-strong -Wformat 
-Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c init.c -o init.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I../inst/include 
-DCOMPILING_TESTTHAT -fpic  -g -O2 -fstack-protector-strong -Wformat 
-Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c reassign.c -o 
reassign.o
g++  -I/usr/share/R/include -DNDEBUG -I../inst/include -DCOMPILING_TESTTHAT 
-fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security 
-Wdate-time -D_FORTIFY_SOURCE=2 -g  -c test-catch.cpp -o test-catch.o
g++  -I/usr/share/R/include -DNDEBUG -I../inst/include -DCOMPILING_TESTTHAT 
-fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security 
-Wdate-time -D_FORTIFY_SOURCE=2 -g  -c test-example.cpp -o test-example.o
g++  -I/usr/share/R/include -DNDEBUG -I../inst/include -DCOMPILING_TESTTHAT 
-fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security 
-Wdate-time -D_FORTIFY_SOURCE=2 -g  -c test-runner.cpp -o test-runner.o
g++ -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o 
testthat.so init.o reassign.o test-catch.o test-example.o test-runner.o 
-L/usr/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/testthat/libs
** R
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (testthat){noformat}

> upgrade testthat on jenkins workers to 2.0.0
> 
>
> Key: SPARK-30637
> URL: https://issues.apache.org/jira/browse/SPARK-30637
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, jenkins, R
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> see:  https://issues.apache.org/jira/browse/SPARK-23435
> i will investigate upgrading testthat on my staging worker, and if that goes 
> smoothly we can upgrade it on all jenkins workers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2020-01-24 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023194#comment-17023194
 ] 

Shane Knapp commented on SPARK-23435:
-

since we can't have different R environments for different spark branches, we 
should confirm that testthat 2.0.0 doesn't break the 2.4 branch before the 
jenkins workers are upgraded.

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-01-24 Thread Shane Knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp updated SPARK-30637:

Parent: SPARK-23435
Issue Type: Sub-task  (was: Task)

> upgrade testthat on jenkins workers to 2.0.0
> 
>
> Key: SPARK-30637
> URL: https://issues.apache.org/jira/browse/SPARK-30637
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, jenkins, R
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> see:  https://issues.apache.org/jira/browse/SPARK-23435
> i will investigate upgrading testthat on my staging worker, and if that goes 
> smoothly we can upgrade it on all jenkins workers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-01-24 Thread Shane Knapp (Jira)

Shane Knapp created SPARK-30637:
---

 Summary: upgrade testthat on jenkins workers to 2.0.0
 Key: SPARK-30637
 URL: https://issues.apache.org/jira/browse/SPARK-30637
 Project: Spark
  Issue Type: Task
  Components: Build, jenkins, R
Affects Versions: 3.0.0
Reporter: Shane Knapp
Assignee: Shane Knapp


see:  https://issues.apache.org/jira/browse/SPARK-23435

i will investigate upgrading testthat on my staging worker, and if that goes 
smoothly we can upgrade it on all jenkins workers.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30636) Unable to add packages on spark-packages.org

2020-01-24 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30636:

Priority: Critical  (was: Blocker)

> Unable to add packages on spark-packages.org
> 
>
> Key: SPARK-30636
> URL: https://issues.apache.org/jira/browse/SPARK-30636
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.4
>Reporter: Xiao Li
>Assignee: Burak Yavuz
>Priority: Critical
>
> Unable to add new packages to spark-packages.org. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30636) Unable to add packages on spark-packages.org

2020-01-24 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-30636:
---

Assignee: Burak Yavuz

> Unable to add packages on spark-packages.org
> 
>
> Key: SPARK-30636
> URL: https://issues.apache.org/jira/browse/SPARK-30636
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.4
>Reporter: Xiao Li
>Assignee: Burak Yavuz
>Priority: Blocker
>
> Unable to add new packages to spark-packages.org. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30636) Unable to add packages on spark-packages.org

2020-01-24 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30636:

Affects Version/s: (was: 3.0.0)
   2.4.4

> Unable to add packages on spark-packages.org
> 
>
> Key: SPARK-30636
> URL: https://issues.apache.org/jira/browse/SPARK-30636
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.4
>Reporter: Xiao Li
>Priority: Blocker
>
> Unable to add new packages to spark-packages.org. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30636) Unable to add packages on spark-packages.org

2020-01-24 Thread Xiao Li (Jira)

Xiao Li created SPARK-30636:
---

 Summary: Unable to add packages on spark-packages.org
 Key: SPARK-30636
 URL: https://issues.apache.org/jira/browse/SPARK-30636
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Xiao Li


Unable to add new packages to spark-packages.org. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2020-01-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023118#comment-17023118
 ] 

Dongjoon Hyun edited comment on SPARK-30218 at 1/24/20 5:10 PM:


No, this is fixed. The following is a similar case. User should do the 
disambiguation.
{code}
spark-sql> create table T (a int);
Time taken: 0.348 seconds

spark-sql> select a from T, T;
Error in query: Reference 'a' is ambiguous, could be: default.t.a, 
default.t.a.; line 1 pos 7
{code}


was (Author: dongjoon):
No, this is fixed. The following is the same case. User should do the 
disambiguation.
{code}
spark-sql> create table T (a int);
Time taken: 0.348 seconds

spark-sql> select a from T, T;
Error in query: Reference 'a' is ambiguous, could be: default.t.a, 
default.t.a.; line 1 pos 7
{code}

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2020-01-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023118#comment-17023118
 ] 

Dongjoon Hyun edited comment on SPARK-30218 at 1/24/20 5:10 PM:


No, this is fixed. The following is the same case. User should do the 
disambiguation.
{code}
spark-sql> create table T (a int);
Time taken: 0.348 seconds

spark-sql> select a from T, T;
Error in query: Reference 'a' is ambiguous, could be: default.t.a, 
default.t.a.; line 1 pos 7
{code}


was (Author: dongjoon):
No, this is fixed. The following is the same case. User should do the 
disambiguation.
{code}
spark-sql> create table T (a int);
Error in query: Table T already exists.;
spark-sql> select a from T, T;
Error in query: cannot resolve '`a`' given input columns: [default.t.id, 
default.t.id]; line 1 pos 7;
{code}

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2020-01-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023118#comment-17023118
 ] 

Dongjoon Hyun commented on SPARK-30218:
---

No, this is fixed. The following is the same case. User should do the 
disambiguation.
{code}
spark-sql> create table T (a int);
Error in query: Table T already exists.;
spark-sql> select a from T, T;
Error in query: cannot resolve '`a`' given input columns: [default.t.id, 
default.t.id]; line 1 pos 7;
{code}

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-01-24 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023036#comment-17023036
 ] 

Wenchen Fan commented on SPARK-30612:
-

I think the example from [~brkyvz] is right. The column name qualifier should 
only refer to what specified in the table name.

> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-24 Thread Jeff Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023019#comment-17023019
 ] 

Jeff Evans commented on SPARK-19248:


I'm not a Spark maintainer, so can't answer definitively.  However, I would 
guess they won't change the default value.  This was deliberately added in 2.0 
with a default value of false, and usually breaking changes like this are 
introduced in new major versions (speaking in general terms).

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2020-01-24 Thread Rahul Kumar Challapalli (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022969#comment-17022969
 ] 

Rahul Kumar Challapalli commented on SPARK-30218:
-

[~dongjoon] I am not sure but I was pointing what the OP was asking. Since we 
don't disambiguate the columns in this case, should we keep this issue as open?

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30635) Document PARTITIONED BY Clause of CREATE statement in SQL Reference

2020-01-24 Thread jobit mathew (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022966#comment-17022966
 ] 

jobit mathew commented on SPARK-30635:
--

I will work on this

> Document PARTITIONED BY  Clause of CREATE statement in SQL Reference
> 
>
> Key: SPARK-30635
> URL: https://issues.apache.org/jira/browse/SPARK-30635
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30635) Document PARTITIONED BY Clause of CREATE statement in SQL Reference

2020-01-24 Thread jobit mathew (Jira)

jobit mathew created SPARK-30635:


 Summary: Document PARTITIONED BY  Clause of CREATE statement in 
SQL Reference
 Key: SPARK-30635
 URL: https://issues.apache.org/jira/browse/SPARK-30635
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 2.4.4
Reporter: jobit mathew






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30634) Delta Merge and Arbitrary Stateful Processing in Structured streaming (foreachBatch)

2020-01-24 Thread Yurii Oleynikov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yurii Oleynikov updated SPARK-30634:

Description: 
Hi ,

I have an application that makes Arbitrary Stateful Processing in Structured 
Streaming and used delta.merge to update delta table and faced strange 
behaviour:

1. I've noticed that logs inside implementation of 
{{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my 
application outputted twice.

2. While finding a root cause I've also found that number State rows reported 
by Spark is also doubles.

 

I thought that may be there's a bug in my code, so I back to 
{{JavaStructuredSessionization}} from Apache Spark examples and changed it a 
bit. Still got same result.

The problem happens only if I do not perform datch.DF.persist inside 
foreachBatch.
{code:java}
StreamingQuery query = sessionUpdates
.writeStream()
.outputMode("update")
.foreachBatch((VoidFunction2, Long>) (batchDf, 
v2) -> {
// following doubles number of spark state rows and causes 
MapGroupsWithStateFunction to log twice withport persisting
deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), 
mergeExpr)
.whenNotMatched().insertAll()
.whenMatched()
.updateAll()
.execute();
})
.trigger(Trigger.ProcessingTime(1))
.queryName("ACME")
.start(); 
{code}
According to 
[https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and 
[Apache spark 
docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch]
 there's seems to be no need to persist dataset/dataframe inside 
{{foreachBatch.}}

Sample code from Apache Spark examples with delta: 
[JavaStructuredSessionization with Delta 
merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java]

 

 

Appreciate your clarification.

 

  was:
Hi , I've faced strange behaviour with Delta merge and Arbitrary Stateful 
Processing in Structured streaming.

I have an application that makes Arbitrary Stateful Processing in Structured 
Streaming and used delta.merge to update delta table.

 

I've noticed that longs inside implementation of 
{{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my 
application outputted twice.

While finding a root cause I've also found that number State rows reported by 
Spark is also doubles.

I thought that may be there's a bug in my code, so I back to 
{{JavaStructuredSessionization}} Apache Spark examples and changed it a bit. 
Still got same result.

The problem happens only if I do not perform datch.DF.persist inside 
foreachBatch.
{code:java}
StreamingQuery query = sessionUpdates
.writeStream()
.outputMode("update")
.foreachBatch((VoidFunction2, Long>) (batchDf, 
v2) -> {
// following doubles number of spark state rows and causes 
MapGroupsWithStateFunction to log twice withport persisting
deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), 
mergeExpr)
.whenNotMatched().insertAll()
.whenMatched()
.updateAll()
.execute();
})
.trigger(Trigger.ProcessingTime(1))
.queryName("ACME")
.start(); 
{code}
According to 
[https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and 
[Apache spark 
docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch]
 there's seems to be no need to persist dataset/dataframe inside 
{{foreachBatch.}}

Sample code from Apache Spark examples with delta: 
[JavaStructuredSessionization with Delta 
merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java]

 

 

Appreciate your clarification.

 


> Delta Merge and Arbitrary Stateful Processing in Structured streaming  
> (foreachBatch)
> -
>
> Key: SPARK-30634
> URL: https://issues.apache.org/jira/browse/SPARK-30634
> Project: Spark
>  Issue Type: Question
>  Components: Examples, Spark Core, Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 (scala 2.11.12)
> Delta: 0.5.0
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> OS: Ubuntu 18.04 LTS
>  
>Reporter: Yurii Oleynikov
>Priority: Trivial
> Attachments: Capture1.PNG
>
>
> Hi ,
> I have an application that makes Arbitrary Stateful Processing in Structured 
> Streaming and used delta.merge to update delta table and faced strange 
> behaviour:
> 1. I've noticed that logs inside

[jira] [Created] (SPARK-30634) Delta Merge and Arbitrary Stateful Processing in Structured streaming (foreachBatch)

2020-01-24 Thread Yurii Oleynikov (Jira)

Yurii Oleynikov created SPARK-30634:
---

 Summary: Delta Merge and Arbitrary Stateful Processing in 
Structured streaming  (foreachBatch)
 Key: SPARK-30634
 URL: https://issues.apache.org/jira/browse/SPARK-30634
 Project: Spark
  Issue Type: Question
  Components: Examples, Spark Core, Structured Streaming
Affects Versions: 2.4.3
 Environment: Spark 2.4.3 (scala 2.11.12)

Delta: 0.5.0

Java(TM) SE Runtime Environment (build 1.8.0_91-b14)

OS: Ubuntu 18.04 LTS

 
Reporter: Yurii Oleynikov
 Attachments: Capture1.PNG

Hi , I've faced strange behaviour with Delta merge and Arbitrary Stateful 
Processing in Structured streaming.

I have an application that makes Arbitrary Stateful Processing in Structured 
Streaming and used delta.merge to update delta table.

 

I've noticed that longs inside implementation of 
{{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my 
application outputted twice.

While finding a root cause I've also found that number State rows reported by 
Spark is also doubles.

I thought that may be there's a bug in my code, so I back to 
{{JavaStructuredSessionization}} Apache Spark examples and changed it a bit. 
Still got same result.

The problem happens only if I do not perform datch.DF.persist inside 
foreachBatch.
{code:java}
StreamingQuery query = sessionUpdates
.writeStream()
.outputMode("update")
.foreachBatch((VoidFunction2, Long>) (batchDf, 
v2) -> {
// following doubles number of spark state rows and causes 
MapGroupsWithStateFunction to log twice withport persisting
deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), 
mergeExpr)
.whenNotMatched().insertAll()
.whenMatched()
.updateAll()
.execute();
})
.trigger(Trigger.ProcessingTime(1))
.queryName("ACME")
.start(); 
{code}
According to 
[https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and 
[Apache spark 
docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch]
 there's seems to be no need to persist dataset/dataframe inside 
{{foreachBatch.}}

Sample code from Apache Spark examples with delta: 
[JavaStructuredSessionization with Delta 
merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java]

 

 

Appreciate your clarification.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30634) Delta Merge and Arbitrary Stateful Processing in Structured streaming (foreachBatch)

2020-01-24 Thread Yurii Oleynikov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yurii Oleynikov updated SPARK-30634:

Attachment: Capture1.PNG

> Delta Merge and Arbitrary Stateful Processing in Structured streaming  
> (foreachBatch)
> -
>
> Key: SPARK-30634
> URL: https://issues.apache.org/jira/browse/SPARK-30634
> Project: Spark
>  Issue Type: Question
>  Components: Examples, Spark Core, Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 (scala 2.11.12)
> Delta: 0.5.0
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> OS: Ubuntu 18.04 LTS
>  
>Reporter: Yurii Oleynikov
>Priority: Trivial
> Attachments: Capture1.PNG
>
>
> Hi , I've faced strange behaviour with Delta merge and Arbitrary Stateful 
> Processing in Structured streaming.
> I have an application that makes Arbitrary Stateful Processing in Structured 
> Streaming and used delta.merge to update delta table.
>  
> I've noticed that longs inside implementation of 
> {{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my 
> application outputted twice.
> While finding a root cause I've also found that number State rows reported by 
> Spark is also doubles.
> I thought that may be there's a bug in my code, so I back to 
> {{JavaStructuredSessionization}} Apache Spark examples and changed it a bit. 
> Still got same result.
> The problem happens only if I do not perform datch.DF.persist inside 
> foreachBatch.
> {code:java}
> StreamingQuery query = sessionUpdates
> .writeStream()
> .outputMode("update")
> .foreachBatch((VoidFunction2, Long>) (batchDf, 
> v2) -> {
> // following doubles number of spark state rows and causes 
> MapGroupsWithStateFunction to log twice withport persisting
> deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), 
> mergeExpr)
> .whenNotMatched().insertAll()
> .whenMatched()
> .updateAll()
> .execute();
> })
> .trigger(Trigger.ProcessingTime(1))
> .queryName("ACME")
> .start(); 
> {code}
> According to 
> [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and 
> [Apache spark 
> docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch]
>  there's seems to be no need to persist dataset/dataframe inside 
> {{foreachBatch.}}
> Sample code from Apache Spark examples with delta: 
> [JavaStructuredSessionization with Delta 
> merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java]
>  
>  
> Appreciate your clarification.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-24 Thread Tobias Hermann (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022900#comment-17022900
 ] 

Tobias Hermann commented on SPARK-30421:


[~dongjoon] I'm glad we are aligned now. :)

For future reference:

The original Pandas example
{quote}df.drop(columns=["col1"]).loc[df["col1"] == 1]
{quote}
accesses the (unnamed) dataframe resulting from the drop call by row index 
(loc). This would even work (but not be very meaningful) by using a totally 
independent dataframe for this filtering.
{quote}df_foo = pd.DataFrame(data=\{'foo': [0, 1]})
df_bar = pd.DataFrame(data=\{'bar': ["a", "b"]})
df_bar.loc[df_foo["foo"] == 1]
{quote}

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-24 Thread weiwenda (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022869#comment-17022869
 ] 

weiwenda commented on SPARK-30617:
--

[~dongjoon] Thanks for your advise. I will write Fix versions / Affected 
Version carefully next time.

> Is there any possible that spark no longer restrict enumerate types of 
> spark.sql.catalogImplementation
> --
>
> Key: SPARK-30617
> URL: https://issues.apache.org/jira/browse/SPARK-30617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: weiwenda
>Priority: Minor
>
> # We have implemented a complex ExternalCatalog which is used for retrieving 
> multi isomerism database's metadata(sush as elasticsearch、postgresql), so 
> that we can make a mixture query between hive and our online data.
>  # But as spark require that value of spark.sql.catalogImplementation must be 
> one of in-memory/hive, we have to modify SparkSession and rebuild spark to 
> make our project work.
>  # Finally, we hope spark removing above restriction, so that it's will be 
> much easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30633) Codegen fails when xxHash seed is not an integer

2020-01-24 Thread Patrick Cording (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Cording updated SPARK-30633:

Description: 
If the seed for xxHash is not an integer the generated code does not compile.

Steps to reproduce:
{code:java}
import org.apache.spark.sql.catalyst.expressions.XxHash64
import org.apache.spark.sql.Column

val file = "..."
val column = col("...")

val df = spark.read.csv(file)

def xxHash(seed: Long, cols: Column*): Column = new Column(
   XxHash64(cols.map(_.expr), seed)
)

val seed = (Math.pow(2, 32)+1).toLong
df.select(xxHash(seed, column)).show()
{code}
Appending an L to the seed when the datatype is long fixes the issue.

  was:
If the seed for xxHash is not an integer the generated code does not compile.


Steps to reproduce:
{code:java}
import org.apache.spark.sql.catalyst.expressions.XxHash64
import org.apache.spark.sql.Column

val file = "..."
val column = col("...")

val df = spark.read.csv(file)

def xxHash(seed: Long, cols: Column*): Column = new Column( 
XxHash64(cols.map(_.expr), seed)
)

val seed = (Math.pow(2, 32)+1).toLong
df.select(xxHash(seed, column)).show()
{code}

Appending an L to the seed when the datatype is long fixes the issue.


> Codegen fails when xxHash seed is not an integer
> 
>
> Key: SPARK-30633
> URL: https://issues.apache.org/jira/browse/SPARK-30633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Patrick Cording
>Priority: Major
>
> If the seed for xxHash is not an integer the generated code does not compile.
> Steps to reproduce:
> {code:java}
> import org.apache.spark.sql.catalyst.expressions.XxHash64
> import org.apache.spark.sql.Column
> val file = "..."
> val column = col("...")
> val df = spark.read.csv(file)
> def xxHash(seed: Long, cols: Column*): Column = new Column(
>XxHash64(cols.map(_.expr), seed)
> )
> val seed = (Math.pow(2, 32)+1).toLong
> df.select(xxHash(seed, column)).show()
> {code}
> Appending an L to the seed when the datatype is long fixes the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30633) Codegen fails when xxHash seed is not an integer

2020-01-24 Thread Patrick Cording (Jira)

Patrick Cording created SPARK-30633:
---

 Summary: Codegen fails when xxHash seed is not an integer
 Key: SPARK-30633
 URL: https://issues.apache.org/jira/browse/SPARK-30633
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Patrick Cording


If the seed for xxHash is not an integer the generated code does not compile.


Steps to reproduce:
{code:java}
import org.apache.spark.sql.catalyst.expressions.XxHash64
import org.apache.spark.sql.Column

val file = "..."
val column = col("...")

val df = spark.read.csv(file)

def xxHash(seed: Long, cols: Column*): Column = new Column( 
XxHash64(cols.map(_.expr), seed)
)

val seed = (Math.pow(2, 32)+1).toLong
df.select(xxHash(seed, column)).show()
{code}

Appending an L to the seed when the datatype is long fixes the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022850#comment-17022850
 ] 

Dongjoon Hyun commented on SPARK-30421:
---

While rethinking about this, the original column's index might be different 
because it can be considered a value array without any meaning. Got it, 
[~tobias_hermann].

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30421) Dropped columns still available for filtering

2020-01-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022850#comment-17022850
 ] 

Dongjoon Hyun edited comment on SPARK-30421 at 1/24/20 10:44 AM:
-

While rethinking about this, the original column's index might be different 
because it can be considered as a value array without any meaning. Got it, 
[~tobias_hermann].


was (Author: dongjoon):
While rethinking about this, the original column's index might be different 
because it can be considered a value array without any meaning. Got it, 
[~tobias_hermann].

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022835#comment-17022835
 ] 

Dongjoon Hyun commented on SPARK-30421:
---

Nope. Your example is different. I illustrated what I wanted.
"Pandas supports filtering with *the original column's index* on the dropped 
data frame."
That's my point. I intentionally didn't declare `df2` or `df2["bar"]`.

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30632) to_timestamp() doesn't work with certain timezones

2020-01-24 Thread Anton Daitche (Jira)

Anton Daitche created SPARK-30632:
-

 Summary: to_timestamp() doesn't work with certain timezones
 Key: SPARK-30632
 URL: https://issues.apache.org/jira/browse/SPARK-30632
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4, 2.3.0
Reporter: Anton Daitche


It seams that to_timestamp() doesn't work with timezones of the type 
/, e.g. America/Los_Angeles.

The code

{code:scala}
val df = Seq(
("2019-01-24 11:30:00.123", "America/Los_Angeles"), 
("2020-01-01 01:30:00.123", "PST")
).toDF("ts_str", "tz_name")

val ts_parsed = to_timestamp(
concat_ws(" ", $"ts_str", $"tz_name"), "-MM-dd HH:mm:ss.SSS z"
).as("timestamp")

df.select(ts_parsed).show(false)
{code}

prints


{code}
+---+
|timestamp  |
+---+
|null   |
|2020-01-01 10:30:00|
+---+
{code}

So, the datetime string with timezone PST is properly parsed, whereas the one 
with America/Los_Angeles is converted to null. According to 
[this|https://github.com/apache/spark/pull/24195#issuecomment-578055146] 
response on GitHub, this code works when run on the recent master version. 

See also the discussion in 
[this|https://github.com/apache/spark/pull/24195#issue] issue for more context.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

67 matches

Mail list logo