[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory

2021-07-30 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390851#comment-17390851
 ] 

Senthil Kumar commented on SPARK-36327:
---

Hi [~sunchao]

Hive is creating .staging directories inside "/db/table" location but Spark-sql 
creates .staging directories inside /db/" location when we use hadoop 
federation(viewFs). But works as expected (creating .staging inside /db/table/ 
location for other filesystems like hdfs).

HIVE:
{{
# beeline
> use dicedb;
> insert into table part_test partition (j=1) values (1);
...
INFO : Loading data to table dicedb.part_test partition (j=1) from 
**viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-1**

}}

but spark's behaviour,

{{
spark-sql> use dicedb;
spark-sql> insert into table part_test partition (j=2) values (2);
21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: 
**viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1**
... 
}}


The reason why we require this change is , if we allow spark-sql to create 
.staging directory inside /db/ location then we will end-up with security 
issues. We need to provide permission for "viewfs:///db/" location to all users 
who submit spark jobs.

After this change is applied spark-sql creates .staging inside /db/table/, 
similar to hive, as below,

{{
spark-sql> use dicedb;
21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds
spark-sql> insert into table part_test partition (j=8) values (8);
21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, 
current connections: 1
21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: 
**viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1**
 
}}

The reason why we don't see this issue in Hive but only occurs in Spark-sql:

In hive, "/db/table/tmp" directory structure is passed for path and hence 
path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so 
it is not required to use "path.getParent" for hadoop federation(viewfs)

 

> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory
> ---
>
> Key: SPARK-36327
> URL: https://issues.apache.org/jira/browse/SPARK-36327
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory.
>  
> This arises only when viewfs:// is configured. When the location is hdfs://, 
> it doesn't occur.
>  
> Based on further investigation in file *SaveAsHiveFile.scala*, I could see 
> that the directory hierarchy has been not properly handled for viewFS 
> condition.
> Parent path(db path) is passed rather than passing the actual directory(table 
> location).
> {{
> // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2
> private def newVersionExternalTempPath(
> path: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
> val extURI: URI = path.toUri
> if (extURI.getScheme == "viewfs")
> { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) }
> else
> { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), 
> "-ext-1") }
> }
> }}
> Please refer below lines
> ===
> if (extURI.getScheme == "viewfs") {
> getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
> ===



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36370) Avoid using SelectionMixin._builtin_table which is removed in pandas 1.3

2021-07-30 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36370:
-

 Summary: Avoid using SelectionMixin._builtin_table which is 
removed in pandas 1.3
 Key: SPARK-36370
 URL: https://issues.apache.org/jira/browse/SPARK-36370
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36369) Fix Index.union to follow pandas 1.3

2021-07-30 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36369:
-

 Summary: Fix Index.union to follow pandas 1.3
 Key: SPARK-36369
 URL: https://issues.apache.org/jira/browse/SPARK-36369
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36368) Fix CategoricalOps.astype to follow pandas 1.3

2021-07-30 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36368:
-

 Summary: Fix CategoricalOps.astype to follow pandas 1.3
 Key: SPARK-36368
 URL: https://issues.apache.org/jira/browse/SPARK-36368
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-07-30 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-36367:
--
Issue Type: Umbrella  (was: Improvement)

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36367:


Assignee: Apache Spark

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390822#comment-17390822
 ] 

Apache Spark commented on SPARK-36367:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33598

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36367:


Assignee: (was: Apache Spark)

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390821#comment-17390821
 ] 

Apache Spark commented on SPARK-36367:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33598

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-07-30 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36367:
-

 Summary: Fix the behavior to follow pandas >= 1.3
 Key: SPARK-36367
 URL: https://issues.apache.org/jira/browse/SPARK-36367
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin


Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36366) Google Kubernetes Engine authentication fails

2021-07-30 Thread Tiago Reis (Jira)
Tiago Reis created SPARK-36366:
--

 Summary: Google Kubernetes Engine authentication fails
 Key: SPARK-36366
 URL: https://issues.apache.org/jira/browse/SPARK-36366
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.1.2
 Environment: 
{code}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.15", 
GitCommit:"73dd5c840662bb066a146d0871216333181f4b64", GitTreeState:"clean", 
BuildDate:"2021-01-13T13:22:41Z", GoVersion:"go1.13.15", Compiler:"gc", 
Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", 
GitVersion:"v1.18.19-gke.1701", 
GitCommit:"d7cecefb99b58e8968f59b59d76448eb1e6ea403", GitTreeState:"clean", 
BuildDate:"2021-06-23T21:51:59Z", GoVersion:"go1.13.15b4", Compiler:"gc", 
Platform:"linux/amd64"}

$ spark-submit --version
version 3.1.2
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.10
{code}
Reporter: Tiago Reis


When connecting to a Google Kubernetes Engine, a command {{gcloud container 
clusters get-credentials}} is used that generates a {{~/.kube/config}} file. 
The distinctive trait in this config file is that it uses an {{auth-provider}} 
relying on {{gcloud}} to inject the keys {{expiry}} and {{access-token}} from 
the general Google SDK auth config, as seen here:
{code:json}
users:
- name: gke_my-project_my-region_my-cluster
  user:
auth-provider:
  config:
cmd-args: config config-helper --format=json
cmd-path: /Users/reist01/google-cloud-sdk/bin/gcloud
expiry-key: '{.credential.token_expiry}'
token-key: '{.credential.access_token}'
{code}
{{kubectl}}, because it uses {{client-go}}, supports the auth-provider and 
fetches the token and expiry from the json returne by config-helper. As Spark 
is using the fabric8 client, this is yet to be supported, breaking when running 
spark-submit:
{code:java}
Exception in thread "main" 
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https://my-endpoint/api/v1/namespaces/my-namespace/pods. Message: 
Forbidden! User gke_my-project_my-region_my-cluster doesn't have permission. 
pods is forbidden: User "system:anonymous" cannot create resource "pods" in API 
group "" in the namespace "my-namespace".
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36366) Google Kubernetes Engine authentication fails

2021-07-30 Thread Tiago Reis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tiago Reis updated SPARK-36366:
---
Description: 
When connecting to a Google Kubernetes Engine, a command {{gcloud container 
clusters get-credentials}} is used that generates a {{~/.kube/config}} file. 
The distinctive trait in this config file is that it uses an {{auth-provider}} 
relying on {{gcloud}} to inject the keys {{expiry}} and {{access-token}} from 
the general Google SDK auth config, as seen here:
{code:json}
users:
- name: gke_my-project_my-region_my-cluster
  user:
auth-provider:
  config:
cmd-args: config config-helper --format=json
cmd-path: /Users/user/google-cloud-sdk/bin/gcloud
expiry-key: '{.credential.token_expiry}'
token-key: '{.credential.access_token}'
{code}
{{kubectl}}, because it uses {{client-go}}, supports the auth-provider and 
fetches the token and expiry from the json returne by config-helper. As Spark 
is using the fabric8 client, this is yet to be supported, breaking when running 
spark-submit:
{code:java}
Exception in thread "main" 
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https://my-endpoint/api/v1/namespaces/my-namespace/pods. Message: 
Forbidden! User gke_my-project_my-region_my-cluster doesn't have permission. 
pods is forbidden: User "system:anonymous" cannot create resource "pods" in API 
group "" in the namespace "my-namespace".
{code}

  was:
When connecting to a Google Kubernetes Engine, a command {{gcloud container 
clusters get-credentials}} is used that generates a {{~/.kube/config}} file. 
The distinctive trait in this config file is that it uses an {{auth-provider}} 
relying on {{gcloud}} to inject the keys {{expiry}} and {{access-token}} from 
the general Google SDK auth config, as seen here:
{code:json}
users:
- name: gke_my-project_my-region_my-cluster
  user:
auth-provider:
  config:
cmd-args: config config-helper --format=json
cmd-path: /Users/reist01/google-cloud-sdk/bin/gcloud
expiry-key: '{.credential.token_expiry}'
token-key: '{.credential.access_token}'
{code}
{{kubectl}}, because it uses {{client-go}}, supports the auth-provider and 
fetches the token and expiry from the json returne by config-helper. As Spark 
is using the fabric8 client, this is yet to be supported, breaking when running 
spark-submit:
{code:java}
Exception in thread "main" 
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https://my-endpoint/api/v1/namespaces/my-namespace/pods. Message: 
Forbidden! User gke_my-project_my-region_my-cluster doesn't have permission. 
pods is forbidden: User "system:anonymous" cannot create resource "pods" in API 
group "" in the namespace "my-namespace".
{code}


> Google Kubernetes Engine authentication fails
> -
>
> Key: SPARK-36366
> URL: https://issues.apache.org/jira/browse/SPARK-36366
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.2
> Environment: {code}
> $ kubectl version
> Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.15", 
> GitCommit:"73dd5c840662bb066a146d0871216333181f4b64", GitTreeState:"clean", 
> BuildDate:"2021-01-13T13:22:41Z", GoVersion:"go1.13.15", Compiler:"gc", 
> Platform:"darwin/amd64"}
> Server Version: version.Info{Major:"1", Minor:"18+", 
> GitVersion:"v1.18.19-gke.1701", 
> GitCommit:"d7cecefb99b58e8968f59b59d76448eb1e6ea403", GitTreeState:"clean", 
> BuildDate:"2021-06-23T21:51:59Z", GoVersion:"go1.13.15b4", Compiler:"gc", 
> Platform:"linux/amd64"}
> $ spark-submit --version
> version 3.1.2
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.10
> {code}
>Reporter: Tiago Reis
>Priority: Minor
>  Labels: google, kubernetes, kubernetesexecutor, newbie
>
> When connecting to a Google Kubernetes Engine, a command {{gcloud container 
> clusters get-credentials}} is used that generates a {{~/.kube/config}} file. 
> The distinctive trait in this config file is that it uses an 
> {{auth-provider}} relying on {{gcloud}} to inject the keys {{expiry}} and 
> {{access-token}} from the general Google SDK auth config, as seen here:
> {code:json}
> users:
> - name: gke_my-project_my-region_my-cluster
>   user:
> auth-provider:
>   config:
> cmd-args: config config-helper --format=json
> cmd-path: /Users/user/google-cloud-sdk/bin/gcloud
> expiry-key: '{.credential.token_expiry}'
> token-key: '{.credential.access_token}'
> {code}
> {{kubectl}}, because it uses {{client-go}}, supports the auth-provider and 
> fetches the token and expiry from the json returne by config-helper. As Spark 
> is using the fabric8 client, this is yet to be supported, breaking when 

[jira] [Assigned] (SPARK-36365) Remove old workarounds related to null ordering.

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36365:


Assignee: (was: Apache Spark)

> Remove old workarounds related to null ordering.
> 
>
> Key: SPARK-36365
> URL: https://issues.apache.org/jira/browse/SPARK-36365
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> In pandas-on-Spark, there are still some remaining places to call 
> {{Column._jc.(asc|desc)_nulls_(first|last)}} as a workaround from Koalas to 
> support Spark 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36365) Remove old workarounds related to null ordering.

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36365:


Assignee: Apache Spark

> Remove old workarounds related to null ordering.
> 
>
> Key: SPARK-36365
> URL: https://issues.apache.org/jira/browse/SPARK-36365
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> In pandas-on-Spark, there are still some remaining places to call 
> {{Column._jc.(asc|desc)_nulls_(first|last)}} as a workaround from Koalas to 
> support Spark 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36365) Remove old workarounds related to null ordering.

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390813#comment-17390813
 ] 

Apache Spark commented on SPARK-36365:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33597

> Remove old workarounds related to null ordering.
> 
>
> Key: SPARK-36365
> URL: https://issues.apache.org/jira/browse/SPARK-36365
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> In pandas-on-Spark, there are still some remaining places to call 
> {{Column._jc.(asc|desc)_nulls_(first|last)}} as a workaround from Koalas to 
> support Spark 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36365) Remove old workarounds related to null ordering.

2021-07-30 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-36365:
--
Summary: Remove old workarounds related to null ordering.  (was: Remove old 
workarounds related to ordering.)

> Remove old workarounds related to null ordering.
> 
>
> Key: SPARK-36365
> URL: https://issues.apache.org/jira/browse/SPARK-36365
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> In pandas-on-Spark, there are still some remaining places to call 
> {{Column._jc.(asc|desc)_nulls_(first|last)}} as a workaround from Koalas to 
> support Spark 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36338) Move distributed-sequence implementation to Scala side

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390807#comment-17390807
 ] 

Apache Spark commented on SPARK-36338:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33596

> Move distributed-sequence implementation to Scala side
> --
>
> Key: SPARK-36338
> URL: https://issues.apache.org/jira/browse/SPARK-36338
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> https://github.com/apache/spark/blob/c22f7a4834e6fb7b69c4cc4af87c61c2fbbe0786/python/pyspark/pandas/internal.py#L925-L945
> This can be implemented in JVM side to make it more performance without extra 
> serializations, and working around the nullability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36345) Add mlflow/sklearn to GHA docker image

2021-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36345:


Assignee: Dongjoon Hyun

> Add mlflow/sklearn to GHA docker image
> --
>
> Key: SPARK-36345
> URL: https://issues.apache.org/jira/browse/SPARK-36345
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Dongjoon Hyun
>Priority: Major
>
> In Github Actions CI, we install the `mlflow>=1.0` and `sklearn` in the step 
> "List Python packages (Python 3.9)" of "pyspark" job.
>  
> We can reduce the cost of CI by creating the image that has pre-installed 
> both package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36345) Add mlflow/sklearn to GHA docker image

2021-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36345.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33595
[https://github.com/apache/spark/pull/33595]

> Add mlflow/sklearn to GHA docker image
> --
>
> Key: SPARK-36345
> URL: https://issues.apache.org/jira/browse/SPARK-36345
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> In Github Actions CI, we install the `mlflow>=1.0` and `sklearn` in the step 
> "List Python packages (Python 3.9)" of "pyspark" job.
>  
> We can reduce the cost of CI by creating the image that has pre-installed 
> both package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36365) Remove old workarounds related to ordering.

2021-07-30 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36365:
-

 Summary: Remove old workarounds related to ordering.
 Key: SPARK-36365
 URL: https://issues.apache.org/jira/browse/SPARK-36365
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin


In pandas-on-Spark, there are still some remaining places to call 
{{Column._jc.(asc|desc)_nulls_(first|last)}} as a workaround from Koalas to 
support Spark 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36364) Move window and aggregate functions to DataTypeOps

2021-07-30 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36364:


 Summary: Move window and aggregate functions to DataTypeOps
 Key: SPARK-36364
 URL: https://issues.apache.org/jira/browse/SPARK-36364
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36140) Replace DataTypeOps tests that have operations on different Series

2021-07-30 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-36140.
--
Resolution: Done

> Replace DataTypeOps tests that have operations on different Series
> --
>
> Key: SPARK-36140
> URL: https://issues.apache.org/jira/browse/SPARK-36140
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Replace DataTypeOps tests that have operations on different Series for a 
> shorter test duration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36345) Add mlflow/sklearn to GHA docker image

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36345:


Assignee: (was: Apache Spark)

> Add mlflow/sklearn to GHA docker image
> --
>
> Key: SPARK-36345
> URL: https://issues.apache.org/jira/browse/SPARK-36345
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> In Github Actions CI, we install the `mlflow>=1.0` and `sklearn` in the step 
> "List Python packages (Python 3.9)" of "pyspark" job.
>  
> We can reduce the cost of CI by creating the image that has pre-installed 
> both package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36345) Add mlflow/sklearn to GHA docker image

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390791#comment-17390791
 ] 

Apache Spark commented on SPARK-36345:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33595

> Add mlflow/sklearn to GHA docker image
> --
>
> Key: SPARK-36345
> URL: https://issues.apache.org/jira/browse/SPARK-36345
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> In Github Actions CI, we install the `mlflow>=1.0` and `sklearn` in the step 
> "List Python packages (Python 3.9)" of "pyspark" job.
>  
> We can reduce the cost of CI by creating the image that has pre-installed 
> both package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36345) Add mlflow/sklearn to GHA docker image

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390790#comment-17390790
 ] 

Apache Spark commented on SPARK-36345:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33595

> Add mlflow/sklearn to GHA docker image
> --
>
> Key: SPARK-36345
> URL: https://issues.apache.org/jira/browse/SPARK-36345
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> In Github Actions CI, we install the `mlflow>=1.0` and `sklearn` in the step 
> "List Python packages (Python 3.9)" of "pyspark" job.
>  
> We can reduce the cost of CI by creating the image that has pre-installed 
> both package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36345) Add mlflow/sklearn to GHA docker image

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36345:


Assignee: Apache Spark

> Add mlflow/sklearn to GHA docker image
> --
>
> Key: SPARK-36345
> URL: https://issues.apache.org/jira/browse/SPARK-36345
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> In Github Actions CI, we install the `mlflow>=1.0` and `sklearn` in the step 
> "List Python packages (Python 3.9)" of "pyspark" job.
>  
> We can reduce the cost of CI by creating the image that has pre-installed 
> both package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-07-30 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-35881:
--
Fix Version/s: 3.2.0

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-07-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35881:
--
Fix Version/s: (was: 3.2.0)

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.3.0
>
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36363) AKS SPark UI does not have executor tab showing up

2021-07-30 Thread Koushik (Jira)
Koushik created SPARK-36363:
---

 Summary: AKS SPark UI does not have executor tab showing up
 Key: SPARK-36363
 URL: https://issues.apache.org/jira/browse/SPARK-36363
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Koushik


Spark UI Executor tab showing blank and i see the below error in the network 
tab :

https://keplerfnet-aks-prod.az.3pc.att.com/proxy:10.128.0.76:4043/executors/

Failed to load resource: the server responded with a status of 404 ()

DevTools failed to load source map: Could not load content for 
[https://keplerfnet-aks-prod.az.3pc.att.com/proxy:10.128.0.76:4043/static/vis.map|https://ind01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkeplerfnet-aks-prod.az.3pc.att.com%2Fproxy%3A10.128.0.76%3A4043%2Fstatic%2Fvis.map=04%7C01%7CKoushik.Gopal%40TechMahindra.com%7C71ec48c8fa8d4ecc123908d95388dd8e%7Cedf442f5b9944c86a131b42b03a16c95%7C0%7C0%7C637632669559893674%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=UyrYVdDO4vfzwq4%2Fl4GHN6Gm8QC%2FMrvrGMl50FUCGrI%3D=0]:
 HTTP error: status code 502, net::ERR_HTTP_RESPONSE_CODE_FAILURE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-07-30 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-35881.
---
Fix Version/s: 3.3.0
   3.2.0
   Resolution: Fixed

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-07-30 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-35881:
-

Assignee: Andy Grove

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36350) Make nanvl work with DataTypeOps

2021-07-30 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36350.
---
Fix Version/s: 3.2.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Issue resolved by pull request 33582
https://github.com/apache/spark/pull/33582

> Make nanvl work with DataTypeOps
> 
>
> Key: SPARK-36350
> URL: https://issues.apache.org/jira/browse/SPARK-36350
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> We can move some logic related to {{F.nanvl}} to {{DataTypeOps}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36362) Omnibus Java code static analyzer warning fixes

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390705#comment-17390705
 ] 

Apache Spark commented on SPARK-36362:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/33594

> Omnibus Java code static analyzer warning fixes
> ---
>
> Key: SPARK-36362
> URL: https://issues.apache.org/jira/browse/SPARK-36362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Inspired by a recent Java code touch-up, I wanted to fix in one pass several 
> lingering non-trivial issues with the Java code that a static analyzer turns 
> up. Only a few of these have material effects, but some do, and figured we 
> could avoid taking N PRs over time to address these.
> * Some int*int multiplications that widen to long maybe could overflow
> * Unnecessarily non-static inner classes
> * Some tests "catch (AssertionError)" and do nothing
> * Manual array iteration vs very slightly faster/simpler foreach
> * Incorrect generic types that just happen to not cause a runtime error
> * Missed opportunities for try-close
> * Mutable enums which shouldn't be
> * .. and a few other minor things



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36362) Omnibus Java code static analyzer warning fixes

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36362:


Assignee: Apache Spark  (was: Sean R. Owen)

> Omnibus Java code static analyzer warning fixes
> ---
>
> Key: SPARK-36362
> URL: https://issues.apache.org/jira/browse/SPARK-36362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Apache Spark
>Priority: Minor
>
> Inspired by a recent Java code touch-up, I wanted to fix in one pass several 
> lingering non-trivial issues with the Java code that a static analyzer turns 
> up. Only a few of these have material effects, but some do, and figured we 
> could avoid taking N PRs over time to address these.
> * Some int*int multiplications that widen to long maybe could overflow
> * Unnecessarily non-static inner classes
> * Some tests "catch (AssertionError)" and do nothing
> * Manual array iteration vs very slightly faster/simpler foreach
> * Incorrect generic types that just happen to not cause a runtime error
> * Missed opportunities for try-close
> * Mutable enums which shouldn't be
> * .. and a few other minor things



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36362) Omnibus Java code static analyzer warning fixes

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36362:


Assignee: Sean R. Owen  (was: Apache Spark)

> Omnibus Java code static analyzer warning fixes
> ---
>
> Key: SPARK-36362
> URL: https://issues.apache.org/jira/browse/SPARK-36362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Inspired by a recent Java code touch-up, I wanted to fix in one pass several 
> lingering non-trivial issues with the Java code that a static analyzer turns 
> up. Only a few of these have material effects, but some do, and figured we 
> could avoid taking N PRs over time to address these.
> * Some int*int multiplications that widen to long maybe could overflow
> * Unnecessarily non-static inner classes
> * Some tests "catch (AssertionError)" and do nothing
> * Manual array iteration vs very slightly faster/simpler foreach
> * Incorrect generic types that just happen to not cause a runtime error
> * Missed opportunities for try-close
> * Mutable enums which shouldn't be
> * .. and a few other minor things



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36362) Omnibus Java code static analyzer warning fixes

2021-07-30 Thread Sean R. Owen (Jira)
Sean R. Owen created SPARK-36362:


 Summary: Omnibus Java code static analyzer warning fixes
 Key: SPARK-36362
 URL: https://issues.apache.org/jira/browse/SPARK-36362
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL, Tests
Affects Versions: 3.2.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen


Inspired by a recent Java code touch-up, I wanted to fix in one pass several 
lingering non-trivial issues with the Java code that a static analyzer turns 
up. Only a few of these have material effects, but some do, and figured we 
could avoid taking N PRs over time to address these.

* Some int*int multiplications that widen to long maybe could overflow
* Unnecessarily non-static inner classes
* Some tests "catch (AssertionError)" and do nothing
* Manual array iteration vs very slightly faster/simpler foreach
* Incorrect generic types that just happen to not cause a runtime error
* Missed opportunities for try-close
* Mutable enums which shouldn't be
* .. and a few other minor things



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36358) Upgrade Kubernetes Client Version to 5.6.0

2021-07-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36358.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33593
[https://github.com/apache/spark/pull/33593]

> Upgrade Kubernetes Client Version to 5.6.0
> --
>
> Key: SPARK-36358
> URL: https://issues.apache.org/jira/browse/SPARK-36358
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.3.0
>
>
> This way [Retry HTTP operation in case IOException too (exponential 
> backoff)|https://github.com/fabric8io/kubernetes-client/pull/3293] will be 
> included



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36358) Upgrade Kubernetes Client Version to 5.6.0

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36358:


Assignee: Attila Zsolt Piros  (was: Apache Spark)

> Upgrade Kubernetes Client Version to 5.6.0
> --
>
> Key: SPARK-36358
> URL: https://issues.apache.org/jira/browse/SPARK-36358
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> This way [Retry HTTP operation in case IOException too (exponential 
> backoff)|https://github.com/fabric8io/kubernetes-client/pull/3293] will be 
> included



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36358) Upgrade Kubernetes Client Version to 5.6.0

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390627#comment-17390627
 ] 

Apache Spark commented on SPARK-36358:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/33593

> Upgrade Kubernetes Client Version to 5.6.0
> --
>
> Key: SPARK-36358
> URL: https://issues.apache.org/jira/browse/SPARK-36358
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> This way [Retry HTTP operation in case IOException too (exponential 
> backoff)|https://github.com/fabric8io/kubernetes-client/pull/3293] will be 
> included



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36358) Upgrade Kubernetes Client Version to 5.6.0

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390626#comment-17390626
 ] 

Apache Spark commented on SPARK-36358:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/33593

> Upgrade Kubernetes Client Version to 5.6.0
> --
>
> Key: SPARK-36358
> URL: https://issues.apache.org/jira/browse/SPARK-36358
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> This way [Retry HTTP operation in case IOException too (exponential 
> backoff)|https://github.com/fabric8io/kubernetes-client/pull/3293] will be 
> included



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36358) Upgrade Kubernetes Client Version to 5.6.0

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36358:


Assignee: Apache Spark  (was: Attila Zsolt Piros)

> Upgrade Kubernetes Client Version to 5.6.0
> --
>
> Key: SPARK-36358
> URL: https://issues.apache.org/jira/browse/SPARK-36358
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> This way [Retry HTTP operation in case IOException too (exponential 
> backoff)|https://github.com/fabric8io/kubernetes-client/pull/3293] will be 
> included



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36360) StreamingSource duplicates appName

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36360:


Assignee: Apache Spark

> StreamingSource duplicates appName
> --
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Assignee: Apache Spark
>Priority: Minor
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired for people using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> will still be included in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36360) StreamingSource duplicates appName

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36360:


Assignee: (was: Apache Spark)

> StreamingSource duplicates appName
> --
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Minor
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired for people using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> will still be included in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36360) StreamingSource duplicates appName

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390585#comment-17390585
 ] 

Apache Spark commented on SPARK-36360:
--

User 'mrclneumann' has created a pull request for this issue:
https://github.com/apache/spark/pull/33592

> StreamingSource duplicates appName
> --
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Minor
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired for people using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> will still be included in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36338) Move distributed-sequence implementation to Scala side

2021-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36338.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33570
[https://github.com/apache/spark/pull/33570]

> Move distributed-sequence implementation to Scala side
> --
>
> Key: SPARK-36338
> URL: https://issues.apache.org/jira/browse/SPARK-36338
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> https://github.com/apache/spark/blob/c22f7a4834e6fb7b69c4cc4af87c61c2fbbe0786/python/pyspark/pandas/internal.py#L925-L945
> This can be implemented in JVM side to make it more performance without extra 
> serializations, and working around the nullability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36338) Move distributed-sequence implementation to Scala side

2021-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36338:


Assignee: Hyukjin Kwon

> Move distributed-sequence implementation to Scala side
> --
>
> Key: SPARK-36338
> URL: https://issues.apache.org/jira/browse/SPARK-36338
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/blob/c22f7a4834e6fb7b69c4cc4af87c61c2fbbe0786/python/pyspark/pandas/internal.py#L925-L945
> This can be implemented in JVM side to make it more performance without extra 
> serializations, and working around the nullability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36092) Migrate to GitHub Actions Codecov from Jenkins

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36092:


Assignee: Apache Spark

> Migrate to GitHub Actions Codecov from Jenkins
> --
>
> Key: SPARK-36092
> URL: https://issues.apache.org/jira/browse/SPARK-36092
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> We currently uses manual Codecov site to work around our Jenkins CI security 
> issue. Now we use GitHub Actions so we can leverage Codecov to report the 
> coverage for PySpark.
> See also https://github.com/codecov/codecov-action



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36092) Migrate to GitHub Actions Codecov from Jenkins

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390580#comment-17390580
 ] 

Apache Spark commented on SPARK-36092:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33591

> Migrate to GitHub Actions Codecov from Jenkins
> --
>
> Key: SPARK-36092
> URL: https://issues.apache.org/jira/browse/SPARK-36092
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We currently uses manual Codecov site to work around our Jenkins CI security 
> issue. Now we use GitHub Actions so we can leverage Codecov to report the 
> coverage for PySpark.
> See also https://github.com/codecov/codecov-action



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36092) Migrate to GitHub Actions Codecov from Jenkins

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36092:


Assignee: (was: Apache Spark)

> Migrate to GitHub Actions Codecov from Jenkins
> --
>
> Key: SPARK-36092
> URL: https://issues.apache.org/jira/browse/SPARK-36092
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We currently uses manual Codecov site to work around our Jenkins CI security 
> issue. Now we use GitHub Actions so we can leverage Codecov to report the 
> coverage for PySpark.
> See also https://github.com/codecov/codecov-action



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36361) Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image

2021-07-30 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36361:


 Summary: Install coverage in Python 3.9 and PyPy 3 in GitHub 
Actions image
 Key: SPARK-36361
 URL: https://issues.apache.org/jira/browse/SPARK-36361
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


SPARK-36092 requires coverage package to be installed in both Python 3.9 and 
PyPy. Currently this is being manually installed.

To save installtation time, it would be great to have them installed in the 
image we use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36360) StreamingSource duplicates appName

2021-07-30 Thread Marcel Neumann (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Neumann updated SPARK-36360:
---
Description: The StreamingSource includes the appName in its sourceName. 
This is not desired for people using a custom namespace for metrics reporting 
using {{spark.metrics.namespace}} configuration property as the 
{{spark.app.name}} will still be included in the name of the metric. Using a 
metrics namespace results in a duplicated indicator for {{spark.app.name}}.  
(was: The StreamingSource includes the appName in its sourceName. This is not 
desired for people using a custom namespace for metrics reporting using 
{{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
will still be included in the name of the metric. Not using a metrics namespace 
results in a duplicated indicator for {{spark.app.name}}.)

> StreamingSource duplicates appName
> --
>
> Key: SPARK-36360
> URL: https://issues.apache.org/jira/browse/SPARK-36360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Marcel Neumann
>Priority: Minor
>
> The StreamingSource includes the appName in its sourceName. This is not 
> desired for people using a custom namespace for metrics reporting using 
> {{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
> will still be included in the name of the metric. Using a metrics namespace 
> results in a duplicated indicator for {{spark.app.name}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36360) StreamingSource duplicates appName

2021-07-30 Thread Marcel Neumann (Jira)
Marcel Neumann created SPARK-36360:
--

 Summary: StreamingSource duplicates appName
 Key: SPARK-36360
 URL: https://issues.apache.org/jira/browse/SPARK-36360
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Marcel Neumann


The StreamingSource includes the appName in its sourceName. This is not desired 
for people using a custom namespace for metrics reporting using 
{{spark.metrics.namespace}} configuration property as the {{spark.app.name}} 
will still be included in the name of the metric. Not using a metrics namespace 
results in a duplicated indicator for {{spark.app.name}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36359) Coalesce returns the first expression if it is non nullable

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36359:


Assignee: Apache Spark

> Coalesce returns the first expression if it is non nullable
> ---
>
> Key: SPARK-36359
> URL: https://issues.apache.org/jira/browse/SPARK-36359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36254) Install mlflow/sklearn in Github Actions CI

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390538#comment-17390538
 ] 

Apache Spark commented on SPARK-36254:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33589

> Install mlflow/sklearn in Github Actions CI
> ---
>
> Key: SPARK-36254
> URL: https://issues.apache.org/jira/browse/SPARK-36254
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> Since the pandas-on-Spark includes the mlflow features and related tests, we 
> should install the mlflow and its dependencies our Github Actions CI so that 
> the test won't be skipped from Spark 3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36359) Coalesce returns the first expression if it is non nullable

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390537#comment-17390537
 ] 

Apache Spark commented on SPARK-36359:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/33590

> Coalesce returns the first expression if it is non nullable
> ---
>
> Key: SPARK-36359
> URL: https://issues.apache.org/jira/browse/SPARK-36359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36359) Coalesce returns the first expression if it is non nullable

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36359:


Assignee: (was: Apache Spark)

> Coalesce returns the first expression if it is non nullable
> ---
>
> Key: SPARK-36359
> URL: https://issues.apache.org/jira/browse/SPARK-36359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36359) Coalesce returns the first expression if it is non nullable

2021-07-30 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36359:
---

 Summary: Coalesce returns the first expression if it is non 
nullable
 Key: SPARK-36359
 URL: https://issues.apache.org/jira/browse/SPARK-36359
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36358) Upgrade Kubernetes Client Version to 5.6.0

2021-07-30 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-36358:
---
Description: This way [Retry HTTP operation in case IOException too 
(exponential backoff)|https://github.com/fabric8io/kubernetes-client/pull/3293] 
will be included  (was: This way 
[https://github.com/fabric8io/kubernetes-client/pull/3293|Retry HTTP operation 
in case IOException too (exponential backoff)] will be included)

> Upgrade Kubernetes Client Version to 5.6.0
> --
>
> Key: SPARK-36358
> URL: https://issues.apache.org/jira/browse/SPARK-36358
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> This way [Retry HTTP operation in case IOException too (exponential 
> backoff)|https://github.com/fabric8io/kubernetes-client/pull/3293] will be 
> included



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36358) Upgrade Kubernetes Client Version to 5.6.0

2021-07-30 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-36358:
--

 Summary: Upgrade Kubernetes Client Version to 5.6.0
 Key: SPARK-36358
 URL: https://issues.apache.org/jira/browse/SPARK-36358
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Attila Zsolt Piros
Assignee: Attila Zsolt Piros


This way [https://github.com/fabric8io/kubernetes-client/pull/3293|Retry HTTP 
operation in case IOException too (exponential backoff)] will be included



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28330) ANSI SQL: Top-level in

2021-07-30 Thread Alexander Bij (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390508#comment-17390508
 ] 

Alexander Bij commented on SPARK-28330:
---

I'm looking forward to this feature!

 

I noticed it is absends when using DBeaver sql-client (simba-spark driver) to 
look at Table data. It's downloading full datasets when viewing tables.

Comparing it to Hive SQL offset is implemented and working in DBeaver, 
scrolling pages when looking at tables.

 

All the PR's are closed (not merged) and mentioned the work is suspended (at 
27-april-2021)

 

_At lease you can upvote the feature to raise importance_

> ANSI SQL: Top-level  in 
> 
>
> Key: SPARK-28330
> URL: https://issues.apache.org/jira/browse/SPARK-28330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. {{LIMIT}} and {{OFFSET}}
> LIMIT and OFFSET allow you to retrieve just a portion of the rows that are 
> generated by the rest of the query:
> {noformat}
> SELECT select_list
> FROM table_expression
> [ ORDER BY ... ]
> [ LIMIT { number | ALL } ] [ OFFSET number ]
> {noformat}
> If a limit count is given, no more than that many rows will be returned (but 
> possibly fewer, if the query itself yields fewer rows). LIMIT ALL is the same 
> as omitting the LIMIT clause, as is LIMIT with a NULL argument.
> OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 
> is the same as omitting the OFFSET clause, as is OFFSET with a NULL argument.
> If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting 
> to count the LIMIT rows that are returned.
> https://www.postgresql.org/docs/11/queries-limit.html
> *Feature ID*: F861



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36346) Support TimestampNTZ type in Orc file source

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36346:


Assignee: (was: Apache Spark)

> Support TimestampNTZ type in Orc file source
> 
>
> Key: SPARK-36346
> URL: https://issues.apache.org/jira/browse/SPARK-36346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> As per https://orc.apache.org/docs/types.html, Orc supports both 
> TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current default timestamp type):
> * A TIMESTAMP => TIMESTAMP_LTZ
> * Timestamp with local time zone => TIMESTAMP_NTZ
> In Spark 3.1 or prior, Spark only considered TIMESTAMP.
> Since 3.2, with the support of timestamp without time zone type:
> * Orc writer follows the definition and uses "Timestamp with local time zone" 
> on writing TIMESTAMP_NTZ.
> * Orc reader converts the "Timestamp with local time zone" to TIMESTAMP_NTZ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36346) Support TimestampNTZ type in Orc file source

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36346:


Assignee: Apache Spark

> Support TimestampNTZ type in Orc file source
> 
>
> Key: SPARK-36346
> URL: https://issues.apache.org/jira/browse/SPARK-36346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> As per https://orc.apache.org/docs/types.html, Orc supports both 
> TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current default timestamp type):
> * A TIMESTAMP => TIMESTAMP_LTZ
> * Timestamp with local time zone => TIMESTAMP_NTZ
> In Spark 3.1 or prior, Spark only considered TIMESTAMP.
> Since 3.2, with the support of timestamp without time zone type:
> * Orc writer follows the definition and uses "Timestamp with local time zone" 
> on writing TIMESTAMP_NTZ.
> * Orc reader converts the "Timestamp with local time zone" to TIMESTAMP_NTZ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36346) Support TimestampNTZ type in Orc file source

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390506#comment-17390506
 ] 

Apache Spark commented on SPARK-36346:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/33588

> Support TimestampNTZ type in Orc file source
> 
>
> Key: SPARK-36346
> URL: https://issues.apache.org/jira/browse/SPARK-36346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> As per https://orc.apache.org/docs/types.html, Orc supports both 
> TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current default timestamp type):
> * A TIMESTAMP => TIMESTAMP_LTZ
> * Timestamp with local time zone => TIMESTAMP_NTZ
> In Spark 3.1 or prior, Spark only considered TIMESTAMP.
> Since 3.2, with the support of timestamp without time zone type:
> * Orc writer follows the definition and uses "Timestamp with local time zone" 
> on writing TIMESTAMP_NTZ.
> * Orc reader converts the "Timestamp with local time zone" to TIMESTAMP_NTZ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36357) Support pushdown Timestamp with local time zone for orc

2021-07-30 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390504#comment-17390504
 ] 

jiaan.geng commented on SPARK-36357:


I'm working on.

> Support pushdown Timestamp with local time zone for orc
> ---
>
> Key: SPARK-36357
> URL: https://issues.apache.org/jira/browse/SPARK-36357
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36357) Support pushdown Timestamp with local time zone for orc

2021-07-30 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-36357:
--

 Summary: Support pushdown Timestamp with local time zone for orc
 Key: SPARK-36357
 URL: https://issues.apache.org/jira/browse/SPARK-36357
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36346) Support TimestampNTZ type in Orc file source

2021-07-30 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-36346:
---
Description: 
As per https://orc.apache.org/docs/types.html, Orc supports both TIMESTAMP_NTZ 
and TIMESTAMP_LTZ (Spark's current default timestamp type):

* A TIMESTAMP => TIMESTAMP_LTZ
* Timestamp with local time zone => TIMESTAMP_NTZ
In Spark 3.1 or prior, Spark only considered TIMESTAMP.
Since 3.2, with the support of timestamp without time zone type:
* Orc writer follows the definition and uses "Timestamp with local time zone" 
on writing TIMESTAMP_NTZ.
* Orc reader converts the "Timestamp with local time zone" to TIMESTAMP_NTZ.


  was:
As per https://orc.apache.org/docs/types.html, Orc supports both TIMESTAMP_NTZ 
and TIMESTAMP_LTZ (Spark's current default timestamp type):

A TIMESTAMP => TIMESTAMP_LTZ
Timestamp with local time zone => TIMESTAMP_NTZ
In Spark 3.1 or prior, Spark only considered TIMESTAMP.
Since 3.2, with the support of timestamp without time zone type:
* Orc writer follows the definition and uses "Timestamp with local time zone" 
on writing TIMESTAMP_NTZ.
* Orc reader converts the "Timestamp with local time zone" to TIMESTAMP_NTZ.



> Support TimestampNTZ type in Orc file source
> 
>
> Key: SPARK-36346
> URL: https://issues.apache.org/jira/browse/SPARK-36346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> As per https://orc.apache.org/docs/types.html, Orc supports both 
> TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current default timestamp type):
> * A TIMESTAMP => TIMESTAMP_LTZ
> * Timestamp with local time zone => TIMESTAMP_NTZ
> In Spark 3.1 or prior, Spark only considered TIMESTAMP.
> Since 3.2, with the support of timestamp without time zone type:
> * Orc writer follows the definition and uses "Timestamp with local time zone" 
> on writing TIMESTAMP_NTZ.
> * Orc reader converts the "Timestamp with local time zone" to TIMESTAMP_NTZ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36346) Support TimestampNTZ type in Orc file source

2021-07-30 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-36346:
---
Description: 
As per https://orc.apache.org/docs/types.html, Orc supports both TIMESTAMP_NTZ 
and TIMESTAMP_LTZ (Spark's current default timestamp type):

A TIMESTAMP => TIMESTAMP_LTZ
Timestamp with local time zone => TIMESTAMP_NTZ
In Spark 3.1 or prior, Spark only considered TIMESTAMP.
Since 3.2, with the support of timestamp without time zone type:
* Orc writer follows the definition and uses "Timestamp with local time zone" 
on writing TIMESTAMP_NTZ.
* Orc reader converts the "Timestamp with local time zone" to TIMESTAMP_NTZ.


> Support TimestampNTZ type in Orc file source
> 
>
> Key: SPARK-36346
> URL: https://issues.apache.org/jira/browse/SPARK-36346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> As per https://orc.apache.org/docs/types.html, Orc supports both 
> TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current default timestamp type):
> A TIMESTAMP => TIMESTAMP_LTZ
> Timestamp with local time zone => TIMESTAMP_NTZ
> In Spark 3.1 or prior, Spark only considered TIMESTAMP.
> Since 3.2, with the support of timestamp without time zone type:
> * Orc writer follows the definition and uses "Timestamp with local time zone" 
> on writing TIMESTAMP_NTZ.
> * Orc reader converts the "Timestamp with local time zone" to TIMESTAMP_NTZ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36356) RemoveRedundantAlias should keep output schema

2021-07-30 Thread angerszhu (Jira)
angerszhu created SPARK-36356:
-

 Summary: RemoveRedundantAlias  should keep output schema
 Key: SPARK-36356
 URL: https://issues.apache.org/jira/browse/SPARK-36356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36353) RemoveNoopOperators should keep output schema

2021-07-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-36353:
--
Description: 
!image-2021-07-30-17-46-59-196.png|width=539,height=220!

[https://github.com/apache/spark/pull/33587]

 

Only first level?

> RemoveNoopOperators should keep output schema
> -
>
> Key: SPARK-36353
> URL: https://issues.apache.org/jira/browse/SPARK-36353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-07-30-17-46-59-196.png
>
>
> !image-2021-07-30-17-46-59-196.png|width=539,height=220!
> [https://github.com/apache/spark/pull/33587]
>  
> Only first level?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36353) RemoveNoopOperators should keep output schema

2021-07-30 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390458#comment-17390458
 ] 

angerszhu commented on SPARK-36353:
---

Raise a pr soon

> RemoveNoopOperators should keep output schema
> -
>
> Key: SPARK-36353
> URL: https://issues.apache.org/jira/browse/SPARK-36353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-07-30-17-46-59-196.png
>
>
> !image-2021-07-30-17-46-59-196.png|width=539,height=220!
> [https://github.com/apache/spark/pull/33587]
>  
> Only first level?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36353) RemoveNoopOperators should keep output schema

2021-07-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-36353:
--
Attachment: image-2021-07-30-17-46-59-196.png

> RemoveNoopOperators should keep output schema
> -
>
> Key: SPARK-36353
> URL: https://issues.apache.org/jira/browse/SPARK-36353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-07-30-17-46-59-196.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36355) NamedExpression add method `withName(newName: String)`

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390456#comment-17390456
 ] 

Apache Spark commented on SPARK-36355:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33587

> NamedExpression add method `withName(newName: String)`
> --
>
> Key: SPARK-36355
> URL: https://issues.apache.org/jira/browse/SPARK-36355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36355) NamedExpression add method `withName(newName: String)`

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36355:


Assignee: Apache Spark

> NamedExpression add method `withName(newName: String)`
> --
>
> Key: SPARK-36355
> URL: https://issues.apache.org/jira/browse/SPARK-36355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36355) NamedExpression add method `withName(newName: String)`

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36355:


Assignee: (was: Apache Spark)

> NamedExpression add method `withName(newName: String)`
> --
>
> Key: SPARK-36355
> URL: https://issues.apache.org/jira/browse/SPARK-36355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36355) NamedExpression add method `withName(newName: String)`

2021-07-30 Thread angerszhu (Jira)
angerszhu created SPARK-36355:
-

 Summary: NamedExpression add method `withName(newName: String)`
 Key: SPARK-36355
 URL: https://issues.apache.org/jira/browse/SPARK-36355
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36065) date_trunc returns incorrect output

2021-07-30 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390421#comment-17390421
 ] 

Peter Toth commented on SPARK-36065:


I think the output is correct as there was a time zone change (+00:02:16) at 
1891-10-01 00:00:00 in Bratislava and that means that 1891-10-01 00:00:00 = 
1891-10-01 00:02:16.
I found this site that shows the TZ changes: 
https://www.timeanddate.com/time/zone/slovakia/bratislava

> date_trunc returns incorrect output
> ---
>
> Key: SPARK-36065
> URL: https://issues.apache.org/jira/browse/SPARK-36065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Sumeet
>Priority: Major
>  Labels: date_trunc, sql, timestamp
>
> Hi,
> Running date_trunc on any hour of "1891-10-01" returns incorrect output for 
> "Europe/Bratislava" timezone.
> Use the following steps in order to reproduce the issue:
>  * Run spark-shell using:
> {code:java}
> TZ="Europe/Bratislava" ./bin/spark-shell --conf 
> spark.driver.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
> spark.sql.session.timeZone="Europe/Bratislava"{code}
>  * Generate test data:
> {code:java}
> ((0 until 9).map(hour => s"1891-10-01 00:0$hour:00") ++ (10 until 
> 24).map(hour => s"1891-10-01 
> 00:$hour:00")).toDF("ts_string").createOrReplaceTempView("temp_ts")
> {code}
>  * Run query:
> {code:java}
> sql("select ts_string, cast(ts_string as TIMESTAMP) as ts, date_trunc('day', 
> ts_string) from temp_ts").show(false)
> {code}
>  * Output:
> {code:java}
> +---+---+--+
> |ts_string  |ts |date_trunc(day, ts_string)|
> +---+---+--+
> |1891-10-01 00:00:00|1891-10-01 00:02:16|1891-10-01 00:02:16   |
> |1891-10-01 00:01:00|1891-10-01 00:03:16|1891-10-01 00:02:16   |
> |1891-10-01 00:02:00|1891-10-01 00:04:16|1891-10-01 00:02:16   |
> |1891-10-01 00:03:00|1891-10-01 00:03:00|1891-10-01 00:02:16   |
> |1891-10-01 00:04:00|1891-10-01 00:04:00|1891-10-01 00:02:16   |
> |1891-10-01 00:05:00|1891-10-01 00:05:00|1891-10-01 00:02:16   |
> |1891-10-01 00:06:00|1891-10-01 00:06:00|1891-10-01 00:02:16   |
> |1891-10-01 00:07:00|1891-10-01 00:07:00|1891-10-01 00:02:16   |
> |1891-10-01 00:08:00|1891-10-01 00:08:00|1891-10-01 00:02:16   |
> |1891-10-01 00:10:00|1891-10-01 00:10:00|1891-10-01 00:02:16   |
> |1891-10-01 00:11:00|1891-10-01 00:11:00|1891-10-01 00:02:16   |
> |1891-10-01 00:12:00|1891-10-01 00:12:00|1891-10-01 00:02:16   |
> |1891-10-01 00:13:00|1891-10-01 00:13:00|1891-10-01 00:02:16   |
> |1891-10-01 00:14:00|1891-10-01 00:14:00|1891-10-01 00:02:16   |
> |1891-10-01 00:15:00|1891-10-01 00:15:00|1891-10-01 00:02:16   |
> |1891-10-01 00:16:00|1891-10-01 00:16:00|1891-10-01 00:02:16   |
> |1891-10-01 00:17:00|1891-10-01 00:17:00|1891-10-01 00:02:16   |
> |1891-10-01 00:18:00|1891-10-01 00:18:00|1891-10-01 00:02:16   |
> |1891-10-01 00:19:00|1891-10-01 00:19:00|1891-10-01 00:02:16   |
> |1891-10-01 00:20:00|1891-10-01 00:20:00|1891-10-01 00:02:16   |
> +---+---+--+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36354) EventLogFileReaders should not complain in case of no event log files

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390390#comment-17390390
 ] 

Apache Spark commented on SPARK-36354:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33586

> EventLogFileReaders should not complain in case of no event log files
> -
>
> Key: SPARK-36354
> URL: https://issues.apache.org/jira/browse/SPARK-36354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> 21/07/30 07:38:26 WARN FsHistoryProvider: Error while reading new log 
> s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771
> java.lang.IllegalArgumentException: requirement failed: Log directory must 
> contain at least one event log file!
> at scala.Predef$.require(Predef.scala:281)
> at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216)
> {code}
> {code}
> $ aws s3 ls s3://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771/
> 2021-06-26 22:31:40  0 
> appstatus_spark-95b5c736c8e44037afcf152534d08771.inprogress
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36354) EventLogFileReaders should not complain in case of no event log files

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36354:


Assignee: Apache Spark

> EventLogFileReaders should not complain in case of no event log files
> -
>
> Key: SPARK-36354
> URL: https://issues.apache.org/jira/browse/SPARK-36354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> 21/07/30 07:38:26 WARN FsHistoryProvider: Error while reading new log 
> s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771
> java.lang.IllegalArgumentException: requirement failed: Log directory must 
> contain at least one event log file!
> at scala.Predef$.require(Predef.scala:281)
> at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216)
> {code}
> {code}
> $ aws s3 ls s3://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771/
> 2021-06-26 22:31:40  0 
> appstatus_spark-95b5c736c8e44037afcf152534d08771.inprogress
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36354) EventLogFileReaders should not complain in case of no event log files

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36354:


Assignee: (was: Apache Spark)

> EventLogFileReaders should not complain in case of no event log files
> -
>
> Key: SPARK-36354
> URL: https://issues.apache.org/jira/browse/SPARK-36354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> 21/07/30 07:38:26 WARN FsHistoryProvider: Error while reading new log 
> s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771
> java.lang.IllegalArgumentException: requirement failed: Log directory must 
> contain at least one event log file!
> at scala.Predef$.require(Predef.scala:281)
> at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216)
> {code}
> {code}
> $ aws s3 ls s3://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771/
> 2021-06-26 22:31:40  0 
> appstatus_spark-95b5c736c8e44037afcf152534d08771.inprogress
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36354) EventLogFileReaders should not complain in case of no event log files

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390392#comment-17390392
 ] 

Apache Spark commented on SPARK-36354:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33586

> EventLogFileReaders should not complain in case of no event log files
> -
>
> Key: SPARK-36354
> URL: https://issues.apache.org/jira/browse/SPARK-36354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> 21/07/30 07:38:26 WARN FsHistoryProvider: Error while reading new log 
> s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771
> java.lang.IllegalArgumentException: requirement failed: Log directory must 
> contain at least one event log file!
> at scala.Predef$.require(Predef.scala:281)
> at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216)
> {code}
> {code}
> $ aws s3 ls s3://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771/
> 2021-06-26 22:31:40  0 
> appstatus_spark-95b5c736c8e44037afcf152534d08771.inprogress
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36354) EventLogFileReaders should not complain in case of no event log files

2021-07-30 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-36354:
-

 Summary: EventLogFileReaders should not complain in case of no 
event log files
 Key: SPARK-36354
 URL: https://issues.apache.org/jira/browse/SPARK-36354
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2, 3.2.0
Reporter: Dongjoon Hyun


{code}
21/07/30 07:38:26 WARN FsHistoryProvider: Error while reading new log 
s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771
java.lang.IllegalArgumentException: requirement failed: Log directory must 
contain at least one event log file!
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216)
{code}

{code}
$ aws s3 ls s3://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771/
2021-06-26 22:31:40  0 
appstatus_spark-95b5c736c8e44037afcf152534d08771.inprogress
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35976) Adjust `astype` method for ExtensionDtype in pandas API on Spark

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390374#comment-17390374
 ] 

Apache Spark commented on SPARK-35976:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33585

> Adjust `astype` method for ExtensionDtype in pandas API on Spark
> 
>
> Key: SPARK-35976
> URL: https://issues.apache.org/jira/browse/SPARK-35976
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, `astype` method for ExtensionDtype in pandas API on Spark is not 
> consistent with pandas. For example, 
> [https://github.com/apache/spark/pull/33095#discussion_r661704734.]
> [https://github.com/apache/spark/pull/33095#discussion_r662623005.]
>  
> We ought to fill in the gap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35976) Adjust `astype` method for ExtensionDtype in pandas API on Spark

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390372#comment-17390372
 ] 

Apache Spark commented on SPARK-35976:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33585

> Adjust `astype` method for ExtensionDtype in pandas API on Spark
> 
>
> Key: SPARK-35976
> URL: https://issues.apache.org/jira/browse/SPARK-35976
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, `astype` method for ExtensionDtype in pandas API on Spark is not 
> consistent with pandas. For example, 
> [https://github.com/apache/spark/pull/33095#discussion_r661704734.]
> [https://github.com/apache/spark/pull/33095#discussion_r662623005.]
>  
> We ought to fill in the gap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35976) Adjust `astype` method for ExtensionDtype in pandas API on Spark

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35976:


Assignee: Apache Spark

> Adjust `astype` method for ExtensionDtype in pandas API on Spark
> 
>
> Key: SPARK-35976
> URL: https://issues.apache.org/jira/browse/SPARK-35976
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, `astype` method for ExtensionDtype in pandas API on Spark is not 
> consistent with pandas. For example, 
> [https://github.com/apache/spark/pull/33095#discussion_r661704734.]
> [https://github.com/apache/spark/pull/33095#discussion_r662623005.]
>  
> We ought to fill in the gap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35976) Adjust `astype` method for ExtensionDtype in pandas API on Spark

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35976:


Assignee: (was: Apache Spark)

> Adjust `astype` method for ExtensionDtype in pandas API on Spark
> 
>
> Key: SPARK-35976
> URL: https://issues.apache.org/jira/browse/SPARK-35976
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, `astype` method for ExtensionDtype in pandas API on Spark is not 
> consistent with pandas. For example, 
> [https://github.com/apache/spark/pull/33095#discussion_r661704734.]
> [https://github.com/apache/spark/pull/33095#discussion_r662623005.]
>  
> We ought to fill in the gap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36351) Separate partition filters and data filters in PushDownUtils

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390364#comment-17390364
 ] 

Apache Spark commented on SPARK-36351:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/33584

> Separate partition filters and data filters in PushDownUtils
> 
>
> Key: SPARK-36351
> URL: https://issues.apache.org/jira/browse/SPARK-36351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Currently, DSv2 partition filters and data filters are separated in 
> PruneFileSourcePartitions. It's better to separate these in PushDownUtils, 
> where we do filter/aggregate push down and column pruning, so we can still 
> push down aggregate for FileScan if the filers are only partition filters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36351) Separate partition filters and data filters in PushDownUtils

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36351:


Assignee: (was: Apache Spark)

> Separate partition filters and data filters in PushDownUtils
> 
>
> Key: SPARK-36351
> URL: https://issues.apache.org/jira/browse/SPARK-36351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Currently, DSv2 partition filters and data filters are separated in 
> PruneFileSourcePartitions. It's better to separate these in PushDownUtils, 
> where we do filter/aggregate push down and column pruning, so we can still 
> push down aggregate for FileScan if the filers are only partition filters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36351) Separate partition filters and data filters in PushDownUtils

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36351:


Assignee: Apache Spark

> Separate partition filters and data filters in PushDownUtils
> 
>
> Key: SPARK-36351
> URL: https://issues.apache.org/jira/browse/SPARK-36351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, DSv2 partition filters and data filters are separated in 
> PruneFileSourcePartitions. It's better to separate these in PushDownUtils, 
> where we do filter/aggregate push down and column pruning, so we can still 
> push down aggregate for FileScan if the filers are only partition filters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory

2021-07-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390358#comment-17390358
 ] 

Dongjoon Hyun edited comment on SPARK-36327 at 7/30/21, 7:22 AM:
-

I commented on the PR and looped other review, too. [~senthh].


was (Author: dongjoon):
I commented on the PR and looped other review, too.

> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory
> ---
>
> Key: SPARK-36327
> URL: https://issues.apache.org/jira/browse/SPARK-36327
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory.
>  
> This arises only when viewfs:// is configured. When the location is hdfs://, 
> it doesn't occur.
>  
> Based on further investigation in file *SaveAsHiveFile.scala*, I could see 
> that the directory hierarchy has been not properly handled for viewFS 
> condition.
> Parent path(db path) is passed rather than passing the actual directory(table 
> location).
> {{
> // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2
> private def newVersionExternalTempPath(
> path: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
> val extURI: URI = path.toUri
> if (extURI.getScheme == "viewfs")
> { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) }
> else
> { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), 
> "-ext-1") }
> }
> }}
> Please refer below lines
> ===
> if (extURI.getScheme == "viewfs") {
> getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
> ===



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory

2021-07-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390358#comment-17390358
 ] 

Dongjoon Hyun commented on SPARK-36327:
---

I commented on the PR and looped other review, too.

> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory
> ---
>
> Key: SPARK-36327
> URL: https://issues.apache.org/jira/browse/SPARK-36327
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory.
>  
> This arises only when viewfs:// is configured. When the location is hdfs://, 
> it doesn't occur.
>  
> Based on further investigation in file *SaveAsHiveFile.scala*, I could see 
> that the directory hierarchy has been not properly handled for viewFS 
> condition.
> Parent path(db path) is passed rather than passing the actual directory(table 
> location).
> {{
> // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2
> private def newVersionExternalTempPath(
> path: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
> val extURI: URI = path.toUri
> if (extURI.getScheme == "viewfs")
> { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) }
> else
> { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), 
> "-ext-1") }
> }
> }}
> Please refer below lines
> ===
> if (extURI.getScheme == "viewfs") {
> getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
> ===



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36254) Install mlflow in Github Actions CI

2021-07-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36254.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33567
[https://github.com/apache/spark/pull/33567]

> Install mlflow in Github Actions CI
> ---
>
> Key: SPARK-36254
> URL: https://issues.apache.org/jira/browse/SPARK-36254
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> Since the pandas-on-Spark includes the mlflow features and related tests, we 
> should install the mlflow and its dependencies our Github Actions CI so that 
> the test won't be skipped from Spark 3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36254) Install mlflow/sklearn in Github Actions CI

2021-07-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36254:
--
Summary: Install mlflow/sklearn in Github Actions CI  (was: Install mlflow 
in Github Actions CI)

> Install mlflow/sklearn in Github Actions CI
> ---
>
> Key: SPARK-36254
> URL: https://issues.apache.org/jira/browse/SPARK-36254
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> Since the pandas-on-Spark includes the mlflow features and related tests, we 
> should install the mlflow and its dependencies our Github Actions CI so that 
> the test won't be skipped from Spark 3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36345) Add mlflow/sklearn to GHA docker image

2021-07-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390341#comment-17390341
 ] 

Dongjoon Hyun commented on SPARK-36345:
---

Thank you for reporting. I revised the title and will take care of this, 
[~itholic] and [~hyukjin.kwon].

> Add mlflow/sklearn to GHA docker image
> --
>
> Key: SPARK-36345
> URL: https://issues.apache.org/jira/browse/SPARK-36345
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> In Github Actions CI, we install the `mlflow>=1.0` and `sklearn` in the step 
> "List Python packages (Python 3.9)" of "pyspark" job.
>  
> We can reduce the cost of CI by creating the image that has pre-installed 
> both package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36254) Install mlflow in Github Actions CI

2021-07-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36254:
-

Assignee: Haejoon Lee

> Install mlflow in Github Actions CI
> ---
>
> Key: SPARK-36254
> URL: https://issues.apache.org/jira/browse/SPARK-36254
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Since the pandas-on-Spark includes the mlflow features and related tests, we 
> should install the mlflow and its dependencies our Github Actions CI so that 
> the test won't be skipped from Spark 3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36345) Add mlflow/sklearn to GHA docker image

2021-07-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36345:
--
Summary: Add mlflow/sklearn to GHA docker image  (was: Create docker image 
that has mlflow and sklearn.)

> Add mlflow/sklearn to GHA docker image
> --
>
> Key: SPARK-36345
> URL: https://issues.apache.org/jira/browse/SPARK-36345
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> In Github Actions CI, we install the `mlflow>=1.0` and `sklearn` in the step 
> "List Python packages (Python 3.9)" of "pyspark" job.
>  
> We can reduce the cost of CI by creating the image that has pre-installed 
> both package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36093) (RemoveRedundantAliases should keep output schema name) The result incorrect if the partition path case is inconsistent

2021-07-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-36093:
--
Summary: (RemoveRedundantAliases should keep output schema name) The result 
incorrect if the partition path case is inconsistent  (was: The result 
incorrect if the partition path case is inconsistent)

> (RemoveRedundantAliases should keep output schema name) The result incorrect 
> if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: angerszhu
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36353) RemoveNoopOperators should keep output schema

2021-07-30 Thread angerszhu (Jira)
angerszhu created SPARK-36353:
-

 Summary: RemoveNoopOperators should keep output schema
 Key: SPARK-36353
 URL: https://issues.apache.org/jira/browse/SPARK-36353
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name

2021-07-30 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390338#comment-17390338
 ] 

angerszhu commented on SPARK-36352:
---

RemoveNoopOperators

CollapseProject

> Spark should check result plan's output schema name
> ---
>
> Key: SPARK-36352
> URL: https://issues.apache.org/jira/browse/SPARK-36352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36352) Spark should check result plan's output schema name

2021-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390329#comment-17390329
 ] 

Apache Spark commented on SPARK-36352:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33583

> Spark should check result plan's output schema name
> ---
>
> Key: SPARK-36352
> URL: https://issues.apache.org/jira/browse/SPARK-36352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36352) Spark should check result plan's output schema name

2021-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36352:


Assignee: Apache Spark

> Spark should check result plan's output schema name
> ---
>
> Key: SPARK-36352
> URL: https://issues.apache.org/jira/browse/SPARK-36352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >