[jira] [Updated] (SPARK-33668) Fix flaky test "Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties."

2020-12-04 Thread Prashant Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-33668:

Description: 
The test is flaking, with multiple flaked instances - the reason for the 
failure has been similar to:
{code:java}
  The code passed to eventually never returned normally. Attempted 109 times 
over 3.007988241397 minutes. Last failure message: Failure executing: GET 
at: 
https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false.
 Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: 
Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, 
kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, 
uid=null, additionalProperties={}), kind=Status, message=pods 
"spark-pi-97a9bc76308e7fe3-exec-1" not found, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=NotFound, status=Failure, 
additionalProperties={}).. (KubernetesSuite.scala:402)
{code}

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36854/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36852/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36850/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36848/console

>From the above failures, it seems, that executor finishes too quickly and is 
>removed by spark before the test can complete. 

So, in order to mitigate this situation, one way is to turn on the flag

{code}
   "spark.kubernetes.executor.deleteOnTermination"
{code}

  was:
The test is flaking, and at more than one instance and the reason for the 
failure is
{code:java}
  The code passed to eventually never returned normally. Attempted 109 times 
over 3.007988241397 minutes. Last failure message: Failure executing: GET 
at: 
https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false.
 Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: 
Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, 
kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, 
uid=null, additionalProperties={}), kind=Status, message=pods 
"spark-pi-97a9bc76308e7fe3-exec-1" not found, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=NotFound, status=Failure, 
additionalProperties={}).. (KubernetesSuite.scala:402)
{code}

>From the above failure, it seems, that executor finishes too quickly and is 
>removed by spark before the test can complete. 

So, in order to mitigate this situation, one way is to turn on the flag

{code}
   "spark.kubernetes.executor.deleteOnTermination"
{code}


> Fix flaky test "Verify logging configuration is picked from the provided 
> SPARK_CONF_DIR/log4j.properties."
> --
>
> Key: SPARK-33668
> URL: https://issues.apache.org/jira/browse/SPARK-33668
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> The test is flaking, with multiple flaked instances - the reason for the 
> failure has been similar to:
> {code:java}
>   The code passed to eventually never returned normally. Attempted 109 times 
> over 3.007988241397 minutes. Last failure message: Failure executing: GET 
> at: 
> https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false.
>  Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: 
> Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, 
> kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, 
> uid=null, additionalProperties={}), kind=Status, message=pods 
> "spark-pi-97a9bc76308e7fe3-exec-1" not found, 
> metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=NotFound, status=Failure, additionalProperties={}).. 
> (KubernetesSuite.scala:402)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36854/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36852/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36850/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36848/console
> From the above failures, it seems, that executor finishes too quickly and is 

[jira] [Created] (SPARK-33668) Fix flaky test "Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties."

2020-12-04 Thread Prashant Sharma (Jira)
Prashant Sharma created SPARK-33668:
---

 Summary: Fix flaky test "Verify logging configuration is picked 
from the provided SPARK_CONF_DIR/log4j.properties."
 Key: SPARK-33668
 URL: https://issues.apache.org/jira/browse/SPARK-33668
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.1.0
Reporter: Prashant Sharma


The test is flaking, and at more than one instance and the reason for the 
failure is
{code:java}
  The code passed to eventually never returned normally. Attempted 109 times 
over 3.007988241397 minutes. Last failure message: Failure executing: GET 
at: 
https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false.
 Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: 
Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, 
kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, 
uid=null, additionalProperties={}), kind=Status, message=pods 
"spark-pi-97a9bc76308e7fe3-exec-1" not found, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=NotFound, status=Failure, 
additionalProperties={}).. (KubernetesSuite.scala:402)
{code}

>From the above failure, it seems, that executor finishes too quickly and is 
>removed by spark before the test can complete. 

So, in order to mitigate this situation, one way is to turn on the flag

{code}
   "spark.kubernetes.executor.deleteOnTermination"
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244426#comment-17244426
 ] 

Apache Spark commented on SPARK-33667:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30615

> Respect case sensitivity in V1 SHOW PARTITIONS
> --
>
> Key: SPARK-33667
> URL: https://issues.apache.org/jira/browse/SPARK-33667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
> *spark.sql.caseSensitive* which is false by default, for instance:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
>  > USING parquet
>  > PARTITIONED BY (year, month);
> spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
> spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
> Error in query: Non-partitioning column(s) [YEAR, Month] are specified for 
> SHOW PARTITIONS;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244425#comment-17244425
 ] 

Apache Spark commented on SPARK-33667:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30615

> Respect case sensitivity in V1 SHOW PARTITIONS
> --
>
> Key: SPARK-33667
> URL: https://issues.apache.org/jira/browse/SPARK-33667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
> *spark.sql.caseSensitive* which is false by default, for instance:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
>  > USING parquet
>  > PARTITIONED BY (year, month);
> spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
> spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
> Error in query: Non-partitioning column(s) [YEAR, Month] are specified for 
> SHOW PARTITIONS;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33667:


Assignee: (was: Apache Spark)

> Respect case sensitivity in V1 SHOW PARTITIONS
> --
>
> Key: SPARK-33667
> URL: https://issues.apache.org/jira/browse/SPARK-33667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
> *spark.sql.caseSensitive* which is false by default, for instance:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
>  > USING parquet
>  > PARTITIONED BY (year, month);
> spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
> spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
> Error in query: Non-partitioning column(s) [YEAR, Month] are specified for 
> SHOW PARTITIONS;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33667:


Assignee: Apache Spark

> Respect case sensitivity in V1 SHOW PARTITIONS
> --
>
> Key: SPARK-33667
> URL: https://issues.apache.org/jira/browse/SPARK-33667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
> *spark.sql.caseSensitive* which is false by default, for instance:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
>  > USING parquet
>  > PARTITIONED BY (year, month);
> spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
> spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
> Error in query: Non-partitioning column(s) [YEAR, Month] are specified for 
> SHOW PARTITIONS;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS

2020-12-04 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33667:
---
Description: 
SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
*spark.sql.caseSensitive* which is false by default, for instance:
{code:sql}
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
 > USING parquet
 > PARTITIONED BY (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW 
PARTITIONS;
{code}
 

  was:
SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
*spark.sql.caseSensitive* which is true by default, for instance:
{code:sql}
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
 > USING parquet
 > PARTITIONED BY (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW 
PARTITIONS;
{code}
 


> Respect case sensitivity in V1 SHOW PARTITIONS
> --
>
> Key: SPARK-33667
> URL: https://issues.apache.org/jira/browse/SPARK-33667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
> *spark.sql.caseSensitive* which is false by default, for instance:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
>  > USING parquet
>  > PARTITIONED BY (year, month);
> spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
> spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
> Error in query: Non-partitioning column(s) [YEAR, Month] are specified for 
> SHOW PARTITIONS;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS

2020-12-04 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33667:
--

 Summary: Respect case sensitivity in V1 SHOW PARTITIONS
 Key: SPARK-33667
 URL: https://issues.apache.org/jira/browse/SPARK-33667
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.8, 3.0.2, 3.1.0
Reporter: Maxim Gekk


SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
*spark.sql.caseSensitive* which is true by default, for instance:
{code:sql}
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
 > USING parquet
 > PARTITIONED BY (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW 
PARTITIONS;
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33614) Fix the constant folding rule to skip it if the expression fails to execute

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33614:


Assignee: (was: Apache Spark)

> Fix the constant folding rule to skip it if the expression fails to execute
> ---
>
> Key: SPARK-33614
> URL: https://issues.apache.org/jira/browse/SPARK-33614
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33614) Fix the constant folding rule to skip it if the expression fails to execute

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33614:


Assignee: Apache Spark

> Fix the constant folding rule to skip it if the expression fails to execute
> ---
>
> Key: SPARK-33614
> URL: https://issues.apache.org/jira/browse/SPARK-33614
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33614) Fix the constant folding rule to skip it if the expression fails to execute

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244402#comment-17244402
 ] 

Apache Spark commented on SPARK-33614:
--

User 'luluorta' has created a pull request for this issue:
https://github.com/apache/spark/pull/30614

> Fix the constant folding rule to skip it if the expression fails to execute
> ---
>
> Key: SPARK-33614
> URL: https://issues.apache.org/jira/browse/SPARK-33614
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33614) Fix the constant folding rule to skip it if the expression fails to execute

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244401#comment-17244401
 ] 

Apache Spark commented on SPARK-33614:
--

User 'luluorta' has created a pull request for this issue:
https://github.com/apache/spark/pull/30614

> Fix the constant folding rule to skip it if the expression fails to execute
> ---
>
> Key: SPARK-33614
> URL: https://issues.apache.org/jira/browse/SPARK-33614
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33666) Fix Flaky Test: HiveThriftHttpServerSuite

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33666:


Assignee: (was: Apache Spark)

> Fix Flaky Test: HiveThriftHttpServerSuite
> -
>
> Key: SPARK-33666
> URL: https://issues.apache.org/jira/browse/SPARK-33666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33666) Fix Flaky Test: HiveThriftHttpServerSuite

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244387#comment-17244387
 ] 

Apache Spark commented on SPARK-33666:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30613

> Fix Flaky Test: HiveThriftHttpServerSuite
> -
>
> Key: SPARK-33666
> URL: https://issues.apache.org/jira/browse/SPARK-33666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33666) Fix Flaky Test: HiveThriftHttpServerSuite

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33666:


Assignee: Apache Spark

> Fix Flaky Test: HiveThriftHttpServerSuite
> -
>
> Key: SPARK-33666
> URL: https://issues.apache.org/jira/browse/SPARK-33666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33666) Fix Flaky Test: HiveThriftHttpServerSuite

2020-12-04 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-33666:
-

 Summary: Fix Flaky Test: HiveThriftHttpServerSuite
 Key: SPARK-33666
 URL: https://issues.apache.org/jira/browse/SPARK-33666
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.1.0, 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33651) allow CREATE EXTERNAL TABLE with LOCATION for data source tables

2020-12-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33651.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30595
[https://github.com/apache/spark/pull/30595]

> allow CREATE EXTERNAL TABLE with LOCATION for data source tables
> 
>
> Key: SPARK-33651
> URL: https://issues.apache.org/jira/browse/SPARK-33651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2020-12-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244356#comment-17244356
 ] 

Dongjoon Hyun edited comment on SPARK-33564 at 12/4/20, 11:43 PM:
--

Here are the short answers.

Yes, it does. 
> Just to understand, the configuration needs to be set up before 
> start-master.sh?

Apache Spark master/worker metric system is not designed like that. It's not 
per-application metrics, is it?
> If that is the case, how do I change metrics configs between applications?

Yes. You need to setup both. The collection depends on your collection system.
> Is it possible to run one application using PrometheusServlet and another 
> application to use a different sink on this same cluster?

`metrics.properties` is documented here 
(http://spark.apache.org/docs/1.0.2/monitoring.html) since 1.0 .
> Also, is there documentation about the subject? Because nowhere it is 
> mentioned that the conf should be set before starting the cluster.

Prometheus metrics are for standalone and K8s clusters.
> Final question: how to achieve this using YARN?


was (Author: dongjoon):
Here are the short answers.

Yes, it does. 
> Just to understand, the configuration needs to be set up before 
> start-master.sh?

Apache Spark master/worker metric system is not designed like that. It's not 
per-application metrics, is it?
> If that is the case, how do I change metrics configs between applications?

Yes. You need to setup both. The collection depends on your collection system.
> Is it possible to run one application using PrometheusServlet and another 
> application to use a different sink on this same cluster?

`metrics.properties` is documented here 
(http://spark.apache.org/docs/1.0.2/monitoring.html) since 1.0 .
> Also, is there documentation about the subject? Because nowhere it is 
> mentioned that the conf should be set before starting the cluster.

Prometheus metrics are for standalone and K8s clusters.
> Final question: how to achieve this using YARN?

Master/Worker are Spark standalone deployment. It's irrelevant to YARN, isn't 
it?
> Do I have to have the metrics config set before launching YARN?

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"}
>  0
> 

[jira] [Comment Edited] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2020-12-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244356#comment-17244356
 ] 

Dongjoon Hyun edited comment on SPARK-33564 at 12/4/20, 11:42 PM:
--

Here are the short answers.

Yes, it does. 
> Just to understand, the configuration needs to be set up before 
> start-master.sh?

Apache Spark master/worker metric system is not designed like that. It's not 
per-application metrics, is it?
> If that is the case, how do I change metrics configs between applications?

Yes. You need to setup both. The collection depends on your collection system.
> Is it possible to run one application using PrometheusServlet and another 
> application to use a different sink on this same cluster?

`metrics.properties` is documented here 
(http://spark.apache.org/docs/1.0.2/monitoring.html) since 1.0 .
> Also, is there documentation about the subject? Because nowhere it is 
> mentioned that the conf should be set before starting the cluster.

Prometheus metrics are for standalone and K8s clusters.
> Final question: how to achieve this using YARN?

Master/Worker are Spark standalone deployment. It's irrelevant to YARN, isn't 
it?
> Do I have to have the metrics config set before launching YARN?


was (Author: dongjoon):
Here are the short answers.

Yes, it does. 
> Just to understand, the configuration needs to be set up before 
> start-master.sh?

Apache Spark master/worker metric system is not designed like that. It's not 
per-application metrics, is it?
> If that is the case, how do I change metrics configs between applications?

No.
> Is it possible to run one application using PrometheusServlet and another 
> application to use a different sink on this same cluster?

`metrics.properties` is documented here 
(http://spark.apache.org/docs/1.0.2/monitoring.html) since 1.0 .
> Also, is there documentation about the subject? Because nowhere it is 
> mentioned that the conf should be set before starting the cluster.

Prometheus metrics are for standalone and K8s clusters.
> Final question: how to achieve this using YARN?

Master/Worker are Spark standalone deployment. It's irrelevant to YARN, isn't 
it?
> Do I have to have the metrics config set before launching YARN?

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> 

[jira] [Commented] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2020-12-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244356#comment-17244356
 ] 

Dongjoon Hyun commented on SPARK-33564:
---

Here are the short answers.

Yes, it does. 
> Just to understand, the configuration needs to be set up before 
> start-master.sh?

Apache Spark master/worker metric system is not designed like that. It's not 
per-application metrics, is it?
> If that is the case, how do I change metrics configs between applications?

No.
> Is it possible to run one application using PrometheusServlet and another 
> application to use a different sink on this same cluster?

`metrics.properties` is documented here 
(http://spark.apache.org/docs/1.0.2/monitoring.html) since 1.0 .
> Also, is there documentation about the subject? Because nowhere it is 
> mentioned that the conf should be set before starting the cluster.

Prometheus metrics are for standalone and K8s clusters.
> Final question: how to achieve this using YARN?

Master/Worker are Spark standalone deployment. It's irrelevant to YARN, isn't 
it?
> Do I have to have the metrics config set before launching YARN?

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"}
>  0
> {quote}
> *The problem appears when I try accessing master metrics*, and I get the 
> following problem:
> {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}}
> {{}}
> {{      }}
> {{         type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> 

[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters

2020-12-04 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244341#comment-17244341
 ] 

Baohe Zhang commented on SPARK-26399:
-

Seems the work of this ticket is already done by 
https://issues.apache.org/jira/browse/SPARK-23431 and 
https://issues.apache.org/jira/browse/SPARK-32446.

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http://:18080/api/v1/applications/ id>// attempt>/executorSummary{code}
> Add parameters to the stages REST API to specify:
> *  filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
> * task metric quantiles, add adding the task summary if specified
> * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33141) capture SQL configs when creating permanent views

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244322#comment-17244322
 ] 

Apache Spark commented on SPARK-33141:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30611

> capture SQL configs when creating permanent views
> -
>
> Key: SPARK-33141
> URL: https://issues.apache.org/jira/browse/SPARK-33141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33141) capture SQL configs when creating permanent views

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244320#comment-17244320
 ] 

Apache Spark commented on SPARK-33141:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30611

> capture SQL configs when creating permanent views
> -
>
> Key: SPARK-33141
> URL: https://issues.apache.org/jira/browse/SPARK-33141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33662) Setting version to 3.2.0-SNAPSHOT

2020-12-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33662.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30606
[https://github.com/apache/spark/pull/30606]

> Setting version to 3.2.0-SNAPSHOT
> -
>
> Key: SPARK-33662
> URL: https://issues.apache.org/jira/browse/SPARK-33662
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33662) Setting version to 3.2.0-SNAPSHOT

2020-12-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33662:
-

Assignee: Dongjoon Hyun

> Setting version to 3.2.0-SNAPSHOT
> -
>
> Key: SPARK-33662
> URL: https://issues.apache.org/jira/browse/SPARK-33662
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33660) Update Kafka Headers Documentation in Structured Streaming

2020-12-04 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-33660.
--
Fix Version/s: 3.2.0
   3.0.2
   Resolution: Fixed

Issue resolved by pull request 30605
[https://github.com/apache/spark/pull/30605]

> Update Kafka Headers Documentation in Structured Streaming
> --
>
> Key: SPARK-33660
> URL: https://issues.apache.org/jira/browse/SPARK-33660
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Structured Streaming
>Affects Versions: 3.0.0, 3.0.1
>Reporter: German Schiavon Matteo
>Assignee: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.1.0, 3.0.2, 3.2.0
>
>
> This is to update the documentation of the Kafka Integration Guide, in 
> particular the Headers section.
> The documentation suggests to use `Map` as the type for headers but is the 
> type is 
> `Array[(String, Array[Byte])]`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33660) Update Kafka Headers Documentation in Structured Streaming

2020-12-04 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-33660:


Assignee: German Schiavon Matteo

> Update Kafka Headers Documentation in Structured Streaming
> --
>
> Key: SPARK-33660
> URL: https://issues.apache.org/jira/browse/SPARK-33660
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Structured Streaming
>Affects Versions: 3.0.0, 3.0.1
>Reporter: German Schiavon Matteo
>Assignee: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.1.0
>
>
> This is to update the documentation of the Kafka Integration Guide, in 
> particular the Headers section.
> The documentation suggests to use `Map` as the type for headers but is the 
> type is 
> `Array[(String, Array[Byte])]`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24837) Add kafka as spark metrics sink

2020-12-04 Thread Zikun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244300#comment-17244300
 ] 

Zikun commented on SPARK-24837:
---

Why is this marked as resolved? The kafka sink for spark metrics is still 
missing, right?

> Add kafka as spark metrics sink
> ---
>
> Key: SPARK-24837
> URL: https://issues.apache.org/jira/browse/SPARK-24837
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Sandish Kumar HN
>Priority: Major
>  Labels: bulk-closed
>
> Sink spark metrics to kafka producer 
> spark/core/src/main/scala/org/apache/spark/metrics/sink/
>  someone assign this to me?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33665) Enable ShuffleBlockPusher to stop pushing blocks for a particular shuffle partition

2020-12-04 Thread Chandni Singh (Jira)
Chandni Singh created SPARK-33665:
-

 Summary: Enable ShuffleBlockPusher to stop pushing blocks for a 
particular shuffle partition
 Key: SPARK-33665
 URL: https://issues.apache.org/jira/browse/SPARK-33665
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.1.0
Reporter: Chandni Singh


{{ShuffleBlockPusher}}, which was introduced in SPARK-32917 stops pushing 
shuffle blocks for the entire shuffle when it receives "TOO Late" exception. 
However, with the change [https://github.com/apache/spark/pull/30433], there is 
also a need to stop pushing shuffle blocks for a particular reduce partition. 
Refer https://github.com/apache/spark/pull/30433#discussion_r533694433

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33664) Migrate ALTER TABLE ... RENAME TO to new resolution framework

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244269#comment-17244269
 ] 

Apache Spark commented on SPARK-33664:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30610

> Migrate ALTER TABLE ... RENAME TO to new resolution framework
> -
>
> Key: SPARK-33664
> URL: https://issues.apache.org/jira/browse/SPARK-33664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate ALTER TABLE ... RENAME TO to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33664) Migrate ALTER TABLE ... RENAME TO to new resolution framework

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33664:


Assignee: Apache Spark

> Migrate ALTER TABLE ... RENAME TO to new resolution framework
> -
>
> Key: SPARK-33664
> URL: https://issues.apache.org/jira/browse/SPARK-33664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> Migrate ALTER TABLE ... RENAME TO to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33664) Migrate ALTER TABLE ... RENAME TO to new resolution framework

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33664:


Assignee: (was: Apache Spark)

> Migrate ALTER TABLE ... RENAME TO to new resolution framework
> -
>
> Key: SPARK-33664
> URL: https://issues.apache.org/jira/browse/SPARK-33664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate ALTER TABLE ... RENAME TO to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33664) Migrate ALTER TABLE ... RENAME TO to new resolution framework

2020-12-04 Thread Terry Kim (Jira)
Terry Kim created SPARK-33664:
-

 Summary: Migrate ALTER TABLE ... RENAME TO to new resolution 
framework
 Key: SPARK-33664
 URL: https://issues.apache.org/jira/browse/SPARK-33664
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


Migrate ALTER TABLE ... RENAME TO to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33663) Fix misleading message for uncaching when createOrReplaceTempView is called

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33663:


Assignee: (was: Apache Spark)

> Fix misleading message for uncaching when createOrReplaceTempView is called
> ---
>
> Key: SPARK-33663
> URL: https://issues.apache.org/jira/browse/SPARK-33663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> To repro:
> {code:java}
> scala> sql("CREATE TABLE table USING parquet AS SELECT 2")
> res0: org.apache.spark.sql.DataFrame = [] 
>   
> scala> val df = spark.table("table")
> df: org.apache.spark.sql.DataFrame = [2: int]
> scala> df.createOrReplaceTempView("t2")
> 20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache 
> $name
> org.apache.spark.sql.AnalysisException: Table or view not found: t2;;
> 'UnresolvedRelation [t2], [], false
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
>   at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889)
>   at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589)
>   at 
> org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476)
>   at 
> org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124)
> {code}
> It shouldn't log because `t2` does not exist yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33663) Fix misleading message for uncaching when createOrReplaceTempView is called

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244229#comment-17244229
 ] 

Apache Spark commented on SPARK-33663:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30608

> Fix misleading message for uncaching when createOrReplaceTempView is called
> ---
>
> Key: SPARK-33663
> URL: https://issues.apache.org/jira/browse/SPARK-33663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> To repro:
> {code:java}
> scala> sql("CREATE TABLE table USING parquet AS SELECT 2")
> res0: org.apache.spark.sql.DataFrame = [] 
>   
> scala> val df = spark.table("table")
> df: org.apache.spark.sql.DataFrame = [2: int]
> scala> df.createOrReplaceTempView("t2")
> 20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache 
> $name
> org.apache.spark.sql.AnalysisException: Table or view not found: t2;;
> 'UnresolvedRelation [t2], [], false
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
>   at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889)
>   at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589)
>   at 
> org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476)
>   at 
> org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124)
> {code}
> It shouldn't log because `t2` does not exist yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33663) Fix misleading message for uncaching when createOrReplaceTempView is called

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33663:


Assignee: Apache Spark

> Fix misleading message for uncaching when createOrReplaceTempView is called
> ---
>
> Key: SPARK-33663
> URL: https://issues.apache.org/jira/browse/SPARK-33663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> To repro:
> {code:java}
> scala> sql("CREATE TABLE table USING parquet AS SELECT 2")
> res0: org.apache.spark.sql.DataFrame = [] 
>   
> scala> val df = spark.table("table")
> df: org.apache.spark.sql.DataFrame = [2: int]
> scala> df.createOrReplaceTempView("t2")
> 20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache 
> $name
> org.apache.spark.sql.AnalysisException: Table or view not found: t2;;
> 'UnresolvedRelation [t2], [], false
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
>   at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889)
>   at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589)
>   at 
> org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476)
>   at 
> org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124)
> {code}
> It shouldn't log because `t2` does not exist yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33663) Fix misleading message for uncaching when createOrReplaceTempView is called

2020-12-04 Thread Terry Kim (Jira)
Terry Kim created SPARK-33663:
-

 Summary: Fix misleading message for uncaching when 
createOrReplaceTempView is called
 Key: SPARK-33663
 URL: https://issues.apache.org/jira/browse/SPARK-33663
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


To repro:

{code:java}
scala> sql("CREATE TABLE table USING parquet AS SELECT 2")
res0: org.apache.spark.sql.DataFrame = []   

scala> val df = spark.table("table")
df: org.apache.spark.sql.DataFrame = [2: int]

scala> df.createOrReplaceTempView("t2")
20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache $name
org.apache.spark.sql.AnalysisException: Table or view not found: t2;;
'UnresolvedRelation [t2], [], false

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589)
at 
org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476)
at 
org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392)
at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124)

{code}

It shouldn't log because `t2` does not exist yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24907) Migrate JDBC data source to DataSource API v2

2020-12-04 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-24907:
---
Affects Version/s: (was: 2.3.0)
   (was: 3.0.0)
   3.1.0

> Migrate JDBC data source to DataSource API v2
> -
>
> Key: SPARK-24907
> URL: https://issues.apache.org/jira/browse/SPARK-24907
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Teng Peng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2020-12-04 Thread Paulo Roberto de Oliveira Castro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Roberto de Oliveira Castro reopened SPARK-33564:
--

Just to understand, the configuration needs to be set up before 
start-master.sh? If that is the case, how do I change metrics configs between 
applications? Is it possible to run one application using PrometheusServlet and 
another application to use a different sink on this same cluster?

Also, is there documentation about the subject? Because nowhere it is mentioned 
that the conf should be set before starting the cluster.

Final question: how to achieve this using YARN? Do I have to have the metrics 
config set before launching YARN?

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"}
>  0
> {quote}
> *The problem appears when I try accessing master metrics*, and I get the 
> following problem:
> {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}}
> {{}}
> {{      }}
> {{         type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')}}
> {{        }}
> {{         href="/static/spark-logo-77x50px-hd.png">}}
> {{        Spark Master at 
> spark://MacBook-Pro-de-Paulo-2.local:7077}}
> {{      }}
> {{      }}
> {{        }}
> {{          }}
> {{            }}
> {{              }}
> {{                }}
> {{                  }}
> {{                  3.0.0}}
> {{                }}
> {{                Spark Master 

[jira] [Issue Comment Deleted] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2020-12-04 Thread Paulo Roberto de Oliveira Castro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Roberto de Oliveira Castro updated SPARK-33564:
-
Comment: was deleted

(was: Just to understand, the configuration needs to be set up before 
start-master.sh? If that is the case, how do I change metrics configs between 
applications? Is it possible to run one application using PrometheusServlet and 
another application to use a different sink on this same cluster?

Also, is there documentation about the subject? Because nowhere it is mentioned 
that the conf should be set before starting the cluster.

Final question: how to achieve this using YARN? Do I have to have the metrics 
config set before launching YARN?)

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"}
>  0
> {quote}
> *The problem appears when I try accessing master metrics*, and I get the 
> following problem:
> {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}}
> {{}}
> {{      }}
> {{         type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')}}
> {{        }}
> {{         href="/static/spark-logo-77x50px-hd.png">}}
> {{        Spark Master at 
> spark://MacBook-Pro-de-Paulo-2.local:7077}}
> {{      }}
> {{      }}
> {{        }}
> {{          }}
> {{            }}
> {{              }}
> {{                }}
> {{                  }}
> {{                  3.0.0}}
> {{                }}
> 

[jira] [Commented] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2020-12-04 Thread Paulo Roberto de Oliveira Castro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244182#comment-17244182
 ] 

Paulo Roberto de Oliveira Castro commented on SPARK-33564:
--

Just to understand, the configuration needs to be set up before 
start-master.sh? If that is the case, how do I change metrics configs between 
applications? Is it possible to run one application using PrometheusServlet and 
another application to use a different sink on this same cluster?

Also, is there documentation about the subject? Because nowhere it is mentioned 
that the conf should be set before starting the cluster.

Final question: how to achieve this using YARN? Do I have to have the metrics 
config set before launching YARN?

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"}
>  0
> {quote}
> *The problem appears when I try accessing master metrics*, and I get the 
> following problem:
> {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}}
> {{}}
> {{      }}
> {{         type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')}}
> {{        }}
> {{         href="/static/spark-logo-77x50px-hd.png">}}
> {{        Spark Master at 
> spark://MacBook-Pro-de-Paulo-2.local:7077}}
> {{      }}
> {{      }}
> {{        }}
> {{          }}
> {{            }}
> {{              }}
> {{                }}
> {{                  }}
> {{                  3.0.0}}
> {{              

[jira] [Commented] (SPARK-33662) Setting version to 3.2.0-SNAPSHOT

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244114#comment-17244114
 ] 

Apache Spark commented on SPARK-33662:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30606

> Setting version to 3.2.0-SNAPSHOT
> -
>
> Key: SPARK-33662
> URL: https://issues.apache.org/jira/browse/SPARK-33662
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33662) Setting version to 3.2.0-SNAPSHOT

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33662:


Assignee: Apache Spark

> Setting version to 3.2.0-SNAPSHOT
> -
>
> Key: SPARK-33662
> URL: https://issues.apache.org/jira/browse/SPARK-33662
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33662) Setting version to 3.2.0-SNAPSHOT

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33662:


Assignee: (was: Apache Spark)

> Setting version to 3.2.0-SNAPSHOT
> -
>
> Key: SPARK-33662
> URL: https://issues.apache.org/jira/browse/SPARK-33662
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33662) Setting version to 3.2.0-SNAPSHOT

2020-12-04 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-33662:
-

 Summary: Setting version to 3.2.0-SNAPSHOT
 Key: SPARK-33662
 URL: https://issues.apache.org/jira/browse/SPARK-33662
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33661) Unable to load RandomForestClassificationModel trained in Spark 2.x

2020-12-04 Thread Marcus Levine (Jira)
Marcus Levine created SPARK-33661:
-

 Summary: Unable to load RandomForestClassificationModel trained in 
Spark 2.x
 Key: SPARK-33661
 URL: https://issues.apache.org/jira/browse/SPARK-33661
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.0.1
Reporter: Marcus Levine


When attempting to load a RandomForestClassificationModel that was trained in 
Spark 2.x using Spark 3.x, an exception is raised:

{code:python}
...
RandomForestClassificationModel.load('/path/to/my/model')
  File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 330, in load
  File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 291, in 
load
  File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 280, in load
  File "/usr/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1305, in __call__
  File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in 
deco
  File "", line 3, in raise_from
pyspark.sql.utils.AnalysisException: No such struct field rawCount in id, 
prediction, impurity, impurityStats, gain, leftChild, rightChild, split;
{code}

There seems to be a schema incompatibility between the trained model data saved 
by Spark 2.x and the expected data for a model trained in Spark 3.x

If this issue is not resolved, users will be forced to retrain any existing 
random forest models they trained in Spark 2.x using Spark 3.x before they can 
upgrade



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32595) Do truncate and append atomically in JDBCWriteBuilder

2020-12-04 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-32595:
---
Affects Version/s: (was: 3.1.0)
   3.2.0

> Do truncate and append atomically in JDBCWriteBuilder
> -
>
> Key: SPARK-32595
> URL: https://issues.apache.org/jira/browse/SPARK-32595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Priority: Major
>
> In JDBCWriteBuilder.buildForV1Write, we need to do truncate and append 
> atomically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32593) JDBC nested columns support

2020-12-04 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-32593:
---
Affects Version/s: (was: 3.1.0)
   3.2.0

> JDBC nested columns support
> ---
>
> Key: SPARK-32593
> URL: https://issues.apache.org/jira/browse/SPARK-32593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add nested column support  and nested column pruning in JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32833) JDBC V2 Datasource aggregate push down

2020-12-04 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-32833:
---
Affects Version/s: (was: 3.1.0)
   3.2.0

> JDBC V2 Datasource aggregate push down
> --
>
> Key: SPARK-32833
> URL: https://issues.apache.org/jira/browse/SPARK-32833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Push down aggregate to data source layer for better performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33660) Update Kafka Headers Documentation in Structured Streaming

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244043#comment-17244043
 ] 

Apache Spark commented on SPARK-33660:
--

User 'Gschiavon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30605

> Update Kafka Headers Documentation in Structured Streaming
> --
>
> Key: SPARK-33660
> URL: https://issues.apache.org/jira/browse/SPARK-33660
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Structured Streaming
>Affects Versions: 3.0.0, 3.0.1
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.1.0
>
>
> This is to update the documentation of the Kafka Integration Guide, in 
> particular the Headers section.
> The documentation suggests to use `Map` as the type for headers but is the 
> type is 
> `Array[(String, Array[Byte])]`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33660) Update Kafka Headers Documentation in Structured Streaming

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33660:


Assignee: (was: Apache Spark)

> Update Kafka Headers Documentation in Structured Streaming
> --
>
> Key: SPARK-33660
> URL: https://issues.apache.org/jira/browse/SPARK-33660
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Structured Streaming
>Affects Versions: 3.0.0, 3.0.1
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.1.0
>
>
> This is to update the documentation of the Kafka Integration Guide, in 
> particular the Headers section.
> The documentation suggests to use `Map` as the type for headers but is the 
> type is 
> `Array[(String, Array[Byte])]`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33660) Update Kafka Headers Documentation in Structured Streaming

2020-12-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33660:


Assignee: Apache Spark

> Update Kafka Headers Documentation in Structured Streaming
> --
>
> Key: SPARK-33660
> URL: https://issues.apache.org/jira/browse/SPARK-33660
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Structured Streaming
>Affects Versions: 3.0.0, 3.0.1
>Reporter: German Schiavon Matteo
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
>
> This is to update the documentation of the Kafka Integration Guide, in 
> particular the Headers section.
> The documentation suggests to use `Map` as the type for headers but is the 
> type is 
> `Array[(String, Array[Byte])]`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33660) Update Kafka Headers Documentation in Structured Streaming

2020-12-04 Thread German Schiavon Matteo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

German Schiavon Matteo updated SPARK-33660:
---
Component/s: docs

> Update Kafka Headers Documentation in Structured Streaming
> --
>
> Key: SPARK-33660
> URL: https://issues.apache.org/jira/browse/SPARK-33660
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Structured Streaming
>Affects Versions: 3.0.0, 3.0.1
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.1.0
>
>
> This is to update the documentation of the Kafka Integration Guide, in 
> particular the Headers section.
> The documentation suggests to use `Map` as the type for headers but is the 
> type is 
> `Array[(String, Array[Byte])]`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33660) Update Kafka Headers Documentation in Structured Streaming

2020-12-04 Thread German Schiavon Matteo (Jira)
German Schiavon Matteo created SPARK-33660:
--

 Summary: Update Kafka Headers Documentation in Structured Streaming
 Key: SPARK-33660
 URL: https://issues.apache.org/jira/browse/SPARK-33660
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.1, 3.0.0
Reporter: German Schiavon Matteo
 Fix For: 3.1.0


This is to update the documentation of the Kafka Integration Guide, in 
particular the Headers section.

The documentation suggests to use `Map` as the type for headers but is the type 
is 
`Array[(String, Array[Byte])]`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32681) PySpark type hints support

2020-12-04 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244012#comment-17244012
 ] 

Maciej Szymkiewicz commented on SPARK-32681:


Thanks [~hyukjin.kwon]. As far as I am aware things are in a good shape now ‒ 
there are some things I'd like to try in the future, but these target infra and 
/ or tests and can safely wait after the release. 

> PySpark type hints support
> --
>
> Key: SPARK-32681
> URL: https://issues.apache.org/jira/browse/SPARK-32681
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Critical
>
>  https://github.com/zero323/pyspark-stubs demonstrates a lot of benefits to 
> improve usability in PySpark by leveraging Python type hints.
> By having the type hints in PySpark we can, for example:
> - automatically document the input and output types
> - leverage IDE for error detection and auto-completion
> - have a cleaner definition and easier to understand.
> This JIRA is an umbrella JIRA that targets to port 
> https://github.com/zero323/pyspark-stubs and related items to smoothly run 
> within PySpark.
> It was also discussed in the dev mailing list:  
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33565) python/run-tests.py calling python3.8

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33565:
-
Fix Version/s: (was: 3.2.0)
   3.1.0

> python/run-tests.py calling python3.8
> -
>
> Key: SPARK-33565
> URL: https://issues.apache.org/jira/browse/SPARK-33565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: Shane Knapp
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> this line in run-tests.py on master:
> |python_execs = [x for x in ["python3.6", "python3.8", "pypy3"] if which(x)]|
>  
> and this line in branch-3.0:
> python_execs = [x for x in ["python3.8", "python2.7", "pypy3", "pypy"] if 
> which(x)]
> ...are currently breaking builds on the new ubuntu 20.04LTS workers.
> the default  system python is /usr/bin/python3.8 and we do NOT have a working 
> python3.8 anaconda deployment yet.  this is causing python test breakages.
> PRs incoming
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33466) Imputer support mode(most_frequent) strategy

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33466:
-
Fix Version/s: (was: 3.2.0)
   3.1.0

> Imputer support mode(most_frequent) strategy
> 
>
> Key: SPARK-33466
> URL: https://issues.apache.org/jira/browse/SPARK-33466
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.1.0
>
>
> [sklearn.impute.SimpleImputer|https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer]
>  supports *most_frequent(mode)*, which replace missing using the most 
> frequent value along each column.
> It should be easy to implement it in MLlib.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33615) Make spark.archives working in Kubernates

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33615.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30581
[https://github.com/apache/spark/pull/30581]

> Make spark.archives working in Kubernates
> -
>
> Key: SPARK-33615
> URL: https://issues.apache.org/jira/browse/SPARK-33615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> {{--archive}} submit option and {{spark.archives}} configuration were added 
> at SPARK-33530.
> Looks like there is a bug. I found this while working on adding an IT test. 
> Please refer the PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33615) Make spark.archives working in Kubernates

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33615:


Assignee: Hyukjin Kwon

> Make spark.archives working in Kubernates
> -
>
> Key: SPARK-33615
> URL: https://issues.apache.org/jira/browse/SPARK-33615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {{--archive}} submit option and {{spark.archives}} configuration were added 
> at SPARK-33530.
> Looks like there is a bug. I found this while working on adding an IT test. 
> Please refer the PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27237) Introduce State schema validation among query restart

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27237:


Assignee: Jungtaek Lim

> Introduce State schema validation among query restart
> -
>
> Key: SPARK-27237
> URL: https://issues.apache.org/jira/browse/SPARK-27237
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> Even though Spark structured streaming guide page clearly documents that "Any 
> change in number or type of grouping keys or aggregates is not allowed.", 
> Spark doesn't do anything when end users try to do it, which would end up 
> with indeterministic outputs or unexpected exceptions.
> Even worse, if the query doesn't crash by chance it could write the new 
> messed values to state which completely breaks state unless end users roll 
> back to specific batch via manually editing checkpoint.
> The restriction is clear, the number of columns, and data type for each must 
> not be modified among query runs. We can store schema of state along with 
> state, and verify whether the (maybe) new schema is compatible if state 
> schema is modified. With this validation we can prevent query runs and shows 
> indeterministic behavior when schema is incompatible, as well as we can give 
> more informative error messages to end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27237) Introduce State schema validation among query restart

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27237.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 24173
[https://github.com/apache/spark/pull/24173]

> Introduce State schema validation among query restart
> -
>
> Key: SPARK-27237
> URL: https://issues.apache.org/jira/browse/SPARK-27237
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.1.0
>
>
> Even though Spark structured streaming guide page clearly documents that "Any 
> change in number or type of grouping keys or aggregates is not allowed.", 
> Spark doesn't do anything when end users try to do it, which would end up 
> with indeterministic outputs or unexpected exceptions.
> Even worse, if the query doesn't crash by chance it could write the new 
> messed values to state which completely breaks state unless end users roll 
> back to specific batch via manually editing checkpoint.
> The restriction is clear, the number of columns, and data type for each must 
> not be modified among query runs. We can store schema of state along with 
> state, and verify whether the (maybe) new schema is compatible if state 
> schema is modified. With this validation we can prevent query runs and shows 
> indeterministic behavior when schema is incompatible, as well as we can give 
> more informative error messages to end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33632) to_date doesn't behave as documented

2020-12-04 Thread Frank Oosterhuis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243858#comment-17243858
 ] 

Frank Oosterhuis edited comment on SPARK-33632 at 12/4/20, 9:11 AM:


[~qwe1398775315] is right, and the spec is actually somewhat clear about this. 
I had tried "y" and "", but not "yy"

"Year: The count of letters determines the minimum field width below which 
padding is used. If the count of letters is two, then a reduced two digit form 
is used. For printing, this outputs the rightmost two digits. *For parsing, 
this will parse using the base value of 2000, resulting in a year within the 
range 2000 to 2099 inclusive.* If the count of letters is less than four (but 
not two), then the sign is only output for negative years. Otherwise, the sign 
is output if the pad width is exceeded when ‘G’ is not present. 7 or more 
letters will fail."

 The table could be a bit clearer.
 !screenshot-1.png! 


was (Author: frankivo):
[~qwe1398775315] is right, and the spec is actually somewhat clear about this.

"Year: The count of letters determines the minimum field width below which 
padding is used. If the count of letters is two, then a reduced two digit form 
is used. For printing, this outputs the rightmost two digits. *For parsing, 
this will parse using the base value of 2000, resulting in a year within the 
range 2000 to 2099 inclusive.* If the count of letters is less than four (but 
not two), then the sign is only output for negative years. Otherwise, the sign 
is output if the pad width is exceeded when ‘G’ is not present. 7 or more 
letters will fail."

 The table could be a bit clearer.
 !screenshot-1.png! 

> to_date doesn't behave as documented
> 
>
> Key: SPARK-33632
> URL: https://issues.apache.org/jira/browse/SPARK-33632
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Frank Oosterhuis
>Priority: Major
> Attachments: image-2020-12-04-11-45-10-379.png, screenshot-1.png
>
>
> I'm trying to use to_date on a string formatted as "10/31/20".
> Expected output is "2020-10-31".
> Actual output is "0020-01-31".
> The 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]
>  suggests 2020 or 20 as input for "y".
> Example below. Expected behaviour is included in the udf.
> {code:scala}
> import java.sql.Date
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions.{to_date, udf}
> object ToDate {
>   val toDate = udf((date: String) => {
> val split = date.split("/")
> val month = "%02d".format(split(0).toInt)
> val day = "%02d".format(split(1).toInt)
> val year = split(2).toInt + 2000
> Date.valueOf(s"${year}-${month}-${day}")
>   })
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder().master("local[2]").getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.implicits._
> Seq("1/1/20", "10/31/20")
>   .toDF("raw")
>   .withColumn("to_date", to_date($"raw", "m/d/y"))
>   .withColumn("udf", toDate($"raw"))
>   .show
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33632) to_date doesn't behave as documented

2020-12-04 Thread Frank Oosterhuis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243858#comment-17243858
 ] 

Frank Oosterhuis commented on SPARK-33632:
--

[~qwe1398775315] is right, and the spec is actually somewhat clear about this.

"Year: The count of letters determines the minimum field width below which 
padding is used. If the count of letters is two, then a reduced two digit form 
is used. For printing, this outputs the rightmost two digits. *For parsing, 
this will parse using the base value of 2000, resulting in a year within the 
range 2000 to 2099 inclusive.* If the count of letters is less than four (but 
not two), then the sign is only output for negative years. Otherwise, the sign 
is output if the pad width is exceeded when ‘G’ is not present. 7 or more 
letters will fail."

 The table could be a bit clearer.
 !screenshot-1.png! 

> to_date doesn't behave as documented
> 
>
> Key: SPARK-33632
> URL: https://issues.apache.org/jira/browse/SPARK-33632
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Frank Oosterhuis
>Priority: Major
> Attachments: image-2020-12-04-11-45-10-379.png, screenshot-1.png
>
>
> I'm trying to use to_date on a string formatted as "10/31/20".
> Expected output is "2020-10-31".
> Actual output is "0020-01-31".
> The 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]
>  suggests 2020 or 20 as input for "y".
> Example below. Expected behaviour is included in the udf.
> {code:scala}
> import java.sql.Date
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions.{to_date, udf}
> object ToDate {
>   val toDate = udf((date: String) => {
> val split = date.split("/")
> val month = "%02d".format(split(0).toInt)
> val day = "%02d".format(split(1).toInt)
> val year = split(2).toInt + 2000
> Date.valueOf(s"${year}-${month}-${day}")
>   })
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder().master("local[2]").getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.implicits._
> Seq("1/1/20", "10/31/20")
>   .toDF("raw")
>   .withColumn("to_date", to_date($"raw", "m/d/y"))
>   .withColumn("udf", toDate($"raw"))
>   .show
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33632) to_date doesn't behave as documented

2020-12-04 Thread Frank Oosterhuis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Oosterhuis updated SPARK-33632:
-
Attachment: screenshot-1.png

> to_date doesn't behave as documented
> 
>
> Key: SPARK-33632
> URL: https://issues.apache.org/jira/browse/SPARK-33632
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Frank Oosterhuis
>Priority: Major
> Attachments: image-2020-12-04-11-45-10-379.png, screenshot-1.png
>
>
> I'm trying to use to_date on a string formatted as "10/31/20".
> Expected output is "2020-10-31".
> Actual output is "0020-01-31".
> The 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]
>  suggests 2020 or 20 as input for "y".
> Example below. Expected behaviour is included in the udf.
> {code:scala}
> import java.sql.Date
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions.{to_date, udf}
> object ToDate {
>   val toDate = udf((date: String) => {
> val split = date.split("/")
> val month = "%02d".format(split(0).toInt)
> val day = "%02d".format(split(1).toInt)
> val year = split(2).toInt + 2000
> Date.valueOf(s"${year}-${month}-${day}")
>   })
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder().master("local[2]").getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> import spark.implicits._
> Seq("1/1/20", "10/31/20")
>   .toDF("raw")
>   .withColumn("to_date", to_date($"raw", "m/d/y"))
>   .withColumn("udf", toDate($"raw"))
>   .show
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243856#comment-17243856
 ] 

Apache Spark commented on SPARK-33571:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30604

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
> Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32681) PySpark type hints support

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32681.
--
Resolution: Done

I will mark this as done for now but [~zero323] please feel free to add a 
ticket here and fix if you happen to have more issues to fix before Spark 3.1.0 
release.

> PySpark type hints support
> --
>
> Key: SPARK-32681
> URL: https://issues.apache.org/jira/browse/SPARK-32681
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Critical
>
>  https://github.com/zero323/pyspark-stubs demonstrates a lot of benefits to 
> improve usability in PySpark by leveraging Python type hints.
> By having the type hints in PySpark we can, for example:
> - automatically document the input and output types
> - leverage IDE for error detection and auto-completion
> - have a cleaner definition and easier to understand.
> This JIRA is an umbrella JIRA that targets to port 
> https://github.com/zero323/pyspark-stubs and related items to smoothly run 
> within PySpark.
> It was also discussed in the dev mailing list:  
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7768) Make user-defined type (UDT) API public

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-7768:

Target Version/s: 3.2.0  (was: 3.1.0)

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243847#comment-17243847
 ] 

Hyukjin Kwon commented on SPARK-7768:
-

I am retargetting to 3.2.0.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27495:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>  Labels: SPIP
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then change them for the ML stage of the job. 
> Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
> extra Resources (GPU/FPGA/etc). It has the potential to allow for other 
> things like limiting the number of tasks per stage, specifying other 
> parameters for things like shuffle, etc. Initially I would propose we only 
> support resources as they are now. So Task resources would be cpu and other 
> resources (GPU, FPGA), that way we aren't adding in extra scheduling things 
> at this point.  Executor resources would be cpu, memory, and extra 
> resources(GPU,FPGA, etc). Changing the executor resources will rely on 
> dynamic allocation being enabled.
> Main use cases:
>  # ML use case where user does ETL and feeds it into an ML algorithm where 
> it’s using the RDD API. This should work with barrier scheduling as well once 
> it supports dynamic allocation.
>  # This adds the framework/api for Spark's own internal use.  In the future 
> (not covered by this SPIP), Catalyst could control the stage level resources 
> as it finds the need to change it between stages for different optimizations. 
> For instance, with the new columnar plugin to the query planner we can insert 
> stages into the plan that would change running something on the CPU in row 
> format to running it on the GPU in columnar format. This API would allow the 
> planner to make sure the stages that run on the GPU get the corresponding GPU 
> resources it needs to run. Another possible use case for catalyst is that it 
> would allow catalyst to add in more optimizations to where the user doesn’t 
> need to configure container sizes at all. If the optimizer/planner can handle 
> that for the user, everyone wins.
> This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I 
> think the DataSet API will require more changes because it specifically hides 
> the RDD from the users via the plans and catalyst can optimize the plan and 
> insert things into the plan. The only way I’ve found to make this work with 
> the Dataset API would be modifying all the plans to be able to get the 
> resource requirements down into where it creates the RDDs, which I believe 
> would be a lot of change.  If other people know better options, it would be 
> great to hear them.
> *Q2.* What problem is this proposal NOT designed to solve?
> The initial implementation is not going to add Dataset APIs.
> We are starting with allowing users to specify a specific set of 
> task/executor resources and plan to design it to be extendable, but the first 
> implementation will not support changing generic SparkConf configs and only 
> specific limited resources.
> This initial version will have a programmatic API for specifying the resource 
> requirements per stage, we can add the ability to perhaps have profiles in 
> the configs later if its useful.
> *Q3.* How is it done today, and what are the limits of current practice?
> Currently this is either done by having multiple spark jobs or requesting 
> containers with the max resources needed for any part of the job.  To do this 
> today, you can break it into separate jobs where each job requests the 
> corresponding resources needed, but then you have to write the 

[jira] [Commented] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243846#comment-17243846
 ] 

Hyukjin Kwon commented on SPARK-27495:
--

[~tgraves], I switch the target version up to 3.2.0 because the branch is cut 
out now but please edit it if I got it wrong.

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>  Labels: SPIP
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then change them for the ML stage of the job. 
> Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
> extra Resources (GPU/FPGA/etc). It has the potential to allow for other 
> things like limiting the number of tasks per stage, specifying other 
> parameters for things like shuffle, etc. Initially I would propose we only 
> support resources as they are now. So Task resources would be cpu and other 
> resources (GPU, FPGA), that way we aren't adding in extra scheduling things 
> at this point.  Executor resources would be cpu, memory, and extra 
> resources(GPU,FPGA, etc). Changing the executor resources will rely on 
> dynamic allocation being enabled.
> Main use cases:
>  # ML use case where user does ETL and feeds it into an ML algorithm where 
> it’s using the RDD API. This should work with barrier scheduling as well once 
> it supports dynamic allocation.
>  # This adds the framework/api for Spark's own internal use.  In the future 
> (not covered by this SPIP), Catalyst could control the stage level resources 
> as it finds the need to change it between stages for different optimizations. 
> For instance, with the new columnar plugin to the query planner we can insert 
> stages into the plan that would change running something on the CPU in row 
> format to running it on the GPU in columnar format. This API would allow the 
> planner to make sure the stages that run on the GPU get the corresponding GPU 
> resources it needs to run. Another possible use case for catalyst is that it 
> would allow catalyst to add in more optimizations to where the user doesn’t 
> need to configure container sizes at all. If the optimizer/planner can handle 
> that for the user, everyone wins.
> This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I 
> think the DataSet API will require more changes because it specifically hides 
> the RDD from the users via the plans and catalyst can optimize the plan and 
> insert things into the plan. The only way I’ve found to make this work with 
> the Dataset API would be modifying all the plans to be able to get the 
> resource requirements down into where it creates the RDDs, which I believe 
> would be a lot of change.  If other people know better options, it would be 
> great to hear them.
> *Q2.* What problem is this proposal NOT designed to solve?
> The initial implementation is not going to add Dataset APIs.
> We are starting with allowing users to specify a specific set of 
> task/executor resources and plan to design it to be extendable, but the first 
> implementation will not support changing generic SparkConf configs and only 
> specific limited resources.
> This initial version will have a programmatic API for specifying the resource 
> requirements per stage, we can add the ability to perhaps have profiles in 
> the configs later if its useful.
> *Q3.* How is it done today, and what are the limits of current practice?
> Currently this is either done by having multiple spark jobs or requesting 
> containers with the max resources needed for any part of the job.  To do this 
> today, you can 

[jira] [Updated] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24942:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Improve cluster resource management with jobs containing barrier stage
> --
>
> Key: SPARK-24942
> URL: https://issues.apache.org/jira/browse/SPARK-24942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r205652317
> We shall improve cluster resource management to address the following issues:
> - With dynamic resource allocation enabled, it may happen that we acquire 
> some executors (but not enough to launch all the tasks in a barrier stage) 
> and later release them due to executor idle time expire, and then acquire 
> again.
> - There can be deadlock with two concurrent applications. Each application 
> may acquire some resources, but not enough to launch all the tasks in a 
> barrier stage. And after hitting the idle timeout and releasing them, they 
> may acquire resources again, but just continually trade resources between 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24941) Add RDDBarrier.coalesce() function

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243845#comment-17243845
 ] 

Hyukjin Kwon commented on SPARK-24941:
--

I switched it to 3.2.0

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24941) Add RDDBarrier.coalesce() function

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24941:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243844#comment-17243844
 ] 

Hyukjin Kwon commented on SPARK-24942:
--

Let me retarget it to 3.2.0. Branch will be cut out soon.

> Improve cluster resource management with jobs containing barrier stage
> --
>
> Key: SPARK-24942
> URL: https://issues.apache.org/jira/browse/SPARK-24942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r205652317
> We shall improve cluster resource management to address the following issues:
> - With dynamic resource allocation enabled, it may happen that we acquire 
> some executors (but not enough to launch all the tasks in a barrier stage) 
> and later release them due to executor idle time expire, and then acquire 
> again.
> - There can be deadlock with two concurrent applications. Each application 
> may acquire some resources, but not enough to launch all the tasks in a 
> barrier stage. And after hitting the idle timeout and releasing them, they 
> may acquire resources again, but just continually trade resources between 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25383) Image data source supports sample pushdown

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243843#comment-17243843
 ] 

Hyukjin Kwon commented on SPARK-25383:
--

Branch 3.2 will be cut out soon. Let me retarget it to 3.2.0.

> Image data source supports sample pushdown
> --
>
> Key: SPARK-25383
> URL: https://issues.apache.org/jira/browse/SPARK-25383
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> After SPARK-25349, we should update image data source to support sampling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243841#comment-17243841
 ] 

Hyukjin Kwon commented on SPARK-27780:
--

I will switch to 3.2.0. Branch will be cut out soon.

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25383) Image data source supports sample pushdown

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25383:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Image data source supports sample pushdown
> --
>
> Key: SPARK-25383
> URL: https://issues.apache.org/jira/browse/SPARK-25383
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> After SPARK-25349, we should update image data source to support sampling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27780:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28629) Capture the missing rules in HiveSessionStateBuilder

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28629:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Capture the missing rules in HiveSessionStateBuilder
> 
>
> Key: SPARK-28629
> URL: https://issues.apache.org/jira/browse/SPARK-28629
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> A general mistake for new contributors is to forget adding the corresponding 
> rules into the extended extendedResolutionRules, postHocResolutionRules, 
> extendedCheckRules in HiveSessionStateBuilder. We need to avoid missing the 
> rules or capture them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28629) Capture the missing rules in HiveSessionStateBuilder

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243840#comment-17243840
 ] 

Hyukjin Kwon commented on SPARK-28629:
--

[~smilegator] I am going to retarget it to 3.2.0 but please feel free to fix it 
if I got wrong.

> Capture the missing rules in HiveSessionStateBuilder
> 
>
> Key: SPARK-28629
> URL: https://issues.apache.org/jira/browse/SPARK-28629
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> A general mistake for new contributors is to forget adding the corresponding 
> rules into the extended extendedResolutionRules, postHocResolutionRules, 
> extendedCheckRules in HiveSessionStateBuilder. We need to avoid missing the 
> rules or capture them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31555) Improve cache block migration

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31555:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Improve cache block migration
> -
>
> Key: SPARK-31555
> URL: https://issues.apache.org/jira/browse/SPARK-31555
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> We should explore the following improvements to cache block migration:
> 1) Peer selection (right now may overbalance on certain peers)
> 2) Do we need to configure the number of blocks to be migrated at the same 
> time
> 3) Are there any blocks we don't need to replicate (e.g. they are already 
> stored on the desired number of executors even once we remove the executors 
> slated for decommissioning).
> 4) Do we want to prioritize migrating blocks with no replicas
> 5) Log the attempt number for debugging 
> 6) Clarify the logic for determining the number of replicas
> 7) Consider using TestUtils.waitUntilExecutorsUp in tests rather than count 
> to wait for the executors to come up. imho this is the least important.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31555) Improve cache block migration

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243839#comment-17243839
 ] 

Hyukjin Kwon commented on SPARK-31555:
--

[~holden] and [~dongjoon], I switched it to 3.2.0 but feel free to edit if I 
got wrong.

> Improve cache block migration
> -
>
> Key: SPARK-31555
> URL: https://issues.apache.org/jira/browse/SPARK-31555
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> We should explore the following improvements to cache block migration:
> 1) Peer selection (right now may overbalance on certain peers)
> 2) Do we need to configure the number of blocks to be migrated at the same 
> time
> 3) Are there any blocks we don't need to replicate (e.g. they are already 
> stored on the desired number of executors even once we remove the executors 
> slated for decommissioning).
> 4) Do we want to prioritize migrating blocks with no replicas
> 5) Log the attempt number for debugging 
> 6) Clarify the logic for determining the number of replicas
> 7) Consider using TestUtils.waitUntilExecutorsUp in tests rather than count 
> to wait for the executors to come up. imho this is the least important.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33461) Propagating SPARK_CONF_DIR in K8s and tests

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33461:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Propagating SPARK_CONF_DIR in K8s and tests
> ---
>
> Key: SPARK-33461
> URL: https://issues.apache.org/jira/browse/SPARK-33461
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
>
> Foundational work for propagating SPARK_CONF_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14922:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.2, 2.2.1
>Reporter: Xiao Li
>Priority: Major
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25075) Build and test Spark against Scala 2.13

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25075:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31976) use MemoryUsage to control the size of block

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31976:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> use MemoryUsage to control the size of block
> 
>
> Key: SPARK-31976
> URL: https://issues.apache.org/jira/browse/SPARK-31976
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it maybe reasonable to control the size of block by memory usage, instead 
> of number of rows.
>  
> note1: param blockSize had already used in ALS and MLP to stack vectors 
> (expected to be dense);
> note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models;
>  
> There may be two ways to impl:
> 1, compute the sparsity of input vectors ahead of train (this can be computed 
> with other statistics computation, maybe no extra pass), and infer a 
> reasonable number of vectors to stack;
> 2, stack the input vectors adaptively, by monitoring the memory usage in a 
> block;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25186) Stabilize Data Source V2 API

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25186:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Stabilize Data Source V2 API 
> -
>
> Key: SPARK-25186
> URL: https://issues.apache.org/jira/browse/SPARK-25186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31981) Keep TimestampType when taking an average of a Timestamp

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31981:
-
Target Version/s:   (was: 3.1.0)

> Keep TimestampType when taking an average of a Timestamp
> 
>
> Key: SPARK-31981
> URL: https://issues.apache.org/jira/browse/SPARK-31981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> Currently, when you take an average of a Timestamp, you'll end up with a 
> Double, representing the seconds since epoch. This is because of old Hive 
> behavior. I strongly believe that it is better to return a Timestamp.
> root@8c4241b617ec:/# psql postgres postgres
> psql (12.3 (Debian 12.3-1.pgdg100+1))
> Type "help" for help.
> postgres=# CREATE TABLE timestamp_demo (ts TIMESTAMP);
> CREATE TABLE
> postgres=# INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
> INSERT 0 1
> postgres=# INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
> INSERT 0 1
> postgres=# INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
> INSERT 0 1
> postgres=# SELECT AVG(ts) FROM timestamp_demo;
> ERROR: function avg(timestamp without time zone) does not exist
> LINE 1: SELECT AVG(ts) FROM timestamp_demo;
>  
> root@bab43a5731e8:/# mysql
> Welcome to the MySQL monitor. Commands end with ; or \g.
> Your MySQL connection id is 9
> Server version: 8.0.20 MySQL Community Server - GPL
> Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.
> Oracle is a registered trademark of Oracle Corporation and/or its
> affiliates. Other names may be trademarks of their respective
> owners.
> Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
> mysql> CREATE TABLE timestamp_demo (ts TIMESTAMP);
> Query OK, 0 rows affected (0.05 sec)
> mysql> INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
> Query OK, 1 row affected (0.01 sec)
> mysql> INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
> Query OK, 1 row affected (0.01 sec)
> mysql> INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
> Query OK, 1 row affected (0.01 sec)
> mysql> SELECT AVG(ts) FROM timestamp_demo;
> +-+
> | AVG(ts) |
> +-+
> | 20180101182211. |
> +-+
> 1 row in set (0.00 sec)
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30334:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Add metadata around semi-structured columns to Spark
> 
>
> Key: SPARK-30334
> URL: https://issues.apache.org/jira/browse/SPARK-30334
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> Semi-structured data is used widely in the data industry for reporting events 
> in a wide variety of formats. Click events in product analytics can be stored 
> as json. Some application logs can be in the form of delimited key=value 
> text. Some data may be in xml.
> The goal of this project is to be able to signal Spark that such a column 
> exists. This will then enable Spark to "auto-parse" these columns on the fly. 
> The proposal is to store this information as part of the column metadata, in 
> the fields:
>  - format: The format of the semi-structured column, e.g. json, xml, avro
>  - options: Options for parsing these columns
> Then imagine having the following data:
> {code:java}
> ++---++
> | ts | event |raw |
> ++---++
> | 2019-10-12 | click | {"field":"value"}  |
> ++---++ {code}
> SELECT raw.field FROM data
> will return "value"
> or the following data
> {code:java}
> ++---+--+
> | ts | event | raw  |
> ++---+--+
> | 2019-10-12 | click | field1=v1|field2=v2  |
> ++---+--+ {code}
> SELECT raw.field1 FROM data
> will return v1.
>  
> As a first step, we will introduce the function "as_json", which accomplishes 
> this for JSON columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30978) Remove multiple workers on the same host support from Standalone backend

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243831#comment-17243831
 ] 

Hyukjin Kwon commented on SPARK-30978:
--

I changed this to 3.2.0, [~jiangxb1987]

> Remove multiple workers on the same host support from Standalone backend
> 
>
> Key: SPARK-30978
> URL: https://issues.apache.org/jira/browse/SPARK-30978
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
>
> Based on our experience, there is no scenario that necessarily requires 
> deploying multiple Workers on the same node with Standalone backend. A worker 
> should book all the resources reserved to Spark on the host it is launched, 
> then it can allocate those resources to one or more executors launched by 
> this worker. Since each executor runs in a separated JVM, we can limit the 
> memory of each executor to avoid long GC pause.
> The remaining concern is the local-cluster mode is implemented by launching 
> multiple workers on the local host, we might need to re-implement 
> LocalSparkCluster to launch only one Worker and multiple executors. It should 
> be fine because local-cluster mode is only used in running Spark unit test 
> cases, thus end users should not be affected by this change.
> Removing multiple workers on the same host support could simplify the deploy 
> model of Standalone backend, and also reduce the burden to support legacy 
> deploy pattern in the future feature developments.
> The proposal is to update the document to deprecate the support of system 
> environment `SPARK_WORKER_INSTANCES` in 3.0, and remove the support in the 
> next major version (3.1.0).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30324) Simplify API for JSON access in DataFrames/SQL

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30324:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Simplify API for JSON access in DataFrames/SQL
> --
>
> Key: SPARK-30324
> URL: https://issues.apache.org/jira/browse/SPARK-30324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> get_json_object() is a UDF to parse JSON fields. It is verbose and hard to 
> use, e.g. I wasn't expecting the path to a field to have to start with "$.". 
> We can simplify all of this when a column is of StringType, and a nested 
> field is requested. This API sugar will in the query planner be rewritten as 
> get_json_object.
> This nested access can then be extended in the future to other 
> semi-structured formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30978) Remove multiple workers on the same host support from Standalone backend

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30978:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Remove multiple workers on the same host support from Standalone backend
> 
>
> Key: SPARK-30978
> URL: https://issues.apache.org/jira/browse/SPARK-30978
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
>
> Based on our experience, there is no scenario that necessarily requires 
> deploying multiple Workers on the same node with Standalone backend. A worker 
> should book all the resources reserved to Spark on the host it is launched, 
> then it can allocate those resources to one or more executors launched by 
> this worker. Since each executor runs in a separated JVM, we can limit the 
> memory of each executor to avoid long GC pause.
> The remaining concern is the local-cluster mode is implemented by launching 
> multiple workers on the local host, we might need to re-implement 
> LocalSparkCluster to launch only one Worker and multiple executors. It should 
> be fine because local-cluster mode is only used in running Spark unit test 
> cases, thus end users should not be affected by this change.
> Removing multiple workers on the same host support could simplify the deploy 
> model of Standalone backend, and also reduce the burden to support legacy 
> deploy pattern in the future feature developments.
> The proposal is to update the document to deprecate the support of system 
> environment `SPARK_WORKER_INSTANCES` in 3.0, and remove the support in the 
> next major version (3.1.0).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases

2020-12-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243830#comment-17243830
 ] 

Hyukjin Kwon commented on SPARK-25752:
--

I changed this to 3.2.0.

> Add trait to easily whitelist logical operators that produce named output 
> from CleanupAliases
> -
>
> Key: SPARK-25752
> URL: https://issues.apache.org/jira/browse/SPARK-25752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> The rule `CleanupAliases` cleans up aliases from logical operators that do 
> not match a whitelist. This whitelist is hardcoded inside the rule which is 
> cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` 
> that will be ignored by `CleanupAliases` and other ops that require aliases 
> to be preserved in the operator should extend it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25752:
-
Target Version/s: 3.2.0  (was: 3.1.0)

> Add trait to easily whitelist logical operators that produce named output 
> from CleanupAliases
> -
>
> Key: SPARK-25752
> URL: https://issues.apache.org/jira/browse/SPARK-25752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> The rule `CleanupAliases` cleans up aliases from logical operators that do 
> not match a whitelist. This whitelist is hardcoded inside the rule which is 
> cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` 
> that will be ignored by `CleanupAliases` and other ops that require aliases 
> to be preserved in the operator should extend it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32846) Support createDataFrame from an RDD of pd.DataFrames

2020-12-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32846:
-
Target Version/s:   (was: 3.1.0)

> Support createDataFrame from an RDD of pd.DataFrames
> 
>
> Key: SPARK-32846
> URL: https://issues.apache.org/jira/browse/SPARK-32846
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.0.1
>Reporter: Linar Savion
>Priority: Minor
>  Labels: arrow, pandas, sql
>
> Add support to createDataFrame from a distributed collection of 
> pandas.DataFrames by converting the RDD of pd.DFs to an RDD of arrow records 
> batches, then directly creating the spark DataFrame from it.
>  
> Performance is significantly better (vectorized) than creating a spark DF by 
> converting each df to a list of rows, similar to the improvement of 
> SPARK-20791.
>  
> Initial example & benchmark for older spark versions: 
> [https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5|https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5,]
>  
> I'm currently working on a PR and will post it soon.
>  
> Extends the work done in: 
> https://issues.apache.org/jira/browse/SPARK-20791 
> https://issues.apache.org/jira/browse/SPARK-23030 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33640) Extend connection timeout to DB server for DB2IntegrationSuite and its variants

2020-12-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33640.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30583
[https://github.com/apache/spark/pull/30583]

> Extend connection timeout to DB server for DB2IntegrationSuite and its 
> variants
> ---
>
> Key: SPARK-33640
> URL: https://issues.apache.org/jira/browse/SPARK-33640
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> The container image ibmcom/db2 creates a database when it starts up.
> The database creation can take over 2 minutes.
> DB2IntegrationSuite and its variants use the container image but the 
> connection timeout is set to 2 minutes so these suites almost always fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org