[jira] [Updated] (SPARK-30211) Use python3 in make-distribution.sh

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30211:
--
Summary: Use python3 in make-distribution.sh  (was: Update python version 
in make-distribution.sh)

> Use python3 in make-distribution.sh
> ---
>
> Key: SPARK-30211
> URL: https://issues.apache.org/jira/browse/SPARK-30211
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30211) Update python version in make-distribution.sh

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30211.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26844
[https://github.com/apache/spark/pull/26844]

> Update python version in make-distribution.sh
> -
>
> Key: SPARK-30211
> URL: https://issues.apache.org/jira/browse/SPARK-30211
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30214) Support COMMENT ON syntax

2019-12-10 Thread Kent Yao (Jira)
Kent Yao created SPARK-30214:


 Summary: Support COMMENT ON syntax
 Key: SPARK-30214
 URL: https://issues.apache.org/jira/browse/SPARK-30214
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


https://prestosql.io/docs/current/sql/comment.html
https://www.postgresql.org/docs/12/sql-comment.html

We are going to disable setting reserved properties by dbproperties or 
tblproperites directory, which need a subclause in create syntax or specific 
alter commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30213) Remove the mutable status in QueryStage when enable AQE

2019-12-10 Thread Ke Jia (Jira)
Ke Jia created SPARK-30213:
--

 Summary: Remove the mutable status in QueryStage when enable AQE
 Key: SPARK-30213
 URL: https://issues.apache.org/jira/browse/SPARK-30213
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Currently ShuffleQueryStageExec contain the mutable status, eg 
mapOutputStatisticsFuture variable. So It is not easy to pass when we copy 
ShuffleQueryStageExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2019-12-10 Thread Rinaz Belhaj (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993231#comment-16993231
 ] 

Rinaz Belhaj commented on SPARK-19335:
--

+1 This feature would be very useful. Any updates on this ?

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30212) COUNT(DISTINCT) window function should be supported

2019-12-10 Thread Kernel Force (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kernel Force updated SPARK-30212:
-
Summary: COUNT(DISTINCT) window function should be supported  (was: Could 
not use COUNT(DISTINCT) window function in SparkSQL)

> COUNT(DISTINCT) window function should be supported
> ---
>
> Key: SPARK-30212
> URL: https://issues.apache.org/jira/browse/SPARK-30212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4
> Scala 2.11.12
> Hive 2.3.6
>Reporter: Kernel Force
>Priority: Major
>  Labels: SQL, distinct, window_function
>
> Suppose we have a typical table in Hive like below:
> {code:sql}
> CREATE TABLE DEMO_COUNT_DISTINCT (
> demo_date string,
> demo_id string
> );
> {code}
> {noformat}
> ++--+
> | demo_count_distinct.demo_date | demo_count_distinct.demo_id |
> ++--+
> | 20180301 | 101 |
> | 20180301 | 102 |
> | 20180301 | 103 |
> | 20180401 | 201 |
> | 20180401 | 202 |
> ++--+
> {noformat}
> Now I want to count distinct number of DEMO_DATE but also reserve every 
> columns' data in each row.
> So I use COUNT(DISTINCT) window function like below in Hive beeline and it 
> work:
> {code:sql}
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T;
> {code}
> {noformat}
> +--++-+
> | t.demo_date | t.demo_id | uniq_dates |
> +--++-+
> | 20180401 | 202 | 2 |
> | 20180401 | 201 | 2 |
> | 20180301 | 103 | 2 |
> | 20180301 | 102 | 2 |
> | 20180301 | 101 | 2 |
> +--++-+
> {noformat}
> But when I came to SparkSQL, it threw exception even if I run the same SQL.
> {code:sql}
> spark.sql("""
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Distinct window functions are not 
> supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> unboundedfollowing$()));;
> Project [demo_date#1, demo_id#2, UNIQ_DATES#0L]
> +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L]
>  +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS UNIQ_DATES#0L], [null]
>  +- Project [demo_date#1, demo_id#2]
>  +- SubqueryAlias `T`
>  +- SubqueryAlias `default`.`demo_count_distinct`
>  +- HiveTableRelation `default`.`demo_count_distinct`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2]
> {noformat}
> Then I try to use countDistinct function but also got exceptions.
> {code:sql}
> spark.sql("""
> SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 2 pos 12
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
>  ..
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30212) Could not use COUNT(DISTINCT) window function in SparkSQL

2019-12-10 Thread Kernel Force (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kernel Force updated SPARK-30212:
-
Labels: SQL distinct window_function  (was: )

> Could not use COUNT(DISTINCT) window function in SparkSQL
> -
>
> Key: SPARK-30212
> URL: https://issues.apache.org/jira/browse/SPARK-30212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4
> Scala 2.11.12
> Hive 2.3.6
>Reporter: Kernel Force
>Priority: Major
>  Labels: SQL, distinct, window_function
>
> Suppose we have a typical table in Hive like below:
> {code:sql}
> CREATE TABLE DEMO_COUNT_DISTINCT (
> demo_date string,
> demo_id string
> );
> {code}
> {noformat}
> ++--+
> | demo_count_distinct.demo_date | demo_count_distinct.demo_id |
> ++--+
> | 20180301 | 101 |
> | 20180301 | 102 |
> | 20180301 | 103 |
> | 20180401 | 201 |
> | 20180401 | 202 |
> ++--+
> {noformat}
> Now I want to count distinct number of DEMO_DATE but also reserve every 
> columns' data in each row.
> So I use COUNT(DISTINCT) window function like below in Hive beeline and it 
> work:
> {code:sql}
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T;
> {code}
> {noformat}
> +--++-+
> | t.demo_date | t.demo_id | uniq_dates |
> +--++-+
> | 20180401 | 202 | 2 |
> | 20180401 | 201 | 2 |
> | 20180301 | 103 | 2 |
> | 20180301 | 102 | 2 |
> | 20180301 | 101 | 2 |
> +--++-+
> {noformat}
> But when I came to SparkSQL, it threw exception even if I run the same SQL.
> {code:sql}
> spark.sql("""
> SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Distinct window functions are not 
> supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> unboundedfollowing$()));;
> Project [demo_date#1, demo_id#2, UNIQ_DATES#0L]
> +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L]
>  +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS UNIQ_DATES#0L], [null]
>  +- Project [demo_date#1, demo_id#2]
>  +- SubqueryAlias `T`
>  +- SubqueryAlias `default`.`demo_count_distinct`
>  +- HiveTableRelation `default`.`demo_count_distinct`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2]
> {noformat}
> Then I try to use countDistinct function but also got exceptions.
> {code:sql}
> spark.sql("""
> SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
>  FROM DEMO_COUNT_DISTINCT T
> """).show
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 2 pos 12
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
>  at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
>  ..
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30212) Could not use COUNT(DISTINCT) window function in SparkSQL

2019-12-10 Thread Dilly King (Jira)
Dilly King created SPARK-30212:
--

 Summary: Could not use COUNT(DISTINCT) window function in SparkSQL
 Key: SPARK-30212
 URL: https://issues.apache.org/jira/browse/SPARK-30212
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
 Environment: Spark 2.4.4

Scala 2.11.12

Hive 2.3.6
Reporter: Dilly King


Suppose we have a typical table in Hive like below:

{code:sql}
CREATE TABLE DEMO_COUNT_DISTINCT (
demo_date string,
demo_id string
);
{code}

{noformat}
++--+
| demo_count_distinct.demo_date | demo_count_distinct.demo_id |
++--+
| 20180301 | 101 |
| 20180301 | 102 |
| 20180301 | 103 |
| 20180401 | 201 |
| 20180401 | 202 |
++--+
{noformat}


Now I want to count distinct number of DEMO_DATE but also reserve every 
columns' data in each row.
So I use COUNT(DISTINCT) window function like below in Hive beeline and it work:

{code:sql}
SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
 FROM DEMO_COUNT_DISTINCT T;
{code}

{noformat}
+--++-+
| t.demo_date | t.demo_id | uniq_dates |
+--++-+
| 20180401 | 202 | 2 |
| 20180401 | 201 | 2 |
| 20180301 | 103 | 2 |
| 20180301 | 102 | 2 |
| 20180301 | 101 | 2 |
+--++-+
{noformat}


But when I came to SparkSQL, it threw exception even if I run the same SQL.

{code:sql}
spark.sql("""
SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
 FROM DEMO_COUNT_DISTINCT T
""").show
{code}

{noformat}
org.apache.spark.sql.AnalysisException: Distinct window functions are not 
supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$()));;
Project [demo_date#1, demo_id#2, UNIQ_DATES#0L]
+- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L]
 +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS UNIQ_DATES#0L], [null]
 +- Project [demo_date#1, demo_id#2]
 +- SubqueryAlias `T`
 +- SubqueryAlias `default`.`demo_count_distinct`
 +- HiveTableRelation `default`.`demo_count_distinct`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2]
{noformat}


Then I try to use countDistinct function but also got exceptions.

{code:sql}
spark.sql("""
SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
 FROM DEMO_COUNT_DISTINCT T
""").show
{code}

{noformat}
org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. 
This function is neither a registered temporary function nor a permanent 
function registered in the database 'default'.; line 2 pos 12
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
 at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
 ..
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30211) Update python version in make-distribution.sh

2019-12-10 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-30211:
---

 Summary: Update python version in make-distribution.sh
 Key: SPARK-30211
 URL: https://issues.apache.org/jira/browse/SPARK-30211
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Yuming Wang
Assignee: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30204) Support for config Pod DNS for Kubernetes

2019-12-10 Thread vanderliang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vanderliang resolved SPARK-30204.
-
Resolution: Fixed

> Support for config Pod DNS for Kubernetes
> -
>
> Key: SPARK-30204
> URL: https://issues.apache.org/jira/browse/SPARK-30204
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: vanderliang
>Priority: Major
>
> Current we can not configure the pod dns nameservers and searches when submit 
> a job via cli for kubernetes. However, this's a common scenarios for 
> hybricloud where we use public cloud compute resourses while with private 
> dns. 
>  
> {code:java}
> //代码占位符
> apiVersion: v1
> kind: Pod
> metadata:
>   namespace: default
>   name: dns-example
> spec:
>   containers:
> - name: test
>   image: nginx
>   dnsConfig:
> nameservers:
>   - 1.2.3.4
> searches:
>   - ns1.svc.cluster-domain.example
>   - my.dns.search.suffix
> options:
>   - name: ndots
> value: "2"
>   - name: edns0
> {code}
> As a result, we can use the following property to specify the pod dns config.
>  * spark.kubernetes.dnsConfig.nameservers, Comma separated list of the 
> Kubernetes dns nameservers for driver and executor.
>  * spark.kubernetes.dnsConfig.searches, Comma separated list of the 
> Kubernetes dns searches for driver and executor.
>  * spark.kubernetes.dnsConfig.options.[OptionVariableName], Add the dns 
> option variable specified by OptionVariableName to the Driver And Executor 
> process. The user can specify multiple of these to set multiple options 
> variables.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29152) Spark Executor Plugin API shutdown is not proper when dynamic allocation enabled

2019-12-10 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29152:
--

Assignee: Rakesh Raushan

> Spark Executor Plugin API shutdown is not proper when dynamic allocation 
> enabled
> 
>
> Key: SPARK-29152
> URL: https://issues.apache.org/jira/browse/SPARK-29152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Major
>
> *Issue Description*
> Spark Executor Plugin API *shutdown handling is not proper*, when dynamic 
> allocation enabled .Plugin's shutdown method is not processed when dynamic 
> allocation is enabled and *executors become dead* after inactive time.
> *Test Precondition*
> 1. Create a plugin and make a jar named SparkExecutorplugin.jar
> import org.apache.spark.ExecutorPlugin;
> public class ExecutoTest1 implements ExecutorPlugin{
> public void init(){
> System.out.println("Executor Plugin Initialised.");
> }
> public void shutdown(){
> System.out.println("Executor plugin closed successfully.");
> }
> }
> 2. Create the  jars with the same and put it in folder /spark/examples/jars
> *Test Steps*
> 1. launch bin/spark-sql with dynamic allocation enabled
> ./spark-sql --master yarn --conf spark.executor.plugins=ExecutoTest1  --jars 
> /opt/HA/C10/install/spark/spark/examples/jars/SparkExecutorPlugin.jar --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=2 --conf 
> spark.dynamicAllocation.minExecutors=1
> 2 create a table , insert the data and select * from tablename
> 3.Check the spark UI Jobs tab/SQL tab
> 4. Check all Executors(executor tab will give all executors details) 
> application log file for Executor plugin Initialization and Shutdown messages 
> or operations.
> Example 
> /yarn/logdir/application_1567156749079_0025/container_e02_1567156749079_0025_01_05/
>  stdout
> 5. Wait for the executor to be dead after the inactive time and check the 
> same container log 
> 6. Kill the spark sql and check the container log  for executor plugin 
> shutdown.
> *Expect Output*
> 1. Job should be success. Create table ,insert and select query should be 
> success.
> 2.While running query All Executors  log should contain the executor plugin 
> Init messages or operations.
> "Executor Plugin Initialised.
> 3.Once the executors are dead ,shutdown message should be there in log file.
> “ Executor plugin closed successfully.
> 4.Once the sql application closed ,shutdown message should be there in log.
> “ Executor plugin closed successfully". 
> *Actual Output*
> Shutdown message is not called when executor is dead after inactive time.
> *Observation*
> Without dynamic allocation Executor plugin is working fine. But after 
> enabling dynamic allocation,Executor shutdown is not processed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29152) Spark Executor Plugin API shutdown is not proper when dynamic allocation enabled

2019-12-10 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29152.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26810
[https://github.com/apache/spark/pull/26810]

> Spark Executor Plugin API shutdown is not proper when dynamic allocation 
> enabled
> 
>
> Key: SPARK-29152
> URL: https://issues.apache.org/jira/browse/SPARK-29152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Major
> Fix For: 3.0.0
>
>
> *Issue Description*
> Spark Executor Plugin API *shutdown handling is not proper*, when dynamic 
> allocation enabled .Plugin's shutdown method is not processed when dynamic 
> allocation is enabled and *executors become dead* after inactive time.
> *Test Precondition*
> 1. Create a plugin and make a jar named SparkExecutorplugin.jar
> import org.apache.spark.ExecutorPlugin;
> public class ExecutoTest1 implements ExecutorPlugin{
> public void init(){
> System.out.println("Executor Plugin Initialised.");
> }
> public void shutdown(){
> System.out.println("Executor plugin closed successfully.");
> }
> }
> 2. Create the  jars with the same and put it in folder /spark/examples/jars
> *Test Steps*
> 1. launch bin/spark-sql with dynamic allocation enabled
> ./spark-sql --master yarn --conf spark.executor.plugins=ExecutoTest1  --jars 
> /opt/HA/C10/install/spark/spark/examples/jars/SparkExecutorPlugin.jar --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=2 --conf 
> spark.dynamicAllocation.minExecutors=1
> 2 create a table , insert the data and select * from tablename
> 3.Check the spark UI Jobs tab/SQL tab
> 4. Check all Executors(executor tab will give all executors details) 
> application log file for Executor plugin Initialization and Shutdown messages 
> or operations.
> Example 
> /yarn/logdir/application_1567156749079_0025/container_e02_1567156749079_0025_01_05/
>  stdout
> 5. Wait for the executor to be dead after the inactive time and check the 
> same container log 
> 6. Kill the spark sql and check the container log  for executor plugin 
> shutdown.
> *Expect Output*
> 1. Job should be success. Create table ,insert and select query should be 
> success.
> 2.While running query All Executors  log should contain the executor plugin 
> Init messages or operations.
> "Executor Plugin Initialised.
> 3.Once the executors are dead ,shutdown message should be there in log file.
> “ Executor plugin closed successfully.
> 4.Once the sql application closed ,shutdown message should be there in log.
> “ Executor plugin closed successfully". 
> *Actual Output*
> Shutdown message is not called when executor is dead after inactive time.
> *Observation*
> Without dynamic allocation Executor plugin is working fine. But after 
> enabling dynamic allocation,Executor shutdown is not processed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30209) Display stageId, attemptId, taskId with SQL max metric in UI

2019-12-10 Thread Niranjan Artal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993008#comment-16993008
 ] 

Niranjan Artal commented on SPARK-30209:


I am working on it.

> Display stageId, attemptId, taskId with SQL max metric in UI
> 
>
> Key: SPARK-30209
> URL: https://issues.apache.org/jira/browse/SPARK-30209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Niranjan Artal
>Priority: Major
>
> It would be helpful if we could add stageId, stage attemptId and taskId for 
> in SQL UI for each of the max metrics values.  These additional metrics help 
> in debugging the jobs quicker.  For a  given operator, it will be easy to 
> identify the task which is taking maximum time to complete from the Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30209) Display stageId, attemptId, taskId with SQL max metric in UI

2019-12-10 Thread Niranjan Artal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Artal updated SPARK-30209:
---
Description: It would be helpful if we could add stageId, stage attemptId 
and taskId for in SQL UI for each of the max metrics values.  These additional 
metrics help in debugging the jobs quicker.  For a  given operator, it will be 
easy to identify the task which is taking maximum time to complete from the 
Spark UI.  (was: It would be helpful if we could add stageId, stage attemptId 
and taskId in SQL UI.  These additional metrics help in debugging the jobs 
quicker.  For a  given operator, it will be easy to identify the task which is 
taking maximum time to complete from the Spark UI.)

> Display stageId, attemptId, taskId with SQL max metric in UI
> 
>
> Key: SPARK-30209
> URL: https://issues.apache.org/jira/browse/SPARK-30209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Niranjan Artal
>Priority: Major
>
> It would be helpful if we could add stageId, stage attemptId and taskId for 
> in SQL UI for each of the max metrics values.  These additional metrics help 
> in debugging the jobs quicker.  For a  given operator, it will be easy to 
> identify the task which is taking maximum time to complete from the Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30210) Give more informative error for BinaryClassificationEvaluator when data with only one label is provided

2019-12-10 Thread Paul Anzel (Jira)
Paul Anzel created SPARK-30210:
--

 Summary: Give more informative error for 
BinaryClassificationEvaluator when data with only one label is provided
 Key: SPARK-30210
 URL: https://issues.apache.org/jira/browse/SPARK-30210
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.5
 Environment: Pyspark on Databricks
Reporter: Paul Anzel


Hi all,

When I was trying to do some machine learning work with pyspark I ran into a 
confusing error message:
# Model and train/test set generated
evaluator = BinaryClassificationEvaluator(labelCol=label, 
metricName='areaUnderROC')
prediction = model.transform(test_data)
auc = evaluator.evaluate(prediction)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 37 in 
stage 21.0 failed 4 times, most recent failure: Lost task 37.3 in stage 21.0 
(TID 2811, 10.139.65.48, executor 16): java.lang.ArrayIndexOutOfBoundsException
After some investigation, I found that the issue was that the data I was trying 
to predict on only had one label represented, rather than both positive and 
negative labels. Easy enough to fix, but I would like to ask if we could 
replace this error with one that explicitly points out the issue. Would it be 
acceptable to have a check ahead of time on labels that ensures all labels are 
represented? Alternately, can we change the docs for 
BinaryClassificationEvaluator to explain what this error means?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30210) Give more informative error for BinaryClassificationEvaluator when data with only one label is provided

2019-12-10 Thread Paul Anzel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Anzel updated SPARK-30210:
---
Description: 
Hi all,

When I was trying to do some machine learning work with pyspark I ran into a 
confusing error message:

{{# Model and train/test set generated...}}
{{ evaluator = BinaryClassificationEvaluator(labelCol=label, 
metricName='areaUnderROC')}}
{{ prediction = model.transform(test_data)}}
{{ auc = evaluator.evaluate(prediction)}}

{{org.apache.spark.SparkException: Job aborted due to stage failure: Task 37 in 
stage 21.0 failed 4 times, most recent failure: Lost task 37.3 in stage 21.0 
(TID 2811, 10.139.65.48, executor 16): 
java.lang.ArrayIndexOutOfBoundsException}}


 After some investigation, I found that the issue was that the data I was 
trying to predict on only had one label represented, rather than both positive 
and negative labels. Easy enough to fix, but I would like to ask if we could 
replace this error with one that explicitly points out the issue. Would it be 
acceptable to have a check ahead of time on labels that ensures all labels are 
represented? Alternately, can we change the docs for 
BinaryClassificationEvaluator to explain what this error means?

  was:
Hi all,

When I was trying to do some machine learning work with pyspark I ran into a 
confusing error message:
# Model and train/test set generated
evaluator = BinaryClassificationEvaluator(labelCol=label, 
metricName='areaUnderROC')
prediction = model.transform(test_data)
auc = evaluator.evaluate(prediction)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 37 in 
stage 21.0 failed 4 times, most recent failure: Lost task 37.3 in stage 21.0 
(TID 2811, 10.139.65.48, executor 16): java.lang.ArrayIndexOutOfBoundsException
After some investigation, I found that the issue was that the data I was trying 
to predict on only had one label represented, rather than both positive and 
negative labels. Easy enough to fix, but I would like to ask if we could 
replace this error with one that explicitly points out the issue. Would it be 
acceptable to have a check ahead of time on labels that ensures all labels are 
represented? Alternately, can we change the docs for 
BinaryClassificationEvaluator to explain what this error means?


> Give more informative error for BinaryClassificationEvaluator when data with 
> only one label is provided
> ---
>
> Key: SPARK-30210
> URL: https://issues.apache.org/jira/browse/SPARK-30210
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.5
> Environment: Pyspark on Databricks
>Reporter: Paul Anzel
>Priority: Minor
>
> Hi all,
> When I was trying to do some machine learning work with pyspark I ran into a 
> confusing error message:
> {{# Model and train/test set generated...}}
> {{ evaluator = BinaryClassificationEvaluator(labelCol=label, 
> metricName='areaUnderROC')}}
> {{ prediction = model.transform(test_data)}}
> {{ auc = evaluator.evaluate(prediction)}}
> {{org.apache.spark.SparkException: Job aborted due to stage failure: Task 37 
> in stage 21.0 failed 4 times, most recent failure: Lost task 37.3 in stage 
> 21.0 (TID 2811, 10.139.65.48, executor 16): 
> java.lang.ArrayIndexOutOfBoundsException}}
>  After some investigation, I found that the issue was that the data I was 
> trying to predict on only had one label represented, rather than both 
> positive and negative labels. Easy enough to fix, but I would like to ask if 
> we could replace this error with one that explicitly points out the issue. 
> Would it be acceptable to have a check ahead of time on labels that ensures 
> all labels are represented? Alternately, can we change the docs for 
> BinaryClassificationEvaluator to explain what this error means?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30209) Display stageId, attemptId, taskId with SQL max metric in UI

2019-12-10 Thread Niranjan Artal (Jira)
Niranjan Artal created SPARK-30209:
--

 Summary: Display stageId, attemptId, taskId with SQL max metric in 
UI
 Key: SPARK-30209
 URL: https://issues.apache.org/jira/browse/SPARK-30209
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.0.0
Reporter: Niranjan Artal


It would be helpful if we could add stageId, stage attemptId and taskId in SQL 
UI.  These additional metrics help in debugging the jobs quicker.  For a  given 
operator, it will be easy to identify the task which is taking maximum time to 
complete from the Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-12-10 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21869:
-
Fix Version/s: (was: 3.0.0)

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Gabor Somogyi
>Priority: Major
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-12-10 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992975#comment-16992975
 ] 

Shixiong Zhu commented on SPARK-21869:
--

Reopened this. https://github.com/apache/spark/pull/25853 has been reverted.

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-12-10 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-21869:
--

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29976) Allow speculation even if there is only one task

2019-12-10 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-29976.
---
Fix Version/s: 3.0.0
 Assignee: Yuchen Huo
   Resolution: Fixed

> Allow speculation even if there is only one task
> 
>
> Key: SPARK-29976
> URL: https://issues.apache.org/jira/browse/SPARK-29976
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuchen Huo
>Assignee: Yuchen Huo
>Priority: Major
> Fix For: 3.0.0
>
>
> In the current speculative execution implementation if there is only one task 
> in the stage then no speculative run would be conducted. However, there might 
> be cases where an executor have some problem in writing to its disk and just 
> hang forever. In this case, if the single task stage get assigned to the 
> problematic executor then the whole job would hang forever. It would be 
> better if we could run the task on another executor if this happens. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30208) A race condition when reading from Kafka in PySpark

2019-12-10 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-30208:


 Summary: A race condition when reading from Kafka in PySpark
 Key: SPARK-30208
 URL: https://issues.apache.org/jira/browse/SPARK-30208
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.4
Reporter: Jiawen Zhu


When using PySpark to read from Kafka, there is a race condition that Spark may 
use KafkaConsumer in multiple threads at the same time and throw the following 
error:

{code}
java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
multi-threaded access
at 
kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:2215)
at 
kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2104)
at 
kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2059)
at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.close(KafkaDataConsumer.scala:451)
at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$NonCachedKafkaDataConsumer.release(KafkaDataConsumer.scala:508)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.close(KafkaSourceRDD.scala:126)
at 
org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:66)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:131)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:130)
at 
org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:162)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131)
at 
org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:144)
at 
org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:142)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:142)
at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:130)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:155)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}

When using PySpark, reading from Kafka is actually happening in a separate 
writer thread rather that the task thread.  When a task is early terminated 
(e.g., there is a limit operator), the task thread may stop the KafkaConsumer 
when the writer thread is using it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30205) Import ABC from collections.abc to remove deprecation warnings

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30205.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/26835

> Import ABC from collections.abc to remove deprecation warnings
> --
>
> Key: SPARK-30205
> URL: https://issues.apache.org/jira/browse/SPARK-30205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Karthikeyan Singaravelan
>Priority: Minor
> Fix For: 3.0.0
>
>
> Importing ABC from collections module directly is deprecated since 3.4 and is 
> removed in Python 3.9. Thus this will cause ImportError for pyspark in Python 
> 3.9 in the resultiterable module where Iterable is used from collections at 
> https://github.com/tirkarthi/spark/blob/aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852/python/pyspark/resultiterable.py#L23.
>  
> Relevant CPython PR : https://github.com/python/cpython/pull/10596.
> I am a new contributor and would like to work on this issue.
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30205) Import ABC from collections.abc to remove deprecation warnings

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30205:
-

Assignee: Karthikeyan Singaravelan

> Import ABC from collections.abc to remove deprecation warnings
> --
>
> Key: SPARK-30205
> URL: https://issues.apache.org/jira/browse/SPARK-30205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Karthikeyan Singaravelan
>Assignee: Karthikeyan Singaravelan
>Priority: Minor
> Fix For: 3.0.0
>
>
> Importing ABC from collections module directly is deprecated since 3.4 and is 
> removed in Python 3.9. Thus this will cause ImportError for pyspark in Python 
> 3.9 in the resultiterable module where Iterable is used from collections at 
> https://github.com/tirkarthi/spark/blob/aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852/python/pyspark/resultiterable.py#L23.
>  
> Relevant CPython PR : https://github.com/python/cpython/pull/10596.
> I am a new contributor and would like to work on this issue.
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30130) Hardcoded numeric values in common table expressions which utilize GROUP BY are interpreted as ordinal positions

2019-12-10 Thread Matt Boegner (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992835#comment-16992835
 ] 

Matt Boegner commented on SPARK-30130:
--

[~Ankitraj] apologies, a typo was introduced when I copied the sample queries 
into the Jira code block. The query has been edited and should generate the 
error. Let me know if you have any questions.

> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions
> 
>
> Key: SPARK-30130
> URL: https://issues.apache.org/jira/browse/SPARK-30130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Matt Boegner
>Priority: Minor
>
> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions.
> {code:java}
> val df = spark.sql("""
>  with a as (select 0 as test, count(*) group by test)
>  select * from a
>  """)
>  df.show(){code}
> This results in an error message like {color:#e01e5a}GROUP BY position 0 is 
> not in select list (valid range is [1, 2]){color} .
>  
> However, this error does not appear in a traditional subselect format. For 
> example, this query executes correctly:
> {code:java}
> val df = spark.sql("""
>  select * from (select 0 as test, count(*) group by test) a
>  """)
>  df.show(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30130) Hardcoded numeric values in common table expressions which utilize GROUP BY are interpreted as ordinal positions

2019-12-10 Thread Matt Boegner (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Boegner updated SPARK-30130:
-
Description: 
Hardcoded numeric values in common table expressions which utilize GROUP BY are 
interpreted as ordinal positions.
{code:java}
val df = spark.sql("""
 with a as (select 0 as test, count(*) group by test)
 select * from a
 """)
 df.show(){code}
This results in an error message like {color:#e01e5a}GROUP BY position 0 is not 
in select list (valid range is [1, 2]){color} .

 

However, this error does not appear in a traditional subselect format. For 
example, this query executes correctly:
{code:java}
val df = spark.sql("""
 select * from (select 0 as test, count(*) group by test) a
 """)
 df.show(){code}
 

  was:
Hardcoded numeric values in common table expressions which utilize GROUP BY are 
interpreted as ordinal positions. 
{code:java}
val df = spark.sql("""
 with a as (select 0 as test, count group by test)
 select * from a
 """)
 df.show(){code}

 This results in an error message like {color:#e01e5a}GROUP BY position 0 is 
not in select list (valid range is [1, 2]){color} .

 

However, this error does not appear in a traditional subselect format. For 
example, this query executes correctly:
{code:java}
val df = spark.sql("""
 select * from (select 0 as test, count group by test) a
 """)
 df.show(){code}

  


> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions
> 
>
> Key: SPARK-30130
> URL: https://issues.apache.org/jira/browse/SPARK-30130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Matt Boegner
>Priority: Minor
>
> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions.
> {code:java}
> val df = spark.sql("""
>  with a as (select 0 as test, count(*) group by test)
>  select * from a
>  """)
>  df.show(){code}
> This results in an error message like {color:#e01e5a}GROUP BY position 0 is 
> not in select list (valid range is [1, 2]){color} .
>  
> However, this error does not appear in a traditional subselect format. For 
> example, this query executes correctly:
> {code:java}
> val df = spark.sql("""
>  select * from (select 0 as test, count(*) group by test) a
>  """)
>  df.show(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29587) Real data type is not supported in Spark SQL which is supporting in postgresql

2019-12-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29587:
---

Assignee: Kent Yao

> Real data type is not supported in Spark SQL which is supporting in postgresql
> --
>
> Key: SPARK-29587
> URL: https://issues.apache.org/jira/browse/SPARK-29587
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Assignee: Kent Yao
>Priority: Minor
>
> Real data type is not supported in Spark SQL which is supporting in 
> postgresql.
> +*In postgresql query success*+
> CREATE TABLE weather2(prcp real);
> insert into weather2 values(2.5);
> select * from weather2;
>  
> ||  ||prcp||
> |1|2,5|
> +*In spark sql getting error*+
> spark-sql> CREATE TABLE weather2(prcp real);
> Error in query:
> DataType real is not supported.(line 1, pos 27)
> == SQL ==
> CREATE TABLE weather2(prcp real)
> ---
> Better to add the datatype "real " support in sql also
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29587) Real data type is not supported in Spark SQL which is supporting in postgresql

2019-12-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29587.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Real data type is not supported in Spark SQL which is supporting in postgresql
> --
>
> Key: SPARK-29587
> URL: https://issues.apache.org/jira/browse/SPARK-29587
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Real data type is not supported in Spark SQL which is supporting in 
> postgresql.
> +*In postgresql query success*+
> CREATE TABLE weather2(prcp real);
> insert into weather2 values(2.5);
> select * from weather2;
>  
> ||  ||prcp||
> |1|2,5|
> +*In spark sql getting error*+
> spark-sql> CREATE TABLE weather2(prcp real);
> Error in query:
> DataType real is not supported.(line 1, pos 27)
> == SQL ==
> CREATE TABLE weather2(prcp real)
> ---
> Better to add the datatype "real " support in sql also
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-29587) Real data type is not supported in Spark SQL which is supporting in postgresql

2019-12-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-29587:
-

> Real data type is not supported in Spark SQL which is supporting in postgresql
> --
>
> Key: SPARK-29587
> URL: https://issues.apache.org/jira/browse/SPARK-29587
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Assignee: Kent Yao
>Priority: Minor
>
> Real data type is not supported in Spark SQL which is supporting in 
> postgresql.
> +*In postgresql query success*+
> CREATE TABLE weather2(prcp real);
> insert into weather2 values(2.5);
> select * from weather2;
>  
> ||  ||prcp||
> |1|2,5|
> +*In spark sql getting error*+
> spark-sql> CREATE TABLE weather2(prcp real);
> Error in query:
> DataType real is not supported.(line 1, pos 27)
> == SQL ==
> CREATE TABLE weather2(prcp real)
> ---
> Better to add the datatype "real " support in sql also
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30200) Add ExplainMode for Dataset.explain

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30200.
---
Fix Version/s: 3.0.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/26829

> Add ExplainMode for Dataset.explain
> ---
>
> Key: SPARK-30200
> URL: https://issues.apache.org/jira/browse/SPARK-30200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> This pr targets to add ExplainMode for explaining Dataset/DataFrame with a 
> given format mode (ExplainMode). ExplainMode has four types along with the 
> SQL EXPLAIN command: Simple, Extended, Codegen, Cost, and Formatted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30207) Enhance the SQL NULL Semantics document

2019-12-10 Thread Yuanjian Li (Jira)
Yuanjian Li created SPARK-30207:
---

 Summary: Enhance the SQL NULL Semantics document
 Key: SPARK-30207
 URL: https://issues.apache.org/jira/browse/SPARK-30207
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Yuanjian Li


Enhancement of the SQL NULL Semantics document: sql-ref-null-semantics.html.

Clarify the behavior of `UNKNOW` for both `EXIST` and `IN` operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30125) Remove PostgreSQL dialect

2019-12-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30125.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26763
[https://github.com/apache/spark/pull/26763]

> Remove PostgreSQL dialect
> -
>
> Key: SPARK-30125
> URL: https://issues.apache.org/jira/browse/SPARK-30125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> As the discussion in 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html],
>  we need to remove PostgreSQL dialect form code base for several reasons:
> 1. The current approach makes the codebase complicated and hard to maintain.
> 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now.
>  
> Curently we have 3 features under PostgreSQL dialect:
> 1. SPARK-27931: when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. 
> are also allowed as true string.
> 2. SPARK-29364: `date - date`  returns interval in Spark (SQL standard 
> behavior), but return int in PostgreSQL
> 3. SPARK-28395: `int / int` returns double in Spark, but returns int in 
> PostgreSQL. (there is no standard)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30125) Remove PostgreSQL dialect

2019-12-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30125:
---

Assignee: Yuanjian Li

> Remove PostgreSQL dialect
> -
>
> Key: SPARK-30125
> URL: https://issues.apache.org/jira/browse/SPARK-30125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> As the discussion in 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html],
>  we need to remove PostgreSQL dialect form code base for several reasons:
> 1. The current approach makes the codebase complicated and hard to maintain.
> 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now.
>  
> Curently we have 3 features under PostgreSQL dialect:
> 1. SPARK-27931: when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. 
> are also allowed as true string.
> 2. SPARK-29364: `date - date`  returns interval in Spark (SQL standard 
> behavior), but return int in PostgreSQL
> 3. SPARK-28395: `int / int` returns double in Spark, but returns int in 
> PostgreSQL. (there is no standard)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30205) Import ABC from collections.abc to remove deprecation warnings

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30205:
--
Labels:   (was: python3)

> Import ABC from collections.abc to remove deprecation warnings
> --
>
> Key: SPARK-30205
> URL: https://issues.apache.org/jira/browse/SPARK-30205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Python version : 3.9
> Operating System : Linux
>Reporter: Karthikeyan Singaravelan
>Priority: Minor
>
> Importing ABC from collections module directly is deprecated since 3.4 and is 
> removed in Python 3.9. Thus this will cause ImportError for pyspark in Python 
> 3.9 in the resultiterable module where Iterable is used from collections at 
> https://github.com/tirkarthi/spark/blob/aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852/python/pyspark/resultiterable.py#L23.
>  
> Relevant CPython PR : https://github.com/python/cpython/pull/10596.
> I am a new contributor and would like to work on this issue.
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30205) Import ABC from collections.abc to remove deprecation warnings

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30205:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Import ABC from collections.abc to remove deprecation warnings
> --
>
> Key: SPARK-30205
> URL: https://issues.apache.org/jira/browse/SPARK-30205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Python version : 3.9
> Operating System : Linux
>Reporter: Karthikeyan Singaravelan
>Priority: Minor
>  Labels: python3
>
> Importing ABC from collections module directly is deprecated since 3.4 and is 
> removed in Python 3.9. Thus this will cause ImportError for pyspark in Python 
> 3.9 in the resultiterable module where Iterable is used from collections at 
> https://github.com/tirkarthi/spark/blob/aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852/python/pyspark/resultiterable.py#L23.
>  
> Relevant CPython PR : https://github.com/python/cpython/pull/10596.
> I am a new contributor and would like to work on this issue.
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30205) Import ABC from collections.abc to remove deprecation warnings

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30205:
--
Environment: (was: Python version : 3.9
Operating System : Linux)

> Import ABC from collections.abc to remove deprecation warnings
> --
>
> Key: SPARK-30205
> URL: https://issues.apache.org/jira/browse/SPARK-30205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Karthikeyan Singaravelan
>Priority: Minor
>
> Importing ABC from collections module directly is deprecated since 3.4 and is 
> removed in Python 3.9. Thus this will cause ImportError for pyspark in Python 
> 3.9 in the resultiterable module where Iterable is used from collections at 
> https://github.com/tirkarthi/spark/blob/aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852/python/pyspark/resultiterable.py#L23.
>  
> Relevant CPython PR : https://github.com/python/cpython/pull/10596.
> I am a new contributor and would like to work on this issue.
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30205) Import ABC from collections.abc to remove deprecation warnings

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30205:
--
Issue Type: Improvement  (was: Bug)

> Import ABC from collections.abc to remove deprecation warnings
> --
>
> Key: SPARK-30205
> URL: https://issues.apache.org/jira/browse/SPARK-30205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Python version : 3.9
> Operating System : Linux
>Reporter: Karthikeyan Singaravelan
>Priority: Minor
>  Labels: python3
>
> Importing ABC from collections module directly is deprecated since 3.4 and is 
> removed in Python 3.9. Thus this will cause ImportError for pyspark in Python 
> 3.9 in the resultiterable module where Iterable is used from collections at 
> https://github.com/tirkarthi/spark/blob/aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852/python/pyspark/resultiterable.py#L23.
>  
> Relevant CPython PR : https://github.com/python/cpython/pull/10596.
> I am a new contributor and would like to work on this issue.
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30205) Import ABC from collections.abc to remove deprecation warnings

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30205:
--
Summary: Import ABC from collections.abc to remove deprecation warnings  
(was: Importing ABC from collections module is removed in Python 3.9)

> Import ABC from collections.abc to remove deprecation warnings
> --
>
> Key: SPARK-30205
> URL: https://issues.apache.org/jira/browse/SPARK-30205
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Python version : 3.9
> Operating System : Linux
>Reporter: Karthikeyan Singaravelan
>Priority: Minor
>  Labels: python3
>
> Importing ABC from collections module directly is deprecated since 3.4 and is 
> removed in Python 3.9. Thus this will cause ImportError for pyspark in Python 
> 3.9 in the resultiterable module where Iterable is used from collections at 
> https://github.com/tirkarthi/spark/blob/aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852/python/pyspark/resultiterable.py#L23.
>  
> Relevant CPython PR : https://github.com/python/cpython/pull/10596.
> I am a new contributor and would like to work on this issue.
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30197) Add minimum `requirements-dev.txt` file to `python` directory

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30197:
--
Summary: Add minimum `requirements-dev.txt` file to `python` directory  
(was: Add `requirements.txt` file to `python` directory)

> Add minimum `requirements-dev.txt` file to `python` directory
> -
>
> Key: SPARK-30197
> URL: https://issues.apache.org/jira/browse/SPARK-30197
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-30197) Add `requirements.txt` file to `python` directory

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30197.
-

> Add `requirements.txt` file to `python` directory
> -
>
> Key: SPARK-30197
> URL: https://issues.apache.org/jira/browse/SPARK-30197
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30197) Add `requirements.txt` file to `python` directory

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30197:
--
Priority: Minor  (was: Major)

> Add `requirements.txt` file to `python` directory
> -
>
> Key: SPARK-30197
> URL: https://issues.apache.org/jira/browse/SPARK-30197
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30197) Add `requirements.txt` file to `python` directory

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30197.
---
Resolution: Won't Do

> Add `requirements.txt` file to `python` directory
> -
>
> Key: SPARK-30197
> URL: https://issues.apache.org/jira/browse/SPARK-30197
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30206) Rename normalizeFilters in DataSourceStrategy to be generic

2019-12-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30206.
---
Fix Version/s: 3.0.0
 Assignee: Anton Okolnychyi
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/26830

> Rename normalizeFilters in DataSourceStrategy to be generic
> ---
>
> Key: SPARK-30206
> URL: https://issues.apache.org/jira/browse/SPARK-30206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Anton Okolnychyi
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30206) Rename normalizeFilters in DataSourceStrategy to be generic

2019-12-10 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30206:
-

 Summary: Rename normalizeFilters in DataSourceStrategy to be 
generic
 Key: SPARK-30206
 URL: https://issues.apache.org/jira/browse/SPARK-30206
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29967) KMeans support instance weighting

2019-12-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29967.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26739
[https://github.com/apache/spark/pull/26739]

> KMeans support instance weighting
> -
>
> Key: SPARK-29967
> URL: https://issues.apache.org/jira/browse/SPARK-29967
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Since https://issues.apache.org/jira/browse/SPARK-9610, we start to support 
> instance weighting in ML.
> However, Clustering and other impl in features still do not support instance 
> weighting.
> I think we need to start support weighting in KMeans, like what scikit-learn 
> does.
> It will contains three parts:
> 1, move the impl from .mllib to .ml
> 2, make .mllib.KMeans as a wrapper of .ml.KMeans
> 3, support instance weighting in the .ml.KMeans



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29967) KMeans support instance weighting

2019-12-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29967:


Assignee: Huaxin Gao

> KMeans support instance weighting
> -
>
> Key: SPARK-29967
> URL: https://issues.apache.org/jira/browse/SPARK-29967
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
>
> Since https://issues.apache.org/jira/browse/SPARK-9610, we start to support 
> instance weighting in ML.
> However, Clustering and other impl in features still do not support instance 
> weighting.
> I think we need to start support weighting in KMeans, like what scikit-learn 
> does.
> It will contains three parts:
> 1, move the impl from .mllib to .ml
> 2, make .mllib.KMeans as a wrapper of .ml.KMeans
> 3, support instance weighting in the .ml.KMeans



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20135) spark thriftserver2: no job running but containers not release on yarn

2019-12-10 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992644#comment-16992644
 ] 

angerszhu commented on SPARK-20135:
---

meet same problem in spark2.4.0 
[~xwc3504]
Have you have any idel now?

> spark thriftserver2: no job running but containers not release on yarn
> --
>
> Key: SPARK-20135
> URL: https://issues.apache.org/jira/browse/SPARK-20135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark 2.0.1 with hadoop 2.6.0 
>Reporter: bruce xu
>Priority: Major
> Attachments: 0329-1.png, 0329-2.png, 0329-3.png
>
>
> i opened the executor dynamic allocation feature, however it doesn't work 
> sometimes.
> i set the initial executor num 50,  after job finished the cores and mem 
> resource did not release. 
> from the spark web UI, the active job/running task/stage num is 0 , but the 
> executors page show  cores 1276, active task 7288.
> from the yarn web UI,  the thriftserver job's running containers is 639 
> without releasing. 
> this may be a bug. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30204) Support for config Pod DNS for Kubernetes

2019-12-10 Thread vanderliang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vanderliang updated SPARK-30204:

Description: 
Current we can not configure the pod dns nameservers and searches when submit a 
job via cli for kubernetes. However, this's a common scenarios for hybricloud 
where we use public cloud compute resourses while with private dns. 

 
{code:java}
//代码占位符
apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: dns-example
spec:
  containers:
- name: test
  image: nginx
  dnsConfig:
nameservers:
  - 1.2.3.4
searches:
  - ns1.svc.cluster-domain.example
  - my.dns.search.suffix
options:
  - name: ndots
value: "2"
  - name: edns0
{code}
As a result, we can use the following property to specify the pod dns config.
 * spark.kubernetes.dnsConfig.nameservers, Comma separated list of the 
Kubernetes dns nameservers for driver and executor.
 * spark.kubernetes.dnsConfig.searches, Comma separated list of the Kubernetes 
dns searches for driver and executor.
 * spark.kubernetes.dnsConfig.options.[OptionVariableName], Add the dns option 
variable specified by OptionVariableName to the Driver And Executor process. 
The user can specify multiple of these to set multiple options variables.

 

  was:
Current we can not configure the pod dns nameservers, searches and options when 
submit a job via cli for kubernetes. However, this's a common scenarios for 
hybricloud where we use public cloud compute resourses while with private dns. 
{code:java}
//代码占位符
apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: dns-example
spec:
  containers:
- name: test
  image: nginx
  dnsConfig:
nameservers:
  - 1.2.3.4
searches:
  - ns1.svc.cluster-domain.example
  - my.dns.search.suffix
options:
  - name: ndots
value: "2"
  - name: edns0
{code}


> Support for config Pod DNS for Kubernetes
> -
>
> Key: SPARK-30204
> URL: https://issues.apache.org/jira/browse/SPARK-30204
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: vanderliang
>Priority: Major
>
> Current we can not configure the pod dns nameservers and searches when submit 
> a job via cli for kubernetes. However, this's a common scenarios for 
> hybricloud where we use public cloud compute resourses while with private 
> dns. 
>  
> {code:java}
> //代码占位符
> apiVersion: v1
> kind: Pod
> metadata:
>   namespace: default
>   name: dns-example
> spec:
>   containers:
> - name: test
>   image: nginx
>   dnsConfig:
> nameservers:
>   - 1.2.3.4
> searches:
>   - ns1.svc.cluster-domain.example
>   - my.dns.search.suffix
> options:
>   - name: ndots
> value: "2"
>   - name: edns0
> {code}
> As a result, we can use the following property to specify the pod dns config.
>  * spark.kubernetes.dnsConfig.nameservers, Comma separated list of the 
> Kubernetes dns nameservers for driver and executor.
>  * spark.kubernetes.dnsConfig.searches, Comma separated list of the 
> Kubernetes dns searches for driver and executor.
>  * spark.kubernetes.dnsConfig.options.[OptionVariableName], Add the dns 
> option variable specified by OptionVariableName to the Driver And Executor 
> process. The user can specify multiple of these to set multiple options 
> variables.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30204) Support for config Pod DNS for Kubernetes

2019-12-10 Thread vanderliang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vanderliang updated SPARK-30204:

Description: 
Current we can not configure the pod dns nameservers, searches and options when 
submit a job via cli for kubernetes. However, this's a common scenarios for 
hybricloud where we use public cloud compute resourses while with private dns. 
{code:java}
//代码占位符
apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: dns-example
spec:
  containers:
- name: test
  image: nginx
  dnsConfig:
nameservers:
  - 1.2.3.4
searches:
  - ns1.svc.cluster-domain.example
  - my.dns.search.suffix
options:
  - name: ndots
value: "2"
  - name: edns0
{code}

  was:
Current we can not configure the pod dns nameservers and searches when submit a 
job via cli for kubernetes. However, this's a common scenarios for hybricloud 
where we use public cloud compute resourses while with private dns. 
{code:java}
//代码占位符
apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: dns-example
spec:
  containers:
- name: test
  image: nginx
  dnsConfig:
nameservers:
  - 1.2.3.4
searches:
  - ns1.svc.cluster-domain.example
  - my.dns.search.suffix
options:
  - name: ndots
value: "2"
  - name: edns0
{code}


> Support for config Pod DNS for Kubernetes
> -
>
> Key: SPARK-30204
> URL: https://issues.apache.org/jira/browse/SPARK-30204
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: vanderliang
>Priority: Major
>
> Current we can not configure the pod dns nameservers, searches and options 
> when submit a job via cli for kubernetes. However, this's a common scenarios 
> for hybricloud where we use public cloud compute resourses while with private 
> dns. 
> {code:java}
> //代码占位符
> apiVersion: v1
> kind: Pod
> metadata:
>   namespace: default
>   name: dns-example
> spec:
>   containers:
> - name: test
>   image: nginx
>   dnsConfig:
> nameservers:
>   - 1.2.3.4
> searches:
>   - ns1.svc.cluster-domain.example
>   - my.dns.search.suffix
> options:
>   - name: ndots
> value: "2"
>   - name: edns0
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30204) Support for config Pod DNS for Kubernetes

2019-12-10 Thread vanderliang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vanderliang updated SPARK-30204:

Description: 
Current we can not configure the pod dns nameservers and searches when submit a 
job via cli for kubernetes. However, this's a common scenarios for hybricloud 
where we use public cloud compute resourses while with private dns. 
{code:java}
//代码占位符
apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: dns-example
spec:
  containers:
- name: test
  image: nginx
  dnsConfig:
nameservers:
  - 1.2.3.4
searches:
  - ns1.svc.cluster-domain.example
  - my.dns.search.suffix
options:
  - name: ndots
value: "2"
  - name: edns0
{code}

  was:Current we can not configure the pod dns nameservers and searches when 
submit a job via cli for kubernetes. However, this's a common scenarios for 
hybricloud where we use public cloud compute resourses while with private dns. 


> Support for config Pod DNS for Kubernetes
> -
>
> Key: SPARK-30204
> URL: https://issues.apache.org/jira/browse/SPARK-30204
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: vanderliang
>Priority: Major
>
> Current we can not configure the pod dns nameservers and searches when submit 
> a job via cli for kubernetes. However, this's a common scenarios for 
> hybricloud where we use public cloud compute resourses while with private 
> dns. 
> {code:java}
> //代码占位符
> apiVersion: v1
> kind: Pod
> metadata:
>   namespace: default
>   name: dns-example
> spec:
>   containers:
> - name: test
>   image: nginx
>   dnsConfig:
> nameservers:
>   - 1.2.3.4
> searches:
>   - ns1.svc.cluster-domain.example
>   - my.dns.search.suffix
> options:
>   - name: ndots
> value: "2"
>   - name: edns0
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30205) Importing ABC from collections module is removed in Python 3.9

2019-12-10 Thread Karthikeyan Singaravelan (Jira)
Karthikeyan Singaravelan created SPARK-30205:


 Summary: Importing ABC from collections module is removed in 
Python 3.9
 Key: SPARK-30205
 URL: https://issues.apache.org/jira/browse/SPARK-30205
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
 Environment: Python version : 3.9
Operating System : Linux
Reporter: Karthikeyan Singaravelan


Importing ABC from collections module directly is deprecated since 3.4 and is 
removed in Python 3.9. Thus this will cause ImportError for pyspark in Python 
3.9 in the resultiterable module where Iterable is used from collections at 
https://github.com/tirkarthi/spark/blob/aa9da9365ff31948e42ab4c6dcc6cb4cec5fd852/python/pyspark/resultiterable.py#L23.
 

Relevant CPython PR : https://github.com/python/cpython/pull/10596.

I am a new contributor and would like to work on this issue.

Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30204) Support for config Pod DNS for Kubernetes

2019-12-10 Thread vanderliang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vanderliang updated SPARK-30204:

Summary: Support for config Pod DNS for Kubernetes  (was: Support for 
config DNS for Kubernetes)

> Support for config Pod DNS for Kubernetes
> -
>
> Key: SPARK-30204
> URL: https://issues.apache.org/jira/browse/SPARK-30204
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: vanderliang
>Priority: Major
>
> Current we can not configure the pod dns nameservers and searches when submit 
> a job via cli for kubernetes. However, this's a common scenarios for 
> hybricloud where we use public cloud compute resourses while with private 
> dns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30204) Support for config DNS for Kubernetes

2019-12-10 Thread vanderliang (Jira)
vanderliang created SPARK-30204:
---

 Summary: Support for config DNS for Kubernetes
 Key: SPARK-30204
 URL: https://issues.apache.org/jira/browse/SPARK-30204
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: vanderliang


Current we can not configure the pod dns nameservers and searches when submit a 
job via cli for kubernetes. However, this's a common scenarios for hybricloud 
where we use public cloud compute resourses while with private dns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30151) Issue better error message when user-specified schema not match relation schema

2019-12-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30151:
---

Assignee: wuyi

> Issue better error message when user-specified schema not match relation 
> schema
> ---
>
> Key: SPARK-30151
> URL: https://issues.apache.org/jira/browse/SPARK-30151
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> In DataSource.resolveRelation(), when relation schema does not match 
> user-specified schema, it raises exception and says that "$className does not 
> allow user-specified schemas."  However, it does allow user-specified schema 
> if it matches relation schema. Instead, we should issue a better error 
> message to tell user what is really happening here, e.g. clarify the 
> mismatched fields to user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30151) Issue better error message when user-specified schema not match relation schema

2019-12-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30151.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26781
[https://github.com/apache/spark/pull/26781]

> Issue better error message when user-specified schema not match relation 
> schema
> ---
>
> Key: SPARK-30151
> URL: https://issues.apache.org/jira/browse/SPARK-30151
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> In DataSource.resolveRelation(), when relation schema does not match 
> user-specified schema, it raises exception and says that "$className does not 
> allow user-specified schemas."  However, it does allow user-specified schema 
> if it matches relation schema. Instead, we should issue a better error 
> message to tell user what is really happening here, e.g. clarify the 
> mismatched fields to user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30203) store assignable if there exists an appropriate user-defined cast function

2019-12-10 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-30203:
-
Description: 
h3. 9.2 Store assignment
h4. Syntax Rules

1) Let T be the TARGET and let V be the VALUE in an application of the Syntax 
Rules of this Subclause.
 2) Let TD and SD be the declared types of T and V, respectively.
 3) If TD is character string, binary string, numeric, boolean, datetime, 
interval, or a user-defined type, then
 either SD shall be assignable to TD or there shall exist an appropriate 
user-defined cast function UDCF
 from SD to TD.
 _NOTE 319 — “Appropriate user-defined cast function” is defined in Subclause 
4.11, “Data conversions”_
h3.  4.11 Data conversions

Implicit type conversion can occur in expressions, fetch operations, single row 
select operations, inserts, deletes,
 and updates. Explicit type conversions can be specified by the use of the CAST 
operator.

 

The current implementation for ANSI store assignment is totally out of context. 

According to this rule, `there shall exist an appropriate user-defined cast 
function UDCF`, the spark legacy store assignment is just fine because we do 
have *appropriate cast* _*functions.*_

At least according to the ansi cast rule, the current ANSI assignment policy is 
too strict to the ANSI cast rules

 
{code:java}
* (SD) - (TD) -
* | EN  AN  C  D  T  TS  YM  DT  BO  UDT  B  RT  CT  RW
* EN  |  Y   Y  Y  N  N   N   M   M   N   M   N   M   N   N
* AN  |  Y   Y  Y  N  N   N   N   N   N   M   N   M   N   N
*  C  |  Y   Y  Y  Y  Y   Y   Y   Y   Y   M   N   M   N   N
*  D  |  N   N  Y  Y  N   Y   N   N   N   M   N   M   N   N
*  T  |  N   N  Y  N  Y   Y   N   N   N   M   N   M   N   N
* TS  |  N   N  Y  Y  Y   Y   N   N   N   M   N   M   N   N
* YM  |  M   N  Y  N  N   N   Y   N   N   M   N   M   N   N
* DT  |  M   N  Y  N  N   N   N   Y   N   M   N   M   N   N
* BO  |  N   N  Y  N  N   N   N   N   Y   M   N   M   N   N
* UDT |  M   M  M  M  M   M   M   M   M   M   M   M   M   N
*  B  |  N   N  N  N  N   N   N   N   N   M   Y   M   N   N
* RT  |  M   M  M  M  M   M   M   M   M   M   M   M   N   N
* CT  |  N   N  N  N  N   N   N   N   N   M   N   N   M   N
* RW  |  N   N  N  N  N   N   N   N   N   N   N   N   N   M
*
* Where:
* EN = Exact Numeric
* AN = Approximate Numeric
* C = Character (Fixed- or Variable-Length, or Character Large Object)
* D = Date
* T = Time
* TS = Timestamp
* YM = Year-Month Interval
* DT = Day-Time Interval
* BO = Boolean
* UDT = User-Defined Type
* B = Binary (Fixed- or Variable-Length or Binary Large Object)
* RT = Reference type
* CT = Collection type
* RW = Row type
{code}
 

 

_cc [~cloud_fan] [~gengliang]_ [~maropu] 

 

  was:
h3. 9.2 Store assignment
h4. Syntax Rules

1) Let T be the TARGET and let V be the VALUE in an application of the Syntax 
Rules of this Subclause.
 2) Let TD and SD be the declared types of T and V, respectively.
 3) If TD is character string, binary string, numeric, boolean, datetime, 
interval, or a user-defined type, then
 either SD shall be assignable to TD or there shall exist an appropriate 
user-defined cast function UDCF
 from SD to TD.
 _NOTE 319 — “Appropriate user-defined cast function” is defined in Subclause 
4.11, “Data conversions”_
h3.  4.11 Data conversions

Implicit type conversion can occur in expressions, fetch operations, single row 
select operations, inserts, deletes,
and updates. Explicit type conversions can be specified by the use of the CAST 
operator.

 

The current implementation for ANSI store assignment is totally out of context. 

According to this rule, `there shall exist an appropriate user-defined cast 
function UDCF`, the spark legacy store assignment is just fine because we do 
have *a_ppropriate cast_* _*functions.*_

At least according to the ansi cast rule, the current ANSI assignment policy is 
too strict to the ANSI cast rules

 
{code:java}
* (SD) - (TD) -
* | EN  AN  C  D  T  TS  YM  DT  BO  UDT  B  RT  CT  RW
* EN  |  Y   Y  Y  N  N   N   M   M   N   M   N   M   N   N
* AN  |  Y   Y  Y  N  N   N   N   N   N   M   N   M   N   N
*  C  |  Y   Y  Y  Y  Y   Y   Y   Y   Y   M   N   M   N   N
*  D  |  N   N  Y  Y  N   Y   N   N   N   M   N   M   N   N
*  T  |  N   N  Y  N  Y   Y   N   N   N   M   N   M   N   N
* TS  |  N   N  Y  Y  Y   Y   N   N   N   M   N   M   N   N
* YM  |  M   N  Y  N  N   N   Y   N   N   M   N   M   N   N
* DT  |  M   N  Y  N  N   N   N   Y   N   M   N   M   N   N
* BO  |  N   N  Y  N  N   N   N   N   Y   M   N   M   N   N
* UDT |  M   M  M  M  M   M   M   M   M   M   M   M   M   N
*  B  |  N   N  N  N  N   N   N   N   N   M   Y   M   N   N
* RT  |  M   M  M  M  M   M   M   M   M   M   M   M   N   N
* CT  |  N   N  N  N  N   N   N   N   N   M   N   N   M   N
* RW  |  N   N  N  N  N   N   N   N   N   N   N   N   N   M
*
* Where:
* 

[jira] [Created] (SPARK-30203) store assignable if there exists an appropriate user-defined cast function

2019-12-10 Thread Kent Yao (Jira)
Kent Yao created SPARK-30203:


 Summary: store assignable if there exists an appropriate 
user-defined cast function
 Key: SPARK-30203
 URL: https://issues.apache.org/jira/browse/SPARK-30203
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


h3. 9.2 Store assignment
h4. Syntax Rules

1) Let T be the TARGET and let V be the VALUE in an application of the Syntax 
Rules of this Subclause.
 2) Let TD and SD be the declared types of T and V, respectively.
 3) If TD is character string, binary string, numeric, boolean, datetime, 
interval, or a user-defined type, then
 either SD shall be assignable to TD or there shall exist an appropriate 
user-defined cast function UDCF
 from SD to TD.
 _NOTE 319 — “Appropriate user-defined cast function” is defined in Subclause 
4.11, “Data conversions”_
h3.  4.11 Data conversions

Implicit type conversion can occur in expressions, fetch operations, single row 
select operations, inserts, deletes,
and updates. Explicit type conversions can be specified by the use of the CAST 
operator.

 

The current implementation for ANSI store assignment is totally out of context. 

According to this rule, `there shall exist an appropriate user-defined cast 
function UDCF`, the spark legacy store assignment is just fine because we do 
have *a_ppropriate cast_* _*functions.*_

At least according to the ansi cast rule, the current ANSI assignment policy is 
too strict to the ANSI cast rules

 
{code:java}
* (SD) - (TD) -
* | EN  AN  C  D  T  TS  YM  DT  BO  UDT  B  RT  CT  RW
* EN  |  Y   Y  Y  N  N   N   M   M   N   M   N   M   N   N
* AN  |  Y   Y  Y  N  N   N   N   N   N   M   N   M   N   N
*  C  |  Y   Y  Y  Y  Y   Y   Y   Y   Y   M   N   M   N   N
*  D  |  N   N  Y  Y  N   Y   N   N   N   M   N   M   N   N
*  T  |  N   N  Y  N  Y   Y   N   N   N   M   N   M   N   N
* TS  |  N   N  Y  Y  Y   Y   N   N   N   M   N   M   N   N
* YM  |  M   N  Y  N  N   N   Y   N   N   M   N   M   N   N
* DT  |  M   N  Y  N  N   N   N   Y   N   M   N   M   N   N
* BO  |  N   N  Y  N  N   N   N   N   Y   M   N   M   N   N
* UDT |  M   M  M  M  M   M   M   M   M   M   M   M   M   N
*  B  |  N   N  N  N  N   N   N   N   N   M   Y   M   N   N
* RT  |  M   M  M  M  M   M   M   M   M   M   M   M   N   N
* CT  |  N   N  N  N  N   N   N   N   N   M   N   N   M   N
* RW  |  N   N  N  N  N   N   N   N   N   N   N   N   N   M
*
* Where:
* EN = Exact Numeric
* AN = Approximate Numeric
* C = Character (Fixed- or Variable-Length, or Character Large Object)
* D = Date
* T = Time
* TS = Timestamp
* YM = Year-Month Interval
* DT = Day-Time Interval
* BO = Boolean
* UDT = User-Defined Type
* B = Binary (Fixed- or Variable-Length or Binary Large Object)
* RT = Reference type
* CT = Collection type
* RW = Row type
{code}
 

 

_cc [~cloud_fan] [~gengliang]_ [~maropu] 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30202) impl QuantileTransform

2019-12-10 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30202:


 Summary: impl QuantileTransform
 Key: SPARK-30202
 URL: https://issues.apache.org/jira/browse/SPARK-30202
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Recently, I encountered some practice senarinos to map the data to another 
distribution.

Then I found that QuantileTransformer in sklearn is what I needed, I locally 
fitted a model on sampled dataset and broadcast it to transform the whole 
dataset in pyspark.

After that I impled QuantileTransform as a new Estimator atop Spark, the impl 
followed scikit-learn' s impl, however there still are sereral differences:

1, use QuantileSummaries for approximation, no matter the size of dataset;

2, use linear interpolate, the logic is similar to existing IsotonicRegression, 
while scikit-learn use a bi-directional interpolate;

3, when skipZero=true, treat sparse vectors just like dense ones, while 
scikit-learn have two different logics for sparse and dense datasets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2019-12-10 Thread ulysses you (Jira)
ulysses you created SPARK-30201:
---

 Summary: HiveOutputWriter standardOI should use 
ObjectInspectorCopyOption.DEFAULT
 Key: SPARK-30201
 URL: https://issues.apache.org/jira/browse/SPARK-30201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: ulysses you


Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will convert 
any string to UTF-8 string. When write non UTF-8 code data, then `EFBFBD` will 
appear.
We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.

Here is the way to reproduce:
1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
2. create table test1 (c string) location '$file_path';
3. select hex(c) from test1; // AABBCC
4. craete table test2 (c string) as select c from test1;
5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28664) ORDER BY in aggregate function

2019-12-10 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992427#comment-16992427
 ] 

jiaan.geng commented on SPARK-28664:


[https://github.com/postgres/postgres/blob/44e95b5728a4569c494fa4ea4317f8a2f50a206b/src/test/regress/expected/aggregates.out#L2239]

[~yumwang] I didn't understand the meaning of this syntax . If it is valuable, 
I will do it.

> ORDER BY in aggregate function
> --
>
> Key: SPARK-28664
> URL: https://issues.apache.org/jira/browse/SPARK-28664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> SELECT min(x ORDER BY y) FROM (VALUES(1, NULL)) AS d(x,y);
> SELECT min(x ORDER BY y) FROM (VALUES(1, 2)) AS d(x,y);
> {code}
> https://github.com/postgres/postgres/blob/44e95b5728a4569c494fa4ea4317f8a2f50a206b/src/test/regress/sql/aggregates.sql#L978-L982



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30092) Number of active tasks is negative in Live UI Executors page

2019-12-10 Thread ZhongYu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992401#comment-16992401
 ] 

ZhongYu commented on SPARK-30092:
-

It is hard to give steps that will certainly reproduce this issues. But I use 
this step to reproduce this issues with relatively large probability.
 # Deploy yarn using AWS ec2 ( or other virtual machine )
 # Start spark job on yarn in client mode.
 # Stop some yarn ec2 slaves that spark job are running 

> Number of active tasks is negative in Live UI Executors page
> 
>
> Key: SPARK-30092
> URL: https://issues.apache.org/jira/browse/SPARK-30092
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.1
> Environment: Hadoop version: 2.7.3
> ResourceManager version: 2.7.3
>Reporter: ZhongYu
>Priority: Major
> Attachments: wx20191202-102...@2x.png
>
>
> The number of active tasks is negative in Live UI Executors page when there 
> is executor lost and task failure. I am using spark on yarn which built on 
> AWS spot instances. When yarn work lost, there is a large probability to 
> become negative active tasks in Spark Live UI.  
> I saw related tickets below and resolved in earlier version of Spark. But 
> Same things happened again in Spark 2.4.1. See attachment.
> https://issues.apache.org/jira/browse/SPARK-8560
> https://issues.apache.org/jira/browse/SPARK-10141
> https://issues.apache.org/jira/browse/SPARK-19356



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org