[jira] [Updated] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters

2019-05-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27896:
--
Fix Version/s: (was: 2.4.4)

> Fix definition of clustering silhouette coefficient for 1-element clusters
> --
>
> Key: SPARK-27896
> URL: https://issues.apache.org/jira/browse/SPARK-27896
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> Reported by Samuel Kubler via email:
> In the code 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala,
>  I think there is a little mistake in the class “Silhouette” when you 
> calculate the Silhouette coefficient for a point. Indeed, according to the 
> scientific paper of reference “Silhouettes:  a graphical aid to the 
> interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, 
> for the points which are alone in a cluster it is not the 
> currentClusterDissimilarity  which is supposed to be equal to 0 like it is 
> the case in your code (“val currentClusterDissimilarity = if 
> (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. 
> Indeed, “When cluster A contains only a single object it is unclear how a(i) 
> should be defined, and the we simply set s(i) equal to zero”.
> The problem of defining the currentClusterDissimilarity to zero like you have 
> done is that you can’t use the silhouette coefficient anymore as a criterion 
> to determine the optimal value of the number of clusters in your clustering 
> process because your algorithm will answer that the more clusters you have, 
> the better will be your clustering algorithm. Indeed, in that case when the 
> number of clustering classes increases, s(i) converges toward 1. (so your 
> algorithm seems to be more efficient) I have, beside, check this result of my 
> own clustering example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23626) DAGScheduler blocked due to JobSubmitted event

2019-05-31 Thread Ajith S (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-23626:

Affects Version/s: 2.4.3

>  DAGScheduler blocked due to JobSubmitted event
> ---
>
> Key: SPARK-23626
> URL: https://issues.apache.org/jira/browse/SPARK-23626
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.1, 2.3.3, 3.0.0, 2.4.3
>Reporter: Ajith S
>Priority: Major
>
> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted 
> events has to be processed as DAGSchedulerEventProcessLoop is single threaded 
> and it will block other tasks in queue like TaskCompletion.
> The JobSubmitted event is time consuming depending on the nature of the job 
> (Example: calculating parent stage dependencies, shuffle dependencies, 
> partitions) and thus it blocks all the events to be processed.
>  
> I see multiple JIRA referring to this behavior
> https://issues.apache.org/jira/browse/SPARK-2647
> https://issues.apache.org/jira/browse/SPARK-4961
>  
> Similarly in my cluster some jobs partition calculation is time consuming 
> (Similar to stack at SPARK-2647) hence it slows down the spark 
> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if 
> its tasks are finished within seconds, as TaskCompletion Events are processed 
> at a slower rate due to blockage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23626) DAGScheduler blocked due to JobSubmitted event

2019-05-31 Thread Ajith S (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-23626:

Summary:  DAGScheduler blocked due to JobSubmitted event  (was: Spark 
DAGScheduler scheduling performance hindered on JobSubmitted Event)

>  DAGScheduler blocked due to JobSubmitted event
> ---
>
> Key: SPARK-23626
> URL: https://issues.apache.org/jira/browse/SPARK-23626
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.1, 2.3.3, 3.0.0
>Reporter: Ajith S
>Priority: Major
>
> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted 
> events has to be processed as DAGSchedulerEventProcessLoop is single threaded 
> and it will block other tasks in queue like TaskCompletion.
> The JobSubmitted event is time consuming depending on the nature of the job 
> (Example: calculating parent stage dependencies, shuffle dependencies, 
> partitions) and thus it blocks all the events to be processed.
>  
> I see multiple JIRA referring to this behavior
> https://issues.apache.org/jira/browse/SPARK-2647
> https://issues.apache.org/jira/browse/SPARK-4961
>  
> Similarly in my cluster some jobs partition calculation is time consuming 
> (Similar to stack at SPARK-2647) hence it slows down the spark 
> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if 
> its tasks are finished within seconds, as TaskCompletion Events are processed 
> at a slower rate due to blockage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23626) Spark DAGScheduler scheduling performance hindered on JobSubmitted Event

2019-05-31 Thread Ajith S (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-23626:

Labels:   (was: bulk-closed)

> Spark DAGScheduler scheduling performance hindered on JobSubmitted Event
> 
>
> Key: SPARK-23626
> URL: https://issues.apache.org/jira/browse/SPARK-23626
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.1, 2.3.3, 3.0.0
>Reporter: Ajith S
>Priority: Major
>
> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted 
> events has to be processed as DAGSchedulerEventProcessLoop is single threaded 
> and it will block other tasks in queue like TaskCompletion.
> The JobSubmitted event is time consuming depending on the nature of the job 
> (Example: calculating parent stage dependencies, shuffle dependencies, 
> partitions) and thus it blocks all the events to be processed.
>  
> I see multiple JIRA referring to this behavior
> https://issues.apache.org/jira/browse/SPARK-2647
> https://issues.apache.org/jira/browse/SPARK-4961
>  
> Similarly in my cluster some jobs partition calculation is time consuming 
> (Similar to stack at SPARK-2647) hence it slows down the spark 
> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if 
> its tasks are finished within seconds, as TaskCompletion Events are processed 
> at a slower rate due to blockage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20856) support statement using nested joins

2019-05-31 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-20856:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-27764

> support statement using nested joins
> 
>
> Key: SPARK-20856
> URL: https://issues.apache.org/jira/browse/SPARK-20856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>Priority: Major
>  Labels: bulk-closed
>
> While DB2, ORACLE etc support a join expressed as follows, SPARK SQL does 
> not. 
> Not supported
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> versus written as shown
> select * from 
>   cert.tsint tsint inner join cert.tint tint on tsint.rnum = tint.rnum inner 
> join cert.tbint tbint on tint.rnum = tbint.rnum
>
> ERROR_STATE, SQL state: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input 'on' expecting {, ',', '.', '[', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 
> 'IS', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', EQ, '<=>', 
> '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', 
> '^', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 4, pos 5)
> == SQL ==
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> -^^^
> , Query: select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum.
> SQLState:  HY000
> ErrorCode: 500051



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20856) support statement using nested joins

2019-05-31 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853575#comment-16853575
 ] 

Yuming Wang commented on SPARK-20856:
-

PostgreSQL 11.3 also support this join case:
{code:sql}
CREATE TABLE J1_TBL (
  i integer,
  j integer,
  t text
);

CREATE TABLE J2_TBL (
  i integer,
  k integer
);

CREATE TABLE J3_TBL (
  i integer,
  k integer
);

INSERT INTO J1_TBL VALUES (1, 4, 'one');
INSERT INTO J2_TBL VALUES (1, -1);
INSERT INTO J3_TBL VALUES (1, -1);

select * from J1_TBL t1 join J2_TBL t2 join J3_TBL t3 on t3.i = t2.i on t2.i = 
t1.i;
{code}

> support statement using nested joins
> 
>
> Key: SPARK-20856
> URL: https://issues.apache.org/jira/browse/SPARK-20856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>Priority: Major
>  Labels: bulk-closed
>
> While DB2, ORACLE etc support a join expressed as follows, SPARK SQL does 
> not. 
> Not supported
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> versus written as shown
> select * from 
>   cert.tsint tsint inner join cert.tint tint on tsint.rnum = tint.rnum inner 
> join cert.tbint tbint on tint.rnum = tbint.rnum
>
> ERROR_STATE, SQL state: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input 'on' expecting {, ',', '.', '[', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 
> 'IS', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', EQ, '<=>', 
> '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', 
> '^', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 4, pos 5)
> == SQL ==
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> -^^^
> , Query: select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum.
> SQLState:  HY000
> ErrorCode: 500051



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`

2019-05-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24077:
-
Description: 
The error message of {{CREATE TEMPORARY FUNCTION IF NOT EXISTS}} looks 
confusing: 

{code}
scala> 
org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
 TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'")
{code}

{code}
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)

== SQL ==
 CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'
 -^^^
 at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
 ... 48 elided
{code}



  was:
{code}
scala> 
org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
 TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'")
{code}

{code}
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)

== SQL ==
 CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'
 -^^^
 at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
 ... 48 elided
{code}

The error message of {{(name = "udf")}} 


> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
> --
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>  Labels: starter
>
> The error message of {{CREATE TEMPORARY FUNCTION IF NOT EXISTS}} looks 
> confusing: 
> {code}
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> {code}
> {code}
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`

2019-05-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853573#comment-16853573
 ] 

Hyukjin Kwon commented on SPARK-24077:
--

Reopened after editing the JIRA.

> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
> --
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>  Labels: starter
>
> The error message of {{CREATE TEMPORARY FUNCTION IF NOT EXISTS}} looks 
> confusing: 
> {code}
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> {code}
> {code}
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`

2019-05-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24077:
-
Description: 
{code}
scala> 
org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
 TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'")
{code}

{code}
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)

== SQL ==
 CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'
 -^^^
 at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
 ... 48 elided
{code}

The error message of 

  was:
Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?

 

scala> 
org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
 TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'")

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)

== SQL ==
 CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'
 -^^^
 at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
 ... 48 elided


> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
> --
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>  Labels: starter
>
> {code}
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> {code}
> {code}
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided
> {code}
> The error message of 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`

2019-05-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24077:
-
Description: 
{code}
scala> 
org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
 TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'")
{code}

{code}
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)

== SQL ==
 CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'
 -^^^
 at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
 ... 48 elided
{code}

The error message of {{(name = "udf")}} 

  was:
{code}
scala> 
org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
 TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'")
{code}

{code}
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)

== SQL ==
 CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'
 -^^^
 at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
 ... 48 elided
{code}

The error message of 


> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
> --
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>  Labels: starter
>
> {code}
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> {code}
> {code}
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided
> {code}
> The error message of {{(name = "udf")}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`

2019-05-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24077:
-
Summary: Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT 
EXISTS`  (was: Issue a better error message for `CREATE TEMPORARY FUNCTION IF 
NOT EXISTS`?)

> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
> --
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>  Labels: starter
>
> Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
>  
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters

2019-05-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27896.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.4

This is resolved via https://github.com/apache/spark/pull/24756

> Fix definition of clustering silhouette coefficient for 1-element clusters
> --
>
> Key: SPARK-27896
> URL: https://issues.apache.org/jira/browse/SPARK-27896
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> Reported by Samuel Kubler via email:
> In the code 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala,
>  I think there is a little mistake in the class “Silhouette” when you 
> calculate the Silhouette coefficient for a point. Indeed, according to the 
> scientific paper of reference “Silhouettes:  a graphical aid to the 
> interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, 
> for the points which are alone in a cluster it is not the 
> currentClusterDissimilarity  which is supposed to be equal to 0 like it is 
> the case in your code (“val currentClusterDissimilarity = if 
> (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. 
> Indeed, “When cluster A contains only a single object it is unclear how a(i) 
> should be defined, and the we simply set s(i) equal to zero”.
> The problem of defining the currentClusterDissimilarity to zero like you have 
> done is that you can’t use the silhouette coefficient anymore as a criterion 
> to determine the optimal value of the number of clusters in your clustering 
> process because your algorithm will answer that the more clusters you have, 
> the better will be your clustering algorithm. Indeed, in that case when the 
> number of clustering classes increases, s(i) converges toward 1. (so your 
> algorithm seems to be more efficient) I have, beside, check this result of my 
> own clustering example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27794) Use secure URLs for downloading CRAN artifacts

2019-05-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27794.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.4

This is resolved via 
- https://github.com/apache/spark/pull/24664 (master)
- https://github.com/apache/spark/pull/24758 (branch-2.4)

> Use secure URLs for downloading CRAN artifacts
> --
>
> Key: SPARK-27794
> URL: https://issues.apache.org/jira/browse/SPARK-27794
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> Currently, artifacts from CRAN are downloaded from 
> http://cran.us.r-project.org . Ideally, this should be an HTTPS URL. It seems 
> like the main redirector is https://cloud.r-project.org .
> On a lightly related note, there's also still a Dockerfile downloading Scala 
> over HTTP, which can be changed to HTTPS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27885) Announce deprecation of Python 2 support

2019-05-31 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27885:
-

Assignee: Xiangrui Meng

> Announce deprecation of Python 2 support
> 
>
> Key: SPARK-27885
> URL: https://issues.apache.org/jira/browse/SPARK-27885
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> * Draft the message.
> * Update Spark website and announce deprecation of Python 2 support in the 
> next major release in 2019 and remove the support in a release after 
> 2020/01/01. It should show up in the "Latest News" section.
> * Announce it on users@ and dev@



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27914) Improve parser error message for ALTER TABLE ADD COLUMNS statement

2019-05-31 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-27914:
--

 Summary: Improve parser error message for ALTER TABLE ADD COLUMNS 
statement
 Key: SPARK-27914
 URL: https://issues.apache.org/jira/browse/SPARK-27914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


The {{ALTER TABLE ADD COLUMNS}} statement is often misspelled as {{ALTER TABLE 
ADD COLUMN}}. However, when a user queries such a statement, the error message 
is confusing. For example, the error message for


{code:sql}
ALTER TABLE test ADD COLUMN (x INT);
{code}

is
{code:java}
no viable alternative at input 'ALTER TABLE test ADD COLUMN'(line 1, pos 21)
{code}
which is misleading.
 

One possible way to fix is to explicitly capture these statements in a grammar 
rule and print user-friendly error message to instruct users to change 
{{COLUMN}} to {{COLUMNS}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2019-05-31 Thread Owen O'Malley (JIRA)
Owen O'Malley created SPARK-27913:
-

 Summary: Spark SQL's native ORC reader implements its own schema 
evolution
 Key: SPARK-27913
 URL: https://issues.apache.org/jira/browse/SPARK-27913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.3
Reporter: Owen O'Malley


ORC's reader handles a wide range of schema evolution, but the Spark SQL native 
ORC bindings do not provide the desired schema to the ORC reader. This causes a 
regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27912) Improve parser error message for CASE clause

2019-05-31 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-27912:
--

 Summary: Improve parser error message for CASE clause
 Key: SPARK-27912
 URL: https://issues.apache.org/jira/browse/SPARK-27912
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


The \{{CASE}} clause is commonly used in SQL queries, but people can forget the 
trailing {{END}}. When a user queries such a statement, the error message is 
confusing. For example, the error message for


{code:sql}
SELECT (CASE WHEN a THEN b ELSE c) FROM a;
{code}

is
{code:java}
no viable alternative at input '(CASE WHEN a THEN b ELSE c)'(line 1, pos 33)
{code}
which is misleading.
 

One possible way to fix is to explicitly capture these statements in a grammar 
rule and print user-friendly error message such as
{code:java}
missing trailing END for case clause
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27911) PySpark Packages should automatically choose correct scala version

2019-05-31 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-27911:
-
Description: 
Today, users of pyspark (and Scala) need to manually specify the version of 
Scala that their Spark installation is using when adding a Spark package to 
their application. This extra configuration is confusing to users who may not 
even know which version of Scala they are using (for example, if they installed 
Spark using {{pip}}). The confusion here is exacerbated by releases in Spark 
that have changed the default from {{2.11}} -> {{2.12}} -> {{2.11}}.

https://spark.apache.org/releases/spark-release-2-4-2.html
https://spark.apache.org/releases/spark-release-2-4-3.html

Since Spark can know which version of Scala it was compiled for, we should give 
users the option to automatically choose the correct version.  This could be as 
simple as a substitution for {{$scalaVersion}} or something when resolving a 
package (similar to SBTs support for automatically handling scala dependencies).

Here are some concrete examples of users getting it wrong and getting confused:
https://github.com/delta-io/delta/issues/6
https://github.com/delta-io/delta/issues/63

  was:
Today, users of pyspark (and Scala) need to manually specify the version of 
Scala that their Spark installation is using when adding a Spark package to 
their application. This extra configuration confusing to users who may not even 
know which version of Scala they are using (for example, if they installed 
Spark using {{pip}}). The confusion here is exacerbated by releases in Spark 
that have changed the default from {{2.11}} -> {{2.12}} -> {{2.11}}.

https://spark.apache.org/releases/spark-release-2-4-2.html
https://spark.apache.org/releases/spark-release-2-4-3.html

Since Spark can know which version of Scala it was compiled for, we should give 
users the option to automatically choose the correct version.  This could be as 
simple as a substitution for {{$scalaVersion}} or something when resolving a 
package (similar to SBTs support for automatically handling scala dependencies).

Here are some concrete examples of users getting it wrong and getting confused:
https://github.com/delta-io/delta/issues/6
https://github.com/delta-io/delta/issues/63


> PySpark Packages should automatically choose correct scala version
> --
>
> Key: SPARK-27911
> URL: https://issues.apache.org/jira/browse/SPARK-27911
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Michael Armbrust
>Priority: Major
>
> Today, users of pyspark (and Scala) need to manually specify the version of 
> Scala that their Spark installation is using when adding a Spark package to 
> their application. This extra configuration is confusing to users who may not 
> even know which version of Scala they are using (for example, if they 
> installed Spark using {{pip}}). The confusion here is exacerbated by releases 
> in Spark that have changed the default from {{2.11}} -> {{2.12}} -> {{2.11}}.
> https://spark.apache.org/releases/spark-release-2-4-2.html
> https://spark.apache.org/releases/spark-release-2-4-3.html
> Since Spark can know which version of Scala it was compiled for, we should 
> give users the option to automatically choose the correct version.  This 
> could be as simple as a substitution for {{$scalaVersion}} or something when 
> resolving a package (similar to SBTs support for automatically handling scala 
> dependencies).
> Here are some concrete examples of users getting it wrong and getting 
> confused:
> https://github.com/delta-io/delta/issues/6
> https://github.com/delta-io/delta/issues/63



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27911) PySpark Packages should automatically choose correct scala version

2019-05-31 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-27911:


 Summary: PySpark Packages should automatically choose correct 
scala version
 Key: SPARK-27911
 URL: https://issues.apache.org/jira/browse/SPARK-27911
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.4.3
Reporter: Michael Armbrust


Today, users of pyspark (and Scala) need to manually specify the version of 
Scala that their Spark installation is using when adding a Spark package to 
their application. This extra configuration confusing to users who may not even 
know which version of Scala they are using (for example, if they installed 
Spark using {{pip}}). The confusion here is exacerbated by releases in Spark 
that have changed the default from {{2.11}} -> {{2.12}} -> {{2.11}}.

https://spark.apache.org/releases/spark-release-2-4-2.html
https://spark.apache.org/releases/spark-release-2-4-3.html

Since Spark can know which version of Scala it was compiled for, we should give 
users the option to automatically choose the correct version.  This could be as 
simple as a substitution for {{$scalaVersion}} or something when resolving a 
package (similar to SBTs support for automatically handling scala dependencies).

Here are some concrete examples of users getting it wrong and getting confused:
https://github.com/delta-io/delta/issues/6
https://github.com/delta-io/delta/issues/63



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27910) Improve parser error message for misused numeric identifiers

2019-05-31 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-27910:
--

 Summary: Improve parser error message for misused numeric 
identifiers
 Key: SPARK-27910
 URL: https://issues.apache.org/jira/browse/SPARK-27910
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


Numeric identifiers are misused commonly in Spark SQL queries. For example, the 
error message for
{code:sql}
CREATE TABLE test (`1` INT);
SELECT test.1 FROM test;
{code}

is
{code:java}
Error in query:
mismatched input '.1' expecting {, '(', ',', '.', '[', 'ADD', 'AFTER', 
'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 
'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 
'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
DATABASES, 'DAY', 'DAYS', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 
'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 
'DISTRIBUTE', 'DROP', 'ELSE', 'END', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 
'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 
'FIELDS', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 
'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 
'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'HOURS', 'IF', 'IGNORE', 'IMPORT', 'IN', 
'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 
'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 
'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 
'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MICROSECOND', 
'MICROSECONDS', 'MILLISECOND', 'MILLISECONDS', 'MINUTE', 'MINUTES', 'MONTH', 
'MONTHS', 'MSCK', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 
'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 
'OVERLAPS', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 
'PIVOT', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PURGE', 'QUERY', 
'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 
'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 
'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 
'SECOND', 'SECONDS', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 
'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 
'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'TABLE', 
'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 
'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRUE', 
'TRUNCATE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNLOCK', 
'UNSET', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'WEEK', 'WEEKS', 'WHEN', 
'WHERE', 'WINDOW', 'WITH', 'YEAR', 'YEARS', EQ, '<=>', '<>', '!=', '<', LTE, 
'>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^', IDENTIFIER, 
BACKQUOTED_IDENTIFIER}(line 1, pos 11)

== SQL ==
SELECT test.1 FROM test
{code}
which is verbose and misleading.
 

One possible way to fix is to explicitly capture these misused numeric 
identifiers in a grammar rule and print user-friendly error message such as
{code:java}
Numeric identifiers detected. Consider using quoted version test.`1`
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0

2019-05-31 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853412#comment-16853412
 ] 

Xiao Li commented on SPARK-17164:
-

I think we can issue a better error message here. like 
https://issues.apache.org/jira/browse/SPARK-27890

> Query with colon in the table name fails to parse in 2.0
> 
>
> Key: SPARK-17164
> URL: https://issues.apache.org/jira/browse/SPARK-17164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Priority: Major
>
> Running a simple query with colon in table name fails to parse in 2.0
> {code}
> == SQL ==
> SELECT * FROM a:b
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 48 elided
> {code}
> Please note that this is a regression from Spark 1.6 as the query runs fine 
> in 1.6.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17164) Query with colon in the table name fails to parse in 2.0

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17164:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-27901

> Query with colon in the table name fails to parse in 2.0
> 
>
> Key: SPARK-17164
> URL: https://issues.apache.org/jira/browse/SPARK-17164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Priority: Major
>
> Running a simple query with colon in table name fails to parse in 2.0
> {code}
> == SQL ==
> SELECT * FROM a:b
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 48 elided
> {code}
> Please note that this is a regression from Spark 1.6 as the query runs fine 
> in 1.6.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27909:


Assignee: (was: Apache Spark)

> Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
> --
>
> Key: SPARK-27909
> URL: https://issues.apache.org/jira/browse/SPARK-27909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> CTE substitution currently works by running all analyzer rules on plans after 
> each substitution. It does this to fix a recursive CTE case, but this design 
> requires the ResolveRelations rule to throw an AnalysisException when it 
> cannot resolve a table or else the CTE substitution will run again and may 
> possibly recurse infinitely.
> Table resolution should be possible across multiple independent rules. To 
> accomplish this, the current ResolveRelations rule detects cases where other 
> rules (like ResolveDataSource) will resolve a TableIdentifier and returns the 
> UnresolvedRelation unmodified only in those cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27909:


Assignee: Apache Spark

> Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
> --
>
> Key: SPARK-27909
> URL: https://issues.apache.org/jira/browse/SPARK-27909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> CTE substitution currently works by running all analyzer rules on plans after 
> each substitution. It does this to fix a recursive CTE case, but this design 
> requires the ResolveRelations rule to throw an AnalysisException when it 
> cannot resolve a table or else the CTE substitution will run again and may 
> possibly recurse infinitely.
> Table resolution should be possible across multiple independent rules. To 
> accomplish this, the current ResolveRelations rule detects cases where other 
> rules (like ResolveDataSource) will resolve a TableIdentifier and returns the 
> UnresolvedRelation unmodified only in those cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27374) Fetch assigned resources from TaskContext

2019-05-31 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27374.
---
Resolution: Duplicate

> Fetch assigned resources from TaskContext
> -
>
> Key: SPARK-27374
> URL: https://issues.apache.org/jira/browse/SPARK-27374
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27362) Kubernetes support for GPU-aware scheduling

2019-05-31 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27362.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Kubernetes support for GPU-aware scheduling
> ---
>
> Key: SPARK-27362
> URL: https://issues.apache.org/jira/browse/SPARK-27362
> Project: Spark
>  Issue Type: Story
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Design and implement k8s support for GPU-aware scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling

2019-05-31 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27373.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Design: Kubernetes support for GPU-aware scheduling
> ---
>
> Key: SPARK-27373
> URL: https://issues.apache.org/jira/browse/SPARK-27373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException

2019-05-31 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27909:
-

 Summary: Fix CTE substitution dependence on ResolveRelations 
throwing AnalysisException
 Key: SPARK-27909
 URL: https://issues.apache.org/jira/browse/SPARK-27909
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


CTE substitution currently works by running all analyzer rules on plans after 
each substitution. It does this to fix a recursive CTE case, but this design 
requires the ResolveRelations rule to throw an AnalysisException when it cannot 
resolve a table or else the CTE substitution will run again and may possibly 
recurse infinitely.

Table resolution should be possible across multiple independent rules. To 
accomplish this, the current ResolveRelations rule detects cases where other 
rules (like ResolveDataSource) will resolve a TableIdentifier and returns the 
UnresolvedRelation unmodified only in those cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27897) GPU Scheduling - move example discovery Script to scripts directory

2019-05-31 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27897.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> GPU Scheduling - move example discovery Script to scripts directory
> ---
>
> Key: SPARK-27897
> URL: https://issues.apache.org/jira/browse/SPARK-27897
> Project: Spark
>  Issue Type: Story
>  Components: Examples
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
> Fix For: 3.0.0
>
>
> SPARK-27725 GPU Scheduling - add an example discovery Script  added a script 
> at 
> [https://github.com/apache/spark/blob/master/examples/src/main/resources/getGpusResources.sh.]
> Instead of having it in the resources directory lets move it to the scripts 
> directory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27907) HiveUDAF with 0 rows throw NPE

2019-05-31 Thread Ajith S (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-27907:

Description: 
When query returns zero rows, the HiveUDAFFunction throws NPE

CASE 1:
create table abc(a int)
select histogram_numeric(a,2) from abc // NPE

Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost, executor driver): 
java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveUDAFFunction.eval(hiveUDFs.scala:471)
at org.apache.spark.sql.hive.HiveUDAFFunction.eval(hiveUDFs.scala:315)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.eval(interfaces.scala:543)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$5(AggregationIterator.scala:231)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:122)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


CASE 2:

create table abc(a int)
insert into abc values (1)
select histogram_numeric(a,2) from abc where a=3 //NPE

Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor driver): 
java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477)
at 
org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at 

[jira] [Updated] (SPARK-27907) HiveUDAF with 0 rows throw NPE

2019-05-31 Thread Ajith S (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-27907:

Summary: HiveUDAF with 0 rows throw NPE  (was: HiveUDAF with 0 rows throw 
NPE when try to serialize)

> HiveUDAF with 0 rows throw NPE
> --
>
> Key: SPARK-27907
> URL: https://issues.apache.org/jira/browse/SPARK-27907
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3, 3.0.0, 2.4.3, 3.1.0
>Reporter: Ajith S
>Priority: Major
>
> When query returns zero rows, the HiveUDAFFunction.seralize throws NPE
> create table abc(a int)
> insert into abc values (1)
> insert into abc values (2)
> select histogram_numeric(a,2) from abc where a=3 //NPE
> Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor 
> driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:122)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27908) Improve parser error message for SELECT TOP statement

2019-05-31 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-27908:
--

 Summary: Improve parser error message for SELECT TOP statement
 Key: SPARK-27908
 URL: https://issues.apache.org/jira/browse/SPARK-27908
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


The {{SELECT TOP}} statement is actually not supported in Spark SQL. However, 
when a user queries such a statement, the error message is confusing. For 
example, the error message for


{code:sql}
SELECT TOP 1 FROM test;
{code}

is
{code:java}
Error in query:
mismatched input '1' expecting {, '(', ',', '.', '[', 'ADD', 'AFTER', 
'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 
'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 
'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
DATABASES, 'DAY', 'DAYS', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 
'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 
'DISTRIBUTE', 'DROP', 'ELSE', 'END', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 
'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 
'FIELDS', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 
'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 
'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'HOURS', 'IF', 'IGNORE', 'IMPORT', 'IN', 
'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 
'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 
'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 
'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MICROSECOND', 
'MICROSECONDS', 'MILLISECOND', 'MILLISECONDS', 'MINUTE', 'MINUTES', 'MONTH', 
'MONTHS', 'MSCK', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 
'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 
'OVERLAPS', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 
'PIVOT', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PURGE', 'QUERY', 
'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 
'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 
'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 
'SECOND', 'SECONDS', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 
'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 
'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'TABLE', 
'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 
'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRUE', 
'TRUNCATE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNLOCK', 
'UNSET', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'WEEK', 'WEEKS', 'WHEN', 
'WHERE', 'WINDOW', 'WITH', 'YEAR', 'YEARS', EQ, '<=>', '<>', '!=', '<', LTE, 
'>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^', IDENTIFIER, 
BACKQUOTED_IDENTIFIER}(line 1, pos 11)

== SQL ==
SELECT TOP 1 FROM test
---^^^
{code}
which is verbose and misleading.
 

One possible way to fix is to explicitly capture these statements in a grammar 
rule and print user-friendly error message such as
{code:java}
SELECT TOP statements are not supported.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27907) HiveUDAF with 0 rows throw NPE when try to serialize

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27907:


Assignee: (was: Apache Spark)

> HiveUDAF with 0 rows throw NPE when try to serialize
> 
>
> Key: SPARK-27907
> URL: https://issues.apache.org/jira/browse/SPARK-27907
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3, 3.0.0, 2.4.3, 3.1.0
>Reporter: Ajith S
>Priority: Major
>
> When query returns zero rows, the HiveUDAFFunction.seralize throws NPE
> create table abc(a int)
> insert into abc values (1)
> insert into abc values (2)
> select histogram_numeric(a,2) from abc where a=3 //NPE
> Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor 
> driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:122)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27907) HiveUDAF with 0 rows throw NPE when try to serialize

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27907:


Assignee: Apache Spark

> HiveUDAF with 0 rows throw NPE when try to serialize
> 
>
> Key: SPARK-27907
> URL: https://issues.apache.org/jira/browse/SPARK-27907
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3, 3.0.0, 2.4.3, 3.1.0
>Reporter: Ajith S
>Assignee: Apache Spark
>Priority: Major
>
> When query returns zero rows, the HiveUDAFFunction.seralize throws NPE
> create table abc(a int)
> insert into abc values (1)
> insert into abc values (2)
> select histogram_numeric(a,2) from abc where a=3 //NPE
> Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor 
> driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:122)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27907) HiveUDAF with 0 rows throw NPE when try to serialize

2019-05-31 Thread Ajith S (JIRA)
Ajith S created SPARK-27907:
---

 Summary: HiveUDAF with 0 rows throw NPE when try to serialize
 Key: SPARK-27907
 URL: https://issues.apache.org/jira/browse/SPARK-27907
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3, 2.3.3, 3.0.0, 3.1.0
Reporter: Ajith S


When query returns zero rows, the HiveUDAFFunction.seralize throws NPE

create table abc(a int)
insert into abc values (1)
insert into abc values (2)
select histogram_numeric(a,2) from abc where a=3 //NPE

Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor driver): 
java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477)
at 
org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:122)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27905) Add higher order function`forall`

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27905:


Assignee: (was: Apache Spark)

> Add higher order function`forall`
> -
>
> Key: SPARK-27905
> URL: https://issues.apache.org/jira/browse/SPARK-27905
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Nikolas Vanderhoof
>Priority: Major
>
> Add the SQL function forall.
> `forall` tests an array to see if predicate holds for every item of the array.
> This complements the `exists` higher order function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27905) Add higher order function`forall`

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27905:


Assignee: Apache Spark

> Add higher order function`forall`
> -
>
> Key: SPARK-27905
> URL: https://issues.apache.org/jira/browse/SPARK-27905
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Nikolas Vanderhoof
>Assignee: Apache Spark
>Priority: Major
>
> Add the SQL function forall.
> `forall` tests an array to see if predicate holds for every item of the array.
> This complements the `exists` higher order function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27906) Improve parser error message for CREATE LOCAL TEMPORARY TABLE statement

2019-05-31 Thread Yesheng Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesheng Ma updated SPARK-27906:
---
Description: 
The {{CREATE LOCAL TEMPORARY TABLE}} statement is actually not supported in 
Spark SQL. However, when a user queries such a statement, the error message is 
confusing. For example, the error message for


{code:sql}
CREATE LOCAL TEMPORARY TABLE my_table (x INT);
{code}

is
{code:java}
no viable alternative at input 'CREATE LOCAL'(line 1, pos 7)
{code}
which is misleading.
 

One possible way to fix is to explicitly capture these statements in a grammar 
rule and print user-friendly error message such as
{code:java}
CREATE LOCAL TEMPORARY TABLE statements are not supported.
{code}

  was:
{{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when a 
user quries such a statement, the error message is confusing. For example, the 
error message for


{code:sql}
SHOW VIEWS IN my_database
{code}

is
{code:java}
missing 'FUNCTIONS' at 'IN'(line 1, pos 11)
{code}
which is misleading.
 

One possible way to fix is to explicitly capture these statements in a grammar 
rule and print user-friendly error message such as
{code:java}
SHOW VIEW statements are not supported.
{code}


> Improve parser error message for CREATE LOCAL TEMPORARY TABLE statement
> ---
>
> Key: SPARK-27906
> URL: https://issues.apache.org/jira/browse/SPARK-27906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> The {{CREATE LOCAL TEMPORARY TABLE}} statement is actually not supported in 
> Spark SQL. However, when a user queries such a statement, the error message 
> is confusing. For example, the error message for
> {code:sql}
> CREATE LOCAL TEMPORARY TABLE my_table (x INT);
> {code}
> is
> {code:java}
> no viable alternative at input 'CREATE LOCAL'(line 1, pos 7)
> {code}
> which is misleading.
>  
> One possible way to fix is to explicitly capture these statements in a 
> grammar rule and print user-friendly error message such as
> {code:java}
> CREATE LOCAL TEMPORARY TABLE statements are not supported.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27906) Improve parser error message for CREATE LOCAL TABLE statement

2019-05-31 Thread Yesheng Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesheng Ma updated SPARK-27906:
---
Description: 
{{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when a 
user quries such a statement, the error message is confusing. For example, the 
error message for


{code:sql}
SHOW VIEWS IN my_database
{code}

is
{code:java}
missing 'FUNCTIONS' at 'IN'(line 1, pos 11)
{code}
which is misleading.
 

One possible way to fix is to explicitly capture these statements in a grammar 
rule and print user-friendly error message such as
{code:java}
SHOW VIEW statements are not supported.
{code}

> Improve parser error message for CREATE LOCAL TABLE statement
> -
>
> Key: SPARK-27906
> URL: https://issues.apache.org/jira/browse/SPARK-27906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> {{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when 
> a user quries such a statement, the error message is confusing. For example, 
> the error message for
> {code:sql}
> SHOW VIEWS IN my_database
> {code}
> is
> {code:java}
> missing 'FUNCTIONS' at 'IN'(line 1, pos 11)
> {code}
> which is misleading.
>  
> One possible way to fix is to explicitly capture these statements in a 
> grammar rule and print user-friendly error message such as
> {code:java}
> SHOW VIEW statements are not supported.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27903) Improve parser error message for mismatched parentheses in expressions

2019-05-31 Thread Yesheng Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesheng Ma updated SPARK-27903:
---
Description: 
When parentheses are mismatched in expressions in queries, the error message is 
confusing. This is especially true for large queries, where mismatched parens 
are tedious for human to figure out. 

For example, the error message for 
{code:sql} 
SELECT ((x + y) * z FROM t; 
{code} 
is 
{code:java} 
mismatched input 'FROM' expecting ','(line 1, pos 20) 
{code} 

One possible way to fix is to explicitly capture such kind of mismatched parens 
in a grammar rule and print user-friendly error message such as 
{code:java} 
mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, pos 
20) 
{code} 

  was:





> Improve parser error message for mismatched parentheses in expressions
> --
>
> Key: SPARK-27903
> URL: https://issues.apache.org/jira/browse/SPARK-27903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> When parentheses are mismatched in expressions in queries, the error message 
> is confusing. This is especially true for large queries, where mismatched 
> parens are tedious for human to figure out. 
> For example, the error message for 
> {code:sql} 
> SELECT ((x + y) * z FROM t; 
> {code} 
> is 
> {code:java} 
> mismatched input 'FROM' expecting ','(line 1, pos 20) 
> {code} 
> One possible way to fix is to explicitly capture such kind of mismatched 
> parens in a grammar rule and print user-friendly error message such as 
> {code:java} 
> mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, 
> pos 20) 
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27906) Improve parser error message for CREATE LOCAL TEMPORARY TABLE statement

2019-05-31 Thread Yesheng Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesheng Ma updated SPARK-27906:
---
Summary: Improve parser error message for CREATE LOCAL TEMPORARY TABLE 
statement  (was: Improve parser error message for CREATE LOCAL TABLE statement)

> Improve parser error message for CREATE LOCAL TEMPORARY TABLE statement
> ---
>
> Key: SPARK-27906
> URL: https://issues.apache.org/jira/browse/SPARK-27906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> {{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when 
> a user quries such a statement, the error message is confusing. For example, 
> the error message for
> {code:sql}
> SHOW VIEWS IN my_database
> {code}
> is
> {code:java}
> missing 'FUNCTIONS' at 'IN'(line 1, pos 11)
> {code}
> which is misleading.
>  
> One possible way to fix is to explicitly capture these statements in a 
> grammar rule and print user-friendly error message such as
> {code:java}
> SHOW VIEW statements are not supported.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27906) Improve parser error message for CREATE LOCAL TABLE statement

2019-05-31 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-27906:
--

 Summary: Improve parser error message for CREATE LOCAL TABLE 
statement
 Key: SPARK-27906
 URL: https://issues.apache.org/jira/browse/SPARK-27906
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27903) Improve parser error message for mismatched parentheses in expressions

2019-05-31 Thread Yesheng Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesheng Ma updated SPARK-27903:
---
Description: 




  was:
When parentheses are mismatched in expressions in queries, the error message is 
confusing. This is especially true for large queries, where mismatched parens 
are tedious for human to figure out.

For example, the error message for 
{code:sql}
SELECT ((x + y) * z FROM t;
{code}
is
{code:java}
mismatched input 'FROM' expecting ','(line 1, pos 20)
{code}

One possible way to fix is to explicitly capture such kind of mismatched parens 
in a grammar rule and print user-friendly error message such as
{code:java}
mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, pos 
20)
{code}



> Improve parser error message for mismatched parentheses in expressions
> --
>
> Key: SPARK-27903
> URL: https://issues.apache.org/jira/browse/SPARK-27903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27905) Add higher order function`forall`

2019-05-31 Thread Nikolas Vanderhoof (JIRA)
Nikolas Vanderhoof created SPARK-27905:
--

 Summary: Add higher order function`forall`
 Key: SPARK-27905
 URL: https://issues.apache.org/jira/browse/SPARK-27905
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Nikolas Vanderhoof


Add the SQL function forall.

`forall` tests an array to see if predicate holds for every item of the array.

This complements the `exists` higher order function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27904) Improve parser error message for SHOW VIEW statement

2019-05-31 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-27904:
--

 Summary: Improve parser error message for SHOW VIEW statement
 Key: SPARK-27904
 URL: https://issues.apache.org/jira/browse/SPARK-27904
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


{{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when a 
user quries such a statement, the error message is confusing. For example, the 
error message for


{code:sql}
SHOW VIEWS IN my_database
{code}

is
{code:java}
missing 'FUNCTIONS' at 'IN'(line 1, pos 11)
{code}
which is misleading.
 

One possible way to fix is to explicitly capture these statements in a grammar 
rule and print user-friendly error message such as
{code:java}
SHOW VIEW statements are not supported.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21529) Improve the error message for unsupported Uniontype

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21529:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-27901

> Improve the error message for unsupported Uniontype
> ---
>
> Key: SPARK-21529
> URL: https://issues.apache.org/jira/browse/SPARK-21529
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Qubole, DataBricks
>Reporter: Elliot West
>Priority: Major
>  Labels: hive, starter, uniontype
>
> We encounter errors when attempting to read Hive tables whose schema contains 
> the {{uniontype}}. It appears perhaps that Catalyst
> does not support the {{uniontype}} which renders this table unreadable by 
> Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive
> query engine, it is fully supported by the storage engine and also the Avro 
> data format, which we use for these tables. Therefore, I believe it is
> a valid, usable type construct that should be supported by Spark.
> We've attempted to read the table as follows:
> {code}
> spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' 
> limit 5").show
> val tblread = spark.read.table("etl.tbl")
> {code}
> But this always results in the same error message. The pertinent error 
> messages are as follows (full stack trace below):
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype ...
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '<' expecting
> {, '('}
> (line 1, pos 9)
> == SQL ==
> uniontype -^^^
> {code}
> h2. Full stack trace
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>>
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648)
> at 
> 

[jira] [Updated] (SPARK-21529) Improve the error message for unsupported Uniontype

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21529:

Labels: hive starter uniontype  (was: bulk-closed hive uniontype)

> Improve the error message for unsupported Uniontype
> ---
>
> Key: SPARK-21529
> URL: https://issues.apache.org/jira/browse/SPARK-21529
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Qubole, DataBricks
>Reporter: Elliot West
>Priority: Major
>  Labels: hive, starter, uniontype
>
> We encounter errors when attempting to read Hive tables whose schema contains 
> the {{uniontype}}. It appears perhaps that Catalyst
> does not support the {{uniontype}} which renders this table unreadable by 
> Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive
> query engine, it is fully supported by the storage engine and also the Avro 
> data format, which we use for these tables. Therefore, I believe it is
> a valid, usable type construct that should be supported by Spark.
> We've attempted to read the table as follows:
> {code}
> spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' 
> limit 5").show
> val tblread = spark.read.table("etl.tbl")
> {code}
> But this always results in the same error message. The pertinent error 
> messages are as follows (full stack trace below):
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype ...
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '<' expecting
> {, '('}
> (line 1, pos 9)
> == SQL ==
> uniontype -^^^
> {code}
> h2. Full stack trace
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>>
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648)
> at 
> 

[jira] [Updated] (SPARK-21529) Improve the error message for unsupported Uniontype

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21529:

Summary: Improve the error message for unsupported Uniontype  (was: 
Uniontype not supported when reading from Hive tables.)

> Improve the error message for unsupported Uniontype
> ---
>
> Key: SPARK-21529
> URL: https://issues.apache.org/jira/browse/SPARK-21529
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Qubole, DataBricks
>Reporter: Elliot West
>Priority: Major
>  Labels: bulk-closed, hive, uniontype
>
> We encounter errors when attempting to read Hive tables whose schema contains 
> the {{uniontype}}. It appears perhaps that Catalyst
> does not support the {{uniontype}} which renders this table unreadable by 
> Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive
> query engine, it is fully supported by the storage engine and also the Avro 
> data format, which we use for these tables. Therefore, I believe it is
> a valid, usable type construct that should be supported by Spark.
> We've attempted to read the table as follows:
> {code}
> spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' 
> limit 5").show
> val tblread = spark.read.table("etl.tbl")
> {code}
> But this always results in the same error message. The pertinent error 
> messages are as follows (full stack trace below):
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype ...
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '<' expecting
> {, '('}
> (line 1, pos 9)
> == SQL ==
> uniontype -^^^
> {code}
> h2. Full stack trace
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>>
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648)
> at 
> 

[jira] [Reopened] (SPARK-21529) Improve the error message for unsupported Uniontype

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-21529:
-

> Improve the error message for unsupported Uniontype
> ---
>
> Key: SPARK-21529
> URL: https://issues.apache.org/jira/browse/SPARK-21529
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Qubole, DataBricks
>Reporter: Elliot West
>Priority: Major
>  Labels: bulk-closed, hive, uniontype
>
> We encounter errors when attempting to read Hive tables whose schema contains 
> the {{uniontype}}. It appears perhaps that Catalyst
> does not support the {{uniontype}} which renders this table unreadable by 
> Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive
> query engine, it is fully supported by the storage engine and also the Avro 
> data format, which we use for these tables. Therefore, I believe it is
> a valid, usable type construct that should be supported by Spark.
> We've attempted to read the table as follows:
> {code}
> spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' 
> limit 5").show
> val tblread = spark.read.table("etl.tbl")
> {code}
> But this always results in the same error message. The pertinent error 
> messages are as follows (full stack trace below):
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype ...
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '<' expecting
> {, '('}
> (line 1, pos 9)
> == SQL ==
> uniontype -^^^
> {code}
> h2. Full stack trace
> {code}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>>
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648)
> at 
> 

[jira] [Created] (SPARK-27903) Improve parser error message for mismatched parentheses in expressions

2019-05-31 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-27903:
--

 Summary: Improve parser error message for mismatched parentheses 
in expressions
 Key: SPARK-27903
 URL: https://issues.apache.org/jira/browse/SPARK-27903
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


When parentheses are mismatched in expressions in queries, the error message is 
confusing. This is especially true for large queries, where mismatched parens 
are tedious for human to figure out.

For example, the error message for 
{code:sql}
SELECT ((x + y) * z FROM t;
{code}
is
{code:java}
mismatched input 'FROM' expecting ','(line 1, pos 20)
{code}

One possible way to fix is to explicitly capture such kind of mismatched parens 
in a grammar rule and print user-friendly error message such as
{code:java}
mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, pos 
20)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27098) Flaky missing file parts when writing to Ceph without error

2019-05-31 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853250#comment-16853250
 ] 

Steve Loughran commented on SPARK-27098:


I don't think we'd silently swallow exceptions during a copy, more likely "we 
take so long doing it that something times out"

Maybe [~gabor.bota] has some suggestions; he's worked on Ceph support through 
s3a.

> Flaky missing file parts when writing to Ceph without error
> ---
>
> Key: SPARK-27098
> URL: https://issues.apache.org/jira/browse/SPARK-27098
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
> Attachments: sanitized_stdout_1.txt
>
>
> https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233
> Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. 
> occasionally a file part will be missing; i.e. part 3 here:
> ```
> > aws s3 ls my-bucket/folder/
> 2019-02-28 13:07:21  0 _SUCCESS
> 2019-02-28 13:06:58   79428651 
> part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:06:59   79586172 
> part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:00   79561910 
> part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:01   79192617 
> part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:07   79364413 
> part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:08   79623254 
> part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79445030 
> part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79474923 
> part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:11   79477310 
> part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:12   79331453 
> part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79567600 
> part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79388012 
> part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:14   79308387 
> part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:15   79455483 
> part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:17   79512342 
> part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79403307 
> part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79617769 
> part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:19   79333534 
> part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:20   79543324 
> part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> ```
> However, the write succeeds and leaves a _SUCCESS file.
> This can be caught by additionally checking afterward whether the number of 
> written file parts agrees with the number of partitions, but Spark should at 
> least fail on its own and leave a meaningful stack trace in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27902) Improve error message for DESCRIBE statement

2019-05-31 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-27902:
--

 Summary: Improve error message for DESCRIBE statement
 Key: SPARK-27902
 URL: https://issues.apache.org/jira/browse/SPARK-27902
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


{{DESCRIBE}} statement only supports queries such as {{SELECT}}. However, when 
other statements are used as a clause of {{DESCRIBE}}, the error message is 
confusing.

For example, the error message for 
{code:sql}
DESCRIBE INSERT INTO desc_temp1 values (1, 'val1');
{code}
is
{code:java}
mismatched input 'desc_temp1' expecting {, '.'}(line 1, pos 21)}}
{code}
which is misleading and hard for end users to figure out the real cause.


One possible way to fix is to explicitly capture such kind of wrong clauses and 
print user-friendly error message such as
{code:java}
mismatched insert clause 'INSERT INTO desc_temp1 values (1, 'val1');'
expecting normal query clauses.
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21136) Misleading error message for typo in SQL

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21136:

Priority: Critical  (was: Minor)

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Assignee: Yesheng Ma
>Priority: Critical
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27098) Flaky missing file parts when writing to Ceph without error

2019-05-31 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853245#comment-16853245
 ] 

Martin Loncaric commented on SPARK-27098:
-

After upgrading to Hadoop 2.9 and using 
{{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}}, the problem 
is substantially less frequent, but still present. I think this suggests that 
moving files sometimes quietly fails.

> Flaky missing file parts when writing to Ceph without error
> ---
>
> Key: SPARK-27098
> URL: https://issues.apache.org/jira/browse/SPARK-27098
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
> Attachments: sanitized_stdout_1.txt
>
>
> https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233
> Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. 
> occasionally a file part will be missing; i.e. part 3 here:
> ```
> > aws s3 ls my-bucket/folder/
> 2019-02-28 13:07:21  0 _SUCCESS
> 2019-02-28 13:06:58   79428651 
> part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:06:59   79586172 
> part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:00   79561910 
> part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:01   79192617 
> part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:07   79364413 
> part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:08   79623254 
> part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79445030 
> part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79474923 
> part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:11   79477310 
> part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:12   79331453 
> part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79567600 
> part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79388012 
> part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:14   79308387 
> part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:15   79455483 
> part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:17   79512342 
> part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79403307 
> part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79617769 
> part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:19   79333534 
> part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:20   79543324 
> part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> ```
> However, the write succeeds and leaves a _SUCCESS file.
> This can be caught by additionally checking afterward whether the number of 
> written file parts agrees with the number of partitions, but Spark should at 
> least fail on its own and leave a meaningful stack trace in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21136) Misleading error message for typo in SQL

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21136:
---

Assignee: Yesheng Ma

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Assignee: Yesheng Ma
>Priority: Minor
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24077:

Labels: starter  (was: )

> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
> ---
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>  Labels: starter
>
> Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
>  
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21136) Misleading error message for typo in SQL

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21136:

Labels:   (was: bulk-closed)

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24077) Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-24077:
-

> Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
> 
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>
> Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
>  
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24077:

Issue Type: Sub-task  (was: Question)
Parent: SPARK-27901

> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
> ---
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>
> Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
>  
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24077:

Summary: Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT 
EXISTS`?  (was: Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT 
EXISTS`?)

> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
> ---
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>
> Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
>  
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21136) Misleading error message for typo in SQL

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-21136:
-

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Priority: Minor
>  Labels: bulk-closed
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21136) Misleading error message for typo in SQL

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21136:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-27901

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Priority: Minor
>  Labels: bulk-closed
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21154) ParseException when Create View from another View in Spark SQL

2019-05-31 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853234#comment-16853234
 ] 

Xiao Li commented on SPARK-21154:
-

This issue should have been resolved in the new version of Spark

> ParseException when Create View from another View in Spark SQL 
> ---
>
> Key: SPARK-21154
> URL: https://issues.apache.org/jira/browse/SPARK-21154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Brian Zhang
>Priority: Major
>  Labels: bulk-closed
>
> When creating View from another existing View in Spark SQL, we will see 
> ParseException if existing View.
> Here is the detail on how to reproduce it:
> *Hive* (I'm using 1.1.0):
> hive> *CREATE TABLE my_table (id int, name string);*
> OK
> Time taken: 0.107 seconds
> hive> *CREATE VIEW my_view(view_id,view_name) AS SELECT * FROM my_table;*
> OK
> Time taken: 0.075 seconds
> # View Information
> View Original Text: SELECT * FROM my_table
> View Expanded Text: SELECT `id` AS `view_id`, `name` AS `view_name` FROM 
> (SELECT `my_table`.`id`, `my_table`.`name` FROM `default`.`my_table`) 
> `default.my_view`
> Time taken: 0.04 seconds, Fetched: 28 row(s)
> *Spark* (Same behavior for spark 2.1.0 and 2.1.1):
> scala> *sqlContext.sql("CREATE VIEW my_view_spark AS SELECT * FROM my_view");*
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `view_id`, `gen_attr_1` AS `view_name` FROM (SELECT 
> `gen_attr_0`, `gen_attr_1` FROM (SELECT `gen_attr_2` AS `gen_attr_0`, 
> `gen_attr_3` AS `gen_attr_1` FROM (SELECT `gen_attr_2`, `gen_attr_3` FROM 
> (SELECT `id` AS `gen_attr_2`, `name` AS `gen_attr_3` FROM 
> `default`.`my_table`) AS gen_subquery_0) AS default.my_view) AS my_view) AS 
> my_view
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:222)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:176)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
>   ... 74 elided
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'FROM' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 62)
> == SQL ==
> SELECT `gen_attr_0` AS `view_id`, `gen_attr_1` AS `view_name` FROM (SELECT 
> `gen_attr_0`, `gen_attr_1` FROM (SELECT `gen_attr_2` AS `gen_attr_0`, 
> `gen_attr_3` AS `gen_attr_1` FROM (SELECT `gen_attr_2`, `gen_attr_3` FROM 
> (SELECT `id` AS `gen_attr_2`, `name` AS `gen_attr_3` FROM 
> `default`.`my_table`) AS gen_subquery_0) AS default.my_view) AS my_view) AS 
> my_view
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:219)
>   ... 90 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-05-31 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853227#comment-16853227
 ] 

Dongjoon Hyun commented on SPARK-27812:
---

What I meat was *upgrading*, [~igor.calabria]. Currently, the latest one is 
already 4.2.2.

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27809) Make optional clauses order insensitive for CREATE DATABASE/VIEW SQL statement

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27809:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-27901

> Make optional clauses order insensitive for CREATE DATABASE/VIEW SQL statement
> --
>
> Key: SPARK-27809
> URL: https://issues.apache.org/jira/browse/SPARK-27809
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yesheng Ma
>Assignee: Yesheng Ma
>Priority: Major
> Fix For: 3.0.0
>
>
> Each time, when I write a complex CREATE DATABASE/VIEW statements, I have to 
> open the .g4 file to find the EXACT order of clauses in CREATE TABLE 
> statement. When the order is not right, I will get A strange confusing error 
> message generated from ANTLR4.
> The original g4 grammar for CREATE VIEW is
> {code:sql}
> CREATE [OR REPLACE] [[GLOBAL] TEMPORARY] VIEW [db_name.]view_name
>   [(col_name1 [COMMENT col_comment1], ...)]
>   [COMMENT table_comment]
>   [TBLPROPERTIES (key1=val1, key2=val2, ...)]
> AS select_statement
> {code}
> The proposal is to make the following clauses order insensitive.
> {code:sql}
>   [COMMENT table_comment]
>   [TBLPROPERTIES (key1=val1, key2=val2, ...)]
> {code}
> –
>  The original g4 grammar for CREATE DATABASE is
> {code:sql}
> CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] db_name
>   [COMMENT comment_text]
>   [LOCATION path]
>   [WITH DBPROPERTIES (key1=val1, key2=val2, ...)]
> {code}
> The proposal is to make the following clauses order insensitive.
> {code:sql}
>   [COMMENT comment_text]
>   [LOCATION path]
>   [WITH DBPROPERTIES (key1=val1, key2=val2, ...)]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27890:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-27901

> Improve SQL parser error message when missing backquotes for identifiers with 
> hyphens
> -
>
> Key: SPARK-27890
> URL: https://issues.apache.org/jira/browse/SPARK-27890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yesheng Ma
>Priority: Major
>
> Current SQL parser's error message for hyphen-connected identifiers without 
> surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A 
> possible approach to tackle this is to explicitly capture these wrong usages 
> in the SQL parser. In this way, the end users can fix these errors more 
> quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27901) Improve the error messages of SQL parser

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27901:

Issue Type: Umbrella  (was: Bug)

> Improve the error messages of SQL parser
> 
>
> Key: SPARK-27901
> URL: https://issues.apache.org/jira/browse/SPARK-27901
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> SQL is one the most popular APIs for Apache Spark. Our SQL parser is built on 
> ANTLR4. The error messages generated by ANTLR4 is not always helpful. This 
> umbrella Jira is to track all the improvements in our parser error handling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27901) Improve the error messages of SQL parser

2019-05-31 Thread Xiao Li (JIRA)
Xiao Li created SPARK-27901:
---

 Summary: Improve the error messages of SQL parser
 Key: SPARK-27901
 URL: https://issues.apache.org/jira/browse/SPARK-27901
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li


SQL is one the most popular APIs for Apache Spark. Our SQL parser is built on 
ANTLR4. The error messages generated by ANTLR4 is not always helpful. This 
umbrella Jira is to track all the improvements in our parser error handling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-27899:
---

Assignee: Lantao Jin

> Make HiveMetastoreClient.getTableObjectsByName available in 
> ExternalCatalog/SessionCatalog API
> --
>
> Key: SPARK-27899
> URL: https://issues.apache.org/jira/browse/SPARK-27899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Assignee: Lantao Jin
>Priority: Major
>
> The new Spark ThriftServer SparkGetTablesOperation implemented in 
> https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata 
> request for every table. This can get very slow for large schemas (~50ms per 
> table with an external Hive metastore).
> Hive ThriftServer GetTablesOperation uses 
> HiveMetastoreClient.getTableObjectsByName to get table information in bulk, 
> but we don't expose that through our APIs that go through Hive -> 
> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> 
> SessionCatalog.
> If we added and exposed getTableObjectsByName through our catalog APIs, we 
> could resolve that performance problem in SparkGetTablesOperation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API

2019-05-31 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853209#comment-16853209
 ] 

Xiao Li commented on SPARK-27899:
-

Also cc [~cltlfcjin]

 

> Make HiveMetastoreClient.getTableObjectsByName available in 
> ExternalCatalog/SessionCatalog API
> --
>
> Key: SPARK-27899
> URL: https://issues.apache.org/jira/browse/SPARK-27899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> The new Spark ThriftServer SparkGetTablesOperation implemented in 
> https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata 
> request for every table. This can get very slow for large schemas (~50ms per 
> table with an external Hive metastore).
> Hive ThriftServer GetTablesOperation uses 
> HiveMetastoreClient.getTableObjectsByName to get table information in bulk, 
> but we don't expose that through our APIs that go through Hive -> 
> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> 
> SessionCatalog.
> If we added and exposed getTableObjectsByName through our catalog APIs, we 
> could resolve that performance problem in SparkGetTablesOperation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API

2019-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27899:

Target Version/s: 3.0.0

> Make HiveMetastoreClient.getTableObjectsByName available in 
> ExternalCatalog/SessionCatalog API
> --
>
> Key: SPARK-27899
> URL: https://issues.apache.org/jira/browse/SPARK-27899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> The new Spark ThriftServer SparkGetTablesOperation implemented in 
> https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata 
> request for every table. This can get very slow for large schemas (~50ms per 
> table with an external Hive metastore).
> Hive ThriftServer GetTablesOperation uses 
> HiveMetastoreClient.getTableObjectsByName to get table information in bulk, 
> but we don't expose that through our APIs that go through Hive -> 
> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> 
> SessionCatalog.
> If we added and exposed getTableObjectsByName through our catalog APIs, we 
> could resolve that performance problem in SparkGetTablesOperation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-05-31 Thread Igor Calabria (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853159#comment-16853159
 ] 

Igor Calabria commented on SPARK-27812:
---

[~dongjoon] Do you mean downgrading? Because the issue was introduced when 
kubernetes client was updated. I took a look at both OkHttp and fabric8's 
kubernetes client code between the upgraded tags and I couldn't find anything 
obvious that caused this. 

 

Maybe the right path for spark is to actually deal with "rogue" user threads on 
shutdown/exceptions instead of simply relying that they won't be created by 
libs or user code.

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27395) Improve EXPLAIN command

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27395:


Assignee: (was: Apache Spark)

> Improve EXPLAIN command
> ---
>
> Key: SPARK-27395
> URL: https://issues.apache.org/jira/browse/SPARK-27395
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, when the query is complex or the output schema is long, the 
> outputs of our EXPAIN command are not readable. Our documentation does not 
> explain how to read the plans [e.g., the meaning of each field]. It is 
> confusing to end users. The current format limits us to add more more useful 
> details to each operator, since it is already very long. We need to reformat 
> the query plans for better readability. 
> In this release, we need to improve the usability and documentation of 
> EXPLAIN. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27395) Improve EXPLAIN command

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27395:


Assignee: Apache Spark

> Improve EXPLAIN command
> ---
>
> Key: SPARK-27395
> URL: https://issues.apache.org/jira/browse/SPARK-27395
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Currently, when the query is complex or the output schema is long, the 
> outputs of our EXPAIN command are not readable. Our documentation does not 
> explain how to read the plans [e.g., the meaning of each field]. It is 
> confusing to end users. The current format limits us to add more more useful 
> details to each operator, since it is already very long. We need to reformat 
> the query plans for better readability. 
> In this release, we need to improve the usability and documentation of 
> EXPLAIN. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-31 Thread hemshankar sahu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853140#comment-16853140
 ] 

hemshankar sahu commented on SPARK-27891:
-

Attached log for spark 2.3.1 (spark_2.3.1_failure.log)

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
> Attachments: application_1559242207407_0001.log, 
> spark_2.3.1_failure.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs attached
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-31 Thread hemshankar sahu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemshankar sahu updated SPARK-27891:

Attachment: spark_2.3.1_failure.log

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
> Attachments: application_1559242207407_0001.log, 
> spark_2.3.1_failure.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs attached
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-05-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26192:
--
Fix Version/s: 2.4.4

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27900) Spark on K8s will not report container failure due to oom

2019-05-31 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Description: 
A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
 kind: SparkApplication
 metadata:
 name: spark-pi
 namespace: spark
 spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
 arguments:
 - "100"
 sparkVersion: "2.4.0"
 restartPolicy:
 type: Never
 nodeSelector:
 "spark": "autotune"
 driver:
 memory: "1g"
 labels:
 version: 2.4.0
 serviceAccount: spark-sa
 executor:
 instances: 2
 memory: "1g"
 labels:
 version: 2.4.0{quote}
At some point the driver fails but it is still running and so the pods are 
still running:

19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 3.0 KiB, free 110.0 MiB)
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
in memory (estimated size 1765.0 B, free 110.0 MiB)
 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 
MiB)
 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
DAGScheduler.scala:1180
 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks 
are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
tasks
 Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
Java heap space
 at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
 at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
 at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
 Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached

$ kubectl describe pod spark-pi2-driver -n spark
 Name: spark-pi2-driver
 Namespace: spark
 Priority: 0
 PriorityClassName: 
 Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
 Start Time: Fri, 31 May 2019 16:28:59 +0300
 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
 spark-role=driver
 sparkoperator.k8s.io/app-name=spark-pi2
 sparkoperator.k8s.io/launched-by-spark-operator=true
 sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
 version=2.4.0
 Annotations: 
 Status: Running
 IP: 10.12.103.4
 Controlled By: SparkApplication/spark-pi2
 Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
 Image: skonto/spark:k8s-3.0.0-sa
 Image ID: 
docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 org.apache.spark.examples.SparkPi
 spark-internal
 100
 State: Running

In the container processes are in _interruptible sleep_:

PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit 
--deploy-mode client --conf spar
 287 0 185 S 2344 0% 3 0% sh
 294 287 185 R 1536 0% 3 0% top
 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
/opt/spark/conf/spark.prope

Liveness checks might be a workaround but rest apis may be still working if 
threads in jvm still are running as in this case (I did check the spark ui and 
it was there).

 

 

  was:
{quote}A driver is running 
{quote}
spark-pi-driver 1/1 Running 0 1h
spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
 name: spark-pi
 namespace: spark
spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
 arguments:
 - "100"
 sparkVersion: "2.4.0"
 restartPolicy:
 type: Never
 nodeSelector:
 "spark": "autotune"
 driver:
 memory: 

[jira] [Created] (SPARK-27900) Spark on K8s will not report container failure due to oom

2019-05-31 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-27900:
---

 Summary: Spark on K8s will not report container failure due to oom
 Key: SPARK-27900
 URL: https://issues.apache.org/jira/browse/SPARK-27900
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.3, 3.0.0
Reporter: Stavros Kontopoulos


{quote}A driver is running 
{quote}
spark-pi-driver 1/1 Running 0 1h
spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
 name: spark-pi
 namespace: spark
spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
 arguments:
 - "100"
 sparkVersion: "2.4.0"
 restartPolicy:
 type: Never
 nodeSelector:
 "spark": "autotune"
 driver:
 memory: "1g"
 labels:
 version: 2.4.0
 serviceAccount: spark-sa
 executor:
 instances: 2
 memory: "1g"
 labels:
 version: 2.4.0
{quote}
At some point the driver fails but it is still running and so the pods are 
still running:

19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 3.0 KiB, free 110.0 MiB)
19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 1765.0 B, free 110.0 MiB)
19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 
MiB)
19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
DAGScheduler.scala:1180
19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks 
are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java 
heap space
 at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
 at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
 at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached

$ kubectl describe pod spark-pi2-driver -n spark
Name: spark-pi2-driver
Namespace: spark
Priority: 0
PriorityClassName: 
Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
Start Time: Fri, 31 May 2019 16:28:59 +0300
Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
 spark-role=driver
 sparkoperator.k8s.io/app-name=spark-pi2
 sparkoperator.k8s.io/launched-by-spark-operator=true
 sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
 version=2.4.0
Annotations: 
Status: Running
IP: 10.12.103.4
Controlled By: SparkApplication/spark-pi2
Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
 Image: skonto/spark:k8s-3.0.0-sa
 Image ID: 
docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 org.apache.spark.examples.SparkPi
 spark-internal
 100
 State: Running

In the container processes are in _interruptible sleep_:

PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit 
--deploy-mode client --conf spar
 287 0 185 S 2344 0% 3 0% sh
 294 287 185 R 1536 0% 3 0% top
 1 0 185 S 776 0% 0 0% /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
/opt/spark/conf/spark.prope



Liveness checks might be a workaround but rest apis may be still working if 
threads in jvm still are running as in this case (I did check the spark ui and 
it was there).

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27873:


Assignee: Apache Spark

> Csv reader, adding a corrupt record column causes error if enforceSchema=false
> --
>
> Key: SPARK-27873
> URL: https://issues.apache.org/jira/browse/SPARK-27873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcin Mejran
>Assignee: Apache Spark
>Priority: Major
>
> In the Spark CSV reader If you're using permissive mode with a column for 
> storing corrupt records then you need to add a new schema column 
> corresponding to columnNameOfCorruptRecord.
> However, if you have a header row and enforceSchema=false the schema vs. 
> header validation fails because there is an extra column corresponding to 
> columnNameOfCorruptRecord.
> Since, the FAILFAST mode doesn't print informative error messages on which 
> rows failed to parse there is no way other to track down broken rows without 
> setting a corrupt record column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27873:


Assignee: (was: Apache Spark)

> Csv reader, adding a corrupt record column causes error if enforceSchema=false
> --
>
> Key: SPARK-27873
> URL: https://issues.apache.org/jira/browse/SPARK-27873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcin Mejran
>Priority: Major
>
> In the Spark CSV reader If you're using permissive mode with a column for 
> storing corrupt records then you need to add a new schema column 
> corresponding to columnNameOfCorruptRecord.
> However, if you have a header row and enforceSchema=false the schema vs. 
> header validation fails because there is an extra column corresponding to 
> columnNameOfCorruptRecord.
> Since, the FAILFAST mode doesn't print informative error messages on which 
> rows failed to parse there is no way other to track down broken rows without 
> setting a corrupt record column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API

2019-05-31 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853083#comment-16853083
 ] 

Juliusz Sompolski commented on SPARK-27899:
---

cc [~LI,Xiao], [~yumwang]

> Make HiveMetastoreClient.getTableObjectsByName available in 
> ExternalCatalog/SessionCatalog API
> --
>
> Key: SPARK-27899
> URL: https://issues.apache.org/jira/browse/SPARK-27899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> The new Spark ThriftServer SparkGetTablesOperation implemented in 
> https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata 
> request for every table. This can get very slow for large schemas (~50ms per 
> table with an external Hive metastore).
> Hive ThriftServer GetTablesOperation uses 
> HiveMetastoreClient.getTableObjectsByName to get table information in bulk, 
> but we don't expose that through our APIs that go through Hive -> 
> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> 
> SessionCatalog.
> If we added and exposed getTableObjectsByName through our catalog APIs, we 
> could resolve that performance problem in SparkGetTablesOperation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API

2019-05-31 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-27899:
-

 Summary: Make HiveMetastoreClient.getTableObjectsByName available 
in ExternalCatalog/SessionCatalog API
 Key: SPARK-27899
 URL: https://issues.apache.org/jira/browse/SPARK-27899
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Juliusz Sompolski


The new Spark ThriftServer SparkGetTablesOperation implemented in 
https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata 
request for every table. This can get very slow for large schemas (~50ms per 
table with an external Hive metastore).
Hive ThriftServer GetTablesOperation uses 
HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but 
we don't expose that through our APIs that go through Hive -> HiveClientImpl 
(HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog.

If we added and exposed getTableObjectsByName through our catalog APIs, we 
could resolve that performance problem in SparkGetTablesOperation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26822) Upgrade the deprecated module 'optparse'

2019-05-31 Thread Neo Chien (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neo Chien closed SPARK-26822.
-

> Upgrade the deprecated module 'optparse'
> 
>
> Key: SPARK-26822
> URL: https://issues.apache.org/jira/browse/SPARK-26822
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Neo Chien
>Assignee: Neo Chien
>Priority: Minor
>  Labels: pull-request-available, test
> Fix For: 3.0.0
>
>
> Follow the [official 
> document|https://docs.python.org/2/library/argparse.html#upgrading-optparse-code]
>  to upgrade the deprecated module 'optparse' to 'argparse'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27896:


Assignee: Apache Spark  (was: Sean Owen)

> Fix definition of clustering silhouette coefficient for 1-element clusters
> --
>
> Key: SPARK-27896
> URL: https://issues.apache.org/jira/browse/SPARK-27896
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> Reported by Samuel Kubler via email:
> In the code 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala,
>  I think there is a little mistake in the class “Silhouette” when you 
> calculate the Silhouette coefficient for a point. Indeed, according to the 
> scientific paper of reference “Silhouettes:  a graphical aid to the 
> interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, 
> for the points which are alone in a cluster it is not the 
> currentClusterDissimilarity  which is supposed to be equal to 0 like it is 
> the case in your code (“val currentClusterDissimilarity = if 
> (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. 
> Indeed, “When cluster A contains only a single object it is unclear how a(i) 
> should be defined, and the we simply set s(i) equal to zero”.
> The problem of defining the currentClusterDissimilarity to zero like you have 
> done is that you can’t use the silhouette coefficient anymore as a criterion 
> to determine the optimal value of the number of clusters in your clustering 
> process because your algorithm will answer that the more clusters you have, 
> the better will be your clustering algorithm. Indeed, in that case when the 
> number of clustering classes increases, s(i) converges toward 1. (so your 
> algorithm seems to be more efficient) I have, beside, check this result of my 
> own clustering example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27896:


Assignee: Sean Owen  (was: Apache Spark)

> Fix definition of clustering silhouette coefficient for 1-element clusters
> --
>
> Key: SPARK-27896
> URL: https://issues.apache.org/jira/browse/SPARK-27896
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Reported by Samuel Kubler via email:
> In the code 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala,
>  I think there is a little mistake in the class “Silhouette” when you 
> calculate the Silhouette coefficient for a point. Indeed, according to the 
> scientific paper of reference “Silhouettes:  a graphical aid to the 
> interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, 
> for the points which are alone in a cluster it is not the 
> currentClusterDissimilarity  which is supposed to be equal to 0 like it is 
> the case in your code (“val currentClusterDissimilarity = if 
> (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. 
> Indeed, “When cluster A contains only a single object it is unclear how a(i) 
> should be defined, and the we simply set s(i) equal to zero”.
> The problem of defining the currentClusterDissimilarity to zero like you have 
> done is that you can’t use the silhouette coefficient anymore as a criterion 
> to determine the optimal value of the number of clusters in your clustering 
> process because your algorithm will answer that the more clusters you have, 
> the better will be your clustering algorithm. Indeed, in that case when the 
> number of clustering classes increases, s(i) converges toward 1. (so your 
> algorithm seems to be more efficient) I have, beside, check this result of my 
> own clustering example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27898) Support 4 date operators(date + integer, integer + date, date - integer and date - date)

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27898:


Assignee: Apache Spark

> Support 4 date operators(date + integer, integer + date, date - integer and 
> date - date)
> 
>
> Key: SPARK-27898
> URL: https://issues.apache.org/jira/browse/SPARK-27898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Support 4 date operators(date + integer, integer + date, date - integer and 
> date - date):
> |Operator|Example|Result|
> |+|date '2001-09-28' + integer '7'|date '2001-10-05'|
> |-|date '2001-10-01' - integer '7'|date '2001-09-24'|
> |-|date '2001-10-01' - date '2001-09-28'|integer '3' (days)|
> [https://www.postgresql.org/docs/12/functions-datetime.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27898) Support 4 date operators(date + integer, integer + date, date - integer and date - date)

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27898:


Assignee: (was: Apache Spark)

> Support 4 date operators(date + integer, integer + date, date - integer and 
> date - date)
> 
>
> Key: SPARK-27898
> URL: https://issues.apache.org/jira/browse/SPARK-27898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Support 4 date operators(date + integer, integer + date, date - integer and 
> date - date):
> |Operator|Example|Result|
> |+|date '2001-09-28' + integer '7'|date '2001-10-05'|
> |-|date '2001-10-01' - integer '7'|date '2001-09-24'|
> |-|date '2001-10-01' - date '2001-09-28'|integer '3' (days)|
> [https://www.postgresql.org/docs/12/functions-datetime.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27898) Support 4 date operators(date + integer, integer + date, date - integer and date - date)

2019-05-31 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27898:
---

 Summary: Support 4 date operators(date + integer, integer + date, 
date - integer and date - date)
 Key: SPARK-27898
 URL: https://issues.apache.org/jira/browse/SPARK-27898
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Support 4 date operators(date + integer, integer + date, date - integer and 
date - date):
|Operator|Example|Result|
|+|date '2001-09-28' + integer '7'|date '2001-10-05'|
|-|date '2001-10-01' - integer '7'|date '2001-09-24'|
|-|date '2001-10-01' - date '2001-09-28'|integer '3' (days)|

[https://www.postgresql.org/docs/12/functions-datetime.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27897) GPU Scheduling - move example discovery Script to scripts directory

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27897:


Assignee: Apache Spark  (was: Thomas Graves)

> GPU Scheduling - move example discovery Script to scripts directory
> ---
>
> Key: SPARK-27897
> URL: https://issues.apache.org/jira/browse/SPARK-27897
> Project: Spark
>  Issue Type: Story
>  Components: Examples
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-27725 GPU Scheduling - add an example discovery Script  added a script 
> at 
> [https://github.com/apache/spark/blob/master/examples/src/main/resources/getGpusResources.sh.]
> Instead of having it in the resources directory lets move it to the scripts 
> directory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27897) GPU Scheduling - move example discovery Script to scripts directory

2019-05-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27897:


Assignee: Thomas Graves  (was: Apache Spark)

> GPU Scheduling - move example discovery Script to scripts directory
> ---
>
> Key: SPARK-27897
> URL: https://issues.apache.org/jira/browse/SPARK-27897
> Project: Spark
>  Issue Type: Story
>  Components: Examples
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
>
> SPARK-27725 GPU Scheduling - add an example discovery Script  added a script 
> at 
> [https://github.com/apache/spark/blob/master/examples/src/main/resources/getGpusResources.sh.]
> Instead of having it in the resources directory lets move it to the scripts 
> directory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27897) GPU Scheduling - move example discovery Script to scripts directory

2019-05-31 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-27897:
-

 Summary: GPU Scheduling - move example discovery Script to scripts 
directory
 Key: SPARK-27897
 URL: https://issues.apache.org/jira/browse/SPARK-27897
 Project: Spark
  Issue Type: Story
  Components: Examples
Affects Versions: 3.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves


SPARK-27725 GPU Scheduling - add an example discovery Script  added a script at 
[https://github.com/apache/spark/blob/master/examples/src/main/resources/getGpusResources.sh.]

Instead of having it in the resources directory lets move it to the scripts 
directory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27880) Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY)

2019-05-31 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27880:

Description: 
{code:sql}
bool_and/booland_statefunc(expression) -- true if all input values are true, 
otherwise false
{code}
{code:sql}
bool_or/boolor_statefunc(expression) -- true if at least one input value is 
true, otherwise false
{code}
{code:sql}
every(expression) -- equivalent to bool_and
{code}
More details:
 [https://www.postgresql.org/docs/9.3/functions-aggregate.html]

  was:
{code:sql}
bool_and(expression) -- true if all input values are true, otherwise false
{code}
{code:sql}
bool_or(expression) -- true if at least one input value is true, otherwise false
{code}
{code:sql}
every(expression) -- equivalent to bool_and
{code}

More details:
https://www.postgresql.org/docs/9.3/functions-aggregate.html


> Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY)
> -
>
> Key: SPARK-27880
> URL: https://issues.apache.org/jira/browse/SPARK-27880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> bool_and/booland_statefunc(expression) -- true if all input values are true, 
> otherwise false
> {code}
> {code:sql}
> bool_or/boolor_statefunc(expression) -- true if at least one input value is 
> true, otherwise false
> {code}
> {code:sql}
> every(expression) -- equivalent to bool_and
> {code}
> More details:
>  [https://www.postgresql.org/docs/9.3/functions-aggregate.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters

2019-05-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852962#comment-16852962
 ] 

Sean Owen commented on SPARK-27896:
---

Copying follow up from email:

Yes the paper does say the silhouette is 0 in this case. That's an
argument to change it.

On the other hand, I am not sure if I agree with the paper here. If A
consists of one point, then that point's assignment is optimal in a
sense. Setting the silhouette to 0 indicates that assigning it to B,
which is a cluster of more distant points, is just as good. I don't
think that makes as much sense as 1, which it returns now.

You could argue that silhouette is specifically penalizing, in a way,
this type of assignment in a way that Euclidean distance does not.
Wikipedia's definition follows the paper:
https://en.wikipedia.org/wiki/Silhouette_(clustering)
It looks like sklearn also follows the paper's definition:
https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/metrics/cluster/unsupervised.py#L235

> Fix definition of clustering silhouette coefficient for 1-element clusters
> --
>
> Key: SPARK-27896
> URL: https://issues.apache.org/jira/browse/SPARK-27896
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Reported by Samuel Kubler via email:
> In the code 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala,
>  I think there is a little mistake in the class “Silhouette” when you 
> calculate the Silhouette coefficient for a point. Indeed, according to the 
> scientific paper of reference “Silhouettes:  a graphical aid to the 
> interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, 
> for the points which are alone in a cluster it is not the 
> currentClusterDissimilarity  which is supposed to be equal to 0 like it is 
> the case in your code (“val currentClusterDissimilarity = if 
> (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. 
> Indeed, “When cluster A contains only a single object it is unclear how a(i) 
> should be defined, and the we simply set s(i) equal to zero”.
> The problem of defining the currentClusterDissimilarity to zero like you have 
> done is that you can’t use the silhouette coefficient anymore as a criterion 
> to determine the optimal value of the number of clusters in your clustering 
> process because your algorithm will answer that the more clusters you have, 
> the better will be your clustering algorithm. Indeed, in that case when the 
> number of clustering classes increases, s(i) converges toward 1. (so your 
> algorithm seems to be more efficient) I have, beside, check this result of my 
> own clustering example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters

2019-05-31 Thread Sean Owen (JIRA)
Sean Owen created SPARK-27896:
-

 Summary: Fix definition of clustering silhouette coefficient for 
1-element clusters
 Key: SPARK-27896
 URL: https://issues.apache.org/jira/browse/SPARK-27896
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.3
Reporter: Sean Owen
Assignee: Sean Owen


Reported by Samuel Kubler via email:

In the code 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala,
 I think there is a little mistake in the class “Silhouette” when you calculate 
the Silhouette coefficient for a point. Indeed, according to the scientific 
paper of reference “Silhouettes:  a graphical aid to the interpretation and 
validation of cluster analysis” Peter J. ROUSSEEUW 1986, for the points which 
are alone in a cluster it is not the currentClusterDissimilarity  which is 
supposed to be equal to 0 like it is the case in your code (“val 
currentClusterDissimilarity = if (pointClusterNumOfPoints == 1) {0.0}” but the 
silhouette coefficient itself. Indeed, “When cluster A contains only a single 
object it is unclear how a(i) should be defined, and the we simply set s(i) 
equal to zero”.

The problem of defining the currentClusterDissimilarity to zero like you have 
done is that you can’t use the silhouette coefficient anymore as a criterion to 
determine the optimal value of the number of clusters in your clustering 
process because your algorithm will answer that the more clusters you have, the 
better will be your clustering algorithm. Indeed, in that case when the number 
of clustering classes increases, s(i) converges toward 1. (so your algorithm 
seems to be more efficient) I have, beside, check this result of my own 
clustering example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2019-05-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852874#comment-16852874
 ] 

Stavros Kontopoulos commented on SPARK-24815:
-

@[~Karthik Palaniappan] I can help with the design. Btw there is a refactoring 
happen in here: [https://github.com/apache/spark/pull/24704]

My concern with batch mode dynamic allocation is task list may not tell the 
whole story, what if the number of tasks stays the same and load changes per 
task/partition eg. Kafka source? As for state I think you need to rebalance it 
which translates to dynamic re-partitioning for the micro-batch mode in 
structured streaming. For continuous streaming it is harder I think, but maybe 
a unified approach could solve it for both batch and continuous streaming as in 
Flink, https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html.

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27791) Support SQL year-month INTERVAL type

2019-05-31 Thread Zhu, Lipeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852870#comment-16852870
 ] 

Zhu, Lipeng commented on SPARK-27791:
-

[~yumwang] 

I think it should be 
{code:sql}
select current_date - interval '1-1' year_month
{code}
 

> Support SQL year-month INTERVAL type
> 
>
> Key: SPARK-27791
> URL: https://issues.apache.org/jira/browse/SPARK-27791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The INTERVAL type must conform to SQL year-month INTERVAL type, has 2 logical 
> types:
> # YEAR - Unconstrained except by the leading field precision
> # MONTH - Months (within years) (0-11)
> And support arithmetic operations involving values of type datetime or 
> interval.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27700) SparkSubmit closes with SocketTimeoutException in kubernetes mode.

2019-05-31 Thread Udbhav Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udbhav Agrawal resolved SPARK-27700.

Resolution: Duplicate

> SparkSubmit closes with SocketTimeoutException in kubernetes mode.
> --
>
> Key: SPARK-27700
> URL: https://issues.apache.org/jira/browse/SPARK-27700
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Udbhav Agrawal
>Priority: Major
> Attachments: socket timeout
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27791) Support SQL year-month INTERVAL type

2019-05-31 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852761#comment-16852761
 ] 

Yuming Wang commented on SPARK-27791:
-

Hi, [~maxgekk] Is this ticket to cover these 2 cases?
{code:sql}
SELECT interval '1' year to month;
SELECT interval '1-2' year to month;
{code}
[https://github.com/postgres/postgres/blob/df1a699e5ba3232f373790b2c9485ddf720c4a70/src/test/regress/sql/interval.sql#L180-L181]

> Support SQL year-month INTERVAL type
> 
>
> Key: SPARK-27791
> URL: https://issues.apache.org/jira/browse/SPARK-27791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The INTERVAL type must conform to SQL year-month INTERVAL type, has 2 logical 
> types:
> # YEAR - Unconstrained except by the leading field precision
> # MONTH - Months (within years) (0-11)
> And support arithmetic operations involving values of type datetime or 
> interval.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27895) Spark streaming - RDD filter is always refreshing providing updated filtered items

2019-05-31 Thread Ilias Karalis (JIRA)
Ilias Karalis created SPARK-27895:
-

 Summary: Spark streaming - RDD filter is always refreshing 
providing updated filtered items
 Key: SPARK-27895
 URL: https://issues.apache.org/jira/browse/SPARK-27895
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.4.3, 2.4.2, 2.4.0
 Environment: Intellij, running local in windows10 laptop.

 
Reporter: Ilias Karalis


Spark streaming: 2.4.x

Scala: 2.11.11

 

foreachRDD of DStream,

in case filter is used on RDD then filter is always refreshing, providing new 
results continuously until new batch is processed. For the new batch, the same 
occurs.

With the same code, if we do rdd.collect() and then run the filter on the 
collection, we get just one time results, which remains stable until new batch 
is coming in.

Filter function is based on random probability (reservoir sampling).

 

{color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = inputRdd.filter(x=> 
chooseX(x) )

 

{color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = {
{color:#808080}
{color} {color:#80}val {color}r = scala.util.Random
 {color:#80}val {color}p = r.nextFloat()
 edgeTotalCounter += {color:#ff}1
{color} {color:#808080}
{color} {color:#80}if {color}(p < (sampleLength.toFloat / 
edgeTotalCounter.toFloat)) {
 edgeLocalRDDCounter += {color:#ff}1
{color} println({color:#008000}"Edge " {color}+x + {color:#008000}" has been 
selected and is number : " {color}+ edgeLocalRDDCounter 
+{color:#008000}"."{color})
 {color:#80}true
{color} }
 {color:#80}else
{color}{color:#80} false
{color}}

 

edgeLocalRDDCounter counts selected edges from inputRDD.

Strange is that the counter is increased 1st time from 1 to y, then filter 
continues to run unexpectedly again and the counter is increased again starting 
from y+1 to z. After that each time filter unexpectedly continues to run, it 
provides results for which the counter starts from y+1. Each time filter runs 
provides different results and filters different number of edges.

toSampleRDD always changes accordingly to new provided results.

When new batch is coming in then it starts the same behavior for the new batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >