[jira] [Updated] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters
[ https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27896: -- Fix Version/s: (was: 2.4.4) > Fix definition of clustering silhouette coefficient for 1-element clusters > -- > > Key: SPARK-27896 > URL: https://issues.apache.org/jira/browse/SPARK-27896 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0 > > > Reported by Samuel Kubler via email: > In the code > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala, > I think there is a little mistake in the class “Silhouette” when you > calculate the Silhouette coefficient for a point. Indeed, according to the > scientific paper of reference “Silhouettes: a graphical aid to the > interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, > for the points which are alone in a cluster it is not the > currentClusterDissimilarity which is supposed to be equal to 0 like it is > the case in your code (“val currentClusterDissimilarity = if > (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. > Indeed, “When cluster A contains only a single object it is unclear how a(i) > should be defined, and the we simply set s(i) equal to zero”. > The problem of defining the currentClusterDissimilarity to zero like you have > done is that you can’t use the silhouette coefficient anymore as a criterion > to determine the optimal value of the number of clusters in your clustering > process because your algorithm will answer that the more clusters you have, > the better will be your clustering algorithm. Indeed, in that case when the > number of clustering classes increases, s(i) converges toward 1. (so your > algorithm seems to be more efficient) I have, beside, check this result of my > own clustering example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23626) DAGScheduler blocked due to JobSubmitted event
[ https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated SPARK-23626: Affects Version/s: 2.4.3 > DAGScheduler blocked due to JobSubmitted event > --- > > Key: SPARK-23626 > URL: https://issues.apache.org/jira/browse/SPARK-23626 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.1, 2.3.3, 3.0.0, 2.4.3 >Reporter: Ajith S >Priority: Major > > DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted > events has to be processed as DAGSchedulerEventProcessLoop is single threaded > and it will block other tasks in queue like TaskCompletion. > The JobSubmitted event is time consuming depending on the nature of the job > (Example: calculating parent stage dependencies, shuffle dependencies, > partitions) and thus it blocks all the events to be processed. > > I see multiple JIRA referring to this behavior > https://issues.apache.org/jira/browse/SPARK-2647 > https://issues.apache.org/jira/browse/SPARK-4961 > > Similarly in my cluster some jobs partition calculation is time consuming > (Similar to stack at SPARK-2647) hence it slows down the spark > DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if > its tasks are finished within seconds, as TaskCompletion Events are processed > at a slower rate due to blockage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23626) DAGScheduler blocked due to JobSubmitted event
[ https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated SPARK-23626: Summary: DAGScheduler blocked due to JobSubmitted event (was: Spark DAGScheduler scheduling performance hindered on JobSubmitted Event) > DAGScheduler blocked due to JobSubmitted event > --- > > Key: SPARK-23626 > URL: https://issues.apache.org/jira/browse/SPARK-23626 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.1, 2.3.3, 3.0.0 >Reporter: Ajith S >Priority: Major > > DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted > events has to be processed as DAGSchedulerEventProcessLoop is single threaded > and it will block other tasks in queue like TaskCompletion. > The JobSubmitted event is time consuming depending on the nature of the job > (Example: calculating parent stage dependencies, shuffle dependencies, > partitions) and thus it blocks all the events to be processed. > > I see multiple JIRA referring to this behavior > https://issues.apache.org/jira/browse/SPARK-2647 > https://issues.apache.org/jira/browse/SPARK-4961 > > Similarly in my cluster some jobs partition calculation is time consuming > (Similar to stack at SPARK-2647) hence it slows down the spark > DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if > its tasks are finished within seconds, as TaskCompletion Events are processed > at a slower rate due to blockage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23626) Spark DAGScheduler scheduling performance hindered on JobSubmitted Event
[ https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated SPARK-23626: Labels: (was: bulk-closed) > Spark DAGScheduler scheduling performance hindered on JobSubmitted Event > > > Key: SPARK-23626 > URL: https://issues.apache.org/jira/browse/SPARK-23626 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.1, 2.3.3, 3.0.0 >Reporter: Ajith S >Priority: Major > > DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted > events has to be processed as DAGSchedulerEventProcessLoop is single threaded > and it will block other tasks in queue like TaskCompletion. > The JobSubmitted event is time consuming depending on the nature of the job > (Example: calculating parent stage dependencies, shuffle dependencies, > partitions) and thus it blocks all the events to be processed. > > I see multiple JIRA referring to this behavior > https://issues.apache.org/jira/browse/SPARK-2647 > https://issues.apache.org/jira/browse/SPARK-4961 > > Similarly in my cluster some jobs partition calculation is time consuming > (Similar to stack at SPARK-2647) hence it slows down the spark > DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if > its tasks are finished within seconds, as TaskCompletion Events are processed > at a slower rate due to blockage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20856) support statement using nested joins
[ https://issues.apache.org/jira/browse/SPARK-20856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-20856: Issue Type: Sub-task (was: Improvement) Parent: SPARK-27764 > support statement using nested joins > > > Key: SPARK-20856 > URL: https://issues.apache.org/jira/browse/SPARK-20856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: N Campbell >Priority: Major > Labels: bulk-closed > > While DB2, ORACLE etc support a join expressed as follows, SPARK SQL does > not. > Not supported > select * from > cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint > on tbint.rnum = tint.rnum > on tint.rnum = tsint.rnum > versus written as shown > select * from > cert.tsint tsint inner join cert.tint tint on tsint.rnum = tint.rnum inner > join cert.tbint tbint on tint.rnum = tbint.rnum > > ERROR_STATE, SQL state: org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input 'on' expecting {, ',', '.', '[', 'WHERE', 'GROUP', > 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, > 'IS', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', EQ, '<=>', > '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', > '^', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 4, pos 5) > == SQL == > select * from > cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint > on tbint.rnum = tint.rnum > on tint.rnum = tsint.rnum > -^^^ > , Query: select * from > cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint > on tbint.rnum = tint.rnum > on tint.rnum = tsint.rnum. > SQLState: HY000 > ErrorCode: 500051 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20856) support statement using nested joins
[ https://issues.apache.org/jira/browse/SPARK-20856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853575#comment-16853575 ] Yuming Wang commented on SPARK-20856: - PostgreSQL 11.3 also support this join case: {code:sql} CREATE TABLE J1_TBL ( i integer, j integer, t text ); CREATE TABLE J2_TBL ( i integer, k integer ); CREATE TABLE J3_TBL ( i integer, k integer ); INSERT INTO J1_TBL VALUES (1, 4, 'one'); INSERT INTO J2_TBL VALUES (1, -1); INSERT INTO J3_TBL VALUES (1, -1); select * from J1_TBL t1 join J2_TBL t2 join J3_TBL t3 on t3.i = t2.i on t2.i = t1.i; {code} > support statement using nested joins > > > Key: SPARK-20856 > URL: https://issues.apache.org/jira/browse/SPARK-20856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: N Campbell >Priority: Major > Labels: bulk-closed > > While DB2, ORACLE etc support a join expressed as follows, SPARK SQL does > not. > Not supported > select * from > cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint > on tbint.rnum = tint.rnum > on tint.rnum = tsint.rnum > versus written as shown > select * from > cert.tsint tsint inner join cert.tint tint on tsint.rnum = tint.rnum inner > join cert.tbint tbint on tint.rnum = tbint.rnum > > ERROR_STATE, SQL state: org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input 'on' expecting {, ',', '.', '[', 'WHERE', 'GROUP', > 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, > 'IS', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', EQ, '<=>', > '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', > '^', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 4, pos 5) > == SQL == > select * from > cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint > on tbint.rnum = tint.rnum > on tint.rnum = tsint.rnum > -^^^ > , Query: select * from > cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint > on tbint.rnum = tint.rnum > on tint.rnum = tsint.rnum. > SQLState: HY000 > ErrorCode: 500051 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24077: - Description: The error message of {{CREATE TEMPORARY FUNCTION IF NOT EXISTS}} looks confusing: {code} scala> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan'") {code} {code} org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) == SQL == CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan' -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) ... 48 elided {code} was: {code} scala> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan'") {code} {code} org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) == SQL == CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan' -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) ... 48 elided {code} The error message of {{(name = "udf")}} > Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS` > -- > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > Labels: starter > > The error message of {{CREATE TEMPORARY FUNCTION IF NOT EXISTS}} looks > confusing: > {code} > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > {code} > {code} > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853573#comment-16853573 ] Hyukjin Kwon commented on SPARK-24077: -- Reopened after editing the JIRA. > Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS` > -- > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > Labels: starter > > The error message of {{CREATE TEMPORARY FUNCTION IF NOT EXISTS}} looks > confusing: > {code} > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > {code} > {code} > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24077: - Description: {code} scala> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan'") {code} {code} org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) == SQL == CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan' -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) ... 48 elided {code} The error message of was: Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? scala> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan'") org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) == SQL == CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan' -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) ... 48 elided > Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS` > -- > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > Labels: starter > > {code} > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > {code} > {code} > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided > {code} > The error message of -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24077: - Description: {code} scala> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan'") {code} {code} org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) == SQL == CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan' -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) ... 48 elided {code} The error message of {{(name = "udf")}} was: {code} scala> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan'") {code} {code} org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) == SQL == CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan' -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) ... 48 elided {code} The error message of > Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS` > -- > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > Labels: starter > > {code} > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > {code} > {code} > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided > {code} > The error message of {{(name = "udf")}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24077: - Summary: Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS` (was: Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?) > Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS` > -- > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > Labels: starter > > Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters
[ https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27896. --- Resolution: Fixed Fix Version/s: 3.0.0 2.4.4 This is resolved via https://github.com/apache/spark/pull/24756 > Fix definition of clustering silhouette coefficient for 1-element clusters > -- > > Key: SPARK-27896 > URL: https://issues.apache.org/jira/browse/SPARK-27896 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 2.4.4, 3.0.0 > > > Reported by Samuel Kubler via email: > In the code > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala, > I think there is a little mistake in the class “Silhouette” when you > calculate the Silhouette coefficient for a point. Indeed, according to the > scientific paper of reference “Silhouettes: a graphical aid to the > interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, > for the points which are alone in a cluster it is not the > currentClusterDissimilarity which is supposed to be equal to 0 like it is > the case in your code (“val currentClusterDissimilarity = if > (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. > Indeed, “When cluster A contains only a single object it is unclear how a(i) > should be defined, and the we simply set s(i) equal to zero”. > The problem of defining the currentClusterDissimilarity to zero like you have > done is that you can’t use the silhouette coefficient anymore as a criterion > to determine the optimal value of the number of clusters in your clustering > process because your algorithm will answer that the more clusters you have, > the better will be your clustering algorithm. Indeed, in that case when the > number of clustering classes increases, s(i) converges toward 1. (so your > algorithm seems to be more efficient) I have, beside, check this result of my > own clustering example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27794) Use secure URLs for downloading CRAN artifacts
[ https://issues.apache.org/jira/browse/SPARK-27794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27794. --- Resolution: Fixed Fix Version/s: 3.0.0 2.4.4 This is resolved via - https://github.com/apache/spark/pull/24664 (master) - https://github.com/apache/spark/pull/24758 (branch-2.4) > Use secure URLs for downloading CRAN artifacts > -- > > Key: SPARK-27794 > URL: https://issues.apache.org/jira/browse/SPARK-27794 > Project: Spark > Issue Type: Improvement > Components: Documentation, R >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 2.4.4, 3.0.0 > > > Currently, artifacts from CRAN are downloaded from > http://cran.us.r-project.org . Ideally, this should be an HTTPS URL. It seems > like the main redirector is https://cloud.r-project.org . > On a lightly related note, there's also still a Dockerfile downloading Scala > over HTTP, which can be changed to HTTPS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27885) Announce deprecation of Python 2 support
[ https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27885: - Assignee: Xiangrui Meng > Announce deprecation of Python 2 support > > > Key: SPARK-27885 > URL: https://issues.apache.org/jira/browse/SPARK-27885 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > * Draft the message. > * Update Spark website and announce deprecation of Python 2 support in the > next major release in 2019 and remove the support in a release after > 2020/01/01. It should show up in the "Latest News" section. > * Announce it on users@ and dev@ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27914) Improve parser error message for ALTER TABLE ADD COLUMNS statement
Yesheng Ma created SPARK-27914: -- Summary: Improve parser error message for ALTER TABLE ADD COLUMNS statement Key: SPARK-27914 URL: https://issues.apache.org/jira/browse/SPARK-27914 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma The {{ALTER TABLE ADD COLUMNS}} statement is often misspelled as {{ALTER TABLE ADD COLUMN}}. However, when a user queries such a statement, the error message is confusing. For example, the error message for {code:sql} ALTER TABLE test ADD COLUMN (x INT); {code} is {code:java} no viable alternative at input 'ALTER TABLE test ADD COLUMN'(line 1, pos 21) {code} which is misleading. One possible way to fix is to explicitly capture these statements in a grammar rule and print user-friendly error message to instruct users to change {{COLUMN}} to {{COLUMNS}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution
Owen O'Malley created SPARK-27913: - Summary: Spark SQL's native ORC reader implements its own schema evolution Key: SPARK-27913 URL: https://issues.apache.org/jira/browse/SPARK-27913 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.3 Reporter: Owen O'Malley ORC's reader handles a wide range of schema evolution, but the Spark SQL native ORC bindings do not provide the desired schema to the ORC reader. This causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27912) Improve parser error message for CASE clause
Yesheng Ma created SPARK-27912: -- Summary: Improve parser error message for CASE clause Key: SPARK-27912 URL: https://issues.apache.org/jira/browse/SPARK-27912 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma The \{{CASE}} clause is commonly used in SQL queries, but people can forget the trailing {{END}}. When a user queries such a statement, the error message is confusing. For example, the error message for {code:sql} SELECT (CASE WHEN a THEN b ELSE c) FROM a; {code} is {code:java} no viable alternative at input '(CASE WHEN a THEN b ELSE c)'(line 1, pos 33) {code} which is misleading. One possible way to fix is to explicitly capture these statements in a grammar rule and print user-friendly error message such as {code:java} missing trailing END for case clause {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27911) PySpark Packages should automatically choose correct scala version
[ https://issues.apache.org/jira/browse/SPARK-27911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-27911: - Description: Today, users of pyspark (and Scala) need to manually specify the version of Scala that their Spark installation is using when adding a Spark package to their application. This extra configuration is confusing to users who may not even know which version of Scala they are using (for example, if they installed Spark using {{pip}}). The confusion here is exacerbated by releases in Spark that have changed the default from {{2.11}} -> {{2.12}} -> {{2.11}}. https://spark.apache.org/releases/spark-release-2-4-2.html https://spark.apache.org/releases/spark-release-2-4-3.html Since Spark can know which version of Scala it was compiled for, we should give users the option to automatically choose the correct version. This could be as simple as a substitution for {{$scalaVersion}} or something when resolving a package (similar to SBTs support for automatically handling scala dependencies). Here are some concrete examples of users getting it wrong and getting confused: https://github.com/delta-io/delta/issues/6 https://github.com/delta-io/delta/issues/63 was: Today, users of pyspark (and Scala) need to manually specify the version of Scala that their Spark installation is using when adding a Spark package to their application. This extra configuration confusing to users who may not even know which version of Scala they are using (for example, if they installed Spark using {{pip}}). The confusion here is exacerbated by releases in Spark that have changed the default from {{2.11}} -> {{2.12}} -> {{2.11}}. https://spark.apache.org/releases/spark-release-2-4-2.html https://spark.apache.org/releases/spark-release-2-4-3.html Since Spark can know which version of Scala it was compiled for, we should give users the option to automatically choose the correct version. This could be as simple as a substitution for {{$scalaVersion}} or something when resolving a package (similar to SBTs support for automatically handling scala dependencies). Here are some concrete examples of users getting it wrong and getting confused: https://github.com/delta-io/delta/issues/6 https://github.com/delta-io/delta/issues/63 > PySpark Packages should automatically choose correct scala version > -- > > Key: SPARK-27911 > URL: https://issues.apache.org/jira/browse/SPARK-27911 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Michael Armbrust >Priority: Major > > Today, users of pyspark (and Scala) need to manually specify the version of > Scala that their Spark installation is using when adding a Spark package to > their application. This extra configuration is confusing to users who may not > even know which version of Scala they are using (for example, if they > installed Spark using {{pip}}). The confusion here is exacerbated by releases > in Spark that have changed the default from {{2.11}} -> {{2.12}} -> {{2.11}}. > https://spark.apache.org/releases/spark-release-2-4-2.html > https://spark.apache.org/releases/spark-release-2-4-3.html > Since Spark can know which version of Scala it was compiled for, we should > give users the option to automatically choose the correct version. This > could be as simple as a substitution for {{$scalaVersion}} or something when > resolving a package (similar to SBTs support for automatically handling scala > dependencies). > Here are some concrete examples of users getting it wrong and getting > confused: > https://github.com/delta-io/delta/issues/6 > https://github.com/delta-io/delta/issues/63 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27911) PySpark Packages should automatically choose correct scala version
Michael Armbrust created SPARK-27911: Summary: PySpark Packages should automatically choose correct scala version Key: SPARK-27911 URL: https://issues.apache.org/jira/browse/SPARK-27911 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 2.4.3 Reporter: Michael Armbrust Today, users of pyspark (and Scala) need to manually specify the version of Scala that their Spark installation is using when adding a Spark package to their application. This extra configuration confusing to users who may not even know which version of Scala they are using (for example, if they installed Spark using {{pip}}). The confusion here is exacerbated by releases in Spark that have changed the default from {{2.11}} -> {{2.12}} -> {{2.11}}. https://spark.apache.org/releases/spark-release-2-4-2.html https://spark.apache.org/releases/spark-release-2-4-3.html Since Spark can know which version of Scala it was compiled for, we should give users the option to automatically choose the correct version. This could be as simple as a substitution for {{$scalaVersion}} or something when resolving a package (similar to SBTs support for automatically handling scala dependencies). Here are some concrete examples of users getting it wrong and getting confused: https://github.com/delta-io/delta/issues/6 https://github.com/delta-io/delta/issues/63 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27910) Improve parser error message for misused numeric identifiers
Yesheng Ma created SPARK-27910: -- Summary: Improve parser error message for misused numeric identifiers Key: SPARK-27910 URL: https://issues.apache.org/jira/browse/SPARK-27910 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma Numeric identifiers are misused commonly in Spark SQL queries. For example, the error message for {code:sql} CREATE TABLE test (`1` INT); SELECT test.1 FROM test; {code} is {code:java} Error in query: mismatched input '.1' expecting {, '(', ',', '.', '[', 'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 'DAY', 'DAYS', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DROP', 'ELSE', 'END', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'HOURS', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MICROSECOND', 'MICROSECONDS', 'MILLISECOND', 'MILLISECONDS', 'MINUTE', 'MINUTES', 'MONTH', 'MONTHS', 'MSCK', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SECOND', 'SECONDS', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRUE', 'TRUNCATE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNLOCK', 'UNSET', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'WEEK', 'WEEKS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'YEAR', 'YEARS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 11) == SQL == SELECT test.1 FROM test {code} which is verbose and misleading. One possible way to fix is to explicitly capture these misused numeric identifiers in a grammar rule and print user-friendly error message such as {code:java} Numeric identifiers detected. Consider using quoted version test.`1` {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0
[ https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853412#comment-16853412 ] Xiao Li commented on SPARK-17164: - I think we can issue a better error message here. like https://issues.apache.org/jira/browse/SPARK-27890 > Query with colon in the table name fails to parse in 2.0 > > > Key: SPARK-17164 > URL: https://issues.apache.org/jira/browse/SPARK-17164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Priority: Major > > Running a simple query with colon in table name fails to parse in 2.0 > {code} > == SQL == > SELECT * FROM a:b > ---^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 48 elided > {code} > Please note that this is a regression from Spark 1.6 as the query runs fine > in 1.6. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17164) Query with colon in the table name fails to parse in 2.0
[ https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17164: Issue Type: Sub-task (was: Bug) Parent: SPARK-27901 > Query with colon in the table name fails to parse in 2.0 > > > Key: SPARK-17164 > URL: https://issues.apache.org/jira/browse/SPARK-17164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Priority: Major > > Running a simple query with colon in table name fails to parse in 2.0 > {code} > == SQL == > SELECT * FROM a:b > ---^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 48 elided > {code} > Please note that this is a regression from Spark 1.6 as the query runs fine > in 1.6. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27909: Assignee: (was: Apache Spark) > Fix CTE substitution dependence on ResolveRelations throwing AnalysisException > -- > > Key: SPARK-27909 > URL: https://issues.apache.org/jira/browse/SPARK-27909 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > > CTE substitution currently works by running all analyzer rules on plans after > each substitution. It does this to fix a recursive CTE case, but this design > requires the ResolveRelations rule to throw an AnalysisException when it > cannot resolve a table or else the CTE substitution will run again and may > possibly recurse infinitely. > Table resolution should be possible across multiple independent rules. To > accomplish this, the current ResolveRelations rule detects cases where other > rules (like ResolveDataSource) will resolve a TableIdentifier and returns the > UnresolvedRelation unmodified only in those cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-27909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27909: Assignee: Apache Spark > Fix CTE substitution dependence on ResolveRelations throwing AnalysisException > -- > > Key: SPARK-27909 > URL: https://issues.apache.org/jira/browse/SPARK-27909 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > CTE substitution currently works by running all analyzer rules on plans after > each substitution. It does this to fix a recursive CTE case, but this design > requires the ResolveRelations rule to throw an AnalysisException when it > cannot resolve a table or else the CTE substitution will run again and may > possibly recurse infinitely. > Table resolution should be possible across multiple independent rules. To > accomplish this, the current ResolveRelations rule detects cases where other > rules (like ResolveDataSource) will resolve a TableIdentifier and returns the > UnresolvedRelation unmodified only in those cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27374) Fetch assigned resources from TaskContext
[ https://issues.apache.org/jira/browse/SPARK-27374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27374. --- Resolution: Duplicate > Fetch assigned resources from TaskContext > - > > Key: SPARK-27374 > URL: https://issues.apache.org/jira/browse/SPARK-27374 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27362) Kubernetes support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27362. --- Resolution: Fixed Fix Version/s: 3.0.0 > Kubernetes support for GPU-aware scheduling > --- > > Key: SPARK-27362 > URL: https://issues.apache.org/jira/browse/SPARK-27362 > Project: Spark > Issue Type: Story > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > Design and implement k8s support for GPU-aware scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27373. --- Resolution: Fixed Fix Version/s: 3.0.0 > Design: Kubernetes support for GPU-aware scheduling > --- > > Key: SPARK-27373 > URL: https://issues.apache.org/jira/browse/SPARK-27373 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
Ryan Blue created SPARK-27909: - Summary: Fix CTE substitution dependence on ResolveRelations throwing AnalysisException Key: SPARK-27909 URL: https://issues.apache.org/jira/browse/SPARK-27909 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue CTE substitution currently works by running all analyzer rules on plans after each substitution. It does this to fix a recursive CTE case, but this design requires the ResolveRelations rule to throw an AnalysisException when it cannot resolve a table or else the CTE substitution will run again and may possibly recurse infinitely. Table resolution should be possible across multiple independent rules. To accomplish this, the current ResolveRelations rule detects cases where other rules (like ResolveDataSource) will resolve a TableIdentifier and returns the UnresolvedRelation unmodified only in those cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27897) GPU Scheduling - move example discovery Script to scripts directory
[ https://issues.apache.org/jira/browse/SPARK-27897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27897. --- Resolution: Fixed Fix Version/s: 3.0.0 > GPU Scheduling - move example discovery Script to scripts directory > --- > > Key: SPARK-27897 > URL: https://issues.apache.org/jira/browse/SPARK-27897 > Project: Spark > Issue Type: Story > Components: Examples >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Minor > Fix For: 3.0.0 > > > SPARK-27725 GPU Scheduling - add an example discovery Script added a script > at > [https://github.com/apache/spark/blob/master/examples/src/main/resources/getGpusResources.sh.] > Instead of having it in the resources directory lets move it to the scripts > directory -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27907) HiveUDAF with 0 rows throw NPE
[ https://issues.apache.org/jira/browse/SPARK-27907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated SPARK-27907: Description: When query returns zero rows, the HiveUDAFFunction throws NPE CASE 1: create table abc(a int) select histogram_numeric(a,2) from abc // NPE Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveUDAFFunction.eval(hiveUDFs.scala:471) at org.apache.spark.sql.hive.HiveUDAFFunction.eval(hiveUDFs.scala:315) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.eval(interfaces.scala:543) at org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$5(AggregationIterator.scala:231) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:122) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) CASE 2: create table abc(a int) insert into abc values (1) select histogram_numeric(a,2) from abc where a=3 //NPE Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477) at org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570) at org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at
[jira] [Updated] (SPARK-27907) HiveUDAF with 0 rows throw NPE
[ https://issues.apache.org/jira/browse/SPARK-27907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated SPARK-27907: Summary: HiveUDAF with 0 rows throw NPE (was: HiveUDAF with 0 rows throw NPE when try to serialize) > HiveUDAF with 0 rows throw NPE > -- > > Key: SPARK-27907 > URL: https://issues.apache.org/jira/browse/SPARK-27907 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3, 3.0.0, 2.4.3, 3.1.0 >Reporter: Ajith S >Priority: Major > > When query returns zero rows, the HiveUDAFFunction.seralize throws NPE > create table abc(a int) > insert into abc values (1) > insert into abc values (2) > select histogram_numeric(a,2) from abc where a=3 //NPE > Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477) > at > org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:122) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27908) Improve parser error message for SELECT TOP statement
Yesheng Ma created SPARK-27908: -- Summary: Improve parser error message for SELECT TOP statement Key: SPARK-27908 URL: https://issues.apache.org/jira/browse/SPARK-27908 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma The {{SELECT TOP}} statement is actually not supported in Spark SQL. However, when a user queries such a statement, the error message is confusing. For example, the error message for {code:sql} SELECT TOP 1 FROM test; {code} is {code:java} Error in query: mismatched input '1' expecting {, '(', ',', '.', '[', 'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 'DAY', 'DAYS', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DROP', 'ELSE', 'END', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'HOURS', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MICROSECOND', 'MICROSECONDS', 'MILLISECOND', 'MILLISECONDS', 'MINUTE', 'MINUTES', 'MONTH', 'MONTHS', 'MSCK', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SECOND', 'SECONDS', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRUE', 'TRUNCATE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNLOCK', 'UNSET', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'WEEK', 'WEEKS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'YEAR', 'YEARS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 11) == SQL == SELECT TOP 1 FROM test ---^^^ {code} which is verbose and misleading. One possible way to fix is to explicitly capture these statements in a grammar rule and print user-friendly error message such as {code:java} SELECT TOP statements are not supported. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27907) HiveUDAF with 0 rows throw NPE when try to serialize
[ https://issues.apache.org/jira/browse/SPARK-27907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27907: Assignee: (was: Apache Spark) > HiveUDAF with 0 rows throw NPE when try to serialize > > > Key: SPARK-27907 > URL: https://issues.apache.org/jira/browse/SPARK-27907 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3, 3.0.0, 2.4.3, 3.1.0 >Reporter: Ajith S >Priority: Major > > When query returns zero rows, the HiveUDAFFunction.seralize throws NPE > create table abc(a int) > insert into abc values (1) > insert into abc values (2) > select histogram_numeric(a,2) from abc where a=3 //NPE > Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477) > at > org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:122) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27907) HiveUDAF with 0 rows throw NPE when try to serialize
[ https://issues.apache.org/jira/browse/SPARK-27907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27907: Assignee: Apache Spark > HiveUDAF with 0 rows throw NPE when try to serialize > > > Key: SPARK-27907 > URL: https://issues.apache.org/jira/browse/SPARK-27907 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3, 3.0.0, 2.4.3, 3.1.0 >Reporter: Ajith S >Assignee: Apache Spark >Priority: Major > > When query returns zero rows, the HiveUDAFFunction.seralize throws NPE > create table abc(a int) > insert into abc values (1) > insert into abc values (2) > select histogram_numeric(a,2) from abc where a=3 //NPE > Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477) > at > org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:122) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27907) HiveUDAF with 0 rows throw NPE when try to serialize
Ajith S created SPARK-27907: --- Summary: HiveUDAF with 0 rows throw NPE when try to serialize Key: SPARK-27907 URL: https://issues.apache.org/jira/browse/SPARK-27907 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3, 2.3.3, 3.0.0, 3.1.0 Reporter: Ajith S When query returns zero rows, the HiveUDAFFunction.seralize throws NPE create table abc(a int) insert into abc values (1) insert into abc values (2) select histogram_numeric(a,2) from abc where a=3 //NPE Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477) at org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570) at org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:122) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27905) Add higher order function`forall`
[ https://issues.apache.org/jira/browse/SPARK-27905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27905: Assignee: (was: Apache Spark) > Add higher order function`forall` > - > > Key: SPARK-27905 > URL: https://issues.apache.org/jira/browse/SPARK-27905 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Nikolas Vanderhoof >Priority: Major > > Add the SQL function forall. > `forall` tests an array to see if predicate holds for every item of the array. > This complements the `exists` higher order function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27905) Add higher order function`forall`
[ https://issues.apache.org/jira/browse/SPARK-27905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27905: Assignee: Apache Spark > Add higher order function`forall` > - > > Key: SPARK-27905 > URL: https://issues.apache.org/jira/browse/SPARK-27905 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Nikolas Vanderhoof >Assignee: Apache Spark >Priority: Major > > Add the SQL function forall. > `forall` tests an array to see if predicate holds for every item of the array. > This complements the `exists` higher order function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27906) Improve parser error message for CREATE LOCAL TEMPORARY TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-27906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesheng Ma updated SPARK-27906: --- Description: The {{CREATE LOCAL TEMPORARY TABLE}} statement is actually not supported in Spark SQL. However, when a user queries such a statement, the error message is confusing. For example, the error message for {code:sql} CREATE LOCAL TEMPORARY TABLE my_table (x INT); {code} is {code:java} no viable alternative at input 'CREATE LOCAL'(line 1, pos 7) {code} which is misleading. One possible way to fix is to explicitly capture these statements in a grammar rule and print user-friendly error message such as {code:java} CREATE LOCAL TEMPORARY TABLE statements are not supported. {code} was: {{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when a user quries such a statement, the error message is confusing. For example, the error message for {code:sql} SHOW VIEWS IN my_database {code} is {code:java} missing 'FUNCTIONS' at 'IN'(line 1, pos 11) {code} which is misleading. One possible way to fix is to explicitly capture these statements in a grammar rule and print user-friendly error message such as {code:java} SHOW VIEW statements are not supported. {code} > Improve parser error message for CREATE LOCAL TEMPORARY TABLE statement > --- > > Key: SPARK-27906 > URL: https://issues.apache.org/jira/browse/SPARK-27906 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > > The {{CREATE LOCAL TEMPORARY TABLE}} statement is actually not supported in > Spark SQL. However, when a user queries such a statement, the error message > is confusing. For example, the error message for > {code:sql} > CREATE LOCAL TEMPORARY TABLE my_table (x INT); > {code} > is > {code:java} > no viable alternative at input 'CREATE LOCAL'(line 1, pos 7) > {code} > which is misleading. > > One possible way to fix is to explicitly capture these statements in a > grammar rule and print user-friendly error message such as > {code:java} > CREATE LOCAL TEMPORARY TABLE statements are not supported. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27906) Improve parser error message for CREATE LOCAL TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-27906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesheng Ma updated SPARK-27906: --- Description: {{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when a user quries such a statement, the error message is confusing. For example, the error message for {code:sql} SHOW VIEWS IN my_database {code} is {code:java} missing 'FUNCTIONS' at 'IN'(line 1, pos 11) {code} which is misleading. One possible way to fix is to explicitly capture these statements in a grammar rule and print user-friendly error message such as {code:java} SHOW VIEW statements are not supported. {code} > Improve parser error message for CREATE LOCAL TABLE statement > - > > Key: SPARK-27906 > URL: https://issues.apache.org/jira/browse/SPARK-27906 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > > {{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when > a user quries such a statement, the error message is confusing. For example, > the error message for > {code:sql} > SHOW VIEWS IN my_database > {code} > is > {code:java} > missing 'FUNCTIONS' at 'IN'(line 1, pos 11) > {code} > which is misleading. > > One possible way to fix is to explicitly capture these statements in a > grammar rule and print user-friendly error message such as > {code:java} > SHOW VIEW statements are not supported. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27903) Improve parser error message for mismatched parentheses in expressions
[ https://issues.apache.org/jira/browse/SPARK-27903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesheng Ma updated SPARK-27903: --- Description: When parentheses are mismatched in expressions in queries, the error message is confusing. This is especially true for large queries, where mismatched parens are tedious for human to figure out. For example, the error message for {code:sql} SELECT ((x + y) * z FROM t; {code} is {code:java} mismatched input 'FROM' expecting ','(line 1, pos 20) {code} One possible way to fix is to explicitly capture such kind of mismatched parens in a grammar rule and print user-friendly error message such as {code:java} mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, pos 20) {code} was: > Improve parser error message for mismatched parentheses in expressions > -- > > Key: SPARK-27903 > URL: https://issues.apache.org/jira/browse/SPARK-27903 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > > When parentheses are mismatched in expressions in queries, the error message > is confusing. This is especially true for large queries, where mismatched > parens are tedious for human to figure out. > For example, the error message for > {code:sql} > SELECT ((x + y) * z FROM t; > {code} > is > {code:java} > mismatched input 'FROM' expecting ','(line 1, pos 20) > {code} > One possible way to fix is to explicitly capture such kind of mismatched > parens in a grammar rule and print user-friendly error message such as > {code:java} > mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, > pos 20) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27906) Improve parser error message for CREATE LOCAL TEMPORARY TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-27906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesheng Ma updated SPARK-27906: --- Summary: Improve parser error message for CREATE LOCAL TEMPORARY TABLE statement (was: Improve parser error message for CREATE LOCAL TABLE statement) > Improve parser error message for CREATE LOCAL TEMPORARY TABLE statement > --- > > Key: SPARK-27906 > URL: https://issues.apache.org/jira/browse/SPARK-27906 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > > {{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when > a user quries such a statement, the error message is confusing. For example, > the error message for > {code:sql} > SHOW VIEWS IN my_database > {code} > is > {code:java} > missing 'FUNCTIONS' at 'IN'(line 1, pos 11) > {code} > which is misleading. > > One possible way to fix is to explicitly capture these statements in a > grammar rule and print user-friendly error message such as > {code:java} > SHOW VIEW statements are not supported. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27906) Improve parser error message for CREATE LOCAL TABLE statement
Yesheng Ma created SPARK-27906: -- Summary: Improve parser error message for CREATE LOCAL TABLE statement Key: SPARK-27906 URL: https://issues.apache.org/jira/browse/SPARK-27906 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27903) Improve parser error message for mismatched parentheses in expressions
[ https://issues.apache.org/jira/browse/SPARK-27903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesheng Ma updated SPARK-27903: --- Description: was: When parentheses are mismatched in expressions in queries, the error message is confusing. This is especially true for large queries, where mismatched parens are tedious for human to figure out. For example, the error message for {code:sql} SELECT ((x + y) * z FROM t; {code} is {code:java} mismatched input 'FROM' expecting ','(line 1, pos 20) {code} One possible way to fix is to explicitly capture such kind of mismatched parens in a grammar rule and print user-friendly error message such as {code:java} mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, pos 20) {code} > Improve parser error message for mismatched parentheses in expressions > -- > > Key: SPARK-27903 > URL: https://issues.apache.org/jira/browse/SPARK-27903 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27905) Add higher order function`forall`
Nikolas Vanderhoof created SPARK-27905: -- Summary: Add higher order function`forall` Key: SPARK-27905 URL: https://issues.apache.org/jira/browse/SPARK-27905 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Nikolas Vanderhoof Add the SQL function forall. `forall` tests an array to see if predicate holds for every item of the array. This complements the `exists` higher order function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27904) Improve parser error message for SHOW VIEW statement
Yesheng Ma created SPARK-27904: -- Summary: Improve parser error message for SHOW VIEW statement Key: SPARK-27904 URL: https://issues.apache.org/jira/browse/SPARK-27904 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma {{SHOW VIEW}} statement is actually not supported in Spark SQL. However, when a user quries such a statement, the error message is confusing. For example, the error message for {code:sql} SHOW VIEWS IN my_database {code} is {code:java} missing 'FUNCTIONS' at 'IN'(line 1, pos 11) {code} which is misleading. One possible way to fix is to explicitly capture these statements in a grammar rule and print user-friendly error message such as {code:java} SHOW VIEW statements are not supported. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21529) Improve the error message for unsupported Uniontype
[ https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21529: Issue Type: Sub-task (was: Bug) Parent: SPARK-27901 > Improve the error message for unsupported Uniontype > --- > > Key: SPARK-21529 > URL: https://issues.apache.org/jira/browse/SPARK-21529 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 > Environment: Qubole, DataBricks >Reporter: Elliot West >Priority: Major > Labels: hive, starter, uniontype > > We encounter errors when attempting to read Hive tables whose schema contains > the {{uniontype}}. It appears perhaps that Catalyst > does not support the {{uniontype}} which renders this table unreadable by > Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive > query engine, it is fully supported by the storage engine and also the Avro > data format, which we use for these tables. Therefore, I believe it is > a valid, usable type construct that should be supported by Spark. > We've attempted to read the table as follows: > {code} > spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' > limit 5").show > val tblread = spark.read.table("etl.tbl") > {code} > But this always results in the same error message. The pertinent error > messages are as follows (full stack trace below): > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype ... > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '<' expecting > {, '('} > (line 1, pos 9) > == SQL == > uniontype -^^^ > {code} > h2. Full stack trace > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>> > at > org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648) > at >
[jira] [Updated] (SPARK-21529) Improve the error message for unsupported Uniontype
[ https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21529: Labels: hive starter uniontype (was: bulk-closed hive uniontype) > Improve the error message for unsupported Uniontype > --- > > Key: SPARK-21529 > URL: https://issues.apache.org/jira/browse/SPARK-21529 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Qubole, DataBricks >Reporter: Elliot West >Priority: Major > Labels: hive, starter, uniontype > > We encounter errors when attempting to read Hive tables whose schema contains > the {{uniontype}}. It appears perhaps that Catalyst > does not support the {{uniontype}} which renders this table unreadable by > Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive > query engine, it is fully supported by the storage engine and also the Avro > data format, which we use for these tables. Therefore, I believe it is > a valid, usable type construct that should be supported by Spark. > We've attempted to read the table as follows: > {code} > spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' > limit 5").show > val tblread = spark.read.table("etl.tbl") > {code} > But this always results in the same error message. The pertinent error > messages are as follows (full stack trace below): > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype ... > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '<' expecting > {, '('} > (line 1, pos 9) > == SQL == > uniontype -^^^ > {code} > h2. Full stack trace > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>> > at > org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648) > at >
[jira] [Updated] (SPARK-21529) Improve the error message for unsupported Uniontype
[ https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21529: Summary: Improve the error message for unsupported Uniontype (was: Uniontype not supported when reading from Hive tables.) > Improve the error message for unsupported Uniontype > --- > > Key: SPARK-21529 > URL: https://issues.apache.org/jira/browse/SPARK-21529 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Qubole, DataBricks >Reporter: Elliot West >Priority: Major > Labels: bulk-closed, hive, uniontype > > We encounter errors when attempting to read Hive tables whose schema contains > the {{uniontype}}. It appears perhaps that Catalyst > does not support the {{uniontype}} which renders this table unreadable by > Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive > query engine, it is fully supported by the storage engine and also the Avro > data format, which we use for these tables. Therefore, I believe it is > a valid, usable type construct that should be supported by Spark. > We've attempted to read the table as follows: > {code} > spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' > limit 5").show > val tblread = spark.read.table("etl.tbl") > {code} > But this always results in the same error message. The pertinent error > messages are as follows (full stack trace below): > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype ... > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '<' expecting > {, '('} > (line 1, pos 9) > == SQL == > uniontype -^^^ > {code} > h2. Full stack trace > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>> > at > org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648) > at >
[jira] [Reopened] (SPARK-21529) Improve the error message for unsupported Uniontype
[ https://issues.apache.org/jira/browse/SPARK-21529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-21529: - > Improve the error message for unsupported Uniontype > --- > > Key: SPARK-21529 > URL: https://issues.apache.org/jira/browse/SPARK-21529 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Qubole, DataBricks >Reporter: Elliot West >Priority: Major > Labels: bulk-closed, hive, uniontype > > We encounter errors when attempting to read Hive tables whose schema contains > the {{uniontype}}. It appears perhaps that Catalyst > does not support the {{uniontype}} which renders this table unreadable by > Spark (2.1). Although, {{uniontype}} is arguably incomplete in the Hive > query engine, it is fully supported by the storage engine and also the Avro > data format, which we use for these tables. Therefore, I believe it is > a valid, usable type construct that should be supported by Spark. > We've attempted to read the table as follows: > {code} > spark.sql("select * from etl.tbl where acquisition_instant='20170706T133545Z' > limit 5").show > val tblread = spark.read.table("etl.tbl") > {code} > But this always results in the same error message. The pertinent error > messages are as follows (full stack trace below): > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype ... > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '<' expecting > {, '('} > (line 1, pos 9) > == SQL == > uniontype -^^^ > {code} > h2. Full stack trace > {code} > org.apache.spark.SparkException: Cannot recognize hive type string: > uniontype>>,n:boolean,o:string,p:bigint,q:string>,struct,ag:boolean,ah:string,ai:bigint,aj:string>> > at > org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:800) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:377) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:377) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:373) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:373) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:371) > at > org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:79) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:648) > at >
[jira] [Created] (SPARK-27903) Improve parser error message for mismatched parentheses in expressions
Yesheng Ma created SPARK-27903: -- Summary: Improve parser error message for mismatched parentheses in expressions Key: SPARK-27903 URL: https://issues.apache.org/jira/browse/SPARK-27903 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma When parentheses are mismatched in expressions in queries, the error message is confusing. This is especially true for large queries, where mismatched parens are tedious for human to figure out. For example, the error message for {code:sql} SELECT ((x + y) * z FROM t; {code} is {code:java} mismatched input 'FROM' expecting ','(line 1, pos 20) {code} One possible way to fix is to explicitly capture such kind of mismatched parens in a grammar rule and print user-friendly error message such as {code:java} mismatched parentheses for expression 'SELECT ((x + y) * z FROM t;'(line 1, pos 20) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27098) Flaky missing file parts when writing to Ceph without error
[ https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853250#comment-16853250 ] Steve Loughran commented on SPARK-27098: I don't think we'd silently swallow exceptions during a copy, more likely "we take so long doing it that something times out" Maybe [~gabor.bota] has some suggestions; he's worked on Ceph support through s3a. > Flaky missing file parts when writing to Ceph without error > --- > > Key: SPARK-27098 > URL: https://issues.apache.org/jira/browse/SPARK-27098 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Martin Loncaric >Priority: Major > Attachments: sanitized_stdout_1.txt > > > https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233 > Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. > occasionally a file part will be missing; i.e. part 3 here: > ``` > > aws s3 ls my-bucket/folder/ > 2019-02-28 13:07:21 0 _SUCCESS > 2019-02-28 13:06:58 79428651 > part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:06:59 79586172 > part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:00 79561910 > part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:01 79192617 > part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:07 79364413 > part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:08 79623254 > part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:10 79445030 > part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:10 79474923 > part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:11 79477310 > part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:12 79331453 > part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:13 79567600 > part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:13 79388012 > part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:14 79308387 > part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:15 79455483 > part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:17 79512342 > part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:18 79403307 > part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:18 79617769 > part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:19 79333534 > part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:20 79543324 > part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > ``` > However, the write succeeds and leaves a _SUCCESS file. > This can be caught by additionally checking afterward whether the number of > written file parts agrees with the number of partitions, but Spark should at > least fail on its own and leave a meaningful stack trace in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27902) Improve error message for DESCRIBE statement
Yesheng Ma created SPARK-27902: -- Summary: Improve error message for DESCRIBE statement Key: SPARK-27902 URL: https://issues.apache.org/jira/browse/SPARK-27902 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma {{DESCRIBE}} statement only supports queries such as {{SELECT}}. However, when other statements are used as a clause of {{DESCRIBE}}, the error message is confusing. For example, the error message for {code:sql} DESCRIBE INSERT INTO desc_temp1 values (1, 'val1'); {code} is {code:java} mismatched input 'desc_temp1' expecting {, '.'}(line 1, pos 21)}} {code} which is misleading and hard for end users to figure out the real cause. One possible way to fix is to explicitly capture such kind of wrong clauses and print user-friendly error message such as {code:java} mismatched insert clause 'INSERT INTO desc_temp1 values (1, 'val1');' expecting normal query clauses. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21136) Misleading error message for typo in SQL
[ https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21136: Priority: Critical (was: Minor) > Misleading error message for typo in SQL > > > Key: SPARK-21136 > URL: https://issues.apache.org/jira/browse/SPARK-21136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Assignee: Yesheng Ma >Priority: Critical > > {code} > scala> spark.sql("select * from a left joinn b on a.id = b.id").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) > == SQL == > select * from a left joinn b on a.id = b.id > -^^^ > {code} > The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of > the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in > themselves, a misleading error like this can hinder debugging substantially. > I tried to see if maybe I could fix this. Am I correct to deduce that the > error message originates in ANTLR4, which parses the query based on the > syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out > how that syntax definition works, and why it misattributes the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27098) Flaky missing file parts when writing to Ceph without error
[ https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853245#comment-16853245 ] Martin Loncaric commented on SPARK-27098: - After upgrading to Hadoop 2.9 and using {{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}}, the problem is substantially less frequent, but still present. I think this suggests that moving files sometimes quietly fails. > Flaky missing file parts when writing to Ceph without error > --- > > Key: SPARK-27098 > URL: https://issues.apache.org/jira/browse/SPARK-27098 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Martin Loncaric >Priority: Major > Attachments: sanitized_stdout_1.txt > > > https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233 > Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. > occasionally a file part will be missing; i.e. part 3 here: > ``` > > aws s3 ls my-bucket/folder/ > 2019-02-28 13:07:21 0 _SUCCESS > 2019-02-28 13:06:58 79428651 > part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:06:59 79586172 > part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:00 79561910 > part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:01 79192617 > part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:07 79364413 > part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:08 79623254 > part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:10 79445030 > part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:10 79474923 > part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:11 79477310 > part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:12 79331453 > part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:13 79567600 > part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:13 79388012 > part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:14 79308387 > part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:15 79455483 > part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:17 79512342 > part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:18 79403307 > part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:18 79617769 > part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:19 79333534 > part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:20 79543324 > part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > ``` > However, the write succeeds and leaves a _SUCCESS file. > This can be caught by additionally checking afterward whether the number of > written file parts agrees with the number of partitions, but Spark should at > least fail on its own and leave a meaningful stack trace in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21136) Misleading error message for typo in SQL
[ https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-21136: --- Assignee: Yesheng Ma > Misleading error message for typo in SQL > > > Key: SPARK-21136 > URL: https://issues.apache.org/jira/browse/SPARK-21136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Assignee: Yesheng Ma >Priority: Minor > > {code} > scala> spark.sql("select * from a left joinn b on a.id = b.id").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) > == SQL == > select * from a left joinn b on a.id = b.id > -^^^ > {code} > The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of > the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in > themselves, a misleading error like this can hinder debugging substantially. > I tried to see if maybe I could fix this. Am I correct to deduce that the > error message originates in ANTLR4, which parses the query based on the > syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out > how that syntax definition works, and why it misattributes the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24077: Labels: starter (was: ) > Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > --- > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > Labels: starter > > Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21136) Misleading error message for typo in SQL
[ https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21136: Labels: (was: bulk-closed) > Misleading error message for typo in SQL > > > Key: SPARK-21136 > URL: https://issues.apache.org/jira/browse/SPARK-21136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Priority: Minor > > {code} > scala> spark.sql("select * from a left joinn b on a.id = b.id").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) > == SQL == > select * from a left joinn b on a.id = b.id > -^^^ > {code} > The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of > the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in > themselves, a misleading error like this can hinder debugging substantially. > I tried to see if maybe I could fix this. Am I correct to deduce that the > error message originates in ANTLR4, which parses the query based on the > syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out > how that syntax definition works, and why it misattributes the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24077) Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-24077: - > Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > > Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24077: Issue Type: Sub-task (was: Question) Parent: SPARK-27901 > Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > --- > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > > Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24077: Summary: Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? (was: Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?) > Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > --- > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > > Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21136) Misleading error message for typo in SQL
[ https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-21136: - > Misleading error message for typo in SQL > > > Key: SPARK-21136 > URL: https://issues.apache.org/jira/browse/SPARK-21136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Priority: Minor > Labels: bulk-closed > > {code} > scala> spark.sql("select * from a left joinn b on a.id = b.id").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) > == SQL == > select * from a left joinn b on a.id = b.id > -^^^ > {code} > The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of > the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in > themselves, a misleading error like this can hinder debugging substantially. > I tried to see if maybe I could fix this. Am I correct to deduce that the > error message originates in ANTLR4, which parses the query based on the > syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out > how that syntax definition works, and why it misattributes the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21136) Misleading error message for typo in SQL
[ https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21136: Issue Type: Sub-task (was: Bug) Parent: SPARK-27901 > Misleading error message for typo in SQL > > > Key: SPARK-21136 > URL: https://issues.apache.org/jira/browse/SPARK-21136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Priority: Minor > Labels: bulk-closed > > {code} > scala> spark.sql("select * from a left joinn b on a.id = b.id").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) > == SQL == > select * from a left joinn b on a.id = b.id > -^^^ > {code} > The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of > the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in > themselves, a misleading error like this can hinder debugging substantially. > I tried to see if maybe I could fix this. Am I correct to deduce that the > error message originates in ANTLR4, which parses the query based on the > syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out > how that syntax definition works, and why it misattributes the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21154) ParseException when Create View from another View in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-21154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853234#comment-16853234 ] Xiao Li commented on SPARK-21154: - This issue should have been resolved in the new version of Spark > ParseException when Create View from another View in Spark SQL > --- > > Key: SPARK-21154 > URL: https://issues.apache.org/jira/browse/SPARK-21154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Brian Zhang >Priority: Major > Labels: bulk-closed > > When creating View from another existing View in Spark SQL, we will see > ParseException if existing View. > Here is the detail on how to reproduce it: > *Hive* (I'm using 1.1.0): > hive> *CREATE TABLE my_table (id int, name string);* > OK > Time taken: 0.107 seconds > hive> *CREATE VIEW my_view(view_id,view_name) AS SELECT * FROM my_table;* > OK > Time taken: 0.075 seconds > # View Information > View Original Text: SELECT * FROM my_table > View Expanded Text: SELECT `id` AS `view_id`, `name` AS `view_name` FROM > (SELECT `my_table`.`id`, `my_table`.`name` FROM `default`.`my_table`) > `default.my_view` > Time taken: 0.04 seconds, Fetched: 28 row(s) > *Spark* (Same behavior for spark 2.1.0 and 2.1.1): > scala> *sqlContext.sql("CREATE VIEW my_view_spark AS SELECT * FROM my_view");* > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `view_id`, `gen_attr_1` AS `view_name` FROM (SELECT > `gen_attr_0`, `gen_attr_1` FROM (SELECT `gen_attr_2` AS `gen_attr_0`, > `gen_attr_3` AS `gen_attr_1` FROM (SELECT `gen_attr_2`, `gen_attr_3` FROM > (SELECT `id` AS `gen_attr_2`, `name` AS `gen_attr_3` FROM > `default`.`my_table`) AS gen_subquery_0) AS default.my_view) AS my_view) AS > my_view > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:222) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:176) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) > ... 74 elided > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'FROM' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 62) > == SQL == > SELECT `gen_attr_0` AS `view_id`, `gen_attr_1` AS `view_name` FROM (SELECT > `gen_attr_0`, `gen_attr_1` FROM (SELECT `gen_attr_2` AS `gen_attr_0`, > `gen_attr_3` AS `gen_attr_1` FROM (SELECT `gen_attr_2`, `gen_attr_3` FROM > (SELECT `id` AS `gen_attr_2`, `name` AS `gen_attr_3` FROM > `default`.`my_table`) AS gen_subquery_0) AS default.my_view) AS my_view) AS > my_view > --^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:219) > ... 90 more -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853227#comment-16853227 ] Dongjoon Hyun commented on SPARK-27812: --- What I meat was *upgrading*, [~igor.calabria]. Currently, the latest one is already 4.2.2. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27809) Make optional clauses order insensitive for CREATE DATABASE/VIEW SQL statement
[ https://issues.apache.org/jira/browse/SPARK-27809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27809: Issue Type: Sub-task (was: Bug) Parent: SPARK-27901 > Make optional clauses order insensitive for CREATE DATABASE/VIEW SQL statement > -- > > Key: SPARK-27809 > URL: https://issues.apache.org/jira/browse/SPARK-27809 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yesheng Ma >Assignee: Yesheng Ma >Priority: Major > Fix For: 3.0.0 > > > Each time, when I write a complex CREATE DATABASE/VIEW statements, I have to > open the .g4 file to find the EXACT order of clauses in CREATE TABLE > statement. When the order is not right, I will get A strange confusing error > message generated from ANTLR4. > The original g4 grammar for CREATE VIEW is > {code:sql} > CREATE [OR REPLACE] [[GLOBAL] TEMPORARY] VIEW [db_name.]view_name > [(col_name1 [COMMENT col_comment1], ...)] > [COMMENT table_comment] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > AS select_statement > {code} > The proposal is to make the following clauses order insensitive. > {code:sql} > [COMMENT table_comment] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > {code} > – > The original g4 grammar for CREATE DATABASE is > {code:sql} > CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] db_name > [COMMENT comment_text] > [LOCATION path] > [WITH DBPROPERTIES (key1=val1, key2=val2, ...)] > {code} > The proposal is to make the following clauses order insensitive. > {code:sql} > [COMMENT comment_text] > [LOCATION path] > [WITH DBPROPERTIES (key1=val1, key2=val2, ...)] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens
[ https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27890: Issue Type: Sub-task (was: Improvement) Parent: SPARK-27901 > Improve SQL parser error message when missing backquotes for identifiers with > hyphens > - > > Key: SPARK-27890 > URL: https://issues.apache.org/jira/browse/SPARK-27890 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yesheng Ma >Priority: Major > > Current SQL parser's error message for hyphen-connected identifiers without > surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A > possible approach to tackle this is to explicitly capture these wrong usages > in the SQL parser. In this way, the end users can fix these errors more > quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27901) Improve the error messages of SQL parser
[ https://issues.apache.org/jira/browse/SPARK-27901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27901: Issue Type: Umbrella (was: Bug) > Improve the error messages of SQL parser > > > Key: SPARK-27901 > URL: https://issues.apache.org/jira/browse/SPARK-27901 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > SQL is one the most popular APIs for Apache Spark. Our SQL parser is built on > ANTLR4. The error messages generated by ANTLR4 is not always helpful. This > umbrella Jira is to track all the improvements in our parser error handling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27901) Improve the error messages of SQL parser
Xiao Li created SPARK-27901: --- Summary: Improve the error messages of SQL parser Key: SPARK-27901 URL: https://issues.apache.org/jira/browse/SPARK-27901 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Xiao Li SQL is one the most popular APIs for Apache Spark. Our SQL parser is built on ANTLR4. The error messages generated by ANTLR4 is not always helpful. This umbrella Jira is to track all the improvements in our parser error handling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API
[ https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-27899: --- Assignee: Lantao Jin > Make HiveMetastoreClient.getTableObjectsByName available in > ExternalCatalog/SessionCatalog API > -- > > Key: SPARK-27899 > URL: https://issues.apache.org/jira/browse/SPARK-27899 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Juliusz Sompolski >Assignee: Lantao Jin >Priority: Major > > The new Spark ThriftServer SparkGetTablesOperation implemented in > https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata > request for every table. This can get very slow for large schemas (~50ms per > table with an external Hive metastore). > Hive ThriftServer GetTablesOperation uses > HiveMetastoreClient.getTableObjectsByName to get table information in bulk, > but we don't expose that through our APIs that go through Hive -> > HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> > SessionCatalog. > If we added and exposed getTableObjectsByName through our catalog APIs, we > could resolve that performance problem in SparkGetTablesOperation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API
[ https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853209#comment-16853209 ] Xiao Li commented on SPARK-27899: - Also cc [~cltlfcjin] > Make HiveMetastoreClient.getTableObjectsByName available in > ExternalCatalog/SessionCatalog API > -- > > Key: SPARK-27899 > URL: https://issues.apache.org/jira/browse/SPARK-27899 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Juliusz Sompolski >Priority: Major > > The new Spark ThriftServer SparkGetTablesOperation implemented in > https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata > request for every table. This can get very slow for large schemas (~50ms per > table with an external Hive metastore). > Hive ThriftServer GetTablesOperation uses > HiveMetastoreClient.getTableObjectsByName to get table information in bulk, > but we don't expose that through our APIs that go through Hive -> > HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> > SessionCatalog. > If we added and exposed getTableObjectsByName through our catalog APIs, we > could resolve that performance problem in SparkGetTablesOperation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API
[ https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27899: Target Version/s: 3.0.0 > Make HiveMetastoreClient.getTableObjectsByName available in > ExternalCatalog/SessionCatalog API > -- > > Key: SPARK-27899 > URL: https://issues.apache.org/jira/browse/SPARK-27899 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Juliusz Sompolski >Priority: Major > > The new Spark ThriftServer SparkGetTablesOperation implemented in > https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata > request for every table. This can get very slow for large schemas (~50ms per > table with an external Hive metastore). > Hive ThriftServer GetTablesOperation uses > HiveMetastoreClient.getTableObjectsByName to get table information in bulk, > but we don't expose that through our APIs that go through Hive -> > HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> > SessionCatalog. > If we added and exposed getTableObjectsByName through our catalog APIs, we > could resolve that performance problem in SparkGetTablesOperation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853159#comment-16853159 ] Igor Calabria commented on SPARK-27812: --- [~dongjoon] Do you mean downgrading? Because the issue was introduced when kubernetes client was updated. I took a look at both OkHttp and fabric8's kubernetes client code between the upgraded tags and I couldn't find anything obvious that caused this. Maybe the right path for spark is to actually deal with "rogue" user threads on shutdown/exceptions instead of simply relying that they won't be created by libs or user code. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27395) Improve EXPLAIN command
[ https://issues.apache.org/jira/browse/SPARK-27395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27395: Assignee: (was: Apache Spark) > Improve EXPLAIN command > --- > > Key: SPARK-27395 > URL: https://issues.apache.org/jira/browse/SPARK-27395 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > Currently, when the query is complex or the output schema is long, the > outputs of our EXPAIN command are not readable. Our documentation does not > explain how to read the plans [e.g., the meaning of each field]. It is > confusing to end users. The current format limits us to add more more useful > details to each operator, since it is already very long. We need to reformat > the query plans for better readability. > In this release, we need to improve the usability and documentation of > EXPLAIN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27395) Improve EXPLAIN command
[ https://issues.apache.org/jira/browse/SPARK-27395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27395: Assignee: Apache Spark > Improve EXPLAIN command > --- > > Key: SPARK-27395 > URL: https://issues.apache.org/jira/browse/SPARK-27395 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Currently, when the query is complex or the output schema is long, the > outputs of our EXPAIN command are not readable. Our documentation does not > explain how to read the plans [e.g., the meaning of each field]. It is > confusing to end users. The current format limits us to add more more useful > details to each operator, since it is already very long. We need to reformat > the query plans for better readability. > In this release, we need to improve the usability and documentation of > EXPLAIN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853140#comment-16853140 ] hemshankar sahu commented on SPARK-27891: - Attached log for spark 2.3.1 (spark_2.3.1_failure.log) > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > Attachments: application_1559242207407_0001.log, > spark_2.3.1_failure.log > > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs attached > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemshankar sahu updated SPARK-27891: Attachment: spark_2.3.1_failure.log > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > Attachments: application_1559242207407_0001.log, > spark_2.3.1_failure.log > > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs attached > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26192: -- Fix Version/s: 2.4.4 > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Minor > Fix For: 2.4.4, 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27900) Spark on K8s will not report container failure due to oom
[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27900: Description: A spark pi job is running: spark-pi-driver 1/1 Running 0 1h spark-pi2-1559309337787-exec-1 1/1 Running 0 1h spark-pi2-1559309337787-exec-2 1/1 Running 0 1h with the following setup: {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "skonto/spark:k8s-3.0.0-sa" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" arguments: - "100" sparkVersion: "2.4.0" restartPolicy: type: Never nodeSelector: "spark": "autotune" driver: memory: "1g" labels: version: 2.4.0 serviceAccount: spark-sa executor: instances: 2 memory: "1g" labels: version: 2.4.0{quote} At some point the driver fails but it is still running and so the pods are still running: 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KiB, free 110.0 MiB) 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1765.0 B, free 110.0 MiB) 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 MiB) 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1180 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached $ kubectl describe pod spark-pi2-driver -n spark Name: spark-pi2-driver Namespace: spark Priority: 0 PriorityClassName: Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 Start Time: Fri, 31 May 2019 16:28:59 +0300 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 spark-role=driver sparkoperator.k8s.io/app-name=spark-pi2 sparkoperator.k8s.io/launched-by-spark-operator=true sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 version=2.4.0 Annotations: Status: Running IP: 10.12.103.4 Controlled By: SparkApplication/spark-pi2 Containers: spark-kubernetes-driver: Container ID: docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f Image: skonto/spark:k8s-3.0.0-sa Image ID: docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 100 State: Running In the container processes are in _interruptible sleep_: PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar 287 0 185 S 2344 0% 3 0% sh 294 287 185 R 1536 0% 3 0% top 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file /opt/spark/conf/spark.prope Liveness checks might be a workaround but rest apis may be still working if threads in jvm still are running as in this case (I did check the spark ui and it was there). was: {quote}A driver is running {quote} spark-pi-driver 1/1 Running 0 1h spark-pi2-1559309337787-exec-1 1/1 Running 0 1h spark-pi2-1559309337787-exec-2 1/1 Running 0 1h with the following setup: {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "skonto/spark:k8s-3.0.0-sa" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" arguments: - "100" sparkVersion: "2.4.0" restartPolicy: type: Never nodeSelector: "spark": "autotune" driver: memory:
[jira] [Created] (SPARK-27900) Spark on K8s will not report container failure due to oom
Stavros Kontopoulos created SPARK-27900: --- Summary: Spark on K8s will not report container failure due to oom Key: SPARK-27900 URL: https://issues.apache.org/jira/browse/SPARK-27900 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.3, 3.0.0 Reporter: Stavros Kontopoulos {quote}A driver is running {quote} spark-pi-driver 1/1 Running 0 1h spark-pi2-1559309337787-exec-1 1/1 Running 0 1h spark-pi2-1559309337787-exec-2 1/1 Running 0 1h with the following setup: {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "skonto/spark:k8s-3.0.0-sa" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" arguments: - "100" sparkVersion: "2.4.0" restartPolicy: type: Never nodeSelector: "spark": "autotune" driver: memory: "1g" labels: version: 2.4.0 serviceAccount: spark-sa executor: instances: 2 memory: "1g" labels: version: 2.4.0 {quote} At some point the driver fails but it is still running and so the pods are still running: 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KiB, free 110.0 MiB) 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1765.0 B, free 110.0 MiB) 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 MiB) 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1180 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 tasks Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached $ kubectl describe pod spark-pi2-driver -n spark Name: spark-pi2-driver Namespace: spark Priority: 0 PriorityClassName: Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 Start Time: Fri, 31 May 2019 16:28:59 +0300 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 spark-role=driver sparkoperator.k8s.io/app-name=spark-pi2 sparkoperator.k8s.io/launched-by-spark-operator=true sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 version=2.4.0 Annotations: Status: Running IP: 10.12.103.4 Controlled By: SparkApplication/spark-pi2 Containers: spark-kubernetes-driver: Container ID: docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f Image: skonto/spark:k8s-3.0.0-sa Image ID: docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 100 State: Running In the container processes are in _interruptible sleep_: PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar 287 0 185 S 2344 0% 3 0% sh 294 287 185 R 1536 0% 3 0% top 1 0 185 S 776 0% 0 0% /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file /opt/spark/conf/spark.prope Liveness checks might be a workaround but rest apis may be still working if threads in jvm still are running as in this case (I did check the spark ui and it was there). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false
[ https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27873: Assignee: Apache Spark > Csv reader, adding a corrupt record column causes error if enforceSchema=false > -- > > Key: SPARK-27873 > URL: https://issues.apache.org/jira/browse/SPARK-27873 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Marcin Mejran >Assignee: Apache Spark >Priority: Major > > In the Spark CSV reader If you're using permissive mode with a column for > storing corrupt records then you need to add a new schema column > corresponding to columnNameOfCorruptRecord. > However, if you have a header row and enforceSchema=false the schema vs. > header validation fails because there is an extra column corresponding to > columnNameOfCorruptRecord. > Since, the FAILFAST mode doesn't print informative error messages on which > rows failed to parse there is no way other to track down broken rows without > setting a corrupt record column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false
[ https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27873: Assignee: (was: Apache Spark) > Csv reader, adding a corrupt record column causes error if enforceSchema=false > -- > > Key: SPARK-27873 > URL: https://issues.apache.org/jira/browse/SPARK-27873 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Marcin Mejran >Priority: Major > > In the Spark CSV reader If you're using permissive mode with a column for > storing corrupt records then you need to add a new schema column > corresponding to columnNameOfCorruptRecord. > However, if you have a header row and enforceSchema=false the schema vs. > header validation fails because there is an extra column corresponding to > columnNameOfCorruptRecord. > Since, the FAILFAST mode doesn't print informative error messages on which > rows failed to parse there is no way other to track down broken rows without > setting a corrupt record column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API
[ https://issues.apache.org/jira/browse/SPARK-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853083#comment-16853083 ] Juliusz Sompolski commented on SPARK-27899: --- cc [~LI,Xiao], [~yumwang] > Make HiveMetastoreClient.getTableObjectsByName available in > ExternalCatalog/SessionCatalog API > -- > > Key: SPARK-27899 > URL: https://issues.apache.org/jira/browse/SPARK-27899 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Juliusz Sompolski >Priority: Major > > The new Spark ThriftServer SparkGetTablesOperation implemented in > https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata > request for every table. This can get very slow for large schemas (~50ms per > table with an external Hive metastore). > Hive ThriftServer GetTablesOperation uses > HiveMetastoreClient.getTableObjectsByName to get table information in bulk, > but we don't expose that through our APIs that go through Hive -> > HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> > SessionCatalog. > If we added and exposed getTableObjectsByName through our catalog APIs, we > could resolve that performance problem in SparkGetTablesOperation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27899) Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API
Juliusz Sompolski created SPARK-27899: - Summary: Make HiveMetastoreClient.getTableObjectsByName available in ExternalCatalog/SessionCatalog API Key: SPARK-27899 URL: https://issues.apache.org/jira/browse/SPARK-27899 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Juliusz Sompolski The new Spark ThriftServer SparkGetTablesOperation implemented in https://github.com/apache/spark/pull/22794 does a catalog.getTableMetadata request for every table. This can get very slow for large schemas (~50ms per table with an external Hive metastore). Hive ThriftServer GetTablesOperation uses HiveMetastoreClient.getTableObjectsByName to get table information in bulk, but we don't expose that through our APIs that go through Hive -> HiveClientImpl (HiveClient) -> HiveExternalCatalog (ExternalCatalog) -> SessionCatalog. If we added and exposed getTableObjectsByName through our catalog APIs, we could resolve that performance problem in SparkGetTablesOperation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26822) Upgrade the deprecated module 'optparse'
[ https://issues.apache.org/jira/browse/SPARK-26822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neo Chien closed SPARK-26822. - > Upgrade the deprecated module 'optparse' > > > Key: SPARK-26822 > URL: https://issues.apache.org/jira/browse/SPARK-26822 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.4.0 >Reporter: Neo Chien >Assignee: Neo Chien >Priority: Minor > Labels: pull-request-available, test > Fix For: 3.0.0 > > > Follow the [official > document|https://docs.python.org/2/library/argparse.html#upgrading-optparse-code] > to upgrade the deprecated module 'optparse' to 'argparse'. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters
[ https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27896: Assignee: Apache Spark (was: Sean Owen) > Fix definition of clustering silhouette coefficient for 1-element clusters > -- > > Key: SPARK-27896 > URL: https://issues.apache.org/jira/browse/SPARK-27896 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > Reported by Samuel Kubler via email: > In the code > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala, > I think there is a little mistake in the class “Silhouette” when you > calculate the Silhouette coefficient for a point. Indeed, according to the > scientific paper of reference “Silhouettes: a graphical aid to the > interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, > for the points which are alone in a cluster it is not the > currentClusterDissimilarity which is supposed to be equal to 0 like it is > the case in your code (“val currentClusterDissimilarity = if > (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. > Indeed, “When cluster A contains only a single object it is unclear how a(i) > should be defined, and the we simply set s(i) equal to zero”. > The problem of defining the currentClusterDissimilarity to zero like you have > done is that you can’t use the silhouette coefficient anymore as a criterion > to determine the optimal value of the number of clusters in your clustering > process because your algorithm will answer that the more clusters you have, > the better will be your clustering algorithm. Indeed, in that case when the > number of clustering classes increases, s(i) converges toward 1. (so your > algorithm seems to be more efficient) I have, beside, check this result of my > own clustering example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters
[ https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27896: Assignee: Sean Owen (was: Apache Spark) > Fix definition of clustering silhouette coefficient for 1-element clusters > -- > > Key: SPARK-27896 > URL: https://issues.apache.org/jira/browse/SPARK-27896 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > Reported by Samuel Kubler via email: > In the code > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala, > I think there is a little mistake in the class “Silhouette” when you > calculate the Silhouette coefficient for a point. Indeed, according to the > scientific paper of reference “Silhouettes: a graphical aid to the > interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, > for the points which are alone in a cluster it is not the > currentClusterDissimilarity which is supposed to be equal to 0 like it is > the case in your code (“val currentClusterDissimilarity = if > (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. > Indeed, “When cluster A contains only a single object it is unclear how a(i) > should be defined, and the we simply set s(i) equal to zero”. > The problem of defining the currentClusterDissimilarity to zero like you have > done is that you can’t use the silhouette coefficient anymore as a criterion > to determine the optimal value of the number of clusters in your clustering > process because your algorithm will answer that the more clusters you have, > the better will be your clustering algorithm. Indeed, in that case when the > number of clustering classes increases, s(i) converges toward 1. (so your > algorithm seems to be more efficient) I have, beside, check this result of my > own clustering example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27898) Support 4 date operators(date + integer, integer + date, date - integer and date - date)
[ https://issues.apache.org/jira/browse/SPARK-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27898: Assignee: Apache Spark > Support 4 date operators(date + integer, integer + date, date - integer and > date - date) > > > Key: SPARK-27898 > URL: https://issues.apache.org/jira/browse/SPARK-27898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Support 4 date operators(date + integer, integer + date, date - integer and > date - date): > |Operator|Example|Result| > |+|date '2001-09-28' + integer '7'|date '2001-10-05'| > |-|date '2001-10-01' - integer '7'|date '2001-09-24'| > |-|date '2001-10-01' - date '2001-09-28'|integer '3' (days)| > [https://www.postgresql.org/docs/12/functions-datetime.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27898) Support 4 date operators(date + integer, integer + date, date - integer and date - date)
[ https://issues.apache.org/jira/browse/SPARK-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27898: Assignee: (was: Apache Spark) > Support 4 date operators(date + integer, integer + date, date - integer and > date - date) > > > Key: SPARK-27898 > URL: https://issues.apache.org/jira/browse/SPARK-27898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Support 4 date operators(date + integer, integer + date, date - integer and > date - date): > |Operator|Example|Result| > |+|date '2001-09-28' + integer '7'|date '2001-10-05'| > |-|date '2001-10-01' - integer '7'|date '2001-09-24'| > |-|date '2001-10-01' - date '2001-09-28'|integer '3' (days)| > [https://www.postgresql.org/docs/12/functions-datetime.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27898) Support 4 date operators(date + integer, integer + date, date - integer and date - date)
Yuming Wang created SPARK-27898: --- Summary: Support 4 date operators(date + integer, integer + date, date - integer and date - date) Key: SPARK-27898 URL: https://issues.apache.org/jira/browse/SPARK-27898 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Support 4 date operators(date + integer, integer + date, date - integer and date - date): |Operator|Example|Result| |+|date '2001-09-28' + integer '7'|date '2001-10-05'| |-|date '2001-10-01' - integer '7'|date '2001-09-24'| |-|date '2001-10-01' - date '2001-09-28'|integer '3' (days)| [https://www.postgresql.org/docs/12/functions-datetime.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27897) GPU Scheduling - move example discovery Script to scripts directory
[ https://issues.apache.org/jira/browse/SPARK-27897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27897: Assignee: Apache Spark (was: Thomas Graves) > GPU Scheduling - move example discovery Script to scripts directory > --- > > Key: SPARK-27897 > URL: https://issues.apache.org/jira/browse/SPARK-27897 > Project: Spark > Issue Type: Story > Components: Examples >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Minor > > SPARK-27725 GPU Scheduling - add an example discovery Script added a script > at > [https://github.com/apache/spark/blob/master/examples/src/main/resources/getGpusResources.sh.] > Instead of having it in the resources directory lets move it to the scripts > directory -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27897) GPU Scheduling - move example discovery Script to scripts directory
[ https://issues.apache.org/jira/browse/SPARK-27897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27897: Assignee: Thomas Graves (was: Apache Spark) > GPU Scheduling - move example discovery Script to scripts directory > --- > > Key: SPARK-27897 > URL: https://issues.apache.org/jira/browse/SPARK-27897 > Project: Spark > Issue Type: Story > Components: Examples >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Minor > > SPARK-27725 GPU Scheduling - add an example discovery Script added a script > at > [https://github.com/apache/spark/blob/master/examples/src/main/resources/getGpusResources.sh.] > Instead of having it in the resources directory lets move it to the scripts > directory -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27897) GPU Scheduling - move example discovery Script to scripts directory
Thomas Graves created SPARK-27897: - Summary: GPU Scheduling - move example discovery Script to scripts directory Key: SPARK-27897 URL: https://issues.apache.org/jira/browse/SPARK-27897 Project: Spark Issue Type: Story Components: Examples Affects Versions: 3.0.0 Reporter: Thomas Graves Assignee: Thomas Graves SPARK-27725 GPU Scheduling - add an example discovery Script added a script at [https://github.com/apache/spark/blob/master/examples/src/main/resources/getGpusResources.sh.] Instead of having it in the resources directory lets move it to the scripts directory -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27880) Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY)
[ https://issues.apache.org/jira/browse/SPARK-27880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27880: Description: {code:sql} bool_and/booland_statefunc(expression) -- true if all input values are true, otherwise false {code} {code:sql} bool_or/boolor_statefunc(expression) -- true if at least one input value is true, otherwise false {code} {code:sql} every(expression) -- equivalent to bool_and {code} More details: [https://www.postgresql.org/docs/9.3/functions-aggregate.html] was: {code:sql} bool_and(expression) -- true if all input values are true, otherwise false {code} {code:sql} bool_or(expression) -- true if at least one input value is true, otherwise false {code} {code:sql} every(expression) -- equivalent to bool_and {code} More details: https://www.postgresql.org/docs/9.3/functions-aggregate.html > Implement boolean aggregates(BOOL_AND, BOOL_OR and EVERY) > - > > Key: SPARK-27880 > URL: https://issues.apache.org/jira/browse/SPARK-27880 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > bool_and/booland_statefunc(expression) -- true if all input values are true, > otherwise false > {code} > {code:sql} > bool_or/boolor_statefunc(expression) -- true if at least one input value is > true, otherwise false > {code} > {code:sql} > every(expression) -- equivalent to bool_and > {code} > More details: > [https://www.postgresql.org/docs/9.3/functions-aggregate.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters
[ https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852962#comment-16852962 ] Sean Owen commented on SPARK-27896: --- Copying follow up from email: Yes the paper does say the silhouette is 0 in this case. That's an argument to change it. On the other hand, I am not sure if I agree with the paper here. If A consists of one point, then that point's assignment is optimal in a sense. Setting the silhouette to 0 indicates that assigning it to B, which is a cluster of more distant points, is just as good. I don't think that makes as much sense as 1, which it returns now. You could argue that silhouette is specifically penalizing, in a way, this type of assignment in a way that Euclidean distance does not. Wikipedia's definition follows the paper: https://en.wikipedia.org/wiki/Silhouette_(clustering) It looks like sklearn also follows the paper's definition: https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/metrics/cluster/unsupervised.py#L235 > Fix definition of clustering silhouette coefficient for 1-element clusters > -- > > Key: SPARK-27896 > URL: https://issues.apache.org/jira/browse/SPARK-27896 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > Reported by Samuel Kubler via email: > In the code > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala, > I think there is a little mistake in the class “Silhouette” when you > calculate the Silhouette coefficient for a point. Indeed, according to the > scientific paper of reference “Silhouettes: a graphical aid to the > interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, > for the points which are alone in a cluster it is not the > currentClusterDissimilarity which is supposed to be equal to 0 like it is > the case in your code (“val currentClusterDissimilarity = if > (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. > Indeed, “When cluster A contains only a single object it is unclear how a(i) > should be defined, and the we simply set s(i) equal to zero”. > The problem of defining the currentClusterDissimilarity to zero like you have > done is that you can’t use the silhouette coefficient anymore as a criterion > to determine the optimal value of the number of clusters in your clustering > process because your algorithm will answer that the more clusters you have, > the better will be your clustering algorithm. Indeed, in that case when the > number of clustering classes increases, s(i) converges toward 1. (so your > algorithm seems to be more efficient) I have, beside, check this result of my > own clustering example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters
Sean Owen created SPARK-27896: - Summary: Fix definition of clustering silhouette coefficient for 1-element clusters Key: SPARK-27896 URL: https://issues.apache.org/jira/browse/SPARK-27896 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.4.3 Reporter: Sean Owen Assignee: Sean Owen Reported by Samuel Kubler via email: In the code https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala, I think there is a little mistake in the class “Silhouette” when you calculate the Silhouette coefficient for a point. Indeed, according to the scientific paper of reference “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, for the points which are alone in a cluster it is not the currentClusterDissimilarity which is supposed to be equal to 0 like it is the case in your code (“val currentClusterDissimilarity = if (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. Indeed, “When cluster A contains only a single object it is unclear how a(i) should be defined, and the we simply set s(i) equal to zero”. The problem of defining the currentClusterDissimilarity to zero like you have done is that you can’t use the silhouette coefficient anymore as a criterion to determine the optimal value of the number of clusters in your clustering process because your algorithm will answer that the more clusters you have, the better will be your clustering algorithm. Indeed, in that case when the number of clustering classes increases, s(i) converges toward 1. (so your algorithm seems to be more efficient) I have, beside, check this result of my own clustering example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852874#comment-16852874 ] Stavros Kontopoulos commented on SPARK-24815: - @[~Karthik Palaniappan] I can help with the design. Btw there is a refactoring happen in here: [https://github.com/apache/spark/pull/24704] My concern with batch mode dynamic allocation is task list may not tell the whole story, what if the number of tasks stays the same and load changes per task/partition eg. Kafka source? As for state I think you need to rebalance it which translates to dynamic re-partitioning for the micro-batch mode in structured streaming. For continuous streaming it is harder I think, but maybe a unified approach could solve it for both batch and continuous streaming as in Flink, https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html. > Structured Streaming should support dynamic allocation > -- > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Karthik Palaniappan >Priority: Minor > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27791) Support SQL year-month INTERVAL type
[ https://issues.apache.org/jira/browse/SPARK-27791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852870#comment-16852870 ] Zhu, Lipeng commented on SPARK-27791: - [~yumwang] I think it should be {code:sql} select current_date - interval '1-1' year_month {code} > Support SQL year-month INTERVAL type > > > Key: SPARK-27791 > URL: https://issues.apache.org/jira/browse/SPARK-27791 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > The INTERVAL type must conform to SQL year-month INTERVAL type, has 2 logical > types: > # YEAR - Unconstrained except by the leading field precision > # MONTH - Months (within years) (0-11) > And support arithmetic operations involving values of type datetime or > interval. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27700) SparkSubmit closes with SocketTimeoutException in kubernetes mode.
[ https://issues.apache.org/jira/browse/SPARK-27700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udbhav Agrawal resolved SPARK-27700. Resolution: Duplicate > SparkSubmit closes with SocketTimeoutException in kubernetes mode. > -- > > Key: SPARK-27700 > URL: https://issues.apache.org/jira/browse/SPARK-27700 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Udbhav Agrawal >Priority: Major > Attachments: socket timeout > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27791) Support SQL year-month INTERVAL type
[ https://issues.apache.org/jira/browse/SPARK-27791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852761#comment-16852761 ] Yuming Wang commented on SPARK-27791: - Hi, [~maxgekk] Is this ticket to cover these 2 cases? {code:sql} SELECT interval '1' year to month; SELECT interval '1-2' year to month; {code} [https://github.com/postgres/postgres/blob/df1a699e5ba3232f373790b2c9485ddf720c4a70/src/test/regress/sql/interval.sql#L180-L181] > Support SQL year-month INTERVAL type > > > Key: SPARK-27791 > URL: https://issues.apache.org/jira/browse/SPARK-27791 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > The INTERVAL type must conform to SQL year-month INTERVAL type, has 2 logical > types: > # YEAR - Unconstrained except by the leading field precision > # MONTH - Months (within years) (0-11) > And support arithmetic operations involving values of type datetime or > interval. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27895) Spark streaming - RDD filter is always refreshing providing updated filtered items
Ilias Karalis created SPARK-27895: - Summary: Spark streaming - RDD filter is always refreshing providing updated filtered items Key: SPARK-27895 URL: https://issues.apache.org/jira/browse/SPARK-27895 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.4.3, 2.4.2, 2.4.0 Environment: Intellij, running local in windows10 laptop. Reporter: Ilias Karalis Spark streaming: 2.4.x Scala: 2.11.11 foreachRDD of DStream, in case filter is used on RDD then filter is always refreshing, providing new results continuously until new batch is processed. For the new batch, the same occurs. With the same code, if we do rdd.collect() and then run the filter on the collection, we get just one time results, which remains stable until new batch is coming in. Filter function is based on random probability (reservoir sampling). {color:#80}val {color}toSampleRDD: RDD[(Long, Long)] = inputRdd.filter(x=> chooseX(x) ) {color:#80}def {color}chooseX (x:(Long, Long)) : Boolean = { {color:#808080} {color} {color:#80}val {color}r = scala.util.Random {color:#80}val {color}p = r.nextFloat() edgeTotalCounter += {color:#ff}1 {color} {color:#808080} {color} {color:#80}if {color}(p < (sampleLength.toFloat / edgeTotalCounter.toFloat)) { edgeLocalRDDCounter += {color:#ff}1 {color} println({color:#008000}"Edge " {color}+x + {color:#008000}" has been selected and is number : " {color}+ edgeLocalRDDCounter +{color:#008000}"."{color}) {color:#80}true {color} } {color:#80}else {color}{color:#80} false {color}} edgeLocalRDDCounter counts selected edges from inputRDD. Strange is that the counter is increased 1st time from 1 to y, then filter continues to run unexpectedly again and the counter is increased again starting from y+1 to z. After that each time filter unexpectedly continues to run, it provides results for which the counter starts from y+1. Each time filter runs provides different results and filters different number of edges. toSampleRDD always changes accordingly to new provided results. When new batch is coming in then it starts the same behavior for the new batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org