date:20201011

[jira] [Assigned] (SPARK-32281) Spark wipes out SORTED spec in metastore when DESC is used

2020-10-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32281:


Assignee: Apache Spark

> Spark wipes out SORTED spec in metastore when DESC is used
> --
>
> Key: SPARK-32281
> URL: https://issues.apache.org/jira/browse/SPARK-32281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> When altering a Hive bucketed table or updating its statistics, Spark will 
> wipe out the SORTED specification in the metastore if the specification uses 
> DESC.
>  For example:
> {noformat}
> 0: jdbc:hive2://localhost:1> -- in beeline
> 0: jdbc:hive2://localhost:1> create table bucketed (a int, b int, c int, 
> d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
> No rows affected (0.045 seconds)
> 0: jdbc:hive2://localhost:1> show create table bucketed;
> ++
> |   createtab_stmt   |
> ++
> | CREATE TABLE `bucketed`(   |
> |   `a` int, |
> |   `b` int, |
> |   `c` int, |
> |   `d` int) |
> | CLUSTERED BY ( |
> |   c)   |
> | SORTED BY (|
> |   c ASC,   |
> |   d DESC)  |
> | INTO 10 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.mapred.TextInputFormat'   |
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION   |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (|
> |   'transient_lastDdlTime'='1594488043')|
> ++
> 21 rows selected (0.042 seconds)
> 0: jdbc:hive2://localhost:1> 
> -
> -
> -
> scala> // in spark
> scala> sql("alter table bucketed set tblproperties ('foo'='bar')")
> 20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> res0: org.apache.spark.sql.DataFrame = []
> scala> 
> -
> -
> -
> 0: jdbc:hive2://localhost:1> -- back in beeline
> 0: jdbc:hive2://localhost:1> show create table bucketed;
> ++
> |   createtab_stmt   |
> ++
> | CREATE TABLE `bucketed`(   |
> |   `a` int, |
> |   `b` int, |
> |   `c` int, |
> |   `d` int) |
> | CLUSTERED BY ( |
> |   c)   |
> | INTO 10 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.mapred.TextInputFormat'   |
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION   |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (|
> |   'foo'='bar', |
> |   'spark.sql.partitionProvider'='catalog', |
> |   'transient_lastDdlTime'='1594488098')|
> ++
> 20 rows selected (0.038 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> Note that the SORTED specification disappears.
> Another example, this time using insert:
> {noformat}
> 0: jdbc:hive2://localhost:1> -- in beeline
> 0: jdbc:hive2://localhost:1> create table bucketed (a int, b int, c int, 
> d int) clustered by (c)

[jira] [Assigned] (SPARK-32281) Spark wipes out SORTED spec in metastore when DESC is used

2020-10-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32281:


Assignee: (was: Apache Spark)

> Spark wipes out SORTED spec in metastore when DESC is used
> --
>
> Key: SPARK-32281
> URL: https://issues.apache.org/jira/browse/SPARK-32281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When altering a Hive bucketed table or updating its statistics, Spark will 
> wipe out the SORTED specification in the metastore if the specification uses 
> DESC.
>  For example:
> {noformat}
> 0: jdbc:hive2://localhost:1> -- in beeline
> 0: jdbc:hive2://localhost:1> create table bucketed (a int, b int, c int, 
> d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
> No rows affected (0.045 seconds)
> 0: jdbc:hive2://localhost:1> show create table bucketed;
> ++
> |   createtab_stmt   |
> ++
> | CREATE TABLE `bucketed`(   |
> |   `a` int, |
> |   `b` int, |
> |   `c` int, |
> |   `d` int) |
> | CLUSTERED BY ( |
> |   c)   |
> | SORTED BY (|
> |   c ASC,   |
> |   d DESC)  |
> | INTO 10 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.mapred.TextInputFormat'   |
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION   |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (|
> |   'transient_lastDdlTime'='1594488043')|
> ++
> 21 rows selected (0.042 seconds)
> 0: jdbc:hive2://localhost:1> 
> -
> -
> -
> scala> // in spark
> scala> sql("alter table bucketed set tblproperties ('foo'='bar')")
> 20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> res0: org.apache.spark.sql.DataFrame = []
> scala> 
> -
> -
> -
> 0: jdbc:hive2://localhost:1> -- back in beeline
> 0: jdbc:hive2://localhost:1> show create table bucketed;
> ++
> |   createtab_stmt   |
> ++
> | CREATE TABLE `bucketed`(   |
> |   `a` int, |
> |   `b` int, |
> |   `c` int, |
> |   `d` int) |
> | CLUSTERED BY ( |
> |   c)   |
> | INTO 10 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.mapred.TextInputFormat'   |
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION   |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (|
> |   'foo'='bar', |
> |   'spark.sql.partitionProvider'='catalog', |
> |   'transient_lastDdlTime'='1594488098')|
> ++
> 20 rows selected (0.038 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> Note that the SORTED specification disappears.
> Another example, this time using insert:
> {noformat}
> 0: jdbc:hive2://localhost:1> -- in beeline
> 0: jdbc:hive2://localhost:1> create table bucketed (a int, b int, c int, 
> d int) clustered by (c) sorted by (c asc, d desc)

[jira] [Commented] (SPARK-32281) Spark wipes out SORTED spec in metastore when DESC is used

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212156#comment-17212156
 ] 

Apache Spark commented on SPARK-32281:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30011

> Spark wipes out SORTED spec in metastore when DESC is used
> --
>
> Key: SPARK-32281
> URL: https://issues.apache.org/jira/browse/SPARK-32281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When altering a Hive bucketed table or updating its statistics, Spark will 
> wipe out the SORTED specification in the metastore if the specification uses 
> DESC.
>  For example:
> {noformat}
> 0: jdbc:hive2://localhost:1> -- in beeline
> 0: jdbc:hive2://localhost:1> create table bucketed (a int, b int, c int, 
> d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
> No rows affected (0.045 seconds)
> 0: jdbc:hive2://localhost:1> show create table bucketed;
> ++
> |   createtab_stmt   |
> ++
> | CREATE TABLE `bucketed`(   |
> |   `a` int, |
> |   `b` int, |
> |   `c` int, |
> |   `d` int) |
> | CLUSTERED BY ( |
> |   c)   |
> | SORTED BY (|
> |   c ASC,   |
> |   d DESC)  |
> | INTO 10 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.mapred.TextInputFormat'   |
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION   |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (|
> |   'transient_lastDdlTime'='1594488043')|
> ++
> 21 rows selected (0.042 seconds)
> 0: jdbc:hive2://localhost:1> 
> -
> -
> -
> scala> // in spark
> scala> sql("alter table bucketed set tblproperties ('foo'='bar')")
> 20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> res0: org.apache.spark.sql.DataFrame = []
> scala> 
> -
> -
> -
> 0: jdbc:hive2://localhost:1> -- back in beeline
> 0: jdbc:hive2://localhost:1> show create table bucketed;
> ++
> |   createtab_stmt   |
> ++
> | CREATE TABLE `bucketed`(   |
> |   `a` int, |
> |   `b` int, |
> |   `c` int, |
> |   `d` int) |
> | CLUSTERED BY ( |
> |   c)   |
> | INTO 10 BUCKETS|
> | ROW FORMAT SERDE   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT  |
> |   'org.apache.hadoop.mapred.TextInputFormat'   |
> | OUTPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION   |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (|
> |   'foo'='bar', |
> |   'spark.sql.partitionProvider'='catalog', |
> |   'transient_lastDdlTime'='1594488098')|
> ++
> 20 rows selected (0.038 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> Note that the SORTED specification disappears.
> Another example, this time using insert:
> {noformat}
> 0: jdbc:hive2://localhost:1> -- in beeline
> 0: jdbc:hi

[jira] [Commented] (SPARK-33117) Update zstd-jni to 1.4.5-6

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212135#comment-17212135
 ] 

Apache Spark commented on SPARK-33117:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30010

> Update zstd-jni to 1.4.5-6
> --
>
> Key: SPARK-33117
> URL: https://issues.apache.org/jira/browse/SPARK-33117
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33117) Update zstd-jni to 1.4.5-6

2020-10-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33117:


Assignee: Apache Spark

> Update zstd-jni to 1.4.5-6
> --
>
> Key: SPARK-33117
> URL: https://issues.apache.org/jira/browse/SPARK-33117
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33117) Update zstd-jni to 1.4.5-6

2020-10-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33117:


Assignee: (was: Apache Spark)

> Update zstd-jni to 1.4.5-6
> --
>
> Key: SPARK-33117
> URL: https://issues.apache.org/jira/browse/SPARK-33117
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33013) The constrains may grow exponentially in sql optimizer 'InferFiltersFromConstraints', which leads to driver oom

2020-10-11 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212131#comment-17212131
 ] 

Takeshi Yamamuro commented on SPARK-33013:
--

Yea, that's a known issue, so I will close this. Thanks!

> The constrains may grow exponentially in sql optimizer 
> 'InferFiltersFromConstraints', which leads to driver oom
> ---
>
> Key: SPARK-33013
> URL: https://issues.apache.org/jira/browse/SPARK-33013
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: zhou xiang
>Priority: Major
>
>  
>  
>  Consider the case below:
> {code:java}
> Seq((1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)).toDF("a", "b", "c", 
> "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", 
> "s", "t").write.saveAsTable("test") 
> val df = spark.table("test") 
> val df2 = df.filter("a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p+q+r+s+t > 100") 
> val df3 = df2.select('a as 'a1, 'b as 'b1, 'c as 'c1, 'd as 'd1, 'e as 'e1, 
> 'f as 'f1, 'g as 'g1, 'h as 'h1, 'i as 'i1, 'j as 'j1, 'k as 'k1, 'l as 'l1, 
> 'm as 'm1, 'n as 'n1, 'o as 'o1, 'p as 'p1, 'q as 'q1, 'r as 'r1, 's as 's1, 
> 't as 't1) 
> val df4 = df3.join(df2, df3("a1") === df2("a")) 
> df4.explain(true)
> {code}
> If you run the this in spark shell, it will got stuck at "df4.explain(true)". 
> The reason is in sql optimizer 'InferFiltersFromConstraints', it will try to 
> infer all the constrains from the plan. And the plan has a constrain contains 
> about 20 columns, each column has an alias. It will try to replace the column 
> with alias, and at the same time keep the origin constrain, that will lead to 
> the constrains grow exponentially. And make driver oom in the end.
> The related code:
> {code:java}
>   /**
>* Generates all valid constraints including an set of aliased constraints 
> by replacing the
>* original constraint expressions with the corresponding alias
>*/
>   protected def getAllValidConstraints(projectList: Seq[NamedExpression]): 
> Set[Expression] = {
> var allConstraints = child.constraints.asInstanceOf[Set[Expression]]
> projectList.foreach {
>   case a @ Alias(l: Literal, _) =>
> allConstraints += EqualNullSafe(a.toAttribute, l)
>   case a @ Alias(e, _) =>
> // For every alias in `projectList`, replace the reference in 
> constraints by its attribute.
> allConstraints ++= allConstraints.map(_ transform {
>   case expr: Expression if expr.semanticEquals(e) =>
> a.toAttribute
> })
> allConstraints += EqualNullSafe(e, a.toAttribute)
>   case _ => // Don't change.
> }
> allConstraints
>   }
> {code}
>  
>  
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33013) The constrains may grow exponentially in sql optimizer 'InferFiltersFromConstraints', which leads to driver oom

2020-10-11 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33013.
--
Resolution: Not A Problem

> The constrains may grow exponentially in sql optimizer 
> 'InferFiltersFromConstraints', which leads to driver oom
> ---
>
> Key: SPARK-33013
> URL: https://issues.apache.org/jira/browse/SPARK-33013
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: zhou xiang
>Priority: Major
>
>  
>  
>  Consider the case below:
> {code:java}
> Seq((1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)).toDF("a", "b", "c", 
> "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", 
> "s", "t").write.saveAsTable("test") 
> val df = spark.table("test") 
> val df2 = df.filter("a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p+q+r+s+t > 100") 
> val df3 = df2.select('a as 'a1, 'b as 'b1, 'c as 'c1, 'd as 'd1, 'e as 'e1, 
> 'f as 'f1, 'g as 'g1, 'h as 'h1, 'i as 'i1, 'j as 'j1, 'k as 'k1, 'l as 'l1, 
> 'm as 'm1, 'n as 'n1, 'o as 'o1, 'p as 'p1, 'q as 'q1, 'r as 'r1, 's as 's1, 
> 't as 't1) 
> val df4 = df3.join(df2, df3("a1") === df2("a")) 
> df4.explain(true)
> {code}
> If you run the this in spark shell, it will got stuck at "df4.explain(true)". 
> The reason is in sql optimizer 'InferFiltersFromConstraints', it will try to 
> infer all the constrains from the plan. And the plan has a constrain contains 
> about 20 columns, each column has an alias. It will try to replace the column 
> with alias, and at the same time keep the origin constrain, that will lead to 
> the constrains grow exponentially. And make driver oom in the end.
> The related code:
> {code:java}
>   /**
>* Generates all valid constraints including an set of aliased constraints 
> by replacing the
>* original constraint expressions with the corresponding alias
>*/
>   protected def getAllValidConstraints(projectList: Seq[NamedExpression]): 
> Set[Expression] = {
> var allConstraints = child.constraints.asInstanceOf[Set[Expression]]
> projectList.foreach {
>   case a @ Alias(l: Literal, _) =>
> allConstraints += EqualNullSafe(a.toAttribute, l)
>   case a @ Alias(e, _) =>
> // For every alias in `projectList`, replace the reference in 
> constraints by its attribute.
> allConstraints ++= allConstraints.map(_ transform {
>   case expr: Expression if expr.semanticEquals(e) =>
> a.toAttribute
> })
> allConstraints += EqualNullSafe(e, a.toAttribute)
>   case _ => // Don't change.
> }
> allConstraints
>   }
> {code}
>  
>  
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33116.
--
Resolution: Invalid

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212130#comment-17212130
 ] 

Takeshi Yamamuro commented on SPARK-33116:
--

In SQL, they are different, I think. The statement below was cited from the 
PostgreSQL doc:
{code:java}
since there is no ORDER BY in the OVER clause, the window frame is the same as 
the partition, which for lack of PARTITION BY is the whole table; in other 
words each sum is taken over the whole table and so we get the same result for 
each output row. But if we add an ORDER BY clause, we get very different 
results:...
{code}
[https://www.postgresql.org/docs/current/tutorial-window.html]

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33117) Update zstd-jni to 1.4.5-6

2020-10-11 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-33117:
-

 Summary: Update zstd-jni to 1.4.5-6
 Key: SPARK-33117
 URL: https://issues.apache.org/jira/browse/SPARK-33117
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32907) adaptively blockify instances

2020-10-11 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-32907:
--

Assignee: zhengruifeng

> adaptively blockify instances
> -
>
> Key: SPARK-32907
> URL: https://issues.apache.org/jira/browse/SPARK-32907
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Attachments: blockify_svc_perf_20201010.xlsx
>
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it is reasonable to control the size of block.
>  
> I had some offline discuss with [~weichenxu123], then we think following 
> changes are worthy：
> 1, infer an appropriate blockSize (MB) based on numFeatures and nnz by 
> default;
> 2, impls should use a relative small memory footprint when processing one 
> block, and should not use a large pre-allocated buffer, so we need to revert 
> gmm;
> 3, use new blockify strategy in LinearSVC/LoR/LiR/AFT;
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2020-10-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27872:
--
Fix Version/s: 2.4.8

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 2.4.8, 3.0.0
>
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32047) Add provider disable possibility just like in delegation token provider

2020-10-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32047.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29964
[https://github.com/apache/spark/pull/29964]

> Add provider disable possibility just like in delegation token provider
> ---
>
> Key: SPARK-32047
> URL: https://issues.apache.org/jira/browse/SPARK-32047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>
> There is an enable flag in delegation provider area 
> "spark.security.credentials.%s.enabled".
> It would be good to add similar to the JDBC secure connection provider area 
> because this would make embedded providers interchangeable (embedded can be 
> turned off and another provider w/ a different name can be registered). This 
> make sense only if we create API for the secure JDBC connection provider.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32047) Add provider disable possibility just like in delegation token provider

2020-10-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32047:


Assignee: Gabor Somogyi

> Add provider disable possibility just like in delegation token provider
> ---
>
> Key: SPARK-32047
> URL: https://issues.apache.org/jira/browse/SPARK-32047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>
> There is an enable flag in delegation provider area 
> "spark.security.credentials.%s.enabled".
> It would be good to add similar to the JDBC secure connection provider area 
> because this would make embedded providers interchangeable (embedded can be 
> turned off and another provider w/ a different name can be registered). This 
> make sense only if we create API for the secure JDBC connection provider.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24266) Spark client terminates while driver is still running

2020-10-11 Thread Jim Kleckner (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212105#comment-17212105
 ] 

Jim Kleckner commented on SPARK-24266:
--

I believe that the PR is ready to merge to the 3.0 branch for a target of 3.0.2:

https://github.com/apache/spark/pull/29533

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Chun Chen
>Priority: Major
> Fix For: 3.1.0
>
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-

[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Will Du (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Du updated SPARK-33116:

Description: 
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
| drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
| game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
| game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
| toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
| toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
7 rows selected (0.436 seconds)

Does it seem that we can only order by the columns after partition by clause?
I do not think there are such limitation in standard SQL.


  was:
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p

[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Will Du (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Du updated SPARK-33116:

Description: 
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
| drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
| game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
| game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
| toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
| toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
7 rows selected (0.436 seconds)

Does it seem that we can only order by the columns after partition by clause?
I do not think there are such limitation in standard SQL.


  was:
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p

[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Will Du (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Du updated SPARK-33116:

Description: 
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
| drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
| game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
| game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
| toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
| toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
7 rows selected (0.436 seconds)

  was:
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct.
SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
|

[jira] [Created] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Will Du (Jira)

Will Du created SPARK-33116:
---

 Summary: Spark SQL window function with order by cause result 
incorrect
 Key: SPARK-33116
 URL: https://issues.apache.org/jira/browse/SPARK-33116
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Will Du


Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct.
SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
| drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
| game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
| game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
| toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
| toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
7 rows selected (0.436 seconds)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32907) adaptively blockify instances

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212092#comment-17212092
 ] 

Apache Spark commented on SPARK-32907:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/30009

> adaptively blockify instances
> -
>
> Key: SPARK-32907
> URL: https://issues.apache.org/jira/browse/SPARK-32907
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: blockify_svc_perf_20201010.xlsx
>
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it is reasonable to control the size of block.
>  
> I had some offline discuss with [~weichenxu123], then we think following 
> changes are worthy：
> 1, infer an appropriate blockSize (MB) based on numFeatures and nnz by 
> default;
> 2, impls should use a relative small memory footprint when processing one 
> block, and should not use a large pre-allocated buffer, so we need to revert 
> gmm;
> 3, use new blockify strategy in LinearSVC/LoR/LiR/AFT;
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21708) Migrate build to sbt 1.3.13

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212091#comment-17212091
 ] 

Apache Spark commented on SPARK-21708:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30008

>  Migrate build to sbt 1.3.13
> 
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: PJ Fanning
>Assignee: Denis Pyshev
>Priority: Major
> Fix For: 3.1.0
>
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21708) Migrate build to sbt 1.3.13

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212090#comment-17212090
 ] 

Apache Spark commented on SPARK-21708:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30008

>  Migrate build to sbt 1.3.13
> 
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: PJ Fanning
>Assignee: Denis Pyshev
>Priority: Major
> Fix For: 3.1.0
>
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33013) The constrains may grow exponentially in sql optimizer 'InferFiltersFromConstraints', which leads to driver oom

2020-10-11 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212085#comment-17212085
 ] 

Yang Jie commented on SPARK-33013:
--

It appears to be a known issue that can configured 
"spark.sql.constraintPropagation.enabled=false"  try to avoid it.

> The constrains may grow exponentially in sql optimizer 
> 'InferFiltersFromConstraints', which leads to driver oom
> ---
>
> Key: SPARK-33013
> URL: https://issues.apache.org/jira/browse/SPARK-33013
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: zhou xiang
>Priority: Major
>
>  
>  
>  Consider the case below:
> {code:java}
> Seq((1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)).toDF("a", "b", "c", 
> "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", 
> "s", "t").write.saveAsTable("test") 
> val df = spark.table("test") 
> val df2 = df.filter("a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p+q+r+s+t > 100") 
> val df3 = df2.select('a as 'a1, 'b as 'b1, 'c as 'c1, 'd as 'd1, 'e as 'e1, 
> 'f as 'f1, 'g as 'g1, 'h as 'h1, 'i as 'i1, 'j as 'j1, 'k as 'k1, 'l as 'l1, 
> 'm as 'm1, 'n as 'n1, 'o as 'o1, 'p as 'p1, 'q as 'q1, 'r as 'r1, 's as 's1, 
> 't as 't1) 
> val df4 = df3.join(df2, df3("a1") === df2("a")) 
> df4.explain(true)
> {code}
> If you run the this in spark shell, it will got stuck at "df4.explain(true)". 
> The reason is in sql optimizer 'InferFiltersFromConstraints', it will try to 
> infer all the constrains from the plan. And the plan has a constrain contains 
> about 20 columns, each column has an alias. It will try to replace the column 
> with alias, and at the same time keep the origin constrain, that will lead to 
> the constrains grow exponentially. And make driver oom in the end.
> The related code:
> {code:java}
>   /**
>* Generates all valid constraints including an set of aliased constraints 
> by replacing the
>* original constraint expressions with the corresponding alias
>*/
>   protected def getAllValidConstraints(projectList: Seq[NamedExpression]): 
> Set[Expression] = {
> var allConstraints = child.constraints.asInstanceOf[Set[Expression]]
> projectList.foreach {
>   case a @ Alias(l: Literal, _) =>
> allConstraints += EqualNullSafe(a.toAttribute, l)
>   case a @ Alias(e, _) =>
> // For every alias in `projectList`, replace the reference in 
> constraints by its attribute.
> allConstraints ++= allConstraints.map(_ transform {
>   case expr: Expression if expr.semanticEquals(e) =>
> a.toAttribute
> })
> allConstraints += EqualNullSafe(e, a.toAttribute)
>   case _ => // Don't change.
> }
> allConstraints
>   }
> {code}
>  
>  
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13

2020-10-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212078#comment-17212078
 ] 

Hyukjin Kwon commented on SPARK-33044:
--

We need a confirmation from [~shaneknapp], [~LuciferYang].

> Add a Jenkins build and test job for Scala 2.13
> ---
>
> Key: SPARK-33044
> URL: https://issues.apache.org/jira/browse/SPARK-33044
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a 
> Jenkins test job to verify current work results and CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13

2020-10-11 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212077#comment-17212077
 ] 

Yang Jie commented on SPARK-33044:
--

[~dongjoon] has this issue completed？

> Add a Jenkins build and test job for Scala 2.13
> ---
>
> Key: SPARK-33044
> URL: https://issues.apache.org/jira/browse/SPARK-33044
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a 
> Jenkins test job to verify current work results and CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33106) Fix sbt resolvers clash

2020-10-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33106:
-

Assignee: Denis Pyshev

> Fix sbt resolvers clash
> ---
>
> Key: SPARK-33106
> URL: https://issues.apache.org/jira/browse/SPARK-33106
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Assignee: Denis Pyshev
>Priority: Minor
>
> During sbt upgrade from 0.13 to 1.x, exact resolvers list was used as is.
> That leads to local resolvers name clashing, which is observed as warning 
> from SBT:
> {code:java}
> [warn] Multiple resolvers having different access mechanism configured with 
> same name 'local'. To avoid conflict, Remove duplicate project resolvers 
> (`resolvers`) or rename publishing resolve
> r (`publishTo`).
> {code}
> This needs to be fixed to avoid potential errors and reduce log noise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33106) Fix sbt resolvers clash

2020-10-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33106.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30006
[https://github.com/apache/spark/pull/30006]

> Fix sbt resolvers clash
> ---
>
> Key: SPARK-33106
> URL: https://issues.apache.org/jira/browse/SPARK-33106
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Assignee: Denis Pyshev
>Priority: Minor
> Fix For: 3.1.0
>
>
> During sbt upgrade from 0.13 to 1.x, exact resolvers list was used as is.
> That leads to local resolvers name clashing, which is observed as warning 
> from SBT:
> {code:java}
> [warn] Multiple resolvers having different access mechanism configured with 
> same name 'local'. To avoid conflict, Remove duplicate project resolvers 
> (`resolvers`) or rename publishing resolve
> r (`publishTo`).
> {code}
> This needs to be fixed to avoid potential errors and reduce log noise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212067#comment-17212067
 ] 

Apache Spark commented on SPARK-33115:
--

User 'gemelen' has created a pull request for this issue:
https://github.com/apache/spark/pull/30007

> `kvstore` and `unsafe` doc tasks fail
> -
>
> Key: SPARK-33115
> URL: https://issues.apache.org/jira/browse/SPARK-33115
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> `build/sbt publishLocal` task fails in two modules:
> {code:java}
> [error] stack trace is suppressed; run last kvstore / Compile / doc for the 
> full output
> [error] stack trace is suppressed; run last unsafe / Compile / doc for the 
> full output
> {code}
> {code:java}
>  sbt:spark-parent> kvstore/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: malformed HTML 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]    ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: unknown tag: Object 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]   ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: bad use of '>' 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used
> [error]   
>  ^
> {code}
> {code:java}
>  sbt:spark-parent> unsafe/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
>   error: malformed HTML 
> [error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
> [error]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail

2020-10-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33115:


Assignee: Apache Spark

> `kvstore` and `unsafe` doc tasks fail
> -
>
> Key: SPARK-33115
> URL: https://issues.apache.org/jira/browse/SPARK-33115
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Assignee: Apache Spark
>Priority: Minor
>
> `build/sbt publishLocal` task fails in two modules:
> {code:java}
> [error] stack trace is suppressed; run last kvstore / Compile / doc for the 
> full output
> [error] stack trace is suppressed; run last unsafe / Compile / doc for the 
> full output
> {code}
> {code:java}
>  sbt:spark-parent> kvstore/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: malformed HTML 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]    ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: unknown tag: Object 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]   ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: bad use of '>' 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used
> [error]   
>  ^
> {code}
> {code:java}
>  sbt:spark-parent> unsafe/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
>   error: malformed HTML 
> [error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
> [error]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail

2020-10-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33115:


Assignee: (was: Apache Spark)

> `kvstore` and `unsafe` doc tasks fail
> -
>
> Key: SPARK-33115
> URL: https://issues.apache.org/jira/browse/SPARK-33115
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> `build/sbt publishLocal` task fails in two modules:
> {code:java}
> [error] stack trace is suppressed; run last kvstore / Compile / doc for the 
> full output
> [error] stack trace is suppressed; run last unsafe / Compile / doc for the 
> full output
> {code}
> {code:java}
>  sbt:spark-parent> kvstore/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: malformed HTML 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]    ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: unknown tag: Object 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]   ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: bad use of '>' 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used
> [error]   
>  ^
> {code}
> {code:java}
>  sbt:spark-parent> unsafe/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
>   error: malformed HTML 
> [error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
> [error]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212066#comment-17212066
 ] 

Apache Spark commented on SPARK-33115:
--

User 'gemelen' has created a pull request for this issue:
https://github.com/apache/spark/pull/30007

> `kvstore` and `unsafe` doc tasks fail
> -
>
> Key: SPARK-33115
> URL: https://issues.apache.org/jira/browse/SPARK-33115
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> `build/sbt publishLocal` task fails in two modules:
> {code:java}
> [error] stack trace is suppressed; run last kvstore / Compile / doc for the 
> full output
> [error] stack trace is suppressed; run last unsafe / Compile / doc for the 
> full output
> {code}
> {code:java}
>  sbt:spark-parent> kvstore/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: malformed HTML 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]    ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: unknown tag: Object 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]   ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: bad use of '>' 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used
> [error]   
>  ^
> {code}
> {code:java}
>  sbt:spark-parent> unsafe/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
>   error: malformed HTML 
> [error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
> [error]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33106) Fix sbt resolvers clash

2020-10-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33106:


Assignee: (was: Apache Spark)

> Fix sbt resolvers clash
> ---
>
> Key: SPARK-33106
> URL: https://issues.apache.org/jira/browse/SPARK-33106
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> During sbt upgrade from 0.13 to 1.x, exact resolvers list was used as is.
> That leads to local resolvers name clashing, which is observed as warning 
> from SBT:
> {code:java}
> [warn] Multiple resolvers having different access mechanism configured with 
> same name 'local'. To avoid conflict, Remove duplicate project resolvers 
> (`resolvers`) or rename publishing resolve
> r (`publishTo`).
> {code}
> This needs to be fixed to avoid potential errors and reduce log noise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33106) Fix sbt resolvers clash

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212059#comment-17212059
 ] 

Apache Spark commented on SPARK-33106:
--

User 'gemelen' has created a pull request for this issue:
https://github.com/apache/spark/pull/30006

> Fix sbt resolvers clash
> ---
>
> Key: SPARK-33106
> URL: https://issues.apache.org/jira/browse/SPARK-33106
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> During sbt upgrade from 0.13 to 1.x, exact resolvers list was used as is.
> That leads to local resolvers name clashing, which is observed as warning 
> from SBT:
> {code:java}
> [warn] Multiple resolvers having different access mechanism configured with 
> same name 'local'. To avoid conflict, Remove duplicate project resolvers 
> (`resolvers`) or rename publishing resolve
> r (`publishTo`).
> {code}
> This needs to be fixed to avoid potential errors and reduce log noise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33106) Fix sbt resolvers clash

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212058#comment-17212058
 ] 

Apache Spark commented on SPARK-33106:
--

User 'gemelen' has created a pull request for this issue:
https://github.com/apache/spark/pull/30006

> Fix sbt resolvers clash
> ---
>
> Key: SPARK-33106
> URL: https://issues.apache.org/jira/browse/SPARK-33106
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> During sbt upgrade from 0.13 to 1.x, exact resolvers list was used as is.
> That leads to local resolvers name clashing, which is observed as warning 
> from SBT:
> {code:java}
> [warn] Multiple resolvers having different access mechanism configured with 
> same name 'local'. To avoid conflict, Remove duplicate project resolvers 
> (`resolvers`) or rename publishing resolve
> r (`publishTo`).
> {code}
> This needs to be fixed to avoid potential errors and reduce log noise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33106) Fix sbt resolvers clash

2020-10-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33106:


Assignee: Apache Spark

> Fix sbt resolvers clash
> ---
>
> Key: SPARK-33106
> URL: https://issues.apache.org/jira/browse/SPARK-33106
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Assignee: Apache Spark
>Priority: Minor
>
> During sbt upgrade from 0.13 to 1.x, exact resolvers list was used as is.
> That leads to local resolvers name clashing, which is observed as warning 
> from SBT:
> {code:java}
> [warn] Multiple resolvers having different access mechanism configured with 
> same name 'local'. To avoid conflict, Remove duplicate project resolvers 
> (`resolvers`) or rename publishing resolve
> r (`publishTo`).
> {code}
> This needs to be fixed to avoid potential errors and reduce log noise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail

2020-10-11 Thread Denis Pyshev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Pyshev updated SPARK-33115:
-
Description: 
`build/sbt publishLocal` task fails in two modules:
{code:java}
[error] stack trace is suppressed; run last kvstore / Compile / doc for the 
full output
[error] stack trace is suppressed; run last unsafe / Compile / doc for the full 
output
{code}
{code:java}
 sbt:spark-parent> kvstore/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: malformed HTML 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used 
[error]    ^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: unknown tag: Object 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used 
[error]   ^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: bad use of '>' 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used
[error] 
   ^
{code}
{code:java}
 sbt:spark-parent> unsafe/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
  error: malformed HTML 
[error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
[error]
{code}

  was:
`build/sbt publishLocal` task fails in two modules:
{code:java}
[error] stack trace is suppressed; run last kvstore / Compile / doc for the 
full output
[error] stack trace is suppressed; run last unsafe / Compile / doc for the full 
output
{code}
{code:java}
 sbt:spark-parent> kvstore/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: malformed HTML 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]    
^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: unknown tag: Object 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]    
   ^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: bad use of '>' 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]  
{code}
{code:java}
 sbt:spark-parent> unsafe/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
  error: malformed HTML 
[error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
[error]
{code}


> `kvstore` and `unsafe` doc tasks fail
> -
>
> Key: SPARK-33115
> URL: https://issues.apache.org/jira/browse/SPARK-33115
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> `build/sbt publishLocal` task fails in two modules:
> {code:java}
> [error] stack trace is suppressed; run last kvstore / Compile / doc for the 
> full output
> [error] stack trace is suppressed; run last unsafe / Compile / doc for the 
> full output
> {code}
> {code:java}
>  sbt:spark-parent> kvstore/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: malformed HTML 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used 
> [error]    ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: unknown tag: Object 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boole

[jira] [Updated] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail

2020-10-11 Thread Denis Pyshev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Pyshev updated SPARK-33115:
-
Description: 
`build/sbt publishLocal` task fails in two modules:
{code:java}
[error] stack trace is suppressed; run last kvstore / Compile / doc for the 
full output
[error] stack trace is suppressed; run last unsafe / Compile / doc for the full 
output
{code}
{code:java}
 sbt:spark-parent> kvstore/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: malformed HTML 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]    
^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: unknown tag: Object 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]    
   ^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: bad use of '>' 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]  
{code}
{code:java}
 sbt:spark-parent> unsafe/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
  error: malformed HTML 
[error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
[error]
{code}

  was:
`build/sbt publishLocal` task fails in two modules:
{code:java}
[error] stack trace is suppressed; run last kvstore / Compile / doc for the 
full output
[error] stack trace is suppressed; run last unsafe / Compile / doc for the full 
output
{code}
{code:java}
 sbt:spark-parent> kvstore/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: malformed HTML [error]    * An alias class for the type 
"ConcurrentHashMap, Boolean>", which is used [error] 
   ^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: unknown tag: Object 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]    
   ^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: bad use of '>' 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]  
{code}
{code:java}
 sbt:spark-parent> unsafe/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
  error: malformed HTML 
[error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
[error]
{code}


> `kvstore` and `unsafe` doc tasks fail
> -
>
> Key: SPARK-33115
> URL: https://issues.apache.org/jira/browse/SPARK-33115
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> `build/sbt publishLocal` task fails in two modules:
> {code:java}
> [error] stack trace is suppressed; run last kvstore / Compile / doc for the 
> full output
> [error] stack trace is suppressed; run last unsafe / Compile / doc for the 
> full output
> {code}
> {code:java}
>  sbt:spark-parent> kvstore/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: malformed HTML 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used [error]   
>  ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: unknown tag: Object 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used [error]   
>

[jira] [Updated] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail

2020-10-11 Thread Denis Pyshev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Pyshev updated SPARK-33115:
-
Description: 
`build/sbt publishLocal` task fails in two modules:
{code:java}
[error] stack trace is suppressed; run last kvstore / Compile / doc for the 
full output
[error] stack trace is suppressed; run last unsafe / Compile / doc for the full 
output
{code}
{code:java}
 sbt:spark-parent> kvstore/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: malformed HTML [error]    * An alias class for the type 
"ConcurrentHashMap, Boolean>", which is used [error] 
   ^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: unknown tag: Object 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]    
   ^ 
[error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: bad use of '>' 
[error]    * An alias class for the type "ConcurrentHashMap, 
Boolean>", which is used [error]  
{code}
{code:java}
 sbt:spark-parent> unsafe/Compile/doc 
[info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... 
[error] 
/home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
  error: malformed HTML 
[error]    * Trims whitespaces (<= ASCII 32) from both ends of this string. 
[error]
{code}

  was:
`build/sbt publishLocal` task fails in two modules:


{code:java}
[error] stack trace is suppressed; run last kvstore / Compile / doc for the 
full output
[error] stack trace is suppressed; run last unsafe / Compile / doc for the full 
output
{code}
{code:java}
 sbt:spark-parent> kvstore/Compile/doc [info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... [error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: malformed HTML [error]    * An alias class for the type 
"ConcurrentHashMap, Boolean>", which is used [error] 
   ^ [error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: unknown tag: Object [error]    * An alias class for the type 
"ConcurrentHashMap, Boolean>", which is used [error] 
  ^ [error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: bad use of '>' [error]    * An alias class for the type 
"ConcurrentHashMap, Boolean>", which is used [error]  
{code}
{code:java}
 sbt:spark-parent> unsafe/Compile/doc [info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... [error] 
/home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
  error: malformed HTML [error]    * Trims whitespaces (<= ASCII 32) from both 
ends of this string. [error]
{code}


> `kvstore` and `unsafe` doc tasks fail
> -
>
> Key: SPARK-33115
> URL: https://issues.apache.org/jira/browse/SPARK-33115
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> `build/sbt publishLocal` task fails in two modules:
> {code:java}
> [error] stack trace is suppressed; run last kvstore / Compile / doc for the 
> full output
> [error] stack trace is suppressed; run last unsafe / Compile / doc for the 
> full output
> {code}
> {code:java}
>  sbt:spark-parent> kvstore/Compile/doc 
> [info] Main Java API documentation to 
> /home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: malformed HTML [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used [error]   
>  ^ 
> [error] 
> /home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
>   error: unknown tag: Object 
> [error]    * An alias class for the type 
> "ConcurrentHashMap, Boolean>", which is used [error]   
>

[jira] [Created] (SPARK-33115) `kvstore` and `unsafe` doc tasks fail

2020-10-11 Thread Denis Pyshev (Jira)

Denis Pyshev created SPARK-33115:


 Summary: `kvstore` and `unsafe` doc tasks fail
 Key: SPARK-33115
 URL: https://issues.apache.org/jira/browse/SPARK-33115
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 3.1.0
Reporter: Denis Pyshev


`build/sbt publishLocal` task fails in two modules:


{code:java}
[error] stack trace is suppressed; run last kvstore / Compile / doc for the 
full output
[error] stack trace is suppressed; run last unsafe / Compile / doc for the full 
output
{code}
{code:java}
 sbt:spark-parent> kvstore/Compile/doc [info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/kvstore/target/scala-2.12/api... [error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: malformed HTML [error]    * An alias class for the type 
"ConcurrentHashMap, Boolean>", which is used [error] 
   ^ [error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: unknown tag: Object [error]    * An alias class for the type 
"ConcurrentHashMap, Boolean>", which is used [error] 
  ^ [error] 
/home/gemelen/work/src/spark/common/kvstore/src/main/java/org/apache/spark/util/kvstore/InMemoryStore.java:167:1:
  error: bad use of '>' [error]    * An alias class for the type 
"ConcurrentHashMap, Boolean>", which is used [error]  
{code}
{code:java}
 sbt:spark-parent> unsafe/Compile/doc [info] Main Java API documentation to 
/home/gemelen/work/src/spark/common/unsafe/target/scala-2.12/api... [error] 
/home/gemelen/work/src/spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:566:1:
  error: malformed HTML [error]    * Trims whitespaces (<= ASCII 32) from both 
ends of this string. [error]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled stringsAsFactors=TRUE

2020-10-11 Thread Jacek Pliszka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Pliszka updated SPARK-33113:
--
Summary: [SparkR] gapply works with arrow disabled, fails with arrow 
enabled stringsAsFactors=TRUE  (was: [SparkR] gapply works with arrow disabled, 
fails with arrow enabled)

> [SparkR] gapply works with arrow disabled, fails with arrow enabled 
> stringsAsFactors=TRUE
> -
>
> Key: SPARK-33113
> URL: https://issues.apache.org/jira/browse/SPARK-33113
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Jacek Pliszka
>Priority: Major
>
> Running in databricks on Azure
> library("arrow")
>  library("SparkR")
> df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
>  udf <- function(key, x) data.frame(out=c("dfs"))
>  
> This works:
> sparkR.session(master = "local[*]", 
> sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
>  df1 <- gapply(df, c("ColumnA"), udf, "out String")
>  collect(df1)
> This fails:
> sparkR.session(master = "local[*]", 
> sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
>  df2 <- gapply(df, c("ColumnA"), udf, "out String")
>  collect(df2)
>  
> with error
>  \{{ Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : 
> }}Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 
> 'n' argument
>  Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 
> 'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 
> 'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead.
>   
>  Clicking through Failed Stages to Failure Reason:
>   
>  Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, 
> most recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, 
> executor 0): java.lang.UnsupportedOperationException
>  at 
> org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
>  at 
> org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
>  at 
> org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.Iterator.foreach$(Iterator.scala:941)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>  at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>  at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>  at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
>  at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1429)
>  at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
>  at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
>  at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
>  at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
>  at org.apache.spark.scheduler.Task.run(Task.scala:117)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWor

[jira] [Updated] (SPARK-33113) [SparkR] gapply works with arrow disabled, fails with arrow enabled

2020-10-11 Thread Jacek Pliszka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Pliszka updated SPARK-33113:
--
Description: 
Running in databricks on Azure

library("arrow")
 library("SparkR")

df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
 udf <- function(key, x) data.frame(out=c("dfs"))

 

This works:

sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
 df1 <- gapply(df, c("ColumnA"), udf, "out String")
 collect(df1)

This fails:

sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
 df2 <- gapply(df, c("ColumnA"), udf, "out String")
 collect(df2)

 

with error
 \{{ Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : 
}}Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 
'n' argument
 Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 
'n' argument In addition: Warning messages: 1: Use 'read_ipc_stream' or 
'read_feather' instead. 2: Use 'read_ipc_stream' or 'read_feather' instead.
  
 Clicking through Failed Stages to Failure Reason:
  
 Job aborted due to stage failure: Task 49 in stage 1843.0 failed 4 times, most 
recent failure: Lost task 49.3 in stage 1843.0 (TID 89810, 10.99.0.5, executor 
0): java.lang.UnsupportedOperationException
 at 
org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
 at 
org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
 at 
org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.$anonfun$next$1(ArrowConverters.scala:131)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:140)
 at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.next(ArrowConverters.scala:115)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
 at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
 at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
 at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
 at scala.collection.AbstractIterator.to(Iterator.scala:1429)
 at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
 at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
 at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
 at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
 at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToR$3(Dataset.scala:3589)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
 at org.apache.spark.scheduler.Task.run(Task.scala:117)
 at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  
  

 

 

 

 

  was:
Running in databricks on Azure

library("arrow")
library("SparkR")

df <- as.DataFrame(list("A", "B", "C"), schema="ColumnA")
udf <- function(key, x) data.frame(out=c("dfs"), stringAsFactors=FALSE)

 

This works:

sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "false"))
df1 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df1)

This fails:

sparkR.session(master = "local[*]", 
sparkConfig=list(spark.sql.execution.arrow.sparkr.enabled = "true"))
df2 <- gapply(df, c("ColumnA"), udf, "out String")
collect(df2)

 

with error
{{

[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0

2020-10-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211987#comment-17211987
 ] 

Dongjoon Hyun commented on SPARK-27733:
---

HIVE-21737 seems to be blocked by AVRO-2817 .

> Upgrade to Avro 1.10.0
> --
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.2 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranamer, no shaded guava, security 
> updates, so probably a worth upgrade.
> Avro 1.10.0 was released and this is still not done.
> There is at the moment (2020/08) still a blocker because of Hive related 
> transitive dependencies bringing older versions of Avro, so we could say that 
> this is somehow still blocked until HIVE-21737 is solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0

2020-10-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211984#comment-17211984
 ] 

Dongjoon Hyun commented on SPARK-27733:
---

Ya. It's removed for all these stuffs and more. :)

> Upgrade to Avro 1.10.0
> --
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.2 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranamer, no shaded guava, security 
> updates, so probably a worth upgrade.
> Avro 1.10.0 was released and this is still not done.
> There is at the moment (2020/08) still a blocker because of Hive related 
> transitive dependencies bringing older versions of Avro, so we could say that 
> this is somehow still blocked until HIVE-21737 is solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0

2020-10-11 Thread t oo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211955#comment-17211955
 ] 

t oo commented on SPARK-27733:
--

hive 1 gone I think

> Upgrade to Avro 1.10.0
> --
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.2 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranamer, no shaded guava, security 
> updates, so probably a worth upgrade.
> Avro 1.10.0 was released and this is still not done.
> There is at the moment (2020/08) still a blocker because of Hive related 
> transitive dependencies bringing older versions of Avro, so we could say that 
> this is somehow still blocked until HIVE-21737 is solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33114) Add metadata in MapStatus to support custom shuffle manager

2020-10-11 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33114:

Target Version/s:   (was: 3.0.1)

> Add metadata in MapStatus to support custom shuffle manager
> ---
>
> Key: SPARK-33114
> URL: https://issues.apache.org/jira/browse/SPARK-33114
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1
>Reporter: BoYang
>Priority: Major
>
> Current MapStatus class is tightly bound with local (sort merge) shuffle 
> which uses BlockManagerId to store the shuffle data location. It could not 
> support other custom shuffle manager implementation.
> We could add "metadata" to MapStatus and allow different shuffle manager 
> implementation to store information related to them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33107) Remove hive-2.3 workaround code

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211869#comment-17211869
 ] 

Apache Spark commented on SPARK-33107:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30005

> Remove hive-2.3 workaround code
> ---
>
> Key: SPARK-33107
> URL: https://issues.apache.org/jira/browse/SPARK-33107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> We can make code more clear and readable after SPARK-33082.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33107) Remove hive-2.3 workaround code

2020-10-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211868#comment-17211868
 ] 

Apache Spark commented on SPARK-33107:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30005

> Remove hive-2.3 workaround code
> ---
>
> Key: SPARK-33107
> URL: https://issues.apache.org/jira/browse/SPARK-33107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> We can make code more clear and readable after SPARK-33082.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

48 matches

Mail list logo