date:20210531

[jira] [Commented] (SPARK-35589) BlockManagerMasterEndpoint should not ignore index-only shuffle file during updating

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354822#comment-17354822
 ] 

Apache Spark commented on SPARK-35589:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32727

> BlockManagerMasterEndpoint should not ignore index-only shuffle file during 
> updating
> 
>
> Key: SPARK-35589
> URL: https://issues.apache.org/jira/browse/SPARK-35589
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35589) BlockManagerMasterEndpoint should not ignore index-only shuffle file during updating

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35589:


Assignee: Apache Spark

> BlockManagerMasterEndpoint should not ignore index-only shuffle file during 
> updating
> 
>
> Key: SPARK-35589
> URL: https://issues.apache.org/jira/browse/SPARK-35589
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35589) BlockManagerMasterEndpoint should not ignore index-only shuffle file during updating

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354821#comment-17354821
 ] 

Apache Spark commented on SPARK-35589:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32727

> BlockManagerMasterEndpoint should not ignore index-only shuffle file during 
> updating
> 
>
> Key: SPARK-35589
> URL: https://issues.apache.org/jira/browse/SPARK-35589
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35589) BlockManagerMasterEndpoint should not ignore index-only shuffle file during updating

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35589:


Assignee: (was: Apache Spark)

> BlockManagerMasterEndpoint should not ignore index-only shuffle file during 
> updating
> 
>
> Key: SPARK-35589
> URL: https://issues.apache.org/jira/browse/SPARK-35589
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35587) Initial porting of Koalas documentation

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354820#comment-17354820
 ] 

Apache Spark commented on SPARK-35587:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32726

> Initial porting of Koalas documentation
> ---
>
> Key: SPARK-35587
> URL: https://issues.apache.org/jira/browse/SPARK-35587
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA aims initial porting of the Koalas documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35587) Initial porting of Koalas documentation

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35587:


Assignee: (was: Apache Spark)

> Initial porting of Koalas documentation
> ---
>
> Key: SPARK-35587
> URL: https://issues.apache.org/jira/browse/SPARK-35587
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA aims initial porting of the Koalas documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35587) Initial porting of Koalas documentation

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35587:


Assignee: Apache Spark

> Initial porting of Koalas documentation
> ---
>
> Key: SPARK-35587
> URL: https://issues.apache.org/jira/browse/SPARK-35587
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA aims initial porting of the Koalas documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354818#comment-17354818
 ] 

Apache Spark commented on SPARK-33933:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32725

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Assignee: Yu Zhong
>Priority: Major
> Fix For: 3.2.0
>
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35589) BlockManagerMasterEndpoint should not ignore index-only shuffle file during updating

2021-05-31 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-35589:
-

 Summary: BlockManagerMasterEndpoint should not ignore index-only 
shuffle file during updating
 Key: SPARK-35589
 URL: https://issues.apache.org/jira/browse/SPARK-35589
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35588) Merge Binder integration and quickstart notebook.

2021-05-31 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-35588:


 Summary: Merge Binder integration and quickstart notebook.
 Key: SPARK-35588
 URL: https://issues.apache.org/jira/browse/SPARK-35588
 Project: Spark
  Issue Type: Sub-task
  Components: docs, PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


We should merge:

https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb
https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35585) Support propagate empty relation through project/filter

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35585:


Assignee: Apache Spark

> Support propagate empty relation through project/filter
> ---
>
> Key: SPARK-35585
> URL: https://issues.apache.org/jira/browse/SPARK-35585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Minor
>
> Support propagate empty local relation through project and filter like such 
> SQL case:
> {code:java}
> Aggregate
>   Project
> Join
>   ShuffleStage
>   ShuffleStage
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35585) Support propagate empty relation through project/filter

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354815#comment-17354815
 ] 

Apache Spark commented on SPARK-35585:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32724

> Support propagate empty relation through project/filter
> ---
>
> Key: SPARK-35585
> URL: https://issues.apache.org/jira/browse/SPARK-35585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Minor
>
> Support propagate empty local relation through project and filter like such 
> SQL case:
> {code:java}
> Aggregate
>   Project
> Join
>   ShuffleStage
>   ShuffleStage
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35585) Support propagate empty relation through project/filter

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35585:


Assignee: (was: Apache Spark)

> Support propagate empty relation through project/filter
> ---
>
> Key: SPARK-35585
> URL: https://issues.apache.org/jira/browse/SPARK-35585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Minor
>
> Support propagate empty local relation through project and filter like such 
> SQL case:
> {code:java}
> Aggregate
>   Project
> Join
>   ShuffleStage
>   ShuffleStage
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35585) Support propagate empty relation through project/filter

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354816#comment-17354816
 ] 

Apache Spark commented on SPARK-35585:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32724

> Support propagate empty relation through project/filter
> ---
>
> Key: SPARK-35585
> URL: https://issues.apache.org/jira/browse/SPARK-35585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Minor
>
> Support propagate empty local relation through project and filter like such 
> SQL case:
> {code:java}
> Aggregate
>   Project
> Join
>   ShuffleStage
>   ShuffleStage
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35423) The output of PCA is inconsistent

2021-05-31 Thread shahid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354810#comment-17354810
 ] 

shahid commented on SPARK-35423:


I would like to analyse this issue

> The output of PCA is inconsistent
> -
>
> Key: SPARK-35423
> URL: https://issues.apache.org/jira/browse/SPARK-35423
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.1.1
> Environment: Spark Version： 3.1.1 
>Reporter: cqfrog
>Priority: Major
>
> 1. The example from doc
>  
> {code:java}
> import org.apache.spark.ml.feature.PCA
> import org.apache.spark.ml.linalg.Vectors
> val data = Array(
>   Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
>   Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
>   Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
> )
> val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
> val pca = new PCA()
>   .setInputCol("features")
>   .setOutputCol("pcaFeatures")
>   .setK(3)
>   .fit(df)
> val result = pca.transform(df).select("pcaFeatures")
> result.show(false)
> {code}
>  
>  
> the output show:
> {code:java}
> +---+
> |pcaFeatures|
> +---+
> |[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
> |[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
> |[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
> +---+
> {code}
> 2. change the Vector format
> I modified the code from "Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))" to 
> "Vectors.dense(0.0,1.0,0.0,7.0,0.0)" 。
> but the output show：
> {code:java}
> ++
> |pcaFeatures |
> ++
> |[1.6485728230883814,-4.0132827005162985,-1.0091435193998504]|
> |[-4.645104331781533,-1.1167972663619048,-1.0091435193998501]|
> |[-6.428880535676488,-5.337951427775359,-1.009143519399851]  |
> ++
> {code}
> It's strange that the two outputs are inconsistent. Why?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35579) Fix a bug in janino or work around it in Spark.

2021-05-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-35579:

Priority: Blocker  (was: Critical)

> Fix a bug in janino or work around it in Spark.
> ---
>
> Key: SPARK-35579
> URL: https://issues.apache.org/jira/browse/SPARK-35579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> See the test in SPARK-35578



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35583) Move JDBC data source options from Python and Scala into a single page

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354804#comment-17354804
 ] 

Apache Spark commented on SPARK-35583:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32723

> Move JDBC data source options from Python and Scala into a single page
> --
>
> Key: SPARK-35583
> URL: https://issues.apache.org/jira/browse/SPARK-35583
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35583) Move JDBC data source options from Python and Scala into a single page

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35583:


Assignee: (was: Apache Spark)

> Move JDBC data source options from Python and Scala into a single page
> --
>
> Key: SPARK-35583
> URL: https://issues.apache.org/jira/browse/SPARK-35583
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35583) Move JDBC data source options from Python and Scala into a single page

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35583:


Assignee: Apache Spark

> Move JDBC data source options from Python and Scala into a single page
> --
>
> Key: SPARK-35583
> URL: https://issues.apache.org/jira/browse/SPARK-35583
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35583) Move JDBC data source options from Python and Scala into a single page

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354803#comment-17354803
 ] 

Apache Spark commented on SPARK-35583:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32723

> Move JDBC data source options from Python and Scala into a single page
> --
>
> Key: SPARK-35583
> URL: https://issues.apache.org/jira/browse/SPARK-35583
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35587) Initial porting of Koalas documentation

2021-05-31 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-35587:


 Summary: Initial porting of Koalas documentation
 Key: SPARK-35587
 URL: https://issues.apache.org/jira/browse/SPARK-35587
 Project: Spark
  Issue Type: Sub-task
  Components: docs, PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


This JIRA aims initial porting of the Koalas documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34885) Port/integrate Koalas documentation into PySpark

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34885:
-
Target Version/s: 3.2.0

> Port/integrate Koalas documentation into PySpark
> 
>
> Key: SPARK-34885
> URL: https://issues.apache.org/jira/browse/SPARK-34885
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port [Koalas 
> documentation|https://koalas.readthedocs.io/en/latest/index.html] 
> appropriately to [PySpark 
> documentation|https://spark.apache.org/docs/latest/api/python/index.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34885) Port/integrate Koalas documentation into PySpark

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34885:
-
Parent: (was: SPARK-34849)
Issue Type: Improvement  (was: Sub-task)

> Port/integrate Koalas documentation into PySpark
> 
>
> Key: SPARK-34885
> URL: https://issues.apache.org/jira/browse/SPARK-34885
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port [Koalas 
> documentation|https://koalas.readthedocs.io/en/latest/index.html] 
> appropriately to [PySpark 
> documentation|https://spark.apache.org/docs/latest/api/python/index.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35586) Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35586:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for 
> Kubernetes integration tests
> --
>
> Key: SPARK-35586
> URL: https://issues.apache.org/jira/browse/SPARK-35586
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In kubernetes/integration-tests/pom.xml, there is no default value for 
> spark.kubernetes.test.sparkTgz so running tests with the following command 
> will fail.
> {code}
> build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes 
> -Pkubernetes-integration-tests -Psparkr  -pl 
> resource-managers/kubernetes/integration-tests integration-test
> {code}
> + mkdir -p 
> /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
> + tar -xzvf --test-exclude-tags --strip-components=1 -C 
> /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
> tar (child): --test-exclude-tags: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> [ERROR] Command execution failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35586) Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35586:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for 
> Kubernetes integration tests
> --
>
> Key: SPARK-35586
> URL: https://issues.apache.org/jira/browse/SPARK-35586
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In kubernetes/integration-tests/pom.xml, there is no default value for 
> spark.kubernetes.test.sparkTgz so running tests with the following command 
> will fail.
> {code}
> build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes 
> -Pkubernetes-integration-tests -Psparkr  -pl 
> resource-managers/kubernetes/integration-tests integration-test
> {code}
> + mkdir -p 
> /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
> + tar -xzvf --test-exclude-tags --strip-components=1 -C 
> /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
> tar (child): --test-exclude-tags: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> [ERROR] Command execution failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35586) Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354798#comment-17354798
 ] 

Apache Spark commented on SPARK-35586:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32722

> Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for 
> Kubernetes integration tests
> --
>
> Key: SPARK-35586
> URL: https://issues.apache.org/jira/browse/SPARK-35586
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In kubernetes/integration-tests/pom.xml, there is no default value for 
> spark.kubernetes.test.sparkTgz so running tests with the following command 
> will fail.
> {code}
> build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes 
> -Pkubernetes-integration-tests -Psparkr  -pl 
> resource-managers/kubernetes/integration-tests integration-test
> {code}
> + mkdir -p 
> /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
> + tar -xzvf --test-exclude-tags --strip-components=1 -C 
> /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
> tar (child): --test-exclude-tags: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> [ERROR] Command execution failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35586) Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests

2021-05-31 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35586:
---
Description: 
In kubernetes/integration-tests/pom.xml, there is no default value for 
spark.kubernetes.test.sparkTgz so running tests with the following command will 
fail.

{code}
build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes 
-Pkubernetes-integration-tests -Psparkr  -pl 
resource-managers/kubernetes/integration-tests integration-test
{code}
+ mkdir -p 
/home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
+ tar -xzvf --test-exclude-tags --strip-components=1 -C 
/home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
tar (child): --test-exclude-tags: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
[ERROR] Command execution failed.
{code}


  was:
In kubernetes/integration-tests/pom.xml, there are no default value for 
spark.kubernetes.test.sparkTgz so running tests with the following command will 
fail.

{code}
build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes 
-Pkubernetes-integration-tests -Psparkr  -pl 
resource-managers/kubernetes/integration-tests integration-test
{code}
+ mkdir -p 
/home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
+ tar -xzvf --test-exclude-tags --strip-components=1 -C 
/home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
tar (child): --test-exclude-tags: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
[ERROR] Command execution failed.
{code}



> Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for 
> Kubernetes integration tests
> --
>
> Key: SPARK-35586
> URL: https://issues.apache.org/jira/browse/SPARK-35586
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In kubernetes/integration-tests/pom.xml, there is no default value for 
> spark.kubernetes.test.sparkTgz so running tests with the following command 
> will fail.
> {code}
> build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes 
> -Pkubernetes-integration-tests -Psparkr  -pl 
> resource-managers/kubernetes/integration-tests integration-test
> {code}
> + mkdir -p 
> /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
> + tar -xzvf --test-exclude-tags --strip-components=1 -C 
> /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
> tar (child): --test-exclude-tags: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> [ERROR] Command execution failed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35586) Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests

2021-05-31 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-35586:
--

 Summary: Set a default value for spark.kubernetes.test.sparkTgz in 
pom.xml for Kubernetes integration tests
 Key: SPARK-35586
 URL: https://issues.apache.org/jira/browse/SPARK-35586
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In kubernetes/integration-tests/pom.xml, there are no default value for 
spark.kubernetes.test.sparkTgz so running tests with the following command will 
fail.

{code}
build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes 
-Pkubernetes-integration-tests -Psparkr  -pl 
resource-managers/kubernetes/integration-tests integration-test
{code}
+ mkdir -p 
/home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
+ tar -xzvf --test-exclude-tags --strip-components=1 -C 
/home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
tar (child): --test-exclude-tags: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
[ERROR] Command execution failed.
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35077) Migrate to transformWithPruning for leftover optimizer rules

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35077:


Assignee: Apache Spark

> Migrate to transformWithPruning for leftover optimizer rules
> 
>
> Key: SPARK-35077
> URL: https://issues.apache.org/jira/browse/SPARK-35077
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Apache Spark
>Priority: Major
>
> E.g., PushDownPredicates and a few others.
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35077) Migrate to transformWithPruning for leftover optimizer rules

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354795#comment-17354795
 ] 

Apache Spark commented on SPARK-35077:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32721

> Migrate to transformWithPruning for leftover optimizer rules
> 
>
> Key: SPARK-35077
> URL: https://issues.apache.org/jira/browse/SPARK-35077
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> E.g., PushDownPredicates and a few others.
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35077) Migrate to transformWithPruning for leftover optimizer rules

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35077:


Assignee: (was: Apache Spark)

> Migrate to transformWithPruning for leftover optimizer rules
> 
>
> Key: SPARK-35077
> URL: https://issues.apache.org/jira/browse/SPARK-35077
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> E.g., PushDownPredicates and a few others.
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35077) Migrate to transformWithPruning for leftover optimizer rules

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35077:


Assignee: Apache Spark

> Migrate to transformWithPruning for leftover optimizer rules
> 
>
> Key: SPARK-35077
> URL: https://issues.apache.org/jira/browse/SPARK-35077
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Apache Spark
>Priority: Major
>
> E.g., PushDownPredicates and a few others.
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35585) Support propagate empty relation through project/filter

2021-05-31 Thread XiDuo You (Jira)

XiDuo You created SPARK-35585:
-

 Summary: Support propagate empty relation through project/filter
 Key: SPARK-35585
 URL: https://issues.apache.org/jira/browse/SPARK-35585
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: XiDuo You


Support propagate empty local relation through project and filter like such SQL 
case:
{code:java}
Aggregate
  Project
Join
  ShuffleStage
  ShuffleStage
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354783#comment-17354783
 ] 

Apache Spark commented on SPARK-35576:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32720

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.8, 3.0.2, 3.1.2, 
> 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354782#comment-17354782
 ] 

Apache Spark commented on SPARK-35576:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32720

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.8, 3.0.2, 3.1.2, 
> 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35544) Add tree pattern pruning into Analyzer rules

2021-05-31 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-35544:
--

Assignee: Yingyi Bu

> Add tree pattern pruning into Analyzer rules
> 
>
> Key: SPARK-35544
> URL: https://issues.apache.org/jira/browse/SPARK-35544
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
>
> Analyzer rules have ruleid pruning, but do not have tree pattern prunings yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35544) Add tree pattern pruning into Analyzer rules

2021-05-31 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-35544.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32686
[https://github.com/apache/spark/pull/32686]

> Add tree pattern pruning into Analyzer rules
> 
>
> Key: SPARK-35544
> URL: https://issues.apache.org/jira/browse/SPARK-35544
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
> Fix For: 3.2.0
>
>
> Analyzer rules have ruleid pruning, but do not have tree pattern prunings yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35584) Increase the timeout in FallbackStorageSuite

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354765#comment-17354765
 ] 

Apache Spark commented on SPARK-35584:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32719

> Increase the timeout in FallbackStorageSuite
> 
>
> Key: SPARK-35584
> URL: https://issues.apache.org/jira/browse/SPARK-35584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Minor
>
> {{Discovery starting. Discovery completed in 2 seconds, 396 milliseconds. Run 
> starting. Expected test count is: 9 FallbackStorageSuite: - fallback storage 
> APIs - copy/exists - SPARK-34142: fallback storage API - cleanUp - migrate 
> shuffle data to fallback storage - Upload from all decommissioned executors}}
> {{- Upload multi stages *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.011176743 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:243)}}
> {{- lz4 - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.010694845 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{- lzf - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.00972281101 seconds. Last failure message: 
> fallbackStorage.exists(0, file) was false. (FallbackStorageSuite.scala:268)}}
> {{- snappy - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.009750581 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{- zstd - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.00968885 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{Run completed in 1 minute, 37 seconds.}}
> {{Total number of tests run: 9}}
> {{Suites: completed 2, aborted 0}}
> {{Tests: succeeded 4, failed 5, canceled 0, ignored 0, pending 0}}
> {{*** 5 TESTS FAILED ***}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35584) Increase the timeout in FallbackStorageSuite

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35584:


Assignee: (was: Apache Spark)

> Increase the timeout in FallbackStorageSuite
> 
>
> Key: SPARK-35584
> URL: https://issues.apache.org/jira/browse/SPARK-35584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Minor
>
> {{Discovery starting. Discovery completed in 2 seconds, 396 milliseconds. Run 
> starting. Expected test count is: 9 FallbackStorageSuite: - fallback storage 
> APIs - copy/exists - SPARK-34142: fallback storage API - cleanUp - migrate 
> shuffle data to fallback storage - Upload from all decommissioned executors}}
> {{- Upload multi stages *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.011176743 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:243)}}
> {{- lz4 - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.010694845 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{- lzf - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.00972281101 seconds. Last failure message: 
> fallbackStorage.exists(0, file) was false. (FallbackStorageSuite.scala:268)}}
> {{- snappy - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.009750581 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{- zstd - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.00968885 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{Run completed in 1 minute, 37 seconds.}}
> {{Total number of tests run: 9}}
> {{Suites: completed 2, aborted 0}}
> {{Tests: succeeded 4, failed 5, canceled 0, ignored 0, pending 0}}
> {{*** 5 TESTS FAILED ***}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35584) Increase the timeout in FallbackStorageSuite

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35584:


Assignee: Apache Spark

> Increase the timeout in FallbackStorageSuite
> 
>
> Key: SPARK-35584
> URL: https://issues.apache.org/jira/browse/SPARK-35584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Minor
>
> {{Discovery starting. Discovery completed in 2 seconds, 396 milliseconds. Run 
> starting. Expected test count is: 9 FallbackStorageSuite: - fallback storage 
> APIs - copy/exists - SPARK-34142: fallback storage API - cleanUp - migrate 
> shuffle data to fallback storage - Upload from all decommissioned executors}}
> {{- Upload multi stages *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.011176743 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:243)}}
> {{- lz4 - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.010694845 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{- lzf - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.00972281101 seconds. Last failure message: 
> fallbackStorage.exists(0, file) was false. (FallbackStorageSuite.scala:268)}}
> {{- snappy - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.009750581 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{- zstd - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.00968885 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{Run completed in 1 minute, 37 seconds.}}
> {{Total number of tests run: 9}}
> {{Suites: completed 2, aborted 0}}
> {{Tests: succeeded 4, failed 5, canceled 0, ignored 0, pending 0}}
> {{*** 5 TESTS FAILED ***}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35584) Increase the timeout in FallbackStorageSuite

2021-05-31 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-35584:

Description: 
The aarch64 case failed due to:

 

{{Discovery starting. Discovery completed in 2 seconds, 396 milliseconds. Run 
starting. Expected test count is: 9 FallbackStorageSuite: - fallback storage 
APIs - copy/exists - SPARK-34142: fallback storage API - cleanUp - migrate 
shuffle data to fallback storage - Upload from all decommissioned executors}}
 {{- Upload multi stages *** FAILED ***}}
 \{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.011176743 seconds. Last failure message: fallbackStorage.exists(0, 
file) was false. (FallbackStorageSuite.scala:243)}}
 {{- lz4 - Newly added executors should access old data from remote storage *** 
FAILED ***}}
 \{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.010694845 seconds. Last failure message: fallbackStorage.exists(0, 
file) was false. (FallbackStorageSuite.scala:268)}}
 {{- lzf - Newly added executors should access old data from remote storage *** 
FAILED ***}}
 \{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.00972281101 seconds. Last failure message: 
fallbackStorage.exists(0, file) was false. (FallbackStorageSuite.scala:268)}}
 {{- snappy - Newly added executors should access old data from remote storage 
*** FAILED ***}}
 \{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.009750581 seconds. Last failure message: fallbackStorage.exists(0, 
file) was false. (FallbackStorageSuite.scala:268)}}
 {{- zstd - Newly added executors should access old data from remote storage 
*** FAILED ***}}
 \{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.00968885 seconds. Last failure message: fallbackStorage.exists(0, file) 
was false. (FallbackStorageSuite.scala:268)}}
 {{Run completed in 1 minute, 37 seconds.}}
 {{Total number of tests run: 9}}
 {{Suites: completed 2, aborted 0}}
 {{Tests: succeeded 4, failed 5, canceled 0, ignored 0, pending 0}}
 {{*** 5 TESTS FAILED ***}}

  was:
{{Discovery starting. Discovery completed in 2 seconds, 396 milliseconds. Run 
starting. Expected test count is: 9 FallbackStorageSuite: - fallback storage 
APIs - copy/exists - SPARK-34142: fallback storage API - cleanUp - migrate 
shuffle data to fallback storage - Upload from all decommissioned executors}}
{{- Upload multi stages *** FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.011176743 seconds. Last failure message: fallbackStorage.exists(0, 
file) was false. (FallbackStorageSuite.scala:243)}}
{{- lz4 - Newly added executors should access old data from remote storage *** 
FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.010694845 seconds. Last failure message: fallbackStorage.exists(0, 
file) was false. (FallbackStorageSuite.scala:268)}}
{{- lzf - Newly added executors should access old data from remote storage *** 
FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.00972281101 seconds. Last failure message: 
fallbackStorage.exists(0, file) was false. (FallbackStorageSuite.scala:268)}}
{{- snappy - Newly added executors should access old data from remote storage 
*** FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.009750581 seconds. Last failure message: fallbackStorage.exists(0, 
file) was false. (FallbackStorageSuite.scala:268)}}
{{- zstd - Newly added executors should access old data from remote storage *** 
FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.00968885 seconds. Last failure message: fallbackStorage.exists(0, file) 
was false. (FallbackStorageSuite.scala:268)}}
{{Run completed in 1 minute, 37 seconds.}}
{{Total number of tests run: 9}}
{{Suites: completed 2, aborted 0}}
{{Tests: succeeded 4, failed 5, canceled 0, ignored 0, pending 0}}
{{*** 5 TESTS FAILED ***}}


> Increase the timeout in FallbackStorageSuite
> 
>
> Key: SPARK-35584
> URL: https://issues.apache.org/jira/browse/SPARK-35584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Minor
>
> The aarch64 case failed due to:
>  
> {{Discovery starting. Discovery completed in 2 seconds, 396 milliseconds. Run 
> starting. Expected test count is: 9 FallbackStorageSuite: - fallback storage 
> APIs - copy/exists - SPARK-34142: fallback storage API - cleanUp - migrate 
> shuffle data to fallback storage - Upload from all decommissioned executors}}
>  {{- Upload multi stages *** FAILED

[jira] [Commented] (SPARK-34059) Use for/foreach rather than map to make sure execute it eagerly

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354762#comment-17354762
 ] 

Apache Spark commented on SPARK-34059:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32719

> Use for/foreach rather than map to make sure execute it eagerly 
> 
>
> Key: SPARK-34059
> URL: https://issues.apache.org/jira/browse/SPARK-34059
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.1, 3.2.0
>
>
> This is virtually a clone of SPARK-16694. There are some more new places that 
> foreach has to be map. Please see the original ticket and PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34059) Use for/foreach rather than map to make sure execute it eagerly

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354761#comment-17354761
 ] 

Apache Spark commented on SPARK-34059:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32719

> Use for/foreach rather than map to make sure execute it eagerly 
> 
>
> Key: SPARK-34059
> URL: https://issues.apache.org/jira/browse/SPARK-34059
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.1, 3.2.0
>
>
> This is virtually a clone of SPARK-16694. There are some more new places that 
> foreach has to be map. Please see the original ticket and PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32166) Metastore problem on Spark3.0 with Hive3.0

2021-05-31 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354759#comment-17354759
 ] 

angerszhu commented on SPARK-32166:
---

http://apache-spark-user-list.1001560.n3.nabble.com/Re-Metastore-problem-on-Spark2-3-with-Hive3-0-td33474.html

>  Metastore problem on Spark3.0 with Hive3.0
> ---
>
> Key: SPARK-32166
> URL: https://issues.apache.org/jira/browse/SPARK-32166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: hzk
>Priority: Major
>
> When i use spark-sql to create table ,the problem appear.
> {code:java}
> create table bigbig as select b.user_id , b.name , b.age , c.address , c.city 
> , a.position , a.object , a.problem , a.complaint_time from ( select user_id 
> , position , object , problem , complaint_time from 
> HIVE_COMBINE_7efde4e2dcb34c218b3fb08872e698d5 ) as a left join 
> HIVE_ODS_17_TEST_DEMO_ODS_USERS_INFO_20200608141945 as b on b.user_id = 
> a.user_id left join HIVE_ODS_17_TEST_ADDRESS_CITY_20200608141942 as c on 
> c.address_id = b.address_id;
> {code}
> It opened a connection to hive metastore.
> my hive version is 3.1.0.
> {code:java}
> org.apache.thrift.TApplicationException: Required field 'filesAdded' is 
> unset! 
> Struct:InsertEventRequestData(filesAdded:null)org.apache.thrift.TApplicationException:
>  Required field 'filesAdded' is unset! 
> Struct:InsertEventRequestData(filesAdded:null) at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111) 
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_fire_listener_event(ThriftHiveMetastore.java:4182)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.fire_listener_event(ThriftHiveMetastore.java:4169)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.fireListenerEvent(HiveMetaStoreClient.java:1954)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
>  at com.sun.proxy.$Proxy5.fireListenerEvent(Unknown Source) at 
> org.apache.hadoop.hive.ql.metadata.Hive.fireInsertEvent(Hive.java:1947) at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1673) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:847) at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:757)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:756)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:829)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:827)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:416)
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:403) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
>

[jira] [Commented] (SPARK-21957) Add current_user function

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354755#comment-17354755
 ] 

Apache Spark commented on SPARK-21957:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32718

> Add current_user function
> -
>
> Key: SPARK-21957
> URL: https://issues.apache.org/jira/browse/SPARK-21957
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Priority: Minor
>  Labels: bulk-closed
>
> Spark doesn't support the {{current_user}} function.
> Despite the user can be retrieved in other ways, the function would help 
> making easier to migrate existing Hive queries to Spark and it can also be 
> convenient for people who are just using SQL to interact with Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21957) Add current_user function

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354756#comment-17354756
 ] 

Apache Spark commented on SPARK-21957:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32718

> Add current_user function
> -
>
> Key: SPARK-21957
> URL: https://issues.apache.org/jira/browse/SPARK-21957
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Priority: Minor
>  Labels: bulk-closed
>
> Spark doesn't support the {{current_user}} function.
> Despite the user can be retrieved in other ways, the function would help 
> making easier to migrate existing Hive queries to Spark and it can also be 
> convenient for people who are just using SQL to interact with Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35584) Increase the timeout in FallbackStorageSuite

2021-05-31 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354753#comment-17354753
 ] 

Yikun Jiang commented on SPARK-35584:
-

I also see some random timeout failed test on github action, like [1][2]:

[[1] 
https://github.com/apache/spark/actions/runs/489319612|https://github.com/apache/spark/actions/runs/489319612]

[[2]https://github.com/apache/spark/actions/runs/479317320|https://github.com/apache/spark/actions/runs/479317320]

 

> Increase the timeout in FallbackStorageSuite
> 
>
> Key: SPARK-35584
> URL: https://issues.apache.org/jira/browse/SPARK-35584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Minor
>
> {{Discovery starting. Discovery completed in 2 seconds, 396 milliseconds. Run 
> starting. Expected test count is: 9 FallbackStorageSuite: - fallback storage 
> APIs - copy/exists - SPARK-34142: fallback storage API - cleanUp - migrate 
> shuffle data to fallback storage - Upload from all decommissioned executors}}
> {{- Upload multi stages *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.011176743 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:243)}}
> {{- lz4 - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.010694845 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{- lzf - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.00972281101 seconds. Last failure message: 
> fallbackStorage.exists(0, file) was false. (FallbackStorageSuite.scala:268)}}
> {{- snappy - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.009750581 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{- zstd - Newly added executors should access old data from remote storage 
> *** FAILED ***}}
> {{ The code passed to eventually never returned normally. Attempted 20 times 
> over 10.00968885 seconds. Last failure message: fallbackStorage.exists(0, 
> file) was false. (FallbackStorageSuite.scala:268)}}
> {{Run completed in 1 minute, 37 seconds.}}
> {{Total number of tests run: 9}}
> {{Suites: completed 2, aborted 0}}
> {{Tests: succeeded 4, failed 5, canceled 0, ignored 0, pending 0}}
> {{*** 5 TESTS FAILED ***}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35584) Increase the timeout in FallbackStorageSuite

2021-05-31 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-35584:
---

 Summary: Increase the timeout in FallbackStorageSuite
 Key: SPARK-35584
 URL: https://issues.apache.org/jira/browse/SPARK-35584
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.1.2
Reporter: Yikun Jiang


{{Discovery starting. Discovery completed in 2 seconds, 396 milliseconds. Run 
starting. Expected test count is: 9 FallbackStorageSuite: - fallback storage 
APIs - copy/exists - SPARK-34142: fallback storage API - cleanUp - migrate 
shuffle data to fallback storage - Upload from all decommissioned executors}}
{{- Upload multi stages *** FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.011176743 seconds. Last failure message: fallbackStorage.exists(0, 
file) was false. (FallbackStorageSuite.scala:243)}}
{{- lz4 - Newly added executors should access old data from remote storage *** 
FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.010694845 seconds. Last failure message: fallbackStorage.exists(0, 
file) was false. (FallbackStorageSuite.scala:268)}}
{{- lzf - Newly added executors should access old data from remote storage *** 
FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.00972281101 seconds. Last failure message: 
fallbackStorage.exists(0, file) was false. (FallbackStorageSuite.scala:268)}}
{{- snappy - Newly added executors should access old data from remote storage 
*** FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.009750581 seconds. Last failure message: fallbackStorage.exists(0, 
file) was false. (FallbackStorageSuite.scala:268)}}
{{- zstd - Newly added executors should access old data from remote storage *** 
FAILED ***}}
{{ The code passed to eventually never returned normally. Attempted 20 times 
over 10.00968885 seconds. Last failure message: fallbackStorage.exists(0, file) 
was false. (FallbackStorageSuite.scala:268)}}
{{Run completed in 1 minute, 37 seconds.}}
{{Total number of tests run: 9}}
{{Suites: completed 2, aborted 0}}
{{Tests: succeeded 4, failed 5, canceled 0, ignored 0, pending 0}}
{{*** 5 TESTS FAILED ***}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35583) Move JDBC data source options from Python and Scala into a single page

2021-05-31 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-35583:
---

 Summary: Move JDBC data source options from Python and Scala into 
a single page
 Key: SPARK-35583
 URL: https://issues.apache.org/jira/browse/SPARK-35583
 Project: Spark
  Issue Type: Sub-task
  Components: docs
Affects Versions: 3.2.0
Reporter: Haejoon Lee


Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35433) Move CSV data source options from Python and Scala into a single page.

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35433.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32658
[https://github.com/apache/spark/pull/32658]

> Move CSV data source options from Python and Scala into a single page.
> --
>
> Key: SPARK-35433
> URL: https://issues.apache.org/jira/browse/SPARK-35433
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35433) Move CSV data source options from Python and Scala into a single page.

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35433:


Assignee: Haejoon Lee

> Move CSV data source options from Python and Scala into a single page.
> --
>
> Key: SPARK-35433
> URL: https://issues.apache.org/jira/browse/SPARK-35433
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35578) Add a test case for a janino bug

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35578.
--
Fix Version/s: 3.2.0
 Assignee: Wenchen Fan
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32716

> Add a test case for a janino bug
> 
>
> Key: SPARK-35578
> URL: https://issues.apache.org/jira/browse/SPARK-35578
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35582) Remove # noqa in Python API documents.

2021-05-31 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354732#comment-17354732
 ] 

Haejoon Lee commented on SPARK-35582:
-

I'm working on this

> Remove # noqa in Python API documents.
> --
>
> Key: SPARK-35582
> URL: https://issues.apache.org/jira/browse/SPARK-35582
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are some unnecessary words "# noqa" are exposed in the Python API 
> documentation.
>  
> For example, 
> [https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.parquet.html#pyspark.sql.DataFrameReader.parquet.]
>  
> We should remove this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35582) Remove # noqa in Python API documents.

2021-05-31 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-35582:
---

 Summary: Remove # noqa in Python API documents.
 Key: SPARK-35582
 URL: https://issues.apache.org/jira/browse/SPARK-35582
 Project: Spark
  Issue Type: Sub-task
  Components: docs, PySpark
Affects Versions: 3.2.0
Reporter: Haejoon Lee


There are some unnecessary words "# noqa" are exposed in the Python API 
documentation.

 

For example, 
[https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.parquet.html#pyspark.sql.DataFrameReader.parquet.]

 

We should remove this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35573) Support R 4.1.0

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35573:


Assignee: Hyukjin Kwon  (was: Dongjoon Hyun)

> Support R 4.1.0
> ---
>
> Key: SPARK-35573
> URL: https://issues.apache.org/jira/browse/SPARK-35573
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.3, 3.2.0, 3.1.3
>
>
> Currently, there exists 6 SparkR UT failures in R 4.1.0.
> Until R 4.0.5, there was no errors.
> {code}
> ══ Failed 
> ══
> ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow 
> optimi
> collect(createDataFrame(rdf)) not equal to `expected`.
> Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 4. Error (test_sparkSQL.R:1454:3): column functions 
> ─
> Error: (converted from warning) cannot xtfrm data frames
> Backtrace:
>   1. base::sort(collect(distinct(select(df, input_file_name() 
> test_sparkSQL.R:1454:2
>   2. base::sort.default(collect(distinct(select(df, input_file_name()
>   5. base::order(x, na.last = na.last, decreasing = decreasing)
>   6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
>   7. base:::FUN(X[[i]], ...)
>  10. base::xtfrm.data.frame(x)
> ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35573) Make SparkR tests pass with R 4.1+

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35573:
-
Summary: Make SparkR tests pass with R 4.1+  (was: Support R 4.1.0)

> Make SparkR tests pass with R 4.1+
> --
>
> Key: SPARK-35573
> URL: https://issues.apache.org/jira/browse/SPARK-35573
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.3, 3.2.0, 3.1.3
>
>
> Currently, there exists 6 SparkR UT failures in R 4.1.0.
> Until R 4.0.5, there was no errors.
> {code}
> ══ Failed 
> ══
> ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow 
> optimi
> collect(createDataFrame(rdf)) not equal to `expected`.
> Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 4. Error (test_sparkSQL.R:1454:3): column functions 
> ─
> Error: (converted from warning) cannot xtfrm data frames
> Backtrace:
>   1. base::sort(collect(distinct(select(df, input_file_name() 
> test_sparkSQL.R:1454:2
>   2. base::sort.default(collect(distinct(select(df, input_file_name()
>   5. base::order(x, na.last = na.last, decreasing = decreasing)
>   6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
>   7. base:::FUN(X[[i]], ...)
>  10. base::xtfrm.data.frame(x)
> ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35573) Support R 4.1.0

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35573:
-
Fix Version/s: 3.1.3
   3.0.3

> Support R 4.1.0
> ---
>
> Key: SPARK-35573
> URL: https://issues.apache.org/jira/browse/SPARK-35573
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.3, 3.2.0, 3.1.3
>
>
> Currently, there exists 6 SparkR UT failures in R 4.1.0.
> Until R 4.0.5, there was no errors.
> {code}
> ══ Failed 
> ══
> ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow 
> optimi
> collect(createDataFrame(rdf)) not equal to `expected`.
> Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 4. Error (test_sparkSQL.R:1454:3): column functions 
> ─
> Error: (converted from warning) cannot xtfrm data frames
> Backtrace:
>   1. base::sort(collect(distinct(select(df, input_file_name() 
> test_sparkSQL.R:1454:2
>   2. base::sort.default(collect(distinct(select(df, input_file_name()
>   5. base::order(x, na.last = na.last, decreasing = decreasing)
>   6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
>   7. base:::FUN(X[[i]], ...)
>  10. base::xtfrm.data.frame(x)
> ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35573) Support R 4.1.0

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35573.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32709
[https://github.com/apache/spark/pull/32709]

> Support R 4.1.0
> ---
>
> Key: SPARK-35573
> URL: https://issues.apache.org/jira/browse/SPARK-35573
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, there exists 6 SparkR UT failures in R 4.1.0.
> Until R 4.0.5, there was no errors.
> {code}
> ══ Failed 
> ══
> ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow 
> optimi
> collect(createDataFrame(rdf)) not equal to `expected`.
> Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 4. Error (test_sparkSQL.R:1454:3): column functions 
> ─
> Error: (converted from warning) cannot xtfrm data frames
> Backtrace:
>   1. base::sort(collect(distinct(select(df, input_file_name() 
> test_sparkSQL.R:1454:2
>   2. base::sort.default(collect(distinct(select(df, input_file_name()
>   5. base::order(x, na.last = na.last, decreasing = decreasing)
>   6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
>   7. base:::FUN(X[[i]], ...)
>  10. base::xtfrm.data.frame(x)
> ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35573) Support R 4.1.0

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35573:


Assignee: Dongjoon Hyun

> Support R 4.1.0
> ---
>
> Key: SPARK-35573
> URL: https://issues.apache.org/jira/browse/SPARK-35573
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Currently, there exists 6 SparkR UT failures in R 4.1.0.
> Until R 4.0.5, there was no errors.
> {code}
> ══ Failed 
> ══
> ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow 
> optimi
> collect(createDataFrame(rdf)) not equal to `expected`.
> Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - 
> type
> collect(ret) not equal to `rdf`.
> Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')
> ── 4. Error (test_sparkSQL.R:1454:3): column functions 
> ─
> Error: (converted from warning) cannot xtfrm data frames
> Backtrace:
>   1. base::sort(collect(distinct(select(df, input_file_name() 
> test_sparkSQL.R:1454:2
>   2. base::sort.default(collect(distinct(select(df, input_file_name()
>   5. base::order(x, na.last = na.last, decreasing = decreasing)
>   6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
>   7. base:::FUN(X[[i]], ...)
>  10. base::xtfrm.data.frame(x)
> ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions 
> ─
> `actual` not equal to `g`.
> names for current but not for target
> Length mismatch: comparison on first 0 components
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35453) Move Koalas accessor to pandas_on_spark accessor

2021-05-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35453.
--
Fix Version/s: 3.2.0
 Assignee: Haejoon Lee
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32674

> Move Koalas accessor to pandas_on_spark accessor
> 
>
> Key: SPARK-35453
> URL: https://issues.apache.org/jira/browse/SPARK-35453
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> The existing Koalas has the "Koalas accessor" which named after Koalas 
> project.
>  
> We should rename this accessor to "Pandas on Spark accessor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35561) partition result is incorrect when insert into partition table with int datatype partition column

2021-05-31 Thread YuanGuanhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354728#comment-17354728
 ] 

YuanGuanhu commented on SPARK-35561:


[~Stelyus] I know,but the amazing thing is, if I execute this statement `insert 
into table orc_part03 partition (p_int=002) select * from partitiontb04 where 
id > 10006`, the partition is 002. I think we should have same behavior.

> partition result is incorrect when insert into partition table with int 
> datatype partition column
> -
>
> Key: SPARK-35561
> URL: https://issues.apache.org/jira/browse/SPARK-35561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.1, 3.1.2
>Reporter: YuanGuanhu
>Priority: Major
>
> when inserting into partitioned table with int datatype partition column, if 
> partition column value is starting with 0, like 001, get wrong partition 
> result
>  
> *How to reproduce the problem:*
> CREATE TABLE partitiontb04 (id INT, c_string STRING) STORED AS orc; 
>  insert into table partitiontb04 values (10001,'test1');
>  CREATE TABLE orc_part03(id INT, c_string STRING) partitioned by (p_int int) 
> STORED AS orc;
>  insert into table orc_part03 partition (p_int=001) select * from 
> partitiontb04 where id < 10006;
>  show partitions orc_part03;
> expect result:
> p_int=001
>  
> actural result:
> p_int=1
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-35396) Support to manual close/release entries in MemoryStore and InMemoryRelation instead of replying on GC

2021-05-31 Thread Chendi.Xue (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chendi.Xue reopened SPARK-35396:


Reopen this JIRA, since this Jira is aim to add manual close to both 
MemoryStore and InMemoryRelation, and the second PR was just submitted.

https://github.com/apache/spark/pull/32717

> Support to manual close/release entries in MemoryStore and InMemoryRelation 
> instead of replying on GC
> -
>
> Key: SPARK-35396
> URL: https://issues.apache.org/jira/browse/SPARK-35396
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Chendi.Xue
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR is proposing a add-on to support to manual close entries in 
> MemoryStore and InMemoryRelation
> h3. What changes were proposed in this pull request?
> Currently:
> MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap 
> or OffHeap entries.
> And when memoryStore.remove(blockId) is called, codes will simply remove one 
> entry from LinkedHashMap and leverage Java GC to do release work.
> This PR:
> We are proposing a add-on to manually close any object stored in MemoryStore 
> and InMemoryRelation if this object is extended from AutoCloseable.
> Veifiication:
> In our own use case, we implemented a user-defined off-heap-hashRelation for 
> BHJ, and we verified that by adding this manual close, we can make sure our 
> defined off-heap-hashRelation can be released when evict is called.
> Also, we implemented user-defined cachedBatch and will be release when 
> InMemoryRelation.clearCache() is called by this PR
> h3. Why are the changes needed?
> This changes can help to clean some off-heap user-defined object may be 
> cached in InMemoryRelation or MemoryStore
> h3. Does this PR introduce _any_ user-facing change?
> NO
> h3. How was this patch tested?
> WIP
> Signed-off-by: Chendi Xue [chendi@intel.com|mailto:chendi@intel.com]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35396) Support to manual close/release entries in MemoryStore and InMemoryRelation instead of replying on GC

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354724#comment-17354724
 ] 

Apache Spark commented on SPARK-35396:
--

User 'xuechendi' has created a pull request for this issue:
https://github.com/apache/spark/pull/32717

> Support to manual close/release entries in MemoryStore and InMemoryRelation 
> instead of replying on GC
> -
>
> Key: SPARK-35396
> URL: https://issues.apache.org/jira/browse/SPARK-35396
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Chendi.Xue
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR is proposing a add-on to support to manual close entries in 
> MemoryStore and InMemoryRelation
> h3. What changes were proposed in this pull request?
> Currently:
> MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap 
> or OffHeap entries.
> And when memoryStore.remove(blockId) is called, codes will simply remove one 
> entry from LinkedHashMap and leverage Java GC to do release work.
> This PR:
> We are proposing a add-on to manually close any object stored in MemoryStore 
> and InMemoryRelation if this object is extended from AutoCloseable.
> Veifiication:
> In our own use case, we implemented a user-defined off-heap-hashRelation for 
> BHJ, and we verified that by adding this manual close, we can make sure our 
> defined off-heap-hashRelation can be released when evict is called.
> Also, we implemented user-defined cachedBatch and will be release when 
> InMemoryRelation.clearCache() is called by this PR
> h3. Why are the changes needed?
> This changes can help to clean some off-heap user-defined object may be 
> cached in InMemoryRelation or MemoryStore
> h3. Does this PR introduce _any_ user-facing change?
> NO
> h3. How was this patch tested?
> WIP
> Signed-off-by: Chendi Xue [chendi@intel.com|mailto:chendi@intel.com]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35396) Support to manual close/release entries in MemoryStore and InMemoryRelation instead of replying on GC

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354723#comment-17354723
 ] 

Apache Spark commented on SPARK-35396:
--

User 'xuechendi' has created a pull request for this issue:
https://github.com/apache/spark/pull/32717

> Support to manual close/release entries in MemoryStore and InMemoryRelation 
> instead of replying on GC
> -
>
> Key: SPARK-35396
> URL: https://issues.apache.org/jira/browse/SPARK-35396
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Chendi.Xue
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR is proposing a add-on to support to manual close entries in 
> MemoryStore and InMemoryRelation
> h3. What changes were proposed in this pull request?
> Currently:
> MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap 
> or OffHeap entries.
> And when memoryStore.remove(blockId) is called, codes will simply remove one 
> entry from LinkedHashMap and leverage Java GC to do release work.
> This PR:
> We are proposing a add-on to manually close any object stored in MemoryStore 
> and InMemoryRelation if this object is extended from AutoCloseable.
> Veifiication:
> In our own use case, we implemented a user-defined off-heap-hashRelation for 
> BHJ, and we verified that by adding this manual close, we can make sure our 
> defined off-heap-hashRelation can be released when evict is called.
> Also, we implemented user-defined cachedBatch and will be release when 
> InMemoryRelation.clearCache() is called by this PR
> h3. Why are the changes needed?
> This changes can help to clean some off-heap user-defined object may be 
> cached in InMemoryRelation or MemoryStore
> h3. Does this PR introduce _any_ user-facing change?
> NO
> h3. How was this patch tested?
> WIP
> Signed-off-by: Chendi Xue [chendi@intel.com|mailto:chendi@intel.com]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34731) ConcurrentModificationException in EventLoggingListener when redacting properties

2021-05-31 Thread John Pugliesi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354670#comment-17354670
 ] 

John Pugliesi commented on SPARK-34731:
---

To clarify, does this issue potentially prevent event logs from being 
created/written entirely? We're seeing this exception in some of our Spark 
3.1.1 applications - namely the applications with particularly large Window 
queries - where the final event log is never successfully written out (using an 
s3a:// spark.eventLog.dir, for what it's worth):

{code:bash}
# spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir s3a://my-bucket/spark-event-logs/
{code}

> ConcurrentModificationException in EventLoggingListener when redacting 
> properties
> -
>
> Key: SPARK-34731
> URL: https://issues.apache.org/jira/browse/SPARK-34731
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> Reproduction:
> The key elements of reproduction are enabling event logging, setting 
> spark.executor.cores, and some bad luck:
> {noformat}
> $ bin/spark-shell --conf spark.ui.showConsoleProgress=false \
> --conf spark.executor.cores=1 --driver-memory 4g --conf \
> "spark.ui.showConsoleProgress=false" \
> --conf spark.eventLog.enabled=true \
> --conf spark.eventLog.dir=/tmp/spark-events
> ...
> scala> (0 to 500).foreach { i =>
>  |   val df = spark.range(0, 2).toDF("a")
>  |   df.filter("a > 12").count
>  | }
> 21/03/12 18:16:44 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.util.ConcurrentModificationException
>   at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424)
>   at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.mutable.MapLike.toSeq(MapLike.scala:75)
>   at scala.collection.mutable.MapLike.toSeq$(MapLike.scala:72)
>   at scala.collection.mutable.AbstractMap.toSeq(Map.scala:82)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.redactProperties(EventLoggingListener.scala:290)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:162)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1379)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> {noformat}
> Analysis from quick reading of the code:
> DAGScheduler posts a JobSubmitted event containing a clone of a properties 
> object 
> [here|https://github.com/apache/spark/blob/4f1e434ec57070b52b28f98c66b53ca6ec4de7a4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L834].
> This event is handled 
> [here|https://github.com/apache/spark/blob/4f1e434ec57070b52b28f98c66b53ca6ec4de7a4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2394].
> DAGScheduler#handleJobSubmitted stores the properties object in a [Job 
>

[jira] [Resolved] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35576.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/32712

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.8, 3.0.2, 3.1.2, 
> 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35576:
--
Affects Version/s: 1.6.3

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.8, 3.0.2, 3.1.2, 
> 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35576:
--
Affects Version/s: 2.0.2

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.8, 3.0.2, 3.1.2, 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35576:
--
Affects Version/s: 2.1.3

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.8, 3.0.2, 3.1.2, 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35576:
--
Affects Version/s: 2.2.3

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.2, 3.1.2, 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35576:
--
Affects Version/s: 2.3.4

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 2.3.4, 2.4.8, 3.0.2, 3.1.2, 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35576:
--
Affects Version/s: 2.4.8
   3.0.2

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.2, 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35576:
--
Issue Type: Bug  (was: Task)

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35581) Casting special strings to DATE/TIMESTAMP returns inconsistent results

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35581:


Assignee: Max Gekk  (was: Apache Spark)

> Casting special strings to DATE/TIMESTAMP returns inconsistent results
> --
>
> Key: SPARK-35581
> URL: https://issues.apache.org/jira/browse/SPARK-35581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> When casting the special values "now", "today", "tomorrow", and "yesterday" 
> to DATE/TIMESTAMP, Spark may return inconsistent results.
> Looks like Spark runs the expression on each executor, on every row 
> independently. So the results could differ across executors if they have 
> different system time, and across rows because of the resolution of "now".
> https://github.com/databricks/runtime/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L876



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35581) Casting special strings to DATE/TIMESTAMP returns inconsistent results

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35581:


Assignee: Apache Spark  (was: Max Gekk)

> Casting special strings to DATE/TIMESTAMP returns inconsistent results
> --
>
> Key: SPARK-35581
> URL: https://issues.apache.org/jira/browse/SPARK-35581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> When casting the special values "now", "today", "tomorrow", and "yesterday" 
> to DATE/TIMESTAMP, Spark may return inconsistent results.
> Looks like Spark runs the expression on each executor, on every row 
> independently. So the results could differ across executors if they have 
> different system time, and across rows because of the resolution of "now".
> https://github.com/databricks/runtime/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L876



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35581) Casting special strings to DATE/TIMESTAMP returns inconsistent results

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354651#comment-17354651
 ] 

Apache Spark commented on SPARK-35581:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32714

> Casting special strings to DATE/TIMESTAMP returns inconsistent results
> --
>
> Key: SPARK-35581
> URL: https://issues.apache.org/jira/browse/SPARK-35581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> When casting the special values "now", "today", "tomorrow", and "yesterday" 
> to DATE/TIMESTAMP, Spark may return inconsistent results.
> Looks like Spark runs the expression on each executor, on every row 
> independently. So the results could differ across executors if they have 
> different system time, and across rows because of the resolution of "now".
> https://github.com/databricks/runtime/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L876



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35581) Casting special strings to DATE/TIMESTAMP returns inconsistent results

2021-05-31 Thread Max Gekk (Jira)

Max Gekk created SPARK-35581:


 Summary: Casting special strings to DATE/TIMESTAMP returns 
inconsistent results
 Key: SPARK-35581
 URL: https://issues.apache.org/jira/browse/SPARK-35581
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk


When casting the special values "now", "today", "tomorrow", and "yesterday" to 
DATE/TIMESTAMP, Spark may return inconsistent results.

Looks like Spark runs the expression on each executor, on every row 
independently. So the results could differ across executors if they have 
different system time, and across rows because of the resolution of "now".

https://github.com/databricks/runtime/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L876



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-05-31 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354637#comment-17354637
 ] 

Adam Binford commented on SPARK-35564:
--

A 2x gain would be pretty significant to us, I don't know about others. I'm 
planning to implement this in our fork and if I get good results I'll put up a 
PR for further discussion. Could optionally add a config for this if it's 
workload dependent. Also, the only thing it could likely do to the generated 
code is reduce the overall size, albeit with more functional calls in worst 
cases. Whether smaller code size adds any value, I don't know enough about Java 
to know.

>Oh, this is another issue. I noticed it last time when I worked on another PR 
>recently, but don't have time to look at it yet.

I created https://issues.apache.org/jira/browse/SPARK-35580 to track what I've 
figured out so far. Not sure what the right fix is.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35580) Support subexpression elimination for higher order functions

2021-05-31 Thread Adam Binford (Jira)

Adam Binford created SPARK-35580:


 Summary: Support subexpression elimination for higher order 
functions
 Key: SPARK-35580
 URL: https://issues.apache.org/jira/browse/SPARK-35580
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Adam Binford


Currently higher order functions are not candidates for subexpression 
elimination. This is because all higher order functions have different semantic 
hashes, due to "exprId" and "value" in "NamedLambdaVariable". These always are 
unique, so the semanticHash of a NamedLambdaVariable is always unique. Also, 
[https://github.com/apache/spark/pull/32424] might throw a wrench in things 
some too, depending on how you define your expressions the name could be 
different.
{code:java}
scala> var d = transform($"a", x => x + 1)
d: org.apache.spark.sql.Column = transform(a, lambdafunction((x_2 + 1), x_2))

scala> var e = transform($"a", x => x + 1)
e: org.apache.spark.sql.Column = transform(a, lambdafunction((x_3 + 1), x_3))

scala> struct(d.alias("1"), d.alias("2")).expr
res9: org.apache.spark.sql.catalyst.expressions.Expression = 
struct(NamePlaceholder, transform('a, lambdafunction((lambda 'x_2 + 1), lambda 
'x_2, false)) AS 1#4, NamePlaceholder, transform('a, lambdafunction((lambda 
'x_2 + 1), lambda 'x_2, false)) AS 2#5)

scala> struct(d.alias("1"), e.alias("2")).expr
res10: org.apache.spark.sql.catalyst.expressions.Expression = 
struct(NamePlaceholder, transform('a, lambdafunction((lambda 'x_2 + 1), lambda 
'x_2, false)) AS 1#6, NamePlaceholder, transform('a, lambdafunction((lambda 
'x_3 + 1), lambda 'x_3, false)) AS 2#7)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-05-31 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354590#comment-17354590
 ] 

L. C. Hsieh commented on SPARK-35564:
-

> I don't really think this is much of a corner case, but a common case of 
> using a when expression for data validation. Most of our ETL process comes 
> down to normalizing, cleaning, and validating strings, which at the end of 
> the day usually looks like:

This is a corner case because it simplifies other possible cases, although you 
might actually use this pattern in your ETL process.

For example, when we treat an always-evaluate-at-least-once and 
optionally-evaluate-at-least-once expression as subexpression, there are many 
expressions qualified for this. A child expression of the first predicate of 
when, if it is also part of any conditional predicate/value, might also be 
treated as subexpression. Finally we might end with tons of subexpressions like 
that to flood generated code.

On the other hand, how much gain we can get from this case? In the example, for 
the worst case we evaluate it twice, not 5 or 10 times. It may be just small 
piece of the entire ETL process. I feel it's not worth because we might pay a 
lot cost including making the code more complicated and creating tons of 
subexpressions, but in the end we only get a little bit from it and it is also 
only for a worst case.

> though currently higher order functions are always semantically different so 
> they don't get subexpressions regardless I think. That's something I plan to 
> look into as a follow up.

Oh, this is another issue. I noticed it last time when I worked on another PR 
recently, but don't have time to look at it yet.



> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35561) partition result is incorrect when insert into partition table with int datatype partition column

2021-05-31 Thread Franck Thang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354556#comment-17354556
 ] 

Franck Thang commented on SPARK-35561:
--

I personally don't expect 001 because the type is an INT, if I wanted 001, I 
would have use the type STRING

> partition result is incorrect when insert into partition table with int 
> datatype partition column
> -
>
> Key: SPARK-35561
> URL: https://issues.apache.org/jira/browse/SPARK-35561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.1, 3.1.2
>Reporter: YuanGuanhu
>Priority: Major
>
> when inserting into partitioned table with int datatype partition column, if 
> partition column value is starting with 0, like 001, get wrong partition 
> result
>  
> *How to reproduce the problem:*
> CREATE TABLE partitiontb04 (id INT, c_string STRING) STORED AS orc; 
>  insert into table partitiontb04 values (10001,'test1');
>  CREATE TABLE orc_part03(id INT, c_string STRING) partitioned by (p_int int) 
> STORED AS orc;
>  insert into table orc_part03 partition (p_int=001) select * from 
> partitiontb04 where id < 10006;
>  show partitions orc_part03;
> expect result:
> p_int=001
>  
> actural result:
> p_int=1
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35567) Explain cost is not showing statistics for all the nodes

2021-05-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35567:
---

Assignee: shahid

> Explain cost is not showing statistics for all the nodes
> 
>
> Key: SPARK-35567
> URL: https://issues.apache.org/jira/browse/SPARK-35567
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0, 3.1.2
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Attachments: image-2021-05-31-05-09-09-637.png
>
>
> Explain cost command doesn't show statistics for all the nodes in most of the 
> TPCDS queries
> For eg: Query1
> !image-2021-05-31-05-09-09-637.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35567) Explain cost is not showing statistics for all the nodes

2021-05-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35567.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

> Explain cost is not showing statistics for all the nodes
> 
>
> Key: SPARK-35567
> URL: https://issues.apache.org/jira/browse/SPARK-35567
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0, 3.1.2
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: image-2021-05-31-05-09-09-637.png
>
>
> Explain cost command doesn't show statistics for all the nodes in most of the 
> TPCDS queries
> For eg: Query1
> !image-2021-05-31-05-09-09-637.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35578) Add a test case for a janino bug

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354508#comment-17354508
 ] 

Apache Spark commented on SPARK-35578:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32716

> Add a test case for a janino bug
> 
>
> Key: SPARK-35578
> URL: https://issues.apache.org/jira/browse/SPARK-35578
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35578) Add a test case for a janino bug

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354507#comment-17354507
 ] 

Apache Spark commented on SPARK-35578:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32716

> Add a test case for a janino bug
> 
>
> Key: SPARK-35578
> URL: https://issues.apache.org/jira/browse/SPARK-35578
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35578) Add a test case for a janino bug

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35578:


Assignee: Apache Spark

> Add a test case for a janino bug
> 
>
> Key: SPARK-35578
> URL: https://issues.apache.org/jira/browse/SPARK-35578
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35578) Add a test case for a janino bug

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35578:


Assignee: (was: Apache Spark)

> Add a test case for a janino bug
> 
>
> Key: SPARK-35578
> URL: https://issues.apache.org/jira/browse/SPARK-35578
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35579) Fix a bug in janino or work around it in Spark.

2021-05-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-35579:

Priority: Critical  (was: Major)

> Fix a bug in janino or work around it in Spark.
> ---
>
> Key: SPARK-35579
> URL: https://issues.apache.org/jira/browse/SPARK-35579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Critical
>
> See the test in SPARK-35578



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35576) Redact the sensitive info in the result of Set command

2021-05-31 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35576:
---
Affects Version/s: 3.1.2

> Redact the sensitive info in the result of Set command
> --
>
> Key: SPARK-35576
> URL: https://issues.apache.org/jira/browse/SPARK-35576
> Project: Spark
>  Issue Type: Task
>  Components: Security, SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the results of following SQL queries are not redacted:
> ```
> SET [KEY];
> SET;
> ```
> For example:
> {code:java}
> scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
> ++--+
> | key| value|
> ++--+
> |javax.jdo.option|123456|
> ++--+
> scala> spark.sql("set").show()
> +++
> | key|   value|
> +++
> |javax.jdo.option|  123456|
> {code}
> We should hide the sensitive information and redact the query output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35579) Fix a bug in janino or work around it in Spark.

2021-05-31 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-35579:
---

 Summary: Fix a bug in janino or work around it in Spark.
 Key: SPARK-35579
 URL: https://issues.apache.org/jira/browse/SPARK-35579
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan


See the test in SPARK-35578



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35578) Add a test case for a janino bug

2021-05-31 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-35578:
---

 Summary: Add a test case for a janino bug
 Key: SPARK-35578
 URL: https://issues.apache.org/jira/browse/SPARK-35578
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35557) Adapt uses of JDK 17 Internal APIs

2021-05-31 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-35557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated SPARK-35557:
-
Summary: Adapt uses of JDK 17 Internal APIs  (was: Adapt uses of JDK 17 
Internal APIs (Unsafe, etc))

> Adapt uses of JDK 17 Internal APIs
> --
>
> Key: SPARK-35557
> URL: https://issues.apache.org/jira/browse/SPARK-35557
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Ismaël Mejía
>Priority: Major
>
> I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with 
> Spark 2.12.4 on Java 17 and I found this exception:
> {code:java}
> java.lang.ExceptionInInitializerError
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
> ...
> Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
> private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
> not "opens java.nio" to unnamed module @110df513
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:357)
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:297)
>  at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
>  at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
>  at org.apache.spark.unsafe.Platform. (Platform.java:56)
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
> {code}
> It seems that Java 17 will be more strict about uses of JDK Internals 
> [https://openjdk.java.net/jeps/403]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35557) Adapt uses of JDK 17 Internal APIs (Unsafe, etc)

2021-05-31 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-35557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated SPARK-35557:
-
Description: 
I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with Spark 
2.12.4 on Java 17 and I found this exception:
{code:java}
java.lang.ExceptionInInitializerError
 at org.apache.spark.unsafe.array.ByteArrayMethods. 
(ByteArrayMethods.java:54)
 at org.apache.spark.internal.config.package$. (package.scala:1149)
 at org.apache.spark.SparkConf$. (SparkConf.scala:654)
 at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
...
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
not "opens java.nio" to unnamed module @110df513
 at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
(AccessibleObject.java:357)
 at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
(AccessibleObject.java:297)
 at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
 at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
 at org.apache.spark.unsafe.Platform. (Platform.java:56)
 at org.apache.spark.unsafe.array.ByteArrayMethods. 
(ByteArrayMethods.java:54)
 at org.apache.spark.internal.config.package$. (package.scala:1149)
 at org.apache.spark.SparkConf$. (SparkConf.scala:654)
 at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
{code}
It seems that Java 17 will be more strict about uses of JDK Internals 
[https://openjdk.java.net/jeps/403]

  was:
I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with Spark 
2.13 on Java 17 and I found this exception:
{code:java}
java.lang.ExceptionInInitializerError
 at org.apache.spark.unsafe.array.ByteArrayMethods. 
(ByteArrayMethods.java:54)
 at org.apache.spark.internal.config.package$. (package.scala:1149)
 at org.apache.spark.SparkConf$. (SparkConf.scala:654)
 at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
...
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
not "opens java.nio" to unnamed module @110df513
 at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
(AccessibleObject.java:357)
 at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
(AccessibleObject.java:297)
 at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
 at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
 at org.apache.spark.unsafe.Platform. (Platform.java:56)
 at org.apache.spark.unsafe.array.ByteArrayMethods. 
(ByteArrayMethods.java:54)
 at org.apache.spark.internal.config.package$. (package.scala:1149)
 at org.apache.spark.SparkConf$. (SparkConf.scala:654)
 at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
{code}
It seems that Java 17 will be more strict about uses of JDK Internals 
[https://openjdk.java.net/jeps/403]


> Adapt uses of JDK 17 Internal APIs (Unsafe, etc)
> 
>
> Key: SPARK-35557
> URL: https://issues.apache.org/jira/browse/SPARK-35557
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Ismaël Mejía
>Priority: Major
>
> I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with 
> Spark 2.12.4 on Java 17 and I found this exception:
> {code:java}
> java.lang.ExceptionInInitializerError
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
> ...
> Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
> private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
> not "opens java.nio" to unnamed module @110df513
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:357)
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:297)
>  at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
>  at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
>  at org.apache.spark.unsafe.Platform. (Platform.java:56)
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
> {code}
> It seems that Java 17 will be more strict about uses of JDK Internals 
> [https://openjdk.java.net/jeps/403]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-05-31 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354447#comment-17354447
 ] 

Adam Binford commented on SPARK-35564:
--

>Do you mean "Create a subexpression if an expression will always be evaluated 
>at least once AND will be evaluated at least once in conditional expression"?

Yeah you can think of it that way in terms of adding to existing functionality. 
I was trying to word it in a way that encompassed existing functionality as 
well.

>And this looks like a corner case, so I'm not sure if it is worth to do this.

I don't really think this is much of a corner case, but a common case of using 
a when expression for data validation. Most of our ETL process comes down to 
normalizing, cleaning, and validating strings, which at the end of the day 
usually looks like:
{code:java}
column = normalize_value(col('my_raw_value'))
result = when(column != '', column){code}
where "normalize_value" usually involves some combination of regexp_repace's, 
lower/upper, and trim.

And things get worse when you are dealing with arrays of strings and want to 
minimize your data:
{code:java}
column = filter(transform(col('my_raw_array_value'), lambda x: 
normalize_value(x)), lambda x: x != '')
result = when(size(column) > 0, column){code}
though currently higher order functions are always semantically different so 
they don't get subexpressions regardless I think. That's something I plan to 
look into as a follow up.

It's natural for users to think that these expressions only get evaluated once, 
and not that they are doubling their runtime trying to clean their data. To me 
the edge case is creating a subexpression in this case decreasing throughput. 
It would require a very large percentage of the rows to not pass the 
conditional check, since the additional calculation is much more expensive than 
the additional function call. I'm playing around with an implementation so 
we'll see how far I can get with it.

 

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35557) Adapt uses of JDK 17 Internal APIs (Unsafe, etc)

2021-05-31 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-35557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated SPARK-35557:
-
Description: 
I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with Spark 
2.13 on Java 17 and I found this exception:
{code:java}
java.lang.ExceptionInInitializerError
 at org.apache.spark.unsafe.array.ByteArrayMethods. 
(ByteArrayMethods.java:54)
 at org.apache.spark.internal.config.package$. (package.scala:1149)
 at org.apache.spark.SparkConf$. (SparkConf.scala:654)
 at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
...
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
not "opens java.nio" to unnamed module @110df513
 at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
(AccessibleObject.java:357)
 at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
(AccessibleObject.java:297)
 at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
 at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
 at org.apache.spark.unsafe.Platform. (Platform.java:56)
 at org.apache.spark.unsafe.array.ByteArrayMethods. 
(ByteArrayMethods.java:54)
 at org.apache.spark.internal.config.package$. (package.scala:1149)
 at org.apache.spark.SparkConf$. (SparkConf.scala:654)
 at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
{code}
It seems that Java 17 will be more strict about uses of JDK Internals 
[https://openjdk.java.net/jeps/403]

  was:
I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with Spark 
2.13 on Java 17 and I found this exception:

{code:borderStyle=solid}
java.lang.ExceptionInInitializerError
 at org.apache.spark.unsafe.array.ByteArrayMethods. 
(ByteArrayMethods.java:54)
 at org.apache.spark.internal.config.package$. (package.scala:1149)
 at org.apache.spark.SparkConf$. (SparkConf.scala:654)
 at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
...
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
not "opens java.nio" to unnamed module @110df513
 at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
(AccessibleObject.java:357)
 at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
(AccessibleObject.java:297)
 at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
 at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
 at org.apache.spark.unsafe.Platform. (Platform.java:56)
 at org.apache.spark.unsafe.array.ByteArrayMethods. 
(ByteArrayMethods.java:54)
 at org.apache.spark.internal.config.package$. (package.scala:1149)
 at org.apache.spark.SparkConf$. (SparkConf.scala:654)
 at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
{code}

Not sure if this is the case here but it seems that Java 17 will be more strict 
about uses of JDK Internals https://openjdk.java.net/jeps/403


> Adapt uses of JDK 17 Internal APIs (Unsafe, etc)
> 
>
> Key: SPARK-35557
> URL: https://issues.apache.org/jira/browse/SPARK-35557
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Ismaël Mejía
>Priority: Major
>
> I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with 
> Spark 2.13 on Java 17 and I found this exception:
> {code:java}
> java.lang.ExceptionInInitializerError
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
> ...
> Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
> private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
> not "opens java.nio" to unnamed module @110df513
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:357)
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:297)
>  at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
>  at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
>  at org.apache.spark.unsafe.Platform. (Platform.java:56)
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
> {code}
> It seems that Java 17 will be more strict about uses of JDK Internals 
> [https://openjdk.java.net/jeps/403]



--
This message

[jira] [Updated] (SPARK-35557) Adapt uses of JDK 17 Internal APIs (Unsafe, etc)

2021-05-31 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-35557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated SPARK-35557:
-
Summary: Adapt uses of JDK 17 Internal APIs (Unsafe, etc)  (was: Adapt uses 
of JDK Internal APIs (Unsafe, etc))

> Adapt uses of JDK 17 Internal APIs (Unsafe, etc)
> 
>
> Key: SPARK-35557
> URL: https://issues.apache.org/jira/browse/SPARK-35557
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Ismaël Mejía
>Priority: Major
>
> I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with 
> Spark 2.13 on Java 17 and I found this exception:
> {code:borderStyle=solid}
> java.lang.ExceptionInInitializerError
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)
> ...
> Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make 
> private java.nio.DirectByteBuffer(long,int) accessible: module java.base does 
> not "opens java.nio" to unnamed module @110df513
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:357)
>  at java.lang.reflect.AccessibleObject.checkCanSetAccessible 
> (AccessibleObject.java:297)
>  at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188)
>  at java.lang.reflect.Constructor.setAccessible (Constructor.java:181)
>  at org.apache.spark.unsafe.Platform. (Platform.java:56)
>  at org.apache.spark.unsafe.array.ByteArrayMethods. 
> (ByteArrayMethods.java:54)
>  at org.apache.spark.internal.config.package$. (package.scala:1149)
>  at org.apache.spark.SparkConf$. (SparkConf.scala:654)
>  at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}}
> {code}
> Not sure if this is the case here but it seems that Java 17 will be more 
> strict about uses of JDK Internals https://openjdk.java.net/jeps/403



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35577) Allow to log container output for docker integration tests

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354429#comment-17354429
 ] 

Apache Spark commented on SPARK-35577:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32715

> Allow to log container output for docker integration tests
> --
>
> Key: SPARK-35577
> URL: https://issues.apache.org/jira/browse/SPARK-35577
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, docker integration tests don't log their container 
> output.
> If we have container logs, it's useful to debug especially for GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35577) Allow to log container output for docker integration tests

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35577:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Allow to log container output for docker integration tests
> --
>
> Key: SPARK-35577
> URL: https://issues.apache.org/jira/browse/SPARK-35577
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, docker integration tests don't log their container 
> output.
> If we have container logs, it's useful to debug especially for GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35577) Allow to log container output for docker integration tests

2021-05-31 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35577:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Allow to log container output for docker integration tests
> --
>
> Key: SPARK-35577
> URL: https://issues.apache.org/jira/browse/SPARK-35577
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In the current master, docker integration tests don't log their container 
> output.
> If we have container logs, it's useful to debug especially for GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35577) Allow to log container output for docker integration tests

2021-05-31 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354428#comment-17354428
 ] 

Apache Spark commented on SPARK-35577:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32715

> Allow to log container output for docker integration tests
> --
>
> Key: SPARK-35577
> URL: https://issues.apache.org/jira/browse/SPARK-35577
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, docker integration tests don't log their container 
> output.
> If we have container logs, it's useful to debug especially for GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 136 matches

Mail list logo