date:20210201

[jira] [Created] (SPARK-34326) "SPARK-31793: FileSourceScanExec metadata should contain limited file paths" fails in some edge-case

2021-02-01 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-34326:


 Summary: "SPARK-31793: FileSourceScanExec metadata should contain 
limited file paths" fails in some edge-case
 Key: SPARK-34326
 URL: https://issues.apache.org/jira/browse/SPARK-34326
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Jungtaek Lim


Our internal build failed with this test, and looks like the calculation in UT 
is missing some points about the format of location.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34326) "SPARK-31793: FileSourceScanExec metadata should contain limited file paths" fails in some edge-case

2021-02-01 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276918#comment-17276918
 ] 

Jungtaek Lim commented on SPARK-34326:
--

Will provide a PR shortly.

> "SPARK-31793: FileSourceScanExec metadata should contain limited file paths" 
> fails in some edge-case
> 
>
> Key: SPARK-34326
> URL: https://issues.apache.org/jira/browse/SPARK-34326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Our internal build failed with this test, and looks like the calculation in 
> UT is missing some points about the format of location.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34293) kubernetes executor pod unable to access secure hdfs

2021-02-01 Thread Manohar Chamaraju (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276913#comment-17276913
 ] 

Manohar Chamaraju commented on SPARK-34293:
---

Update:
 # In client mode by adding fs.defaultFS in core-site.xml fixed the issue for 
me.
 # what do to work is usage of hadoop-conf configmap in client mode.

> kubernetes executor pod unable to access secure hdfs
> 
>
> Key: SPARK-34293
> URL: https://issues.apache.org/jira/browse/SPARK-34293
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Manohar Chamaraju
>Priority: Major
> Attachments: driver.log, executor.log, 
> image-2021-01-30-00-13-18-234.png, image-2021-01-30-00-14-14-329.png, 
> image-2021-01-30-00-14-45-335.png, image-2021-01-30-00-20-54-620.png, 
> image-2021-01-30-00-33-02-109.png, image-2021-01-30-00-34-05-946.png
>
>
> Steps to reproduce
>  # Configure secure HDFS(kerberos) cluster running as containers in 
> kubernetes.
>  # Configure KDC on centos and create keytab for user principal hdfs, in 
> hdfsuser.keytab.
>  # Genearte spark image(v3.0.1), to spawn as container out of spark image.
>  # Inside spark container, run export HADOOP_CONF_DIR=/etc/hadoop/conf/ with 
> core-site.xml configuration as below 
>  !image-2021-01-30-00-13-18-234.png!
>  # Create configmap kbr-conf 
>  !image-2021-01-30-00-14-14-329.png!
>  # Run the command /opt/spark/bin/spark-submit \
>  --deploy-mode client \
>  --executor-memory 1g\
>  --executor-memory 1g\
>  --executor-cores 1\
>  --class org.apache.spark.examples.HdfsTest \
>  --conf spark.kubernetes.namespace=arcsight-installer-lh7fm\
>  --master k8s://[https://172.17.17.1:443|https://172.17.17.1/] \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>  --conf spark.app.name=spark-hdfs \
>  --conf spark.executer.instances=1 \
>  --conf spark.kubernetes.node.selector.spark=yes\
>  --conf spark.kubernetes.node.selector.Worker=label\
>  --conf spark.kubernetes.container.image=manohar/spark:v3.0.1 \
>  --conf spark.kubernetes.kerberos.enabled=true \
>  --conf spark.kubernetes.kerberos.krb5.configMapName=krb5-conf \
>  --conf spark.kerberos.keytab=/data/hdfsuser.keytab \
>  --conf spark.kerberos.principal=h...@dom047600.lab \
>  local:///opt/spark/examples/jars/spark-examples_2.12-3.0.1.jar \
>  hdfs://hdfs-namenode:30820/staging-directory.
>  # On running this command driver is able to connect hdfs with kerberos but 
> execurtor fails to connect to secure hdfs and below is the logs 
> !image-2021-01-30-00-34-05-946.png!
>  # Some of observation
>  ## In Client mode, --conf spark.kubernetes.hadoop.configMapName=hadoop-conf 
> as not effect only works after HADOOP_CONF_DIR is set. Below was the contents 
> of hadoop-conf configmap.
>  !image-2021-01-30-00-20-54-620.png!
>  ## Ran the command in cluster mode as well, in cluster mode also executor 
> could not connect to secure hdfs.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines

2021-02-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34199.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31286
[https://github.com/apache/spark/pull/31286]

> Block `count(table.*)` to follow ANSI standard and other SQL engines
> 
>
> Key: SPARK-34199
> URL: https://issues.apache.org/jira/browse/SPARK-34199
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.2.0
>
>
> In spark, the count(table.*) may cause very weird result, for example:
> select count(*) from (select 1 as a, null as b) t;
> output: 1
> select count(t.*) from (select 1 as a, null as b) t;
> output: 0
>  
> After checking the ANSI standard, count(*) is always treated as count(1) 
> while count(t.*) is not allowed. What's more, this is also not allowed by 
> common databases, e.g. MySQL, oracle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines

2021-02-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34199:
---

Assignee: Linhong Liu

> Block `count(table.*)` to follow ANSI standard and other SQL engines
> 
>
> Key: SPARK-34199
> URL: https://issues.apache.org/jira/browse/SPARK-34199
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
>
> In spark, the count(table.*) may cause very weird result, for example:
> select count(*) from (select 1 as a, null as b) t;
> output: 1
> select count(t.*) from (select 1 as a, null as b) t;
> output: 0
>  
> After checking the ANSI standard, count(*) is always treated as count(1) 
> while count(t.*) is not allowed. What's more, this is also not allowed by 
> common databases, e.g. MySQL, oracle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33591) NULL is recognized as the "null" string in partition specs

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276908#comment-17276908
 ] 

Apache Spark commented on SPARK-33591:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31434

> NULL is recognized as the "null" string in partition specs
> --
>
> Key: SPARK-33591
> URL: https://issues.apache.org/jira/browse/SPARK-33591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> For example:
> {code:sql}
> spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED 
> BY (p1);
> spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
> spark-sql> SELECT isnull(p1) FROM tbl5;
> false
> {code}
> The *p1 = null* is not recognized as a partition with NULL value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33591) NULL is recognized as the "null" string in partition specs

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276907#comment-17276907
 ] 

Apache Spark commented on SPARK-33591:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31434

> NULL is recognized as the "null" string in partition specs
> --
>
> Key: SPARK-33591
> URL: https://issues.apache.org/jira/browse/SPARK-33591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> For example:
> {code:sql}
> spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED 
> BY (p1);
> spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
> spark-sql> SELECT isnull(p1) FROM tbl5;
> false
> {code}
> The *p1 = null* is not recognized as a partition with NULL value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes

2021-02-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34319:


Assignee: wuyi

> Self-join after cogroup applyInPandas fails due to unresolved conflicting 
> attributes
> 
>
> Key: SPARK-34319
> URL: https://issues.apache.org/jira/browse/SPARK-34319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
>  
> {code:java}
> df = spark.createDataFrame([(1, 1)], ("column", "value"))row = 
> df.groupby("ColUmn").cogroup(
> df.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, "column long, value long")
> row.join(row).show()
> {code}
> {code:java}
> Conflicting attributes: column#163321L,value#163322L
> ;;
> ’Join Inner
> :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
> :  :- Project [ColUmn#163312L, column#163312L, value#163313L]
> :  :  +- LogicalRDD [column#163312L, value#163313L], false
> :  +- Project [COLUMN#163312L, column#163312L, value#163313L]
> : +- LogicalRDD [column#163312L, value#163313L], false
> +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
>    :- Project [ColUmn#163312L, column#163312L, value#163313L]
>    :  +- LogicalRDD [column#163312L, value#163313L], false
>    +- Project [COLUMN#163312L, column#163312L, value#163313L]
>   +- LogicalRDD [column#163312L, value#163313L], false
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes

2021-02-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34319.
--
Fix Version/s: 3.1.2
   3.0.2
   Resolution: Fixed

Issue resolved by pull request 31429
[https://github.com/apache/spark/pull/31429]

> Self-join after cogroup applyInPandas fails due to unresolved conflicting 
> attributes
> 
>
> Key: SPARK-34319
> URL: https://issues.apache.org/jira/browse/SPARK-34319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.2, 3.1.2
>
>
>  
> {code:java}
> df = spark.createDataFrame([(1, 1)], ("column", "value"))row = 
> df.groupby("ColUmn").cogroup(
> df.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, "column long, value long")
> row.join(row).show()
> {code}
> {code:java}
> Conflicting attributes: column#163321L,value#163322L
> ;;
> ’Join Inner
> :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
> :  :- Project [ColUmn#163312L, column#163312L, value#163313L]
> :  :  +- LogicalRDD [column#163312L, value#163313L], false
> :  +- Project [COLUMN#163312L, column#163312L, value#163313L]
> : +- LogicalRDD [column#163312L, value#163313L], false
> +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
>    :- Project [ColUmn#163312L, column#163312L, value#163313L]
>    :  +- LogicalRDD [column#163312L, value#163313L], false
>    +- Project [COLUMN#163312L, column#163312L, value#163313L]
>   +- LogicalRDD [column#163312L, value#163313L], false
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-01 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276893#comment-17276893
 ] 

L. C. Hsieh commented on SPARK-34198:
-

Thanks [~kabhwan] for your point.

Besides the maintenance cost of extra code, I remember one concern of adding 
it, is the rocksdb dependency. I think the concern is valid and so it actually 
does have some differences between putting in sql core module or as an external 
module. IIUC, that is why we have external modules.

If raising a discussion in dev mailing list helps, I think I will do it.

The RocksDB StateStore we are working with, is also based on the existing 
implementation with our bug fix. So I think the review cost should be as lower 
as possible even we submit the changed code. Of course if the original author 
can contribute the code, it will be great too. And sure, this depends on what 
the consensus we get eventually.











> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2021-02-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-29220.
-

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
>

[jira] [Comment Edited] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276874#comment-17276874
 ] 

Attila Zsolt Piros edited comment on SPARK-34194 at 2/2/21, 6:59 AM:
-

Yes, that is the reason.

[~nchammas] so based on this you should consider closing this issue. 


was (Author: attilapiros):
Yes.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2021-02-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29220.
---
Resolution: Duplicate

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
>

[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2021-02-01 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276891#comment-17276891
 ] 

Dongjoon Hyun commented on SPARK-29220:
---

I agree with you, [~attilapiros]. 

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
>

[jira] [Commented] (SPARK-33734) Spark Core ::Spark core versions upto 3.0.1 using interdependency on Jackson-core-asl version 1.9.13, which is having security issues reported.

2021-02-01 Thread Aparna (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276890#comment-17276890
 ] 

Aparna commented on SPARK-33734:


Hi,

Please provide an updates on this, the spark-core 3.1.0 version is also using 
[org.apache.avro|https://mvnrepository.com/artifact/org.apache.avro] version 
1.8.2 which is having  
[jackson-core-asl|https://mvnrepository.com/artifact/org.codehaus.jackson/jackson-core-asl]
 version 1.9.13.
Details of Security Issues are shared in previous comments. Please update on 
the same.

> Spark Core ::Spark core versions upto 3.0.1 using interdependency on 
> Jackson-core-asl version 1.9.13, which is having security issues reported. 
> 
>
> Key: SPARK-33734
> URL: https://issues.apache.org/jira/browse/SPARK-33734
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Aparna
>Priority: Major
>
> spark-core version upto latest 3.0.1 is using dependency 
> [org.apache.avro|https://mvnrepository.com/artifact/org.apache.avro] version 
> 1.8.2 which is having 
> [jackson-core-asl|https://mvnrepository.com/artifact/org.codehaus.jackson/jackson-core-asl]
>  version 1.9.13 which has security issues.
> Please fix and share the new version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276887#comment-17276887
 ] 

Apache Spark commented on SPARK-34325:
--

User 'offthewall123' has created a pull request for this issue:
https://github.com/apache/spark/pull/31433

> remove_shuffleBlockResolver_in_SortShuffleWriter
> 
>
> Key: SPARK-34325
> URL: https://issues.apache.org/jira/browse/SPARK-34325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xudingyu
>Priority: Major
>
> shuffleBlockResolver in SortShuffleWriter is not used, can remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34325:


Assignee: Apache Spark

> remove_shuffleBlockResolver_in_SortShuffleWriter
> 
>
> Key: SPARK-34325
> URL: https://issues.apache.org/jira/browse/SPARK-34325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xudingyu
>Assignee: Apache Spark
>Priority: Major
>
> shuffleBlockResolver in SortShuffleWriter is not used, can remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276886#comment-17276886
 ] 

Apache Spark commented on SPARK-34325:
--

User 'offthewall123' has created a pull request for this issue:
https://github.com/apache/spark/pull/31433

> remove_shuffleBlockResolver_in_SortShuffleWriter
> 
>
> Key: SPARK-34325
> URL: https://issues.apache.org/jira/browse/SPARK-34325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xudingyu
>Priority: Major
>
> shuffleBlockResolver in SortShuffleWriter is not used, can remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34325:


Assignee: (was: Apache Spark)

> remove_shuffleBlockResolver_in_SortShuffleWriter
> 
>
> Key: SPARK-34325
> URL: https://issues.apache.org/jira/browse/SPARK-34325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xudingyu
>Priority: Major
>
> shuffleBlockResolver in SortShuffleWriter is not used, can remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34325:


Assignee: Apache Spark

> remove_shuffleBlockResolver_in_SortShuffleWriter
> 
>
> Key: SPARK-34325
> URL: https://issues.apache.org/jira/browse/SPARK-34325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xudingyu
>Assignee: Apache Spark
>Priority: Major
>
> shuffleBlockResolver in SortShuffleWriter is not used, can remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-01 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276885#comment-17276885
 ] 

Dongjoon Hyun commented on SPARK-34309:
---

Oh my. :(

> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter

2021-02-01 Thread Xudingyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xudingyu updated SPARK-34325:
-
Description: shuffleBlockResolver in SortShuffleWriter is not used, can 
remove it.  (was: shuffleBlockResolver in SortShuffleWriter is not used.)

> remove_shuffleBlockResolver_in_SortShuffleWriter
> 
>
> Key: SPARK-34325
> URL: https://issues.apache.org/jira/browse/SPARK-34325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xudingyu
>Priority: Major
>
> shuffleBlockResolver in SortShuffleWriter is not used, can remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables

2021-02-01 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-34322:

Description: 
For a view, there might be several underlying tables.

In long-running spark server use case, such as zeppelin, kyuubi, livy.

If a table updated, we need refresh this table in current long running spark 
session.

But if the table is a view, we need refresh the underlying tables one by one.


> When refreshing a non-temporary view, also refresh its underlying tables
> 
>
> Key: SPARK-34322
> URL: https://issues.apache.org/jira/browse/SPARK-34322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Priority: Major
>
> For a view, there might be several underlying tables.
> In long-running spark server use case, such as zeppelin, kyuubi, livy.
> If a table updated, we need refresh this table in current long running spark 
> session.
> But if the table is a view, we need refresh the underlying tables one by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter

2021-02-01 Thread Xudingyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xudingyu updated SPARK-34325:
-
Description: shuffleBlockResolver in SortShuffleWriter is not used.

> remove_shuffleBlockResolver_in_SortShuffleWriter
> 
>
> Key: SPARK-34325
> URL: https://issues.apache.org/jira/browse/SPARK-34325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xudingyu
>Priority: Major
>
> shuffleBlockResolver in SortShuffleWriter is not used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter

2021-02-01 Thread Xudingyu (Jira)

Xudingyu created SPARK-34325:


 Summary: remove_shuffleBlockResolver_in_SortShuffleWriter
 Key: SPARK-34325
 URL: https://issues.apache.org/jira/browse/SPARK-34325
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Xudingyu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-01 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276883#comment-17276883
 ] 

Jungtaek Lim commented on SPARK-34198:
--

The external modules means modules in external directory.

Personally I don't think there's huge difference between adding it in spark-sql 
core module vs adding it via external module. The major point of this is 
whether we want to add the functionality to Spark codebase or not. As we 
already confirmed there're concerns on adding this in Spark codebase, unless 
you raise the discussion in dev@ mailing list and gather consensus, the effort 
can be easily wasted. Please make sure we don't have such case.

And once we decide to add this, I'd rather say I'd like to see either we 
persuade repo owner to contribute well-known existing implementation 
(https://github.com/chermenin/spark-states) to ASF, or new PR based on #24922. 
I wouldn't like to review multiple PRs again and again for the same 
functionality.



> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-01 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276882#comment-17276882
 ] 

L. C. Hsieh commented on SPARK-34198:
-

For external module here, I mean to put the related code under external/ along 
with other external modules like avro, kafka-0-10-sql, etc.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34324) FileTable should not list TRUNCATE in capabilities by default

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34324:


Assignee: Apache Spark  (was: L. C. Hsieh)

> FileTable should not list TRUNCATE in capabilities by default
> -
>
> Key: SPARK-34324
> URL: https://issues.apache.org/jira/browse/SPARK-34324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> abstract class {{FileTable}} now lists {{TRUNCATE}} in its {{capabilities}}, 
> but {{FileTable}} does not know if an implementation really supports 
> truncation or not. Specifically, we can check existing {{FileTable}} 
> implementations including {{AvroTable}}, {{CSVTable}}, {{JsonTable}}, etc. No 
> one implementation really implements {{SupportsTruncate}} in its writer 
> builder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34324) FileTable should not list TRUNCATE in capabilities by default

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34324:


Assignee: L. C. Hsieh  (was: Apache Spark)

> FileTable should not list TRUNCATE in capabilities by default
> -
>
> Key: SPARK-34324
> URL: https://issues.apache.org/jira/browse/SPARK-34324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> abstract class {{FileTable}} now lists {{TRUNCATE}} in its {{capabilities}}, 
> but {{FileTable}} does not know if an implementation really supports 
> truncation or not. Specifically, we can check existing {{FileTable}} 
> implementations including {{AvroTable}}, {{CSVTable}}, {{JsonTable}}, etc. No 
> one implementation really implements {{SupportsTruncate}} in its writer 
> builder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34324) FileTable should not list TRUNCATE in capabilities by default

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276880#comment-17276880
 ] 

Apache Spark commented on SPARK-34324:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31432

> FileTable should not list TRUNCATE in capabilities by default
> -
>
> Key: SPARK-34324
> URL: https://issues.apache.org/jira/browse/SPARK-34324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> abstract class {{FileTable}} now lists {{TRUNCATE}} in its {{capabilities}}, 
> but {{FileTable}} does not know if an implementation really supports 
> truncation or not. Specifically, we can check existing {{FileTable}} 
> implementations including {{AvroTable}}, {{CSVTable}}, {{JsonTable}}, etc. No 
> one implementation really implements {{SupportsTruncate}} in its writer 
> builder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34324) FileTable should not list TRUNCATE in capabilities by default

2021-02-01 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-34324:
---

 Summary: FileTable should not list TRUNCATE in capabilities by 
default
 Key: SPARK-34324
 URL: https://issues.apache.org/jira/browse/SPARK-34324
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


abstract class {{FileTable}} now lists {{TRUNCATE}} in its {{capabilities}}, 
but {{FileTable}} does not know if an implementation really supports truncation 
or not. Specifically, we can check existing {{FileTable}} implementations 
including {{AvroTable}}, {{CSVTable}}, {{JsonTable}}, etc. No one 
implementation really implements {{SupportsTruncate}} in its writer builder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276878#comment-17276878
 ] 

Apache Spark commented on SPARK-34322:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/31431

> When refreshing a non-temporary view, also refresh its underlying tables
> 
>
> Key: SPARK-34322
> URL: https://issues.apache.org/jira/browse/SPARK-34322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34322:


Assignee: (was: Apache Spark)

> When refreshing a non-temporary view, also refresh its underlying tables
> 
>
> Key: SPARK-34322
> URL: https://issues.apache.org/jira/browse/SPARK-34322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276877#comment-17276877
 ] 

Apache Spark commented on SPARK-34322:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/31431

> When refreshing a non-temporary view, also refresh its underlying tables
> 
>
> Key: SPARK-34322
> URL: https://issues.apache.org/jira/browse/SPARK-34322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34322:


Assignee: Apache Spark

> When refreshing a non-temporary view, also refresh its underlying tables
> 
>
> Key: SPARK-34322
> URL: https://issues.apache.org/jira/browse/SPARK-34322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34316) Support spark.kubernetes.executor.disableConfigMap

2021-02-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34316:
--
Summary: Support spark.kubernetes.executor.disableConfigMap  (was: Optional 
Propagation of SPARK_CONF_DIR in K8s)

> Support spark.kubernetes.executor.disableConfigMap
> --
>
> Key: SPARK-34316
> URL: https://issues.apache.org/jira/browse/SPARK-34316
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Zhou JIANG
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> In shared Kubernetes clusters, Spark could be restricted from creating and 
> deleting config maps in job namespaces.
> It would be helpful if the current mandatory config map creation could be 
> optional. User may still take responsibility of handing Spark conf files 
> separately. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s

2021-02-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34316:
-

Assignee: Dongjoon Hyun

> Optional Propagation of SPARK_CONF_DIR in K8s
> -
>
> Key: SPARK-34316
> URL: https://issues.apache.org/jira/browse/SPARK-34316
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Zhou JIANG
>Assignee: Dongjoon Hyun
>Priority: Major
>
> In shared Kubernetes clusters, Spark could be restricted from creating and 
> deleting config maps in job namespaces.
> It would be helpful if the current mandatory config map creation could be 
> optional. User may still take responsibility of handing Spark conf files 
> separately. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s

2021-02-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34316.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31428
[https://github.com/apache/spark/pull/31428]

> Optional Propagation of SPARK_CONF_DIR in K8s
> -
>
> Key: SPARK-34316
> URL: https://issues.apache.org/jira/browse/SPARK-34316
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Zhou JIANG
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> In shared Kubernetes clusters, Spark could be restricted from creating and 
> deleting config maps in job namespaces.
> It would be helpful if the current mandatory config map creation could be 
> optional. User may still take responsibility of handing Spark conf files 
> separately. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34314) Wrong discovered partition value

2021-02-01 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-34314:
---
Affects Version/s: 3.1.0
   3.0.2
   2.4.8

> Wrong discovered partition value
> 
>
> Key: SPARK-34314
> URL: https://issues.apache.org/jira/browse/SPARK-34314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The example below portraits the issue:
> {code:scala}
>   val df = Seq((0, "AA"), (1, "-0")).toDF("id", "part")
>   df.write
> .partitionBy("part")
> .format("parquet")
> .save(path)
>   val readback = spark.read.parquet(path)
>   readback.printSchema()
>   readback.show(false)
> {code}
> It write the partition value as string:
> {code}
> /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d
> ├── _SUCCESS
> ├── part=-0
> │   └── part-1-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> └── part=AA
> └── part-0-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> {code}
> *"-0"* and "AA".
> but when Spark reads data back, it transforms "-0" to "0"
> {code}
> root
>  |-- id: integer (nullable = true)
>  |-- part: string (nullable = true)
> +---++
> |id |part|
> +---++
> |0  |AA  |
> |1  |0   |
> +---++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-01 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276875#comment-17276875
 ] 

Cheng Su commented on SPARK-34198:
--

Sorry I think my question is not very clear. I am aware of 
[https://github.com/apache/spark/pull/24922] and what was the concern back 
then. I think I am not sure what's external module here. Do you mind explaining 
a bit more of what's external module meaning here (or is there an existing 
example in current codebase I can refer to)?

More context is we are working on RocksDB state store internally as well based 
on above PR. So would like to check if there's any good things here we need to 
watch out for backport, thanks.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276874#comment-17276874
 ] 

Attila Zsolt Piros commented on SPARK-34194:


Yes.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34323) Upgrade zstd-jni to 1.4.8-3

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34323:


Assignee: (was: Apache Spark)

> Upgrade zstd-jni to 1.4.8-3
> ---
>
> Key: SPARK-34323
> URL: https://issues.apache.org/jira/browse/SPARK-34323
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34323) Upgrade zstd-jni to 1.4.8-3

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276873#comment-17276873
 ] 

Apache Spark commented on SPARK-34323:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/31430

> Upgrade zstd-jni to 1.4.8-3
> ---
>
> Key: SPARK-34323
> URL: https://issues.apache.org/jira/browse/SPARK-34323
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34323) Upgrade zstd-jni to 1.4.8-3

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34323:


Assignee: Apache Spark

> Upgrade zstd-jni to 1.4.8-3
> ---
>
> Key: SPARK-34323
> URL: https://issues.apache.org/jira/browse/SPARK-34323
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34323) Upgrade zstd-jni to 1.4.8-3

2021-02-01 Thread William Hyun (Jira)

William Hyun created SPARK-34323:


 Summary: Upgrade zstd-jni to 1.4.8-3
 Key: SPARK-34323
 URL: https://issues.apache.org/jira/browse/SPARK-34323
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables

2021-02-01 Thread feiwang (Jira)

feiwang created SPARK-34322:
---

 Summary: When refreshing a non-temporary view, also refresh its 
underlying tables
 Key: SPARK-34322
 URL: https://issues.apache.org/jira/browse/SPARK-34322
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276869#comment-17276869
 ] 

Nicholas Chammas edited comment on SPARK-34194 at 2/2/21, 5:56 AM:
---

Interesting reference, [~attilapiros]. It looks like that config is internal to 
Spark and was [deprecated in Spark 
3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929]
 due to the correctness issue mentioned in that warning and documented in 
SPARK-26709.


was (Author: nchammas):
Interesting reference, [~attilapiros]. It looks like that config was 
[deprecated in Spark 
3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929]
 due to the correctness issue mentioned in that warning and documented in 
SPARK-26709.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude

2021-02-01 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-34295:
---

Assignee: L. C. Hsieh

> Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
> 
>
> Key: SPARK-34295
> URL: https://issues.apache.org/jira/browse/SPARK-34295
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> MapReduce jobs can instruct YARN to skip renewal of tokens obtained from 
> certain hosts by specifying the hosts with configuration 
> mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,.
> But seems Spark lacks of similar option. So the job submission fails if YARN 
> fails to renew DelegationToken for any of the remote HDFS cluster.  The 
> failure in DT renewal can happen due to many reason like Remote HDFS does not 
> trust Kerberos identity of YARN etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276869#comment-17276869
 ] 

Nicholas Chammas commented on SPARK-34194:
--

Interesting reference, [~attilapiros]. It looks like that config was 
[deprecated in Spark 
3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929]
 due to the correctness issue mentioned in that warning and documented in 
SPARK-26709.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34321) Fix the guarantee of foreachBatch

2021-02-01 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276865#comment-17276865
 ] 

L. C. Hsieh commented on SPARK-34321:
-

Err...I made a mistake when reading the document and code. This is invalid.

> Fix the guarantee of foreachBatch
> -
>
> Key: SPARK-34321
> URL: https://issues.apache.org/jira/browse/SPARK-34321
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Similar to SPARK-28650, {{foreachBatch}} API document also documents the 
> guarantee:
> The batchId can be used to deduplicate and transactionally write the output 
> (that is, the provided Dataset) to external systems. The output Dataset is 
> guaranteed to be exactly the same for the same batchId
> But like the reason of fixing the document of {{ForeachWriter}} in 
> SPARK-28650, it is not hard to break the guarantee by changing the partition 
> number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12497) thriftServer does not support semicolon in sql

2021-02-01 Thread xinzhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276864#comment-17276864
 ] 

xinzhang edited comment on SPARK-12497 at 2/2/21, 5:44 AM:
---

[~kabhwan]

Sorry for the mixed up Tests.

Please recheck the new test. 
 # It's good with Spark 3.0.0 . (BTW: semicolon is good in beeline
 # It's still a bug with Spark 2.4.7 . 

[root@actuatorx-dispatcher-172-25-48-173 spark]# env|grep spark
 SPARK_HOME=/opt/spark/spark-bin
 
PATH=/root/perl5/bin:/opt/scala/scala-bin//bin:/opt/spark/spark-bin/bin:172.25.52.34:/opt/hive/hive-bin/bin/:172.31.10.86:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/usr/local/swosbf/bin:/usr/local/swosbf/bin/system:/usr/java/jdk/bin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin:/root/bin
 PWD=/opt/spark
 [root@actuatorx-dispatcher-172-25-48-173 spark]# ll
 total 4
 -rw-r--r-- 1 root root 646 Feb 1 17:44 derby.log
 drwxr-xr-x 5 root root 133 Feb 1 17:44 metastore_db
 drwxr-xr-x 14 root root 255 Sep 22 13:57 spark-2.3.0-bin-hadoop2.6
 drwxr-xr-x 14 1000 1000 240 Feb 2 13:32 spark-2.4.7-bin-hadoop2.6
 drwxr-xr-x 14 root root 240 Feb 2 13:26 spark-3.0.0-bin-hadoop2.7
 lrwxrwxrwx 1 root root 25 Feb 1 15:42 spark-bin -> spark-2.4.7-bin-hadoop2.6
 [root@actuatorx-dispatcher-172-25-48-173 spark]# jps
 3348544 RunJar
 3354564 Jps
 3354234 RunJar
 984853 JarLauncher
 [root@actuatorx-dispatcher-172-25-48-173 spark]# sh 
spark-bin/sbin/start-thriftserver.sh 
 starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/opt/spark/spark-bin/logs/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-actuatorx-dispatcher-172-25-48-173.out

[root@actuatorx-dispatcher-172-25-48-173 spark]# jps
3362650 Jps
984853 JarLauncher
3355197 SparkSubmit
3362444 RunJar
 [root@actuatorx-dispatcher-172-25-48-173 spark]# netstat -anp|grep 3355197
 tcp 0 0 172.25.48.173:21120 0.0.0.0:* LISTEN 3355197/java 
 tcp 0 0 0.0.0.0:4040 0.0.0.0:* LISTEN 3355197/java 
 tcp 0 0 172.25.48.173:22219 0.0.0.0:* LISTEN 3355197/java 
 tcp 0 0 0.0.0.0:50031 0.0.0.0:* LISTEN 3355197/java 
 tcp 0 0 172.25.48.173:51797 172.25.48.231:6033 ESTABLISHED 3355197/java 
 tcp 0 0 172.25.48.173:51795 172.25.48.231:6033 ESTABLISHED 3355197/java 
 tcp 0 0 172.25.48.173:51787 172.25.48.231:6033 ESTABLISHED 3355197/java 
 tcp 0 0 172.25.48.173:51789 172.25.48.231:6033 ESTABLISHED 3355197/java 
 unix 3 [ ] STREAM CONNECTED 534110569 3355197/java 
 unix 3 [ ] STREAM CONNECTED 534110568 3355197/java 
 unix 2 [ ] STREAM CONNECTED 534050562 3355197/java 
 unix 2 [ ] STREAM CONNECTED 534110572 3355197/java 
 [root@actuatorx-dispatcher-172-25-48-173 spark]# 
/opt/spark/spark-bin/bin/beeline -u jdbc:hive2://172.25.48.173:50031/tools -n 
tools 
 Connecting to jdbc:hive2://172.25.48.173:50031/tools
 21/02/02 13:38:57 INFO jdbc.Utils: Supplied authorities: 172.25.48.173:50031
 21/02/02 13:38:57 INFO jdbc.Utils: Resolved authority: 172.25.48.173:50031
 21/02/02 13:38:57 INFO jdbc.HiveConnection: Will try to open client transport 
with JDBC Uri: jdbc:hive2://172.25.48.173:50031/tools
 Connected to: Spark SQL (version 2.4.7)
 Driver: Hive JDBC (version 1.2.1.spark2)
 Transaction isolation: TRANSACTION_REPEATABLE_READ
 Beeline version 1.2.1.spark2 by Apache Hive
 0: jdbc:hive2://172.25.48.173:50031/tools> select '\;';
 Error: org.apache.spark.sql.catalyst.parser.ParseException: 
 no viable alternative at input 'select ''(line 1, pos 7)

== SQL ==
 select '\
 ---^^^ (state=,code=0)
 0: jdbc:hive2://172.25.48.173:50031/tools> !exit
 Closing: 0: jdbc:hive2://172.25.48.173:50031/tools
 [root@actuatorx-dispatcher-172-25-48-173 spark]#


was (Author: zhangxin0112zx):
[~kabhwan]

Sorry for the mixed up Tests.

Please recheck the new test. 
 # It's good with Spark 3.0.0 . (BTW: semicolon is good in beeline
 # It's still a bug with Spark 2.4.7 . 

[root@actuatorx-dispatcher-172-25-48-173 spark]# env|grep spark
SPARK_HOME=/opt/spark/spark-bin
PATH=/root/perl5/bin:/opt/scala/scala-bin//bin:/opt/spark/spark-bin/bin:172.25.52.34:/opt/hive/hive-bin/bin/:172.31.10.86:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/usr/local/swosbf/bin:/usr/local/swosbf/bin/system:/usr/java/jdk/bin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin:/root/bin
PWD=/opt/spark
[root@actuatorx-dispatcher-172-25-48-173 spark]# ll
total 4
-rw-r--r-- 1 root root 646 Feb 1 17:44 derby.log
drwxr-xr-x 5 root root 133 Feb 1 17:44 metastore_db
drwxr-xr-x 14 root root 255 Sep 22 13:57 spark-2.3.0-bin-hadoop2.6
drwxr-xr-x 14 1000 1000 240 Feb 2 13:32 spark-2.4.7-bin-hadoop2.6
drwxr-xr-x 14 root root 240 Feb 2 13:26 spark-3.0.0-bin-hadoop2.7
lrwxrwxrwx 1 root root 25 Feb 1 15:42 spark-bin -> spark-2.4.7-bin-hadoop2.6
[root@actuatorx-dispatcher-172-25-48-173 spark]# jps
3348544 RunJar
3354564 Jps
3354234 RunJar
984853 JarLauncher
[root@actuatorx-dispatcher-172-25-48-173 spark]# sh 
spark-bin/sbin/start-thriftserver.sh

[jira] [Resolved] (SPARK-34321) Fix the guarantee of foreachBatch

2021-02-01 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34321.
-
Resolution: Invalid

> Fix the guarantee of foreachBatch
> -
>
> Key: SPARK-34321
> URL: https://issues.apache.org/jira/browse/SPARK-34321
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Similar to SPARK-28650, {{foreachBatch}} API document also documents the 
> guarantee:
> The batchId can be used to deduplicate and transactionally write the output 
> (that is, the provided Dataset) to external systems. The output Dataset is 
> guaranteed to be exactly the same for the same batchId
> But like the reason of fixing the document of {{ForeachWriter}} in 
> SPARK-28650, it is not hard to break the guarantee by changing the partition 
> number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34321) Fix the guarantee of foreachBatch

2021-02-01 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-34321:
---

 Summary: Fix the guarantee of foreachBatch
 Key: SPARK-34321
 URL: https://issues.apache.org/jira/browse/SPARK-34321
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Similar to SPARK-28650, {{foreachBatch}} API document also documents the 
guarantee:

The batchId can be used to deduplicate and transactionally write the output 
(that is, the provided Dataset) to external systems. The output Dataset is 
guaranteed to be exactly the same for the same batchId

But like the reason of fixing the document of {{ForeachWriter}} in SPARK-28650, 
it is not hard to break the guarantee by changing the partition number.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12497) thriftServer does not support semicolon in sql

2021-02-01 Thread xinzhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276864#comment-17276864
 ] 

xinzhang commented on SPARK-12497:
--

[~kabhwan]

Sorry for the mixed up Tests.

Please recheck the new test. 
 # It's good with Spark 3.0.0 . (BTW: semicolon is good in beeline
 # It's still a bug with Spark 2.4.7 . 

[root@actuatorx-dispatcher-172-25-48-173 spark]# env|grep spark
SPARK_HOME=/opt/spark/spark-bin
PATH=/root/perl5/bin:/opt/scala/scala-bin//bin:/opt/spark/spark-bin/bin:172.25.52.34:/opt/hive/hive-bin/bin/:172.31.10.86:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/usr/local/swosbf/bin:/usr/local/swosbf/bin/system:/usr/java/jdk/bin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin:/root/bin
PWD=/opt/spark
[root@actuatorx-dispatcher-172-25-48-173 spark]# ll
total 4
-rw-r--r-- 1 root root 646 Feb 1 17:44 derby.log
drwxr-xr-x 5 root root 133 Feb 1 17:44 metastore_db
drwxr-xr-x 14 root root 255 Sep 22 13:57 spark-2.3.0-bin-hadoop2.6
drwxr-xr-x 14 1000 1000 240 Feb 2 13:32 spark-2.4.7-bin-hadoop2.6
drwxr-xr-x 14 root root 240 Feb 2 13:26 spark-3.0.0-bin-hadoop2.7
lrwxrwxrwx 1 root root 25 Feb 1 15:42 spark-bin -> spark-2.4.7-bin-hadoop2.6
[root@actuatorx-dispatcher-172-25-48-173 spark]# jps
3348544 RunJar
3354564 Jps
3354234 RunJar
984853 JarLauncher
[root@actuatorx-dispatcher-172-25-48-173 spark]# sh 
spark-bin/sbin/start-thriftserver.sh 
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/opt/spark/spark-bin/logs/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-actuatorx-dispatcher-172-25-48-173.out
[root@actuatorx-dispatcher-172-25-48-173 spark]# netstat -anp|grep 3355197
tcp 0 0 172.25.48.173:21120 0.0.0.0:* LISTEN 3355197/java 
tcp 0 0 0.0.0.0:4040 0.0.0.0:* LISTEN 3355197/java 
tcp 0 0 172.25.48.173:22219 0.0.0.0:* LISTEN 3355197/java 
tcp 0 0 0.0.0.0:50031 0.0.0.0:* LISTEN 3355197/java 
tcp 0 0 172.25.48.173:51797 172.25.48.231:6033 ESTABLISHED 3355197/java 
tcp 0 0 172.25.48.173:51795 172.25.48.231:6033 ESTABLISHED 3355197/java 
tcp 0 0 172.25.48.173:51787 172.25.48.231:6033 ESTABLISHED 3355197/java 
tcp 0 0 172.25.48.173:51789 172.25.48.231:6033 ESTABLISHED 3355197/java 
unix 3 [ ] STREAM CONNECTED 534110569 3355197/java 
unix 3 [ ] STREAM CONNECTED 534110568 3355197/java 
unix 2 [ ] STREAM CONNECTED 534050562 3355197/java 
unix 2 [ ] STREAM CONNECTED 534110572 3355197/java 
[root@actuatorx-dispatcher-172-25-48-173 spark]# 
/opt/spark/spark-bin/bin/beeline -u jdbc:hive2://172.25.48.173:50031/tools -n 
tools 
Connecting to jdbc:hive2://172.25.48.173:50031/tools
21/02/02 13:38:57 INFO jdbc.Utils: Supplied authorities: 172.25.48.173:50031
21/02/02 13:38:57 INFO jdbc.Utils: Resolved authority: 172.25.48.173:50031
21/02/02 13:38:57 INFO jdbc.HiveConnection: Will try to open client transport 
with JDBC Uri: jdbc:hive2://172.25.48.173:50031/tools
Connected to: Spark SQL (version 2.4.7)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://172.25.48.173:50031/tools> select '\;';
Error: org.apache.spark.sql.catalyst.parser.ParseException: 
no viable alternative at input 'select ''(line 1, pos 7)

== SQL ==
select '\
---^^^ (state=,code=0)
0: jdbc:hive2://172.25.48.173:50031/tools> !exit
Closing: 0: jdbc:hive2://172.25.48.173:50031/tools
[root@actuatorx-dispatcher-172-25-48-173 spark]#

> thriftServer does not support semicolon in sql 
> ---
>
> Key: SPARK-12497
> URL: https://issues.apache.org/jira/browse/SPARK-12497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: nilonealex
>Priority: Major
>
> 0: jdbc:hive2://192.168.128.130:14005> SELECT ';' from tx_1 limit 1 ;
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '' '' '' in select clause; line 1 pos 8 (state=,code=0)
> 0: jdbc:hive2://192.168.128.130:14005> 
> 0: jdbc:hive2://192.168.128.130:14005> select '\;' from tx_1 limit 1 ; 
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '' '' '' in select clause; line 1 pos 9 (state=,code=0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-01 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276863#comment-17276863
 ] 

L. C. Hsieh commented on SPARK-34198:
-

If you are asking why adding it as an external module instead of directly into 
streaming codebase, one previous concern is that this introduces extra 
dependency of RocksDB. So to add it as external module, we hope to relieve the 
concern.

We will add RocksDB StateStore code as an external module as the JIRA title 
describes. Spark SS already can use a config provider class to choose what 
StateStore provider to use. So I think there won't be too many tasks involved.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-01 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276857#comment-17276857
 ] 

Cheng Su commented on SPARK-34198:
--

[~viirya] - could you help elaborate what's the benefit of adding as an 
external module? Also do you mind sharing a list of potential things/sub-tasks 
need to be done to make it work? Thanks.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34320) Migrate ALTER TABLE drop columns command to the new resolution framework

2021-02-01 Thread Terry Kim (Jira)

Terry Kim created SPARK-34320:
-

 Summary: Migrate ALTER TABLE drop columns command to the new 
resolution framework
 Key: SPARK-34320
 URL: https://issues.apache.org/jira/browse/SPARK-34320
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Terry Kim


Migrate ALTER TABLE drop columns command to the new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276818#comment-17276818
 ] 

Apache Spark commented on SPARK-34319:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/31429

> Self-join after cogroup applyInPandas fails due to unresolved conflicting 
> attributes
> 
>
> Key: SPARK-34319
> URL: https://issues.apache.org/jira/browse/SPARK-34319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> df = spark.createDataFrame([(1, 1)], ("column", "value"))row = 
> df.groupby("ColUmn").cogroup(
> df.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, "column long, value long")
> row.join(row).show()
> {code}
> {code:java}
> Conflicting attributes: column#163321L,value#163322L
> ;;
> ’Join Inner
> :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
> :  :- Project [ColUmn#163312L, column#163312L, value#163313L]
> :  :  +- LogicalRDD [column#163312L, value#163313L], false
> :  +- Project [COLUMN#163312L, column#163312L, value#163313L]
> : +- LogicalRDD [column#163312L, value#163313L], false
> +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
>    :- Project [ColUmn#163312L, column#163312L, value#163313L]
>    :  +- LogicalRDD [column#163312L, value#163313L], false
>    +- Project [COLUMN#163312L, column#163312L, value#163313L]
>   +- LogicalRDD [column#163312L, value#163313L], false
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34319:


Assignee: Apache Spark

> Self-join after cogroup applyInPandas fails due to unresolved conflicting 
> attributes
> 
>
> Key: SPARK-34319
> URL: https://issues.apache.org/jira/browse/SPARK-34319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
> df = spark.createDataFrame([(1, 1)], ("column", "value"))row = 
> df.groupby("ColUmn").cogroup(
> df.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, "column long, value long")
> row.join(row).show()
> {code}
> {code:java}
> Conflicting attributes: column#163321L,value#163322L
> ;;
> ’Join Inner
> :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
> :  :- Project [ColUmn#163312L, column#163312L, value#163313L]
> :  :  +- LogicalRDD [column#163312L, value#163313L], false
> :  +- Project [COLUMN#163312L, column#163312L, value#163313L]
> : +- LogicalRDD [column#163312L, value#163313L], false
> +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
>    :- Project [ColUmn#163312L, column#163312L, value#163313L]
>    :  +- LogicalRDD [column#163312L, value#163313L], false
>    +- Project [COLUMN#163312L, column#163312L, value#163313L]
>   +- LogicalRDD [column#163312L, value#163313L], false
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34319:


Assignee: (was: Apache Spark)

> Self-join after cogroup applyInPandas fails due to unresolved conflicting 
> attributes
> 
>
> Key: SPARK-34319
> URL: https://issues.apache.org/jira/browse/SPARK-34319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> df = spark.createDataFrame([(1, 1)], ("column", "value"))row = 
> df.groupby("ColUmn").cogroup(
> df.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, "column long, value long")
> row.join(row).show()
> {code}
> {code:java}
> Conflicting attributes: column#163321L,value#163322L
> ;;
> ’Join Inner
> :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
> :  :- Project [ColUmn#163312L, column#163312L, value#163313L]
> :  :  +- LogicalRDD [column#163312L, value#163313L], false
> :  +- Project [COLUMN#163312L, column#163312L, value#163313L]
> : +- LogicalRDD [column#163312L, value#163313L], false
> +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
> (column#163312L, value#163313L, column#163312L, value#163313L), 
> [column#163321L, value#163322L]
>    :- Project [ColUmn#163312L, column#163312L, value#163313L]
>    :  +- LogicalRDD [column#163312L, value#163313L], false
>    +- Project [COLUMN#163312L, column#163312L, value#163313L]
>   +- LogicalRDD [column#163312L, value#163313L], false
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes

2021-02-01 Thread wuyi (Jira)

wuyi created SPARK-34319:


 Summary: Self-join after cogroup applyInPandas fails due to 
unresolved conflicting attributes
 Key: SPARK-34319
 URL: https://issues.apache.org/jira/browse/SPARK-34319
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0, 3.1.0, 3.2.0
Reporter: wuyi


 
{code:java}
df = spark.createDataFrame([(1, 1)], ("column", "value"))row = 
df.groupby("ColUmn").cogroup(
df.groupby("COLUMN")
).applyInPandas(lambda r, l: r + l, "column long, value long")
row.join(row).show()
{code}
{code:java}
Conflicting attributes: column#163321L,value#163322L
;;
’Join Inner
:- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
(column#163312L, value#163313L, column#163312L, value#163313L), 
[column#163321L, value#163322L]
:  :- Project [ColUmn#163312L, column#163312L, value#163313L]
:  :  +- LogicalRDD [column#163312L, value#163313L], false
:  +- Project [COLUMN#163312L, column#163312L, value#163313L]
: +- LogicalRDD [column#163312L, value#163313L], false
+- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], 
(column#163312L, value#163313L, column#163312L, value#163313L), 
[column#163321L, value#163322L]
   :- Project [ColUmn#163312L, column#163312L, value#163313L]
   :  +- LogicalRDD [column#163312L, value#163313L], false
   +- Project [COLUMN#163312L, column#163312L, value#163313L]
  +- LogicalRDD [column#163312L, value#163313L], false
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-01 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276806#comment-17276806
 ] 

Yang Jie commented on SPARK-34309:
--

There has been a patch to replace all the places.

However, when using RemovalListener, it seems that the timing behavior is 
inconsistent, data was deleted 3 ~5 ms later than expected :(

> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34316:


Assignee: (was: Apache Spark)

> Optional Propagation of SPARK_CONF_DIR in K8s
> -
>
> Key: SPARK-34316
> URL: https://issues.apache.org/jira/browse/SPARK-34316
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Zhou JIANG
>Priority: Major
>
> In shared Kubernetes clusters, Spark could be restricted from creating and 
> deleting config maps in job namespaces.
> It would be helpful if the current mandatory config map creation could be 
> optional. User may still take responsibility of handing Spark conf files 
> separately. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s

2021-02-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34316:
--
Affects Version/s: (was: 3.0.1)
   3.2.0

> Optional Propagation of SPARK_CONF_DIR in K8s
> -
>
> Key: SPARK-34316
> URL: https://issues.apache.org/jira/browse/SPARK-34316
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Zhou JIANG
>Priority: Major
>
> In shared Kubernetes clusters, Spark could be restricted from creating and 
> deleting config maps in job namespaces.
> It would be helpful if the current mandatory config map creation could be 
> optional. User may still take responsibility of handing Spark conf files 
> separately. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s

2021-02-01 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276782#comment-17276782
 ] 

Dongjoon Hyun commented on SPARK-34316:
---

Thank you for filing a Jira issue, [~zhou_jiang].

> Optional Propagation of SPARK_CONF_DIR in K8s
> -
>
> Key: SPARK-34316
> URL: https://issues.apache.org/jira/browse/SPARK-34316
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Zhou JIANG
>Priority: Major
>
> In shared Kubernetes clusters, Spark could be restricted from creating and 
> deleting config maps in job namespaces.
> It would be helpful if the current mandatory config map creation could be 
> optional. User may still take responsibility of handing Spark conf files 
> separately. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-01 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276766#comment-17276766
 ] 

Dongjoon Hyun commented on SPARK-34309:
---

Thank you for pinging me, [~LuciferYang]. The benchmark seems to be written in 
2015.

We have multiple places of Guava cache usage. Which part are you testing?

> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34300) Fix of typos in documentation of pyspark.sql.functions and output of lint-python

2021-02-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34300:


Assignee: David Toneian

> Fix of typos in documentation of pyspark.sql.functions and output of 
> lint-python
> 
>
> Key: SPARK-34300
> URL: https://issues.apache.org/jira/browse/SPARK-34300
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: David Toneian
>Assignee: David Toneian
>Priority: Trivial
>
> Minor documentation and standard output issues:
> * {{dev/lint-python}} contains a typo when printing a warning regarding bad 
> Sphinx version ("lower then 3.1" rather than "lower than 3.1")
> * The documentations of the functions {{lag}} and {{lead}} of 
> {{pyspark.sql.functions}} refer to a parameter {{defaultValue}}, which in 
> reality is named {{default}}.
> * The documentation strings of functions in {{pyspark.sql.functions}} make 
> reference to the {{Column}} class, which is not resolved by Sphinx unless 
> fully qualified as {{pyspark.sql.Column}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34300) Fix of typos in documentation of pyspark.sql.functions and output of lint-python

2021-02-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34300.
--
Fix Version/s: 3.1.2
   Resolution: Fixed

Issue resolved by pull request 31401
[https://github.com/apache/spark/pull/31401]

> Fix of typos in documentation of pyspark.sql.functions and output of 
> lint-python
> 
>
> Key: SPARK-34300
> URL: https://issues.apache.org/jira/browse/SPARK-34300
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: David Toneian
>Assignee: David Toneian
>Priority: Trivial
> Fix For: 3.1.2
>
>
> Minor documentation and standard output issues:
> * {{dev/lint-python}} contains a typo when printing a warning regarding bad 
> Sphinx version ("lower then 3.1" rather than "lower than 3.1")
> * The documentations of the functions {{lag}} and {{lead}} of 
> {{pyspark.sql.functions}} refer to a parameter {{defaultValue}}, which in 
> reality is named {{default}}.
> * The documentation strings of functions in {{pyspark.sql.functions}} make 
> reference to the {{Column}} class, which is not resolved by Sphinx unless 
> fully qualified as {{pyspark.sql.Column}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34306) Use Snake naming rule across the function APIs

2021-02-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34306.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31408
[https://github.com/apache/spark/pull/31408]

> Use Snake naming rule across the function APIs
> --
>
> Key: SPARK-34306
> URL: https://issues.apache.org/jira/browse/SPARK-34306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> There are some of functions missed in SPARK-10621.
> This JIRA targets to rename everything under functions APIs to use Snake 
> naming rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34310) Replaces map and flatten with flatMap

2021-02-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34310:
-
Fix Version/s: 3.1.2
   3.0.2

> Replaces map and flatten with flatMap
> -
>
> Key: SPARK-34310
> URL: https://issues.apache.org/jira/browse/SPARK-34310
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.0.2, 3.2.0, 3.1.2
>
>
> Replaces collection.map(f1).flatten(f2) with collection.flatMap if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34310) Replaces map and flatten with flatMap

2021-02-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34310:
-
Fix Version/s: 2.4.8

> Replaces map and flatten with flatMap
> -
>
> Key: SPARK-34310
> URL: https://issues.apache.org/jira/browse/SPARK-34310
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 2.4.8, 3.0.2, 3.2.0, 3.1.2
>
>
> Replaces collection.map(f1).flatten(f2) with collection.flatMap if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34209) Allow multiple namespaces with session catalog

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276728#comment-17276728
 ] 

Apache Spark commented on SPARK-34209:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31427

> Allow multiple namespaces with session catalog
> --
>
> Key: SPARK-34209
> URL: https://issues.apache.org/jira/browse/SPARK-34209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Priority: Trivial
>
> SPARK-30885 removed the ability for tables in session catalogs being queried 
> with SQL to have multiple namespaces. This seems to have been added as a 
> follow up, not as part of the core change. We should explore if this 
> restriction can be relaxed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34209) Allow multiple namespaces with session catalog

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276727#comment-17276727
 ] 

Apache Spark commented on SPARK-34209:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31427

> Allow multiple namespaces with session catalog
> --
>
> Key: SPARK-34209
> URL: https://issues.apache.org/jira/browse/SPARK-34209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Priority: Trivial
>
> SPARK-30885 removed the ability for tables in session catalogs being queried 
> with SQL to have multiple namespaces. This seems to have been added as a 
> follow up, not as part of the core change. We should explore if this 
> restriction can be relaxed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34209) Allow multiple namespaces with session catalog

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34209:


Assignee: Apache Spark

> Allow multiple namespaces with session catalog
> --
>
> Key: SPARK-34209
> URL: https://issues.apache.org/jira/browse/SPARK-34209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Trivial
>
> SPARK-30885 removed the ability for tables in session catalogs being queried 
> with SQL to have multiple namespaces. This seems to have been added as a 
> follow up, not as part of the core change. We should explore if this 
> restriction can be relaxed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34209) Allow multiple namespaces with session catalog

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34209:


Assignee: (was: Apache Spark)

> Allow multiple namespaces with session catalog
> --
>
> Key: SPARK-34209
> URL: https://issues.apache.org/jira/browse/SPARK-34209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Priority: Trivial
>
> SPARK-30885 removed the ability for tables in session catalogs being queried 
> with SQL to have multiple namespaces. This seems to have been added as a 
> follow up, not as part of the core change. We should explore if this 
> restriction can be relaxed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26325) Interpret timestamp fields in Spark while reading json (timestampFormat)

2021-02-01 Thread Daniel Himmelstein (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276711#comment-17276711
 ] 

Daniel Himmelstein edited comment on SPARK-26325 at 2/1/21, 10:53 PM:
--

Here's the code from the original post, but using an RDD rather than JSON file 
and applying [~maxgekk]'s suggestion to "try Z instead of 'Z'":
{code:python}
line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
spark.read
.option("timestampFormat", "-MM-dd HH:mm:ss.SSZ")
.json(path=rdd)
){code}
The output I get with pyspark 3.0.1 is `DataFrame[time_field: string]`. So it 
looks like the issue remains.

I'd be interested if there are any examples where spark infers a date or 
timestamp from a JSON string or whether dateFormat and timestampFormat do not 
work at all?


was (Author: dhimmel):
Here's the code from the original post, but using an RDD rather than JSON file 
and applying [~maxgekk]'s suggestion to "try Z instead of 'Z'":
{code:python}
line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
spark.read
.option("timestampFormat", "-MM-dd HH:mm:ss.SSZ")
.json(path=rdd)
){code}
The output I get with pyspark 3.0.1 is `DataFrame[time_field: string]`. So it 
looks like the issue remains.

I'd be interested if there are any examples where spark infers a timestamp from 
a JSON string or whether timestampFormat does not work at all?

> Interpret timestamp fields in Spark while reading json (timestampFormat)
> 
>
> Key: SPARK-26325
> URL: https://issues.apache.org/jira/browse/SPARK-26325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Veenit Shah
>Priority: Major
>
> I am trying to read a pretty printed json which has time fields in it. I want 
> to interpret the timestamps columns as timestamp fields while reading the 
> json itself. However, it's still reading them as string when I {{printSchema}}
> E.g. Input json file -
> {code:java}
> [{
> "time_field" : "2017-09-30 04:53:39.412496Z"
> }]
> {code}
> Code -
> {code:java}
> df = spark.read.option("multiLine", 
> "true").option("timestampFormat","-MM-dd 
> HH:mm:ss.SS'Z'").json('path_to_json_file')
> {code}
> Output of df.printSchema() -
> {code:java}
> root
>  |-- time_field: string (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26325) Interpret timestamp fields in Spark while reading json (timestampFormat)

2021-02-01 Thread Daniel Himmelstein (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276711#comment-17276711
 ] 

Daniel Himmelstein commented on SPARK-26325:


Here's the code from the original post, but using an RDD rather than JSON file 
and applying [~maxgekk]'s suggestion to "try Z instead of 'Z'":
{code:python}
line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
spark.read
.option("timestampFormat", "-MM-dd HH:mm:ss.SSZ")
.json(path=rdd)
){code}
The output I get with pyspark 3.0.1 is `DataFrame[time_field: string]`. So it 
looks like the issue remains.

I'd be interested if there are any examples where spark infers a timestamp from 
a JSON string or whether timestampFormat does not work at all?

> Interpret timestamp fields in Spark while reading json (timestampFormat)
> 
>
> Key: SPARK-26325
> URL: https://issues.apache.org/jira/browse/SPARK-26325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Veenit Shah
>Priority: Major
>
> I am trying to read a pretty printed json which has time fields in it. I want 
> to interpret the timestamps columns as timestamp fields while reading the 
> json itself. However, it's still reading them as string when I {{printSchema}}
> E.g. Input json file -
> {code:java}
> [{
> "time_field" : "2017-09-30 04:53:39.412496Z"
> }]
> {code}
> Code -
> {code:java}
> df = spark.read.option("multiLine", 
> "true").option("timestampFormat","-MM-dd 
> HH:mm:ss.SS'Z'").json('path_to_json_file')
> {code}
> Output of df.printSchema() -
> {code:java}
> root
>  |-- time_field: string (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34318) Dataset.colRegex should work with column names and qualifiers which contain newlines

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34318:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Dataset.colRegex should work with column names and qualifiers which contain 
> newlines
> 
>
> Key: SPARK-34318
> URL: https://issues.apache.org/jira/browse/SPARK-34318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, Dataset.colRegex doesn't work with column names or 
> qualifiers which contain newlines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34318) Dataset.colRegex should work with column names and qualifiers which contain newlines

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276708#comment-17276708
 ] 

Apache Spark commented on SPARK-34318:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31426

> Dataset.colRegex should work with column names and qualifiers which contain 
> newlines
> 
>
> Key: SPARK-34318
> URL: https://issues.apache.org/jira/browse/SPARK-34318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current master, Dataset.colRegex doesn't work with column names or 
> qualifiers which contain newlines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34318) Dataset.colRegex should work with column names and qualifiers which contain newlines

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34318:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Dataset.colRegex should work with column names and qualifiers which contain 
> newlines
> 
>
> Key: SPARK-34318
> URL: https://issues.apache.org/jira/browse/SPARK-34318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In the current master, Dataset.colRegex doesn't work with column names or 
> qualifiers which contain newlines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34318) Dataset.colRegex should work with column names and qualifiers which contain newlines

2021-02-01 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-34318:
--

 Summary: Dataset.colRegex should work with column names and 
qualifiers which contain newlines
 Key: SPARK-34318
 URL: https://issues.apache.org/jira/browse/SPARK-34318
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the current master, Dataset.colRegex doesn't work with column names or 
qualifiers which contain newlines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276682#comment-17276682
 ] 

Apache Spark commented on SPARK-34315:
--

User 'timhughes' has created a pull request for this issue:
https://github.com/apache/spark/pull/31425

> docker-image-tool.sh debconf trying to configure kerberos
> -
>
> Key: SPARK-34315
> URL: https://issues.apache.org/jira/browse/SPARK-34315
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Tim Hughes
>Priority: Critical
> Attachments: full-logs.txt
>
>
> When building the docker containers using the docker-image-tool.sh there is 
> RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3`  Which 
> leads to `debconf` trying to configure kerberos. I have tried putting 
> nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it 
> just hangs after enter is pressed
>  
> {{Setting up krb5-config (2.6) ...}}
> {{debconf: unable to initialize frontend: Dialog}}
> {{debconf: (TERM is not set, so the dialog frontend is not usable.)}}
> {{debconf: falling back to frontend: Readline}}
> {{debconf: unable to initialize frontend: Readline}}
> {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install 
> the Term::ReadLine module) (@INC contains: /etc/perl 
> /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 
> /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 
> /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 
> /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at 
> /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}}
> {{debconf: falling back to frontend: Teletype}}
> {{Configuring Kerberos Authentication}}
> {{---}}
> {{When users attempt to use Kerberos and specify a principal or user name 
> without}}
> {{specifying what administrative Kerberos realm that principal belongs to, 
> the}}
> {{system appends the default realm. The default realm may also be used as 
> the}}
> {{realm of a Kerberos service running on the local machine. Often, the 
> default}}
> {{realm is the uppercase version of the local DNS domain.}}
> {{Default Kerberos version 5 realm: EXAMPLE.ORG}}
> {{^CFailed to build Spark JVM Docker image, please refer to Docker build 
> output for details.}}
>  
>  
> {{## Steps to reproduce}}
> {{```}}
> {{wget -qO- 
> https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
>  | tar -xzf -}}
> {{cd spark-3.0.1-bin-hadoop3.2/}}
> {{./bin/docker-image-tool.sh build}}
> {{```}}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34315:


Assignee: Apache Spark

> docker-image-tool.sh debconf trying to configure kerberos
> -
>
> Key: SPARK-34315
> URL: https://issues.apache.org/jira/browse/SPARK-34315
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Tim Hughes
>Assignee: Apache Spark
>Priority: Critical
> Attachments: full-logs.txt
>
>
> When building the docker containers using the docker-image-tool.sh there is 
> RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3`  Which 
> leads to `debconf` trying to configure kerberos. I have tried putting 
> nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it 
> just hangs after enter is pressed
>  
> {{Setting up krb5-config (2.6) ...}}
> {{debconf: unable to initialize frontend: Dialog}}
> {{debconf: (TERM is not set, so the dialog frontend is not usable.)}}
> {{debconf: falling back to frontend: Readline}}
> {{debconf: unable to initialize frontend: Readline}}
> {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install 
> the Term::ReadLine module) (@INC contains: /etc/perl 
> /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 
> /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 
> /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 
> /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at 
> /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}}
> {{debconf: falling back to frontend: Teletype}}
> {{Configuring Kerberos Authentication}}
> {{---}}
> {{When users attempt to use Kerberos and specify a principal or user name 
> without}}
> {{specifying what administrative Kerberos realm that principal belongs to, 
> the}}
> {{system appends the default realm. The default realm may also be used as 
> the}}
> {{realm of a Kerberos service running on the local machine. Often, the 
> default}}
> {{realm is the uppercase version of the local DNS domain.}}
> {{Default Kerberos version 5 realm: EXAMPLE.ORG}}
> {{^CFailed to build Spark JVM Docker image, please refer to Docker build 
> output for details.}}
>  
>  
> {{## Steps to reproduce}}
> {{```}}
> {{wget -qO- 
> https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
>  | tar -xzf -}}
> {{cd spark-3.0.1-bin-hadoop3.2/}}
> {{./bin/docker-image-tool.sh build}}
> {{```}}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34315:


Assignee: (was: Apache Spark)

> docker-image-tool.sh debconf trying to configure kerberos
> -
>
> Key: SPARK-34315
> URL: https://issues.apache.org/jira/browse/SPARK-34315
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Tim Hughes
>Priority: Critical
> Attachments: full-logs.txt
>
>
> When building the docker containers using the docker-image-tool.sh there is 
> RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3`  Which 
> leads to `debconf` trying to configure kerberos. I have tried putting 
> nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it 
> just hangs after enter is pressed
>  
> {{Setting up krb5-config (2.6) ...}}
> {{debconf: unable to initialize frontend: Dialog}}
> {{debconf: (TERM is not set, so the dialog frontend is not usable.)}}
> {{debconf: falling back to frontend: Readline}}
> {{debconf: unable to initialize frontend: Readline}}
> {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install 
> the Term::ReadLine module) (@INC contains: /etc/perl 
> /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 
> /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 
> /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 
> /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at 
> /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}}
> {{debconf: falling back to frontend: Teletype}}
> {{Configuring Kerberos Authentication}}
> {{---}}
> {{When users attempt to use Kerberos and specify a principal or user name 
> without}}
> {{specifying what administrative Kerberos realm that principal belongs to, 
> the}}
> {{system appends the default realm. The default realm may also be used as 
> the}}
> {{realm of a Kerberos service running on the local machine. Often, the 
> default}}
> {{realm is the uppercase version of the local DNS domain.}}
> {{Default Kerberos version 5 realm: EXAMPLE.ORG}}
> {{^CFailed to build Spark JVM Docker image, please refer to Docker build 
> output for details.}}
>  
>  
> {{## Steps to reproduce}}
> {{```}}
> {{wget -qO- 
> https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
>  | tar -xzf -}}
> {{cd spark-3.0.1-bin-hadoop3.2/}}
> {{./bin/docker-image-tool.sh build}}
> {{```}}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos

2021-02-01 Thread Tim Hughes (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276681#comment-17276681
 ] 

Tim Hughes commented on SPARK-34315:


Created a pull request https://github.com/apache/spark/pull/31425

> docker-image-tool.sh debconf trying to configure kerberos
> -
>
> Key: SPARK-34315
> URL: https://issues.apache.org/jira/browse/SPARK-34315
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Tim Hughes
>Priority: Critical
> Attachments: full-logs.txt
>
>
> When building the docker containers using the docker-image-tool.sh there is 
> RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3`  Which 
> leads to `debconf` trying to configure kerberos. I have tried putting 
> nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it 
> just hangs after enter is pressed
>  
> {{Setting up krb5-config (2.6) ...}}
> {{debconf: unable to initialize frontend: Dialog}}
> {{debconf: (TERM is not set, so the dialog frontend is not usable.)}}
> {{debconf: falling back to frontend: Readline}}
> {{debconf: unable to initialize frontend: Readline}}
> {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install 
> the Term::ReadLine module) (@INC contains: /etc/perl 
> /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 
> /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 
> /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 
> /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at 
> /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}}
> {{debconf: falling back to frontend: Teletype}}
> {{Configuring Kerberos Authentication}}
> {{---}}
> {{When users attempt to use Kerberos and specify a principal or user name 
> without}}
> {{specifying what administrative Kerberos realm that principal belongs to, 
> the}}
> {{system appends the default realm. The default realm may also be used as 
> the}}
> {{realm of a Kerberos service running on the local machine. Often, the 
> default}}
> {{realm is the uppercase version of the local DNS domain.}}
> {{Default Kerberos version 5 realm: EXAMPLE.ORG}}
> {{^CFailed to build Spark JVM Docker image, please refer to Docker build 
> output for details.}}
>  
>  
> {{## Steps to reproduce}}
> {{```}}
> {{wget -qO- 
> https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
>  | tar -xzf -}}
> {{cd spark-3.0.1-bin-hadoop3.2/}}
> {{./bin/docker-image-tool.sh build}}
> {{```}}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos

2021-02-01 Thread Tim Hughes (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276680#comment-17276680
 ] 

Tim Hughes commented on SPARK-34315:


[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L34]

 

Prefixing the line that installs krb5-user with 
`DEBIAN_FRONTEND=noninteractive` allows the container to be built.

 

 

 

> docker-image-tool.sh debconf trying to configure kerberos
> -
>
> Key: SPARK-34315
> URL: https://issues.apache.org/jira/browse/SPARK-34315
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Tim Hughes
>Priority: Critical
> Attachments: full-logs.txt
>
>
> When building the docker containers using the docker-image-tool.sh there is 
> RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3`  Which 
> leads to `debconf` trying to configure kerberos. I have tried putting 
> nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it 
> just hangs after enter is pressed
>  
> {{Setting up krb5-config (2.6) ...}}
> {{debconf: unable to initialize frontend: Dialog}}
> {{debconf: (TERM is not set, so the dialog frontend is not usable.)}}
> {{debconf: falling back to frontend: Readline}}
> {{debconf: unable to initialize frontend: Readline}}
> {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install 
> the Term::ReadLine module) (@INC contains: /etc/perl 
> /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 
> /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 
> /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 
> /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at 
> /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}}
> {{debconf: falling back to frontend: Teletype}}
> {{Configuring Kerberos Authentication}}
> {{---}}
> {{When users attempt to use Kerberos and specify a principal or user name 
> without}}
> {{specifying what administrative Kerberos realm that principal belongs to, 
> the}}
> {{system appends the default realm. The default realm may also be used as 
> the}}
> {{realm of a Kerberos service running on the local machine. Often, the 
> default}}
> {{realm is the uppercase version of the local DNS domain.}}
> {{Default Kerberos version 5 realm: EXAMPLE.ORG}}
> {{^CFailed to build Spark JVM Docker image, please refer to Docker build 
> output for details.}}
>  
>  
> {{## Steps to reproduce}}
> {{```}}
> {{wget -qO- 
> https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
>  | tar -xzf -}}
> {{cd spark-3.0.1-bin-hadoop3.2/}}
> {{./bin/docker-image-tool.sh build}}
> {{```}}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276655#comment-17276655
 ] 

Apache Spark commented on SPARK-34317:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/31424

> Introduce relationTypeMismatchHint to UnresolvedTable for a better error 
> message
> 
>
> Key: SPARK-34317
> URL: https://issues.apache.org/jira/browse/SPARK-34317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if 
> the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" 
> is resolved a view, the error message will also contain a hint, "Please use 
> ALTER VIEW instead."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34317:


Assignee: (was: Apache Spark)

> Introduce relationTypeMismatchHint to UnresolvedTable for a better error 
> message
> 
>
> Key: SPARK-34317
> URL: https://issues.apache.org/jira/browse/SPARK-34317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if 
> the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" 
> is resolved a view, the error message will also contain a hint, "Please use 
> ALTER VIEW instead."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34317:


Assignee: Apache Spark

> Introduce relationTypeMismatchHint to UnresolvedTable for a better error 
> message
> 
>
> Key: SPARK-34317
> URL: https://issues.apache.org/jira/browse/SPARK-34317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if 
> the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" 
> is resolved a view, the error message will also contain a hint, "Please use 
> ALTER VIEW instead."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276654#comment-17276654
 ] 

Apache Spark commented on SPARK-34317:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/31424

> Introduce relationTypeMismatchHint to UnresolvedTable for a better error 
> message
> 
>
> Key: SPARK-34317
> URL: https://issues.apache.org/jira/browse/SPARK-34317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if 
> the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" 
> is resolved a view, the error message will also contain a hint, "Please use 
> ALTER VIEW instead."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos

2021-02-01 Thread Tim Hughes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Hughes updated SPARK-34315:
---
Environment: (was: # 
 ## Full logs

 

{{}}{{}})

> docker-image-tool.sh debconf trying to configure kerberos
> -
>
> Key: SPARK-34315
> URL: https://issues.apache.org/jira/browse/SPARK-34315
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Tim Hughes
>Priority: Critical
> Attachments: full-logs.txt
>
>
> When building the docker containers using the docker-image-tool.sh there is 
> RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3`  Which 
> leads to `debconf` trying to configure kerberos. I have tried putting 
> nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it 
> just hangs after enter is pressed
>  
> {{Setting up krb5-config (2.6) ...}}
> {{debconf: unable to initialize frontend: Dialog}}
> {{debconf: (TERM is not set, so the dialog frontend is not usable.)}}
> {{debconf: falling back to frontend: Readline}}
> {{debconf: unable to initialize frontend: Readline}}
> {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install 
> the Term::ReadLine module) (@INC contains: /etc/perl 
> /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 
> /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 
> /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 
> /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at 
> /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}}
> {{debconf: falling back to frontend: Teletype}}
> {{Configuring Kerberos Authentication}}
> {{---}}
> {{When users attempt to use Kerberos and specify a principal or user name 
> without}}
> {{specifying what administrative Kerberos realm that principal belongs to, 
> the}}
> {{system appends the default realm. The default realm may also be used as 
> the}}
> {{realm of a Kerberos service running on the local machine. Often, the 
> default}}
> {{realm is the uppercase version of the local DNS domain.}}
> {{Default Kerberos version 5 realm: EXAMPLE.ORG}}
> {{^CFailed to build Spark JVM Docker image, please refer to Docker build 
> output for details.}}
>  
>  
> {{## Steps to reproduce}}
> {{```}}
> {{wget -qO- 
> https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
>  | tar -xzf -}}
> {{cd spark-3.0.1-bin-hadoop3.2/}}
> {{./bin/docker-image-tool.sh build}}
> {{```}}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message

2021-02-01 Thread Terry Kim (Jira)

Terry Kim created SPARK-34317:
-

 Summary: Introduce relationTypeMismatchHint to UnresolvedTable for 
a better error message
 Key: SPARK-34317
 URL: https://issues.apache.org/jira/browse/SPARK-34317
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Terry Kim


The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if 
the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" 
is resolved a view, the error message will also contain a hint, "Please use 
ALTER VIEW instead."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s

2021-02-01 Thread Zhou JIANG (Jira)

Zhou JIANG created SPARK-34316:
--

 Summary: Optional Propagation of SPARK_CONF_DIR in K8s
 Key: SPARK-34316
 URL: https://issues.apache.org/jira/browse/SPARK-34316
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 3.0.1
Reporter: Zhou JIANG


In shared Kubernetes clusters, Spark could be restricted from creating and 
deleting config maps in job namespaces.

It would be helpful if the current mandatory config map creation could be 
optional. User may still take responsibility of handing Spark conf files 
separately. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos

2021-02-01 Thread Tim Hughes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Hughes updated SPARK-34315:
---
Attachment: full-logs.txt

> docker-image-tool.sh debconf trying to configure kerberos
> -
>
> Key: SPARK-34315
> URL: https://issues.apache.org/jira/browse/SPARK-34315
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
> Environment: ## Full logs
>  
> {{}}{{$ bin/docker-image-tool.sh build}}
> {{Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet 
> msg.}}
> {{STEP 1: FROM openjdk:8-jre-slim}}
> {{STEP 2: ARG spark_uid=185}}
> {{--> Using cache 
> d24913e4f80a167a2682380bc0565b0eefac2e7e5b94f1491b99712e1154dd3b}}
> {{--> d24913e4f80}}
> {{STEP 3: RUN set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' 
> /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install 
> -y bash tini libc6 libpam-modules krb5-user libnss3 && mkdir -p /opt/spark && 
> mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch 
> /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth 
> required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && 
> chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/*}}
> {{+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list}}
> {{+ apt-get update}}
> {{Get:1 http://security.debian.org/debian-security buster/updates InRelease 
> [65.4 kB]}}
> {{Get:2 https://deb.debian.org/debian buster InRelease [121 kB] }}
> {{Get:3 https://deb.debian.org/debian buster-updates InRelease [51.9 kB]}}
> {{Get:4 http://security.debian.org/debian-security buster/updates/main amd64 
> Packages [271 kB]}}
> {{Get:5 https://deb.debian.org/debian buster/main amd64 Packages [7907 kB]}}
> {{Get:6 https://deb.debian.org/debian buster-updates/main amd64 Packages 
> [7848 B]}}
> {{Fetched 8426 kB in 4s (1995 kB/s) }}
> {{Reading package lists... Done}}
> {{+ ln -s /lib /lib64}}
> {{+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3}}
> {{Reading package lists... Done}}
> {{Building dependency tree }}
> {{Reading state information... Done}}
> {{bash is already the newest version (5.0-4).}}
> {{bash set to manually installed.}}
> {{libc6 is already the newest version (2.28-10).}}
> {{libc6 set to manually installed.}}
> {{libpam-modules is already the newest version (1.3.1-5).}}
> {{libpam-modules set to manually installed.}}
> {{The following package was automatically installed and is no longer 
> required:}}
> {{ lsb-base}}
> {{Use 'apt autoremove' to remove it.}}
> {{The following additional packages will be installed:}}
> {{ bind9-host geoip-database krb5-config krb5-locales libbind9-161 libcap2}}
> {{ libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 libicu63}}
> {{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}}
> {{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}}
> {{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libprotobuf-c1 libsqlite3-0}}
> {{ libxml2}}
> {{Suggested packages:}}
> {{ krb5-k5tls geoip-bin krb5-doc}}
> {{The following NEW packages will be installed:}}
> {{ bind9-host geoip-database krb5-config krb5-locales krb5-user libbind9-161}}
> {{ libcap2 libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 
> libicu63}}
> {{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}}
> {{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}}
> {{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libnss3 libprotobuf-c1}}
> {{ libsqlite3-0 libxml2 tini}}
> {{0 upgraded, 32 newly installed, 0 to remove and 2 not upgraded.}}
> {{Need to get 18.1 MB of archives.}}
> {{After this operation, 61.3 MB of additional disk space will be used.}}
> {{Get:1 https://deb.debian.org/debian buster/main amd64 libcap2 amd64 
> 1:2.25-2 [17.6 kB]}}
> {{Get:2 https://deb.debian.org/debian buster/main amd64 libfstrm0 amd64 
> 0.4.0-1 [20.8 kB]}}
> {{Get:3 https://deb.debian.org/debian buster/main amd64 libgeoip1 amd64 
> 1.6.12-1 [93.1 kB]}}
> {{Get:4 https://deb.debian.org/debian buster/main amd64 libjson-c3 amd64 
> 0.12.1+ds-2+deb10u1 [27.3 kB]}}
> {{Get:5 https://deb.debian.org/debian buster/main amd64 liblmdb0 amd64 
> 0.9.22-1 [45.0 kB]}}
> {{Get:6 https://deb.debian.org/debian buster/main amd64 libprotobuf-c1 amd64 
> 1.3.1-1+b1 [26.5 kB]}}
> {{Get:7 https://deb.debian.org/debian buster/main amd64 libicu63 amd64 
> 63.1-6+deb10u1 [8300 kB]}}
> {{Get:8 https://deb.debian.org/debian buster/main amd64 libxml2 amd64 
> 2.9.4+dfsg1-7+deb10u1 [689 kB]}}
> {{Get:9 https://deb.debian.org/debian buster/main amd64 libisc1100 amd64 
> 1:9.11.5.P4+dfsg-5.1+deb10u2 [458 kB]}}
> {{Get:10 https://deb.debian.org/debian buster/main amd64 libkeyutils1 amd64 
> 1.6-6 [15.0 kB]}}
> {{Get:11

[jira] [Updated] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos

2021-02-01 Thread Tim Hughes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Hughes updated SPARK-34315:
---
Environment: 
# 
 ## Full logs

 

{{}}{{}}

  was:
## Full logs

 

{{}}{{$ bin/docker-image-tool.sh build}}
{{Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet 
msg.}}
{{STEP 1: FROM openjdk:8-jre-slim}}
{{STEP 2: ARG spark_uid=185}}
{{--> Using cache 
d24913e4f80a167a2682380bc0565b0eefac2e7e5b94f1491b99712e1154dd3b}}
{{--> d24913e4f80}}
{{STEP 3: RUN set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' 
/etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y 
bash tini libc6 libpam-modules krb5-user libnss3 && mkdir -p /opt/spark && 
mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch 
/opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth 
required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && 
chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/*}}
{{+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list}}
{{+ apt-get update}}
{{Get:1 http://security.debian.org/debian-security buster/updates InRelease 
[65.4 kB]}}
{{Get:2 https://deb.debian.org/debian buster InRelease [121 kB] }}
{{Get:3 https://deb.debian.org/debian buster-updates InRelease [51.9 kB]}}
{{Get:4 http://security.debian.org/debian-security buster/updates/main amd64 
Packages [271 kB]}}
{{Get:5 https://deb.debian.org/debian buster/main amd64 Packages [7907 kB]}}
{{Get:6 https://deb.debian.org/debian buster-updates/main amd64 Packages [7848 
B]}}
{{Fetched 8426 kB in 4s (1995 kB/s) }}
{{Reading package lists... Done}}
{{+ ln -s /lib /lib64}}
{{+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3}}
{{Reading package lists... Done}}
{{Building dependency tree }}
{{Reading state information... Done}}
{{bash is already the newest version (5.0-4).}}
{{bash set to manually installed.}}
{{libc6 is already the newest version (2.28-10).}}
{{libc6 set to manually installed.}}
{{libpam-modules is already the newest version (1.3.1-5).}}
{{libpam-modules set to manually installed.}}
{{The following package was automatically installed and is no longer required:}}
{{ lsb-base}}
{{Use 'apt autoremove' to remove it.}}
{{The following additional packages will be installed:}}
{{ bind9-host geoip-database krb5-config krb5-locales libbind9-161 libcap2}}
{{ libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 libicu63}}
{{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}}
{{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}}
{{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libprotobuf-c1 libsqlite3-0}}
{{ libxml2}}
{{Suggested packages:}}
{{ krb5-k5tls geoip-bin krb5-doc}}
{{The following NEW packages will be installed:}}
{{ bind9-host geoip-database krb5-config krb5-locales krb5-user libbind9-161}}
{{ libcap2 libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 libicu63}}
{{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}}
{{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}}
{{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libnss3 libprotobuf-c1}}
{{ libsqlite3-0 libxml2 tini}}
{{0 upgraded, 32 newly installed, 0 to remove and 2 not upgraded.}}
{{Need to get 18.1 MB of archives.}}
{{After this operation, 61.3 MB of additional disk space will be used.}}
{{Get:1 https://deb.debian.org/debian buster/main amd64 libcap2 amd64 1:2.25-2 
[17.6 kB]}}
{{Get:2 https://deb.debian.org/debian buster/main amd64 libfstrm0 amd64 0.4.0-1 
[20.8 kB]}}
{{Get:3 https://deb.debian.org/debian buster/main amd64 libgeoip1 amd64 
1.6.12-1 [93.1 kB]}}
{{Get:4 https://deb.debian.org/debian buster/main amd64 libjson-c3 amd64 
0.12.1+ds-2+deb10u1 [27.3 kB]}}
{{Get:5 https://deb.debian.org/debian buster/main amd64 liblmdb0 amd64 0.9.22-1 
[45.0 kB]}}
{{Get:6 https://deb.debian.org/debian buster/main amd64 libprotobuf-c1 amd64 
1.3.1-1+b1 [26.5 kB]}}
{{Get:7 https://deb.debian.org/debian buster/main amd64 libicu63 amd64 
63.1-6+deb10u1 [8300 kB]}}
{{Get:8 https://deb.debian.org/debian buster/main amd64 libxml2 amd64 
2.9.4+dfsg1-7+deb10u1 [689 kB]}}
{{Get:9 https://deb.debian.org/debian buster/main amd64 libisc1100 amd64 
1:9.11.5.P4+dfsg-5.1+deb10u2 [458 kB]}}
{{Get:10 https://deb.debian.org/debian buster/main amd64 libkeyutils1 amd64 
1.6-6 [15.0 kB]}}
{{Get:11 https://deb.debian.org/debian buster/main amd64 libkrb5support0 amd64 
1.17-3+deb10u1 [65.8 kB]}}
{{Get:12 https://deb.debian.org/debian buster/main amd64 libk5crypto3 amd64 
1.17-3+deb10u1 [122 kB]}}
{{Get:13 https://deb.debian.org/debian buster/main amd64 libkrb5-3 amd64 
1.17-3+deb10u1 [369 kB]}}
{{Get:14 https://deb.debian.org/debian buster/main amd64 libgssapi-krb5-2 amd64 
1.17-3+deb10u1 [158 kB]}}
{{Get:15 https://deb.debian.org/debian buster/main amd64 libdns1104 amd64 
1:9.11.5.P4+dfsg-5.1+deb10u2 [1223 kB]}}

[jira] [Created] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos

2021-02-01 Thread Tim Hughes (Jira)

Tim Hughes created SPARK-34315:
--

 Summary: docker-image-tool.sh debconf trying to configure kerberos
 Key: SPARK-34315
 URL: https://issues.apache.org/jira/browse/SPARK-34315
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.1
 Environment: ## Full logs

 

{{}}{{$ bin/docker-image-tool.sh build}}
{{Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet 
msg.}}
{{STEP 1: FROM openjdk:8-jre-slim}}
{{STEP 2: ARG spark_uid=185}}
{{--> Using cache 
d24913e4f80a167a2682380bc0565b0eefac2e7e5b94f1491b99712e1154dd3b}}
{{--> d24913e4f80}}
{{STEP 3: RUN set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' 
/etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y 
bash tini libc6 libpam-modules krb5-user libnss3 && mkdir -p /opt/spark && 
mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch 
/opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth 
required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && 
chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/*}}
{{+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list}}
{{+ apt-get update}}
{{Get:1 http://security.debian.org/debian-security buster/updates InRelease 
[65.4 kB]}}
{{Get:2 https://deb.debian.org/debian buster InRelease [121 kB] }}
{{Get:3 https://deb.debian.org/debian buster-updates InRelease [51.9 kB]}}
{{Get:4 http://security.debian.org/debian-security buster/updates/main amd64 
Packages [271 kB]}}
{{Get:5 https://deb.debian.org/debian buster/main amd64 Packages [7907 kB]}}
{{Get:6 https://deb.debian.org/debian buster-updates/main amd64 Packages [7848 
B]}}
{{Fetched 8426 kB in 4s (1995 kB/s) }}
{{Reading package lists... Done}}
{{+ ln -s /lib /lib64}}
{{+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3}}
{{Reading package lists... Done}}
{{Building dependency tree }}
{{Reading state information... Done}}
{{bash is already the newest version (5.0-4).}}
{{bash set to manually installed.}}
{{libc6 is already the newest version (2.28-10).}}
{{libc6 set to manually installed.}}
{{libpam-modules is already the newest version (1.3.1-5).}}
{{libpam-modules set to manually installed.}}
{{The following package was automatically installed and is no longer required:}}
{{ lsb-base}}
{{Use 'apt autoremove' to remove it.}}
{{The following additional packages will be installed:}}
{{ bind9-host geoip-database krb5-config krb5-locales libbind9-161 libcap2}}
{{ libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 libicu63}}
{{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}}
{{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}}
{{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libprotobuf-c1 libsqlite3-0}}
{{ libxml2}}
{{Suggested packages:}}
{{ krb5-k5tls geoip-bin krb5-doc}}
{{The following NEW packages will be installed:}}
{{ bind9-host geoip-database krb5-config krb5-locales krb5-user libbind9-161}}
{{ libcap2 libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 libicu63}}
{{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}}
{{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}}
{{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libnss3 libprotobuf-c1}}
{{ libsqlite3-0 libxml2 tini}}
{{0 upgraded, 32 newly installed, 0 to remove and 2 not upgraded.}}
{{Need to get 18.1 MB of archives.}}
{{After this operation, 61.3 MB of additional disk space will be used.}}
{{Get:1 https://deb.debian.org/debian buster/main amd64 libcap2 amd64 1:2.25-2 
[17.6 kB]}}
{{Get:2 https://deb.debian.org/debian buster/main amd64 libfstrm0 amd64 0.4.0-1 
[20.8 kB]}}
{{Get:3 https://deb.debian.org/debian buster/main amd64 libgeoip1 amd64 
1.6.12-1 [93.1 kB]}}
{{Get:4 https://deb.debian.org/debian buster/main amd64 libjson-c3 amd64 
0.12.1+ds-2+deb10u1 [27.3 kB]}}
{{Get:5 https://deb.debian.org/debian buster/main amd64 liblmdb0 amd64 0.9.22-1 
[45.0 kB]}}
{{Get:6 https://deb.debian.org/debian buster/main amd64 libprotobuf-c1 amd64 
1.3.1-1+b1 [26.5 kB]}}
{{Get:7 https://deb.debian.org/debian buster/main amd64 libicu63 amd64 
63.1-6+deb10u1 [8300 kB]}}
{{Get:8 https://deb.debian.org/debian buster/main amd64 libxml2 amd64 
2.9.4+dfsg1-7+deb10u1 [689 kB]}}
{{Get:9 https://deb.debian.org/debian buster/main amd64 libisc1100 amd64 
1:9.11.5.P4+dfsg-5.1+deb10u2 [458 kB]}}
{{Get:10 https://deb.debian.org/debian buster/main amd64 libkeyutils1 amd64 
1.6-6 [15.0 kB]}}
{{Get:11 https://deb.debian.org/debian buster/main amd64 libkrb5support0 amd64 
1.17-3+deb10u1 [65.8 kB]}}
{{Get:12 https://deb.debian.org/debian buster/main amd64 libk5crypto3 amd64 
1.17-3+deb10u1 [122 kB]}}
{{Get:13 https://deb.debian.org/debian buster/main amd64 libkrb5-3 amd64 
1.17-3+deb10u1 [369 kB]}}
{{Get:14 https://deb.debian.org/debian buster/main amd64 libgssapi-krb5-2 amd64

[jira] [Commented] (SPARK-34314) Wrong discovered partition value

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276612#comment-17276612
 ] 

Apache Spark commented on SPARK-34314:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31423

> Wrong discovered partition value
> 
>
> Key: SPARK-34314
> URL: https://issues.apache.org/jira/browse/SPARK-34314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The example below portraits the issue:
> {code:scala}
>   val df = Seq((0, "AA"), (1, "-0")).toDF("id", "part")
>   df.write
> .partitionBy("part")
> .format("parquet")
> .save(path)
>   val readback = spark.read.parquet(path)
>   readback.printSchema()
>   readback.show(false)
> {code}
> It write the partition value as string:
> {code}
> /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d
> ├── _SUCCESS
> ├── part=-0
> │   └── part-1-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> └── part=AA
> └── part-0-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> {code}
> *"-0"* and "AA".
> but when Spark reads data back, it transforms "-0" to "0"
> {code}
> root
>  |-- id: integer (nullable = true)
>  |-- part: string (nullable = true)
> +---++
> |id |part|
> +---++
> |0  |AA  |
> |1  |0   |
> +---++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34314) Wrong discovered partition value

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34314:


Assignee: Apache Spark

> Wrong discovered partition value
> 
>
> Key: SPARK-34314
> URL: https://issues.apache.org/jira/browse/SPARK-34314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The example below portraits the issue:
> {code:scala}
>   val df = Seq((0, "AA"), (1, "-0")).toDF("id", "part")
>   df.write
> .partitionBy("part")
> .format("parquet")
> .save(path)
>   val readback = spark.read.parquet(path)
>   readback.printSchema()
>   readback.show(false)
> {code}
> It write the partition value as string:
> {code}
> /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d
> ├── _SUCCESS
> ├── part=-0
> │   └── part-1-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> └── part=AA
> └── part-0-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> {code}
> *"-0"* and "AA".
> but when Spark reads data back, it transforms "-0" to "0"
> {code}
> root
>  |-- id: integer (nullable = true)
>  |-- part: string (nullable = true)
> +---++
> |id |part|
> +---++
> |0  |AA  |
> |1  |0   |
> +---++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34314) Wrong discovered partition value

2021-02-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34314:


Assignee: (was: Apache Spark)

> Wrong discovered partition value
> 
>
> Key: SPARK-34314
> URL: https://issues.apache.org/jira/browse/SPARK-34314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The example below portraits the issue:
> {code:scala}
>   val df = Seq((0, "AA"), (1, "-0")).toDF("id", "part")
>   df.write
> .partitionBy("part")
> .format("parquet")
> .save(path)
>   val readback = spark.read.parquet(path)
>   readback.printSchema()
>   readback.show(false)
> {code}
> It write the partition value as string:
> {code}
> /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d
> ├── _SUCCESS
> ├── part=-0
> │   └── part-1-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> └── part=AA
> └── part-0-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> {code}
> *"-0"* and "AA".
> but when Spark reads data back, it transforms "-0" to "0"
> {code}
> root
>  |-- id: integer (nullable = true)
>  |-- part: string (nullable = true)
> +---++
> |id |part|
> +---++
> |0  |AA  |
> |1  |0   |
> +---++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34314) Wrong discovered partition value

2021-02-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276610#comment-17276610
 ] 

Apache Spark commented on SPARK-34314:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31423

> Wrong discovered partition value
> 
>
> Key: SPARK-34314
> URL: https://issues.apache.org/jira/browse/SPARK-34314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The example below portraits the issue:
> {code:scala}
>   val df = Seq((0, "AA"), (1, "-0")).toDF("id", "part")
>   df.write
> .partitionBy("part")
> .format("parquet")
> .save(path)
>   val readback = spark.read.parquet(path)
>   readback.printSchema()
>   readback.show(false)
> {code}
> It write the partition value as string:
> {code}
> /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d
> ├── _SUCCESS
> ├── part=-0
> │   └── part-1-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> └── part=AA
> └── part-0-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
> {code}
> *"-0"* and "AA".
> but when Spark reads data back, it transforms "-0" to "0"
> {code}
> root
>  |-- id: integer (nullable = true)
>  |-- part: string (nullable = true)
> +---++
> |id |part|
> +---++
> |0  |AA  |
> |1  |0   |
> +---++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 149 matches

Mail list logo