[jira] [Updated] (SPARK-37513) date +/- interval with only day-time fields returns different data type between Spark3.2 and Spark3.1

2021-11-30 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37513:
---
Summary: date +/- interval with only day-time fields returns different data 
type between Spark3.2 and Spark3.1  (was: Additive expression of date and 
interval returns different data type between Spark3.2 and Spark3.1)

> date +/- interval with only day-time fields returns different data type 
> between Spark3.2 and Spark3.1
> -
>
> Key: SPARK-37513
> URL: https://issues.apache.org/jira/browse/SPARK-37513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> select date '2011-11-11' + interval 12 hours;
> {code}
>  Previously returned the date type, now it returns the timestamp type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37210) An error occurred while concurrently writing to different static partitions

2021-11-30 Thread Zhen Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451582#comment-17451582
 ] 

Zhen Wang commented on SPARK-37210:
---

I made some adjustments to the test cases to make it more likely to conflict 
and lead to exception. For details:  
[https://github.com/apache/spark/pull/34489].
 # Increase the size of the inserted data.
 # Different tasks write different data sizes, so that the task execution 
duration is different.
 # Modify the {{spark.sql.test.master}} configuration to execute tasks in 
parallel.

> An error occurred while concurrently writing to different static partitions
> ---
>
> Key: SPARK-37210
> URL: https://issues.apache.org/jira/browse/SPARK-37210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Zhen Wang
>Priority: Major
> Attachments: 
> [SPARK-37210]_Write_to_static_partition_in_dynamic_write_mode.patch, 
> image-2021-11-05-15-28-41-393.png
>
>
> An error occurred while concurrently writing to different static partitions.
> For writing to a static partition, committerOutputPath is the location path 
> of the table. When multiple tasks write to the same table concurrently, the 
> _temporary path will be deleted after one task ends, causing another task to 
> fail.
>  
> test code:
>  
> {code:java}
> // code placeholder
> object HiveTests {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .appName("HiveTests")
>   .enableHiveSupport()
>   .getOrCreate()
> //rows
> val users1 = new util.ArrayList[Row]()
> users1.add(Row(1, "user1", "2021-11-03", 10))
> users1.add(Row(2, "user2", "2021-11-03", 10))
> users1.add(Row(3, "user3", "2021-11-03", 10))
> //schema
> val structType = StructType(Array(
>   StructField("id", IntegerType, true),
>   StructField("name", StringType, true),
>   StructField("dt", StringType, true),
>   StructField("hour", IntegerType, true)
> ))
> spark.sql("set hive.exec.dynamic.partition=true")
> spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
> spark.sql("drop table if exists default.test")
> spark.sql(
>   """
> |create table if not exists default.test (
> |  id int,
> |  name string)
> |partitioned by (dt string, hour int)
> |stored as parquet
> |""".stripMargin)
> spark.sql("desc formatted default.test").show()
> spark.sqlContext
>   .createDataFrame(users1, structType)
>   .select("id", "name")
>   .createOrReplaceTempView("user1")
> val thread1 = new Thread(() => {
>   spark.sql("INSERT OVERWRITE TABLE test PARTITION(dt = '2021-11-03', 
> hour=10) select * from user1")
> })
> thread1.start()
> val thread2 = new Thread(() => {
>   spark.sql("INSERT OVERWRITE TABLE test PARTITION(dt = '2021-11-04', 
> hour=10) select * from user1")
> })
> thread2.start()
> thread1.join()
> thread2.join()
> spark.sql("select * from test").show()
> spark.stop()
>   }
> }
> {code}
>  
> error message:
>  
> {code:java}
> // code placeholder
> 21/11/04 19:01:21 ERROR Utils: Aborting task
> ExitCodeException exitCode=1: chmod: cannot access 
> '/data/spark-examples/spark-warehouse/test/_temporary/0/_temporary/attempt_202111041901182933014038999149736_0001_m_01_
> 4/dt=2021-11-03/hour=10/.part-1-95895b03-45d2-4ac6-806b-b76fd1dfa3dc.c000.snappy.parquet.crc':
>  No such file or directoryat 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)
> at org.apache.hadoop.util.Shell.run(Shell.java:901)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:324)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:294)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:439)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:428)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:459)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:437)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:521)
> at 
> 

[jira] [Commented] (SPARK-37210) An error occurred while concurrently writing to different static partitions

2021-11-30 Thread Zhen Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451580#comment-17451580
 ] 

Zhen Wang commented on SPARK-37210:
---

There seem to be two bugs:
 # For the insert overwrite fully static partition, when the insert data size 
is 0, the existing partition data will be deleted.

 # For concurrent writes to fully static partitions, exceptions may occur in 
{{insert into}} or {{{}insert overwrite{}}}. Due to temporary path 
'${tableLocation}/_temporary' will be deleted in 
{{{}FileOutputCommitter.cleanupJob{}}}.

> An error occurred while concurrently writing to different static partitions
> ---
>
> Key: SPARK-37210
> URL: https://issues.apache.org/jira/browse/SPARK-37210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Zhen Wang
>Priority: Major
> Attachments: 
> [SPARK-37210]_Write_to_static_partition_in_dynamic_write_mode.patch, 
> image-2021-11-05-15-28-41-393.png
>
>
> An error occurred while concurrently writing to different static partitions.
> For writing to a static partition, committerOutputPath is the location path 
> of the table. When multiple tasks write to the same table concurrently, the 
> _temporary path will be deleted after one task ends, causing another task to 
> fail.
>  
> test code:
>  
> {code:java}
> // code placeholder
> object HiveTests {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .appName("HiveTests")
>   .enableHiveSupport()
>   .getOrCreate()
> //rows
> val users1 = new util.ArrayList[Row]()
> users1.add(Row(1, "user1", "2021-11-03", 10))
> users1.add(Row(2, "user2", "2021-11-03", 10))
> users1.add(Row(3, "user3", "2021-11-03", 10))
> //schema
> val structType = StructType(Array(
>   StructField("id", IntegerType, true),
>   StructField("name", StringType, true),
>   StructField("dt", StringType, true),
>   StructField("hour", IntegerType, true)
> ))
> spark.sql("set hive.exec.dynamic.partition=true")
> spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
> spark.sql("drop table if exists default.test")
> spark.sql(
>   """
> |create table if not exists default.test (
> |  id int,
> |  name string)
> |partitioned by (dt string, hour int)
> |stored as parquet
> |""".stripMargin)
> spark.sql("desc formatted default.test").show()
> spark.sqlContext
>   .createDataFrame(users1, structType)
>   .select("id", "name")
>   .createOrReplaceTempView("user1")
> val thread1 = new Thread(() => {
>   spark.sql("INSERT OVERWRITE TABLE test PARTITION(dt = '2021-11-03', 
> hour=10) select * from user1")
> })
> thread1.start()
> val thread2 = new Thread(() => {
>   spark.sql("INSERT OVERWRITE TABLE test PARTITION(dt = '2021-11-04', 
> hour=10) select * from user1")
> })
> thread2.start()
> thread1.join()
> thread2.join()
> spark.sql("select * from test").show()
> spark.stop()
>   }
> }
> {code}
>  
> error message:
>  
> {code:java}
> // code placeholder
> 21/11/04 19:01:21 ERROR Utils: Aborting task
> ExitCodeException exitCode=1: chmod: cannot access 
> '/data/spark-examples/spark-warehouse/test/_temporary/0/_temporary/attempt_202111041901182933014038999149736_0001_m_01_
> 4/dt=2021-11-03/hour=10/.part-1-95895b03-45d2-4ac6-806b-b76fd1dfa3dc.c000.snappy.parquet.crc':
>  No such file or directoryat 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)
> at org.apache.hadoop.util.Shell.run(Shell.java:901)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:324)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:294)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:439)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:428)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:459)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:437)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:521)
> at 
> 

[jira] [Updated] (SPARK-37392) Catalyst optimizer very time-consuming and memory-intensive with some "explode(array)"

2021-11-30 Thread Francois MARTIN (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois MARTIN updated SPARK-37392:

Affects Version/s: 3.2.0

> Catalyst optimizer very time-consuming and memory-intensive with some 
> "explode(array)" 
> ---
>
> Key: SPARK-37392
> URL: https://issues.apache.org/jira/browse/SPARK-37392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Francois MARTIN
>Priority: Major
>
> The problem occurs with the simple code below:
> {code:java}
> import session.implicits._
> Seq(
>   (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", 
> "x", "x", "x", "x", "x", "x")
> ).toDF()
>   .checkpoint() // or save and reload to truncate lineage
>   .createOrReplaceTempView("sub")
> session.sql("""
>   SELECT
> *
>   FROM
>   (
> SELECT
>   EXPLODE( ARRAY( * ) ) result
> FROM
> (
>   SELECT
> _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, 
> _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
>   FROM
> sub
> )
>   )
>   WHERE
> result != ''
>   """).show() {code}
> It takes several minutes and a very high Java heap usage, when it should be 
> immediate.
> It does not occur when replacing the unique integer value (1) with a string 
> value ({_}"x"{_}).
> All the time is spent in the _PruneFilters_ optimization rule.
> Not reproduced in Spark 2.4.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-11-30 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-37326:


Assignee: Ivan Sadikov

> Support TimestampNTZ in CSV data source
> ---
>
> Key: SPARK-37326
> URL: https://issues.apache.org/jira/browse/SPARK-37326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-11-30 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-37326.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34596
[https://github.com/apache/spark/pull/34596]

> Support TimestampNTZ in CSV data source
> ---
>
> Key: SPARK-37326
> URL: https://issues.apache.org/jira/browse/SPARK-37326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37480:
-

Assignee: Yikun Jiang

> Configurations in docs/running-on-kubernetes.md are not uptodate
> 
>
> Key: SPARK-37480
> URL: https://issues.apache.org/jira/browse/SPARK-37480
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37480.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34734
[https://github.com/apache/spark/pull/34734]

> Configurations in docs/running-on-kubernetes.md are not uptodate
> 
>
> Key: SPARK-37480
> URL: https://issues.apache.org/jira/browse/SPARK-37480
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37513) Additive expression of date and interval returns different data type between Spark3.2 and Spark3.1

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451542#comment-17451542
 ] 

Apache Spark commented on SPARK-37513:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34766

> Additive expression of date and interval returns different data type between 
> Spark3.2 and Spark3.1
> --
>
> Key: SPARK-37513
> URL: https://issues.apache.org/jira/browse/SPARK-37513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> select date '2011-11-11' + interval 12 hours;
> {code}
>  Previously returned the date type, now it returns the timestamp type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37513) Additive expression of date and interval returns different data type between Spark3.2 and Spark3.1

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37513:


Assignee: Apache Spark

> Additive expression of date and interval returns different data type between 
> Spark3.2 and Spark3.1
> --
>
> Key: SPARK-37513
> URL: https://issues.apache.org/jira/browse/SPARK-37513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> select date '2011-11-11' + interval 12 hours;
> {code}
>  Previously returned the date type, now it returns the timestamp type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37513) Additive expression of date and interval returns different data type between Spark3.2 and Spark3.1

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37513:


Assignee: (was: Apache Spark)

> Additive expression of date and interval returns different data type between 
> Spark3.2 and Spark3.1
> --
>
> Key: SPARK-37513
> URL: https://issues.apache.org/jira/browse/SPARK-37513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> select date '2011-11-11' + interval 12 hours;
> {code}
>  Previously returned the date type, now it returns the timestamp type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37513) Additive expression of date and interval returns different data type between Spark3.2 and Spark3.1

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451541#comment-17451541
 ] 

Apache Spark commented on SPARK-37513:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34766

> Additive expression of date and interval returns different data type between 
> Spark3.2 and Spark3.1
> --
>
> Key: SPARK-37513
> URL: https://issues.apache.org/jira/browse/SPARK-37513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> select date '2011-11-11' + interval 12 hours;
> {code}
>  Previously returned the date type, now it returns the timestamp type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37513) Additive expression of date and interval returns different data type between Spark3.2 and Spark3.1

2021-11-30 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37513:
---
Summary: Additive expression of date and interval returns different data 
type between Spark3.2 and Spark3.1  (was: additive expression of date and 
interval returns different data type between Spark3.2 and Spark3.1)

> Additive expression of date and interval returns different data type between 
> Spark3.2 and Spark3.1
> --
>
> Key: SPARK-37513
> URL: https://issues.apache.org/jira/browse/SPARK-37513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> select date '2011-11-11' + interval 12 hours;
> {code}
>  Previously returned the date type, now it returns the timestamp type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37513) additive expression of date and interval returns different data type between Spark3.2 and Spark3.1

2021-11-30 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-37513:
--

 Summary: additive expression of date and interval returns 
different data type between Spark3.2 and Spark3.1
 Key: SPARK-37513
 URL: https://issues.apache.org/jira/browse/SPARK-37513
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng



{code:java}
select date '2011-11-11' + interval 12 hours;
{code}
 Previously returned the date type, now it returns the timestamp type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451530#comment-17451530
 ] 

Apache Spark commented on SPARK-37487:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34765

> CollectMetrics is executed twice if it is followed by a sort
> 
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: correctness
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-37487: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
> min($"id").as("min_val"),
> max($"id").as("max_val"),
> // Test unresolved alias
> sum($"id"),
> count(when($"id" % 2 === 0, 1)).as("num_even"))
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .sort($"id".desc)
> validateObservedMetrics(df)
>   }
> {code}
> The count and sum aggregate report twice the number of rows:
> {code}
> [info] - SPARK-37487: get observable metrics with sort by callback *** FAILED 
> *** (169 milliseconds)
> [info]   [0,99,9900,100] did not equal [0,99,4950,50] 
> (DataFrameCallbackSuite.scala:342)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}
> I could not figure out how this happes. Hopefully the UT can help with 
> debugging



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37487:


Assignee: (was: Apache Spark)

> CollectMetrics is executed twice if it is followed by a sort
> 
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: correctness
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-37487: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
> min($"id").as("min_val"),
> max($"id").as("max_val"),
> // Test unresolved alias
> sum($"id"),
> count(when($"id" % 2 === 0, 1)).as("num_even"))
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .sort($"id".desc)
> validateObservedMetrics(df)
>   }
> {code}
> The count and sum aggregate report twice the number of rows:
> {code}
> [info] - SPARK-37487: get observable metrics with sort by callback *** FAILED 
> *** (169 milliseconds)
> [info]   [0,99,9900,100] did not equal [0,99,4950,50] 
> (DataFrameCallbackSuite.scala:342)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}
> I could not figure out how this happes. Hopefully the UT can help with 
> debugging



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37487:


Assignee: Apache Spark

> CollectMetrics is executed twice if it is followed by a sort
> 
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-37487: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
> min($"id").as("min_val"),
> max($"id").as("max_val"),
> // Test unresolved alias
> sum($"id"),
> count(when($"id" % 2 === 0, 1)).as("num_even"))
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .sort($"id".desc)
> validateObservedMetrics(df)
>   }
> {code}
> The count and sum aggregate report twice the number of rows:
> {code}
> [info] - SPARK-37487: get observable metrics with sort by callback *** FAILED 
> *** (169 milliseconds)
> [info]   [0,99,9900,100] did not equal [0,99,4950,50] 
> (DataFrameCallbackSuite.scala:342)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}
> I could not figure out how this happes. Hopefully the UT can help with 
> debugging



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37330) Migrate ReplaceTableStatement to v2 command

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37330:


Assignee: Apache Spark

> Migrate ReplaceTableStatement to v2 command
> ---
>
> Key: SPARK-37330
> URL: https://issues.apache.org/jira/browse/SPARK-37330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37330) Migrate ReplaceTableStatement to v2 command

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451527#comment-17451527
 ] 

Apache Spark commented on SPARK-37330:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34764

> Migrate ReplaceTableStatement to v2 command
> ---
>
> Key: SPARK-37330
> URL: https://issues.apache.org/jira/browse/SPARK-37330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37330) Migrate ReplaceTableStatement to v2 command

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37330:


Assignee: (was: Apache Spark)

> Migrate ReplaceTableStatement to v2 command
> ---
>
> Key: SPARK-37330
> URL: https://issues.apache.org/jira/browse/SPARK-37330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37330) Migrate ReplaceTableStatement to v2 command

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451529#comment-17451529
 ] 

Apache Spark commented on SPARK-37330:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34764

> Migrate ReplaceTableStatement to v2 command
> ---
>
> Key: SPARK-37330
> URL: https://issues.apache.org/jira/browse/SPARK-37330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37389) Check unclosed bracketed comments

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451524#comment-17451524
 ] 

Apache Spark commented on SPARK-37389:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34763

> Check unclosed bracketed comments
> -
>
> Key: SPARK-37389
> URL: https://issues.apache.org/jira/browse/SPARK-37389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> The SQL below has unclosed bracketed comment.
> {code:java}
> /*abc*/
> select 1 as a
> /*
> 2 as b
> /*abc*/
> , 3 as c
> /**/
> ;
> {code}
> But Spark will output:
> a
> 1
> PostgreSQL also supports the feature, and output:
> {code:java}
> SQL 错误 [42601]: Unterminated block comment started at position 47 in SQL 
> /*abc*/ -- block comment
> select 1 as a
> /*
> 2 as b
> /*abc*/
> , 3 as c
> /**/
> . Expected */ sequence
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37389) Check unclosed bracketed comments

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451525#comment-17451525
 ] 

Apache Spark commented on SPARK-37389:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34763

> Check unclosed bracketed comments
> -
>
> Key: SPARK-37389
> URL: https://issues.apache.org/jira/browse/SPARK-37389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> The SQL below has unclosed bracketed comment.
> {code:java}
> /*abc*/
> select 1 as a
> /*
> 2 as b
> /*abc*/
> , 3 as c
> /**/
> ;
> {code}
> But Spark will output:
> a
> 1
> PostgreSQL also supports the feature, and output:
> {code:java}
> SQL 错误 [42601]: Unterminated block comment started at position 47 in SQL 
> /*abc*/ -- block comment
> select 1 as a
> /*
> 2 as b
> /*abc*/
> , 3 as c
> /**/
> . Expected */ sequence
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index

2021-11-30 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-37512:


 Summary: Support TimedeltaIndex creation given a timedelta 
Series/Index
 Key: SPARK-37512
 URL: https://issues.apache.org/jira/browse/SPARK-37512
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


 

To solve the issues below:


{code:java}
>>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)])
>>> idx
TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], 
dtype='timedelta64[ns]', freq=None)
>>> ps.TimedeltaIndex(idx)
Traceback (most recent call last):
...
    raise TypeError("astype can not be applied to %s." % self.pretty_name)
TypeError: astype can not be applied to timedelta.
 {code}
 

 

 
{code:java}
>>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20])
>>> s
10          1 days 00:00:00
20   0 days 00:00:00.02
dtype: timedelta64[ns]
>>> ps.TimedeltaIndex(s)
Traceback (most recent call last):
...
    raise TypeError("astype can not be applied to %s." % self.pretty_name)
TypeError: astype can not be applied to timedelta.
 {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37376) Introduce a new DataSource V2 interface HasPartitionKey

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37376:
-

Assignee: Chao Sun

> Introduce a new DataSource V2 interface HasPartitionKey 
> 
>
> Key: SPARK-37376
> URL: https://issues.apache.org/jira/browse/SPARK-37376
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> One of the pre-requisite for the feature is to allow V2 input partitions to 
> report their partition values to Spark, which can use them to compare if both 
> sides of join are co-partitioned, and also optionally group input partitions 
> together.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37376) Introduce a new DataSource V2 interface HasPartitionKey

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37376.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34656
[https://github.com/apache/spark/pull/34656]

> Introduce a new DataSource V2 interface HasPartitionKey 
> 
>
> Key: SPARK-37376
> URL: https://issues.apache.org/jira/browse/SPARK-37376
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> One of the pre-requisite for the feature is to allow V2 input partitions to 
> report their partition values to Spark, which can use them to compare if both 
> sides of join are co-partitioned, and also optionally group input partitions 
> together.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37490) Show hint if analyzer fails due to ANSI type coercion

2021-11-30 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-37490.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34747
[https://github.com/apache/spark/pull/34747]

> Show hint if analyzer fails due to ANSI type coercion
> -
>
> Key: SPARK-37490
> URL: https://issues.apache.org/jira/browse/SPARK-37490
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Show hint in the error message if analysis failed only with ANSI type 
> coercion:
> {code:java}
> To fix the error, you might need to add explicit type casts.
> To bypass the error with lenient type coercion rules, set 
> spark.sql.ansi.enabled as false. {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result

2021-11-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451496#comment-17451496
 ] 

Hyukjin Kwon commented on SPARK-37476:
--

can you try {{java.lang.Double}} that can legitimately be {{null}}?

> udaf doesnt work with nullable (or option of) case class result 
> 
>
> Key: SPARK-37476
> URL: https://issues.apache.org/jira/browse/SPARK-37476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark master branch on nov 27
>Reporter: koert kuipers
>Priority: Minor
>
> i have a need to have a dataframe aggregation return a nullable case class. 
> there seems to be no way to get this to work. the suggestion to wrap the 
> result in an option doesnt work either.
> first attempt using nulls:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double)
> val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] {
>   def zero: SumAndProduct = null
>   def reduce(b: SumAndProduct, a: Double): SumAndProduct =
>     if (b == null) {
>       SumAndProduct(a, a)
>     } else {
>       SumAndProduct(b.sum + a, b.product * a)
>     }
>   def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct =
>     if (b1 == null) {
>       b2
>     } else if (b2 == null) {
>       b1
>     } else {
>       SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>     }
>   def finish(r: SumAndProduct): SumAndProduct = r
>   def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
>   def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$3(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1491.0 (TID 1929)
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).sum AS sum#20070
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).product AS product#20071
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125)
>  {code}
> i dont really understand the error, this result is not a top-level row object.
> anyhow taking the advice to heart and using option we get to the second 
> attempt using options:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double) 
> val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], 
> Option[SumAndProduct]] {
>   def zero: Option[SumAndProduct] = None
>   def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] =
>     b
>       .map{ b => SumAndProduct(b.sum + a, b.product * a) }
>       .orElse{ Option(SumAndProduct(a, a)) }
>   def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): 
> Option[SumAndProduct] =
>     b1.map{ b1 =>
>       b2.map{ b2 =>
>         SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>       }.getOrElse(b1)
>     }.orElse(b2)
>   def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r
>   def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
>   def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$4(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.998 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1493.0 (TID 1930)
> java.lang.AssertionError: index (1) should < 1
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:142)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:338)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$5(AggregationIterator.scala:260)
>     at 
> 

[jira] [Assigned] (SPARK-37504) pyspark should not pass all options to session states.

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37504:


Assignee: (was: Apache Spark)

> pyspark should not pass all options to session states.
> --
>
> Key: SPARK-37504
> URL: https://issues.apache.org/jira/browse/SPARK-37504
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> in current session.py
> have Such code 
> {code}
> for key, value in self._options.items():
> session._jsparkSession.sessionState().conf().setConfString(key, value)
> {code}
> Looks like it will pass all options to a existed/new created session 
> In scala code, we will check it it is a static conf.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37504) pyspark should not pass all options to session states.

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451483#comment-17451483
 ] 

Apache Spark commented on SPARK-37504:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34757

> pyspark should not pass all options to session states.
> --
>
> Key: SPARK-37504
> URL: https://issues.apache.org/jira/browse/SPARK-37504
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> in current session.py
> have Such code 
> {code}
> for key, value in self._options.items():
> session._jsparkSession.sessionState().conf().setConfString(key, value)
> {code}
> Looks like it will pass all options to a existed/new created session 
> In scala code, we will check it it is a static conf.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37504) pyspark should not pass all options to session states.

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37504:


Assignee: Apache Spark

> pyspark should not pass all options to session states.
> --
>
> Key: SPARK-37504
> URL: https://issues.apache.org/jira/browse/SPARK-37504
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> in current session.py
> have Such code 
> {code}
> for key, value in self._options.items():
> session._jsparkSession.sessionState().conf().setConfString(key, value)
> {code}
> Looks like it will pass all options to a existed/new created session 
> In scala code, we will check it it is a static conf.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort

2021-11-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451482#comment-17451482
 ] 

Hyukjin Kwon commented on SPARK-37487:
--

cc [~beliefer] FYI

> CollectMetrics is executed twice if it is followed by a sort
> 
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: correctness
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-37487: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
> min($"id").as("min_val"),
> max($"id").as("max_val"),
> // Test unresolved alias
> sum($"id"),
> count(when($"id" % 2 === 0, 1)).as("num_even"))
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .sort($"id".desc)
> validateObservedMetrics(df)
>   }
> {code}
> The count and sum aggregate report twice the number of rows:
> {code}
> [info] - SPARK-37487: get observable metrics with sort by callback *** FAILED 
> *** (169 milliseconds)
> [info]   [0,99,9900,100] did not equal [0,99,4950,50] 
> (DataFrameCallbackSuite.scala:342)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}
> I could not figure out how this happes. Hopefully the UT can help with 
> debugging



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37498) test_reuse_worker_of_parallelize_range is flaky

2021-11-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451481#comment-17451481
 ] 

Hyukjin Kwon commented on SPARK-37498:
--

maybe we should add eventually 
https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/testing/utils.py#L57
 for the time being. I suspect that the worker sometimes die due to intensive 
CPU usage, or lack of file descriptors available  

>  test_reuse_worker_of_parallelize_range is flaky
> 
>
> Key: SPARK-37498
> URL: https://issues.apache.org/jira/browse/SPARK-37498
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
>  
> {code:java}
> ERROR [2.132s]: test_reuse_worker_of_parallelize_range 
> (pyspark.tests.test_worker.WorkerReuseTest)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/tests/test_worker.py", line 195, in 
> test_reuse_worker_of_parallelize_range
>     self.assertTrue(pid in previous_pids)
> AssertionError: False is not true
> --
> Ran 12 tests in 22.589s
> {code}
>  
>  
> [1] https://github.com/apache/spark/runs/1182154542?check_suite_focus=true
> [2] https://github.com/apache/spark/pull/33657#issuecomment-893969310
> [3] https://github.com/Yikun/spark/runs/4362783540?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37503) Improve SparkSession startup issue

2021-11-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37503:


Assignee: angerszhu

> Improve SparkSession startup issue
> --
>
> Key: SPARK-37503
> URL: https://issues.apache.org/jira/browse/SPARK-37503
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37497) Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37497:
--
Issue Type: Task  (was: Improvement)

> Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi
> -
>
> Key: SPARK-37497
> URL: https://issues.apache.org/jira/browse/SPARK-37497
> Project: Spark
>  Issue Type: Task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark

2021-11-30 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-37511:
-
Description: 
Introduce TimedeltaIndex to pandas API on Spark.

Properties, functions, and basic operations of TimedeltaIndex will be supported 
in follow-up PRs.

  was:Introduce TimedeltaIndex to pandas API on Spark.


> Introduce TimedeltaIndex to pandas API on Spark
> ---
>
> Key: SPARK-37511
> URL: https://issues.apache.org/jira/browse/SPARK-37511
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Introduce TimedeltaIndex to pandas API on Spark.
> Properties, functions, and basic operations of TimedeltaIndex will be 
> supported in follow-up PRs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37497) Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37497.
---
Fix Version/s: 3.3.0
   3.2.1
   3.1.3
   Resolution: Fixed

Issue resolved by pull request 34751
[https://github.com/apache/spark/pull/34751]

> Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi
> -
>
> Key: SPARK-37497
> URL: https://issues.apache.org/jira/browse/SPARK-37497
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37497) Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37497:
-

Assignee: Dongjoon Hyun

> Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi
> -
>
> Key: SPARK-37497
> URL: https://issues.apache.org/jira/browse/SPARK-37497
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37511:


Assignee: Apache Spark

> Introduce TimedeltaIndex to pandas API on Spark
> ---
>
> Key: SPARK-37511
> URL: https://issues.apache.org/jira/browse/SPARK-37511
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Introduce TimedeltaIndex to pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451472#comment-17451472
 ] 

Apache Spark commented on SPARK-37511:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34657

> Introduce TimedeltaIndex to pandas API on Spark
> ---
>
> Key: SPARK-37511
> URL: https://issues.apache.org/jira/browse/SPARK-37511
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Introduce TimedeltaIndex to pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37511:


Assignee: (was: Apache Spark)

> Introduce TimedeltaIndex to pandas API on Spark
> ---
>
> Key: SPARK-37511
> URL: https://issues.apache.org/jira/browse/SPARK-37511
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Introduce TimedeltaIndex to pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark

2021-11-30 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-37511:


 Summary: Introduce TimedeltaIndex to pandas API on Spark
 Key: SPARK-37511
 URL: https://issues.apache.org/jira/browse/SPARK-37511
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Introduce TimedeltaIndex to pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37510) Support TimedeltaIndex in pandas API on Spark

2021-11-30 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-37510:


 Summary: Support TimedeltaIndex in pandas API on Spark
 Key: SPARK-37510
 URL: https://issues.apache.org/jira/browse/SPARK-37510
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex 
support in pandas API on Spark accordingly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37509) Improve Fallback Storage upload speed by avoiding S3 rate limiter

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37509.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34762
[https://github.com/apache/spark/pull/34762]

> Improve Fallback Storage upload speed by avoiding S3 rate limiter
> -
>
> Key: SPARK-37509
> URL: https://issues.apache.org/jira/browse/SPARK-37509
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37509) Improve Fallback Storage upload speed by avoiding S3 rate limiter

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37509:
-

Assignee: Dongjoon Hyun

> Improve Fallback Storage upload speed by avoiding S3 rate limiter
> -
>
> Key: SPARK-37509
> URL: https://issues.apache.org/jira/browse/SPARK-37509
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37505) mesos module is missing log4j.properties file for UT

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37505.
---
Fix Version/s: 3.3.0
   3.2.1
   3.1.3
   Resolution: Fixed

Issue resolved by pull request 34759
[https://github.com/apache/spark/pull/34759]

> mesos module is missing log4j.properties file for UT
> 
>
> Key: SPARK-37505
> URL: https://issues.apache.org/jira/browse/SPARK-37505
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
>
> Run
> {code:java}
> mvn clean install -pl resource-managers/mesos -Pmesos -am -DskipTests
> mvn test -pl resource-managers/mesos -Pmesos     {code}
> {code:java}
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
>     at java.io.FileInputStream.open0(Native Method)
>     at java.io.FileInputStream.open(FileInputStream.java:195)
>     at java.io.FileInputStream.(FileInputStream.java:138)
>     at java.io.FileInputStream.(FileInputStream.java:93)
>     at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
>     at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
>     at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
>     at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
>     at org.apache.log4j.LogManager.(LogManager.java:127)
>     at org.slf4j.impl.Log4jLoggerFactory.(Log4jLoggerFactory.java:66)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:72)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:45)
>     at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j12(Logging.scala:222)
>     at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:127)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:111)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:102)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:101)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at org.apache.spark.internal.Logging.log(Logging.scala:49)
>     at org.apache.spark.internal.Logging.log$(Logging.scala:47)
>     at org.apache.spark.SparkFunSuite.log(SparkFunSuite.scala:62)
>     at org.apache.spark.SparkFunSuite.(SparkFunSuite.scala:74)
>     at 
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackendSuite.(MesosCoarseGrainedSchedulerBackendSuite.scala:43)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at java.lang.Class.newInstance(Class.java:442)
>     at 
> org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:66)
>     at 
> org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at org.scalatest.tools.DiscoverySuite.(DiscoverySuite.scala:37)
>     at org.scalatest.tools.Runner$.genDiscoSuites$1(Runner.scala:1132)
>     at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1226)
>     at 
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993)
>     at 
> 

[jira] [Assigned] (SPARK-37505) mesos module is missing log4j.properties file for UT

2021-11-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37505:
-

Assignee: Yang Jie

> mesos module is missing log4j.properties file for UT
> 
>
> Key: SPARK-37505
> URL: https://issues.apache.org/jira/browse/SPARK-37505
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Run
> {code:java}
> mvn clean install -pl resource-managers/mesos -Pmesos -am -DskipTests
> mvn test -pl resource-managers/mesos -Pmesos     {code}
> {code:java}
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
>     at java.io.FileInputStream.open0(Native Method)
>     at java.io.FileInputStream.open(FileInputStream.java:195)
>     at java.io.FileInputStream.(FileInputStream.java:138)
>     at java.io.FileInputStream.(FileInputStream.java:93)
>     at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
>     at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
>     at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
>     at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
>     at org.apache.log4j.LogManager.(LogManager.java:127)
>     at org.slf4j.impl.Log4jLoggerFactory.(Log4jLoggerFactory.java:66)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:72)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:45)
>     at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j12(Logging.scala:222)
>     at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:127)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:111)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:102)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:101)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at org.apache.spark.internal.Logging.log(Logging.scala:49)
>     at org.apache.spark.internal.Logging.log$(Logging.scala:47)
>     at org.apache.spark.SparkFunSuite.log(SparkFunSuite.scala:62)
>     at org.apache.spark.SparkFunSuite.(SparkFunSuite.scala:74)
>     at 
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackendSuite.(MesosCoarseGrainedSchedulerBackendSuite.scala:43)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at java.lang.Class.newInstance(Class.java:442)
>     at 
> org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:66)
>     at 
> org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at org.scalatest.tools.DiscoverySuite.(DiscoverySuite.scala:37)
>     at org.scalatest.tools.Runner$.genDiscoSuites$1(Runner.scala:1132)
>     at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1226)
>     at 
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993)
>     at 
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971)
>     at 
> org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1482)
>     at 
> 

[jira] [Commented] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort

2021-11-30 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451376#comment-17451376
 ] 

Kousuke Saruta commented on SPARK-37487:


[~tanelk] Thank you for pinging me.
I think a sampling job for the global sort performs the extra CollectMetrics 
(operations before the sort are performed twice).
Please let me look into more.

> CollectMetrics is executed twice if it is followed by a sort
> 
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: correctness
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-37487: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
> min($"id").as("min_val"),
> max($"id").as("max_val"),
> // Test unresolved alias
> sum($"id"),
> count(when($"id" % 2 === 0, 1)).as("num_even"))
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .sort($"id".desc)
> validateObservedMetrics(df)
>   }
> {code}
> The count and sum aggregate report twice the number of rows:
> {code}
> [info] - SPARK-37487: get observable metrics with sort by callback *** FAILED 
> *** (169 milliseconds)
> [info]   [0,99,9900,100] did not equal [0,99,4950,50] 
> (DataFrameCallbackSuite.scala:342)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}
> I could not figure out how this happes. Hopefully the UT can help with 
> debugging



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37509) Improve Fallback Storage upload speed by avoiding S3 rate limiter

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451308#comment-17451308
 ] 

Apache Spark commented on SPARK-37509:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34762

> Improve Fallback Storage upload speed by avoiding S3 rate limiter
> -
>
> Key: SPARK-37509
> URL: https://issues.apache.org/jira/browse/SPARK-37509
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37509) Improve Fallback Storage upload speed by avoiding S3 rate limiter

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37509:


Assignee: (was: Apache Spark)

> Improve Fallback Storage upload speed by avoiding S3 rate limiter
> -
>
> Key: SPARK-37509
> URL: https://issues.apache.org/jira/browse/SPARK-37509
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37509) Improve Fallback Storage upload speed by avoiding S3 rate limiter

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37509:


Assignee: Apache Spark

> Improve Fallback Storage upload speed by avoiding S3 rate limiter
> -
>
> Key: SPARK-37509
> URL: https://issues.apache.org/jira/browse/SPARK-37509
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37509) Improve Fallback Storage upload speed by avoiding S3 rate limiter

2021-11-30 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-37509:
-

 Summary: Improve Fallback Storage upload speed by avoiding S3 rate 
limiter
 Key: SPARK-37509
 URL: https://issues.apache.org/jira/browse/SPARK-37509
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26589) proper `median` method for spark dataframe

2021-11-30 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451283#comment-17451283
 ] 

Nicholas Chammas edited comment on SPARK-26589 at 11/30/21, 6:17 PM:
-

I think there is a potential solution using the algorithm [described here by 
Michael 
Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers].


was (Author: nchammas):
I'm going to try to implement this using the algorithm [described here by 
Michael 
Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers].

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-11-30 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451283#comment-17451283
 ] 

Nicholas Chammas commented on SPARK-26589:
--

I'm going to try to implement this using the algorithm [described here by 
Michael 
Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers].

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12185) Add Histogram support to Spark SQL/DataFrames

2021-11-30 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-12185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-12185:
-
Labels:   (was: bulk-closed)

> Add Histogram support to Spark SQL/DataFrames
> -
>
> Key: SPARK-12185
> URL: https://issues.apache.org/jira/browse/SPARK-12185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Holden Karau
>Priority: Minor
>
> While we have the ability to compute histograms on RDDs of Doubles it would 
> be good to also directly support histograms in Spark SQL (see 
> https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions
>  ).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12185) Add Histogram support to Spark SQL/DataFrames

2021-11-30 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-12185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-12185:
--

Reopening this because I think it's a valid improvement that mirrors the 
existing {{RDD.histogram}} method.

> Add Histogram support to Spark SQL/DataFrames
> -
>
> Key: SPARK-12185
> URL: https://issues.apache.org/jira/browse/SPARK-12185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Holden Karau
>Priority: Minor
>  Labels: bulk-closed
>
> While we have the ability to compute histograms on RDDs of Doubles it would 
> be good to also directly support histograms in Spark SQL (see 
> https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions
>  ).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37508) Add CONTAINS() function

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451203#comment-17451203
 ] 

Apache Spark commented on SPARK-37508:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34761

> Add CONTAINS() function
> ---
>
> Key: SPARK-37508
> URL: https://issues.apache.org/jira/browse/SPARK-37508
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> {{contains()}} is a common convenient function supported by a number of 
> database systems:
>  # 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
>  # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
> Snowflake 
> Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]
> Proposed syntax:
> {code:java}
> contains(haystack, needle)
> return type: boolean {code}
> It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37508) Add CONTAINS() function

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37508:


Assignee: Apache Spark

> Add CONTAINS() function
> ---
>
> Key: SPARK-37508
> URL: https://issues.apache.org/jira/browse/SPARK-37508
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> {{contains()}} is a common convenient function supported by a number of 
> database systems:
>  # 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
>  # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
> Snowflake 
> Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]
> Proposed syntax:
> {code:java}
> contains(haystack, needle)
> return type: boolean {code}
> It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37508) Add CONTAINS() function

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37508:


Assignee: (was: Apache Spark)

> Add CONTAINS() function
> ---
>
> Key: SPARK-37508
> URL: https://issues.apache.org/jira/browse/SPARK-37508
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> {{contains()}} is a common convenient function supported by a number of 
> database systems:
>  # 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
>  # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
> Snowflake 
> Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]
> Proposed syntax:
> {code:java}
> contains(haystack, needle)
> return type: boolean {code}
> It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37508) Add CONTAINS() function

2021-11-30 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451183#comment-17451183
 ] 

angerszhu commented on SPARK-37508:
---

Raise a pr soon

> Add CONTAINS() function
> ---
>
> Key: SPARK-37508
> URL: https://issues.apache.org/jira/browse/SPARK-37508
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> {{contains()}} is a common convenient function supported by a number of 
> database systems:
>  # 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
>  # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
> Snowflake 
> Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]
> Proposed syntax:
> {code:java}
> contains(haystack, needle)
> return type: boolean {code}
> It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37461) yarn-client mode client's appid value is null

2021-11-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37461:
--
Description: 
In yarn-client mode, *Client.appId* variable is not assigned, it is always 
`null`, in cluster mode, this variable will be assigned to the true value. In 
this patch, we assign true application id to `appId` too.

For client mode, it directly call submitApplication and return appId to 
YarnClientSchedulerBackend.buildtoYarn(). So for client mode, we only can 
assign ApplicationId to appId in submitApplication. Then since this value is 
assigned. so We don't need to add a new variable appId in 
`createContainerLaunchContext()`. and don need assign `this.appId` in `run()`.

> yarn-client mode client's appid value is null
> -
>
> Key: SPARK-37461
> URL: https://issues.apache.org/jira/browse/SPARK-37461
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.3.0
>
>
> In yarn-client mode, *Client.appId* variable is not assigned, it is always 
> `null`, in cluster mode, this variable will be assigned to the true value. In 
> this patch, we assign true application id to `appId` too.
> For client mode, it directly call submitApplication and return appId to 
> YarnClientSchedulerBackend.buildtoYarn(). So for client mode, we only can 
> assign ApplicationId to appId in submitApplication. Then since this value is 
> assigned. so We don't need to add a new variable appId in 
> `createContainerLaunchContext()`. and don need assign `this.appId` in `run()`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37461) yarn-client mode client's appid value is null

2021-11-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37461:
--
Description: In yarn-client mode, *Client.appId* variable is not assigned, 
it is always {*}null{*}, in cluster mode, this variable will be assigned to the 
true value. In this patch, we assign true application id to `appId` too.  (was: 
In yarn-client mode, *Client.appId* variable is not assigned, it is always 
{*}null{*}, in cluster mode, this variable will be assigned to the true value. 
In this patch, we assign true application id to `appId` too.

For client mode, it directly call submitApplication and return appId to 
YarnClientSchedulerBackend.buildtoYarn(). So for client mode, we only can 
assign ApplicationId to appId in submitApplication. Then since this value is 
assigned. so We don't need to add a new variable appId in 
`createContainerLaunchContext()`. and don need assign `this.appId` in `run()`.)

> yarn-client mode client's appid value is null
> -
>
> Key: SPARK-37461
> URL: https://issues.apache.org/jira/browse/SPARK-37461
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.3.0
>
>
> In yarn-client mode, *Client.appId* variable is not assigned, it is always 
> {*}null{*}, in cluster mode, this variable will be assigned to the true 
> value. In this patch, we assign true application id to `appId` too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37461) yarn-client mode client's appid value is null

2021-11-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37461:
--
Description: 
In yarn-client mode, *Client.appId* variable is not assigned, it is always 
{*}null{*}, in cluster mode, this variable will be assigned to the true value. 
In this patch, we assign true application id to `appId` too.

For client mode, it directly call submitApplication and return appId to 
YarnClientSchedulerBackend.buildtoYarn(). So for client mode, we only can 
assign ApplicationId to appId in submitApplication. Then since this value is 
assigned. so We don't need to add a new variable appId in 
`createContainerLaunchContext()`. and don need assign `this.appId` in `run()`.

  was:
In yarn-client mode, *Client.appId* variable is not assigned, it is always 
`null`, in cluster mode, this variable will be assigned to the true value. In 
this patch, we assign true application id to `appId` too.

For client mode, it directly call submitApplication and return appId to 
YarnClientSchedulerBackend.buildtoYarn(). So for client mode, we only can 
assign ApplicationId to appId in submitApplication. Then since this value is 
assigned. so We don't need to add a new variable appId in 
`createContainerLaunchContext()`. and don need assign `this.appId` in `run()`.


> yarn-client mode client's appid value is null
> -
>
> Key: SPARK-37461
> URL: https://issues.apache.org/jira/browse/SPARK-37461
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.3.0
>
>
> In yarn-client mode, *Client.appId* variable is not assigned, it is always 
> {*}null{*}, in cluster mode, this variable will be assigned to the true 
> value. In this patch, we assign true application id to `appId` too.
> For client mode, it directly call submitApplication and return appId to 
> YarnClientSchedulerBackend.buildtoYarn(). So for client mode, we only can 
> assign ApplicationId to appId in submitApplication. Then since this value is 
> assigned. so We don't need to add a new variable appId in 
> `createContainerLaunchContext()`. and don need assign `this.appId` in `run()`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37461) yarn-client mode client's appid value is null

2021-11-30 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451172#comment-17451172
 ] 

Thomas Graves commented on SPARK-37461:
---

[~angerszhuuu] please add a description to this issue.

> yarn-client mode client's appid value is null
> -
>
> Key: SPARK-37461
> URL: https://issues.apache.org/jira/browse/SPARK-37461
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2021-11-30 Thread Joao Miguel Pinto (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451155#comment-17451155
 ] 

Joao Miguel Pinto edited comment on SPARK-10925 at 11/30/21, 2:34 PM:
--

Hi guys. I had the same problem and the only way I found to figure out to solve 
it was rename the columns used as join keys to the original name. It's sounds 
weird but worked for me :D. 

I managed that using alias:

SELECT field1 as field1 FROM tableX; 

Where field1 is a key used after for an hypothetic Join.

Hope I can help.

Regards

 


was (Author: JIRAUSER280995):
Hi guys. I had the same problem and the only way I found to figure out was 
rename the columns used as join keys to the original name. It's sounds weird 
but worked for me :D. 

I managed that using alias:

SELECT field1 as field1 FROM tableX; 

Where field1 is a key used after for an hypothetic Join.

Hope I can help.

Regards

 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
>  Labels: bulk-closed
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> 

[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2021-11-30 Thread Joao Miguel Pinto (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451155#comment-17451155
 ] 

Joao Miguel Pinto edited comment on SPARK-10925 at 11/30/21, 2:20 PM:
--

Hi guys. I had the same problem and the only way I found to figure out was 
rename the columns used as join keys to the original name. It's sounds weird 
but worked for me :D. 

I managed that using alias:

SELECT field1 as field1 FROM tableX; 

Where field1 is a key used after for an hypothetic Join.

Hope I can help.

Regards

 


was (Author: JIRAUSER280995):
Hi guys. I had the same problem and the only way I found to figure out was 
rename the columns used as join keys for the original name. It's sounds weird 
but worked for me :D. 

I managed that using alias:

SELECT field1 as field1 FROM tableX; 

Where field1 is a key used after for an hypothetic Join.

Hope I can help.

Regards

 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
>  Labels: bulk-closed
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> 

[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2021-11-30 Thread Joao Miguel Pinto (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451155#comment-17451155
 ] 

Joao Miguel Pinto commented on SPARK-10925:
---

Hi guys. I had the same problem and the only way I found to figure out was 
rename the columns used as join keys for the original name. It's sounds weird 
but worked for me :D. 

I managed that using alias:

SELECT field1 as field1 FROM tableX; 

Where field1 is a key used after for an hypothetic Join.

Hope I can help.

Regards

 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
>  Labels: bulk-closed
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at 

[jira] [Commented] (SPARK-37508) Add CONTAINS() function

2021-11-30 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451149#comment-17451149
 ] 

Max Gekk commented on SPARK-37508:
--

[~angerszhuuu] [~xiaopenglei] @jiaan.geng If you would like to work on this, 
please, leave a comment here.
 

> Add CONTAINS() function
> ---
>
> Key: SPARK-37508
> URL: https://issues.apache.org/jira/browse/SPARK-37508
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> {{contains()}} is a common convenient function supported by a number of 
> database systems:
>  # 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
>  # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
> Snowflake 
> Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]
> Proposed syntax:
> {code:java}
> contains(haystack, needle)
> return type: boolean {code}
> It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37508) Add CONTAINS() function

2021-11-30 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451149#comment-17451149
 ] 

Max Gekk edited comment on SPARK-37508 at 11/30/21, 2:09 PM:
-

[~angerszhuuu] [~xiaopenglei] [~beliefer]  If you would like to work on this, 
please, leave a comment here.
 


was (Author: maxgekk):
[~angerszhuuu] [~xiaopenglei] @jiaan.geng If you would like to work on this, 
please, leave a comment here.
 

> Add CONTAINS() function
> ---
>
> Key: SPARK-37508
> URL: https://issues.apache.org/jira/browse/SPARK-37508
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> {{contains()}} is a common convenient function supported by a number of 
> database systems:
>  # 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
>  # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
> Snowflake 
> Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]
> Proposed syntax:
> {code:java}
> contains(haystack, needle)
> return type: boolean {code}
> It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37508) Add CONTAINS() function

2021-11-30 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-37508:
-
Description: 
{{contains()}} is a common convenient function supported by a number of 
database systems:
 # 
[https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
 # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
Snowflake 
Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]

Proposed syntax:
{code:java}
contains(haystack, needle)
return type: boolean {code}

It is semantically equivalent to {{haystack like '%needle%'}}

  was:
{{contains()}} is a common convenient function supported by a number of 
database systems:

[!https://www.gstatic.com/devrel-devsite/prod/v7824338a80ec44166704fb131e1860a66ed443b0ce02adfe8171907535d63bde/cloud/images/favicons/onecloud/super_cloud.png!Expressions,
 functions, and operators  |  BigQuery  |  Google 
Cloud|https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]

[!https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
Snowflake 
Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]

Proposed syntax:
{{contains(haystack, needle)}}
{{return type: boolean}}
It is semantically equivalent to {{haystack like '%needle%'}}


> Add CONTAINS() function
> ---
>
> Key: SPARK-37508
> URL: https://issues.apache.org/jira/browse/SPARK-37508
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> {{contains()}} is a common convenient function supported by a number of 
> database systems:
>  # 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
>  # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
> Snowflake 
> Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]
> Proposed syntax:
> {code:java}
> contains(haystack, needle)
> return type: boolean {code}
> It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37508) Add CONTAINS() function

2021-11-30 Thread Max Gekk (Jira)
Max Gekk created SPARK-37508:


 Summary: Add CONTAINS() function
 Key: SPARK-37508
 URL: https://issues.apache.org/jira/browse/SPARK-37508
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


{{contains()}} is a common convenient function supported by a number of 
database systems:

[!https://www.gstatic.com/devrel-devsite/prod/v7824338a80ec44166704fb131e1860a66ed443b0ce02adfe8171907535d63bde/cloud/images/favicons/onecloud/super_cloud.png!Expressions,
 functions, and operators  |  BigQuery  |  Google 
Cloud|https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]

[!https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
Snowflake 
Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]

Proposed syntax:
{{contains(haystack, needle)}}
{{return type: boolean}}
It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37507) Add the TO_BINARY() function

2021-11-30 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451138#comment-17451138
 ] 

Max Gekk commented on SPARK-37507:
--

[~beliefer] Would you like to work on this? If so, please, leave a comment here.

> Add the TO_BINARY() function
> 
>
> Key: SPARK-37507
> URL: https://issues.apache.org/jira/browse/SPARK-37507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> to_binary(expr, fmt) is a common function available in many other systems to 
> provide a unified entry for string to binary data conversion, where fmt can 
> be utf8, base64, hex and base2 (or whatever the reverse operation 
> to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37507) Add the TO_BINARY() function

2021-11-30 Thread Max Gekk (Jira)
Max Gekk created SPARK-37507:


 Summary: Add the TO_BINARY() function
 Key: SPARK-37507
 URL: https://issues.apache.org/jira/browse/SPARK-37507
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


to_binary(expr, fmt) is a common function available in many other systems to 
provide a unified entry for string to binary data conversion, where fmt can be 
utf8, base64, hex and base2 (or whatever the reverse operation 
to_char()supports).

[https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]

[https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]

[https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]

[https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]

Related Spark functions: unbase64, unhex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37506) Change the never changed 'var' to 'val'

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37506:


Assignee: Apache Spark

> Change the never changed 'var' to 'val'
> ---
>
> Key: SPARK-37506
> URL: https://issues.apache.org/jira/browse/SPARK-37506
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Similar to SPARK-33346, there are still some `var` that can be replaced by 
> `val` in the current code base
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37506) Change the never changed 'var' to 'val'

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37506:


Assignee: (was: Apache Spark)

> Change the never changed 'var' to 'val'
> ---
>
> Key: SPARK-37506
> URL: https://issues.apache.org/jira/browse/SPARK-37506
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Similar to SPARK-33346, there are still some `var` that can be replaced by 
> `val` in the current code base
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37506) Change the never changed 'var' to 'val'

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451086#comment-17451086
 ] 

Apache Spark commented on SPARK-37506:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34760

> Change the never changed 'var' to 'val'
> ---
>
> Key: SPARK-37506
> URL: https://issues.apache.org/jira/browse/SPARK-37506
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Similar to SPARK-33346, there are still some `var` that can be replaced by 
> `val` in the current code base
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37506) Change the never changed 'var' to 'val'

2021-11-30 Thread Yang Jie (Jira)
Yang Jie created SPARK-37506:


 Summary: Change the never changed 'var' to 'val'
 Key: SPARK-37506
 URL: https://issues.apache.org/jira/browse/SPARK-37506
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.3.0
Reporter: Yang Jie


Similar to SPARK-33346, there are still some `var` that can be replaced by 
`val` in the current code base

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37505) mesos module is missing log4j.properties file for UT

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37505:


Assignee: (was: Apache Spark)

> mesos module is missing log4j.properties file for UT
> 
>
> Key: SPARK-37505
> URL: https://issues.apache.org/jira/browse/SPARK-37505
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Run
> {code:java}
> mvn clean install -pl resource-managers/mesos -Pmesos -am -DskipTests
> mvn test -pl resource-managers/mesos -Pmesos     {code}
> {code:java}
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
>     at java.io.FileInputStream.open0(Native Method)
>     at java.io.FileInputStream.open(FileInputStream.java:195)
>     at java.io.FileInputStream.(FileInputStream.java:138)
>     at java.io.FileInputStream.(FileInputStream.java:93)
>     at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
>     at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
>     at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
>     at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
>     at org.apache.log4j.LogManager.(LogManager.java:127)
>     at org.slf4j.impl.Log4jLoggerFactory.(Log4jLoggerFactory.java:66)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:72)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:45)
>     at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j12(Logging.scala:222)
>     at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:127)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:111)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:102)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:101)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at org.apache.spark.internal.Logging.log(Logging.scala:49)
>     at org.apache.spark.internal.Logging.log$(Logging.scala:47)
>     at org.apache.spark.SparkFunSuite.log(SparkFunSuite.scala:62)
>     at org.apache.spark.SparkFunSuite.(SparkFunSuite.scala:74)
>     at 
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackendSuite.(MesosCoarseGrainedSchedulerBackendSuite.scala:43)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at java.lang.Class.newInstance(Class.java:442)
>     at 
> org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:66)
>     at 
> org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at org.scalatest.tools.DiscoverySuite.(DiscoverySuite.scala:37)
>     at org.scalatest.tools.Runner$.genDiscoSuites$1(Runner.scala:1132)
>     at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1226)
>     at 
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993)
>     at 
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971)
>     at 
> org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1482)
>     at 
> 

[jira] [Assigned] (SPARK-37505) mesos module is missing log4j.properties file for UT

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37505:


Assignee: Apache Spark

> mesos module is missing log4j.properties file for UT
> 
>
> Key: SPARK-37505
> URL: https://issues.apache.org/jira/browse/SPARK-37505
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Run
> {code:java}
> mvn clean install -pl resource-managers/mesos -Pmesos -am -DskipTests
> mvn test -pl resource-managers/mesos -Pmesos     {code}
> {code:java}
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
>     at java.io.FileInputStream.open0(Native Method)
>     at java.io.FileInputStream.open(FileInputStream.java:195)
>     at java.io.FileInputStream.(FileInputStream.java:138)
>     at java.io.FileInputStream.(FileInputStream.java:93)
>     at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
>     at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
>     at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
>     at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
>     at org.apache.log4j.LogManager.(LogManager.java:127)
>     at org.slf4j.impl.Log4jLoggerFactory.(Log4jLoggerFactory.java:66)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:72)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:45)
>     at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j12(Logging.scala:222)
>     at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:127)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:111)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:102)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:101)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at org.apache.spark.internal.Logging.log(Logging.scala:49)
>     at org.apache.spark.internal.Logging.log$(Logging.scala:47)
>     at org.apache.spark.SparkFunSuite.log(SparkFunSuite.scala:62)
>     at org.apache.spark.SparkFunSuite.(SparkFunSuite.scala:74)
>     at 
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackendSuite.(MesosCoarseGrainedSchedulerBackendSuite.scala:43)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at java.lang.Class.newInstance(Class.java:442)
>     at 
> org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:66)
>     at 
> org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at org.scalatest.tools.DiscoverySuite.(DiscoverySuite.scala:37)
>     at org.scalatest.tools.Runner$.genDiscoSuites$1(Runner.scala:1132)
>     at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1226)
>     at 
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993)
>     at 
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971)
>     at 
> org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1482)
>     at 
> 

[jira] [Commented] (SPARK-37505) mesos module is missing log4j.properties file for UT

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451077#comment-17451077
 ] 

Apache Spark commented on SPARK-37505:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34759

> mesos module is missing log4j.properties file for UT
> 
>
> Key: SPARK-37505
> URL: https://issues.apache.org/jira/browse/SPARK-37505
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Run
> {code:java}
> mvn clean install -pl resource-managers/mesos -Pmesos -am -DskipTests
> mvn test -pl resource-managers/mesos -Pmesos     {code}
> {code:java}
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
>     at java.io.FileInputStream.open0(Native Method)
>     at java.io.FileInputStream.open(FileInputStream.java:195)
>     at java.io.FileInputStream.(FileInputStream.java:138)
>     at java.io.FileInputStream.(FileInputStream.java:93)
>     at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
>     at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
>     at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
>     at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
>     at org.apache.log4j.LogManager.(LogManager.java:127)
>     at org.slf4j.impl.Log4jLoggerFactory.(Log4jLoggerFactory.java:66)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:72)
>     at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:45)
>     at 
> org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j12(Logging.scala:222)
>     at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:127)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:111)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:102)
>     at 
> org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:101)
>     at 
> org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
>     at org.apache.spark.internal.Logging.log(Logging.scala:49)
>     at org.apache.spark.internal.Logging.log$(Logging.scala:47)
>     at org.apache.spark.SparkFunSuite.log(SparkFunSuite.scala:62)
>     at org.apache.spark.SparkFunSuite.(SparkFunSuite.scala:74)
>     at 
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackendSuite.(MesosCoarseGrainedSchedulerBackendSuite.scala:43)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at java.lang.Class.newInstance(Class.java:442)
>     at 
> org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:66)
>     at 
> org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at org.scalatest.tools.DiscoverySuite.(DiscoverySuite.scala:37)
>     at org.scalatest.tools.Runner$.genDiscoSuites$1(Runner.scala:1132)
>     at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1226)
>     at 
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993)
>     at 
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971)
>     at 
> 

[jira] [Created] (SPARK-37505) mesos module is missing log4j.properties file for UT

2021-11-30 Thread Yang Jie (Jira)
Yang Jie created SPARK-37505:


 Summary: mesos module is missing log4j.properties file for UT
 Key: SPARK-37505
 URL: https://issues.apache.org/jira/browse/SPARK-37505
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Tests
Affects Versions: 3.3.0
Reporter: Yang Jie


Run
{code:java}
mvn clean install -pl resource-managers/mesos -Pmesos -am -DskipTests
mvn test -pl resource-managers/mesos -Pmesos     {code}
{code:java}
log4j:ERROR Could not read configuration file from URL 
[file:src/test/resources/log4j.properties].
java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.(FileInputStream.java:138)
    at java.io.FileInputStream.(FileInputStream.java:93)
    at 
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
    at 
sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
    at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
    at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
    at org.apache.log4j.LogManager.(LogManager.java:127)
    at org.slf4j.impl.Log4jLoggerFactory.(Log4jLoggerFactory.java:66)
    at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:72)
    at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:45)
    at 
org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j12(Logging.scala:222)
    at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:127)
    at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:111)
    at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105)
    at 
org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
    at 
org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:102)
    at 
org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:101)
    at 
org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62)
    at org.apache.spark.internal.Logging.log(Logging.scala:49)
    at org.apache.spark.internal.Logging.log$(Logging.scala:47)
    at org.apache.spark.SparkFunSuite.log(SparkFunSuite.scala:62)
    at org.apache.spark.SparkFunSuite.(SparkFunSuite.scala:74)
    at 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackendSuite.(MesosCoarseGrainedSchedulerBackendSuite.scala:43)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at java.lang.Class.newInstance(Class.java:442)
    at 
org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:66)
    at 
org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
    at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at org.scalatest.tools.DiscoverySuite.(DiscoverySuite.scala:37)
    at org.scalatest.tools.Runner$.genDiscoSuites$1(Runner.scala:1132)
    at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1226)
    at 
org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993)
    at 
org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971)
    at 
org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1482)
    at 
org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971)
    at org.scalatest.tools.Runner$.main(Runner.scala:775)
    at org.scalatest.tools.Runner.main(Runner.scala)
log4j:ERROR Ignoring configuration file 
[file:src/test/resources/log4j.properties].
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (SPARK-37501) CREATE/REPLACE TABLE should qualify location for v2 command

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37501:


Assignee: Apache Spark

> CREATE/REPLACE TABLE should qualify location for v2 command
> ---
>
> Key: SPARK-37501
> URL: https://issues.apache.org/jira/browse/SPARK-37501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37501) CREATE/REPLACE TABLE should qualify location for v2 command

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37501:


Assignee: (was: Apache Spark)

> CREATE/REPLACE TABLE should qualify location for v2 command
> ---
>
> Key: SPARK-37501
> URL: https://issues.apache.org/jira/browse/SPARK-37501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37501) CREATE/REPLACE TABLE should qualify location for v2 command

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451057#comment-17451057
 ] 

Apache Spark commented on SPARK-37501:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34758

> CREATE/REPLACE TABLE should qualify location for v2 command
> ---
>
> Key: SPARK-37501
> URL: https://issues.apache.org/jira/browse/SPARK-37501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37054) Porting "pandas API on Spark: Internals" to PySpark docs.

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451005#comment-17451005
 ] 

Apache Spark commented on SPARK-37054:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34757

> Porting "pandas API on Spark: Internals" to PySpark docs.
> -
>
> Key: SPARK-37054
> URL: https://issues.apache.org/jira/browse/SPARK-37054
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We have a 
> [documents|https://docs.google.com/document/d/1PR88p6yMHIeSxkDkSqCxLofkcnP0YtwQ2tETfyAWLQQ/edit?usp=sharing]
>  for pandas API on Spark internal features, apart from the PySpark official 
> documents.
>  
> Since pandas API on Spark is officially released in Spark 3.2, it's good to 
> port this internal documents to the PySpark official documents.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37504) pyspark should not pass all options to session states.

2021-11-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37504:
--
Description: 
in current session.py
have Such code 
{code}
for key, value in self._options.items():
session._jsparkSession.sessionState().conf().setConfString(key, value)
{code}
Looks like it will pass all options to a existed/new created session 
In scala code, we will check it it is a static conf.


> pyspark should not pass all options to session states.
> --
>
> Key: SPARK-37504
> URL: https://issues.apache.org/jira/browse/SPARK-37504
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> in current session.py
> have Such code 
> {code}
> for key, value in self._options.items():
> session._jsparkSession.sessionState().conf().setConfString(key, value)
> {code}
> Looks like it will pass all options to a existed/new created session 
> In scala code, we will check it it is a static conf.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37504) pyspark should not pass all options to session states.

2021-11-30 Thread angerszhu (Jira)
angerszhu created SPARK-37504:
-

 Summary: pyspark should not pass all options to session states.
 Key: SPARK-37504
 URL: https://issues.apache.org/jira/browse/SPARK-37504
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34568) enableHiveSupport should ignore if SparkContext is created

2021-11-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-34568:
--
Parent: SPARK-37503
Issue Type: Sub-task  (was: Bug)

> enableHiveSupport should ignore if SparkContext is created
> --
>
> Key: SPARK-34568
> URL: https://issues.apache.org/jira/browse/SPARK-34568
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> If SparkContext is created, 
> SparkSession.builder.enableHiveSupport().getOrCreate() won't load hive 
> metadata.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37346) Link migration guide for structured stream.

2021-11-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37346.
--
  Assignee: angerszhu
Resolution: Fixed

> Link migration guide for structured stream.
> ---
>
> Key: SPARK-37346
> URL: https://issues.apache.org/jira/browse/SPARK-37346
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> Link migration guide to each project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37346) Link migration guide for structured stream.

2021-11-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37346:
-
Fix Version/s: 3.1.3
   3.2.1

> Link migration guide for structured stream.
> ---
>
> Key: SPARK-37346
> URL: https://issues.apache.org/jira/browse/SPARK-37346
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> Link migration guide to each project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37291) PySpark init SparkSession should copy conf to sharedState

2021-11-30 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37291:
--
Parent: SPARK-37503
Issue Type: Sub-task  (was: Bug)

>  PySpark init SparkSession should copy conf to sharedState
> --
>
> Key: SPARK-37291
> URL: https://issues.apache.org/jira/browse/SPARK-37291
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> PySpark SparkSession.config should respect enableHiveSupport



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37503) Improve SparkSession startup issue

2021-11-30 Thread angerszhu (Jira)
angerszhu created SPARK-37503:
-

 Summary: Improve SparkSession startup issue
 Key: SPARK-37503
 URL: https://issues.apache.org/jira/browse/SPARK-37503
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37502) Support cast aware output partitioning and required if it can up cast

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37502:


Assignee: Apache Spark

> Support cast aware output partitioning and required if it can up cast
> -
>
> Key: SPARK-37502
> URL: https://issues.apache.org/jira/browse/SPARK-37502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> If a `Cast` is up cast then it should be without any truncating or precision 
> lose or possible runtime failures. So the output partitioning should be same 
> with/without `Cast` if the `Cast` is up cast.
> Let's say we have a query:
> {code:java}
> -- v1: c1 int
> -- v2: c2 long
> SELECT * FROM v2 JOIN (SELECT c1, count(*) FROM v1 GROUP BY c1) v1 ON v1.c1 = 
> v2.c2
> {code}
> The executed plan contains three shuffle nodes which looks like:
> {code:java}
> SortMergeJoin
>   Exchange(cast(c1 as bigint))
> HashAggregate
>   Exchange(c1)
> Scan v1
>   Exchange(c2)
> Scan v2
> {code}
> We can simplify the plan using two shuffle nodes:
> {code:java}
> SortMergeJoin
>   HashAggregate
> Exchange(c1)
>   Scan v1
>   Exchange(c2)
> Scan v2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37502) Support cast aware output partitioning and required if it can up cast

2021-11-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37502:


Assignee: (was: Apache Spark)

> Support cast aware output partitioning and required if it can up cast
> -
>
> Key: SPARK-37502
> URL: https://issues.apache.org/jira/browse/SPARK-37502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If a `Cast` is up cast then it should be without any truncating or precision 
> lose or possible runtime failures. So the output partitioning should be same 
> with/without `Cast` if the `Cast` is up cast.
> Let's say we have a query:
> {code:java}
> -- v1: c1 int
> -- v2: c2 long
> SELECT * FROM v2 JOIN (SELECT c1, count(*) FROM v1 GROUP BY c1) v1 ON v1.c1 = 
> v2.c2
> {code}
> The executed plan contains three shuffle nodes which looks like:
> {code:java}
> SortMergeJoin
>   Exchange(cast(c1 as bigint))
> HashAggregate
>   Exchange(c1)
> Scan v1
>   Exchange(c2)
> Scan v2
> {code}
> We can simplify the plan using two shuffle nodes:
> {code:java}
> SortMergeJoin
>   HashAggregate
> Exchange(c1)
>   Scan v1
>   Exchange(c2)
> Scan v2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37502) Support cast aware output partitioning and required if it can up cast

2021-11-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17450966#comment-17450966
 ] 

Apache Spark commented on SPARK-37502:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34755

> Support cast aware output partitioning and required if it can up cast
> -
>
> Key: SPARK-37502
> URL: https://issues.apache.org/jira/browse/SPARK-37502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If a `Cast` is up cast then it should be without any truncating or precision 
> lose or possible runtime failures. So the output partitioning should be same 
> with/without `Cast` if the `Cast` is up cast.
> Let's say we have a query:
> {code:java}
> -- v1: c1 int
> -- v2: c2 long
> SELECT * FROM v2 JOIN (SELECT c1, count(*) FROM v1 GROUP BY c1) v1 ON v1.c1 = 
> v2.c2
> {code}
> The executed plan contains three shuffle nodes which looks like:
> {code:java}
> SortMergeJoin
>   Exchange(cast(c1 as bigint))
> HashAggregate
>   Exchange(c1)
> Scan v1
>   Exchange(c2)
> Scan v2
> {code}
> We can simplify the plan using two shuffle nodes:
> {code:java}
> SortMergeJoin
>   HashAggregate
> Exchange(c1)
>   Scan v1
>   Exchange(c2)
> Scan v2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37502) Support cast aware output partitioning and required if it can up cast

2021-11-30 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-37502:
--
Description: 
If a `Cast` is up cast then it should be without any truncating or precision 
lose or possible runtime failures. So the output partitioning should be same 
with/without `Cast` if the `Cast` is up cast.

Let's say we have a query:
{code:java}
-- v1: c1 int
-- v2: c2 long

SELECT * FROM v2 JOIN (SELECT c1, count(*) FROM v1 GROUP BY c1) v1 ON v1.c1 = 
v2.c2
{code}
The executed plan contains three shuffle nodes which looks like:
{code:java}
SortMergeJoin
  Exchange(cast(c1 as bigint))
HashAggregate
  Exchange(c1)
Scan v1
  Exchange(c2)
Scan v2
{code}
We can simplify the plan using two shuffle nodes:
{code:java}
SortMergeJoin
  HashAggregate
Exchange(c1)
  Scan v1
  Exchange(c2)
Scan v2
{code}

  was:
if a `Cast` is up cast then it should be without any truncating or precision 
lose or possible runtime failures. So the output partitioning should be same 
with/without `Cast` if the `Cast` is up cast.

Let's say we have a query:
{code:java}
-- v1: c1 int
-- v2: c2 long

SELECT * FROM v2 JOIN (SELECT c1, count(*) FROM v1 GROUP BY c1) v1 ON v1.c1 = 
v2.c2
{code}
The executed plan contains three shuffle nodes which looks like:
{code:java}
SortMergeJoin
  Exchange(cast(c1 as bigint))
HashAggregate
  Exchange(c1)
Scan v1
  Exchange(c2)
Scan v2
{code}

We can simply the plan using two shuffle nodes:
{code:java}
SortMergeJoin
  HashAggregate
Exchange(c1)
  Scan v1
  Exchange(c2)
Scan v2
{code}


> Support cast aware output partitioning and required if it can up cast
> -
>
> Key: SPARK-37502
> URL: https://issues.apache.org/jira/browse/SPARK-37502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If a `Cast` is up cast then it should be without any truncating or precision 
> lose or possible runtime failures. So the output partitioning should be same 
> with/without `Cast` if the `Cast` is up cast.
> Let's say we have a query:
> {code:java}
> -- v1: c1 int
> -- v2: c2 long
> SELECT * FROM v2 JOIN (SELECT c1, count(*) FROM v1 GROUP BY c1) v1 ON v1.c1 = 
> v2.c2
> {code}
> The executed plan contains three shuffle nodes which looks like:
> {code:java}
> SortMergeJoin
>   Exchange(cast(c1 as bigint))
> HashAggregate
>   Exchange(c1)
> Scan v1
>   Exchange(c2)
> Scan v2
> {code}
> We can simplify the plan using two shuffle nodes:
> {code:java}
> SortMergeJoin
>   HashAggregate
> Exchange(c1)
>   Scan v1
>   Exchange(c2)
> Scan v2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37501) CREATE/REPLACE TABLE should qualify location for v2 command

2021-11-30 Thread PengLei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PengLei updated SPARK-37501:

Summary: CREATE/REPLACE TABLE should qualify location for v2 command  (was: 
CREATE TABLE should qualify location for v2 command)

> CREATE/REPLACE TABLE should qualify location for v2 command
> ---
>
> Key: SPARK-37501
> URL: https://issues.apache.org/jira/browse/SPARK-37501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37502) Support cast aware output partitioning and required if it can up cast

2021-11-30 Thread XiDuo You (Jira)
XiDuo You created SPARK-37502:
-

 Summary: Support cast aware output partitioning and required if it 
can up cast
 Key: SPARK-37502
 URL: https://issues.apache.org/jira/browse/SPARK-37502
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


if a `Cast` is up cast then it should be without any truncating or precision 
lose or possible runtime failures. So the output partitioning should be same 
with/without `Cast` if the `Cast` is up cast.

Let's say we have a query:
{code:java}
-- v1: c1 int
-- v2: c2 long

SELECT * FROM v2 JOIN (SELECT c1, count(*) FROM v1 GROUP BY c1) v1 ON v1.c1 = 
v2.c2
{code}
The executed plan contains three shuffle nodes which looks like:
{code:java}
SortMergeJoin
  Exchange(cast(c1 as bigint))
HashAggregate
  Exchange(c1)
Scan v1
  Exchange(c2)
Scan v2
{code}

We can simply the plan using two shuffle nodes:
{code:java}
SortMergeJoin
  HashAggregate
Exchange(c1)
  Scan v1
  Exchange(c2)
Scan v2
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37381) Unify v1 and v2 SHOW CREATE TABLE tests

2021-11-30 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17450947#comment-17450947
 ] 

dch nguyen edited comment on SPARK-37381 at 11/30/21, 8:33 AM:
---

[~xiaopenglei] it's ok.  I'll try the other one. Thank you.


was (Author: dchvn):
[~xiaopenglei] it's ok. Thank you.

> Unify v1 and v2 SHOW CREATE TABLE  tests
> 
>
> Key: SPARK-37381
> URL: https://issues.apache.org/jira/browse/SPARK-37381
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37381) Unify v1 and v2 SHOW CREATE TABLE tests

2021-11-30 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17450947#comment-17450947
 ] 

dch nguyen commented on SPARK-37381:


[~xiaopenglei] it's ok. Thank you.

> Unify v1 and v2 SHOW CREATE TABLE  tests
> 
>
> Key: SPARK-37381
> URL: https://issues.apache.org/jira/browse/SPARK-37381
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >