[jira] [Created] (SPARK-31975) Throw user facing error when use WindowFunction directly

2020-06-11 Thread ulysses you (Jira)
ulysses you created SPARK-31975:
---

 Summary: Throw user facing error when use WindowFunction directly
 Key: SPARK-31975
 URL: https://issues.apache.org/jira/browse/SPARK-31975
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26905) Revisit reserved/non-reserved keywords based on the ANSI SQL standard

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133984#comment-17133984
 ] 

Apache Spark commented on SPARK-26905:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28807

> Revisit reserved/non-reserved keywords based on the ANSI SQL standard
> -
>
> Key: SPARK-26905
> URL: https://issues.apache.org/jira/browse/SPARK-26905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, 
> spark-nonReserved.txt, spark-strictNonReserved.txt, 
> sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, 
> sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, 
> sql2016-14-nonreserved.txt, sql2016-14-reserved.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26905) Revisit reserved/non-reserved keywords based on the ANSI SQL standard

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133983#comment-17133983
 ] 

Apache Spark commented on SPARK-26905:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28807

> Revisit reserved/non-reserved keywords based on the ANSI SQL standard
> -
>
> Key: SPARK-26905
> URL: https://issues.apache.org/jira/browse/SPARK-26905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, 
> spark-nonReserved.txt, spark-strictNonReserved.txt, 
> sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, 
> sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, 
> sql2016-14-nonreserved.txt, sql2016-14-reserved.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31912) Normalize all binary comparison expressions

2020-06-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-31912.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28734
[https://github.com/apache/spark/pull/28734]

> Normalize all binary comparison expressions
> ---
>
> Key: SPARK-31912
> URL: https://issues.apache.org/jira/browse/SPARK-31912
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> This test will fail:
> {code:scala}
>   test("SPARK-31912 Normalize all binary comparison expressions") {
> val original = testRelation
>   .where('a === 'b && Literal(13) >= 'b).as("x")
> val optimized = testRelation
>   .where(IsNotNull('a) && IsNotNull('b) && 'a === 'b && 'b <= 13 && 'a <= 
> 13).as("x")
> comparePlans(Optimize.execute(original.analyze), optimized.analyze)
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31912) Normalize all binary comparison expressions

2020-06-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-31912:
---

Assignee: Yuming Wang

> Normalize all binary comparison expressions
> ---
>
> Key: SPARK-31912
> URL: https://issues.apache.org/jira/browse/SPARK-31912
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> This test will fail:
> {code:scala}
>   test("SPARK-31912 Normalize all binary comparison expressions") {
> val original = testRelation
>   .where('a === 'b && Literal(13) >= 'b).as("x")
> val optimized = testRelation
>   .where(IsNotNull('a) && IsNotNull('b) && 'a === 'b && 'b <= 13 && 'a <= 
> 13).as("x")
> comparePlans(Optimize.execute(original.analyze), optimized.analyze)
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133931#comment-17133931
 ] 

Apache Spark commented on SPARK-31967:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28806

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133930#comment-17133930
 ] 

Apache Spark commented on SPARK-31967:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28806

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31967:


Assignee: Apache Spark

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31967) Loading jobs UI page takes 40 seconds

2020-06-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31967:


Assignee: (was: Apache Spark)

> Loading jobs UI page takes 40 seconds
> -
>
> Key: SPARK-31967
> URL: https://issues.apache.org/jira/browse/SPARK-31967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Gengliang Wang
>Priority: Blocker
> Attachments: load_time.jpeg, profile.png
>
>
> In the latest master branch, I find that the job list page becomes very slow.
> To reproduce in local setup:
> {code:java}
> spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1")
> spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2")
> (1 to 1000).map(_ =>  spark.sql("select * from t1, t2 where 
> t1.value=t2.value").show())
> {code}
> And that, open live UI: http://localhost:4040/
> The loading time is about 40 seconds.
> If we comment out the function call for `drawApplicationTimeline`, then the 
> loading time is around 1 second.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31887) Date casting to string is giving wrong value

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133925#comment-17133925
 ] 

Hyukjin Kwon commented on SPARK-31887:
--

{code}
➜  ~ cat /tmp/test1.csv/*
a,b
2020-02-19,2020-02-19T05:11:00.000+09:00
{code}

> Date casting to string is giving wrong value
> 
>
> Key: SPARK-31887
> URL: https://issues.apache.org/jira/browse/SPARK-31887
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
> Environment: The spark is running on cluster mode with Mesos.
>  
> Mesos agents are dockerised running on Ubuntu 18.
>  
> Timezone setting of docker instance: UTC
> Timezone of server hosting docker: America/New_York
> Timezone of driver machine: America/New_York
>Reporter: Amit Gupta
>Priority: Major
>
> The code converts the string to date and then write it in csv.
> {code:java}
> val x = Seq(("2020-02-19", "2020-02-19 05:11:00")).toDF("a", 
> "b").select('a.cast("date"), 'b.cast("timestamp"))
> x.show()
> +--+---+
> | a|  b|
> +--+---+
> |2020-02-19|2020-02-19 05:11:00|
> +--+---+
> x.write.mode("overwrite").option("header", true).csv("/tmp/test1.csv")
> {code}
>  
> The date written in CSV file is different:
> {code:java}
> > snakebite cat "/tmp/test1.csv/*.csv"
> a,b
> 2020-02-18,2020-02-19T05:11:00.000Z{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31887) Date casting to string is giving wrong value

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31887.
--
Resolution: Cannot Reproduce

I can't reproduce this in the current master. Probably fixed somewhere in the 
master. It would be good if we can identify and think about porting it back.

> Date casting to string is giving wrong value
> 
>
> Key: SPARK-31887
> URL: https://issues.apache.org/jira/browse/SPARK-31887
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
> Environment: The spark is running on cluster mode with Mesos.
>  
> Mesos agents are dockerised running on Ubuntu 18.
>  
> Timezone setting of docker instance: UTC
> Timezone of server hosting docker: America/New_York
> Timezone of driver machine: America/New_York
>Reporter: Amit Gupta
>Priority: Major
>
> The code converts the string to date and then write it in csv.
> {code:java}
> val x = Seq(("2020-02-19", "2020-02-19 05:11:00")).toDF("a", 
> "b").select('a.cast("date"), 'b.cast("timestamp"))
> x.show()
> +--+---+
> | a|  b|
> +--+---+
> |2020-02-19|2020-02-19 05:11:00|
> +--+---+
> x.write.mode("overwrite").option("header", true).csv("/tmp/test1.csv")
> {code}
>  
> The date written in CSV file is different:
> {code:java}
> > snakebite cat "/tmp/test1.csv/*.csv"
> a,b
> 2020-02-18,2020-02-19T05:11:00.000Z{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31928) Flaky test: StreamingDeduplicationSuite.test no-data flag

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31928.
--
Resolution: Duplicate

> Flaky test: StreamingDeduplicationSuite.test no-data flag
> -
>
> Key: SPARK-31928
> URL: https://issues.apache.org/jira/browse/SPARK-31928
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Test failed: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123621/
> {code:java}
> [info]   with spark.sql.streaming.noDataMicroBatches.enabled = false 
> [info]   Assert on query failed: : 
> [info]   Assert on query failed: 
> [info]   
> [info]   == Progress ==
> [info]  
> StartStream(ProcessingTimeTrigger(0),org.apache.spark.util.SystemClock@372edb19,Map(spark.sql.streaming.noDataMicroBatches.enabled
>  -> false),null)
> [info]  AddData to MemoryStream[value#437541]: 10,11,12,13,14,15
> [info]  CheckAnswer: [10],[11],[12],[13],[14],[15]
> [info]  AssertOnQuery(, Check total state rows = List(6), 
> updated state rows = List(6))
> [info]  AddData to MemoryStream[value#437541]: 25
> [info]  CheckNewAnswer: [25]
> [info]  AssertOnQuery(, Check total state rows = List(7), 
> updated state rows = List(1))
> [info]   => AssertOnQuery(, )
> [info]   
> [info]   == Stream ==
> [info]   Output Mode: Append
> [info]   Stream state: {MemoryStream[value#437541]: 1}
> [info]   Thread state: alive
> [info]   Thread stack trace: java.lang.Thread.sleep(Native Method)
> [info]   
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:241)
> [info]   
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$Lambda$1375/882607691.apply$mcZ$sp(Unknown
>  Source)
> [info]   
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
> [info]   
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:185)
> [info]   
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:334)
> [info]   
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:245)
> [info]   
> [info]   
> [info]   == Sink ==
> [info]   0: [11] [14] [13] [10] [15] [12]
> [info]   1: [25]
> [info]   
> [info]   
> [info]   == Plan ==
> [info]   == Parsed Logical Plan ==
> [info]   WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@158ccd13
> [info]   +- Project [cast(eventTime#437544-T1ms as bigint) AS 
> eventTime#437548L]
> [info]  +- Deduplicate [value#437541, eventTime#437544-T1ms]
> [info] +- EventTimeWatermark eventTime#437544: timestamp, 10 seconds
> [info]+- Project [value#437541, cast(value#437541 as timestamp) 
> AS eventTime#437544]
> [info]   +- StreamingDataSourceV2Relation [value#437541], 
> org.apache.spark.sql.execution.streaming.MemoryStreamScanBuilder@1802eea6, 
> MemoryStream[value#437541], 0, 1
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   
> [info]   WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@158ccd13
> [info]   +- Project [cast(eventTime#437544-T1ms as bigint) AS 
> eventTime#437548L]
> [info]  +- Deduplicate [value#437541, eventTime#437544-T1ms]
> [info] +- EventTimeWatermark eventTime#437544: timestamp, 10 seconds
> [info]+- Project [value#437541, cast(value#437541 as timestamp) 
> AS eventTime#437544]
> [info]   +- StreamingDataSourceV2Relation [value#437541], 
> org.apache.spark.sql.execution.streaming.MemoryStreamScanBuilder@1802eea6, 
> MemoryStream[value#437541], 0, 1
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@158ccd13
> [info]   +- Project [cast(eventTime#437544-T1ms as bigint) AS 
> eventTime#437548L]
> [info]  +- Deduplicate [value#437541, eventTime#437544-T1ms]
> [info] +- EventTimeWatermark eventTime#437544: timestamp, 10 seconds
> [info]+- Project [value#437541, cast(value#437541 as timestamp) 
> AS eventTime#437544]
> [info]   +- StreamingDataSourceV2Relation [value#437541], 
> org.apache.spark.sql.execution.streaming.MemoryStreamScanBuilder@1802eea6, 
> MemoryStream[value#437541], 0, 1
> [info]   
> [info]   == Physical Plan ==
> [info]   WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@158ccd13
> [info]   +- *(2) Project [cast(eventTime#437544-T1ms as bigi

[jira] [Commented] (SPARK-31930) Pandas_udf does not properly return ArrayType

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133921#comment-17133921
 ] 

Hyukjin Kwon commented on SPARK-31930:
--

Seems like it depends on which version you use. I can't reproduce this in the 
latest master:

{code}
+-++
|group|list_col|
+-++
|B|[[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8]]|
|C|[[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8]]|
|A|[[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8]]|
+-++
{code}

Let's better identify which JIRA fixed this and see if we can port back. Or it 
might be fixed in the upper version of pyarrow or pandas.

> Pandas_udf does not properly return ArrayType
> -
>
> Key: SPARK-31930
> URL: https://issues.apache.org/jira/browse/SPARK-31930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: Azure Databricks
>Reporter: Julia Maddalena
>Priority: Major
>
> Attempting to return an ArrayType() from pandas_udf reveals a consistent 
> error with skipping specific list elements upon return. 
> We were able to create a reproducible example, as below. 
> {code:java}
> df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C', 
> 10)], ['group', 'val'])
> @pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG)
> def get_list(x):
> return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]]
> df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False) 
> {code}
> {code:java}
> +-+-+
> |group|list_col |
> +-+-+
> |B|[[1, 1],, [7, 7], [8, 8]]|
> |C|[[1, 1],, [7, 7], [8, 8]]|
> |A|[[1, 1],, [7, 7], [8, 8]]|
> +-+-+
> {code}
>  
>  
> In every example we've come up with, it consistently replaces elements 2-6 
> with None (as well as some later elements too). 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31930) Pandas_udf does not properly return ArrayType

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31930.
--
Resolution: Cannot Reproduce

> Pandas_udf does not properly return ArrayType
> -
>
> Key: SPARK-31930
> URL: https://issues.apache.org/jira/browse/SPARK-31930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: Azure Databricks
>Reporter: Julia Maddalena
>Priority: Major
>
> Attempting to return an ArrayType() from pandas_udf reveals a consistent 
> error with skipping specific list elements upon return. 
> We were able to create a reproducible example, as below. 
> {code:java}
> df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C', 
> 10)], ['group', 'val'])
> @pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG)
> def get_list(x):
> return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]]
> df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False) 
> {code}
> {code:java}
> +-+-+
> |group|list_col |
> +-+-+
> |B|[[1, 1],, [7, 7], [8, 8]]|
> |C|[[1, 1],, [7, 7], [8, 8]]|
> |A|[[1, 1],, [7, 7], [8, 8]]|
> +-+-+
> {code}
>  
>  
> In every example we've come up with, it consistently replaces elements 2-6 
> with None (as well as some later elements too). 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31947) Solve string value error about Date/Timestamp in ScriptTransform

2020-06-11 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133915#comment-17133915
 ] 

angerszhu commented on SPARK-31947:
---

[~hyukjin.kwon]

Yea, later

> Solve string value error about Date/Timestamp in ScriptTransform
> 
>
> Key: SPARK-31947
> URL: https://issues.apache.org/jira/browse/SPARK-31947
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31930) Pandas_udf does not properly return ArrayType

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31930:
-
Priority: Major  (was: Blocker)

> Pandas_udf does not properly return ArrayType
> -
>
> Key: SPARK-31930
> URL: https://issues.apache.org/jira/browse/SPARK-31930
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: Azure Databricks
>Reporter: Julia Maddalena
>Priority: Major
>
> Attempting to return an ArrayType() from pandas_udf reveals a consistent 
> error with skipping specific list elements upon return. 
> We were able to create a reproducible example, as below. 
> {code:java}
> df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C', 
> 10)], ['group', 'val'])
> @pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG)
> def get_list(x):
> return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]]
> df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False) 
> {code}
> {code:java}
> +-+-+
> |group|list_col |
> +-+-+
> |B|[[1, 1],, [7, 7], [8, 8]]|
> |C|[[1, 1],, [7, 7], [8, 8]]|
> |A|[[1, 1],, [7, 7], [8, 8]]|
> +-+-+
> {code}
>  
>  
> In every example we've come up with, it consistently replaces elements 2-6 
> with None (as well as some later elements too). 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31943) SPARK-31500 introduces breaking changes in 2.4.6

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31943.
--
Resolution: Won't Fix

{{Collect}} is under catalyst package which is private. It isn't an API.

> SPARK-31500 introduces breaking changes in 2.4.6
> 
>
> Key: SPARK-31943
> URL: https://issues.apache.org/jira/browse/SPARK-31943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Jefferson V
>Priority: Major
>  Labels: backward-incompatible
>
> [31500|[https://github.com/apache/spark/pull/28351/files]] introduced 
> unimplemented fields of the `Collect` class that cause client extensions of 
> that class to fail unless the spark version is pinned to `2.4.5`. Since this 
> was a minor version bump, seems like this wasn't desired.
> I believe we should be able to at least put default values:
> `convertToBufferElement(value: Any): Any = InternalRow.copyValue(value)`
> `bufferElementType: DataType = child.dataType`
> and restore the `override def eval` in `Collect`, to support compatibility 
> with 2.4.5 while allowing implementers in that file to override them to fix 
> the bug. Since the abstract `Collect` is currently not designed to fix the 
> bug (just provide tools that can be implemented to fix it), this change 
> wouldn't undermine the bug fix, just add backwards compatibility to 2.4.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31947) Solve string value error about Date/Timestamp in ScriptTransform

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133905#comment-17133905
 ] 

Hyukjin Kwon commented on SPARK-31947:
--

Can you fill the JIRA description?

> Solve string value error about Date/Timestamp in ScriptTransform
> 
>
> Key: SPARK-31947
> URL: https://issues.apache.org/jira/browse/SPARK-31947
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31961.
--
Resolution: Won't Fix

> Add a class in spark with all Kafka configuration key available as string
> -
>
> Key: SPARK-31961
> URL: https://issues.apache.org/jira/browse/SPARK-31961
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Gunjan Kumar
>Priority: Minor
>  Labels: kafka, sql, structured-streaming
>
> Add a class in spark with all Kafka configuration key available as string.
> see the highligted class which i want.
> eg:-
> Current code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
> .option("subscribe", "topic1")
> Expected code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
> .option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31961:
-
Target Version/s:   (was: 2.4.6)

> Add a class in spark with all Kafka configuration key available as string
> -
>
> Key: SPARK-31961
> URL: https://issues.apache.org/jira/browse/SPARK-31961
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Gunjan Kumar
>Priority: Minor
>  Labels: kafka, sql, structured-streaming
>
> Add a class in spark with all Kafka configuration key available as string.
> see the highligted class which i want.
> eg:-
> Current code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
> .option("subscribe", "topic1")
> Expected code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
> .option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133902#comment-17133902
 ] 

Hyukjin Kwon commented on SPARK-31962:
--

cc [~kabhwan] FYI

> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a 
> number of occasions where I want to be able to stream from a folder 
> containing any number of historical delta files in CSV format.  When I start 
> reading from a folder, however, I might only care about files were created 
> after a certain time.
> {code:java}
> spark.readStream
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
>  
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
>  there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
> in-memory index of files for a given path.  There may a rather clean 
> opportunity to consider options here.
> Having the ability to provide an option specifying a timestamp by which to 
> begin globbing files would result in quite a bit of less complexity needed on 
> a consumer who leverages the ability to stream from a folder path but does 
> not have an interest in reading what could be thousands of files that are not 
> relevant.
> One example to could be "createdFileTime" accepting a UTC datetime like below.
> {code:java}
> spark.readStream
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .option("createdFileTime", "2020-05-01 00:00:00")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
>  
> If this option is specified, the expected behavior would be that files within 
> the _"/mnt/Deltas/"_ path must have been created at or later than the 
> specified time in order to be consumed for purposes of reading the files in 
> general or for purposes of structured streaming.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-11 Thread Lin Gang Deng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133898#comment-17133898
 ] 

Lin Gang Deng commented on SPARK-31955:
---

[~hyukjin.kwon] , the  difference between me and [~younggyuchun] 's sql file is 
that there is no EOL at the end of my script.

My example is the exact reproducer. 

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-11 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133894#comment-17133894
 ] 

Yuming Wang commented on SPARK-31955:
-

It seems this is a Hive issue. Maybe we should fix it on the Hive side.

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31880) Adjacent value parsing not supported for Localized Patterns because of JDK bug

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31880.
--
Resolution: Invalid

> Adjacent value parsing not supported for Localized Patterns because of JDK bug
> --
>
> Key: SPARK-31880
> URL: https://issues.apache.org/jira/browse/SPARK-31880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('202011', 'ww');
> {code}
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text '202011' could not 
> be parsed at index 0
>   at 
> java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
>   at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1777)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:79)
>   ... 99 more
> {code}
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('202011', 'wu');
> 2019-12-30 00:00:00
> spark-sql> select to_timestamp('202011', 'ww');
> 2020-03-08 00:00:00
> {code}
> The result could vary between different JDKs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31880) Adjacent value parsing not supported for Localized Patterns because of JDK bug

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-31880:
--

> Adjacent value parsing not supported for Localized Patterns because of JDK bug
> --
>
> Key: SPARK-31880
> URL: https://issues.apache.org/jira/browse/SPARK-31880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('202011', 'ww');
> {code}
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text '202011' could not 
> be parsed at index 0
>   at 
> java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
>   at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1777)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:79)
>   ... 99 more
> {code}
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('202011', 'wu');
> 2019-12-30 00:00:00
> spark-sql> select to_timestamp('202011', 'ww');
> 2020-03-08 00:00:00
> {code}
> The result could vary between different JDKs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31880) Adjacent value parsing not supported for Localized Patterns because of JDK bug

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31880.
--
Resolution: Not A Problem

> Adjacent value parsing not supported for Localized Patterns because of JDK bug
> --
>
> Key: SPARK-31880
> URL: https://issues.apache.org/jira/browse/SPARK-31880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('202011', 'ww');
> {code}
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text '202011' could not 
> be parsed at index 0
>   at 
> java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
>   at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1777)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:79)
>   ... 99 more
> {code}
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('202011', 'wu');
> 2019-12-30 00:00:00
> spark-sql> select to_timestamp('202011', 'ww');
> 2020-03-08 00:00:00
> {code}
> The result could vary between different JDKs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31880) Adjacent value parsing not supported for Localized Patterns because of JDK bug

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31880:
-
Parent: SPARK-31408
Issue Type: Sub-task  (was: Bug)

> Adjacent value parsing not supported for Localized Patterns because of JDK bug
> --
>
> Key: SPARK-31880
> URL: https://issues.apache.org/jira/browse/SPARK-31880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('202011', 'ww');
> {code}
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text '202011' could not 
> be parsed at index 0
>   at 
> java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
>   at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1777)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:79)
>   ... 99 more
> {code}
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('202011', 'wu');
> 2019-12-30 00:00:00
> spark-sql> select to_timestamp('202011', 'ww');
> 2020-03-08 00:00:00
> {code}
> The result could vary between different JDKs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28169) Spark can’t push down partition predicate for OR expression

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133886#comment-17133886
 ] 

Apache Spark commented on SPARK-28169:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/28805

> Spark can’t push down partition predicate for OR expression
> ---
>
> Key: SPARK-28169
> URL: https://issues.apache.org/jira/browse/SPARK-28169
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>  Labels: SQL
>
> Spark can't push down filter condition of Or:
> Such as if I have a table {color:#d04437}default.test{color}, his partition 
> col is "{color:#d04437}dt{color}",
> if I use query : 
> {code:java}
> select * from default.test where dt=20190625 or (dt = 20190626 and id in 
> (1,2,3) )
> {code}
> In this case, Spark will resolve or condition as one expression, and since 
> this {color:#33}expr {color}has reference of "{color:#FF}id{color}", 
> then it can't been push down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31880) Adjacent value parsing not supported for Localized Patterns because of JDK bug

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133887#comment-17133887
 ] 

Hyukjin Kwon commented on SPARK-31880:
--

Okay, thanks for clarification. Let's be diligent on updating JIRA for 
trackability.

> Adjacent value parsing not supported for Localized Patterns because of JDK bug
> --
>
> Key: SPARK-31880
> URL: https://issues.apache.org/jira/browse/SPARK-31880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('202011', 'ww');
> {code}
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text '202011' could not 
> be parsed at index 0
>   at 
> java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
>   at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1777)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:79)
>   ... 99 more
> {code}
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('202011', 'wu');
> 2019-12-30 00:00:00
> spark-sql> select to_timestamp('202011', 'ww');
> 2020-03-08 00:00:00
> {code}
> The result could vary between different JDKs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28169) Spark can’t push down partition predicate for OR expression

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133885#comment-17133885
 ] 

Apache Spark commented on SPARK-28169:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/28805

> Spark can’t push down partition predicate for OR expression
> ---
>
> Key: SPARK-28169
> URL: https://issues.apache.org/jira/browse/SPARK-28169
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>  Labels: SQL
>
> Spark can't push down filter condition of Or:
> Such as if I have a table {color:#d04437}default.test{color}, his partition 
> col is "{color:#d04437}dt{color}",
> if I use query : 
> {code:java}
> select * from default.test where dt=20190625 or (dt = 20190626 and id in 
> (1,2,3) )
> {code}
> In this case, Spark will resolve or condition as one expression, and since 
> this {color:#33}expr {color}has reference of "{color:#FF}id{color}", 
> then it can't been push down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31955.
--
Resolution: Incomplete

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133884#comment-17133884
 ] 

Hyukjin Kwon commented on SPARK-31955:
--

I am going to leave it resolved until enough information is provided to analyze 
further, for JIRA management purpose.

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133883#comment-17133883
 ] 

Hyukjin Kwon commented on SPARK-31955:
--

[~denglg] Please show the __exact__ reproducer. From reading [~younggyuchun], 
it doesn't look clear what issue you mean.
Also please check if the behaviours are consistent with beeline in Hive. If it 
also exists in Hive, this isn't a Spark specific issue.

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25186) Stabilize Data Source V2 API

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25186:
-
Target Version/s: 3.1.0

> Stabilize Data Source V2 API 
> -
>
> Key: SPARK-25186
> URL: https://issues.apache.org/jira/browse/SPARK-25186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25186) Stabilize Data Source V2 API

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25186:
-
Target Version/s:   (was: 3.0.0)

> Stabilize Data Source V2 API 
> -
>
> Key: SPARK-25186
> URL: https://issues.apache.org/jira/browse/SPARK-25186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31880) Adjacent value parsing not supported for Localized Patterns because of JDK bug

2020-06-11 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133880#comment-17133880
 ] 

Kent Yao commented on SPARK-31880:
--

This should be fixed as ’Not a Problem‘ because we have forbidden week-based 
fields for the new DatetimeFormatter we built 
https://issues.apache.org/jira/browse/SPARK-31892

It’s better to move it back and marked as resolved. Thanks [~hyukjin.kwon]

> Adjacent value parsing not supported for Localized Patterns because of JDK bug
> --
>
> Key: SPARK-31880
> URL: https://issues.apache.org/jira/browse/SPARK-31880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('202011', 'ww');
> {code}
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text '202011' could not 
> be parsed at index 0
>   at 
> java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
>   at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1777)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:79)
>   ... 99 more
> {code}
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('202011', 'wu');
> 2019-12-30 00:00:00
> spark-sql> select to_timestamp('202011', 'ww');
> 2020-03-08 00:00:00
> {code}
> The result could vary between different JDKs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133879#comment-17133879
 ] 

Hyukjin Kwon commented on SPARK-19169:
--

[~angerszhuuu] can you file a new JIRA with a reproducer?

> columns changed orc table encouter 'IndexOutOfBoundsException' when read the 
> old schema files
> -
>
> Key: SPARK-19169
> URL: https://issues.apache.org/jira/browse/SPARK-19169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: roncenzhao
>Priority: Major
>
> We hava an orc table called orc_test_tbl and hava inserted some data into it.
> After that, we change the table schema by droping some columns.
> When reading the old schema file, we get the follow exception.
> ```
> java.lang.IndexOutOfBoundsException: toIndex = 65
> at java.util.ArrayList.subListRangeCheck(ArrayList.java:962)
> at java.util.ArrayList.subList(ArrayList.java:954)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30399) Bucketing does not compatible with partitioning in practice

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133878#comment-17133878
 ] 

Hyukjin Kwon commented on SPARK-30399:
--

Okay but Spark 2.3.0 is EOL. Mind checking if the issue exists in the latest 
Spark version? It would be nicer if we have the reproducer as well.

> Bucketing does not compatible with partitioning in practice
> ---
>
> Key: SPARK-30399
> URL: https://issues.apache.org/jira/browse/SPARK-30399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: HDP 2.7
>Reporter: Shay Elbaz
>Priority: Minor
>
> When using Spark Bucketed table, Spark would use as many partitions as the 
> number of buckets for the map-side join 
> (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" 
> tables, but quite disastrous for _time-partitioned_ tables. In our use case, 
> a daily partitioned key-value table is added 100GB of data every day. So in 
> 100 days there are 10TB of data we want to join with. Aiming to this 
> scenario, we need thousands of buckets if we want every task to successfully 
> *read and sort* all of it's data in a map-side join. But in such case, every 
> daily increment would emit thousands of small files, leading to other big 
> issues.
> In practice, and with a hope for some hidden optimization, we set the number 
> of buckets to 1000 and backfilled such a table with 10TB. When trying to join 
> with the smallest input, every executor was killed by Yarn due to over 
> allocating memory in the sorting phase. Even without such failures, it would 
> take every executor unreasonably amount of time to locally sort all its data.
> A question on SO remained unanswered for a while, so I thought asking here - 
> is it by design that buckets cannot be used in time-partitioned table, or am 
> I doing something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31880) Adjacent value parsing not supported for Localized Patterns because of JDK bug

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133874#comment-17133874
 ] 

Hyukjin Kwon commented on SPARK-31880:
--

I converted this into a separate issue because SPARK-31408 is fixed in 3.0.0. 
Let's file another parent issue if there are many of it, and then add a link 
between parent tickets [~Qin Yao].

> Adjacent value parsing not supported for Localized Patterns because of JDK bug
> --
>
> Key: SPARK-31880
> URL: https://issues.apache.org/jira/browse/SPARK-31880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('202011', 'ww');
> {code}
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text '202011' could not 
> be parsed at index 0
>   at 
> java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
>   at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1777)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:79)
>   ... 99 more
> {code}
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('202011', 'wu');
> 2019-12-30 00:00:00
> spark-sql> select to_timestamp('202011', 'ww');
> 2020-03-08 00:00:00
> {code}
> The result could vary between different JDKs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31880) Adjacent value parsing not supported for Localized Patterns because of JDK bug

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31880:
-
Parent: (was: SPARK-31408)
Issue Type: Bug  (was: Sub-task)

> Adjacent value parsing not supported for Localized Patterns because of JDK bug
> --
>
> Key: SPARK-31880
> URL: https://issues.apache.org/jira/browse/SPARK-31880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('202011', 'ww');
> {code}
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text '202011' could not 
> be parsed at index 0
>   at 
> java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
>   at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1777)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:79)
>   ... 99 more
> {code}
> {code:java}
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('202011', 'wu');
> 2019-12-30 00:00:00
> spark-sql> select to_timestamp('202011', 'ww');
> 2020-03-08 00:00:00
> {code}
> The result could vary between different JDKs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31899) Forbid datetime pattern letter u

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31899.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28728

> Forbid datetime pattern letter u
> 
>
> Key: SPARK-31899
> URL: https://issues.apache.org/jira/browse/SPARK-31899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31827) Fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31827:
-
Summary: Fail datetime parsing/formatting if detect the Java 8 bug of 
stand-alone form  (was: fail datetime parsing/formatting if detect the Java 8 
bug of stand-alone form)

> Fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
> -
>
> Key: SPARK-31827
> URL: https://issues.apache.org/jira/browse/SPARK-31827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31899) Forbid datetime pattern letter u

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31899:
-
Summary: Forbid datetime pattern letter u  (was: forbid datetime pattern 
letter u)

> Forbid datetime pattern letter u
> 
>
> Key: SPARK-31899
> URL: https://issues.apache.org/jira/browse/SPARK-31899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31827) fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31827.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed in 

> fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
> -
>
> Key: SPARK-31827
> URL: https://issues.apache.org/jira/browse/SPARK-31827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31827) fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133868#comment-17133868
 ] 

Hyukjin Kwon edited comment on SPARK-31827 at 6/12/20, 3:09 AM:


Fixed in https://github.com/apache/spark/pull/28646


was (Author: hyukjin.kwon):
Fixed in 

> fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
> -
>
> Key: SPARK-31827
> URL: https://issues.apache.org/jira/browse/SPARK-31827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31799) Spark Datasource Tables Creating Incorrect Hive Metadata

2020-06-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133866#comment-17133866
 ] 

Hyukjin Kwon commented on SPARK-31799:
--

JSON itself isn't a completely self-describing format. You should still infer 
the schema and get the field list after scanning all. You can go ahead for a PR 
if you have a way to ensure it.

> Spark Datasource Tables Creating Incorrect Hive Metadata
> 
>
> Key: SPARK-31799
> URL: https://issues.apache.org/jira/browse/SPARK-31799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Anoop Johnson
>Priority: Major
>
> I found that if I create a CSV or JSON table using Spark SQL, it writes the 
> wrong Hive table metadata, breaking compatibility with other query engines 
> like Hive and Presto. Here is a very simple example:
> {code:sql}
> CREATE TABLE test_csv (id String, name String)
> USING csv
>   LOCATION  's3://[...]'
> ;
> {code}
> If you describe the table using Presto, you will see:
> {code:sql}
> CREATE EXTERNAL TABLE `test_csv`(
>   `col` array COMMENT 'from deserializer')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
> WITH SERDEPROPERTIES ( 
>   'path'='s3://[...]') 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.mapred.SequenceFileInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
> LOCATION
>   's3://[...]/test_csv-__PLACEHOLDER__'
> TBLPROPERTIES (
>   'spark.sql.create.version'='2.4.4', 
>   'spark.sql.sources.provider'='csv', 
>   'spark.sql.sources.schema.numParts'='1', 
>   
> 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
>  
>   'transient_lastDdlTime'='1590196086')
>   ;
> {code}
>  The table location is set to a placeholder value - the schema is always set 
> to _col array_. The serde/inputformat is wrong - it says 
> _SequenceFileInputFormat_ and _LazySimpleSerDe_ even though the requested 
> format is CSV.
> But all the right metadata is written to the custom table properties with 
> prefix _spark.sql_. However, Hive and Presto does not understand these table 
> properties and this breaks them. I could reproduce this with JSON too, but 
> not with Parquet. 
> I root-caused this issue to CSV and JSON tables not handled 
> [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66]
>  in HiveSerde.scala. As a result, these default values are written.
> Is there a reason why CSV and JSON are not handled? I could send a patch to 
> fix this, but the caveat is that the CSV and JSON Hive serdes should be in 
> the Spark classpath, otherwise the table creation will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-06-11 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31705:

Fix Version/s: 3.1.0

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>  

[jira] [Created] (SPARK-31974) Stop tracking speculative shuffle files in ExecutorMonitor

2020-06-11 Thread Holden Karau (Jira)
Holden Karau created SPARK-31974:


 Summary: Stop tracking speculative shuffle files in ExecutorMonitor
 Key: SPARK-31974
 URL: https://issues.apache.org/jira/browse/SPARK-31974
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0, 3.0.1, 3.1.0
Reporter: Holden Karau






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-11 Thread Lin Gang Deng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133822#comment-17133822
 ] 

Lin Gang Deng commented on SPARK-31955:
---

[~dongjoon] As  you said, EOL is the key to the problem. Sometimes, there is no 
newline character when SQL is submitted, or the newline character is removed by 
the third-party component. Maybe, spark or beeline should correctly parse SQL, 
whether or not there is EOL. Otherwise, users will be bothered by the wrong 
result.

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31973) Add ability to disable Sort,Spill in Partial aggregation

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133810#comment-17133810
 ] 

Apache Spark commented on SPARK-31973:
--

User 'karuppayya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28804

> Add ability to disable Sort,Spill in Partial aggregation 
> -
>
> Key: SPARK-31973
> URL: https://issues.apache.org/jira/browse/SPARK-31973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karuppayya
>Priority: Major
>
> In case of HashAggregation, a partial aggregation(update) is done followed by 
> final aggregation(merge) 
> During partial aggregation we sort and spill to disk everytime, when the fast 
> Map(when enabled) and  UnsafeFixedWidthAggregationMap gets exhausted
> *When the cardinality of grouping column is close to the total number of 
> records being processed*, the sorting of data spilling to disk is not 
> required, since it is kind of no-op and we can directly use in Final 
> aggregation.
> When the user is aware of nature of data, currently he has no control over 
> disabling this sort, spill operation.
> This is similar to following issue in Hive:
> https://issues.apache.org/jira/browse/HIVE-223
> https://issues.apache.org/jira/browse/HIVE-291
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31916) StringConcat can overflow `length`, leads to StringIndexOutOfBoundsException

2020-06-11 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31916:
-
Fix Version/s: 3.1.0

> StringConcat can overflow `length`, leads to StringIndexOutOfBoundsException
> 
>
> Key: SPARK-31916
> URL: https://issues.apache.org/jira/browse/SPARK-31916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jeffrey Stokes
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> We have query plans that through multiple transformations can grow extremely 
> long in length. These would eventually throw OutOfMemory exceptions 
> (https://issues.apache.org/jira/browse/SPARK-26103 &; related 
> https://issues.apache.org/jira/browse/SPARK-25380).
>  
> We backported the changes from [https://github.com/apache/spark/pull/23169] 
> into our distribution of Spark, based on 2.4.4, and attempted to use the 
> added `spark.sql.maxPlanStringLength`. While this works in some cases, large 
> query plans can still lead to issues stemming from `StringConcat` in 
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala.
>  
> The following unit test exhibits the issue, which continues to fail in the 
> master branch of spark:
>  
> {code:scala}
>   test("StringConcat doesn't overflow on many inputs") {
> val concat = new StringConcat(maxLength = 100)
> 0.to(Integer.MAX_VALUE).foreach { _ =>  
>   concat.append("hello world")
>  }
> assert(concat.toString.length === 100)  
> } 
> {code}
>  
> Looking at the append method here: 
> [https://github.com/apache/spark/blob/fc6af9d900ec6f6a1cbe8f987857a69e6ef600d1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L118-L128]
>  
> It seems like regardless of whether the string to be append is added fully to 
> the internal buffer, added as a substring to reach `maxLength`, or not added 
> at all the internal `length` field is incremented by the length of `s`. 
> Eventually this will overflow an int and cause L123 to substring with a 
> negative index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31973) Add ability to disable Sort,Spill in Partial aggregation

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133809#comment-17133809
 ] 

Apache Spark commented on SPARK-31973:
--

User 'karuppayya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28804

> Add ability to disable Sort,Spill in Partial aggregation 
> -
>
> Key: SPARK-31973
> URL: https://issues.apache.org/jira/browse/SPARK-31973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karuppayya
>Priority: Major
>
> In case of HashAggregation, a partial aggregation(update) is done followed by 
> final aggregation(merge) 
> During partial aggregation we sort and spill to disk everytime, when the fast 
> Map(when enabled) and  UnsafeFixedWidthAggregationMap gets exhausted
> *When the cardinality of grouping column is close to the total number of 
> records being processed*, the sorting of data spilling to disk is not 
> required, since it is kind of no-op and we can directly use in Final 
> aggregation.
> When the user is aware of nature of data, currently he has no control over 
> disabling this sort, spill operation.
> This is similar to following issue in Hive:
> https://issues.apache.org/jira/browse/HIVE-223
> https://issues.apache.org/jira/browse/HIVE-291
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31973) Add ability to disable Sort,Spill in Partial aggregation

2020-06-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31973:


Assignee: Apache Spark

> Add ability to disable Sort,Spill in Partial aggregation 
> -
>
> Key: SPARK-31973
> URL: https://issues.apache.org/jira/browse/SPARK-31973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karuppayya
>Assignee: Apache Spark
>Priority: Major
>
> In case of HashAggregation, a partial aggregation(update) is done followed by 
> final aggregation(merge) 
> During partial aggregation we sort and spill to disk everytime, when the fast 
> Map(when enabled) and  UnsafeFixedWidthAggregationMap gets exhausted
> *When the cardinality of grouping column is close to the total number of 
> records being processed*, the sorting of data spilling to disk is not 
> required, since it is kind of no-op and we can directly use in Final 
> aggregation.
> When the user is aware of nature of data, currently he has no control over 
> disabling this sort, spill operation.
> This is similar to following issue in Hive:
> https://issues.apache.org/jira/browse/HIVE-223
> https://issues.apache.org/jira/browse/HIVE-291
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31973) Add ability to disable Sort,Spill in Partial aggregation

2020-06-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31973:


Assignee: (was: Apache Spark)

> Add ability to disable Sort,Spill in Partial aggregation 
> -
>
> Key: SPARK-31973
> URL: https://issues.apache.org/jira/browse/SPARK-31973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karuppayya
>Priority: Major
>
> In case of HashAggregation, a partial aggregation(update) is done followed by 
> final aggregation(merge) 
> During partial aggregation we sort and spill to disk everytime, when the fast 
> Map(when enabled) and  UnsafeFixedWidthAggregationMap gets exhausted
> *When the cardinality of grouping column is close to the total number of 
> records being processed*, the sorting of data spilling to disk is not 
> required, since it is kind of no-op and we can directly use in Final 
> aggregation.
> When the user is aware of nature of data, currently he has no control over 
> disabling this sort, spill operation.
> This is similar to following issue in Hive:
> https://issues.apache.org/jira/browse/HIVE-223
> https://issues.apache.org/jira/browse/HIVE-291
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31916) StringConcat can overflow `length`, leads to StringIndexOutOfBoundsException

2020-06-11 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31916.
--
Fix Version/s: 3.0.1
 Assignee: Dilip Biswal
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28750

> StringConcat can overflow `length`, leads to StringIndexOutOfBoundsException
> 
>
> Key: SPARK-31916
> URL: https://issues.apache.org/jira/browse/SPARK-31916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jeffrey Stokes
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.1
>
>
> We have query plans that through multiple transformations can grow extremely 
> long in length. These would eventually throw OutOfMemory exceptions 
> (https://issues.apache.org/jira/browse/SPARK-26103 &; related 
> https://issues.apache.org/jira/browse/SPARK-25380).
>  
> We backported the changes from [https://github.com/apache/spark/pull/23169] 
> into our distribution of Spark, based on 2.4.4, and attempted to use the 
> added `spark.sql.maxPlanStringLength`. While this works in some cases, large 
> query plans can still lead to issues stemming from `StringConcat` in 
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala.
>  
> The following unit test exhibits the issue, which continues to fail in the 
> master branch of spark:
>  
> {code:scala}
>   test("StringConcat doesn't overflow on many inputs") {
> val concat = new StringConcat(maxLength = 100)
> 0.to(Integer.MAX_VALUE).foreach { _ =>  
>   concat.append("hello world")
>  }
> assert(concat.toString.length === 100)  
> } 
> {code}
>  
> Looking at the append method here: 
> [https://github.com/apache/spark/blob/fc6af9d900ec6f6a1cbe8f987857a69e6ef600d1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L118-L128]
>  
> It seems like regardless of whether the string to be append is added fully to 
> the internal buffer, added as a substring to reach `maxLength`, or not added 
> at all the internal `length` field is incremented by the length of `s`. 
> Eventually this will overflow an int and cause L123 to substring with a 
> negative index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31973) Add ability to disable Sort,Spill in Partial aggregation

2020-06-11 Thread Karuppayya (Jira)
Karuppayya created SPARK-31973:
--

 Summary: Add ability to disable Sort,Spill in Partial aggregation 
 Key: SPARK-31973
 URL: https://issues.apache.org/jira/browse/SPARK-31973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Karuppayya


In case of HashAggregation, a partial aggregation(update) is done followed by 
final aggregation(merge) 

During partial aggregation we sort and spill to disk everytime, when the fast 
Map(when enabled) and  UnsafeFixedWidthAggregationMap gets exhausted

*When the cardinality of grouping column is close to the total number of 
records being processed*, the sorting of data spilling to disk is not required, 
since it is kind of no-op and we can directly use in Final aggregation.

When the user is aware of nature of data, currently he has no control over 
disabling this sort, spill operation.

This is similar to following issue in Hive:

https://issues.apache.org/jira/browse/HIVE-223

https://issues.apache.org/jira/browse/HIVE-291

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26905) Revisit reserved/non-reserved keywords based on the ANSI SQL standard

2020-06-11 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133215#comment-17133215
 ] 

Takeshi Yamamuro edited comment on SPARK-26905 at 6/11/20, 11:48 PM:
-

-Ah, I got noticed that they are already reserved in the ANSI mode (the 
document says so, too: 
[https://github.com/apache/spark/blob/master/docs/sql-ref-ansi-compliance.md]).-
-Actually, the Spark reserved keywords in the ANSI mode are computed by 
`spark-keywords-list.txt – spark-ansiNonReserved.txt – "strict non-reserved 
keywords"`.-

Ur, sorry.., but I misread them. Yea, we need more work to fix this. I'll open 
a PR for that.


was (Author: maropu):
-Ah, I got noticed that they are already reserved in the ANSI mode (the 
document says so, too: 
[https://github.com/apache/spark/blob/master/docs/sql-ref-ansi-compliance.md]).-
-Actually, the Spark reserved keywords in the ANSI mode are computed by 
`spark-keywords-list.txt – spark-ansiNonReserved.txt – "strict non-reserved 
keywords"`.-

Ur, sorry.., but I misread them. Yea, we need more work to fix this.

> Revisit reserved/non-reserved keywords based on the ANSI SQL standard
> -
>
> Key: SPARK-26905
> URL: https://issues.apache.org/jira/browse/SPARK-26905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, 
> spark-nonReserved.txt, spark-strictNonReserved.txt, 
> sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, 
> sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, 
> sql2016-14-nonreserved.txt, sql2016-14-reserved.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26905) Revisit reserved/non-reserved keywords based on the ANSI SQL standard

2020-06-11 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133215#comment-17133215
 ] 

Takeshi Yamamuro edited comment on SPARK-26905 at 6/11/20, 11:47 PM:
-

-Ah, I got noticed that they are already reserved in the ANSI mode (the 
document says so, too: 
[https://github.com/apache/spark/blob/master/docs/sql-ref-ansi-compliance.md]).-
-Actually, the Spark reserved keywords in the ANSI mode are computed by 
`spark-keywords-list.txt – spark-ansiNonReserved.txt – "strict non-reserved 
keywords"`.-

Ur, sorry.., but I misread them. Yea, we need more work to fix this.


was (Author: maropu):
-Ah, I got noticed that they are already reserved in the ANSI mode (the 
document says so, too: 
[https://github.com/apache/spark/blob/master/docs/sql-ref-ansi-compliance.md]);-
{code:java}
scala> sql("SET spark.sql.ansi.enabled=false")

scala> sql("create table t1 (anti int)")
res10: org.apache.spark.sql.DataFrame = []
scala> sql("create table t2 (semi int)")
res11: org.apache.spark.sql.DataFrame = []
scala> sql("create table t3 (minus int)")
res12: org.apache.spark.sql.DataFrame = []

scala> sql("SET spark.sql.ansi.enabled=true")

scala> sql("create table t4 (anti int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'anti'(line 1, pos 17)

== SQL ==
create table t4 (anti int)
-^^^

scala> sql("create table t5 (semi int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'semi'(line 1, pos 17)

== SQL ==
create table t5 (semi int)
-^^^

scala> sql("create table t6 (minus int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'minus'(line 1, pos 17)

== SQL ==
create table t6 (minus int)
-^^^
{code}
-Actually, the Spark reserved keywords in the ANSI mode are computed by 
`spark-keywords-list.txt – spark-ansiNonReserved.txt – "strict non-reserved 
keywords"`.-

Ur, sorry.., but I misread them. Yea, we need more work to fix this.

> Revisit reserved/non-reserved keywords based on the ANSI SQL standard
> -
>
> Key: SPARK-26905
> URL: https://issues.apache.org/jira/browse/SPARK-26905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, 
> spark-nonReserved.txt, spark-strictNonReserved.txt, 
> sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, 
> sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, 
> sql2016-14-nonreserved.txt, sql2016-14-reserved.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26905) Revisit reserved/non-reserved keywords based on the ANSI SQL standard

2020-06-11 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133215#comment-17133215
 ] 

Takeshi Yamamuro edited comment on SPARK-26905 at 6/11/20, 11:46 PM:
-

-Ah, I got noticed that they are already reserved in the ANSI mode (the 
document says so, too: 
[https://github.com/apache/spark/blob/master/docs/sql-ref-ansi-compliance.md]);-
{code:java}
scala> sql("SET spark.sql.ansi.enabled=false")

scala> sql("create table t1 (anti int)")
res10: org.apache.spark.sql.DataFrame = []
scala> sql("create table t2 (semi int)")
res11: org.apache.spark.sql.DataFrame = []
scala> sql("create table t3 (minus int)")
res12: org.apache.spark.sql.DataFrame = []

scala> sql("SET spark.sql.ansi.enabled=true")

scala> sql("create table t4 (anti int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'anti'(line 1, pos 17)

== SQL ==
create table t4 (anti int)
-^^^

scala> sql("create table t5 (semi int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'semi'(line 1, pos 17)

== SQL ==
create table t5 (semi int)
-^^^

scala> sql("create table t6 (minus int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'minus'(line 1, pos 17)

== SQL ==
create table t6 (minus int)
-^^^
{code}
-Actually, the Spark reserved keywords in the ANSI mode are computed by 
`spark-keywords-list.txt – spark-ansiNonReserved.txt – "strict non-reserved 
keywords"`.-

Ur, sorry.., but I misread them. Yea, we need more work to fix this.


was (Author: maropu):
~Ah, I got noticed that they are already reserved in the ANSI mode (the 
document says so, too: 
[https://github.com/apache/spark/blob/master/docs/sql-ref-ansi-compliance.md]);~
{code:java}
scala> sql("SET spark.sql.ansi.enabled=false")

scala> sql("create table t1 (anti int)")
res10: org.apache.spark.sql.DataFrame = []
scala> sql("create table t2 (semi int)")
res11: org.apache.spark.sql.DataFrame = []
scala> sql("create table t3 (minus int)")
res12: org.apache.spark.sql.DataFrame = []

scala> sql("SET spark.sql.ansi.enabled=true")

scala> sql("create table t4 (anti int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'anti'(line 1, pos 17)

== SQL ==
create table t4 (anti int)
-^^^

scala> sql("create table t5 (semi int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'semi'(line 1, pos 17)

== SQL ==
create table t5 (semi int)
-^^^

scala> sql("create table t6 (minus int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'minus'(line 1, pos 17)

== SQL ==
create table t6 (minus int)
-^^^
{code}
Actually, the Spark reserved keywords in the ANSI mode are computed by 
`spark-keywords-list.txt – spark-ansiNonReserved.txt – "strict non-reserved 
keywords"`. Yay, I think we already follow the standard SQL2016, haha

> Revisit reserved/non-reserved keywords based on the ANSI SQL standard
> -
>
> Key: SPARK-26905
> URL: https://issues.apache.org/jira/browse/SPARK-26905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, 
> spark-nonReserved.txt, spark-strictNonReserved.txt, 
> sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, 
> sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, 
> sql2016-14-nonreserved.txt, sql2016-14-reserved.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26905) Revisit reserved/non-reserved keywords based on the ANSI SQL standard

2020-06-11 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133215#comment-17133215
 ] 

Takeshi Yamamuro edited comment on SPARK-26905 at 6/11/20, 11:44 PM:
-

~Ah, I got noticed that they are already reserved in the ANSI mode (the 
document says so, too: 
[https://github.com/apache/spark/blob/master/docs/sql-ref-ansi-compliance.md]);~
{code:java}
scala> sql("SET spark.sql.ansi.enabled=false")

scala> sql("create table t1 (anti int)")
res10: org.apache.spark.sql.DataFrame = []
scala> sql("create table t2 (semi int)")
res11: org.apache.spark.sql.DataFrame = []
scala> sql("create table t3 (minus int)")
res12: org.apache.spark.sql.DataFrame = []

scala> sql("SET spark.sql.ansi.enabled=true")

scala> sql("create table t4 (anti int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'anti'(line 1, pos 17)

== SQL ==
create table t4 (anti int)
-^^^

scala> sql("create table t5 (semi int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'semi'(line 1, pos 17)

== SQL ==
create table t5 (semi int)
-^^^

scala> sql("create table t6 (minus int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'minus'(line 1, pos 17)

== SQL ==
create table t6 (minus int)
-^^^
{code}
Actually, the Spark reserved keywords in the ANSI mode are computed by 
`spark-keywords-list.txt – spark-ansiNonReserved.txt – "strict non-reserved 
keywords"`. Yay, I think we already follow the standard SQL2016, haha


was (Author: maropu):
Ah, I got noticed that they are already reserved in the ANSI mode (the document 
says so, too: 
[https://github.com/apache/spark/blob/master/docs/sql-ref-ansi-compliance.md]);
{code:java}
scala> sql("SET spark.sql.ansi.enabled=false")

scala> sql("create table t1 (anti int)")
res10: org.apache.spark.sql.DataFrame = []
scala> sql("create table t2 (semi int)")
res11: org.apache.spark.sql.DataFrame = []
scala> sql("create table t3 (minus int)")
res12: org.apache.spark.sql.DataFrame = []

scala> sql("SET spark.sql.ansi.enabled=true")

scala> sql("create table t4 (anti int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'anti'(line 1, pos 17)

== SQL ==
create table t4 (anti int)
-^^^

scala> sql("create table t5 (semi int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'semi'(line 1, pos 17)

== SQL ==
create table t5 (semi int)
-^^^

scala> sql("create table t6 (minus int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'minus'(line 1, pos 17)

== SQL ==
create table t6 (minus int)
-^^^
{code}
Actually, the Spark reserved keywords in the ANSI mode are computed by 
`spark-keywords-list.txt – spark-ansiNonReserved.txt – "strict non-reserved 
keywords"`. Yay, I think we already follow the standard SQL2016, haha

> Revisit reserved/non-reserved keywords based on the ANSI SQL standard
> -
>
> Key: SPARK-26905
> URL: https://issues.apache.org/jira/browse/SPARK-26905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, 
> spark-nonReserved.txt, spark-strictNonReserved.txt, 
> sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, 
> sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, 
> sql2016-14-nonreserved.txt, sql2016-14-reserved.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31972) Improve heurestic for selecting nodes for scale down to take into account graceful decommission cost

2020-06-11 Thread Holden Karau (Jira)
Holden Karau created SPARK-31972:


 Summary: Improve heurestic for selecting nodes for scale down to 
take into account graceful decommission cost
 Key: SPARK-31972
 URL: https://issues.apache.org/jira/browse/SPARK-31972
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Holden Karau


Once SPARK-31198 is in we should see if we can come up with a better graceful 
decommissioning aware heuristic to use for selecting nodes to scale down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31197) Exit the executor once all tasks & migrations are finished

2020-06-11 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133776#comment-17133776
 ] 

Holden Karau commented on SPARK-31197:
--

I'm working on this, will post PR soon.

> Exit the executor once all tasks & migrations are finished
> --
>
> Key: SPARK-31197
> URL: https://issues.apache.org/jira/browse/SPARK-31197
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling

2020-06-11 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133777#comment-17133777
 ] 

Holden Karau commented on SPARK-31198:
--

I'm going to start working on this (on top of my SPARK-31197 code)

> Use graceful decommissioning as part of dynamic scaling
> ---
>
> Key: SPARK-31198
> URL: https://issues.apache.org/jira/browse/SPARK-31198
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31765) Upgrade HtmlUnit >= 2.37.0

2020-06-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31765.
--
Fix Version/s: 3.1.0
 Assignee: Kousuke Saruta
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28585

> Upgrade HtmlUnit >= 2.37.0
> --
>
> Key: SPARK-31765
> URL: https://issues.apache.org/jira/browse/SPARK-31765
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> Recently, a security issue which affects HtmlUnit is reported.
> [https://nvd.nist.gov/vuln/detail/CVE-2020-5529]
> According to the report, arbitrary code can be run by malicious users.
> HtmlUnit is used for test so the impact might not be large but it's better to 
> upgrade it just in case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31924) Create remote shuffle service reference implementation

2020-06-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31924:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Create remote shuffle service reference implementation
> --
>
> Key: SPARK-31924
> URL: https://issues.apache.org/jira/browse/SPARK-31924
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: BoYang
>Priority: Major
>
> People in [Spark Scalability & Reliability Sync Meeting 
> |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
>  discussed a lot about remote (disaggregated) shuffle service, and plan to do 
> a reference implementation to help demonstrate some basic design and pave the 
> way for a future production grade remote  shuffle service.
>  
> There are already two pull requests to enhance Spark shuffle metadata API to 
> make it easy/possible to implement remote shuffle service ([PR 
> 28616|https://github.com/apache/spark/pull/28616], [PR 
> 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
> service reference implementation will help to validate those shuffle metadata 
> API.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31924) Create remote shuffle service reference implementation

2020-06-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31924:
--
Target Version/s:   (was: 3.0.0)

> Create remote shuffle service reference implementation
> --
>
> Key: SPARK-31924
> URL: https://issues.apache.org/jira/browse/SPARK-31924
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: BoYang
>Priority: Major
>
> People in [Spark Scalability & Reliability Sync Meeting 
> |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
>  discussed a lot about remote (disaggregated) shuffle service, and plan to do 
> a reference implementation to help demonstrate some basic design and pave the 
> way for a future production grade remote  shuffle service.
>  
> There are already two pull requests to enhance Spark shuffle metadata API to 
> make it easy/possible to implement remote shuffle service ([PR 
> 28616|https://github.com/apache/spark/pull/28616], [PR 
> 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
> service reference implementation will help to validate those shuffle metadata 
> API.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31924) Create remote shuffle service reference implementation

2020-06-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31924:
--
Fix Version/s: (was: 3.0.0)

> Create remote shuffle service reference implementation
> --
>
> Key: SPARK-31924
> URL: https://issues.apache.org/jira/browse/SPARK-31924
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: BoYang
>Priority: Major
>
> People in [Spark Scalability & Reliability Sync Meeting 
> |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
>  discussed a lot about remote (disaggregated) shuffle service, and plan to do 
> a reference implementation to help demonstrate some basic design and pave the 
> way for a future production grade remote  shuffle service.
>  
> There are already two pull requests to enhance Spark shuffle metadata API to 
> make it easy/possible to implement remote shuffle service ([PR 
> 28616|https://github.com/apache/spark/pull/28616], [PR 
> 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
> service reference implementation will help to validate those shuffle metadata 
> API.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31935) Hadoop file system config should be effective in data source options

2020-06-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31935.
---
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28776
[https://github.com/apache/spark/pull/28776]

> Hadoop file system config should be effective in data source options 
> -
>
> Key: SPARK-31935
> URL: https://issues.apache.org/jira/browse/SPARK-31935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 2.4.7
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Data source options should be propagated into the hadoop configuration of 
> method `checkAndGlobPathIfNecessary`
> From org.apache.hadoop.fs.FileSystem.java:
> {code:java}
>   public static FileSystem get(URI uri, Configuration conf) throws 
> IOException {
> String scheme = uri.getScheme();
> String authority = uri.getAuthority();
> if (scheme == null && authority == null) { // use default FS
>   return get(conf);
> }
> if (scheme != null && authority == null) { // no authority
>   URI defaultUri = getDefaultUri(conf);
>   if (scheme.equals(defaultUri.getScheme())// if scheme matches 
> default
>   && defaultUri.getAuthority() != null) {  // & default has authority
> return get(defaultUri, conf);  // return default
>   }
> }
> 
> String disableCacheName = String.format("fs.%s.impl.disable.cache", 
> scheme);
> if (conf.getBoolean(disableCacheName, false)) {
>   return createFileSystem(uri, conf);
> }
> return CACHE.get(uri, conf);
>   }
> {code}
> With this, we can specify URI schema and authority related configurations for 
> scanning file systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31935) Hadoop file system config should be effective in data source options

2020-06-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31935:
-

Assignee: Gengliang Wang

> Hadoop file system config should be effective in data source options 
> -
>
> Key: SPARK-31935
> URL: https://issues.apache.org/jira/browse/SPARK-31935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 2.4.7
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Data source options should be propagated into the hadoop configuration of 
> method `checkAndGlobPathIfNecessary`
> From org.apache.hadoop.fs.FileSystem.java:
> {code:java}
>   public static FileSystem get(URI uri, Configuration conf) throws 
> IOException {
> String scheme = uri.getScheme();
> String authority = uri.getAuthority();
> if (scheme == null && authority == null) { // use default FS
>   return get(conf);
> }
> if (scheme != null && authority == null) { // no authority
>   URI defaultUri = getDefaultUri(conf);
>   if (scheme.equals(defaultUri.getScheme())// if scheme matches 
> default
>   && defaultUri.getAuthority() != null) {  // & default has authority
> return get(defaultUri, conf);  // return default
>   }
> }
> 
> String disableCacheName = String.format("fs.%s.impl.disable.cache", 
> scheme);
> if (conf.getBoolean(disableCacheName, false)) {
>   return createFileSystem(uri, conf);
> }
> return CACHE.get(uri, conf);
>   }
> {code}
> With this, we can specify URI schema and authority related configurations for 
> scanning file systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21117) Built-in SQL Function Support - WIDTH_BUCKET

2020-06-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-21117.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28764
[https://github.com/apache/spark/pull/28764]

> Built-in SQL Function Support - WIDTH_BUCKET
> 
>
> Key: SPARK-21117
> URL: https://issues.apache.org/jira/browse/SPARK-21117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> For a given expression, the {{WIDTH_BUCKET}} function returns the bucket 
> number into which the value of this expression would fall after being 
> evaluated.
> {code:sql}
> WIDTH_BUCKET (expr , min_value , max_value , num_buckets)
> {code}
> Ref: 
> https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21117) Built-in SQL Function Support - WIDTH_BUCKET

2020-06-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-21117:
-

Assignee: Takeshi Yamamuro

> Built-in SQL Function Support - WIDTH_BUCKET
> 
>
> Key: SPARK-21117
> URL: https://issues.apache.org/jira/browse/SPARK-21117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> For a given expression, the {{WIDTH_BUCKET}} function returns the bucket 
> number into which the value of this expression would fall after being 
> evaluated.
> {code:sql}
> WIDTH_BUCKET (expr , min_value , max_value , num_buckets)
> {code}
> Ref: 
> https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31971) Add pagination support for all jobs timeline

2020-06-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31971:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Add pagination support for all jobs timeline
> 
>
> Key: SPARK-31971
> URL: https://issues.apache.org/jira/browse/SPARK-31971
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> If there are lots of jobs, rendering performance of all jobs timeline can 
> significantly goes down. This issue is reported in SPARK-31967.
>  For example, the following operation can take >40 sec.
> {code:java}
> (1 to 1000).foreach(_ => sc.parallelize(1 to 10).collect) {code}
> Although it's not the fundamental solution, pagination can mitigate the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31971) Add pagination support for all jobs timeline

2020-06-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31971:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Add pagination support for all jobs timeline
> 
>
> Key: SPARK-31971
> URL: https://issues.apache.org/jira/browse/SPARK-31971
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> If there are lots of jobs, rendering performance of all jobs timeline can 
> significantly goes down. This issue is reported in SPARK-31967.
>  For example, the following operation can take >40 sec.
> {code:java}
> (1 to 1000).foreach(_ => sc.parallelize(1 to 10).collect) {code}
> Although it's not the fundamental solution, pagination can mitigate the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31971) Add pagination support for all jobs timeline

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133679#comment-17133679
 ] 

Apache Spark commented on SPARK-31971:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/28803

> Add pagination support for all jobs timeline
> 
>
> Key: SPARK-31971
> URL: https://issues.apache.org/jira/browse/SPARK-31971
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> If there are lots of jobs, rendering performance of all jobs timeline can 
> significantly goes down. This issue is reported in SPARK-31967.
>  For example, the following operation can take >40 sec.
> {code:java}
> (1 to 1000).foreach(_ => sc.parallelize(1 to 10).collect) {code}
> Although it's not the fundamental solution, pagination can mitigate the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31971) Add pagination support for all jobs timeline

2020-06-11 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-31971:
---
Description: 
If there are lots of jobs, rendering performance of all jobs timeline can 
significantly goes down. This issue is reported in SPARK-31967.
 For example, the following operation can take >40 sec.
{code:java}
(1 to 1000).foreach(_ => sc.parallelize(1 to 10).collect) {code}
Although it's not the fundamental solution, pagination can mitigate the issue.

  was:
If there are lots of jobs, rendering performance of all jobs timeline can 
significantly goes down. This issue is reported in SPARK-31967.
For example, the following operation can take >40 sec.
{code:java}
(1 to 301).foreach(_ => sc.parallelize(1 to 10).collect) {code}
Although it's not the fundamental solution, pagination can mitigate the issue.


> Add pagination support for all jobs timeline
> 
>
> Key: SPARK-31971
> URL: https://issues.apache.org/jira/browse/SPARK-31971
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> If there are lots of jobs, rendering performance of all jobs timeline can 
> significantly goes down. This issue is reported in SPARK-31967.
>  For example, the following operation can take >40 sec.
> {code:java}
> (1 to 1000).foreach(_ => sc.parallelize(1 to 10).collect) {code}
> Although it's not the fundamental solution, pagination can mitigate the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31971) Add pagination support for all jobs timeline

2020-06-11 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-31971:
---
Description: 
If there are lots of jobs, rendering performance of all jobs timeline can 
significantly goes down. This issue is reported in SPARK-31967.
For example, the following operation can take >40 sec.
{code:java}
(1 to 301).foreach(_ => sc.parallelize(1 to 10).collect) {code}
Although it's not the fundamental solution, pagination can mitigate the issue.

  was:
If there are lots of jobs, rendering performance of all jobs timeline can 
significantly goes down. This issue is reported in SPARK-31967.

Although the fundamental solution, pagination can mitigate the issue.


> Add pagination support for all jobs timeline
> 
>
> Key: SPARK-31971
> URL: https://issues.apache.org/jira/browse/SPARK-31971
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> If there are lots of jobs, rendering performance of all jobs timeline can 
> significantly goes down. This issue is reported in SPARK-31967.
> For example, the following operation can take >40 sec.
> {code:java}
> (1 to 301).foreach(_ => sc.parallelize(1 to 10).collect) {code}
> Although it's not the fundamental solution, pagination can mitigate the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31971) Add pagination support for all jobs timeline

2020-06-11 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-31971:
--

 Summary: Add pagination support for all jobs timeline
 Key: SPARK-31971
 URL: https://issues.apache.org/jira/browse/SPARK-31971
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.1, 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


If there are lots of jobs, rendering performance of all jobs timeline can 
significantly goes down. This issue is reported in SPARK-31967.

Although the fundamental solution, pagination can mitigate the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30616) Introduce TTL config option for SQL Parquet Metadata Cache

2020-06-11 Thread Yaroslav Tkachenko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaroslav Tkachenko updated SPARK-30616:
---
Description: 
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this 
metadata cache. It can be disabled by default (-1), so it doesn't change the 
existing behaviour. 

  was:
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this 
metadata cache. Its default value can be pretty high (an hour? a few hours?), 
so it doesn't alter the existing behaviour much. 


> Introduce TTL config option for SQL Parquet Metadata Cache
> --
>
> Key: SPARK-30616
> URL: https://issues.apache.org/jira/browse/SPARK-30616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yaroslav Tkachenko
>Priority: Major
>
> From 
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive 
> metastore Parquet table conversion is enabled, metadata of those converted 
> tables are also cached. If these tables are updated by Hive or other external 
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
> cumbersome. Assuming frequently generated new Parquet files, hundreds of 
> tables and dozens of users querying the data (and expecting up-to-date 
> results), manually refreshing metadata for each table is not an optimal 
> solution. And this is a pretty common use-case for streaming ingestion of 
> data.    
> I propose to introduce a new option in Spark (something like 
> "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of 
> this metadata cache. It can be disabled by default (-1), so it doesn't change 
> the existing behaviour. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30616) Introduce TTL config option for SQL Parquet Metadata Cache

2020-06-11 Thread Yaroslav Tkachenko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaroslav Tkachenko updated SPARK-30616:
---
Description: 
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this 
metadata cache. Its default value can be pretty high (an hour? a few hours?), 
so it doesn't alter the existing behaviour much. 

  was:
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this 
metadata cache. Its default value can be pretty high (an hour? a few hours?), 
so it doesn't alter the existing behavior much. When it's set to 0 the cache is 
effectively disabled (could be useful for testing or some edge cases). 


> Introduce TTL config option for SQL Parquet Metadata Cache
> --
>
> Key: SPARK-30616
> URL: https://issues.apache.org/jira/browse/SPARK-30616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yaroslav Tkachenko
>Priority: Major
>
> From 
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive 
> metastore Parquet table conversion is enabled, metadata of those converted 
> tables are also cached. If these tables are updated by Hive or other external 
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
> cumbersome. Assuming frequently generated new Parquet files, hundreds of 
> tables and dozens of users querying the data (and expecting up-to-date 
> results), manually refreshing metadata for each table is not an optimal 
> solution. And this is a pretty common use-case for streaming ingestion of 
> data.    
> I propose to introduce a new option in Spark (something like 
> "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of 
> this metadata cache. Its default value can be pretty high (an hour? a few 
> hours?), so it doesn't alter the existing behaviour much. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30616) Introduce TTL config option for SQL Parquet Metadata Cache

2020-06-11 Thread Yaroslav Tkachenko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaroslav Tkachenko updated SPARK-30616:
---
Description: 
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this 
metadata cache. Its default value can be pretty high (an hour? a few hours?), 
so it doesn't alter the existing behavior much. When it's set to 0 the cache is 
effectively disabled (could be useful for testing or some edge cases). 

  was:
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.parquet.metadataCache.refreshInterval") that controls the TTL of 
this metadata cache. Its default value can be pretty high (an hour? a few 
hours?), so it doesn't alter the existing behavior much. When it's set to 0 the 
cache is effectively disabled (could be useful for testing or some edge cases). 


> Introduce TTL config option for SQL Parquet Metadata Cache
> --
>
> Key: SPARK-30616
> URL: https://issues.apache.org/jira/browse/SPARK-30616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yaroslav Tkachenko
>Priority: Major
>
> From 
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive 
> metastore Parquet table conversion is enabled, metadata of those converted 
> tables are also cached. If these tables are updated by Hive or other external 
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
> cumbersome. Assuming frequently generated new Parquet files, hundreds of 
> tables and dozens of users querying the data (and expecting up-to-date 
> results), manually refreshing metadata for each table is not an optimal 
> solution. And this is a pretty common use-case for streaming ingestion of 
> data.    
> I propose to introduce a new option in Spark (something like 
> "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of 
> this metadata cache. Its default value can be pretty high (an hour? a few 
> hours?), so it doesn't alter the existing behavior much. When it's set to 0 
> the cache is effectively disabled (could be useful for testing or some edge 
> cases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30616) Introduce TTL config option for SQL Parquet Metadata Cache

2020-06-11 Thread Yaroslav Tkachenko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133596#comment-17133596
 ] 

Yaroslav Tkachenko commented on SPARK-30616:


The fix can be adding "expireAfterWrite" to this cache initialization 
[https://github.com/apache/spark/blob/f6dd8e0e1673aa491b895c1f0467655fa4e9d52f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala#L132-L136]

> Introduce TTL config option for SQL Parquet Metadata Cache
> --
>
> Key: SPARK-30616
> URL: https://issues.apache.org/jira/browse/SPARK-30616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yaroslav Tkachenko
>Priority: Major
>
> From 
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive 
> metastore Parquet table conversion is enabled, metadata of those converted 
> tables are also cached. If these tables are updated by Hive or other external 
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
> cumbersome. Assuming frequently generated new Parquet files, hundreds of 
> tables and dozens of users querying the data (and expecting up-to-date 
> results), manually refreshing metadata for each table is not an optimal 
> solution. And this is a pretty common use-case for streaming ingestion of 
> data.    
> I propose to introduce a new option in Spark (something like 
> "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of 
> this metadata cache. Its default value can be pretty high (an hour? a few 
> hours?), so it doesn't alter the existing behavior much. When it's set to 0 
> the cache is effectively disabled (could be useful for testing or some edge 
> cases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-06-11 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31960:

Parent: SPARK-31582
Issue Type: Sub-task  (was: Bug)

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-11 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133521#comment-17133521
 ] 

Dongjoon Hyun commented on SPARK-31955:
---

[~denglg]. Please add a new line character at the end of the line. Your script 
doesn't have it.
Thank you for investigating, [~younggyuchun].

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-06-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-31705.

Resolution: Fixed

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem

[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-06-11 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133435#comment-17133435
 ] 

Gengliang Wang commented on SPARK-31705:


The issue is resolved in https://github.com/apache/spark/pull/28733

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))

[jira] [Assigned] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-06-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-31705:
--

Assignee: Gengliang Wang  (was: Yuming Wang)

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>

[jira] [Commented] (SPARK-31950) Extract SQL keywords from the generated parser class in TableIdentifierParserSuite

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133363#comment-17133363
 ] 

Apache Spark commented on SPARK-31950:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28802

> Extract SQL keywords from the generated parser class in 
> TableIdentifierParserSuite
> --
>
> Key: SPARK-31950
> URL: https://issues.apache.org/jira/browse/SPARK-31950
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31970) Make MDC configuration step be consistent between setLocalProperty and log4j.properties

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133351#comment-17133351
 ] 

Apache Spark commented on SPARK-31970:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28801

> Make MDC configuration step be consistent between setLocalProperty and 
> log4j.properties
> ---
>
> Key: SPARK-31970
> URL: https://issues.apache.org/jira/browse/SPARK-31970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> It's weird that we use "mdc.XXX" as key to set MDC value via 
> `setLocalProperty` while we use "XXX" as key to set MDC pattern in 
> log4j.properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31970) Make MDC configuration step be consistent between setLocalProperty and log4j.properties

2020-06-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31970:


Assignee: (was: Apache Spark)

> Make MDC configuration step be consistent between setLocalProperty and 
> log4j.properties
> ---
>
> Key: SPARK-31970
> URL: https://issues.apache.org/jira/browse/SPARK-31970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> It's weird that we use "mdc.XXX" as key to set MDC value via 
> `setLocalProperty` while we use "XXX" as key to set MDC pattern in 
> log4j.properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31970) Make MDC configuration step be consistent between setLocalProperty and log4j.properties

2020-06-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31970:


Assignee: Apache Spark

> Make MDC configuration step be consistent between setLocalProperty and 
> log4j.properties
> ---
>
> Key: SPARK-31970
> URL: https://issues.apache.org/jira/browse/SPARK-31970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> It's weird that we use "mdc.XXX" as key to set MDC value via 
> `setLocalProperty` while we use "XXX" as key to set MDC pattern in 
> log4j.properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31970) Make MDC configuration step be consistent between setLocalProperty and log4j.properties

2020-06-11 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31970:
-
Summary: Make MDC configuration step be consistent between setLocalProperty 
and log4j.properties  (was: Make MDC configuration step be consistent between 
setLocalProperty and log4j)

> Make MDC configuration step be consistent between setLocalProperty and 
> log4j.properties
> ---
>
> Key: SPARK-31970
> URL: https://issues.apache.org/jira/browse/SPARK-31970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> It's weird that we use "mdc.XXX" as key to set MDC value via 
> `setLocalProperty` while we use "XXX" as key to set MDC pattern in 
> log4j.properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31970) Make MDC configuration step be consistent between setLocalProperty and log4j

2020-06-11 Thread wuyi (Jira)
wuyi created SPARK-31970:


 Summary: Make MDC configuration step be consistent between 
setLocalProperty and log4j
 Key: SPARK-31970
 URL: https://issues.apache.org/jira/browse/SPARK-31970
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: wuyi


It's weird that we use "mdc.XXX" as key to set MDC value via `setLocalProperty` 
while we use "XXX" as key to set MDC pattern in log4j.properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31939) Fix Parsing day of year when year field pattern is missing

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133309#comment-17133309
 ] 

Apache Spark commented on SPARK-31939:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28800

> Fix Parsing day of year when year field pattern is missing
> --
>
> Key: SPARK-31939
> URL: https://issues.apache.org/jira/browse/SPARK-31939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> If a datetime pattern contains no year field, the day of year field should 
> not be ignored if exists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31939) Fix Parsing day of year when year field pattern is missing

2020-06-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133308#comment-17133308
 ] 

Apache Spark commented on SPARK-31939:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28800

> Fix Parsing day of year when year field pattern is missing
> --
>
> Key: SPARK-31939
> URL: https://issues.apache.org/jira/browse/SPARK-31939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> If a datetime pattern contains no year field, the day of year field should 
> not be ignored if exists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-11 Thread YoungGyu Chun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133298#comment-17133298
 ] 

YoungGyu Chun commented on SPARK-31955:
---

[~denglg] 

 

I cannot reproduce this locally. : 


Whole data:
{code:java}
0: jdbc:hive2://localhost:1> select * from info_dev.beeline_test;
+--++--+
| beeline_test.id  | beeline_test.name  |
+--++--+
| 1| aaa|
| 2| bbb|
| 3| ccc|
| 1| aaa|
+--++--+
4 rows selected (0.239 seconds)
0: jdbc:hive2://localhost:1>
{code}
 

test2.sql:
{code:java}
jun562@CHUNYLT:~/spark-2.4.4-bin-hadoop2.7/bin$ cat test2.sql
select * from info_dev.beeline_test where name='bbb'
jun562@CHUNYLT:~/spark-2.4.4-bin-hadoop2.7/bin$
{code}
 

Execute a test2.sql file on Beeline by running a "run" command:
{code:java}
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:1> !run 
/home/jun562/apache-hive-1.2.1-bin/bin/test2.sql
>>>  select * from info_dev.beeline_test where name='bbb';
+--++--+
| beeline_test.id  | beeline_test.name  |
+--++--+
| 2| bbb|
+--++--+
1 row selected (0.406 seconds)
0: jdbc:hive2://localhost:1>
{code}
 

Execute SQL on Beeline:
{code:java}
0: jdbc:hive2://localhost:1> select * from info_dev.beeline_test where 
name='bbb';
+--++--+
| beeline_test.id  | beeline_test.name  |
+--++--+
| 2| bbb|
+--++--+
1 row selected (0.233 seconds)
0: jdbc:hive2://localhost:1>
{code}
 

cc [~dongjoon] [~hyukjin.kwon] [~srowen]

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31954) delete duplicate test cases in hivequerysuite

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31954:
-
Fix Version/s: (was: 3.0.0)
   3.0.1

> delete duplicate test cases in hivequerysuite
> -
>
> Key: SPARK-31954
> URL: https://issues.apache.org/jira/browse/SPARK-31954
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: philipse
>Assignee: philipse
>Priority: Minor
> Fix For: 3.0.1, 2.4.7
>
>
> remove duplication test cases and result files in hivequerysuite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31954) delete duplicate test cases in hivequerysuite

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31954.
--
Fix Version/s: 3.0.0
   2.4.7
   Resolution: Fixed

Issue resolved by pull request 28782
[https://github.com/apache/spark/pull/28782]

> delete duplicate test cases in hivequerysuite
> -
>
> Key: SPARK-31954
> URL: https://issues.apache.org/jira/browse/SPARK-31954
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: philipse
>Assignee: philipse
>Priority: Minor
> Fix For: 2.4.7, 3.0.0
>
>
> remove duplication test cases and result files in hivequerysuite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31954) delete duplicate test cases in hivequerysuite

2020-06-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31954:


Assignee: philipse

> delete duplicate test cases in hivequerysuite
> -
>
> Key: SPARK-31954
> URL: https://issues.apache.org/jira/browse/SPARK-31954
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: philipse
>Assignee: philipse
>Priority: Minor
>
> remove duplication test cases and result files in hivequerysuite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31969) StreamingJobProgressListener threw an exception java.util.NoSuchElementException for Long Running Streaming Job

2020-06-11 Thread ThimmeGowda (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ThimmeGowda updated SPARK-31969:

Component/s: Web UI
 Spark Core
 Scheduler

> StreamingJobProgressListener threw an exception 
> java.util.NoSuchElementException for Long Running Streaming Job
> ---
>
> Key: SPARK-31969
> URL: https://issues.apache.org/jira/browse/SPARK-31969
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler, Spark Core, Web UI
>Affects Versions: 2.4.0
> Environment: Kubernetes
> Spark 2.4.0
>Reporter: ThimmeGowda
>Priority: Major
> Attachments: driver_log, executor_log
>
>
> We are running a long running streaming job and Below exception is seen 
> continuosly after sometime. After the jobs starts all of a sudden our Spark 
> streaming application's batch durations start to increase. At around the same 
> time there starts to appear an error log that does not refer to the 
> application code at all. We couldn't find any other significant errors in the 
> driver logs.
> Refrerred ticket : https://issues.apache.org/jira/browse/SPARK-21065 for 
> similar issue, in our case we are not setting anything for 
> spark.streaming.concurrentJobs and default value is taken.
>  \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
> "time":"2020-06-09T04:31:43.918Z", "timezone":"UTC", 
> "class":"spark-listener-group-appStatus", 
> "method":"streaming.scheduler.StreamingListenerBus.logError(91)", 
> "log":"Listener StreamingJobProgressListener threw an 
> exception\u000Ajava.util.NoSuchElementException: key not found: 159167710 
> ms\u000A\u0009at 
> scala.collection.MapLike$class.default(MapLike.scala:228)\u000A\u0009at 
> scala.collection.AbstractMap.default(Map.scala:59)\u000A\u0009at 
> scala.collection.mutable.HashMap.apply(HashMap.scala:65)\u000A\u0009at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:134)\u000A\u0009at
>  
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)\u000A\u0009at
>  
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)\u000A\u0009at
>  
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)\u000A\u0009at
>  
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)\u000A\u0009at
>  
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)\u000A\u0009at
>  
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:80)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)\u000A\u0009at
>  
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)\u000A\u0009at
>  scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)\u000A\u0009at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)\u000A\u0009at
>  
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)\u000A"}
> java.util.NoSuchElementException: key not found: 159167710 ms
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> ~[scala-library-2.11.12.jar:?]
> at scala.collection.AbstractMap.default(Map.scala:59) 
> ~[scala-library-2.11.12.jar:?]
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65) 
> ~[scala-library-2.11.12.jar:?]
> at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:134)
>  ~[spark-streaming_2.11-2.4.0.jar:2.4.0]
> at 
> org.apache.spark.streaming.scheduler.Stream

[jira] [Updated] (SPARK-31969) StreamingJobProgressListener threw an exception java.util.NoSuchElementException for Long Running Streaming Job

2020-06-11 Thread ThimmeGowda (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ThimmeGowda updated SPARK-31969:

Attachment: executor_log

> StreamingJobProgressListener threw an exception 
> java.util.NoSuchElementException for Long Running Streaming Job
> ---
>
> Key: SPARK-31969
> URL: https://issues.apache.org/jira/browse/SPARK-31969
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0
> Environment: Kubernetes
> Spark 2.4.0
>Reporter: ThimmeGowda
>Priority: Major
> Attachments: driver_log, executor_log
>
>
> We are running a long running streaming job and Below exception is seen 
> continuosly after sometime. After the jobs starts all of a sudden our Spark 
> streaming application's batch durations start to increase. At around the same 
> time there starts to appear an error log that does not refer to the 
> application code at all. We couldn't find any other significant errors in the 
> driver logs.
> Refrerred ticket : https://issues.apache.org/jira/browse/SPARK-21065 for 
> similar issue, in our case we are not setting anything for 
> spark.streaming.concurrentJobs and default value is taken.
>  \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
> "time":"2020-06-09T04:31:43.918Z", "timezone":"UTC", 
> "class":"spark-listener-group-appStatus", 
> "method":"streaming.scheduler.StreamingListenerBus.logError(91)", 
> "log":"Listener StreamingJobProgressListener threw an 
> exception\u000Ajava.util.NoSuchElementException: key not found: 159167710 
> ms\u000A\u0009at 
> scala.collection.MapLike$class.default(MapLike.scala:228)\u000A\u0009at 
> scala.collection.AbstractMap.default(Map.scala:59)\u000A\u0009at 
> scala.collection.mutable.HashMap.apply(HashMap.scala:65)\u000A\u0009at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:134)\u000A\u0009at
>  
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)\u000A\u0009at
>  
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)\u000A\u0009at
>  
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)\u000A\u0009at
>  
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)\u000A\u0009at
>  
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)\u000A\u0009at
>  
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:80)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)\u000A\u0009at
>  
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)\u000A\u0009at
>  scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)\u000A\u0009at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)\u000A\u0009at
>  
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)\u000A\u0009at
>  
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)\u000A"}
> java.util.NoSuchElementException: key not found: 159167710 ms
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> ~[scala-library-2.11.12.jar:?]
> at scala.collection.AbstractMap.default(Map.scala:59) 
> ~[scala-library-2.11.12.jar:?]
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65) 
> ~[scala-library-2.11.12.jar:?]
> at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:134)
>  ~[spark-streaming_2.11-2.4.0.jar:2.4.0]
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
>  ~[spark-streaming_2.

  1   2   >